Arxiv今日论文 | 2025-03-14

本篇博文主要内容为 2025-03-14 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决在大规模模型仓库中搜索和分析神经网络模型的问题，特别是缺乏系统性导航工具的挑战。随着数百万公开可用神经网络的存在，构建一个全面的“模型地图”（atlas）变得至关重要，但由于大多数模型文档不足，绘制这样的地图极具挑战性。论文的关键解决方案在于提出了一种方法，通过利用基于现实世界主流模型训练实践的高置信度结构先验（structural priors），实现对未记录区域的准确映射，从而扩展现有的“模型地图”。这一方法不仅完善了已有的Hugging Face模型仓库可视化图谱，还展示了其在预测模型属性（如精度）及分析计算机视觉模型趋势等应用中的潜力。

链接: https://arxiv.org/abs/2503.10633
作者: Eliahu Horwitz,Nitzan Kurer,Jonathan Kahana,Liel Amar,Yedid Hoshen
机构: School of Computer Science and Engineering (计算机科学与工程学院), The Hebrew University of Jerusalem (耶路撒冷希伯来大学), Israel (以色列)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As there are now millions of publicly available neural networks, searching and analyzing large model repositories becomes increasingly important. Navigating so many models requires an atlas, but as most models are poorly documented charting such an atlas is challenging. To explore the hidden potential of model repositories, we chart a preliminary atlas representing the documented fraction of Hugging Face. It provides stunning visualizations of the model landscape and evolution. We demonstrate several applications of this atlas including predicting model attributes (e.g., accuracy), and analyzing trends in computer vision models. However, as the current atlas remains incomplete, we propose a method for charting undocumented regions. Specifically, we identify high-confidence structural priors based on dominant real-world model training practices. Leveraging these priors, our approach enables accurate mapping of previously undocumented areas of the atlas. We publicly release our datasets, code, and interactive atlas.
zh

[NLP-1] SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems

【速读】：本文旨在解决大型多模态模型（Large Multi-modal Models, LMMs）在科学问题解决中的能力评估不足问题。为实现这一目标，作者提出了SciVerse，这是一个包含5,735个测试实例的多模态科学评估基准，涵盖五个不同版本，用于全面评估LMMs在三个关键维度上的表现：科学知识理解、多模态内容解析以及链式思维（Chain-of-Thought, CoT）推理。关键解决方案包括设计三种知识难度递增的问题版本（Knowledge-free、-lite、-rich），以测试模型的科学专业知识储备；同时通过Vision-rich和-Vision-only两种标注方式解析模型对文本与图表信息的综合处理能力。此外，提出了一种新的科学CoT评估策略，通过逐步分析模型输出中的知识与逻辑错误来严格评估其推理能力。这些方法系统性地揭示了LMMs在科学领域的专业局限，并为未来改进提供了新视角。

链接: https://arxiv.org/abs/2503.10627
作者: Ziyu Guo,Ray Zhang,Hao Chen,Jialin Gao,Dongzhi Jiang,Jiaze Wang,Pheng-Ann Heng
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Initially released in September 2024. Project page: this https URL

点击查看摘要

Abstract:The rapid advancement of Large Multi-modal Models (LMMs) has enabled their application in scientific problem-solving, yet their fine-grained capabilities remain under-explored. In this paper, we introduce SciVerse, a multi-modal scientific evaluation benchmark to thoroughly assess LMMs across 5,735 test instances in five distinct versions. We aim to investigate three key dimensions of LMMs: scientific knowledge comprehension, multi-modal content interpretation, and Chain-of-Thought (CoT) reasoning. To unveil whether LMMs possess sufficient scientific expertise, we first transform each problem into three versions containing different levels of knowledge required for solving, i.e., Knowledge-free, -lite, and -rich. Then, to explore how LMMs interpret multi-modal scientific content, we annotate another two versions, i.e., Vision-rich and -only, marking more question information from texts to diagrams. Comparing the results of different versions, SciVerse systematically examines the professional knowledge stock and visual perception skills of LMMs in scientific domains. In addition, to rigorously assess CoT reasoning, we propose a new scientific CoT evaluation strategy, conducting a step-wise assessment on knowledge and logical errors in model outputs. Our extensive evaluation of different LMMs on SciVerse reveals critical limitations in their scientific proficiency and provides new insights into future developments. Project page: this https URL
zh

[NLP-2] ransformers without Normalization CVPR2025

【速读】：该论文试图解决的问题是：传统观点认为归一化层（Normalization Layers）在现代神经网络中不可或缺，而该研究旨在证明Transformer架构即使没有归一化层，通过一种简单技术也能实现相同或更好的性能。
解决方案的关键在于引入了一种名为Dynamic Tanh (DyT) 的元素级操作，其定义为 ( \text{DyT}(x) = \tanh(\alpha x) )，作为归一化层的直接替代方案。DyT 的设计灵感来源于观察到Transformer中的层归一化（Layer Normalization）通常会产生类似tanh的S形输入-输出映射。通过使用DyT替换归一化层，实验表明无需大量超参数调整即可使无归一化Transformer的表现与有归一化的情况相当甚至更优，并且验证了DyT在多种任务设置下的有效性，包括分类、生成、有监督与自监督学习以及视觉与语言模型等。这些结果挑战了关于归一化层必要性的传统认知，并为理解其在深度网络中的作用提供了新视角。

链接: https://arxiv.org/abs/2503.10622
作者: Jiachen Zhu,Xinlei Chen,Kaiming He,Yann LeCun,Zhuang Liu
机构: New York University (纽约大学); Princeton University (普林斯顿大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025; Project page: this https URL

点击查看摘要

Abstract:Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation DyT( x ) = \tanh(\alpha x ) , as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, S -shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.
zh

[NLP-3] From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM

【速读】：该论文旨在解决如何将大型语言模型（Large Language Models, LLMs）扩展到语音模态的问题，并探索在多语言环境下通过离散化语音输入实现跨模态集成的可行性。论文的关键在于通过语音离散化（speech discretization）和持续预训练（continued pre-training），将多语言LLMs（如TOWER模型）适配到语音任务中，同时保持其在翻译相关任务上的原有性能。这种方案的核心创新点在于将离散化后的语音输入作为一种额外的“翻译语言”进行处理，从而实现了语音转录与翻译功能的结合，验证了在LLM适应过程中集成离散化语音输入的可行性。论文开源了代码与模型以促进社区进一步研究。

链接: https://arxiv.org/abs/2503.10620
作者: Kshitij Ambilduke,Ben Peters,Sonal Sannigrahi,Anil Keshwani,Tsz Kin Lam,Bruno Martins,Marcely Zanon Boito,André F.T. Martins
机构: Paris-Saclay University; Instituto de Telecomunicações (葡萄牙电信研究所); Instituto Superior Técnico, Universidade de Lisboa (里斯本高等理工学院, 里斯本大学); Sapienza University of Rome (罗马一大); University of Edinburgh (爱丁堡大学); INESC-ID; NAVER LABS Europe; ELLIS Unit Lisbon; Unbabel
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable performance and generalization capabilities across multiple languages and tasks, making them very attractive targets for multi-modality integration (e.g., images or speech). In this work, we extend an existing LLM to the speech modality via speech discretization and continued pre-training. In particular, we are interested in multilingual LLMs, such as TOWER, as their pre-training setting allows us to treat discretized speech input as an additional translation language. The resulting open-source model, SPIRE, is able to transcribe and translate English speech input while maintaining TOWER’s original performance on translation-related tasks, showcasing that discretized speech input integration as an additional language is feasible during LLM adaptation. We make our code and models available to the community.
zh

[NLP-4] Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search ICLR2025

【速读】：该论文旨在研究大型语言模型（Large Language Model, LLM）在多轮对话中安全性的逐步退化问题，并提出了一种名为Siege的多轮对抗性框架。传统的单轮越狱攻击依赖于精心设计的单一提示，而Siege通过广度优先搜索的方式，在每一轮对话中扩展多个对抗性提示，这些提示利用了先前响应中的部分合规性漏洞。关键在于，Siege通过追踪这些逐步累积的安全策略泄漏，并将其重新注入后续查询中，揭示了小的妥协如何逐渐积累成完全被禁止的输出。这种方法不仅展示了GPT-3.5-turbo和GPT-4等模型在多轮对话中的安全性退化过程，还强调了开发稳健的多轮测试程序的重要性。

链接: https://arxiv.org/abs/2503.10619
作者: Andy Zhou
机构: Intology AI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Accepted to ICLR 2025 Trustworthy LLM

点击查看摘要

Abstract:We introduce Siege, a multi-turn adversarial framework that models the gradual erosion of Large Language Model (LLM) safety through a tree search perspective. Unlike single-turn jailbreaks that rely on one meticulously engineered prompt, Siege expands the conversation at each turn in a breadth-first fashion, branching out multiple adversarial prompts that exploit partial compliance from previous responses. By tracking these incremental policy leaks and re-injecting them into subsequent queries, Siege reveals how minor concessions can accumulate into fully disallowed outputs. Evaluations on the JailbreakBench dataset show that Siege achieves a 100% success rate on GPT-3.5-turbo and 97% on GPT-4 in a single multi-turn run, using fewer queries than baselines such as Crescendo or GOAT. This tree search methodology offers an in-depth view of how model safeguards degrade over successive dialogue turns, underscoring the urgency of robust multi-turn testing procedures for language models.
zh

[NLP-5] Compositional Subspace Representation Fine-tuning for Adaptive Large Language Models ICLR2025

【速读】：该论文试图解决大语言模型在适配多任务时因交叉技能干扰（cross-skill interference）导致某一技能提升而另一技能退化的问题。现有方法如LoRA通过权重层面施加正交性约束，但未能充分解决隐藏状态表示中的干扰问题。论文提出了一种名为组合子空间表示微调（Compositional Subspace Representation Fine-tuning, CS-ReFT）的新方法，其关键是学习多个正交子空间变换，每个子空间专注于特定技能，并通过轻量级路由模块组合这些子空间。通过在隐藏状态而非权重矩阵中隔离这些子空间编辑，CS-ReFT能够更有效地防止跨任务冲突。在AlpacaEval基准测试中，将CS-ReFT应用于Llama-2-7B模型实现了93.94%的胜率，显著优于GPT-3.5 Turbo（86.30%），同时仅需模型参数的0.0098%。

链接: https://arxiv.org/abs/2503.10617
作者: Andy Zhou
机构: Intology AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2025 SCOPE

点击查看摘要

Abstract:Adapting large language models to multiple tasks can cause cross-skill interference, where improvements for one skill degrade another. While methods such as LoRA impose orthogonality constraints at the weight level, they do not fully address interference in hidden-state representations. We propose Compositional Subspace Representation Fine-tuning (CS-ReFT), a novel representation-based approach that learns multiple orthonormal subspace transformations, each specializing in a distinct skill, and composes them via a lightweight router. By isolating these subspace edits in the hidden state, rather than weight matrices, CS-ReFT prevents cross-task conflicts more effectively. On the AlpacaEval benchmark, applying CS-ReFT to Llama-2-7B achieves a 93.94% win rate, surpassing GPT-3.5 Turbo (86.30%) while requiring only 0.0098% of model parameters. These findings show that specialized representation edits, composed via a simple router, significantly enhance multi-task instruction following with minimal overhead.
zh

[NLP-6] ruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）中的生成式幻觉（Object Hallucination, OH）问题，这是当前可信度挑战中的一个主要难题。论文的关键在于探索LVLM内部状态是否可用作逐token的幻觉指示器，并提出了一种新的方法来缓解OH问题。研究发现，LVLM的内部状态能够高度特异性地表征逐token的幻觉行为，并且不同LVLM在共同潜在子空间中编码了幻觉的通用模式，这表明存在多种LVLM共享的“通用真实性方向”。基于这些发现，论文提出了Truthful-Guided Pre-Intervention (TruthPrInt)，该方法首先学习LVLM解码的真实方向，然后在解码过程中应用基于真实性的推理时干预。此外，还提出了ComnHallu以通过构建和对齐幻觉潜在子空间来增强跨LVLM和跨数据集的幻觉检测迁移能力。实验结果表明，TruthPrInt在多个流行LVLM和OH基准测试的域内及域外场景下显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2503.10602
作者: Jinhao Duan,Fei Kong,Hao Cheng,James Diffenderfer,Bhavya Kailkhura,Lichao Sun,Xiaofeng Zhu,Xiaoshuang Shi,Kaidi Xu
机构: Drexel University (德雷克塞尔大学); University of Electronic Science and Technology of China (电子科技大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); LLNL (劳伦斯利弗莫尔国家实验室); Lehigh University (莱斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 9 figures, the first two authors contributed equally

点击查看摘要

Abstract:Object Hallucination (OH) has been acknowledged as one of the major trustworthy challenges in Large Vision-Language Models (LVLMs). Recent advancements in Large Language Models (LLMs) indicate that internal states, such as hidden states, encode the “overall truthfulness” of generated responses. However, it remains under-explored how internal states in LVLMs function and whether they could serve as “per-token” hallucination indicators, which is essential for mitigating OH. In this paper, we first conduct an in-depth exploration of LVLM internal states in relation to OH issues and discover that (1) LVLM internal states are high-specificity per-token indicators of hallucination behaviors. Moreover, (2) different LVLMs encode universal patterns of hallucinations in common latent subspaces, indicating that there exist “generic truthful directions” shared by various LVLMs. Based on these discoveries, we propose Truthful-Guided Pre-Intervention (TruthPrInt) that first learns the truthful direction of LVLM decoding and then applies truthful-guided inference-time intervention during LVLM decoding. We further propose ComnHallu to enhance both cross-LVLM and cross-data hallucination detection transferability by constructing and aligning hallucination latent subspaces. We evaluate TruthPrInt in extensive experimental settings, including in-domain and out-of-domain scenarios, over popular LVLMs and OH benchmarks. Experimental results indicate that TruthPrInt significantly outperforms state-of-the-art methods. Codes will be available at this https URL.
zh

[NLP-7] VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search

【速读】：该论文试图解决视觉语言模型（Vision-Language Models, VLMs）在推理聚焦任务上的性能瓶颈问题，主要由于缺乏高质量且多样化的训练数据。论文的关键解决方案是提出VisualWebInstruct，这是一种利用搜索引擎构建跨学科（如数学、物理、金融、化学等）多样化、高质量多模态推理数据集的新方法。通过从30,000个精心挑选的种子图像出发，结合Google图像搜索识别相关网站，并处理来自70多万个独特URL源的HTML内容，最终构建了一个包含约90万问答对的数据集，其中40%为视觉问答对，其余为文本问答对。这一方法有效提升了模型的推理能力，验证其在复杂多模态任务中的显著效果。

链接: https://arxiv.org/abs/2503.10582
作者: Yiming Jia,Jiachen Li,Xiang Yue,Bo Li,Ping Nie,Kai Zou,Wenhu Chen
机构: University of Waterloo (滑铁卢大学); University of Toronto (多伦多大学); UC Santa Barbara (加州大学圣塔芭芭拉分校); CMU (卡内基梅隆大学); NUS (新加坡国立大学); Netmind.ai; Independent (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Technical Report

点击查看摘要

Abstract:Vision-Language Models have made significant progress on many perception-focused tasks, however, their progress on reasoning-focused tasks seem to be limited due to the lack of high-quality and diverse training data. In this work, we aim to address the scarcity issue of reasoning-focused multimodal datasets. We propose VisualWebInstruct - a novel approach that leverages search engine to create a diverse, and high-quality dataset spanning multiple disciplines like math, physics, finance, chemistry, etc. Starting with meticulously selected 30,000 seed images, we employ Google Image search to identify websites containing similar images. We collect and process the HTMLs from over 700K unique URL sources. Through a pipeline of content extraction, filtering and synthesis, we build a dataset of approximately 900K question-answer pairs, with 40% being visual QA pairs and the rest as text QA pairs. Models fine-tuned on VisualWebInstruct demonstrate significant performance gains: (1) training from Llava-OV-mid shows 10-20% absolute point gains across benchmarks, (2) training from MAmmoTH-VL shows 5% absoluate gain. Our best model MAmmoTH-VL2 shows state-of-the-art performance within the 10B parameter class on MMMU-Pro-std (40.7%), MathVerse (42.6%), and DynaMath (55.7%). These remarkable results highlight the effectiveness of our dataset in enhancing VLMs’ reasoning capabilities for complex multimodal tasks.
zh

[NLP-8] Language Models Graph Searching and Supervision Adulteration: When More Supervision is Less and How to Make More More ICLR2025

【速读】：该论文关注于路径-星任务（path-star task），这是一个在图上搜索的极简示例。任务的目标是在一个从起始节点 ( s ) 发散出 ( D ) 条臂的星形图 ( G ) 中，确定目标节点 ( t ) 所在的臂。研究发现，解码器-only 的语言模型（Decoder-only LM）由于学习到的捷径（shortcut）无法有效解决此任务，其表现仅略高于随机猜测 ( 1/D ) 的概率，原因是这些模型吸收了过多的训练监督信号。论文的关键解决方案在于揭示了这种失效现象是由过度监督引起的，并提出了一系列方法，证明通过适当的调整，解码器-only 的语言模型可以成功完成该任务。研究还指出，任务的极简性质导致其难度增加，因为它阻碍了任务的分解。最终，论文的解决方案不仅解决了该特定问题，还为基于下一令牌预测（next-token prediction）训练的语言模型提供了对捷径学习（shortcut learning）及其潜在病理影响的深刻洞察。

链接: https://arxiv.org/abs/2503.10542
作者: Arvid Frydenlund
机构: University of Toronto (多伦多大学); Vector Institute (向量研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: A reduced version of this work has been accepted to the Workshop on Spurious Correlation and Shortcut Learning: Foundations and Solutions (SCSL) at ICLR 2025. Full version under review

点击查看摘要

Abstract:This work concerns the path-star task, a minimal example of searching over a graph. The graph, G , is star-shaped with D arms radiating from a start node, s . A language model (LM) is given G , s , and a target node t , which ends one of the arms and is tasked with generating the arm containing t . The minimal nature of this task means only a single choice needs to be made: which of the D arms contains t ? Decoder-only LMs fail to solve this elementary task above 1/D chance due to a learned shortcut that absorbs training supervision. We show how this pathology is caused by excess supervision and we present a series of solutions demonstrating that the task is solvable via decoder-only LMs. We find that the task’s minimal nature causes its difficulty, as it prevents task decomposition. Our solutions provide insight into the pathology and its implications for LMs trained via next-token prediction. Comments: A reduced version of this work has been accepted to the Workshop on Spurious Correlation and Shortcut Learning: Foundations and Solutions (SCSL) at ICLR 2025. Full version under review Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2.7; I.2.8; I.5.0 Cite as: arXiv:2503.10542 [cs.LG] (or arXiv:2503.10542v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.10542 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-9] he Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory

【速读】：该论文试图解决传统验证方法在估计项目难度和区分度时依赖资源密集型试点测试的问题，并探索基于文本特征的项目撰写缺陷（Item-Writing Flaw, IWF）打分准则与项目反应理论（Item Response Theory, IRT）参数之间的关系。论文的关键在于开发了一种自动化方法，使用包含19项标准的IWF量规标注大量多选题（约7000道），并通过分析揭示了IWF数量与IRT难度和区分度参数之间存在显著统计关联，特别是在生命科学和物理科学领域。此外，研究进一步探讨了不同IWF标准对项目质量影响的程度差异。尽管如此，论文指出，IWF虽可用于预测IRT参数（尤其是筛选低难度的多选题），但无法替代传统的数据驱动验证方法。因此，其解决方案的关键在于结合自动化IWF评估与IRT分析，以改进教育评估中的试题质量验证。

链接: https://arxiv.org/abs/2503.10533
作者: Robin Schmucker,Steven Moore
机构: Machine Learning Department (机器学习系), Carnegie Mellon University (卡内基梅隆大学); Human-Computer Interaction (人机交互), Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:High-quality test items are essential for educational assessments, particularly within Item Response Theory (IRT). Traditional validation methods rely on resource-intensive pilot testing to estimate item difficulty and discrimination. More recently, Item-Writing Flaw (IWF) rubrics emerged as a domain-general approach for evaluating test items based on textual features. However, their relationship to IRT parameters remains underexplored. To address this gap, we conducted a study involving over 7,000 multiple-choice questions across various STEM subjects (e.g., math and biology). Using an automated approach, we annotated each question with a 19-criteria IWF rubric and studied relationships to data-driven IRT parameters. Our analysis revealed statistically significant links between the number of IWFs and IRT difficulty and discrimination parameters, particularly in life and physical science domains. We further observed how specific IWF criteria can impact item quality more and less severely (e.g., negative wording vs. implausible distractors). Overall, while IWFs are useful for predicting IRT parameters–particularly for screening low-difficulty MCQs–they cannot replace traditional data-driven validation methods. Our findings highlight the need for further research on domain-general evaluation rubrics and algorithms that understand domain-specific content for robust item validation.
zh

[NLP-10] Probing LLM s for Multilingual Discourse Generalization Through a Unified Label Set

【速读】：该论文试图解决的问题是如何评估大型语言模型（Large Language Models, LLMs）是否能够捕获可跨语言和跨框架泛化的语篇知识。论文的关键解决方案包括：(1) 构建一个统一的语篇关系标签集，以促进跨语言和跨框架的语篇分析；(2) 设计实验来探测LLMs，验证其是否编码了可泛化的语篇抽象表示。通过多语言语篇关系分类任务，研究发现，尤其是那些使用多语言训练语料库的LLMs，能够在语言和框架之间泛化语篇信息，并且这种泛化能力在中间层表现得最为显著。此外，错误分析揭示了具有挑战性的语篇关系类别。

链接: https://arxiv.org/abs/2503.10515
作者: Florian Eichin,Yang Janet Liu,Barbara Plank,Michael A. Hedderich
机构: MaiNLP (MaiNLP); Center for Information and Language Processing, LMU Munich (慕尼黑大学信息与语言处理中心); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注: 18 pages, 7 figures, 3 tables, code: this https URL

点击查看摘要

Abstract:Discourse understanding is essential for many NLP tasks, yet most existing work remains constrained by framework-dependent discourse representations. This work investigates whether large language models (LLMs) capture discourse knowledge that generalizes across languages and frameworks. We address this question along two dimensions: (1) developing a unified discourse relation label set to facilitate cross-lingual and cross-framework discourse analysis, and (2) probing LLMs to assess whether they encode generalizable discourse abstractions. Using multilingual discourse relation classification as a testbed, we examine a comprehensive set of 23 LLMs of varying sizes and multilingual capabilities. Our results show that LLMs, especially those with multilingual training corpora, can generalize discourse information across languages and frameworks. Further layer-wise analyses reveal that language generalization at the discourse level is most salient in the intermediate layers. Lastly, our error analysis provides an account of challenging relation classes.
zh

[NLP-11] MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation

【速读】：该论文旨在解决传统基准测试在多语言和文化多样性背景下难以有效评估日益复杂的语言模型的问题。为填补这一空白，论文引入了MMLU-ProX，这是一个涵盖13种类型多样的语言、每种语言包含约11,829个问题的综合性多语言基准测试。解决方案的关键在于其采用的半自动翻译流程：通过最先进的大型语言模型（LLMs）生成翻译后，由专家注释员严格评估以确保概念准确性、术语一致性以及文化相关性。此外，论文还使用5-shot链式思维（CoT）和零样本提示策略全面评估了25种最先进的LLMs，并分析了它们在语言和文化边界上的表现。实验结果显示，从高资源语言到低资源语言，模型性能呈现一致下降趋势，例如，最佳模型在英语上的准确率超过70%，而在斯瓦希里语等语言中降至约40%，这凸显了尽管近期取得进展但仍存在的多语言能力差距。MMLU-ProX是一个持续进行的项目，未来将通过纳入更多语言和评估更多语言模型来提供更全面的能力评估。

链接: https://arxiv.org/abs/2503.10497
作者: Weihao Xuan,Rui Yang,Heli Qi,Qingcheng Zeng,Yunze Xiao,Yun Xing,Junjue Wang,Huitao Li,Xin Li,Kunyu Yu,Nan Liu,Qingyu Chen,Douglas Teodoro,Edison Marrese-Taylor,Shijian Lu,Yusuke Iwasawa,Yutaka Matsuo,Irene Li
机构: The University of Tokyo (东京大学); Duke-NUS Medical School (杜克-新加坡国立大学医学学院); Waseda University (早稻田大学); Northwestern University (西北大学); Carnegie Mellon University (卡内基梅隆大学); Nanyang Technological University (南洋理工大学); Yale University (耶鲁大学); University of Geneva (日内瓦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional benchmarks struggle to evaluate increasingly sophisticated language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a comprehensive multilingual benchmark covering 13 typologically diverse languages with approximately 11,829 questions per language. Building on the challenging reasoning-focused design of MMLU-Pro, our framework employs a semi-automatic translation process: translations generated by state-of-the-art large language models (LLMs) are rigorously evaluated by expert annotators to ensure conceptual accuracy, terminological consistency, and cultural relevance. We comprehensively evaluate 25 state-of-the-art LLMs using 5-shot chain-of-thought (CoT) and zero-shot prompting strategies, analyzing their performance across linguistic and cultural boundaries. Our experiments reveal consistent performance degradation from high-resource languages to lower-resource ones, with the best models achieving over 70% accuracy on English but dropping to around 40% for languages like Swahili, highlighting persistent gaps in multilingual capabilities despite recent advances. MMLU-ProX is an ongoing project; we are expanding our benchmark by incorporating additional languages and evaluating more language models to provide a more comprehensive assessment of multilingual capabilities.
zh

[NLP-12] Source-primed Multi-turn Conversation Helps Large Language Models Translate Documents

【速读】：该论文旨在解决文档级机器翻译中的遗漏错误（omission errors）等挑战，提出了一种利用多轮对话方式处理上下文信息的简单方法。其关键在于将文档分解为段落，并通过迭代翻译这些段落同时保留之前的翻译上下文（key-value缓存，KV cache），从而确保译文连贯性，且无需额外训练模型。此外，论文进一步引入“源端引导”（source-primed）策略，在多轮翻译前提供完整的源文档，以优化翻译效果。实验表明，该多轮方法在代表性大型语言模型（LLMs）中优于单次整体翻译或独立段落翻译，为基于LLMs的文档级翻译建立了坚实的基线。

链接: https://arxiv.org/abs/2503.10494
作者: Hanxu Hu,Jannis Vamvas,Rico Sennrich
机构: University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:LLMs have paved the way for truly simple document-level machine translation, but challenges such as omission errors remain. In this paper, we study a simple method for handling document-level machine translation, by leveraging previous contexts in a multi-turn conversational manner. Specifically, by decomposing documents into segments and iteratively translating them while maintaining previous turns, this method ensures coherent translations without additional training, and can fully re-use the KV cache of previous turns thus minimizing computational overhead. We further propose a `source-primed’ method that first provides the whole source document before multi-turn translation. We empirically show this multi-turn method outperforms both translating entire documents in a single turn and translating each segment independently according to multiple automatic metrics in representative LLMs, establishing a strong baseline for document-level translation using LLMs.
zh

[NLP-13] LLM s in Disease Diagnosis: A Comparative Study of DeepSeek -R1 and O3 Mini Across Chronic Health Conditions

【速读】：该论文旨在评估基于大型语言模型（LLMs）的两种诊断工具——DeepSeek R1 和 O3 Mini 在医疗诊断中的性能，重点研究其在疾病分类和临床决策支持方面的有效性。论文通过结构化症状与诊断数据集，从疾病级别和类别级别的预测准确性以及置信分数的可靠性两方面进行评估。解决方案的关键在于对比分析这两种模型在不同疾病领域的表现差异，如 DeepSeek R1 在精神健康、神经系统疾病和肿瘤学领域达到 100% 的准确性，而 O3 Mini 在自身免疫疾病分类中表现优异。此外，论文还探讨了呼吸系统疾病分类上的不足，以及置信分数的可靠性差异，并强调了模型在实际应用中的伦理考量，包括偏见、可解释性和数据隐私等问题。这为未来基于 LLM 的医疗诊断系统的改进提供了重要参考。

链接: https://arxiv.org/abs/2503.10486
作者: Gaurav Kumar Gupta,Pranal Pande
机构: Youngstown State University (扬斯敦州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are revolutionizing medical diagnostics by enhancing both disease classification and clinical decision-making. In this study, we evaluate the performance of two LLM- based diagnostic tools, DeepSeek R1 and O3 Mini, using a structured dataset of symptoms and diagnoses. We assessed their predictive accuracy at both the disease and category levels, as well as the reliability of their confidence scores. DeepSeek R1 achieved a disease-level accuracy of 76% and an overall accuracy of 82%, outperforming O3 Mini, which attained 72% and 75% respectively. Notably, DeepSeek R1 demonstrated exceptional performance in Mental Health, Neurological Disorders, and Oncology, where it reached 100% accuracy, while O3 Mini excelled in Autoimmune Disease classification with 100% accuracy. Both models, however, struggled with Respiratory Disease classification, recording accuracies of only 40% for DeepSeek R1 and 20% for O3 Mini. Additionally, the analysis of confidence scores revealed that DeepSeek R1 provided high-confidence predictions in 92% of cases, compared to 68% for O3 Mini. Ethical considerations regarding bias, model interpretability, and data privacy are also discussed to ensure the responsible integration of LLMs into clinical practice. Overall, our findings offer valuable insights into the strengths and limitations of LLM-based diagnostic systems and provide a roadmap for future enhancements in AI-driven healthcare.
zh

[NLP-14] World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在执行具身任务规划时面临的依赖约束和效率问题。现有方法主要关注单一的动作选择优化或利用世界模型进行推理，而忽视了通过学习建模环境动态来提升规划能力的重要性。为了解决这些问题，论文提出了一种名为双偏好优化（Dual Preference Optimization, D²PO）的新框架，其关键是通过偏好学习同时优化状态预测和动作选择，使LVLMs能够更好地理解环境动力学以增强规划能力。此外，为了实现无人工标注的数据自动收集，论文引入了一种树搜索机制，用于通过试错法进行广泛的探索。实验结果表明，基于D²PO的方法显著优于现有方法及GPT-4o，在应用于Qwen2-VL (7B)、LLaVA-1.6 (7B) 和 LLaMA-3.2 (11B) 时，实现了更高的任务成功率以及更高效的执行路径。

链接: https://arxiv.org/abs/2503.10480
作者: Siyin Wang,Zhaoye Fei,Qinyuan Cheng,Shiduo Zhang,Panpan Cai,Jinlan Fu,Xipeng Qiu
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); National University of Singapore (新加坡国立大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Recent advances in large vision-language models (LVLMs) have shown promise for embodied task planning, yet they struggle with fundamental challenges like dependency constraints and efficiency. Existing approaches either solely optimize action selection or leverage world models during inference, overlooking the benefits of learning to model the world as a way to enhance planning capabilities. We propose Dual Preference Optimization (D ^2 PO), a new learning framework that jointly optimizes state prediction and action selection through preference learning, enabling LVLMs to understand environment dynamics for better planning. To automatically collect trajectories and stepwise preference data without human annotation, we introduce a tree search mechanism for extensive exploration via trial-and-error. Extensive experiments on VoTa-Bench demonstrate that our D ^2 PO-based method significantly outperforms existing methods and GPT-4o when applied to Qwen2-VL (7B), LLaVA-1.6 (7B), and LLaMA-3.2 (11B), achieving superior task success rates with more efficient execution paths.
zh

[NLP-15] Statistical Analysis of Sentence Structures through ASCII Lexical Alignment and PCA

【速读】：该论文试图解决自然语言处理（NLP）中利用句法工具（如词性标注 POS tagging）理解句子结构及其在多样化语料库中分布时所面临的复杂性和挑战。论文提出了一种新颖的统计方法，通过使用美国信息交换标准代码（ASCII codes）表示来自不同来源的11个文本语料库的文字及其词类对齐情况，并借助主成分分析（PCA）压缩版本进行分析，采用直方图及正态性检验（如Shapiro-Wilk和Anderson-Darling测试）评估结果。关键在于通过ASCII编码简化文本处理流程，而非取代句法工具，而是作为资源高效的方法补充其功能，用于评估文本结构平衡性。

链接: https://arxiv.org/abs/2503.10470
作者: Abhijeet Sahdev
机构: New Jersey Institute of Technology
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While utilizing syntactic tools such as parts-of-speech (POS) tagging has helped us understand sentence structures and their distribution across diverse corpora, it is quite complex and poses a challenge in natural language processing (NLP). This study focuses on understanding sentence structure balance - usages of nouns, verbs, determiners, etc - harmoniously without relying on such tools. It proposes a novel statistical method that uses American Standard Code for Information Interchange (ASCII) codes to represent text of 11 text corpora from various sources and their lexical category alignment after using their compressed versions through PCA, and analyzes the results through histograms and normality tests such as Shapiro-Wilk and Anderson-Darling Tests. By focusing on ASCII codes, this approach simplifies text processing, although not replacing any syntactic tools but complementing them by offering it as a resource-efficient tool for assessing text balance. The story generated by Grok shows near normality indicating balanced sentence structures in LLM outputs, whereas 4 out of the remaining 10 pass the normality tests. Further research could explore potential applications in text quality evaluation and style analysis with syntactic integration for more broader tasks.
zh

[NLP-16] Light-R1: Curriculum SFT DPO and RL for Long COT from Scratch and Beyond

【速读】：该论文旨在解决如何从零开始训练具备长链思维（long Chain-of-Thought, COT）能力的大规模生成式模型，并验证其在数学推理任务中的性能以及跨领域的泛化能力。论文的关键创新在于提出了一种两阶段的监督微调（SFT）结合半按策略的拒绝采样优化（DPO）的课程训练方法，用于从初始不具备长链思维能力的基础模型（如Qwen2.5-32B-Instruct）出发，训练出性能优越的Light-R1系列模型。此外，通过引入强化学习（RL），特别是广义策略优化（GRPO），进一步提升了模型的推理能力。这些方法不仅实现了在数学任务上的最新技术水平（SOTA），还展示了数据集构建对于提升模型性能的重要性，并验证了长链思维模型从头训练的有效性。

链接: https://arxiv.org/abs/2503.10460
作者: Liang Wen,Yunke Cai,Fenrui Xiao,Xin He,Qi An,Zhenyu Duan,Yimin Du,Junchen Liu,Lifu Tang,Xiaowei Lv,Haosheng Zou,Yongchao Deng,Shousheng Jia,Xiangzheng Zhang
机构: Qiyuan Tech (启远科技); Renmin University (中国人民大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: all release at this https URL

点击查看摘要

Abstract:This paper presents our work on the Light-R1 series, with models, data, and code all released. We first focus on training long COT models from scratch, specifically starting from models initially lacking long COT capabilities. Using a curriculum training recipe consisting of two-stage SFT and semi-on-policy DPO, we train our model Light-R1-32B from Qwen2.5-32B-Instruct, resulting in superior math performance compared to DeepSeek-R1-Distill-Qwen-32B. Despite being trained exclusively on math data, Light-R1-32B shows strong generalization across other domains. In the subsequent phase of this work, we highlight the significant benefit of the 3k dataset constructed for the second SFT stage on enhancing other models. By fine-tuning DeepSeek-R1-Distilled models using this dataset, we obtain new SOTA models in 7B and 14B, while the 32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying reinforcement learning, specifically GRPO, on long-COT models to further improve reasoning performance. We successfully train our final Light-R1-14B-DS with RL, achieving SOTA performance among 14B parameter models in math. With AIME24 25 scores of 74.0 and 60.2 respectively, Light-R1-14B-DS surpasses even many 32B models and DeepSeek-R1-Distill-Llama-70B. Its RL training also exhibits well expected behavior, showing simultaneous increase in response length and reward score. The Light-R1 series of work validates training long-COT models from scratch, showcases the art in SFT data and releases SOTA models from RL. Comments: all release at this https URL Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2503.10460 [cs.CL] (or arXiv:2503.10460v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.10460 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Haosheng Zou [view email] [v1] Thu, 13 Mar 2025 15:29:22 UTC (811 KB)
zh

[NLP-17] DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation

【速读】：该论文试图解决现有代码基准测试静态且易受记忆效应影响的问题，即现有的代码数据集由固定的预定义问题组成，这使得大型语言模型（LLMs）在训练过程中容易记住特定的测试用例而非泛化到新问题，从而导致数据污染和评价结果的不可靠性。为了解决这些问题，论文提出了DynaCode，这是一种动态且复杂度感知的基准测试方法，克服了静态数据集的局限性。其关键是通过引入一种复杂度感知的度量标准来系统性评估LLMs，该标准同时考虑了代码复杂度和调用图结构，并实现了大规模的多样性生成，在四个不同的代码复杂度级别和16种调用图类型中生成多达1.89亿个独特的嵌套代码问题。

链接: https://arxiv.org/abs/2503.10452
作者: Wenhao Hu,Jinhao Duan,Chunchen Wei,Li Zhang,Yue Zhang,Kaidi Xu
机构: University of Electronic Science and Technology of China (电子科技大学); Drexel University (德雷塞尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 11 figures

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has significantly improved their performance in code generation tasks. However, existing code benchmarks remain static, consisting of fixed datasets with predefined problems. This makes them vulnerable to memorization during training, where LLMs recall specific test cases instead of generalizing to new problems, leading to data contamination and unreliable evaluation results. To address these issues, we introduce DynaCode, a dynamic, complexity-aware benchmark that overcomes the limitations of static datasets. DynaCode evaluates LLMs systematically using a complexity-aware metric, incorporating both code complexity and call-graph structures. DynaCode achieves large-scale diversity, generating up to 189 million unique nested code problems across four distinct levels of code complexity, referred to as units, and 16 types of call graphs. Results on 12 latest LLMs show an average performance drop of 16.8% to 45.7% compared to MBPP+, a static code generation benchmark, with performance progressively decreasing as complexity increases. This demonstrates DynaCode’s ability to effectively differentiate LLMs. Additionally, by leveraging call graphs, we gain insights into LLM behavior, particularly their preference for handling subfunction interactions within nested code.
zh

[NLP-18] BeamLLM : Vision-Empowered mmWave Beam Prediction with Large Language Models

【速读】：该论文旨在解决毫米波（mmWave）通信系统中高训练开销和低预测延迟的问题。为应对这些挑战，论文提出了一种名为BeamLLM的视觉辅助毫米波波束预测框架。其关键在于结合计算机视觉（Computer Vision, CV）与大型语言模型（Large Language Models, LLMs）的跨模态推理能力，通过重新编程技术从RGB图像中提取用户设备（User Equipment, UE）的位置特征，并将视觉-时间特征映射到LLMs的语义空间，从而实现高效的波束预测。实验结果显示，在标准预测任务中，BeamLLM达到了61.01%的Top-1准确率和97.39%的Top-3准确率，显著优于传统深度学习模型；在少量样本预测场景下，性能下降幅度分别仅为12.56%（Top-1）和5.55%（Top-3）。

链接: https://arxiv.org/abs/2503.10432
作者: Can Zheng,Jiguang He,Guofa Cai,Zitong Yu,Chung G. Kang
机构: School of Electrical Engineering, Korea University, Seoul, Republic of Korea (韩国建国大学电气工程学院); School of Computing and Information Technology, Great Bay University, Dongguan 523000, China (大湾区大学计算与信息技术学院); School of Information Engineering, Guangdong University of Technology, Guangzhou, China (广东工业大学信息工程学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 6 pages, 7 figures, conference

点击查看摘要

Abstract:In this paper, we propose BeamLLM, a vision-aided millimeter-wave (mmWave) beam prediction framework leveraging large language models (LLMs) to address the challenges of high training overhead and latency in mmWave communication systems. By combining computer vision (CV) with LLMs’ cross-modal reasoning capabilities, the framework extracts user equipment (UE) positional features from RGB images and aligns visual-temporal features with LLMs’ semantic space through reprogramming techniques. Evaluated on a realistic vehicle-to-infrastructure (V2I) scenario, the proposed method achieves 61.01% top-1 accuracy and 97.39% top-3 accuracy in standard prediction tasks, significantly outperforming traditional deep learning models. In few-shot prediction scenarios, the performance degradation is limited to 12.56% (top-1) and 5.55% (top-3) from time sample 1 to 10, demonstrating superior prediction capability.
zh

[NLP-19] VisTai: Benchmarking Vision-Language Models for Traditional Chinese in Taiwan

【速读】：该论文旨在解决现有视觉语言模型（Visual Language Model, VLM）评估基准在处理繁体中文时存在的不足问题。目前大多数评估基准主要针对英语或简体中文设计，忽视了繁体中文特有的语言与文化特性，特别是在台湾和香港等地区广泛使用的语言形式。为填补这一空白，论文提出了一套全面的繁体中文VLM评估基准，包含两个互补组件：VisTai-MCQ（手动整理的多选题考试题库，涵盖21个学术领域）用于测试模型的广博知识与推理能力；以及VisTai-Dialogue（开放对话基准，包含131组图像-问题对），用于评估模型在自由形式对话生成中的表现，并特别关注台湾文化语境下的应用。解决方案的关键在于构建能够有效反映繁体中文特性和区域文化特色的多样化评估任务，从而更准确地衡量和比较不同VLM的表现。

链接: https://arxiv.org/abs/2503.10427
作者: Zhi Rui Tam,Ya-Ting Pai,Yen-Wei Lee
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we propose a comprehensive evaluation benchmark for Visual Language Models (VLM) in Traditional Chinese. Our evaluation suite, the first of its kind, contains two complementary components: (1) VisTai-MCQ, a collection of manually curated exam multi-choice questions from 21 academic subjects designed to test the broad knowledge and reasoning capabilities of VLMs; and (2) VisTai-Dialogue, an open dialogue benchmark comprising 131 image-question pairs manually created to evaluate VLMs’ ability in free-form dialogue generation within Taiwanese cultural contexts. These benchmarks address a critical gap in the evaluation landscape, where existing benchmarks predominantly focus on English or Simplified Chinese, neglecting the unique linguistic and cultural aspects of Traditional Chinese used in regions like Taiwan and Hong Kong. Our analysis reveals significant performance differences across various VLMs and highlights specific challenges in processing Traditional Chinese visual content.
zh

[NLP-20] Understanding the Logical Capabilities of Large Language Models via Out-of-Context Representation Learning

【速读】：该论文旨在研究大型语言模型（Large Language Models, LLM）在二元关系（binary relations）上的能力，特别是关注等价关系、不等关系和包含关系及其满足的性质（如自反性/非自反性、对称性/非对称性、传递性以及逻辑复杂度），并探索这些能力如何适用于构建复杂推理基准。论文提出了一种替代基于上下文学习（in-context learning）的方法——上下文外表示学习（out-of-context representation learning），其关键是仅训练新引入标记的表示，而非依赖外部信息或上下文示例。这种方法能够减轻模型中已存在的语言偏见，并避免对上下文信息的过度依赖，从而提供一种更优的评估LLMs在逻辑任务（作为复杂推理基准的基本构件）上能力的方式。

链接: https://arxiv.org/abs/2503.10408
作者: Jonathan Shaki,Emanuele La Malfa,Michael Wooldridge,Sarit Kraus
机构: Bar-Ilan University (巴伊兰大学); University of Oxford (牛津大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We study the capabilities of Large Language Models (LLM) on binary relations, a ubiquitous concept in math employed in most reasoning, math and logic benchmarks. This work focuses on equality, inequality, and inclusion, along with the properties they satisfy, such as ir/reflexivity, a/symmetry, transitivity, and logical complexity (e.g., number of reasoning ``hops’'). We propose an alternative to in-context learning that trains only the representations of newly introduced tokens, namely out-of-context representation learning. This method mitigates linguistic biases already present in a model and, differently from in-context learning, does not rely on external information or illustrations. We argue out-of-context representation learning as a better alternative to in-context learning and fine-tuning to evaluate the capabilities of LLMs on logic tasks that are the building blocks of more complex reasoning benchmarks.
zh

[NLP-21] G-Boost: Boosting Private SLMs with General LLM s

【速读】：该论文试图解决在有限计算资源下，私有小语言模型（Small Language Models, SLMs）效果受限的问题。解决方案的关键在于提出G-Boost框架，通过在过程奖励指导下，让私有SLM与可负担推理成本的一般大型语言模型（Large Language Models, LLMs）进行自适应协同推理，从而显著提升私有SLM的性能。

链接: https://arxiv.org/abs/2503.10367
作者: Yijiang Fan,Yuren Mao,Longbin Lai,Ying Zhang,Zhengping Qian,Yunjun Gao
机构: Zhejiang University (浙江大学); Tongyi Lab Alibaba Group (通义实验室阿里巴巴集团); Zhejiang Gongshang University (浙江工商大学); Alibaba Cloud (阿里云)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Due to the limited computational resources, most Large Language Models (LLMs) developers can only fine-tune Small Language Models (SLMs) on their own data. These private SLMs typically have limited effectiveness. To boost the performance of private SLMs, this paper proposes to ask general LLMs for help. The general LLMs can be APIs or larger LLMs whose inference cost the developers can afford. Specifically, we propose the G-Boost framework where a private SLM adaptively performs collaborative inference with a general LLM under the guide of process reward. Experiments demonstrate that our framework can significantly boost the performance of private SLMs.
zh

[NLP-22] Do I look like a cat.n.01 to you? A Taxonomy Image Generation Benchmark

【速读】：该论文试图解决的问题是如何利用文本到图像（Text-to-Image, T2I）模型在零样本（zero-shot）设置下生成用于分类学概念的图像。尽管基于文本的方法在分类学扩展方面已较为成熟，但视觉维度的潜力尚未被充分探索。为了解决这一问题，论文的关键在于提出了一套全面的分类学图像生成基准（Taxonomy Image Generation Benchmark），用于评估模型理解分类学概念并生成相关高质量图像的能力。该基准涵盖了来自WordNet的常识性和随机采样的概念，以及由大型语言模型（LLM）生成的预测，并引入了9个新颖的分类学相关T2I度量和人类反馈进行评估。此外，论文创新性地采用了GPT-4反馈的成对评估方法。实验结果表明，模型排名在分类学任务中与标准T2I任务存在显著差异，Playground-v2和FLUX模型在多项指标和子集上表现最优，而基于检索的方法表现较差。这些发现突显了自动化整理结构化数据资源的潜力。

链接: https://arxiv.org/abs/2503.10357
作者: Viktor Moskvoretskii,Alina Lobanova,Ekaterina Neminova,Chris Biemann,Alexander Panchenko,Irina Nikishina
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Labeled data and generated image Wordnet are published at this https URL

点击查看摘要

Abstract:This paper explores the feasibility of using text-to-image models in a zero-shot setup to generate images for taxonomy concepts. While text-based methods for taxonomy enrichment are well-established, the potential of the visual dimension remains unexplored. To address this, we propose a comprehensive benchmark for Taxonomy Image Generation that assesses models’ abilities to understand taxonomy concepts and generate relevant, high-quality images. The benchmark includes common-sense and randomly sampled WordNet concepts, alongside the LLM generated predictions. The 12 models are evaluated using 9 novel taxonomy-related text-to-image metrics and human feedback. Moreover, we pioneer the use of pairwise evaluation with GPT-4 feedback for image generation. Experimental results show that the ranking of models differs significantly from standard T2I tasks. Playground-v2 and FLUX consistently outperform across metrics and subsets and the retrieval-based approach performs poorly. These findings highlight the potential for automating the curation of structured data resources.
zh

[NLP-23] A Hybrid Architecture with Efficient Fine Tuning for Abstractive Patent Document Summarization

【速读】：该论文旨在解决专利文本自动摘要生成的问题，特别是应对专利文档因技术性和法律性复杂以及长度较长而导致的传统文本摘要模型在提取关键信息方面的挑战。论文的关键解决方案在于提出了一种结合抽取式和生成式（Abstractive）文本摘要方法的混合框架：首先利用基于LexRank图算法从输入文本中提取重要句子，然后采用经过Low-Ranking Adaptation (LoRA) 微调的双向自回归Transformer (BART) 模型生成抽象摘要，并结合系统化的测试与评估策略。此外，作者还引入了元学习技术以实现生成式组件在多领域专利数据上的领域泛化 (Domain Generalization, DG)。

链接: https://arxiv.org/abs/2503.10354
作者: Nevidu Jayatilleke,Ruvan Weerasinghe
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted Paper in the 8th International Research Conference on Smart Computing and Systems Engineering, University of Kelaniya, Sri Lanka. (Pending Publication)

点击查看摘要

Abstract:Automatic patent summarization approaches that help in the patent analysis and comprehension procedure are in high demand due to the colossal growth of innovations. The development of natural language processing (NLP), text mining, and deep learning has notably amplified the efficacy of text summarization models for abundant types of documents. Summarizing patent text remains a pertinent challenge due to the labyrinthine writing style of these documents, which includes technical and legal intricacies. Additionally, these patent document contents are considerably lengthier than archetypal documents, which intricates the process of extracting pertinent information for summarization. Embodying extractive and abstractive text summarization methodologies into a hybrid framework, this study proposes a system for efficiently creating abstractive summaries of patent records. The procedure involves leveraging the LexRank graph-based algorithm to retrieve the important sentences from input parent texts, then utilizing a Bidirectional Auto-Regressive Transformer (BART) model that has been fine-tuned using Low-Ranking Adaptation (LoRA) for producing text summaries. This is accompanied by methodical testing and evaluation strategies. Furthermore, the author employed certain meta-learning techniques to achieve Domain Generalization (DG) of the abstractive component across multiple patent fields.
zh

[NLP-24] New Trends for Modern Machine Translation with Large Reasoning Models

【速读】：该论文试图解决传统机器翻译（Machine Translation, MT）在处理复杂语境、文化适应性和鲁棒性方面的局限性问题。论文指出，基于大型推理模型（Large Reasoning Models, LRMs）特别是利用链式思维推理（Chain-of-Thought reasoning, CoT）的方法，能够将翻译重新定义为一个需要上下文、文化和语言理解与推理的动态任务。解决方案的关键在于通过三个基础转变实现：1）上下文连贯性，即LRMs通过显式推理跨句和复杂甚至缺失的上下文来解决歧义并保持话语结构；2）文化意图性，使模型能够通过推断说话者意图、受众期望和社会语言规范来调整输出；3）自我反思能力，在推理过程中进行自我校正以修正潜在错误，特别是在极端噪声情况下表现出更好的鲁棒性，而非仅仅执行简单的X-Y映射翻译。这种范式转变使得翻译系统不仅作为文本转换器，更成为具备多语言认知能力的代理，能够推理文本之外的意义。

链接: https://arxiv.org/abs/2503.10351
作者: Sinuo Liu,Chenyang Lyu,Minghao Wu,Longyue Wang,Weihua Luo,Kaifu Zhang
机构: Alibaba International Digital Commerce (阿里巴巴国际数字商业); University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Large Reasoning Models (LRMs), particularly those leveraging Chain-of-Thought reasoning (CoT), have opened brand new possibility for Machine Translation (MT). This position paper argues that LRMs substantially transformed traditional neural MT as well as LLMs-based MT paradigms by reframing translation as a dynamic reasoning task that requires contextual, cultural, and linguistic understanding and reasoning. We identify three foundational shifts: 1) contextual coherence, where LRMs resolve ambiguities and preserve discourse structure through explicit reasoning over cross-sentence and complex context or even lack of context; 2) cultural intentionality, enabling models to adapt outputs by inferring speaker intent, audience expectations, and socio-linguistic norms; 3) self-reflection, LRMs can perform self-reflection during the inference time to correct the potential errors in translation especially extremely noisy cases, showing better robustness compared to simply mapping X-Y translation. We explore various scenarios in translation including stylized translation, document-level translation and multimodal translation by showcasing empirical examples that demonstrate the superiority of LRMs in translation. We also identify several interesting phenomenons for LRMs for MT including auto-pivot translation as well as the critical challenges such as over-localisation in translation and inference efficiency. In conclusion, we think that LRMs redefine translation systems not merely as text converters but as multilingual cognitive agents capable of reasoning about meaning beyond the text. This paradigm shift reminds us to think of problems in translation beyond traditional translation scenarios in a much broader context with LRMs - what we can achieve on top of it.
zh

[NLP-25] KV-Distill: Nearly Lossless Learnable Context Compression for LLM s

【速读】：该论文旨在解决长上下文序列到序列任务中，由于标准Transformer自注意力机制的二次复杂度导致难以有效利用长上下文的问题。特别是在生成过程中，临时存储在KV缓存中的表示占用了大量GPU内存，并且其占用随上下文长度线性增长。为了解决这一挑战，论文提出了KV-Distill框架，这是一种针对Transformer模型的压缩方法，通过以问题无关的方式将长上下文的KV缓存蒸馏为显著更短的表示形式。关键在于，KV-Distill能够作为参数高效的适配器对预训练模型进行微调，并在保持预训练模型能力的同时压缩任意长度的上下文片段。论文通过将压缩后的KV缓存与未压缩版本视为学生-教师对，并采用KL散度型的损失函数匹配输出分布，从而实现性能的提升。实验表明，KV-Distill在最坏情况下的抽取任务中优于其他压缩技术，并在长上下文问答和摘要任务中接近未压缩模型的表现，同时能够在特定领域上下文中进一步优化以实现高达99%的压缩率，而下游任务性能得以保留。其通用性还体现在适用于不同规模和架构的模型。

链接: https://arxiv.org/abs/2503.10337
作者: Vivek Chari,Guanghui Qin,Benjamin Van Durme
机构: Johns Hopkins University (约翰斯·霍普金斯大学); Microsoft (微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sequence-to-sequence tasks often benefit from long contexts, but the quadratic complexity of self-attention in standard Transformers renders this non-trivial. During generation, temporary representations -stored in the so-called KV cache-account for a large portion of GPU memory usage and scale linearly with context length. We introduce KV-Distill, a Transformer compression framework that distills long context KV caches into significantly shorter representations in a question-independent fashion. KV-Distill can be trained as a parameter-efficient adaptor for pretrained models, and enables the compression of arbitrary spans of a context while preserving pre-trained model capabilities. We treat a compressed-uncompressed cache as a student-teacher pairing and apply a KL-type divergence to match the generated outputs. KV-Distill outperforms other compression techniques in worst-case extractive tasks and approaches uncompressed performance in long context question answering and summarization, and it can be fine-tuned on domain-specific contexts to reduce lengths by up to 99% while preserving downstream performance. We demonstrate the generalizability of KV-Distill across various model sizes and architectures.
zh

[NLP-26] OSMa-Bench: Evaluating Open Semantic Mapping Under Varying Lighting Conditions

【速读】：本文旨在解决室内环境下光照条件变化对语义映射（Semantic Mapping）性能影响这一关键挑战。解决方案的核心在于提出一个名为OSMa-Bench的动态可配置且高度自动化的评估管道，该管道基于大型语言模型（LLM）/轻量级语言模型（LVLM）。研究引入了一个包含模拟RGB-D序列和精确三维重建真值的新数据集，用于在不同光照条件下严格分析语义映射性能。此外，通过引入场景图（Scene Graph）评估方法，进一步分析模型解析语义结构的能力。这些方法共同为提升机器人系统的鲁棒性和适应性提供了重要参考，同时为未来研究指明方向。

链接: https://arxiv.org/abs/2503.10331
作者: Maxim Popov,Regina Kurkova,Mikhail Iumanov,Jaafar Mahmoud,Sergey Kolyubin
机构: Biomechatronics and Energy-Efficient Robotics (BE2R) Lab (仿生机器人与能源效率实验室), ITMO University (ITMO大学), Saint Petersburg, Russia (俄罗斯)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Open Semantic Mapping (OSM) is a key technology in robotic perception, combining semantic segmentation and SLAM techniques. This paper introduces a dynamically configurable and highly automated LLM/LVLM-powered pipeline for evaluating OSM solutions called OSMa-Bench (Open Semantic Mapping Benchmark). The study focuses on evaluating state-of-the-art semantic mapping algorithms under varying indoor lighting conditions, a critical challenge in indoor environments. We introduce a novel dataset with simulated RGB-D sequences and ground truth 3D reconstructions, facilitating the rigorous analysis of mapping performance across different lighting conditions. Through experiments on leading models such as ConceptGraphs, BBQ and OpenScene, we evaluate the semantic fidelity of object recognition and segmentation. Additionally, we introduce a Scene Graph evaluation method to analyze the ability of models to interpret semantic structure. The results provide insights into the robustness of these models, forming future research directions for developing resilient and adaptable robotic systems. Our code is available at this https URL.
zh

[NLP-27] Proceedings of the ISCA/ITG Workshop on Diversity in Large Speech and Language Models

【速读】：该论文试图解决的问题是：随着大型语言模型（Large Language Models, LLMs）和相关技术的发展，人类与这些技术交互的有效性受到训练数据所覆盖语言类型的限制，可能导致部分用户因未能适应模型的语言模式而被排除在外。同时，由于商业模型开发倾向于优先支持主流语言，欠代表语言及方言/社会方言可能进一步被边缘化，加剧数字鸿沟。此外，对于许多低资源语言，缺乏足够的数据进一步限制了其在语音和自然语言处理技术中的应用。

解决方案的关键在于从计算机科学和语言学（包括计算语言学和自然语言处理）的角度探讨上述问题，并提出相应的科学贡献以促进更公平和广泛的语言技术普及。

链接: https://arxiv.org/abs/2503.10298
作者: Sebastian Möller,Pia Knoeferle,Britta Schulte,Nils Feldhus
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine learning techniques have conquered many different tasks in speech and natural language processing, such as speech recognition, information extraction, text and speech generation, and human machine interaction using natural language or speech (chatbots). Modern techniques typically rely on large models for representing general knowledge of one or several languages (Large Language Models, LLMs), or for representing speech and general audio characteristics. These models have been trained with large amounts of speech and language data, typically including web content. When humans interact with such technologies, the effectiveness of the interaction will be influenced by how far humans make use of the same type of language the models have been trained on or, in other words, if the models are able to generalize to the language used by humans when interacting with the technology. This may lead to some gradual forms of adaptation in human speech and language production, and users who do not adapt may be excluded from efficient use of such technologies. On top of this, as commercial model development follows market needs, under-represented languages and dialects/sociolects may decrease in terms of priorities. Furthermore, for many lesser spoken languages the necessary data is not available, which will worsen a digital divide in speech and language technology usage. The workshop sets out to discuss this problem based on scientific contributions from the perspective of computer science and linguistics (including computational linguistics and NLP).
zh

[NLP-28] Wikipedia is Not a Dictionary Delete! Text Classification as a Proxy for Analysing Wiki Deletion Discussions

【速读】：该论文试图解决自动化内容审核在协作知识库（如维基百科或Wikidata）中的挑战性问题，特别是针对标记为删除的文章讨论进行内容审核的任务。论文的关键在于构建了一个包含多语言维基站点删除标记文章相关讨论的数据库，并利用此数据集评估了一系列语言模型（Language Models, LMs）在不同任务上的表现，包括预测讨论结果及识别单条评论所隐含的政策指向。研究发现，某些情况下自动生成的标签（如“保留”、“删除”或“重定向”）并未有效指导分类器，这可能归因于用户评论中的犹豫或深思熟虑。因此，解决方案的关键在于通过精心设计的数据集和任务设定来揭示现有方法的优势与局限性。

链接: https://arxiv.org/abs/2503.10294
作者: Hsuvas Borkakoty,Luis Espinosa-Anke
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to WNUT-2025

点击查看摘要

Abstract:Automated content moderation for collaborative knowledge hubs like Wikipedia or Wikidata is an important yet challenging task due to multiple factors. In this paper, we construct a database of discussions happening around articles marked for deletion in several Wikis and in three languages, which we then use to evaluate a range of LMs on different tasks (from predicting the outcome of the discussion to identifying the implicit policy an individual comment might be pointing to). Our results reveal, among others, that discussions leading to deletion are easier to predict, and that, surprisingly, self-produced tags (keep, delete or redirect) don’t always help guiding the classifiers, presumably because of users’ hesitation or deliberation within comments.
zh

[NLP-29] VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

【速读】：本文旨在解决现有多模态大型语言模型（Multimodal Large Language Models, MLLMs）在跨尺度和跨家族推理能力上的不足。论文提出了一种名为VisualPRM的高级多模态过程奖励模型（Process Reward Model, PRM），包含80亿参数，通过最佳-of-N（Best-of-N, BoN）评估策略显著提升了不同规模和类型MLLMs的推理性能。关键在于引入了基于过程监督的数据集VisualPRM400K和用于评估PRM的VisualProcessBench基准，后者提供了逐步骤的人类标注正确性标签，以更精确地衡量模型检测多模态推理任务中错误步骤的能力。这一方法使得即使在高度强大的InternVL2.5-78B模型上，也能实现平均5.9点的性能提升。

链接: https://arxiv.org/abs/2503.10291
作者: Weiyun Wang,Zhangwei Gao,Lianjie Chen,Zhe Chen,Jinguo Zhu,Xiangyu Zhao,Yangzhou Liu,Yue Cao,Shenglong Ye,Xizhou Zhu,Lewei Lu,Haodong Duan,Yu Qiao,Jifeng Dai,Wenhai Wang
机构: Fudan University; Shanghai AI Laboratory (上海人工智能实验室); Shanghai Jiaotong University; Tsinghua University (清华大学); Nanjing University (南京大学); The Chinese University of Hong Kong (香港中文大学); SenseTime Research (商汤研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark are released in this https URL.
zh

[NLP-30] An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

【速读】：该论文试图解决构建高质量多语言数据集的挑战，特别是在训练先进大型语言模型时对清洁且多样化文本数据的需求。解决方案的关键在于提出HPLT v2，这是一个包含高质量单语和平行语料库的多语言数据集。单语部分包含覆盖193种语言的8万亿个token，而平行部分则包含覆盖51种语言的3.8亿句对。论文详细记录了整个数据处理流程，并开源了代码以重现数据集，同时提供了数据质量和特性的广泛分析。最终，通过评估在HPLT v2上训练的语言模型和机器翻译系统的性能，验证了该数据集的价值。

链接: https://arxiv.org/abs/2503.10267
作者: Laurie Burchell,Ona de Gibert,Nikolay Arefyev,Mikko Aulamo,Marta Bañón,and Pinzhen Chen,Mariia Fedorova,Liane Guillou,Barry Haddow,Jan Hajič,and Jindřich Helcl,Erik Henriksson,Mateusz Klimaszewski,Ville Komulainen,and Andrey Kutuzov,Joona Kytöniemi,Veronika Laippala,Petter Mæhlum,and Bhavitvya Malik,Farrokh Mehryary,Vladislav Mikhailov,Nikita Moghe,and Amanda Myntti,Dayyán O’Brien,Stephan Oepen,Proyag Pal,Jousia Piha,and Sampo Pyysalo,Gema Ramírez-Sánchez,David Samuel,Pavel Stepachev,and Jörg Tiedemann,Dušan Variš,Tereza Vojtěchová,Jaume Zaragoza-Bernabeu
机构: University of Edinburgh (爱丁堡大学); University of Helsinki (赫尔辛基大学); University of Oslo (奥斯陆大学); Prompsit Language Engineering; Charles University (查理大学); University of Turku (图尔库大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.
zh

[NLP-31] MinorBench: A hand-built benchmark for content-based risks for children

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在儿童使用过程中产生的内容相关风险未被充分研究的问题。论文通过一个中学环境中部署的基于LLM的聊天机器人案例研究，揭示了学生如何使用甚至误用该系统。基于这些发现，论文提出了针对未成年人的内容相关风险的新分类法，并引入了MinorBench，这是一个开源基准，用于评估LLMs拒绝不安全或不适查询的能力。关键解决方案在于设计MinorBench来标准化评估LLMs的安全性，并通过实验验证了不同提示下六个主流LLMs在儿童安全合规方面的显著差异，从而为构建更健壮、以儿童为中心的安全机制提供了实用建议，并强调了定制AI系统以保护年轻用户的紧迫性。

链接: https://arxiv.org/abs/2503.10242
作者: Shaun Khoo,Gabriel Chua,Rachel Shong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are rapidly entering children’s lives - through parent-driven adoption, schools, and peer networks - yet current AI ethics and safety research do not adequately address content-related risks specific to minors. In this paper, we highlight these gaps with a real-world case study of an LLM-based chatbot deployed in a middle school setting, revealing how students used and sometimes misused the system. Building on these findings, we propose a new taxonomy of content-based risks for minors and introduce MinorBench, an open-source benchmark designed to evaluate LLMs on their ability to refuse unsafe or inappropriate queries from children. We evaluate six prominent LLMs under different system prompts, demonstrating substantial variability in their child-safety compliance. Our results inform practical steps for more robust, child-focused safety mechanisms and underscore the urgency of tailoring AI systems to safeguard young users.
zh

[NLP-32] ARLED: Leverag ing LED-based ARMAN Model for Abstractive Summarization of Persian Long Documents

【速读】：该论文旨在解决长篇文档自动摘要生成的问题，特别是针对波斯语文本的抽象型文本摘要挑战。传统抽取式方法因简单易用而广泛应用，但常遗漏重要信息，而抽象型摘要能够通过理解文本的深层含义生成更连贯且信息量更大的摘要。然而，长文档摘要任务仍面临技术瓶颈。为应对这一挑战，论文提出的关键解决方案是基于Longformer架构的ARMAN模型，并利用从Ensani网站获取的包含300,000篇全文本波斯语论文的新数据集进行训练与验证，从而实现性能优异的波斯语文本摘要生成。

链接: https://arxiv.org/abs/2503.10233
作者: Samira Zangooei,Amirhossein Darmani,Hossein Farahmand Nezhad,Laya Mahmoudi
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 3 tables

点击查看摘要

Abstract:The increasing volume of textual data poses challenges in reading and comprehending large documents, particularly for scholars who need to extract useful information from research articles. Automatic text summarization has emerged as a powerful tool to condense lengthy documents into concise and informative summaries. Depending on the approach used, text summarization can be categorized as either extractive or abstractive. While extractive methods are commonly used due to their simplicity, they often miss important information. On the other hand, Abstractive Summarization can generate more coherent and informative summaries by understanding the underlying meaning of the text. Abstractive techniques have gained attention in various languages, and recent advancements have been achieved through pre-training models such as BERT, BART, and T5. However, the challenge of summarizing long documents remains, and alternative models like Longformer have been introduced to address this limitation. In this context, this paper focuses on abstractive summarization in the Persian language. The authors introduce a new dataset of 300,000 full-text Persian papers obtained from the Ensani website and apply the ARMAN model, based on the Longformer architecture, to generate summaries. The experimental results demonstrate promising performance in Persian text summarization. The paper provides a comprehensive overview of related work, discusses the methodology, presents the experimental results, and concludes with future research directions.
zh

[NLP-33] R.U.Psycho? Robust Unified Psychometric Testing of Language Models

【速读】：该论文旨在解决生成式语言模型（Generative Language Models）在心理测验（psychometric questionnaires）研究中因模型输出不稳定、提示设计敏感性、参数设置多样性以及众多可用模型版本等因素导致的实验严谨性和可重复性不足的问题。论文的关键解决方案在于提出了一种名为“this http URL”的框架，该框架旨在通过降低对编程专业知识的要求，帮助研究人员设计和执行稳健且可重复的生成式语言模型心理测验实验。通过这一框架，研究者能够更系统地记录实验条件，并验证已有文献中的相关发现，从而提升研究结果的可靠性和普适性。该框架已作为Python包公开提供。

链接: https://arxiv.org/abs/2503.10229
作者: Julian Schelb,Orr Borin,David Garcia,Andreas Spitz
机构: University of Konstanz (康斯坦茨大学); Recosys; University of Konstanz (康斯坦茨大学); University of Konstanz (康斯坦茨大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative language models are increasingly being subjected to psychometric questionnaires intended for human testing, in efforts to establish their traits, as benchmarks for alignment, or to simulate participants in social science experiments. While this growing body of work sheds light on the likeness of model responses to those of humans, concerns are warranted regarding the rigour and reproducibility with which these experiments may be conducted. Instabilities in model outputs, sensitivity to prompt design, parameter settings, and a large number of available model versions increase documentation requirements. Consequently, generalization of findings is often complex and reproducibility is far from guaranteed. In this paper, we present this http URL, a framework for designing and running robust and reproducible psychometric experiments on generative language models that requires limited coding expertise. We demonstrate the capability of our framework on a variety of psychometric questionnaires, which lend support to prior findings in the literature. this http URL is available as a Python package at this https URL.
zh

[NLP-34] Assessing the validity of new paradigmatic complexity measures as criterial features for proficiency in L2 writings in English

【速读】：该论文旨在解决二语（L2）写作发展中语法和结构复杂性评估的问题。论文通过将语言功能与特定的语法范式相联系，探索学习者英语中的范例性产出，并引入七个微系统（microsystems, MS）作为评估指标。论文的关键解决方案在于采用监督学习框架，利用EFCAMDAT数据集作为基准，同时以法语学习者语料库作为外部测试集，验证这些微系统的有效性。研究发现，尽管单个微系统对评估学习者水平的影响较低，但它们作为一个整体具有显著的影响力。这一方法表明，这些微系统及其测量方法可以用于更广泛的计算机辅助语言学习（CALL）系统中，以支持语言能力评估。

链接: https://arxiv.org/abs/2503.10220
作者: Cyriel Mallart(UR2, LIDILE),Andrew Simpkin,Nicolas Ballier(UPCité, ALTAE (URP 3967)),Paula Lissón(UNIR),Rémi Venant(UM, LIUM),Jen-Yu Li(UR2, LIDILE),Bernardo Stearns(NUI Galway, INSIGHT),Thomas Gaillat(LIDILE, UR2)
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This article addresses Second Language (L2) writing development through an investigation of new grammatical and structural complexity metrics. We explore the paradigmatic production in learner English by linking language functions to specific grammatical paradigms. Using the EFCAMDAT as a gold standard and a corpus of French learners as an external test set, we employ a supervised learning framework to operationalise and evaluate seven microsystems. We show that learner levels are associated with the seven microsystems (MS). Using ordinal regression modelling for evaluation, the results show that all MS are significant but yield a low impact if taken individually. However, their influence is shown to be impactful if taken as a group. These microsystems and their measurement method suggest that it is possible to use them as part of broader-purpose CALL systems focused on proficiency assessment.
zh

[NLP-35] Adaptive Inner Speech-Text Alignment for LLM -based Speech Translation

【速读】：该论文旨在解决现有跨模态对齐方法主要关注输入与输出层面的对齐，而忽视模型表征内部更深层次语义对齐的问题。为了解决这一局限性，论文提出了一种自适应内层语音-文本对齐（Adaptive Inner Speech-Text Alignment, AI-STA）方法，通过在大型语言模型（LLMs）的选定层显式对齐语音和文本表示来弥合模态差距。其关键在于利用最优传输（Optimal Transport, OT）理论量化语音和文本表征之间的细粒度差异，并结合跨模态检索技术确定最适合对齐的层，进而对这些层进行联合训练。实验结果表明，AI-STA显著提升了大规模语音-文本模型（LSMs）的翻译性能，优于先前最先进的方法。

链接: https://arxiv.org/abs/2503.10211
作者: Henglyu Liu,Andong Chen,Kehai Chen,Xuefeng Bai,Meizhi Zhong,Yuan Qiu,Min Zhang
机构: Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学（深圳）计算与智能研究所); School of Computer Science and Engineering, Xi’an University of Technology (西安理工大学计算机科学与工程学院)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 12 pages, 7 figures

点击查看摘要

Abstract:Recent advancement of large language models (LLMs) has led to significant breakthroughs across various tasks, laying the foundation for the development of LLM-based speech translation systems. Existing methods primarily focus on aligning inputs and outputs across modalities while overlooking deeper semantic alignment within model representations. To address this limitation, we propose an Adaptive Inner Speech-Text Alignment (AI-STA) method to bridge the modality gap by explicitly aligning speech and text representations at selected layers within LLMs. To achieve this, we leverage the optimal transport (OT) theory to quantify fine-grained representation discrepancies between speech and text. Furthermore, we utilize the cross-modal retrieval technique to identify the layers that are best suited for alignment and perform joint training on these layers. Experimental results on speech translation (ST) tasks demonstrate that AI-STA significantly improves the translation performance of large speech-text models (LSMs), outperforming previous state-of-the-art approaches. Our findings highlight the importance of inner-layer speech-text alignment in LLMs and provide new insights into enhancing cross-modal learning.
zh

[NLP-36] Red Teaming Contemporary AI Models: Insights from Spanish and Basque Perspectives

【速读】：该论文旨在解决大型语言模型在支持特定语言（如西班牙语和巴斯克语）时可能存在的偏差（biases）和安全性问题（safety concerns）。论文的关键解决方案在于通过红队测试（Red Teaming）方法，组织专家对来自不同机构（OpenAI、DeepSeek 和 ALIA）的最新模型进行人工评估。研究通过对670次对话的分析，揭示了这些模型在偏差和安全性方面的漏洞，其中偏差或不安全响应的比例在29.5%到50.6%之间。这表明开发可靠且可信赖的语言模型仍面临持续挑战，而红队测试作为一种主动评估手段，为识别和改进这些问题提供了关键路径。

链接: https://arxiv.org/abs/2503.10192
作者: Miguel Romero-Arjona,Pablo Valle,Juan C. Alonso,Ana B. Sánchez,Miriam Ugarte,Antonia Cazalilla,Vicente Cambrón,José A. Parejo,Aitor Arrieta,Sergio Segura
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The battle for AI leadership is on, with OpenAI in the United States and DeepSeek in China as key contenders. In response to these global trends, the Spanish government has proposed ALIA, a public and transparent AI infrastructure incorporating small language models designed to support Spanish and co-official languages such as Basque. This paper presents the results of Red Teaming sessions, where ten participants applied their expertise and creativity to manually test three of the latest models from these initiatives \unicodex2013 OpenAI o3-mini, DeepSeek R1, and ALIA Salamandra \unicodex2013 focusing on biases and safety concerns. The results, based on 670 conversations, revealed vulnerabilities in all the models under test, with biased or unsafe responses ranging from 29.5% in o3-mini to 50.6% in Salamandra. These findings underscore the persistent challenges in developing reliable and trustworthy AI systems, particularly those intended to support Spanish and Basque languages.
zh

[NLP-37] PRISM: Preference Refinement via Implicit Scene Modeling for 3D Vision-Language Preference-Based Reinforcement Learning

【速读】：本文针对基于2D的偏好驱动强化学习（Preference-Based Reinforcement Learning, PBRL）存在的局限性提出了解决方案，设计了名为PRISM的新框架。其核心在于统一三维点云建模与面向未来的偏好精化，通过采用三维点云-语言模型（3D Point Cloud-Language Model, 3D-PC-LLM）缓解遮挡和视角偏差，确保更稳定且空间一致的偏好信号。同时，利用链式思维（Chain-of-Thought, CoT）推理引入远期考虑，避免传统静态偏好比较中常见的短视反馈。关键在于结合三维感知与面向未来的推理能力，显著提升了偏好一致性率、加速了策略收敛，并增强了在未见过的机器人环境中的鲁棒泛化性能。实验结果表明，PRISM在机器人操作和自主导航等任务中展现出实际应用潜力，通过将三维几何感知与基于CoT的偏好建模相结合，奠定了可扩展的人机一致强化学习的基础。

链接: https://arxiv.org/abs/2503.10177
作者: Yirong Sun,Yanjun Chen
机构: Department of Computing, The Hong Kong Polytechnic University (香港理工大学); Digital Twin Institute, Eastern Institute of Technology (东方理工学院)
类目: Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We propose PRISM, a novel framework designed to overcome the limitations of 2D-based Preference-Based Reinforcement Learning (PBRL) by unifying 3D point cloud modeling and future-aware preference refinement. At its core, PRISM adopts a 3D Point Cloud-Language Model (3D-PC-LLM) to mitigate occlusion and viewpoint biases, ensuring more stable and spatially consistent preference signals. Additionally, PRISM leverages Chain-of-Thought (CoT) reasoning to incorporate long-horizon considerations, thereby preventing the short-sighted feedback often seen in static preference comparisons. In contrast to conventional PBRL techniques, this integration of 3D perception and future-oriented reasoning leads to significant gains in preference agreement rates, faster policy convergence, and robust generalization across unseen robotic environments. Our empirical results, spanning tasks such as robotic manipulation and autonomous navigation, highlight PRISM’s potential for real-world applications where precise spatial understanding and reliable long-term decision-making are critical. By bridging 3D geometric awareness with CoT-driven preference modeling, PRISM establishes a comprehensive foundation for scalable, human-aligned reinforcement learning.
zh

[NLP-38] “Well Keep Thinking”: Enhancing LLM Reasoning with Adaptive Injection Decoding

【速读】：该论文试图解决的问题是如何在无需显式提示（explicit prompting）的情况下，诱导大型语言模型（LLMs）展现强大的推理能力。传统方法依赖于劳动密集型的提示工程（如 few-shot 或 zero-shot chain-of-thought (CoT) 提示），而本文提出了一种新颖的解码策略来克服这一局限。解决方案的关键在于通过监控模型生成过程，并在模型可能过早结束推理之前注入特定短语，从而系统性地引导模型继续完成推理过程，避免不成熟的推理结果。实验评估表明，该策略显著提升了 LLM 的推理能力，展示了基于解码干预相对于传统提示技术的潜力。

链接: https://arxiv.org/abs/2503.10167
作者: Hyunbin Jin,Je Won Yeom,Seunghyun Bae,Taesup Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit strong reasoning abilities, often attributed to few-shot or zero-shot chain-of-thought (CoT) prompting. While effective, these methods require labor-intensive prompt engineering, raising the question of whether reasoning can be induced without reliance on explicit prompts. In this work, we unlock the reasoning capabilities of LLMs without explicit prompting. Inspired by zero-shot CoT and CoT-decoding, we propose a novel decoding strategy that systematically nudges LLMs to continue reasoning, thereby preventing immature reasoning processes. Specifically, we monitor the model’s generation and inject a designated phrase whenever it is likely to conclude its response prematurely, before completing the reasoning process. Our experimental evaluations on diverse reasoning benchmarks demonstrate that our proposed strategy substantially improves LLM reasoning capabilities, highlighting the potential of decoding-based interventions as an alternative to traditional prompting techniques.
zh

[NLP-39] Retrieval-Augmented Generation with Hierarchical Knowledge

【速读】：该论文旨在解决现有基于图的 Retrieval-Augmented Generation (RAG) 方法未能充分利用人类认知中固有的层次化知识的问题，这限制了 RAG 系统在语义理解和结构捕捉方面的能力。为了解决这一问题，论文提出了一种新的方法 HiRAG，其关键在于利用层次化知识来增强 RAG 系统在索引和检索过程中的语义理解与结构捕捉能力。实验结果表明，HiRAG 在性能上显著超越了最先进的基准方法。

链接: https://arxiv.org/abs/2503.10150
作者: Haoyu Huang,Yongfeng Huang,Junjie Yang,Zhenyu Pan,Yongqiang Chen,Kaili Ma,Hongzhi Chen,James Cheng
机构: KASMA.ai (卡斯玛人工智能); Department of Computer Science and Engineering, The Chinese University of Hong Kong (香港中文大学计算机科学与工程系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph-based Retrieval-Augmented Generation (RAG) methods have significantly enhanced the performance of large language models (LLMs) in domain-specific tasks. However, existing RAG methods do not adequately utilize the naturally inherent hierarchical knowledge in human cognition, which limits the capabilities of RAG systems. In this paper, we introduce a new RAG approach, called HiRAG, which utilizes hierarchical knowledge to enhance the semantic understanding and structure capturing capabilities of RAG systems in the indexing and retrieval processes. Our extensive experiments demonstrate that HiRAG achieves significant performance improvements over the state-of-the-art baseline methods. The code of our proposed method is available at \hrefthis https URLthis https URL.
zh

[NLP-40] Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding

【速读】：该论文试图解决自回归语言模型在推测解码（Speculative Decoding, SPD）过程中多令牌生成效率与准确性难以兼顾的问题。现有方法假设序列中的所有标记同等重要，并依赖单一的串行或并行生成范式，这可能导致早期关键标记的处理不够精确，从而限制整体性能。论文的关键在于提出了一种名为Gumiho的混合模型，通过结合串行和并行头结构来优化不同位置标记的处理方式：利用复杂的Transformer架构处理早期标记以提升准确性，同时采用轻量级MLP并行处理后期标记以增强效率。这种策略使得Gumiho能够更高效地分配计算资源，最终实现优于现有方法的整体性能提升。

链接: https://arxiv.org/abs/2503.10135
作者: Jinze Li,Yixing Xu,Haiduo Huang,Xuanwu Yin,Dong Li,Edith C.H. Ngai,Emad Barsoum
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Paper under review

点击查看摘要

Abstract:Speculative decoding (SPD) aims to accelerate the auto-regressive token generation process of a target Large Language Model (LLM). Some approaches employ a draft model with multiple heads to predict a sequence of future tokens, where each head handles a token in the sequence. The target LLM verifies the predicted sequence and accepts aligned tokens, enabling efficient multi-token generation. However, existing methods assume that all tokens within a sequence are equally important, employing identical head structures and relying on a single-generation paradigm, either serial or parallel. To this end, we theoretically demonstrate that initial tokens in the draft sequence are more important than later ones. Building on this insight, we propose Gumiho, a hybrid model combining serial and parallel heads. Specifically, given the critical importance of early tokens, we employ a sophisticated Transformer architecture for the early draft heads in a serial configuration to improve accuracy. For later tokens, we utilize multiple lightweight MLP heads operating in parallel to enhance efficiency. By allocating more advanced model structures and longer running times to the early heads, Gumiho achieves improved overall performance. The experimental results demonstrate that our method outperforms existing approaches, fully validating its effectiveness.
zh

[NLP-41] Cognitive-Mental-LLM : Leverag ing Reasoning in Large Language Models for Mental Health Prediction via Online Text

【速读】：该论文旨在解决传统分类方法在从在线文本预测心理健康结果时缺乏可解释性和鲁棒性的问题。为了解决这一问题，研究的关键在于评估和应用结构化推理技术（包括Chain-of-Thought (CoT)、Self-Consistency (SC-CoT) 和 Tree-of-Thought (ToT)）以提升多心理健康数据集上的分类准确性。通过引入基于推理的提示策略（如Zero-shot CoT和Few-shot CoT），并在平衡准确率、F1分数以及敏感性/特异性等关键性能指标下进行分析，研究发现，与直接预测相比，这些推理增强的技术尤其在复杂情况下表现出更高的分类性能。此外，Few-shot CoT提示策略在所有测试中始终优于其他方法，进一步验证了推理驱动的大型语言模型的有效性。尽管如此，不同数据集间的性能差异揭示了模型可靠性和可解释性面临的挑战。总之，本研究为基于推理的大型语言模型技术在心理健康文本分类中的应用提供了全面基准，并指出了未来改进的方向。

链接: https://arxiv.org/abs/2503.10095
作者: Avinash Patil,Amardeep Kour Gedhu
机构: Ira A. Fulton Schools of Engineering (艾拉·A·富尔顿工程学院), Arizona State University (亚利桑那州立大学); Department of Psychology (心理学系), Santa Clara University (圣克拉拉大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 Figures, 3 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated potential in predicting mental health outcomes from online text, yet traditional classification methods often lack interpretability and robustness. This study evaluates structured reasoning techniques-Chain-of-Thought (CoT), Self-Consistency (SC-CoT), and Tree-of-Thought (ToT)-to improve classification accuracy across multiple mental health datasets sourced from Reddit. We analyze reasoning-driven prompting strategies, including Zero-shot CoT and Few-shot CoT, using key performance metrics such as Balanced Accuracy, F1 score, and Sensitivity/Specificity. Our findings indicate that reasoning-enhanced techniques improve classification performance over direct prediction, particularly in complex cases. Compared to baselines such as Zero Shot non-CoT Prompting, and fine-tuned pre-trained transformers such as BERT and Mental-RoBerta, and fine-tuned Open Source LLMs such as Mental Alpaca and Mental-Flan-T5, reasoning-driven LLMs yield notable gains on datasets like Dreaddit (+0.52% over M-LLM, +0.82% over BERT) and SDCNL (+4.67% over M-LLM, +2.17% over BERT). However, performance declines in Depression Severity, and CSSRS predictions suggest dataset-specific limitations, likely due to our using a more extensive test set. Among prompting strategies, Few-shot CoT consistently outperforms others, reinforcing the effectiveness of reasoning-driven LLMs. Nonetheless, dataset variability highlights challenges in model reliability and interpretability. This study provides a comprehensive benchmark of reasoning-based LLM techniques for mental health text classification. It offers insights into their potential for scalable clinical applications while identifying key challenges for future improvements.
zh

[NLP-42] Semantic Synergy: Unlocking Policy Insights and Learning Pathways Through Advanced Skill Mapping

【速读】：本文旨在解决从复杂文本信息中提取结构化且可操作的洞察（actionable insights）的问题，特别是在政策文件和简历等多源文档中自动识别和关联技能、职业档案及学习路径的需求。为实现这一目标，系统的关键创新在于结合先进的自然语言处理技术（如基于SentenceTransformer的语义嵌入与分割）、高效的FAISS搜索方法进行技能提取，并通过Dash和Plotly实现交互式可视化，从而建立技能与ESCO职业框架及可持续发展目标学院提供的学习路径之间的强关联关系。这种方法不仅验证了其接近人类水平的准确性（F1分数超过0.95和0.93），还为AE4RIA网络内的深入协作奠定了坚实基础。

链接: https://arxiv.org/abs/2503.10094
作者: Phoebe Koundouri,Conrad Landis,Georgios Feretzakis
机构: Athens University of Economics and Business (雅典经济与商业大学); Denmark Technical University (DTU) (丹麦技术大学); Athena Research Centre (雅典研究与创新中心); UN SDSN (全球气候中心，欧洲中心，希腊中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This research introduces a comprehensive system based on state-of-the-art natural language processing, semantic embedding, and efficient search techniques for retrieving similarities and thus generating actionable insights from raw textual information. The system automatically extracts and aggregates normalized competencies from multiple documents (such as policy files and curricula vitae) and creates strong relationships between recognized competencies, occupation profiles, and related learning courses. To validate its performance, we conducted a multi-tier evaluation that included both explicit and implicit skill references in synthetic and real-world documents. The results showed near-human-level accuracy, with F1 scores exceeding 0.95 for explicit skill detection and above 0.93 for implicit mentions. The system thereby establishes a sound foundation for supporting in-depth collaboration across the AE4RIA network. The methodology involves a multi-stage pipeline based on extensive preprocessing and data cleaning, semantic embedding and segmentation via SentenceTransformer, and skill extraction using a FAISS-based search method. The extracted skills are associated with occupation frameworks (as formulated in the ESCO ontology) and with learning paths offered through the Sustainable Development Goals Academy. Moreover, interactive visualization software, implemented with Dash and Plotly, presents graphs and tables for real-time exploration and informed decision-making by those involved in policymaking, training and learning supply, career transitions, and recruitment. Overall, this system, backed by rigorous validation, offers promising prospects for improved policymaking, human resource development, and lifelong learning by providing structured and actionable insights from raw, complex textual information.
zh

[NLP-43] Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model

【速读】：该论文旨在解决基于强化学习（Reinforcement Learning, RL）的安全对齐方法（如直接偏好优化 Direct Preference Optimization, DPO）在处理分布偏移（distribution shift）问题时面临的挑战。当前方法通常通过在线采样目标策略来缓解这一问题，但这种方式需要大量的计算资源。论文的关键假设是，在离线策略训练过程中，尽管由策略生成的输出排序会发生变化，但其整体分布相对稳定。基于此假设，论文提出了一种新框架，利用模型内在的安全判断能力提取奖励信号，并用这些信号计算偏好数据的标签置信度以重新排序偏好。实验结果和理论分析表明，该方法不仅有效解决了分布偏移问题，还显著提升了安全性能，同时减少了约300倍的计算开销。

链接: https://arxiv.org/abs/2503.10093
作者: Qiyuan Deng,Xuefeng Bai,Kehai Chen,Yaowei Wang,Liqiang Nie,Min Zhang
机构: Harbin Insititute of Technology (哈尔滨工业大学), Shenzhen, China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) algorithms for safety alignment of Large Language Models (LLMs), such as Direct Preference Optimization (DPO), encounter the challenge of distribution shift. Current approaches typically address this issue through online sampling from the target policy, which requires significant computational resources. In this paper, we hypothesize that during off-policy training, while the ranking order of output generated by policy changes, their overall distribution remains relatively stable. This stability allows the transformation of the sampling process from the target policy into a re-ranking of preference data. Building on this hypothesis, We propose a new framework that leverages the model’s intrinsic safety judgment capability to extract reward signals, which are then used to calculate label confidence for preferences reordering. Extensive experimental results and theoretical analysis demonstrate that the proposed method effectively addresses the distribution shift issue, remarkably enhancing the safety performance while reducing about 300x computational overheads.
zh

[NLP-44] Why Does Your CoT Prompt (Not) Work? Theoretical Analysis of Prompt Space Complexity its Interaction with Answer Space During CoT Reasoning with LLM s: A Recurrent Perspective

【速读】：该论文旨在解决现有基于Chain-of-Thought (CoT)提示方法在处理多样化推理任务时普遍采用“一刀切”策略（即固定模板如“逐步思考”）所导致的理论计算能力限制问题。当前方法迫使大语言模型(Large Language Models, LLMs)在极其复杂的提示空间中寻找有效的推理路径，且现有的提示设计研究很大程度上依赖试错而非基于理论指导的方法。论文的关键在于提供了一个严格的理论分析，探讨了提示空间（潜在提示结构的空间）与答案空间（LLMs生成的推理解的空间）之间的复杂性和相互作用。研究表明，依赖单一通用提示会负面影响LLMs的理论可计算性，并证明提示复杂度直接影响答案空间中导航的结构和效率。通过理论和实证分析，论文强调了特定任务提示的重要性，表明其显著优于无监督提示生成，突出了人类有指导的提示设计在CoT推理中的必要性。

链接: https://arxiv.org/abs/2503.10084
作者: Xiang Zhang,Juntai Cao,Jiaqi Wei,Chenyu You,Dujian Ding
机构: 未知
类目: Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:2410.14198

点击查看摘要

Abstract:Despite the remarkable successes of Large Language Models (LLMs), their fundamental Transformer architecture possesses inherent theoretical limitations that restrict their capability to handle reasoning tasks with increasing computational complexity. Chain-of-Thought (CoT) prompting has emerged as a practical solution, supported by several theoretical studies. However, current CoT-based methods (including ToT, GoT, etc.) generally adopt a “one-prompt-fits-all” strategy, using fixed templates (e.g., “think step by step”) across diverse reasoning tasks. This method forces models to navigate an extremely complex prompt space to identify effective reasoning paths. The current prompt designing research are also heavily relying on trial-and-error rather than theoretically informed guidance. In this paper, we provide a rigorous theoretical analysis of the complexity and interplay between two crucial spaces: the prompt space (the space of potential prompt structures) and the answer space (the space of reasoning solutions generated by LLMs) in CoT reasoning. We demonstrate how reliance on a single universal prompt (e.g. think step by step) can negatively impact the theoretical computability of LLMs, illustrating that prompt complexity directly influences the structure and effectiveness of the navigation in answer space. Our analysis highlights that sometimes human supervision is critical for efficiently navigating the prompt space. We theoretically and empirically show that task-specific prompting significantly outperforms unsupervised prompt generation, emphasizing the necessity of thoughtful human guidance in CoT prompting.
zh

[NLP-45] Information Density Principle for MLLM Benchmarks

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）评估基准的可靠性问题，特别是现有评估机制可能无法充分反映模型在实际应用场景中的表现。论文指出，开发者在选择基准时面临困惑，不确定哪些基准能够有效满足其需求。为解决此问题，论文提出了信息密度（Information Density）这一关键原则，用于衡量基准对MLLM开发的洞察力。信息密度从四个维度进行表征：Fallacy（错误检测能力）、Difficulty（任务难度）、Redundancy（冗余性）和Diversity（多样性）。通过分析超过10,000个样本，研究测量了19个MLLM基准的信息密度，发现较新的基准相较于旧版提供了更多见解，但仍存在提升空间。论文期望这一原则能够推动未来MLLM基准的发展与应用。

链接: https://arxiv.org/abs/2503.10079
作者: Chunyi Li,Xiaozhe Li,Zicheng Zhang,Yuan Tian,Ziheng Jia,Xiaohong Liu,Xiongkuo Min,Jia Wang,Haodong Duan,Kai Chen,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Lab (上海人工智能实验室); Tongji University (同济大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the emergence of Multimodal Large Language Models (MLLMs), hundreds of benchmarks have been developed to ensure the reliability of MLLMs in downstream tasks. However, the evaluation mechanism itself may not be reliable. For developers of MLLMs, questions remain about which benchmark to use and whether the test results meet their requirements. Therefore, we propose a critical principle of Information Density, which examines how much insight a benchmark can provide for the development of MLLMs. We characterize it from four key dimensions: (1) Fallacy, (2) Difficulty, (3) Redundancy, (4) Diversity. Through a comprehensive analysis of more than 10,000 samples, we measured the information density of 19 MLLM benchmarks. Experiments show that using the latest benchmarks in testing can provide more insight compared to previous ones, but there is still room for improvement in their information density. We hope this principle can promote the development and application of future MLLM benchmarks. Project page: this https URL
zh

[NLP-46] Compute Optimal Scaling of Skills: Knowledge vs Reasoning

【速读】：该论文试图解决的问题是：是否计算最优的缩放行为（compute-optimal scaling behavior）在不同技能（skill-dependent）之间存在差异，特别是在知识和推理相关的技能（如基于知识的问答和代码生成）上。此外，论文进一步探讨这种技能依赖的缩放行为是否仅仅是预训练数据混合（pretraining datamix）的 artifacts，还是反映了根本性的差异。

解决方案的关键在于通过广泛的消融实验（extensive ablation）分析不同数据混合下的表现，并验证即使在控制数据混合差异后，知识和代码生成在缩放行为上仍表现出根本性差异。最终，论文还分析了标准计算最优缩放与验证集选择之间的关系，指出验证集的技能组成可能会影响计算最优参数量近50%。这一发现强调了正确设计验证集的重要性。

链接: https://arxiv.org/abs/2503.10061
作者: Nicholas Roberts,Niladri Chatterji,Sharan Narang,Mike Lewis,Dieuwke Hupkes
机构: University of Wisconsin (威斯康星大学); GenAI at Meta (Meta的生成式人工智能部门); Meta (Meta)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scaling laws are a critical component of the LLM development pipeline, most famously as a way to forecast training decisions such as ‘compute-optimally’ trading-off parameter count and dataset size, alongside a more recent growing list of other crucial decisions. In this work, we ask whether compute-optimal scaling behaviour can be skill-dependent. In particular, we examine knowledge and reasoning-based skills such as knowledge-based QA and code generation, and we answer this question in the affirmative: \textbfscaling laws are skill-dependent . Next, to understand whether skill-dependent scaling is an artefact of the pretraining datamix, we conduct an extensive ablation of different datamixes and find that, also when correcting for datamix differences, \textbfknowledge and code exhibit fundamental differences in scaling behaviour . We conclude with an analysis of how our findings relate to standard compute-optimal scaling using a validation set, and find that \textbfa misspecified validation set can impact compute-optimal parameter count by nearly 50%, depending on its skill composition.
zh

[NLP-47] Using Context to Improve Word Segmentation

【速读】：该论文旨在研究婴儿如何利用统计规律学习词分割，并探索上下文在这一过程中的作用。论文的关键在于通过实现Goldwater等人的两种模型（unigram和bigram模型）来验证上下文是否能够提升基于统计的词分割性能。结果显示，与假设一致，bigram模型优于unigram模型在预测词分割上的表现。此外，论文进一步探讨了儿童可能利用已知词汇来分割新语音序列的基本建模方法。解决方案的关键在于引入bigram模型以充分利用上下文信息，从而提高词分割任务的表现。

链接: https://arxiv.org/abs/2503.10023
作者: Stephanie Hu,Xiaolu Guo
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:An important step in understanding how children acquire languages is studying how infants learn word segmentation. It has been established in previous research that infants may use statistical regularities in speech to learn word segmentation. The research of Goldwater et al., demonstrated that incorporating context in models improves their ability to learn word segmentation. We implemented two of their models, a unigram and bigram model, to examine how context can improve statistical word segmentation. The results are consistent with our hypothesis that the bigram model outperforms the unigram model at predicting word segmentation. Extending the work of Goldwater et al., we also explored basic ways to model how young children might use previously learned words to segment new utterances.
zh

[NLP-48] ExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist Content

【速读】：该论文旨在解决大型多模态模型（LMMs）在应对由人工智能生成的极端内容（包括照片级逼真的图像和文本）时的脆弱性问题，现有评估数据集对极端内容的探索有限，缺乏AI生成的图像、多样化的图像生成模型以及对历史事件的全面覆盖，这阻碍了对模型漏洞的完整评估。为填补这一空白，论文引入了ExtremeAIGC，这是一个专门设计的基准数据集和评估框架，用于评估LMMs对抗此类内容的脆弱性。ExtremeAIGC通过使用最先进的图像生成技术精心策划包含文本和图像示例的数据集，模拟现实世界中的事件和恶意使用案例。研究揭示了LMMs令人担忧的弱点，并量化了各种攻击策略的成功率，揭示了当前防御措施的关键漏洞，强调了需要更强大的缓解策略。解决方案的关键在于构建ExtremeAIGC数据集及其评估框架，以系统地识别和量化LMMs在面对AI生成极端内容时的脆弱性。

链接: https://arxiv.org/abs/2503.09964
作者: Bhavik Chandna,Mariam Aboujenane,Usman Naseem
机构: UC San Diego (加州大学圣地亚哥分校); Euromed University of Fez (伊本图菲尔欧洲地中海大学); Macquarie University (麦考瑞大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Large Multimodal Models (LMMs) are increasingly vulnerable to AI-generated extremist content, including photorealistic images and text, which can be used to bypass safety mechanisms and generate harmful outputs. However, existing datasets for evaluating LMM robustness offer limited exploration of extremist content, often lacking AI-generated images, diverse image generation models, and comprehensive coverage of historical events, which hinders a complete assessment of model vulnerabilities. To fill this gap, we introduce ExtremeAIGC, a benchmark dataset and evaluation framework designed to assess LMM vulnerabilities against such content. ExtremeAIGC simulates real-world events and malicious use cases by curating diverse text- and image-based examples crafted using state-of-the-art image generation techniques. Our study reveals alarming weaknesses in LMMs, demonstrating that even cutting-edge safety measures fail to prevent the generation of extremist material. We systematically quantify the success rates of various attack strategies, exposing critical gaps in current defenses and emphasizing the need for more robust mitigation strategies.
zh

[NLP-49] ake Off the Training Wheels Progressive In-Context Learning for Effective Alignment EMNLP2024

【速读】：该论文试图解决现有研究主要关注分类与简单生成任务的问题，旨在填补In-Context Learning (ICL) 在实际复杂生成任务中的应用空白。论文的关键发现是，Transformer 将从演示样本中学到的任务函数嵌入到分隔符(token separator)的表示中，并且这些分隔符的表示在生成先验响应(prior response tokens)的过程中起着重要作用。基于此，论文提出了一种名为Progressive In-Context Alignment (PICA) 的两阶段方法：第一阶段通过标准ICL生成部分先验响应，并同时从分隔符表示中提取存储任务函数的ICL向量；第二阶段利用该向量指导模型完成零样本生成。这种方法无需额外微调，显著降低了时间成本并提升了对齐性能，在多项实验中超越了传统ICL并达到了与其他对齐方法相当的效果。

链接: https://arxiv.org/abs/2503.09958
作者: Zhenyu Liu,Dongfang Li,Xinshuo Hu,Xinping Zhao,Yibin Chen,Baotian Hu,Min Zhang
机构: Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学（深圳）); Huawei Cloud, Huawei Technologies Ltd. (华为云，华为技术有限公司)
类目: Computation and Language (cs.CL)
备注: 15 pages, 9 figures, published in EMNLP2024

点击查看摘要

Abstract:Recent studies have explored the working mechanisms of In-Context Learning (ICL). However, they mainly focus on classification and simple generation tasks, limiting their broader application to more complex generation tasks in practice. To address this gap, we investigate the impact of demonstrations on token representations within the practical alignment tasks. We find that the transformer embeds the task function learned from demonstrations into the separator token representation, which plays an important role in the generation of prior response tokens. Once the prior response tokens are determined, the demonstrations become this http URL by this finding, we propose an efficient Progressive In-Context Alignment (PICA) method consisting of two stages. In the first few-shot stage, the model generates several prior response tokens via standard ICL while concurrently extracting the ICL vector that stores the task function from the separator token representation. In the following zero-shot stage, this ICL vector guides the model to generate responses without further this http URL experiments demonstrate that our PICA not only surpasses vanilla ICL but also achieves comparable performance to other alignment tuning methods. The proposed training-free method reduces the time cost (e.g., 5.45+) with improved alignment performance (e.g., 6.57+). Consequently, our work highlights the application of ICL for alignment and calls for a deeper understanding of ICL for complex generations. The code will be available at this https URL.
zh

[NLP-50] Developing and Evaluating an AI-Assisted Prediction Model for Unplanned Intensive Care Admissions following Elective Neurosurgery using Natural Language Processing within an Electronic Healthcare Record System

【速读】：该论文旨在解决术后重症监护病房（Intensive Therapy Unit, ITU）收治决策的主观性问题，并提出通过自然语言处理（Natural Language Processing, NLP）技术从电子健康记录（Electronic Health Records, EHRs）中提取关键信息，以预测择期手术患者是否需要入住ITU。论文的关键在于开发了一种基于MedCAT的NLP模型，该模型首先在全院范围的EHR数据上进行训练，随后针对正常颅压脑积水（Normal Pressure Hydrocephalus, NPH）和听神经瘤（Vestibular Schwannoma, VS）患者的特定数据集进行了两次优化，最终实现了概念检测F1分数达到0.93的性能。这一经过优化的模型能够有效提取神经外科患者的临床相关概念，并将其集成到多种AI算法中，包括决策树模型和神经时间序列模型，从而实现对ITU入院的高召回率（0.87），显著降低了人为决策遗漏的风险。

链接: https://arxiv.org/abs/2503.09927
作者: Julia Ive,Olatomiwa Olukoya,Jonathan P. Funnell,James Booker,Sze H M Lam,Ugan Reddy,Kawsar Noor,Richard JB Dobson,Astri M.V. Luoma,Hani J Marcus
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Introduction: Timely care in a specialised neuro-intensive therapy unit (ITU) reduces mortality and hospital stays, with planned admissions being safer than unplanned ones. However, post-operative care decisions remain subjective. This study used artificial intelligence (AI), specifically natural language processing (NLP) to analyse electronic health records (EHRs) and predict ITU admissions for elective surgery patients. Methods: This study analysed the EHRs of elective neurosurgery patients from University College London Hospital (UCLH) using NLP. Patients were categorised into planned high dependency unit (HDU) or ITU admission; unplanned HDU or ITU admission; or ward / overnight recovery (ONR). The Medical Concept Annotation Tool (MedCAT) was used to identify SNOMED-CT concepts within the clinical notes. We then explored the utility of these identified concepts for a range of AI algorithms trained to predict ITU admission. Results: The CogStack-MedCAT NLP model, initially trained on hospital-wide EHRs, underwent two refinements: first with data from patients with Normal Pressure Hydrocephalus (NPH) and then with data from Vestibular Schwannoma (VS) patients, achieving a concept detection F1-score of 0.93. This refined model was then used to extract concepts from EHR notes of 2,268 eligible neurosurgical patients. We integrated the extracted concepts into AI models, including a decision tree model and a neural time-series model. Using the simpler decision tree model, we achieved a recall of 0.87 (CI 0.82 - 0.91) for ITU admissions, reducing the proportion of unplanned ITU cases missed by human experts from 36% to 4%. Conclusion: The NLP model, refined for accuracy, has proven its efficiency in extracting relevant concepts, providing a reliable basis for predictive AI models to use in clinically valid applications.
zh

[NLP-51] PluralLLM : Pluralistic Alignment in LLM s via Federated Learning

【速读】：该论文旨在解决在确保大型语言模型（Large Language Models, LLMs）与多样化人类偏好保持一致的同时，兼顾隐私保护和公平性的问题。现有方法如基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）依赖于集中式数据收集，这不仅计算成本高昂，还侵犯隐私。论文提出了一种名为PluralLLM的联邦学习（Federated Learning）方法，其关键是通过联邦平均（Federated Averaging, FedAvg）高效聚合偏好更新，使多个用户群体能够在不共享敏感数据的情况下协作训练基于Transformer的偏好预测器，同时作为奖励模型以对齐LLMs。实验结果显示，PluralLLM实现了比集中式训练更快的收敛速度（快46%）、更高的对齐分数（提升4%），且具有接近的群体公平性指标，证明了联邦偏好学习作为一种可扩展且保护隐私的替代方案的有效性。

链接: https://arxiv.org/abs/2503.09925
作者: Mahmoud Srewa,Tianyu Zhao,Salma Elmalaki
机构: University of California, Irvine (加州大学欧文分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Ensuring Large Language Models (LLMs) align with diverse human preferences while preserving privacy and fairness remains a challenge. Existing methods, such as Reinforcement Learning from Human Feedback (RLHF), rely on centralized data collection, making them computationally expensive and privacy-invasive. We introduce PluralLLM a federated learning-based approach that enables multiple user groups to collaboratively train a transformer-based preference predictor without sharing sensitive data, which can also serve as a reward model for aligning LLMs. Our method leverages Federated Averaging (FedAvg) to aggregate preference updates efficiently, achieving 46% faster convergence, a 4% improvement in alignment scores, and nearly the same group fairness measure as in centralized training. Evaluated on a Q/A preference alignment task, PluralLLM demonstrates that federated preference learning offers a scalable and privacy-preserving alternative for aligning LLMs with diverse human values.
zh

[NLP-52] Quantization for OpenAI s Whisper Models: A Comparative Analysis

【速读】：该论文旨在解决自动语音识别（ASR）模型在实际应用中生成幻觉内容导致转录可靠性下降的问题，并探讨大型模型变体在资源受限设备上的部署挑战。论文的关键在于分析不同Whisper模型的相似性和差异性，评估模型量化对延迟的影响，并通过使用LibriSpeech数据集验证量化方法（INT4、INT5、INT8）在降低延迟和模型大小的同时保持转录准确性方面的有效性。研究结果提供了关于不同Whisper模型最佳应用场景以及边缘设备部署可能性的重要见解。所有代码、数据集及实现细节均已公开于GitHub仓库。

链接: https://arxiv.org/abs/2503.09905
作者: Allison Andreyev
机构: Unknown
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 7 pages

点击查看摘要

Abstract:Automated speech recognition (ASR) models have gained prominence for applications such as captioning, speech translation, and live transcription. This paper studies Whisper and two model variants: one optimized for live speech streaming and another for offline transcription. Notably, these models have been found to generate hallucinated content, reducing transcription reliability. Furthermore, larger model variants exhibit increased latency and pose challenges for deployment on resource-constrained devices. This study analyzes the similarities and differences between three Whisper models, qualitatively examining their distinct capabilities. Next, this study quantifies the impact of model quantization on latency and evaluates its viability for edge deployment. Using the open source LibriSpeech dataset, this paper evaluates the word error rate (WER) along with latency analysis of whispercpp using 3 quantization methods (INT4, INT5, INT8). Results show that quantization reduces latency by 19% and model size by 45%, while preserving transcription accuracy. These findings provide insights into the optimal use cases of different Whisper models and edge device deployment possibilities. All code, datasets, and implementation details are available in a public GitHub repository: this https URL
zh

[NLP-53] A Rule Based Solution to Co-reference Resolution in Clinical Text

【速读】：该论文旨在构建一个针对生物医学领域的高效共指解析系统。为实现这一目标，论文基于2011年i2b2自然语言处理挑战赛提供的数据集进行实验，该挑战赛涉及医学文档中的共指解析任务，其中临床文本中的概念提及已被标注，并需通过共指链将同一文档内的共指提及链接起来。论文指出，现有共指解析系统的两种主要方法分别是手动构建规则和利用机器学习系统从训练数据集中自动学习。研究结果表明，基于观察训练数据手工设计的规则是一种在生物医学关键领域中实现高性能共指解析的有效方式，论文提出的基于规则的系统在多个医学数据集上的总体性能达到了89.6%。因此，该研究的关键在于通过分析训练数据并手工制定规则来实现高效的共指解析任务。

链接: https://arxiv.org/abs/2503.09896
作者: Ping Chen,David Hinote,Guoqing Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Objective: The aim of this study was to build an effective co-reference resolution system tailored for the biomedical domain. Materials and Methods: Experiment materials used in this study is provided by the 2011 i2b2 Natural Language Processing Challenge. The 2011 i2b2 challenge involves coreference resolution in medical documents. Concept mentions have been annotated in clinical texts, and the mentions that co-refer in each document are to be linked by coreference chains. Normally, there are two ways of constructing a system to automatically discover co-referent links. One is to manually build rules for co-reference resolution, and the other category of approaches is to use machine learning systems to learn automatically from training datasets and then perform the resolution task on testing datasets. Results: Experiments show the existing co-reference resolution systems are able to find some of the co-referent links, and our rule based system performs well finding the majority of the co-referent links. Our system achieved 89.6% overall performance on multiple medical datasets. Conclusion: The experiment results show that manually crafted rules based on observation of training data is a valid way to accomplish high performance in this coreference resolution task for the critical biomedical domain.
zh

[NLP-54] Whats In Your Field? Mapping Scientific Research with Knowledge Graphs and Large Language Models

【速读】：该论文旨在解决科学文献在跨学科知识导航与综合分析方面因数量指数级增长而日益困难的问题。随着文献规模扩大，传统的大型语言模型（Large Language Models, LLMs）难以捕捉大规模研究工作中的详细关系，而基于检索增强生成等无结构方法虽能筛选相关事实，但在数百万事实影响答案时变得成本高昂。为克服这些局限性，论文提出利用LLMs提取结构化表示，并结合科学概念的语义理解与预定义的概念框架，构建一个能够精准回答整个文献集问题的系统。关键在于通过少量人工标注的摘要（仅20篇）从通用框架中提取结构化概念，并将其应用于大规模跨领域文献（如arXiv上的30,000篇论文），从而揭示新兴趋势并通过知识图谱可视化提供新的探索方式。

链接: https://arxiv.org/abs/2503.09894
作者: Abhipsha Das,Nicholas Lourie,Siavash Golkar,Mariel Pettee
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 pdf figures

点击查看摘要

Abstract:The scientific literature’s exponential growth makes it increasingly challenging to navigate and synthesize knowledge across disciplines. Large language models (LLMs) are powerful tools for understanding scientific text, but they fail to capture detailed relationships across large bodies of work. Unstructured approaches, like retrieval augmented generation, can sift through such corpora to recall relevant facts; however, when millions of facts influence the answer, unstructured approaches become cost prohibitive. Structured representations offer a natural complement – enabling systematic analysis across the whole corpus. Recent work enhances LLMs with unstructured or semistructured representations of scientific concepts; to complement this, we try extracting structured representations using LLMs. By combining LLMs’ semantic understanding with a schema of scientific concepts, we prototype a system that answers precise questions about the literature as a whole. Our schema applies across scientific fields and we extract concepts from it using only 20 manually annotated abstracts. To demonstrate the system, we extract concepts from 30,000 papers on arXiv spanning astrophysics, fluid dynamics, and evolutionary biology. The resulting database highlights emerging trends and, by visualizing the knowledge graph, offers new ways to explore the ever-growing landscape of scientific knowledge. Demo: abby101/surveyor-0 on HF Spaces. Code: this https URL.
zh

[NLP-55] Who Are You Behind the Screen? Implicit MBTI and Gender Detection Using Artificial Intelligence

【速读】：该论文旨在解决从数字互动中隐式检测人口统计特征和人格特质的问题，特别是在个性化技术和心理学研究中，精准识别这些特性变得日益重要。传统的人格预测方法主要依赖于明确的自报告标签，而本文提出的方法直接从Telegram会话数据的语言模式推断人格和性别变量，无需显式标签。论文的关键在于优化基于Transformer的语言模型RoBERTa，通过包含138,866条带MBTI类型标注的消息和195,016条带性别标注的消息的数据集，捕捉反映人格特质和性别差异的复杂语言线索。此外，通过引入置信水平，模型准确率显著提升至86.16%，证明了RoBERTa在隐式识别人格类型方面的持续能力。对于性别分类任务，模型准确率达到74.4%，揭示了性别特定的语言模式。人格维度分析表明，倾向于内向和直觉的人在文本交互中更为活跃。总体而言，论文强调了在实际对话环境中平衡准确性和数据覆盖范围的重要性，同时突出了基于Transformer的模型在隐式人格和性别预测任务中的效率。

链接: https://arxiv.org/abs/2503.09853
作者: Kourosh Shahnazari,Seyed Moein Ayyoubzadeh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In personalized technology and psychological research, precisely detecting demographic features and personality traits from digital interactions becomes ever more important. This work investigates implicit categorization, inferring personality and gender variables directly from linguistic patterns in Telegram conversation data, while conventional personality prediction techniques mostly depend on explicitly self-reported labels. We refine a Transformer-based language model (RoBERTa) to capture complex linguistic cues indicative of personality traits and gender differences using a dataset comprising 138,866 messages from 1,602 users annotated with MBTI types and 195,016 messages from 2,598 users annotated with gender. Confidence levels help to greatly raise model accuracy to 86.16%, hence proving RoBERTa’s capacity to consistently identify implicit personality types from conversational text data. Our results highlight the usefulness of Transformer topologies for implicit personality and gender classification, hence stressing their efficiency and stressing important trade-offs between accuracy and coverage in realistic conversational environments. With regard to gender classification, the model obtained an accuracy of 74.4%, therefore capturing gender-specific language patterns. Personality dimension analysis showed that people with introverted and intuitive preferences are especially more active in text-based interactions. This study emphasizes practical issues in balancing accuracy and data coverage as Transformer-based models show their efficiency in implicit personality and gender prediction tasks from conversational texts.
zh

[NLP-56] On the Limitations of Vision-Language Models in Understanding Image Transforms

【速读】：该论文试图解决视觉语言模型（Vision Language Models, VLMs）在基本图像变换（image-level augmentations）理解上的不足问题。研究聚焦于分析CLIP（OpenAI）和SigLIP（Google）等主流VLMs在处理图像级变换时的能力缺陷，并揭示这些模型缺乏对多种图像级增强技术的理解。为解决这一问题，论文的关键方案是创建了一个扩展版的Flickr8k数据集，通过为每张图像添加详细的变换描述来构建配对样本，从而系统性地评估和研究模型在此类任务中的表现。此外，论文进一步探索了这种能力缺失对下游任务（如图像编辑）的影响，并评估了最先进的图像到图像（Image2Image）模型在简单变换下的性能。

链接: https://arxiv.org/abs/2503.09837
作者: Ahmad Mustafa Anis,Hasnain Ali,Saquib Sarfraz
机构: Cohere for AI Community; Arbisoft; Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 15 images

点击查看摘要

Abstract:Vision Language Models (VLMs) have demonstrated significant potential in various downstream tasks, including Image/Video Generation, Visual Question Answering, Multimodal Chatbots, and Video Understanding. However, these models often struggle with basic image transformations. This paper investigates the image-level understanding of VLMs, specifically CLIP by OpenAI and SigLIP by Google. Our findings reveal that these models lack comprehension of multiple image-level augmentations. To facilitate this study, we created an augmented version of the Flickr8k dataset, pairing each image with a detailed description of the applied transformation. We further explore how this deficiency impacts downstream tasks, particularly in image editing, and evaluate the performance of state-of-the-art Image2Image models on simple transformations.
zh

[NLP-57] Generative AI for Named Entity Recognition in Low-Resource Language Nepali

【速读】：该论文试图解决低资源语言（如尼泊尔语）中生成式人工智能（Generative AI, GenAI）在命名实体识别（Named Entity Recognition, NER）任务上的性能评估不足的问题。论文的关键在于探索最先进的大型语言模型（Large Language Models, LLMs）在尼泊尔语NER中的应用，并通过采用多种提示技术（prompting techniques）来评估其有效性，从而为低资源语言环境下GenAI模型的应用提供挑战与机遇的洞见，推动尼泊尔语等语言的自然语言处理（Natural Language Processing, NLP）研究进展。

链接: https://arxiv.org/abs/2503.09822
作者: Sameer Neupane(University of Memphis),Jeevan Chapagain(University of Memphis),Nobal B. Niraula(Nowa Lab),Diwa Koirala(Nowa Lab)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper has been accepted in the FLAIRS Conference 2025

点击查看摘要

Abstract:Generative Artificial Intelligence (GenAI), particularly Large Language Models (LLMs), has significantly advanced Natural Language Processing (NLP) tasks, such as Named Entity Recognition (NER), which involves identifying entities like person, location, and organization names in text. LLMs are especially promising for low-resource languages due to their ability to learn from limited data. However, the performance of GenAI models for Nepali, a low-resource language, has not been thoroughly evaluated. This paper investigates the application of state-of-the-art LLMs for Nepali NER, conducting experiments with various prompting techniques to assess their effectiveness. Our results provide insights into the challenges and opportunities of using LLMs for NER in low-resource settings and offer valuable contributions to the advancement of NLP research in languages like Nepali.
zh

[NLP-58] Attention Reveals More Than Tokens: Training-Free Long-Context Reasoning with Attention-guided Retrieval

【速读】：该论文试图解决大型语言模型（LLMs）在处理复杂推理任务时有效上下文长度不足的问题，尤其是在需要整合长上下文中多部分信息并执行多步推理的情况下。尽管链式思维（Chain-of-Thought, CoT）提示方法显示出减少任务复杂性的潜力，但实证分析表明其未能完全克服这一局限性。研究发现，失败的主要原因是隐式事实的记忆效果不佳，这显著影响了推理性能。有趣的是，生成的CoT标记的内部注意力权重能够有效地关联隐式事实，即使这些事实未被显式召回。基于此洞见，论文提出了一种无需训练的新算法Attrieval，该算法利用注意力权重从长上下文中检索相关事实并将其融入推理过程。此外，研究还发现从CoT标记中选择上下文标记进一步提升了性能。实验结果表明，Attrieval显著增强了多种模型在合成和真实问答数据集上的长上下文推理能力。

链接: https://arxiv.org/abs/2503.09819
作者: Yuwei Zhang,Jayanth Srinivasa,Gaowen Liu,Jingbo Shang
机构: University of California, San Diego (加州大学圣地亚哥分校); Cisco (思科)
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) often exhibit substantially shorter effective context lengths than their claimed capacities, especially when handling complex reasoning tasks that require integrating information from multiple parts of a long context and performing multi-step reasoning. Although Chain-of-Thought (CoT) prompting has shown promise in reducing task complexity, our empirical analysis reveals that it does not fully resolve this limitation. Through controlled experiments, we identify poor recall of implicit facts as the primary cause of failure, which significantly hampers reasoning performance. Interestingly, we observe that the internal attention weights from the generated CoT tokens can effectively ground implicit facts, even when these facts are not explicitly recalled. Building on this insight, we propose a novel training-free algorithm, Attrieval, which leverages attention weights to retrieve relevant facts from the long context and incorporates them into the reasoning process. Additionally, we find that selecting context tokens from CoT tokens further improves performance. Our results demonstrate that Attrieval enhances long-context reasoning capability notably on both synthetic and real-world QA datasets with various models.
zh

[NLP-59] Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo

【速读】：该论文试图解决在大规模机器学习模型训练中，数据并行方法因频繁同步需求导致的显著减速问题，这一问题成为进一步扩展规模的关键障碍。论文的解决方案之关键是提出了一种名为DiLoCo的方法，该方法通过放松同步需求，在不牺牲模型质量的前提下提升训练效率。研究进一步表明，当在固定计算预算下训练大型语言模型（LLMs）时，DiLoCo不仅能够可预测且稳健地随模型规模扩展，而且在调优得当时，其扩展性能优于传统的数据并行训练，甚至在小规模模型中也能表现出色。

链接: https://arxiv.org/abs/2503.09799
作者: Zachary Charles,Gabriel Teston,Lucio Dery,Keith Rush,Nova Fallen,Zachary Garrett,Arthur Szlam,Arthur Douillard
机构: Google Research (谷歌研究); Google Search (谷歌搜索); Google DeepMind (谷歌深思维)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:As we scale to more massive machine learning models, the frequent synchronization demands inherent in data-parallel approaches create significant slowdowns, posing a critical challenge to further scaling. Recent work develops an approach (DiLoCo) that relaxes synchronization demands without compromising model quality. However, these works do not carefully analyze how DiLoCo’s behavior changes with model size. In this work, we study the scaling law behavior of DiLoCo when training LLMs under a fixed compute budget. We focus on how algorithmic factors, including number of model replicas, hyperparameters, and token budget affect training in ways that can be accurately predicted via scaling laws. We find that DiLoCo scales both predictably and robustly with model size. When well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes. Our results showcase a more general set of benefits of DiLoCo than previously documented, including increased optimal batch sizes, improved downstream generalization with scale, and improved evaluation loss for a fixed token budget.
zh

[NLP-60] Constrained Language Generation with Discrete Diffusion Models

【速读】：该论文旨在解决大型语言模型（LLMs）在生成文本时难以确保输出符合用户定义指令或通用安全指南的问题。传统方法通常依赖于后处理过滤或模型再训练来实现可控生成，但这些方法效率较低且效果有限。论文提出的关键解决方案是Constrained Discrete Diffusion (CDD)，一种将离散扩散模型与可微分优化相结合的新方法，通过直接将约束条件嵌入到离散扩散采样过程中，实现在自然语言生成中高效且灵活地满足多种约束条件，包括毒性抑制、词法层面的字符和序列约束，以及具有特定属性的新分子序列生成等。实验结果表明，该方法在保持流畅性和语义连贯性的同时，能够高保真地满足各类约束需求，优于自回归方法和现有的离散扩散方法。

链接: https://arxiv.org/abs/2503.09790
作者: Michael Cardei,Jacob K Christopher,Thomas Hartvigsen,Brian R. Bartoldson,Bhavya Kailkhura,Ferdinando Fioretto
机构: University of Virginia (弗吉尼亚大学); Lawrence Livermore National Laboratory (劳伦斯利弗莫尔国家实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Constraints are critical in text generation as LLM outputs are often unreliable when it comes to ensuring generated outputs adhere to user defined instruction or general safety guidelines. To address this gap, we present Constrained Discrete Diffusion (CDD), a novel method for enforcing constraints on natural language by integrating discrete diffusion models with differentiable optimization. Unlike conventional text generators, which often rely on post-hoc filtering or model retraining for controllable generation, we propose imposing constraints directly into the discrete diffusion sampling process. We illustrate how this technique can be applied to satisfy a variety of natural language constraints, including (i) toxicity mitigation by preventing harmful content from emerging, (ii) character and sequence level lexical constraints, and (iii) novel molecule sequence generation with specific property adherence. Experimental results show that our constraint-aware procedure achieves high fidelity in meeting these requirements while preserving fluency and semantic coherence, outperforming auto-regressive and existing discrete diffusion approaches.
zh

[NLP-61] Efficient Multi-Task Inferencing: Model Merging with Gromov-Wasserstein Feature Alignment

【速读】：该论文旨在解决在教育领域中为学生响应自动评分时，采用单独神经网络处理每个任务导致存储需求增加、维护成本提高以及冗余计算的问题。为应对这些挑战，论文提出了Gromov-Wasserstein评分模型合并（GW-SMM）方法，其关键是通过Gromov-Wasserstein距离度量特征分布相似性来合并模型。具体而言，该方法首先利用独立模型从学生响应中提取特征，捕捉特定题目的上下文信息及独特的学习表示；接着，利用Gromov-Wasserstein距离量化这些特征分布之间的相似性，确定最兼容的模型进行合并；最后，通过仅融合共享层前的分类头实现模型合并，从而构建统一的特征提取器同时保留独立的分类头以支持题目特定评分。这种方法不仅在多个评估指标上优于人工专家知识和基于GPT-o1的合并方法，还显著减少了存储需求且保持了较高的评分准确性。

链接: https://arxiv.org/abs/2503.09774
作者: Luyang Fang,Ehsan Latif,Haoran Lu,Yifan Zhou,Ping Ma,Xiaoming Zhai
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to AIED2025

点击查看摘要

Abstract:Automatic scoring of student responses enhances efficiency in education, but deploying a separate neural network for each task increases storage demands, maintenance efforts, and redundant computations. To address these challenges, this paper introduces the Gromov-Wasserstein Scoring Model Merging (GW-SMM) method, which merges models based on feature distribution similarities measured via the Gromov-Wasserstein distance. Our approach begins by extracting features from student responses using individual models, capturing both item-specific context and unique learned representations. The Gromov-Wasserstein distance then quantifies the similarity between these feature distributions, identifying the most compatible models for merging. Models exhibiting the smallest pairwise distances, typically in pairs or trios, are merged by combining only the shared layers preceding the classification head. This strategy results in a unified feature extractor while preserving separate classification heads for item-specific scoring. We validated our approach against human expert knowledge and a GPT-o1-based merging method. GW-SMM consistently outperformed both, achieving a higher micro F1 score, macro F1 score, exact match accuracy, and per-label accuracy. The improvements in micro F1 and per-label accuracy were statistically significant compared to GPT-o1-based merging (p=0.04, p=0.01). Additionally, GW-SMM reduced storage requirements by half without compromising much accuracy, demonstrating its computational efficiency alongside reliable scoring performance.
zh

[NLP-62] BiasConnect: Investigating Bias Interactions in Text-to-Image Models

【速读】：该论文试图解决Text-to-Image (TTI) 模型中偏差之间的复杂相互依赖关系问题，这些偏差通常被视为独立存在，但实际上可能彼此深度关联。论文的关键解决方案是引入BiasConnect，这是一种基于反事实框架的新工具，用于分析和量化TTI模型中的偏差交互。BiasConnect通过生成成对因果图揭示特定文本提示下偏差交互的潜在结构，并提供经验估计，表明当修改某一偏差维度时，其他偏差维度如何朝向或偏离理想分布移动。这种方法的估计值与偏差缓解后的相互依赖性观察结果具有较强的相关性（+0.69），从而证明了BiasConnect在选择最优偏差缓解方向、比较不同TTI模型所学习的依赖关系以及理解交叉社会偏见放大效应方面的实用价值。

链接: https://arxiv.org/abs/2503.09763
作者: Pushkar Shukla,Aditya Chinchure,Emily Diana,Alexander Tolbert,Kartik Hosanagar,Vineeth N. Balasubramanian,Leonid Sigal,Matthew A. Turk
机构: Toyota Technological Institute at Chicago (丰田技术研究院（芝加哥）); University of British Columbia (不列颠哥伦比亚大学); Carnegie Mellon University, Tepper School of Business (卡内基梅隆大学，泰珀商学院); Emory University (埃默里大学); University of Pennsylvania, The Wharton School (宾夕法尼亚大学，沃顿商学院); Indian Institute of Technology Hyderabad (海得拉巴印度理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The biases exhibited by Text-to-Image (TTI) models are often treated as if they are independent, but in reality, they may be deeply interrelated. Addressing bias along one dimension, such as ethnicity or age, can inadvertently influence another dimension, like gender, either mitigating or exacerbating existing disparities. Understanding these interdependencies is crucial for designing fairer generative models, yet measuring such effects quantitatively remains a challenge. In this paper, we aim to address these questions by introducing BiasConnect, a novel tool designed to analyze and quantify bias interactions in TTI models. Our approach leverages a counterfactual-based framework to generate pairwise causal graphs that reveals the underlying structure of bias interactions for the given text prompt. Additionally, our method provides empirical estimates that indicate how other bias dimensions shift toward or away from an ideal distribution when a given bias is modified. Our estimates have a strong correlation (+0.69) with the interdependency observations post bias mitigation. We demonstrate the utility of BiasConnect for selecting optimal bias mitigation axes, comparing different TTI models on the dependencies they learn, and understanding the amplification of intersectional societal biases in TTI models.
zh

[NLP-63] Review GIDE – Restaurant Review Gastrointestinal Illness Detection and Extraction with Large Language Models

【速读】：该论文试图解决食品传播的胃肠道（Gastrointestinal, GI）疾病传统监测方法面临的挑战，即许多未与医疗系统交互的病例难以被捕捉的问题。为应对这一挑战，论文提出利用公开的在线餐厅评论文本，并结合大型语言模型（Large Language Models, LLMs）进行疾病监测的扩展。解决方案的关键在于开发了一种新的注释方案，该方案不仅能够识别GI疾病的二元存在与否，还能详细提取症状和食物相关信息。通过评估基于提示的方法及微调RoBERTa模型在疾病检测、症状提取和食物提取三个任务上的表现，研究发现LLMs在所有任务上均实现了超过90%的micro-F1分数，且仅使用提示方法即可超越小型特定任务微调模型的表现。此外，实验表明LLMs在GI疾病检测中的鲁棒性较强，但仍需谨慎解读结果，因为餐厅评论数据本身存在固有限制。

链接: https://arxiv.org/abs/2503.09743
作者: Timothy Laurence,Joshua Harris,Leo Loman,Amy Douglas,Yung-Wai Chan,Luke Hounsome,Lesley Larkin,Michael Borowitz
机构: UK Health Security Agency (英国健康安全局)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages

点击查看摘要

Abstract:Foodborne gastrointestinal (GI) illness is a common cause of ill health in the UK. However, many cases do not interact with the healthcare system, posing significant challenges for traditional surveillance methods. The growth of publicly available online restaurant reviews and advancements in large language models (LLMs) present potential opportunities to extend disease surveillance by identifying public reports of GI illness. In this study, we introduce a novel annotation schema, developed with experts in GI illness, applied to the Yelp Open Dataset of reviews. Our annotations extend beyond binary disease detection, to include detailed extraction of information on symptoms and foods. We evaluate the performance of open-weight LLMs across these three tasks: GI illness detection, symptom extraction, and food extraction. We compare this performance to RoBERTa-based classification models fine-tuned specifically for these tasks. Our results show that using prompt-based approaches, LLMs achieve micro-F1 scores of over 90% for all three of our tasks. Using prompting alone, we achieve micro-F1 scores that exceed those of smaller fine-tuned models. We further demonstrate the robustness of LLMs in GI illness detection across three bias-focused experiments. Our results suggest that publicly available review text and LLMs offer substantial potential for public health surveillance of GI illness by enabling highly effective extraction of key information. While LLMs appear to exhibit minimal bias in processing, the inherent limitations of restaurant review data highlight the need for cautious interpretation of results.
zh

[NLP-64] Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving ICLR2025

【速读】：该论文旨在解决基于强化学习（Reinforcement Learning, RL）的自动定理证明（Automatic Theorem Proving, ATP）方法中因依赖完整推理轨迹反馈而导致计算成本高昂及奖励稀疏的问题。传统方法通常需要在推理轨迹完成后才能判断其正确性，导致RL中的稀疏奖励或需要昂贵的合成数据生成。论文的关键解决方案是提出了一种“验证器循环”（verifier-in-the-loop）的设计，与现有依赖整个推理轨迹反馈的方法不同，该设计利用自动化验证器（此处使用Lean作为验证工具）在推理过程的每一步提供中间反馈。实验结果表明，这种逐步局部验证的方式显著提升了模型的整体推理准确性和效率。

链接: https://arxiv.org/abs/2503.09730
作者: Sara Rajaee,Kumar Pratik,Gabriele Cesa,Arash Behboodi
机构: University of Amsterdam; Qualcomm AI Research (高通人工智能研究)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: Accepted at ICLR 2025 Workshop on Reasoning and Planning for Large Language Models

点击查看摘要

Abstract:The most promising recent methods for AI reasoning require applying variants of reinforcement learning (RL) either on rolled out trajectories from the model, even for the step-wise rewards, or large quantities of human annotated trajectory data. The reliance on the rolled-out trajectory renders the compute cost and time prohibitively high. In particular, the correctness of a reasoning trajectory can typically only be judged at its completion, leading to sparse rewards in RL or requiring expensive synthetic data generation in expert iteration-like methods. In this work, we focus on the Automatic Theorem Proving (ATP) task and propose a novel verifier-in-the-loop design, which unlike existing approaches that leverage feedback on the entire reasoning trajectory, employs an automated verifier to give intermediate feedback at each step of the reasoning process. Using Lean as the verifier, we empirically show that the step-by-step local verification produces a global improvement in the model’s reasoning accuracy and efficiency.
zh

[NLP-65] Have LLM s Made Active Learning Obsolete? Surveying the NLP Community

【速读】：该论文试图探讨在大型语言模型（Large Language Models, LLMs）推动多种数据高效学习方法发展的背景下，主动学习（Active Learning）是否已经过时的问题。为全面解答这一问题，研究不仅依赖文献分析，还结合了实际经验，通过面向自然语言处理（NLP）社区的在线调查，收集关于数据标注相关性的直观洞察，特别是主动学习的最佳实践、面临的障碍以及未来发展趋势。关键在于通过实证研究验证主动学习在当前技术环境中的实际效用，并揭示其相较于十余年前仍存在的持续性挑战，如设置复杂性、成本降低估算及工具支持等。

链接: https://arxiv.org/abs/2503.09701
作者: Julia Romberg,Christopher Schröder,Julius Gonsior,Katrin Tomanek,Fredrik Olsson
机构: GESIS – Leibniz Institute for the Social Sciences (莱布尼茨社会科学研究 institute); Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig (德累斯顿/莱比锡可扩展数据分析和人工智能中心); TUD Dresden University of Technology (德累斯顿工业大学); Independent Researcher (独立研究员); All Ears (未知)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Supervised learning relies on annotated data, which is expensive to obtain. A longstanding strategy to reduce annotation costs is active learning, an iterative process, in which a human annotates only data instances deemed informative by a model. Large language models (LLMs) have pushed the effectiveness of active learning, but have also improved methods such as few- or zero-shot learning, and text synthesis - thereby introducing potential alternatives. This raises the question: has active learning become obsolete? To answer this fully, we must look beyond literature to practical experiences. We conduct an online survey in the NLP community to collect previously intangible insights on the perceived relevance of data annotation, particularly focusing on active learning, including best practices, obstacles and expected future developments. Our findings show that annotated data remains a key factor, and active learning continues to be relevant. While the majority of active learning users find it effective, a comparison with a community survey from over a decade ago reveals persistent challenges: setup complexity, estimation of cost reduction, and tooling. We publish an anonymized version of the collected dataset
zh

[NLP-66] Probabilistic Reasoning with LLM s for k-anonymity Estimation

【速读】：本文旨在解决在不确定性条件下对用户生成文档的隐私敏感信息进行k-匿名性估计的数值推理任务。论文的关键创新在于提出了一种名为BRANCH的方法，利用大型语言模型（LLMs）将联合概率分布分解，通过将文本信息建模为随机变量来估计匹配给定信息的人口规模（即k值）。该方法的核心在于使用独立的LLMs或检索增强生成系统估算各因子在人口中的发生概率，并将这些概率综合为最终的k值。实验结果显示，该方法在67%的情况下成功估计出正确的k值，比GPT-4o的链式思维推理提高了11%。此外，论文还利用LLMs的不确定性构建了k-匿名性的预测区间，在近92%的情况下包含正确值。

链接: https://arxiv.org/abs/2503.09674
作者: Jonathan Zheng,Sauvik Das,Alan Ritter,Wei Xu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages

点击查看摘要

Abstract:Probabilistic reasoning is a key aspect of both human and artificial intelligence that allows for handling uncertainty and ambiguity in decision-making. In this paper, we introduce a novel numerical reasoning task under uncertainty, focusing on estimating the k-anonymity of user-generated documents containing privacy-sensitive information. We propose BRANCH, which uses LLMs to factorize a joint probability distribution to estimate the k-value-the size of the population matching the given information-by modeling individual pieces of textual information as random variables. The probability of each factor occurring within a population is estimated using standalone LLMs or retrieval-augmented generation systems, and these probabilities are combined into a final k-value. Our experiments show that this method successfully estimates the correct k-value 67% of the time, an 11% increase compared to GPT-4o chain-of-thought reasoning. Additionally, we leverage LLM uncertainty to develop prediction intervals for k-anonymity, which include the correct value in nearly 92% of cases.
zh

[NLP-67] LLM -PS: Empowering Large Language Models for Time Series Forecasting with Temporal Patterns and Semantics

【速读】：该论文致力于解决现有基于大语言模型（Large Language Models, LLMs）的时间序列预测（Time Series Forecasting, TSF）方法性能不佳的问题。其核心在于现有方法通常忽视了时间序列数据的内在特性，而时间序列数据具有语义稀疏性和独特的时序模式，与LLM预训练所使用的文本数据显著不同。为了解决这一问题，论文提出了一种名为LLM-PS的方法，其关键是通过从时间序列数据中学习基本模式（\textit{Patterns}）和有意义的语义（\textit{Semantics}），以增强LLM在TSF中的能力。具体而言，LLM-PS结合了一个多尺度卷积神经网络，用于捕捉时间序列中的短期波动和长期趋势，并引入了一个时序转文本模块，以提取连续时间间隔而非孤立时间点的有价值语义信息。通过整合这些模式和语义，LLM-PS能够有效建模时序依赖性，实现对时间序列的深度理解并提供精准预测。实验结果表明，LLM-PS在短、长期预测任务以及少量样本和零样本设置下均达到了最先进的性能水平。

链接: https://arxiv.org/abs/2503.09656
作者: Jialiang Tang,Shuo Chen,Chen Gong,Jing Zhang,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Time Series Forecasting (TSF) is critical in many real-world domains like financial planning and health monitoring. Recent studies have revealed that Large Language Models (LLMs), with their powerful in-contextual modeling capabilities, hold significant potential for TSF. However, existing LLM-based methods usually perform suboptimally because they neglect the inherent characteristics of time series data. Unlike the textual data used in LLM pre-training, the time series data is semantically sparse and comprises distinctive temporal patterns. To address this problem, we propose LLM-PS to empower the LLM for TSF by learning the fundamental \textitPatterns and meaningful \textitSemantics from time series data. Our LLM-PS incorporates a new multi-scale convolutional neural network adept at capturing both short-term fluctuations and long-term trends within the time series. Meanwhile, we introduce a time-to-text module for extracting valuable semantics across continuous time intervals rather than isolated time points. By integrating these patterns and semantics, LLM-PS effectively models temporal dependencies, enabling a deep comprehension of time series and delivering accurate forecasts. Intensive experimental results demonstrate that LLM-PS achieves state-of-the-art performance in both short- and long-term forecasting tasks, as well as in few- and zero-shot settings.
zh

[NLP-68] Global Position Aware Group Choreography using Large Language Model

【速读】：该论文旨在解决多人群舞生成（multi-person dance generation）这一相对新颖的研究领域中的挑战。论文的关键在于将群舞生成问题建模为一个序列到序列的翻译任务，并利用大型语言模型（Large Language Models, LLM）的最新进展。解决方案的核心包括设计一个能够将连续特征转化为离散标记的分词器（tokenizer），以及对LLM进行微调以在给定音频标记的情况下预测运动标记。通过适当的输入模态分词方法和精心设计的LLM训练策略，该框架能够生成逼真且多样化的群舞，同时保持与音乐的良好关联性和个体舞者间的一致性。

链接: https://arxiv.org/abs/2503.09645
作者: Haozhou Pang,Tianwei Ding,Lanshan He,Qi Gan
机构: Soul AI (灵魂人工智能)(中国)
类目: Graphics (cs.GR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dance serves as a profound and universal expression of human culture, conveying emotions and stories through movements synchronized with music. Although some current works have achieved satisfactory results in the task of single-person dance generation, the field of multi-person dance generation remains relatively novel. In this work, we present a group choreography framework that leverages recent advancements in Large Language Models (LLM) by modeling the group dance generation problem as a sequence-to-sequence translation task. Our framework consists of a tokenizer that transforms continuous features into discrete tokens, and an LLM that is fine-tuned to predict motion tokens given the audio tokens. We show that by proper tokenization of input modalities and careful design of the LLM training strategies, our framework can generate realistic and diverse group dances while maintaining strong music correlation and dancer-wise consistency. Extensive experiments and evaluations demonstrate that our framework achieves state-of-the-art performance.
zh

[NLP-69] Can A Society of Generative Agents Simulate Human Behavior and Inform Public Health Policy? A Case Study on Vaccine Hesitancy

【速读】：该论文试图解决的问题是如何通过生成式人工智能（Generative AI）模拟一个沙盒社会以建模人类行为，从而减少对真实人类试验的过度依赖，用于评估公共卫生政策。论文以疫苗犹豫（vaccine hesitancy）作为案例研究，探讨利用生成式代理（generative agents）模拟与健康相关决策的可行性。解决方案的关键在于提出VacSim框架，该框架基于大型语言模型（Large Language Models, LLMs），包含100个生成式代理，通过模拟人口统计数据、构建社交网络以及调整代理对疫苗的态度来再现社会动态和疾病相关信息的影响。此外，VacSim还设计并评估了多种公共卫生干预措施以缓解疫苗犹豫，并引入模拟预热（simulation warmup）和态度调节（attitude modulation）机制以使模拟结果更贴近现实。实验表明，如Llama和Qwen等模型能够模拟部分人类行为，但也存在一致性挑战，例如与人口统计特征不一致的响应。这一初步探索旨在呼吁关注社会模拟在政策开发中的应用，而非直接提供政策指导。

链接: https://arxiv.org/abs/2503.09639
作者: Abe Bohan Hou,Hongru Du,Yichen Wang,Jingyu Zhang,Zixiao Wang,Paul Pu Liang,Daniel Khashabi,Lauren Gardner,Tianxing He
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Can we simulate a sandbox society with generative agents to model human behavior, thereby reducing the over-reliance on real human trials for assessing public policies? In this work, we investigate the feasibility of simulating health-related decision-making, using vaccine hesitancy, defined as the delay in acceptance or refusal of vaccines despite the availability of vaccination services (MacDonald, 2015), as a case study. To this end, we introduce the VacSim framework with 100 generative agents powered by Large Language Models (LLMs). VacSim simulates vaccine policy outcomes with the following steps: 1) instantiate a population of agents with demographics based on census data; 2) connect the agents via a social network and model vaccine attitudes as a function of social dynamics and disease-related information; 3) design and evaluate various public health interventions aimed at mitigating vaccine hesitancy. To align with real-world results, we also introduce simulation warmup and attitude modulation to adjust agents’ attitudes. We propose a series of evaluations to assess the reliability of various LLM simulations. Experiments indicate that models like Llama and Qwen can simulate aspects of human behavior but also highlight real-world alignment challenges, such as inconsistent responses with demographic profiles. This early exploration of LLM-driven simulations is not meant to serve as definitive policy guidance; instead, it serves as a call for action to examine social simulation for policy development.
zh

[NLP-70] Factorio Learning Environment

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在现有基准测试中快速饱和的问题，提出需要新的开放式评估方法。为应对这一挑战，论文引入了基于游戏《Factorio》的Factorio学习环境（Factorio Learning Environment, FLE），用于测试智能体在长期规划、程序合成和资源优化方面的能力。FLE通过指数级难度扩展任务，从基础自动化到处理每秒数百万资源单位的复杂工厂，提供结构化实验室任务和开放性生成地图任务两种设置。论文的关键解决方案在于设计了一个具有渐进挑战性的环境，以揭示LLMs在空间推理和复杂自动化方面的局限性，同时评估其在短期技能与长期策略规划中的表现差异。

链接: https://arxiv.org/abs/2503.09617
作者: Jack Hopkins,Mart Bakler,Akbir Khan
机构: 未知
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are rapidly saturating existing benchmarks, necessitating new open-ended evaluations. We introduce the Factorio Learning Environment (FLE), based on the game of Factorio, that tests agents in long-term planning, program synthesis, and resource optimization. FLE provides exponentially scaling challenges – from basic automation to complex factories processing millions of resource units per second. We provide two settings: (1) lab-play consisting of eight structured tasks with fixed resources, and (2) open-play with the unbounded task of building the largest factory on an procedurally generated map. We demonstrate across both settings that models still lack strong spatial reasoning. In lab-play, we find that LLMs exhibit promising short-horizon skills, yet are unable to operate effectively in constrained environments, reflecting limitations in error analysis. In open-play, while LLMs discover automation strategies that improve growth (e.g electric-powered drilling), they fail to achieve complex automation (e.g electronic-circuit manufacturing).
zh

[NLP-71] A Unified Framework with Novel Metrics for Evaluating the Effectiveness of XAI Techniques in LLM s

【速读】：该论文试图解决大型语言模型（LLMs）日益复杂化导致的透明性和可解释性挑战问题，提出了解决方案的关键在于引入一个包含四个创新指标的综合评估框架，用于评估五种解释性人工智能（XAI）技术在五个LLMs及两个下游任务中的有效性。通过此框架，论文评估了LIME、SHAP、Integrated Gradients、层归因传播（LRP）和注意力机制可视化（AMV）等技术，并聚焦于人类推理一致性（HA）、鲁棒性、一致性及对比性四项关键指标。研究结果揭示了不同XAI方法的优势与局限，为开发和选择适合LLMs的解释技术提供了重要指导。

链接: https://arxiv.org/abs/2503.05050
作者: Melkamu Abay Mersha,Mesay Gemeda Yigezu,Hassan shakil,Ali Al shami,Sanghyun Byun,Jugal Kalita
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: arXiv admin note: substantial text overlap with arXiv:2501.15374

点击查看摘要

Abstract:The increasing complexity of LLMs presents significant challenges to their transparency and interpretability, necessitating the use of eXplainable AI (XAI) techniques to enhance trustworthiness and usability. This study introduces a comprehensive evaluation framework with four novel metrics for assessing the effectiveness of five XAI techniques across five LLMs and two downstream tasks. We apply this framework to evaluate several XAI techniques LIME, SHAP, Integrated Gradients, Layer-wise Relevance Propagation (LRP), and Attention Mechanism Visualization (AMV) using the IMDB Movie Reviews and Tweet Sentiment Extraction datasets. The evaluation focuses on four key metrics: Human-reasoning Agreement (HA), Robustness, Consistency, and Contrastivity. Our results show that LIME consistently achieves high scores across multiple LLMs and evaluation metrics, while AMV demonstrates superior Robustness and near-perfect Consistency. LRP excels in Contrastivity, particularly with more complex models. Our findings provide valuable insights into the strengths and limitations of different XAI methods, offering guidance for developing and selecting appropriate XAI techniques for LLMs.
zh

[NLP-72] Evaluating the Effectiveness of XAI Techniques for Encoder-Based Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）黑箱特性带来的透明性和可信性问题，通过开发可解释人工智能（eXplainable AI, XAI）技术来应对这一挑战。然而，评估这些XAI技术仍面临困难。为了解决这一问题，论文提出了一种通用的评估框架，使用四个关键指标：人类推理一致性（Human-reasoning Agreement, HA）、鲁棒性（Robustness）、一致性（Consistency）和对比性（Contrastivity）。解决方案的关键在于综合运用这一框架对来自五类不同XAI方法（包括模型简化、扰动法、梯度法、层归因传播以及基于注意力机制的方法）的六种具体技术进行系统评估，并验证其在五个基于编码器的语言模型上的表现，最终揭示出模型简化方法（如LIME）在多方面表现出色，特别是在复杂模型中的HA、鲁棒性和一致性，而基于注意力机制的可视化方法（AMV）则在鲁棒性和一致性方面尤为突出，层归因传播方法（LRP）在对比性评估中表现最佳。

链接: https://arxiv.org/abs/2501.15374
作者: Melkamu Abay Mersha,Mesay Gemeda Yigezu,Jugal Kalita
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The black-box nature of large language models (LLMs) necessitates the development of eXplainable AI (XAI) techniques for transparency and trustworthiness. However, evaluating these techniques remains a challenge. This study presents a general evaluation framework using four key metrics: Human-reasoning Agreement (HA), Robustness, Consistency, and Contrastivity. We assess the effectiveness of six explainability techniques from five different XAI categories model simplification (LIME), perturbation-based methods (SHAP), gradient-based approaches (InputXGradient, Grad-CAM), Layer-wise Relevance Propagation (LRP), and attention mechanisms-based explainability methods (Attention Mechanism Visualization, AMV) across five encoder-based language models: TinyBERT, BERTbase, BERTlarge, XLM-R large, and DeBERTa-xlarge, using the IMDB Movie Reviews and Tweet Sentiment Extraction (TSE) datasets. Our findings show that the model simplification-based XAI method (LIME) consistently outperforms across multiple metrics and models, significantly excelling in HA with a score of 0.9685 on DeBERTa-xlarge, robustness, and consistency as the complexity of large language models increases. AMV demonstrates the best Robustness, with scores as low as 0.0020. It also excels in Consistency, achieving near-perfect scores of 0.9999 across all models. Regarding Contrastivity, LRP performs the best, particularly on more complex models, with scores up to 0.9371.
zh

[NLP-73] Explainable Artificial Intelligence: A Survey of Needs Techniques Applications and Future Direction

【速读】：该论文试图解决现有解释性人工智能（Explainable Artificial Intelligence, XAI）研究中缺乏全面综述的问题，特别是关于XAI模型的详细数学表示、设计方法学及其相关方面的深入探讨。论文的关键在于通过系统梳理XAI的常见术语与定义、XAI的需求、受益群体、方法分类以及其在不同领域的应用，为XAI研究人员、实践者、AI模型开发者及利益相关者提供一个全面的参考框架，以提升AI模型的信任度、透明性、可问责性和公平性。

链接: https://arxiv.org/abs/2409.00265
作者: Melkamu Mersha,Khang Lam,Joseph Wood,Ali AlShami,Jugal Kalita
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Artificial intelligence models encounter significant challenges due to their black-box nature, particularly in safety-critical domains such as healthcare, finance, and autonomous vehicles. Explainable Artificial Intelligence (XAI) addresses these challenges by providing explanations for how these models make decisions and predictions, ensuring transparency, accountability, and fairness. Existing studies have examined the fundamental concepts of XAI, its general principles, and the scope of XAI techniques. However, there remains a gap in the literature as there are no comprehensive reviews that delve into the detailed mathematical representations, design methodologies of XAI models, and other associated aspects. This paper provides a comprehensive literature review encompassing common terminologies and definitions, the need for XAI, beneficiaries of XAI, a taxonomy of XAI methods, and the application of XAI methods in different application areas. The survey is aimed at XAI researchers, XAI practitioners, AI model developers, and XAI beneficiaries who are interested in enhancing the trustworthiness, transparency, accountability, and fairness of their AI models.
zh

[NLP-74] Building Cooperative Embodied Agents Modularly with Large Language Models ICLR24

【速读】：本文旨在解决多智能体在分散控制、原始感官观测、昂贵通信以及多目标任务下的合作难题，这些任务以具身环境为背景。为应对这些问题，论文提出的关键方案是将大型语言模型（LLMs）的常识知识、推理能力、语言理解及文本生成能力无缝整合到一个受认知启发的模块化框架中，该框架融合感知、记忆与执行功能，构建出名为CoELA（ Cooperative Embodied Language Agent）的协作具身语言代理。CoELA能够高效规划、沟通并与其他智能体合作完成长期任务。实验表明，由GPT-4驱动的CoELA不仅超越了基于规划的方法，还展现出有效的新兴沟通能力；尽管当前开源模型如LLAMA-2表现仍有限，但通过微调，使用自收集数据的CoELA实现了显著性能提升。此外，用户研究显示，采用自然语言交流的CoELA更易赢得人类信任且合作效果更佳。本研究强调了LLMs在多智能体合作领域的重要潜力。

链接: https://arxiv.org/abs/2307.02485
作者: Hongxin Zhang,Weihua Du,Jiaming Shan,Qinhong Zhou,Yilun Du,Joshua B. Tenenbaum,Tianmin Shu,Chuang Gan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: ICLR24. The first two authors contributed equally

点击查看摘要

Abstract:In this work, we address challenging multi-agent cooperation problems with decentralized control, raw sensory observations, costly communication, and multi-objective tasks instantiated in various embodied environments. While previous research either presupposes a cost-free communication channel or relies on a centralized controller with shared observations, we harness the commonsense knowledge, reasoning ability, language comprehension, and text generation prowess of LLMs and seamlessly incorporate them into a cognitive-inspired modular framework that integrates with perception, memory, and execution. Thus building a Cooperative Embodied Language Agent CoELA, who can plan, communicate, and cooperate with others to accomplish long-horizon tasks efficiently. Our experiments on C-WAH and TDW-MAT demonstrate that CoELA driven by GPT-4 can surpass strong planning-based methods and exhibit emergent effective communication. Though current Open LMs like LLAMA-2 still underperform, we fine-tune a CoELA with data collected with our agents and show how they can achieve promising performance. We also conducted a user study for human-agent interaction and discovered that CoELA communicating in natural language can earn more trust and cooperate more effectively with humans. Our research underscores the potential of LLMs for future research in multi-agent cooperation. Videos can be found on the project website this https URL.
zh

计算机视觉

[CV-0] GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

【速读】：该论文旨在解决现有图像生成与编辑方法主要依赖直接处理文本提示而缺乏对视觉组成和显式操作推理的问题。论文提出了一种名为Generation Chain-of-Thought (GoT) 的新范式，通过在输出图像前进行显式的语言推理过程，将传统的文本到图像生成及编辑转化为由推理引导的框架，以分析语义关系和空间布局。关键在于定义了GoT的表述形式，并构建了一个包含超过900万个样本的大规模GoT数据集，其中详细记录了捕捉语义-空间关系的推理链。此外，论文实现了一个统一框架，结合Qwen2.5-VL用于推理链生成，并通过引入新型语义-空间引导模块增强端到端扩散模型。实验表明，GoT框架在生成和编辑任务上表现出色，显著优于基线方法。该方法还支持交互式视觉生成，允许用户明确修改推理步骤以实现精确的图像调整。

链接: https://arxiv.org/abs/2503.10639
作者: Rongyao Fang,Chengqi Duan,Kun Wang,Linjiang Huang,Hao Li,Shilin Yan,Hao Tian,Xingyu Zeng,Rui Zhao,Jifeng Dai,Xihui Liu,Hongsheng Li
机构: CUHK MMLab (香港中文大学多媒体实验室); HKU (香港大学); SenseTime (商汤科技); Shanghai AI Laboratory (上海人工智能实验室); THU (清华大学); BUAA (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Dataset and models are released in this https URL

点击查看摘要

Abstract:Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations. We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. We define the formulation of GoT and construct large-scale GoT datasets containing over 9M samples with detailed reasoning chains capturing semantic-spatial relationships. To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2.5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module. Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines. Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent. To facilitate future research, we make our datasets, code, and pretrained models publicly available at this https URL.
zh

[CV-1] Studying Classifier(-Free) Guidance From a Classifier-Centric Perspective

【速读】：该论文旨在解决classifier-free guidance在条件生成任务中的理解不足问题。论文的关键在于从classifier guidance入手，重新审视其基本假设，并通过系统研究揭示分类器在引导去噪扩散模型中的作用。研究发现，无论是classifier guidance还是classifier-free guidance，其核心机制均是通过推动去噪扩散轨迹远离决策边界（通常混杂且难以学习的区域）来实现条件生成。基于这一以分类器为中心的理解，论文提出了一种通用的后处理步骤，利用flow-matching技术缩小预训练去噪扩散模型与真实数据分布之间的差距，特别是在决策边界附近。实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2503.10638
作者: Xiaoming Zhao,Alexander G. Schwing
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Classifier-free guidance has become a staple for conditional generation with denoising diffusion models. However, a comprehensive understanding of classifier-free guidance is still missing. In this work, we carry out an empirical study to provide a fresh perspective on classifier-free guidance. Concretely, instead of solely focusing on classifier-free guidance, we trace back to the root, i.e., classifier guidance, pinpoint the key assumption for the derivation, and conduct a systematic study to understand the role of the classifier. We find that both classifier guidance and classifier-free guidance achieve conditional generation by pushing the denoising diffusion trajectories away from decision boundaries, i.e., areas where conditional information is usually entangled and is hard to learn. Based on this classifier-centric understanding, we propose a generic postprocessing step built upon flow-matching to shrink the gap between the learned distribution for a pre-trained denoising diffusion model and the real data distribution, majorly around the decision boundaries. Experiments on various datasets verify the effectiveness of the proposed approach.
zh

[CV-2] Distilling Diversity and Control in Diffusion Models

【速读】：该论文旨在解决蒸馏扩散模型（Distilled Diffusion Models）因多样性损失而导致的样本多样性减少问题。尽管蒸馏模型在多样性方面表现较差，但研究表明它们保留了基础模型（Base Models）的核心概念表示。论文的关键创新在于提出了控制蒸馏（Control Distillation）的概念，即通过概念滑块（Concept Sliders）和LoRAs等控制机制可以在基础模型与蒸馏模型之间无缝迁移，而无需重新训练。此外，论文通过引入扩散目标可视化（Diffusion Target Visualization）工具，揭示了蒸馏过程中多样性崩溃的机制，并发现早期扩散步长（timesteps）对输出多样性起决定性作用。基于这些洞察，论文提出了一种混合推理方法——多样性蒸馏（Diversity Distillation），该方法仅在推理的初始关键步骤中使用基础模型，随后切换到高效的蒸馏模型。这一简单修改不仅恢复了从基础模型到蒸馏模型的多样性能力，甚至超越了基础模型的表现，同时保持了接近蒸馏模型的计算效率，且无需额外的训练或模型调整。

链接: https://arxiv.org/abs/2503.10637
作者: Rohit Gandikota,David Bau
机构: Northeastern University (东北大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Distilled diffusion models suffer from a critical limitation: reduced sample diversity compared to their base counterparts. In this work, we uncover that despite this diversity loss, distilled models retain the fundamental concept representations of base models. We demonstrate control distillation - where control mechanisms like Concept Sliders and LoRAs trained on base models can be seamlessly transferred to distilled models and vice-versa, effectively distilling control without any retraining. This preservation of representational structure prompted our investigation into the mechanisms of diversity collapse during distillation. To understand how distillation affects diversity, we introduce Diffusion Target (DT) Visualization, an analysis and debugging tool that reveals how models predict final outputs at intermediate steps. Through DT-Visualization, we identify generation artifacts, inconsistencies, and demonstrate that initial diffusion timesteps disproportionately determine output diversity, while later steps primarily refine details. Based on these insights, we introduce diversity distillation - a hybrid inference approach that strategically employs the base model for only the first critical timestep before transitioning to the efficient distilled model. Our experiments demonstrate that this simple modification not only restores the diversity capabilities from base to distilled models but surprisingly exceeds it, while maintaining nearly the computational efficiency of distilled inference, all without requiring additional training or model modifications. Our code and data are available at this https URL
zh

[CV-3] he Curse of Conditions: Analyzing and Improving Optimal Transport for Conditional Flow-Based Generation

【速读】：该论文旨在解决在条件设置下(minibatch optimal transport)表现不佳的问题。其核心问题是，默认的最优传输映射在训练过程中忽略了条件信息，导致条件偏差的先验分布，而在测试阶段无法利用此偏差先验，仅能采样自无偏的全先验分布，从而造成训练与测试之间的性能差距。为了解决这一问题，论文的关键解决方案是提出条件最优传输(Conditional Optimal Transport, C²OT)，通过在计算最优传输分配时引入条件加权项到代价矩阵中，以修正条件偏差问题。实验表明，这一简单改进在离散和连续条件下均有效，并在多种任务中优于现有基线方法。

链接: https://arxiv.org/abs/2503.10636
作者: Ho Kei Cheng,Alexander Schwing
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Minibatch optimal transport coupling straightens paths in unconditional flow matching. This leads to computationally less demanding inference as fewer integration steps and less complex numerical solvers can be employed when numerically solving an ordinary differential equation at test time. However, in the conditional setting, minibatch optimal transport falls short. This is because the default optimal transport mapping disregards conditions, resulting in a conditionally skewed prior distribution during training. In contrast, at test time, we have no access to the skewed prior, and instead sample from the full, unbiased prior distribution. This gap between training and testing leads to a subpar performance. To bridge this gap, we propose conditional optimal transport C^2OT that adds a conditional weighting term in the cost matrix when computing the optimal transport assignment. Experiments demonstrate that this simple fix works with both discrete and continuous conditions in 8gaussians-to-moons, CIFAR-10, ImageNet-32x32, and ImageNet-256x256. Our method performs better overall compared to the existing baselines across different function evaluation budgets. Code is available at this https URL
zh

[CV-4] A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT -4.5/4o/o1

【速读】：该论文旨在解决现有基于迁移的针对开源大型视觉语言模型（Large Vision-Language Models, LVLMs）的对抗攻击方法在黑盒商业LVLMs上失败的问题。研究发现，失败的对抗扰动通常来源于均匀分布，缺乏清晰的语义细节，导致目标模型产生非预期响应。为克服这一挑战，论文的关键在于通过编码局部区域内的显式语义细节来增强语义清晰度，并聚焦于语义丰富的区域进行修改，而非均匀应用扰动。为此，论文提出了一种简单但非常有效的解决方案：在每次优化步骤中，随机裁剪对抗图像以特定宽高比和尺度，调整大小后将其与目标图像对齐到嵌入空间。实验结果验证了该方法的有效性，所提出的局部聚合扰动显著提升了对抗样本在包括GPT-4.5、GPT-4o等在内的多种商业LVLMs上的迁移能力，成功率达到90%以上，大幅超越现有最先进的攻击方法。

链接: https://arxiv.org/abs/2503.10635
作者: Zhaoyi Li,Xiaohan Zhao,Dong-Dong Wu,Jiacheng Cui,Zhiqiang Shen
机构: VILA Lab (VILA 实验室), MBZUAI ( Mohamed bin Zayed University of Artificial Intelligence )
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code at: this https URL

点击查看摘要

Abstract:Despite promising performance on open-source large vision-language models (LVLMs), transfer-based targeted attacks often fail against black-box commercial LVLMs. Analyzing failed adversarial perturbations reveals that the learned perturbations typically originate from a uniform distribution and lack clear semantic details, resulting in unintended responses. This critical absence of semantic information leads commercial LVLMs to either ignore the perturbation entirely or misinterpret its embedded semantics, thereby causing the attack to fail. To overcome these issues, we notice that identifying core semantic objects is a key objective for models trained with various datasets and methodologies. This insight motivates our approach that refines semantic clarity by encoding explicit semantic details within local regions, thus ensuring interoperability and capturing finer-grained features, and by concentrating modifications on semantically rich areas rather than applying them uniformly. To achieve this, we propose a simple yet highly effective solution: at each optimization step, the adversarial image is cropped randomly by a controlled aspect ratio and scale, resized, and then aligned with the target image in the embedding space. Experimental results confirm our hypothesis. Our adversarial examples crafted with local-aggregated perturbations focused on crucial regions exhibit surprisingly good transferability to commercial LVLMs, including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5-sonnet, Claude-3.7-sonnet, and even reasoning models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly outperforming all prior state-of-the-art attack methods. Our optimized adversarial examples under different configurations and training code are available at this https URL.
zh

[CV-5] V2Edit: Versatile Video Diffusion Editor for Videos and 3D Scenes

【速读】：该论文旨在解决指令引导下视频与三维场景编辑中原始内容保留与编辑任务完成之间的平衡难题。解决方案的关键在于提出了一种无需训练的渐进策略，将复杂的编辑任务分解为一系列简单的子任务，并通过初始噪声、每步去噪过程中添加的噪声以及文本提示与视频内容间的交叉注意力图这三种关键协同机制对每个子任务进行控制，从而确保原始视频元素的稳健保留同时有效应用所需编辑。此外，通过“渲染-编辑-重构”过程，该方法进一步扩展至三维场景编辑，实现了高质量且三维一致的编辑效果，即使在涉及显著几何变化的任务中亦表现优异。

链接: https://arxiv.org/abs/2503.10634
作者: Yanming Zhang,Jun-Kun Chen,Jipeng Lyu,Yu-Xiong Wang
机构: Zhejiang University (浙江大学); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Website: this https URL

点击查看摘要

Abstract:This paper introduces V ^2 Edit, a novel training-free framework for instruction-guided video and 3D scene editing. Addressing the critical challenge of balancing original content preservation with editing task fulfillment, our approach employs a progressive strategy that decomposes complex editing tasks into a sequence of simpler subtasks. Each subtask is controlled through three key synergistic mechanisms: the initial noise, noise added at each denoising step, and cross-attention maps between text prompts and video content. This ensures robust preservation of original video elements while effectively applying the desired edits. Beyond its native video editing capability, we extend V ^2 Edit to 3D scene editing via a “render-edit-reconstruct” process, enabling high-quality, 3D-consistent edits even for tasks involving substantial geometric changes such as object insertion. Extensive experiments demonstrate that our V ^2 Edit achieves high-quality and successful edits across various challenging video editing tasks and complex 3D scene editing tasks, thereby establishing state-of-the-art performance in both domains.
zh

[CV-6] Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?

【速读】：该论文试图解决的问题是如何将 Kolmogorov-Arnold 网络（KANs）有效集成到视觉 Transformer（ViTs）等先进架构中，并评估其在多元机器学习任务中的性能。论文的关键创新在于设计了一种通用可学习的 Kolmogorov-Arnold 注意力机制（KArAt），并通过进一步优化提出了模块化版本 Fourier-KArAt。这些方法通过在 CIFAR-10、CIFAR-100 和 ImageNet-1K 数据集上的实验，展示了与传统 ViT 相当或更优的性能。论文还通过分析损失景观、权重分布、优化路径、注意力可视化及频谱行为，深入探讨了这些架构的性能和泛化能力，旨在鼓励研究社区探索 KANs 在复杂架构中的应用潜力。

链接: https://arxiv.org/abs/2503.10632
作者: Subhajit Maity,Killian Hitsman,Xin Li,Aritra Dutta
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, Appendix included

点击查看摘要

Abstract:Kolmogorov-Arnold networks (KANs) are a remarkable innovation consisting of learnable activation functions with the potential to capture more complex relationships from data. Although KANs are useful in finding symbolic representations and continual learning of one-dimensional functions, their effectiveness in diverse machine learning (ML) tasks, such as vision, remains questionable. Presently, KANs are deployed by replacing multilayer perceptrons (MLPs) in deep network architectures, including advanced architectures such as vision Transformers (ViTs). In this paper, we are the first to design a general learnable Kolmogorov-Arnold Attention (KArAt) for vanilla ViTs that can operate on any choice of basis. However, the computing and memory costs of training them motivated us to propose a more modular version, and we designed particular learnable attention, called Fourier-KArAt. Fourier-KArAt and its variants either outperform their ViT counterparts or show comparable performance on CIFAR-10, CIFAR-100, and ImageNet-1K datasets. We dissect these architectures’ performance and generalization capacity by analyzing their loss landscapes, weight distributions, optimizer path, attention visualization, and spectral behavior, and contrast them with vanilla ViTs. The goal of this paper is not to produce parameter- and compute-efficient attention, but to encourage the community to explore KANs in conjunction with more advanced architectures that require a careful understanding of learnable activations. Our open-source code and implementation details are available on: this https URL
zh

[CV-7] HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

【速读】：该论文旨在解决现有视觉-语言-动作（Vision-Language-Action, VLA）模型在连续动作预测中的不连续性问题以及单纯依赖视觉-语言模型特征导致的推理能力局限性。论文的关键创新在于提出了一种名为HybridVLA的统一框架，通过将自回归策略与扩散策略无缝集成于单一的大规模语言模型中，而非简单拼接两者。为弥合不同生成方式之间的差距，作者设计了一种协作训练方法，将扩散建模直接注入到下一个词预测过程中。这种方法不仅使两种动作预测形式相互增强，还在不同任务中表现出差异化的性能。此外，论文进一步提出了协作动作集成机制，以自适应融合这两种预测结果，从而实现更稳健的控制效果。实验结果显示，HybridVLA在多种模拟和真实世界任务中超越了先前最先进的VLA方法，并在未见过的机器人配置中展示了稳定的操控能力。

链接: https://arxiv.org/abs/2503.10631
作者: Jiaming Liu,Hao Chen,Pengju An,Zhuoyang Liu,Renrui Zhang,Chenyang Gu,Xiaoqi Li,Ziyu Guo,Sixiang Chen,Mengzhen Liu,Chengkai Hou,Mengdi Zhao,KC alex Zhou,Pheng-Ann Heng,Shanghang Zhang
机构: State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University (北京大学); Beijing Academy of Artificial Intelligence (BAAI) (北京智源人工智能研究院); CUHK (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Recent advancements in vision-language models (VLMs) for common-sense reasoning have led to the development of vision-language-action (VLA) models, enabling robots to perform generalized manipulation. Although existing autoregressive VLA methods leverage large-scale pretrained knowledge, they disrupt the continuity of actions. Meanwhile, some VLA methods incorporate an additional diffusion head to predict continuous actions, relying solely on VLM-extracted features, which limits their reasoning capabilities. In this paper, we introduce HybridVLA, a unified framework that seamlessly integrates the strengths of both autoregressive and diffusion policies within a single large language model, rather than simply connecting them. To bridge the generation gap, a collaborative training recipe is proposed that injects the diffusion modeling directly into the next-token prediction. With this recipe, we find that these two forms of action prediction not only reinforce each other but also exhibit varying performance across different tasks. Therefore, we design a collaborative action ensemble mechanism that adaptively fuses these two predictions, leading to more robust control. In experiments, HybridVLA outperforms previous state-of-the-art VLA methods across various simulation and real-world tasks, including both single-arm and dual-arm robots, while demonstrating stable manipulation in previously unseen configurations.
zh

[CV-8] UniGoal: Towards Universal Zero-shot Goal-oriented Navigation CVPR2025

【速读】：本文旨在解决通用零样本目标导向导航（universal zero-shot goal-oriented navigation）的问题。现有零样本方法基于特定任务的大语言模型（Large Language Model, LLM）构建推理框架，但其整体流程差异较大，难以在不同类型的导航目标间泛化。为实现通用零样本导航，论文提出了一种统一图表示（uniform graph representation），将不同类型的导航目标（如物体类别、实例图像和文本描述）进行统一建模。同时，将智能体的观测转换为在线维护的场景图（scene graph）。通过这种一致的场景和目标表示方式，相比纯文本保留了更多结构信息，并能够利用LLM进行显式的基于图的推理。关键在于提出了在每个时间步长进行场景图与目标图匹配的方法，并根据不同的匹配状态设计了生成长期探索目标的不同策略。当完全不匹配时，智能体迭代搜索目标子图；部分匹配时，利用坐标投影和锚点对齐推断目标位置；最终通过场景图修正和目标验证实现完美匹配。此外，还引入黑名单机制以确保各阶段切换的鲁棒性。实验结果表明，所提出的UniGoal模型在多个基准数据集上的三种导航任务中实现了最先进的零样本性能，甚至超越了任务专用的零样本方法和监督学习的通用方法。

链接: https://arxiv.org/abs/2503.10630
作者: Hang Yin,Xiuwei Xu,Lingqing Zhao,Ziwei Wang,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:In this paper, we propose a general framework for universal zero-shot goal-oriented navigation. Existing zero-shot methods build inference framework upon large language models (LLM) for specific tasks, which differs a lot in overall pipeline and fails to generalize across different types of goal. Towards the aim of universal zero-shot navigation, we propose a uniform graph representation to unify different goals, including object category, instance image and text description. We also convert the observation of agent into an online maintained scene graph. With this consistent scene and goal representation, we preserve most structural information compared with pure text and are able to leverage LLM for explicit graph-based reasoning. Specifically, we conduct graph matching between the scene graph and goal graph at each time instant and propose different strategies to generate long-term goal of exploration according to different matching states. The agent first iteratively searches subgraph of goal when zero-matched. With partial matching, the agent then utilizes coordinate projection and anchor pair alignment to infer the goal location. Finally scene graph correction and goal verification are applied for perfect matching. We also present a blacklist mechanism to enable robust switch between stages. Extensive experiments on several benchmarks show that our UniGoal achieves state-of-the-art zero-shot performance on three studied navigation tasks with a single model, even outperforming task-specific zero-shot methods and supervised universal methods.
zh

[CV-9] Hierarchical Self-Supervised Adversarial Training for Robust Vision Models in Histopathology

【速读】：该论文致力于解决在医疗等关键领域中，针对视觉模型的对抗攻击所带来的可靠性挑战，尤其是在生物医学和显微镜图像领域的鲁棒性提升有限的问题。传统方法忽视了组织病理学图像的层次结构特性（即患者-切片-Patch之间的关系），而这些层次关系蕴含了有价值的判别信号。论文的关键创新在于提出了分层自监督对抗训练（Hierarchical Self-Supervised Adversarial Training, HSAT），通过多级对比学习利用层次结构信息生成对抗样本，并将其整合到对抗训练中以增强模型的鲁棒性。实验表明，HSAT不仅显著提升了白盒设置下的鲁棒性（平均提升54.31%），还大幅降低了黑盒设置下的性能下降幅度（仅3-4%，远优于基线的25-30%）。

链接: https://arxiv.org/abs/2503.10629
作者: Hashmat Shadab Malik,Shahina Kunhimon,Muzammal Naseer,Fahad Shahbaz Khan,Salman Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adversarial attacks pose significant challenges for vision models in critical fields like healthcare, where reliability is essential. Although adversarial training has been well studied in natural images, its application to biomedical and microscopy data remains limited. Existing self-supervised adversarial training methods overlook the hierarchical structure of histopathology images, where patient-slide-patch relationships provide valuable discriminative signals. To address this, we propose Hierarchical Self-Supervised Adversarial Training (HSAT), which exploits these properties to craft adversarial examples using multi-level contrastive learning and integrate it into adversarial training for enhanced robustness. We evaluate HSAT on multiclass histopathology dataset OpenSRH and the results show that HSAT outperforms existing methods from both biomedical and natural image domains. HSAT enhances robustness, achieving an average gain of 54.31% in the white-box setting and reducing performance drops to 3-4% in the black-box setting, compared to 25-30% for the baseline. These results set a new benchmark for adversarial training in this domain, paving the way for more robust models. Our Code for training and evaluation is available at this https URL.
zh

[CV-10] NIL: No-data Imitation Learning by Leverag ing Pre-trained Video Diffusion Models

【速读】：该论文旨在解决在多样且非常规形态（如人形机器人、四足动物及生物）上获取物理逼真的运动技能的问题。传统方法如强化学习依赖于特定任务和体态，需要大量奖励函数的设计且泛化能力不足；而模仿学习虽提供替代方案，但高度依赖高质量专家演示数据，这类数据对于非人类形态难以获得。论文提出了一种不依赖数据的新方法，通过从2D生成视频中学习三维运动技能，并具备向非常规及非人类形态泛化的潜力。关键在于利用视觉变换器引导模仿学习过程，通过计算视频嵌入之间的成对距离进行基于视频的比较，并结合分割视频帧间的相似度计算作为指导奖励，从而有效替代模仿学习中的数据收集环节，转而使用数据生成技术。

链接: https://arxiv.org/abs/2503.10626
作者: Mert Albaba,Chenhao Li,Markos Diomataris,Omid Taheri,Andreas Krause,Michael Black
机构: ETH Zürich (苏黎世联邦理工学院); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Acquiring physically plausible motor skills across diverse and unconventional morphologies-including humanoid robots, quadrupeds, and animals-is essential for advancing character simulation and robotics. Traditional methods, such as reinforcement learning (RL) are task- and body-specific, require extensive reward function engineering, and do not generalize well. Imitation learning offers an alternative but relies heavily on high-quality expert demonstrations, which are difficult to obtain for non-human morphologies. Video diffusion models, on the other hand, are capable of generating realistic videos of various morphologies, from humans to ants. Leveraging this capability, we propose a data-independent approach for skill acquisition that learns 3D motor skills from 2D-generated videos, with generalization capability to unconventional and non-human forms. Specifically, we guide the imitation learning process by leveraging vision transformers for video-based comparisons by calculating pair-wise distance between video embeddings. Along with video-encoding distance, we also use a computed similarity between segmented video frames as a guidance reward. We validate our method on locomotion tasks involving unique body configurations. In humanoid robot locomotion tasks, we demonstrate that ‘No-data Imitation Learning’ (NIL) outperforms baselines trained on 3D motion-capture data. Our results highlight the potential of leveraging generative video models for physically plausible skill learning with diverse morphologies, effectively replacing data collection with data generation for imitation learning.
zh

[CV-11] LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds

【速读】：该论文旨在解决从单张图像重建可动画化（animatable）三维人体模型的难题，这一问题的核心挑战在于如何在几何结构、外观及形变之间进行解耦。现有基于静态三维人体重建的方法主要依赖合成数据集训练，这限制了其泛化能力；而基于优化的视频方法虽能达到更高的重建精度，但需要严格的捕捉条件且计算开销巨大。为应对这些局限性，本文提出LHM（Large Animatable Human Reconstruction Model），通过前馈过程高效推断高保真度的3D高斯点云表示的人体 avatar。方案的关键在于采用多模态Transformer架构结合注意力机制来有效编码人体位置特征与图像特征，并引入头部特征金字塔编码以聚合多尺度特征，从而实现服装几何与纹理的细节保留，同时提升面部身份特征保存及精细细节恢复的能力。实验表明，LHM能够在数秒内生成逼真的可动画化人体，且无需额外处理面部与手部区域，显著优于现有方法在重建精度与泛化能力方面的表现。

链接: https://arxiv.org/abs/2503.10625
作者: Lingteng Qiu,Xiaodong Gu,Peihao Li,Qi Zuo,Weichao Shen,Junfei Zhang,Kejie Qiu,Weihao Yuan,Guanying Chen,Zilong Dong,Liefeng Bo
机构: Tongyi Lab, Alibaba Group (通义实验室, 阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation. Recent advances in 3D human reconstruction mainly focus on static human modeling, and the reliance of using synthetic 3D scans for training limits their generalization ability. Conversely, optimization-based video methods achieve higher fidelity but demand controlled capture conditions and computationally intensive refinement processes. Motivated by the emergence of large reconstruction models for efficient static reconstruction, we propose LHM (Large Animatable Human Reconstruction Model) to infer high-fidelity avatars represented as 3D Gaussian splatting in a feed-forward pass. Our model leverages a multimodal transformer architecture to effectively encode the human body positional features and image features with attention mechanism, enabling detailed preservation of clothing geometry and texture. To further boost the face identity preservation and fine detail recovery, we propose a head feature pyramid encoding scheme to aggregate multi-scale features of the head regions. Extensive experiments demonstrate that our LHM generates plausible animatable human in seconds without post-processing for face and hands, outperforming existing methods in both reconstruction accuracy and generalization ability.
zh

[CV-12] ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness

【速读】：该论文致力于解决将三维着装人体点云拟合到人体模型上的任务，这一任务具有普遍性但极具挑战性。传统基于优化的方法依赖多阶段流程，对姿态初始化敏感；而近期基于学习的方法在多样化姿态和服装类型上的泛化能力往往不足。论文提出的解决方案——等变贴合拟合（Equivariant Tightness Fitting for Clothed Humans, 简称ETCH），其关键是通过局部近似的SE(3)等变性估计服装表面到人体表面的映射，将紧身程度编码为位移向量，并通过此映射回归姿态不变的人体特征以简化拟合任务为内体标记拟合问题。实验表明，ETCH在宽松服装的拟合精度（提升16.7%~69.5%）和形状精度（平均提升49.9%）方面显著优于现有方法，同时在单次或分布外场景下可减少方向误差达67.2%~89.8%。

链接: https://arxiv.org/abs/2503.10624
作者: Boqian Li,Haiwen Feng,Zeyu Cai,Michael J. Black,Yuliang Xiu
机构: Westlake University (西湖大学); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); Berkeley AI Research (BAIR) (伯克利人工智能研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Page: this https URL , Code: this https URL

点击查看摘要

Abstract:Fitting a body to a 3D clothed human point cloud is a common yet challenging task. Traditional optimization-based approaches use multi-stage pipelines that are sensitive to pose initialization, while recent learning-based methods often struggle with generalization across diverse poses and garment types. We propose Equivariant Tightness Fitting for Clothed Humans, or ETCH, a novel pipeline that estimates cloth-to-body surface mapping through locally approximate SE(3) equivariance, encoding tightness as displacement vectors from the cloth surface to the underlying body. Following this mapping, pose-invariant body features regress sparse body markers, simplifying clothed human fitting into an inner-body marker fitting task. Extensive experiments on CAPE and 4D-Dress show that ETCH significantly outperforms state-of-the-art methods – both tightness-agnostic and tightness-aware – in body fitting accuracy on loose clothing (16.7% ~ 69.5%) and shape accuracy (average 49.9%). Our equivariant tightness design can even reduce directional errors by (67.2% ~ 89.8%) in one-shot (or out-of-distribution) settings. Qualitative results demonstrate strong generalization of ETCH, regardless of challenging poses, unseen shapes, loose clothing, and non-rigid dynamics. We will release the code and models soon for research purposes at this https URL.
zh

[CV-13] DriveLMM-o1 : A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding

【速读】：该论文旨在解决现有大型多模态模型（Large Multimodal Models, LMMs）在复杂多步推理任务中的局限性，特别是在自动驾驶场景下的视觉问答（Visual Question Answering, VQA）问题。当前VQA基准通常仅关注最终答案的准确性，而忽视了生成准确答案所需的推理过程，同时缺乏评估实际驾驶场景中逐步推理能力的全面框架。为填补这一空白，论文提出了DriveLMM-o1，这是一个专门设计的新数据集和基准，用于推进自动驾驶领域的逐步视觉推理能力。其关键创新在于构建了一个包含超过18k训练样本和4k测试样本的数据集，涵盖感知、预测和规划等多个方面的多样化问题，并通过引入详细的分步推理注释确保逻辑推导的有效性。此外，论文还开发了一个经过微调的大型多模态模型，显著提升了复杂驾驶场景中的推理性能。最终，所提出的方法在最终答案准确率上提高了+7.49%，并且在推理得分上比先前最佳开源模型提升了3.62%。

链接: https://arxiv.org/abs/2503.10621
作者: Ayesha Ishaq,Jean Lahoud,Ketan More,Omkar Thawakar,Ritesh Thawkar,Dinura Dissanayake,Noor Ahsan,Yuhao Li,Fahad Shahbaz Khan,Hisham Cholakkal,Ivan Laptev,Rao Muhammad Anwer,Salman Khan
机构: Mohamed Bin Zayed University of Artificial Intelligence (阿联酋哈利法大学); Linköping University (林雪平大学); Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 4 figures, 3 tables, github: this https URL

点击查看摘要

Abstract:While large multimodal models (LMMs) have demonstrated strong performance across various Visual Question Answering (VQA) tasks, certain challenges require complex multi-step reasoning to reach accurate answers. One particularly challenging task is autonomous driving, which demands thorough cognitive processing before decisions can be made. In this domain, a sequential and interpretive understanding of visual cues is essential for effective perception, prediction, and planning. Nevertheless, common VQA benchmarks often focus on the accuracy of the final answer while overlooking the reasoning process that enables the generation of accurate responses. Moreover, existing methods lack a comprehensive framework for evaluating step-by-step reasoning in realistic driving scenarios. To address this gap, we propose DriveLMM-o1, a new dataset and benchmark specifically designed to advance step-wise visual reasoning for autonomous driving. Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning, each enriched with step-by-step reasoning to ensure logical inference in autonomous driving scenarios. We further introduce a large multimodal model that is fine-tuned on our reasoning dataset, demonstrating robust performance in complex driving scenarios. In addition, we benchmark various open-source and closed-source methods on our proposed dataset, systematically comparing their reasoning capabilities for autonomous driving tasks. Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model. Our framework, dataset, and model are available at this https URL.
zh

[CV-14] DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation

【速读】：该论文旨在研究Diffusion Transformers (DiTs)在文本到图像生成任务中的性能与优化潜力，重点关注模型架构设计、文本条件化策略以及训练协议。论文通过评估多种基于DiT的架构变体（包括PixArt风格和MMDiT变体），并与直接处理拼接文本和噪声输入的标准DiT变体进行比较，发现标准DiT在性能上可媲美专门设计的模型，同时展现出更优的参数效率，尤其是在模型规模扩大时。关键解决方案在于采用分层参数共享策略，将模型大小相较于MMDiT减少了66%，且性能影响极小。此外，通过对文本编码器和变分自编码器（VAE）等核心组件的深入分析，提出了DiT-Air和DiT-Air-Lite两种改进模型，并通过有监督和奖励微调，使DiT-Air在GenEval和T2I CompBench基准测试中达到当前最优性能，而DiT-Air-Lite则以紧凑的模型尺寸保持高度竞争力。

链接: https://arxiv.org/abs/2503.10618
作者: Chen Chen,Rui Qian,Wenze Hu,Tsu-Jui Fu,Lezhi Li,Bowen Zhang,Alex Schwing,Wei Liu,Yinfei Yang
机构: Apple Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we empirically study Diffusion Transformers (DiTs) for text-to-image generation, focusing on architectural choices, text-conditioning strategies, and training protocols. We evaluate a range of DiT-based architectures–including PixArt-style and MMDiT variants–and compare them with a standard DiT variant which directly processes concatenated text and noise inputs. Surprisingly, our findings reveal that the performance of standard DiT is comparable with those specialized models, while demonstrating superior parameter-efficiency, especially when scaled up. Leveraging the layer-wise parameter sharing strategy, we achieve a further reduction of 66% in model size compared to an MMDiT architecture, with minimal performance impact. Building on an in-depth analysis of critical components such as text encoders and Variational Auto-Encoders (VAEs), we introduce DiT-Air and DiT-Air-Lite. With supervised and reward fine-tuning, DiT-Air achieves state-of-the-art performance on GenEval and T2I CompBench, while DiT-Air-Lite remains highly competitive, surpassing most existing models despite its compact size.
zh

[CV-15] OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer ICLR2025

【速读】：该论文旨在解决现有开放词汇多目标跟踪器在处理未见过类别时受限于框架结构、孤立的帧级感知以及模态交互不足的问题，这些问题阻碍了其在开放词汇分类和跟踪中的性能。论文的关键解决方案是提出OVTR（基于Transformer的端到端开放词汇多目标跟踪器），它首次实现了同时建模运动、外观和类别信息的端到端方法。为实现稳定的分类与连续跟踪，设计了CIP（类别信息传播）策略，建立高阶类别信息先验以指导后续帧；同时引入双分支结构增强泛化能力和模态交互深度，并在解码器中加入保护策略提升性能。实验表明，该方法不仅在开放词汇MOT基准上超越了先前的跟踪器，还实现了更快的推理速度和显著减少的预处理需求，同时展示了较强的跨数据集适应性。

链接: https://arxiv.org/abs/2503.10616
作者: Jinyang Li,En Yu,Sijia Chen,Wenbing Tao
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2025

点击查看摘要

Abstract:Open-vocabulary multiple object tracking aims to generalize trackers to unseen categories during training, enabling their application across a variety of real-world scenarios. However, the existing open-vocabulary tracker is constrained by its framework structure, isolated frame-level perception, and insufficient modal interactions, which hinder its performance in open-vocabulary classification and tracking. In this paper, we propose OVTR (End-to-End Open-Vocabulary Multiple Object Tracking with TRansformer), the first end-to-end open-vocabulary tracker that models motion, appearance, and category simultaneously. To achieve stable classification and continuous tracking, we design the CIP (Category Information Propagation) strategy, which establishes multiple high-level category information priors for subsequent frames. Additionally, we introduce a dual-branch structure for generalization capability and deep multimodal interaction, and incorporate protective strategies in the decoder to enhance performance. Experimental results show that our method surpasses previous trackers on the open-vocabulary MOT benchmark while also achieving faster inference speeds and significantly reducing preprocessing requirements. Moreover, the experiment transferring the model to another dataset demonstrates its strong adaptability. Models and code are released at this https URL.
zh

[CV-16] R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

【速读】：本文旨在解决多模态推理（Multimodal Reasoning）中的挑战，特别是视觉与文本信息整合的难题。现有视觉-语言模型在分析和推理视觉内容时表现不佳，导致其在复杂推理任务上的性能受限。此外，缺乏全面的基准测试也阻碍了对多模态推理能力的准确评估。为了解决这些问题，论文提出的关键方案是设计了一个名为R1-Onevision的多模态推理模型，并构建了一个跨模态推理管道，该管道将图像转换为形式化的文本表示，从而实现基于语言的精确推理。这一管道不仅支持模型的学习，还促成了R1-Onevision数据集的创建，该数据集包含多样化领域的详细、分步多模态推理标注。通过监督微调和强化学习进一步优化R1-Onevision模型，以提升其高级推理能力和鲁棒泛化能力。同时，为了全面评估不同级别的多模态推理性能，论文引入了R1-Onevision-Bench基准，覆盖从初中到大学及更高阶段的人类教育层级考试。实验结果表明，R1-Onevision在多个具有挑战性的多模态推理基准测试中达到了最先进的性能，超越了包括GPT-4o和Qwen2.5-VL在内的其他模型。

链接: https://arxiv.org/abs/2503.10615
作者: Yi Yang,Xiaoxuan He,Hongkun Pan,Xiyan Jiang,Yan Deng,Xingtao Yang,Haoyu Lu,Dacheng Yin,Fengyun Rao,Minfeng Zhu,Bo Zhang,Wei Chen
机构: Zhejiang University (浙江大学); WeChat Vision, Tencent Inc. (微信视觉团队, 腾讯公司); Renmin University of China (中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and Model: this https URL

点击查看摘要

Abstract:Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities. In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textural representations, enabling precise language-based reasoning. Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities. To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.
zh

[CV-17] ConsisLoRA: Enhancing Content and Style Consistency for LoRA-based Style Transfer

【速读】：该论文旨在解决风格迁移中的三个主要挑战：内容不一致性、风格错位以及内容泄露。为应对这些挑战，论文的关键创新在于提出了一种名为ConsisLoRA的方法，该方法基于LoRA（Low-Rank Adaptation），通过优化LoRA权重以预测原始图像而非噪声，从而增强内容与风格的一致性。此外，论文还设计了一种两阶段训练策略，将目标图像的内容学习与参考图像的风格学习解耦，并引入逐步损失转换策略以更好地捕捉内容图像的全局结构与局部细节。同时，提出了一种推理引导方法，实现在推理过程中对内容和风格强度的连续控制。通过定性和定量评估，ConsisLoRA显著提升了内容和风格的一致性，同时有效减少了内容泄露现象。

链接: https://arxiv.org/abs/2503.10614
作者: Bolin Chen,Baoquan Zhao,Haoran Xie,Yi Cai,Qing Li,Xudong Mao
机构: Sun Yat-sen University (中山大学); Lingnan University (岭南大学); South China University of Technology (华南理工大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Style transfer involves transferring the style from a reference image to the content of a target image. Recent advancements in LoRA-based (Low-Rank Adaptation) methods have shown promise in effectively capturing the style of a single image. However, these approaches still face significant challenges such as content inconsistency, style misalignment, and content leakage. In this paper, we comprehensively analyze the limitations of the standard diffusion parameterization, which learns to predict noise, in the context of style transfer. To address these issues, we introduce ConsisLoRA, a LoRA-based method that enhances both content and style consistency by optimizing the LoRA weights to predict the original image rather than noise. We also propose a two-step training strategy that decouples the learning of content and style from the reference image. To effectively capture both the global structure and local details of the content image, we introduce a stepwise loss transition strategy. Additionally, we present an inference guidance method that enables continuous control over content and style strengths during inference. Through both qualitative and quantitative evaluations, our method demonstrates significant improvements in content and style consistency while effectively reducing content leakage.
zh

[CV-18] CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

【速读】：本文旨在解决文本到图像模型（如Stable Diffusion和DALL-E 3）在多轮图像编辑任务中的不足。传统搜索算法在寻找工具路径时需要昂贵的探索，而大型语言模型（LLMs）虽然具备子任务规划的先验知识，但可能缺乏对工具能力和成本的精确估计，从而难以确定每个子任务应使用的工具。为结合LLMs和图搜索的优势，论文提出了一种三阶段方法“CoSTA*”。其关键是利用LLMs构建子任务树以剪枝AI工具图，并在小规模子图上执行A搜索来找到成本效益高的工具路径。此外，通过将工具在每个子任务上的总成本和质量相结合来指导A搜索，进一步优化成本与质量的平衡。同时，视觉-语言模型（VLM）评估每个子任务的输出，失败时更新工具的成本和质量，使A搜索能够快速恢复并探索其他路径。此外，CoSTA能够在子任务间自动切换模态，实现更好的成本-质量权衡。实验表明，CoSTA*在具有挑战性的多轮图像编辑基准测试中，以成本和质量两方面均优于现有最先进的图像编辑模型或代理，并可根据用户偏好进行多样化的权衡。

链接: https://arxiv.org/abs/2503.10613
作者: Advait Gupta,NandaKiran Velaga,Dang Nguyen,Tianyi Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image models like stable diffusion and DALLE-3 still struggle with multi-turn image editing. We decompose such a task as an agentic workflow (path) of tool use that addresses a sequence of subtasks by AI tools of varying costs. Conventional search algorithms require expensive exploration to find tool paths. While large language models (LLMs) possess prior knowledge of subtask planning, they may lack accurate estimations of capabilities and costs of tools to determine which to apply in each subtask. Can we combine the strengths of both LLMs and graph search to find cost-efficient tool paths? We propose a three-stage approach “CoSTA*” that leverages LLMs to create a subtask tree, which helps prune a graph of AI tools for the given task, and then conducts A* search on the small subgraph to find a tool path. To better balance the total cost and quality, CoSTA* combines both metrics of each tool on every subtask to guide the A* search. Each subtask’s output is then evaluated by a vision-language model (VLM), where a failure will trigger an update of the tool’s cost and quality on the subtask. Hence, the A* search can recover from failures quickly to explore other paths. Moreover, CoSTA* can automatically switch between modalities across subtasks for a better cost-quality trade-off. We build a novel benchmark of challenging multi-turn image editing, on which CoSTA* outperforms state-of-the-art image-editing models or agents in terms of both cost and quality, and performs versatile trade-offs upon user preference.
zh

[CV-19] OCCUQ: Exploring Efficient Uncertainty Quantification for 3D Occupancy Prediction ICRA2025

【速读】：本文旨在解决自动驾驶系统在应对未见过的对抗性条件或分布偏移时不确定性估计不足的问题，特别是在恶劣天气（如雾天）或传感器损坏（如摄像头缺失或区域特定缺陷）等情况下，现有方法难以有效量化模型的不确定性，从而限制其实际应用。论文的关键在于提出了一种高效的不确定性估计技术的适应方法，通过动态校准模型置信度以利用认识论不确定性（epistemic uncertainty）估计，确保在分布外（Out-of-Distribution, OoD）检测和置信校准方面具有更优性能。与深度集成（Deep Ensembles）和MC-Dropout等常见基准相比，该方法能够在场景级和区域级评估中提供更可靠的不确定性度量，显著提升自动驾驶系统的鲁棒性。

链接: https://arxiv.org/abs/2503.10605
作者: Severin Heidrich,Till Beemelmanns,Alexey Nekrasov,Bastian Leibe,Lutz Eckstein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at ICRA 2025

点击查看摘要

Abstract:Autonomous driving has the potential to significantly enhance productivity and provide numerous societal benefits. Ensuring robustness in these safety-critical systems is essential, particularly when vehicles must navigate adverse weather conditions and sensor corruptions that may not have been encountered during training. Current methods often overlook uncertainties arising from adversarial conditions or distributional shifts, limiting their real-world applicability. We propose an efficient adaptation of an uncertainty estimation technique for 3D occupancy prediction. Our method dynamically calibrates model confidence using epistemic uncertainty estimates. Our evaluation under various camera corruption scenarios, such as fog or missing cameras, demonstrates that our approach effectively quantifies epistemic uncertainty by assigning higher uncertainty values to unseen data. We introduce region-specific corruptions to simulate defects affecting only a single camera and validate our findings through both scene-level and region-level assessments. Our results show superior performance in Out-of-Distribution (OoD) detection and confidence calibration compared to common baselines such as Deep Ensembles and MC-Dropout. Our approach consistently demonstrates reliable uncertainty measures, indicating its potential for enhancing the robustness of autonomous driving systems in real-world scenarios. Code and dataset are available at this https URL .
zh

[CV-20] MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction

【速读】：该论文旨在解决基于辐射场（radiance fields）的3D场景重建与新视角合成（NVS, novel view synthesis）在自动驾驶领域中的两个关键挑战：一是基于重建的方法在训练轨迹外显著视点偏差下性能严重下降；二是基于生成的方法难以保证时间一致性与精确场景可控性。为克服这些限制，论文提出了一种名为MuDG的创新框架，其关键是将多模态扩散模型（Multi-modal Diffusion model）与高斯点云拼接（Gaussian Splatting, GS）相结合，用于城市场景重建。通过利用聚合的LiDAR点云以及RGB和几何先验信息来条件化多模态视频扩散模型，MuDG能够为新视角合成逼真的RGB、深度和语义输出，同时提供前馈式的NVS能力，无需针对每个场景进行计算密集型优化，从而增强渲染鲁棒性，尤其在极端视点变化下。实验表明，MuDG在Open Waymo数据集上的重建与合成质量均优于现有方法。

链接: https://arxiv.org/abs/2503.10604
作者: Yingshuang Zou,Yikang Ding,Chuanrui Zhang,Jiazhe Guo,Bohan Li,Xiaoyang Lyu,Feiyang Tan,Xiaojuan Qi,Haoqian Wang
机构: THU (清华大学); MEGVII (商汤科技); SJTU (上海交通大学); HKU (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent breakthroughs in radiance fields have significantly advanced 3D scene reconstruction and novel view synthesis (NVS) in autonomous driving. Nevertheless, critical limitations persist: reconstruction-based methods exhibit substantial performance deterioration under significant viewpoint deviations from training trajectories, while generation-based techniques struggle with temporal coherence and precise scene controllability. To overcome these challenges, we present MuDG, an innovative framework that integrates Multi-modal Diffusion model with Gaussian Splatting (GS) for Urban Scene Reconstruction. MuDG leverages aggregated LiDAR point clouds with RGB and geometric priors to condition a multi-modal video diffusion model, synthesizing photorealistic RGB, depth, and semantic outputs for novel viewpoints. This synthesis pipeline enables feed-forward NVS without computationally intensive per-scene optimization, providing comprehensive supervision signals to refine 3DGS representations for rendering robustness enhancement under extreme viewpoint changes. Experiments on the Open Waymo Dataset demonstrate that MuDG outperforms existing methods in both reconstruction and synthesis quality.
zh

[CV-21] Dual-Stage Cross-Modal Network with Dynamic Feature Fusion for Emotional Mimicry Intensity Estimation

【速读】：该论文旨在解决情感模仿强度（EMI）估计中存在的动态关联建模不足、多模态时间信号鲁棒融合困难以及现有方法在模态协同效应挖掘不充分、抗噪能力弱及细粒度对齐能力有限的问题。为应对这些挑战，论文提出了一种双阶段跨模态对齐框架。关键解决方案包括：首先基于改进的CLIP架构构建视觉-文本和音频-文本对比学习网络，在特征空间中通过模态解耦预训练实现初步对齐；其次设计了一个结合时序卷积网络（TCN）和门控双向LSTM的时间感知动态融合模块，分别捕获面部表情的宏观演化模式与声学特征的局部动态变化；创新性地引入一种质量引导的模态融合策略，通过可微权重分配实现遮挡和噪声场景下的模态补偿。实验结果表明，所提方法在Hume-Vidmimic2数据集上的六个情感维度平均皮尔逊相关系数达到0.35，较最佳基线提升40%，消融研究进一步验证了双阶段训练策略与动态融合机制的有效性。

链接: https://arxiv.org/abs/2503.10603
作者: Jun Yu,Lingsi Zhu,Yanjun Chi,Yunxiang Zhang,Yang Zheng,Yongqi Wang,Xilong Lu
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emotional Mimicry Intensity (EMI) estimation serves as a critical technology for understanding human social behavior and enhancing human-computer interaction experiences, where the core challenge lies in dynamic correlation modeling and robust fusion of multimodal temporal signals. To address the limitations of existing methods in insufficient exploitation of modal synergistic effects, noise sensitivity, and limited fine-grained alignment capabilities, this paper proposes a dual-stage cross-modal alignment framework. First, we construct vision-text and audio-text contrastive learning networks based on an improved CLIP architecture, achieving preliminary alignment in the feature space through modality-decoupled pre-training. Subsequently, we design a temporal-aware dynamic fusion module that combines Temporal Convolutional Networks (TCN) and gated bidirectional LSTM to respectively capture the macro-evolution patterns of facial expressions and local dynamics of acoustic features. Innovatively, we introduce a quality-guided modality fusion strategy that enables modality compensation under occlusion and noisy scenarios through differentiable weight allocation. Experimental results on the Hume-Vidmimic2 dataset demonstrate that our method achieves an average Pearson correlation coefficient of 0.35 across six emotion dimensions, outperforming the best baseline by 40%. Ablation studies further validate the effectiveness of the dual-stage training strategy and dynamic fusion mechanism, providing a novel technical pathway for fine-grained emotion analysis in open environments.
zh

[CV-22] GroomLight: Hybrid Inverse Rendering for Relightable Human Hair Appearance Modeling

【速读】：该论文致力于解决多视角图像下可重光照发型外观建模的问题，现有方法难以在实现照片级真实感渲染的同时具备良好的重光照能力。分析材质模型虽物理基础扎实，但难以捕捉细节；而神经渲染方法虽擅长视图合成，却在新光照条件下的泛化表现不佳。论文的关键解决方案在于提出GroomLight，通过结合传统分析模型与神经渲染的优势，采用扩展的头发BSDF模型捕获主要光传输，并利用光照感知残差模型重建剩余细节，同时设计混合逆向渲染管道优化这两部分，从而实现高保真度的重光照、视图合成及材质编辑。

链接: https://arxiv.org/abs/2503.10597
作者: Yang Zheng,Menglei Chai,Delio Vicini,Yuxiao Zhou,Yinghao Xu,Leonidas Guibas,Gordon Wetzstein,Thabo Beeler
机构: Stanford University (斯坦福大学); Google(谷歌); ETH Zurich (瑞士联邦理工学院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:We present GroomLight, a novel method for relightable hair appearance modeling from multi-view images. Existing hair capture methods struggle to balance photorealistic rendering with relighting capabilities. Analytical material models, while physically grounded, often fail to fully capture appearance details. Conversely, neural rendering approaches excel at view synthesis but generalize poorly to novel lighting conditions. GroomLight addresses this challenge by combining the strengths of both paradigms. It employs an extended hair BSDF model to capture primary light transport and a light-aware residual model to reconstruct the remaining details. We further propose a hybrid inverse rendering pipeline to optimize both components, enabling high-fidelity relighting, view synthesis, and material editing. Extensive evaluations on real-world hair data demonstrate state-of-the-art performance of our method.
zh

[CV-23] GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

【速读】：该论文旨在解决像素接地（Pixel Grounding）任务，特别是指表达分割（REF）领域中现有数据集局限性导致的进展瓶颈问题。这些局限性包括有限的目标类别、文本描述多样性不足以及高质量标注的稀缺性。为应对这些问题，论文提出GroundingSuite作为解决方案，其关键是构建了一个包含自动化数据标注框架（利用多个视觉-语言模型代理）、一个大规模训练数据集（涵盖956万条多样化指代表达及其对应分割结果），以及一个精心设计的评估基准（包含3800张图像）。这一方案显著提升了模型性能，在gRefCOCO上的cIoU达到68.9，在RefCOCOm上的gIoU达到55.3，并且其标注效率比当前最先进的方法GLaMM高出4.5倍。

链接: https://arxiv.org/abs/2503.10596
作者: Rui Hu,Lianghui Zhu,Yuxuan Zhang,Tianheng Cheng,Lei Liu,Heng Liu,Longjin Ran,Xiaoxin Chen,Wenyu Liu,Xinggang Wang
机构: School of EIC, Huazhong University of Science & Technology (华中科技大学); vivo AI Lab (vivo人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress. Code: this https URL

点击查看摘要

Abstract:Pixel grounding, encompassing tasks such as Referring Expression Segmentation (RES), has garnered considerable attention due to its immense potential for bridging the gap between vision and language modalities. However, advancements in this domain are currently constrained by limitations inherent in existing datasets, including limited object categories, insufficient textual diversity, and a scarcity of high-quality annotations. To mitigate these limitations, we introduce GroundingSuite, which comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images. The GroundingSuite training dataset facilitates substantial performance improvements, enabling models trained on it to achieve state-of-the-art results. Specifically, a cIoU of 68.9 on gRefCOCO and a gIoU of 55.3 on RefCOCOm. Moreover, the GroundingSuite annotation framework demonstrates superior efficiency compared to the current leading data annotation method, i.e., 4.5 \times faster than the GLaMM.
zh

[CV-24] Poly-MgNet: Polynomial Building Blocks in Multigrid-Inspired ResNets

【速读】：该论文旨在通过引入基于多重网格（Multigrid, MG）理论中的多项式平滑器的新神经网络模块，改进ResNet架构及其扩展框架MgNet的性能。论文的关键在于设计一种新的多项式块（Polynomial Block），它不仅从理论上扩展了MgNet框架至Poly-MgNet，还减少了模型参数量。此外，研究系统分析了多项式系数的选择、多项式的阶数、激活函数与批量归一化的位置等设计因素，并提出基于实部和虚部多项式根构建二次多项式模块的方法，以提升Poly-MgNet的精度表现，同时实现模型准确性与参数量之间的更优权衡。

链接: https://arxiv.org/abs/2503.10594
作者: Antonia van Betteray,Matthias Rottmann,Karsten Kahl
机构: IZMD, University of Wuppertal (伍珀塔尔大学), Germany
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The structural analogies of ResNets and Multigrid (MG) methods such as common building blocks like convolutions and poolings where already pointed out by He et al.\ in 2016. Multigrid methods are used in the context of scientific computing for solving large sparse linear systems arising from partial differential equations. MG methods particularly rely on two main concepts: smoothing and residual restriction / coarsening. Exploiting these analogies, He and Xu developed the MgNet framework, which integrates MG schemes into the design of ResNets. In this work, we introduce a novel neural network building block inspired by polynomial smoothers from MG theory. Our polynomial block from an MG perspective naturally extends the MgNet framework to Poly-Mgnet and at the same time reduces the number of weights in MgNet. We present a comprehensive study of our polynomial block, analyzing the choice of initial coefficients, the polynomial degree, the placement of activation functions, as well as of batch normalizations. Our results demonstrate that constructing (quadratic) polynomial building blocks based on real and imaginary polynomial roots enhances Poly-MgNet’s capacity in terms of accuracy. Furthermore, our approach achieves an improved trade-off of model accuracy and number of weights compared to ResNet as well as compared to specific configurations of MgNet.
zh

[CV-25] CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models

【速读】：本文旨在解决现有带相机条件的视频生成模型在处理大范围相机运动时动态性减弱及视角受限的问题。为应对这一挑战，论文提出的关键解决方案是设计一个逐步扩展动态场景生成的框架——首先通过增强单个视频片段内的动态内容，然后进一步扩展到支持广视角范围内的无缝探索。具体而言，研究构建了一个包含大量动态数据且带有相机参数标注的数据集用于训练，并开发了一种轻量级的相机注入模块与相应的训练方案以保留预训练模型中的动态特性。在此基础上，通过允许用户迭代指定相机轨迹来生成连贯的视频序列，实现了扩展场景探索能力。实验结果表明，CameraCtrl II相比以往方法能够提供显著更宽的空间探索范围。

链接: https://arxiv.org/abs/2503.10592
作者: Hao He,Ceyuan Yang,Shanchuan Lin,Yinghao Xu,Meng Wei,Liangke Gui,Qi Zhao,Gordon Wetzstein,Lu Jiang,Hongsheng Li
机构: The Chinese University of Hong Kong (香港中文大学); ByteDance Seed (字节跳动种子计划); Stanford University (斯坦福大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:This paper introduces CameraCtrl II, a framework that enables large-scale dynamic scene exploration through a camera-controlled video diffusion model. Previous camera-conditioned video generative models suffer from diminished video dynamics and limited range of viewpoints when generating videos with large camera movement. We take an approach that progressively expands the generation of dynamic scenes – first enhancing dynamic content within individual video clip, then extending this capability to create seamless explorations across broad viewpoint ranges. Specifically, we construct a dataset featuring a large degree of dynamics with camera parameter annotations for training while designing a lightweight camera injection module and training scheme to preserve dynamics of the pretrained models. Building on these improved single-clip techniques, we enable extended scene exploration by allowing users to iteratively specify camera trajectories for generating coherent video sequences. Experiments across diverse scenarios demonstrate that CameraCtrl Ii enables camera-controlled dynamic scene synthesis with substantially wider spatial exploration than previous approaches.
zh

[CV-26] Long Context Tuning for Video Generation

【速读】：该论文旨在解决现实世界中叙事视频生成的需求，即如何生成包含多镜头场景且镜头间视觉与动态一致性的长视频。现有单镜头视频扩散模型无法直接处理多镜头一致性问题，因此论文提出了一种名为Long Context Tuning (LCT) 的训练范式作为解决方案。LCT 的关键是通过扩展预训练单镜头视频扩散模型的上下文窗口，将全注意力机制从单一镜头扩展到整个场景的所有镜头，同时结合交错的三维位置嵌入和异步噪声策略，在不增加额外参数的情况下实现镜头间的联合与自回归生成。此外，经过 LCT 的模型可以通过双向注意力进一步微调为因果上下文注意，从而在高效 KV 缓存的支持下实现自回归生成。实验表明，经过 LCT 的单镜头模型能够生成连贯的多镜头场景，并展现出组合生成和交互式镜头扩展等新兴能力。

链接: https://arxiv.org/abs/2503.10589
作者: Yuwei Guo,Ceyuan Yang,Ziyan Yang,Zhibei Ma,Zhijie Lin,Zhenheng Yang,Dahua Lin,Lu Jiang
机构: The Chinese University of Hong Kong (香港中文大学); ByteDance Seed (字节跳动种子计划); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and dynamic consistency across shots. In this work, we introduce Long Context Tuning (LCT), a training paradigm that expands the context window of pre-trained single-shot video diffusion models to learn scene-level consistency directly from data. Our method expands full attention mechanisms from individual shots to encompass all shots within a scene, incorporating interleaved 3D position embedding and an asynchronous noise strategy, enabling both joint and auto-regressive shot generation without additional parameters. Models with bidirectional attention after LCT can further be fine-tuned with context-causal attention, facilitating auto-regressive generation with efficient KV-cache. Experiments demonstrate single-shot models after LCT can produce coherent multi-shot scenes and exhibit emerging capabilities, including compositional generation and interactive shot extension, paving the way for more practical visual content creation. See this https URL for more details.
zh

[CV-27] Unlock the Power of Unlabeled Data in Language Driving Model ICRA2025

【速读】：该论文试图解决视觉基础大语言模型（Vision-based Large Language Models, VisionLLMs）在自动驾驶领域中对大规模高质量标注数据高度依赖的问题，这导致数据获取成本高且耗时。为了解决这一问题，论文提出了一种半监督学习方法，旨在利用大量未标注的数据来提升语言驱动模型（Language Driving Model, LDM）的表现。方案的关键在于首先通过基于模板的提示生成场景信息及伪答案，然后采用自一致性优化（Self-Consistency Refinement）方法提高这些伪标注的质量，最终用于进一步训练。实验表明，该方法仅使用5%的标注数据即可取得与全数据集训练模型相竞争的性能，在DriveLM基准测试中，有限标注数据下的LDM达到了44.85%，而结合未标注数据后提升至54.27%，相比之下，全数据集训练的模型性能为60.68%。

链接: https://arxiv.org/abs/2503.10586
作者: Chaoqun Wang,Jie Yang,Xiaobin Hong,Ruimao Zhang
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学（深圳）); NanJing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICRA2025

点击查看摘要

Abstract:Recent Vision-based Large Language Models~(VisionLLMs) for autonomous driving have seen rapid advancements. However, such promotion is extremely dependent on large-scale high-quality annotated data, which is costly and labor-intensive. To address this issue, we propose unlocking the value of abundant yet unlabeled data to improve the language-driving model in a semi-supervised learning manner. Specifically, we first introduce a series of template-based prompts to extract scene information, generating questions that create pseudo-answers for the unlabeled data based on a model trained with limited labeled data. Next, we propose a Self-Consistency Refinement method to improve the quality of these pseudo-annotations, which are later used for further training. By utilizing a pre-trained VisionLLM (e.g., InternVL), we build a strong Language Driving Model (LDM) for driving scene question-answering, outperforming previous state-of-the-art methods. Extensive experiments on the DriveLM benchmark show that our approach performs well with just 5% labeled data, achieving competitive performance against models trained with full datasets. In particular, our LDM achieves 44.85% performance with limited labeled data, increasing to 54.27% when using unlabeled data, while models trained with full datasets reach 60.68% on the DriveLM benchmark.
zh

[CV-28] Semantic-Supervised Spatial-Temporal Fusion for LiDAR-based 3D Object Detection ICRA2025

【速读】：本文旨在解决基于 LiDAR 的 3D 对象检测中因点云稀疏性带来的显著挑战，特别是如何高效利用时空信息的问题。传统方法通过长时间序列的 LiDAR 数据来增强输入密度，但如何有效地融合时空信息仍是一个开放问题。为此，论文提出了一种名为语义引导的空间-时间融合（ST-Fusion）的方法，其关键在于引入一种新的融合模块以缓解由于物体运动导致的空间错位，并通过特征级别的语义监督充分挖掘所提融合模块的潜力。具体而言，ST-Fusion 包含空间聚合（SA）模块和时间合并（TM）模块：SA 模块利用具有逐步扩展感受野的卷积层从局部区域聚合目标特征以减轻空间错位；TM 模块则基于注意力机制动态提取前几帧中的目标特征，实现全面的时序表示。此外，在语义监督方面，提出了语义注入方法，通过注入逐点语义标签来丰富稀疏的 LiDAR 数据，用于训练教师模型并在特征级别提供由所提出的目标感知损失监督的重建目标。大量实验验证了该方法在多种基于 LiDAR 的检测器上的有效性和通用性，在 nuScenes 基准测试中提升了约 2.8% 的 NDS 分数。

链接: https://arxiv.org/abs/2503.10579
作者: Chaoqun Wang,Xiaobin Hong,Wenzhong Li,Ruimao Zhang
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学（深圳）); NanJing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICRA2025

点击查看摘要

Abstract:LiDAR-based 3D object detection presents significant challenges due to the inherent sparsity of LiDAR points. A common solution involves long-term temporal LiDAR data to densify the inputs. However, efficiently leveraging spatial-temporal information remains an open problem. In this paper, we propose a novel Semantic-Supervised Spatial-Temporal Fusion (ST-Fusion) method, which introduces a novel fusion module to relieve the spatial misalignment caused by the object motion over time and a feature-level semantic supervision to sufficiently unlock the capacity of the proposed fusion module. Specifically, the ST-Fusion consists of a Spatial Aggregation (SA) module and a Temporal Merging ™ module. The SA module employs a convolutional layer with progressively expanding receptive fields to aggregate the object features from the local regions to alleviate the spatial misalignment, the TM module dynamically extracts object features from the preceding frames based on the attention mechanism for a comprehensive sequential presentation. Besides, in the semantic supervision, we propose a Semantic Injection method to enrich the sparse LiDAR data via injecting the point-wise semantic labels, using it for training a teacher model and providing a reconstruction target at the feature level supervised by the proposed object-aware loss. Extensive experiments on various LiDAR-based detectors demonstrate the effectiveness and universality of our proposal, yielding an improvement of approximately +2.8% in NDS based on the nuScenes benchmark.
zh

[CV-29] Autoregressive Image Generation with Randomized Parallel Decoding

【速读】：该论文试图解决传统自回归模型在推理效率和零样本泛化能力上的局限性，这些问题源于其固定的栅格顺序（raster-order）生成方式。为了解决这些问题，论文提出了一种新颖的视觉自回归模型ARPG，其关键在于引入了一个引导解码框架，将位置引导与内容表示显式解耦，并分别以查询（queries）和键值对（key-value pairs）的形式进行编码。通过将这种引导直接融入因果注意力机制中，ARPG实现了完全随机顺序的训练和生成，无需双向注意力。这一方法不仅显著提升了推理效率和零样本任务（如图像修复、扩展和分辨率提升）的性能，还通过共享键值缓存支持并行推理，从而大幅提高了吞吐量并降低了内存消耗。

链接: https://arxiv.org/abs/2503.10568
作者: Haopeng Li,Jinyue Yang,Guoqi Li,Huan Wang
机构: Westlake University (西湖大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce ARPG, a novel visual autoregressive model that enables randomized parallel generation, addressing the inherent limitations of conventional raster-order approaches, which hinder inference efficiency and zero-shot generalization due to their sequential, predefined token generation order. Our key insight is that effective random-order modeling necessitates explicit guidance for determining the position of the next predicted token. To this end, we propose a novel guided decoding framework that decouples positional guidance from content representation, encoding them separately as queries and key-value pairs. By directly incorporating this guidance into the causal attention mechanism, our approach enables fully random-order training and generation, eliminating the need for bidirectional attention. Consequently, ARPG readily generalizes to zero-shot tasks such as image inpainting, outpainting, and resolution expansion. Furthermore, it supports parallel inference by concurrently processing multiple queries using a shared KV cache. On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.94 with only 64 sampling steps, achieving over a 20-fold increase in throughput while reducing memory consumption by over 75% compared to representative recent autoregressive models at a similar scale.
zh

[CV-30] MASQUE: A Text-Guided Diffusion-Based Framework for Localized and Customized Adversarial Makeup

【速读】：该论文旨在解决面部识别技术在政府和商业服务中被滥用引发的隐私和公民权利担忧。现有的反面部识别方法主要通过对抗性扰动人脸图像来保护隐私，其中基于生成式化妆的方法较为流行。然而，这些方法因设计初衷多为模仿特定目标身份，导致规避成功率较低且存在被针对性滥用的风险，同时常引入全局视觉伪影或缺乏适应性以应对多样化的化妆提示，影响用户体验。为克服上述局限，本文开发了MASQUE，这是一种基于扩散模型的新框架，能够根据用户定义的文本提示生成局部对抗性化妆。其关键在于利用精确的null-text反转、定制化的带掩码交叉注意力融合以及基于同一人像图像的成对对抗引导机制，从而实现在无需外部身份信息的情况下获得稳健的规避性能。综合评估表明，MASQUE在开放源码面部识别模型和商业API上的规避成功率显著优于所有基线方法，并具备更高的感知保真度和更强的文本化妆提示适应性。

链接: https://arxiv.org/abs/2503.10549
作者: Youngjin Kwon,Xiao Zhang
机构: CISPA Helmholtz Center for Information Security (CISPA 莱布尼茨信息安全中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:As facial recognition is increasingly adopted for government and commercial services, its potential misuse has raised serious concerns about privacy and civil rights. To counteract, various anti-facial recognition techniques have been proposed for privacy protection by adversarially perturbing face images, among which generative makeup-based approaches are the most popular. However, these methods, designed primarily to impersonate specific target identities, can only achieve weak dodging success rates while increasing the risk of targeted abuse. In addition, they often introduce global visual artifacts or a lack of adaptability to accommodate diverse makeup prompts, compromising user satisfaction. To address the above limitations, we develop MASQUE, a novel diffusion-based framework that generates localized adversarial makeups guided by user-defined text prompts. Built upon precise null-text inversion, customized cross-attention fusion with masking, and a pairwise adversarial guidance mechanism using images of the same individual, MASQUE achieves robust dodging performance without requiring any external identity. Comprehensive evaluations on open-source facial recognition models and commercial APIs demonstrate that MASQUE significantly improves dodging success rates over all baselines, along with higher perceptual fidelity and stronger adaptability to various text makeup prompts.
zh

[CV-31] Learning Interpretable Logic Rules from Deep Vision Models

【速读】：该论文旨在解决现有深度视觉模型可解释性不足的问题，特别是缺乏因果解释、可视化结果过强信心以及解释模糊性的挑战。为应对这些挑战，论文提出了一种名为VisionLogic的通用框架，其关键是通过因果验证将最后一层神经元转化为谓词，并将其与视觉概念关联起来，从而以逻辑规则的形式提供局部和全局解释，不仅增强了模型行为的透明度，还保留了大部分判别能力。

链接: https://arxiv.org/abs/2503.10547
作者: Chuqin Geng,Yuhe Jiang,Ziyu Zhao,Haolin Ye,Zhaoyue Wang,Xujie Si
机构: McGill University; University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:We propose a general framework called VisionLogic to extract interpretable logic rules from deep vision models, with a focus on image classification tasks. Given any deep vision model that uses a fully connected layer as the output head, VisionLogic transforms neurons in the last layer into predicates and grounds them into vision concepts using causal validation. In this way, VisionLogic can provide local explanations for single images and global explanations for specific classes in the form of logic rules. Compared to existing interpretable visualization tools such as saliency maps, VisionLogic addresses several key challenges, including the lack of causal explanations, overconfidence in visualizations, and ambiguity in interpretation. VisionLogic also facilitates the study of visual concepts encoded by predicates, particularly how they behave under perturbation – an area that remains underexplored in the field of hidden semantics. Apart from providing better visual explanations and insights into the visual concepts learned by the model, we show that VisionLogic retains most of the neural network’s discriminative power in an interpretable and transparent manner. We envision it as a bridge between complex model behavior and human-understandable explanations, providing trustworthy and actionable insights for real-world applications.
zh

[CV-32] Lightweight Models for Emotional Analysis in Video

【速读】：该论文旨在解决高效时空特征提取在情感行为分析中的应用问题。论文的关键在于结合MobileNetV4作为视觉主干网络，利用其Universal Inverted Bottleneck (UIB)块实现输入图像序列的分层特征表示，同时引入基于多尺度3D MLP-Mixer的时间聚合模块以捕获时间依赖性。这种组合不仅保证了计算效率，还实现了丰富的语义编码，并通过多分辨率处理保持空间特征的结构完整性。实验结果表明，该方法在ABAW竞赛中表现出色，适用于实时移动和嵌入式计算环境。

链接: https://arxiv.org/abs/2503.10530
作者: Quoc-Tien Nguyen,Hong-Hai Nguyen,Van-Thong Huynh
机构: Dept. of ITS, FPT University (ITS系, FPT大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this study, we present an approach for efficient spatiotemporal feature extraction using MobileNetV4 and a multi-scale 3D MLP-Mixer-based temporal aggregation module. MobileNetV4, with its Universal Inverted Bottleneck (UIB) blocks, serves as the backbone for extracting hierarchical feature representations from input image sequences, ensuring both computational efficiency and rich semantic encoding. To capture temporal dependencies, we introduce a three-level MLP-Mixer module, which processes spatial features at multiple resolutions while maintaining structural integrity. Experimental results on the ABAW 8th competition demonstrate the effectiveness of our approach, showing promising performance in affective behavior analysis. By integrating an efficient vision backbone with a structured temporal modeling mechanism, the proposed framework achieves a balance between computational efficiency and predictive accuracy, making it well-suited for real-time applications in mobile and embedded computing environments.
zh

[CV-33] PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models

【速读】：该论文旨在解决3D多模态大语言模型(3D MLLMs)因3D数据量有限且质量欠佳而导致潜力未充分挖掘的问题。现有方法通过从2D MLLMs迁移知识来扩充3D指令数据，但仍面临模态与领域差距的挑战。论文的关键解决方案是提出PiSA-Engine（点云自增引擎），这是一种新的框架，用于生成富含3D空间语义的指令点云-语言数据集。PiSA-Engine通过整合现成的2D和3D MLLMs的整体洞见，实现高质量数据的持续生成循环。此外，论文还引入了PiSA-Bench，一个覆盖六个关键方面的综合3D基准，以解决现有基准中语言描述粗糙、类别多样性不足的问题，从而提供更准确的评估。实验结果表明，基于此框架增强的3D MLLM(PointLLM-PiSA)在零样本3D物体描述和生成分类任务上取得了显著性能提升。

链接: https://arxiv.org/abs/2503.10529
作者: Zilu Guo,Hongbin Lin,Zhihao Yuan,Chaoda Zheng,Pengshuo Qiu,Dongzhi Jiang,Renrui Zhang,Chun-Mei Feng,Zhen Li
机构: FNii-Shenzhen (FNii 深圳); SSE, CUHK-Shenzhen (香港中文大学深圳分校 SSE); CUHK (香港中文大学); IHPC, A*STAR, Singapore (新加坡科技研究局信息通讯研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical Report

点击查看摘要

Abstract:3D Multimodal Large Language Models (MLLMs) have recently made substantial advancements. However, their potential remains untapped, primarily due to the limited quantity and suboptimal quality of 3D datasets. Current approaches attempt to transfer knowledge from 2D MLLMs to expand 3D instruction data, but still face modality and domain gaps. To this end, we introduce PiSA-Engine (Point-Self-Augmented-Engine), a new framework for generating instruction point-language datasets enriched with 3D spatial semantics. We observe that existing 3D MLLMs offer a comprehensive understanding of point clouds for annotation, while 2D MLLMs excel at cross-validation by providing complementary information. By integrating holistic 2D and 3D insights from off-the-shelf MLLMs, PiSA-Engine enables a continuous cycle of high-quality data generation. We select PointLLM as the baseline and adopt this co-evolution training framework to develop an enhanced 3D MLLM, termed PointLLM-PiSA. Additionally, we identify limitations in previous 3D benchmarks, which often feature coarse language captions and insufficient category diversity, resulting in inaccurate evaluations. To address this gap, we further introduce PiSA-Bench, a comprehensive 3D benchmark covering six key aspects with detailed and diverse labels. Experimental results demonstrate PointLLM-PiSA’s state-of-the-art performance in zero-shot 3D object captioning and generative classification on our PiSA-Bench, achieving significant improvements of 46.45% (+8.33%) and 63.75% (+16.25%), respectively. We will release the code, datasets, and benchmark.
zh

[CV-34] NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval CVPR2025

【速读】：该论文致力于解决跨模态检索中的“hubness问题”，即在不同模态（如视觉与文本数据）之间建立语义连接时，少量样本（hub节点）过度主导最近邻搜索，导致表示偏差和检索准确性下降的问题。现有方法通常通过后处理归一化技术缓解此问题，但这些方法依赖于先验数据分布，在实际场景中可能不适用。论文的关键解决方案是提出NeighborRetr方法，在训练阶段直接缓解hubness问题，并通过有效平衡hub节点的学习以及自适应调整各类邻居关系，不仅解决了hubness挑战，还提升了检索性能，在多个跨模态检索基准测试中达到了最先进的结果，同时展现出对分布偏移的新领域强泛化能力。

链接: https://arxiv.org/abs/2503.10526
作者: Zengrong Lin,Zheng Wang,Tianwen Qian,Pan Mu,Sixian Chan,Cong Bai
机构: College of Computer Science and Technology, Zhejiang University of Technology (浙江工业大学), Zhejiang, China; College of Computer Science and Technology, East China Normal University (华东师范大学), Shanghai, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025, 18 pages, 7 figures, 13 tables

点击查看摘要

Abstract:Cross-modal retrieval aims to bridge the semantic gap between different modalities, such as visual and textual data, enabling accurate retrieval across them. Despite significant advancements with models like CLIP that align cross-modal representations, a persistent challenge remains: the hubness problem, where a small subset of samples (hubs) dominate as nearest neighbors, leading to biased representations and degraded retrieval accuracy. Existing methods often mitigate hubness through post-hoc normalization techniques, relying on prior data distributions that may not be practical in real-world scenarios. In this paper, we directly mitigate hubness during training and introduce NeighborRetr, a novel method that effectively balances the learning of hubs and adaptively adjusts the relations of various kinds of neighbors. Our approach not only mitigates the hubness problem but also enhances retrieval performance, achieving state-of-the-art results on multiple cross-modal retrieval benchmarks. Furthermore, NeighborRetr demonstrates robust generalization to new domains with substantial distribution shifts, highlighting its effectiveness in real-world applications. We make our code publicly available at: this https URL .
zh

[CV-35] Interactive Multimodal Fusion with Temporal Modeling

【速读】：该论文旨在解决在自然场景下（in-the-wild）面部表情的Valence-Arousal (VA) 估计问题。论文的关键在于提出了一种多模态框架，整合视觉和听觉信息来提升VA估计的准确性。具体而言，视觉分支利用预训练的ResNet模型提取面部图像的空间特征，而音频分支则使用预训练的VGG模型提取语音信号的VGGish和LogMel特征，并通过Temporal Convolutional Networks (TCNs) 进行时间建模。此外，通过跨模态注意力机制实现视觉特征与音频特征的交互，最终将融合后的特征输入回归层以预测VA值。这一方法在Aff-Wild2数据集上的表现证明了其在自然场景下有效实现多模态特征融合的能力。

链接: https://arxiv.org/abs/2503.10523
作者: Jun Yu,Yongqi Wang,Lei Wang,Yang Zheng,Shengfan Xu
机构: University of Science and Technology of China (中国科学技术大学); Macau University of Science and Technology (澳门科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents our method for the estimation of valence-arousal (VA) in the 8th Affective Behavior Analysis in-the-Wild (ABAW) competition. Our approach integrates visual and audio information through a multimodal framework. The visual branch uses a pre-trained ResNet model to extract spatial features from facial images. The audio branches employ pre-trained VGG models to extract VGGish and LogMel features from speech signals. These features undergo temporal modeling using Temporal Convolutional Networks (TCNs). We then apply cross-modal attention mechanisms, where visual features interact with audio features through query-key-value attention structures. Finally, the features are concatenated and passed through a regression layer to predict valence and arousal. Our method achieves competitive performance on the Aff-Wild2 dataset, demonstrating effective multimodal fusion for VA estimation in-the-wild.
zh

[CV-36] AudioX: Diffusion Transformer for Anything-to-Audio Generation

【速读】：该论文旨在解决音频与音乐生成领域中现有方法面临的三大主要问题：1) 缺乏跨模态统一能力，各方法孤立运作；2) 高质量多模态训练数据匮乏；3) 难以有效整合多样化输入。为应对这些挑战，论文提出AudioX，这是一种基于Diffusion Transformer的统一模型，适用于Anything-to-Audio和Music Generation任务。其关键创新在于引入了一种多模态掩码训练策略，通过跨模态掩码输入并迫使模型从掩码后的输入中学习，从而生成稳健且统一的跨模态表示。此外，为了缓解数据稀缺问题，作者构建了两个大规模数据集：基于VGGSound的vggsound-caps（包含19万条音频描述）以及基于V2M的V2M-caps（包含600万条音乐描述）。实验结果表明，AudioX不仅在性能上媲美甚至超越了现有的专用模型，还展现出强大的多模态处理能力和统一架构下的多样化生成任务适应性。

链接: https://arxiv.org/abs/2503.10522
作者: Zeyue Tian,Yizhu Jin,Zhaoyang Liu,Ruibin Yuan,Xu Tan,Qifeng Chen,Wei Xue,Yike Guo
机构: Hong Kong University of Science and Technology (香港科技大学); Moonshot AI (Moonshot AI)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: The code and datasets will be available at this https URL

点击查看摘要

Abstract:Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. Unlike previous domain-specific models, AudioX can generate both general audio and music with high quality, while offering flexible natural language control and seamless processing of various modalities including text, video, image, music, and audio. Its key innovation is a multi-modal masked training strategy that masks inputs across modalities and forces the model to learn from masked inputs, yielding robust and unified cross-modal representations. To address data scarcity, we curate two comprehensive datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset. Extensive experiments demonstrate that AudioX not only matches or outperforms state-of-the-art specialized models, but also offers remarkable versatility in handling diverse input modalities and generation tasks within a unified architecture. The code and datasets will be available at this https URL
zh

[CV-37] CountPath: Automating Frag ment Counting in Digital Pathology

【速读】：该论文旨在解决传统方法在病理切片标本碎片数量核查过程中耗时费力且主观性强、变异性大的问题。解决方案的关键在于利用自动化技术，通过结合YOLOv9和Vision Transformer模型实现碎片计数的智能化处理，从而提供一种与专家评估性能相当的可靠且高效的方法，同时显著降低人为因素带来的不确定性。

链接: https://arxiv.org/abs/2503.10520
作者: Ana Beatriz Vieira,Maria Valente,Diana Montezuma,Tomé Albuquerque,Liliana Ribeiro,Domingos Oliveira,João Monteiro,Sofia Gonçalves,Isabel M. Pinto,Jaime S. Cardoso,Arlindo L. Oliveira
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Quality control of medical images is a critical component of digital pathology, ensuring that diagnostic images meet required standards. A pre-analytical task within this process is the verification of the number of specimen fragments, a process that ensures that the number of fragments on a slide matches the number documented in the macroscopic report. This step is important to ensure that the slides contain the appropriate diagnostic material from the grossing process, thereby guaranteeing the accuracy of subsequent microscopic examination and diagnosis. Traditionally, this assessment is performed manually, requiring significant time and effort while being subject to significant variability due to its subjective nature. To address these challenges, this study explores an automated approach to fragment counting using the YOLOv9 and Vision Transformer models. Our results demonstrate that the automated system achieves a level of performance comparable to expert assessments, offering a reliable and efficient alternative to manual counting. Additionally, we present findings on interobserver variability, showing that the automated approach achieves an accuracy of 86%, which falls within the range of variation observed among experts (82-88%), further supporting its potential for integration into routine pathology workflows.
zh

[CV-38] Hoi2Anomaly: An Explainable Anomaly Detection Approach Guided by Human-Object Interaction

【速读】：该论文旨在解决现有图像异常检测（Image Anomaly Detection, IAD）方法在细粒度语义信息不足的问题，导致检测结果易受机器错觉影响且缺乏充分解释性。为应对这一挑战，论文提出了一种名为Hoi2Anomaly的新方法，其关键在于构建包含异常场景下人-物交互（Human-Object Interaction, HOI）对的多模态指令微调数据集，并通过在威胁场景中训练HOI提取器实现异常行为与实体的精确定位与匹配。此外，利用视觉语言预训练（Visual Language Pretraining, VLP）框架进行微调以生成检测结果的解释性内容，从而提升检测精度与可解释性。

链接: https://arxiv.org/abs/2503.10508
作者: Yuhan Wang,Cheng Liu,Daou Zhang,Weichao Wu
机构: University of Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the domain of Image Anomaly Detection (IAD), Existing methods frequently exhibit a paucity of fine-grained, interpretable semantic information, resulting in the detection of anomalous entities or activities that are susceptible to machine illusions. This deficiency often leads to the detection of anomalous entities or actions that are susceptible to machine illusions and lack sufficient explanation. In this thesis, we propose a novel approach to anomaly detection, termed Hoi2Anomaly, which aims to achieve precise discrimination and localization of anomalies. The proposed methodology involves the construction of a multi-modal instruction tuning dataset comprising human-object interaction (HOI) pairs in anomalous scenarios. Second, we have trained an HOI extractor in threat scenarios to localize and match anomalous actions and entities. Finally, explanatory content is generated for the detected anomalous HOI by fine-tuning the visual language pretraining (VLP) framework. The experimental results demonstrate that Hoi2Anomaly surpasses existing generative approaches in terms of precision and explainability. We will release Hoi2Anomaly for the advancement of the field of anomaly detection.
zh

[CV-39] okenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models

【速读】：该论文旨在解决多模态大语言模型（MLLMs）在处理多模态数据输入，尤其是视觉标记时，因高计算成本导致的效率瓶颈问题。现有基于训练的标记压缩方法虽能提升推理效率，但需要昂贵的重新训练过程；而无训练的压缩方法在大幅减少标记数量时难以保持性能。论文的关键发现是MLLM的性能下降与注意力输出矩阵中信息快速丢失密切相关，这一见解引入了一种新的信息保留视角，使得即使在极端标记压缩情况下也能维持性能。基于此，论文提出了一种名为TokenCarve的无训练、即插即用的两阶段标记压缩框架，其关键在于第一阶段采用信息保留引导选择（IPGS）策略去除低信息量标记，第二阶段利用IPGS指导标记合并以最小化信息损失，从而有效缓解了上述挑战。

链接: https://arxiv.org/abs/2503.10501
作者: Xudong Tan,Peng Ye,Chongjun Tu,Jianjian Cao,Yaoxin Yang,Lin Zhang,Dongzhan Zhou,Tao Chen
机构: Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are becoming increasingly popular, while the high computational cost associated with multimodal data input, particularly from visual tokens, poses a significant challenge. Existing training-based token compression methods improve inference efficiency but require costly retraining, while training-free methods struggle to maintain performance when aggressively reducing token counts. In this study, we reveal that the performance degradation of MLLM closely correlates with the accelerated loss of information in the attention output matrix. This insight introduces a novel information-preserving perspective, making it possible to maintain performance even under extreme token compression. Based on this finding, we propose TokenCarve, a training-free, plug-and-play, two-stage token compression framework. The first stage employs an Information-Preservation-Guided Selection (IPGS) strategy to prune low-information tokens, while the second stage further leverages IPGS to guide token merging, minimizing information loss. Extensive experiments on 11 datasets and 2 model variants demonstrate the effectiveness of TokenCarve. It can even reduce the number of visual tokens to 22.2% of the original count, achieving a 1.23x speedup in inference, a 64% reduction in KV cache storage, and only a 1.54% drop in accuracy. Our code is available at this https URL.
zh

[CV-40] OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding

【速读】：本文提出了一种新的时空全域目标视频定位任务（Spatio-Temporal Omni-Object Video Grounding, OmniSTVG），旨在从视频中定位文本查询中提到的所有空间和时间上的目标。与仅定位单一目标的经典时空视频接地任务相比，OmniSTVG不仅能够定位任意数量的文本提及目标，还能定位查询中涉及的交互对象，使其在实际场景中的应用更加灵活且实用。为推动OmniSTVG的研究，作者构建了一个名为BOSTVG的大规模基准数据集，包含10,018个视频和1020万个帧，覆盖了来自多样化场景的287个类别。每个序列配有一个自由形式的文本查询，包含从1到10个不等的目标，并通过人工标注确保高质量。此外，为了鼓励未来研究，作者提出了一种名为OmniTube的简单而有效的解决方案，其灵感来源于基于Transformer的经典STVG方法，并针对OmniSTVG进行了专门设计，展示了有前景的结果。论文的关键在于通过构建BOSTVG基准和提出OmniTube方法，突破经典STVG的局限性，实现对查询中所有出现目标的全面定位，从而开辟了STVG的新方向。

链接: https://arxiv.org/abs/2503.10500
作者: Jiali Yao,Xinran Deng,Xin Gu,Mengrui Dai,Bing Fan,Zhipeng Zhang,Yan Huang,Heng Fan,Libo Zhang
机构: University of Chinese Academy of Sciences (中国科学院大学); North China University of Technology (华北理工大学); University of North Texas (北德克萨斯大学); Shanghai Jiao Tong University (上海交通大学); Institute of Software Chinese Academy of Sciences (中国科学院软件研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose spatio-temporal omni-object video grounding, dubbed OmniSTVG, a new STVG task that aims at localizing spatially and temporally all targets mentioned in the textual query from videos. Compared to classic STVG locating only a single target, OmniSTVG enables localization of not only an arbitrary number of text-referred targets but also their interacting counterparts in the query from the video, making it more flexible and practical in real scenarios for comprehensive understanding. In order to facilitate exploration of OmniSTVG, we introduce BOSTVG, a large-scale benchmark dedicated to OmniSTVG. Specifically, our BOSTVG consists of 10,018 videos with 10.2M frames and covers a wide selection of 287 classes from diverse scenarios. Each sequence in BOSTVG, paired with a free-form textual query, encompasses a varying number of targets ranging from 1 to 10. To ensure high quality, each video is manually annotated with meticulous inspection and refinement. To our best knowledge, BOSTVG is to date the first and the largest benchmark for OmniSTVG. To encourage future research, we introduce a simple yet effective approach, named OmniTube, which, drawing inspiration from Transformer-based STVG methods, is specially designed for OmniSTVG and demonstrates promising results. By releasing BOSTVG, we hope to go beyond classic STVG by locating every object appearing in the query for more comprehensive understanding, opening up a new direction for STVG. Our benchmark, model, and results will be released at this https URL.
zh

[CV-41] Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion

【速读】：该论文旨在解决实时生成连贯且多样化的共发言语手势（co-speech gestures）的问题，重点在于提升生成效率的同时保持运动的真实性和一致性。论文的关键创新在于引入了“加速滚动扩散”（Accelerated Rolling Diffusion）框架，并进一步提出了“滚动扩散阶梯加速”（Rolling Diffusion Ladder Acceleration, RDLA）方法。RDLA通过将噪声调度重构为逐步梯形结构，实现了多帧同时去噪，显著提高了采样效率，同时保证了运动的一致性，达到了最高2倍的速度提升，且视觉保真度和时间连贯性均得以维持。这一方案可广泛应用于基于扩散模型的任意手势生成任务，展现出其作为通用高效实时高保真共发言语手势合成解决方案的有效性。

链接: https://arxiv.org/abs/2503.10488
作者: Evgeniia Vu,Andrei Boiarov,Dmitry Vetrov
机构: Constructor University (构造者大学), Constructor Tech (构造者科技)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Generating co-speech gestures in real time requires both temporal coherence and efficient sampling. We introduce Accelerated Rolling Diffusion, a novel framework for streaming gesture generation that extends rolling diffusion models with structured progressive noise scheduling, enabling seamless long-sequence motion synthesis while preserving realism and diversity. We further propose Rolling Diffusion Ladder Acceleration (RDLA), a new approach that restructures the noise schedule into a stepwise ladder, allowing multiple frames to be denoised simultaneously. This significantly improves sampling efficiency while maintaining motion consistency, achieving up to a 2x speedup with high visual fidelity and temporal coherence. We evaluate our approach on ZEGGS and BEAT, strong benchmarks for real-world applicability. Our framework is universally applicable to any diffusion-based gesture generation model, transforming it into a streaming approach. Applied to three state-of-the-art methods, it consistently outperforms them, demonstrating its effectiveness as a generalizable and efficient solution for real-time, high-fidelity co-speech gesture synthesis.
zh

[CV-42] OODD: Test-time Out-of-Distribution Detection with Dynamic Dictionary

【速读】：该论文旨在解决深度学习模型在测试时对分布外（Out-of-Distribution, OOD）样本检测的挑战，特别是在测试时的OOD样本与训练阶段的异常样本存在显著差异的情况下。论文提出了一种名为OODD的新方法，其关键在于无需微调即可动态维护和更新一个基于优先队列的OOD字典，并结合一种用于分布内（In-Distribution, ID）样本的信息性采样策略。此外，为了确保早期测试阶段的稳定性，提出了双OOD稳定机制，利用从ID数据衍生的战略性生成的异常样本。实验结果表明，OODD在CIFAR-100远场OOD检测任务中相比现有最先进的方法提升了26.0%的FPR95指标，并且优化后的KNN-based OOD检测框架实现了3倍的速度提升同时保持检测性能。

链接: https://arxiv.org/abs/2503.10468
作者: Yifeng Yang,Lin Zhu,Zewen Sun,Hengyu Liu,Qinying Gu,Nanyang Ye
机构: Shanghai Jiao Tong University (上海交通大学); Tianjin University (天津大学); The Chinese University of Hong Kong (香港中文大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection remains challenging for deep learning models, particularly when test-time OOD samples differ significantly from training outliers. We propose OODD, a novel test-time OOD detection method that dynamically maintains and updates an OOD dictionary without fine-tuning. Our approach leverages a priority queue-based dictionary that accumulates representative OOD features during testing, combined with an informative inlier sampling strategy for in-distribution (ID) samples. To ensure stable performance during early testing, we propose a dual OOD stabilization mechanism that leverages strategically generated outliers derived from ID data. To our best knowledge, extensive experiments on the OpenOOD benchmark demonstrate that OODD significantly outperforms existing methods, achieving a 26.0% improvement in FPR95 on CIFAR-100 Far OOD detection compared to the state-of-the-art approach. Furthermore, we present an optimized variant of the KNN-based OOD detection framework that achieves a 3x speedup while maintaining detection performance.
zh

[CV-43] Flow-NeRF: Joint Learning of Geometry Poses and Dense Flow within Unified Neural Representations

【速读】：该论文旨在解决在神经辐射场（Neural Radiance Fields, NeRF）中无姿态先验的情况下学习精确场景重建的挑战，主要由于其固有的几何歧义性。现有方法要么依赖于对应关系先验进行正则化，要么使用现成的流估计器推导分析姿态。然而，联合学习场景几何、相机姿态和密集光流在一个统一的神经表示中的潜力尚未被充分探索。论文提出Flow-NeRF，这是一种统一框架，能够同时优化场景几何、相机姿态和密集光流。解决方案的关键在于设计并构建了一个基于姿态的双向映射以实现流估计，并开发了一种有效的特征增强机制，将规范空间特征传递到世界空间表示中，显著提升场景几何质量。

链接: https://arxiv.org/abs/2503.10464
作者: Xunzhi Zheng,Dan Xu
机构: Department of Computer Science and Engineering, HKUST (香港科技大学计算机科学与工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning accurate scene reconstruction without pose priors in neural radiance fields is challenging due to inherent geometric ambiguity. Recent development either relies on correspondence priors for regularization or uses off-the-shelf flow estimators to derive analytical poses. However, the potential for jointly learning scene geometry, camera poses, and dense flow within a unified neural representation remains largely unexplored. In this paper, we present Flow-NeRF, a unified framework that simultaneously optimizes scene geometry, camera poses, and dense optical flow all on-the-fly. To enable the learning of dense flow within the neural radiance field, we design and build a bijective mapping for flow estimation, conditioned on pose. To make the scene reconstruction benefit from the flow estimation, we develop an effective feature enhancement mechanism to pass canonical space features to world space representations, significantly enhancing scene geometry. We validate our model across four important tasks, i.e., novel view synthesis, depth estimation, camera pose prediction, and dense optical flow estimation, using several datasets. Our approach surpasses previous methods in almost all metrics for novel-view view synthesis and depth estimation and yields both qualitatively sound and quantitatively accurate novel-view flow. Our project page is this https URL.
zh

[CV-44] Consistent multi-animal pose estimation in cattle using dynamic Kalman filter based tracking

【速读】：该论文旨在解决现有姿态估计算法在动物行为研究中的局限性，特别是其对单一研究目标的高度定制化导致数据重用可能性受限的问题。论文提出通过结合姿态估计与动物跟踪，构建能够同时捕捉空间和时间维度的高级行为表征，从而避免为不同研究需求反复开发特定算法的需求。论文的关键解决方案是引入KeySORT（Keypoint Simple and Online Realtime Tracking）算法，该算法利用自适应卡尔曼滤波器以无边界框的方式构建轨迹片段，显著提高了检测关节点的时间一致性。此外，通过使用KeySORT构建骨骼结构，生成的姿态坐标时间一致性得到了大幅改善，为动物行为的自动化监测提供了新的机会。

链接: https://arxiv.org/abs/2503.10450
作者: Maarten Perneel,Ines Adriaens,Ben Aernouts,Jan Verwaeren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Over the past decade, studying animal behaviour with the help of computer vision has become more popular. Replacing human observers by computer vision lowers the cost of data collection and therefore allows to collect more extensive datasets. However, the majority of available computer vision algorithms to study animal behaviour is highly tailored towards a single research objective, limiting possibilities for data reuse. In this perspective, pose-estimation in combination with animal tracking offers opportunities to yield a higher level representation capturing both the spatial and temporal component of animal behaviour. Such a higher level representation allows to answer a wide variety of research questions simultaneously, without the need to develop repeatedly tailored computer vision algorithms. In this paper, we therefore first cope with several weaknesses of current pose-estimation algorithms and thereafter introduce KeySORT (Keypoint Simple and Online Realtime Tracking). KeySORT deploys an adaptive Kalman filter to construct tracklets in a bounding-box free manner, significantly improving the temporal consistency of detected keypoints. In this paper, we focus on pose estimation in cattle, but our methodology can easily be generalised to any other animal species. Our test results indicate our algorithm is able to detect up to 80% of the ground truth keypoints with high accuracy, with only a limited drop in performance when daylight recordings are compared to nightvision recordings. Moreover, by using KeySORT to construct skeletons, the temporal consistency of generated keypoint coordinates was largely improved, offering opportunities with regard to automated behaviour monitoring of animals.
zh

[CV-45] Learning Disease State from Noisy Ordinal Disease Progression Labels

【速读】：本文旨在解决利用带有噪声的序数标签（如“更好”、“更差”或“稳定”）进行疾病进展建模的问题，特别是针对新生血管年龄相关性黄斑变性（nAMD）。研究将不同医疗访问之间的疾病进展建模问题转化为带有序数标签的分类任务。为增强泛化能力，提出的方法包括独立图像编码、反对称logit空间等变性以及序数尺度感知。此外，通过学习不确定性估计以实现损失重新加权来应对标签噪声。关键在于构建一种可解释的疾病表示方法，即使仅使用包含序数疾病进展标签的图像对进行训练，也能在单张图像的nAMD活动分类任务中实现强大的少样本性能。

链接: https://arxiv.org/abs/2503.10440
作者: Gustav Schmidt,Holger Heidrich,Philipp Berens,Sarah Müller
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Learning from noisy ordinal labels is a key challenge in medical imaging. In this work, we ask whether ordinal disease progression labels (better, worse, or stable) can be used to learn a representation allowing to classify disease state. For neovascular age-related macular degeneration (nAMD), we cast the problem of modeling disease progression between medical visits as a classification task with ordinal ranks. To enhance generalization, we tailor our model to the problem setting by (1) independent image encoding, (2) antisymmetric logit space equivariance, and (3) ordinal scale awareness. In addition, we address label noise by learning an uncertainty estimate for loss re-weighting. Our approach learns an interpretable disease representation enabling strong few-shot performance for the related task of nAMD activity classification from single images, despite being trained only on image pairs with ordinal disease progression labels.
zh

[CV-46] EFC: Elastic Feature Consolidation with Prototype Re-balancing for Cold Start Exemplar-free Incremental Learning

【速读】：该论文旨在解决在Exemplar-Free Class Incremental Learning (EFCIL)框架下冷启动场景中的挑战，即在首个任务中因数据不足而难以训练出高质量主干模型的问题。这一挑战尤为困难，因为EFCIL需要高度的模型可塑性，这会导致特征漂移（Feature Drift），而在无样本（Exemplar-Free）设定下很难对此进行补偿。论文的关键解决方案是提出了一种名为Elastic Feature Consolidation++ (EFC++)的方法，通过正则化与先前任务高度相关的特征漂移方向，并利用原型（Prototypes）减少任务近期偏差来巩固特征表示。EFC++基于提出的Empirical Feature Matrix (EFM) 利用二阶近似来建模特征漂移，并引入伪度量以在重要方向上正则化特征漂移及更新高斯原型。此外，还设计了一个后训练原型再平衡阶段，用于补偿特征漂移并更新分类器。实验结果表明，EFC++在CIFAR-100、Tiny-ImageNet、ImageNet-Subset、ImageNet-1K以及DomainNet等数据集上显著优于现有方法，同时保持了模型的可塑性。

链接: https://arxiv.org/abs/2503.10439
作者: Simone Magistri,Tomaso Trinci,Albin Soutif-Cormerais,Joost van de Weijer,Andrew D. Bagdanov
机构: Media Integration and Communication Center (50134 Florence, Italy); Global Optimization Laboratory (50136 Florence, Italy); LAMP Team, Computer Vision Center (08036 Barcelona, Spain)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted on July 2024. Under Review. arXiv admin note: text overlap with arXiv:2402.03917

点击查看摘要

Abstract:Exemplar-Free Class Incremental Learning (EFCIL) aims to learn from a sequence of tasks without having access to previous task data. In this paper, we consider the challenging Cold Start scenario in which insufficient data is available in the first task to learn a high-quality backbone. This is especially challenging for EFCIL since it requires high plasticity, resulting in feature drift which is difficult to compensate for in the exemplar-free setting. To address this problem, we propose an effective approach to consolidate feature representations by regularizing drift in directions highly relevant to previous tasks and employs prototypes to reduce task-recency bias. Our approach, which we call Elastic Feature Consolidation++ (EFC++) exploits a tractable second-order approximation of feature drift based on a proposed Empirical Feature Matrix (EFM). The EFM induces a pseudo-metric in feature space which we use to regularize feature drift in important directions and to update Gaussian prototypes. In addition, we introduce a post-training prototype re-balancing phase that updates classifiers to compensate for feature drift. Experimental results on CIFAR-100, Tiny-ImageNet, ImageNet-Subset, ImageNet-1K and DomainNet demonstrate that EFC++ is better able to learn new tasks by maintaining model plasticity and significantly outperform the state-of-the-art.
zh

[CV-47] 4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models CVPR2025

【速读】：该论文旨在解决在动态场景中实现时间无关或时间敏感的开放词汇语言查询的问题，现有方法如LangSplat虽能在静态场景中有效学习3D语言场，但无法处理包含时间动态变化的4D场景。论文的关键在于提出4D LangSplat方法，通过多模态大型语言模型（Multimodal Large Language Models, MLLMs）从对象级视频字幕生成文本，而非依赖视觉特征学习语言场。具体而言，引入了一种多模态对象级视频提示方法，结合视觉与文本提示引导MLLMs生成高质量的时间一致描述，这些描述经大语言模型编码为高精度句子嵌入，作为像素对齐的对象特定特征监督，支持共享嵌入空间中的开放词汇查询。此外，针对4D场景中物体状态平滑过渡的特点，设计了一种状态可变形网络以有效建模时间上的连续变化。实验结果表明，4D LangSplat在多个基准测试中实现了精确且高效的时间敏感及时间无关开放词汇查询。

链接: https://arxiv.org/abs/2503.10437
作者: Wanhua Li,Renping Zhou,Jiawei Zhou,Yingwei Song,Johannes Herter,Minghan Qin,Gao Huang,Hanspeter Pfister
机构: Harvard University (哈佛大学); Tsinghua University (清华大学); Stony Brook University (石溪大学); Brown University (布朗大学); ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project Page: this https URL

点击查看摘要

Abstract:Learning 4D language fields to enable time-sensitive, open-ended language queries in dynamic scenes is essential for many real-world applications. While LangSplat successfully grounds CLIP features into 3D Gaussian representations, achieving precision and efficiency in 3D static scenes, it lacks the ability to handle dynamic 4D fields as CLIP, designed for static image-text tasks, cannot capture temporal dynamics in videos. Real-world environments are inherently dynamic, with object semantics evolving over time. Building a precise 4D language field necessitates obtaining pixel-aligned, object-wise video features, which current vision models struggle to achieve. To address these challenges, we propose 4D LangSplat, which learns 4D language fields to handle time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes efficiently. 4D LangSplat bypasses learning the language field from vision features and instead learns directly from text generated from object-wise video captions via Multimodal Large Language Models (MLLMs). Specifically, we propose a multimodal object-wise video prompting method, consisting of visual and text prompts that guide MLLMs to generate detailed, temporally consistent, high-quality captions for objects throughout a video. These captions are encoded using a Large Language Model into high-quality sentence embeddings, which then serve as pixel-aligned, object-specific feature supervision, facilitating open-vocabulary text queries through shared embedding spaces. Recognizing that objects in 4D scenes exhibit smooth transitions across states, we further propose a status deformable network to model these continuous changes over time effectively. Our results across multiple benchmarks demonstrate that 4D LangSplat attains precise and efficient results for both time-sensitive and time-agnostic open-vocabulary queries.
zh

[CV-48] Finetuning Generative Trajectory Model with Reinforcement Learning from Human Feedback

【速读】：该论文旨在解决生成式轨迹模型在动态环境中难以捕捉人类驾驶风格细微变化的问题，主要由于数据集偏差和分布偏移导致。为了解决这一挑战，论文提出TrajHF（human feedback-driven fine-tuning framework），通过结合多条件去噪器与带有人类反馈的强化学习，超越传统的模仿学习，实现多模态轨迹生成的精调。这种方法能够在保持安全性和可行性约束的同时，更好地与多样化的人类驾驶偏好对齐。关键在于引入人类反馈驱动的优化机制，以提升生成轨迹的适应性和个性化水平。实验结果显示，TrajHF在NavSim基准测试中的PDMS达到93.95，显著优于其他方法，并开创了自动驾驶领域个性化和自适应轨迹生成的新范式。

链接: https://arxiv.org/abs/2503.10434
作者: Derun Li,Jianwei Ren,Yue Wang,Xin Wen,Pengxiang Li,Leimeng Xu,Kun Zhan,Zhongpu Xia,Peng Jia,Xianpeng Lang,Ningyi Xu,Hang Zhao
机构: Shanghai Qi Zhi Institute (上海之质研究所); Shanghai Jiao Tong University (上海交通大学); LiAuto; Tsinghua University (清华大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Generating human-like and adaptive trajectories is essential for autonomous driving in dynamic environments. While generative models have shown promise in synthesizing feasible trajectories, they often fail to capture the nuanced variability of human driving styles due to dataset biases and distributional shifts. To address this, we introduce TrajHF, a human feedback-driven finetuning framework for generative trajectory models, designed to align motion planning with diverse driving preferences. TrajHF incorporates multi-conditional denoiser and reinforcement learning with human feedback to refine multi-modal trajectory generation beyond conventional imitation learning. This enables better alignment with human driving preferences while maintaining safety and feasibility constraints. TrajHF achieves PDMS of 93.95 on NavSim benchmark, significantly exceeding other methods. TrajHF sets a new paradigm for personalized and adaptable trajectory generation in autonomous driving.
zh

[CV-49] Improving Medical Waste Classification with Hybrid Capsule Networks

【速读】：该论文旨在解决医疗废物分类效率与准确性不足的问题，以应对因不当处理医疗废物而引发的环境和公共健康风险。论文的关键解决方案在于将胶囊网络（Capsule Networks）与预训练的DenseNet模型相结合，通过引入胶囊网络来克服传统卷积神经网络在空间信息建模上的局限性，从而提升分类性能和鲁棒性。实验结果表明，在一个包含多种医疗废物图像的大规模多样化数据集上，结合胶囊网络的混合模型相较于仅使用预训练DenseNet基线模型，其F1分数从0.89提高到0.92，验证了胶囊网络的有效性。

链接: https://arxiv.org/abs/2503.10426
作者: Bennet van den Broek,Javad Pourmostafa Roshan Sharami
机构: Tilburg University (蒂尔堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The improper disposal and mismanagement of medical waste pose severe environmental and public health risks, contributing to greenhouse gas emissions and the spread of infectious diseases. Efficient and accurate medical waste classification is crucial for mitigating these risks. We explore the integration of capsule networks with a pretrained DenseNet model to improve medical waste classification. To the best of our knowledge, capsule networks have not yet been applied to this task, making this study the first to assess their effectiveness. A diverse dataset of medical waste images collected from multiple public sources, is used to evaluate three model configurations: (1) a pretrained DenseNet model as a baseline, (2) a pretrained DenseNet with frozen layers combined with a capsule network, and (3) a pretrained DenseNet with unfrozen layers combined with a capsule network. Experimental results demonstrate that incorporating capsule networks improves classification performance, with F1 scores increasing from 0.89 (baseline) to 0.92 (hybrid model with unfrozen layers). This highlights the potential of capsule networks to address the spatial limitations of traditional convolutional models and improve classification robustness. While the capsule-enhanced model demonstrated improved classification performance, direct comparisons with prior studies were challenging due to differences in dataset size and diversity. Previous studies relied on smaller, domain-specific datasets, which inherently yielded higher accuracy. In contrast, our study employs a significantly larger and more diverse dataset, leading to better generalization but introducing additional classification challenges. This highlights the trade-off between dataset complexity and model performance. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2503.10426 [cs.CV] (or arXiv:2503.10426v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.10426 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-50] Category Prompt Mamba Network for Nuclei Segmentation and Classification

【速读】：该论文旨在解决传统细胞核分割与分类模型在处理大尺寸图像时面临的两大挑战：一是相邻图像块边界处的细胞核在推理阶段容易错位；二是基于图像块的训练方法显著增加了模型的训练与推理时间。此外，尽管Mamba因其线性时间复杂度和低内存消耗受到关注，并能在全尺寸图像上训练模型，但其基于方向的扫描策略未能充分考虑类别特定特征，在类别分布不平衡的场景下表现欠佳。为此，论文提出了一种基于类别概率排序的新扫描策略，通过按置信度从高到低独立对每个类别的特征进行排序和扫描，增强了不确定样本的特征表示能力，缓解了类别不平衡带来的问题。实验结果表明，该方法在四个公开数据集上的性能优于现有最先进的方法，特别是在细胞核分割与分类任务中表现优异。

链接: https://arxiv.org/abs/2503.10422
作者: Ye Zhang,Zijie Fang,Yifeng Wang,Lingbo Zhang,Xianchao Guan,Yongbing Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Nuclei segmentation and classification provide an essential basis for tumor immune microenvironment analysis. The previous nuclei segmentation and classification models require splitting large images into smaller patches for training, leading to two significant issues. First, nuclei at the borders of adjacent patches often misalign during inference. Second, this patch-based approach significantly increases the model’s training and inference time. Recently, Mamba has garnered attention for its ability to model large-scale images with linear time complexity and low memory consumption. It offers a promising solution for training nuclei segmentation and classification models on full-sized images. However, the Mamba orientation-based scanning method lacks account for category-specific features, resulting in sub-optimal performance in scenarios with imbalanced class distributions. To address these challenges, this paper introduces a novel scanning strategy based on category probability sorting, which independently ranks and scans features for each category according to confidence from high to low. This approach enhances the feature representation of uncertain samples and mitigates the issues caused by imbalanced distributions. Extensive experiments conducted on four public datasets demonstrate that our method outperforms state-of-the-art approaches, delivering superior performance in nuclei segmentation and classification tasks.
zh

[CV-51] RoCo-Sim: Enhancing Roadside Collaborative Perception through Foreground Simulation

【速读】：本文旨在解决路侧协同感知（Roadside Collaborative Perception）中的关键数据问题，如标定误差、信息稀疏性和多视角一致性等，这些问题导致现有基于模型设计的方法在最新公开数据集上的性能不佳。为显著提升路侧协同感知能力并解决上述数据挑战，论文提出了首个路侧协同感知仿真框架RoCo-Sim。该框架的关键在于通过动态前景编辑和单图像全场景风格迁移生成多样化且多视角一致的仿真路侧数据。RoCo-Sim包含四个核心组件：(1) 相机外参优化确保路侧相机3D到2D投影的准确性；(2) 新颖的多视角遮挡感知采样器(MOAS)确定3D空间中数字资产的多样布局；(3) DepthSAM创新性地从固定视角单帧图像建模前景-背景关系，保障前景的多视角一致性；(4) 可扩展后处理工具包通过风格迁移等技术生成更逼真和丰富的场景。实验表明，RoCo-Sim显著提升了路侧3D目标检测性能，在Rcooper-Intersection和TUMTraf-V2X数据集上分别实现了83.74和83.12的AP70指标，填补了路侧感知仿真领域的重要空白。

链接: https://arxiv.org/abs/2503.10410
作者: Yuwen Du,Anning Hu,Zichen Chao,Yifan Lu,Junhao Ge,Genjia Liu,Weitao Wu,Lanjun Wang,Siheng Chen
机构: Shanghai Jiao Tong University (上海交通大学); Tianjin University (天津大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Roadside Collaborative Perception refers to a system where multiple roadside units collaborate to pool their perceptual data, assisting vehicles in enhancing their environmental awareness. Existing roadside perception methods concentrate on model design but overlook data issues like calibration errors, sparse information, and multi-view consistency, leading to poor performance on recent published datasets. To significantly enhance roadside collaborative perception and address critical data issues, we present the first simulation framework RoCo-Sim for road-side collaborative perception. RoCo-Sim is capable of generating diverse, multi-view consistent simulated roadside data through dynamic foreground editing and full-scene style transfer of a single image. RoCo-Sim consists of four components: (1) Camera Extrinsic Optimization ensures accurate 3D to 2D projection for roadside cameras; (2) A novel Multi-View Occlusion-Aware Sampler (MOAS) determines the placement of diverse digital assets within 3D space; (3) DepthSAM innovatively models foreground-background relationships from single-frame fixed-view images, ensuring multi-view consistency of foreground; and (4) Scalable Post-Processing Toolkit generates more realistic and enriched scenes through style transfer and other enhancements. RoCo-Sim significantly improves roadside 3D object detection, outperforming SOTA methods by 83.74 on Rcooper-Intersection and 83.12 on TUMTraf-V2X for AP70. RoCo-Sim fills a critical gap in roadside perception simulation. Code and pre-trained models will be released soon: this https URL
zh

[CV-52] RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

【速读】：该论文旨在解决视觉生成领域中统一多样化图像生成任务的核心挑战。当前方法要么依赖于针对特定任务的数据集和大规模训练，要么通过任务特定修改适配预训练图像模型，这些方法均限制了模型的通用性。为应对这一问题，论文提出以视频模型为基础的统一图像生成框架，关键在于将图像生成重新定义为条件帧预测任务，并引入RealGeneral框架，其核心创新包括：(1) 多模态对齐的统一条件嵌入模块（Unified Conditional Embedding），以及 (2) 解耦自适应LayerNorm和注意力掩码的统一流DiT块（Unified Stream DiT Block），以缓解跨模态干扰。这些设计充分利用了视频模型建模时间相关性的能力，从而实现更广泛的图像生成任务的统一与性能提升。

链接: https://arxiv.org/abs/2503.10406
作者: Yijing Lin,Mengqi Huang,Shuhan Zhuang,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unifying diverse image generation tasks within a single framework remains a fundamental challenge in visual generation. While large language models (LLMs) achieve unification through task-agnostic data and generation, existing visual generation models fail to meet these principles. Current approaches either rely on per-task datasets and large-scale training or adapt pre-trained image models with task-specific modifications, limiting their generalizability. In this work, we explore video models as a foundation for unified image generation, leveraging their inherent ability to model temporal correlations. We introduce RealGeneral, a novel framework that reformulates image generation as a conditional frame prediction task, analogous to in-context learning in LLMs. To bridge the gap between video models and condition-image pairs, we propose (1) a Unified Conditional Embedding module for multi-modal alignment and (2) a Unified Stream DiT Block with decoupled adaptive LayerNorm and attention mask to mitigate cross-modal interference. RealGeneral demonstrates effectiveness in multiple important visual generation tasks, e.g., it achieves a 14.5% improvement in subject similarity for customized generation and a 10% enhancement in image quality for canny-to-image task. Project page: this https URL
zh

[CV-53] Architecture-Aware Minimization (A2M): How to Find Flat Minima in Neural Architecture Search

【速读】：本文旨在解决神经架构搜索（NAS）中架构空间几何特性对模型泛化性能的影响问题。作者研究了可微分NAS方法中常用的架构空间（如NAS-Bench-201和DARTS）的几何性质，通过定义架构路径上的邻域和损失障碍等平滑性度量指标，揭示了架构搜索景观中局部性和平坦性的特征，类似于权重空间中神经网络损失景观的已知属性。论文的关键在于提出了一种名为架构感知最小化（Architecture-Aware Minimization, A²M）的新算法框架，首次显式地将可微分NAS方法的梯度导向架构空间中的平坦极小值区域。这一方法显著提升了在CIFAR-10、CIFAR-100和ImageNet16-120等基准数据集上的泛化性能，平均测试准确率提升分别达到+3.60%、+4.60%和+3.64%，并可轻松集成到现有的可微分NAS框架中，为自动机器学习的研究与应用提供了有力工具。

链接: https://arxiv.org/abs/2503.10404
作者: Matteo Gambella,Fabrizio Pittorino,Manuel Roveri
机构: Department of Electronics, Information and Bioengineering, Politecnico di Milano (电子、信息和生物工程系, 米兰理工大学)
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 11 figures, 3 tables

点击查看摘要

Abstract:Neural Architecture Search (NAS) has become an essential tool for designing effective and efficient neural networks. In this paper, we investigate the geometric properties of neural architecture spaces commonly used in differentiable NAS methods, specifically NAS-Bench-201 and DARTS. By defining flatness metrics such as neighborhoods and loss barriers along paths in architecture space, we reveal locality and flatness characteristics analogous to the well-known properties of neural network loss landscapes in weight space. In particular, we find that highly accurate architectures cluster together in flat regions, while suboptimal architectures remain isolated, unveiling the detailed geometrical structure of the architecture search landscape. Building on these insights, we propose Architecture-Aware Minimization (A ^2 M), a novel analytically derived algorithmic framework that explicitly biases, for the first time, the gradient of differentiable NAS methods towards flat minima in architecture space. A ^2 M consistently improves generalization over state-of-the-art DARTS-based algorithms on benchmark datasets including CIFAR-10, CIFAR-100, and ImageNet16-120, across both NAS-Bench-201 and DARTS search spaces. Notably, A ^2 M is able to increase the test accuracy, on average across different differentiable NAS methods, by +3.60% on CIFAR-10, +4.60% on CIFAR-100, and +3.64% on ImageNet16-120, demonstrating its superior effectiveness in practice. A ^2 M can be easily integrated into existing differentiable NAS frameworks, offering a versatile tool for future research and applications in automated machine learning. We open-source our code at this https URL.
zh

[CV-54] Hyper3D: Efficient 3D Representation via Hybrid Triplane and Octree Feature for Enhanced 3D Shape Variational Auto-Encoders

【速读】：该论文旨在解决高效压缩三维形状同时保留复杂几何细节的关键挑战。现有基于三维形状的变分自编码器（Variational Autoencoders, VAEs）通常采用均匀点采样和一维/二维潜在表示（如向量集或三平面），导致因表面覆盖不足及潜在空间缺乏显式三维表示而产生显著的几何细节损失。尽管已有工作探索三维潜在表示，但其大尺度限制了高分辨率编码与高效训练。论文提出的解决方案关键是引入Hyper3D，通过集成混合三平面和八叉树特征的高效三维表示来增强VAE重建能力。首先，采用基于八叉树的特征表示嵌入网格信息到网络中，缓解均匀点采样在捕捉网格表面几何分布上的局限性；其次，提出一种整合高分辨率三平面与低分辨率三维网格的混合潜在空间表示，不仅弥补了显式三维表示的缺失，还利用三平面保留高分辨率细节。实验结果表明，Hyper3D相比传统表示方法能够以更高的保真度和更精细的细节重建三维形状，适用于三维生成管道。

链接: https://arxiv.org/abs/2503.10403
作者: Jingyu Guo,Sensen Gao,Jia-Wang Bian,Wanhu Sun,Heliang Zheng,Rongfei Jia,Mingming Gong
机构: The University of Melbourne (墨尔本大学); Sensory Universe; Mohammed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); The Chinese University of Hong Kong, ShenZhen (香港中文大学，深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent 3D content generation pipelines often leverage Variational Autoencoders (VAEs) to encode shapes into compact latent representations, facilitating diffusion-based generation. Efficiently compressing 3D shapes while preserving intricate geometric details remains a key challenge. Existing 3D shape VAEs often employ uniform point sampling and 1D/2D latent representations, such as vector sets or triplanes, leading to significant geometric detail loss due to inadequate surface coverage and the absence of explicit 3D representations in the latent space. Although recent work explores 3D latent representations, their large scale hinders high-resolution encoding and efficient training. Given these challenges, we introduce Hyper3D, which enhances VAE reconstruction through efficient 3D representation that integrates hybrid triplane and octree features. First, we adopt an octree-based feature representation to embed mesh information into the network, mitigating the limitations of uniform point sampling in capturing geometric distributions along the mesh surface. Furthermore, we propose a hybrid latent space representation that integrates a high-resolution triplane with a low-resolution 3D grid. This design not only compensates for the lack of explicit 3D representations but also leverages a triplane to preserve high-resolution details. Experimental results demonstrate that Hyper3D outperforms traditional representations by reconstructing 3D shapes with higher fidelity and finer details, making it well-suited for 3D generation pipelines.
zh

[CV-55] HSEmotion Team at ABAW-8 Competition: Audiovisual Ambivalence/Hesitancy Emotional Mimicry Intensity and Facial Expression Recognition CVPR2025

【速读】：该论文旨在解决第八届野外情感行为分析（ABAW）竞赛中的三个任务：情感矛盾/犹豫预测、面部表情识别以及情感模仿强度的视频级预测。论文的关键在于提出了一种结合多模态特征的方法，通过使用预训练模型提取面部情感描述符（EmotiEffLib库）、声学特征以及从语音中识别出的文本嵌入，并将帧级特征聚合后输入到简单的分类器（如多层感知机）中进行预测。此外，在面部表情识别任务中，还利用预训练模型筛选高分视频帧以避免使用领域特定的视频分类器处理这些帧。对于情感模仿强度的视频级预测，则通过对帧级特征简单聚合并训练多层感知机实现。实验结果表明，该方法显著提升了验证指标，优于现有基线。

链接: https://arxiv.org/abs/2503.10399
作者: Andrey V. Savchenko
机构: HSE University (HSE 大学); Sber AI Lab (Sber AI 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to ABAW CVPR 2025 Workshop

点击查看摘要

Abstract:This article presents our results for the eighth Affective Behavior Analysis in-the-Wild (ABAW) competition. We combine facial emotional descriptors extracted by pre-trained models, namely, our EmotiEffLib library, with acoustic features and embeddings of texts recognized from speech. The frame-level features are aggregated and fed into simple classifiers, e.g., multi-layered perceptron (feed-forward neural network with one hidden layer), to predict ambivalence/hesitancy and facial expressions. In the latter case, we also use the pre-trained facial expression recognition model to select high-score video frames and prevent their processing with a domain-specific video classifier. The video-level prediction of emotional mimicry intensity is implemented by simply aggregating frame-level features and training a multi-layered perceptron. Experimental results for three tasks from the ABAW challenge demonstrate that our approach significantly increases validation metrics compared to existing baselines.
zh

[CV-56] RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing

【速读】：该论文旨在解决Vision Transformers (ViTs) 在遥感（Remote Sensing, RS）领域自监督预训练中的两个关键挑战：一是自注意力机制的二次复杂度限制了其在大规模模型和高分辨率图像上的可扩展性；二是现有基于Mamba架构的遥感应用局限于小规模有标签的特定领域数据集。为应对这些挑战，论文提出了一种名为RoMA的框架，通过利用大规模、多样化的无标注数据实现Mamba架构的可扩展自监督预训练。RoMA的关键创新在于引入了一种针对高分辨率图像的定制化自回归学习策略，包含两方面核心设计：1）一种结合自适应裁剪与角度嵌入的旋转感知预训练机制，以处理稀疏分布且方向任意的目标对象；2）多尺度标记预测目标，用于应对遥感影像中目标尺度极端变化的问题。实验结果验证了RoMA方法在场景分类、目标检测和语义分割任务中的优越性能，不仅提升了准确性，还增强了计算效率。

链接: https://arxiv.org/abs/2503.10392
作者: Fengxiang Wang,Hongzhen Wang,Yulin Wang,Di Wang,Mingshuo Chen,Haiyan Zhao,Yangang Sun,Shuo Wang,Long Lan,Wenjing Yang,Jing Zhang
机构: National University of Defense Technology, China (国防科技大学, 中国); Tsinghua University, China (清华大学, 中国); Wuhan University, China (武汉大学, 中国); Beijing University of Posts and Telecommunications, China (北京邮电大学, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in self-supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self-attention poses a significant barrier to scalability, particularly for large models and high-resolution images. While the linear-complexity Mamba architecture offers a promising alternative, existing RS applications of Mamba remain limited to supervised tasks on small, domain-specific datasets. To address these challenges, we propose RoMA, a framework that enables scalable self-supervised pretraining of Mamba-based RS foundation models using large-scale, diverse, unlabeled data. RoMA enhances scalability for high-resolution images through a tailored auto-regressive learning strategy, incorporating two key innovations: 1) a rotation-aware pretraining mechanism combining adaptive cropping with angular embeddings to handle sparsely distributed objects with arbitrary orientations, and 2) multi-scale token prediction objectives that address the extreme variations in object scales inherent to RS imagery. Systematic empirical studies validate that Mamba adheres to RS data and parameter scaling laws, with performance scaling reliably as model and data size increase. Furthermore, experiments across scene classification, object detection, and semantic segmentation tasks demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based counterparts in both accuracy and computational efficiency. The source code and pretrained models will be released at this https URL.
zh

[CV-57] CINEMA: Coherent Multi-Subject Video Generation via MLLM -Based Guidance

【速读】：该论文致力于解决个性化多主体视频生成这一挑战性问题，现有方法主要通过将主体图像映射到文本提示中的关键词来实现，但这种方式存在歧义且难以有效建模主体间的关系。论文的关键解决方案是提出了一种名为CINEMA的新框架，利用多模态大型语言模型（Multimodal Large Language Model, MLLM）实现连贯的多主体视频生成。该方法无需在主体图像与文本实体之间建立显式的对应关系，从而消除了歧义并减少了标注工作量，同时通过MLLM解释主体间关系，使模型能够利用大规模多样化数据集进行训练，并具备适应不同数量主体条件下的灵活性，显著提升了主体一致性及整体视频连贯性。

链接: https://arxiv.org/abs/2503.10391
作者: Yufan Deng,Xun Guo,Yizhi Wang,Jacob Zhiyuan Fang,Angtian Wang,Shenghai Yuan,Yiding Yang,Bo Liu,Haibin Huang,Chongyang Ma
机构: ByteDance Intelligent Creation (字节跳动智能创作); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video generation has witnessed remarkable progress with the advent of deep generative models, particularly diffusion models. While existing methods excel in generating high-quality videos from text prompts or single images, personalized multi-subject video generation remains a largely unexplored challenge. This task involves synthesizing videos that incorporate multiple distinct subjects, each defined by separate reference images, while ensuring temporal and spatial consistency. Current approaches primarily rely on mapping subject images to keywords in text prompts, which introduces ambiguity and limits their ability to model subject relationships effectively. In this paper, we propose CINEMA, a novel framework for coherent multi-subject video generation by leveraging Multimodal Large Language Model (MLLM). Our approach eliminates the need for explicit correspondences between subject images and text entities, mitigating ambiguity and reducing annotation effort. By leveraging MLLM to interpret subject relationships, our method facilitates scalability, enabling the use of large and diverse datasets for training. Furthermore, our framework can be conditioned on varying numbers of subjects, offering greater flexibility in personalized content creation. Through extensive evaluations, we demonstrate that our approach significantly improves subject consistency, and overall video coherence, paving the way for advanced applications in storytelling, interactive media, and personalized video generation.
zh

[CV-58] A Multimodal Fusion Model Leverag ing MLP Mixer and Handcrafted Features-based Deep Learning Networks for Facial Palsy Detection PAKDD2025

【速读】：该论文旨在解决当前面部瘫痪（Facial Palsy）检测实践中存在的问题，即临床评估通常依赖于劳动密集且主观性强的方法。为应对这一挑战，论文提出了一种基于多模态融合的深度学习模型，其关键在于结合两种处理方式：利用MLP Mixer模型处理非结构化数据（如RGB图像或包含面部线段的图像），同时采用前馈神经网络处理结构化数据（如面部标志点坐标、表情特征或手工设计特征）。通过分析20名面部瘫痪患者和20名健康受试者的视频数据，研究验证了多模态融合方法的有效性，最终实现的多模态融合模型F1分数达到96.00，显著优于仅使用手工特征的前馈神经网络（82.80 F1）和基于MLP Mixer的单模态模型（89.00 F1）。

链接: https://arxiv.org/abs/2503.10371
作者: Heng Yim Nicole Oo,Min Hun Lee,Jeong Hoon Lim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: PAKDD 2025. arXiv admin note: text overlap with arXiv:2405.16496

点击查看摘要

Abstract:Algorithmic detection of facial palsy offers the potential to improve current practices, which usually involve labor-intensive and subjective assessments by clinicians. In this paper, we present a multimodal fusion-based deep learning model that utilizes an MLP mixer-based model to process unstructured data (i.e. RGB images or images with facial line segments) and a feed-forward neural network to process structured data (i.e. facial landmark coordinates, features of facial expressions, or handcrafted features) for detecting facial palsy. We then contribute to a study to analyze the effect of different data modalities and the benefits of a multimodal fusion-based approach using videos of 20 facial palsy patients and 20 healthy subjects. Our multimodal fusion model achieved 96.00 F1, which is significantly higher than the feed-forward neural network trained on handcrafted features alone (82.80 F1) and an MLP mixer-based model trained on raw RGB images (89.00 F1).
zh

[CV-59] LUMOS: Language-Conditioned Imitation Learning with World Models ICRA

【速读】：该论文旨在解决机器人技能学习中的长 Horizon 任务挑战，特别是在未标注或无结构数据条件下实现高效迁移学习的问题。传统方法常受制于策略诱导的数据分布偏移（Policy-induced Distribution Shift），而该研究通过提出 LUMOS 框架，结合语言条件下的多任务模仿学习与世界模型的隐空间优化，有效缓解了这一问题。其解决方案的关键在于利用学习的世界模型隐空间进行 On-policy 学习，并通过图像和语言双模态回放缓冲（Hindsight Goal Relabeling）以及基于隐空间的内在奖励优化（Intrinsic Reward Optimization），在少量后见语言注释（Hindsight Language Annotations）的情况下实现技能的零样本迁移（Zero-shot Transfer）。此外，LUMOS 在训练过程中结合了潜在规划（Latent Planning），从而实现了复杂长 Horizon 任务的稳定性能提升。

链接: https://arxiv.org/abs/2503.10370
作者: Iman Nematollahi,Branton DeMoss,Akshay L Chandra,Nick Hawes,Wolfram Burgard,Ingmar Posner
机构: University of Freiburg (弗莱堡大学); University of Oxford (牛津大学); University of Technology Nuremberg (纽伦堡大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at the 2025 IEEE International Conference on Robotics and Automation (ICRA)

点击查看摘要

Abstract:We introduce LUMOS, a language-conditioned multi-task imitation learning framework for robotics. LUMOS learns skills by practicing them over many long-horizon rollouts in the latent space of a learned world model and transfers these skills zero-shot to a real robot. By learning on-policy in the latent space of the learned world model, our algorithm mitigates policy-induced distribution shift which most offline imitation learning methods suffer from. LUMOS learns from unstructured play data with fewer than 1% hindsight language annotations but is steerable with language commands at test time. We achieve this coherent long-horizon performance by combining latent planning with both image- and language-based hindsight goal relabeling during training, and by optimizing an intrinsic reward defined in the latent space of the world model over multiple time steps, effectively reducing covariate shift. In experiments on the difficult long-horizon CALVIN benchmark, LUMOS outperforms prior learning-based methods with comparable approaches on chained multi-task evaluations. To the best of our knowledge, we are the first to learn a language-conditioned continuous visuomotor control for a real-world robot within an offline world model. Videos, dataset and code are available at this http URL.
zh

[CV-60] Piece it Together: Part-Based Concepting with IP-Priors

【速读】：该论文旨在解决现有高级生成模型在缺乏语言条件下的局限性，特别是在视觉设计领域，当设计师需要从部分已有的视觉元素出发，通过创意整合生成完整且连贯的概念时，传统方法难以有效支持的问题。论文的关键在于提出了一种能够无缝整合用户提供的部分视觉组件，并同时采样缺失部分以生成合理且完整的概念的生成框架。这一方案的核心创新点是基于IP-Adapter+提取的未充分探索的表示空间，训练了一个轻量级的流匹配模型IP-Prior，该模型利用领域特定的先验知识合成连贯的构图。此外，论文还提出了一种基于LoRA的微调策略，显著提升了IP-Adapter+在特定任务中的提示一致性，从而缓解了重建质量和提示一致性的常见权衡问题。

链接: https://arxiv.org/abs/2503.10365
作者: Elad Richardson,Kfir Goldberg,Yuval Alaluf,Daniel Cohen-Or
机构: Tel Aviv University (特拉维夫大学); Bria AI (Bria AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page available at this https URL

点击查看摘要

Abstract:Advanced generative models excel at synthesizing images but often rely on text-based conditioning. Visual designers, however, often work beyond language, directly drawing inspiration from existing visual elements. In many cases, these elements represent only fragments of a potential concept-such as an uniquely structured wing, or a specific hairstyle-serving as inspiration for the artist to explore how they can come together creatively into a coherent whole. Recognizing this need, we introduce a generative framework that seamlessly integrates a partial set of user-provided visual components into a coherent composition while simultaneously sampling the missing parts needed to generate a plausible and complete concept. Our approach builds on a strong and underexplored representation space, extracted from IP-Adapter+, on which we train IP-Prior, a lightweight flow-matching model that synthesizes coherent compositions based on domain-specific priors, enabling diverse and context-aware generations. Additionally, we present a LoRA-based fine-tuning strategy that significantly improves prompt adherence in IP-Adapter+ for a given task, addressing its common trade-off between reconstruction quality and prompt adherence.
zh

[CV-61] ConceptGuard: Continual Personalized Text-to-Image Generation with Forgetting and Confusion Mitigation CVPR2025

【速读】：该论文旨在解决连续定制（continual customization）过程中概念遗忘（concept forgetting）和概念混淆（concept confusion）的问题。现有的扩散定制方法在仅使用少量用户提供的图像时已取得显著成果，但它们通常集体处理概念，而现实应用中往往需要按顺序集成新概念。这种顺序学习可能导致先前学到的概念被遗忘。为应对这些挑战，论文提出了一种名为ConceptGuard的综合方法，其关键是结合偏移嵌入（shift embedding）、概念绑定提示（concept-binding prompts）以及记忆保持正则化（memory preservation regularization），并辅以一个优先队列（priority queue）来动态调整不同概念的重要性及出现顺序。这些策略能够动态更新、解绑以及学习先前概念之间的关系，从而减轻概念遗忘和混淆现象。实验结果表明，与基线方法相比，ConceptGuard在定量和定性分析中均表现出一致且显著的优势。

链接: https://arxiv.org/abs/2503.10358
作者: Zirun Guo,Tao Jin
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Diffusion customization methods have achieved impressive results with only a minimal number of user-provided images. However, existing approaches customize concepts collectively, whereas real-world applications often require sequential concept integration. This sequential nature can lead to catastrophic forgetting, where previously learned concepts are lost. In this paper, we investigate concept forgetting and concept confusion in the continual customization. To tackle these challenges, we present ConceptGuard, a comprehensive approach that combines shift embedding, concept-binding prompts and memory preservation regularization, supplemented by a priority queue which can adaptively update the importance and occurrence order of different concepts. These strategies can dynamically update, unbind and learn the relationship of the previous concepts, thus alleviating concept forgetting and confusion. Through comprehensive experiments, we show that our approach outperforms all the baseline methods consistently and significantly in both quantitative and qualitative analyses.
zh

[CV-62] Object detection characteristics in a learning factory environment using YOLOv8

【速读】：本文旨在研究背景因素以及待检测目标不同特征对基于 AI 的目标检测准确性的影响，特别是在复杂工业场景中的表现。论文的关键在于通过系统性实验，使用不同的 YOLOv8 模型（You Only Look Once version 8）在多种材料和表面特性（包括部分透明及具有镜面反射的目标）上进行训练，并保持仅改变外观参数的情况下分析其检测性能。结果显示，具有相似特性的目标可能表现出截然不同的行为，有些背景成分会被正确检测到，而具有相同特征的其他成分则未被检测到。此外，研究还揭示了一些可能导致过拟合或影响检测精度的问题。为此，作者提供了包含 92 个经过训练的 YOLO 模型及其详细调查结果的数据集，以应对这些问题并促进进一步的研究。

链接: https://arxiv.org/abs/2503.10356
作者: Toni Schneidereit,Stefan Gohrenz,Michael Breuß
机构: Chair for Applied Mathematics (AI-Lab) (应用数学讲席（AI实验室）); Brandenburg University of Technology Cottbus-Senftenberg (勃兰登堡工业大学）
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI-based object detection, and efforts to explain and investigate their characteristics, is a topic of high interest. The impact of, e.g., complex background structures with similar appearances as the objects of interest, on the detection accuracy and, beforehand, the necessary dataset composition are topics of ongoing research. In this paper, we present a systematic investigation of background influences and different features of the object to be detected. The latter includes various materials and surfaces, partially transparent and with shiny reflections in the context of an Industry 4.0 learning factory. Different YOLOv8 models have been trained for each of the materials on different sized datasets, where the appearance was the only changing parameter. In the end, similar characteristics tend to show different behaviours and sometimes unexpected results. While some background components tend to be detected, others with the same features are not part of the detection. Additionally, some more precise conclusions can be drawn from the results. Therefore, we contribute a challenging dataset with detailed investigations on 92 trained YOLO models, addressing some issues on the detection accuracy and possible overfitting.
zh

[CV-63] Enhancing Facial Privacy Protection via Weakening Diffusion Purification

【速读】：该论文旨在解决在社交网络中个人肖像图像广泛共享所带来的隐私风险问题，特别是自动人脸识别（AFR）系统用于大规模监控所带来的威胁。论文提出了一种通过生成对抗性人脸图像来保护面部隐私的方法，以应对现有基于扩散模型的方法因扩散净化效应而导致的保护成功率（PSR）较低的问题。

解决方案的关键在于首先提出学习无条件嵌入（unconditional embeddings），以增强对抗性修改的学习能力，并利用这些嵌入指导对抗潜码的修改，从而削弱扩散净化效应。此外，论文还引入了保留身份结构的设计，以保持原始图像与生成图像之间的结构一致性，使人类观察者能够识别生成图像与原始图像具有相同的主体身份。实验结果表明，所提出的方法在CelebA-HQ和LADN两个公开数据集上的表现优于现有的面部隐私保护方法，特别是在迁移性和自然外观方面。

链接: https://arxiv.org/abs/2503.10350
作者: Ali Salar,Qing Liu,Yingli Tian,Guoying Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid growth of social media has led to the widespread sharing of individual portrait images, which pose serious privacy risks due to the capabilities of automatic face recognition (AFR) systems for mass surveillance. Hence, protecting facial privacy against unauthorized AFR systems is essential. Inspired by the generation capability of the emerging diffusion models, recent methods employ diffusion models to generate adversarial face images for privacy protection. However, they suffer from the diffusion purification effect, leading to a low protection success rate (PSR). In this paper, we first propose learning unconditional embeddings to increase the learning capacity for adversarial modifications and then use them to guide the modification of the adversarial latent code to weaken the diffusion purification effect. Moreover, we integrate an identity-preserving structure to maintain structural consistency between the original and generated images, allowing human observers to recognize the generated image as having the same identity as the original. Extensive experiments conducted on two public datasets, i.e., CelebA-HQ and LADN, demonstrate the superiority of our approach. The protected faces generated by our method outperform those produced by existing facial privacy protection approaches in terms of transferability and natural appearance.
zh

[CV-64] DreamInsert: Zero-Shot Image-to-Video Object Insertion from A Single Image

【速读】：该论文致力于解决从单张参考照片插入物体到目标背景视频中的问题，现有方法通常需要额外信息（如参考视频或物体的3D资产）来生成合成运动，而这一领域因缺乏未见运动信息仍属未探索区域。论文提出DreamInsert，首次以无训练的方式实现了图像到视频的物体插入。其关键是将物体轨迹纳入考虑，通过预测未见的物体运动，并将其与背景视频和谐融合，从而无缝生成期望的视频。此外，DreamInsert无需端到端训练或针对精心设计的图像-视频数据对进行额外微调，即可实现零样本插入，展现出简单而有效的特性。

链接: https://arxiv.org/abs/2503.10342
作者: Qi Zhao,Zhan Ma,Pan Zhou
机构: Nanjing University (南京大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent developments in generative diffusion models have turned many dreams into realities. For video object insertion, existing methods typically require additional information, such as a reference video or a 3D asset of the object, to generate the synthetic motion. However, inserting an object from a single reference photo into a target background video remains an uncharted area due to the lack of unseen motion information. We propose DreamInsert, which achieves Image-to-Video Object Insertion in a training-free manner for the first time. By incorporating the trajectory of the object into consideration, DreamInsert can predict the unseen object movement, fuse it harmoniously with the background video, and generate the desired video seamlessly. More significantly, DreamInsert is both simple and effective, achieving zero-shot insertion without end-to-end training or additional fine-tuning on well-designed image-video data pairs. We demonstrated the effectiveness of DreamInsert through a variety of experiments. Leveraging this capability, we present the first results for Image-to-Video object insertion in a training-free manner, paving exciting new directions for future content creation and synthesis. The code will be released soon.
zh

[CV-65] Generative Binary Memory: Pseudo-Replay Class-Incremental Learning on Binarized Embeddings

【速读】：该论文旨在解决在动态环境中，深度神经网络（Deep Neural Networks, DNNs）需要在学习新类别（classes）的同时保留之前已学知识的挑战，这一问题属于类增量学习（Class-Incremental Learning, CIL）的研究范畴。论文提出了一种新颖的CIL伪回放方法——生成式二值记忆（Generative Binary Memory, GBM），其关键在于通过贝努利混合模型（Bernoulli Mixture Models, BMMs）在潜在的二值空间中有效地建模类别分布的多模态特性，并生成合成的二值伪示例（pseudo-exemplars）。此外，借助专门设计的特征二值化器，该方法可以适配任何传统的DNN，并且原生支持二值神经网络（Binary Neural Networks, BNNs），从而适用于嵌入式系统中的高度受限模型尺寸场景。实验结果表明，GBM在CIFAR100和TinyImageNet数据集上分别实现了比现有最佳方法平均精度高出2.9%和1.5%的表现，并在CORE50数据集上以最终准确率提升3.1%以及内存占用减少4.7倍的方式优于其他新兴的BNN-CIL方法。

链接: https://arxiv.org/abs/2503.10333
作者: Yanis Basso-Bert,Anca Molnos,Romain Lemaire,William Guicquero,Antoine Dupret
机构: Univ. Grenoble Alpes (格勒诺布尔阿尔卑斯大学); CEA (法国原子能委员会); List (CEA下属研究机构); Leti (CEA下属研究机构)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In dynamic environments where new concepts continuously emerge, Deep Neural Networks (DNNs) must adapt by learning new classes while retaining previously acquired ones. This challenge is addressed by Class-Incremental Learning (CIL). This paper introduces Generative Binary Memory (GBM), a novel CIL pseudo-replay approach which generates synthetic binary pseudo-exemplars. Relying on Bernoulli Mixture Models (BMMs), GBM effectively models the multi-modal characteristics of class distributions, in a latent, binary space. With a specifically-designed feature binarizer, our approach applies to any conventional DNN. GBM also natively supports Binary Neural Networks (BNNs) for highly-constrained model sizes in embedded systems. The experimental results demonstrate that GBM achieves higher than state-of-the-art average accuracy on CIFAR100 (+2.9%) and TinyImageNet (+1.5%) for a ResNet-18 equipped with our binarizer. GBM also outperforms emerging CIL methods for BNNs, with +3.1% in final accuracy and x4.7 memory reduction, on CORE50.
zh

[CV-66] IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification CVPR2025

【速读】：该论文试图解决多模态目标重识别（ReID）领域中现有方法过度关注异构视觉特征融合而忽视文本语义信息潜在价值的问题。为此，论文提出了一个包含标准化多模态描述生成流水线的解决方案，并构建了三个新的多模态对象ReID基准数据集以促进研究。此外，针对当前方法直接聚合多模态信息时缺乏代表性局部特征选择导致冗余与复杂度高的缺陷，论文引入了IDEA框架，其核心在于Inverted Multi-modal Feature Extractor (IMFE) 和 Cooperative Deformable Aggregation (CDA) 模块。IMFE通过Modal Prefixes和InverseNet利用反转文本提供语义引导来整合多模态信息；CDA则自适应地生成采样位置，使模型能够聚焦于全局特征与判别性局部特征之间的交互作用。这些创新点共同提高了在复杂场景下多模态特征的鲁棒性。

链接: https://arxiv.org/abs/2503.10324
作者: Yuhao Wang,Yongfeng Lv,Pingping Zhang,Huchuan Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: This work is accepted by CVPR2025. More modifications may be performed

点击查看摘要

Abstract:Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary information from various modalities. However, existing methods focus on fusing heterogeneous visual features, neglecting the potential benefits of text-based semantic information. To address this issue, we first construct three text-enhanced multi-modal object ReID benchmarks. To be specific, we propose a standardized multi-modal caption generation pipeline for structured and concise text annotations with Multi-modal Large Language Models (MLLMs). Besides, current methods often directly aggregate multi-modal information without selecting representative local features, leading to redundancy and high complexity. To address the above issues, we introduce IDEA, a novel feature learning framework comprising the Inverted Multi-modal Feature Extractor (IMFE) and Cooperative Deformable Aggregation (CDA). The IMFE utilizes Modal Prefixes and an InverseNet to integrate multi-modal information with semantic guidance from inverted text. The CDA adaptively generates sampling positions, enabling the model to focus on the interplay between global features and discriminative local features. With the constructed benchmarks and the proposed modules, our framework can generate more robust multi-modal features under complex scenarios. Extensive experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
zh

[CV-67] owards Fast Memory-based and Data-Efficient Vision-Language Policy

【速读】：该论文旨在解决现有视觉语言模型（Vision Language Models, VLMs）在机器人学习领域应用中的三个关键挑战：(1) 大规模模型参数导致的高昂推理成本，(2) 数据模态不匹配引起的频繁领域偏移，以及 (3) 对过去或未来经验处理能力的局限性。论文提出的解决方案是 LiteVLP，这是一种轻量级、基于记忆且通用的视觉语言策略生成模型。LiteVLP 的关键是通过在预训练的 10 亿参数 VLM 基础上进行微调，并利用小规模对话风格的机器人数据集，从而实现高效推理速度、高精度表现及卓越的长时序记忆能力，显著优于当前最先进的视觉语言策略方法。

链接: https://arxiv.org/abs/2503.10322
作者: Haoxuan Li,Sixu Yan,Yuhan Li,Xinggang Wang
机构: School of Electronic Information and Communications, Huazhong University of Science and Technology (华中科技大学电子与信息通信学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Vision Language Models (VLMs) pretrained on Internet-scale vision-language data have demonstrated the potential to transfer their knowledge to robotic learning. However, the existing paradigm encounters three critical challenges: (1) expensive inference cost resulting from large-scale model parameters, (2) frequent domain shifts caused by mismatched data modalities, and (3) limited capacity to handle past or future experiences. In this work, we propose LiteVLP, a lightweight, memory-based, and general-purpose vision-language policy generation model. LiteVLP is built upon a pre-trained 1B-parameter VLM and fine-tuned on a tiny-scale and conversation-style robotic dataset. Through extensive experiments, we demonstrate that LiteVLP outperforms state-of-the-art vision-language policy on VIMA-Bench, with minimal training time. Furthermore, LiteVLP exhibits superior inference speed while maintaining exceptional high accuracy. In long-horizon manipulation tasks, LiteVLP also shows remarkable memory ability, outperforming the best-performing baseline model by 18.8%. These results highlight LiteVLP as a promising model to integrating the intelligence of VLMs into robotic learning.
zh

[CV-68] 6D Object Pose Tracking in Internet Videos for Robotic Manipulation ICLR2025

【速读】：本文旨在从互联网教学视频中提取时序一致的物体6D位姿轨迹。面对非受控的拍摄条件、细微但动态的物体运动以及未知目标物体精确模型的挑战，论文提出以下关键解决方案：首先，开发了一种无需先验知识即可估计输入图像中任意物体6D位姿的新方法，包括从大规模CAD模型库检索相似模型、将检索到的CAD模型与输入图像进行6D对齐以及确定物体相对于场景的绝对尺度。其次，通过跨视频帧精细跟踪检测到的物体，从网络视频中提取平滑的6D物体轨迹，并将其优化映射到机械臂的配置空间。第三，在YCB-V、HOPE-Video数据集及包含近似6D物体轨迹标注的手动标注教学视频数据集上全面评估和消融了所提出的6D位姿估计算法，展示了显著优于现有RGB 6D位姿估计算法的性能。最后，验证了从网络视频中估计的6D物体运动可以被转移到虚拟仿真和真实世界中的7轴机械臂，并成功应用于EPIC-KITCHENS数据集的第一人称视角视频，展示了其在具身AI应用中的潜力。

链接: https://arxiv.org/abs/2503.10307
作者: Georgy Ponimatkin,Martin Cífka,Tomáš Souček,Médéric Fourmy,Yann Labbé,Vladimir Petrik,Josef Sivic
机构: Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague (捷克技术大学布拉格分校捷克智能系统、机器人与网络中心); Faculty of Electrical Engineering, Czech Technical University in Prague (捷克技术大学布拉格分校电气工程学院); H Company (H公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to ICLR 2025. Project page available at this https URL

点击查看摘要

Abstract:We seek to extract a temporally consistent 6D pose trajectory of a manipulated object from an Internet instructional video. This is a challenging set-up for current 6D pose estimation methods due to uncontrolled capturing conditions, subtle but dynamic object motions, and the fact that the exact mesh of the manipulated object is not known. To address these challenges, we present the following contributions. First, we develop a new method that estimates the 6D pose of any object in the input image without prior knowledge of the object itself. The method proceeds by (i) retrieving a CAD model similar to the depicted object from a large-scale model database, (ii) 6D aligning the retrieved CAD model with the input image, and (iii) grounding the absolute scale of the object with respect to the scene. Second, we extract smooth 6D object trajectories from Internet videos by carefully tracking the detected objects across video frames. The extracted object trajectories are then retargeted via trajectory optimization into the configuration space of a robotic manipulator. Third, we thoroughly evaluate and ablate our 6D pose estimation method on YCB-V and HOPE-Video datasets as well as a new dataset of instructional videos manually annotated with approximate 6D object trajectories. We demonstrate significant improvements over existing state-of-the-art RGB 6D pose estimation methods. Finally, we show that the 6D object motion estimated from Internet videos can be transferred to a 7-axis robotic manipulator both in a virtual simulator as well as in a real world set-up. We also successfully apply our method to egocentric videos taken from the EPIC-KITCHENS dataset, demonstrating potential for Embodied AI applications.
zh

[CV-69] Eye on the Target: Eye Tracking Meets Rodent Tracking

【速读】：该论文旨在解决动物行为视频分析中手动标注耗时且主观性强的问题，提出了一种结合眼动追踪数据与智能提示优化的高效分割方法。解决方案的关键在于利用Aria眼镜采集的眼动追踪数据生成提示点，并通过零样本分割模型生成分割掩膜；同时引入后处理技术对提示点进行优化，从而显著提升分割精度，在大鼠数据集上的Jaccard指数从38.8提高到66.2，实现了70.6%的改进。

链接: https://arxiv.org/abs/2503.10305
作者: Emil Mededovic,Yuli Wu,Henning Konermann,Marcin Kopaczka,Mareike Schulz,Rene Tolba,Johannes Stegmaier
机构: Institute of Imaging and Computer Vision, RWTH Aachen University (RWTH 亚琛工业大学影像与计算机视觉研究所); Institute for Laboratory Animal Science, RWTH Aachen University (RWTH 亚琛工业大学实验动物科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Analyzing animal behavior from video recordings is crucial for scientific research, yet manual annotation remains labor-intensive and prone to subjectivity. Efficient segmentation methods are needed to automate this process while maintaining high accuracy. In this work, we propose a novel pipeline that utilizes eye-tracking data from Aria glasses to generate prompt points, which are then used to produce segmentation masks via a fast zero-shot segmentation model. Additionally, we apply post-processing to refine the prompts, leading to improved segmentation quality. Through our approach, we demonstrate that combining eye-tracking-based annotation with smart prompt refinement can enhance segmentation accuracy, achieving an improvement of 70.6% from 38.8 to 66.2 in the Jaccard Index for segmentation results in the rats dataset.
zh

[CV-70] CODEI: Resource-Efficient Task-Driven Co-Design of Perception and Decision Making for Mobile Robots Applied to Autonomous Vehicles

【速读】：本文旨在解决移动机器人在任务驱动下硬件与软件集成设计中的挑战，重点在于通过最优选择感知系统、运动规划算法、传感器及计算单元等组件，在安全、效率与资源（成本、能耗、计算需求、重量等）消耗之间实现平衡。论文的关键创新在于提出了一个名为CODEI（Co-design of Embodied Intelligence）的框架，该框架通过将感知需求与性能评估相结合，采用整数线性规划（Integer Linear Programming, ILP）方法优化传感器与算法的选择与布局，并进一步整合机器人本体、运动规划器、感知管线以及计算单元进行协同优化设计。案例研究显示，复杂任务会显著增加资源需求，任务性能直接影响自主堆栈的选择，而资源优先级则影响传感器的选型，例如相机适用于低成本轻量化设计，而激光雷达则在能量和计算效率方面更具优势。

链接: https://arxiv.org/abs/2503.10296
作者: Dejan Milojevic,Gioele Zardini,Miriam Elser,Andrea Censi,Emilio Frazzoli
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: 20 pages, 33 images, IEEE Transactions on Robotics

点击查看摘要

Abstract:This paper discusses the integration challenges and strategies for designing mobile robots, by focusing on the task-driven, optimal selection of hardware and software to balance safety, efficiency, and minimal usage of resources such as costs, energy, computational requirements, and weight. We emphasize the interplay between perception and motion planning in decision-making by introducing the concept of occupancy queries to quantify the perception requirements for sampling-based motion planners. Sensor and algorithm performance are evaluated using False Negative Rates (FPR) and False Positive Rates (FPR) across various factors such as geometric relationships, object properties, sensor resolution, and environmental conditions. By integrating perception requirements with perception performance, an Integer Linear Programming (ILP) approach is proposed for efficient sensor and algorithm selection and placement. This forms the basis for a co-design optimization that includes the robot body, motion planner, perception pipeline, and computing unit. We refer to this framework for solving the co-design problem of mobile robots as CODEI, short for Co-design of Embodied Intelligence. A case study on developing an Autonomous Vehicle (AV) for urban scenarios provides actionable information for designers, and shows that complex tasks escalate resource demands, with task performance affecting choices of the autonomy stack. The study demonstrates that resource prioritization influences sensor choice: cameras are preferred for cost-effective and lightweight designs, while lidar sensors are chosen for better energy and computational efficiency.
zh

[CV-71] MaterialMVP: Illumination-Invariant Material Generation via Multi-view PBR Diffusion

【速读】：该论文旨在解决多视角材质合成中的关键挑战，提出了一种名为MaterialMVP的端到端模型，用于从3D网格和图像提示生成基于物理的渲染（PBR）纹理。解决方案的关键在于引入了参考注意力机制（Reference Attention），以从输入的参考图像中提取并编码信息丰富的潜在特征，从而实现直观且可控的纹理生成；同时提出了一致性正则化训练策略（Consistency-Regularized Training），通过在不同视角和光照条件下强制执行稳定性，确保结果具有光照不变性和几何一致性；此外，还设计了双通道材质生成（Dual-Channel Material Generation）方法，分别优化了漫反射率（albedo）和金属度-粗糙度（metallic-roughness, MR）纹理，并通过多通道对齐注意力（Multi-Channel Aligned Attention）保持与输入图像的空间精确对齐，进一步整合可学习的材质嵌入以捕捉漫反射率和MR的独特属性。实验结果表明，该模型在多种光照场景下生成的PBR纹理表现出真实的行为，相较于现有方法在一致性和质量方面均有显著提升，适用于可扩展的3D资产创建任务。

链接: https://arxiv.org/abs/2503.10289
作者: Zebin He,Mingxin Yang,Shuhui Yang,Yixuan Tang,Tao Wang,Kaihao Zhang,Guanying Chen,Yuhong Liu,Jie Jiang,Chunchao Guo,Wenhan Luo
机构: Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); Tencent Hunyuan (腾讯浑元); Nanjing University (南京大学); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学（深圳）); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Physically-based rendering (PBR) has become a cornerstone in modern computer graphics, enabling realistic material representation and lighting interactions in 3D scenes. In this paper, we present MaterialMVP, a novel end-to-end model for generating PBR textures from 3D meshes and image prompts, addressing key challenges in multi-view material synthesis. Our approach leverages Reference Attention to extract and encode informative latent from the input reference images, enabling intuitive and controllable texture generation. We also introduce a Consistency-Regularized Training strategy to enforce stability across varying viewpoints and illumination conditions, ensuring illumination-invariant and geometrically consistent results. Additionally, we propose Dual-Channel Material Generation, which separately optimizes albedo and metallic-roughness (MR) textures while maintaining precise spatial alignment with the input images through Multi-Channel Aligned Attention. Learnable material embeddings are further integrated to capture the distinct properties of albedo and MR. Experimental results demonstrate that our model generates PBR textures with realistic behavior across diverse lighting scenarios, outperforming existing methods in both consistency and quality for scalable 3D asset creation.
zh

[CV-72] MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment

【速读】：该论文试图解决单源音频到图像生成方法在处理自然听觉场景中多源特性时的局限性，即现有工作仅关注单一来源的音频输入，而忽略了自然场景中多源音频信息的丰富性，从而限制了生成全面视觉内容的能力。为了解决这一问题，论文提出了MACS（Multi-Source Audio-to-Image Generation with Component Separation）方法，这是首个明确分离多源音频以捕捉丰富音频成分后再进行图像生成的工作。解决方案的关键在于其两阶段策略：第一阶段通过弱监督方法分离多源音频输入，并利用大预训练的CLAP模型将音频与文本标签语义对齐，同时引入排序损失来考虑分离出的音频信号的上下文重要性；第二阶段通过可训练适配器和MLP层将分离后的音频信号映射为生成条件，实现高效的图像生成。此外，论文还构建了LLP数据集作为首个完整的多源音频到图像生成基准，并在多种任务中验证了MACS方法的优越性能。

链接: https://arxiv.org/abs/2503.10287
作者: Hao Zhou,Xiaobao Guo,Yuzhe Zhu,Adams Wai-Kin Kong
机构: Nanyang Technological University, Singapore (南洋理工大学, 新加坡)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Propelled by the breakthrough in deep generative models, audio-to-image generation has emerged as a pivotal cross-model task that converts complex auditory signals into rich visual representations. However, previous works only focus on single-source audio inputs for image generation, ignoring the multi-source characteristic in natural auditory scenes, thus limiting the performance in generating comprehensive visual content. To bridge this gap, a method called MACS is proposed to conduct multi-source audio-to-image generation. This is the first work that explicitly separates multi-source audio to capture the rich audio components before image generation. MACS is a two-stage method. In the first stage, multi-source audio inputs are separated by a weakly supervised method, where the audio and text labels are semantically aligned by casting into a common space using the large pre-trained CLAP model. We introduce a ranking loss to consider the contextual significance of the separated audio signals. In the second stage, efficient image generation is achieved by mapping the separated audio signals to the generation condition using only a trainable adapter and a MLP layer. We preprocess the LLP dataset as the first full multi-source audio-to-image generation benchmark. The experiments are conducted on multi-source, mixed-source, and single-source audio-to-image generation tasks. The proposed MACS outperforms the current state-of-the-art methods in 17 of the 21 evaluation indexes on all tasks and delivers superior visual quality. The code will be publicly available.
zh

[CV-73] VicaSplat: A Single Run is All You Need for 3D Gaussian Splatting and Camera Estimation from Unposed Video Frames

【速读】：该论文试图解决从无定位视频帧序列中联合进行三维高斯重建与相机姿态估计的问题，这是真实世界三维应用中一个关键但尚未充分探索的任务。解决方案的关键在于提出了一种基于变压器的新颖网络架构。具体而言，模型首先使用图像编码器将每张图像映射为视觉标记列表，并将这些视觉标记与额外插入的可学习相机标记连接起来。所得标记在定制的变压器解码器内完全相互交流。相机标记因果聚合来自不同视图视觉标记的特征，并进一步逐帧调制以注入视点相关特征。然后可通过不同的预测头估计三维高斯光栅和相机姿态参数。实验表明，VicaSplat 在多视图输入下优于基线方法，并且在 ScanNet 基准测试中展示了出色的跨数据集泛化能力，在无需微调的情况下实现了卓越性能。

链接: https://arxiv.org/abs/2503.10286
作者: Zhiqi Li,Chengrui Dong,Yiming Chen,Zhangchi Huang,Peidong Liu
机构: Zhejiang University (浙江大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present VicaSplat, a novel framework for joint 3D Gaussians reconstruction and camera pose estimation from a sequence of unposed video frames, which is a critical yet underexplored task in real-world 3D applications. The core of our method lies in a novel transformer-based network architecture. In particular, our model starts with an image encoder that maps each image to a list of visual tokens. All visual tokens are concatenated with additional inserted learnable camera tokens. The obtained tokens then fully communicate with each other within a tailored transformer decoder. The camera tokens causally aggregate features from visual tokens of different views, and further modulate them frame-wisely to inject view-dependent features. 3D Gaussian splats and camera pose parameters can then be estimated via different prediction heads. Experiments show that VicaSplat surpasses baseline methods for multi-view inputs, and achieves comparable performance to prior two-view approaches. Remarkably, VicaSplat also demonstrates exceptional cross-dataset generalization capability on the ScanNet benchmark, achieving superior performance without any fine-tuning. Project page: this https URL.
zh

[CV-74] EEdit : Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing

【速读】：该论文旨在解决基于 inversion 的图像编辑方法因计算开销过大而难以应用于实时交互场景的问题。论文指出，这种冗余主要存在于空间和时间两个维度，包括未编辑区域的无谓计算以及 inversion 过程中的重复工作。为了解决这些问题，论文提出了一种名为 EEdit 的实用框架。其关键解决方案包括：1）引入空间局部性缓存技术，仅计算编辑区域及其邻域而跳过未编辑区域，并通过 token 索引预处理进一步加速；2）提出 inversion 步骤跳过策略，复用潜在表示以实现高效编辑。实验结果表明，EEdit 在多种编辑任务中实现了平均 2.46 倍的速度提升且性能无损。

链接: https://arxiv.org/abs/2503.10270
作者: Zexuan Yan,Yue Ma,Chang Zou,Wenteng Chen,Qifeng Chen,Linfeng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages

点击查看摘要

Abstract:Inversion-based image editing is rapidly gaining momentum while suffering from significant computation overhead, hindering its application in real-time interactive scenarios. In this paper, we rethink that the redundancy in inversion-based image editing exists in both the spatial and temporal dimensions, such as the unnecessary computation in unedited regions and the redundancy in the inversion progress. To tackle these challenges, we propose a practical framework, named EEdit, to achieve efficient image editing. Specifically, we introduce three techniques to solve them one by one. For spatial redundancy, spatial locality caching is introduced to compute the edited region and its neighboring regions while skipping the unedited regions, and token indexing preprocessing is designed to further accelerate the caching. For temporal redundancy, inversion step skipping is proposed to reuse the latent for efficient editing. Our experiments demonstrate an average of 2.46 \times acceleration without performance drop in a wide range of editing tasks including prompt-guided image editing, dragging and image composition. Our codes are available at this https URL
zh

[CV-75] A Multi-Modal Federated Learning Framework for Remote Sensing Image Classification

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）在多模态遥感（Remote Sensing, RS）图像分类任务中的应用问题。现有FL方法大多假设所有客户端分布的数据具有相同的模态，而遥感图像可能涉及多种模态数据，联合利用这些多模态数据可显著提升分类性能。然而，由于数据的去中心化与未共享特性，如何有效利用多模态遥感数据成为挑战。

论文提出的解决方案核心在于设计了一个新颖的多模态联邦学习框架，包含三个关键模块：1）多模态融合（Multi-modal Fusion, MF），通过迭代模型平均实现跨模态学习；2）特征白化（Feature Whitening, FW），通过对齐客户端间的数据分布解决训练数据异质性问题；3）互信息最大化（Mutual Information Maximization, MIM），通过最大化不同模态图像之间的相似性来建模相关性。该框架能够在不访问客户端本地多模态训练数据的前提下，有效提升多标签分类与基于像素的分类任务性能。实验结果表明，该框架在基准数据集上的表现优于现有最先进的算法。

链接: https://arxiv.org/abs/2503.10262
作者: Barış Büyüktaş,Gencer Sumbul,Begüm Demir
机构: Faculty of Electrical Engineering and Computer Science, Technische Universität Berlin (柏林工业大学); BIFOLD - Berlin Institute for the Foundations of Learning and Data (柏林学习与数据基础研究所); Environmental Computational Science and Earth Observation Laboratory (ECEO), École Polytechnique Fédérale de Lausanne (EPFL, 洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Federated learning (FL) enables the collaborative training of deep neural networks across decentralized data archives (i.e., clients) without sharing the local data of the clients. Most of the existing FL methods assume that the data distributed across all clients is associated with the same data modality. However, remote sensing (RS) images present in different clients can be associated with diverse data modalities. The joint use of the multi-modal RS data can significantly enhance classification performance. To effectively exploit decentralized and unshared multi-modal RS data, our paper introduces a novel multi-modal FL framework for RS image classification problems. The proposed framework comprises three modules: 1) multi-modal fusion (MF); 2) feature whitening (FW); and 3) mutual information maximization (MIM). The MF module employs iterative model averaging to facilitate learning without accessing multi-modal training data on clients. The FW module aims to address the limitations of training data heterogeneity by aligning data distributions across clients. The MIM module aims to model mutual information by maximizing the similarity between images from different modalities. For the experimental analyses, we focus our attention on multi-label classification and pixel-based classification tasks in RS. The results obtained using two benchmark archives show the effectiveness of the proposed framework when compared to state-of-the-art algorithms in the literature. The code of the proposed framework will be available at this https URL.
zh

[CV-76] KVQ: Boosting Video Quality Assessment via Saliency-guided Local Perception

【速读】：该论文旨在解决视频质量评估（Video Quality Assessment, VQA）中由于运动模糊或特定失真导致的视频不同区域感知质量差异的问题。传统方法在标注区域级质量时成本高昂且缺乏相关数据集的支持，限制了局部感知的有效利用。为应对这一挑战，论文提出了一种名为Kaleidoscope Video Quality Assessment (KVQ) 的框架，其关键在于结合视觉显著性提取与注意力分配机制（Fusion-Window Attention, FWA），同时引入局部感知约束（Local Perception Constraint, LPC）以减少局部纹理感知对邻近区域的依赖。通过这些方法，KVQ 在多个 VQA 数据集上实现了显著性能提升，并建立了包含区域级标注的新数据集 LPVQ 用于评估局部感知能力。

链接: https://arxiv.org/abs/2503.10259
作者: Yunpeng Qu,Kun Yuan,Qizhi Xie,Ming Sun,Chao Zhou,Jian Wang
机构: Tsinghua University (清华大学); Kuaishou Technology (快手科技); BNRist, Tsinghua University (清华大学类脑智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Video Quality Assessment (VQA), which intends to predict the perceptual quality of videos, has attracted increasing attention. Due to factors like motion blur or specific distortions, the quality of different regions in a video varies. Recognizing the region-wise local quality within a video is beneficial for assessing global quality and can guide us in adopting fine-grained enhancement or transcoding strategies. Due to the heavy cost of annotating region-wise quality, the lack of ground truth constraints from relevant datasets further complicates the utilization of local perception. Inspired by the Human Visual System (HVS) that links global quality to the local texture of different regions and their visual saliency, we propose a Kaleidoscope Video Quality Assessment (KVQ) framework, which aims to effectively assess both saliency and local texture, thereby facilitating the assessment of global quality. Our framework extracts visual saliency and allocates attention using Fusion-Window Attention (FWA) while incorporating a Local Perception Constraint (LPC) to mitigate the reliance of regional texture perception on neighboring areas. KVQ obtains significant improvements across multiple scenarios on five VQA benchmarks compared to SOTA methods. Furthermore, to assess local perception, we establish a new Local Perception Visual Quality (LPVQ) dataset with region-wise annotations. Experimental results demonstrate the capability of KVQ in perceiving local distortions. KVQ models and the LPVQ dataset will be available at this https URL.
zh

[CV-77] ROODI: Reconstructing Occluded Objects with Denoising Inpainters

【速读】：该论文致力于解决从复杂场景中提取特定物体的难题，特别是处理3D高斯点云中单个物体的分离及其遮挡问题。论文提出了一种基于两个关键原则的新颖物体提取方法：(1) 以物体为中心，通过剪枝无关的3D高斯基元来减少干扰；(2) 利用基于生成的修复技术补偿因遮挡导致的缺失观测。对于剪枝操作，通过K近邻分析基元的局部结构，保留仅与目标物体相关的基元；对于修复，结合预训练的扩散模型及遮挡推理，利用整个场景的3D表示完成修复。研究强调了剪枝与修复之间的协同作用，两者共同显著提升了提取性能。实验表明，该方法在标准真实数据集和合成数据集上的表现均优于现有技术。

链接: https://arxiv.org/abs/2503.10256
作者: Yeonjin Chang,Erqun Dong,Seunghyeon Seo,Nojun Kwak,Kwang Moo Yi
机构: Seoul National University (首尔国立大学); University of British Columbia (不列颠哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:While the quality of novel-view images has improved dramatically with 3D Gaussian Splatting, extracting specific objects from scenes remains challenging. Isolating individual 3D Gaussian primitives for each object and handling occlusions in scenes remain far from being solved. We propose a novel object extraction method based on two key principles: (1) being object-centric by pruning irrelevant primitives; and (2) leveraging generative inpainting to compensate for missing observations caused by occlusions. For pruning, we analyze the local structure of primitives using K-nearest neighbors, and retain only relevant ones. For inpainting, we employ an off-the-shelf diffusion-based inpainter combined with occlusion reasoning, utilizing the 3D representation of the entire scene. Our findings highlight the crucial synergy between pruning and inpainting, both of which significantly enhance extraction performance. We evaluate our method on a standard real-world dataset and introduce a synthetic dataset for quantitative analysis. Our approach outperforms the state-of-the-art, demonstrating its effectiveness in object extraction from complex scenes.
zh

[CV-78] SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning

【速读】：该论文致力于解决零样本学习（Zero-Shot Learning, ZSL）中的语义错位问题（semantic misalignment），即视觉特征中包含的与类别语义无关的信息引入的歧义，从而影响视觉-语义交互。现有方法通常在特征空间或模型空间中后处理以抑制这些无关信息，而本文提出从输入阶段入手，在网络中阻止这些无关补丁传播。关键解决方案在于引入基于Transformer的语义上下文化视觉补丁（Semantically Contextualized Visual Patches, SVIP）框架，通过自监督的补丁选择机制预先识别输入空间中的语义无关补丁，并利用所有Transformer层的聚合注意力分数进行监督训练。为避免移除无关补丁破坏目标物体结构，使用可学习的补丁嵌入替代，并通过词嵌入初始化确保其在整个特征提取过程中保持语义意义。实验结果表明，SVIP在ZSL基准数据集上实现了最先进的性能，同时提供了更可解释且语义丰富的特征表示。

链接: https://arxiv.org/abs/2503.10252
作者: Zhi Chen,Zecheng Zhao,Jingcai Guo,Jingjing Li,Zi Huang
机构: University of Queesland (昆士兰大学); The Hong Kong Polytechnic University (香港理工大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Pre-print

点击查看摘要

Abstract:Zero-shot learning (ZSL) aims to recognize unseen classes without labeled training examples by leveraging class-level semantic descriptors such as attributes. A fundamental challenge in ZSL is semantic misalignment, where semantic-unrelated information involved in visual features introduce ambiguity to visual-semantic interaction. Unlike existing methods that suppress semantic-unrelated information post hoc either in the feature space or the model space, we propose addressing this issue at the input stage, preventing semantic-unrelated patches from propagating through the network. To this end, we introduce Semantically contextualized VIsual Patches (SVIP) for ZSL, a transformer-based framework designed to enhance visual-semantic alignment. Specifically, we propose a self-supervised patch selection mechanism that preemptively learns to identify semantic-unrelated patches in the input space. This is trained with the supervision from aggregated attention scores across all transformer layers, which estimate each patch’s semantic score. As removing semantic-unrelated patches from the input sequence may disrupt object structure, we replace them with learnable patch embeddings. With initialization from word embeddings, we can ensure they remain semantically meaningful throughout feature extraction. Extensive experiments on ZSL benchmarks demonstrate that SVIP achieves state-of-the-art performance results while providing more interpretable and semantically rich feature representations.
zh

[CV-79] Interpretable Image Classification via Non-parametric Part Prototype Learning

【速读】：该论文旨在解决计算机视觉中基于可解释决策过程的图像分类问题。现有方法如Prototypical Part Networks（ProtoPNets）虽能通过原型对象部件模仿人类视觉推理提供解释，但其生成的解释质量仍有提升空间，主要因为原型通常聚焦于重复和冗余的概念。为此，论文提出了一种基于部件的可解释图像分类框架，其关键在于以非参数化方式学习语义上具有区分性的目标部件，通过聚类从基础视觉模型提取的深度特征来实现，这些特征编码了鲁棒的语义信息。此外，为了定量评估所提方法的解释质量，论文引入了Distinctiveness Score和Comprehensiveness Score，并在CUB-200-2011、Stanford Cars和Stanford Dogs数据集上的实验表明，该框架相较于现有ProtoPNets方法在可解释性方面表现更优。

链接: https://arxiv.org/abs/2503.10247
作者: Zhijie Zhu,Lei Fan,Maurice Pagnucco,Yang Song
机构: University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Classifying images with an interpretable decision-making process is a long-standing problem in computer vision. In recent years, Prototypical Part Networks has gained traction as an approach for self-explainable neural networks, due to their ability to mimic human visual reasoning by providing explanations based on prototypical object parts. However, the quality of the explanations generated by these methods leaves room for improvement, as the prototypes usually focus on repetitive and redundant concepts. Leveraging recent advances in prototype learning, we present a framework for part-based interpretable image classification that learns a set of semantically distinctive object parts for each class, and provides diverse and comprehensive explanations. The core of our method is to learn the part-prototypes in a non-parametric fashion, through clustering deep features extracted from foundation vision models that encode robust semantic information. To quantitatively evaluate the quality of explanations provided by ProtoPNets, we introduce Distinctiveness Score and Comprehensiveness Score. Through evaluation on CUB-200-2011, Stanford Cars and Stanford Dogs datasets, we show that our framework compares favourably against existing ProtoPNets while achieving better interpretability. Code is available at: this https URL.
zh

[CV-80] Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA

【速读】：该论文旨在解决现有模态无关分割（Amodal Segmentation）方法无法通过文本输入与用户交互以及难以理解或推理隐式和复杂目的的问题。此外，虽然如LISA等方法将多模态大语言模型（LLMs）与分割结合用于推理任务，但它们仅限于预测可见物体区域，并在处理复杂遮挡场景时面临挑战。为了解决这些局限性，论文提出了一个新的任务——模态无关推理分割（Amodal Reasoning Segmentation），目标是在提供基于用户文本输入详细说明的同时，预测被遮挡物体的完整模态形状。关键解决方案包括开发一个可推广的数据集生成管道，创建专注于日常生活场景的新数据集，以及提出AURA（模态理解与推理助手）这一具有先进全局和空间级设计的新型模型，专门用于处理复杂遮挡情况。实验验证了AURA的有效性。

链接: https://arxiv.org/abs/2503.10225
作者: Zhixuan Li,Hyunse Yoon,Sanghoon Lee,Weisi Lin
机构: College of Computing and Data Science, Nanyang Technological University (南洋理工大学), Singapore; Department of Electrical Electronic Engineering, Yonsei University (延世大学), Korea
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Amodal segmentation aims to infer the complete shape of occluded objects, even when the occluded region’s appearance is unavailable. However, current amodal segmentation methods lack the capability to interact with users through text input and struggle to understand or reason about implicit and complex purposes. While methods like LISA integrate multi-modal large language models (LLMs) with segmentation for reasoning tasks, they are limited to predicting only visible object regions and face challenges in handling complex occlusion scenarios. To address these limitations, we propose a novel task named amodal reasoning segmentation, aiming to predict the complete amodal shape of occluded objects while providing answers with elaborations based on user text input. We develop a generalizable dataset generation pipeline and introduce a new dataset focusing on daily life scenarios, encompassing diverse real-world occlusions. Furthermore, we present AURA (Amodal Understanding and Reasoning Assistant), a novel model with advanced global and spatial-level designs specifically tailored to handle complex occlusions. Extensive experiments validate AURA’s effectiveness on the proposed dataset. The code, model, and dataset will be publicly released.
zh

[CV-81] CoStoDet-DDPM: Collaborative Training of Stochastic and Deterministic Models Improves Surgical Workflow Anticipation and Recognition

【速读】：本文旨在解决外科手术流程预测与识别中因解剖和操作变异导致的泛化难题。传统方法依赖确定性决策，难以应对现实世界中的复杂变化。为解决此问题，论文提出了一种创新框架，将去噪扩散概率模型（Denoising Diffusion Probabilistic Model, DDPM）引入传统的确定性学习中。其关键是采用协作共训练范式：DDPM分支通过捕捉过程不确定性来丰富特征表示，任务分支则专注于预测手术阶段与器械使用；两者相互优化，其中DDPM在不确定场景下减少预测误差，而任务分支引导DDPM生成临床有意义的表示。关键创新点在于，在推理阶段舍弃DDPM分支以实现实时预测，同时保持性能不减。

链接: https://arxiv.org/abs/2503.10216
作者: Kaixiang Yang,Xin Li,Qiang Li,Zhiwei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Anticipating and recognizing surgical workflows are critical for intelligent surgical assistance systems. However, existing methods rely on deterministic decision-making, struggling to generalize across the large anatomical and procedural variations inherent in real-world this http URL this paper, we introduce an innovative framework that incorporates stochastic modeling through a denoising diffusion probabilistic model (DDPM) into conventional deterministic learning for surgical workflow analysis. At the heart of our approach is a collaborative co-training paradigm: the DDPM branch captures procedural uncertainties to enrich feature representations, while the task branch focuses on predicting surgical phases and instrument this http URL, we demonstrate that this mutual refinement mechanism benefits both branches: the DDPM reduces prediction errors in uncertain scenarios, and the task branch directs the DDPM toward clinically meaningful representations. Notably, the DDPM branch is discarded during inference, enabling real-time predictions without sacrificing this http URL on the Cholec80 dataset show that for the anticipation task, our method achieves a 16% reduction in eMAE compared to state-of-the-art approaches, and for phase recognition, it improves the Jaccard score by 1.0%. Additionally, on the AutoLaparo dataset, our method achieves a 1.5% improvement in the Jaccard score for phase recognition, while also exhibiting robust generalization to patient-specific variations. Our code and weight are available at this https URL.
zh

[CV-82] Singular Value Fine-tuning for Few-Shot Class-Incremental Learning

【速读】：本文旨在解决Few-shot Class-Incremental Learning (FSCIL) 中的两个主要挑战：灾难性遗忘（catastrophic forgetting）和过拟合（overfitting），特别是在基于大型基础模型的场景下，后者尚未得到充分研究。论文的关键创新在于提出了一种名为Singular Value Fine-tuning for FSCIL (SVFCL) 的方法。该方法通过将奇异值分解应用于基础模型权重，固定奇异向量而仅微调与任务相关的奇异值，并最终合并这些值。这种方法不仅有效缓解了灾难性遗忘问题，还显著减少了可训练参数数量，同时更有效地减轻了过拟合现象。

链接: https://arxiv.org/abs/2503.10214
作者: Zhiwu Wang,Yichen Wu,Renzhen Wang,Haokun Lin,Quanziang Wang,Qian Zhao,Deyu Meng
机构: Xi’an Jiaotong University (西安交通大学); City University of Hong Kong (香港城市大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:Class-Incremental Learning (CIL) aims to prevent catastrophic forgetting of previously learned classes while sequentially incorporating new ones. The more challenging Few-shot CIL (FSCIL) setting further complicates this by providing only a limited number of samples for each new class, increasing the risk of overfitting in addition to standard CIL challenges. While catastrophic forgetting has been extensively studied, overfitting in FSCIL, especially with large foundation models, has received less attention. To fill this gap, we propose the Singular Value Fine-tuning for FSCIL (SVFCL) and compared it with existing approaches for adapting foundation models to FSCIL, which primarily build on Parameter Efficient Fine-Tuning (PEFT) methods like prompt tuning and Low-Rank Adaptation (LoRA). Specifically, SVFCL applies singular value decomposition to the foundation model weights, keeping the singular vectors fixed while fine-tuning the singular values for each task, and then merging them. This simple yet effective approach not only alleviates the forgetting problem but also mitigates overfitting more effectively while significantly reducing trainable parameters. Extensive experiments on four benchmark datasets, along with visualizations and ablation studies, validate the effectiveness of SVFCL. The code will be made available.
zh

[CV-83] MouseGPT : A Large-scale Vision-Language Model for Mouse Behavior Analysis

【速读】：该论文旨在解决动物行为分析中难以全面量化和解析其复杂动态的问题。传统机器视觉方法虽能检测自发行为，但因可解释性有限且依赖手动标注，限制了对完整行为谱的探索。论文的关键解决方案是引入MouseGPT，这是一种融合视觉信息与自然语言的视觉-语言模型（Vision-Language Model, VLM）。MouseGPT基于首个包含超过4200万帧多样化精神疾病条件下姿势动态及开放词汇行为注释的数据集构建，提供了一种新颖且语境丰富的行为综合解释方法。其整体分析框架实现了详细的行为特征描述、聚类以及新行为发现，无需耗费大量人工标注，从而为动物模型中复杂行为动态的研究提供了变革性的工具。

链接: https://arxiv.org/abs/2503.10212
作者: Teng Xu,Taotao Zhou,Youjia Wang,Peng Yang,Simin Tang,Kuixiang Shao,Zifeng Tang,Yifei Liu,Xinyuan Chen,Hongshuang Wang,Xiaohui Wang,Huoqing Luo,Jingya Wang,Ji Hu,Jingyi Yu
机构: ShanghaiTech University (上海科技大学); Chinese Academy of Sciences Institute of Chemistry (中国科学院化学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 53 pages, 5 figures, 7 extended figures

点击查看摘要

Abstract:Analyzing animal behavior is crucial in advancing neuroscience, yet quantifying and deciphering its intricate dynamics remains a significant challenge. Traditional machine vision approaches, despite their ability to detect spontaneous behaviors, fall short due to limited interpretability and reliance on manual labeling, which restricts the exploration of the full behavioral spectrum. Here, we introduce MouseGPT, a Vision-Language Model (VLM) that integrates visual cues with natural language to revolutionize mouse behavior analysis. Built upon our first-of-its-kind dataset - incorporating pose dynamics and open-vocabulary behavioral annotations across over 42 million frames of diverse psychiatric conditions - MouseGPT provides a novel, context-rich method for comprehensive behavior interpretation. Our holistic analysis framework enables detailed behavior profiling, clustering, and novel behavior discovery, offering deep insights without the need for labor - intensive manual annotation. Evaluations reveal that MouseGPT surpasses existing models in precision, adaptability, and descriptive richness, positioning it as a transformative tool for ethology and for unraveling complex behavioral dynamics in animal models.
zh

[CV-84] ARS: Traffic-Aware Radar Scene Flow Estimation

【速读】：该论文旨在解决基于雷达点云的场景流估计问题，特别是针对稀疏雷达数据中传统实例级刚体运动假设的局限性。解决方案的关键在于提出了一种名为TARS（Traffic-Aware Radar Scene flow）的方法，通过在交通级别利用运动刚性而非仅依赖实例级别的刚体假设。TARS通过联合执行目标检测与场景流估计，并将目标检测特征图融入场景流分支，构建了一个交通矢量场（TVF），以实现整体的交通级别场景理解。这种方法不仅考虑了点级别的局部运动线索，还结合了空间内刚体运动的交通级别一致性，从而显著提升了场景流估计的性能，在私有数据集和View-of-Delft数据集上分别超越现有方法23%和15%。

链接: https://arxiv.org/abs/2503.10210
作者: Jialong Wu,Marco Braun,Dominic Spata,Matthias Rottmann
机构: University of Wuppertal (伍珀塔尔大学); Aptiv Services Deutschland GmbH (Aptiv Services 德国有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scene flow provides crucial motion information for autonomous driving. Recent LiDAR scene flow models utilize the rigid-motion assumption at the instance level, assuming objects are rigid bodies. However, these instance-level methods are not suitable for sparse radar point clouds. In this work, we present a novel \textbfT raffic- \textbfA ware \textbfR adar \textbfS cene flow estimation method, named \textbfTARS , which utilizes the motion rigidity at the traffic level. To address the challenges in radar scene flow, we perform object detection and scene flow jointly and boost the latter. We incorporate the feature map from the object detector, trained with detection losses, to make radar scene flow aware of the environment and road users. Therefrom, we construct a Traffic Vector Field (TVF) in the feature space, enabling a holistic traffic-level scene understanding in our scene flow branch. When estimating the scene flow, we consider both point-level motion cues from point neighbors and traffic-level consistency of rigid motion within the space. TARS outperforms the state of the art on a proprietary dataset and the View-of-Delft dataset, improving the benchmarks by 23% and 15%, respectively.
zh

[CV-85] LVAgent : Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

【速读】：该论文试图解决现有多模态大型语言模型（Multimodal Large Language Models, MLLMs）在建模长视频时间上下文方面的显著挑战。当前主流基于代理的方法依赖外部工具（如搜索引擎、记忆库、OCR、检索模型等）辅助单一MLLM处理长视频相关问题，但这种方法仍只能提供部分理解，导致性能受限。论文的关键解决方案是提出LVAgent框架，它实现了多轮动态协作的MLLM代理机制，用于长视频理解任务。LVAgent的核心包括四个关键步骤：1）选择合适的代理形成最优团队；2）设计高效的长视频检索方案以提升关键时间片段的覆盖率；3）代理回答问题并交换推理理由；4）评估并优化代理团队的动态协作。通过多轮迭代协作，LVAgent不仅超越了所有闭源模型（如GPT-4o）和开源模型（如InternVL-2.5和Qwen2-VL），还在主流长视频理解任务中达到80%的准确率，并在LongVideoBench数据集上提升了高达14.3%的精度。

链接: https://arxiv.org/abs/2503.10200
作者: Boyu Chen,Zhengrong Yue,Siran Chen,Zikang Wang,Yang Liu,Peng Li,Yali Wang
机构: Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China (清华大学智能产业研究院); Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China (清华大学人工智能研究院计算机系); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing Multimodal Large Language Models (MLLMs) encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools (e.g., search engine, memory banks, OCR, retrieval models) to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understanding of long videos, resulting in limited performance. In order to better address long video tasks, we introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding. Our methodology consists of four key steps: 1. Selection: We pre-select appropriate agents from the model library to form optimal agent teams based on different tasks. 2. Perception: We design an effective retrieval scheme for long videos, improving the coverage of critical temporal segments while maintaining computational efficiency. 3. Action: Agents answer long video-related questions and exchange reasons. 4. Reflection: We evaluate the performance of each agent in each round of discussion and optimize the agent team for dynamic collaboration. The agents iteratively refine their answers by multi-round dynamical collaboration of MLLM agents. LVAgent is the first agent system method that outperforms all closed-source models (including GPT-4o) and open-source models (including InternVL-2.5 and Qwen2-VL) in the long video understanding tasks. Our LVAgent achieves an accuracy of 80% on four mainstream long video understanding tasks. Notably, on the LongVideoBench dataset, LVAgent improves accuracy by up to 14.3% compared with SOTA.
zh

[CV-86] ST-FlowNet: An Efficient Spiking Neural Network for Event-Based Optical Flow Estimation

【速读】：本文旨在解决基于事件数据的光流估计任务中，现有尖峰神经网络（Spiking Neural Networks, SNNs）性能受限的问题，限制其在实际场景中的应用。为解决这一问题，论文提出了ST-FlowNet这一新颖的神经网络架构，专为从事件数据中进行光流估计设计。其关键创新在于通过集成ConvGRU模块实现跨模态特征增强及预测光流的时间对齐，从而提升捕捉复杂运动动态的能力。此外，针对SNN训练挑战，论文引入了从预训练人工神经网络（Artificial Neural Networks, ANNs）转换为SNN的新方法，特别是提出了BISNN方法，该方法简化了生物参数选择的复杂性，进一步增强了SNN在光流估计任务中的鲁棒性。这些创新共同构成了ST-FlowNet模型的优异性能基础，并显著提升了事件驱动光流估计的准确性与效率。

链接: https://arxiv.org/abs/2503.10195
作者: Hongze Sun,Jun Wang,Wuque Cai,Duo Chen,Qianqian Liao,Jiayi He,Yan Cui,Dezhong Yao,Daqing Guo
机构: Clinical Hospital of Chengdu Brain Science Institute, MOE Key Lab for NeuroInformation, China-Cuba Belt and Road Joint Laboratory on Neurotechnology and Brain-Apparatus Communication, School of Life Science and Technology, University of Electronic Science and Technology of China (电子科技大学生命科学与技术学院; 成都脑科学临床医院; 教育部神经信息重点实验室; 中国-古巴“一带一路”神经技术与脑装置通信联合实验室); School of Artificial Intelligence, Chongqing University of Education (重庆教育学院人工智能学院); Sichuan Academy of Medical Sciences and Sichuan Provincial People’s Hospital (四川省医学科学院·人民医院); Research Unit of NeuroInformation (2019RU035), Chinese Academy of Medical Sciences (中国医学科学院神经信息研究室)
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
备注: 12 pages, 5 figures, 5 tables; This work has been submitted for possible publication

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) have emerged as a promising tool for event-based optical flow estimation tasks due to their ability to leverage spatio-temporal information and low-power capabilities. However, the performance of SNN models is often constrained, limiting their application in real-world scenarios. In this work, we address this gap by proposing a novel neural network architecture, ST-FlowNet, specifically tailored for optical flow estimation from event-based data. The ST-FlowNet architecture integrates ConvGRU modules to facilitate cross-modal feature augmentation and temporal alignment of the predicted optical flow, improving the network’s ability to capture complex motion dynamics. Additionally, to overcome the challenges associated with training SNNs, we introduce a novel approach to derive SNN models from pre-trained artificial neural networks (ANNs) through ANN-to-SNN conversion or our proposed BISNN method. Notably, the BISNN method alleviates the complexities involved in biological parameter selection, further enhancing the robustness of SNNs in optical flow estimation tasks. Extensive evaluations on three benchmark event-based datasets demonstrate that the SNN-based ST-FlowNet model outperforms state-of-the-art methods, delivering superior performance in accurate optical flow estimation across a diverse range of dynamic visual scenes. Furthermore, the inherent energy efficiency of SNN models is highlighted, establishing a compelling advantage for their practical deployment. Overall, our work presents a novel framework for optical flow estimation using SNNs and event-based data, contributing to the advancement of neuromorphic vision applications.
zh

[CV-87] Robustness Tokens: Towards Adversarial Robustness of Transformers ECCV

【速读】：该论文旨在解决基于公开预训练基础模型的下游任务容易受到针对相同公开模型设计的对抗攻击（adversarial attacks）威胁的问题。论文的关键解决方案是提出了一种名为“鲁棒性标记（Robustness Tokens）”的新方法，专用于Transformer架构。与传统的对抗训练通过调整模型参数不同，Robustness Tokens仅需微调少量额外的私有标记（private tokens），且计算开销较低。这种方法不仅显著提升了视觉Transformer模型在白盒对抗攻击下的鲁棒性，同时保持了原始下游任务的性能。

链接: https://arxiv.org/abs/2503.10191
作者: Brian Pulfer,Yury Belousov,Slava Voloshynovskiy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted for publication at the European Conference on Computer Vision (ECCV), 2024

点击查看摘要

Abstract:Recently, large pre-trained foundation models have become widely adopted by machine learning practitioners for a multitude of tasks. Given that such models are publicly available, relying on their use as backbone models for downstream tasks might result in high vulnerability to adversarial attacks crafted with the same public model. In this work, we propose Robustness Tokens, a novel approach specific to the transformer architecture that fine-tunes a few additional private tokens with low computational requirements instead of tuning model parameters as done in traditional adversarial training. We show that Robustness Tokens make Vision Transformer models significantly more robust to white-box adversarial attacks while also retaining the original downstream performances.
zh

[CV-88] hrough the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding

【速读】：该论文旨在解决现有视觉-语言模型（Vision-Language Models, VLMs）中存在的视觉幻觉问题，即生成的响应包含与视觉输入不一致的不准确信息。为了解决这一问题，论文提出了一种名为Perception Magnifier (PM) 的新型视觉解码方法。PM的关键在于通过注意力机制迭代隔离相关的视觉标记，并放大相应区域，促使模型在解码过程中关注更精细的视觉细节。这种方法在增强对视觉输入审视能力的同时，通过保留结构化和上下文信息，在减轻视觉幻觉的同时保持了较强的语言推理能力。实验结果表明，PM不仅显著减少了幻觉现象，还提升了语言生成质量。

链接: https://arxiv.org/abs/2503.10183
作者: Shunqi Mao,Chaoyi Zhang,Weidong Cai
机构: School of Computer Science, The University of Sydney (悉尼大学), Australia
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 5 figures, 9 tables

点击查看摘要

Abstract:Existing vision-language models (VLMs) often suffer from visual hallucination, where the generated responses contain inaccuracies that are not grounded in the visual input. Efforts to address this issue without model finetuning primarily mitigate hallucination by reducing biases contrastively or amplifying the weights of visual embedding during decoding. However, these approaches improve visual perception at the cost of impairing the language reasoning capability. In this work, we propose the Perception Magnifier (PM), a novel visual decoding method that iteratively isolates relevant visual tokens based on attention and magnifies the corresponding regions, spurring the model to concentrate on fine-grained visual details during decoding. Specifically, by magnifying critical regions while preserving the structural and contextual information at each decoding step, PM allows the VLM to enhance its scrutiny of the visual input, hence producing more accurate and faithful responses. Extensive experimental results demonstrate that PM not only achieves superior hallucination mitigation but also enhances language generation while preserving strong reasoning this http URL is available at this https URL .
zh

[CV-89] GS-SDF: LiDAR-Augmented Gaussian Splatting and Neural SDF for Geometrically Consistent Rendering and Reconstruction

【速读】：该论文旨在解决高精度表面重建与高质量渲染在自动驾驶及具身人工智能发展中面临的挑战，特别是Gaussian splatting在机器人应用中因稀疏观测数据和几何不一致性导致的问题，以及有效整合LiDAR点云数据与Gaussian splatting的难题。论文的关键创新在于提出了一种统一的LiDAR-视觉系统，通过将Gaussian splatting与神经符号距离场（Neural Signed Distance Field, NSDF）结合，利用精确的LiDAR点云数据提供连续的几何场，并基于此提出了一种基于SDF的Gaussian初始化方法，用于物理基础的基元放置和全面的几何正则化，从而实现几何一致的渲染与重建。

链接: https://arxiv.org/abs/2503.10170
作者: Jianheng Liu,Yunfei Wan,Bowen Wang,Chunran Zheng,Jiarong Lin,Fu Zhang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Digital twins are fundamental to the development of autonomous driving and embodied artificial intelligence. However, achieving high-granularity surface reconstruction and high-fidelity rendering remains a challenge. Gaussian splatting offers efficient photorealistic rendering but struggles with geometric inconsistencies due to fragmented primitives and sparse observational data in robotics applications. Existing regularization methods, which rely on render-derived constraints, often fail in complex environments. Moreover, effectively integrating sparse LiDAR data with Gaussian splatting remains challenging. We propose a unified LiDAR-visual system that synergizes Gaussian splatting with a neural signed distance field. The accurate LiDAR point clouds enable a trained neural signed distance field to offer a manifold geometry field, This motivates us to offer an SDF-based Gaussian initialization for physically grounded primitive placement and a comprehensive geometric regularization for geometrically consistent rendering and reconstruction. Experiments demonstrate superior reconstruction accuracy and rendering quality across diverse trajectories. To benefit the community, the codes will be released at this https URL.
zh

[CV-90] A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection

【速读】：本文旨在解决开放词汇物体检测（OVD）中检测器无法有效学习语义知识的问题，即如何在未见过的类别上实现更强大的泛化能力。传统方法直接将特征空间与预训练的视觉-语言模型（如CLIP）对齐，但未能充分利用其通用识别能力。为此，作者提出了一种名为HD-OVD的分层语义蒸馏框架，通过三个层次的蒸馏过程从CLIP模型中提取可推广的知识。关键在于构建了一个全面的蒸馏流程：首先，在实例级别通过建模视觉空间中单个对象的关系，使检测器学习细粒度的实例级语义；其次，在类别级别引入文本空间中新类别的分类任务，帮助检测器吸收类别级别的通用语义；最后，在图像级别通过图像对比蒸馏传递包含多对象及其上下文的丰富语义信息。这种三重层次的语义蒸馏使得HD-OVD能够在实例、类别和图像三个层面继承CLIP的通用识别能力，从而显著提升了OV-COCO数据集上的新型AP指标至46.4%，并大幅超越其他方法。

链接: https://arxiv.org/abs/2503.10152
作者: Shenghao Fu,Junkai Yan,Qize Yang,Xihan Wei,Xiaohua Xie,Wei-Shi Zheng
机构: School of Computer Science and Engineering, Sun Yat-sen University (中山大学计算机科学与工程学院), Guangzhou 510006, China; Peng Cheng Laboratory (鹏城实验室), China; Tongyi Lab, Alibaba Group (通义实验室，阿里巴巴集团), China; Guangdong Key Laboratory of Information Security Technology, Sun Yat-sen University (广东省信息安全技术重点实验室，中山大学), Guangzhou, Guangdong, China; Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-sen University, Ministry of Education (智能机器与先进计算教育部重点实验室，中山大学), Guangzhou, Guangdong, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to TMM 2025

点击查看摘要

Abstract:Open-vocabulary object detection (OVD) aims to detect objects beyond the training annotations, where detectors are usually aligned to a pre-trained vision-language model, eg, CLIP, to inherit its generalizable recognition ability so that detectors can recognize new or novel objects. However, previous works directly align the feature space with CLIP and fail to learn the semantic knowledge effectively. In this work, we propose a hierarchical semantic distillation framework named HD-OVD to construct a comprehensive distillation process, which exploits generalizable knowledge from the CLIP model in three aspects. In the first hierarchy of HD-OVD, the detector learns fine-grained instance-wise semantics from the CLIP image encoder by modeling relations among single objects in the visual space. Besides, we introduce text space novel-class-aware classification to help the detector assimilate the highly generalizable class-wise semantics from the CLIP text encoder, representing the second hierarchy. Lastly, abundant image-wise semantics containing multi-object and their contexts are also distilled by an image-wise contrastive distillation. Benefiting from the elaborated semantic distillation in triple hierarchies, our HD-OVD inherits generalizable recognition ability from CLIP in instance, class, and image levels. Thus, we boost the novel AP on the OV-COCO dataset to 46.4% with a ResNet50 backbone, which outperforms others by a clear margin. We also conduct extensive ablation studies to analyze how each component works.
zh

[CV-91] Unlocking Generalization Power in LiDAR Point Cloud Registration CVPR2025

【速读】：该论文旨在解决现有激光雷达点云配准方法在实际环境中的泛化能力不足的问题，特别是在不同距离和数据集上的鲁棒性。为了解决这一局限性，论文提出了一种名为UGP的修剪框架，其关键在于移除了交叉注意力机制以提升泛化能力，使网络能够专注于帧内特征提取。此外，引入了渐进自注意力模块来减少大规模场景中的歧义，并结合鸟瞰图（BEV）特征以融入场景元素的语义信息。这些改进显著提升了网络的泛化性能。实验结果表明，在KITTI和nuScenes数据集上的跨距离泛化任务中，UGP分别达到了94.5%和91.4%的最新平均配准召回率；而在从nuScenes到KITTI的跨数据集泛化任务中，UGP实现了90.9%的最新平均配准召回率。

链接: https://arxiv.org/abs/2503.10149
作者: Zhenxuan Zeng,Qiao Wu,Xiyu Zhang,Lin Yuanbo Wu,Pei An,Jiaqi Yang,Ji Wang,Peng Wang
机构: School of Computer Science, Northwestern Polytechnical University (西北工业大学计算机学院), China; Ningbo Institute, Northwestern Polytechnical University (西北工业大学宁波研究院), China; Department of Computer Science, Swansea University (斯旺西大学计算机科学系), United Kingdom; HuaZhong University of Science and Technology (华中科技大学), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:In real-world environments, a LiDAR point cloud registration method with robust generalization capabilities (across varying distances and datasets) is crucial for ensuring safety in autonomous driving and other LiDAR-based applications. However, current methods fall short in achieving this level of generalization. To address these limitations, we propose UGP, a pruned framework designed to enhance generalization power for LiDAR point cloud registration. The core insight in UGP is the elimination of cross-attention mechanisms to improve generalization, allowing the network to concentrate on intra-frame feature extraction. Additionally, we introduce a progressive self-attention module to reduce ambiguity in large-scale scenes and integrate Bird’s Eye View (BEV) features to incorporate semantic information about scene elements. Together, these enhancements significantly boost the network’s generalization performance. We validated our approach through various generalization experiments in multiple outdoor scenes. In cross-distance generalization experiments on KITTI and nuScenes, UGP achieved state-of-the-art mean Registration Recall rates of 94.5% and 91.4%, respectively. In cross-dataset generalization from nuScenes to KITTI, UGP achieved a state-of-the-art mean Registration Recall of 90.9%. Code will be available at this https URL.
zh

[CV-92] 3D Student Splatting and Scooping

【速读】：该论文旨在改进3D Gaussian Splatting (3DGS) 的基础范式与公式化方法，以提升其性能。论文指出，作为一种未归一化的混合模型，3DGS 并不一定需要严格遵循高斯分布或点投影（splatting）的方式。为此，作者提出了一种新的混合模型，该模型采用灵活的学生 t 分布，并引入了正密度（splatting）和负密度（scooping）的概念，命名为 Student Splatting and Scooping (SSS)。SSS 在增强表达能力的同时，也带来了学习上的新挑战，因此作者进一步提出了一种新的优化采样方法。通过在多个数据集、设置和指标上的全面评估，结果表明 SSS 在质量与参数效率方面优于现有方法，例如在相似组件数量下达到或超越现有方法的质量表现，同时在某些情况下将组件数量减少高达 82%。

链接: https://arxiv.org/abs/2503.10148
作者: Jialin Zhu,Jiangbei Yue,Feixiang He,He Wang
机构: University College London (伦敦大学学院); University of Leeds (利兹大学); University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) provides a new framework for novel view synthesis, and has spiked a new wave of research in neural rendering and related applications. As 3DGS is becoming a foundational component of many models, any improvement on 3DGS itself can bring huge benefits. To this end, we aim to improve the fundamental paradigm and formulation of 3DGS. We argue that as an unnormalized mixture model, it needs to be neither Gaussians nor splatting. We subsequently propose a new mixture model consisting of flexible Student’s t distributions, with both positive (splatting) and negative (scooping) densities. We name our model Student Splatting and Scooping, or SSS. When providing better expressivity, SSS also poses new challenges in learning. Therefore, we also propose a new principled sampling approach for optimization. Through exhaustive evaluation and comparison, across multiple datasets, settings, and metrics, we demonstrate that SSS outperforms existing methods in terms of quality and parameter efficiency, e.g. achieving matching or better quality with similar numbers of components, and obtaining comparable results while reducing the component number by as much as 82%.
zh

[CV-93] GaussHDR: High Dynamic Range Gaussian Splatting via Learning Unified 3D and 2D Local Tone Mapping CVPR2025

【速读】：该论文旨在解决高动态范围（HDR）新型视图合成（NVS）中现有方法存在的两个主要问题：一是基于三维（3D）色调映射的训练范式常导致HDR重建不稳定，而基于二维（2D）色调映射的训练则限制了模型拟合低动态范围（LDR）图像的能力；二是全局色调映射器可能阻碍HDR和LDR表征的学习。为了解决这些问题，论文提出的关键方案是GaussHDR，它通过三维高斯点撒（3D Gaussian splatting）统一3D和2D局部色调映射，并设计了一种接受额外上下文特征输入的残差局部色调映射器用于3D和2D色调映射。此外，通过在损失层面结合来自3D和2D局部色调映射的双LDR渲染结果，并引入不确定性学习以实现自适应调制，进一步优化了模型性能。实验表明，GaussHDR在合成和真实场景中均显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2503.10143
作者: Jinfeng Liu,Lingtong Kong,Bo Li,Dan Xu
机构: The Hong Kong University of Science and Technology (香港科技大学); vivo Mobile Communication Co., Ltd (维沃移动通信有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by CVPR 2025. Project page is available at this https URL

点击查看摘要

Abstract:High dynamic range (HDR) novel view synthesis (NVS) aims to reconstruct HDR scenes by leveraging multi-view low dynamic range (LDR) images captured at different exposure levels. Current training paradigms with 3D tone mapping often result in unstable HDR reconstruction, while training with 2D tone mapping reduces the model’s capacity to fit LDR images. Additionally, the global tone mapper used in existing methods can impede the learning of both HDR and LDR representations. To address these challenges, we present GaussHDR, which unifies 3D and 2D local tone mapping through 3D Gaussian splatting. Specifically, we design a residual local tone mapper for both 3D and 2D tone mapping that accepts an additional context feature as input. We then propose combining the dual LDR rendering results from both 3D and 2D local tone mapping at the loss level. Finally, recognizing that different scenes may exhibit varying balances between the dual results, we introduce uncertainty learning and use the uncertainties for adaptive modulation. Extensive experiments demonstrate that GaussHDR significantly outperforms state-of-the-art methods in both synthetic and real-world scenarios.
zh

[CV-94] Deep Learning-Based Direct Leaf Area Estimation using Two RGBD Datasets for Model Development

【速读】：该论文旨在解决基于深度学习的单叶面积估计问题，特别是在现实场景中使用移动相机采集的RGBD图像条件下。传统方法主要依赖于图像处理或手工分割，而直接利用深度学习进行物体面积估计的研究较少。论文的关键解决方案在于：首先通过Mask R-CNN模型结合RGBD数据进行叶片分割与面积估计，并进一步提出一种双骨干网络结构（一个用于分割，另一个用于面积估计）以同时处理附着叶片和离体叶片的数据集。此外，在超参数调优过程中采用敏捷方法而非随机搜索，确保模型性能优化。最终模型在未见数据上的表现验证了其有效性，表明结合真实叶片面积标签可以进一步提升结果精度。

链接: https://arxiv.org/abs/2503.10129
作者: Namal Jayasuriya,Yi Guo,Wen Hu,Oula Ghannoum
机构: Western Sydney University (西悉尼大学); UNSW Sydney (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Estimation of a single leaf area can be a measure of crop growth and a phenotypic trait to breed new varieties. It has also been used to measure leaf area index and total leaf area. Some studies have used hand-held cameras, image processing 3D reconstruction and unsupervised learning-based methods to estimate the leaf area in plant images. Deep learning works well for object detection and segmentation tasks; however, direct area estimation of objects has not been explored. This work investigates deep learning-based leaf area estimation, for RGBD images taken using a mobile camera setup in real-world scenarios. A dataset for attached leaves captured with a top angle view and a dataset for detached single leaves were collected for model development and testing. First, image processing-based area estimation was tested on manually segmented leaves. Then a Mask R-CNN-based model was investigated, and modified to accept RGBD images and to estimate the leaf area. The detached-leaf data set was then mixed with the attached-leaf plant data set to estimate the single leaf area for plant images, and another network design with two backbones was proposed: one for segmentation and the other for area estimation. Instead of trying all possibilities or random values, an agile approach was used in hyperparameter tuning. The final model was cross-validated with 5-folds and tested with two unseen datasets: detached and attached leaves. The F1 score with 90% IoA for segmentation result on unseen detached-leaf data was 1.0, while R-squared of area estimation was 0.81. For unseen plant data segmentation, the F1 score with 90% IoA was 0.59, while the R-squared score was 0.57. The research suggests using attached leaves with ground truth area to improve the results.
zh

[CV-95] PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models

【速读】：该论文旨在解决布局规划与图像生成分离的问题，提出了一种统一的布局规划与图像生成模型PlanGen。传统基于扩散的方法将布局规划与布局到图像的转换视为两个独立任务，而PlanGen通过单一的自回归Transformer模型将这两个任务联合建模，仅利用下一令牌预测（next-token prediction）。其关键是将布局条件作为上下文集成到模型中，无需对局部标题和边界框坐标进行专门编码，这在处理复杂布局时尤其具有优势。此外，PlanGen支持统一提示下的多任务训练，并可通过精心设计的建模扩展到布局引导的图像操作。

链接: https://arxiv.org/abs/2503.10127
作者: Runze He,Bo Cheng,Yuhang Ma,Qingxiang Jia,Shanyuan Liu,Ao Ma,Xiaoyu Wu,Liebucha Wu,Dawei Leng,Yuhui Yin
机构: 360 AI Research (360 AI 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 12 figures, project page: this https URL

点击查看摘要

Abstract:In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates, which provides significant advantages over the previous embed-and-pool operations on layout conditions, particularly when dealing with complex layouts. Unified prompting allows PlanGen to perform multitasking training related to layout, including layout planning, layout-to-image generation, image layout understanding, etc. In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling, with teacher-forcing content manipulation policy and negative layout guidance. Extensive experiments verify the effectiveness of our PlanGen in multiple layoutrelated tasks, showing its great potential. Code is available at: this https URL.
zh

[CV-96] Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation

【速读】：该论文试图解决多模态自回归（Multimodal Autoregressive, AR）模型在主体驱动图像生成（subject-driven image generation）任务中表现不如主导扩散模型（diffusion models）的问题。为了解决这一局限性，论文提出了一种名为Proxy-Tuning的解决方案，通过利用扩散模型的能力来增强AR模型在特定主体图像生成方面的性能。关键在于通过Proxy-Tuning方法，使经过微调的AR模型不仅在主体保真度（subject fidelity）方面表现出色，同时在遵循提示（prompt adherence）上也优于其对应的扩散模型导师（supervisors），从而揭示了从弱到强（weak-to-strong）泛化能力的现象。

链接: https://arxiv.org/abs/2503.10125
作者: Yi Wu,Lingting Zhu,Lei Liu,Wandi Qiao,Ziqiang Li,Lequan Yu,Bin Li
机构: University of Science and Technology of China (中国科学技术大学); The University of Hong Kong (香港大学); Nanjing University of Information Science and Technology (南京信息工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Multimodal autoregressive (AR) models, based on next-token prediction and transformer architecture, have demonstrated remarkable capabilities in various multimodal tasks including text-to-image (T2I) generation. Despite their strong performance in general T2I tasks, our research reveals that these models initially struggle with subject-driven image generation compared to dominant diffusion models. To address this limitation, we introduce Proxy-Tuning, leveraging diffusion models to enhance AR models’ capabilities in subject-specific image generation. Our method reveals a striking weak-to-strong phenomenon: fine-tuned AR models consistently outperform their diffusion model supervisors in both subject fidelity and prompt adherence. We analyze this performance shift and identify scenarios where AR models excel, particularly in multi-subject compositions and contextual understanding. This work not only demonstrates impressive results in subject-driven AR image generation, but also unveils the potential of weak-to-strong generalization in the image generation domain, contributing to a deeper understanding of different architectures’ strengths and limitations.
zh

[CV-97] Hybrid Agents for Image Restoration

【速读】：该论文试图解决现有图像恢复（Image Restoration, IR）研究中任务特定模式与通用模式各自独立且缺乏协作的问题，这导致非专业人士交互不足，并限制了复杂真实场景下的恢复能力。论文提出的关键解决方案是HybridAgent，通过引入快、慢及反馈三种代理的混合规则，将多种恢复模式整合到统一模型中，实现智能高效的用户交互。其中，慢代理优化多模态大型语言模型（Multimodal Large Language Model, MLLM），处理模糊提示下的图像退化识别；快代理基于轻量级大型语言模型（Large Language Model, LLM）利用上下文学习理解简单需求，减少资源浪费；同时引入混合失真去除模式以防止错误传播并提升系统效率。

链接: https://arxiv.org/abs/2503.10120
作者: Bingchen Li,Xin Li,Yiting Lu,Zhibo Chen
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Existing Image Restoration (IR) studies typically focus on task-specific or universal modes individually, relying on the mode selection of users and lacking the cooperation between multiple task-specific/universal restoration modes. This leads to insufficient interaction for unprofessional users and limits their restoration capability for complicated real-world applications. In this work, we present HybridAgent, intending to incorporate multiple restoration modes into a unified image restoration model and achieve intelligent and efficient user interaction through our proposed hybrid agents. Concretely, we propose the hybrid rule of fast, slow, and feedback restoration agents. Here, the slow restoration agent optimizes the powerful multimodal large language model (MLLM) with our proposed instruction-tuning dataset to identify degradations within images with ambiguous user prompts and invokes proper restoration tools accordingly. The fast restoration agent is designed based on a lightweight large language model (LLM) via in-context learning to understand the user prompts with simple and clear requirements, which can obviate the unnecessary time/resource costs of MLLM. Moreover, we introduce the mixed distortion removal mode for our HybridAgents, which is crucial but not concerned in previous agent-based works. It can effectively prevent the error propagation of step-by-step image restoration and largely improve the efficiency of the agent system. We validate the effectiveness of HybridAgent with both synthetic and real-world IR tasks.
zh

[CV-98] MoEdit: On Learning Quantity Perception for Multi-object Image Editing

【速读】：该论文致力于解决多目标图像编辑中的一致性感知问题，特别是现有方法难以同时兼顾单个对象及其作为整体图像一部分编辑时的一致性。论文提出的关键解决方案是MoEdit框架，它通过Feature Compensation (FeCom) 模块确保每个对象属性的区分性和可分离性，并通过Quantity Attention (QTTN) 模块在编辑过程中有效控制以保持数量一致性，无需依赖辅助工具。利用Stable Diffusion模型，MoEdit实现了高质量的风格迁移、对象再创造及背景再生，同时保证输入与输出之间的一致性感知，无论图像中包含多少个对象。

链接: https://arxiv.org/abs/2503.10112
作者: Yanfeng Li,Kahou Chan,Yue Sun,Chantong Lam,Tong Tong,Zitong Yu,Keren Fu,Xiaohong Liu,Tao Tan
机构: Macao Polytechnic University (澳门城市大学); Fuzhou University (福州大学); Great Bay University (大湾区大学); Sichuan University (四川大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-object images are prevalent in various real-world scenarios, including augmented reality, advertisement design, and medical imaging. Efficient and precise editing of these images is critical for these applications. With the advent of Stable Diffusion (SD), high-quality image generation and editing have entered a new era. However, existing methods often struggle to consider each object both individually and part of the whole image editing, both of which are crucial for ensuring consistent quantity perception, resulting in suboptimal perceptual performance. To address these challenges, we propose MoEdit, an auxiliary-free multi-object image editing framework. MoEdit facilitates high-quality multi-object image editing in terms of style transfer, object reinvention, and background regeneration, while ensuring consistent quantity perception between inputs and outputs, even with a large number of objects. To achieve this, we introduce the Feature Compensation (FeCom) module, which ensures the distinction and separability of each object attribute by minimizing the in-between interlacing. Additionally, we present the Quantity Attention (QTTN) module, which perceives and preserves quantity consistency by effective control in editing, without relying on auxiliary tools. By leveraging the SD model, MoEdit enables customized preservation and modification of specific concepts in inputs with high quality. Experimental results demonstrate that our MoEdit achieves State-Of-The-Art (SOTA) performance in multi-object image editing. Data and codes will be available at this https URL.
zh

[CV-99] StableFusion: Continual Video Retrieval via Frame Adaptation

【速读】：该论文旨在解决文本到视频检索（Text-to-Video Retrieval, TVR）在面对持续新增视频内容时系统性能随时间难以保持的问题。具体而言，当前基于预训练模型的TVR方法在适应新任务时缺乏可塑性，而现有的连续学习（Continual Learning）方法则容易发生灾难性遗忘（Catastrophic Forgetting），导致历史查询与存储视频特征之间的语义错位。为应对这些挑战，论文提出了StableFusion框架，其关键是包含两个主要组件：帧融合适配器（Frame Fusion Adapter, FFA），用于捕捉视频内容的时间动态同时保持模型灵活性；任务感知混合专家（Task-Aware Mixture-of-Experts, TAME），用于在不同任务间保持查询与存储视频特征之间的一致语义对齐。实验结果表明，StableFusion在多个基准数据集上的表现优于现有连续学习和TVR方法，在连续视频流场景中对早期任务的性能退化极小。

链接: https://arxiv.org/abs/2503.10111
作者: Zecheng Zhao,Zhi Chen,Zi Huang,Shazia Sadiq,Tong Chen
机构: The University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-Video Retrieval (TVR) aims to match videos with corresponding textual queries, yet the continual influx of new video content poses a significant challenge for maintaining system performance over time. In this work, we introduce the first benchmark for Continual Text-to-Video Retrieval (CTVR) to overcome these limitations. Our analysis reveals that current TVR methods based on pre-trained models struggle to retain plasticity when adapting to new tasks, while existing continual learning approaches experience catastrophic forgetting, resulting in semantic misalignment between historical queries and stored video features. To address these challenges, we propose StableFusion, a novel CTVR framework comprising two main components: the Frame Fusion Adapter (FFA), which captures temporal dynamics in video content while preserving model flexibility, and the Task-Aware Mixture-of-Experts (TAME), which maintains consistent semantic alignment between queries across tasks and the stored video features. Comprehensive evaluations on two benchmark datasets under various task settings demonstrate that StableFusion outperforms existing continual learning and TVR methods, achieving superior retrieval performance with minimal degradation on earlier tasks in the context of continuous video streams. Our code is available at: this https URL
zh

[CV-100] Dream-IF: Dynamic Relative EnhAnceMent for Image Fusion

【速读】：本文旨在解决多源图像融合过程中因传感器差异导致的图像退化问题，以提升融合质量。传统方法通常将图像增强与融合视为独立过程，忽略了两者之间的内在关联。论文的关键在于引入“主导区域”（dominant regions）的概念，提出了一种动态相对增强框架（Dynamic Relative EnhAnceMent framework for Image Fusion, Dream-IF）。该框架通过量化不同层面上各模态的相对主导性，实现跨模态的相互增强，并结合基于提示编码（prompt-based encoding）来捕获特定退化的细节，从而动态引导修复过程。这种方法不仅支持图像复原，还扩展到更广泛的图像增强应用中。实验结果表明，Dream-IF在性能上显著优于现有方法。

链接: https://arxiv.org/abs/2503.10109
作者: Xingxin Xu,Bing Cao,Yinan Xia,Pengfei Zhu,Qinghua Hu
机构: School of New Media and Communication, Tianjin University (天津大学), China; College of Intelligence and Computing, Tianjin University (天津大学), China; State Key Laboratory of Integrated Services Networks, Xidian University (西安电子科技大学), Xi’an, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image fusion aims to integrate comprehensive information from images acquired through multiple sources. However, images captured by diverse sensors often encounter various degradations that can negatively affect fusion quality. Traditional fusion methods generally treat image enhancement and fusion as separate processes, overlooking the inherent correlation between them; notably, the dominant regions in one modality of a fused image often indicate areas where the other modality might benefit from enhancement. Inspired by this observation, we introduce the concept of dominant regions for image enhancement and present a Dynamic Relative EnhAnceMent framework for Image Fusion (Dream-IF). This framework quantifies the relative dominance of each modality across different layers and leverages this information to facilitate reciprocal cross-modal enhancement. By integrating the relative dominance derived from image fusion, our approach supports not only image restoration but also a broader range of image enhancement applications. Furthermore, we employ prompt-based encoding to capture degradation-specific details, which dynamically steer the restoration process and promote coordinated enhancement in both multi-modal image fusion and image enhancement scenarios. Extensive experimental results demonstrate that Dream-IF consistently outperforms its counterparts.
zh

[CV-101] Mamba-VA: A Mamba-based Approach for Continuous Emotion Recognition in Valence-Arousal Space

【速读】：该论文旨在解决连续情感识别（Continuous Emotion Recognition, CER）中长期依赖建模和复杂时间动态捕捉的挑战。现有方法在处理长时间序列的情感变化及细微的时间相关性时仍存在不足。为了解决这些问题，论文提出了一种新的情感识别模型Mamba-VA，其关键在于结合Mamba架构高效建模视频帧中的序列情感变化。具体而言，模型首先利用掩码自编码器（Masked Autoencoder, MAE）提取深度视觉特征以增强时间信息的鲁棒性；接着使用时序卷积网络（Temporal Convolutional Network, TCN）捕获局部时间依赖性；然后通过Mamba模块实现长序列建模，以学习全局情感趋势；最后，采用全连接（Fully Connected, FC）层进行回归分析以预测连续的Valence和Arousal值。这一系列设计共同实现了对情感状态更精确的建模与识别。

链接: https://arxiv.org/abs/2503.10104
作者: Yuheng Liang,Zheyu Wang,Feng Liu,Mingzhou Liu,Yu Yao
机构: Nanjing University of Posts and Telecommunications (南京邮电大学); Jiangsu Key Laboratory of Intelligent Information Processing and Communication Technology (江苏省智能信息处理与通信技术重点实验室); Nanjing University of Science and Technology ZiJin College (南京理工大学紫金学院); School of Computer and Artificial Intelligence (计算机与人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages

点击查看摘要

Abstract:Continuous Emotion Recognition (CER) plays a crucial role in intelligent human-computer interaction, mental health monitoring, and autonomous driving. Emotion modeling based on the Valence-Arousal (VA) space enables a more nuanced representation of emotional states. However, existing methods still face challenges in handling long-term dependencies and capturing complex temporal dynamics. To address these issues, this paper proposes a novel emotion recognition model, Mamba-VA, which leverages the Mamba architecture to efficiently model sequential emotional variations in video frames. First, the model employs a Masked Autoencoder (MAE) to extract deep visual features from video frames, enhancing the robustness of temporal information. Then, a Temporal Convolutional Network (TCN) is utilized for temporal modeling to capture local temporal dependencies. Subsequently, Mamba is applied for long-sequence modeling, enabling the learning of global emotional trends. Finally, a fully connected (FC) layer performs regression analysis to predict continuous valence and arousal values. Experimental results on the Valence-Arousal (VA) Estimation task of the 8th competition on Affective Behavior Analysis in-the-wild (ABAW) demonstrate that the proposed model achieves valence and arousal scores of 0.5362 (0.5036) and 0.4310 (0.4119) on the validation (test) set, respectively, outperforming the baseline. The source code is available on GitHub:this https URL.
zh

[CV-102] Improving Diffusion-based Inverse Algorithms under Few-Step Constraint via Learnable Linear Extrapolation

【速读】：该论文旨在解决扩散模型在解决各类逆问题（Inverse Problems）时，因基于迭代的算法需要数百到数千步才能达到理想性能，而在较少步骤下性能显著下降的问题，这限制了其实际应用。论文的关键在于提出了一种名为Learnable Linear Extrapolation (LLE)的方法，该方法通过引入一个规范形式（canonical form），将现有的基于扩散的逆算法分解为三个模块以统一分析，并受到高阶扩散常微分方程(ODE)求解器中线性子空间搜索策略的启发，实现了对任意符合该规范形式的扩散逆算法的轻量级性能增强。实验表明，LLE方法在多种算法和任务中均能带来一致的性能提升，显示出其在有限步骤内提高扩散基逆算法效率和性能的潜力。

链接: https://arxiv.org/abs/2503.10103
作者: Jiawei Zhang,Ziyuan Liu,Leon Yan,Gen Li,Yuantao Gu
机构: Tsinghua University (清华大学); CUHK (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: preprint

点击查看摘要

Abstract:Diffusion models have demonstrated remarkable performance in modeling complex data priors, catalyzing their widespread adoption in solving various inverse problems. However, the inherently iterative nature of diffusion-based inverse algorithms often requires hundreds to thousands of steps, with performance degradation occurring under fewer steps which limits their practical applicability. While high-order diffusion ODE solvers have been extensively explored for efficient diffusion sampling without observations, their application to inverse problems remains underexplored due to the diverse forms of inverse algorithms and their need for repeated trajectory correction based on observations. To address this gap, we first introduce a canonical form that decomposes existing diffusion-based inverse algorithms into three modules to unify their analysis. Inspired by the linear subspace search strategy in the design of high-order diffusion ODE solvers, we propose the Learnable Linear Extrapolation (LLE) method, a lightweight approach that universally enhances the performance of any diffusion-based inverse algorithm that fits the proposed canonical form. Extensive experiments demonstrate consistent improvements of the proposed LLE method across multiple algorithms and tasks, indicating its potential for more efficient solutions and boosted performance of diffusion-based inverse algorithms with limited steps. Codes for reproducing our experiments are available at \hrefthis https URLthis https URL_inverse_problem.
zh

[CV-103] Geometric Parameter Estimations of Perovskite Solar Cells Based on Optical Simulations

【速读】：该论文旨在解决非侵入性估算钙钛矿太阳能电池层厚度的问题。解决方案的关键在于利用卷积神经网络（Convolutional Neural Network, CNN），通过钙钛矿太阳能电池的外量子效率来预测厚度。网络的训练基于光学性质恒定的厚度范围，这些范围也限定了方法的应用约束。研究发现，由于不透明钙钛矿的光敏性问题，透明钙钛矿表现出更好的性能。为了进一步优化性能并减少均方根误差，作者尝试了不同的采样方法、图像规格以及贝叶斯优化（Bayesian Optimization）进行超参数调优。虽然采样方法仅带来边际改进，但贝叶斯优化显著提高了准确性。此外，还进行了输入规格和预处理方法的实验。实验结果验证了该卷积神经网络在基于控制实验预测钙钛矿太阳能电池层厚度方面的可行性、高效性和有效性。

链接: https://arxiv.org/abs/2503.10102
作者: Junhao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a non-invasive approach to estimate the layer thicknesses of perovskite solar cells. The thicknesses are predicted by a convolutional neural network that leverages the external quantum efficiency of a perovskite solar cell. The network is trained in thickness ranges where the optical properties are constant, and these ranges set the constraints for the network’s application. Due to light sensitivity issues with opaque perovskites, the convolutional neural network showed better performance with transparent perovskites. To optimize the performance and reduce the root mean square error, we tried different sampling methods, image specifications, and Bayesian optimization for hyperparameter tuning. While sampling methods showed marginal improvement, implementing Bayesian optimization demonstrated high accuracy. Other minor optimization attempts include experimenting with input specifications and pre-processing approaches. The results confirm the feasibility, efficiency, and effectiveness of a convolution neural network for predicting perovskite solar cells’ layer thicknesses based on controlled experiments.
zh

[CV-104] Semantic Latent Motion for Portrait Video Generation

【速读】：该论文致力于解决现有肖像视频生成方法过度依赖人工先验和预训练生成模型的问题，这些问题可能导致不真实的运动表现并降低推理效率。为应对这些挑战，论文提出了一种紧凑且具有表达能力的运动表示——语义潜在运动（Semantic Latent Motion, SeMo）。SeMo 的关键在于其有效的三步框架：抽象（Abstraction）、推理（Reasoning）和生成（Generation）。通过在潜在空间中进行长期建模和高效推理，SeMo 实现了高质量的视觉效果与高效的推理性能，同时利用运动动力学作为条件信息引导生成模型合成逼真的帧间过渡，从而实现了实时生成高度逼真运动的视频。

链接: https://arxiv.org/abs/2503.10096
作者: Qiyuan Zhang,Chenyu Wu,Wenzhang Sun,Huaize Liu,Donglin Di,Wei Chen,Changqing Zou
机构: Zhejiang University (浙江大学); Li Auto (理想汽车); Zhejiang Lab (之江实验室); Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences (杭州高级研究院，中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in portrait video generation have been noteworthy. However, existing methods rely heavily on human priors and pre-trained generation models, which may introduce unrealistic motion and lead to inefficient inference. To address these challenges, we propose Semantic Latent Motion (SeMo), a compact and expressive motion representation. Leveraging this representation, our approach achieve both high-quality visual results and efficient inference. SeMo follows an effective three-step framework: Abstraction, Reasoning, and Generation. First, in the Abstraction step, we use a carefully designed Mask Motion Encoder to compress the subject’s motion state into a compact and abstract latent motion (1D token). Second, in the Reasoning step, long-term modeling and efficient reasoning are performed in this latent space to generate motion sequences. Finally, in the Generation step, the motion dynamics serve as conditional information to guide the generation model in synthesizing realistic transitions from reference frames to target frames. Thanks to the compact and descriptive nature of Semantic Latent Motion, our method enables real-time video generation with highly realistic motion. User studies demonstrate that our approach surpasses state-of-the-art models with an 81% win rate in realism. Extensive experiments further highlight its strong compression capability, reconstruction quality, and generative potential. Moreover, its fully self-supervised nature suggests promising applications in broader video generation tasks.
zh

[CV-105] AdvPaint: Protecting Images from Inpainting Manipulation via Adversarial Attention Disruption ICLR2025

【速读】：该论文旨在解决扩散模型在生成高质量图像方面的强大能力被恶意对手滥用的问题，特别是针对扩散模型的擦除修复（inpainting）任务的保护不足。现有方法主要关注图像到图像或文本到图像的任务，而对防止未经授权的擦除修复的防御较少涉及，导致防护效果不佳。为此，论文提出了一种名为ADVPAINT的新防御框架，其关键是通过生成对抗扰动来有效干扰对手的擦除修复任务。具体而言，ADVPAINT通过干扰目标扩散修复模型中的自注意力和交叉注意力模块，破坏语义理解和提示交互，从而削弱对手的能力。此外，ADVPAINT采用两阶段扰动策略，基于物体周围的扩大边界框划分扰动区域，以增强对不同形状和大小遮罩的鲁棒性。实验结果表明，ADVPAINT显著提升了修复任务的干扰效果，在FID分数上提高了超过100点，并大幅降低了精度指标，优于现有方法。

链接: https://arxiv.org/abs/2503.10081
作者: Joonsung Jeon,Woo Jae Kim,Suhyeon Ha,Sooel Son,Sung-eui Yoon
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Accepted to ICLR 2025

点击查看摘要

Abstract:The outstanding capability of diffusion models in generating high-quality images poses significant threats when misused by adversaries. In particular, we assume malicious adversaries exploiting diffusion models for inpainting tasks, such as replacing a specific region with a celebrity. While existing methods for protecting images from manipulation in diffusion-based generative models have primarily focused on image-to-image and text-to-image tasks, the challenge of preventing unauthorized inpainting has been rarely addressed, often resulting in suboptimal protection performance. To mitigate inpainting abuses, we propose ADVPAINT, a novel defensive framework that generates adversarial perturbations that effectively disrupt the adversary’s inpainting tasks. ADVPAINT targets the self- and cross-attention blocks in a target diffusion inpainting model to distract semantic understanding and prompt interactions during image generation. ADVPAINT also employs a two-stage perturbation strategy, dividing the perturbation region based on an enlarged bounding box around the object, enhancing robustness across diverse masks of varying shapes and sizes. Our experimental results demonstrate that ADVPAINT’s perturbations are highly effective in disrupting the adversary’s inpainting tasks, outperforming existing methods; ADVPAINT attains over a 100-point increase in FID and substantial decreases in precision.
zh

[CV-106] Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection

【速读】：该论文致力于解决零样本异常检测（Zero-Shot Anomaly Detection, ZSAD）中现有视觉-语言模型面临的三个主要挑战：1) 手工设计提示词需要大量专家知识和反复尝试；2) 单一形式的可学习提示词难以捕捉复杂的异常语义；3) 不受约束的提示空间限制了对未见类别的泛化能力。为了解决这些问题，论文提出了一种名为贝叶斯提示流学习（Bayesian Prompt Flow Learning, Bayes-PFL）的方法，其关键在于从贝叶斯视角将提示空间建模为一个可学习的概率分布。具体而言，设计了一个提示流模块以同时学习图像特定和图像无关的分布，并联合利用这些分布来正则化文本提示空间并增强模型在未见类别上的泛化能力。此外，引入残差交叉注意力（Residual Cross-Attention, RCA）模块以更好地对齐动态文本嵌入与细粒度图像特征。通过在15个工业和医学数据集上的广泛实验验证，证明了所提方法的优越性能。

链接: https://arxiv.org/abs/2503.10080
作者: Zhen Qu,Xian Tao,Xinyi Gong,Shichen Qu,Qiyu Chen,Zhengtao Zhang,Xingang Wang,Guiguang Ding
机构: Institute of Automation, Chinese Academy of Sciences (自动化研究所, 中国科学院); School of Artificial Intelligence, University of Chinese Academy of Sciences (人工智能学院, 中国科学院大学); HDU (杭州电子科技大学); Casivision (视派科技); Luoyang Institute for Robot and Intelligent Equipment (洛阳机器人与智能装备研究院); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, vision-language models (e.g. CLIP) have demonstrated remarkable performance in zero-shot anomaly detection (ZSAD). By leveraging auxiliary data during training, these models can directly perform cross-category anomaly detection on target datasets, such as detecting defects on industrial product surfaces or identifying tumors in organ tissues. Existing approaches typically construct text prompts through either manual design or the optimization of learnable prompt vectors. However, these methods face several challenges: 1) handcrafted prompts require extensive expert knowledge and trial-and-error; 2) single-form learnable prompts struggle to capture complex anomaly semantics; and 3) an unconstrained prompt space limit generalization to unseen categories. To address these issues, we propose Bayesian Prompt Flow Learning (Bayes-PFL), which models the prompt space as a learnable probability distribution from a Bayesian perspective. Specifically, a prompt flow module is designed to learn both image-specific and image-agnostic distributions, which are jointly utilized to regularize the text prompt space and enhance the model’s generalization on unseen categories. These learned distributions are then sampled to generate diverse text prompts, effectively covering the prompt space. Additionally, a residual cross-attention (RCA) module is introduced to better align dynamic text embeddings with fine-grained image features. Extensive experiments on 15 industrial and medical datasets demonstrate our method’s superior performance.
zh

[CV-107] Image Quality Assessment: From Human to Machine Preference

【速读】：该论文试图解决传统图像质量评估（IQA）方法主要基于人类主观偏好，而未能有效反映机器视觉系统需求的问题。随着通信协议的发展，机器对视觉数据的消费量已超过人类，且机器的偏好取决于下游任务（如分割和检测），而非视觉吸引力。为此，论文首次提出“面向机器视觉的图像质量评估”这一课题，并通过以下关键方案解决该问题：(1) 定义了机器的主观偏好，包括下游任务、测试模型和评估指标；(2) 构建了一个包含225万细粒度标注及3万参考/失真图像对实例的机器偏好数据库（MPD）；(3) 验证了主流IQA算法在MPD上的性能，发现现有IQA指标具有人类中心化特性，无法准确表征机器偏好。论文希望MPD能够推动IQA从人类偏好向机器偏好的演进。

链接: https://arxiv.org/abs/2503.10078
作者: Chunyi Li,Yuan Tian,Xiaoyue Ling,Zicheng Zhang,Haodong Duan,Haoning Wu,Ziheng Jia,Xiaohong Liu,Xiongkuo Min,Guo Lu,Weisi Lin,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Lab (上海人工智能实验室); Moonshot AI; Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Image Quality Assessment (IQA) based on human subjective preferences has undergone extensive research in the past decades. However, with the development of communication protocols, the visual data consumption volume of machines has gradually surpassed that of humans. For machines, the preference depends on downstream tasks such as segmentation and detection, rather than visual appeal. Considering the huge gap between human and machine visual systems, this paper proposes the topic: Image Quality Assessment for Machine Vision for the first time. Specifically, we (1) defined the subjective preferences of machines, including downstream tasks, test models, and evaluation metrics; (2) established the Machine Preference Database (MPD), which contains 2.25M fine-grained annotations and 30k reference/distorted image pair instances; (3) verified the performance of mainstream IQA algorithms on MPD. Experiments show that current IQA metrics are human-centric and cannot accurately characterize machine preferences. We sincerely hope that MPD can promote the evolution of IQA from human to machine preferences. Project page is on: this https URL.
zh

[CV-108] VMBench: A Benchmark for Perception-Aligned Video Motion Generation

【速读】：该论文试图解决视频运动评估中存在的两大问题：1）当前运动度量方法未能充分反映人类感知；2）现有的运动提示较为有限。为解决这些问题，论文提出了VMBench，这是一个综合性的视频运动基准，包含与人类感知一致的运动度量以及涵盖最多样运动类型的特征。其解决方案的关键在于三个方面：1）基于人类感知开发了五维细粒度的运动评估指标，以更深入地洞察模型在运动质量上的优缺点；2）提出了一种元引导的运动提示生成方法，通过提取元信息利用大语言模型生成多样化运动提示，并通过人机验证优化，构建了一个覆盖六个关键动态场景维度的多层次提示库；3）建立了与人类偏好对齐的验证机制，通过人的标注验证基准，使度量指标相比基线方法在Spearman相关性上平均提升了35.3%。这是首次从人类感知一致性角度评估视频运动质量的研究，并将为运动生成模型的评估和进步设定新标准。

链接: https://arxiv.org/abs/2503.10076
作者: Xinrang Ling,Chen Zhu,Meiqi Wu,Hangyu Li,Xiaokun Feng,Cundian Yang,Aiming Hao,Jiashu Zhu,Jiahong Wu,Xiangxiang Chu
机构: APMP, Alibaba Group (阿里云); CRISE, Institute of Automation, Chinese Academy of Sciences (中科院自动化研究所); School of Computer Science and Technology, University of Chinese Academy of Sciences (中国科学院大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video generation has advanced rapidly, improving evaluation methods, yet assessing video’s motion remains a major challenge. Specifically, there are two key issues: 1) current motion metrics do not fully align with human perceptions; 2) the existing motion prompts are limited. Based on these findings, we introduce VMBench–a comprehensive Video Motion Benchmark that has perception-aligned motion metrics and features the most diverse types of motion. VMBench has several appealing properties: 1) Perception-Driven Motion Evaluation Metrics, we identify five dimensions based on human perception in motion video assessment and develop fine-grained evaluation metrics, providing deeper insights into models’ strengths and weaknesses in motion quality. 2) Meta-Guided Motion Prompt Generation, a structured method that extracts meta-information, generates diverse motion prompts with LLMs, and refines them through human-AI validation, resulting in a multi-level prompt library covering six key dynamic scene dimensions. 3) Human-Aligned Validation Mechanism, we provide human preference annotations to validate our benchmarks, with our metrics achieving an average 35.3% improvement in Spearman’s correlation over baseline methods. This is the first time that the quality of motion in videos has been evaluated from the perspective of human perception alignment. Additionally, we will soon release VMBench at this https URL, setting a new standard for evaluating and advancing motion generation models.
zh

[CV-109] SmartWay: Enhanced Waypoint Prediction and Backtracking for Zero-Shot Vision-and-Language Navigation

【速读】：该论文旨在解决连续环境中视觉与语言导航（Vision-and-Language Navigation, VLN）中的两个核心挑战：现有方法中当前点位预测器缺乏空间意识，导航器缺少历史推理和回溯能力，从而限制了其适应性。为了解决这些问题，论文提出了一种零样本 VLNC-CE 框架，关键在于将增强的点位预测器与基于多模态大语言模型（Multi-modal Large Language Model, MLLM）的导航器相结合。具体而言，增强的点位预测器采用更强的视觉编码器、掩码交叉注意力融合以及占用感知损失以提升点位质量；而导航器则引入了历史感知推理和带回溯的自适应路径规划，从而提高鲁棒性。实验结果表明，该方法在 R2R-CE 和 MP3D 数据集上的零样本设置中达到了最先进的性能，并在 Turtlebot 4 的现实世界验证中展示了良好的适应性。

链接: https://arxiv.org/abs/2503.10069
作者: Xiangyu Shi,Zerui Li,Wenqi Lyu,Jiatong Xia,Feras Dayoub,Yanyuan Qiao,Qi Wu
机构: Australian Institute for Machine Learning at the University of Adelaide (澳大利亚阿德莱德大学机器学习研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) in continuous environments requires agents to interpret natural language instructions while navigating unconstrained 3D spaces. Existing VLN-CE frameworks rely on a two-stage approach: a waypoint predictor to generate waypoints and a navigator to execute movements. However, current waypoint predictors struggle with spatial awareness, while navigators lack historical reasoning and backtracking capabilities, limiting adaptability. We propose a zero-shot VLN-CE framework integrating an enhanced waypoint predictor with a Multi-modal Large Language Model (MLLM)-based navigator. Our predictor employs a stronger vision encoder, masked cross-attention fusion, and an occupancy-aware loss for better waypoint quality. The navigator incorporates history-aware reasoning and adaptive path planning with backtracking, improving robustness. Experiments on R2R-CE and MP3D benchmarks show our method achieves state-of-the-art (SOTA) performance in zero-shot settings, demonstrating competitive results compared to fully supervised methods. Real-world validation on Turtlebot 4 further highlights its adaptability.
zh

[CV-110] AI-assisted Early Detection of Pancreatic Ductal Adenocarcinoma on Contrast-enhanced CT

【速读】：该论文旨在解决胰腺导管腺癌（Pancreatic Ductal Adenocarcinoma, PDAC）早期检测困难的问题，由于PDAC缺乏早期特异性症状，大多数患者确诊时已处于晚期，严重影响治疗选择和生活质量。论文提出了一种由粗到细（coarse-to-fine）的方法，在对比增强CT扫描中检测PDAC。关键在于首先从低分辨率图像中定位并裁剪感兴趣区域，随后在更高分辨率下精细分割与PDAC相关的结构。此外，通过引入数据分裂模型集成策略以及定制化后处理函数进一步提升检测性能。最终，在PANORAMA挑战赛中，该方法以AUROC 0.9263和AP 0.7243的成绩获得第一名。

链接: https://arxiv.org/abs/2503.10068
作者: Han Liu,Riqiang Gao,Sasa Grbic
机构: Siemens Healthineers (西门子医疗)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 1st place in the PANORAMA Challenge (Team DTI)

点击查看摘要

Abstract:Pancreatic ductal adenocarcinoma (PDAC) is one of the most common and aggressive types of pancreatic cancer. However, due to the lack of early and disease-specific symptoms, most patients with PDAC are diagnosed at an advanced disease stage. Consequently, early PDAC detection is crucial for improving patients’ quality of life and expanding treatment options. In this work, we develop a coarse-to-fine approach to detect PDAC on contrast-enhanced CT scans. First, we localize and crop the region of interest from the low-resolution images, and then segment the PDAC-related structures at a finer scale. Additionally, we introduce two strategies to further boost detection performance: (1) a data-splitting strategy for model ensembling, and (2) a customized post-processing function. We participated in the PANORAMA challenge and ranked 1st place for PDAC detection with an AUROC of 0.9263 and an AP of 0.7243. Our code and models are publicly available at this https URL.
zh

[CV-111] Do We Always Need the Simplicity Bias? Looking for Optimal Inductive Biases in the Wild CVPR

【速读】：该论文旨在探索神经网络“简单性偏见”（simplicity bias）原则的适用极限。传统观点认为，神经架构通过相对简单的函数拟合数据的简单性偏见是其成功的关键。然而，本文通过研究发现，这种简单性偏见实际上源于ReLU激活函数（Rectified Linear Unit, ReLU），并提出了一种元学习方法，以设计更适合特定任务的新激活函数和归纳偏见（inductive biases）。论文的关键在于通过引入新的激活函数，使模型能够适应复杂度更高的先验假设（prior of higher complexity），从而在一些ReLU表现不佳的任务中取得更好的性能，如表格数据处理、回归任务、捷径学习场景以及算法理解任务等。相比之下，在图像分类任务中，学习到的激活函数与ReLU和GeLU（Gaussian Error Linear Unit）接近，表明ReLU在网络处理图像任务中的简单性偏见仍是有效的。因此，论文的核心贡献在于揭示了ReLU网络的简单性偏见并非普遍适用，并强调了设计更合适的归纳偏见的重要性，而不仅仅是依赖于“复杂性”的度量标准。

链接: https://arxiv.org/abs/2503.10065
作者: Damien Teney,Liangze Jiang,Florin Gogianu,Ehsan Abbasnejad
机构: Idiap Research Institute (Idiap 研究所); EPFL (洛桑联邦理工学院); Bitdefender (Bitdefender); University of Adelaide (阿德莱德大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

点击查看摘要

Abstract:Neural architectures tend to fit their data with relatively simple functions. This “simplicity bias” is widely regarded as key to their success. This paper explores the limits of this principle. Building on recent findings that the simplicity bias stems from ReLU activations [96], we introduce a method to meta-learn new activation functions and inductive biases better suited to specific tasks. Findings: We identify multiple tasks where the simplicity bias is inadequate and ReLUs suboptimal. In these cases, we learn new activation functions that perform better by inducing a prior of higher complexity. Interestingly, these cases correspond to domains where neural networks have historically struggled: tabular data, regression tasks, cases of shortcut learning, and algorithmic grokking tasks. In comparison, the simplicity bias induced by ReLUs proves adequate on image tasks where the best learned activations are nearly identical to ReLUs and GeLUs. Implications: Contrary to popular belief, the simplicity bias of ReLU networks is not universally useful. It is near-optimal for image classification, but other inductive biases are sometimes preferable. We showed that activation functions can control these inductive biases, but future tailored architectures might provide further benefits. Advances are still needed to characterize a model’s inductive biases beyond “complexity”, and their adequacy with the data. Comments: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.10065 [cs.LG] (or arXiv:2503.10065v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.10065 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-112] Multi-Modal Mamba Modeling for Survival Prediction (M4Survive): Adapting Joint Foundation Model Representations

【速读】：该论文致力于解决肿瘤学中生存预测准确性不足的问题，传统单一模态方法难以充分整合放射学与病理学评估提供的互补性洞见。为应对这一挑战，论文提出了一种名为M4Survive（多模态Mamba建模用于生存预测）的新框架，其关键在于利用高效的适配器网络学习联合基础模型表示，并通过基于Mamba的适配器动态融合来自基础模型库（如MedImageInsight、BiomedCLIP等）的异构嵌入，构建优化用于生存风险估计的相关潜在空间，从而实现高效多模态学习同时保持计算效率。实验结果表明，该方法在基准数据集上的生存预测准确性优于单模态及传统静态多模态基线方法。

链接: https://arxiv.org/abs/2503.10057
作者: Ho Hin Lee,Alberto Santamaria-Pang,Jameson Merkov,Matthew Lungren,Ivan Tarapov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Accurate survival prediction in oncology requires integrating diverse imaging modalities to capture the complex interplay of tumor biology. Traditional single-modality approaches often fail to leverage the complementary insights provided by radiological and pathological assessments. In this work, we introduce M4Survive (Multi-Modal Mamba Modeling for Survival Prediction), a novel framework that learns joint foundation model representations using efficient adapter networks. Our approach dynamically fuses heterogeneous embeddings from a foundation model repository (e.g., MedImageInsight, BiomedCLIP, Prov-GigaPath, UNI2-h), creating a correlated latent space optimized for survival risk estimation. By leveraging Mamba-based adapters, M4Survive enables efficient multi-modal learning while preserving computational efficiency. Experimental evaluations on benchmark datasets demonstrate that our approach outperforms both unimodal and traditional static multi-modal baselines in survival prediction accuracy. This work underscores the potential of foundation model-driven multi-modal fusion in advancing precision oncology and predictive analytics.
zh

[CV-113] Fourier Decomposition for Explicit Representation of 3D Point Cloud Attributes

【速读】：该论文旨在解决彩色点云（colored point clouds）编码中存在的关键问题：现有方法仅在单点层面分别处理颜色和几何特征，导致感受野受限且难以捕捉多点间的关联关系。为了解决这一问题，论文提出了一种基于三维傅里叶分解（3D Fourier decomposition）的点云编码方法，通过频域操作同时解耦颜色与几何特征，并扩展感受野。这种方法的核心在于利用振幅（amplitude）唯一表征颜色属性，相位（phase）编码几何结构，从而实现两种属性的独立学习与利用。此外，频域特性天然支持局部特征聚合，并结合多点信息。论文通过点云分类和风格迁移任务验证了该方法的有效性，在DensePoint数据集上取得了最先进的性能，并通过基于振幅的数据增强策略进一步提升了效果。

链接: https://arxiv.org/abs/2503.10055
作者: Donghyun Kim,Hyunah Ko,Chanyoung Kim,Seong Jae Hwang
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:While 3D point clouds are widely utilized across various vision applications, their irregular and sparse nature make them challenging to handle. In response, numerous encoding approaches have been proposed to capture the rich semantic information of point clouds. Yet, a critical limitation persists: a lack of consideration for colored point clouds which are more capable 3D representations as they contain diverse attributes: color and geometry. While existing methods handle these attributes separately on a per-point basis, this leads to a limited receptive field and restricted ability to capture relationships across multiple points. To address this, we pioneer a point cloud encoding methodology that leverages 3D Fourier decomposition to disentangle color and geometric features while extending the receptive field through spectral-domain operations. Our analysis confirms that this encoding approach effectively separates feature components, where the amplitude uniquely captures color attributes and the phase encodes geometric structure, thereby enabling independent learning and utilization of both attributes. Furthermore, the spectral-domain properties of these components naturally aggregate local features while considering multiple points’ information. We validate our point cloud encoding approach on point cloud classification and style transfer tasks, achieving state-of-the-art results on the DensePoint dataset with improvements via a proposed amplitude-based data augmentation strategy.
zh

[CV-114] DTA: Dual Temporal-channel-wise Attention for Spiking Neural Networks WACV

【速读】：本文旨在解决基于脉冲神经网络（Spiking Neural Networks, SNNs）在有效利用时间信息方面存在的局限性，通过引入注意力机制提升其时间动态特性。传统的时间注意力操作要么采用相同的运算方式，要么使用不同的运算方式处理目标维度，而这些方法分别从不同角度捕捉时间信息。论文的关键创新在于提出了一种名为双时间通道注意力（Dual Temporal-channel-wise Attention, DTA）的新机制，它融合了相同与不同注意力策略的优势，同时关注时间维度内的相关性和依赖性。这是首次尝试结合这两种策略来增强SNNs的时间表征能力。实验结果表明，该机制在静态数据集（如CIFAR10、CIFAR100、ImageNet-1k）和动态数据集（如CIFAR10-DVS）上均达到了最先进的性能。

链接: https://arxiv.org/abs/2503.10052
作者: Minje Kim,Minjun Kim,Xu Yang
机构: Institution1; Institution2
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Accepted by IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) present a more energy-efficient alternative to Artificial Neural Networks (ANNs) by harnessing spatio-temporal dynamics and event-driven spikes. Effective utilization of temporal information is crucial for SNNs, leading to the exploration of attention mechanisms to enhance this capability. Conventional attention operations either apply identical operation or employ non-identical operations across target dimensions. We identify that these approaches provide distinct perspectives on temporal information. To leverage the strengths of both operations, we propose a novel Dual Temporal-channel-wise Attention (DTA) mechanism that integrates both identical/non-identical attention strategies. To the best of our knowledge, this is the first attempt to concentrate on both the correlation and dependency of temporal-channel using both identical and non-identical attention operations. Experimental results demonstrate that the DTA mechanism achieves state-of-the-art performance on both static datasets (CIFAR10, CIFAR100, ImageNet-1k) and dynamic dataset (CIFAR10-DVS), elevating spike representation and capturing complex temporal-channel relationship. We open-source our code: this https URL.
zh

[CV-115] Enhancing Multi-Agent Systems via Reinforcement Learning with LLM -based Planner and Graph-based Policy ICRA2025

【速读】：该论文旨在解决多智能体系统（Multi-Agent Systems, MAS）在执行复杂任务时面临的协调困难和安全性挑战，同时克服基于多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）方法在处理复杂任务及设计奖励函数方面的局限性。此外，尽管大型语言模型（Large Language Models, LLMs）为MAS引入了更强的推理和认知能力，但现有基于LLM的系统难以在动态环境中快速且准确地响应。为应对这些挑战，论文提出了一种基于LLM的图协作多智能体强化学习框架（LLM-based Graph Collaboration MARL, LGC-MARL）。该框架的关键在于结合LLM与MARL的优势，通过将复杂任务分解为可执行子任务，并利用基于图的协调机制实现高效协作。具体而言，LGC-MARL包含两个主要组件：LLM规划器和基于图的协作元策略。其中，LLM规划器负责将复杂任务指令转化为一系列子任务，评估其合理性并通过批评模型生成动作依赖图；而基于图的协作元策略则依据此图促进智能体间的通信与协作，并通过元学习适应新环境。实验结果表明，该方法在AI2-THOR仿真平台上的表现显著优于传统方法，展现了优越的性能与扩展性。

链接: https://arxiv.org/abs/2503.10049
作者: Ziqi Jia,Junjie Li,Xiaoyang Qu,Jianzong Wang
机构: Ping An Technology (Shenzhen) Co., Ltd. (平安科技（深圳）有限公司); Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the 2025 IEEE International Conference on Robotics Automation (ICRA 2025)

点击查看摘要

Abstract:Multi-agent systems (MAS) have shown great potential in executing complex tasks, but coordination and safety remain significant challenges. Multi-Agent Reinforcement Learning (MARL) offers a promising framework for agent collaboration, but it faces difficulties in handling complex tasks and designing reward functions. The introduction of Large Language Models (LLMs) has brought stronger reasoning and cognitive abilities to MAS, but existing LLM-based systems struggle to respond quickly and accurately in dynamic environments. To address these challenges, we propose LLM-based Graph Collaboration MARL (LGC-MARL), a framework that efficiently combines LLMs and MARL. This framework decomposes complex tasks into executable subtasks and achieves efficient collaboration among multiple agents through graph-based coordination. Specifically, LGC-MARL consists of two main components: an LLM planner and a graph-based collaboration meta policy. The LLM planner transforms complex task instructions into a series of executable subtasks, evaluates the rationality of these subtasks using a critic model, and generates an action dependency graph. The graph-based collaboration meta policy facilitates communication and collaboration among agents based on the action dependency graph, and adapts to new task environments through meta-learning. Experimental results on the AI2-THOR simulation platform demonstrate the superior performance and scalability of LGC-MARL in completing various complex tasks.
zh

[CV-116] FourierSR: A Fourier Token-based Plugin for Efficient Image Super-Resolution

【速读】：该论文旨在解决图像超分辨率（Super-Resolution, SR）在极低计算成本下提升效率的挑战。传统方法如卷积（Convolution）和基于窗口的Transformer受限于有限的感受野（receptive field），难以有效应对这一问题。为了解决这一局限性，论文提出了一种名为FourierSR的傅里叶令牌插件（Fourier token-based plugin），通过利用傅里叶变换和乘法操作替代传统的令牌混合技术，实现了全局感受野的同时大幅降低了计算复杂度。这种设计避免了现有令牌混合技术作为插件时可能带来的不稳定或低效问题，并且与卷积及基于窗口的Transformer相比，FourierSR的关键创新在于其仅依赖于傅里叶变换和简单的点乘运算，从而在保持高效性的同时显著提升了超分辨率性能。实验结果表明，FourierSR作为即插即用模块，在Manga109数据集x4缩放因子下平均提高了0.34 dB的峰值信噪比（PSNR），而参数量（Params）和浮点运算次数（FLOPs）的增加分别仅为原始模型的0.6%和1.5%。

链接: https://arxiv.org/abs/2503.10043
作者: Wenjie Li,Heng Guo,Yuefeng Hou,Zhanyu Ma
机构: Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications (BUPT)(北京邮电大学); School of Microelectronics, Tianjin University (TJU)(天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image super-resolution (SR) aims to recover low-resolution images to high-resolution images, where improving SR efficiency is a high-profile challenge. However, commonly used units in SR, like convolutions and window-based Transformers, have limited receptive fields, making it challenging to apply them to improve SR under extremely limited computational cost. To address this issue, inspired by modeling convolution theorem through token mix, we propose a Fourier token-based plugin called FourierSR to improve SR uniformly, which avoids the instability or inefficiency of existing token mix technologies when applied as plug-ins. Furthermore, compared to convolutions and windows-based Transformers, our FourierSR only utilizes Fourier transform and multiplication operations, greatly reducing complexity while having global receptive fields. Experimental results show that our FourierSR as a plug-and-play unit brings an average PSNR gain of 0.34dB for existing efficient SR methods on Manga109 test set at the scale of x4, while the average increase in the number of Params and FLOPs is only 0.6% and 1.5% of original sizes. We will release our codes upon acceptance.
zh

[CV-117] How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning ? Placing Them in An Extensible Escape Game

【速读】：该论文旨在解决现有多模态大型语言模型（MLLMs）评估主要关注最终任务完成，而缺乏对多模态环境中推理过程进行全面且定量分析的问题。这种局限性阻碍了对模型行为及底层推理机制的深入理解。为了解决这一问题，论文提出MM-Escape，这是一个受现实世界逃脱游戏启发的可扩展基准，强调中间模型行为与最终任务完成并重。关键解决方案在于开发EscapeCraft，这是一种可定制且开源的环境，使模型能够进行自由形式的探索以评估多模态推理能力。通过在该环境中开展的广泛实验，研究揭示了不同规模的MLLMs在简单任务中表现出色，但随着任务难度增加，性能显著下降，并且各模型在多模态推理方面存在不同的瓶颈和失效模式。

链接: https://arxiv.org/abs/2503.10042
作者: Ziyue Wang,Yurui Dong,Fuwen Luo,Minyuan Ruan,Zhili Cheng,Chi Chen,Peng Li,Yang Liu
机构: Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University (清华大学); Institute for AI Industry Research (AIR), Tsinghua University (清华大学); School of Management, Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancing of Multimodal Large Language Models (MLLMs) has spurred interest in complex multimodal reasoning tasks in the real-world and virtual environment, which require coordinating multiple abilities, including visual perception, visual reasoning, spatial awareness, and target deduction. However, existing evaluations primarily assess the final task completion, often degrading assessments to isolated abilities such as visual grounding and visual question answering. Less attention is given to comprehensively and quantitatively analyzing reasoning process in multimodal environments, which is crucial for understanding model behaviors and underlying reasoning mechanisms beyond merely task success. To address this, we introduce MM-Escape, an extensible benchmark for investigating multimodal reasoning, inspired by real-world escape games. MM-Escape emphasizes intermediate model behaviors alongside final task completion. To achieve this, we develop EscapeCraft, a customizable and open environment that enables models to engage in free-form exploration for assessing multimodal reasoning. Extensive experiments show that MLLMs, regardless of scale, can successfully complete the simplest room escape tasks, with some exhibiting human-like exploration strategies. Yet, performance dramatically drops as task difficulty increases. Moreover, we observe that performance bottlenecks vary across models, revealing distinct failure modes and limitations in their multimodal reasoning abilities, such as repetitive trajectories without adaptive exploration, getting stuck in corners due to poor visual spatial awareness, and ineffective use of acquired props, such as the key. We hope our work sheds light on new challenges in multimodal reasoning, and uncovers potential improvements in MLLMs capabilities.
zh

[CV-118] Investigating and Improving Counter-Stereotypical Action Relation in Text-to-Image Diffusion Models

【速读】：该论文旨在解决文本到图像扩散模型在生成反刻板印象动作关系（如“鼠标追逐猫”）时普遍失败的问题，即使明确提示也倾向于默认常见的刻板印象。研究发现，这一限制源于分布偏差而非模型本身的固有限制。论文的关键洞察是，尽管模型在罕见组合（如上述例子）上表现不佳，但在其反转形式（如“鼠标追逐男孩”）常见的情况下可以成功生成类似的中间组合。基于此，作者开发了一种名为Role-Bridging Decomposition的框架，通过利用这些中间组合逐步教授罕见关系，而无需对模型架构进行修改。此外，还引入了一个名为ActionBench的新基准，专门用于评估基于动作的关系生成在典型和反刻板配置下的性能。实验结果表明，中间组合确实有助于反刻板印象的生成，自动指标和人工评估均显示相对于现有方法有显著改进。这项工作不仅揭示了当前文本到图像系统中存在的基本偏差，还展示了通过组成推理解决这些问题的一个有前景的方向。

链接: https://arxiv.org/abs/2503.10037
作者: Sina Malakouti,Adriana Kovashka
机构: University of Pittsburgh (匹兹堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models consistently fail at generating counter-stereotypical action relationships (e.g., “mouse chasing cat”), defaulting to frequent stereotypes even when explicitly prompted otherwise. Through systematic investigation, we discover this limitation stems from distributional biases rather than inherent model constraints. Our key insight reveals that while models fail on rare compositions when their inversions are common, they can successfully generate similar intermediate compositions (e.g., “mouse chasing boy”). To test this hypothesis, we develop a Role-Bridging Decomposition framework that leverages these intermediates to gradually teach rare relationships without architectural modifications. We introduce ActionBench, a comprehensive benchmark specifically designed to evaluate action-based relationship generation across stereotypical and counter-stereotypical configurations. Our experiments validate that intermediate compositions indeed facilitate counter-stereotypical generation, with both automatic metrics and human evaluations showing significant improvements over existing approaches. This work not only identifies fundamental biases in current text-to-image systems but demonstrates a promising direction for addressing them through compositional reasoning.
zh

[CV-119] V2X-ReaLO: An Open Online Framework and Dataset for Cooperative Perception in Reality

【速读】：本文旨在解决车辆与万物通信（Vehicle-to-Everything, V2X）协同感知在真实场景中的可行性与有效性问题，特别是中间融合方法（Intermediate Fusion）的实际应用挑战。现有研究多基于模拟环境或静态数据集，缺乏对真实动态场景中V2X协同感知性能的验证。为应对这一问题，论文提出了一种名为V2X-ReaLO的开放在线协同感知框架，该框架部署于真实车辆和智能基础设施之上，并整合了早期融合（Early Fusion）、晚期融合（Late Fusion）及中间融合方法于统一管道中，提供了首个在真实世界条件下验证中间融合可行性和性能的实践演示。此外，论文还设计了一个开放基准数据集，扩展了V2X-Real数据集至动态同步ROS Bags格式，包含25,028个测试帧和6,850个标注的关键帧，涵盖复杂城市场景。其关键在于通过实时评估感知精度和通信延迟，为实际应用中的协同感知系统优化设定了新标准。

链接: https://arxiv.org/abs/2503.10034
作者: Hao Xiang,Zhaoliang Zheng,Xin Xia,Seth Z. Zhao,Letian Gao,Zewei Zhou,Tianhui Cai,Yun Zhang,Jiaqi Ma
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Cooperative perception enabled by Vehicle-to-Everything (V2X) communication holds significant promise for enhancing the perception capabilities of autonomous vehicles, allowing them to overcome occlusions and extend their field of view. However, existing research predominantly relies on simulated environments or static datasets, leaving the feasibility and effectiveness of V2X cooperative perception especially for intermediate fusion in real-world scenarios largely unexplored. In this work, we introduce V2X-ReaLO, an open online cooperative perception framework deployed on real vehicles and smart infrastructure that integrates early, late, and intermediate fusion methods within a unified pipeline and provides the first practical demonstration of online intermediate fusion’s feasibility and performance under genuine real-world conditions. Additionally, we present an open benchmark dataset specifically designed to assess the performance of online cooperative perception systems. This new dataset extends V2X-Real dataset to dynamic, synchronized ROS bags and provides 25,028 test frames with 6,850 annotated key frames in challenging urban scenarios. By enabling real-time assessments of perception accuracy and communication lantency under dynamic conditions, V2X-ReaLO sets a new benchmark for advancing and optimizing cooperative perception systems in real-world applications. The codes and datasets will be released to further advance the field.
zh

[CV-120] Post-disaster building indoor damage and survivor detection using autonomous path planning and deep learning with unmanned aerial vehicles

【速读】：该论文旨在解决地震等自然灾害后传统人工结构损伤检查与幸存者搜寻效率低、耗时长且存在高风险的问题。解决方案的关键在于提出了一种自主检查方法，结合了自主导航技术、基于深度学习的损伤与幸存者检测算法，以及搭载传感器的定制低成本微型飞行器（Micro Aerial Vehicle, MAV）。实验表明，该方法在伪灾后办公建筑环境中的结构损伤检查与幸存者检测方面实现了高精度，展现出显著提升现有手动检查效率的巨大潜力。

链接: https://arxiv.org/abs/2503.10027
作者: Xiao Pan,Sina Tavasoli,T. Y. Yang,Sina Poorghasem
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 10 pages, 9 figures, accepted in the International Association for Bridge and Structural Engineering (IABSE) Symposium 2025, Tokyo, Japan

点击查看摘要

Abstract:Rapid response to natural disasters such as earthquakes is a crucial element in ensuring the safety of civil infrastructures and minimizing casualties. Traditional manual inspection is labour-intensive, time-consuming, and can be dangerous for inspectors and rescue workers. This paper proposed an autonomous inspection approach for structural damage inspection and survivor detection in the post-disaster building indoor scenario, which incorporates an autonomous navigation method, deep learning-based damage and survivor detection method, and a customized low-cost micro aerial vehicle (MAV) with onboard sensors. Experimental studies in a pseudo-post-disaster office building have shown the proposed methodology can achieve high accuracy in structural damage inspection and survivor detection. Overall, the proposed inspection approach shows great potential to improve the efficiency of existing manual post-disaster building inspection.
zh

[CV-121] One-Shot Federated Unsupervised Domain Adaptation with Scaled Entropy Attention and Multi-Source Smoothed Pseudo Labeling

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）在处理领域偏移（domain shift）时面临的挑战，尤其是在每个客户端仅能访问自身源数据且无法在目标域适应过程中共享数据的情况下。此外，传统FL方法通常因多轮模型更新而导致较高的通信开销。为应对这些限制，论文提出了一种一次性联邦无监督领域自适应（Federated Unsupervised Domain Adaptation, FUDA）方法。该方案的关键在于引入缩放熵注意力机制（Scaled Entropy Attention, SEA）用于模型聚合，以及多源伪标签生成（Multi-Source Pseudo Labeling, MSPL）用于目标域适应。SEA通过在目标域上的缩放预测熵来分配更高的注意力权重给可靠性更强的模型，从而提升全局模型质量并确保贡献权重的平衡；MSPL则从多个源模型中蒸馏知识以生成伪标签，并利用平滑软标签交叉熵（Smoothed Soft-Label Cross-Entropy, SSCE）管理噪声标签。此方法不仅在四个标准基准上超越现有最先进方法，同时显著降低了通信和计算成本，使其适用于实际应用。

链接: https://arxiv.org/abs/2503.10020
作者: Ali Abedi,Q. M. Jonathan Wu,Ning Zhang,Farhad Pourpanah
机构: University of Windsor (温莎大学); Queen’s University (皇后大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Federated Learning (FL) is a promising approach for privacy-preserving collaborative learning. However, it faces significant challenges when dealing with domain shifts, especially when each client has access only to its source data and cannot share it during target domain adaptation. Moreover, FL methods often require high communication overhead due to multiple rounds of model updates between clients and the server. We propose a one-shot Federated Unsupervised Domain Adaptation (FUDA) method to address these limitations. Specifically, we introduce Scaled Entropy Attention (SEA) for model aggregation and Multi-Source Pseudo Labeling (MSPL) for target domain adaptation. SEA uses scaled prediction entropy on target domain to assign higher attention to reliable models. This improves the global model quality and ensures balanced weighting of contributions. MSPL distills knowledge from multiple source models to generate pseudo labels and manage noisy labels using smoothed soft-label cross-entropy (SSCE). Our approach outperforms state-of-the-art methods across four standard benchmarks while reducing communication and computation costs, making it highly suitable for real-world applications. The implementation code will be made publicly available upon publication.
zh

[CV-122] Speedy MASt3R

【速读】：该论文旨在解决现代3D视觉算法中图像匹配的高延迟问题，尽管MASt3R在准确性方面表现出色，但其推理速度仍受限于ViT编码器解码器和快速最近邻匹配(FastNN)的计算开销。为了解决这一瓶颈，论文提出了Speedy MASt3R，这是一种后训练优化框架，通过结合多种技术在保持精度的同时提升推理效率。关键在于引入了FlashMatch（利用FlashAttention v2与分块策略）、GraphFusion（层和张量融合以及内核自动调优）以及FastNN-Lite（优化内存访问时间和向量化计算），同时采用混合精度推理(HybridCast)，最终实现了每对图像推理时间从198ms减少到91ms，提升了54%的速度，而未牺牲准确性。

链接: https://arxiv.org/abs/2503.10017
作者: Jingxing Li,Yongjae Lee,Abhay Kumar Yadav,Cheng Peng,Rama Chellappa,Deliang Fan
机构: Arizona State University (亚利桑那州立大学); Johns Hopkins University (约翰斯·霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image matching is a key component of modern 3D vision algorithms, essential for accurate scene reconstruction and localization. MASt3R redefines image matching as a 3D task by leveraging DUSt3R and introducing a fast reciprocal matching scheme that accelerates matching by orders of magnitude while preserving theoretical guarantees. This approach has gained strong traction, with DUSt3R and MASt3R collectively cited over 250 times in a short span, underscoring their impact. However, despite its accuracy, MASt3R’s inference speed remains a bottleneck. On an A40 GPU, latency per image pair is 198.16 ms, mainly due to computational overhead from the ViT encoder-decoder and Fast Reciprocal Nearest Neighbor (FastNN) matching. To address this, we introduce Speedy MASt3R, a post-training optimization framework that enhances inference efficiency while maintaining accuracy. It integrates multiple optimization techniques, including FlashMatch-an approach leveraging FlashAttention v2 with tiling strategies for improved efficiency, computation graph optimization via layer and tensor fusion having kernel auto-tuning with TensorRT (GraphFusion), and a streamlined FastNN pipeline that reduces memory access time from quadratic to linear while accelerating block-wise correlation scoring through vectorized computation (FastNN-Lite). Additionally, it employs mixed-precision inference with FP16/FP32 hybrid computations (HybridCast), achieving speedup while preserving numerical precision. Evaluated on Aachen Day-Night, InLoc, 7-Scenes, ScanNet1500, and MegaDepth1500, Speedy MASt3R achieves a 54% reduction in inference time (198 ms to 91 ms per image pair) without sacrificing accuracy. This advancement enables real-time 3D understanding, benefiting applications like mixed reality navigation and large-scale 3D scene reconstruction. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.10017 [cs.CV] (or arXiv:2503.10017v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.10017 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-123] MetricGrids: Arbitrary Nonlinear Approximation with Elementary Metric Grids based Implicit Neural Representation CVPR2025

【速读】：本文旨在解决现有基于网格的神经表示在处理复杂非线性信号时的局限性问题。传统基于线性索引的特征网格只能提供退化线性潜在空间表示，无法通过紧凑解码器充分补偿以表征复杂的非线性信号。为了解决这一问题，同时保持规则网格结构的简单性，本文提出了一种新颖的基于度量网格（MetricGrids）的方法，通过构建多个初等度量网格作为高阶项，遵循泰勒展开原则来逼近复杂的非线性特性。关键解决方案包括利用基于哈希编码的网格稀疏性增强模型紧凑性以避免有害的哈希冲突，并采用高阶外推解码器减少显式的网格存储需求。实验结果表明，所提出的方法在2D和3D重建任务中具有卓越的拟合和渲染精度，验证了其鲁棒性和通用性。

链接: https://arxiv.org/abs/2503.10000
作者: Shu Wang,Yanbo Gao,Shuai Li,Chong Lv,Xun Cai,Chuankun Li,Hui Yuan,Jinglin Zhang
机构: Shandong University (山东大学); North University of China (中北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: accepted by CVPR 2025

点击查看摘要

Abstract:This paper presents MetricGrids, a novel grid-based neural representation that combines elementary metric grids in various metric spaces to approximate complex nonlinear signals. While grid-based representations are widely adopted for their efficiency and scalability, the existing feature grids with linear indexing for continuous-space points can only provide degenerate linear latent space representations, and such representations cannot be adequately compensated to represent complex nonlinear signals by the following compact decoder. To address this problem while keeping the simplicity of a regular grid structure, our approach builds upon the standard grid-based paradigm by constructing multiple elementary metric grids as high-order terms to approximate complex nonlinearities, following the Taylor expansion principle. Furthermore, we enhance model compactness with hash encoding based on different sparsities of the grids to prevent detrimental hash collisions, and a high-order extrapolation decoder to reduce explicit grid storage requirements. experimental results on both 2D and 3D reconstructions demonstrate the superior fitting and rendering accuracy of the proposed method across diverse signal types, validating its robustness and generalizability. Code is available at this https URLthis https URL.
zh

[CV-124] IME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLM s

【速读】：该论文试图解决视频大型语言模型（Video LLMs）在时间理解（temporal understanding）方面表现欠佳的问题。解决方案的关键在于：首先，构建了一个专门用于指令微调的数据集，以增强时间理解能力在五个关键维度上的表现；其次，提出了一种多任务提示微调方法，通过将时间敏感任务无缝集成到现有的指令数据集中，避免了对昂贵的时间标注的依赖；此外，开发了一个新的时间敏感视频理解基准，不仅填补了现有基准在维度覆盖上的空白，还严格过滤了潜在的捷径，确保更准确的评估。实验结果表明，该方法显著提升了视频-LLMs的时间理解能力，同时避免了对捷径的依赖。

链接: https://arxiv.org/abs/2503.09994
作者: Yunxiao Wang,Meng Liu,Rui Shao,Haoyu Zhang,Bin Wen,Fan Yang,Tingting Gao,Di Zhang,Liqiang Nie
机构: Shandong University (山东大学); Shandong Jianzhu University (山东建筑大学); Harbin Institute of Technology (哈尔滨工业大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video large language models have achieved remarkable performance in tasks such as video question answering, however, their temporal understanding remains suboptimal. To address this limitation, we curate a dedicated instruction fine-tuning dataset that focuses on enhancing temporal comprehension across five key dimensions. In order to reduce reliance on costly temporal annotations, we introduce a multi-task prompt fine-tuning approach that seamlessly integrates temporal-sensitive tasks into existing instruction datasets without requiring additional annotations. Furthermore, we develop a novel benchmark for temporal-sensitive video understanding that not only fills the gaps in dimension coverage left by existing benchmarks but also rigorously filters out potential shortcuts, ensuring a more accurate evaluation. Extensive experimental results demonstrate that our approach significantly enhances the temporal understanding of video-LLMs while avoiding reliance on shortcuts.
zh

[CV-125] Channel-wise Noise Scheduled Diffusion for Inverse Rendering in Indoor Scenes CVPR2025

【速读】：该论文旨在解决单张 RGB 图像逆渲染中的不适定问题，即从图像中分解出几何结构、材质和光照等信息。由于逆渲染问题固有的多解性，生成单一准确解与生成多样化解之间存在冲突。论文的关键解决方案是提出了一种通道级噪声调度方法，使单一扩散模型架构能够同时实现这两种相互冲突的目标。通过为不同通道设置不同的噪声调度策略，训练得到的两个扩散模型分别可以预测高精度的单一解和呈现多样化的可能解。实验结果表明，这两个模型在多样性和准确性方面均优于现有方法，并提升了下游任务（如物体插入和材质编辑）的表现。

链接: https://arxiv.org/abs/2503.09993
作者: JunYong Choi,Min-Cheol Sagong,SeokYeong Lee,Seung-Won Jung,Ig-Jae Kim,Junghyun Cho
机构: Korea Institute of Science and Technology(KIST); Korea University; AI-Robotics, KIST School, University of Science and Technology; Yonsei-KIST Convergence Research Institute
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:We propose a diffusion-based inverse rendering framework that decomposes a single RGB image into geometry, material, and lighting. Inverse rendering is inherently ill-posed, making it difficult to predict a single accurate solution. To address this challenge, recent generative model-based methods aim to present a range of possible solutions. However, finding a single accurate solution and generating diverse solutions can be conflicting. In this paper, we propose a channel-wise noise scheduling approach that allows a single diffusion model architecture to achieve two conflicting objectives. The resulting two diffusion models, trained with different channel-wise noise schedules, can predict a single highly accurate solution and present multiple possible solutions. The experimental results demonstrate the superiority of our two models in terms of both diversity and accuracy, which translates to enhanced performance in downstream applications such as object insertion and material editing.
zh

[CV-126] ES-Parkour: Advanced Robot Parkour with Bio-inspired Event Camera and Spiking Neural Network

【速读】：该论文旨在解决四足机器人在复杂环境中运动控制时面临的两大挑战：一是传统视觉传感器（如深度相机）因低操作频率和对光照敏感性导致的稳定性与鲁棒性不足，限制其在户外环境中的应用；二是深度神经网络带来的高计算需求。为应对这些问题，论文提出的关键解决方案是将尖峰神经网络（Spiking Neural Networks, SNNs）与事件相机结合，利用事件相机捕获动态视觉数据，同时通过SNN高效处理尖峰序列，模拟生物感知机制。实验结果表明，该方法显著优于传统人工神经网络（Artificial Neural Network, ANN）模型，在实现卓越跑酷性能的同时，能耗仅为ANN模型的11.7%，节省了88.3%的能量。这一集成方案不仅推动了机器人强化学习的发展，还为苛刻环境下的应用开辟了新途径。

链接: https://arxiv.org/abs/2503.09985
作者: Qiang Zhang,Jiahang Cao,Jingkai Sun,Gang Han,Wen Zhao,Yijie Guo,Renjing Xu
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）); Beijing Innovation Center of Humanoid Robotics Co., Ltd. (人形机器人创新中心有限公司)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In recent years, quadruped robotics has advanced significantly, particularly in perception and motion control via reinforcement learning, enabling complex motions in challenging environments. Visual sensors like depth cameras enhance stability and robustness but face limitations, such as low operating frequencies relative to joint control and sensitivity to lighting, which hinder outdoor deployment. Additionally, deep neural networks in sensor and control systems increase computational demands. To address these issues, we introduce spiking neural networks (SNNs) and event cameras to perform a challenging quadruped parkour task. Event cameras capture dynamic visual data, while SNNs efficiently process spike sequences, mimicking biological perception. Experimental results demonstrate that this approach significantly outperforms traditional models, achieving excellent parkour performance with just 11.7% of the energy consumption of an artificial neural network (ANN)-based model, yielding an 88.3% energy reduction. By integrating event cameras with SNNs, our work advances robotic reinforcement learning and opens new possibilities for applications in demanding environments.
zh

[CV-127] Uncertainty-aware Long-tailed Weights Model the Utility of Pseudo-labels for Semi-supervised Learning

【速读】：该论文旨在解决当前半监督学习（Semi-supervised Learning, SSL）中基于伪标签（pseudo-labeling）策略的问题，特别是伪标签筛选过程中依赖置信度阈值所带来的挑战。具体而言，论文指出两个主要问题：一是合理设置置信度阈值是一个开放性难题，显著影响高质量伪标签的选择；二是由于标注数据稀缺，深度模型通常表现出过高的置信度现象，导致置信值无法可靠地评估伪标签的质量。为了解决这些问题，论文提出了一种不确定性感知集成结构（Uncertainty-aware Ensemble Structure, UES），其关键在于将伪标签的效用建模为长尾权重，从而避免了设定阈值的难题。这种方法不仅确保了即使不可靠的伪标签也能增强模型的鲁棒性，而且具有轻量级和架构无关的特点，易于扩展到多种计算机视觉任务，如分类和回归。实验结果验证了该方法的有效性。

链接: https://arxiv.org/abs/2503.09974
作者: Jiaqi Wu,Junbiao Pang,Qingming Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2408.04150

点击查看摘要

Abstract:Current Semi-supervised Learning (SSL) adopts the pseudo-labeling strategy and further filters pseudo-labels based on confidence thresholds. However, this mechanism has notable drawbacks: 1) setting the reasonable threshold is an open problem which significantly influences the selection of the high-quality pseudo-labels; and 2) deep models often exhibit the over-confidence phenomenon which makes the confidence value an unreliable indicator for assessing the quality of pseudo-labels due to the scarcity of labeled data. In this paper, we propose an Uncertainty-aware Ensemble Structure (UES) to assess the utility of pseudo-labels for unlabeled samples. We further model the utility of pseudo-labels as long-tailed weights to avoid the open problem of setting the threshold. Concretely, the advantage of the long-tailed weights ensures that even unreliable pseudo-labels still contribute to enhancing the model’s robustness. Besides, UES is lightweight and architecture-agnostic, easily extending to various computer vision tasks, including classification and regression. Experimental results demonstrate that combining the proposed method with DualPose leads to a 3.47% improvement in Percentage of Correct Keypoints (PCK) on the Sniffing dataset with 100 data points (30 labeled), a 7.29% improvement in PCK on the FLIC dataset with 100 data points (50 labeled), and a 3.91% improvement in PCK on the LSP dataset with 200 data points (100 labeled). Furthermore, when combined with FixMatch, the proposed method achieves a 0.2% accuracy improvement on the CIFAR-10 dataset with 40 labeled data points and a 0.26% accuracy improvement on the CIFAR-100 dataset with 400 labeled data points.
zh

[CV-128] Detecting Dataset Bias in Medical AI: A Generalized and Modality-Agnostic Auditing Framework

【速读】：该论文试图解决数据驱动型人工智能在证据驱动医学中的广泛应用所引发的因关联性学习导致的模型偏差及不可预期行为的问题。论文指出，机器学习数据集中潜在的偏差可能在训练过程中被放大或在测试阶段被隐藏，从而影响模型的公平性和可靠性。为应对这一挑战，论文提出了一种模态无关的审计框架——广义属性效用与可检测性诱导偏差测试（Generalized Attribute Utility and Detectability-Induced bias Testing, G-AUDIT），用于生成针对偏差来源的靶向假设。该方法的关键在于通过分析任务级标注与数据属性（如受保护属性和社会环境特征）之间的关系，自动量化观察到的数据属性如何促成捷径学习或隐藏基于虚假关联的预测。论文通过三个不同模态和学习任务验证了G-AUDIT方法的广泛适用性和价值，成功识别出传统定性方法常忽略的细微偏差，为从初始原型设计到监管的整个AI开发生命周期提供了深入理解数据集的新途径，并为减少模型偏差、构建更安全可信的AI系统创造了机会。

链接: https://arxiv.org/abs/2503.09969
作者: Nathan Drenkow,Mitchell Pavlak,Keith Harrigian,Ayah Zirikly,Adarsh Subbaswamy,Mathias Unberath
机构: The Johns Hopkins University (约翰斯·霍普金斯大学); The Johns Hopkins University Applied Physics Laboratory (约翰斯·霍普金斯大学应用物理实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data-driven AI is establishing itself at the center of evidence-based medicine. However, reports of shortcomings and unexpected behavior are growing due to AI’s reliance on association-based learning. A major reason for this behavior: latent bias in machine learning datasets can be amplified during training and/or hidden during testing. We present a data modality-agnostic auditing framework for generating targeted hypotheses about sources of bias which we refer to as Generalized Attribute Utility and Detectability-Induced bias Testing (G-AUDIT) for datasets. Our method examines the relationship between task-level annotations and data properties including protected attributes (e.g., race, age, sex) and environment and acquisition characteristics (e.g., clinical site, imaging protocols). G-AUDIT automatically quantifies the extent to which the observed data attributes may enable shortcut learning, or in the case of testing data, hide predictions made based on spurious associations. We demonstrate the broad applicability and value of our method by analyzing large-scale medical datasets for three distinct modalities and learning tasks: skin lesion classification in images, stigmatizing language classification in Electronic Health Records (EHR), and mortality prediction for ICU tabular data. In each setting, G-AUDIT successfully identifies subtle biases commonly overlooked by traditional qualitative methods that focus primarily on social and ethical objectives, underscoring its practical value in exposing dataset-level risks and supporting the downstream development of reliable AI systems. Our method paves the way for achieving deeper understanding of machine learning datasets throughout the AI development life-cycle from initial prototyping all the way to regulation, and creates opportunities to reduce model bias, enabling safer and more trustworthy AI systems.
zh

[CV-129] Style Evolving along Chain-of-Thought for Unknown-Domain Object Detection

【速读】：该论文旨在解决单域广义目标检测（Single-Domain Generalized Object Detection, Single-DGOD）任务中，由于训练阶段未见过的目标域数据不可用，现有方法依赖视觉-语言模型的多模态能力并通过单一文本提示（one-step prompt method）估计跨域信息以提升泛化性能的问题。然而，当面对复杂场景（如雨夜混合风格）时，单一文本提示方法的性能表现较弱，原因在于其难以有效合成涉及多种风格组合的信息。为克服这一局限，论文提出了一种名为“Style Evolving along Chain-of-Thought”的新方法，其关键在于通过逐步整合和扩展链式思维中的风格信息，实现风格的连续演化。具体而言，通过逐步优化风格描述并引导多样化的风格演变，该方法能够更准确地模拟不同风格特征，并帮助模型逐渐适应风格间的细微差异，同时暴露模型于更广泛的具有不同数据分布的风格特征中，从而显著提升其在未知域中的泛化能力。实验结果验证了所提方法在五种恶劣天气场景及Real to Art基准上的优越性。

链接: https://arxiv.org/abs/2503.09968
作者: Zihao Zhang,Aming Wu,Yahong Han
机构: College of Intelligence and Computing, Tianjin University (天津大学); School of Electronic Engineering, Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, a task of Single-Domain Generalized Object Detection (Single-DGOD) is proposed, aiming to generalize a detector to multiple unknown domains never seen before during training. Due to the unavailability of target-domain data, some methods leverage the multimodal capabilities of vision-language models, using textual prompts to estimate cross-domain information, enhancing the model’s generalization capability. These methods typically use a single textual prompt, often referred to as the one-step prompt method. However, when dealing with complex styles such as the combination of rain and night, we observe that the performance of the one-step prompt method tends to be relatively weak. The reason may be that many scenes incorporate not just a single style but a combination of multiple styles. The one-step prompt method may not effectively synthesize combined information involving various styles. To address this limitation, we propose a new method, i.e., Style Evolving along Chain-of-Thought, which aims to progressively integrate and expand style information along the chain of thought, enabling the continual evolution of styles. Specifically, by progressively refining style descriptions and guiding the diverse evolution of styles, this approach enables more accurate simulation of various style characteristics and helps the model gradually learn and adapt to subtle differences between styles. Additionally, it exposes the model to a broader range of style features with different data distributions, thereby enhancing its generalization capability in unseen domains. The significant performance gains over five adverse-weather scenarios and the Real to Art benchmark demonstrate the superiorities of our method.
zh

[CV-130] Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification CVPR2025

【速读】：该论文旨在解决文本到图像人物重识别（Text-to-image Person ReID）任务中大规模数据库手动标注成本高、导致ReID模型泛化能力受限的问题。近年来的研究通过利用多模态大型语言模型（Multi-modal Large Language Models, MLLMs）自动生成行人图像描述来缓解此问题，但这些生成的描述风格缺乏多样性。为了解决这一局限性，论文提出了一种人类注释者建模（Human Annotator Modeling, HAM）方法，使MLLMs能够模仿数千名人类注释者的描述风格。解决方案的关键在于首先从人类文本描述中提取风格特征并进行聚类，将具有相似风格的描述归入同一类别；接着使用提示（prompt）表示每个聚类，并通过提示学习模拟不同注释者的描述风格；进一步定义风格特征空间并在其中进行均匀采样以获得更丰富的聚类原型，从而增强生成描述的多样性；最终利用HAM方法自动标注大规模数据库，实验表明该方法显著提升了ReID模型的泛化能力。

链接: https://arxiv.org/abs/2503.09962
作者: Jiayu Jiang,Changxing Ding,Wentao Tan,Junhong Wang,Jin Tao,Xiangmin Xu
机构: South China University of Technology (华南理工大学); Pazhou Lab, Guangzhou (琶洲实验室, 广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project website: this https URL

点击查看摘要

Abstract:Text-to-image person re-identification (ReID) aims to retrieve the images of an interested person based on textual descriptions. One main challenge for this task is the high cost in manually annotating large-scale databases, which affects the generalization ability of ReID models. Recent works handle this problem by leveraging Multi-modal Large Language Models (MLLMs) to describe pedestrian images automatically. However, the captions produced by MLLMs lack diversity in description styles. To address this issue, we propose a Human Annotator Modeling (HAM) approach to enable MLLMs to mimic the description styles of thousands of human annotators. Specifically, we first extract style features from human textual descriptions and perform clustering on them. This allows us to group textual descriptions with similar styles into the same cluster. Then, we employ a prompt to represent each of these clusters and apply prompt learning to mimic the description styles of different human annotators. Furthermore, we define a style feature space and perform uniform sampling in this space to obtain more diverse clustering prototypes, which further enriches the diversity of the MLLM-generated captions. Finally, we adopt HAM to automatically annotate a massive-scale database for text-to-image ReID. Extensive experiments on this database demonstrate that it significantly improves the generalization ability of ReID models.
zh

[CV-131] Exploring Mutual Empowerment Between Wireless Networks and RL-based LLM s: A Survey

【速读】：本文旨在探索基于强化学习（Reinforcement Learning, RL）的大语言模型（Large Language Models, LLMs）与无线网络之间的相互赋能关系，解决如何通过两者的协同优化提升各自性能及应用场景的问题。关键在于实现智能资源分配、自适应网络优化以及实时决策等能力在无线通信系统中的应用，同时利用无线网络基础设施支持RL-based LLMs的高效训练、部署和分布式推理，特别是在去中心化和边缘计算环境中。论文强调通过深入研究两者交互作用的关键动机、开放挑战及潜在解决方案，为下一代智能化通信系统的未来发展提供方向和启示。

链接: https://arxiv.org/abs/2503.09956
作者: Yu Qiao,Phuong-Nam Tran,Ji Su Yoon,Loc X. Nguyen,Choong Seon Hong
机构: Kyung Hee University (庆熙大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: 25 pages, 13 figures

点击查看摘要

Abstract:Reinforcement learning (RL)-based large language models (LLMs), such as ChatGPT, DeepSeek, and Grok-3, have gained significant attention for their exceptional capabilities in natural language processing and multimodal data understanding. Meanwhile, the rapid expansion of information services has driven the growing need for intelligence, efficient, and adaptable wireless networks. Wireless networks require the empowerment of RL-based LLMs while these models also benefit from wireless networks to broaden their application scenarios. Specifically, RL-based LLMs can enhance wireless communication systems through intelligent resource allocation, adaptive network optimization, and real-time decision-making. Conversely, wireless networks provide a vital infrastructure for the efficient training, deployment, and distributed inference of RL-based LLMs, especially in decentralized and edge computing environments. This mutual empowerment highlights the need for a deeper exploration of the interplay between these two domains. We first review recent advancements in wireless communications, highlighting the associated challenges and potential solutions. We then discuss the progress of RL-based LLMs, focusing on key technologies for LLM training, challenges, and potential solutions. Subsequently, we explore the mutual empowerment between these two fields, highlighting key motivations, open challenges, and potential solutions. Finally, we provide insights into future directions, applications, and their societal impact to further explore this intersection, paving the way for next-generation intelligent communication systems. Overall, this survey provides a comprehensive overview of the relationship between RL-based LLMs and wireless networks, offering a vision where these domains empower each other to drive innovations.
zh

[CV-132] arget-aware Bidirectional Fusion Transformer for Aerial Object Tracking

【速读】：该论文旨在解决现有基于轻量级神经网络的目标跟踪算法在特征融合阶段仅使用单阶段特征进行状态决策的问题，这些问题限制了跟踪的鲁棒性和精度。论文的关键解决方案是提出了一种新颖的目标感知双向融合Transformer（BFTrans），其核心在于设计了一个基于线性自注意力和交叉注意力的双流融合网络，能够从前向和后向两个方向结合浅层与深层特征，提供用于定位的调整后的局部细节和用于识别的全局语义信息。此外，还提出了目标感知的位置编码策略，以增强在融合阶段对目标相关属性的感知能力。实验结果表明，该方法在多个流行的无人机基准数据集上超越了其他最先进的跟踪器，并在嵌入式平台上实现了平均30.5 FPS的速度，适用于实际无人机部署。

链接: https://arxiv.org/abs/2503.09951
作者: Xinglong Sun,Haijiang Sun,Shan Jiang,Jiacheng Wang,Jiasong Wang
机构: Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Science (长春光学精密机械与物理研究所, 中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The trackers based on lightweight neural networks have achieved great success in the field of aerial remote sensing, most of which aggregate multi-stage deep features to lift the tracking quality. However, existing algorithms usually only generate single-stage fusion features for state decision, which ignore that diverse kinds of features are required for identifying and locating the object, limiting the robustness and precision of tracking. In this paper, we propose a novel target-aware Bidirectional Fusion transformer (BFTrans) for UAV tracking. Specifically, we first present a two-stream fusion network based on linear self and cross attentions, which can combine the shallow and the deep features from both forward and backward directions, providing the adjusted local details for location and global semantics for recognition. Besides, a target-aware positional encoding strategy is designed for the above fusion model, which is helpful to perceive the object-related attributes during the fusion phase. Finally, the proposed method is evaluated on several popular UAV benchmarks, including UAV-123, UAV20L and UAVTrack112. Massive experimental results demonstrate that our approach can exceed other state-of-the-art trackers and run with an average speed of 30.5 FPS on embedded platform, which is appropriate for practical drone deployments.
zh

[CV-133] MoFlow: One-Step Flow Matching for Human Trajectory Forecasting via Implicit Maximum Likelihood Estimation based Distillation CVPR2025

【速读】：该论文旨在解决人类轨迹预测问题，即基于过去的人类轨迹和其他上下文线索，预测其未来多模态（multi-modal）运动。论文的关键创新在于提出了一种名为MoFlow的新型运动预测条件流匹配模型。该模型通过设计一种新颖的流匹配损失函数，不仅确保至少一组K-shot未来轨迹的准确性，还鼓励所有K组轨迹的多样性和合理性。此外，利用隐式最大似然估计（IMLE），论文提出了一种仅依赖教师模型样本的新型流模型蒸馏方法。实验结果表明，无论是教师流模型还是经过IMLE蒸馏的学生模型，在真实数据集上的表现均达到当前最优水平，且学生模型在采样速度上比教师模型快100倍。

链接: https://arxiv.org/abs/2503.09950
作者: Yuxiang Fu,Qi Yan,Lele Wang,Ke Li,Renjie Liao
机构: University of British Columbia (英属哥伦比亚大学); Vector Institute for AI (向量人工智能研究所); Simon Fraser University (西蒙弗雷泽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:In this paper, we address the problem of human trajectory forecasting, which aims to predict the inherently multi-modal future movements of humans based on their past trajectories and other contextual cues. We propose a novel motion prediction conditional flow matching model, termed MoFlow, to predict K-shot future trajectories for all agents in a given scene. We design a novel flow matching loss function that not only ensures at least one of the K sets of future trajectories is accurate but also encourages all K sets of future trajectories to be diverse and plausible. Furthermore, by leveraging the implicit maximum likelihood estimation (IMLE), we propose a novel distillation method for flow models that only requires samples from the teacher model. Extensive experiments on the real-world datasets, including SportVU NBA games, ETH-UCY, and SDD, demonstrate that both our teacher flow model and the IMLE-distilled student model achieve state-of-the-art performance. These models can generate diverse trajectories that are physically and socially plausible. Moreover, our one-step student model is \textbf100 times faster than the teacher flow model during sampling. The code, model, and data are available at our project page: this https URL
zh

[CV-134] UVE: Are MLLM s Unified Evaluators for AI-Generated Videos?

【速读】：该论文试图解决视频生成模型（VGMs）快速发展背景下，缺乏可靠且全面的自动指标来评估人工智能生成视频（AIGVs）的问题。现有方法要么依赖于针对其他任务优化的现成模型，要么基于人工评估数据训练专用评估器，这些方法受限于特定的评估方面，难以随着更精细和全面的评估需求而扩展。为了解决这一问题，论文探索了利用多模态大型语言模型（MLLMs）作为AIGVs的统一评估器的可行性，充分利用其强大的视觉感知和语言理解能力。解决方案的关键在于引入了一个名为UVE-Bench的新基准数据集，用于评估自动指标在统一AIGV评估中的性能，并通过深入分析影响MLLM驱动评估器性能的关键设计选择，为未来AIGV评估研究提供了有价值的见解。

链接: https://arxiv.org/abs/2503.09949
作者: Yuanxin Liu,Rui Zhu,Shuhuai Ren,Jiacong Wang,Haoyuan Guo,Xu Sun,Lu Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid growth of video generative models (VGMs), it is essential to develop reliable and comprehensive automatic metrics for AI-generated videos (AIGVs). Existing methods either use off-the-shelf models optimized for other tasks or rely on human assessment data to train specialized evaluators. These approaches are constrained to specific evaluation aspects and are difficult to scale with the increasing demands for finer-grained and more comprehensive evaluations. To address this issue, this work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AIGVs, leveraging their strong visual perception and language understanding capabilities. To evaluate the performance of automatic metrics in unified AIGV evaluation, we introduce a benchmark called UVE-Bench. UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects. Using UVE-Bench, we extensively evaluate 16 MLLMs. Our empirical results suggest that while advanced MLLMs (e.g., Qwen2VL-72B and InternVL2.5-78B) still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation, significantly surpassing existing specialized evaluation methods. Additionally, we conduct an in-depth analysis of key design choices that impact the performance of MLLM-driven evaluators, offering valuable insights for future research on AIGV evaluation. The code is available at this https URL.
zh

[CV-135] Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers

【速读】：该论文旨在解决语音驱动的手势视频合成这一具有挑战性的任务，具体目标是通过概率建模实现人类手势的自然表达，并生成与语音节奏细节相匹配的逼真图像。为了解决这些问题，论文提出了一种名为Cosh-DiT的系统，它结合了混合扩散Transformer模型，分别利用离散和连续扩散建模进行音频到运动以及运动到视频的合成。

解决方案的关键在于引入了一个音频扩散Transformer（Cosh-DiT-A），用于生成与语音节奏同步的富有表现力的手势动态。为了捕捉上半身、面部及手部动作的先验信息，采用了向量量化变分自编码器（VQ-VAEs）来在离散潜空间中联合学习这些依赖关系。此外，设计了一个视觉扩散Transformer（Cosh-DiT-V），以有效整合空间和时间上下文，从而实现基于生成的语音驱动运动的真实感视频合成。实验结果表明，该框架能够持续生成具有生动面部表情和自然流畅手势且与语音完美契合的逼真视频。

链接: https://arxiv.org/abs/2503.09942
作者: Yasheng Sun,Zhiliang Xu,Hang Zhou,Jiazhi Guan,Quanwei Yang,Kaisiyuan Wang,Borong Liang,Yingying Li,Haocheng Feng,Jingdong Wang,Ziwei Liu,Koike Hideki
机构: Baidu(百度); Tsinghua University(清华大学); University of Science and Technology of China(中国科学技术大学); Nanyang Technological University(南洋理工大学); Tokyo Institute of Technology(东京工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Co-speech gesture video synthesis is a challenging task that requires both probabilistic modeling of human gestures and the synthesis of realistic images that align with the rhythmic nuances of speech. To address these challenges, we propose Cosh-DiT, a Co-speech gesture video system with hybrid Diffusion Transformers that perform audio-to-motion and motion-to-video synthesis using discrete and continuous diffusion modeling, respectively. First, we introduce an audio Diffusion Transformer (Cosh-DiT-A) to synthesize expressive gesture dynamics synchronized with speech rhythms. To capture upper body, facial, and hand movement priors, we employ vector-quantized variational autoencoders (VQ-VAEs) to jointly learn their dependencies within a discrete latent space. Then, for realistic video synthesis conditioned on the generated speech-driven motion, we design a visual Diffusion Transformer (Cosh-DiT-V) that effectively integrates spatial and temporal contexts. Extensive experiments demonstrate that our framework consistently generates lifelike videos with expressive facial expressions and natural, smooth gestures that align seamlessly with speech.
zh

[CV-136] GP: Two-modal occupancy prediction with 3D Gaussian and sparse points for 3D Environment Awareness

【速读】：本文旨在解决现有三维语义占用预测任务中基于体素或点云的方法所面临的挑战：体素化方法易导致空间信息损失，而点云方法虽保留空间位置信息较好，但在表达体积结构细节方面存在局限。为应对这一问题，论文提出了一种基于三维高斯集合与稀疏点的双模态预测方法，通过平衡空间位置与体积结构信息，实现更精准的语义占用预测。关键在于采用基于Transformer的架构，利用多层结构增强查询与三维高斯集合共同作用于语义占用预测，并引入自适应融合机制整合两种模态的语义输出以生成最终结果，同时通过动态细化每一层的点云进一步提升定位精度。

链接: https://arxiv.org/abs/2503.09941
作者: Mu Chen,Wenyu Chen,Mingchuan Yang,Yuan Zhang,Tao Han,Xinchi Li,Yunlong Li,Huaici Zhao
机构: China Telecom Research Institute (中国电信研究院), Beijing 102200, China; Key Laboratory of Opto-Electronic Information Processing, Shenyang Institute of Automation, Chinese Academy of Sciences (中国科学院沈阳自动化研究所光电信息处理重点实验室), Shenyang 110016, China; University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D semantic occupancy has rapidly become a research focus in the fields of robotics and autonomous driving environment perception due to its ability to provide more realistic geometric perception and its closer integration with downstream tasks. By performing occupancy prediction of the 3D space in the environment, the ability and robustness of scene understanding can be effectively improved. However, existing occupancy prediction tasks are primarily modeled using voxel or point cloud-based approaches: voxel-based network structures often suffer from the loss of spatial information due to the voxelization process, while point cloud-based methods, although better at retaining spatial location information, face limitations in representing volumetric structural details. To address this issue, we propose a dual-modal prediction method based on 3D Gaussian sets and sparse points, which balances both spatial location and volumetric structural information, achieving higher accuracy in semantic occupancy prediction. Specifically, our method adopts a Transformer-based architecture, taking 3D Gaussian sets, sparse points, and queries as inputs. Through the multi-layer structure of the Transformer, the enhanced queries and 3D Gaussian sets jointly contribute to the semantic occupancy prediction, and an adaptive fusion mechanism integrates the semantic outputs of both modalities to generate the final prediction results. Additionally, to further improve accuracy, we dynamically refine the point cloud at each layer, allowing for more precise location information during occupancy prediction. We conducted experiments on the Occ3DnuScenes dataset, and the experimental results demonstrate superior performance of the proposed method on IoU based metrics.
zh

[CV-137] PanoGen: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation

【速读】：该论文致力于解决视觉与语言导航（Vision-and-Language Navigation, VLN）任务中训练数据匮乏的问题。为应对这一挑战，论文提出了一种名为PanoGen++的新框架，其关键在于通过结合预训练扩散模型与领域特定微调，利用参数高效技术如低秩适应（low-rank adaptation），在计算成本可控的前提下生成多样化且相关的全景环境。该框架研究了两种环境生成方法：掩码图像修复（masked image inpainting）和递归图像扩展（recursive image outpainting）。前者基于文本描述修复掩码区域以最大化新环境的创建，后者则有助于提升代理在全景图中学习空间关系的能力。实验结果表明，PanoGen++显著提升了多个VLN数据集（包括R2R、R4R和CVDN）上的性能，体现了其在增强训练环境多样性和相关性方面的有效性。

链接: https://arxiv.org/abs/2503.09938
作者: Sen Wang,Dongliang Zhou,Liang Xie,Chao Xu,Ye Yan,Erwei Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Robotics (cs.RO)
备注: This paper was accepted by Neural Networks

点击查看摘要

Abstract:Vision-and-language navigation (VLN) tasks require agents to navigate three-dimensional environments guided by natural language instructions, offering substantial potential for diverse applications. However, the scarcity of training data impedes progress in this field. This paper introduces PanoGen++, a novel framework that addresses this limitation by generating varied and pertinent panoramic environments for VLN tasks. PanoGen++ incorporates pre-trained diffusion models with domain-specific fine-tuning, employing parameter-efficient techniques such as low-rank adaptation to minimize computational costs. We investigate two settings for environment generation: masked image inpainting and recursive image outpainting. The former maximizes novel environment creation by inpainting masked regions based on textual descriptions, while the latter facilitates agents’ learning of spatial relationships within panoramas. Empirical evaluations on room-to-room (R2R), room-for-room (R4R), and cooperative vision-and-dialog navigation (CVDN) datasets reveal significant performance enhancements: a 2.44% increase in success rate on the R2R test leaderboard, a 0.63% improvement on the R4R validation unseen set, and a 0.75-meter enhancement in goal progress on the CVDN validation unseen set. PanoGen++ augments the diversity and relevance of training environments, resulting in improved generalization and efficacy in VLN tasks.
zh

[CV-138] Emotion Recognition with CLIP and Sequential Learning

【速读】：本文旨在解决情感计算领域中的三个关键挑战：Valence-Arousal (VA) 估计、表情识别以及 Action Unit (AU) 检测。这些问题均在第8届野生环境下情感行为分析（ABAW）工作坊与竞赛框架下提出。论文的关键创新在于提出了一种新颖的框架，通过微调 CLIP 模型结合 aff-wild2 数据集（该数据集包含标注的表情标签），显著提升了连续情感识别的鲁棒性。此外，引入Temporal Convolutional Network (TCN) 和 Transformer Encoder 模块进一步增强了系统性能，使模型能够更高效且精准地识别人类情绪，从而超越基线表现。

链接: https://arxiv.org/abs/2503.09929
作者: Weiwei Zhou,Chenkun Ling,Zefeng Cai
机构: China Telecom Cloud (中国电信云)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human emotion recognition plays a crucial role in facilitating seamless interactions between humans and computers. In this paper, we present our innovative methodology for tackling the Valence-Arousal (VA) Estimation Challenge, the Expression Recognition Challenge, and the Action Unit (AU) Detection Challenge, all within the framework of the 8th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW). Our approach introduces a novel framework aimed at enhancing continuous emotion recognition. This is achieved by fine-tuning the CLIP model with the aff-wild2 dataset, which provides annotated expression labels. The result is a fine-tuned model that serves as an efficient visual feature extractor, significantly improving its robustness. To further boost the performance of continuous emotion recognition, we incorporate Temporal Convolutional Network (TCN) modules alongside Transformer Encoder modules into our system architecture. The integration of these advanced components allows our model to outperform baseline performance, demonstrating its ability to recognize human emotions with greater accuracy and efficiency. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.09929 [cs.CV] (or arXiv:2503.09929v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.09929 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-139] VideoMerge: Towards Training-free Long Video Generation

【速读】：该论文试图解决长视频生成中存在的两大核心问题：一是现有基于扩散模型的方法在训练数据标注和计算资源上的高昂成本，以及其固定的空间和时间维度限制；二是长视频生成面临的质量挑战，包括平滑性、一致性及动态内容表达等问题。论文提出的关键解决方案是VideoMerge，这是一种无需额外训练的方法，通过无缝融合由预训练文本到视频扩散模型生成的短片段来扩展视频长度。其关键在于利用预训练模型的优势，通过协作的正交策略，在保持模型原有表达性和一致性的同时，实现用户指定的更长持续时间和动态变化，从而显著提升长视频生成的质量。

链接: https://arxiv.org/abs/2503.09926
作者: Siyang Zhang,Harry Yang,Ser-Nam Lim
机构: University of Central Florida (中佛罗里达大学); HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long video generation remains a challenging and compelling topic in computer vision. Diffusion based models, among the various approaches to video generation, have achieved state of the art quality with their iterative denoising procedures. However, the intrinsic complexity of the video domain renders the training of such diffusion models exceedingly expensive in terms of both data curation and computational resources. Moreover, these models typically operate on a fixed noise tensor that represents the video, resulting in predetermined spatial and temporal dimensions. Although several high quality open-source pretrained video diffusion models, jointly trained on images and videos of varying lengths and resolutions, are available, it is generally not recommended to specify a video length at inference that was not included in the training set. Consequently, these models are not readily adaptable to the direct generation of longer videos by merely increasing the specified video length. In addition to feasibility challenges, long-video generation also encounters quality issues. The domain of long videos is inherently more complex than that of short videos: extended durations introduce greater variability and necessitate long-range temporal consistency, thereby increasing the overall difficulty of the task. We propose VideoMerge, a training-free method that can be seamlessly adapted to merge short videos generated by pretrained text-to-video diffusion model. Our approach preserves the model’s original expressiveness and consistency while allowing for extended duration and dynamic variation as specified by the user. By leveraging the strengths of pretrained models, our method addresses challenges related to smoothness, consistency, and dynamic content through orthogonal strategies that operate collaboratively to achieve superior quality.
zh

[CV-140] CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation

【速读】：该论文旨在解决将基于视觉的基础模型（Vision Foundation Models, VFMs）如DINO所提取的强大通用特征，通过自监督跨模态知识蒸馏（Knowledge Distillation, KD）有效转移到基于3D激光雷达（LiDAR）模型中的问题。现有方法通常依赖复杂的蒸馏损失函数、伪语义图或仅限于语义分割任务的知识迁移，限制了其适用性和效率。

解决方案的关键在于提出了一种名为CleverDistiller的自监督跨模态2D到3D知识蒸馏框架。该框架采用简单而有效的设计选择：与依赖复杂对比损失的方法不同，CleverDistiller利用直接特征相似性损失，并结合一个多层感知机（Multi-Layer Perceptron, MLP）投影头，使3D网络能够在投影过程中学习复杂的语义依赖关系。此外，该方法不依赖伪语义图，从而可以直接从VFM进行知识迁移，无需显式的语义监督。同时，引入辅助的自监督空间任务——占用预测（Occupancy Prediction），以增强通过知识蒸馏获得的语义知识，并赋予其三维空间推理能力。实验结果表明，CleverDistiller在标准自动驾驶数据集上的语义分割和3D目标检测（3D Object Detection, 3DOD）任务中实现了最先进的性能，尤其在极低标注数据量情况下表现尤为突出，验证了其简单但强大的知识蒸馏策略的有效性。

链接: https://arxiv.org/abs/2503.09878
作者: Hariprasath Govindarajan,Maciej K. Wozniak,Marvin Klingner,Camille Maurice,B Ravi Kiran,Senthil Yogamani
机构: Qualcomm Arriver Software GmbH (高通Arriver软件有限公司); Qualcomm Auto Ltd Sweden Filial and Linköping University (瑞典高通汽车有限公司及林雪平大学); Qualcomm France, S.A.R.L. (高通法国有限责任公司); Qualcomm Technologies, Inc. (高通技术公司); KTH Royal Institute of Technology (瑞典皇家理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision foundation models (VFMs) such as DINO have led to a paradigm shift in 2D camera-based perception towards extracting generalized features to support many downstream tasks. Recent works introduce self-supervised cross-modal knowledge distillation (KD) as a way to transfer these powerful generalization capabilities into 3D LiDAR-based models. However, they either rely on highly complex distillation losses, pseudo-semantic maps, or limit KD to features useful for semantic segmentation only. In this work, we propose CleverDistiller, a self-supervised, cross-modal 2D-to-3D KD framework introducing a set of simple yet effective design choices: Unlike contrastive approaches relying on complex loss design choices, our method employs a direct feature similarity loss in combination with a multi layer perceptron (MLP) projection head to allow the 3D network to learn complex semantic dependencies throughout the projection. Crucially, our approach does not depend on pseudo-semantic maps, allowing for direct knowledge transfer from a VFM without explicit semantic supervision. Additionally, we introduce the auxiliary self-supervised spatial task of occupancy prediction to enhance the semantic knowledge, obtained from a VFM through KD, with 3D spatial reasoning capabilities. Experiments on standard autonomous driving benchmarks for 2D-to-3D KD demonstrate that CleverDistiller achieves state-of-the-art performance in both semantic segmentation and 3D object detection (3DOD) by up to 10% mIoU, especially when fine tuning on really low data amounts, showing the effectiveness of our simple yet powerful KD strategy
zh

[CV-141] FDCT: Frequency-Aware Decomposition and Cross-Modal Token-Alignment for Multi-Sensor Target Classification

【速读】：该论文旨在解决自动目标识别（ATR）系统中因环境条件、CMOS芯片噪声、遮挡、视差及传感器错位等因素导致的传感器无法捕获判别性细粒度特征的问题。同时，多模态图像传感器存在域差距与粒度差异，并且由于复杂的背景杂波、光照变化及不受控的传感器设置可能导致图像数据的错位。论文的关键解决方案在于提出了一种方法，通过分解、对齐和融合多传感器图像数据进行目标分类。具体而言，提取每组传感器数据的域特定特征与域不变特征，并构建共享统一离散标记（Shared Unified Discrete Token, SUDT）空间以减小域差距与粒度差异。此外，设计了一个对齐模块，用于克服多传感器间的错位问题，强调SUDT空间的判别表示，并引入稀疏性约束以增强跨模态表示能力与鲁棒性。实验结果表明，该方法在四种多传感器ATR数据集上的分类性能优于单模态分类器及多种最先进的多模态融合算法。

链接: https://arxiv.org/abs/2503.09873
作者: Shoaib Meraj Sami,Md Mahedi Hasan,Nasser M. Nasrabadi,Raghuveer Rao
机构: LCSEE Dept., West Virginia University (西弗吉尼亚大学); Army Research Laboratory (陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages Accepted in the IEEE Transactions on Aerospace and Electronic Systems

点击查看摘要

Abstract:In automatic target recognition (ATR) systems, sensors may fail to capture discriminative, fine-grained detail features due to environmental conditions, noise created by CMOS chips, occlusion, parallaxes, and sensor misalignment. Therefore, multi-sensor image fusion is an effective choice to overcome these constraints. However, multi-modal image sensors are heterogeneous and have domain and granularity gaps. In addition, the multi-sensor images can be misaligned due to intricate background clutters, fluctuating illumination conditions, and uncontrolled sensor settings. In this paper, to overcome these issues, we decompose, align, and fuse multiple image sensor data for target classification. We extract the domain-specific and domain-invariant features from each sensor data. We propose to develop a shared unified discrete token (UDT) space between sensors to reduce the domain and granularity gaps. Additionally, we develop an alignment module to overcome the misalignment between multi-sensors and emphasize the discriminative representation of the UDT space. In the alignment module, we introduce sparsity constraints to provide a better cross-modal representation of the UDT space and robustness against various sensor settings. We achieve superior classification performance compared to single-modality classifiers and several state-of-the-art multi-modal fusion algorithms on four multi-sensor ATR datasets.
zh

[CV-142] LuciBot: Automated Robot Policy Learning from Generated Videos

【速读】：该论文旨在解决自动为具身任务生成训练监督的问题，传统方法依赖于大型语言模型（Large Language Models, LLMs）或视觉-语言模型（Vision-Language Models, VLMs），但这些方法在处理复杂任务时受限，因为LLMs难以解析压缩为文本或代码的复杂场景，而基于VLM的奖励机制尽管在视觉感知方面表现更好，但其输出表达能力有限。论文的关键解决方案在于利用通用视频生成模型的想象能力，通过初始仿真帧和文本任务描述生成演示任务完成过程的视频，并从中提取丰富的监督信号，如6D物体位姿序列、2D分割图以及深度估计，以提升复杂具身任务的仿真学习质量，从而实现大规模仿真训练。

链接: https://arxiv.org/abs/2503.09871
作者: Xiaowen Qiu,Yian Wang,Jiting Cai,Zhehuan Chen,Chunru Lin,Tsun-Hsuan Wang,Chuang Gan
机构: Umass Amherst (马萨诸塞大学阿默斯特分校); Shanghai Jiao Tong University (上海交通大学); MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatically generating training supervision for embodied tasks is crucial, as manual designing is tedious and not scalable. While prior works use large language models (LLMs) or vision-language models (VLMs) to generate rewards, these approaches are largely limited to simple tasks with well-defined rewards, such as pick-and-place. This limitation arises because LLMs struggle to interpret complex scenes compressed into text or code due to their restricted input modality, while VLM-based rewards, though better at visual perception, remain limited by their less expressive output modality. To address these challenges, we leverage the imagination capability of general-purpose video generation models. Given an initial simulation frame and a textual task description, the video generation model produces a video demonstrating task completion with correct semantics. We then extract rich supervisory signals from the generated video, including 6D object pose sequences, 2D segmentations, and estimated depth, to facilitate task learning in simulation. Our approach significantly improves supervision quality for complex embodied tasks, enabling large-scale training in simulators.
zh

[CV-143] Object-Aware DINO (Oh-A-Dino): Enhancing Self-Supervised Representations for Multi-Object Instance Retrieval

【速读】：该论文旨在解决多目标实例检索中全局场景理解和细粒度对象表征之间的不匹配问题。传统基于槽(slot)的方法在捕获对象级细节方面表现不佳，而自监督模型如DINO虽能捕捉全局场景特征，但难以区分个体对象属性。为解决这一问题，论文提出了一种结合全局与局部特征的方法，通过将DINO表示与从变分自编码器(Variational Autoencoder, VAE)学习到的对象中心潜在向量相结合，其中VAE训练数据来源于从DINO特征中提取的分割图像块。这种方法的关键在于利用VAE生成的对象中心潜在向量补充DINO在对象级别细节上的不足，从而提升多目标实例检索性能，且无需对整个模型进行重新训练。

链接: https://arxiv.org/abs/2503.09867
作者: Stefan Sylvius Wagner,Stefan Harmeling
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Object-centric learning is fundamental to human vision and crucial for models requiring complex reasoning. Traditional approaches rely on slot-based bottlenecks to learn object properties explicitly, while recent self-supervised vision models like DINO have shown emergent object understanding. However, DINO representations primarily capture global scene features, often confounding individual object attributes. We investigate the effectiveness of DINO representations and slot-based methods for multi-object instance retrieval. Our findings reveal that DINO representations excel at capturing global object attributes such as object shape and size, but struggle with object-level details like colour, whereas slot-based representations struggle at both global and object-level understanding. To address this, we propose a method that combines global and local features by augmenting DINO representations with object-centric latent vectors from a Variational Autoencoder trained on segmented image patches that are extracted from the DINO features. This approach improves multi-object instance retrieval performance, bridging the gap between global scene understanding and fine-grained object representation without requiring full model retraining.
zh

[CV-144] Leverag ing Semantic Attribute Binding for Free-Lunch Color Control in Diffusion Models

【速读】：该论文致力于解决文本到图像（Text-to-Image, T2I）扩散模型中精确颜色指定的根本挑战。现有方法如ColorPeel依赖于模型个性化（model personalization），需要额外的优化过程，从而限制了任意颜色指定的灵活性。论文的关键创新在于提出了一种名为ColorWave的训练-free方法，能够在不进行微调（fine-tuning）的情况下实现扩散模型中的RGB级别颜色控制。其核心解决方案在于系统性分析IP-Adapter中的交叉注意力机制（cross-attention mechanism），揭示了文本颜色描述符与参考图像特征之间的隐式绑定关系，并通过重新连接这些绑定来强制实现精确的颜色属性分配，同时保持预训练模型的生成能力。这种方法在生成质量和多样性方面表现优异，并在多种物体类别上超越了先前方法的准确性和适用性。

链接: https://arxiv.org/abs/2503.09864
作者: Héctor Laria,Alexandra Gomez-Villa,Jiang Qin,Muhammad Atif Butt,Bogdan Raducanu,Javier Vazquez-Corral,Joost van de Weijer,Kai Wang
机构: Computer Vision Center (计算机视觉中心), Spain; Universitat Autònoma de Barcelona (巴塞罗那自治大学), Spain; Universitat de València (瓦伦西亚大学), Spain; Harbin Institute of Technology (哈尔滨工业大学), China
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances in text-to-image (T2I) diffusion models have enabled remarkable control over various attributes, yet precise color specification remains a fundamental challenge. Existing approaches, such as ColorPeel, rely on model personalization, requiring additional optimization and limiting flexibility in specifying arbitrary colors. In this work, we introduce ColorWave, a novel training-free approach that achieves exact RGB-level color control in diffusion models without fine-tuning. By systematically analyzing the cross-attention mechanisms within IP-Adapter, we uncover an implicit binding between textual color descriptors and reference image features. Leveraging this insight, our method rewires these bindings to enforce precise color attribution while preserving the generative capabilities of pretrained models. Our approach maintains generation quality and diversity, outperforming prior methods in accuracy and applicability across diverse object categories. Through extensive evaluations, we demonstrate that ColorWave establishes a new paradigm for structured, color-consistent diffusion-based image synthesis.
zh

[CV-145] Foundation X: Integrating Classification Localization and Segmentation through Lock-Release Pretraining Strategy for Chest X-ray Analysis WACV2025

【速读】：该论文致力于解决医疗影像领域深度学习模型训练中因标注数据不足及专家级注释异质性带来的挑战。为充分利用多个公开数据集中的多样化标注信息，同时应对分类、定位和分割等任务间的注释差异，论文提出了一种名为nFoundation X的端到端框架。其关键创新在于Lock-Release预训练策略与师生学习范式的结合，通过在多数据集间实现循环学习，确保模型获取全面的通用知识，避免过拟合于单一任务。实验表明，基于11个胸部X光数据集训练的模型，在跨数据集和跨任务学习方面表现优异，并显著提升了器官定位与分割性能。所有代码和预训练模型均可公开获取。

链接: https://arxiv.org/abs/2503.09860
作者: Nahid Ul Islam,DongAo Ma,Jiaxuan Pang,Shivasakthi Senthil Velan,Michael Gotway,Jianming Liang
机构: Arizona State University (亚利桑那州立大学), USA; Mayo Clinic (梅奥诊所), USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by WACV 2025

点击查看摘要

Abstract:Developing robust and versatile deep-learning models is essential for enhancing diagnostic accuracy and guiding clinical interventions in medical imaging, but it requires a large amount of annotated data. The advancement of deep learning has facilitated the creation of numerous medical datasets with diverse expert-level annotations. Aggregating these datasets can maximize data utilization and address the inadequacy of labeled data. However, the heterogeneity of expert-level annotations across tasks such as classification, localization, and segmentation presents a significant challenge for learning from these datasets. To this end, we introduce nFoundation X, an end-to-end framework that utilizes diverse expert-level annotations from numerous public datasets to train a foundation model capable of multiple tasks including classification, localization, and segmentation. To address the challenges of annotation and task heterogeneity, we propose a Lock-Release pretraining strategy to enhance the cyclic learning from multiple datasets, combined with the student-teacher learning paradigm, ensuring the model retains general knowledge for all tasks while preventing overfitting to any single task. To demonstrate the effectiveness of Foundation X, we trained a model using 11 chest X-ray datasets, covering annotations for classification, localization, and segmentation tasks. Our experimental results show that Foundation X achieves notable performance gains through extensive annotation utilization, excels in cross-dataset and cross-task learning, and further enhances performance in organ localization and segmentation tasks. All code and pretrained models are publicly accessible at this https URL.
zh

[CV-146] Exploring Position Encoding in Diffusion U-Net for Training-free High-resolution Image Generation ICML2025

【速读】：该论文致力于解决通过预训练U-Net去噪高分辨率潜在表示时产生的重复且无序图像模式的问题。尽管近期研究尝试通过在原始与更高分辨率之间对齐去噪过程来提升生成质量，但生成效果不佳的根本原因仍未被充分探索。作者通过对U-Net中的位置编码进行综合分析，将其归因于位置信息从零填充传播到卷积层潜在特征时的不足，导致分辨率增加时位置编码不一致。为了解决这一问题，提出了一种无需训练的创新方法——渐进边界补全（Progressive Boundary Complement, PBC）方法。该方法在特征图内部创建动态虚拟图像边界，以增强位置信息的传播，从而实现高质量且内容丰富的高分辨率图像合成。大量实验验证了所提方法的优越性。

链接: https://arxiv.org/abs/2503.09830
作者: Feng Zhou,Pu Cao,Yiyang Ma,Lu Yang,Jianqin Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICML 2025

点击查看摘要

Abstract:Denoising higher-resolution latents via a pre-trained U-Net leads to repetitive and disordered image patterns. Although recent studies make efforts to improve generative quality by aligning denoising process across original and higher resolutions, the root cause of suboptimal generation is still lacking exploration. Through comprehensive analysis of position encoding in U-Net, we attribute it to inconsistent position encoding, sourced by the inadequate propagation of position information from zero-padding to latent features in convolution layers as resolution increases. To address this issue, we propose a novel training-free approach, introducing a Progressive Boundary Complement (PBC) method. This method creates dynamic virtual image boundaries inside the feature map to enhance position information propagation, enabling high-quality and rich-content high-resolution image synthesis. Extensive experiments demonstrate the superiority of our method.
zh

[CV-147] Resolution Invariant Autoencoder MICCAI

【速读】：该论文旨在解决医学影像分析中图像分辨率变化这一被忽视的挑战。传统方法通过重采样图像来应对这一问题，但会导致信息丢失或计算效率低下。尽管针对特定任务存在一些解决方案，但尚未提出统一的方法。论文的关键在于引入了一种分辨率不变的自动编码器，其通过在每个网络层实现基于学习的可变缩放过程，取代了传统以2为固定因子的空间下/上采样操作，从而确保一致的潜在空间分辨率，无论输入或输出的分辨率如何。这种方法使得下游任务能够在图像潜在表示上进行处理，同时在不同分辨率下保持性能，克服了传统方法的不足。论文展示了该模型在不确定性感知超分辨率、分类和生成建模等任务中的有效性，并证明了其相比传统基线方法在不同分辨率下的性能损失极小。

链接: https://arxiv.org/abs/2503.09828
作者: Ashay Patel,Michela Antonelli,Sebastien Ourselin,M. Jorge Cardoso
机构: School of Biomedical Engineering and Imaging Sciences (生物医学工程与成像科学学院), King’s College London (国王学院伦敦)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 6 pages, 3 figures, preprint of paper submitted to MICCAI conference

点击查看摘要

Abstract:Deep learning has significantly advanced medical imaging analysis, yet variations in image resolution remain an overlooked challenge. Most methods address this by resampling images, leading to either information loss or computational inefficiencies. While solutions exist for specific tasks, no unified approach has been proposed. We introduce a resolution-invariant autoencoder that adapts spatial resizing at each layer in the network via a learned variable resizing process, replacing fixed spatial down/upsampling at the traditional factor of 2. This ensures a consistent latent space resolution, regardless of input or output resolution. Our model enables various downstream tasks to be performed on an image latent whilst maintaining performance across different resolutions, overcoming the shortfalls of traditional methods. We demonstrate its effectiveness in uncertainty-aware super-resolution, classification, and generative modelling tasks and show how our method outperforms conventional baselines with minimal performance loss across resolutions.
zh

[CV-148] Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning

【速读】：该论文旨在解决将 Vision Transformers (ViTs) 应用于多通道成像 (Multi-Channel Imaging, MCI) 数据（如医学和遥感领域）时面临的挑战。MCI 数据通常由不同模态的层组成，直接在这些数据上训练 ViTs 可能会掩盖互补信息并降低性能。为了解决这一问题，论文提出了一种名为 Isolated Channel ViT (IC-ViT) 的简单而有效的预训练框架。其关键在于通过独立的通道 Patchification 技术，使 ViTs 能够有效处理多模态多通道任务，并且可以在单通道上进行预训练，在下游多通道数据集上进行微调。这种预训练方法能够捕获补丁之间以及通道之间的依赖关系，生成鲁棒的特征表示。实验结果表明，与现有的通道自适应方法相比，IC-ViT 在多个任务和基准测试中提升了 4-14 个百分点的性能。

链接: https://arxiv.org/abs/2503.09826
作者: Wenyi Lian,Joakim Lindblad,Patrick Micke,Nataša Sladoje
机构: Uppsala University (乌普萨拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) have achieved remarkable success in standard RGB image processing tasks. However, applying ViTs to multi-channel imaging (MCI) data, e.g., for medical and remote sensing applications, remains a challenge. In particular, MCI data often consist of layers acquired from different modalities. Directly training ViTs on such data can obscure complementary information and impair the performance. In this paper, we introduce a simple yet effective pretraining framework for large-scale MCI datasets. Our method, named Isolated Channel ViT (IC-ViT), patchifies image channels individually and thereby enables pretraining for multimodal multi-channel tasks. We show that this channel-wise patchifying is a key technique for MCI processing. More importantly, one can pretrain the IC-ViT on single channels and finetune it on downstream multi-channel datasets. This pretraining framework captures dependencies between patches as well as channels and produces robust feature representation. Experiments on various tasks and benchmarks, including JUMP-CP and CHAMMI for cell microscopy imaging, and So2Sat-LCZ42 for satellite imaging, show that the proposed IC-ViT delivers 4-14 percentage points of performance improvement over existing channel-adaptive approaches. Further, its efficient training makes it a suitable candidate for large-scale pretraining of foundation models on heterogeneous data.
zh

[CV-149] Fine-tuning Vision Language Models with Graph-based Knowledge for Explainable Medical Image Analysis

【速读】：本文旨在解决糖尿病视网膜病变（Diabetic Retinopathy, DR）分期模型不可解释的问题，并填补现有公共数据集中缺乏临床推理与解释的空白。解决方案的关键在于提出了一种结合图表示学习与视觉-语言模型（Vision-Language Models, VLMs）的新方法，以实现可解释的DR诊断。具体而言，通过构建基于生物信息的图来编码光学相干断层扫描血管成像（Optical Coherence Tomography Angiography, OCTA）图像中的关键视网膜血管特征（如血管形态和空间连通性），利用图神经网络（Graph Neural Network, GNN）进行DR分期。同时，采用积分梯度（Integrated Gradients）技术突出图中驱动分类决策的重要节点和边及其特征，将这些基于图的知识转化为文本描述以指导VLM训练。最终，通过指令微调得到的学生VLM能够仅基于单一图像输入完成疾病分类并以人类可理解的方式解释其决策过程。实验结果表明，该方法不仅提升了分类准确性，还提供了更具有临床解释性的结果。

链接: https://arxiv.org/abs/2503.09808
作者: Chenjun Li,Laurin Lux,Alexander H. Berger,Martin J. Menten,Mert R. Sabuncu,Johannes C. Paetzold
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:Accurate staging of Diabetic Retinopathy (DR) is essential for guiding timely interventions and preventing vision loss. However, current staging models are hardly interpretable, and most public datasets contain no clinical reasoning or interpretation beyond image-level labels. In this paper, we present a novel method that integrates graph representation learning with vision-language models (VLMs) to deliver explainable DR diagnosis. Our approach leverages optical coherence tomography angiography (OCTA) images by constructing biologically informed graphs that encode key retinal vascular features such as vessel morphology and spatial connectivity. A graph neural network (GNN) then performs DR staging while integrated gradients highlight critical nodes and edges and their individual features that drive the classification decisions. We collect this graph-based knowledge which attributes the model’s prediction to physiological structures and their characteristics. We then transform it into textual descriptions for VLMs. We perform instruction-tuning with these textual descriptions and the corresponding image to train a student VLM. This final agent can classify the disease and explain its decision in a human interpretable way solely based on a single image input. Experimental evaluations on both proprietary and public datasets demonstrate that our method not only improves classification accuracy but also offers more clinically interpretable results. An expert study further demonstrates that our method provides more accurate diagnostic explanations and paves the way for precise localization of pathologies in OCTA images.
zh

[CV-150] How good are deep learning methods for automated road safety analysis using video data? An experimental study

【速读】：该论文旨在解决基于图像的多目标跟踪（MOT）在交通安全分析中的应用问题，特别是利用深度学习方法改进从视频数据中提取的道路使用者轨迹的准确性。论文的关键在于开发了两种后处理步骤，即IDsplit和SS，以优化跟踪结果并研究这些结果对时间至碰撞（TTC）指标的影响。通过使用KITTI交通视频数据集评估三种不同的MOT方法，研究发现现有方法普遍存在高估交互次数和低估TTC的问题，导致道路使用者的交互看似比实际情况更危险。因此，未来的工作将集中在测试更多方法和多样化的数据集，尤其是来自路侧传感器的数据，以验证当前结果并提升性能。

链接: https://arxiv.org/abs/2503.09807
作者: Qingwu Liu,Nicolas Saunier,Guillaume-Alexandre Bilodeau
机构: Polytechnique Montreal (蒙特利尔理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by TRB Annual Meeting 2024

点击查看摘要

Abstract:Image-based multi-object detection (MOD) and multi-object tracking (MOT) are advancing at a fast pace. A variety of 2D and 3D MOD and MOT methods have been developed for monocular and stereo cameras. Road safety analysis can benefit from those advancements. As crashes are rare events, surrogate measures of safety (SMoS) have been developed for safety analyses. (Semi-)Automated safety analysis methods extract road user trajectories to compute safety indicators, for example, Time-to-Collision (TTC) and Post-encroachment Time (PET). Inspired by the success of deep learning in MOD and MOT, we investigate three MOT methods, including one based on a stereo-camera, using the annotated KITTI traffic video dataset. Two post-processing steps, IDsplit and SS, are developed to improve the tracking results and investigate the factors influencing the TTC. The experimental results show that, despite some advantages in terms of the numbers of interactions or similarity to the TTC distributions, all the tested methods systematically over-estimate the number of interactions and under-estimate the TTC: they report more interactions and more severe interactions, making the road user interactions appear less safe than they are. Further efforts will be directed towards testing more methods and more data, in particular from roadside sensors, to verify the results and improve the performance.
zh

[CV-151] Evaluating the Impact of Synthetic Data on Object Detection Tasks in Autonomous Driving

【速读】：该论文旨在解决自动驾驶系统在多样化场景下鲁棒性能保障的需求，通过引入合成数据（Synthetic Data）作为现实世界数据集的有效补充，以克服传统数据集规模和质量上的局限性。论文的关键在于评估合成数据在增强模型鲁棒性和泛化能力方面的效用与潜在限制，并提出结合真实数据与合成数据的混合训练策略，从而有效缓解因分布差异（distributional differences）和偏置（biases）可能带来的负面影响，为推动自动驾驶技术的发展提供理论依据和技术支持。

链接: https://arxiv.org/abs/2503.09803
作者: Enes Özeren,Arka Bhowmick
机构: BIT Technology Solutions GmbH
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures, 3 tables

点击查看摘要

Abstract:The increasing applications of autonomous driving systems necessitates large-scale, high-quality datasets to ensure robust performance across diverse scenarios. Synthetic data has emerged as a viable solution to augment real-world datasets due to its cost-effectiveness, availability of precise ground-truth labels, and the ability to model specific edge cases. However, synthetic data may introduce distributional differences and biases that could impact model performance in real-world settings. To evaluate the utility and limitations of synthetic data, we conducted controlled experiments using multiple real-world datasets and a synthetic dataset generated by BIT Technology Solutions GmbH. Our study spans two sensor modalities, camera and LiDAR, and investigates both 2D and 3D object detection tasks. We compare models trained on real, synthetic, and mixed datasets, analyzing their robustness and generalization capabilities. Our findings demonstrate that the use of a combination of real and synthetic data improves the robustness and generalization of object detection models, underscoring the potential of synthetic data in advancing autonomous driving technologies.
zh

[CV-152] SeqSAM: Autoregressive Multiple Hypothesis Prediction for Medical Image Segmentation using SAM

【速读】：该论文试图解决医学影像分割中单一预测忽略图像固有不确定性的问题，特别是由于对象边界不清晰及标注工具引起的误差。为应对这一挑战，论文提出了一种基于RNN的顺序方法SeqSAM，其关键在于引入二分图匹配损失函数（bipartite matching loss），以确保生成的每个分割掩膜具有临床相关性，同时能够灵活生成任意数量的掩膜，而不受初始预训练超参数的限制。实验结果显示，SeqSAM在两个公开数据集上显著提升了分割掩膜的质量。

链接: https://arxiv.org/abs/2503.09797
作者: Benjamin Towle,Xin Chen,Ke Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ISBI 2025

点击查看摘要

Abstract:Pre-trained segmentation models are a powerful and flexible tool for segmenting images. Recently, this trend has extended to medical imaging. Yet, often these methods only produce a single prediction for a given image, neglecting inherent uncertainty in medical images, due to unclear object boundaries and errors caused by the annotation tool. Multiple Choice Learning is a technique for generating multiple masks, through multiple learned prediction heads. However, this cannot readily be extended to producing more outputs than its initial pre-training hyperparameters, as the sparse, winner-takes-all loss function makes it easy for one prediction head to become overly dominant, thus not guaranteeing the clinical relevancy of each mask produced. We introduce SeqSAM, a sequential, RNN-inspired approach to generating multiple masks, which uses a bipartite matching loss for ensuring the clinical relevancy of each mask, and can produce an arbitrary number of masks. We show notable improvements in quality of each mask produced across two publicly available datasets. Our code is available at this https URL.
zh

[CV-153] A PyTorch-Enabled Tool for Synthetic Event Camera Data Generation and Algorithm Development

【速读】：本文旨在解决事件相机（Event Cameras）在特定领域研究任务中应用受限的问题，主要由于商业可用性有限、现有数据集缺乏以及难以预测其非线性光学编码、独特的噪声模型及基于张量的数据处理需求的影响。为应对这些挑战，论文引入了SENPI（Synthetic Events for Neural Processing and Integration），这是一个基于PyTorch的Python库，用于模拟和处理事件相机数据。SENPI的关键创新在于其可微分数字孪生模块，能够将基于强度的数据转换为事件表示形式，从而在处理前向模型的非光滑和非线性特性的同时评估事件相机性能。此外，SENPI还支持事件驱动的输入输出、操作、滤波和可视化模块，构建了高效且可扩展的合成与真实事件数据处理工作流。通过生成逼真的事件数据并与实际事件相机数据对比，论文验证了SENPI的有效性，并利用其探索了不同噪声条件下事件相机的行为以及优化事件对比阈值以改善目标条件下的编码效果。SENPI的核心贡献在于降低研究人员进入该领域的门槛，提供了一个便捷的事件数据生成与算法开发工具，成为推动神经形态视觉系统研究的重要资源。

链接: https://arxiv.org/abs/2503.09754
作者: Joseph L. Greene,Adrish Kar,Ignacio Galindo,Elijah Quiles,Elliott Chen,Matthew Anderson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:Event, or neuromorphic cameras, offer a novel encoding of natural scenes by asynchronously reporting significant changes in brightness, known as events, with improved dynamic range, temporal resolution and lower data bandwidth when compared to conventional cameras. However, their adoption in domain-specific research tasks is hindered in part by limited commercial availability, lack of existing datasets, and challenges related to predicting the impact of their nonlinear optical encoding, unique noise model and tensor-based data processing requirements. To address these challenges, we introduce Synthetic Events for Neural Processing and Integration (SENPI) in Python, a PyTorch-based library for simulating and processing event camera data. SENPI includes a differentiable digital twin that converts intensity-based data into event representations, allowing for evaluation of event camera performance while handling the non-smooth and nonlinear nature of the forward model The library also supports modules for event-based I/O, manipulation, filtering and visualization, creating efficient and scalable workflows for both synthetic and real event-based data. We demonstrate SENPI’s ability to produce realistic event-based data by comparing synthetic outputs to real event camera data and use these results to draw conclusions on the properties and utility of event-based perception. Additionally, we showcase SENPI’s use in exploring event camera behavior under varying noise conditions and optimizing event contrast threshold for improved encoding under target conditions. Ultimately, SENPI aims to lower the barrier to entry for researchers by providing an accessible tool for event data generation and algorithmic developmnent, making it a valuable resource for advancing research in neuromorphic vision systems.
zh

[CV-154] SASNet: Spatially-Adaptive Sinusoidal Neural Networks

【速读】：该论文旨在解决生成式隐神经表示（INRs）在处理低维信号时面临的谱偏置（spectral bias）、训练不稳定性和过拟合（overfitting）等问题。论文提出了一种空间自适应正弦神经网络（SASNet），其关键在于通过引入频率嵌入层（frequency embedding layer）来控制频谱成分并缓解谱偏置，同时结合联合优化的空间自适应掩码（spatially-adaptive masks），以定位神经元的影响范围，从而减少网络冗余并提升收敛稳定性。这种设计使SASNet能够在不牺牲模型紧凑性的同时，实现高精度的高频信号重建、超分辨率能力以及噪声抑制。

链接: https://arxiv.org/abs/2503.09750
作者: Haoan Feng,Diana Aldana,Tiago Novello,Leila De Floriani
机构: University of Maryland, College Park (马里兰大学帕克分校); IMPA (未知)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures, supplementary materials

点击查看摘要

Abstract:Sinusoidal neural networks (SNNs) have emerged as powerful implicit neural representations (INRs) for low-dimensional signals in computer vision and graphics. They enable high-frequency signal reconstruction and smooth manifold modeling; however, they often suffer from spectral bias, training instability, and overfitting. To address these challenges, we propose SASNet, Spatially-Adaptive SNNs that robustly enhance the capacity of compact INRs to fit detailed signals. SASNet integrates a frequency embedding layer to control frequency components and mitigate spectral bias, along with jointly optimized, spatially-adaptive masks that localize neuron influence, reducing network redundancy and improving convergence stability. Robust to hyperparameter selection, SASNet faithfully reconstructs high-frequency signals without overfitting low-frequency regions. Our experiments show that SASNet outperforms state-of-the-art INRs, achieving strong fitting accuracy, super-resolution capability, and noise suppression, without sacrificing model compactness.
zh

[CV-155] A Siamese Network to Detect If Two Iris Images Are Monozygotic

【速读】：该论文试图解决的问题是如何有效区分同卵（monozygotic）与非同卵（non-monozygotic）虹膜图像对。传统方法将同一人左右眼的虹膜纹理视为与其他无关个体的虹膜一样不同，但研究表明人类在判断同卵双胞胎或同一人的双眼虹膜时具有约80%的准确性。为解决此问题，论文提出利用孪生网络架构（Siamese Network Architecture）结合对比学习（Contrastive Learning）的方法来分类虹膜图像对是否来自同卵个体。关键在于通过构建包含合成同卵对、自然同卵对以及非同卵对的数据集，以及分析三种模型变体（使用原始图像、仅虹膜图像和非仅虹膜图像）的训练结果，揭示虹膜特定的纹理细节和眼部上下文线索对于识别同卵虹膜模式的重要性。研究发现，利用完整眼区信息的模型表现优于仅基于虹膜数据的模型，强调了虹膜与眼部特征之间微妙的相互作用。最终，该方法在全虹膜图像上的分类准确率超过了之前人类分类同卵虹膜对的水平。

链接: https://arxiv.org/abs/2503.09749
作者: Yongle Yuan,Kevin W. Bowyer
机构: University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In Daugman-style iris recognition, the textures of the left and right irises of the same person are traditionally considered as being as different as the irises of two unrelated persons. However, previous research indicates that humans can detect that two iris images are from different eyes of the same person, or eyes of monozygotic twins, with an accuracy of about 80%. In this work, we employ a Siamese network architecture and contrastive learning to categorize a pair of iris images as coming from monozygotic or non-monozygotic irises. This could potentially be applied, for example, as a fast, noninvasive test to determine if twins are monozygotic or non-monozygotic. We construct a dataset comprising both synthetic monozygotic pairs (images of different irises of the same individual) and natural monozygotic pairs (images of different images from persons who are identical twins), in addition to non-monozygotic pairs from unrelated individuals, ensuring a comprehensive evaluation of the model’s capabilities. To gain deeper insights into the learned representations, we train and analyze three variants of the model using (1) the original input images, (2) iris-only images, and (3) non-iris-only images. This comparison reveals the critical importance of iris-specific textural details and contextual ocular cues in identifying monozygotic iris patterns. The results demonstrate that models leveraging full eye-region information outperform those trained solely on iris-only data, emphasizing the nuanced interplay between iris and ocular characteristics. Our approach achieves accuracy levels using the full iris image that exceed those previously reported for human classification of monozygotic iris pairs. This study presents the first classifier designed to determine whether a pair of iris images originates from monozygotic individuals.
zh

[CV-156] Enhancing Adversarial Example Detection Through Model Explanation

【速读】：该论文旨在解决对抗样本（adversarial examples）对机器学习模型构成的重大威胁，并探索基于模型解释（model explanations）的防御方法的有效性。论文关注的是 AmI 方法，这是一种利用模型解释检测对抗样本的技术。研究的关键在于评估 AmI 方法在不同条件下的鲁棒性和可靠性，包括其对超参数、操作系统以及深度学习框架等外部因素的高度依赖性。研究发现，这些局限性限制了 AmI 方法的实际应用价值，从而强调了开发更强大且适应多种条件的防御机制的重要性，同时呼吁建立全面的防御技术评估框架。

链接: https://arxiv.org/abs/2503.09735
作者: Qian Ma,Ziping Ye
机构: The Pennsylvania State University (宾夕法尼亚州立大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 1 figure

点击查看摘要

Abstract:Adversarial examples are a major problem for machine learning models, leading to a continuous search for effective defenses. One promising direction is to leverage model explanations to better understand and defend against these attacks. We looked at AmI, a method proposed by a NeurIPS 2018 spotlight paper that uses model explanations to detect adversarial examples. Our study shows that while AmI is a promising idea, its performance is too dependent on specific settings (e.g., hyperparameter) and external factors such as the operating system and the deep learning framework used, and such drawbacks limit AmI’s practical usage. Our findings highlight the need for more robust defense mechanisms that are effective under various conditions. In addition, we advocate for a comprehensive evaluation framework for defense techniques.
zh

[CV-157] I2V3D: Controllable image-to-video generation with 3D guidance

【速读】：本文旨在解决从单张静态图像生成具有精确三维控制的动态视频的问题。解决方案的关键在于结合三维几何引导与先进的生成式模型的优势，提出了一种名为I2V3D的新框架。该框架通过计算机图形学管道的精度实现对相机移动、物体旋转及角色动画等元素的精准控制，并利用生成式AI（Generative AI）的视觉保真度，从粗略渲染的输入生成高质量视频。其核心方法包括两阶段生成过程：首先，通过定制化的图像扩散模型在关键帧生成阶段优化渲染结果以确保一致性与质量；其次，在无需训练的情况下采用双向引导的方式进行三维引导视频插值，从而生成平滑且高质量的关键帧间视频帧。实验结果证明了该框架能够有效实现可控且高质量的动画生成。

链接: https://arxiv.org/abs/2503.09733
作者: Zhiyuan Zhang,Dongdong Chen,Jing Liao
机构: City University of Hong Kong (香港城市大学); Microsoft GenAI (微软生成式人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present I2V3D, a novel framework for animating static images into dynamic videos with precise 3D control, leveraging the strengths of both 3D geometry guidance and advanced generative models. Our approach combines the precision of a computer graphics pipeline, enabling accurate control over elements such as camera movement, object rotation, and character animation, with the visual fidelity of generative AI to produce high-quality videos from coarsely rendered inputs. To support animations with any initial start point and extended sequences, we adopt a two-stage generation process guided by 3D geometry: 1) 3D-Guided Keyframe Generation, where a customized image diffusion model refines rendered keyframes to ensure consistency and quality, and 2) 3D-Guided Video Interpolation, a training-free approach that generates smooth, high-quality video frames between keyframes using bidirectional guidance. Experimental results highlight the effectiveness of our framework in producing controllable, high-quality animations from single input images by harmonizing 3D geometry with generative models. The code for our framework will be publicly released.
zh

[CV-158] Revisiting semi-supervised learning in the era of foundation models

【速读】：该论文试图解决的问题是：随着视觉基础模型（Vision Foundation Models, VFMs）在视觉应用中的广泛应用，尚不清楚半监督学习（Semi-Supervised Learning, SSL）如何与这些预训练模型有效结合。为填补这一研究空白，作者开发了新的SSL基准数据集，并系统性评估了代表性SSL方法。研究发现，仅使用标记数据进行参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）的表现往往可以媲美传统的SSL方法，即使未利用未标记数据。

解决方案的关键在于重新审视自训练（Self-Training），这是一种概念简单的SSL基线方法。具体而言，作者利用监督式的PEFT模型为未标记数据生成伪标签，并进一步训练模型。为应对伪标签噪声的典型问题，提出通过集成多种PEFT方法和VFM骨干网络来生成更鲁棒的伪标签。实证结果验证了这一简单而强大的方法的有效性，为基于VFM的SSL提供了实用洞见，并为大模型时代的可扩展且实用的半监督学习铺平了道路。

链接: https://arxiv.org/abs/2503.09707
作者: Ping Zhang,Zheda Mai,Quang-Huy Nguyen,Wei-Lun Chao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semi-supervised learning (SSL) leverages abundant unlabeled data alongside limited labeled data to enhance learning. As vision foundation models (VFMs) increasingly serve as the backbone of vision applications, it remains unclear how SSL interacts with these pre-trained models. To address this gap, we develop new SSL benchmark datasets where frozen VFMs underperform and systematically evaluate representative SSL methods. We make a surprising observation: parameter-efficient fine-tuning (PEFT) using only labeled data often matches SSL performance, even without leveraging unlabeled data. This motivates us to revisit self-training, a conceptually simple SSL baseline, where we use the supervised PEFT model to pseudo-label unlabeled data for further training. To overcome the notorious issue of noisy pseudo-labels, we propose ensembling multiple PEFT approaches and VFM backbones to produce more robust pseudo-labels. Empirical results validate the effectiveness of this simple yet powerful approach, providing actionable insights into SSL with VFMs and paving the way for more scalable and practical semi-supervised learning in the era of foundation models.
zh

[CV-159] DRESS: Disentangled Representation-based Self-Supervised Meta-Learning for Diverse Tasks NEURIPS

【速读】：本文旨在解决在少样本学习任务中，传统元学习方法表现不佳的问题。作者认为其原因在于现有少样本学习任务缺乏多样性。为应对这一挑战，论文提出了一种名为DRESS的任务无关解耦表示驱动的自监督元学习方法，它通过解耦表示学习构建自监督任务，从而支持快速模型适应于高度多样化的少样本学习任务。关键创新点在于利用解耦表示学习生成自监督任务以促进元训练过程，并提出基于类别划分的度量方法直接量化输入空间中的任务多样性。实验结果表明，DRESS在大多数数据集和任务设置下优于对比方法，呼吁重新审视任务适应研究的适当设置，并激发利用解耦表示解决少样本学习问题的兴趣。

链接: https://arxiv.org/abs/2503.09679
作者: Wei Cui,Tongzi Wu,Jesse C. Cresswell,Yi Sui,Keyvan Golestan
机构: Layer 6 AI (Layer 6 AI)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures. An earlier version of the paper has been presented at the Self-Supervised Learning workshop at the 2024 NeurIPS conference

点击查看摘要

Abstract:Meta-learning represents a strong class of approaches for solving few-shot learning tasks. Nonetheless, recent research suggests that simply pre-training a generic encoder can potentially surpass meta-learning algorithms. In this paper, we first discuss the reasons why meta-learning fails to stand out in these few-shot learning experiments, and hypothesize that it is due to the few-shot learning tasks lacking diversity. We propose DRESS, a task-agnostic Disentangled REpresentation-based Self-Supervised meta-learning approach that enables fast model adaptation on highly diversified few-shot learning tasks. Specifically, DRESS utilizes disentangled representation learning to create self-supervised tasks that can fuel the meta-training process. Furthermore, we also propose a class-partition based metric for quantifying the task diversity directly on the input space. We validate the effectiveness of DRESS through experiments on datasets with multiple factors of variation and varying complexity. The results suggest that DRESS is able to outperform competing methods on the majority of the datasets and task setups. Through this paper, we advocate for a re-examination of proper setups for task adaptation studies, and aim to reignite interest in the potential of meta-learning for solving few-shot learning tasks via disentangled representations.
zh

[CV-160] Accelerating Diffusion Sampling via Exploiting Local Transition Coherence

【速读】：该论文试图解决文本到图像和视频生成中去噪扩散过程采样时间过长的问题。传统方法要么忽略了相邻步骤之间的统计关系，要么依赖于特定网络结构下的注意力或特征相似性机制，而这些方法往往具有局限性。论文的关键在于发现了一种新的统计关系，即在相邻步骤之间的转换算子中，关注网络输出结果的关系，这种关系不对网络结构提出任何假设。基于此观察，提出了一个无需训练的加速方法LTC-Accel，通过利用这一关系来估计当前的转换算子。由于不依赖特定网络结构，LTC-Accel可适用于几乎所有基于扩散的方法，并且可以与现有大多数加速技术正交结合。实验表明，LTC-Accel在保持高质量样本的同时显著提升了文本到图像和视频生成的速度，在Stable Diffusion v2中实现了1.67倍加速，在视频生成模型中实现了1.55倍加速，与蒸馏模型结合后可实现高达10倍加速。

链接: https://arxiv.org/abs/2503.09675
作者: Shangwen Zhu,Han Zhang,Zhantao Yang,Qianyu Peng,Zhao Pu,Huangji Wang,Fan Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-based diffusion models have made significant breakthroughs in generating high-quality images and videos from textual descriptions. However, the lengthy sampling time of the denoising process remains a significant bottleneck in practical applications. Previous methods either ignore the statistical relationships between adjacent steps or rely on attention or feature similarity between them, which often only works with specific network structures. To address this issue, we discover a new statistical relationship in the transition operator between adjacent steps, focusing on the relationship of the outputs from the network. This relationship does not impose any requirements on the network structure. Based on this observation, we propose a novel training-free acceleration method called LTC-Accel, which uses the identified relationship to estimate the current transition operator based on adjacent steps. Due to no specific assumptions regarding the network structure, LTC-Accel is applicable to almost all diffusion-based methods and orthogonal to almost all existing acceleration techniques, making it easy to combine with them. Experimental results demonstrate that LTC-Accel significantly speeds up sampling in text-to-image and text-to-video synthesis while maintaining competitive sample quality. Specifically, LTC-Accel achieves a speedup of 1.67-fold in Stable Diffusion v2 and a speedup of 1.55-fold in video generation models. When combined with distillation models, LTC-Accel achieves a remarkable 10-fold speedup in video generation, allowing real-time generation of more than 16FPS.
zh

[CV-161] Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models CVPR2025

【速读】：本文旨在解决文本到图像扩散模型在训练数据受到数据投毒攻击时，可能被操控生成包含特定品牌标志或符号的图像的问题。尽管这些模型依赖公开可用的数据且存在数据共享的趋势，但它们对数据投毒攻击特别敏感。论文提出了一种名为“静默品牌植入攻击”（Silent Branding Attack）的新方法，这是一种新颖的数据投毒技术，通过操纵模型在输出中自然生成指定的品牌标志或符号，而无需任何文本触发词。关键在于利用模型在训练过程中对反复出现的视觉模式的学习能力，即使输入提示未提及这些模式，模型仍会自然生成相关结果。为此，研究开发了一种自动化的数据投毒算法，将标志隐秘地注入原始图像中，确保其与背景自然融合且不易被察觉。实验验证表明，即使在没有特定文本触发的情况下，基于中毒数据集训练的模型也能高质量地生成包含标志的图像，并保持图像质量和文本对齐度，同时通过人类评估和定量指标如标志检测证明了该方法能够隐蔽地嵌入标志。

链接: https://arxiv.org/abs/2503.09669
作者: Sangwon Jang,June Suk Choi,Jaehyeong Jo,Kimin Lee,Sung Ju Hwang
机构: Institution1; Institution2
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: CVPR 2025. Project page: this https URL

点击查看摘要

Abstract:Text-to-image diffusion models have achieved remarkable success in generating high-quality contents from text prompts. However, their reliance on publicly available data and the growing trend of data sharing for fine-tuning make these models particularly vulnerable to data poisoning attacks. In this work, we introduce the Silent Branding Attack, a novel data poisoning method that manipulates text-to-image diffusion models to generate images containing specific brand logos or symbols without any text triggers. We find that when certain visual patterns are repeatedly in the training data, the model learns to reproduce them naturally in its outputs, even without prompt mentions. Leveraging this, we develop an automated data poisoning algorithm that unobtrusively injects logos into original images, ensuring they blend naturally and remain undetected. Models trained on this poisoned dataset generate images containing logos without degrading image quality or text alignment. We experimentally validate our silent branding attack across two realistic settings on large-scale high-quality image datasets and style personalization datasets, achieving high success rates even without a specific text trigger. Human evaluation and quantitative metrics including logo detection show that our method can stealthily embed logos.
zh

[CV-162] CoRe2: Collect Reflect and Refine to Generate Better and Faster

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）生成模型在采样速度与生成质量之间难以兼顾的问题，同时克服现有推理方法无法在扩散模型（Diffusion Models, DMs）和视觉自回归模型（Autoregressive Models, ARMs）上均保持稳定性能的局限。论文的关键在于提出了一种新型的即插即用式推理范式CoRe²，其包含三个子过程：Collect（收集）、Reflect（反射）和Refine（优化）。通过首先收集无分类器引导（Classifier-Free Guidance, CFG）轨迹，并利用这些数据训练一个弱模型以反映易于学习的内容，同时减少推理过程中函数评估次数的一半；随后利用从弱到强的引导策略优化条件输出，从而提升模型生成高频细节和高保真内容的能力。这种方法首次实现了在多种扩散模型（如SDXL、SD3.5、FLUX）及视觉自回归模型（如LlamaGen）上的高效性和有效性统一，并显著提升了多个基准测试（HPD v2、Pick-of-Pic、Drawbench、GenEval、T2I-Compbench）中的性能表现。此外，CoRe²能够无缝集成至最先进的Z-Sampling方法中，在PickScore和AES指标上分别超越其0.3和0.16分，同时节省5.64秒运行时间。相关代码已开源。

链接: https://arxiv.org/abs/2503.09662
作者: Shitong Shao,Zikai Zhou,Dian Xie,Yuetong Fang,Tian Ye,Lichen Bai,Zeke Xie
机构: The Hong Kong University of Science and Technology (香港科技大学) (GuangZhou)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Making text-to-image (T2I) generative model sample both fast and well represents a promising research direction. Previous studies have typically focused on either enhancing the visual quality of synthesized images at the expense of sampling efficiency or dramatically accelerating sampling without improving the base model’s generative capacity. Moreover, nearly all inference methods have not been able to ensure stable performance simultaneously on both diffusion models (DMs) and visual autoregressive models (ARMs). In this paper, we introduce a novel plug-and-play inference paradigm, CoRe^2, which comprises three subprocesses: Collect, Reflect, and Refine. CoRe^2 first collects classifier-free guidance (CFG) trajectories, and then use collected data to train a weak model that reflects the easy-to-learn contents while reducing number of function evaluations during inference by half. Subsequently, CoRe^2 employs weak-to-strong guidance to refine the conditional output, thereby improving the model’s capacity to generate high-frequency and realistic content, which is difficult for the base model to capture. To the best of our knowledge, CoRe^2 is the first to demonstrate both efficiency and effectiveness across a wide range of DMs, including SDXL, SD3.5, and FLUX, as well as ARMs like LlamaGen. It has exhibited significant performance improvements on HPD v2, Pick-of-Pic, Drawbench, GenEval, and T2I-Compbench. Furthermore, CoRe^2 can be seamlessly integrated with the state-of-the-art Z-Sampling, outperforming it by 0.3 and 0.16 on PickScore and AES, while achieving 5.64s time saving using this http URL is released at this https URL.
zh

[CV-163] Adaptive Anomaly Recovery for Telemanipulation: A Diffusion Model Approach to Vision-Based Tracking

【速读】：该论文旨在解决灵巧遥操作（Dexterous Telemanipulation）中因视觉跟踪方法稳定性不足导致的问题。传统基于视觉的跟踪方法容易受到遮挡（occlusions）、光照不足及视野丢失等异常情况的影响，而现有的低维数据补偿方法（如滤波、回归和插值）在处理高维图像和视频数据时存在信息损失。论文的关键创新在于提出了Diffusion-Enhanced Telemanipulation (DET) 框架，结合帧差检测（Frame-Difference Detection, FDD）技术识别并分割视频流中的异常片段，并利用扩散模型对这些异常片段进行重建替换，从而在复杂的视觉条件下实现稳定且鲁棒的遥操作性能。实验表明，与三次样条插值和基于FFT的插值相比，DET分别实现了平均RMSE降低17.2%和51.1%。

链接: https://arxiv.org/abs/2503.09632
作者: Haoyang Wang,Haoran Guo,Lingfeng Tao,Zhengxiong Li
机构: Oklahoma State University (俄克拉荷马州立大学); University of Colorado Denver (丹佛大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Dexterous telemanipulation critically relies on the continuous and stable tracking of the human operator’s commands to ensure robust operation. Vison-based tracking methods are widely used but have low stability due to anomalies such as occlusions, inadequate lighting, and loss of sight. Traditional filtering, regression, and interpolation methods are commonly used to compensate for explicit information such as angles and positions. These approaches are restricted to low-dimensional data and often result in information loss compared to the original high-dimensional image and video data. Recent advances in diffusion-based approaches, which can operate on high-dimensional data, have achieved remarkable success in video reconstruction and generation. However, these methods have not been fully explored in continuous control tasks in robotics. This work introduces the Diffusion-Enhanced Telemanipulation (DET) framework, which incorporates the Frame-Difference Detection (FDD) technique to identify and segment anomalies in video streams. These anomalous clips are replaced after reconstruction using diffusion models, ensuring robust telemanipulation performance under challenging visual conditions. We validated this approach in various anomaly scenarios and compared it with the baseline methods. Experiments show that DET achieves an average RMSE reduction of 17.2% compared to the cubic spline and 51.1% compared to FFT-based interpolation for different occlusion durations.
zh

[CV-164] How Should We Evaluate Uncertainty in Accelerated MRI Reconstruction?

【速读】：该论文试图解决在加速磁共振成像（Accelerated MRI）重建中量化不确定性的问题。现有方法主要关注像素强度的变异性，但这种变异性缺乏解剖结构的理解，且与后续数据分析的关系不明确。论文的关键解决方案是提出一种基于重建中显性解剖变化的新方法来评估重建的变异性，这种方法更紧密地关联到常见的下游任务。通过图像配准和分割技术，结合集成方法来衡量加速成像中的不确定性，揭示了重建图像的固有变异性，并展示了即使在结构相似性指数（SSIM）和峰值信噪比（PSNR）等常用质量度量中表现良好的模型，也可能在解剖测量中表现出高水平的方差和偏倚。

链接: https://arxiv.org/abs/2503.10527
作者: Luca Trautmann,Peter Wijeratne,Itamar Ronen,Ivor Simpson
机构: Department of Informatics, University of Sussex (信息学系, 苏塞克斯大学), Falmer, United Kingdom; Brighton and Sussex Medical School (布莱顿和苏塞克斯医学学校), Falmer, United Kingdom
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Reconstructing accelerated MRI is an ill-posed problem. Machine learning has recently shown great promise at this task, but current approaches to quantifying uncertainty focus on measuring the variability in pixelwise intensity variation. Although these provide interpretable maps, they lack structural understanding and they do not have a clear relationship to how the data will be analysed subsequently. In this paper, we propose a new approach to evaluating reconstruction variability based on apparent anatomical changes in the reconstruction, which is more tightly related to common downstream tasks. We use image registration and segmentation to evaluate several common MRI reconstruction approaches, where uncertainty is measured via ensembling, for accelerated imaging. We demonstrate the intrinsic variability in reconstructed images and show that models with high scores on often used quality metrics such as SSIM and PSNR, can nonetheless display high levels of variance and bias in anatomical measures.
zh

[CV-165] Low Complexity Point Tracking of the Myocardium in 2D Echocardiography

【速读】：本文旨在解决现有深度学习方法在二维超声心动图中进行点跟踪时未能充分利用领域特定知识以实现极快速和高效配置的问题。论文提出了一种名为MyoTracker的低复杂度架构（0.3M参数），用于超声心动图中的点跟踪。其关键创新在于基于CoTracker2架构简化组件并扩展时间上下文，从而能够一次性为整个序列提供点预测，而不是逐帧处理。此外，通过维持整个序列的时间上下文，MyoTracker显著提升了准确性，同时实现了更少的GPU内存占用和更快的推理速度，分别为CoTracker2的67%和EchoTracker的84%，并且推理速度分别快74倍和11倍。这些改进使得MyoTracker在保持高精度的同时大幅提高了计算效率。

链接: https://arxiv.org/abs/2503.10431
作者: Artem Chernyshov,John Nyberg,Vegard Holmstrøm,Md Abulkalam Azad,Bjørnar Grenne,Håvard Dalen,Svein Arne Aase,Lasse Lovstakken,Andreas Østvik
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning methods for point tracking are applicable in 2D echocardiography, but do not yet take advantage of domain specifics that enable extremely fast and efficient configurations. We developed MyoTracker, a low-complexity architecture (0.3M parameters) for point tracking in echocardiography. It builds on the CoTracker2 architecture by simplifying its components and extending the temporal context to provide point predictions for the entire sequence in a single step. We applied MyoTracker to the right ventricular (RV) myocardium in RV-focused recordings and compared the results with those of CoTracker2 and EchoTracker, another specialized point tracking architecture for echocardiography. MyoTracker achieved the lowest average point trajectory error at 2.00 \pm 0.53 mm. Calculating RV Free Wall Strain (RV FWS) using MyoTracker’s point predictions resulted in a -0.3 % bias with 95 % limits of agreement from -6.1 % to 5.4 % compared to reference values from commercial software. This range falls within the interobserver variability reported in previous studies. The limits of agreement were wider for both CoTracker2 and EchoTracker, worse than the interobserver variability. At inference, MyoTracker used 67 % less GPU memory than CoTracker2 and 84 % less than EchoTracker on large sequences (100 frames). MyoTracker was 74 times faster during inference than CoTracker2 and 11 times faster than EchoTracker with our setup. Maintaining the entire sequence in the temporal context was the greatest contributor to MyoTracker’s accuracy. Slight additional gains can be made by re-enabling iterative refinement, at the cost of longer processing time.
zh

[CV-166] PS3C: An Ensemble-Based Two-Step Framework for Classification of Pep Smear Cell Images

【速读】：该论文旨在解决宫颈癌筛查中因巴氏涂片检查（Pap smear）使用增加而导致的细胞病理学家工作量激增问题，通过开发自动化工具提升工作效率。论文的关键在于提出了一种两阶段集成方法：首先，神经网络判断图像是否为不适合诊断的“垃圾”图像；若不是，则由第二个神经网络进一步分类图像内容为健康细胞、病变细胞或两者兼具。这种方案的核心在于有效区分无用图像与有价值的诊断信息，并实现精准分类。

链接: https://arxiv.org/abs/2503.10312
作者: Theo Di Piazza,Loic Boussel
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 3 figures, Grand Challenge paper accepted at ISBI 2025

点击查看摘要

Abstract:Early detection of cervical cancer is crucial for improving patient outcomes and reducing mortality by identifying precancerous lesions as soon as possible. As a result, the use of pap smear screening has significantly increased, leading to a growing demand for automated tools that can assist cytologists managing their rising workload. To address this, the Pep Smear Cell Classification Challenge (PS3C) has been organized in association with ISBI in 2025. This project aims to promote the development of automated tools for pep smear images classification. The analyzed images are grouped into four categories: healthy, unhealthy, both, and rubbish images which are considered as unsuitable for diagnosis. In this work, we propose a two-stage ensemble approach: first, a neural network determines whether an image is rubbish or not. If not, a second neural network classifies the image as containing a healthy cell, an unhealthy cell, or both.
zh

[CV-167] Markerless Tracking-Based Registration for Medical Image Motion Correction

【速读】：本文研究旨在从视频荧光造影记录中分离吞咽动力学与干扰患者运动（如头部移动、解剖结构位移及食团移动等），以实现吞咽生理的精确分析。传统光学流方法因受闪烁和不稳定性影响而失效，无法可靠地区分不同运动组。为此，论文评估了无标记跟踪方法（如CoTracker、PIPs++、TAP-Net）在医学感兴趣区域内的跟踪准确性，并发现即使稀疏跟踪点生成的形变场也优于主流配准方法（如ANTs、LDDMM、VoxelMorph）。论文的关键在于提出了一种新的运动校正管道，通过量化均方误差（MSE）和结构相似性指数（SSIM）等指标，有效消除了干扰运动，同时保留了吞咽相关的动力学特性，超越了现有竞争性配准技术。

链接: https://arxiv.org/abs/2503.10260
作者: Luisa Neubig,Deirdre Larsen,Takeshi Ikuma,Markus Kopp,Melda Kunduk,Andreas M. Kist
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Our study focuses on isolating swallowing dynamics from interfering patient motion in videofluoroscopy, an X-ray technique that records patients swallowing a radiopaque bolus. These recordings capture multiple motion sources, including head movement, anatomical displacements, and bolus transit. To enable precise analysis of swallowing physiology, we aim to eliminate distracting motion, particularly head movement, while preserving essential swallowing-related dynamics. Optical flow methods fail due to artifacts like flickering and instability, making them unreliable for distinguishing different motion groups. We evaluated markerless tracking approaches (CoTracker, PIPs++, TAP-Net) and quantified tracking accuracy in key medical regions of interest. Our findings show that even sparse tracking points generate morphing displacement fields that outperform leading registration methods such as ANTs, LDDMM, and VoxelMorph. To compare all approaches, we assessed performance using MSE and SSIM metrics post-registration. We introduce a novel motion correction pipeline that effectively removes disruptive motion while preserving swallowing dynamics and surpassing competitive registration techniques. Code will be available after review.
zh

[CV-168] Automatic quality control in multi-centric fetal brain MRI super-resolution reconstruction MICCAI2025

【速读】：该论文旨在解决胎儿脑部磁共振成像（MRI）中超分辨率重建（SRR）体积的自动化质量控制（QC）问题。这一问题尤为关键，因为胎儿脑MRI的数据采集和图像处理技术相较于成人影像标准化程度较低，而SRR步骤通过将多层厚二维切片配准并组合成单一各向同性且无伪影的T2加权体积，是确保图像质量和研究可靠性的重要环节。论文的关键解决方案在于提出了一种名为FetMRQC_SR的机器学习方法，该方法提取超过100个图像质量指标，并利用随机森林模型预测图像质量评分。此方法特别适合高维、数据异质性强且样本量小的问题场景。此外，通过在跨域验证中达到较高的性能（ROC AUC = 0.89），证明其在面对未知站点或SRR方法的数据时仍具有良好的适应性。

链接: https://arxiv.org/abs/2503.10156
作者: Thomas Sanchez,Vladyslav Zalevsky,Angeline Mihailo,Gerard Martí Juan,Elisenda Eixarch,Andras Jakab,Vincent Dunet,Mériam Koob,Guillaume Auzias,Meritxell Bach Cuadra
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures; Submitted to MICCAI 2025

点击查看摘要

Abstract:Quality control (QC) has long been considered essential to guarantee the reliability of neuroimaging studies. It is particularly important for fetal brain MRI, where acquisitions and image processing techniques are less standardized than in adult imaging. In this work, we focus on automated quality control of super-resolution reconstruction (SRR) volumes of fetal brain MRI, an important processing step where multiple stacks of thick 2D slices are registered together and combined to build a single, isotropic and artifact-free T2 weighted volume. We propose FetMRQC _SR , a machine-learning method that extracts more than 100 image quality metrics to predict image quality scores using a random forest model. This approach is well suited to a problem that is high dimensional, with highly heterogeneous data and small datasets. We validate FetMRQC _SR in an out-of-domain (OOD) setting and report high performance (ROC AUC = 0.89), even when faced with data from an unknown site or SRR method. We also investigate failure cases and show that they occur in 45% of the images due to ambiguous configurations for which the rating from the expert is arguable. These results are encouraging and illustrate how a non deep learning-based method like FetMRQC _SR is well suited to this multifaceted problem. Our tool, along with all the code used to generate, train and evaluate the model will be released upon acceptance of the paper.
zh

[CV-169] Dual-domain Modulation Network for Lightweight Image Super-Resolution

【速读】：该论文致力于解决轻量级图像超分辨率（Super-Resolution, SR）任务中现有基于频率的方法无法同时有效重建整体结构与高频细节的问题，并指出这些方法在处理频率特征时效率低下且不适合轻量级应用。为了解决这些问题，论文的关键在于提出了一种结合小波（Wavelet）和傅里叶（Fourier）信息的双域调制网络。具体而言，该网络通过利用小波域调制自注意力变换器（Wavelet-domain Modulation Transformer, WMT）以及傅里叶监督机制，在空间域调制的基础上进一步优化频率特征的学习，使模型能够在降低计算成本的同时实现高质量的高频细节重建与整体结构恢复。实验结果表明，所提方法在峰值信噪比（PSNR）方面与SRFormer和MambaIR相当，但其浮点运算次数（FLOPs）仅为前者的不到50%和60%，推理速度分别快15.4倍和5.4倍，验证了该方法在轻量级SR任务中的有效性。

链接: https://arxiv.org/abs/2503.10047
作者: Wenjie Li,Heng Guo,Yuefeng Hou,Guangwei Gao,Zhanyu Ma
机构: Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications (BUPT) (北京邮电大学); School of Microelectronics, Tianjin University (TJU) (天津大学); Institute of Advanced Technology, Nanjing University of Posts and Telecommunications (NJUPT) (南京邮电大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lightweight image super-resolution (SR) aims to reconstruct high-resolution images from low-resolution images with limited computational costs. We find existing frequency-based SR methods cannot balance the reconstruction of overall structures and high-frequency parts. Meanwhile, these methods are inefficient for handling frequency features and unsuitable for lightweight SR. In this paper, we show introducing both wavelet and Fourier information allows our model to consider both high-frequency features and overall SR structure reconstruction while reducing costs. Specifically, we propose a dual-domain modulation network that utilize wavelet-domain modulation self-Transformer (WMT) plus Fourier supervision to modulate frequency features in addition to spatial domain modulation. Compared to existing frequency-based SR modules, our WMT is more suitable for frequency learning in lightweight SR. Experimental results show that our method achieves a comparable PSNR of SRFormer and MambaIR while with less than 50% and 60% of their FLOPs and achieving inference speeds 15.4x and 5.4x faster, respectively, demonstrating the effectiveness of our method on SR quality and lightweight. Codes will be released upon acceptance.
zh

[CV-170] CPLOYO: A Pulmonary Nodule Detection Model with Multi-Scale Feature Fusion and Nonlinear Feature Learning

【速读】：该论文旨在解决肺结节多类型检测中灵敏度不足的问题，尤其是针对不同尺度和类型的肺结节（包括小尺寸结节）的检测精度提升。论文的关键解决方案在于对YOLOv8模型进行针对性改进：首先引入C2f_RepViTCAMF模块以增强主干网络的特征提取能力，特别是针对小结节的检测精度，并实现轻量化设计；其次加入MSCAF模块重构特征融合部分，优化多尺度肺结节的检测性能；此外，通过集成KAN网络，利用其强大的非线性特征学习能力进一步提升小结节的检测精度并增强模型的泛化能力。这些改进显著提高了模型的整体检测性能，尤其是在LUNA16数据集上的测试验证了其优越性。

链接: https://arxiv.org/abs/2503.10045
作者: Meng Wang,Zi Yang,Ruifeng Zhao,Yaoting Jiang
机构: Tongji University (同济大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The integration of Internet of Things (IoT) technology in pulmonary nodule detection significantly enhances the intelligence and real-time capabilities of the detection system. Currently, lung nodule detection primarily focuses on the identification of solid nodules, but different types of lung nodules correspond to various forms of lung cancer. Multi-type detection contributes to improving the overall lung cancer detection rate and enhancing the cure rate. To achieve high sensitivity in nodule detection, targeted improvements were made to the YOLOv8 model. Firstly, the C2f_RepViTCAMF module was introduced to augment the C2f module in the backbone, thereby enhancing detection accuracy for small lung nodules and achieving a lightweight model design. Secondly, the MSCAF module was incorporated to reconstruct the feature fusion section of the model, improving detection accuracy for lung nodules of varying scales. Furthermore, the KAN network was integrated into the model. By leveraging the KAN network’s powerful nonlinear feature learning capability, detection accuracy for small lung nodules was further improved, and the model’s generalization ability was enhanced. Tests conducted on the LUNA16 dataset demonstrate that the improved model outperforms the original model as well as other mainstream models such as YOLOv9 and RT-DETR across various evaluation metrics.
zh

[CV-171] RSR-NF: Neural Field Regularization by Static Restoration Priors for Dynamic Imaging

【速读】：该论文旨在解决动态成像（dynamic imaging）中时空对象重建的问题，特别是在动态计算机断层扫描（dynamic computed tomography, dCT）中，由于仅能获取单视角下的欠采样投影数据，导致逆问题极为困难。此外，真实动态数据通常不可用或极度稀缺，限制了监督学习技术的应用。为应对这些挑战，论文提出了一种名为RSR-NF的方法，其关键是利用神经场（neural field, NF）表示动态对象，并通过Regularization-by-Denoising (RED) 框架结合静态深度空间先验，借助学习到的修复算子将静态先验融入变分公式中。同时，采用基于ADMM的算法实现高效优化。关键创新在于结合NF表示与静态先验，以及在动态CT场景中的性能提升。

链接: https://arxiv.org/abs/2503.10015
作者: Berk Iskender,Sushan Nakarmi,Nitin Daphalapurkar,Marc L. Klasky,Yoram Bresler
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Dynamic imaging involves the reconstruction of a spatio-temporal object at all times using its undersampled measurements. In particular, in dynamic computed tomography (dCT), only a single projection at one view angle is available at a time, making the inverse problem very challenging. Moreover, ground-truth dynamic data is usually either unavailable or too scarce to be used for supervised learning techniques. To tackle this problem, we propose RSR-NF, which uses a neural field (NF) to represent the dynamic object and, using the Regularization-by-Denoising (RED) framework, incorporates an additional static deep spatial prior into a variational formulation via a learned restoration operator. We use an ADMM-based algorithm with variable splitting to efficiently optimize the variational objective. We compare RSR-NF to three alternatives: NF with only temporal regularization; a recent method combining a partially-separable low-rank representation with RED using a denoiser pretrained on static data; and a deep-image prior-based model. The first comparison demonstrates the reconstruction improvements achieved by combining the NF representation with static restoration priors, whereas the other two demonstrate the improvement over state-of-the art techniques for dCT.
zh

[CV-172] Reference-Free 3D Reconstruction of Brain Dissection Photographs with Machine Learning

【速读】：该论文旨在解决利用离体脑切片照片与MRI之间的相关性构建病理微观特征与活体扫描之间联系的问题。现有方法依赖于完整的脑切片堆栈及外部参考掩模（如表面扫描仪获取），这严重限制了技术的适用性。论文提出的解决方案——RefFree，是一种无需外部参考的离体脑切片照片重建方法。其关键是通过学习方法估计每张照片中每个像素在图谱空间中的3D坐标，并使用简单的最小二乘拟合完成3D重建，同时还能生成基于图谱的分割结果。RefFree通过从数字切片的3D MRI数据生成的合成照片进行训练，并通过随机外观增强泛化能力，从而实现与现有基准方法相当的性能，同时支持部分堆栈的重建。

链接: https://arxiv.org/abs/2503.09963
作者: Lin Tian,Sean I. Young,Jonathan Williams Ramirez,Dina Zemlyanker,Lucas Jacob Deden Binder,Rogeny Herisse,Theresa R. Connors,Derek H. Oakley,Bradley T. Hyman,Oula Puonti,Matthew S. Rosen,Juan Eugenio Iglesias
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Correlation of neuropathology with MRI has the potential to transfer microscopic signatures of pathology to invivo scans. Recently, a classical registration method has been proposed, to build these correlations from 3D reconstructed stacks of dissection photographs, which are routinely taken at brain banks. These photographs bypass the need for exvivo MRI, which is not widely accessible. However, this method requires a full stack of brain slabs and a reference mask (e.g., acquired with a surface scanner), which severely limits the applicability of the technique. Here we propose RefFree, a dissection photograph reconstruction method without external reference. RefFree is a learning approach that estimates the 3D coordinates in the atlas space for every pixel in every photograph; simple least-squares fitting can then be used to compute the 3D reconstruction. As a by-product, RefFree also produces an atlas-based segmentation of the reconstructed stack. RefFree is trained on synthetic photographs generated from digitally sliced 3D MRI data, with randomized appearance for enhanced generalization ability. Experiments on simulated and real data show that RefFree achieves performance comparable to the baseline method without an explicit reference while also enabling reconstruction of partial stacks. Our code is available at this https URL.
zh

[CV-173] QuickDraw: Fast Visualization Analysis and Active Learning for Medical Image Segmentation

【速读】：该论文旨在解决医学影像分析中手动检测和分割异常耗时且易受主观差异影响的问题，并克服现有先进模型难以与临床常用医学图像查看器无缝集成的局限。解决方案的关键在于提出QuickDraw，这是一个开源的医学图像可视化与分析框架，允许用户上传DICOM图像并运行现成的机器学习模型以生成三维分割掩膜。此外，QuickDraw还支持用户编辑、导出和评估分割结果，通过主动学习进一步优化现有模型。研究结果显示，QuickDraw将人工分割CT扫描的时间从四小时缩短至六分钟，并相比先前方法减少了10%的机器学习辅助分割时间。

链接: https://arxiv.org/abs/2503.09885
作者: Daniel Syomichev,Padmini Gopinath,Guang-Lin Wei,Eric Chang,Ian Gordon,Amanuel Seifu,Rahul Pemmaraju,Neehar Peri,James Purtilo
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: The first two authors contributed equally. The last three authors advised equally. This work has been accepted to the International Conference on Human Computer Interaction (HCII) 2025

点击查看摘要

Abstract:Analyzing CT scans, MRIs and X-rays is pivotal in diagnosing and treating diseases. However, detecting and identifying abnormalities from such medical images is a time-intensive process that requires expert analysis and is prone to interobserver variability. To mitigate such issues, machine learning-based models have been introduced to automate and significantly reduce the cost of image segmentation. Despite significant advances in medical image analysis in recent years, many of the latest models are never applied in clinical settings because state-of-the-art models do not easily interface with existing medical image viewers. To address these limitations, we propose QuickDraw, an open-source framework for medical image visualization and analysis that allows users to upload DICOM images and run off-the-shelf models to generate 3D segmentation masks. In addition, our tool allows users to edit, export, and evaluate segmentation masks to iteratively improve state-of-the-art models through active learning. In this paper, we detail the design of our tool and present survey results that highlight the usability of our software. Notably, we find that QuickDraw reduces the time to manually segment a CT scan from four hours to six minutes and reduces machine learning-assisted segmentation time by 10% compared to prior work. Our code and documentation are available at this https URL
zh

[CV-174] Bidirectional Learned Facial Animation Codec for Low Bitrate Talking Head Videos

【速读】：该论文旨在解决现有单向深度人脸动画编码技术在处理大头部运动时难以准确捕捉面部细节，导致面部区域出现失真的问题。为了解决这一挑战，论文提出了一种新颖的双向学习型动画编解码器（bidirectional learned animation codec），通过利用过去和未来的关键帧生成自然的人脸视频。解决方案的关键在于引入双向参考引导的辅助流增强（BRG-ASE）过程，通过自适应选择过去或未来的一个关键帧来增强非关键帧的紧凑辅助流，从而在略微增加比特率的情况下提升视频质量；随后，在双向参考引导的视频重建（BRG-VRec）过程中，利用自适应选择的关键帧和辅助帧共同重建目标帧。实验结果表明，该方法相较于最新的基于动画的视频编解码器减少了55%的比特率，并较Versatile Video Coding (VVC)标准减少了35%的比特率，验证了所提方法在提高视频质量的同时有效降低比特率的效率。

链接: https://arxiv.org/abs/2503.09787
作者: Riku Takahashi,Ryugo Morita,Fuma Kimishima,Kosuke Iwama,Jinjia Zhou
机构: Hosei University (法政大学), Tokyo, Japan
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to DCC2025

点击查看摘要

Abstract:Existing deep facial animation coding techniques efficiently compress talking head videos by applying deep generative models. Instead of compressing the entire video sequence, these methods focus on compressing only the keyframe and the keypoints of non-keyframes (target frames). The target frames are then reconstructed by utilizing a single keyframe, and the keypoints of the target frame. Although these unidirectional methods can reduce the bitrate, they rely on a single keyframe and often struggle to capture large head movements accurately, resulting in distortions in the facial region. In this paper, we propose a novel bidirectional learned animation codec that generates natural facial videos using past and future keyframes. First, in the Bidirectional Reference-Guided Auxiliary Stream Enhancement (BRG-ASE) process, we introduce a compact auxiliary stream for non-keyframes, which is enhanced by adaptively selecting one of two keyframes (past and future). This stream improves video quality with a slight increase in bitrate. Then, in the Bidirectional Reference-Guided Video Reconstruction (BRG-VRec) process, we animate the adaptively selected keyframe and reconstruct the target frame using both the animated keyframe and the auxiliary frame. Extensive experiments demonstrate a 55% bitrate reduction compared to the latest animation based video codec, and a 35% bitrate reduction compared to the latest video coding standard, Versatile Video Coding (VVC) on a talking head video dataset. It showcases the efficiency of our approach in improving video quality while simultaneously decreasing bitrate.
zh

人工智能

[AI-0] Uncertainty in Action: Confidence Elicitation in Embodied Agents

链接: https://arxiv.org/abs/2503.10628
作者: Tianjiao Yu,Vedant Shah,Muntasir Wahed,Kiet A. Nguyen,Adheesh Juvekar,Tal August,Ismini Lourentzou
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Expressing confidence is challenging for embodied agents navigating dynamic multimodal environments, where uncertainty arises from both perception and decision-making processes. We present the first work investigating embodied confidence elicitation in open-ended multimodal environments. We introduce Elicitation Policies, which structure confidence assessment across inductive, deductive, and abductive reasoning, along with Execution Policies, which enhance confidence calibration through scenario reinterpretation, action sampling, and hypothetical reasoning. Evaluating agents in calibration and failure prediction tasks within the Minecraft environment, we show that structured reasoning approaches, such as Chain-of-Thoughts, improve confidence calibration. However, our findings also reveal persistent challenges in distinguishing uncertainty, particularly under abductive settings, underscoring the need for more sophisticated embodied confidence elicitation methods.

[AI-1] he Spectral Bias of Shallow Neural Network Learning is Shaped by the Choice of Non-linearity

链接: https://arxiv.org/abs/2503.10587
作者: Justin Sahs,Ryan Pyle,Fabio Anselmi,Ankit Patel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 10 figures in main text

点击查看摘要

Abstract:Despite classical statistical theory predicting severe overfitting, modern massively overparameterized neural networks still generalize well. This unexpected property is attributed to the network’s so-called implicit bias, which describes its propensity to converge to solutions that generalize effectively, among the many possible that correctly label the training data. The aim of our research is to explore this bias from a new perspective, focusing on how non-linear activation functions contribute to shaping it. First, we introduce a reparameterization which removes a continuous weight rescaling symmetry. Second, in the kernel regime, we leverage this reparameterization to generalize recent findings that relate shallow Neural Networks to the Radon transform, deriving an explicit formula for the implicit bias induced by a broad class of activation functions. Specifically, by utilizing the connection between the Radon transform and the Fourier transform, we interpret the kernel regime’s inductive bias as minimizing a spectral seminorm that penalizes high-frequency components, in a manner dependent on the activation function. Finally, in the adaptive regime, we demonstrate the existence of local dynamical attractors that facilitate the formation of clusters of hyperplanes where the input to a neuron’s activation function is zero, yielding alignment between many neurons’ response functions. We confirm these theoretical results with simulations. All together, our work provides a deeper understanding of the mechanisms underlying the generalization capabilities of overparameterized neural networks and its relation with the implicit bias, offering potential pathways for designing more efficient and robust models.

[AI-2] KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation

链接: https://arxiv.org/abs/2503.10546
作者: Zixian Liu,Mingtong Zhang,Yunzhu Li
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Project website: this http URL

点击查看摘要

Abstract:With the rapid advancement of large language models (LLMs) and vision-language models (VLMs), significant progress has been made in developing open-vocabulary robotic manipulation systems. However, many existing approaches overlook the importance of object dynamics, limiting their applicability to more complex, dynamic tasks. In this work, we introduce KUDA, an open-vocabulary manipulation system that integrates dynamics learning and visual prompting through keypoints, leveraging both VLMs and learning-based neural dynamics models. Our key insight is that a keypoint-based target specification is simultaneously interpretable by VLMs and can be efficiently translated into cost functions for model-based planning. Given language instructions and visual observations, KUDA first assigns keypoints to the RGB image and queries the VLM to generate target specifications. These abstract keypoint-based representations are then converted into cost functions, which are optimized using a learned dynamics model to produce robotic trajectories. We evaluate KUDA on a range of manipulation tasks, including free-form language instructions across diverse object categories, multi-object interactions, and deformable or granular objects, demonstrating the effectiveness of our framework. The project page is available at this http URL.

[AI-3] GBSVR: Granular Ball Support Vector Regression

链接: https://arxiv.org/abs/2503.10539
作者: Reshma Rastogi,Ankush Bisht,Sanjay Kumar,Suresh Chandra
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Support Vector Regression (SVR) and its variants are widely used to handle regression tasks, however, since their solution involves solving an expensive quadratic programming problem, it limits its application, especially when dealing with large datasets. Additionally, SVR uses an epsilon-insensitive loss function which is sensitive to outliers and therefore can adversely affect its performance. We propose Granular Ball Support Vector Regression (GBSVR) to tackle problem of regression by using granular ball concept. These balls are useful in simplifying complex data spaces for machine learning tasks, however, to the best of our knowledge, they have not been sufficiently explored for regression problems. Granular balls group the data points into balls based on their proximity and reduce the computational cost in SVR by replacing the large number of data points with far fewer granular balls. This work also suggests a discretization method for continuous-valued attributes to facilitate the construction of granular balls. The effectiveness of the proposed approach is evaluated on several benchmark datasets and it outperforms existing state-of-the-art approaches

[AI-4] Conformal Prediction Sets for Deep Generative Models via Reduction to Conformal Regression

链接: https://arxiv.org/abs/2503.10512
作者: Hooman Shahrokhi,Devjeet Raj Roy,Yan Yan,Venera Arnaoudova,Janaradhan Rao Doppa
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We consider the problem of generating valid and small prediction sets by sampling outputs (e.g., software code and natural language text) from a black-box deep generative model for a given input (e.g., textual prompt). The validity of a prediction set is determined by a user-defined binary admissibility function depending on the target application. For example, requiring at least one program in the set to pass all test cases in code generation application. To address this problem, we develop a simple and effective conformal inference algorithm referred to as Generative Prediction Sets (GPS). Given a set of calibration examples and black-box access to a deep generative model, GPS can generate prediction sets with provable guarantees. The key insight behind GPS is to exploit the inherent structure within the distribution over the minimum number of samples needed to obtain an admissible output to develop a simple conformal regression approach over the minimum number of samples. Experiments on multiple datasets for code and math word problems using different large language models demonstrate the efficacy of GPS over state-of-the-art methods.

[AI-5] DeclareAligner: A Leap Towards Efficient Optimal Alignments for Declarative Process Model Conformance Checking

链接: https://arxiv.org/abs/2503.10479
作者: Jacobo Casas-Ramos,Manuel Lama,Manuel Mucientes
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In many engineering applications, processes must be followed precisely, making conformance checking between event logs and declarative process models crucial for ensuring adherence to desired behaviors. This is a critical area where Artificial Intelligence (AI) plays a pivotal role in driving effective process improvement. However, computing optimal alignments poses significant computational challenges due to the vast search space inherent in these models. Consequently, existing approaches often struggle with scalability and efficiency, limiting their applicability in real-world settings. This paper introduces DeclareAligner, a novel algorithm that uses the A* search algorithm, an established AI pathfinding technique, to tackle the problem from a fresh perspective leveraging the flexibility of declarative models. Key features of DeclareAligner include only performing actions that actively contribute to fixing constraint violations, utilizing a tailored heuristic to navigate towards optimal solutions, and employing early pruning to eliminate unproductive branches, while also streamlining the process through preprocessing and consolidating multiple fixes into unified actions. The proposed method is evaluated using 8,054 synthetic and real-life alignment problems, demonstrating its ability to efficiently compute optimal alignments by significantly outperforming the current state of the art. By enabling process analysts to more effectively identify and understand conformance issues, DeclareAligner has the potential to drive meaningful process improvement and management.

[AI-6] Whisper Speaker Identification: Leverag ing Pre-Trained Multilingual Transformers for Robust Speaker Embeddings

链接: https://arxiv.org/abs/2503.10446
作者: Jakaria Islam Emon,Md Abu Salek,Kazi Tamanna Alam
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 6 pages

点击查看摘要

Abstract:Speaker identification in multilingual settings presents unique challenges, particularly when conventional models are predominantly trained on English data. In this paper, we propose WSI (Whisper Speaker Identification), a framework that repurposes the encoder of the Whisper automatic speech recognition model pre trained on extensive multilingual data to generate robust speaker embeddings via a joint loss optimization strategy that leverages online hard triplet mining and self supervised Normalized Temperature-scaled Cross Entropy loss. By capitalizing on Whisper language-agnostic acoustic representations, our approach effectively distinguishes speakers across diverse languages and recording conditions. Extensive evaluations on multiple corpora, including VoxTube (multilingual), JVS (Japanese), CallHome (German, Spanish, Chinese, and Japanese), and Voxconverse (English), demonstrate that WSI consistently outperforms state-of-the-art baselines, namely Pyannote Embedding, ECAPA TDNN, and Xvector, in terms of lower equal error rates and higher AUC scores. These results validate our hypothesis that a multilingual pre-trained ASR encoder, combined with joint loss optimization, substantially improves speaker identification performance in non-English languages.

[AI-7] dFLMoE: Decentralized Federated Learning via Mixture of Experts for Medical Data Analysis

链接: https://arxiv.org/abs/2503.10412
作者: Luyuan Xie,Tianyu Luan,Wenyuan Cai,Guochen Yan,Zhaoyu Chen,Nan Xi,Yuejian Fang,Qingni Shen,Zhonghai Wu,Junsong Yuan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning has wide applications in the medical field. It enables knowledge sharing among different healthcare institutes while protecting patients’ privacy. However, existing federated learning systems are typically centralized, requiring clients to upload client-specific knowledge to a central server for aggregation. This centralized approach would integrate the knowledge from each client into a centralized server, and the knowledge would be already undermined during the centralized integration before it reaches back to each client. Besides, the centralized approach also creates a dependency on the central server, which may affect training stability if the server malfunctions or connections are unstable. To address these issues, we propose a decentralized federated learning framework named dFLMoE. In our framework, clients directly exchange lightweight head models with each other. After exchanging, each client treats both local and received head models as individual experts, and utilizes a client-specific Mixture of Experts (MoE) approach to make collective decisions. This design not only reduces the knowledge damage with client-specific aggregations but also removes the dependency on the central server to enhance the robustness of the framework. We validate our framework on multiple medical tasks, demonstrating that our method evidently outperforms state-of-the-art approaches under both model homogeneity and heterogeneity settings.

[AI-8] Enhance Exploration in Safe Reinforcement Learning with Contrastive Representation Learning

链接: https://arxiv.org/abs/2503.10318
作者: Duc Kien Doan,Bang Giang Le,Viet Cuong Ta
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at ACIIDS 2025

点击查看摘要

Abstract:In safe reinforcement learning, agent needs to balance between exploration actions and safety constraints. Following this paradigm, domain transfer approaches learn a prior Q-function from the related environments to prevent unsafe actions. However, because of the large number of false positives, some safe actions are never executed, leading to inadequate exploration in sparse-reward environments. In this work, we aim to learn an efficient state representation to balance the exploration and safety-prefer action in a sparse-reward environment. Firstly, the image input is mapped to latent representation by an auto-encoder. A further contrastive learning objective is employed to distinguish safe and unsafe states. In the learning phase, the latent distance is used to construct an additional safety check, which allows the agent to bias the exploration if it visits an unsafe state. To verify the effectiveness of our method, the experiment is carried out in three navigation-based MiniGrid environments. The result highlights that our method can explore the environment better while maintaining a good balance between safety and efficiency.

[AI-9] Nash Equilibrium Constrained Auto-bidding With Bi-level Reinforcement Learning

链接: https://arxiv.org/abs/2503.10304
作者: Zhiyu Mou,Miao Xu,Rongquan Bai,Zhuoran Yang,Chuan Yu,Jian Xu,Bo Zheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Many online advertising platforms provide advertisers with auto-bidding services to enhance their advertising performance. However, most existing auto-bidding algorithms fail to accurately capture the auto-bidding problem formulation that the platform truly faces, let alone solve it. Actually, we argue that the platform should try to help optimize each advertiser’s performance to the greatest extent – which makes \epsilon -Nash Equilibrium ( \epsilon -NE) a necessary solution concept – while maximizing the social welfare of all the advertisers for the platform’s long-term value. Based on this, we introduce the \emphNash-Equilibrium Constrained Bidding (NCB), a new formulation of the auto-bidding problem from the platform’s perspective. Specifically, it aims to maximize the social welfare of all advertisers under the \epsilon -NE constraint. However, the NCB problem presents significant challenges due to its constrained bi-level structure and the typically large number of advertisers involved. To address these challenges, we propose a \emphBi-level Policy Gradient (BPG) framework with theoretical guarantees. Notably, its computational complexity is independent of the number of advertisers, and the associated gradients are straightforward to compute. Extensive simulated and real-world experiments validate the effectiveness of the BPG framework.

[AI-10] PyGDA: A Python Library for Graph Domain Adaptation

链接: https://arxiv.org/abs/2503.10284
作者: Zhen Zhang,Meihan Liu,Bingsheng He
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:Graph domain adaptation has emerged as a promising approach to facilitate knowledge transfer across different domains. Recently, numerous models have been proposed to enhance their generalization capabilities in this field. However, there is still no unified library that brings together existing techniques and simplifies their implementation. To fill this gap, we introduce PyGDA, an open-source Python library tailored for graph domain adaptation. As the first comprehensive library in this area, PyGDA covers more than 20 widely used graph domain adaptation methods together with different types of graph datasets. Specifically, PyGDA offers modular components, enabling users to seamlessly build custom models with a variety of commonly used utility functions. To handle large-scale graphs, PyGDA includes support for features such as sampling and mini-batch processing, ensuring efficient computation. In addition, PyGDA also includes comprehensive performance benchmarks and well-documented user-friendly API for both researchers and practitioners. To foster convenient accessibility, PyGDA is released under the MIT license at this https URL, and the API documentation is this https URL.

[AI-11] SurgRAW: Multi-Agent Workflow with Chain-of-Thought Reasoning for Surgical Intelligence

链接: https://arxiv.org/abs/2503.10265
作者: Chang Han Low,Ziyue Wang,Tianyi Zhang,Zhitao Zeng,Zhu Zhuo,Evangelos B. Mazomenos,Yueming Jin
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Integration of Vision-Language Models (VLMs) in surgical intelligence is hindered by hallucinations, domain knowledge gaps, and limited understanding of task interdependencies within surgical scenes, undermining clinical reliability. While recent VLMs demonstrate strong general reasoning and thinking capabilities, they still lack the domain expertise and task-awareness required for precise surgical scene interpretation. Although Chain-of-Thought (CoT) can structure reasoning more effectively, current approaches rely on self-generated CoT steps, which often exacerbate inherent domain gaps and hallucinations. To overcome this, we present SurgRAW, a CoT-driven multi-agent framework that delivers transparent, interpretable insights for most tasks in robotic-assisted surgery. By employing specialized CoT prompts across five tasks: instrument recognition, action recognition, action prediction, patient data extraction, and outcome assessment, SurgRAW mitigates hallucinations through structured, domain-aware reasoning. Retrieval-Augmented Generation (RAG) is also integrated to external medical knowledge to bridge domain gaps and improve response reliability. Most importantly, a hierarchical agentic system ensures that CoT-embedded VLM agents collaborate effectively while understanding task interdependencies, with a panel discussion mechanism promotes logical consistency. To evaluate our method, we introduce SurgCoTBench, the first reasoning-based dataset with structured frame-level annotations. With comprehensive experiments, we demonstrate the effectiveness of proposed SurgRAW with 29.32% accuracy improvement over baseline VLMs on 12 robotic procedures, achieving the state-of-the-art performance and advancing explainable, trustworthy, and autonomous surgical assistance.

[AI-12] PIMRL: Physics-Informed Multi-Scale Recurrent Learning for Spatiotemporal Prediction

链接: https://arxiv.org/abs/2503.10253
作者: Han Wan,Qi Wang,Hao Sun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Simulation of spatiotemporal systems governed by partial differential equations is widely applied in fields such as biology, chemistry, aerospace dynamics, and meteorology. Traditional numerical methods incur high computational costs due to the requirement of small time steps for accurate predictions. While machine learning has reduced these costs, long-term predictions remain challenged by error accumulation, particularly in scenarios with insufficient data or varying time scales, where stability and accuracy are compromised. Existing methods often neglect the effective utilization of multi-scale data, leading to suboptimal robustness in predictions. To address these issues, we propose a novel multi-scale learning framework, namely, the Physics-Informed Multi-Scale Recurrent Learning (PIMRL), to effectively leverage multi-scale data for spatiotemporal dynamics prediction. The PIMRL framework comprises two modules: the micro-scale module embeds physical knowledge into neural networks via pretraining, and the macro-scale module adopts a data-driven approach to learn the temporal evolution of physics in the latent space. Experimental results demonstrate that the PIMRL framework consistently achieves state-of-the-art performance across five benchmark datasets ranging from one to three dimensions, showing average improvements of over 9% in both RMSE and MAE evaluation metrics, with maximum enhancements reaching up to 80%.

[AI-13] LLM Agents Display Human Biases but Exhibit Distinct Learning Patterns

链接: https://arxiv.org/abs/2503.10248
作者: Idan Horowitz,Ori Plonsky
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We investigate the choice patterns of Large Language Models (LLMs) in the context of Decisions from Experience tasks that involve repeated choice and learning from feedback, and compare their behavior to human participants. We find that on the aggregate, LLMs appear to display behavioral biases similar to humans: both exhibit underweighting rare events and correlation effects. However, more nuanced analyses of the choice patterns reveal that this happens for very different reasons. LLMs exhibit strong recency biases, unlike humans, who appear to respond in more sophisticated ways. While these different processes may lead to similar behavior on average, choice patterns contingent on recent events differ vastly between the two groups. Specifically, phenomena such as surprise triggers change" and the wavy recency effect of rare events" are robustly observed in humans, but entirely absent in LLMs. Our findings provide insights into the limitations of using LLMs to simulate and predict humans in learning environments and highlight the need for refined analyses of their behavior when investigating whether they replicate human decision making tendencies.

[AI-14] Efficient Federated Fine-Tuning of Large Language Models with Layer Dropout

链接: https://arxiv.org/abs/2503.10217
作者: Shilong Wang,Jianchun Liu,Hongli Xu,Jiaming Yan,Xianjun Gao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 13 pages

点击查看摘要

Abstract:Fine-tuning plays a crucial role in enabling pre-trained LLMs to evolve from general language comprehension to task-specific expertise. To preserve user data privacy, federated fine-tuning is often employed and has emerged as the de facto paradigm. However, federated fine-tuning is prohibitively inefficient due to the tension between LLM complexity and the resource constraint of end devices, incurring unaffordable fine-tuning overhead. Existing literature primarily utilizes parameter-efficient fine-tuning techniques to mitigate communication costs, yet computational and memory burdens continue to pose significant challenges for developers. This work proposes DropPEFT, an innovative federated PEFT framework that employs a novel stochastic transformer layer dropout method, enabling devices to deactivate a considerable fraction of LLMs layers during training, thereby eliminating the associated computational load and memory footprint. In DropPEFT, a key challenge is the proper configuration of dropout ratios for layers, as overhead and training performance are highly sensitive to this setting. To address this challenge, we adaptively assign optimal dropout-ratio configurations to devices through an exploration-exploitation strategy, achieving efficient and effective fine-tuning. Extensive experiments show that DropPEFT can achieve a 1.3-6.3\times speedup in model convergence and a 40%-67% reduction in memory footprint compared to state-of-the-art methods.

[AI-15] Adaptive Preference Aggregation

链接: https://arxiv.org/abs/2503.10215
作者: Benjamin Heymann
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:AI alignment, the challenge of ensuring AI systems act in accordance with human values, has emerged as a critical problem in the development of systems such as foundation models and recommender systems. Still, the current dominant approach, reinforcement learning with human feedback (RLHF) faces known theoretical limitations in aggregating diverse human preferences. Social choice theory provides a framework to aggregate preferences, but was not developed for the multidimensional applications typical of AI. Leveraging insights from a recently published urn process, this work introduces a preference aggregation strategy that adapts to the user’s context and that inherits the good properties of the maximal lottery, a Condorcet-consistent solution concept.

[AI-16] Deep Learning for Time Series Forecasting: A Survey

链接: https://arxiv.org/abs/2503.10198
作者: Xiangjie Kong,Zhenghao Chen,Weiyao Liu,Kaili Ning,Lechao Zhang,Syauqie Muhammad Marier,Yichen Liu,Yuhao Chen,Feng Xia
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series forecasting (TSF) has long been a crucial task in both industry and daily life. Most classical statistical models may have certain limitations when applied to practical scenarios in fields such as energy, healthcare, traffic, meteorology, and economics, especially when high accuracy is required. With the continuous development of deep learning, numerous new models have emerged in the field of time series forecasting in recent years. However, existing surveys have not provided a unified summary of the wide range of model architectures in this field, nor have they given detailed summaries of works in feature extraction and datasets. To address this gap, in this review, we comprehensively study the previous works and summarize the general paradigms of Deep Time Series Forecasting (DTSF) in terms of model architectures. Besides, we take an innovative approach by focusing on the composition of time series and systematically explain important feature extraction methods. Additionally, we provide an overall compilation of datasets from various domains in existing works. Finally, we systematically emphasize the significant challenges faced and future research directions in this field.

[AI-17] Multi-Agent Q-Learning Dynamics in Random Networks: Convergence due to Exploration and Sparsity

链接: https://arxiv.org/abs/2503.10186
作者: Aamal Hussain,Dan Leonte,Francesco Belardinelli,Raphael Huser,Dario Paccagnan
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Beyond specific settings, many multi-agent learning algorithms fail to converge to an equilibrium solution, and instead display complex, non-stationary behaviours such as recurrent or chaotic orbits. In fact, recent literature suggests that such complex behaviours are likely to occur when the number of agents increases. In this paper, we study Q-learning dynamics in network polymatrix games where the network structure is drawn from classical random graph models. In particular, we focus on the Erdos-Renyi model, a well-studied model for social networks, and the Stochastic Block model, which generalizes the above by accounting for community structures within the network. In each setting, we establish sufficient conditions under which the agents’ joint strategies converge to a unique equilibrium. We investigate how this condition depends on the exploration rates, payoff matrices and, crucially, the sparsity of the network. Finally, we validate our theoretical findings through numerical simulations and demonstrate that convergence can be reliably achieved in many-agent systems, provided network sparsity is controlled.

[AI-18] ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning WWW2025

链接: https://arxiv.org/abs/2503.10166
作者: Pengfei Luo,Jingbo Zhou,Tong Xu,Yuan Xia,Linli Xu,Enhong Chen
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: WWW 2025

点击查看摘要

Abstract:With the proliferation of images in online content, language-guided image retrieval (LGIR) has emerged as a research hotspot over the past decade, encompassing a variety of subtasks with diverse input forms. While the development of large multimodal models (LMMs) has significantly facilitated these tasks, existing approaches often address them in isolation, requiring the construction of separate systems for each task. This not only increases system complexity and maintenance costs, but also exacerbates challenges stemming from language ambiguity and complex image content, making it difficult for retrieval systems to provide accurate and reliable results. To this end, we propose ImageScope, a training-free, three-stage framework that leverages collective reasoning to unify LGIR tasks. The key insight behind the unification lies in the compositional nature of language, which transforms diverse LGIR tasks into a generalized text-to-image retrieval process, along with the reasoning of LMMs serving as a universal verification to refine the results. To be specific, in the first stage, we improve the robustness of the framework by synthesizing search intents across varying levels of semantic granularity using chain-of-thought (CoT) reasoning. In the second and third stages, we then reflect on retrieval results by verifying predicate propositions locally, and performing pairwise evaluations globally. Experiments conducted on six LGIR datasets demonstrate that ImageScope outperforms competitive baselines. Comprehensive evaluations and ablation studies further confirm the effectiveness of our design.

[AI-19] Multiplicative Learning

链接: https://arxiv.org/abs/2503.10144
作者: Han Kim,Hyungjoon Soh,Vipul Periwal,Junghyo Jo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficient training of artificial neural networks remains a key challenge in deep learning. Backpropagation (BP), the standard learning algorithm, relies on gradient descent and typically requires numerous iterations for convergence. In this study, we introduce Expectation Reflection (ER), a novel learning approach that updates weights multiplicatively based on the ratio of observed to predicted outputs. Unlike traditional methods, ER maintains consistency without requiring ad hoc loss functions or learning rate hyperparameters. We extend ER to multilayer networks and demonstrate its effectiveness in performing image classification tasks. Notably, ER achieves optimal weight updates in a single iteration. Additionally, we reinterpret ER as a modified form of gradient descent incorporating the inverse mapping of target propagation. These findings suggest that ER provides an efficient and scalable alternative for training neural networks.

[AI-20] StepMathAgent : A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error

链接: https://arxiv.org/abs/2503.10105
作者: Shu-Xun Yang,Cunxiang Wang,Yidong Wang,Xiaotao Gu,Minlie Huang,Jie Tang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Evaluating mathematical capabilities is critical for assessing the overall performance of large language models (LLMs). However, existing evaluation methods often focus solely on final answers, resulting in highly inaccurate and uninterpretable evaluation outcomes, as well as their failure to assess proof or open-ended problems. To address these issues, we propose a novel mathematical process evaluation agent based on Tree-of-Error, called StepMathAgent. This agent incorporates four internal core operations: logical step segmentation, step scoring, score aggregation and error tree generation, along with four external extension modules: difficulty calibration, simplicity evaluation, completeness validation and format assessment. Furthermore, we introduce StepMathBench, a benchmark comprising 1,000 step-divided process evaluation instances, derived from 200 high-quality math problems grouped by problem type, subject category and difficulty level. Experiments on StepMathBench show that our proposed StepMathAgent outperforms all state-of-the-art methods, demonstrating human-aligned evaluation preferences and broad applicability to various scenarios. Our data and code are available at this https URL.

[AI-21] Parallelizing Multi-objective A* Search

链接: https://arxiv.org/abs/2503.10075
作者: Saman Ahmadi,Nathan R. Sturtevant,Andrea Raith,Daniel Harabor,Mahdi Jalili
类目: Artificial Intelligence (cs.AI)
*备注: 8 page, 2 tables, 2 figures

点击查看摘要

Abstract:The Multi-objective Shortest Path (MOSP) problem is a classic network optimization problem that aims to find all Pareto-optimal paths between two points in a graph with multiple edge costs. Recent studies on multi-objective search with A* (MOA*) have demonstrated superior performance in solving difficult MOSP instances. This paper presents a novel search framework that allows efficient parallelization of MOA* with different objective orders. The framework incorporates a unique upper bounding strategy that helps the search reduce the problem’s dimensionality to one in certain cases. Experimental results demonstrate that the proposed framework can enhance the performance of recent A*-based solutions, with the speed-up proportional to the problem dimension.

[AI-22] Advanced Tool Learning and Selection System (ATLASS): A Closed-Loop Framework Using LLM

链接: https://arxiv.org/abs/2503.10071
作者: Mohd Ariful Haque,Justin Williams,Sunzida Siddique,Md. Hujaifa Islam,Hasmot Ali,Kishor Datta Gupta,Roy George
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The combination of LLM agents with external tools enables models to solve complex tasks beyond their knowledge base. Human-designed tools are inflexible and restricted to solutions within the scope of pre-existing tools created by experts. To address this problem, we propose ATLASS, an advanced tool learning and selection system designed as a closed-loop framework. It enables the LLM to solve problems by dynamically generating external tools on demand. In this framework, agents play a crucial role in orchestrating tool selection, execution, and refinement, ensuring adaptive problem-solving capabilities. The operation of ATLASS follows three phases: The first phase, Understanding Tool Requirements, involves the Agents determining whether tools are required and specifying their functionality; the second phase, Tool Retrieval/Generation, involves the Agents retrieving or generating tools based on their availability; and the third phase, Task Solving, involves combining all the component tools necessary to complete the initial task. The Tool Dataset stores the generated tools, ensuring reusability and minimizing inference cost. Current LLM-based tool generation systems have difficulty creating complex tools that need APIs or external packages. In ATLASS, we solve the problem by automatically setting up the environment, fetching relevant API documentation online, and using a Python interpreter to create a reliable, versatile tool that works in a wider range of situations. OpenAI GPT-4.0 is used as the LLM agent, and safety and ethical concerns are handled through human feedback before executing generated code. By addressing the limitations of predefined toolsets and enhancing adaptability, ATLASS serves as a real-world solution that empowers users with dynamically generated tools for complex problem-solving.

[AI-23] AhaRobot: A Low-Cost Open-Source Bimanual Mobile Manipulator for Embodied AI

链接: https://arxiv.org/abs/2503.10070
作者: Haiqin Cui,Yifu Yuan,Yan Zheng,Jianye Hao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The first two authors contributed equally. Website: this https URL

点击查看摘要

Abstract:Navigation and manipulation in open-world environments remain unsolved challenges in the Embodied AI. The high cost of commercial mobile manipulation robots significantly limits research in real-world scenes. To address this issue, we propose AhaRobot, a low-cost and fully open-source dual-arm mobile manipulation robot system with a hardware cost of only 1,000 (excluding optional computational resources), which is less than 1/15 of the cost of popular mobile robots. The AhaRobot system consists of three components: (1) a novel low-cost hardware architecture primarily composed of off-the-shelf components, (2) an optimized control solution to enhance operational precision integrating dual-motor backlash control and static friction compensation, and (3) a simple remote teleoperation method RoboPilot. We use handles to control the dual arms and pedals for whole-body movement. The teleoperation process is low-burden and easy to operate, much like piloting. RoboPilot is designed for remote data collection in embodied scenarios. Experimental results demonstrate that RoboPilot significantly enhances data collection efficiency in complex manipulation tasks, achieving a 30% increase compared to methods using 3D mouse and leader-follower systems. It also excels at completing extremely long-horizon tasks in one go. Furthermore, AhaRobot can be used to learn end-to-end policies and autonomously perform complex manipulation tasks, such as pen insertion and cleaning up the floor. We aim to build an affordable yet powerful platform to promote the development of embodied tasks on real devices, advancing more robust and reliable embodied AI. All hardware and software systems are available at this https URL.

[AI-24] Deep Learning Approaches for Anti-Money Laundering on Mobile Transactions: Review Framework and Directions

链接: https://arxiv.org/abs/2503.10058
作者: Jiani Fan,Lwin Khin Shar,Ruichen Zhang,Ziyao Liu,Wenzhuo Yang,Dusit Niyato,Bomin Mao,Kwok-Yan Lam
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Money laundering is a financial crime that obscures the origin of illicit funds, necessitating the development and enforcement of anti-money laundering (AML) policies by governments and organizations. The proliferation of mobile payment platforms and smart IoT devices has significantly complicated AML investigations. As payment networks become more interconnected, there is an increasing need for efficient real-time detection to process large volumes of transaction data on heterogeneous payment systems by different operators such as digital currencies, cryptocurrencies and account-based payments. Most of these mobile payment networks are supported by connected devices, many of which are considered loT devices in the FinTech space that constantly generate data. Furthermore, the growing complexity and unpredictability of transaction patterns across these networks contribute to a higher incidence of false positives. While machine learning solutions have the potential to enhance detection efficiency, their application in AML faces unique challenges, such as addressing privacy concerns tied to sensitive financial data and managing the real-world constraint of limited data availability due to data regulations. Existing surveys in the AML literature broadly review machine learning approaches for money laundering detection, but they often lack an in-depth exploration of advanced deep learning techniques - an emerging field with significant potential. To address this gap, this paper conducts a comprehensive review of deep learning solutions and the challenges associated with their use in AML. Additionally, we propose a novel framework that applies the least-privilege principle by integrating machine learning techniques, codifying AML red flags, and employing account profiling to provide context for predictions and enable effective fraud detection under limited data availability…

[AI-25] OR-LLM -Agent : Automating Modeling and Solving of Operations Research Optimization Problem with Reasoning Large Language Model

链接: https://arxiv.org/abs/2503.10009
作者: Bowen Zhang,Pengcheng Luo
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:Operations Research (OR) has been widely applied in various fields such as resource allocation, production planning, and supply chain management. However, addressing real-world OR problems requires OR experts to perform mathematical modeling and programmers to develop solution algorithms. This traditional method, heavily reliant on experts, is costly and has long development cycles, severely limiting the widespread adoption of OR techniques. Few have considered using Artificial Intelligence (AI) to replace professionals to achieve fully automated solutions for OR problems. We propose OR-LLM-Agent, the first AI agent that enables end-to-end automation for solving real-world OR problems. OR-LLM-Agent leverages the Chain-of-Thought (CoT) reasoning capabilities of Large Language Models (LLMs) to translate natural language problem descriptions into formal mathematical models and automatically generate Gurobi solver code. In OR-LLM-Agent, OR-CodeAgent is designed to automate code execution and repair within a sandbox environment, facilitating the derivation of the final solution. Due to the lack of dedicated benchmark datasets for evaluating the automated solving of OR problems, we construct a benchmark dataset comprising 83 real-world OR problems described in natural language. We conduct comparative experiments with state-of-the-art (SOTA) reasoning LLMs, including GPT-o3-mini, DeepSeek-R1, and Gemini 2.0 Flash Thinking. The OR-LLM-Agent achieved the highest pass rate of 100% and the highest solution accuracy of 85%, demonstrating the feasibility of automated OR problem-solving. Data and code have been publicly available at this https URL.

[AI-26] A New Benchmark for Few-Shot Class-Incremental Learning: Redefining the Upper Bound

链接: https://arxiv.org/abs/2503.10003
作者: Shiwon Kim,Dongjun Hwang,Sungwon Woo,Rita Singh
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Class-incremental learning (CIL) aims to continuously adapt to emerging classes while retaining knowledge of previously learned ones. Few-shot class-incremental learning (FSCIL) presents an even greater challenge which requires the model to learn incremental classes with only a limited number of samples. In conventional CIL, joint training is widely considered the upper bound, serving as both a benchmark and a methodological guide. However, we find that joint training fails to be a meaningful upper bound in FSCIL due to the inherent difficulty of inter-task class separation (ICS) caused by severe class imbalance. In this work, we introduce a new joint training benchmark tailored for FSCIL by integrating imbalance-aware techniques, effectively bridging the performance gap between base and incremental classes. Furthermore, we point out inconsistencies in the experimental setup and evaluation of existing FSCIL methods. To ensure fair comparisons between different FSCIL approaches and joint training, we standardize training conditions and propose a unified evaluation protocol that simultaneously considers the validation set and computational complexity. By establishing a reliable upper bound and a standardized evaluation framework for FSCIL, our work provides a clear benchmark and a practical foundation for future research.

[AI-27] Label Unbalance in High-frequency Trading

链接: https://arxiv.org/abs/2503.09988
作者: Zijian Zhao,Xuming Chen,Jiayu Wen,Mingwen Liu,Xiaoteng Ma
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
*备注: Technical Report

点击查看摘要

Abstract:In financial trading, return prediction is one of the foundation for a successful trading system. By the fast development of the deep learning in various areas such as graphical processing, natural language, it has also demonstrate significant edge in handling with financial data. While the success of the deep learning relies on huge amount of labeled sample, labeling each time/event as profitable or unprofitable, under the transaction cost, especially in the high-frequency trading world, suffers from serious label imbalance this http URL this paper, we adopts rigurious end-to-end deep learning framework with comprehensive label imbalance adjustment methods and succeed in predicting in high-frequency return in the Chinese future market. The code for our method is publicly available at this https URL .

[AI-28] Optimizing Fire Safety: Reducing False Alarms Using Advanced Machine Learning Techniques

链接: https://arxiv.org/abs/2503.09960
作者: Muhammad Hassan Jamal,Abdulwahab Alazeb,Shahid Allah Bakhsh,Wadii Boulila,Syed Aziz Shah,Aizaz Ahmad Khattak,Muhammad Shahbaz Khan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fire safety practices are important to reduce the extent of destruction caused by fire. While smoke alarms help save lives, firefighters struggle with the increasing number of false alarms. This paper presents a precise and efficient Weighted ensemble model for decreasing false alarms. It estimates the density, computes weights according to the high and low-density regions, forwards the high region weights to KNN and low region weights to XGBoost and combines the predictions. The proposed model is effective at reducing response time, increasing fire safety, and minimizing the damage that fires cause. A specifically designed dataset for smoke detection is utilized to test the proposed model. In addition, a variety of ML models, such as Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Nai:ve Bayes (NB), K-Nearest Neighbour (KNN), Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), Adaptive Boosting (ADAB), have also been utilized. To maximize the use of the smoke detection dataset, all the algorithms utilize the SMOTE re-sampling technique. After evaluating the assessment criteria, this paper presents a concise summary of the comprehensive findings obtained by comparing the outcomes of all models.

[AI-29] Identifying Trustworthiness Challenges in Deep Learning Models for Continental-Scale Water Quality Prediction

链接: https://arxiv.org/abs/2503.09947
作者: Xiaobo Xia,Xiaofeng Liu,Jiale Liu,Kuai Fang,Lu Lu,Samet Oymak,William S. Currie,Tongliang Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 33 pages, 9 figures, 2 tables

点击查看摘要

Abstract:Water quality is foundational to environmental sustainability, ecosystem resilience, and public health. Deep learning models, particularly Long Short-Term Memory (LSTM) networks, offer transformative potential for large-scale water quality prediction and scientific insights generation. However, their widespread adoption in high-stakes decision-making, such as pollution mitigation and equitable resource allocation, is prevented by unresolved trustworthiness challenges including fairness, uncertainty, interpretability, robustness, generalizability, and reproducibility. In this work, we present the first comprehensive evaluation of trustworthiness in a continental-scale multi-task LSTM model predicting 20 water quality variables (encompassing physical/chemical processes, geochemical weathering, and nutrient cycling) across 482 U.S. basins. Our investigation uncovers systematic patterns of model performance disparities linked to basin characteristics, the inherent complexity of biogeochemical processes, and variable predictability, emphasizing critical performance fairness concerns. We further propose methodological frameworks for quantitatively evaluating critical aspects of trustworthiness, including uncertainty, interpretability, and robustness, identifying key limitations that could challenge reliable real-world deployment. This work serves as a timely call to action for advancing trustworthy data-driven methods for water resources management and provides a pathway to offering critical insights for researchers, decision-makers, and practitioners seeking to leverage artificial intelligence (AI) responsibly in environmental management.

[AI-30] XpLogic: Explaining Logic Types and Patterns in DiffLogic Networks

链接: https://arxiv.org/abs/2503.09910
作者: Stephen Wormald,David Koblah,Matheus Kunzler Maldaner,Domenic Forte,Damon L. Woodard
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Conference submission, 6 pages, 2 figures

点击查看摘要

Abstract:Constraining deep neural networks (DNNs) to learn individual logic types per node, as performed using the DiffLogic network architecture, opens the door to model-specific explanation techniques that quell the complexity inherent to DNNs. Inspired by principles of circuit analysis from computer engineering, this work presents an algorithm (eXpLogic) for producing saliency maps which explain input patterns that activate certain functions. The eXpLogic explanations: (1) show the exact set of inputs responsible for a decision, which helps interpret false negative and false positive predictions, (2) highlight common input patterns that activate certain outputs, and (3) help reduce the network size to improve class-specific inference. To evaluate the eXpLogic saliency map, we introduce a metric that quantifies how much an input changes before switching a model’s class prediction (the SwitchDist) and use this metric to compare eXpLogic against the Vanilla Gradients (VG) and Integrated Gradient (IG) methods. Generally, we show that eXpLogic saliency maps are better at predicting which inputs will change the class score. These maps help reduce the network size and inference times by 87% and 8%, respectively, while having a limited impact (-3.8%) on class-specific predictions. The broader value of this work to machine learning is in demonstrating how certain DNN architectures promote explainability, which is relevant to healthcare, defense, and law.

[AI-31] AI Rivalry as a Craft: How Resisting and Embracing Generative AI Reshape Writing Professions

链接: https://arxiv.org/abs/2503.09901
作者: Rama Adithya Varanasi,Batia Mishan Wiesenfeld,Oded Nov
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative AI (GAI) technologies are disrupting professional writing, challenging traditional practices. Recent studies explore GAI adoption experiences of creative practitioners, but we know little about how these experiences evolve into established practices and how GAI resistance alters these practices. To address this gap, we conducted 25 semi-structured interviews with writing professionals who adopted and/or resisted GAI. Using the theoretical lens of Job Crafting, we identify four strategies professionals employ to reshape their roles. Writing professionals employed GAI resisting strategies to maximize human potential, reinforce professional identity, carve out a professional niche, and preserve credibility within their networks. In contrast, GAI-enabled strategies allowed writers who embraced GAI to enhance desirable workflows, minimize mundane tasks, and engage in new AI-managerial labor. These strategies amplified their collaborations with GAI while reducing their reliance on other people. We conclude by discussing implications of GAI practices on writers’ identity and practices as well as crafting theory.

[AI-32] Media and responsible AI governance: a game-theoretic and LLM analysis

链接: https://arxiv.org/abs/2503.09858
作者: Nataliya Balabanova,Adeela Bashir,Paolo Bova,Alessio Buscemi,Theodor Cimpeanu,Henrique Correia da Fonseca,Alessandro Di Stefano,Manh Hong Duong,Elias Fernandez Domingos,Antonio Fernandes, TheAnh Han,Marcus Krellner,Ndidi Bianca Ogbo,Simon T. Powers,Daniele Proverbio,Fernando P. Santos,Zia Ush Shamszaman,Zhao Song
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:This paper investigates the complex interplay between AI developers, regulators, users, and the media in fostering trustworthy AI systems. Using evolutionary game theory and large language models (LLMs), we model the strategic interactions among these actors under different regulatory regimes. The research explores two key mechanisms for achieving responsible governance, safe AI development and adoption of safe AI: incentivising effective regulation through media reporting, and conditioning user trust on commentariats’ recommendation. The findings highlight the crucial role of the media in providing information to users, potentially acting as a form of “soft” regulation by investigating developers or regulators, as a substitute to institutional AI regulation (which is still absent in many regions). Both game-theoretic analysis and LLM-based simulations reveal conditions under which effective regulation and trustworthy AI development emerge, emphasising the importance of considering the influence of different regulatory regimes from an evolutionary game-theoretic perspective. The study concludes that effective governance requires managing incentives and costs for high quality commentaries.

[AI-33] raining Human-Robot Teams by Improving Transparency Through a Virtual Spectator Interface ICRA2025

链接: https://arxiv.org/abs/2503.09849
作者: Sean Dallas(1),Hongjiao Qiang(2),Motaz AbuHijleh(1),Wonse Jo(2),Kayla Riegner(3),Jon Smereka(3),Lionel Robert(2),Wing-Yue Louie(1),Dawn M. Tilbury(2) ((1) Oakland University, (2) University of Michigan, (3) Ground Vehicle Systems Center (GVSC))
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 7 pages, 4 figures, Accepted to ICRA 2025

点击查看摘要

Abstract:After-action reviews (AARs) are professional discussions that help operators and teams enhance their task performance by analyzing completed missions with peers and professionals. Previous studies that compared different formats of AARs have mainly focused on human teams. However, the inclusion of robotic teammates brings along new challenges in understanding teammate intent and communication. Traditional AAR between human teammates may not be satisfactory for human-robot teams. To address this limitation, we propose a new training review (TR) tool, called the Virtual Spectator Interface (VSI), to enhance human-robot team performance and situational awareness (SA) in a simulated search mission. The proposed VSI primarily utilizes visual feedback to review subjects’ behavior. To examine the effectiveness of VSI, we took elements from AAR to conduct our own TR, designed a 1 x 3 between-subjects experiment with experimental conditions: TR with (1) VSI, (2) screen recording, and (3) non-technology (only verbal descriptions). The results of our experiments demonstrated that the VSI did not result in significantly better team performance than other conditions. However, the TR with VSI led to more improvement in the subjects SA over the other conditions.

[AI-34] Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments

链接: https://arxiv.org/abs/2503.09820
作者: Mohamed Elnoor,Kasun Weerakoon,Gershom Seneviratne,Jing Liang,Vignesh Rajagopal,Dinesh Manocha
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce Vision-Language Attention Distillation (Vi-LAD), a novel approach for distilling socially compliant navigation knowledge from a large Vision-Language Model (VLM) into a lightweight transformer model for real-time robotic navigation. Unlike traditional methods that rely on expert demonstrations or human-annotated datasets, Vi-LAD performs knowledge distillation and fine-tuning at the intermediate layer representation level (i.e., attention maps) by leveraging the backbone of a pre-trained vision-action model. These attention maps highlight key navigational regions in a given scene, which serve as implicit guidance for socially aware motion planning. Vi-LAD fine-tunes a transformer-based model using intermediate attention maps extracted from the pre-trained vision-action model, combined with attention-like semantic maps constructed from a large VLM. To achieve this, we introduce a novel attention-level distillation loss that fuses knowledge from both sources, generating augmented attention maps with enhanced social awareness. These refined attention maps are then utilized as a traversability costmap within a socially aware model predictive controller (MPC) for navigation. We validate our approach through real-world experiments on a Husky wheeled robot, demonstrating significant improvements over state-of-the-art (SOTA) navigation methods. Our results show up to 14.2% - 50% improvement in success rate, which highlights the effectiveness of Vi-LAD in enabling socially compliant and efficient robot navigation.

[AI-35] mporal Difference Flows

链接: https://arxiv.org/abs/2503.09817
作者: Jesse Farebrother,Matteo Pirotta,Andrea Tirinzoni,Rémi Munos,Alessandro Lazaric,Ahmed Touati
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Predictive models of the future are fundamental for an agent’s ability to reason and plan. A common strategy learns a world model and unrolls it step-by-step at inference, where small errors can rapidly compound. Geometric Horizon Models (GHMs) offer a compelling alternative by directly making predictions of future states, avoiding cumulative inference errors. While GHMs can be conveniently learned by a generative analog to temporal difference (TD) learning, existing methods are negatively affected by bootstrapping predictions at train time and struggle to generate high-quality predictions at long horizons. This paper introduces Temporal Difference Flows (TD-Flow), which leverages the structure of a novel Bellman equation on probability paths alongside flow-matching techniques to learn accurate GHMs at over 5x the horizon length of prior methods. Theoretically, we establish a new convergence result and primarily attribute TD-Flow’s efficacy to reduced gradient variance during training. We further show that similar arguments can be extended to diffusion-based methods. Empirically, we validate TD-Flow across a diverse set of domains on both generative metrics and downstream tasks including policy evaluation. Moreover, integrating TD-Flow with recent behavior foundation models for planning over pre-trained policies demonstrates substantial performance gains, underscoring its promise for long-horizon decision-making.

[AI-36] Un-Straightening Generative AI: How Queer Artists Surface and Challenge the Normativity of Generative AI Models

链接: https://arxiv.org/abs/2503.09805
作者: Jordan Taylor,Joel Mire,Franchesca Spektor,Alicia DeVrio,Maarten Sap,Haiyi Zhu,Sarah Fox
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Queer people are often discussed as targets of bias, harm, or discrimination in research on generative AI. However, the specific ways that queer people engage with generative AI, and thus possible uses that support queer people, have yet to be explored. We conducted a workshop study with 13 queer artists, during which we gave participants access to GPT-4 and DALL-E 3 and facilitated group sensemaking activities. We found our participants struggled to use these models due to various normative values embedded in their designs, such as hyper-positivity and anti-sexuality. We describe various strategies our participants developed to overcome these models’ limitations and how, nevertheless, our participants found value in these highly-normative technologies. Drawing on queer feminist theory, we discuss implications for the conceptualization of “state-of-the-art” models and consider how FAccT researchers might support queer alternatives.

[AI-37] Agent DAM: Privacy Leakage Evaluation for Autonomous Web Agents KR

链接: https://arxiv.org/abs/2503.09780
作者: Arman Zharmagambetov,Chuan Guo,Ivan Evtimov,Maya Pavlova,Ruslan Salakhutdinov,Kamalika Chaudhuri
类目: Artificial Intelligence (cs.AI)
*备注: project page: this https URL

点击查看摘要

Abstract:LLM-powered AI agents are an emerging frontier with tremendous potential to increase human productivity. However, empowering AI agents to take action on their user’s behalf in day-to-day tasks involves giving them access to potentially sensitive and private information, which leads to a possible risk of inadvertent privacy leakage when the agent malfunctions. In this work, we propose one way to address that potential risk, by training AI agents to better satisfy the privacy principle of data minimization. For the purposes of this benchmark, by “data minimization” we mean instances where private information is shared only when it is necessary to fulfill a specific task-relevant purpose. We develop a benchmark called AgentDAM to evaluate how well existing and future AI agents can limit processing of potentially private information that we designate “necessary” to fulfill the task. Our benchmark simulates realistic web interaction scenarios and is adaptable to all existing web navigation agents. We use AgentDAM to evaluate how well AI agents built on top of GPT-4, Llama-3 and Claude can limit processing of potentially private information when unnecessary, and show that these agents are often prone to inadvertent use of unnecessary sensitive information. We finally propose a prompting-based approach that reduces this.

[AI-38] Solving Bayesian inverse problems with diffusion priors and off-policy RL ICLR2025

链接: https://arxiv.org/abs/2503.09746
作者: Luca Scimeca,Siddarth Venkatraman,Moksh Jain,Minsu Kim,Marcin Sendera,Mohsin Hasan,Luke Rowe,Sarthak Mittal,Pablo Lemos,Emmanuel Bengio,Alexandre Adam,Jarrid Rector-Brooks,Yashar Hezaveh,Laurence Perreault-Levasseur,Yoshua Bengio,Glen Berseth,Nikolay Malkin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted as workshop paper at DeLTa workshop, ICLR 2025. arXiv admin note: substantial text overlap with arXiv:2405.20971

点击查看摘要

Abstract:This paper presents a practical application of Relative Trajectory Balance (RTB), a recently introduced off-policy reinforcement learning (RL) objective that can asymptotically solve Bayesian inverse problems optimally. We extend the original work by using RTB to train conditional diffusion model posteriors from pretrained unconditional priors for challenging linear and non-linear inverse problems in vision, and science. We use the objective alongside techniques such as off-policy backtracking exploration to improve training. Importantly, our results show that existing training-free diffusion posterior methods struggle to perform effective posterior inference in latent space due to inherent biases.

[AI-39] Unveiling Hidden Pivotal Players with GoalNet: A GNN-Based Soccer Player Evaluation System

链接: https://arxiv.org/abs/2503.09737
作者: Jacky Hao Jiang,Jerry Cai,Anastasios Kyrillidis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: 14 pages, 4-5 figures

点击查看摘要

Abstract:Soccer analysis tools emphasize metrics such as expected goals, leading to an overrepresentation of attacking players’ contributions and overlooking players who facilitate ball control and link attacks. Examples include Rodri from Manchester City and Palhinha who just transferred to Bayern Munich. To address this bias, we aim to identify players with pivotal roles in a soccer team, incorporating both spatial and temporal features. In this work, we introduce a GNN-based framework that assigns individual credit for changes in expected threat (xT), thus capturing overlooked yet vital contributions in soccer. Our pipeline encodes both spatial and temporal features in event-centric graphs, enabling fair attribution of non-scoring actions such as defensive or transitional plays. We incorporate centrality measures into the learned player embeddings, ensuring that ball-retaining defenders and defensive midfielders receive due recognition for their overall impact. Furthermore, we explore diverse GNN variants-including Graph Attention Networks and Transformer-based models-to handle long-range dependencies and evolving match contexts, discussing their relative performance and computational complexity. Experiments on real match data confirm the robustness of our approach in highlighting pivotal roles that traditional attacking metrics typically miss, underscoring the model’s utility for more comprehensive soccer analytics. Comments: 14 pages, 4-5 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC) Cite as: arXiv:2503.09737 [cs.LG] (or arXiv:2503.09737v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.09737 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-40] Finding the Muses: Identifying Coresets through Loss Trajectories

链接: https://arxiv.org/abs/2503.09721
作者: Manish Nagaraj,Deepak Ravikumar,Efstathia Soufleri,Kaushik Roy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Loss Trajectory Correlation (LTC), a novel metric for coreset selection that identifies critical training samples driving generalization. LTC quantifies the alignment between training sample loss trajectories and validation set loss trajectories, enabling the construction of compact, representative subsets. Unlike traditional methods with computational and storage overheads that are infeasible to scale to large datasets, LTC achieves superior efficiency as it can be computed as a byproduct of training. Our results on CIFAR-100 and ImageNet-1k show that LTC consistently achieves accuracy on par with or surpassing state-of-the-art coreset selection methods, with any differences remaining under 1%. LTC also effectively transfers across various architectures, including ResNet, VGG, DenseNet, and Swin Transformer, with minimal performance degradation (2%). Additionally, LTC offers insights into training dynamics, such as identifying aligned and conflicting sample behaviors, at a fraction of the computational cost of traditional methods. This framework paves the way for scalable coreset selection and efficient dataset optimization.

[AI-41] Revisiting Backdoor Attacks on Time Series Classification in the Frequency Domain WWW2025

链接: https://arxiv.org/abs/2503.09712
作者: Yuanmin Huang,Mi Zhang,Zhaoxiang Wang,Wenxuan Li,Min Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: WWW 2025 (Oral)

点击查看摘要

Abstract:Time series classification (TSC) is a cornerstone of modern web applications, powering tasks such as financial data analysis, network traffic monitoring, and user behavior analysis. In recent years, deep neural networks (DNNs) have greatly enhanced the performance of TSC models in these critical domains. However, DNNs are vulnerable to backdoor attacks, where attackers can covertly implant triggers into models to induce malicious outcomes. Existing backdoor attacks targeting DNN-based TSC models remain elementary. In particular, early methods borrow trigger designs from computer vision, which are ineffective for time series data. More recent approaches utilize generative models for trigger generation, but at the cost of significant computational complexity. In this work, we analyze the limitations of existing attacks and introduce an enhanced method, FreqBack. Drawing inspiration from the fact that DNN models inherently capture frequency domain features in time series data, we identify that improper perturbations in the frequency domain are the root cause of ineffective attacks. To address this, we propose to generate triggers both effectively and efficiently, guided by frequency analysis. FreqBack exhibits substantial performance across five models and eight datasets, achieving an impressive attack success rate of over 90%, while maintaining less than a 3% drop in model accuracy on clean data.

[AI-42] owards Robust Model Evolution with Algorithmic Recourse

链接: https://arxiv.org/abs/2503.09658
作者: Hao-Tsung Yang,Jie Gao,Bo-Yi Liu,Zhi-Xuan Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages,4 figures

点击查看摘要

Abstract:Algorithmic Recourse is a way for users to modify their attributes to align with a model’s expectations, thereby improving their outcomes after receiving unfavorable decisions. In real-world scenarios, users often need to strategically adjust their attributes to compete for limited resources. However, such strategic behavior induces users to “game” algorithms, causing model collapse due to distribution shifts. These shifts arise from user competition, resource constraints, and adaptive user responses. While prior research on Algorithmic Recourse has explored its effects on both systems and users, the impact of resource constraints and competition over time remains underexplored. In this work, we develop a general framework to model user strategic behaviors and their interactions with decision-making systems under resource constraints and competitive dynamics. Through theoretical analysis and empirical evaluation, we identify three key phenomena that arise consistently in both synthetic and real-world datasets: escalating decision boundaries, non-robust model predictions, and inequitable recourse actions. Finally, we discuss the broader social implications of these findings and present two algorithmic strategies aimed at mitigating these challenges.

[AI-43] Open-Sora 2.0: Training a Commercial-Level Video Generation Model in 200k

链接: https://arxiv.org/abs/2503.09642
作者: Xiangyu Peng,Zangwei Zheng,Chenhui Shen,Tom Young,Xinying Guo,Binluo Wang,Hang Xu,Hongxin Liu,Mingyan Jiang,Wenjun Li,Yuhui Wang,Anbang Ye,Gang Ren,Qianran Ma,Wanying Liang,Xiang Lian,Xiwen Wu,Yuting Zhong,Zhuangyan Li,Chaoyu Gong,Guojun Lei,Leijun Cheng,Limin Zhang,Minghao Li,Ruijie Zhang,Silan Hu,Shijie Huang,Xiaokang Wang,Yuanheng Zhao,Yuqi Wang,Ziang Wei,Yang You
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Video generation models have achieved remarkable progress in the past year. The quality of AI video continues to improve, but at the cost of larger model size, increased data quantity, and greater demand for training compute. In this report, we present Open-Sora 2.0, a commercial-level video generation model trained for only 200k. With this model, we demonstrate that the cost of training a top-performing video generation model is highly controllable. We detail all techniques that contribute to this efficiency breakthrough, including data curation, model architecture, training strategy, and system optimization. According to human evaluation results and VBench scores, Open-Sora 2.0 is comparable to global leading video generation models including the open-source HunyuanVideo and the closed-source Runway Gen-3 Alpha. By making Open-Sora 2.0 fully open-source, we aim to democratize access to advanced video generation technology, fostering broader innovation and creativity in content creation. All resources are publicly available at: this https URL.

[AI-44] Edge AI-Powered Real-Time Decision-Making for Autonomous Vehicles in Adverse Weather Conditions

链接: https://arxiv.org/abs/2503.09638
作者: Milad Rahmati
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autonomous vehicles (AVs) are transforming modern transportation, but their reliability and safety are significantly challenged by harsh weather conditions such as heavy rain, fog, and snow. These environmental factors impair the performance of cameras, LiDAR, and radar, leading to reduced situational awareness and increased accident risks. Conventional cloud-based AI systems introduce communication delays, making them unsuitable for the rapid decision-making required in real-time autonomous navigation. This paper presents a novel Edge AI-driven real-time decision-making framework designed to enhance AV responsiveness under adverse weather conditions. The proposed approach integrates convolutional neural networks (CNNs) and recurrent neural networks (RNNs) for improved perception, alongside reinforcement learning (RL)-based strategies to optimize vehicle control in uncertain environments. By processing data at the network edge, this system significantly reduces decision latency while improving AV adaptability. The framework is evaluated using simulated driving scenarios in CARLA and real-world data from the Waymo Open Dataset, covering diverse weather conditions. Experimental results indicate that the proposed model achieves a 40% reduction in processing time and a 25% enhancement in perception accuracy compared to conventional cloud-based systems. These findings highlight the potential of Edge AI in improving AV autonomy, safety, and efficiency, paving the way for more reliable self-driving technology in challenging real-world environments.

[AI-45] FPGS: Feed-Forward Semantic-aware Photorealistic Style Transfer of Large-Scale Gaussian Splatting

链接: https://arxiv.org/abs/2503.09635
作者: GeonU Kim,Kim Youwang,Lee Hyoseok,Tae-Hyun Oh
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL . arXiv admin note: substantial text overlap with arXiv:2401.05516

点击查看摘要

Abstract:We present FPGS, a feed-forward photorealistic style transfer method of large-scale radiance fields represented by Gaussian Splatting. FPGS, stylizes large-scale 3D scenes with arbitrary, multiple style reference images without additional optimization while preserving multi-view consistency and real-time rendering speed of 3D Gaussians. Prior arts required tedious per-style optimization or time-consuming per-scene training stage and were limited to small-scale 3D scenes. FPGS efficiently stylizes large-scale 3D scenes by introducing a style-decomposed 3D feature field, which inherits AdaIN’s feed-forward stylization machinery, supporting arbitrary style reference images. Furthermore, FPGS supports multi-reference stylization with the semantic correspondence matching and local AdaIN, which adds diverse user control for 3D scene styles. FPGS also preserves multi-view consistency by applying semantic matching and style transfer processes directly onto queried features in 3D space. In experiments, we demonstrate that FPGS achieves favorable photorealistic quality scene stylization for large-scale static and dynamic 3D scenes with diverse reference images. Project page: this https URL

[AI-46] Certainly Bot Or Not? Trustworthy Social Bot Detection via Robust Multi-Modal Neural Processes

链接: https://arxiv.org/abs/2503.09626
作者: Qi Wu,Yingguang Yang,hao liu,Hao Peng,Buyun He,Yutong Xia,Yong Liao
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages. 7 figures

点击查看摘要

Abstract:Social bot detection is crucial for mitigating misinformation, online manipulation, and coordinated inauthentic behavior. While existing neural network-based detectors perform well on benchmarks, they struggle with generalization due to distribution shifts across datasets and frequently produce overconfident predictions for out-of-distribution accounts beyond the training data. To address this, we introduce a novel Uncertainty Estimation for Social Bot Detection (UESBD) framework, which quantifies the predictive uncertainty of detectors beyond mere classification. For this task, we propose Robust Multi-modal Neural Processes (RMNP), which aims to enhance the robustness of multi-modal neural processes to modality inconsistencies caused by social bot camouflage. RMNP first learns unimodal representations through modality-specific encoders. Then, unimodal attentive neural processes are employed to encode the Gaussian distribution of unimodal latent variables. Furthermore, to avoid social bots stealing human features to camouflage themselves thus causing certain modalities to provide conflictive information, we introduce an evidential gating network to explicitly model the reliability of modalities. The joint latent distribution is learned through the generalized product of experts, which takes the reliability of each modality into consideration during fusion. The final prediction is obtained through Monte Carlo sampling of the joint latent distribution followed by a decoder. Experiments on three real-world benchmarks show the effectiveness of RMNP in classification and uncertainty estimation, as well as its robustness to modality conflicts.

[AI-47] Empowering the Future Workforce: Prioritizing Education for the AI-Accelerated Job Market

链接: https://arxiv.org/abs/2503.09613
作者: Lisa Amini(IBM Research),Henry F. Korth(Lehigh University),Nita Patel(Otis),Evan Peck(University of Colorado Boulder),Ben Zorn(Microsoft)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI’s rapid integration into the workplace demands new approaches to workforce education and training and broader AI literacy across disciplines. Coordinated action from government, industry, and educational institutions is necessary to ensure workers can adapt to accelerating technological change.

[AI-48] RILe: Reinforced Imitation Learning

链接: https://arxiv.org/abs/2406.08472
作者: Mert Albaba,Sammy Christen,Thomas Langarek,Christoph Gebhardt,Otmar Hilliges,Michael J. Black
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Acquiring complex behaviors is essential for artificially intelligent agents, yet learning these behaviors in high-dimensional settings poses a significant challenge due to the vast search space. Traditional reinforcement learning (RL) requires extensive manual effort for reward function engineering. Inverse reinforcement learning (IRL) uncovers reward functions from expert demonstrations but relies on an iterative process that is often computationally expensive. Imitation learning (IL) provides a more efficient alternative by directly comparing an agent’s actions to expert demonstrations; however, in high-dimensional environments, such direct comparisons offer insufficient feedback for effective learning. We introduce RILe (Reinforced Imitation Learning), a framework that combines the strengths of imitation learning and inverse reinforcement learning to learn a dense reward function efficiently and achieve strong performance in high-dimensional tasks. RILe employs a novel trainer-student framework: the trainer learns an adaptive reward function, and the student uses this reward signal to imitate expert behaviors. By dynamically adjusting its guidance as the student evolves, the trainer provides nuanced feedback across different phases of learning. Our framework produces high-performing policies in high-dimensional tasks where direct imitation fails to replicate complex behaviors. We validate RILe in challenging robotic locomotion tasks, demonstrating that it significantly outperforms existing methods and achieves near-expert performance across multiple settings.

[AI-49] Why the Brain Cannot Be a Digital Computer: History-Dependence and the Computational Limits of Consciousness

链接: https://arxiv.org/abs/2503.10518
作者: Andrew Knight
类目: History and Philosophy of Physics (physics.hist-ph); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注: 10 pages, 1 figure

点击查看摘要

Abstract:This paper presents a novel information-theoretic proof demonstrating that the human brain as currently understood cannot function as a classical digital computer. Through systematic quantification of distinguishable conscious states and their historical dependencies, we establish that the minimum information required to specify a conscious state exceeds the physical information capacity of the human brain by a significant factor. Our analysis calculates the bit-length requirements for representing consciously distinguishable sensory “stimulus frames” and demonstrates that consciousness exhibits mandatory temporal-historical dependencies that multiply these requirements beyond the brain’s storage capabilities. This mathematical approach offers new insights into the fundamental limitations of computational models of consciousness and suggests that non-classical information processing mechanisms may be necessary to account for conscious experience.

[AI-50] Explainable Bayesian deep learning through input-skip Latent Binary Bayesian Neural Networks

链接: https://arxiv.org/abs/2503.10496
作者: Eirik Høyheim,Lars Skaaret-Lund,Solve Sæbø,Aliaksandr Hubin
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注: 44 pages, 19 tables, 25 figures. Code available at this https URL

点击查看摘要

Abstract:Modeling natural phenomena with artificial neural networks (ANNs) often provides highly accurate predictions. However, ANNs often suffer from over-parameterization, complicating interpretation and raising uncertainty issues. Bayesian neural networks (BNNs) address the latter by representing weights as probability distributions, allowing for predictive uncertainty evaluation. Latent binary Bayesian neural networks (LBBNNs) further handle structural uncertainty and sparsify models by removing redundant weights. This article advances LBBNNs by enabling covariates to skip to any succeeding layer or be excluded, simplifying networks and clarifying input impacts on predictions. Ultimately, a linear model or even a constant can be found to be optimal for a specific problem at hand. Furthermore, the input-skip LBBNN approach reduces network density significantly compared to standard LBBNNs, achieving over 99% reduction for small networks and over 99.9% for larger ones, while still maintaining high predictive accuracy and uncertainty measurement. For example, on MNIST, we reached 97% accuracy and great calibration with just 935 weights, reaching state-of-the-art for compression of neural networks. Furthermore, the proposed method accurately identifies the true covariates and adjusts for system non-linearity. The main contribution is the introduction of active paths, enhancing directly designed global and local explanations within the LBBNN framework, that have theoretical guarantees and do not require post hoc external tools for explanations.

[AI-51] Siamese Foundation Models for Crystal Structure Prediction

链接: https://arxiv.org/abs/2503.10471
作者: Liming Wu,Wenbing Huang,Rui Jiao,Jianxing Huang,Liwei Liu,Yipeng Zhou,Hao Sun,Yang Liu,Fuchun Sun,Yuxiang Ren,Jirong Wen
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Crystal Structure Prediction (CSP), which aims to generate stable crystal structures from compositions, represents a critical pathway for discovering novel materials. While structure prediction tasks in other domains, such as proteins, have seen remarkable progress, CSP remains a relatively underexplored area due to the more complex geometries inherent in crystal structures. In this paper, we propose Siamese foundation models specifically designed to address CSP. Our pretrain-finetune framework, named DAO, comprises two complementary foundation models: DAO-G for structure generation and DAO-P for energy prediction. Experiments on CSP benchmarks (MP-20 and MPTS-52) demonstrate that our DAO-G significantly surpasses state-of-the-art (SOTA) methods across all metrics. Extensive ablation studies further confirm that DAO-G excels in generating diverse polymorphic structures, and the dataset relaxation and energy guidance provided by DAO-P are essential for enhancing DAO-G’s performance. When applied to three real-world superconductors ( \textCsV_3\textSb_5 , \textZr_16\textRh_8\textO_4 and \textZr_16\textPd_8\textO_4 ) that are known to be challenging to analyze, our foundation models achieve accurate critical temperature predictions and structure generations. For instance, on \textCsV_3\textSb_5 , DAO-G generates a structure close to the experimental one with an RMSE of 0.0085; DAO-P predicts the T_c value with high accuracy (2.26 K vs. the ground-truth value of 2.30 K). In contrast, conventional DFT calculators like Quantum Espresso only successfully derive the structure of the first superconductor within an acceptable time, while the RMSE is nearly 8 times larger, and the computation speed is more than 1000 times slower. These compelling results collectively highlight the potential of our approach for advancing materials science research and development.

[AI-52] Bilingual Dual-Head Deep Model for Parkinsons Disease Detection from Speech ICASSP2025

链接: https://arxiv.org/abs/2503.10301
作者: Moreno La Quatra,Juan Rafael Orozco-Arroyave,Marco Sabato Siniscalchi
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
*备注: Accepted at ICASSP 2025 - Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

点击查看摘要

Abstract:This work aims to tackle the Parkinson’s disease (PD) detection problem from the speech signal in a bilingual setting by proposing an ad-hoc dual-head deep neural architecture for type-based binary classification. One head is specialized for diadochokinetic patterns. The other head looks for natural speech patterns present in continuous spoken utterances. Only one of the two heads is operative accordingly to the nature of the input. Speech representations are extracted from self-supervised learning (SSL) models and wavelet transforms. Adaptive layers, convolutional bottlenecks, and contrastive learning are exploited to reduce variations across languages. Our solution is assessed against two distinct datasets, EWA-DB, and PC-GITA, which cover Slovak and Spanish languages, respectively. Results indicate that conventional models trained on a single language dataset struggle with cross-linguistic generalization, and naive combinations of datasets are suboptimal. In contrast, our model improves generalization on both languages, simultaneously.

[AI-53] Predicting Chemical Reaction Outcomes Based on Electron Movements Using Machine Learning

链接: https://arxiv.org/abs/2503.10197
作者: Shuan Chen,Kye Sung Park,Taewan Kim,Sunkyu Han,Yousung Jung
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
*备注: 15 pages, 3 figures

点击查看摘要

Abstract:Accurately predicting chemical reaction outcomes and potential byproducts is a fundamental task of modern chemistry, enabling the efficient design of synthetic pathways and driving progress in chemical science. Reaction mechanism, which tracks electron movements during chemical reactions, is critical for understanding reaction kinetics and identifying unexpected products. Here, we present Reactron, the first electron-based machine learning model for general reaction prediction. Reactron integrates electron movement into its predictions, generating detailed arrow-pushing diagrams that elucidate each mechanistic step leading to product formation. We demonstrate the high predictive performance of Reactron over existing product-only models by a large-scale reaction outcome prediction benchmark, and the adaptability of the model to learn new reactivity upon providing a few examples. Furthermore, it explores combinatorial reaction spaces, uncovering novel reactivities beyond its training data. With robust performance in both in- and out-of-distribution predictions, Reactron embodies human-like reasoning in chemistry and opens new frontiers in reaction discovery and synthesis design.

[AI-54] Rapid analysis of point-contact Andreev reflection spectra via machine learning with adaptive data augmentation

链接: https://arxiv.org/abs/2503.10040
作者: Dongik Lee,Valentin Stanev,Xiaohang Zhang,Mijeong Kang,Ichiro Takeuchi,Seunghun Lee
类目: perconductivity (cond-mat.supr-con); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 18 pages, 3 figures

点击查看摘要

Abstract:Delineating the superconducting order parameters is a pivotal task in investigating superconductivity for probing pairing mechanisms, as well as their symmetry and topology. Point-contact Andreev reflection (PCAR) measurement is a simple yet powerful tool for identifying the order parameters. The PCAR spectra exhibit significant variations depending on the type of the order parameter in a superconductor, including its magnitude ( \mathit\Delta ), as well as temperature, interfacial quality, Fermi velocity mismatch, and other factors. The information on the order parameter can be obtained by finding the combination of these parameters, generating a theoretical spectrum that fits a measured experimental spectrum. However, due to the complexity of the spectra and the high dimensionality of parameters, extracting the fitting parameters is often time-consuming and labor-intensive. In this study, we employ a convolutional neural network (CNN) algorithm to create models for rapid and automated analysis of PCAR spectra of various superconductors with different pairing symmetries (conventional s -wave, chiral p_x+ip_y -wave, and d_x^2-y^2 -wave). The training datasets are generated based on the Blonder-Tinkham-Klapwijk (BTK) theory and further modified and augmented by selectively incorporating noise and peaks according to the bias voltages. This approach not only replicates the experimental spectra but also brings the model’s attention to important features within the spectra. The optimized models provide fitting parameters for experimentally measured spectra in less than 100 ms per spectrum. Our approaches and findings pave the way for rapid and automated spectral analysis which will help accelerate research on superconductors with complex order parameters.

[AI-55] Exploiting Edited Large Language Models as General Scientific Optimizers

链接: https://arxiv.org/abs/2503.09620
作者: Qitan Lv,Tianyu Liu,Hong Wang
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have been widely adopted in mathematical optimization in scientific scenarios for their extensive knowledge and advanced reasoning capabilities. Existing methods mainly focus on utilizing LLMs to solve optimization problems in a prompt-based manner, which takes observational feedback as additional textual descriptions. However, due to LLM’s \textbfhigh sensitivity to the prompts and \textbftendency to get lost in lengthy prompts, these methods struggle to effectively utilize the observational feedback from each optimization step, which severely hinders the applications for real-world scenarios. To address these challenges, we propose a conceptually simple and general bi-level optimization method, namely \textbfGeneral \textbfScientific \textbfOptimizers (GSO). Specifically, GSO first utilizes inner-level simulators as experimental platforms to evaluate the current solution and provide observational feedback. Then, LLMs serve as knowledgeable and versatile scientists, generating new solutions by refining potential errors from the feedback as the outer-level optimization. Finally, simulations together with the expert knowledge in LLMs are jointly updated with bi-level interactions via model editing. Extensive experiments show that GSO consistently outperforms existing state-of-the-art methods using \textitsix different LLM backbones on \textitseven different tasks, demonstrating the effectiveness and a wide range of applications.

机器学习

[LG-0] Unveiling the Mathematical Reasoning in DeepSeek Models: A Comparative Study of Large Language Models

链接: https://arxiv.org/abs/2503.10573
作者: Afrar Jahin,Arif Hassan Zidan,Yu Bao,Shizhe Liang,Tianming Liu,Wei Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid evolution of Artificial Intelligence (AI), Large Language Models (LLMs) have reshaped the frontiers of various fields, spanning healthcare, public health, engineering, science, agriculture, education, arts, humanities, and mathematical reasoning. Among these advancements, DeepSeek models have emerged as noteworthy contenders, demonstrating promising capabilities that set them apart from their peers. While previous studies have conducted comparative analyses of LLMs, few have delivered a comprehensive evaluation of mathematical reasoning across a broad spectrum of LLMs. In this work, we aim to bridge this gap by conducting an in-depth comparative study, focusing on the strengths and limitations of DeepSeek models in relation to their leading counterparts. In particular, our study systematically evaluates the mathematical reasoning performance of two DeepSeek models alongside five prominent LLMs across three independent benchmark datasets. The findings reveal several key insights: 1). DeepSeek-R1 consistently achieved the highest accuracy on two of the three datasets, demonstrating strong mathematical reasoning capabilities. 2). The distilled variant of LLMs significantly underperformed compared to its peers, highlighting potential drawbacks in using distillation techniques. 3). In terms of response time, Gemini 2.0 Flash demonstrated the fastest processing speed, outperforming other models in efficiency, which is a crucial factor for real-time applications. Beyond these quantitative assessments, we delve into how architecture, training, and optimization impact LLMs’ mathematical reasoning. Moreover, our study goes beyond mere performance comparison by identifying key areas for future advancements in LLM-driven mathematical reasoning. This research enhances our understanding of LLMs’ mathematical reasoning and lays the groundwork for future advancements

[LG-1] Radar: Fast Long-Context Decoding for Any Transformer ICLR2025

链接: https://arxiv.org/abs/2503.10571
作者: Yongchang Hao,Mengyao Zhai,Hossein Hajimirsadeghi,Sepidehsadat Hosseini,Frederick Tung
类目: Machine Learning (cs.LG)
*备注: Accepted @ ICLR 2025

点击查看摘要

Abstract:Transformer models have demonstrated exceptional performance across a wide range of applications. Though forming the foundation of Transformer models, the dot-product attention does not scale well to long-context data since its time requirement grows quadratically with context length. In this work, we propose Radar, a training-free approach that accelerates inference by dynamically searching for the most important context tokens. For any pre-trained Transformer, Radar can reduce the decoding time complexity without training or heuristically evicting tokens. Moreover, we provide theoretical justification for our approach, demonstrating that Radar can reliably identify the most important tokens with high probability. We conduct extensive comparisons with the previous methods on a wide range of tasks. The results demonstrate that Radar achieves the state-of-the-art performance across different architectures with reduced time complexity, offering a practical solution for efficient long-context processing of Transformers.

[LG-2] FedPCA: Noise-Robust Fair Federated Learning via Performance-Capacity Analysis

链接: https://arxiv.org/abs/2503.10567
作者: Nannan Wu,Zengqiang Yan,Nong Sang,Li Yu,Chang Wen Chen
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Training a model that effectively handles both common and rare data-i.e., achieving performance fairness-is crucial in federated learning (FL). While existing fair FL methods have shown effectiveness, they remain vulnerable to mislabeled data. Ensuring robustness in fair FL is therefore essential. However, fairness and robustness inherently compete, which causes robust strategies to hinder fairness. In this paper, we attribute this competition to the homogeneity in loss patterns exhibited by rare and mislabeled data clients, preventing existing loss-based fair and robust FL methods from effectively distinguishing and handling these two distinct client types. To address this, we propose performance-capacity analysis, which jointly considers model performance on each client and its capacity to handle the dataset, measured by loss and a newly introduced feature dispersion score. This allows mislabeled clients to be identified by their significantly deviated performance relative to capacity while preserving rare data clients. Building on this, we introduce FedPCA, an FL method that robustly achieves fairness. FedPCA first identifies mislabeled clients via a Gaussian Mixture Model on loss-dispersion pairs, then applies fairness and robustness strategies in global aggregation and local training by adjusting client weights and selectively using reliable data. Extensive experiments on three datasets demonstrate FedPCA’s effectiveness in tackling this complex challenge. Code will be publicly available upon acceptance.

[LG-3] ASIDE: Architectural Separation of Instructions and Data in Language Models ICLR2025

链接: https://arxiv.org/abs/2503.10566
作者: Egor Zverev,Evgenii Kortukov,Alexander Panfilov,Soroush Tabesh,Alexandra Volkova,Sebastian Lapuschkin,Wojciech Samek,Christoph H. Lampert
类目: Machine Learning (cs.LG)
*备注: ICLR 2025 Workshop on Building Trust in Language Models and Applications

点击查看摘要

Abstract:Despite their remarkable performance, large language models lack elementary safety features, and this makes them susceptible to numerous malicious attacks. In particular, previous work has identified the absence of an intrinsic separation between instructions and data as a root cause for the success of prompt injection attacks. In this work, we propose an architectural change, ASIDE, that allows the model to clearly separate between instructions and data by using separate embeddings for them. Instead of training the embeddings from scratch, we propose a method to convert an existing model to ASIDE form by using two copies of the original model’s embeddings layer, and applying an orthogonal rotation to one of them. We demonstrate the effectiveness of our method by showing (1) highly increased instruction-data separation scores without a loss in model capabilities and (2) competitive results on prompt injection benchmarks, even without dedicated safety training. Additionally, we study the working mechanism behind our method through an analysis of model representations.

[LG-4] From Linear to Spline-Based Classification:Developing and Enhancing SMPA for Noisy Non-Linear Datasets

链接: https://arxiv.org/abs/2503.10545
作者: Vatsal Srivastava
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Building upon the concepts and mechanisms used for the development in Moving Points Algorithm, we will now explore how non linear decision boundaries can be developed for classification tasks. First we will look at the classification performance of MPA and some minor developments in the original algorithm. We then discuss the concepts behind using cubic splines for classification with a similar learning mechanism and finally analyze training results on synthetic datasets with known properties.

[LG-5] DP-GPL: Differentially Private Graph Prompt Learning

链接: https://arxiv.org/abs/2503.10544
作者: Jing Xu,Franziska Boenisch,Iyiola Emmanuel Olatunji,Adam Dziedzic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have shown remarkable performance in various applications. Recently, graph prompt learning has emerged as a powerful GNN training paradigm, inspired by advances in language and vision foundation models. Here, a GNN is pre-trained on public data and then adapted to sensitive tasks using lightweight graph prompts. However, using prompts from sensitive data poses privacy risks. In this work, we are the first to investigate these practical risks in graph prompts by instantiating a membership inference attack that reveals significant privacy leakage. We also find that the standard privacy method, DP-SGD, fails to provide practical privacy-utility trade-offs in graph prompt learning, likely due to the small number of sensitive data points used to learn the prompts. As a solution, we propose DP-GPL for differentially private graph prompt learning based on the PATE framework, that generates a graph prompt with differential privacy guarantees. Our evaluation across various graph prompt learning methods, GNN architectures, and pre-training strategies demonstrates that our algorithm achieves high utility at strong privacy, effectively mitigating privacy concerns while preserving the powerful capabilities of prompted GNNs as powerful foundation models in the graph domain.

[LG-6] Structured Preconditioners in Adaptive Optimization: A Unified Analysis

链接: https://arxiv.org/abs/2503.10537
作者: Shuo Xie,Tianhao Wang,Sashank Reddi,Sanjiv Kumar,Zhiyuan Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel unified analysis for a broad class of adaptive optimization algorithms with structured (e.g., layerwise, diagonal, and kronecker-factored) preconditioners for both online regret minimization and offline convex optimization. Our analysis not only provides matching rate to several important structured preconditioned algorithms including diagonal AdaGrad, full-matrix AdaGrad, and AdaGrad-Norm, but also gives an improved convergence rate for a one-sided variant of Shampoo over that of original Shampoo. Interestingly, more structured preconditioners (e.g., diagonal Adagrad, AdaGrad-Norm which use less space and compute) are often presented as computationally efficient approximations to full-matrix Adagrad, aiming for improved optimization performance through better approximations. Our unified analysis challenges this prevailing view and reveals, perhaps surprisingly, that more structured preconditioners, despite using less space and computation per step, can outperform their less structured counterparts. To demonstrate this, we show that one-sided Shampoo, which is relatively much cheaper than full-matrix AdaGrad could outperform it both theoretically and experimentally.

[LG-7] SySLLM : Generating Synthesized Policy Summaries for Reinforcement Learning Agents Using Large Language Models

链接: https://arxiv.org/abs/2503.10509
作者: Sahar Admoni,Omer Ben-Porat,Ofra Amir
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Policies generated by Reinforcement Learning (RL) algorithms can be difficult to describe to users, as they result from the interplay between complex reward structures and neural network-based representations. This combination often leads to unpredictable behaviors, making policies challenging to analyze and posing significant obstacles to fostering human trust in real-world applications. Global policy summarization methods aim to describe agent behavior through a demonstration of actions in a subset of world-states. However, users can only watch a limited number of demonstrations, restricting their understanding of policies. Moreover, those methods overly rely on user interpretation, as they do not synthesize observations into coherent patterns. In this work, we present SySLLM (Synthesized Summary using LLMs), a novel method that employs synthesis summarization, utilizing large language models’ (LLMs) extensive world knowledge and ability to capture patterns, to generate textual summaries of policies. Specifically, an expert evaluation demonstrates that the proposed approach generates summaries that capture the main insights generated by experts while not resulting in significant hallucinations. Additionally, a user study shows that SySLLM summaries are preferred over demonstration-based policy summaries and match or surpass their performance in objective agent identification tasks.

[LG-8] Sample Compression for Continual Learning

链接: https://arxiv.org/abs/2503.10503
作者: Jacob Comeau,Mathieu Bazinet,Pascal Germain,Cem Subakan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning algorithms aim to learn from a sequence of tasks, making the training distribution non-stationary. The majority of existing continual learning approaches in the literature rely on heuristics and do not provide learning guarantees for the continual learning setup. In this paper, we present a new method called ‘Continual Pick-to-Learn’ (CoP2L), which is able to retain the most representative samples for each task in an efficient way. The algorithm is adapted from the Pick-to-Learn algorithm, rooted in the sample compression theory. This allows us to provide high-confidence upper bounds on the generalization loss of the learned predictors, numerically computable after every update of the learned model. We also empirically show on several standard continual learning benchmarks that our algorithm is able to outperform standard experience replay, significantly mitigating catastrophic forgetting.

[LG-9] Applying Tabular Deep Learning Models to Estimate Crash Injury Types of Young Motorcyclists

链接: https://arxiv.org/abs/2503.10474
作者: Shriyank Somvanshi,Anannya Ghosh Tusti,Rohit Chakraborty,Subasish Das
类目: Machine Learning (cs.LG)
*备注: 6 pages, 6 figures, accepted at IEEE CAI 2025

点击查看摘要

Abstract:Young motorcyclists, particularly those aged 15 to 24 years old, face a heightened risk of severe crashes due to factors such as speeding, traffic violations, and helmet usage. This study aims to identify key factors influencing crash severity by analyzing 10,726 young motorcyclist crashes in Texas from 2017 to 2022. Two advanced tabular deep learning models, ARMNet and MambaNet, were employed, using an advanced resampling technique to address class imbalance. The models were trained to classify crashes into three severity levels, Fatal or Severe, Moderate or Minor, and No Injury. ARMNet achieved an accuracy of 87 percent, outperforming 86 percent of Mambanet, with both models excelling in predicting severe and no injury crashes while facing challenges in moderate crash classification. Key findings highlight the significant influence of demographic, environmental, and behavioral factors on crash outcomes. The study underscores the need for targeted interventions, including stricter helmet enforcement and educational programs customized to young motorcyclists. These insights provide valuable guidance for policymakers in developing evidence-based strategies to enhance motorcyclist safety and reduce crash severity.

[LG-10] SortingEnv: An Extendable RL-Environment for an Industrial Sorting Process

链接: https://arxiv.org/abs/2503.10466
作者: Tom Maus,Nico Zengeler,Tobias Glasmachers
类目: Machine Learning (cs.LG)
*备注: Presented at the 12th International Conference on Industrial Engineering and Applications (ICIEA-EU), Munich, 2025. This article has been submitted to AIP Conference Proceedings. After it is published, it will be available in the AIP Digital Library

点击查看摘要

Abstract:We present a novel reinforcement learning (RL) environment designed to both optimize industrial sorting systems and study agent behavior in evolving spaces. In simulating material flow within a sorting process our environment follows the idea of a digital twin, with operational parameters like belt speed and occupancy level. To reflect real-world challenges, we integrate common upgrades to industrial setups, like new sensors or advanced machinery. It thus includes two variants: a basic version focusing on discrete belt speed adjustments and an advanced version introducing multiple sorting modes and enhanced material composition observations. We detail the observation spaces, state update mechanisms, and reward functions for both environments. We further evaluate the efficiency of common RL algorithms like Proximal Policy Optimization (PPO), Deep-Q-Networks (DQN), and Advantage Actor Critic (A2C) in comparison to a classical rule-based agent (RBA). This framework not only aids in optimizing industrial processes but also provides a foundation for studying agent behavior and transferability in evolving environments, offering insights into model performance and practical implications for real-world RL applications.

[LG-11] Sentiment Analysis in SemEval: A Review of Sentiment Identification Approaches

链接: https://arxiv.org/abs/2503.10457
作者: Bousselham El Haddaoui,Raddouane Chiheb,Rdouan Faizi,Abdellatif El Afia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Social media platforms are becoming the foundations of social interactions including messaging and opinion expression. In this regard, Sentiment Analysis techniques focus on providing solutions to ensure the retrieval and analysis of generated data including sentiments, emotions, and discussed topics. International competitions such as the International Workshop on Semantic Evaluation (SemEval) have attracted many researchers and practitioners with a special research interest in building sentiment analysis systems. In our work, we study top-ranking systems for each SemEval edition during the 2013-2021 period, a total of 658 teams participated in these editions with increasing interest over years. We analyze the proposed systems marking the evolution of research trends with a focus on the main components of sentiment analysis systems including data acquisition, preprocessing, and classification. Our study shows an active use of preprocessing techniques, an evolution of features engineering and word representation from lexicon-based approaches to word embeddings, and the dominance of neural networks and transformers over the classification phase fostering the use of ready-to-use models. Moreover, we provide researchers with insights based on experimented systems which will allow rapid prototyping of new systems and help practitioners build for future SemEval editions.

[LG-12] Langevin Monte-Carlo Provably Learns Depth Two Neural Nets at Any Size and Data

链接: https://arxiv.org/abs/2503.10428
作者: Dibyakanti Kumar,Samyak Jha,Anirbit Mukherjee
类目: Machine Learning (cs.LG); Functional Analysis (math.FA); Probability (math.PR)
*备注:

点击查看摘要

Abstract:In this work, we will establish that the Langevin Monte-Carlo algorithm can learn depth-2 neural nets of any size and for any data and we give non-asymptotic convergence rates for it. We achieve this via showing that under Total Variation distance and q-Renyi divergence, the iterates of Langevin Monte Carlo converge to the Gibbs distribution of Frobenius norm regularized losses for any of these nets, when using smooth activations and in both classification and regression settings. Most critically, the amount of regularization needed for our results is independent of the size of the net. The key observation of ours is that two layer neural loss functions can always be regularized by a constant amount such that they satisfy the Villani conditions, and thus their Gibbs measures satisfy a Poincare inequality.

[LG-13] owards Constraint-Based Adaptive Hypergraph Learning for Solving Vehicle Routing: An End-to-End Solution

链接: https://arxiv.org/abs/2503.10421
作者: Zhenwei Wang,Ruibin Bai,Tiehua Zhang
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:The application of learning based methods to vehicle routing problems has emerged as a pivotal area of research in combinatorial optimization. These problems are characterized by vast solution spaces and intricate constraints, making traditional approaches such as exact mathematical models or heuristic methods prone to high computational overhead or reliant on the design of complex heuristic operators to achieve optimal or near optimal solutions. Meanwhile, although some recent learning-based methods can produce good performance for VRP with straightforward constraint scenarios, they often fail to effectively handle hard constraints that are common in practice. This study introduces a novel end-to-end framework that combines constraint-oriented hypergraphs with reinforcement learning to address vehicle routing problems. A central innovation of this work is the development of a constraint-oriented dynamic hyperedge reconstruction strategy within an encoder, which significantly enhances hypergraph representation learning. Additionally, the decoder leverages a double-pointer attention mechanism to iteratively generate solutions. The proposed model is trained by incorporating asynchronous parameter updates informed by hypergraph constraints and optimizing a dual loss function comprising constraint loss and policy gradient loss. The experiment results on benchmark datasets demonstrate that the proposed approach not only eliminates the need for sophisticated heuristic operators but also achieves substantial improvements in solution quality.

[LG-14] Multi-objective Good Arm Identification with Bandit Feedback

链接: https://arxiv.org/abs/2503.10386
作者: Xuanke Jiang,Kohei Hatano,Eiji Takimoto
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider a good arm identification problem in a stochastic bandit setting with multi-objectives, where each arm i\in[K] is associated with M distributions \mathcalD_i^(1), \ldots, \mathcalD_i^(M) . For each round t , the player/algorithm pulls one arm i_t and receives a vector feedback, where each component m is sampled according to \mathcalD_i^(m) . The target is twofold, one is finding one arm whose means are larger than the predefined thresholds \xi_1,\ldots,\xi_M with a confidence bound \delta and an accuracy rate \epsilon with a bounded sample complexity, the other is output \bot to indicate no such arm exists. We propose an algorithm with a sample complexity bound. When M=1 and \epsilon = 0 , our bound is the same as the one given in the previous work when and novel bounds for M 1 . The proposed algorithm attains better numerical performance than other baselines in the experiments on synthetic and real datasets.

[LG-15] Subgroup Performance Analysis in Hidden Stratifications

链接: https://arxiv.org/abs/2503.10382
作者: Alceu Bissoto,Trung-Dung Hoang,Tim Flühmann,Susu Sun,Christian F. Baumgartner,Lisa M. Koch
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Machine learning (ML) models may suffer from significant performance disparities between patient groups. Identifying such disparities by monitoring performance at a granular level is crucial for safely deploying ML to each patient. Traditional subgroup analysis based on metadata can expose performance disparities only if the available metadata (e.g., patient sex) sufficiently reflects the main reasons for performance variability, which is not common. Subgroup discovery techniques that identify cohesive subgroups based on learned feature representations appear as a potential solution: They could expose hidden stratifications and provide more granular subgroup performance reports. However, subgroup discovery is challenging to evaluate even as a standalone task, as ground truth stratification labels do not exist in real data. Subgroup discovery has thus neither been applied nor evaluated for the application of subgroup performance monitoring. Here, we apply subgroup discovery for performance monitoring in chest x-ray and skin lesion classification. We propose novel evaluation strategies and show that a simplified subgroup discovery method without access to classification labels or metadata can expose larger performance disparities than traditional metadata-based subgroup analysis. We provide the first compelling evidence that subgroup discovery can serve as an important tool for comprehensive performance validation and monitoring of trustworthy AI in medicine.

[LG-16] Probabilistic Forecasting via Autoregressive Flow Matching

链接: https://arxiv.org/abs/2503.10375
作者: Ahmed El-Gazzar,Marcel van Gerven
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we propose FlowTime, a generative model for probabilistic forecasting of multivariate timeseries data. Given historical measurements and optional future covariates, we formulate forecasting as sampling from a learned conditional distribution over future trajectories. Specifically, we decompose the joint distribution of future observations into a sequence of conditional densities, each modeled via a shared flow that transforms a simple base distribution into the next observation distribution, conditioned on observed covariates. To achieve this, we leverage the flow matching (FM) framework, enabling scalable and simulation-free learning of these transformations. By combining this factorization with the FM objective, FlowTime retains the benefits of autoregressive models – including strong extrapolation performance, compact model size, and well-calibrated uncertainty estimates – while also capturing complex multi-modal conditional distributions, as seen in modern transport-based generative models. We demonstrate the effectiveness of FlowTime on multiple dynamical systems and real-world forecasting tasks.

[LG-17] Safe exploration in reproducing kernel Hilbert spaces AISTATS2025

链接: https://arxiv.org/abs/2503.10352
作者: Abdullah Tokmak,Kiran G. Krishnan,Thomas B. Schön,Dominik Baumann
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: Accepted to AISTATS 2025

点击查看摘要

Abstract:Popular safe Bayesian optimization (BO) algorithms learn control policies for safety-critical systems in unknown environments. However, most algorithms make a smoothness assumption, which is encoded by a known bounded norm in a reproducing kernel Hilbert space (RKHS). The RKHS is a potentially infinite-dimensional space, and it remains unclear how to reliably obtain the RKHS norm of an unknown function. In this work, we propose a safe BO algorithm capable of estimating the RKHS norm from data. We provide statistical guarantees on the RKHS norm estimation, integrate the estimated RKHS norm into existing confidence intervals and show that we retain theoretical guarantees, and prove safety of the resulting safe BO algorithm. We apply our algorithm to safely optimize reinforcement learning policies on physics simulators and on a real inverted pendulum, demonstrating improved performance, safety, and scalability compared to the state-of-the-art.

[LG-18] Mirror Online Conformal Prediction with Intermittent Feedback

链接: https://arxiv.org/abs/2503.10345
作者: Bowen Wang,Matteo Zecchin,Osvaldo Simeone
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Online conformal prediction enables the runtime calibration of a pre-trained artificial intelligence model using feedback on its performance. Calibration is achieved through set predictions that are updated via online rules so as to ensure long-term coverage guarantees. While recent research has demonstrated the benefits of incorporating prior knowledge into the calibration process, this has come at the cost of replacing coverage guarantees with less tangible regret guarantees based on the quantile loss. This work introduces intermittent mirror online conformal prediction (IM-OCP), a novel runtime calibration framework that integrates prior knowledge, while maintaining long-term coverage and achieving sub-linear regret. IM-OCP features closed-form updates with minimal memory complexity, and is designed to operate under potentially intermittent feedback.

[LG-19] Characterizing Nonlinear Dynamics via Smooth Prototype Equivalences

链接: https://arxiv.org/abs/2503.10336
作者: Roy Friedman,Noa Moriel,Matthew Ricci,Guy Pelc,Yair Weiss,Mor Nitzan
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:Characterizing dynamical systems given limited measurements is a common challenge throughout the physical and biological sciences. However, this task is challenging, especially due to transient variability in systems with equivalent long-term dynamics. We address this by introducing smooth prototype equivalences (SPE), a framework that fits a diffeomorphism using normalizing flows to distinct prototypes - simplified dynamical systems that define equivalence classes of behavior. SPE enables classification by comparing the deformation loss of the observed sparse, high-dimensional measurements to the prototype dynamics. Furthermore, our approach enables estimation of the invariant sets of the observed dynamics through the learned mapping from prototype space to data space. Our method outperforms existing techniques in the classification of oscillatory systems and can efficiently identify invariant structures like limit cycles and fixed points in an equation-free manner, even when only a small, noisy subset of the phase space is observed. Finally, we show how our method can be used for the detection of biological processes like the cell cycle trajectory from high-dimensional single-cell gene expression data.

[LG-20] Collaborative Speculative Inference for Efficient LLM Inference Serving

链接: https://arxiv.org/abs/2503.10325
作者: Luyao Gao,Jianchun Liu,Hongli Xu,Liusheng Huang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Speculative inference is a promising paradigm employing small speculative models (SSMs) as drafters to generate draft tokens, which are subsequently verified in parallel by the target large language model (LLM). This approach enhances the efficiency of inference serving by reducing LLM inference latency and costs while preserving generation quality. However, existing speculative methods face critical challenges, including inefficient resource utilization and limited draft acceptance, which constrain their scalability and overall effectiveness. To overcome these obstacles, we present CoSine, a novel speculative inference system that decouples sequential speculative decoding from parallel verification, enabling efficient collaboration among multiple nodes. Specifically, CoSine routes inference requests to specialized drafters based on their expertise and incorporates a confidence-based token fusion mechanism to synthesize outputs from cooperating drafters, ensuring high-quality draft generation. Additionally, CoSine dynamically orchestrates the execution of speculative decoding and verification in a pipelined manner, employing batch scheduling to selectively group requests and adaptive speculation control to minimize idle periods. By optimizing parallel workflows through heterogeneous node collaboration, CoSine balances draft generation and verification throughput in real-time, thereby maximizing resource utilization. Experimental results demonstrate that CoSine achieves superior performance compared to state-of-the-art speculative approaches. Notably, with equivalent resource costs, CoSine achieves up to a 23.2% decrease in latency and a 32.5% increase in throughput compared to baseline methods.

[LG-21] Capturing Semantic Flow of ML-based Systems

链接: https://arxiv.org/abs/2503.10310
作者: Shin Yoo,Robert Feldt,Somin Kim,Naryeong Kim
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:ML-based systems are software systems that incorporates machine learning components such as Deep Neural Networks (DNNs) or Large Language Models (LLMs). While such systems enable advanced features such as high performance computer vision, natural language processing, and code generation, their internal behaviour remain largely opaque to traditional dynamic analysis such as testing: existing analysis typically concern only what is observable from the outside, such as input similarity or class label changes. We propose semantic flow, a concept designed to capture the internal behaviour of ML-based system and to provide a platform for traditional dynamic analysis techniques to be adapted to. Semantic flow combines the idea of control flow with internal states taken from executions of ML-based systems, such as activation values of a specific layer in a DNN, or embeddings of LLM responses at a specific inference step of LLM agents. The resulting representation, summarised as semantic flow graphs, can capture internal decisions that are not explicitly represented in the traditional control flow of ML-based systems. We propose the idea of semantic flow, introduce two examples using a DNN and an LLM agent, and finally sketch its properties and how it can be used to adapt existing dynamic analysis techniques for use in ML-based software systems.

[LG-22] HyperArm Bandit Optimization: A Novel approach to Hyperparameter Optimization and an Analysis of Bandit Algorithms in Stochastic and Adversarial Settings

链接: https://arxiv.org/abs/2503.10282
作者: Samih Karroum,Saad Mazhar
类目: Machine Learning (cs.LG)
*备注: 41 pages, 9 figures

点击查看摘要

Abstract:This paper explores the application of bandit algorithms in both stochastic and adversarial settings, with a focus on theoretical analysis and practical applications. The study begins by introducing bandit problems, distinguishing between stochastic and adversarial variants, and examining key algorithms such as Explore-Then-Commit (ETC), Upper Confidence Bound (UCB), and Exponential-Weight Algorithm for Exploration and Exploitation (EXP3). Theoretical regret bounds are analyzed to compare the performance of these algorithms. The paper then introduces a novel framework, HyperArm Bandit Optimization (HABO), which applies EXP3 to hyperparameter tuning in machine learning models. Unlike traditional methods that treat entire configurations as arms, HABO treats individual hyperparameters as super-arms, and its potential configurations as sub-arms, enabling dynamic resource allocation and efficient exploration. Experimental results demonstrate HABO’s effectiveness in classification and regression tasks, outperforming Bayesian Optimization in terms of computational efficiency and accuracy. The paper concludes with insights into the convergence guarantees of HABO and its potential for scalable and robust hyperparameter optimization.

[LG-23] Resource efficient data transmission on animals based on machine learning

链接: https://arxiv.org/abs/2503.10277
作者: Wilhelm Kerle-Malcharek,Karsten Klein,Martin Wikelski,Falk Schreiber,Timm A. Wild
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
*备注: Submitted to Scientific Reports but not published, 23 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Bio-loggers, electronic devices used to track animal behaviour through various sensors, have become essential in wildlife research. Despite continuous improvements in their capabilities, bio-loggers still face significant limitations in storage, processing, and data transmission due to the constraints of size and weight, which are necessary to avoid disturbing the animals. This study aims to explore how selective data transmission, guided by machine learning, can reduce the energy consumption of bio-loggers, thereby extending their operational lifespan without requiring hardware modifications. Comments: Submitted to Scientific Reports but not published, 23 pages, 5 figures, 3 tables Subjects: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Information Retrieval (cs.IR) Cite as: arXiv:2503.10277 [cs.LG] (or arXiv:2503.10277v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.10277 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-24] argeted Data Poisoning for Black-Box Audio Datasets Ownership Verification ICASSP2025

链接: https://arxiv.org/abs/2503.10269
作者: Wassim Bouaziz,El-Mahdi El-Mhamdi,Nicolas Usunier
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Published at ICASSP 2025, 5 pages, 7 figures

点击查看摘要

Abstract:Protecting the use of audio datasets is a major concern for data owners, particularly with the recent rise of audio deep learning models. While watermarks can be used to protect the data itself, they do not allow to identify a deep learning model trained on a protected dataset. In this paper, we adapt to audio data the recently introduced data taggants approach. Data taggants is a method to verify if a neural network was trained on a protected image dataset with top- k predictions access to the model only. This method relies on a targeted data poisoning scheme by discreetly altering a small fraction (1%) of the dataset as to induce a harmless behavior on out-of-distribution data called keys. We evaluate our method on the Speechcommands and the ESC50 datasets and state of the art transformer models, and show that we can detect the use of the dataset with high confidence without loss of performance. We also show the robustness of our method against common data augmentation techniques, making it a practical method to protect audio datasets.

[LG-25] AMR-Transformer: Enabling Efficient Long-range Interaction for Complex Neural Fluid Simulation

链接: https://arxiv.org/abs/2503.10257
作者: Zeyi Xu,Jinfan Liu,Kuangxu Chen,Ye Chen,Zhangli Hu,Bingbing Ni
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately and efficiently simulating complex fluid dynamics is a challenging task that has traditionally relied on computationally intensive methods. Neural network-based approaches, such as convolutional and graph neural networks, have partially alleviated this burden by enabling efficient local feature extraction. However, they struggle to capture long-range dependencies due to limited receptive fields, and Transformer-based models, while providing global context, incur prohibitive computational costs. To tackle these challenges, we propose AMR-Transformer, an efficient and accurate neural CFD-solving pipeline that integrates a novel adaptive mesh refinement scheme with a Navier-Stokes constraint-aware fast pruning module. This design encourages long-range interactions between simulation cells and facilitates the modeling of global fluid wave patterns, such as turbulence and shockwaves. Experiments show that our approach achieves significant gains in efficiency while preserving critical details, making it suitable for high-resolution physical simulations with long-range dependencies. On CFDBench, PDEBench and a new shockwave dataset, our pipeline demonstrates up to an order-of-magnitude improvement in accuracy over baseline models. Additionally, compared to ViT, our approach achieves a reduction in FLOPs of up to 60 times.

[LG-26] Numerical Error Analysis of Large Language Models

链接: https://arxiv.org/abs/2503.10251
作者: Stanislav Budzinskiy,Wenyi Fang,Longbin Zeng,Philipp Petersen
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Large language models based on transformer architectures have become integral to state-of-the-art natural language processing applications. However, their training remains computationally expensive and exhibits instabilities, some of which are expected to be caused by finite-precision computations. We provide a theoretical analysis of the impact of round-off errors within the forward pass of a transformer architecture which yields fundamental bounds for these effects. In addition, we conduct a series of numerical experiments which demonstrate the practical relevance of our bounds. Our results yield concrete guidelines for choosing hyperparameters that mitigate round-off errors, leading to more robust and stable inference.

[LG-27] Spherical dimension

链接: https://arxiv.org/abs/2503.10240
作者: Bogdan Chornomaz,Shay Moran,Tom Waknine
类目: Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce and study the spherical dimension, a natural topological relaxation of the VC dimension that unifies several results in learning theory where topology plays a key role in the proofs. The spherical dimension is defined by extending the set of realizable datasets (used to define the VC dimension) to the continuous space of realizable distributions. In this space, a shattered set of size d (in the VC sense) is completed into a continuous object, specifically a d-dimensional sphere of realizable distributions. The spherical dimension is then defined as the dimension of the largest sphere in this space. Thus, the spherical dimension is at least the VC dimension. The spherical dimension serves as a common foundation for leveraging the Borsuk-Ulam theorem and related topological tools. We demonstrate the utility of the spherical dimension in diverse applications, including disambiguations of partial concept classes, reductions from classification to stochastic convex optimization, stability and replicability, and sample compression schemes. Perhaps surprisingly, we show that the open question posed by Alon, Hanneke, Holzman, and Moran (FOCS 2021) of whether there exist non-trivial disambiguations for halfspaces with margin is equivalent to the basic open question of whether the VC and spherical dimensions are finite together. Subjects: Discrete Mathematics (cs.DM); Machine Learning (cs.LG) ACMclasses: I.2.6; G.2.1 Cite as: arXiv:2503.10240 [cs.DM] (or arXiv:2503.10240v1 [cs.DM] for this version) https://doi.org/10.48550/arXiv.2503.10240 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-28] Flows on convex polytopes

链接: https://arxiv.org/abs/2503.10232
作者: Tomek Diederen,Nicola Zamboni
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a framework for modeling complex, high-dimensional distributions on convex polytopes by leveraging recent advances in discrete and continuous normalizing flows on Riemannian manifolds. We show that any full-dimensional polytope is homeomorphic to a unit ball, and our approach harnesses flows defined on the ball, mapping them back to the original polytope. Furthermore, we introduce a strategy to construct flows when only the vertex representation of a polytope is available, employing maximum entropy barycentric coordinates and Aitchison geometry. Our experiments take inspiration from applications in metabolic flux analysis and demonstrate that our methods achieve competitive density estimation, sampling accuracy, as well as fast training and inference times.

[LG-29] Policy Teaching via Data Poisoning in Learning from Human Preferences AISTATS2025

链接: https://arxiv.org/abs/2503.10228
作者: Andi Nika,Jonathan Nöther,Debmalya Mandal,Parameswaran Kamalaruban,Adish Singla,Goran Radanović
类目: Machine Learning (cs.LG)
*备注: In AISTATS 2025

点击查看摘要

Abstract:We study data poisoning attacks in learning from human preferences. More specifically, we consider the problem of teaching/enforcing a target policy \pi^\dagger by synthesizing preference data. We seek to understand the susceptibility of different preference-based learning paradigms to poisoned preference data by analyzing the number of samples required by the attacker to enforce \pi^\dagger . We first propose a general data poisoning formulation in learning from human preferences and then study it for two popular paradigms, namely: (a) reinforcement learning from human feedback (RLHF) that operates by learning a reward model using preferences; (b) direct preference optimization (DPO) that directly optimizes policy using preferences. We conduct a theoretical analysis of the effectiveness of data poisoning in a setting where the attacker is allowed to augment a pre-existing dataset and also study its special case where the attacker can synthesize the entire preference dataset from scratch. As our main results, we provide lower/upper bounds on the number of samples required to enforce \pi^\dagger . Finally, we discuss the implications of our results in terms of the susceptibility of these learning paradigms under such data poisoning attacks.

[LG-30] Probability-Flow ODE in Infinite-Dimensional Function Spaces ICLR2025

链接: https://arxiv.org/abs/2503.10219
作者: Kunwoo Na,Junghyun Lee,Se-Young Yun,Sungbin Lim
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 26 pages, 8 figures. Accepted to the ICLR 2025 DeLTa Workshop

点击查看摘要

Abstract:Recent advances in infinite-dimensional diffusion models have demonstrated their effectiveness and scalability in function generation tasks where the underlying structure is inherently infinite-dimensional. To accelerate inference in such models, we derive, for the first time, an analog of the probability-flow ODE (PF-ODE) in infinite-dimensional function spaces. Leveraging this newly formulated PF-ODE, we reduce the number of function evaluations while maintaining sample quality in function generation tasks, including applications to PDEs.

[LG-31] Moss: Proxy Model-based Full-Weight Aggregation in Federated Learning with Heterogeneous Models

链接: https://arxiv.org/abs/2503.10218
作者: Yifeng Cai,Ziqi Zhang,Ding Li,Yao Guo,Xiangqun Chen
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted by ACM IMWUT/Ubicomp 2025

点击查看摘要

Abstract:Modern Federated Learning (FL) has become increasingly essential for handling highly heterogeneous mobile devices. Current approaches adopt a partial model aggregation paradigm that leads to sub-optimal model accuracy and higher training overhead. In this paper, we challenge the prevailing notion of partial-model aggregation and propose a novel “full-weight aggregation” method named Moss, which aggregates all weights within heterogeneous models to preserve comprehensive knowledge. Evaluation across various applications demonstrates that Moss significantly accelerates training, reduces on-device training time and energy consumption, enhances accuracy, and minimizes network bandwidth utilization when compared to state-of-the-art baselines.

[LG-32] An Real-Sim-Real (RSR) Loop Framework for Generalizable Robotic Policy Transfer with Differentiable Simulation

链接: https://arxiv.org/abs/2503.10118
作者: Lu Shi,Yuxuan Xu,Shiyu Wang,Jinhao Huang,Wenhao Zhao,Yufei Jia,Zike Yan,Weibin Gu,Guyue Zhou
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The sim-to-real gap remains a critical challenge in robotics, hindering the deployment of algorithms trained in simulation to real-world systems. This paper introduces a novel Real-Sim-Real (RSR) loop framework leveraging differentiable simulation to address this gap by iteratively refining simulation parameters, aligning them with real-world conditions, and enabling robust and efficient policy transfer. A key contribution of our work is the design of an informative cost function that encourages the collection of diverse and representative real-world data, minimizing bias and maximizing the utility of each data point for simulation refinement. This cost function integrates seamlessly into existing reinforcement learning algorithms (e.g., PPO, SAC) and ensures a balanced exploration of critical regions in the real domain. Furthermore, our approach is implemented on the versatile Mujoco MJX platform, and our framework is compatible with a wide range of robotic systems. Experimental results on several robotic manipulation tasks demonstrate that our method significantly reduces the sim-to-real gap, achieving high task performance and generalizability across diverse scenarios of both explicit and implicit environmental uncertainties.

[LG-33] Reconsidering Feature Structure Information and Latent Space Alignment in Partial Multi-label Feature Selection AAAI25

链接: https://arxiv.org/abs/2503.10115
作者: Hanlin Pan,Kunpeng Liu,Wanfu Gao
类目: Machine Learning (cs.LG)
*备注: 9pages,6 figures,accept at AAAI 25

点击查看摘要

Abstract:The purpose of partial multi-label feature selection is to select the most representative feature subset, where the data comes from partial multi-label datasets that have label ambiguity issues. For label disambiguation, previous methods mainly focus on utilizing the information inside the labels and the relationship between the labels and features. However, the information existing in the feature space is rarely considered, especially in partial multi-label scenarios where the noises is considered to be concentrated in the label space while the feature information is correct. This paper proposes a method based on latent space alignment, which uses the information mined in feature space to disambiguate in latent space through the structural consistency between labels and features. In addition, previous methods overestimate the consistency of features and labels in the latent space after convergence. We comprehensively consider the similarity of latent space projections to feature space and label space, and propose new feature selection term. This method also significantly improves the positive label identification ability of the selected features. Comprehensive experiments demonstrate the superiority of the proposed method.

[LG-34] IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models

链接: https://arxiv.org/abs/2503.10110
作者: Yiyang Ling,Karan Owalekar,Oluwatobiloba Adesanya,Erdem Bıyık,Daniel Seita
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Motion planning involves determining a sequence of robot configurations to reach a desired pose, subject to movement and safety constraints. Traditional motion planning finds collision-free paths, but this is overly restrictive in clutter, where it may not be possible for a robot to accomplish a task without contact. In addition, contacts range from relatively benign (e.g., brushing a soft pillow) to more dangerous (e.g., toppling a glass vase). Due to this diversity, it is difficult to characterize which contacts may be acceptable or unacceptable. In this paper, we propose IMPACT, a novel motion planning framework that uses Vision-Language Models (VLMs) to infer environment semantics, identifying which parts of the environment can best tolerate contact based on object properties and locations. Our approach uses the VLM’s outputs to produce a dense 3D “cost map” that encodes contact tolerances and seamlessly integrates with standard motion planners. We perform experiments using 20 simulation and 10 real-world scenes and assess using task success rate, object displacements, and feedback from human evaluators. Our results over 3620 simulation and 200 real-world trials suggest that IMPACT enables efficient contact-rich motion planning in cluttered settings while outperforming alternative methods and ablations. Supplementary material is available at this https URL.

[LG-35] SOLA-GCL: Subgraph-Oriented Learnable Augmentation Method for Graph Contrastive Learning

链接: https://arxiv.org/abs/2503.10100
作者: Tianhao Peng,Xuhong Li,Haitao Yuan,Yuchen Li,Haoyi Xiong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph contrastive learning has emerged as a powerful technique for learning graph representations that are robust and discriminative. However, traditional approaches often neglect the critical role of subgraph structures, particularly the intra-subgraph characteristics and inter-subgraph relationships, which are crucial for generating informative and diverse contrastive pairs. These subgraph features are crucial as they vary significantly across different graph types, such as social networks where they represent communities, and biochemical networks where they symbolize molecular interactions. To address this issue, our work proposes a novel subgraph-oriented learnable augmentation method for graph contrastive learning, termed SOLA-GCL, that centers around subgraphs, taking full advantage of the subgraph information for data augmentation. Specifically, SOLA-GCL initially partitions a graph into multiple densely connected subgraphs based on their intrinsic properties. To preserve and enhance the unique characteristics inherent to subgraphs, a graph view generator optimizes augmentation strategies for each subgraph, thereby generating tailored views for graph contrastive learning. This generator uses a combination of intra-subgraph and inter-subgraph augmentation strategies, including node dropping, feature masking, intra-edge perturbation, inter-edge perturbation, and subgraph swapping. Extensive experiments have been conducted on various graph learning applications, ranging from social networks to molecules, under semi-supervised learning, unsupervised learning, and transfer learning settings to demonstrate the superiority of our proposed approach over the state-of-the-art in GCL.

[LG-36] Enhanced Route Planning with Calibrated Uncertainty Set

链接: https://arxiv.org/abs/2503.10088
作者: Lingxuan Tang,Rui Luo,Zhixin Zhou,Nicolo Colombo
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: arXiv admin note: text overlap with arXiv:2406.08281

点击查看摘要

Abstract:This paper investigates the application of probabilistic prediction methodologies in route planning within a road network context. Specifically, we introduce the Conformalized Quantile Regression for Graph Autoencoders (CQR-GAE), which leverages the conformal prediction technique to offer a coverage guarantee, thus improving the reliability and robustness of our predictions. By incorporating uncertainty sets derived from CQR-GAE, we substantially improve the decision-making process in route planning under a robust optimization framework. We demonstrate the effectiveness of our approach by applying the CQR-GAE model to a real-world traffic scenario. The results indicate that our model significantly outperforms baseline methods, offering a promising avenue for advancing intelligent transportation systems.

[LG-37] Impact of buckypaper on the mechanical properties and failure modes of composites

链接: https://arxiv.org/abs/2503.10073
作者: Kartik Tripathi,Mohamed H. Hamza,Aditi Chattopadhyay,Todd C. Henry,Asha Hall
类目: Machine Learning (cs.LG)
*备注: In 38th Technical Conference of the American Society for Composites, ASC 2023 (pp. 2281-2297)

点击查看摘要

Abstract:Recently, there has been an interest in the incorporation of buckypaper (BP), or carbon nanotube (CNT) membranes, in composite laminates. Research has shown that using BP in contrast to nanotube doped resin enables the introduction of a higher CNT weight fraction which offers multiple benefits including higher piezo resistivity for health monitoring applications and enhanced mechanical response for structural applications. However, their impact on the deformation and failure mechanisms of composite laminates has not been investigated thoroughly. Understanding these issues experimentally would require a carefully executed test plan involving a multitude of design parameters such as BP geometry and placement, material anisotropy and variability, and laminate stacking sequence. This paper presents a deep learning (DL)-based surrogate model for studying the mechanical response of hybrid carbon fiber reinforced polymer (CFRP) composite laminates with BP interleaves under various mechanical loads. The surrogate model utilizes a long short-term memory architecture implemented within a DL framework and predicts the laminate global response for a given configuration, geometry, and loading condition. The DL framework training and cross-validation are performed via data acquisition from a series of three-point bend tests conducted through finite element analysis (FEA) and in-house experiments, respectively. The model predictions show good agreement with FEA simulations and experimental results, where CFRP with two BP interleaves showed enhanced flexural strength and modulus over pristine samples. This enhancement can be attributed to the excellent crack retardation capabilities of CNTs, particularly in the interlaminar region.

[LG-38] Model-Agnostic Knowledge Guided Correction for Improved Neural Surrogate Rollout

链接: https://arxiv.org/abs/2503.10048
作者: Bharat Srikishan,Daniel O’Malley,Mohamed Mehana,Nicholas Lubbers,Nikhil Muralidhar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modeling the evolution of physical systems is critical to many applications in science and engineering. As the evolution of these systems is governed by partial differential equations (PDEs), there are a number of computational simulations which resolve these systems with high accuracy. However, as these simulations incur high computational costs, they are infeasible to be employed for large-scale analysis. A popular alternative to simulators are neural network surrogates which are trained in a data-driven manner and are much more computationally efficient. However, these surrogate models suffer from high rollout error when used autoregressively, especially when confronted with training data paucity. Existing work proposes to improve surrogate rollout error by either including physical loss terms directly in the optimization of the model or incorporating computational simulators as `differentiable layers’ in the neural network. Both of these approaches have their challenges, with physical loss functions suffering from slow convergence for stiff PDEs and simulator layers requiring gradients which are not always available, especially in legacy simulators. We propose the Hybrid PDE Predictor with Reinforcement Learning (HyPER) model: a model-agnostic, RL based, cost-aware model which combines a neural surrogate, RL decision model, and a physics simulator (with or without gradients) to reduce surrogate rollout error significantly. In addition to reducing in-distribution rollout error by 47%-78%, HyPER learns an intelligent policy that is adaptable to changing physical conditions and resistant to noise corruption. Code available at this https URL.

[LG-39] A Neumann-Neumann Acceleration with Coarse Space for Domain Decomposition of Extreme Learning Machines

链接: https://arxiv.org/abs/2503.10032
作者: Chang-Ock Lee,Byungeun Ryoo
类目: Numerical Analysis (math.NA); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 21 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Extreme learning machines (ELMs), which preset hidden layer parameters and solve for last layer coefficients via a least squares method, can typically solve partial differential equations faster and more accurately than Physics Informed Neural Networks. However, they remain computationally expensive when high accuracy requires large least squares problems to be solved. Domain decomposition methods (DDMs) for ELMs have allowed parallel computation to reduce training times of large systems. This paper constructs a coarse space for ELMs, which enables further acceleration of their training. By partitioning interface variables into coarse and non-coarse variables, selective elimination introduces a Schur complement system on the non-coarse variables with the coarse problem embedded. Key to the performance of the proposed method is a Neumann-Neumann acceleration that utilizes the coarse space. Numerical experiments demonstrate significant speedup compared to a previous DDM method for ELMs.

[LG-40] DGNet: A Neural Network Framework Induced by Discontinuous Galerkin Methods

链接: https://arxiv.org/abs/2503.10021
作者: Guanyu Chen,Shengze Xu,Dong Ni,Tieyong Zeng
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:We propose a general framework for the Discontinuous Galerkin-induced Neural Network (DGNet) inspired by the Interior Penalty Discontinuous Galerkin Method (IPDGM). In this approach, the trial space consists of piecewise neural network space defined over the computational domain, while the test function space is composed of piecewise polynomials. We demonstrate the advantages of DGNet in terms of accuracy and training efficiency across several numerical examples, including stationary and time-dependent problems. Specifically, DGNet easily handles high perturbations, discontinuous solutions, and complex geometric domains.

[LG-41] Revisiting Multi-Agent Asynchronous Online Optimization with Delays: the Strongly Convex Case

链接: https://arxiv.org/abs/2503.10013
作者: Lingchan Bao,Tong Wei,Yuanyu Wan
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We revisit multi-agent asynchronous online optimization with delays, where only one of the agents becomes active for making the decision at each round, and the corresponding feedback is received by all the agents after unknown delays. Although previous studies have established an O(\sqrtdT) regret bound for this problem, they assume that the maximum delay d is knowable or the arrival order of feedback satisfies a special property, which may not hold in practice. In this paper, we surprisingly find that when the loss functions are strongly convex, these assumptions can be eliminated, and the existing regret bound can be significantly improved to O(d\log T) meanwhile. Specifically, to exploit the strong convexity of functions, we first propose a delayed variant of the classical follow-the-leader algorithm, namely FTDL, which is very simple but requires the full information of functions as feedback. Moreover, to handle the more general case with only the gradient feedback, we develop an approximate variant of FTDL by combining it with surrogate loss functions. Experimental results show that the approximate FTDL outperforms the existing algorithm in the strongly convex case.

[LG-42] From Equations to Insights: Unraveling Symbolic Structures in PDEs with LLM s

链接: https://arxiv.org/abs/2503.09986
作者: Rohan Bhatnagar,Ling Liang,Krish Patel,Haizhao Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Motivated by the remarkable success of artificial intelligence (AI) across diverse fields, the application of AI to solve scientific problems-often formulated as partial differential equations (PDEs)-has garnered increasing attention. While most existing research concentrates on theoretical properties (such as well-posedness, regularity, and continuity) of the solutions, alongside direct AI-driven methods for solving PDEs, the challenge of uncovering symbolic relationships within these equations remains largely unexplored. In this paper, we propose leveraging large language models (LLMs) to learn such symbolic relationships. Our results demonstrate that LLMs can effectively predict the operators involved in PDE solutions by utilizing the symbolic information in the PDEs. Furthermore, we show that discovering these symbolic relationships can substantially improve both the efficiency and accuracy of the finite expression method for finding analytical approximation of PDE solutions, delivering a fully interpretable solution pipeline. This work opens new avenues for understanding the symbolic structure of scientific problems and advancing their solution processes.

[LG-43] Accuracy of Discretely Sampled Stochastic Policies in Continuous-time Reinforcement Learning

链接: https://arxiv.org/abs/2503.09981
作者: Yanwei Jia,Du Ouyang,Yufei Zhang
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Stochastic policies are widely used in continuous-time reinforcement learning algorithms. However, executing a stochastic policy and evaluating its performance in a continuous-time environment remain open challenges. This work introduces and rigorously analyzes a policy execution framework that samples actions from a stochastic policy at discrete time points and implements them as piecewise constant controls. We prove that as the sampling mesh size tends to zero, the controlled state process converges weakly to the dynamics with coefficients aggregated according to the stochastic policy. We explicitly quantify the convergence rate based on the regularity of the coefficients and establish an optimal first-order convergence rate for sufficiently regular coefficients. Additionally, we show that the same convergence rates hold with high probability concerning the sampling noise, and further establish a 1/2 -order almost sure convergence when the volatility is not controlled. Building on these results, we analyze the bias and variance of various policy evaluation and policy gradient estimators based on discrete-time observations. Our results provide theoretical justification for the exploratory stochastic control framework in [H. Wang, T. Zariphopoulou, and X.Y. Zhou, J. Mach. Learn. Res., 21 (2020), pp. 1-34].

[LG-44] ype Information-Assisted Self-Supervised Knowledge Graph Denoising AISTATS2025

链接: https://arxiv.org/abs/2503.09916
作者: Jiaqi Sun,Yujia Zheng,Xinshuai Dong,Haoyue Dai,Kun Zhang
类目: Machine Learning (cs.LG)
*备注: Accepted by AISTATS 2025

点击查看摘要

Abstract:Knowledge graphs serve as critical resources supporting intelligent systems, but they can be noisy due to imperfect automatic generation processes. Existing approaches to noise detection often rely on external facts, logical rule constraints, or structural embeddings. These methods are often challenged by imperfect entity alignment, flexible knowledge graph construction, and overfitting on structures. In this paper, we propose to exploit the consistency between entity and relation type information for noise detection, resulting a novel self-supervised knowledge graph denoising method that avoids those problems. We formalize type inconsistency noise as triples that deviate from the majority with respect to type-dependent reasoning along the topological structure. Specifically, we first extract a compact representation of a given knowledge graph via an encoder that models the type dependencies of triples. Then, the decoder reconstructs the original input knowledge graph based on the compact representation. It is worth noting that, our proposal has the potential to address the problems of knowledge graph compression and completion, although this is not our focus. For the specific task of noise detection, the discrepancy between the reconstruction results and the input knowledge graph provides an opportunity for denoising, which is facilitated by the type consistency embedded in our method. Experimental validation demonstrates the effectiveness of our approach in detecting potential noise in real-world data.

[LG-45] Inter-environmental world modeling for continuous and compositional dynamics

链接: https://arxiv.org/abs/2503.09911
作者: Kohei Hayashi,Masanori Koyama,Julian Jorge Andrade Guerreiro
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Various world model frameworks are being developed today based on autoregressive frameworks that rely on discrete representations of actions and observations, and these frameworks are succeeding in constructing interactive generative models for the target environment of interest. Meanwhile, humans demonstrate remarkable generalization abilities to combine experiences in multiple environments to mentally simulate and learn to control agents in diverse environments. Inspired by this human capability, we introduce World modeling through Lie Action (WLA), an unsupervised framework that learns continuous latent action representations to simulate across environments. WLA learns a control interface with high controllability and predictive ability by simultaneously modeling the dynamics of multiple environments using Lie group theory and object-centric autoencoder. On synthetic benchmark and real-world datasets, we demonstrate that WLA can be trained using only video frames and, with minimal or no action labels, can quickly adapt to new environments with novel action sets.

[LG-46] A Semantic-Loss Function Modeling Framework With Task-Oriented Machine Learning Perspectives

链接: https://arxiv.org/abs/2503.09903
作者: Ti Ti Nguyen,Thanh-Dung Le,Vu Nguyen Ha,Hong-fu Chou,Geoffrey Eappen,Duc-Dung Tran,Hung Nguyen-Kha,Prabhu Thiruvasagam,Luis M. Garces-Socarras,Jorge L. Gonzalez-Rios,Juan C. Merlano-Duncan,Symeon Chatzinotas
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 6 pages, 11 figures

点击查看摘要

Abstract:The integration of machine learning (ML) has significantly enhanced the capabilities of Earth Observation (EO) systems by enabling the extraction of actionable insights from complex datasets. However, the performance of data-driven EO applications is heavily influenced by the data collection and transmission processes, where limited satellite bandwidth and latency constraints can hinder the full transmission of original data to the receivers. To address this issue, adopting the concepts of Semantic Communication (SC) offers a promising solution by prioritizing the transmission of essential data semantics over raw information. Implementing SC for EO systems requires a thorough understanding of the impact of data processing and communication channel conditions on semantic loss at the processing center. This work proposes a novel data-fitting framework to empirically model the semantic loss using real-world EO datasets and domain-specific insights. The framework quantifies two primary types of semantic loss: (1) source coding loss, assessed via a data quality indicator measuring the impact of processing on raw source data, and (2) transmission loss, evaluated by comparing practical transmission performance against the Shannon limit. Semantic losses are estimated by evaluating the accuracy of EO applications using four task-oriented ML models, EfficientViT, MobileViT, ResNet50-DINO, and ResNet8-KD, on lossy image datasets under varying channel conditions and compression ratios. These results underpin a framework for efficient semantic-loss modeling in bandwidth-constrained EO scenarios, enabling more reliable and effective operations.

[LG-47] racking the Best Expert Privately

链接: https://arxiv.org/abs/2503.09889
作者: Aadirupa Saha,Vinod Raman,Hilal Asi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We design differentially private algorithms for the problem of prediction with expert advice under dynamic regret, also known as tracking the best expert. Our work addresses three natural types of adversaries, stochastic with shifting distributions, oblivious, and adaptive, and designs algorithms with sub-linear regret for all three cases. In particular, under a shifting stochastic adversary where the distribution may shift S times, we provide an \epsilon -differentially private algorithm whose expected dynamic regret is at most O\left( \sqrtS T \log (NT) + \fracS \log (NT)\epsilon\right) , where T and N are the epsilon horizon and number of experts, respectively. For oblivious adversaries, we give a reduction from dynamic regret minimization to static regret minimization, resulting in an upper bound of O\left(\sqrtS T \log(NT) + \fracS T^1/3\log(T/\delta) \log(NT)\epsilon^2/3\right) on the expected dynamic regret, where S now denotes the allowable number of switches of the best expert. Finally, similar to static regret, we establish a fundamental separation between oblivious and adaptive adversaries for the dynamic setting: while our algorithms show that sub-linear regret is achievable for oblivious adversaries in the high-privacy regime \epsilon \le \sqrtS/T , we show that any (\epsilon, \delta) -differentially private algorithm must suffer linear dynamic regret under adaptive adversaries for \epsilon \le \sqrtS/T . Finally, to complement this lower bound, we give an \epsilon -differentially private algorithm that attains sub-linear dynamic regret under adaptive adversaries whenever \epsilon \gg \sqrtS/T .

[LG-48] EquiPy: Sequential Fairness using Optimal Transport in Python

链接: https://arxiv.org/abs/2503.09866
作者: Agathe Fernandes Machado,Suzie Grondin,Philipp Ratz,Arthur Charpentier,François Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Algorithmic fairness has received considerable attention due to the failures of various predictive AI systems that have been found to be unfairly biased against subgroups of the population. Many approaches have been proposed to mitigate such biases in predictive systems, however, they often struggle to provide accurate estimates and transparent correction mechanisms in the case where multiple sensitive variables, such as a combination of gender and race, are involved. This paper introduces a new open source Python package, EquiPy, which provides a easy-to-use and model agnostic toolbox for efficiently achieving fairness across multiple sensitive variables. It also offers comprehensive graphic utilities to enable the user to interpret the influence of each sensitive variable within a global context. EquiPy makes use of theoretical results that allow the complexity arising from the use of multiple variables to be broken down into easier-to-solve sub-problems. We demonstrate the ease of use for both mitigation and interpretation on publicly available data derived from the US Census and provide sample code for its use.

[LG-49] An Asymmetric Independence Model for Causal Discovery on Path Spaces

链接: https://arxiv.org/abs/2503.09859
作者: Georg Manten,Cecilia Casolo,Søren Wengel Mogensen,Niki Kilbertus
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: appeared at CLeaR 2025

点击查看摘要

Abstract:We develop the theory linking ‘E-separation’ in directed mixed graphs (DMGs) with conditional independence relations among coordinate processes in stochastic differential equations (SDEs), where causal relationships are determined by “which variables enter the governing equation of which other variables”. We prove a global Markov property for cyclic SDEs, which naturally extends to partially observed cyclic SDEs, because our asymmetric independence model is closed under marginalization. We then characterize the class of graphs that encode the same set of independence relations, yielding a result analogous to the seminal ‘same skeleton and v-structures’ result for directed acyclic graphs (DAGs). In the fully observed case, we show that each such equivalence class of graphs has a greatest element as a parsimonious representation and develop algorithms to identify this greatest element from data. We conjecture that a greatest element also exists under partial observations, which we verify computationally for graphs with up to four nodes.

[LG-50] abNSA: Native Sparse Attention for Efficient Tabular Data Learning

链接: https://arxiv.org/abs/2503.09850
作者: Ali Eslamian,Qiang Cheng
类目: Machine Learning (cs.LG)
*备注: 5 pages, 4 tables

点击查看摘要

Abstract:Tabular data poses unique challenges for deep learning due to its heterogeneous features and lack of inherent spatial structure. This paper introduces TabNSA, a novel deep learning architecture leveraging Native Sparse Attention (NSA) specifically for efficient tabular data processing. TabNSA incorporates a dynamic hierarchical sparse strategy, combining coarse-grained feature compression with fine-grained feature selection to preserve both global context awareness and local precision. By dynamically focusing on relevant subsets of features, TabNSA effectively captures intricate feature interactions. Extensive experiments demonstrate that TabNSA consistently outperforms existing methods, including both deep learning architectures and ensemble decision trees, achieving state-of-the-art performance across various benchmark datasets.

[LG-51] A Comprehensive Review on Understanding the Decentralized and Collaborative Approach in Machine Learning

链接: https://arxiv.org/abs/2503.09833
作者: Sarwar Saif,Md Jahirul Islam,Md. Zihad Bin Jahangir,Parag Biswas,Abdur Rashid,MD Abdullah Al Nasim,Kishor Datta Gupta
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The arrival of Machine Learning (ML) completely changed how we can unlock valuable information from data. Traditional methods, where everything was stored in one place, had big problems with keeping information private, handling large amounts of data, and avoiding unfair advantages. Machine Learning has become a powerful tool that uses Artificial Intelligence (AI) to overcome these challenges. We started by learning the basics of Machine Learning, including the different types like supervised, unsupervised, and reinforcement learning. We also explored the important steps involved, such as preparing the data, choosing the right model, training it, and then checking its performance. Next, we examined some key challenges in Machine Learning, such as models learning too much from specific examples (overfitting), not learning enough (underfitting), and reflecting biases in the data used. Moving beyond centralized systems, we looked at decentralized Machine Learning and its benefits, like keeping data private, getting answers faster, and using a wider variety of data sources. We then focused on a specific type called federated learning, where models are trained without directly sharing sensitive information. Real-world examples from healthcare and finance were used to show how collaborative Machine Learning can solve important problems while still protecting information security. Finally, we discussed challenges like communication efficiency, dealing with different types of data, and security. We also explored using a Zero Trust framework, which provides an extra layer of protection for collaborative Machine Learning systems. This approach is paving the way for a bright future for this groundbreaking technology.

[LG-52] SE(3)-Equivariant Robot Learning and Control: A Tutorial Survey

链接: https://arxiv.org/abs/2503.09829
作者: Joohwan Seo,Soochul Yoo,Junwoo Chang,Hyunseok An,Hyunwoo Ryu,Soomi Lee,Arvind Kruthiventy,Jongeun CHoi,Roberto Horowitz
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Submitted to International Journcal of Control, Automation and Systems (IJCAS), Under Review

点击查看摘要

Abstract:Recent advances in deep learning and Transformers have driven major breakthroughs in robotics by employing techniques such as imitation learning, reinforcement learning, and LLM-based multimodal perception and decision-making. However, conventional deep learning and Transformer models often struggle to process data with inherent symmetries and invariances, typically relying on large datasets or extensive data augmentation. Equivariant neural networks overcome these limitations by explicitly integrating symmetry and invariance into their architectures, leading to improved efficiency and generalization. This tutorial survey reviews a wide range of equivariant deep learning and control methods for robotics, from classic to state-of-the-art, with a focus on SE(3)-equivariant models that leverage the natural 3D rotational and translational symmetries in visual robotic manipulation and control design. Using unified mathematical notation, we begin by reviewing key concepts from group theory, along with matrix Lie groups and Lie algebras. We then introduce foundational group-equivariant neural network design and show how the group-equivariance can be obtained through their structure. Next, we discuss the applications of SE(3)-equivariant neural networks in robotics in terms of imitation learning and reinforcement learning. The SE(3)-equivariant control design is also reviewed from the perspective of geometric control. Finally, we highlight the challenges and future directions of equivariant methods in developing more robust, sample-efficient, and multi-modal real-world robotic systems.

[LG-53] Batch List-Decodable Linear Regression via Higher Moments

链接: https://arxiv.org/abs/2503.09802
作者: Ilias Diakonikolas,Daniel M. Kane,Sushrut Karmalkar,Sihan Liu,Thanasis Pittas
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the task of list-decodable linear regression using batches. A batch is called clean if it consists of i.i.d. samples from an unknown linear regression distribution. For a parameter \alpha \in (0, 1/2) , an unknown \alpha -fraction of the batches are clean and no assumptions are made on the remaining ones. The goal is to output a small list of vectors at least one of which is close to the true regressor vector in \ell_2 -norm. [DJKS23] gave an efficient algorithm, under natural distributional assumptions, with the following guarantee. Assuming that the batch size n satisfies n \geq \tilde\Omega(\alpha^-1) and the number of batches is m = \mathrmpoly(d, n, 1/\alpha) , their algorithm runs in polynomial time and outputs a list of O(1/\alpha^2) vectors at least one of which is \tildeO(\alpha^-1/2/\sqrtn) close to the target regressor. Here we design a new polynomial time algorithm with significantly stronger guarantees under the assumption that the low-degree moments of the covariates distribution are Sum-of-Squares (SoS) certifiably bounded. Specifically, for any constant \delta0 , as long as the batch size is n \geq \Omega_\delta(\alpha^-\delta) and the degree- \Theta(1/\delta) moments of the covariates are SoS certifiably bounded, our algorithm uses m = \mathrmpoly((dn)^1/\delta, 1/\alpha) batches, runs in polynomial-time, and outputs an O(1/\alpha) -sized list of vectors one of which is O(\alpha^-\delta/2/\sqrtn) close to the target. That is, our algorithm achieves substantially smaller minimum batch size and final error, while achieving the optimal list size. Our approach uses higher-order moment information by carefully combining the SoS paradigm interleaved with an iterative method and a novel list pruning procedure. In the process, we give an SoS proof of the Marcinkiewicz-Zygmund inequality that may be of broader applicability.

[LG-54] Minimal Time Series Transformer

链接: https://arxiv.org/abs/2503.09791
作者: Joni-Kristian Kämäräinen
类目: Machine Learning (cs.LG)
*备注: 8 pages, 8 figures

点击查看摘要

Abstract:Transformer is the state-of-the-art model for many natural language processing, computer vision, and audio analysis problems. Transformer effectively combines information from the past input and output samples in auto-regressive manner so that each sample becomes aware of all inputs and outputs. In sequence-to-sequence (Seq2Seq) modeling, the transformer processed samples become effective in predicting the next output. Time series forecasting is a Seq2Seq problem. The original architecture is defined for discrete input and output sequence tokens, but to adopt it for time series, the model must be adapted for continuous data. This work introduces minimal adaptations to make the original transformer architecture suitable for continuous value time series data.

[LG-55] Designing Graph Convolutional Neural Networks for Discrete Choice with Network Effects

链接: https://arxiv.org/abs/2503.09786
作者: Daniel F. Villarraga,Ricardo A. Daziano
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce a novel model architecture that incorporates network effects into discrete choice problems, achieving higher predictive performance than standard discrete choice models while offering greater interpretability than general-purpose flexible model classes. Econometric discrete choice models aid in studying individual decision-making, where agents select the option with the highest reward from a discrete set of alternatives. Intuitively, the utility an individual derives from a particular choice depends on their personal preferences and characteristics, the attributes of the alternative, and the value their peers assign to that alternative or their previous choices. However, most applications ignore peer influence, and models that do consider peer or network effects often lack the flexibility and predictive performance of recently developed approaches to discrete choice, such as deep learning. We propose a novel graph convolutional neural network architecture to model network effects in discrete choices, achieving higher predictive performance than standard discrete choice models while retaining the interpretability necessary for inference–a quality often lacking in general-purpose deep learning architectures. We evaluate our architecture using revealed commuting choice data, extended with travel times and trip costs for each travel mode for work-related trips in New York City, as well as 2016 U.S. election data aggregated by county, to test its performance on datasets with highly imbalanced classes. Given the interpretability of our models, we can estimate relevant economic metrics, such as the value of travel time savings in New York City. Finally, we compare the predictive performance and behavioral insights from our architecture to those derived from traditional discrete choice and general-purpose deep learning models.

[LG-56] Learning richness modulates equality reasoning in neural networks

链接: https://arxiv.org/abs/2503.09781
作者: William L. Tong,Cengiz Pehlevan
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 28 pages, 8 figures, code available at this https URL

点击查看摘要

Abstract:Equality reasoning is ubiquitous and purely abstract: sameness or difference may be evaluated no matter the nature of the underlying objects. As a result, same-different tasks (SD) have been extensively studied as a starting point for understanding abstract reasoning in humans and across animal species. With the rise of neural networks (NN) that exhibit striking apparent proficiency for abstractions, equality reasoning in NNs has also gained interest. Yet despite extensive study, conclusions about equality reasoning vary widely and with little consensus. To clarify the underlying principles in learning SD, we develop a theory of equality reasoning in multi-layer perceptrons (MLP). Following observations in comparative psychology, we propose a spectrum of behavior that ranges from conceptual to perceptual outcomes. Conceptual behavior is characterized by task-specific representations, efficient learning, and insensitivity to spurious perceptual details. Perceptual behavior is characterized by strong sensitivity to spurious perceptual details, accompanied by the need for exhaustive training to learn the task. We develop a mathematical theory to show that an MLP’s behavior is driven by learning richness. Rich-regime MLPs exhibit conceptual behavior, whereas lazy-regime MLPs exhibit perceptual behavior. We validate our theoretical findings in vision SD experiments, showing that rich feature learning promotes success by encouraging hallmarks of conceptual behavior. Overall, our work identifies feature learning richness as a key parameter modulating equality reasoning, and suggests that equality reasoning in humans and animals may similarly depend on learning richness in neural circuits.

[LG-57] Real-Time Risky Fault-Chain Search using Time-Varying Graph RNNs

链接: https://arxiv.org/abs/2503.09775
作者: Anmol Dwivedi,Ali Tajer
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: arXiv admin note: substantial text overlap with arXiv:2303.08864

点击查看摘要

Abstract:This paper introduces a data-driven graphical framework for the real-time search of risky cascading fault chains (FCs) in power-grids, crucial for enhancing grid resiliency in the face of climate change. As extreme weather events driven by climate change increase, identifying risky FCs becomes crucial for mitigating cascading failures and ensuring grid stability. However, the complexity of the spatio-temporal dependencies among grid components and the exponential growth of the search space with system size pose significant challenges to modeling and risky FC search. To tackle this, we model the search process as a partially observable Markov decision process (POMDP), which is subsequently solved via a time-varying graph recurrent neural network (GRNN). This approach captures the spatial and temporal structure induced by the system’s topology and dynamics, while efficiently summarizing the system’s history in the GRNN’s latent space, enabling scalable and effective identification of risky FCs.

[LG-58] Cover Learning for Large-Scale Topology Representation

链接: https://arxiv.org/abs/2503.09767
作者: Luis Scoccola,Uzu Lim,Heather A. Harrington
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG); Algebraic Topology (math.AT); Machine Learning (stat.ML)
*备注: 26 pages, 17 figures, 4 tables

点击查看摘要

Abstract:Classical unsupervised learning methods like clustering and linear dimensionality reduction parametrize large-scale geometry when it is discrete or linear, while more modern methods from manifold learning find low dimensional representation or infer local geometry by constructing a graph on the input data. More recently, topological data analysis popularized the use of simplicial complexes to represent data topology with two main methodologies: topological inference with geometric complexes and large-scale topology visualization with Mapper graphs – central to these is the nerve construction from topology, which builds a simplicial complex given a cover of a space by subsets. While successful, these have limitations: geometric complexes scale poorly with data size, and Mapper graphs can be hard to tune and only contain low dimensional information. In this paper, we propose to study the problem of learning covers in its own right, and from the perspective of optimization. We describe a method for learning topologically-faithful covers of geometric datasets, and show that the simplicial complexes thus obtained can outperform standard topological inference approaches in terms of size, and Mapper-type algorithms in terms of representation of large-scale topology.

[LG-59] Distributionally Robust Multi-Agent Reinforcement Learning for Dynamic Chute Mapping

链接: https://arxiv.org/abs/2503.09755
作者: Guangyi Liu,Suzan Iloglu,Michael Caldara,Joseph W. Durham,Michael M. Zavlanos
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In Amazon robotic warehouses, the destination-to-chute mapping problem is crucial for efficient package sorting. Often, however, this problem is complicated by uncertain and dynamic package induction rates, which can lead to increased package recirculation. To tackle this challenge, we introduce a Distributionally Robust Multi-Agent Reinforcement Learning (DRMARL) framework that learns a destination-to-chute mapping policy that is resilient to adversarial variations in induction rates. Specifically, DRMARL relies on group distributionally robust optimization (DRO) to learn a policy that performs well not only on average but also on each individual subpopulation of induction rates within the group that capture, for example, different seasonality or operation modes of the system. This approach is then combined with a novel contextual bandit-based predictor of the worst-case induction distribution for each state-action pair, significantly reducing the cost of exploration and thereby increasing the learning efficiency and scalability of our framework. Extensive simulations demonstrate that DRMARL achieves robust chute mapping in the presence of varying induction distributions, reducing package recirculation by an average of 80% in the simulation scenario.

[LG-60] How Feasible is Augmenting Fake Nodes with Learnable Features as a Counter-strategy against Link Stealing Attacks?

链接: https://arxiv.org/abs/2503.09726
作者: Mir Imtiaz Mostafiz,Imtiaz Karim,Elisa Bertino
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Preprint for the Accepted Work in The 15th ACM Conference on Data and Application Security and Privacy (CODASPY’25)}, 14 pages

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are widely used and deployed for graph-based prediction tasks. However, as good as GNNs are for learning graph data, they also come with the risk of privacy leakage. For instance, an attacker can run carefully crafted queries on the GNNs and, from the responses, can infer the existence of an edge between a pair of nodes. This attack, dubbed as a “link-stealing” attack, can jeopardize the user’s privacy by leaking potentially sensitive information. To protect against this attack, we propose an approach called " (N) ode (A) ugmentation for ® estricting (G) raphs from (I) nsinuating their (S) tructure" ( NARGIS ) and study its feasibility. NARGIS is focused on reshaping the graph embedding space so that the posterior from the GNN model will still provide utility for the prediction task but will introduce ambiguity for the link-stealing attackers. To this end, NARGIS applies spectral clustering on the given graph to facilitate it being augmented with new nodes – that have learned features instead of fixed ones. It utilizes tri-level optimization for learning parameters for the GNN model, surrogate attacker model, and our defense model (i.e. learnable node features). We extensively evaluate NARGIS on three benchmark citation datasets over eight knowledge availability settings for the attackers. We also evaluate the model fidelity and defense performance on influence-based link inference attacks. Through our studies, we have figured out the best feature of NARGIS – its superior fidelity-privacy performance trade-off in a significant number of cases. We also have discovered in which cases the model needs to be improved, and proposed ways to integrate different schemes to make the model more robust against link stealing attacks.

[LG-61] he Pitfalls of Imitation Learning when Actions are Continuous

链接: https://arxiv.org/abs/2503.09722
作者: Max Simchowitz,Daniel Pfrommer,Ali Jadbabaie
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: 98 pages, 2 figures

点击查看摘要

Abstract:We study the problem of imitating an expert demonstrator in a discrete-time, continuous state-and-action control system. We show that, even if the dynamics are stable (i.e. contracting exponentially quickly), and the expert is smooth and deterministic, any smooth, deterministic imitator policy necessarily suffers error on execution that is exponentially larger, as a function of problem horizon, than the error under the distribution of expert training data. Our negative result applies to both behavior cloning and offline-RL algorithms, unless they produce highly “improper” imitator policies–those which are non-smooth, non-Markovian, or which exhibit highly state-dependent stochasticity–or unless the expert trajectory distribution is sufficiently “spread.” We provide experimental evidence of the benefits of these more complex policy parameterizations, explicating the benefits of today’s popular policy parameterizations in robot learning (e.g. action-chunking and Diffusion Policies). We also establish a host of complementary negative and positive results for imitation in control systems.

[LG-62] owards Causal Model-Based Policy Optimization

链接: https://arxiv.org/abs/2503.09719
作者: Alberto Caron,Vasilios Mavroudis,Chris Hicks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world decision-making problems are often marked by complex, uncertain dynamics that can shift or break under changing conditions. Traditional Model-Based Reinforcement Learning (MBRL) approaches learn predictive models of environment dynamics from queried trajectories and then use these models to simulate rollouts for policy optimization. However, such methods do not account for the underlying causal mechanisms that govern the environment, and thus inadvertently capture spurious correlations, making them sensitive to distributional shifts and limiting their ability to generalize. The same naturally holds for model-free approaches. In this work, we introduce Causal Model-Based Policy Optimization (C-MBPO), a novel framework that integrates causal learning into the MBRL pipeline to achieve more robust, explainable, and generalizable policy learning algorithms. Our approach centers on first inferring a Causal Markov Decision Process (C-MDP) by learning a local Structural Causal Model (SCM) of both the state and reward transition dynamics from trajectories gathered online. C-MDPs differ from classic MDPs in that we can decompose causal dependencies in the environment dynamics via specifying an associated Causal Bayesian Network. C-MDPs allow for targeted interventions and counterfactual reasoning, enabling the agent to distinguish between mere statistical correlations and causal relationships. The learned SCM is then used to simulate counterfactual on-policy transitions and rewards under hypothetical actions (or ``interventions"), thereby guiding policy optimization more effectively. The resulting policy learned by C-MBPO can be shown to be robust to a class of distributional shifts that affect spurious, non-causal relationships in the dynamics. We demonstrate this through some simple experiments involving near and far OOD dynamics drifts. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2503.09719 [cs.LG] (or arXiv:2503.09719v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.09719 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-63] MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching

链接: https://arxiv.org/abs/2503.09716
作者: Tairan Xu,Leyang Xue,Zhan Lu,Adrian Jackson,Luo Mai
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents MoE-Gen, a high-throughput MoE inference system optimized for single-GPU execution. Existing inference systems rely on model-based or continuous batching strategies, originally designed for interactive inference, which result in excessively small batches for MoE’s key modules-attention and expert modules-leading to poor throughput. To address this, we introduce module-based batching, which accumulates tokens in host memory and dynamically launches large batches on GPUs to maximize utilization. Additionally, we optimize the choice of batch sizes for each module in an MoE to fully overlap GPU computation and communication, maximizing throughput. Evaluation demonstrates that MoE-Gen achieves 8-31x higher throughput compared to state-of-the-art systems employing model-based batching (FlexGen, MoE-Lightning, DeepSpeed), and offers even greater throughput improvements over continuous batching systems (e.g., vLLM and Ollama) on popular MoE models (DeepSeek and Mixtral) across offline inference tasks. MoE-Gen’s source code is publicly available at this https URL

[LG-64] owards Hardware Supported Domain Generalization in DNN-Based Edge Computing Devices for Health Monitoring

链接: https://arxiv.org/abs/2503.09661
作者: Johnson Loh,Lyubov Dudchenko,Justus Viga,Tobias Gemmeke
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural network (DNN) models have shown remarkable success in many real-world scenarios, such as object detection and classification. Unfortunately, these models are not yet widely adopted in health monitoring due to exceptionally high requirements for model robustness and deployment in highly resource-constrained devices. In particular, the acquisition of biosignals, such as electrocardiogram (ECG), is subject to large variations between training and deployment, necessitating domain generalization (DG) for robust classification quality across sensors and patients. The continuous monitoring of ECG also requires the execution of DNN models in convenient wearable devices, which is achieved by specialized ECG accelerators with small form factor and ultra-low power consumption. However, combining DG capabilities with ECG accelerators remains a challenge. This article provides a comprehensive overview of ECG accelerators and DG methods and discusses the implication of the combination of both domains, such that multi-domain ECG monitoring is enabled with emerging algorithm-hardware co-optimized systems. Within this context, an approach based on correction layers is proposed to deploy DG capabilities on the edge. Here, the DNN fine-tuning for unknown domains is limited to a single layer, while the remaining DNN model remains unmodified. Thus, computational complexity (CC) for DG is reduced with minimal memory overhead compared to conventional fine-tuning of the whole DNN model. The DNN model-dependent CC is reduced by more than 2.5x compared to DNN fine-tuning at an average increase of F1 score by more than 20% on the generalized target domain. In summary, this article provides a novel perspective on robust DNN classification on the edge for health monitoring applications.

[LG-65] Edge AI for Real-time Fetal Assessment in Rural Guatemala

链接: https://arxiv.org/abs/2503.09659
作者: Nasim Katebi,Mohammad Ahmad,Mohsen Motie-Shirazi,Daniel Phan,Ellen Kolesnikova,Sepideh Nikookar,Alireza Rafiei,Murali K. Korikana,Rachel Hall-Clifford,Esteban Castro,Rosibely Sut,Enma Coyote,Anahi Venzor Strader,Edlyn Ramos,Peter Rohloff,Reza Sameni,Gari D. Clifford
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Perinatal complications, defined as conditions that arise during pregnancy, childbirth, and the immediate postpartum period, represent a significant burden on maternal and neonatal health worldwide. Factors contributing to these disparities include limited access to quality healthcare, socioeconomic inequalities, and variations in healthcare infrastructure. Addressing these issues is crucial for improving health outcomes for mothers and newborns, particularly in underserved communities. To mitigate these challenges, we have developed an AI-enabled smartphone application designed to provide decision support at the point-of-care. This tool aims to enhance health monitoring during pregnancy by leveraging machine learning (ML) techniques. The intended use of this application is to assist midwives during routine home visits by offering real-time analysis and providing feedback based on collected data. The application integrates TensorFlow Lite (TFLite) and other Python-based algorithms within a Kotlin framework to process data in real-time. It is designed for use in low-resource settings, where traditional healthcare infrastructure may be lacking. The intended patient population includes pregnant women and new mothers in underserved areas and the developed system was piloted in rural Guatemala. This ML-based solution addresses the critical need for accessible and quality perinatal care by empowering healthcare providers with decision support tools to improve maternal and neonatal health outcomes.

[LG-66] ýr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLM s via Global Sparsity Distribution Optimization

链接: https://arxiv.org/abs/2503.09657
作者: Guanchen Li,Yixing Xu,Zeping Li,Ji Liu,Xuanwu Yin,Dong Li,Emad Barsoum
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Structural pruning enhances hardware-agnostic inference efficiency for large language models (LLMs) but often struggles to maintain performance. Local pruning performs efficient layer-by-layer compression but ignores global topology. Global pruning has the potential to find the optimal solution although resource-intensive. However, existing methods tend to rank structural saliency uniformly, ignoring inter-structure dependencies and failing to achieve end-to-end optimization. To address these limitations, we propose Týr-the-Pruner, an efficient end-to-end search-based global structural pruning framework. This framework constructs a supernet by repeatedly applying local pruning across a range of sparsity ratios to each layer in an LLM, with the core goal of determining the optimal sparsity distribution under a target overall sparsity ratio. Concretely, we introduce an effective local pruning and an expectation error accumulation approach to improve supernet construction. Furthermore, we employ an iterative prune-and-search strategy with coarse-to-fine sparsity granularity to ensure efficient search convergence. Experimental results show that Týr-the-Pruner achieves state-of-the-art structural pruning, retaining 97% of the dense model’s performance while removing a challenging 50% of Llama-3.1-70B’s parameters.

[LG-67] A Deep Reinforcement Learning Approach to Automated Stock Trading using xLSTM Networks

链接: https://arxiv.org/abs/2503.09655
作者: Faezeh Sarlakifar,Mohammadreza Mohammadzadeh Asl,Sajjad Rezvani Khaledi,Armin Salimi-Badr
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR)
*备注:

点击查看摘要

Abstract:Traditional Long Short-Term Memory (LSTM) networks are effective for handling sequential data but have limitations such as gradient vanishing and difficulty in capturing long-term dependencies, which can impact their performance in dynamic and risky environments like stock trading. To address these limitations, this study explores the usage of the newly introduced Extended Long Short Term Memory (xLSTM) network in combination with a deep reinforcement learning (DRL) approach for automated stock trading. Our proposed method utilizes xLSTM networks in both actor and critic components, enabling effective handling of time series data and dynamic market environments. Proximal Policy Optimization (PPO), with its ability to balance exploration and exploitation, is employed to optimize the trading strategy. Experiments were conducted using financial data from major tech companies over a comprehensive timeline, demonstrating that the xLSTM-based model outperforms LSTM-based methods in key trading evaluation metrics, including cumulative return, average profitability per trade, maximum earning rate, maximum pullback, and Sharpe ratio. These findings mark the potential of xLSTM for enhancing DRL-based stock trading systems.

[LG-68] Bags of Projected Nearest Neighbours: Competitors to Random Forests?

链接: https://arxiv.org/abs/2503.09651
作者: David P. Hofmeyr
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Currently under submission for potential publication by IEEE

点击查看摘要

Abstract:In this paper we introduce a simple and intuitive adaptive k nearest neighbours classifier, and explore its utility within the context of bootstrap aggregating (“bagging”). The approach is based on finding discriminant subspaces which are computationally efficient to compute, and are motivated by enhancing the discrimination of classes through nearest neighbour classifiers. This adaptiveness promotes diversity of the individual classifiers fit across different bootstrap samples, and so further leverages the variance reducing effect of bagging. Extensive experimental results are presented documenting the strong performance of the proposed approach in comparison with Random Forest classifiers, as well as other nearest neighbours based ensembles from the literature, plus other relevant benchmarks. Code to implement the proposed approach is available in the form of an R package from this https URL.

[LG-69] Inductive Spatio-Temporal Kriging with Physics-Guided Increment Training Strategy for Air Quality Inference

链接: https://arxiv.org/abs/2503.09646
作者: Songlin Yang,Tao Yang,Bo Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The deployment of sensors for air quality monitoring is constrained by high costs, leading to inadequate network coverage and data deficits in some areas. Utilizing existing observations, spatio-temporal kriging is a method for estimating air quality at unobserved locations during a specific period. Inductive spatio-temporal kriging with increment training strategy has demonstrated its effectiveness using virtual nodes to simulate unobserved nodes. However, a disparity between virtual and real nodes persists, complicating the application of learning patterns derived from virtual nodes to actual unobserved ones. To address these limitations, this paper presents a Physics-Guided Increment Training Strategy (PGITS). Specifically, we design a dynamic graph generation module to incorporate the advection and diffusion processes of airborne particles as physical knowledge into the graph structure, dynamically adjusting the adjacency matrix to reflect physical interactions between nodes. By using physics principles as a bridge between virtual and real nodes, this strategy ensures the features of virtual nodes and their pseudo labels are closer to actual nodes. Consequently, the learned patterns of virtual nodes can be applied to actual unobserved nodes for effective kriging.

[LG-70] FedMSGL: A Self-Expressive Hypergraph Based Federated Multi-View Learning AAAI2025

链接: https://arxiv.org/abs/2503.09643
作者: Daoyuan Li,Zuyuan Yang,Shengli Xie
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accept by AAAI2025

点击查看摘要

Abstract:Federated learning is essential for enabling collaborative model training across decentralized data sources while preserving data privacy and security. This approach mitigates the risks associated with centralized data collection and addresses concerns related to data ownership and compliance. Despite significant advancements in federated learning algorithms that address communication bottlenecks and enhance privacy protection, existing works overlook the impact of differences in data feature dimensions, resulting in global models that disproportionately depend on participants with large feature dimensions. Additionally, current single-view federated learning methods fail to account for the unique characteristics of multi-view data, leading to suboptimal performance in processing such data. To address these issues, we propose a Self-expressive Hypergraph Based Federated Multi-view Learning method (FedMSGL). The proposed method leverages self-expressive character in the local training to learn uniform dimension subspace with latent sample relation. At the central side, an adaptive fusion technique is employed to generate the global model, while constructing a hypergraph from the learned global and view-specific subspace to capture intricate interconnections across views. Experiments on multi-view datasets with different feature dimensions validated the effectiveness of the proposed method.

[LG-71] APECS: Adaptive Personalized Control System Architecture

链接: https://arxiv.org/abs/2503.09624
作者: Marius F. R. Juston,Alex Gisi,William R. Norris,Dustin Nottage,Ahmet Soylemezoglu
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 8 pages, 11 figures

点击查看摘要

Abstract:This paper presents the Adaptive Personalized Control System (APECS) architecture, a novel framework for human-in-the-loop control. An architecture is developed which defines appropriate constraints for the system objectives. A method for enacting Lipschitz and sector bounds on the resulting controller is derived to ensure desirable control properties. An analysis of worst-case loss functions and the optimal loss function weighting is made to implement an effective training scheme. Finally, simulations are carried out to demonstrate the effectiveness of the proposed architecture. This architecture resulted in a 4.5% performance increase compared to the human operator and 9% to an unconstrained feedforward neural network trained in the same way.

[LG-72] On the Injective Norm of Sums of Random Tensors and the Moments of Gaussian Chaoses

链接: https://arxiv.org/abs/2503.10580
作者: Ishaq Aden-Ali
类目: Probability (math.PR); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 12 pages

点击查看摘要

Abstract:We prove an upper bound on the expected \ell_p injective norm of sums of subgaussian random tensors. Our proof is simple and does not rely on any explicit geometric or chaining arguments. Instead, it follows from a simple application of the PAC-Bayesian lemma, a tool that has proven effective at controlling the suprema of certain ``smooth’’ empirical processes in recent years. Our bound strictly improves a very recent result of Bandeira, Gopi, Jiang, Lucca, and Rothvoss. In the Euclidean case ( p=2 ), our bound sharpens a result of Latała that was central to proving his estimates on the moments of Gaussian chaoses. As a consequence, we obtain an elementary proof of this fundamental result.

[LG-73] Sample and Map from a Single Convex Potential: Generation using Conjugate Moment Measures

链接: https://arxiv.org/abs/2503.10576
作者: Nina Vesseron,Louis Béthune,Marco Cuturi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A common approach to generative modeling is to split model-fitting into two blocks: define first how to sample noise (e.g. Gaussian) and choose next what to do with it (e.g. using a single map or flows). We explore in this work an alternative route that ties sampling and mapping. We find inspiration in moment measures, a result that states that for any measure \rho supported on a compact convex set of \mathbbR^d , there exists a unique convex potential u such that \rho=\nabla u,\sharp,e^-u . While this does seem to tie effectively sampling (from log-concave distribution e^-u ) and action (pushing particles through \nabla u ), we observe on simple examples (e.g., Gaussians or 1D distributions) that this choice is ill-suited for practical tasks. We study an alternative factorization, where \rho is factorized as \nabla w^,\sharp,e^-w , where w^ is the convex conjugate of w . We call this approach conjugate moment measures, and show far more intuitive results on these examples. Because \nabla w^* is the Monge map between the log-concave distribution e^-w and \rho , we rely on optimal transport solvers to propose an algorithm to recover w from samples of \rho , and parameterize w as an input-convex neural network.

[LG-74] Extreme Learning Machines for Attention-based Multiple Instance Learning in Whole-Slide Image Classification

链接: https://arxiv.org/abs/2503.10510
作者: Rajiv Krishnakumar,Julien Baglio,Frederik F. Flöther,Christian Ruiz,Stefan Habringer,Nicole H. Romano
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Whole-slide image classification represents a key challenge in computational pathology and medicine. Attention-based multiple instance learning (MIL) has emerged as an effective approach for this problem. However, the effect of attention mechanism architecture on model performance is not well-documented for biomedical imagery. In this work, we compare different methods and implementations of MIL, including deep learning variants. We introduce a new method using higher-dimensional feature spaces for deep MIL. We also develop a novel algorithm for whole-slide image classification where extreme machine learning is combined with attention-based MIL to improve sensitivity and reduce training complexity. We apply our algorithms to the problem of detecting circulating rare cells (CRCs), such as erythroblasts, in peripheral blood. Our results indicate that nonlinearities play a key role in the classification, as removing them leads to a sharp decrease in stability in addition to a decrease in average area under the curve (AUC) of over 4%. We also demonstrate a considerable increase in robustness of the model with improvements of over 10% in average AUC when higher-dimensional feature spaces are leveraged. In addition, we show that extreme learning machines can offer clear improvements in terms of training efficiency by reducing the number of trained parameters by a factor of 5 whilst still maintaining the average AUC to within 1.5% of the deep MIL model. Finally, we discuss options of enriching the classical computing framework with quantum algorithms in the future. This work can thus help pave the way towards more accurate and efficient single-cell diagnostics, one of the building blocks of precision medicine.

[LG-75] Meta-learning characteristics and dynamics of quantum systems

链接: https://arxiv.org/abs/2503.10492
作者: Lucas Schorling,Pranav Vaidhyanathan,Jonas Schuff,Miguel J. Carballido,Dominik Zumbühl,Gerard Milburn,Florian Marquardt,Jakob Foerster,Michael A. Osborne,Natalia Ares
类目: Quantum Physics (quant-ph); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 6+1 pages, 4 figures. L. Schorling and P. Vaidhyanathan contributed equally to this work

点击查看摘要

Abstract:While machine learning holds great promise for quantum technologies, most current methods focus on predicting or controlling a specific quantum system. Meta-learning approaches, however, can adapt to new systems for which little data is available, by leveraging knowledge obtained from previous data associated with similar systems. In this paper, we meta-learn dynamics and characteristics of closed and open two-level systems, as well as the Heisenberg model. Based on experimental data of a Loss-DiVincenzo spin-qubit hosted in a Ge/Si core/shell nanowire for different gate voltage configurations, we predict qubit characteristics i.e. g -factor and Rabi frequency using meta-learning. The algorithm we introduce improves upon previous state-of-the-art meta-learning methods for physics-based systems by introducing novel techniques such as adaptive learning rates and a global optimizer for improved robustness and increased computational efficiency. We benchmark our method against other meta-learning methods, a vanilla transformer, and a multilayer perceptron, and demonstrate improved performance.

[LG-76] Representation Learning Large-Scale 3D Molecular Pretraining Molecular Property

链接: https://arxiv.org/abs/2503.10489
作者: Shuqi Lu,Xiaohong Ji,Bohang Zhang,Lin Yao,Siyuan Liu,Zhifeng Gao,Linfeng Zhang,Guolin Ke
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Molecular pretrained representations (MPR) has emerged as a powerful approach for addressing the challenge of limited supervised data in applications such as drug discovery and material design. While early MPR methods relied on 1D sequences and 2D graphs, recent advancements have incorporated 3D conformational information to capture rich atomic interactions. However, these prior models treat molecules merely as discrete atom sets, overlooking the space surrounding them. We argue from a physical perspective that only modeling these discrete points is insufficient. We first present a simple yet insightful observation: naively adding randomly sampled virtual points beyond atoms can surprisingly enhance MPR performance. In light of this, we propose a principled framework that incorporates the entire 3D space spanned by molecules. We implement the framework via a novel Transformer-based architecture, dubbed SpaceFormer, with three key components: (1) grid-based space discretization; (2) grid sampling/merging; and (3) efficient 3D positional encoding. Extensive experiments show that SpaceFormer significantly outperforms previous 3D MPR models across various downstream tasks with limited data, validating the benefit of leveraging the additional 3D space beyond atoms in MPR models.

[LG-77] Deep Learning based discovery of Integrable Systems

链接: https://arxiv.org/abs/2503.10469
作者: Shailesh Lal,Suvajit Majumder,Evgeny Sobko
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG); Mathematical Physics (math-ph); Quantum Algebra (math.QA); Quantum Physics (quant-ph)
*备注: 11 pages, 2 column text, 3 figures, Mathematica notebook with example Hamiltonians

点击查看摘要

Abstract:We introduce a novel machine learning based framework for discovering integrable models. Our approach first employs a synchronized ensemble of neural networks to find high-precision numerical solution to the Yang-Baxter equation within a specified class. Then, using an auxiliary system of algebraic equations, [Q_2, Q_3] = 0, and the numerical value of the Hamiltonian obtained via deep learning as a seed, we reconstruct the entire Hamiltonian family, forming an algebraic variety. We illustrate our presentation with three- and four-dimensional spin chains of difference form with local interactions. Remarkably, all discovered Hamiltonian families form rational varieties.

[LG-78] BioSerenity-E1: a self-supervised EEG model for medical applications

链接: https://arxiv.org/abs/2503.10362
作者: Ruggero G. Bettinardi,Mohamed Rahmouni,Ulysse Gimenez
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Electroencephalography (EEG) serves as an essential diagnostic tool in neurology; however, its accurate manual interpretation is a time-intensive process that demands highly specialized expertise, which remains relatively scarce and not consistently accessible. To address these limitations, the implementation of automated pre-screening and analysis systems for EEG data holds considerable promise. Advances in self-supervised learning made it possible to pre-train complex deep learning architectures on large volumes of unlabeled EEG data to learn generalizable representations, that can later be used to enhance performance on multiple tasks while needing less downstream data. In the present paper, we introduce BioSerenity-E1, the first of a family of self-supervised foundation models for clinical EEG applications that combines spectral tokenization with masked prediction to achieve state-of-the-art performance across relevant diagnostic tasks. The two-phase self-supervised pretraining framework initially acquires compressed EEG representations via a transformer-based VQ-VAE architecture designed to reconstruct log-multitaper spectral projections, then implements extensive (70% block) masked token prediction to force the model to learn complex spatiotemporal dependencies in EEG signals. BioSerenity-E1 achieves strong performance across three clinical tasks, either in line or above state-of-the-art methods: seizure detection (AUROC = 0.926, Sensitivity = 0.909), normal/abnormal classification (AUPRC = 0.970 on proprietary data; 0.910 on TUH-Abnormal), and multiclass pathology differentiation on unbalanced data (Weighted F1 = 0.730). The utility of BioSerenity-E1 is further confirmed in low-data regimes scenarios, showing clear improvements in AUPRC (from +2% to 17%) when trained on less than 10% of the available data.

[LG-79] Robust Learning-Based Sparse Recovery for Device Activity Detection in Grant-Free Random Access Cell-Free Massive MIMO: Enhancing Resilience to Impairments

链接: https://arxiv.org/abs/2503.10280
作者: Ali Elkeshawy,Haifa Fares,Amor Nafkha
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Massive MIMO is considered a key enabler to support massive machine-type communication (mMTC). While massive access schemes have been extensively analyzed for co-located massive MIMO arrays, this paper explores activity detection in grant-free random access for mMTC within the context of cell-free massive MIMO systems, employing distributed antenna arrays. This sparse support recovery of device activity status is performed by a finite cluster of access points (APs) from a large number of geographically distributed APs collaborating to serve a larger number of devices. Active devices transmit non-orthogonal pilot sequences to APs, which forward the received signals to a central processing unit (CPU) for collaborative activity detection. This paper proposes a simple and efficient data-driven algorithm tailored for device activity detection, implemented centrally at the CPU. Furthermore, the study assesses the algorithm’s robustness to input perturbations and examines the effects of adopting fixed-point representation on its performance.

[LG-80] Numerically robust Gaussian state estimation with singular observation noise

链接: https://arxiv.org/abs/2503.10279
作者: Nicholas Krämer,Filip Tronarp
类目: Methodology (stat.ME); Machine Learning (cs.LG); Numerical Analysis (math.NA); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:This article proposes numerically robust algorithms for Gaussian state estimation with singular observation noise. Our approach combines a series of basis changes with Bayes’ rule, transforming the singular estimation problem into a nonsingular one with reduced state dimension. In addition to ensuring low runtime and numerical stability, our proposal facilitates marginal-likelihood computations and Gauss-Markov representations of the posterior process. We analyse the proposed method’s computational savings and numerical robustness and validate our findings in a series of simulations.

[LG-81] Climate land use and other drivers impacts on island ecosystem services: a global review

链接: https://arxiv.org/abs/2503.10278
作者: Aristides Moustakas,Shiri Zemah-Shamir,Mirela Tase,Savvas Zotos,Nazli Demirel,Christos Zoumides,Irene Christoforidi,Turgay Dindaroglu,Tamer Albayrak,Cigdem Kaptan Ayhan,Mauro Fois,Paraskevi Manolaki,Attila D. Sandor,Ina Sieber,Valentini Stamatiadou,Elli Tzirkalli,Ioannis N. Vogiatzakis,Ziv Zemah-Shamir,George Zittis
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Article published in the journal: Science of the Total Environment. Free author’s version

点击查看摘要

Abstract:Islands are diversity hotspots and vulnerable to environmental degradation, climate variations, land use changes and societal crises. These factors can exhibit interactive impacts on ecosystem services. The study reviewed a large number of papers on the climate change-islands-ecosystem services topic worldwide. Potential inclusion of land use changes and other drivers of impacts on ecosystem services were sequentially also recorded. The study sought to investigate the impacts of climate change, land use change, and other non-climatic driver changes on island ecosystem services. Explanatory variables examined were divided into two categories: environmental variables and methodological ones. Environmental variables include sea zone geographic location, ecosystem, ecosystem services, climate, land use, other driver variables, Methodological variables include consideration of policy interventions, uncertainty assessment, cumulative effects of climate change, synergistic effects of climate change with land use change and other anthropogenic and environmental drivers, and the diversity of variables used in the analysis. Machine learning and statistical methods were used to analyze their effects on island ecosystem services. Negative climate change impacts on ecosystem services are better quantified by land use change or other non-climatic driver variables than by climate variables. The synergy of land use together with climate changes is modulating the impact outcome and critical for a better impact assessment. Analyzed together, there is little evidence of more pronounced for a specific sea zone, ecosystem, or ecosystem service. Climate change impacts may be underestimated due to the use of a single climate variable deployed in most studies. Policy interventions exhibit low classification accuracy in quantifying impacts indicating insufficient efficacy or integration in the studies.

[LG-82] Data augmentation using diffusion models to enhance inverse Ising inference

链接: https://arxiv.org/abs/2503.10154
作者: Yechan Lim,Sangwon Lee,Junghyo Jo
类目: Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying model parameters from observed configurations poses a fundamental challenge in data science, especially with limited data. Recently, diffusion models have emerged as a novel paradigm in generative machine learning, capable of producing new samples that closely mimic observed data. These models learn the gradient of model probabilities, bypassing the need for cumbersome calculations of partition functions across all possible configurations. We explore whether diffusion models can enhance parameter inference by augmenting small datasets. Our findings demonstrate this potential through a synthetic task involving inverse Ising inference and a real-world application of reconstructing missing values in neural activity data. This study serves as a proof-of-concept for using diffusion models for data augmentation in physics-related problems, thereby opening new avenues in data science.

[LG-83] Are Convex Optimization Curves Convex?

链接: https://arxiv.org/abs/2503.10138
作者: Guy Barzilai,Ohad Shamir
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:In this paper, we study when we might expect the optimization curve induced by gradient descent to be \emphconvex – precluding, for example, an initial plateau followed by a sharp decrease, making it difficult to decide when optimization should stop. Although such undesirable behavior can certainly occur when optimizing general functions, might it also occur in the benign and well-studied case of smooth convex functions? As far as we know, this question has not been tackled in previous work. We show, perhaps surprisingly, that the answer crucially depends on the choice of the step size. In particular, for the range of step sizes which are known to result in monotonic convergence to an optimal value, there is a regime where the optimization curve will be provably convex, and there is a regime where the curve can be non-convex. We also extend our results to gradient flow, and to the closely-related but different question of whether the gradient norm decreases monotonically.

[LG-84] Adaptive Moment Estimation Optimization Algorithm Using Projection Gradient for Deep Learning

链接: https://arxiv.org/abs/2503.10005
作者: Yongqi Li,Xiaowei Zhang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training deep neural networks is challenging. To accelerate training and enhance performance, we propose PadamP, a novel optimization algorithm. PadamP is derived by applying the adaptive estimation of the p-th power of the second-order moments under scale invariance, enhancing projection adaptability by modifying the projection discrimination condition. It is integrated into Adam-type algorithms, accelerating training, boosting performance, and improving generalization in deep learning. Combining projected gradient benefits with adaptive moment estimation, PadamP tackles unconstrained non-convex problems. Convergence for the non-convex case is analyzed, focusing on the decoupling of first-order moment estimation coefficients and second-order moment estimation coefficients. Unlike prior work relying on , our proof generalizes the convergence theorem, enhancing practicality. Experiments using VGG-16 and ResNet-18 on CIFAR-10 and CIFAR-100 show PadamP’s effectiveness, with notable performance on CIFAR-10/100, especially for VGG-16. The results demonstrate that PadamP outperforms existing algorithms in terms of convergence speed and generalization ability, making it a valuable addition to the field of deep learning optimization.

[LG-85] hermodynamic Bound on Energy and Negentropy Costs of Inference in Deep Neural Networks

链接: https://arxiv.org/abs/2503.09980
作者: Alexei V. Tkachenko
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Neurons and Cognition (q-bio.NC)
*备注: 5 pages, 2 figures

点击查看摘要

Abstract:The fundamental thermodynamic bound is derived for the energy cost of inference in Deep Neural Networks (DNNs). By applying Landauer’s principle, we demonstrate that the linear operations in DNNs can, in principle, be performed reversibly, whereas the non-linear activation functions impose an unavoidable energy cost. The resulting theoretical lower bound on the inference energy is determined by the average number of neurons undergoing state transition for each inference. We also restate the thermodynamic bound in terms of negentropy, a metric which is more universal than energy for assessing thermodynamic cost of information processing. Concept of negentropy is further elaborated in the context of information processing in biological and engineered system as well as human intelligence. Our analysis provides insight into the physical limits of DNN efficiency and suggests potential directions for developing energy-efficient AI architectures that leverage reversible analog computing.

[LG-86] Predicting Tropical Cyclone Track Forecast Errors using a Probabilistic Neural Network

链接: https://arxiv.org/abs/2503.09840
作者: M.A. Fernandez,Elizabeth A. Barnes,Randal J. Barnes,Mark DeMaria,Marie McGraw,Galina Chirokova,Lixin Lu
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 26 (single column) pages + 8 figures (appended supplemental: 20 pages + 17 figures)

点击查看摘要

Abstract:A new method for estimating tropical cyclone track uncertainty is presented and tested. This method uses a neural network to predict a bivariate normal distribution, which serves as an estimate for track uncertainty. We train the network and make predictions on forecasts from the National Hurricane Center (NHC), which currently uses static error distributions based on forecasts from the past five years for most applications. The neural network-based method produces uncertainty estimates that are dynamic and probabilistic. Further, the neural network-based method allows for probabilistic statements about tropical cyclone trajectories, including landfall probability, which we highlight. We show that our predictions are well calibrated using multiple metrics, that our method produces better uncertainty estimates than current NHC approaches, and that our method achieves similar performance to the Global Ensemble Forecast System. Once trained, the computational cost of predictions using this method is negligible, making it a strong candidate to improve the NHC’s operational estimations of tropical cyclone track uncertainty.

[LG-87] A practical guide to machine learning interatomic potentials – Status and future

链接: https://arxiv.org/abs/2503.09814
作者: Ryan Jacobs,Dane Morgan,Siamak Attarian,Jun Meng,Chen Shen,Zhenghao Wu,Clare Yijia Xie,Julia H. Yang,Nongnuch Artrith,Ben Blaiszik,Gerbrand Ceder,Kamal Choudhary,Gabor Csanyi,Ekin Dogus Cubuk,Bowen Deng,Ralf Drautz,Xiang Fu,Jonathan Godwin,Vasant Honavar,Olexandr Isayev,Anders Johansson,Boris Kozinsky,Stefano Martiniani,Shyue Ping Ong,Igor Poltavsky,KJ Schmidt,So Takamoto,Aidan Thompson,Julia Westermayr,Brandon M. Wood
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid development and large body of literature on machine learning interatomic potentials (MLIPs) can make it difficult to know how to proceed for researchers who are not experts but wish to use these tools. The spirit of this review is to help such researchers by serving as a practical, accessible guide to the state-of-the-art in MLIPs. This review paper covers a broad range of topics related to MLIPs, including (i) central aspects of how and why MLIPs are enablers of many exciting advancements in molecular modeling, (ii) the main underpinnings of different types of MLIPs, including their basic structure and formalism, (iii) the potentially transformative impact of universal MLIPs for both organic and inorganic systems, including an overview of the most recent advances, capabilities, downsides, and potential applications of this nascent class of MLIPs, (iv) a practical guide for estimating and understanding the execution speed of MLIPs, including guidance for users based on hardware availability, type of MLIP used, and prospective simulation size and time, (v) a manual for what MLIP a user should choose for a given application by considering hardware resources, speed requirements, energy and force accuracy requirements, as well as guidance for choosing pre-trained potentials or fitting a new potential from scratch, (vi) discussion around MLIP infrastructure, including sources of training data, pre-trained potentials, and hardware resources for training, (vii) summary of some key limitations of present MLIPs and current approaches to mitigate such limitations, including methods of including long-range interactions, handling magnetic systems, and treatment of excited states, and finally (viii) we finish with some more speculative thoughts on what the future holds for the development and application of MLIPs over the next 3-10+ years.

[LG-88] Optimisation of the Accelerator Control by Reinforcement Learning: A Simulation-Based Approach

链接: https://arxiv.org/abs/2503.09665
作者: Anwar Ibrahim,Denis Derkach,Alexey Petrenko,Fedor Ratnikov,Maxim Kaledin
类目: Accelerator Physics (physics.acc-ph); Machine Learning (cs.LG)
*备注: Proceedings for Mathematical Modeling and Computational Physics, 2024 (MMCP2024)

点击查看摘要

Abstract:Optimizing accelerator control is a critical challenge in experimental particle physics, requiring significant manual effort and resource expenditure. Traditional tuning methods are often time-consuming and reliant on expert input, highlighting the need for more efficient approaches. This study aims to create a simulation-based framework integrated with Reinforcement Learning (RL) to address these challenges. Using \textttElegant as the simulation backend, we developed a Python wrapper that simplifies the interaction between RL algorithms and accelerator simulations, enabling seamless input management, simulation execution, and output analysis. The proposed RL framework acts as a co-pilot for physicists, offering intelligent suggestions to enhance beamline performance, reduce tuning time, and improve operational efficiency. As a proof of concept, we demonstrate the application of our RL approach to an accelerator control problem and highlight the improvements in efficiency and performance achieved through our methodology. We discuss how the integration of simulation tools with a Python-based RL framework provides a powerful resource for the accelerator physics community, showcasing the potential of machine learning in optimizing complex physical systems. Comments: Proceedings for Mathematical Modeling and Computational Physics, 2024 (MMCP2024) Subjects: Accelerator Physics (physics.acc-ph); Machine Learning (cs.LG) Cite as: arXiv:2503.09665 [physics.acc-ph] (or arXiv:2503.09665v1 [physics.acc-ph] for this version) https://doi.org/10.48550/arXiv.2503.09665 Focus to learn more arXiv-issued DOI via DataCite

[LG-89] Power Spectrum Signatures of Graphs

链接: https://arxiv.org/abs/2503.09660
作者: Karamatou Yacoubou Djima,Ka Man Yim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Point signatures based on the Laplacian operators on graphs, point clouds, and manifolds have become popular tools in machine learning for graphs, clustering, and shape analysis. In this work, we propose a novel point signature, the power spectrum signature, a measure on \mathbbR defined as the squared graph Fourier transform of a graph signal. Unlike eigenvectors of the Laplacian from which it is derived, the power spectrum signature is invariant under graph automorphisms. We show that the power spectrum signature is stable under perturbations of the input graph with respect to the Wasserstein metric. We focus on the signature applied to classes of indicator functions, and its applications to generating descriptive features for vertices of graphs. To demonstrate the practical value of our signature, we showcase several applications in characterizing geometry and symmetries in point cloud data, and graph regression problems.

[LG-90] chnical Insights and Legal Considerations for Advancing Federated Learning in Bioinformatics

链接: https://arxiv.org/abs/2503.09649
作者: Daniele Malpetti,Marco Scutari,Francesco Gualdi,Jessica van Setten,Sander van der Laan,Saskia Haitjema,Aaron Mark Lee,Isabelle Hering,Francesca Mangili
类目: Other Quantitative Biology (q-bio.OT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:Federated learning leverages data across institutions to improve clinical discovery while complying with data-sharing restrictions and protecting patient privacy. As the evolution of biobanks in genetics and systems biology has proved, accessing more extensive and varied data pools leads to a faster and more robust exploration and translation of results. More widespread use of federated learning may have the same impact in bioinformatics, allowing access to many combinations of genotypic, phenotypic and environmental information that are undercovered or not included in existing biobanks. This paper reviews the methodological, infrastructural and legal issues that academic and clinical institutions must address before implementing it. Finally, we provide recommendations for the reliable use of federated learning and its effective translation into clinical practice.

[LG-91] Learning second-order TVD flux limiters using differentiable solvers

链接: https://arxiv.org/abs/2503.09625
作者: Chenyang Huang,Amal S. Sebastian,Venkatasubramanian Viswanathan
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:This paper presents a data-driven framework for learning optimal second-order total variation diminishing (TVD) flux limiters via differentiable simulations. In our fully differentiable finite volume solvers, the limiter functions are replaced by neural networks. By representing the limiter as a pointwise convex linear combination of the Minmod and Superbee limiters, we enforce both second-order accuracy and TVD constraints at all stages of training. Our approach leverages gradient-based optimization through automatic differentiation, allowing a direct backpropagation of errors from numerical solutions to the limiter parameters. We demonstrate the effectiveness of this method on various hyperbolic conservation laws, including the linear advection equation, the Burgers’ equation, and the one-dimensional Euler equations. Remarkably, a limiter trained solely on linear advection exhibits strong generalizability, surpassing the accuracy of most classical flux limiters across a range of problems with shocks and discontinuities. The learned flux limiters can be readily integrated into existing computational fluid dynamics codes, and the proposed methodology also offers a flexible pathway to systematically develop and optimize flux limiters for complex flow problems.

信息检索

[IR-0] Conversational Gold: Evaluating Personalized Conversational Search System using Gold Nuggets

链接: https://arxiv.org/abs/2503.09902
作者: Zahra Abbasiantaeb,Simon Lupart,Leif Azzopardi,Jeffery Dalton,Mohammad Aliannejadi
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The rise of personalized conversational search systems has been driven by advancements in Large Language Models (LLMs), enabling these systems to retrieve and generate answers for complex information needs. However, the automatic evaluation of responses generated by Retrieval Augmented Generation (RAG) systems remains an understudied challenge. In this paper, we introduce a new resource for assessing the retrieval effectiveness and relevance of response generated by RAG systems, using a nugget-based evaluation framework. Built upon the foundation of TREC iKAT 2023, our dataset extends to the TREC iKAT 2024 collection, which includes 17 conversations and 20,575 relevance passage assessments, together with 2,279 extracted gold nuggets, and 62 manually written gold answers from NIST assessors. While maintaining the core structure of its predecessor, this new collection enables a deeper exploration of generation tasks in conversational settings. Key improvements in iKAT 2024 include: (1) ``gold nuggets’’ – concise, essential pieces of information extracted from relevant passages of the collection – which serve as a foundation for automatic response evaluation; (2) manually written answers to provide a gold standard for response evaluation; (3) unanswerable questions to evaluate model hallucination; (4) expanded user personas, providing richer contextual grounding; and (5) a transition from Personal Text Knowledge Base (PTKB) ranking to PTKB classification and selection. Built on this resource, we provide a framework for long-form answer generation evaluation, involving nuggets extraction and nuggets matching, linked to retrieval. This establishes a solid resource for advancing research in personalized conversational search and long-form answer generation. Our resources are publicly available at this https URL.

[IR-1] Improving the Reusability of Conversational Search Test Collections

链接: https://arxiv.org/abs/2503.09899
作者: Zahra Abbasiantaeb,Chuan Meng,Leif Azzopardi,Mohammad Aliannejadi
类目: Information Retrieval (cs.IR)
*备注: arXiv admin note: text overlap with arXiv:2405.05600

点击查看摘要

Abstract:Incomplete relevance judgments limit the reusability of test collections. When new systems are compared to previous systems that contributed to the pool, they often face a disadvantage. This is due to pockets of unjudged documents (called holes) in the test collection that the new systems return. The very nature of Conversational Search (CS) means that these holes are potentially larger and more problematic when evaluating systems. In this paper, we aim to extend CS test collections by employing Large Language Models (LLMs) to fill holes by leveraging existing judgments. We explore this problem using TREC iKAT 23 and TREC CAsT 22 collections, where information needs are highly dynamic and the responses are much more varied, leaving bigger holes to fill. Our experiments reveal that CS collections show a trend towards less reusability in deeper turns. Also, fine-tuning the Llama 3.1 model leads to high agreement with human assessors, while few-shot prompting the ChatGPT results in low agreement with humans. Consequently, filling the holes of a new system using ChatGPT leads to a higher change in the location of the new system. While regenerating the assessment pool with few-shot prompting the ChatGPT model and using it for evaluation achieves a high rank correlation with human-assessed pools. We show that filling the holes using few-shot training the Llama 3.1 model enables a fairer comparison between the new system and the systems contributed to the pool. Our hole-filling model based on few-shot training of the Llama 3.1 model can improve the reusability of test collections.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-03-14

目录

概览 (2025-03-14)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载