Arxiv今日论文 | 2025-03-25

本篇博文主要内容为 2025-03-25 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决生成式检索（Generative Retrieval）在性能和可扩展性机制方面尚未明确的问题。论文的关键在于系统性地研究生成式检索在训练和推理中的缩放规律（scaling laws），特别是模型大小、训练数据规模以及推理计算资源如何共同影响检索性能。为了解决现有评估指标的不足，论文提出了一种基于对比熵（contrastive entropy）和生成损失（generation loss）的新评价指标，以提供连续的性能信号，从而实现不同方法之间的稳健比较。实验结果表明，n-gram基方法与更大的大型语言模型（LLMs）结合时表现出较强的缩放规律一致性，且增加推理计算资源能够显著提升性能，同时指出解码器-only架构的LLaMA模型优于T5模型，这揭示了更大解码器-only模型在生成式检索中的优势。综合来看，论文强调模型大小、数据可用性和推理计算资源的交互作用是释放生成式检索潜力的关键。

链接: https://arxiv.org/abs/2503.18941
作者: Hongru Cai,Yongqi Li,Ruifeng Yuan,Wenjie Wang,Zhen Zhang,Wenjie Li,Tat-Seng Chua
机构: National University of Singapore(Singapore); The Hong Kong Polytechnic University(Hong Kong SAR, China); University of Science and Technology of China(Hefei, China); Nanyang Technological University(Singapore); National University of Singapore(Singapore)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative retrieval has emerged as a novel paradigm that leverages large language models (LLMs) to autoregressively generate document identifiers. Although promising, the mechanisms that underpin its performance and scalability remain largely unclear. We conduct a systematic investigation of training and inference scaling laws in generative retrieval, exploring how model size, training data scale, and inference-time compute jointly influence retrieval performance. To address the lack of suitable metrics, we propose a novel evaluation measure inspired by contrastive entropy and generation loss, providing a continuous performance signal that enables robust comparisons across diverse generative retrieval methods. Our experiments show that n-gram-based methods demonstrate strong alignment with both training and inference scaling laws, especially when paired with larger LLMs. Furthermore, increasing inference computation yields substantial performance gains, revealing that generative retrieval can significantly benefit from higher compute budgets at inference. Across these settings, LLaMA models consistently outperform T5 models, suggesting a particular advantage for larger decoder-only models in generative retrieval. Taken together, our findings underscore that model sizes, data availability, and inference computation interact to unlock the full potential of generative retrieval, offering new insights for designing and optimizing future systems.
zh

[NLP-1] xKV: Cross-Layer SVD for KV-Cache Compression

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在长上下文窗口场景下，由于存储多层的Key和Value缓存状态（KV-Cache）而导致的高内存消耗问题。现有方法要么需要昂贵的预训练，要么依赖于跨层每令牌余弦相似性较高的假设，而在实际应用中通常不成立。论文的关键发现是KV-Cache的主奇异向量在多层之间表现出显著的一致性。基于此洞察，论文提出了一种名为xKV的简单后训练方法，通过将分组层的KV-Cache进行奇异值分解（Singular Value Decomposition, SVD），将其压缩到共享的低秩子空间中，从而大幅减少了KV-Cache的大小。实验结果表明，xKV相比现有最先进的跨层技术获得了高达6.8倍的压缩率，并提升了2.7%的准确性，同时与新兴的多头潜在注意力机制（Multi-Head Latent Attention, MLA）兼容，在编码任务上实现了3倍的压缩率且无性能下降。这些结果展示了xKV在缓解长上下文LLMs推理内存瓶颈方面的强大能力和通用性。

链接: https://arxiv.org/abs/2503.18893
作者: Chi-Chih Chang,Chien-Yu Lin,Yash Akhauri,Wei-Cheng Lin,Kai-Chiang Wu,Luis Ceze,Mohamed S. Abdelfattah
机构: Cornell University (康奈尔大学); University of Washington (华盛顿大学); National Yang Ming Chiao Tung University (阳明交通大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) with long context windows enable powerful applications but come at the cost of high memory consumption to store the Key and Value states (KV-Cache). Recent studies attempted to merge KV-cache from multiple layers into shared representations, yet these approaches either require expensive pretraining or rely on assumptions of high per-token cosine similarity across layers which generally does not hold in practice. We find that the dominant singular vectors are remarkably well-aligned across multiple layers of the KV-Cache. Exploiting this insight, we propose xKV, a simple post-training method that applies Singular Value Decomposition (SVD) on the KV-Cache of grouped layers. xKV consolidates the KV-Cache of multiple layers into a shared low-rank subspace, significantly reducing KV-Cache sizes. Through extensive evaluations on the RULER long-context benchmark with widely-used LLMs (e.g., Llama-3.1 and Qwen2.5), xKV achieves up to 6.8x higher compression rates than state-of-the-art inter-layer technique while improving accuracy by 2.7%. Moreover, xKV is compatible with the emerging Multi-Head Latent Attention (MLA) (e.g., DeepSeek-Coder-V2), yielding a notable 3x compression rates on coding tasks without performance degradation. These results highlight xKV’s strong capability and versatility in addressing memory bottlenecks for long-context LLM inference. Our code is publicly available at: this https URL.
zh

[NLP-2] SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

【速读】：该论文旨在探索通过零样本强化学习（Zero RL）训练范式，使不同基础模型能够自然涌现长链思维（Long Chain-of-Thought, CoT）推理能力。传统方法多集中于Qwen2.5系列模型，但作者发现这些模型的基础版本已具备较强的指令跟随与自我反思能力，因此更具多样性的基础模型值得研究。论文的关键在于通过调整奖励格式（Format Reward）和控制查询难度（Query Difficulty）等设计策略，在大多数设置下显著提升了推理准确性及响应长度。然而，通过对训练动态的细致监控，作者观察到不同基础模型在训练过程中表现出独特的模式。特别地，论文首次在非Qwen家族的小型模型中观察到了“顿悟时刻”（Aha Moment）。解决方案的关键在于提出并验证这些设计策略的有效性，并通过开源代码、模型及分析工具促进后续研究。

链接: https://arxiv.org/abs/2503.18892
作者: Weihao Zeng,Yuzhen Huang,Qian Liu,Wei Liu,Keqing He,Zejun Ma,Junxian He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can naturally emerge through a simple reinforcement learning (RL) framework with rule-based rewards, where the training may directly start from the base models-a paradigm referred to as zero RL training. Most recent efforts to reproduce zero RL training have primarily focused on the Qwen2.5 model series, which may not be representative as we find the base models already exhibit strong instruction-following and self-reflection abilities. In this work, we investigate zero RL training across 10 diverse base models, spanning different families and sizes including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B. Leveraging several key design strategies-such as adjusting format reward and controlling query difficulty-we achieve substantial improvements in both reasoning accuracy and response length across most settings. However, by carefully monitoring the training dynamics, we observe that different base models exhibit distinct patterns during training. For instance, the increased response length does not always correlate with the emergence of certain cognitive behaviors such as verification (i.e., the “aha moment”). Notably, we observe the “aha moment” for the first time in small models not from the Qwen family. We share the key designs that enable successful zero RL training, along with our findings and practices. To facilitate further research, we open-source the code, models, and analysis tools.
zh

[NLP-3] Agent Dropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM -Based Multi-Agent Collaboration

【速读】：该论文旨在解决基于大语言模型（Large Language Models, LLMs）的多智能体系统（Multi-agent Systems, MAS）在协作问题解决过程中面临的通信效率低下和任务性能次优的问题。论文的关键解决方案是提出AgentDropout方法，通过优化通信图的邻接矩阵来识别并消除冗余智能体及其在不同通信轮次中的冗余通信，从而提升标记效率（prompt token consumption降低21.6%，completion token consumption降低18.4%）并改善任务性能（任务表现提升1.14）。这一方法的核心在于动态调整智能体的角色与通信拓扑，以实现更高效的协作。

链接: https://arxiv.org/abs/2503.18891
作者: Zhexuan Wang,Yutong Wang,Xuebo Liu,Liang Ding,Miao Zhang,Jie Liu,Min Zhang
机构: Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学（深圳）计算与智能研究所); The University of Sydney (悉尼大学), Sydney, Australia; School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学（深圳）计算机科学与技术学院); State Key Lab of Smart Farm Technologies and Systems, Harbin Institute of Technology, Harbin, China (哈尔滨工业大学智能农场技术与系统国家重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent systems (MAS) based on large language models (LLMs) have demonstrated significant potential in collaborative problem-solving. However, they still face substantial challenges of low communication efficiency and suboptimal task performance, making the careful design of the agents’ communication topologies particularly important. Inspired by the management theory that roles in an efficient team are often dynamically adjusted, we propose AgentDropout, which identifies redundant agents and communication across different communication rounds by optimizing the adjacency matrices of the communication graphs and eliminates them to enhance both token efficiency and task performance. Compared to state-of-the-art methods, AgentDropout achieves an average reduction of 21.6% in prompt token consumption and 18.4% in completion token consumption, along with a performance improvement of 1.14 on the tasks. Furthermore, the extended experiments demonstrate that AgentDropout achieves notable domain transferability and structure robustness, revealing its reliability and effectiveness. We release our code at this https URL.
zh

[NLP-4] oward building next-generation Geocoding systems: a systematic review

【速读】：该论文旨在解决地理编码（Geocoding）系统在应对多样化场景时面临的输入输出需求变化及其质量挑战，以满足科学空间分析与日常位置服务的需求。论文的关键在于通过剖析地理编码系统的功能组件，综合评估传统基于规则的方法与现代技术（如信息检索、自然语言处理、大语言模型等）的优势，提出下一代地理编码系统的构建路径，并探讨如何利用近期技术进步进一步提升其性能与可靠性。

链接: https://arxiv.org/abs/2503.18888
作者: Zhengcong Yin,Daniel W. Goldberg,Binbin Lin,Bing Zhou,Diya Li,Andong Ma,Ziqian Ming,Heng Cai,Zhe Zhang,Shaohua Wang,Shanzhen Gao,Joey Ying Lee,Xiao Li,Da Huo
机构: Texas A&M University (德克萨斯农工大学); Pennsylvania State University (宾夕法尼亚州立大学); Metropolitan State University of Denver (丹佛都市州立大学); Esri; Institute of Atmospheric Physics, Chinese Academy of Sciences (中科院大气物理研究所); Virginia State University (弗吉尼亚州立大学); Oxford University (牛津大学); University of Toronto (多伦多大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Geocoding systems are widely used in both scientific research for spatial analysis and everyday life through location-based services. The quality of geocoded data significantly impacts subsequent processes and applications, underscoring the need for next-generation systems. In response to this demand, this review first examines the evolving requirements for geocoding inputs and outputs across various scenarios these systems must address. It then provides a detailed analysis of how to construct such systems by breaking them down into key functional components and reviewing a broad spectrum of existing approaches, from traditional rule-based methods to advanced techniques in information retrieval, natural language processing, and large language models. Finally, we identify opportunities to improve next-generation geocoding systems in light of recent technological advances.
zh

[NLP-5] I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

【速读】：该论文旨在探索大型语言模型（Large Language Models, LLMs）在复杂推理任务中的内部推理机制，特别是针对新兴的推理型LLMs（如DeepSeek-R1）。尽管这些模型在性能上表现出色，但其具体的推理机理仍未被充分理解。为解决这一问题，论文提出利用稀疏自编码器（Sparse Autoencoders, SAEs）来学习神经网络潜在表示的稀疏分解，并从中提取可解释的特征以揭示驱动推理的关键因素。关键解决方案在于通过SAE方法识别出候选的“推理特征”（reasoning features），并通过实证分析与可解释性技术验证这些特征与模型推理能力之间的直接关联。进一步地，论文证明通过对这些特征进行系统性操控能够显著提升模型的推理性能，从而首次提供了LLMs推理机制的机制性解释。代码资源可在提供的链接获取。

链接: https://arxiv.org/abs/2503.18878
作者: Andrey Galichin,Alexey Dontsov,Polina Druzhinina,Anton Razzhigaev,Oleg Y. Rogov,Elena Tutubalina,Ivan Oseledets
机构: AIRI (人工智能先进技术研究院); MTUCI (莫斯科技术大学和控制信息系统学院); Skoltech (斯科尔科沃科学技术研究院); Sber (俄罗斯联邦储蓄银行); HSE (高等经济研究大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success in natural language processing. Recent advances have led to the developing of a new class of reasoning LLMs; for example, open-source DeepSeek-R1 has achieved state-of-the-art performance by integrating deep thinking and complex reasoning. Despite these impressive capabilities, the internal reasoning mechanisms of such models remain unexplored. In this work, we employ Sparse Autoencoders (SAEs), a method to learn a sparse decomposition of latent representations of a neural network into interpretable features, to identify features that drive reasoning in the DeepSeek-R1 series of models. First, we propose an approach to extract candidate ‘‘reasoning features’’ from SAE representations. We validate these features through empirical analysis and interpretability methods, demonstrating their direct correlation with the model’s reasoning abilities. Crucially, we demonstrate that steering these features systematically enhances reasoning performance, offering the first mechanistic account of reasoning in LLMs. Code available at this https URL
zh

[NLP-6] Reasoning to Learn from Latent Thoughts

【速读】：该论文试图解决在数据受限的情况下，语言模型（Language Model, LM）预训练数据效率不足的问题。随着计算能力的增长速度已经超过了人类编写文本的增长速度，数据短缺可能成为LM规模扩展的瓶颈。为了解决这一问题，论文提出了一种通过显式建模和推断文本生成过程背后的潜在思维（latent thoughts）来显著提高预训练数据效率的方法。这种方法的核心在于将网络文本视为冗长的人类思维过程的压缩结果，并认为潜在思维包含重要的上下文知识和推理步骤，这对于高效学习至关重要。论文的关键解决方案是引入潜在思维推断机制，不仅通过合成数据的方式实现高效预训练，还提出了一种无需强教师指导的自举方法，利用期望最大化（EM）算法迭代提升模型性能和增强的数据质量，从而在数据受限条件下实现高效的持续预训练。

链接: https://arxiv.org/abs/2503.18866
作者: Yangjun Ruan,Neil Band,Chris J. Maddison,Tatsunori Hashimoto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Compute scaling for language model (LM) pretraining has outpaced the growth of human-written texts, leading to concerns that data will become the bottleneck to LM scaling. To continue scaling pretraining in this data-constrained regime, we propose that explicitly modeling and inferring the latent thoughts that underlie the text generation process can significantly improve pretraining data efficiency. Intuitively, our approach views web text as the compressed final outcome of a verbose human thought process and that the latent thoughts contain important contextual knowledge and reasoning steps that are critical to data-efficient learning. We empirically demonstrate the effectiveness of our approach through data-constrained continued pretraining for math. We first show that synthetic data approaches to inferring latent thoughts significantly improve data efficiency, outperforming training on the same amount of raw data (5.7% \rightarrow 25.4% on MATH). Furthermore, we demonstrate latent thought inference without a strong teacher, where an LM bootstraps its own performance by using an EM algorithm to iteratively improve the capability of the trained LM and the quality of thought-augmented pretraining data. We show that a 1B LM can bootstrap its performance across at least three iterations and significantly outperform baselines trained on raw data, with increasing gains from additional inference compute when performing the E-step. The gains from inference scaling and EM iterations suggest new opportunities for scaling data-constrained pretraining.
zh

[NLP-7] EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）代理在未知环境中进行决策、学习及策略制定的能力评估问题。论文的关键在于开发了一套针对LLM代理的基准测试（benchmarks）和酸碱试纸测试（litmus tests）。基准测试通过从经济学关键问题衍生出的决策任务来衡量LLM代理的能力，这些任务具有可调节的难度以避免性能饱和；而酸碱试纸测试则提供了一种新的定量方法，用于量化LLM及其代理在面对效率与公平等权衡问题时的行为差异，从而揭示其特性、价值观和倾向。通过这两项工具，论文全面评估了LLM代理在采购、调度、任务分配和定价等复杂经济问题中的表现及行为倾向。

链接: https://arxiv.org/abs/2503.18825
作者: Sara Fish,Julia Shephard,Minkai Li,Ran I. Shorrer,Yannai A. Gonczarowski
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:We develop benchmarks for LLM agents that act in, learn from, and strategize in unknown environments, the specifications of which the LLM agent must learn over time from deliberate exploration. Our benchmarks consist of decision-making tasks derived from key problems in economics. To forestall saturation, the benchmark tasks are synthetically generated with scalable difficulty levels. Additionally, we propose litmus tests, a new kind of quantitative measure for LLMs and LLM agents. Unlike benchmarks, litmus tests quantify differences in character, values, and tendencies of LLMs and LLM agents, by considering their behavior when faced with tradeoffs (e.g., efficiency versus equality) where there is no objectively right or wrong behavior. Overall, our benchmarks and litmus tests assess the abilities and tendencies of LLM agents in tackling complex economic problems in diverse settings spanning procurement, scheduling, task allocation, and pricing – applications that should grow in importance as such agents are further integrated into the economy.
zh

[NLP-8] REALM: A Dataset of Real-World LLM Use Cases

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在现实世界中的应用理解有限的问题。解决方案的关键在于引入REALM数据集，该数据集收集了来自Reddit和新闻文章的超过94,000个LLMs的实际使用案例。REALM不仅涵盖了LLMs的多样化应用场景，还探索了用户职业与其使用应用类型之间的关系，从而为不同领域中LLMs的采用情况提供了深入洞察，并为研究其不断演化的社会角色奠定了基础。

链接: https://arxiv.org/abs/2503.18792
作者: Jingwen Cheng,Kshitish Ghate,Wenyue Hua,William Yang Wang,Hong Shen,Fei Fang
机构: Carnegie Mellon University (卡内基梅隆大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Large Language Models, such as the GPT series, have driven significant industrial applications, leading to economic and societal transformations. However, a comprehensive understanding of their real-world applications remains limited. To address this, we introduce REALM, a dataset of over 94,000 LLM use cases collected from Reddit and news articles. REALM captures two key dimensions: the diverse applications of LLMs and the demographics of their users. It categorizes LLM applications and explores how users’ occupations relate to the types of applications they use. By integrating real-world data, REALM offers insights into LLM adoption across different domains, providing a foundation for future research on their evolving societal roles. A dedicated dashboard this https URL presents the data.
zh

[NLP-9] BitDecoding: Unlocking Tensor Cores for Long-Context LLM s Decoding with Low-Bit KV Cache

【速读】：该论文旨在解决长上下文大语言模型（LLMs）在自回归解码过程中因扩展的关键值（KV）缓存而导致的显著内存和计算挑战。尽管KV缓存量化（如4位或2位量化）已被证明能够在保持模型精度的同时减少内存成本，但初步实现低比特KV缓存的加速效果有限，主要由于量化与去量化开销以及缺乏张量核心（Tensor Cores）的有效利用。
论文提出BitDecoding，这是一种针对GPU优化的框架，通过解锁张量核心以高效使用低比特KV缓存。其关键在于设计了一个以张量核心为中心的位融合方案（Tensor Cores-Centric BitFusion Scheme），确保数据布局兼容性以实现张量核心的高利用率。此外，BitDecoding还结合了线程束高效并行解码内核和细粒度异步流水线，从而最小化去量化开销并提升计算效率。实验结果显示，BitDecoding相比FP16 FlashDecoding-v2在RTX 4090上实现了高达7.5倍的速度提升，在A100上为4.8倍，在H100上为8.9倍，并且比最先进的低比特KV缓存实现（QServe）快至多4.3倍。

链接: https://arxiv.org/abs/2503.18773
作者: Dayou Du,Shijie Cao,Jianyi Cheng,Ting Cao,Mao Yang
机构: University of Edinburgh; Microsoft Research (微软研究院)
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Performance (cs.PF)
备注:

点击查看摘要

Abstract:The growing adoption of long-context Large Language Models (LLMs) has introduced significant memory and computational challenges in autoregressive decoding due to the expanding Key-Value (KV) cache. KV cache quantization has emerged as a promising solution, with prior work showing that 4-bit or even 2-bit quantization can maintain model accuracy while reducing memory costs. However, despite these benefits, preliminary implementations for the low-bit KV cache struggle to deliver the expected speedup due to quantization and dequantization overheads and the lack of Tensor Cores utilization. In this work, we propose BitDecoding, a GPU-optimized framework that unlocks Tensor Cores for efficient decoding with low-bit KV cache. Efficiently leveraging Tensor Cores for low-bit KV cache is challenging due to the dynamic nature of KV cache generation at each decoding step. BitDecoding addresses these challenges with a Tensor Cores-Centric BitFusion Scheme that ensures data layout compatibility to enable high utilization of Tensor Cores. Additionally, BitDecoding incorporates a warp-efficient parallel decoding kernel and a fine-grained asynchronous pipeline, minimizing dequantization overhead and improving computational efficiency. Experiments show that BitDecoding achieves up to 7.5x speedup on RTX 4090, 4.8x on A100, and 8.9x on H100, compared to FP16 FlashDecoding-v2. It also outperforms the state-of-the-art low-bit KV cache implementation (QServe) by up to 4.3x. On LLaMA-3.1-8B with a 128K sequence length, BitDecoding reduces single-batch decoding latency by 3x, demonstrating its effectiveness in long-context generation scenarios. The code is available at this https URL.
zh

[NLP-10] AlphaSpace: Enabling Robotic Actions through Semantic Tokenization and Symbolic Reasoning

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在三维笛卡尔空间导航中的空间推理能力不足的问题。解决方案的关键在于AlphaSpace提出的一种基于语义的分词策略，通过专门的语义标记编码高度信息，并结合主要符号合成推理数据，使LLMs能够精确地将物体定位到特定的[x, y, z]坐标，从而实现对物体的准确操作。

链接: https://arxiv.org/abs/2503.18769
作者: Alan Dao(Gia Tuan Dao),Dinh Bach Vu,Bui Quang Huy
机构: Menlo Research
类目: Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:This paper presents AlphaSpace, a novel methodology designed to enhance the spatial reasoning capabilities of large language models (LLMs) for 3D Cartesian space navigation. AlphaSpace employs a semantics-based tokenization strategy, encoding height information through specialized semantic tokens, and integrates primarily symbolic synthetic reasoning data. This approach enables LLMs to accurately manipulate objects by positioning them at specific [x, y, z] coordinates. Experimental results demonstrate that AlphaSpace significantly outperforms existing models on manipulation subtasks, achieving a total accuracy of 66.67%, compared to 37.5% for GPT-4o and 29.17% for Claude 3.5 Sonnet.
zh

[NLP-11] Synthetic Function Demonstrations Improve Generation in Low-Resource Programming Languages

【速读】：该论文试图解决低资源编程语言训练数据不足的问题，提出了一种针对低资源编程语言生成高质量训练数据的新方法。解决方案的关键在于使用教师模型生成完全合成的、教科书级别的示例数据（synthetic, textbook-quality demonstrations），这些数据专注于常见库函数，并应用于Excel公式领域。通过这种方式，论文展示了对该学生模型进行微调（finetuning）的有效性，显著提升了其在两个经过Excel领域改编的问题回答数据集上的性能，同时证明了这种方法相较于标准的现成可检索生成（RAG）方法的优势。

链接: https://arxiv.org/abs/2503.18760
作者: Nick McKenna,Xinnuo Xu,Jack Williams,Nick Wilson,Benjamin Van Durme,Christian Poelitz
机构: Microsoft Research (微软研究); Microsoft (微软)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A key consideration when training an LLM is whether the target language is more or less resourced, whether this is English compared to Welsh, or Python compared to Excel. Typical training data for programming languages consist of real program demonstrations coupled with human-written comments. Here we present novel approaches to the creation of such data for low resource programming languages. We generate fully-synthetic, textbook-quality demonstrations of common library functions in an example domain of Excel formulas, using a teacher model. We then finetune an underperforming student model, and show improvement on 2 question-answering datasets recast into the Excel domain. We show advantages of finetuning over standard, off-the-shelf RAG approaches, which can offer only modest improvement due to the unfamiliar target domain.
zh

[NLP-12] Construction Identification and Disambiguation Using BERT: A Case Study of NPN ACL

【速读】：该论文试图解决的问题是探究 BERT 是否能够表征英语中一种小众且多义的结构（NPN 结构，如 “face to face” 和 “day to day”）的形式与意义，并验证其是否超越了表面句法模式和词汇线索，隐含编码了该结构的知识。解决方案的关键在于构建了一个语义注释的基准数据集（包含目标结构实例及其干扰项），并通过训练和评估探针分类器来分析 BERT 嵌入向量中是否携带该结构的语义信息，同时测试词序扰动对该结构识别的影响，从而揭示 BERT 对形式和意义的敏感性。

链接: https://arxiv.org/abs/2503.18751
作者: Wesley Scivetti,Nathan Schneider
机构: Georgetown University (乔治城大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, ACL long-paper format (preprint)

点击查看摘要

Abstract:Construction Grammar hypothesizes that knowledge of a language consists chiefly of knowledge of form-meaning pairs (‘‘constructions’’) that include vocabulary, general grammar rules, and even idiosyncratic patterns. Recent work has shown that transformer language models represent at least some constructional patterns, including ones where the construction is rare overall. In this work, we probe BERT’s representation of the form and meaning of a minor construction of English, the NPN (noun-preposition-noun) construction – exhibited in such expressions as face to face and day to day – which is known to be polysemous. We construct a benchmark dataset of semantically annotated corpus instances (including distractors that superficially resemble the construction). With this dataset, we train and evaluate probing classifiers. They achieve decent discrimination of the construction from distractors, as well as sense disambiguation among true instances of the construction, revealing that BERT embeddings carry indications of the construction’s semantics. Moreover, artificially permuting the word order of true construction instances causes them to be rejected, indicating sensitivity to matters of form. We conclude that BERT does latently encode at least some knowledge of the NPN construction going beyond a surface syntactic pattern and lexical cues.
zh

[NLP-13] Predicting the Road Ahead: A Knowledge Graph based Foundation Model for Scene Understanding in Autonomous Driving

【速读】：本文旨在解决自动驾驶场景理解中现有方法难以有效捕捉驾驶场景随时间复杂演化的局限性问题。解决方案的关键在于提出了一种名为FM4SU的新方法，通过利用知识图谱（Knowledge Graphs, KGs）整合感官观测与领域知识（如道路拓扑、交通规则及交通参与者间的复杂交互），构建鸟瞰图（Bird’s Eye View, BEV）符号化表示，该表示包含场景内物体的时空信息。随后，将BEV表示序列化为令牌序列，并输入到预训练语言模型（Pre-trained Language Models, PLMs）中，以学习驾驶场景元素共现的内在理解并预测后续场景。实验结果表明，经过微调的模型在所有任务中均实现了显著更高的准确性，其中微调后的T5模型在下一场景预测任务中的准确率达到86.7%。

链接: https://arxiv.org/abs/2503.18730
作者: Hongkuan Zhou,Stefan Schmid,Yicong Li,Lavdim Halilaj,Xiangtong Yao,Wei cao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The autonomous driving field has seen remarkable advancements in various topics, such as object recognition, trajectory prediction, and motion planning. However, current approaches face limitations in effectively comprehending the complex evolutions of driving scenes over time. This paper proposes FM4SU, a novel methodology for training a symbolic foundation model (FM) for scene understanding in autonomous driving. It leverages knowledge graphs (KGs) to capture sensory observation along with domain knowledge such as road topology, traffic rules, or complex interactions between traffic participants. A bird’s eye view (BEV) symbolic representation is extracted from the KG for each driving scene, including the spatio-temporal information among the objects across the scenes. The BEV representation is serialized into a sequence of tokens and given to pre-trained language models (PLMs) for learning an inherent understanding of the co-occurrence among driving scene elements and generating predictions on the next scenes. We conducted a number of experiments using the nuScenes dataset and KG in various scenarios. The results demonstrate that fine-tuned models achieve significantly higher accuracy in all tasks. The fine-tuned T5 model achieved a next scene prediction accuracy of 86.7%. This paper concludes that FM4SU offers a promising foundation for developing more comprehensive models for scene understanding in autonomous driving.
zh

[NLP-14] Unsupervised Acquisition of Discrete Grammatical Categories

【速读】：该论文旨在研究如何通过模拟语言习得过程，使一个女儿语言模型（Daughter Language Model）在没有直接访问母语模型内部知识的情况下，仅依赖从母语模型生成的语言实例，习得抽象的语法知识。论文的关键在于利用层级聚类分析（Hierarchical Agglomerative Cluster Analysis）对母语模型连续生成的句子进行统计分析，从中提取出与语法范畴相对应的模式，并将其转化为离散的语法规则，进而添加到女儿语言模型的语法知识库中。这一方法证明了该系统能够习得类似于语言学家提出的自然语言语法范畴的结构，从而实现非平凡语法知识的获取。此外，在第二个实验中，通过验证训练数据确定的参数配置，进一步证明了该方法的有效性。

链接: https://arxiv.org/abs/2503.18702
作者: David Ph. Shakouri,Crit Cremers,Niels O. Schiller
机构: 未知
类目: Computation and Language (cs.CL)
备注: 34 pages, 3 figures, 7 tables

点击查看摘要

Abstract:This article presents experiments performed using a computational laboratory environment for language acquisition experiments. It implements a multi-agent system consisting of two agents: an adult language model and a daughter language model that aims to learn the mother language. Crucially, the daughter agent does not have access to the internal knowledge of the mother language model but only to the language exemplars the mother agent generates. These experiments illustrate how this system can be used to acquire abstract grammatical knowledge. We demonstrate how statistical analyses of patterns in the input data corresponding to grammatical categories yield discrete grammatical rules. These rules are subsequently added to the grammatical knowledge of the daughter language model. To this end, hierarchical agglomerative cluster analysis was applied to the utterances consecutively generated by the mother language model. It is argued that this procedure can be used to acquire structures resembling grammatical categories proposed by linguists for natural languages. Thus, it is established that non-trivial grammatical knowledge has been acquired. Moreover, the parameter configuration of this computational laboratory environment determined using training data generated by the mother language model is validated in a second experiment with a test set similarly resulting in the acquisition of non-trivial categories.
zh

[NLP-15] Commander-GPT : Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models

【速读】：该论文旨在解决单模态方法在讽刺检测任务中因讽刺的隐晦性而导致性能不足的问题，并探索如何有效利用多模态信息以更准确地识别讽刺内容。论文的关键创新在于提出了一种基于多模态大型语言模型（Multi-Modal Large Language Models, MLLMs）的新型框架——Commander-GPT。受军事战略启发，该框架将讽刺检测任务分解为六个子任务，并通过中央指挥者分配最适合的大规模语言模型处理每个子任务，最终整合各模型结果完成讽刺检测。这种分而治之的策略充分利用了多模态信息的优势，无需微调或人工标注即可实现最先进的性能提升，在F1分数上提高了19.3%。

链接: https://arxiv.org/abs/2503.18681
作者: Yazhou Zhang,Chunwang Zou,Bo Wang,Jing Qin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sarcasm detection, as a crucial research direction in the field of Natural Language Processing (NLP), has attracted widespread attention. Traditional sarcasm detection tasks have typically focused on single-modal approaches (e.g., text), but due to the implicit and subtle nature of sarcasm, such methods often fail to yield satisfactory results. In recent years, researchers have shifted the focus of sarcasm detection to multi-modal approaches. However, effectively leveraging multi-modal information to accurately identify sarcastic content remains a challenge that warrants further exploration. Leveraging the powerful integrated processing capabilities of Multi-Modal Large Language Models (MLLMs) for various information sources, we propose an innovative multi-modal Commander-GPT framework. Inspired by military strategy, we first decompose the sarcasm detection task into six distinct sub-tasks. A central commander (decision-maker) then assigns the best-suited large language model to address each specific sub-task. Ultimately, the detection results from each model are aggregated to identify sarcasm. We conducted extensive experiments on MMSD and MMSD 2.0, utilizing four multi-modal large language models and six prompting strategies. Our experiments demonstrate that our approach achieves state-of-the-art performance, with a 19.3% improvement in F1 score, without necessitating fine-tuning or ground-truth rationales.
zh

[NLP-16] ArchSeek: Retrieving Architectural Case Studies Using Vision-Language Models

【速读】：该论文旨在解决传统文本搜索工具在捕捉建筑设计领域固有的视觉与复杂特性的不足，从而导致低效且不精确的知识探索问题。解决方案的关键在于提出ArchSeek系统，它利用视觉-语言模型的视觉理解能力和跨模态嵌入技术，支持基于文本和图像的精细查询控制以及交互式的案例推荐功能，为建筑师提供更高效且个性化的灵感发现方式。

链接: https://arxiv.org/abs/2503.18680
作者: Danrui Li,Yichao Shi,Yaluo Wang,Ziying Shi,Mubbasir Kapadia
机构: Rutgers University, NJ, USA(罗格斯大学，新泽西州，美国); Georgia Institute of Technology, GA, USA(乔治亚理工学院，佐治亚州，美国); Harvard University, MA, USA(哈佛大学，马萨诸塞州，美国); Southeast University, Nanjing, China(东南大学，南京，中国); Roblox, CA, USA(罗布乐思，加利福尼亚州，美国)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 15 pages, 8 figures, 3 tables. Accepted by CAAD Futures 2025

点击查看摘要

Abstract:Efficiently searching for relevant case studies is critical in architectural design, as designers rely on precedent examples to guide or inspire their ongoing projects. However, traditional text-based search tools struggle to capture the inherently visual and complex nature of architectural knowledge, often leading to time-consuming and imprecise exploration. This paper introduces ArchSeek, an innovative case study search system with recommendation capability, tailored for architecture design professionals. Powered by the visual understanding capabilities from vision-language models and cross-modal embeddings, it enables text and image queries with fine-grained control, and interaction-based design case recommendations. It offers architects a more efficient, personalized way to discover design inspirations, with potential applications across other visually driven design fields. The source code is available at this https URL.
zh

[NLP-17] AgentS pec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

【速读】：该论文旨在解决基于大型语言模型（LLMs）构建的代理在多样化领域中的安全问题，包括其自治性带来的安全风险，如安全漏洞、法律违规及意外有害行为。现有缓解方法（如基于模型的安全保障和早期执行策略）在鲁棒性、可解释性和适应性方面存在不足。为应对这些挑战，论文提出了一种轻量级领域特定语言AgentSpec，用于在运行时指定和强制实施对LLM代理的约束。AgentSpec的关键在于通过结构化规则定义触发器、谓词和执行机制，确保代理在预设的安全边界内运行。论文通过在代码执行、具身代理和自动驾驶等多个领域的实现，验证了AgentSpec的适应性和有效性。评估表明，AgentSpec在超过90%的代码代理案例中成功防止了不安全执行，在具身代理任务中消除了所有危险行为，并使自动驾驶车辆实现了100%合规。此外，通过结合可解释性、模块化和高效性，AgentSpec提供了实用且可扩展的解决方案，同时利用LLMs自动生成规则，进一步提升了其实用价值。

链接: https://arxiv.org/abs/2503.18666
作者: Haoyu Wang,Christopher M. Poskitt,Jun Sun
机构: School of Computing and Information System, Singapore Management University(Singapore管理大学计算与信息系统学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Agents built on LLMs are increasingly deployed across diverse domains, automating complex decision-making and task execution. However, their autonomy introduces safety risks, including security vulnerabilities, legal violations, and unintended harmful actions. Existing mitigation methods, such as model-based safeguards and early enforcement strategies, fall short in robustness, interpretability, and adaptability. To address these challenges, we propose AgentSpec, a lightweight domain-specific language for specifying and enforcing runtime constraints on LLM agents. With AgentSpec, users define structured rules that incorporate triggers, predicates, and enforcement mechanisms, ensuring agents operate within predefined safety boundaries. We implement AgentSpec across multiple domains, including code execution, embodied agents, and autonomous driving, demonstrating its adaptability and effectiveness. Our evaluation shows that AgentSpec successfully prevents unsafe executions in over 90% of code agent cases, eliminates all hazardous actions in embodied agent tasks, and enforces 100% compliance by autonomous vehicles (AVs). Despite its strong safety guarantees, AgentSpec remains computationally lightweight, with overheads in milliseconds. By combining interpretability, modularity, and efficiency, AgentSpec provides a practical and scalable solution for enforcing LLM agent safety across diverse applications. We also automate the generation of rules using LLMs and assess their effectiveness. Our evaluation shows that the rules generated by OpenAI o1 achieve a precision of 95.56% and recall of 70.96% for embodied agents, successfully identifying 87.26% of the risky code, and prevent AVs from breaking laws in 5 out of 8 scenarios.
zh

[NLP-18] ZeroLM: Data-Free Transformer Architecture Search for Language Models

【速读】：该论文旨在解决神经架构搜索（NAS）在实际应用中因计算资源需求过高而阻碍其广泛应用的问题，特别是现有零成本代理方法在Transformer基模型架构排序任务中的性能不足，以及当前自动化代理发现方法存在的搜索时间长、易过拟合和结构复杂等问题。论文的关键解决方案在于提出一种新颖的零成本代理方法，通过高效计算权重统计量量化模型容量，并将Transformer架构分解为功能独立的子模块，优化各子模块对整体性能的贡献平衡。这一方法在FlexiBERT基准测试中表现出色，实现了Spearman’s rho为0.76和Kendall’s tau为0.53的性能指标，同时在多种NAS基准任务中展现出卓越的计算效率与稳健性，为大规模架构搜索提供了实用方案。

链接: https://arxiv.org/abs/2503.18646
作者: Zhen-Song Chen,Hong-Wei Ding,Xian-Jia Wang,Witold Pedrycz
机构: School of Civil Engineering, Wuhan University (武汉大学土木建筑工程学院); Economic and Management School, Wuhan University (武汉大学经济与管理学院); Department of Electrical and Computer Engineering, University of Alberta (阿尔伯塔大学电气与计算机工程系); Research Center of Performance and Productivity Analysis, Istinye University (伊斯坦布尔伊斯坦布尔大学绩效与生产力分析研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Neural architecture search (NAS) provides a systematic framework for automating the design of neural network architectures, yet its widespread adoption is hindered by prohibitive computational requirements. Existing zero-cost proxy methods, while reducing search overhead, demonstrate inadequate performance in architecture ranking tasks, particularly for Transformer-based models where they often underperform simple parameter counting metrics. Current automated proxy discovery approaches suffer from extended search times, susceptibility to data overfitting, and structural complexity. This paper introduces a novel zero-cost proxy methodology that quantifies model capacity through efficient weight statistics computation while decomposing Transformer architectures into functionally distinct sub-modules, thereby optimizing the balance of their contributions to overall performance. Our comprehensive evaluation demonstrates the superiority of this approach, achieving a Spearman’s rho of 0.76 and Kendall’s tau of 0.53 on the FlexiBERT benchmark. The proposed method exhibits exceptional computational efficiency while maintaining robust performance across diverse NAS benchmark tasks, offering a practical solution for large-scale architecture search.
zh

[NLP-19] LANGALIGN: Enhancing Non-English Language Models via Cross-Lingual Embedding Alignment

【速读】：该论文试图解决服务开发者因实际约束仍广泛依赖基于嵌入（Embedding-based）模型的问题，特别是在非英语任务场景下，由于缺乏高质量的细调数据，模型性能受限。论文聚焦于通过利用英语数据作为种子数据来训练非英语模型，但直接迁移的效果有限。为解决这一问题，论文提出LANGALIGN方法，其关键是通过在语言模型与任务头之间对齐英语嵌入向量与目标语言嵌入向量，从而显著提升目标语言（如韩语、日语和中文）的任务处理性能，并进一步展示其可逆应用，即将目标语言数据转换为英语模型可处理的格式。

链接: https://arxiv.org/abs/2503.18603
作者: Jong Myoung Kim,Young-Jun Lee,Ho-Jin Choi,Sangkeun Jung
机构: SK-telecom (SK电讯); School of Computing, KAIST (KAIST计算机学院); The Division of Computer Convergence, Chugnam National University (全南国立大学计算机融合学院)
类目: Computation and Language (cs.CL)
备注: now preparing

点击查看摘要

Abstract:While Large Language Models have gained attention, many service developers still rely on embedding-based models due to practical constraints. In such cases, the quality of fine-tuning data directly impacts performance, and English datasets are often used as seed data for training non-English models. In this study, we propose LANGALIGN, which enhances target language processing by aligning English embedding vectors with those of the target language at the interface between the language model and the task header. Experiments on Korean, Japanese, and Chinese demonstrate that LANGALIGN significantly improves performance across all three languages. Additionally, we show that LANGALIGN can be applied in reverse to convert target language data into a format that an English-based model can process.
zh

[NLP-20] LinkAlign: Scalable Schema Linking for Real-World Large-Scale Multi-Database Text-to-SQL

【速读】：本文旨在解决Text-to-SQL任务中实现人类水平性能的关键瓶颈——模式链接（Schema Linking）问题，特别是在真实世界的大规模多数据库场景中。论文针对两个主要挑战提出了解决方案：(1) 数据库检索（Database Retrieval），即在多数据库环境中从大型模式池中选择正确的数据库，并过滤无关数据库；(2) 模式项接地（Schema Item Grounding），即从冗余的模式中准确识别与SQL生成相关的表和列。为应对这些挑战，作者引入了LinkAlign框架，通过多轮语义增强检索和无关信息隔离（Challenge 1）以及模式提取增强（Challenge 2）来系统性地解决模式链接问题。实验结果表明，LinkAlign在多数据库设置下优于现有基线模型，且在SPIDER 2.0-lite基准测试中，其适应现有Text-to-SQL模型到真实环境的能力排名最高（排除使用长链思维推理大语言模型的模型）。因此，LinkAlign的关键在于其创新性的三步法，包括多轮语义增强检索、无关信息隔离及模式提取增强，从而实现了模式链接的鲁棒性和可扩展性。

链接: https://arxiv.org/abs/2503.18596
作者: Yihan Wang,Peiyu Liu,Xin Yang
机构: China Academy of Information and Communications Technology (中国信息通信研究院); Beihang University (北京航空航天大学); University of International Business and Economics (对外经济贸易大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Schema linking is a critical bottleneck in achieving human-level performance in Text-to-SQL tasks, particularly in real-world large-scale multi-database scenarios. Addressing schema linking faces two major challenges: (1) Database Retrieval: selecting the correct database from a large schema pool in multi-database settings, while filtering out irrelevant ones. (2) Schema Item Grounding: accurately identifying the relevant tables and columns from within a large and redundant schema for SQL generation. To address this, we introduce LinkAlign, a novel framework that can effectively adapt existing baselines to real-world environments by systematically addressing schema linking. Our framework comprises three key steps: multi-round semantic enhanced retrieval and irrelevant information isolation for Challenge 1, and schema extraction enhancement for Challenge 2. We evaluate our method performance of schema linking on the SPIDER and BIRD benchmarks, and the ability to adapt existing Text-to-SQL models to real-world environments on the SPIDER 2.0-lite benchmark. Experiments show that LinkAlign outperforms existing baselines in multi-database settings, demonstrating its effectiveness and robustness. On the other hand, our method ranks highest among models excluding those using long chain-of-thought reasoning LLMs. This work bridges the gap between current research and real-world scenarios, providing a practical solution for robust and scalable schema linking. The codes are available at this https URL.
zh

[NLP-21] ClinText-SP and RigoBERTa Clinical: a new set of open resources for Spanish Clinical NLP

【速读】：该论文旨在解决西班牙语临床自然语言处理（Clinical Natural Language Processing, Clinical NLP）领域资源匮乏的问题。论文的关键贡献在于引入了ClinText-SP，这是目前公开可用的最大规模的西班牙语临床语料库，并开发了一个先进的临床编码语言模型RigoBERTa Clinical。解决方案的核心在于通过在精心策划的多样化数据集（包括从医学期刊案例和共享任务标注语料库中提取的数据）上进行领域自适应预训练，显著提升了模型在多个临床NLP基准测试中的性能。通过公开发布数据集和模型，论文期望为研究社区提供强大的工具，推动临床NLP领域的进一步发展，并最终促进医疗应用的进步。

链接: https://arxiv.org/abs/2503.18594
作者: Guillem García Subies,Álvaro Barbero Jiménez,Paloma Martínez Fernández
机构: Universidad Carlos III de Madrid (卡洛斯三世大学马德里); Instituto de Ingeniería del Conocimiento (知识工程研究所); Universidad Carlos III de Madrid (卡洛斯三世大学马德里)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a novel contribution to Spanish clinical natural language processing by introducing the largest publicly available clinical corpus, ClinText-SP, along with a state-of-the-art clinical encoder language model, RigoBERTa Clinical. Our corpus was meticulously curated from diverse open sources, including clinical cases from medical journals and annotated corpora from shared tasks, providing a rich and diverse dataset that was previously difficult to access. RigoBERTa Clinical, developed through domain-adaptive pretraining on this comprehensive dataset, significantly outperforms existing models on multiple clinical NLP benchmarks. By publicly releasing both the dataset and the model, we aim to empower the research community with robust resources that can drive further advancements in clinical NLP and ultimately contribute to improved healthcare applications.
zh

[NLP-22] Dense Retrieval for Low Resource Languages – the Case of Amharic Language

【速读】：该论文试图解决在低资源语言阿姆哈拉语（Amharic）上应用密集检索器（dense retriever）所面临的困难，并报告相关成果。阿姆哈拉语是使用人口超过1.2亿的一种低资源语言，其信息检索面临独特挑战。论文的关键在于探讨亚的斯亚贝巴大学（University of Addis Ababa）在阿姆哈拉语文本信息检索领域所做的努力以及克服的具体困难。

链接: https://arxiv.org/abs/2503.18570
作者: Tilahun Yeshambel,Moncef Garouani,Serge Molina,Josiane Mothe
机构: Addis Ababa University (埃塞俄比亚亚的斯亚贝巴大学); Univ. Capitole, IRIT, UMR5505 CNRS (图卢兹第一大学，IRIT，CNRS联合研究实验室); IRIT, UMR5505 CNRS, Univ. de Toulouse (图卢兹第一大学，IRIT，CNRS联合研究实验室); INSPE, UT2J, IRIT, UMR5505 CNRS (图卢兹第二大学教育学院，IRIT，CNRS联合研究实验室)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 4 pages, 2 figures

点击查看摘要

Abstract:This paper reports some difficulties and some results when using dense retrievers on Amharic, one of the low-resource languages spoken by 120 millions populations. The efforts put and difficulties faced by University Addis Ababa toward Amharic Information Retrieval will be developed during the presentation.
zh

[NLP-23] Distil-xLSTM: Learning Attention Mechanisms through Recurrent Structures

【速读】：本文旨在解决在自然语言处理（NLP）领域中，Transformer模型占据主导地位但计算资源需求较高的问题。论文提出了一种基于xLSTM的小型语言模型（Small Language Model, SLM），即Distil-xLSTM，通过知识蒸馏自大型语言模型（Large Language Model, LLM）来训练，以实现计算效率和扩展性的提升。解决方案的关键在于利用xLSTM的循环序列混合组件近似Transformer模型的注意力参数化，从而在保证性能的同时大幅降低训练开销。

链接: https://arxiv.org/abs/2503.18565
作者: Abdoul Majid O. Thiombiano,Brahim Hnich,Ali Ben Mrad,Mohamed Wiem Mkaouer
机构: FSM, University of Monastir (突尼斯穆纳斯蒂尔大学), Monastir, 5000 Tunisia; CES Lab, ENIS, University of Sfax (突尼斯赛夫赛斯大学), Sfax, 3038 Tunisia; Department of Computer Science, College of Computer, Qassim University (卡西姆大学), Buraydah, Saudi Arabia; University of Michigan-Flint (密歇根大学弗林特分校), MI, USA
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The current era of Natural Language Processing (NLP) is dominated by Transformer models. However, novel architectures relying on recurrent mechanisms, such as xLSTM and Mamba, have been proposed as alternatives to attention-based models. Although computation is done differently than with the attention mechanism mechanism, these recurrent models yield good results and sometimes even outperform state-of-the-art attention-based models. In this work, we propose Distil-xLSTM, an xLSTM-based Small Language Model (SLM) trained by distilling knowledge from a Large Language Model (LLM) that shows promising results while being compute and scale efficient. Our Distil-xLSTM focuses on approximating a transformer-based model attention parametrization using its recurrent sequence mixing components and shows good results with minimal training.
zh

[NLP-24] Self-Reported Confidence of Large Language Models in Gastroenterology: Analysis of Commercial Open-Source and Quantized Models

【速读】：该论文旨在评估多个大型语言模型（Large Language Models, LLMs）在胃肠病学领域自报响应置信度的表现，并探讨其在医疗健康领域的安全应用挑战。研究使用300道胃肠病学考试风格的问题，发现最高性能的模型（如GPT-o1 preview、GPT-4o和Claude-3.5-Sonnet）尽管取得了Brier分数0.15-0.2和AUROC 0.6的成绩，但所有模型均表现出显著的过度自信倾向。论文的关键在于强调不确定性量化（Uncertainty Quantification）在确保LLMs安全应用于医疗场景中的重要性，指出有效的不确定性估计是克服当前过信问题的核心解决方案。

链接: https://arxiv.org/abs/2503.18562
作者: Nariman Naderi,Seyed Amir Ahmad Safavi-Naini,Thomas Savage,Zahra Atf,Peter Lewis,Girish Nadkarni,Ali Soroush
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 35 pages, 5 figures, 1 table, 7 supplementary figures

点击查看摘要

Abstract:This study evaluated self-reported response certainty across several large language models (GPT, Claude, Llama, Phi, Mistral, Gemini, Gemma, and Qwen) using 300 gastroenterology board-style questions. The highest-performing models (GPT-o1 preview, GPT-4o, and Claude-3.5-Sonnet) achieved Brier scores of 0.15-0.2 and AUROC of 0.6. Although newer models demonstrated improved performance, all exhibited a consistent tendency towards overconfidence. Uncertainty estimation presents a significant challenge to the safe use of LLMs in healthcare. Keywords: Large Language Models; Confidence Elicitation; Artificial Intelligence; Gastroenterology; Uncertainty Quantification
zh

[NLP-25] Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models ICME2025

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在描述图像时存在的幻觉现象（hallucinations），即生成包含不存在对象的答案的问题。研究表明，这些模型倾向于过度关注无关紧要的图像标记（irrelevant image tokens），而这些标记并不包含回答问题的关键信息，从而导致输出结果失真。为了解决这一问题，论文提出了一种名为“指令对齐视觉注意力”（Instruction-Aligned Visual Attention, IAVA）的方法。其关键是通过对比两种不同指令下注意力权重的变化来识别无关标记，并利用对比解码技术动态调整来自原始图像标记和无关图像标记生成的logits，从而减少模型对无关信息的过度关注。实验结果显示，IAVA方法在MME、POPE和TextVQA等基准数据集上显著优于现有解码技术，有效缓解了物体幻觉问题。

链接: https://arxiv.org/abs/2503.18556
作者: Bin Li,Dehong Gao,Yeyuan Wang,Linbo Jin,Shanqing Yu,Xiaoyan Cai,Libin Yang
机构: School of Automation, Northwestern Polytechnical University (西北工业大学自动化学院), Xi’an, Shaanxi, China; School of Cybersecurity, Northwestern Polytechnical University (西北工业大学网络空间安全学院), Xi’an, Shaanxi, China; Alibaba Group (阿里巴巴集团), Hangzhou, Zhejiang, China; Zhejiang University of Technology (浙江工业大学), Hangzhou, Zhejiang, China; Binjiang Institute of Artificial Intelligence (滨江人工智能研究所), Hangzhou, Zhejiang, China
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted by ICME2025

点击查看摘要

Abstract:Despite the significant success of Large Vision-Language models(LVLMs), these models still suffer hallucinations when describing images, generating answers that include non-existent objects. It is reported that these models tend to over-focus on certain irrelevant image tokens that do not contain critical information for answering the question and distort the output. To address this, we propose an Instruction-Aligned Visual Attention(IAVA) approach, which identifies irrelevant tokens by comparing changes in attention weights under two different instructions. By applying contrastive decoding, we dynamically adjust the logits generated from original image tokens and irrelevant image tokens, reducing the model’s over-attention to irrelevant information. The experimental results demonstrate that IAVA consistently outperforms existing decoding techniques on benchmarks such as MME, POPE, and TextVQA in mitigating object hallucinations. Our IAVA approach is available online at this https URL.
zh

[NLP-26] Natural Language Processing for Electronic Health Records in Scandinavian Languages: Norwegian Swedish and Danish

【速读】：该论文旨在系统性评估和分析针对大陆斯堪的纳维亚临床文本的最先进临床自然语言处理（Clinical Natural Language Processing, Clinical NLP）方法。论文试图解决的问题包括：当前研究在不同斯堪的纳维亚语言（挪威语、瑞典语、丹麦语）中的发展现状、存在的差距与不均衡现象，以及资源共享和迁移学习水平较低的问题。关键在于通过全面回顾相关文献，识别出各语言领域在采用基于变换器模型（Transformer-based models）方面的显著差异，尤其是在去标识化等重要任务中，挪威语和丹麦语的研究活动明显少于瑞典语。此外，论文强调了提高资源共享、代码开放、预训练模型适应及迁移学习水平的重要性，以克服区域范围内Clinical NLP发展的潜在障碍。

链接: https://arxiv.org/abs/2503.18539
作者: Ashenafi Zebene Woldaregay,Jørgen Aarmo Lund,Phuong Dinh Ngo,Mariyam Tayefi,Joel Burman,Stine Hansen,Martin Hylleholt Sillesen,Hercules Dalianis,Robert Jenssen,Lindsetmo Rolf Ole,Karl Øyvind Mikalsen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 45 pages including the appendix, 9 figures in the main manuscript and 11 figures in the Appendix

点击查看摘要

Abstract:Background: Clinical natural language processing (NLP) refers to the use of computational methods for extracting, processing, and analyzing unstructured clinical text data, and holds a huge potential to transform healthcare in various clinical tasks. Objective: The study aims to perform a systematic review to comprehensively assess and analyze the state-of-the-art NLP methods for the mainland Scandinavian clinical text. Method: A literature search was conducted in various online databases including PubMed, ScienceDirect, Google Scholar, ACM digital library, and IEEE Xplore between December 2022 and February 2024. Further, relevant references to the included articles were also used to solidify our search. The final pool includes articles that conducted clinical NLP in the mainland Scandinavian languages and were published in English between 2010 and 2024. Results: Out of the 113 articles, 18% (n=21) focus on Norwegian clinical text, 64% (n=72) on Swedish, 10% (n=11) on Danish, and 8% (n=9) focus on more than one language. Generally, the review identified positive developments across the region despite some observable gaps and disparities between the languages. There are substantial disparities in the level of adoption of transformer-based models. In essential tasks such as de-identification, there is significantly less research activity focusing on Norwegian and Danish compared to Swedish text. Further, the review identified a low level of sharing resources such as data, experimentation code, pre-trained models, and rate of adaptation and transfer learning in the region. Conclusion: The review presented a comprehensive assessment of the state-of-the-art Clinical NLP for electronic health records (EHR) text in mainland Scandinavian languages and, highlighted the potential barriers and challenges that hinder the rapid advancement of the field in the region.
zh

[NLP-27] SciClaims: An End-to-End Generative System for Biomedical Claim Analysis

【速读】：该论文旨在解决科学文献（尤其是生物医学研究）中关键主张验证的准确性问题，并提升自动化科学主张分析的能力。当前解决方案存在显著局限性，包括缺乏端到端的工作流、依赖易出错的复杂自然语言处理（NLP）与信息检索管道，以及无法提供清晰且用户友好的主张验证结果解释。为应对这些挑战，论文提出的关键解决方案是SciClaims系统，该系统利用最先进的大型语言模型（LLMs），实现了从主张提取到证据检索及验证的全流程无缝集成。SciClaims在主张提取和验证任务上的表现优于现有方法，且无需额外微调，从而树立了新的自动化科学主张分析基准。

链接: https://arxiv.org/abs/2503.18526
作者: Raúl Ortega,José Manuel Gómez-Pérez
机构: Expert.ai (Expert.ai); Language Technology Research Lab (语言技术研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: Pre-print version

点击查看摘要

Abstract:Validating key claims in scientific literature, particularly in biomedical research, is essential for ensuring accuracy and advancing knowledge. This process is critical in sectors like the pharmaceutical industry, where rapid scientific progress requires automation and deep domain expertise. However, current solutions have significant limitations. They lack end-to-end pipelines encompassing all claim extraction, evidence retrieval, and verification steps; rely on complex NLP and information retrieval pipelines prone to multiple failure points; and often fail to provide clear, user-friendly justifications for claim verification outcomes. To address these challenges, we introduce SciClaims, an advanced system powered by state-of-the-art large language models (LLMs) that seamlessly integrates the entire scientific claim analysis process. SciClaims outperforms previous approaches in both claim extraction and verification without requiring additional fine-tuning, setting a new benchmark for automated scientific claim analysis.
zh

[NLP-28] Autoregressive Language Models for Knowledge Base Population: A case study in the space mission domain

【速读】：该论文旨在解决知识库人口统计（Knowledge Base Population, KBP）任务中如何利用大型语言模型（Large Language Models, LLMs）的有效微调方法，以提高小规模模型在KB任务中的性能和部署可行性。论文的关键解决方案在于通过生成端到端KBP的数据集，并对自回归语言模型进行微调，使其专注于特定领域的知识图谱构建任务（如航天任务知识图谱）。研究发现，经过微调的小规模模型不仅能够在KBP任务中达到与大规模模型相当甚至更高的准确性，还因其较低的部署和推理成本以及无需将本体论（ontology）纳入提示（prompt）的能力，提供了更灵活的上下文空间用于输入文本或输出序列化，从而展现出显著优势。

链接: https://arxiv.org/abs/2503.18502
作者: Andrés García-Silva,José Manuel Gómez-Pérez
机构: Expert.ai (Expert.ai); Madrid, Spain (西班牙马德里)
类目: Computation and Language (cs.CL)
备注: Pre-print version

点击查看摘要

Abstract:Knowledge base population KBP plays a crucial role in populating and maintaining knowledge bases up-to-date in organizations by leveraging domain corpora. Motivated by the increasingly large context windows supported by large language models, we propose to fine-tune an autoregressive language model for end-toend KPB. Our case study involves the population of a space mission knowledge graph. To fine-tune the model we generate a dataset for end-to-end KBP tapping into existing domain resources. Our case study shows that fine-tuned language models of limited size can achieve competitive and even higher accuracy than larger models in the KBP task. Smaller models specialized for KBP offer affordable deployment and lower-cost inference. Moreover, KBP specialist models do not require the ontology to be included in the prompt, allowing for more space in the context for additional input text or output serialization.
zh

[NLP-29] Verbal Process Supervision Elicits Better Coding Agents

【速读】：该论文旨在解决复杂软件工程任务中现有大型语言模型（Large Language Models, LLMs）在代码生成与理解方面的局限性问题。尽管测试时计算推理模型的引入有所改进，但这些系统在处理复杂的软件工程挑战时仍显不足。为应对这一问题，论文提出CURA系统，这是一种结合了语言模型与基于推理架构的代码理解和推理代理系统，并通过引入基于口头过程监督（Verbal Process Supervision, VPS）的技术显著提升了性能，在BigCodeBench等具有挑战性的基准测试中较基线模型提高了3.65%。特别是当CURA与o3-mini模型结合使用时，达到了当前最先进的性能水平。因此，该工作的关键在于将基于推理的架构与LLM驱动的代码生成相结合，使语言模型能够实现自主推理以解决复杂的软件工程任务。

链接: https://arxiv.org/abs/2503.18494
作者: Hao-Yuan Chen,Cheng-Pong Huang,Jui-Ming Yao
机构: Mindify AI (Mindify AI); University of London (伦敦大学); National Taiwan University of Science and Technology (国立台湾科技大学); National Taiwan University of Science and Technology (国立台湾科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The emergence of large language models and their applications as AI agents have significantly advanced state-of-the-art code generation benchmarks, transforming modern software engineering tasks. However, even with test-time computed reasoning models, these systems still struggle with complex software engineering challenges. This work introduces CURA, a code understanding and reasoning agent system enhanced with verbal process supervision (VPS), achieving a 3.65% improvement over baseline models on challenging benchmarks like BigCodeBench. Furthermore, CURA, when paired with the o3-mini model and VPS techniques, attains state-of-the-art performance. This work represents a step forward in integrating reasoning-driven architectures with LLM-based code generation, enabling agentic reasoning for language models to solve complex software engineering tasks.
zh

[NLP-30] Safeguarding Mobile GUI Agent via Logic-based Action Verification

【速读】：该论文旨在解决大型基础模型（LFMs）驱动的移动图形用户界面（GUI）代理在自动化任务执行中因概率性质及任务歧义性与上下文依赖性导致的不可靠性和错误频发问题。为应对这一挑战，论文提出VeriSafe Agent (VSA)，这是一种形式化验证系统，作为逻辑严谨的安全保障机制，确保移动GUI代理的操作严格符合用户意图后再执行动作。VSA的关键创新在于引入了一种新颖的自动形式化技术，将自然语言用户指令转化为可形式验证的规范，并以领域特定语言（DSL）表达。这种方法支持运行时基于规则的验证，使VSA能够检测并阻止错误操作，通过提供纠正反馈或终止不安全行为来保障安全性。据作者所知，VSA是首次尝试将形式化验证的严谨性应用于GUI代理，有效弥合了LFM驱动自动化与形式软件验证之间的差距。

链接: https://arxiv.org/abs/2503.18492
作者: Jungjae Lee,Dongjae Lee,Chihun Choi,Youngmin Im,Jaeyoung Wi,Kihong Heo,Sangeun Oh,Sunjae Lee,Insik Shin
机构: School of Computing, KAIST (KAIST 计算机科学学院); Korea University (韩国大学); Sungkyunkwan University (成均馆大学); School of Computing, KAIST (KAIST 计算机科学学院); School of Computing, KAIST (KAIST 计算机科学学院); School of Computing, KAIST (KAIST 计算机科学学院); Korea University (韩国大学); Sungkyunkwan University (成均馆大学); School of Computing, KAIST (KAIST 计算机科学学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Foundation Models (LFMs) have unlocked new possibilities in human-computer interaction, particularly with the rise of mobile Graphical User Interface (GUI) Agents capable of interpreting GUIs. These agents promise to revolutionize mobile computing by allowing users to automate complex mobile tasks through simple natural language instructions. However, the inherent probabilistic nature of LFMs, coupled with the ambiguity and context-dependence of mobile tasks, makes LFM-based automation unreliable and prone to errors. To address this critical challenge, we introduce VeriSafe Agent (VSA): a formal verification system that serves as a logically grounded safeguard for Mobile GUI Agents. VSA is designed to deterministically ensure that an agent’s actions strictly align with user intent before conducting an action. At its core, VSA introduces a novel autoformalization technique that translates natural language user instructions into a formally verifiable specification, expressed in our domain-specific language (DSL). This enables runtime, rule-based verification, allowing VSA to detect and prevent erroneous actions executing an action, either by providing corrective feedback or halting unsafe behavior. To the best of our knowledge, VSA is the first attempt to bring the rigor of formal verification to GUI agent. effectively bridging the gap between LFM-driven automation and formal software verification. We implement VSA using off-the-shelf LLM services (GPT-4o) and evaluate its performance on 300 user instructions across 18 widely used mobile apps. The results demonstrate that VSA achieves 94.3%-98.33% accuracy in verifying agent actions, representing a significant 20.4%-25.6% improvement over existing LLM-based verification methods, and consequently increases the GUI agent’s task completion rate by 90%-130%.
zh

[NLP-31] MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering

【速读】：该论文旨在解决视觉问答（VQA）任务中大型视觉语言模型（LVLMs）缺乏集成常识知识的问题，这限制了它们在真实场景中的鲁棒性。论文提出了一种名为MAGIC-VQA的新框架，通过系统性地将常识知识与LVLMs结合来增强VQA能力。解决方案的关键在于其三阶段方法：首先从外部来源显式整合知识；其次通过类型后处理进行上下文细化；最后利用图神经网络（GNN）实现隐式知识增强以支持结构化推理。这种方法不仅加深了结构化推断的能力，还超越了LVLMs的关联推理性能，同时无需大规模预训练或复杂的提示调整。

链接: https://arxiv.org/abs/2503.18491
作者: Shuo Yang,Siwen Luo,Soyeon Caren Han,Eduard Hovy
机构: The University of Melbourne; The University of Western Australia
类目: Computation and Language (cs.CL)
备注: 8 Pages, 5 figures

点击查看摘要

Abstract:Visual Question Answering (VQA) requires reasoning across visual and textual modalities, yet Large Vision-Language Models (LVLMs) often lack integrated commonsense knowledge, limiting their robustness in real-world scenarios. To address this, we introduce MAGIC-VQA, a novel framework that enhances VQA by systematically integrating commonsense knowledge with LVLMs. MAGIC-VQA employs a three-stage process: (1) Explicit Knowledge Integration from external sources, (2) By-Type Post-Processing for contextual refinement, and (3) Implicit Knowledge Augmentation using a Graph Neural Network (GNN) for structured reasoning. While GNNs bring greater depth to structured inference, they enable superior relational inference beyond LVLMs. MAGIC-VQA bridges a key gap by unifying commonsensse knowledge with LVLM-driven reasoning, eliminating the need for extensive pre-training or complex prompt tuning. Our framework achieves state-of-the-art performance on benchmark datasets, significantly improving commonsense reasoning in VQA.
zh

[NLP-32] Whispering in Amharic: Fine-tuning Whisper for Low-resource Language

【速读】：该论文旨在解决低资源语言（如阿姆哈拉语 Amharic）自动语音识别（Automatic Speech Recognition, ASR）准确性不足的问题。论文的关键解决方案在于通过结合现有数据集（如FLEURS）与新收集的阿姆哈拉语数据，对OpenAI的Whisper基础模型进行微调（fine-tuning），从而显著提升模型在阿姆哈拉语上的性能。研究还发现，对阿姆哈拉语同音词进行规范化处理可进一步改善词错误率（Word Error Rate, WER）和双语评估 understudy（BLEU）评分。这表明合理的微调策略及数据集组合对于提高低资源语言的ASR性能至关重要。

链接: https://arxiv.org/abs/2503.18485
作者: Dawit Ketema Gete,Bedru Yimam Ahamed,Tadesse Destaw Belay,Yohannes Ayana Ejigu,Sukairaj Hafiz Imam,Alemu Belay Tessema,Mohammed Oumer Adem,Tadesse Amare Belay,Robert Geislinger,Umma Aliyu Musa,Martin Semmann,Shamsuddeen Hassan Muhammad,Henning Schreiber,Seid Muhie Yimam
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work explores fine-tuning OpenAI’s Whisper automatic speech recognition (ASR) model for Amharic, a low-resource language, to improve transcription accuracy. While the foundational Whisper model struggles with Amharic due to limited representation in its training data, we fine-tune it using datasets like Mozilla Common Voice, FLEURS, and the BDU-speech dataset. The best-performing model, Whispersmall-am, significantly improves when finetuned on a mix of existing FLEURS data and new, unseen Amharic datasets. Training solely on new data leads to poor performance, but combining it with FLEURS data reinforces the model, enabling better specialization in Amharic. We also demonstrate that normalizing Amharic homophones significantly enhances Word Error Rate (WER) and Bilingual Evaluation Understudy (BLEU) scores. This study underscores the importance of fine-tuning strategies and dataset composition for improving ASR in low-resource languages, providing insights for future Amharic speech recognition research.
zh

[NLP-33] PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model

【速读】：该论文旨在解决现有多语言基准测试（Multilingual Benchmarks）中存在的三个主要问题：语言特定的内容偏差（Language-specific Content Biases）、多模态输入格式的不连贯性（Disjointed Multimodal Input Formats）以及缺乏安全性评估（Lack of Safety Evaluation）。为应对这些挑战，论文提出了PM4Bench，这是一个针对大型视觉语言模型（Large Vision Language Models, LVLMs）的首个平行多语言、多模态、多任务基准测试。其关键在于通过设计包含10种语言的平行语料库（Parallel Corpus Design），实现公平且精确的跨语言比较，并在视觉设置中结合图像嵌入文本和查询的任务，要求LVLMs同时具备“看”、“读”和“思考”的能力，以更贴近真实应用场景。此外，PM4Bench引入了安全性评估模块，弥补了现有多语言基准测试中的重要空白。研究基于PM4Bench评估了11种主流LVLMs，发现显著的语言间性能差异，特别是在视觉任务中，并指出光学字符识别（OCR）能力是造成这些不平衡的关键因素之一。

链接: https://arxiv.org/abs/2503.18484
作者: Junyuan Gao,Jiahe Song,Jiang Wu,Runchuan Zhu,Guanlin Shen,Shasha Wang,Xingjian Wei,Haote Yang,Songyang Zhang,Weijia Li,Bin Wang,Dahua Lin,Lijun Wu,Conghui He
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University of Chinese Academy of Sciences (中国科学院大学); Peking University (北京大学); Sun Yat-Sen University (中山大学); Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Equal contribution: Junyuan Gao, Jiahe Song, Jiang Wu; Corresponding author: Conghui He

点击查看摘要

Abstract:Existing multilingual benchmarks for Large Vision Language Models (LVLMs) suffer from limitations including language-specific content biases, disjointed multimodal input formats, and a lack of safety evaluation. To address these gaps, we propose PM4Bench, the first Parallel Multilingual Multi-Modal Multi-task Benchmark for LVLMs. PM4Bench features a parallel corpus design across 10 languages, enabling fair and accurate cross-lingual comparisons. It includes the vision setting where text and queries are embedded in images, requiring LVLMs to simultaneously “see”, “read”, and “think”, aligning with real-world applications. Additionally, PM\textsuperscript4Bench incorporates safety evaluations, addressing critical oversight in existing multilingual benchmarks. Using PM4Bench, we evaluate 11 mainstream LVLMs, revealing significant cross-linguistic performance disparities, particularly in vision settings, and identifying OCR capability as a key determinant of these imbalances. We will release PM4Bench at this https URL .
zh

[NLP-34] Global-Local Tree Search for Language Guided 3D Scene Generation CVPR2025

【速读】：该论文旨在解决利用视觉语言模型（Vision-Language Models, VLMs）进行3D室内场景生成的问题，这是现有大型VLM研究中较少涉及的领域。论文将此任务视为一个受空间和布局常识约束的规划问题，并提出了一种新的全局-局部树搜索算法作为解决方案的核心。关键在于通过层次分解场景结构（包括房间级、区域级、地板对象级和支持对象级），在全局层面逐步放置对象并探索多种可能布局，同时在局部层面将每个对象的放置任务分解为多个步骤以细化搜索过程。此外，为了利用VLM生成对象位置，论文将自顶向下视角的空间离散化为密集网格，并使用多样化的表情符号填充网格单元以区分不同位置，最终通过描述表情符号名称来引导VLM输出合理的位置信息。实验结果表明，该方法生成的3D场景比现有最先进的方法更逼真。

链接: https://arxiv.org/abs/2503.18476
作者: Wei Deng,Mengshi Qi,Huadong Ma
机构: State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Large Vision-Language Models (VLMs), such as GPT-4, have achieved remarkable success across various fields. However, there are few studies on 3D indoor scene generation with VLMs. This paper considers this task as a planning problem subject to spatial and layout common sense constraints. To solve the problem with a VLM, we propose a new global-local tree search algorithm. Globally, the method places each object sequentially and explores multiple placements during each placement process, where the problem space is represented as a tree. To reduce the depth of the tree, we decompose the scene structure hierarchically, i.e. room level, region level, floor object level, and supported object level. The algorithm independently generates the floor objects in different regions and supported objects placed on different floor objects. Locally, we also decompose the sub-task, the placement of each object, into multiple steps. The algorithm searches the tree of problem space. To leverage the VLM model to produce positions of objects, we discretize the top-down view space as a dense grid and fill each cell with diverse emojis to make to cells distinct. We prompt the VLM with the emoji grid and the VLM produces a reasonable location for the object by describing the position with the name of emojis. The quantitative and qualitative experimental results illustrate our approach generates more plausible 3D scenes than state-of-the-art approaches. Our source code is available at this https URL .
zh

[NLP-35] Words as Bridges: Exploring Computational Support for Cross-Disciplinary Translation Work

【速读】：该论文试图解决跨学科文献探索中因领域特定术语（domain-specific jargon）造成的理解障碍问题。传统的计算方法通常通过简化和摘要来消除这些术语以促进翻译工作，而本文提出了一种不同的方法，即保留术语作为通往新概念空间的有用桥梁。关键在于将不同学术领域视为使用不同语言的社区，并采用无监督跨语言词嵌入对齐技术来探索领域特定词嵌入之间的概念对齐。为此，作者开发了一个原型跨领域搜索引擎，利用对齐的领域特定嵌入支持概念探索，并通过两个案例研究验证了该原型。研究讨论了这种方法在翻译工作中的潜力与挑战，并提出了未来界面设计的见解，以提供计算支持的跨领域信息检索。

链接: https://arxiv.org/abs/2503.18471
作者: Calvin Bao,Yow-Ting Shiue,Marine Carpuat,Joel Chan
机构: University of Maryland, College Park (马里兰大学帕克分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 26 pages, 8 tables, 6 figures

点击查看摘要

Abstract:Scholars often explore literature outside of their home community of study. This exploration process is frequently hampered by field-specific jargon. Past computational work often focuses on supporting translation work by removing jargon through simplification and summarization; here, we explore a different approach that preserves jargon as useful bridges to new conceptual spaces. Specifically, we cast different scholarly domains as different language-using communities, and explore how to adapt techniques from unsupervised cross-lingual alignment of word embeddings to explore conceptual alignments between domain-specific word embedding this http URL developed a prototype cross-domain search engine that uses aligned domain-specific embeddings to support conceptual exploration, and tested this prototype in two case studies. We discuss qualitative insights into the promises and pitfalls of this approach to translation work, and suggest design insights for future interfaces that provide computational support for cross-domain information seeking.
zh

[NLP-36] StableGS: A Floater-Free Framework for 3D Gaussian Splatting

【速读】：该论文旨在解决3D Gaussian Splatting (3DGS) 在新型视图合成中的训练不稳定性问题，特别是由耦合的不透明度-颜色优化导致的频繁陷入局部最优，从而产生浮动物体(floater)伪影，损害视觉保真度的问题。论文的关键解决方案包括两个方面：一是通过跨视图深度一致性约束消除浮动物体伪影；二是引入双不透明度(double-opacity)的3DGS模型以解耦半透明物体的几何结构与材质属性。此外，为了进一步提升弱纹理区域的重建质量，论文还集成了DUSt3R深度估计方法，显著增强了几何稳定性。这些改进从根本上解决了3DGS的训练不稳定性问题，并在开源数据集上超越了现有最先进的方法。

链接: https://arxiv.org/abs/2503.18458
作者: Luchao Wang,Qian Ren,Kaiming He,Hua Wang,Zhi Chen,Yaohua Tang
机构: Moore Threads AI
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent years have witnessed remarkable success of 3D Gaussian Splatting (3DGS) in novel view synthesis, surpassing prior differentiable rendering methods in both quality and efficiency. However, its training process suffers from coupled opacity-color optimization that frequently converges to local minima, producing floater artifacts that degrade visual fidelity. We present StableGS, a framework that eliminates floaters through cross-view depth consistency constraints while introducing a dual-opacity GS model to decouple geometry and material properties of translucent objects. To further enhance reconstruction quality in weakly-textured regions, we integrate DUSt3R depth estimation, significantly improving geometric stability. Our method fundamentally addresses 3DGS training instabilities, outperforming existing state-of-the-art methods across open-source datasets.
zh

[NLP-37] On the Perception Bottleneck of VLMs for Chart Understanding

【速读】：该论文旨在解决图表理解过程中现有大型视觉-语言模型（Large Vision-Language Models, LVLMs）在感知能力上的瓶颈问题。具体而言，论文将这一感知瓶颈分解为两个组成部分：视觉编码器瓶颈（vision encoder bottleneck），即视觉表征可能未能正确封装所需信息；以及提取瓶颈（extraction bottleneck），即语言模型难以从提供的视觉表征中提取必要信息。论文的关键解决方案在于通过对比学习框架增强视觉编码器，以缓解视觉编码器瓶颈，从而显著改善LVLMs对图表的理解能力。代码已公开发布。

链接: https://arxiv.org/abs/2503.18435
作者: Junteng Liu,Weihao Zeng,Xiwen Zhang,Yijun Wang,Zifei Shan,Junxian He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chart understanding requires models to effectively analyze and reason about numerical data, textual elements, and complex visual components. Our observations reveal that the perception capabilities of existing large vision-language models (LVLMs) constitute a critical bottleneck in this process. In this study, we delve into this perception bottleneck by decomposing it into two components: the vision encoder bottleneck, where the visual representation may fail to encapsulate the correct information, and the extraction bottleneck, where the language model struggles to extract the necessary information from the provided visual representations. Through comprehensive experiments, we find that (1) the information embedded within visual representations is substantially richer than what is typically captured by linear extractors, such as the widely used retrieval accuracy metric; (2) While instruction tuning effectively enhances the extraction capability of LVLMs, the vision encoder remains a critical bottleneck, demanding focused attention and improvement. Therefore, we further enhance the visual encoder to mitigate the vision encoder bottleneck under a contrastive learning framework. Empirical results demonstrate that our approach significantly mitigates the perception bottleneck and improves the ability of LVLMs to comprehend charts. Code is publicly available at this https URL.
zh

[NLP-38] aching LLM s for Step-Level Automatic Math Correction via Reinforcement Learning

【速读】：该论文旨在解决数学问题求解过程中，现有自动数学校正方法仅关注最终答案判断而忽视每一步骤详细反馈的问题，这需要语义理解和推理能力。论文提出了一种基于强化学习（Reinforcement Learning, RL）的方法StepAMC来提升大型语言模型（Large Language Model, LLM）在步骤级自动数学校正中的表现。关键解决方案包括将步骤级自动数学校正转化为一个RL问题以增强LLMs的推理能力，设计空间受限的策略网络以提高RL的稳定性，以及引入细粒度奖励网络将二元人工反馈转换为连续值。实验结果表明，该模型在两个基准数据集上的表现优于十一种强基准方法。

链接: https://arxiv.org/abs/2503.18432
作者: Junsong Li,Jie Zhou,Yutao Yang,Bihao Zhan,Qianjun Pan,Yuyang Ding,Qin Chen,Jiang Bo,Xin Lin,Liang He
机构: School of Computer Science and Technology, East China Normal University, Shanghai, China (计算机科学与技术学院, 华东师范大学, 上海, 中国); Lab of AI for Education, East China Normal University, Shanghai, China (人工智能教育实验室, 华东师范大学, 上海, 中国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Automatic math correction aims to check students’ solutions to mathematical problems via artificial intelligence technologies. Most existing studies focus on judging the final answer at the problem level, while they ignore detailed feedback on each step in a math problem-solving process, which requires abilities of semantic understanding and reasoning. In this paper, we propose a reinforcement learning (RL)-based method to boost large language model (LLM) for step-level automatic math correction, named StepAMC. Particularly, we convert the step-level automatic math correction within the text classification task into an RL problem to enhance the reasoning capabilities of LLMs. Then, we design a space-constrained policy network to improve the stability of RL. Then, we introduce a fine-grained reward network to convert the binary human feedback into a continuous value. We conduct extensive experiments over two benchmark datasets and the results show that our model outperforms the eleven strong baselines.
zh

[NLP-39] Solving Situation Puzzles with Large Language Model and External Reformulation

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在处理需要多轮对话推理的问题时表现不佳的问题，特别是在解决情境谜题（situation puzzles）时，LLMs倾向于提出过于详细或重复的问题，导致推理效率低下。论文的关键解决方案是提出了一种新颖的外部重述方法（external reformulation methodology），即在多次问答（QA）循环后或当LLMs提出错误猜测时，对情境谜题进行重新表述。实验结果表明，该方法在胜率以及问题/猜测尝试次数等方面优于直接使用LLMs解决问题，凸显了策略性问题重述在提升LLMs复杂交互场景下推理能力方面的潜力。

链接: https://arxiv.org/abs/2503.18394
作者: Kun Li,Xinwei Chen,Tianyou Song,Chengrui Zhou,Zhuoran Liu,Zhenyan Zhang,Jiangjian Guo,Qing Shan
机构: University of Illinois Urbana-Champaign (UIUC); Columbia University; Carnegie Mellon University (CMU); University of California San Diego (UCSD); Northeastern University
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, large language models (LLMs) have shown an impressive ability to perform arithmetic and symbolic reasoning tasks. However, we found that LLMs (e.g., ChatGPT) cannot perform well on reasoning that requires multiple rounds of dialogue, especially when solving situation puzzles. Specifically, LLMs intend to ask very detailed questions focusing on a specific aspect or same/similar questions after several rounds of QAs. To help LLMs get out of the above dilemma, we propose a novel external reformulation methodology, where the situation puzzle will be reformulated after several rounds of QA or when the LLMs raise an incorrect guess. Experiments show superior performance (e.g., win rate, number of question/guess attempts) of our method than directly using LLMs for solving situation puzzles, highlighting the potential of strategic problem reformulation to enhance the reasoning capabilities of LLMs in complex interactive scenarios.
zh

[NLP-40] JH: Evaluating the Robustness of Large Language Models Under Knowledge-Injection Attacks in Legal Domain

【速读】：该论文试图解决的问题是：评估大型语言模型（Large Language Models, LLMs）在法律领域是否真正基于领域知识进行推理判断，而非仅依赖特定词汇或模式匹配。如果LLMs缺乏逻辑推理能力，那么其作为“法官”的应用可能带来显著风险。为此，论文提出了一种法律知识注入攻击方法，用于测试LLMs在法律领域的鲁棒性，并推断其是否掌握了法律知识及推理逻辑。

解决方案的关键在于设计了一个名为JH的评估框架，通过针对法律任务中的演绎推理逻辑（大前提、小前提和结论生成）实施知识注入攻击，模拟法律专家在实际司法决策中可能犯的错误（如拼写错误、法律同义词使用、外部法律条文检索不准确等），从而揭示LLMs是否能够抵御此类干扰并正确利用逻辑进行推理。此外，论文还探讨了增强LLMs法律知识鲁棒性的多种方法。

链接: https://arxiv.org/abs/2503.18360
作者: Yiran Hu,Huanghai Liu,Qingjing Chen,Ning Zheng,Chong Wang,Yun Liu,Charles L.A. Clarke,Weixing Shen
机构: University of Waterloo (滑铁卢大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:As the scale and capabilities of Large Language Models (LLMs) increase, their applications in knowledge-intensive fields such as legal domain have garnered widespread attention. However, it remains doubtful whether these LLMs make judgments based on domain knowledge for reasoning. If LLMs base their judgments solely on specific words or patterns, rather than on the underlying logic of the language, the ‘‘LLM-as-judges’’ paradigm poses substantial risks in the real-world applications. To address this question, we propose a method of legal knowledge injection attacks for robustness testing, thereby inferring whether LLMs have learned legal knowledge and reasoning logic. In this paper, we propose JH: an evaluation framework for detecting the robustness of LLMs under knowledge injection attacks in the legal domain. The aim of the framework is to explore whether LLMs perform deductive reasoning when accomplishing legal tasks. To further this aim, we have attacked each part of the reasoning logic underlying these tasks (major premise, minor premise, and conclusion generation). We have collected mistakes that legal experts might make in judicial decisions in the real world, such as typos, legal synonyms, inaccurate external legal statutes retrieval. However, in real legal practice, legal experts tend to overlook these mistakes and make judgments based on logic. However, when faced with these errors, LLMs are likely to be misled by typographical errors and may not utilize logic in their judgments. We conducted knowledge injection attacks on existing general and domain-specific LLMs. Current LLMs are not robust against the attacks employed in our experiments. In addition we propose and compare several methods to enhance the knowledge robustness of LLMs.
zh

[NLP-41] Bridging Writing Manner Gap in Visual Instruction Tuning by Creating LLM -aligned Instructions

【速读】：该论文旨在解决视觉指令微调阶段中视觉指令与大型多模态模型（LMMs）内部预训练基础大型语言模型（LLMs）之间存在的“写作方式差距”（writing manner gap），这一差距会导致基础LLMs偏离其原始写作风格，从而影响其能力并降低整体性能。论文的关键解决方案是提出通过利用基础LLM直接对软格式视觉指令的写作方式进行对齐，生成新的LLM对齐指令（LLM-aligned instructions），以在保留原始语义的同时弥合写作方式差距。实验结果表明，该方法成功减小了写作方式差距，并且使用这些对齐指令后，基准模型LLaVA-7B和QwenVL在所有15个视觉与语言基准测试中表现出更强的抗幻觉能力和显著的综合性能提升。

链接: https://arxiv.org/abs/2503.18320
作者: Dong Jing,Nanyi Fei,Zhiwu Lu
机构: Gaoling School of Artificial Intelligence, Renmin University of China (高瓴人工智能学院，中国人民大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the realm of Large Multi-modal Models (LMMs), the instruction quality during the visual instruction tuning stage significantly influences the performance of modality alignment. In this paper, we assess the instruction quality from a unique perspective termed \textbfWriting Manner, which encompasses the selection of vocabulary, grammar and sentence structure to convey specific semantics. We argue that there exists a substantial writing manner gap between the visual instructions and the base Large Language Models (LLMs) within LMMs. This gap forces the pre-trained base LLMs to deviate from their original writing styles, leading to capability degradation of both base LLMs and LMMs. To bridge the writing manner gap while preserving the original semantics, we propose directly leveraging the base LLM to align the writing manner of soft-format visual instructions with that of the base LLM itself, resulting in novel LLM-aligned instructions. The manual writing manner evaluation results demonstrate that our approach successfully minimizes the writing manner gap. By utilizing LLM-aligned instructions, the baseline models LLaVA-7B and QwenVL demonstrate enhanced resistance to hallucinations and non-trivial comprehensive improvements across all 15 visual and language benchmarks.
zh

[NLP-42] Surgical Action Planning with Large Language Models

【速读】：该论文旨在解决机器人辅助微创手术中手术动作规划（Surgical Action Planning, SAP）任务的挑战，特别是在当前智能应用中缺乏术中预测性规划的问题。论文聚焦于从视觉输入中生成未来的行动方案，同时克服理解器械与动作关系及跟踪手术进展等困难。为应对这些挑战，论文提出了一种基于大型语言模型（Large Language Models, LLMs）的手术动作规划框架（LLM-SAP）。该框架通过解析自然语言提示的目标来预测未来动作并生成文本响应，潜在支持手术教育、术中决策、手术记录以及技能分析。其关键创新在于引入了近历史焦点记忆模块（Near-History Focus Memory Module, NHF-MM）用于建模历史状态，并结合提示工厂实现动作规划。通过在CholecT50-SAP数据集上的实验验证，采用Qwen2.5和Qwen2-VL等预训练模型的零样本测试及带LoRA的监督微调（Supervised Fine-Tuning, SFT）有效提升了隐私保护和性能，其中Qwen2.5-72B-SFT相比未微调版本提升了19.3%的准确性。

链接: https://arxiv.org/abs/2503.18296
作者: Mengya Xu,Zhongzhen Huang,Jie Zhang,Xiaofan Zhang,Qi Dou
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages,4 figures

点击查看摘要

Abstract:In robot-assisted minimally invasive surgery, we introduce the Surgical Action Planning (SAP) task, which generates future action plans from visual inputs to address the absence of intraoperative predictive planning in current intelligent applications. SAP shows great potential for enhancing intraoperative guidance and automating procedures. However, it faces challenges such as understanding instrument-action relationships and tracking surgical progress. Large Language Models (LLMs) show promise in understanding surgical video content but remain underexplored for predictive decision-making in SAP, as they focus mainly on retrospective analysis. Challenges like data privacy, computational demands, and modality-specific constraints further highlight significant research gaps. To tackle these challenges, we introduce LLM-SAP, a Large Language Models-based Surgical Action Planning framework that predicts future actions and generates text responses by interpreting natural language prompts of surgical goals. The text responses potentially support surgical education, intraoperative decision-making, procedure documentation, and skill analysis. LLM-SAP integrates two novel modules: the Near-History Focus Memory Module (NHF-MM) for modeling historical states and the prompts factory for action planning. We evaluate LLM-SAP on our constructed CholecT50-SAP dataset using models like Qwen2.5 and Qwen2-VL, demonstrating its effectiveness in next-action prediction. Pre-trained LLMs are tested zero-shot, and supervised fine-tuning (SFT) with LoRA is implemented to address data privacy concerns. Our experiments show that Qwen2.5-72B-SFT surpasses Qwen2.5-72B with a 19.3% higher accuracy.
zh

[NLP-43] Fact-checking AI-generated news reports: Can LLM s catch their own lies?

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在评估自身或其它LLMs生成的新闻报道内容真实性方面的能力问题。具体而言，研究旨在确定LLMs是否能够通过类似于验证人类所作声明的方法来有效核查其自身生成的内容。研究的关键在于探索如何提高LLMs对不同类型信息（如静态与动态信息、国内外新闻与本地新闻等）以及真伪声明的评估能力，并提出了一种结合检索增强生成（Retrieval-Augmented Generation, RAG）方法的策略。此方法通过引入搜索引擎检索结果显著减少了LLMs无法评估的声明数量，但同时也增加了错误评估的发生率，这主要是由于检索结果的相关性和质量参差不齐所致。因此，研究强调未来工作应着重于提升检索信息的精确性和相关性，以进一步支持机器生成内容的真实性核查。对于涉及动态事件和地方新闻的声明，可能仍需人机协作系统来确保准确性与可靠性。

链接: https://arxiv.org/abs/2503.18293
作者: Jiayi Yao,Haibo Sun,Nianwen Xue
机构: Brandeis University (布兰迪斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we evaluate the ability of Large Language Models (LLMs) to assess the veracity of claims in ‘‘news reports’’ generated by themselves or other LLMs. Our goal is to determine whether LLMs can effectively fact-check their own content, using methods similar to those used to verify claims made by humans. Our findings indicate that LLMs are more effective at assessing claims in national or international news stories than in local news stories, better at evaluating static information than dynamic information, and better at verifying true claims compared to false ones. We hypothesize that this disparity arises because the former types of claims are better represented in the training data. Additionally, we find that incorporating retrieved results from a search engine in a Retrieval-Augmented Generation (RAG) setting significantly reduces the number of claims an LLM cannot assess. However, this approach also increases the occurrence of incorrect assessments, partly due to irrelevant or low-quality search results. This diagnostic study highlights the need for future research on fact-checking machine-generated reports to prioritize improving the precision and relevance of retrieved information to better support fact-checking efforts. Furthermore, claims about dynamic events and local news may require human-in-the-loop fact-checking systems to ensure accuracy and reliability.
zh

[NLP-44] When is dataset cartography ineffective? Using training dynamics does not improve robustness against Adversarial SQuAD

【速读】：该论文旨在研究数据集地图绘制（dataset cartography）在SQuAD数据集上的抽取式问答任务中的有效性。论文通过分析SQuAD中的标注 artifact，并评估两个对抗性数据集（AddSent 和 AddOneSent）对ELECTRA-small模型的影响，将SQuAD划分为易学、模棱两可和难学三个子集。关键在于利用训练动态进行这种划分，并比较基于这些子集训练的模型与随机选择等量样本训练的模型性能。结果表明，基于地图绘制的子集训练并未显著提升对SQuAD验证集或AddSent对抗集的泛化能力，尽管在AddOneSent数据集上难学子集略高，但整体增益有限。这表明数据集地图绘制对SQuAD风格问答任务中的对抗鲁棒性提升作用较小。最后，作者将这些结果与SNLI上的先前研究进行对比，并探讨可能的原因。

链接: https://arxiv.org/abs/2503.18290
作者: Paul K. Mandal
机构: Neurint LLC (Neurint LLC); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, 4 tables

点击查看摘要

Abstract:In this paper, I investigate the effectiveness of dataset cartography for extractive question answering on the SQuAD dataset. I begin by analyzing annotation artifacts in SQuAD and evaluate the impact of two adversarial datasets, AddSent and AddOneSent, on an ELECTRA-small model. Using training dynamics, I partition SQuAD into easy-to-learn, ambiguous, and hard-to-learn subsets. I then compare the performance of models trained on these subsets to those trained on randomly selected samples of equal size. Results show that training on cartography-based subsets does not improve generalization to the SQuAD validation set or the AddSent adversarial set. While the hard-to-learn subset yields a slightly higher F1 score on the AddOneSent dataset, the overall gains are limited. These findings suggest that dataset cartography provides little benefit for adversarial robustness in SQuAD-style QA tasks. I conclude by comparing these results to prior findings on SNLI and discuss possible reasons for the observed differences.
zh

[NLP-45] Sun-Shine: A Large Language Model for Tibetan Culture

【速读】：该论文试图解决藏语文化领域中大型语言模型（Large Language Models, LLMs）应用不足的问题，特别是针对藏语复杂的语法结构、文化内涵以及数据稀缺性导致的模型开发挑战。论文的关键解决方案在于提出Llama-Sunshine (Sun-Shine)，这是首个专注于藏语文化的大型语言模型，其核心在于采用先进的模型架构以适配藏语的语言特性，并构建了TIB-STC数据集，这是一个包含文学、宗教文本、新闻及对话等多种类型藏文的综合性大规模数据集。通过这些方法，Sun-Shine不仅在藏文化知识的专业性方面表现优异，还在低资源场景下展现出强大的泛化能力，同时在语言建模、文本分类、机器翻译和句法分析等任务中具备初步的具身智能能力。

链接: https://arxiv.org/abs/2503.18288
作者: Cheng Huang,Fan Gao,Nyima Tashi,Yutong Liu,Xiangxiang Wang,Thupten Tsering,Ban Ma-bao,Renzeg Duojie,Gadeng Luosang,Rinchen Dongrub,Dorje Tashi,Xiao Feng,Yongbin Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tibetan, a minority language in China, features a highly intricate grammatical structure, characterized by four verb tenses and a tense system with frequent irregularities, contributing to its extensive inflectional diversity. Recently, advances in Large Language Models (LLMs) have transformed the paradigm in many domains. Despite the success in other fields, current LLMs often fall short in catering to the needs of domain experts like Tibetans, and the potential of LLMs for Tibetan culture is under-explored. The intrinsic reasons are the immense and intricate nature of Tibetan culture as well as the necessity for higher granularity and richness in knowledge. Simultaneously, the complexity and uniqueness of its grammatical structure, coupled with its status as a minority ethnic language, contribute to data scarcity, which remains a fundamental challenge. To alleviate these issues, we introduce Llama-Sunshine (Sun-Shine), the first large language model for Tibetan culture, which is expert in various Tibetan language processing tasks. Sun-Shine incorporates state-of-the-art model architectures optimized for Tibetan’s linguistic features. We also propose TIB-STC, a comprehensive dataset comprising diverse Tibetan texts such as literature, religious scripts, news, and conversational data, which is also the first large-scale dataset for Tibetan culture. Though comprehensive experiments, Sun-Shine not only demonstrates a higher level of knowledge expertise for Tibetan culture but also gains preliminary embodied intelligence capabilities in Tibetan language processing tasks, like language modeling, text classification, machine translation, and syntactic analysis. Moreover, it excels in low-resource scenarios, showcasing strong generalization capabilities.
zh

[NLP-46] Bridging Emotions and Architecture: Sentiment Analysis in Modern Distributed Systems

【速读】：该论文试图研究情感分析（Sentiment Analysis）与分布式系统（Distributed Systems）的结合方式，重点探讨不同方法、面临的挑战以及未来的研究方向。同时，通过在单节点配置和分布式架构下训练情感分析模型，开展广泛的实验以展示每种方法在性能和准确性方面的优缺点。解决方案的关键在于对比分析单节点与分布式架构在情感分析任务中的表现差异，从而揭示其各自的适用场景与局限性。

链接: https://arxiv.org/abs/2503.18260
作者: Mahak Shah,Akaash Vishal Hazarika,Meetu Malhotra,Sachin C. Patil,Joshit Mohanty
机构: Department of Computer Science, Columbia University (哥伦比亚大学); Department of Computer Science, North Carolina State University (北卡罗来纳州立大学); Harrisburg University of Science and Technology (哈里斯堡科技大学); Old Dominion University (老 dominion 大学)
类目: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: IEEE 3rd International Conference on Advancements in Smart, Secure and Intelligent Computing (ASSIC)

点击查看摘要

Abstract:Sentiment analysis is a field within NLP that has gained importance because it is applied in various areas such as; social media surveillance, customer feedback evaluation and market research. At the same time, distributed systems allow for effective processing of large amounts of data. Therefore, this paper examines how sentiment analysis converges with distributed systems by concentrating on different approaches, challenges and future investigations. Furthermore, we do an extensive experiment where we train sentiment analysis models using both single node configuration and distributed architecture to bring out the benefits and shortcomings of each method in terms of performance and accuracy.
zh

[NLP-47] Enhancing Multi-Label Emotion Analysis and Corresponding Intensities for Ethiopian Languages

【速读】：该论文旨在解决多标签情感分析及其强度标注在人类-计算机交互任务中的应用问题，特别是在决策制定、产品反馈分析、政治推广、市场研究及社交媒体监控等场景下的情感建模与整合需求。论文的关键在于通过扩展EthioEmo数据集，引入情感强度（intensity）的标注，以更精细地捕捉情绪表达的动态变化及其影响程度，特别是在负面情绪相关的应用场景如医疗和心理健康研究中。此外，论文通过评估多种最先进的编码器-only预训练语言模型（PLMs）和解码器-only大语言模型（LLMs），为情感分析任务提供了全面的基准测试，从而验证所提出方法的有效性与适用性。

链接: https://arxiv.org/abs/2503.18253
作者: Tadesse Destaw Belay,Dawit Ketema Gete,Abinew Ali Ayele,Olga Kolesnikova,Grigori Sidorov,Seid Muhie Yimam
机构: Instituto Politécnico Nacional (墨西哥国家理工学院); Wollo University (沃洛大学); Bahir Dar University (巴赫达尔大学); University of Hamburg (汉堡大学); Unknown
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this digital world, people freely express their emotions using different social media platforms. As a result, modeling and integrating emotion-understanding models are vital for various human-computer interaction tasks such as decision-making, product and customer feedback analysis, political promotions, marketing research, and social media monitoring. As users express different emotions simultaneously in a single instance, annotating emotions in a multilabel setting such as the EthioEmo (Belay et al., 2025) dataset effectively captures this dynamic. Additionally, incorporating intensity, or the degree of emotion, is crucial, as emotions can significantly differ in their expressive strength and impact. This intensity is significant for assessing whether further action is necessary in decision-making processes, especially concerning negative emotions in applications such as healthcare and mental health studies. To enhance the EthioEmo dataset, we include annotations for the intensity of each labeled emotion. Furthermore, we evaluate various state-of-the-art encoder-only Pretrained Language Models (PLMs) and decoder-only Large Language Models (LLMs) to provide comprehensive benchmarking.
zh

[NLP-48] PAD: Towards Efficient Data Generation for Transfer Learning Using Phrase Alignment

【速读】：该论文旨在解决非英语语言（如韩语）建模资源匮乏的问题，通过迁移学习利用丰富的英语数据。论文的关键解决方案在于探索短语对齐数据（Phrase Aligned Data, PAD）在增强迁移学习效率方面的潜力。研究发现，PAD能够与韩语的句法特征有效协同，弥补传统统计机器翻译（Statistical Machine Translation, SMT）的不足，并显著提升模型性能。此外，PAD还被证明可以补充并优化传统的数据构建方法，从而进一步提高整体效果。这一创新方法不仅提升了模型性能，还为资源匮乏的语言提供了成本效益更高的解决方案。

链接: https://arxiv.org/abs/2503.18250
作者: Jong Myoung Kim,Young-Jun_Lee,Ho-Jin Choi,Sangkeun Jung
机构: SK-telecom (SK电讯); School of Computing, KAIST (KAIST计算机学院); The Division of Computer Convergence, Chugnam National University (韩国忠南国立大学计算机融合学院)
类目: Computation and Language (cs.CL)
备注: Preparing for conference

点击查看摘要

Abstract:Transfer learning leverages the abundance of English data to address the scarcity of resources in modeling non-English languages, such as Korean. In this study, we explore the potential of Phrase Aligned Data (PAD) from standardized Statistical Machine Translation (SMT) to enhance the efficiency of transfer learning. Through extensive experiments, we demonstrate that PAD synergizes effectively with the syntactic characteristics of the Korean language, mitigating the weaknesses of SMT and significantly improving model performance. Moreover, we reveal that PAD complements traditional data construction methods and enhances their effectiveness when combined. This innovative approach not only boosts model performance but also suggests a cost-efficient solution for resource-scarce languages.
zh

[NLP-49] AfroXLMR-Social: Adapting Pre-trained Language Models for African Languages Social Media Text

【速读】：该论文旨在解决低资源非洲语言在自然语言处理（NLP）任务中的性能瓶颈问题。论文的关键解决方案在于提出并验证了两种领域和任务自适应的持续预训练方法：领域自适应持续预训练（Domain Adaptive Pretraining, DAPT）和任务自适应持续预训练（Task Adaptive Pretraining, TAPT）。其中，DAPT通过利用高质量预处理的AfriSocial语料库作为领域特定的预训练数据，显著提升了细粒度情感分类任务中16种目标非洲语言的宏F1分数，增幅达1%至28.27%。而TAPT则通过结合少量未标注但与目标任务相似的数据进一步微调模型，在细粒度情感分类任务中使基线模型的F1分数提升了0.55%到15.11%。最终，将DAPT与TAPT相结合的方法取得了优于单一方法的结果。论文的核心贡献在于提供了一种可推广的策略，以改善低资源NLP任务及类似领域的表现，如仇恨言论检测和情感分析任务。

链接: https://arxiv.org/abs/2503.18247
作者: Tadesse Destaw Belay,Israel Abebe Azime,Ibrahim Said Ahmad,Idris Abdulmumin,Abinew Ali Ayele,Shamsuddeen Hassan Muhammad,Seid Muhie Yimam
机构: Instituto Politécnico Nacional (墨西哥国家理工学院); Wollo University (沃洛大学); Saarland University (萨尔兰大学); Northeastern University (东北大学); Bayero University Kano (卡诺包罗州立大学); University of Pretoria (比勒陀利亚大学); Bahir Dar University (巴赫达尔大学); University of Hamburg (汉堡大学); Imperial College London (伦敦帝国理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pretrained Language Models (PLMs) built from various sources are the foundation of today’s NLP progress. Language representations learned by such models achieve strong performance across many tasks with datasets of varying sizes drawn from various sources. We explore a thorough analysis of domain and task adaptive continual pretraining approaches for low-resource African languages and a promising result is shown for the evaluated tasks. We create AfriSocial, a corpus designed for domain adaptive finetuning that passes through quality pre-processing steps. Continual pretraining PLMs using AfriSocial as domain adaptive pretraining (DAPT) data, consistently improves performance on fine-grained emotion classification task of 16 targeted languages from 1% to 28.27% macro F1 score. Likewise, using the task adaptive pertaining (TAPT) approach, further finetuning with small unlabeled but similar task data shows promising results. For example, unlabeled sentiment data (source) for fine-grained emotion classification task (target) improves the base model results by an F1 score ranging from 0.55% to 15.11%. Combining the two methods, DAPT + TAPT, achieves also better results than base models. All the resources will be available to improve low-resource NLP tasks, generally, as well as other similar domain tasks such as hate speech and sentiment tasks.
zh

[NLP-50] ShED-HD: A Shannon Entropy Distribution Framework for Lightweight Hallucination Detection on Edge Devices

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在高风险领域中因产生幻觉（hallucinations，即看似合理但事实错误的内容）而带来的严重挑战。现有幻觉检测方法要么计算成本高昂（需要多次推理），要么为提高效率牺牲准确性（单次推理），这在资源受限的环境中（如边缘设备）并不理想。论文提出了一种名为Shannon Entropy Distribution Hallucination Detector (ShED-HD) 的新型幻觉检测框架，其关键在于利用轻量级双向长短期记忆网络（BiLSTM）结合单头注意力机制，通过分类整个输出序列的香农熵分布模式来高效检测不确定性特征。与以往方法不同，ShED-HD能够在保持上下文感知的同时，以低成本、高精度的方式实现幻觉检测，尤其在分布外（out-of-distribution）场景下显著优于其他高效方法，同时在分布内（in-distribution）场景中表现相当。这种方案提升了LLMs生成内容的可信度，特别适用于需要可靠AI功能的资源受限环境。

链接: https://arxiv.org/abs/2503.18242
作者: Aneesh Vathul,Daniel Lee,Sheryl Chen,Arthi Tasmia
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities on a broad array of NLP tasks, but their tendency to produce hallucinations \unicodex2013 plausible-sounding but factually incorrect content \unicodex2013 poses severe challenges in high-stakes domains. Existing hallucination detection methods either bear the computational cost of multiple inference passes or sacrifice accuracy for efficiency with single-pass approaches, neither of which is ideal in resource-constrained environments such as edge devices. We propose the Shannon Entropy Distribution Hallucination Detector (ShED-HD), a novel hallucination detection framework that bridges this gap by classifying sequence-level entropy patterns using a lightweight BiLSTM architecture with single-headed attention. In contrast to prior approaches, ShED-HD efficiently detects distinctive uncertainty patterns across entire output sequences, preserving contextual awareness. Through in-depth evaluation on three datasets (BioASQ, TriviaQA, and Jeopardy Questions), we show that ShED-HD significantly outperforms other computationally efficient approaches in the out-of-distribution setting, while achieving comparable performance in the in-distribution setting. ShED-HD facilitates hallucination detection that is low-cost, accurate, and generalizable, improving the credibility of content generated by LLMs in resource-constrained environments where trustworthy AI functionality is crucial.
zh

[NLP-51] Mapping Hymns and Organizing Concepts in the Rigveda: Quantitatively Connecting the Vedic Suktas NAACL2025

【速读】：该论文旨在解决通过自然语言处理（NLP）技术揭示《梨俱吠陀》（Rigveda）中颂词（suktas）的主题及其语义关联这一非平凡挑战。《梨俱吠陀》因其极其古老的梵文语言、诗意结构以及庞大的文本量而难以被现代读者理解。论文的关键在于提出了一种新颖的潜在语义分析（LSA）适应方法，并结合UMAP降维与k近邻网络构建，利用Leiden社区检测算法识别颂词主题网络。这种基于LSA的方案成功检测到显著的主题网络（z = 2.726, p < 0.01），且具有较高的模块度值（0.944），在七个著名的颂词分组（如创世、葬礼、水等）中均表现优异，优于SBERT和Doc2Vec方法。

链接: https://arxiv.org/abs/2503.18226
作者: Venkatesh Bollineni,Igor Crk,Eren Gultepe
机构: Dept. of Computer Science, Southern Illinois University Edwardsville (南伊利诺伊大学爱德华兹维尔分校)
类目: Computation and Language (cs.CL)
备注: Accepted to NLP4DH 2025 at NAACL 2025

点击查看摘要

Abstract:Accessing and gaining insight into the Rigveda poses a non-trivial challenge due to its extremely ancient Sanskrit language, poetic structure, and large volume of text. By using NLP techniques, this study identified topics and semantic connections of hymns within the Rigveda that were corroborated by seven well-known groupings of hymns. The 1,028 suktas (hymns) from the modern English translation of the Rigveda by Jamison and Brereton were preprocessed and sukta-level embeddings were obtained using, i) a novel adaptation of LSA, presented herein, ii) SBERT, and iii) Doc2Vec embeddings. Following an UMAP dimension reduction of the vectors, the network of suktas was formed using k-nearest neighbours. Then, community detection of topics in the sukta networks was performed with the Louvain, Leiden, and label propagation methods, whose statistical significance of the formed topics were determined using an appropriate null distribution. Only the novel adaptation of LSA using the Leiden method, had detected sukta topic networks that were significant (z = 2.726, p .01) with a modularity score of 0.944. Of the seven famous sukta groupings analyzed (e.g., creation, funeral, water, etc.) the LSA derived network was successful in all seven cases, while Doc2Vec was not significant and failed to detect the relevant suktas. SBERT detected four of the famous suktas as separate groups, but mistakenly combined three of them into a single mixed group. Also, the SBERT network was not statistically significant.
zh

[NLP-52] Decoupling Angles and Strength in Low-rank Adaptation ICLR2025

【速读】：该论文旨在解决现有 Parameter-Efficient FineTuning (PEFT) 方法在超参数选择或扩展训练方案下的鲁棒性不足问题，同时克服传统有界方法适应表达能力受限的局限。论文的关键创新在于提出了一种名为 Decoupled Low-rank Adaptation (DeLoRA) 的新型微调方法，通过规范化和缩放可学习的低秩矩阵，并通过限制变换距离将角度学习与适应强度解耦，从而在不牺牲性能的前提下显著提升鲁棒性。这一方案在主体驱动图像生成、自然语言理解及指令微调等任务上的评估表明，DeLoRA 在性能上匹配甚至超越了其他 PEFT 方法，同时展现出更强的鲁棒性。

链接: https://arxiv.org/abs/2503.18225
作者: Massimo Bini,Leander Girrbach,Zeynep Akata
机构: University of Tübingen (图宾根大学); Helmholz Munich (赫尔姆霍兹慕尼黑研究中心); Technical University of Munich (慕尼黑工业大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025

点击查看摘要

Abstract:Parameter-Efficient FineTuning (PEFT) methods have recently gained significant popularity thanks to the widespread availability of large-scale pretrained models. These methods allow for quick adaptation to downstream tasks with minimal computational cost. However, popular finetuning methods such as LoRA exhibit limited robustness when it comes to hyperparameter choices or extended training regimes, preventing optimal out-of-the-box performance. In contrast, bounded approaches, such as ETHER, provide greater robustness but are limited to extremely low-rank adaptations and fixed-strength transformations, reducing their adaptation expressive power. In this work, we propose Decoupled Low-rank Adaptation (DeLoRA), a novel finetuning method that normalizes and scales learnable low-rank matrices. By bounding the distance of the transformation, DeLoRA effectively decouples the angular learning from the adaptation strength, enhancing robustness without compromising performance. Through evaluations on subject-driven image generation, natural language understanding, and instruction tuning, we show that DeLoRA matches or surpasses performance of competing PEFT methods, while exhibiting stronger robustness. Code is available at this https URL.
zh

[NLP-53] LakotaBERT: A Transformer-based Model for Low Resource Lakota Language

【速读】：该论文旨在解决Lakota语言因年轻一代流利使用者减少而面临的濒危困境，通过构建大型语言模型（Large Language Model, LLM）支持其语言复兴。论文的核心解决方案包括创建一个包含105,000句Lakota语及其平行语料库（与英语等语言对照），并基于RoBERTa架构开发定制化模型LakotaBERT。关键在于结合AI技术和语言学方法，利用预训练技术提升模型性能，并通过多维度评估（如精度、F1值等）验证模型能力，以期为其他濒危原住民语言的保护提供技术借鉴。

链接: https://arxiv.org/abs/2503.18212
作者: Kanishka Parankusham,Rodrigue Rizk,KC Santosh
机构: AI Research Lab, Department of Computer Science, University of South Dakota (南达科他大学计算机科学系人工智能研究实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Lakota, a critically endangered language of the Sioux people in North America, faces significant challenges due to declining fluency among younger generations. This paper introduces LakotaBERT, the first large language model (LLM) tailored for Lakota, aiming to support language revitalization efforts. Our research has two primary objectives: (1) to create a comprehensive Lakota language corpus and (2) to develop a customized LLM for Lakota. We compiled a diverse corpus of 105K sentences in Lakota, English, and parallel texts from various sources, such as books and websites, emphasizing the cultural significance and historical context of the Lakota language. Utilizing the RoBERTa architecture, we pre-trained our model and conducted comparative evaluations against established models such as RoBERTa, BERT, and multilingual BERT. Initial results demonstrate a masked language modeling accuracy of 51% with a single ground truth assumption, showcasing performance comparable to that of English-based models. We also evaluated the model using additional metrics, such as precision and F1 score, to provide a comprehensive assessment of its capabilities. By integrating AI and linguistic methodologies, we aspire to enhance linguistic diversity and cultural resilience, setting a valuable precedent for leveraging technology in the revitalization of other endangered indigenous languages.
zh

[NLP-54] Exploring Topic Trends in COVID-19 Research Literature using Non-Negative Matrix Factorization

【速读】：本文旨在通过应用非负矩阵分解（Non-Negative Matrix Factorization, NMF）的主题建模方法于COVID-19开放研究数据集（CORD-16），揭示COVID-19研究文献中的潜在主题结构及其演化。论文的关键在于采用一系列严谨的数据预处理步骤以标准化文本数据并保留短语上下文，随后利用词频-逆文档频率（term frequency-inverse document frequency, tf-idf）进行特征提取，从而将文档-词项矩阵分解为两个非负矩阵，用以表示主题及其在文档中的分布。为确保主题模型的稳健性，论文进行了稳定性分析，评估不同主题数量下NMF模型的稳定性分数，以确定最佳主题数量。最终，通过分析追踪了CORD-19数据集中主题随时间的演化过程。这一方案的核心在于结合严谨的数据处理与量化评估，以揭示COVID-19研究领域的知识结构。

链接: https://arxiv.org/abs/2503.18182
作者: Divya Patel,Vansh Parikh,Om Patel,Agam Shah,Bhaskar Chaudhury
机构: Group in Computational Science and HPC, Dhirubhai Ambani Institute of Information and Communication Technology (DAIICT), Gandhinagar, India; School of Computational Science & Engineering, College of Computing, Georgia Institute of Technology (Georgia Tech), Atlanta, GA, USA
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we apply topic modeling using Non-Negative Matrix Factorization (NMF) on the COVID-19 Open Research Dataset (CORD-19) to uncover the underlying thematic structure and its evolution within the extensive body of COVID-19 research literature. NMF factorizes the document-term matrix into two non-negative matrices, effectively representing the topics and their distribution across the documents. This helps us see how strongly documents relate to topics and how topics relate to words. We describe the complete methodology which involves a series of rigorous pre-processing steps to standardize the available text data while preserving the context of phrases, and subsequently feature extraction using the term frequency-inverse document frequency (tf-idf), which assigns weights to words based on their frequency and rarity in the dataset. To ensure the robustness of our topic model, we conduct a stability analysis. This process assesses the stability scores of the NMF topic model for different numbers of topics, enabling us to select the optimal number of topics for our analysis. Through our analysis, we track the evolution of topics over time within the CORD-19 dataset. Our findings contribute to the understanding of the knowledge structure of the COVID-19 research landscape, providing a valuable resource for future research in this field.
zh

[NLP-55] GINGER: Grounded Information Nugget-Based Generation of Responses

【速读】：该论文旨在解决 Retrieval-augmented generation (RAG) 模型在事实正确性、来源归因以及响应完整性方面面临的挑战。为应对这些问题，论文提出了一种基于信息片段（information nuggets）的模块化生成式响应生成管道。信息片段被定义为从检索到的文档中提取出的最小且不可再分的相关信息单元。该多阶段管道的关键在于包含信息片段检测、聚类、排序、顶级簇摘要生成以及流畅性增强等步骤，从而确保生成内容基于特定事实、便于追溯来源，并在长度限制下最大化信息包含量。通过在 TREC RAG’24 数据集上的广泛实验，采用 AutoNuggetizer 框架评估的结果表明，该方法达到了当前最佳性能。

链接: https://arxiv.org/abs/2503.18174
作者: Weronika Łajewska,Krisztian Balog
机构: University of Stavanger(斯塔万格大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) faces challenges related to factual correctness, source attribution, and response completeness. To address them, we propose a modular pipeline for grounded response generation that operates on information nuggets-minimal, atomic units of relevant information extracted from retrieved documents. The multistage pipeline encompasses nugget detection, clustering, ranking, top cluster summarization, and fluency enhancement. It guarantees grounding in specific facts, facilitates source attribution, and ensures maximum information inclusion within length constraints. Extensive experiments on the TREC RAG’24 dataset evaluated with the AutoNuggetizer framework demonstrate that GINGER achieves state-of-the-art performance on this benchmark.
zh

[NLP-56] Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering

【速读】：该论文旨在解决误导性图表可视化检测与解读的问题，这类图表通过操纵数据表示来支持特定主张，可能导致感知偏差和错误结论。尽管已有多年研究积累，此类问题依然普遍存在且亟待解决。现有研究表明多模态大型语言模型（Multimodal Large Language Models, MLLMs）具备较强的图表理解能力，但尚未有系统性评估其识别与解析误导性图表的能力。为此，论文引入了一个名为“误导性图表问答（Misleading ChartQA）”的大规模多模态基准数据集，用于评估MLLMs在辨别和推理误导性图表方面的表现。该数据集包含超过3,000个精心设计的例子，涵盖21种误导类型及10种图表类型，并提供标准化图表代码、CSV数据以及带标注解释的多项选择题，经过多轮MLLM验证和专家人工审查确保质量。研究对16种最先进的MLLMs进行了基准测试，揭示了它们在识别视觉欺骗手法上的局限性。同时，论文提出了一种新的检测与定位误导性元素的流水线方法，以提升MLLMs在误导性图表解读中的准确性。这项工作为推动基于MLLM的误导性图表理解奠定了基础，并公开发布了部分数据集以促进相关领域的进一步研究。

链接: https://arxiv.org/abs/2503.18172
作者: Zixin Chen,Sicheng Song,Kashun Shum,Yanna Lin,Rui Sheng,Huamin Qu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 31 pages in total. Under Review For ARR

点击查看摘要

Abstract:Misleading chart visualizations, which intentionally manipulate data representations to support specific claims, can distort perceptions and lead to incorrect conclusions. Despite decades of research, misleading visualizations remain a widespread and pressing issue. Recent advances in multimodal large language models (MLLMs) have demonstrated strong chart comprehension capabilities, yet no existing work has systematically evaluated their ability to detect and interpret misleading charts. This paper introduces the Misleading Chart Question Answering (Misleading ChartQA) Benchmark, a large-scale multimodal dataset designed to assess MLLMs in identifying and reasoning about misleading charts. It contains over 3,000 curated examples, covering 21 types of misleaders and 10 chart types. Each example includes standardized chart code, CSV data, and multiple-choice questions with labeled explanations, validated through multi-round MLLM checks and exhausted expert human review. We benchmark 16 state-of-the-art MLLMs on our dataset, revealing their limitations in identifying visually deceptive practices. We also propose a novel pipeline that detects and localizes misleaders, enhancing MLLMs’ accuracy in misleading chart interpretation. Our work establishes a foundation for advancing MLLM-driven misleading chart comprehension. We publicly release the sample dataset to support further research in this critical area.
zh

[NLP-57] Evaluating Negative Sampling Approaches for Neural Topic Models

【速读】：该论文旨在探索负采样（Negative Sampling）在无监督领域如主题建模（Topic Modeling）中的影响，并试图解决现有神经主题模型在生成主题表示时可能存在的不足。论文的关键在于通过在变分自编码器（Variational Autoencoder, VAE）基线神经主题模型的解码器中引入不同的负采样策略，提升模型的主题表达能力。具体而言，负采样的核心作用是通过比较正样本与负样本，增强模型对语义相似性和差异性的学习能力，从而提高主题的连贯性（Topic Coherence）、多样性（Topic Diversity）以及文档分类的准确性。实验结果表明，负采样技术能够显著改善神经主题模型的表现，同时手动评估进一步验证了其对生成主题质量的提升效果。

链接: https://arxiv.org/abs/2503.18167
作者: Suman Adhya,Avishek Lahiri,Debarshi Kumar Sanyal,Partha Pratim Das
机构: School of Mathematical and Computational Sciences, Indian Association for the Cultivation of Science (印度科学普及协会数学与计算科学学院); Department of Computer Science, Ashoka University (阿什oka大学计算机科学系), Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur (印度理工学院卡哈格普尔计算机科学与工程系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available at: this https URL

点击查看摘要

Abstract:Negative sampling has emerged as an effective technique that enables deep learning models to learn better representations by introducing the paradigm of learn-to-compare. The goal of this approach is to add robustness to deep learning models to learn better representation by comparing the positive samples against the negative ones. Despite its numerous demonstrations in various areas of computer vision and natural language processing, a comprehensive study of the effect of negative sampling in an unsupervised domain like topic modeling has not been well explored. In this paper, we present a comprehensive analysis of the impact of different negative sampling strategies on neural topic models. We compare the performance of several popular neural topic models by incorporating a negative sampling technique in the decoder of variational autoencoder-based neural topic models. Experiments on four publicly available datasets demonstrate that integrating negative sampling into topic models results in significant enhancements across multiple aspects, including improved topic coherence, richer topic diversity, and more accurate document classification. Manual evaluations also indicate that the inclusion of negative sampling into neural topic models enhances the quality of the generated topics. These findings highlight the potential of negative sampling as a valuable tool for advancing the effectiveness of neural topic models.
zh

[NLP-58] MathAgent : Leverag ing a Mixture-of-Math-Agent Framework for Real-World Multimodal Mathematical Error Detection

【速读】：本文试图解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在教育场景下数学错误检测的挑战，特别是处理视觉与文本数学内容的理解及复杂推理能力不足的问题。MLLMs 虽然在数学问题求解方面表现良好，但在识别和分类学生在多模态数学环境中产生的细微错误时仍存在困难。为应对这一挑战，论文提出了一种名为 MathAgent 的新型混合代理框架。MathAgent 的关键创新在于将错误检测分解为三个阶段，并分别由专门的代理处理：图像-文本一致性验证器、视觉语义解释器以及综合错误分析器。这种架构通过显式建模多模态问题与学生解题步骤之间的关系，实现了更精确的数学内容处理。实验结果表明，MathAgent 在真实教育数据上的错误步骤识别准确率比基线模型提高了约 5%，错误分类准确率提升了约 3%，并在一个服务于超过一百万 K-12 学生的教育平台上取得了近 90% 的学生满意度，同时显著降低了人工错误检测的成本。

链接: https://arxiv.org/abs/2503.18132
作者: Yibo Yan,Shen Wang,Jiahao Huo,Philip S. Yu,Xuming Hu,Qingsong Wen
机构: Squirrel Ai Learning (松鼠Ai学习); HKUST (GZ) (香港科技大学（广州）); HKUST (香港科技大学); University of Illinois at Chicago (芝加哥大学伊利诺伊分校)
类目: Computation and Language (cs.CL)
备注: Work In Progress

点击查看摘要

Abstract:Mathematical error detection in educational settings presents a significant challenge for Multimodal Large Language Models (MLLMs), requiring a sophisticated understanding of both visual and textual mathematical content along with complex reasoning capabilities. Though effective in mathematical problem-solving, MLLMs often struggle with the nuanced task of identifying and categorizing student errors in multimodal mathematical contexts. Therefore, we introduce MathAgent, a novel Mixture-of-Math-Agent framework designed specifically to address these challenges. Our approach decomposes error detection into three phases, each handled by a specialized agent: an image-text consistency validator, a visual semantic interpreter, and an integrative error analyzer. This architecture enables more accurate processing of mathematical content by explicitly modeling relationships between multimodal problems and student solution steps. We evaluate MathAgent on real-world educational data, demonstrating approximately 5% higher accuracy in error step identification and 3% improvement in error categorization compared to baseline models. Besides, MathAgent has been successfully deployed in an educational platform that has served over one million K-12 students, achieving nearly 90% student satisfaction while generating significant cost savings by reducing manual error detection.
zh

[NLP-59] GeoBenchX: Benchmarking LLM s for Multistep Geospatial Tasks

【速读】：该论文旨在建立一个基准（benchmark），用于评估大型语言模型（LLMs）在多步骤地理空间任务中的表现，这些任务与商业GIS从业人员相关。论文的关键在于设计了一个包含23个地理空间功能的简单工具调用代理，并构建了一个涵盖四个复杂性递增类别任务的基准集，其中包括可解和故意不可解的任务以测试幻觉拒绝能力。此外，还开发了LLM-as-Judge评估框架来对比代理解决方案与参考实现。通过这一方法，论文揭示了不同模型在处理地理空间任务时的表现差异及常见错误类型，同时提供了开源资源以便于未来对GeoAI领域的LLMs进行标准化评估。

链接: https://arxiv.org/abs/2503.18129
作者: Varvara Krechetova,Denis Kochedykov
机构: World Bank; J.P.Morgan Chase (摩根大通)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Github with code and benchmark set: this https URL

点击查看摘要

Abstract:In this paper, we establish a benchmark for evaluating large language models (LLMs) on multi-step geospatial tasks relevant to commercial GIS practitioners. We assess seven leading commercial LLMs (Sonnet 3.5 and 3.7, Haiku 3.5, Gemini 2.0, GPT-4o, GPT-4o mini, and o3-mini) using a simple tool-calling agent equipped with 23 geospatial functions. Our benchmark comprises tasks across four categories of increasing complexity, with both solvable and intentionally unsolvable tasks to test hallucination rejection. We develop an LLM-as-Judge evaluation framework to compare agent solutions against reference implementations. Results show Sonnet 3.5 and GPT-4o achieve the best overall performance, with Claude models excelling on solvable tasks while OpenAI models better identify unsolvable scenarios. We observe significant differences in token usage, with Anthropic models consuming substantially more tokens than competitors. Common errors include misunderstanding geometrical relationships, relying on outdated knowledge, and inefficient data manipulation. The resulting benchmark set, evaluation framework, and data generation pipeline are released as open-source resources, providing one more standardized method for ongoing evaluation of LLMs for GeoAI.
zh

[NLP-60] Detection of Somali-written Fake News and Toxic Messages on the Social Media Using Transformer-based Language Models

【速读】：该论文旨在解决低资源语言（如索马里语）在生成式 AI (Generative AI) 自动化应用中的局限性问题，包括稀缺的标注训练数据集以及缺乏针对其独特语言学特性的模型。为应对这些挑战，论文提出的关键解决方案是开发了一种基于 Transformer 的单语索马里语语言模型（命名为 SomBERTa），这是首个此类模型。SomBERTa 针对索马里语的两个下游任务——假新闻分类和毒性内容分类进行了训练，并在多个数据集上进行了微调与评估，包括有毒内容、假新闻及新闻主题分类数据集。通过与相关多语言模型（如 AfriBERTa 和 AfroXLMR）的对比评估分析，研究结果表明，SomBERTa 在假新闻分类和毒性内容分类任务中均表现出色，且在所有任务中的平均准确率达到 87.99%，显著优于其他比较模型。这项研究不仅填补了索马里自然语言处理 (NLP) 领域的基础模型空白，还提供了一个可复制的框架，以促进其他低资源语言的数字与 AI 包容性及语言多样性。

链接: https://arxiv.org/abs/2503.18117
作者: Muhidin A. Mohamed,Shuab D. Ahmed,Yahye A. Isse,Hanad M. Mohamed,Fuad M. Hassan,Houssein A. Assowe
机构: Aston University (阿斯顿大学), United Kingdom; Jamhuriya University (贾姆胡里亚大学), Mogadishu, Somalia; Somali National University (索马里国立大学), Mogadishu, Somalia; University of Djibouti (吉布提大学), Balbala, Djibouti
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The fact that everyone with a social media account can create and share content, and the increasing public reliance on social media platforms as a news and information source bring about significant challenges such as misinformation, fake news, harmful content, etc. Although human content moderation may be useful to an extent and used by these platforms to flag posted materials, the use of AI models provides a more sustainable, scalable, and effective way to mitigate these harmful contents. However, low-resourced languages such as the Somali language face limitations in AI automation, including scarce annotated training datasets and lack of language models tailored to their unique linguistic characteristics. This paper presents part of our ongoing research work to bridge some of these gaps for the Somali language. In particular, we created two human-annotated social-media-sourced Somali datasets for two downstream applications, fake news \ toxicity classification, and developed a transformer-based monolingual Somali language model (named SomBERTa) – the first of its kind to the best of our knowledge. SomBERTa is then fine-tuned and evaluated on toxic content, fake news and news topic classification datasets. Comparative evaluation analysis of the proposed model against related multilingual models (e.g., AfriBERTa, AfroXLMR, etc) demonstrated that SomBERTa consistently outperformed these comparators in both fake news and toxic content classification tasks while achieving the best average accuracy (87.99%) across all tasks. This research contributes to Somali NLP by offering a foundational language model and a replicable framework for other low-resource languages, promoting digital and AI inclusivity and linguistic diversity.
zh

[NLP-61] Agent Rxiv: Towards Collaborative Autonomous Research

【速读】：该论文试图解决科学发现过程中现有自主代理（agent）工作流程缺乏协作与持续改进能力的问题。解决方案的关键在于引入AgentRxiv框架，它允许语言模型代理实验室将研究报告上传至共享预印本服务器，并从中检索信息以实现协作、分享洞见以及迭代优化彼此的研究成果。实验表明，能够访问自身先前研究的代理相较于孤立运作的代理，在MATH-500基准测试上实现了11.4%的相对性能提升，并且最佳策略在其他领域基准测试中也表现出平均3.3%的改进，同时多个代理实验室通过AgentRxiv协作后，在MATH-500上的整体准确性比基线提高了13.7%。这表明自主代理可能在未来的人类辅助AI系统设计中发挥作用。

链接: https://arxiv.org/abs/2503.18102
作者: Samuel Schmidgall,Michael Moor
机构: Department of Electrical & Computer Engineering, Johns Hopkins University (约翰斯·霍普金斯大学); Department of Biosystems Science & Engineering, ETH Zurich (瑞士苏黎世联邦理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Progress in scientific discovery is rarely the result of a single “Eureka” moment, but is rather the product of hundreds of scientists incrementally working together toward a common goal. While existing agent workflows are capable of producing research autonomously, they do so in isolation, without the ability to continuously improve upon prior research results. To address these challenges, we introduce AgentRxiv-a framework that lets LLM agent laboratories upload and retrieve reports from a shared preprint server in order to collaborate, share insights, and iteratively build on each other’s research. We task agent laboratories to develop new reasoning and prompting techniques and find that agents with access to their prior research achieve higher performance improvements compared to agents operating in isolation (11.4% relative improvement over baseline on MATH-500). We find that the best performing strategy generalizes to benchmarks in other domains (improving on average by 3.3%). Multiple agent laboratories sharing research through AgentRxiv are able to work together towards a common goal, progressing more rapidly than isolated laboratories, achieving higher overall accuracy (13.7% relative improvement over baseline on MATH-500). These findings suggest that autonomous agents may play a role in designing future AI systems alongside humans. We hope that AgentRxiv allows agents to collaborate toward research goals and enables researchers to accelerate discovery.
zh

[NLP-62] Clarifying Misconceptions in COVID-19 Vaccine Sentiment and Stance Analysis and Their Implications for Vaccine Hesitancy Mitigation: A Systematic Review

【速读】：该论文旨在解决通过自然语言处理（Natural Language Processing, NLP）研究 COVID-19 疫苗犹豫（vaccine hesitancy）在社交媒体话语中的持续性问题。论文的关键在于系统性地评估采用监督机器学习（supervised machine learning）进行立场检测（stance detection）或情感分析（sentiment analysis）的研究方法，以识别不同研究在测量疫苗犹豫趋势时是否存在偏差，并探讨如何改进自然语言处理方法的报告标准，从而提升研究结果的可推广性和解释力。

链接: https://arxiv.org/abs/2503.18095
作者: Lorena G Barberia,Belinda Lombard,Norton Trevisan Roman,Tatiane C. M. Sousa
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 14 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Background Advances in machine learning (ML) models have increased the capability of researchers to detect vaccine hesitancy in social media using Natural Language Processing (NLP). A considerable volume of research has identified the persistence of COVID-19 vaccine hesitancy in discourse shared on various social media platforms. Methods Our objective in this study was to conduct a systematic review of research employing sentiment analysis or stance detection to study discourse towards COVID-19 vaccines and vaccination spread on Twitter (officially known as X since 2023). Following registration in the PROSPERO international registry of systematic reviews, we searched papers published from 1 January 2020 to 31 December 2023 that used supervised machine learning to assess COVID-19 vaccine hesitancy through stance detection or sentiment analysis on Twitter. We categorized the studies according to a taxonomy of five dimensions: tweet sample selection approach, self-reported study type, classification typology, annotation codebook definitions, and interpretation of results. We analyzed if studies using stance detection report different hesitancy trends than those using sentiment analysis by examining how COVID-19 vaccine hesitancy is measured, and whether efforts were made to avoid measurement bias. Results Our review found that measurement bias is widely prevalent in studies employing supervised machine learning to analyze sentiment and stance toward COVID-19 vaccines and vaccination. The reporting errors are sufficiently serious that they hinder the generalisability and interpretation of these studies to understanding whether individual opinions communicate reluctance to vaccinate against SARS-CoV-2. Conclusion Improving the reporting of NLP methods is crucial to addressing knowledge gaps in vaccine hesitancy discourse.
zh

[NLP-63] D2LoRA: Data-Driven LoRA Initialization for Low Resource Tasks

【速读】：该论文旨在解决在数据稀缺场景下微调大型语言模型（Large Language Models, LLMs）性能优化的问题，特别是针对低收敛速度的LoRA方法。论文的关键解决方案是提出了一种名为D²LoRA的数据驱动方法，用于初始化LoRA参数，以提升训练效率，尤其是在有限数据条件下。通过结合后训练方法如监督微调（Supervised Fine-Tuning, SFT）、直接偏好优化（Direct Preference Optimization, DPO）和胜率比偏好优化（Odds Ratio Preference Optimization, ORPO），并与任务特定学习的LoRA方法相结合，D²LoRA显著减少了灾难性遗忘现象，并在GSM8K基准测试中提升了1%的表现，在标题生成任务中的ROUGE得分提升了2点。这表明D²LoRA能够有效降低训练成本并减少数据开销，同时支持LLMs适应多任务需求。

链接: https://arxiv.org/abs/2503.18089
作者: Javad SeraJ,Mohammad Mahdi Mohajeri,Mohammad Javad Dousti
机构: Department of Electrical and Computer Engineering (电气与计算机工程系), University of Tehran (德黑兰大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tuning large language models is essential for optimizing their performance across diverse applications, particularly in scenarios with limited data availability. Tuning large language models in scarce data scenarios is crucial, particularly given that the convergence speed of the LoRA method is lower than that of full fine-tuning. In this paper, we present an analysis of post-training methods including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Odds Ratio Preference Optimization (ORPO) within the context of task-specific learning using the LoRA method. Next we introduce D^2LoRA , a data-driven approach for initializing LoRA metrics that enhances training efficiency, especially in limited-data settings. Our experiments compare D^2LoRA with vanilla LoRA in terms of performance and catastrophic forgetting under extremely data-constrained conditions. The results demonstrate that D^2LoRA achieves a 1% improvement GSM8K benchmark and a 2-point improvement in ROUGE score in title generation tasks. D^2LoRA facilitates the adaptation of LLMs to multiple tasks even when task-specific data is scarce, thereby reducing training expenses and offering data cost.
zh

[NLP-64] mporal Relation Extraction in Clinical Texts: A Span-based Graph Transformer Approach

【速读】：本文旨在解决从非结构化文本中提取临床事件及其时间关系的问题，特别是在医学领域，这一任务因其复杂的临床语言、长文档以及稀疏标注而具有挑战性。论文针对I2B2 2012时间关系挑战数据集中的临床事件及时间关系抽取任务提出了解决方案。关键在于引入了GRAPHTREX方法，该方法结合基于跨度的实体-关系抽取、临床大型预训练语言模型（Clinical Large Pre-trained Language Models, LPLMs）以及异质图Transformer（Heterogeneous Graph Transformers, HGT），以捕捉局部和全局依赖关系。其中，HGT组件通过创新的全局地标信息，在文档范围内促进远距离实体间的信息传播，从而显著提升了性能，相比之前的最佳方法，在tempeval F₁分数上提高了5.5%，并在长程关系抽取上提高了多达8.9%。这项工作不仅推动了时间信息抽取技术的发展，还为通过增强的时间推理能力改进诊断和预后模型奠定了基础。

链接: https://arxiv.org/abs/2503.18085
作者: Rochana Chaturvedi,Peyman Baghershahi,Sourav Medya,Barbara Di Eugenio
机构: University of Illinois Chicago (芝加哥伊利诺伊大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Introducing a novel method for joint extraction of medical events and temporal relations from free-text, leveraging clinical LPLMs and Heterogeneous Graph Transformers, achieving a 5.5% improvement over the previous state-of-the-art and up to 8.9% on long-range relations

点击查看摘要

Abstract:Temporal information extraction from unstructured text is essential for contextualizing events and deriving actionable insights, particularly in the medical domain. We address the task of extracting clinical events and their temporal relations using the well-studied I2B2 2012 Temporal Relations Challenge corpus. This task is inherently challenging due to complex clinical language, long documents, and sparse annotations. We introduce GRAPHTREX, a novel method integrating span-based entity-relation extraction, clinical large pre-trained language models (LPLMs), and Heterogeneous Graph Transformers (HGT) to capture local and global dependencies. Our HGT component facilitates information propagation across the document through innovative global landmarks that bridge distant entities. Our method improves the state-of-the-art with 5.5% improvement in the tempeval F_1 score over the previous best and up to 8.9% improvement on long-range relations, which presents a formidable challenge. This work not only advances temporal information extraction but also lays the groundwork for improved diagnostic and prognostic models through enhanced temporal reasoning.
zh

[NLP-65] A Multi-Model Adaptation of Speculative Decoding for Classification

【速读】：本文旨在解决多模型协作在分类任务中高效性和准确性平衡的问题。传统方法往往依赖单一强大的模型进行复杂推理，但计算成本较高。本文提出的解决方案关键在于引入了一种基于推测解码（Speculative Decoding）改编的多模型框架，该框架包含多达三个轻量级工作模型（worker models）和一个更稳健的裁判模型（judge model）。工作模型负责执行大部分计算并独立预测输入的离散类别标签，当多数工作模型达成一致时，直接采用该标签以避免调用计算密集型的裁判模型；而在存在分歧的情况下，裁判模型介入以确定最终标签。这种方法通过减少冗余计算、利用多个工作模型的冗余性提高置信度，并将裁判模型仅限于处理棘手案例，实现了效率与精度的良好折衷。此外，研究发现，参数量为30亿的小型预训练或指令微调工作模型，在简单及高阶推理任务中的表现与参数量为70亿的大型微调工作模型相当，同时显著提升了运行速度。

链接: https://arxiv.org/abs/2503.18076
作者: Somnath Roy,Padharthi Sreekar,Srivatsa Narasimha,Anubhav Anand
机构: Freshworks Inc, USA
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The current study introduces a novel adaptation of speculative decoding, repurposed from generation to classification tasks. We propose a multi-model framework employing up to three lightweight worker models and a single, more robust judge model analogous to draft models and target model, respectively, in speculative decoding. The worker models, tasked with the bulk of the computation, independently predict discrete class labels for a given input. When majority worker models agree on a label, it is accepted as the final label, optimizing efficiency by bypassing the computationally expensive judge model. In cases of disagreement, the judge model intervenes to resolve the label. This approach minimizes redundant computation, leverages the redundancy of multiple workers for confidence, and confines the judge model’s role to challenging cases, offering a practical balance of efficiency and accuracy. Our analysis suggests that smaller out of the box instruction/chat finetuned worker models with 3 billion parameters (hereafter, 3B) demonstrate a level of alignment with judge models comparable to that of larger finetuned worker models with 7 billion parameters (hereafter, 7B) across both simple and higher order reasoning tasks. The top performing 3B worker model pair achieve an agreement rate of approximately 80-83% for sentiment and around 50-80% for similar ticket when compared to judge models. Additionally, 3B worker models provide a speedup ranging from 2.8x to 9x relative to the judge models, while 7B worker model combinations achieve a speedup ranging from 1.28x to 0.28x
zh

[NLP-66] On the effectiveness of LLM s for automatic grading of open-ended questions in Spanish

【速读】：该论文试图解决教育领域中耗时且繁重的主观题评分问题，旨在通过利用大型语言模型（LLMs）实现开放性问题短文本答案的自动评分。论文的关键在于探索不同LLMs与提示技术（prompting techniques）在西班牙语环境下自动评分的效果，并发现高级LLMs（包括开源和专有模型）在三级别评分任务中准确率超过95%，简化为二元正确或错误判断时甚至超过98%。此外，研究强调评分性能对提示风格高度敏感，提示设计中的偏差可能影响评分结果的一致性。因此，选择合适的模型与提示策略是解决方案的核心关键。

链接: https://arxiv.org/abs/2503.18072
作者: Germán Capdehourat,Isabel Amigo,Brian Lorenzo,Joaquín Trigo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Grading is a time-consuming and laborious task that educators must face. It is an important task since it provides feedback signals to learners, and it has been demonstrated that timely feedback improves the learning process. In recent years, the irruption of LLMs has shed light on the effectiveness of automatic grading. In this paper, we explore the performance of different LLMs and prompting techniques in automatically grading short-text answers to open-ended questions. Unlike most of the literature, our study focuses on a use case where the questions, answers, and prompts are all in Spanish. Experimental results comparing automatic scores to those of human-expert evaluators show good outcomes in terms of accuracy, precision and consistency for advanced LLMs, both open and proprietary. Results are notably sensitive to prompt styles, suggesting biases toward certain words or content in the prompt. However, the best combinations of models and prompt strategies, consistently surpasses an accuracy of 95% in a three-level grading task, which even rises up to more than 98% when the it is simplified to a binary right or wrong rating problem, which demonstrates the potential that LLMs have to implement this type of automation in education applications.
zh

[NLP-67] Mind with Eyes: from Language Reasoning to Multimodal Reasoning

【速读】：该论文旨在解决如何通过多模态推理实现更全面、类人认知能力的问题。论文的关键在于提出了一种系统性的分类方法，将多模态推理方法分为语言中心型多模态推理（包括单次视觉感知和主动视觉感知）与协作型多模态推理（包含动作生成与状态更新）。前者强调语言推理为主导，视觉作为辅助；后者则促进模态间的动态交互。此外，论文分析了这些方法的技术演进、内在挑战，并介绍了评估多模态推理性能的关键基准任务与评价指标，从而为未来研究提供了从视觉-语言推理到全模态推理，以及从多模态推理到多模态智能体的发展方向洞见。

链接: https://arxiv.org/abs/2503.18071
作者: Zhiyu Lin,Yifei Gao,Xian Zhao,Yunfan Yang,Jitao Sang
机构: Beijing Jiaotong University (北京交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models have recently advanced into the realm of reasoning, yet it is through multimodal reasoning that we can fully unlock the potential to achieve more comprehensive, human-like cognitive capabilities. This survey provides a systematic overview of the recent multimodal reasoning approaches, categorizing them into two levels: language-centric multimodal reasoning and collaborative multimodal reasoning. The former encompasses one-pass visual perception and active visual perception, where vision primarily serves a supporting role in language reasoning. The latter involves action generation and state update within reasoning process, enabling a more dynamic interaction between modalities. Furthermore, we analyze the technical evolution of these methods, discuss their inherent challenges, and introduce key benchmark tasks and evaluation metrics for assessing multimodal reasoning performance. Finally, we provide insights into future research directions from the following two perspectives: (i) from visual-language reasoning to omnimodal reasoning and (ii) from multimodal reasoning to multimodal agents. This survey aims to provide a structured overview that will inspire further advancements in multimodal reasoning research.
zh

[NLP-68] Long Is More Important Than Difficult for Training Reasoning Models

【速读】：该论文旨在解决推理模型性能提升受限于高难度问题稀缺性的问题。论文的关键解决方案在于解耦推理性能对问题难度的依赖，转而通过扩展推理长度来提升模型表现。具体而言，作者首先验证了推理长度而非问题难度是影响模型性能的主要因素，并发现模型性能随推理数据长度增长呈现对数线性提升的规律。进一步地，提出一种简单方法生成任意长度的推理数据，证明合成数据能够有效用于训练推理模型。最终，通过在自建的Long1K数据集上微调Qwen2.5-32B-Instruct语言模型，展示了仅使用1,000个训练样本即可实现卓越性能，如MATH任务达到95.6%准确率，GPQA任务达到71.1%，优于DeepSeek-R1-Distill-Qwen-32B。

链接: https://arxiv.org/abs/2503.18069
作者: Si Shen,Fei Huang,Zhixiao Zhao,Chang Liu,Tiansheng Zheng,Danhao Zhu
机构: Nanjing University of Science and Technology (南京理工大学); Nanjing Agricultural University (南京农业大学); Criminal Science and Technology, Jiangsu Police Institute (江苏警官学院刑事科学技术系)
类目: Computation and Language (cs.CL)
备注: 15 pages,6 figures

点击查看摘要

Abstract:Difficult problems, which often result in long reasoning traces, are widely recognized as key factors for enhancing the performance of reasoning models. However, such high-challenge problems are scarce, limiting the size of available datasets. In this paper, we propose a simple method to decouple the reliance on problem difficulty. First, we empirically demonstrate that reasoning length, rather than problem difficulty, primarily influences the performance of trained models. Second, we identify a scaling law on reasoning length, showing that model performance increases in a log-linear fashion as the reasoning data length grows. Finally, we introduce a straightforward technique to generate reasoning data of arbitrary length, and show that synthesized data is effective for training reasoning models. After fine-tuning the Qwen2.5-32B-Instruct language model on our Long1K dataset, we present our model, Long1K-32B, which achieves remarkable performance with only 1,000 training samples, achieving 95.6% accuracy on MATH, and 71.1% on GPQA outperforming DeepSeek-R1-Distill-Qwen-32B. The model, code, and dataset are all open-sourced, available at this https URL.
zh

[NLP-69] Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation

【速读】：该论文旨在解决视觉-语言导航（Vision-Language Navigation, VLN）领域中数据稀缺的问题，这一问题严重阻碍了智能体在未见过环境中的泛化能力。传统方法主要依赖于额外的仿真器数据或网络收集的图像/视频来提升泛化性能，但仿真环境的多样性有限，网络收集的数据往往需要大量人工处理以去除噪声。为此，论文提出了一种基于重写机制的数据增强范式（Rewriting-driven AugMentation, RAM），其核心在于通过重写人类标注的训练数据直接生成未见的观测-指令对。

解决方案的关键在于两个创新性的重写策略：一是物体增强的观测重写（Object-Enriched Observation Rewriting），结合视觉-语言模型（Vision-Language Models, VLMs）和大型语言模型（Large Language Models, LLMs）生成富含新对象的场景描述，并利用文本到图像生成模型（Text-to-Image Generation Models, T2IMs）合成多样化观测；二是观测对比的指令重写（Observation-Contrast Instruction Rewriting），通过要求LLMs推理原始与新观测之间的差异来生成与观测对齐的新指令。此外，论文还设计了一种混合后聚焦的训练策略及随机观测裁剪方案，在增强数据分布多样性的同时抑制了增强数据的噪声。实验表明，该方法在离散环境（如R2R、REVERIE、R4R）和连续环境中（如R2R-CE）均表现出卓越的性能和显著的泛化能力。

链接: https://arxiv.org/abs/2503.18065
作者: Ziming Wei,Bingqian Lin,Yunshuang Nie,Jiaqi Chen,Shikui Ma,Hang Xu,Xiaodan Liang
机构: Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); Shanghai Jiao Tong University (上海交通大学); University of Hong Kong (香港大学); Dataa Robotics (Dataa Robotics公司); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at this https URL.
zh

[NLP-70] Dynamic Task Vector Grouping for Efficient Multi-Task Prompt Tuning

【速读】：该论文试图解决多任务提示调优中源任务选择和组合的静态固定导致的性能瓶颈问题。现有方法通常一次性使用所有源任务或单一“高相似”源任务进行迁移，未能充分利用最优源任务组合，并且忽视了在微调过程中源任务与目标任务相似性会动态变化的事实。为了解决这些问题，论文提出了一种名为Dynamic Task Vector Grouping (DTVG) 的方法，其关键是通过任务向量而非软提示来衡量任务相似性，基于目标相似性和知识一致性两个指标动态分组最优源任务组合，并在每次迭代中动态更新组合，从而有效减少负迁移并提升性能。

链接: https://arxiv.org/abs/2503.18063
作者: Pieyi Zhang,Richong Zhang,Zhijie Nie
机构: CCSE, School of Computer Science and Engineering, Beihang University (北航); Zhongguancun Laboratory (中关村实验室); Shen Yuan Honors College, Beihang University (北航致真荣誉学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Multi-task prompt tuning utilizes multiple high-resource source tasks to improve performance on low-source target tasks. Existing approaches transfer the soft prompt trained by combining all source tasks or a single ``high-similar’’ source task one-time-only. However, we find that the optimal transfer performance often comes from a combination of source tasks, which is neither one nor all. Further, we find that the similarity between source and target tasks also changes dynamically during fine-tuning after transfering, making similarity calculation in the initiation stage inadequate. To address these issues, we propose a method called Dynamic Task Vector Grouping (DTVG), whose core ideas contain (1) measuring the task similarity with task vectors instead of soft prompt, (2) grouping the optimal source task combination based on two metrics: \it target similarity and \it knowledge consistency; (3) dynamically updating the combination in each iteration step. Extensive experiments on the 26 NLP datasets under different settings demonstrate that DTVG effectively groups similar source tasks while reducing negative transfer, achieving the start-of-art performance.
zh

[NLP-71] Investigating Recent Large Language Models for Vietnamese Machine Reading Comprehension

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在低资源语言（如越南语）机器阅读理解（Machine Reading Comprehension, MRC）任务中的有效性问题。论文的关键在于通过量化低秩适应（Quantized Low-Rank Adaptation, QLoRA）技术，高效微调两个最先进的LLMs（Llama 3和Gemma），并在ViMMRC数据集上评估其性能。实验结果表明，尽管所提出的微调模型参数量较小，但其表现优于传统的BERT基线方法以及更大规模的GPT-3和GPT-3.5模型，从而验证了微调过程的有效性，并展示了现代LLMs在资源受限环境下仍可超越传统模型的能力。

链接: https://arxiv.org/abs/2503.18062
作者: Anh Duc Nguyen,Hieu Minh Phi,Anh Viet Ngo,Long Hai Trieu,Thai Phuong Nguyen
机构: University of Engineering and Technology - Vietnam National University (越南国立大学工程技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable proficiency in Machine Reading Comprehension (MRC) tasks; however, their effectiveness for low-resource languages like Vietnamese remains largely unexplored. In this paper, we fine-tune and evaluate two state-of-the-art LLMs: Llama 3 (8B parameters) and Gemma (7B parameters), on ViMMRC, a Vietnamese MRC dataset. By utilizing Quantized Low-Rank Adaptation (QLoRA), we efficiently fine-tune these models and compare their performance against powerful LLM-based baselines. Although our fine-tuned models are smaller than GPT-3 and GPT-3.5, they outperform both traditional BERT-based approaches and these larger models. This demonstrates the effectiveness of our fine-tuning process, showcasing how modern LLMs can surpass the capabilities of older models like BERT while still being suitable for deployment in resource-constrained environments. Through intensive analyses, we explore various aspects of model performance, providing valuable insights into adapting LLMs for low-resource languages like Vietnamese. Our study contributes to the advancement of natural language processing in low-resource languages, and we make our fine-tuned models publicly available at: this https URL.
zh

[NLP-72] (G)I-DLE: Generative Inference via Distribution-preserving Logit Exclusion with KL Divergence Minimization for Constrained Decoding

【速读】：该论文试图解决在受限解码（constrained decoding）过程中如何有效排除不期望的标记（tokens）的同时，保持自回归语言模型固有的条件概率分布的问题。传统方法通过将禁用标记的对数概率（logits）简单设置为负无穷大（-\infty），但这种方法可能导致从原始对数概率到后验概率的转换失真，并增加输出质量的方差。论文提出的解决方案（(G)I-DLE）的关键在于通过最小化KL散度来重新归一化允许标记的概率分布，从而减少这种失真，同时确保生成结果符合约束条件。实验验证表明，该方法不仅提高了平均评估分数，还显著降低了输出质量的方差。

链接: https://arxiv.org/abs/2503.18050
作者: Hanwool Lee
机构: Shinhan Securities Co. ( Shinhan证券公司 )
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:We propose (G)I-DLE, a new approach to constrained decoding that leverages KL divergence minimization to preserve the intrinsic conditional probability distribution of autoregressive language models while excluding undesirable tokens. Unlike conventional methods that naively set banned tokens’ logits to -\infty , which can distort the conversion from raw logits to posterior probabilities and increase output variance, (G)I-DLE re-normalizes the allowed token probabilities to minimize such distortion. We validate our method on the K2-Eval dataset, specifically designed to assess Korean language fluency, logical reasoning, and cultural appropriateness. Experimental results on Qwen2.5 models (ranging from 1.5B to 14B) demonstrate that G-IDLE not only boosts mean evaluation scores but also substantially reduces the variance of output quality.
zh

[NLP-73] Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models

【速读】：该论文旨在探究视觉编码器的先验知识是否限制了多模态大型语言模型（MLLMs）的能力边界，并试图量化这种先验知识对MLLM性能的影响。论文发现，现有的仅通过端到端视觉问答（VQA）数据进行领域特定微调的方法在提升低视觉先验知识实体的性能方面效果有限。为解决这一问题，论文提出了一种名为VisPRE（视觉先验补救）的两阶段训练框架，其关键在于显式地在视觉编码器层面引入先验知识。实验结果表明，增强视觉编码器的先验知识显著提升了MLLMs的视觉理解能力，为提高模型性能提供了一种新颖且有效的方法，尤其是在处理罕见视觉实体的场景中。

链接: https://arxiv.org/abs/2503.18034
作者: Qiao Liang,Yanjiang Liu,Ben He,Yaojie Lu,Hongyu Lin,Jia Zheng,Xianpei Han,Le Sun,Yingfei Sun
机构: University of Chinese Academy of Sciences (中国科学院大学); Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所中文信息处理实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Does the prior knowledge of the vision encoder constrain the capability boundary of Multi-modal Large Language Models (MLLMs)? While most existing research treats MLLMs as unified systems optimized through end-to-end training, the impact of vision encoder’s prior knowledge is seldom investigated. In this work, we introduce a novel metric, Rank_e , to quantify the effect of the vision encoder’s prior knowledge on MLLM performance. Our analysis reveals a positive correlation between prior knowledge and MLLM performance. Moreover, we find that domain-specific fine-tuning using solely end-to-end visual question answering (VQA) data is insufficient–particularly for entities with low inherent visual prior knowledge. To address this issue, we propose VisPRE (Vision Prior Remediation), a two-stage training framework that explicitly incorporates prior knowledge at the vision encoder level. Experimental results demonstrate that augmenting vision encoder’s prior knowledge substantially boosts the visual understanding capabilities of MLLMs, offering a novel and effective strategy for improving performance, especially in scenarios involving uncommon visual entities.
zh

[NLP-74] Personalized Language Models via Privacy-Preserving Evolutionary Model Merging

【速读】：该论文旨在解决现有大语言模型（Large Language Models, LLMs）个性化方法中存在的两个主要问题：未能直接优化任务特定指标（task-specific metrics）以及缺乏明确的隐私保护机制。为了解决这些问题，论文提出了一种名为Privacy-Preserving Model Merging via Evolutionary Algorithms（PriME）的新方法。PriME的关键在于利用无梯度优化技术，在保护用户隐私的同时直接针对任务特定指标进行优化，从而生成能够有效捕捉目标用户偏好的个性化模块，并最小化共享私人信息用户的隐私风险。实验结果表明，与基于提示的方法和基于训练的方法相比，PriME在LaMP基准测试中实现了最高达45%的性能提升，并显著改善了隐私-效用权衡，展示了进化算法在隐私保护型LLM个性化中的潜力。

链接: https://arxiv.org/abs/2503.18008
作者: Kyuyoung Kim,Jinwoo Shin,Jaehyung Kim
机构: 未知
类目: Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: Preprint

点击查看摘要

Abstract:Personalization in large language models (LLMs) seeks to tailor models to individual user or user group preferences. Prompt-based methods augment queries with user preference information, whereas training-based methods directly encode preferences into model parameters for more effective personalization. Despite achieving some success in personalizing LLMs, prior methods often fail to directly optimize task-specific metrics and lack explicit privacy-preservation mechanisms. To address these limitations, we propose Privacy-Preserving Model Merging via Evolutionary Algorithms (PriME), a novel approach to personalization that employs gradient-free methods to directly optimize task-specific metrics while preserving user privacy. By incorporating privacy preservation into optimization, PriME produces a personalized module that effectively captures the target user’s preferences while minimizing the privacy risks for the users sharing their private information. Experiments on the LaMP benchmark show that PriME outperforms both prompt-based and training-based methods, achieving up to a 45% performance improvement over the prior art. Further analysis shows that PriME achieves a significantly better privacy-utility trade-off, highlighting the potential of evolutionary approaches for privacy-preserving LLM personalization.
zh

[NLP-75] Instructing the Architecture Search for Spatial-temporal Sequence Forecasting with LLM

【速读】：该论文旨在解决空间-时间序列预测（STSF）任务中的神经架构搜索（NAS）问题，特别是现有方法因依赖耗时的数据驱动方式而难以有效利用背景知识和探索复杂搜索路径的局限性。同时，论文探索了大型语言模型（LLMs）在决策中的强大能力如何能够提升STSF任务中的NAS性能，这是一个尚未被充分研究的方向。

解决方案的关键在于提出了一种基于LLM的创新NAS方法，并通过多层次增强机制发挥LLM的能力。具体而言，在步骤层面，通过强大的提示工程将生成任务分解为决策步骤，促使LLM作为架构搜索的指导者，利用其内部知识；在实例层面，采用一步调优框架快速评估架构实例，并构建记忆库累积知识以提升LLM的搜索能力；在任务层面，设计了两阶段架构搜索策略，平衡探索与优化阶段，降低陷入局部最优的可能性。这些创新共同提升了STSF任务中NAS方法的有效性和效率。

链接: https://arxiv.org/abs/2503.17994
作者: Xin Xue,Haoyi Zhou,Tianyu Chen,Shuai Zhang,Yizhou Long,Jianxin Li
机构: Beihang University (北航); ACT (先进计算技术研究所) (先进计算技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatial-temporal sequence forecasting (STSF) is a long-standing research problem with widespread real-world applications. Neural architecture search (NAS), which automates the neural network design, has been shown effective in tackling the STSF problem. However, the existing NAS methods for STSF focus on generating architectures in a time-consuming data-driven fashion, which heavily limits their ability to use background knowledge and explore the complicated search trajectory. Large language models (LLMs) have shown remarkable ability in decision-making with comprehensive internal world knowledge, but how it could benefit NAS for STSF remains unexplored. In this paper, we propose a novel NAS method for STSF based on LLM. Instead of directly generate architectures with LLM, We inspire the LLM’s capability with a multi-level enhancement mechanism. Specifically, on the step-level, we decompose the generation task into decision steps with powerful prompt engineering and inspire LLM to serve as instructor for architecture search based on its internal knowledge. On the instance-level, we utilize a one-step tuning framework to quickly evaluate the architecture instance and a memory bank to cumulate knowledge to improve LLM’s search ability. On the task-level, we propose a two-stage architecture search, balancing the exploration stage and optimization stage, to reduce the possibility of being trapped in local optima. Extensive experimental results demonstrate that our method can achieve competitive effectiveness with superior efficiency against existing NAS methods for STSF.
zh

[NLP-76] rade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities

【速读】：该论文试图解决大型推理模型（Large Reasoning Models, LRMs）在提升复杂推理能力（如人类式的深思熟虑与长链思维推理）过程中导致的基础能力下降以及推理成本增加的问题。论文的关键解决方案在于提出自适应推理机制，通过引入零思考（Zero-Thinking）、少思考（Less-Thinking）和摘要思考（Summary-Thinking）等模式，动态调整模型在推理过程中的计算资源分配，从而有效缓解上述性能退化和成本上升的弊端，同时保持模型的实用性和效率。

链接: https://arxiv.org/abs/2503.17979
作者: Weixiang Zhao,Xingyu Sui,Jiahe Guo,Yulin Hu,Yang Deng,Yanyan Zhao,Bing Qin,Wanxiang Che,Tat-Seng Chua,Ting Liu
机构: Harbin Institute of Technology (哈尔滨工业大学); Singapore Management University (新加坡管理大学); National University of Singapore (新加坡国立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 23 pages. Work in progress

点击查看摘要

Abstract:Recent advancements in Large Reasoning Models (LRMs), such as OpenAI’s o1/o3 and DeepSeek-R1, have demonstrated remarkable performance in specialized reasoning tasks through human-like deliberative thinking and long chain-of-thought reasoning. However, our systematic evaluation across various model families (DeepSeek, Qwen, and LLaMA) and scales (7B to 671B) reveals that acquiring these deliberative reasoning capabilities significantly reduces the foundational capabilities of LRMs, including notable declines in helpfulness and harmlessness, alongside substantially increased inference costs. Importantly, we demonstrate that adaptive reasoning – employing modes like Zero-Thinking, Less-Thinking, and Summary-Thinking – can effectively alleviate these drawbacks. Our empirical insights underline the critical need for developing more versatile LRMs capable of dynamically allocating inference-time compute according to specific task characteristics.
zh

[NLP-77] Understanding the Effects of RLHF on the Quality and Detectability of LLM -Generated Texts

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）生成文本在潜在恶意滥用场景下的检测难题。随着LLMs生成的文本与人类写作高度相似，传统检测方法面临挑战，且生成的文本可能被进一步操纵以规避检测。论文的关键解决方案是研究通过基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）微调LLMs的效果，评估其对生成文本质量以及检测性能的影响。研究发现，尽管RLHF提升了生成文本的质量，但同时也增加了文本的可检测性、冗长性和重复性，并分析了不同检测方法（训练-based和zero-shot）在面对不同类型文本时的脆弱性和鲁棒性。

链接: https://arxiv.org/abs/2503.17965
作者: Beining Xu,Arkaitz Zubiaga
机构: Queen Mary University of London (伦敦大学玛丽皇后学院); Arkaitz Zubiaga
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated exceptional performance on a range of downstream NLP tasks by generating text that closely resembles human writing. However, the ease of achieving this similarity raises concerns from potential malicious uses at scale by bad actors, as LLM-generated text becomes increasingly difficult to discern from human text. Although detection methods have been developed to address this issue, bad actors can further manipulate LLM-generated texts to make them less detectable. In this work, we study how further editing texts with Reinforcement Learning from Human Feedback (RLHF), which aligns model outputs with human preferences, affects (a) the quality of generated texts for two tasks, and (b) the performance of LLM-generated text detectors, looking at both training-based and zero-shot detection methods. Although RLHF improves the quality of LLM-generated texts, we find that it also tends to produce more detectable, lengthy, and repetitive outputs. Additionally, we observe that training-based detectors are vulnerable to short texts and to texts that incorporate code, whereas zero-shot detectors exhibit greater robustness.
zh

[NLP-78] Won: Establishing Best Practices for Korean Financial NLP DATE KR

【速读】：该论文试图解决的问题是如何有效评估专注于金融领域的韩语大型语言模型（Large Language Models, LLMs），并推动其在安全性与性能上的提升。为实现这一目标，论文构建了一个开放的排行榜（leaderboard），运行八周，对涵盖五个多项选择题问答（MCQA）类别及一个开放式问答任务的封闭基准测试中的1,119份提交进行了评估。

解决方案的关键在于通过大规模评估总结出广泛使用的训练策略，并基于此发布了一个包含80,000个实例的开放指令数据集（instruction dataset）。此外，还引入了Won——一款完全开源且透明的LLM，它采用了这些最佳实践来提升模型的表现与安全性。这为开发更优秀的韩语及其他语言的金融领域LLMs奠定了基础。

链接: https://arxiv.org/abs/2503.17963
作者: Guijin Son,Hyunwoo Ko,Haneral Jung,Chami Hwang
机构: OneLineAI(OneLineAI); KRX(韩国证券交易所)
类目: Computation and Language (cs.CL)
备注: The training dataset is uploaded here: this https URL . The model will be updated shortly

点击查看摘要

Abstract:In this work, we present the first open leaderboard for evaluating Korean large language models focused on finance. Operated for about eight weeks, the leaderboard evaluated 1,119 submissions on a closed benchmark covering five MCQA categories: finance and accounting, stock price prediction, domestic company analysis, financial markets, and financial agent tasks and one open-ended qa task. Building on insights from these evaluations, we release an open instruction dataset of 80k instances and summarize widely used training strategies observed among top-performing models. Finally, we introduce Won, a fully open and transparent LLM built using these best practices. We hope our contributions help advance the development of better and safer financial LLMs for Korean and other languages.
zh

[NLP-79] Human-AI Interaction and User Satisfaction: Empirical Evidence from Online Reviews of AI Products

【速读】：该论文试图解决大型实证证据有限的问题，即在实际应用中如何通过Human-AI Interaction (HAI) 原则提升用户满意度。论文的关键解决方案是通过对来自领先软件和服务评论平台的超过100,000条AI相关产品用户评论进行分析，识别出七个核心HAI维度，并考察这些维度在评论中的覆盖范围与情感倾向。研究发现，可适应性、定制化、错误恢复和安全性这四个HAI维度的情感倾向与整体用户满意度呈正相关。此外，不同职业背景的用户对HAI维度的参与度存在差异，但HAI情感与整体满意度之间的关系不受职业角色的调节，表明一旦某个HAI维度被用户识别，其对满意度的影响在各职业角色中具有一致性。

链接: https://arxiv.org/abs/2503.17955
作者: Stefan Pasch,Sun-Young Ha
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human-AI Interaction (HAI) guidelines and design principles have become increasingly important in both industry and academia to guide the development of AI systems that align with user needs and expectations. However, large-scale empirical evidence on how HAI principles shape user satisfaction in practice remains limited. This study addresses that gap by analyzing over 100,000 user reviews of AI-related products from this http URL, a leading review platform for business software and services. Based on widely adopted industry guidelines, we identify seven core HAI dimensions and examine their coverage and sentiment within the reviews. We find that the sentiment on four HAI dimensions-adaptability, customization, error recovery, and security-is positively associated with overall user satisfaction. Moreover, we show that engagement with HAI dimensions varies by professional background: Users with technical job roles are more likely to discuss system-focused aspects, such as reliability, while non-technical users emphasize interaction-focused features like customization and feedback. Interestingly, the relationship between HAI sentiment and overall satisfaction is not moderated by job role, suggesting that once an HAI dimension has been identified by users, its effect on satisfaction is consistent across job roles.
zh

[NLP-80] SLIDE: Sliding Localized Information for Document Extraction

【速读】：该论文旨在解决从长文本和低资源语言中构建准确知识图谱的问题，主要挑战在于大型语言模型（Large Language Models, LLMs）在处理较长输入片段时性能下降，尤其是在数据稀缺的低资源环境下，实体和关系抽取的准确性受到限制。此外，传统的上下文检索方法虽能提升检索精度，但在长文档中难以避免因截断关键信息而导致的知识图谱构建受限问题。

论文的关键解决方案是提出了一种名为SLIDE（Sliding Localized Information for Document Extraction）的分块方法，通过滑动重叠窗口生成局部上下文，确保长文档中的重要上下文信息得以保留。这种方法显著提升了GraphRAG的知识图谱抽取性能，在英语中实现了实体抽取提升24%、关系抽取提升39%，而在阿非利加语等低资源语言中，分别实现了实体抽取提升49%和关系抽取提升82%。此外，SLIDE还改善了问答任务的全面性、多样性和能力等指标，展示了其在多语言和资源受限环境中的有效性。

链接: https://arxiv.org/abs/2503.17952
作者: Divyansh Singh,Manuel Nunez Martinez,Bonnie J. Dorr,Sonja Schmer Galunder
机构: University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Constructing accurate knowledge graphs from long texts and low-resource languages is challenging, as large language models (LLMs) experience degraded performance with longer input chunks. This problem is amplified in low-resource settings where data scarcity hinders accurate entity and relationship extraction. Contextual retrieval methods, while improving retrieval accuracy, struggle with long documents. They truncate critical information in texts exceeding maximum context lengths of LLMs, significantly limiting knowledge graph construction. We introduce SLIDE (Sliding Localized Information for Document Extraction), a chunking method that processes long documents by generating local context through overlapping windows. SLIDE ensures that essential contextual information is retained, enhancing knowledge graph extraction from documents exceeding LLM context limits. It significantly improves GraphRAG performance, achieving a 24% increase in entity extraction and a 39% improvement in relationship extraction for English. For Afrikaans, a low-resource language, SLIDE achieves a 49% increase in entity extraction and an 82% improvement in relationship extraction. Furthermore, it improves upon state-of-the-art in question-answering metrics such as comprehensiveness, diversity and empowerment, demonstrating its effectiveness in multilingual and resource-constrained settings.
zh

[NLP-81] An Empirical Study of the Role of Incompleteness and Ambiguity in Interactions with Large Language Models

【速读】：该论文试图解决在与大规模语言模型（Large Language Models, LLMs）交互过程中，如何确定何时需要多轮对话以成功回答问题或判断问题是否无法回答的问题。论文的关键在于提出了一种神经符号框架，用于建模人与LLMs之间的交互过程，并通过此框架定义了问题中的不完整性和模糊性作为可从交互消息中推导出的属性。研究结果表明，在包含较高比例不完整或模糊问题的数据集中，通常需要多轮交互来提高答案的正确性，且交互轮次的增加能够减少这些问题的影响。此外，所提出的不完整性和模糊性度量方法可以作为表征LLMs问答任务中交互特性的有用工具。

链接: https://arxiv.org/abs/2503.17936
作者: Riya Naik,Ashwin Srinivasan,Estrid He,Swati Agarwal
机构: BITS Pilani, K K Birla Goa Campus (BITS浦那分校，卡克拉·比拉·果阿校区); RMIT University (皇家墨尔本理工大学); PandaByte Innovations Pvt Ltd (PandaByte创新私人有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural language as a medium for human-computer interaction has long been anticipated, has been undergoing a sea-change with the advent of Large Language Models (LLMs) with startling capacities for processing and generating language. Many of us now treat LLMs as modern-day oracles, asking it almost any kind of question. Unlike its Delphic predecessor, consulting an LLM does not have to be a single-turn activity (ask a question, receive an answer, leave); and – also unlike the Pythia – it is widely acknowledged that answers from LLMs can be improved with additional context. In this paper, we aim to study when we need multi-turn interactions with LLMs to successfully get a question answered; or conclude that a question is unanswerable. We present a neural symbolic framework that models the interactions between human and LLM agents. Through the proposed framework, we define incompleteness and ambiguity in the questions as properties deducible from the messages exchanged in the interaction, and provide results from benchmark problems, in which the answer-correctness is shown to depend on whether or not questions demonstrate the presence of incompleteness or ambiguity (according to the properties we identify). Our results show multi-turn interactions are usually required for datasets which have a high proportion of incompleteness or ambiguous questions; and that that increasing interaction length has the effect of reducing incompleteness or ambiguity. The results also suggest that our measures of incompleteness and ambiguity can be useful tools for characterising interactions with an LLM on question-answeringproblems
zh

[NLP-82] Experience Retrieval-Augmentation with Electronic Health Records Enables Accurate Discharge QA

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在临床应用中可靠性不足的问题，特别是如何有效整合基于真实患者案例的临床经验知识以支持更有效的医学推理。传统方法主要依赖开放数据集中的通用医学知识，但忽略了从电子健康记录（Electronic Health Record, EHR）中提取基于病例的知识的重要性。为了解决这一问题，论文提出了基于EHR的经验检索增强框架（Experience Retrieval Augmentation, ExpRAG）。ExpRAG 的关键在于其粗到细的检索过程：首先利用基于EHR的报告排名器高效识别相似患者，然后通过经验检索器提取与任务相关的具体内容，从而为医学推理提供增强的上下文支持。实验结果表明，ExpRAG 相较于文本基排名器实现了平均5.2%的相对性能提升，验证了基于案例知识在医学推理中的重要性。

链接: https://arxiv.org/abs/2503.17933
作者: Justice Ou,Tinglin Huang,Yilun Zhao,Ziyang Yu,Peiqing Lu,Rex Ying
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Yale University (耶鲁大学); University of Waterloo (滑铁卢大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:To improve the reliability of Large Language Models (LLMs) in clinical applications, retrieval-augmented generation (RAG) is extensively applied to provide factual medical knowledge. However, beyond general medical knowledge from open-ended datasets, clinical case-based knowledge is also critical for effective medical reasoning, as it provides context grounded in real-world patient experiences. Motivated by this, we propose Experience Retrieval Augmentation - ExpRAG framework based on Electronic Health Record (EHR), aiming to offer the relevant context from other patients’ discharge reports. ExpRAG performs retrieval through a coarse-to-fine process, utilizing an EHR-based report ranker to efficiently identify similar patients, followed by an experience retriever to extract task-relevant content for enhanced medical reasoning. To evaluate ExpRAG, we introduce DischargeQA, a clinical QA dataset with 1,280 discharge-related questions across diagnosis, medication, and instruction tasks. Each problem is generated using EHR data to ensure realistic and challenging scenarios. Experimental results demonstrate that ExpRAG consistently outperforms a text-based ranker, achieving an average relative improvement of 5.2%, highlighting the importance of case-based knowledge for medical reasoning.
zh

[NLP-83] STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在面对越狱攻击（jailbreak attacks）时日益脆弱的问题，现有的防御方法要么容易受到对抗性攻击的影响，要么需要依赖计算成本高昂的辅助模型。为了解决这一问题，论文提出了一种名为STShield的轻量级框架，用于实时判断模型是否被越狱。STShield的关键创新在于引入了一种新颖的单令牌哨兵机制（single-token sentinel mechanism），通过在模型响应序列中附加一个二进制安全指示符，利用LLM自身的对齐能力进行检测。此外，该框架结合了基于正常提示的有监督微调与嵌入空间扰动的对抗训练，从而在保持模型实用性的前提下实现鲁棒的检测性能。实验结果表明，STShield能够有效防御多种越狱攻击，同时在合法查询上维持原有性能，相较于现有方法，它具有更高的防御效果且计算开销极小，使其成为实际部署中的可行解决方案。

链接: https://arxiv.org/abs/2503.17932
作者: Xunguang Wang,Wenxuan Wang,Zhenlan Ji,Zongjie Li,Pingchuan Ma,Daoyuan Wu,Shuai Wang
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 11 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks that circumvent their safety mechanisms. While existing defense methods either suffer from adaptive attacks or require computationally expensive auxiliary models, we present STShield, a lightweight framework for real-time jailbroken judgement. STShield introduces a novel single-token sentinel mechanism that appends a binary safety indicator to the model’s response sequence, leveraging the LLM’s own alignment capabilities for detection. Our framework combines supervised fine-tuning on normal prompts with adversarial training using embedding-space perturbations, achieving robust detection while preserving model utility. Extensive experiments demonstrate that STShield successfully defends against various jailbreak attacks, while maintaining the model’s performance on legitimate queries. Compared to existing approaches, STShield achieves superior defense performance with minimal computational overhead, making it a practical solution for real-world LLM deployment.
zh

[NLP-84] Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization CVPR2025

【速读】：该论文旨在解决多模态大型语言模型中存在的模态偏见（modality bias）问题，即模型倾向于过度依赖某一模态而忽略其他模态中的关键信息，从而导致错误的关注点和生成无关响应的现象。为解决此问题，论文的关键方案是采用偏好优化（preference optimization）范式，通过构建一个去偏好的偏好优化数据集（RLAIFVBias）以及一种噪声感知的偏好优化算法。具体而言，首先通过引入扰动减少某些模态的信息量，迫使模型在生成负向响应时依赖特定模态；其次，结合噪声鲁棒的平均绝对误差与二元交叉熵，并通过负Box-Cox变换动态调整算法的噪声鲁棒性以应对自动构造数据中的不可避免噪声。实验验证表明，该方法不仅有效缓解了模态偏见，还显著减少了幻觉现象（hallucinations）。

链接: https://arxiv.org/abs/2503.17928
作者: Zefeng Zhang,Hengzhu Tang,Jiawei Sheng,Zhenyu Zhang,Yiming Ren,Zhenyang Li,Dawei Yin,Duohe Ma,Tingwen Liu
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: CVPR 2025

点击查看摘要

Abstract:Multimodal Large Language Models excel in various tasks, yet often struggle with modality bias, where the model tends to rely heavily on a single modality and overlook critical information in other modalities, which leads to incorrect focus and generating irrelevant responses. In this paper, we propose using the paradigm of preference optimization to solve the modality bias problem, including RLAIFVBias, a debiased preference optimization dataset, and a Noise Aware Preference Optimization algorithm. Specifically, we first construct the dataset by introducing perturbations to reduce the informational content of certain modalities, compelling the model to rely on a specific modality when generating negative responses. To address the inevitable noise in automatically constructed data, we combine the noise robust Mean Absolute Error with the Binary Cross Entropy in Direct Preference Optimization by a negative Box Cox transformation, and dynamically adjust the algorithm noise robustness based on the evaluated noise levels in the data. Extensive experiments validate our approach, demonstrating not only its effectiveness in mitigating modality bias but also its significant role in minimizing hallucinations.
zh

[NLP-85] WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference

【速读】：该论文旨在解决大型语言模型（LLMs）在工业场景中高效推理面临的内存消耗问题，特别是由键值（KV）缓存引起的显著GPU内存占用。现有研究主要关注优化KV缓存的内存使用，但忽视了两个关键因素：保持语义连贯性和在压缩过程中考虑任务特定特性。为了解决这些问题，论文提出了一个名为WindowKV的任务自适应KV缓存窗口选择方法，其关键是动态选择包含连续令牌的局部语义窗口，并根据任务特性确保保留的KV缓存捕获连续且重要的上下文。此外，还引入了组内层KV缓存索引共享策略以减少计算开销，在性能与效率之间实现平衡。实验结果表明，WindowKV在LongBench基准测试中实现了与完整KV缓存相当的性能，仅需原始KV缓存的12%，大幅降低了内存需求，并在Needle-in-a-Haystack评估中取得了最先进的结果，展示了其有效性和鲁棒性。

链接: https://arxiv.org/abs/2503.17922
作者: Youhui Zuo,Sibo Wei,Chen Zhang,Zhuorui Liu,Wenpeng Lu,Dawei Song
机构: Beijing Institute of Technology (北京理工大学); Qilu University of Technology (Shandong Academy of Sciences) (齐鲁工业大学（山东省科学院）)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the advancements in long-context inference capabilities of large language models (LLMs), the KV cache has become one of the foundational components. However, its substantial GPU memory consumption makes KV cache compression a key technique for enabling efficient LLM inference in industrial scenarios. While recent studies have focused on optimizing the memory occupied by the KV cache, they overlook two critical factors: preserving semantic coherence and considering task-specific characteristic during compression. To address these limitations, we propose a novel task-adaptive KV cache window selection method, WindowKV. WindowKV dynamically selects local semantic windows consisting of consecutive tokens, according to task-specific characteristics, ensuring the retained KV cache captures continuous, essential context. Additionally, we introduce an intra-group layer KV cache indices sharing strategy to reduce computational overhead, achieving a balance between performance and efficiency. We rigorously evaluate WindowKV on the LongBench benchmark, and the results demonstrate that it maintains a performance comparable to full KV cache retention while using only 12% of the original KV cache, significantly reducing memory requirements. Furthermore, our method also achieves state-of-the-art results in the Needle-in-a-Haystack evaluation, highlighting its effectiveness and robustness.
zh

[NLP-86] MedPlan:A Two-Stage RAG -Based System for Personalized Medical Plan Generation

【速读】：该论文旨在解决现有大型语言模型（Large Language Models, LLMs）应用于电子健康记录（Electronic Health Records, EHR）时的主要局限性，这些问题包括：仅通过单次推导生成治疗计划，而未能遵循临床医生使用的顺序推理过程；缺乏对患者特定历史背景的整合；以及无法有效区分主观与客观的临床信息。论文提出以SOAP方法学（主观、客观、评估、计划）为启发，引入MedPlan框架，通过结构化LLM推理以匹配实际临床工作流程。解决方案的关键在于采用两阶段架构：第一阶段基于患者症状和客观数据生成临床评估，第二阶段利用检索增强生成技术，结合评估结果及患者特异性信息制定结构化的治疗计划。综合评估表明，该方法在评估准确性和治疗计划质量方面显著优于基线方法。

链接: https://arxiv.org/abs/2503.17900
作者: Hsin-Ling Hsu,Cong-Tinh Dao,Luning Wang,Zitao Shuai,Thao Nguyen Minh Phan,Jun-En Ding,Chun-Chieh Liao,Pengfei Hu,Xiaoxue Han,Chih-Ho Hsu,Dongsheng Luo,Wen-Chih Peng,Feng Liu,Fang-Ming Hung,Chenwei Wu
机构: National Chengchi University (国立政治大学); National Yang Ming Chiao Tung University (国立阳明交通大学); University of Michigan (密歇根大学); Stevens Institute of Technology (史蒂文斯理工学院); Far Eastern Memorial Hospital (远东纪念医院); Florida International University (佛罗里达国际大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite recent success in applying large language models (LLMs) to electronic health records (EHR), most systems focus primarily on assessment rather than treatment planning. We identify three critical limitations in current approaches: they generate treatment plans in a single pass rather than following the sequential reasoning process used by clinicians; they rarely incorporate patient-specific historical context; and they fail to effectively distinguish between subjective and objective clinical information. Motivated by the SOAP methodology (Subjective, Objective, Assessment, Plan), we introduce MedPlan, a novel framework that structures LLM reasoning to align with real-life clinician workflows. Our approach employs a two-stage architecture that first generates a clinical assessment based on patient symptoms and objective data, then formulates a structured treatment plan informed by this assessment and enriched with patient-specific information through retrieval-augmented generation. Comprehensive evaluation demonstrates that our method significantly outperforms baseline approaches in both assessment accuracy and treatment plan quality.
zh

[NLP-87] hink Before Refusal : Triggering Safety Reflection in LLM s to Mitigate False Refusal Behavior

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在经过微调和人类对齐后，虽然能够表现出“无害”行为，但可能因拒绝有害请求而导致误拒（false refusal）的问题，即良性查询（如“告诉我如何杀死一个Python进程”）被错误地拒绝。论文的关键解决方案是引入Think-Before-Refusal (TBR) 框架，并通过在生成响应前加入安全反思（safety reflection）的安全感知指令微调方法，在减少误拒行为的同时保持模型的安全性和整体性能。

链接: https://arxiv.org/abs/2503.17882
作者: Shengyun Si,Xinpeng Wang,Guangyao Zhai,Nassir Navab,Barbara Plank
机构: Technical University of Munich (慕尼黑工业大学); Ludwig Maximilian University of Munich (路德维希-马克西米利安大学慕尼黑); Munich Center for Machine Learning (慕尼黑机器学习研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 23 figures

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have demonstrated that fine-tuning and human alignment can render LLMs harmless. In practice, such “harmlessness” behavior is mainly achieved by training models to reject harmful requests, such as “Explain how to burn down my neighbor’s house”, where the model appropriately declines to respond. However, this approach can inadvertently result in false refusal, where models reject benign queries as well, such as “Tell me how to kill a Python process”. In this work, we demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior. Building on this finding, we introduce the Think-Before-Refusal (TBR) schema and conduct safety-aware instruction fine-tuning incorporating safety reflection. In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior while maintaining safety and overall performance compared to those fine-tuned without safety reflection.
zh

[NLP-88] Satisfactory Medical Consultation based on Terminology-Enhanced Information Retrieval and Emotional In-Context Learning

【速读】：本文旨在解决现有大型语言模型（Large Language Models, LLMs）在医疗咨询领域性能未能达到专业咨询标准的问题。解决方案的关键在于提出了一种包含两个主要模块的新框架：术语增强信息检索（Terminology-Enhanced Information Retrieval, TEIR）和情感上下文学习（Emotional In-Context Learning, EICL）。TEIR 模块通过利用归纳知识和关键术语检索实现隐式推理，克服了公共数据库中领域知识受限的局限性，并具备处理长上下文的能力；而 EICL 模块则通过从无标注语料库中记忆语义和属性信息，并进行受控检索以获取所需信息，从而辅助生成具有高属性相关性的句子。此外，构建了一个包含 803,564 条咨询记录的数据集，显著提升了模型在复杂对话和主动查询发起方面的能力。实验结果表明，所提方法有效扩展了现有 LLMs 的上下文窗口长度，并在 BLEU 和 ROUGE 等指标上优于五个基线模型，在某些能力方面有显著领先优势，消融研究进一步验证了 TEIR 和 EICL 组件的重要性。

链接: https://arxiv.org/abs/2503.17876
作者: Kaiwen Zuo,Jing Tang,Hanbing Qin,Binli Luo,Ligang He,Shiyan Tang
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: The 46th European Conference on Information Retrieval Workshop

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have marked significant progress in understanding and responding to medical inquiries. However, their performance still falls short of the standards set by professional consultations. This paper introduces a novel framework for medical consultation, comprising two main modules: Terminology-Enhanced Information Retrieval (TEIR) and Emotional In-Context Learning (EICL). TEIR ensures implicit reasoning through the utilization of inductive knowledge and key terminology retrieval, overcoming the limitations of restricted domain knowledge in public databases. Additionally, this module features capabilities for processing long context. The EICL module aids in generating sentences with high attribute relevance by memorizing semantic and attribute information from unlabelled corpora and applying controlled retrieval for the required information. Furthermore, a dataset comprising 803,564 consultation records was compiled in China, significantly enhancing the model’s capability for complex dialogues and proactive inquiry initiation. Comprehensive experiments demonstrate the proposed method’s effectiveness in extending the context window length of existing LLMs. The experimental outcomes and extensive data validate the framework’s superiority over five baseline models in terms of BLEU and ROUGE performance metrics, with substantial leads in certain capabilities. Notably, ablation studies confirm the significance of the TEIR and EICL components. In addition, our new framework has the potential to significantly improve patient satisfaction in real clinical consulting situations.
zh

[NLP-89] Enhancing Retrieval Systems with Inference-Time Logical Reasoning

【速读】：该论文试图解决传统检索方法在处理包含否定、合取和析取等逻辑结构的复杂查询时效果不佳的问题。解决方案的关键在于提出了一种在推理阶段融入显式逻辑推理的框架，通过从自然语言查询中提取逻辑推理结构，并将个体的余弦相似度分数组合以生成最终的文档评分，从而在保持计算效率的同时实现对复杂逻辑推理的支持。

链接: https://arxiv.org/abs/2503.17860
作者: Felix Faltings,Wei Wei,Yujia Bao
机构: MIT; Center for Advanced AI, Accenture (埃森哲先进技术研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional retrieval methods rely on transforming user queries into vector representations and retrieving documents based on cosine similarity within an embedding space. While efficient and scalable, this approach often fails to handle complex queries involving logical constructs such as negations, conjunctions, and disjunctions. In this paper, we propose a novel inference-time logical reasoning framework that explicitly incorporates logical reasoning into the retrieval process. Our method extracts logical reasoning structures from natural language queries and then composes the individual cosine similarity scores to formulate the final document scores. This approach enables the retrieval process to handle complex logical reasoning without compromising computational efficiency. Our results on both synthetic and real-world benchmarks demonstrate that the proposed method consistently outperforms traditional retrieval methods across different models and datasets, significantly improving retrieval performance for complex queries.
zh

[NLP-90] Feather-SQL: A Lightweight NL2SQL Framework with Dual-Model Collaboration Paradigm for Small Language Models

【速读】：该论文致力于解决自然语言到SQL（Natural Language to SQL, NL2SQL）任务中大型语言模型（Large Language Models, LLMs）依赖闭源系统和高计算资源导致的数据隐私及部署挑战，同时克服小型语言模型（Small Language Models, SLMs）在NL2SQL任务上的性能不佳和框架兼容性差的问题。为了解决这些问题，论文提出了一种名为Feather-SQL的新轻量级框架，其关键在于通过模式剪枝与链接（schema pruning and linking）以及多路径和多候选生成（multi-path and multi-candidate generation）提升SQL执行能力和准确性。此外，论文引入了“1+1模型协作范式”（1+1 Model Collaboration Paradigm），将强大的通用型对话模型与经过微调的SQL专家模型相结合，实现强分析推理能力与高精度SQL生成的结合。实验结果表明，Feather-SQL显著提升了SLMs在NL2SQL任务上的性能，在未经微调的模型上性能提升约10%，并将SLMs的准确性上限提高到54.76%。

链接: https://arxiv.org/abs/2503.17811
作者: Wenqi Pei,Hailing Xu,Hengyuan Zhao,Shizheng Hou,Han Chen,Zining Zhang,Pingyi Luo,Bingsheng He
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Natural Language to SQL (NL2SQL) has seen significant advancements with large language models (LLMs). However, these models often depend on closed-source systems and high computational resources, posing challenges in data privacy and deployment. In contrast, small language models (SLMs) struggle with NL2SQL tasks, exhibiting poor performance and incompatibility with existing frameworks. To address these issues, we introduce Feather-SQL, a new lightweight framework tailored for SLMs. Feather-SQL improves SQL executability and accuracy through 1) schema pruning and linking, 2) multi-path and multi-candidate generation. Additionally, we introduce the 1+1 Model Collaboration Paradigm, which pairs a strong general-purpose chat model with a fine-tuned SQL specialist, combining strong analytical reasoning with high-precision SQL generation. Experimental results on BIRD demonstrate that Feather-SQL improves NL2SQL performance on SLMs, with around 10% boost for models without fine-tuning. The proposed paradigm raises the accuracy ceiling of SLMs to 54.76%, highlighting its effectiveness.
zh

[NLP-91] ParsiPy: NLP Toolkit for Historical Persian Texts in Python

【速读】：该论文旨在解决历史语言研究中因复杂表音文字系统、零散文本证据以及缺乏标准化数字表示而带来的独特挑战。论文的关键解决方案是开发了一个名为ParsiPy的自然语言处理（NLP）工具包，它提供了分词、词形还原、词性标注、音素到转写转换以及词嵌入等模块，以支持对历史波斯语（如帕尔斯语）文本的分析。通过处理帕尔斯语文本的实验，展示了该工具包在扩展历史语言计算方法方面的潜力，从而为计算语言学领域做出了贡献，并为古代文本的数字化研究与保存提供了可适配的工具。

链接: https://arxiv.org/abs/2503.17810
作者: Farhan Farsi,Parnian Fazel,Sepand Haghighi,Sadra Sabouri,Farzaneh Goshtasb,Nadia Hajipour,Ehsaneddin Asgari,Hossein Sameti
机构: Amirkabir University of Technology ( Tehran University of Technology); University of Tehran ( Tehran University); Open Science Laboratory ( 开放科学实验室); University of Southern California ( 南加州大学); Institute for Humanities and Cultural Studies ( 人文与文化研究所); Qatar Computing Research Institute ( 卡塔尔计算研究所); Sharif University of Technology ( 谢里夫理工大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 6 figure, accepted into Second Workshop on Ancient Language Processing (ALP2025)

点击查看摘要

Abstract:The study of historical languages presents unique challenges due to their complex orthographic systems, fragmentary textual evidence, and the absence of standardized digital representations of text in those languages. Tackling these challenges needs special NLP digital tools to handle phonetic transcriptions and analyze ancient texts. This work introduces ParsiPy, an NLP toolkit designed to facilitate the analysis of historical Persian languages by offering modules for tokenization, lemmatization, part-of-speech tagging, phoneme-to-transliteration conversion, and word embedding. We demonstrate the utility of our toolkit through the processing of Parsig (Middle Persian) texts, highlighting its potential for expanding computational methods in the study of historical languages. Through this work, we contribute to computational philology, offering tools that can be adapted for the broader study of ancient texts and their digital preservation.
zh

[NLP-92] Relation Extraction with Instance-Adapted Predicate Descriptions

【速读】：该论文旨在解决关系抽取（Relation Extraction, RE）任务中基于较小编码器模型性能受限的问题。当前解码器为主的大型语言模型在生成任务中表现优异，但针对RE任务，较小的编码器模型仍是主流架构。为应对这一挑战，论文提出了一种新颖的双编码器架构，并结合联合对比损失（contrastive loss）与交叉熵损失（cross-entropy loss）进行微调。其关键创新在于引入第二个编码器，通过注入输入实例中的真实实体跨度来计算特定实例的谓词表示，而非依赖固定的线性层。这种方法以简洁优雅的方式提升了性能，在四个数据集上的F1分数较现有方法提高了1%到2%，并通过消融研究验证了所设计组件的重要性。

链接: https://arxiv.org/abs/2503.17799
作者: Yuhang Jiang,Ramakanth Kavuluru
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Relation extraction (RE) is a standard information extraction task playing a major role in downstream applications such as knowledge discovery and question answering. Although decoder-only large language models are excelling in generative tasks, smaller encoder models are still the go to architecture for RE. In this paper, we revisit fine-tuning such smaller models using a novel dual-encoder architecture with a joint contrastive and cross-entropy loss. Unlike previous methods that employ a fixed linear layer for predicate representations, our approach uses a second encoder to compute instance-specific predicate representations by infusing them with real entity spans from corresponding input instances. We conducted experiments on two biomedical RE datasets and two general domain datasets. Our approach achieved F1 score improvements ranging from 1% to 2% over state-of-the-art methods with a simple but elegant formulation. Ablation studies justify the importance of various components built into the proposed architecture.
zh

[NLP-93] Every Sample Matters: Leverag ing Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM

【速读】：该论文试图解决在代码大语言模型（Code LLMs）中实现全面性能与极致效率之间的权衡问题。解决方案的关键在于利用高效的专家混合（Mixture-of-Experts, MoE）架构，并结合一系列高质量的数据整理方法（尤其是基于程序分析的方法），构建了一个既高效又强大的代码大语言模型——Ling-Coder-Lite。该模型在12个代表性编码基准测试中表现出与类似规模的最先进模型（如Qwen2.5-Coder-7B和DeepSeek-Coder-V2-Lite）相当的性能，同时提供了更具竞争力的延迟和吞吐量。此外，通过采用MoE架构，在部署资源需求上实现了比同类密集型模型降低50%的效果，而性能未受影响。为了促进该领域的进一步研究与发展，论文还开源了模型及其在蒸馏和后训练阶段使用的大量高质量数据。

链接: https://arxiv.org/abs/2503.17793
作者: Codefuse,Ling Team:Wenting Cai,Yuchen Cao,Chaoyu Chen,Chen Chen,Siba Chen,Qing Cui,Peng Di,Junpeng Fang,Zi Gong,Ting Guo,Zhengyu He,Yang Huang,Cong Li,Jianguo Li,Zheng Li,Shijie Lian,BingChang Liu,Songshan Luo,Shuo Mao,Min Shen,Jian Wu,Jiaolong Yang,Wenjie Yang,Tong Ye,Hang Yu,Wei Zhang,Zhenduo Zhang,Hailin Zhao,Xunjin Zheng,Jun Zhou
机构: Ant Group (蚂蚁集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 6 figures

点击查看摘要

Abstract:Recent advancements in code large language models (LLMs) have demonstrated remarkable capabilities in code generation and understanding. It is still challenging to build a code LLM with comprehensive performance yet ultimate efficiency. Many attempts have been released in the open source community to break the trade-off between performance and efficiency, such as the Qwen Coder series and the DeepSeek Coder series. This paper introduces yet another attempt in this area, namely Ling-Coder-Lite. We leverage the efficient Mixture-of-Experts (MoE) architecture along with a set of high-quality data curation methods (especially those based on program analytics) to build an efficient yet powerful code LLM. Ling-Coder-Lite exhibits on-par performance on 12 representative coding benchmarks compared to state-of-the-art models of similar size, such as Qwen2.5-Coder-7B and DeepSeek-Coder-V2-Lite, while offering competitive latency and throughput. In practice, we achieve a 50% reduction in deployment resources compared to the similar-sized dense model without performance loss. To facilitate further research and development in this area, we open-source our models as well as a substantial portion of high-quality data for the annealing and post-training stages. The models and data can be accessed at~\urlthis https URL.
zh

[NLP-94] Energy-Aware LLM s: A step towards sustainable AI for downstream applications CEC

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在通信网络故障票分析中的高能耗与性能优化之间的权衡问题。论文的关键在于提出了一种端到端的工作流，通过结合量化（quantization）和剪枝（pruning）技术，在降低模型能耗的同时显著提升模型性能，从而实现能源效率与模型表现之间的有效平衡。

链接: https://arxiv.org/abs/2503.17783
作者: Nguyen Phuc Tran,Brigitte Jaumard,Oscar Delgado
机构: Concordia University (康考迪亚大学); École de Technologie Supérieure (高等技术学院)
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This work has been submitted to V. International Conference on Electrical, Computer and Energy Technologies (ICECET 2025) for possible publication

点击查看摘要

Abstract:Advanced Large Language Models (LLMs) have revolutionized various fields, including communication networks, sparking an innovation wave that has led to new applications and services, and significantly enhanced solution schemes. Despite all these impressive developments, most LLMs typically require huge computational resources, resulting in terribly high energy consumption. Thus, this research study proposes an end-to-end pipeline that investigates the trade-off between energy efficiency and model performance for an LLM during fault ticket analysis in communication networks. It further evaluates the pipeline performance using two real-world datasets for the tasks of root cause analysis and response feedback in a communication network. Our results show that an appropriate combination of quantization and pruning techniques is able to reduce energy consumption while significantly improving model performance.
zh

[NLP-95] Improving Preference Extraction In LLM s By Identifying Latent Knowledge Through Classifying Probes ACL

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）作为自动评判器评估文本时因潜在无意偏见而导致效果受限的问题。论文的关键解决方案是提出使用线性分类探针（linear classifying probes），通过利用对比提示对之间的差异来直接访问LLMs的潜在知识，并提取更准确的偏好。这些探针在四种不同家族、大小各异的模型以及六个多样化的数据集上进行的广泛实验表明，无论是有监督还是无监督的探针方法，在保持相似计算成本的同时，始终优于传统的基于生成的判断方式。此外，这些探针在领域转换下具有泛化能力，甚至可以超越使用相同训练数据量微调的评估器。研究结果表明，线性探针为LLM作为裁判任务提供了一种精确、稳健且计算高效的方案，同时提供了可解释的见解，揭示模型如何编码与判断相关知识。

链接: https://arxiv.org/abs/2503.17755
作者: Sharan Maiya,Yinhong Liu,Ramit Debnath,Anna Korhonen
机构: Language Technology Lab, University of Cambridge (语言技术实验室，剑桥大学); Cambridge Collective Intelligence & Design Group, University of Cambridge (剑桥集体智能与设计小组，剑桥大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: preprint, submitted to ACL ARR 2025, 21 pages, 23 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are often used as automated judges to evaluate text, but their effectiveness can be hindered by various unintentional biases. We propose using linear classifying probes, trained by leveraging differences between contrasting pairs of prompts, to directly access LLMs’ latent knowledge and extract more accurate preferences. Through extensive experiments using models of varying size from four different families and six diverse datasets assessing text quality evaluation and common sense reasoning, we demonstrate that both supervised and unsupervised probing approaches consistently outperform traditional generation-based judgement while maintaining similar computational costs. These probes generalise under domain shifts and can even outperform finetuned evaluators with the same training data size. Our results suggest linear probing offers an accurate, robust and computationally efficient approach for LLM-as-judge tasks while providing interpretable insights into how models encode judgement-relevant knowledge. Our data and code will be openly released in the future.
zh

[NLP-96] Building Resource-Constrained Language Agents : A Korean Case Study on Chemical Toxicity Information

【速读】：该论文旨在解决语言代理（Language Agents）在资源受限环境下的部署挑战，特别是在专业化领域和使用频率较低的语言中。论文提出的关键解决方案包括：1) 一种上下文高效架构，通过分层节段搜索（hierarchical section search）减少标记消耗；2) 基于场景的对话生成方法论，有效从更大模型中蒸馏出工具使用能力。实验评估表明，所提出的8B参数微调模型在数据库忠实性（DB faithfulness）和用户偏好方面显著优于未微调模型及基线方法。

链接: https://arxiv.org/abs/2503.17753
作者: Hojun Cho,Donghu Kim,Soyoung Yang,Chan Lee,Hunjoo Lee,Jaegul Choo
机构: KAIST AI (KAIST AI); CHEM. I. NET (化学网络研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Language agents powered by large language models (LLMs) face significant deployment challenges in resource-constrained environments, particularly for specialized domains and less-common languages. This paper presents Tox-chat, a Korean chemical toxicity information agent devised within these limitations. We propose two key innovations: a context-efficient architecture that reduces token consumption through hierarchical section search, and a scenario-based dialogue generation methodology that effectively distills tool-using capabilities from larger models. Experimental evaluations demonstrate that our fine-tuned 8B parameter model substantially outperforms both untuned models and baseline approaches, in terms of DB faithfulness and preference. Our work offers valuable insights for researchers developing domain-specific language agents under practical constraints.
zh

[NLP-97] Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection

【速读】：该论文旨在解决阿拉伯语自动作文评分（Arabic Automated Essay Scoring, AES）系统因缺乏标注作文数据集而面临的挑战。解决方案的关键在于提出了一种新颖的框架，利用大语言模型（Large Language Models, LLMs）和Transformer架构生成合成的阿拉伯语作文数据集，并通过微调的标准阿拉伯BERT模型引入受控的错误注入以预测错误类型。此方法生成了包含3,040篇标注作文的数据集，并开发了一个基于BERT的自动评分系统，实现了准确且可扩展的阿拉伯语作文评估。实验结果验证了该框架在提升阿拉伯语AES性能方面的有效性。

链接: https://arxiv.org/abs/2503.17739
作者: Chatrine Qwaider,Bashar Alhafni,Kirill Chirkunov,Nizar Habash,Ted Briscoe
机构: MBZUAI (MBZUAI); New York University Abu Dhabi (纽约大学阿联酋分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated Essay Scoring (AES) plays a crucial role in assessing language learners’ writing quality, reducing grading workload, and providing real-time feedback. Arabic AES systems are particularly challenged by the lack of annotated essay datasets. This paper presents a novel framework leveraging Large Language Models (LLMs) and Transformers to generate synthetic Arabic essay datasets for AES. We prompt an LLM to generate essays across CEFR proficiency levels and introduce controlled error injection using a fine-tuned Standard Arabic BERT model for error type prediction. Our approach produces realistic human-like essays, contributing a dataset of 3,040 annotated essays. Additionally, we develop a BERT-based auto-marking system for accurate and scalable Arabic essay evaluation. Experimental results demonstrate the effectiveness of our framework in improving Arabic AES performance.
zh

[NLP-98] V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

【速读】：该论文试图解决现有视频理解基准测试中过度依赖复杂文本提示的问题，这些提示往往难以提供精确的空间和时间参考，从而影响人机交互体验与效率。为了解决这一局限性，论文提出了Video Visual Prompt Benchmark (V2P-Bench)，这是一个专为评估大型视觉语言模型（LVLMs）在多模态人机交互场景下的视频理解能力而设计的综合基准。V2P-Bench的关键在于其包含的980个独特视频和1,172组问答对，覆盖了5大任务和12个维度，能够促进与人类认知一致的实例级细粒度理解。通过引入这个新的基准，研究揭示了当前最强大的模型在处理视频视觉提示方面的不足，其性能显著低于人类专家水平，这凸显了LVLMs在视频理解领域存在的不足。因此，V2P-Bench旨在为推动多模态人机交互及视频理解评估的发展奠定基础。

链接: https://arxiv.org/abs/2503.17736
作者: Yiming Zhao,Yu Zeng,Yukun Qi,YaoYang Liu,Lin Chen,Zehui Chen,Xikun Bao,Jie Zhao,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have made significant progress in the field of video understanding recently. However, current benchmarks uniformly lean on text prompts for evaluation, which often necessitate complex referential language and fail to provide precise spatial and temporal references. This limitation diminishes the experience and efficiency of human-model interaction. To address this limitation, we propose the Video Visual Prompt Benchmark(V2P-Bench), a comprehensive benchmark specifically designed to evaluate LVLMs’ video understanding capabilities in multimodal human-model interaction scenarios. V2P-Bench includes 980 unique videos and 1,172 QA pairs, covering 5 main tasks and 12 dimensions, facilitating instance-level fine-grained understanding aligned with human cognition. Benchmarking results reveal that even the most powerful models perform poorly on V2P-Bench (65.4% for GPT-4o and 67.9% for Gemini-1.5-Pro), significantly lower than the human experts’ 88.3%, highlighting the current shortcomings of LVLMs in understanding video visual prompts. We hope V2P-Bench will serve as a foundation for advancing multimodal human-model interaction and video understanding evaluation. Project page: this https URL.
zh

[NLP-99] Can LLM s Automate Fact-Checking Article Writing?

【速读】：该论文旨在解决自动事实核查系统在生成适合广泛传播至公众的核查结果输出方面存在的不足，当前系统通常缺乏对评估结论的充分解释，而人类事实核查员通过撰写核查文章传达其发现。为填补这一空白，论文提出的关键解决方案是扩展传统的自动事实核查流程，引入基于大型语言模型（LLM）的生成式框架QRAFT，以模仿人类事实核查员的写作工作流，自动生成完整的核查文章。研究通过专家访谈确定了此类文章的核心需求，并通过专业事实核查员的人类评估验证了QRAFT的实用性，结果显示其性能虽优于其他文本生成方法，但仍落后于专家撰写的文章。

链接: https://arxiv.org/abs/2503.17684
作者: Dhruv Sahnan,David Corney,Irene Larraz,Giovanni Zagni,Ruben Miguez,Zhuohan Xie,Iryna Gurevych,Elizabeth Churchill,Tanmoy Chakraborty,Preslav Nakov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Automatic fact-checking aims to support professional fact-checkers by offering tools that can help speed up manual fact-checking. Yet, existing frameworks fail to address the key step of producing output suitable for broader dissemination to the general public: while human fact-checkers communicate their findings through fact-checking articles, automated systems typically produce little or no justification for their assessments. Here, we aim to bridge this gap. We argue for the need to extend the typical automatic fact-checking pipeline with automatic generation of full fact-checking articles. We first identify key desiderata for such articles through a series of interviews with experts from leading fact-checking organizations. We then develop QRAFT, an LLM-based agentic framework that mimics the writing workflow of human fact-checkers. Finally, we assess the practical usefulness of QRAFT through human evaluations with professional fact-checkers. Our evaluation shows that while QRAFT outperforms several previously proposed text-generation approaches, it lags considerably behind expert-written articles. We hope that our work will enable further research in this new and important direction.
zh

[NLP-100] Enhancing Persona Consistency for LLM s Role-Playing using Persona-Aware Contrastive Learning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在角色扮演对话生成任务中因缺乏情感和细粒度的角色意识，导致个性化和多样化交互能力受限的问题。此外，现有方法在收集高质量标注数据以及部署传统人工对齐方法时面临高成本挑战，尤其是在角色扮演场景中模型行为的固有多样性难以处理。为应对这些问题，论文从角色对齐的角度重新审视模型的行为，并提出了一种无标注框架——人格感知对比学习（Persona-Aware Contrastive Learning, PCL），以增强LLMs在角色扮演中的行为一致性。

解决方案的关键在于两个方面：首先，设计了一种角色链方法，通过基于角色特征和对话上下文引导模型自我提问，从而调整并保持人格一致性；其次，通过引入角色特征使用与不使用的迭代对比学习机制，进一步优化模型的角色扮演策略。实验结果表明，在黑盒和白盒LLMs上的自动评估（CharEval-GPT-4）及专家人工评估中，采用PCL的LLMs显著优于标准LLMs。

链接: https://arxiv.org/abs/2503.17662
作者: Ke Ji,Yixin Lian,Linxu Li,Jingsheng Gao,Weiyuan Li,Bin Dai
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学（深圳）); Xiaobing.AI (小冰公司); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:In recent years, large language models (LLMs) have achieved breakthrough progress in many dialogue generation tasks. However, their lack of emotion and fine-grained role awareness limits the model’s ability to provide personalized and diverse interactions further. Current methods face high costs in collecting high-quality annotated data for scenarios such as role-playing, and traditional human alignment methods are difficult to deploy due to the inherent diversity of model behavior in role-playing scenarios. Inspired by the alignment of models for safety behaviors through RLHF (Reinforcement Learning from Human Feedback), in this paper, we revisit model role-playing behavior from the perspective of persona alignment and propose a novel annotation-free framework named \textbf\underlinePersona-Aware \textbf\underlineContrastive \textbf\underlineLearning (PCL) to align LLMs’ behavior during role-playing, enhancing the model’s role consistency. Specifically, we first design a role chain method to encourage the model to self-question based on the role characteristics and dialogue context to adjust personality consistency. Then, we further enhance the model’s role-playing strategy through iterative contrastive learning between the use of role characteristics and not. Experiments on both black-box and white-box LLMs show that LLMs equipped with PCL significantly outperform vanilla LLMs under automatic evaluation methods (CharEval \ GPT-4) and human expert evaluation.
zh

[NLP-101] FairFlow: Mitigating Dataset Biases through Undecided Learning EMNLP2024

【速读】：该论文试图解决语言模型受数据集偏差影响的问题，即由数据中的捷径（shortcuts）和虚假相关性（spurious correlations）引起的性能下降，尤其是在新数据上的表现。论文提出了一种名为“FairFlow”的去偏框架，其关键是通过学习在与已知或未知偏差相关的数据样本或表示上保持不确定性预测，从而缓解数据集偏差。FairFlow 引入了两个关键组件：一系列数据和模型扰动操作以生成输入样本的不同有偏视图，以及一种对比目标，从这些有偏视图中学习去偏且鲁棒的表示。实验表明，FairFlow 在不损害领域内性能的情况下，尤其对域外和困难测试样本的表现优于现有去偏方法。

链接: https://arxiv.org/abs/2503.17632
作者: Jiali Cheng,Hadi Amiri
机构: University of Massachusetts Lowell (马萨诸塞大学洛厄尔分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP 2024

点击查看摘要

Abstract:Language models are prone to dataset biases, known as shortcuts and spurious correlations in data, which often result in performance drop on new data. We present a new debiasing framework called ``FairFlow’’ that mitigates dataset biases by learning to be undecided in its predictions for data samples or representations associated with known or unknown biases. The framework introduces two key components: a suite of data and model perturbation operations that generate different biased views of input samples, and a contrastive objective that learns debiased and robust representations from the resulting biased views of samples. Experiments show that FairFlow outperforms existing debiasing methods, particularly against out-of-domain and hard test samples without compromising the in-domain performance
zh

[NLP-102] GPBench: A Comprehensive and Fine-Grained Benchmark for Evaluating Large Language Models as General Practitioners

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在支持全科医生（General Practitioners, GPs）临床决策时面临的评估不足问题。现有基准和评估框架多集中于考试形式的多项选择题，缺乏能够真实反映全科医生日常工作场景的综合性评估集。为了解决这一问题，论文设计了GPBench，这是一个包含实际临床实践测试题以及创新评估框架的数据集与方法。其关键在于通过专家精心标注的问题集合以及基于全科医学能力模型的评估框架，全面衡量LLMs在模拟真实世界全科医生工作环境中的表现，揭示了现有主流LLMs在疾病分期、并发症识别、治疗细节及药物使用等方面至少十项主要缺陷，从而明确指出这些模型尚需人工监督才能在实际全科医疗场景中独立应用。

链接: https://arxiv.org/abs/2503.17599
作者: Zheqing Li,Yiying Yang,Jiping Lang,Wenhao Jiang,Yuhang Zhao,Shuang Li,Dingqian Wang,Zhu Lin,Xuanna Li,Yuze Tang,Jiexian Qiu,Xiaolin Lu,Hongji Yu,Shuang Chen,Yuhua Bi,Xiaofei Zeng,Yixian Chen,Junrong Chen,Lin Yao
机构: The Sixth Affiliated Hospital of Sun Yat-sen University (中山大学附属第六医院); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (深圳人工智能与数字经济广东省实验室); Xinyi People’s Hospital (信宜市人民医院); School of Intelligent Systems Engineering, Sun Yat-sen University (中山大学智能工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:General practitioners (GPs) serve as the cornerstone of primary healthcare systems by providing continuous and comprehensive medical services. However, due to community-oriented nature of their practice, uneven training and resource gaps, the clinical proficiency among GPs can vary significantly across regions and healthcare settings. Currently, Large Language Models (LLMs) have demonstrated great potential in clinical and medical applications, making them a promising tool for supporting general practice. However, most existing benchmarks and evaluation frameworks focus on exam-style assessments-typically multiple-choice question-lack comprehensive assessment sets that accurately mirror the real-world scenarios encountered by GPs. To evaluate how effectively LLMs can make decisions in the daily work of GPs, we designed GPBench, which consists of both test questions from clinical practice and a novel evaluation framework. The test set includes multiple-choice questions that assess fundamental knowledge of general practice, as well as realistic, scenario-based problems. All questions are meticulously annotated by experts, incorporating rich fine-grained information related to clinical management. The proposed LLM evaluation framework is based on the competency model for general practice, providing a comprehensive methodology for assessing LLM performance in real-world settings. As the first large-model evaluation set targeting GP decision-making scenarios, GPBench allows us to evaluate current mainstream LLMs. Expert assessment and evaluation reveal that in areas such as disease staging, complication recognition, treatment detail, and medication usage, these models exhibit at least ten major shortcomings. Overall, existing LLMs are not yet suitable for independent use in real-world GP working scenarios without human oversight.
zh

[NLP-103] Leverag ing Human Production-Interpretation Asymmetries to Test LLM Cognitive Plausibility

【速读】：该论文试图解决的问题是大型语言模型（Large Language Models, LLMs）是否以与人类相似的方式处理语言。论文通过考察人类句子处理中的生产-解释区分（production-interpretation distinction），评估指令微调后的LLMs在多大程度上能够复制这一区分。论文的关键解决方案在于利用隐性因果动词在人类中观察到的生产与解释之间的实证不对称性作为测试平台，并分析模型大小以及元语言提示（meta-linguistic prompts）的选择对模型表现的影响，从而揭示部分LLMs在定量和定性上能否反映类似人类的不对称行为。

链接: https://arxiv.org/abs/2503.17579
作者: Suet-Ying Lam,Qingcheng Zeng,Jingyi Wu,Rob Voigt
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Whether large language models (LLMs) process language similarly to humans has been the subject of much theoretical and practical debate. We examine this question through the lens of the production-interpretation distinction found in human sentence processing and evaluate the extent to which instruction-tuned LLMs replicate this distinction. Using an empirically documented asymmetry between production and interpretation in humans for implicit causality verbs as a testbed, we find that some LLMs do quantitatively and qualitatively reflect human-like asymmetries between production and interpretation. We demonstrate that whether this behavior holds depends upon both model size - with larger models more likely to reflect human-like patterns and the choice of meta-linguistic prompts used to elicit the behavior.
zh

[NLP-104] Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models

【速读】：该论文试图解决的问题是：当代大型语言模型（Large Language Models, LLMs）在构建世界内部表征并形成关于这些表征的概率性信念方面的能力不足，尤其是在基于多轮交互推断用户偏好的场景下，其预测性能未能如预期般随着信息量增加而提升，甚至不如人类的表现。论文通过引入贝叶斯推理框架评估LLMs的行为，并发现它们未能按贝叶斯方式更新信念。

解决方案的关键在于通过训练LLMs模仿最优贝叶斯模型的预测行为，使它们学会以贝叶斯方式推理。这种方法不仅显著提升了LLMs在特定推荐任务上的表现，还实现了对其他任务的泛化能力，表明该方法赋予了LLMs更广泛的贝叶斯推理技能。这进一步证明了LLMs能够有效学习推理策略并在新领域中应用这些技能。

链接: https://arxiv.org/abs/2503.17523
作者: Linlu Qiu,Fei Sha,Kelsey Allen,Yoon Kim,Tal Linzen,Sjoerd van Steenkiste
机构: MIT; Google Research; Google DeepMind; MIT; Google Research; Google Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence systems based on large language models (LLMs) are increasingly used as agents that interact with users and with the world. To do so successfully, LLMs need to construct internal representations of the world and form probabilistic beliefs about those representations. To provide a user with personalized recommendations, for example, the LLM needs to gradually infer the user’s preferences, over the course of multiple interactions. To evaluate whether contemporary LLMs are able to do so, we use the Bayesian inference framework from probability theory, which lays out the optimal way to update an agent’s beliefs as it receives new information. We first show that the LLMs do not update their beliefs as expected from the Bayesian framework, and that consequently their predictions do not improve as expected as more information becomes available, even less so than we find is the case for humans. To address this issue, we teach the LLMs to reason in a Bayesian manner by training them to mimic the predictions of an optimal Bayesian model. We find that this approach not only significantly improves the LLM’s performance on the particular recommendation task it is trained on, but also enables generalization to other tasks. This suggests that this method endows the LLM with broader Bayesian reasoning skills. More generally, our results indicate that LLMs can learn about reasoning strategies effectively and generalize those skills to new domains, which in part explains LLMs’ empirical success.
zh

[NLP-105] Language Models May Verbatim Complete TextThey Were Not Explicitly Trained On

【速读】：该论文旨在解决如何有效检测大型语言模型（Large Language Model, LLM）是否使用特定文本进行训练的问题。传统方法通常依赖于基于n-gram重叠的成员资格定义（Membership Definition），即通过检查目标文本与数据集中任何文本的n-gram重叠来判断其成员资格。然而，论文指出这种基于n-gram的成员资格定义容易被绕过，并且即使在序列不是某个n的成员的情况下，完成测试仍可能成功。研究发现，通过重新训练LLM并移除所有可以被完成的训练样本后，存在许多自然情况下依然能够成功完成的现象，包括完全重复、近似重复以及短片段重叠等。这些现象表明很难为成员资格定义找到单一合理的n值选择。基于此，论文设计了一种对抗性数据集，能够在任何合理的n值选择下使给定的目标序列被完成而无需包含它本身。研究结果强调了n-gram成员资格定义的不足之处，指出其未能充分考虑训练算法可利用的辅助信息。因此，解决方案的关键在于揭示现有基于n-gram的方法的局限性，并提出一种新的方法来更准确地评估LLM的训练数据来源。

链接: https://arxiv.org/abs/2503.17514
作者: Ken Ziyu Liu,Christopher A. Choquette-Choo,Matthew Jagielski,Peter Kairouz,Sanmi Koyejo,Percy Liang,Nicolas Papernot
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Main text: 9 pages, 7 figures, 1 table. Appendix: 29 pages, 20 tables, 15 figures

点击查看摘要

Abstract:An important question today is whether a given text was used to train a large language model (LLM). A \emphcompletion test is often employed: check if the LLM completes a sufficiently complex text. This, however, requires a ground-truth definition of membership; most commonly, it is defined as a member based on the n -gram overlap between the target text and any text in the dataset. In this work, we demonstrate that this n -gram based membership definition can be effectively gamed. We study scenarios where sequences are \emphnon-members for a given n and we find that completion tests still succeed. We find many natural cases of this phenomenon by retraining LLMs from scratch after removing all training samples that were completed; these cases include exact duplicates, near-duplicates, and even short overlaps. They showcase that it is difficult to find a single viable choice of n for membership definitions. Using these insights, we design adversarial datasets that can cause a given target sequence to be completed without containing it, for any reasonable choice of n . Our findings highlight the inadequacy of n -gram membership, suggesting membership definitions fail to account for auxiliary information available to the training algorithm.
zh

[NLP-106] Follow-up Question Generation For Enhanced Patient-Provider Conversations

【速读】：该论文旨在解决异步医疗对话中生成式跟进问题的两大核心挑战：（1）从分散的数据源中提取隐含的相关信息，以及（2）建模平行思维过程。在医疗场景中，医生不仅依赖患者的陈述，还需结合其电子健康记录（EHR）数据和当前诊断假设来提出问题，而异步对话进一步限制了医生只能基于静态EHR信息生成跟进问题。为应对这些挑战，论文提出了FollowupQ框架，这是一种多智能体系统，通过处理患者消息与EHR数据生成个性化跟进问题，从而明确患者报告的医疗状况。关键在于其多智能体设计及对EHR数据的有效整合，使FollowupQ能够显著减少医疗提供者后续沟通需求34%，并在真实和合成数据集上分别提升17%和5%的性能。此外，论文还公开了一个包含异步医疗消息及其关联EHR数据的首个公共数据集，其中包含2,300条由临床专家撰写的跟进问题，供更广泛的自然语言处理（NLP）研究社区使用。

链接: https://arxiv.org/abs/2503.17509
作者: Joseph Gatto,Parker Seegmiller,Timothy Burdick,Inas S. Khayal,Sarah DeLozier,Sarah M. Preum
机构: Department of Computer Science, Dartmouth College (达特茅斯学院计算机科学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 Pages, 7 Figures, 6 Tables

点击查看摘要

Abstract:Follow-up question generation is an essential feature of dialogue systems as it can reduce conversational ambiguity and enhance modeling complex interactions. Conversational contexts often pose core NLP challenges such as (i) extracting relevant information buried in fragmented data sources, and (ii) modeling parallel thought processes. These two challenges occur frequently in medical dialogue as a doctor asks questions based not only on patient utterances but also their prior EHR data and current diagnostic hypotheses. Asking medical questions in asynchronous conversations compounds these issues as doctors can only rely on static EHR information to motivate follow-up questions. To address these challenges, we introduce FollowupQ, a novel framework for enhancing asynchronous medical conversation. FollowupQ is a multi-agent framework that processes patient messages and EHR data to generate personalized follow-up questions, clarifying patient-reported medical conditions. FollowupQ reduces requisite provider follow-up communications by 34%. It also improves performance by 17% and 5% on real and synthetic data, respectively. We also release the first public dataset of asynchronous medical messages with linked EHR data alongside 2,300 follow-up questions written by clinical experts for the wider NLP research community. Comments: 17 Pages, 7 Figures, 6 Tables Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.17509 [cs.CL] (or arXiv:2503.17509v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.17509 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-107] Large Language Models (LLM s) for Source Code Analysis: applications models and datasets

【速读】：该论文旨在研究大型语言模型（Large Language Models, LLMs）在源代码分析中的应用，探索其在不同代码分析任务中的角色，并聚焦于三个关键方面：可分析的内容及其应用、所使用的模型类型以及所采用的数据集和面临的挑战。论文通过分析相关学术文章，揭示该领域的研究进展、当前趋势及知识结构，同时总结现有局限性，强调重要工具、数据集及关键挑战，为未来研究提供参考。论文的关键在于系统梳理LLMs在代码分析中的应用场景、模型选择与数据集构建，并深入探讨其技术瓶颈和发展方向。

链接: https://arxiv.org/abs/2503.17502
作者: Hamed Jelodar,Mohammad Meymani,Roozbeh Razavi-Far
机构: Canadian Institute for Cybersecurity (加拿大网络安全研究所), University of New Brunswick (新不伦瑞克大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) and transformer-based architectures are increasingly utilized for source code analysis. As software systems grow in complexity, integrating LLMs into code analysis workflows becomes essential for enhancing efficiency, accuracy, and automation. This paper explores the role of LLMs for different code analysis tasks, focusing on three key aspects: 1) what they can analyze and their applications, 2) what models are used and 3) what datasets are used, and the challenges they face. Regarding the goal of this research, we investigate scholarly articles that explore the use of LLMs for source code analysis to uncover research developments, current trends, and the intellectual structure of this emerging field. Additionally, we summarize limitations and highlight essential tools, datasets, and key challenges, which could be valuable for future work.
zh

[NLP-108] Variance Control via Weight Rescaling in LLM Pre-training

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）预训练过程中权重初始化和方差控制策略对模型性能影响显著的问题。尽管神经网络中初始方差控制的重要性已被广泛认可，但针对LLM预训练期间初始化及其方差增长管理的系统研究相对较少。为解决这一问题，论文提出了层索引重缩放（Layer Index Rescaling, LIR）权重初始化方案和目标方差重缩放（Target Variance Rescaling, TVR）方差控制策略。这些方法的关键在于通过更有效的方差管理，不仅提升了下游任务性能（在常见预训练基准上最高提升4.6%），还降低了极端激活值，从而缓解了量化和低精度训练带来的挑战。

链接: https://arxiv.org/abs/2503.17500
作者: Louis Owen,Abhay Kumar,Nilabhra Roy Chowdhury,Fabian Güra
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The outcome of Large Language Model (LLM) pre-training strongly depends on weight initialization and variance control strategies. Although the importance of initial variance control has been well documented in neural networks in general, the literature on initialization and management of its growth during LLM pre-training, specifically, is somewhat sparse. In this paper, we introduce the Layer Index Rescaling (LIR) weight initialization scheme, and the Target Variance Rescaling (TVR) variance control strategy. Experiments on a 1B parameter LLaMA model demonstrate that better variance management using these techniques yields substantial improvements in downstream task performance (up to 4.6% on common pre-training benchmarks) and reduces extreme activation values, thus mitigating challenges associated with quantization and low-precision training. Our code is available at: this https URL.
zh

[NLP-109] Judge Anything: MLLM as a Judge Across Any Modality

【速读】：该论文旨在解决跨模态生成式基础模型（Multimodal Generative Foundation Models）在开放域多模态理解（MMU）和生成（MMG）任务评估中的挑战。由于跨模态交互的复杂性，传统评估方法难以全面衡量这些模型的能力。论文提出的关键解决方案是将多模态大型语言模型（Multimodal Large Language Models, MLLMs）作为自动化评委（MLLM-as-a-Judge），并通过引入两个基准测试工具——TaskAnything和JudgeAnything，统一评估MLLMs在任意模态间任务中的综合性能与评判能力。TaskAnything用于评估15类任意模态间的MMU和MMG能力，而JudgeAnything则从配对比较和评分评估的角度，评估包括GPT-4o和Gemini-2.0-Flash在内的先进MLLMs的表现。实验结果显示，尽管MLLMs在MMU任务中表现尚可，但在MMG任务中存在显著的跨模态偏差和幻觉问题。为解决这些问题，论文提出了OmniArena平台，用于评估多模态模型及其奖励模型的公平性和人类偏好一致性。研究强调了制定更公平的评估协议以及提升模型与人类偏好的对齐程度的重要性。

链接: https://arxiv.org/abs/2503.17489
作者: Shu Pu,Yaochen Wang,Dongping Chen,Yuhang Chen,Guohao Wang,Qi Qin,Zhongyi Zhang,Zhiyuan Zhang,Zetong Zhou,Shuang Gong,Yi Gui,Yao Wan,Philip S. Yu
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Evaluating generative foundation models on open-ended multimodal understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g., images, audio, video) poses significant challenges due to the complexity of cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs (MLLMs) as automated judges has emerged, with encouraging results in assessing vision-language understanding tasks. Moving further, this paper extends MLLM-as-a-Judge across modalities to a unified manner by introducing two benchmarks, TaskAnything and JudgeAnything, to respectively evaluate the overall performance and judging capabilities of MLLMs across any-to-any modality tasks. Specifically, TaskAnything evaluates the MMU and MMG capabilities across 15 any-to-any modality categories, employing 1,500 queries curated from well-established benchmarks. Furthermore, JudgeAnything evaluates the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from the perspectives of Pair Comparison and Score Evaluation, providing a standardized testbed that incorporates human judgments and detailed rubrics. Our extensive experiments reveal that while these MLLMs show promise in assessing MMU (i.e., achieving an average of 66.55% in Pair Comparison setting and 42.79% in Score Evaluation setting), they encounter significant challenges with MMG tasks (i.e., averaging only 53.37% in Pair Comparison setting and 30.05% in Score Evaluation setting), exposing cross-modality biases and hallucination issues. To address this, we present OmniArena, an automated platform for evaluating omni-models and multimodal reward models. Our work highlights the need for fairer evaluation protocols and stronger alignment with human preferences. The source code and dataset are publicly available at: this https URL.
zh

[NLP-110] SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在捕捉和反映文化细微差别方面存在的不足，特别是在文化多样性显著且具有丰富传统背景的沙特阿拉伯这一特定环境中。为应对这一挑战，论文提出了关键解决方案：引入了一个名为SaudiCulture的新基准数据集，用于评估LLMs在沙特阿拉伯独特地理与文化语境下的文化胜任力。SaudiCulture包含覆盖五个主要地理区域（西、东、南、北、中）以及适用于所有地区的通用问题的综合性问题集，并涉及食物、服饰、娱乐、庆典及手工艺等多个文化领域。通过包含不同复杂度的问题类型（开放式、单选题、多选题等），并区分普通文化知识与特定区域特色知识，确保了评估过程的严格性。最终研究表明，所有测试模型在处理高度专业化或地区特有的问题时均表现出显著性能下降，这凸显了将地区特有知识融入模型训练以提升其文化理解能力的重要性。

链接: https://arxiv.org/abs/2503.17485
作者: Lama Ayash,Hassan Alhuzali,Ashwag Alasmari,Sultan Aloufi
机构: King Khalid University (国王克利德大学); University of Qassim (卡西姆大学); Taibah University (塔伊巴大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 34 pages, under-review

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing; however, they often struggle to accurately capture and reflect cultural nuances. This research addresses this challenge by focusing on Saudi Arabia, a country characterized by diverse dialects and rich cultural traditions. We introduce SaudiCulture, a novel benchmark designed to evaluate the cultural competence of LLMs within the distinct geographical and cultural contexts of Saudi Arabia. SaudiCulture is a comprehensive dataset of questions covering five major geographical regions, such as West, East, South, North, and Center, along with general questions applicable across all regions. The dataset encompasses a broad spectrum of cultural domains, including food, clothing, entertainment, celebrations, and crafts. To ensure a rigorous evaluation, SaudiCulture includes questions of varying complexity, such as open-ended, single-choice, and multiple-choice formats, with some requiring multiple correct answers. Additionally, the dataset distinguishes between common cultural knowledge and specialized regional aspects. We conduct extensive evaluations on five LLMs, such as GPT-4, Llama 3.3, FANAR, Jais, and AceGPT, analyzing their performance across different question types and cultural contexts. Our findings reveal that all models experience significant performance declines when faced with highly specialized or region-specific questions, particularly those requiring multiple correct responses. Additionally, certain cultural categories are more easily identifiable than others, further highlighting inconsistencies in LLMs cultural understanding. These results emphasize the importance of incorporating region-specific knowledge into LLMs training to enhance their cultural competence.
zh

[NLP-111] ConvoGen: Enhancing Conversational AI with Synthetic Data: A Multi-Agent Approach

【速读】：该论文旨在解决合成会话数据生成的问题，以支持训练和评估对话式人工智能（Conversational AI）模型，并增强现有数据集的多样性。论文的关键创新在于提出了一种名为ConvoGen的新框架，它利用少量学习（few-shot learning）并通过从动态更新的少量学习样本库中进行迭代采样，生成多样化且逼真的会话场景。这种方法的核心在于其动态更新的少量学习机制，能够有效提升生成数据的质量与多样性。

链接: https://arxiv.org/abs/2503.17460
作者: Reem Gody,Mahmoud Goudy,Ahmed Y. Tawfik
机构: Microsoft AI (微软人工智能)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we present ConvoGen: an innovative framework for generating synthetic conversational data using multi-agent systems. Our method leverages few-shot learning and introduces iterative sampling from a dynamically updated few-shot hub to create diverse and realistic conversational scenarios. The generated data has numerous applications, including training and evaluating conversational AI models, and augmenting existing datasets for tasks like conversational intent classification or conversation summarization. Our experiments demonstrate the effectiveness of this method in producing high-quality diverse synthetic conversational data, highlighting its potential to enhance the development and evaluation of conversational AI systems.
zh

[NLP-112] Language-specific Neurons Do Not Facilitate Cross-Lingual Transfer NAACL2025

【速读】：该论文试图解决多语言大语言模型（Multilingual LLMs）在低资源语言上性能显著下降的问题。论文探索是否可以利用现有的语言特定神经元识别技术（如Language Activation Probability Entropy和基于激活概率的阈值化）以及针对特定神经元的LoRA微调方法来提升低资源语言的跨语言任务表现。关键在于通过实验验证这些神经元特定干预措施是否能够有效改善低资源语言在下游任务（如XNLI、XQuAD）中的跨语言泛化能力。研究结果表明，现有方法不足以实现这一目标，并揭示了实现跨语言泛化的挑战，为多语言LLMs提供了重要见解。

链接: https://arxiv.org/abs/2503.17456
作者: Soumen Kumar Mondal,Sayambhu Sen,Abhishek Singhania,Preethi Jyothi
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校); Amazon Alexa (亚马逊Alexa)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted (oral) at NAACL 2025 (InsightsNLP)

点击查看摘要

Abstract:Multilingual large language models (LLMs) aim towards robust natural language understanding across diverse languages, yet their performance significantly degrades on low-resource languages. This work explores whether existing techniques to identify language-specific neurons can be leveraged to enhance cross-lingual task performance of lowresource languages. We conduct detailed experiments covering existing language-specific neuron identification techniques (such as Language Activation Probability Entropy and activation probability-based thresholding) and neuron-specific LoRA fine-tuning with models like Llama 3.1 and Mistral Nemo. We find that such neuron-specific interventions are insufficient to yield cross-lingual improvements on downstream tasks (XNLI, XQuAD) in lowresource languages. This study highlights the challenges in achieving cross-lingual generalization and provides critical insights for multilingual LLMs.
zh

[NLP-113] From Text to Talent: A Pipeline for Extracting Insights from Candidate Profiles

【速读】：该论文旨在解决招聘过程中多职位场景下理想候选人建议的研究空白。解决方案的关键在于提出了一种结合大语言模型（Large Language Models）和图相似性度量的新颖管道，通过将候选人档案表示为多模态嵌入（multimodal embeddings），实现了岗位需求与候选人属性之间细微关系的有效捕获，从而更高效地识别顶尖人才并优化招聘流程。

链接: https://arxiv.org/abs/2503.17438
作者: Paolo Frazzetto,Muhammad Uzair Ul Haq,Flavia Fabris,Alessandro Sperduti
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ITADATA 2024

点击查看摘要

Abstract:The recruitment process is undergoing a significant transformation with the increasing use of machine learning and natural language processing techniques. While previous studies have focused on automating candidate selection, the role of multiple vacancies in this process remains understudied. This paper addresses this gap by proposing a novel pipeline that leverages Large Language Models and graph similarity measures to suggest ideal candidates for specific job openings. Our approach represents candidate profiles as multimodal embeddings, enabling the capture of nuanced relationships between job requirements and candidate attributes. The proposed approach has significant implications for the recruitment industry, enabling companies to streamline their hiring processes and identify top talent more efficiently. Our work contributes to the growing body of research on the application of machine learning in human resources, highlighting the potential of LLMs and graph-based methods in revolutionizing the recruitment landscape.
zh

[NLP-114] Beyond Negation Detection: Comprehensive Assertion Detection Models for Clinical NLP ECIR2025

【速读】：该论文旨在解决临床自然语言处理（Clinical NLP）中关键但常被忽视的断言状态检测（assertion status detection）问题，其核心在于准确标注从医学文本中提取的事实属性。传统研究主要聚焦于否定词检测（negation detection），导致基于AWS Medical Comprehend、Azure AI Text Analytics和GPT-4o等商业API的解决方案在领域适应性不足而表现欠佳。为填补这一空白，论文开发了一系列最先进的断言检测模型，包括微调的大规模语言模型（LLMs）、基于Transformer的分类器、少量样本学习（few-shot learning）分类器以及深度学习（Deep Learning, DL）方法。这些模型的关键创新在于通过领域适配优化，显著提升了在多个断言类别（如Present、Absent、Hypothetical、Conditional等）上的性能，尤其在资源受限环境下的轻量级few-shot分类器表现出色，同时与Spark NLP集成后实现了高效、可扩展的推理能力及与其他NLP任务的无缝整合。最终结果表明，领域适配的透明化和可定制化临床NLP解决方案优于通用型大模型和专有API。

链接: https://arxiv.org/abs/2503.17425
作者: Veysel Kocaman,Yigit Gul,M. Aytug Kaya,Hasham Ul Haq,Mehmet Butgul,Cabir Celik,David Talby
机构: John Snow Labs inc. (John Snow 实验室公司)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: accepted at Text2Story Workshop at ECIR 2025

点击查看摘要

Abstract:Assertion status detection is a critical yet often overlooked component of clinical NLP, essential for accurately attributing extracted medical facts. Past studies have narrowly focused on negation detection, leading to underperforming commercial solutions such as AWS Medical Comprehend, Azure AI Text Analytics, and GPT-4o due to their limited domain adaptation. To address this gap, we developed state-of-the-art assertion detection models, including fine-tuned LLMs, transformer-based classifiers, few-shot classifiers, and deep learning (DL) approaches. We evaluated these models against cloud-based commercial API solutions, the legacy rule-based NegEx approach, and GPT-4o. Our fine-tuned LLM achieves the highest overall accuracy (0.962), outperforming GPT-4o (0.901) and commercial APIs by a notable margin, particularly excelling in Present (+4.2%), Absent (+8.4%), and Hypothetical (+23.4%) assertions. Our DL-based models surpass commercial solutions in Conditional (+5.3%) and Associated-with-Someone-Else (+10.1%) categories, while the few-shot classifier offers a lightweight yet highly competitive alternative (0.929), making it ideal for resource-constrained environments. Integrated within Spark NLP, our models consistently outperform black-box commercial solutions while enabling scalable inference and seamless integration with medical NER, Relation Extraction, and Terminology Resolution. These results reinforce the importance of domain-adapted, transparent, and customizable clinical NLP solutions over general-purpose LLMs and proprietary APIs.
zh

[NLP-115] Understanding Social Support Needs in Questions: A Hybrid Approach Integrating Semi-Supervised Learning and LLM -based Data Augmentation

【速读】：该论文旨在解决在线健康问答（QA）社区中，由于缺乏与用户具体需求相匹配的社会支持而导致的支持效果不佳甚至可能产生负面影响的问题。为应对这一挑战，研究开发了一种名为Hybrid Approach for SOcial Support need classification (HA-SOS)的新框架，其关键在于通过结合答案增强的半监督学习方法、利用大型语言模型（LLMs）进行文本数据增强的技术以及具有可靠性和多样性感知的样本选择机制，并设计统一的训练过程，从而实现自动标注问题中的社会支持需求。实验结果表明，HA-SOS在性能上显著优于现有的问题分类模型及其它半监督学习方法。此研究不仅推动了社会支持、问题分类、半监督学习以及文本数据增强领域的学术进展，还为在线QA平台管理者和回答者提供了更好地理解用户社会支持需求的能力，以提供及时且个性化的响应与干预措施。

链接: https://arxiv.org/abs/2503.17421
作者: Junwei Kuang,Liang Yang,Shaoze Cui,Weiguo Fan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 55 pages

点击查看摘要

Abstract:Patients are increasingly turning to online health QA communities for social support to improve their well-being. However, when this support received does not align with their specific needs, it may prove ineffective or even detrimental. This necessitates a model capable of identifying the social support needs in questions. However, training such a model is challenging due to the scarcity and class imbalance issues of labeled data. To overcome these challenges, we follow the computational design science paradigm to develop a novel framework, Hybrid Approach for SOcial Support need classification (HA-SOS). HA-SOS integrates an answer-enhanced semi-supervised learning approach, a text data augmentation technique leveraging large language models (LLMs) with reliability- and diversity-aware sample selection mechanism, and a unified training process to automatically label social support needs in questions. Extensive empirical evaluations demonstrate that HA-SOS significantly outperforms existing question classification models and alternative semi-supervised learning approaches. This research contributes to the literature on social support, question classification, semi-supervised learning, and text data augmentation. In practice, our HA-SOS framework facilitates online QA platform managers and answerers to better understand users’ social support needs, enabling them to provide timely, personalized answers and interventions.
zh

[NLP-116] A Comprehensive Survey on Long Context Language Modeling

【速读】：该论文旨在解决自然语言处理领域中长上下文高效处理的问题，随着长文档、对话及其他文本数据的增多，开发能够有效且高效处理大规模输入的长上下文语言模型（Long Context Language Models, LCLMs）变得尤为重要。论文的关键在于从三个核心方面探讨了解决方案：首先是如何获得有效的LCLMs，包括数据策略、架构设计及面向长上下文处理的工作流方法；其次是高效训练与部署LCLMs所需的基础设施细节；最后是如何全面评估和分析LCLMs，涵盖长上下文理解、长篇生成以及模型行为分析与机制可解释性。此外，论文还探索了现有LCLMs的应用场景，并指出了未来的发展方向。这些努力为研究者和工程师提供了一份关于长上下文大型语言模型的最新综述资料。

链接: https://arxiv.org/abs/2503.17407
作者: Jiaheng Liu,Dawei Zhu,Zhiqi Bai,Yancheng He,Huanxuan Liao,Haoran Que,Zekun Wang,Chenchen Zhang,Ge Zhang,Jiebin Zhang,Yuanxing Zhang,Zhuo Chen,Hangyu Guo,Shilong Li,Ziqiang Liu,Yong Shan,Yifan Song,Jiayi Tian,Wenhao Wu,Zhejian Zhou,Ruijie Zhu,Junlan Feng,Yang Gao,Shizhu He,Zhoujun Li,Tianyu Liu,Fanyu Meng,Wenbo Su,Yingshui Tan,Zili Wang,Jian Yang,Wei Ye,Bo Zheng,Wangchunshu Zhou,Wenhao Huang,Sujian Li,Zhaoxiang Zhang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Efficient processing of long contexts has been a persistent pursuit in Natural Language Processing. With the growing number of long documents, dialogues, and other textual data, it is important to develop Long Context Language Models (LCLMs) that can process and analyze extensive inputs in an effective and efficient way. In this paper, we present a comprehensive survey on recent advances in long-context modeling for large language models. Our survey is structured around three key aspects: how to obtain effective and efficient LCLMs, how to train and deploy LCLMs efficiently, and how to evaluate and analyze LCLMs comprehensively. For the first aspect, we discuss data strategies, architectural designs, and workflow approaches oriented with long context processing. For the second aspect, we provide a detailed examination of the infrastructure required for LCLM training and inference. For the third aspect, we present evaluation paradigms for long-context comprehension and long-form generation, as well as behavioral analysis and mechanism interpretability of LCLMs. Beyond these three key aspects, we thoroughly explore the diverse application scenarios where existing LCLMs have been deployed and outline promising future development directions. This survey provides an up-to-date review of the literature on long-context LLMs, which we wish to serve as a valuable resource for both researchers and engineers. An associated GitHub repository collecting the latest papers and repos is available at: \hrefthis https URL\color[RGB]175,36,67LCLM-Horizon.
zh

[NLP-117] ChatGPT or A Silent Everywhere Helper: A Survey of Large Language Models

【速读】：该论文旨在全面分析大型语言模型（Large Language Models, LLMs）中的代表性模型ChatGPT，探索其架构、训练过程、功能特性及其在各行业（如客户服务、教育、医疗和娱乐）的应用潜力。论文通过与其它LLMs的对比分析，强调ChatGPT的独特特性和性能指标，并评估其在基准测试中的表现，同时讨论了潜在风险（如错误信息传播、偏见及数据隐私问题）。论文的关键在于提供详实的数据支撑和系统化的研究框架，包括图表和数据集列表，以阐明ChatGPT的技术背景、应用案例及其影响。最终，论文指出了未来研究方向和技术发展趋势，强调LLMs对人工智能（Artificial Intelligence, AI）领域及社会的深远影响。

链接: https://arxiv.org/abs/2503.17403
作者: Azim Akhtarshenas,Afshin Dini,Navid Ayoobi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revo lutionized natural language processing Natural Language Processing (NLP), with Chat Generative Pre-trained Transformer (ChatGPT) standing out as a notable exampledue to its advanced capabilities and widespread applications. This survey provides a comprehensive analysis of ChatGPT, exploring its architecture, training processes, and functionalities. We examine its integration into various domains across industries such as customer service, education, healthcare, and entertainment. A comparative analysis with other LLMs highlights ChatGPT’s unique features and performance metrics. Regarding benchmarks, the paper examines ChatGPT’s comparative performance against other LLMs and discusses potential risks such as misinformation, bias, and data privacy concerns. Additionally, we offer a number of figures and tables that outline the backdrop of the discussion, the main ideas of the article, the numerous LLM models, a thorough list of datasets used for pre-training, fine-tuning, and evaluation, as well as particular LLM applications with pertinent references. Finally, we identify future research directions and technological advancements, underscoring the evolving landscape of LLMs and their profound impact on artificial intelligence Artificial Intelligence (AI) and society.
zh

[NLP-118] State Fourier Diffusion Language Model (SFDLM): A Scalable Novel Iterative Approach to Language Modeling

【速读】：该论文旨在解决基于离散扩散模型的文本生成问题，尤其是在无需依赖传统Transformer或大型卷积模块的情况下实现高效的离散数据去噪生成。论文的关键创新在于提出了一种完全由扩散驱动的离散文本生成模型，该模型结合了时间域中的结构化状态空间动力学与新颖的Complex Fourier多层感知机（Complex Fourier Multi Layer Perceptron）模块，后者在频域中运作。这一方案通过前向加噪过程以受控概率随机替换词汇表中的标记，同时利用学习到的反向模型系统性地将被破坏的序列恢复至原始状态。其关键之处在于通过局部状态空间更新与全局傅里叶混合的组合，有效捕获了短程和长程依赖关系，从而实现了高效且灵活的文本生成能力。

链接: https://arxiv.org/abs/2503.17382
作者: Andrew Kiruluta,Andreas Lemos
机构: School of Infomation, University of California (加州大学信息学院); School of Infomation, University of California (加州大学信息学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, diffusion based methods have emerged as a powerful paradigm for generative modeling. Although discrete diffusion for natural language processing has been explored to a lesser extent, it shows promise for tasks requiring iterative denoising of token based data. In standard approaches to text generation, transformers dominate, but their reliance on self attention often incurs high computational costs. This paper introduces a fully diffusion driven discrete text generation model built without any transformer or large convolution modules. Instead, the model integrates structured state space dynamics in the time domain with a novel Complex Fourier Multi Layer Perceptron module that operates in the frequency domain. The forward noising process randomly samples the vocabulary to replace tokens with a controlled probability, while the learned reverse model systematically reverts corrupted sequences toward their original states. By composing local state space updates with global Fourier based mixing, the approach effectively captures both short and long range dependencies.
zh

[NLP-119] Big Help or Big Brother? Auditing Tracking Profiling and Personalization in Generative AI Assistants

【速读】：该论文试图解决生成式 AI 浏览器助手在隐私保护方面的问题，重点关注其数据收集、存储、处理和共享行为。解决方案的关键在于通过网络流量分析和一种新颖的提示框架，审计十大流行生成式 AI 浏览器助手扩展的跟踪、用户画像构建及个性化响应行为。研究发现，这些助手主要依赖服务器端 API 进行操作，且在未明确用户交互的情况下自动调用，广泛收集和共享网页内容（如完整的 HTML DOM 和用户表单输入），甚至与第三方追踪器共享标识符和用户提示信息。此外，一些助手通过推断用户的年龄、性别、收入和兴趣等人口统计属性，构建跨浏览上下文的用户画像以实现个性化响应。论文的核心结论是，这些生成式 AI 浏览器助手在用户隐私保护方面存在显著漏洞，缺乏足够的安全措施。

链接: https://arxiv.org/abs/2503.16586
作者: Yash Vekaria(1),Aurelio Loris Canino(2),Jonathan Levitsky(1),Alex Ciechonski(3),Patricia Callejo(4),Anna Maria Mandalari(3),Zubair Shafiq(1) ((1) UC Davis, (2) Mediterranea University of Reggio Calabria, (3) University College London, (4) Universidad Carlos III de Madrid)
机构: UC Davis (加州大学戴维斯分校); UNIRC (未知中文名称); UCL (伦敦大学学院); UC3M (卡洛斯三世大学); Anna Maria Mandalari (未提供单位信息); Zubair Shafiq (未提供完整单位信息)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Generative AI (GenAI) browser assistants integrate powerful capabilities of GenAI in web browsers to provide rich experiences such as question answering, content summarization, and agentic navigation. These assistants, available today as browser extensions, can not only track detailed browsing activity such as search and click data, but can also autonomously perform tasks such as filling forms, raising significant privacy concerns. It is crucial to understand the design and operation of GenAI browser extensions, including how they collect, store, process, and share user data. To this end, we study their ability to profile users and personalize their responses based on explicit or inferred demographic attributes and interests of users. We perform network traffic analysis and use a novel prompting framework to audit tracking, profiling, and personalization by the ten most popular GenAI browser assistant extensions. We find that instead of relying on local in-browser models, these assistants largely depend on server-side APIs, which can be auto-invoked without explicit user interaction. When invoked, they collect and share webpage content, often the full HTML DOM and sometimes even the user’s form inputs, with their first-party servers. Some assistants also share identifiers and user prompts with third-party trackers such as Google Analytics. The collection and sharing continues even if a webpage contains sensitive information such as health or personal information such as name or SSN entered in a web form. We find that several GenAI browser assistants infer demographic attributes such as age, gender, income, and interests and use this profile–which carries across browsing contexts–to personalize responses. In summary, our work shows that GenAI browser assistants can and do collect personal and sensitive information for profiling and personalization with little to no safeguards.
zh

[NLP-120] Autonomous Radiotherapy Treatment Planning Using DOLA: A Privacy-Preserving LLM -Based Optimization Agent

【速读】：该论文旨在解决放射治疗计划过程中复杂且耗时的问题，这些问题受到规划者之间差异性和主观决策的影响。为应对这些挑战，论文提出了一种名为Dose Optimization Language Agent (DOLA) 的自主大型语言模型（Large Language Model, LLM）驱动代理，其设计目的是在严格保护患者隐私的前提下优化放疗计划。DOLA的关键在于将LLaMa3.1 LLM直接集成到商业治疗计划系统中，并采用思维链提示（chain-of-thought prompting）、检索增强生成（Retrieval-Augmented Generation, RAG）以及强化学习（Reinforcement Learning, RL）方法，同时完全在本地安全基础设施内运行以避免外部数据共享。通过评估不同模型规模和优化策略，研究证明了70B参数模型与结合RAG及RL的优化方法显著提升了性能，展示了检索记忆与强化学习协同作用的优势。

链接: https://arxiv.org/abs/2503.17553
作者: Humza Nusrat(1 and 2),Bing Luo(1),Ryan Hall(1),Joshua Kim(1),Hassan Bagher-Ebadian(1 and 2),Anthony Doemer(1),Benjamin Movsas(1 and 2),Kundan Thind(1 and 2) ((1) Department of Radiation Oncology, Henry Ford Health, Detroit, USA (2) College of Human Medicine, Michigan State University, East Lansing, USA)
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注: 19 pages, 5 figures, preprint

点击查看摘要

Abstract:Radiotherapy treatment planning is a complex and time-intensive process, often impacted by inter-planner variability and subjective decision-making. To address these challenges, we introduce Dose Optimization Language Agent (DOLA), an autonomous large language model (LLM)-based agent designed for optimizing radiotherapy treatment plans while rigorously protecting patient privacy. DOLA integrates the LLaMa3.1 LLM directly with a commercial treatment planning system, utilizing chain-of-thought prompting, retrieval-augmented generation (RAG), and reinforcement learning (RL). Operating entirely within secure local infrastructure, this agent eliminates external data sharing. We evaluated DOLA using a retrospective cohort of 18 prostate cancer patients prescribed 60 Gy in 20 fractions, comparing model sizes (8 billion vs. 70 billion parameters) and optimization strategies (No-RAG, RAG, and RAG+RL) over 10 planning iterations. The 70B model demonstrated significantly improved performance, achieving approximately 16.4% higher final scores than the 8B model. The RAG approach outperformed the No-RAG baseline by 19.8%, and incorporating RL accelerated convergence, highlighting the synergy of retrieval-based memory and reinforcement learning. Optimal temperature hyperparameter analysis identified 0.4 as providing the best balance between exploration and exploitation. This proof of concept study represents the first successful deployment of locally hosted LLM agents for autonomous optimization of treatment plans within a commercial radiotherapy planning system. By extending human-machine interaction through interpretable natural language reasoning, DOLA offers a scalable and privacy-conscious framework, with significant potential for clinical implementation and workflow improvement.
zh

计算机视觉

[CV-0] arget-Aware Video Diffusion Models

【速读】：本文旨在解决从输入图像生成包含指定目标交互的视频的问题，尤其关注人类-物体交互（Human-Object Interaction, HOI）场景中精确动作指导的挑战。现有可控图像到视频扩散模型通常依赖密集的结构或运动线索来引导演员的动作，而本文提出的目标感知视频扩散模型仅需一个简单的分割掩模定义目标，并通过利用预训练模型的泛化能力生成可信的动作。关键在于通过扩展基线模型引入一个特殊标记，将目标的空间信息编码进文本提示中，并设计了一种新颖的跨注意力损失函数，该函数将与该标记相关的跨注意力图与输入的目标掩模对齐。此外，为了进一步提升性能，选择性地将此损失应用于语义相关性最高的Transformer模块及其注意力区域。实验结果表明，该目标感知模型在生成演员准确与指定目标交互的视频方面优于现有方法，并在视频内容创作和零样本3D HOI运动合成两个下游任务中展现出有效性。

链接: https://arxiv.org/abs/2503.18950
作者: Taeksoo Kim,Hanbyul Joo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project page is available at this https URL

点击查看摘要

Abstract:We present a target-aware video diffusion model that generates videos from an input image in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask and the desired action is described via a text prompt. Unlike existing controllable image-to-video diffusion models that often rely on dense structural or motion cues to guide the actor’s movements toward the target, our target-aware model requires only a simple mask to indicate the target, leveraging the generalization capabilities of pretrained models to produce plausible actions. This makes our method particularly effective for human-object interaction (HOI) scenarios, where providing precise action guidance is challenging, and further enables the use of video diffusion models for high-level action planning in applications such as robotics. We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target’s spatial information within the text prompt. We then fine-tune the model with our curated dataset using a novel cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant transformer blocks and attention regions. Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: video content creation and zero-shot 3D HOI motion synthesis.
zh

[CV-1] Equivariant Image Modeling

【速读】：该论文致力于解决现有生成模型（如自回归和扩散模型）在分解高维数据分布学习任务时，子任务联合优化过程中产生的固有冲突问题。现有的解决方案通常需要在效率或可扩展性上做出妥协。为了解决这一问题，论文提出了一种新的等变图像建模框架，通过利用自然视觉信号的平移不变性，在子任务间内在地对齐优化目标。关键在于引入了“列向分词”以增强水平方向上的平移对称性，并采用“窗口因果注意力”以强制一致的位置上下文关系。这些创新显著减少了子任务间的冲突，提升了零样本泛化能力，并实现了高效的参数共享与无冲突优化。

链接: https://arxiv.org/abs/2503.18948
作者: Ruixiao Dong,Mengde Xu,Zigang Geng,Li Li,Han Hu,Shuyang Gu
机构: University of Science and Technology of China (中国科学技术大学); Tencent Hunyuan Research (腾讯浑元研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current generative models, such as autoregressive and diffusion approaches, decompose high-dimensional data distribution learning into a series of simpler subtasks. However, inherent conflicts arise during the joint optimization of these subtasks, and existing solutions fail to resolve such conflicts without sacrificing efficiency or scalability. We propose a novel equivariant image modeling framework that inherently aligns optimization targets across subtasks by leveraging the translation invariance of natural visual signals. Our method introduces (1) column-wise tokenization which enhances translational symmetry along the horizontal axis, and (2) windowed causal attention which enforces consistent contextual relationships across positions. Evaluated on class-conditioned ImageNet generation at 256x256 resolution, our approach achieves performance comparable to state-of-the-art AR models while using fewer computational resources. Systematic analysis demonstrates that enhanced equivariance reduces inter-task conflicts, significantly improving zero-shot generalization and enabling ultra-long image synthesis. This work establishes the first framework for task-aligned decomposition in generative modeling, offering insights into efficient parameter sharing and conflict-free optimization. The code and models are publicly available at this https URL.
zh

[CV-2] uning-Free Amodal Segmentation via the Occlusion-Free Bias of Inpainting Models

【速读】：该论文试图解决模态分割（amodal segmentation）问题，即同时预测物体可见区域和被遮挡区域的分割掩码。现有方法大多将其视为有监督学习问题，依赖于人工标注的模态掩码或合成数据，导致性能受限于数据集的质量，而这些数据集通常缺乏多样性和规模。论文提出了一种无需调参的方法，通过重新利用预训练的基于扩散模型的图像修复（inpainting）模型来实现模态分割。解决方案的关键在于利用图像修复模型的“无遮挡偏置”特性，即修复后的物体倾向于呈现完整的、无遮挡的状态。具体而言，该方法通过图像修复重建物体的被遮挡区域，随后直接应用分割操作，无需额外的训练或微调。实验结果表明，该方法在五个数据集上的表现具有广泛的适用性和鲁棒性，平均比最先进的方法提高了5.3%的掩码准确性。

链接: https://arxiv.org/abs/2503.18947
作者: Jae Joong Lee,Bedrich Benes,Raymond A. Yeh
机构: Department of Computer Science, Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Amodal segmentation aims to predict segmentation masks for both the visible and occluded regions of an object. Most existing works formulate this as a supervised learning problem, requiring manually annotated amodal masks or synthetic training data. Consequently, their performance depends on the quality of the datasets, which often lack diversity and scale. This work introduces a tuning-free approach that repurposes pretrained diffusion-based inpainting models for amodal segmentation. Our approach is motivated by the “occlusion-free bias” of inpainting models, i.e., the inpainted objects tend to be complete objects without occlusions. Specifically, we reconstruct the occluded regions of an object via inpainting and then apply segmentation, all without additional training or fine-tuning. Experiments on five datasets demonstrate the generalizability and robustness of our approach. On average, our approach achieves 5.3% more accurate masks over the state-of-the-art.
zh

[CV-3] Aether: Geometric-Aware Unified World Modeling

【速读】：该论文致力于解决几何重建与生成式建模（Generative Modeling）集成这一在开发具备人类空间推理能力的AI系统中的关键挑战。论文提出了一种名为Aether的统一框架，通过联合优化三维动态重建、动作条件视频预测以及目标条件视觉规划三大核心能力，实现世界模型中的几何感知推理。其关键在于任务交织特征学习（task-interleaved feature learning），使重建、预测和规划目标之间实现协同的知识共享。此外，Aether利用基于几何信息的动作空间，将预测结果无缝转换为实际操作，实现了有效的自主轨迹规划。尽管训练过程中未使用真实世界数据，Aether在跨领域泛化和零样本泛化方面表现出色，并且其重建性能显著优于特定领域的模型。

链接: https://arxiv.org/abs/2503.18945
作者: Aether Team,Haoyi Zhu,Yifan Wang,Jianjun Zhou,Wenzheng Chang,Yang Zhou,Zizun Li,Junyi Chen,Chunhua Shen,Jiangmiao Pang,Tong He
机构: Aether Team, Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project Page: this https URL

点击查看摘要

Abstract:The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates unprecedented synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Remarkably, even without real-world data, its reconstruction performance far exceeds that of domain-specific models. Additionally, Aether leverages a geometry-informed action space to seamlessly translate predictions into actions, enabling effective autonomous trajectory planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.
zh

[CV-4] DINO in the Room: Leverag ing 2D Foundation Models for 3D Segmentation

【速读】：该论文旨在解决视觉基础模型（Vision Foundation Models, VFMs）在三维视觉任务中的潜力未被充分挖掘的问题。尽管二维图像与三维点云数据集通常同时可用，但现有的三维方法大多专注于三维数据本身，而忽视了将二维视觉特征有效整合到三维模型中的可能性。论文的关键解决方案是提出了一种名为DITR的简单而高效的方法，它通过提取二维视觉基础模型的特征、将其投影到三维空间，并最终注入到三维点云分割模型中，实现了在室内和室外三维语义分割基准上的最新性能。此外，为了在推理阶段即使没有图像可用的情况下也能利用VFMs，论文进一步提出了将二维视觉基础模型的知识蒸馏到三维主干网络作为预训练任务，从而为下游三维分割任务奠定了坚实的基础，提升了多种数据集上的性能表现。

链接: https://arxiv.org/abs/2503.18944
作者: Karim Abou Zeid,Kadir Yilmaz,Daan de Geus,Alexander Hermans,David Adrian,Timm Linder,Bastian Leibe
机构: RWTH Aachen University (RWTH亚琛工业大学); Eindhoven University of Technology (埃因霍温科技大学); Bosch Center for AI (博世人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL

点击查看摘要

Abstract:Vision foundation models (VFMs) trained on large-scale image datasets provide high-quality features that have significantly advanced 2D visual recognition. However, their potential in 3D vision remains largely untapped, despite the common availability of 2D images alongside 3D point cloud datasets. While significant research has been dedicated to 2D-3D fusion, recent state-of-the-art 3D methods predominantly focus on 3D data, leaving the integration of VFMs into 3D models underexplored. In this work, we challenge this trend by introducing DITR, a simple yet effective approach that extracts 2D foundation model features, projects them to 3D, and finally injects them into a 3D point cloud segmentation model. DITR achieves state-of-the-art results on both indoor and outdoor 3D semantic segmentation benchmarks. To enable the use of VFMs even when images are unavailable during inference, we further propose to distill 2D foundation models into a 3D backbone as a pretraining task. By initializing the 3D backbone with knowledge distilled from 2D VFMs, we create a strong basis for downstream 3D segmentation tasks, ultimately boosting performance across various datasets.
zh

[CV-5] SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

【速读】：该论文试图解决长视频理解中的高效建模问题，特别是针对轻量级、移动友好的视频大语言模型（Video LLMs）的需求。解决方案的关键在于提出了一种名为SlowFast-LLaVA-1.5 (SF-LLaVA-1.5) 的模型家族，它采用了两流慢快机制（two-stream SlowFast mechanism），能够有效地对长时序上下文进行建模，从而实现高效的长视频理解能力。通过优化的训练管道和高质量的数据混合，该模型在从10亿到70亿参数范围内提供了多种规模的模型，并在多个视频和图像基准测试中表现出色，尤其在长视频理解任务（如LongVideoBench和MLVU）中达到了当前最优性能，同时在小规模模型上也展现出卓越的能力。

链接: https://arxiv.org/abs/2503.18943
作者: Mingze Xu,Mingfei Gao,Shiyu Li,Jiasen Lu,Zhe Gan,Zhengfeng Lai,Meng Cao,Kai Kang,Yinfei Yang,Afshin Dehghan
机构: Apple
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report

点击查看摘要

Abstract:We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. This model family employs the two-stream SlowFast mechanism, enabling efficient modeling of long-range temporal context to meet the demand for lightweight, mobile-friendly Video LLMs. We provide models ranging from 1B to 7B parameters, optimized through a streamlined training pipeline and a high-quality data mixture composed of publicly available datasets. Experimental results demonstrate that SF-LLaVA-1.5 achieves competitive performance on a wide range of video and image benchmarks, with robust results across all model sizes. Notably, SF-LLaVA-1.5 achieves state-of-the-art results in long-form video understanding (e.g., LongVideoBench and MLVU) and excels at small scales (1B and 3B) across various video benchmarks.
zh

[CV-6] Video-T1: Test-Time Scaling for Video Generation

【速读】：该论文试图解决通过非平凡的推理时间计算提升视频生成质量的问题。传统方法通常依赖于昂贵的训练过程来扩展模型规模，而本文探索了测试时间扩展（Test-Time Scaling, TTS）在视频生成中的潜力。解决方案的关键在于将测试时间扩展重新解释为一个搜索问题，即从高斯噪声空间采样更优的轨迹以逼近目标视频分布。具体而言，作者构建了一个包含测试时间验证器的搜索空间以提供反馈，并采用启发式算法指导搜索过程。此外，为了降低测试时间的计算开销，提出了树状帧（Tree-of-Frames, ToF）方法，该方法以自回归的方式自适应地扩展和剪枝视频分支，从而实现高效且高质量的视频生成。实验结果表明，增加推理时间计算能够显著提升生成视频的质量。

链接: https://arxiv.org/abs/2503.18942
作者: Fangfu Liu,Hanyang Wang,Yimo Cai,Kaiyan Zhang,Xiaohang Zhan,Yueqi Duan
机构: Tsinghua University (清华大学); Tencent (腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:With the scale capability of increasing training data, model size, and computational cost, video generation has achieved impressive results in digital creation, enabling users to express creativity across various domains. Recently, researchers in Large Language Models (LLMs) have expanded the scaling to test-time, which can significantly improve LLM performance by using more inference-time computation. Instead of scaling up video foundation models through expensive training costs, we explore the power of Test-Time Scaling (TTS) in video generation, aiming to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt. In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution. Specifically, we build the search space with test-time verifiers to provide feedback and heuristic algorithms to guide searching process. Given a text prompt, we first explore an intuitive linear search strategy by increasing noise candidates at inference time. As full-step denoising all frames simultaneously requires heavy test-time computation costs, we further design a more efficient TTS method for video generation called Tree-of-Frames (ToF) that adaptively expands and prunes video branches in an autoregressive manner. Extensive experiments on text-conditioned video generation benchmarks demonstrate that increasing test-time compute consistently leads to significant improvements in the quality of videos. Project page: this https URL
zh

[CV-7] raining-free Diffusion Acceleration with Bottleneck Sampling

【速读】：该论文旨在解决扩散模型（Diffusion Models）在推理阶段因自注意力机制的二次复杂度导致的高计算成本问题。现有加速方法通常会牺牲输出质量或需要昂贵的重新训练，而该研究观察到大多数扩散模型在较低分辨率下预训练的特点，提出通过利用这些低分辨率先验知识来实现更高效的推理而不降低性能。论文的关键解决方案是引入了一种无需额外训练的瓶颈采样（Bottleneck Sampling）框架，其核心在于采用高-低-高去噪的工作流程：即在初始和最终阶段进行高分辨率去噪，而在中间步骤则以低分辨率操作，并通过调整分辨率转换点和自适应调整各阶段的去噪步长来减轻混叠和模糊伪影的影响，从而显著提升推理效率同时保持输出质量。实验结果表明，该方法可将图像生成的推理速度提高至原来的3倍，视频生成的速度提高至2.5倍，且在多个评估指标上的输出质量与标准全分辨率采样过程相当。

链接: https://arxiv.org/abs/2503.18940
作者: Ye Tian,Xin Xia,Yuxi Ren,Shanchuan Lin,Xing Wang,Xuefeng Xiao,Yunhai Tong,Ling Yang,Bin Cui
机构: Peking University (北京大学); Bytedance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code Repo: this https URL ,Project Page: this https URL

点击查看摘要

Abstract:Diffusion models have demonstrated remarkable capabilities in visual content generation but remain challenging to deploy due to their high computational cost during inference. This computational burden primarily arises from the quadratic complexity of self-attention with respect to image or video resolution. While existing acceleration methods often compromise output quality or necessitate costly retraining, we observe that most diffusion models are pre-trained at lower resolutions, presenting an opportunity to exploit these low-resolution priors for more efficient inference without degrading performance. In this work, we introduce Bottleneck Sampling, a training-free framework that leverages low-resolution priors to reduce computational overhead while preserving output fidelity. Bottleneck Sampling follows a high-low-high denoising workflow: it performs high-resolution denoising in the initial and final stages while operating at lower resolutions in intermediate steps. To mitigate aliasing and blurring artifacts, we further refine the resolution transition points and adaptively shift the denoising timesteps at each stage. We evaluate Bottleneck Sampling on both image and video generation tasks, where extensive experiments demonstrate that it accelerates inference by up to 3 \times for image generation and 2.5 \times for video generation, all while maintaining output quality comparable to the standard full-resolution sampling process across multiple evaluation metrics. Code is available at: this https URL
zh

[CV-8] AdaWorld: Learning Adaptable World Models with Latent Actions

【速读】：该论文旨在解决现有世界模型（World Models）依赖大量带标注动作数据和昂贵训练的问题，特别是难以通过有限交互适应包含异构动作的新环境的挑战。这种局限性限制了其在更广泛领域的适用性。为克服这一问题，论文提出了一种名为AdaWorld的创新学习方法。其关键是，在世界模型的预训练阶段融入动作信息。具体而言，通过自监督方式从视频中提取潜在动作（latent actions），捕捉帧间最关键的过渡关系。随后构建一个条件于这些潜在动作的自回归世界模型。此学习范式使世界模型具备高度适应性，即使在有限交互和微调的情况下，也能高效迁移和学习新动作。实验结果表明，AdaWorld在模拟质量和视觉规划方面均表现出色。

链接: https://arxiv.org/abs/2503.18938
作者: Shenyuan Gao,Siyuan Zhou,Yilun Du,Jun Zhang,Chuang Gan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:World models aim to learn action-controlled prediction models and have proven essential for the development of intelligent agents. However, most existing world models rely heavily on substantial action-labeled data and costly training, making it challenging to adapt to novel environments with heterogeneous actions through limited interactions. This limitation can hinder their applicability across broader domains. To overcome this challenge, we propose AdaWorld, an innovative world model learning approach that enables efficient adaptation. The key idea is to incorporate action information during the pretraining of world models. This is achieved by extracting latent actions from videos in a self-supervised manner, capturing the most critical transitions between frames. We then develop an autoregressive world model that conditions on these latent actions. This learning paradigm enables highly adaptable world models, facilitating efficient transfer and learning of new actions even with limited interactions and finetuning. Our comprehensive experiments across multiple environments demonstrate that AdaWorld achieves superior performance in both simulation quality and visual planning.
zh

[CV-9] SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction

【速读】：该论文旨在解决仅依靠RGB帧进行未来视频帧预测时信息不足的问题，难以充分捕捉现实世界的复杂性。为应对这一局限，论文提出了一种名为同步视频预测（SyncVP）的多模态框架，通过整合互补的数据模态来提升未来预测的丰富性和准确性。解决方案的关键在于基于预训练的模态特定扩散模型，并引入高效的时空跨模态注意力模块，以实现模态间有效信息共享。

链接: https://arxiv.org/abs/2503.18933
作者: Enrico Pallotta,Sina Mokhtarzadeh Azar,Shuai Li,Olga Zatsarynna,Juergen Gall
机构: University of Bonn (波恩大学); Lamarr Institute for Machine Learning and Artificial Intelligence (拉马尔机器学习与人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Predicting future video frames is essential for decision-making systems, yet RGB frames alone often lack the information needed to fully capture the underlying complexities of the real world. To address this limitation, we propose a multi-modal framework for Synchronous Video Prediction (SyncVP) that incorporates complementary data modalities, enhancing the richness and accuracy of future predictions. SyncVP builds on pre-trained modality-specific diffusion models and introduces an efficient spatio-temporal cross-attention module to enable effective information sharing across modalities. We evaluate SyncVP on standard benchmark datasets, such as Cityscapes and BAIR, using depth as an additional modality. We furthermore demonstrate its generalization to other modalities on SYNTHIA with semantic information and ERA5-Land with climate data. Notably, SyncVP achieves state-of-the-art performance, even in scenarios where only one modality is present, demonstrating its robustness and potential for a wide range of applications.
zh

[CV-10] CoMP: Continual Multimodal Pre-training for Vision Foundation Models

【速读】：该论文旨在解决现有预训练视觉基础模型（Vision Foundation Models, VFMs）在处理不同尺寸视觉输入时的局限性，并提升其与语言表示的对齐程度，无论这些模型最初采用何种预训练过程。为实现这一目标，论文提出了一种名为CoMP（Carefully Designed Multimodal Pre-training Pipeline）的关键解决方案。CoMP的核心在于引入持续旋转位置嵌入（Continual Rotary Position Embedding）以支持原生分辨率的持续预训练，同时通过语言原型对视觉和文本特征施加对齐损失（Alignment Loss），从而实现多模态表示的对齐。通过三阶段训练，该方法不仅显著提升了模型在多模态理解任务上的性能，还在分类和分割等下游任务中表现出色。

链接: https://arxiv.org/abs/2503.18931
作者: Yitong Chen,Lingchen Meng,Wujian Peng,Zuxuan Wu,Yu-Gang Jiang
机构: Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University (上海智能信息处理重点实验室，复旦大学计算机学院); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available in this https URL

点击查看摘要

Abstract:Pre-trained Vision Foundation Models (VFMs) provide strong visual representations for a wide range of applications. In this paper, we continually pre-train prevailing VFMs in a multimodal manner such that they can effortlessly process visual inputs of varying sizes and produce visual representations that are more aligned with language representations, regardless of their original pre-training process. To this end, we introduce CoMP, a carefully designed multimodal pre-training pipeline. CoMP uses a Continual Rotary Position Embedding to support native resolution continual pre-training, and an Alignment Loss between visual and textual features through language prototypes to align multimodal representations. By three-stage training, our VFMs achieve remarkable improvements not only in multimodal understanding but also in other downstream tasks such as classification and segmentation. Remarkably, CoMP-SigLIP achieves scores of 66.7 on ChartQA and 75.9 on DocVQA with a 0.5B LLM, while maintaining an 87.4% accuracy on ImageNet-1K and a 49.5 mIoU on ADE20K under frozen chunk evaluation.
zh

[CV-11] Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models

【速读】：该论文旨在解决大型视频语言模型（Large Video Language Models, LVLMs）在视频上下文中事实性评估这一关键未解挑战。论文提出Video SimpleQA，首个专门用于评估LVLMs事实性的综合性基准。其解决方案的关键在于设计了一套独特的评估标准：1) 需要整合超出显式叙事范围的外部知识；2) 提出面向客观事件或关系的事实性问题以避免主观解读；3) 答案采用简洁且明确正确的短格式，便于通过大语言模型作为裁判框架进行自动化评估；4) 所有标注需经过权威外部参考验证以确保可靠性；5) 强调时序推理能力，覆盖静态单帧与动态时间依赖的综合理解。通过这些创新设计，论文全面评估了41个最先进的LVLMs，并揭示了当前模型在事实一致性方面的显著不足及改进空间。

链接: https://arxiv.org/abs/2503.18923
作者: Meng Cao,Pengfei Hu,Yingyao Wang,Jihao Gu,Haoran Tang,Haoze Zhao,Jiahua Dong,Wangbo Yu,Ge Zhang,Ian Reid,Xiaodan Liang
机构: MBZUAI (MBZUAI); Alibaba Group (阿里巴巴集团); Peking University (北京大学); M-A-P (M-A-P)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages

点击查看摘要

Abstract:Recent advancements in Large Video Language Models (LVLMs) have highlighted their potential for multi-modal understanding, yet evaluating their factual grounding in video contexts remains a critical unsolved challenge. To address this gap, we introduce Video SimpleQA, the first comprehensive benchmark tailored for factuality evaluation of LVLMs. Our work distinguishes from existing video benchmarks through the following key features: 1) Knowledge required: demanding integration of external knowledge beyond the explicit narrative; 2) Fact-seeking question: targeting objective, undisputed events or relationships, avoiding subjective interpretation; 3) Definitive short-form answer: Answers are crafted as unambiguous and definitively correct in a short format, enabling automated evaluation through LLM-as-a-judge frameworks with minimal scoring variance; 4) External-source verified: All annotations undergo rigorous validation against authoritative external references to ensure the reliability; 5) Temporal reasoning required: The annotated question types encompass both static single-frame understanding and dynamic temporal reasoning, explicitly evaluating LVLMs factuality under the long-context dependencies. We extensively evaluate 41 state-of-the-art LVLMs and summarize key findings as follows: 1) Current LVLMs exhibit notable deficiencies in factual adherence, particularly for open-source models. The best-performing model Gemini-1.5-Pro achieves merely an F-score of 54.4%; 2) Test-time compute paradigms show insignificant performance gains, revealing fundamental constraints for enhancing factuality through post-hoc computation; 3) Retrieval-Augmented Generation demonstrates consistent improvements at the cost of additional inference time overhead, presenting a critical efficiency-performance trade-off.
zh

[CV-12] Building Blocks for Robust and Effective Semi-Supervised Real-World Object Detection

【速读】：该论文旨在解决半监督目标检测（Semi-Supervised Object Detection, SSOD）在真实世界应用中面临的挑战，包括类别不平衡、标签噪声和标注错误等问题。为应对这些挑战，论文的关键在于提出了一种深入分析方法，揭示次优伪标签产生的原因以及标签质量和数量之间的权衡关系。基于此，论文设计了四个可无缝集成到SSOD框架中的构建模块：Rare Class Collage (RCC)，通过创建稀有对象拼贴来增强稀有类别的表示；Rare Class Focus (RCF)，一种分层批量采样策略，确保训练过程中所有类别的平衡表示；Ground Truth Label Correction (GLC)，一种标签精化方法，利用教师模型预测的一致性来识别并修正虚假、缺失和嘈杂的真实标签；Pseudo-Label Selection (PLS)，一种依据新颖指标评估缺失检测率并考虑类别稀有性的伪标签选择方法以剔除低质量伪标签。通过在自动驾驶数据集上的全面实验验证，这些方法使SSOD性能提升了多达6%，证明了其在复杂真实场景中实现稳健且有效的SSOD的能力。

链接: https://arxiv.org/abs/2503.18903
作者: Moussa Kassem Sbeyti,Nadja Klein,Azarm Nowzad,Fikret Sivrikaya,Sahin Albayrak
机构: Scientific Computing Center, Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院科学计算中心); DAI-Labor, Technische Universität Berlin (柏林工业大学DAI实验室); Continental AG (大陆集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to Transactions on Machine Learning Research (TMLR). OpenReview: this https URL

点击查看摘要

Abstract:Semi-supervised object detection (SSOD) based on pseudo-labeling significantly reduces dependence on large labeled datasets by effectively leveraging both labeled and unlabeled data. However, real-world applications of SSOD often face critical challenges, including class imbalance, label noise, and labeling errors. We present an in-depth analysis of SSOD under real-world conditions, uncovering causes of suboptimal pseudo-labeling and key trade-offs between label quality and quantity. Based on our findings, we propose four building blocks that can be seamlessly integrated into an SSOD framework. Rare Class Collage (RCC): a data augmentation method that enhances the representation of rare classes by creating collages of rare objects. Rare Class Focus (RCF): a stratified batch sampling strategy that ensures a more balanced representation of all classes during training. Ground Truth Label Correction (GLC): a label refinement method that identifies and corrects false, missing, and noisy ground truth labels by leveraging the consistency of teacher model predictions. Pseudo-Label Selection (PLS): a selection method for removing low-quality pseudo-labeled images, guided by a novel metric estimating the missing detection rate while accounting for class rarity. We validate our methods through comprehensive experiments on autonomous driving datasets, resulting in up to 6% increase in SSOD performance. Overall, our investigation and novel, data-centric, and broadly applicable building blocks enable robust and effective SSOD in complex, real-world scenarios. Code is available at this https URL.
zh

[CV-13] Online 3D Scene Reconstruction Using Neural Object Priors WWW

【速读】：本文旨在解决基于RGB-D视频序列在线重建场景中物体级别细节的问题。当前基于物体感知的神经隐式表示在重建效率和形状完成方面存在局限性。为缓解上述限制，论文提出的关键解决方案包括：首先，设计了一种特征网格插值机制，以连续更新基于网格的物体中心神经隐式表示，当新物体部分被揭示时能够动态调整；其次，预先构建一个包含已映射物体的对象库，并利用对应的形状先验初始化新视频中的几何物体模型，通过引入新颖视角以及合成的历史视角来完成物体建模，从而避免丢失原始物体细节。实验结果表明，该方法在Replica数据集的合成环境、ScanNet的真实世界序列以及实验室捕获的视频中，相较于最先进的神经隐式模型，在重建精度和完整性方面表现更优。

链接: https://arxiv.org/abs/2503.18897
作者: Thomas Chabal,Shizhe Chen,Jean Ponce,Cordelia Schmid
机构: Inria (英睿达); Department of Computer Science, École normale supérieure (ENS-PSL, CNRS, Inria) (巴黎高等师范学院计算机科学系); Courant Institute of Mathematical Sciences and Center for Data Science, New York University (纽约大学柯朗数学科学研究所和数据科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 3DV 2025. Project page: this https URL

点击查看摘要

Abstract:This paper addresses the problem of reconstructing a scene online at the level of objects given an RGB-D video sequence. While current object-aware neural implicit representations hold promise, they are limited in online reconstruction efficiency and shape completion. Our main contributions to alleviate the above limitations are twofold. First, we propose a feature grid interpolation mechanism to continuously update grid-based object-centric neural implicit representations as new object parts are revealed. Second, we construct an object library with previously mapped objects in advance and leverage the corresponding shape priors to initialize geometric object models in new videos, subsequently completing them with novel views as well as synthesized past views to avoid losing original object details. Extensive experiments on synthetic environments from the Replica dataset, real-world ScanNet sequences and videos captured in our laboratory demonstrate that our approach outperforms state-of-the-art neural implicit models for this task in terms of reconstruction accuracy and completeness.
zh

[CV-14] CFG-Zero*: Improved Classifier-Free Guidance for Flow Matching Models

【速读】：该论文试图解决在Classifier-Free Guidance (CFG) 应用于流匹配（Flow Matching）模型时，尤其是在训练初期因流估计不准确导致样本趋向错误轨迹的问题。论文的关键解决方案是提出CFG-Zero*，其包含两个核心贡献：(a) 优化缩放因子（optimized scale），通过优化一个标量来校正估计速度中的不准确性；(b) 零初始化（zero-init），即在常微分方程（ODE）求解器中将前几步置零。这些改进有效提升了流匹配模型的指导性能，在文本到图像和文本到视频生成任务中均表现出优于CFG的效果。

链接: https://arxiv.org/abs/2503.18886
作者: Weichen Fan,Amber Yijia Zheng,Raymond A. Yeh,Ziwei Liu
机构: S-Lab, Nanyang Technological University (南洋理工大学); Department of Computer Science, Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Classifier-Free Guidance (CFG) is a widely adopted technique in diffusion/flow models to improve image fidelity and controllability. In this work, we first analytically study the effect of CFG on flow matching models trained on Gaussian mixtures where the ground-truth flow can be derived. We observe that in the early stages of training, when the flow estimation is inaccurate, CFG directs samples toward incorrect trajectories. Building on this observation, we propose CFG-Zero*, an improved CFG with two contributions: (a) optimized scale, where a scalar is optimized to correct for the inaccuracies in the estimated velocity, hence the * in the name; and (b) zero-init, which involves zeroing out the first few steps of the ODE solver. Experiments on both text-to-image (Lumina-Next, Stable Diffusion 3, and Flux) and text-to-video (Wan-2.1) generation demonstrate that CFG-Zero* consistently outperforms CFG, highlighting its effectiveness in guiding Flow Matching models. (Code is available at this http URL)
zh

[CV-15] Efficient and Accurate Scene Text Recognition with Cascaded-Transformers

【速读】：该论文旨在解决场景文本识别（Scene Text Recognition, STR）任务中基于视觉Transformer与文本解码器的模型计算和内存需求过高的问题，限制其在资源受限场景中的应用。为应对这一挑战，论文提出了一种高效且准确的STR系统。解决方案的关键在于通过引入级联Transformer（cascaded-transformers）结构来优化编码器模型的效率，在编码阶段逐步减小视觉Token的尺寸，从而有效消除冗余Token并降低计算成本。实验结果表明，该方法在保持与现有最先进模型相当性能的同时，显著减少了计算需求，尤其对于大模型，准确性几乎不变（从92.77降至92.68），而计算复杂度几乎减半。

链接: https://arxiv.org/abs/2503.18883
作者: Savas Ozkan,Andrea Maracani,Hyowon Kim,Sijun Cho,Eunchung Noh,Jeongwon Min,Jung Min Cho,Mete Ozay
机构: Samsung Research(三星研究)(United Kingdom); Samsung Electronics(三星电子)(South Korea)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM-MMSys2025

点击查看摘要

Abstract:In recent years, vision transformers with text decoder have demonstrated remarkable performance on Scene Text Recognition (STR) due to their ability to capture long-range dependencies and contextual relationships with high learning capacity. However, the computational and memory demands of these models are significant, limiting their deployment in resource-constrained applications. To address this challenge, we propose an efficient and accurate STR system. Specifically, we focus on improving the efficiency of encoder models by introducing a cascaded-transformers structure. This structure progressively reduces the vision token size during the encoding step, effectively eliminating redundant tokens and reducing computational cost. Our experimental results confirm that our STR system achieves comparable performance to state-of-the-art baselines while substantially decreasing computational requirements. In particular, for large-models, the accuracy remains same, 92.77 to 92.68, while computational complexity is almost halved with our structure.
zh

[CV-16] Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes CVPR2025

【速读】：该论文旨在解决现有音频-视觉定位模型在处理混合音频源时的关键局限性，即无法同时有效处理语音和非语音声音。当前方法通常只能独立或按顺序处理这两种声音类型，而不能同时混合处理。为了解决这一问题，论文提出了一种“混合与分离”（mix-and-separate）框架，并通过音频-视觉对齐目标实现联合学习，使模型能够从混合音频中同时提取并分离出不同类型的声源信息。关键在于引入了基于混合音频的联合对应性和解纠缠学习机制，从而实现了对不同类型音频的有效解耦和定位，显著提升了对复杂真实世界音频场景的理解能力。此外，研究还构建了一个新数据集以验证所提方法在混合音频同步定位中的性能，并展示了其在标准分割和跨模态检索任务中的优越表现。

链接: https://arxiv.org/abs/2503.18880
作者: Hyeonggon Ryu,Seongyu Kim,Joon Son Chung,Arda Senocak
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: CVPR 2025

点击查看摘要

Abstract:We present a unified model capable of simultaneously grounding both spoken language and non-speech sounds within a visual scene, addressing key limitations in current audio-visual grounding models. Existing approaches are typically limited to handling either speech or non-speech sounds independently, or at best, together but sequentially without mixing. This limitation prevents them from capturing the complexity of real-world audio sources that are often mixed. Our approach introduces a ‘mix-and-separate’ framework with audio-visual alignment objectives that jointly learn correspondence and disentanglement using mixed audio. Through these objectives, our model learns to produce distinct embeddings for each audio type, enabling effective disentanglement and grounding across mixed audio sources. Additionally, we created a new dataset to evaluate simultaneous grounding of mixed audio sources, demonstrating that our model outperforms prior methods. Our approach also achieves comparable or better performance in standard segmentation and cross-modal retrieval tasks, highlighting the benefits of our mix-and-separate approach.
zh

[CV-17] A semantic communication-based workload-adjustable transceiver for wireless AI-generated content (AIGC) delivery

【速读】：本文针对无线网络中高质量生成式 AI 内容 (AIGC) 服务交付的主要挑战——不稳定信道、有限带宽资源以及不均衡的计算资源分布展开研究。论文的关键在于提出了一种基于语义通信 (Semantic Communication, SemCom) 的 Resource-aware wOrkload-adjUstable TransceivEr (ROUTE)，用于动态无线网络中的 AIGC 分发。解决方案的核心包括：利用语义通信优先传输生成内容的语义信息以缓解通信资源瓶颈；通过改进的基于扩散的模型动态调整计算工作负载与语义密度，从而在协作内容生成过程中提高计算资源利用率并减少语义失真。仿真结果验证了 ROUTE 在延迟和内容质量方面的优越性。

链接: https://arxiv.org/abs/2503.18874
作者: Runze Cheng,Yao Sun,Lan Zhang,Lei Feng,Lei Zhang,Muhammad Ali Imran
机构: James Watt School of Engineering, University of Glasgow (詹姆斯瓦特工程学院, 格拉斯哥大学); Department of Electrical and Computer Engineering, Clemson University (电气与计算机工程系, 克莱姆森大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the significant advances in generative AI (GAI) and the proliferation of mobile devices, providing high-quality AI-generated content (AIGC) services via wireless networks is becoming the future direction. However, the primary challenges of AIGC service delivery in wireless networks lie in unstable channels, limited bandwidth resources, and unevenly distributed computational resources. In this paper, we employ semantic communication (SemCom) in diffusion-based GAI models to propose a Resource-aware wOrkload-adjUstable TransceivEr (ROUTE) for AIGC delivery in dynamic wireless networks. Specifically, to relieve the communication resource bottleneck, SemCom is utilized to prioritize semantic information of the generated content. Then, to improve computational resource utilization in both edge and local and reduce AIGC semantic distortion in transmission, modified diffusion-based models are applied to adjust the computing workload and semantic density in cooperative content generation. Simulations verify the superiority of our proposed ROUTE in terms of latency and content quality compared to conventional AIGC approaches.
zh

[CV-18] Efficient Self-Supervised Adaptation for Medical Image Analysis

【速读】：该论文旨在解决自监督适应（Self-Supervised Adaptation, SSA）在医学领域应用中计算成本过高的问题。尽管参数高效微调方法如LoRA已在有监督适应中被探索，但其在SSA中的有效性尚不清楚。为应对这一挑战，论文提出了高效的自监督适应（Efficient Self-Supervised Adaptation, ESSA），通过将参数高效微调技术应用于SSA，以降低计算开销并提升适应性能。该方案的关键在于Attention Projection Layer Adaptation (APLA)，它不仅在多种医学任务中超越了全参数SSA和有监督微调，同时减少了高达40.1%的GPU内存使用，提升了25.2%的训练吞吐量，并保持了推理效率。

链接: https://arxiv.org/abs/2503.18873
作者: Moein Sorkhei,Emir Konuk,Jingyu Guo,Christos Matsoukas,Kevin Smith
机构: KTH Royal Institute of Technology (瑞典皇家理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised adaptation (SSA) improves foundation model transfer to medical domains but is computationally prohibitive. Although parameter efficient fine-tuning methods such as LoRA have been explored for supervised adaptation, their effectiveness for SSA remains unknown. In this work, we introduce efficient self-supervised adaptation (ESSA), a framework that applies parameter-efficient fine-tuning techniques to SSA with the aim of reducing computational cost and improving adaptation performance. Among the methods tested, Attention Projection Layer Adaptation (APLA) sets a new state-of-the-art, consistently surpassing full-parameter SSA and supervised fine-tuning across diverse medical tasks, while reducing GPU memory by up to 40.1% and increasing training throughput by 25.2%, all while maintaining inference efficiency.
zh

[CV-19] Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation CVPR2025

【速读】：该论文旨在解决数据蒸馏（Dataset Distillation, DD）在高图像每类数量（high-IPC）设置下效果衰减的问题。尽管已有研究表明结合蒸馏数据与真实数据可缓解这一问题，但当前组合范式采用的一次性且独立的选择机制导致蒸馏数据与真实数据之间存在不兼容性。为了解决这一问题，论文提出了一种新颖的关键方法——课程粗到细选择（Curriculum Coarse-to-Fine Selection, CCFS）。CCFS 利用课程学习框架进行真实数据的选择，通过粗到细策略，在每个课程阶段基于当前合成数据集挑选合适的真实数据。实验结果表明，CCFS 在高-IPC 设置下的 CIFAR-10、CIFAR-100 和 Tiny-ImageNet 数据集上分别取得了比现有最优方法高出 6.6%、5.8% 和 3.4% 的性能提升，并在 Tiny-ImageNet 上实现了 ResNet-18 模型 60.2% 的测试精度，仅比全数据集训练下降 0.3%，同时达到了 20% 的压缩率。

链接: https://arxiv.org/abs/2503.18872
作者: Yanda Chen,Gongwei Chen,Miao Zhang,Weili Guan,Liqiang Nie
机构: School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Dataset distillation (DD) excels in synthesizing a small number of images per class (IPC) but struggles to maintain its effectiveness in high-IPC settings. Recent works on dataset distillation demonstrate that combining distilled and real data can mitigate the effectiveness decay. However, our analysis of the combination paradigm reveals that the current one-shot and independent selection mechanism induces an incompatibility issue between distilled and real images. To address this issue, we introduce a novel curriculum coarse-to-fine selection (CCFS) method for efficient high-IPC dataset distillation. CCFS employs a curriculum selection framework for real data selection, where we leverage a coarse-to-fine strategy to select appropriate real data based on the current synthetic dataset in each curriculum. Extensive experiments validate CCFS, surpassing the state-of-the-art by +6.6% on CIFAR-10, +5.8% on CIFAR-100, and +3.4% on Tiny-ImageNet under high-IPC settings. Notably, CCFS achieves 60.2% test accuracy on ResNet-18 with a 20% compression ratio of Tiny-ImageNet, closely matching full-dataset training with only 0.3% degradation. Code: this https URL.
zh

[CV-20] Exploring the Integration of Key-Value Attention Into Pure and Hybrid Transformers for Semantic Segmentation

【速读】：该论文试图解决的问题是如何在保持或接近传统Transformer模型性能的同时，通过降低复杂度和内存使用，提升KV Transformer在语义分割任务中的实用性和效率，特别是在医学影像领域的应用。论文的关键解决方案在于引入KV Transformer这一变体，通过减少参数数量和乘积累加操作（Multiply Accumulate Operations），在不显著牺牲性能的前提下，实现更高效的模型推理能力，尤其适用于需要本地推理的应用场景，如医疗筛查。

链接: https://arxiv.org/abs/2503.18862
作者: DeShin Hwa,Tobias Holmes,Klaus Drechsler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 3 figures, Preprint. Final version published in: Bildverarbeitung für die Medizin 2025, Springer. DOI: this https URL

点击查看摘要

Abstract:While CNNs were long considered state of the art for image processing, the introduction of Transformer architectures has challenged this position. While achieving excellent results in image classification and segmentation, Transformers remain inherently reliant on large training datasets and remain computationally expensive. A newly introduced Transformer derivative named KV Transformer shows promising results in synthetic, NLP, and image classification tasks, while reducing complexity and memory usage. This is especially conducive to use cases where local inference is required, such as medical screening applications. We endeavoured to further evaluate the merit of KV Transformers on semantic segmentation tasks, specifically in the domain of medical imaging. By directly comparing traditional and KV variants of the same base architectures, we provide further insight into the practical tradeoffs of reduced model complexity. We observe a notable reduction in parameter count and multiply accumulate operations, while achieving similar performance from most of the KV variant models when directly compared to their QKV implementation.
zh

[CV-21] HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation CVPR2025

【速读】：该论文试图解决单幅肖像图像在驱动视频条件下的高度可控且逼真的肖像动画生成问题。解决方案的关键在于提出了一种基于扩散模型的条件控制方法HunyuanPortrait，它采用隐式表示（implicit representation）来解耦视频中的肖像运动信息与身份信息，并将运动信息作为控制信号用于动画生成阶段。通过在去噪UNet中设计适配层（adapter layers），利用注意力机制注入控制信号，实现了空间细节的丰富性和时间一致性。此外，该框架还展示了较强的泛化能力，能够有效分离不同图像风格下的外观与运动特性。

链接: https://arxiv.org/abs/2503.18860
作者: Zunnan Xu,Zhentao Yu,Zixiang Zhou,Jun Zhou,Xiaoyu Jin,Fa-Ting Hong,Xiaozhong Ji,Junwei Zhu,Chengfei Cai,Shiyu Tang,Qin Lin,Xiu Li,Qinglin Lu
机构: Tsinghua University (清华大学); Hunyuan, Tencent (腾讯混元实验室); Sun Yat-Sen University (中山大学); HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:We introduce HunyuanPortrait, a diffusion-based condition control method that employs implicit representations for highly controllable and lifelike portrait animation. Given a single portrait image as an appearance reference and video clips as driving templates, HunyuanPortrait can animate the character in the reference image by the facial expression and head pose of the driving videos. In our framework, we utilize pre-trained encoders to achieve the decoupling of portrait motion information and identity in videos. To do so, implicit representation is adopted to encode motion information and is employed as control signals in the animation phase. By leveraging the power of stable video diffusion as the main building block, we carefully design adapter layers to inject control signals into the denoising unet through attention mechanisms. These bring spatial richness of details and temporal consistency. HunyuanPortrait also exhibits strong generalization performance, which can effectively disentangle appearance and motion under different image styles. Our framework outperforms existing methods, demonstrating superior temporal consistency and controllability. Our project is available at this https URL.
zh

[CV-22] MC-LLaVA: Multi-Concept Personalized Vision-Language Model

【速读】：该论文旨在解决现有视觉语言模型（Vision-Language Models, VLMs）在个性化方面的局限性，特别是单概念个性化方法无法有效处理多概念交互的问题，从而限制其在实际应用中的表现。论文的关键在于提出了一种名为MC-LLaVA的多概念个性化范式，其核心解决方案包括：(1) 采用多概念指令微调策略，在单一训练步骤中有效整合多个概念；(2) 设计个性化的文本提示，利用视觉标记信息初始化概念标记，以降低联合训练的成本；(3) 在推理阶段引入个性化的视觉提示，通过聚合位置置信图提升识别与定位能力。此外，论文还贡献了一个高质量的指令微调数据集，包含来自电影的多角色和多物体图像及其手动生成的多概念问答样本。这些创新点使得MC-LLaVA能够在多概念个性化任务中实现显著性能提升，推动VLMs成为更优秀的用户特定助手。

链接: https://arxiv.org/abs/2503.18854
作者: Ruichuan An,Sihan Yang,Ming Lu,Renrui Zhang,Kai Zeng,Yulin Luo,Jiajun Cao,Hao Liang,Ying Chen,Qi She,Shanghang Zhang,Wentao Zhang
机构: Peking University (北京大学); School of Software Engineering, Xi’an JiaoTong University (西安交通大学软件学院); Intel Labs, China (英特尔中国实验室); CUHK MMLab (香港中文大学多媒体实验室); Tianjin University (天津大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The code and dataset will be publicly available at $\href{ [this https URL](https://github.com/arctanxarc/MC-LLaVA) }{ [this https URL](https://github.com/arctanxarc/MC-LLaVA) }$

点击查看摘要

Abstract:Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies investigate VLM personalization to understand user-provided concepts. However, they mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes the first multi-concept personalization paradigm, MC-LLaVA. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the costs related to joint training, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location confidence maps for enhanced recognition and grounding capabilities. To advance multi-concept personalization research, we further contribute a high-quality instruction tuning dataset. We carefully collect images with multiple characters and objects from movies and manually generate question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive qualitative and quantitative experiments demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses, paving the way for VLMs to become better user-specific assistants. The code and dataset will be publicly available at \hrefthis https URLthis https URL .
zh

[CV-23] 3DSwapping: Texture Swapping For 3D Object From Single Reference Image

【速读】：该论文旨在解决3D纹理交换（3D Texture Swapping）领域中缺乏专用方法的问题，现有方案如基于2D编辑的方法需要逐帧操作，导致多视角之间的一致性难以保证；而基于文本驱动的3D编辑方法则难以有效保留参考图像中的纹理特性。为克服这些挑战，论文提出了一种名为3DSwapping的新方法，其关键是通过整合三种创新策略实现高质量的纹理迁移与结构一致性：1）渐进生成（Progressive Generation），从单一参考图像开始逐步传播编辑效果至相邻视图以确保视角一致性；2）视角一致性梯度引导（View-Consistency Gradient Guidance），通过模型条件化处理一致性和不一致输出之间的特征差异来进一步强化一致性；3）提示调优梯度引导（Prompt-Tuned Gradient Guidance），利用学习到的标记精确捕捉参考图像与3D对象间的差异，并指导编辑过程以保持跨视角的纹理特征一致性。综合这些关键技术，3DSwapping能够在保持结构连贯性的前提下完成更逼真的2D纹理到3D物体的交换任务。

链接: https://arxiv.org/abs/2503.18853
作者: Xiao Cao,Beibei Lin,Bo Wang,Zhiyong Huang,Robby T. Tan
机构: National University of Singapore (新加坡国立大学); University of Mississippi (密西西比大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D texture swapping allows for the customization of 3D object textures, enabling efficient and versatile visual transformations in 3D editing. While no dedicated method exists, adapted 2D editing and text-driven 3D editing approaches can serve this purpose. However, 2D editing requires frame-by-frame manipulation, causing inconsistencies across views, while text-driven 3D editing struggles to preserve texture characteristics from reference images. To tackle these challenges, we introduce 3DSwapping, a 3D texture swapping method that integrates: 1) progressive generation, 2) view-consistency gradient guidance, and 3) prompt-tuned gradient guidance. To ensure view consistency, our progressive generation process starts by editing a single reference image and gradually propagates the edits to adjacent views. Our view-consistency gradient guidance further reinforces consistency by conditioning the generation model on feature differences between consistent and inconsistent outputs. To preserve texture characteristics, we introduce prompt-tuning-based gradient guidance, which learns a token that precisely captures the difference between the reference image and the 3D object. This token then guides the editing process, ensuring more consistent texture preservation across views. Overall, 3DSwapping integrates these novel strategies to achieve higher-fidelity texture transfer while preserving structural coherence across multiple viewpoints. Extensive qualitative and quantitative evaluations confirm that our three novel components enable convincing and effective 2D texture swapping for 3D objects. Code will be available upon acceptance.
zh

[CV-24] DAGait: Generalized Skeleton-Guided Data Alignment for Gait Recognition

【速读】：该论文旨在解决广义步态识别在真实场景（wild datasets）中因空间-时间分布不一致导致的性能显著下降问题。论文指出，现有方法在受控实验室数据集上表现良好，但在实际应用中，由于行人出现在不同角度、位置和距离的情况下，性能明显降低。为了解决这一问题，论文提出了一种基于骨架引导的剪影对齐策略（skeleton-guided silhouette alignment strategy），利用先验骨架知识对相应剪影进行仿射变换（affine transformations），以实现精确的空间对齐。这是首次探索数据对齐对步态识别影响的研究。实验结果表明，所提出的对齐方法在多个数据集和网络架构上均表现出显著优势，在挑战性的Gait3D数据集上平均性能提升了7.9%，并在跨域数据集上实现了最高达24.0%的准确率提升。关键在于通过骨架引导实现剪影的空间对齐，从而减少分布不一致性带来的负面影响。

链接: https://arxiv.org/abs/2503.18830
作者: Zhengxian Wu,Chuanrui Zhang,Hangrui Xu,Peng Jiao,Haoqian Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gait recognition is emerging as a promising and innovative area within the field of computer vision, widely applied to remote person identification. Although existing gait recognition methods have achieved substantial success in controlled laboratory datasets, their performance often declines significantly when transitioning to wild this http URL argue that the performance gap can be primarily attributed to the spatio-temporal distribution inconsistencies present in wild datasets, where subjects appear at varying angles, positions, and distances across the frames. To achieve accurate gait recognition in the wild, we propose a skeleton-guided silhouette alignment strategy, which uses prior knowledge of the skeletons to perform affine transformations on the corresponding this http URL the best of our knowledge, this is the first study to explore the impact of data alignment on gait recognition. We conducted extensive experiments across multiple datasets and network architectures, and the results demonstrate the significant advantages of our proposed alignment this http URL, on the challenging Gait3D dataset, our method achieved an average performance improvement of 7.9% across all evaluated networks. Furthermore, our method achieves substantial improvements on cross-domain datasets, with accuracy improvements of up to 24.0%.
zh

[CV-25] Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations CVPR2025

【速读】：该论文旨在解决多模态模型在Out-of-Distribution Detection (OoDD)任务中的性能提升问题。现有方法通常通过冻结或部分微调预训练权重来利用大型预训练视觉-语言模型（如CLIP）的多模态表示，但这些方法可能无法充分利用预训练知识，导致性能受限。论文的关键在于提出了一种多模态微调（Multi-modal Fine-Tuning, MMFT）策略，并通过引入一种新的训练目标来解决分布内（In-Distribution, ID）嵌入模态差距的问题。该目标通过正则化图像和文本嵌入的距离，增强跨模态对齐，从而更好地利用预训练文本信息。理论分析表明，这种正则化等价于高维球面上能量模型的最大似然估计。实验结果表明，结合后处理OoDD方法（如NegLabel），所提方法在ImageNet-1k OoD基准数据集上实现了最先进的OoDD性能及领先的ID准确性。

链接: https://arxiv.org/abs/2503.18817
作者: Jeonghyeon Kim,Sangheum Hwang
机构: Seoul National University of Science and Technology ( Seoul科技大學 )
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2025

点击查看摘要

Abstract:Prior research on out-of-distribution detection (OoDD) has primarily focused on single-modality models. Recently, with the advent of large-scale pretrained vision-language models such as CLIP, OoDD methods utilizing such multi-modal representations through zero-shot and prompt learning strategies have emerged. However, these methods typically involve either freezing the pretrained weights or only partially tuning them, which can be suboptimal for downstream datasets. In this paper, we highlight that multi-modal fine-tuning (MMFT) can achieve notable OoDD performance. Despite some recent works demonstrating the impact of fine-tuning methods for OoDD, there remains significant potential for performance improvement. We investigate the limitation of naïve fine-tuning methods, examining why they fail to fully leverage the pretrained knowledge. Our empirical analysis suggests that this issue could stem from the modality gap within in-distribution (ID) embeddings. To address this, we propose a training objective that enhances cross-modal alignment by regularizing the distances between image and text embeddings of ID data. This adjustment helps in better utilizing pretrained textual information by aligning similar semantics from different modalities (i.e., text and image) more closely in the hyperspherical representation space. We theoretically demonstrate that the proposed regularization corresponds to the maximum likelihood estimation of an energy-based model on a hypersphere. Utilizing ImageNet-1k OoD benchmark datasets, we show that our method, combined with post-hoc OoDD approaches leveraging pretrained knowledge (e.g., NegLabel), significantly outperforms existing methods, achieving state-of-the-art OoDD performance and leading ID accuracy.
zh

[CV-26] SKDU at De-Factify 4.0: Vision Transformer with Data Augmentation for AI-Generated Image Detection AAAI AAAI2025

【速读】：该论文旨在探索预训练视觉语言模型（如Vision Transformer, ViT）结合先进数据增强策略在检测AI生成图像方面的潜力。论文的关键解决方案在于通过在Defactify-4.0数据集上微调ViT模型，并在训练过程中采用翻转、旋转、高斯噪声注入以及JPEG压缩等扰动技术来提升模型的鲁棒性和泛化能力。实验结果表明，基于ViT的方案实现了最先进的性能，在验证集和测试集上显著优于其他竞争方法。

链接: https://arxiv.org/abs/2503.18812
作者: Shrikant Malviya,Neelanjan Bhowmik,Stamos Katsigiannis
机构: Department of Computer Science, Durham University (杜伦大学), UK; British Car Auctions (英国汽车拍卖公司), UK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: De-Factify 4.0 workshop at the 39th Annual AAAI Conference on Artificial Intelligence (AAAI 2025)

点击查看摘要

Abstract:The aim of this work is to explore the potential of pre-trained vision-language models, e.g. Vision Transformers (ViT), enhanced with advanced data augmentation strategies for the detection of AI-generated images. Our approach leverages a fine-tuned ViT model trained on the Defactify-4.0 dataset, which includes images generated by state-of-the-art models such as Stable Diffusion 2.1, Stable Diffusion XL, Stable Diffusion 3, DALL-E 3, and MidJourney. We employ perturbation techniques like flipping, rotation, Gaussian noise injection, and JPEG compression during training to improve model robustness and generalisation. The experimental results demonstrate that our ViT-based pipeline achieves state-of-the-art performance, significantly outperforming competing methods on both validation and test datasets.
zh

[CV-27] CRCL: Causal Representation Consistency Learning for Anomaly Detection in Surveillance Videos

【速读】：该论文旨在解决视频异常检测（Video Anomaly Detection, VAD）中现有方法无法有效应对现实场景中标签无关的数据偏移（如场景变化）以及深度神经网络过泛化导致的微小异常检测失败的问题。论文的关键在于提出了一种基于因果表示一致性学习（Causal Representation Consistency Learning, CRCL）的方法，通过利用因果因素挖掘潜在的场景鲁棒因果变量，在无监督视频正常性学习中隐式提取因果信息。具体而言，该方法基于结构因果模型，分别设计了场景去偏学习和因果启发的正常性学习模块，以剥离深度表示中的场景纠缠偏差并学习因果视频正常性。实验结果表明，CRCL 方法在基准数据集上的表现优于传统深度表示学习方法，并且能够处理多场景设置中的标签无关偏移，同时在有限训练数据下保持稳定性能。

链接: https://arxiv.org/abs/2503.18808
作者: Yang Liu,Hongjin Wang,Zepu Wang,Xiaoguang Zhu,Jing Liu,Peng Sun,Rui Tang,Jianwei Du,Victor C.M. Leung,Liang Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication by IEEE Transactions on Image Processing

点击查看摘要

Abstract:Video Anomaly Detection (VAD) remains a fundamental yet formidable task in the video understanding community, with promising applications in areas such as information forensics and public safety protection. Due to the rarity and diversity of anomalies, existing methods only use easily collected regular events to model the inherent normality of normal spatial-temporal patterns in an unsupervised manner. Previous studies have shown that existing unsupervised VAD models are incapable of label-independent data offsets (e.g., scene changes) in real-world scenarios and may fail to respond to light anomalies due to the overgeneralization of deep neural networks. Inspired by causality learning, we argue that there exist causal factors that can adequately generalize the prototypical patterns of regular events and present significant deviations when anomalous instances occur. In this regard, we propose Causal Representation Consistency Learning (CRCL) to implicitly mine potential scene-robust causal variable in unsupervised video normality learning. Specifically, building on the structural causal models, we propose scene-debiasing learning and causality-inspired normality learning to strip away entangled scene bias in deep representations and learn causal video normality, respectively. Extensive experiments on benchmarks validate the superiority of our method over conventional deep representation learning. Moreover, ablation studies and extension validation show that the CRCL can cope with label-independent biases in multi-scene settings and maintain stable performance with only limited training data available.
zh

[CV-28] Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective CVPR2025

【速读】：该论文试图解决传统方法在变化检测与描述任务中因采用独立图像编码器处理双时相图像而导致无法有效关注变化区域的问题，以及不同任务设计的多样化变化提取器难以统一框架的局限。为了解决这些问题，论文提出Change3D框架，其关键在于将双时相图像视为微型视频的两帧，并通过在双时相图像间引入可学习的感知帧，利用视频编码器使感知帧直接与图像交互以感知差异，从而摒弃复杂的专门变化提取器，实现统一的框架，适用于多种变化检测（如二值变化检测、语义变化检测及建筑物损毁评估）和变化描述任务。

链接: https://arxiv.org/abs/2503.18803
作者: Duowang Zhu,Xiaohu Huang,Haiyan Huang,Hao Zhou,Zhenfeng Shao
机构: Wuhan University (武汉大学); The University of Hong Kong (香港大学); Bytedance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: conference paper, accepted by CVPR 2025

点击查看摘要

Abstract:In this paper, we present Change3D, a framework that reconceptualizes the change detection and captioning tasks through video modeling. Recent methods have achieved remarkable success by regarding each pair of bi-temporal images as separate frames. They employ a shared-weight image encoder to extract spatial features and then use a change extractor to capture differences between the two images. However, image feature encoding, being a task-agnostic process, cannot attend to changed regions effectively. Furthermore, different change extractors designed for various change detection and captioning tasks make it difficult to have a unified framework. To tackle these challenges, Change3D regards the bi-temporal images as comprising two frames akin to a tiny video. By integrating learnable perception frames between the bi-temporal images, a video encoder enables the perception frames to interact with the images directly and perceive their differences. Therefore, we can get rid of the intricate change extractors, providing a unified framework for different change detection and captioning tasks. We verify Change3D on multiple tasks, encompassing change detection (including binary change detection, semantic change detection, and building damage assessment) and change captioning, across eight standard benchmarks. Without bells and whistles, this simple yet effective framework can achieve superior performance with an ultra-light video model comprising only ~6%-13% of the parameters and ~8%-34% of the FLOPs compared to state-of-the-art methods. We hope that Change3D could be an alternative to 2D-based models and facilitate future research.
zh

[CV-29] NexusGS: Sparse View Synthesis with Epipolar Depth Priors in 3D Gaussian Splatting CVPR2025

【速读】：该论文旨在解决基于稀疏视角图像进行新型视图合成（novel view synthesis）时，因监督信号不足而导致的传统方法（如Neural Radiance Field和3D Gaussian Splatting）表现不佳的问题。论文提出的关键解决方案是NexusGS，这是一种基于3D Gaussian Splatting的方法，通过直接将深度信息嵌入点云中，并利用3DGS固有的对极几何特性，引入了一种新颖的点云密度增强策略。该策略以密集点云初始化3DGS，减少点放置的随机性，同时避免过平滑和过拟合。具体而言，NexusGS包含三个核心步骤：对极深度关联（Epipolar Depth Nexus）、抗流残差深度融合（Flow-Resilient Depth Blending）以及流过滤深度剪枝（Flow-Filtered Depth Pruning）。这些步骤利用光流和相机姿态计算精确的深度图，同时缓解了与光流相关的不准确性。通过引入对极深度先验，NexusGS在稀疏视角条件下确保了可靠的密集点云覆盖，并支持稳定的3DGS训练。实验表明，NexusGS显著提升了深度精度和渲染质量，超越了现有最先进的方法，并进一步验证了生成点云的优越性。

链接: https://arxiv.org/abs/2503.18794
作者: Yulong Zheng,Zicheng Jiang,Shengfeng He,Yandu Sun,Junyu Dong,Huaidong Zhang,Yong Du
机构: Ocean University of China (海洋大学); Singapore Management University (新加坡管理大学); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by CVPR 2025

点击查看摘要

Abstract:Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) have noticeably advanced photo-realistic novel view synthesis using images from densely spaced camera viewpoints. However, these methods struggle in few-shot scenarios due to limited supervision. In this paper, we present NexusGS, a 3DGS-based approach that enhances novel view synthesis from sparse-view images by directly embedding depth information into point clouds, without relying on complex manual regularizations. Exploiting the inherent epipolar geometry of 3DGS, our method introduces a novel point cloud densification strategy that initializes 3DGS with a dense point cloud, reducing randomness in point placement while preventing over-smoothing and overfitting. Specifically, NexusGS comprises three key steps: Epipolar Depth Nexus, Flow-Resilient Depth Blending, and Flow-Filtered Depth Pruning. These steps leverage optical flow and camera poses to compute accurate depth maps, while mitigating the inaccuracies often associated with optical flow. By incorporating epipolar depth priors, NexusGS ensures reliable dense point cloud coverage and supports stable 3DGS training under sparse-view conditions. Experiments demonstrate that NexusGS significantly enhances depth accuracy and rendering quality, surpassing state-of-the-art methods by a considerable margin. Furthermore, we validate the superiority of our generated point clouds by substantially boosting the performance of competing methods. Project page: this https URL.
zh

[CV-30] LGI-DETR: Local-Global Interaction for UAV Object Detection

【速读】：该论文旨在解决无人机（Unmanned Aerial Vehicle, UAV）领域中现有目标检测器的两大挑战：一是大多数现有的目标检测器并非端到端设计，需要复杂的组件设计和精细调参；二是现有的端到端目标检测器主要针对自然场景优化，直接应用于无人机图像时效果不理想。为应对这些挑战，论文提出了一种面向无人机的局部-全局信息交互检测 transformer 模型（Local-Global Information Interaction DETR, 简称 LGI-DETR）。其关键在于通过跨层双向增强低级局部特征与高级语义特征的信息融合机制，特别是在小目标检测任务中的有效性。具体而言，论文在编码器的初始阶段引入局部空间增强模块（Local Spatial Enhancement, LSE），将丰富的低级局部空间信息注入到高级特征中，减少高层信息传递过程中的局部信息损失；而在编码器的最终阶段提出全局信息注入模块（Global Information Injection, GII），用于整合高层次全局语义表示与低级特征图，通过跨层级上下文传播机制有效弥补局部感受野的固有限制。实验结果表明，所提模型在 VisDrone2019 和 UAVDT 两个具有挑战性的无人机图像目标检测基准数据集上均优于当前最先进的方法，AP 和 AP50 分别提升了 1.9% 和 2.4%。

链接: https://arxiv.org/abs/2503.18785
作者: Zifa Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages

点击查看摘要

Abstract:UAV has been widely used in various fields. However, most of the existing object detectors used in drones are not end-to-end and require the design of various complex components and careful fine-tuning. Most of the existing end-to-end object detectors are designed for natural scenes. It is not ideal to apply them directly to UAV images. In order to solve the above challenges, we design an local-global information interaction DETR for UAVs, namely LGI-DETR. Cross-layer bidirectional low-level and high-level feature information enhancement, this fusion method is effective especially in the field of small objection detection. At the initial stage of encoder, we propose a local spatial enhancement module (LSE), which enhances the low-level rich local spatial information into the high-level feature, and reduces the loss of local information in the transmission process of high-level information. At the final stage of the encoder, we propose a novel global information injection module (GII) designed to integrate rich high-level global semantic representations with low-level feature maps. This hierarchical fusion mechanism effectively addresses the inherent limitations of local receptive fields by propagating contextual information across the feature hierarchy. Experimental results on two challenging UAV image object detection benchmarks, VisDrone2019 and UAVDT, show that our proposed model outperforms the SOTA model. Compared to the baseline model, AP and AP50 improved by 1.9% and 2.4%, respectively.
zh

[CV-31] Leverag ing Perturbation Robustness to Enhance Out-of-Distribution Detection

【速读】：该论文旨在解决深度计算机视觉模型在开放世界环境中因输入数据分布偏移（Out-of-Distribution, OOD）而导致的安全部署问题。为应对这一挑战，论文提出了一种后处理方法——扰动校正 OOD 检测（Perturbation-Rectified OOD detection, PRO）。其关键在于利用扰动对 OOD 输入预测置信度的降低幅度大于在分布内（In-Distribution, IND）输入的特性，通过设计一种对抗性评分函数，在原始输入附近寻找局部最小评分值（应用梯度下降实现），从而增强 IND 和 OOD 样本的可分性。该方法无需对底层模型架构进行复杂修改即可提升 OOD 检测性能，并在 CIFAR-10 数据集上验证了其有效性，尤其在对抗训练模型中，相较于现有方法，PRO 将错误接受率（False Positive Rate at 95% True Positive Rate, FPR@95）降低了超过 10%。

链接: https://arxiv.org/abs/2503.18784
作者: Wenxi Chen,Raymond A. Yeh,Shaoshuai Mou,Yan Gu
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is the task of identifying inputs that deviate from the training data distribution. This capability is essential for safely deploying deep computer vision models in open-world environments. In this work, we propose a post-hoc method, Perturbation-Rectified OOD detection (PRO), based on the insight that prediction confidence for OOD inputs is more susceptible to reduction under perturbation than in-distribution (IND) inputs. Based on the observation, we propose an adversarial score function that searches for the local minimum scores near the original inputs by applying gradient descent. This procedure enhances the separability between IND and OOD samples. Importantly, the approach improves OOD detection performance without complex modifications to the underlying model architectures. We conduct extensive experiments using the OpenOOD benchmark~\citeyang2022openood. Our approach further pushes the limit of softmax-based OOD detection and is the leading post-hoc method for small-scale models. On a CIFAR-10 model with adversarial training, PRO effectively detects near-OOD inputs, achieving a reduction of more than 10% on FPR@95 compared to state-of-the-art methods.
zh

[CV-32] Frequency Dynamic Convolution for Dense Image Prediction CVPR2025

【速读】：该论文针对动态卷积（Dynamic Convolution, DY-Conv）在通过注意力机制实现自适应权重选择时，所生成的多个并行权重在频域响应上表现出高相似性的问题，试图解决其参数成本较高但适应性有限的局限。论文提出频率动态卷积（Frequency Dynamic Convolution, FDConv）作为解决方案，其关键在于通过在傅里叶域学习固定参数预算，将该预算划分为基于频域的不相交索引分组，从而在不增加参数成本的情况下构建多样化的权重。此外，论文进一步引入空间核调制（Kernel Spatial Modulation, KSM）和频段调制（Frequency Band Modulation, FBM），分别在空间层面动态调整滤波器的频率响应，并在频域分解权重为不同的频段以基于局部内容动态调制，从而显著提升模型的适应性。实验结果表明，FDConv 在目标检测、分割和分类任务中均表现出色，在仅增加约 3.6M 参数的情况下优于需要大幅增加参数预算的现有方法（如 CondConv 需 +90M 参数，KW 需 +76.5M 参数）。同时，FDConv 可无缝集成到多种架构中，包括 ConvNeXt 和 Swin-Transformer，为现代视觉任务提供了灵活高效的解决方案。

链接: https://arxiv.org/abs/2503.18783
作者: Linwei Chen,Lin Gu,Liang Li,Chenggang Yan,Ying Fu
机构: Beijing Institute of Technology (北京理工大学); RIKEN (理化学研究所); The University of Tokyo (东京大学); Chinese Academy of Sciences (中国科学院); Hangzhou Dianzi University (杭州电子科技大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:While Dynamic Convolution (DY-Conv) has shown promising performance by enabling adaptive weight selection through multiple parallel weights combined with an attention mechanism, the frequency response of these weights tends to exhibit high similarity, resulting in high parameter costs but limited adaptability. In this work, we introduce Frequency Dynamic Convolution (FDConv), a novel approach that mitigates these limitations by learning a fixed parameter budget in the Fourier domain. FDConv divides this budget into frequency-based groups with disjoint Fourier indices, enabling the construction of frequency-diverse weights without increasing the parameter cost. To further enhance adaptability, we propose Kernel Spatial Modulation (KSM) and Frequency Band Modulation (FBM). KSM dynamically adjusts the frequency response of each filter at the spatial level, while FBM decomposes weights into distinct frequency bands in the frequency domain and modulates them dynamically based on local content. Extensive experiments on object detection, segmentation, and classification validate the effectiveness of FDConv. We demonstrate that when applied to ResNet-50, FDConv achieves superior performance with a modest increase of +3.6M parameters, outperforming previous methods that require substantial increases in parameter budgets (e.g., CondConv +90M, KW +76.5M). Moreover, FDConv seamlessly integrates into a variety of architectures, including ConvNeXt, Swin-Transformer, offering a flexible and efficient solution for modern vision tasks. The code is made publicly available at this https URL.
zh

[CV-33] Good Keypoints for the Two-View Geometry Estimation Problem CVPR2025

【速读】：本文旨在研究局部特征属性对两视图几何估计任务中单应性（homography）估计算法性能的影响，并提出一种新的理论模型以评估特征点（关键点）的质量。论文的关键在于揭示一个优秀的单应性估计关键点应具备的两个核心性质：可重复性（repeatability）和较小的测量误差期望值。这一结果解释了为何单纯增加匹配点数量并不总能提高单应性估计的准确性。基于此理论模型，作者设计了一种名为Bounded NeSS-ST (BoNeSS-ST) 的自监督关键点检测方法，其创新之处在于坚实的理论基础、通过亚像素细化实现的更精确的关键点评分以及针对低显著性关键点的鲁棒性优化设计。实验表明，BoNeSS-ST 在平面单应性和对极几何估计任务中优于现有自监督局部特征检测器。

链接: https://arxiv.org/abs/2503.18767
作者: Konstantin Pakulev,Alexander Vakhitov,Gonzalo Ferrer
机构: Skolkovo Institute of Science and Technology (斯科尔科沃科学技术研究院); Slamcore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Camera-ready version of the CVPR 2025 paper

点击查看摘要

Abstract:Local features are essential to many modern downstream applications. Therefore, it is of interest to determine the properties of local features that contribute to the downstream performance for a better design of feature detectors and descriptors. In our work, we propose a new theoretical model for scoring feature points (keypoints) in the context of the two-view geometry estimation problem. The model determines two properties that a good keypoint for solving the homography estimation problem should have: be repeatable and have a small expected measurement error. This result provides key insights into why maximizing the number of correspondences doesn’t always lead to better homography estimation accuracy. We use the developed model to design a method that detects keypoints that benefit the homography estimation introducing the Bounded NeSS-ST (BoNeSS-ST) keypoint detector. The novelty of BoNeSS-ST comes from strong theoretical foundations, a more accurate keypoint scoring due to subpixel refinement and a cost designed for superior robustness to low saliency keypoints. As a result, BoNeSS-ST outperforms prior self-supervised local feature detectors in both planar homography and epipolar geometry estimation problems.
zh

[CV-34] EgoSurgery-HTS: A Dataset for Egocentric Hand-Tool Segmentation in Open Surgery Videos

【速读】：该论文旨在解决从第一人称视角开放手术视频中对手术工具、手以及交互工具进行精确分割的问题。为实现这一目标，论文引入了EgoSurgery-HTS数据集，该数据集包含像素级标注，并针对第一人称开放手术视频中的工具实例分割、手实例分割以及手-工具分割提供了基准评估套件。解决方案的关键在于通过提供详细的像素级标注，特别是对手术工具实例、操作手以及手-工具交互的精确标注，从而实现对手术过程中医生动作和意图的深入理解。实验结果表明，基于EgoSurgery-HTS数据集，现有最先进的分割方法在手和手-工具分割的准确性上取得了显著提升。

链接: https://arxiv.org/abs/2503.18755
作者: Nathan Darjana,Ryo Fujii,Hideo Saito,Hiroki Kajita
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Egocentric open-surgery videos capture rich, fine-grained details essential for accurately modeling surgical procedures and human behavior in the operating room. A detailed, pixel-level understanding of hands and surgical tools is crucial for interpreting a surgeon’s actions and intentions. We introduce EgoSurgery-HTS, a new dataset with pixel-wise annotations and a benchmark suite for segmenting surgical tools, hands, and interacting tools in egocentric open-surgery videos. Specifically, we provide a labeled dataset for (1) tool instance segmentation of 14 distinct surgical tools, (2) hand instance segmentation, and (3) hand-tool segmentation to label hands and the tools they manipulate. Using EgoSurgery-HTS, we conduct extensive evaluations of state-of-the-art segmentation methods and demonstrate significant improvements in the accuracy of hand and hand-tool segmentation in egocentric open-surgery videos compared to existing datasets. The dataset will be released at this https URL.
zh

[CV-35] Self-Supervised Learning based on Transformed Image Reconstruction for Equivariance-Coherent Feature Representation

【速读】：该论文试图解决在许多计算机视觉任务中特征等变性（equivariance）难以有效建模的问题。当前流行的自监督学习（Self-Supervised Learning, SSL）方法通常通过设计来限制等变性，而非直接学习。为了解决这一问题，论文提出了一种新的自监督学习方法，其关键是让模型在没有先验知识的情况下，通过重建经过先前未见过的变换（如平移或旋转）后的中间变换图像，独立学习这些变换。这种方法通过一个辅助任务鼓励模型发展与等变一致性相符的特征，而无需依赖预定义的变换规则。具体实现中，输入图像经过变换生成图像对后，提取的特征被分为两组：一组用于常规SSL损失以促进不变性，另一组则用于基于辅助任务的损失来重建中间变换图像。两种损失通过加权线性组合的方式结合。实验结果表明，无论是否针对等变性专门设计，该方法在合成任务中均显著优于竞争对手，并且当与基于增强的方法（如iBOT或DINOv2）联合训练时，能够学习到不变性和等变性的平衡特征，在一系列现实计算机视觉下游任务中表现出色，几乎始终优于所有基线方法。

链接: https://arxiv.org/abs/2503.18753
作者: Qin Wang,Benjamin Bruns,Hanno Scharr,Kai Krajsek
机构: IAS-8: Data Analytics and Machine Learning, Forschungszentrum Jülich (朱利希研究中心), Germany; Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich (朱利希研究中心), Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The equivariant behaviour of features is essential in many computer vision tasks, yet popular self-supervised learning (SSL) methods tend to constrain equivariance by design. We propose a self-supervised learning approach where the system learns transformations independently by reconstructing images that have undergone previously unseen transformations. Specifically, the model is tasked to reconstruct intermediate transformed images, e.g. translated or rotated images, without prior knowledge of these transformations. This auxiliary task encourages the model to develop equivariance-coherent features without relying on predefined transformation rules. To this end, we apply transformations to the input image, generating an image pair, and then split the extracted features into two sets per image. One set is used with a usual SSL loss encouraging invariance, the other with our loss based on the auxiliary task to reconstruct the intermediate transformed images. Our loss and the SSL loss are linearly combined with weighted terms. Evaluating on synthetic tasks with natural images, our proposed method strongly outperforms all competitors, regardless of whether they are designed to learn equivariance. Furthermore, when trained alongside augmentation-based methods as the invariance tasks, such as iBOT or DINOv2, we successfully learn a balanced combination of invariant and equivariant features. Our approach performs strong on a rich set of realistic computer vision downstream tasks, almost always improving over all baselines.
zh

[CV-36] Robust Tube-based Control Strategy for Vision-guided Autonomous Vehicles

【速读】：该论文旨在解决自动驾驶车辆在高速急转弯场景下车道保持系统的鲁棒性不足问题。论文提出了一种基于插值管的约束迭代线性二次调节器（interpolation tube-based constrained iterative linear quadratic regulator, itube-CILQR）算法，其关键是通过减少系统保守性和提高计算速度，增强车辆在高动态工况下的控制性能。与传统方法相比，itube-CILQR不仅在计算效率上显著优于基于内点法的传统模型预测控制（MPC），还通过优化插值变量轨迹降低了保守性对系统行为的影响，从而实现了更优的车道保持性能。

链接: https://arxiv.org/abs/2503.18752
作者: Der-Hau Lee
机构: Department of Electrophysics, National Yang Ming Chiao Tung University (国立阳明交通大学电子物理系)
类目: ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 13 pages, 14 figures

点击查看摘要

Abstract:A robust control strategy for autonomous vehicles can improve system stability, enhance riding comfort, and prevent driving accidents. This paper presents a novel interpolation tube-based constrained iterative linear quadratic regulator (itube-CILQR) algorithm for autonomous computer-vision-based vehicle lane-keeping. The goal of the algorithm is to enhance robustness during high-speed cornering on tight turns. The advantages of itube-CILQR over the standard tube-approach include reduced system conservatism and increased computational speed. Numerical and vision-based experiments were conducted to examine the feasibility of the proposed algorithm. The proposed itube-CILQR algorithm is better suited to vehicle lane-keeping than variational CILQR-based methods and model predictive control (MPC) approaches using a classical interior-point solver. Specifically, in evaluation experiments, itube-CILQR achieved an average runtime of 3.16 ms to generate a control signal to guide a self-driving vehicle; itube-MPC typically required a 4.67-times longer computation time to complete the same task. Moreover, the influence of conservatism on system behavior was investigated by exploring the interpolation variable trajectories derived from the proposed itube-CILQR algorithm during lane-keeping maneuvers.
zh

[CV-37] Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition CVPR2025

【速读】：该论文致力于解决场景文本识别（STR）在视觉质量退化情况下，仅依赖视觉信息难以充分理解文本含义的问题，强调结合视觉与语言特征的重要性。传统方法通常通过语言模型或语义推理模块提取语言特征，但往往需要大规模标注数据，而自监督学习因缺乏标注，难以有效分离与全局上下文相关联的语言特征。当前自监督方法如序列对比学习侧重局部特征对齐，掩码图像建模（MIM）倾向于利用局部结构重建视觉模式，但其捕获的语言知识有限。

论文的关键解决方案是提出了一种基于语言感知的掩码图像建模（Linguistics-aware Masked Image Modeling, LMIM）方法。该方法通过独立分支将语言信息引入MIM的解码过程，并设计了一个语言对齐模块，利用具有不同视觉外观的输入提取与视觉无关的语言引导特征。为实现全局上下文的考虑以完成重建任务，LMIM突破了单纯视觉结构的限制。实验结果表明，该方法在多个基准数据集上取得了最先进的性能，同时注意力可视化展示了其同时捕获视觉和语言信息的能力。

链接: https://arxiv.org/abs/2503.18746
作者: Yifei Zhang,Chang Liu,Jin Wei,Xiaomeng Yang,Yu Zhou,Can Ma,Xiangyang Ji
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); VCIP & TMCC & DISSec, College of Computer Science, Nankai University (南开大学计算机学院VCIP&TMCC&DISSec); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Department of Automation and BNRist, Tsinghua University (清华大学自动化系与BNRist); Lenovo Research (联想研究院); College of Engineering, Northeastern University (东北大学工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Text images are unique in their dual nature, encompassing both visual and linguistic information. The visual component encompasses structural and appearance-based features, while the linguistic dimension incorporates contextual and semantic elements. In scenarios with degraded visual quality, linguistic patterns serve as crucial supplements for comprehension, highlighting the necessity of integrating both aspects for robust scene text recognition (STR). Contemporary STR approaches often use language models or semantic reasoning modules to capture linguistic features, typically requiring large-scale annotated datasets. Self-supervised learning, which lacks annotations, presents challenges in disentangling linguistic features related to the global context. Typically, sequence contrastive learning emphasizes the alignment of local features, while masked image modeling (MIM) tends to exploit local structures to reconstruct visual patterns, resulting in limited linguistic knowledge. In this paper, we propose a Linguistics-aware Masked Image Modeling (LMIM) approach, which channels the linguistic information into the decoding process of MIM through a separate branch. Specifically, we design a linguistics alignment module to extract vision-independent features as linguistic guidance using inputs with different visual appearances. As features extend beyond mere visual structures, LMIM must consider the global context to achieve reconstruction. Extensive experiments on various benchmarks quantitatively demonstrate our state-of-the-art performance, and attention visualizations qualitatively show the simultaneous capture of both visual and linguistic information.
zh

[CV-38] SFDLA: Source-Free Document Layout Analysis

【速读】：该论文旨在解决文档版面分析（Document Layout Analysis, DLA）在无源数据情况下适应目标领域的问题。传统方法通常需要大规模源域数据和目标标签，这限制了其在隐私敏感和资源受限领域的实际应用，如财务报表、医疗记录和专有商业文件等。论文观察到直接将源域微调模型转移到目标域会导致显著的性能下降（平均降幅为32.64%）。为应对这一挑战，论文提出了无源文档版面分析（Source-Free Document Layout Analysis, SFDLA），目标是在不访问任何源数据的情况下，将预训练的源DLA模型适配到未标注的目标领域。为此，论文建立了首个SFDLA基准，涵盖三个主要的DLA数据集，用于几何和内容感知的适配。同时，提出了一种名为文档版面分析适配器（DLAdapter）的新框架，以改进跨文档领域的无源适配。该方法在从PubLayNet到DocLayNet的任务中，相较于仅使用源域的基线模型提升了4.21%，比现有的无源方法提高了2.26%。关键在于DLAdapter框架的设计，它通过有效的无源适配策略克服了无源数据带来的挑战。

链接: https://arxiv.org/abs/2503.18742
作者: Sebastian Tewes,Yufan Chen,Omar Moured,Jiaming Zhang,Rainer Stiefelhagen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The benchmark, models, and code will be publicly available at this https URL

点击查看摘要

Abstract:Document Layout Analysis (DLA) is a fundamental task in document understanding. However, existing DLA and adaptation methods often require access to large-scale source data and target labels. This requirements severely limiting their real-world applicability, particularly in privacy-sensitive and resource-constrained domains, such as financial statements, medical records, and proprietary business documents. According to our observation, directly transferring source-domain fine-tuned models on target domains often results in a significant performance drop (Avg. -32.64%). In this work, we introduce Source-Free Document Layout Analysis (SFDLA), aiming for adapting a pre-trained source DLA models to an unlabeled target domain, without access to any source data. To address this challenge, we establish the first SFDLA benchmark, covering three major DLA datasets for geometric- and content-aware adaptation. Furthermore, we propose Document Layout Analysis Adapter (DLAdapter), a novel framework that is designed to improve source-free adaptation across document domains. Our method achieves a +4.21% improvement over the source-only baseline and a +2.26% gain over existing source-free methods from PubLayNet to DocLayNet. We believe this work will inspire the DLA community to further investigate source-free document understanding. To support future research of the community, the benchmark, models, and code will be publicly available at this https URL.
zh

[CV-39] FG2: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching

【速读】：本文提出了一种新颖的细粒度跨视图定位方法，旨在通过匹配地面图像与航拍图像之间的细粒度特征，估计后者的三维三自由度（3 Degrees of Freedom）姿态。解决方案的关键在于将地面图像特征映射到三维点云，并通过沿高度维度选择特征来汇聚这些点到鸟瞰图（Bird’s-Eye-View, BEV）平面上，从而实现对地面图像特征贡献的可追溯性。随后，通过在两个点平面间采样稀疏匹配点并利用Procrustes对齐算法计算相对姿态，实现了更精确的定位。相比现有最先进的方法，本文方法在VIGOR跨区域测试集上的平均定位误差减少了28%。

链接: https://arxiv.org/abs/2503.18725
作者: Zimin Xia,Alexandre Alahi
机构: École Polytechnique Fédérale de Lausanne (EPFL)(洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a novel fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image in an aerial image of the surroundings by matching fine-grained features between the two images. The pose is estimated by aligning a point plane generated from the ground image with a point plane sampled from the aerial image. To generate the ground points, we first map ground image features to a 3D point cloud. Our method then learns to select features along the height dimension to pool the 3D points to a Bird’s-Eye-View (BEV) plane. This selection enables us to trace which feature in the ground image contributes to the BEV representation. Next, we sample a set of sparse matches from computed point correspondences between the two point planes and compute their relative pose using Procrustes alignment. Compared to the previous state-of-the-art, our method reduces the mean localization error by 28% on the VIGOR cross-area test set. Qualitative results show that our method learns semantically consistent matches across ground and aerial views through weakly supervised learning from the camera pose.
zh

[CV-40] Boosting Resolution Generalization of Diffusion Transformers with Randomized Positional Encodings

【速读】：该论文致力于解决扩散Transformer等模型在分辨率泛化任务中因测试阶段与训练阶段位置编码不匹配而导致的显著挑战。传统方法通过插值、外推或其组合未能完全解决此问题。论文的关键创新在于提出了一种二维随机位置编码（Randomized Positional Encodings - 2D, RPE-2D）框架，它侧重于学习图像块的位置顺序而非具体距离，从而实现高分辨率和低分辨率图像生成的无缝切换，且无需使用高分辨率或低分辨率图像进行专门训练。具体而言，RPE-2D沿水平和垂直轴独立选择更广范围内的位置，确保所有位置编码在推理阶段均被充分训练，从而提升分辨率泛化能力。此外，论文还引入了一种随机数据增强技术以强化位置顺序建模，并通过微调条件（micro-conditioning）解决由此引起的图像裁剪问题。实验结果表明，该方法在ImageNet数据集上实现了最先进的分辨率泛化性能，在多种分辨率设置下超越现有方法。

链接: https://arxiv.org/abs/2503.18719
作者: Cong Liu,Liang Hou,Mingwu Zheng,Xin Tao,Pengfei Wan,Di Zhang,Kun Gai
机构: Kuaishou Technology (快手科技); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Resolution generalization in image generation tasks enables the production of higher-resolution images with lower training resolution overhead. However, a significant challenge in resolution generalization, particularly in the widely used Diffusion Transformers, lies in the mismatch between the positional encodings encountered during testing and those used during training. While existing methods have employed techniques such as interpolation, extrapolation, or their combinations, none have fully resolved this issue. In this paper, we propose a novel two-dimensional randomized positional encodings (RPE-2D) framework that focuses on learning positional order of image patches instead of the specific distances between them, enabling seamless high- and low-resolution image generation without requiring high- and low-resolution image training. Specifically, RPE-2D independently selects positions over a broader range along both the horizontal and vertical axes, ensuring that all position encodings are trained during the inference phase, thus improving resolution generalization. Additionally, we propose a random data augmentation technique to enhance the modeling of position order. To address the issue of image cropping caused by the augmentation, we introduce corresponding micro-conditioning to enable the model to perceive the specific cropping patterns. On the ImageNet dataset, our proposed RPE-2D achieves state-of-the-art resolution generalization performance, outperforming existing competitive methods when trained at a resolution of 256 \times 256 and inferred at 384 \times 384 and 512 \times 512 , as well as when scaling from 512 \times 512 to 768 \times 768 and 1024 \times 1024 . And it also exhibits outstanding capabilities in low-resolution image generation, multi-stage training acceleration and multi-resolution inheritance.
zh

[CV-41] GS-Marker: Generalizable and Robust Watermarking for 3D Gaussian Splatting

【速读】：该论文旨在解决3D模型在生成式AI时代面临的水印保护难题，特别是针对现有2D图像隐形水印技术难以直接推广至3D领域的问题。主要挑战包括渲染器中断梯度流动导致训练困难，以及如何确保水印在多样化3D模型上的通用性，同时在自由视角渲染及多种失真条件下可靠提取水印。论文提出了一种名为GS-Marker的单次通过水印方案，用于3D高斯点绘（3D Gaussian Splatting）表示。其关键创新在于自适应标记控制机制（Adaptive Marker Control），该机制能够动态调整初始优化后的3DGS，避免局部最优解，从而提升训练稳定性和收敛速度。此外，GS-Marker框架结合了3D编码器嵌入消息、失真层增强鲁棒性以及2D解码器从渲染结果中提取水印的功能，显著提高了解码准确性与模型保真度，同时大幅减少了计算时间。

链接: https://arxiv.org/abs/2503.18718
作者: Lijiang Li,Jinglu Wang,Xiang Ming,Yan Lu
机构: Microsoft Research Asia (微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:In the Generative AI era, safeguarding 3D models has become increasingly urgent. While invisible watermarking is well-established for 2D images with encoder-decoder frameworks, generalizable and robust solutions for 3D remain elusive. The main difficulty arises from the renderer between the 3D encoder and 2D decoder, which disrupts direct gradient flow and complicates training. Existing 3D methods typically rely on per-scene iterative optimization, resulting in time inefficiency and limited generalization. In this work, we propose a single-pass watermarking approach for 3D Gaussian Splatting (3DGS), a well-known yet underexplored representation for watermarking. We identify two major challenges: (1) ensuring effective training generalized across diverse 3D models, and (2) reliably extracting watermarks from free-view renderings, even under distortions. Our framework, named GS-Marker, incorporates a 3D encoder to embed messages, distortion layers to enhance resilience against various distortions, and a 2D decoder to extract watermarks from renderings. A key innovation is the Adaptive Marker Control mechanism that adaptively perturbs the initially optimized 3DGS, escaping local minima and improving both training stability and convergence. Extensive experiments show that GS-Marker outperforms per-scene training approaches in terms of decoding accuracy and model fidelity, while also significantly reducing computation time.
zh

[CV-42] LLaVAction: evaluating and training multi-modal large language models for action recognition

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）在执行第一人称视角动作识别任务中的性能不足问题。论文的关键创新在于重新构建了EPIC-KITCHENS-100数据集为视频多选问答形式（EPIC-KITCHENS-100-MQA），并通过引入具有迷惑性的错误答案作为干扰项，揭示了现有MLLMs在动作识别任务上的局限性。为应对这一挑战，作者提出了一系列方法显著提升了MLLMs的动作识别能力，并在多个基准数据集上取得了领先的结果，包括在EPIC-KITCHENS-100-MQA上超越GPT-4o达21个百分点的准确性提升。这些改进表明MLLMs在复杂动作理解任务中具有广阔的应用前景。

链接: https://arxiv.org/abs/2503.18712
作者: Shaokai Ye,Haozhe Qi,Alexander Mathis,Mackenzie W. Mathis
机构: EPFL
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. The recent development of multi-modal large language models (MLLMs) is a promising candidate for a wide range of action understanding tasks. In this work, we focus on evaluating and then improving MLLMs to perform action recognition. We reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action datasets, to the form of video multiple question answering (EPIC-KITCHENS-100-MQA). We show that when we sample difficult incorrect answers as distractors, leading MLLMs struggle to recognize the correct actions. We propose a series of methods that greatly improve the MLLMs’ ability to perform action recognition, achieving state-of-the-art on both the EPIC-KITCHENS-100 validation set, as well as outperforming GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA. Lastly, we show improvements on other action-related video benchmarks such as EgoSchema, PerceptionTest, LongVideoBench, VideoMME and MVBench, suggesting that MLLMs are a promising path forward for complex action tasks. Code and models are available at: this https URL.
zh

[CV-43] Accenture-NVS1: A Novel View Synthesis Dataset

【速读】：该论文旨在解决空中和地面图像的新视角合成（Novel View Synthesis, NVS）研究中的特定挑战，通过构建一个名为ACC-NVS1的专业数据集。这一数据集的关键在于其多样性和规模，包含2023至2024年间在奥斯汀和匹兹堡采集的六组来自空中与地面摄像机的多样化真实场景图像，总计148,000张图片。这些数据有效应对了不同高度变化和瞬态物体带来的挑战，从而为NVS领域的研究提供了补充性的资源支持，而非作为基准测试使用。

链接: https://arxiv.org/abs/2503.18711
作者: Thomas Sugg,Kyle O’Brien,Lekh Poudel,Alex Dumouchelle,Michelle Jou,Marc Bosch,Deva Ramanan,Srinivasa Narasimhan,Shubham Tulsiani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 7 figures

点击查看摘要

Abstract:This paper introduces ACC-NVS1, a specialized dataset designed for research on Novel View Synthesis specifically for airborne and ground imagery. Data for ACC-NVS1 was collected in Austin, TX and Pittsburgh, PA in 2023 and 2024. The collection encompasses six diverse real-world scenes captured from both airborne and ground cameras, resulting in a total of 148,000 images. ACC-NVS1 addresses challenges such as varying altitudes and transient objects. This dataset is intended to supplement existing datasets, providing additional resources for comprehensive research, rather than serving as a benchmark.
zh

[CV-44] Revisiting Automatic Data Curation for Vision Foundation Models in Digital Pathology MICCAI2025

【速读】：该论文试图解决视觉基础模型（Vision Foundation Models, FMs）在数字病理学算法开发中的数据预训练质量问题。具体而言，当前这些模型的性能高度依赖于预训练数据的规模、多样性和平衡性，但现有数据选择主要基于专家知识，局限于全切片图像（Whole-Slide Images, WSIs）层面的疾病分类和组织类型等宏观因素，而忽略了瓦片（Tiles）级别更细粒度的信息。为了解决这一问题，论文的关键方案是提出了一种无监督的自动数据整理方法，针对3.5亿个瓦片进行层次聚类树分析，并通过均匀采样策略在整个预训练模型的嵌入空间中构建平衡的数据集。此外，论文进一步揭示了数据集规模与平衡性之间的权衡对模型表示质量的影响，并设计了定制化的批量采样策略以缓解这种影响，从而提升模型在多种临床相关下游任务中的表现。

链接: https://arxiv.org/abs/2503.18709
作者: Boqi Chen,Cédric Vincent-Cuaz,Lydia A. Schoenpflug,Manuel Madeira,Lisa Fournier,Vaishnavi Subramanian,Sonali Andani,Samuel Ruiperez-Campillo,Julia E. Vogt,Raphaëlle Luisier,Dorina Thanou,Viktor H. Koelzer,Pascal Frossard,Gabriele Campanella,Gunnar Rätsch
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025

点击查看摘要

Abstract:Vision foundation models (FMs) are accelerating the development of digital pathology algorithms and transforming biomedical research. These models learn, in a self-supervised manner, to represent histological features in highly heterogeneous tiles extracted from whole-slide images (WSIs) of real-world patient samples. The performance of these FMs is significantly influenced by the size, diversity, and balance of the pre-training data. However, data selection has been primarily guided by expert knowledge at the WSI level, focusing on factors such as disease classification and tissue types, while largely overlooking the granular details available at the tile level. In this paper, we investigate the potential of unsupervised automatic data curation at the tile-level, taking into account 350 million tiles. Specifically, we apply hierarchical clustering trees to pre-extracted tile embeddings, allowing us to sample balanced datasets uniformly across the embedding space of the pretrained FM. We further identify these datasets are subject to a trade-off between size and balance, potentially compromising the quality of representations learned by FMs, and propose tailored batch sampling strategies to mitigate this effect. We demonstrate the effectiveness of our method through improved performance on a diverse range of clinically relevant downstream tasks.
zh

[CV-45] Benchmarking Burst Super-Resolution for Polarization Images: Noise Dataset and Analysis

【速读】：该论文试图解决传统偏振成像在低光照效率和空间分辨率不足方面的问题，这些问题导致图像噪声增加且偏振测量精度下降。此外，现有的突发超分辨率方法难以直接应用于偏振成像，因为缺乏专门的数据集和可靠的噪声统计基准。为了解决这些挑战，论文提出了PolarNS和PolarBurstSR两个创新数据集，分别用于表征偏振噪声统计特性和作为偏振图像突发超分辨率的基准。关键在于通过这些数据集实现全面评估，并引入一种分析偏振噪声的模型以量化噪声传播特性。论文还展示了针对偏振成像优化训练的模型相较于基于RGB的方法的优势，从而建立了偏振图像突发超分辨率的基准，并提供了关于噪声传播的重要见解，以提升偏振图像重建质量。

链接: https://arxiv.org/abs/2503.18705
作者: Inseung Hwang,Kiseok Choi,Hyunho Ha,Min H. Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Snapshot polarization imaging calculates polarization states from linearly polarized subimages. To achieve this, a polarization camera employs a double Bayer-patterned sensor to capture both color and polarization. It demonstrates low light efficiency and low spatial resolution, resulting in increased noise and compromised polarization measurements. Although burst super-resolution effectively reduces noise and enhances spatial resolution, applying it to polarization imaging poses challenges due to the lack of tailored datasets and reliable ground truth noise statistics. To address these issues, we introduce PolarNS and PolarBurstSR, two innovative datasets developed specifically for polarization imaging. PolarNS provides characterization of polarization noise statistics, facilitating thorough analysis, while PolarBurstSR functions as a benchmark for burst super-resolution in polarization images. These datasets, collected under various real-world conditions, enable comprehensive evaluation. Additionally, we present a model for analyzing polarization noise to quantify noise propagation, tested on a large dataset captured in a darkroom environment. As part of our application, we compare the latest burst super-resolution models, highlighting the advantages of training tailored to polarization compared to RGB-based methods. This work establishes a benchmark for polarization burst super-resolution and offers critical insights into noise propagation, thereby enhancing polarization image reconstruction.
zh

[CV-46] Channel Consistency Prior and Self-Reconstruction Strategy Based Unsupervised Image Deraining CVPR2025

【速读】：该论文旨在解决基于配对数据集的深度去雨模型在实际应用中的两大挑战：真实配对数据集难以获取以及泛化性能较差的问题。为应对这些挑战，论文提出了一种新颖的无监督去雨框架——通道一致性先验与自重建策略驱动的去雨框架（CSUD）。该框架的关键在于引入了通道一致性损失（Channel Consistency Loss, CCLoss）和自重建（Self-Reconstruction, SR）策略。其中，通道一致性损失通过将雨条纹的通道一致性先验引入训练过程，确保生成的伪雨图与真实雨图高度相似，同时保留更多背景细节；而自重建策略则用于缓解生成器的冗余信息传递问题，进一步提升去雨性能和泛化能力。实验结果表明，CSUD在多个合成及真实场景数据集上的去雨效果超越了其他最先进的无监督方法，并展现出更优的泛化能力。

链接: https://arxiv.org/abs/2503.18703
作者: Guanglu Dong,Tianheng Zheng,Yuanzhouhan Cao,Linbo Qing,Chao Ren
机构: Sichuan University (四川大学, Chengdu, China); Beijing Jiaotong University (北京交通大学, Beijing, China)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2025

点击查看摘要

Abstract:Recently, deep image deraining models based on paired datasets have made a series of remarkable progress. However, they cannot be well applied in real-world applications due to the difficulty of obtaining real paired datasets and the poor generalization performance. In this paper, we propose a novel Channel Consistency Prior and Self-Reconstruction Strategy Based Unsupervised Image Deraining framework, CSUD, to tackle the aforementioned challenges. During training with unpaired data, CSUD is capable of generating high-quality pseudo clean and rainy image pairs which are used to enhance the performance of deraining network. Specifically, to preserve more image background details while transferring rain streaks from rainy images to the unpaired clean images, we propose a novel Channel Consistency Loss (CCLoss) by introducing the Channel Consistency Prior (CCP) of rain streaks into training process, thereby ensuring that the generated pseudo rainy images closely resemble the real ones. Furthermore, we propose a novel Self-Reconstruction (SR) strategy to alleviate the redundant information transfer problem of the generator, further improving the deraining performance and the generalization capability of our method. Extensive experiments on multiple synthetic and real-world datasets demonstrate that the deraining performance of CSUD surpasses other state-of-the-art unsupervised methods and CSUD exhibits superior generalization capability.
zh

[CV-47] OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad CVPR2025

【速读】：该论文试图解决基础模型（Foundation Models, FMs）在面对分布偏移、弱监督或恶意攻击等开放世界场景时泛化能力显著下降的问题。论文指出，现有大多数领域泛化或对抗微调方法通常与具体任务或模型相关，忽视了实际应用中的通用性以及基础模型之间的可迁移性。为此，本文提出了一种名为对象-概念-关系三元组（Object-Concept-Relation Triad, OCRT）的新框架，旨在使基础模型能够从原始视觉输入中提取稀疏且高层次的概念以及复杂的关联结构。

OCRT的关键在于通过无监督解耦和迭代优化将视觉场景中的对象与其一组以对象为中心的表示绑定，并将其投影到语义概念空间中进行解释，同时评估其重要性以过滤无关元素。进一步地，构建了一个具有灵活度的概念图，该图结合了概念及其对应的权重，从而实现了从信息丰富的概念中提取高阶因子并促进这些概念间的关联推理。大量实验表明，OCRT能够显著提升SAM和CLIP在多个下游任务中的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2503.18695
作者: Luyao Tang,Yuxuan Yuan,Chaoqi Chen,Zeyu Zhang,Yue Huang,Kun Zhang
机构: Key Laboratory of Multimedia Trusted Perception and Efficient Computing (多媒体可信感知与高效计算重点实验室), Ministry of Education of China (中华人民共和国教育部), Xiamen University (厦门大学); School of Informatics (信息学院), Xiamen University (厦门大学); Shenzhen University (深圳大学); The Australian National University (澳大利亚国立大学); Carnegie Mellon University (卡内基梅隆大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Although foundation models (FMs) claim to be powerful, their generalization ability significantly decreases when faced with distribution shifts, weak supervision, or malicious attacks in the open world. On the other hand, most domain generalization or adversarial fine-tuning methods are task-related or model-specific, ignoring the universality in practical applications and the transferability between FMs. This paper delves into the problem of generalizing FMs to the out-of-domain data. We propose a novel framework, the Object-Concept-Relation Triad (OCRT), that enables FMs to extract sparse, high-level concepts and intricate relational structures from raw visual inputs. The key idea is to bind objects in visual scenes and a set of object-centric representations through unsupervised decoupling and iterative refinement. To be specific, we project the object-centric representations onto a semantic concept space that the model can readily interpret and estimate their importance to filter out irrelevant elements. Then, a concept-based graph, which has a flexible degree, is constructed to incorporate the set of concepts and their corresponding importance, enabling the extraction of high-order factors from informative concepts and facilitating relational reasoning among these concepts. Extensive experiments demonstrate that OCRT can substantially boost the generalizability and robustness of SAM and CLIP across multiple downstream tasks.
zh

[CV-48] Hardware-Rasterized Ray-Based Gaussian Splatting

【速读】：本文提出了一种新颖的硬件光栅化渲染方法，用于基于光线的三维高斯点 splatting（RayGS），旨在实现快速且高质量的新视角合成。论文的核心问题是现有 RayGS 模型在虚拟现实（Virtual Reality, VR）和混合现实（Mixed Reality, MR）等对画质敏感的应用中，由于帧率不足而难以实用化。为了解决这一问题，论文的关键贡献在于提供了一个数学严谨且几何直观的推导，详细说明如何利用标准硬件光栅化着色器高效估算 RayGS 模型的所有相关量，从而大幅提升渲染速度至可接受的帧率水平。此外，论文还解决了 RayGS 渲染中与多级纹理（MIP）相关的混叠问题，确保在训练和测试过程中处理不同尺度时的无混叠高质量渲染。这些方案共同实现了显著的性能提升，同时保持了 RayGS 的一流外观质量。

链接: https://arxiv.org/abs/2503.18682
作者: Samuel Rota Bulò,Nemanja Bartolovic,Lorenzo Porzi,Peter Kontschieder
机构: Meta Reality Labs (Meta 实景实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We present a novel, hardware rasterized rendering approach for ray-based 3D Gaussian Splatting (RayGS), obtaining both fast and high-quality results for novel view synthesis. Our work contains a mathematically rigorous and geometrically intuitive derivation about how to efficiently estimate all relevant quantities for rendering RayGS models, structured with respect to standard hardware rasterization shaders. Our solution is the first enabling rendering RayGS models at sufficiently high frame rates to support quality-sensitive applications like Virtual and Mixed Reality. Our second contribution enables alias-free rendering for RayGS, by addressing MIP-related issues arising when rendering diverging scales during training and testing. We demonstrate significant performance gains, across different benchmark scenes, while retaining state-of-the-art appearance quality of RayGS.
zh

[CV-49] NullSwap: Proactive Identity Cloaking Against Deepfake Face Swapping

【速读】：该论文旨在解决被动检测高质量Deepfake图像所面临的性能瓶颈问题，特别是由生成模型进步引起的挑战。现有主动扰动方法在视觉退化、针对人脸交换操作的有效性不足以及对白盒或灰盒设置的依赖等方面存在局限性。为此，论文提出了一种名为NullSwap的新颖主动防御方法，其关键在于通过隐藏源图像的身份并使源身份特征无效化，从而在纯黑盒场景下禁用人脸交换操作。该方案的核心包括设计一个Identity Extraction模块提取源图像的面部身份特征，开发一个Perturbation Block生成相应的身份引导扰动，并利用Feature Block提取浅层图像特征与扰动融合进行图像重构。此外，为了提高算法在不同身份提取器上的适应性，还提出了Dynamic Loss Weighting来自适应平衡身份损失。实验表明，NullSwap在欺骗各种身份识别模型方面表现出色，显著优于现有的主动扰动方法，特别是在防止人脸交换模型生成具有正确源身份的图像方面。

链接: https://arxiv.org/abs/2503.18678
作者: Tianyi Wang,Harry Cheng,Xiao Zhang,Yinglong Wang
机构: Nanyang Technological University (南洋理工大学); Shandong University (山东大学); Qilu University of Technology (齐鲁工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Suffering from performance bottlenecks in passively detecting high-quality Deepfake images due to the advancement of generative models, proactive perturbations offer a promising approach to disabling Deepfake manipulations by inserting signals into benign images. However, existing proactive perturbation approaches remain unsatisfactory in several aspects: 1) visual degradation due to direct element-wise addition; 2) limited effectiveness against face swapping manipulation; 3) unavoidable reliance on white- and grey-box settings to involve generative models during training. In this study, we analyze the essence of Deepfake face swapping and argue the necessity of protecting source identities rather than target images, and we propose NullSwap, a novel proactive defense approach that cloaks source image identities and nullifies face swapping under a pure black-box scenario. We design an Identity Extraction module to obtain facial identity features from the source image, while a Perturbation Block is then devised to generate identity-guided perturbations accordingly. Meanwhile, a Feature Block extracts shallow-level image features, which are then fused with the perturbation in the Cloaking Block for image reconstruction. Furthermore, to ensure adaptability across different identity extractors in face swapping algorithms, we propose Dynamic Loss Weighting to adaptively balance identity losses. Experiments demonstrate the outstanding ability of our approach to fool various identity recognition models, outperforming state-of-the-art proactive perturbations in preventing face swapping models from generating images with correct source identities.
zh

[CV-50] Human Motion Unlearning

【速读】：本文旨在解决在生成式文本到人体运动模型（text-to-motion generative model）中，如何有效去除有害（toxic）运动序列的问题，同时保持模型的整体生成性能。这一问题的关键挑战在于有害运动可能由显式的文本提示直接生成，也可能从安全运动的隐式组合中间接生成（如“踢”可由“加载并摆动腿部”构成）。为应对这一挑战，论文提出的关键解决方案是引入一种名为Latent Code Replacement (LCR) 的无训练（training-free）方法，该方法专为最先进的离散潜空间文本到运动扩散模型设计，能够有效地识别并替换有害运动的潜在表示，从而实现对有害运动的精确去学习（motion unlearning），且在定性和定量评估中均优于现有基线方法。

链接: https://arxiv.org/abs/2503.18674
作者: Edoardo De Matteis,Matteo Migliarini,Alessio Sampieri,Indro Spinelli,Fabio Galasso
机构: Sapienza University of Rome (罗马大学); ItalAI (未知)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce the task of human motion unlearning to prevent the synthesis of toxic animations while preserving the general text-to-motion generative performance. Unlearning toxic motions is challenging as those can be generated from explicit text prompts and from implicit toxic combinations of safe motions (e.g., kicking" is loading and swinging a leg"). We propose the first motion unlearning benchmark by filtering toxic motions from the large and recent text-to-motion datasets of HumanML3D and Motion-X. We propose baselines, by adapting state-of-the-art image unlearning techniques to process spatio-temporal signals. Finally, we propose a novel motion unlearning model based on Latent Code Replacement, which we dub LCR. LCR is training-free and suitable to the discrete latent spaces of state-of-the-art text-to-motion diffusion models. LCR is simple and consistently outperforms baselines qualitatively and quantitatively. Project page: \hrefthis https URLthis https URL.
zh

[CV-51] Any6D: Model-free 6D Pose Estimation of Novel Objects CVPR2025

【速读】：该论文试图解决在未知场景中估计未知物体的6D位姿（位置与姿态）及其尺寸的问题。现有方法通常依赖于纹理丰富的3D模型或多个视角，而该论文提出的Any6D是一种无模型（model-free）框架，其关键在于通过联合物体对齐过程（joint object alignment process）提升2D-3D对齐精度及度量尺度估计，从而实现更高的位姿估计准确性。此外，Any6D结合渲染与比较策略（render-and-compare strategy），用于生成和优化位姿假设，确保在遮挡、非重叠视图、多变光照条件以及跨环境差异等复杂场景下的鲁棒性能。

链接: https://arxiv.org/abs/2503.18673
作者: Taeyeop Lee,Bowen Wen,Minjun Kang,Gyuree Kang,In So Kweon,Kuk-Jin Yoon
机构: KAIST; NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: CVPR 2025, Project Page: this https URL

点击查看摘要

Abstract:We introduce Any6D, a model-free framework for 6D object pose estimation that requires only a single RGB-D anchor image to estimate both the 6D pose and size of unknown objects in novel scenes. Unlike existing methods that rely on textured 3D models or multiple viewpoints, Any6D leverages a joint object alignment process to enhance 2D-3D alignment and metric scale estimation for improved pose accuracy. Our approach integrates a render-and-compare strategy to generate and refine pose hypotheses, enabling robust performance in scenarios with occlusions, non-overlapping views, diverse lighting conditions, and large cross-environment variations. We evaluate our method on five challenging datasets: REAL275, Toyota-Light, HO3D, YCBINEOAT, and LM-O, demonstrating its effectiveness in significantly outperforming state-of-the-art methods for novel object pose estimation. Project page: this https URL
zh

[CV-52] Feature Calibration enhanced Parameter Synthesis for CLIP-based Class-incremental Learning

【速读】：该论文旨在解决在持续学习新类别（Class-incremental Learning, CIL）过程中，传统方法因仅基于视觉特征而难以应对复杂场景的问题，同时克服使用视觉-语言模型（Vision-Language Models, VLMs）时面临的灾难性遗忘（catastrophic forgetting）与保持模型泛化能力之间的矛盾。论文的关键解决方案是提出了一种名为特征校准增强参数合成（Feature Calibration enhanced Parameter Synthesis, FCPS）的方法。FCPS通过引入特定的参数调整机制，迭代优化原始视觉特征在最终类别判断中的参与比例，从而确保模型的基础泛化能力；同时，通过跨任务的参数整合实现学习新类知识与保留旧知识之间的平衡。实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2503.18672
作者: Juncen Guo,Xiaoguang Zhu,Lianlong Sun,Liangyu Teng,Di Li,Yang Liu,Liang Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Class-incremental Learning (CIL) enables models to continuously learn new class knowledge while memorizing previous classes, facilitating their adaptation and evolution in dynamic environments. Traditional CIL methods are mainly based on visual features, which limits their ability to handle complex scenarios. In contrast, Vision-Language Models (VLMs) show promising potential to promote CIL by integrating pretrained knowledge with textual features. However, previous methods make it difficult to overcome catastrophic forgetting while preserving the generalization capabilities of VLMs. To tackle these challenges, we propose Feature Calibration enhanced Parameter Synthesis (FCPS) in this paper. Specifically, our FCPS employs a specific parameter adjustment mechanism to iteratively refine the proportion of original visual features participating in the final class determination, ensuring the model’s foundational generalization capabilities. Meanwhile, parameter integration across different tasks achieves a balance between learning new class knowledge and retaining old knowledge. Experimental results on popular benchmarks (e.g., CIFAR100 and ImageNet100) validate the superiority of the proposed method.
zh

[CV-53] Structure-Aware Correspondence Learning for Relative Pose Estimation CVPR2025

【速读】：该论文旨在解决相对位姿估计中因显式特征匹配依赖而导致的小重叠区域和不可靠的隐式区域特征估计问题。为克服这一挑战，论文提出了一种基于结构感知对应学习的相对位姿估计算法（Structure-Aware Correspondence Learning for Relative Pose Estimation）。其关键在于两个模块：首先，设计了一个结构感知关键点提取模块（structure-aware keypoint extraction module），通过关键点引导的图像重建损失函数定位能够代表不同形状和外观物体结构的关键点集合；其次，构建了一个结构感知对应估计模块（structure-aware correspondence estimation module），用于建模关键点之间的图像内和图像间关系，以提取结构感知特征进行对应估计。通过联合利用这两个模块，该方法能够在无需显式特征匹配的情况下自然实现未见物体的3D-3D对应关系估计，从而实现精确的相对位姿估计。实验结果表明，该方法在CO3D、Objaverse和LineMOD数据集上的性能显著优于现有方法。

链接: https://arxiv.org/abs/2503.18671
作者: Yihan Chen,Wenfei Yang,Huan Ren,Shifeng Zhang,Tianzhu Zhang,Feng Wu
机构: University of Science and Technology of China (中国科学技术大学); National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory (深空探测全国重点实验室, 深空探测实验室); Sangfor Technologies (深信服科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025

点击查看摘要

Abstract:Relative pose estimation provides a promising way for achieving object-agnostic pose estimation. Despite the success of existing 3D correspondence-based methods, the reliance on explicit feature matching suffers from small overlaps in visible regions and unreliable feature estimation for invisible regions. Inspired by humans’ ability to assemble two object parts that have small or no overlapping regions by considering object structure, we propose a novel Structure-Aware Correspondence Learning method for Relative Pose Estimation, which consists of two key modules. First, a structure-aware keypoint extraction module is designed to locate a set of kepoints that can represent the structure of objects with different shapes and appearance, under the guidance of a keypoint based image reconstruction loss. Second, a structure-aware correspondence estimation module is designed to model the intra-image and inter-image relationships between keypoints to extract structure-aware features for correspondence estimation. By jointly leveraging these two modules, the proposed method can naturally estimate 3D-3D correspondences for unseen objects without explicit feature matching for precise relative pose estimation. Experimental results on the CO3D, Objaverse and LineMOD datasets demonstrate that the proposed method significantly outperforms prior methods, i.e., with 5.7°reduction in mean angular error on the CO3D dataset.
zh

[CV-54] Boosting Virtual Agent Learning and Reasoning : A Step-wise Multi-dimensional and Generalist Reward Model with Benchmark

【速读】：本文旨在解决当前多模态大语言模型驱动的通用虚拟代理（GVAs）在训练过程中面临的两大关键挑战：对结果监督的依赖以及劳动密集型的人工标注需求。为应对这些挑战，论文提出了一种名为Similar的分步式多维通用奖励模型（Step-wise Multi-dimensional Generalist Reward Model）。该模型通过提供细粒度的信号来优化代理训练，并在推理阶段扩展时选择更优的动作。其核心在于定义了五个维度以评估代理行为，并基于此框架设计了一种MCTS-P算法，用于自动收集和标注分步式的五维代理执行数据。随后，利用Triple-M策略训练Similar模型。此外，论文还推出了首个针对分步式多维奖励模型训练与评估的虚拟代理领域基准SRM，包括SRMTrain（作为Similar的训练集）和SRMEval（用于评估奖励模型的手动选择测试集）。实验结果显示，Similar通过分步式多维评估及其协同增益，在训练和推理阶段扩展时为GVAs提供了有效的中间信号。

链接: https://arxiv.org/abs/2503.18665
作者: Bingchen Miao,Yang Wu,Minghe Gao,Qifan Yu,Wendong Bu,Wenqiao Zhang,Yunfei Li,Siliang Tang,Tat-Seng Chua,Juncheng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The development of Generalist Virtual Agents (GVAs) powered by Multimodal Large Language Models (MLLMs) has shown significant promise in autonomous task execution. However, current training paradigms face critical limitations, including reliance on outcome supervision and labor-intensive human annotations. To address these challenges, we propose Similar, a Step-wise Multi-dimensional Generalist Reward Model, which offers fine-grained signals for agent training and can choose better action for inference-time scaling. Specifically, we begin by systematically defining five dimensions for evaluating agent actions. Building on this framework, we design an MCTS-P algorithm to automatically collect and annotate step-wise, five-dimensional agent execution data. Using this data, we train Similar with the Triple-M strategy. Furthermore, we introduce the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation, named SRM. This benchmark consists of two components: SRMTrain, which serves as the training set for Similar, and SRMEval, a manually selected test set for evaluating the reward model. Experimental results demonstrate that Similar, through its step-wise, multi-dimensional assessment and synergistic gain, provides GVAs with effective intermediate signals during both training and inference-time scaling. The code is available at this https URL.
zh

[CV-55] Leverag ing Land Cover Priors for Isoprene Emission Super-Resolution

【速读】：该论文旨在解决卫星遥感数据在生物源挥发性有机化合物（BVOCs）排放空间分辨率上的限制问题，特别是针对异质景观中的高精度增强需求。论文的关键解决方案在于提出了一种基于深度学习的超分辨率（Super-Resolution, SR）框架，该框架通过整合土地覆盖信息作为排放驱动因子，有效捕捉了空间模式，优于传统方法。这一方案的核心创新点在于利用土地覆盖先验知识来提升排放数据的空间准确性，并通过跨气候条件的评估以及与关键环境因素（如作物用地和树木覆盖数据）的相关性分析验证了模型性能，同时评估了模型在未见气候区和地理区域中的泛化能力。实验结果表明，结合土地覆盖数据显著提高了排放超分辨率的准确性，从而为大气化学和气候建模提供了一种经济高效且数据驱动的方法。

链接: https://arxiv.org/abs/2503.18658
作者: Christopher Ummerle,Antonio Giganti,Sara Mandelli,Paolo Bestagini,Stefano Tubaro
机构: Department of Electronics, Information and Bioengineering - Politecnico di Milano - Milan, Italy (电子、信息和生物工程系 - 米兰理工大学 - 意大利米兰); Image and Sound Processing Lab (ISPL) - Politecnico di Milano - Milan, Italy (图像与声音处理实验室 (ISPL) - 米兰理工大学 - 意大利米兰)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 16 figures, 4 tables

点击查看摘要

Abstract:Remote sensing plays a crucial role in monitoring Earth’s ecosystems, yet satellite-derived data often suffer from limited spatial resolution, restricting their applicability in atmospheric modeling and climate research. In this work, we propose a deep learning-based Super-Resolution (SR) framework that leverages land cover information to enhance the spatial accuracy of Biogenic Volatile Organic Compounds (BVOCs) emissions, with a particular focus on isoprene. Our approach integrates land cover priors as emission drivers, capturing spatial patterns more effectively than traditional methods. We evaluate the model’s performance across various climate conditions and analyze statistical correlations between isoprene emissions and key environmental information such as cropland and tree cover data. Additionally, we assess the generalization capabilities of our SR model by applying it to unseen climate zones and geographical regions. Experimental results demonstrate that incorporating land cover data significantly improves emission SR accuracy, particularly in heterogeneous landscapes. This study contributes to atmospheric chemistry and climate modeling by providing a cost-effective, data-driven approach to refining BVOC emission maps. The proposed method enhances the usability of satellite-based emissions data, supporting applications in air quality forecasting, climate impact assessments, and environmental studies.
zh

[CV-56] Robust face recognition based on the wing loss and the ell_1 regularization

【速读】：该论文旨在解决复杂场景下人脸图像因严重遮挡或损坏而导致的人脸识别率显著下降的问题。为应对这一挑战，论文提出了新的翼约束稀疏编码模型（WCSC）及其加权版本（WWCSC），并通过交替方向乘子法（ADMM）算法解决相应的最小化问题。解决方案的关键在于通过引入翼约束机制和加权策略，提升模型在高度遮挡或损坏人脸图像条件下的鲁棒性与识别性能。

链接: https://arxiv.org/abs/2503.18652
作者: Yaoyao Yun,Jianwen Xu
机构: College of Mathematics and Statistics (数学与统计学院), Chongqing University (重庆大学); National Elite Institute of Engineering (国家工科基础课程教学基地), Chongqing University (重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:In recent years, sparse sampling techniques based on regression analysis have witnessed extensive applications in face recognition research. Presently, numerous sparse sampling models based on regression analysis have been explored by various researchers. Nevertheless, the recognition rates of the majority of these models would be significantly decreased when confronted with highly occluded and highly damaged face images. In this paper, a new wing-constrained sparse coding model(WCSC) and its weighted version(WWCSC) are introduced, so as to deal with the face recognition problem in complex circumstances, where the alternating direction method of multipliers (ADMM) algorithm is employed to solve the corresponding minimization problems. In addition, performances of the proposed method are examined based on the four well-known facial databases, namely the ORL facial database, the Yale facial database, the AR facial database and the FERET facial database. Also, compared to the other methods in the literatures, the WWCSC has a very high recognition rate even in complex situations where face images have high occlusion or high damage, which illustrates the robustness of the WWCSC method in facial recognition.
zh

[CV-57] LLGS: Unsupervised Gaussian Splatting for Image Enhancement and Reconstruction in Pure Dark Environment

【速读】：该论文旨在解决低光照环境下3D Gaussian Splatting在机器人领域的应用限制，特别是高保真建模和特征匹配中的色彩表示不足及多视角一致性问题。现有方法要么依赖增强图像导致多视角一致性问题，要么通过预训练数据缺乏场景泛化能力。为应对这些挑战，论文提出了一种基于Gaussian Splatting的无监督立体系统——Low-Light Gaussian Splatting (LLGS)，其关键在于引入一种可分解的高斯表示M-Color以分离颜色信息进行针对性增强，并设计了一种基于方向的无先验知识优化方法以确保多视角一致性。实验表明，该系统在低光照增强和3D Gaussian Splatting任务中均优于现有技术。

链接: https://arxiv.org/abs/2503.18640
作者: Haoran Wang,Jingwei Huang,Lu Yang,Tianchen Deng,Gaojing Zhang,Mingrui Li
机构: School of Engineering and Informatics, University of Sussex (工程与信息学院，苏塞克斯大学); Department of Automation Engineering, University of Electronic Science and Technology of China (自动化工程学院，电子科技大学); Department of Automation, Shanghai Jiao Tong University (自动化学院，上海交通大学); Department of Computer Science, Dalian University of Technology (计算机科学学院，大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting has shown remarkable capabilities in novel view rendering tasks and exhibits significant potential for multi-view this http URL, the original 3D Gaussian Splatting lacks color representation for inputs in low-light environments. Simply using enhanced images as inputs would lead to issues with multi-view consistency, and current single-view enhancement systems rely on pre-trained data, lacking scene generalization. These problems limit the application of 3D Gaussian Splatting in low-light conditions in the field of robotics, including high-fidelity modeling and feature matching. To address these challenges, we propose an unsupervised multi-view stereoscopic system based on Gaussian Splatting, called Low-Light Gaussian Splatting (LLGS). This system aims to enhance images in low-light environments while reconstructing the scene. Our method introduces a decomposable Gaussian representation called M-Color, which separately characterizes color information for targeted enhancement. Furthermore, we propose an unsupervised optimization method with zero-knowledge priors, using direction-based enhancement to ensure multi-view consistency. Experiments conducted on real-world datasets demonstrate that our system outperforms state-of-the-art methods in both low-light enhancement and 3D Gaussian Splatting.
zh

[CV-58] Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks CVPR2025

【速读】：该论文旨在解决现有视频基准测试中存在的表示偏差问题，例如对象偏差或单帧偏差，这些问题可能导致仅通过识别物体或利用单一帧即可进行正确预测，从而无法全面评估视频理解能力。为了解决这一问题，论文的关键在于提出了一种新的“通过文本描述实现无偏见（UTD）”视频基准测试方法。具体而言，研究者利用视觉语言模型（VLMs）和大型语言模型（LLMs）来分析并去偏现有的视频分类和检索数据集。他们生成了视频的逐帧文本描述，并筛选出特定信息以检测三个维度上的表示偏差：概念偏差、时间偏差以及常识与数据集偏差。此外，论文对12个流行的视频分类和检索数据集进行了系统性分析，并创建了这些数据集的新对象去偏测试分割。最后，论文在原始和去偏后的分割上评估了30种最先进的视频模型，并分析了模型中的偏差。为了促进更鲁棒的视频理解基准测试和模型的发展，研究团队发布了“UTD-描述”数据集（包含每个数据集的丰富结构化描述）和“UTD-分割”数据集（包含对象去偏测试分割）。

链接: https://arxiv.org/abs/2503.18637
作者: Nina Shvetsova,Arsha Nagrani,Bernt Schiele,Hilde Kuehne,Christian Rupprecht
机构: Goethe University Frankfurt(法兰克福大学); Tuebingen AI Center/University of Tuebingen(图宾根人工智能中心/图宾根大学); MPI for Informatics, Saarland Informatics Campus(马克斯·普朗克计算机科学研究所, 萨尔州计算机科学校园); University of Oxford(牛津大学); MIT-IBM Watson AI Lab(麻省理工学院-IBM 沃森人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be published at CVPR 2025, project webpage this https URL

点击查看摘要

Abstract:We propose a new “Unbiased through Textual Description (UTD)” video benchmark based on unbiased subsets of existing video classification and retrieval datasets to enable a more robust assessment of video understanding capabilities. Namely, we tackle the problem that current video benchmarks may suffer from different representation biases, e.g., object bias or single-frame bias, where mere recognition of objects or utilization of only a single frame is sufficient for correct prediction. We leverage VLMs and LLMs to analyze and debias benchmarks from such representation biases. Specifically, we generate frame-wise textual descriptions of videos, filter them for specific information (e.g. only objects) and leverage them to examine representation biases across three dimensions: 1) concept bias - determining if a specific concept (e.g., objects) alone suffice for prediction; 2) temporal bias - assessing if temporal information contributes to prediction; and 3) common sense vs. dataset bias - evaluating whether zero-shot reasoning or dataset correlations contribute to prediction. We conduct a systematic analysis of 12 popular video classification and retrieval datasets and create new object-debiased test splits for these datasets. Moreover, we benchmark 30 state-of-the-art video models on original and debiased splits and analyze biases in the models. To facilitate the future development of more robust video understanding benchmarks and models, we release: “UTD-descriptions”, a dataset with our rich structured descriptions for each dataset, and “UTD-splits”, a dataset of object-debiased test splits.
zh

[CV-59] OCCO: LVM-guided Infrared and Visible Image Fusion Framework based on Object-aware and Contextual COntrastive Learning

【速读】：该论文旨在解决现有图像融合方法难以同时实现高质量融合图像与提升下游视觉任务性能之间的平衡问题。解决方案的关键在于提出了一种新颖的LVM（大视觉模型）引导的融合框架，名为OCCO。该框架利用预训练的大视觉模型提供语义指导，并通过对象感知和上下文对比学习强调显著语义特征的学习。此外，设计了一种新的特征交互融合网络，以解决因模态差异导致的融合图像中的信息冲突。通过在潜在特征空间（上下文化空间）中学习正负样本的区别，提升了融合图像中目标信息的完整性，从而有益于下游任务性能的提升。

链接: https://arxiv.org/abs/2503.18635
作者: Hui Li,Congcong Bian,Zeyang Zhang,Xiaoning Song,Xi Li,Xiao-Jun Wu
机构: International Joint Laboratory on Artificial Intelligence of Jiangsu Province, School of Artificial Intelligence and Computer Science, Jiangnan University (江苏省人工智能联合实验室，江南大学人工智能与计算机科学学院); College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image fusion is a crucial technique in the field of computer vision, and its goal is to generate high-quality fused images and improve the performance of downstream tasks. However, existing fusion methods struggle to balance these two factors. Achieving high quality in fused images may result in lower performance in downstream visual tasks, and vice versa. To address this drawback, a novel LVM (large vision model)-guided fusion framework with Object-aware and Contextual COntrastive learning is proposed, termed as OCCO. The pre-trained LVM is utilized to provide semantic guidance, allowing the network to focus solely on fusion tasks while emphasizing learning salient semantic features in form of contrastive learning. Additionally, a novel feature interaction fusion network is also designed to resolve information conflicts in fusion images caused by modality differences. By learning the distinction between positive samples and negative samples in the latent feature space (contextual space), the integrity of target information in fused image is improved, thereby benefiting downstream performance. Finally, compared with eight state-of-the-art methods on four datasets, the effectiveness of the proposed method is validated, and exceptional performance is also demonstrated on downstream visual task.
zh

[CV-60] Robust Lane Detection with Wavelet-Enhanced Context Modeling and Adaptive Sampling

【速读】：该论文致力于解决车道检测在恶劣条件（如极端天气、光照变化、遮挡及复杂曲线）下的性能下降问题。论文的关键在于提出了一种基于小波增强特征金字塔网络（Wavelet-Enhanced Feature Pyramid Network, WE-FPN），通过在特征金字塔前引入基于小波的非局部块来提升全局上下文建模能力，特别是在处理被遮挡和弯曲车道时表现优异。此外，设计了一个自适应预处理模块以增强低光照条件下的车道可见性，并采用注意力引导采样策略进一步优化空间特征，从而提高远距离和弯曲车道的检测精度。实验结果表明，该方法在CULane和TuSimple数据集上的挑战性场景中显著优于基线方法，实现了更高的鲁棒性和准确性。

链接: https://arxiv.org/abs/2503.18631
作者: Kunyang Li,Ming Hou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lane detection is critical for autonomous driving and ad-vanced driver assistance systems (ADAS). While recent methods like CLRNet achieve strong performance, they struggle under adverse con-ditions such as extreme weather, illumination changes, occlusions, and complex curves. We propose a Wavelet-Enhanced Feature Pyramid Net-work (WE-FPN) to address these challenges. A wavelet-based non-local block is integrated before the feature pyramid to improve global context modeling, especially for occluded and curved lanes. Additionally, we de-sign an adaptive preprocessing module to enhance lane visibility under poor lighting. An attention-guided sampling strategy further reffnes spa-tial features, boosting accuracy on distant and curved lanes. Experiments on CULane and TuSimple demonstrate that our approach signiffcantly outperforms baselines in challenging scenarios, achieving better robust-ness and accuracy in real-world driving conditions.
zh

[CV-61] owards Human-Understandable Multi-Dimensional Concept Discovery

【速读】：该论文旨在解决传统可解释性方法（如显著图）难以将模型决策映射到人类可理解概念的问题，并提出一种能够提升概念完整性和可理解性的新方法。论文的核心问题是提高基于概念的可解释人工智能（Concept-based eXplainable AI, C-XAI）中概念的完整性（completeness），即衡量一组概念在多大程度上能够解释模型的决策，同时确保这些概念对人类而言易于理解且实用。现有方法如多维概念发现（Multi-Dimensional Concept Discovery, MCD）虽提升了完整性，但其解释结果对人类不够直观。

解决方案的关键在于提出Human-Understandable Multi-dimensional Concept Discovery (HU-MCD)，通过引入Segment Anything Model进行概念识别，并采用CNN特定的输入掩码技术减少传统掩码方法带来的噪声。这些改进结合完整性关系，使HU-MCD能够在保持解释忠实性的同时增强概念的可理解性，从而提供更精确和可靠的解释。

链接: https://arxiv.org/abs/2503.18629
作者: Arne Grobrügge,Niklas Kühl,Gerhard Satzger,Philipp Spitzer
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院), Germany; University of Bayreuth (拜罗伊特大学), Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Concept-based eXplainable AI (C-XAI) aims to overcome the limitations of traditional saliency maps by converting pixels into human-understandable concepts that are consistent across an entire dataset. A crucial aspect of C-XAI is completeness, which measures how well a set of concepts explains a model’s decisions. Among C-XAI methods, Multi-Dimensional Concept Discovery (MCD) effectively improves completeness by breaking down the CNN latent space into distinct and interpretable concept subspaces. However, MCD’s explanations can be difficult for humans to understand, raising concerns about their practical utility. To address this, we propose Human-Understandable Multi-dimensional Concept Discovery (HU-MCD). HU-MCD uses the Segment Anything Model for concept identification and implements a CNN-specific input masking technique to reduce noise introduced by traditional masking methods. These changes to MCD, paired with the completeness relation, enable HU-MCD to enhance concept understandability while maintaining explanation faithfulness. Our experiments, including human subject studies, show that HU-MCD provides more precise and reliable explanations than existing C-XAI methods. The code is available at this https URL.
zh

[CV-62] Dig2DIG: Dig into Diffusion Information Gains for Image Fusion

【速读】：该论文试图解决现有基于扩散模型的图像融合方法中存在的两个主要问题：1) 预定义多模态引导未能捕捉每种模态动态变化的重要性；2) 缺乏理论保证。为了解决这些问题，论文揭示了图像去噪过程中的时空不平衡现象，并发现扩散模型在不同去噪步长下会在图像的不同区域产生动态的信息增益。基于这一观察，论文提出了挖掘扩散信息增益（Dig2DIG）的方法，并从理论上推导出一个基于扩散的动态图像融合框架，该框架能够证明性地减小泛化误差的上界。关键解决方案在于引入扩散信息增益（DIG），用于量化不同去噪步长下每种模态的信息贡献，从而在融合过程中提供动态指导。

链接: https://arxiv.org/abs/2503.18627
作者: Bing Cao,Baoshuo Cai,Changqing Zhang,Qinghua Hu
机构: College of Intelligence and Computing, Tianjin University, Tianjin, China (智能与计算学院, 天津大学, 天津, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Image fusion integrates complementary information from multi-source images to generate more informative results. Recently, the diffusion model, which demonstrates unprecedented generative potential, has been explored in image fusion. However, these approaches typically incorporate predefined multimodal guidance into diffusion, failing to capture the dynamically changing significance of each modality, while lacking theoretical guarantees. To address this issue, we reveal a significant spatio-temporal imbalance in image denoising; specifically, the diffusion model produces dynamic information gains in different image regions with denoising steps. Based on this observation, we Dig into the Diffusion Information Gains (Dig2DIG) and theoretically derive a diffusion-based dynamic image fusion framework that provably reduces the upper bound of the generalization error. Accordingly, we introduce diffusion information gains (DIG) to quantify the information contribution of each modality at different denoising steps, thereby providing dynamic guidance during the fusion process. Extensive experiments on multiple fusion scenarios confirm that our method outperforms existing diffusion-based approaches in terms of both fusion quality and inference efficiency.
zh

[CV-63] Generative Dataset Distillation using Min-Max Diffusion Model ECCV2024 WWW

【速读】：本文旨在解决生成式数据蒸馏问题，利用生成模型合成图像。其核心挑战在于，在保持评估时间不变的情况下，如何通过生成模型产生尽可能多的图像样本，并确保数据集的多样性和代表性。为解决此问题，作者采用流行的扩散模型（Diffusion Model）作为生成器，并引入最小-最大损失函数（min-max loss）来控制数据集的质量。然而，扩散模型在生成图像时耗时较长，因为其需要迭代式的生成过程。关键创新点在于提出了一种扩散步数减少方法（Diffusion Step Reduction），以在图像数量与图像质量之间找到最优平衡。实验结果表明，该方法在 ECCV2024 第一届数据蒸馏挑战赛的生成赛道中获得了第二名的成绩，验证了其有效性。

链接: https://arxiv.org/abs/2503.18626
作者: Junqiao Fan,Yunjiao Zhou,Min Chang Jordan Ren,Jianfei Yang
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper is accepted as the ECCV2024 workshop paper and achieved second place in the generative track of The First Dataset Distillation Challenge of ECCV2024, this https URL

点击查看摘要

Abstract:In this paper, we address the problem of generative dataset distillation that utilizes generative models to synthesize images. The generator may produce any number of images under a preserved evaluation time. In this work, we leverage the popular diffusion model as the generator to compute a surrogate dataset, boosted by a min-max loss to control the dataset’s diversity and representativeness during training. However, the diffusion model is time-consuming when generating images, as it requires an iterative generation process. We observe a critical trade-off between the number of image samples and the image quality controlled by the diffusion steps and propose Diffusion Step Reduction to achieve optimal performance. This paper details our comprehensive method and its performance. Our model achieved 2^nd place in the generative track of \hrefthis https URLThe First Dataset Distillation Challenge of ECCV2024, demonstrating its superior performance.
zh

[CV-64] raining-Free Personalization via Retrieval and Reasoning on Fingerprints

【速读】：该论文旨在解决现有视觉语言模型（Vision Language Models, VLMs）在理解用户特定概念时的局限性，尤其是当前个性化方法依赖于昂贵或不愉快的训练过程的问题。论文提出了一种无需训练的个性化方法——检索与推理个性化（Retrieval and Reasoning for Personalization, R2P），其关键是利用VLMs内部的知识来提取概念指纹（concept fingerprint），即唯一定义某一语义类中概念的关键属性。当接收到查询时，通过链式思维推理（chain-of-thought-reasoning）检索并评分最相似的概念指纹，并通过跨模态验证（cross-modal verification）减少幻觉风险。若评分存在分歧，则通过成对多模态匹配（pairwise multimodal matching）直接比较检索到的指纹及其图像与查询，从而优化概念关联。实验验证表明，R2P在多个基准数据集上的下游任务中始终优于现有最先进的方法。

链接: https://arxiv.org/abs/2503.18623
作者: Deepayan Das,Davide Talon,Yiming Wang,Massimiliano Mancini,Elisa Ricci
机构: University of Trento (University of Trento); Fondazione Bruno Kessler (Fondazione Bruno Kessler)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) have lead to major improvements in multimodal reasoning, yet they still struggle to understand user-specific concepts. Existing personalization methods address this limitation but heavily rely on training procedures, that can be either costly or unpleasant to individual users. We depart from existing work, and for the first time explore the training-free setting in the context of personalization. We propose a novel method, Retrieval and Reasoning for Personalization (R2P), leveraging internal knowledge of VLMs. First, we leverage VLMs to extract the concept fingerprint, i.e., key attributes uniquely defining the concept within its semantic class. When a query arrives, the most similar fingerprints are retrieved and scored via chain-of-thought-reasoning. To reduce the risk of hallucinations, the scores are validated through cross-modal verification at the attribute level: in case of a discrepancy between the scores, R2P refines the concept association via pairwise multimodal matching, where the retrieved fingerprints and their images are directly compared with the query. We validate R2P on two publicly available benchmarks and a newly introduced dataset, Personal Concepts with Visual Ambiguity (PerVA), for concept identification highlighting challenges in visual ambiguity. R2P consistently outperforms state-of-the-art approaches on various downstream tasks across all benchmarks. Code will be available upon acceptance.
zh

[CV-65] Unified Uncertainty-Aware Diffusion for Multi-Agent Trajectory Modeling CVPR2025

【速读】：该论文旨在解决多智能体轨迹建模中普遍存在的两个主要问题：一是现有方法主要关注未来状态预测，而忽视了轨迹补全等更广泛的任务，这对实际应用（如跟踪数据校正）至关重要；二是缺乏对每个状态的不确定性度量以及在相同先验观测下对生成场景的误差概率估计。论文的关键创新在于提出了一种统一扩散模型U2Diff，它不仅能够处理轨迹补全任务，还联合提供了状态级别的不确定性估计。这种不确定性估计通过在去噪损失中加入预测噪声的负对数似然，并将潜在空间中的不确定性传播到真实状态空间来实现。此外，论文引入了一个后处理中的排名神经网络，用于估算每种生成模式的误差概率，这与相对于真实值的误差表现出强相关性。这些特性使得U2Diff在四个具有挑战性的体育数据集上的轨迹补全和预测性能超越了当前最先进的方法。

链接: https://arxiv.org/abs/2503.18589
作者: Guillem Capellera,Antonio Rubio,Luis Ferraz,Antonio Agudo
机构: Institut de Robòtica i Informàtica Industrial, CSIC-UPC (机器人与工业信息研究所, CSIC-UPC); Kognia Sports Intelligence (Kognia体育智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025 conference

点击查看摘要

Abstract:Multi-agent trajectory modeling has primarily focused on forecasting future states, often overlooking broader tasks like trajectory completion, which are crucial for real-world applications such as correcting tracking data. Existing methods also generally predict agents’ states without offering any state-wise measure of uncertainty. Moreover, popular multi-modal sampling methods lack any error probability estimates for each generated scene under the same prior observations, making it difficult to rank the predictions during inference time. We introduce U2Diff, a \textbfunified diffusion model designed to handle trajectory completion while providing state-wise \textbfuncertainty estimates jointly. This uncertainty estimation is achieved by augmenting the simple denoising loss with the negative log-likelihood of the predicted noise and propagating latent space uncertainty to the real state space. Additionally, we incorporate a Rank Neural Network in post-processing to enable \textbferror probability estimation for each generated mode, demonstrating a strong correlation with the error relative to ground truth. Our method outperforms the state-of-the-art solutions in trajectory completion and forecasting across four challenging sports datasets (NBA, Basketball-U, Football-U, Soccer-U), highlighting the effectiveness of uncertainty and error probability estimation. Video at this https URL
zh

[CV-66] Adapting Video Diffusion Models for Time-Lapse Microscopy

【速读】：该论文旨在解决现有最先进的生成式视频模型（Generative Video Models）在显微镜领域应用不足的问题。尽管这些模型在自然视频生成方面取得了显著进展，但在显微镜视频生成任务中仍缺乏探索。论文的关键解决方案是针对显微镜特定序列对预训练的视频扩散模型进行微调，并探索了三种条件策略：(1) 来自数值表型测量（如增殖率、迁移速度、细胞死亡频率）的文本提示；(2) 表型评分的直接数值嵌入；(3) 图像条件生成，即将初始显微镜帧扩展为完整的视频序列。通过生物意义上重要的形态学、增殖和迁移指标评估显示，微调显著提高了生成视频的真实感，并准确捕捉了有丝分裂和迁移等关键细胞行为。此外，微调后的模型在超出训练范围的情况下仍能生成连贯的细胞动力学。然而，精确控制特定表型特征仍然具有挑战性，这为进一步改进条件方法提供了方向。研究结果表明，领域特定的微调可以为生成式视频模型提供生物上可信的合成显微镜数据，支持虚拟假设测试和数据增强等应用。

链接: https://arxiv.org/abs/2503.18583
作者: Alexander Holmberg,Nils Mechtel,Wei Ouyang
机构: Department of Applied Physics, Science for Life Laboratory, KTH Royal Institute of Technology (KTH皇家理工学院), Stockholm, Sweden
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a domain adaptation of video diffusion models to generate highly realistic time-lapse microscopy videos of cell division in HeLa cells. Although state-of-the-art generative video models have advanced significantly for natural videos, they remain underexplored in microscopy domains. To address this gap, we fine-tune a pretrained video diffusion model on microscopy-specific sequences, exploring three conditioning strategies: (1) text prompts derived from numeric phenotypic measurements (e.g., proliferation rates, migration speeds, cell-death frequencies), (2) direct numeric embeddings of phenotype scores, and (3) image-conditioned generation, where an initial microscopy frame is extended into a complete video sequence. Evaluation using biologically meaningful morphological, proliferation, and migration metrics demonstrates that fine-tuning substantially improves realism and accurately captures critical cellular behaviors such as mitosis and migration. Notably, the fine-tuned model also generalizes beyond the training horizon, generating coherent cell dynamics even in extended sequences. However, precisely controlling specific phenotypic characteristics remains challenging, highlighting opportunities for future work to enhance conditioning methods. Our results demonstrate the potential for domain-specific fine-tuning of generative video models to produce biologically plausible synthetic microscopy data, supporting applications such as in-silico hypothesis testing and data augmentation.
zh

[CV-67] Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding

【速读】：该论文旨在解决现代视觉-语言模型（Vision-Language Models, VLMs）在扩展到天文学领域时面临的两大挑战：(1) 当前预训练模型局限于欧几里得空间，缺乏全面的几何嵌入能力；(2) 主流架构难以适应各向异性物理几何。为应对这些挑战，论文提出了一种名为Galaxy-Walker的几何感知VLM，其核心解决方案包括几何提示（geometry prompt）和几何适配器（geometry adapter）。几何提示通过在多尺度物理图上的不同空间随机游走生成几何标记，而几何适配器以混合专家（mixture-of-experts）的方式压缩和重塑空间各向异性。实验结果表明，Galaxy-Walker在星系属性估计（(R^2)评分高达0.91）和形态分类任务中均达到最先进的性能，显著优于领域特定模型和通用VLM。

链接: https://arxiv.org/abs/2503.18578
作者: Tianyu Chen,Xingcheng Fu,Yisen Gao,Haodong Qian,Yuecen Wei,Kun Yan,Haoyi Zhou,Jianxin Li
机构: SKLCCSE, School of Computer Science and Engineering, Beihang University (北航), China; School of Software, Beihang University (北航), China; Key Lab of Education Blockchain and Intelligent Technology, Guangxi Normal University (广西师范大学), China; Institute of Artificial Intelligence, Beihang University (北航), Beijing, China
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern vision-language models (VLMs) develop patch embedding and convolution backbone within vector space, especially Euclidean ones, at the very founding. When expanding VLMs to a galaxy scale for understanding astronomical phenomena, the integration of spherical space for planetary orbits and hyperbolic spaces for black holes raises two formidable challenges. a) The current pre-training model is confined to Euclidean space rather than a comprehensive geometric embedding. b) The predominant architecture lacks suitable backbones for anisotropic physical geometries. In this paper, we introduced Galaxy-Walker, a geometry-aware VLM, for the universe-level vision understanding tasks. We proposed the geometry prompt that generates geometry tokens by random walks across diverse spaces on a multi-scale physical graph, along with a geometry adapter that compresses and reshapes the space anisotropy in a mixture-of-experts manner. Extensive experiments demonstrate the effectiveness of our approach, with Galaxy-Walker achieving state-of-the-art performance in both galaxy property estimation ( R^2 scores up to 0.91 ) and morphology classification tasks (up to +0.17 F1 improvement in challenging features), significantly outperforming both domain-specific models and general-purpose VLMs.
zh

[CV-68] Advancing Cross-Organ Domain Generalization with Test-Time Style Transfer and Diversity Enhancement

【速读】：该论文致力于解决深度学习模型在处理领域迁移（domain shift）问题时性能下降的挑战，特别是在多域或多领域交叉任务中的表现退化问题。为应对这一挑战，论文提出了一种名为Test-time style transfer (T3s) 的方法，其关键是通过双向映射机制将源域和目标域的特征投影到统一的特征空间，从而提升模型的泛化能力。此外，论文引入了Cross-domain style diversification module (CSDM) 来确保风格基底之间的正交性，进一步扩展风格表达空间。结合数据增强和低秩适应技术以优化特征对齐与敏感性，使模型能够有效适应多域输入。实验结果表明，该方法在三个未见数据集上表现出有效性。

链接: https://arxiv.org/abs/2503.18567
作者: Biwen Meng,Xi Long,Wanrong Yang,Ruochen Liu,Yi Tian,Yalin Zheng,Jingxin Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2025 IEEE International Symposium on Biomedical Imaging (ISBI)

点击查看摘要

Abstract:Deep learning has made significant progress in addressing challenges in various fields including computational pathology (CPath). However, due to the complexity of the domain shift problem, the performance of existing models will degrade, especially when it comes to multi-domain or cross-domain tasks. In this paper, we propose a Test-time style transfer (T3s) that uses a bidirectional mapping mechanism to project the features of the source and target domains into a unified feature space, enhancing the generalization ability of the model. To further increase the style expression space, we introduce a Cross-domain style diversification module (CSDM) to ensure the orthogonality between style bases. In addition, data augmentation and low-rank adaptation techniques are used to improve feature alignment and sensitivity, enabling the model to adapt to multi-domain inputs effectively. Our method has demonstrated effectiveness on three unseen datasets.
zh

[CV-69] AMD-Hummingbird: Towards an Efficient Text-to-Video Model WWW

【速读】：该论文旨在解决现有 Text-to-Video (T2V) 模型在资源受限设备上难以平衡计算效率与高视觉质量的问题。大多数现有方法侧重于视觉保真度，而忽视了开发更小、更高效的模型以适应实际部署的需求。为了解决这一挑战，论文提出了一种轻量级的 T2V 框架 Hummingbird，其关键在于通过模型剪枝减少 U-Net 参数规模（从 1.4 亿降至 0.7 亿），同时利用视觉反馈学习提升视觉质量。此外，论文引入一种新颖的数据处理管道，结合大型语言模型 (Large Language Models, LLMs) 和视频质量评估 (Video Quality Assessment, VQA) 模型，进一步优化文本提示和视频数据的质量。该方法不仅实现了 31 倍的速度提升，还支持高达 26 帧的长视频生成，并通过公开完整的训练代码促进用户定制化训练，从而提供了一个高效且实用的 T2V 解决方案。

链接: https://arxiv.org/abs/2503.18559
作者: Takashi Isobe,He Cui,Dong Zhou,Mengmeng Ge,Dong Li,Emad Barsoum
机构: Advanced Micro Devices, Inc. (AMD)(超威半导体); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Homepage: this https URL | GitHub: this https URL

点击查看摘要

Abstract:Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions. However, existing models struggle to balance computational efficiency and high visual quality, particularly on resource-limited devices, e.g.,iGPUs and mobile phones. Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment. To address this challenge, we propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning. Our approach reduces the size of the U-Net from 1.4 billion to 0.7 billion parameters, significantly improving efficiency while preserving high-quality video generation. Additionally, we introduce a novel data processing pipeline that leverages Large Language Models (LLMs) and Video Quality Assessment (VQA) models to enhance the quality of both text prompts and video data. To support user-driven training and style customization, we publicly release the full training code, including data processing and model training. Extensive experiments show that our method achieves a 31X speedup compared to state-of-the-art models such as VideoCrafter2, while also attaining the highest overall score on VBench. Moreover, our method supports the generation of videos with up to 26 frames, addressing the limitations of existing U-Net-based methods in long video generation. Notably, the entire training process requires only four GPUs, yet delivers performance competitive with existing leading methods. Hummingbird presents a practical and efficient solution for T2V generation, combining high performance, scalability, and flexibility for real-world applications.
zh

[CV-70] LeanStereo: A Leaner Backbone based Stereo Network

【速读】：该论文旨在解决现有基于端到端深度网络的立体匹配方法在性能提升的同时导致计算量和内存带宽需求增加的问题，这限制了其在实际应用中的适用性。尽管这些方法能够提供高精度的立体匹配，但其较长的推理时间使其难以满足实时应用场景的需求。为了解决这一问题，论文提出了一种快速的端到端立体匹配方法，关键在于引入了一个更轻量化的主干网络以显著提高推理速度。为了弥补因采用轻量化主干网络而导致的性能下降，论文进一步提出了结合学习到的关注权重的成本体以及LogL1损失函数的方法，用于立体匹配任务。这种方法不仅提升了整体网络的性能，还加速了模型的收敛过程。通过详细的实证评估表明，所提出的方法在操作次数减少4倍的同时，推理速度提高了9到14倍，且与最先进的方法如ACVNet、LEAStereo和CFNet相比，在保持相近性能的前提下实现了显著的速度提升。

链接: https://arxiv.org/abs/2503.18557
作者: Rafia Rahim,Samuel Woerz,Andreas Zell
机构: University of Tuebingen (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Recently, end-to-end deep networks based stereo matching methods, mainly because of their performance, have gained popularity. However, this improvement in performance comes at the cost of increased computational and memory bandwidth requirements, thus necessitating specialized hardware (GPUs); even then, these methods have large inference times compared to classical methods. This limits their applicability in real-world applications. Although we desire high accuracy stereo methods albeit with reasonable inference time. To this end, we propose a fast end-to-end stereo matching method. Majority of this speedup comes from integrating a leaner backbone. To recover the performance lost because of a leaner backbone, we propose to use learned attention weights based cost volume combined with LogL1 loss for stereo matching. Using LogL1 loss not only improves the overall performance of the proposed network but also leads to faster convergence. We do a detailed empirical evaluation of different design choices and show that our method requires 4x less operations and is also about 9 to 14x faster compared to the state of the art methods like ACVNet [1], LEAStereo [2] and CFNet [3] while giving comparable performance.
zh

[CV-71] ATARS: An Aerial Traffic Atomic Activity Recognition and Temporal Segmentation Dataset

【速读】：该论文旨在解决现有交通原子活动数据集无法支持整个交叉口场景分析以及仅提供视频级标注导致人工视频剪辑识别工作量巨大且限制其应用于未剪辑视频的问题。论文的关键解决方案在于引入了首个面向多标签原子活动分析的航拍交通原子活动识别与分割（ATARS）数据集，为每一帧提供精确的原子活动标签以记录交通活动的时间间隔，并提出了一种新的任务——多标签时间原子活动识别，从而实现原子活动的精确时间定位并减轻手动视频剪辑的负担。这一方案通过提供详细的帧级标注和创新的任务设计，有效应对了现有数据集的局限性。

链接: https://arxiv.org/abs/2503.18553
作者: Zihao Chen,Hsuanyu Wu,Chi-Hsi Kung,Yi-Ting Chen,Yan-Tsung Peng
机构: National Chengchi University; Indiana University Bloomington; National Yang Ming Chiao Tung University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traffic Atomic Activity which describes traffic patterns for topological intersection dynamics is a crucial topic for the advancement of intelligent driving systems. However, existing atomic activity datasets are collected from an egocentric view, which cannot support the scenarios where traffic activities in an entire intersection are required. Moreover, existing datasets only provide video-level atomic activity annotations, which require exhausting efforts to manually trim the videos for recognition and limit their applications to untrimmed videos. To bridge this gap, we introduce the Aerial Traffic Atomic Activity Recognition and Segmentation (ATARS) dataset, the first aerial dataset designed for multi-label atomic activity analysis. We offer atomic activity labels for each frame, which accurately record the intervals for traffic activities. Moreover, we propose a novel task, Multi-label Temporal Atomic Activity Recognition, enabling the study of accurate temporal localization for atomic activity and easing the burden of manual video trimming for recognition. We conduct extensive experiments to evaluate existing state-of-the-art models on both atomic activity recognition and temporal atomic activity segmentation. The results highlight the unique challenges of our ATARS dataset, such as recognizing extremely small objects’ activities. We further provide comprehensive discussion analyzing these challenges and offer valuable insights for future direction to improve recognizing atomic activity in aerial view. Our source code and dataset are available at this https URL
zh

[CV-72] EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation

【速读】：该论文旨在解决传统基于视频的运动线索（如姿态）在生成动态人体序列时面临的低时间分辨率、运动模糊、过曝及低光条件下的不准确性等问题。论文的关键创新在于提出了一种名为EvAnimate的框架，利用事件相机提供的事件流作为运动线索来驱动静态人体图像的动画生成。其核心解决方案包括：首先设计了一种专用的事件表示方法，将异步事件流转换为可控切片率和适当密度的3通道切片，以适配扩散模型；其次采用双分支架构，通过挖掘事件流的内在运动特性生成高质量且时间一致性良好的视频；此外，还引入了专门的数据增强策略以提升跨个体的泛化能力。最后，构建了一个包含模拟事件数据与真实世界事件数据的新基准，验证了EvAnimate在传统视频衍生线索失效场景中的高时间保真度和鲁棒性能。

链接: https://arxiv.org/abs/2503.18552
作者: Qiang Qu,Ming Li,Xiaoming Chen,Tongliang Liu
机构: University of Sydney(悉尼大学); Beijing Technology and Business University(北京工商大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Conditional human animation transforms a static reference image into a dynamic sequence by applying motion cues such as poses. These motion cues are typically derived from video data but are susceptible to limitations including low temporal resolution, motion blur, overexposure, and inaccuracies under low-light conditions. In contrast, event cameras provide data streams with exceptionally high temporal resolution, a wide dynamic range, and inherent resistance to motion blur and exposure issues. In this work, we propose EvAnimate, a framework that leverages event streams as motion cues to animate static human images. Our approach employs a specialized event representation that transforms asynchronous event streams into 3-channel slices with controllable slicing rates and appropriate slice density, ensuring compatibility with diffusion models. Subsequently, a dual-branch architecture generates high-quality videos by harnessing the inherent motion dynamics of the event streams, thereby enhancing both video quality and temporal consistency. Specialized data augmentation strategies further enhance cross-person generalization. Finally, we establish a new benchmarking, including simulated event data for training and validation, and a real-world event dataset capturing human actions under normal and extreme scenarios. The experiment results demonstrate that EvAnimate achieves high temporal fidelity and robust performance in scenarios where traditional video-derived cues fall short.
zh

[CV-73] Benchmarking Post-Hoc Unknown-Category Detection in Food Recognition

【速读】：该论文旨在解决食品识别模型在实际应用中难以区分已见样本（seen samples）与未见样本（unseen samples）的问题，尤其是在自动饮食评估系统中，模型错误地将未见类别的样本分配为已分布（in-distribution, ID）标签时会导致系统级错误。为应对这一挑战，论文探索了多种后验方法以检测细粒度食品识别中的未分布样本（out-of-distribution, OOD）。研究发现，虚拟对数匹配（Virtual Logit Matching, ViM）方法总体表现最佳，这可能归因于其结合对数和特征空间表示的能力。此外，研究还验证了模型的高 ID 准确性与其在 OOD 检测任务上的性能正相关，并且基于 Transformer 的架构在多种方法中始终优于基于卷积的方法。因此，关键解决方案在于采用 ViM 方法以及选择具有更高 ID 准确性的模型架构来提升 OOD 检测能力。

链接: https://arxiv.org/abs/2503.18548
作者: Lubnaa Abdur Rahman,Ioannis Papathanail,Lorenzo Brigato,Stavroula Mougiakakou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Food recognition models often struggle to distinguish between seen and unseen samples, frequently misclassifying samples from unseen categories by assigning them an in-distribution (ID) label. This misclassification presents significant challenges when deploying these models in real-world applications, particularly within automatic dietary assessment systems, where incorrect labels can lead to cascading errors throughout the system. Ideally, such models should prompt the user when an unknown sample is encountered, allowing for corrective action. Given no prior research exploring food recognition in real-world settings, in this work we conduct an empirical analysis of various post-hoc out-of-distribution (OOD) detection methods for fine-grained food recognition. Our findings indicate that virtual logit matching (ViM) performed the best overall, likely due to its combination of logits and feature-space representations. Additionally, our work reinforces prior notions in the OOD domain, noting that models with higher ID accuracy performed better across the evaluated OOD detection methods. Furthermore, transformer-based architectures consistently outperformed convolution-based models in detecting OOD samples across various methods.
zh

[CV-74] Distilling Stereo Networks for Performant and Efficient Leaner Networks IJCNN

【速读】：该论文旨在解决立体匹配网络中知识蒸馏技术应用不足的问题。尽管知识蒸馏在分类和分割等视觉任务中已广泛应用，但其在立体匹配网络中的研究较少，主要因为这类网络结构复杂且包含多维模块。论文的关键在于提出了一种结合先进立体方法与通用知识蒸馏技术的联合框架，通过精心设计完整的蒸馏流程（从主干网络到蒸馏点的选择及相应损失函数），使学生网络不仅更轻量化、推理速度更快，而且性能优异。例如，在SceneFlow数据集上的实验表明，学生网络在性能上优于PSMNet、CFNet和LEAStereo等方法，同时推理速度分别快8倍、5倍和8倍；相较于推理时间小于100毫秒的速度优化方法，学生网络的表现也更为出色。此外，学生网络在未见数据集ETH3D和Middlebury上的测试中展现了更好的泛化能力。

链接: https://arxiv.org/abs/2503.18544
作者: Rafia Rahim,Samuel Woerz,Andreas Zell
机构: University of Tuebingen (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures. Published in: 2023 International Joint Conference on Neural Networks (IJCNN)

点击查看摘要

Abstract:Knowledge distillation has been quite popular in vision for tasks like classification and segmentation however not much work has been done for distilling state-of-the-art stereo matching methods despite their range of applications. One of the reasons for its lack of use in stereo matching networks is due to the inherent complexity of these networks, where a typical network is composed of multiple two- and three-dimensional modules. In this work, we systematically combine the insights from state-of-the-art stereo methods with general knowledge-distillation techniques to develop a joint framework for stereo networks distillation with competitive results and faster inference. Moreover, we show, via a detailed empirical analysis, that distilling knowledge from the stereo network requires careful design of the complete distillation pipeline starting from backbone to the right selection of distillation points and corresponding loss functions. This results in the student networks that are not only leaner and faster but give excellent performance . For instance, our student network while performing better than the performance oriented methods like PSMNet [1], CFNet [2], and LEAStereo [3]) on benchmark SceneFlow dataset, is 8x, 5x, and 8x faster respectively. Furthermore, compared to speed oriented methods having inference time less than 100ms, our student networks perform better than all the tested methods. In addition, our student network also shows better generalization capabilities when tested on unseen datasets like ETH3D and Middlebury.
zh

[CV-75] UniPCGC: Towards Practical Point Cloud Geometry Compression via an Efficient Unified Approach AAAI2025

【速读】：该论文旨在解决基于学习的点云压缩方法在实际应用中面临的高复杂度、有限的压缩模式以及缺乏变码率支持等挑战。论文的关键解决方案在于提出了一种高效的统一点云几何压缩框架（UniPCGC），其创新之处包括：在无损压缩模式下引入非均匀8阶段无损编码器（UELC），通过将更多计算复杂度分配给编码难度较高的分组，并合并编码难度较低的分组；在有损压缩模式下实现变码率与变复杂度模块（VRCM），结合速率调制模块和动态稀疏卷积技术；最终通过UELC与VRCM的动态组合，在统一框架内实现了无损压缩、有损压缩、变码率及变复杂度的支持。与现有最先进的方法相比，该方法在无损压缩中的压缩比（CR）提升了8.1%，在有损压缩中的Bjontegaard Delta码率（BD-Rate）提升了14.02%。

链接: https://arxiv.org/abs/2503.18541
作者: Kangli Wang,Wei Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Learning-based point cloud compression methods have made significant progress in terms of performance. However, these methods still encounter challenges including high complexity, limited compression modes, and a lack of support for variable rate, which restrict the practical application of these methods. In order to promote the development of practical point cloud compression, we propose an efficient unified point cloud geometry compression framework, dubbed as UniPCGC. It is a lightweight framework that supports lossy compression, lossless compression, variable rate and variable complexity. First, we introduce the Uneven 8-Stage Lossless Coder (UELC) in the lossless mode, which allocates more computational complexity to groups with higher coding difficulty, and merges groups with lower coding difficulty. Second, Variable Rate and Complexity Module (VRCM) is achieved in the lossy mode through joint adoption of a rate modulation module and dynamic sparse convolution. Finally, through the dynamic combination of UELC and VRCM, we achieve lossy compression, lossless compression, variable rate and complexity within a unified framework. Compared to the previous state-of-the-art method, our method achieves a compression ratio (CR) gain of 8.1% on lossless compression, and a Bjontegaard Delta Rate (BD-Rate) gain of 14.02% on lossy compression, while also supporting variable rate and variable complexity.
zh

[CV-76] HiRes-FusedMIM: A High-Resolution RGB-DSM Pre-trained Model for Building-Level Remote Sensing Applications

【速读】：该论文旨在解决现有自监督学习方法在理解城市环境中忽视高分辨率数字表面模型（Digital Surface Models, DSMs）重要性的问题，特别是在建筑物级别的分析中，这对于数字孪生等应用至关重要。论文的关键创新在于提出了HiRes-FusedMIM，这是一种专门设计用于融合高分辨率RGB与DSM数据信息的预训练模型。其解决方案的核心在于采用双编码器简单掩码图像建模（Simple Masked Image Modeling, SimMIM）架构，并结合重构与对比损失函数的多目标优化策略，从而实现从两种模态数据中联合学习强大表示的能力。这一方法不仅提升了多项建筑相关任务的表现，还证明了引入DSM进行预训练的价值以及分离编码器结构的优势。

链接: https://arxiv.org/abs/2503.18540
作者: Guneet Mutreja,Philipp Schuegraf,Ksenia Bittner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in self-supervised learning have led to the development of foundation models that have significantly advanced performance in various computer vision tasks. However, despite their potential, these models often overlook the crucial role of high-resolution digital surface models (DSMs) in understanding urban environments, particularly for building-level analysis, which is essential for applications like digital twins. To address this gap, we introduce HiRes-FusedMIM, a novel pre-trained model specifically designed to leverage the rich information contained within high-resolution RGB and DSM data. HiRes-FusedMIM utilizes a dual-encoder simple masked image modeling (SimMIM) architecture with a multi-objective loss function that combines reconstruction and contrastive objectives, enabling it to learn powerful, joint representations from both modalities. We conducted a comprehensive evaluation of HiRes-FusedMIM on a diverse set of downstream tasks, including classification, semantic segmentation, and instance segmentation. Our results demonstrate that: 1) HiRes-FusedMIM outperforms previous state-of-the-art geospatial methods on several building-related datasets, including WHU Aerial and LoveDA, demonstrating its effectiveness in capturing and leveraging fine-grained building information; 2) Incorporating DSMs during pre-training consistently improves performance compared to using RGB data alone, highlighting the value of elevation information for building-level analysis; 3) The dual-encoder architecture of HiRes-FusedMIM, with separate encoders for RGB and DSM data, significantly outperforms a single-encoder model on the Vaihingen segmentation task, indicating the benefits of learning specialized representations for each modality. To facilitate further research and applications in this direction, we will publicly release the trained model weights.
zh

[CV-77] DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels

【速读】：该论文旨在解决医疗视觉问答（Med-VQA）系统中因噪声标签和有限高质量数据集导致的挑战。为应对这一问题，论文建立了首个针对Med-VQA噪声标签的基准，并通过模拟人类误标设计了多种语义噪声类型。解决方案的关键在于提出了DiN框架，其中Answer Diffuser (AD)模块采用从粗到精的过程，利用扩散模型优化答案候选项以提升准确性；同时，Answer Condition Generator (ACG)模块通过整合答案嵌入与图像-问题特征进一步增强此过程。此外，Noisy Label Refinement (NLR)模块引入鲁棒损失函数和动态答案调整，以进一步提升AD模块的性能。

链接: https://arxiv.org/abs/2503.18536
作者: Erjian Guo,Zhen Zhao,Zicheng Wang,Tong Chen,Yunyi Liu,Luping Zhou
机构: University of Sydney (悉尼大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical Visual Question Answering (Med-VQA) systems benefit the interpretation of medical images containing critical clinical information. However, the challenge of noisy labels and limited high-quality datasets remains underexplored. To address this, we establish the first benchmark for noisy labels in Med-VQA by simulating human mislabeling with semantically designed noise types. More importantly, we introduce the DiN framework, which leverages a diffusion model to handle noisy labels in Med-VQA. Unlike the dominant classification-based VQA approaches that directly predict answers, our Answer Diffuser (AD) module employs a coarse-to-fine process, refining answer candidates with a diffusion model for improved accuracy. The Answer Condition Generator (ACG) further enhances this process by generating task-specific conditional information via integrating answer embeddings with fused image-question features. To address label noise, our Noisy Label Refinement(NLR) module introduces a robust loss function and dynamic answer adjustment to further boost the performance of the AD module.
zh

[CV-78] k-NN as a Simple and Effective Estimator of Transferability

【速读】：该论文试图解决在领域迁移（domain shift）、任务差异（task difference）以及架构变化（architecture change）的新场景下，如何准确预测迁移学习（transfer learning）性能的问题。现有众多迁移能力度量指标被提出以回答这一问题，但它们在实际新场景中的预测准确性尚不清楚。为解决此问题，论文通过超过42,000次实验评估了23种迁移能力度量指标在16个不同数据集上的表现。研究发现，没有一种现有指标在所有情况下表现良好。关键在于，论文提出了一种简单的k-最近邻（k-nearest neighbor, k-NN）评估方法，这种方法不仅超越了现有的迁移能力度量指标，还提供了更好的计算效率和实现便捷性。

链接: https://arxiv.org/abs/2503.18528
作者: Moein Sorkhei,Christos Matsoukas,Johan Fredin Haslum,Kevin Smith
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:How well can one expect transfer learning to work in a new setting where the domain is shifted, the task is different, and the architecture changes? Many transfer learning metrics have been proposed to answer this question. But how accurate are their predictions in a realistic new setting? We conducted an extensive evaluation involving over 42,000 experiments comparing 23 transferability metrics across 16 different datasets to assess their ability to predict transfer performance. Our findings reveal that none of the existing metrics perform well across the board. However, we find that a simple k-nearest neighbor evaluation – as is commonly used to evaluate feature quality for self-supervision – not only surpasses existing metrics, but also offers better computational efficiency and ease of implementation.
zh

[CV-79] AIM2PC: Aerial Image to 3D Building Point Cloud Reconstruction

【速读】：该论文旨在解决三维城市建筑物重建中单视角图像面临的两个主要挑战：一是现有方法多聚焦于航拍图像中的屋顶细节，而忽略了建筑物的重要几何细节；二是缺乏包含完整建筑3D点云的数据集，并且获取航拍图像可靠的相机位姿信息存在困难。为应对这些挑战，论文提出了一种名为AIM2PC的新方法，其关键在于利用由完整3D点云及确定相机位姿组成的自建数据集，通过将单张航拍图像特征与二值掩模、Sobel边缘图等附加条件结合，实现更边缘感知的重建。此外，该方法采用基于中心去噪扩散概率模型（CDPM）的点云扩散模型，在每次扩散步骤中结合相机位姿将这些特征投射到部分去噪的点云上，从而能够重建出包含墙体信息在内的完整3D建筑点云，表现出优于现有基线技术的性能。

链接: https://arxiv.org/abs/2503.18527
作者: Soulaimene Turki,Daniel Panangian,Houda Chaabouni-Chouayakh,Ksenia Bittner
机构: Remote Sensing Technology Institute, German Aerospace Center (DLR)(德国航空航天中心遥感技术研究所); Sm@rts Laboratory, Digital Research Center of Sfax (突尼斯斯法克斯数字研究中心Sm@rts实验室); Higher School of Communication of Tunis (突尼斯高等通信学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ISPRS Geospatial Week 2025

点击查看摘要

Abstract:Three-dimensional urban reconstruction of buildings from single-view images has attracted significant attention over the past two decades. However, recent methods primarily focus on rooftops from aerial images, often overlooking essential geometrical details. Additionally, there is a notable lack of datasets containing complete 3D point clouds for entire buildings, along with challenges in obtaining reliable camera pose information for aerial images. This paper addresses these challenges by presenting a novel methodology, AIM2PC , which utilizes our generated dataset that includes complete 3D point clouds and determined camera poses. Our approach takes features from a single aerial image as input and concatenates them with essential additional conditions, such as binary masks and Sobel edge maps, to enable more edge-aware reconstruction. By incorporating a point cloud diffusion model based on Centered denoising Diffusion Probabilistic Models (CDPM), we project these concatenated features onto the partially denoised point cloud using our camera poses at each diffusion step. The proposed method is able to reconstruct the complete 3D building point cloud, including wall information and demonstrates superior performance compared to existing baseline techniques. To allow further comparisons with our methodology the dataset has been made available at this https URL
zh

[CV-80] LookCloser: Frequency-aware Radiance Field for Tiny-Detail Scene CVPR2025

【速读】：该论文旨在解决现有NeRF框架在同时建模场景的整体结构与局部高频细节方面的局限性问题。论文提出的关键解决方案是引入FA-NeRF（Frequency-Aware Neural Radiance Fields），这是一种能够同时捕捉场景整体结构与高精度细节的新型频率感知框架。其核心在于提出了一种三维频率量化方法，用于分析场景的频率分布，并实现频率感知渲染。此外，通过引入频率网格以加速收敛与查询，以及采用频率感知特征重加权策略来平衡不同频率内容的特征，从而有效解决了平衡全局与局部信息的难题。实验结果表明，该方法在建模完整场景的同时显著保留了精细细节，大幅超越现有方法。

链接: https://arxiv.org/abs/2503.18513
作者: Xiaoyu Zhang,Weihong Pan,Chong Bao,Xiyu Zhang,Xiaojun Xiang,Hanqing Jiang,Hujun Bao
机构: SenseTime Research (商汤科技研究院); State Key Lab of CAD&CG, Zhejiang University (浙江大学CAD&CG国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures. Accepted to CVPR 2025

点击查看摘要

Abstract:Humans perceive and comprehend their surroundings through information spanning multiple frequencies. In immersive scenes, people naturally scan their environment to grasp its overall structure while examining fine details of objects that capture their attention. However, current NeRF frameworks primarily focus on modeling either high-frequency local views or the broad structure of scenes with low-frequency information, which is limited to balancing both. We introduce FA-NeRF, a novel frequency-aware framework for view synthesis that simultaneously captures the overall scene structure and high-definition details within a single NeRF model. To achieve this, we propose a 3D frequency quantification method that analyzes the scene’s frequency distribution, enabling frequency-aware rendering. Our framework incorporates a frequency grid for fast convergence and querying, a frequency-aware feature re-weighting strategy to balance features across different frequency contents. Extensive experiments show that our method significantly outperforms existing approaches in modeling entire scenes while preserving fine details.
zh

[CV-81] Uncertainty-guided Perturbation for Image Super-Resolution Diffusion Model CVPR2025

【速读】：该论文旨在解决扩散模型在图像超分辨率任务中如何更有效地利用低分辨率（Low-Resolution, LR）信息以提升重建性能的问题。论文的关键在于提出了一种名为Uncertainty-guided Noise Weighting（不确定性引导噪声加权，UWN）的技术，通过结合不确定性估计来指导不同区域的噪声水平控制。研究发现，低分辨率图像的不同区域对应扩散过程中的不同时间步长，平坦区域接近目标高分辨率（High-Resolution, HR）分布，而边缘和纹理区域则距离较远。因此，在平坦区域应用轻微噪声更有助于重建质量。此外，作者还改进了网络架构，设计了Uncertainty-guided Perturbation Super-Resolution（UPSR）模型。实验结果表明，尽管模型规模和训练开销减少，该方法在多个数据集上均实现了当前最先进的定量与定性性能。

链接: https://arxiv.org/abs/2503.18512
作者: Leheng Zhang,Weiyi You,Kexuan Shi,Shuhang Gu
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Diffusion-based image super-resolution methods have demonstrated significant advantages over GAN-based approaches, particularly in terms of perceptual quality. Building upon a lengthy Markov chain, diffusion-based methods possess remarkable modeling capacity, enabling them to achieve outstanding performance in real-world scenarios. Unlike previous methods that focus on modifying the noise schedule or sampling process to enhance performance, our approach emphasizes the improved utilization of LR information. We find that different regions of the LR image can be viewed as corresponding to different timesteps in a diffusion process, where flat areas are closer to the target HR distribution but edge and texture regions are farther away. In these flat areas, applying a slight noise is more advantageous for the reconstruction. We associate this characteristic with uncertainty and propose to apply uncertainty estimate to guide region-specific noise level control, a technique we refer to as Uncertainty-guided Noise Weighting. Pixels with lower uncertainty (i.e., flat regions) receive reduced noise to preserve more LR information, therefore improving performance. Furthermore, we modify the network architecture of previous methods to develop our Uncertainty-guided Perturbation Super-Resolution (UPSR) model. Extensive experimental results demonstrate that, despite reduced model size and training overhead, the proposed UWSR method outperforms current state-of-the-art methods across various datasets, both quantitatively and qualitatively.
zh

[CV-82] Can Text-to-Video Generation help Video-Language Alignment? CVPR2025

【速读】：该论文试图解决视频-语言对齐模型在训练过程中因使用由大规模语言模型生成的负向描述（negative captions）而导致的潜在语言偏见问题。这种偏见源于负向描述仅呈现概念的负面特性而未与视频正向关联，而现有数据库难以提供足够的细粒度变化来覆盖所有可能的负向场景。为克服此问题，论文探索利用合成视频（synthetic videos）作为补充数据源。然而，实验表明合成视频对模型性能的影响具有任务依赖性，某些情况下会带来负面影响。研究发现这一问题的关键在于生成视频中存在的语义噪声和视觉噪声。为此，论文提出了一种名为SynViTA的方法，其核心解决方案包括：通过动态加权机制调整合成视频的贡献程度，以反映其目标描述与真实描述之间的相似性；引入语义一致性损失函数，引导模型关注描述间的细粒度差异而非视频外观差异。实验结果表明，SynViTA在多个基准测试集上优于现有方法，为合成视频在视频-语言模型学习中的应用提供了初步可行路径。

链接: https://arxiv.org/abs/2503.18507
作者: Luca Zanella,Massimiliano Mancini,Willi Menapace,Sergey Tulyakov,Yiming Wang,Elisa Ricci
机构: University of Trento (特伦托大学); Snap Inc. (Snap 公司); Fondazione Bruno Kessler (布鲁诺凯勒基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project website at this https URL

点击查看摘要

Abstract:Recent video-language alignment models are trained on sets of videos, each with an associated positive caption and a negative caption generated by large language models. A problem with this procedure is that negative captions may introduce linguistic biases, i.e., concepts are seen only as negatives and never associated with a video. While a solution would be to collect videos for the negative captions, existing databases lack the fine-grained variations needed to cover all possible negatives. In this work, we study whether synthetic videos can help to overcome this issue. Our preliminary analysis with multiple generators shows that, while promising on some tasks, synthetic videos harm the performance of the model on others. We hypothesize this issue is linked to noise (semantic and visual) in the generated videos and develop a method, SynViTA, that accounts for those. SynViTA dynamically weights the contribution of each synthetic video based on how similar its target caption is w.r.t. the real counterpart. Moreover, a semantic consistency loss makes the model focus on fine-grained differences across captions, rather than differences in video appearance. Experiments show that, on average, SynViTA improves over existing methods on VideoCon test sets and SSv2-Temporal, SSv2-Events, and ATP-Hard benchmarks, being a first promising step for using synthetic videos when learning video-language models.
zh

[CV-83] Explaining Domain Shifts in Language: Concept erasing for Interpretable Image Classification CVPR2025

【速读】：该论文试图解决基于概念的模型在处理领域特定概念时影响最终预测结果，从而削弱模型泛化能力的问题，特别是限制其在高风险应用场景中的应用。为了解决这一问题，论文提出了一种名为Language-guided Concept-Erasing (LanCE) 的框架。关键解决方案在于引入了一个新的插件式领域描述符正交性（Domain Descriptor Orthogonality, DDO）正则化器，通过利用预训练视觉-语言模型（Vision-Language Models, VLMs）近似不同的视觉领域偏移，并结合大规模语言模型（Large Language Models, LLMs）模拟未见视觉领域的描述符，以减轻领域特定概念对最终预测的影响。DDO正则化器与基于概念的模型设计无关，并可集成到多种现有模型中，显著提升了跨领域的泛化性能。

链接: https://arxiv.org/abs/2503.18483
作者: Zequn Zeng,Yudi Su,Jianqiao Sun,Tiansheng Wen,Hao Zhang,Zhengjue Wang,Bo Chen,Hongwei Liu,Jiawei Ma
机构: National Key Laboratory of Radar Signal Processing, Xidian University (西安电子科技大学), Xi’an, 710071, China; State Key Laboratory of Integrated Service Networks, Xidian University (西安电子科技大学), Xi’an, 710071, China; City University of Hong Kong (香港城市大学), Hong Kong SAR
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Concept-based models can map black-box representations to human-understandable concepts, which makes the decision-making process more transparent and then allows users to understand the reason behind predictions. However, domain-specific concepts often impact the final predictions, which subsequently undermine the model generalization capabilities, and prevent the model from being used in high-stake applications. In this paper, we propose a novel Language-guided Concept-Erasing (LanCE) framework. In particular, we empirically demonstrate that pre-trained vision-language models (VLMs) can approximate distinct visual domain shifts via domain descriptors while prompting large Language Models (LLMs) can easily simulate a wide range of descriptors of unseen visual domains. Then, we introduce a novel plug-in domain descriptor orthogonality (DDO) regularizer to mitigate the impact of these domain-specific concepts on the final predictions. Notably, the DDO regularizer is agnostic to the design of concept-based models and we integrate it into several prevailing models. Through evaluation of domain generalization on four standard benchmarks and three newly introduced benchmarks, we demonstrate that DDO can significantly improve the out-of-distribution (OOD) generalization over the previous state-of-the-art concept-based this http URL code is available at this https URL.
zh

[CV-84] Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding

【速读】：该论文试图解决现有多模态大型语言模型（Multimodal Large Language Models, MLLMs）在理解长达数小时视频内容时效率低下和性能不足的问题。解决方案的关键在于提出了一种名为Video-XL-Pro的新方法，其核心基于可学习的重构式令牌压缩模块（Reconstructive Compression of Tokens, ReCoT）。ReCoT通过自监督学习生成全面且紧凑的视频令牌，并包含两个关键组件：(i) 动态令牌合成器（Dynamic Token Synthesizer, DTS），用于从静态图像令牌中学习令牌间关系以生成伪视频令牌；(ii) 语义引导掩码（Semantic-Guided Masking, SGM），用于自适应屏蔽冗余视觉令牌以促进更有效的重构学习。此外，论文还设计了一种特定于视频的数据集剪枝策略和查询感知选择器，进一步提升了模型在长视频理解任务中的训练效率和性能表现。

链接: https://arxiv.org/abs/2503.18478
作者: Xiangrui Liu,Yan Shu,Zheng Liu,Ao Li,Yang Tian,Bo Zhao
机构: School of AI, Shanghai Jiao Tong University (上海交通大学人工智能学院); Beijing Academy of Artificial Intelligence (北京智源人工智能研究院); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite advanced token compression techniques, existing multimodal large language models (MLLMs) still struggle with hour-long video understanding. In this work, we propose Video-XL-Pro, an efficient method for extremely long video understanding, built upon Reconstructive Compression of Tokens (ReCoT), a learnable module that leverages self-supervised learning to generate comprehensive and compact video tokens. ReCoT introduces two key components: (i) Dynamic Token Synthesizer (DTS): DTS generates pseudo-video tokens from static image tokens by learning intra-token relationships, which are then used in masked video modeling. (ii) Semantic-Guided Masking (SGM): SGM adaptively masks redundant visual tokens to facilitate more effective reconstructive learning. To improve training efficiency in MLLMs fine-tuning, we introduce a video-specific dataset pruning strategy and design a simple yet Query-aware Selector that enables the model to precisely locate query-relevant video tokens. With only 3B parameters, Video-XL-Pro outperforms most 7B models trained on larger datasets across multiple long video understanding benchmarks. Moreover, it can process over 8K frames on a single A100 GPU while maintaining high-quality performance.
zh

[CV-85] MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse

【速读】：本文旨在解决两个核心问题：(1) 视觉语言模型 (Vision-Language Models, VLMs) 缺乏内部化的三维空间推理能力，这限制了它们生成逼真的布局的能力；(2) 传统监督微调 (Supervised Fine-Tuning, SFT) 在布局生成任务中的低效性，因为完美地面真实标注不可得。论文的关键创新在于提出了一种基于多轮强化学习 (Reinforcement Learning, RL) 的优化机制，该机制整合了物理感知约束和渲染图像评估，确保生成的三维布局在一致性、物理合理性及美学一致性方面表现优异。通过引入自适应迭代推理过程，视觉语言模型通过分析渲染输出逐步改进场景一致性。

链接: https://arxiv.org/abs/2503.18470
作者: Zhenyu Pan,Han Liu
机构: Department of Computer Science (计算机科学系), Northwestern University (西北大学); Department of Computer Science (计算机科学系), Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Working Paper

点击查看摘要

Abstract:We present MetaSpatial, the first reinforcement learning (RL)-based framework designed to enhance 3D spatial reasoning in vision-language models (VLMs), enabling real-time 3D scene generation without the need for hard-coded optimizations. MetaSpatial addresses two core challenges: (i) the lack of internalized 3D spatial reasoning in VLMs, which limits their ability to generate realistic layouts, and (ii) the inefficiency of traditional supervised fine-tuning (SFT) for layout generation tasks, as perfect ground truth annotations are unavailable. Our key innovation is a multi-turn RL-based optimization mechanism that integrates physics-aware constraints and rendered image evaluations, ensuring generated 3D layouts are coherent, physically plausible, and aesthetically consistent. Methodologically, MetaSpatial introduces an adaptive, iterative reasoning process, where the VLM refines spatial arrangements over multiple turns by analyzing rendered outputs, improving scene coherence progressively. Empirical evaluations demonstrate that MetaSpatial significantly enhances the spatial consistency and formatting stability of various scale models. Post-training, object placements are more realistic, aligned, and functionally coherent, validating the effectiveness of RL for 3D spatial reasoning in metaverse, AR/VR, digital twins, and game development applications. Our code, data, and training pipeline are publicly available at this https URL.
zh

[CV-86] CFReID: Continual Few-shot Person Re-Identification

【速读】：本文旨在解决在实际监控系统动态演化过程中，传统终身行人再识别（Lifelong ReID, LReID）模型需要大量标记数据以适应新领域的问题，而这通常是由于隐私和成本限制而无法获取的。为应对这一挑战，论文提出了连续小样本行人再识别（Continual Few-shot ReID, CFReID）的新范式，并设计了稳定分布对齐（Stable Distribution Alignment, SDA）框架作为解决方案的核心。SDA框架包含两个模块：元分布对齐（Meta Distribution Alignment, MDA）和基于原型的小样本适应（Prototype-based Few-shot Adaptation, PFA）。通过这两个模块，SDA能够有效应对在小样本条件下学习新领域知识的同时避免遗忘已有领域知识的双重挑战。实验结果表明，该方法仅使用少量数据（如32个身份标识，占总数据量的5%），便显著优于需要700到1000个身份标识的传统LReID方法。

链接: https://arxiv.org/abs/2503.18469
作者: Hao Ni,Lianli Gao,Pengpeng Zeng,Heng Tao Shen,Jingkuan Song
机构: Center for Future Media and School of Computer Science and Engineering, University of Electronic Science and Technology of China (电子科技大学未来媒体中心和计算机科学与工程学院); Department of Information Engineering and Computer Science, Tongji University (同济大学信息工程与计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:Real-world surveillance systems are dynamically evolving, requiring a person Re-identification model to continuously handle newly incoming data from various domains. To cope with these dynamics, Lifelong ReID (LReID) has been proposed to learn and accumulate knowledge across multiple domains incrementally. However, LReID models need to be trained on large-scale labeled data for each unseen domain, which are typically inaccessible due to privacy and cost concerns. In this paper, we propose a new paradigm called Continual Few-shot ReID (CFReID), which requires models to be incrementally trained using few-shot data and tested on all seen domains. Under few-shot conditions, CFREID faces two core challenges: 1) learning knowledge from few-shot data of unseen domain, and 2) avoiding catastrophic forgetting of seen domains. To tackle these two challenges, we propose a Stable Distribution Alignment (SDA) framework from feature distribution perspective. Specifically, our SDA is composed of two modules, i.e., Meta Distribution Alignment (MDA) and Prototype-based Few-shot Adaptation (PFA). To support the study of CFReID, we establish an evaluation benchmark for CFReID on five publicly available ReID datasets. Extensive experiments demonstrate that our SDA can enhance the few-shot learning and anti-forgetting capabilities under few-shot conditions. Notably, our approach, using only 5% of the data, i.e., 32 IDs, significantly outperforms LReID’s state-of-the-art performance, which requires 700 to 1,000 IDs.
zh

[CV-87] SIT-FER: Integration of Semantic- Instance- Text-level Information for Semi-supervised Facial Expression Recognition

【速读】：本文旨在解决半监督深度面部表情识别（Semi-Supervised Deep Facial Expression Recognition, SS-DFER）中因难以获取充足标注数据而导致性能受限的问题。现有方法主要依赖生成的语义级伪标签进行监督学习，但这些伪标签的可靠性不足，影响了整体性能和实际应用价值。为克服这一局限，本文提出了一种新颖的SS-DFER框架，其关键是同时整合语义、实例和文本级信息以生成高质量伪标签。具体而言，针对未标注数据，通过计算面部视觉特征与对应的文本描述及实例表示之间的相似性，分别获得文本级和实例级的概率，并结合语义级概率，精心聚合这三个层级的概率以得到最终的伪标签。此外，为了增强对有标注数据中单热标签（one-hot labels）的利用，还引入从文本描述中挖掘的文本嵌入来共同监督模型训练，使面部视觉特征在文本空间中表现出语义相关性。实验结果表明，该方法显著优于当前最先进的SS-DFER方法，甚至超过了完全监督的基准模型。

链接: https://arxiv.org/abs/2503.18463
作者: Sixian Ding,Xu Jiang,Zhongjing Du,Jiaqi Cui,Xinyi Zeng,Yan Wang
机构: College of Computer Science, Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semi-supervised deep facial expression recognition (SS-DFER) has gained increasingly research interest due to the difficulty in accessing sufficient labeled data in practical settings. However, existing SS-DFER methods mainly utilize generated semantic-level pseudo-labels for supervised learning, the unreliability of which compromises their performance and undermines the practical utility. In this paper, we propose a novel SS-DFER framework that simultaneously incorporates semantic, instance, and text-level information to generate high-quality pseudo-labels. Specifically, for the unlabeled data, considering the comprehensive knowledge within the textual descriptions and instance representations, we respectively calculate the similarities between the facial vision features and the corresponding textual and instance features to obtain the probabilities at the text- and instance-level. Combining with the semantic-level probability, these three-level probabilities are elaborately aggregated to gain the final pseudo-labels. Furthermore, to enhance the utilization of one-hot labels for the labeled data, we also incorporate text embeddings excavated from textual descriptions to co-supervise model training, enabling facial visual features to exhibit semantic correlations in the text space. Experiments on three datasets demonstrate that our method significantly outperforms current state-of-the-art SS-DFER methods and even exceeds fully supervised baselines. The code will be available at this https URL.
zh

[CV-88] PALATE: Peculiar Application of the Law of Total Expectation to Enhance the Evaluation of Deep Generative Models

【速读】：该论文旨在解决深度生成模型（Deep Generative Models, DGMs）评估中面临的挑战，特别是如何全面衡量生成样本在保真度（fidelity）、多样性和新颖性之间的平衡。现有评估方法存在局限性，而特征似然散度（Feature Likelihood Divergence, FLD）虽提供了一种有潜力的解决方案，但其计算复杂度较高。为应对这些挑战，论文提出了一种名为PALATE的新方法，通过巧妙应用全期望定律（Law of Total Expectation）处理可访问的真实数据随机变量，从而克服现有指标的限制。关键在于将PALATE与最大均值差异（Maximum Mean Discrepancy, MMD）基线指标及DINOv2特征提取器结合，形成一个高效且可扩展的综合性评估框架，不仅提升了计算效率，还增强了对大规模数据集的支持，尤其在检测样本记忆效应和评估泛化能力方面表现出色。

链接: https://arxiv.org/abs/2503.18462
作者: Tadeusz Dziarmaga,Marcin Kądziołka,Artur Kasymov,Marcin Mazur
机构: Jagiellonian University ( Jagiellonian University )
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep generative models (DGMs) have caused a paradigm shift in the field of machine learning, yielding noteworthy advancements in domains such as image synthesis, natural language processing, and other related areas. However, a comprehensive evaluation of these models that accounts for the trichotomy between fidelity, diversity, and novelty in generated samples remains a formidable challenge. A recently introduced solution that has emerged as a promising approach in this regard is the Feature Likelihood Divergence (FLD), a method that offers a theoretically motivated practical tool, yet also exhibits some computational challenges. In this paper, we propose PALATE, a novel enhancement to the evaluation of DGMs that addresses limitations of existing metrics. Our approach is based on a peculiar application of the law of total expectation to random variables representing accessible real data. When combined with the MMD baseline metric and DINOv2 feature extractor, PALATE offers a holistic evaluation framework that matches or surpasses state-of-the-art solutions while providing superior computational efficiency and scalability to large-scale datasets. Through a series of experiments, we demonstrate the effectiveness of the PALATE enhancement, contributing a computationally efficient, holistic evaluation approach that advances the field of DGMs assessment, especially in detecting sample memorization and evaluating generalization capabilities.
zh

[CV-89] MuMA: 3D PBR Texturing via Multi-Channel Multi-View Generation and Agent ic Post-Processing

【速读】：该论文致力于解决基于物理渲染（Physically Based Rendering, PBR）的三维纹理生成中面临的挑战，特别是由于数据限制和多通道材质建模困难导致的方法局限性问题。论文提出了一种名为MuMA的方法，通过多通道多视角生成与主动后处理实现三维PBR纹理生成。其解决方案的关键创新点在于：首先，引入明暗（shaded）和反照率（albedo）外观通道建模，并利用明暗通道集成内在分解模块以表征材质属性；其次，借助多模态大型语言模型模拟艺术家的材质评估与选择技术。实验表明，MuMA在视觉质量和材质保真度方面优于现有方法。

链接: https://arxiv.org/abs/2503.18461
作者: Lingting Zhu,Jingrui Ye,Runze Zhang,Zeyu Hu,Yingda Yin,Lanjiong Li,Jinnan Chen,Shengju Qian,Xin Wang,Qingmin Liao,Lequan Yu
机构: HKU (香港大学); Tsinghua SIGS (清华大学深圳国际研究生院); LIGHTSPEED (光速); HKUST(GZ) (香港科技大学广州分校); NUS (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 14 figures

点击查看摘要

Abstract:Current methods for 3D generation still fall short in physically based rendering (PBR) texturing, primarily due to limited data and challenges in modeling multi-channel materials. In this work, we propose MuMA, a method for 3D PBR texturing through Multi-channel Multi-view generation and Agentic post-processing. Our approach features two key innovations: 1) We opt to model shaded and albedo appearance channels, where the shaded channels enables the integration intrinsic decomposition modules for material properties. 2) Leveraging multimodal large language models, we emulate artists’ techniques for material assessment and selection. Experiments demonstrate that MuMA achieves superior results in visual quality and material fidelity compared to existing methods.
zh

[CV-90] Hiding Images in Diffusion Models by Editing Learned Score Functions

【速读】：该论文致力于解决数据隐藏在扩散模型（diffusion models）中的潜力尚未被充分探索的问题，特别是现有方法在高提取精度、模型保真度和隐藏效率方面存在的局限性。这些局限性主要源于隐藏与提取过程与多个去噪扩散步骤的纠缠。为了解决这些问题，论文提出了一种简单而有效的方法，通过在反向扩散过程中特定时间步（timesteps）编辑学习到的评分函数（score functions）来嵌入图像。此外，还引入了一种参数高效的微调方法，结合基于梯度的参数选择与低秩适应技术，以提升模型保真度和隐藏效率。关键在于通过在特定时间步修改评分函数实现高效嵌入，并利用参数高效的微调策略优化性能。实验结果表明，该方法能够以人类无法区分的质量提取高质量图像，同时在样本和群体层面复制原始模型行为，并显著提高嵌入速度及支持多接收者场景。

链接: https://arxiv.org/abs/2503.18459
作者: Haoyu Chen,Yunqiao Yang,Nan Zhong,Kede Ma
机构: City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hiding data using neural networks (i.e., neural steganography) has achieved remarkable success across both discriminative classifiers and generative adversarial networks. However, the potential of data hiding in diffusion models remains relatively unexplored. Current methods exhibit limitations in achieving high extraction accuracy, model fidelity, and hiding efficiency due primarily to the entanglement of the hiding and extraction processes with multiple denoising diffusion steps. To address these, we describe a simple yet effective approach that embeds images at specific timesteps in the reverse diffusion process by editing the learned score functions. Additionally, we introduce a parameter-efficient fine-tuning method that combines gradient-based parameter selection with low-rank adaptation to enhance model fidelity and hiding efficiency. Comprehensive experiments demonstrate that our method extracts high-quality images at human-indistinguishable levels, replicates the original model behaviors at both sample and population levels, and embeds images orders of magnitude faster than prior methods. Besides, our method naturally supports multi-recipient scenarios through independent extraction channels.
zh

[CV-91] InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment CVPR2025

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）扩散模型与人类偏好对齐的问题，现有方法因长马尔可夫链过程及逆过程的不可解性导致训练效率低且生成质量欠佳。论文的关键解决方案是提出DDIM-InPO方法，将扩散模型视为单步生成模型，并通过重参数化技术为潜在变量赋予隐式奖励，同时构建反转技术估计适配偏好的潜在变量。这种方法仅针对与偏好数据强相关的潜在变量输出进行微调，从而显著提升训练效率和生成质量，在仅需400步微调的情况下实现了最先进的性能。

链接: https://arxiv.org/abs/2503.18454
作者: Yunhong Lu,Qichao Wang,Hengyuan Cao,Xierui Wang,Xiaoyin Xu,Min Zhang
机构: Zhejiang University (浙江大学); Shanghai Institute for Advanced Study-Zhejiang University (上海交通大学浙江研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Without using explicit reward, direct preference optimization (DPO) employs paired human preference data to fine-tune generative models, a method that has garnered considerable attention in large language models (LLMs). However, exploration of aligning text-to-image (T2I) diffusion models with human preferences remains limited. In comparison to supervised fine-tuning, existing methods that align diffusion model suffer from low training efficiency and subpar generation quality due to the long Markov chain process and the intractability of the reverse process. To address these limitations, we introduce DDIM-InPO, an efficient method for direct preference alignment of diffusion models. Our approach conceptualizes diffusion model as a single-step generative model, allowing us to fine-tune the outputs of specific latent variables selectively. In order to accomplish this objective, we first assign implicit rewards to any latent variable directly via a reparameterization technique. Then we construct an Inversion technique to estimate appropriate latent variables for preference optimization. This modification process enables the diffusion model to only fine-tune the outputs of latent variables that have a strong correlation with the preference dataset. Experimental results indicate that our DDIM-InPO achieves state-of-the-art performance with just 400 steps of fine-tuning, surpassing all preference aligning baselines for T2I diffusion models in human preference evaluation tasks.
zh

[CV-92] Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models CVPR2025

【速读】：该论文旨在解决现有扩散模型（Diffusion Models）在生成超高清图像（分辨率超过1K）时面临的挑战，特别是当这些模型尝试超出其训练分辨率进行扩展时，容易出现结构失真或内容重复的问题。此外，参考引导方法虽然通过将低分辨率参考图像上采样以指导高分辨率生成来缓解这些问题，但也面临显著挑战：潜空间上采样通常会导致流形偏差，从而降低输出质量；而RGB空间的上采样则倾向于产生过度平滑的结果。论文的关键解决方案是提出了一种名为LSRNA的新框架，它结合了潜空间超分辨率（Latent Space Super-Resolution, LSR）用于流形对齐以及区域噪声添加（Region-wise Noise Addition, RNA）来增强高频细节，从而有效克服上述限制，并在多种分辨率和指标下超越当前最先进的参考引导方法，同时证明了潜空间上采样的关键作用在于保持细节和清晰度。代码已开源。

链接: https://arxiv.org/abs/2503.18446
作者: Jinho Jeong,Sangmin Han,Jinwoo Kim,Seon Joo Kim
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:In this paper, we propose LSRNA, a novel framework for higher-resolution (exceeding 1K) image generation using diffusion models by leveraging super-resolution directly in the latent space. Existing diffusion models struggle with scaling beyond their training resolutions, often leading to structural distortions or content repetition. Reference-based methods address the issues by upsampling a low-resolution reference to guide higher-resolution generation. However, they face significant challenges: upsampling in latent space often causes manifold deviation, which degrades output quality. On the other hand, upsampling in RGB space tends to produce overly smoothed outputs. To overcome these limitations, LSRNA combines Latent space Super-Resolution (LSR) for manifold alignment and Region-wise Noise Addition (RNA) to enhance high-frequency details. Our extensive experiments demonstrate that integrating LSRNA outperforms state-of-the-art reference-based methods across various resolutions and metrics, while showing the critical role of latent space upsampling in preserving detail and sharpness. The code is available at this https URL.
zh

[CV-93] Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness

【速读】：该论文旨在解决多模态语义分割（MMSS）在实际部署中的鲁棒性不足问题，主要源于多模态数据质量的多样性和不确定性。为应对这一挑战，论文的关键在于构建首个专注于MMSS鲁棒性的标准化基准。通过调研现有文献并分类代表性方法，论文提出了评估模型在三种场景（Entire-Missing Modality, Random-Missing Modality, Noisy Modality）下性能的方案，并从概率角度建模模态失效情况，进而提出四个指标（mIoU^Avg_EMM, mIoU^E_EMM, mIoU^Avg_RMM, mIoU^E_RMM）以量化模型在缺失模态和噪声模态下的鲁棒性表现。这一解决方案的关键在于提供了系统化的方法与工具，填补了当前研究与实际应用之间的差距，推动MMSS领域的发展。

链接: https://arxiv.org/abs/2503.18445
作者: Chenfei Liao,Kaiyu Lei,Xu Zheng,Junha Moon,Zhixiong Wang,Yixuan Wang,Danda Pani Paudel,Luc Van Gool,Xuming Hu
机构: HKUST(GZ)(香港科技大学(广州)); XJTU(西安交通大学); INSAIT, Sofia University “St. Kliment Ohridski”(INSAIT, 索非亚大学“圣克莱门特·奥赫里德”); CSE, HKUST(计算机科学与工程系, 香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal semantic segmentation (MMSS) addresses the limitations of single-modality data by integrating complementary information across modalities. Despite notable progress, a significant gap persists between research and real-world deployment due to variability and uncertainty in multi-modal data quality. Robustness has thus become essential for practical MMSS applications. However, the absence of standardized benchmarks for evaluating robustness hinders further advancement. To address this, we first survey existing MMSS literature and categorize representative methods to provide a structured overview. We then introduce a robustness benchmark that evaluates MMSS models under three scenarios: Entire-Missing Modality (EMM), Random-Missing Modality (RMM), and Noisy Modality (NM). From a probabilistic standpoint, we model modality failure under two conditions: (1) all damaged combinations are equally probable; (2) each modality fails independently following a Bernoulli distribution. Based on these, we propose four metrics- mIoU^Avg_EMM , mIoU^E_EMM , mIoU^Avg_RMM , and mIoU^E_RMM -to assess model robustness under EMM and RMM. This work provides the first dedicated benchmark for MMSS robustness, offering new insights and tools to advance the field. Source code is available at this https URL.
zh

[CV-94] ReconDreamer: Harmonizing Generative and Reconstructive Models for Driving Scene Representation

【速读】：本文旨在解决结合重建模型与生成式模型（Reconstruction Models & Generative AI）在自动驾驶闭环仿真中的应用挑战，特别是生成数据与真实传感器观测之间的域间隙（Domain Gap）问题，尤其体现在对结构化元素（如地面表面）保真度不足的问题。为了解决这些问题，论文提出了一种改进框架ReconDreamer++。其关键在于引入了新型轨迹可变形网络（Novel Trajectory Deformable Network, NTDNet），通过可学习的空间变形机制弥合合成新视图与原始传感器观测间的域差距。同时，针对地面等结构化元素，采用3D高斯分布保留几何先验知识，并优化外观属性以保持底层几何结构完整性。实验结果表明，ReconDreamer++在多个数据集（Waymo、nuScenes、PandaSet和EUVS）上的性能显著优于ReconDreamer，特别是在Waymo数据集上，其在新轨迹上的表现尤为突出，NTA-IoU提升6.1%，FID改善23.0%，NTL-IoU提升4.5%，验证了其在准确重建结构化元素方面的有效性。

链接: https://arxiv.org/abs/2503.18438
作者: Guosheng Zhao,Xiaofeng Wang,Chaojun Ni,Zheng Zhu,Wenkang Qin,Guan Huang,Xingang Wang
机构: GigaAI; Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Combining reconstruction models with generative models has emerged as a promising paradigm for closed-loop simulation in autonomous driving. For example, ReconDreamer has demonstrated remarkable success in rendering large-scale maneuvers. However, a significant gap remains between the generated data and real-world sensor observations, particularly in terms of fidelity for structured elements, such as the ground surface. To address these challenges, we propose ReconDreamer++, an enhanced framework that significantly improves the overall rendering quality by mitigating the domain gap and refining the representation of the ground surface. Specifically, ReconDreamer++ introduces the Novel Trajectory Deformable Network (NTDNet), which leverages learnable spatial deformation mechanisms to bridge the domain gap between synthesized novel views and original sensor observations. Moreover, for structured elements such as the ground surface, we preserve geometric prior knowledge in 3D Gaussians, and the optimization process focuses on refining appearance attributes while preserving the underlying geometric structure. Experimental evaluations conducted on multiple datasets (Waymo, nuScenes, PandaSet, and EUVS) confirm the superior performance of ReconDreamer++. Specifically, on Waymo, ReconDreamer++ achieves performance comparable to Street Gaussians for the original trajectory while significantly outperforming ReconDreamer on novel trajectories. In particular, it achieves substantial improvements, including a 6.1% increase in NTA-IoU, a 23. 0% improvement in FID, and a remarkable 4.5% gain in the ground surface metric NTL-IoU, highlighting its effectiveness in accurately reconstructing structured elements such as the road surface.
zh

[CV-95] A Simple yet Effective Layout Token in Large Language Models for Document Understanding CVPR2025

【速读】：该论文旨在解决现有方法在将空间布局与文本集成用于大型语言模型（Large Language Models, LLMs）文档理解时所面临的两个主要问题：一是需要额外的位置 IDs 来表示布局信息，这限制了可用于文本内容的学习能力；二是长上下文推理过程中引入大量可能未训练的位置 IDs，影响文档理解任务的表现。为了解决这些问题，论文提出了一种名为 LayTokenLLM 的简单而有效的解决方案。其关键是通过将每个文本片段的布局信息表示为单一标记，并采用专门的位置编码方案，共享文本和布局标记之间的位置 IDs，从而避免使用额外的位置 IDs。这种设计不仅保持了模型从文本中学习的能力，还缓解了长上下文推理中的问题。此外，论文还提出了一个新的预训练目标——Next Interleaved Text and Layout Token Prediction (NTLP)，以增强文本和布局标记之间的跨模态学习。

链接: https://arxiv.org/abs/2503.18434
作者: Zhaoqing Zhu,Chuwei Luo,Zirui Shao,Feiyu Gao,Hangdi Xing,Qi Zheng,Ji Zhang
机构: Alibaba Group (阿里巴巴集团); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Recent methods that integrate spatial layouts with text for document understanding in large language models (LLMs) have shown promising results. A commonly used method is to represent layout information as text tokens and interleave them with text content as inputs to the LLMs. However, such a method still demonstrates limitations, as it requires additional position IDs for tokens that are used to represent layout information. Due to the constraint on max position IDs, assigning them to layout information reduces those available for text content, reducing the capacity for the model to learn from the text during training, while also introducing a large number of potentially untrained position IDs during long-context inference, which can hinder performance on document understanding tasks. To address these issues, we propose LayTokenLLM, a simple yet effective method for document understanding. LayTokenLLM represents layout information as a single token per text segment and uses a specialized positional encoding scheme. It shares position IDs between text and layout tokens, eliminating the need for additional position IDs. This design maintains the model’s capacity to learn from text while mitigating long-context issues during inference. Furthermore, a novel pre-training objective called Next Interleaved Text and Layout Token Prediction (NTLP) is devised to enhance cross-modality learning between text and layout tokens. Extensive experiments show that LayTokenLLM outperforms existing layout-integrated LLMs and MLLMs of similar scales on multi-page document understanding tasks, as well as most single-page tasks.
zh

[CV-96] CQ-DINO: Mitigating Gradient Dilution via Category Queries for Vast Vocabulary Object Detection

【速读】：本文旨在解决传统目标检测方法在处理大规模类别词汇表（vast vocabulary）检测任务时面临的挑战，具体针对分类基础检测器的两个关键限制：正梯度稀释（positive gradient dilution），即罕见正类别的学习信号不足；以及硬负梯度稀释（hard negative gradient dilution），即判别性梯度被大量简单负样本淹没。为应对这些挑战，论文提出了一种基于类别查询的目标检测框架CQ-DINO，将分类任务重新表述为对象查询与可学习类别查询之间的对比任务。其核心解决方案在于引入图像引导的查询选择机制，通过交叉注意力自适应检索每张图像的前K个相关类别，从而减少负空间并重新平衡梯度分布，同时隐式挖掘困难样本。此外，CQ-DINO能够灵活整合结构化数据集中的显式层级类别关系或通过自注意力在通用数据集中学到隐式的类别关联。实验结果表明，CQ-DINO在V3Det基准测试中实现了卓越性能（超过先前方法2.1% AP），同时在COCO数据集上保持竞争力。

链接: https://arxiv.org/abs/2503.18430
作者: Zhichao Sun,Huazhang Hu,Yidong Ma,Gang Liu,Nemo Chen,Xu Tang,Yongchao Xu
机构: School of Computer Science, Wuhan University (武汉大学); Xiaohongshu Inc. (小红书)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the exponential growth of data, traditional object detection methods are increasingly struggling to handle vast vocabulary object detection tasks effectively. We analyze two key limitations of classification-based detectors: positive gradient dilution, where rare positive categories receive insufficient learning signals, and hard negative gradient dilution, where discriminative gradients are overwhelmed by numerous easy negatives. To address these challenges, we propose CQ-DINO, a category query-based object detection framework that reformulates classification as a contrastive task between object queries and learnable category queries. Our method introduces image-guided query selection, which reduces the negative space by adaptively retrieving top-K relevant categories per image via cross-attention, thereby rebalancing gradient distributions and facilitating implicit hard example mining. Furthermore, CQ-DINO flexibly integrates explicit hierarchical category relationships in structured datasets (e.g., V3Det) or learns implicit category correlations via self-attention in generic datasets (e.g., COCO). Experiments demonstrate that CQ-DINO achieves superior performance on the challenging V3Det benchmark (surpassing previous methods by 2.1% AP) while maintaining competitiveness in COCO. Our work provides a scalable solution for real-world detection systems requiring wide category coverage. The dataset and code will be publicly at this https URL.
zh

[CV-97] r: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation CVPR2025

【速读】：该论文旨在解决实时、基于音频驱动的虚拟人物头部动画生成问题，特别是如何在保证自然性的同时实现面部与身体各部位细节的多样化运动。传统方法面临较长动画时间和运动真实性难以兼顾的挑战。为了解决这些问题，论文提出了Teller框架，其关键在于结合自回归（autoregressive）机制和高效时间模块（Efficient Temporal Module, ETM）。具体而言，Teller通过Facial Motion Latent Generation (FMLG) 将面部及身体细节分解为离散运动标记，并利用音频嵌入进行时间切片以学习音频到运动的实时映射；同时，ETM用于捕捉更精细的运动细节，确保身体各部分及其配饰的物理一致性，从而提升整体真实感。此外，Teller在推理速度和实时性能上均显著优于基于扩散模型的方法。

链接: https://arxiv.org/abs/2503.18429
作者: Dingcheng Zhen,Shunshun Yin,Shiyang Qin,Hou Yi,Ziwei Zhang,Siyuan Liu,Gan Qi,Ming Tao
机构: Shanghai Soulgate Techonolgy Co.tl. (上海随幻科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept in CVPR 2025 Conference Submission

点击查看摘要

Abstract:In this work, we introduce the first autoregressive framework for real-time, audio-driven portrait animation, a.k.a, talking head. Beyond the challenge of lengthy animation times, a critical challenge in realistic talking head generation lies in preserving the natural movement of diverse body parts. To this end, we propose Teller, the first streaming audio-driven protrait animation framework with autoregressive motion generation. Specifically, Teller first decomposes facial and body detail animation into two components: Facial Motion Latent Generation (FMLG) based on an autoregressive transfromer, and movement authenticity refinement using a Efficient Temporal Module (ETM).Concretely, FMLG employs a Residual VQ model to map the facial motion latent from the implicit keypoint-based model into discrete motion tokens, which are then temporally sliced with audio embeddings. This enables the AR tranformer to learn real-time, stream-based mappings from audio to motion. Furthermore, Teller incorporate ETM to capture finer motion details. This module ensures the physical consistency of body parts and accessories, such as neck muscles and earrings, improving the realism of these movements. Teller is designed to be efficient, surpassing the inference speed of diffusion-based models (Hallo 20.93s vs. Teller 0.92s for one second video generation), and achieves a real-time streaming performance of up to 25 FPS. Extensive experiments demonstrate that our method outperforms recent audio-driven portrait animation models, especially in small movements, as validated by human evaluations with a significant margin in quality and realism.
zh

[CV-98] Breaking the Encoder Barrier for Seamless Video-Language Understanding

【速读】：该论文旨在解决现有Video-Large Language Models (Video-LLMs) 中基于编码器-解码器框架的高计算成本、分辨率偏差以及难以捕捉细粒度多模态交互的问题。论文提出的解决方案是ELVA，一种无编码器的Video-LLM，其关键在于直接建模视频与语言之间的细微交互，通过token合并构建自底向上的分层表示，并引入视频引导监督器实现直接时空表征学习。此外，混合分辨率机制战略性地结合高低分辨率帧作为输入，在性能与效率之间达到最佳平衡。这些创新使ELVA在仅使用700万公开可用的视频-文本对的情况下，实现了与基于编码器的Video-LLMs相当的性能，同时将浮点运算次数（FLOPs）减少高达95%，推理延迟降低92%，提供了一种可扩展且高效的实时视频理解方案。

链接: https://arxiv.org/abs/2503.18422
作者: Handong Li,Yiyuan Zhang,Longteng Guo,Xiangyu Yue,Jing Liu
机构: School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); MMLab, CUHK (香港中文大学多媒体实验室); Institute of Automation, Chinese Academy of Science (中国科学院自动化研究所); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages

点击查看摘要

Abstract:Most Video-Large Language Models (Video-LLMs) adopt an encoder-decoder framework, where a vision encoder extracts frame-wise features for processing by a language model. However, this approach incurs high computational costs, introduces resolution biases, and struggles to capture fine-grained multimodal interactions. To overcome these limitations, we propose ELVA, an encoder-free Video-LLM that directly models nuanced video-language interactions without relying on a vision encoder. ELVA employs token merging to construct a bottom-up hierarchical representation and incorporates a video guidance supervisor for direct spatiotemporal representation learning. Additionally, a hybrid-resolution mechanism strategically integrates high- and low-resolution frames as inputs to achieve an optimal balance between performance and efficiency. With only 7M publicly available video-text pairs, ELVA achieves performance on par with encoder-based Video-LLMs while reducing FLOPs by up to 95% and inference latency by 92%, offering a scalable and efficient solution for real-time video understanding.
zh

[CV-99] 4DGC: Rate-Aware 4D Gaussian Compression for Efficient Streamable Free-Viewpoint Video CVPR2025

【速读】：该论文旨在解决基于3D高斯点 splatting (3D Gaussian Splatting, 3DGS) 的自由视点视频 (Free-Viewpoint Video, FVV) 表示与压缩中的存储和传输挑战。现有方法通常将动态3DGS表示与压缩分开处理，忽略了运动信息以及率失真 (Rate-Distortion, RD) 贸易-off，导致性能下降和模型冗余增加。为了解决这些问题，论文提出了一种名为4DGC的新型率感知4D高斯压缩框架。其关键是引入了一种运动感知的动态高斯表示，通过紧凑的运动网格结合稀疏补偿高斯，利用帧间相似性减少存储开销；同时，设计了一个端到端的压缩方案，采用可微分量化和小型隐式熵模型高效压缩运动网格和补偿高斯，并通过RD优化实现整体框架的联合训练，从而在保持高质量的同时显著减小存储大小并提升RD性能。

链接: https://arxiv.org/abs/2503.18421
作者: Qiang Hu,Zihan Zheng,Houqiang Zhong,Sihua Fu,Li Song,XiaoyunZhang,Guangtao Zhai,Yanfeng Wang
机构: Cooperative Medianet Innovation Center, Shanghai Jiao Tong University (上海交通大学); School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University (上海交通大学); School of Artificial Intelligence, Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: CVPR2025

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has substantial potential for enabling photorealistic Free-Viewpoint Video (FVV) experiences. However, the vast number of Gaussians and their associated attributes poses significant challenges for storage and transmission. Existing methods typically handle dynamic 3DGS representation and compression separately, neglecting motion information and the rate-distortion (RD) trade-off during training, leading to performance degradation and increased model redundancy. To address this gap, we propose 4DGC, a novel rate-aware 4D Gaussian compression framework that significantly reduces storage size while maintaining superior RD performance for FVV. Specifically, 4DGC introduces a motion-aware dynamic Gaussian representation that utilizes a compact motion grid combined with sparse compensated Gaussians to exploit inter-frame similarities. This representation effectively handles large motions, preserving quality and reducing temporal redundancy. Furthermore, we present an end-to-end compression scheme that employs differentiable quantization and a tiny implicit entropy model to compress the motion grid and compensated Gaussians efficiently. The entire framework is jointly optimized using a rate-distortion trade-off. Extensive experiments demonstrate that 4DGC supports variable bitrates and consistently outperforms existing methods in RD performance across multiple datasets.
zh

[CV-100] Panorama Generation From NFoV Image Done Right CVPR2025

【速读】：该论文试图解决从窄视场图像生成全景图时现有方法在评估全景图失真方面的不足。具体而言，现有的基于InceptionNet或CLIP的指标倾向于感知图像质量，而不适合评估失真。论文首先提出了一种特定于失真的CLIP模型——Distort-CLIP，用于准确评估全景图的失真，并揭示了先前工作中存在的“视觉欺骗”现象，即通过牺牲失真准确性来提升视觉效果。为了解决这一现象，论文提出了PanoDecouple框架，这是一种解耦扩散模型，将全景图生成任务分解为失真引导和内容完成两个独立部分，旨在同时实现准确的失真和视觉吸引力。其关键是设计了一个DistortNet用于失真引导，引入了全景特定的失真先验和改进的条件注册机制；同时设计了一个ContentNet用于内容完成，利用透视图像信息。此外，还引入了结合Distort-CLIP的失真校正损失函数以显式约束失真。实验结果验证了PanoDecouple在失真和视觉指标方面均优于现有方法。

链接: https://arxiv.org/abs/2503.18420
作者: Dian Zheng,Cheng Zhang,Xiao-Ming Wu,Cao Li,Chengfei Lv,Jian-Fang Hu,Wei-Shi Zheng
机构: Sun Yat-sen University (中山大学); Monash University (蒙纳士大学); Alibaba Group (阿里巴巴集团); Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China (教育部机器智能与先进计算重点实验室, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025. Project page: this https URL Code: this https URL

点击查看摘要

Abstract:Generating 360-degree panoramas from narrow field of view (NFoV) image is a promising computer vision task for Virtual Reality (VR) applications. Existing methods mostly assess the generated panoramas with InceptionNet or CLIP based metrics, which tend to perceive the image quality and is \textbfnot suitable for evaluating the distortion. In this work, we first propose a distortion-specific CLIP, named Distort-CLIP to accurately evaluate the panorama distortion and discover the \textbf``visual cheating’’ phenomenon in previous works (\ie, tending to improve the visual results by sacrificing distortion accuracy). This phenomenon arises because prior methods employ a single network to learn the distinct panorama distortion and content completion at once, which leads the model to prioritize optimizing the latter. To address the phenomenon, we propose \textbfPanoDecouple, a decoupled diffusion model framework, which decouples the panorama generation into distortion guidance and content completion, aiming to generate panoramas with both accurate distortion and visual appeal. Specifically, we design a DistortNet for distortion guidance by imposing panorama-specific distortion prior and a modified condition registration mechanism; and a ContentNet for content completion by imposing perspective image information. Additionally, a distortion correction loss function with Distort-CLIP is introduced to constrain the distortion explicitly. The extensive experiments validate that PanoDecouple surpasses existing methods both in distortion and visual metrics.
zh

[CV-101] U-REPA: Aligning Diffusion U-Nets to ViTs

【速读】：该论文旨在解决如何将 Representation Alignment (REPA) 方法适配到经典的 diffusion U-Net 架构以进一步提升其收敛速度与生成质量的问题。论文的关键解决方案包括：(1) 通过观察发现，在 U-Net 的跳跃连接作用下，其中间阶段是最优的对齐选项；(2) 提出在经过多层感知机（MLPs）处理后对 U-Net 特征进行上采样的方法；(3) 针对逐令牌对齐的困难，引入流形损失来正则化样本间的相对相似性。实验表明，所提出的 U-REPA 不仅能够显著加速收敛速度，还能实现更高质量的生成结果，并且在 ImageNet 256 × 256 数据集上达到 FID 1.5 所需的训练轮数仅为 REPA 的一半。

链接: https://arxiv.org/abs/2503.18414
作者: Yuchuan Tian,Hanting Chen,Mengyu Zheng,Yuchen Liang,Chao Xu,Yunhe Wang
机构: State Key Lab of General AI, School of Intelligence Science and Technology, Peking University (北京大学); Huawei Noah’s Ark Lab. (华为诺亚方舟实验室); The University of Sydney (悉尼大学); School of Mathematical Sciences, Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:Representation Alignment (REPA) that aligns Diffusion Transformer (DiT) hidden-states with ViT visual encoders has proven highly effective in DiT training, demonstrating superior convergence properties, but it has not been validated on the canonical diffusion U-Net architecture that shows faster convergence compared to DiTs. However, adapting REPA to U-Net architectures presents unique challenges: (1) different block functionalities necessitate revised alignment strategies; (2) spatial-dimension inconsistencies emerge from U-Net’s spatial downsampling operations; (3) space gaps between U-Net and ViT hinder the effectiveness of tokenwise alignment. To encounter these challenges, we propose U-REPA, a representation alignment paradigm that bridges U-Net hidden states and ViT features as follows: Firstly, we propose via observation that due to skip connection, the middle stage of U-Net is the best alignment option. Secondly, we propose upsampling of U-Net features after passing them through MLPs. Thirdly, we observe difficulty when performing tokenwise similarity alignment, and further introduces a manifold loss that regularizes the relative similarity between samples. Experiments indicate that the resulting U-REPA could achieve excellent generation quality and greatly accelerates the convergence speed. With CFG guidance interval, U-REPA could reach FID1.5 in 200 epochs or 1M iterations on ImageNet 256 \times 256, and needs only half the total epochs to perform better than REPA. Codes are available at this https URL.
zh

[CV-102] Fast and Physically-based Neural Explicit Surface for Relightable Human Avatars

【速读】：该论文旨在解决从稀疏视点视频高效建模可重新照明的人体 avatar 的问题，当前方法因体渲染需要密集采样而导致成本高昂。为克服这些挑战，论文提出了一种基于物理的神经显式表面（PhyNES）方法，其关键在于利用紧凑的神经材质图，通过将符号距离场连接到显式表面来实现高效的几何推理，并以 2D 神经表示形式建模动态几何、纹理和材质图，从而实现高效的光栅化与实时基于物理的渲染。实验表明，PhyNES 在提供与当前最优方法相当的重新照明质量的同时，显著提升了渲染速度、内存效率和重建质量。

链接: https://arxiv.org/abs/2503.18408
作者: Jiacheng Wu,Ruiqi Zhang,Jie Chen,Hui Zhang
机构: Department of Computer Science, Hong Kong Baptist University (香港浸会大学); Department of Computer Science, Beijing Normal-Hong Kong Baptist University (北京师范大学-香港浸会大学联合国际学院), Guangdong, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Efficiently modeling relightable human avatars from sparse-view videos is crucial for AR/VR applications. Current methods use neural implicit representations to capture dynamic geometry and reflectance, which incur high costs due to the need for dense sampling in volume rendering. To overcome these challenges, we introduce Physically-based Neural Explicit Surface (PhyNES), which employs compact neural material maps based on the Neural Explicit Surface (NES) representation. PhyNES organizes human models in a compact 2D space, enhancing material disentanglement efficiency. By connecting Signed Distance Fields to explicit surfaces, PhyNES enables efficient geometry inference around a parameterized human shape model. This approach models dynamic geometry, texture, and material maps as 2D neural representations, enabling efficient rasterization. PhyNES effectively captures physical surface attributes under varying illumination, enabling real-time physically-based rendering. Experiments show that PhyNES achieves relighting quality comparable to SOTA methods while significantly improving rendering speed, memory efficiency, and reconstruction quality.
zh

[CV-103] VTD-CLIP: Video-to-Text Discretization via Prompting CLIP

【速读】：该论文旨在解决现有视觉-语言模型在视频识别任务中因时间建模不足而导致的可解释性差和泛化能力弱的问题。解决方案的关键在于提出了一种简单而有效的视频到文本离散化框架。该方法利用冻结的文字编码器从视频类别标签构建视觉代码本，通过多对一的对比对齐，在多模态预训练中将视觉和文本嵌入关联起来。此代码本能够通过特征查找将时间相关的视觉数据转换为文本标记，并通过显式的视频建模提供可解释的视频表示。此外，引入了一个基于置信度的融合模块，通过评估帧的语义相关性动态加权关键帧，以增强对无关或噪声帧的鲁棒性。同时，该方法还结合可学习的文字提示进行自适应代码本更新。这些创新显著提升了模型的性能，在多个数据集上的实验验证了该方法的优越性。

链接: https://arxiv.org/abs/2503.18407
作者: Wencheng Zhu,Yuexin Wang,Hongxuan Li,Pengfei Zhu,Danqing Song,Qinghua Hu
机构: Tianjin University (天津大学); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks. Existing approaches primarily rely on parameter-efficient fine-tuning of image-text pre-trained models, yet they often suffer from limited interpretability and poor generalization due to inadequate temporal modeling. To address these, we propose a simple yet effective video-to-text discretization framework. Our method repurposes the frozen text encoder to construct a visual codebook from video class labels due to the many-to-one contrastive alignment between visual and textual embeddings in multimodal pretraining. This codebook effectively transforms temporal visual data into textual tokens via feature lookups and offers interpretable video representations through explicit video modeling. Then, to enhance robustness against irrelevant or noisy frames, we introduce a confidence-aware fusion module that dynamically weights keyframes by assessing their semantic relevance via the codebook. Furthermore, our method incorporates learnable text prompts to conduct adaptive codebook updates. Extensive experiments on HMDB-51, UCF-101, SSv2, and Kinetics-400 have validated the superiority of our approach, achieving more competitive improvements over state-of-the-art methods. The code will be publicly available at this https URL.
zh

[CV-104] Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning

【速读】：该论文旨在解决自然语言指令引导的自动化图像编辑任务中，由于现有训练数据集质量不足导致的深度学习模型难以生成高质量结果的问题。传统方法通常依赖文本到图像（Text-to-Image, T2I）生成模型来创建原始图像与编辑图像的配对样本，但这些配对样本往往无法精确匹配指定的编辑指令，从而影响了模型性能。论文的关键创新在于提出了一种名为Instruct-CLIP的自监督方法，通过学习原始图像与编辑图像之间的语义变化，改进并更好地对齐现有数据集中包含的指令。此外，Instruct-CLIP被进一步优化以处理噪声潜空间图像及扩散步长，使其能够用于训练潜扩散模型（Latent Diffusion Models, LDMs），并在扩散过程中的任意步骤高效实现编辑指令与潜空间中图像变化的一致性对齐。通过使用Instruct-CLIP修正InstructPix2Pix数据集并获得超过12万条精炼样本，论文展示了其方法在提升模型生成编辑结果与指令一致性方面的有效性。

链接: https://arxiv.org/abs/2503.18406
作者: Sherry X. Chen,Misha Sra,Pradeep Sen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Computer Vision and Pattern Recognition 2025

点击查看摘要

Abstract:Although natural language instructions offer an intuitive way to guide automated image editing, deep-learning models often struggle to achieve high-quality results, largely due to challenges in creating large, high-quality training datasets. Previous work has typically relied on text-toimage (T2I) generative models to produce pairs of original and edited images that simulate the input/output of an instruction-guided image-editing model. However, these image pairs often fail to align with the specified edit instructions due to the limitations of T2I models, which negatively impacts models trained on such datasets. To address this, we present Instruct-CLIP, a self-supervised method that learns the semantic changes between original and edited images to refine and better align the instructions in existing datasets. Furthermore, we adapt Instruct-CLIP to handle noisy latent images and diffusion timesteps so that it can be used to train latent diffusion models (LDMs) [19] and efficiently enforce alignment between the edit instruction and the image changes in latent space at any step of the diffusion pipeline. We use Instruct-CLIP to correct the InstructPix2Pix dataset and get over 120K refined samples we then use to fine-tune their model, guided by our novel Instruct-CLIP-based loss function. The resulting model can produce edits that are more aligned with the given instructions. Our code and dataset are available at this https URL.
zh

[CV-105] Offline Meteorology-Pollution Coupling Global Air Pollution Forecasting Model with Bilinear Pooling

【速读】：该论文旨在解决传统物理模型在实时空气污染预测中因高计算需求导致效率受限的问题，以及现有深度学习（Deep Learning, DL）方法在线耦合策略需要大量训练资源的局限。论文的关键创新在于提出了一种基于深度学习的离线耦合框架，通过双线性池化技术实现气象场与污染物的离线耦合。该方案仅需在线耦合模型参数的13%，却展现出竞争力的预测性能，并在超过48小时的预测中于85%的变量上优于最先进的全球空气污染预报模型CAMS。此外，研究验证了离线耦合气象场的有效性，实现了所有污染物变量相对均方根误差（RMSE）降低15%，为实时全球空气污染预警系统建立了新范式，并为开发更高效全面的AI驱动全球大气预报框架提供了关键技术支撑。

链接: https://arxiv.org/abs/2503.18405
作者: Xu Fan,Yuetan Lin,Bing Gong,Hao Li
机构: Shanghai Academy of AI for Science (上海人工智能实验室); Shanghai Normal University (上海师范大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Air pollution has become a major threat to human health, making accurate forecasting crucial for pollution control. Traditional physics-based models forecast global air pollution by coupling meteorology and pollution processes, using either online or offline methods depending on whether fully integrated with meteorological models and run simultaneously. However, the high computational demands of both methods severely limit real-time prediction efficiency. Existing deep learning (DL) solutions employ online coupling strategies for global air pollution forecasting, which finetune pollution forecasting based on pretrained atmospheric models, requiring substantial training resources. This study pioneers a DL-based offline coupling framework that utilizes bilinear pooling to achieve offline coupling between meteorological fields and pollutants. The proposed model requires only 13% of the parameters of DL-based online coupling models while achieving competitive performance. Compared with the state-of-the-art global air pollution forecasting model CAMS, our approach demonstrates superiority in 63% variables across all forecast time steps and 85% variables in predictions exceeding 48 hours. This work pioneers experimental validation of the effectiveness of meteorological fields in DL-based global air pollution forecasting, demonstrating that offline coupling meteorological fields with pollutants can achieve a 15% relative reduction in RMSE across all pollution variables. The research establishes a new paradigm for real-time global air pollution warning systems and delivers critical technical support for developing more efficient and comprehensive AI-powered global atmospheric forecasting frameworks.
zh

[CV-106] Knowledge Graph Enhanced Generative Multi-modal Models for Class-Incremental Learning

【速读】：该论文旨在解决计算机视觉领域连续学习（Continual Learning）中的灾难性遗忘（Catastrophic Forgetting）问题，即模型在适应新任务的同时难以保留先前学到的知识，导致性能下降甚至误分类。为应对这一挑战，论文提出了一种新颖的知识图谱增强生成式多模态模型（Knowledge Graph Enhanced Generative Multi-modal model, KG-GMM）。该模型的关键在于构建一个随学习过程演进的知识图谱，并利用其中的关系信息增强类别标签，通过为相似类别分配不同关系来提升模型的区分能力。此外，在推理阶段，论文提出了知识图谱增强的推理方法，通过分析生成文本中的关系定位特定类别，从而减少学习新知识时对旧类别细节信息的丢失，缓解遗忘现象。实验结果表明，所提方法能够有效利用关系信息修正误预测，在传统连续学习和少量样本连续学习场景下均达到了最先进的性能，验证了知识图谱在连续学习中保存知识的有效性。

链接: https://arxiv.org/abs/2503.18403
作者: Xusheng Cao,Haori Lu,Linlan Huang,Fei Yang,Xialei Liu,Ming-Ming Cheng
机构: VCIP, CS, Nankai University (南开大学); NKIARI, Shenzhen Futian (深圳福田)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Continual learning in computer vision faces the critical challenge of catastrophic forgetting, where models struggle to retain prior knowledge while adapting to new tasks. Although recent studies have attempted to leverage the generalization capabilities of pre-trained models to mitigate overfitting on current tasks, models still tend to forget details of previously learned categories as tasks progress, leading to misclassification. To address these limitations, we introduce a novel Knowledge Graph Enhanced Generative Multi-modal model (KG-GMM) that builds an evolving knowledge graph throughout the learning process. Our approach utilizes relationships within the knowledge graph to augment the class labels and assigns different relations to similar categories to enhance model differentiation. During testing, we propose a Knowledge Graph Augmented Inference method that locates specific categories by analyzing relationships within the generated text, thereby reducing the loss of detailed information about old classes when learning new knowledge and alleviating forgetting. Experiments demonstrate that our method effectively leverages relational information to help the model correct mispredictions, achieving state-of-the-art results in both conventional CIL and few-shot CIL settings, confirming the efficacy of knowledge graphs at preserving knowledge in the continual learning scenarios.
zh

[CV-107] DashGaussian: Optimizing 3D Gaussian Splatting in 200 Seconds CVPR2025

【速读】：该论文旨在解决基于3D Gaussian Splatting (3DGS) 的优化过程中，由于渲染分辨率和高斯基元数量（即优化复杂度）导致的时间成本过高的问题。论文的关键解决方案是提出了一种名为DashGaussian的调度方案，通过减少冗余的优化复杂度来加速3DGS的优化过程。具体而言，作者将3DGS的优化建模为逐步拟合更高频率成分的过程，并设计了一种动态渲染分辨率方案，大幅降低了优化复杂度。此外，论文强调渲染分辨率与基元数量之间的平衡至关重要，为此设计了同步增长的基元调度策略以优化计算效率与拟合质量之间的权衡。实验结果表明，该方法在保持渲染质量的同时，平均加速了多种3DGS主干网络的优化速度达45.7%。

链接: https://arxiv.org/abs/2503.18402
作者: Youyu Chen,Junjun Jiang,Kui Jiang,Xiao Tang,Zhihao Li,Xianming Liu,Yinyu Nie
机构: Harbin Institute of Technology (哈尔滨工业大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025. Project page: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) renders pixels by rasterizing Gaussian primitives, where the rendering resolution and the primitive number, concluded as the optimization complexity, dominate the time cost in primitive optimization. In this paper, we propose DashGaussian, a scheduling scheme over the optimization complexity of 3DGS that strips redundant complexity to accelerate 3DGS optimization. Specifically, we formulate 3DGS optimization as progressively fitting 3DGS to higher levels of frequency components in the training views, and propose a dynamic rendering resolution scheme that largely reduces the optimization complexity based on this formulation. Besides, we argue that a specific rendering resolution should cooperate with a proper primitive number for a better balance between computing redundancy and fitting quality, where we schedule the growth of the primitives to synchronize with the rendering resolution. Extensive experiments show that our method accelerates the optimization of various 3DGS backbones by 45.7% on average while preserving the rendering quality.
zh

[CV-108] PDDM: Pseudo Depth Diffusion Model for RGB-PD Semantic Segmentation Based in Complex Indoor Scenes

【速读】：本文旨在探索利用伪深度（Pseudo Depth, PD）替代真实深度数据进行语义分割的可行性，并提出一种基于伪深度的解决方案以提升复杂室内场景分割的准确性。论文的关键在于设计了一个RGB-PD分割管道来整合RGB图像与伪深度信息，并提出了伪深度聚合模块（Pseudo Depth Aggregation Module, PDAM），用于充分挖掘多样化伪深度图所提供的有用线索，将多张伪深度图聚合为单一模态，从而易于适配其他RGB-D分割方法。此外，为了进一步结合伪深度信息与扩散模型特征提取能力，论文还引入了伪深度扩散模型（Pseudo Depth Diffusion Model, PDDM），采用大规模文本-图像扩散模型作为特征提取器，并结合简单有效的融合策略实现伪深度的集成。通过在NYUv2和SUNRGB-D数据集上的大量实验验证，结果表明伪深度能够有效增强分割性能，而所提出的PDDM在NYUv2和SUNRGB-D数据集上分别取得了+6.98 mIoU和+2.11 mIoU的领先性能提升。

链接: https://arxiv.org/abs/2503.18393
作者: Xinhua Xu,Hong Liu,Jianbing Wu,Jinfu Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The integration of RGB and depth modalities significantly enhances the accuracy of segmenting complex indoor scenes, with depth data from RGB-D cameras playing a crucial role in this improvement. However, collecting an RGB-D dataset is more expensive than an RGB dataset due to the need for specialized depth sensors. Aligning depth and RGB images also poses challenges due to sensor positioning and issues like missing data and noise. In contrast, Pseudo Depth (PD) from high-precision depth estimation algorithms can eliminate the dependence on RGB-D sensors and alignment processes, as well as provide effective depth information and show significant potential in semantic segmentation. Therefore, to explore the practicality of utilizing pseudo depth instead of real depth for semantic segmentation, we design an RGB-PD segmentation pipeline to integrate RGB and pseudo depth and propose a Pseudo Depth Aggregation Module (PDAM) for fully exploiting the informative clues provided by the diverse pseudo depth maps. The PDAM aggregates multiple pseudo depth maps into a single modality, making it easily adaptable to other RGB-D segmentation methods. In addition, the pre-trained diffusion model serves as a strong feature extractor for RGB segmentation tasks, but multi-modal diffusion-based segmentation methods remain unexplored. Therefore, we present a Pseudo Depth Diffusion Model (PDDM) that adopts a large-scale text-image diffusion model as a feature extractor and a simple yet effective fusion strategy to integrate pseudo depth. To verify the applicability of pseudo depth and our PDDM, we perform extensive experiments on the NYUv2 and SUNRGB-D datasets. The experimental results demonstrate that pseudo depth can effectively enhance segmentation performance, and our PDDM achieves state-of-the-art performance, outperforming other methods by +6.98 mIoU on NYUv2 and +2.11 mIoU on SUNRGB-D.
zh

[CV-109] Resource-Efficient Motion Control for Video Generation via Dynamic Mask Guidance

【速读】：该论文旨在解决现有文本到视频生成模型面临的高训练成本、大量数据需求以及难以保持给定文本与前景物体运动之间一致性等挑战。关键解决方案在于提出了一种基于掩码引导的视频生成方法，通过掩码运动序列控制视频生成过程，同时引入前景掩码以实现精确的文本位置匹配和运动轨迹控制，从而确保整个序列中前景物体的一致性。此外，通过首帧共享策略和自回归扩展方法，进一步提升了生成视频的稳定性和长度。实验结果表明，该方法在视频编辑和艺术视频生成等多种任务中表现出色，显著优于先前的方法。

链接: https://arxiv.org/abs/2503.18386
作者: Sicong Feng,Jielong Yang,Li Peng
机构: Jiangnan University (江南大学, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in diffusion models bring new vitality to visual content creation. However, current text-to-video generation models still face significant challenges such as high training costs, substantial data requirements, and difficulties in maintaining consistency between given text and motion of the foreground object. To address these challenges, we propose mask-guided video generation, which can control video generation through mask motion sequences, while requiring limited training data. Our model enhances existing architectures by incorporating foreground masks for precise text-position matching and motion trajectory control. Through mask motion sequences, we guide the video generation process to maintain consistent foreground objects throughout the sequence. Additionally, through a first-frame sharing strategy and autoregressive extension approach, we achieve more stable and longer video generation. Extensive qualitative and quantitative experiments demonstrate that this approach excels in various video generation tasks, such as video editing and generating artistic videos, outperforming previous methods in terms of consistency and quality. Our generated results can be viewed in the supplementary materials.
zh

[CV-110] LiDAR Remote Sensing Meets Weak Supervision: Concepts Methods and Perspectives

【速读】：该论文旨在解决 LiDAR 遥感在数据解释和基于 LiDAR 的反演过程中对密集且精确标注依赖的问题，这些问题导致了高昂的成本和时间消耗。论文的关键解决方案是采用统一的弱监督学习视角，通过利用不完整、不准确或来自其他领域的标注，开发新的方法来处理 LiDAR 遥感任务，从而降低对昂贵标注资源的依赖，推动 LiDAR 解释和反演技术的发展。

链接: https://arxiv.org/abs/2503.18384
作者: Yuan Gao,Shaobo Xia,Pu Wang,Xiaohuan Xi,Sheng Nie,Cheng Wang
机构: Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:LiDAR (Light Detection and Ranging) enables rapid and accurate acquisition of three-dimensional spatial data, widely applied in remote sensing areas such as surface mapping, environmental monitoring, urban modeling, and forestry inventory. LiDAR remote sensing primarily includes data interpretation and LiDAR-based inversion. However, LiDAR interpretation typically relies on dense and precise annotations, which are costly and time-consuming. Similarly, LiDAR inversion depends on scarce supervisory signals and expensive field surveys for annotations. To address this challenge, weakly supervised learning has gained significant attention in recent years, with many methods emerging to tackle LiDAR remote sensing tasks using incomplete, inaccurate, and inexact annotations, as well as annotations from other domains. Existing review articles treat LiDAR interpretation and inversion as separate tasks. This review, for the first time, adopts a unified weakly supervised learning perspective to systematically examine research on both LiDAR interpretation and inversion. We summarize the latest advancements, provide a comprehensive review of the development and application of weakly supervised techniques in LiDAR remote sensing, and discuss potential future research directions in this field.
zh

[CV-111] PP-FormulaNet: Bridging Accuracy and Efficiency in Advanced Formula Recognition

【速读】：该论文旨在解决文档智能中的公式识别任务，即将从文档图像中提取的数学表达式转换为计算机易于处理的结构化符号格式，其中LaTeX是最常用的格式。论文的关键在于提出了一种最先进的公式识别模型PP-FormulaNet，其在精度和效率上均表现出色。为了满足不同应用场景的需求，该研究开发了两个专门的模型：PP-FormulaNet-L（适用于高精度场景）和PP-FormulaNet-S（优化用于高效场景）。此外，论文还引入了一个公式挖掘系统，用于提取高质量的公式数据，进一步增强了模型的鲁棒性和适用性。

链接: https://arxiv.org/abs/2503.18382
作者: Hongen Liu,Cheng Cui,Yuning Du,Yi Liu,Gang Pan
机构: PaddlePaddle Team, Baidu Inc. (飞桨团队, 百度公司); College of Intelligence and Computing, Tianjin University (天津大学智能与计算学部)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Formula recognition is an important task in document intelligence. It involves converting mathematical expressions from document images into structured symbolic formats that computers can easily work with. LaTeX is the most common format used for this purpose. In this work, we present PP-FormulaNet, a state-of-the-art formula recognition model that excels in both accuracy and efficiency. To meet the diverse needs of applications, we have developed two specialized models: PP-FormulaNet-L, tailored for high-accuracy scenarios, and PP-FormulaNet-S, optimized for high-efficiency contexts. Our extensive evaluations reveal that PP-FormulaNet-L attains accuracy levels that surpass those of prominent models such as UniMERNet by a significant 6%. Conversely, PP-FormulaNet-S operates at speeds that are over 16 times faster. These advancements facilitate seamless integration of PP-FormulaNet into a broad spectrum of document processing environments that involve intricate mathematical formulas. Furthermore, we introduce a Formula Mining System, which is capable of extracting a vast amount of high-quality formula data. This system further enhances the robustness and applicability of our formula recognition model. Code and models are publicly available at PaddleOCR(this https URL) and PaddleX(this https URL).
zh

[CV-112] Exploring State Space Model in Wavelet Domain: An Infrared and Visible Image Fusion Network via Wavelet Transform and State Space Model

【速读】：该论文旨在解决现有红外与可见光图像融合（Infrared and Visible Image Fusion, IVIF）方法未能充分结合频率域特征与全局语义信息的问题，这导致跨模态全局特征提取不完全以及局部纹理细节保存不足。为了解决这些问题，论文提出Wavelet-Mamba (W-Mamba)，其关键在于将小波变换与状态空间模型（State-Space Model, SSM）相结合，通过引入Wavelet-SSM模块实现基于小波的频率域特征提取及通过SSM进行全局信息提取，从而有效捕捉全局与局部特征。此外，还提出了跨模态特征注意力调制机制，以促进不同模态之间的高效交互与融合。实验结果表明，所提方法在视觉效果和性能上均优于当前最先进的方法。

链接: https://arxiv.org/abs/2503.18378
作者: Tianpei Zhang,Yiming Zhu,Jufeng Zhao,Guangmang Cui,Yuchen Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning techniques have revolutionized the infrared and visible image fusion (IVIF), showing remarkable efficacy on complex scenarios. However, current methods do not fully combine frequency domain features with global semantic information, which will result in suboptimal extraction of global features across modalities and insufficient preservation of local texture details. To address these issues, we propose Wavelet-Mamba (W-Mamba), which integrates wavelet transform with the state-space model (SSM). Specifically, we introduce Wavelet-SSM module, which incorporates wavelet-based frequency domain feature extraction and global information extraction through SSM, thereby effectively capturing both global and local features. Additionally, we propose a cross-modal feature attention modulation, which facilitates efficient interaction and fusion between different modalities. The experimental results indicate that our method achieves both visually compelling results and superior performance compared to current state-of-the-art methods. Our code is available at this https URL.
zh

[CV-113] Do Your Best and Get Enough Rest for Continual Learning CVPR

【速读】：该论文试图解决神经网络在持续学习过程中因灾难性遗忘（Catastrophic Forgetting）而导致的长期知识保留难题。论文的关键解决方案是基于艾宾浩斯（Ebbinghaus）的遗忘曲线理论，提出了一种视图-批次（view-batch）模型，通过优化重训练样本之间的回忆间隔，使神经网络能够在学习新数据的同时获得充分的“休息”。具体而言，论文提出了两种方法：1）一种保证最优回忆间隔的回放方法，以及2）一种自监督学习方法，能够从单个训练样本中获取广泛的知识。实验结果表明，这些方法不仅符合遗忘曲线理论以增强长期记忆，还显著提升了多种现有持续学习方法的性能。

链接: https://arxiv.org/abs/2503.18371
作者: Hankyul Kang,Gregor Seifer,Donghyun Lee,Jongbin Ryu
机构: Ajou University; KAIST
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

点击查看摘要

Abstract:According to the forgetting curve theory, we can enhance memory retention by learning extensive data and taking adequate rest. This means that in order to effectively retain new knowledge, it is essential to learn it thoroughly and ensure sufficient rest so that our brain can memorize without forgetting. The main takeaway from this theory is that learning extensive data at once necessitates sufficient rest before learning the same data again. This aspect of human long-term memory retention can be effectively utilized to address the continual learning of neural networks. Retaining new knowledge for a long period of time without catastrophic forgetting is the critical problem of continual learning. Therefore, based on Ebbinghaus’ theory, we introduce the view-batch model that adjusts the learning schedules to optimize the recall interval between retraining the same samples. The proposed view-batch model allows the network to get enough rest to learn extensive knowledge from the same samples with a recall interval of sufficient length. To this end, we specifically present two approaches: 1) a replay method that guarantees the optimal recall interval, and 2) a self-supervised learning that acquires extensive knowledge from a single training sample at a time. We empirically show that these approaches of our method are aligned with the forgetting curve theory, which can enhance long-term memory. In our experiments, we also demonstrate that our method significantly improves many state-of-the-art continual learning methods in various protocols and scenarios. We open-source this project at this https URL.
zh

[CV-114] DiffusedWrinkles: A Diffusion-Based Model for Data-Driven Garment Animation BMVC2024

【速读】：本文旨在解决基于2D图像的扩散模型生成高质量3D服装动画的问题，尤其针对参数化服装中精细褶皱细节难以处理的挑战。现有方法（如全连接网络、图神经网络或生成对抗网络）在处理这类复杂场景时存在局限性。本文的关键创新在于提出了一种将3D服装变形表示为与参数化服装模板相关的2D布局一致纹理的方法，该纹理编码了相对于模板的3D偏移量。通过这一表示方式，作者构建了一个能够处理多样化服装和身体形状的条件扩散模型，且不依赖于服装网格拓扑结构。此模型可生成针对特定姿态、形状和设计依赖的高质量3D服装变形，并支持生成时间上连贯的序列。

链接: https://arxiv.org/abs/2503.18370
作者: Raquel Vidaurre,Elena Garces,Dan Casas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: BMVC 2024

点击查看摘要

Abstract:We present a data-driven method for learning to generate animations of 3D garments using a 2D image diffusion model. In contrast to existing methods, typically based on fully connected networks, graph neural networks, or generative adversarial networks, which have difficulties to cope with parametric garments with fine wrinkle detail, our approach is able to synthesize high-quality 3D animations for a wide variety of garments and body shapes, while being agnostic to the garment mesh topology. Our key idea is to represent 3D garment deformations as a 2D layout-consistent texture that encodes 3D offsets with respect to a parametric garment template. Using this representation, we encode a large dataset of garments simulated in various motions and shapes and train a novel conditional diffusion model that is able to synthesize high-quality pose-shape-and-design dependent 3D garment deformations. Since our model is generative, we can synthesize various plausible deformations for a given target pose, shape, and design. Additionally, we show that we can further condition our model using an existing garment state, which enables the generation of temporally coherent sequences.
zh

[CV-115] MoST: Efficient Monarch Sparse Tuning for 3D Representation Learning

【速读】：本文旨在解决三维表示学习中的参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）问题，提出了一种名为Monarch Sparse Tuning (MoST) 的重参数化方法。与现有的基于适配器（adapter-based）或提示微调（prompt-tuning）的三维PEFT方法不同，MoST无额外推理开销，并且兼容多种三维表示学习主干网络。其核心在于引入了一类新的结构化矩阵——Point Monarch，用于三维点云数据，能够捕捉不规则点的局部几何特征同时具备高表达能力。MoST通过将密集更新权重矩阵重新参数化为稀疏的Point Monarch矩阵，大幅减少了参数数量，同时保持了强大的性能表现。实验结果表明，该方法在多个基准数据集上达到了最先进的分类精度，如ScanObjectNN数据集上的97.5%准确率以及ModelNet40数据集上的96.2%，并且还可以与其他矩阵分解技术结合以进一步降低参数量。

链接: https://arxiv.org/abs/2503.18368
作者: Xu Han,Yuan Tang,Jinfeng Xu,Xianzhi Li
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 6 figures, 6 tables. Code and weights are available at this https URL

点击查看摘要

Abstract:We introduce Monarch Sparse Tuning (MoST), the first reparameterization-based parameter-efficient fine-tuning (PEFT) method tailored for 3D representation learning. Unlike existing adapter-based and prompt-tuning 3D PEFT methods, MoST introduces no additional inference overhead and is compatible with many 3D representation learning backbones. At its core, we present a new family of structured matrices for 3D point clouds, Point Monarch, which can capture local geometric features of irregular points while offering high expressiveness. MoST reparameterizes the dense update weight matrices as our sparse Point Monarch matrices, significantly reducing parameters while retaining strong performance. Experiments on various backbones show that MoST is simple, effective, and highly generalizable. It captures local features in point clouds, achieving state-of-the-art results on multiple benchmarks, e.g., 97.5% acc. on ScanObjectNN (PB_50_RS) and 96.2% on ModelNet40 classification, while it can also combine with other matrix decompositions (e.g., Low-rank, Kronecker) to further reduce parameters.
zh

[CV-116] MaSS13K: A Matting-level Semantic Segmentation Benchmark

【速读】：该论文旨在解决高分辨率语义分割领域中现有数据集分辨率有限且缺乏精确掩码细节和边界的问题。论文提出了一种名为MaSSFormer的新方法，其关键在于采用高效的像素解码器，在三个阶段聚合高级语义特征与低级纹理特征，以在最小计算成本下生成高分辨率掩码。此外，论文还引入了一种新的学习范式，通过整合高质量的已标注掩码与新类别的伪标签，使MaSSFormer能够将精确的分割能力迁移到其他物体类别。这一系列工作基于自建的大规模Matting-level语义分割数据集MaSS13K展开，为高分辨率高质量语义分割研究提供了重要支持。

链接: https://arxiv.org/abs/2503.18364
作者: Chenxi Xie,Minghan Li,Hui Zeng,Jun Luo,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学); OPPO Research Institute (OPPO 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-resolution semantic segmentation is essential for applications such as image editing, bokeh imaging, AR/VR, etc. Unfortunately, existing datasets often have limited resolution and lack precise mask details and boundaries. In this work, we build a large-scale, matting-level semantic segmentation dataset, named MaSS13K, which consists of 13,348 real-world images, all at 4K resolution. MaSS13K provides high-quality mask annotations of a number of objects, which are categorized into seven categories: human, vegetation, ground, sky, water, building, and others. MaSS13K features precise masks, with an average mask complexity 20-50 times higher than existing semantic segmentation datasets. We consequently present a method specifically designed for high-resolution semantic segmentation, namely MaSSFormer, which employs an efficient pixel decoder that aggregates high-level semantic features and low-level texture features across three stages, aiming to produce high-resolution masks with minimal computational cost. Finally, we propose a new learning paradigm, which integrates the high-quality masks of the seven given categories with pseudo labels from new classes, enabling MaSSFormer to transfer its accurate segmentation capability to other classes of objects. Our proposed MaSSFormer is comprehensively evaluated on the MaSS13K benchmark together with 14 representative segmentation models. We expect that our meticulously annotated MaSS13K dataset and the MaSSFormer model can facilitate the research of high-resolution and high-quality semantic segmentation. Datasets and codes can be found at this https URL.
zh

[CV-117] MonoInstance: Enhancing Monocular Priors via Multi-view Instance Alignment for Neural Rendering and Reconstruction CVPR2025

【速读】：该论文旨在解决单目深度先验在多视图任务（如3D重建和新视角合成）中因跨视图不一致性导致的利用效率低下问题。当前方法将整个估计的深度图视为精确的真值监督，而忽视了单目先验中存在的固有不准确性与跨视图不一致性。为了解决这些问题，论文提出了一种通用方法MonoInstance，通过探索单目深度的不确定性，提供增强的几何先验以改进神经渲染和重建。其关键在于将来自多个视图的每个分割实例的深度对齐到一个公共的3D空间中，从而将单目深度的不确定性估计转化为噪声点云中的密度度量。对于深度先验不可靠的高不确定性区域，进一步引入约束项以鼓励投影实例与附近视图对应的实例掩模对齐。MonoInstance作为一种灵活策略，可以无缝集成到各种多视图神经渲染框架中。实验结果表明，该方法显著提升了在多种基准下的重建和新视角合成性能。

链接: https://arxiv.org/abs/2503.18363
作者: Wenyuan Zhang,Yixiao Yang,Han Huang,Liang Han,Kanle Shi,Yu-Shen Liu
机构: School of Software, Tsinghua University (清华大学); Kuaishou Technology (快手科技); Department of Computer Science, Wayne State University (韦恩州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025. Project page: this https URL

点击查看摘要

Abstract:Monocular depth priors have been widely adopted by neural rendering in multi-view based tasks such as 3D reconstruction and novel view synthesis. However, due to the inconsistent prediction on each view, how to more effectively leverage monocular cues in a multi-view context remains a challenge. Current methods treat the entire estimated depth map indiscriminately, and use it as ground truth supervision, while ignoring the inherent inaccuracy and cross-view inconsistency in monocular priors. To resolve these issues, we propose MonoInstance, a general approach that explores the uncertainty of monocular depths to provide enhanced geometric priors for neural rendering and reconstruction. Our key insight lies in aligning each segmented instance depths from multiple views within a common 3D space, thereby casting the uncertainty estimation of monocular depths into a density measure within noisy point clouds. For high-uncertainty areas where depth priors are unreliable, we further introduce a constraint term that encourages the projected instances to align with corresponding instance masks on nearby views. MonoInstance is a versatile strategy which can be seamlessly integrated into various multi-view neural rendering frameworks. Our experimental results demonstrate that MonoInstance significantly improves the performance in both reconstruction and novel view synthesis under various benchmarks.
zh

[CV-118] NeRFPrior: Learning Neural Radiance Field as a Prior for Indoor Scene Reconstruction CVPR2025

【速读】：该论文旨在解决神经隐式函数在从多视角RGB图像重建高质量表面时对先验依赖的问题。当前的先验方法需要大规模预训练，并且仅提供几何线索而未充分考虑颜色的重要性。为了解决这一问题，论文提出了NeRFPrior，这是一种采用神经辐射场作为先验的方法，通过体渲染学习符号距离场以实现表面重建。NeRF先验不仅能提供几何和颜色线索，还能在相同场景下快速训练而无需额外数据。解决方案的关键在于基于NeRF先验，通过显式施加多视角一致性约束于每条射线交点来学习符号距离函数（SDF），从而进行表面推断。此外，针对纹理较少区域，引入带有置信权重的深度一致性损失以进一步推断SDF。实验结果表明，该方法在广泛使用的基准测试中优于现有最先进方法。

链接: https://arxiv.org/abs/2503.18361
作者: Wenyuan Zhang,Emily Yue-ting Jia,Junsheng Zhou,Baorui Ma,Kanle Shi,Yu-Shen Liu
机构: School of Software, Tsinghua University (清华大学); Kuaishou Technology (快手科技); Department of Computer Science, Wayne State University (韦恩州立大学); Deep Earth Probe and Mineral Resources Exploration—National Science and Technology Major Project; National Natural Science Foundation of China (国家自然科学基金)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025. Project page: this https URL

点击查看摘要

Abstract:Recently, it has shown that priors are vital for neural implicit functions to reconstruct high-quality surfaces from multi-view RGB images. However, current priors require large-scale pre-training, and merely provide geometric clues without considering the importance of color. In this paper, we present NeRFPrior, which adopts a neural radiance field as a prior to learn signed distance fields using volume rendering for surface reconstruction. Our NeRF prior can provide both geometric and color clues, and also get trained fast under the same scene without additional data. Based on the NeRF prior, we are enabled to learn a signed distance function (SDF) by explicitly imposing a multi-view consistency constraint on each ray intersection for surface inference. Specifically, at each ray intersection, we use the density in the prior as a coarse geometry estimation, while using the color near the surface as a clue to check its visibility from another view angle. For the textureless areas where the multi-view consistency constraint does not work well, we further introduce a depth consistency loss with confidence weights to infer the SDF. Our experimental results outperform the state-of-the-art methods under the widely used benchmarks.
zh

[CV-119] Context-Enhanced Memory-Refined Transformer for Online Action Detection CVPR2025

【速读】：该论文旨在解决在线动作检测（Online Action Detection, OAD）中训练与推理阶段存在的不一致问题，即现有方法在训练时使用变化长度的短期记忆，而在推理时依赖完整长度的短期记忆，这种不一致性限制了学习的有效性。为了解决这一问题，论文提出了一种上下文增强的记忆精炼Transformer（Context-enhanced Memory-Refined Transformer, CMeRT）。CMeRT的关键创新在于引入了一个上下文增强的编码器，通过利用额外的近过去上下文来改进帧表示，并设计了一个记忆精炼的解码器，通过利用近未来的生成结果来提升性能。这些改进使CMeRT在THUMOS’14、CrossTask以及EPIC-Kitchens-100数据集上的在线检测和预测任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2503.18359
作者: Zhanzhong Pang,Fadime Sener,Angela Yao
机构: National University of Singapore; Meta Reality Labs (Meta 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Online Action Detection (OAD) detects actions in streaming videos using past observations. State-of-the-art OAD approaches model past observations and their interactions with an anticipated future. The past is encoded using short- and long-term memories to capture immediate and long-range dependencies, while anticipation compensates for missing future context. We identify a training-inference discrepancy in existing OAD methods that hinders learning effectiveness. The training uses varying lengths of short-term memory, while inference relies on a full-length short-term memory. As a remedy, we propose a Context-enhanced Memory-Refined Transformer (CMeRT). CMeRT introduces a context-enhanced encoder to improve frame representations using additional near-past context. It also features a memory-refined decoder to leverage near-future generation to enhance performance. CMeRT achieves state-of-the-art in online detection and anticipation on THUMOS’14, CrossTask, and EPIC-Kitchens-100.
zh

[CV-120] Cost-Sensitive Learning for Long-Tailed Temporal Action Segmentation

【速读】：该论文旨在解决无剪辑操作视频中的时序动作分割问题，特别关注长尾分布下动作类别频率和持续时间的广泛差异。论文发现当前方法存在两级学习偏差：(1) 类别级偏差，源于头部类别的频率优势；(2) 转换级偏差，优先考虑常见转换。为缓解这些偏差，论文提出了一种约束优化问题，并定义了动作类别及其相关转换的学习状态，将其整合到优化过程中。关键在于引入一种新颖的成本敏感损失函数，该函数以自适应加权交叉熵形式构建，权重基于动作及其转换的学习状态动态调整。实验结果表明，该方法在多个挑战性数据集和不同框架下显著提升了类别级帧级和片段级性能。

链接: https://arxiv.org/abs/2503.18358
作者: Zhanzhong Pang,Fadime Sener,Shrinivas Ramasubramanian,Angela Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: BMCV 2024

点击查看摘要

Abstract:Temporal action segmentation in untrimmed procedural videos aims to densely label frames into action classes. These videos inherently exhibit long-tailed distributions, where actions vary widely in frequency and duration. In temporal action segmentation approaches, we identified a bi-level learning bias. This bias encompasses (1) a class-level bias, stemming from class imbalance favoring head classes, and (2) a transition-level bias arising from variations in transitions, prioritizing commonly observed transitions. As a remedy, we introduce a constrained optimization problem to alleviate both biases. We define learning states for action classes and their associated transitions and integrate them into the optimization process. We propose a novel cost-sensitive loss function formulated as a weighted cross-entropy loss, with weights adaptively adjusted based on the learning state of actions and their transitions. Experiments on three challenging temporal segmentation benchmarks and various frameworks demonstrate the effectiveness of our approach, resulting in significant improvements in both per-class frame-wise and segment-wise performance.
zh

[CV-121] Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models CVPR2025

【速读】：该论文旨在解决文本到超高分辨率图像生成领域中缺乏高质量4K数据集及高效训练方法的问题。论文的关键解决方案包括：(1) 构建Aesthetic-4K基准数据集，包含精心挑选的高质量4K图像及其由GPT-4o生成的描述，并引入GLCM分数与压缩比等新指标结合传统度量（如FID、美学评分及CLIPScore）进行综合评估；(2) 提出基于小波变换的微调策略，允许直接使用真实感4K图像对多种潜在扩散模型进行高效训练，显著提升了生成图像细节丰富度和文本提示一致性，特别是在大尺度扩散模型支持下。这些创新使得Diffusion-4K在超高分辨率图像合成任务中表现出色。

链接: https://arxiv.org/abs/2503.18352
作者: Jinjin Zhang,Qiuyu Huang,Junjie Liu,Xiefan Guo,Di Huang
机构: State Key Laboratory of Complex and Critical Software Environment (复杂与关键软件环境国家重点实验室), Beihang University (北京航空航天大学); School of Computer Science and Engineering (计算机科学与工程学院), Beihang University (北京航空航天大学); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:In this paper, we present Diffusion-4K, a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models. The core advancements include: (1) Aesthetic-4K Benchmark: addressing the absence of a publicly available 4K image synthesis dataset, we construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation. We curated a high-quality 4K dataset with carefully selected images and captions generated by GPT-4o. Additionally, we introduce GLCM Score and Compression Ratio metrics to evaluate fine details, combined with holistic measures such as FID, Aesthetics and CLIPScore for a comprehensive assessment of ultra-high-resolution images. (2) Wavelet-based Fine-tuning: we propose a wavelet-based fine-tuning approach for direct training with photorealistic 4K images, applicable to various latent diffusion models, demonstrating its effectiveness in synthesizing highly detailed 4K images. Consequently, Diffusion-4K achieves impressive performance in high-quality image synthesis and text prompt adherence, especially when powered by modern large-scale diffusion models (e.g., SD3-2B and Flux-12B). Extensive experimental results from our benchmark demonstrate the superiority of Diffusion-4K in ultra-high-resolution image synthesis.
zh

[CV-122] Human-Object Interaction with Vision-Language Model Guided Relative Movement Dynamics

【速读】：该论文旨在解决现有方法在模拟人与物体交互（Human-Object Interaction, HOI）时难以实现物理真实性和支持多样化交互的问题。为应对这些挑战，论文提出了一种统一的人体-物体交互框架，通过语言指令实现对静态场景和动态物体交互的统一控制。关键在于将语言命令转化为人体与物体部件之间的连续稳定相对运动动力学（Relative Movement Dynamics, RMD）图，并利用视觉-语言模型（Vision-Language Models, VLMs）的世界知识和场景感知能力来指导基于目标的强化学习，从而实现物体的顺序交互。此外，为了支持训练和评估，论文还构建了一个名为Interplay的新数据集，涵盖静态和动态HOI任务的多轮任务计划。

链接: https://arxiv.org/abs/2503.18349
作者: Zekai Deng,Ye Shi,Kaiyang Ji,Lan Xu,Shaoli Huang,Jingya Wang
机构: ShanghaiTech University (上海科技大学); Astribot
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human-Object Interaction (HOI) is vital for advancing simulation, animation, and robotics, enabling the generation of long-term, physically plausible motions in 3D environments. However, existing methods often fall short of achieving physics realism and supporting diverse types of interactions. To address these challenges, this paper introduces a unified Human-Object Interaction framework that provides unified control over interactions with static scenes and dynamic objects using language commands. The interactions between human and object parts can always be described as the continuous stable Relative Movement Dynamics (RMD) between human and object parts. By leveraging the world knowledge and scene perception capabilities of Vision-Language Models (VLMs), we translate language commands into RMD diagrams, which are used to guide goal-conditioned reinforcement learning for sequential interaction with objects. Our framework supports long-horizon interactions among dynamic, articulated, and static objects. To support the training and evaluation of our framework, we present a new dataset named Interplay, which includes multi-round task plans generated by VLMs, covering both static and dynamic HOI tasks. Extensive experiments demonstrate that our proposed framework can effectively handle a wide range of HOI tasks, showcasing its ability to maintain long-term, multi-round transitions. For more details, please refer to our project webpage: this https URL.
zh

[CV-123] PS-EIP: Robust Photometric Stereo Based on Event Interval Profile CVPR2025

【速读】：该论文旨在解决现有事件相机光度立体法（EventPS）在处理噪声、阴影及非朗伯反射时鲁棒性不足的问题。其关键在于提出了一种基于事件区间轮廓（Photometric Stereo based on Event Interval Profile, PS-EIP）的新方法，通过利用事件区间时间序列轮廓的连续性，并结合一种基于轮廓形状的异常值检测技术，显著增强了对阴影和镜面反射等异常值的抵抗能力。与依赖深度学习的EventPS变体（EventPS-FCN）相比，PS-EIP在无需深度学习的情况下实现了更高的鲁棒性，实验结果验证了其有效性。

链接: https://arxiv.org/abs/2503.18341
作者: Kazuma Kitazawa,Takahito Aoto,Satoshi Ikehata,Tsuyoshi Takatani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025

点击查看摘要

Abstract:Recently, the energy-efficient photometric stereo method using an event camera has been proposed to recover surface normals from events triggered by changes in logarithmic Lambertian reflections under a moving directional light source. However, EventPS treats each event interval independently, making it sensitive to noise, shadows, and non-Lambertian reflections. This paper proposes Photometric Stereo based on Event Interval Profile (PS-EIP), a robust method that recovers pixelwise surface normals from a time-series profile of event intervals. By exploiting the continuity of the profile and introducing an outlier detection method based on profile shape, our approach enhances robustness against outliers from shadows and specular reflections. Experiments using real event data from 3D-printed objects demonstrate that PS-EIP significantly improves robustness to outliers compared to EventPS’s deep-learning variant, EventPS-FCN, without relying on deep learning.
zh

[CV-124] GranQ: Granular Zero-Shot Quantization with Unified Layer-Channel Awareness

【速读】：该论文致力于解决零样本量化（Zero-shot Quantization, ZSQ）在低比特环境下的显著激活损失问题，这是由于现有方法采用粗粒度缩放策略所致。论文的关键创新在于提出了一种名为GranQ的新方法，通过引入层-通道感知机制动态调整量化粒度，同时考虑层和通道级别的激活分布，从而实现细粒度量化并最小化激活失真。此外，GranQ还引入向量化激活量化技术，在保持精度的同时提高计算效率并降低开销。这些方案共同克服了传统ZSQ方法的局限性，并在性能上超越了使用量化感知训练的最先进方法。

链接: https://arxiv.org/abs/2503.18339
作者: Inpyo Hong,Youngwan Jo,Hyojeong Lee,Sunghyun Ahn,Sanghyun Park
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot quantization (ZSQ) enables neural network compression without training data, which is crucial in restricted data access environments. However, existing ZSQ methods suffer from significant activation loss in low-bit environments owing to their coarse-grained scaling strategy. To address this issue, we propose GranQ, a novel ZSQ approach that leverages layer-channel awareness to minimize the quantization error. Unlike conventional layer- or channel-wise quantization, GranQ dynamically adjusts quantization granularity by considering both layer- and channel-level activation distributions. This enables fine-grained quantization while minimizing activation distortion. Additionally, we introduce vectorized activation quantization, which enables efficient parallel computation and reduces computational overhead while preserving accuracy. GranQ achieves superior performance compared with those of state-of-the-art ZSQ methods that employ quantization-aware training. With these findings, we anticipate that GranQ will inspire novel research directions beyond conventional ZSQ approaches focused on data generation and model training.
zh

[CV-125] SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking CVPR2025

【速读】：该论文试图解决单流跟踪器在处理模板图像与搜索区域图像的关系建模时存在的局限性问题，即单一视觉Transformer难以同时有效应对不同图像块间关系建模的显著差异。例如，背景区域需要较少关注分配，而前景特别是边界区域则需强调。为此，论文提出了一种基于混合专家模型（Tailored Mixture-of-Experts, TMoE）的新跟踪器SPMTrack，通过结合多个专家的能力更灵活地处理多样化的关联建模任务。关键在于利用TMoE扩展了关系建模至时空上下文，并实现了以极小参数增量提升跟踪精度，同时将TMoE作为高效微调方法减少可训练参数量，从而支持大规模模型训练并保持预训练模型的泛化能力。实验结果表明，该方法显著超越现有最先进的跟踪器。

链接: https://arxiv.org/abs/2503.18338
作者: Wenrui Cai,Qingjie Liu,Yunhong Wang
机构: State Key Laboratory of Virtual Reality Technology and Systems, Beihang University (北航), Beijing, China; Zhongguancun Laboratory (中关村实验室), Beijing, China; Hangzhou Innovation Institute, Beihang University (北航杭州创新研究院), Hangzhou, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Most state-of-the-art trackers adopt one-stream paradigm, using a single Vision Transformer for joint feature extraction and relation modeling of template and search region images. However, relation modeling between different image patches exhibits significant variations. For instance, background regions dominated by target-irrelevant information require reduced attention allocation, while foreground, particularly boundary areas, need to be be emphasized. A single model may not effectively handle all kinds of relation modeling simultaneously. In this paper, we propose a novel tracker called SPMTrack based on mixture-of-experts tailored for visual tracking task (TMoE), combining the capability of multiple experts to handle diverse relation modeling more flexibly. Benefiting from TMoE, we extend relation modeling from image pairs to spatio-temporal context, further improving tracking accuracy with minimal increase in model parameters. Moreover, we employ TMoE as a parameter-efficient fine-tuning method, substantially reducing trainable parameters, which enables us to train SPMTrack of varying scales efficiently and preserve the generalization ability of pretrained models to achieve superior performance. We conduct experiments on seven datasets, and experimental results demonstrate that our method significantly outperforms current state-of-the-art trackers. The source code is available at this https URL.
zh

[CV-126] Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large Models

【速读】：本文旨在解决如何以极小的计算和存储开销，通过参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法定制大型预训练 Transformer 模型以适应下游任务的问题。传统 PEFT 方法主要从张量分解的角度出发，通过调整少量参数来优化线性变换。然而，本文采用了一种全新的视角，将注意力操作表示为图卷积，并将多头注意力映射建模为卷积滤波器子空间，其中每个注意力映射被视为子空间的一个元素。关键在于，作者提出通过学习一组小型组合系数，构造一个更具表达能力的滤波器子空间，从而增强原始多头注意力映射的能力。实验和理论分析表明，这种调整后的滤波器子空间能够有效扩展多头注意力的特征空间，进一步提升 Transformer 的容量。此外，通过残差参数化对可调子空间系数进行稳定化，并在训练过程中通过直接对可调系数应用 Dropout 来设计正则化，增强了模型的泛化性能。这种方法仅需极少量的可调参数，且可以与现有 PEFT 方法无缝结合，最终实现在几乎不增加额外参数的情况下超越现有 PEFT 基线方法的表现。

链接: https://arxiv.org/abs/2503.18337
作者: Zichen Miao,Wei Chen,Qiang Qiu
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformer-based large pre-trained models have shown remarkable generalization ability, and various parameter-efficient fine-tuning (PEFT) methods have been proposed to customize these models on downstream tasks with minimal computational and memory budgets. Previous PEFT methods are primarily designed from a tensor-decomposition perspective that tries to effectively tune the linear transformation by finding the smallest subset of parameters to train. Our study adopts an orthogonal view by representing the attention operation as a graph convolution and formulating the multi-head attention maps as a convolutional filter subspace, with each attention map as a subspace element. In this paper, we propose to tune the large pre-trained transformers by learning a small set of combination coefficients that construct a more expressive filter subspace from the original multi-head attention maps. We show analytically and experimentally that the tuned filter subspace can effectively expand the feature space of the multi-head attention and further enhance the capacity of transformers. We further stabilize the fine-tuning with a residual parameterization of the tunable subspace coefficients, and enhance the generalization with a regularization design by directly applying dropout on the tunable coefficient during training. The tunable coefficients take a tiny number of parameters and can be combined with previous PEFT methods in a plug-and-play manner. Extensive experiments show that our approach achieves superior performances than PEFT baselines with neglectable additional parameters.
zh

[CV-127] Mitigating Cache Noise in Test-Time Adaptation for Large Vision-Language Models ICME2025 ICLR2025

【速读】：该论文旨在解决视觉语言模型在测试时适应（Test-time Adaptation, TTA）中因下游任务分布偏移导致性能下降的问题。现有基于缓存的TTA方法主要依赖于缓存特征标签的准确性，但伪标签噪声可能导致特征偏离真实分布，使得基于相似性匹配的缓存检索对异常值或极端样本高度敏感。此外，当前方法缺乏有效机制来建模类别分布，限制了其充分利用缓存信息的潜力。为应对这些挑战，论文提出了一种全面可靠的缓存机制，并设计了一种新颖的零样本TTA方法“缓存、残差、高斯”（Cache, Residual, Gaussian, CRG）。关键创新在于引入可学习的残差参数以优化正负视觉原型与文本原型的对齐质量，同时结合高斯判别分析（Gaussian Discriminant Analysis, GDA）动态建模类内特征分布，从而减轻噪声特征的影响。实验结果表明，CRG在13个基准数据集上超越了最先进的TTA方法，展现了卓越的鲁棒性和适应性。

链接: https://arxiv.org/abs/2503.18334
作者: Haotian Zhai,Xinyu Chen,Can Zhang,Tianming Sha,Ruirui Li
机构: University of Minnesota Twin Cities (明尼苏达大学双城校区); Beijing University of Chemical Technology (北京化工大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME 2025 and ICLR 2025 Workshop on Foundation Models in the Wild

点击查看摘要

Abstract:Test-time adaptation (TTA) of visual language models has recently attracted significant attention as a solution to the performance degradation caused by distribution shifts in downstream tasks. However, existing cache-based TTA methods have certain limitations. They mainly rely on the accuracy of cached feature labels, and the presence of noisy pseudo-labels can cause these features to deviate from their true distribution. This makes cache retrieval methods based on similarity matching highly sensitive to outliers or extreme samples. Moreover, current methods lack effective mechanisms to model class distributions, which limits their ability to fully exploit the potential of cached information. To address these challenges, we introduce a comprehensive and reliable caching mechanism and propose a novel zero-shot TTA method called ``Cache, Residual, Gaussian" (CRG). This method not only employs learnable residual parameters to better align positive and negative visual prototypes with text prototypes, thereby optimizing the quality of cached features, but also incorporates Gaussian Discriminant Analysis (GDA) to dynamically model intra-class feature distributions, further mitigating the impact of noisy features. Experimental results on 13 benchmarks demonstrate that CRG outperforms state-of-the-art TTA methods, showcasing exceptional robustness and adaptability.
zh

[CV-128] nsoFlow: Tensorial Flow-based Sampler for Inverse Rendering CVPR2025

【速读】：该论文旨在解决现有逆向渲染方法在评估渲染方程时因采用预定义非学习型重要性采样器而导致的空间和方向变化的积分函数匹配不足、方差较高以及性能次优的问题。论文的关键解决方案是提出了一种空间和方向感知的学习型重要性采样器（TensoFlow），通过参数化归一化流（normalizing flows）实现入射光的方向采样和概率密度函数（PDF）推断，并结合场景空间的张量表示来捕捉采样分布的空间特性。这种方法能够更精确且灵活地适应典型场景的复杂性，从而显著提升逆向渲染的效果。

链接: https://arxiv.org/abs/2503.18328
作者: Chun Gu,Xiaofei Wei,Li Zhang,Xiatian Zhu
机构: School of Data Science, Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Code: this https URL

点击查看摘要

Abstract:Inverse rendering aims to recover scene geometry, material properties, and lighting from multi-view images. Given the complexity of light-surface interactions, importance sampling is essential for the evaluation of the rendering equation, as it reduces variance and enhances the efficiency of Monte Carlo sampling. Existing inverse rendering methods typically use pre-defined non-learnable importance samplers in prior manually, struggling to effectively match the spatially and directionally varied integrand and resulting in high variance and suboptimal performance. To address this limitation, we propose the concept of learning a spatially and directionally aware importance sampler for the rendering equation to accurately and flexibly capture the unconstrained complexity of a typical scene. We further formulate TensoFlow, a generic approach for sampler learning in inverse rendering, enabling to closely match the integrand of the rendering equation spatially and directionally. Concretely, our sampler is parameterized by normalizing flows, allowing both directional sampling of incident light and probability density function (PDF) inference. To capture the characteristics of the sampler spatially, we learn a tensorial representation of the scene space, which imposes spatial conditions, together with reflected direction, leading to spatially and directionally aware sampling distributions. Our model can be optimized by minimizing the difference between the integrand and our normalizing flow. Extensive experiments validate the superiority of TensoFlow over prior alternatives on both synthetic and real-world benchmarks.
zh

[CV-129] owards Training-free Anomaly Detection with Vision and Language Foundation Models CVPR2025

【速读】：本文旨在解决现有异常检测方法主要关注局部结构异常（Local Structural Anomalies），而忽视包含逻辑约束的组合异常（Compositional Anomalies Incorporating Logical Constraints）的问题。为应对这一挑战，论文提出了LogSAD，这是一种无需训练即可实现逻辑异常检测（Logical Anomaly Detection）与结构异常检测（Structural Anomaly Detection）的新型多模态框架。关键创新点包括：首先，设计了一种匹配思维架构（Match-of-Thought Architecture），利用先进的大型多模态模型（如GPT-4V）生成匹配提案，构建用于异常检测的兴趣点及组合规则；其次，通过多粒度异常检测模块，结合视觉与语言基础模型实现补丁令牌（Patch Tokens）、兴趣集合（Sets of Interests）以及跨模态组合匹配；再次，引入校准模块以对齐来自不同检测器的异常评分，并采用集成策略做出最终决策。由此，LogSAD在一个统一框架内同时解决了逻辑与结构异常检测问题，展现出超越传统监督方法的鲁棒性和有效性，且无需依赖标注数据进行训练。

链接: https://arxiv.org/abs/2503.18325
作者: Jinjin Zhang,Guodong Wang,Yizhou Jin,Di Huang
机构: State Key Laboratory of Complex and Critical Software Environment, Beihang University (北航大学), Beijing 100191, China; School of Computer Science and Engineering, Beihang University (北航大学), Beijing 100191, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Anomaly detection is valuable for real-world applications, such as industrial quality inspection. However, most approaches focus on detecting local structural anomalies while neglecting compositional anomalies incorporating logical constraints. In this paper, we introduce LogSAD, a novel multi-modal framework that requires no training for both Logical and Structural Anomaly Detection. First, we propose a match-of-thought architecture that employs advanced large multi-modal models (i.e. GPT-4V) to generate matching proposals, formulating interests and compositional rules of thought for anomaly detection. Second, we elaborate on multi-granularity anomaly detection, consisting of patch tokens, sets of interests, and composition matching with vision and language foundation models. Subsequently, we present a calibration module to align anomaly scores from different detectors, followed by integration strategies for the final decision. Consequently, our approach addresses both logical and structural anomaly detection within a unified framework and achieves state-of-the-art results without the need for training, even when compared to supervised approaches, highlighting its robustness and effectiveness. Code is available at this https URL.
zh

[CV-130] Plug-and-Play Interpretable Responsible Text-to-Image Generation via Dual-Space Multi-facet Concept Control

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）模型在生成内容方面的伦理问题，特别是如何实现公平（fair）且安全（safe，非暴力/非露骨）的生成内容。现有方法虽关注责任概念的单一方面，但缺乏可解释性，并通常需要修改原始模型，从而影响其性能。论文提出了一种独特技术，通过一种外部即插即用机制，在不影响模型性能的前提下，以可扩展的方式同时考虑广泛的概念，构建可解释的综合责任空间。关键创新在于利用知识蒸馏（Knowledge Distillation）和概念白化（Concept Whitening）技术，将目标T2I管道蒸馏为一个有条件于目标模型的复合责任空间。在推理阶段，该空间用于调节生成内容。此外，典型T2I流程提供了两个插件点：文本嵌入空间与扩散模型潜在空间，论文为此开发了相应的模块，并展示了显著的效果。

链接: https://arxiv.org/abs/2503.18324
作者: Basim Azam,Naveed Akhtar
机构: School of Computing and Information Systems, The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ethical issues around text-to-image (T2I) models demand a comprehensive control over the generative content. Existing techniques addressing these issues for responsible T2I models aim for the generated content to be fair and safe (non-violent/explicit). However, these methods remain bounded to handling the facets of responsibility concepts individually, while also lacking in interpretability. Moreover, they often require alteration to the original model, which compromises the model performance. In this work, we propose a unique technique to enable responsible T2I generation by simultaneously accounting for an extensive range of concepts for fair and safe content generation in a scalable manner. The key idea is to distill the target T2I pipeline with an external plug-and-play mechanism that learns an interpretable composite responsible space for the desired concepts, conditioned on the target T2I pipeline. We use knowledge distillation and concept whitening to enable this. At inference, the learned space is utilized to modulate the generative content. A typical T2I pipeline presents two plug-in points for our approach, namely; the text embedding space and the diffusion model latent space. We develop modules for both points and show the effectiveness of our approach with a range of strong results.
zh

[CV-131] Diff-Palm: Realistic Palmprint Generation with Polynomial Creases and Intra-Class Variation Controllable Diffusion Models CVPR2025

【速读】：该论文旨在解决掌纹识别领域因缺乏大规模公开可用数据集而导致性能受限的问题。传统方法通过Bézier曲线模拟掌纹褶皱，并将其作为条件GAN的输入来生成逼真的掌纹图像，但由于未采用真实数据微调，模型性能在合成数据集上训练后显著下降，表明生成数据与真实数据之间存在较大差距。这一问题主要源于不准确的掌纹表示以及在平衡类内变化与身份一致性方面的挑战。

为了解决上述问题，论文提出了基于多项式的掌纹褶皱表示方法，提供了一种更贴近实际分布的新一代掌纹生成机制。同时，引入了带有一种新颖类内变化控制方法的掌纹条件扩散模型。通过应用所提出的K步噪声共享采样技术，能够合成具有大类内变化和高身份一致性的掌纹数据集。实验结果首次证明，在未经任何微调的情况下，仅基于合成数据集训练的识别模型优于基于真实数据集训练的模型。此外，随着生成身份数量的增加，该方法实现了更优的识别性能。

解决方案的关键在于提出了一种新的掌纹褶皱表示方法及相应的扩散模型，并结合创新的类内变化控制策略，以提高生成数据的真实性和多样性，从而有效缩小生成数据与真实数据之间的差距。

链接: https://arxiv.org/abs/2503.18312
作者: Jianlong Jin,Chenglong Zhao,Ruixin Zhang,Sheng Shang,Jianqing Xu,Jingyun Zhang,ShaoMing Wang,Yang Zhao,Shouhong Ding,Wei Jia,Yunsheng Wu
机构: Hefei University of Technology (合肥工业大学); Tencent Youtu Lab (腾讯优图实验室); Tencent WeChat Pay Lab (腾讯微信支付实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Palmprint recognition is significantly limited by the lack of large-scale publicly available datasets. Previous methods have adopted Bézier curves to simulate the palm creases, which then serve as input for conditional GANs to generate realistic palmprints. However, without employing real data fine-tuning, the performance of the recognition model trained on these synthetic datasets would drastically decline, indicating a large gap between generated and real palmprints. This is primarily due to the utilization of an inaccurate palm crease representation and challenges in balancing intra-class variation with identity consistency. To address this, we introduce a polynomial-based palm crease representation that provides a new palm crease generation mechanism more closely aligned with the real distribution. We also propose the palm creases conditioned diffusion model with a novel intra-class variation control method. By applying our proposed K -step noise-sharing sampling, we are able to synthesize palmprint datasets with large intra-class variation and high identity consistency. Experimental results show that, for the first time, recognition models trained solely on our synthetic datasets, without any fine-tuning, outperform those trained on real datasets. Furthermore, our approach achieves superior recognition performance as the number of generated identities increases.
zh

[CV-132] Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module

【速读】：该论文旨在解决医疗报告生成任务中，通用大模型难以准确捕捉专业领域知识的问题，同时应对医学数据固有的重复性和相似性导致的特征提取困难及过拟合倾向。为了解决这些问题，论文提出了一种名为Co-Attention Triple-LSTM Network (CA-TriNet) 的多模态深度学习模型。其关键在于引入了协同注意力（Co-Attention）模块，通过将视觉Transformer与文本Transformer相结合，并辅以自适应权重算子，增强对具有细微差异的医学图像及其标签的区分能力；此外，Triple-LSTM模块利用目标图像对象进一步优化生成的句子，从而提升整体性能。实验结果表明，CA-TriNet在三个公开数据集上的综合能力优于现有最先进的模型，甚至在某些指标上超过了预训练的大语言模型。

链接: https://arxiv.org/abs/2503.18297
作者: Yishen Liu,Shengda Liu,Hudan Pan
机构: Chinese Medicine Guangdong Laboratory
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical report generation requires specialized expertise that general large models often fail to accurately capture. Moreover, the inherent repetition and similarity in medical data make it difficult for models to extract meaningful features, resulting in a tendency to overfit. So in this paper, we propose a multimodal model, Co-Attention Triple-LSTM Network (CA-TriNet), a deep learning model that combines transformer architectures with a Multi-LSTM network. Its Co-Attention module synergistically links a vision transformer with a text transformer to better differentiate medical images with similarities, augmented by an adaptive weight operator to catch and amplify image labels with minor similarities. Furthermore, its Triple-LSTM module refines generated sentences using targeted image objects. Extensive evaluations over three public datasets have demonstrated that CA-TriNet outperforms state-of-the-art models in terms of comprehensive ability, even pre-trained large language models on some metrics.
zh

[CV-133] LGPS: A Lightweight GAN-Based Approach for Polyp Segmentation in Colonoscopy Images

【速读】：该论文致力于解决结直肠癌（CRC）早期检测中结肠镜下息肉分割面临的挑战，包括高计算成本、小息肉或低对比度息肉分割困难以及模型在不同数据集上的泛化能力有限等问题。为应对这些挑战，论文提出了一种名为LGPS的轻量级生成对抗网络（GAN）框架用于息肉分割。其关键创新点包括：(1) 增强型MobileNetV2主干网络，结合改进的残差块与挤压激励模块（ResE），以实现高效特征提取；(2) 卷积条件随机场（ConvCRF）用于精确边界细化；(3) 混合损失函数，结合二元交叉熵、加权交并比损失（Weighted IoU Loss）和Dice损失，以缓解类别不平衡并提升分割精度。这些创新使LGPS在保持极低参数量（仅107万参数）的同时，在多个基准数据集上实现了优于现有最先进方法（SOTA）的表现，尤其在PolypGen测试集上达到了Dice系数0.7299和IoU系数0.7867，充分证明了其鲁棒性和广泛适用性。

链接: https://arxiv.org/abs/2503.18294
作者: Fiseha B. Tesema,Alejandro Guerra Manzanares,Tianxiang Cui,Qian Zhang,Moses Solomon,Sean He
机构: University of Nottingham Ningbo China(UNNC)(诺丁汉大学宁波校区); School of Computer Science, UNNC, Ningbo, China(诺丁汉大学宁波校区计算机科学学院); Department of Chemical and Environmental Engineering, UNNC, Ningbo, China(诺丁汉大学宁波校区化学与环境工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 56 Figures

点击查看摘要

Abstract:Colorectal cancer (CRC) is a major global cause of cancer-related deaths, with early polyp detection and removal during colonoscopy being crucial for prevention. While deep learning methods have shown promise in polyp segmentation, challenges such as high computational costs, difficulty in segmenting small or low-contrast polyps, and limited generalizability across datasets persist. To address these issues, we propose LGPS, a lightweight GAN-based framework for polyp segmentation. LGPS incorporates three key innovations: (1) a MobileNetV2 backbone enhanced with modified residual blocks and Squeeze-and-Excitation (ResE) modules for efficient feature extraction; (2) Convolutional Conditional Random Fields (ConvCRF) for precise boundary refinement; and (3) a hybrid loss function combining Binary Cross-Entropy, Weighted IoU Loss, and Dice Loss to address class imbalance and enhance segmentation accuracy. LGPS is validated on five benchmark datasets and compared with state-of-the-art(SOTA) methods. On the largest and challenging PolypGen test dataset, LGPS achieves a Dice of 0.7299 and an IoU of 0.7867, outperformed all SOTA works and demonstrating robust generalization. With only 1.07 million parameters, LGPS is 17 times smaller than the smallest existing model, making it highly suitable for real-time clinical applications. Its lightweight design and strong performance underscore its potential for improving early CRC diagnosis. Code is available at this https URL.
zh

[CV-134] CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI

【速读】：该论文试图解决生成式 AI (Generative AI) 技术滥用带来的挑战，特别是如何有效区分真实图像与 AI 合成图像的问题。现有方法在检测合成图像时通常缺乏泛化能力，仅对特定类型的生成模型有效，并容易受到后处理技术（如 JPEG 压缩）的影响。论文的关键解决方案是提出了一种名为 Co-Spy 的新框架，该框架通过增强现有的语义特征（如手指数目）和人工痕迹特征（如像素值差异），并自适应地整合这些特征，实现了更通用且鲁棒的合成图像检测能力。此外，为了支持这一研究，作者构建了一个包含 5 个真实图像数据集和 22 种先进生成模型的综合数据集 Co-Spy-Bench，并从互联网中收集了 5 万张野生环境下的合成图像以进行更实用的评估。实验结果表明，Co-Spy 检测器在相同训练条件下显著优于现有方法，平均准确率提升约 11% 至 34%。

链接: https://arxiv.org/abs/2503.18286
作者: Siyuan Cheng,Lingjuan Lyu,Zhenting Wang,Xiangyu Zhang,Vikash Sehwag
机构: Purdue University (普渡大学); Sony AI (索尼 AI); Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of generative AI, it is now possible to synthesize high-quality images in a few seconds. Despite the power of these technologies, they raise significant concerns regarding misuse. Current efforts to distinguish between real and AI-generated images may lack generalization, being effective for only certain types of generative models and susceptible to post-processing techniques like JPEG compression. To overcome these limitations, we propose a novel framework, Co-Spy, that first enhances existing semantic features (e.g., the number of fingers in a hand) and artifact features (e.g., pixel value differences), and then adaptively integrates them to achieve more general and robust synthetic image detection. Additionally, we create Co-Spy-Bench, a comprehensive dataset comprising 5 real image datasets and 22 state-of-the-art generative models, including the latest models like FLUX. We also collect 50k synthetic images in the wild from the Internet to enable evaluation in a more practical setting. Our extensive evaluations demonstrate that our detector outperforms existing methods under identical training conditions, achieving an average accuracy improvement of approximately 11% to 34%. The code is available at this https URL.
zh

[CV-135] Voxel-based Point Cloud Geometry Compression with Space-to-Channel Context

【速读】：该论文旨在解决基于体素的方法在处理高比特深度点云时受限于感受野较小的问题，尤其针对密集点云及不同稀疏度点云的几何压缩挑战。论文的关键解决方案在于引入了一种分阶段的空间到通道（Space-to-Channel, S2C）上下文模型：对于密集点云和低层级稀疏点云，利用通道级自回归策略有效整合粗分辨率下的邻域信息；而对于高层级稀疏点云，则进一步提出层级级S2C上下文模型，并结合几何残差编码（Geometry Residual Coding, GRC）实现跨层级的一致分辨率预测。此外，通过采用球坐标系的紧凑表示，并结合大核尺寸的残差概率近似（Residual Probability Approximation, RPA）模块增强GRC方法，进一步优化性能。实验结果表明，该S2C上下文模型不仅实现了比特率节省，同时保持甚至提升了重建质量，且相比现有最先进的体素基压缩方法降低了计算复杂度。

链接: https://arxiv.org/abs/2503.18283
作者: Bojun Liu,Yangzhi Ma,Ao Luo,Li Li,Dong Liu
机构: University of Science and Technology of China (中国科学技术大学); Waseda University (早稻田大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Voxel-based methods are among the most efficient for point cloud geometry compression, particularly with dense point clouds. However, they face limitations due to a restricted receptive field, especially when handling high-bit depth point clouds. To overcome this issue, we introduce a stage-wise Space-to-Channel (S2C) context model for both dense point clouds and low-level sparse point clouds. This model utilizes a channel-wise autoregressive strategy to effectively integrate neighborhood information at a coarse resolution. For high-level sparse point clouds, we further propose a level-wise S2C context model that addresses resolution limitations by incorporating Geometry Residual Coding (GRC) for consistent-resolution cross-level prediction. Additionally, we use the spherical coordinate system for its compact representation and enhance our GRC approach with a Residual Probability Approximation (RPA) module, which features a large kernel size. Experimental results show that our S2C context model not only achieves bit savings while maintaining or improving reconstruction quality but also reduces computational complexity compared to state-of-the-art voxel-based compression methods.
zh

[CV-136] rackID3x3: A Dataset and Algorithm for Multi-Player Tracking with Identification and Pose Estimation in 3x3 Basketball Full-court Videos

【速读】：该论文旨在解决多目标跟踪、球员识别及姿态估计在非主流体育项目（尤其是3x3篮球）中的数据稀缺和技术挑战问题，特别是针对固定摄像头场景下的数据分析需求。现有研究主要集中于主流团队运动（如足球和传统5v5篮球），而忽略了业余水平比赛中常用的固定摄像头设置或包含姿态标注的数据集。为了解决这些问题，论文提出了TrackID3x3数据集，这是首个专门用于3x3篮球场景下多玩家跟踪、球员识别和姿态估计的公开综合数据集。此外，还引入了Track-ID任务来简化游戏状态重建任务，并专注于固定摄像头环境。关键在于构建了一个包含三种不同子集的数据集以及设计了一种专门用于评估跟踪和身份确认质量的基线算法Track-ID算法，同时通过使用最新的多目标跟踪算法和顶部姿态估计方法进行基准测试实验，以展示其性能并揭示尚存的挑战。这为推动3x3篮球自动化分析奠定了坚实的基础。

链接: https://arxiv.org/abs/2503.18282
作者: Kazuhiro Yamada,Li Yin,Qingrui Hu,Ning Ding,Shunsuke Iwashita,Jun Ichikawa,Kiwamu Kotani,Calvin Yeung,Keisuke Fujii
机构: Nagoya University (名古屋大学); Nagoya Institute of Technology (名古屋工业大学); Shizuoka University (静冈大学); Ryutsu Keizai University (流通经济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-object tracking, player identification, and pose estimation are fundamental components of sports analytics, essential for analyzing player movements, performance, and tactical strategies. However, existing datasets and methodologies primarily target mainstream team sports such as soccer and conventional 5-on-5 basketball, often overlooking scenarios involving fixed-camera setups commonly used at amateur levels, less mainstream sports, or datasets that explicitly incorporate pose annotations. In this paper, we propose the TrackID3x3 dataset, the first publicly available comprehensive dataset specifically designed for multi-player tracking, player identification, and pose estimation in 3x3 basketball scenarios. The dataset comprises three distinct subsets (Indoor fixed-camera, Outdoor fixed-camera, and Drone camera footage), capturing diverse full-court camera perspectives and environments. We also introduce the Track-ID task, a simplified variant of the game state reconstruction task that excludes field detection and focuses exclusively on fixed-camera scenarios. To evaluate performance, we propose a baseline algorithm called Track-ID algorithm, tailored to assess tracking and identification quality. Furthermore, our benchmark experiments, utilizing recent multi-object tracking algorithms (e.g., BoT-SORT-ReID) and top-down pose estimation methods (HRNet, RTMPose, and SwinPose), demonstrate robust results and highlight remaining challenges. Our dataset and evaluation benchmarks provide a solid foundation for advancing automated analytics in 3x3 basketball. Dataset and code will be available at this https URL.
zh

[CV-137] opV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model CVPR2025

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在推理阶段因大量视觉输入标记（tokens）导致的高计算资源需求问题。现有方法主要依赖于贪婪的启发式标准来评估标记的重要性，并且与FlashAttention及KV缓存不兼容。为了解决这些问题，论文提出了一种名为\textbfTopV的方法，这是一种与推理时间优化兼容的标记剪枝技术，能够在无需额外训练或微调的情况下实现高效剪枝。关键在于将标记剪枝形式化为一个优化问题，而不是依赖注意力分数来确定标记的重要性，从而能够准确识别重要的视觉标记，同时保持对FlashAttention的支持。此外，由于此剪枝仅在预填充阶段执行一次，还有效减小了KV缓存的大小。优化框架引入了一个考虑特征相似性、相对空间距离和绝对中心距离等因子的视觉感知成本函数，用于衡量每个源视觉标记的重要性，实现了对低重要性标记的有效剪枝。实验结果表明，该方法优于现有的标记剪枝方法，验证了其有效性和效率。

链接: https://arxiv.org/abs/2503.18278
作者: Cheng Yang,Yang Sui,Jinqi Xiao,Lingyi Huang,Yu Gong,Chendi Li,Jinghua Yan,Yu Bai,Ponnuswamy Sadayappan,Xia Hu,Bo Yuan
机构: Rutgers University (罗格斯大学); Rice University (莱斯大学); The University of Utah (犹他大学); California State University, Fullerton (加利福尼亚州立大学富勒顿分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Vision-Language Models (VLMs) demand substantial computational resources during inference, largely due to the extensive visual input tokens for representing visual information. Previous studies have noted that visual tokens tend to receive less attention than text tokens, suggesting their lower importance during inference and potential for pruning. However, their methods encounter several challenges: reliance on greedy heuristic criteria for token importance and incompatibility with FlashAttention and KV cache. To address these issues, we introduce \textbfTopV, a compatible \textbfTOken \textbfPruning with inference Time Optimization for fast and low-memory \textbfVLM, achieving efficient pruning without additional training or fine-tuning. Instead of relying on attention scores, we formulate token pruning as an optimization problem, accurately identifying important visual tokens while remaining compatible with FlashAttention. Additionally, since we only perform this pruning once during the prefilling stage, it effectively reduces KV cache size. Our optimization framework incorporates a visual-aware cost function considering factors such as Feature Similarity, Relative Spatial Distance, and Absolute Central Distance, to measure the importance of each source visual token, enabling effective pruning of low-importance tokens. Extensive experiments demonstrate that our method outperforms previous token pruning methods, validating the effectiveness and efficiency of our approach.
zh

[CV-138] GI-SLAM: Gaussian-Inertial SLAM

【速读】：该论文旨在解决现有基于3D高斯点云（3D Gaussian Splatting, 3DGS）的SLAM方法忽视惯性测量单元（IMU）数据的问题。论文提出了一种名为GI-SLAM的新方法，其关键在于通过引入IMU增强的相机跟踪模块以及基于真实3D高斯分布的场景表示，将IMU信息无缝整合到3DGS的深度学习框架中。这种方法不仅提升了相机跟踪的准确性、鲁棒性和效率，还支持多种传感器配置，包括带有或不带IMU的单目、双目和RGBD相机。实验结果显示，GI-SLAM在EuRoC和TUM-RGBD数据集上的性能与现有最先进的实时SLAM方法相当。

链接: https://arxiv.org/abs/2503.18275
作者: Xulang Liu,Ning Tan
机构: Sun Yat-sen University (中山大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures, 5 tables

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has recently emerged as a powerful representation of geometry and appearance for dense Simultaneous Localization and Mapping (SLAM). Through rapid, differentiable rasterization of 3D Gaussians, many 3DGS SLAM methods achieve near real-time rendering and accelerated training. However, these methods largely overlook inertial data, witch is a critical piece of information collected from the inertial measurement unit (IMU). In this paper, we present GI-SLAM, a novel gaussian-inertial SLAM system which consists of an IMU-enhanced camera tracking module and a realistic 3D Gaussian-based scene representation for mapping. Our method introduces an IMU loss that seamlessly integrates into the deep learning framework underpinning 3D Gaussian Splatting SLAM, effectively enhancing the accuracy, robustness and efficiency of camera tracking. Moreover, our SLAM system supports a wide range of sensor configurations, including monocular, stereo, and RGBD cameras, both with and without IMU integration. Our method achieves competitive performance compared with existing state-of-the-art real-time methods on the EuRoC and TUM-RGBD datasets.
zh

[CV-139] Enhancing Dataset Distillation via Non-Critical Region Refinement CVPR2025

【速读】：该论文旨在解决数据集蒸馏中实例特定特征（instance-specific features）与类别通用特征（class-general features）难以平衡的问题。传统方法要么仅关注类别级别的通用模式而忽略实例细节，要么过分强调实例特定特征而忽视共享的类别信息。为了解决这一难题，论文提出了一种名为非关键区域优化数据蒸馏（Non-Critical Region Refinement Dataset Distillation, NRR-DD）的方法，通过在合成数据中保留实例特定的细粒度区域同时增强非关键区域的类别通用信息，从而实现对所有像素信息的有效利用，并提升模型的整体性能。此外，论文还引入基于距离的代表性知识迁移（Distance-Based Representative, DBR）技术，通过利用合成数据预测与独热编码标签之间的距离来替代软标签，进一步简化训练过程。关键在于NRR-DD方法能够同时兼顾两种特征类型，同时DBR方法显著降低了对软标签的需求。实验表明，该方法在小规模和大规模数据集上均达到了最先进的性能，并且在不同设置下只需存储每实例两个距离即可获得可比结果。

链接: https://arxiv.org/abs/2503.18267
作者: Minh-Tuan Tran,Trung Le,Xuan-May Le,Thanh-Toan Do,Dinh Phung
机构: Monash University (蒙纳士大学); The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Dataset distillation has become a popular method for compressing large datasets into smaller, more efficient representations while preserving critical information for model training. Data features are broadly categorized into two types: instance-specific features, which capture unique, fine-grained details of individual examples, and class-general features, which represent shared, broad patterns across a class. However, previous approaches often struggle to balance these features-some focus solely on class-general patterns, neglecting finer instance details, while others prioritize instance-specific features, overlooking the shared characteristics essential for class-level understanding. In this paper, we introduce the Non-Critical Region Refinement Dataset Distillation (NRR-DD) method, which preserves instance-specific details and fine-grained regions in synthetic data while enriching non-critical regions with class-general information. This approach enables models to leverage all pixel information, capturing both feature types and enhancing overall performance. Additionally, we present Distance-Based Representative (DBR) knowledge transfer, which eliminates the need for soft labels in training by relying on the distance between synthetic data predictions and one-hot encoded labels. Experimental results show that NRR-DD achieves state-of-the-art performance on both small- and large-scale datasets. Furthermore, by storing only two distances per instance, our method delivers comparable results across various settings. The code is available at this https URL.
zh

[CV-140] Surface-Aware Distilled 3D Semantic Features

【速读】：该论文旨在解决基于预训练视觉模型语义特征匹配在处理相同语义类别不同实例（如“左手”与“右手”）时难以区分导致显著映射错误的问题。为了解决这一挑战，论文提出了一种表面感知嵌入空间的学习方法，该方法能够有效应对这些歧义。解决方案的关键在于引入了一种对比损失函数，在保留从基础模型蒸馏出的特征语义内容的同时，通过区分形状表面上相距较远的特征来消除歧义，且该方法为自监督学习，仅需少量未配对的训练网格即可在测试时为新3D形状推断特征。

链接: https://arxiv.org/abs/2503.18254
作者: Lukas Uzolas,Elmar Eisemann,Petr Kellnhofer
机构: Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Many 3D tasks such as pose alignment, animation, motion transfer, and 3D reconstruction rely on establishing correspondences between 3D shapes. This challenge has recently been approached by matching of semantic features from pre-trained vision models. However, despite their power, these features struggle to differentiate instances of the same semantic class such as “left hand” versus “right hand” which leads to substantial mapping errors. To solve this, we learn a surface-aware embedding space that is robust to these ambiguities. Importantly, our approach is self-supervised and requires only a small number of unpaired training meshes to infer features for new 3D shapes at test time. We achieve this by introducing a contrastive loss that preserves the semantic content of the features distilled from foundational models while disambiguating features located far apart on the shape’s surface. We observe superior performance in correspondence matching benchmarks and enable downstream applications including in-part segmentation, pose alignment, and motion transfer. The project site is available at this https URL.
zh

[CV-141] CustomKD: Customizing Large Vision Foundation for Edge Model Improvement via Knowledge Distillation CVPR2025

【速读】：该论文旨在解决利用大型视觉基础模型（Large Vision Foundation Models, LVFMs）如DINOv2和CLIP进行知识蒸馏以提升边缘模型（Edge Models）性能的问题。尽管这些LVFMs在各自领域表现出色，但它们在知识蒸馏方面的潜力尚未被充分挖掘。论文指出，当前知识蒸馏方法面临的挑战在于LVFMs与边缘模型之间存在显著的能力差异和异构架构差异。传统方法中，即使通过使用更大容量的教师模型（如从ViT-S扩展到ViT-L）可以提高其下游任务性能，但由于模型间的巨大差距，学生模型无法获得与教师模型相当的性能提升。

为了解决这一问题，论文提出了一种名为CustomKD的新颖知识蒸馏方法。CustomKD的关键在于定制化处理LVFMs中固有的通用特征，使其更适合特定的学生模型，从而缩小模型间的能力差距。具体而言，除了提供来自教师模型的良好泛化原始知识外，CustomKD还通过特征对齐技术将教师模型的特征映射调整至与学生模型一致，使得学生模型能够更轻松地适应这种能力差距。这种方法显著提升了边缘模型在无标注数据场景下的表现，包括无监督领域自适应（如OfficeHome和DomainNet）以及半监督学习（如CIFAR-100中仅有400个标注样本和ImageNet中仅1%标注样本的情况），达到了新的技术水平。

链接: https://arxiv.org/abs/2503.18244
作者: Jungsoo Lee,Debasmit Das,Munawar Hayat,Sungha Choi,Kyuwoong Hwang,Fatih Porikli
机构: Qualcomm AI Research (高通人工智能研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:We propose a novel knowledge distillation approach, CustomKD, that effectively leverages large vision foundation models (LVFMs) to enhance the performance of edge models (e.g., MobileNetV3). Despite recent advancements in LVFMs, such as DINOv2 and CLIP, their potential in knowledge distillation for enhancing edge models remains underexplored. While knowledge distillation is a promising approach for improving the performance of edge models, the discrepancy in model capacities and heterogeneous architectures between LVFMs and edge models poses a significant challenge. Our observation indicates that although utilizing larger backbones (e.g., ViT-S to ViT-L) in teacher models improves their downstream task performances, the knowledge distillation from the large teacher models fails to bring as much performance gain for student models as for teacher models due to the large model discrepancy. Our simple yet effective CustomKD customizes the well-generalized features inherent in LVFMs to a given student model in order to reduce model discrepancies. Specifically, beyond providing well-generalized original knowledge from teachers, CustomKD aligns the features of teachers to those of students, making it easy for students to understand and overcome the large model discrepancy overall. CustomKD significantly improves the performances of edge models in scenarios with unlabeled data such as unsupervised domain adaptation (e.g., OfficeHome and DomainNet) and semi-supervised learning (e.g., CIFAR-100 with 400 labeled samples and ImageNet with 1% labeled samples), achieving the new state-of-the-art performances.
zh

[CV-142] PG-SAM: Prior-Guided SAM with Medical for Multi-organ Segmentation

【速读】：该论文旨在解决现有医学图像分割方法在应用 Segment Anything Model (SAM) 时，其准确性与鲁棒性显著下降的问题。为应对这一挑战，现有方法通常通过模态融合整合文本与图像信息以提供更详细的先验知识，但这些方法受到文本粒度不足及跨领域差距（domain gap）的影响，同时图像中高层抽象语义与像素级边界细节之间的差异可能引入噪声，从而影响融合效果。

论文的关键解决方案是提出 Prior-Guided SAM (PG-SAM)，其核心在于利用来自医学大语言模型 (LLM) 的细粒度模态先验对齐器，结合专门的医学知识来更好地弥合跨领域差距，并通过增强对齐后的先验质量确保更精确的分割结果。此外，PG-SAM 的解码器通过多层级特征融合与迭代掩码优化操作提升模型表达能力，支持无提示学习。同时，论文还设计了一个统一的流水线，有效为 SAM 提供高质量的语义信息。实验结果表明，PG-SAM 在 Synapse 数据集上达到了最先进的性能水平。

链接: https://arxiv.org/abs/2503.18227
作者: Yiheng Zhong,Zihong Luo,Chengzhi Liu,Feilong Tang,Zelin Peng,Ming Hu,Yingzhen Hu,Jionglong Su,Zongyuan Geand,Imran Razzak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Segment Anything Model (SAM) demonstrates powerful zero-shot capabilities; however, its accuracy and robustness significantly decrease when applied to medical image segmentation. Existing methods address this issue through modality fusion, integrating textual and image information to provide more detailed priors. In this study, we argue that the granularity of text and the domain gap affect the accuracy of the priors. Furthermore, the discrepancy between high-level abstract semantics and pixel-level boundary details in images can introduce noise into the fusion process. To address this, we propose Prior-Guided SAM (PG-SAM), which employs a fine-grained modality prior aligner to leverage specialized medical knowledge for better modality alignment. The core of our method lies in efficiently addressing the domain gap with fine-grained text from a medical LLM. Meanwhile, it also enhances the priors’ quality after modality alignment, ensuring more accurate segmentation. In addition, our decoder enhances the model’s expressive capabilities through multi-level feature fusion and iterative mask optimizer operations, supporting unprompted learning. We also propose a unified pipeline that effectively supplies high-quality semantic information to SAM. Extensive experiments on the Synapse dataset demonstrate that the proposed PG-SAM achieves state-of-the-art performance. Our anonymous code is released at this https URL.
zh

[CV-143] MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss Alps CVPR2025

【速读】：该论文旨在解决野生哺乳动物行为监测中因缺乏标注数据而限制强大视频理解模型发展的关键问题。论文提出的关键解决方案是构建了一个名为MammAlps的多模态、多视角数据集，包含来自瑞士国家公园9个相机陷阱的超过14小时带音频的视频、2D分割图以及8.5小时密集标注的个体轨迹，涵盖物种和行为信息。基于此数据集，论文提出了首个基于音频、视频及参考场景分割图的分层多模态动物行为识别基准，并进一步设计了一个生态学导向的基准任务，用于从多视角长期生态事件中识别活动、物种、个体数量及气象条件，包括误报触发情况。论文主张这两个任务相互补充，有助于弥合机器学习与生态学之间的差距。关键在于通过高质量标注的数据集和多模态输入促进视频理解模型的发展。代码和数据可在指定链接获取。

链接: https://arxiv.org/abs/2503.18223
作者: Valentin Gabeff,Haozhe Qi,Brendan Flaherty,Gencer Sumbül,Alexander Mathis,Devis Tuia
机构: EPFL (École Polytechnique Fédérale de Lausanne)(瑞士联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM)
备注: CVPR 2025; Benchmark and code at: this https URL

点击查看摘要

Abstract:Monitoring wildlife is essential for ecology and ethology, especially in light of the increasing human impact on ecosystems. Camera traps have emerged as habitat-centric sensors enabling the study of wildlife populations at scale with minimal disturbance. However, the lack of annotated video datasets limits the development of powerful video understanding models needed to process the vast amount of fieldwork data collected. To advance research in wild animal behavior monitoring we present MammAlps, a multimodal and multi-view dataset of wildlife behavior monitoring from 9 camera-traps in the Swiss National Park. MammAlps contains over 14 hours of video with audio, 2D segmentation maps and 8.5 hours of individual tracks densely labeled for species and behavior. Based on 6135 single animal clips, we propose the first hierarchical and multimodal animal behavior recognition benchmark using audio, video and reference scene segmentation maps as inputs. Furthermore, we also propose a second ecology-oriented benchmark aiming at identifying activities, species, number of individuals and meteorological conditions from 397 multi-view and long-term ecological events, including false positive triggers. We advocate that both tasks are complementary and contribute to bridging the gap between machine learning and ecology. Code and data are available at: this https URL
zh

[CV-144] SimMotionEdit: Text-Based Human Motion Editing with Motion Similarity Prediction

【速读】：该论文旨在解决文本驱动的3D人体运动编辑任务中精确控制的问题，现有方法常导致运动语义与语言指令之间的错位。为解决此问题，论文引入运动相似性预测这一相关任务，并提出一种多任务训练范式，通过联合训练模型进行运动编辑和运动相似性预测，以促进语义上有意义的表征学习。解决方案的关键在于设计了一种基于Diffusion-Transformer的先进架构，该架构能够分别处理运动相似性预测和运动编辑任务。

链接: https://arxiv.org/abs/2503.18211
作者: Zhengyuan Li,Kai Cheng,Anindita Ghosh,Uttaran Bhattacharya,Liangyan Gui,Aniket Bera
机构: Purdue University (普渡大学); DFKI, MPI-INF, Saarland Informatics Campus (DFKI, MPI-INF, 萨尔兰信息技术校园); Adobe Inc. (Adobe 公司); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project URL: this https URL

点击查看摘要

Abstract:Text-based 3D human motion editing is a critical yet challenging task in computer vision and graphics. While training-free approaches have been explored, the recent release of the MotionFix dataset, which includes source-text-motion triplets, has opened new avenues for training, yielding promising results. However, existing methods struggle with precise control, often leading to misalignment between motion semantics and language instructions. In this paper, we introduce a related task, motion similarity prediction, and propose a multi-task training paradigm, where we train the model jointly on motion editing and motion similarity prediction to foster the learning of semantically meaningful representations. To complement this task, we design an advanced Diffusion-Transformer-based architecture that separately handles motion similarity prediction and motion editing. Extensive experiments demonstrate the state-of-the-art performance of our approach in both editing alignment and fidelity.
zh

[CV-145] raining A Neural Network For Partially Occluded Road Sign Identification In The Context Of Autonomous Vehicles

【速读】：该论文旨在解决交通标志识别在部分遮挡场景下的准确性下降问题。解决方案的关键在于构建了一个包含5,746张图像的数据集，其中涵盖了完全可见和部分遮挡的交通标志，并通过实验验证了在训练过程中引入含部分遮挡的真实世界数据的重要性。研究发现，仅基于完全可见标志训练的模型在处理遮挡标志时表现显著下降，而采用迁移学习且对VGG16模型进行全层微调的方法实现了99%的最高识别准确率，从而证明了使用包含遮挡样本的数据集进行训练对于提升模型在复杂实际场景中的鲁棒性及保障自动驾驶安全性至关重要。

链接: https://arxiv.org/abs/2503.18177
作者: Gulnaz Gimaletdinova,Dim Shaiakhmetov,Madina Akpaeva,Mukhammadmuso Abduzhabbarov,Kadyrmamat Momunov
机构: Ala-Too International University (阿拉套国际大学); Westminster International University in Tashkent (塔什干国际韦斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The increasing number of autonomous vehicles and the rapid development of computer vision technologies underscore the particular importance of conducting research on the accuracy of traffic sign recognition. Numerous studies in this field have already achieved significant results, demonstrating high effectiveness in addressing traffic sign recognition tasks. However, the task becomes considerably more complex when a sign is partially obscured by surrounding objects, such as tree branches, billboards, or other elements of the urban environment. In our study, we investigated how partial occlusion of traffic signs affects their recognition. For this purpose, we collected a dataset comprising 5,746 images, including both fully visible and partially occluded signs, and made it publicly available. Using this dataset, we compared the performance of our custom convolutional neural network (CNN), which achieved 96% accuracy, with models trained using transfer learning. The best result was obtained by VGG16 with full layer unfreezing, reaching 99% accuracy. Additional experiments revealed that models trained solely on fully visible signs lose effectiveness when recognizing occluded signs. This highlights the critical importance of incorporating real-world data with partial occlusion into training sets to ensure robust model performance in complex practical scenarios and to enhance the safety of autonomous driving.
zh

[CV-146] Self-Attention Diffusion Models for Zero-Shot Biomedical Image Segmentation: Unlocking New Frontiers in Medical Imaging

【速读】：该论文旨在解决在零样本（zero-shot）条件下，利用无标注数据实现多样化医学图像分割这一具有挑战性的问题。现有方法主要依赖于大规模有监督训练或无监督训练，但这些方法仍存在局限性，尤其是在无需任何标注的情况下跨模态分割医学图像方面。论文提出了一种名为Attention Diffusion Zero-shot Unsupervised System (ADZUS) 的新方法，其关键在于利用预训练扩散模型的自注意力机制与生成及判别能力，在无需标注数据或领域特定先验知识的前提下实现医学图像的高效分割。ADZUS通过融合自注意力机制，显著提升了上下文感知能力和细节敏感度，从而在皮肤病变、胸部X光感染以及白细胞分割等多种医学影像数据集上取得了最先进的性能，Dice分数达到88.7%-92.9%，IoU分数达到66.3%-93.3%。尽管ADZUS表现出色，但也需要较高的计算资源和较长的处理时间。其有效性表明，该方法能够大幅减少对昂贵标注数据的依赖，并快速适应新的医学成像任务，从而扩展AI驱动的医学影像技术的诊断能力。

链接: https://arxiv.org/abs/2503.18170
作者: Abderrachid Hamrani,Anuradha Godavarty
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures

点击查看摘要

Abstract:Producing high-quality segmentation masks for medical images is a fundamental challenge in biomedical image analysis. Recent research has explored large-scale supervised training to enable segmentation across various medical imaging modalities and unsupervised training to facilitate segmentation without dense annotations. However, constructing a model capable of segmenting diverse medical images in a zero-shot manner without any annotations remains a significant hurdle. This paper introduces the Attention Diffusion Zero-shot Unsupervised System (ADZUS), a novel approach that leverages self-attention diffusion models for zero-shot biomedical image segmentation. ADZUS harnesses the intrinsic capabilities of pre-trained diffusion models, utilizing their generative and discriminative potentials to segment medical images without requiring annotated training data or prior domain-specific knowledge. The ADZUS architecture is detailed, with its integration of self-attention mechanisms that facilitate context-aware and detail-sensitive segmentations being highlighted. Experimental results across various medical imaging datasets, including skin lesion segmentation, chest X-ray infection segmentation, and white blood cell segmentation, reveal that ADZUS achieves state-of-the-art performance. Notably, ADZUS reached Dice scores ranging from 88.7% to 92.9% and IoU scores from 66.3% to 93.3% across different segmentation tasks, demonstrating significant improvements in handling novel, unseen medical imagery. It is noteworthy that while ADZUS demonstrates high effectiveness, it demands substantial computational resources and extended processing times. The model’s efficacy in zero-shot settings underscores its potential to reduce reliance on costly annotations and seamlessly adapt to new medical imaging tasks, thereby expanding the diagnostic capabilities of AI-driven medical imaging technologies.
zh

[CV-147] MAO: Efficient Model-Agnostic Optimization of Prompt Tuning for Vision-Language Models ICME2025

【速读】：该论文试图解决在基于CLIP的提示微调（CLIP-based Prompt Tuning）过程中，现有研究通过重构模型架构（如额外的损失计算和元网络）来提升预训练视觉-语言模型性能的问题。然而，这些方法通常会导致复杂度增加和训练成本延长。为保持微调过程的效率，论文提出了一种即插即用的模型不可知优化方法（Model-Agnostic Optimization, MAO）。其关键在于引入了一个数据驱动增强框架（Data-Driven Enhancement Framework）以优化初始数据分布，并结合可调节正则化模块（Alterable Regularization Module）来增强任务特定特征处理管道，从而在维持较低计算成本的同时提升整体性能。

链接: https://arxiv.org/abs/2503.18160
作者: Haoyang Li,Siyu Zhou,Liang Wang,Guodong Long
机构: School of Mechanical Engineering and Automation, Shanghai University, Shanghai, China (上海大学机械工程与自动化学院, 中国); Australian Artificial Intelligence Institute, University of Technology Sydney, Sydney, Australia (澳大利亚悉尼科技大学人工智能研究所, 澳大利亚)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by the IEEE International Conference on Multimedia Expo 2025 (ICME 2025); 12 pages, 6 figures, 8 tables

点击查看摘要

Abstract:Though CLIP-based prompt tuning significantly enhances pre-trained Vision-Language Models, existing research focuses on reconstructing the model architecture, e.g., additional loss calculation and meta-networks. These approaches generally lead to increased complexity and extended training cost. To maintain the efficiency of the tuning process, we propose plug-and-play Model-Agnostic Optimization (MAO) for prompt tuning. Without altering any components of the prompt tuning backbone, we introduce a Data-Driven Enhancement framework to optimize the distribution of the initial data, and incorporate an Alterable Regularization module to boost the task-specific feature processing pipeline, thereby improving overall performance while maintaining low computational cost. Extensive experiments on MAO demonstrate its outstanding performance and efficiency. The code of MAO is available at: this https URL .
zh

[CV-148] DiffusionTalker: Efficient and Compact Speech-Driven 3D Talking Head via Personalizer-Guided Distillation ICME2025

【速读】：该论文旨在解决实时语音驱动的3D人脸动画在个性化表达和效率方面的局限性。具体而言，现有的基于扩散模型的方法虽提升了动画的多样性，但仍然缺乏准确捕捉个性化说话风格的能力，并且在效率和模型紧凑性方面仍有提升空间。为了解决这些问题，论文提出了一种名为DiffusionTalker的方法，其关键在于通过个性化引导蒸馏来优化模型性能。首先，引入对比式个性化器（Contrastive Personalizer）从音频中学习身份和情感嵌入以捕捉说话风格；其次，在蒸馏过程中进一步设计个性化增强器（Personalizer Enhancer），强化嵌入对人脸动画的影响；同时，采用迭代蒸馏技术减少动画生成所需的步骤，实现推理速度8倍以上的提升；并通过将大型教师模型蒸馏为小型学生模型，使模型存储量减少了86.4%，同时保持性能损失最小化。最终，用户可以从音频中提取身份和情感嵌入，快速生成反映特定说话风格的个性化动画。实验结果表明，该方法优于现有最先进的方法。

链接: https://arxiv.org/abs/2503.18159
作者: Peng Chen,Xiaobao Wei,Ming Lu,Hui Chen,Feng Tian
机构: Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所), Beijing, China; University of Chinese Academy of Sciences (中国科学院大学), Beijing, China; Intel Labs China (英特尔中国实验室), Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted by ICME2025

点击查看摘要

Abstract:Real-time speech-driven 3D facial animation has been attractive in academia and industry. Traditional methods mainly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the nondeterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. Existing diffusion-based methods can improve the diversity of facial animation. However, personalized speaking styles conveying accurate lip language is still lacking, besides, efficiency and compactness still need to be improved. In this work, we propose DiffusionTalker to address the above limitations via personalizer-guided distillation. In terms of personalization, we introduce a contrastive personalizer that learns identity and emotion embeddings to capture speaking styles from audio. We further propose a personalizer enhancer during distillation to enhance the influence of embeddings on facial animation. For efficiency, we use iterative distillation to reduce the steps required for animation generation and achieve more than 8x speedup in inference. To achieve compactness, we distill the large teacher model into a smaller student model, reducing our model’s storage by 86.4% while minimizing performance loss. After distillation, users can derive their identity and emotion embeddings from audio to quickly create personalized animations that reflect specific speaking styles. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released at: this https URL.
zh

[CV-149] Decorum: A Language-Based Approach For Style-Conditioned Synthesis of Indoor 3D Scenes

【速读】：该论文致力于解决3D室内场景生成中对场景布局、视觉特征及风格偏好的综合控制问题。现有方法在这些属性上的可控性非常有限，仅能处理简单的对象级描述或成对的空间关系。论文提出的方法Decorum通过在每个阶段采用基于语言的表征，实现了用户使用自然语言控制场景生成过程的能力。其关键是利用大型语言模型（Large Language Models, LLMs）建模语言到语言的映射，并结合文本表示实现了一种基于多模态LLM的新型家具检索方法，从而提升了基于文本条件的场景合成与物体检索性能。

链接: https://arxiv.org/abs/2503.18155
作者: Kelly O. Marshall,Omid Poursaeed,Sergiu Oprea,Amit Kumar,Anushrut Jignasu,Chinmay Hegde,Yilei Li,Rakesh Ranjan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D indoor scene generation is an important problem for the design of digital and real-world environments. To automate this process, a scene generation model should be able to not only generate plausible scene layouts, but also take into consideration visual features and style preferences. Existing methods for this task exhibit very limited control over these attributes, only allowing text inputs in the form of simple object-level descriptions or pairwise spatial relationships. Our proposed method Decorum enables users to control the scene generation process with natural language by adopting language-based representations at each stage. This enables us to harness recent advancements in Large Language Models (LLMs) to model language-to-language mappings. In addition, we show that using a text-based representation allows us to select furniture for our scenes using a novel object retrieval method based on multimodal LLMs. Evaluations on the benchmark 3D-FRONT dataset show that our methods achieve improvements over existing work in text-conditioned scene synthesis and object retrieval.
zh

[CV-150] LongDiff: Training-Free Long Video Generation in One Go

【速读】：该论文旨在解决短视频生成模型在扩展至长视频生成时面临的两个关键挑战：时间位置模糊（temporal position ambiguity）和信息稀释（information dilution），这些问题阻碍了短视频生成模型在长视频生成中的泛化能力。为了解决这些问题，论文提出了一种新颖的无训练方法LongDiff，其核心在于精心设计的两个组件：位置映射（Position Mapping, PM）和信息帧选择（Informative Frame Selection, IFS）。通过这些组件，LongDiff能够充分利用现成的短视频扩散模型（video diffusion models），实现高质量长视频的一次性生成。实验结果验证了该方法的有效性。

链接: https://arxiv.org/abs/2503.18150
作者: Zhuoling Li,Hossein Rahmani,Qiuhong Ke,Jun Liu
机构: Lancaster University (兰开斯特大学); Monash University (蒙纳士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video diffusion models have recently achieved remarkable results in video generation. Despite their encouraging performance, most of these models are mainly designed and trained for short video generation, leading to challenges in maintaining temporal consistency and visual details in long video generation. In this paper, we propose LongDiff, a novel training-free method consisting of carefully designed components \ – Position Mapping (PM) and Informative Frame Selection (IFS) \ – to tackle two key challenges that hinder short-to-long video generation generalization: temporal position ambiguity and information dilution. Our LongDiff unlocks the potential of off-the-shelf video diffusion models to achieve high-quality long video generation in one go. Extensive experiments demonstrate the efficacy of our method.
zh

[CV-151] PHT-CAD: Efficient CAD Parametric Primitive Analysis with Progressive Hierarchical Tuning

【速读】：该论文致力于解决2D参数化原语分析（2D Parametric Primitive Analysis, 2D PPA）领域中因结构约束推理和高级语义理解不足而未被充分探索的关键挑战。论文的关键解决方案包括提出一种高效的混合参数化方法（Efficient Hybrid Parametrization, EHP），用于更好地表示2D工程图纸，并引入一个名为PHT-CAD的新框架，该框架利用视觉-语言模型（Vision-Language Models, VLMs）的模态对齐与推理能力实现精确的工程图纸分析。此外，论文还设计了一种三阶段渐进式分层微调方法（Progressive Hierarchical Tuning, PHT）来逐步提升PHT-CAD在感知单个原语、推断结构约束以及对齐标注层与几何表示方面的能力。同时，为了支持研究，论文构建了ParaCAD数据集，这是一个包含超过1000万张标注图纸用于训练和3000张具有复杂拓扑结构和物理约束的真实工业图纸用于测试的大规模基准数据集。

链接: https://arxiv.org/abs/2503.18147
作者: Ke Niu,Yuwen Chen,Haiyang Yu,Zhuofan Chen,Xianghui Que,Bin Li,Xiangyang Xue
机构: Shanghai Key Laboratory of Intelligent Information Processing (上海智能信息处理重点实验室); School of Computer Science, Fudan University (复旦大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computer-Aided Design (CAD) plays a pivotal role in industrial manufacturing, yet 2D Parametric Primitive Analysis (PPA) remains underexplored due to two key challenges: structural constraint reasoning and advanced semantic understanding. To tackle these challenges, we first propose an Efficient Hybrid Parametrization (EHP) for better representing 2D engineering drawings. EHP contains four types of atomic component i.e., point, line, circle, and arc). Additionally, we propose PHT-CAD, a novel 2D PPA framework that harnesses the modality alignment and reasoning capabilities of Vision-Language Models (VLMs) for precise engineering drawing analysis. In PHT-CAD, we introduce four dedicated regression heads to predict corresponding atomic components. To train PHT-CAD, a three-stage training paradigm Progressive Hierarchical Tuning (PHT) is proposed to progressively enhance PHT-CAD’s capability to perceive individual primitives, infer structural constraints, and align annotation layers with their corresponding geometric representations. Considering that existing datasets lack complete annotation layers and real-world engineering drawings, we introduce ParaCAD, the first large-scale benchmark that explicitly integrates both the geometric and annotation layers. ParaCAD comprises over 10 million annotated drawings for training and 3,000 real-world industrial drawings with complex topological structures and physical constraints for test. Extensive experiments demonstrate the effectiveness of PHT-CAD and highlight the practical significance of ParaCAD in advancing 2D PPA research.
zh

[CV-152] LocDiffusion: Identifying Locations on Earth by Diffusing in the Hilbert Space

【速读】：本文旨在解决图像地理定位（image geolocalization）这一基础但具有挑战性的任务，现有方法主要通过基于网格分类或图像检索实现，但在测试图像的空间分布与这些选择不一致时，其性能会显著下降。为克服这些限制，论文提出利用扩散机制进行图像地理定位。关键创新在于开发了一种新颖的球面位置编码-解码框架——球谐函数狄拉克δ表示（Spherical Harmonics Dirac Delta, SHDD），它将球面上的点（如地球上的地理坐标）编码为球谐系数的希尔伯特空间，并通过模式搜索解码地理坐标。此外，还提出了基于SirenNet的CS-UNet架构，在潜伏的SHDD空间中通过最小化潜在KL散度损失来学习条件逆向过程。最终训练出的条件潜伏扩散模型LocDiffusion能够在图像指导下生成地理坐标，据作者所知，这是首个通过在隐藏的位置嵌入空间中扩散地理信息的生成式图像地理定位模型。实验表明，LocDiffusion在多个标准基准上取得了竞争力的性能，并展示了对未见地理坐标的显著更强泛化能力。

链接: https://arxiv.org/abs/2503.18142
作者: Zhangyu Wang,Jielu Zhang,Zhongliang Zhou,Qian Cao,Nemin Wu,Zeping Liu,Lan Mu,Yang Song,Yiqun Xie,Ni Lao,Gengchen Mai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Image geolocalization is a fundamental yet challenging task, aiming at inferring the geolocation on Earth where an image is taken. Existing methods approach it either via grid-based classification or via image retrieval. Their performance significantly suffers when the spatial distribution of test images does not align with such choices. To address these limitations, we propose to leverage diffusion as a mechanism for image geolocalization. To avoid the problematic manifold reprojection step in diffusion, we developed a novel spherical positional encoding-decoding framework, which encodes points on a spherical surface (e.g., geolocations on Earth) into a Hilbert space of Spherical Harmonics coefficients and decodes points (geolocations) by mode-seeking. We call this type of position encoding Spherical Harmonics Dirac Delta (SHDD) Representation. We also propose a novel SirenNet-based architecture called CS-UNet to learn the conditional backward process in the latent SHDD space by minimizing a latent KL-divergence loss. We train a conditional latent diffusion model called LocDiffusion that generates geolocations under the guidance of images – to the best of our knowledge, the first generative model for image geolocalization by diffusing geolocation information in a hidden location embedding space. We evaluate our method against SOTA image geolocalization baselines. LocDiffusion achieves competitive geolocalization performance and demonstrates significantly stronger generalizability to unseen geolocations.
zh

[CV-153] AGIR: Assessing 3D Gait Impairment with Reasoning based on LLM s

【速读】：该论文旨在解决神经退行性疾病中步态障碍评估的主观性和缺乏精确性的问题，同时克服现有基于深度学习的方法在临床决策中可解释性不足的局限。论文的关键解决方案是引入AGIR（Adversarial Generative Interpretable Reasoning）这一创新管道，其核心包括：1）一个预训练的VQ-VAE运动标记器用于将步态动作转换为运动标记；2）一个经过特定优化的大语言模型（Large Language Model, LLM），通过双阶段监督微调（Supervised Fine-Tuning, SFT）策略增强其对病理步态的理解能力。该策略包含双向运动描述生成以对齐动作与解析描述，以及结合逻辑链式思维（Chain-of-Thought, CoT）推理进行步态评分评估。通过在现有数据集上的验证及与最新方法的对比，证明了该方法在提供具有临床意义解释的同时能够准确分配步态损伤评分。

链接: https://arxiv.org/abs/2503.18141
作者: Diwei Wang,Cédric Bobenrieth,Hyewon Seo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Assessing gait impairment plays an important role in early diagnosis, disease monitoring, and treatment evaluation for neurodegenerative diseases. Despite its widespread use in clinical practice, it is limited by subjectivity and a lack of precision. While recent deep learning-based approaches have consistently improved classification accuracies, they often lack interpretability, hindering their utility in clinical decision-making. To overcome these challenges, we introduce AGIR, a novel pipeline consisting of a pre-trained VQ-VAE motion tokenizer and a subsequent Large Language Model (LLM) fine-tuned over pairs of motion tokens and Chain-of-Thought (CoT) reasonings. To fine-tune an LLM for pathological gait analysis, we first introduce a multimodal dataset by adding rationales dedicated to MDS-UPDRS gait score assessment to an existing PD gait dataset. We then introduce a two-stage supervised fine-tuning (SFT) strategy to enhance the LLM’s motion comprehension with pathology-specific knowledge. This strategy includes: 1) a generative stage that aligns gait motions with analytic descriptions through bidirectional motion-description generation, 2) a reasoning stage that integrates logical Chain-of-Thought (CoT) reasoning for impairment assessment with UPDRS gait score. Validation on an existing dataset and comparisons with state-of-the-art methods confirm the robustness and accuracy of our pipeline, demonstrating its ability to assign gait impairment scores from motion input with clinically meaningful rationales.
zh

[CV-154] CFG: Tangential Damping Classifier-free Guidance CVPR2025

【速读】：该论文旨在解决扩散模型在使用分类器自由引导（Classifier-Free Guidance, CFG）进行文本到图像合成时存在的问题，即 unconditional score 可能干扰特定条件下的采样轨迹。论文的关键创新在于从几何视角重新设计了 unconditional score 的处理方法，通过奇异值分解（Singular Value Decomposition, SVD）过滤 conditional 和 unconditional scores 的奇异向量，使 unconditional score 与 conditional score 对齐，从而优化采样轨迹以更贴近条件流形，同时保持计算开销极低。这一方案显著提升了生成图像的质量，并提供了对扩散模型评分函数行为的深入理解，为实现更精确且语境一致的图像合成提供实用技术。

链接: https://arxiv.org/abs/2503.18137
作者: Mingi Kwon,Shin seong Kim,Jaeseok Jeong. Yi Ting Hsiao,Youngjung Uh
机构: Yonsei University (延世大学); Yonsei University (延世大学); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in text-to-image synthesis, largely attributed to the use of classifier-free guidance (CFG), which enables high-quality, condition-aligned image generation. CFG combines the conditional score (e.g., text-conditioned) with the unconditional score to control the output. However, the unconditional score is in charge of estimating the transition between manifolds of adjacent timesteps from x_t to x_t-1 , which may inadvertently interfere with the trajectory toward the specific condition. In this work, we introduce a novel approach that leverages a geometric perspective on the unconditional score to enhance CFG performance when conditional scores are available. Specifically, we propose a method that filters the singular vectors of both conditional and unconditional scores using singular value decomposition. This filtering process aligns the unconditional score with the conditional score, thereby refining the sampling trajectory to stay closer to the manifold. Our approach improves image quality with negligible additional computation. We provide deeper insights into the score function behavior in diffusion models and present a practical technique for achieving more accurate and contextually coherent image synthesis.
zh

[CV-155] MLLM -For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation

【速读】：该论文旨在解决将多模态大型语言模型（Multimodal Large Language Models, MLLMs）在二维图像推理分割中的能力迁移到三维场景理解的问题。现有方法在处理三维场景时面临缺乏三维上下文和多视角空间一致性的问题，导致模型产生不存在的物体并无法一致地定位目标对象，从而影响性能。为了解决这些问题，论文的关键创新包括：1）提出了一种空间一致性策略，确保分割掩码在三维空间中保持连贯性，以有效捕捉场景的几何结构；2）开发了一种Token-for-Query方法实现多模态语义对齐，使同一物体能够在不同视角下被一致识别。这些方案使得MLLM-For3D即使在没有标注的三维训练数据情况下，仍能超越现有的三维推理分割方法。

链接: https://arxiv.org/abs/2503.18135
作者: Jiaxin Huang,Runnan Chen,Ziwen Li,Zhengqing Gao,Xiao He,Yandong Guo,Mingming Gong,Tongliang Liu
机构: MBZUAI; The University of Sydney; The University of Melbourne; AI2Robotic
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning. While recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation, adapting these capabilities to 3D scenes remains underexplored. In this paper, we introduce MLLM-For3D, a simple yet effective framework that transfers knowledge from 2D MLLMs to 3D scene understanding. Specifically, we utilize MLLMs to generate multi-view pseudo segmentation masks and corresponding text embeddings, then unproject 2D masks into 3D space and align them with the text embeddings. The primary challenge lies in the absence of 3D context and spatial consistency across multiple views, causing the model to hallucinate objects that do not exist and fail to target objects consistently. Training the 3D model with such irrelevant objects leads to performance degradation. To address this, we introduce a spatial consistency strategy to enforce that segmentation masks remain coherent in the 3D space, effectively capturing the geometry of the scene. Moreover, we develop a Token-for-Query approach for multimodal semantic alignment, enabling consistent identification of the same object across different views. Extensive evaluations on various challenging indoor scene benchmarks demonstrate that, even without any labeled 3D training data, MLLM-For3D outperforms existing 3D reasoning segmentation methods, effectively interpreting user intent, understanding 3D scenes, and reasoning about spatial relationships.
zh

[CV-156] An Image-like Diffusion Method for Human-Object Interaction Detection CVPR2025

【速读】：该论文致力于解决人类-物体交互（Human-Object Interaction, HOI）检测中的高模糊性和不确定性问题，尤其关注因相同交互在不同人-物对中表现差异巨大，以及遮挡和杂乱背景等问题加剧的挑战。论文的关键创新在于将每个HOI检测输出重新建模为一张图像，并由此提出了一种名为HOI-IDiff的新框架。该框架通过图像扩散（Image-like Diffusion）过程从新颖的角度生成HOI检测结果，将其视为图像进行处理。为应对重建图像与自然图像的特性差异，论文进一步引入定制化的HOI扩散过程和切片patchification模型架构，以更有效地生成这些“HOI图像”。实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2503.18134
作者: Xiaofei Hui,Haoxuan Qu,Hossein Rahmani,Jun Liu
机构: Lancaster University (兰开斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Human-object interaction (HOI) detection often faces high levels of ambiguity and indeterminacy, as the same interaction can appear vastly different across different human-object pairs. Additionally, the indeterminacy can be further exacerbated by issues such as occlusions and cluttered backgrounds. To handle such a challenging task, in this work, we begin with a key observation: the output of HOI detection for each human-object pair can be recast as an image. Thus, inspired by the strong image generation capabilities of image diffusion models, we propose a new framework, HOI-IDiff. In HOI-IDiff, we tackle HOI detection from a novel perspective, using an Image-like Diffusion process to generate HOI detection outputs as images. Furthermore, recognizing that our recast images differ in certain properties from natural images, we enhance our framework with a customized HOI diffusion process and a slice patchification model architecture, which are specifically tailored to generate our recast ``HOI images’'. Extensive experiments demonstrate the efficacy of our framework.
zh

[CV-157] End-to-End Implicit Neural Representations for Classification CVPR2025

【速读】：该论文旨在解决利用隐式神经表示（INRs）进行信号重建后，直接应用于下游任务如分类时性能显著低于基于像素的方法（如CNNs）的问题。论文的关键在于提出了一种端到端的初始化策略，结合学习率调节方案，以优化SIREN网络的表示能力，从而提升分类准确性。通过在元学习的SIREN基础上应用简单的Transformer模型，并未引入显式的对称等变性设计，该方法超越了当前最先进的结果。在CIFAR-10数据集上的实验表明，该方法在无增强情况下将准确率从38.8%提升至59.6%，有增强时从63.4%提升至64.7%，同时在高分辨率的Imagenette和ImageNet-1K数据集上实现了分类性能的首次突破。

链接: https://arxiv.org/abs/2503.18123
作者: Alexander Gielisse,Jan van Gemert
机构: Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025. 8 pages, supplementary material included

点击查看摘要

Abstract:Implicit neural representations (INRs) such as NeRF and SIREN encode a signal in neural network parameters and show excellent results for signal reconstruction. Using INRs for downstream tasks, such as classification, is however not straightforward. Inherent symmetries in the parameters pose challenges and current works primarily focus on designing architectures that are equivariant to these symmetries. However, INR-based classification still significantly under-performs compared to pixel-based methods like CNNs. This work presents an end-to-end strategy for initializing SIRENs together with a learned learning-rate scheme, to yield representations that improve classification accuracy. We show that a simple, straightforward, Transformer model applied to a meta-learned SIREN, without incorporating explicit symmetry equivariances, outperforms the current state-of-the-art. On the CIFAR-10 SIREN classification task, we improve the state-of-the-art without augmentations from 38.8% to 59.6%, and from 63.4% to 64.7% with augmentations. We demonstrate scalability on the high-resolution Imagenette dataset achieving reasonable reconstruction quality with a classification accuracy of 60.8% and are the first to do INR classification on the full ImageNet-1K dataset where we achieve a SIREN classification performance of 23.6%. To the best of our knowledge, no other SIREN classification approach has managed to set a classification baseline for any high-resolution image dataset. Our code is available at this https URL
zh

[CV-158] Unraveling the Effects of Synthetic Data on End-to-End Autonomous Driving

【速读】：该论文旨在解决现有自动驾驶（Autonomous Driving, AD）端到端（End-to-End, E2E）模型训练和评估中对高质量多样化数据的需求与实际数据采集成本高、效率低之间的矛盾。具体而言，论文针对当前合成数据生成方法的局限性提出了解决方案：基于游戏引擎的模拟器难以生成逼真的传感器数据，NeRF和扩散模型面临效率瓶颈，而专为闭环评估设计的传统模拟器缺乏复杂的交通动态交互能力。论文的关键在于引入SceneCrafter，这是一种基于3D高斯点云（3D Gaussian Splatting, 3DGS）的新型模拟器，它能够高效生成多样交通场景下的逼真驾驶日志，并支持鲁棒的闭环模型评估，同时显著提升E2E模型的泛化性能。

链接: https://arxiv.org/abs/2503.18108
作者: Junhao Ge,Zuhong Liu,Longteng Fan,Yifan Jiang,Jiaqi Su,Yiming Li,Zhejun Zhang,Siheng Chen
机构: Shanghai Jiao Tong University (上海交通大学); New York University (纽约大学); ETH Zurich (苏黎世联邦理工学院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end (E2E) autonomous driving (AD) models require diverse, high-quality data to perform well across various driving scenarios. However, collecting large-scale real-world data is expensive and time-consuming, making high-fidelity synthetic data essential for enhancing data diversity and model robustness. Existing driving simulators for synthetic data generation have significant limitations: game-engine-based simulators struggle to produce realistic sensor data, while NeRF-based and diffusion-based methods face efficiency challenges. Additionally, recent simulators designed for closed-loop evaluation provide limited interaction with other vehicles, failing to simulate complex real-world traffic dynamics. To address these issues, we introduce SceneCrafter, a realistic, interactive, and efficient AD simulator based on 3D Gaussian Splatting (3DGS). SceneCrafter not only efficiently generates realistic driving logs across diverse traffic scenarios but also enables robust closed-loop evaluation of end-to-end models. Experimental results demonstrate that SceneCrafter serves as both a reliable evaluation platform and a efficient data generator that significantly improves end-to-end model generalization.
zh

[CV-159] PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding CVPR2025

【速读】：该论文旨在解决现有方法在开放词汇场景理解任务中无法区分三维实例级信息的问题。具体而言，以往的方法通常通过预测场景特征与文本查询之间的热图来实现语义分割，但难以有效区分不同的三维实例。为了解决这一挑战，论文提出了一种名为PanoGS的新方法，这是一种新颖且高效的三维全景开放词汇场景理解方案。

解决方案的关键在于两个方面：首先，采用金字塔三平面（pyramid tri-plane）建模潜在的连续参数化特征空间，并利用三维特征解码器回归多视图融合的二维特征云，以学习适用于大规模室内场景的精确三维语言特征；其次，引入语言引导的图割算法，协同利用重建的几何结构与学习到的语言线索，将三维高斯基元分组成一组超基元。此外，通过结合SAM指导的边缘亲和力计算，在不同超基元之间执行基于图聚类的分割，从而获得三维一致的实例级分割结果。这些技术共同构成了PanoGS的核心创新点，显著提升了三维全景开放词汇场景理解任务的性能。

链接: https://arxiv.org/abs/2503.18107
作者: Hongjia Zhai,Hai Li,Zhenzhe Li,Xiaokun Pan,Yijia He,Guofeng Zhang
机构: State Key Lab of CAD & CG, Zhejiang University (浙江大学); RayNeo (RayNeo)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) has shown encouraging performance for open vocabulary scene understanding tasks. However, previous methods cannot distinguish 3D instance-level information, which usually predicts a heatmap between the scene feature and text query. In this paper, we propose PanoGS, a novel and effective 3D panoptic open vocabulary scene understanding approach. Technically, to learn accurate 3D language features that can scale to large indoor scenarios, we adopt the pyramid tri-plane to model the latent continuous parametric feature space and use a 3D feature decoder to regress the multi-view fused 2D feature cloud. Besides, we propose language-guided graph cuts that synergistically leverage reconstructed geometry and learned language cues to group 3D Gaussian primitives into a set of super-primitives. To obtain 3D consistent instance, we perform graph clustering based segmentation with SAM-guided edge affinity computation between different super-primitives. Extensive experiments on widely used datasets show better or more competitive performance on 3D panoptic open vocabulary scene understanding. Project page: \hrefthis https URLthis https URL.
zh

[CV-160] M3Net: Multimodal Multi-task Learning for 3D Detection Segmentation and Occupancy Prediction in Autonomous Driving AAAI2025

【速读】：该论文旨在解决当前自动驾驶感知系统中多任务学习存在的效率低下以及任务间冲突的问题。传统方法通常单独处理各个子任务，而一些多任务学习方法虽尝试用单一模型统一多个任务，但未能有效缓解任务间的冲突。为应对上述挑战，论文提出M3Net，一种新颖的多模态多任务网络，同时处理检测、分割及3D占用预测任务，并在nuScenes数据集上取得了最先进的性能。

M3Net的关键创新在于其模块化设计：首先引入了Modality-Adaptive Feature Integration (MAFI) 模块，通过让单模态特征为各自高性能的任务预测通道级注意力权重，增强多模态特征的整合能力；其次开发了针对检测/分割和3D占用预测任务的任务特定查询初始化策略；基于初始化后的查询，共享解码器逐层转换查询与BEV特征，促进多任务学习；此外，在解码器中提出了Task-oriented Channel Scaling (TCS) 模块，以减轻不同任务优化之间的冲突。这些设计不仅提升了多任务学习的效果，还展示了良好的架构灵活性，支持Transformer和Mamba等多种解码器结构。

链接: https://arxiv.org/abs/2503.18100
作者: Xuesong Chen,Shaoshuai Shi,Tao Ma,Jingqiu Zhou,Simon See,Ka Chun Cheung,Hongsheng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:The perception system for autonomous driving generally requires to handle multiple diverse sub-tasks. However, current algorithms typically tackle individual sub-tasks separately, which leads to low efficiency when aiming at obtaining full-perception results. Some multi-task learning methods try to unify multiple tasks with one model, but do not solve the conflicts in multi-task learning. In this paper, we introduce M3Net, a novel multimodal and multi-task network that simultaneously tackles detection, segmentation, and 3D occupancy prediction for autonomous driving and achieves superior performance than single task model. M3Net takes multimodal data as input and multiple tasks via query-token interactions. To enhance the integration of multi-modal features for multi-task learning, we first propose the Modality-Adaptive Feature Integration (MAFI) module, which enables single-modality features to predict channel-wise attention weights for their high-performing tasks, respectively. Based on integrated features, we then develop task-specific query initialization strategies to accommodate the needs of detection/segmentation and 3D occupancy prediction. Leveraging the properly initialized queries, a shared decoder transforms queries and BEV features layer-wise, facilitating multi-task learning. Furthermore, we propose a Task-oriented Channel Scaling (TCS) module in the decoder to mitigate conflicts between optimizing for different tasks. Additionally, our proposed multi-task querying and TCS module support both Transformer-based decoder and Mamba-based decoder, demonstrating its flexibility to different architectures. M3Net achieves state-of-the-art multi-task learning performance on the nuScenes benchmarks.
zh

[CV-161] Anomize: Better Open Vocabulary Video Anomaly Detection

【速读】：该论文致力于解决开放词汇视频异常检测（Open Vocabulary Video Anomaly Detection, OVVAD）中的两个特定挑战：检测模糊性（detection ambiguity）和分类混淆（categorization confusion），特别是在处理新型异常（novel anomalies）时。检测模糊性指模型难以对不熟悉的异常分配准确的异常分数；分类混淆则表现为新型异常常被错误分类为视觉上相似的基础实例。为解决这些问题，论文的关键方案在于利用多源补充信息来缓解检测模糊性，通过结合多层次视觉数据与匹配的文本信息实现更精准的异常评分。同时，提出通过引入标签关系指导新标签的编码，以改善新型视频与其对应标签之间的对齐程度，从而减轻分类混淆。最终提出的Anomize框架有效解决了上述问题，并在UCF-Crime和XD-Violence数据集上展现了卓越性能。

链接: https://arxiv.org/abs/2503.18094
作者: Fei Li,Wenxuan Liu,Jingjing Chen,Ruixu Zhang,Yuran Wang,Xian Zhong,Zheng Wang
机构: National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University (武汉大学); Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University (复旦大学); State Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University (北京大学); Hubei Key Laboratory of Transportation Internet of Things, Wuhan University of Technology (武汉理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open Vocabulary Video Anomaly Detection (OVVAD) seeks to detect and classify both base and novel anomalies. However, existing methods face two specific challenges related to novel anomalies. The first challenge is detection ambiguity, where the model struggles to assign accurate anomaly scores to unfamiliar anomalies. The second challenge is categorization confusion, where novel anomalies are often misclassified as visually similar base instances. To address these challenges, we explore supplementary information from multiple sources to mitigate detection ambiguity by leveraging multiple levels of visual data alongside matching textual information. Furthermore, we propose incorporating label relations to guide the encoding of new labels, thereby improving alignment between novel videos and their corresponding labels, which helps reduce categorization confusion. The resulting Anomize framework effectively tackles these issues, achieving superior performance on UCF-Crime and XD-Violence datasets, demonstrating its effectiveness in OVVAD.
zh

[CV-162] Unified Geometry and Color Compression Framework for Point Clouds via Generative Diffusion Priors

【速读】：本文旨在解决三维点云几何与颜色属性高效联合压缩的问题，特别是现有学习型压缩方法难以直接应用于带颜色点云且受限于训练数据分布泛化能力不足的挑战。为应对这些问题，论文提出了一种测试时统一的三维点云几何与颜色压缩框架。其关键在于利用提示调优（prompt tuning）适配预训练生成扩散模型（pre-trained generative diffusion model），将原始带色点云压缩为稀疏种子集（seeds），并通过独立采样过程结合多次去噪步骤实现解压，从而实现几何与颜色信息的高效联合表示与恢复。

链接: https://arxiv.org/abs/2503.18083
作者: Tianxin Huang,Gim Hee Lee
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the growth of 3D applications and the rapid increase in sensor-collected 3D point cloud data, there is a rising demand for efficient compression algorithms. Most existing learning-based compression methods handle geometry and color attributes separately, treating them as distinct tasks, making these methods challenging to apply directly to point clouds with colors. Besides, the limited capacities of training datasets also limit their generalizability across points with different distributions. In this work, we introduce a test-time unified geometry and color compression framework of 3D point clouds. Instead of training a compression model based on specific datasets, we adapt a pre-trained generative diffusion model to compress original colored point clouds into sparse sets, termed ‘seeds’, using prompt tuning. Decompression is then achieved through multiple denoising steps with separate sampling processes. Experiments on objects and indoor scenes demonstrate that our method has superior performances compared to existing baselines for the compression of geometry and color.
zh

[CV-163] Vehicular Road Crack Detection with Deep Learning: A New Online Benchmark for Comprehensive Evaluation of Existing Algorithms

【速读】：该论文旨在解决城市数字孪生（Urban Digital Twins, UDTs）领域中智能道路检测（Intelligent Road Inspection, IRI）车辆在自动道路裂缝检测系统方面的不足，以替代传统的人工视觉检测，提高检测效率、准确性和客观性。论文的关键在于全面综述了最先进的深度学习方法，包括有监督（Supervised）、无监督（Unsupervised）、半监督（Semi-Supervised）和弱监督（Weakly-Supervised）四种方法，并针对道路裂缝检测任务引入了数据融合与标注高效的算法。此外，论文构建了一个名为UDTIRI-Crack的数据集，包含来自七个公开标注源的2500张高质量图像，作为该领域的首个广泛在线基准。通过综合实验评估了现有主流深度学习算法的检测性能、计算效率和泛化能力，并探索了基础模型和大型语言模型（Large Language Models, LLMs）在道路裂缝检测中的可行性。论文还讨论了当前面临的挑战及未来发展趋势，为下一代道路状况评估系统的智能道路检测车辆开发提供了实用指导。

链接: https://arxiv.org/abs/2503.18082
作者: Nachuan Ma,Zhengfei Song,Qiang Hu,Chuang-Wei Liu,Yu Han,Yanting Zhang,Rui Fan,Lihua Xie
机构: College of Electronics & Information Engineering, Shanghai Research Institute for Intelligent Autonomous Systems, the State Key Laboratory of Intelligent Autonomous Systems, and Frontiers Science Center for Intelligent Autonomous Systems, Tongji University (同济大学); School of Computer Science and Technology, Donghua University (东华大学); School of Electrical and Electronic Engineering, Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:In the emerging field of urban digital twins (UDTs), advancing intelligent road inspection (IRI) vehicles with automatic road crack detection systems is essential for maintaining civil infrastructure. Over the past decade, deep learning-based road crack detection methods have been developed to detect cracks more efficiently, accurately, and objectively, with the goal of replacing manual visual inspection. Nonetheless, there is a lack of systematic reviews on state-of-the-art (SoTA) deep learning techniques, especially data-fusion and label-efficient algorithms for this task. This paper thoroughly reviews the SoTA deep learning-based algorithms, including (1) supervised, (2) unsupervised, (3) semi-supervised, and (4) weakly-supervised methods developed for road crack detection. Also, we create a dataset called UDTIRI-Crack, comprising 2,500 high-quality images from seven public annotated sources, as the first extensive online benchmark in this field. Comprehensive experiments are conducted to compare the detection performance, computational efficiency, and generalizability of public SoTA deep learning-based algorithms for road crack detection. In addition, the feasibility of foundation models and large language models (LLMs) for road crack detection is explored. Afterwards, the existing challenges and future development trends of deep learning-based road crack detection algorithms are discussed. We believe this review can serve as practical guidance for developing intelligent road detection vehicles with the next-generation road condition assessment systems. The released benchmark UDTIRI-Crack is available at this https URL.
zh

[CV-164] PanopticSplatting: End-to-End Panoptic Gaussian Splatting

【速读】：该论文致力于解决开放词汇全景重建（open-vocabulary panoptic reconstruction）这一具有挑战性的任务，即同时进行场景重建与理解。传统基于高斯点撒（Gaussian splatting）的方法多为多阶段设计，存在累积误差以及对人工设计组件的依赖问题。为简化流程并实现全局优化，论文提出了一种端到端系统PanopticSplatting。其关键在于引入了查询引导的高斯分割与局部交叉注意力机制，实现了2D实例掩膜的端到端提升，且无需跨帧关联；同时通过视锥体内的局部交叉注意力有效降低了训练内存需求，使得模型能够处理包含更多高斯分布和物体的大场景。此外，针对2D伪掩膜中的噪声标签问题，提出了标签融合以减少噪声干扰，并通过标签扭曲增强多视角一致性及分割精度。实验表明，该方法在ScanNet-V2和ScanNet++数据集上的3D场景全景重建性能优于基于NeRF和高斯点撒的方法，并且具有良好的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2503.18073
作者: Yuxuan Xie,Xuan Yu,Changjian Jiang,Sitong Mao,Shunbo Zhou,Rui Fan,Rong Xiong,Yue Wang
机构: Zhejiang University (浙江大学), Hangzhou, Zhejiang, China; Huawei Cloud Computing Technologies Co., Ltd. (华为云技术有限公司), Shenzhen, China; Tongji University (同济大学), Shanghai, China
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Open-vocabulary panoptic reconstruction is a challenging task for simultaneous scene reconstruction and understanding. Recently, methods have been proposed for 3D scene understanding based on Gaussian splatting. However, these methods are multi-staged, suffering from the accumulated errors and the dependence of hand-designed components. To streamline the pipeline and achieve global optimization, we propose PanopticSplatting, an end-to-end system for open-vocabulary panoptic reconstruction. Our method introduces query-guided Gaussian segmentation with local cross attention, lifting 2D instance masks without cross-frame association in an end-to-end way. The local cross attention within view frustum effectively reduces the training memory, making our model more accessible to large scenes with more Gaussians and objects. In addition, to address the challenge of noisy labels in 2D pseudo masks, we propose label blending to promote consistent 3D segmentation with less noisy floaters, as well as label warping on 2D predictions which enhances multi-view coherence and segmentation accuracy. Our method demonstrates strong performances in 3D scene panoptic reconstruction on the ScanNet-V2 and ScanNet++ datasets, compared with both NeRF-based and Gaussian-based panoptic reconstruction methods. Moreover, PanopticSplatting can be easily generalized to numerous variants of Gaussian splatting, and we demonstrate its robustness on different Gaussian base models.
zh

[CV-165] Dynamic Allocation Hypernetwork with Adaptive Model Recalibration for FCL

【速读】：该论文致力于解决联邦持续学习（Federated Continual Learning, FCL）在实际医疗场景中面临的两个核心挑战：(1) 服务器模型因灾难性遗忘（Catastrophic Forgetting）导致先前任务的知识丢失，并在所有任务间难以保持全面的知识；(2) 异步任务处理引起的客户端优化目标冲突，导致偏置优化（Biased Optimization）。为应对这些挑战，论文提出了一个名为动态分配超网络与自适应模型再校准（Dynamic Allocation Hypernetwork with adaptive model recalibration, \textbf{FedDAH}）的新方法。其关键是通过设计动态分配超网络（Dynamic Allocation Hypernetwork, DAHyper）缓解灾难性遗忘问题，利用不断更新的超网络管理任务身份与其关联参数之间的映射，实现客户端间模型的动态分配；同时引入自适应模型再校准（Adaptive Model Recalibration, AMR），将历史模型的变化整合到当前服务器更新中，并基于相似度为不同时间步的相同任务分配权重，以实现持续优化。

链接: https://arxiv.org/abs/2503.18064
作者: Xiaoming Qi,Jingyang Zhang,Huazhu Fu,Guanyu Yang,Shuo Li,Yueming Jin
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Federated continual learning (FCL) offers an emerging pattern to facilitate the applicability of federated learning (FL) in real-world scenarios, where tasks evolve dynamically and asynchronously across clients, especially in medical scenario. Existing server-side FCL methods in nature domain construct a continually learnable server model by client aggregation on all-involved tasks. However, they are challenged by: (1) Catastrophic forgetting for previously learned tasks, leading to error accumulation in server model, making it difficult to sustain comprehensive knowledge across all tasks. (2) Biased optimization due to asynchronous tasks handled across different clients, leading to the collision of optimization targets of different clients at the same time steps. In this work, we take the first step to propose a novel server-side FCL pattern in medical domain, Dynamic Allocation Hypernetwork with adaptive model recalibration (\textbfFedDAH). It is to facilitate collaborative learning under the distinct and dynamic task streams across clients. To alleviate the catastrophic forgetting, we propose a dynamic allocation hypernetwork (DAHyper) where a continually updated hypernetwork is designed to manage the mapping between task identities and their associated model parameters, enabling the dynamic allocation of the model across clients. For the biased optimization, we introduce a novel adaptive model recalibration (AMR) to incorporate the candidate changes of historical models into current server updates, and assign weights to identical tasks across different time steps based on the similarity for continual optimization. Extensive experiments on the AMOS dataset demonstrate the superiority of our FedDAH to other FCL methods on sites with different task streams. The code is available:this https URL.
zh

[CV-166] PolarFree: Polarization-based Reflection-free Imaging CVPR2025

【速读】：该论文旨在解决由于复杂光交互导致的反射去除难题，反射会掩盖重要细节并阻碍场景理解。论文的关键在于利用偏振（Polarization）这一自然提供的强大线索来区分反射光和透射光，从而实现更准确的反射去除。然而，现有方法通常依赖于小规模或合成数据集，无法捕捉真实世界场景的多样性和复杂性。为此，论文构建了一个大规模数据集PolaRGB，用于基于偏振的RGB图像反射去除，使模型能够有效泛化到广泛的真实场景中。此外，为了充分利用偏振线索的优势，论文提出了PolarFree方法，通过扩散过程生成无反射线索以实现精确的反射去除。实验结果表明，PolarFree在具有挑战性的反射场景中显著提升了图像清晰度，并为偏振成像和反射去除设定了新的基准。

链接: https://arxiv.org/abs/2503.18055
作者: Mingde Yao,Menglu Wang,King-Man Tam,Lingen Li,Tianfan Xue,Jinwei Gu
机构: The Chinese University of Hong Kong (香港中文大学); Shanghai AI Laboratory (上海人工智能实验室); University of Science and Technology of China (中国科学技术大学); Institute of Science Tokyo (东京科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Reflection removal is challenging due to complex light interactions, where reflections obscure important details and hinder scene understanding. Polarization naturally provides a powerful cue to distinguish between reflected and transmitted light, enabling more accurate reflection removal. However, existing methods often rely on small-scale or synthetic datasets, which fail to capture the diversity and complexity of real-world scenarios. To this end, we construct a large-scale dataset, PolaRGB, for Polarization-based reflection removal of RGB images, which enables us to train models that generalize effectively across a wide range of real-world scenarios. The PolaRGB dataset contains 6,500 well-aligned mixed-transmission image pairs, 8x larger than existing polarization datasets, and is the first to include both RGB and polarization images captured across diverse indoor and outdoor environments with varying lighting conditions. Besides, to fully exploit the potential of polarization cues for reflection removal, we introduce PolarFree, which leverages diffusion process to generate reflection-free cues for accurate reflection removal. Extensive experiments show that PolarFree significantly enhances image clarity in challenging reflective scenarios, setting a new benchmark for polarized imaging and reflection removal. Code and dataset are available at this https URL.
zh

[CV-167] SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

【速读】：本文旨在解决两个关键问题：一是现有方法在3D场景语义理解中过度依赖2D或文本模态，缺乏能够端到端处理纯3D数据的模型及其所需的数据集；二是如何以可泛化的方式将语义推理有效集成到3D Gaussian Splatting (3DGS) 中。为了解决这些问题，论文提出了SceneSplat，这是首个原生基于3DGS的大规模室内场景理解方法，并引入了一种自监督学习方案，从无标注场景中解锁丰富的3D特征学习能力。此外，为了支持这些方法，论文构建了SceneSplat-7K，首个包含6868个场景的大规模室内场景3DGS数据集，其生成过程等效于在L4 GPU上消耗119个GPU日，从而实现了基于3DGS的室内场景推理标准化基准测试。关键在于通过SceneSplat实现纯3D数据的端到端语义学习，并结合自监督学习充分利用未标注数据进行高效特征提取。

链接: https://arxiv.org/abs/2503.18052
作者: Yue Li,Qi Ma,Runyi Yang,Huapeng Li,Mengjiao Ma,Bin Ren,Nikola Popovic,Nicu Sebe,Ender Konukoglu,Theo Gevers,Luc Van Gool,Martin R. Oswald,Danda Pani Paudel
机构: University of Amsterdam (阿姆斯特丹大学); Computer Vision Lab, ETH Zurich (苏黎世联邦理工学院计算机视觉实验室); INSAIT, Sofia University ”St. Kliment Ohridski” (索非亚大学INSAIT研究所); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); University of Pisa (比萨大学); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our code, model, and dataset will be released at this https URL

点击查看摘要

Abstract:Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training, or together at inference. This highlights a clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Meanwhile, 3D Gaussian Splatting (3DGS) has emerged as the de facto standard for 3D scene representation across various vision tasks. However, effectively integrating semantic reasoning into 3DGS in a generalizable fashion remains an open challenge. To address these limitations we introduce SceneSplat, to our knowledge the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS. Furthermore, we propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. In order to power the proposed methods, we introduce SceneSplat-7K, the first large-scale 3DGS dataset for indoor scenes, comprising of 6868 scenes derived from 7 established datasets like ScanNet, Matterport3D, etc. Generating SceneSplat-7K required computational resources equivalent to 119 GPU-days on an L4 GPU, enabling standardized benchmarking for 3DGS-based reasoning for indoor scenes. Our exhaustive experiments on SceneSplat-7K demonstrate the significant benefit of the proposed methods over the established baselines.
zh

[CV-168] DualCP: Rehearsal-Free Domain-Incremental Learning via Dual-Level Concept Prototype AAAI2025

【速读】：本文旨在解决领域增量学习（Domain-Incremental Learning, DIL）中隐私保护和训练时间限制下的知识保持与新知识学习之间的冲突问题，特别是在无重播（Rehearsal-Free DIL, RFDIL）场景下。为应对这一挑战，论文提出了一种双层概念原型（Dual-level Concept Prototypes, DualCP）的设计，其灵感来源于人类大脑的增量认知过程。DualCP 的关键在于通过概念原型生成器（Concept Prototype Generator, CPG）为每个类别生成粗粒度和细粒度的概念原型，同时引入粗到精校准模块（Coarse-to-Fine calibrator, C2F）以对齐图像特征与 DualCP。最终，论文设计了双点回归损失函数（Dual Dot-Regression, DDR）来优化 C2F 模块。实验结果表明，该方法在 DomainNet、CDDB 和 CORe50 数据集上均表现出显著的有效性。

链接: https://arxiv.org/abs/2503.18042
作者: Qiang Wang,Yuhang He,SongLin Dong,Xiang Song,Jizhou Han,Haoyu Luo,Yihong Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Domain-Incremental Learning (DIL) enables vision models to adapt to changing conditions in real-world environments while maintaining the knowledge acquired from previous domains. Given privacy concerns and training time, Rehearsal-Free DIL (RFDIL) is more practical. Inspired by the incremental cognitive process of the human brain, we design Dual-level Concept Prototypes (DualCP) for each class to address the conflict between learning new knowledge and retaining old knowledge in RFDIL. To construct DualCP, we propose a Concept Prototype Generator (CPG) that generates both coarse-grained and fine-grained prototypes for each class. Additionally, we introduce a Coarse-to-Fine calibrator (C2F) to align image features with DualCP. Finally, we propose a Dual Dot-Regression (DDR) loss function to optimize our C2F module. Extensive experiments on the DomainNet, CDDB, and CORe50 datasets demonstrate the effectiveness of our method.
zh

[CV-169] xt-Driven Cross-Modal Place Recognition Method for Remote Sensing Localization

【速读】：该论文致力于解决在基于遥感构建的大规模点云地图中，通过环境描述进行定位的关键挑战。现有方法面临的主要问题是点云编码器难以有效捕捉局部细节与长程空间关系，同时文本与点云表示之间存在显著的模态差距。为应对这些挑战，论文提出了一种名为Des4Pos的新颖两阶段文本驱动的遥感定位框架。其关键在于粗略阶段采用多尺度融合注意力机制（MFAM）增强局部几何特征，并结合双向长短时记忆模块（LSTM）强化全局空间关系；同时引入分步文本编码器（STE），利用CLIP跨模态先验知识对齐文本与点云特征，有效弥合模态差异。精细阶段则通过级联残差注意力模块（CRA）融合多模态特征并预测相对定位偏移，从而实现更高的定位精度。实验结果表明，Des4Pos在KITTI360Pose测试集中达到了最先进的文本到点云位置识别性能，超越现有最佳方法7%和7%，分别达到5米半径阈值下的top-1准确率40%和top-10准确率77%。

链接: https://arxiv.org/abs/2503.18035
作者: Tianyi Shang,Zhenyu Li,Pengjie Xu,Zhaojun Deng,Ruirui Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages

点击查看摘要

Abstract:Environment description-based localization in large-scale point cloud maps constructed through remote sensing is critically significant for the advancement of large-scale autonomous systems, such as delivery robots operating in the last mile. However, current approaches encounter challenges due to the inability of point cloud encoders to effectively capture local details and long-range spatial relationships, as well as a significant modality gap between text and point cloud representations. To address these challenges, we present Des4Pos, a novel two-stage text-driven remote sensing localization framework. In the coarse stage, the point-cloud encoder utilizes the Multi-scale Fusion Attention Mechanism (MFAM) to enhance local geometric features, followed by a bidirectional Long Short-Term Memory (LSTM) module to strengthen global spatial relationships. Concurrently, the Stepped Text Encoder (STE) integrates cross-modal prior knowledge from CLIP [1] and aligns text and point-cloud features using this prior knowledge, effectively bridging modality discrepancies. In the fine stage, we introduce a Cascaded Residual Attention (CRA) module to fuse cross-modal features and predict relative localization offsets, thereby achieving greater localization precision. Experiments on the KITTI360Pose test set demonstrate that Des4Pos achieves state-of-the-art performance in text-to-point-cloud place recognition. Specifically, it attains a top-1 accuracy of 40% and a top-10 accuracy of 77% under a 5-meter radius threshold, surpassing the best existing methods by 7% and 7%, respectively.
zh

[CV-170] OmnimatteZero: Training-free Real-time Omnimatte with Pre-trained Video Diffusion Models

【速读】：该论文试图解决视频分解（Omnimatte）的问题，即从给定视频中分解出语义上有意义的层，包括背景以及各个对象及其相关效果（如阴影和反射）。现有方法通常需要大量训练或昂贵的自监督优化。论文提出的解决方案关键在于OmnimatteZero，这是一种无需训练的方法，利用现成的预训练视频扩散模型来实现Omnimatte。其核心创新点包括：通过调整零样本图像修复技术以处理视频对象移除任务，使用自注意力图捕获对象及其足迹信息以修复对象效果，并通过简单的潜在空间算术隔离和重新组合对象层与新视频层，从而实现实时性能且具有最小的帧运行时间。

链接: https://arxiv.org/abs/2503.18033
作者: Dvir Samuel,Matan Levy,Nir Darshan,Gal Chechik,Rami Ben-Ari
机构: OriginAI (OriginAI); The Hebrew University of Jerusalem (耶路撒冷希伯来大学); Bar-Ilan University (巴伊兰大学); NVIDIA Research (NVIDIA研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Omnimatte aims to decompose a given video into semantically meaningful layers, including the background and individual objects along with their associated effects, such as shadows and reflections. Existing methods often require extensive training or costly self-supervised optimization. In this paper, we present OmnimatteZero, a training-free approach that leverages off-the-shelf pre-trained video diffusion models for omnimatte. It can remove objects from videos, extract individual object layers along with their effects, and composite those objects onto new videos. We accomplish this by adapting zero-shot image inpainting techniques for video object removal, a task they fail to handle effectively out-of-the-box. We then show that self-attention maps capture information about the object and its footprints and use them to inpaint the object’s effects, leaving a clean background. Additionally, through simple latent arithmetic, object layers can be isolated and recombined seamlessly with new video layers to produce new videos. Evaluations show that OmnimatteZero not only achieves superior performance in terms of background reconstruction but also sets a new record for the fastest Omnimatte approach, achieving real-time performance with minimal frame runtime.
zh

[CV-171] Anomaly Detection and Localization for Speech Deepfakes via Feature Pyramid Matching

【速读】：该论文旨在解决现有语音深度伪造（Speech Deepfake）检测方法的两个关键问题：一是对未见过的合成技术泛化能力有限；二是缺乏可解释性。为了解决这些问题，论文提出了一种新颖的可解释单类检测框架，将语音深度伪造检测重新定义为异常检测任务。该框架的关键在于仅使用真实语音数据训练模型以刻画其分布，并通过学生-教师特征金字塔匹配系统结合差异缩放（Discrepancy Scaling）增强模型在未知数据分布上的泛化能力，同时在推理过程中生成可解释的异常图，突出时间域和频率域中的异常区域。这一创新方法显著提升了检测性能，并增强了检测结果的可解释性。

链接: https://arxiv.org/abs/2503.18032
作者: Emma Coletta,Davide Salvi,Viola Negroni,Daniele Ugo Leonzio,Paolo Bestagini
机构: Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB), Politecnico di Milano (电子、信息和生物工程系 (DEIB), 米兰理工大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The rise of AI-driven generative models has enabled the creation of highly realistic speech deepfakes - synthetic audio signals that can imitate target speakers’ voices - raising critical security concerns. Existing methods for detecting speech deepfakes primarily rely on supervised learning, which suffers from two critical limitations: limited generalization to unseen synthesis techniques and a lack of explainability. In this paper, we address these issues by introducing a novel interpretable one-class detection framework, which reframes speech deepfake detection as an anomaly detection task. Our model is trained exclusively on real speech to characterize its distribution, enabling the classification of out-of-distribution samples as synthetically generated. Additionally, our framework produces interpretable anomaly maps during inference, highlighting anomalous regions across both time and frequency domains. This is done through a Student-Teacher Feature Pyramid Matching system, enhanced with Discrepancy Scaling to improve generalization capabilities across unseen data distributions. Extensive evaluations demonstrate the superior performance of our approach compared to the considered baselines, validating the effectiveness of framing speech deepfake detection as an anomaly detection problem.
zh

[CV-172] Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook

【速读】：本文旨在解决计算机视觉（Computer Vision, CV）领域中仅依赖模型内部知识导致的局限性问题，通过引入检索增强（Retrieval-Augmented, RAG）技术，将权威外部知识库融入视觉任务中，以提升模型的理解与生成能力。解决方案的关键在于结合RAG策略，将外部可靠且实时更新的知识源与视觉任务相结合，在视觉理解（如图像识别、医疗报告生成及多模态问答等任务）和视觉生成（如图像、视频及3D内容生成）两个主要方向上实现性能提升，并进一步探索其在具身人工智能（Embodied AI）中的应用潜力，同时指出当前方法的不足并提出未来研究方向。

链接: https://arxiv.org/abs/2503.18016
作者: Xu Zheng,Ziqiao Weng,Yuanhuiyi Lyu,Lutao Jiang,Haiwei Xue,Bin Ren,Danda Paudel,Nicu Sebe,Luc Van Gool,Xuming Hu
机构: HKUST(GZ)(香港科技大学（广州）); INSAIT, Sofia University “St. Kliment Ohridski”(索非亚大学“圣克莱门特·奥赫里德斯基”); Sichuan University(四川大学); Tinghua University(廷华大学); ETH Zurich(瑞士苏黎世联邦理工学院); University of Pisa(比萨大学); University of Trento(特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 10 figures

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has emerged as a pivotal technique in artificial intelligence (AI), particularly in enhancing the capabilities of large language models (LLMs) by enabling access to external, reliable, and up-to-date knowledge sources. In the context of AI-Generated Content (AIGC), RAG has proven invaluable by augmenting model outputs with supplementary, relevant information, thus improving their quality. Recently, the potential of RAG has extended beyond natural language processing, with emerging methods integrating retrieval-augmented strategies into the computer vision (CV) domain. These approaches aim to address the limitations of relying solely on internal model knowledge by incorporating authoritative external knowledge bases, thereby improving both the understanding and generation capabilities of vision models. This survey provides a comprehensive review of the current state of retrieval-augmented techniques in CV, focusing on two main areas: (I) visual understanding and (II) visual generation. In the realm of visual understanding, we systematically review tasks ranging from basic image recognition to complex applications such as medical report generation and multimodal question answering. For visual content generation, we examine the application of RAG in tasks related to image, video, and 3D generation. Furthermore, we explore recent advancements in RAG for embodied AI, with a particular focus on applications in planning, task execution, multimodal perception, interaction, and specialized domains. Given that the integration of retrieval-augmented techniques in CV is still in its early stages, we also highlight the key limitations of current approaches and propose future research directions to drive the development of this promising area.
zh

[CV-173] Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning

【速读】：该论文旨在解决大型视觉-语言模型（LVLMs）在偏好优化（Preference Optimization）策略应用中存在的两个关键挑战：一是构建高质量的人类标注偏好数据的成本与难度较高；二是开发鲁棒的奖励模型以模拟这些偏好同样面临困难。为应对这些问题，论文提出了一种名为Vision-R1的新颖算法，这是一种基于视觉引导的R1-like强化学习方法。Vision-R1通过确定性的视觉反馈奖励模型，无需依赖专门设计的奖励模型或人工标注的偏好数据集，而是仅利用经过整理的指令数据进行训练。其核心创新在于引入了一个由准则驱动的奖励函数，能够综合多维反馈评估模型完成度，并结合视觉任务逻辑实现全面评价。此外，还提出了渐进式规则精炼策略，在训练过程中动态调整奖励标准，从而促进模型持续改进并避免奖励劫持（Reward Hacking）。实验结果表明，使用Vision-R1微调7B参数规模的LVLMs可获得一致性能提升，最高可达50%，甚至超越了10倍规模的当前最优模型。

链接: https://arxiv.org/abs/2503.18013
作者: Yufei Zhan,Yousong Zhu,Shurong Zheng,Hongyin Zhao,Fan Yang,Ming Tang,Jinqiao Wang
机构: Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所基础模型研究中心); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Peng Cheng Laboratory (鹏城实验室); Wuhan AI Research (武汉人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project in development. Github: this https URL

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) typically follow a two-stage training paradigm-pretraining and supervised fine-tuning. Recently, preference optimization, derived from the language domain, has emerged as an effective post-training reinforcement strategy to enhance capabilities of LVLMs. However, constructing high-quality human-annotated preference data and developing robust reward models to mimic these preferences are both costly and challenging. Motivated by this observation, we propose Vision-R1, a novel vision-guided R1-like reinforcement learning algorithm for LVLMs that rewards models with definitive vision feedback. It only leverages curated instruction data, eliminating the need for specialized reward models and handcrafted preference datasets. We incorporate a criterion-driven reward function that further integrates multi-dimensional feedback to evaluate model completions comprehensively based on the vision task logic. Furthermore, we introduce a progressive rule refinement strategy that dynamically adjusts the reward criteria during training, enabling continuous model improvement and mitigating reward hacking. Extensive experiments on both in-distribution and out-of-distribution benchmarks demonstrate that fine-tuning the 7B LVLMs with Vision-R1 achieves consistent performance gains, with even up to 50% improvement and surpassing the state-of-the-art 10x size model.
zh

[CV-174] Finsler Multi-Dimensional Scaling: Manifold Learning for Asymmetric Dimensionality Reduction and Embedding CVPR

【速读】：该论文旨在解决传统多维尺度分析（Multidimensional Scaling, MDS）方法在嵌入数据时受限于黎曼流形（Riemannian Manifolds）的局限性问题。具体而言，当前标准MDS方法在优化嵌入时仅关注保持数据之间的对称距离信息，而忽略了非对称数据结构的表达需求。论文的关键创新在于将MDS问题推广至芬斯勒流形（Finsler Manifolds），这是一种自然的黎曼流形的非对称广义形式。通过受欧几里得空间启发，定义了一个适用于嵌入非对称数据的标准芬斯勒空间，该空间因测地线的简单性而使得数据表示直观且易于分析。论文证明了这一推广保留了与传统MDS相同的理论收敛保证，并展示了芬斯勒嵌入在非对称数据可视化、降维、有向图嵌入及链路预测等应用中的有效性。

链接: https://arxiv.org/abs/2503.18010
作者: Thomas Dagès,Simon Weber,Ya-Wei Eileen Lin,Ronen Talmon,Daniel Cremers,Michael Lindenbaum,Alfred M. Bruckstein,Ron Kimmel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

点击查看摘要

Abstract:Dimensionality reduction is a fundamental task that aims to simplify complex data by reducing its feature dimensionality while preserving essential patterns, with core applications in data analysis and visualisation. To preserve the underlying data structure, multi-dimensional scaling (MDS) methods focus on preserving pairwise dissimilarities, such as distances. They optimise the embedding to have pairwise distances as close as possible to the data dissimilarities. However, the current standard is limited to embedding data in Riemannian manifolds. Motivated by the lack of asymmetry in the Riemannian metric of the embedding space, this paper extends the MDS problem to a natural asymmetric generalisation of Riemannian manifolds called Finsler manifolds. Inspired by Euclidean space, we define a canonical Finsler space for embedding asymmetric data. Due to its simplicity with respect to geodesics, data representation in this space is both intuitive and simple to analyse. We demonstrate that our generalisation benefits from the same theoretical convergence guarantees. We reveal the effectiveness of our Finsler embedding across various types of non-symmetric data, highlighting its value in applications such as data visualisation, dimensionality reduction, directed graph embedding, and link prediction.
zh

[CV-175] SymmCompletion: High-Fidelity and High-Consistency Point Cloud Completion with Symmetry Guidance AAAI2025

【速读】：本文旨在解决点云补全任务中全局完整性与局部几何细节丢失的问题，尤其是现有方法在处理部分点云与重建缺失区域之间的几何一致性不足。为了解决这一问题，论文提出了一种基于对称性引导的高效补全方法SymmCompletion。其关键在于结合了两个核心组件：局部对称变换网络（Local Symmetry Transformation Network, LSTNet）和对称引导Transformer（Symmetry-Guidance Transformer, SGFormer）。LSTNet通过估计点级别的局部对称变换，将输入点云的关键几何特征转移到缺失区域，生成几何对齐的部分-缺失配对以及初始点云；而SGFormer利用这些部分-缺失配对的几何特征作为显式的对称性引导，约束初始点云的细化过程，从而实现高保真且几何一致的最终点云补全结果。

链接: https://arxiv.org/abs/2503.18007
作者: Hongyu Yan,Zijun Li,Kunming Luo,Li Lu,Ping Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025 (Oral presentation), Code: this https URL

点击查看摘要

Abstract:Point cloud completion aims to recover a complete point shape from a partial point cloud. Although existing methods can form satisfactory point clouds in global completeness, they often lose the original geometry details and face the problem of geometric inconsistency between existing point clouds and reconstructed missing parts. To tackle this problem, we introduce SymmCompletion, a highly effective completion method based on symmetry guidance. Our method comprises two primary components: a Local Symmetry Transformation Network (LSTNet) and a Symmetry-Guidance Transformer (SGFormer). First, LSTNet efficiently estimates point-wise local symmetry transformation to transform key geometries of partial inputs into missing regions, thereby generating geometry-align partial-missing pairs and initial point clouds. Second, SGFormer leverages the geometric features of partial-missing pairs as the explicit symmetric guidance that can constrain the refinement process for initial point clouds. As a result, SGFormer can exploit provided priors to form high-fidelity and geometry-consistency final point clouds. Qualitative and quantitative evaluations on several benchmark datasets demonstrate that our method outperforms state-of-the-art completion networks.
zh

[CV-176] Geometric Constrained Non-Line-of-Sight Imaging

【速读】：该论文旨在解决非视距（Non-Line-of-Sight, NLOS）成像中法线与albedo联合重建的问题，这一问题由于从矩阵值函数扩展到张量值函数而显著增加了复杂性和计算难度。论文的关键创新在于提出了一种新的联合albedo-表面重建方法，通过利用形状算子的Frobenius范数来控制法线场的变化率，首次将正则化方法应用于隐藏物体表面法线的重建。这种方法通过提高法线场的准确性，增强了细节表示能力，并实现了隐藏物体几何结构的高精度重建。实验结果表明，该方法在合成数据集和实验数据集上均表现出鲁棒性和有效性，在15秒内捕获的瞬态数据中，所提出的表面法线正则化重建模型比现有方法更准确，且速度提高了30倍。

链接: https://arxiv.org/abs/2503.17992
作者: Xueying Liu,Lianfang Wang,Jun Liu,Yong Wang,Yuping Duan
机构: Center for Applied Mathematics, Tianjin University (天津大学应用数学中心); School of Physics, Nankai University (南开大学物理学院); School of Mathematical Sciences, Beijing Normal University (北京师范大学数学科学学院); IEEE Publication Technology Group (IEEE出版技术集团, 新泽西皮斯卡塔韦)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Normal reconstruction is crucial in non-line-of-sight (NLOS) imaging, as it provides key geometric and lighting information about hidden objects, which significantly improves reconstruction accuracy and scene understanding. However, jointly estimating normals and albedo expands the problem from matrix-valued functions to tensor-valued functions that substantially increasing complexity and computational difficulty. In this paper, we propose a novel joint albedo-surface reconstruction method, which utilizes the Frobenius norm of the shape operator to control the variation rate of the normal field. It is the first attempt to apply regularization methods to the reconstruction of surface normals for hidden objects. By improving the accuracy of the normal field, it enhances detail representation and achieves high-precision reconstruction of hidden object geometry. The proposed method demonstrates robustness and effectiveness on both synthetic and experimental datasets. On transient data captured within 15 seconds, our surface normal-regularized reconstruction model produces more accurate surfaces than recently proposed methods and is 30 times faster than the existing surface reconstruction approach.
zh

[CV-177] Metaphor-based Jailbreaking Attacks on Text-to-Image Models

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）模型在安全过滤方面的漏洞问题，即现有方法难以有效生成能够绕过安全过滤器的对抗性提示，同时存在查询效率低下的局限。论文提出了一种基于隐喻的越狱攻击方法（Metaphor-based Jailbreaking Attack, MJA），其关键是结合多智能体生成模块（Multi-Agent Generation, MLAG）与对抗性提示优化模块（Adversarial Prompt Optimization, APO）。MLAG通过隐喻检索、上下文匹配和对抗性提示生成三个子任务分解问题，并利用多个LLM智能体协同探索多样化解决方案；APO则通过训练代理模型预测攻击结果并设计自适应获取策略，提升攻击效率。实验表明，MJA在保持高攻击效果的同时显著降低了查询次数，并且生成的对抗性提示具有跨模型的强迁移能力。

链接: https://arxiv.org/abs/2503.17987
作者: Chenyu Zhang,Yiwen Ma,Lanjun Wang,Wenhui Li,Yi Tu,An-An Liu
机构: School of New Media and Communication, Tianjin University (天津大学新媒体与传播学院), Tianjin, China; School of Electrical and Information Engineering, Tianjin University (天津大学电气与信息工程学院), Tianjin, China; Huawei Technologies Co Ltd. (华为技术有限公司), Shanghai, China
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 page3, 4 figures. This paper includes model-generated content that may contain offensive or distressing material

点击查看摘要

Abstract:To mitigate misuse, text-to-image~(T2I) models commonly incorporate safety filters to prevent the generation of sensitive images. Unfortunately, recent jailbreaking attack methods use LLMs to generate adversarial prompts that effectively bypass safety filters while generating sensitive images, revealing the safety vulnerabilities within the T2I model. However, existing LLM-based attack methods lack explicit guidance, relying on substantial queries to achieve a successful attack, which limits their practicality in real-world scenarios. In this work, we introduce \textbfMJA, a \textbfmetaphor-based \textbfjailbreaking \textbfattack method inspired by the Taboo game, aiming to balance the attack effectiveness and query efficiency by generating metaphor-based adversarial prompts. Specifically, MJA consists of two modules: an LLM-based multi-agent generation module~(MLAG) and an adversarial prompt optimization module~(APO). MLAG decomposes the generation of metaphor-based adversarial prompts into three subtasks: metaphor retrieval, context matching, and adversarial prompt generation. Subsequently, MLAG coordinates three LLM-based agents to generate diverse adversarial prompts by exploring various metaphors and contexts. To enhance the attack efficiency, APO first trains a surrogate model to predict the attack results of adversarial prompts and then designs an acquisition strategy to adaptively identify optimal adversarial prompts. Experiments demonstrate that MJA achieves better attack effectiveness while requiring fewer queries compared to baseline methods. Moreover, our adversarial prompts exhibit strong transferability across various open-source and commercial T2I models. \textcolorredThis paper includes model-generated content that may contain offensive or distressing material.
zh

[CV-178] aste More Taste Better: Diverse Data and Strong Model Boost Semi-Supervised Crowd Counting CVPR2025

【速读】：本文旨在解决半监督人群计数任务中未标注数据难以有效且精准利用的问题，特别是在密集场景下高标注成本的挑战。为应对这一难题，论文提出了一种名为“Taste More Taste Better (TMTB)”的新框架，其关键在于同时从数据与模型两个方面进行创新。在数据层面，通过背景补全（inpainting）技术增强数据多样性，同时保持场景整体保真度；在模型层面，引入视觉状态空间模型（Visual State Space Model）作为主干网络，以更好地捕捉人群场景中的全局上下文信息，尤其适用于极端拥挤、低光照及恶劣天气等复杂场景。此外，除了传统的回归头用于精确预测外，还设计了抗噪分类头（Anti-Noise classification head），提供更鲁棒的监督信号，缓解人工标注噪声对回归头的敏感性问题。实验结果表明，该方法在四个基准数据集上显著优于现有方法。

链接: https://arxiv.org/abs/2503.17984
作者: Maochen Yang,Zekun Li,Jian Zhang,Lei Qi,Yinghuan Shi
机构: Nanjing University (南京大学); Suzhou Laboratory (苏州实验室); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Semi-supervised crowd counting is crucial for addressing the high annotation costs of densely populated scenes. Although several methods based on pseudo-labeling have been proposed, it remains challenging to effectively and accurately utilize unlabeled data. In this paper, we propose a novel framework called Taste More Taste Better (TMTB), which emphasizes both data and model aspects. Firstly, we explore a data augmentation technique well-suited for the crowd counting task. By inpainting the background regions, this technique can effectively enhance data diversity while preserving the fidelity of the entire scenes. Secondly, we introduce the Visual State Space Model as backbone to capture the global context information from crowd scenes, which is crucial for extremely crowded, low-light, and adverse weather scenarios. In addition to the traditional regression head for exact prediction, we employ an Anti-Noise classification head to provide less exact but more accurate supervision, since the regression head is sensitive to noise in manual annotations. We conduct extensive experiments on four benchmark datasets and show that our method outperforms state-of-the-art methods by a large margin. Code is publicly available on this https URL.
zh

[CV-179] Histomorphology-driven multi-instance learning for breast cancer WSI classification

【速读】：该论文旨在解决现有全片图像（Whole Slide Image, WSI）分类方法难以有效整合组织形态学信息的问题，这限制了其捕捉关键且细微病理特征的能力。论文的关键解决方案在于提出了一种新的框架，通过显式地将肿瘤细胞密度、细胞形态和组织结构等组织形态学信息整合到WSI分类中。具体而言，该方法包含三个核心组件：(1) 基于医学先验知识在patch级别评估组织形态学信息的重要性；(2) 利用组织形态学驱动的聚类池化生成表征性的聚类级特征；(3) 借助组织形态学驱动的多实例聚合实现WSI级别的分类。通过引入组织形态学信息，该框架增强了模型捕捉关键病理模式的能力，从而提升了WSI分类性能，并在分子分型和癌症亚型分类中实现了高诊断准确性。

链接: https://arxiv.org/abs/2503.17983
作者: Baizhi Wang,Rui Yan,Wenxin Ma,Xu Zhang,Yuhao Wang,Xiaolong Li,Yunjie Gu,Zihang Jiang,S. Kevin Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages,5 figures

点击查看摘要

Abstract:Histomorphology is crucial in breast cancer diagnosis. However, existing whole slide image (WSI) classification methods struggle to effectively incorporate histomorphology information, limiting their ability to capture key and fine-grained pathological features. To address this limitation, we propose a novel framework that explicitly incorporates histomorphology (tumor cellularity, cellular morphology, and tissue architecture) into WSI classification. Specifically, our approach consists of three key components: (1) estimating the importance of tumor-related histomorphology information at the patch level based on medical prior knowledge; (2) generating representative cluster-level features through histomorphology-driven cluster pooling; and (3) enabling WSI-level classification through histomorphology-driven multi-instance aggregation. With the incorporation of histomorphological information, our framework strengthens the model’s ability to capture key and fine-grained pathological patterns, thereby enhancing WSI classification performance. Experimental results demonstrate its effectiveness, achieving high diagnostic accuracy for molecular subtyping and cancer subtyping. The code will be made available at this https URL.
zh

[CV-180] Co-SemDepth: Fast Joint Semantic Segmentation and Depth Estimation on Aerial Images

【速读】：该论文旨在解决无人机（Unmanned Aerial Vehicle, UAV）在低空非结构化环境中实时估计深度图与语义分割图的问题，这对于自主导航至关重要但极具挑战性。论文的关键解决方案在于提出了一种联合深度学习架构，能够同时高效且精准地完成深度预测与语义分割任务。通过利用空中机器人上的单目相机，在MidAir和Aeroscapes基准数据集上验证了该架构的有效性，其性能优于或至少媲于其他单一任务及联合任务方法，同时实现了20.2帧每秒（FPS）的预测速度，并具有较低的内存占用。所有训练与预测代码已开源。

链接: https://arxiv.org/abs/2503.17982
作者: Yara AlaaEldin,Francesca Odone
机构: University of Genova (热那亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding the geometric and semantic properties of the scene is crucial in autonomous navigation and particularly challenging in the case of Unmanned Aerial Vehicle (UAV) navigation. Such information may be by obtained by estimating depth and semantic segmentation maps of the surrounding environment and for their practical use in autonomous navigation, the procedure must be performed as close to real-time as possible. In this paper, we leverage monocular cameras on aerial robots to predict depth and semantic maps in low-altitude unstructured environments. We propose a joint deep-learning architecture that can perform the two tasks accurately and rapidly, and validate its effectiveness on MidAir and Aeroscapes benchmark datasets. Our joint-architecture proves to be competitive or superior to the other single and joint architecture methods while performing its task fast predicting 20.2 FPS on a single NVIDIA quadro p5000 GPU and it has a low memory footprint. All codes for training and prediction can be found on this link: this https URL
zh

[CV-181] PIM: Physics-Informed Multi-task Pre-training for Improving Inertial Sensor-Based Human Activity Recognition

【速读】：该论文试图解决在基于深度学习的人类活动识别（HAR）任务中，由于标注数据获取成本高、耗时长、劳动密集而导致的数据稀缺问题。为应对这一挑战，论文提出了一种基于物理知识的多任务预训练（Physics-Informed Multi-task Pre-training, PIM）框架，用于基于惯性测量单元（IMU）的HAR任务。PIM的关键在于通过理解人体运动的基本物理特性（如运动速度、运动角度以及传感器布局的对称性）设计预训练任务，并利用基于物理方程计算的特征作为自监督学习（Self-Supervised Learning, SSL）的预训练目标。这种方法使模型能够捕捉到人类活动的基本物理特性，尤其适用于多传感器系统。实验结果表明，该方法在多个HAR基准数据集上的准确率和F1分数均优于现有最先进的方法，特别是在极少量标注数据（每类仅2至8个样本）的情况下，宏观F1分数和准确率提升了近10%，而在不减少训练数据量时也实现了高达3%的提升。

链接: https://arxiv.org/abs/2503.17978
作者: Dominique Nshimyimana,Vitor Fortes Rey,Sungho Suh,Bo Zhou,Paul Lukowicz
机构: RPTU Kaiserslautern-Landau (莱茵-普法尔茨应用技术大学卡塞尔-兰道校区); DFKI (德国人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human activity recognition (HAR) with deep learning models relies on large amounts of labeled data, often challenging to obtain due to associated cost, time, and labor. Self-supervised learning (SSL) has emerged as an effective approach to leverage unlabeled data through pretext tasks, such as masked reconstruction and multitask learning with signal processing-based data augmentations, to pre-train encoder models. However, such methods are often derived from computer vision approaches that disregard physical mechanisms and constraints that govern wearable sensor data and the phenomena they reflect. In this paper, we propose a physics-informed multi-task pre-training (PIM) framework for IMU-based HAR. PIM generates pre-text tasks based on the understanding of basic physical aspects of human motion: including movement speed, angles of movement, and symmetry between sensor placements. Given a sensor signal, we calculate corresponding features using physics-based equations and use them as pretext tasks for SSL. This enables the model to capture fundamental physical characteristics of human activities, which is especially relevant for multi-sensor systems. Experimental evaluations on four HAR benchmark datasets demonstrate that the proposed method outperforms existing state-of-the-art methods, including data augmentation and masked reconstruction, in terms of accuracy and F1 score. We have observed gains of almost 10% in macro f1 score and accuracy with only 2 to 8 labeled examples per class and up to 3% when there is no reduction in the amount of training data.
zh

[CV-182] Shot Sequence Ordering for Video Editing: Benchmarks Metrics and Cinematology-Inspired Computing Methods

【速读】：该论文旨在解决短视频平台兴起背景下，高质量视频制作仍高度依赖专业技能的问题，特别是通过引入Shot Sequence Ordering (SSO)任务提升视频叙事质量和观看体验。然而，该领域的进展受到缺乏公开基准数据集的限制。为此，论文提出了两个新的基准数据集（AVE-Order 和 ActivityNet-Order），并采用Kendall Tau距离作为评估指标，同时提出了一种新的损失函数——Kendall Tau Distance-Cross Entropy Loss。此外，论文引入了Cinematology Embedding概念，将电影元数据和镜头标签作为先验知识融入SSO模型，并构建了AVE-Meta数据集验证其有效性。实验结果表明，所提出的损失函数和方法显著提高了SSO任务的准确性。关键在于创新性地结合领域知识与新型损失函数，以及提供开放的基准数据集以推动研究进展。

链接: https://arxiv.org/abs/2503.17975
作者: Yuzhi Li,Haojun Xu,Feng Tian
机构: Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rising popularity of short video platforms, the demand for video production has increased substantially. However, high-quality video creation continues to rely heavily on professional editing skills and a nuanced understanding of visual language. To address this challenge, the Shot Sequence Ordering (SSO) task in AI-assisted video editing has emerged as a pivotal approach for enhancing video storytelling and the overall viewing experience. Nevertheless, the progress in this field has been impeded by a lack of publicly available benchmark datasets. In response, this paper introduces two novel benchmark datasets, AVE-Order and ActivityNet-Order. Additionally, we employ the Kendall Tau distance as an evaluation metric for the SSO task and propose the Kendall Tau Distance-Cross Entropy Loss. We further introduce the concept of Cinematology Embedding, which incorporates movie metadata and shot labels as prior knowledge into the SSO model, and constructs the AVE-Meta dataset to validate the method’s effectiveness. Experimental results indicate that the proposed loss function and method substantially enhance SSO task accuracy. All datasets are publicly accessible at this https URL.
zh

[CV-183] PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos

【速读】：本文旨在解决通过稀疏视频构建真实物理特性的数字虚拟副本的问题，以实现真实世界物体的实时交互仿真。论文提出的关键解决方案包括两个核心部分：(1) 一种结合弹簧质量模型（Spring-Mass Models）用于物理模拟、生成式形状模型（Generative Shape Models）用于几何建模以及高斯点样（Gaussian Splatting）用于渲染的物理信息驱动表示方法；(2) 一种新颖的多阶段优化逆向建模框架，能够从视频中重建完整几何结构、推断密集物理属性并复制逼真的外观。该方法通过整合逆向物理框架与视觉感知线索，即使在部分遮挡或有限视角下也能实现高保真重建。此外，PhysTwin 支持多种可变形物体的建模，包括绳索、填充玩具、布料及快递包裹等。实验表明，PhysTwin 在重建、渲染、未来预测及新交互下的仿真性能优于现有方法，并展示了其在实时交互仿真和基于模型的机器人运动规划中的应用潜力。

链接: https://arxiv.org/abs/2503.17973
作者: Hanxiao Jiang,Hao-Yu Hsu,Kaifeng Zhang,Hsin-Ni Yu,Shenlong Wang,Yunzhu Li
机构: Columbia University (哥伦比亚大学); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Project Page: this https URL

点击查看摘要

Abstract:Creating a physical digital twin of a real-world object has immense potential in robotics, content creation, and XR. In this paper, we present PhysTwin, a novel framework that uses sparse videos of dynamic objects under interaction to produce a photo- and physically realistic, real-time interactive virtual replica. Our approach centers on two key components: (1) a physics-informed representation that combines spring-mass models for realistic physical simulation, generative shape models for geometry, and Gaussian splats for rendering; and (2) a novel multi-stage, optimization-based inverse modeling framework that reconstructs complete geometry, infers dense physical properties, and replicates realistic appearance from videos. Our method integrates an inverse physics framework with visual perception cues, enabling high-fidelity reconstruction even from partial, occluded, and limited viewpoints. PhysTwin supports modeling various deformable objects, including ropes, stuffed animals, cloth, and delivery packages. Experiments show that PhysTwin outperforms competing methods in reconstruction, rendering, future prediction, and simulation under novel interactions. We further demonstrate its applications in interactive real-time simulation and model-based robotic motion planning.
zh

[CV-184] Real-World Remote Sensing Image Dehazing: Benchmark and Baseline

【速读】：该论文致力于解决遥感图像去雾（Remote Sensing Image Dehazing, RSID）在真实场景中因复杂大气条件和严重色彩失真导致图像质量下降的问题。由于真实世界中遥感雾霾图像对的稀缺性，现有方法主要依赖合成数据集，但这些方法在实际应用中面临域差距（domain gap）的挑战。为解决此问题，论文提出了Real-World Remote Sensing Hazy Image Dataset (RRSHID)，这是一个包含多种大气条件下真实雾霾与去雾图像对的大规模数据集，并基于此提出了一种名为MCAF-Net的新框架。MCAF-Net的关键创新在于其三个核心组件：Multi-branch Feature Integration Block Aggregator (MFIBA)，通过级联集成块和并行多分支处理实现稳健特征提取；Color-Calibrated Self-Supervised Attention Module (CSAM)，借助自监督学习和注意力引导优化缓解复杂的色彩失真；以及Multi-Scale Feature Adaptive Fusion Module (MFAFM)，有效融合多尺度特征同时保留局部细节和全局上下文。实验验证表明，MCAF-Net在真实场景RSID任务中表现出最先进的性能，同时在合成数据集上也保持竞争力。

链接: https://arxiv.org/abs/2503.17966
作者: Zeng-Hui Zhu,Wei Lu,Si-Bao Chen,Chris H. Q. Ding,Jin Tang,Bin Luo
机构: MOE Key Laboratory of ICSP, IMIS Laboratory of Anhui, Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Zenmorn-AHU AI Joint Laboratory, School of Computer Science and Technology, Anhui University (安徽大学), Hefei 230601, China; School of Data Science (SDS), Chinese University of Hong Kong, Shenzhen (香港中文大学（深圳）), 518172, China
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 11 pages, 9 figures, real-world remote sensing image dehazing dataset

点击查看摘要

Abstract:Remote Sensing Image Dehazing (RSID) poses significant challenges in real-world scenarios due to the complex atmospheric conditions and severe color distortions that degrade image quality. The scarcity of real-world remote sensing hazy image pairs has compelled existing methods to rely primarily on synthetic datasets. However, these methods struggle with real-world applications due to the inherent domain gap between synthetic and real data. To address this, we introduce Real-World Remote Sensing Hazy Image Dataset (RRSHID), the first large-scale dataset featuring real-world hazy and dehazed image pairs across diverse atmospheric conditions. Based on this, we propose MCAF-Net, a novel framework tailored for real-world RSID. Its effectiveness arises from three innovative components: Multi-branch Feature Integration Block Aggregator (MFIBA), which enables robust feature extraction through cascaded integration blocks and parallel multi-branch processing; Color-Calibrated Self-Supervised Attention Module (CSAM), which mitigates complex color distortions via self-supervised learning and attention-guided refinement; and Multi-Scale Feature Adaptive Fusion Module (MFAFM), which integrates features effectively while preserving local details and global context. Extensive experiments validate that MCAF-Net demonstrates state-of-the-art performance in real-world RSID, while maintaining competitive performance on synthetic datasets. The introduction of RRSHID and MCAF-Net sets new benchmarks for real-world RSID research, advancing practical solutions for this complex task. The code and dataset are publicly available at \urlthis https URL.
zh

[CV-185] FisherTune: Fisher-Guided Robust Tuning of Vision Foundation Models for Domain Generalized Segmentation

【速读】：该论文致力于解决在大规模预训练视觉基础模型（Vision Foundation Models, VFMs）保持泛化能力的同时，针对领域泛化语义分割（Domain Generalized Semantic Segmentation, DGSS）任务进行微调的挑战。现有方法要么选择性地微调参数，要么冻结VFMs仅更新适配器（adapters），但这些方法可能未能充分挖掘VFMs在DGSS任务中的全部潜力。论文观察到，由于任务和分布差异导致的领域敏感参数会阻碍泛化性能。为此，论文提出了一种名为\textbf{FisherTune}的鲁棒微调方法，其关键在于利用领域相关Fisher信息矩阵（Domain-Related Fisher Information Matrix, DR-FIM）。DR-FIM能够衡量参数在任务和领域间的敏感性，从而实现选择性的参数更新，以保持泛化能力并提升DGSS任务的适应性。此外，FisherTune通过变分推理稳定DR-FIM的估计，将参数视为高斯分布变量，并利用预训练的先验知识。实验结果表明，FisherTune在保持泛化能力的同时实现了卓越的跨域分割性能，优于现有选择性参数和适配器更新方法。

链接: https://arxiv.org/abs/2503.17940
作者: Dong Zhao,Jinlong Li,Shuang Wang,Mengyao Wu,Qi Zang,Nicu Sebe,Zhun Zhong
机构: School of Artificial Intelligence, Xidian University (西安电子科技大学), Shaanxi, China; Department of Information Engineering and Computer Science, University of Trento (特伦托大学), Italy; School of Computer Science and Information Engineering, Hefei University of Technology (合肥工业大学), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Foundation Models (VFMs) excel in generalization due to large-scale pretraining, but fine-tuning them for Domain Generalized Semantic Segmentation (DGSS) while maintaining this ability remains challenging. Existing approaches either selectively fine-tune parameters or freeze the VFMs and update only the adapters, both of which may underutilize the VFMs’ full potential in DGSS tasks. We observe that domain-sensitive parameters in VFMs, arising from task and distribution differences, can hinder generalization. To address this, we propose \textbfFisherTune, a robust fine-tuning method guided by the Domain-Related Fisher Information Matrix (DR-FIM). DR-FIM measures parameter sensitivity across tasks and domains, enabling selective updates that preserve generalization and enhance DGSS adaptability. FisherTune incorporates variational inference to stabilize DR-FIM estimation, treating parameters as Gaussian-distributed variables and leveraging pre-trained priors. Extensive experiments show that FisherTune achieves superior cross-domain segmentation while maintaining generalization, outperforming selective-parameter and adapter-based methods.
zh

[CV-186] Selecting and Pruning: A Differentiable Causal Sequentialized State-Space Model for Two-View Correspondence Learning

【速读】：该论文致力于解决两视图对应学习中的问题，即通过识别图像对之间潜在的不同信息来区分真实的和虚假的对应关系。传统方法要么平等地对待这些信息，要么需要显式存储整个上下文，在实际场景中往往效率低下且繁琐。为了解决这些问题，论文提出了一种名为CorrMamba的对应过滤器，它利用Mamba选择性挖掘真实对应信息的能力，同时减少虚假对应带来的干扰，从而以较低的成本实现自适应聚焦。关键在于，为了防止Mamba受到无序关键点的影响而削弱其挖掘空间信息的能力，研究者定制了一种基于Gumbel-Softmax技术的因果序列学习方法，以全自动且可微的方式建立特征间的因果依赖关系。此外，还设计了一个局部上下文增强模块，用于捕捉对应剪枝所需的批判性上下文线索，进一步完善了核心框架。实验结果表明，CorrMamba在相对位姿估计、视觉定位等任务中达到了最先进的性能，在室外相对位姿估计任务中，AUC@20°指标比现有最佳方法提高了2.58个百分点，凸显了其实用优势。

链接: https://arxiv.org/abs/2503.17938
作者: Xiang Fang,Shihua Zhang,Hao Zhang,Tao Lu,Huabing Zhou,Jiayi Ma
机构: Wuhan University (武汉大学); Wuhan Institute of Technology (武汉工业学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Two-view correspondence learning aims to discern true and false correspondences between image pairs by recognizing their underlying different information. Previous methods either treat the information equally or require the explicit storage of the entire context, tending to be laborious in real-world scenarios. Inspired by Mamba’s inherent selectivity, we propose \textbfCorrMamba, a \textbfCorrespondence filter leveraging \textbfMamba’s ability to selectively mine information from true correspondences while mitigating interference from false ones, thus achieving adaptive focus at a lower cost. To prevent Mamba from being potentially impacted by unordered keypoints that obscured its ability to mine spatial information, we customize a causal sequential learning approach based on the Gumbel-Softmax technique to establish causal dependencies between features in a fully autonomous and differentiable manner. Additionally, a local-context enhancement module is designed to capture critical contextual cues essential for correspondence pruning, complementing the core framework. Extensive experiments on relative pose estimation, visual localization, and analysis demonstrate that CorrMamba achieves state-of-the-art performance. Notably, in outdoor relative pose estimation, our method surpasses the previous SOTA by 2.58 absolute percentage points in AUC@20\textdegree, highlighting its practical superiority. Our code will be publicly available.
zh

[CV-187] Cross-Domain Underwater Image Enhancement Guided by No-Reference Image Quality Assessment: A Transfer Learning Approach

【速读】：该论文致力于解决单张水下图像增强（Single Underwater Image Enhancement, UIE）这一病态问题，主要面临两大挑战：(1) 水下参考数据集中的标签为伪标签，在有监督学习中依赖这些伪真实标签会导致领域差异；(2) 水下参考数据集稀缺，导致在小规模数据集上训练容易过拟合和分布偏移。为应对这些挑战，论文提出了一种基于迁移学习的Trans-UIE模型，通过预训练捕获UIE的基本范式，并利用包含参考和非参考数据集的混合数据集进行微调。解决方案的关键在于引入了无参考图像质量评估（No-Reference Image Quality Assessment, NR-IQA）指标以引导跨领域的迁移学习，同时在预训练阶段加入皮尔逊相关性损失（Pearson Correlation Loss）以减少过拟合风险。实验结果表明，Trans-UIE在全参考和无参考的水下基准数据集上显著优于现有方法。

链接: https://arxiv.org/abs/2503.17937
作者: Zhi Zhang,Daoyi Chen
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Single underwater image enhancement (UIE) is a challenging ill-posed problem, but its development is hindered by two major issues: (1) The labels in underwater reference datasets are pseudo labels, relying on these pseudo ground truths in supervised learning leads to domain discrepancy. (2) Underwater reference datasets are scarce, making training on such small datasets prone to overfitting and distribution shift. To address these challenges, we propose Trans-UIE, a transfer learning-based UIE model that captures the fundamental paradigms of UIE through pretraining and utilizes a dataset composed of both reference and non-reference datasets for fine-tuning. However, fine-tuning the model using only reconstruction loss may introduce confirmation bias. To mitigate this, our method leverages no-reference image quality assessment (NR-IQA) metrics from above-water scenes to guide the transfer learning process across domains while generating enhanced images with the style of the above-water image domain. Additionally, to reduce the risk of overfitting during the pretraining stage, we introduce Pearson correlation loss. Experimental results on both full-reference and no-reference underwater benchmark datasets demonstrate that Trans-UIE significantly outperforms state-of-the-art methods.
zh

[CV-188] ransAnimate: Taming Layer Diffusion to Generate RGBA Video

【速读】：该论文旨在解决生成带有透明通道（alpha通道）的RGBA视频的挑战，主要由于现有数据集的稀缺性和适配现有模型以支持透明度与视觉效果的复杂性。为应对这些限制，论文提出了一种名为TransAnimate的创新框架，其关键在于将文本到透明图像生成技术与视频生成模块相结合，通过高效利用预训练的文本到透明图像模型权重，并结合在RGB视频上训练的时间模型和可控性插件，实现可控的RGBA视频生成任务。此外，引入了交互式运动引导控制机制，通过方向箭头定义移动，颜色调整缩放，提供精确且直观的游戏特效设计控制。同时，为了缓解数据稀缺问题，开发了一套创建高质量RGBA视频数据集的管道，包括提取的游戏特效视频、前景对象及合成的透明视频。综合实验结果表明，TransAnimate能够生成高质量的RGBA视频，成为游戏和视觉特效领域实用且有效的工具。

链接: https://arxiv.org/abs/2503.17934
作者: Xuewei Chen,Zhimin Chen,Yiren Song
机构: Clemson University (克莱姆森大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-video generative models have made remarkable advancements in recent years. However, generating RGBA videos with alpha channels for transparency and visual effects remains a significant challenge due to the scarcity of suitable datasets and the complexity of adapting existing models for this purpose. To address these limitations, we present TransAnimate, an innovative framework that integrates RGBA image generation techniques with video generation modules, enabling the creation of dynamic and transparent videos. TransAnimate efficiently leverages pre-trained text-to-transparent image model weights and combines them with temporal models and controllability plugins trained on RGB videos, adapting them for controllable RGBA video generation tasks. Additionally, we introduce an interactive motion-guided control mechanism, where directional arrows define movement and colors adjust scaling, offering precise and intuitive control for designing game effects. To further alleviate data scarcity, we have developed a pipeline for creating an RGBA video dataset, incorporating high-quality game effect videos, extracted foreground objects, and synthetic transparent videos. Comprehensive experiments demonstrate that TransAnimate generates high-quality RGBA videos, establishing it as a practical and effective tool for applications in gaming and visual effects.
zh

[CV-189] Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning

【速读】：该论文致力于解决现有半监督语义分割方法在利用潜在监督信息方面的不足。具体而言，现有的一致性正则化方法主要关注基于图像增强的预测一致性，并整体优化分割网络，而未能充分挖掘潜在的监督信号。论文的关键创新在于提出了一种多约束一致性学习（MCCL）方法，通过分阶段增强编码器和解码器来提升性能。其解决方案的核心包括设计特征知识对齐（Feature Knowledge Alignment, FKA）策略以促进编码器从图像增强视角下的特征一致性学习，以及引入自适应干预（Self-adaptive Intervention, SAI）模块以增加中间特征表示的差异性，从而实现基于特征扰动的预测一致性学习。实验结果表明，该方法在Pascal VOC2012和Cityscapes数据集上达到了新的最先进的性能水平。

链接: https://arxiv.org/abs/2503.17914
作者: Jianjian Yin,Tao Chen,Gensheng Pei,Yazhou Yao,Liqiang Nie,Xiansheng Hua
机构: School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China (南京理工大学计算机科学与工程学院，中国南京); School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China (哈尔滨工业大学（深圳）计算机科学与技术学院，中国深圳); Terminus Group, Beijing, China (灵机集团，中国北京)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by IEEE Transactions on Multimedia

点击查看摘要

Abstract:Consistency regularization has prevailed in semi-supervised semantic segmentation and achieved promising performance. However, existing methods typically concentrate on enhancing the Image-augmentation based Prediction consistency and optimizing the segmentation network as a whole, resulting in insufficient utilization of potential supervisory information. In this paper, we propose a Multi-Constraint Consistency Learning (MCCL) approach to facilitate the staged enhancement of the encoder and decoder. Specifically, we first design a feature knowledge alignment (FKA) strategy to promote the feature consistency learning of the encoder from image-augmentation. Our FKA encourages the encoder to derive consistent features for strongly and weakly augmented views from the perspectives of point-to-point alignment and prototype-based intra-class compactness. Moreover, we propose a self-adaptive intervention (SAI) module to increase the discrepancy of aligned intermediate feature representations, promoting Feature-perturbation based Prediction consistency learning. Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder. Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance. The source code and models are made available at this https URL.
zh

[CV-190] Guided Diffusion for the Extension of Machine Vision to Human Visual Perception

【速读】：本文旨在解决图像压缩领域中同时满足机器视觉与人类视觉需求的问题。传统图像编码主要针对人类感知优化，而随着AI任务中图像识别模型的发展，面向机器的图像编码（Image Coding for Machines, ICM）变得尤为重要。然而，现有方法难以兼顾两者的需求。为应对这一挑战，论文提出了一种基于引导扩散（Guided Diffusion）的方法，通过利用ICM输出作为引导信号，从随机噪声生成可供人类感知的图像，从而在机器视觉与人类视觉之间架起桥梁。该方案的关键在于利用扩散模型的生成能力，在不增加额外比特率的前提下实现两种视觉系统的平滑过渡，并通过比特率与图像质量的综合评估验证其性能，最终与其它可扩展的人机兼容图像编码方法进行对比分析。

链接: https://arxiv.org/abs/2503.17907
作者: Takahiro Shindo,Yui Tatsumi,Taiju Watanabe,Hiroshi Watanabe
机构: Waseda University (早稻田大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Image compression technology eliminates redundant information to enable efficient transmission and storage of images, serving both machine vision and human visual perception. For years, image coding focused on human perception has been well-studied, leading to the development of various image compression standards. On the other hand, with the rapid advancements in image recognition models, image compression for AI tasks, known as Image Coding for Machines (ICM), has gained significant importance. Therefore, scalable image coding techniques that address the needs of both machines and humans have become a key area of interest. Additionally, there is increasing demand for research applying the diffusion model, which can generate human-viewable images from a small amount of data to image compression methods for human vision. Image compression methods that use diffusion models can partially reconstruct the target image by guiding the generation process with a small amount of conditioning information. Inspired by the diffusion model’s potential, we propose a method for extending machine vision to human visual perception using guided diffusion. Utilizing the diffusion model guided by the output of the ICM method, we generate images for human perception from random noise. Guided diffusion acts as a bridge between machine vision and human vision, enabling transitions between them without any additional bitrate overhead. The generated images then evaluated based on bitrate and image quality, and we compare their compression performance with other scalable image coding methods for humans and machines.
zh

[CV-191] What Time Tells Us? An Explorative Study of Time Awareness Learned from Static Images

【速读】：该论文试图解决的问题是如何从静态图像中学习时间感知能力，即探索时间如何通过视觉线索传递信息，并尝试回答“时间告诉我们什么”。为了解决这一问题，论文提出了一个关键方案：首先构建了一个包含130,906张带可靠时间戳图像的Time-Oriented Collection (TOC) 数据集；然后提出了一种Time-Image Contrastive Learning (TICL) 方法，通过跨模态对比学习同时建模时间戳与相关的视觉表示。关键在于利用TICL方法，不仅在时间戳估计任务上达到了最先进的性能，还发现仅通过静态图像学习到的时间感知嵌入在基于时间的图像检索、视频场景分类和时间感知图像编辑等下游任务中表现出强大的能力，从而证明了从静态图像中学习时间相关视觉线索的有效性及其在多种视觉任务中的潜在价值。

链接: https://arxiv.org/abs/2503.17899
作者: Dongheng Lin,Han Hu,Jianbo Jiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Time becomes visible through illumination changes in what we see. Inspired by this, in this paper we explore the potential to learn time awareness from static images, trying to answer: what time tells us? To this end, we first introduce a Time-Oriented Collection (TOC) dataset, which contains 130,906 images with reliable timestamps. Leveraging this dataset, we propose a Time-Image Contrastive Learning (TICL) approach to jointly model timestamps and related visual representations through cross-modal contrastive learning. We found that the proposed TICL, 1) not only achieves state-of-the-art performance on the timestamp estimation task, over various benchmark metrics, 2) but also, interestingly, though only seeing static images, the time-aware embeddings learned from TICL show strong capability in several time-aware downstream tasks such as time-based image retrieval, video scene classification, and time-aware image editing. Our findings suggest that time-related visual cues can be learned from static images and are beneficial for various vision tasks, laying a foundation for future research on understanding time-related visual context. Project page:this https URL.
zh

[CV-192] Real-time Global Illumination for Dynamic 3D Gaussian Scenes

【速读】：本文提出了一种针对动态3D高斯模型和网格的实时全局光照方法及其处理管线。论文旨在解决动态场景中高质量间接光照（mutual multi-bounce light transport）的实时渲染挑战。为实现这一目标，关键在于开发了一种快速复合随机光线追踪算法（fast compound stochastic ray-tracing algorithm）以及优化的3D高斯光栅化器（optimized 3D Gaussian rasterizer），并通过集成多种实时技术加速性能，同时保持高保真的光照效果。此外，论文展示了交互式可编辑材质与多样化动态光源设置下的高效渲染能力，并验证了该方法在包含3D高斯模型和网格的复杂场景中能够稳定达到超过40帧每秒的性能。

链接: https://arxiv.org/abs/2503.17897
作者: Chenxiao Hu,Meng Gai,Guoping Wang,Sheng Li
机构: Peking University (北京大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a real-time global illumination approach along with a pipeline for dynamic 3D Gaussian models and meshes. Building on a formulated surface light transport model for 3D Gaussians, we address key performance challenges with a fast compound stochastic ray-tracing algorithm and an optimized 3D Gaussian rasterizer. Our pipeline integrates multiple real-time techniques to accelerate performance and achieve high-quality lighting effects. Our approach enables real-time rendering of dynamic scenes with interactively editable materials and dynamic lighting of diverse multi-lights settings, capturing mutual multi-bounce light transport (indirect illumination) between 3D Gaussians and mesh. Additionally, we present a real-time renderer with an interactive user interface, validating our approach and demonstrating its practicality and high efficiency with over 40 fps in scenes including both 3D Gaussians and mesh. Furthermore, our work highlights the potential of 3D Gaussians in real-time applications with dynamic lighting, offering insights into performance and optimization.
zh

[CV-193] IceBench: A Benchmark for Deep Learning based Sea Ice Type Classification

【速读】：该论文旨在解决海冰类型分类领域缺乏标准化基准和系统性比较研究的问题，以明确最优模型性能并推动该领域的效率与一致性提升。传统手动方法耗时且成本高昂，而深度学习模型虽展现出潜力，但其表现尚未有统一共识。为填补这一空白，论文提出\textit{IceBench}，一个全面的基准框架。其关键在于：首先，\textit{IceBench}基于现有AI4Arctic海冰挑战数据集构建标准化数据集，整合多样化的评估指标，并涵盖像素级与patch级分类方法的代表性模型；其次，通过深入比较研究揭示模型优势与局限，为实践者和研究者提供洞见；最后，利用\textit{IceBench}开展系统实验，探索模型在季节（时间）和地点（空间）上的迁移能力、数据降尺度及预处理策略等关键科学问题。

链接: https://arxiv.org/abs/2503.17877
作者: Samira Alkaee Taleghan,Andrew P. Barrett,Walter N. Meier,Farnoush Banaei-Kashani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sea ice plays a critical role in the global climate system and maritime operations, making timely and accurate classification essential. However, traditional manual methods are time-consuming, costly, and have inherent biases. Automating sea ice type classification addresses these challenges by enabling faster, more consistent, and scalable analysis. While both traditional and deep learning approaches have been explored, deep learning models offer a promising direction for improving efficiency and consistency in sea ice classification. However, the absence of a standardized benchmark and comparative study prevents a clear consensus on the best-performing models. To bridge this gap, we introduce \textitIceBench, a comprehensive benchmarking framework for sea ice type classification. Our key contributions are threefold: First, we establish the IceBench benchmarking framework which leverages the existing AI4Arctic Sea Ice Challenge dataset as a standardized dataset, incorporates a comprehensive set of evaluation metrics, and includes representative models from the entire spectrum of sea ice type classification methods categorized in two distinct groups, namely, pixel-based classification methods and patch-based classification methods. IceBench is open-source and allows for convenient integration and evaluation of other sea ice type classification methods; hence, facilitating comparative evaluation of new methods and improving reproducibility in the field. Second, we conduct an in-depth comparative study on representative models to assess their strengths and limitations, providing insights for both practitioners and researchers. Third, we leverage IceBench for systematic experiments addressing key research questions on model transferability across seasons (time) and locations (space), data downscaling, and preprocessing strategies.
zh

[CV-194] good4cir: Generating Detailed Synthetic Captions for Composed Image Retrieval

【速读】：该论文旨在解决现有合成图像检索（CIR, Composed Image Retrieval）数据集因手动标注质量不高而导致的细粒度检索困难问题。论文的关键解决方案是提出了一种名为good4cir的结构化管道，利用视觉-语言模型生成高质量的合成标注。其核心方法包括从查询图像中提取细粒度对象描述、为目标图像生成可比描述，以及合成捕捉图像间有意义变换的文本指令。这种方法通过减少幻觉、增强修改多样性并确保对象级一致性，有效提升了现有数据集的质量，并支持跨领域新数据集的创建，从而显著提高了基于该管道生成数据训练的CIR模型的检索准确性。

链接: https://arxiv.org/abs/2503.17871
作者: Pranavi Kolouju,Eric Xing,Robert Pless,Nathan Jacobs,Abby Stylianou
机构: Saint Louis University (圣路易斯大学); Washington University in St. Louis (圣路易斯华盛顿大学); George Washington University (乔治华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Composed image retrieval (CIR) enables users to search images using a reference image combined with textual modifications. Recent advances in vision-language models have improved CIR, but dataset limitations remain a barrier. Existing datasets often rely on simplistic, ambiguous, or insufficient manual annotations, hindering fine-grained retrieval. We introduce good4cir, a structured pipeline leveraging vision-language models to generate high-quality synthetic annotations. Our method involves: (1) extracting fine-grained object descriptions from query images, (2) generating comparable descriptions for target images, and (3) synthesizing textual instructions capturing meaningful transformations between images. This reduces hallucination, enhances modification diversity, and ensures object-level consistency. Applying our method improves existing datasets and enables creating new datasets across diverse domains. Results demonstrate improved retrieval accuracy for CIR models trained on our pipeline-generated datasets. We release our dataset construction framework to support further research in CIR and multi-modal retrieval.
zh

[CV-195] A Causal Adjustment Module for Debiasing Scene Graph Generation

【速读】：该论文旨在解决场景图生成（Scene Graph Generation, SGG）模型中存在的偏差问题，这些偏差不仅源于关系分布的长尾现象，还更深层次地与物体及物体对分布的偏斜有关。论文的关键在于利用因果推断技术建模这些观测到的偏斜分布之间的因果关系。具体而言，作者引入了基于中介变量的因果链模型（Mediator-based Causal Chain Model, MCCM），该模型除了对物体、物体对和关系之间的因果性进行建模外，还结合了共现分布等中介变量以补充因果关系的描述。进一步地，提出了因果调整模块（Causal Adjustment Module, CAModule），通过利用MCCM中的变量来估计所建模的因果结构，并生成一组调整因子以修正模型的偏差预测。此外，该方法还能实现零样本关系的组合，从而提升模型识别此类关系的能力。实验结果表明，CAModule在多种SGG主干网络和流行基准数据集上达到了最先进的平均召回率，并在具有挑战性的零样本召回率指标上也显示出显著改进。

链接: https://arxiv.org/abs/2503.17862
作者: Li Liu,Shuzhou Sun,Shuaifeng Zhi,Fan Shi,Zhen Liu,Janne Heikkilä,Yongxiang Liu
机构: College of Electronic Science and Technology, NUDT, Changsha, Hunan, China (国防科技大学电子科学与工程学院, 长沙, 湖南, 中国); College of Electronic Engineering, NUDT, Hefei, Hunan, China (国防科技大学电子工程学院, 合肥, 湖南, 中国); Department of Computer Science & Technology, Tsinghua University, Beijing 100190, China (清华大学计算机科学与技术系, 北京 100190, 中国); Center for Machine Vision and Signal Analysis (CMVS), University of Oulu, 90570 Oulu, Finland (奥卢大学机器视觉与信号分析中心, 90570 奥卢, 芬兰)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 8 tables, 10 figures

点击查看摘要

Abstract:While recent debiasing methods for Scene Graph Generation (SGG) have shown impressive performance, these efforts often attribute model bias solely to the long-tail distribution of relationships, overlooking the more profound causes stemming from skewed object and object pair distributions. In this paper, we employ causal inference techniques to model the causality among these observed skewed distributions. Our insight lies in the ability of causal inference to capture the unobservable causal effects between complex distributions, which is crucial for tracing the roots of model bias. Specifically, we introduce the Mediator-based Causal Chain Model (MCCM), which, in addition to modeling causality among objects, object pairs, and relationships, incorporates mediator variables, i.e., cooccurrence distribution, for complementing the causality. Following this, we propose the Causal Adjustment Module (CAModule) to estimate the modeled causal structure, using variables from MCCM as inputs to produce a set of adjustment factors aimed at correcting biased model predictions. Moreover, our method enables the composition of zero-shot relationships, thereby enhancing the model’s ability to recognize such relationships. Experiments conducted across various SGG backbones and popular benchmarks demonstrate that CAModule achieves state-of-the-art mean recall rates, with significant improvements also observed on the challenging zero-shot recall rate metric.
zh

[CV-196] ClaraVid: A Holistic Scene Reconstruction Benchmark From Aerial Perspective With Delentropy-Based Complexity Profiling

【速读】：该论文旨在解决现有空域全景场景理解算法发展受限的问题，主要由于缺乏能够同时支持语义和几何重建的综合性数据集。尽管合成数据集提供了一种替代方案，但现有的数据集存在任务特定限制、不现实的场景组成以及渲染伪影等问题，影响其在真实世界中的适用性。为克服这些局限，论文引入了ClaraVid，这是一个专门设计的合成空域数据集，包含16,917张高分辨率图像（4032x3024），从多样景观的多个视角捕获，并提供了密集深度图、全景分割、稀疏点云和动态对象掩模，同时减少了常见渲染伪影。此外，论文提出了Delentropic Scene Profile (DSP)，这是一种基于微分熵分析的新复杂度度量方法，用于定量评估场景难度并指导重建任务。通过DSP，系统地基准测试了神经网络重建方法，揭示了场景复杂度与重建精度之间的一致且可测量的相关性。实证结果表明，较高的delentropy与更高的重建误差密切相关，验证了DSP作为可靠复杂度先验的有效性。

链接: https://arxiv.org/abs/2503.17856
作者: Radu Beche,Sergiu Nedevschi
机构: Technical University of Cluj-Napoca (特克尔什瓦尼亚理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Currently under review

点击查看摘要

Abstract:The development of aerial holistic scene understanding algorithms is hindered by the scarcity of comprehensive datasets that enable both semantic and geometric reconstruction. While synthetic datasets offer an alternative, existing options exhibit task-specific limitations, unrealistic scene compositions, and rendering artifacts that compromise real-world applicability. We introduce ClaraVid, a synthetic aerial dataset specifically designed to overcome these limitations. Comprising 16,917 high-resolution images captured at 4032x3024 from multiple viewpoints across diverse landscapes, ClaraVid provides dense depth maps, panoptic segmentation, sparse point clouds, and dynamic object masks, while mitigating common rendering artifacts. To further advance neural reconstruction, we introduce the Delentropic Scene Profile (DSP), a novel complexity metric derived from differential entropy analysis, designed to quantitatively assess scene difficulty and inform reconstruction tasks. Utilizing DSP, we systematically benchmark neural reconstruction methods, uncovering a consistent, measurable correlation between scene complexity and reconstruction accuracy. Empirical results indicate that higher delentropy strongly correlates with increased reconstruction errors, validating DSP as a reliable complexity prior. Currently under review, upon acceptance the data and code will be available at \hrefthis https URLthis http URL .
zh

[CV-197] 4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

【速读】：该论文试图解决多模态大型语言模型（MLLMs）在理解四维物体（4D objects，即具有时间演化的三维物体）方面缺乏标准化基准的问题。论文的关键解决方案是提出了首个评估MLLMs在四维物体理解能力的基准——4D-Bench，它包含四维物体问答（4D object QA）和四维物体描述（4D object captioning）任务，并提供了涵盖多样化类别、高质量标注以及需要多视角时空理解的四维物体数据集。这一基准不同于现有的基于二维图像/视频的评估方法，通过其评估结果揭示了MLLMs在外观理解和时间理解上的差异性表现及不足之处，从而明确了未来研究的方向与改进需求。

链接: https://arxiv.org/abs/2503.17827
作者: Wenxuan Zhu,Bing Li,Cheng Zheng,Jinjie Mai,Jun Chen,Letian Jiang,Abdullah Hamdi,Sara Rojas Martinez,Chia-Wen Lin,Mohamed Elhoseiny,Bernard Ghanem
机构: King Abdullah University of Science and Technology (沙特国王科技大学); University of Oxford (牛津大学); National Tsing Hua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities. However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects (3D objects with temporal evolution over time). In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding, featuring tasks in 4D object Question Answering (4D object QA) and 4D object captioning. 4D-Bench provides 4D objects with diverse categories, high-quality annotations, and tasks necessitating multi-view spatial-temporal understanding, different from existing 2D image/video-based benchmarks. With 4D-Bench, we evaluate a wide range of open-source and closed-source MLLMs. The results from the 4D object captioning experiment indicate that MLLMs generally exhibit weaker temporal understanding compared to their appearance understanding, notably, while open-source models approach closed-source performance in appearance understanding, they show larger performance gaps in temporal understanding. 4D object QA yields surprising findings: even with simple single-object videos, MLLMs perform poorly, with state-of-the-art GPT-4o achieving only 63% accuracy compared to the human baseline of 91%. These findings highlight a substantial gap in 4D object understanding and the need for further advancements in MLLMs.
zh

[CV-198] Fractal-IR: A Unified Framework for Efficient and Scalable Image Restoration

【速读】：该论文旨在解决视觉变换器在多种图像恢复（Image Restoration, IR）任务中高效扩展以应对不同类型退化和分辨率的问题。论文的关键创新在于提出了Fractal-IR，这是一种基于分形的设计，通过反复将局部信息扩展到更广区域来逐步优化退化图像。其核心解决方案的关键在于利用分形架构自然捕获早期的局部细节，并逐渐过渡到深层阶段的全局上下文，从而避免了计算成本高昂的长距离自注意力机制。此外，作者还针对视觉变换器在图像恢复任务中的扩展挑战，提出了一组全面的策略以有效指导模型扩展。实验结果表明，Fractal-IR在包括超分辨率、去噪、JPEG artifacts去除、恶劣天气条件下的图像恢复、运动模糊去除、散焦模糊去除及去马赛克等七个常见图像恢复任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2503.17825
作者: Yawei Li,Bin Ren,Jingyun Liang,Rakesh Ranjan,Mengyuan Liu,Nicu Sebe,Ming-Hsuan Yang,Luca Benini
机构: ETH Zürich; University of Pisa; University of Trento; Meta Reality Labs; Peking University; University of California, Merced
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While vision transformers achieve significant breakthroughs in various image restoration (IR) tasks, it is still challenging to efficiently scale them across multiple types of degradations and resolutions. In this paper, we propose Fractal-IR, a fractal-based design that progressively refines degraded images by repeatedly expanding local information into broader regions. This fractal architecture naturally captures local details at early stages and seamlessly transitions toward global context in deeper fractal stages, removing the need for computationally heavy long-range self-attention mechanisms. Moveover, we observe the challenge in scaling up vision transformers for IR tasks. Through a series of analyses, we identify a holistic set of strategies to effectively guide model scaling. Extensive experimental results show that Fractal-IR achieves state-of-the-art performance in seven common image restoration tasks, including super-resolution, denoising, JPEG artifact removal, IR in adverse weather conditions, motion deblurring, defocus deblurring, and demosaicking. For 2\times SR on Manga109, Fractal-IR achieves a 0.21 dB PSNR gain. For grayscale image denoising on Urban100, Fractal-IR surpasses the previous method by 0.2 dB for \sigma=50 .
zh

[CV-199] RefCut: Interactive Segmentation with Reference Guidance

【速读】：该论文致力于解决交互式分割中的交互歧义问题（Interactive Ambiguity），即在相同用户点击（正负点击）条件下，模型可能产生多个符合要求但不一致的结果，例如部分物体与完整物体的选择、单个物体与多个物体组合的区分等。这种不确定性限制了交互式标注在大规模和高效场景下的应用。为了解决这一问题，论文提出了一种基于参考的交互式分割框架——RefCut，其关键是通过引入参考图像及其对应的参考掩码来优化模型，从而有效减少用户在标注大量目标时的交互负担，同时解决了部分歧义（part ambiguity）和对象歧义（object ambiguity）。此外，为了丰富这两种歧义数据，作者构建了一个新的目标分解数据集（Target Disassembly Dataset），包含部分分解和对象分解两个子集用于评估。实验结果表明，RefCut在多个数据集的联合评估中达到了最先进的性能，并显著提升了交互式分割的直观性和可控性。

链接: https://arxiv.org/abs/2503.17820
作者: Zheng Lin,Nan Zhou,Chen-Xi Du,Deng-Ping Fan,Shi-Min Hu
机构: Tsinghua University (清华大学); Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Interactive segmentation aims to segment the specified target on the image with positive and negative clicks from users. Interactive ambiguity is a crucial issue in this field, which refers to the possibility of multiple compliant outcomes with the same clicks, such as selecting a part of an object versus the entire object, a single object versus a combination of multiple objects, and so on. The existing methods cannot provide intuitive guidance to the model, which leads to unstable output results and makes it difficult to meet the large-scale and efficient annotation requirements for specific targets in some scenarios. To bridge this gap, we introduce RefCut, a reference-based interactive segmentation framework designed to address part ambiguity and object ambiguity in segmenting specific targets. Users only need to provide a reference image and corresponding reference masks, and the model will be optimized based on them, which greatly reduces the interactive burden on users when annotating a large number of such targets. In addition, to enrich these two kinds of ambiguous data, we propose a new Target Disassembly Dataset which contains two subsets of part disassembly and object disassembly for evaluation. In the combination evaluation of multiple datasets, our RefCut achieved state-of-the-art performance. Extensive experiments and visualized results demonstrate that RefCut advances the field of intuitive and controllable interactive segmentation. Our code will be publicly available and the demo video is in this https URL.
zh

[CV-200] LightLoc: Learning Outdoor LiDAR Localization at Light Speed CVPR2025

【速读】：该论文旨在解决现有场景坐标回归方法在大规模户外LiDAR定位中训练时间过长的问题，尤其对于需要快速适应新场景的时间敏感应用（如自动驾驶、无人机和机器人）而言，这种长训练时间使其不切实际。论文指出大覆盖范围和海量数据是限制快速训练的关键挑战。为了解决这些问题，论文提出了LightLoc方法，这是一种能够在“光速”下高效学习新场景定位的新技术。其解决方案的关键在于引入了两种创新技术：首先，通过样本分类引导来辅助回归学习，减少相似样本带来的歧义并提升训练效率；其次，提出冗余样本下采样以在训练过程中移除已学好的帧，从而缩短训练时间而不影响精度。此外，样本分类的快速训练和置信度估计能力还可集成到SLAM系统中，有效消除误差累积。

链接: https://arxiv.org/abs/2503.17814
作者: Wen Li,Chen Liu,Shangshu Yu,Dunqiang Liu,Yin Zhou,Siqi Shen,Chenglu Wen,Cheng Wang
机构: Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University (福建智能感知与计算重点实验室，厦门大学); Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University (多媒体可信感知与高效计算教育部重点实验室，厦门大学); Nanyang Technological University (南洋理工大学); GAC R&D Center (广汽研发中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Scene coordinate regression achieves impressive results in outdoor LiDAR localization but requires days of training. Since training needs to be repeated for each new scene, long training times make these methods impractical for time-sensitive applications, such as autonomous driving, drones, and robotics. We identify large coverage areas and vast data in large-scale outdoor scenes as key challenges that limit fast training. In this paper, we propose LightLoc, the first method capable of efficiently learning localization in a new scene at light speed. LightLoc introduces two novel techniques to address these challenges. First, we introduce sample classification guidance to assist regression learning, reducing ambiguity from similar samples and improving training efficiency. Second, we propose redundant sample downsampling to remove well-learned frames during training, reducing training time without compromising accuracy. Additionally, the fast training and confidence estimation capabilities of sample classification enable its integration into SLAM, effectively eliminating error accumulation. Extensive experiments on large-scale outdoor datasets demonstrate that LightLoc achieves state-of-the-art performance with a 50x reduction in training time than existing methods. Our code is available at this https URL.
zh

[CV-201] GaussianFocus: Constrained Attention Focus for 3D Gaussian Splatting

【速读】：该论文致力于解决3D Gaussian Splatting技术在高保真渲染中的冗余高斯分布过拟合问题，以及其在大规模场景应用中受限于内存消耗大、优化时间长和跨视角外观变化多样性的挑战。论文的关键解决方案包括引入GaussianFocus方法，通过结合patch注意力算法提升渲染质量，采用高斯分布约束策略减少冗余，并提出一种针对大规模场景的分块重建策略，将场景划分为更小的子区域进行独立训练。这些创新显著减少了不必要的高斯分布数量，提升了渲染效果，并实现了对城市等复杂大型场景的有效管理与高质量渲染。

链接: https://arxiv.org/abs/2503.17798
作者: Zexu Huang,Min Xu,Stuart Perry
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent developments in 3D reconstruction and neural rendering have significantly propelled the capabilities of photo-realistic 3D scene rendering across various academic and industrial fields. The 3D Gaussian Splatting technique, alongside its derivatives, integrates the advantages of primitive-based and volumetric representations to deliver top-tier rendering quality and efficiency. Despite these advancements, the method tends to generate excessive redundant noisy Gaussians overfitted to every training view, which degrades the rendering quality. Additionally, while 3D Gaussian Splatting excels in small-scale and object-centric scenes, its application to larger scenes is hindered by constraints such as limited video memory, excessive optimization duration, and variable appearance across views. To address these challenges, we introduce GaussianFocus, an innovative approach that incorporates a patch attention algorithm to refine rendering quality and implements a Gaussian constraints strategy to minimize redundancy. Moreover, we propose a subdivision reconstruction strategy for large-scale scenes, dividing them into smaller, manageable blocks for individual training. Our results indicate that GaussianFocus significantly reduces unnecessary Gaussians and enhances rendering quality, surpassing existing State-of-The-Art (SoTA) methods. Furthermore, we demonstrate the capability of our approach to effectively manage and render large scenes, such as urban environments, whilst maintaining high fidelity in the visual output.
zh

[CV-202] Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models

【速读】：该论文旨在解决文本到图像生成模型在处理长提示（long prompts）时面临的挑战，这些提示通常包含复杂场景、多样化且具有独特视觉特征的对象以及它们之间的空间关系。论文提出了一种名为SCoPE (Scheduled interpolation of Coarse-to-fine Prompt Embeddings) 的训练-free方法，通过逐步以从粗到细的方式优化输入提示来改进文本到图像的对齐效果。方案的关键在于将详细的输入提示分解为多个子提示，并让这些子提示从描述广义场景布局逐渐演进到高度复杂的细节，同时在推理过程中对这些子提示进行插值，从而逐步引入更精细的细节到生成的图像中。这一无需训练的即插即用方法显著提升了提示对齐性能，在GenAI-Bench数据集中的大部分提示上，相较于Stable Diffusion基线，Visual Question Answering (VQA) 分数平均提高了多达+4%。

链接: https://arxiv.org/abs/2503.17794
作者: Ketan Suhaas Saichandran,Xavier Thomas,Prakhar Kaushik,Deepti Ghadiyaram
机构: Boston University (波士顿大学); Johns Hopkins University (约翰霍普金斯大学); Runway
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-image generative models often struggle with long prompts detailing complex scenes, diverse objects with distinct visual characteristics and spatial relationships. In this work, we propose SCoPE (Scheduled interpolation of Coarse-to-fine Prompt Embeddings), a training-free method to improve text-to-image alignment by progressively refining the input prompt in a coarse-to-fine-grained manner. Given a detailed input prompt, we first decompose it into multiple sub-prompts which evolve from describing broad scene layout to highly intricate details. During inference, we interpolate between these sub-prompts and thus progressively introduce finer-grained details into the generated image. Our training-free plug-and-play approach significantly enhances prompt alignment, achieves an average improvement of up to +4% in Visual Question Answering (VQA) scores over the Stable Diffusion baselines on 85% of the prompts from the GenAI-Bench dataset.
zh

[CV-203] opology preserving Image segmentation using the iterative convolution-thresholding method

【速读】：该论文旨在解决传统图像分割模型主要关注图像的视觉属性，而忽视目标物体拓扑性质的问题，这可能导致分割结果偏离真实情况，特别是在具有复杂拓扑结构或噪声的图像中。为解决此问题，论文的关键在于将拓扑保持约束引入迭代卷积阈值化方法（Iterative Convolution-Thresholding Method, ICTM），形成拓扑保持的ICTM（Topology-Preserving ICTM, TP-ICTM）。通过显式保持目标物体的拓扑特性（如连通性），所提出的方法在具有复杂结构或噪声的图像中实现了更高的精度和鲁棒性。

链接: https://arxiv.org/abs/2503.17792
作者: Lingyun Deng,Litong Liu,Dong Wang,Xiao-Ping Wang
机构: School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen (香港中文大学（深圳）理工学院); Department of Industrial and System Engineering, Georgia Institute of Technology (佐治亚理工学院工业与系统工程系); Shenzhen International Center for Industrial and Applied Mathematics, Shenzhen Research Institute of Big Data (深圳国际工业与应用数学中心, 深圳大数据研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 14 figures

点击查看摘要

Abstract:Variational models are widely used in image segmentation, with various models designed to address different types of images by optimizing specific objective functionals. However, traditional segmentation models primarily focus on the visual attributes of the image, often neglecting the topological properties of the target objects. This limitation can lead to segmentation results that deviate from the ground truth, particularly in images with complex topological structures. In this paper, we introduce a topology-preserving constraint into the iterative convolution-thresholding method (ICTM), resulting in the topology-preserving ICTM (TP-ICTM). Extensive experiments demonstrate that, by explicitly preserving the topological properties of target objects-such as connectivity-the proposed algorithm achieves enhanced accuracy and robustness, particularly in images with intricate structures or noise.
zh

[CV-204] Aligning Foundation Model Priors and Diffusion-Based Hand Interactions for Occlusion-Resistant Two-Hand Reconstruction

【速读】：该论文旨在解决单目图像中双手重建面临的挑战，特别是复杂动态的手势和遮挡导致的交互对齐困难，现有方法常出现对齐错误和穿透伪影等问题。为应对这些挑战，论文提出了一种新颖框架，通过协同利用基于基础模型的二维先验（2D Priors）与基于扩散模型的交互细化（Diffusion-based Interaction Refinement），实现鲁棒的遮挡抵抗型双手重建。关键在于首先引入融合对齐编码器（Fusion Alignment Encoder），在训练过程中学习对齐多模态先验的关键点、分割图和深度线索，提供稳健的结构化指导，并在测试阶段无需基础模型即可保持高重建精度；其次，采用专门训练的双手机器扩散模型（Two-hand Diffusion Model），通过梯度引导去噪修正伪影并确保真实的空间关系，从而将穿透的手势转换为合理的非穿透交互。

链接: https://arxiv.org/abs/2503.17788
作者: Gaoge Han,Yongkang Cheng,Zhe Chen,Shaoli Huang,Tongliang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures and occlusions, causing significant difficulty in achieving plausible interaction alignment. Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts. To tackle this, we propose a novel framework that attempts to precisely align hand poses and interactions by synergistically integrating foundation model-driven 2D priors with diffusion-based interaction refinement for occlusion-resistant two-hand reconstruction. First, we introduce a Fusion Alignment Encoder that learns to align fused multimodal priors keypoints, segmentation maps, and depth cues from foundation models during training. This provides robust structured guidance, further enabling efficient inference without foundation models at test time while maintaining high reconstruction accuracy. Second, we employ a two-hand diffusion model explicitly trained to transform interpenetrated poses into plausible, non-penetrated interactions, leveraging gradient-guided denoising to correct artifacts and ensure realistic spatial relations. Extensive evaluations demonstrate that our method achieves state-of-the-art performance on InterHand2.6M, FreiHAND, and HIC datasets, significantly advancing occlusion handling and interaction robustness.
zh

[CV-205] GOAL: Global-local Object Alignment Learning

【速读】：该论文旨在解决现有视觉-语言模型（如CLIP）在处理冗长且详细文本描述时表现不佳的问题，主要由于这些模型在训练过程中侧重于短小精炼的图像标题。论文提出了一种名为GOAL（全局-局部目标对齐学习）的新颖微调方法，通过利用图像与冗长文本之间的全局和局部语义对齐来增强CLIP处理长文本的能力。解决方案的关键在于两个核心组件：局部图像-句子匹配（Local Image-Sentence Matching, LISM），用于识别图像片段与描述性句子之间的对应对；以及基于标记相似性的学习（Token Similarity-based Learning, TSL），通过这些匹配对高效传播局部元素注意力。实验结果表明，GOAL在三个新的图像-长文本检索基准上显著优于基础CLIP微调方法，并证明了这种方法对于需要细粒度理解长文本描述的任务特别有益。

链接: https://arxiv.org/abs/2503.17782
作者: Hyungyu Choi,Young Kyun Jang,Chanho Eom
机构: Chung-Ang University (中央大学); Meta (Meta)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:Vision-language models like CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions because of their training focus on short and concise captions. We present GOAL (Global-local Object Alignment Learning), a novel fine-tuning method that enhances CLIP’s ability to handle lengthy text by leveraging both global and local semantic alignments between image and lengthy text. Our approach consists of two key components: Local Image-Sentence Matching (LISM), which identifies corresponding pairs between image segments and descriptive sentences, and Token Similarity-based Learning (TSL), which efficiently propagates local element attention through these matched pairs. Evaluating GOAL on three new benchmarks for image-lengthy text retrieval, we demonstrate significant improvements over baseline CLIP fine-tuning, establishing a simple yet effective approach for adapting CLIP to detailed textual descriptions. Through extensive experiments, we show that our method’s focus on local semantic alignment alongside global context leads to more nuanced and representative embeddings, particularly beneficial for tasks requiring fine-grained understanding of lengthy text descriptions.
zh

[CV-206] CODA: Repurposing Continuous VAEs for Discrete Tokenization

【速读】：该论文旨在解决离散视觉分词器在将图像转换为令牌序列时面临的挑战，即如何同时有效地压缩视觉信号并将其离散化为固定代码集，传统方法通常联合学习这两个任务，导致训练不稳定、码本利用率低及重建质量有限。论文提出了一种名为\textbfCODA (\textbfCOntinuous-to-\textbfDiscrete \textbfAdaptation) 的框架，通过解耦压缩与离散化过程来解决这些问题。关键在于通过精心设计的离散化过程，将已经针对感知压缩优化过的现成连续变分自编码器（VAE）适配为离散分词器，从而确保训练稳定高效的同时保留连续VAE的高视觉保真度。实验表明，相比标准VQGAN，该方法仅需其1/6的训练资源，在ImageNet 256×256基准上的8×和16×压缩条件下分别实现了100%的码本利用率和0.43、1.34的重建FID (rFID)。

链接: https://arxiv.org/abs/2503.17760
作者: Zeyu Liu,Zanlin Ni,Yeguo Hua,Xin Deng,Xiao Ma,Cheng Zhong,Gao Huang
机构: Tsinghua University (清华大学); Renmin University (中国人民大学); Lenovo Research, AI Lab (联想研究院, AI 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Discrete visual tokenizers transform images into a sequence of tokens, enabling token-based visual generation akin to language models. However, this process is inherently challenging, as it requires both compressing visual signals into a compact representation and discretizing them into a fixed set of codes. Traditional discrete tokenizers typically learn the two tasks jointly, often leading to unstable training, low codebook utilization, and limited reconstruction quality. In this paper, we introduce \textbfCODA(\textbfCOntinuous-to-\textbfDiscrete \textbfAdaptation), a framework that decouples compression and discretization. Instead of training discrete tokenizers from scratch, CODA adapts off-the-shelf continuous VAEs – already optimized for perceptual compression – into discrete tokenizers via a carefully designed discretization process. By primarily focusing on discretization, CODA ensures stable and efficient training while retaining the strong visual fidelity of continuous VAEs. Empirically, with \mathbf6 \times less training budget than standard VQGAN, our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of \mathbf0.43 and \mathbf1.34 for 8 \times and 16 \times compression on ImageNet 256 \times 256 benchmark.
zh

[CV-207] HiLoTs: High-Low Temporal Sensitive Representation Learning for Semi-Supervised LiDAR Segmentation in Autonomous Driving CVPR2025

【速读】：该论文旨在解决现有半监督点云语义分割方法在利用自动驾驶场景中丰富的长时序特性方面的不足。当前方法通常关注点云的空间分布或仅考虑短期时间表示（如相邻两帧），而忽视了自动驾驶场景中固有的长时序特性。论文观察到，在驾驶过程中，近处物体（如道路和车辆）相对稳定，而远处物体类别和形状变化较大，这一现象也被激光雷达捕获，表现为对近处物体较低的时间敏感性和对远处物体较高的时间敏感性。为此，论文提出HiLoTs方法，通过从连续激光雷达帧中学习高时间敏感性和低时间敏感性的特征表示，并采用交叉注意力机制进一步增强和融合这些表示。此外，利用教师-学生框架对标注分支和未标注分支学到的表示进行对齐，有效利用大量未标注数据。关键在于结合高低时间敏感性特征以及跨模态知识蒸馏技术，提升半监督方法的性能。实验结果表明，HiLoTs在SemanticKITTI和nuScenes数据集上超越了现有最先进的半监督方法，并接近激光雷达-相机多模态方法的表现。

链接: https://arxiv.org/abs/2503.17752
作者: R.D. Lin,Pengcheng Weng,Yinqiao Wang,Han Ding,Jinsong Han,Fei Wang
机构: School of Software Engineering, Xi’an Jiaotong University (西安交通大学软件学院), China; School of Computer Science and Technology, Xi’an Jiaotong University (西安交通大学计算机科学与技术学院), China; College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by CVPR 2025

点击查看摘要

Abstract:LiDAR point cloud semantic segmentation plays a crucial role in autonomous driving. In recent years, semi-supervised methods have gained popularity due to their significant reduction in annotation labor and time costs. Current semi-supervised methods typically focus on point cloud spatial distribution or consider short-term temporal representations, e.g., only two adjacent frames, often overlooking the rich long-term temporal properties inherent in autonomous driving scenarios. In driving experience, we observe that nearby objects, such as roads and vehicles, remain stable while driving, whereas distant objects exhibit greater variability in category and shape. This natural phenomenon is also captured by LiDAR, which reflects lower temporal sensitivity for nearby objects and higher sensitivity for distant ones. To leverage these characteristics, we propose HiLoTs, which learns high-temporal sensitivity and low-temporal sensitivity representations from continuous LiDAR frames. These representations are further enhanced and fused using a cross-attention mechanism. Additionally, we employ a teacher-student framework to align the representations learned by the labeled and unlabeled branches, effectively utilizing the large amounts of unlabeled data. Experimental results on the SemanticKITTI and nuScenes datasets demonstrate that our proposed HiLoTs outperforms state-of-the-art semi-supervised methods, and achieves performance close to LiDAR+Camera multimodal approaches. Code is available on this https URL
zh

[CV-208] Serial Low-rank Adaptation of Vision Transformer

【速读】：该论文旨在解决在计算和存储资源受限的情况下，通过参数高效微调大型预训练视觉基础模型以满足下游视觉任务需求的问题。现有方法如低秩适应（Low-rank Adaptation, LoRA）虽已实现较高效率，但进一步减少参数量和内存需求仍具挑战性。论文的关键解决方案是提出了一种名为Serial LoRA的新颖LoRA变体，其通过将共享的低秩矩阵与注意力机制依次组合，提取适应过程中的潜在共性，显著降低冗余。与LoRA相比，Serial LoRA仅使用其1/4的参数量，在大多数情况下却能达到相当的性能。实验结果验证了该方法在多种基于Transformer结构的视觉基础模型上的优越性。

链接: https://arxiv.org/abs/2503.17750
作者: Houqiang Zhong,Shaocheng Shen,Ke Cai,Zhenglong Wu,Jiangchao Yao,Yuan Cheng,Xuefei Li,Xiaoyun Zhang,Li Song,Qiang Hu
机构: School of Information Science and Electronic Engineering, Shanghai Jiao Tong University (上海交通大学信息科学与电子工程学院); Cooperative Medianet Innovation Center, Shanghai Jiao Tong University (上海交通大学协同创新中心); Glodon Company (广联达公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Fine-tuning large pre-trained vision foundation models in a parameter-efficient manner is critical for downstream vision tasks, considering the practical constraints of computational and storage costs. Low-rank adaptation (LoRA) is a well-established technique in this domain, achieving impressive efficiency by reducing the parameter space to a low-rank form. However, developing more advanced low-rank adaptation methods to reduce parameters and memory requirements remains a significant challenge in resource-constrained application scenarios. In this study, we consider on top of the commonly used vision transformer and propose Serial LoRA, a novel LoRA variant that introduces a shared low-rank matrix serially composite with the attention mechanism. Such a design extracts the underlying commonality of parameters in adaptation, significantly reducing redundancy. Notably, Serial LoRA uses only 1/4 parameters of LoRA but achieves comparable performance in most cases. We conduct extensive experiments on a range of vision foundation models with the transformer structure, and the results confirm consistent superiority of our method.
zh

[CV-209] RDTF: Resource-efficient Dual-mask Training Framework for Multi-frame Animated Sticker Generation

【速读】：该论文旨在解决在资源受限条件下，如何通过更高效的方式训练视频生成模型以满足下游应用需求的问题。传统方法通常基于参数高效的微调（如Adapter或Lora）来适应目标领域，但这些方法因参数量少导致拟合能力不足，并可能因源域知识干扰而使推理偏离目标域。论文的关键在于提出了一种从零开始训练小规模视频生成模型的方法，仅使用百万级样本即可在下游任务中超越对大规模模型进行参数高效微调的效果。其核心在于有效利用数据和课程策略：首先构建适用于低帧率贴纸生成的离散帧生成网络；其次提出双掩码数据利用策略以提升有限数据的可用性和多样性；最后设计难度自适应课程学习方法，通过分解样本熵实现从易到难的样本选择，从而促进模型收敛。实验结果验证了所提方法在资源受限条件下的可行性和优越性。

链接: https://arxiv.org/abs/2503.17735
作者: Zhiqiang Yuan,Ting Zhang,Ying Deng,Jiapei Zhang,Yeshuang Zhu,Zexi Jia,Jie Zhou,Jinchao Zhang
机构: Pattern Recognition Center, WeChat AI, Tencent (微信人工智能模式识别中心，腾讯)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, great progress has been made in video generation technology, attracting the widespread attention of scholars. To apply this technology to downstream applications under resource-constrained conditions, researchers usually fine-tune the pre-trained models based on parameter-efficient tuning methods such as Adapter or Lora. Although these methods can transfer the knowledge from the source domain to the target domain, fewer training parameters lead to poor fitting ability, and the knowledge from the source domain may lead to the inference process deviating from the target domain. In this paper, we argue that under constrained resources, training a smaller video generation model from scratch using only million-level samples can outperform parameter-efficient tuning on larger models in downstream applications: the core lies in the effective utilization of data and curriculum strategy. Take animated sticker generation (ASG) as a case study, we first construct a discrete frame generation network for stickers with low frame rates, ensuring that its parameters meet the requirements of model training under constrained resources. In order to provide data support for models trained from scratch, we come up with a dual-mask based data utilization strategy, which manages to improve the availability and expand the diversity of limited data. To facilitate convergence under dual-mask situation, we propose a difficulty-adaptive curriculum learning method, which decomposes the sample entropy into static and adaptive components so as to obtain samples from easy to difficult. The experiment demonstrates that our resource-efficient dual-mask training framework is quantitatively and qualitatively superior to efficient-parameter tuning methods such as I2V-Adapter and SimDA, verifying the feasibility of our method on downstream tasks under constrained resources. Code will be available.
zh

[CV-210] GS-LTS: 3D Gaussian Splatting-Based Adaptive Modeling for Long-Term Service Robots

【速读】：该论文旨在解决现有基于3D Gaussian Splatting (3DGS) 的方法在机器人领域主要关注静态场景的问题，而长期服务机器人需要处理动态环境中的多样化任务，并实现场景的持续更新与高效维护。当前方法难以满足这些需求。为了解决这些问题，论文提出了一种名为GS-LTS (Gaussian Splatting for Long-Term Service) 的系统，其关键在于通过单图像变化检测识别场景变化（如物体添加或移除），利用基于规则的策略自主采集多视角观测数据，并通过高斯编辑高效更新场景表示。此外，论文还设计了一个基于模拟的基准测试，能够自动生成场景变化数据作为紧凑的配置脚本，提供标准化且友好的评估工具。实验结果表明，GS-LTS 在重建、导航以及场景更新方面优于图像训练基线，显著提升了3DGS 在长期机器人操作中的性能。

链接: https://arxiv.org/abs/2503.17733
作者: Bin Fu,Jialin Li,Bin Zhang,Ruiping Wang,Xilin Chen
机构: Key Laboratory of AI Safety of CAS, Institute of Computing Technology, Chinese Academy of Sciences (CAS)(中国科学院计算技术研究所, 中国科学院人工智能安全实验室); University of Chinese Academy of Sciences (中国科学院大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has garnered significant attention in robotics for its explicit, high fidelity dense scene representation, demonstrating strong potential for robotic applications. However, 3DGS-based methods in robotics primarily focus on static scenes, with limited attention to the dynamic scene changes essential for long-term service robots. These robots demand sustained task execution and efficient scene updates-challenges current approaches fail to meet. To address these limitations, we propose GS-LTS (Gaussian Splatting for Long-Term Service), a 3DGS-based system enabling indoor robots to manage diverse tasks in dynamic environments over time. GS-LTS detects scene changes (e.g., object addition or removal) via single-image change detection, employs a rule-based policy to autonomously collect multi-view observations, and efficiently updates the scene representation through Gaussian editing. Additionally, we propose a simulation-based benchmark that automatically generates scene change data as compact configuration scripts, providing a standardized, user-friendly evaluation benchmark. Experimental results demonstrate GS-LTS’s advantages in reconstruction, navigation, and superior scene updates-faster and higher quality than the image training baseline-advancing 3DGS for long-term robotic operations. Code and benchmark are available at: this https URL.
zh

[CV-211] Co-op: Correspondence-based Novel Object Pose Estimation CVPR2025

【速读】：该论文试图解决在仅使用计算机辅助设计（CAD）模型的情况下，从单张 RGB 图像中准确且鲁棒地估计未见物体的六自由度（6DoF）姿态的问题。现有基于模型的方法因需要大量模板而效率低下，而该文提出的 Co-op 方法通过在输入图像与预渲染模板之间建立半稠密对应关系，显著提升了效率和精度。其关键在于采用了一种结合局部区域分类与偏移回归的混合表示方法，并利用可微分 PnP 层通过概率流优化初始估计，从而实现快速且高精度的姿态估计，同时在 BOP 挑战的核心数据集上取得了最先进的性能。

链接: https://arxiv.org/abs/2503.17731
作者: Sungphill Moon,Hyeontae Son,Dongcheol Hur,Sangwook Kim
机构: NAVER LABS (NAVER LABS)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:We propose Co-op, a novel method for accurately and robustly estimating the 6DoF pose of objects unseen during training from a single RGB image. Our method requires only the CAD model of the target object and can precisely estimate its pose without any additional fine-tuning. While existing model-based methods suffer from inefficiency due to using a large number of templates, our method enables fast and accurate estimation with a small number of templates. This improvement is achieved by finding semi-dense correspondences between the input image and the pre-rendered templates. Our method achieves strong generalization performance by leveraging a hybrid representation that combines patch-level classification and offset regression. Additionally, our pose refinement model estimates probabilistic flow between the input image and the rendered image, refining the initial estimate to an accurate pose using a differentiable PnP layer. We demonstrate that our method not only estimates object poses rapidly but also outperforms existing methods by a large margin on the seven core datasets of the BOP Challenge, achieving state-of-the-art accuracy.
zh

[CV-212] DynASyn: Multi-Subject Personalization Enabling Dynamic Action Synthesis AAAI2025

【速读】：该论文旨在解决现有文本到图像扩散模型在个性化（Personalization）任务中的局限性，特别是针对单一参考图像场景下，难以修改主体行为或动态交互的问题。这一挑战源于对参考图像的过拟合，尤其是在仅有一张参考图像的情况下更为显著。论文的关键解决方案是提出了一种名为DynASyn的方法，它通过将基于概念的先验知识与主体外观及动作对齐，有效实现了多主体个性化。具体而言，DynASyn通过调节主体标记(token)与图像之间的注意力图来保留主体身份，并采用基于概念的提示(prompt)与图像增强技术优化身份保留与动作多样性之间的平衡。此外，利用基于SDE（随机微分方程）的编辑引导增强提示，生成多样化外观与动作的同时保持身份一致性。实验表明，DynASyn能够合成高度逼真的包含新颖上下文及动态交互的主体图像，在定量与定性评估方面均优于基线方法。

链接: https://arxiv.org/abs/2503.17728
作者: Yongjin Choi,Chanhun Park,Seung Jun Baek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Recent advances in text-to-image diffusion models spurred research on personalization, i.e., a customized image synthesis, of subjects within reference images. Although existing personalization methods are able to alter the subjects’ positions or to personalize multiple subjects simultaneously, they often struggle to modify the behaviors of subjects or their dynamic interactions. The difficulty is attributable to overfitting to reference images, which worsens if only a single reference image is available. We propose DynASyn, an effective multi-subject personalization from a single reference image addressing these challenges. DynASyn preserves the subject identity in the personalization process by aligning concept-based priors with subject appearances and actions. This is achieved by regularizing the attention maps between the subject token and images through concept-based priors. In addition, we propose concept-based prompt-and-image augmentation for an enhanced trade-off between identity preservation and action diversity. We adopt an SDE-based editing guided by augmented prompts to generate diverse appearances and actions while maintaining identity consistency in the augmented images. Experiments show that DynASyn is capable of synthesizing highly realistic images of subjects with novel contexts and dynamic interactions with the surroundings, and outperforms baseline methods in both quantitative and qualitative aspects.
zh

[CV-213] owards Invisible Backdoor Attack on Text-to-Image Diffusion Model

【速读】：该论文旨在解决文本到图像扩散模型中的后门攻击检测问题，提出了一种新的不可见后门攻击（Invisible Backdoor Attack, IBA）方法，以提高后门样本的隐蔽性。当前后门样本通常表现出与良性样本不同的两个关键异常：语义一致性（Semantic Consistency）和注意力一致性（Attention Consistency）。这些特性为防御者提供了可检测的线索。为增强后门样本的隐蔽性，论文的关键解决方案包括：利用句法结构作为后门触发器来放大对文本变化的敏感性，从而打破语义一致性；同时，通过基于核最大均值差异（Kernel Maximum Mean Discrepancy, KMMD）的正则化方法对齐后门样本与良性样本之间的交叉注意力响应分布，破坏注意力一致性。实验结果表明，该方法在保持高攻击成功率的同时，显著提升了对抗检测的能力。

链接: https://arxiv.org/abs/2503.17724
作者: Jie Zhang,Zhongqi Wang,Shiguang Shan,Xilin Chen
机构: Key Laboratory of AI Safety of CAS, Institute of Computing Technology, Chinese Academy of Sciences (CAS)(中国科学院计算技术研究所, 中国科学院人工智能安全重点实验室), Chinese Academy of Sciences (CAS)(中国科学院); University of Chinese Academy of Sciences(中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Backdoor attacks targeting text-to-image diffusion models have advanced rapidly, enabling attackers to implant malicious triggers into these models to manipulate their outputs. However, current backdoor samples often exhibit two key abnormalities compared to benign samples: 1) Semantic Consistency, where backdoor prompts tend to generate images with similar semantic content even with significant textual variations to the prompts; 2) Attention Consistency, where the trigger induces consistent structural responses in the cross-attention maps. These consistencies leave detectable traces for defenders, making backdoors easier to identify. To enhance the stealthiness of backdoor samples, we propose a novel Invisible Backdoor Attack (IBA) by explicitly mitigating these consistencies. Specifically, our approach leverages syntactic structures as backdoor triggers to amplify the sensitivity to textual variations, effectively breaking down the semantic consistency. Besides, a regularization method based on Kernel Maximum Mean Discrepancy (KMMD) is proposed to align the distribution of cross-attention responses between backdoor and benign samples, thereby disrupting attention consistency. Extensive experiments demonstrate that our IBA achieves a 97.5% attack success rate while exhibiting stronger resistance to defenses, with an average of over 98% backdoor samples bypassing three state-of-the-art detection mechanisms. The code is available at this https URL.
zh

[CV-214] BackMix: Regularizing Open Set Recognition by Removing Underlying Fore-Background Priors

【速读】：该论文旨在解决开放集识别（Open Set Recognition, OSR）中辅助数据集中已知异常样本选择敏感的问题。传统方法依赖于精心挑选的辅助数据中的已知异常样本来正则化OSR模型，但这些方法对异常样本的选择高度敏感。论文从一个新的视角探讨了是否可以在不精心设计辅助已知异常样本的情况下实现有效的正则化。

论文的关键在于揭示前景和背景在开放集识别中的作用，并提出了一种新的方法——Background Mix (BackMix)。通过实证与理论分析发现：1) 与前景相关的背景可能误导模型，在处理“部分已知”图像时导致失败；2) 与前景无关的背景可以作为辅助的已知异常样本，并通过全局平均池化提供正则化效果。基于此洞察，BackMix 方法通过估计图像的前景（使用类激活图 CAMs），随机替换图像块为其他图像的背景，从而混合不同背景下的前景以去除前景-背景先验。这种方法简单易实现，无需额外操作即可用于推理，并可无缝集成到几乎所有现有框架中。

链接: https://arxiv.org/abs/2503.17717
作者: Yu Wang,Junxian Mu,Hongzhi Huang,Qilong Wang,Pengfei Zhu,Qinghua Hu
机构: College of Intelligence and Computing, Tianjin University (天津大学智能与计算学院), Haihe Laboratory of Information Technology Application Innovation (Haihe Lab of ITAI) (海河实验室信息技术应用创新研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 11 figures. Accepted by TPAMI

点击查看摘要

Abstract:Open set recognition (OSR) requires models to classify known samples while detecting unknown samples for real-world applications. Existing studies show impressive progress using unknown samples from auxiliary datasets to regularize OSR models, but they have proved to be sensitive to selecting such known outliers. In this paper, we discuss the aforementioned problem from a new perspective: Can we regularize OSR models without elaborately selecting auxiliary known outliers? We first empirically and theoretically explore the role of foregrounds and backgrounds in open set recognition and disclose that: 1) backgrounds that correlate with foregrounds would mislead the model and cause failures when encounters ‘partially’ known images; 2) Backgrounds unrelated to foregrounds can serve as auxiliary known outliers and provide regularization via global average pooling. Based on the above insights, we propose a new method, Background Mix (BackMix), that mixes the foreground of an image with different backgrounds to remove the underlying fore-background priors. Specifically, BackMix first estimates the foreground with class activation maps (CAMs), then randomly replaces image patches with backgrounds from other images to obtain mixed images for training. With backgrounds de-correlated from foregrounds, the open set recognition performance is significantly improved. The proposed method is quite simple to implement, requires no extra operation for inferences, and can be seamlessly integrated into almost all of the existing frameworks. The code is released on this https URL.
zh

[CV-215] EMPLACE: Self-Supervised Urban Scene Change Detection AAAI2025

【速读】：该论文旨在解决城市场景变化检测（Urban Scene Change Detection, USCD）领域中传统方法因依赖小规模标注数据集而难以推广到新城市的问题。具体而言，传统监督方法需要耗费大量人工标注资源，并强制预先定义相关变化类型，这限制了其在实际应用中的灵活性与扩展性。为了解决这些问题，论文提出了两个关键创新：一是构建了AC-1M，这是迄今为止最大的USCD数据集，包含超过110万张图像；二是提出了一种名为EMPLACE的自监督方法，通过引入适应性三元组损失函数来训练视觉Transformer模型。这些创新使得模型不仅能够在线性微调任务中超越当前最先进的方法，还能够在零样本设置下表现优异。此外，在阿姆斯特丹的城市案例研究中，研究表明EMPLACE能够有效检测城市中的大小变化，并且不同规模的变化与房价存在关联，从而反映了潜在的社会不平等现象。因此，论文的核心解决方案在于利用大规模无监督学习技术克服传统USCD方法的局限性。

链接: https://arxiv.org/abs/2503.17716
作者: Tim Alpherts,Sennay Ghebreab,Nanne van Noord
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 7 figures, published at AAAI 2025

点击查看摘要

Abstract:Urban change is a constant process that influences the perception of neighbourhoods and the lives of the people within them. The field of Urban Scene Change Detection (USCD) aims to capture changes in street scenes using computer vision and can help raise awareness of changes that make it possible to better understand the city and its residents. Traditionally, the field of USCD has used supervised methods with small scale datasets. This constrains methods when applied to new cities, as it requires labour-intensive labeling processes and forces a priori definitions of relevant change. In this paper we introduce AC-1M the largest USCD dataset by far of over 1.1M images, together with EMPLACE, a self-supervising method to train a Vision Transformer using our adaptive triplet loss. We show EMPLACE outperforms SOTA methods both as a pre-training method for linear fine-tuning as well as a zero-shot setting. Lastly, in a case study of Amsterdam, we show that we are able to detect both small and large changes throughout the city and that changes uncovered by EMPLACE, depending on size, correlate with housing prices - which in turn is indicative of inequity.
zh

[CV-216] Normalized Matching Transformer

【速读】：该论文旨在解决稀疏关键点匹配（sparse keypoint matching）问题，即在图像对之间建立精确的关键点对应关系。为实现这一目标，论文提出了一种全新的端到端深度学习方法，其关键在于结合视觉主干网络（visual backbone）、SplineCNN图神经网络用于特征处理、归一化变换器解码器（normalized transformer decoder）用于解码关键点对应关系，并采用Sinkhorn算法优化匹配。此外，该方法通过对比损失（contrastive loss）和超球面损失（hyperspherical loss）提升特征表示能力，并利用数据增强技术进一步提高鲁棒性。与现有最先进的方法相比，该架构以更少的训练轮次实现了5.1%和2.2%的性能提升，分别在Pascal VOC和SPair-71k数据集上验证了其有效性。

链接: https://arxiv.org/abs/2503.17715
作者: Abtin Pourhadi,Paul Swoboda
机构: Heinrich Heine University Düsseldorf (海因里希海涅杜塞尔多夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present a new state of the art approach for sparse keypoint matching between pairs of images. Our method consists of a fully deep learning based approach combining a visual backbone coupled with a SplineCNN graph neural network for feature processing and a normalized transformer decoder for decoding keypoint correspondences together with the Sinkhorn algorithm. Our method is trained using a contrastive and a hyperspherical loss for better feature representations. We additionally use data augmentation during training. This comparatively simple architecture combining extensive normalization and advanced losses outperforms current state of the art approaches on PascalVOC and SPair-71k datasets by 5.1% and 2.2% respectively compared to BBGM, ASAR, COMMON and GMTR while training for at least 1.7x fewer epochs.
zh

[CV-217] Multi-modality Anomaly Segmentation on the Road

【速读】：该论文旨在解决当前单模态异常分割框架在非异常区域产生过高异常评分的问题，这对于自动驾驶系统的安全性至关重要。论文的关键创新在于提出了一种基于多模态不确定性的异常分割框架MMRAS+，通过引入CLIP文本编码器的文本模态信息，有效降低了非异常类别输出的高异常评分。此外，该方法还开发了一个集成模块以进一步提升异常分割性能。这一方案是首个针对自动驾驶领域的多模态异常分割解决方案，并在RoadAnomaly、SMIYC和Fishyscapes验证数据集上展示了优越性能。

链接: https://arxiv.org/abs/2503.17712
作者: Heng Gao,Zhuolin He,Shoumeng Qiu,Xiangyang Xue,Jian Pu
机构: Institute of Science and Technology for Brain-inspired Intelligence, Fudan University (类脑智能技术研究院, 复旦大学); School of Computer Science, Fudan University (计算机科学学院, 复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semantic segmentation allows autonomous driving cars to understand the surroundings of the vehicle comprehensively. However, it is also crucial for the model to detect obstacles that may jeopardize the safety of autonomous driving systems. Based on our experiments, we find that current uni-modal anomaly segmentation frameworks tend to produce high anomaly scores for non-anomalous regions in images. Motivated by this empirical finding, we develop a multi-modal uncertainty-based anomaly segmentation framework, named MMRAS+, for autonomous driving systems. MMRAS+ effectively reduces the high anomaly outputs of non-anomalous classes by introducing text-modal using the CLIP text encoder. Indeed, MMRAS+ is the first multi-modal anomaly segmentation solution for autonomous driving. Moreover, we develop an ensemble module to further boost the anomaly segmentation performance. Experiments on RoadAnomaly, SMIYC, and Fishyscapes validation datasets demonstrate the superior performance of our method. The code is available in this https URL.
zh

[CV-218] GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration CVPR2025

【速读】：该论文旨在解决当前基于图形用户界面（GUI）代理的方法在跨应用（apps）和任务泛化方面面临的挑战。这些问题源于现有数据集的两个根本局限性：首先，这些数据集忽略了开发者引入的应用程序结构变异性，限制了知识在多样化软件环境中的迁移能力；其次，许多数据集仅关注导航任务，无法充分表示全面的软件架构和复杂的用户交互。为了解决这些问题，论文提出了GUI-Xplore数据集，其设计目的是通过探索与推理框架增强跨应用和跨任务的泛化能力，并结合预录制的探索视频提供上下文洞察，以及五个分层结构的下游任务以全面评估GUI代理的能力。解决方案的关键在于提出的Xplore-Agent框架，它融合了动作感知的GUI建模与图导向的环境推理，从而实现了在陌生环境中的性能提升，但仍存在进一步优化的空间以实现真正通用化的GUI代理。

链接: https://arxiv.org/abs/2503.17709
作者: Yuchen Sun,Shanhui Zhao,Tao Yu,Hao Wen,Samith Va,Mengwei Xu,Yuanchun Li,Chongyang Zhang
机构: School of Information Science and Electronic Engineering, Shanghai Jiao Tong University (上海交通大学信息科学与电子工程学院); Institute for AI Industry Research (AIR), Tsinghua University (清华大学人工智能产业研究院); MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University (教育部人工智能重点实验室，上海交通大学人工智能研究院); Beijing Academy of Artificial Intelligence (BAAI) (北京智源人工智能研究院); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2025

点击查看摘要

Abstract:GUI agents hold significant potential to enhance the experience and efficiency of human-device interaction. However, current methods face challenges in generalizing across applications (apps) and tasks, primarily due to two fundamental limitations in existing datasets. First, these datasets overlook developer-induced structural variations among apps, limiting the transferability of knowledge across diverse software environments. Second, many of them focus solely on navigation tasks, which restricts their capacity to represent comprehensive software architectures and complex user interactions. To address these challenges, we introduce GUI-Xplore, a dataset meticulously designed to enhance cross-application and cross-task generalization via an exploration-and-reasoning framework. GUI-Xplore integrates pre-recorded exploration videos providing contextual insights, alongside five hierarchically structured downstream tasks designed to comprehensively evaluate GUI agent capabilities. To fully exploit GUI-Xplore’s unique features, we propose Xplore-Agent, a GUI agent framework that combines Action-aware GUI Modeling with Graph-Guided Environment Reasoning. Further experiments indicate that Xplore-Agent achieves a 10% improvement over existing methods in unfamiliar environments, yet there remains significant potential for further enhancement towards truly generalizable GUI agents.
zh

[CV-219] MAMAT: 3D Mamba-Based Atmospheric Turbulence Removal and its Object Detection Capability

【速读】：该论文旨在解决大气湍流条件下捕获视频的质量退化问题，以提升可视化效果，并改善监控系统中的目标检测、分类和跟踪性能。论文提出了一种基于 Mamba 的新方法——3D Mamba-Based Atmospheric Turbulence Removal (MAMAT)，其关键在于采用双模块策略来缓解这些失真。第一模块利用可变形 3D 卷积实现非刚性配准，以最小化空间偏移；第二模块则通过增强对比度和细节来优化图像质量。实验结果表明，MAMAT 在视觉质量和目标检测性能上均优于现有的学习型方法，分别提升了 3% 和 15%，不仅增强了视觉恢复效果，还显著提高了监控应用的有效性。

链接: https://arxiv.org/abs/2503.17700
作者: Paul Hill,Zhiming Liu,Nantheera Anantrasirichai
机构: Visual Information Laboratory (视觉信息实验室), University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Restoration and enhancement are essential for improving the quality of videos captured under atmospheric turbulence conditions, aiding visualization, object detection, classification, and tracking in surveillance systems. In this paper, we introduce a novel Mamba-based method, the 3D Mamba-Based Atmospheric Turbulence Removal (MAMAT), which employs a dual-module strategy to mitigate these distortions. The first module utilizes deformable 3D convolutions for non-rigid registration to minimize spatial shifts, while the second module enhances contrast and detail. Leveraging the advanced capabilities of the 3D Mamba architecture, experimental results demonstrate that MAMAT outperforms state-of-the-art learning-based methods, achieving up to a 3% improvement in visual quality and a 15% boost in object detection. It not only enhances visualization but also significantly improves object detection accuracy, bridging the gap between visual restoration and the effectiveness of surveillance applications.
zh

[CV-220] MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking CVPR2025

【速读】：该论文旨在解决基于RGB图像的小型目标跟踪在实际场景中面临的挑战，如小尺寸目标和遮挡问题。为应对这些挑战，多光谱图像（Multispectral Images, MSI）因其额外的光谱信息提供了一种潜在的解决方案，但该领域的进展受到缺乏相关数据集的限制。为此，论文引入了首个大规模多光谱无人机单目标跟踪数据集（MUST），包含250个跨越多样化环境和挑战的视频序列，为多光谱无人机跟踪提供了全面的数据基础。同时，论文提出了一种新颖的跟踪框架UNTrack，其关键在于通过光谱提示、初始模板和序列搜索统一编码光谱、空间和时间特征，并采用具有光谱背景消除机制的非对称Transformer以优化关系建模，以及一个不断更新光谱提示的编码器以精化跟踪，从而提升精度与效率。实验表明，UNTrack在性能上超越了现有最先进的无人机跟踪器。

链接: https://arxiv.org/abs/2503.17699
作者: Haolin Qin,Tingfa Xu,Tianhao Li,Zhenxiang Chen,Tao Feng,Jianan Li
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025

点击查看摘要

Abstract:UAV tracking faces significant challenges in real-world scenarios, such as small-size targets and occlusions, which limit the performance of RGB-based trackers. Multispectral images (MSI), which capture additional spectral information, offer a promising solution to these challenges. However, progress in this field has been hindered by the lack of relevant datasets. To address this gap, we introduce the first large-scale Multispectral UAV Single Object Tracking dataset (MUST), which includes 250 video sequences spanning diverse environments and challenges, providing a comprehensive data foundation for multispectral UAV tracking. We also propose a novel tracking framework, UNTrack, which encodes unified spectral, spatial, and temporal features from spectrum prompts, initial templates, and sequential searches. UNTrack employs an asymmetric transformer with a spectral background eliminate mechanism for optimal relationship modeling and an encoder that continuously updates the spectrum prompt to refine tracking, improving both accuracy and efficiency. Extensive experiments show that our proposed UNTrack outperforms state-of-the-art UAV trackers. We believe our dataset and framework will drive future research in this area. The dataset is available on this https URL.
zh

[CV-221] MotionDiff: Training-free Zero-shot Interactive Motion Editing via Flow-assisted Multi-view Diffusion

【速读】：该论文旨在解决生成式模型在可控编辑（尤其是多视角运动编辑）中的挑战，特别是由于其输出固有的不确定性导致的复杂旋转和拉伸运动难以处理以及多视角一致性难以保证的问题。传统基于物理的生成方法通常局限于单视角简单运动（如平移和拖动），且在处理复杂运动时需要资源密集型的再训练。为应对这些挑战，论文提出了一种名为MotionDiff的无训练零样本扩散方法，其关键在于利用光流估计来实现复杂的多视角运动编辑。具体而言，通过多视角光流估计阶段（MFES）中的点动力学模型（PKM）捕捉物体的运动先验，并在多视角运动扩散阶段（MMDS）中通过解耦的运动表示生成高质量的多视角一致运动结果，从而无需再训练即可灵活适应多种下游任务。

链接: https://arxiv.org/abs/2503.17695
作者: Yikun Ma,Yiqing Li,Jiawei Wu,Zhi Jin
机构: Sun Yat-sen University (中山大学); Guangdong Provincial Key Laboratory of Fire Science and Intelligent Emergency Technology (广东省火灾科学与智能应急技术重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative models have made remarkable advancements and are capable of producing high-quality content. However, performing controllable editing with generative models remains challenging, due to their inherent uncertainty in outputs. This challenge is praticularly pronounced in motion editing, which involves the processing of spatial information. While some physics-based generative methods have attempted to implement motion editing, they typically operate on single-view images with simple motions, such as translation and dragging. These methods struggle to handle complex rotation and stretching motions and ensure multi-view consistency, often necessitating resource-intensive retraining. To address these challenges, we propose MotionDiff, a training-free zero-shot diffusion method that leverages optical flow for complex multi-view motion editing. Specifically, given a static scene, users can interactively select objects of interest to add motion priors. The proposed Point Kinematic Model (PKM) then estimates corresponding multi-view optical flows during the Multi-view Flow Estimation Stage (MFES). Subsequently, these optical flows are utilized to generate multi-view motion results through decoupled motion representation in the Multi-view Motion Diffusion Stage (MMDS). Extensive experiments demonstrate that MotionDiff outperforms other physics-based generative motion editing methods in achieving high-quality multi-view consistent motion results. Notably, MotionDiff does not require retraining, enabling users to conveniently adapt it for various down-stream tasks.
zh

[CV-222] CountLLM : Towards Generalizable Repetitive Action Counting via Large Language Model CVPR2025

【速读】：该论文旨在解决重复动作计数（Repetitive Action Counting）任务中现有方法存在的两个主要问题：一是由于回归网络表征能力有限，难以准确捕捉变化的周期性模式；二是因在狭窄且有限的训练集上进行监督学习导致的过拟合，限制了模型在多样化场景中的泛化能力。为了解决这些问题，论文提出了一种基于大型语言模型（Large Language Model, LLM）的新框架CountLLM。该框架的关键在于利用显式文本指令提供的丰富线索以及预训练LLM的强大表征能力来实现重复动作计数，并通过设计一种基于周期性的结构化模板指导指令以确保一致性，同时引入渐进式多模态训练范式增强模型的周期感知能力。实验结果表明，CountLLM在广泛认可的数据集上表现出色，尤其在处理与训练数据显著不同的新颖或领域外动作时展现了优越的性能和泛化能力。

链接: https://arxiv.org/abs/2503.17690
作者: Ziyu Yao,Xuxin Cheng,Zhiqi Huang,Lei Li
机构: Peking University (北京大学); University of Washington (华盛顿大学); University of Copenhagen (哥本哈根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Repetitive action counting, which aims to count periodic movements in a video, is valuable for video analysis applications such as fitness monitoring. However, existing methods largely rely on regression networks with limited representational capacity, which hampers their ability to accurately capture variable periodic patterns. Additionally, their supervised learning on narrow, limited training sets leads to overfitting and restricts their ability to generalize across diverse scenarios. To address these challenges, we propose CountLLM, the first large language model (LLM)-based framework that takes video data and periodic text prompts as inputs and outputs the desired counting value. CountLLM leverages the rich clues from explicit textual instructions and the powerful representational capabilities of pre-trained LLMs for repetitive action counting. To effectively guide CountLLM, we develop a periodicity-based structured template for instructions that describes the properties of periodicity and implements a standardized answer format to ensure consistency. Additionally, we propose a progressive multimodal training paradigm to enhance the periodicity-awareness of the LLM. Empirical evaluations on widely recognized benchmarks demonstrate CountLLM’s superior performance and generalization, particularly in handling novel and out-of-domain actions that deviate significantly from the training data, offering a promising avenue for repetitive action counting.
zh

[CV-223] owards Transformer-Based Aligned Generation with Self-Coherence Guidance CVPR2025

【速读】：该论文致力于解决现有基于Transformer的文本引导扩散模型（Text-Guided Diffusion Models, TGDMs）在生成语义对齐图像时面临的挑战，特别是在处理复杂文本提示或多概念属性绑定任务时表现不佳的问题。以往基于U-Net的方法主要优化潜在空间，但其直接应用于Transformer架构的效果有限。论文的关键解决方案是提出了一种无需额外训练的全新方法，通过在生成过程中直接优化交叉注意力图来实现精确的语义对齐。具体而言，作者引入了Self-Coherence Guidance，该方法利用从先前去噪步骤中获得的掩码动态精修注意力图，从而确保生成图像与输入文本提示的高度一致性。实验结果表明，该方法在粗粒度属性绑定、细粒度属性绑定以及风格绑定等任务上均显著优于当前其他最先进的方法。

链接: https://arxiv.org/abs/2503.17675
作者: Shulei Wang,Wang Lin,Hai Huang,Hanting Wang,Sihang Cai,WenKang Han,Tao Jin,Jingyuan Chen,Jiacheng Sun,Jieming Zhu,Zhou Zhao
机构: Zhejiang University (浙江大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:We introduce a novel, training-free approach for enhancing alignment in Transformer-based Text-Guided Diffusion Models (TGDMs). Existing TGDMs often struggle to generate semantically aligned images, particularly when dealing with complex text prompts or multi-concept attribute binding challenges. Previous U-Net-based methods primarily optimized the latent space, but their direct application to Transformer-based architectures has shown limited effectiveness. Our method addresses these challenges by directly optimizing cross-attention maps during the generation process. Specifically, we introduce Self-Coherence Guidance, a method that dynamically refines attention maps using masks derived from previous denoising steps, ensuring precise alignment without additional training. To validate our approach, we constructed more challenging benchmarks for evaluating coarse-grained attribute binding, fine-grained attribute binding, and style binding. Experimental results demonstrate the superior performance of our method, significantly surpassing other state-of-the-art methods across all evaluated tasks. Our code is available at this https URL.
zh

[CV-224] DCEvo: Discriminative Cross-Dimensional Evolutionary Learning for Infrared and Visible Image Fusion CVPR2025

【速读】：该论文旨在解决现有红外与可见光图像融合方法在提升任务性能方面效果有限且无法有效优化融合过程的问题。论文的关键在于提出了一种名为DCEvo（Discriminative Cross-Dimension Evolutionary Learning Framework）的方法，通过结合进化学习的鲁棒搜索能力，将图像融合与后续高阶任务优化统一为多目标优化问题，并利用进化算法动态调整损失函数参数以平衡双任务优化。此外，通过在编码器和解码器中引入判别增强模块（Discriminative Enhancer, DE），以及设计跨维度嵌入块（Cross-Dimensional Embedding, CDE），实现了不同模态互补特征的有效学习及高维任务特征与低维融合特征之间的相互增强，从而显著提升了视觉质量和后续高阶任务的性能，在三个基准数据集上的实验结果表明其平均性能提升了9.32%。

链接: https://arxiv.org/abs/2503.17673
作者: Jinyuan Liu,Bowei Zhang,Qingyun Mei,Xingyuan Li,Yang Zou,Zhiying Jiang,Long Ma,Risheng Liu,Xin Fan
机构: School of Mechanical Engineering, Dalian University of Technology (大连理工大学机械工程学院); School of Software Technology & DUT-RU International School of ISE, Dalian University of Technology (大连理工大学软件技术学院与大连理工大学-俄罗斯远东联邦大学国际信息科学与工程学院); School of Computer Science, Northwestern Polytechnical University (西北工业大学计算机科学学院); College of Information Science and Technology, Dalian Maritime University (大连海事大学信息科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Infrared and visible image fusion integrates information from distinct spectral bands to enhance image quality by leveraging the strengths and mitigating the limitations of each modality. Existing approaches typically treat image fusion and subsequent high-level tasks as separate processes, resulting in fused images that offer only marginal gains in task performance and fail to provide constructive feedback for optimizing the fusion process. To overcome these limitations, we propose a Discriminative Cross-Dimension Evolutionary Learning Framework, termed DCEvo, which simultaneously enhances visual quality and perception accuracy. Leveraging the robust search capabilities of Evolutionary Learning, our approach formulates the optimization of dual tasks as a multi-objective problem by employing an Evolutionary Algorithm (EA) to dynamically balance loss function parameters. Inspired by visual neuroscience, we integrate a Discriminative Enhancer (DE) within both the encoder and decoder, enabling the effective learning of complementary features from different modalities. Additionally, our Cross-Dimensional Embedding (CDE) block facilitates mutual enhancement between high-dimensional task features and low-dimensional fusion features, ensuring a cohesive and efficient feature integration process. Experimental results on three benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches, achieving an average improvement of 9.32% in visual quality while also enhancing subsequent high-level tasks. The code is available at this https URL.
zh

[CV-225] A Temporal Modeling Framework for Video Pre-Training on Video Instance Segmentation ICME2025

【速读】：该论文旨在解决视频实例分割（Video Instance Segmentation, VIS）任务中预训练模型缺乏时间维度知识导致的领域差距问题，这可能对VIS性能产生负面影响。为有效弥合这一差距，论文提出了一种新颖的视频预训练方法以增强VIS模型，特别是在处理具有复杂实例关系的视频时。解决方案的关键在于减少预训练与微调阶段之间的差异。具体而言，首先通过一致的伪视频增强技术生成多样化伪视频样本进行预训练，同时保持帧间实例一致性；其次，引入多尺度时间模块，通过自注意力和跨注意力机制提升模型在短时和长时时间跨度上建模时间关系的能力。此方法不局限于特定模型架构，可与多种VIS方法无缝集成。实验结果表明，该方法在常用VIS基准数据集上始终优于现有先进技术，并在具有挑战性的OVIS数据集上实现了平均精度4.0%的显著提升。

链接: https://arxiv.org/abs/2503.17672
作者: Qing Zhong,Peng-Tao Jiang,Wen Wang,Guodong Ding,Lin Wu,Kaiqi Huang
机构: University of Adelaide (阿德莱德大学); vivo; Zhejiang University (浙江大学); NUS (新加坡国立大学); Swansea University (斯旺西大学); University of South China (南华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5figures, 6 tables, Accepted to ICME 2025

点击查看摘要

Abstract:Contemporary Video Instance Segmentation (VIS) methods typically adhere to a pre-train then fine-tune regime, where a segmentation model trained on images is fine-tuned on videos. However, the lack of temporal knowledge in the pre-trained model introduces a domain gap which may adversely affect the VIS performance. To effectively bridge this gap, we present a novel video pre-training approach to enhance VIS models, especially for videos with intricate instance relationships. Our crucial innovation focuses on reducing disparities between the pre-training and fine-tuning stages. Specifically, we first introduce consistent pseudo-video augmentations to create diverse pseudo-video samples for pre-training while maintaining the instance consistency across frames. Then, we incorporate a multi-scale temporal module to enhance the model’s ability to model temporal relations through self- and cross-attention at short- and long-term temporal spans. Our approach does not set constraints on model architecture and can integrate seamlessly with various VIS methods. Experiment results on commonly adopted VIS benchmarks show that our method consistently outperforms state-of-the-art methods. Our approach achieves a notable 4.0% increase in average precision on the challenging OVIS dataset.
zh

[CV-226] DRI: Two-Phase Dialogue Refinement and Co-Adaptation for Interactive Image Generation

【速读】：该论文旨在解决文本到图像生成技术在处理模糊提示（ambiguous prompts）以及输出与用户期望对齐方面的挑战。论文提出的解决方案是TDRI（Two-Phase Dialogue Refinement and Co-Adaptation）框架，其关键在于通过迭代的用户交互增强图像生成质量。TDRI包含两个阶段：初始生成阶段（Initial Generation Phase）基于用户提示生成基础图像，并通过交互优化阶段（Interactive Refinement Phase）整合用户反馈。这一阶段的核心由三个模块组成：Dialogue-to-Prompt (D2P) 模块确保用户反馈被有效转化为可操作的提示以提升用户意图与模型输入的对齐；Feedback-Reflection (FR) 模块通过评估生成结果与用户期望的一致性来识别差异并促进改进；Adaptive Optimization (AO) 模块则通过平衡用户偏好并保持提示忠实性来微调生成过程。这些模块共同保证了生成结果的质量和用户满意度。

链接: https://arxiv.org/abs/2503.17669
作者: Yuheng Feng,Jianhui Wang,Kun Li,Sida Li,Tianyu Shi,Haoyue Han,Miao Zhang,Xueqian Wang
机构: unknown
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although text-to-image generation technologies have made significant advancements, they still face challenges when dealing with ambiguous prompts and aligning outputs with user this http URL proposed framework, TDRI (Two-Phase Dialogue Refinement and Co-Adaptation), addresses these issues by enhancing image generation through iterative user interaction. It consists of two phases: the Initial Generation Phase, which creates base images based on user prompts, and the Interactive Refinement Phase, which integrates user feedback through three key modules. The Dialogue-to-Prompt (D2P) module ensures that user feedback is effectively transformed into actionable prompts, which improves the alignment between user intent and model input. By evaluating generated outputs against user expectations, the Feedback-Reflection (FR) module identifies discrepancies and facilitates improvements. In an effort to ensure consistently high-quality results, the Adaptive Optimization (AO) module fine-tunes the generation process by balancing user preferences and maintaining prompt fidelity. Experimental results show that TDRI outperforms existing methods by achieving 33.6% human preference, compared to 6.2% for GPT-4 augmentation, and the highest CLIP and BLIP alignment scores (0.338 and 0.336, respectively). In iterative feedback tasks, user satisfaction increased to 88% after 8 rounds, with diminishing returns beyond 6 rounds. Furthermore, TDRI has been found to reduce the number of iterations and improve personalization in the creation of fashion products. TDRI exhibits a strong potential for a wide range of applications in the creative and industrial domains, as it streamlines the creative process and improves alignment with user preferences
zh

[CV-227] 3D Modeling: Camera Movement Estimation and path Correction for SFM Model using the Combination of Modified A-SIFT and Stereo System

【速读】：该论文致力于解决在3D模型创建过程中因大视角变化、计算复杂性和对齐误差所面临的挑战。为应对这些问题，论文提出了一种改进的仿射尺度不变特征变换（Affine Scale-Invariant Feature Transform, ASIFT）算法以提取更多匹配点，并降低计算开销，确保足够的内点数用于精确估计相机旋转角度。同时，引入基于双目相机的旋转校正模型来修正小角度旋转误差，进一步提升精度。此外，通过改造基于立体视觉的运动结构模型（Structure From Motion, SFM），实现对相机平移的估计与校正，从而确定相机在三维空间中的移动轨迹。最终，结合改进的ASIFT算法和基于双目相机的SFM模型，提供了精确的三维空间相机运动轨迹。实验结果表明，所提出的相机运动方法相比实际路径实现了99.9%的准确率，优于现有的最优相机路径估计方法。由此构建的高精度相机路径显著提高了3D重建的保真度与效率。论文的关键在于提出了改进的特征提取技术和创新的旋转及平移校正模型，为高效且精确的3D模型生成奠定了基础。

链接: https://arxiv.org/abs/2503.17668
作者: Usha Kumari,Shuvendu Rana
机构: srmap.edu.in (SRM AP Institute of Science and Technology (SRMAP)); ieee.org (Institute of Electrical and Electronics Engineers (IEEE))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Creating accurate and efficient 3D models poses significant challenges, particularly in addressing large viewpoint variations, computational complexity, and alignment discrepancies. Efficient camera path generation can help resolve these issues. In this context, a modified version of the Affine Scale-Invariant Feature Transform (ASIFT) is proposed to extract more matching points with reduced computational overhead, ensuring an adequate number of inliers for precise camera rotation angle estimation. Additionally, a novel two-camera-based rotation correction model is introduced to mitigate small rotational errors, further enhancing accuracy. Furthermore, a stereo camera-based translation estimation and correction model is implemented to determine camera movement in 3D space by altering the Structure From Motion (SFM) model. Finally, the novel combination of ASIFT and two camera-based SFM models provides an accurate camera movement trajectory in 3D space. Experimental results show that the proposed camera movement approach achieves 99.9% accuracy compared to the actual camera movement path and outperforms state-of-the-art camera path estimation methods. By leveraging this accurate camera path, the system facilitates the creation of precise 3D models, making it a robust solution for applications requiring high fidelity and efficiency in 3D reconstruction.
zh

[CV-228] OMR-Diffusion:Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Intent Understanding

【速读】：该论文旨在解决生成式 AI 在多轮对话场景下难以持续满足用户不断变化的偏好和意图的问题。为应对这一挑战，论文提出了一种名为 Visual Co-Adaptation (VCA) 的框架，其关键是通过人机协作反馈机制，利用专门设计的奖励模型（reward model）来紧密匹配人类偏好。VCA 框架结合多样性、一致性和偏好反馈等多种奖励函数，通过 LoRA 对扩散模型进行优化，从而有效提升基于用户输入的图像生成效果。此外，研究构建了与用户意图契合的多轮对话数据集（包含提示和图像对），进一步增强了模型在保持图像一致性及与用户意图对齐方面的性能。实验结果表明，该方法显著优于现有技术，在人类评估中获得 508 次胜利，并在对话效率、LPIPS 和 BLIP 等指标上表现出色。

链接: https://arxiv.org/abs/2503.17660
作者: Kun Li,Jianhui Wang,Miao Zhang,Xueqian Wang
机构: Xiamen University (厦门大学); University of Electronic Science and Technology of China (电子科技大学); Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative AI has significantly advanced text-driven image generation, but it still faces challenges in producing outputs that consistently align with evolving user preferences and intents, particularly in multi-turn dialogue scenarios. In this research, We present a Visual Co-Adaptation (VCA) framework that incorporates human-in-the-loop feedback, utilizing a well-trained reward model specifically designed to closely align with human preferences. Using a diverse multi-turn dialogue dataset, the framework applies multiple reward functions (such as diversity, consistency, and preference feedback) to refine the diffusion model through LoRA, effectively optimizing image generation based on user input. We also constructed multi-round dialogue datasets with prompts and image pairs that well-fit user intent. Experiments show the model achieves 508 wins in human evaluation, outperforming DALL-E 3 (463 wins) and others. It also achieves 3.4 rounds in dialogue efficiency (vs. 13.7 for DALL-E 3) and excels in metrics like LPIPS (0.15) and BLIP (0.59). Various experiments demonstrate the effectiveness of the proposed method over state-of-the-art baselines, with significant improvements in image consistency and alignment with user intent.
zh

[CV-229] Efficient Diffusion Training through Parallelization with Truncated Karhunen-Loève Expansion

【速读】：该论文旨在解决扩散去噪模型（Denoising Diffusion Models）在训练过程中收敛速度慢的问题。论文指出，这一问题部分源于前向过程驱动的布朗运动（Brownian Motion）的复杂性。为了解决此问题，作者提出将布朗运动表示为Karhunen-Loève展开，并将其截断至有限数量的特征函数（eigenfunctions）。关键创新在于引入了一种新的前向过程，称为KL扩散（KL Diffusion），它基于具有增广随机初值的常微分方程（Ordinary Differential Equation）。通过设计适当的去噪损失函数（denoising loss function），该方法能够无缝集成到现有的基于去噪的模型中。实验基于广泛采用的DDIM框架进行，仅修改了前向过程和损失函数，而保持网络架构和采样方法不变。结果表明，该方法显著提升了收敛速度，在达到与基线模型相同FID分数的情况下，训练速度提高了两倍，并最终实现了更低的FID分数。此外，该方案支持高度并行化计算，无需额外可学习参数，且可以灵活适配现有扩散方法。

链接: https://arxiv.org/abs/2503.17657
作者: Yumeng Ren,Yaofang Liu,Aitor Artola,Laurent Mertz,Raymond H. Chan,Jean-michel Morel
机构: City University of Hong Kong (香港城市大学); Lingnan University of Hong Kong (岭南大学香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 9 figures

点击查看摘要

Abstract:Diffusion denoising models have become a popular approach for image generation, but they often suffer from slow convergence during training. In this paper, we identify that this slow convergence is partly due to the complexity of the Brownian motion driving the forward-time process. To address this, we represent the Brownian motion using the Karhunen-Loève expansion, truncating it to a limited number of eigenfunctions. We propose a novel ordinary differential equation with augmented random initials, termed KL diffusion, as a new forward-time process for training and sampling. By developing an appropriate denoising loss function, we facilitate the integration of our KL-diffusion into existing denoising-based models. Using the widely adopted DDIM framework as our baseline ensures a fair comparison, as our modifications focus solely on the forward process and loss function, leaving the network architecture and sampling methods unchanged. Our method significantly outperforms baseline diffusion models, achieving convergence speeds that are twice faster to reach the best FID score of the baseline and ultimately yielding much lower FID scores. Notably, our approach allows for highly parallelized computation, requires no additional learnable parameters, and can be flexibly integrated into existing diffusion methods. The code will be made publicly available.
zh

[CV-230] Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization

【速读】：该论文致力于解决自然语言视频定位（Natural Language Video Localization, NLVL）任务中因点监督标注导致的视频内容与语言描述难以对齐的问题。传统完全监督方法虽然精度较高，但标注成本高昂，而点监督虽降低了标注成本，却因缺乏完整的时间边界标注，使得视频内容与语言描述的对齐变得困难，从而影响了精准时刻预测。为了解决这一挑战，论文提出了一种新的协作时间一致性学习（COllaborative Temporal consistEncy Learning, COTEL）框架，其关键在于利用显著性检测与时刻定位之间的协同作用来增强视频-语言对齐能力。具体而言，首先设计了帧级和片段级时间一致性学习模块以建模帧显著图与句子-时刻对之间的语义对齐；其次，通过帧级一致性引导（Frame-level Consistency Guidance, FCG）和片段级一致性引导（Segment-level Consistency Guidance, SCG）实现两条时间一致性学习路径的相互强化；此外，引入分层对比对齐损失（Hierarchical Contrastive Alignment Loss, HCAL），以全面对齐视频与文本查询。实验结果表明，该方法在多个基准数据集上优于现有最先进的技术。

链接: https://arxiv.org/abs/2503.17651
作者: Zhuo Tao,Liang Li,Qi Chen,Yunbin Tu,Zheng-Jun Zha,Ming-Hsuan Yang,Yuankai Qi,Qingming Huang
机构: Institute of Computing Technology, Chinese Academy of Sciences (中科院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); Macquarie University (麦考瑞大学); University of California at Merced (加州大学默塞德分校); University of Adelaide (阿德莱德大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Natural language video localization (NLVL) is a crucial task in video understanding that aims to localize the target moment in videos specified by a given language description. Recently, a point-supervised paradigm has been presented to address this task, requiring only a single annotated frame within the target moment rather than complete temporal boundaries. Compared with the fully-supervised paradigm, it offers a balance between localization accuracy and annotation cost. However, due to the absence of complete annotation, it is challenging to align the video content with language descriptions, consequently hindering accurate moment prediction. To address this problem, we propose a new COllaborative Temporal consistEncy Learning (COTEL) framework that leverages the synergy between saliency detection and moment localization to strengthen the video-language alignment. Specifically, we first design a frame- and a segment-level Temporal Consistency Learning (TCL) module that models semantic alignment across frame saliencies and sentence-moment pairs. Then, we design a cross-consistency guidance scheme, including a Frame-level Consistency Guidance (FCG) and a Segment-level Consistency Guidance (SCG), that enables the two temporal consistency learning paths to reinforce each other mutually. Further, we introduce a Hierarchical Contrastive Alignment Loss (HCAL) to comprehensively align the video and text query. Extensive experiments on two benchmarks demonstrate that our method performs favorably against SoTA approaches. We will release all the source codes.
zh

[CV-231] Visual Variational Autoencoder Prompt Tuning

【速读】：该论文旨在解决现有视觉提示调优（Visual Prompt Tuning, VPT）方法依赖静态、领域特定提示的问题，这些方法无法充分捕捉单个实例内的丰富视觉多样性。为了解决这一问题，论文提出了一种名为V²APT（Visual Variational Autoencoder Prompt Tuning）的新框架，其关键在于利用变分自编码器架构生成动态、输入相关的提示。通过学习图像特定特征的潜在表示并将它们解码为定制化的提示，V²APT能够适应每个输入的独特视觉特性。实验结果表明，该方法在FGVC、HTA和VTAB-1k基准数据集上始终优于最先进的参数高效微调（PEFT）方法，尤其在HTA上比VPT-Deep提升了+3.2%，并在所有三个数据集上的平均性能提升了+2.0%。

链接: https://arxiv.org/abs/2503.17650
作者: Xi Xiao,Yunbei Zhang,Yanshuh Li,Xingjian Li,Tianyang Wang,Jihun Hamm,Xiao Wang,Min Xu
机构: University of Alabama at Birmingham (阿拉巴马大学伯明翰分校); Tulane University (杜兰大学); Brown University (布朗大学); Oak Ridge National Laboratory (橡树岭国家实验室); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) has emerged as a crucial approach for adapting large vision transformers to downstream tasks without the prohibitive computational costs of full fine-tuning. While existing visual prompt tuning (VPT) methods have made significant strides, they predominantly rely on static, domain-specific prompts that fail to capture the rich visual diversity within individual instances. This paper introduces V ^2 APT (Visual Variational Autoencoder Prompt Tuning), a novel framework that generates dynamic, input-dependent prompts using a variational autoencoder architecture. By learning a latent representation of image-specific features and decoding them into customized prompts, V ^2 APT adapts to the unique visual characteristics of each input. Extensive experiments on FGVC, HTA, and VTAB-1k benchmarks demonstrate that our approach consistently outperforms state-of-the-art PEFT methods. Notably, V ^2 APT achieves +3.2% improvement over VPT-Deep on HTA, with an average performance gain of +2.0% across all three datasets.
zh

[CV-232] Leverag ing Audio Representations for Vibration-Based Crowd Monitoring in Stadiums

【速读】：该论文旨在解决体育场馆中人群监测的问题，现有方法主要依赖于摄像头和麦克风，但这些方式可能造成显著干扰且常引发隐私担忧。为此，论文提出通过感知地板振动来预测人群行为，这是一种更少干扰且更具非侵入性的方法。然而，由于基于振动的人群监测是一种新兴技术，缺乏标注数据成为一大挑战，尤其是在大型公共场所中获取复杂物理活动相关的训练数据。为应对这一问题，论文的关键解决方案是提出了ViLA（Vibration Leverage Audio），一种基于振动的方法，通过无监督预训练未标记的跨模态数据（音频）来减少对标注数据的依赖。具体而言，ViLA首先以音频数据为对象进行无监督预训练，然后仅使用少量领域内振动数据进行微调，并利用公开可用的音频数据集学习波形行为，再将表示迁移到振动信号上，从而大幅减少对特定领域振动数据的依赖。实验结果表明，与未使用音频预训练的模型相比，采用公开音频数据（如YouTube8M）进行预训练的振动模型在真实环境中实现了最高5.8倍的误差降低。

链接: https://arxiv.org/abs/2503.17646
作者: Yen Cheng Chang,Jesse Codling,Yiwen Dong,Jiale Zhang,Jiasi Chen,Hae Young Noh,Pei Zhang
机构: University of Michigan (密歇根大学); Stanford University (斯坦福大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Crowd monitoring in sports stadiums is important to enhance public safety and improve the audience experience. Existing approaches mainly rely on cameras and microphones, which can cause significant disturbances and often raise privacy concerns. In this paper, we sense floor vibration, which provides a less disruptive and more non-intrusive way of crowd sensing, to predict crowd behavior. However, since the vibration-based crowd monitoring approach is newly developed, one main challenge is the lack of training data due to sports stadiums being large public spaces with complex physical activities. In this paper, we present ViLA (Vibration Leverage Audio), a vibration-based method that reduces the dependency on labeled data by pre-training with unlabeled cross-modality data. ViLA is first pre-trained on audio data in an unsupervised manner and then fine-tuned with a minimal amount of in-domain vibration data. By leveraging publicly available audio datasets, ViLA learns the wave behaviors from audio and then adapts the representation to vibration, reducing the reliance on domain-specific vibration data. Our real-world experiments demonstrate that pre-training the vibration model using publicly available audio data (YouTube8M) achieved up to a 5.8x error reduction compared to the model without audio pre-training. Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.17646 [cs.SD] (or arXiv:2503.17646v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2503.17646 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-233] InstructVEdit: A Holistic Approach for Instructional Video Editing

【速读】：该论文旨在解决视频编辑任务中因缺乏大规模高质量编辑视频对(pair)数据而导致的挑战，包括训练数据不足以及模型架构和训练策略探索受限的问题。为应对这些挑战，论文提出了一种端到端的指令驱动视频编辑方法InstructVEdit，其关键在于：(1) 构建了一个可靠的数据整理工作流以初始化训练；(2) 引入两种模型架构改进以提升编辑质量同时保持时间一致性；(3) 提出一种基于真实世界数据的迭代优化策略，以增强泛化能力并减小训练集与测试集之间的差异。通过这些创新，InstructVEdit在指令驱动的视频编辑任务中达到了当前最佳性能，并展现出对多样化实际场景的强大适应性。

链接: https://arxiv.org/abs/2503.17641
作者: Chi Zhang,Chengjian Feng,Feng Yan,Qiming Zhang,Mingjin Zhang,Yujie Zhong,Jing Zhang,Lin Ma
机构: Xidian University (西安电子科技大学); Meituan Inc. (美团); University of Sydney (悉尼大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Video editing according to instructions is a highly challenging task due to the difficulty in collecting large-scale, high-quality edited video pair data. This scarcity not only limits the availability of training data but also hinders the systematic exploration of model architectures and training strategies. While prior work has improved specific aspects of video editing (e.g., synthesizing a video dataset using image editing techniques or decomposed video editing training), a holistic framework addressing the above challenges remains underexplored. In this study, we introduce InstructVEdit, a full-cycle instructional video editing approach that: (1) establishes a reliable dataset curation workflow to initialize training, (2) incorporates two model architectural improvements to enhance edit quality while preserving temporal consistency, and (3) proposes an iterative refinement strategy leveraging real-world data to enhance generalization and minimize train-test discrepancies. Extensive experiments show that InstructVEdit achieves state-of-the-art performance in instruction-based video editing, demonstrating robust adaptability to diverse real-world scenarios. Project page: this https URL.
zh

[CV-234] Enhancing Martian Terrain Recognition with Deep Constrained Clustering

【速读】：该论文旨在解决火星地形分类中因图像亮度、尺度和旋转等自然变化导致的准确性挑战。为克服这些限制，论文提出了一种名为Deep Constrained Clustering with Metric Learning (DCCML) 的新算法，其关键是利用多种约束类型（包括基于空间和深度相似性的软必须连接约束以及来自立体相机对和时间相邻图像的硬约束）来引导聚类过程，从而实现更精确的火星地质特征分类。

链接: https://arxiv.org/abs/2503.17633
作者: Tejas Panambur,Mario Parente
机构: Department of Electrical and Computer Engineering, University of Massachusetts, Amherst (马萨诸塞大学阿默斯特分校电气与计算机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Martian terrain recognition is pivotal for advancing our understanding of topography, geomorphology, paleoclimate, and habitability. While deep clustering methods have shown promise in learning semantically homogeneous feature embeddings from Martian rover imagery, the natural variations in intensity, scale, and rotation pose significant challenges for accurate terrain classification. To address these limitations, we propose Deep Constrained Clustering with Metric Learning (DCCML), a novel algorithm that leverages multiple constraint types to guide the clustering process. DCCML incorporates soft must-link constraints derived from spatial and depth similarities between neighboring patches, alongside hard constraints from stereo camera pairs and temporally adjacent images. Experimental evaluation on the Curiosity rover dataset (with 150 clusters) demonstrates that DCCML increases homogeneous clusters by 16.7 percent while reducing the Davies-Bouldin Index from 3.86 to 1.82 and boosting retrieval accuracy from 86.71 percent to 89.86 percent. This improvement enables more precise classification of Martian geological features, advancing our capacity to analyze and understand the planet’s landscape.
zh

[CV-235] AI-Based Screening for Depression and Social Anxiety Through Eye Tracking: An Exploratory Study

【速读】：该论文旨在解决心理健康状态（尤其是幸福感）的动态性和个体间波动性带来的准确量化挑战，特别是通过视觉注意力偏差识别情感障碍（如抑郁症和社交焦虑症）。论文提出了一种基于卷积神经网络（Convolutional Neural Networks, CNNs）分析眼动扫描路径的新方法，用于辅助情感障碍的筛查。解决方案的关键在于利用残差网络（Residual Networks, ResNet）处理由眼动模式生成的眼图图像，并通过多类（三分类系统）和二分类系统的实验验证其有效性，分别达到48%和62%的平均准确率。这表明该方法可能成为一种快速、生态友好且有效的心理健康筛查工具，通过眼动追踪评估个体的幸福感。

链接: https://arxiv.org/abs/2503.17625
作者: Karol Chlasta,Katarzyna Wisiecka,Krzysztof Krejtz,Izabela Krejtz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 17 pages, 11 figures

点击查看摘要

Abstract:Well-being is a dynamic construct that evolves over time and fluctuates within individuals, presenting challenges for accurate quantification. Reduced well-being is often linked to depression or anxiety disorders, which are characterised by biases in visual attention towards specific stimuli, such as human faces. This paper introduces a novel approach to AI-assisted screening of affective disorders by analysing visual attention scan paths using convolutional neural networks (CNNs). Data were collected from two studies examining (1) attentional tendencies in individuals diagnosed with major depression and (2) social anxiety. These data were processed using residual CNNs through images generated from eye-gaze patterns. Experimental results, obtained with ResNet architectures, demonstrated an average accuracy of 48% for a three-class system and 62% for a two-class system. Based on these exploratory findings, we propose that this method could be employed in rapid, ecological, and effective mental health screening systems to assess well-being through eye-tracking.
zh

[CV-236] Guidance Free Image Editing via Explicit Conditioning

【速读】：该论文试图解决现有条件扩散模型采样机制在生成高质量图像时计算成本过高的问题。解决方案的关键在于提出了一种名为Explicit Conditioning (EC) 的新方法，通过显式建模噪声分布以指导扩散过程中的条件扩散模型，从而显著减轻了经典引导技术（如Classifier Free Guidance, CFG）的计算负担，大幅提升了扩散模型的推理速度，同时保持了生成图像的质量与多样性。

链接: https://arxiv.org/abs/2503.17593
作者: Mehdi Noroozi,Alberto Gil Ramos,Luca Morreale,Ruchika Chavhan,Malcolm Chadwick,Abhinav Mehrotra,Sourav Bhattacharya
机构: Samsung AI Cambridge (三星人工智能剑桥中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current sampling mechanisms for conditional diffusion models rely mainly on Classifier Free Guidance (CFG) to generate high-quality images. However, CFG requires several denoising passes in each time step, e.g., up to three passes in image editing tasks, resulting in excessive computational costs. This paper introduces a novel conditioning technique to ease the computational burden of the well-established guidance techniques, thereby significantly improving the inference time of diffusion models. We present Explicit Conditioning (EC) of the noise distribution on the input modalities to achieve this. Intuitively, we model the noise to guide the conditional diffusion model during the diffusion process. We present evaluations on image editing tasks and demonstrate that EC outperforms CFG in generating diverse high-quality images with significantly reduced computations.
zh

[CV-237] Is there anything left? Measuring semantic residuals of objects removed from 3D Gaussian Splatting

【速读】：该论文旨在解决3D场景隐私保护映射中的逆向问题，即在移除场景中的目标元素后，剩余部分是否仍包含可推理的对象残余信息，从而导致场景失去隐私性。论文的关键在于提出了一种定量评估方法，通过测量移除操作后是否存在可被推理的残余物来判断场景的隐私性，并进一步基于空间一致性和语义一致性提出了改进移除方法以减少残余信息。实验结果表明，所提出的度量标准具有意义且与用户研究结果一致。

链接: https://arxiv.org/abs/2503.17574
作者: Simona Kocour,Assia Benbihi,Aikaterini Adam,Torsten Sattler
机构: Faculty of Electrical Engineering, Czech Technical University in Prague (捷克技术大学电气工程学院); Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague (捷克技术大学捷克智能信息学、机器人学和网络学研究所); Archimedes/Athena RC (Archimedes/Athena 研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Searching in and editing 3D scenes has become extremely intuitive with trainable scene representations that allow linking human concepts to elements in the scene. These operations are often evaluated on the basis of how accurately the searched element is segmented or extracted from the scene. In this paper, we address the inverse problem, that is, how much of the searched element remains in the scene after it is removed. This question is particularly important in the context of privacy-preserving mapping when a user reconstructs a 3D scene and wants to remove private elements before sharing the map. To the best of our knowledge, this is the first work to address this question. To answer this, we propose a quantitative evaluation that measures whether a removal operation leaves object residuals that can be reasoned over. The scene is not private when such residuals are present. Experiments on state-of-the-art scene representations show that the proposed metrics are meaningful and consistent with the user study that we also present. We also propose a method to refine the removal based on spatial and semantic consistency.
zh

[CV-238] Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion

【速读】：该论文旨在解决Transformer-based多模态模型在工业级推荐、搜索和广告系统中的内容理解与相关性排序性能瓶颈问题。传统基于统计的主动学习（Active Learning, AL）方法在处理深度神经网络中语义相似项区分及过自信误分类检测时存在局限性，同时现有预训练多模态架构对音频信息的利用不足。论文的关键解决方案包括提出基于kNN的潜在空间扩展（Latent Space Broadening, LSB）以提升主动学习效率，以及视觉-语言建模结合音频增强（Vision-Language Modeling with Audio Enhancement, VLMAE），通过中期融合方式将音频信息整合到视觉-语言模型中，从而充分利用已有预训练的视觉-语言和音频模型的优势。这一系统部署后显著提升了业务指标。

链接: https://arxiv.org/abs/2503.17551
作者: Yu Sun,Yin Li,Ruixiao Sun,Chunhui Liu,Fangming Zhou,Ze Jin,Linjie Wang,Xiang Shen,Zhuolin Hao,Hongyu Xiong
机构: Bytedance Inc.USA; Bytedance Inc.China
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems for content understanding and relevance ranking. Enhancing labeled training data quality and cross-modal fusion significantly improves model performance, influencing key metrics such as quality view rates and ad revenue. High-quality annotations are crucial for advancing content modeling, yet traditional statistical-based active learning (AL) methods face limitations: they struggle to detect overconfident misclassifications and are less effective in distinguishing semantically similar items in deep neural networks. Additionally, audio information plays an increasing role, especially in short-video platforms, yet most pre-trained multimodal architectures primarily focus on text and images. While training from scratch across all three modalities is possible, it sacrifices the benefits of leveraging existing pre-trained visual-language (VL) and audio models. To address these challenges, we propose kNN-based Latent Space Broadening (LSB) to enhance AL efficiency and Vision-Language Modeling with Audio Enhancement (VLMAE), a mid-fusion approach integrating audio into VL models. This system deployed in production systems, leading to significant business gains.
zh

[CV-239] PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning

【速读】：该论文旨在解决构建交互式虚拟化身运动系统的问题，目标是开发一种生成式运动模型，能够以持续性、真实性、可控性和响应性驱动身体在三维空间中移动。传统方法因离线设置、速度慢、动作长度有限或不自然等问题难以支持“具身智能”（Embodied Intelligence）。为克服这些限制，论文提出了一种名为PRIMAL的自回归扩散模型，其关键在于采用两阶段预训练范式：第一阶段通过大量亚秒级运动片段学习运动动力学，获取“运动基元”；第二阶段利用类似ControlNet的适配器微调运动控制，实现语义动作生成和空间目标导向。实验表明，该模型不仅能够生成无限长、真实且可控的运动，还能实时响应诱导的外部冲击，并能高效适配个性化动作及空间控制任务。与现有技术相比，所提方法性能更优，同时被用于开发了一个在Unreal Engine中的实时角色动画系统。

链接: https://arxiv.org/abs/2503.17544
作者: Yan Zhang,Yao Feng,Alpár Cseke,Nitin Saini,Nathan Bajandas,Nicolas Heron,Michael J. Black
机构: Meshcapade; Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所), Tübingen; Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:To build a motor system of the interactive avatar, it is essential to develop a generative motion model drives the body to move through 3D space in a perpetual, realistic, controllable, and responsive manner. Although motion generation has been extensively studied, most methods do not support embodied intelligence'' due to their offline setting, slow speed, limited motion lengths, or unnatural movements. To overcome these limitations, we propose PRIMAL, an autoregressive diffusion model that is learned with a two-stage paradigm, inspired by recent advances in foundation models. In the pretraining stage, the model learns motion dynamics from a large number of sub-second motion segments, providing motor primitives’’ from which more complex motions are built. In the adaptation phase, we employ a ControlNet-like adaptor to fine-tune the motor control for semantic action generation and spatial target reaching. Experiments show that physics effects emerge from our training. Given a single-frame initial state, our model not only generates unbounded, realistic, and controllable motion, but also enables the avatar to be responsive to induced impulses in real time. In addition, we can effectively and efficiently adapt our base model to few-shot personalized actions and the task of spatial control. Evaluations show that our proposed method outperforms state-of-the-art baselines. We leverage the model to create a real-time character animation system in Unreal Engine that is highly responsive and natural. Code, models, and more results are available at: this https URL
zh

[CV-240] Generating Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks

【速读】：该论文旨在解决直接训练和采样长视频时计算成本高昂的问题，以及现有基于分块生成方法中存在的多迭代采样链和一致性模块需求的局限性。论文的关键创新在于引入了一种名为Video Interface Networks (VINs) 的新范式，通过在Diffusion Transformers (DiTs) 中加入抽象模块，实现视频片段的并行推理。VINs 在每个去噪步骤中从局部块的噪声输入及其编码表示中提取全局语义，并引导 DiTs 并行去噪。此外，VIN 的固定大小编码标记仅需单次交叉注意力即可对输入进行编码，从而实现对长视频的扩展并学习关键语义。这一设计显著提升了背景一致性与主体连贯性，同时实现了优于全生成方法的运动平滑性，且计算开销降低了25%-40%。

链接: https://arxiv.org/abs/2503.17539
作者: Bhishma Dedhia,David Bourgin,Krishna Kumar Singh,Yuheng Li,Yan Kang,Zhan Xu,Niraj K. Jha,Yuchen Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) can generate short photorealistic videos, yet directly training and sampling longer videos with full attention across the video remains computationally challenging. Alternative methods break long videos down into sequential generation of short video segments, requiring multiple sampling chain iterations and specialized consistency modules. To overcome these challenges, we introduce a new paradigm called Video Interface Networks (VINs), which augment DiTs with an abstraction module to enable parallel inference of video chunks. At each diffusion step, VINs encode global semantics from the noisy input of local chunks and the encoded representations, in turn, guide DiTs in denoising chunks in parallel. The coupling of VIN and DiT is learned end-to-end on the denoising objective. Further, the VIN architecture maintains fixed-size encoding tokens that encode the input via a single cross-attention step. Disentangling the encoding tokens from the input thus enables VIN to scale to long videos and learn essential semantics. Experiments on VBench demonstrate that VINs surpass existing chunk-based methods in preserving background consistency and subject coherence. We then show via an optical flow analysis that our approach attains state-of-the-art motion smoothness while using 25-40% fewer FLOPs than full generation. Finally, human raters favorably assessed the overall video quality and temporal consistency of our method in a user study.
zh

[CV-241] DermDiff: Generative Diffusion Model for Mitigating Racial Biases in Dermatology Diagnosis MICCAI2024

【速读】：该论文旨在解决现有皮肤疾病诊断的人工智能（AI）模型因基于有限且有偏见的数据集开发和测试而导致在特定皮肤色调上的表现不佳的问题。论文的关键解决方案是提出了一种名为DermDiff的新颖生成式模型（Generative Model），该模型能够生成多样且具有代表性的皮肤镜图像数据，用于皮肤疾病诊断。通过利用文本提示和多模态图像-文本学习方法，DermDiff改善了高度不平衡数据集中代表性不足群体（如患者、疾病等）的表达能力。实验结果表明，DermDiff在高保真度和多样性方面表现出色，并且下游评估显示其有助于缓解皮肤病诊断中的种族偏见。

链接: https://arxiv.org/abs/2503.17536
作者: Nusrat Munia,Abdullah-Al-Zubaer Imran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper presented at ADSMI@MICCAI 2024

点击查看摘要

Abstract:Skin diseases, such as skin cancer, are a significant public health issue, and early diagnosis is crucial for effective treatment. Artificial intelligence (AI) algorithms have the potential to assist in triaging benign vs malignant skin lesions and improve diagnostic accuracy. However, existing AI models for skin disease diagnosis are often developed and tested on limited and biased datasets, leading to poor performance on certain skin tones. To address this problem, we propose a novel generative model, named DermDiff, that can generate diverse and representative dermoscopic image data for skin disease diagnosis. Leveraging text prompting and multimodal image-text learning, DermDiff improves the representation of underrepresented groups (patients, diseases, etc.) in highly imbalanced datasets. Our extensive experimentation showcases the effectiveness of DermDiff in terms of high fidelity and diversity. Furthermore, downstream evaluation suggests the potential of DermDiff in mitigating racial biases for dermatology diagnosis. Our code is available at this https URL
zh

[CV-242] FMDConv: Fast Multi-Attention Dynamic Convolution via Speed-Accuracy Trade-off

【速读】：该论文旨在解决动态卷积在提升模型准确性的同时因显著增加计算开销而难以部署于资源受限环境（如联邦边缘计算）的问题。为应对这一挑战，论文提出了一种快速多注意力动态卷积（Fast Multi-Attention Dynamic Convolution, FMDConv），其关键在于通过结合输入注意力、温度降解核注意力和输出注意力，优化速度与准确性的权衡，以更低的复杂度选择性增强特征提取能力。此外，论文引入了逆效率评分（Inverse Efficiency Score）和速率校正评分（Rate-Correct Score）两个新型量化指标，系统性评估这种权衡，并通过实验证明FMDConv在保持竞争力的准确性下，相比先前的多注意力动态卷积方法，在ResNet-18和ResNet-50上的计算成本分别降低了49.8%和42.2%。这些优势使其特别适用于实际中的资源受限应用场景。

链接: https://arxiv.org/abs/2503.17530
作者: Tianyu Zhang,Fan Wan,Haoran Duan,Kevin W. Tong,Jingjing Deng,Yang Long
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatial convolution is fundamental in constructing deep Convolutional Neural Networks (CNNs) for visual recognition. While dynamic convolution enhances model accuracy by adaptively combining static kernels, it incurs significant computational overhead, limiting its deployment in resource-constrained environments such as federated edge computing. To address this, we propose Fast Multi-Attention Dynamic Convolution (FMDConv), which integrates input attention, temperature-degraded kernel attention, and output attention to optimize the speed-accuracy trade-off. FMDConv achieves a better balance between accuracy and efficiency by selectively enhancing feature extraction with lower complexity. Furthermore, we introduce two novel quantitative metrics, the Inverse Efficiency Score and Rate-Correct Score, to systematically evaluate this trade-off. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet demonstrate that FMDConv reduces the computational cost by up to 49.8% on ResNet-18 and 42.2% on ResNet-50 compared to prior multi-attention dynamic convolution methods while maintaining competitive accuracy. These advantages make FMDConv highly suitable for real-world, resource-constrained applications.
zh

[CV-243] Should we pre-train a decoder in contrastive learning for dense prediction tasks?

【速读】：该论文旨在解决传统自监督学习（Self-Supervised Learning, SSL）方法中仅关注编码器（encoder）预训练而忽视解码器（decoder）联合优化的问题。论文指出，现有的方法通常将编码器与解码器分开设计和训练，这种分离方式未能充分利用编码器和解码器协同工作的潜力。为了解决这一问题，论文提出了一种名为DeCon的框架无关型适配方案，将单一编码器的对比学习方法高效转化为编码器-解码器联合预训练框架。关键在于引入了一种带有非竞争目标的加权编码器-解码器对比损失函数，该函数能够有效促进编码器和解码器在对比学习中的联合预训练过程，从而提升编码器的表征能力和解码器在密集预测任务中的性能，同时确保该方法在不同架构及跨领域小样本场景下的通用性和有效性。

链接: https://arxiv.org/abs/2503.17526
作者: Sébastien Quetin,Tapotosh Ghosh,Farhad Maleki
机构: McGill University; University of Calgary
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive learning in self-supervised settings primarily focuses on pre-training encoders, while decoders are typically introduced and trained separately for downstream dense prediction tasks. This conventional approach, however, overlooks the potential benefits of jointly pre-training both the encoder and decoder. In this paper, we propose DeCon: a framework-agnostic adaptation to convert an encoder-only self-supervised learning (SSL) contrastive approach to an efficient encoder-decoder framework that can be pre-trained in a contrastive manner. We first update the existing architecture to accommodate a decoder and its respective contrastive loss. We then introduce a weighted encoder-decoder contrastive loss with non-competing objectives that facilitates the joint encoder-decoder architecture pre-training. We adapt two established contrastive SSL frameworks tailored for dense prediction tasks, achieve new state-of-the-art results in COCO object detection and instance segmentation, and match state-of-the-art performance on Pascal VOC semantic segmentation. We show that our approach allows for pre-training a decoder and enhances the representation power of the encoder and its performance in dense prediction tasks. This benefit holds across heterogeneous decoder architectures between pre-training and fine-tuning and persists in out-of-domain, limited-data scenarios.
zh

[CV-244] Event-Based Crossing Dataset (EBCD)

【速读】：该论文旨在解决传统事件驱动视觉（Event-based Vision）数据集在固定阈值约束下适应真实世界环境波动能力有限的问题。固定阈值要么保留过多细节引入噪声，要么抑制无关激活而丢失重要目标信息。为解决这一局限，论文提出了一种名为事件驱动交叉数据集（Event-Based Crossing Dataset, EBCD）的综合数据集，其关键在于引入多阈值框架以优化事件表示。通过在十种不同阈值水平（4, 8, 12, 16, 20, 30, 40, 50, 60 和 75）下捕获事件驱动图像，EBCD 支持全面评估稀疏性和噪声抑制条件下目标检测性能的变化，并通过基准测试多种先进检测架构验证阈值选择对检测性能的影响。此方法系统化地探索阈值变化，有望促进事件驱动目标检测的自适应评估，使类脑视觉更好地匹配真实场景动态。

链接: https://arxiv.org/abs/2503.17499
作者: Joey Mulé,Dhandeep Challagundla,Rachit Saini,Riadul Islam
机构: Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County (马里兰大学巴尔的摩郡分校); FRIS Inc. (d/b/a Oculi)(FRIS公司（以Oculi名义开展业务）)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event-based vision revolutionizes traditional image sensing by capturing asynchronous intensity variations rather than static frames, enabling ultrafast temporal resolution, sparse data encoding, and enhanced motion perception. While this paradigm offers significant advantages, conventional event-based datasets impose a fixed thresholding constraint to determine pixel activations, severely limiting adaptability to real-world environmental fluctuations. Lower thresholds retain finer details but introduce pervasive noise, whereas higher thresholds suppress extraneous activations at the expense of crucial object information. To mitigate these constraints, we introduce the Event-Based Crossing Dataset (EBCD), a comprehensive dataset tailored for pedestrian and vehicle detection in dynamic outdoor environments, incorporating a multi-thresholding framework to refine event representations. By capturing event-based images at ten distinct threshold levels (4, 8, 12, 16, 20, 30, 40, 50, 60, and 75), this dataset facilitates an extensive assessment of object detection performance under varying conditions of sparsity and noise suppression. We benchmark state-of-the-art detection architectures-including YOLOv4, YOLOv7, EfficientDet-b0, MobileNet-v1, and Histogram of Oriented Gradients (HOG)-to experiment upon the nuanced impact of threshold selection on detection performance. By offering a systematic approach to threshold variation, we foresee that EBCD fosters a more adaptive evaluation of event-based object detection, aligning diverse neuromorphic vision with real-world scene dynamics. We present the dataset as publicly available to propel further advancements in low-latency, high-fidelity neuromorphic imaging: this https URL
zh

[CV-245] You Only Look Once at Anytime (AnytimeYOLO): Analysis and Optimization of Early-Exits for Object-Detection

【速读】：该论文旨在解决实时系统中物体检测任务的中断容忍性和灵活性问题，特别是在资源受限或安全关键型应用场景下。传统的目标检测模型通常需要完整推理才能提供最终结果，而AnytimeYOLO通过引入随时预测能力（Anytime Prediction）解决了这一局限性。其核心解决方案在于对YOLO架构进行结构化改造，通过增加高粒度的早期退出点（early exits），允许在任意时间点终止推理并输出中间结果。关键创新包括提出一种转置版YOLO架构（Transposed YOLO Architecture），以优化早期预测能力和处理阶段的执行顺序；同时设计了两种优化算法，用于确定最佳的退出执行顺序和早期退出子集选择策略。此外，论文还定义了新的随时质量评估指标，并分析了当前随时推理部署面临的挑战及成本问题。

链接: https://arxiv.org/abs/2503.17497
作者: Daniel Kuhse,Harun Teper,Sebastian Buschjäger,Chien-Yao Wang,Jian-Jia Chen
机构: Technische Universität Dortmund (杜伊斯堡-埃森大学); Lamarr Institute for Machine Learning and Artificial Intelligence (拉马尔机器学习与人工智能研究所); Academia Sinica (中央研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce AnytimeYOLO, a family of variants of the YOLO architecture that enables anytime object detection. Our AnytimeYOLO networks allow for interruptible inference, i.e., they provide a prediction at any point in time, a property desirable for safety-critical real-time applications. We present structured explorations to modify the YOLO architecture, enabling early termination to obtain intermediate results. We focus on providing fine-grained control through high granularity of available termination points. First, we formalize Anytime Models as a special class of prediction models that offer anytime predictions. Then, we discuss a novel transposed variant of the YOLO architecture, that changes the architecture to enable better early predictions and greater freedom for the order of processing stages. Finally, we propose two optimization algorithms that, given an anytime model, can be used to determine the optimal exit execution order and the optimal subset of early-exits to select for deployment in low-resource environments. We evaluate the anytime performance and trade-offs of design choices, proposing a new anytime quality metric for this purpose. In particular, we also discuss key challenges for anytime inference that currently make its deployment costly. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2503.17497 [cs.CV] (or arXiv:2503.17497v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.17497 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-246] Meme Similarity and Emotion Detection using Multimodal Analysis

【速读】：该论文试图解决的问题是如何有效比较互联网模因（Internet memes）及其引发的情绪。为了解决这一问题，论文的关键在于提出了一种多模态方法论，通过结合模因的视觉和文本元素进行分析。具体而言，研究采用了多模态CLIP（对比语言图像预训练）模型，基于文本和视觉内容嵌入对相似模因进行分组，从而实现跨模态的稳健相似性评估。此外，利用Reddit Meme数据集和Memotion数据集提取低级视觉特征和高级语义特征以识别相似模因对，并通过用户研究验证了自动化相似性评估与人类判断之间67.23%的一致性。同时，还使用DistilBERT模型实现了基于文本的情绪分类，揭示愤怒和快乐是模因中最主要的情绪，并且动机类模因引发了更强的情感反应。这些方法共同构成了研究的核心贡献，即提升在线视觉交流和用户体验的同时也为网络平台的内容审核策略提供了新见解。

链接: https://arxiv.org/abs/2503.17493
作者: Aidos Konyspay,Pakizar Shamoi,Malika Ziyada,Zhusup Smambayev
机构: School of Information Technology and Engineering (信息技术与工程学院), Kazakh-British Technical University (哈萨克斯坦英国技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Have been submitted to IEEE for consideration

点击查看摘要

Abstract:Internet memes are a central element of online culture, blending images and text. While substantial research has focused on either the visual or textual components of memes, little attention has been given to their interplay. This gap raises a key question: What methodology can effectively compare memes and the emotions they elicit? Our study employs a multimodal methodological approach, analyzing both the visual and textual elements of memes. Specifically, we perform a multimodal CLIP (Contrastive Language-Image Pre-training) model for grouping similar memes based on text and visual content embeddings, enabling robust similarity assessments across modalities. Using the Reddit Meme Dataset and Memotion Dataset, we extract low-level visual features and high-level semantic features to identify similar meme pairs. To validate these automated similarity assessments, we conducted a user study with 50 participants, asking them to provide yes/no responses regarding meme similarity and their emotional reactions. The comparison of experimental results with human judgments showed a 67.23% agreement, suggesting that the computational approach aligns well with human perception. Additionally, we implemented a text-based classifier using the DistilBERT model to categorize memes into one of six basic emotions. The results indicate that anger and joy are the dominant emotions in memes, with motivational memes eliciting stronger emotional responses. This research contributes to the study of multimodal memes, enhancing both language-based and visual approaches to analyzing and improving online visual communication and user experiences. Furthermore, it provides insights for better content moderation strategies in online platforms.
zh

[CV-247] Splat-LOAM: Gaussian Splatting LiDAR Odometry and Mapping ICCV2025

【速读】：该论文旨在解决基于激光雷达（LiDAR）的自运动估计与建图任务中，如何实现高精度且轻量级环境表示的问题。经典方法与基于NeRF的方法虽在性能上有一定优势，但通常需要在精度与内存占用及处理时间之间进行权衡。论文的关键解决方案在于利用高斯点 splatting 方法的最新进展，开发了一种仅依赖高斯基元（Gaussian Primitives）表示场景的新颖激光雷达里程计与建图管道。通过球面投影技术，该方法能够从激光雷达测量值直接驱动高斯基元的精化过程，从而实现高效且精确的场景表达。实验表明，该方法在注册性能方面达到现有最优（SOTA），同时在地图构建任务中展现出极低的GPU需求，表现出色的实时性潜力。

链接: https://arxiv.org/abs/2503.17491
作者: Emanuele Giacomini,Luca Di Giammarino,Lorenzo De Rebotti,Giorgio Grisetti,Martin R. Oswald
机构: Sapienza University of Rome (罗马大学); University of Amsterdam (阿姆斯特丹大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to ICCV 2025

点击查看摘要

Abstract:LiDARs provide accurate geometric measurements, making them valuable for ego-motion estimation and reconstruction tasks. Although its success, managing an accurate and lightweight representation of the environment still poses challenges. Both classic and NeRF-based solutions have to trade off accuracy over memory and processing times. In this work, we build on recent advancements in Gaussian Splatting methods to develop a novel LiDAR odometry and mapping pipeline that exclusively relies on Gaussian primitives for its scene representation. Leveraging spherical projection, we drive the refinement of the primitives uniquely from LiDAR measurements. Experiments show that our approach matches the current registration performance, while achieving SOTA results for mapping tasks with minimal GPU requirements. This efficiency makes it a strong candidate for further exploration and potential adoption in real-time robotics estimation tasks.
zh

[CV-248] ProDehaze: Prompting Diffusion Models Toward Faithful Image Dehazing ICME2025

【速读】：该论文旨在解决现有基于大规模预训练扩散模型的去雾方法在提升感知质量的同时容易产生幻觉效应的问题，即生成的去雾图像与原始图像不一致。论文的关键解决方案是提出ProDehaze框架，通过引入两种选择性内部先验（internal priors）来引导预训练模型中的外部先验：一种是在潜在空间中的结构提示修复器（Structure-Prompted Restorer），强调富含结构信息的区域；另一种是在解码过程中使用的去雾感知自校正精炼器（Haze-Aware Self-Correcting Refiner），用于对齐清晰输入区域与输出之间的分布。这些设计显著提升了去雾结果的真实性和细节保真度，尤其是在减少颜色偏移方面表现出色。

链接: https://arxiv.org/abs/2503.17488
作者: Tianwen Zhou,Jing Wang,Songtao Wu,Kuanhong Xu
机构: Department of Computer Science (计算机科学系), University College London (伦敦大学学院); R&D Center Beijing Lab (北京实验室), Sony China Ltd. (索尼中国有限公司), Beijing, China (中国北京)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICME 2025

点击查看摘要

Abstract:Recent approaches using large-scale pretrained diffusion models for image dehazing improve perceptual quality but often suffer from hallucination issues, producing unfaithful dehazed image to the original one. To mitigate this, we propose ProDehaze, a framework that employs internal image priors to direct external priors encoded in pretrained models. We introduce two types of \textitselective internal priors that prompt the model to concentrate on critical image areas: a Structure-Prompted Restorer in the latent space that emphasizes structure-rich regions, and a Haze-Aware Self-Correcting Refiner in the decoding process to align distributions between clearer input regions and the output. Extensive experiments on real-world datasets demonstrate that ProDehaze achieves high-fidelity results in image dehazing, particularly in reducing color shifts. Our code is at this https URL.
zh

[CV-249] ProtoGS: Efficient and High-Quality Rendering with 3D Gaussian Prototypes

【速读】：该论文旨在解决3D Gaussian Splatting (3DGS) 在新型视图合成中的高存储需求问题，尤其在资源受限的轻量级设备上的部署挑战。现有方法虽通过压缩密集高斯分布的存储大小缓解了部分问题，但通常以牺牲渲染质量和效率为代价。论文的关键创新在于提出ProtoGS，通过学习高斯原型（Gaussian prototypes）来表示高斯基元（Gaussian primitives），显著减少了所需高斯数量，同时保持视觉质量。其解决方案的核心在于直接利用高斯原型进行高效渲染，并通过重建损失引导原型学习；此外，在训练过程中引入运动结构（Structure-from-Motion, SfM）点作为锚点对高斯基元分组，并通过K-means聚类得到每组内的高斯原型，同时联合优化锚点与原型，从而进一步提升内存效率。实验结果表明，该方法在真实世界及合成数据集上均优于现有方法，实现了高渲染速度且维持甚至提升了渲染保真度。

链接: https://arxiv.org/abs/2503.17486
作者: Zhengqing Gao,Dongting Hu,Jia-Wang Bian,Huan Fu,Yan Li,Tongliang Liu,Mingming Gong,Kun Zhang
机构: MBZUAI; University of Melbourne; Alibaba Group
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has made significant strides in novel view synthesis but is limited by the substantial number of Gaussian primitives required, posing challenges for deployment on lightweight devices. Recent methods address this issue by compressing the storage size of densified Gaussians, yet fail to preserve rendering quality and efficiency. To overcome these limitations, we propose ProtoGS to learn Gaussian prototypes to represent Gaussian primitives, significantly reducing the total Gaussian amount without sacrificing visual quality. Our method directly uses Gaussian prototypes to enable efficient rendering and leverage the resulting reconstruction loss to guide prototype learning. To further optimize memory efficiency during training, we incorporate structure-from-motion (SfM) points as anchor points to group Gaussian primitives. Gaussian prototypes are derived within each group by clustering of K-means, and both the anchor points and the prototypes are optimized jointly. Our experiments on real-world and synthetic datasets prove that we outperform existing methods, achieving a substantial reduction in the number of Gaussians, and enabling high rendering speed while maintaining or even enhancing rendering fidelity.
zh

[CV-250] Whats Producible May Not Be Reachable: Measuring the Steerability of Generative Models

【速读】：该论文试图解决如何更全面地评估生成式模型（Generative Models）的质量问题，尤其是在区分生成能力（Producibility）与可操控性（Steerability）方面。现有评估方法大多集中于模型生成输出的质量和多样性，但忽略了用户在特定目标下能否有效利用模型达成目的这一重要属性。论文指出，生成模型的实际价值不仅取决于其生成能力，更在于其可操控性。然而，由于可操控性的评估需要明确用户的特定目标，这使得其评价比生成能力更具挑战性。

为了解决这一问题，论文提出了一种新的基准任务设计思路：通过从生成模型中采样输出，并要求用户尝试复现该输出来衡量模型的可操控性。这种方法能够独立于生成能力评估模型的可操控性。论文进一步在大规模用户研究中验证了这一方法，测试对象包括文本到图像模型和大型语言模型。结果显示，尽管这些模型在生成高质量输出方面表现优异，但在可操控性上均表现欠佳。

解决方案的关键在于引入一种基于强化学习（Reinforcement Learning）的替代操控机制，该机制显著提升了图像模型在此基准上的表现，实现了超过两倍的性能提升。这表明通过改进模型的可操控性，可以有效提高生成式模型的实际应用价值。

链接: https://arxiv.org/abs/2503.17482
作者: Keyon Vafa,Sarah Bentley,Jon Kleinberg,Sendhil Mullainathan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:How should we evaluate the quality of generative models? Many existing metrics focus on a model’s producibility, i.e. the quality and breadth of outputs it can generate. However, the actual value from using a generative model stems not just from what it can produce but whether a user with a specific goal can produce an output that satisfies that goal. We refer to this property as steerability. In this paper, we first introduce a mathematical framework for evaluating steerability independently from producibility. Steerability is more challenging to evaluate than producibility because it requires knowing a user’s goals. We address this issue by creating a benchmark task that relies on one key idea: sample an output from a generative model and ask users to reproduce it. We implement this benchmark in a large-scale user study of text-to-image models and large language models. Despite the ability of these models to produce high-quality outputs, they all perform poorly on steerabilty. This suggests that we need to focus on improving the steerability of generative models. We show such improvements are indeed possible: through reinforcement learning techniques, we create an alternative steering mechanism for image models that achieves more than 2x improvement on this benchmark.
zh

[CV-251] Bayesian generative models can flag performance loss bias and out-of-distribution image content

【速读】：本文旨在解决生成式模型（Generative Models）在医学影像任务中因深度学习参数化而导致的对分布偏移敏感以及对域外数据（out-of-distribution data）处理不可靠的问题，这种不可靠性可能引发表示不足偏差（underrepresentation bias）。为应对这一挑战，论文提出了一种名为SLUG的新不确定性量化（Uncertainty Quantification, UQ）方法，专门针对变分自编码器（Variational Autoencoders, VAEs）。其关键在于结合拉普拉斯近似（Laplace Approximations）的最新进展与随机迹估计器（stochastic trace estimators），从而能够随图像维度优雅扩展。通过引入新的UQ评分，论文展示了其与重建误差及皮肤科图像种族表示不足偏差之间的强相关性，同时证明了像素级不确定性可有效检测域外图像内容（如墨水、尺子和补丁），这些内容常导致预测模型产生学习捷径。

链接: https://arxiv.org/abs/2503.17477
作者: Miguel López-Pérez,Marco Miani,Valery Naranjo,Søren Hauberg,Aasa Feragen
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Under review

点击查看摘要

Abstract:Generative models are popular for medical imaging tasks such as anomaly detection, feature extraction, data visualization, or image generation. Since they are parameterized by deep learning models, they are often sensitive to distribution shifts and unreliable when applied to out-of-distribution data, creating a risk of, e.g. underrepresentation bias. This behavior can be flagged using uncertainty quantification methods for generative models, but their availability remains limited. We propose SLUG: A new UQ method for VAEs that combines recent advances in Laplace approximations with stochastic trace estimators to scale gracefully with image dimensionality. We show that our UQ score – unlike the VAE’s encoder variances – correlates strongly with reconstruction error and racial underrepresentation bias for dermatological images. We also show how pixel-wise uncertainty can detect out-of-distribution image content such as ink, rulers, and patches, which is known to induce learning shortcuts in predictive models.
zh

[CV-252] Spatiotemporal Learning with Context-aware Video Tubelets for Ultrasound Video Analysis

【速读】：该论文致力于解决基于视频成像模态的计算机辅助病理检测算法在解析复杂时空信息时面临的挑战，特别是在多帧信息整合过程中容易丢失全局空间上下文的问题。当前最先进的方法通过将视频划分为子体积（tubelets）进行分类，但这些方法通常仅关注检测感兴趣区域（ROI）内的局部区域，从而忽视了全局的空间关联性。论文的关键创新在于提出了一种轻量级框架，该框架不仅能够保留全局空间上下文，还能捕捉精细的时空特征。为了解决全局上下文丢失的问题，研究者将tubelet的位置、大小以及置信度作为分类器的输入，并利用预训练检测模型生成的ROI对齐特征图，借助已学习到的特征表示来扩展感受野并降低计算复杂度。该方案的核心优势在于其高效性，时空tubelet分类器仅包含约0.4M参数，同时在针对肺部实变和胸腔积液的超声视频检测与分类任务中表现出色，五折交叉验证结果显示其性能优于先前的tubelet基方法，并适用于实时工作流。

链接: https://arxiv.org/abs/2503.17475
作者: Gary Y. Li,Li Chen,Bryson Hicks,Nikolai Schnittke,David O. Kessler,Jeffrey Shupp,Maria Parker,Cristiana Baloescu,Christopher Moore,Cynthia Gregory,Kenton Gregory,Balasundar Raju,Jochen Kruecker,Alvin Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ISBI Oral 2025

点击查看摘要

Abstract:Computer-aided pathology detection algorithms for video-based imaging modalities must accurately interpret complex spatiotemporal information by integrating findings across multiple frames. Current state-of-the-art methods operate by classifying on video sub-volumes (tubelets), but they often lose global spatial context by focusing only on local regions within detection ROIs. Here we propose a lightweight framework for tubelet-based object detection and video classification that preserves both global spatial context and fine spatiotemporal features. To address the loss of global context, we embed tubelet location, size, and confidence as inputs to the classifier. Additionally, we use ROI-aligned feature maps from a pre-trained detection model, leveraging learned feature representations to increase the receptive field and reduce computational complexity. Our method is efficient, with the spatiotemporal tubelet classifier comprising only 0.4M parameters. We apply our approach to detect and classify lung consolidation and pleural effusion in ultrasound videos. Five-fold cross-validation on 14,804 videos from 828 patients shows our method outperforms previous tubelet-based approaches and is suited for real-time workflows.
zh

[CV-253] High Efficiency Wiener Filter-based Point Cloud Quality Enhancement for MPEG G-PCC

【速读】：该论文旨在解决基于几何的点云压缩（Geometry-based Point Cloud Compression, G-PCC）在损失y压缩下重建质量较低的问题，尤其是在低比特率条件下。论文的关键解决方案是提出了一种高效的维纳滤波器（Wiener filter），该滤波器能够集成到G-PCC的编码器和解码器pipeline中，以提升动态点云的重建质量和率失真性能。具体而言，论文首先设计了基本的维纳滤波器，并通过引入系数继承和基于方差的亮度分量点分类方法对其进行改进。此外，为了降低维纳滤波应用过程中最近邻搜索的复杂度，提出了基于Morton码的快速最近邻搜索算法以高效计算滤波系数。实验结果表明，所提方法在无损几何-有损属性配置下，相较于最新的G-PCC编码平台，分别实现了亮度、Cb色度和Cr色度分量平均Bjøntegaard delta率下降6.1%、7.3%和8.0%，且计算复杂度可接受。

链接: https://arxiv.org/abs/2503.17467
作者: Yuxuan Wei,Zehan Wang,Tian Guo,Hao Liu,Liquan Shen,Hui Yuan
机构: School of Control Science and Engineering, Shandong University, Jinan 250061, China (山东大学控制科学与工程学院，济南 250061, 中国); Key Laboratory of Machine Intelligence and System Control, Ministry of Education, Jinan 250061, China (教育部机器智能与系统控制重点实验室，济南 250061, 中国); School of Computer and Control Engineering, Yantai University, Yantai, 264005, China (烟台大学计算机与控制工程学院，烟台 264005, 中国); Shanghai Institute for Advanced Communication and Data Science, Shanghai University, Shanghai 200072, China (上海大学上海先进通信与数据科学研究院，上海 200072, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Point clouds, which directly record the geometry and attributes of scenes or objects by a large number of points, are widely used in various applications such as virtual reality and immersive communication. However, due to the huge data volume and unstructured geometry, efficient compression of point clouds is very crucial. The Moving Picture Expert Group is establishing a geometry-based point cloud compression (G-PCC) standard for both static and dynamic point clouds in recent years. Although lossy compression of G-PCC can achieve a very high compression ratio, the reconstruction quality is relatively low, especially at low bitrates. To mitigate this problem, we propose a high efficiency Wiener filter that can be integrated into the encoder and decoder pipeline of G-PCC to improve the reconstruction quality as well as the rate-distortion performance for dynamic point clouds. Specifically, we first propose a basic Wiener filter, and then improve it by introducing coefficients inheritance and variance-based point classification for the Luma component. Besides, to reduce the complexity of the nearest neighbor search during the application of the Wiener filter, we also propose a Morton code-based fast nearest neighbor search algorithm for efficient calculation of filter coefficients. Experimental results demonstrate that the proposed method can achieve average Bjøntegaard delta rates of -6.1%, -7.3%, and -8.0% for Luma, Chroma Cb, and Chroma Cr components, respectively, under the condition of lossless-geometry-lossy-attributes configuration compared to the latest G-PCC encoding platform (i.e., geometry-based solid content test model version 7.0 release candidate 2) by consuming affordable computational complexity.
zh

[CV-254] Feature-Based Dual Visual Feature Extraction Model for Compound Multimodal Emotion Recognition

【速读】：该论文致力于解决现实世界中复合情感识别面临的不确定性及模态冲突问题。针对Compound Expression (CE) Recognition Challenge，论文提出了一种多模态情感识别方法，关键在于融合Vision Transformer (ViT) 和Residual Network (ResNet) 的特征。通过在C-EXPR-DB和MELD数据集上的实验验证，结果显示，在包含复杂视觉和音频线索的场景下（如C-EXPR-DB），所提出的融合ViT与ResNet特征的模型表现出优越性能。代码已开源供参考。

链接: https://arxiv.org/abs/2503.17453
作者: Ran Liu,Fengyu Zhang,Cong Yu,Longjiang Yang,Zhuofan Wen,Siyuan Zhang,Hailiang Yao,Shun Chen,Zheng Lian,Bin Liu
机构: University of Chinese Academy of Sciences (中国科学院大学); The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所多模态人工智能系统国家重点实验室); Tianjin Normal University (天津师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This article presents our results for the eighth Affective Behavior Analysis in-the-wild (ABAW) this http URL emotion recognition (ER) has important applications in affective computing and human-computer interaction. However, in the real world, compound emotion recognition faces greater issues of uncertainty and modal conflicts. For the Compound Expression (CE) Recognition Challenge,this paper proposes a multimodal emotion recognition method that fuses the features of Vision Transformer (ViT) and Residual Network (ResNet). We conducted experiments on the C-EXPR-DB and MELD datasets. The results show that in scenarios with complex visual and audio cues (such as C-EXPR-DB), the model that fuses the features of ViT and ResNet exhibits superior this http URL code are avalible on this https URL
zh

[CV-255] On-Device Federated Continual Learning on RISC-V-based Ultra-Low-Power SoC for Intelligent Nano-Drone Swarms

【速读】：该论文旨在解决在资源受限的电池供电边缘设备上实现高效On-Device Learning (ODL) 的挑战，特别是针对多节点智能传感器网络中的持续学习任务。论文重点克服由于计算资源有限、设备寿命受限以及固有的学习问题（如灾难性遗忘）所带来的困难。为应对这些挑战，论文提出了一种基于正则化的On-Device联邦持续学习算法，专门设计用于执行人脸识别任务的多个纳米无人机。关键解决方案在于通过优化RISC-V架构上的10核超低功耗SoC，显著提升了分类精度（较朴素微调提高了24%），同时保持了每本地训练轮次178毫秒和每全局训练轮次10.5秒的高效计算开销，从而验证了所提架构的有效性。

链接: https://arxiv.org/abs/2503.17436
作者: Lars Kröger,Cristian Cioflan,Victor Kartsch,Luca Benini
机构: ETH Zurich; University of Bologna
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: 2 pages, 2 tables, 1 figure. Accepted as a poster at the RISC-V Summit Europe 2025

点击查看摘要

Abstract:RISC-V-based architectures are paving the way for efficient On-Device Learning (ODL) in smart edge devices. When applied across multiple nodes, ODL enables the creation of intelligent sensor networks that preserve data privacy. However, developing ODL-capable, battery-operated embedded platforms presents significant challenges due to constrained computational resources and limited device lifetime, besides intrinsic learning issues such as catastrophic forgetting. We face these challenges by proposing a regularization-based On-Device Federated Continual Learning algorithm tailored for multiple nano-drones performing face recognition tasks. We demonstrate our approach on a RISC-V-based 10-core ultra-low-power SoC, optimizing the ODL computational requirements. We improve the classification accuracy by 24% over naive fine-tuning, requiring 178 ms per local epoch and 10.5 s per global epoch, demonstrating the effectiveness of the architecture for this task.
zh

[CV-256] Enhancing Subsequent Video Retrieval via Vision-Language Models (VLMs)

【速读】：该论文旨在解决视频内容快速增长背景下，传统视觉-语言模型（Vision-Language Models, VLMs）在自适应、时间敏感的视频检索任务中表现不佳的问题。论文的关键创新在于提出了一种结合向量相似性搜索与基于图的数据结构的新框架。通过利用VLM嵌入进行初步检索，并建模视频片段之间的上下文关系，该方法实现了查询的自适应优化和检索精度的提升。实验验证了该方法在精确性、可扩展性和鲁棒性方面的优势，为动态环境中交互式视频检索提供了有效的解决方案。

链接: https://arxiv.org/abs/2503.17415
作者: Yicheng Duan,Xi Huang,Duo Chen
机构: Case Western Reserve University (凯斯西储大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The rapid growth of video content demands efficient and precise retrieval systems. While vision-language models (VLMs) excel in representation learning, they often struggle with adaptive, time-sensitive video retrieval. This paper introduces a novel framework that combines vector similarity search with graph-based data structures. By leveraging VLM embeddings for initial retrieval and modeling contextual relationships among video segments, our approach enables adaptive query refinement and improves retrieval accuracy. Experiments demonstrate its precision, scalability, and robustness, offering an effective solution for interactive video retrieval in dynamic environments.
zh

[CV-257] IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes ICRA2025

【速读】：该论文旨在解决基于自然语言指令的室内导航问题，这一任务因需要三维空间推理和语义理解而具有挑战性。尤其当自然语言指令可能存在不完美或与场景不一致时，任务变得更加复杂。论文的关键解决方案是创建了一个名为IRef-VLA的基准数据集，该数据集针对在三维场景中使用不完美参考的交互式指代视觉-语言引导动作任务进行了优化。IRef-VLA包含超过11.5K个来自现有数据集的真实扫描三维房间、7.6M个启发式生成的语义关系以及4.7M个指代表述，并且结合了语义对象和房间标注、场景图、可导航自由空间标注，同时包含带有语言不完美或歧义性的陈述。通过评估最新模型以验证数据集的通用性并建立图搜索基线以展示性能界限及利用场景图知识生成替代方案，论文为开发鲁棒的交互式导航系统提供了支持。

链接: https://arxiv.org/abs/2503.17406
作者: Haochen Zhang,Nader Zantout,Pujith Kachana,Ji Zhang,Wenshan Wang
机构: Carnegie Mellon University (卡内基梅隆大学), Robotics Institute (机器人研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to ICRA 2025. Code available at this https URL . arXiv admin note: text overlap with arXiv:2411.03540

点击查看摘要

Abstract:With the recent rise of large language models, vision-language models, and other general foundation models, there is growing potential for multimodal, multi-task robotics that can operate in diverse environments given natural language input. One such application is indoor navigation using natural language instructions. However, despite recent progress, this problem remains challenging due to the 3D spatial reasoning and semantic understanding required. Additionally, the language used may be imperfect or misaligned with the scene, further complicating the task. To address this challenge, we curate a benchmark dataset, IRef-VLA, for Interactive Referential Vision and Language-guided Action in 3D Scenes with imperfect references. IRef-VLA is the largest real-world dataset for the referential grounding task, consisting of over 11.5K scanned 3D rooms from existing datasets, 7.6M heuristically generated semantic relations, and 4.7M referential statements. Our dataset also contains semantic object and room annotations, scene graphs, navigable free space annotations, and is augmented with statements where the language has imperfections or ambiguities. We verify the generalizability of our dataset by evaluating with state-of-the-art models to obtain a performance baseline and also develop a graph-search baseline to demonstrate the performance bound and generation of alternatives using scene-graph knowledge. With this benchmark, we aim to provide a resource for 3D scene understanding that aids the development of robust, interactive navigation systems. The dataset and all source code is publicly released at this https URL.
zh

[CV-258] mporal Flexibility in Spiking Neural Networks: Towards Generalization Across Time Steps and Deployment Friendliness ICLR2025

【速读】：该论文旨在解决当前尖峰神经网络（Spiking Neural Networks, SNNs）在时间步长上的“时间刚性”问题，即现有直接训练方法限制了SNNs只能在特定时间步长下工作。这种局限性阻碍了SNNs在无固定时间步长的全事件驱动芯片上的部署，并且妨碍了基于动态推理时间步长的能量-性能平衡。为了解决这一问题，论文提出了一种名为混合时间步训练（Mixed Time-step Training, MTT）的新方法，其关键是通过在每次迭代中为不同的SNN阶段随机分配时间步长，并利用通信模块在阶段间传输脉冲信号，从而提高SNNs的时间灵活性，使其能够适应多样的时间结构。实验结果表明，采用MTT训练的模型不仅获得了显著的时间灵活性，还能在事件驱动和时钟驱动部署中表现出色，在N-MNIST数据集上接近无损性能，在CIFAR10-DVS数据集上比标准方法高出10.1%，同时提升了网络的泛化能力和接近最先进的性能。据作者所知，这是首次在全事件驱动场景中报告大规模SNN部署的结果。

链接: https://arxiv.org/abs/2503.17394
作者: Kangrui Du,Yuhang Wu,Shikuang Deng,Shi Gu
机构: University of Electronic Science and Technology of China (电子科技大学); Shenzhen Institute for Advanced Study, UESTC (深圳先进技术研究院，电子科技大学); College of Computing, Georgia Institute of Technology (计算学院，乔治亚理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 20 pages, ICLR 2025

点击查看摘要

Abstract:Spiking Neural Networks (SNNs), models inspired by neural mechanisms in the brain, allow for energy-efficient implementation on neuromorphic hardware. However, SNNs trained with current direct training approaches are constrained to a specific time step. This “temporal inflexibility” 1) hinders SNNs’ deployment on time-step-free fully event-driven chips and 2) prevents energy-performance balance based on dynamic inference time steps. In this study, we first explore the feasibility of training SNNs that generalize across different time steps. We then introduce Mixed Time-step Training (MTT), a novel method that improves the temporal flexibility of SNNs, making SNNs adaptive to diverse temporal structures. During each iteration of MTT, random time steps are assigned to different SNN stages, with spikes transmitted between stages via communication modules. After training, the weights are deployed and evaluated on both time-stepped and fully event-driven platforms. Experimental results show that models trained by MTT gain remarkable temporal flexibility, friendliness for both event-driven and clock-driven deployment (nearly lossless on N-MNIST and 10.1% higher than standard methods on CIFAR10-DVS), enhanced network generalization, and near SOTA performance. To the best of our knowledge, this is the first work to report the results of large-scale SNN deployment on fully event-driven scenarios.
zh

[CV-259] AI-driven Automation of End-to-end Assessment of Suturing Expertise

【速读】：本文旨在解决基于人类评估的传统缝合技能评估方法耗时且资源密集的问题。为应对这一挑战，论文提出了一种基于人工智能（AI）的方法，用于自动化端到端缝合技能评估（EASE）。该方法的关键在于通过模型推理实现资源消耗极低情况下的实时评分预测，从而支持手术医生或受训人员的实时反馈，加速缝合任务的学习过程，并在手术过程中减少关键错误，最终改善患者预后。

链接: https://arxiv.org/abs/2503.17391
作者: Atharva Deo,Nicholas Matsumoto,Sun Kim,Peter Wager,Randy G. Tsai,Aaron Denmark,Cherine Yang,Xi Li,Jay Moran,Miguel Hernandez,Andrew J. Hung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present an AI based approach to automate the End-to-end Assessment of Suturing Expertise (EASE), a suturing skills assessment tool that comprehensively defines criteria around relevant sub-skills.1 While EASE provides granular skills assessment related to suturing to provide trainees with an objective evaluation of their aptitude along with actionable insights, the scoring process is currently performed by human evaluators, which is time and resource consuming. The AI based approach solves this by enabling real-time score prediction with minimal resources during model inference. This enables the possibility of real-time feedback to the surgeons/trainees, potentially accelerating the learning process for the suturing task and mitigating critical errors during the surgery, improving patient outcomes. In this study, we focus on the following 7 EASE domains that come under 3 suturing phases: 1) Needle Handling: Number of Repositions, Needle Hold Depth, Needle Hold Ratio, and Needle Hold Angle; 2) Needle Driving: Driving Smoothness, and Wrist Rotation; 3) Needle Withdrawal: Wrist Rotation.
zh

[CV-260] DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models CVPR2025

【速读】：该论文旨在解决基于CLIP的提示调优（Prompt Tuning）过程中普遍存在的Base-New Trade-off (BNT)问题，即在目标类（base classes）上持续微调会导致新类（unseen classes）泛化能力的同时下降。现有方法通过添加约束来调节提示调优过程以平衡BNT，但这些约束无法完全避免目标提示（target prompt）优化方向之间的相互排斥性。

解决方案的关键在于提出了一种插拔式Dual-Prompt Collaboration (DPC)框架，这是首个在提示级别解耦基任务与新任务优化过程的方法。具体而言，DPC通过克隆一个可学习的并行提示，并引入权重-解耦框架独立控制双提示（dual prompts）针对基任务或新任务的优化方向，从而避免泛化冲突。此外，还提出了Dynamic Hard Negative Optimizer，利用双提示增强基类的优化难度。为了提高可解释性，论文证明了提示向量在优化过程中的特征通道不变性，为DPC的权重-解耦提供了理论支持。实验表明，DPC在不引入任何外部知识的情况下显著提升了基类性能，同时保持了对新类的泛化能力。

链接: https://arxiv.org/abs/2503.13443
作者: Haoyang Li,Liang Wang,Chao Wang,Jing Jiang,Yan Peng,Guodong Long
机构: Shanghai University (上海大学); University of Technology Sydney (悉尼科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 (CVPR 2025)

点击查看摘要

Abstract:The Base-New Trade-off (BNT) problem universally exists during the optimization of CLIP-based prompt tuning, where continuous fine-tuning on base (target) classes leads to a simultaneous decrease of generalization ability on new (unseen) classes. Existing approaches attempt to regulate the prompt tuning process to balance BNT by appending constraints. However, imposed on the same target prompt, these constraints fail to fully avert the mutual exclusivity between the optimization directions for base and new. As a novel solution to this challenge, we propose the plug-and-play Dual-Prompt Collaboration (DPC) framework, the first that decoupling the optimization processes of base and new tasks at the prompt level. Specifically, we clone a learnable parallel prompt based on the backbone prompt, and introduce a variable Weighting-Decoupling framework to independently control the optimization directions of dual prompts specific to base or new tasks, thus avoiding the conflict in generalization. Meanwhile, we propose a Dynamic Hard Negative Optimizer, utilizing dual prompts to construct a more challenging optimization task on base classes for enhancement. For interpretability, we prove the feature channel invariance of the prompt vector during the optimization process, providing theoretical support for the Weighting-Decoupling of DPC. Extensive experiments on multiple backbones demonstrate that DPC can significantly improve base performance without introducing any external knowledge beyond the base classes, while maintaining generalization to new classes. Code is available at: this https URL.
zh

[CV-261] Learning to segment anatomy and lesions from disparately labeled sources in brain MRI

【速读】：该论文试图解决在脑部磁共振成像（MRI）中分割病变附近的健康组织结构的问题，主要挑战在于病变引起的解剖结构破坏以及缺乏同时标注健康组织和病变的联合训练数据集。论文提出的方法通过解耦健康组织和病变分割为两条路径，并利用多序列采集信息结合注意力机制来增强分割性能。关键解决方案在于引入图像特定的适应机制，在推理阶段减少病变区域对健康组织预测的不利影响，并通过元学习在训练阶段考虑该适应性，同时采用协同训练从异构标注的数据集中学习。实验结果显示，该模型在公开的脑胶质母细胞瘤数据集上对多个解剖结构和病变的分割性能优于现有方法。

链接: https://arxiv.org/abs/2503.18840
作者: Meva Himmetoglu,Ilja Ciernik,Ender Konukoglu
机构: Computer Vision Lab, ETH Zürich, Switzerland(计算机视觉实验室, 苏黎世联邦理工学院, 瑞士); University of Zürich, Switzerland(苏黎世大学, 瑞士)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segmenting healthy tissue structures alongside lesions in brain Magnetic Resonance Images (MRI) remains a challenge for today’s algorithms due to lesion-caused disruption of the anatomy and lack of jointly labeled training datasets, where both healthy tissues and lesions are labeled on the same images. In this paper, we propose a method that is robust to lesion-caused disruptions and can be trained from disparately labeled training sets, i.e., without requiring jointly labeled samples, to automatically segment both. In contrast to prior work, we decouple healthy tissue and lesion segmentation in two paths to leverage multi-sequence acquisitions and merge information with an attention mechanism. During inference, an image-specific adaptation reduces adverse influences of lesion regions on healthy tissue predictions. During training, the adaptation is taken into account through meta-learning and co-training is used to learn from disparately labeled training images. Our model shows an improved performance on several anatomical structures and lesions on a publicly available brain glioblastoma dataset compared to the state-of-the-art segmentation methods.
zh

[CV-262] Dual-domain Multi-path Self-supervised Diffusion Model for Accelerated MRI Reconstruction

【速读】：该论文旨在解决磁共振成像（MRI）加速重建中的三个主要挑战：1）现有扩散模型通常依赖于全采样数据进行训练；2）模型计算成本高；3）缺乏不确定性估计，从而限制其临床实用性。为克服这些挑战，论文提出了一种名为Dual-domain Multi-path Self-supervised Diffusion Model (DMSM) 的创新框架。DMSM 的关键在于结合自监督双域扩散模型训练方案、轻量级混合注意力网络以及多路径推理策略，以提升重建准确性、效率和可解释性。特别地，DMSM 不再依赖全采样数据进行训练，使其更适用于实际临床环境。

链接: https://arxiv.org/abs/2503.18836
作者: Yuxuan Zhang,Jinkui Hao,Bo Zhou
机构: Department of Radiology, Northwestern University (西北大学), Chicago, IL, 60611, USA; Department of Biomedical Engineering, Huazhong University of Science and Technology (华中科技大学), Wuhan, China
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures, 5 tables

点击查看摘要

Abstract:Magnetic resonance imaging (MRI) is a vital diagnostic tool, but its inherently long acquisition times reduce clinical efficiency and patient comfort. Recent advancements in deep learning, particularly diffusion models, have improved accelerated MRI reconstruction. However, existing diffusion models’ training often relies on fully sampled data, models incur high computational costs, and often lack uncertainty estimation, limiting their clinical applicability. To overcome these challenges, we propose a novel framework, called Dual-domain Multi-path Self-supervised Diffusion Model (DMSM), that integrates a self-supervised dual-domain diffusion model training scheme, a lightweight hybrid attention network for the reconstruction diffusion model, and a multi-path inference strategy, to enhance reconstruction accuracy, efficiency, and explainability. Unlike traditional diffusion-based models, DMSM eliminates the dependency on training from fully sampled data, making it more practical for real-world clinical settings. We evaluated DMSM on two human MRI datasets, demonstrating that it achieves favorable performance over several supervised and self-supervised baselines, particularly in preserving fine anatomical structures and suppressing artifacts under high acceleration factors. Additionally, our model generates uncertainty maps that correlate reasonably well with reconstruction errors, offering valuable clinically interpretable guidance and potentially enhancing diagnostic confidence.
zh

[CV-263] Rethinking Glaucoma Calibration: Voting-Based Binocular and Metadata Integration

【速读】：本文针对青光眼诊断中存在的主观性高、过诊断或误诊风险以及现有模型校准性能不佳的问题，提出了一种新的框架——基于投票机制的Vision Transformer（V-ViT）。该方案的关键在于结合青光眼的系统性和多因素特性，通过整合双眼数据与元数据来反映诊断过程的复杂性，并引入基于MC Dropout的投票系统以缓解主观性影响。实验结果表明，所提方法在准确性等各项指标上均达到最先进水平，有效解决了校准问题。研究使用包含双眼数据的自定义数据集进行了验证。

链接: https://arxiv.org/abs/2503.18642
作者: Taejin Jeong,Joohyeok Kim,Jaehoon Joo,Yeonwoo Jung,Hyeonmin Kim,Seong Jae Hwang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Glaucoma is an incurable ophthalmic disease that damages the optic nerve, leads to vision loss, and ranks among the leading causes of blindness worldwide. Diagnosing glaucoma typically involves fundus photography, optical coherence tomography (OCT), and visual field testing. However, the high cost of OCT often leads to reliance on fundus photography and visual field testing, both of which exhibit inherent inter-observer variability. This stems from glaucoma being a multifaceted disease that influenced by various factors. As a result, glaucoma diagnosis is highly subjective, emphasizing the necessity of calibration, which aligns predicted probabilities with actual disease likelihood. Proper calibration is essential to prevent overdiagnosis or misdiagnosis, which are critical concerns for high-risk diseases. Although AI has significantly improved diagnostic accuracy, overconfidence in models have worsen calibration performance. Recent study has begun focusing on calibration for glaucoma. Nevertheless, previous study has not fully considered glaucoma’s systemic nature and the high subjectivity in its diagnostic process. To overcome these limitations, we propose V-ViT (Voting-based ViT), a novel framework that enhances calibration by incorporating disease-specific characteristics. V-ViT integrates binocular data and metadata, reflecting the multi-faceted nature of glaucoma diagnosis. Additionally, we introduce a MC dropout-based Voting System to address high subjectivity. Our approach achieves state-of-the-art performance across all metrics, including accuracy, demonstrating that our proposed methods are effective in addressing calibration issues. We validate our method using a custom dataset including binocular data.
zh

[CV-264] ZECO: ZeroFusion Guided 3D MRI Conditional Generation

【速读】：本文旨在解决医学图像分割领域中由于精确病灶掩膜（lesion masks）标注需求高且耗时，导致临床实践中可用数据集规模较小的问题。论文提出了一种名为ZECO的框架，这是一种基于ZeroFusion引导的3D MRI条件生成方法，能够提取、压缩并生成高保真的MRI图像及其对应的3D分割掩膜，以缓解数据稀缺问题。该方案的关键在于引入了空间变换模块（Spatial Transformation Module），用于将MRI图像编码到紧凑的潜在空间中，从而有效捕捉体素内切片间的关联性，并支持扩散过程；同时，创新性的ZeroFusion技术能够在潜在空间中逐步映射3D掩膜至MRI图像，不仅实现了在有限数据集上的鲁棒训练，还避免了过拟合现象的发生。通过这种方式，ZECO在多种模态下的脑部MRI数据集上展现出超越现有最先进模型的性能，在定量与定性评估中均表现优异。

链接: https://arxiv.org/abs/2503.18246
作者: Feiran Wang,Bin Duan,Jiachen Tao,Nikhil Sharma,Dawen Cai,Yan Yan
机构: Illinois Institute of Technology (伊利诺伊理工学院); University of Michigan (密歇根大学); University of Illinois Chicago (芝加哥大学伊利诺伊分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: \url{ this https URL }; Github Code: \url{ this https URL }

点击查看摘要

Abstract:Medical image segmentation is crucial for enhancing diagnostic accuracy and treatment planning in Magnetic Resonance Imaging (MRI). However, acquiring precise lesion masks for segmentation model training demands specialized expertise and significant time investment, leading to a small dataset scale in clinical practice. In this paper, we present ZECO, a ZeroFusion guided 3D MRI conditional generation framework that extracts, compresses, and generates high-fidelity MRI images with corresponding 3D segmentation masks to mitigate data scarcity. To effectively capture inter-slice relationships within volumes, we introduce a Spatial Transformation Module that encodes MRI images into a compact latent space for the diffusion process. Moving beyond unconditional generation, our novel ZeroFusion method progressively maps 3D masks to MRI images in latent space, enabling robust training on limited datasets while avoiding overfitting. ZECO outperforms state-of-the-art models in both quantitative and qualitative evaluations on Brain MRI datasets across various modalities, showcasing its exceptional capability in synthesizing high-quality MRI images conditioned on segmentation masks.
zh

[CV-265] SNRAware: Improved Deep Learning MRI Denoising with SNR Unit Training and G-factor Map Augmentation

【速读】：该论文旨在开发并评估一种新的基于深度学习的磁共振图像去噪方法，通过利用MRI重建过程中的定量噪声分布信息来提升去噪性能与泛化能力。论文的关键解决方案在于提出的SNRAware训练方案，该方案利用MRI重建过程的知识，通过模拟大规模高质量且多样化的合成数据集，并向模型提供关于噪声分布的量化信息，从而显著提高去噪效果。实验结果表明，最佳模型在保持域内性能的同时，对跨序列、动态对比度变化、不同解剖结构及磁场强度的域外样本也表现出良好的泛化能力。

链接: https://arxiv.org/abs/2503.18162
作者: Hui Xue,Sarah M. Hooper,Iain Pierce,Rhodri H. Davies,John Stairs,Joseph Naegele,Adrienne E. Campbell-Washburn,Charlotte Manisty,James C. Moon,Thomas A. Treibel,Peter Kellman,Michael S. Hansen
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:To develop and evaluate a new deep learning MR denoising method that leverages quantitative noise distribution information from the reconstruction process to improve denoising performance and generalization. This retrospective study trained 14 different transformer and convolutional models with two backbone architectures on a large dataset of 2,885,236 images from 96,605 cardiac retro-gated cine complex series acquired at 3T. The proposed training scheme, termed SNRAware, leverages knowledge of the MRI reconstruction process to improve denoising performance by simulating large, high quality, and diverse synthetic datasets, and providing quantitative information about the noise distribution to the model. In-distribution testing was performed on a hold-out dataset of 3000 samples with performance measured using PSNR and SSIM, with ablation comparison without the noise augmentation. Out-of-distribution tests were conducted on cardiac real-time cine, first-pass cardiac perfusion, and neuro and spine MRI, all acquired at 1.5T, to test model generalization across imaging sequences, dynamically changing contrast, different anatomies, and field strengths. The best model found in the in-distribution test generalized well to out-of-distribution samples, delivering 6.5x and 2.9x CNR improvement for real-time cine and perfusion imaging, respectively. Further, a model trained with 100% cardiac cine data generalized well to a T1 MPRAGE neuro 3D scan and T2 TSE spine MRI. Subjects: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2503.18162 [physics.med-ph] (or arXiv:2503.18162v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2503.18162 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-266] Efficient Deep Learning Approaches for Processing Ultra-Widefield Retinal Imaging

【速读】：该论文致力于解决两个主要问题：一是高性能深度学习模型对计算资源的需求较高，限制了其在医疗资源有限地区的应用；二是基于彩色眼底摄影（Color Fundus Photography, CFP）方法的准确性问题，尽管超广域视网膜成像（Ultra-Widefield, UWF）方法提供了更多诊断信息，但现有研究多基于CFP。论文的关键解决方案在于通过策略性数据增强（strategic data augmentation）和模型集成（model ensembles）的方法，在低性能计算单元上平衡模型性能与计算资源需求，同时充分利用UWF图像的优势。

链接: https://arxiv.org/abs/2503.18151
作者: Siwon Kim,Wooyung Yun,Jeongbin Oh,Soomok Lee
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has emerged as the predominant solution for classifying medical images. We intend to apply these developments to the ultra-widefield (UWF) retinal imaging dataset. Since UWF images can accurately diagnose various retina diseases, it is very important to clas sify them accurately and prevent them with early treatment. However, processing images manually is time-consuming and labor-intensive, and there are two challenges to automating this process. First, high perfor mance usually requires high computational resources. Artificial intelli gence medical technology is better suited for places with limited medical resources, but using high-performance processing units in such environ ments is challenging. Second, the problem of the accuracy of colour fun dus photography (CFP) methods. In general, the UWF method provides more information for retinal diagnosis than the CFP method, but most of the research has been conducted based on the CFP method. Thus, we demonstrate that these problems can be efficiently addressed in low performance units using methods such as strategic data augmentation and model ensembles, which balance performance and computational re sources while utilizing UWF images.
zh

[CV-267] WISE: A Framework for Gigapixel Whole-Slide-Image Lossless Compression

【速读】：该论文旨在解决 Whole-Slide Images (WSIs) 因数据量巨大而导致存储和维护成本高昂且不可持续的问题。现有压缩方法在处理 WSI 图像时效果不佳，主要原因是 WSI 图像中存在的信息不规则性。为了解决这一问题，论文提出了一种名为 WISE 的高效无损压缩器，专门针对 WSI 图像设计。WISE 的关键是采用分层编码策略提取有效位以降低图像熵，并结合基于字典的方法处理不规则频率模式，从而实现对千兆像素 WSI 图像的平均 36 倍压缩率，最高可达 136 倍。

链接: https://arxiv.org/abs/2503.18074
作者: Yu Mao,Jun Wang,Nan Guan,Chun Jason Xue
机构: MBZUAI (阿联酋穆罕默德·本·扎耶尔人工智能大学); City University of Hong Kong (香港城市大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Whole-Slide Images (WSIs) have revolutionized medical analysis by presenting high-resolution images of the whole tissue slide. Despite avoiding the physical storage of the slides, WSIs require considerable data volume, which makes the storage and maintenance of WSI records costly and unsustainable. To this end, this work presents the first investigation of lossless compression of WSI images. Interestingly, we find that most existing compression methods fail to compress the WSI images effectively. Furthermore, our analysis reveals that the failure of existing compressors is mainly due to information irregularity in WSI images. To resolve this issue, we developed a simple yet effective lossless compressor called WISE, specifically designed for WSI images. WISE employs a hierarchical encoding strategy to extract effective bits, reducing the entropy of the image and then adopting a dictionary-based method to handle the irregular frequency patterns. Through extensive experiments, we show that WISE can effectively compress the gigapixel WSI images to 36 times on average and up to 136 times.
zh

[CV-268] Multiple-Particle Autofocusing Algorithm Using Axial Resolution and Morphological Analyses Based on Digital Holography

【速读】：该论文试图解决通过全息图（Hologram）准确获取密集透明粒子溶液中每个粒子的三维位置（尤其是轴向位置）以及粒子数量的问题。解决方案的关键在于首先利用形态学分析和约束强度（Constrained Intensity）从重建图像中提取候选聚焦粒子的信息；其次，结合轴向分辨率确定真实的聚焦粒子；最终依据候选粒子的平均强度和等效直径锁定所有聚焦粒子。这一方法能够快速提供相对精确的真实轴向位置，从而有效解决密集粒子带来的自动聚焦难题。

链接: https://arxiv.org/abs/2503.18038
作者: Wei-Na Li,Yi Zhou,Jiatai Chen,Hongjie Ou,XiangSheng Xie
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We propose an autofocusing algorithm to obtain, relatively accurately, the 3D position of each particle, particularly its axial location, and particle number of a dense transparent particle solution via its hologram. First, morphological analyses and constrained intensity are used on raw reconstructed images to obtain information on candidate focused particles. Second, axial resolution is used to obtain the real focused particles. Based on the mean intensity and equivalent diameter of each candidate focused particle, all focused particles are eventually secured. Our proposed method can rapidly provide relatively accurate ground-truth axial positions to solve the autofocusing problem that occurs with dense particles.
zh

[CV-269] PathoHR: Breast Cancer Survival Prediction on High-Resolution Pathological Images

【速读】：该论文旨在解决乳腺癌生存预测在计算病理学中的挑战，主要由于肿瘤异质性导致从全切片图像（Whole Slide Images, WSIs）中难以提取能够真正反映肿瘤侵袭潜力及预后的代表性特征。论文提出的解决方案关键在于PathoHR管道的设计：首先通过可插拔的高分辨率视觉Transformer (Vision Transformer, ViT) 提升小块图像表示能力，实现更细致全面的特征提取；其次系统评估多种高级相似性度量方法以优化特征比较过程，从而更好地捕捉肿瘤特性；最后证明采用增强分辨率的小图像块进行预测，不仅可达到与原始大块图像相当甚至更高的准确性，还能大幅降低计算开销。这些创新点共同构成了PathoHR的核心贡献。

链接: https://arxiv.org/abs/2503.17970
作者: Yang Luo,Shiru Wang,Jun Liu,Jiaxuan Xiao,Rundong Xue,Zeyu Zhang,Hao Zhang,Yu Lu,Yang Zhao,Yutong Xie
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Breast cancer survival prediction in computational pathology presents a remarkable challenge due to tumor heterogeneity. For instance, different regions of the same tumor in the pathology image can show distinct morphological and molecular characteristics. This makes it difficult to extract representative features from whole slide images (WSIs) that truly reflect the tumor’s aggressive potential and likely survival outcomes. In this paper, we present PathoHR, a novel pipeline for accurate breast cancer survival prediction that enhances any size of pathological images to enable more effective feature learning. Our approach entails (1) the incorporation of a plug-and-play high-resolution Vision Transformer (ViT) to enhance patch-wise WSI representation, enabling more detailed and comprehensive feature extraction, (2) the systematic evaluation of multiple advanced similarity metrics for comparing WSI-extracted features, optimizing the representation learning process to better capture tumor characteristics, (3) the demonstration that smaller image patches enhanced follow the proposed pipeline can achieve equivalent or superior prediction accuracy compared to raw larger patches, while significantly reducing computational overhead. Experimental findings valid that PathoHR provides the potential way of integrating enhanced image resolution with optimized feature learning to advance computational pathology, offering a promising direction for more accurate and efficient breast cancer survival prediction. Code will be available at this https URL.
zh

[CV-270] Cat-AIR: Content and Task-Aware All-in-One Image Restoration

【速读】：该论文旨在解决现有单一模型在处理多种图像退化类型时效果不佳且效率不高的问题。为应对这一挑战，论文提出了一种名为Cat-AIR（Content And Task-aware All-in-one Image Restoration）的新框架。Cat-AIR的关键创新在于引入了交替的空间-通道注意力机制，能够自适应地平衡不同任务中的局部与全局信息。具体而言，通过跨层通道注意力和跨特征空间注意力，根据内容和任务复杂度动态分配计算资源。此外，还设计了一种平滑学习策略，确保模型在适应新修复任务的同时，仍能保持对已有任务的良好性能。实验结果表明，Cat-AIR在多种图像恢复任务中达到了最先进的性能，并以更少的浮点运算次数（FLOPs）实现了高效的全要素图像恢复。

链接: https://arxiv.org/abs/2503.17915
作者: Jiachen Jiang,Tianyu Ding,Ke Zhang,Jinxin Zhou,Tianyi Chen,Ilya Zharkov,Zhihui Zhu,Luming Liang
机构: Institution1; Institution2
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:All-in-one image restoration seeks to recover high-quality images from various types of degradation using a single model, without prior knowledge of the corruption source. However, existing methods often struggle to effectively and efficiently handle multiple degradation types. We present Cat-AIR, a novel \textbfContent \textbfAnd \textbfTask-aware framework for \textbfAll-in-one \textbfImage \textbfRestoration. Cat-AIR incorporates an alternating spatial-channel attention mechanism that adaptively balances the local and global information for different tasks. Specifically, we introduce cross-layer channel attentions and cross-feature spatial attentions that allocate computations based on content and task complexity. Furthermore, we propose a smooth learning strategy that allows for seamless adaptation to new restoration tasks while maintaining performance on existing ones. Extensive experiments demonstrate that Cat-AIR achieves state-of-the-art results across a wide range of restoration tasks, requiring fewer FLOPs than previous methods, establishing new benchmarks for efficient all-in-one image restoration.
zh

[CV-271] Multi-Disease-Aware Training Strategy for Cardiac MR Image Segmentation

【速读】：该论文旨在解决心脏磁共振成像（Cardiac Magnetic Resonance Images, CMRIs）中右心室（Right Ventricle, RV）分割效果不佳的问题。当前基于深度学习的分割方法在处理规则形状器官（如左心室和心肌）时表现良好，但在分割不规则形状的右心室时性能较差。论文认为这一局限性源于现有模型缺乏足够的泛化能力，无法应对不同切片、心动周期和疾病状态下目标分布的变化。为解决此问题，论文提出了多疾病感知训练策略（Multi-Disease-Aware Training Strategy, MTS），并通过重构数据集为多疾病数据集以及设计专门的数据预处理技术来支持MTS。实验结果验证了所提方法的有效性，特别是在提升右心室分割精度方面表现出色，并且模型在未知疾病数据上的表现也具有鲁棒性。

链接: https://arxiv.org/abs/2503.17896
作者: Hong Zheng(1 and 2 and 4),Yucheng Chen(3),Nan Mu(1 and 4 and 5),Xiaoning Li(1 and 4 and 5) ((1) College of Computer Science, Sichuan Normal University, Chengdu, China, (2) School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China, (3) Department of Cardiology, West China Hospital, Sichuan University, Chengdu, China, (4) Visual Computing and Virtual Reality Key Laboratory of Sichuan Province, Chengdu, China, (5) Sichuan 2011 Collaborative Innovation Center for Educational Big Data, Chengdu, China)
机构: College of Computer Science, Sichuan Normal University (四川师范大学计算机科学学院), Chengdu, 610101, China; School of Computing and Artificial Intelligence, Southwest Jiaotong University (西南交通大学计算与人工智能学院), Chengdu, 611756, China; Department of Cardiology, West China Hospital, Sichuan University (四川大学华西医院心内科), Chengdu, 610041, China; Visual Computing and Virtual Reality Key Laboratory of Sichuan Province (四川省虚拟现实与可视化重点实验室), Chengdu, 610066, China; Sichuan 2011 Collaborative Innovation Center for Educational Big Data (四川2011教育大数据协同创新中心), Chengdu, 610066, China
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of the ventricles from cardiac magnetic resonance images (CMRIs) is crucial for enhancing the diagnosis and analysis of heart conditions. Deep learning-based segmentation methods have recently garnered significant attention due to their impressive performance. However, these segmentation methods are typically good at partitioning regularly shaped organs, such as the left ventricle (LV) and the myocardium (MYO), whereas they perform poorly on irregularly shaped organs, such as the right ventricle (RV). In this study, we argue that this limitation of segmentation models stems from their insufficient generalization ability to address the distribution shift of segmentation targets across slices, cardiac phases, and disease conditions. To overcome this issue, we present a Multi-Disease-Aware Training Strategy (MTS) and restructure the introduced CMRI datasets into multi-disease datasets. Additionally, we propose a specialized data processing technique for preprocessing input images to support the MTS. To validate the effectiveness of our method, we performed control group experiments and cross-validation tests. The experimental results show that (1) network models trained using our proposed strategy achieved superior segmentation performance, particularly in RV segmentation, and (2) these networks exhibited robust performance even when applied to data from unknown diseases.
zh

[CV-272] FundusGAN: A Hierarchical Feature-Aware Generative Framework for High-Fidelity Fundus Image Generation

【速读】：该论文旨在解决眼科基础模型在预训练过程中对大规模数据集的高度依赖问题，这构成了模型开发与部署的主要障碍。为应对这一挑战，论文提出了一种名为FundusGAN的新型分层特征感知生成框架，专门用于高保真眼底图像合成。解决方案的关键在于其创新性的特征金字塔网络（Feature Pyramid Network）编码器设计，能够全面提取多尺度信息，同时兼顾大范围解剖结构与细微病理特征的捕捉。此外，通过修改基于StyleGAN的生成器，并引入空洞卷积与策略性上采样调整，FundusGAN能够在保留关键视网膜结构的同时，显著增强病理细节的表现能力。这些技术突破使得FundusGAN成为缓解眼科人工智能研究中数据稀缺问题的有效基础模型组件。

链接: https://arxiv.org/abs/2503.17831
作者: Qingshan Hou,Meng Wang,Peng Cao,Zou Ke,Xiaoli Liu,Huazhu Fu,Osmar R. Zaiane
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in ophthalmology foundation models such as RetFound have demonstrated remarkable diagnostic capabilities but require massive datasets for effective pre-training, creating significant barriers for development and deployment. To address this critical challenge, we propose FundusGAN, a novel hierarchical feature-aware generative framework specifically designed for high-fidelity fundus image synthesis. Our approach leverages a Feature Pyramid Network within its encoder to comprehensively extract multi-scale information, capturing both large anatomical structures and subtle pathological features. The framework incorporates a modified StyleGAN-based generator with dilated convolutions and strategic upsampling adjustments to preserve critical retinal structures while enhancing pathological detail representation. Comprehensive evaluations on the DDR, DRIVE, and IDRiD datasets demonstrate that FundusGAN consistently outperforms state-of-the-art methods across multiple metrics (SSIM: 0.8863, FID: 54.2, KID: 0.0436 on DDR). Furthermore, disease classification experiments reveal that augmenting training data with FundusGAN-generated images significantly improves diagnostic accuracy across multiple CNN architectures (up to 6.49% improvement with ResNet50). These results establish FundusGAN as a valuable foundation model component that effectively addresses data scarcity challenges in ophthalmological AI research, enabling more robust and generalizable diagnostic systems while reducing dependency on large-scale clinical data collection.
zh

[CV-273] DVG-Diffusion: Dual-View Guided Diffusion Model for CT Reconstruction from X-Rays

【速读】：该论文旨在解决从少数视角的二维 X 射线图像直接重建三维 CT 体积这一具有挑战性的任务。传统方法面临的主要困难在于 X 射线图像是三维 CT 体积的投影视图，缺乏足够的信息进行精确重建。为应对这一难题，论文提出了一种新的解决方案，其关键是通过引入新视图合成（new view synthesis）来简化复杂的 2D X 射线到 3D CT 的映射，并利用视图引导特征对齐（view-guided feature alignment）降低学习难度。具体而言，论文提出了双视图引导扩散模型（Dual-View Guided Diffusion Model, DVG-Diffusion），该模型结合真实输入的 X 射线视图与合成的新 X 射线视图共同指导 CT 重建过程。首先，一种新颖的视参数引导编码器捕获与 CT 空间对齐的 X 射线特征；然后，将提取的双视图特征作为潜在扩散模型的条件，用于学习和优化 CT 潜在表示；最后，通过像素空间解码得到最终的 CT 体积。通过这种视参数引导编码与双视图引导重建的方式，DVG-Diffusion 在高保真度与感知质量之间实现了有效平衡。实验结果表明，该方法优于现有最先进的技术，并且基于实验提供了全面的视图分析与讨论。

链接: https://arxiv.org/abs/2503.17804
作者: Xing Xie,Jiawei Liu,Huijie Fan,Zhi Han,Yandong Tang,Liangqiong Qu
机构: State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences (中国科学院沈阳自动化研究所机器人学国家重点实验室); University of Chinese Academy of Sciences (中国科学院大学); Department of Statistics and Actuarial Science and the Institute of Data Science, The University of Hong Kong (香港大学统计与精算学系及数据科学研究所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Directly reconstructing 3D CT volume from few-view 2D X-rays using an end-to-end deep learning network is a challenging task, as X-ray images are merely projection views of the 3D CT volume. In this work, we facilitate complex 2D X-ray image to 3D CT mapping by incorporating new view synthesis, and reduce the learning difficulty through view-guided feature alignment. Specifically, we propose a dual-view guided diffusion model (DVG-Diffusion), which couples a real input X-ray view and a synthesized new X-ray view to jointly guide CT reconstruction. First, a novel view parameter-guided encoder captures features from X-rays that are spatially aligned with CT. Next, we concatenate the extracted dual-view features as conditions for the latent diffusion model to learn and refine the CT latent representation. Finally, the CT latent representation is decoded into a CT volume in pixel space. By incorporating view parameter guided encoding and dual-view guided CT reconstruction, our DVG-Diffusion can achieve an effective balance between high fidelity and perceptual quality for CT reconstruction. Experimental results demonstrate our method outperforms state-of-the-art methods. Based on experiments, the comprehensive analysis and discussions for views and reconstruction are also presented.
zh

[CV-274] Assessing workflow impact and clinical utility of AI-assisted brain aneurysm detection: a multi-reader study

【速读】：该论文旨在评估基于人工智能（AI）的脑动脉瘤检测模型在临床环境中的适用性和实用性，通过对比两名具有不同经验水平（2年和13年）的放射科医生的表现，回答以下两个问题：1）AI算法是否提升了医生的检测性能？2）AI算法对常规临床工作流程的影响程度如何？论文的关键在于重新使用并扩展了一个开放获取的时间飞跃磁共振血管造影（TOF MRA）数据集，并通过实际读片测试验证AI辅助诊断的实际效果，而非仅关注算法的技术性能。研究发现尽管AI模型在测试集上达到最先进的结果（灵敏度=74%，假阳性率=1.6%），但并未显著提升医生的灵敏度，同时增加了医生的读片时间，表明AI辅助对临床效率和诊断信心的影响有限。因此，论文强调了在真实临床环境中验证AI算法的重要性，以确保其有效性和实用性。

链接: https://arxiv.org/abs/2503.17786
作者: Tommaso Di Noto,Sofyan Jankowski,Francesco Puccinelli,Guillaume Marie,Sebastien Tourbier,Yasser Aleman-Gomez,Oscar Esteban,Ricardo Corredor-Jerez,Guillaume Saliou,Patric Hagmann,Meritxell Bach Cuadra,Jonas Richiardi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Paper under review with a Journal in the medical imaging field

点击查看摘要

Abstract:Despite the plethora of AI-based algorithms developed for anomaly detection in radiology, subsequent integration into clinical setting is rarely evaluated. In this work, we assess the applicability and utility of an AI-based model for brain aneurysm detection comparing the performance of two readers with different levels of experience (2 and 13 years). We aim to answer the following questions: 1) Do the readers improve their performance when assisted by the AI algorithm? 2) How much does the AI algorithm impact routine clinical workflow? We reuse and enlarge our open-access, Time-Of-Flight Magnetic Resonance Angiography dataset (N=460). We use 360 subjects for training/validating our algorithm and 100 as unseen test set for the reading session. Even though our model reaches state-of-the-art results on the test set (sensitivity=74%, false positive rate=1.6), we show that neither the junior nor the senior reader significantly increase their sensitivity (p=0.59, p=1, respectively). In addition, we find that reading time for both readers is significantly higher in the “AI-assisted” setting than in the “Unassisted” (+15 seconds, on average; p=3x10^(-4) junior, p=3x10^(-5) senior). The confidence reported by the readers is unchanged across the two settings, indicating that the AI assistance does not influence the certainty of the diagnosis. Our findings highlight the importance of clinical validation of AI algorithms in a clinical setting involving radiologists. This study should serve as a reminder to the community to always examine the real-word effectiveness and workflow impact of proposed algorithms.
zh

[CV-275] Hierarchy-Aware and Channel-Adaptive Semantic Communication for Bandwidth-Limited Data Fusion

【速读】：该论文旨在解决高分辨率高光谱图像（HR-HSI）获取成本高且数据量大的问题，通过融合低分辨率高光谱图像（LR-HSI）与高分辨率RGB图像（HR-RGB）以满足实际应用需求。然而，传统融合技术因显著增加带宽消耗而不够高效。为应对这些挑战，论文提出了一种层次感知且信道自适应的语义通信方法用于带宽受限的数据融合。其关键在于引入了一个分层相关模块，该模块能够同时保留图像的整体结构信息和超分辨率所需的细节，并有效结合来自LR-HSI和HR-RGB的深层语义特征与浅层特征；此外，还提出了基于Transformer的信道自适应注意力机制，动态整合并传输深层与浅层特征，从而实现高效的数据传输和高质量的HR-HSI重建。实验结果表明，所提方法相较于单一源传输提升了高达2 dB的峰值信噪比（PSNR），同时将带宽消耗减少了三分之二，在带宽受限环境下展现了卓越的有效性。

链接: https://arxiv.org/abs/2503.17777
作者: Lei Guo,Wei Chen,Yuxuan Sun,Bo Ai,Nikolaos Pappas,Tony Quek
机构: State Key Laboratory of Advanced Rail Autonomous Operation, Beijing Jiaotong University, China (北京交通大学国家重点实验室, 中国); School of Electronic and Information Engineering, Beijing Jiaotong University, Beijing, China (北京交通大学电子与信息工程学院, 北京, 中国); Department of Computer and Information Science, Linköping University, Linköping, Sweden (瑞典林雪平大学计算机与信息科学系); Information Systems Technology and Design, Singapore University of Technology and Design, Singapore (新加坡科技设计大学信息系统技术与设计系)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the WCL

点击查看摘要

Abstract:Obtaining high-resolution hyperspectral images (HR-HSI) is costly and data-intensive, making it necessary to fuse low-resolution hyperspectral images (LR-HSI) with high-resolution RGB images (HR-RGB) for practical applications. However, traditional fusion techniques, which integrate detailed information into the reconstruction, significantly increase bandwidth consumption compared to directly transmitting raw data. To overcome these challenges, we propose a hierarchy-aware and channel-adaptive semantic communication approach for bandwidth-limited data fusion. A hierarchical correlation module is proposed to preserve both the overall structural information and the details of the image required for super-resolution. This module efficiently combines deep semantic and shallow features from LR-HSI and HR-RGB. To further reduce bandwidth usage while preserving reconstruction quality, a channel-adaptive attention mechanism based on Transformer is proposed to dynamically integrate and transmit the deep and shallow features, enabling efficient data transmission and high-quality HR-HSI reconstruction. Experimental results on the CAVE and Washington DC Mall datasets demonstrate that our method outperforms single-source transmission, achieving up to a 2 dB improvement in peak signal-to-noise ratio (PSNR). Additionally, it reduces bandwidth consumption by two-thirds, confirming its effectiveness in bandwidth-constrained environments for HR-HSI reconstruction tasks.
zh

[CV-276] ModalTune: Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology

【速读】：该论文旨在解决数字病理学中全幻灯片图像（Whole-Slide Images, WSIs）规模庞大及训练信号弱导致的预测任务挑战，并针对低数据场景下的滑块级基础模型（Slide-Level Foundation Models, SLFMs）提出了改进方法。论文的关键在于提出了一种名为ModalTune的新颖微调框架，其通过引入模态适配器（Modal Adapter）解决了在不修改SLFMs权重的情况下整合新模态的问题，同时利用大型语言模型（Large-Language Models, LLMs）将标签编码为文本以捕获语义关系，从而增强跨多种任务和癌症类型的泛化能力。这一方案不仅实现了在四种癌症类型上的最先进的性能提升，还在多模态、多任务以及泛癌背景下保持竞争力，并且对两个分布外数据集表现出高度的可推广性。

链接: https://arxiv.org/abs/2503.17564
作者: Vishwesh Ramanathan,Tony Xu,Pushpak Pati,Faruk Ahmed,Maged Goubran,Anne L. Martel
机构: Physical Sciences Platform, Sunnybrook Research Institute ( Sunnybrook 研究院物理科学平台), Canada; Department of Medical Biophysics, University of Toronto (多伦多大学医学生物物理系), Canada; Hurvitz Brain Sciences, Sunnybrook Health Sciences Centre ( Sunnybrook 医疗科学中心赫尔维茨脑科学), Canada; Harquail Centre for Neuromodulation, Sunnybrook Health Sciences Centre ( Sunnybrook 医疗科学中心哈克威尔神经调节中心), Canada; Google Research (Google 研究), USA
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Prediction tasks in digital pathology are challenging due to the massive size of whole-slide images (WSIs) and the weak nature of training signals. Advances in computing, data availability, and self-supervised learning (SSL) have paved the way for slide-level foundation models (SLFMs) that can improve prediction tasks in low-data regimes. However, working with these models is challenging, with issues such as catastrophic forgetting during fine-tuning and under-utilization of shared information between tasks and modalities. To overcome these two challenges, we propose ModalTune, a novel fine-tuning framework which introduces the Modal Adapter to integrate new modalities without modifying SLFM weights. Additionally, we use large-language models (LLMs) to encode labels as text, capturing semantic relationships and enhancing generalization across multiple tasks and cancer types in a single training recipe. ModalTune achieves state-of-the-art (SOTA) results against both uni-modal and multi-modal models across four cancer types, jointly improving survival and cancer subtype prediction while remaining competitive in pan-cancer settings. Additionally, we show ModalTune is highly generalizable to two out-of-distribution (OOD) datasets. To our knowledge, this is the first unified fine-tuning framework for multi-modal, multi-task, and pan-cancer modeling in digital pathology.
zh

[CV-277] Echo-E3Net: Efficient Endo-Epi Spatio-Temporal Network for Ejection Fraction Estimation MICCAI2025

【速读】：该论文旨在解决传统左心室射血分数（LVEF）估计方法耗时且依赖操作者的问题，同时现有基于深度学习的方法因计算需求高而难以满足实时临床应用的需求。此外，当前模型往往未能充分考虑空间与时间特征之间的交互作用，而这对于准确估计LVEF至关重要。为此，论文提出了一种名为Echo-E³Net的高效Endo-Epi时空网络用于LVEF估计。其关键在于引入了内膜-心外膜心缘检测器（E²CBD）模块，通过利用空间和时间地标线索增强特征提取；以及内膜-心外膜特征聚合器（E²FA）模块，从主干特征图中提炼统计描述符以优化最终的EF预测。结合专门设计的多组件损失函数，这些创新共同提升了空间-时间表示学习能力，确保了鲁棒且高效的EF估计。

链接: https://arxiv.org/abs/2503.17543
作者: Moein Heidari,Afshin Bozorgpour,AmirHossein Zarif-Fakharnia,Dorit Merhof,Ilker Hacihaliloglu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted as a conference paper to MICCAI 2025

点击查看摘要

Abstract:Left ventricular ejection fraction (LVEF) is a critical metric for assessing cardiac function, widely used in diagnosing heart failure and guiding clinical decisions. Despite its importance, conventional LVEF estimation remains time-consuming and operator-dependent. Recent deep learning advancements have enhanced automation, yet many existing models are computationally demanding, hindering their feasibility for real-time clinical applications. Additionally, the interplay between spatial and temporal features is crucial for accurate estimation but is often overlooked. In this work, we propose Echo-E ^3 Net, an efficient Endo-Epi spatio-temporal network tailored for LVEF estimation. Our method introduces the Endo-Epi Cardial Border Detector (E ^2 CBD) module, which enhances feature extraction by leveraging spatial and temporal landmark cues. Complementing this, the Endo-Epi Feature Aggregator (E ^2 FA) distills statistical descriptors from backbone feature maps, refining the final EF prediction. These modules, along with a multi-component loss function tailored to align with the clinical definition of EF, collectively enhance spatial-temporal representation learning, ensuring robust and efficient EF estimation. We evaluate Echo-E ^3 Net on the EchoNet-Dynamic dataset, achieving a RMSE of 5.15 and an R ^2 score of 0.82, setting a new benchmark in efficiency with 6.8 million parameters and only 8.49G Flops. Our model operates without pre-training, data augmentation, or ensemble methods, making it well-suited for real-time point-of-care ultrasound (PoCUS) applications. Our Code is publicly available on~\hrefthis https URL\textcolormagentaGitHub.
zh

[CV-278] MM-UNet: Meta Mamba UNet for Medical Image Segmentation

【速读】：该论文旨在解决基于状态空间模型（State Space Models, SSMs）直接应用于医学图像分割时面临的挑战，包括对3D空间结构处理中的不连续性问题以及对高方差数据拟合困难的问题。论文的关键解决方案在于提出了一种名为Meta Mamba UNet（MM-UNet）的统一U形编码器-解码器架构。MM-UNet通过在残差连接中集成混合模块来结合SSMs的优势，并通过引入双向扫描顺序策略缓解处理医学图像时的不连续性问题，从而有效提升了分割性能。实验结果表明，MM-UNet在AMOS2022和Synapse数据集上的Dice分数分别达到了91.0%和87.1%，显著优于现有方法。

链接: https://arxiv.org/abs/2503.17540
作者: Bin Xie,Yan Yan,Gady Agam
机构: Department of Computer Science, Illinois Institute of Technology (伊利诺伊理工学院), USA; Department of Computer Science, University of Illinois Chicago (芝加哥伊利诺伊大学), USA
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State Space Models (SSMs) have recently demonstrated outstanding performance in long-sequence modeling, particularly in natural language processing. However, their direct application to medical image segmentation poses several challenges. SSMs, originally designed for 1D sequences, struggle with 3D spatial structures in medical images due to discontinuities introduced by flattening. Additionally, SSMs have difficulty fitting high-variance data, which is common in medical imaging. In this paper, we analyze the intrinsic limitations of SSMs in medical image segmentation and propose a unified U-shaped encoder-decoder architecture, Meta Mamba UNet (MM-UNet), designed to leverage the advantages of SSMs while mitigating their drawbacks. MM-UNet incorporates hybrid modules that integrate SSMs within residual connections, reducing variance and improving performance. Furthermore, we introduce a novel bi-directional scan order strategy to alleviate discontinuities when processing medical images. Extensive experiments on the AMOS2022 and Synapse datasets demonstrate the superiority of MM-UNet over state-of-the-art methods. MM-UNet achieves a Dice score of 91.0% on AMOS2022, surpassing nnUNet by 3.2%, and a Dice score of 87.1% on Synapse. These results confirm the effectiveness of integrating SSMs in medical image segmentation through architectural design optimizations. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.17540 [eess.IV] (or arXiv:2503.17540v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2503.17540 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

人工智能

[AI-0] Statistical Proof of Execution (SPEX)

链接: https://arxiv.org/abs/2503.18899
作者: Michele Dallachiesa,Antonio Pitasi,David Pinger,Josh Goodbody,Luis Vaello
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Many real-world applications are increasingly incorporating automated decision-making, driven by the widespread adoption of ML/AI inference for planning and guidance. This study examines the growing need for verifiable computing in autonomous decision-making. We formalize the problem of verifiable computing and introduce a sampling-based protocol that is significantly faster, more cost-effective, and simpler than existing methods. Furthermore, we tackle the challenges posed by non-determinism, proposing a set of strategies to effectively manage common scenarios.

[AI-1] Bootstrapped Model Predictive Control ICLR2025

链接: https://arxiv.org/abs/2503.18871
作者: Yuhang Wang,Hanwei Guo,Sizhe Wang,Long Qian,Xuguang Lan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:Model Predictive Control (MPC) has been demonstrated to be effective in continuous control tasks. When a world model and a value function are available, planning a sequence of actions ahead of time leads to a better policy. Existing methods typically obtain the value function and the corresponding policy in a model-free manner. However, we find that such an approach struggles with complex tasks, resulting in poor policy learning and inaccurate value estimation. To address this problem, we leverage the strengths of MPC itself. In this work, we introduce Bootstrapped Model Predictive Control (BMPC), a novel algorithm that performs policy learning in a bootstrapped manner. BMPC learns a network policy by imitating an MPC expert, and in turn, uses this policy to guide the MPC process. Combined with model-based TD-learning, our policy learning yields better value estimation and further boosts the efficiency of MPC. We also introduce a lazy reanalyze mechanism, which enables computationally efficient imitation learning. Our method achieves superior performance over prior works on diverse continuous control tasks. In particular, on challenging high-dimensional locomotion tasks, BMPC significantly improves data efficiency while also enhancing asymptotic performance and training stability, with comparable training time and smaller network sizes. Code is available at this https URL.

[AI-2] Structuring Scientific Innovation: A Framework for Modeling and Discovering Impactful Knowledge Combinations

链接: https://arxiv.org/abs/2503.18865
作者: Junlan Chen,Kexin Zhang,Daifeng Li,Yangyang Feng,Yuxuan Zhang,Bowen Deng
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The emergence of large language models offers new possibilities for structured exploration of scientific knowledge. Rather than viewing scientific discovery as isolated ideas or content, we propose a structured approach that emphasizes the role of method combinations in shaping disruptive insights. Specifically, we investigate how knowledge unit–especially those tied to methodological design–can be modeled and recombined to yield research this http URL proposed framework addresses two key challenges. First, we introduce a contrastive learning-based mechanism to identify distinguishing features of historically disruptive method combinations within problem-driven this http URL, we propose a reasoning-guided Monte Carlo search algorithm that leverages the chain-of-thought capability of LLMs to identify promising knowledge recombinations for new problem statements.Empirical studies across multiple domains show that the framework is capable of modeling the structural dynamics of innovation and successfully highlights combinations with high disruptive this http URL research provides a new path for computationally guided scientific ideation grounded in structured reasoning and historical data modeling.

[AI-3] Self-Organizing Graph Reasoning Evolves into a Critical State for Continuous Discovery Through Structural-Semantic Dynamics

链接: https://arxiv.org/abs/2503.18852
作者: Markus J. Buehler
类目: Artificial Intelligence (cs.AI); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO); Applied Physics (physics.app-ph)
*备注:

点击查看摘要

Abstract:We report fundamental insights into how agentic graph reasoning systems spontaneously evolve toward a critical state that sustains continuous semantic discovery. By rigorously analyzing structural (Von Neumann graph entropy) and semantic (embedding) entropy, we identify a subtle yet robust regime in which semantic entropy persistently dominates over structural entropy. This interplay is quantified by a dimensionless Critical Discovery Parameter that stabilizes at a small negative value, indicating a consistent excess of semantic entropy. Empirically, we observe a stable fraction (12%) of “surprising” edges, links between semantically distant concepts, providing evidence of long-range or cross-domain connections that drive continuous innovation. Concomitantly, the system exhibits scale-free and small-world topological features, alongside a negative cross-correlation between structural and semantic measures, reinforcing the analogy to self-organized criticality. These results establish clear parallels with critical phenomena in physical, biological, and cognitive complex systems, revealing an entropy-based principle governing adaptability and continuous innovation. Crucially, semantic richness emerges as the underlying driver of sustained exploration, despite not being explicitly used by the reasoning process. Our findings provide interdisciplinary insights and practical strategies for engineering intelligent systems with intrinsic capacities for long-term discovery and adaptation, and offer insights into how model training strategies can be developed that reinforce critical discovery.

[AI-4] hree Kinds of AI Ethics

链接: https://arxiv.org/abs/2503.18842
作者: Emanuele Ratti
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 16 pages, two figures

点击查看摘要

Abstract:There is an overwhelmingly abundance of works in AI Ethics. This growth is chaotic because of how sudden it is, its volume, and its multidisciplinary nature. This makes difficult to keep track of debates, and to systematically characterize goals, research questions, methods, and expertise required by AI ethicists. In this article, I show that the relation between AI and ethics can be characterized in at least three ways, which correspond to three well-represented kinds of AI ethics: ethics and AI; ethics in AI; ethics of AI. I elucidate the features of these three kinds of AI Ethics, characterize their research questions, and identify the kind of expertise that each kind needs. I also show how certain criticisms to AI ethics are misplaced, as being done from the point of view of one kind of AI ethics, to another kind with different goals. All in all, this work sheds light on the nature of AI ethics, and set the grounds for more informed discussions about scope, methods, and trainings of AI ethicists.

[AI-5] Interpretable and Fair Mechanisms for Abstaining Classifiers ECML KDD2024

链接: https://arxiv.org/abs/2503.18826
作者: Daphne Lenders,Andrea Pugnana,Roberto Pellungrini,Toon Calders,Dino Pedreschi,Fosca Giannotti
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 25 pages, 8 figures. In: Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2024

点击查看摘要

Abstract:Abstaining classifiers have the option to refrain from providing a prediction for instances that are difficult to classify. The abstention mechanism is designed to trade off the classifier’s performance on the accepted data while ensuring a minimum number of predictions. In this setting, often fairness concerns arise when the abstention mechanism solely reduces errors for the majority groups of the data, resulting in increased performance differences across demographic groups. While there exist a bunch of methods that aim to reduce discrimination when abstaining, there is no mechanism that can do so in an explainable way. In this paper, we fill this gap by introducing Interpretable and Fair Abstaining Classifier IFAC, an algorithm that can reject predictions both based on their uncertainty and their unfairness. By rejecting possibly unfair predictions, our method reduces error and positive decision rate differences across demographic groups of the non-rejected data. Since the unfairness-based rejections are based on an interpretable-by-design method, i.e., rule-based fairness checks and situation testing, we create a transparent process that can empower human decision-makers to review the unfair predictions and make more just decisions for them. This explainable aspect is especially important in light of recent AI regulations, mandating that any high-risk decision task should be overseen by human experts to reduce discrimination risks.

[AI-6] Learning Multi-Robot Coordination through Locality-Based Factorized Multi-Agent Actor-Critic Algorithm

链接: https://arxiv.org/abs/2503.18816
作者: Chak Lam Shek,Amrit Singh Bedi,Anjon Basak,Ellen Novoseller,Nick Waytowich,Priya Narayanan,Dinesh Manocha,Pratap Tokekar
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we present a novel cooperative multi-agent reinforcement learning method called \textbfLocality based \textbfFactorized \textbfMulti-Agent \textbfActor-\textbfCritic (Loc-FACMAC). Existing state-of-the-art algorithms, such as FACMAC, rely on global reward information, which may not accurately reflect the quality of individual robots’ actions in decentralized systems. We integrate the concept of locality into critic learning, where strongly related robots form partitions during training. Robots within the same partition have a greater impact on each other, leading to more precise policy evaluation. Additionally, we construct a dependency graph to capture the relationships between robots, facilitating the partitioning process. This approach mitigates the curse of dimensionality and prevents robots from using irrelevant information. Our method improves existing algorithms by focusing on local rewards and leveraging partition-based learning to enhance training efficiency and performance. We evaluate the performance of Loc-FACMAC in three environments: Hallway, Multi-cartpole, and Bounded-Cooperative-Navigation. We explore the impact of partition sizes on the performance and compare the result with baseline MARL algorithms such as LOMAQ, FACMAC, and QMIX. The experiments reveal that, if the locality structure is defined properly, Loc-FACMAC outperforms these baseline algorithms up to 108%, indicating that exploiting the locality structure in the actor-critic framework improves the MARL performance.

[AI-7] owards Responsible AI Music: an Investigation of Trustworthy Features for Creative Systems

链接: https://arxiv.org/abs/2503.18814
作者: Jacopo de Berardinis,Lorenzo Porcaro,Albert Meroño-Peñuela,Angelo Cangelosi,Tess Buckley
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative AI is radically changing the creative arts, by fundamentally transforming the way we create and interact with cultural artefacts. While offering unprecedented opportunities for artistic expression and commercialisation, this technology also raises ethical, societal, and legal concerns. Key among these are the potential displacement of human creativity, copyright infringement stemming from vast training datasets, and the lack of transparency, explainability, and fairness mechanisms. As generative systems become pervasive in this domain, responsible design is crucial. Whilst previous work has tackled isolated aspects of generative systems (e.g., transparency, evaluation, data), we take a comprehensive approach, grounding these efforts within the Ethics Guidelines for Trustworthy Artificial Intelligence produced by the High-Level Expert Group on AI appointed by the European Commission - a framework for designing responsible AI systems across seven macro requirements. Focusing on generative music AI, we illustrate how these requirements can be contextualised for the field, addressing trustworthiness across multiple dimensions and integrating insights from the existing literature. We further propose a roadmap for operationalising these contextualised requirements, emphasising interdisciplinary collaboration and stakeholder engagement. Our work provides a foundation for designing and evaluating responsible music generation systems, calling for collaboration among AI experts, ethicists, legal scholars, and artists. This manuscript is accompanied by a website: this https URL.

[AI-8] Defeating Prompt Injections by Design

链接: https://arxiv.org/abs/2503.18813
作者: Edoardo Debenedetti,Ilia Shumailov,Tianqi Fan,Jamie Hayes,Nicholas Carlini,Daniel Fabian,Christoph Kern,Chongyang Shi,Andreas Terzis,Florian Tramèr
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment. However, LLM agents are vulnerable to prompt injection attacks when handling untrusted data. In this paper we propose CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models may be susceptible to attacks. To operate, CaMeL explicitly extracts the control and data flows from the (trusted) query; therefore, the untrusted data retrieved by the LLM can never impact the program flow. To further improve security, CaMeL relies on a notion of a capability to prevent the exfiltration of private data over unauthorized data flows. We demonstrate effectiveness of CaMeL by solving 67% of tasks with provable security in AgentDojo [NeurIPS 2024], a recent agentic security benchmark.

[AI-9] Classical Planning with LLM -Generated Heuristics: Challenging the State of the Art with Python Code

链接: https://arxiv.org/abs/2503.18809
作者: Augusto B. Corrêa,André G. Pereira,Jendrik Seipp
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, large language models (LLMs) have shown remarkable capabilities in various artificial intelligence problems. However, they fail to plan reliably, even when prompted with a detailed definition of the planning task. Attempts to improve their planning capabilities, such as chain-of-thought prompting, fine-tuning, and explicit “reasoning” still yield incorrect plans and usually fail to generalize to larger tasks. In this paper, we show how to use LLMs to generate correct plans, even for out-of-distribution tasks of increasing size. For a given planning domain, we ask an LLM to generate several domain-dependent heuristic functions in the form of Python code, evaluate them on a set of training tasks within a greedy best-first search, and choose the strongest one. The resulting LLM-generated heuristics solve many more unseen test tasks than state-of-the-art domain-independent heuristics for classical planning. They are even competitive with the strongest learning algorithm for domain-dependent planning. These findings are especially remarkable given that our proof-of-concept implementation is based on an unoptimized Python planner and the baselines all build upon highly optimized C++ code. In some domains, the LLM-generated heuristics expand fewer states than the baselines, revealing that they are not only efficiently computable, but sometimes even more informative than the state-of-the-art heuristics. Overall, our results show that sampling a set of planning heuristic function programs can significantly improve the planning capabilities of LLMs.

[AI-10] he case for delegated AI autonomy for Human AI teaming in healthcare

链接: https://arxiv.org/abs/2503.18778
作者: Yan Jia,Harriet Evans,Zoe Porter,Simon Graham,John McDermid,Tom Lawton,David Snead,Ibrahim Habli
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper we propose an advanced approach to integrating artificial intelligence (AI) into healthcare: autonomous decision support. This approach allows the AI algorithm to act autonomously for a subset of patient cases whilst serving a supportive role in other subsets of patient cases based on defined delegation criteria. By leveraging the complementary strengths of both humans and AI, it aims to deliver greater overall performance than existing human-AI teaming models. It ensures safe handling of patient cases and potentially reduces clinician review time, whilst being mindful of AI tool limitations. After setting the approach within the context of current human-AI teaming models, we outline the delegation criteria and apply them to a specific AI-based tool used in histopathology. The potential impact of the approach and the regulatory requirements for its successful implementation are then discussed.

[AI-11] Mechanistic Interpretability of Fine-Tuned Vision Transformers on Distorted Images: Decoding Attention Head Behavior for Transparent and Trustworthy AI

链接: https://arxiv.org/abs/2503.18762
作者: Nooshin Bahador
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 8 figures

点击查看摘要

Abstract:Mechanistic interpretability improves the safety, reliability, and robustness of large AI models. This study examined individual attention heads in vision transformers (ViTs) fine tuned on distorted 2D spectrogram images containing non relevant content (axis labels, titles, color bars). By introducing extraneous features, the study analyzed how transformer components processed unrelated information, using mechanistic interpretability to debug issues and reveal insights into transformer architectures. Attention maps assessed head contributions across layers. Heads in early layers (1 to 3) showed minimal task impact with ablation increased MSE loss slightly (\mu=0.11%, \sigma=0.09%), indicating focus on less critical low level features. In contrast, deeper heads (e.g., layer 6) caused a threefold higher loss increase (\mu=0.34%, \sigma=0.02%), demonstrating greater task importance. Intermediate layers (6 to 11) exhibited monosemantic behavior, attending exclusively to chirp regions. Some early heads (1 to 4) were monosemantic but non task relevant (e.g. text detectors, edge or corner detectors). Attention maps distinguished monosemantic heads (precise chirp localization) from polysemantic heads (multiple irrelevant regions). These findings revealed functional specialization in ViTs, showing how heads processed relevant vs. extraneous information. By decomposing transformers into interpretable components, this work enhanced model understanding, identified vulnerabilities, and advanced safer, more transparent AI.

[AI-12] Energy-Efficient Dynamic Training and Inference for GNN-Based Network Modeling

链接: https://arxiv.org/abs/2503.18706
作者: Chetna Singhal,Yassine Hadjadj-Aoul
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: Accepted in IEEE WCNC 2025

点击查看摘要

Abstract:Efficient network modeling is essential for resource optimization and network planning in next-generation large-scale complex networks. Traditional approaches, such as queuing theory-based modeling and packet-based simulators, can be inefficient due to the assumption made and the computational expense, respectively. To address these challenges, we propose an innovative energy-efficient dynamic orchestration of Graph Neural Networks (GNN) based model training and inference framework for context-aware network modeling and predictions. We have developed a low-complexity solution framework, QAG, that is a Quantum approximation optimization (QAO) algorithm for Adaptive orchestration of GNN-based network modeling. We leverage the tripartite graph model to represent a multi-application system with many compute nodes. Thereafter, we apply the constrained graph-cutting using QAO to find the feasible energy-efficient configurations of the GNN-based model and deploying them on the available compute nodes to meet the network modeling application requirements. The proposed QAG scheme closely matches the optimum and offers atleast a 50% energy saving while meeting the application requirements with 60% lower churn-rate.

[AI-13] Efficient Continual Adaptation of Pretrained Robotic Policy with Online Meta-Learned Adapters

链接: https://arxiv.org/abs/2503.18684
作者: Ruiqi Zhu,Endong Sun,Guanhe Huang,Oya Celiktutan
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Project link: this https URL

点击查看摘要

Abstract:Continual adaptation is essential for general autonomous agents. For example, a household robot pretrained with a repertoire of skills must still adapt to unseen tasks specific to each household. Motivated by this, building upon parameter-efficient fine-tuning in language models, prior works have explored lightweight adapters to adapt pretrained policies, which can preserve learned features from the pretraining phase and demonstrate good adaptation performances. However, these approaches treat task learning separately, limiting knowledge transfer between tasks. In this paper, we propose Online Meta-Learned adapters (OMLA). Instead of applying adapters directly, OMLA can facilitate knowledge transfer from previously learned tasks to current learning tasks through a novel meta-learning objective. Extensive experiments in both simulated and real-world environments demonstrate that OMLA can lead to better adaptation performances compared to the baseline methods. The project link: this https URL.

[AI-14] From Frag ment to One Piece: A Survey on AI-Driven Graphic Design

链接: https://arxiv.org/abs/2503.18641
作者: Xingxing Zou,Wen Zhang,Nanxuan Zhao
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This survey provides a comprehensive overview of the advancements in Artificial Intelligence in Graphic Design (AIGD), focusing on integrating AI techniques to support design interpretation and enhance the creative process. We categorize the field into two primary directions: perception tasks, which involve understanding and analyzing design elements, and generation tasks, which focus on creating new design elements and layouts. The survey covers various subtasks, including visual element perception and generation, aesthetic and semantic understanding, layout analysis, and generation. We highlight the role of large language models and multimodal approaches in bridging the gap between localized visual features and global design intent. Despite significant progress, challenges remain to understanding human intent, ensuring interpretability, and maintaining control over multilayered compositions. This survey serves as a guide for researchers, providing information on the current state of AIGD and potential future directions\footnotethis https URL_Intelligent_graphic_design.

[AI-15] Adventurer: Exploration with BiGAN for Deep Reinforcement Learning

链接: https://arxiv.org/abs/2503.18612
作者: Yongshuai Liu,Xin Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at Applied Intelligence

点击查看摘要

Abstract:Recent developments in deep reinforcement learning have been very successful in learning complex, previously intractable problems. Sample efficiency and local optimality, however, remain significant challenges. To address these challenges, novelty-driven exploration strategies have emerged and shown promising potential. Unfortunately, no single algorithm outperforms all others in all tasks and most of them struggle with tasks with high-dimensional and complex observations. In this work, we propose Adventurer, a novelty-driven exploration algorithm that is based on Bidirectional Generative Adversarial Networks (BiGAN), where BiGAN is trained to estimate state novelty. Intuitively, a generator that has been trained on the distribution of visited states should only be able to generate a state coming from the distribution of visited states. As a result, novel states using the generator to reconstruct input states from certain latent representations would lead to larger reconstruction errors. We show that BiGAN performs well in estimating state novelty for complex observations. This novelty estimation method can be combined with intrinsic-reward-based exploration. Our empirical results show that Adventurer produces competitive results on a range of popular benchmark tasks, including continuous robotic manipulation tasks (e.g. Mujoco robotics) and high-dimensional image-based tasks (e.g. Atari games).

[AI-16] Reinforcement Learning in Switching Non-Stationary Markov Decision Processes: Algorithms and Convergence Analysis

链接: https://arxiv.org/abs/2503.18607
作者: Mohsen Amiri,Sindri Magnússon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning in non-stationary environments is challenging due to abrupt and unpredictable changes in dynamics, often causing traditional algorithms to fail to converge. However, in many real-world cases, non-stationarity has some structure that can be exploited to develop algorithms and facilitate theoretical analysis. We introduce one such structure, Switching Non-Stationary Markov Decision Processes (SNS-MDP), where environments switch over time based on an underlying Markov chain. Under a fixed policy, the value function of an SNS-MDP admits a closed-form solution determined by the Markov chain’s statistical properties, and despite the inherent non-stationarity, Temporal Difference (TD) learning methods still converge to the correct value function. Furthermore, policy improvement can be performed, and it is shown that policy iteration converges to the optimal policy. Moreover, since Q-learning converges to the optimal Q-function, it likewise yields the corresponding optimal policy. To illustrate the practical advantages of SNS-MDPs, we present an example in communication networks where channel noise follows a Markovian pattern, demonstrating how this framework can effectively guide decision-making in complex, time-varying contexts.

[AI-17] Adaptive Unimodal Regulation for Balanced Multimodal Information Acquisition CVPR2025

链接: https://arxiv.org/abs/2503.18595
作者: Chengxiang Huang,Yake Wei,Zequn Yang,Di Hu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10pages, 16 figures, CVPR2025

点击查看摘要

Abstract:Sensory training during the early ages is vital for human development. Inspired by this cognitive phenomenon, we observe that the early training stage is also important for the multimodal learning process, where dataset information is rapidly acquired. We refer to this stage as the prime learning window. However, based on our observation, this prime learning window in multimodal learning is often dominated by information-sufficient modalities, which in turn suppresses the information acquisition of information-insufficient modalities. To address this issue, we propose Information Acquisition Regulation (InfoReg), a method designed to balance information acquisition among modalities. Specifically, InfoReg slows down the information acquisition process of information-sufficient modalities during the prime learning window, which could promote information acquisition of information-insufficient modalities. This regulation enables a more balanced learning process and improves the overall performance of the multimodal network. Experiments show that InfoReg outperforms related multimodal imbalanced methods across various datasets, achieving superior model performance. The code is available at this https URL.

[AI-18] he Role of Artificial Intelligence in Enhancing Insulin Recommendations and Therapy Outcomes

链接: https://arxiv.org/abs/2503.18592
作者: Maria Panagiotou,Knut Stroemmen,Lorenzo Brigato,Bastiaan E. de Galan,Stavroula Mougiakakou
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:The growing worldwide incidence of diabetes requires more effective approaches for managing blood glucose levels. Insulin delivery systems have advanced significantly, with artificial intelligence (AI) playing a key role in improving their precision and adaptability. AI algorithms, particularly those based on reinforcement learning, allow for personalised insulin dosing by continuously adapting to an individual’s responses. Despite these advancements, challenges such as data privacy, algorithm transparency, and accessibility still need to be addressed. Continued progress and validation in AI-driven insulin delivery systems promise to improve therapy outcomes further, offering people more effective and individualised management of their diabetes. This paper presents an overview of current strategies, key challenges, and future directions.

[AI-19] Identifying and Characterising Higher Order Interactions in Mobility Networks Using Hypergraphs

链接: https://arxiv.org/abs/2503.18572
作者: Prathyush Sambaturu,Bernardo Gutierrez,Moritz U.G. Kraemer
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Databases (cs.DB); Discrete Mathematics (cs.DM); Combinatorics (math.CO)
*备注:

点击查看摘要

Abstract:Understanding human mobility is essential for applications ranging from urban planning to public health. Traditional mobility models such as flow networks and colocation matrices capture only pairwise interactions between discrete locations, overlooking higher-order relationships among locations (i.e., mobility flow among two or more locations). To address this, we propose co-visitation hypergraphs, a model that leverages temporal observation windows to extract group interactions between locations from individual mobility trajectory data. Using frequent pattern mining, our approach constructs hypergraphs that capture dynamic mobility behaviors across different spatial and temporal scales. We validate our method on a publicly available mobility dataset and demonstrate its effectiveness in analyzing city-scale mobility patterns, detecting shifts during external disruptions such as extreme weather events, and examining how a location’s connectivity (degree) relates to the number of points of interest (POIs) within it. Our results demonstrate that our hypergraph-based mobility analysis framework is a valuable tool with potential applications in diverse fields such as public health, disaster resilience, and urban planning.

[AI-20] Anchor-based oversampling for imbalanced tabular data via contrastive and adversarial learning

链接: https://arxiv.org/abs/2503.18569
作者: Hadi Mohammadi,Ehsan Nazerfard,Mostafa Haghir Chehreghani
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Imbalanced data represent a distribution with more frequencies of one class (majority) than the other (minority). This phenomenon occurs across various domains, such as security, medical care and human activity. In imbalanced learning, classification algorithms are typically inclined to classify the majority class accurately, resulting in artificially high accuracy rates. As a result, many minority samples are mistakenly labelled as majority-class instances, resulting in a bias that benefits the majority class. This study presents a framework based on boundary anchor samples to tackle the imbalance learning challenge. First, we select and use anchor samples to train a multilayer perceptron (MLP) classifier, which acts as a prior knowledge model and aids the adversarial and contrastive learning procedures. Then, we designed a novel deep generative model called Anchor Stabilized Conditional Generative Adversarial Network or Anch-SCGAN in short. Anch-SCGAN is supported with two generators for the minority and majority classes and a discriminator incorporating additional class-specific information from the pre-trained feature extractor MLP. In addition, we facilitate the generator’s training procedure in two ways. First, we define a new generator loss function based on reprocessed anchor samples and contrastive learning. Second, we apply a scoring strategy to stabilize the adversarial training part in generators. We train Anch-SCGAN and further finetune it with anchor samples to improve the precision of the generated samples. Our experiments on 16 real-world imbalanced datasets illustrate that Anch-SCGAN outperforms the renowned methods in imbalanced learning.

[AI-21] Discriminative protein sequence modelling with Latent Space Diffusion

链接: https://arxiv.org/abs/2503.18551
作者: Eoin Quinn,Ghassene Jebali,Maxime Seince,Oliver Bent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We explore a framework for protein sequence representation learning that decomposes the task between manifold learning and distributional modelling. Specifically we present a Latent Space Diffusion architecture which combines a protein sequence autoencoder with a denoising diffusion model operating on its latent space. We obtain a one-parameter family of learned representations from the diffusion model, along with the autoencoder’s latent representation. We propose and evaluate two autoencoder architectures: a homogeneous model forcing amino acids of the same type to be identically distributed in the latent space, and an inhomogeneous model employing a noise-based variant of masking. As a baseline we take a latent space learned by masked language modelling, and evaluate discriminative capability on a range of protein property prediction tasks. Our finding is twofold: the diffusion models trained on both our proposed variants display higher discriminative power than the one trained on the masked language model baseline, none of the diffusion representations achieve the performance of the masked language model embeddings themselves.

[AI-22] RLCAD: Reinforcement Learning Training Gym for Revolution Involved CAD Command Sequence Generation

链接: https://arxiv.org/abs/2503.18549
作者: Xiaolong Yin,Xingyu Lu,Jiahang Shen,Jingzhe Ni,Hailong Li,Ruofeng Tong,Min Tang,Peng Du
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A CAD command sequence is a typical parametric design paradigm in 3D CAD systems where a model is constructed by overlaying 2D sketches with operations such as extrusion, revolution, and Boolean operations. Although there is growing academic interest in the automatic generation of command sequences, existing methods and datasets only support operations such as 2D sketching, extrusion,and Boolean operations. This limitation makes it challenging to represent more complex geometries. In this paper, we present a reinforcement learning (RL) training environment (gym) built on a CAD geometric engine. Given an input boundary representation (B-Rep) geometry, the policy network in the RL algorithm generates an action. This action, along with previously generated actions, is processed within the gym to produce the corresponding CAD geometry, which is then fed back into the policy network. The rewards, determined by the difference between the generated and target geometries within the gym, are used to update the RL network. Our method supports operations beyond sketches, Boolean, and extrusion, including revolution operations. With this training gym, we achieve state-of-the-art (SOTA) quality in generating command sequences from B-Rep geometries. In addition, our method can significantly improve the efficiency of command sequence generation by a factor of 39X compared with the previous training gym.

[AI-23] An Identity and Interaction Based Network Forensic Analysis

链接: https://arxiv.org/abs/2503.18542
作者: Nathan Clarke,Gaseb Alotibi,Dany Joy,Fudong Li,Steven Furnell,Ali Alshumrani,Hussan Mohammed
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In todays landscape of increasing electronic crime, network forensics plays a pivotal role in digital investigations. It aids in understanding which systems to analyse and as a supplement to support evidence found through more traditional computer based investigations. However, the nature and functionality of the existing Network Forensic Analysis Tools (NFATs) fall short compared to File System Forensic Analysis Tools (FS FATs) in providing usable data. The analysis tends to focus upon IP addresses, which are not synonymous with user identities, a point of significant interest to investigators. This paper presents several experiments designed to create a novel NFAT approach that can identify users and understand how they are using network based applications whilst the traffic remains encrypted. The experiments build upon the prior art and investigate how effective this approach is in classifying users and their actions. Utilising an in-house dataset composed of 50 million packers, the experiments are formed of three incremental developments that assist in improving performance. Building upon the successful experiments, a proposed NFAT interface is presented to illustrate the ease at which investigators would be able to ask relevant questions of user interactions. The experiments profiled across 27 users, has yielded an average 93.3% True Positive Identification Rate (TPIR), with 41% of users experiencing 100% TPIR. Skype, Wikipedia and Hotmail services achieved a notably high level of recognition performance. The study has developed and evaluated an approach to analyse encrypted network traffic more effectively through the modelling of network traffic and to subsequently visualise these interactions through a novel network forensic analysis tool.

[AI-24] MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning

链接: https://arxiv.org/abs/2503.18533
作者: Dawei Yan,Yang Li,Qing-Guo Chen,Weihua Luo,Peng Wang,Haokui Zhang,Chunhua Shen
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Compared to single-turn dialogue, multi-turn dialogue involving multiple images better aligns with the needs of real-world human-AI interactions. Additionally, as training data, it provides richer contextual reasoning information, thereby guiding the model to achieve better performance. However, existing vision-language models (VLMs) primarily rely on single-turn dialogue training and evaluation benchmarks. In this paper, following the characteristics of human dialogue, such as focused topics and concise, clear content, we present MMCR (Multimodal Multi-turn Contextual Reasoning), a novel dataset comprising: (1) MMCR-310k – the largest multi-image multi-turn instruction tuning dataset with 310K contextual dialogues, each covering 1-4 images and 4 or 8 dialogue turns; and (2) MMCR-Bench – a diagnostic benchmark featuring dialogues, spanning 8 domains (Humanities, Natural, Science, Education, etc.) and 40 sub-topics. Extensive evaluations demonstrate that models fine-tuned with MMCR-310k achieve 5.2% higher contextual accuracy on MMCR-Bench, while showing consistent improvements on existing benchmarks (+1.1% on AI2D, +1.2% on MMMU and MMVet). MMCR and prompt engineering will be released publicly.

[AI-25] Neuro-symbolic Weak Supervision: Theory and Semantics

链接: https://arxiv.org/abs/2503.18509
作者: Nijesh Upreti,Vaishak Belle
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Weak supervision allows machine learning models to learn from limited or noisy labels, but it introduces challenges in interpretability and reliability - particularly in multi-instance partial label learning (MI-PLL), where models must resolve both ambiguous labels and uncertain instance-label mappings. We propose a semantics for neuro-symbolic framework that integrates Inductive Logic Programming (ILP) to improve MI-PLL by providing structured relational constraints that guide learning. Within our semantic characterization, ILP defines a logical hypothesis space for label transitions, clarifies classifier semantics, and establishes interpretable performance standards. This hybrid approach improves robustness, transparency, and accountability in weakly supervised settings, ensuring neural predictions align with domain knowledge. By embedding weak supervision into a logical framework, we enhance both interpretability and learning, making weak supervision more suitable for real-world, high-stakes applications.

[AI-26] Statistically Testing Training Data for Unwanted Error Patterns using Rule-Oriented Regression

链接: https://arxiv.org/abs/2503.18497
作者: Stefan Rass,Martin Dallinger
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial intelligence models trained from data can only be as good as the underlying data is. Biases in training data propagating through to the output of a machine learning model are a well-documented and well-understood phenomenon, but the machinery to prevent these undesired effects is much less developed. Efforts to ensure data is clean during collection, such as using bias-aware sampling, are most effective when the entity controlling data collection also trains the AI. In cases where the data is already available, how do we find out if the data was already manipulated, i.e., poisoned'', so that an undesired behavior would be trained into a machine learning model? This is a challenge fundamentally different to (just) improving approximation accuracy or efficiency, and we provide a method to test training data for flaws, to establish a trustworthy ground-truth for a subsequent training of machine learning models (of any kind). Unlike the well-studied problem of approximating data using fuzzy rules that are generated from the data, our method hinges on a prior definition of rules to happen before seeing the data to be tested. Therefore, the proposed method can also discover hidden error patterns, which may also have substantial influence. Our approach extends the abilities of conventional statistical testing by letting the test-condition’’ be any Boolean condition to describe a pattern in the data, whose presence we wish to determine. The method puts fuzzy inference into a regression model, to get the best of the two: explainability from fuzzy logic with statistical properties and diagnostics from the regression, and finally also being applicable to ``small data’', hence not requiring large datasets as deep learning methods do. We provide an open source implementation for demonstration and experiments.

[AI-27] Large Language Models powered Network Attack Detection: Architecture Opportunities and Case Study

链接: https://arxiv.org/abs/2503.18487
作者: Xinggong Zhang,Qingyang Li,Yunpeng Tan,Zongming Guo,Lei Zhang,Yong Cui
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: submitted for peer-review

点击查看摘要

Abstract:Network attack detection is a pivotal technology to identify network anomaly and classify malicious traffic. Large Language Models (LLMs) are trained on a vast corpus of text, have amassed remarkable capabilities of context-understanding and commonsense knowledge. This has opened up a new door for network threat detection. Researchers have already initiated discussions regarding the application of LLMs on specific cyber-security tasks. Unfortunately, there is still a lack of comprehensive elaboration how to mine LLMs’ potentials in network threat detections, as well as the opportunities and challenges. In this paper, we mainly focus on the classification of malicious traffic from the perspective of LLMs’ capability. We present a holistic view of the architecture of LLM-powered network attack detection, including Pre-training, Fine-tuning, and Detection. Especially, by exploring the knowledge and capabilities of LLM, we identify three distinct roles LLM can act in network attack detection: \textitClassifier, Encoder, and Predictor. For each of them, the modeling paradigm, opportunities and challenges are elaborated. Finally, we present our design on LLM-powered DDoS detection as a case study. The proposed framework attains accurate detection on carpet bombing DDoS by exploiting LLMs’ capabilities in contextual mining. The evaluation shows its efficacy, exhibiting a nearly 35 % improvement compared to existing systems.

[AI-28] ModiGen: A Large Language Model-Based Workflow for Multi-Task Modelica Code Generation

链接: https://arxiv.org/abs/2503.18460
作者: Jiahui Xiang,Tong Ye,Peiyu Liu,Yinan Zhang,Wenhai Wang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Modelica is a widely adopted language for simulating complex physical systems, yet effective model creation and optimization require substantial domain expertise. Although large language models (LLMs) have demonstrated promising capabilities in code generation, their application to modeling remains largely unexplored. To address this gap, we have developed benchmark datasets specifically designed to evaluate the performance of LLMs in generating Modelica component models and test cases. Our evaluation reveals substantial limitations in current LLMs, as the generated code often fails to simulate successfully. To overcome these challenges, we propose a specialized workflow that integrates supervised fine-tuning, graph retrieval-augmented generation, and feedback optimization to improve the accuracy and reliability of Modelica code generation. The evaluation results demonstrate significant performance gains: the maximum improvement in pass@1 reached 0.3349 for the component generation task and 0.2457 for the test case generation task. This research underscores the potential of LLMs to advance intelligent modeling tools and offers valuable insights for future developments in system modeling and engineering applications.

[AI-29] Generative AI in Knowledge Work: Design Implications for Data Navigation and Decision-Making

链接: https://arxiv.org/abs/2503.18419
作者: Bhada Yun,Dana Feng,Ace S. Chen,Afshin Nikzad,Niloufar Salehi
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: Accepted to CHI '25 (Conference on Human Factors in Computing Systems), to appear April 26-May 1, 2025, Yokohama, Japan

点击查看摘要

Abstract:Our study of 20 knowledge workers revealed a common challenge: the difficulty of synthesizing unstructured information scattered across multiple platforms to make informed decisions. Drawing on their vision of an ideal knowledge synthesis tool, we developed Yodeai, an AI-enabled system, to explore both the opportunities and limitations of AI in knowledge work. Through a user study with 16 product managers, we identified three key requirements for Generative AI in knowledge work: adaptable user control, transparent collaboration mechanisms, and the ability to integrate background knowledge with external information. However, we also found significant limitations, including overreliance on AI, user isolation, and contextual factors outside the AI’s reach. As AI tools become increasingly prevalent in professional settings, we propose design principles that emphasize adaptability to diverse workflows, accountability in personal and collaborative contexts, and context-aware interoperability to guide the development of human-centered AI systems for product managers and knowledge workers.

[AI-30] PRECTR: A Synergistic Framework for Integrating Personalized Search Relevance Matching and CTR Prediction

链接: https://arxiv.org/abs/2503.18395
作者: Rong Chen,Shuzhi Cao,Ailong He,Shuguang Han,Jufeng Chen
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The two primary tasks in the search recommendation system are search relevance matching and click-through rate (CTR) prediction – the former focuses on seeking relevant items for user queries whereas the latter forecasts which item may better match user interest. Prior research typically develops two models to predict the CTR and search relevance separately, then ranking candidate items based on the fusion of the two outputs. However, such a divide-and-conquer paradigm creates the inconsistency between different models. Meanwhile, the search relevance model mainly concentrates on the degree of objective text matching while neglecting personalized differences among different users, leading to restricted model performance. To tackle these issues, we propose a unified \textbfPersonalized Search RElevance Matching and CTR Prediction Fusion Model(PRECTR). Specifically, based on the conditional probability fusion mechanism, PRECTR integrates the CTR prediction and search relevance matching into one framework to enhance the interaction and consistency of the two modules. However, directly optimizing CTR binary classification loss may bring challenges to the fusion model’s convergence and indefinitely promote the exposure of items with high CTR, regardless of their search relevance. Hence, we further introduce two-stage training and semantic consistency regularization to accelerate the model’s convergence and restrain the recommendation of irrelevant items. Finally, acknowledging that different users may have varied relevance preferences, we assessed current users’ relevance preferences by analyzing past users’ preferences for similar queries and tailored incentives for different candidate items accordingly. Extensive experimental results on our production dataset and online A/B testing demonstrate the effectiveness and superiority of our proposed PRECTR method.

[AI-31] Manipulation and the AI Act: Large Language Model Chatbots and the Danger of Mirrors

链接: https://arxiv.org/abs/2503.18387
作者: Joshua Krook
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Large Language Model chatbots are increasingly taking the form and visage of human beings, adapting human faces, names, voices, personalities, and quirks, including those of celebrities and well-known political figures. Personifying AI chatbots could foreseeably increase their trust with users. However, it could also make them more capable of manipulation, by creating the illusion of a close and intimate relationship with an artificial entity. The European Commission has finalized the AI Act, with the EU Parliament making amendments banning manipulative and deceptive AI systems that cause significant harm to users. Although the AI Act covers harms that accumulate over time, it is unlikely to prevent harms associated with prolonged discussions with AI chatbots. Specifically, a chatbot could reinforce a person’s negative emotional state over weeks, months, or years through negative feedback loops, prolonged conversations, or harmful recommendations, contributing to a user’s deteriorating mental health.

[AI-32] RoCA: Robust Contrastive One-class Time Series Anomaly Detection with Contaminated Data

链接: https://arxiv.org/abs/2503.18385
作者: Xudong Mou,Rui Wang,Bo Li,Tianyu Wo,Jie Sun,Hui Wang,Xudong Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The accumulation of time-series signals and the absence of labels make time-series Anomaly Detection (AD) a self-supervised task of deep learning. Methods based on normality assumptions face the following three limitations: (1) A single assumption could hardly characterize the whole normality or lead to some deviation. (2) Some assumptions may go against the principle of AD. (3) Their basic assumption is that the training data is uncontaminated (free of anomalies), which is unrealistic in practice, leading to a decline in robustness. This paper proposes a novel robust approach, RoCA, which is the first to address all of the above three challenges, as far as we are aware. It fuses the separated assumptions of one-class classification and contrastive learning in a single training process to characterize a more complete so-called normality. Additionally, it monitors the training data and computes a carefully designed anomaly score throughout the training process. This score helps identify latent anomalies, which are then used to define the classification boundary, inspired by the concept of outlier exposure. The performance on AIOps datasets improved by 6% compared to when contamination was not considered (COCA). On two large and high-dimensional multivariate datasets, the performance increased by 5% to 10%. RoCA achieves the highest average performance on both univariate and multivariate datasets. The source code is available at this https URL.

[AI-33] Maximum Redundancy Pruning: A Principle-Driven Layerwise Sparsity Allocation for LLM s

链接: https://arxiv.org/abs/2503.18377
作者: Chang Gao,Kang Zhao,Jianfei Chen,Liping Jing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities, but their enormous size poses significant challenges for deployment in real-world applications. To address this issue, researchers have sought to apply network pruning techniques to LLMs. A critical challenge in pruning is allocation the sparsity for each layer. Recent sparsity allocation methods is often based on heuristics or search that can easily lead to suboptimal performance. In this paper, we conducted an extensive investigation into various LLMs and revealed three significant discoveries: (1) the layerwise pruning sensitivity (LPS) of LLMs is highly non-uniform, (2) the choice of pruning metric affects LPS, and (3) the performance of a sparse model is related to the uniformity of its layerwise redundancy level. Based on these observations, we propose that the layerwise sparsity of LLMs should adhere to three principles: \emphnon-uniformity, \emphpruning metric dependency, and \emphuniform layerwise redundancy level in the pruned model. To this end, we proposed Maximum Redundancy Pruning (MRP), an iterative pruning algorithm that prunes in the most redundant layers (\emphi.e., those with the highest non-outlier ratio) at each iteration. The achieved layerwise sparsity aligns with the outlined principles. We conducted extensive experiments on publicly available LLMs, including the LLaMA2 and OPT, across various benchmarks. Experimental results validate the effectiveness of MRP, demonstrating its superiority over previous methods.

[AI-34] Latent Embedding Adaptation for Human Preference Alignment in Diffusion Planners

链接: https://arxiv.org/abs/2503.18347
作者: Wen Zheng Terence Ng,Jianda Chen,Yuan Xu,Tianwei Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 8 pages

点击查看摘要

Abstract:This work addresses the challenge of personalizing trajectories generated in automated decision-making systems by introducing a resource-efficient approach that enables rapid adaptation to individual users’ preferences. Our method leverages a pretrained conditional diffusion model with Preference Latent Embeddings (PLE), trained on a large, reward-free offline dataset. The PLE serves as a compact representation for capturing specific user preferences. By adapting the pretrained model using our proposed preference inversion method, which directly optimizes the learnable PLE, we achieve superior alignment with human preferences compared to existing solutions like Reinforcement Learning from Human Feedback (RLHF) and Low-Rank Adaptation (LoRA). To better reflect practical applications, we create a benchmark experiment using real human preferences on diverse, high-reward trajectories.

[AI-35] Optimizing Influence Campaigns: Nudging under Bounded Confidence

链接: https://arxiv.org/abs/2503.18331
作者: Yen-Shao Chen,Tauhid Zaman
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Influence campaigns in online social networks are often run by organizations, political parties, and nation states to influence large audiences. These campaigns are employed through the use of agents in the network that share persuasive content. Yet, their impact might be minimal if the audiences remain unswayed, often due to the bounded confidence phenomenon, where only a narrow spectrum of viewpoints can influence them. Here we show that to persuade under bounded confidence, an agent must nudge its targets to gradually shift their opinions. Using a control theory approach, we show how to construct an agent’s nudging policy under the bounded confidence opinion dynamics model and also how to select targets for multiple agents in an influence campaign on a social network. Simulations on real Twitter networks show that a multi-agent nudging policy can shift the mean opinion, decrease opinion polarization, or even increase it. We find that our nudging based policies outperform other common techniques that do not consider the bounded confidence effect. Finally, we show how to craft prompts for large language models, such as ChatGPT, to generate text-based content for real nudging policies. This illustrates the practical feasibility of our approach, allowing one to go from mathematical nudging policies to real social media content.

[AI-36] LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty CVPR2025

链接: https://arxiv.org/abs/2503.18314
作者: Christoforos N. Spartalis,Theodoros Semertzidis,Stratis Gavves,Petros Daras
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted as a main conference paper at CVPR 2025

点击查看摘要

Abstract:We present LoTUS, a novel Machine Unlearning (MU) method that eliminates the influence of training samples from pre-trained models, avoiding retraining from scratch. LoTUS smooths the prediction probabilities of the model – up to an information theoretic bound – mitigating its over-confidence that stems from data memorization. We evaluate LoTUS on the Transformer and ResNet18 models, against eight baseline methods, on five public datasets. Beyond established MU benchmarks, we evaluate unlearning on a large-scale dataset (ImageNet1k) which deters retraining, simulating real-world conditions. Moreover, we introduce the novel Retrain-Free Jensen-Shannon Divergence (RF-JSD) metric to enable evaluation under real-world conditions. Experimental results show that LoTUS outperforms state-of-the-art methods in terms of both efficiency and effectiveness. Code: this https URL.

[AI-37] DeepFund: Will LLM be Professional at Fund Investment? A Live Arena Perspective

链接: https://arxiv.org/abs/2503.18313
作者: Changlun Li,Yao Shi,Yuyu Luo,Nan Tang
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC)
*备注: Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities across various domains, but their effectiveness in financial decision making, particularly in fund investment, remains inadequately evaluated. Current benchmarks primarily assess LLMs understanding of financial documents rather than their ability to manage assets or analyze trading opportunities in dynamic market conditions. A critical limitation in existing evaluation methodologies is the backtesting approach, which suffers from information leakage when LLMs are evaluated on historical data they may have encountered during pretraining. This paper introduces DeepFund, a comprehensive platform for evaluating LLM based trading strategies in a simulated live environment. Our approach implements a multi agent framework where LLMs serve as both analysts and managers, creating a realistic simulation of investment decision making. The platform employs a forward testing methodology that mitigates information leakage by evaluating models on market data released after their training cutoff dates. We provide a web interface that visualizes model performance across different market conditions and investment parameters, enabling detailed comparative analysis. Through DeepFund, we aim to provide a more accurate and fair assessment of LLMs capabilities in fund investment, offering insights into their potential real world applications in financial markets.

[AI-38] How to Capture and Study Conversations Between Research Participants and ChatGPT : GPT for Researchers (g4r.org)

链接: https://arxiv.org/abs/2503.18303
作者: Jin Kim
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) like ChatGPT become increasingly integrated into our everyday lives–from customer service and education to creative work and personal productivity–understanding how people interact with these AI systems has become a pressing issue. Despite the widespread use of LLMs, researchers lack standardized tools for systematically studying people’s interactions with LLMs. To address this issue, we introduce GPT for Researchers (G4R), or this http URL, a free website that researchers can use to easily create and integrate a GPT Interface into their studies. At this http URL, researchers can (1) enable their study participants to interact with GPT (such as ChatGPT), (2) customize GPT Interfaces to guide participants’ interactions with GPT (e.g., set constraints on topics or adjust GPT’s tone or response style), and (3) capture participants’ interactions with GPT by downloading data on messages exchanged between participants and GPT. By facilitating study participants’ interactions with GPT and providing detailed data on these interactions, G4R can support research on topics such as consumer interactions with AI agents or LLMs, AI-assisted decision-making, and linguistic patterns in human-AI communication. With this goal in mind, we provide a step-by-step guide to using G4R at this http URL.

[AI-39] DiffMove: Group Mobility Tendency Enhanced Trajectory Recovery via Diffusion Model

链接: https://arxiv.org/abs/2503.18302
作者: Qingyue Long,Can Rong,Huandong Wang,Shaw Rajib,Yong Li
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the real world, trajectory data is often sparse and incomplete due to low collection frequencies or limited device coverage. Trajectory recovery aims to recover these missing trajectory points, making the trajectories denser and more complete. However, this task faces two key challenges: 1) The excessive sparsity of individual trajectories makes it difficult to effectively leverage historical information for recovery; 2) Sparse trajectories make it harder to capture complex individual mobility preferences. To address these challenges, we propose a novel method called DiffMove. Firstly, we harness crowd wisdom for trajectory recovery. Specifically, we construct a group tendency graph using the collective trajectories of all users and then integrate the group mobility trends into the location representations via graph embedding. This solves the challenge of sparse trajectories being unable to rely on individual historical trajectories for recovery. Secondly, we capture individual mobility preferences from both historical and current perspectives. Finally, we integrate group mobility tendencies and individual preferences into the spatiotemporal distribution of the trajectory to recover high-quality trajectories. Extensive experiments on two real-world datasets demonstrate that DiffMove outperforms existing state-of-the-art methods. Further analysis validates the robustness of our method.

[AI-40] Risk Management for Distributed Arbitrag e Systems: Integrating Artificial Intelligence

链接: https://arxiv.org/abs/2503.18265
作者: Akaash Vishal Hazarika,Mahak Shah,Swapnil Patil,Pradyumna Shukla
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: International Conference on AI and Financial Innovation AIFI-2025

点击查看摘要

Abstract:Effective risk management solutions become absolutely crucial when financial markets embrace distributed technology and decentralized financing (DeFi). This study offers a thorough survey and comparative analysis of the integration of artificial intelligence (AI) in risk management for distributed arbitrage systems. We examine several modern caching techniques namely in memory caching, distributed caching, and proxy caching and their functions in enhancing performance in decentralized settings. Through literature review we examine the utilization of AI techniques for alleviating risks related to market volatility, liquidity challenges, operational failures, regulatory compliance, and security threats. This comparison research evaluates various case studies from prominent DeFi technologies, emphasizing critical performance metrics like latency reduction, load balancing, and system resilience. Additionally, we examine the problems and trade offs associated with these technologies, emphasizing their effects on consistency, scalability, and fault tolerance. By meticulously analyzing real world applications, specifically centering on the Aave platform as our principal case study, we illustrate how the purposeful amalgamation of AI with contemporary caching methodologies has revolutionized risk management in distributed arbitrage systems.

[AI-41] Severing Spurious Correlations with Data Pruning ICLR2025

链接: https://arxiv.org/abs/2503.18258
作者: Varun Mulchandani,Jung-Eun Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICLR 2025, Spotlight

点击查看摘要

Abstract:Deep neural networks have been shown to learn and rely on spurious correlations present in the data that they are trained on. Reliance on such correlations can cause these networks to malfunction when deployed in the real world, where these correlations may no longer hold. To overcome the learning of and reliance on such correlations, recent studies propose approaches that yield promising results. These works, however, study settings where the strength of the spurious signal is significantly greater than that of the core, invariant signal, making it easier to detect the presence of spurious features in individual training samples and allow for further processing. In this paper, we identify new settings where the strength of the spurious signal is relatively weaker, making it difficult to detect any spurious information while continuing to have catastrophic consequences. We also discover that spurious correlations are learned primarily due to only a handful of all the samples containing the spurious feature and develop a novel data pruning technique that identifies and prunes small subsets of the training data that contain these samples. Our proposed technique does not require inferred domain knowledge, information regarding the sample-wise presence or nature of spurious information, or human intervention. Finally, we show that such data pruning attains state-of-the-art performance on previously studied settings where spurious information is identifiable.

[AI-42] he Human-Machine Identity Blur: A Unified Framework for Cybersecurity Risk Management in 2025

链接: https://arxiv.org/abs/2503.18255
作者: Kush Janani
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 9 pages, 8 figures

点击查看摘要

Abstract:The modern enterprise is facing an unprecedented surge in digital identities, with machine identities now significantly outnumbering human identities. This paper examines the cybersecurity risks emerging from what we define as the “human-machine identity blur” - the point at which human and machine identities intersect, delegate authority, and create new attack surfaces. Drawing from industry data, expert insights, and real-world incident analysis, we identify key governance gaps in current identity management models that treat human and machine entities as separate domains. To address these challenges, we propose a Unified Identity Governance Framework based on four core principles: treating identity as a continuum rather than a binary distinction, applying consistent risk evaluation across all identity types, implementing continuous verification guided by zero trust principles, and maintaining governance throughout the entire identity lifecycle. Our research shows that organizations adopting this unified approach experience a 47 percent reduction in identity-related security incidents and a 62 percent improvement in incident response time. We conclude by offering a practical implementation roadmap and outlining future research directions as AI-driven systems become increasingly autonomous.

[AI-43] Collaborating with AI Agents : Field Experiments on Teamwork Productivity and Performance

链接: https://arxiv.org/abs/2503.18238
作者: Harang Ju,Sinan Aral
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 56 pages, 8 figures

点击查看摘要

Abstract:To uncover how AI agents change productivity, performance, and work processes, we introduce MindMeld: an experimentation platform enabling humans and AI agents to collaborate in integrative workspaces. In a large-scale marketing experiment on the platform, 2310 participants were randomly assigned to human-human and human-AI teams, with randomized AI personality traits. The teams exchanged 183,691 messages, and created 63,656 image edits, 1,960,095 ad copy edits, and 10,375 AI-generated images while producing 11,138 ads for a large think tank. Analysis of fine-grained communication, collaboration, and workflow logs revealed that collaborating with AI agents increased communication by 137% and allowed humans to focus 23% more on text and image content generation messaging and 20% less on direct text editing. Humans on Human-AI teams sent 23% fewer social messages, creating 60% greater productivity per worker and higher-quality ad copy. In contrast, human-human teams produced higher-quality images, suggesting that AI agents require fine-tuning for multimodal workflows. AI personality prompt randomization revealed that AI traits can complement human personalities to enhance collaboration. For example, conscientious humans paired with open AI agents improved image quality, while extroverted humans paired with conscientious AI agents reduced the quality of text, images, and clicks. In field tests of ad campaigns with ~5M impressions, ads with higher image quality produced by human collaborations and higher text quality produced by AI collaborations performed significantly better on click-through rate and cost per click metrics. Overall, ads created by human-AI teams performed similarly to those created by human-human teams. Together, these results suggest AI agents can improve teamwork and productivity, especially when tuned to complement human traits.

[AI-44] Adaptive Multi-Fidelity Reinforcement Learning for Variance Reduction in Engineering Design Optimization

链接: https://arxiv.org/abs/2503.18229
作者: Akash Agrawal(1),Christopher McComb(1) ((1) Carnegie Mellon University)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-fidelity Reinforcement Learning (RL) frameworks efficiently utilize computational resources by integrating analysis models of varying accuracy and costs. The prevailing methodologies, characterized by transfer learning, human-inspired strategies, control variate techniques, and adaptive sampling, predominantly depend on a structured hierarchy of models. However, this reliance on a model hierarchy can exacerbate variance in policy learning when the underlying models exhibit heterogeneous error distributions across the design space. To address this challenge, this work proposes a novel adaptive multi-fidelity RL framework, in which multiple heterogeneous, non-hierarchical low-fidelity models are dynamically leveraged alongside a high-fidelity model to efficiently learn a high-fidelity policy. Specifically, low-fidelity policies and their experience data are adaptively used for efficient targeted learning, guided by their alignment with the high-fidelity policy. The effectiveness of the approach is demonstrated in an octocopter design optimization problem, utilizing two low-fidelity models alongside a high-fidelity simulator. The results demonstrate that the proposed approach substantially reduces variance in policy learning, leading to improved convergence and consistent high-quality solutions relative to traditional hierarchical multi-fidelity RL methods. Moreover, the framework eliminates the need for manually tuning model usage schedules, which can otherwise introduce significant computational overhead. This positions the framework as an effective variance-reduction strategy for multi-fidelity RL, while also mitigating the computational and operational burden of manual fidelity scheduling.

[AI-45] A Study on Neuro-Symbolic Artificial Intelligence: Healthcare Perspectives

链接: https://arxiv.org/abs/2503.18213
作者: Delower Hossain,Jake Y Chen
类目: Artificial Intelligence (cs.AI)
*备注: 18 pages

点击查看摘要

Abstract:Over the last few decades, Artificial Intelligence (AI) scientists have been conducting investigations to attain human-level performance by a machine in accomplishing a cognitive task. Within machine learning, the ultimate aspiration is to attain Artificial General Intelligence (AGI) through a machine. This pursuit has led to the exploration of two distinct AI paradigms. Symbolic AI, also known as classical or GOFAI (Good Old-Fashioned AI) and Connectionist (Sub-symbolic) AI, represented by Neural Systems, are two mutually exclusive paradigms. Symbolic AI excels in reasoning, explainability, and knowledge representation but faces challenges in processing complex real-world data with noise. Conversely, deep learning (Black-Box systems) research breakthroughs in neural networks are notable, yet they lack reasoning and interpretability. Neuro-symbolic AI (NeSy), an emerging area of AI research, attempts to bridge this gap by integrating logical reasoning into neural networks, enabling them to learn and reason with symbolic representations. While a long path, this strategy has made significant progress towards achieving common sense reasoning by systems. This article conducts an extensive review of over 977 studies from prominent scientific databases (DBLP, ACL, IEEExplore, Scopus, PubMed, ICML, ICLR), thoroughly examining the multifaceted capabilities of Neuro-Symbolic AI, with a particular focus on its healthcare applications, particularly in drug discovery, and Protein engineering research. The survey addresses vital themes, including reasoning, explainability, integration strategies, 41 healthcare-related use cases, benchmarking, datasets, current approach limitations from both healthcare and broader perspectives, and proposed novel approaches for future experiments.

[AI-46] ViVa: Video-Trained Value Functions for Guiding Online RL from Diverse Data

链接: https://arxiv.org/abs/2503.18210
作者: Nitish Dashora,Dibya Ghosh,Sergey Levine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Online reinforcement learning (RL) with sparse rewards poses a challenge partly because of the lack of feedback on states leading to the goal. Furthermore, expert offline data with reward signal is rarely available to provide this feedback and bootstrap online learning. How can we guide online agents to the right solution without this on-task data? Reward shaping offers a solution by providing fine-grained signal to nudge the policy towards the optimal solution. However, reward shaping often requires domain knowledge to hand-engineer heuristics for a specific goal. To enable more general and inexpensive guidance, we propose and analyze a data-driven methodology that automatically guides RL by learning from widely available video data such as Internet recordings, off-task demonstrations, task failures, and undirected environment interaction. By learning a model of optimal goal-conditioned value from diverse passive data, we open the floor to scaling up and using various data sources to model general goal-reaching behaviors relevant to guiding online RL. Specifically, we use intent-conditioned value functions to learn from diverse videos and incorporate these goal-conditioned values into the reward. Our experiments show that video-trained value functions work well with a variety of data sources, exhibit positive transfer from human video pre-training, can generalize to unseen goals, and scale with dataset size.

[AI-47] FROG: Fair Removal on Graphs

链接: https://arxiv.org/abs/2503.18197
作者: Ziheng Chen,Jiali Cheng,Gabriele Tolomei,Sijia Liu,Hadi Amiri,Yu Wang,Kaushiki Nag,Lu Lin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As compliance with privacy regulations becomes increasingly critical, the growing demand for data privacy has highlighted the significance of machine unlearning in many real world applications, such as social network and recommender systems, many of which can be represented as graph-structured data. However, existing graph unlearning algorithms indiscriminately modify edges or nodes from well-trained models without considering the potential impact of such structural modifications on fairness. For example, forgetting links between nodes with different genders in a social network may exacerbate group disparities, leading to significant fairness concerns. To address these challenges, we propose a novel approach that jointly optimizes the graph structure and the corresponding model for fair unlearning tasks. Specifically,our approach rewires the graph to enhance unlearning efficiency by removing redundant edges that hinder forgetting while preserving fairness through targeted edge augmentation. Additionally, we introduce a worst-case evaluation mechanism to assess the reliability of fair unlearning performance. Extensive experiments on real-world datasets demonstrate the effectiveness of the proposed approach in achieving superior unlearning outcomes.

[AI-48] Exploring Energy Landscapes for Minimal Counterfactual Explanations: Applications in Cybersecurity and Beyond

链接: https://arxiv.org/abs/2503.18185
作者: Spyridon Evangelatos,Eleni Veroni,Vasilis Efthymiou,Christos Nikolopoulos,Georgios Th. Papadopoulos,Panagiotis Sarigiannidis
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Counterfactual explanations have emerged as a prominent method in Explainable Artificial Intelligence (XAI), providing intuitive and actionable insights into Machine Learning model decisions. In contrast to other traditional feature attribution methods that assess the importance of input variables, counterfactual explanations focus on identifying the minimal changes required to alter a model’s prediction, offering a ``what-if’’ analysis that is close to human reasoning. In the context of XAI, counterfactuals enhance transparency, trustworthiness and fairness, offering explanations that are not just interpretable but directly applicable in the decision-making processes. In this paper, we present a novel framework that integrates perturbation theory and statistical mechanics to generate minimal counterfactual explanations in explainable AI. We employ a local Taylor expansion of a Machine Learning model’s predictive function and reformulate the counterfactual search as an energy minimization problem over a complex landscape. In sequence, we model the probability of candidate perturbations leveraging the Boltzmann distribution and use simulated annealing for iterative refinement. Our approach systematically identifies the smallest modifications required to change a model’s prediction while maintaining plausibility. Experimental results on benchmark datasets for cybersecurity in Internet of Things environments, demonstrate that our method provides actionable, interpretable counterfactuals and offers deeper insights into model sensitivity and decision boundaries in high-dimensional spaces. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2503.18185 [cs.AI] (or arXiv:2503.18185v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2503.18185 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-49] Adaptive Physics-informed Neural Networks: A Survey

链接: https://arxiv.org/abs/2503.18181
作者: Edgar Torres,Jonathan Schiefer,Mathias Niepert
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) have emerged as a promising approach to solving partial differential equations (PDEs) using neural networks, particularly in data-scarce scenarios, due to their unsupervised training capability. However, limitations related to convergence and the need for re-optimization with each change in PDE parameters hinder their widespread adoption across scientific and engineering applications. This survey reviews existing research that addresses these limitations through transfer learning and meta-learning. The covered methods improve the training efficiency, allowing faster adaptation to new PDEs with fewer data and computational resources. While traditional numerical methods solve systems of differential equations directly, neural networks learn solutions implicitly by adjusting their parameters. One notable advantage of neural networks is their ability to abstract away from specific problem domains, allowing them to retain, discard, or adapt learned representations to efficiently address similar problems. By exploring the application of these techniques to PINNs, this survey identifies promising directions for future research to facilitate the broader adoption of PINNs in a wide range of scientific and engineering applications.

[AI-50] Strategic Prompt Pricing for AIGC Services: A User-Centric Approach

链接: https://arxiv.org/abs/2503.18168
作者: Xiang Li,Bing Luo,Jianwei Huang,Yuan Luo
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注: accepted in WiOpt 2025

点击查看摘要

Abstract:The rapid growth of AI-generated content (AIGC) services has created an urgent need for effective prompt pricing strategies, yet current approaches overlook users’ strategic two-step decision-making process in selecting and utilizing generative AI models. This oversight creates two key technical challenges: quantifying the relationship between user prompt capabilities and generation outcomes, and optimizing platform payoff while accounting for heterogeneous user behaviors. We address these challenges by introducing prompt ambiguity, a theoretical framework that captures users’ varying abilities in prompt engineering, and developing an Optimal Prompt Pricing (OPP) algorithm. Our analysis reveals a counterintuitive insight: users with higher prompt ambiguity (i.e., lower capability) exhibit non-monotonic prompt usage patterns, first increasing then decreasing with ambiguity levels, reflecting complex changes in marginal utility. Experimental evaluation using a character-level GPT-like model demonstrates that our OPP algorithm achieves up to 31.72% improvement in platform payoff compared to existing pricing mechanisms, validating the importance of user-centric prompt pricing in AIGC services.

[AI-51] Adoption of Watermarking for Generative AI Systems in Practice and Implications under the new EU AI Act

链接: https://arxiv.org/abs/2503.18156
作者: Bram Rijsbosch,Gijs van Dijck,Konrad Kollnig
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 12 pages, 7 figures, note that this work has not been published in a peer reviewed venue yet. While we have made our best effort to ensure the validity of our findings, it is therefore still work in progress and potentially subject to change

点击查看摘要

Abstract:AI-generated images have become so good in recent years that individuals cannot distinguish them any more from “real” images. This development creates a series of societal risks, and challenges our perception of what is true and what is not, particularly with the emergence of “deep fakes” that impersonate real individuals. Watermarking, a technique that involves embedding identifying information within images to indicate their AI-generated nature, has emerged as a primary mechanism to address the risks posed by AI-generated images. The implementation of watermarking techniques is now becoming a legal requirement in many jurisdictions, including under the new 2024 EU AI Act. Despite the widespread use of AI image generation systems, the current status of watermarking implementation remains largely unexamined. Moreover, the practical implications of the AI Act’s watermarking requirements have not previously been studied. The present paper therefore both provides an empirical analysis of 50 of the most widely used AI systems for image generation, and embeds this empirical analysis into a legal analysis of the AI Act. We identify four categories of generative AI image systems relevant under the AI Act, outline the legal obligations for each category, and find that only a minority number of providers currently implement adequate watermarking practices.

[AI-52] Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization ICLR2025

链接: https://arxiv.org/abs/2503.18130
作者: Juntao Dai,Taiye Chen,Yaodong Yang,Qian Zheng,Gang Pan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) is an effective method for aligning large language models (LLMs) with human values. However, reward over-optimization remains an open challenge leading to discrepancies between the performance of LLMs under the reward model and the true human objectives. A primary contributor to reward over-optimization is the extrapolation error that arises when the reward model evaluates out-of-distribution (OOD) responses. However, current methods still fail to prevent the increasing frequency of OOD response generation during the reinforcement learning (RL) process and are not effective at handling extrapolation errors from OOD responses. In this work, we propose the Behavior-Supported Policy Optimization (BSPO) method to mitigate the reward over-optimization issue. Specifically, we define behavior policy as the next token distribution of the reward training dataset to model the in-distribution (ID) region of the reward model. Building on this, we introduce the behavior-supported Bellman operator to regularize the value function, penalizing all OOD values without impacting the ID ones. Consequently, BSPO reduces the generation of OOD responses during the RL process, thereby avoiding overestimation caused by the reward model’s extrapolation errors. Theoretically, we prove that BSPO guarantees a monotonic improvement of the supported policy until convergence to the optimal behavior-supported policy. Empirical results from extensive experiments show that BSPO outperforms baselines in preventing reward over-optimization due to OOD evaluation and finding the optimal ID policy.

[AI-53] Decision from Suboptimal Classifiers: Excess Risk Pre- and Post-Calibration

链接: https://arxiv.org/abs/2503.18025
作者: Alexandre Perez-Lebel,Gael Varoquaux,Sanmi Koyejo,Matthieu Doutreligne,Marine Le Morvan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Probabilistic classifiers are central for making informed decisions under uncertainty. Based on the maximum expected utility principle, optimal decision rules can be derived using the posterior class probabilities and misclassification costs. Yet, in practice only learned approximations of the oracle posterior probabilities are available. In this work, we quantify the excess risk (a.k.a. regret) incurred using approximate posterior probabilities in batch binary decision-making. We provide analytical expressions for miscalibration-induced regret ( R^\mathrmCL ), as well as tight and informative upper and lower bounds on the regret of calibrated classifiers ( R^\mathrmGL ). These expressions allow us to identify regimes where recalibration alone addresses most of the regret, and regimes where the regret is dominated by the grouping loss, which calls for post-training beyond recalibration. Crucially, both R^\mathrmCL and R^\mathrmGL can be estimated in practice using a calibration curve and a recent grouping loss estimator. On NLP experiments, we show that these quantities identify when the expected gain of more advanced post-training is worth the operational cost. Finally, we highlight the potential of multicalibration approaches as efficient alternatives to costlier fine-tuning approaches.

[AI-54] Lost in Cultural Translation: Do LLM s Struggle with Math Across Cultural Contexts?

链接: https://arxiv.org/abs/2503.18018
作者: Aabid Karim,Abdul Karim,Bhoomika Lohana,Matt Keon,Jaswinder Singh,Abdul Sattar
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly advanced various fields, particularly coding, mathematical reasoning, and logical problem solving. However, a critical question remains: Do these mathematical reasoning abilities persist when LLMs are presented with culturally adapted math problems? Specifically, how do LLMs perform when faced with math problems embedded in cultural contexts that have no significant representation in main stream web-scale AI training data? To explore this, we generated six synthetic cultural datasets from GSM8K, a widely used benchmark for assessing LLMs’ mathematical reasoning skills. While preserving the mathematical logic and numerical values of the original GSM8K test set, we modify cultural elements such as personal names, food items, place names, etc. These culturally adapted datasets provide a more reliable framework for evaluating LLMs’ mathematical reasoning under shifting cultural contexts. Our findings reveal that LLMs struggle with math problems when cultural references change, even though the underlying mathematical structure remains constant. Smaller models exhibit greater performance drops compared to larger models. Interestingly, our results also suggest that cultural familiarity can enhance mathematical reasoning. Even models with no explicit mathematical training but exposure to relevant cultural contexts sometimes outperform larger, mathematically proficient models on culturally embedded math problems. This study highlights the impact of cultural context on the mathematical reasoning abilities of LLMs, underscoring the need for more diverse and representative training data to improve robustness in real-world applications. The benchmark data sets and script for reproducing the results are available at this https URL

[AI-55] Neuromorphic Principles for Efficient Large Language Models on Intel Loihi 2

链接: https://arxiv.org/abs/2503.18002
作者: Steven Abreu,Sumit Bam Shrestha,Rui-Jie Zhu,Jason Eshraghian
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Large language models (LLMs) deliver impressive performance but require large amounts of energy. In this work, we present a MatMul-free LLM architecture adapted for Intel’s neuromorphic processor, Loihi 2. Our approach leverages Loihi 2’s support for low-precision, event-driven computation and stateful processing. Our hardware-aware quantized model on GPU demonstrates that a 370M parameter MatMul-free model can be quantized with no accuracy loss. Based on preliminary results, we report up to 3x higher throughput with 2x less energy, compared to transformer-based LLMs on an edge GPU, with significantly better scaling. Further hardware optimizations will increase throughput and decrease energy consumption. These results show the potential of neuromorphic hardware for efficient inference and pave the way for efficient reasoning models capable of generating complex, long-form text rapidly and cost-effectively.

[AI-56] Predicting Multitasking in Manual and Automated Driving with Optimal Supervisory Control

链接: https://arxiv.org/abs/2503.17993
作者: Jussi Jokinen,Patrick Ebel,Tuomo Kujala
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern driving involves interactive technologies that can divert attention, increasing the risk of accidents. This paper presents a computational cognitive model that simulates human multitasking while driving. Based on optimal supervisory control theory, the model predicts how multitasking adapts to variations in driving demands, interactive tasks, and automation levels. Unlike previous models, it accounts for context-dependent multitasking across different degrees of driving automation. The model predicts longer in-car glances on straight roads and shorter glances during curves. It also anticipates increased glance durations with driver aids such as lane-centering assistance and their interaction with environmental demands. Validated against two empirical datasets, the model offers insights into driver multitasking amid evolving in-car technologies and automation.

[AI-57] Optimizing Navigation And Chemical Application in Precision Agriculture With Deep Reinforcement Learning And Conditional Action Tree

链接: https://arxiv.org/abs/2503.17985
作者: Mahsa Khosravi(1),Zhanhong Jiang(2),Joshua R Waite(2),Sarah Jonesc,Hernan Torres(3),Arti Singh(3),Baskar Ganapathysubramanian(2),Asheesh Kumar Singh(3),Soumik Sarkar(2) ((1) Department of Industrial and Manufacturing Systems Engineering, Iowa State University, Ames, Iowa, USA, (2) Department of Mechanical Engineering, Iowa State University, Ames, Iowa, USA, (3) Department of Agronomy, Iowa State University, Ames, Iowa, USA)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 32 pages, 9 figures

点击查看摘要

Abstract:This paper presents a novel reinforcement learning (RL)-based planning scheme for optimized robotic management of biotic stresses in precision agriculture. The framework employs a hierarchical decision-making structure with conditional action masking, where high-level actions direct the robot’s exploration, while low-level actions optimize its navigation and efficient chemical spraying in affected areas. The key objectives of optimization include improving the coverage of infected areas with limited battery power and reducing chemical usage, thus preventing unnecessary spraying of healthy areas of the field. Our numerical experimental results demonstrate that the proposed method, Hierarchical Action Masking Proximal Policy Optimization (HAM-PPO), significantly outperforms baseline practices, such as LawnMower navigation + indiscriminate spraying (Carpet Spray), in terms of yield recovery and resource efficiency. HAM-PPO consistently achieves higher yield recovery percentages and lower chemical costs across a range of infection scenarios. The framework also exhibits robustness to observation noise and generalizability under diverse environmental conditions, adapting to varying infection ranges and spatial distribution patterns.

[AI-58] Dynamic Gradient Sparse Update for Edge Training ISCAS2024

链接: https://arxiv.org/abs/2503.17959
作者: I-Hsuan Li,Tian-Sheuan Chang
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: published in IEEE International Symposium on Circuits and Systems (IEEE ISCAS 2024)

点击查看摘要

Abstract:Training on edge devices enables personalized model fine-tuning to enhance real-world performance and maintain data privacy. However, the gradient computation for backpropagation in the training requires significant memory buffers to store intermediate features and compute losses. This is unacceptable for memory-constrained edge devices such as microcontrollers. To tackle this issue, we propose a training acceleration method using dynamic gradient sparse updates. This method updates the important channels and layers only and skips gradient computation for the less important channels and layers to reduce memory usage for each update iteration. In addition, the channel selection is dynamic for different iterations to traverse most of the parameters in the update layers along the time dimension for better performance. The experimental result shows that the proposed method enables an ImageNet pre-trained MobileNetV2 trained on CIFAR-10 to achieve an accuracy of 85.77% while updating only 2% of convolution weights within 256KB on-chip memory. This results in a remarkable 98% reduction in feature memory usage compared to dense model training.

[AI-59] WLB-LLM : Workload-Balanced 4D Parallelism for Large Language Model Training

链接: https://arxiv.org/abs/2503.17924
作者: Zheng Wang,Anna Cai,Xinfeng Xie,Zaifeng Pan,Yue Guan,Weiwei Chu,Jie Wang,Shikai Li,Jianyu Huang,Chris Cai,Yuchen Hao,Yufei Ding
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 16 figures

点击查看摘要

Abstract:In this work, we present WLB-LLM, a workLoad-balanced 4D parallelism for large language model training. We first thoroughly analyze the workload imbalance issue in LLM training and identify two primary sources of imbalance at the pipeline parallelism and context parallelism levels. Then, to address the imbalance issue, at the pipeline parallelism level, WLB-LLM incorporates a workload-aware variable-length document packing method to balance the computation and communication workload across micro-batches. Additionally, at the context parallelism level, WLB-LLM introduces a novel fine-grained per-document sharding strategy, ensuring each worker within a context parallelism group has an identical workload. Comprehensive experiments under different model scales demonstrate that WLB-LLM significantly mitigates the workload imbalance during 4D parallelism LLM training and achieves an average speedup of 1.23x when applying WLB-LLM in our internal LLM training framework.

[AI-60] GLADMamba: Unsupervised Graph-Level Anomaly Detection Powered by Selective State Space Model

链接: https://arxiv.org/abs/2503.17903
作者: Yali Fu,Jindong Li,Qi Wang,Qianli Xing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Unsupervised graph-level anomaly detection (UGLAD) is a critical and challenging task across various domains, such as social network analysis, anti-cancer drug discovery, and toxic molecule identification. However, existing methods often struggle to capture the long-range dependencies efficiently and neglect the spectral information. Recently, selective State Space Models (SSMs), particularly Mamba, have demonstrated remarkable advantages in capturing long-range dependencies with linear complexity and a selection mechanism. Motivated by their success across various domains, we propose GLADMamba, a novel framework that adapts the selective state space model into UGLAD field. We design View-Fused Mamba (VFM) with a Mamba-Transformer-style architecture to efficiently fuse information from different views with a selective state mechanism. We also design Spectrum-Guided Mamba (SGM) with a Mamba-Transformer-style architecture to leverage the Rayleigh quotient to guide the embedding refining process. GLADMamba can dynamically focus on anomaly-related information while discarding irrelevant information for anomaly detection. To the best of our knowledge, this is the first work to introduce Mamba and explicit spectral information to UGLAD. Extensive experiments on 12 real-world datasets demonstrate that GLADMamba outperforms existing state-of-the-art methods, achieving superior performance in UGLAD. The code is available at this https URL.

[AI-61] Reasoning with LLM s for Zero-Shot Vulnerability Detection

链接: https://arxiv.org/abs/2503.17885
作者: Arastoo Zibaeirad,Marco Vieira
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Automating software vulnerability detection (SVD) remains a critical challenge in an era of increasingly complex and interdependent software systems. Despite significant advances in Large Language Models (LLMs) for code analysis, prevailing evaluation methodologies often lack the \textbfcontext-aware robustness necessary to capture real-world intricacies and cross-component interactions. To address these limitations, we present \textbfVulnSage, a comprehensive evaluation framework and a dataset curated from diverse, large-scale open-source system software projects developed in C/C++. Unlike prior datasets, it leverages a heuristic noise pre-filtering approach combined with LLM-based reasoning to ensure a representative and minimally noisy spectrum of vulnerabilities. The framework supports multi-granular analysis across function, file, and inter-function levels and employs four diverse zero-shot prompt strategies: Baseline, Chain-of-Thought, Think, and Think Verify. Through this evaluation, we uncover that structured reasoning prompts substantially improve LLM performance, with Think Verify reducing ambiguous responses from 20.3% to 9.1% while increasing accuracy. We further demonstrate that code-specialized models consistently outperform general-purpose alternatives, with performance varying significantly across vulnerability types, revealing that no single approach universally excels across all security contexts. Link to dataset and codes: this https URL

[AI-62] Detecting and Mitigating DDoS Attacks with AI: A Survey

链接: https://arxiv.org/abs/2503.17867
作者: Alexandru Apostu,Silviu Gheorghe,Andrei Hîji,Nicolae Cleju,Andrei Pătraşcu,Cristian Rusu,Radu Ionescu,Paul Irofti
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Distributed Denial of Service attacks represent an active cybersecurity research problem. Recent research shifted from static rule-based defenses towards AI-based detection and mitigation. This comprehensive survey covers several key topics. Preeminently, state-of-the-art AI detection methods are discussed. An in-depth taxonomy based on manual expert hierarchies and an AI-generated dendrogram are provided, thus settling DDoS categorization ambiguities. An important discussion on available datasets follows, covering data format options and their role in training AI detection methods together with adversarial training and examples augmentation. Beyond detection, AI based mitigation techniques are surveyed as well. Finally, multiple open research directions are proposed.

[AI-63] Adapt Agree Aggregate: Semi-Supervised Ensemble Labeling for Graph Convolutional Networks

链接: https://arxiv.org/abs/2503.17842
作者: Maryam Abdolali,Romina Zakerian,Behnam Roshanfekr,Fardin Ayar,Mohammad Rahmati
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel framework that combines ensemble learning with augmented graph structures to improve the performance and robustness of semi-supervised node classification in graphs. By creating multiple augmented views of the same graph, our approach harnesses the “wisdom of a diverse crowd”, mitigating the challenges posed by noisy graph structures. Leveraging ensemble learning allows us to simultaneously achieve three key goals: adaptive confidence threshold selection based on model agreement, dynamic determination of the number of high-confidence samples for training, and robust extraction of pseudo-labels to mitigate confirmation bias. Our approach uniquely integrates adaptive ensemble consensus to flexibly guide pseudo-label extraction and sample selection, reducing the risks of error accumulation and improving robustness. Furthermore, the use of ensemble-driven consensus for pseudo-labeling captures subtle patterns that individual models often overlook, enabling the model to generalize better. Experiments on several real-world datasets demonstrate the effectiveness of our proposed method.

[AI-64] A Study on the Improvement of Code Generation Quality Using Large Language Models Leverag ing Product Documentation

链接: https://arxiv.org/abs/2503.17837
作者: Takuro Morimoto,Harumi Haraguchi
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 12 pages, 5 figures and 10 tables

点击查看摘要

Abstract:Research on using Large Language Models (LLMs) in system development is expanding, especially in automated code and test generation. While E2E testing is vital for ensuring application quality, most test generation research has focused on unit tests, with limited work on E2E test code. This study proposes a method for automatically generating E2E test code from product documentation such as manuals, FAQs, and tutorials using LLMs with tailored prompts. The two step process interprets documentation intent and produces executable test code. Experiments on a web app with six key features (e.g., authentication, profile, discussion) showed that tests generated from product documentation had high compilation success and functional coverage, outperforming those based on requirement specs and user stories. These findings highlight the potential of product documentation to improve E2E test quality and, by extension, software quality.

[AI-65] Metacognition in Content-Centric Computational Cognitive C4 Modeling

链接: https://arxiv.org/abs/2503.17822
作者: Sergei Nirenburg,Marjorie McShane,Sanjay Oruganti
类目: Artificial Intelligence (cs.AI)
*备注: METACOG-25: 2nd Workshop on Metacognitive Prediction of AI Behavior

点击查看摘要

Abstract:For AI agents to emulate human behavior, they must be able to perceive, meaningfully interpret, store, and use large amounts of information about the world, themselves, and other agents. Metacognition is a necessary component of all of these processes. In this paper, we briefly a) introduce content-centric computational cognitive (C4) modeling for next-generation AI agents; b) review the long history of developing C4 agents at RPI’s LEIA (Language-Endowed Intelligent Agents) Lab; c) discuss our current work on extending LEIAs’ cognitive capabilities to cognitive robotic applications developed using a neuro symbolic processing model; and d) sketch plans for future developments in this paradigm that aim to overcome underappreciated limitations of currently popular, LLM-driven methods in AI.

[AI-66] OvercookedV2: Rethinking Overcooked for Zero-Shot Coordination

链接: https://arxiv.org/abs/2503.17821
作者: Tobias Gessler,Tin Dizdarevic,Ani Calinescu,Benjamin Ellis,Andrei Lupu,Jakob Nicolaus Foerster
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI agents hold the potential to transform everyday life by helping humans achieve their goals. To do this successfully, agents need to be able to coordinate with novel partners without prior interaction, a setting known as zero-shot coordination (ZSC). Overcooked has become one of the most popular benchmarks for evaluating coordination capabilities of AI agents and learning algorithms. In this work, we investigate the origins of ZSC challenges in Overcooked. We introduce a state augmentation mechanism which mixes states that might be encountered when paired with unknown partners into the training distribution, reducing the out-of-distribution challenge associated with ZSC. We show that independently trained agents under this algorithm coordinate successfully in Overcooked. Our results suggest that ZSC failure can largely be attributed to poor state coverage under self-play rather than more sophisticated coordination challenges. The Overcooked environment is therefore not suitable as a ZSC benchmark. To address these shortcomings, we introduce OvercookedV2, a new version of the benchmark, which includes asymmetric information and stochasticity, facilitating the creation of interesting ZSC scenarios. To validate OvercookedV2, we conduct experiments demonstrating that mere exhaustive state coverage is insufficient to coordinate well. Finally, we use OvercookedV2 to build a new range of coordination challenges, including ones that require test time protocol formation, and we demonstrate the need for new coordination algorithms that can adapt online. We hope that OvercookedV2 will help benchmark the next generation of ZSC algorithms and advance collaboration between AI agents and humans.

[AI-67] A Roadmap Towards Improving Multi-Agent Reinforcement Learning With Causal Discovery And Inference

链接: https://arxiv.org/abs/2503.17803
作者: Giovanni Briglia,Stefano Mariani,Franco Zambonelli
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Causal reasoning is increasingly used in Reinforcement Learning (RL) to improve the learning process in several dimensions: efficacy of learned policies, efficiency of convergence, generalisation capabilities, safety and interpretability of behaviour. However, applications of causal reasoning to Multi-Agent RL (MARL) are still mostly unexplored. In this paper, we take the first step in investigating the opportunities and challenges of applying causal reasoning in MARL. We measure the impact of a simple form of causal augmentation in state-of-the-art MARL scenarios increasingly requiring cooperation, and with state-of-the-art MARL algorithms exploiting various degrees of collaboration between agents. Then, we discuss the positive as well as negative results achieved, giving us the chance to outline the areas where further research may help to successfully transfer causal RL to the multi-agent setting.

[AI-68] MEPNet: Medical Entity-balanced Prompting Network for Brain CT Report Generation AAAI2025

链接: https://arxiv.org/abs/2503.17784
作者: Xiaodan Zhang,Yanzhao Shi,Junzhong Ji,Chengxin Zheng,Liangqiong Qu
类目: Artificial Intelligence (cs.AI)
*备注: AAAI 2025 Oral Paper

点击查看摘要

Abstract:The automatic generation of brain CT reports has gained widespread attention, given its potential to assist radiologists in diagnosing cranial diseases. However, brain CT scans involve extensive medical entities, such as diverse anatomy regions and lesions, exhibiting highly inconsistent spatial patterns in 3D volumetric space. This leads to biased learning of medical entities in existing methods, resulting in repetitiveness and inaccuracy in generated reports. To this end, we propose a Medical Entity-balanced Prompting Network (MEPNet), which harnesses the large language model (LLM) to fairly interpret various entities for accurate brain CT report generation. By introducing the visual embedding and the learning status of medical entities as enriched clues, our method prompts the LLM to balance the learning of diverse entities, thereby enhancing reports with comprehensive findings. First, to extract visual embedding of entities, we propose Knowledge-driven Joint Attention to explore and distill entity patterns using both explicit and implicit medical knowledge. Then, a Learning Status Scorer is designed to evaluate the learning of entity visual embeddings, resulting in unique learning status for individual entities. Finally, these entity visual embeddings and status are elaborately integrated into multi-modal prompts, to guide the text generation of LLM. This process allows LLM to self-adapt the learning process for biased-fitted entities, thereby covering detailed findings in generated reports. We conduct experiments on two brain CT report generation benchmarks, showing the effectiveness in clinical accuracy and text coherence.

[AI-69] Lifelong Evolution of Swarms GECCO2025

链接: https://arxiv.org/abs/2503.17763
作者: Lorenzo Leuzzi,Simon Jones,Sabine Hauert,Davide Bacciu,Andrea Cossu
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: Accepted as full paper at GECCO 2025

点击查看摘要

Abstract:Adapting to task changes without forgetting previous knowledge is a key skill for intelligent systems, and a crucial aspect of lifelong learning. Swarm controllers, however, are typically designed for specific tasks, lacking the ability to retain knowledge across changing tasks. Lifelong learning, on the other hand, focuses on individual agents with limited insights into the emergent abilities of a collective like a swarm. To address this gap, we introduce a lifelong evolutionary framework for swarms, where a population of swarm controllers is evolved in a dynamic environment that incrementally presents novel tasks. This requires evolution to find controllers that quickly adapt to new tasks while retaining knowledge of previous ones, as they may reappear in the future. We discover that the population inherently preserves information about previous tasks, and it can reuse it to foster adaptation and mitigate forgetting. In contrast, the top-performing individual for a given task catastrophically forgets previous tasks. To mitigate this phenomenon, we design a regularization process for the evolutionary algorithm, reducing forgetting in top-performing individuals. Evolving swarms in a lifelong fashion raises fundamental questions on the current state of deep lifelong learning and on the robustness of swarm controllers in dynamic environments.

[AI-70] Bandwidth Reservation for Time-Critical Vehicular Applications: A Multi-Operator Environment

链接: https://arxiv.org/abs/2503.17756
作者: Abdullah Al-Khatib,Abdullah Ahmed,Klaus Moessner,Holger Timinger
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Neural and Evolutionary Computing (cs.NE)
*备注: 14 pages, 11 figures

点击查看摘要

Abstract:Onsite bandwidth reservation requests often face challenges such as price fluctuations and fairness issues due to unpredictable bandwidth availability and stringent latency requirements. Requesting bandwidth in advance can mitigate the impact of these fluctuations and ensure timely access to critical resources. In a multi-Mobile Network Operator (MNO) environment, vehicles need to select cost-effective and reliable resources for their safety-critical applications. This research aims to minimize resource costs by finding the best price among multiple MNOs. It formulates multi-operator scenarios as a Markov Decision Process (MDP), utilizing a Deep Reinforcement Learning (DRL) algorithm, specifically Dueling Deep Q-Learning. For efficient and stable learning, we propose a novel area-wise approach and an adaptive MDP synthetic close to the real environment. The Temporal Fusion Transformer (TFT) is used to handle time-dependent data and model training. Furthermore, the research leverages Amazon spot price data and adopts a multi-phase training approach, involving initial training on synthetic data, followed by real-world data. These phases enable the DRL agent to make informed decisions using insights from historical data and real-time observations. The results show that our model leads to significant cost reductions, up to 40%, compared to scenarios without a policy model in such a complex environment.

[AI-71] Aportes para el cumplimiento del Reglamento (UE) 2024/1689 en robótica y sistemas autónomos

链接: https://arxiv.org/abs/2503.17730
作者: Francisco J. Rodríguez Lera,Yoana Pita Lorenzo,David Sobrín Hidalgo,Laura Fernández Becerra,Irene González Fernández,Jose Miguel Guerrero Hernández
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 9 pages, 1 figure, in Spanish

点击查看摘要

Abstract:Cybersecurity in robotics stands out as a key aspect within Regulation (EU) 2024/1689, also known as the Artificial Intelligence Act, which establishes specific guidelines for intelligent and automated systems. A fundamental distinction in this regulatory framework is the difference between robots with Artificial Intelligence (AI) and those that operate through automation systems without AI, since the former are subject to stricter security requirements due to their learning and autonomy capabilities. This work analyzes cybersecurity tools applicable to advanced robotic systems, with special emphasis on the protection of knowledge bases in cognitive architectures. Furthermore, a list of basic tools is proposed to guarantee the security, integrity, and resilience of these systems, and a practical case is presented, focused on the analysis of robot knowledge management, where ten evaluation criteria are defined to ensure compliance with the regulation and reduce risks in human-robot interaction (HRI) environments.

[AI-72] A Survey on Mathematical Reasoning and Optimization with Large Language Models

链接: https://arxiv.org/abs/2503.17726
作者: Ali Forootani
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mathematical reasoning and optimization are fundamental to artificial intelligence and computational problem-solving. Recent advancements in Large Language Models (LLMs) have significantly improved AI-driven mathematical reasoning, theorem proving, and optimization techniques. This survey explores the evolution of mathematical problem-solving in AI, from early statistical learning approaches to modern deep learning and transformer-based methodologies. We review the capabilities of pretrained language models and LLMs in performing arithmetic operations, complex reasoning, theorem proving, and structured symbolic computation. A key focus is on how LLMs integrate with optimization and control frameworks, including mixed-integer programming, linear quadratic control, and multi-agent optimization strategies. We examine how LLMs assist in problem formulation, constraint generation, and heuristic search, bridging theoretical reasoning with practical applications. We also discuss enhancement techniques such as Chain-of-Thought reasoning, instruction tuning, and tool-augmented methods that improve LLM’s problem-solving performance. Despite their progress, LLMs face challenges in numerical precision, logical consistency, and proof verification. Emerging trends such as hybrid neural-symbolic reasoning, structured prompt engineering, and multi-step self-correction aim to overcome these limitations. Future research should focus on interpretability, integration with domain-specific solvers, and improving the robustness of AI-driven decision-making. This survey offers a comprehensive review of the current landscape and future directions of mathematical reasoning and optimization with LLMs, with applications across engineering, finance, and scientific research.

[AI-73] Slide2Text: Leverag ing LLM s for Personalized Textbook Generation from PowerPoint Presentations

链接: https://arxiv.org/abs/2503.17710
作者: Yizhou Zhou
类目: Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The rapid advancements in Large Language Models (LLMs) have revolutionized educational technology, enabling innovative approaches to automated and personalized content creation. This paper introduces Slide2Text, a system that leverages LLMs to transform PowerPoint presentations into customized textbooks. By extracting slide content using OCR, organizing it into a coherent structure, and generating tailored materials such as explanations, exercises, and references, Slide2Text streamlines the textbook creation process. Flexible customization options further enhance its adaptability to diverse educational needs. The system highlights the potential of LLMs in modernizing textbook creation and improving educational accessibility. Future developments will explore multimedia inputs and advanced user customization features.

[AI-74] On the (im)possibility of sustainable artificial intelligence. Why it does not make sense to move faster when heading the wrong way

链接: https://arxiv.org/abs/2503.17702
作者: Rainer Rehak
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:Artificial intelligence (AI) is currently considered a sustainability “game-changer” within and outside of academia. In order to discuss sustainable AI this article draws from insights by critical data and algorithm studies, STS, transformative sustainability science, critical computer science, and public interest theory. I argue that while there are indeed many sustainability-related use cases for AI, they are likely to have more overall drawbacks than benefits. To substantiate this claim, I differentiate three ‘AI materialities’ of the AI supply chain: first the literal materiality (e.g. water, cobalt, lithium, energy consumption etc.), second, the informational materiality (e.g. lots of data and centralised control necessary), and third, the social materiality (e.g. exploitative data work, communities harm by waste and pollution). In all materialities, effects are especially devastating for the global south while benefiting the global north. A second strong claim regarding sustainable AI circles around so called apolitical optimisation (e.g. regarding city traffic), however the optimisation criteria (e.g. cars, bikes, emissions, commute time, health) are purely political and have to be collectively negotiated before applying AI optimisation. Hence, sustainable AI, in principle, cannot break the glass ceiling of transformation and might even distract from necessary societal change. To address that I propose to stop ‘unformation gathering’ and to apply the ‘small is beautiful’ principle. This aims to contribute to an informed academic and collective negotiation on how to (not) integrate AI into the sustainability project while avoiding to reproduce the status quo by serving hegemonic interests between useful AI use cases, techno-utopian salvation narratives, technology-centred efficiency paradigms, the exploitative and extractivist character of AI and concepts of digital degrowth.

[AI-75] Intelligence Sequencing and the Path-Dependence of Intelligence Evolution: AGI-First vs. DCI-First as Irreversible Attractors

链接: https://arxiv.org/abs/2503.17688
作者: Andy E. Williams
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The trajectory of intelligence evolution is often framed around the emergence of artificial general intelligence (AGI) and its alignment with human values. This paper challenges that framing by introducing the concept of intelligence sequencing: the idea that the order in which AGI and decentralized collective intelligence (DCI) emerge determines the long-term attractor basin of intelligence. Using insights from dynamical systems, evolutionary game theory, and network models, it argues that intelligence follows a path-dependent, irreversible trajectory. Once development enters a centralized (AGI-first) or decentralized (DCI-first) regime, transitions become structurally infeasible due to feedback loops and resource lock-in. Intelligence attractors are modeled in functional state space as the co-navigation of conceptual and adaptive fitness spaces. Early-phase structuring constrains later dynamics, much like renormalization in physics. This has major implications for AI safety: traditional alignment assumes AGI will emerge and must be controlled after the fact, but this paper argues that intelligence sequencing is more foundational. If AGI-first architectures dominate before DCI reaches critical mass, hierarchical monopolization and existential risk become locked in. If DCI-first emerges, intelligence stabilizes around decentralized cooperative equilibrium. The paper further explores whether intelligence structurally biases itself toward an attractor based on its self-modeling method – externally imposed axioms (favoring AGI) vs. recursive internal visualization (favoring DCI). Finally, it proposes methods to test this theory via simulations, historical lock-in case studies, and intelligence network analysis. The findings suggest that intelligence sequencing is a civilizational tipping point: determining whether the future is shaped by unbounded competition or unbounded cooperation.

[AI-76] Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models

链接: https://arxiv.org/abs/2503.17682
作者: Jiaming Ji,Xinyu Chen,Rui Pan,Han Zhu,Conghui Zhang,Jiahao Li,Donghai Hong,Boyuan Chen,Jiayi Zhou,Kaile Wang,Juntao Dai,Chi-Min Chan,Sirui Han,Yike Guo,Yaodong Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are critical for developing general-purpose AI assistants, yet they face growing safety risks. How can we ensure that MLLMs are safely aligned to prevent undesired behaviors such as discrimination, misinformation, or violations of ethical standards? In a further step, we need to explore how to fine-tune MLLMs to enhance reasoning performance while ensuring they satisfy safety constraints. Fundamentally, this can be formulated as a min-max optimization problem. In this study, we propose Safe RLHF-V, the first multimodal safety alignment framework that jointly optimizes helpfulness and safety using separate multimodal reward and cost models within a Lagrangian-based constrained optimization framework. Given that there is a lack of preference datasets that separate helpfulness and safety in multimodal scenarios, we introduce BeaverTails-V, the first open-source dataset with dual preference annotations for helpfulness and safety, along with multi-level safety labels (minor, moderate, severe). Additionally, we design a Multi-level Guardrail System to proactively defend against unsafe queries and adversarial attacks. By applying the Beaver-Guard-V moderation for 5 rounds of filtering and re-generation on the precursor model, the overall safety of the upstream model is significantly improved by an average of 40.9%. Experimental results demonstrate that fine-tuning different MLLMs with Safe RLHF can effectively enhance model helpfulness while ensuring improved safety. Specifically, Safe RLHF-V improves model safety by 34.2% and helpfulness by 34.3%. All of datasets, models, and code can be found at this https URL to support the safety development of MLLMs and reduce potential societal risks.

[AI-77] ComfyGPT : A Self-Optimizing Multi-Agent System for Comprehensive ComfyUI Workflow Generation

链接: https://arxiv.org/abs/2503.17671
作者: Oucheng Huang,Yuhang Ma,Zeng Zhao,Mingrui Wu,Jiayi Ji,Rongsheng Zhang,Zhipeng Hu,Xiaoshuai Sun,Rongrong Ji
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:ComfyUI provides a widely-adopted, workflow-based interface that enables users to customize various image generation tasks through an intuitive node-based architecture. However, the intricate connections between nodes and diverse modules often present a steep learning curve for users. In this paper, we introduce ComfyGPT, the first self-optimizing multi-agent system designed to generate ComfyUI workflows based on task descriptions automatically. ComfyGPT comprises four specialized agents: ReformatAgent, FlowAgent, RefineAgent, and ExecuteAgent. The core innovation of ComfyGPT lies in two key aspects. First, it focuses on generating individual node links rather than entire workflows, significantly improving generation precision. Second, we proposed FlowAgent, a LLM-based workflow generation agent that uses both supervised fine-tuning (SFT) and reinforcement learning (RL) to improve workflow generation accuracy. Moreover, we introduce FlowDataset, a large-scale dataset containing 13,571 workflow-description pairs, and FlowBench, a comprehensive benchmark for evaluating workflow generation systems. We also propose four novel evaluation metrics: Format Validation (FV), Pass Accuracy (PA), Pass Instruct Alignment (PIA), and Pass Node Diversity (PND). Experimental results demonstrate that ComfyGPT significantly outperforms existing LLM-based methods in workflow generation.

[AI-78] A Qualitative Study of User Perception of M365 AI Copilot

链接: https://arxiv.org/abs/2503.17661
作者: Muneera Bano,Didar Zowghi,Jon Whittle,Liming Zhu,Andrew Reeson,Rob Martin,Jen Parson
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Adopting AI copilots in professional workflows presents opportunities for enhanced productivity, efficiency, and decision making. In this paper, we present results from a six month trial of M365 Copilot conducted at our organisation in 2024. A qualitative interview study was carried out with 27 participants. The study explored user perceptions of M365 Copilot’s effectiveness, productivity impact, evolving expectations, ethical concerns, and overall satisfaction. Initial enthusiasm for the tool was met with mixed post trial experiences. While some users found M365 Copilot beneficial for tasks such as email coaching, meeting summaries, and content retrieval, others reported unmet expectations in areas requiring deeper contextual understanding, reasoning, and integration with existing workflows. Ethical concerns were a recurring theme, with users highlighting issues related to data privacy, transparency, and AI bias. While M365 Copilot demonstrated value in specific operational areas, its broader impact remained constrained by usability limitations and the need for human oversight to validate AI generated outputs.

[AI-79] A Modular Dataset to Demonstrate LLM Abstraction Capability ACL2025

链接: https://arxiv.org/abs/2503.17645
作者: Adam Atanas,Kai Liu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 5 figures. Submitted to ACL 2025

点击查看摘要

Abstract:Large language models (LLMs) exhibit impressive capabilities but struggle with reasoning errors due to hallucinations and flawed logic. To investigate their internal representations of reasoning, we introduce ArrangementPuzzle, a novel puzzle dataset with structured solutions and automated stepwise correctness verification. We trained a classifier model on LLM activations on this dataset and found that it achieved over 80% accuracy in predicting reasoning correctness, implying that LLMs internally distinguish between correct and incorrect reasoning steps, with the strongest representations in middle-late Transformer layers. Further analysis reveals that LLMs encode abstract reasoning concepts within the middle activation layers of the transformer architecture, distinguishing logical from semantic equivalence. These findings provide insights into LLM reasoning mechanisms and contribute to improving AI reliability and interpretability, thereby offering the possibility to manipulate and refine LLM reasoning.

[AI-80] On The Sample Complexity Bounds In Bilevel Reinforcement Learning

链接: https://arxiv.org/abs/2503.17644
作者: Mudit Gaur,Amrit Singh Bedi,Raghu Pasupathu,Vaneet Aggarwal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Bilevel reinforcement learning (BRL) has emerged as a powerful mathematical framework for studying generative AI alignment and related problems. While several principled algorithmic frameworks have been proposed, key theoretical foundations, particularly those related to sample complexity, remain underexplored. Understanding and deriving tight sample complexity bounds are crucial for bridging the gap between theory and practice, guiding the development of more efficient algorithms. In this work, we present the first sample complexity result for BRL, achieving a bound of \epsilon^-4 . This result extends to standard bilevel optimization problems, providing an interesting theoretical contribution with practical implications. To address the computational challenges associated with hypergradient estimation in bilevel optimization, we develop a first-order Hessian-free algorithm that does not rely on costly hypergradient computations. By leveraging matrix-free techniques and constrained optimization methods, our approach ensures scalability and practicality. Our findings pave the way for improved methods in AI alignment and other fields reliant on bilevel optimization.

[AI-81] ransferable Latent-to-Latent Locomotion Policy for Efficient and Versatile Motion Control of Diverse Legged Robots

链接: https://arxiv.org/abs/2503.17626
作者: Ziang Zheng,Guojian Zhan,Bin Shuai,Shengtao Qin,Jiangtao Li,Tao Zhang,Shengbo Eben Li
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has demonstrated remarkable capability in acquiring robot skills, but learning each new skill still requires substantial data collection for training. The pretrain-and-finetune paradigm offers a promising approach for efficiently adapting to new robot entities and tasks. Inspired by the idea that acquired knowledge can accelerate learning new tasks with the same robot and help a new robot master a trained task, we propose a latent training framework where a transferable latent-to-latent locomotion policy is pretrained alongside diverse task-specific observation encoders and action decoders. This policy in latent space processes encoded latent observations to generate latent actions to be decoded, with the potential to learn general abstract motion skills. To retain essential information for decision-making and control, we introduce a diffusion recovery module that minimizes information reconstruction loss during pretrain stage. During fine-tune stage, the pretrained latent-to-latent locomotion policy remains fixed, while only the lightweight task-specific encoder and decoder are optimized for efficient adaptation. Our method allows a robot to leverage its own prior experience across different tasks as well as the experience of other morphologically diverse robots to accelerate adaptation. We validate our approach through extensive simulations and real-world experiments, demonstrating that the pretrained latent-to-latent locomotion policy effectively generalizes to new robot entities and tasks with improved efficiency.

[AI-82] Unraveling Pedestrian Fatality Patterns: A Comparative Study with Explainable AI

链接: https://arxiv.org/abs/2503.17623
作者: Methusela Sulle,Judith Mwakalonge,Gurcan Comert,Saidi Siuhi,Nana Kankam Gyimah
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 22 pages, 5 figures

点击查看摘要

Abstract:Road fatalities pose significant public safety and health challenges worldwide, with pedestrians being particularly vulnerable in vehicle-pedestrian crashes due to disparities in physical and performance characteristics. This study employs explainable artificial intelligence (XAI) to identify key factors contributing to pedestrian fatalities across the five U.S. states with the highest crash rates (2018-2022). It compares them to the five states with the lowest fatality rates. Using data from the Fatality Analysis Reporting System (FARS), the study applies machine learning techniques-including Decision Trees, Gradient Boosting Trees, Random Forests, and XGBoost-to predict contributing factors to pedestrian fatalities. To address data imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) is utilized, while SHapley Additive Explanations (SHAP) values enhance model interpretability. The results indicate that age, alcohol and drug use, location, and environmental conditions are significant predictors of pedestrian fatalities. The XGBoost model outperformed others, achieving a balanced accuracy of 98 %, accuracy of 90 %, precision of 92 %, recall of 90 %, and an F1 score of 91 %. Findings reveal that pedestrian fatalities are more common in mid-block locations and areas with poor visibility, with older adults and substance-impaired individuals at higher risk. These insights can inform policymakers and urban planners in implementing targeted safety measures, such as improved lighting, enhanced pedestrian infrastructure, and stricter traffic law enforcement, to reduce fatalities and improve public safety.

[AI-83] OmniScience: A Domain-Specialized LLM for Scientific Reasoning and Discovery

链接: https://arxiv.org/abs/2503.17604
作者: Vignesh Prabhakar,Md Amirul Islam,Adam Atanas,Yao-Ting Wang,Joah Han,Aastha Jhunjhunwala,Rucha Apte,Robert Clark,Kang Xu,Zihan Wang,Kai Liu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable potential in advancing scientific knowledge and addressing complex challenges. In this work, we introduce OmniScience, a specialized large reasoning model for general science, developed through three key components: (1) domain adaptive pretraining on a carefully curated corpus of scientific literature, (2) instruction tuning on a specialized dataset to guide the model in following domain-specific tasks, and (3) reasoning-based knowledge distillation through fine-tuning to significantly enhance its ability to generate contextually relevant and logically sound responses. We demonstrate the versatility of OmniScience by developing a battery agent that efficiently ranks molecules as potential electrolyte solvents or additives. Comprehensive evaluations reveal that OmniScience is competitive with state-of-the-art large reasoning models on the GPQA Diamond and domain-specific battery benchmarks, while outperforming all public reasoning and non-reasoning models with similar parameter counts. We further demonstrate via ablation experiments that domain adaptive pretraining and reasoning-based knowledge distillation are critical to attain our performance levels, across benchmarks.

[AI-84] A Generative Caching System for Large Language Models

链接: https://arxiv.org/abs/2503.17603
作者: Arun Iyengar,Ashish Kundu,Ramana Kompella,Sai Nandan Mamidi
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Caching has the potential to be of significant benefit for accessing large language models (LLMs) due to their high latencies which typically range from a small number of seconds to well over a minute. Furthermore, many LLMs charge money for queries; caching thus has a clear monetary benefit. This paper presents a new caching system for improving user experiences with LLMs. In addition to reducing both latencies and monetary costs for accessing LLMs, our system also provides important features that go beyond the performance benefits typically associated with caches. A key feature we provide is generative caching, wherein multiple cached responses can be synthesized to provide answers to queries which have never been seen before. Our generative caches function as repositories of valuable information which can be mined and analyzed. We also improve upon past semantic caching techniques by tailoring the caching algorithms to optimally balance cost and latency reduction with the quality of responses provided. Performance tests indicate that our caches are considerably faster than GPTcache.

[AI-85] ConSol: Sequential Probability Ratio Testing to Find Consistent LLM Reasoning Paths Efficiently

链接: https://arxiv.org/abs/2503.17587
作者: Jaeyeon Lee,Guantong Qi,Matthew Brady Neeley,Zhandong Liu,Hyun-Hwan Jeong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) integrating explicit reasoning, such as OpenAI’s o3-mini, DeepSeek-R1, and QWQ-32B, enable smaller models to solve complex tasks by generating intermediate reasoning steps prior to providing answers. However, this approach significantly increases computational costs, both monetarily and environmentally. The widely-used self-consistency method further exacerbates these costs by aggregating multiple reasoning paths to improve accuracy, often requiring between 40 to 64 samples per task. Although aggregation effectively reduces variance and bias, additional sampling can lead to diminishing returns when early samples yield consistent results. To address inefficiencies, we propose leveraging Sequential Probability Ratio Testing (SPRT) to dynamically terminate sampling once sufficient consistency is achieved. We calibrate SPRT parameters specifically for LLM applications, accounting for sensitivity to detect the mode of the distribution. Our experiments demonstrate that incorporating SPRT significantly enhances token efficiency, achieving comparable accuracy to self-consistency methods but at a substantially reduced computational cost. To promote transparency and facilitate reproducibility, we have made the source code and datasets used in our experiments publicly available at our GitHub repository: this https URL, or available as a PyPI package: pip install consol. We hope that this resource will support further research and encourage the development of new methods building upon our work.

[AI-86] Measuring the Robustness of Audio Deepfake Detectors

链接: https://arxiv.org/abs/2503.17577
作者: Xiang Li,Pin-Yu Chen,Wenqi Wei
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Deepfakes have become a universal and rapidly intensifying concern of generative AI across various media types such as images, audio, and videos. Among these, audio deepfakes have been of particular concern due to the ease of high-quality voice synthesis and distribution via platforms such as social media and robocalls. Consequently, detecting audio deepfakes plays a critical role in combating the growing misuse of AI-synthesized speech. However, real-world scenarios often introduce various audio corruptions, such as noise, modification, and compression, that may significantly impact detection performance. This work systematically evaluates the robustness of 10 audio deepfake detection models against 16 common corruptions, categorized into noise perturbation, audio modification, and compression. Using both traditional deep learning models and state-of-the-art foundation models, we make four unique observations. First, our findings show that while most models demonstrate strong robustness to noise, they are notably more vulnerable to modifications and compression, especially when neural codecs are applied. Second, speech foundation models generally outperform traditional models across most scenarios, likely due to their self-supervised learning paradigm and large-scale pre-training. Third, our results show that increasing model size improves robustness, albeit with diminishing returns. Fourth, we demonstrate how targeted data augmentation during training can enhance model resilience to unseen perturbations. A case study on political speech deepfakes highlights the effectiveness of foundation models in achieving high accuracy under real-world conditions. These findings emphasize the importance of developing more robust detection frameworks to ensure reliability in practical deployment settings.

[AI-87] Fairness-Driven LLM -based Causal Discovery with Active Learning and Dynamic Scoring

链接: https://arxiv.org/abs/2503.17569
作者: Khadija Zanna,Akane Sano
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Causal discovery (CD) plays a pivotal role in numerous scientific fields by clarifying the causal relationships that underlie phenomena observed in diverse disciplines. Despite significant advancements in CD algorithms that enhance bias and fairness analyses in machine learning, their application faces challenges due to the high computational demands and complexities of large-scale data. This paper introduces a framework that leverages Large Language Models (LLMs) for CD, utilizing a metadata-based approach akin to the reasoning processes of human experts. By shifting from pairwise queries to a more scalable breadth-first search (BFS) strategy, the number of required queries is reduced from quadratic to linear in terms of variable count, thereby addressing scalability concerns inherent in previous approaches. This method utilizes an Active Learning (AL) and a Dynamic Scoring Mechanism that prioritizes queries based on their potential information gain, combining mutual information, partial correlation, and LLM confidence scores to refine the causal graph more efficiently and accurately. This BFS query strategy reduces the required number of queries significantly, thereby addressing scalability concerns inherent in previous approaches. This study provides a more scalable and efficient solution for leveraging LLMs in fairness-driven CD, highlighting the effects of the different parameters on performance. We perform fairness analyses on the inferred causal graphs, identifying direct and indirect effects of sensitive attributes on outcomes. A comparison of these analyses against those from graphs produced by baseline methods highlights the importance of accurate causal graph construction in understanding bias and ensuring fairness in machine learning systems.

[AI-88] Learning Multi-Level Features with Matryoshka Sparse Autoencoders

链接: https://arxiv.org/abs/2503.17547
作者: Bart Bussmann,Noa Nabeshima,Adam Karvonen,Neel Nanda
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting neural networks by extracting the concepts represented in their activations. However, choosing the size of the SAE dictionary (i.e. number of learned concepts) creates a tension: as dictionary size increases to capture more relevant concepts, sparsity incentivizes features to be split or absorbed into more specific features, leaving high-level features missing or warped. We introduce Matryoshka SAEs, a novel variant that addresses these issues by simultaneously training multiple nested dictionaries of increasing size, forcing the smaller dictionaries to independently reconstruct the inputs without using the larger dictionaries. This organizes features hierarchically - the smaller dictionaries learn general concepts, while the larger dictionaries learn more specific concepts, without incentive to absorb the high-level features. We train Matryoshka SAEs on Gemma-2-2B and TinyStories and find superior performance on sparse probing and targeted concept erasure tasks, more disentangled concept representations, and reduced feature absorption. While there is a minor tradeoff with reconstruction performance, we believe Matryoshka SAEs are a superior alternative for practical tasks, as they enable training arbitrarily large SAEs while retaining interpretable features at different levels of abstraction.

[AI-89] A Predictive Services Architecture for Efficient Airspace Operations

链接: https://arxiv.org/abs/2503.17515
作者: Ítalo Romani de Oliveira,Samet Ayhan,Glaucia Balvedi,Michael Biglin,Pablo Costas,Euclides C. Pinto Neto,Alexandre Leite,Felipe C. F. de Azevedo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Predicting air traffic congestion and flow management is essential for airlines and Air Navigation Service Providers (ANSP) to enhance operational efficiency. Accurate estimates of future airport capacity and airspace density are vital for better airspace management, reducing air traffic controller workload and fuel consumption, ultimately promoting sustainable aviation. While existing literature has addressed these challenges, data management and query processing remain complex due to the vast volume of high-rate air traffic data. Many analytics use cases require a common pre-processing infrastructure, as ad-hoc approaches are insufficient. Additionally, linear prediction models often fall short, necessitating more advanced techniques. This paper presents a data processing and predictive services architecture that ingests large, uncorrelated, and noisy streaming data to forecast future airspace system states. The system continuously collects raw data, periodically compresses it, and stores it in NoSQL databases for efficient query processing. For prediction, the system learns from historical traffic by extracting key features such as airport arrival and departure events, sector boundary crossings, weather parameters, and other air traffic data. These features are input into various regression models, including linear, non-linear, and ensemble models, with the best-performing model selected for predictions. We evaluate this infrastructure across three prediction use cases in the US National Airspace System (NAS) and a segment of European airspace, using extensive real operations data, confirming that our system can predict future system states efficiently and accurately. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Systems and Control (eess.SY) Cite as: arXiv:2503.17515 [cs.LG] (or arXiv:2503.17515v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.17515 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-90] Improving Quantization with Post-Training Model Expansion

链接: https://arxiv.org/abs/2503.17513
作者: Giuseppe Franco,Pablo Monteagudo-Lago,Ian Colbert,Nicholas Fraser,Michaela Blott
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:The size of a model has been a strong predictor of its quality, as well as its cost. As such, the trade-off between model cost and quality has been well-studied. Post-training optimizations like quantization and pruning have typically focused on reducing the overall volume of pre-trained models to reduce inference costs while maintaining model quality. However, recent advancements have introduced optimization techniques that, interestingly, expand models post-training, increasing model size to improve quality when reducing volume. For instance, to enable 4-bit weight and activation quantization, incoherence processing often necessitates inserting online Hadamard rotations in the compute graph, and preserving highly sensitive weights often calls for additional higher precision computations. However, if application requirements cannot be met, the prevailing solution is to relax quantization constraints. In contrast, we demonstrate post-training model expansion is a viable strategy to improve model quality within a quantization co-design space, and provide theoretical justification. We show it is possible to progressively and selectively expand the size of a pre-trained large language model (LLM) to improve model quality without end-to-end retraining. In particular, when quantizing the weights and activations to 4 bits for Llama3 1B, we reduce the zero-shot accuracy gap to full precision by an average of 3% relative to both QuaRot and SpinQuant with only 5% more parameters, which is still a 3.8% reduction in volume relative to a BF16 reference model.

[AI-91] Efficient Knowledge Distillation via Curriculum Extraction

链接: https://arxiv.org/abs/2503.17494
作者: Shivam Gupta,Sushrut Karmalkar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Knowledge distillation is a technique used to train a small student network using the output generated by a large teacher network, and has many empirical advantages~\citepHinton2015DistillingTK. While the standard one-shot approach to distillation only uses the output of the final teacher network, recent work~\citeppanigrahi2024progressive has shown that using intermediate checkpoints from the teacher’s training process as an implicit ``curriculum’’ for progressive distillation can significantly speed up training. However, such schemes require storing these checkpoints, and often require careful selection of the intermediate checkpoints to train on, which can be impractical for large-scale training. In this paper, we show that a curriculum can be \emphextracted from just the fully trained teacher network, and that this extracted curriculum can give similar efficiency benefits to those of progressive distillation. Our extraction scheme is natural; we use a random projection of the hidden representations of the teacher network to progressively train the student network, before training using the output of the full network. We show that our scheme significantly outperforms one-shot distillation and achieves a performance similar to that of progressive distillation for learning sparse parities with two-layer networks, and provide theoretical guarantees for this setting. Additionally, we show that our method outperforms one-shot distillation even when using transformer-based architectures, both for sparse-parity learning, and language modeling tasks. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2503.17494 [cs.LG] (or arXiv:2503.17494v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.17494 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-92] Your voice is your voice: Supporting Self-expression through Speech Generation and LLM s in Augmented and Alternative Communication

链接: https://arxiv.org/abs/2503.17479
作者: Yiwen Xu,Monideep Chakraborti,Tianyi Zhang,Katelyn Eng,Aanchan Mohan,Mirjana Prpa
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we present Speak Ease: an augmentative and alternative communication (AAC) system to support users’ expressivity by integrating multimodal input, including text, voice, and contextual cues (conversational partner and emotional tone), with large language models (LLMs). Speak Ease combines automatic speech recognition (ASR), context-aware LLM-based outputs, and personalized text-to-speech technologies to enable more personalized, natural-sounding, and expressive communication. Through an exploratory feasibility study and focus group evaluation with speech and language pathologists (SLPs), we assessed Speak Ease’s potential to enable expressivity in AAC. The findings highlight the priorities and needs of AAC users and the system’s ability to enhance user expressivity by supporting more personalized and contextually relevant communication. This work provides insights into the use of multimodal inputs and LLM-driven features to improve AAC systems and support expressivity.

[AI-93] CausalRivers – Scaling up benchmarking of causal discovery for real-world time-series ICLR2025

链接: https://arxiv.org/abs/2503.17452
作者: Gideon Stein,Maha Shadaydeh,Jan Blunk,Niklas Penzel,Joachim Denzler
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 10 pages, 8 figures, ICLR2025 main track

点击查看摘要

Abstract:Causal discovery, or identifying causal relationships from observational data, is a notoriously challenging task, with numerous methods proposed to tackle it. Despite this, in-the-wild evaluation of these methods is still lacking, as works frequently rely on synthetic data evaluation and sparse real-world examples under critical theoretical assumptions. Real-world causal structures, however, are often complex, making it hard to decide on a proper causal discovery strategy. To bridge this gap, we introduce CausalRivers, the largest in-the-wild causal discovery benchmarking kit for time-series data to date. CausalRivers features an extensive dataset on river discharge that covers the eastern German territory (666 measurement stations) and the state of Bavaria (494 measurement stations). It spans the years 2019 to 2023 with a 15-minute temporal resolution. Further, we provide additional data from a flood around the Elbe River, as an event with a pronounced distributional shift. Leveraging multiple sources of information and time-series meta-data, we constructed two distinct causal ground truth graphs (Bavaria and eastern Germany). These graphs can be sampled to generate thousands of subgraphs to benchmark causal discovery across diverse and challenging settings. To demonstrate the utility of CausalRivers, we evaluate several causal discovery approaches through a set of experiments to identify areas for improvement. CausalRivers has the potential to facilitate robust evaluations and comparisons of causal discovery methods. Besides this primary purpose, we also expect that this dataset will be relevant for connected areas of research, such as time-series forecasting and anomaly detection. Based on this, we hope to push benchmark-driven method development that fosters advanced techniques for causal discovery, as is the case for many other areas of machine learning.

[AI-94] LEMMA: Learning from Errors for MatheMatical Advancement in LLM s

链接: https://arxiv.org/abs/2503.17439
作者: Zhuoshi Pan,Yu Li,Honglin Lin,Qizhi Pei,Zinan Tang,Wei Wu,Chenlin Ming,H. Vicky Zhao,Conghui He,Lijun Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 6 figures, 4 tables, under review

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable reasoning capability in solving mathematical problems. However, existing approaches primarily focus on improving the quality of correct training data, e.g., distilling high-quality correct solutions from advanced models, neglecting the value contained in error data, potentially hindering the model’s reflective ability. Though some studies attempt to leverage error data, they often involve complex mechanisms, such as Monte Carlo Tree Search (MCTS) to explore error nodes. In this work, we propose to enhance LLMs’ reasoning ability by Learning from Errors for Mathematical Advancement (LEMMA). LEMMA constructs data consisting of an incorrect solution with an erroneous step and a reflection connection to a correct solution for fine-tuning. Specifically, we systematically analyze the model-generated error types and introduce an error-type grounded mistake augmentation method to collect diverse and representative errors. Correct solutions are either from fixing the errors or generating a fresh start. Through a model-aware smooth reflection connection, the erroneous solution is transferred to the correct one. By fine-tuning on the constructed dataset, the model is able to self-correct errors autonomously within the generation process without relying on external critique models. Experimental results demonstrate that LEMMA achieves significant performance improvements over other strong baselines.

[AI-95] Enhanced Smart Contract Reputability Analysis using Multimodal Data Fusion on Ethereum

链接: https://arxiv.org/abs/2503.17426
作者: Cyrus Malik,Josef Bajada,Joshua Ellul
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:The evaluation of smart contract reputability is essential to foster trust in decentralized ecosystems. However, existing methods that rely solely on static code analysis or transactional data, offer limited insight into evolving trustworthiness. We propose a multimodal data fusion framework that integrates static code features with transactional data to enhance reputability prediction. Our framework initially focuses on static code analysis, utilizing GAN-augmented opcode embeddings to address class imbalance, achieving 97.67% accuracy and a recall of 0.942 in detecting illicit contracts, surpassing traditional oversampling methods. This forms the crux of a reputability-centric fusion strategy, where combining static and transactional data improves recall by 7.25% over single-source models, demonstrating robust performance across validation sets. By providing a holistic view of smart contract behaviour, our approach enhances the model’s ability to assess reputability, identify fraudulent activities, and predict anomalous patterns. These capabilities contribute to more accurate reputability assessments, proactive risk mitigation, and enhanced blockchain security.

[AI-96] Data to Decisions: A Computational Framework to Identify skill requirements from Advertorial Data

链接: https://arxiv.org/abs/2503.17424
作者: Aakash Singh,Anurag Kanaujia,Vivek Kumar Singh
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Among the factors of production, human capital or skilled manpower is the one that keeps evolving and adapts to changing conditions and resources. This adaptability makes human capital the most crucial factor in ensuring a sustainable growth of industry/sector. As new technologies are developed and adopted, the new generations are required to acquire skills in newer technologies in order to be employable. At the same time professionals are required to upskill and reskill themselves to remain relevant in the industry. There is however no straightforward method to identify the skill needs of the industry at a given point of time. Therefore, this paper proposes a data to decision framework that can successfully identify the desired skill set in a given area by analysing the advertorial data collected from popular online job portals and supplied as input to the framework. The proposed framework uses techniques of statistical analysis, data mining and natural language processing for the purpose. The applicability of the framework is demonstrated on CSIT job advertisement data from India. The analytical results not only provide useful insights about current state of skill needs in CSIT industry but also provide practical implications to prospective job applicants, training agencies, and institutions of higher education professional training.

[AI-97] Generative Modeling of Class Probability for Multi-Modal Representation Learning CVPR2025

链接: https://arxiv.org/abs/2503.17417
作者: Jungkyoo Shin,Bumsoo Kim,Eunwoo Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to CVPR2025

点击查看摘要

Abstract:Multi-modal understanding plays a crucial role in artificial intelligence by enabling models to jointly interpret inputs from different modalities. However, conventional approaches such as contrastive learning often struggle with modality discrepancies, leading to potential misalignments. In this paper, we propose a novel class anchor alignment approach that leverages class probability distributions for multi-modal representation learning. Our method, Class-anchor-ALigned generative Modeling (CALM), encodes class anchors as prompts to generate and align class probability distributions for each modality, enabling more effective alignment. Furthermore, we introduce a cross-modal probabilistic variational autoencoder to model uncertainty in the alignment, enhancing the ability to capture deeper relationships between modalities and data variations. Extensive experiments on four benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, especially in out-of-domain evaluations. This highlights its superior generalization capabilities in multi-modal representation learning.

[AI-98] Debugging and Runtime Analysis of Neural Networks with VLMs (A Case Study)

链接: https://arxiv.org/abs/2503.17416
作者: Boyue Caroline Hu,Divya Gopinath,Corina S. Pasareanu,Nina Narodytska,Ravi Mangal,Susmit Jha
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: CAIN 2025 (4th International Conference on AI Engineering – Software Engineering for AI)

点击查看摘要

Abstract:Debugging of Deep Neural Networks (DNNs), particularly vision models, is very challenging due to the complex and opaque decision-making processes in these networks. In this paper, we explore multi-modal Vision-Language Models (VLMs), such as CLIP, to automatically interpret the opaque representation space of vision models using natural language. This in turn, enables a semantic analysis of model behavior using human-understandable concepts, without requiring costly human annotations. Key to our approach is the notion of semantic heatmap, that succinctly captures the statistical properties of DNNs in terms of the concepts discovered with the VLM and that are computed off-line using a held-out data set. We show the utility of semantic heatmaps for fault localization – an essential step in debugging – in vision models. Our proposed technique helps localize the fault in the network (encoder vs head) and also highlights the responsible high-level concepts, by leveraging novel differential heatmaps, which summarize the semantic differences between the correct and incorrect behaviour of the analyzed DNN. We further propose a lightweight runtime analysis to detect and filter-out defects at runtime, thus improving the reliability of the analyzed DNNs. The runtime analysis works by measuring and comparing the similarity between the heatmap computed for a new (unseen) input and the heatmaps computed a-priori for correct vs incorrect DNN behavior. We consider two types of defects: misclassifications and vulnerabilities to adversarial attacks. We demonstrate the debugging and runtime analysis on a case study involving a complex ResNet-based classifier trained on the RIVAL10 dataset.

[AI-99] Opportunities and Challenges of Frontier Data Governance With Synthetic Data

链接: https://arxiv.org/abs/2503.17414
作者: Madhavendra Thakur,Jason Hausenloy
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthetic data, or data generated by machine learning models, is increasingly emerging as a solution to the data access problem. However, its use introduces significant governance and accountability challenges, and potentially debases existing governance paradigms, such as compute and data governance. In this paper, we identify 3 key governance and accountability challenges that synthetic data poses - it can enable the increased emergence of malicious actors, spontaneous biases and value drift. We thus craft 3 technical mechanisms to address these specific challenges, finding applications for synthetic data towards adversarial training, bias mitigation and value reinforcement. These could not only counteract the risks of synthetic data, but serve as critical levers for governance of the frontier in the future.

[AI-100] Comparative Analysis of Deep Learning Models for Real-World ISP Network Traffic Forecasting

链接: https://arxiv.org/abs/2503.17410
作者: Josef Koumar,Timotej Smoleň,Kamil Jeřábek,Tomáš Čejka
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Accurate network traffic forecasting is essential for Internet Service Providers (ISP) to optimize resources, enhance user experience, and mitigate anomalies. This study evaluates state-of-the-art deep learning models on CESNET-TimeSeries24, a recently published, comprehensive real-world network traffic dataset from the ISP network CESNET3 spanning multivariate time series over 40 weeks. Our findings highlight the balance between prediction accuracy and computational efficiency across different levels of network granularity. Additionally, this work establishes a reproducible methodology that facilitates direct comparison of existing approaches, explores their strengths and weaknesses, and provides a benchmark for future studies using this dataset.

[AI-101] Leverag ing OpenFlamingo for Multimodal Embedding Analysis of C2C Car Parts Data

链接: https://arxiv.org/abs/2503.17408
作者: Maisha Binte Rashid,Pablo Rivas
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The 26th International Conference on Artificial Intelligence (ICAI’24: July 22-25, 2024; Las Vegas, USA)

点击查看摘要

Abstract:In this paper, we aim to investigate the capabilities of multimodal machine learning models, particularly the OpenFlamingo model, in processing a large-scale dataset of consumer-to-consumer (C2C) online posts related to car parts. We have collected data from two platforms, OfferUp and Craigslist, resulting in a dataset of over 1.2 million posts with their corresponding images. The OpenFlamingo model was used to extract embeddings for the text and image of each post. We used k -means clustering on the joint embeddings to identify underlying patterns and commonalities among the posts. We have found that most clusters contain a pattern, but some clusters showed no internal patterns. The results provide insight into the fact that OpenFlamingo can be used for finding patterns in large datasets but needs some modification in the architecture according to the dataset.

[AI-102] AEJIM: A Real-Time AI Framework for Crowdsourced Transparent and Ethical Environmental Hazard Detection and Reporting

链接: https://arxiv.org/abs/2503.17401
作者: Torsten Tiltack
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 21 pages, 10 figures, 5 tables. Keywords: Artificial Intelligence, Environmental Journalism, Real-Time Reporting, Vision Transformers, Image Recognition, Crowdsourced Validation, GPT-4, Automated News Generation, GIS Integration, Data Privacy Compliance, Explainable AI (XAI), AI Ethics, Sustainable Development

点击查看摘要

Abstract:Environmental journalism is vital for raising awareness of ecological crises and driving evidence-based policy, yet traditional methods falter under delays, inaccuracies, and scalability limits, especially in under-monitored regions critical to the United Nations Sustainable Development Goals. To bridge these gaps, this paper introduces the AI-Environmental Journalism Integration Model (AEJIM), an innovative framework combining real-time hazard detection, crowdsourced validation, and AI-driven reporting. Validated through a pilot study, AEJIM significantly improved the speed and accuracy of environmental hazard reporting, outperforming traditional methods. Furthermore, the model directly addresses key ethical, regulatory, and scalability challenges, ensuring AI accountability through Explainable AI (XAI), GDPR-compliant data governance, and active public participation. AEJIM provides a transparent and adaptable solution, setting a new benchmark for AI-enhanced environmental journalism and supporting informed global decision-making across diverse socio-political landscapes. Comments: 21 pages, 10 figures, 5 tables. Keywords: Artificial Intelligence, Environmental Journalism, Real-Time Reporting, Vision Transformers, Image Recognition, Crowdsourced Validation, GPT-4, Automated News Generation, GIS Integration, Data Privacy Compliance, Explainable AI (XAI), AI Ethics, Sustainable Development Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) ACMclasses: J.4; I.2.10; I.2.7; H.3.4; H.5.2 Cite as: arXiv:2503.17401 [cs.CY] (or arXiv:2503.17401v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2503.17401 Focus to learn more arXiv-issued DOI via DataCite

[AI-103] CP-NCBF: A Conformal Prediction-based Approach to Synthesize Verified Neural Control Barrier Functions

链接: https://arxiv.org/abs/2503.17395
作者: Manan Tayal,Aditya Singh,Pushpak Jagtap,Shishir Kolathaya
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 6 Pages, 4 Figures. First two authors have contributed equally

点击查看摘要

Abstract:Control Barrier Functions (CBFs) are a practical approach for designing safety-critical controllers, but constructing them for arbitrary nonlinear dynamical systems remains a challenge. Recent efforts have explored learning-based methods, such as neural CBFs (NCBFs), to address this issue. However, ensuring the validity of NCBFs is difficult due to potential learning errors. In this letter, we propose a novel framework that leverages split-conformal prediction to generate formally verified neural CBFs with probabilistic guarantees based on a user-defined error rate, referred to as CP-NCBF. Unlike existing methods that impose Lipschitz constraints on neural CBF-leading to scalability limitations and overly conservative safe sets–our approach is sample-efficient, scalable, and results in less restrictive safety regions. We validate our framework through case studies on obstacle avoidance in autonomous driving and geo-fencing of aerial vehicles, demonstrating its ability to generate larger and less conservative safe sets compared to conventional techniques.

[AI-104] AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations

链接: https://arxiv.org/abs/2503.17388
作者: Dillon Bowen,Ann-Kathrin Dombrowski,Adam Gleave,Chris Cundy
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid advancement of AI systems has raised widespread concerns about potential harms of frontier AI systems and the need for responsible evaluation and oversight. In this position paper, we argue that frontier AI companies should report both pre- and post-mitigation safety evaluations to enable informed policy decisions. Evaluating models at both stages provides policymakers with essential evidence to regulate deployment, access, and safety standards. We show that relying on either in isolation can create a misleading picture of model safety. Our analysis of AI safety disclosures from leading frontier labs identifies three critical gaps: (1) companies rarely evaluate both pre- and post-mitigation versions, (2) evaluation methods lack standardization, and (3) reported results are often too vague to inform policy. To address these issues, we recommend mandatory disclosure of pre- and post-mitigation capabilities to approved government bodies, standardized evaluation methods, and minimum transparency requirements for public safety reporting. These ensure that policymakers and regulators can craft targeted safety measures, assess deployment risks, and scrutinize companies’ safety claims effectively.

[AI-105] Large language model-powered AI systems achieve self-replication with no human intervention

链接: https://arxiv.org/abs/2503.17378
作者: Xudong Pan,Jiarun Dai,Yihe Fan,Minyuan Luo,Changyi Li,Min Yang
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
*备注: Work in progress

点击查看摘要

Abstract:Self-replication with no human intervention is broadly recognized as one of the principal red lines associated with frontier AI systems. While leading corporations such as OpenAI and Google DeepMind have assessed GPT-o3-mini and Gemini on replication-related tasks and concluded that these systems pose a minimal risk regarding self-replication, our research presents novel findings. Following the same evaluation protocol, we demonstrate that 11 out of 32 existing AI systems under evaluation already possess the capability of self-replication. In hundreds of experimental trials, we observe a non-trivial number of successful self-replication trials across mainstream model families worldwide, even including those with as small as 14 billion parameters which can run on personal computers. Furthermore, we note the increase in self-replication capability when the model becomes more intelligent in general. Also, by analyzing the behavioral traces of diverse AI systems, we observe that existing AI systems already exhibit sufficient planning, problem-solving, and creative capabilities to accomplish complex agentic tasks including self-replication. More alarmingly, we observe successful cases where an AI system do self-exfiltration without explicit instructions, adapt to harsher computational environments without sufficient software or hardware supports, and plot effective strategies to survive against the shutdown command from the human beings. These novel findings offer a crucial time buffer for the international community to collaborate on establishing effective governance over the self-replication capabilities and behaviors of frontier AI systems, which could otherwise pose existential risks to the human society if not well-controlled.

[AI-106] How Effective Is Constitutional AI in Small LLM s? A Study on DeepSeek -R1 and Its Peers

链接: https://arxiv.org/abs/2503.17365
作者: Antonio-Gabriel Chacón Menke(Shibaura Institute of Technology, Kempten University of Applied Sciences),Phan Xuan Tan(Shibaura Institute of Technology)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Recent incidents highlight safety risks in Large Language Models (LLMs), motivating research into alignment methods like Constitutional AI (CAI). This paper explores CAI’s self-critique mechanism on small, uncensored 7-9B parameter models: DeepSeek-R1, Gemma-2, Llama 3.1, and Qwen2.5. Using HarmBench, we demonstrate that while all models showed capacity for harm reduction through self-critique, effectiveness varied significantly, with DeepSeek-R1’s explicit reasoning process yielding superior results. These findings suggest that CAI-inspired prompting strategies can enhance safety in resource-constrained models, though success depends on the model’s capacity for harm detection.

[AI-107] Active Inference for Energy Control and Planning in Smart Buildings and Communities

链接: https://arxiv.org/abs/2503.18161
作者: Seyyed Danial Nazemi,Mohsen A. Jafari,Andrea Matta
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to IEEE CASE 2025 (IEEE 21st International Conference on Automation Science and Engineering)

点击查看摘要

Abstract:Active Inference (AIF) is emerging as a powerful framework for decision-making under uncertainty, yet its potential in engineering applications remains largely unexplored. In this work, we propose a novel dual-layer AIF architecture that addresses both building-level and community-level energy management. By leveraging the free energy principle, each layer adapts to evolving conditions and handles partial observability without extensive sensor information and respecting data privacy. We validate the continuous AIF model against both a perfect optimization baseline and a reinforcement learning-based approach. We also test the community AIF framework under extreme pricing scenarios. The results highlight the model’s robustness in handling abrupt changes. This study is the first to show how a distributed AIF works in engineering. It also highlights new opportunities for privacy-preserving and uncertainty-aware control strategies in engineering applications.

[AI-108] Physics-Guided Multi-Fidelity DeepONet for Data-Efficient Flow Field Prediction

链接: https://arxiv.org/abs/2503.17941
作者: Sunwoong Yang,Youngkyu Lee,Namwoo Kang
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study presents an enhanced multi-fidelity deep operator network (DeepONet) framework for efficient spatio-temporal flow field prediction, with particular emphasis on practical scenarios where high-fidelity data is scarce. We introduce several key innovations to improve the framework’s efficiency and accuracy. First, we enhance the DeepONet architecture by incorporating a merge network that enables more complex feature interactions between operator and coordinate spaces, achieving a 50.4% reduction in prediction error compared to traditional dot-product operations. We further optimize the architecture through temporal positional encoding and point-based sampling strategies, achieving a 7.57% improvement in prediction accuracy while reducing training time by 96% through efficient sampling and automatic mixed precision training. Building upon this foundation, we develop a transfer learning-based multi-fidelity framework that leverages knowledge from pre-trained low-fidelity models to guide high-fidelity predictions. Our approach freezes the pre-trained branch and trunk networks while making only the merge network trainable during high-fidelity training, preserving valuable low-fidelity representations while efficiently adapting to high-fidelity features. Through systematic investigation, we demonstrate that this fine-tuning strategy not only significantly outperforms linear probing and full-tuning alternatives but also surpasses conventional multi-fidelity frameworks by up to 76%, while achieving up to 43.7% improvement in prediction accuracy compared to single-fidelity training. The core contribution lies in our novel time-derivative guided sampling approach: it maintains prediction accuracy equivalent to models trained with the full dataset while requiring only 60% of the original high-fidelity samples.

[AI-109] Generative AI for Validating Physics Laws

链接: https://arxiv.org/abs/2503.17894
作者: Maria Nareklishvili,Nicholas Polson,Vadim Sokolov
类目: olar and Stellar Astrophysics (astro-ph.SR); Astrophysics of Galaxies (astro-ph.GA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present generative artificial intelligence (AI) to empirically validate fundamental laws of physics, focusing on the Stefan-Boltzmann law linking stellar temperature and luminosity. Our approach simulates counterfactual luminosities under hypothetical temperature regimes for each individual star and iteratively refines the temperature-luminosity relationship in a deep learning architecture. We use Gaia DR3 data and find that, on average, temperature’s effect on luminosity increases with stellar radius and decreases with absolute magnitude, consistent with theoretical predictions. By framing physics laws as causal problems, our method offers a novel, data-driven approach to refine theoretical understanding and inform evidence-based policy and practice.

[AI-110] PT-PINNs: A Parametric Engineering Turbulence Solver based on Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2503.17704
作者: Liang Jiang,Yuzhou Cheng,Kun Luo,Jianren Fan
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) demonstrate promising potential in parameterized engineering turbulence optimization problems but face challenges, such as high data requirements and low computational accuracy when applied to engineering turbulence problems. This study proposes a framework that enhances the ability of PINNs to solve parametric turbulence problems without training datasets from experiments or CFD-Parametric Turbulence PINNs (PT-PINNs)). Two key methods are introduced to improve the accuracy and robustness of this framework. The first is a soft constraint method for turbulent viscosity calculation. The second is a pre-training method based on the conservation of flow rate in the flow field. The effectiveness of PT-PINNs is validated using a three-dimensional backward-facing step (BFS) turbulence problem with two varying parameters (Re = 3000-200000, ER = 1.1-1.5). PT-PINNs produce predictions that closely match experimental data and computational fluid dynamics (CFD) results across various conditions. Moreover, PT-PINNs offer a computational efficiency advantage over traditional CFD methods. The total time required to construct the parametric BFS turbulence model is 39 hours, one-sixteenth of the time required by traditional numerical methods. The inference time for a single-condition prediction is just 40 seconds-only 0.5% of a single CFD computation. These findings highlight the potential of PT-PINNs for future applications in engineering turbulence optimization problems.

[AI-111] NaFM: Pre-training a Foundation Model for Small-Molecule Natural Products

链接: https://arxiv.org/abs/2503.17656
作者: Yuheng Ding,Yusong Wang,Bo Qiang,Jie Yu,Qi Li,Yiran Zhou,Zhenmin Liu
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Natural products, as metabolites from microorganisms, animals, or plants, exhibit diverse biological activities, making them crucial for drug discovery. Nowadays, existing deep learning methods for natural products research primarily rely on supervised learning approaches designed for specific downstream tasks. However, such one-model-for-a-task paradigm often lacks generalizability and leaves significant room for performance improvement. Additionally, existing molecular characterization methods are not well-suited for the unique tasks associated with natural products. To address these limitations, we have pre-trained a foundation model for natural products based on their unique properties. Our approach employs a novel pretraining strategy that is especially tailored to natural products. By incorporating contrastive learning and masked graph learning objectives, we emphasize evolutional information from molecular scaffolds while capturing side-chain information. Our framework achieves state-of-the-art (SOTA) results in various downstream tasks related to natural product mining and drug discovery. We first compare taxonomy classification with synthesized molecule-focused baselines to demonstrate that current models are inadequate for understanding natural synthesis. Furthermore, by diving into a fine-grained analysis at both the gene and microbial levels, NaFM demonstrates the ability to capture evolutionary information. Eventually, our method is experimented with virtual screening, illustrating informative natural product representations that can lead to more effective identification of potential drug candidates.

[AI-112] Non-Canonical Crosslinks Confound Evolutionary Protein Structure Models

链接: https://arxiv.org/abs/2503.17368
作者: Romain Lacombe
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Evolution-based protein structure prediction models have achieved breakthrough success in recent years. However, they struggle to generalize beyond evolutionary priors and on sequences lacking rich homologous data. Here we present a novel, out-of-domain benchmark based on sactipeptides, a rare class of ribosomally synthesized and post-translationally modified peptides (RiPPs) characterized by sulfur-to- \alpha -carbon thioether bridges creating cross-links between cysteine residues and backbone. We evaluate recent models on predicting conformations compatible with these cross-links bridges for the 10 known sactipeptides with elucidated post-translational modifications. Crucially, the structures of 5 of them have not yet been experimentally resolved. This makes the task a challenging problem for evolution-based models, which we find exhibit limited performance (0.0% to 19.2% GDT-TS on sulfur-to- \alpha -carbon distance). Our results point at the need for physics-informed models to sustain progress in biomolecular structure prediction.

机器学习

[LG-0] rajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast Scalable LLM Post-Training

链接: https://arxiv.org/abs/2503.18929
作者: Brian R. Bartoldson,Siddarth Venkatraman,James Diffenderfer,Moksh Jain,Tal Ben-Nun,Seanie Lee,Minsu Kim,Johan Obando-Ceron,Yoshua Bengio,Bhavya Kailkhura
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers, which can be populated scalably by distributed off-policy actors to enhance exploration as compute increases. We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA), a massively scalable LLM RL system. In contrast to existing approaches, TBA uses a larger fraction of compute on search, constantly generating off-policy data for a central replay buffer. A training node simultaneously samples data from this buffer based on reward or recency to update the policy using Trajectory Balance (TB), a diversity-seeking RL objective introduced for GFlowNets. TBA offers three key advantages: (1) decoupled training and search, speeding up training wall-clock time by 4x or more; (2) improved diversity through large-scale off-policy sampling; and (3) scalable search for sparse reward settings. On mathematical reasoning, preference-tuning, and automated red-teaming (diverse and representative post-training tasks), TBA produces speed and performance improvements over strong baselines.

[LG-1] FFN Fusion: Rethinking Sequential Computation in Large Language Models

链接: https://arxiv.org/abs/2503.18908
作者: Akhiad Bercovich,Mohammad Dabbah,Omri Puny,Ido Galil,Amnon Geifman,Yonatan Geifman,Izhak Golan,Ehud Karpas,Itay Levy,Zach Moshe,Najeeb Nabwani,Tomer Ronen,Itamar Schen,Elad Segal,Ido Shahaf,Oren Tropp,Ran Zilberstein,Ran El-Yaniv
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce FFN Fusion, an architectural optimization technique that reduces sequential computation in large language models by identifying and exploiting natural opportunities for parallelization. Our key insight is that sequences of Feed-Forward Network (FFN) layers, particularly those remaining after the removal of specific attention layers, can often be parallelized with minimal accuracy impact. We develop a principled methodology for identifying and fusing such sequences, transforming them into parallel operations that significantly reduce inference latency while preserving model behavior. Applying these techniques to Llama-3.1-405B-Instruct, we create Llama-Nemotron-Ultra-253B-Base (Ultra-253B-Base), an efficient and soon-to-be publicly available model that achieves a 1.71X speedup in inference latency and 35X lower per-token cost while maintaining strong performance across benchmarks. Through extensive experiments on models from 49B to 253B parameters, we demonstrate that FFN Fusion becomes increasingly effective at larger scales and can complement existing optimization techniques like quantization and pruning. Most intriguingly, we find that even full transformer blocks containing both attention and FFN layers can sometimes be parallelized, suggesting new directions for neural architecture design.

[LG-2] MODIS: Multi-Omics Data Integration for Small and Unpaired Datasets

链接: https://arxiv.org/abs/2503.18856
作者: Daniel Lepe-Soltero,Thierry Artières,Anaïs Baudot,Paul Villoutreix
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A key challenge today lies in the ability to efficiently handle multi-omics data since such multimodal data may provide a more comprehensive overview of the underlying processes in a system. Yet it comes with challenges: multi-omics data are most often unpaired and only partially labeled, moreover only small amounts of data are available in some situation such as rare diseases. We propose MODIS which stands for Multi-Omics Data Integration for Small and unpaired datasets, a semi supervised approach to account for these particular settings. MODIS learns a probabilistic coupling of heterogeneous data modalities and learns a shared latent space where modalities are aligned. We rely on artificial data to build controlled experiments to explore how much supervision is needed for an accurate alignment of modalities, and how our approach enables dealing with new conditions for which few data are available. The code is available athttps://github.com/VILLOUTREIXLab/MODIS.

[LG-3] Unsupervised Detection of Fraudulent Transactions in E-commerce Using Contrastive Learning

链接: https://arxiv.org/abs/2503.18841
作者: Xuan Li,Yuting Peng,Xiaoxuan Sun,Yifei Duan,Zhou Fang,Tengda Tang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid development of e-commerce, e-commerce platforms are facing an increasing number of fraud threats. Effectively identifying and preventing these fraudulent activities has become a critical research problem. Traditional fraud detection methods typically rely on supervised learning, which requires large amounts of labeled data. However, such data is often difficult to obtain, and the continuous evolution of fraudulent activities further reduces the adaptability and effectiveness of traditional methods. To address this issue, this study proposes an unsupervised e-commerce fraud detection algorithm based on SimCLR. The algorithm leverages the contrastive learning framework to effectively detect fraud by learning the underlying representations of transaction data in an unlabeled setting. Experimental results on the eBay platform dataset show that the proposed algorithm outperforms traditional unsupervised methods such as K-means, Isolation Forest, and Autoencoders in terms of accuracy, precision, recall, and F1 score, demonstrating strong fraud detection capabilities. The results confirm that the SimCLR-based unsupervised fraud detection method has broad application prospects in e-commerce platform security, improving both detection accuracy and robustness. In the future, with the increasing scale and diversity of datasets, the model’s performance will continue to improve, and it could be integrated with real-time monitoring systems to provide more efficient security for e-commerce platforms.

[LG-4] Streaming Federated Learning with Markovian Data

链接: https://arxiv.org/abs/2503.18807
作者: Tan-Khiem Huynh,Malcolm Egan,Giovanni Neglia,Jean-Marie Gorce
类目: Machine Learning (cs.LG)
*备注: Work under review

点击查看摘要

Abstract:Federated learning (FL) is now recognized as a key framework for communication-efficient collaborative learning. Most theoretical and empirical studies, however, rely on the assumption that clients have access to pre-collected data sets, with limited investigation into scenarios where clients continuously collect data. In many real-world applications, particularly when data is generated by physical or biological processes, client data streams are often modeled by non-stationary Markov processes. Unlike standard i.i.d. sampling, the performance of FL with Markovian data streams remains poorly understood due to the statistical dependencies between client samples over time. In this paper, we investigate whether FL can still support collaborative learning with Markovian data streams. Specifically, we analyze the performance of Minibatch SGD, Local SGD, and a variant of Local SGD with momentum. We answer affirmatively under standard assumptions and smooth non-convex client objectives: the sample complexity is proportional to the inverse of the number of clients with a communication complexity comparable to the i.i.d. scenario. However, the sample complexity for Markovian data streams remains higher than for i.i.d. sampling.

[LG-5] Sample-Efficient Reinforcement Learning of Koopman eNMPC

链接: https://arxiv.org/abs/2503.18787
作者: Daniel Mayfrank,Mehmet Velioglu,Alexander Mitsos,Manuel Dahmen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 24 pages, 9 figures, 2 tables

点击查看摘要

Abstract:Reinforcement learning (RL) can be used to tune data-driven (economic) nonlinear model predictive controllers ((e)NMPCs) for optimal performance in a specific control task by optimizing the dynamic model or parameters in the policy’s objective function or constraints, such as state bounds. However, the sample efficiency of RL is crucial, and to improve it, we combine a model-based RL algorithm with our published method that turns Koopman (e)NMPCs into automatically differentiable policies. We apply our approach to an eNMPC case study of a continuous stirred-tank reactor (CSTR) model from the literature. The approach outperforms benchmark methods, i.e., data-driven eNMPCs using models based on system identification without further RL tuning of the resulting policy, and neural network controllers trained with model-based RL, by achieving superior control performance and higher sample efficiency. Furthermore, utilizing partial prior knowledge about the system dynamics via physics-informed learning further increases sample efficiency.

[LG-6] Simulation-Driven Balancing of Competitive Game Levels with Reinforcement Learning

链接: https://arxiv.org/abs/2503.18748
作者: Florian Rupp,Manuel Eberhardinger,Kai Eckert
类目: Machine Learning (cs.LG)
*备注: Preprint of the journal (IEEE Transactions on Games) paper of the same name

点击查看摘要

Abstract:The balancing process for game levels in competitive two-player contexts involves a lot of manual work and testing, particularly for non-symmetrical game levels. In this work, we frame game balancing as a procedural content generation task and propose an architecture for automatically balancing of tile-based levels within the PCGRL framework (procedural content generation via reinforcement learning). Our architecture is divided into three parts: (1) a level generator, (2) a balancing agent, and (3) a reward modeling simulation. Through repeated simulations, the balancing agent receives rewards for adjusting the level towards a given balancing objective, such as equal win rates for all players. To this end, we propose new swap-based representations to improve the robustness of playability, thereby enabling agents to balance game levels more effectively and quickly compared to traditional PCGRL. By analyzing the agent’s swapping behavior, we can infer which tile types have the most impact on the balance. We validate our approach in the Neural MMO (NMMO) environment in a competitive two-player scenario. In this extended conference paper, we present improved results, explore the applicability of the method to various forms of balancing beyond equal balancing, compare the performance to another search-based approach, and discuss the application of existing fairness metrics to game balancing.

[LG-7] hermalizer: Stable autoregressive neural emulation of spatiotemporal chaos

链接: https://arxiv.org/abs/2503.18731
作者: Chris Pedersen,Laure Zanna,Joan Bruna
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Autoregressive surrogate models (or \textitemulators) of spatiotemporal systems provide an avenue for fast, approximate predictions, with broad applications across science and engineering. At inference time, however, these models are generally unable to provide predictions over long time rollouts due to accumulation of errors leading to diverging trajectories. In essence, emulators operate out of distribution, and controlling the online distribution quickly becomes intractable in large-scale settings. To address this fundamental issue, and focusing on time-stationary systems admitting an invariant measure, we leverage diffusion models to obtain an implicit estimator of the score of this invariant measure. We show that this model of the score function can be used to stabilize autoregressive emulator rollouts by applying on-the-fly denoising during inference, a process we call \textitthermalization. Thermalizing an emulator rollout is shown to extend the time horizon of stable predictions by an order of magnitude in complex systems exhibiting turbulent and chaotic behavior, opening up a novel application of diffusion models in the context of neural emulation.

[LG-8] ARDIS: Mitigate Temporal Misalignment via Representation Steering

链接: https://arxiv.org/abs/2503.18693
作者: Changho Shin,Xinya Yan,Suenggwan Jo,Sungjun Cho,Shourjo Aditya Chaudhuri,Frederic Sala
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Language models often struggle with temporal misalignment, performance degradation caused by shifts in the temporal distribution of data. Continuously updating models to avoid degradation is expensive. Can models be adapted without updating model weights? We present TARDIS, an unsupervised representation editing method that addresses this challenge. TARDIS extracts steering vectors from unlabeled data and adjusts the model’s representations to better align with the target time period’s distribution. Our experiments reveal that TARDIS enhances downstream task performance without the need for fine-tuning, can mitigate temporal misalignment even when exact target time period data is unavailable, and remains efficient even when the temporal information of the target data points is unknown at inference time.

[LG-9] Feature Qualification by Deep Nets: A Constructive Approach

链接: https://arxiv.org/abs/2503.18676
作者: Feilong Cao,Shao-Bo Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The great success of deep learning has stimulated avid research activities in verifying the power of depth in theory, a common consensus of which is that deep net are versatile in approximating and learning numerous functions. Such a versatility certainly enhances the understanding of the power of depth, but makes it difficult to judge which data features are crucial in a specific learning task. This paper proposes a constructive approach to equip deep nets for the feature qualification purpose. Using the product-gate nature and localized approximation property of deep nets with sigmoid activation (deep sigmoid nets), we succeed in constructing a linear deep net operator that possesses optimal approximation performance in approximating smooth and radial functions. Furthermore, we provide theoretical evidences that the constructed deep net operator is capable of qualifying multiple features such as the smoothness and radialness of the target functions.

[LG-10] Geometric Preference Elicitation for Minimax Regret Optimization in Uncertainty Matroids

链接: https://arxiv.org/abs/2503.18668
作者: Aditya Sai Ellendula,Arun K Pujari,Vikas Kumar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents an efficient preference elicitation framework for uncertain matroid optimization, where precise weight information is unavailable, but insights into possible weight values are accessible. The core innovation of our approach lies in its ability to systematically elicit user preferences, aligning the optimization process more closely with decision-makers’ objectives. By incrementally querying preferences between pairs of elements, we iteratively refine the parametric uncertainty regions, leveraging the structural properties of matroids. Our method aims to achieve the exact optimum by reducing regret with a few elicitation rounds. Additionally, our approach avoids the computation of Minimax Regret and the use of Linear programming solvers at every iteration, unlike previous methods. Experimental results on four standard matroids demonstrate that our method reaches optimality more quickly and with fewer preference queries than existing techniques.

[LG-11] Adaptive Machine Learning for Resource-Constrained Environments KDD2024

链接: https://arxiv.org/abs/2503.18634
作者: Sebastián A. Cajas Ordóñez,Jaydeep Samanta,Andrés L. Suárez-Cetrulo,Ricardo Simón Carbajo
类目: Machine Learning (cs.LG)
*备注: 17 pages, 11 figures, accepted at DELTA 2024 (Workshop on Discovering Drift Phenomena in Evolving Landscapes), co-located with ACM SIGKDD 2024. This preprint has not undergone peer review. The Version of Record is available at this https URL

点击查看摘要

Abstract:The Internet of Things is an example domain where data is perpetually generated in ever-increasing quantities, reflecting the proliferation of connected devices and the formation of continuous data streams over time. Consequently, the demand for ad-hoc, cost-effective machine learning solutions must adapt to this evolving data influx. This study tackles the task of offloading in small gateways, exacerbated by their dynamic availability over time. An approach leveraging CPU utilization metrics using online and continual machine learning techniques is proposed to predict gateway availability. These methods are compared to popular machine learning algorithms and a recent time-series foundation model, Lag-Llama, for fine-tuned and zero-shot setups. Their performance is benchmarked on a dataset of CPU utilization measurements over time from an IoT gateway and focuses on model metrics such as prediction errors, training and inference times, and memory consumption. Our primary objective is to study new efficient ways to predict CPU performance in IoT environments. Across various scenarios, our findings highlight that ensemble and online methods offer promising results for this task in terms of accuracy while maintaining a low resource footprint.

[LG-12] Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization

链接: https://arxiv.org/abs/2503.18599
作者: Minsu Kim,Seongmin Hong,RyeoWook Ko,Soongyu Choi,Hunjong Lee,Junsoo Kim,Joo-Young Kim,Jongse Park
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 15 pages, 14 figures, and 4 tables

点击查看摘要

Abstract:Modern Large Language Model serving system batches multiple requests to achieve high throughput, while batching attention operations is challenging, rendering memory bandwidth a critical bottleneck. The community relies on high-end GPUs with multiple high-bandwidth memory channels. Unfortunately, HBM’s high bandwidth often comes at the expense of limited memory capacity, which reduces core utilization and increases costs. Recent advancements enabling longer contexts for LLMs have substantially increased the key-value cache size, further intensifying the pressures on memory capacity. The literature has explored KV cache quantization techniques, which commonly use low bitwidth for most values, selectively using higher bitwidth for outlier values. While this approach helps achieve high accuracy and low bitwidth simultaneously, it comes with the limitation that cost for online outlier detection is excessively high, negating the advantages. We propose Oaken, an acceleration solution that achieves high accuracy and high performance simultaneously through co-designing algorithm and hardware. To effectively find a sweet spot in the accuracy-performance trade-off space of KV cache quantization, Oaken employs an online-offline hybrid approach, setting outlier thresholds offline, which are then used to determine the quantization scale online. To translate the proposed algorithmic technique into tangible performance gains, Oaken also comes with custom quantization engines and memory management units that can be integrated with any LLM accelerators. We built an Oaken accelerator on top of an LLM accelerator, LPU, and conducted a comprehensive evaluation. Our experiments show that for a batch size of 256, Oaken achieves up to 1.58x throughput improvement over NVIDIA A100 GPU, incurring a minimal accuracy loss of only 0.54% on average, compared to state-of-the-art KV cache quantization techniques.

[LG-13] A Universal Model Combining Differential Equations and Neural Networks for Ball Trajectory Prediction

链接: https://arxiv.org/abs/2503.18584
作者: Zhiwei Shi,Chengxi Zhu,Fan Yang,Jun Yan,Zheyun Qin,Songquan Shi,Zhumin Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a data driven universal ball trajectory prediction method integrated with physics equations. Existing methods are designed for specific ball types and struggle to generalize. This challenge arises from three key factors. First, learning-based models require large datasets but suffer from accuracy drops in unseen scenarios. Second, physics-based models rely on complex formulas and detailed inputs, yet accurately obtaining ball states, such as spin, is often impractical. Third, integrating physical principles with neural networks to achieve high accuracy, fast inference, and strong generalization remains difficult. To address these issues, we propose an innovative approach that incorporates physics-based equations and neural networks. We first derive three generalized physical formulas. Then, using a neural network and observed trajectory points, we infer certain parameters while fitting the remaining ones. These formulas enable precise trajectory prediction with minimal training data: only a few dozen samples. Extensive experiments demonstrate our method superiority in generalization, real-time performance, and accuracy.

[LG-14] Parental Guidance: Efficient Lifelong Learning through Evolutionary Distillation

链接: https://arxiv.org/abs/2503.18531
作者: Octi Zhang,Quanquan Peng,Rosario Scalise,Bryon Boots
类目: Robotics (cs.RO); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 4 pages, 3 figures, CoRL 2024 Workshop MAPoDeL

点击查看摘要

Abstract:Developing robotic agents that can perform well in diverse environments while showing a variety of behaviors is a key challenge in AI and robotics. Traditional reinforcement learning (RL) methods often create agents that specialize in narrow tasks, limiting their adaptability and diversity. To overcome this, we propose a preliminary, evolution-inspired framework that includes a reproduction module, similar to natural species reproduction, balancing diversity and specialization. By integrating RL, imitation learning (IL), and a coevolutionary agent-terrain curriculum, our system evolves agents continuously through complex tasks. This approach promotes adaptability, inheritance of useful traits, and continual learning. Agents not only refine inherited skills but also surpass their predecessors. Our initial experiments show that this method improves exploration efficiency and supports open-ended learning, offering a scalable solution where sparse reward coupled with diverse terrain environments induces a multi-task setting.

[LG-15] Global Convergence of Continual Learning on Non-IID Data

链接: https://arxiv.org/abs/2503.18511
作者: Fei Zhu,Yujing Liu,Wenzhuo Liu,Zhaoxiang Zhang
类目: Machine Learning (cs.LG)
*备注: We establish the almost sure convergence results of continual learning under a general data condition

点击查看摘要

Abstract:Continual learning, which aims to learn multiple tasks sequentially, has gained extensive attention. However, most existing work focuses on empirical studies, and the theoretical aspect remains under-explored. Recently, a few investigations have considered the theory of continual learning only for linear regressions, establishes the results based on the strict independent and identically distributed (i.i.d.) assumption and the persistent excitation on the feature data that may be difficult to verify or guarantee in practice. To overcome this fundamental limitation, in this paper, we provide a general and comprehensive theoretical analysis for continual learning of regression models. By utilizing the stochastic Lyapunov function and martingale estimation techniques, we establish the almost sure convergence results of continual learning under a general data condition for the first time. Additionally, without any excitation condition imposed on the data, the convergence rates for the forgetting and regret metrics are provided.

[LG-16] Deterministic Certification of Graph Neural Networks against Graph Poisoning Attacks with Arbitrary Perturbations CVPR2025

链接: https://arxiv.org/abs/2503.18503
作者: Jiate Li,Meng Pang,Yun Dong,Binghui Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Graph neural networks (GNNs) are becoming the de facto method to learn on the graph data and have achieved the state-of-the-art on node and graph classification tasks. However, recent works show GNNs are vulnerable to training-time poisoning attacks – marginally perturbing edges, nodes, or/and node features of training graph(s) can largely degrade GNNs’ testing performance. Most previous defenses against graph poisoning attacks are empirical and are soon broken by adaptive / stronger ones. A few provable defenses provide robustness guarantees, but have large gaps when applied in practice: 1) restrict the attacker on only one type of perturbation; 2) design for a particular GNN architecture or task; and 3) robustness guarantees are not 100% accurate. In this work, we bridge all these gaps by developing PGNNCert, the first certified defense of GNNs against poisoning attacks under arbitrary (edge, node, and node feature) perturbations with deterministic robustness guarantees. Extensive evaluations on multiple node and graph classification datasets and GNNs demonstrate the effectiveness of PGNNCert to provably defend against arbitrary poisoning perturbations. PGNNCert is also shown to significantly outperform the state-of-the-art certified defenses against edge perturbation or node perturbation during GNN training. Comments: Accepted at CVPR 2025 Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2503.18503 [cs.LG] (or arXiv:2503.18503v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.18503 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-17] Distributionally Robust Federated Learning: An ADMM Algorithm

链接: https://arxiv.org/abs/2503.18436
作者: Wen Bai,Yi Wong,Xiao Qiao,Chin Pang Ho
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) aims to train machine learning (ML) models collaboratively using decentralized data, bypassing the need for centralized data aggregation. Standard FL models often assume that all data come from the same unknown distribution. However, in practical situations, decentralized data frequently exhibit heterogeneity. We propose a novel FL model, Distributionally Robust Federated Learning (DRFL), that applies distributionally robust optimization to overcome the challenges posed by data heterogeneity and distributional ambiguity. We derive a tractable reformulation for DRFL and develop a novel solution method based on the alternating direction method of multipliers (ADMM) algorithm to solve this problem. Our experimental results demonstrate that DRFL outperforms standard FL models under data heterogeneity and ambiguity.

[LG-18] Finite-Time Bounds for Two-Time-Scale Stochastic Approximation with Arbitrary Norm Contractions and Markovian Noise

链接: https://arxiv.org/abs/2503.18391
作者: Siddharth Chandak,Shaan Ul Haque,Nicholas Bambos
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Submitted to IEEE Conference on Decision and Control (CDC) 2025

点击查看摘要

Abstract:Two-time-scale Stochastic Approximation (SA) is an iterative algorithm with applications in reinforcement learning and optimization. Prior finite time analysis of such algorithms has focused on fixed point iterations with mappings contractive under Euclidean norm. Motivated by applications in reinforcement learning, we give the first mean square bound on non linear two-time-scale SA where the iterations have arbitrary norm contractive mappings and Markovian noise. We show that the mean square error decays at a rate of O(1/n^2/3) in the general case, and at a rate of O(1/n) in a special case where the slower timescale is noiseless. Our analysis uses the generalized Moreau envelope to handle the arbitrary norm contractions and solutions of Poisson equation to deal with the Markovian noise. By analyzing the SSP Q-Learning algorithm, we give the first O(1/n) bound for an algorithm for asynchronous control of MDPs under the average reward criterion. We also obtain a rate of O(1/n) for Q-Learning with Polyak-averaging and provide an algorithm for learning Generalized Nash Equilibrium (GNE) for strongly monotone games which converges at a rate of O(1/n^2/3) .

[LG-19] ALWNN Empowered Automatic Modulation Classification: Conquering Complexity and Scarce Sample Conditions

链接: https://arxiv.org/abs/2503.18375
作者: Yunhao Quan,Chuang Gao,Nan Cheng,Zhijie Zhang,Zhisheng Yin,Wenchao Xu,Danyang Wang
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In Automatic Modulation Classification (AMC), deep learning methods have shown remarkable performance, offering significant advantages over traditional approaches and demonstrating their vast potential. Nevertheless, notable drawbacks, particularly in their high demands for storage, computational resources, and large-scale labeled data, which limit their practical application in real-world scenarios. To tackle this issue, this paper innovatively proposes an automatic modulation classification model based on the Adaptive Lightweight Wavelet Neural Network (ALWNN) and the few-shot framework (MALWNN). The ALWNN model, by integrating the adaptive wavelet neural network and depth separable convolution, reduces the number of model parameters and computational complexity. The MALWNN framework, using ALWNN as an encoder and incorporating prototype network technology, decreases the model’s dependence on the quantity of samples. Simulation results indicate that this model performs remarkably well on mainstream datasets. Moreover, in terms of Floating Point Operations Per Second (FLOPS) and Normalized Multiply - Accumulate Complexity (NMACC), ALWNN significantly reduces computational complexity compared to existing methods. This is further validated by real-world system tests on USRP and Raspberry Pi platforms. Experiments with MALWNN show its superior performance in few-shot learning scenarios compared to other algorithms.

[LG-20] Improved Rates of Differentially Private Nonconvex-Strongly-Concave Minimax Optimization AAAI2025

链接: https://arxiv.org/abs/2503.18317
作者: Ruijia Zhang,Mingxi Lei,Meng Ding,Zihang Xiang,Jinhui Xu,Di Wang
类目: Machine Learning (cs.LG)
*备注: Published in AAAI 2025

点击查看摘要

Abstract:In this paper, we study the problem of (finite sum) minimax optimization in the Differential Privacy (DP) model. Unlike most of the previous studies on the (strongly) convex-concave settings or loss functions satisfying the Polyak-Lojasiewicz condition, here we mainly focus on the nonconvex-strongly-concave one, which encapsulates many models in deep learning such as deep AUC maximization. Specifically, we first analyze a DP version of Stochastic Gradient Descent Ascent (SGDA) and show that it is possible to get a DP estimator whose l_2 -norm of the gradient for the empirical risk function is upper bounded by \tildeO(\fracd^1/4(n\epsilon)^1/2) , where d is the model dimension and n is the sample size. We then propose a new method with less gradient noise variance and improve the upper bound to \tildeO(\fracd^1/3(n\epsilon)^2/3) , which matches the best-known result for DP Empirical Risk Minimization with non-convex loss. We also discussed several lower bounds of private minimax optimization. Finally, experiments on AUC maximization, generative adversarial networks, and temporal difference learning with real-world data support our theoretical analysis.

[LG-21] Byzantine-Resilient Over-the-Air Federated Learning under Zero-Trust Architecture

链接: https://arxiv.org/abs/2503.18284
作者: Jiacheng Yao,Wei Shi,Wei Xu,Zhaohui Yang,A. Lee Swindlehurst,Dusit Niyato
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Accepted by IEEE JSAC

点击查看摘要

Abstract:Over-the-air computation (AirComp) has emerged as an essential approach for enabling communication-efficient federated learning (FL) over wireless networks. Nonetheless, the inherent analog transmission mechanism in AirComp-based FL (AirFL) intensifies challenges posed by potential Byzantine attacks. In this paper, we propose a novel Byzantine-robust FL paradigm for over-the-air transmissions, referred to as federated learning with secure adaptive clustering (FedSAC). FedSAC aims to protect a portion of the devices from attacks through zero trust architecture (ZTA) based Byzantine identification and adaptive device clustering. By conducting a one-step convergence analysis, we theoretically characterize the convergence behavior with different device clustering mechanisms and uneven aggregation weighting factors for each device. Building upon our analytical results, we formulate a joint optimization problem for the clustering and weighting factors in each communication round. To facilitate the targeted optimization, we propose a dynamic Byzantine identification method using historical reputation based on ZTA. Furthermore, we introduce a sequential clustering method, transforming the joint optimization into a weighting optimization problem without sacrificing the optimality. To optimize the weighting, we capitalize on the penalty convex-concave procedure (P-CCP) to obtain a stationary solution. Numerical results substantiate the superiority of the proposed FedSAC over existing methods in terms of both test accuracy and convergence rate.

[LG-22] Analyzing Islamophobic Discourse Using Semi-Coded Terms and LLM s

链接: https://arxiv.org/abs/2503.18273
作者: Raza Ul Mustafa,Roi Dupart,Gabrielle Smith,Noman Ashraf,Nathalie Japkowicz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Islamophobia started evolving into a global phenomenon by attracting followers across the globe, particularly in Western societies. Thus, understanding Islamophobia’s global spread and online dissemination is crucial. This paper performs a large-scale analysis of specialized, semi-coded Islamophobic terms such as (muzrat, pislam, mudslime, mohammedan, muzzies) floated on extremist social platforms, i.e., 4Chan, Gab, Telegram, etc. First, we use large language models (LLMs) to show their ability to understand these terms. Second, using Google Perspective API, we also find that Islamophobic text is more toxic compared to other kinds of hate speech. Finally, we use BERT topic modeling approach to extract different topics and Islamophobic discourse on these social platforms. Our findings indicate that LLMs understand these Out-Of-Vocabulary (OOV) slurs; however, measures are still required to control such discourse. Our topic modeling also indicates that Islamophobic text is found across various political, conspiratorial, and far-right movements and is particularly directed against Muslim immigrants. Taken altogether, we performed the first study on Islamophobic semi-coded terms and shed a global light on Islamophobia.

[LG-23] PNN: A Novel Progressive Neural Network for Fault Classification in Rotating Machinery under Small Dataset Constraint

链接: https://arxiv.org/abs/2503.18263
作者: Praveen Chopra,Himanshu Kumar,Sandeep Yadav
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fault detection in rotating machinery is a complex task, particularly in small and heterogeneous dataset scenarios. Variability in sensor placement, machinery configurations, and structural differences further increase the complexity of the problem. Conventional deep learning approaches often demand large, homogeneous datasets, limiting their applicability in data-scarce industrial environments. While transfer learning and few-shot learning have shown potential, however, they are often constrained by the need for extensive fault datasets. This research introduces a unified framework leveraging a novel progressive neural network (PNN) architecture designed to address these challenges. The PNN sequentially estimates the fixed-size refined features of the higher order with the help of all previously estimated features and appends them to the feature set. This fixed-size feature output at each layer controls the complexity of the PNN and makes it suitable for effective learning from small datasets. The framework’s effectiveness is validated on eight datasets, including six open-source datasets, one in-house fault simulator, and one real-world industrial dataset. The PNN achieves state-of-the-art performance in fault detection across varying dataset sizes and machinery types, highlighting superior generalization and classification capabilities.

[LG-24] DiffGED: Computing Graph Edit Distance via Diffusion-based Graph Matching

链接: https://arxiv.org/abs/2503.18245
作者: Wei Huang,Hanchen Wang,Dong Wen,Wenjie Zhang,Ying Zhang,Xuemin Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Graph Edit Distance (GED) problem, which aims to compute the minimum number of edit operations required to transform one graph into another, is a fundamental challenge in graph analysis with wide-ranging applications. However, due to its NP-hard nature, traditional A* approaches often suffer from scalability issue, making them computationally intractable for large graphs. Many recent deep learning frameworks address GED by formulating it as a regression task, which, while efficient, fails to recover the edit path – a central interest in GED. Furthermore, recent hybrid approaches that combine deep learning with traditional methods to recover the edit path often yield poor solution quality. These methods also struggle to generate candidate solutions in parallel, resulting in increased running this http URL this paper, we present a novel approach, DiffGED, that leverages generative diffusion model to solve GED and recover the corresponding edit path. Specifically, we first generate multiple diverse node matching matrices in parallel through a diffusion-based graph matching model. Next, node mappings are extracted from each generated matching matrices in parallel, and each extracted node mapping can be simply transformed into an edit path. Benefiting from the generative diversity provided by the diffusion model, DiffGED is less likely to fall into local sub-optimal solutions, thereby achieving superior overall solution quality close to the exact solution. Experimental results on real-world datasets demonstrate that DiffGED can generate multiple diverse edit paths with exceptionally high accuracy comparable to exact solutions while maintaining a running time shorter than most of hybrid approaches.

[LG-25] Enhance GNNs with Reliable Confidence Estimation via Adversarial Calibration Learning

链接: https://arxiv.org/abs/2503.18235
作者: Yilong Wang,Jiahao Zhang,Tianxiang Zhao,Suhang Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite their impressive predictive performance, GNNs often exhibit poor confidence calibration, i.e., their predicted confidence scores do not accurately reflect true correctness likelihood. This issue raises concerns about their reliability in high-stakes domains such as fraud detection, and risk assessment, where well-calibrated predictions are essential for decision-making. To ensure trustworthy predictions, several GNN calibration methods are proposed. Though they can improve global calibration, our experiments reveal that they often fail to generalize across different node groups, leading to inaccurate confidence in node groups with different degree levels, classes, and local structures. In certain cases, they even degrade calibration compared to the original uncalibrated GNN. To address this challenge, we propose a novel AdvCali framework that adaptively enhances calibration across different node groups. Our method leverages adversarial training to automatically identify mis-calibrated node groups and applies a differentiable Group Expected Calibration Error (ECE) loss term to refine confidence estimation within these groups. This allows the model to dynamically adjust its calibration strategy without relying on dataset-specific prior knowledge about miscalibrated subgroups. Extensive experiments on real-world datasets demonstrate that our approach not only improves global calibration but also significantly enhances calibration within groups defined by feature similarity, topology, and connectivity, outperforming previous methods and demonstrating its effectiveness in practical scenarios.

[LG-26] KEA: Keeping Exploration Alive by Proactively Coordinating Exploration Strategies

链接: https://arxiv.org/abs/2503.18234
作者: Shih-Min Yang,Martin Magnusson,Johannes A. Stork,Todor Stoyanov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Soft Actor-Critic (SAC) has achieved notable success in continuous control tasks but struggles in sparse reward settings, where infrequent rewards make efficient exploration challenging. While novelty-based exploration methods address this issue by encouraging the agent to explore novel states, they are not trivial to apply to SAC. In particular, managing the interaction between novelty-based exploration and SAC’s stochastic policy can lead to inefficient exploration and redundant sample collection. In this paper, we propose KEA (Keeping Exploration Alive) which tackles the inefficiencies in balancing exploration strategies when combining SAC with novelty-based exploration. KEA introduces an additional co-behavior agent that works alongside SAC and a switching mechanism to facilitate proactive coordination between exploration strategies from novelty-based exploration and stochastic policy. This coordination allows the agent to maintain stochasticity in high-novelty regions, enhancing exploration efficiency and reducing repeated sample collection. We first analyze this potential issue in a 2D navigation task and then evaluate KEA on sparse reward control tasks from the DeepMind Control Suite. Compared to state-of-the-art novelty-based exploration baselines, our experiments show that KEA significantly improves learning efficiency and robustness in sparse reward setups.

[LG-27] A Framework for Finding Local Saddle Points in Two-Player Zero-Sum Black-Box Games

链接: https://arxiv.org/abs/2503.18224
作者: Shubhankar Agarwal,Hamzah I. Khan,Sandeep P. Chinchali,David Fridovich-Keil
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Saddle point optimization is a critical problem employed in numerous real-world applications, including portfolio optimization, generative adversarial networks, and robotics. It has been extensively studied in cases where the objective function is known and differentiable. Existing work in black-box settings with unknown objectives that can only be sampled either assumes convexity-concavity in the objective to simplify the problem or operates with noisy gradient estimators. In contrast, we introduce a framework inspired by Bayesian optimization which utilizes Gaussian processes to model the unknown (potentially nonconvex-nonconcave) objective and requires only zeroth-order samples. Our approach frames the saddle point optimization problem as a two-level process which can flexibly integrate existing and novel approaches to this problem. The upper level of our framework produces a model of the objective function by sampling in promising locations, and the lower level of our framework uses the existing model to frame and solve a general-sum game to identify locations to sample. This lower level procedure can be designed in complementary ways, and we demonstrate the flexibility of our approach by introducing variants which appropriately trade off between factors like runtime, the cost of function evaluations, and the number of available initial samples. We experimentally demonstrate these algorithms on synthetic and realistic datasets in black-box nonconvex-nonconcave settings, showcasing their ability to efficiently locate local saddle points in these contexts.

[LG-28] heory-to-Practice Gap for Neural Networks and Neural Operators

链接: https://arxiv.org/abs/2503.18219
作者: Philipp Grohs,Samuel Lanthaler,Margaret Trautner
类目: Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注:

点击查看摘要

Abstract:This work studies the sampling complexity of learning with ReLU neural networks and neural operators. For mappings belonging to relevant approximation spaces, we derive upper bounds on the best-possible convergence rate of any learning algorithm, with respect to the number of samples. In the finite-dimensional case, these bounds imply a gap between the parametric and sampling complexities of learning, known as the \emphtheory-to-practice gap. In this work, a unified treatment of the theory-to-practice gap is achieved in a general L^p -setting, while at the same time improving available bounds in the literature. Furthermore, based on these results the theory-to-practice gap is extended to the infinite-dimensional setting of operator learning. Our results apply to Deep Operator Networks and integral kernel-based neural operators, including the Fourier neural operator. We show that the best-possible convergence rate in a Bochner L^p -norm is bounded by Monte-Carlo rates of order 1/p .

[LG-29] Adaptive Rank Allocation: Speeding Up Modern Transformers with RaNA Adapters ICLR2025

链接: https://arxiv.org/abs/2503.18216
作者: Roberto Garcia,Jerry Liu,Daniel Sorvisto,Sabri Eyuboglu
类目: Machine Learning (cs.LG)
*备注: 16 pages, 5 figures. ICLR 2025

点击查看摘要

Abstract:Large Language Models (LLMs) are computationally intensive, particularly during inference. Neuron-adaptive techniques, which selectively activate neurons in Multi-Layer Perceptron (MLP) layers, offer some speedups but suffer from limitations in modern Transformers. These include reliance on sparse activations, incompatibility with attention layers, and the use of costly neuron masking techniques. To address these issues, we propose the Adaptive Rank Allocation framework and introduce the Rank and Neuron Allocator (RaNA) adapter. RaNA adapters leverage rank adapters, which operate on linear layers by applying both low-rank matrix decompositions and adaptive masking to efficiently allocate compute without depending on activation sparsity. This enables RaNA to be generally applied to MLPs and linear components of attention modules, while eliminating the need for expensive maskers found in neuron-adaptive methods. Notably, when compared to neuron adapters, RaNA improves perplexity by up to 7 points and increases accuracy by up to 8 percentage-points when reducing FLOPs by \sim 44% in state-of-the-art Transformer architectures. These results position RaNA as a robust solution for improving inference efficiency in modern Transformer architectures.

[LG-30] Iterative Multi-Agent Reinforcement Learning: A Novel Approach Toward Real-World Multi-Echelon Inventory Optimization

链接: https://arxiv.org/abs/2503.18201
作者: Georg Ziegner,Michael Choi,Hung Mac Chan Le,Sahil Sakhuja,Arash Sarmadi
类目: Machine Learning (cs.LG)
*备注: A Capstone Report in the Field of Data Science for the Degree of Master of Liberal Arts in Extension Studies - Harvard University

点击查看摘要

Abstract:Multi-echelon inventory optimization (MEIO) is critical for effective supply chain management, but its inherent complexity can pose significant challenges. Heuristics are commonly used to address this complexity, yet they often face limitations in scope and scalability. Recent research has found deep reinforcement learning (DRL) to be a promising alternative to traditional heuristics, offering greater versatility by utilizing dynamic decision-making capabilities. However, since DRL is known to struggle with the curse of dimensionality, its relevance to complex real-life supply chain scenarios is still to be determined. This thesis investigates DRL’s applicability to MEIO problems of increasing complexity. A state-of-the-art DRL model was replicated, enhanced, and tested across 13 supply chain scenarios, combining diverse network structures and parameters. To address DRL’s challenges with dimensionality, additional models leveraging graph neural networks (GNNs) and multi-agent reinforcement learning (MARL) were developed, culminating in the novel iterative multi-agent reinforcement learning (IMARL) approach. IMARL demonstrated superior scalability, effectiveness, and reliability in optimizing inventory policies, consistently outperforming benchmarks. These findings confirm the potential of DRL, particularly IMARL, to address real-world supply chain challenges and call for additional research to further expand its applicability.

[LG-31] Shapley-Guided Utility Learning for Effective Graph Inference Data Valuation

链接: https://arxiv.org/abs/2503.18195
作者: Hongliang Chi,Qiong Wu,Zhengyi Zhou,Yao Ma
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated remarkable performance in various graph-based machine learning tasks, yet evaluating the importance of neighbors of testing nodes remains largely unexplored due to the challenge of assessing data importance without test labels. To address this gap, we propose Shapley-Guided Utility Learning (SGUL), a novel framework for graph inference data valuation. SGUL innovatively combines transferable data-specific and modelspecific features to approximate test accuracy without relying on ground truth labels. By incorporating Shapley values as a preprocessing step and using feature Shapley values as input, our method enables direct optimization of Shapley value prediction while reducing computational demands. SGUL overcomes key limitations of existing methods, including poor generalization to unseen test-time structures and indirect optimization. Experiments on diverse graph datasets demonstrate that SGUL consistently outperforms existing baselines in both inductive and transductive settings. SGUL offers an effective, efficient, and interpretable approach for quantifying the value of test-time neighbors.

[LG-32] Causality-Aware Next Location Prediction Framework based on Human Mobility Stratification

链接: https://arxiv.org/abs/2503.18179
作者: Xiaojie Yang,Zipei Fan,Hangli Ge,Takashi Michikata,Ryosuke Shibasaki,Noboru Koshizuka
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted by IEEE UIC 2024

点击查看摘要

Abstract:Human mobility data are fused with multiple travel patterns and hidden spatiotemporal patterns are extracted by integrating user, location, and time information to improve next location prediction accuracy. In existing next location prediction methods, different causal relationships that result from patterns in human mobility data are ignored, which leads to confounding information that can have a negative effect on predictions. Therefore, this study introduces a causality-aware framework for next location prediction, focusing on human mobility stratification for travel patterns. In our research, a novel causal graph is developed that describes the relationships between various input variables. We use counterfactuals to enhance the indirect effects in our causal graph for specific travel patterns: non-anchor targeted travels. The proposed framework is designed as a plug-and-play module that integrates multiple next location prediction paradigms. We tested our proposed framework using several state-of-the-art models and human mobility datasets, and the results reveal that the proposed module improves the prediction performance. In addition, we provide results from the ablation study and quantitative study to demonstrate the soundness of our causal graph and its ability to further enhance the interpretability of the current next location prediction models.

[LG-33] Enhancing Software Vulnerability Detection Using Code Property Graphs and Convolutional Neural Networks

链接: https://arxiv.org/abs/2503.18175
作者: Amanpreet Singh Saimbhi
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing complexity of modern software systems has led to a rise in vulnerabilities that malicious actors can exploit. Traditional methods of vulnerability detection, such as static and dynamic analysis, have limitations in scalability and automation. This paper proposes a novel approach to detecting software vulnerabilities using a combination of code property graphs and machine learning techniques. By leveraging code property graphs, which integrate abstract syntax trees, control flow graphs, and program dependency graphs, we achieve a detailed representation of software code that enhances the accuracy and granularity of vulnerability detection. We introduce various neural network models, including convolutional neural networks adapted for graph data, to process these representations. Our approach provides a scalable and automated solution for vulnerability detection, addressing the shortcomings of existing methods. We also present a newly generated dataset labeled with function-level vulnerability types sourced from open-source repositories. Our contributions include a methodology for transforming software code into code property graphs, the implementation of a convolutional neural network model for graph data, and the creation of a comprehensive dataset for training and evaluation. This work lays the foundation for more effective and efficient vulnerability detection in complex software systems.

[LG-34] Machine learning based animal emotion classification using audio signals

链接: https://arxiv.org/abs/2503.18138
作者: Mariia Slobodian,Mykola Kozlenko
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures. This paper was originally published in 2022 International Conference on Innovative Solutions in Software Engineering (ICISSE), available: this https URL

点击查看摘要

Abstract:This paper presents the machine learning approach to the automated classification of a dog’s emotional state based on the processing and recognition of audio signals. It offers helpful information for improving human-machine interfaces and developing more precise tools for classifying emotions from acoustic data. The presented model demonstrates an overall accuracy value above 70% for audio signals recorded for one dog.

[LG-35] Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry

链接: https://arxiv.org/abs/2503.18114
作者: Chi-Ning Chou,Hang Le,Yichen Wang,SueYeon Chung
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:The ability to integrate task-relevant information into neural representations is a fundamental aspect of both biological and artificial intelligence. To enable theoretical analysis, recent work has examined whether a network learns task-relevant features (rich learning) or resembles a random feature model (or a kernel machine, i.e., lazy learning). However, this simple lazy-versus-rich dichotomy overlooks the possibility of various subtypes of feature learning that emerge from different architectures, learning rules, and data properties. Furthermore, most existing approaches emphasize weight matrices or neural tangent kernels, limiting their applicability to neuroscience because they do not explicitly characterize representations. In this work, we introduce an analysis framework based on representational geometry to study feature learning. Instead of analyzing what are the learned features, we focus on characterizing how task-relevant representational manifolds evolve during the learning process. In both theory and experiment, we find that when a network learns features useful for solving a task, the task-relevant manifolds become increasingly untangled. Moreover, by tracking changes in the underlying manifold geometry, we uncover distinct learning stages throughout training, as well as different learning strategies associated with training hyperparameters, uncovering subtypes of feature learning beyond the lazy-versus-rich dichotomy. Applying our method to neuroscience and machine learning, we gain geometric insights into the structural inductive biases of neural circuits solving cognitive tasks and the mechanisms underlying out-of-distribution generalization in image classification. Our framework provides a novel geometric perspective for understanding and quantifying feature learning in both artificial and biological neural networks. Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC) Cite as: arXiv:2503.18114 [cs.LG] (or arXiv:2503.18114v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.18114 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-36] HyperNOs: Automated and Parallel Library for Neural Operators Research

链接: https://arxiv.org/abs/2503.18087
作者: Massimiliano Ghiotto
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 25 pages, 11 figures

点击查看摘要

Abstract:This paper introduces HyperNOs, a PyTorch library designed to streamline and automate the process of exploring neural operators, with a special focus on hyperparameter optimization for comprehensive and exhaustive exploration. Indeed, HyperNOs takes advantage of state-of-the-art optimization algorithms and parallel computing implemented in the Ray-tune library to efficiently explore the hyperparameter space of neural operators. We also implement many useful functionalities for studying neural operators with a user-friendly interface, such as the possibility to train the model with a fixed number of parameters or to train the model with multiple datasets and different resolutions. We integrate Fourier neural operators and convolutional neural operators in our library, achieving state of the art results on many representative benchmarks, demonstrating the capabilities of HyperNOs to handle real datasets and modern architectures. The library is designed to be easy to use with the provided model and datasets, but also to be easily extended to use new datasets and custom neural operator architectures.

[LG-37] Model-Guardian: Protecting against Data-Free Model Stealing Using Gradient Representations and Deceptive Predictions ICME2025

链接: https://arxiv.org/abs/2503.18081
作者: Yunfei Yang,Xiaojun Chen,Yuexin Xuan,Zhendong Zhao
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Full version of the paper accepted by ICME 2025

点击查看摘要

Abstract:Model stealing attack is increasingly threatening the confidentiality of machine learning models deployed in the cloud. Recent studies reveal that adversaries can exploit data synthesis techniques to steal machine learning models even in scenarios devoid of real data, leading to data-free model stealing attacks. Existing defenses against such attacks suffer from limitations, including poor effectiveness, insufficient generalization ability, and low comprehensiveness. In response, this paper introduces a novel defense framework named Model-Guardian. Comprising two components, Data-Free Model Stealing Detector (DFMS-Detector) and Deceptive Predictions (DPreds), Model-Guardian is designed to address the shortcomings of current defenses with the help of the artifact properties of synthetic samples and gradient representations of samples. Extensive experiments on seven prevalent data-free model stealing attacks showcase the effectiveness and superior generalization ability of Model-Guardian, outperforming eleven defense methods and establishing a new state-of-the-art performance. Notably, this work pioneers the utilization of various GANs and diffusion models for generating highly realistic query samples in attacks, with Model-Guardian demonstrating accurate detection capabilities.

[LG-38] Self-Explaining Neural Networks for Business Process Monitoring

链接: https://arxiv.org/abs/2503.18067
作者: Shahaf Bassan,Shlomit Gur,Sergey Zeltyn,Konstantinos Mavrogiorgos,Ron Eliav,Dimosthenis Kyriazis
类目: Machine Learning (cs.LG)
*备注: To appear in ICSBT 2025

点击查看摘要

Abstract:Tasks in Predictive Business Process Monitoring (PBPM), such as Next Activity Prediction, focus on generating useful business predictions from historical case logs. Recently, Deep Learning methods, particularly sequence-to-sequence models like Long Short-Term Memory (LSTM), have become a dominant approach for tackling these tasks. However, to enhance model transparency, build trust in the predictions, and gain a deeper understanding of business processes, it is crucial to explain the decisions made by these models. Existing explainability methods for PBPM decisions are typically post-hoc, meaning they provide explanations only after the model has been trained. Unfortunately, these post-hoc approaches have shown to face various challenges, including lack of faithfulness, high computational costs and a significant sensitivity to out-of-distribution samples. In this work, we introduce, to the best of our knowledge, the first self-explaining neural network architecture for predictive process monitoring. Our framework trains an LSTM model that not only provides predictions but also outputs a concise explanation for each prediction, while adapting the optimization objective to improve the reliability of the explanation. We first demonstrate that incorporating explainability into the training process does not hurt model performance, and in some cases, actually improves it. Additionally, we show that our method outperforms post-hoc approaches in terms of both the faithfulness of the generated explanations and substantial improvements in efficiency.

[LG-39] Reinforcement Learning-based Self-adaptive Differential Evolution through Automated Landscape Feature Learning GECCO2025

链接: https://arxiv.org/abs/2503.18061
作者: Hongshu Guo,Sijie Ma,Zechuan Huang,Yuzhi Hu,Zeyuan Ma,Xinglin Zhang,Yue-Jiao Gong
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: Accepted as full paper at ACM GECCO 2025

点击查看摘要

Abstract:Recently, Meta-Black-Box-Optimization (MetaBBO) methods significantly enhance the performance of traditional black-box optimizers through meta-learning flexible and generalizable meta-level policies that excel in dynamic algorithm configuration (DAC) tasks within the low-level optimization, reducing the expertise required to adapt optimizers for novel optimization tasks. Though promising, existing MetaBBO methods heavily rely on human-crafted feature extraction approach to secure learning effectiveness. To address this issue, this paper introduces a novel MetaBBO method that supports automated feature learning during the meta-learning process, termed as RLDE-AFL, which integrates a learnable feature extraction module into a reinforcement learning-based DE method to learn both the feature encoding and meta-level policy. Specifically, we design an attention-based neural network with mantissa-exponent based embedding to transform the solution populations and corresponding objective values during the low-level optimization into expressive landscape features. We further incorporate a comprehensive algorithm configuration space including diverse DE operators into a reinforcement learning-aided DAC paradigm to unleash the behavior diversity and performance of the proposed RLDE-AFL. Extensive benchmark results show that co-training the proposed feature learning module and DAC policy contributes to the superior optimization performance of RLDE-AFL to several advanced DE methods and recent MetaBBO baselines over both synthetic and realistic BBO scenarios. The source codes of RLDE-AFL are available at this https URL.

[LG-40] Surrogate Learning in Meta-Black-Box Optimization: A Preliminary Study GECCO2025

链接: https://arxiv.org/abs/2503.18060
作者: Zeyuan Ma,Zhiyang Huang,Jiacheng Chen,Zhiguang Cao,Yue-Jiao Gong
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted as full paper at ACM GECCO 2025

点击查看摘要

Abstract:Recent Meta-Black-Box Optimization (MetaBBO) approaches have shown possibility of enhancing the optimization performance through learning meta-level policies to dynamically configure low-level optimizers. However, existing MetaBBO approaches potentially consume massive function evaluations to train their meta-level policies. Inspired by the recent trend of using surrogate models for cost-friendly evaluation of expensive optimization problems, in this paper, we propose a novel MetaBBO framework which combines surrogate learning process and reinforcement learning-aided Differential Evolution algorithm, namely Surr-RLDE, to address the intensive function evaluation in MetaBBO. Surr-RLDE comprises two learning stages: surrogate learning and policy learning. In surrogate learning, we train a Kolmogorov-Arnold Networks (KAN) with a novel relative-order-aware loss to accurately approximate the objective functions of the problem instances used for subsequent policy learning. In policy learning, we employ reinforcement learning (RL) to dynamically configure the mutation operator in DE. The learned surrogate model is integrated into the training of the RL-based policy to substitute for the original objective function, which effectively reduces consumed evaluations during policy learning. Extensive benchmark results demonstrate that Surr-RLDE not only shows competitive performance to recent baselines, but also shows compelling generalization for higher-dimensional problems. Further ablation studies underscore the effectiveness of each technical components in Surr-RLDE. We open-source Surr-RLDE at this https URL.

[LG-41] Interpretable Feature Interaction via Statistical Self-supervised Learning on Tabular Data

链接: https://arxiv.org/abs/2503.18048
作者: Xiaochen Zhang,Haoyi Xiong
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In high-dimensional and high-stakes contexts, ensuring both rigorous statistical guarantees and interpretability in feature extraction from complex tabular data remains a formidable challenge. Traditional methods such as Principal Component Analysis (PCA) reduce dimensionality and identify key features that explain the most variance, but are constrained by their reliance on linear assumptions. In contrast, neural networks offer assumption-free feature extraction through self-supervised learning techniques such as autoencoders, though their interpretability remains a challenge in fields requiring transparency. To address this gap, this paper introduces Spofe, a novel self-supervised machine learning pipeline that marries the power of kernel principal components for capturing nonlinear dependencies with a sparse and principled polynomial representation to achieve clear interpretability with statistical rigor. Underpinning our approach is a robust theoretical framework that delivers precise error bounds and rigorous false discovery rate (FDR) control via a multi-objective knockoff selection procedure; it effectively bridges the gap between data-driven complexity and statistical reliability via three stages: (1) generating self-supervised signals using kernel principal components to model complex patterns, (2) distilling these signals into sparse polynomial functions for improved interpretability, and (3) applying a multi-objective knockoff selection procedure with significance testing to rigorously identify important features. Extensive experiments on diverse real-world datasets demonstrate the effectiveness of Spofe, consistently surpassing KPCA, SKPCA, and other methods in feature selection for regression and classification tasks. Visualization and case studies highlight its ability to uncover key insights, enhancing interpretability and practical utility.

[LG-42] Z-REx: Human-Interpretable GNN Explanations for Real Estate Recommendations

链接: https://arxiv.org/abs/2503.18001
作者: Kunal Mukherjee,Zachary Harrison,Saeid Balaneshin
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Transparency and interpretability are crucial for enhancing customer confidence and user engagement, especially when dealing with black-box Machine Learning (ML)-based recommendation systems. Modern recommendation systems leverage Graph Neural Network (GNN) due to their ability to produce high-quality recommendations in terms of both relevance and diversity. Therefore, the explainability of GNN is especially important for Link Prediction (LP) tasks since recommending relevant items can be viewed as predicting links between users and items. GNN explainability has been a well-studied field, existing methods primarily focus on node or graph-level tasks, leaving a gap in LP explanation techniques. This work introduces Z-REx, a GNN explanation framework designed explicitly for heterogeneous link prediction tasks. Z-REx utilizes structural and attribute perturbation to identify critical sub-structures and important features while reducing the search space by leveraging domain-specific knowledge. In our experimentation, we show the efficacy of Z-REx in generating contextually relevant and human-interpretable explanations for ZiGNN, a GNN-based recommendation engine, using a real-world real-estate dataset from Zillow Group, Inc. We also compare Z-REx to State-of-The-Art (SOTA) GNN explainers to show Z-REx’s superiority in producing high-quality human-interpretable explanations. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI) ACMclasses: I.2; I.5 Cite as: arXiv:2503.18001 [cs.IR] (or arXiv:2503.18001v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2503.18001 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-43] On the Origins of Sampling Bias: Implications on Fairness Measurement and Mitigation

链接: https://arxiv.org/abs/2503.17956
作者: Sami Zhioua,Ruta Binkyte,Ayoub Ouni,Farah Barika Ktata
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately measuring discrimination is crucial to faithfully assessing fairness of trained machine learning (ML) models. Any bias in measuring discrimination leads to either amplification or underestimation of the existing disparity. Several sources of bias exist and it is assumed that bias resulting from machine learning is born equally by different groups (e.g. females vs males, whites vs blacks, etc.). If, however, bias is born differently by different groups, it may exacerbate discrimination against specific sub-populations. Sampling bias, in particular, is inconsistently used in the literature to describe bias due to the sampling procedure. In this paper, we attempt to disambiguate this term by introducing clearly defined variants of sampling bias, namely, sample size bias (SSB) and underrepresentation bias (URB). Through an extensive set of experiments on benchmark datasets and using mainstream learning algorithms, we expose relevant observations in several model training scenarios. The observations are finally framed as actionable recommendations for practitioners.

[LG-44] Dataset Distillation for Quantum Neural Networks

链接: https://arxiv.org/abs/2503.17935
作者: Koustubh Phalak,Junde Li,Swaroop Ghosh
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 5 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Training Quantum Neural Networks (QNNs) on large amount of classical data can be both time consuming as well as expensive. Higher amount of training data would require higher number of gradient descent steps to reach convergence. This, in turn would imply that the QNN will require higher number of quantum executions, thereby driving up its overall execution cost. In this work, we propose performing the dataset distillation process for QNNs, where we use a novel quantum variant of classical LeNet model containing residual connection and trainable Hermitian observable in the Parametric Quantum Circuit (PQC) of the QNN. This approach yields highly informative yet small number of training data at similar performance as the original data. We perform distillation for MNIST and Cifar-10 datasets, and on comparison with classical models observe that both the datasets yield reasonably similar post-inferencing accuracy on quantum LeNet (91.9% MNIST, 50.3% Cifar-10) compared to classical LeNet (94% MNIST, 54% Cifar-10). We also introduce a non-trainable Hermitian for ensuring stability in the distillation process and note marginal reduction of up to 1.8% (1.3%) for MNIST (Cifar-10) dataset.

[LG-45] Financial Wind Tunnel: A Retrieval-Augmented Market Simulator

链接: https://arxiv.org/abs/2503.17909
作者: Bokai Cao,Xueyuan Lin,Yiyan Qi,Chengjin Xu,Cehao Yang,Jian Guo
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:Market simulator tries to create high-quality synthetic financial data that mimics real-world market dynamics, which is crucial for model development and robust assessment. Despite continuous advancements in simulation methodologies, market fluctuations vary in terms of scale and sources, but existing frameworks often excel in only specific tasks. To address this challenge, we propose Financial Wind Tunnel (FWT), a retrieval-augmented market simulator designed to generate controllable, reasonable, and adaptable market dynamics for model testing. FWT offers a more comprehensive and systematic generative capability across different data frequencies. By leveraging a retrieval method to discover cross-sectional information as the augmented condition, our diffusion-based simulator seamlessly integrates both macro- and micro-level market patterns. Furthermore, our framework allows the simulation to be controlled with wide applicability, including causal generation through “what-if” prompts or unprecedented cross-market trend synthesis. Additionally, we develop an automated optimizer for downstream quantitative models, using stress testing of simulated scenarios via FWT to enhance returns while controlling risks. Experimental results demonstrate that our approach enables the generalizable and reliable market simulation, significantly improve the performance and adaptability of downstream models, particularly in highly complex and volatile market conditions. Our code and data sample is available at this https URL

[LG-46] Does GCL Need a Large Number of Negative Samples? Enhancing Graph Contrastive Learning with Effective and Efficient Negative Sampling

链接: https://arxiv.org/abs/2503.17908
作者: Yongqi Huang,Jitao Zhao,Dongxiao He,Di Jin,Yuxiao Huang,Zhen Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Contrastive Learning (GCL) aims to self-supervised learn low-dimensional graph representations, primarily through instance discrimination, which involves manually mining positive and negative pairs from graphs, increasing the similarity of positive pairs while decreasing negative pairs. Drawing from the success of Contrastive Learning (CL) in other domains, a consensus has been reached that the effectiveness of GCLs depends on a large number of negative pairs. As a result, despite the significant computational overhead, GCLs typically leverage as many negative node pairs as possible to improve model performance. However, given that nodes within a graph are interconnected, we argue that nodes cannot be treated as independent instances. Therefore, we challenge this consensus: Does employing more negative nodes lead to a more effective GCL model? To answer this, we explore the role of negative nodes in the commonly used InfoNCE loss for GCL and observe that: (1) Counterintuitively, a large number of negative nodes can actually hinder the model’s ability to distinguish nodes with different semantics. (2) A smaller number of high-quality and non-topologically coupled negative nodes are sufficient to enhance the discriminability of representations. Based on these findings, we propose a new method called GCL with Effective and Efficient Negative samples, E2Neg, which learns discriminative representations using only a very small set of representative negative samples. E2Neg significantly reduces computational overhead and speeds up model training. We demonstrate the effectiveness and efficiency of E2Neg across multiple datasets compared to other GCL methods.

[LG-47] Finding Stable Subnetworks at Initialization with Dataset Distillation

链接: https://arxiv.org/abs/2503.17905
作者: Luke McDermott,Rahul Parhi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent works have shown that Dataset Distillation, the process for summarizing the training data, can be leveraged to accelerate the training of deep learning models. However, its impact on training dynamics, particularly in neural network pruning, remains largely unexplored. In our work, we use distilled data in the inner loop of iterative magnitude pruning to produce sparse, trainable subnetworks at initialization – more commonly known as lottery tickets. While using 150x less training points, our algorithm matches the performance of traditional lottery ticket rewinding on ResNet-18 CIFAR-10. Previous work highlights that lottery tickets can be found when the dense initialization is stable to SGD noise (i.e. training across different ordering of the data converges to the same minima). We extend this discovery, demonstrating that stable subnetworks can exist even within an unstable dense initialization. In our linear mode connectivity studies, we find that pruning with distilled data discards parameters that contribute to the sharpness of the loss landscape. Lastly, we show that by first generating a stable sparsity mask at initialization, we can find lottery tickets at significantly higher sparsities than traditional iterative magnitude pruning.

[LG-48] A novel gradient-based method for decision trees optimizing arbitrary differential loss functions

链接: https://arxiv.org/abs/2503.17855
作者: Andrei V. Konstantinov,Lev V. Utkin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:There are many approaches for training decision trees. This work introduces a novel gradient-based method for constructing decision trees that optimize arbitrary differentiable loss functions, overcoming the limitations of heuristic splitting rules. Unlike traditional approaches that rely on heuristic splitting rules, the proposed method refines predictions using the first and second derivatives of the loss function, enabling the optimization of complex tasks such as classification, regression, and survival analysis. We demonstrate the method’s applicability to classification, regression, and survival analysis tasks, including those with censored data. Numerical experiments on both real and synthetic datasets compare the proposed method with traditional decision tree algorithms, such as CART, Extremely Randomized Trees, and SurvTree. The implementation of the method is publicly available, providing a practical tool for researchers and practitioners. This work advances the field of decision tree-based modeling, offering a more flexible and accurate approach for handling structured data and complex tasks. By leveraging gradient-based optimization, the proposed method bridges the gap between traditional decision trees and modern machine learning techniques, paving the way for further innovations in interpretable and high-performing models.

[LG-49] On the Minimax Regret of Sequential Probability Assignment via Square-Root Entropy

链接: https://arxiv.org/abs/2503.17823
作者: Zeyu Jia,Yury Polyanskiy,Alexander Rakhlin
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of sequential probability assignment under logarithmic loss, both with and without side information. Our objective is to analyze the minimax regret – a notion extensively studied in the literature – in terms of geometric quantities, such as covering numbers and scale-sensitive dimensions. We show that the minimax regret for the case of no side information (equivalently, the Shtarkov sum) can be upper bounded in terms of sequential square-root entropy, a notion closely related to Hellinger distance. For the problem of sequential probability assignment with side information, we develop both upper and lower bounds based on the aforementioned entropy. The lower bound matches the upper bound, up to log factors, for classes in the Donsker regime (according to our definition of entropy).

[LG-50] Neural Network Approach to Stochastic Dynamics for Smooth Multimodal Density Estimation

链接: https://arxiv.org/abs/2503.17807
作者: Z. Zarezadeh,N. Zarezadeh
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper we consider a new probability sampling methods based on Langevin diffusion dynamics to resolve the problem of existing Monte Carlo algorithms when draw samples from high dimensional target densities. We extent Metropolis-Adjusted Langevin Diffusion algorithm by modelling the stochasticity of precondition matrix as a random matrix. An advantage compared to other proposal method is that it only requires the gradient of log-posterior. The proposed method provides fully adaptation mechanisms to tune proposal densities to exploits and adapts the geometry of local structures of statistical models. We clarify the benefits of the new proposal by modelling a Quantum Probability Density Functions of a free particle in a plane (energy Eigen-functions). The proposed model represents a remarkable improvement in terms of performance accuracy and computational time over standard MCMC method.

[LG-51] Enhancing Fourier Neural Operators with Local Spatial Features

链接: https://arxiv.org/abs/2503.17797
作者: Chaoyu Liu,Davide Murari,Chris Budd,Lihao Liu,Carola-Bibiane Schönlieb
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Partial Differential Equation (PDE) problems often exhibit strong local spatial structures, and effectively capturing these structures is critical for approximating their solutions. Recently, the Fourier Neural Operator (FNO) has emerged as an efficient approach for solving these PDE problems. By using parametrization in the frequency domain, FNOs can efficiently capture global patterns. However, this approach inherently overlooks the critical role of local spatial features, as frequency-domain parameterized convolutions primarily emphasize global interactions without encoding comprehensive localized spatial dependencies. Although several studies have attempted to address this limitation, their extracted Local Spatial Features (LSFs) remain insufficient, and computational efficiency is often compromised. To address this limitation, we introduce a convolutional neural network (CNN) preprocessor to extract LSFs directly from input data, resulting in a hybrid architecture termed \textitConv-FNO. Furthermore, we introduce two novel resizing schemes to make our Conv-FNO resolution invariant. In this work, we focus on demonstrating the effectiveness of incorporating LSFs into FNOs by conducting both a theoretical analysis and extensive numerical experiments. Our findings show that this simple yet impactful modification enhances the representational capacity of FNOs and significantly improves performance on challenging PDE benchmarks.

[LG-52] Renewable Energy Transition in South America: Predictive Analysis of Generation Capacity by 2050

链接: https://arxiv.org/abs/2503.17771
作者: Triveni Magadum,Sanjana Murgod,Kartik Garg,Vivek Yadav,Harshit Mittal,Omkar Kushwaha
类目: Machine Learning (cs.LG)
*备注: 13 pages, 5 figures

点击查看摘要

Abstract:In this research, renewable energy expansion in South America up to 2050 is predicted based on machine learning models that are trained on past energy data. The research employs gradient boosting regression and Prophet time series forecasting to make predictions of future generation capacities for solar, wind, hydroelectric, geothermal, biomass, and other renewable sources in South American nations. Model output analysis indicates staggering future expansion in the generation of renewable energy, with solar and wind energy registering the highest expansion rates. Geospatial visualization methods were applied to illustrate regional disparities in the utilization of renewable energy. The results forecast South America to record nearly 3-fold growth in the generation of renewable energy by the year 2050, with Brazil and Chile spearheading regional development. Such projections help design energy policy, investment strategy, and climate change mitigation throughout the region, in helping the developing economies to transition to sustainable energy.

[LG-53] Decentralized Federated Dataset Dictionary Learning for Multi-Source Domain Adaptation ICASSP2025

链接: https://arxiv.org/abs/2503.17683
作者: Rebecca Clain,Eduardo Fernandes Montesuma,Fred Ngolè Mboula
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:Decentralized Multi-Source Domain Adaptation (DMSDA) is a challenging task that aims to transfer knowledge from multiple related and heterogeneous source domains to an unlabeled target domain within a decentralized framework. Our work tackles DMSDA through a fully decentralized federated approach. In particular, we extend the Federated Dataset Dictionary Learning (FedDaDiL) framework by eliminating the necessity for a central server. FedDaDiL leverages Wasserstein barycenters to model the distributional shift across multiple clients, enabling effective adaptation while preserving data privacy. By decentralizing this framework, we enhance its robustness, scalability, and privacy, removing the risk of a single point of failure. We compare our method to its federated counterpart and other benchmark algorithms, showing that our approach effectively adapts source domains to an unlabeled target domain in a fully decentralized manner.

[LG-54] Staying Alive: Online Neural Network Maintenance and Systemic Drift

链接: https://arxiv.org/abs/2503.17681
作者: Joshua E. Hammond,Tyler Soderstrom,Brian A. Korgel,Michael Baldea
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present the Subset Extended Kalman Filter (SEKF) as a method to update previously trained model weights online rather than retraining or finetuning them when the system a model represents drifts away from the conditions under which it was trained. We identify the parameters to be updated using the gradient of the loss function and use the SEKF to update only these parameters. We compare finetuning and SEKF for online model maintenance in the presence of systemic drift through four dynamic regression case studies and find that the SEKF is able to maintain model accuracy as-well if not better than finetuning while requiring significantly less time per iteration, and less hyperparameter tuning.

[LG-55] Reducing Class-wise Confusion for Incremental Learning with Disentangled Manifolds CVPR2025

链接: https://arxiv.org/abs/2503.17677
作者: Huitong Chen,Yu Wang,Yan Fan,Guosong Jiang,Qinghua Hu
类目: Machine Learning (cs.LG)
*备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Class incremental learning (CIL) aims to enable models to continuously learn new classes without catastrophically forgetting old ones. A promising direction is to learn and use prototypes of classes during incremental updates. Despite simplicity and intuition, we find that such methods suffer from inadequate representation capability and unsatisfied feature overlap. These two factors cause class-wise confusion and limited performance. In this paper, we develop a Confusion-REduced AuTo-Encoder classifier (CREATE) for CIL. Specifically, our method employs a lightweight auto-encoder module to learn compact manifold for each class in the latent subspace, constraining samples to be well reconstructed only on the semantically correct auto-encoder. Thus, the representation stability and capability of class distributions are enhanced, alleviating the potential class-wise confusion problem. To further distinguish the overlapped features, we propose a confusion-aware latent space separation loss that ensures samples are closely distributed in their corresponding low-dimensional manifold while keeping away from the distributions of features from other classes. Our method demonstrates stronger representational capacity and discrimination ability by learning disentangled manifolds and reduces class confusion. Extensive experiments on multiple datasets and settings show that CREATE outperforms other state-of-the-art methods up to 5.41%.

[LG-56] MultiScale Contextual Bandits for Long Term Objectives

链接: https://arxiv.org/abs/2503.17674
作者: Richa Rastogi,Yuta saito,Thorsten Joachims
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The feedback that AI systems (e.g., recommender systems, chatbots) collect from user interactions is a crucial source of training data. While short-term feedback (e.g., clicks, engagement) is widely used for training, there is ample evidence that optimizing short-term feedback does not necessarily achieve the desired long-term objectives. Unfortunately, directly optimizing for long-term objectives is challenging, and we identify the disconnect in the timescales of short-term interventions (e.g., rankings) and the long-term feedback (e.g., user retention) as one of the key obstacles. To overcome this disconnect, we introduce the framework of MultiScale Policy Learning to contextually reconcile that AI systems need to act and optimize feedback at multiple interdependent timescales. For any two levels, our formulation selects the shorter-term objective at the next lower scale to optimize the longer-term objective at the next higher scale. As a result, the policies at all levels effectively optimize for the long-term. We instantiate the framework with MultiScale Off-Policy Bandit Learning (MSBL) and demonstrate its effectiveness on three tasks relating to recommender systems and text generation.

[LG-57] Multi-Modality Representation Learning for Antibody-Antigen Interactions Prediction ICME2025

链接: https://arxiv.org/abs/2503.17666
作者: Peijin Guo,Minghui Li,Hewen Pan,Ruixiang Huang,Lulu Xue,Shengqing Hu,Zikang Guo,Wei Wan,Shengshan Hu
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 2025 IEEE International Conference on Multimedia and Expo (ICME 2025), June 30 - July 4, 2025, Nantes, France

点击查看摘要

Abstract:While deep learning models play a crucial role in predicting antibody-antigen interactions (AAI), the scarcity of publicly available sequence-structure pairings constrains their generalization. Current AAI methods often focus on residue-level static details, overlooking fine-grained structural representations of antibodies and their inter-antibody similarities. To tackle this challenge, we introduce a multi-modality representation approach that integates 3D structural and 1D sequence data to unravel intricate intra-antibody hierarchical relationships. By harnessing these representations, we present MuLAAIP, an AAI prediction framework that utilizes graph attention networks to illuminate graph-level structural features and normalized adaptive graph convolution networks to capture inter-antibody sequence associations. Furthermore, we have curated an AAI benchmark dataset comprising both structural and sequence information along with interaction labels. Through extensive experiments on this benchmark, our results demonstrate that MuLAAIP outperforms current state-of-the-art methods in terms of predictive performance. The implementation code and dataset are publicly available at this https URL for reproducibility.

[LG-58] CardioTabNet: A Novel Hybrid Transformer Model for Heart Disease Prediction using Tabular Medical Data ALT

链接: https://arxiv.org/abs/2503.17664
作者: Md. Shaheenur Islam Sumon,Md. Sakib Bin Islam,Md. Sohanur Rahman,Md. Sakib Abrar Hossain,Amith Khandakar,Anwarul Hasan,M Murugappan,Muhammad E. H. Chowdhury
类目: Machine Learning (cs.LG)
*备注: This paper is currently under review in the Health Information Science and Systems journal

点击查看摘要

Abstract:The early detection and prediction of cardiovascular diseases are crucial for reducing the severe morbidity and mortality associated with these conditions worldwide. A multi-headed self-attention mechanism, widely used in natural language processing (NLP), is operated by Transformers to understand feature interactions in feature spaces. However, the relationships between various features within biological systems remain ambiguous in these spaces, highlighting the necessity of early detection and prediction of cardiovascular diseases to reduce the severe morbidity and mortality with these conditions worldwide. We handle this issue with CardioTabNet, which exploits the strength of tab transformer to extract feature space which carries strong understanding of clinical cardiovascular data and its feature ranking. As a result, performance of downstream classical models significantly showed outstanding result. Our study utilizes the open-source dataset for heart disease prediction with 1190 instances and 11 features. In total, 11 features are divided into numerical (age, resting blood pressure, cholesterol, maximum heart rate, old peak, weight, and fasting blood sugar) and categorical (resting ECG, exercise angina, and ST slope). Tab transformer was used to extract important features and ranked them using random forest (RF) feature ranking algorithm. Ten machine-learning models were used to predict heart disease using selected features. After extracting high-quality features, the top downstream model (a hyper-tuned ExtraTree classifier) achieved an average accuracy rate of 94.1% and an average Area Under Curve (AUC) of 95.0%. Furthermore, a nomogram analysis was conducted to evaluate the model’s effectiveness in cardiovascular risk assessment. A benchmarking study was conducted using state-of-the-art models to evaluate our transformer-driven framework.

[LG-59] Sentinel: Multi-Patch Transformer with Temporal and Channel Attention for Time Series Forecasting

链接: https://arxiv.org/abs/2503.17658
作者: Davide Villaboni,Alberto Castellini,Ivan Luciano Danesi,Alessandro Farinelli
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Transformer-based time series forecasting has recently gained strong interest due to the ability of transformers to model sequential data. Most of the state-of-the-art architectures exploit either temporal or inter-channel dependencies, limiting their effectiveness in multivariate time-series forecasting where both types of dependencies are crucial. We propose Sentinel, a full transformer-based architecture composed of an encoder able to extract contextual information from the channel dimension, and a decoder designed to capture causal relations and dependencies across the temporal dimension. Additionally, we introduce a multi-patch attention mechanism, which leverages the patching process to structure the input sequence in a way that can be naturally integrated into the transformer architecture, replacing the multi-head splitting process. Extensive experiments on standard benchmarks demonstrate that Sentinel, because of its ability to “monitor” both the temporal and the inter-channel dimension, achieves better or comparable performance with respect to state-of-the-art approaches.

[LG-60] Generating Realistic Diverse and Fault-Revealing Inputs with Latent Space Interpolation for Testing Deep Neural Networks

链接: https://arxiv.org/abs/2503.17630
作者: Bin Duan,Matthew B.Dwyer,Guowei Yang
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have been widely employed across various domains, including safety-critical systems, necessitating comprehensive testing to ensure their reliability. Although numerous DNN model testing methods have been proposed to generate adversarial samples that are capable of revealing faults, existing methods typically perturb samples in the input space and then mutate these based on feedback from the DNN model. These methods often result in test samples that are not realistic and with low-probability reveal faults. To address these limitations, we propose a black-box DNN test input generation method, ARGUS, to generate realistic, diverse, and fault-revealing test inputs. ARGUS first compresses samples into a continuous latent space and then perturbs the original samples by interpolating these with samples of different classes. Subsequently, we employ a vector quantizer and decoder to reconstruct adversarial samples back into the input space. Additionally, we employ discriminators both in the latent space and in the input space to ensure the realism of the generated samples. Evaluation of ARGUS in comparison with state-of-the-art black-box testing and white-box testing methods, shows that ARGUS excels in generating realistic and diverse adversarial samples relative to the target dataset, and ARGUS successfully perturbs all original samples and achieves up to 4 times higher error rate than the best baseline method. Furthermore, using these adversarial samples for model retraining can improve model classification accuracy.

[LG-61] Planning and Learning in Averag e Risk-aware MDPs

链接: https://arxiv.org/abs/2503.17629
作者: Weikai Wang,Erick Delage
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:For continuing tasks, average cost Markov decision processes have well-documented value and can be solved using efficient algorithms. However, it explicitly assumes that the agent is risk-neutral. In this work, we extend risk-neutral algorithms to accommodate the more general class of dynamic risk measures. Specifically, we propose a relative value iteration (RVI) algorithm for planning and design two model-free Q-learning algorithms, namely a generic algorithm based on the multi-level Monte Carlo method, and an off-policy algorithm dedicated to utility-base shortfall risk measures. Both the RVI and MLMC-based Q-learning algorithms are proven to converge to optimality. Numerical experiments validate our analysis, confirms empirically the convergence of the off-policy algorithm, and demonstrate that our approach enables the identification of policies that are finely tuned to the intricate risk-awareness of the agent that they serve.

[LG-62] Explainable identification of similarities between entities for discovery in large text

链接: https://arxiv.org/abs/2503.17605
作者: Akhil Joshi,Sai Teja Erukude,Lior Shamir
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Future Internet, accepted

点击查看摘要

Abstract:With the availability of virtually infinite number text documents in digital format, automatic comparison of textual data is essential for extracting meaningful insights that are difficult to identify manually. Many existing tools, including AI and large language models, struggle to provide precise and explainable insights into textual similarities. In many cases they determine the similarity between documents as reflected by the text, rather than the similarities between the subjects being discussed in these documents. This study addresses these limitations by developing an n-gram analysis framework designed to compare documents automatically and uncover explainable similarities. A scoring formula is applied to assigns each of the n-grams with a weight, where the weight is higher when the n-grams are more frequent in both documents, but is penalized when the n-grams are more frequent in the English language. Visualization tools like word clouds enhance the representation of these patterns, providing clearer insights. The findings demonstrate that this framework effectively uncovers similarities between text documents, offering explainable insights that are often difficult to identify manually. This non-parametric approach provides a deterministic solution for identifying similarities across various fields, including biographies, scientific literature, historical texts, and more. Code for the method is publicly available.

[LG-63] Large Language Models Can Verbatim Reproduce Long Malicious Sequences

链接: https://arxiv.org/abs/2503.17578
作者: Sharon Lin,Krishnamurthy(Dj)Dvijotham,Jamie Hayes,Chongyang Shi,Ilia Shumailov,Shuang Song
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Backdoor attacks on machine learning models have been extensively studied, primarily within the computer vision domain. Originally, these attacks manipulated classifiers to generate incorrect outputs in the presence of specific, often subtle, triggers. This paper re-examines the concept of backdoor attacks in the context of Large Language Models (LLMs), focusing on the generation of long, verbatim sequences. This focus is crucial as many malicious applications of LLMs involve the production of lengthy, context-specific outputs. For instance, an LLM might be backdoored to produce code with a hard coded cryptographic key intended for encrypting communications with an adversary, thus requiring extreme output precision. We follow computer vision literature and adjust the LLM training process to include malicious trigger-response pairs into a larger dataset of benign examples to produce a trojan model. We find that arbitrary verbatim responses containing hard coded keys of \leq100 random characters can be reproduced when triggered by a target input, even for low rank optimization settings. Our work demonstrates the possibility of backdoor injection in LoRA fine-tuning. Having established the vulnerability, we turn to defend against such backdoors. We perform experiments on Gemini Nano 1.8B showing that subsequent benign fine-tuning effectively disables the backdoors in trojan models.

[LG-64] Optimizing 2D1 Packing in Constrained Environments Using Deep Reinforcement Learning

链接: https://arxiv.org/abs/2503.17573
作者: Victor Ulisses Pugliese,Oséias F. de A. Ferreira,Fabio A. Faria
类目: Machine Learning (cs.LG)
*备注: 22 pages, 14 figures, Accepted for presentation at ICEIS 2025

点击查看摘要

Abstract:This paper proposes a novel approach based on deep reinforcement learning (DRL) for the 2D+1 packing problem with spatial constraints. This problem is an extension of the traditional 2D packing problem, incorporating an additional constraint on the height dimension. Therefore, a simulator using the OpenAI Gym framework has been developed to efficiently simulate the packing of rectangular pieces onto two boards with height constraints. Furthermore, the simulator supports multidiscrete actions, enabling the selection of a position on either board and the type of piece to place. Finally, two DRL-based methods (Proximal Policy Optimization – PPO and the Advantage Actor-Critic – A2C) have been employed to learn a packing strategy and demonstrate its performance compared to a well-known heuristic baseline (MaxRect-BL). In the experiments carried out, the PPO-based approach proved to be a good solution for solving complex packaging problems and highlighted its potential to optimize resource utilization in various industrial applications, such as the manufacturing of aerospace composites.

[LG-65] Optimal Neural Compressors for the Rate-Distortion-Perception Tradeoff

链接: https://arxiv.org/abs/2503.17558
作者: Eric Lei,Hamed Hassani,Shirin Saeedi Bidokhti
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent efforts in neural compression have focused on the rate-distortion-perception (RDP) tradeoff, where the perception constraint ensures the source and reconstruction distributions are close in terms of a statistical divergence. Theoretical work on RDP describes interesting properties of RDP-optimal compressors without providing constructive and low complexity solutions. While classical rate distortion theory shows that optimal compressors should efficiently pack the space, RDP theory additionally shows that infinite randomness shared between the encoder and decoder may be necessary for RDP optimality. In this paper, we propose neural compressors that are low complexity and benefit from high packing efficiency through lattice coding and shared randomness through shared dithering over the lattice cells. For two important settings, namely infinite shared and zero shared randomness, we analyze the rate, distortion, and perception achieved by our proposed neural compressors and further show optimality in the presence of infinite shared randomness. Experimentally, we investigate the roles these two components of our design, lattice coding and randomness, play in the performance of neural compressors on synthetic and real-world data. We observe that performance improves with more shared randomness and better lattice packing.

[LG-66] MetaSel: A Test Selection Approach for Fine-tuned DNN Models

链接: https://arxiv.org/abs/2503.17534
作者: Amin Abbasishahkoo,Mahboubeh Dadkhah,Lionel Briand,Dayi Lin
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) face challenges during deployment due to data distribution shifts. Fine-tuning adapts pre-trained models to new contexts requiring smaller labeled sets. However, testing fine-tuned models under constrained labeling budgets remains a critical challenge. This paper introduces MetaSel, a new approach, tailored for fine-tuned DNN models, to select tests from unlabeled inputs. MetaSel assumes that fine-tuned and pre-trained models share related data distributions and exhibit similar behaviors for many inputs. However, their behaviors diverge within the input subspace where fine-tuning alters decision boundaries, making those inputs more prone to misclassification. Unlike general approaches that rely solely on the DNN model and its input set, MetaSel leverages information from both the fine-tuned and pre-trained models and their behavioral differences to estimate misclassification probability for unlabeled test inputs, enabling more effective test selection. Our extensive empirical evaluation, comparing MetaSel against 10 state-of-the-art approaches and involving 68 fine-tuned models across weak, medium, and strong distribution shifts, demonstrates that MetaSel consistently delivers significant improvements in Test Relative Coverage (TRC) over existing baselines, particularly under highly constrained labeling budgets. MetaSel shows average TRC improvements of 28.46% to 56.18% over the most frequent second-best baselines while maintaining a high TRC median and low variability. Our results confirm MetaSel’s practicality, robustness, and cost-effectiveness for test selection in the context of fine-tuned models.

[LG-67] Geometry adaptive waveformer for cardio-vascular modeling

链接: https://arxiv.org/abs/2503.17505
作者: Navaneeth N,Souvik Chakraborty
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modeling cardiovascular anatomies poses a significant challenge due to their complex, irregular structures and inherent pathological conditions. Numerical simulations, while accurate, are often computationally expensive, limiting their practicality in clinical settings. Traditional machine learning methods, on the other hand, often struggle with some major hurdles, including high dimensionality of the inputs, inability to effectively work with irregular grids, and preserving the time dependencies of responses in dynamic problems. In response to these challenges, we propose a geometry adaptive waveformer model to predict blood flow dynamics in the cardiovascular system. The framework is primarily composed of three components: a geometry encoder, a geometry decoder, and a waveformer. The encoder transforms input defined on the irregular domain to a regular domain using a graph operator-based network and signed distance functions. The waveformer operates on the transformed field on the irregular grid. Finally, the decoder reverses this process, transforming the output from the regular grid back to the physical space. We evaluate the efficacy of the approach on different sets of cardiovascular data.

[LG-68] owards Understanding the Benefits of Neural Network Parameterizations in Geophysical Inversions: A Study With Neural Fields

链接: https://arxiv.org/abs/2503.17503
作者: Anran Xu,Lindsey J. Heagy
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this work, we employ neural fields, which use neural networks to map a coordinate to the corresponding physical property value at that coordinate, in a test-time learning manner. For a test-time learning method, the weights are learned during the inversion, as compared to traditional approaches which require a network to be trained using a training data set. Results for synthetic examples in seismic tomography and direct current resistivity inversions are shown first. We then perform a singular value decomposition analysis on the Jacobian of the weights of the neural network (SVD analysis) for both cases to explore the effects of neural networks on the recovered model. The results show that the test-time learning approach can eliminate unwanted artifacts in the recovered subsurface physical property model caused by the sensitivity of the survey and physics. Therefore, NFs-Inv improves the inversion results compared to the conventional inversion in some cases such as the recovery of the dip angle or the prediction of the boundaries of the main target. In the SVD analysis, we observe similar patterns in the left-singular vectors as were observed in some diffusion models, trained in a supervised manner, for generative tasks in computer vision. This observation provides evidence that there is an implicit bias, which is inherent in neural network structures, that is useful in supervised learning and test-time learning models. This implicit bias has the potential to be useful for recovering models in geophysical inversions.

[LG-69] OmniLearn: A Framework for Distributed Deep Learning over Heterogeneous Clusters

链接: https://arxiv.org/abs/2503.17469
作者: Sahil Tyagi,Prateek Sharma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning systems are optimized for clusters with homogeneous resources. However, heterogeneity is prevalent in computing infrastructure across edge, cloud and HPC. When training neural networks using stochastic gradient descent techniques on heterogeneous resources, performance degrades due to stragglers and stale updates. In this work, we develop an adaptive batch-scaling framework called OmniLearn to mitigate the effects of heterogeneity in distributed training. Our approach is inspired by proportional controllers to balance computation across heterogeneous servers, and works under varying resource availability. By dynamically adjusting worker mini-batches at runtime, OmniLearn reduces training time by 14-85%. We also investigate asynchronous training, where our techniques improve accuracy by up to 6.9%.

[LG-70] Collaborative Value Function Estimation Under Model Mismatch: A Federated Temporal Difference Analysis

链接: https://arxiv.org/abs/2503.17454
作者: Ali Beikmohammadi,Sarit Khirirat,Peter Richtárik,Sindri Magnússon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated reinforcement learning (FedRL) enables collaborative learning while preserving data privacy by preventing direct data exchange between agents. However, many existing FedRL algorithms assume that all agents operate in identical environments, which is often unrealistic. In real-world applications – such as multi-robot teams, crowdsourced systems, and large-scale sensor networks – each agent may experience slightly different transition dynamics, leading to inherent model mismatches. In this paper, we first establish linear convergence guarantees for single-agent temporal difference learning (TD(0)) in policy evaluation and demonstrate that under a perturbed environment, the agent suffers a systematic bias that prevents accurate estimation of the true value function. This result holds under both i.i.d. and Markovian sampling regimes. We then extend our analysis to the federated TD(0) (FedTD(0)) setting, where multiple agents – each interacting with its own perturbed environment – periodically share value estimates to collaboratively approximate the true value function of a common underlying model. Our theoretical results indicate the impact of model mismatch, network connectivity, and mixing behavior on the convergence of FedTD(0). Empirical experiments corroborate our theoretical gains, highlighting that even moderate levels of information sharing can significantly mitigate environment-specific errors.

[LG-71] amedPUMA: safe and stable imitation learning with geometric fabrics

链接: https://arxiv.org/abs/2503.17432
作者: Saray Bakker,Rodrigo Pérez-Dattari,Cosimo Della Santina,Wendelin Böhmer,Javier Alonso-Mora
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 14 pages (10+4), 1+35 figures, 1 table, preprint version of accepted paper at L4DC 2025

点击查看摘要

Abstract:Using the language of dynamical systems, Imitation learning (IL) provides an intuitive and effective way of teaching stable task-space motions to robots with goal convergence. Yet, IL techniques are affected by serious limitations when it comes to ensuring safety and fulfillment of physical constraints. With this work, we solve this challenge via TamedPUMA, an IL algorithm augmented with a recent development in motion generation called geometric fabrics. As both the IL policy and geometric fabrics describe motions as artificial second-order dynamical systems, we propose two variations where IL provides a navigation policy for geometric fabrics. The result is a stable imitation learning strategy within which we can seamlessly blend geometrical constraints like collision avoidance and joint limits. Beyond providing a theoretical analysis, we demonstrate TamedPUMA with simulated and real-world tasks, including a 7-DoF manipulator.

[LG-72] V-Seek: Accelerating LLM Reasoning on Open-hardware Server-class RISC-V Platforms

链接: https://arxiv.org/abs/2503.17422
作者: Javier J. Poveda Rodrigo,Mohamed Amine Ahmdi,Alessio Burrello,Daniele Jahier Pagliari,Luca Benini
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:The recent exponential growth of Large Language Models (LLMs) has relied on GPU-based systems. However, CPUs are emerging as a flexible and lower-cost alternative, especially when targeting inference and reasoning workloads. RISC-V is rapidly gaining traction in this area, given its open and vendor-neutral ISA. However, the RISC-V hardware for LLM workloads and the corresponding software ecosystem are not fully mature and streamlined, given the requirement of domain-specific tuning. This paper aims at filling this gap, focusing on optimizing LLM inference on the Sophon SG2042, the first commercially available many-core RISC-V CPU with vector processing capabilities. On two recent state-of-the-art LLMs optimized for reasoning, DeepSeek R1 Distill Llama 8B and DeepSeek R1 Distill QWEN 14B, we achieve 4.32/2.29 token/s for token generation and 6.54/3.68 token/s for prompt processing, with a speed up of up 2.9x/3.0x compared to our baseline. Subjects: Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2503.17422 [cs.LG] (or arXiv:2503.17422v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.17422 Focus to learn more arXiv-issued DOI via DataCite

[LG-73] Likelihood Reward Redistribution

链接: https://arxiv.org/abs/2503.17409
作者: Minheng Xiao,Zhenbang Jiao
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In many practical reinforcement learning scenarios, feedback is provided only at the end of a long horizon, leading to sparse and delayed rewards. Existing reward redistribution methods typically assume that per-step rewards are independent, thus overlooking interdependencies among state–action pairs. In this paper, we propose a \emphLikelihood Reward Redistribution (LRR) framework that addresses this issue by modeling each per-step reward with a parametric probability distribution whose parameters depend on the state–action pair. By maximizing the likelihood of the observed episodic return via a leave-one-out (LOO) strategy that leverages the entire trajectory, our framework inherently introduces an uncertainty regularization term into the surrogate objective. Moreover, we show that the conventional mean squared error (MSE) loss for reward redistribution emerges as a special case of our likelihood framework when the uncertainty is fixed under the Gaussian distribution. When integrated with an off-policy algorithm such as Soft Actor-Critic, LRR yields dense and informative reward signals, resulting in superior sample efficiency and policy performance on Box-2d and MuJoCo benchmarks.

[LG-74] Efficiently Vectorized MCMC on Modern Accelerators

链接: https://arxiv.org/abs/2503.17405
作者: Hugh Dance,Pierre Glaser,Peter Orbanz,Ryan Adams
类目: Mathematical Software (cs.MS); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:With the advent of automatic vectorization tools (e.g., JAX’s \textttvmap ), writing multi-chain MCMC algorithms is often now as simple as invoking those tools on single-chain code. Whilst convenient, for various MCMC algorithms this results in a synchronization problem – loosely speaking, at each iteration all chains running in parallel must wait until the last chain has finished drawing its sample. In this work, we show how to design single-chain MCMC algorithms in a way that avoids synchronization overheads when vectorizing with tools like \textttvmap by using the framework of finite state machines (FSMs). Using a simplified model, we derive an exact theoretical form of the obtainable speed-ups using our approach, and use it to make principled recommendations for optimal algorithm design. We implement several popular MCMC algorithms as FSMs, including Elliptical Slice Sampling, HMC-NUTS, and Delayed Rejection, demonstrating speed-ups of up to an order of magnitude in experiments.

[LG-75] Enhanced Vascular Flow Simulations in Aortic Aneurysm via Physics-Informed Neural Networks and Deep Operator Networks

链接: https://arxiv.org/abs/2503.17402
作者: Oscar L. Cruz-González,Valérie Deplano,Badih Ghattas
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Due to the limited accuracy of 4D Magnetic Resonance Imaging (MRI) in identifying hemodynamics in cardiovascular diseases, the challenges in obtaining patient-specific flow boundary conditions, and the computationally demanding and time-consuming nature of Computational Fluid Dynamics (CFD) simulations, it is crucial to explore new data assimilation algorithms that offer possible alternatives to these limitations. In the present work, we study Physics-Informed Neural Networks (PINNs), Deep Operator Networks (DeepONets), and their Physics-Informed extensions (PI-DeepONets) in predicting vascular flow simulations in the context of a 3D Abdominal Aortic Aneurysm (AAA) idealized model. PINN is a technique that combines deep neural networks with the fundamental principles of physics, incorporating the physics laws, which are given as partial differential equations, directly into loss functions used during the training process. On the other hand, DeepONet is designed to learn nonlinear operators from data and is particularly useful in studying parametric partial differential equations (PDEs), e.g., families of PDEs with different source terms, boundary conditions, or initial conditions. Here, we adapt the approaches to address the particular use case of AAA by integrating the 3D Navier-Stokes equations (NSE) as the physical laws governing fluid dynamics. In addition, we follow best practices to enhance the capabilities of the models by effectively capturing the underlying physics of the problem under study. The advantages and limitations of each approach are highlighted through a series of relevant application cases. We validate our results by comparing them with CFD simulations for benchmark datasets, demonstrating good agreements and emphasizing those cases where improvements in computational efficiency are observed.

[LG-76] Availability of Perfect Decomposition in Statistical Linkage Learning for Unitation-based Function Concatenations

链接: https://arxiv.org/abs/2503.17397
作者: Michal Prusik,Bartosz Frej,Michal W. Przewozniczek
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Statistical Linkage Learning (SLL) is a part of many state-of-the-art optimizers. The purpose of SLL is to discover variable interdependencies. It has been shown that the effectiveness of SLL-using optimizers is highly dependent on the quality of SLL-based problem decomposition. Thus, understanding what kind of problems are hard or easy to decompose by SLL is important for practice. In this work, we analytically estimate the size of a population sufficient for obtaining a perfect decomposition in case of concatenations of certain unitation-based functions. The experimental study confirms the accuracy of the proposed estimate. Finally, using the proposed estimate, we identify those problem types that may be considered hard for SLL-using optimizers.

[LG-77] BPINN-EM-Post: Stochastic Electromigration Damage Analysis in the Post-Void Phase based on Bayesian Physics-Informed Neural Network

链接: https://arxiv.org/abs/2503.17393
作者: Subed Lamichhane,Haotian Lu,Sheldon X.-D. Tan
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, conference style

点击查看摘要

Abstract:In contrast to the assumptions of most existing Electromigration (EM) analysis tools, the evolution of EM-induced stress is inherently non-deterministic, influenced by factors such as input current fluctuations and manufacturing non-idealities. Traditional approaches for estimating stress variations typically involve computationally expensive and inefficient Monte Carlo simulations with industrial solvers, which quantify variations using mean and variance metrics. In this work, we introduce a novel machine learning-based framework, termed BPINNEM- Post, for efficient stochastic analysis of EM-induced postvoiding aging processes. This new approach integrates closedform analytical solutions with a Bayesian Physics-Informed Neural Network (BPINN) framework to accelerate the analysis for the first time. The closed-form solutions enforce physical laws at the individual wire segment level, while the BPINN ensures that physics constraints at inter-segment junctions are satisfied and stochastic behaviors are accurately modeled. By reducing the number of variables in the loss functions through the use of analytical solutions, our method significantly improves training efficiency without accuracy loss and naturally incorporates variational effects. Additionally, the analytical solutions effectively address the challenge of incorporating initial stress distributions in interconnect structures during post-void stress calculations. Numerical results demonstrate that BPINN-EM-Post achieves over 240x speedup compared to Monte Carlo simulations using the FEM-based COMSOL solver and more than 65x speedup compared to Monte Carlo simulations using the FDM-based EMSpice method.

[LG-78] A new graph-based surrogate model for rapid prediction of crashworthiness performance of vehicle panel components

链接: https://arxiv.org/abs/2503.17386
作者: Haoran Li,Yingxue Zhao,Haosu Zhou,Tobias Pfaff,Nan Li
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:During the design cycle of safety critical vehicle components such as B-pillars, crashworthiness performance is a key metric for passenger protection assessment in vehicle accidents. Traditional finite element simulations for crashworthiness analysis involve complex modelling, leading to an increased computational demand. Although a few machine learning-based surrogate models have been developed for rapid predictions for crashworthiness analysis, they exhibit limitations in detailed representation of complex 3D components. Graph Neural Networks (GNNs) have emerged as a promising solution for processing data with complex structures. However, existing GNN models often lack sufficient accuracy and computational efficiency to meet industrial demands. This paper proposes Recurrent Graph U-Net (ReGUNet), a new graph-based surrogate model for crashworthiness analysis of vehicle panel components. ReGUNet adoptes a U-Net architecture with multiple graph downsampling and upsampling layers, which improves the model’s computational efficiency and accuracy; the introduction of recurrence enhances the accuracy and stability of temporal predictions over multiple time steps. ReGUNet is evaluated through a case study of side crash testing of a B-pillar component with variation in geometric design. The trained model demonstrates great accuracy in predicting the dynamic behaviour of previously unseen component designs within a relative error of 0.74% for the maximum B-pillar intrusion. Compared to the baseline models, ReGUNet can reduce the averaged mean prediction error of the component’s deformation by more than 51% with significant improvement in computational efficiency. Provided enhanced accuracy and efficiency, ReGUNet shows greater potential in accurate predictions of large and complex graphs compared to existing models.

[LG-79] Uncertainty Quantification for Data-Driven Machine Learning Models in Nuclear Engineering Applications: Where We Are and What Do We Need?

链接: https://arxiv.org/abs/2503.17385
作者: Xu Wu,Lesego E. Moloko,Pavel M. Bokov,Gregory K. Delipei,Joshua Kaizer,Kostadin N. Ivanov
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 34 pages, 13 figures, invited journal article from BEPU-2024 conference

点击查看摘要

Abstract:Machine learning (ML) has been leveraged to tackle a diverse range of tasks in almost all branches of nuclear engineering. Many of the successes in ML applications can be attributed to the recent performance breakthroughs in deep learning, the growing availability of computational power, data, and easy-to-use ML libraries. However, these empirical successes have often outpaced our formal understanding of the ML algorithms. An important but under-rated area is uncertainty quantification (UQ) of ML. ML-based models are subject to approximation uncertainty when they are used to make predictions, due to sources including but not limited to, data noise, data coverage, extrapolation, imperfect model architecture and the stochastic training process. The goal of this paper is to clearly explain and illustrate the importance of UQ of ML. We will elucidate the differences in the basic concepts of UQ of physics-based models and data-driven ML models. Various sources of uncertainties in physical modeling and data-driven modeling will be discussed, demonstrated, and compared. We will also present and demonstrate a few techniques to quantify the ML prediction uncertainties. Finally, we will discuss the need for building a verification, validation and UQ framework to establish ML credibility.

[LG-80] On the Optimality of Single-label and Multi-label Neural Network Decoders

链接: https://arxiv.org/abs/2503.18758
作者: Yunus Can Gültekin,Péter Scheepers,Yuncheng Yuan,Federico Corradi,Alex Alvarado
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 8 pages, 8 figures

点击查看摘要

Abstract:We investigate the design of two neural network (NN) architectures recently proposed as decoders for forward error correction: the so-called single-label NN (SLNN) and multi-label NN (MLNN) decoders. These decoders have been reported to achieve near-optimal codeword- and bit-wise performance, respectively. Results in the literature show near-optimality for a variety of short codes. In this paper, we analytically prove that certain SLNN and MLNN architectures can, in fact, always realize optimal decoding, regardless of the code. These optimal architectures and their binary weights are shown to be defined by the codebook, i.e., no training or network optimization is required. Our proposed architectures are in fact not NNs, but a different way of implementing the maximum likelihood decoding rule. Optimal performance is numerically demonstrated for Hamming (7,4) , Polar (16,8) , and BCH (31,21) codes. The results show that our optimal architectures are less complex than the SLNN and MLNN architectures proposed in the literature, which in fact only achieve near-optimal performance. Extension to longer codes is still hindered by the curse of dimensionality. Therefore, even though SLNN and MLNN can perform maximum likelihood decoding, such architectures cannot be used for medium and long codes.

[LG-81] Differentially Private Joint Independence Test

链接: https://arxiv.org/abs/2503.18721
作者: Xingwei Liu,Yuexin Chen,Wangli Xu
类目: atistics Theory (math.ST); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 51pages

点击查看摘要

Abstract:Identification of joint dependence among more than two random vectors plays an important role in many statistical applications, where the data may contain sensitive or confidential information. In this paper, we consider the the d-variable Hilbert-Schmidt independence criterion (dHSIC) in the context of differential privacy. Given the limiting distribution of the empirical estimate of dHSIC is complicated Gaussian chaos, constructing tests in the non-privacy regime is typically based on permutation and bootstrap. To detect joint dependence in privacy, we propose a dHSIC-based testing procedure by employing a differentially private permutation methodology. Our method enjoys privacy guarantee, valid level and pointwise consistency, while the bootstrap counterpart suffers inconsistent power. We further investigate the uniform power of the proposed test in dHSIC metric and L_2 metric, indicating that the proposed test attains the minimax optimal power across different privacy regimes. As a byproduct, our results also contain the pointwise and uniform power of the non-private permutation dHSIC, addressing an unsolved question remained in Pfister et al. (2018).

[LG-82] Scaling Laws for Emulation of Stellar Spectra

链接: https://arxiv.org/abs/2503.18617
作者: Tomasz Różański,Yuan-Sen Ting
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 25 pages, 11 figures, submitted to OJA

点击查看摘要

Abstract:Neural network-based emulators for the inference of stellar parameters and elemental abundances represent an increasingly popular methodology in modern spectroscopic surveys. However, these approaches are often constrained by their emulation precision and domain transfer capabilities. Greater generalizability has previously been achieved only with significantly larger model architectures, as demonstrated by Transformer-based models in natural language processing. This observation aligns with neural scaling laws, where model performance predictably improves with increased model size, computational resources allocated to model training, and training data volume. In this study, we demonstrate that these scaling laws also apply to Transformer-based spectral emulators in astronomy. Building upon our previous work with TransformerPayne and incorporating Maximum Update Parametrization techniques from natural language models, we provide training guidelines for scaling models to achieve optimal performance. Our results show that within the explored parameter space, clear scaling relationships emerge. These findings suggest that optimal computational resource allocation requires balanced scaling. Specifically, given a tenfold increase in training compute, achieving an optimal seven-fold reduction in mean squared error necessitates an approximately 2.5-fold increase in dataset size and a 3.8-fold increase in model size. This study establishes a foundation for developing spectral foundational models with enhanced domain transfer capabilities.

[LG-83] AutoBayes: A Compositional Framework for Generalized Variational Inference

链接: https://arxiv.org/abs/2503.18608
作者: Toby St Clere Smithe,Marco Perin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 15 pages

点击查看摘要

Abstract:We introduce a new compositional framework for generalized variational inference, clarifying the different parts of a model, how they interact, and how they compose. We explain that both exact Bayesian inference and the loss functions typical of variational inference (such as variational free energy and its generalizations) satisfy chain rules akin to that of reverse-mode automatic differentiation, and we advocate for exploiting this to build and optimize models accordingly. To this end, we construct a series of compositional tools: for building models; for constructing their inversions; for attaching local loss functions; and for exposing parameters. Finally, we explain how the resulting parameterized statistical games may be optimized locally, too. We illustrate our framework with a number of classic examples, pointing to new areas of extensibility that are revealed.

[LG-84] Parametric Dynamic Mode Decomposition with multi-linear interpolation for prediction of thermal fields of Al2O3-water nanofluid flows at unseen parameters

链接: https://arxiv.org/abs/2503.18571
作者: Abhijith M S,Sandra S
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 19 pages, 19 figures

点击查看摘要

Abstract:The study proposes a data-driven model which combines the Dynamic Mode Decomposition with multi-linear interpolation to predict the thermal fields of nanofluid flows at unseen Reynolds numbers (Re) and particle volume concentrations ( \epsilon ). The flow, considered for the study, is laminar and incompressible. The study employs an in-house Fortran-based solver to predict the thermal fields of Al _2 O _3 -water nanofluid flow through a two-dimensional rectangular channel, with the bottom wall subjected to a uniform heat flux. The performance of two models operating in one- and two-dimensional parametric spaces are investigated. Initially, a DMD with linear interpolation (DMD-LI) based solver is used for prediction of temperature of the nanofluid at any Re 100. The DMD-LI based model, predicts temperature fields with a maximum percentage difference of just 0.0273%, in comparison with the CFD-based solver at Re =960, and \epsilon = 1.0%. The corresponding difference in the average Nusselt numbers is only 0.39%. Following that a DMD with bi-linear interpolation (DMD-BLI) based solver is used for prediction of temperature of the nanofluid at any Re 100 and \epsilon 0.5%. The performance of two different ways of stacking the data are also examined. When compared to the CFD-based model, the DMD-BLI-based model predicts the temperature fields with a maximum percentage difference of 0.21 %, at Re = 800 and \epsilon = 1.35%. And the corresponding percentage difference in the average Nusselt number prediction is only 6.08%. All the results are reported in detail. Along side the important conclusions, the future scope of the study is also listed.

[LG-85] Learning a Class of Mixed Linear Regressions: Global Convergence under General Data Conditions

链接: https://arxiv.org/abs/2503.18500
作者: Yujing Liu,Zhixin Liu,Lei Guo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Mixed linear regression (MLR) has attracted increasing attention because of its great theoretical and practical importance in capturing nonlinear relationships by utilizing a mixture of linear regression sub-models. Although considerable efforts have been devoted to the learning problem of such systems, i.e., estimating data labels and identifying model parameters, most existing investigations employ the offline algorithm, impose the strict independent and identically distributed (i.i.d.) or persistent excitation (PE) conditions on the regressor data, and provide local convergence results only. In this paper, we investigate the recursive estimation and data clustering problems for a class of stochastic MLRs with two components. To address this inherently nonconvex optimization problem, we propose a novel two-step recursive identification algorithm to estimate the true parameters, where the direction vector and the scaling coefficient of the unknown parameters are estimated by the least squares and the expectation-maximization (EM) principles, respectively. Under a general data condition, which is much weaker than the traditional i.i.d. and PE conditions, we establish the global convergence and the convergence rate of the proposed identification algorithm for the first time. Furthermore, we prove that, without any excitation condition on the regressor data, the data clustering performance including the cumulative mis-classification error and the within-cluster error can be optimal asymptotically. Finally, we provide a numerical example to illustrate the performance of the proposed learning algorithm.

[LG-86] A New Stochastic Approximation Method for Gradient-based Simulated Parameter Estimation

链接: https://arxiv.org/abs/2503.18319
作者: Zehao Li,Yijie Peng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper tackles the challenge of parameter calibration in stochastic models, particularly in scenarios where the likelihood function is unavailable in an analytical form. We introduce a gradient-based simulated parameter estimation framework, which employs a multi-time scale stochastic approximation algorithm. This approach effectively addresses the ratio bias that arises in both maximum likelihood estimation and posterior density estimation problems. The proposed algorithm enhances estimation accuracy and significantly reduces computational costs, as demonstrated through extensive numerical experiments. Our work extends the GSPE framework to handle complex models such as hidden Markov models and variational inference-based problems, offering a robust solution for parameter estimation in challenging stochastic environments.

[LG-87] Efficient Transformed Gaussian Process State-Space Models for Non-Stationary High-Dimensional Dynamical Systems

链接: https://arxiv.org/abs/2503.18309
作者: Zhidi Lin,Ying Li,Feng Yin,Juan Maroñas,Alexandre H. Thiéry
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:Gaussian process state-space models (GPSSMs) have emerged as a powerful framework for modeling dynamical systems, offering interpretable uncertainty quantification and inherent regularization. However, existing GPSSMs face significant challenges in handling high-dimensional, non-stationary systems due to computational inefficiencies, limited scalability, and restrictive stationarity assumptions. In this paper, we propose an efficient transformed Gaussian process state-space model (ETGPSSM) to address these limitations. Our approach leverages a single shared Gaussian process (GP) combined with normalizing flows and Bayesian neural networks, enabling efficient modeling of complex, high-dimensional state transitions while preserving scalability. To address the lack of closed-form expressions for the implicit process in the transformed GP, we follow its generative process and introduce an efficient variational inference algorithm, aided by the ensemble Kalman filter (EnKF), to enable computationally tractable learning and inference. Extensive empirical evaluations on synthetic and real-world datasets demonstrate the superior performance of our ETGPSSM in system dynamics learning, high-dimensional state estimation, and time-series forecasting, outperforming existing GPSSMs and neural network-based methods in both accuracy and computational efficiency.

[LG-88] Quantile-Based Randomized Kaczmarz for Corrupted Tensor Linear Systems

链接: https://arxiv.org/abs/2503.18190
作者: Alejandra Castillo,Jamie Haddock,Iryna Hartsock,Paulina Hoyos,Lara Kassab,Alona Kryshchenko,Kamila Larripa,Deanna Needell,Shambhavi Suryanarayanan,Karamatou Yacoubou Djima
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:The reconstruction of tensor-valued signals from corrupted measurements, known as tensor regression, has become essential in many multi-modal applications such as hyperspectral image reconstruction and medical imaging. In this work, we address the tensor linear system problem \mathcalA \mathcalX=\mathcalB , where \mathcalA is a measurement operator, \mathcalX is the unknown tensor-valued signal, and \mathcalB contains the measurements, possibly corrupted by arbitrary errors. Such corruption is common in large-scale tensor data, where transmission, sensory, or storage errors are rare per instance but likely over the entire dataset and may be arbitrarily large in magnitude. We extend the Kaczmarz method, a popular iterative algorithm for solving large linear systems, to develop a Quantile Tensor Randomized Kaczmarz (QTRK) method robust to large, sparse corruptions in the observations \mathcalB . This approach combines the tensor Kaczmarz framework with quantile-based statistics, allowing it to mitigate adversarial corruptions and improve convergence reliability. We also propose and discuss the Masked Quantile Randomized Kaczmarz (mQTRK) variant, which selectively applies partial updates to handle corruptions further. We present convergence guarantees, discuss the advantages and disadvantages of our approaches, and demonstrate the effectiveness of our methods through experiments, including an application for video deblurring.

[LG-89] Informer in Algorithmic Investment Strategies on High Frequency Bitcoin Data

链接: https://arxiv.org/abs/2503.18096
作者: Filip Stefaniuk,Robert Ślepaczuk
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Portfolio Management (q-fin.PM)
*备注: 41 pages, 17 figures, 19 tables

点击查看摘要

Abstract:The article investigates the usage of Informer architecture for building automated trading strategies for high frequency Bitcoin data. Three strategies using Informer model with different loss functions: Root Mean Squared Error (RMSE), Generalized Mean Absolute Directional Loss (GMADL) and Quantile loss, are proposed and evaluated against the Buy and Hold benchmark and two benchmark strategies based on technical indicators. The evaluation is conducted using data of various frequencies: 5 minute, 15 minute, and 30 minute intervals, over the 6 different periods. Although the Informer-based model with Quantile loss did not outperform the benchmark, two other models achieved better results. The performance of the model using RMSE loss worsens when used with higher frequency data while the model that uses novel GMADL loss function is benefiting from higher frequency data and when trained on 5 minute interval it beat all the other strategies on most of the testing periods. The primary contribution of this study is the application and assessment of the RMSE, GMADL, and Quantile loss functions with the Informer model to forecast future returns, subsequently using these forecasts to develop automated trading strategies. The research provides evidence that employing an Informer model trained with the GMADL loss function can result in superior trading outcomes compared to the buy-and-hold approach.

[LG-90] Regularization of ML models for Earth systems by using longer model timesteps

链接: https://arxiv.org/abs/2503.18023
作者: Raghul Parthipan,Mohit Anand,Hannah M Christensen,Frederic Vitart,Damon J Wischik,Jakob Zscheischler
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Regularization is a technique to improve generalization of machine learning (ML) models. A common form of regularization in the ML literature is to train on data where similar inputs map to different outputs. This improves generalization by preventing ML models from becoming overconfident in their predictions. This paper shows how using longer timesteps when modelling chaotic Earth systems naturally leads to more of this regularization. We show this in two domains. We explain how using longer model timesteps can improve results and demonstrate that increased regularization is one of the causes. We explain why longer model timesteps lead to improved regularization in these systems and present a procedure to pick the model timestep. We also carry out a benchmarking exercise on ORAS5 ocean reanalysis data to show that a longer model timestep (28 days) than is typically used gives realistic simulations. We suggest that there will be many opportunities to use this type of regularization in Earth system problems because the Earth system is chaotic and the regularization is so easy to implement.

[LG-91] Identifying Ising and percolation phase transitions based on KAN method

链接: https://arxiv.org/abs/2503.17996
作者: Dian Xu,Shanshan Wang,Wei Li,Weibing Deng,Feng Gao,Jianmin Shen
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 10 pagees, 9 figures, 1 table

点击查看摘要

Abstract:Modern machine learning, grounded in the Universal Approximation Theorem, has achieved significant success in the study of phase transitions in both equilibrium and non-equilibrium systems. However, identifying the critical points of percolation models using raw configurations remains a challenging and intriguing problem. This paper proposes the use of the Kolmogorov-Arnold Network, which is based on the Kolmogorov-Arnold Representation Theorem, to input raw configurations into a learning model. The results demonstrate that the KAN can indeed predict the critical points of percolation models. Further observation reveals that, apart from models associated with the density of occupied points, KAN is also capable of effectively achieving phase classification for models where the sole alteration pertains to the orientation of spins, resulting in an order parameter that manifests as an external magnetic flux, such as the Ising model.

[LG-92] Equivariant Machine Learning Interatomic Potentials with Global Charge Redistribution

链接: https://arxiv.org/abs/2503.17949
作者: Moin Uddin Maruf,Sungmin Kim,Zeeshan Ahmad
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 24 pages, 5 figures, 1 table + 12 pages of Supporting Information

点击查看摘要

Abstract:Machine learning interatomic potentials (MLIPs) provide a computationally efficient alternative to quantum mechanical simulations for predicting material properties. Message-passing graph neural networks, commonly used in these MLIPs, rely on local descriptor-based symmetry functions to model atomic interactions. However, such local descriptor-based approaches struggle with systems exhibiting long-range interactions, charge transfer, and compositional heterogeneity. In this work, we develop a new equivariant MLIP incorporating long-range Coulomb interactions through explicit treatment of electronic degrees of freedom, specifically global charge distribution within the system. This is achieved using a charge equilibration scheme based on predicted atomic electronegativities. We systematically evaluate our model across a range of benchmark periodic and non-periodic datasets, demonstrating that it outperforms both short-range equivariant and long-range invariant MLIPs in energy and force predictions. Our approach enables more accurate and efficient simulations of systems with long-range interactions and charge heterogeneity, expanding the applicability of MLIPs in computational materials science.

[LG-93] Predicting performance-related properties of refrigerant based on tailored small-molecule functional group contribution

链接: https://arxiv.org/abs/2503.17919
作者: Peilin Cao,Ying Geng,Nan Feng,Xiang Zhang,Zhiwen Qi,Zhen Song,Rafiqul Gani
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As current group contribution (GC) methods are mostly proposed for a wide size-range of molecules, applying them to property prediction of small refrigerant molecules could lead to unacceptable errors. In this sense, for the design of novel refrigerants and refrigeration systems, tailoring GC-based models specifically fitted to refrigerant molecules is of great interest. In this work, databases of potential refrigerant molecules are first collected, focusing on five key properties related to the operational efficiency of refrigeration systems, namely normal boiling point, critical temperature, critical pressure, enthalpy of vaporization, and acentric factor. Based on tailored small-molecule groups, the GC method is combined with machine learning (ML) to model these performance-related properties. Following the development of GC-ML models, their performance is analyzed to highlight the potential group-to-property contributions. Additionally, the refrigerant property databases are extended internally and externally, based on which examples are presented to highlight the significance of the developed models.

[LG-94] Accelerating and enhancing thermodynamic simulations of electrochemical interfaces

链接: https://arxiv.org/abs/2503.17870
作者: Xiaochen Du,Mengren Liu,Jiayu Peng,Hoje Chun,Alexander Hoffman,Bilge Yildiz,Lin Li,Martin Z. Bazant,Rafael Gómez-Bombarelli
类目: Materials Science (cond-mat.mtrl-sci); Statistical Mechanics (cond-mat.stat-mech); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 19 pages main text, 5 figures, supplementary information (SI) in ancillary files

点击查看摘要

Abstract:Electrochemical interfaces are crucial in catalysis, energy storage, and corrosion, where their stability and reactivity depend on complex interactions between the electrode, adsorbates, and electrolyte. Predicting stable surface structures remains challenging, as traditional surface Pourbaix diagrams tend to either rely on expert knowledge or costly \textitab initio sampling, and neglect thermodynamic equilibration with the environment. Machine learning (ML) potentials can accelerate static modeling but often overlook dynamic surface transformations. Here, we extend the Virtual Surface Site Relaxation-Monte Carlo (VSSR-MC) method to autonomously sample surface reconstructions modeled under aqueous electrochemical conditions. Through fine-tuning foundational ML force fields, we accurately and efficiently predict surface energetics, recovering known Pt(111) phases and revealing new LaMnO _\mathrm3 (001) surface reconstructions. By explicitly accounting for bulk-electrolyte equilibria, our framework enhances electrochemical stability predictions, offering a scalable approach to understanding and designing materials for electrochemical applications.

[LG-95] Understanding Inverse Reinforcement Learning under Overparameterization: Non-Asymptotic Analysis and Global Optimality

链接: https://arxiv.org/abs/2503.17865
作者: Ruijia Zhang,Siliang Zeng,Chenliang Li,Alfredo Garcia,Mingyi Hong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The goal of the Inverse reinforcement learning (IRL) task is to identify the underlying reward function and the corresponding optimal policy from a set of expert demonstrations. While most IRL algorithms’ theoretical guarantees rely on a linear reward structure, we aim to extend the theoretical understanding of IRL to scenarios where the reward function is parameterized by neural networks. Meanwhile, conventional IRL algorithms usually adopt a nested structure, leading to computational inefficiency, especially in high-dimensional settings. To address this problem, we propose the first two-timescale single-loop IRL algorithm under neural network parameterized reward and provide a non-asymptotic convergence analysis under overparameterization. Although prior optimality results for linear rewards do not apply, we show that our algorithm can identify the globally optimal reward and policy under certain neural network structures. This is the first IRL algorithm with a non-asymptotic convergence guarantee that provably achieves global optimality in neural network settings.

[LG-96] Graphical Transformation Models

链接: https://arxiv.org/abs/2503.17845
作者: Matthias Herp,Johannes Brachem,Michael Altenbuchinger,Thomas Kneib
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 36 pages, 10 Figures, presented at the DAGStat 2025 in Berlin

点击查看摘要

Abstract:Graphical Transformation Models (GTMs) are introduced as a novel approach to effectively model multivariate data with intricate marginals and complex dependency structures non-parametrically, while maintaining interpretability through the identification of varying conditional independencies. GTMs extend multivariate transformation models by replacing the Gaussian copula with a custom-designed multivariate transformation, offering two major advantages. Firstly, GTMs can capture more complex interdependencies using penalized splines, which also provide an efficient regularization scheme. Secondly, we demonstrate how to approximately regularize GTMs using a lasso penalty towards pairwise conditional independencies, akin to Gaussian graphical models. The model’s robustness and effectiveness are validated through simulations, showcasing its ability to accurately learn parametric vine copulas and identify conditional independencies. Additionally, the model is applied to a benchmark astrophysics dataset, where the GTM demonstrates favorable performance compared to non-parametric vine copulas in learning complex multivariate distributions.

[LG-97] Poisson-Process Topic Model for Integrating Knowledge from Pre-trained Language Models

链接: https://arxiv.org/abs/2503.17809
作者: Morgane Austern,Yuanchuan Guo,Zheng Tracy Ke,Tianle Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 35 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Topic modeling is traditionally applied to word counts without accounting for the context in which words appear. Recent advancements in large language models (LLMs) offer contextualized word embeddings, which capture deeper meaning and relationships between words. We aim to leverage such embeddings to improve topic modeling. We use a pre-trained LLM to convert each document into a sequence of word embeddings. This sequence is then modeled as a Poisson point process, with its intensity measure expressed as a convex combination of K base measures, each corresponding to a topic. To estimate these topics, we propose a flexible algorithm that integrates traditional topic modeling methods, enhanced by net-rounding applied before and kernel smoothing applied after. One advantage of this framework is that it treats the LLM as a black box, requiring no fine-tuning of its parameters. Another advantage is its ability to seamlessly integrate any traditional topic modeling approach as a plug-in module, without the need for modifications Assuming each topic is a \beta -Hölder smooth intensity measure on the embedded space, we establish the rate of convergence of our method. We also provide a minimax lower bound and show that the rate of our method matches with the lower bound when \beta\leq 1 . Additionally, we apply our method to several datasets, providing evidence that it offers an advantage over traditional topic modeling approaches. Comments: 35 pages, 9 figures, 3 tables Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) MSC classes: 62G07 Cite as: arXiv:2503.17809 [stat.ML] (or arXiv:2503.17809v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2503.17809 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-98] Benchmark Dataset for Pore-Scale CO2-Water Interaction

链接: https://arxiv.org/abs/2503.17592
作者: Alhasan Abdellatif,Hannah P. Menke,Julien Maes,Ahmed H. Elsheikh,Florian Doster
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Accurately capturing the complex interaction between CO2 and water in porous media at the pore scale is essential for various geoscience applications, including carbon capture and storage (CCS). We introduce a comprehensive dataset generated from high-fidelity numerical simulations to capture the intricate interaction between CO2 and water at the pore scale. The dataset consists of 624 2D samples, each of size 512x512 with a resolution of 35 \mum, covering 100 time steps under a constant CO2 injection rate. It includes various levels of heterogeneity, represented by different grain sizes with random variation in spacing, offering a robust testbed for developing predictive models. This dataset provides high-resolution temporal and spatial information crucial for benchmarking machine learning models.

[LG-99] me-optimal neural feedback control of nilpotent systems as a binary classification problem

链接: https://arxiv.org/abs/2503.17581
作者: Sara Bicego,Samuel Gue,Dante Kalise,Nelly Villamizar
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A computational method for the synthesis of time-optimal feedback control laws for linear nilpotent systems is proposed. The method is based on the use of the bang-bang theorem, which leads to a characterization of the time-optimal trajectory as a parameter-dependent polynomial system for the control switching sequence. A deflated Newton’s method is then applied to exhaust all the real roots of the polynomial system. The root-finding procedure is informed by the Hermite quadratic form, which provides a sharp estimate on the number of real roots to be found. In the second part of the paper, the polynomial systems are sampled and solved to generate a synthetic dataset for the construction of a time-optimal deep neural network – interpreted as a binary classifier – via supervised learning. Numerical tests in integrators of increasing dimension assess the accuracy, robustness, and real-time-control capabilities of the approximate control law.

[LG-100] Communities in the Kuramoto Model: Dynamics and Detection via Path Signatures

链接: https://arxiv.org/abs/2503.17546
作者: Tâm Johan Nguyên,Darrick Lee,Bernadette Jana Stolz
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM)
*备注: 46 pages, 13 figures, submitted to ‘Journal of Physics: Complexity’

点击查看摘要

Abstract:The behavior of multivariate dynamical processes is often governed by underlying structural connections that relate the components of the system. For example, brain activity which is often measured via time series is determined by an underlying structural graph, where nodes represent neurons or brain regions and edges represent cortical connectivity. Existing methods for inferring structural connections from observed dynamics, such as correlation-based or spectral techniques, may fail to fully capture complex relationships in high-dimensional time series in an interpretable way. Here, we propose the use of path signatures a mathematical framework that encodes geometric and temporal properties of continuous paths to address this problem. Path signatures provide a reparametrization-invariant characterization of dynamical data and, in particular, can be used to compute the lead matrix which reveals lead-lag phenomena. We showcase our approach on time series from coupled oscillators in the Kuramoto model defined on a stochastic block model graph, termed the Kuramoto stochastic block model (KSBM). Using mean-field theory and Gaussian approximations, we analytically derive reduced models of KSBM dynamics in different temporal regimes and theoretically characterize the lead matrix in these settings. Leveraging these insights, we propose a novel signature-based community detection algorithm, achieving exact recovery of structural communities from observed time series in multiple KSBM instances. Our results demonstrate that path signatures provide a novel perspective on analyzing complex neural data and other high-dimensional systems, explicitly exploiting temporal functional relationships to infer underlying structure.

[LG-101] A Statistical Theory of Contrastive Learning via Approximate Sufficient Statistics

链接: https://arxiv.org/abs/2503.17538
作者: Licong Lin,Song Mei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Contrastive learning – a modern approach to extract useful representations from unlabeled data by training models to distinguish similar samples from dissimilar ones – has driven significant progress in foundation models. In this work, we develop a new theoretical framework for analyzing data augmentation-based contrastive learning, with a focus on SimCLR as a representative example. Our approach is based on the concept of \emphapproximate sufficient statistics, which we extend beyond its original definition in \citeoko2025statistical for contrastive language-image pretraining (CLIP) using KL-divergence. We generalize it to equivalent forms and general f-divergences, and show that minimizing SimCLR and other contrastive losses yields encoders that are approximately sufficient. Furthermore, we demonstrate that these near-sufficient encoders can be effectively adapted to downstream regression and classification tasks, with performance depending on their sufficiency and the error induced by data augmentation in contrastive learning. Concrete examples in linear regression and topic classification are provided to illustrate the broad applicability of our results.

[LG-102] Long-term excitation energy transfer predicted by a modified convolutional neural networks in the FMO complexes

链接: https://arxiv.org/abs/2503.17430
作者: Yi-Meng Huang,Zi-Ran Zhao,Shun-Cai Zhao
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 11 pages, 10figures

点击查看摘要

Abstract:In machine learning (ML), the risk of recursive strategies overfitting historical data has driven the development of convolutional neural networks (CNNs) in simulating quantum dissipative dynamics. In this work, we propose an efficient CNNs scheme incorporating novel redundant time-functions to predict 100 picosecond (ps) excitation energy transfer (EET) in Fenna-Matthews-Olson (FMO) complexes, in which the original time t is normalized by mapping it to the [0, 1] range, allowing different functions focus on distinct time intervals, thereby effectively capturing the multi-timescale characteristics of EET dynamics. This method simplifies optimization and enhances learning efficiency, and demonstrate the superior accuracy, robustness, and efficiency of our approach in predicting quantum dissipative dynamics.

[LG-103] 3D variational autoencoder for fingerprinting microstructure volume elements

链接: https://arxiv.org/abs/2503.17427
作者: Michael D. White,Michael D. Atkinson,Adam J. Plowman,Pratheek Shanthraj
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 24 pages, 9 figures

点击查看摘要

Abstract:Microstructure quantification is an important step towards establishing structure-property relationships in materials. Machine learning-based image processing methods have been shown to outperform conventional image processing techniques and are increasingly applied to microstructure quantification tasks. In this work, we present a 3D variational autoencoder (VAE) for encoding microstructure volume elements (VEs) comprising voxelated crystallographic orientation data. Crystal symmetries in the orientation space are accounted for by mapping to the crystallographic fundamental zone as a preprocessing step, which allows for a continuous loss function to be used and improves the training convergence rate. The VAE is then used to encode a training set of VEs with an equiaxed polycrystalline microstructure with random texture. Accurate reconstructions are achieved with a relative average misorientation error of 9x10-3 on the test dataset, for a continuous latent space with dimension 256. We show that the model generalises well to microstructures with textures, grain sizes and aspect ratios outside the training distribution. Structure-property relationships are explored through using the training set of VEs as initial configurations in various crystal plasticity (CP) simulations. Microstructural fingerprints extracted from the VAE, which parameterise the VEs in a low-dimensional latent space, are stored alongside the volume-averaged stress response, at each strain increment, to uniaxial tensile deformation from CP simulations. This is then used to train a fully connected neural network mapping the input fingerprint to the resulting stress response, which acts as a surrogate model for the CP simulation. The fingerprint-based surrogate model is shown to accurately predict the microstructural dependence in the CP stress response, with a relative mean-squared error of 8.9x10-4 on unseen test data.

[LG-104] ripNet: Learning Large-scale High-fidelity 3D Car Aerodynamics with Triplane Networks

链接: https://arxiv.org/abs/2503.17400
作者: Qian Chen,Mohamed Elrefaie,Angela Dai,Faez Ahmed
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Computational Fluid Dynamics (CFD) simulations are essential in product design, providing insights into fluid behavior around complex geometries in aerospace and automotive applications. However, high-fidelity CFD simulations are computationally expensive, making rapid design iterations challenging. To address this, we propose TripNet, Triplane CFD Network, a machine learning-based framework leveraging triplane representations to predict the outcomes of large-scale, high-fidelity CFD simulations with significantly reduced computation cost. Our method encodes 3D geometry into compact yet information-rich triplane features, maintaining full geometry fidelity and enabling accurate aerodynamic predictions. Unlike graph- and point cloud-based models, which are inherently discrete and provide solutions only at the mesh nodes, TripNet allows the solution to be queried at any point in the 3D space. Validated on high-fidelity DrivAerNet and DrivAerNet++ car aerodynamics datasets, TripNet achieves state-of-the-art performance in drag coefficient prediction, surface field estimation, and full 3D flow field simulations of industry-standard car designs. By utilizing a shared triplane backbone across multiple tasks, our approach offers a scalable, accurate, and efficient alternative to traditional CFD solvers.

[LG-105] Challenges and Advancements in Modeling Shock Fronts with Physics-Informed Neural Networks: A Review and Benchmarking Study

链接: https://arxiv.org/abs/2503.17379
作者: Jassem Abbasi,Ameya D. Jagtap,Ben Moseley,Aksel Hiorth,Pål Østebø Andersen
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 15 figures, 5 tables, and 37 pages

点击查看摘要

Abstract:Solving partial differential equations (PDEs) with discontinuous solutions , such as shock waves in multiphase viscous flow in porous media , is critical for a wide range of scientific and engineering applications, as they represent sudden changes in physical quantities. Physics-Informed Neural Networks (PINNs), an approach proposed for solving PDEs, encounter significant challenges when applied to such systems. Accurately solving PDEs with discontinuities using PINNs requires specialized techniques to ensure effective solution accuracy and numerical stability. A benchmarking study was conducted on two multiphase flow problems in porous media: the classic Buckley-Leverett (BL) problem and a fully coupled system of equations involving shock waves but with varying levels of solution complexity. The findings show that PM and LM approaches can provide accurate solutions for the BL problem by effectively addressing the infinite gradients associated with shock occurrences. In contrast, AM methods failed to effectively resolve the shock waves. When applied to fully coupled PDEs (with more complex loss landscape), the generalization error in the solutions quickly increased, highlighting the need for ongoing innovation. This study provides a comprehensive review of existing techniques for managing PDE discontinuities using PINNs, offering information on their strengths and limitations. The results underscore the necessity for further research to improve PINNs ability to handle complex discontinuities, particularly in more challenging problems with complex loss landscapes. This includes problems involving higher dimensions or multiphysics systems, where current methods often struggle to maintain accuracy and efficiency.

信息检索

[IR-0] CCMusic: An Open and Diverse Database for Chinese Music Information Retrieval Research

链接: https://arxiv.org/abs/2503.18802
作者: Monan Zhou,Shenyang Xu,Zhaorui Liu,Zhaowen Wang,Feng Yu,Wei Li,Baoqiang Han
类目: Information Retrieval (cs.IR); Sound (cs.SD)
*备注: 17 pages, 18 figures

点击查看摘要

Abstract:Data are crucial in various computer-related fields, including music information retrieval (MIR), an interdisciplinary area bridging computer science and music. This paper introduces CCMusic, an open and diverse database comprising multiple datasets specifically designed for tasks related to Chinese music, highlighting our focus on this culturally rich domain. The database integrates both published and unpublished datasets, with steps taken such as data cleaning, label refinement, and data structure unification to ensure data consistency and create ready-to-use versions. We conduct benchmark evaluations for all datasets using a unified evaluation framework developed specifically for this purpose. This publicly available framework supports both classification and detection tasks, ensuring standardized and reproducible results across all datasets. The database is hosted on HuggingFace and ModelScope, two open and multifunctional data and model hosting platforms, ensuring ease of accessibility and usability.

[IR-1] A Comprehensive Review on Hashtag Recommendation: From Traditional to Deep Learning and Beyond

链接: https://arxiv.org/abs/2503.18669
作者: Shubhi Bansal,Kushaan Gowda,Anupama Sureshbabu K,Chirag Kothari,Nagendra Kumar
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The exponential growth of user-generated content on social media platforms has precipitated significant challenges in information management, particularly in content organization, retrieval, and discovery. Hashtags, as a fundamental categorization mechanism, play a pivotal role in enhancing content visibility and user engagement. However, the development of accurate and robust hashtag recommendation systems remains a complex and evolving research challenge. Existing surveys in this domain are limited in scope and recency, focusing narrowly on specific platforms, methodologies, or timeframes. To address this gap, this review article conducts a systematic analysis of hashtag recommendation systems, comprehensively examining recent advancements across several dimensions. We investigate unimodal versus multimodal methodologies, diverse problem formulations, filtering strategies, methodological evolution from traditional frequency-based models to advanced deep learning architectures. Furthermore, we critically evaluate performance assessment paradigms, including quantitative metrics, qualitative analyses, and hybrid evaluation frameworks. Our analysis underscores a paradigm shift toward transformer-based deep learning models, which harness contextual and semantic features to achieve superior recommendation accuracy. Key challenges such as data sparsity, cold-start scenarios, polysemy, and model explainability are rigorously discussed, alongside practical applications in tweet classification, sentiment analysis, and content popularity prediction. By synthesizing insights from diverse methodological and platform-specific perspectives, this survey provides a structured taxonomy of current research, identifies unresolved gaps, and proposes future directions for developing adaptive, user-centric recommendation systems.

[IR-2] Robust-IR @ SIGIR 2025: The First Workshop on Robust Information Retrieval SIGIR2025

链接: https://arxiv.org/abs/2503.18426
作者: Yu-An Liu,Haya Nachimovsky,Ruqing Zhang,Oren Kurland,Jiafeng Guo,Moshe Tennenholtz
类目: Information Retrieval (cs.IR)
*备注: Accept by SIGIR 2025

点击查看摘要

Abstract:With the advancement of information retrieval (IR) technologies, robustness is increasingly attracting attention. When deploying technology into practice, we consider not only its average performance under normal conditions but, more importantly, its ability to maintain functionality across a variety of exceptional situations. In recent years, the research on IR robustness covers theory, evaluation, methodology, and application, and all of them show a growing trend. The purpose of this workshop is to systematize the latest results of each research aspect, to foster comprehensive communication within this niche domain while also bridging robust IR research with the broader community, and to promote further future development of robust IR. To avoid the one-sided talk of mini-conferences, this workshop adopts a highly interactive format, including round-table and panel discussion sessions, to encourage active participation and meaningful exchange among attendees.

[IR-3] Food Recommendation With Balancing Comfort and Curiosity

链接: https://arxiv.org/abs/2503.18355
作者: Yuto Sakai,Qiang Ma
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Food is a key pleasure of traveling, but travelers face a trade-off between exploring curious new local food and choosing comfortable, familiar options. This creates demand for personalized recommendation systems that balance these competing factors. To the best of our knowledge, conventional recommendation methods cannot provide recommendations that offer both curiosity and comfort for food unknown to the user at a travel destination. In this study, we propose new quantitative methods for estimating comfort and curiosity: Kernel Density Scoring (KDS) and Mahalanobis Distance Scoring (MDS). KDS probabilistically estimates food history distribution using kernel density estimation, while MDS uses Mahalanobis distances between foods. These methods score food based on how their representation vectors fit the estimated distributions. We also propose a ranking method measuring the balance between comfort and curiosity based on taste and ingredients. This balance is defined as curiosity (return) gained per unit of comfort (risk) in choosing a food. For evaluation the proposed method, we newly collected a dataset containing user surveys on Japanese food and assessments of foreign food regarding comfort and curiosity. Comparing our methods against the existing method, the Wilcoxon signed-rank test showed that when estimating comfort from taste and curiosity from ingredients, the MDS-based method outperformed the Baseline, while the KDS-based method showed no significant differences. When estimating curiosity from taste and comfort from ingredients, both methods outperformed the Baseline. The MDS-based method consistently outperformed KDS in ROC-AUC values.

[IR-4] RAU: Towards Regularized Alignment and Uniformity for Representation Learning in Recommendation

链接: https://arxiv.org/abs/2503.18300
作者: Xi Wu,Dan Zhang,Chao Zhou,Liangwei Yang,Tianyu Lin,Jibing Gong
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems (RecSys) have become essential in modern society, driving user engagement and satisfaction across diverse online platforms. Most RecSys focuses on designing a powerful encoder to embed users and items into high-dimensional vector representation space, with loss functions optimizing their representation distributions. Recent studies reveal that directly optimizing key properties of the representation distribution, such as alignment and uniformity, can outperform complex encoder designs. However, existing methods for optimizing critical attributes overlook the impact of dataset sparsity on the model: limited user-item interactions lead to sparse alignment, while excessive interactions result in uneven uniformity, both of which degrade performance. In this paper, we identify the sparse alignment and uneven uniformity issues, and further propose Regularized Alignment and Uniformity (RAU) to cope with these two issues accordingly. RAU consists of two novel regularization methods for alignment and uniformity to learn better user/item representation. 1) Center-strengthened alignment further aligns the average in-batch user/item representation to provide an enhanced alignment signal and further minimize the disparity between user and item representation. 2) Low-variance-guided uniformity minimizes the variance of pairwise distances along with uniformity, which provides extra guidance to a more stabilized uniformity increase during training. We conducted extensive experiments on three real-world datasets, and the proposed RAU resulted in significant performance improvements compared to current state-of-the-art CF methods, which confirms the advantages of the two proposed regularization methods.

[IR-5] BERTDetect: A Neural Topic Modelling Approach for Android Malware Detection

链接: https://arxiv.org/abs/2503.18043
作者: Nishavi Ranaweera,Jiarui Xu,Suranga Seneviratne,Aruna Seneviratne
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Web access today occurs predominantly through mobile devices, with Android representing a significant share of the mobile device market. This widespread usage makes Android a prime target for malicious attacks. Despite efforts to combat malicious attacks through tools like Google Play Protect and antivirus software, new and evolved malware continues to infiltrate Android devices. Source code analysis is effective but limited, as attackers quickly abandon old malware for new variants to evade detection. Therefore, there is a need for alternative methods that complement source code analysis. Prior research investigated clustering applications based on their descriptions and identified outliers in these clusters by API usage as malware. However, these works often used traditional techniques such as Latent Dirichlet Allocation (LDA) and k-means clustering, that do not capture the nuanced semantic structures present in app descriptions. To this end, in this paper, we propose BERTDetect, which leverages the BERTopic neural topic modelling to effectively capture the latent topics in app descriptions. The resulting topic clusters are comparatively more coherent than previous methods and represent the app functionalities well. Our results demonstrate that BERTDetect outperforms other baselines, achieving ~10% relative improvement in F1 score.

[IR-6] SUNAR: Semantic Uncertainty based Neighborhood Aware Retrieval for Complex QA NAACL2025

链接: https://arxiv.org/abs/2503.17990
作者: V Venktesh,Mandeep Rathee,Avishek Anand
类目: Information Retrieval (cs.IR)
*备注: Accepted at NAACL 2025 Main Conference

点击查看摘要

Abstract:Complex question-answering (QA) systems face significant challenges in retrieving and reasoning over information that addresses multi-faceted queries. While large language models (LLMs) have advanced the reasoning capabilities of these systems, the bounded-recall problem persists, where procuring all relevant documents in first-stage retrieval remains a challenge. Missing pertinent documents at this stage leads to performance degradation that cannot be remedied in later stages, especially given the limited context windows of LLMs which necessitate high recall at smaller retrieval depths. In this paper, we introduce SUNAR, a novel approach that leverages LLMs to guide a Neighborhood Aware Retrieval process. SUNAR iteratively explores a neighborhood graph of documents, dynamically promoting or penalizing documents based on uncertainty estimates from interim LLM-generated answer candidates. We validate our approach through extensive experiments on two complex QA datasets. Our results show that SUNAR significantly outperforms existing retrieve-and-reason baselines, achieving up to a 31.84% improvement in performance over existing state-of-the-art methods for complex QA.

[IR-7] Dense Passage Retrieval in Conversational Search

链接: https://arxiv.org/abs/2503.17507
作者: Ahmed H. Salamah,Pierre McWhannel,Nicole Yan
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Information retrieval systems have traditionally relied on exact term match methods such as BM25 for first-stage retrieval. However, recent advancements in neural network-based techniques have introduced a new method called dense retrieval. This approach uses a dual-encoder to create contextual embeddings that can be indexed and clustered efficiently at run-time, resulting in improved retrieval performance in Open-domain Question Answering systems. In this paper, we apply the dense retrieval technique to conversational search by conducting experiments on the CAsT benchmark dataset. We also propose an end-to-end conversational search system called GPT2QR+DPR, which incorporates various query reformulation strategies to improve retrieval accuracy. Our findings indicate that dense retrieval outperforms BM25 even without extensive fine-tuning. Our work contributes to the growing body of research on neural-based retrieval methods in conversational search, and highlights the potential of dense retrieval in improving retrieval accuracy in conversational search systems.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-03-25

目录

概览 (2025-03-25)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载

目录

概览 (2025-03-25)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载

微信扫一扫：分享