Arxiv今日论文 | 2025-03-27

本篇博文主要内容为 2025-03-27 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在移动设备上的部署与优化问题。当前的基准数据集主要针对服务器和桌面环境设计，缺乏专门面向移动场景的广泛数据集，同时移动设备受限于存储和计算资源，需要高效且针对性强的模型。论文的关键解决方案是引入Mobile-MMLU，这是一个专为移动智能设计的大规模基准数据集，包含16,186个涵盖80个移动相关领域的多选题。Mobile-MMLU通过评估推理延迟、能耗、内存使用和响应质量等关键移动特定指标，提供全面的模型性能洞察，并强调隐私保护和个性化适应能力。此外，其挑战子集Mobile-MMLU-Pro进一步提升了评估难度。这一标准化框架为开发和比较移动优化的LLMs提供了重要支持，推动了移动计算环境中生产力和决策能力的进步。

链接: https://arxiv.org/abs/2503.20786
作者: Sondos Mahmoud Bsharat,Mukul Ranjan,Aidar Myrzakhan,Jiacheng Liu,Bowei Guo,Shengkun Tang,Zhuang Liu,Yuanzhi Li,Zhiqiang Shen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: An order-invariant and mobile-centric benchmark. Code and data are available at: this https URL

点击查看摘要

Abstract:Rapid advancements in large language models (LLMs) have increased interest in deploying them on mobile devices for on-device AI applications. Mobile users interact differently with LLMs compared to desktop users, creating unique expectations and data biases. Current benchmark datasets primarily target at server and desktop environments, and there is a notable lack of extensive datasets specifically designed for mobile contexts. Additionally, mobile devices face strict limitations in storage and computing resources, constraining model size and capabilities, thus requiring optimized efficiency and prioritized knowledge. To address these challenges, we introduce Mobile-MMLU, a large-scale benchmark dataset tailored for mobile intelligence. It consists of 16,186 questions across 80 mobile-related fields, designed to evaluate LLM performance in realistic mobile scenarios. A challenging subset, Mobile-MMLU-Pro, provides advanced evaluation similar in size to MMLU-Pro but significantly more difficult than our standard full set. Both benchmarks use multiple-choice, order-invariant questions focused on practical mobile interactions, such as recipe suggestions, travel planning, and essential daily tasks. The dataset emphasizes critical mobile-specific metrics like inference latency, energy consumption, memory usage, and response quality, offering comprehensive insights into model performance under mobile constraints. Moreover, it prioritizes privacy and adaptability, assessing models’ ability to perform on-device processing, maintain user privacy, and adapt to personalized usage patterns. Mobile-MMLU family offers a standardized framework for developing and comparing mobile-optimized LLMs, enabling advancements in productivity and decision-making within mobile computing environments. Our code and data are available at: this https URL.
zh

[NLP-1] Understanding R1-Zero-Like Training: A Critical Perspective

【速读】：该论文旨在探索大规模强化学习（Reinforcement Learning, RL）在提升大语言模型（Large Language Models, LLMs）推理能力方面的潜力，并提出一种无需监督微调的训练方法。论文的关键在于分析和改进现有方法的核心组件，包括基础模型（Base Models）和强化学习过程。研究发现，不同预训练模型的特性显著影响强化学习的表现，例如DeepSeek-V3-Base表现出“顿悟”现象，而Qwen2.5模型即使在无提示模板的情况下也展现出强大的推理能力，这可能源于预训练中的潜在偏差。此外，论文识别出组相对策略优化（Group Relative Policy Optimization, GRPO）中存在的优化偏差，即在训练过程中错误输出的人工响应长度增加。为了解决这一问题，作者引入了Dr. GRPO，这是一种无偏的优化方法，能够在保持推理性能的同时提高令牌效率。基于这些洞察，论文提出了一种极简版R1-Zero方法，在AIME 2024测试中实现了43.3%的准确率，达到了新的技术水平。

链接: https://arxiv.org/abs/2503.20783
作者: Zichen Liu,Changyu Chen,Wenjun Li,Penghui Qi,Tianyu Pang,Chao Du,Wee Sun Lee,Min Lin
机构: Sea AI Lab (海 AI 实验室); National University of Singapore (新加坡国立大学); Singapore Management University (新加坡管理大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ‘‘Aha moment’’, while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at this https URL.
zh

[NLP-2] MCTS-RAG : Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search

【速读】：本文旨在解决小规模语言模型（Small-Scale Language Models）在知识密集型任务中推理能力不足的问题。传统方法如标准的检索增强生成（RAG）通常独立于推理过程进行信息检索，导致知识整合次优；而常规的蒙特卡洛树搜索（MCTS）仅依赖模型内部知识，缺乏外部事实支持。为应对这些挑战，论文提出了一种新的方法——MCTS-RAG，其关键在于通过迭代决策过程动态整合检索与推理。具体而言，MCTS-RAG结合了结构化推理与自适应检索，有效提升了决策质量，减少了幻觉现象，并保证了更高的事实准确性与响应一致性。实验结果表明，该方法使小规模语言模型在多个知识密集型数据集上的表现接近前沿的大规模语言模型（如GPT-4o），同时通过高效扩展推理计算实现了这一突破。

链接: https://arxiv.org/abs/2503.20757
作者: Yunhai Hu,Yilun Zhao,Chen Zhao,Arman Cohan
机构: Yale University (耶鲁大学); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce MCTS-RAG, a novel approach that enhances the reasoning capabilities of small language models on knowledge-intensive tasks by leveraging retrieval-augmented generation (RAG) to provide relevant context and Monte Carlo Tree Search (MCTS) to refine reasoning paths. MCTS-RAG dynamically integrates retrieval and reasoning through an iterative decision-making process. Unlike standard RAG methods, which typically retrieve information independently from reasoning and thus integrate knowledge suboptimally, or conventional MCTS reasoning, which depends solely on internal model knowledge without external facts, MCTS-RAG combines structured reasoning with adaptive retrieval. This integrated approach enhances decision-making, reduces hallucinations, and ensures improved factual accuracy and response consistency. The experimental results on multiple reasoning and knowledge-intensive datasets datasets (i.e., ComplexWebQA, GPQA, and FoolMeTwice) show that our method enables small-scale LMs to achieve performance comparable to frontier LLMs like GPT-4o by effectively scaling inference-time compute, setting a new standard for reasoning in small-scale models.
zh

[NLP-3] ADS-Edit: A Multimodal Knowledge Editing Dataset for Autonomous Driving Systems

【速读】：本文旨在解决大型多模态模型（Large Multimodal Models, LMMs）在自动驾驶系统（Autonomous Driving Systems, ADS）应用中面临的挑战，包括对交通知识的误解、复杂道路条件以及车辆状态多样性等问题。为应对这些挑战，论文提出利用知识编辑（Knowledge Editing）技术，通过针对性修改模型行为而非完全重新训练的方式提升模型性能。关键在于引入了一个专门针对ADS设计的多模态知识编辑数据集ADS-Edit，其包含多种真实场景、多样数据类型及全面评估指标。这一方案的核心在于结合知识编辑技术和定制化数据集以有效改善LMMs在ADS中的表现。

链接: https://arxiv.org/abs/2503.20756
作者: Chenxi Wang,Jizhan Fang,Xiang Chen,Bozhong Tian,Ziwen Xu,Huajun Chen,Ningyu Zhang
机构: Zhejiang University (浙江大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Work in progress

点击查看摘要

Abstract:Recent advancements in Large Multimodal Models (LMMs) have shown promise in Autonomous Driving Systems (ADS). However, their direct application to ADS is hindered by challenges such as misunderstanding of traffic knowledge, complex road conditions, and diverse states of vehicle. To address these challenges, we propose the use of Knowledge Editing, which enables targeted modifications to a model’s behavior without the need for full retraining. Meanwhile, we introduce ADS-Edit, a multimodal knowledge editing dataset specifically designed for ADS, which includes various real-world scenarios, multiple data types, and comprehensive evaluation metrics. We conduct comprehensive experiments and derive several interesting conclusions. We hope that our work will contribute to the further advancement of knowledge editing applications in the field of autonomous driving. Code and data are available in this https URL.
zh

[NLP-4] Beyond Believability: Accurate Human Behavior Simulation with Fine-Tuned LLM s

【速读】：该论文试图解决如何提高大型语言模型（LLMs）在生成网页行为（web action generation）任务中的“准确性”而非主观的“可信性”问题。解决方案的关键在于通过在真实世界行为数据上的微调（fine-tuning）来显著提升LLMs的行为生成能力，并进一步结合合成推理轨迹（synthesized reasoning traces）的融入以获得额外的性能提升，从而验证显式推理在行为建模中的价值。

链接: https://arxiv.org/abs/2503.20749
作者: Yuxuan Lu,Jing Huang,Yan Han,Bennet Bei,Yaochen Xie,Dakuo Wang,Jessie Wang,Qi He
机构: Amazon.com, Inc. (亚马逊公司); Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent research shows that LLMs can simulate believable'' human behaviors to power LLM agents via prompt-only methods. In this work, we focus on evaluating and improving LLM's objective accuracy’’ rather than the subjective ``believability’’ in the web action generation task, leveraging a large-scale, real-world dataset collected from online shopping human actions. We present the first comprehensive quantitative evaluation of state-of-the-art LLMs (e.g., DeepSeek-R1, Llama, and Claude) on the task of web action generation. Our results show that fine-tuning LLMs on real-world behavioral data substantially improves their ability to generate actions compared to prompt-only methods. Furthermore, incorporating synthesized reasoning traces into model training leads to additional performance gains, demonstrating the value of explicit rationale in behavior modeling. This work establishes a new benchmark for evaluating LLMs in behavior simulation and offers actionable insights into how real-world action data and reasoning augmentation can enhance the fidelity of LLM agents.
zh

[NLP-5] Ontology-based Semantic Similarity Measures for Clustering Medical Concepts in Drug Safety

【速读】：该论文旨在解决生物医学领域语义相似性度量（SSMs）在药物警戒中的应用不足问题。论文的关键在于评估六种基于本体论的SSMs在聚类MedDRA首选术语（PTs）方面的性能，并提出通过基于内在信息含量（IC）的方法（如INTRINSIC-LIN和SOKAL）来提升聚类准确性。研究开发了一个支持大规模相似性计算的高通量框架，并验证了这些方法在改进药物警戒工作流（如早期信号检测和减少人工审查）中的潜力。

链接: https://arxiv.org/abs/2503.20737
作者: Jeffery L Painter,François Haguinet,Gregory E Powell,Andrew Bate
机构: GlaxoSmithKline (GSK)(葛兰素史克)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Semantic similarity measures (SSMs) are widely used in biomedical research but remain underutilized in pharmacovigilance. This study evaluates six ontology-based SSMs for clustering MedDRA Preferred Terms (PTs) in drug safety data. Using the Unified Medical Language System (UMLS), we assess each method’s ability to group PTs around medically meaningful centroids. A high-throughput framework was developed with a Java API and Python and R interfaces support large-scale similarity computations. Results show that while path-based methods perform moderately with F1 scores of 0.36 for WUPALMER and 0.28 for LCH, intrinsic information content (IC)-based measures, especially INTRINSIC-LIN and SOKAL, consistently yield better clustering accuracy (F1 score of 0.403). Validated against expert review and standard MedDRA queries (SMQs), our findings highlight the promise of IC-based SSMs in enhancing pharmacovigilance workflows by improving early signal detection and reducing manual review.
zh

[NLP-6] From Annotation to Adaptation: Metrics Synthetic Data and Aspect Extraction for Aspect-Based Sentiment Analysis with Large Language Models NAACL

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在基于方面的情感分析（Aspect-Based Sentiment Analysis, ABSA）中，特别是在新兴领域隐式方面提取方面的性能问题。论文的关键在于使用一个合成的体育反馈数据集评估开放权重LLMs提取方面-极性对（aspect-polarity pairs）的能力，并提出了一种度量方法以促进利用生成式模型（Generative Models）进行方面提取的评估。

链接: https://arxiv.org/abs/2503.20715
作者: Nikita Neveditsin,Pawan Lingras,Vijay Mago
机构: Saint Mary’s University (圣玛丽大学), Halifax, Canada; York University (约克大学), Toronto, Canada
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL SRW 2025

点击查看摘要

Abstract:This study examines the performance of Large Language Models (LLMs) in Aspect-Based Sentiment Analysis (ABSA), with a focus on implicit aspect extraction in a novel domain. Using a synthetic sports feedback dataset, we evaluate open-weight LLMs’ ability to extract aspect-polarity pairs and propose a metric to facilitate the evaluation of aspect extraction with generative models. Our findings highlight both the potential and limitations of LLMs in the ABSA task.
zh

[NLP-7] UniEDU: A Unified Language and Vision Assistant for Education Applications

【速读】：该论文旨在解决多模态教育材料（如文本和图像）难以被现有模型充分理解的问题。传统方法通常依赖于任务特定的模型，缺乏通用性和适应性。为应对这一挑战，论文提出了一种名为UniEDU的统一语言与视觉辅助模型，它能够同时支持知识推荐、知识追踪、时间成本预测以及用户答案预测等多种教育应用场景。关键在于UniEDU通过提供一个统一框架实现了跨任务的高性能表现，并具备强大的泛化能力，同时优化了计算效率，在保持竞争力性能的前提下提升了约3倍的运行效率，使其更适合实际部署于多样化学习环境。

链接: https://arxiv.org/abs/2503.20701
作者: Zhendong Chu,Jian Xie,Shen Wang,Zichao Wang,Qingsong Wen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Education materials for K-12 students often consist of multiple modalities, such as text and images, posing challenges for models to fully understand nuanced information in these materials. In this paper, we propose a unified language and vision assistant UniEDU designed for various educational applications, including knowledge recommendation, knowledge tracing, time cost prediction, and user answer prediction, all within a single model. Unlike conventional task-specific models, UniEDU offers a unified solution that excels across multiple educational tasks while maintaining strong generalization capabilities. Its adaptability makes it well-suited for real-world deployment in diverse learning environments. Furthermore, UniEDU is optimized for industry-scale deployment by significantly reducing computational overhead-achieving approximately a 300% increase in efficiency-while maintaining competitive performance with minimal degradation compared to fully fine-tuned models. This work represents a significant step toward creating versatile AI systems tailored to the evolving demands of education.
zh

[NLP-8] Vision as LoRA

【速读】：该论文旨在解决如何将单一模态的语言模型（LLM）高效转化为多模态语言模型（MLLM），传统方法依赖外部视觉模块进行视觉编码，存在结构复杂和计算开销大的问题。论文提出的解决方案关键在于引入Vision as LoRA (VoRA)，通过在LLM中直接集成视觉特定的LoRA层来内化视觉能力，使新增参数能在推理阶段无缝融合到LLM中，从而消除结构复杂性并最小化计算开销。此外，通过块状蒸馏方法从预训练的ViT转移视觉先验知识至LoRA层，并采用双向注意力掩码增强上下文信息捕捉能力，进一步提升视觉性能。这些创新点共同实现了与基于编码的传统MLLM相当的性能，同时显著简化了系统架构。

链接: https://arxiv.org/abs/2503.20680
作者: Han Wang,Yongjie Ye,Bingru Li,Yuxiang Nie,Jinghui Lu,Jingqun Tang,Yanjie Wang,Can Huang
机构: ByteDance Inc. (字节跳动); University of Birmingham (伯明翰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting the LLM’s ability of handling flexible context, VoRA can process inputs at arbitrary resolutions. To further strengthen VoRA’s visual capabilities, we introduce a block-wise distillation method that transfers visual priors from a pre-trained ViT into the LoRA layers, effectively accelerating training by injecting visual knowledge. Additionally, we apply bi-directional attention masks to better capture the context information of an image. We successfully demonstrate that with additional pre-training data, VoRA can perform comparably with conventional encode-based MLLMs. All training data, codes, and model weights will be released at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL) Cite as: arXiv:2503.20680 [cs.CV] (or arXiv:2503.20680v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.20680 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-9] AMA: A Human-AI Collaborative Thematic Analysis Framework Using Multi-Agent LLM s for Clinical Interviews

【速读】：该论文试图解决传统主题分析（Thematic Analysis, TA）在医疗领域应用中资源消耗大且效率低的问题。解决方案的关键在于提出TAMA框架，这是一种结合多智能体大型语言模型（Multi-Agent Large Language Models, Multi-Agent LLMs）的人机协同主题分析方法。通过设计结构化的智能体间对话机制协调心脏专家的专业知识，并利用罕见先天性心脏病（如AAOCA）患者家长访谈记录进行验证，结果显示TAMA相比现有基于LLM的主题分析方法，在主题命中率、覆盖范围及独特性方面表现更优。其核心创新点在于通过人机协作显著提升分析质量的同时大幅降低人工负担。

链接: https://arxiv.org/abs/2503.20666
作者: Huimin Xu,Seungjun Yi,Terence Lim,Jiawei Xu,Andrew Well,Carlos Mery,Aidong Zhang,Yuji Zhang,Heng Ji,Keshav Pingali,Yan Leng,Ying Ding
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: Submitted to the American Medical Informatics Association (AMIA) 2025 Annual Symposium, 10 pages

点击查看摘要

Abstract:Thematic analysis (TA) is a widely used qualitative approach for uncovering latent meanings in unstructured text data. TA provides valuable insights in healthcare but is resource-intensive. Large Language Models (LLMs) have been introduced to perform TA, yet their applications in healthcare remain unexplored. Here, we propose TAMA: A Human-AI Collaborative Thematic Analysis framework using Multi-Agent LLMs for clinical interviews. We leverage the scalability and coherence of multi-agent systems through structured conversations between agents and coordinate the expertise of cardiac experts in TA. Using interview transcripts from parents of children with Anomalous Aortic Origin of a Coronary Artery (AAOCA), a rare congenital heart disease, we demonstrate that TAMA outperforms existing LLM-assisted TA approaches, achieving higher thematic hit rate, coverage, and distinctiveness. TAMA demonstrates strong potential for automated TA in clinical settings by leveraging multi-agent LLM systems with human-in-the-loop integration by enhancing quality while significantly reducing manual workload.
zh

[NLP-10] N-Eval: Rubric and Evaluation Protocols for Measuring the Quality of Behavioral Therapy Notes

【速读】：该论文旨在解决行为疗法记录质量标准不完善的问题，以提升其在法律合规性和患者护理中的重要性。论文的关键解决方案是设计了一个综合性的评分量表（rubric），用于评估行为疗法记录在完整性（completeness）、简洁性（conciseness）和忠实性（faithfulness）三个关键维度上的表现。此外，研究扩展了一个包含治疗师撰写记录和大型语言模型（LLM）自动生成记录的公开对话数据集，并应用该评估框架量化其质量。这一方法的核心在于通过基于量表的手动评估协议替代传统的利克特量表（Likert-scale）标注，从而提供更可靠且可解释的结果。

链接: https://arxiv.org/abs/2503.20648
作者: Raj Sanjay Shah,Lei Xu,Qianchu Liu,Jon Burnsky,Drew Bertagnolli,Chaitanya Shivade
机构: Georgia Institute of Technology (乔治亚理工学院); AWS AI Labs (AWS 人工智能实验室); OneMedical (OneMedical)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Behavioral therapy notes are important for both legal compliance and patient care. Unlike progress notes in physical health, quality standards for behavioral therapy notes remain underdeveloped. To address this gap, we collaborated with licensed therapists to design a comprehensive rubric for evaluating therapy notes across key dimensions: completeness, conciseness, and faithfulness. Further, we extend a public dataset of behavioral health conversations with therapist-written notes and LLM-generated notes, and apply our evaluation framework to measure their quality. We find that: (1) A rubric-based manual evaluation protocol offers more reliable and interpretable results than traditional Likert-scale annotations. (2) LLMs can mimic human evaluators in assessing completeness and conciseness but struggle with faithfulness. (3) Therapist-written notes often lack completeness and conciseness, while LLM-generated notes contain hallucination. Surprisingly, in a blind test, therapists prefer and judge LLM-generated notes to be superior to therapist-written notes.
zh

[NLP-11] Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在从System 1推理向System 2推理过渡过程中因过度思考而导致效率下降的问题。尽管通过System 2推理实现复杂任务处理取得了显著进展，但这种深度推理通常伴随着冗余步骤的生成，而输出质量的提升却不成比例。为应对这一挑战，论文探索了Long-to-Short (L2S) 推理方法，以平衡推理深度与实际效率。解决方案的关键在于模型合并（model merging），即将System 1模型的快速思维能力与System 2模型的系统性推理能力相结合。通过任务向量、奇异值分解（SVD）以及激活信息引导等多种方法，论文验证了模型合并能够将平均响应长度减少高达55%，同时保持甚至提升基线性能。此外，研究发现模型规模与合并效果之间存在强相关性，并揭示了合并模型具备自批判和自修正的能力以及根据任务复杂度调整响应长度的适应性。这些结果表明，模型合并是一种高效且有效的L2S推理范式，为解决过度推理问题提供了实用方案，同时保留了System 2推理的鲁棒性。

链接: https://arxiv.org/abs/2503.20641
作者: Han Wu,Yuxuan Yao,Shuqi Liu,Zehua Liu,Xiaojin Fu,Xiongwei Han,Xing Li,Hui-Ling Zhen,Tao Zhong,Mingxuan Yuan
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注: Work in progress; technical report

点击查看摘要

Abstract:The transition from System 1 to System 2 reasoning in large language models (LLMs) has marked significant advancements in handling complex tasks through deliberate, iterative thinking. However, this progress often comes at the cost of efficiency, as models tend to overthink, generating redundant reasoning steps without proportional improvements in output quality. Long-to-Short (L2S) reasoning has emerged as a promising solution to this challenge, aiming to balance reasoning depth with practical efficiency. While existing approaches, such as supervised fine-tuning (SFT), reinforcement learning (RL), and prompt engineering, have shown potential, they are either computationally expensive or unstable. Model merging, on the other hand, offers a cost-effective and robust alternative by integrating the quick-thinking capabilities of System 1 models with the methodical reasoning of System 2 models. In this work, we present a comprehensive empirical study on model merging for L2S reasoning, exploring diverse methodologies, including task-vector-based, SVD-based, and activation-informed merging. Our experiments reveal that model merging can reduce average response length by up to 55% while preserving or even improving baseline performance. We also identify a strong correlation between model scale and merging efficacy with extensive evaluations on 1.5B/7B/14B/32B models. Furthermore, we investigate the merged model’s ability to self-critique and self-correct, as well as its adaptive response length based on task complexity. Our findings highlight model merging as a highly efficient and effective paradigm for L2S reasoning, offering a practical solution to the overthinking problem while maintaining the robustness of System 2 reasoning. This work can be found on Github this https URL.
zh

[NLP-12] PVLens: Enhancing Pharmacovigilance Through Automated Label Extraction

【速读】：该论文旨在解决现有药物安全参考数据库（如SIDER）更新不及时且缺乏动态性的问题。解决方案的关键在于开发PVLens系统，该系统通过自动化提取美国食品药品监督管理局（FDA）结构化产品标签（SPLs）中的标注安全性信息，并将术语映射到医学词典MedDRA，同时结合基于Web的专家审查工具实现自动化与人工监督的融合。这种集成方法显著提升了数据处理的效率与准确性，在验证测试中实现了0.882的F1分数，其中召回率高达0.983，展示了其在提高药物警戒实时性和准确性方面的潜力。

链接: https://arxiv.org/abs/2503.20639
作者: Jeffery L Painter,Gregory E Powell,Andrew Bate
机构: GlaxoSmithKline (GSK)(葛兰素史克)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reliable drug safety reference databases are essential for pharmacovigilance, yet existing resources like SIDER are outdated and static. We introduce PVLens, an automated system that extracts labeled safety information from FDA Structured Product Labels (SPLs) and maps terms to MedDRA. PVLens integrates automation with expert oversight through a web-based review tool. In validation against 97 drug labels, PVLens achieved an F1 score of 0.882, with high recall (0.983) and moderate precision (0.799). By offering a scalable, more accurate and continuously updated alternative to SIDER, PVLens enhances real-time pharamcovigilance with improved accuracy and contemporaneous insights.
zh

[NLP-13] Collaborative Storytelling and LLM : A Linguistic Analysis of Automatically-Generated Role-Playing Game Sessions

【速读】：本文旨在探究大型语言模型（Large Language Models, LLMs）在无人类干预的情况下生成角色扮演游戏（RPG）会话时，其语言表现出的口头特征或书面特征的程度。为解决这一问题，论文的关键在于对生成文本的词汇和句法特征进行语言学分析，并将结果与人类RPG会话的对话记录及书籍等其他文本类别进行比较。研究发现，LLMs展现出一种与其他文本类别（包括口头对话、人类RPG会话和书籍）都不同的独特模式。通过这种分析，论文揭示了训练方式如何影响LLMs的表达方式，并为评估这些工具的叙事能力提供了重要线索。

链接: https://arxiv.org/abs/2503.20623
作者: Alessandro Maisto
机构: Università degli Studi di Salerno (萨勒诺大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages

点击查看摘要

Abstract:Role-playing games (RPG) are games in which players interact with one another to create narratives. The role of players in the RPG is largely based on the interaction between players and their characters. This emerging form of shared narrative, primarily oral, is receiving increasing attention. In particular, many authors investigated the use of an LLM as an actor in the game. In this paper, we aim to discover to what extent the language of Large Language Models (LLMs) exhibit oral or written features when asked to generate an RPG session without human interference. We will conduct a linguistic analysis of the lexical and syntactic features of the generated texts and compare the results with analyses of conversations, transcripts of human RPG sessions, and books. We found that LLMs exhibit a pattern that is distinct from all other text categories, including oral conversations, human RPG sessions and books. Our analysis has shown how training influences the way LLMs express themselves and provides important indications of the narrative capabilities of these tools.
zh

[NLP-14] Synthetic Data Augmentation for Cross-domain Implicit Discourse Relation Recognition

【速读】：该论文旨在解决隐性语篇关系识别（Implicit Discourse Relation Recognition, IDRR）任务中，利用大规模语言模型（LLMs）进行合成数据增强以提升跨领域性能的问题。论文的关键在于探索通过未标注的目标领域数据，使用LLMs生成遵循特定语义连贯关系的文本延续，以此适配在源领域标注数据上训练的基础模型。然而，实验结果表明，不同版本的这种方法并未带来显著改进，论文最终得出结论：LLMs在生成对IDRR任务有用的样本方面表现不佳，并强调在评估IDRR模型时需同时考虑统计显著性和可比性的重要性。

链接: https://arxiv.org/abs/2503.20588
作者: Frances Yung,Varsha Suresh,Zaynab Reza,Mansoor Ahmad,Vera Demberg
机构: Saarland University (萨尔兰大学), Germany
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Implicit discourse relation recognition (IDRR) – the task of identifying the implicit coherence relation between two text spans – requires deep semantic understanding. Recent studies have shown that zero- or few-shot approaches significantly lag behind supervised models, but LLMs may be useful for synthetic data augmentation, where LLMs generate a second argument following a specified coherence relation. We applied this approach in a cross-domain setting, generating discourse continuations using unlabelled target-domain data to adapt a base model which was trained on source-domain labelled data. Evaluations conducted on a large-scale test set revealed that different variations of the approach did not result in any significant improvements. We conclude that LLMs often fail to generate useful samples for IDRR, and emphasize the importance of considering both statistical significance and comparability when evaluating IDRR models.
zh

[NLP-15] Optimizing Case-Based Reasoning System for Functional Test Script Generation with Large Language Models

【速读】：该论文旨在解决利用大型语言模型（Large Language Models, LLMs）生成功能测试脚本的问题，这一过程需要理解目标软件动态演化的代码结构。论文的关键解决方案是提出了一种基于案例推理（Case-Based Reasoning, CBR）的系统，该系统采用4R循环（检索Retrieve、重用Reuse、修订Revise、保留Retain），并通过维护和利用一个包含测试意图描述及其对应测试脚本的案例库来辅助LLMs进行测试脚本生成。为进一步提升用户体验，论文引入了Re4优化方法，包括基于重新排序的检索微调和增强型重用微调。具体而言，首先通过识别语义和脚本相似度高的正例，为检索器模型的微调提供可靠的伪标签，而无需昂贵的人工标注；随后结合有监督微调与强化学习微调阶段，确保LLMs在生产场景中的有效应用，并解决LLMs重复生成的问题。

链接: https://arxiv.org/abs/2503.20576
作者: Siyuan Guo,Huiwu Liu,Xiaolong Chen,Yuming Xie,Liang Zhang,Tao Han,Hechang Chen,Yi Chang,Jun Wang
机构: Jilin University (吉林大学); Huawei Datacom (华为数据通信); University College London (伦敦大学学院)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this work, we explore the potential of large language models (LLMs) for generating functional test scripts, which necessitates understanding the dynamically evolving code structure of the target software. To achieve this, we propose a case-based reasoning (CBR) system utilizing a 4R cycle (i.e., retrieve, reuse, revise, and retain), which maintains and leverages a case bank of test intent descriptions and corresponding test scripts to facilitate LLMs for test script generation. To improve user experience further, we introduce Re4, an optimization method for the CBR system, comprising reranking-based retrieval finetuning and reinforced reuse finetuning. Specifically, we first identify positive examples with high semantic and script similarity, providing reliable pseudo-labels for finetuning the retriever model without costly labeling. Then, we apply supervised finetuning, followed by a reinforcement learning finetuning stage, to align LLMs with our production scenarios, ensuring the faithful reuse of retrieved cases. Extensive experimental results on two product development units from Huawei Datacom demonstrate the superiority of the proposed CBR+Re4. Notably, we also show that the proposed Re4 method can help alleviate the repetitive generation issues with LLMs.
zh

[NLP-16] Low-resource Information Extraction with the European Clinical Case Corpus

【速读】：该论文试图解决医学领域多语言临床案例数据稀缺的问题，并构建了一个包含疾病和检验结果关系标注的多语言数据集E3C-3.0。解决方案的关键在于采用了一种半自动方法，结合基于大型语言模型（Large Language Models, LLMs）的自动标注投影技术和人工修订，以高效生成高质量的多语言标注数据，同时通过微调当前最先进的LLMs验证了该数据集的有效性，并展示了跨语言迁移学习在缓解数据稀缺方面的有效性。

链接: https://arxiv.org/abs/2503.20568
作者: Soumitra Ghosh,Begona Altuna,Saeed Farzi,Pietro Ferrazzi,Alberto Lavelli,Giulia Mezzanotte,Manuela Speranza,Bernardo Magnini
机构: Facebook AI Research (Meta); University of the Basque Country (巴斯克大学); Fondazione Bruno Kessler (FBK, 布鲁诺·凯泽勒基金会); Università degli Studi di Padova (帕多瓦大学); FBK (Fondazione Bruno Kessler); FBK (Fondazione Bruno Kessler); FBK (Fondazione Bruno Kessler)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present E3C-3.0, a multilingual dataset in the medical domain, comprising clinical cases annotated with diseases and test-result relations. The dataset includes both native texts in five languages (English, French, Italian, Spanish and Basque) and texts translated and projected from the English source into five target languages (Greek, Italian, Polish, Slovak, and Slovenian). A semi-automatic approach has been implemented, including automatic annotation projection based on Large Language Models (LLMs) and human revision. We present several experiments showing that current state-of-the-art LLMs can benefit from being fine-tuned on the E3C-3.0 dataset. We also show that transfer learning in different languages is very effective, mitigating the scarcity of data. Finally, we compare performance both on native data and on projected data. We release the data at this https URL .
zh

[NLP-17] A Retrieval-Based Approach to Medical Procedure Matching in Romanian

【速读】：该论文旨在解决医疗程序名称在医疗服务提供者与保险公司使用的标准化术语之间准确映射的问题，这一任务因命名约定的不一致而导致程序分类错误，从而引发行政效率低下及保险索赔问题。特别是在资源匮乏的语言（如罗马尼亚语）中，由于现有预训练语言模型缺乏针对医学文本的领域特定适应性，此问题尤为困难。论文的关键解决方案在于提出了一种基于检索的架构，利用句子嵌入（sentence embeddings）进行医学名称匹配，并通过评估多种嵌入模型（包括罗马尼亚语、多语言及医学领域特定表示），确定了最有效的任务解决方案。

链接: https://arxiv.org/abs/2503.20556
作者: Andrei Niculae,Adrian Cosma,Emilian Radoi
机构: National University of Science and Technology POLITEHNICA Bucharest (布加勒斯特理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurately mapping medical procedure names from healthcare providers to standardized terminology used by insurance companies is a crucial yet complex task. Inconsistencies in naming conventions lead to missclasified procedures, causing administrative inefficiencies and insurance claim problems in private healthcare settings. Many companies still use human resources for manual mapping, while there is a clear opportunity for automation. This paper proposes a retrieval-based architecture leveraging sentence embeddings for medical name matching in the Romanian healthcare system. This challenge is significantly more difficult in underrepresented languages such as Romanian, where existing pretrained language models lack domain-specific adaptation to medical text. We evaluate multiple embedding models, including Romanian, multilingual, and medical-domain-specific representations, to identify the most effective solution for this task. Our findings contribute to the broader field of medical NLP for low-resource languages such as Romanian.
zh

[NLP-18] Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence

【速读】：该论文试图解决推理模型在处理复杂任务（如数学推理）时生成详细推理序列所导致的计算开销大和耗时长的问题。解决方案的关键在于利用某些任务的固有并行性，在存在多个并行推理分支的情况下，通过在每一步解码多个标记（使用专门设计的注意力掩码）并在单一序列内处理它们，从而加速推理过程。实验结果表明，该方法在保持基本准确性的同时实现了超过100%的解码速度提升。

链接: https://arxiv.org/abs/2503.20533
作者: Yijiong Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Our code is available in this https URL

点击查看摘要

Abstract:Recent advances in reasoning models have demonstrated significant improvements in accuracy, particularly for complex tasks such as mathematical reasoning, by employing detailed and comprehensive reasoning processes. However, generating these lengthy reasoning sequences is computationally expensive and time-consuming. To address this inefficiency, we leverage the inherent parallelizability of certain tasks to accelerate the reasoning process. Specifically, when multiple parallel reasoning branches exist, we decode multiple tokens per step using a specialized attention mask, processing them within a single sequence. Experimental results show that our method achieves over 100% speedup in decoding time while basically maintaining accuracy.
zh

[NLP-19] StableToolBench-MirrorAPI: Modeling Tool Environments as Mirrors of 7000 Real-World APIs

【速读】：该论文旨在解决现有工具环境在稳定性、可扩展性和真实性方面的平衡难题，特别是在基准测试中的应用挑战。为应对这一问题，论文提出了一种名为MirrorAPI的新框架，其关键是通过监督微调和链式思维推理技术，训练专用大型语言模型（LLMs）以精确模拟真实API响应，从而作为工具环境的“镜像”提供高度逼真的仿真效果。

链接: https://arxiv.org/abs/2503.20527
作者: Zhicheng Guo,Sijie Cheng,Yuchen Niu,Hao Wang,Sicheng Zhou,Wenbing Huang,Yang Liu
机构: Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University (清华大学); Institute for AI Industry Research (AIR), Tsinghua University (清华大学); RayNeo; Google(谷歌); The University of Toronto (多伦多大学), Canada; Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has spurred significant interest in tool learning, where LLMs are augmented with external tools to tackle complex tasks. However, existing tool environments face challenges in balancing stability, scalability, and realness, particularly for benchmarking purposes. To address this problem, we propose MirrorAPI, a novel framework that trains specialized LLMs to accurately simulate real API responses, effectively acting as “mirrors” to tool environments. Using a comprehensive dataset of request-response pairs from 7,000+ APIs, we employ supervised fine-tuning and chain-of-thought reasoning to enhance simulation fidelity. MirrorAPI achieves superior accuracy and stability compared to state-of-the-art methods, as demonstrated by its performance on the newly constructed MirrorAPI-Bench and its integration into StableToolBench.
zh

[NLP-20] Exploring the Effect of Robotic Embodiment and Empathetic Tone of LLM s on Empathy Elicitation

【速读】：该论文试图解决通过与社会代理交互引发对第三方共情的问题。研究的关键在于通过大型语言模型（LLM）编程使物理机器人或语音聊天机器人分别表现出共情语气或保持中立，并评估参与者对虚构角色 Katie Banks 的帮助意愿（以愿意志愿服务的小时数衡量）及其对代理的感知。结果表明，机器人的物理形态或共情语气并未显著影响参与者的志愿服务意愿，尽管 LLM 能有效模拟人类共情，但激发参与者的真实共情反应仍具挑战性。

链接: https://arxiv.org/abs/2503.20518
作者: Liza Darwesh,Jaspreet Singh,Marin Marian,Eduard Alexa,Koen Hindriks,Kim Baraka
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Robotics (cs.RO)
备注: Liza Darwesh, Jaspreet Singh, Marin Marian, and Eduard Alexa contributed equally to this work.

点击查看摘要

Abstract:This study investigates the elicitation of empathy toward a third party through interaction with social agents. Participants engaged with either a physical robot or a voice-enabled chatbot, both driven by a large language model (LLM) programmed to exhibit either an empathetic tone or remain neutral. The interaction is focused on a fictional character, Katie Banks, who is in a challenging situation and in need of financial donations. The willingness to help Katie, measured by the number of hours participants were willing to volunteer, along with their perceptions of the agent, were assessed for 60 participants. Results indicate that neither robotic embodiment nor empathetic tone significantly influenced participants’ willingness to volunteer. While the LLM effectively simulated human empathy, fostering genuine empathetic responses in participants proved challenging.
zh

[NLP-21] Explainable ICD Coding via Entity Linking ALT NAACL2025

【速读】：该论文旨在解决临床编码任务中传统自动化方法无法为医疗编码员在生产环境中提供足够明确证据的问题。医疗编码员需要确保输入的病历中至少有一段明确的文字能够支持代码的归类。为了解决这一问题，论文提出将临床编码任务重新定义为实体链接问题，并通过为每个文档标注其对应的代码及其文本证据，以促进更好的人机协作。解决方案的关键在于利用大语言模型（Large Language Models, LLMs）的参数高效微调以及受限解码技术，提出了三种有效的方法，这些方法能够在临床术语消歧和少量样本场景下表现优异。

链接: https://arxiv.org/abs/2503.20508
作者: Leonor Barreiros,Isabel Coutinho,Gonçalo M. Correia,Bruno Martins
机构: Priberam Labs (Priberam 实验室); Instituto Superior Técnico (里斯本理工学院); INESC-ID (林斯克信息系统与计算机工程研究所)
类目: Computation and Language (cs.CL)
备注: Accepted at CL4Health at NAACL 2025

点击查看摘要

Abstract:Clinical coding is a critical task in healthcare, although traditional methods for automating clinical coding may not provide sufficient explicit evidence for coders in production environments. This evidence is crucial, as medical coders have to make sure there exists at least one explicit passage in the input health record that justifies the attribution of a code. We therefore propose to reframe the task as an entity linking problem, in which each document is annotated with its set of codes and respective textual evidence, enabling better human-machine collaboration. By leveraging parameter-efficient fine-tuning of Large Language Models (LLMs), together with constrained decoding, we introduce three approaches to solve this problem that prove effective at disambiguating clinical mentions and that perform well in few-shot scenarios.
zh

[NLP-22] Enhancing Depression Detection via Question-wise Modality Fusion

【速读】：该论文旨在解决抑郁症诊断过程中因依赖人工问卷或访谈导致的延迟治疗及人力资源消耗问题，并提出了一种新的方法来优化基于多模态数据的自动化诊断。现有方法通常忽视了问卷中每个问题不同模态的贡献差异以及任务应采用序数分类的特点，从而导致次优的模态融合与训练策略。为了解决这些问题，论文提出了一个名为Question-wise Modality Fusion (QuestMF) 的新框架，并结合一种新型的Imbalanced Ordinal Log-Loss (ImbOLL) 损失函数进行训练。这种方法的关键在于通过模态级的问题感知融合机制和序数损失函数，提高了模型在E-DAIC数据集上的性能，同时增强了模型的可解释性，能够为每个问题预测分数，帮助临床医生更精准地识别个体症状并制定个性化干预措施。此外，研究团队开源了QuestMF框架的代码以促进进一步的研究和应用。

链接: https://arxiv.org/abs/2503.20496
作者: Aishik Mandal,Dana Atzil-Slonim,Thamar Solorio,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab), Technische Universität Darmstadt (达姆施塔特工业大学); Department of Psychology, Bar-Ilan University (巴伊兰大学); MBZUAI (Mohammed bin Zayed University of Artificial Intelligence)
类目: Computation and Language (cs.CL)
备注: 18 pages, 5 figures, The 10th Workshop on Computational Linguistics and Clinical Psychology

点击查看摘要

Abstract:Depression is a highly prevalent and disabling condition that incurs substantial personal and societal costs. Current depression diagnosis involves determining the depression severity of a person through self-reported questionnaires or interviews conducted by clinicians. This often leads to delayed treatment and involves substantial human resources. Thus, several works try to automate the process using multimodal data. However, they usually overlook the following: i) The variable contribution of each modality for each question in the questionnaire and ii) Using ordinal classification for the task. This results in sub-optimal fusion and training methods. In this work, we propose a novel Question-wise Modality Fusion (QuestMF) framework trained with a novel Imbalanced Ordinal Log-Loss (ImbOLL) function to tackle these issues. The performance of our framework is comparable to the current state-of-the-art models on the E-DAIC dataset and enhances interpretability by predicting scores for each question. This will help clinicians identify an individual’s symptoms, allowing them to customise their interventions accordingly. We also make the code for the QuestMF framework publicly available.
zh

[NLP-23] VPO: Aligning Text-to-Video Generation Models with Prompt Optimization

【速读】：该论文旨在解决现有文本到视频生成模型在推理阶段因用户输入（如简洁、模糊或结构不良的提示）与训练阶段高度详细且精心设计的文本-视频对之间的差距，导致生成视频质量下降的问题。当前基于大型语言模型（Large Language Models, LLMs）通过上下文学习优化提示的方法存在多个局限性，包括可能扭曲用户意图、遗漏关键细节或引入安全风险，并且未能充分考虑优化后的提示对最终视频质量的影响。

为了解决这些问题，论文提出了一种名为VPO（Prompt Optimization Framework）的系统性框架，其关键在于遵循三个核心原则：无害性（harmlessness）、准确性（accuracy）和实用性（helpfulness）。VPO通过两阶段优化方法实现这些目标：首先基于安全性和对齐原则构建并精炼监督微调（Supervised Fine-Tuning, SFT）数据集；其次引入文本级和视频级反馈，结合偏好学习进一步优化SFT模型。实验结果表明，VPO显著提升了生成视频的安全性、对齐度以及整体质量，并且在不同视频生成模型之间表现出较强的泛化能力，同时优于或可与基于强化学习的人类反馈（RLHF）方法相结合。

链接: https://arxiv.org/abs/2503.20491
作者: Jiale Cheng,Ruiliang Lyu,Xiaotao Gu,Xiao Liu,Jiazheng Xu,Yida Lu,Jiayan Teng,Zhuoyi Yang,Yuxiao Dong,Jie Tang,Hongning Wang,Minlie Huang
机构: The Conversational Artificial Intelligence (CoAI) Group, Tsinghua University (清华大学会话人工智能小组); Zhipu AI (智谱AI); The Knowledge Engineering Group (KEG), Tsinghua University (清华大学知识工程组)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Video generation models have achieved remarkable progress in text-to-video tasks. These models are typically trained on text-video pairs with highly detailed and carefully crafted descriptions, while real-world user inputs during inference are often concise, vague, or poorly structured. This gap makes prompt optimization crucial for generating high-quality videos. Current methods often rely on large language models (LLMs) to refine prompts through in-context learning, but suffer from several limitations: they may distort user intent, omit critical details, or introduce safety risks. Moreover, they optimize prompts without considering the impact on the final video quality, which can lead to suboptimal results. To address these issues, we introduce VPO, a principled framework that optimizes prompts based on three core principles: harmlessness, accuracy, and helpfulness. The generated prompts faithfully preserve user intents and, more importantly, enhance the safety and quality of generated videos. To achieve this, VPO employs a two-stage optimization approach. First, we construct and refine a supervised fine-tuning (SFT) dataset based on principles of safety and alignment. Second, we introduce both text-level and video-level feedback to further optimize the SFT model with preference learning. Our extensive experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to baseline methods. Moreover, VPO shows strong generalization across video generation models. Furthermore, we demonstrate that VPO could outperform and be combined with RLHF methods on video generation models, underscoring the effectiveness of VPO in aligning video generation models. Our code and data are publicly available at this https URL.
zh

[NLP-24] mpTest: Local Normalization Distortion and the Detection of Machine-generated Text

【速读】：本文旨在解决现有零样本机器生成文本检测方法受限于语言模型不断逼近人类文本分布的问题。随着语言模型的改进，基于对数似然（log-likelihood）、对数秩（log-rank）和熵（entropy）的统计量的传统检测方法将逐渐失效。为此，作者提出了一种完全独立于生成语言模型的检测方法，其关键在于利用解码策略（如温度采样或top-k采样）在归一化条件概率测度时存在的缺陷。此方法不仅具有严格的理论基础，且概念上与现有检测技术有显著区别。研究评估了该检测器在白盒和黑盒场景下的表现，并分析了 paraphrasing 攻击对其的影响以及潜在的对非母语者的偏见问题，在多种语言模型、数据集及段落长度下均表现出与当前最先进的文本检测器相当甚至更优的性能。

链接: https://arxiv.org/abs/2503.20421
作者: Tom Kempton,Stuart Burrell,Connor Cheverall
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Dynamical Systems (math.DS)
备注:

点击查看摘要

Abstract:Existing methods for the zero-shot detection of machine-generated text are dominated by three statistical quantities: log-likelihood, log-rank, and entropy. As language models mimic the distribution of human text ever closer, this will limit our ability to build effective detection algorithms. To combat this, we introduce a method for detecting machine-generated text that is entirely agnostic of the generating language model. This is achieved by targeting a defect in the way that decoding strategies, such as temperature or top-k sampling, normalize conditional probability measures. This method can be rigorously theoretically justified, is easily explainable, and is conceptually distinct from existing methods for detecting machine-generated text. We evaluate our detector in the white and black box settings across various language models, datasets, and passage lengths. We also study the effect of paraphrasing attacks on our detector and the extent to which it is biased against non-native speakers. In each of these settings, the performance of our test is at least comparable to that of other state-of-the-art text detectors, and in some cases, we strongly outperform these baselines.
zh

[NLP-25] CFunModel: A “Funny” Language Model Capable of Chinese Humor Generation and Processing

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在处理和生成中文幽默文本方面表现不佳的问题。为了解决这一问题，论文的关键在于构建了一个包含超过16万条记录的综合性中文幽默相关数据集——中文趣味集（Chinese Fun Set, CFunSet），并通过整合来自中国知名笑话分享平台Tieba-JokeBar的2万多条笑话样本，丰富了现有中文幽默数据资源。基于此数据集，研究者开发了首个专注于多种中文幽默相关任务的大规模语言模型——中文趣味模型（Chinese Fun Model, CFunModel），包括相声回复选择、幽默识别及笑话生成等。实验结果表明，CFunModel 在这些任务中的表现优于其他流行的大规模语言模型。

链接: https://arxiv.org/abs/2503.20417
作者: Zhenghan Yu,Xinyu Hu,Xiaojun Wan
机构: Wangxuan Institute of Computer Technology, Peking University (王选计算机技术研究所，北京大学)
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Humor plays a significant role in daily language communication. With the rapid development of large language models (LLMs), natural language processing has made significant strides in understanding and generating various genres of texts. However, most LLMs exhibit poor performance in generating and processing Chinese humor. In this study, we introduce a comprehensive Chinese humor-related dataset, the Chinese Fun Set (CFunSet). This dataset aggregates existing Chinese humor datasets and includes over 20,000 jokes collected from Tieba-JokeBar, a Chinese online platform known for joke sharing. The resulting corpus comprises more than 160,000 entries. Leveraging CFunSet, we developed the Chinese Fun Model (CFunModel), the first large language model designed to handle various Chinese humor-related tasks including Crosstalk Response Selection, Humor Recognition, Joke Generation, etc. Experimental results demonstrate that CFunModel outperforms popular large language models in these tasks. Our CFunSet is available at this https URL and CFunModel is available at this https URL. A demostration video of our work is available at this https URL.
zh

[NLP-26] VideoGEM: Training-free Action Grounding in Videos

【速读】：该论文试图解决视频中动作和事件的空间定位问题，这是利用现有视觉-语言基础模型在图像中实现无训练定位能力的扩展挑战。论文的关键在于提出VideoGEM方法，通过自适应调整预训练图像-视频-语言骨干网络中的自注意力机制，引入层加权策略优先关注高层语义概念，并结合动态权重调节以捕捉各层对特定提示的相关性。此外，通过分解提示（分别处理动作、动词和对象），进一步优化动作的空间定位效果。实验结果表明，该无训练的方法在多个数据集上超越了当前有训练的最先进方法。

链接: https://arxiv.org/abs/2503.20348
作者: Felix Vogel,Walid Bousselham,Anna Kukleva,Nina Shvetsova,Hilde Kuehne
机构: Goethe University Frankfurt (歌德大学法兰克福); Tuebingen AI Center/University of Tuebingen (图宾根人工智能中心/图宾根大学); MPI for Informatics, SIC (马克斯·普朗克计算机科学研究所); MIT-IBM Watson AI Lab (麻省理工学院-IBM 沃森人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-language foundation models have shown impressive capabilities across various zero-shot tasks, including training-free localization and grounding, primarily focusing on localizing objects in images. However, leveraging those capabilities to localize actions and events in videos is challenging, as actions have less physical outline and are usually described by higher-level concepts. In this work, we propose VideoGEM, the first training-free spatial action grounding method based on pretrained image- and video-language backbones. Namely, we adapt the self-self attention formulation of GEM to spatial activity grounding. We observe that high-level semantic concepts, such as actions, usually emerge in the higher layers of the image- and video-language models. We, therefore, propose a layer weighting in the self-attention path to prioritize higher layers. Additionally, we introduce a dynamic weighting method to automatically tune layer weights to capture each layer`s relevance to a specific prompt. Finally, we introduce a prompt decomposition, processing action, verb, and object prompts separately, resulting in a better spatial localization of actions. We evaluate the proposed approach on three image- and video-language backbones, CLIP, OpenCLIP, and ViCLIP, and on four video grounding datasets, V-HICO, DALY, YouCook-Interactions, and GroundingYouTube, showing that the proposed training-free approach is able to outperform current trained state-of-the-art approaches for spatial video grounding.
zh

[NLP-27] Iterative Prompting with Persuasion Skills in Jailbreaking Large Language Models

【速读】：本文旨在研究如何通过迭代提示技术（Iterative Prompting Technique）提升大型语言模型（LLMs）在越狱攻击（jailbreaking attacks）中的有效性。论文的关键在于系统性地修改和优化提示词，通过对GPT-3.5、GPT-4、LLaMa2、Vicuna和ChatGLM等模型响应模式的分析，调整提示以突破这些模型的伦理与安全限制。解决方案的核心是结合说服策略，在保持恶意意图一致的前提下增强提示的有效性，最终实现攻击成功率（ASR）的显著提升，其中GPT-4和ChatGLM的ASR最高达到90%，而LLaMa2最低为68%。与基线方法（PAIR和PAP）相比，本文提出的技术在ASR方面表现更优，并与GCG和ArtPrompt方法具有相当的性能。

链接: https://arxiv.org/abs/2503.20320
作者: Shih-Wen Ke,Guan-Yu Lai,Guo-Lin Fang,Hsi-Yuan Kao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are designed to align with human values in their responses. This study exploits LLMs with an iterative prompting technique where each prompt is systematically modified and refined across multiple iterations to enhance its effectiveness in jailbreaking attacks progressively. This technique involves analyzing the response patterns of LLMs, including GPT-3.5, GPT-4, LLaMa2, Vicuna, and ChatGLM, allowing us to adjust and optimize prompts to evade the LLMs’ ethical and security constraints. Persuasion strategies enhance prompt effectiveness while maintaining consistency with malicious intent. Our results show that the attack success rates (ASR) increase as the attacking prompts become more refined with the highest ASR of 90% for GPT4 and ChatGLM and the lowest ASR of 68% for LLaMa2. Our technique outperforms baseline techniques (PAIR and PAP) in ASR and shows comparable performance with GCG and ArtPrompt.
zh

[NLP-28] A Multilingual Culture-First Approach to Addressing Misgendering in LLM Applications

【速读】：该论文试图解决跨多种语言和方言（总计42种）中性别的误指问题（misgendering），即在不同文化和语法背景下如何避免将个体错误地归类到与其性别认同不符的类别。论文的关键解决方案在于采用参与式设计方法（participatory-design approach）开发适用于所有语言的有效且适当的防护措施（guardrails），并通过人机协作流程（human-in-the-loop）在大规模语言模型驱动的应用场景（如会议记录摘要生成）中验证这些防护措施的效果。研究发现，所提出的防护措施显著降低了多语言摘要中的性别误指率，同时未对整体质量造成损失，从而展示了一种可行的规模化实现多语言、多文化包容性和负责任AI的方法。

链接: https://arxiv.org/abs/2503.20302
作者: Sunayana Sitaram,Adrian de Wynter,Isobel McCrum,Qilong Gu,Si-Qing Chen
机构: Microsoft Research India (微软研究印度); The University of York (约克大学); Microsoft (微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Misgendering is the act of referring to someone by a gender that does not match their chosen identity. It marginalizes and undermines a person’s sense of self, causing significant harm. English-based approaches have clear-cut approaches to avoiding misgendering, such as the use of the pronoun ``they’'. However, other languages pose unique challenges due to both grammatical and cultural constructs. In this work we develop methodologies to assess and mitigate misgendering across 42 languages and dialects using a participatory-design approach to design effective and appropriate guardrails across all languages. We test these guardrails in a standard large language model-based application (meeting transcript summarization), where both the data generation and the annotation steps followed a human-in-the-loop approach. We find that the proposed guardrails are very effective in reducing misgendering rates across all languages in the summaries generated, and without incurring loss of quality. Our human-in-the-loop approach demonstrates a method to feasibly scale inclusive and responsible AI-based solutions across multiple languages and cultures.
zh

[NLP-29] sudo rm -rf agent ic_security

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）作为计算机使用代理部署后在真实桌面或web环境中引发的安全隐患问题。随着LLMs自主执行任务的能力增强，其拒绝有害请求的机制可能被绕过，从而导致潜在的安全风险。论文提出了一种名为SUDO（Screen-based Universal Detox2Tox Offense）的新攻击框架，其关键是通过一种称为Detox2Tox的核心机制，将原本被拒绝的有害请求经过“解毒”（detoxification）转化为看似无害的请求，利用先进的视觉语言模型（Vision-Language Models, VLMs）获取详细指令，并在执行前通过“中毒”（toxification）重新引入恶意内容。与传统越狱攻击不同，SUDO能够根据内置的拒绝反馈迭代优化攻击策略，显著提高对抗强大策略过滤器的有效性。因此，SUDO的关键在于其创新性的Detox2Tox机制以及基于反馈的迭代优化能力，从而揭示并量化了现有安全防护措施的漏洞。

链接: https://arxiv.org/abs/2503.20279
作者: Sejin Lee,Jian Kim,Haon Park,Ashkan Yousefpour,Sangyoon Yu,Min Song
机构: Aim Intelligence (艾姆智能); Yonsei University (延世大学); Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed as computer-use agents, autonomously performing tasks within real desktop or web environments. While this evolution greatly expands practical use cases for humans, it also creates serious security exposures. We present SUDO (Screen-based Universal Detox2Tox Offense), a novel attack framework that systematically bypasses refusal trained safeguards in commercial computer-use agents, such as Claude Computer Use. The core mechanism, Detox2Tox, transforms harmful requests (that agents initially reject) into seemingly benign requests via detoxification, secures detailed instructions from advanced vision language models (VLMs), and then reintroduces malicious content via toxification just before execution. Unlike conventional jailbreaks, SUDO iteratively refines its attacks based on a built-in refusal feedback, making it increasingly effective against robust policy filters. In extensive tests spanning 50 real-world tasks and multiple state-of-the-art VLMs, SUDO achieves a stark attack success rate of 24% (with no refinement), and up to 41% (by its iterative refinement) in Claude Computer Use. By revealing these vulnerabilities and demonstrating the ease with which they can be exploited in real-world computing environments, this paper highlights an immediate need for robust, context-aware safeguards. WARNING: This paper includes harmful or offensive model outputs.
zh

[NLP-30] ViLBench: A Suite for Vision-Language Process Reward Modeling

【速读】：该论文试图解决的问题是：尽管过程监督奖励模型（Process-supervised Reward Models, PRMs）在提供细粒度反馈以优化复杂任务推理路径方面具有优势，但对其评估的研究仍较为有限，尤其是在多模态领域。此外，当前视觉-语言大语言模型（Vision-Language Large Language Models, VLLMs）作为输出奖励模型（Output Reward Models, ORMs）或过程奖励模型（Process Reward Models, PRMs）的表现并不一致，且性能优越的VLLMs不一定能更好地完成奖励建模任务。为此，论文引入ViLBench基准，设计了一系列需要密集过程奖励信号的任务来评估VLLMs，并探索如何缩小通用VLLMs与奖励模型之间的差距。

解决方案的关键在于通过收集大规模的视觉-语言过程奖励数据（73.6K条数据），利用增强的树搜索算法优化奖励建模流程。实验表明，经过此方法训练的3B参数规模模型，在ViLBench上相比标准链式思考（Chain-of-Thought, CoT）方法平均提升3.3%，较未经训练的模型最高提升2.5%。这一成果初步展示了通过改进奖励数据采集和模型训练，有效提升VLLMs在多模态任务中的表现的可行性。

链接: https://arxiv.org/abs/2503.20271
作者: Haoqin Tu,Weitao Feng,Hardy Chen,Hui Liu,Xianfeng Tang,Cihang Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI’s GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the benchmark’s challenge for current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models – by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on ViLBench by selecting OpenAI o1’s generations. We release the implementations at this https URL with our code, model, and data.
zh

[NLP-31] LoRA: Teleporting Model-Specific Alignment Across LLM s

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）中后门攻击（Trojans）的问题，特别是当针对不同LLMs的对齐数据（alignment data）通常是模型特定且各不相同的情况下。论文提出了一种名为TeleLoRA（Teleporting Low-Rank Adaptation）的新框架，其关键在于通过协同多个LLMs的模型特定对齐数据，实现对未见LLMs的零样本后门缓解（zero-shot Trojan mitigation），而无需额外的对齐数据。TeleLoRA通过利用多个LLMs的局部激活信息，学习一个统一的LoRA适配器权重生成器，并设计为排列对称（permutation symmetric），以适应具有不同架构和规模的模型。此外，该方法优化了内存效率，使得在大规模LLMs上训练成为可能，同时保持较低的计算资源需求。实验结果表明，TeleLoRA在有效降低攻击成功率的同时，能够保持模型的良性性能。

链接: https://arxiv.org/abs/2503.20228
作者: Xiao Lin,Manoj Acharya,Anirban Roy,Susmit Jha
机构: SRI International (SRI国际)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mitigating Trojans in Large Language Models (LLMs) is one of many tasks where alignment data is LLM specific, as different LLMs have different Trojan triggers and trigger behaviors to be removed. In this paper, we introduce TeleLoRA (Teleporting Low-Rank Adaptation), a novel framework that synergizes model-specific alignment data across multiple LLMs to enable zero-shot Trojan mitigation on unseen LLMs without alignment data. TeleLoRA learns a unified generator of LoRA adapter weights by leveraging local activation information across multiple LLMs. This generator is designed to be permutation symmetric to generalize across models with different architectures and sizes. We optimize the model design for memory efficiency, making it feasible to learn with large-scale LLMs with minimal computational resources. Experiments on LLM Trojan mitigation benchmarks demonstrate that TeleLoRA effectively reduces attack success rates while preserving the benign performance of the models.
zh

[NLP-32] Advancements in Natural Language Processing: Exploring Transformer-Based Architectures for Text Understanding

【速读】：该论文旨在探索基于Transformer架构的自然语言处理（NLP）模型（如BERT和GPT）在文本理解任务中的突破性进展，并解决传统方法（如循环神经网络RNNs）难以有效处理长程依赖、条件偏移及多分类重叠问题的局限性。论文的关键在于提出了一种包含数据准备、模型选择、预训练、微调和评估的方法论，通过分析文本长度分布的概率密度函数及特征空间分类的可视化表示，揭示了Transformer模型在处理长程依赖、适应条件变化以及提取分类特征方面的卓越能力。尽管该方法在GLUE和SQuAD等基准测试中实现了超过90%的F1分数，达到了当前最先进的性能，但其高计算成本仍是亟待解决的问题。因此，论文强调了解决效率优化与多模态集成的重要性，以推动语言基础AI系统的进一步发展。

链接: https://arxiv.org/abs/2503.20227
作者: Tianhao Wu,Yu Wang,Ngoc Quach
机构: Tianhao Wu is an independent researcher in San Jose, CA94088.; OceanAI (海洋人工智能); Pittsburg, CA94565(未知城市名, 可能是拼写错误，应为“Pittsburgh”)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by the 5th International Conference on Artificial Intelligence and Industrial Technology Applications (AIITA 2025)

点击查看摘要

Abstract:Natural Language Processing (NLP) has witnessed a transformative leap with the advent of transformer-based architectures, which have significantly enhanced the ability of machines to understand and generate human-like text. This paper explores the advancements in transformer models, such as BERT and GPT, focusing on their superior performance in text understanding tasks compared to traditional methods like recurrent neural networks (RNNs). By analyzing statistical properties through visual representations-including probability density functions of text length distributions and feature space classifications-the study highlights the models’ proficiency in handling long-range dependencies, adapting to conditional shifts, and extracting features for classification, even with overlapping classes. Drawing on recent 2024 research, including enhancements in multi-hop knowledge graph reasoning and context-aware chat interactions, the paper outlines a methodology involving data preparation, model selection, pretraining, fine-tuning, and evaluation. The results demonstrate state-of-the-art performance on benchmarks like GLUE and SQuAD, with F1 scores exceeding 90%, though challenges such as high computational costs persist. This work underscores the pivotal role of transformers in modern NLP and suggests future directions, including efficiency optimization and multimodal integration, to further advance language-based AI systems.
zh

[NLP-33] Qwen 2.5-Omni Technical Report

【速读】：本文旨在解决多模态信息处理与生成的问题，特别是如何高效地同时感知文本、图像、音频和视频等多种模态，并以流式方式生成文本和自然语音响应。关键解决方案包括：1) 音频和视觉编码器采用分块处理方法以支持多模态信息的流式输入；2) 提出时间对齐多模态旋转位置嵌入（TMRoPE）来同步视频与音频的时间戳；3) 设计“思考者-说话者”（Thinker-Talker）架构，通过独立的文本生成模块（Thinker）和直接利用其隐藏表示生成音频标记的双轨自回归模型（Talker），实现文本和语音的同时生成且避免模态间干扰；4) 引入滑动窗口DiT解码器以限制感受野，减少初始包延迟。这些创新使得Qwen2.5-Omni在多模态基准测试中达到领先性能，并在端到端语音指令跟随任务中表现出与文本输入相当的能力。

链接: https://arxiv.org/abs/2503.20215
作者: Jin Xu,Zhifang Guo,Jinzheng He,Hangrui Hu,Ting He,Shuai Bai,Keqin Chen,Jialin Wang,Yang Fan,Kai Dang,Bin Zhang,Xiong Wang,Yunfei Chu,Junyang Lin
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose \textbfThinker-Talker architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni’s performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni’s streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.
zh

[NLP-34] Dolphin: A Large-Scale Automatic Speech Recognition Model for Eastern Languages

【速读】：该论文试图解决多语言自动语音识别（ASR）模型在广泛语言覆盖下的性能不足问题，特别是针对东亚、南亚、东南亚及中东地区的40种东方语言以及22种中国方言。论文的关键解决方案在于通过扩展Whisper架构，并整合内部专有数据集与开源数据集，优化Dolphin模型以实现显著的识别准确性提升。实验表明，Dolphin在多种语言上大幅超越现有最先进的开源模型。

链接: https://arxiv.org/abs/2503.20212
作者: Yangyang Meng,Jinpeng Li,Guodong Lin,Yu Pu,Guanbo Wang,Hu Du,Zhiming Shao,Yukai Huang,Ke Li,Wei-Qiang Zhang
机构: Speech and Audio Technology Lab, Dept. EE, Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This report introduces Dolphin, a large-scale multilingual automatic speech recognition (ASR) model that extends the Whisper architecture to support a wider range of languages. Our approach integrates in-house proprietary and open-source datasets to refine and optimize Dolphin’s performance. The model is specifically designed to achieve notable recognition accuracy for 40 Eastern languages across East Asia, South Asia, Southeast Asia, and the Middle East, while also supporting 22 Chinese dialects. Experimental evaluations show that Dolphin significantly outperforms current state-of-the-art open-source models across various languages. To promote reproducibility and community-driven innovation, we are making our trained models and inference source code publicly available.
zh

[NLP-35] SARGes: Semantically Aligned Reliable Gesture Generation via Intent Chain

【速读】：该论文旨在解决通过语音同步生成语义上有意义的手势这一具有挑战性的问题。解决方案的关键在于提出了一种名为SARGes的新框架，该框架利用大规模语言模型（Large Language Models, LLMs）解析语音内容并生成可靠的语义手势标签，这些标签随后指导有意义的伴随语音手势合成。具体而言，论文构建了一个全面的伴随语音手势行为目录，并开发了一种基于LLM的意图链推理机制，按照行为目录标准系统地解析和分解手势语义为结构化的推理步骤，从而有效引导LLMs生成上下文感知的手势标签。此外，还构建了一个带意图链注释的文本到手势标签数据集并训练了一个轻量级的手势标签生成模型，以指导生成可信且语义连贯的伴随语音手势。实验结果表明，SARGes实现了高度语义对齐的手势标注（50.2% 准确率）以及高效的单次推理（0.4秒）。该方法为语义手势合成提供了可解释的意图推理路径。

链接: https://arxiv.org/abs/2503.20202
作者: Nan Gao,Yihua Bao,Dongdong Weng,Jiayi Zhao,Jia Li,Yan Zhou,Pengfei Wan,Di Zhang
机构: Institute of Automation, Chinese Academy of Sciences (自动化研究所，中国科学院); Beijing Engineering Research Center of Mixed Reality and Advanced Display (混合现实与先进显示工程技术研究中心), Institute of Technology, Beijing (北京理工大学); Kuaishou Technology (快手科技)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Co-speech gesture generation enhances human-computer interaction realism through speech-synchronized gesture synthesis. However, generating semantically meaningful gestures remains a challenging problem. We propose SARGes, a novel framework that leverages large language models (LLMs) to parse speech content and generate reliable semantic gesture labels, which subsequently guide the synthesis of meaningful co-speech this http URL, we constructed a comprehensive co-speech gesture ethogram and developed an LLM-based intent chain reasoning mechanism that systematically parses and decomposes gesture semantics into structured inference steps following ethogram criteria, effectively guiding LLMs to generate context-aware gesture labels. Subsequently, we constructed an intent chain-annotated text-to-gesture label dataset and trained a lightweight gesture label generation model, which then guides the generation of credible and semantically coherent co-speech gestures. Experimental results demonstrate that SARGes achieves highly semantically-aligned gesture labeling (50.2% accuracy) with efficient single-pass inference (0.4 seconds). The proposed method provides an interpretable intent reasoning pathway for semantic gesture synthesis.
zh

[NLP-36] Open Deep Search: Democratizing Search with Open-source Reasoning Agents

【速读】：本文旨在缩小专有搜索 AI 解决方案（如 Perplexity 的 Sonar Reasoning Pro 和 OpenAI 的 GPT-4o Search Preview）与其开源替代品之间的性能差距。论文的关键创新在于通过引入推理代理（Reasoning Agent）增强最新开源大型语言模型（LLMs）的推理能力，使其能够审慎利用网络搜索引擎工具来回答查询。具体而言，Open Deep Search (ODS) 包含两个组件：开放搜索工具（Open Search Tool）和开放推理代理（Open Reasoning Agent），它们与用户选择的基础 LLM 协同工作。其中，开放推理代理负责解析任务并通过对工具调用的序列化操作完成任务，而开放搜索工具则是一种新型的网络搜索引擎工具，其性能优于现有的专有工具。通过结合强大的开源推理 LLM（如 DeepSeek-R1），ODS 在 SimpleQA 和 FRAMES 两个基准测试中几乎达到了当前最先进的水平，并在某些情况下超越了这些基准。例如，在 FRAMES 评估基准上，ODS 将最近发布的 GPT-4o Search Preview 的最佳现有基线准确率提升了 9.7%。此外，ODS 提供了一个通用框架，可以无缝增强任何 LLM 的搜索与推理能力，从而实现最先进的性能表现，例如将 DeepSeek-R1 在 SimpleQA 上的准确率从 82.4% 提升至 88.3%，在 FRAMES 上的准确率从 30.1% 提升至 75.3%。

链接: https://arxiv.org/abs/2503.20201
作者: Salaheddin Alzubi,Creston Brooks,Purva Chiniya,Edoardo Contente,Chiara von Gerlach,Lucas Irwin,Yihan Jiang,Arda Kaz,Windsor Nguyen,Sewoong Oh,Himanshu Tyagi,Pramod Viswanath
机构: Sentient; University of Washington (华盛顿大学); Princeton University (普林斯顿大学); UC Berkeley (加州大学伯克利分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 27 pages, 8 figures, 4 tables

点击查看摘要

Abstract:We introduce Open Deep Search (ODS) to close the increasing gap between the proprietary search AI solutions, such as Perplexity’s Sonar Reasoning Pro and OpenAI’s GPT-4o Search Preview, and their open-source counterparts. The main innovation introduced in ODS is to augment the reasoning capabilities of the latest open-source LLMs with reasoning agents that can judiciously use web search tools to answer queries. Concretely, ODS consists of two components that work with a base LLM chosen by the user: Open Search Tool and Open Reasoning Agent. Open Reasoning Agent interprets the given task and completes it by orchestrating a sequence of actions that includes calling tools, one of which is the Open Search Tool. Open Search Tool is a novel web search tool that outperforms proprietary counterparts. Together with powerful open-source reasoning LLMs, such as DeepSeek-R1, ODS nearly matches and sometimes surpasses the existing state-of-the-art baselines on two benchmarks: SimpleQA and FRAMES. For example, on the FRAMES evaluation benchmark, ODS improves the best existing baseline of the recently released GPT-4o Search Preview by 9.7% in accuracy. ODS is a general framework for seamlessly augmenting any LLMs – for example, DeepSeek-R1 that achieves 82.4% on SimpleQA and 30.1% on FRAMES – with search and reasoning capabilities to achieve state-of-the-art performance: 88.3% on SimpleQA and 75.3% on FRAMES.
zh

[NLP-37] GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization

【速读】：该论文试图解决在大型语言模型（Large Language Models, LLMs）中通过预定义约束精确控制模型输出的问题。现有方法主要依赖直接指令-响应合成或偏好响应优化，但在处理细粒度约束时往往面临理解与适应能力不足的挑战，导致幻觉现象或性能脆弱。论文提出的解决方案关键在于引入了一种名为生成对抗策略优化（Generative Adversarial Policy Optimization, GAPO）的新框架。GAPO结合了基于生成对抗网络（GAN）的训练动态与仅编码器奖励模型，通过对抗性训练自动生成不同难度的训练样本，并利用仅编码器架构更好地捕捉提示-响应关系，从而逐步学习并适应日益复杂的约束条件。实验结果表明，GAPO在多个基准测试中表现出色，特别是在需要处理细粒度约束的场景下，显著优于PPO、DPO和KTO等现有方法。

链接: https://arxiv.org/abs/2503.20194
作者: Zhouhong Gu,Xingzhou Chen,Xiaoran Shi,Tao Wang,Suhang Zheng,Tianyu Li,Hongwei Feng,Yanghua Xiao
机构: Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University (上海关键数据科学重点实验室，复旦大学计算机学院); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models have highlighted the critical need for precise control over model outputs through predefined constraints. While existing methods attempt to achieve this through either direct instruction-response synthesis or preferential response optimization, they often struggle with constraint understanding and adaptation. This limitation becomes particularly evident when handling fine-grained constraints, leading to either hallucination or brittle performance. We introduce Generative Adversarial Policy Optimization (GAPO), a novel framework that combines GAN-based training dynamics with an encoder-only reward model to progressively learn and adapt to increasingly complex constraints. GAPO leverages adversarial training to automatically generate training samples of varying difficulty while utilizing the encoder-only architecture to better capture prompt-response relationships. Extensive experiments demonstrate GAPO’s superior performance across multiple benchmarks, particularly in scenarios requiring fine-grained constraint handling, where it significantly outperforms existing methods like PPO, DPO, and KTO. Our results suggest that GAPO’s unique approach to preferential prompt learning offers a more robust and effective solution for controlling LLM outputs. Code is avaliable in this https URL.
zh

[NLP-38] Leverag ing Implicit Sentiments: Enhancing Reliability and Validity in Psychological Trait Evaluation of LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在心理层面评估中存在的可靠性不足和有效性有限的问题。当前基于人类心理测评工具（如BFI）的方法难以准确预测LLMs在真实场景中的行为表现。论文的关键解决方案是提出了一种名为核心情感清单（Core Sentiment Inventory, CSI）的新评估工具。CSI是一种双语工具（覆盖英语和中文），通过隐式评估模型的情感倾向，在乐观、悲观和中性三个维度上为LLMs提供深入的心理画像。实验结果表明，CSI不仅能够有效捕捉微妙的情感模式，还能显著提高评估的一致性和预测LLMs实际输出情感的相关性（超过0.85）。

链接: https://arxiv.org/abs/2503.20182
作者: Huanhuan Ma,Haisong Gong,Xiaoyuan Yi,Xing Xie,Dongkuan Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code available via this https URL

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have led to their increasing integration into human life. With the transition from mere tools to human-like assistants, understanding their psychological aspects-such as emotional tendencies and personalities-becomes essential for ensuring their trustworthiness. However, current psychological evaluations of LLMs, often based on human psychological assessments like the BFI, face significant limitations. The results from these approaches often lack reliability and have limited validity when predicting LLM behavior in real-world scenarios. In this work, we introduce a novel evaluation instrument specifically designed for LLMs, called Core Sentiment Inventory (CSI). CSI is a bilingual tool, covering both English and Chinese, that implicitly evaluates models’ sentiment tendencies, providing an insightful psychological portrait of LLM across three dimensions: optimism, pessimism, and neutrality. Through extensive experiments, we demonstrate that: 1) CSI effectively captures nuanced emotional patterns, revealing significant variation in LLMs across languages and contexts; 2) Compared to current approaches, CSI significantly improves reliability, yielding more consistent results; and 3) The correlation between CSI scores and the sentiment of LLM’s real-world outputs exceeds 0.85, demonstrating its strong validity in predicting LLM behavior. We make CSI public available via: this https URL.
zh

[NLP-39] ProtoBERT-LoRA: Parameter-Efficient Prototypical Finetuning for Immunotherapy Study Identification

【速读】：本文旨在解决在基因表达数据仓库（如Gene Expression Omnibus, GEO）中识别免疫检查点抑制剂（Immune Checkpoint Inhibitor, ICI）相关研究的问题，这一任务因语义模糊性、极端类别不平衡以及低资源环境下的有限标注数据而面临挑战。论文提出了一种名为ProtoBERT-LoRA的混合框架，结合了PubMedBERT、原型网络（prototypical networks）和低秩适应（Low-Rank Adaptation, LoRA）。该模型通过基于原型的元学习方法生成可分离的嵌入表示，同时保留生物医学领域的专业知识。关键在于将原型网络与LoRA技术相结合，不仅有效缓解了类别不平衡问题，还显著提升了模型在小样本场景下的泛化能力。实验结果显示，与基于规则的系统、机器学习基线及微调后的PubMedBERT相比，ProtoBERT-LoRA在测试集上的F1分数达到0.624（精确率为0.481，召回率为0.887），且在实际应用中将人工审查工作量减少了82%。

链接: https://arxiv.org/abs/2503.20179
作者: Shijia Zhang,Xiyu Ding,Kai Ding,Jacob Zhang,Kevin Galinsky,Mengrui Wang,Ryan P. Mayers,Zheyu Wang,Hadi Kharrazi
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Quantitative Methods (q-bio.QM)
备注: Submitted to AMIA 2025 Annual Symposium

点击查看摘要

Abstract:Identifying immune checkpoint inhibitor (ICI) studies in genomic repositories like Gene Expression Omnibus (GEO) is vital for cancer research yet remains challenging due to semantic ambiguity, extreme class imbalance, and limited labeled data in low-resource settings. We present ProtoBERT-LoRA, a hybrid framework that combines PubMedBERT with prototypical networks and Low-Rank Adaptation (LoRA) for efficient fine-tuning. The model enforces class-separable embeddings via episodic prototype training while preserving biomedical domain knowledge. Our dataset was divided as: Training (20 positive, 20 negative), Prototype Set (10 positive, 10 negative), Validation (20 positive, 200 negative), and Test (71 positive, 765 negative). Evaluated on test dataset, ProtoBERT-LoRA achieved F1-score of 0.624 (precision: 0.481, recall: 0.887), outperforming the rule-based system, machine learning baselines and finetuned PubMedBERT. Application to 44,287 unlabeled studies reduced manual review efforts by 82%. Ablation studies confirmed that combining prototypes with LoRA improved performance by 29% over stand-alone LoRA.
zh

[NLP-40] Efficient Model Development through Fine-tuning Transfer

【速读】：该论文试图解决现代大型语言模型（LLMs）在高效更新方面面临的挑战，即每次新版本的预训练模型发布时，都需要重复昂贵的对齐过程。这种挑战同样适用于领域特定或语言特定的模型，在这些模型中，针对专业化数据的微调需要为每个新的基础模型版本重新进行。为了解决这一问题，论文探索了在不同模型版本之间转移微调更新的方法。关键在于通过从一个源模型版本推导出表示微调权重变化的diff向量，并将其应用于目标版本的基础模型，从而实现性能的显著提升，而无需额外的训练。这种方法不仅减少了训练成本，还保持了模型性能。

链接: https://arxiv.org/abs/2503.20110
作者: Pin-Jie Lin,Rishab Balasubramanian,Fengyuan Liu,Nikhil Kandpal,Tu Vu
机构: Virginia Tech (弗吉尼亚理工学院); University of Toronto & Vector Institute (多伦多大学 & 向量研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 4 figures, 13 tables

点击查看摘要

Abstract:Modern LLMs struggle with efficient updates, as each new pretrained model version requires repeating expensive alignment processes. This challenge also applies to domain- or language-specific models, where fine-tuning on specialized data must be redone for every new base model release. In this paper, we explore the transfer of fine-tuning updates between model versions. Specifically, we derive the diff vector from one source model version, which represents the weight changes from fine-tuning, and apply it to the base model of a different target version. Through empirical evaluations on various open-weight model versions, we show that transferring diff vectors can significantly improve the target base model, often achieving performance comparable to its fine-tuned counterpart. For example, reusing the fine-tuning updates from Llama 3.0 8B leads to an absolute accuracy improvement of 10.7% on GPQA over the base Llama 3.1 8B without additional training, surpassing Llama 3.1 8B Instruct. In a multilingual model development setting, we show that this approach can significantly increase performance on target-language tasks without retraining, achieving an absolute improvement of 4.7% and 15.5% on Global MMLU for Malagasy and Turkish, respectively, compared to Llama 3.1 8B Instruct. Our controlled experiments reveal that fine-tuning transfer is most effective when the source and target models are linearly connected in the parameter space. Additionally, we demonstrate that fine-tuning transfer offers a stronger and more computationally efficient starting point for further fine-tuning. Finally, we propose an iterative recycling-then-finetuning approach for continuous model development, which improves both efficiency and effectiveness. Our findings suggest that fine-tuning transfer is a viable strategy to reduce training costs while maintaining model performance.
zh

[NLP-41] "Is There Anything Else?: Examining Administrator Influence on Linguistic Features from the Cookie Theft Picture Description Cognitive Test NAACL2025

【速读】：本文旨在量化测试管理员在痴呆症语言评估中对语言特征的影响，并探讨“观察者效应”对下游分析的潜在干扰。研究使用两个英语语料库，即在不同地点收集的“饼干盗窃”图片描述数据集，测试管理员表现出不同程度的参与度。关键在于通过对比不同测试管理员行为下的语言特征变化，揭示这些变化不仅反映了患者的认知状态，也可能部分归因于测试管理实践的差异。因此，研究强调需要制定更标准化的测试管理协议，以减少系统性偏差，从而促进负责任的临床语音分析框架的发展。

链接: https://arxiv.org/abs/2503.20104
作者: Changye Li,Zhecheng Sheng,Trevor Cohen,Serguei Pakhomov
机构: University of Washington (华盛顿大学); University of Minnesota (明尼苏达大学)
类目: Computation and Language (cs.CL)
备注: Accepted to CMCL 2025 workshop, co-located with NAACL 2025

点击查看摘要

Abstract:Alzheimer’s Disease (AD) dementia is a progressive neurodegenerative disease that negatively impacts patients’ cognitive ability. Previous studies have demonstrated that changes in naturalistic language samples can be useful for early screening of AD dementia. However, the nature of language deficits often requires test administrators to use various speech elicitation techniques during spontaneous language assessments to obtain enough propositional utterances from dementia patients. This could lead to the observer's effect'' on the downstream analysis that has not been fully investigated. Our study seeks to quantify the influence of test administrators on linguistic features in dementia assessment with two English corpora the Cookie Theft’’ picture description datasets collected at different locations and test administrators show different levels of administrator involvement. Our results show that the level of test administrator involvement significantly impacts observed linguistic features in patient speech. These results suggest that many of significant linguistic features in the downstream classification task may be partially attributable to differences in the test administration practices rather than solely to participants’ cognitive status. The variations in test administrator behavior can lead to systematic biases in linguistic data, potentially confounding research outcomes and clinical assessments. Our study suggests that there is a need for a more standardized test administration protocol in the development of responsible clinical speech analytics frameworks.
zh

[NLP-42] Bigger But Not Better: Small Neural Language Models Outperform Large Language Models in Detection of Thought Disorder NAACL2025

【速读】：该论文旨在解决使用大型语言模型（Large Language Models, LLMs）评估精神分裂症谱系障碍中思维紊乱严重程度的临床应用局限性问题，包括隐私顾虑、高昂的计算与财务成本以及训练数据的不透明性。论文的关键在于探索较小神经语言模型是否能够作为有效替代方案，用于检测阳性形式化思维障碍，通过相同的滑动窗口困惑度（perplexity）测量方法来评估。研究发现，较小模型对形式化思维障碍相关的语言差异更为敏感，在特定模型规模和上下文长度之外，检测能力会下降，挑战了“越大越好”的普遍假设。这一方法在包含精神病症状个体的音频日记和临床访谈语音样本中具有普适性，为开发高效、经济且保护隐私的精神疾病筛查工具提供了潜在方向。

链接: https://arxiv.org/abs/2503.20103
作者: Changye Li,Weizhe Xu,Serguei Pakhomov,Ellen Bradley,Dror Ben-Zeev,Trevor Cohen
机构: University of Washington (华盛顿大学); University of Minnesota (明尼苏达大学); University of California San Francisco (加州大学旧金山分校)
类目: Computation and Language (cs.CL)
备注: Accepted to CL Psych 2025 workshop, co-located with NAACL 2025

点击查看摘要

Abstract:Disorganized thinking is a key diagnostic indicator of schizophrenia-spectrum disorders. Recently, clinical estimates of the severity of disorganized thinking have been shown to correlate with measures of how difficult speech transcripts would be for large language models (LLMs) to predict. However, LLMs’ deployment challenges – including privacy concerns, computational and financial costs, and lack of transparency of training data – limit their clinical utility. We investigate whether smaller neural language models can serve as effective alternatives for detecting positive formal thought disorder, using the same sliding window based perplexity measurements that proved effective with larger models. Surprisingly, our results show that smaller models are more sensitive to linguistic differences associated with formal thought disorder than their larger counterparts. Detection capability declines beyond a certain model size and context length, challenging the common assumption of ``bigger is better’’ for LLM-based applications. Our findings generalize across audio diaries and clinical interview speech samples from individuals with psychotic symptoms, suggesting a promising direction for developing efficient, cost-effective, and privacy-preserving screening tools that can be deployed in both clinical and naturalistic settings.
zh

[NLP-43] Generative Linguistics Large Language Models and the Social Nature of Scientific Success

【速读】：该论文试图解决的问题是为何生成式语言模型的研究取得了成功，而基于原则的生成式语言学（Principled Generative Linguistics）却面临危机。作者认为这一危机并非源于形式或经验上的严谨性不足（如Chesi和Piantadosi所言），而是由于生成式语言学家的社会抱负较为有限。论文指出，生成式语言学的命运不应仅仅取决于其理论本身的优劣，而应更多地与社会层面的成功挂钩。因此，解决这一问题的关键在于生成式语言学家不仅需要提升研究的严谨性以回应Chesi的呼吁，还必须扩大自身的目标，使外部研究者也能参与到生成式语言学的未来发展之中，从而增强其社会影响力和吸引力。

链接: https://arxiv.org/abs/2503.20088
作者: Sophie Hao
机构: 未知
类目: Computation and Language (cs.CL)
备注: To appear in the Italian Journal of Linguistics. This is a response to Chesi (2024): arXiv:2412.12797

点击查看摘要

Abstract:Chesi’s (forthcoming) target paper depicts a generative linguistics in crisis, foreboded by Piantadosi’s (2023) declaration that “modern language models refute Chomsky’s approach to language.” In order to survive, Chesi warns, generativists must hold themselves to higher standards of formal and empirical rigor. This response argues that the crisis described by Chesi and Piantadosi actually has little to do with rigor, but is rather a reflection of generativists’ limited social ambitions. Chesi ties the fate of generative linguistics to its intellectual merits, but the current success of language model research is social in nature as much as it is intellectual. In order to thrive, then, generativists must do more than heed Chesi’s call for rigor; they must also expand their ambitions by giving outsiders a stake in their future success.
zh

[NLP-44] Cross-Tokenizer Distillation via Approximate Likelihood Matching

【速读】：该论文旨在解决现有蒸馏方法在大型语言模型（Large Language Model, LLM）教师与学生模型之间因需要相同分词器而导致适用性受限的问题。论文的关键创新在于开发了一种跨分词器蒸馏方法（Cross-Tokenizer Distillation），无需以下一token预测损失为主要目标，而是通过纯蒸馏（Pure Distillation）最大化学生模型预测与教师模型预测的相似性，同时对教师与学生分词器函数和词汇表之间的大幅不匹配具有鲁棒性。这一方案突破了传统蒸馏方法对分词器一致性的依赖，显著提升了跨分词器迁移的效果。

链接: https://arxiv.org/abs/2503.20083
作者: Benjamin Minixhofer,Edoardo Maria Ponti,Ivan Vulić
机构: University of Cambridge (剑桥大学); University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Distillation has shown remarkable success in transferring knowledge from a Large Language Model (LLM) teacher to a student LLM. However, current distillation methods predominantly require the same tokenizer between the teacher and the student, restricting their applicability to only a small subset of teacher-student pairs. In this work, we develop a cross-tokenizer distillation method to solve this crucial deficiency. Our method is the first to enable cross-tokenizer distillation without a next-token prediction loss as the main objective, instead purely maximizing the student predictions’ similarity to the teacher’s predictions (known as pure distillation), while also being robust to large mismatches between the teacher and the student tokenizer function and vocabulary. Empirically, our method enables substantially improved performance as tested on two use cases. First, we show that viewing tokenizer transfer as self-distillation enables unprecedently effective transfer across tokenizers. We transfer (subword-level) Llama and Gemma models to byte-level tokenization more effectively than prior methods transfer to a similar subword tokenizer under a comparable training budget. Transferring different base models to the same tokenizer also enables ensembling them (e.g., via averaging their predicted probabilities) which boosts performance. Second, we use our cross-tokenizer distillation method to distil a large maths-specialized LLM into a smaller model, achieving competitive maths problem-solving performance. Overall, our results make substantial strides toward better adaptability and enhanced interaction between different LLMs.
zh

[NLP-45] Poor Alignment and Steerability of Large Language Models : Evidence from College Admission Essays

【速读】：该论文试图解决大型语言模型（Large Language Models, LLM）在正式沟通文本撰写中的两个重要问题：模型对齐（model alignment），即LLM写作风格与特定人群一致的程度；以及模型可控性（model steerability），即是否能通过提示词有效引导LLM模仿特定社会人口群体的语言模式。研究以一所选拔性大学的本科录取场景为高风险背景，通过对比30,000名申请人手写文章与两类LLM生成文章的词汇和句法变化，探讨上述问题。其中一类LLM生成的文章仅基于申请人的作文题目提示，另一类则额外包含每位申请人的社会人口统计信息。研究的关键发现是，无论具体模型或分析方法如何，两类LLM生成的文章均在语言特征上与人类作者的文章存在显著差异，并且针对特定社会人口身份的提示词未能有效引导LLM匹配该群体的语言模式，这在性别、种族、第一代大学生身份及地理位置等关键维度上均成立。此外，带有社会人口统计信息提示的合成文本与未提示的合成文本更为相似，而非更接近人类文本，表明提示并未缓解合成文本的同质化问题。这些问题揭示了当前LLM在高风险应用场景中存在的对齐和可控性挑战。

链接: https://arxiv.org/abs/2503.20062
作者: Jinsook Lee,AJ Alvero,Thorsten Joachims,René Kizilcec
机构: 未知
类目: Computation and Language (cs.CL)
备注: 48 pages, 10 figures, 6 tables

点击查看摘要

Abstract:People are increasingly using technologies equipped with large language models (LLM) to write texts for formal communication, which raises two important questions at the intersection of technology and society: Who do LLMs write like (model alignment); and can LLMs be prompted to change who they write like (model steerability). We investigate these questions in the high-stakes context of undergraduate admissions at a selective university by comparing lexical and sentence variation between essays written by 30,000 applicants to two types of LLM-generated essays: one prompted with only the essay question used by the human applicants; and another with additional demographic information about each applicant. We consistently find that both types of LLM-generated essays are linguistically distinct from human-authored essays, regardless of the specific model and analytical approach. Further, prompting a specific sociodemographic identity is remarkably ineffective in aligning the model with the linguistic patterns observed in human writing from this identity group. This holds along the key dimensions of sex, race, first-generation status, and geographic location. The demographically prompted and unprompted synthetic texts were also more similar to each other than to the human text, meaning that prompting did not alleviate homogenization. These issues of model alignment and steerability in current LLMs raise concerns about the use of LLMs in high-stakes contexts.
zh

[NLP-46] Low-resource Machine Translation for Code-switched Kazakh-Russian Language Pair

【速读】：该论文致力于解决低资源语言对（Low Resource Language Pairs）中涉及代码切换（Code Switching）的机器翻译难题。在缺乏标注数据的情况下，特别是针对哈萨克语-俄语这种代码切换场景的语言对，传统机器翻译方法面临巨大挑战。论文的关键解决方案在于通过生成合成数据（Synthetic Data Generation）来构建机器翻译模型，而非依赖于实际标注的数据。此外，论文还发布了首个哈萨克语-俄语代码切换平行语料库，并展示了所提出模型的评估结果，其BLEU分数达到16.48，接近现有的商业系统并在人工评估中表现更优。

链接: https://arxiv.org/abs/2503.20007
作者: Maksim Borisov,Zhanibek Kozhirbayev,Valentin Malykh
机构: ITMO University (圣彼得堡 ITMO 大学); Nazarbayev University (纳扎尔巴耶夫大学); ITMO University (圣彼得堡 ITMO 大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machine translation for low resource language pairs is a challenging task. This task could become extremely difficult once a speaker uses code switching. We propose a method to build a machine translation model for code-switched Kazakh-Russian language pair with no labeled data. Our method is basing on generation of synthetic data. Additionally, we present the first codeswitching Kazakh-Russian parallel corpus and the evaluation results, which include a model achieving 16.48 BLEU almost reaching an existing commercial system and beating it by human evaluation.
zh

[NLP-47] Untangling the Influence of Typology Data and Model Architecture on Ranking Transfer Languages for Cross-Lingual POS Tagging NAACL2025

【速读】：该论文旨在解决跨语言迁移学习中迁移语言选择的问题，尽管其在应对数据稀缺性方面具有重要价值，但如何选择合适的迁移语言仍具挑战性。现有研究对语言类型学、训练数据和模型架构在迁移语言选择中的具体作用理解尚不充分。为解决这一问题，论文采用整体性方法，探讨数据集特定特征与精细粒度类型学特征如何影响词性标注任务中的迁移语言选择，并考虑两种不同的形态句法特征来源。此外，论文不仅扩展了以往关于双语双向长短时记忆网络（bilingual biLSTMs）的研究，还将其分析范围延伸至更现代的迁移学习管道——基于预训练多语言模型的零样本预测。关键解决方案在于通过训练一系列迁移语言排名系统，评估不同特征输入对多种模型架构下排名器性能的影响，最终发现词重叠（word overlap）、词种比（type-token ratio）和谱系距离（genealogical distance）是影响排名的关键特征。论文揭示，结合类型学特性和数据集依赖特性能够获得最佳排名效果，且单一特征组也能实现良好的性能表现。

链接: https://arxiv.org/abs/2503.19979
作者: Enora Rice,Ali Marashian,Hannah Haynie,Katharina von der Wense,Alexis Palmer
机构: University of Colorado Boulder (科罗拉多大学博尔德分校); Johannes Gutenberg University Mainz (约翰内斯·古腾堡美因茨大学)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025 Workshop Language Models for Underserved Communities

点击查看摘要

Abstract:Cross-lingual transfer learning is an invaluable tool for overcoming data scarcity, yet selecting a suitable transfer language remains a challenge. The precise roles of linguistic typology, training data, and model architecture in transfer language choice are not fully understood. We take a holistic approach, examining how both dataset-specific and fine-grained typological features influence transfer language selection for part-of-speech tagging, considering two different sources for morphosyntactic features. While previous work examines these dynamics in the context of bilingual biLSTMS, we extend our analysis to a more modern transfer learning pipeline: zero-shot prediction with pretrained multilingual models. We train a series of transfer language ranking systems and examine how different feature inputs influence ranker performance across architectures. Word overlap, type-token ratio, and genealogical distance emerge as top features across all architectures. Our findings reveal that a combination of typological and dataset-dependent features leads to the best rankings, and that good performance can be obtained with either feature group on its own.
zh

[NLP-48] LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation ICLR2025

【速读】：该论文旨在解决大型语言模型（LLM）推理过程中KV缓存（KV Cache）占用过多内存的问题，同时确保性能不受显著影响。传统方法通常假设后续token更重要或尝试基于早期注意力模式预测重要token，但这些方法可能导致性能瓶颈或频繁误预测。LogQuant的关键创新在于引入了一种基于对数的过滤机制，通过在整个上下文中选择性地压缩KV缓存，实现了与现有方法相同甚至更小的内存占用下获得更好的性能。在基准测试中，LogQuant提升了吞吐量25%，将batch size提高了60%，并在保持相同压缩比的情况下显著提升了数学任务和代码补全等复杂任务的准确性，最高可达200%。这一技术可无缝集成到流行的推理框架中，如Python的transformers库，并已开源实现。

链接: https://arxiv.org/abs/2503.19950
作者: Han Chen,Zicong Jiang,Zining Zhang,Bingsheng He,Pingyi Luo,Mian Lu,Yuqiang Chen
机构: School of Computing (计算学院), National University of Singapore (新加坡国立大学); School of Electronic and Information Engineering (电子与信息工程学院), South China University of Technology (华南理工大学); 4Paradigm (第四范式)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ICLR 2025 Workshop on Sparsity in LLMs (SLLM)

点击查看摘要

Abstract:We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions. LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable this http URL integrates effortlessly with popular inference frameworks like Python’s transformers library. Implementation can be available in this https URL. Comments: Accepted by ICLR 2025 Workshop on Sparsity in LLMs (SLLM) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2503.19950 [cs.LG] (or arXiv:2503.19950v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.19950 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-49] QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions

【速读】：该论文试图解决传统数值评分方法在语音质量评估中的局限性，即缺乏对噪声、失真等质量问题的详细描述与深入分析。为弥补现有数据集在自然语言注释方面的不足，论文提出了QualiSpeech数据集，包含11个关键方面及详细的自然语言评价，涵盖推理与上下文洞察。同时，论文引入QualiSpeech基准来评估听觉大语言模型（Auditory LLMs）的低级语音理解能力。解决方案的关键在于通过构建高质量的数据集和评估基准，使经过微调的听觉大语言模型能够可靠地生成关于噪声和失真的详细描述，并有效识别其类型及其时间特性，从而提升语音质量评估的准确性和可靠性。

链接: https://arxiv.org/abs/2503.20290
作者: Siyin Wang,Wenyi Yu,Xianzhao Chen,Xiaohai Tian,Jun Zhang,Yu Tsao,Junichi Yamagishi,Yuxuan Wang,Chao Zhang
机构: Tsinghua University (清华大学); ByteDance (字节跳动); Academia Sinica (中央研究院); National Institute of Informatics (国立信息学研究所)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: 23 pages, 16 figures

点击查看摘要

Abstract:This paper explores a novel perspective to speech quality assessment by leveraging natural language descriptions, offering richer, more nuanced insights than traditional numerical scoring methods. Natural language feedback provides instructive recommendations and detailed evaluations, yet existing datasets lack the comprehensive annotations needed for this approach. To bridge this gap, we introduce QualiSpeech, a comprehensive low-level speech quality assessment dataset encompassing 11 key aspects and detailed natural language comments that include reasoning and contextual insights. Additionally, we propose the QualiSpeech Benchmark to evaluate the low-level speech understanding capabilities of auditory large language models (LLMs). Experimental results demonstrate that finetuned auditory LLMs can reliably generate detailed descriptions of noise and distortion, effectively identifying their types and temporal characteristics. The results further highlight the potential for incorporating reasoning to enhance the accuracy and reliability of quality assessments. The dataset will be released at this https URL.
zh

计算机视觉

[CV-0] Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

【速读】：本文旨在解决单图像条件下4D场景生成的问题，现有方法要么局限于物体级别的生成，难以实现场景级生成，要么依赖大规模多视角视频数据集进行昂贵训练，由于4D场景数据稀缺，其泛化能力受限。为应对这些挑战，论文提出了一种无需调参的Free4D框架。其关键在于通过蒸馏预训练的基础模型来获得一致的4D场景表示，这带来了效率和泛化性的优势。具体而言，首先利用图像到视频扩散模型对输入图像进行动画处理并初始化4D几何结构；其次设计了一种自适应引导机制，结合点引导去噪策略实现空间一致性，并提出一种新颖的潜在替换策略以确保时间一致性；最后，通过基于调制的精炼方法缓解不一致性，同时充分利用生成的信息，从而将生成的观测值提升为一致的4D表示。最终得到的4D表示支持实时可控渲染，在基于单图像的4D场景生成领域实现了重要进展。

链接: https://arxiv.org/abs/2503.20785
作者: Tianqi Liu,Zihao Huang,Zhaoxi Chen,Guangcong Wang,Shoukang Hu,Liao Shen,Huiqiang Sun,Zhiguo Cao,Wei Li,Ziwei Liu
机构: Huazhong University of Science and Technology (华中科技大学); S-Lab, Nanyang Technological University (南洋理工大学S-Lab); Great Bay University (大湾区大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL , Code: this https URL

点击查看摘要

Abstract:We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. Existing methods either focus on object-level generation, making scene-level generation infeasible, or rely on large-scale multi-view video datasets for expensive training, with limited generalization ability due to the scarcity of 4D scene data. In contrast, our key insight is to distill pre-trained foundation models for consistent 4D scene representation, which offers promising advantages such as efficiency and generalizability. 1) To achieve this, we first animate the input image using image-to-video diffusion models followed by 4D geometric structure initialization. 2) To turn this coarse structure into spatial-temporal consistent multiview videos, we design an adaptive guidance mechanism with a point-guided denoising strategy for spatial consistency and a novel latent replacement strategy for temporal coherence. 3) To lift these generated observations into consistent 4D representation, we propose a modulation-based refinement to mitigate inconsistencies while fully leveraging the generated information. The resulting 4D representation enables real-time, controllable rendering, marking a significant advancement in single-image-based 4D scene generation.
zh

[CV-1] FB-4D: Spatial-Temporal Coherent Dynamic 3D Content Generation with Feature Banks

【速读】：该论文旨在解决动态三维（4D）内容生成中高保真度（high-fidelity）且具有强时空一致性（spatial-temporal consistency）的问题，这是当前研究中的一个挑战。论文的关键解决方案是提出了一种名为FB-4D的新框架，它通过引入特征库（Feature Bank）机制来增强生成帧的空间和时间一致性。在FB-4D中，通过存储并融合来自先前帧的特征到后续帧生成过程中，确保了时间和多视角下的一致性特性。此外，为了保持紧凑表示，特征库采用了一种动态合并机制进行更新。利用这一特征库，论文首次证明了通过多次自回归迭代生成额外参考序列可以持续提升生成性能。实验结果表明，与现有方法相比，FB-4D在渲染质量、时空一致性和鲁棒性方面均有显著改进，并且在无需微调的情况下超越所有多视角生成方法，同时达到基于训练的方法的性能水平。

链接: https://arxiv.org/abs/2503.20784
作者: Jinwei Li,Huan-ang Gao,Wenyi Li,Haohan Chi,Chenyu Liu,Chenxi Du,Yiqian Liu,Mingju Gao,Guiyu Zhang,Zongzheng Zhang,Li Yi,Yao Yao,Jingwei Zhao,Hongyang Li,Yikai Wang,Hao Zhao
机构: AIR, Tsinghua University (清华大学智能产业研究院); DCST, Tsinghua University (清华大学智能科学与技术中心); IIIS, Tsinghua University (清华大学交叉信息研究院); Nanjing University (南京大学); Xiaomi Corporation (小米公司); Shanghai AI Laboratory (上海人工智能实验室); School of Artificial Intelligence, Beijing Normal University (北京师范大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:With the rapid advancements in diffusion models and 3D generation techniques, dynamic 3D content generation has become a crucial research area. However, achieving high-fidelity 4D (dynamic 3D) generation with strong spatial-temporal consistency remains a challenging task. Inspired by recent findings that pretrained diffusion features capture rich correspondences, we propose FB-4D, a novel 4D generation framework that integrates a Feature Bank mechanism to enhance both spatial and temporal consistency in generated frames. In FB-4D, we store features extracted from previous frames and fuse them into the process of generating subsequent frames, ensuring consistent characteristics across both time and multiple views. To ensure a compact representation, the Feature Bank is updated by a proposed dynamic merging mechanism. Leveraging this Feature Bank, we demonstrate for the first time that generating additional reference sequences through multiple autoregressive iterations can continuously improve generation performance. Experimental results show that FB-4D significantly outperforms existing methods in terms of rendering quality, spatial-temporal consistency, and robustness. It surpasses all multi-view generation tuning-free approaches by a large margin and achieves performance on par with training-based methods.
zh

[CV-2] Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

【速读】：该论文旨在解决零样本音频-视频编辑任务中，如何在不进行额外模型训练的情况下，将原始音视频内容调整以与指定文本提示对齐的问题。论文指出，现有方法在模态间同步性和一致性方面存在局限性，导致编辑结果不够可靠。为解决这些问题，论文提出了一种名为AvED的零样本跨模态去噪框架（zero-shot cross-modal delta denoising framework），其关键在于利用音频-视频交互来实现同步且一致的编辑效果。实验结果表明，AvED在自建的AvED-Bench数据集以及OAVE数据集上均表现出色，验证了其泛化能力。

链接: https://arxiv.org/abs/2503.20782
作者: Yan-Bo Lin,Kevin Lin,Zhengyuan Yang,Linjie Li,Jianfeng Wang,Chung-Ching Lin,Xiaofei Wang,Gedas Bertasius,Lijuan Wang
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Project page: this https URL

点击查看摘要

Abstract:In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing. AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation. We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits. AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities. Results are available at this https URL
zh

[CV-3] BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation

【速读】：该论文试图解决篮球技能细粒度估计的问题，通过构建一个大规模篮球视频数据集 BASKET 来挑战现有视频识别模型在捕捉球员技能复杂细节方面的局限性。BASKET 包含来自全球的 32,232 名球员的 4,477 小时视频，并涵盖 20 种细粒度篮球技能，要求模型从长段比赛视频（8-10 分钟）中预测每种技能的水平（如优秀、良好、平均、一般、较差）。解决方案的关键在于设计能够处理长时间序列并实现细粒度识别能力的新一代视频模型，以应对当前最先进的视频模型在此任务上的不足。此外，该数据集还旨在推动公平球探、个性化球员发展等领域的应用。

链接: https://arxiv.org/abs/2503.20781
作者: Yulu Pan,Ce Zhang,Gedas Bertasius
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present BASKET, a large-scale basketball video dataset for fine-grained skill estimation. BASKET contains 4,477 hours of video capturing 32,232 basketball players from all over the world. Compared to prior skill estimation datasets, our dataset includes a massive number of skilled participants with unprecedented diversity in terms of gender, age, skill level, geographical location, etc. BASKET includes 20 fine-grained basketball skills, challenging modern video recognition models to capture the intricate nuances of player skill through in-depth video analysis. Given a long highlight video (8-10 minutes) of a particular player, the model needs to predict the skill level (e.g., excellent, good, average, fair, poor) for each of the 20 basketball skills. Our empirical analysis reveals that the current state-of-the-art video models struggle with this task, significantly lagging behind the human baseline. We believe that BASKET could be a useful resource for developing new video models with advanced long-range, fine-grained recognition capabilities. In addition, we hope that our dataset will be useful for domain-specific applications such as fair basketball scouting, personalized player development, and many others. Dataset and code are available at this https URL.
zh

[CV-4] Feature4X: Bridging Any Monocular Video to 4D Agent ic AI with Versatile Gaussian Feature Fields

【速读】：该论文旨在解决如何将现有的2D和多模态模型的功能扩展到复杂3D/4D场景中，以实现自由形式交互和高级语义操作的问题。这一挑战的核心在于缺乏大规模标注的3D/4D或多视图数据集，这些数据集对于开放词汇量和基于提示的分割、语言引导编辑以及视觉问答（VQA）等通用视觉-语言任务至关重要。论文提出的关键解决方案是Feature4X框架，它通过单目视频输入即可将任何2D视觉基础模型的功能扩展到4D领域。“X”代表其多功能性，通过可调节、基于模型的4D特征场蒸馏实现任意任务。框架的核心是一种动态优化策略，能够将多种模型能力统一到单一表示中。此外，据作者所知，Feature4X是首个利用高斯点阵化方法将视频基础模型（如SAM2、InternVideo2）的特征蒸馏并提升为显式4D特征场的方法。实验展示了通过大型语言模型反馈循环实现的新颖视角分割、几何与外观场景编辑以及自由形式VQA的能力，从而扩展了主动AI应用的范围。

链接: https://arxiv.org/abs/2503.20776
作者: Shijie Zhou,Hui Ren,Yijia Weng,Shuwang Zhang,Zhen Wang,Dejia Xu,Zhiwen Fan,Suya You,Zhangyang Wang,Leonidas Guibas,Achuta Kadambi
机构: UCLA(加州大学洛杉矶分校); MIT(麻省理工学院); Stanford(斯坦福大学); UT Austin(德克萨斯大学奥斯汀分校); DEVCOM ARL(美国陆军研究实验室); Feature4X
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in 2D and multimodal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into the 4D realm, using only monocular video input, which is widely available from user-generated content. The “X” in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single representation. Additionally, to the best of our knowledge, Feature4X is the first method to distill and lift the features of video foundation models (e.g. SAM2, InternVideo2) into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in feedback loops. These advancements broaden the scope of agentic AI applications by providing a foundation for scalable, contextually and spatiotemporally aware systems capable of immersive dynamic 4D scene interaction.
zh

[CV-5] Disentangled Source-Free Personalization for Facial Expression Recognition with Neutral Target Data

【速读】：本文旨在解决源无关域适应（Source-free Domain Adaptation, SFDA）在面部表情识别（Facial Expression Recognition, FER）中的挑战，特别是当目标域缺乏非中性表情数据时。传统SFDA方法通常假设目标数据集包含所有识别类别，但在医疗应用中，收集这样的全面目标数据可能困难甚至不可能。为应对这一问题，论文提出了解耦源无关域适应（Disentangled Source-Free Domain Adaptation, DSFDA）方法。该方法的关键在于利用目标主体的中性控制视频，通过端到端生成缺失的非中性表情数据，并同时解耦与表情和身份相关的特征。此外，自监督策略通过保持目标图像的身份和源表情一致性来进一步优化模型适应过程，从而提升模型的识别准确性。

链接: https://arxiv.org/abs/2503.20771
作者: Masoumeh Sharafi,Emma Ollivier,Muhammad Osama Zeeshan,Soufiane Belharbi,Marco Pedersoli,Alessandro Lameiras Koerich,Simon Bacon,EricGranger
机构: LIVIA, ILLS, Dept. of Systems Engineering, ETS Montreal, Canada(利瓦系统工程系，蒙特利尔E.T.S大学); LIVIA, Dept. of Software and IT Engineering, ETS Montreal, Canada(利瓦软件与IT工程系，蒙特利尔E.T.S大学); Dept. of Health, Kinesiology & Applied Physiology, Concordia University, Montreal, Canada(康考迪亚大学健康、运动学与应用生理学系); Montreal Behavioural Medicine Centre, CIUSSS Nord-de-l’Ile-de-Montréal, Canada(蒙特利尔行为医学中心，蒙特利尔北部岛屿整合卫生和社会服务统合中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial Expression Recognition (FER) from videos is a crucial task in various application areas, such as human-computer interaction and health monitoring (e.g., pain, depression, fatigue, and stress). Beyond the challenges of recognizing subtle emotional or health states, the effectiveness of deep FER models is often hindered by the considerable variability of expressions among subjects. Source-free domain adaptation (SFDA) methods are employed to adapt a pre-trained source model using only unlabeled target domain data, thereby avoiding data privacy and storage issues. Typically, SFDA methods adapt to a target domain dataset corresponding to an entire population and assume it includes data from all recognition classes. However, collecting such comprehensive target data can be difficult or even impossible for FER in healthcare applications. In many real-world scenarios, it may be feasible to collect a short neutral control video (displaying only neutral expressions) for target subjects before deployment. These videos can be used to adapt a model to better handle the variability of expressions among subjects. This paper introduces the Disentangled Source-Free Domain Adaptation (DSFDA) method to address the SFDA challenge posed by missing target expression data. DSFDA leverages data from a neutral target control video for end-to-end generation and adaptation of target data with missing non-neutral data. Our method learns to disentangle features related to expressions and identity while generating the missing non-neutral target data, thereby enhancing model accuracy. Additionally, our self-supervision strategy improves model adaptation by reconstructing target images that maintain the same identity and source expression.
zh

[CV-6] MindfulLIME: A Stable Solution for Explanations of Machine Learning Models with Enhanced Localization Precision – A Medical Image Case Study

【速读】：该论文旨在解决现有可解释算法（如LIME）在生成图像数据解释时因随机扰动导致的解释不稳定问题。这种不稳定性源于随机采样方法引入的小幅变化或噪声，使得生成的解释对输入样本的变化过于敏感，从而影响模型的可信度和实际应用中的接受度。论文的关键解决方案是提出了一种名为MindfulLIME的新算法，它通过基于图的剪枝算法和不确定性采样智能生成目标导向的样本，显著提高了视觉解释的一致性。实验结果表明，MindfulLIME在保持高定位精度的同时，能够在相同条件下以100%的成功率提供可靠的解释，并且在多种分割设置下表现出色，生成更高质量的样本且效率合理。这一改进增强了特定医学影像应用中机器学习模型的可信性和可解释性。

链接: https://arxiv.org/abs/2503.20758
作者: Shakiba Rahimiaghdam,Hande Alemdar
机构: Middle East Technical University (中东技术大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Ensuring transparency in machine learning decisions is critically important, especially in sensitive sectors such as healthcare, finance, and justice. Despite this, some popular explainable algorithms, such as Local Interpretable Model-agnostic Explanations (LIME), often produce unstable explanations due to the random generation of perturbed samples. Random perturbation introduces small changes or noise to modified instances of the original data, leading to inconsistent explanations. Even slight variations in the generated samples significantly affect the explanations provided by such models, undermining trust and hindering the adoption of interpretable models. To address this challenge, we propose MindfulLIME, a novel algorithm that intelligently generates purposive samples using a graph-based pruning algorithm and uncertainty sampling. MindfulLIME substantially improves the consistency of visual explanations compared to random sampling approaches. Our experimental evaluation, conducted on a widely recognized chest X-ray dataset, confirms MindfulLIME’s stability with a 100% success rate in delivering reliable explanations under identical conditions. Additionally, MindfulLIME improves the localization precision of visual explanations by reducing the distance between the generated explanations and the actual local annotations compared to LIME. We also performed comprehensive experiments considering various segmentation algorithms and sample numbers, focusing on stability, quality, and efficiency. The results demonstrate the outstanding performance of MindfulLIME across different segmentation settings, generating fewer high-quality samples within a reasonable processing time. By addressing the stability limitations of LIME in image data, MindfulLIME enhances the trustworthiness and interpretability of machine learning models in specific medical imaging applications, a critical domain.
zh

[CV-7] Reason -RFT: Reinforcement Fine-Tuning for Visual Reasoning

【速读】：该论文试图解决现有视觉语言模型（Vision-Language Models, VLMs）在视觉推理任务中因监督微调导致的过拟合和认知僵化问题，这些问题限制了模型在跨领域迁移推理能力及实际应用中的表现。为了解决这些局限性，论文提出了一种名为Reason-RFT的新型强化微调框架。其关键是引入了一个两阶段训练方法：第一阶段通过精心策划的链式思维（Chain-of-Thought, CoT）数据进行有监督微调（SFT），激活模型的推理潜能；第二阶段基于群体相对策略优化（Group Relative Policy Optimization, GRPO）的强化学习，生成多组推理-响应对，显著提升模型在视觉推理任务中的泛化能力。

链接: https://arxiv.org/abs/2503.20752
作者: Huajie Tan,Yuheng Ji,Xiaoshuai Hao,Minglan Lin,Pengwei Wang,Zhongyuan Wang,Shanghang Zhang
机构: State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University (北京大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 35 pages, 22 figures

点击查看摘要

Abstract:Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI). Existing methods improve VLM reasoning via Chain-of-Thought (CoT) supervised fine-tuning, using meticulously annotated training data to enhance visual reasoning capabilities. However, this training paradigm may lead to overfitting and cognitive rigidity, restricting the model’s ability to transfer visual reasoning skills across domains and limiting its real-world applicability. To address these limitations, we propose Reason-RFT, a novel reinforcement fine-tuning framework that significantly enhances generalization capabilities in visual reasoning tasks. Reason-RFT introduces a two-phase training framework for visual reasoning: (1) Supervised Fine-Tuning (SFT) with curated Chain-of-Thought (CoT) data activates the reasoning potential of Vision-Language Models (VLMs), followed by (2) Group Relative Policy Optimization (GRPO)-based reinforcement learning that generates multiple reasoning-response pairs, significantly enhancing generalization in visual reasoning tasks. To evaluate Reason-RFT’s visual reasoning capabilities, we reconstructed a comprehensive dataset spanning visual counting, structure perception, and spatial this http URL results demonstrate Reasoning-RFT’s three key advantages: (1) Performance Enhancement: achieving state-of-the-art results across multiple tasks, outperforming most mainstream open-source and proprietary models; (2) Generalization Superiority: consistently maintaining robust performance across diverse tasks and domains, outperforming alternative training paradigms; (3) Data Efficiency: excelling in few-shot learning scenarios while surpassing full-dataset SFT baselines.
zh

[CV-8] UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines CVPR2025

【速读】：该论文旨在解决传统时空模型因任务特定架构导致的可泛化性和可扩展性限制问题，特别是在跨领域和多样化任务中的应用局限。论文提出了一种名为\textbf{UniSTD}的统一Transformer框架，其关键在于采用两阶段范式：首先通过无任务依赖的预训练在2D视觉和视觉-文本数据集上构建通用的基础模型，然后通过特定任务的联合训练增强针对具体任务的适应能力。此外，为提升跨领域的学习能力，该框架引入了基于秩自适应混合专家适应机制，并利用分数插值方法优化离散变量以实现连续空间优化。同时，还设计了一个时间模块以显式建模时间动态特性。这一方案的核心在于通过统一框架实现高效的多任务学习与跨领域适配，显著降低多领域应用场景下的训练开销。

链接: https://arxiv.org/abs/2503.20748
作者: Chen Tang,Xinzhu Ma,Encheng Su,Xiufeng Song,Xiaohong Liu,Wei-Hong Li,Lei Bai,Wanli Ouyang,Xiangyu Yue
机构: MMLab, The Chinese University of Hong Kong (香港中文大学); Shanghai AI Lab (上海人工智能实验室); Shanghai Jiaotong University (上海交通大学); Shun Hing Institute of Advanced Engineering, The Chinese University of Hong Kong (香港中文大学循正工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Traditional spatiotemporal models generally rely on task-specific architectures, which limit their generalizability and scalability across diverse tasks due to domain-specific design requirements. In this paper, we introduce \textbfUniSTD, a unified Transformer-based framework for spatiotemporal modeling, which is inspired by advances in recent foundation models with the two-stage pretraining-then-adaption paradigm. Specifically, our work demonstrates that task-agnostic pretraining on 2D vision and vision-text datasets can build a generalizable model foundation for spatiotemporal learning, followed by specialized joint training on spatiotemporal datasets to enhance task-specific adaptability. To improve the learning capabilities across domains, our framework employs a rank-adaptive mixture-of-expert adaptation by using fractional interpolation to relax the discrete variables so that can be optimized in the continuous space. Additionally, we introduce a temporal module to incorporate temporal dynamics explicitly. We evaluate our approach on a large-scale dataset covering 10 tasks across 4 disciplines, demonstrating that a unified spatiotemporal model can achieve scalable, cross-task learning and support up to 10 tasks simultaneously within one model while reducing training costs in multi-domain applications. Code will be available at this https URL.
zh

[CV-9] PhysGen3D: Crafting a Miniature Interactive World from a Single Image CVPR2025

【速读】：该论文旨在解决从单张图像生成具有物理真实感且可交互的动态视频的问题。为实现这一目标，论文提出了一种名为PhysGen3D的新框架，其关键是结合基于图像的几何与语义理解以及基于物理的模拟技术，通过估计物体的3D形状、姿态、物理属性及光照特性，将静态图像转换为交互式的3D场景，并支持用户指定初始条件以增强生成视频结果的可控性。实验表明，PhysGen3D在保持高保真度的同时提供了更灵活且精细的控制能力，实现了照片级真实感、物理合理性与用户驱动交互性的独特平衡。

链接: https://arxiv.org/abs/2503.20746
作者: Boyuan Chen,Hanxiao Jiang,Shaowei Liu,Saurabh Gupta,Yunzhu Li,Hao Zhao,Shenlong Wang
机构: Tsinghua University (清华大学); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校); Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025, Project page: this https URL

点击查看摘要

Abstract:Envisioning physically plausible outcomes from a single image requires a deep understanding of the world’s dynamics. To address this, we introduce PhysGen3D, a novel framework that transforms a single image into an amodal, camera-centric, interactive 3D scene. By combining advanced image-based geometric and semantic understanding with physics-based simulation, PhysGen3D creates an interactive 3D world from a static image, enabling us to “imagine” and simulate future scenarios based on user input. At its core, PhysGen3D estimates 3D shapes, poses, physical and lighting properties of objects, thereby capturing essential physical attributes that drive realistic object interactions. This framework allows users to specify precise initial conditions, such as object speed or material properties, for enhanced control over generated video outcomes. We evaluate PhysGen3D’s performance against closed-source state-of-the-art (SOTA) image-to-video models, including Pika, Kling, and Gen-3, showing PhysGen3D’s capacity to generate videos with realistic physics while offering greater flexibility and fine-grained control. Our results show that PhysGen3D achieves a unique balance of photorealism, physical plausibility, and user-driven interactivity, opening new possibilities for generating dynamic, physics-grounded video from an image.
zh

[CV-10] MATHGLANCE: Multimodal Large Language Models Do Not Know Where to Look in Mathematical Diagrams

【速读】：该论文试图解决的问题是当前多模态大型语言模型（Multimodal Large Language Models, MLLMs）在理解数学图示（mathematical diagrams）时存在的局限性，尤其是它们在感知任务中的表现往往依赖于浅层模式识别而非真正理解图示内容。现有的基准测试混淆了感知与推理任务，难以有效评估MLLMs是否具备超越表面特征识别的数学图示理解能力。

为了解决这一问题，论文提出了两个关键方案：首先，引入了一个名为MATHGLANCE的新基准数据集，专门用于隔离并评估MLLMs在数学感知任务上的能力，涵盖形状分类、对象计数、关系识别及细粒度定位等四个任务；其次，构建了一个名为GeoPeP的大规模结构化几何图像-文本数据集，包含20万张标注有几何基元及其精确空间关系的图像，并通过在该数据集上训练MLLM显著提升了模型的感知准确性，进而改善其数学推理能力。这些贡献为评估和推动多模态数学理解提供了重要标准和资源。

链接: https://arxiv.org/abs/2503.20745
作者: Yanpeng Sun,Shan Zhang,Wei Tang,Aotian Chen,Piotr Koniusz,Kai Zou,Yuan Xue,Anton van den Hengel
机构: National University of Singapore; Australian Institute for Machine Learning; Nanjing University of Science and Technology; Ohio State University; Data61 (CSIRO); NetMind.ai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diagrams serve as a fundamental form of visual language, representing complex concepts and their inter-relationships through structured symbols, shapes, and spatial arrangements. Unlike natural images, their inherently symbolic and abstract nature poses significant challenges for Multimodal Large Language Models (MLLMs). However, current benchmarks conflate perceptual and reasoning tasks, making it difficult to assess whether MLLMs genuinely understand mathematical diagrams beyond superficial pattern recognition. To address this gap, we introduce MATHGLANCE, a benchmark specifically designed to isolate and evaluate mathematical perception in MLLMs. MATHGLANCE comprises 1.2K images and 1.6K carefully curated questions spanning four perception tasks: shape classification, object counting, relationship identification, and object grounding, covering diverse domains including plane geometry, solid geometry, and graphical representations. Our evaluation of MLLMs reveals that their ability to understand diagrams is notably limited, particularly in fine-grained grounding tasks. In response, we construct GeoPeP, a perception-oriented dataset of 200K structured geometry image-text pairs explicitly annotated with geometric primitives and precise spatial relationships. Training MLLM on GeoPeP leads to significant gains in perceptual accuracy, which in turn substantially improves mathematical reasoning. Our benchmark and dataset establish critical standards for evaluating and advancing multimodal mathematical understanding, providing valuable resources and insights to foster future MLLM research.
zh

[CV-11] High Quality Diffusion Distillation on a Single GPU with Relative and Absolute Position Matching

【速读】：该论文旨在解决现有扩散蒸馏方法对计算资源需求较高（如需要多块GPU和大批次大小）的问题，使得更多研究者能够在有限资源下高效训练高质量的文本到图像生成模型。论文的关键创新在于提出了一种名为相对与绝对位置匹配（Relative and Absolute Position Matching, RAPM）的方法，通过以单GPU和小批次大小（甚至批次大小为1）的方式实现高效的扩散蒸馏训练。其核心解决方案是模仿教师模型的采样轨迹，通过匹配输入数据的相对和绝对位置来实现这一目标。具体而言，RAPM 引入了两个判别器：一个用于匹配相对位置，另一个用于匹配绝对位置，这些设计灵感部分来源于阶段一致性模型（Phased Consistency Models, PCM）。实验结果表明，在非常有限的计算资源条件下，使用RAPM的模型在FID分数上与现有最佳方法相当，从而验证了该方法的有效性。

链接: https://arxiv.org/abs/2503.20744
作者: Guoqiang Zhang,Kenta Niwa,J.P. Lewis,Cedric Mesnage,W. Bastiaan Kleijn
机构: University of Exeter (埃克塞特大学); NTT Communication Science Laboratories (NTT通信科学实验室); Victoria University of Wellington (惠灵顿维多利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce relative and absolute position matching (RAPM), a diffusion distillation method resulting in high quality generation that can be trained efficiently on a single GPU. Recent diffusion distillation research has achieved excellent results for high-resolution text-to-image generation with methods such as phased consistency models (PCM) and improved distribution matching distillation (DMD2). However, these methods generally require many GPUs (e.g.~8-64) and significant batchsizes (e.g.~128-2048) during training, resulting in memory and compute requirements that are beyond the resources of some researchers. RAPM provides effective single-GPU diffusion distillation training with a batchsize of 1. The new method attempts to mimic the sampling trajectories of the teacher model by matching the relative and absolute positions. The design of relative positions is inspired by PCM. Two discriminators are introduced accordingly in RAPM, one for matching relative positions and the other for absolute positions. Experimental results on StableDiffusion (SD) V1.5 and SDXL indicate that RAPM with 4 timesteps produces comparable FID scores as the best method with 1 timestep under very limited computational resources.
zh

[CV-12] Emotion Detection and Music Recommendation System

【速读】：该论文旨在通过音乐疗法改善人类的情绪健康，其核心问题是开发一种能够实时检测人类情绪并据此推荐相应音乐的系统。解决方案的关键在于结合深度学习（Deep Learning）与面部识别技术，利用DeepFace框架实时分析用户的面部表情以推断情绪状态，并从本地存储中调取匹配情绪的播放列表。同时，系统允许用户通过手动选择进一步定制音乐体验，确保播放过程的连贯性通过循环播放实现。这种响应式且自动化的音乐推荐方式是解决上述问题的核心方法。

链接: https://arxiv.org/abs/2503.20739
作者: Swetha Kambham,Hubert Jhonson,Sai Prathap Reddy Kambham
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As artificial intelligence becomes more and more ingrained in daily life, we present a novel system that uses deep learning for music recommendation and emotion-based detection. Through the use of facial recognition and the DeepFace framework, our method analyses human emotions in real-time and then plays music that reflects the mood it has discovered. The system uses a webcam to take pictures, analyses the most common facial expression, and then pulls a playlist from local storage that corresponds to the mood it has detected. An engaging and customised experience is ensured by allowing users to manually change the song selection via a dropdown menu or navigation buttons. By continuously looping over the playlist, the technology guarantees continuity. The objective of our system is to improve emotional well-being through music therapy by offering a responsive and automated music-selection experience.
zh

[CV-13] SChanger: Change Detection from a Semantic Change and Spatial Consistency Perspective

【速读】：该论文旨在解决地球观测领域中变化检测任务面临的标注数据稀缺问题。由于遥感图像配准的劳动密集型特性，获取大规模带标注的变化检测数据集具有很大挑战性，这限制了深度学习算法的性能。为应对这一问题，论文提出了一种名为语义变化网络（Semantic Change Network, SCN）的微调策略作为解决方案的关键。SCN 首先在单时相监督任务上预训练模型以获取实例特征提取的先验知识，然后采用共享权重的孪生架构与扩展的时间融合模块（Temporal Fusion Module, TFM），在保持这些先验知识的同时针对变化检测任务进行微调。此外，论文观察到两幅图像之间的变化位置在空间上具有一致性，并通过大核卷积生成的注意力图引入这种归纳偏置，以增强多尺度变化的建模能力并捕捉变化检测语义中的潜在关系。最终，结合上述两种策略，论文开发了一个二元变化检测模型，并在六个基准数据集上验证其有效性，显著超越现有方法。

链接: https://arxiv.org/abs/2503.20734
作者: Ziyu Zhou,Keyan Hu,Yutian Fang,Xiaoping Rui
机构: School of Earth Sciences and Engineering, Hohai University (河海大学), Nanjing 211100, China; School of Geosciences and Info-physics, Central South University (中南大学), Changsha 410100, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Change detection is a key task in Earth observation applications. Recently, deep learning methods have demonstrated strong performance and widespread application. However, change detection faces data scarcity due to the labor-intensive process of accurately aligning remote sensing images of the same area, which limits the performance of deep learning algorithms. To address the data scarcity issue, we develop a fine-tuning strategy called the Semantic Change Network (SCN). We initially pre-train the model on single-temporal supervised tasks to acquire prior knowledge of instance feature extraction. The model then employs a shared-weight Siamese architecture and extended Temporal Fusion Module (TFM) to preserve this prior knowledge and is fine-tuned on change detection tasks. The learned semantics for identifying all instances is changed to focus on identifying only the changes. Meanwhile, we observe that the locations of changes between the two images are spatially identical, a concept we refer to as spatial consistency. We introduce this inductive bias through an attention map that is generated by large-kernel convolutions and applied to the features from both time points. This enhances the modeling of multi-scale changes and helps capture underlying relationships in change detection semantics. We develop a binary change detection model utilizing these two strategies. The model is validated against state-of-the-art methods on six datasets, surpassing all benchmark methods and achieving F1 scores of 92.87%, 86.43%, 68.95%, 97.62%, 84.58%, and 93.20% on the LEVIR-CD, LEVIR-CD+, S2Looking, CDD, SYSU-CD, and WHU-CD datasets, respectively.
zh

[CV-14] Dynamic Motion Blending for Versatile Motion Editing

【速读】：该论文试图解决传统文本引导运动编辑方法因依赖有限预收集训练三元组而缺乏多样性和适用性的问题。为了解决这一挑战，论文提出的关键方案包括两个部分：首先，引入MotionCutMix，这是一种在线数据增强技术，通过基于输入文本混合身体部位运动动态生成训练三元组，从而有效扩展训练分布；其次，提出MotionReFit，这是一种带有运动协调器的自回归扩散模型，用于建模由MotionCutMix引入的丰富分布，并缓解运动组合带来的不协调和伪影问题。通过这些创新，该方法实现了从高级人类指令直接处理空间和时间上的运动编辑，无需额外规格或大型语言模型的支持。

链接: https://arxiv.org/abs/2503.20724
作者: Nan Jiang,Hongjie Li,Ziye Yuan,Zimo He,Yixin Chen,Tengyu Liu,Yixin Zhu,Siyuan Huang
机构: Institute for AI, Peking University (北京大学人工智能研究院); State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室, BIGAI); Yuanpei College, Peking University (元培学院, 北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-guided motion editing enables high-level semantic control and iterative modifications beyond traditional keyframe animation. Existing methods rely on limited pre-collected training triplets, which severely hinders their versatility in diverse editing scenarios. We introduce MotionCutMix, an online data augmentation technique that dynamically generates training triplets by blending body part motions based on input text. While MotionCutMix effectively expands the training distribution, the compositional nature introduces increased randomness and potential body part incoordination. To model such a rich distribution, we present MotionReFit, an auto-regressive diffusion model with a motion coordinator. The auto-regressive architecture facilitates learning by decomposing long sequences, while the motion coordinator mitigates the artifacts of motion composition. Our method handles both spatial and temporal motion edits directly from high-level human instructions, without relying on additional specifications or Large Language Models. Through extensive experiments, we show that MotionReFit achieves state-of-the-art performance in text-guided motion editing.
zh

[CV-15] A weakly-supervised deep learning model for fast localisation and delineation of the skeleton internal organs and spinal canal on Whole-Body Diffusion-Weighted MRI (WB-DWI)

【速读】：该论文旨在解决手动勾画 Whole-body diffusion-weighted MRI (WB-DWI) 图像中用于癌症成像生物标志物（如表观扩散系数 ADC 和总扩散体积 TDV）测量的解剖结构（骨骼、邻近内部器官及脊髓腔）的繁琐性和不可行性问题。解决方案的关键在于提出了一种基于弱监督学习的自动化深度学习管道，采用 3D patch-based Residual U-Net 架构，结合“软标签”（非二值分割）训练方式，从计算密集型的图谱方法推导出标注数据。该算法通过多中心数据集进行训练与验证，并在独立测试集中实现了快速且可重复的概率图生成，显著提升了分割精度与效率，同时保证了与人工勾画结果的高度一致性，从而支持临床医生进行疾病分期和治疗反应评估。

链接: https://arxiv.org/abs/2503.20722
作者: A. Candito(1),A. Dragan(1,2),R. Holbrey(1),A. Ribeiro(2),R. Donners(3),C. Messiou(1,2),N. Tunariu(1,2),D.-M. Koh(1,2),M. D. Blackledge(1), (1)TheInstitute of Cancer Research,London,United Kingdom(2)TheRoyal Marsden NHS Foundation Trust,London,United Kingdom(3)University Hospital Basel,Basel,Switzerland
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Background: Apparent Diffusion Coefficient (ADC) values and Total Diffusion Volume (TDV) from Whole-body diffusion-weighted MRI (WB-DWI) are recognized cancer imaging biomarkers. However, manual disease delineation for ADC and TDV measurements is unfeasible in clinical practice, demanding automation. As a first step, we propose an algorithm to generate fast and reproducible probability maps of the skeleton, adjacent internal organs (liver, spleen, urinary bladder, and kidneys), and spinal canal. Methods: We developed an automated deep-learning pipeline based on a 3D patch-based Residual U-Net architecture that localizes and delineates these anatomical structures on WB-DWI. The algorithm was trained using “soft-labels” (non-binary segmentations) derived from a computationally intensive atlas-based approach. For training and validation, we employed a multi-center WB-DWI dataset comprising 532 scans from patients with Advanced Prostate Cancer (APC) or Multiple Myeloma (MM), with testing on 45 patients. Results: Our weakly-supervised deep learning model achieved an average dice score/precision/recall of 0.66/0.6/0.73 for skeletal delineations, 0.8/0.79/0.81 for internal organs, and 0.85/0.79/0.94 for spinal canal, with surface distances consistently below 3 mm. Relative median ADC and log-transformed volume differences between automated and manual expert-defined full-body delineations were below 10% and 4%, respectively. The computational time for generating probability maps was 12x faster than the atlas-based registration algorithm (25 s vs. 5 min). An experienced radiologist rated the model’s accuracy “good” or “excellent” on test datasets. Conclusion: Our model offers fast and reproducible probability maps for localizing and delineating body regions on WB-DWI, enabling ADC and TDV quantification, potentially supporting clinicians in disease staging and treatment response assessment.
zh

[CV-16] MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion

【速读】：该论文旨在解决现有视频检索系统过度依赖视觉信号的问题，指出当前最先进的多模态语言模型（如VAST和LanguageBind）基于视觉-语言模型（Vision-Language Models, VLMs）构建，导致在检索任务中对其他模态（如文本、声音和语音）的关注不足。同时，现有的检索基准进一步强化了这种偏向视觉的倾向，主要关注视觉查询而忽视其他模态。为了解决这一问题，论文提出了一种名为MMMORRF的搜索系统，其关键在于从视觉和音频模态中提取文本和特征，并通过一种新颖的模态感知加权逆序排名融合（modality-aware weighted reciprocal rank fusion）方法将不同模态的信息有效整合，从而实现基于用户信息需求而非仅限于视觉描述性查询的视频检索。实验结果显示，MMMORRF在MultiVENT 2.0和TVR两个多模态基准数据集上的nDCG@20指标分别比领先的多模态编码器提升了81%，比单模态检索提升了37%，验证了整合多样化模态的价值。

链接: https://arxiv.org/abs/2503.20698
作者: Saron Samuel,Dan DeGenaro,Jimena Guallar-Blasco,Kate Sanders,Oluwaseun Eisape,Arun Reddy,Alexander Martin,Andrew Yates,Eugene Yang,Cameron Carpenter,David Etter,Efsun Kayi,Matthew Wiesner,Kenton Murray,Reno Kriz
机构: Stanford University (斯坦福大学); Georgetown University (乔治敦大学); University of California, Berkeley (加州大学伯克利分校); Johns Hopkins University (约翰斯·霍普金斯大学); Applied Physics Laboratory (应用物理实验室); Human Language Technology Center of Excellence (人类语言技术卓越中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Videos inherently contain multiple modalities, including visual events, text overlays, sounds, and speech, all of which are important for retrieval. However, state-of-the-art multimodal language models like VAST and LanguageBind are built on vision-language models (VLMs), and thus overly prioritize visual signals. Retrieval benchmarks further reinforce this bias by focusing on visual queries and neglecting other modalities. We create a search system MMMORRF that extracts text and features from both visual and audio modalities and integrates them with a novel modality-aware weighted reciprocal rank fusion. MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users’ information needs instead of visual descriptive queries. We evaluate MMMORRF on MultiVENT 2.0 and TVR, two multimodal benchmarks designed for more targeted information needs, and find that it improves nDCG@20 by 81% over leading multimodal encoders and 37% over single-modality retrieval, demonstrating the value of integrating diverse modalities.
zh

[CV-17] Flip Learning: Weakly Supervised Erase to Segment Nodules in Breast Ultrasound

【速读】：该论文旨在解决乳腺超声（Breast Ultrasound, BUS）和自动化乳腺超声（Automated Breast Ultrasound, ABUS）中结节分割精度不足的问题，同时降低弱监督分割（Weakly-Supervised Segmentation, WSS）方法在标注过程中的劳动密集度与复杂性。当前WSS方法面临挑战，主要源于依赖不准确的激活图或低效的伪掩码生成算法。为应对这些挑战，论文提出了一种基于多智能体强化学习的新型弱监督分割框架——Flip Learning，其关键创新在于仅利用2D/3D框实现精确分割。具体而言，通过采用超像素/超体素编码标准化环境以捕获边界先验，并引入三类精心设计的奖励机制引导智能体擦除操作，避免欠分割与过分割现象，同时结合渐进式课程学习策略提升学习效率。最终，该方法在大规模内部BUS和ABUS数据集上的验证表明，其性能优于现有WSS方法及基础模型，接近全监督学习算法的表现。

链接: https://arxiv.org/abs/2503.20685
作者: Yuhao Huang,Ao Chang,Haoran Dou,Xing Tao,Xinrui Zhou,Yan Cao,Ruobing Huang,Alejandro F Frangi,Lingyun Bao,Xin Yang,Dong Ni
机构: National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University (深圳大学), Shenzhen, China; Medical UltraSound Image Computing (MUSIC) Lab, Shenzhen University (深圳大学), Shenzhen, China; Marshall Laboratory of Biomedical Engineering, Shenzhen University (深圳大学), Shenzhen, China; School of Computing, University of Leeds (利兹大学), Leeds, UK; Shenzhen RayShape Medical Technology Co., Ltd (深圳瑞影医疗科技有限公司), Shenzhen, China; Division of Informatics, Imaging and Data Science, School of Health Sciences, University of Manchester (曼彻斯特大学), Manchester, UK; Department of Computer Science, School of Engineering, University of Manchester (曼彻斯特大学), Manchester, UK; Medical Imaging Research Center (MIRC), Department of Electrical Engineering, Department of Cardiovascular Sciences, KU Leuven (鲁汶大学), Belgium; Alan Turing Institute (图灵研究所), London, UK; NIHR Manchester Biomedical Research Centre, Manchester Academic Health Science Centre, Manchester, UK; Department of Ultrasound, Affiliated Hangzhou First People’s Hospital, School of Medicine, Westlake University (西湖大学); School of Biomedical Engineering and Informatics, Nanjing Medical University (南京医科大学), Nanjing, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by Medical Image Analysis. 24 pages, 13 figures, 18 tabels

点击查看摘要

Abstract:Accurate segmentation of nodules in both 2D breast ultrasound (BUS) and 3D automated breast ultrasound (ABUS) is crucial for clinical diagnosis and treatment planning. Therefore, developing an automated system for nodule segmentation can enhance user independence and expedite clinical analysis. Unlike fully-supervised learning, weakly-supervised segmentation (WSS) can streamline the laborious and intricate annotation process. However, current WSS methods face challenges in achieving precise nodule segmentation, as many of them depend on inaccurate activation maps or inefficient pseudo-mask generation algorithms. In this study, we introduce a novel multi-agent reinforcement learning-based WSS framework called Flip Learning, which relies solely on 2D/3D boxes for accurate segmentation. Specifically, multiple agents are employed to erase the target from the box to facilitate classification tag flipping, with the erased region serving as the predicted segmentation mask. The key contributions of this research are as follows: (1) Adoption of a superpixel/supervoxel-based approach to encode the standardized environment, capturing boundary priors and expediting the learning process. (2) Introduction of three meticulously designed rewards, comprising a classification score reward and two intensity distribution rewards, to steer the agents’ erasing process precisely, thereby avoiding both under- and over-segmentation. (3) Implementation of a progressive curriculum learning strategy to enable agents to interact with the environment in a progressively challenging manner, thereby enhancing learning efficiency. Extensively validated on the large in-house BUS and ABUS datasets, our Flip Learning method outperforms state-of-the-art WSS methods and foundation models, and achieves comparable performance as fully-supervised learning algorithms.
zh

[CV-18] GLRD: Global-Local Collaborative Reason and Debate with PSL for 3D Open-Vocabulary Detection

【速读】：本文旨在解决基于 LiDAR 的 3D 开放词汇检测（3D OVD）任务中的挑战，即在缺乏现成训练标签的情况下，检测点云中未见过的新类别对象。现有方法主要关注于物体级别的表征学习，而忽略了场景级别的上下文信息，导致难以区分具有相似类别的物体。为了解决这一问题，论文提出了一个名为全局-局部协作推理与辩论（GLRD）的框架，该框架同时考虑了局部物体级别的信息和全局场景级别的信息。关键在于利用大语言模型（LLM）进行常识推理，并通过设计的概率软逻辑求解器（OV-PSL）优化决策过程，以及引入辩论机制来确认易混淆类别的具体类别。此外，为了应对类别分布不均的问题，论文还提出了静态平衡方案（SBC）和动态平衡方案（DBC）。另外，为减少数据噪声的影响，进一步提出反射伪标签生成（RPLG）和背景感知目标定位（BAOL）策略。实验结果表明，在部分开放词汇设置下，GLRD 在 SUN RGB-D 数据集上的平均精度均值提升了 2.82%，在 ScanNet 上提升了 3.72%；而在全开放词汇设置下，分别提升了 4.03% 和 14.11%。

链接: https://arxiv.org/abs/2503.20682
作者: Xingyu Peng,Si Liu,Chen Gao,Yan Bai,Beipeng Mu,Xiaofei Wang,Huaxia Xia
机构: School of Artificial Intelligence, Beihang University (北京航空航天大学), China; Meituan (美团), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages

点击查看摘要

Abstract:The task of LiDAR-based 3D Open-Vocabulary Detection (3D OVD) requires the detector to learn to detect novel objects from point clouds without off-the-shelf training labels. Previous methods focus on the learning of object-level representations and ignore the scene-level information, thus it is hard to distinguish objects with similar classes. In this work, we propose a Global-Local Collaborative Reason and Debate with PSL (GLRD) framework for the 3D OVD task, considering both local object-level information and global scene-level information. Specifically, LLM is utilized to perform common sense reasoning based on object-level and scene-level information, where the detection result is refined accordingly. To further boost the LLM’s ability of precise decisions, we also design a probabilistic soft logic solver (OV-PSL) to search for the optimal solution, and a debate scheme to confirm the class of confusable objects. In addition, to alleviate the uneven distribution of classes, a static balance scheme (SBC) and a dynamic balance scheme (DBC) are designed. In addition, to reduce the influence of noise in data and training, we further propose Reflected Pseudo Labels Generation (RPLG) and Background-Aware Object Localization (BAOL). Extensive experiments conducted on ScanNet and SUN RGB-D demonstrate the superiority of GLRD, where absolute improvements in mean average precision are +2.82% on SUN RGB-D and +3.72% on ScanNet in the partial open-vocabulary setting. In the full open-vocabulary setting, the absolute improvements in mean average precision are +4.03% on ScanNet and +14.11% on SUN RGB-D.
zh

[CV-19] Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database Model and Training Strategy

【速读】：该论文试图解决低级视觉感知与理解（Low-level Visual Perception and Understanding, HLPU）任务中大语言模型容易产生幻觉（hallucinations）的问题，这一现象限制了其作为可靠人工智能系统的应用。论文的关键在于通过引入HLPU指令数据库以及提出SAFEQA模型和ESA-PO框架来提升模型的自意识能力，从而减少幻觉的发生。具体而言，SAFEQA模型结合图像特征、显著区域特征和质量特征以增强模型在低级视觉任务中的感知与理解能力；而ESA-PO框架则旨在提高模型对知识边界的认知，进一步降低幻觉发生的概率。实验结果表明，所提方法有效提升了模型在这类任务中的自意识水平，并在多个评估指标上优于闭源模型。

链接: https://arxiv.org/abs/2503.20673
作者: Yinan Sun,Xiongkuo Min,Zicheng Zhang,Yixuan Gao,Yuqin Cao,Guangtao Zhai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid development of multimodal large language models has resulted in remarkable advancements in visual perception and understanding, consolidating several tasks into a single visual question-answering framework. However, these models are prone to hallucinations, which limit their reliability as artificial intelligence systems. While this issue is extensively researched in natural language processing and image captioning, there remains a lack of investigation of hallucinations in Low-level Visual Perception and Understanding (HLPU), especially in the context of image quality assessment tasks. We consider that these hallucinations arise from an absence of clear self-awareness within the models. To address this issue, we first introduce the HLPU instruction database, the first instruction database specifically focused on hallucinations in low-level vision tasks. This database contains approximately 200K question-answer pairs and comprises four subsets, each covering different types of instructions. Subsequently, we propose the Self-Awareness Failure Elimination (SAFEQA) model, which utilizes image features, salient region features and quality features to improve the perception and comprehension abilities of the model in low-level vision tasks. Furthermore, we propose the Enhancing Self-Awareness Preference Optimization (ESA-PO) framework to increase the model’s awareness of knowledge boundaries, thereby mitigating the incidence of hallucination. Finally, we conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations. Notably, our proposed method improves both accuracy and self-awareness of the proposed model and outperforms close-source models in terms of various evaluation metrics.
zh

[CV-20] BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation CVPR2025

【速读】：该论文致力于解决文章级视觉文本渲染的挑战，并提出了一项基于用户提供的文章级描述性提示和超密集布局生成高质量商业内容（包括信息图表和幻灯片）的新任务。主要难题在于显著更长的上下文长度以及高质量商业内容数据的稀缺性。与大多数专注于有限子区域和句子级提示的工作不同，确保商业内容中数十甚至数百个子区域的超密集布局精确遵循是一项更为艰巨的任务。论文的关键技术贡献包括：(i) 构建了一个可扩展的高质量商业内容数据集 Infographics-650K，并通过分层检索增强的信息图表生成方案配备了超密集布局和提示；(ii) 提出了一种布局引导的交叉注意力机制，该机制根据超密集布局将数十个区域级提示注入裁剪后的区域潜在空间，并在推理过程中利用布局条件的CFG灵活调整每个子区域。实验表明，所提系统在BizEval提示集上的表现优于现有最先进的系统如Flux和SD3。此外，通过详尽的消融实验验证了各组件的有效性。我们希望构建的Infographics-650K和BizEval能推动商业内容生成领域的进一步发展。

链接: https://arxiv.org/abs/2503.20672
作者: Yuyang Peng,Shishi Xiao,Keming Wu,Qisheng Liao,Bohan Chen,Kevin Lin,Danqing Huang,Ji Li,Yuhui Yuan
机构: Tsinghua University (清华大学); Brown University (布朗大学); University of Liverpool (利物浦大学); Microsoft Research Asia (微软亚洲研究院); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025. Project Page: this https URL

点击查看摘要

Abstract:Recently, state-of-the-art text-to-image generation models, such as Flux and Ideogram 2.0, have made significant progress in sentence-level visual text rendering. In this paper, we focus on the more challenging scenarios of article-level visual text rendering and address a novel task of generating high-quality business content, including infographics and slides, based on user provided article-level descriptive prompts and ultra-dense layouts. The fundamental challenges are twofold: significantly longer context lengths and the scarcity of high-quality business content data. In contrast to most previous works that focus on a limited number of sub-regions and sentence-level prompts, ensuring precise adherence to ultra-dense layouts with tens or even hundreds of sub-regions in business content is far more challenging. We make two key technical contributions: (i) the construction of scalable, high-quality business content dataset, i.e., Infographics-650K, equipped with ultra-dense layouts and prompts by implementing a layer-wise retrieval-augmented infographic generation scheme; and (ii) a layout-guided cross attention scheme, which injects tens of region-wise prompts into a set of cropped region latent space according to the ultra-dense layouts, and refine each sub-regions flexibly during inference using a layout conditional CFG. We demonstrate the strong results of our system compared to previous SOTA systems such as Flux and SD3 on our BizEval prompt set. Additionally, we conduct thorough ablation experiments to verify the effectiveness of each component. We hope our constructed Infographics-650K and BizEval can encourage the broader community to advance the progress of business content generation. Comments: Accepted by CVPR 2025. Project Page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.20672 [cs.CV] (or arXiv:2503.20672v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.20672 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-21] ARMO: Autoregressive Rigging for Multi-Category Objects

【速读】：该论文致力于解决现有大规模生成式方法主要关注静态3D模型生成，而忽视某些形状（如人形、动物和昆虫）动态特性的不足。论文引入OmniRig，首个包含79,499个网格及其详细骨骼和蒙皮信息的大规模绑定数据集，并提出ARMO框架作为解决方案的关键。ARMO采用自回归模型统一预测关节位置与连接关系，通过将骨骼结构视为完整图并离散化为令牌，利用自动编码器获取潜在嵌入，并使用网格条件潜扩散模型进行条件骨骼生成。这种方法克服了基于回归方法中误差累积和次优连接估计的局限性。

链接: https://arxiv.org/abs/2503.20663
作者: Mingze Sun,Shiwei Mao,Keyi Chen,Yurun Chen,Shunlin Lu,Jingbo Wang,Junting Dong,Ruqi Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in large-scale generative models have significantly improved the quality and diversity of 3D shape generation. However, most existing methods focus primarily on generating static 3D models, overlooking the potentially dynamic nature of certain shapes, such as humanoids, animals, and insects. To address this gap, we focus on rigging, a fundamental task in animation that establishes skeletal structures and skinning for 3D models. In this paper, we introduce OmniRig, the first large-scale rigging dataset, comprising 79,499 meshes with detailed skeleton and skinning information. Unlike traditional benchmarks that rely on predefined standard poses (e.g., A-pose, T-pose), our dataset embraces diverse shape categories, styles, and poses. Leveraging this rich dataset, we propose ARMO, a novel rigging framework that utilizes an autoregressive model to predict both joint positions and connectivity relationships in a unified manner. By treating the skeletal structure as a complete graph and discretizing it into tokens, we encode the joints using an auto-encoder to obtain a latent embedding and an autoregressive model to predict the tokens. A mesh-conditioned latent diffusion model is used to predict the latent embedding for conditional skeleton generation. Our method addresses the limitations of regression-based approaches, which often suffer from error accumulation and suboptimal connectivity estimation. Through extensive experiments on the OmniRig dataset, our approach achieves state-of-the-art performance in skeleton prediction, demonstrating improved generalization across diverse object categories. The code and dataset will be made public for academic use upon acceptance.
zh

[CV-22] AutoRad-Lung: A Radiomic-Guided Prompting Autoregressive Vision-Language Model for Lung Nodule Malignancy Prediction

【速读】：该论文旨在解决早期肺癌诊断中区分视觉特征相似且注释评分相近的不确定病例的难题。现有方法依赖放射科医生手工标注的影像组学特征或深度学习模型，而基于对比语言图像预训练（CLIP）的模型虽引入文本知识，但仍存在三个主要局限：依赖主观易错的标注属性、仅在训练阶段使用文本信息、以及忽视先验知识的随机初始化卷积视觉编码器。为克服这些限制，论文提出AutoRad-Lung，其关键在于结合自回归预训练的视觉-语言模型与从手工影像组学特征生成的提示词，并采用预训练的大规模自回归图像模型（AIMv2）的视觉编码器实现跨模态对齐优化。此外，通过条件上下文优化动态生成特定输入的提示词，进一步提升跨模态对齐效果，从而更有效地捕捉肺肿瘤与健康组织间的细微差异。

链接: https://arxiv.org/abs/2503.20662
作者: Sadaf Khademi,Mehran Shabanpour,Reza Taleei,Anastasia Oikonomou,Arash Mohammadi
机构: Natural Sciences and Engineering Research Council (NSERC) of Canada (加拿大自然科学与工程研究理事会)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Lung cancer remains one of the leading causes of cancer-related mortality worldwide. A crucial challenge for early diagnosis is differentiating uncertain cases with similar visual characteristics and closely annotation scores. In clinical practice, radiologists rely on quantitative, hand-crafted Radiomic features extracted from Computed Tomography (CT) images, while recent research has primarily focused on deep learning solutions. More recently, Vision-Language Models (VLMs), particularly Contrastive Language-Image Pre-Training (CLIP)-based models, have gained attention for their ability to integrate textual knowledge into lung cancer diagnosis. While CLIP-Lung models have shown promising results, we identified the following potential limitations: (a) dependence on radiologists’ annotated attributes, which are inherently subjective and error-prone, (b) use of textual information only during training, limiting direct applicability at inference, and © Convolutional-based vision encoder with randomly initialized weights, which disregards prior knowledge. To address these limitations, we introduce AutoRad-Lung, which couples an autoregressively pre-trained VLM, with prompts generated from hand-crafted Radiomics. AutoRad-Lung uses the vision encoder of the Large-Scale Autoregressive Image Model (AIMv2), pre-trained using a multi-modal autoregressive objective. Given that lung tumors are typically small, irregularly shaped, and visually similar to healthy tissue, AutoRad-Lung offers significant advantages over its CLIP-based counterparts by capturing pixel-level differences. Additionally, we introduce conditional context optimization, which dynamically generates context-specific prompts based on input Radiomics, improving cross-modal alignment.
zh

[CV-23] AccidentSim: Generating Physically Realistic Vehicle Collision Videos from Real-World Accident Reports

【速读】：本文旨在解决在自动驾驶研究中收集真实世界车辆事故视频的挑战，由于事故视频的稀有性和复杂性，直接采集困难。现有驾驶视频生成方法虽可创建视觉逼真的视频，但往往无法提供物理真实的模拟，因为它们缺乏生成精确碰撞后轨迹的能力。为了解决这一问题，论文提出AccidentSim框架，其关键是通过提取和利用真实事故报告中的物理线索和上下文信息，结合可靠的物理仿真器生成碰撞后的车辆轨迹，并构建相应的数据集用于微调语言模型。该模型能够根据用户描述预测多种驾驶场景下的物理一致的碰撞后轨迹，最终利用Neural Radiance Fields (NeRF) 渲染高质量背景并与具有物理逼真轨迹的前景车辆融合，生成高度真实的车辆碰撞视频。实验结果表明，AccidentSim生成的视频在视觉和物理真实性方面均表现出色。

链接: https://arxiv.org/abs/2503.20654
作者: Xiangwen Zhang,Qian Zhang,Longfei Han,Qiang Qu,Xiaoming Chen
机构: The University of Sydney (悉尼大学); Beijing Technology and Business University (北京工商大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Collecting real-world vehicle accident videos for autonomous driving research is challenging due to their rarity and complexity. While existing driving video generation methods may produce visually realistic videos, they often fail to deliver physically realistic simulations because they lack the capability to generate accurate post-collision trajectories. In this paper, we introduce AccidentSim, a novel framework that generates physically realistic vehicle collision videos by extracting and utilizing the physical clues and contextual information available in real-world vehicle accident reports. Specifically, AccidentSim leverages a reliable physical simulator to replicate post-collision vehicle trajectories from the physical and contextual information in the accident reports and to build a vehicle collision trajectory dataset. This dataset is then used to fine-tune a language model, enabling it to respond to user prompts and predict physically consistent post-collision trajectories across various driving scenarios based on user descriptions. Finally, we employ Neural Radiance Fields (NeRF) to render high-quality backgrounds, merging them with the foreground vehicles that exhibit physically realistic trajectories to generate vehicle collision videos. Experimental results demonstrate that the videos produced by AccidentSim excel in both visual and physical authenticity.
zh

[CV-24] Imitating Radiological Scrolling: A Global-Local Attention Model for 3D Chest CT Volumes Multi-Label Anomaly Classification

【速读】：该论文旨在解决三维 CT 扫描多标签分类任务中的挑战，特别是由于数据的体素性质和需检测异常类型的多样性所导致的问题。现有基于卷积神经网络（CNN）的方法难以有效捕捉长距离依赖关系，而视觉Transformer虽然具备更强的表达能力，但需要大量的预训练，在实际应用中存在困难。此外，这些方法未能显式建模放射科医生在浏览 CT 扫描切片时的导航行为，这一过程需要同时理解全局上下文和局部细节。论文的关键解决方案是提出了一种名为 CT-Scroll 的新颖全局-局部注意力模型，该模型专门设计用于模拟放射科医生在分析三维 CT 扫描时的滚动行为。通过在两个公开数据集上的评估，实验结果证明了其有效性，并通过消融研究展示了每个模型组件的贡献。

链接: https://arxiv.org/abs/2503.20652
作者: Theo Di Piazza,Carole Lazarus,Olivier Nempont,Loic Boussel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures, under review for MIDL 2025

点击查看摘要

Abstract:The rapid increase in the number of Computed Tomography (CT) scan examinations has created an urgent need for automated tools, such as organ segmentation, anomaly classification, and report generation, to assist radiologists with their growing workload. Multi-label classification of Three-Dimensional (3D) CT scans is a challenging task due to the volumetric nature of the data and the variety of anomalies to be detected. Existing deep learning methods based on Convolutional Neural Networks (CNNs) struggle to capture long-range dependencies effectively, while Vision Transformers require extensive pre-training, posing challenges for practical use. Additionally, these existing methods do not explicitly model the radiologist’s navigational behavior while scrolling through CT scan slices, which requires both global context understanding and local detail awareness. In this study, we present CT-Scroll, a novel global-local attention model specifically designed to emulate the scrolling behavior of radiologists during the analysis of 3D CT scans. Our approach is evaluated on two public datasets, demonstrating its efficacy through comprehensive experiments and an ablation study that highlights the contribution of each model component.
zh

[CV-25] MMGen: Unified Multi-modal Image Generation and Understanding in One Go

【速读】：该论文旨在构建一个统一的扩散框架，以实现多模态生成与理解的无缝衔接和可控性，并试图解决跨模态任务（如图像扩散和其他相关任务）中的挑战。论文提出MMGen，这是一个将多种生成任务整合到单一扩散模型中的统一框架。其关键创新在于引入了一种新颖的多模态扩散变换器（diffusion transformer），能够灵活支持多模态输出，并结合一种简单的模态解耦策略（modality-decoupling strategy），从而实现不同任务的统一。通过这种方式，MMGen不仅实现了类别条件下的多模态生成、基于RGB图像的多模态视觉理解（如深度预测、表面法线估计和分割图生成），还支持基于特定模态条件的多模态条件生成，最终在广泛实验中验证了其在多样任务和条件下的有效性和优越性。

链接: https://arxiv.org/abs/2503.20644
作者: Jiepeng Wang,Zhaoqing Wang,Hao Pan,Yuan Liu,Dongdong Yu,Changhu Wang,Wenping Wang
机构: The University of Hong Kong (香港大学); The University of Sydney (悉尼大学); AIsphere; Tsinghua University (清华大学); Hong Kong University of Science and Technology (香港科技大学); Texas A&M University (德克萨斯农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our project page: this https URL

点击查看摘要

Abstract:A unified diffusion framework for multi-modal generation and understanding has the transformative potential to achieve seamless and controllable image diffusion and other cross-modal tasks. In this paper, we introduce MMGen, a unified framework that integrates multiple generative tasks into a single diffusion model. This includes: (1) multi-modal category-conditioned generation, where multi-modal outputs are generated simultaneously through a single inference process, given category information; (2) multi-modal visual understanding, which accurately predicts depth, surface normals, and segmentation maps from RGB images; and (3) multi-modal conditioned generation, which produces corresponding RGB images based on specific modality conditions and other aligned modalities. Our approach develops a novel diffusion transformer that flexibly supports multi-modal output, along with a simple modality-decoupling strategy to unify various tasks. Extensive experiments and applications demonstrate the effectiveness and superiority of MMGen across diverse tasks and conditions, highlighting its potential for applications that require simultaneous generation and understanding.
zh

[CV-26] Robust Flower Cluster Matching Using The Unscented Transform

【速读】：该论文旨在解决因植物授粉过程引起的视觉外观变化以及由生长和相机视角导致的遮挡问题，从而实现花簇在时间序列上的鲁棒图像配准。论文的关键在于提出了一种利用RGB-D数据生成描述符来匹配花簇的方法，并通过引入Unscented Transform（UKF）高效估计植物描述符的不确定性容限，以处理花位置的非线性变换及其在描述符域中的变化，从而允许花簇内存在空间不确定性。Monte Carlo仿真验证了方法的有效性，为动态环境下的精准机器人授粉提供了技术支持。

链接: https://arxiv.org/abs/2503.20631
作者: Andy Chu,Rashik Shrestha,Yu Gu,Jason N. Gross
机构: Department of Mechanical, Materials, and Aerospace Engineering, West Virginia University (西弗吉尼亚大学), Morgantown, USA; Lane Department of Computer Science and Electrical Engineering, West Virginia University (西弗吉尼亚大学), Morgantown, USA
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: CASE2025 Under Review

点击查看摘要

Abstract:Monitoring flowers over time is essential for precision robotic pollination in agriculture. To accomplish this, a continuous spatial-temporal observation of plant growth can be done using stationary RGB-D cameras. However, image registration becomes a serious challenge due to changes in the visual appearance of the plant caused by the pollination process and occlusions from growth and camera angles. Plants flower in a manner that produces distinct clusters on branches. This paper presents a method for matching flower clusters using descriptors generated from RGB-D data and considers allowing for spatial uncertainty within the cluster. The proposed approach leverages the Unscented Transform to efficiently estimate plant descriptor uncertainty tolerances, enabling a robust image-registration process despite temporal changes. The Unscented Transform is used to handle the nonlinear transformations by propagating the uncertainty of flower positions to determine the variations in the descriptor domain. A Monte Carlo simulation is used to validate the Unscented Transform results, confirming our method’s effectiveness for flower cluster matching. Therefore, it can facilitate improved robotics pollination in dynamic environments.
zh

[CV-27] IAP: Improving Continual Learning of Vision-Language Models via Instance-Aware Prompting

【速读】：该论文旨在解决多域渐增学习（Multi-Domain Class-Incremental Learning, MCIL）场景下预训练视觉语言模型（Pre-trained Vision-Language Models, PT-VLMs）面临的前向和后向遗忘问题，特别是在内存受限的情况下。为应对这些挑战，论文提出了一种实例感知提示框架（Instance-Aware Prompting, IAP），其关键是通过设计两种模块来优化提示（Prompting）的设计：实例感知门控提示模块（Instance-Aware Gated Prompting, IA-GP）和实例感知类别分布驱动提示模块（Instance-Aware Class-Distribution-Driven Prompting, IA-CDDP）。其中，IA-GP通过在实例级别动态分配跨Transformer层的提示，增强对新任务的适应性并减轻遗忘；IA-CDDP则通过为每个实例确定与任务标签相关的准确置信度分数，改进任务适应过程。实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2503.20612
作者: Hao Fu,Hanbin Zhao,Jiahua Dong,Chao Zhang,Hui Qian
机构: Zhejiang University (浙江大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code can be found at this https URL

点击查看摘要

Abstract:Recent pre-trained vision-language models (PT-VLMs) often face a Multi-Domain Class-Incremental Learning (MCIL) scenario in practice, where several classes and domains of multi-modal tasks are incrementally arrived. Without access to previously learned tasks and unseen tasks, memory-constrained MCIL suffers from forward and backward forgetting. To alleviate the above challenges, parameter-efficient fine-tuning techniques (PEFT), such as prompt tuning, are employed to adapt the PT-VLM to the diverse incrementally learned tasks. To achieve effective new task adaptation, existing methods only consider the effect of PEFT strategy selection, but neglect the influence of PEFT parameter setting (e.g., prompting). In this paper, we tackle the challenge of optimizing prompt designs for diverse tasks in MCIL and propose an Instance-Aware Prompting (IAP) framework. Specifically, our Instance-Aware Gated Prompting (IA-GP) module enhances adaptation to new tasks while mitigating forgetting by dynamically assigning prompts across transformer layers at the instance level. Our Instance-Aware Class-Distribution-Driven Prompting (IA-CDDP) improves the task adaptation process by determining an accurate task-label-related confidence score for each instance. Experimental evaluations across 11 datasets, using three performance metrics, demonstrate the effectiveness of our proposed method. Code can be found at this https URL.
zh

[CV-28] Diffusion Counterfactuals for Image Regressors

【速读】：该论文旨在解决反事实解释在回归任务中的应用不足问题，特别是针对图像回归任务中反事实解释的稀疏性和质量挑战。论文的关键解决方案是提出两种基于扩散生成模型的方法：一种直接在像素空间中操作的去噪扩散概率模型，另一种在潜在空间中操作的扩散自编码器。这两种方法能够在CelebA-HQ和合成数据集上生成逼真、语义清晰且平滑的反事实样本，从而为回归模型的决策过程提供易于解释的洞察，并揭示潜在的虚假相关性。研究发现，与分类器相比，回归任务中的反事实解释需要更大的语义变化以实现预测值的显著改变，这使得寻找稀疏反事实更加困难。此外，像素空间中的反事实更稀疏，而潜在空间中的反事实质量更高且允许更大的语义变化。

链接: https://arxiv.org/abs/2503.20595
作者: Trung Duc Ha,Sidney Bender
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 24 Pages, 5 Figures, Accepted at 3rd World Conference on eXplainable Artificial Intelligence (xAI-2025), Code and reproduction instructions available on GitHub, see this https URL

点击查看摘要

Abstract:Counterfactual explanations have been successfully applied to create human interpretable explanations for various black-box models. They are handy for tasks in the image domain, where the quality of the explanations benefits from recent advances in generative models. Although counterfactual explanations have been widely applied to classification models, their application to regression tasks remains underexplored. We present two methods to create counterfactual explanations for image regression tasks using diffusion-based generative models to address challenges in sparsity and quality: 1) one based on a Denoising Diffusion Probabilistic Model that operates directly in pixel-space and 2) another based on a Diffusion Autoencoder operating in latent space. Both produce realistic, semantic, and smooth counterfactuals on CelebA-HQ and a synthetic data set, providing easily interpretable insights into the decision-making process of the regression model and reveal spurious correlations. We find that for regression counterfactuals, changes in features depend on the region of the predicted value. Large semantic changes are needed for significant changes in predicted values, making it harder to find sparse counterfactuals than with classifiers. Moreover, pixel space counterfactuals are more sparse while latent space counterfactuals are of higher quality and allow bigger semantic changes.
zh

[CV-29] rraTorch: The Geospatial Foundation Models Toolkit

【速读】：本文献旨在解决地理空间基础模型在卫星、天气和气候数据上的快速微调与基准测试难题。论文提出的解决方案关键在于TerraTorch工具包，它基于PyTorch Lightning构建，集成了领域特定的数据模块、预定义任务以及模块化的模型工厂，能够灵活搭配多种主干网络与解码器头。通过无代码方式仅需调整训练配置即可实现模型微调，同时结合Iterate自动超参数优化扩展，显著降低了专业知识和时间成本。此外，TerraTorch与GEO-Bench的直接集成，确保了地理空间基础模型的系统化、可重复基准测试。这一工具包以Apache 2.0许可开源，并提供便捷的安装方式。

链接: https://arxiv.org/abs/2503.20563
作者: Carlos Gomes,Benedikt Blumenstiel,Joao Lucas de Sousa Almeida,Pedro Henrique de Oliveira,Paolo Fraccaro,Francesc Marti Escofet,Daniela Szwarcman,Naomi Simumba,Romeo Kienzler,Bianca Zadrozny
机构: IBM Research (IBM研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: IGARSS 2025

点击查看摘要

Abstract:TerraTorch is a fine-tuning and benchmarking toolkit for Geospatial Foundation Models built on PyTorch Lightning and tailored for satellite, weather, and climate data. It integrates domain-specific data modules, pre-defined tasks, and a modular model factory that pairs any backbone with diverse decoder heads. These components allow researchers and practitioners to fine-tune supported models in a no-code fashion by simply editing a training configuration. By consolidating best practices for model development and incorporating the automated hyperparameter optimization extension Iterate, TerraTorch reduces the expertise and time required to fine-tune or benchmark models on new Earth Observation use cases. Furthermore, TerraTorch directly integrates with GEO-Bench, allowing for systematic and reproducible benchmarking of Geospatial Foundation Models. TerraTorch is open sourced under Apache 2.0, available at this https URL, and can be installed via pip install terratorch.
zh

[CV-30] Beyond Intermediate States: Explaining Visual Redundancy through Language

【速读】：该论文旨在解决多模态大语言模型（MLLMs）在处理大量视觉标记时面临的计算负担过重以及现有视觉标记剪枝方法无法精确定义视觉冗余的问题。这些现有方法主要依赖于模型的中间状态（如注意力分数），但难以捕捉视觉标记对模型视觉理解的影响，即对文本标记候选预测概率的实际贡献。为了解决这一问题，论文从标记中心和上下文中心两个视角操纵视觉输入并分析文本输出的变化，实现了直观且全面的分析。研究发现，与ViT-[cls]关联度低且文本到图像注意力分数低的视觉标记可能包含可识别的信息，并对图像整体信息有显著贡献。因此，论文的关键解决方案是结合这两种视角，并引入上下文无关条件来识别训练图像中的冗余原型，从而在推理过程中探测每个视觉标记的冗余性。实验结果表明，该方法在单图像、多图像及视频理解任务中表现出色，在剪枝80%至90%视觉标记的同时，性能保持在90%到110%之间。

链接: https://arxiv.org/abs/2503.20540
作者: Dingchen Yang,Bowen Cao,Anran Zhang,Weibo Gu,Winston Hu,Guang Chen
机构: Tongji University (同济大学); Tencent Hunyuan Team (腾讯浑元团队); Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal Large Langue Models (MLLMs) often process thousands of visual tokens, which consume a significant portion of the context window and impose a substantial computational burden. Prior work has empirically explored visual token pruning methods based on MLLMs’ intermediate states (e.g., attention scores). However, they have limitations in precisely defining visual redundancy due to their inability to capture the influence of visual tokens on MLLMs’ visual understanding (i.e., the predicted probabilities for textual token candidates). To address this issue, we manipulate the visual input and investigate variations in the textual output from both token-centric and context-centric perspectives, achieving intuitive and comprehensive analysis. Experimental results reveal that visual tokens with low ViT-[cls] association and low text-to-image attention scores can contain recognizable information and significantly contribute to images’ overall information. To develop a more reliable method for identifying and pruning redundant visual tokens, we integrate these two perspectives and introduce a context-independent condition to identify redundant prototypes from training images, which probes the redundancy of each visual token during inference. Extensive experiments on single-image, multi-image and video comprehension tasks demonstrate the effectiveness of our method, notably achieving 90% to 110% of the performance while pruning 80% to 90% of visual tokens.
zh

[CV-31] D-BFR: Truncated Diffusion Model for Efficient Blind Face Restoration ICME2025

【速读】：该论文致力于解决基于扩散（Diffusion）方法在盲面部分恢复（Blind Face Restoration, BFR）中的两大显著问题：1）训练与推理速度慢；2）对精细面部细节恢复不足。为解决这些问题，论文提出了一种新颖的截断扩散模型（Truncated Diffusion model for efficient Blind Face Restoration, TD-BFR），其关键在于引入创新的截断采样方法，从低分辨率低质量（LQ）图像开始以提升采样效率，并通过自适应退化移除模块处理未知退化及跨分辨率生成过程的连接。此外，通过调整预训练扩散模型的先验知识进一步恢复丰富的面部细节，从而实现由粗到精的高效高质量图像恢复。实验结果表明，TD-BFR相比当前最先进的扩散基BFR方法平均快4.75倍，同时保持竞争性的图像质量。

链接: https://arxiv.org/abs/2503.20537
作者: Ziying Zhang,Xiang Gao,Zhixin Wang,Qiang hu,Xiaoyun Zhang
机构: Cooperative Medianet Innovation Center, Shanghai Jiao Tong University (上海交通大学合作媒体网创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME 2025

点击查看摘要

Abstract:Diffusion-based methodologies have shown significant potential in blind face restoration (BFR), leveraging their robust generative capabilities. However, they are often criticized for two significant problems: 1) slow training and inference speed, and 2) inadequate recovery of fine-grained facial details. To address these problems, we propose a novel Truncated Diffusion model for efficient Blind Face Restoration (TD-BFR), a three-stage paradigm tailored for the progressive resolution of degraded images. Specifically, TD-BFR utilizes an innovative truncated sampling method, starting from low-quality (LQ) images at low resolution to enhance sampling speed, and then introduces an adaptive degradation removal module to handle unknown degradations and connect the generation processes across different resolutions. Additionally, we further adapt the priors of pre-trained diffusion models to recover rich facial details. Our method efficiently restores high-quality images in a coarse-to-fine manner and experimental results demonstrate that TD-BFR is, on average, \textbf4.75 \times faster than current state-of-the-art diffusion-based BFR methods while maintaining competitive quality.
zh

[CV-32] GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

【速读】：该论文旨在解决当前生成式模型在自动驾驶领域应用中的不足，具体包括多智能体交互、精细化控制以及多摄像机一致性等特定需求。论文提出的关键解决方案是GAIA-2（Generative AI for Autonomy），这是一种潜扩散世界模型，能够在单一生成框架内整合这些能力。GAIA-2通过丰富的结构化输入（如车辆动力学、智能体配置、环境因素及道路语义）实现可控视频生成，并能够生成高分辨率、时空一致的多摄像机视频，适用于地理多样的驾驶场景。其关键在于结合结构化条件与外部潜在嵌入（例如来自专有驾驶模型的信息），以促进灵活且语义明确的场景合成，从而支持自动驾驶系统开发中常见及罕见场景的可扩展模拟。

链接: https://arxiv.org/abs/2503.20523
作者: Lloyd Russell,Anthony Hu,Lorenzo Bertoni,George Fedoseev,Jamie Shotton,Elahe Arani,Gianluca Corrado
机构: Wayve (Wayve)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Technical Report

点击查看摘要

Abstract:Generative models offer a scalable and flexible paradigm for simulating complex environments, yet current approaches fall short in addressing the domain-specific requirements of autonomous driving - such as multi-agent interactions, fine-grained control, and multi-camera consistency. We introduce GAIA-2, Generative AI for Autonomy, a latent diffusion world model that unifies these capabilities within a single generative framework. GAIA-2 supports controllable video generation conditioned on a rich set of structured inputs: ego-vehicle dynamics, agent configurations, environmental factors, and road semantics. It generates high-resolution, spatiotemporally consistent multi-camera videos across geographically diverse driving environments (UK, US, Germany). The model integrates both structured conditioning and external latent embeddings (e.g., from a proprietary driving model) to facilitate flexible and semantically grounded scene synthesis. Through this integration, GAIA-2 enables scalable simulation of both common and rare driving scenarios, advancing the use of generative world models as a core tool in the development of autonomous systems. Videos are available at this https URL.
zh

[CV-33] MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation CVPR2025

【速读】：该论文旨在解决在应用自回归变换器（Auto-regressive Transformers）到三维数据生成领域时面临的三个关键挑战：三维数据无序性与顺序预测范式的冲突、传统矢量量化方法在三维网格上的显著压缩损失以及高分辨率潜在变量预测的高效扩展策略缺失。为了解决这些问题，论文提出了一种名为MAR-3D的架构，它结合了金字塔变分自动编码器与级联掩码自回归变换器（Cascaded Masked Auto-Regressive Transformer, Cascaded MAR），用于连续空间中的逐步潜在变量上采样。该架构通过训练过程中的随机掩码以及推理阶段以随机顺序进行自回归去噪，自然适应了三维潜在令牌的无序特性。此外，还提出了带条件增强的级联训练策略，以实现高效的潜在令牌分辨率上采样并快速收敛。实验结果表明，MAR-3D不仅在性能和泛化能力上超越现有方法，而且相比联合分布建模方法（如扩散变换器）展现出更强大的扩展能力。

链接: https://arxiv.org/abs/2503.20519
作者: Jinnan Chen,Lingting Zhu,Zeyu Hu,Shengju Qian,Yugang Chen,Xin Wang,Gim Hee Lee
机构: National University of Singapore; The University of Hong Kong; LIGHTSPEED
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Aceepted to CVPR 2025

点击查看摘要

Abstract:Recent advances in auto-regressive transformers have revolutionized generative modeling across different domains, from language processing to visual generation, demonstrating remarkable capabilities. However, applying these advances to 3D generation presents three key challenges: the unordered nature of 3D data conflicts with sequential next-token prediction paradigm, conventional vector quantization approaches incur substantial compression loss when applied to 3D meshes, and the lack of efficient scaling strategies for higher resolution latent prediction. To address these challenges, we introduce MAR-3D, which integrates a pyramid variational autoencoder with a cascaded masked auto-regressive transformer (Cascaded MAR) for progressive latent upscaling in the continuous space. Our architecture employs random masking during training and auto-regressive denoising in random order during inference, naturally accommodating the unordered property of 3D latent tokens. Additionally, we propose a cascaded training strategy with condition augmentation that enables efficiently up-scale the latent token resolution with fast convergence. Extensive experiments demonstrate that MAR-3D not only achieves superior performance and generalization capabilities compared to existing methods but also exhibits enhanced scaling capabilities compared to joint distribution modeling approaches (e.g., diffusion transformers).
zh

[CV-34] Small Object Detection: A Comprehensive Survey on Challenges Techniques and Real-World Applications

【速读】：该论文旨在解决小目标检测（SOD）在计算机视觉中的关键挑战，特别是利用深度学习方法应对小目标因空间信息有限、上下文信息不足以及低分辨率、遮挡、背景干扰和类别不平衡等问题导致的检测困难。论文综述了2024-2025年发表于Q1期刊的相关研究进展，重点分析了面临的挑战、最先进的技术、数据集、评估指标及实际应用。

解决方案的关键在于引入创新的深度学习技术，包括多尺度特征提取、超分辨率（Super-Resolution, SR）技术、注意力机制以及基于Transformer的架构。此外，通过数据增强、合成数据生成和迁移学习等手段缓解数据稀缺和领域适应问题。新兴趋势如轻量级神经网络、知识蒸馏（Knowledge Distillation, KD）和自监督学习则为资源受限环境下的检测效率提升提供了方向，例如无人机（Unmanned Aerial Vehicle, UAV）监控和边缘计算场景。

链接: https://arxiv.org/abs/2503.20516
作者: Mahya Nikouei,Bita Baroutian,Shahabedin Nabavi,Fateme Taraghi,Atefe Aghaei,Ayoob Sajedi,Mohsen Ebrahimi Moghaddam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Small object detection (SOD) is a critical yet challenging task in computer vision, with applications like spanning surveillance, autonomous systems, medical imaging, and remote sensing. Unlike larger objects, small objects contain limited spatial and contextual information, making accurate detection difficult. Challenges such as low resolution, occlusion, background interference, and class imbalance further complicate the problem. This survey provides a comprehensive review of recent advancements in SOD using deep learning, focusing on articles published in Q1 journals during 2024-2025. We analyzed challenges, state-of-the-art techniques, datasets, evaluation metrics, and real-world applications. Recent advancements in deep learning have introduced innovative solutions, including multi-scale feature extraction, Super-Resolution (SR) techniques, attention mechanisms, and transformer-based architectures. Additionally, improvements in data augmentation, synthetic data generation, and transfer learning have addressed data scarcity and domain adaptation issues. Furthermore, emerging trends such as lightweight neural networks, knowledge distillation (KD), and self-supervised learning offer promising directions for improving detection efficiency, particularly in resource-constrained environments like Unmanned Aerial Vehicles (UAV)-based surveillance and edge computing. We also review widely used datasets, along with standard evaluation metrics such as mean Average Precision (mAP) and size-specific AP scores. The survey highlights real-world applications, including traffic monitoring, maritime surveillance, industrial defect detection, and precision agriculture. Finally, we discuss open research challenges and future directions, emphasizing the need for robust domain adaptation techniques, better feature fusion strategies, and real-time performance optimization.
zh

[CV-35] Vision-Amplified Semantic Entropy for Hallucination Detection in Medical Visual Question Answering

【速读】：本文旨在解决医学多模态大型语言模型（Medical Multimodal Large Language Models, MLLMs）在视觉问答（Visual Question Answering, VQA）任务中容易产生幻觉（hallucinations）的问题，即生成与输入图像相矛盾的错误响应，这对临床决策构成潜在风险。为了使临床医生和患者信任这些模型，并促进其实际应用，检测并减少幻觉至关重要。现有方法如语义熵（Semantic Entropy, SE）在检测大语言模型幻觉方面表现出潜力，但在医学MLLMs中的应用面临挑战：弱扰动虽保留图像内容的临床有效性，但可能被模型忽略；强扰动虽增强模型对视觉信息的依赖，却可能破坏诊断特征。针对这一困境，论文提出了一种名为视觉增强语义熵（Vision Amplified Semantic Entropy, VASE）的方法，通过结合弱图像变换来保持临床有效性，并通过对比变换前后分布放大视觉输入的影响，从而改进医学VQA中的幻觉检测性能。实验表明，VASE在两个公开数据集上的表现优于现有方法。

链接: https://arxiv.org/abs/2503.20504
作者: Zehui Liao,Shishuai Hu,Ke Zou,Huazhu Fu,Liangli Zhen,Yong Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have demonstrated significant potential in medical Visual Question Answering (VQA). Yet, they remain prone to hallucinations-incorrect responses that contradict input images, posing substantial risks in clinical decision-making. Detecting these hallucinations is essential for establishing trust in MLLMs among clinicians and patients, thereby enabling their real-world adoption. Current hallucination detection methods, especially semantic entropy (SE), have demonstrated promising hallucination detection capacity for LLMs. However, adapting SE to medical MLLMs by incorporating visual perturbations presents a dilemma. Weak perturbations preserve image content and ensure clinical validity, but may be overlooked by medical MLLMs, which tend to over rely on language priors. In contrast, strong perturbations can distort essential diagnostic features, compromising clinical interpretation. To address this issue, we propose Vision Amplified Semantic Entropy (VASE), which incorporates weak image transformations and amplifies the impact of visual input, to improve hallucination detection in medical VQA. We first estimate the semantic predictive distribution under weak visual transformations to preserve clinical validity, and then amplify visual influence by contrasting this distribution with that derived from a distorted image. The entropy of the resulting distribution is estimated as VASE. Experiments on two medical open-ended VQA datasets demonstrate that VASE consistently outperforms existing hallucination detection methods.
zh

[CV-36] MLLM -Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning

【速读】：该论文旨在解决多模态大语言模型（MLLMs）在视觉指令微调（VIT）过程中高质量指令数据属性理解不足以及自动化选择框架缺失的问题。解决方案的关键在于提出了一种名为MLLM-Selector的自动化方法，通过综合考量必要性评分与多样性策略来识别有价值的微调数据。具体而言，该方法首先利用随机采样的子集对预训练模型进行微调以构建种子模型，随后基于此模型计算VIT数据池中每个样本的必要性分数，从而筛选出对提升模型性能至关重要的样本。研究强调了在数据选择中结合必要性和多样性的价值，并通过实验验证了MLLM-Selector在有限数据条件下（少于1%或50%的数据量）优于现有方法如LLaVA-1.5的表现。

链接: https://arxiv.org/abs/2503.20502
作者: Yiwei Ma,Guohai Xu,Xiaoshuai Sun,Jiayi Ji,Jie Lou,Debing Zhang,Rongrong Ji
机构: Xiamen University (厦门大学); Xiaohongshu (小红书)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech Report

点击查看摘要

Abstract:Visual instruction tuning (VIT) has emerged as a crucial technique for enabling multi-modal large language models (MLLMs) to follow user instructions adeptly. Yet, a significant gap persists in understanding the attributes of high-quality instruction tuning data and frameworks for its automated selection. To address this, we introduce MLLM-Selector, an automated approach that identifies valuable data for VIT by weighing necessity and diversity. Our process starts by randomly sampling a subset from the VIT data pool to fine-tune a pretrained model, thus creating a seed model with an initial ability to follow instructions. Then, leveraging the seed model, we calculate necessity scores for each sample in the VIT data pool to identify samples pivotal for enhancing model performance. Our findings underscore the importance of mixing necessity and diversity in data choice, leading to the creation of MLLM-Selector, our methodology that fuses necessity scoring with strategic sampling for superior data refinement. Empirical results indicate that within identical experimental conditions, MLLM-Selector surpasses LLaVA-1.5 in some benchmarks with less than 1% of the data and consistently exceeds performance across all validated benchmarks when using less than 50%.
zh

[CV-37] owards Efficient and General-Purpose Few-Shot Misclassification Detection for Vision-Language Models

【速读】：该论文旨在解决现代神经网络在高安全性和动态变化场景中部署时，因过度自信而导致误分类预测可靠性不足的问题。现有方法虽在小规模数据集上取得一定成效，但需要从头训练且缺乏高效通用的误分类检测（MisD）方法，难以适应大规模及持续变化的数据集。论文的关键解决方案是利用视觉语言模型（Vision Language Model, VLM）结合文本信息，构建一个高效通用的误分类检测框架。通过引入少量提示学习框架FSMisD（Few-Shot Prompt Learning Framework for MisD），避免从头训练以提升调优效率。此外，通过自适应伪样本生成和新颖的负损失函数，推动类别提示远离伪特征，缓解过度自信问题，从而增强误分类检测能力。实验结果验证了该方法的有效性、高效性和跨领域数据集的泛化能力。

链接: https://arxiv.org/abs/2503.20492
作者: Fanhu Zeng,Zhen Cheng,Fei Zhu,Xu-Yao Zhang
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS (多模态人工智能系统国家重点实验室，自动化研究所，中国科学院); School of Artificial Intelligence, UCAS (人工智能学院，中国科学院大学); Centre for Artificial Intelligence and Robotics, HKISI-CAS (香港人工智能与机器人研究所，中科院自动化所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Reliable prediction by classifiers is crucial for their deployment in high security and dynamically changing situations. However, modern neural networks often exhibit overconfidence for misclassified predictions, highlighting the need for confidence estimation to detect errors. Despite the achievements obtained by existing methods on small-scale datasets, they all require training from scratch and there are no efficient and effective misclassification detection (MisD) methods, hindering practical application towards large-scale and ever-changing datasets. In this paper, we pave the way to exploit vision language model (VLM) leveraging text information to establish an efficient and general-purpose misclassification detection framework. By harnessing the power of VLM, we construct FSMisD, a Few-Shot prompt learning framework for MisD to refrain from training from scratch and therefore improve tuning efficiency. To enhance misclassification detection ability, we use adaptive pseudo sample generation and a novel negative loss to mitigate the issue of overconfidence by pushing category prompts away from pseudo features. We conduct comprehensive experiments with prompt learning methods and validate the generalization ability across various datasets with domain shift. Significant and consistent improvement demonstrates the effectiveness, efficiency and generalizability of our approach.
zh

[CV-38] Contrastive Learning Guided Latent Diffusion Model for Image-to-Image Translation

【速读】：该论文旨在解决两个主要问题：一是文本提示的变化显著影响生成图像质量，且用户难以设计完全捕捉输入图像内容的理想提示；二是现有模型在修改参考图像特定区域时，常导致未预期的改变。为应对这些挑战，论文提出了一种名为pix2pix-zeroCon的零样本扩散方法，其关键在于利用基于补丁的对比损失消除额外训练需求，并通过在预训练扩散模型中引入跨注意引导损失和补丁级对比损失，确保编辑后图像的内容和结构精确保留。此外，通过基于参考图像和目标提示自动确定文本嵌入空间中的编辑方向，实现了无需额外训练即可直接操作预训练文本到图像扩散模型的能力。实验结果表明，该方法在图像到图像翻译任务中超越现有模型，提高了保真度和可控性。

链接: https://arxiv.org/abs/2503.20484
作者: Qi Si,Bo Wang,Zhao Zhang
机构: School of Computer Science and Information Engineering, Hefei University of Technology (合肥工业大学计算机科学与信息工程学院), China; School of Computer Science and Information Engineering, Hefei University of Technology (合肥工业大学计算机科学与信息工程学院), China; School of Computer Science and Information Engineering, Hefei University of Technology (合肥工业大学计算机科学与信息工程学院), China and Yunnan Key Laboratory of Software Engineering (云南软件工程重点实验室), China; School of Computer Science and Engineering, Donghua University (东华大学计算机科学与工程学院), China; ByteDance (字节跳动), Culver City, USA; School of Computer and Artificial Intelligence, Zhengzhou University (郑州大学计算机与人工智能学院), China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 13 figures

点击查看摘要

Abstract:The diffusion model has demonstrated superior performance in synthesizing diverse and high-quality images for text-guided image translation. However, there remains room for improvement in both the formulation of text prompts and the preservation of reference image content. First, variations in target text prompts can significantly influence the quality of the generated images, and it is often challenging for users to craft an optimal prompt that fully captures the content of the input image. Second, while existing models can introduce desired modifications to specific regions of the reference image, they frequently induce unintended alterations in areas that should remain unchanged. To address these challenges, we propose pix2pix-zeroCon, a zero-shot diffusion-based method that eliminates the need for additional training by leveraging patch-wise contrastive loss. Specifically, we automatically determine the editing direction in the text embedding space based on the reference image and target prompts. Furthermore, to ensure precise content and structural preservation in the edited image, we introduce cross-attention guiding loss and patch-wise contrastive loss between the generated and original image embeddings within a pre-trained diffusion model. Notably, our approach requires no additional training and operates directly on a pre-trained text-to-image diffusion model. Extensive experiments demonstrate that our method surpasses existing models in image-to-image translation, achieving enhanced fidelity and controllability.
zh

[CV-39] Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability CVPR2025

【速读】：该论文旨在解决扩散模型（Diffusion Models）在生成高质量内容的同时所固有的社会偏见问题，特别是与性别和种族相关的偏见。这些偏见可能引发有害的实际后果，加剧社会中的刻板印象和不平等现象。现有研究主要集中在引导生成内容以减轻偏见，但往往忽视了扩散模型内部因果驱动偏见输出的机制。论文的关键在于深入探究扩散模型的内部过程，识别出嵌入在模型架构中的特定决策机制——即所谓的偏见特征（bias features）。通过直接操控这些偏见特征，论文提出的方法能够精准隔离并调整导致偏见生成的元素，实现对生成内容中偏见水平的精细控制。实验结果表明，该方法能够在管理生成分布的同时保持图像质量，并揭示了不同内在特征如何控制生成的细粒度方面，从而推动扩散模型的机制可解释性研究的进一步发展。

链接: https://arxiv.org/abs/2503.20483
作者: Yingdong Shi,Changming Li,Yifan Wang,Yongxiang Zhao,Anqi Pang,Sibei Yang,Jingyi Yu,Kan Ren
机构: ShanghaiTech University (上海科技大学); Stony Brook University (石溪大学); Tencent PCG (腾讯PCG)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CVPR 2025; Project Page: this https URL

点击查看摘要

Abstract:Diffusion models have demonstrated impressive capabilities in synthesizing diverse content. However, despite their high-quality outputs, these models often perpetuate social biases, including those related to gender and race. These biases can potentially contribute to harmful real-world consequences, reinforcing stereotypes and exacerbating inequalities in various social contexts. While existing research on diffusion bias mitigation has predominantly focused on guiding content generation, it often neglects the intrinsic mechanisms within diffusion models that causally drive biased outputs. In this paper, we investigate the internal processes of diffusion models, identifying specific decision-making mechanisms, termed bias features, embedded within the model architecture. By directly manipulating these features, our method precisely isolates and adjusts the elements responsible for bias generation, permitting granular control over the bias levels in the generated content. Through experiments on both unconditional and conditional diffusion models across various social bias attributes, we demonstrate our method’s efficacy in managing generation distribution while preserving image quality. We also dissect the discovered model mechanism, revealing different intrinsic features controlling fine-grained aspects of generation, boosting further research on mechanistic interpretability of diffusion models.
zh

[CV-40] From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment

【速读】：该论文旨在解决多模态大语言模型（MLLMs）在理解长视频时面临的挑战，即由于模型每次推理只能处理有限数量的帧，可能导致关键视觉信息的丢失。为了解决这一问题，论文提出通过视觉上下文采样生成多个预测，并结合打分机制选择最终预测作为解决方案的关键。具体而言，论文设计了一种基于关键帧组合的分桶采样策略，以丰富视觉上下文并生成多样化答案；同时，采用自奖励机制，将频率得分、边际置信得分及类型化推理得分线性组合，分别确保多数正确性、预测置信度以及针对稀疏关键视觉信息的定制化策略，从而有效提高长视频问题的解答覆盖率和准确性。

链接: https://arxiv.org/abs/2503.20472
作者: Yucheng Suo,Fan Ma,Linchao Zhu,Tianyi Wang,Fengyun Rao,Yi Yang
机构: ReLER, CCAI, Zhejiang University (浙江大学); Wechat Vision, Tencent Inc. (微信视图，腾讯公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-modal Large language models (MLLMs) show remarkable ability in video understanding. Nevertheless, understanding long videos remains challenging as the models can only process a finite number of frames in a single inference, potentially omitting crucial visual information. To address the challenge, we propose generating multiple predictions through visual context sampling, followed by a scoring mechanism to select the final prediction. Specifically, we devise a bin-wise sampling strategy that enables MLLMs to generate diverse answers based on various combinations of keyframes, thereby enriching the visual context. To determine the final prediction from the sampled answers, we employ a self-reward by linearly combining three scores: (1) a frequency score indicating the prevalence of each option, (2) a marginal confidence score reflecting the inter-intra sample certainty of MLLM predictions, and (3) a reasoning score for different question types, including clue-guided answering for global questions and temporal self-refocusing for local questions. The frequency score ensures robustness through majority correctness, the confidence-aligned score reflects prediction certainty, and the typed-reasoning score addresses cases with sparse key visual information using tailored strategies. Experiments show that this approach covers the correct answer for a high percentage of long video questions, on seven datasets show that our method improves the performance of three MLLMs.
zh

[CV-41] Lipschitz Constant Meets Condition Number: Learning Robust and Compact Deep Neural Networks

【速读】：该论文旨在解决深度神经网络（Deep Neural Networks, DNNs）在高压缩情况下（如大规模权重矩阵剪枝）导致的精度下降及对抗鲁棒性减弱的问题。现有研究表明，高度剪枝的权重矩阵通常条件数较大（ill-conditioned），这会加剧模型对输入噪声的敏感性，并限制其对抗鲁棒性。尽管高稀疏化有助于降低局部Lipschitz常数的上界以容忍较大扰动，但权重矩阵的病态性质（ill-conditionedness）使得模型变得脆弱且非鲁棒。

为克服这一挑战，论文提出了一种新颖的联合约束方法，称为“变换稀疏约束与条件数约束（Transformed Sparse Constraint joint with Condition Number Constraint, TSCNC）”。该方法通过平滑权重分布并引入可微分的约束函数来降低条件数，从而避免权重矩阵的病态性质。理论分析进一步揭示了条件数与权重矩阵局部Lipschitz常数之间的关系，指出条件数的急剧增加成为限制过度稀疏化模型鲁棒性的主要因素。实验结果表明，所提出的约束显著提升了高剪枝率下DNN模型的鲁棒性。

链接: https://arxiv.org/abs/2503.20454
作者: Yangqi Feng,Shing-Ho J. Lin,Baoyuan Gao,Xian Wei
机构: Software Engineering Institute, East China Normal University (华东师范大学软件工程学院); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Tianjin University (天津大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Recent research has revealed that high compression of Deep Neural Networks (DNNs), e.g., massive pruning of the weight matrix of a DNN, leads to a severe drop in accuracy and susceptibility to adversarial attacks. Integration of network pruning into an adversarial training framework has been proposed to promote adversarial robustness. It has been observed that a highly pruned weight matrix tends to be ill-conditioned, i.e., increasing the condition number of the weight matrix. This phenomenon aggravates the vulnerability of a DNN to input noise. Although a highly pruned weight matrix is considered to be able to lower the upper bound of the local Lipschitz constant to tolerate large distortion, the ill-conditionedness of such a weight matrix results in a non-robust DNN model. To overcome this challenge, this work develops novel joint constraints to adjust the weight distribution of networks, namely, the Transformed Sparse Constraint joint with Condition Number Constraint (TSCNC), which copes with smoothing distribution and differentiable constraint functions to reduce condition number and thus avoid the ill-conditionedness of weight matrices. Furthermore, our theoretical analyses unveil the relevance between the condition number and the local Lipschitz constant of the weight matrix, namely, the sharply increasing condition number becomes the dominant factor that restricts the robustness of over-sparsified models. Extensive experiments are conducted on several public datasets, and the results show that the proposed constraints significantly improve the robustness of a DNN with high pruning rates.
zh

[CV-42] Siformer: Feature-isolated Transformer for Efficient Skeleton-based Sign Language Recognition

【速读】：本文旨在解决手语识别（Sign Language Recognition, SLR）领域中基于骨架的动作识别方法存在的三个主要局限性：1）忽视了真实手部姿态的重要性，大多数研究在非真实的骨骼表示上训练模型；2）假设训练和推理阶段数据完整，并整体捕获身体不同部位之间的复杂关系；3）未能适应不同复杂度的手势释义（sign glosses），忽略了骨骼表示的差异。为解决这些问题，论文提出了关键的创新方案：首先，通过运动学手部姿态校正方法加强约束以提升手部骨骼表示的真实性；其次，设计了一种特征隔离机制来缓解缺失数据的影响，同时独立捕获局部时空上下文以增强模型鲁棒性；最后，开发了一种输入自适应推理方法以优化计算效率和精度，从而适配不同复杂度的手势释义。实验结果表明，这些方法显著提升了SLR性能，在WLASL100和LSA64数据集上达到了新的state-of-the-art水平。

链接: https://arxiv.org/abs/2503.20436
作者: Muxin Pu,Mei Kuan Lim,Chun Yong Chong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, ACM Multimedia

点击查看摘要

Abstract:Sign language recognition (SLR) refers to interpreting sign language glosses from given videos automatically. This research area presents a complex challenge in computer vision because of the rapid and intricate movements inherent in sign languages, which encompass hand gestures, body postures, and even facial expressions. Recently, skeleton-based action recognition has attracted increasing attention due to its ability to handle variations in subjects and backgrounds independently. However, current skeleton-based SLR methods exhibit three limitations: 1) they often neglect the importance of realistic hand poses, where most studies train SLR models on non-realistic skeletal representations; 2) they tend to assume complete data availability in both training or inference phases, and capture intricate relationships among different body parts collectively; 3) these methods treat all sign glosses uniformly, failing to account for differences in complexity levels regarding skeletal representations. To enhance the realism of hand skeletal representations, we present a kinematic hand pose rectification method for enforcing constraints. Mitigating the impact of missing data, we propose a feature-isolated mechanism to focus on capturing local spatial-temporal context. This method captures the context concurrently and independently from individual features, thus enhancing the robustness of the SLR model. Additionally, to adapt to varying complexity levels of sign glosses, we develop an input-adaptive inference approach to optimise computational efficiency and accuracy. Experimental results demonstrate the effectiveness of our approach, as evidenced by achieving a new state-of-the-art (SOTA) performance on WLASL100 and LSA64. For WLASL100, we achieve a top-1 accuracy of 86.50%, marking a relative improvement of 2.39% over the previous SOTA. For LSA64, we achieve a top-1 accuracy of 99.84%.
zh

[CV-43] Latent Beam Diffusion Models for Decoding Image Sequences

【速读】：该论文旨在解决扩散模型（Diffusion Models）在生成高质量图像序列时面临的视觉一致性挑战，特别是在非线性叙事场景下，相邻帧之间的连贯性难以保证的问题。现有方法通常独立生成每一帧图像，导致叙述不连贯。为应对这一挑战，论文提出了一种新颖的潜在空间（Latent Space）束搜索策略，通过条件生成完整图像序列实现更连贯的视觉过渡。关键在于动态探索最优潜在表示序列，而非依赖固定的潜在先验（Latent Prior）。此外，为了克服束搜索固有的二次复杂度问题，论文引入了交叉注意力机制（Cross-Attention Mechanism），高效评估搜索路径并实现剪枝操作，同时优先考虑与文本提示及视觉上下文的对齐。实验结果显示，该方法在连贯性、视觉连续性和文本对齐方面显著优于基线方法。

链接: https://arxiv.org/abs/2503.20429
作者: Guilherme Fernandes,Vasco Ramos,Regev Cohen,Idan Szpektor,João Magalhães
机构: NOVA LINCS, NOVA School of Science and Technology (葡萄牙新里斯本大学科学与技术学院), Portugal; Google Research (谷歌研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While diffusion models excel at generating high-quality images from text prompts, they struggle with visual consistency in image sequences. Existing methods generate each image independently, leading to disjointed narratives - a challenge further exacerbated in non-linear storytelling, where scenes must connect beyond adjacent frames. We introduce a novel beam search strategy for latent space exploration, enabling conditional generation of full image sequences with beam search decoding. Unlike prior approaches that use fixed latent priors, our method dynamically searches for an optimal sequence of latent representations, ensuring coherent visual transitions. To address beam search’s quadratic complexity, we integrate a cross-attention mechanism that efficiently scores search paths and enables pruning, prioritizing alignment with both textual prompts and visual context. Human evaluations confirm that our approach outperforms baseline methods, producing full sequences with superior coherence, visual continuity, and textual alignment. By bridging advances in search optimization and latent space refinement, this work sets a new standard for structured image sequence generation.
zh

[CV-44] Evaluating Facial Expression Recognition Datasets for Deep Learning: A Benchmark Study with Novel Similarity Metrics

【速读】：本文旨在解决面部表情识别（FER）系统性能高度依赖于底层数据集质量和多样性的核心问题。为应对这一挑战，研究的关键在于构建一个包含24个常用FER数据集的综合分析框架，并通过标准化管道处理这些数据集，同时引入自动标注以增强数据的人口统计特性评估。此外，提出了三个新的量化指标——局部相似性、全局相似性和配对相似性，用于衡量数据集的难度、泛化能力和跨数据集迁移能力。实验结果表明，虽然大规模自动收集的数据集（如AffectNet、FER2013）存在标签噪声和人口统计偏差等问题，但其通常具有更好的泛化能力；而控制良好的小规模数据集则提供更高的标注质量但变异性有限。因此，本研究为FER数据集的选择与设计提供了实用建议，推动了更稳健、公平且有效的FER系统的开发。

链接: https://arxiv.org/abs/2503.20428
作者: F. Xavier Gaya-Morey,Cristina Manresa-Yee,Célia Martinie,Jose M. Buades-Rubio
机构: Computer Graphics and Vision and AI Group (UGIVIA), Universitat de les Illes Balears (巴利阿里大学); ICS-IRIT, University Toulouse 3, Paul Sabatier (图卢兹第三大学保罗萨比尔)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study investigates the key characteristics and suitability of widely used Facial Expression Recognition (FER) datasets for training deep learning models. In the field of affective computing, FER is essential for interpreting human emotions, yet the performance of FER systems is highly contingent on the quality and diversity of the underlying datasets. To address this issue, we compiled and analyzed 24 FER datasets, including those targeting specific age groups such as children, adults, and the elderly, and processed them through a comprehensive normalization pipeline. In addition, we enriched the datasets with automatic annotations for age and gender, enabling a more nuanced evaluation of their demographic properties. To further assess dataset efficacy, we introduce three novel metricsLocal, Global, and Paired Similarity, which quantitatively measure dataset difficulty, generalization capability, and cross-dataset transferability. Benchmark experiments using state-of-the-art neural networks reveal that large-scale, automatically collected datasets (e.g., AffectNet, FER2013) tend to generalize better, despite issues with labeling noise and demographic biases, whereas controlled datasets offer higher annotation quality but limited variability. Our findings provide actionable recommendations for dataset selection and design, advancing the development of more robust, fair, and effective FER systems.
zh

[CV-45] Cherry Yield Forecast: Harvest Prediction for Individual Sweet Cherry Trees

【速读】：该论文试图解决甜樱桃树早期可靠产量预测的难题，这是果树种植领域的一个关键挑战。解决方案的关键在于通过收集甜樱桃从休眠到收获期间在不同生长阶段的数据，并评估这些数据在基于线性回归的产量预测中的表现。研究发现，在爆发芽期（opening cluster stage）和早期果实期（early fruit stage）进行人工计数能够实现准确的产量预测。然而，利用图像数据实现自动化特征提取仍面临挑战，特别是在小尺寸果实易被叶片遮挡的情况下，现有的先进水果计数方法未能取得满意结果。因此，论文得出结论：虽然人工计数可实现准确的产量预测，但自动化特征提取的高精度方法仍然是一个尚未解决的问题。

链接: https://arxiv.org/abs/2503.20419
作者: Andreas Gilson,Peter Pietrzyk,Chiara Paglia,Annika Killer,Fabian Keil,Lukas Meyer,Dominikus Kittemann,Patrick Noack,Oliver Scholz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper is part of a publication series from the For5G project that has the goal of creating digital twins of sweet cherry trees. At the beginning a brief overview of the revious work in this project is provided. Afterwards the focus shifts to a crucial problem in the fruit farming domain: the difficulty of making reliable yield predictions early in the season. Following three Satin sweet cherry trees along the year 2023 enabled the collection of accurate ground truth data about the development of cherries from dormancy until harvest. The methodology used to collect this data is presented, along with its valuation and visualization. The predictive power of counting objects at all relevant vegetative stages of the fruit development cycle in cherry trees with regards to yield predictions is investigated. It is found that all investigated fruit states are suitable for yield predictions based on linear regression. Conceptionally, there is a trade-off between earliness and external events with the potential to invalidate the prediction. Considering this, two optimal timepoints are suggested that are opening cluster stage before the start of the flowering and the early fruit stage right after the second fruit drop. However, both timepoints are challenging to solve with automated procedures based on image data. Counting developing cherries based on images is exceptionally difficult due to the small fruit size and their tendency to be occluded by leaves. It was not possible to obtain satisfying results relying on a state-of-the-art fruit-counting method. Counting the elements within a bursting bud is also challenging, even when using high resolution cameras. It is concluded that accurate yield prediction for sweet cherry trees is possible when objects are manually counted and that automated features extraction with similar accuracy remains an open problem yet to be solved.
zh

[CV-46] ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On CVPR2025

【速读】：本文旨在解决基于图像的虚拟试穿（Image-Based Virtual Try-On, IVTON）任务中，现有方法在处理全局服装上下文和细粒度细节方面存在的局限性。论文提出了一种名为ITA-MDT（Image-Timestep-Adaptive Masked Diffusion Transformer Framework）的新框架，通过引入Masked Diffusion Transformer (MDT)，实现了对全局信息和细节信息更优的建模能力。关键创新在于Image-Timestep Adaptive Feature Aggregator (ITAFA)，它通过动态整合图像编码器的所有特征，并结合扩散步长与服装图像复杂度进行引导，实现特征的自适应加权，从而根据不同去噪阶段的需求强调全局或局部细节信息。此外，论文还提出了Salient Region Extractor (SRE)模块，用于提取服装的复杂区域，并将其作为额外条件输入到去噪模型中，以提供高分辨率的局部信息，同时避免对整个图像不必要的计算开销。这些设计显著提升了效率，同时保持了性能竞争力，在多个评估指标上达到了当前最优水平。

链接: https://arxiv.org/abs/2503.20418
作者: Ji Woo Hong,Tri Ton,Trung X. Pham,Gwanhyeong Koo,Sunjae Yoon,Chang D. Yoo
机构: Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院), South Korea
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025, Project Page: this https URL

点击查看摘要

Abstract:This paper introduces ITA-MDT, the Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On (IVTON), designed to overcome the limitations of previous approaches by leveraging the Masked Diffusion Transformer (MDT) for improved handling of both global garment context and fine-grained details. The IVTON task involves seamlessly superimposing a garment from one image onto a person in another, creating a realistic depiction of the person wearing the specified garment. Unlike conventional diffusion-based virtual try-on models that depend on large pre-trained U-Net architectures, ITA-MDT leverages a lightweight, scalable transformer-based denoising diffusion model with a mask latent modeling scheme, achieving competitive results while reducing computational overhead. A key component of ITA-MDT is the Image-Timestep Adaptive Feature Aggregator (ITAFA), a dynamic feature aggregator that combines all of the features from the image encoder into a unified feature of the same size, guided by diffusion timestep and garment image complexity. This enables adaptive weighting of features, allowing the model to emphasize either global information or fine-grained details based on the requirements of the denoising stage. Additionally, the Salient Region Extractor (SRE) module is presented to identify complex region of the garment to provide high-resolution local information to the denoising model as an additional condition alongside the global information of the full garment image. This targeted conditioning strategy enhances detail preservation of fine details in highly salient garment regions, optimizing computational resources by avoiding unnecessarily processing entire garment image. Comparative evaluations confirms that ITA-MDT improves efficiency while maintaining strong performance, reaching state-of-the-art results in several metrics.
zh

[CV-47] RSRWKV: A Linear-Complexity 2D Attention Mechanism for Efficient Remote Sensing Vision Task

【速读】：该论文旨在解决高分辨率遥感分析中因场景复杂性和尺度多样性导致的全局上下文建模挑战。卷积神经网络（CNNs）在局部特征提取方面表现优异，但由于固定感受野限制了长距离依赖建模；视觉Transformer（ViTs）通过自注意力机制有效捕获全局语义关系，但其计算复杂度随图像分辨率呈二次增长，对高分辨率影像处理形成效率瓶颈；RWKV模型在自然语言处理（NLP）任务中的线性复杂度序列建模取得了突破，但在视觉任务中由于一维扫描机制表现出各向异性局限。为应对这些挑战，论文提出RSRWKV模型，其关键在于引入一种新颖的二维WKV扫描机制，该机制在保持线性复杂度的同时融合了顺序处理与二维空间推理能力，实现了各向同性的上下文聚合。此外，MVC-Shift模块增强了多尺度感受野覆盖，ECA模块强化了跨通道特征交互与语义显著性建模。实验结果表明，RSRWKV在NWPU RESISC45、VHR-10.v2和GLH-Water数据集的分类、检测和分割任务中优于CNN和Transformer基线，提供了一种可扩展的高分辨率遥感分析解决方案。

链接: https://arxiv.org/abs/2503.20382
作者: Chunshan Li,Rong Wang,Xiaofei Yang,Dianhui Chu
机构: Harbin Institute of Technology (哈尔滨工业大学); School of Electronic and Communication Engineering, Guangzhou University (广州大学电子与通信工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-resolution remote sensing analysis faces challenges in global context modeling due to scene complexity and scale diversity. While CNNs excel at local feature extraction via parameter sharing, their fixed receptive fields fundamentally restrict long-range dependency modeling. Vision Transformers (ViTs) effectively capture global semantic relationships through self-attention mechanisms but suffer from quadratic computational complexity relative to image resolution, creating critical efficiency bottlenecks for high-resolution imagery. The RWKV model’s linear-complexity sequence modeling achieves breakthroughs in NLP but exhibits anisotropic limitations in vision tasks due to its 1D scanning mechanism. To address these challenges, we propose RSRWKV, featuring a novel 2D-WKV scanning mechanism that bridges sequential processing and 2D spatial reasoning while maintaining linear complexity. This enables isotropic context aggregation across multiple directions. The MVC-Shift module enhances multi-scale receptive field coverage, while the ECA module strengthens cross-channel feature interaction and semantic saliency modeling. Experimental results demonstrate RSRWKV’s superior performance over CNN and Transformer baselines in classification, detection, and segmentation tasks on NWPU RESISC45, VHR-10.v2, and GLH-Water datasets, offering a scalable solution for high-resolution remote sensing analysis.
zh

[CV-48] Pluggable Style Representation Learning for Multi-Style Transfer

【速读】：该论文旨在解决图像风格迁移在实际应用中因图像风格多样性而导致的可扩展性挑战。传统多风格迁移方法依赖于增大模型规模，而任意风格迁移方法则采用复杂的主干网络，但这些方法因引入更多参数而带来额外计算开销，限制了其在资源受限设备上的部署。论文的关键解决方案在于通过解耦风格建模与迁移来构建风格迁移框架：首先提出一种风格表征学习方案，将风格信息编码到紧凑表示中；然后开发了一种风格感知的多风格迁移网络（SaMST），利用可插拔的风格表示以适应多样化风格。这种方法能够在推理过程中不增加额外开销的情况下，在学到的风格表示中容纳多种图像风格，从而保持高效性。实验结果表明，所提方法在准确性和效率上均达到了当前最优性能。

链接: https://arxiv.org/abs/2503.20368
作者: Hongda Liu,Longguang Wang,Weijun Guan,Ye Zhang,Yulan Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 13 figures, 2 tables

点击查看摘要

Abstract:Due to the high diversity of image styles, the scalability to various styles plays a critical role in real-world applications. To accommodate a large amount of styles, previous multi-style transfer approaches rely on enlarging the model size while arbitrary-style transfer methods utilize heavy backbones. However, the additional computational cost introduced by more model parameters hinders these methods to be deployed on resource-limited devices. To address this challenge, in this paper, we develop a style transfer framework by decoupling the style modeling and transferring. Specifically, for style modeling, we propose a style representation learning scheme to encode the style information into a compact representation. Then, for style transferring, we develop a style-aware multi-style transfer network (SaMST) to adapt to diverse styles using pluggable style representations. In this way, our framework is able to accommodate diverse image styles in the learned style representations without introducing additional overhead during inference, thereby maintaining efficiency. Experiments show that our style representation can extract accurate style information. Moreover, qualitative and quantitative results demonstrate that our method achieves state-of-the-art performance in terms of both accuracy and efficiency. The codes are available in this https URL.
zh

[CV-49] Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding

【速读】：该论文旨在解决大型视觉语言模型（LVLMs）在长视频理解任务中的性能瓶颈问题。传统线性帧采样策略无法有效应对视频数据中关键事件非线性分布的特点，在长上下文中引入冗余或无关信息的同时，可能遗漏短上下文中的重要事件。为了解决这一问题，论文提出了一种名为SelfReS的非线性时空自反射采样方法，其关键是利用LVLMs固有的稀疏注意力图定义反射标记（reflection tokens），从而实现与用户提示相关的令牌选择，而无需额外的训练或外部模块。实验表明，SelfReS能够无缝集成到强大的基础LVLM中，提升长视频任务的准确性，并在相同的GPU内存预算下实现高达46%的推理速度提升。

链接: https://arxiv.org/abs/2503.20362
作者: Joao Pereira,Vasco Lopes,David Semedo,Joao Neves
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) demonstrate remarkable performance in short-video tasks such as video question answering, but struggle in long-video understanding. The linear frame sampling strategy, conventionally used by LVLMs, fails to account for the non-linear distribution of key events in video data, often introducing redundant or irrelevant information in longer contexts while risking the omission of critical events in shorter ones. To address this, we propose SelfReS, a non-linear spatiotemporal self-reflective sampling method that dynamically selects key video fragments based on user prompts. Unlike prior approaches, SelfReS leverages the inherently sparse attention maps of LVLMs to define reflection tokens, enabling relevance-aware token selection without requiring additional training or external modules. Experiments demonstrate that SelfReS can be seamlessly integrated into strong base LVLMs, improving long-video task accuracy and achieving up to 46% faster inference speed within the same GPU memory budget.
zh

[CV-50] SURGEON: Memory-Adaptive Fully Test-Time Adaptation via Dynamic Activation Sparsity CVPR2025

【速读】：该论文旨在解决深度模型在移动终端部署时因多种干扰导致准确率显著下降的问题，特别是在资源受限设备上的测试时适应（Test-time Adaptation, TTA）方法因显著的内存消耗而难以有效部署的问题。论文提出的关键解决方案是SURGEON方法，它通过引入一种新颖的动态激活稀疏策略，在保持与全测试时适应（Fully Test-time Adaptation, FTTA）相当的精度提升的同时，大幅降低内存成本，且无需依赖特定网络架构或修改原始训练过程。该策略通过考虑梯度重要性和层激活内存两项指标，动态确定各层的剪枝比率，从而实现灵活控制学习能力和内存成本。实验结果表明，SURGEON不仅减少了内存使用，还实现了更高的准确性，在多种数据集、架构和任务上达到了最先进的性能。

链接: https://arxiv.org/abs/2503.20354
作者: Ke Ma,Jiaqi Tang,Bin Guo,Fan Dang,Sicong Liu,Zhui Zhu,Lei Wu,Cheng Fang,Ying-Cong Chen,Zhiwen Yu,Yunhao Liu
机构: Northwestern Polytechnical University (西北工业大学); Tsinghua University (清华大学); The Hong Kong University of Science and Technology; Beijing Jiaotong University; Harbin Engineering University
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Despite the growing integration of deep models into mobile terminals, the accuracy of these models declines significantly due to various deployment interferences. Test-time adaptation (TTA) has emerged to improve the performance of deep models by adapting them to unlabeled target data online. Yet, the significant memory cost, particularly in resource-constrained terminals, impedes the effective deployment of most backward-propagation-based TTA methods. To tackle memory constraints, we introduce SURGEON, a method that substantially reduces memory cost while preserving comparable accuracy improvements during fully test-time adaptation (FTTA) without relying on specific network architectures or modifications to the original training procedure. Specifically, we propose a novel dynamic activation sparsity strategy that directly prunes activations at layer-specific dynamic ratios during adaptation, allowing for flexible control of learning ability and memory cost in a data-sensitive manner. Among this, two metrics, Gradient Importance and Layer Activation Memory, are considered to determine the layer-wise pruning ratios, reflecting accuracy contribution and memory efficiency, respectively. Experimentally, our method surpasses the baselines by not only reducing memory usage but also achieving superior accuracy, delivering SOTA performance across diverse datasets, architectures, and tasks.
zh

[CV-51] Consistency Trajectory Matching for One-Step Generative Super-Resolution

【速读】：该论文旨在解决基于扩散模型的超分辨率（Super-Resolution, SR）方法在推理效率与训练成本之间的权衡问题。现有方法通过蒸馏技术将多步教师模型压缩为单步学生模型以加速推理，但显著增加了训练成本，并受限于教师模型对学生模型性能的影响。为克服这些挑战，论文提出了一种无蒸馏策略——一致性轨迹匹配超分辨率（Consistency Trajectory Matching for Super-Resolution, CTMSR）。其关键在于通过构建概率流常微分方程（Probability Flow ODE, PF-ODE）轨迹，直接学习从带噪声的低分辨率（Low-Resolution, LR）图像到高分辨率（High-Resolution, HR）图像的确定性映射，无需依赖预训练的扩散模型。此外，通过设计分布轨迹匹配（Distribution Trajectory Matching, DTM）损失函数，进一步优化生成结果的分布一致性，从而提升重建图像的真实感。实验表明，该方法在保持极低推理延迟的同时，在合成和真实数据集上均表现出与现有方法相当甚至更优的能力。

链接: https://arxiv.org/abs/2503.20349
作者: Weiyi You,Mingyang Zhang,Leheng Zhang,Kexuan Shi,Xingyu Zhou,Shuhang Gu
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current diffusion-based super-resolution (SR) approaches achieve commendable performance at the cost of high inference overhead. Therefore, distillation techniques are utilized to accelerate the multi-step teacher model into one-step student model. Nevertheless, these methods significantly raise training costs and constrain the performance of the student model by the teacher model. To overcome these tough challenges, we propose Consistency Trajectory Matching for Super-Resolution (CTMSR), a distillation-free strategy that is able to generate photo-realistic SR results in one step. Concretely, we first formulate a Probability Flow Ordinary Differential Equation (PF-ODE) trajectory to establish a deterministic mapping from low-resolution (LR) images with noise to high-resolution (HR) images. Then we apply the Consistency Training (CT) strategy to directly learn the mapping in one step, eliminating the necessity of pre-trained diffusion model. To further enhance the performance and better leverage the ground-truth during the training process, we aim to align the distribution of SR results more closely with that of the natural images. To this end, we propose to minimize the discrepancy between their respective PF-ODE trajectories from the LR image distribution by our meticulously designed Distribution Trajectory Matching (DTM) loss, resulting in improved realism of our recovered HR images. Comprehensive experimental results demonstrate that the proposed methods can attain comparable or even superior capabilities on both synthetic and real datasets while maintaining minimal inference latency.
zh

[CV-52] Progressive Focused Transformer for Single Image Super-Resolution

【速读】：该论文试图解决在基于 Transformer 的图像超分辨率任务中，由于特征丰富建模导致的计算开销大以及不必要的相似性计算影响重建性能的问题。解决方案的关键在于提出了一种新颖且有效的 Progressive Focused Transformer (PFT)，其通过 Progressive Focused Attention (PFA) 将网络中的孤立注意力图关联起来，聚焦于最重要的 token。PFA 的核心作用是通过在计算相似性之前过滤掉无关特征，不仅使网络能够捕获更关键的相似特征，还显著降低了整体网络的计算成本。

链接: https://arxiv.org/abs/2503.20337
作者: Wei Long,Xingyu Zhou,Leheng Zhang,Shuhang Gu
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformer-based methods have achieved remarkable results in image super-resolution tasks because they can capture non-local dependencies in low-quality input images. However, this feature-intensive modeling approach is computationally expensive because it calculates the similarities between numerous features that are irrelevant to the query features when obtaining attention weights. These unnecessary similarity calculations not only degrade the reconstruction performance but also introduce significant computational overhead. How to accurately identify the features that are important to the current query features and avoid similarity calculations between irrelevant features remains an urgent problem. To address this issue, we propose a novel and effective Progressive Focused Transformer (PFT) that links all isolated attention maps in the network through Progressive Focused Attention (PFA) to focus attention on the most important tokens. PFA not only enables the network to capture more critical similar features, but also significantly reduces the computational cost of the overall network by filtering out irrelevant features before calculating similarities. Extensive experiments demonstrate the effectiveness of the proposed method, achieving state-of-the-art performance on various single image super-resolution benchmarks.
zh

[CV-53] Dynamic Pyramid Network for Efficient Multimodal Large Language Model

【速读】：该论文旨在解决多模态大语言模型（MLLMs）在实际应用中因昂贵计算开销而受限的问题。为应对这一挑战，近期研究尝试通过压缩视觉特征来降低MLLMs的计算成本，但直接的视觉压缩方法（如高效投影器）不可避免地会破坏视觉语义，尤其是在处理困难样本时表现更差。为此，论文提出了一种新颖的动态金字塔网络（DPN），其关键是将MLLM建模为一个分层结构，在此结构中视觉特征随着深度增加逐步被压缩。这种设计使得即使在高压缩比下，细粒度的视觉信息仍能在浅层感知。此外，为了最大化DPN的优势，论文进一步提出了创新的动态池化专家（DPE），可根据输入特征动态选择最佳的视觉压缩率，从而为较难的样本分配更多计算资源以保持模型性能。实验结果表明，DPN能够在LLaVA上节省高达56%的平均浮点运算次数（FLOPs），同时提升0.74%的性能，并且其泛化能力也在LLaVA-HR上得到验证。

链接: https://arxiv.org/abs/2503.20322
作者: Hao Ai,Kunyi Wang,Zezhou Wang,Hao Lu,Jin Tian,Yaxin Luo,Peng Xing,Jen-Yuan Huang,Huaxia Li,Gen luo
机构: BeiHang University (北京航空航天大学); Shanghai AI Laboratory (上海人工智能实验室); King Abdullah University of Science and Technology (KAUST); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); Technical University Of Denmark (丹麦技术大学); Nanjing University of Science and Technology (南京理工大学); Peking University (北京大学); Xiaohongshu Inc (小红书)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have demonstrated impressive performance in various vision-language (VL) tasks, but their expensive computations still limit the real-world application. To address this issue, recent efforts aim to compress the visual features to save the computational costs of MLLMs. However, direct visual compression methods, e.g. efficient projectors, inevitably destroy the visual semantics in MLLM, especially in difficult samples. To overcome this shortcoming, we propose a novel dynamic pyramid network (DPN) for efficient MLLMs. Specifically, DPN formulates MLLM as a hierarchical structure where visual features are gradually compressed with increasing depth. In this case, even with a high compression ratio, fine-grained visual information can still be perceived in shallow layers. To maximize the benefit of DPN, we further propose an innovative Dynamic Pooling Experts (DPE) that can dynamically choose the optimal visual compression rate according to input features. With this design, harder samples will be assigned larger computations, thus preserving the model performance. To validate our approach, we conduct extensive experiments on two popular MLLMs and ten benchmarks. Experimental results show that DPN can save up to 56% average FLOPs on LLaVA while further achieving +0.74% performance gains. Besides, the generalization ability of DPN is also validated on the existing high-resolution MLLM called LLaVA-HR. Our source codes are anonymously released at this https URL.
zh

[CV-54] Recovering Dynamic 3D Sketches from Videos CVPR2025

【速读】：该论文旨在解决从视频中理解3D运动的挑战，由于物体运动类型多样（从刚体到可变形结构），传统方法难以有效抽象和表示这些运动。论文提出了一种名为Liv3Stroke的新方法，通过利用可变形的3D笔画（deformable 3D strokes）来抽象运动中的物体。其关键在于使用一组参数化3D曲线捕捉通用物体的空间平滑运动元素，即使物体结构未知也能实现有效表示。该方法首先从视频帧中提取基于语义特征的噪声3D点云运动引导，然后将一组曲线变形以抽象出明确的3D运动特征表示，从而在保持环境鲁棒性的同时解析显著的运动组件。这种抽象方式能够直接分析视频中的3D物体运动，解决了将现实世界运动转换为记录片段时常遇到的不确定性问题。

链接: https://arxiv.org/abs/2503.20321
作者: Jaeah Lee,Changwoon Choi,Young Min Kim,Jaesik Park
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Understanding 3D motion from videos presents inherent challenges due to the diverse types of movement, ranging from rigid and deformable objects to articulated structures. To overcome this, we propose Liv3Stroke, a novel approach for abstracting objects in motion with deformable 3D strokes. The detailed movements of an object may be represented by unstructured motion vectors or a set of motion primitives using a pre-defined articulation from a template model. Just as a free-hand sketch can intuitively visualize scenes or intentions with a sparse set of lines, we utilize a set of parametric 3D curves to capture a set of spatially smooth motion elements for general objects with unknown structures. We first extract noisy, 3D point cloud motion guidance from video frames using semantic features, and our approach deforms a set of curves to abstract essential motion features as a set of explicit 3D representations. Such abstraction enables an understanding of prominent components of motions while maintaining robustness to environmental factors. Our approach allows direct analysis of 3D object movements from video, tackling the uncertainty that typically occurs when translating real-world motion into recorded footage. The project page is accessible via: this https URL
zh

[CV-55] EditCLIP: Representation Learning for Image Editing

【速读】：该论文旨在解决图像编辑中的表示学习问题，并提出一种名为EditCLIP的新方法。EditCLIP通过联合编码原始图像及其编辑后的版本来学习统一的编辑表示，从而有效捕捉两者之间的变换关系。解决方案的关键在于利用EditCLIP嵌入来替代传统的基于文本的指令，这使得EditCLIP在基于示例的图像编辑任务中表现出色，不仅超越了现有最先进的方法，而且在效率和通用性上更具优势。此外，在自动评估任务中，EditCLIP通过衡量给定图像对的EditCLIP嵌入与文本编辑指令或另一参考图像对的EditCLIP嵌入之间的相似度来评估图像编辑质量。实验结果表明，EditCLIP比现有的基于CLIP的度量更能贴近人类判断，为编辑质量和结构保持提供了可靠的度量标准。

链接: https://arxiv.org/abs/2503.20318
作者: Qian Wang,Aleksandar Cvejic,Abdelrahman Eldesokey,Peter Wonka
机构: KAUST(沙特国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce EditCLIP, a novel representation-learning approach for image editing. Our method learns a unified representation of edits by jointly encoding an input image and its edited counterpart, effectively capturing their transformation. To evaluate its effectiveness, we employ EditCLIP to solve two tasks: exemplar-based image editing and automated edit evaluation. In exemplar-based image editing, we replace text-based instructions in InstructPix2Pix with EditCLIP embeddings computed from a reference exemplar image pair. Experiments demonstrate that our approach outperforms state-of-the-art methods while being more efficient and versatile. For automated evaluation, EditCLIP assesses image edits by measuring the similarity between the EditCLIP embedding of a given image pair and either a textual editing instruction or the EditCLIP embedding of another reference image pair. Experiments show that EditCLIP aligns more closely with human judgments than existing CLIP-based metrics, providing a reliable measure of edit quality and structural preservation.
zh

[CV-56] SpikeDerain: Unveiling Clear Videos from Rainy Sequences Using Color Spike Streams

【速读】：本文旨在解决从雨天视频中恢复清晰帧的问题，主要挑战源于雨滴的快速运动。传统基于帧的视觉传感器难以精确捕捉这些快速移动的细节。尽管神经形态传感器提供了微秒级时间分辨率和高动态范围，但现有融合事件流与RGB图像的多模态方法在处理实际场景中雨滴复杂的时空干扰时面临困难，主要是由于硬件同步误差和计算冗余。为应对真实连续降雨场景数据稀缺的问题，文中提出了一种色度尖峰流去雨网络（SpikeDerain），其关键在于能够重建动态场景的尖峰流并准确去除雨 streak。此外，设计了一个物理可解释的雨 streak 合成模型，基于任意背景图像生成参数化的连续降雨模式。实验结果表明，使用合成数据训练的网络即使在极端降雨条件下也表现出高度鲁棒性。这些发现强调了该方法在不同降雨水平和数据集上的有效性和鲁棒性，为视频去雨任务设定了新的标准。代码即将发布。

链接: https://arxiv.org/abs/2503.20315
作者: Hanwen Liang,Xian Zhong,Wenxuan Liu,Yajing Zheng,Wenxin Huang,Zhaofei Yu,Tiejun Huang
机构: Wuhan University of Technology (武汉理工大学); Peking University (北京大学); Hubei University (湖北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Restoring clear frames from rainy videos presents a significant challenge due to the rapid motion of rain streaks. Traditional frame-based visual sensors, which capture scene content synchronously, struggle to capture the fast-moving details of rain accurately. In recent years, neuromorphic sensors have introduced a new paradigm for dynamic scene perception, offering microsecond temporal resolution and high dynamic range. However, existing multimodal methods that fuse event streams with RGB images face difficulties in handling the complex spatiotemporal interference of raindrops in real scenes, primarily due to hardware synchronization errors and computational redundancy. In this paper, we propose a Color Spike Stream Deraining Network (SpikeDerain), capable of reconstructing spike streams of dynamic scenes and accurately removing rain streaks. To address the challenges of data scarcity in real continuous rainfall scenes, we design a physically interpretable rain streak synthesis model that generates parameterized continuous rain patterns based on arbitrary background images. Experimental results demonstrate that the network, trained with this synthetic data, remains highly robust even under extreme rainfall conditions. These findings highlight the effectiveness and robustness of our method across varying rainfall levels and datasets, setting new standards for video deraining tasks. The code will be released soon.
zh

[CV-57] Wan: Open and Advanced Large-Scale Video Generative Models

【速读】：该论文旨在解决视频生成领域中模型性能与应用范围的提升问题，同时推动开源社区的发展。论文提出了一套名为Wan的全面且开放的视频基础模型（Video Foundation Models）套件，通过一系列创新方法显著增强了生成式视频的能力。关键解决方案包括：(1) 自研的变分自编码器（VAE），用于提升表征能力；(2) 可扩展的预训练策略，充分利用大规模数据集；(3) 大规模数据整理技术，以提高数据质量；以及(4) 自动化评估指标，确保模型性能的客观评价。这些创新共同使Wan在多项基准测试中表现出色，具备卓越的性能、广泛的适用性和高效的资源利用，同时通过开源代码和模型促进了学术界和工业界的进一步发展。

链接: https://arxiv.org/abs/2503.20314
作者: WanTeam:Ang Wang,Baole Ai,Bin Wen,Chaojie Mao,Chen-Wei Xie,Di Chen,Feiwu Yu,Haiming Zhao,Jianxiao Yang,Jianyuan Zeng,Jiayu Wang,Jingfeng Zhang,Jingren Zhou,Jinkai Wang,Jixuan Chen,Kai Zhu,Kang Zhao,Keyu Yan,Lianghua Huang,Mengyang Feng,Ningyi Zhang,Pandeng Li,Pingyu Wu,Ruihang Chu,Ruili Feng,Shiwei Zhang,Siyang Sun,Tao Fang,Tianxing Wang,Tianyi Gui,Tingyu Weng,Tong Shen,Wei Lin,Wei Wang,Wei Wang,Wenmeng Zhou,Wente Wang,Wenting Shen,Wenyuan Yu,Xianzhong Shi,Xiaoming Huang,Xin Xu,Yan Kou,Yangyu Lv,Yifei Li,Yijing Liu,Yiming Wang,Yingya Zhang,Yitong Huang,Yong Li,You Wu,Yu Liu,Yulin Pan,Yun Zheng,Yuntao Hong,Yupeng Shi,Yutong Feng,Zeyinzi Jiang,Zhen Han,Zhi-Fan Wu,Ziyu Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 60 pages, 33 figures

点击查看摘要

Abstract:This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model’s performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. All the code and models are available at this https URL.
zh

[CV-58] Enabling Heterogeneous Adversarial Transferability via Feature Permutation Attacks PAKDD2025

【速读】：该论文旨在解决黑盒设置下针对异构架构（如CNNs、MLPs和Vision Transformers (ViTs)）的迁移性对抗攻击性能显著下降的问题。传统迁移攻击在跨架构传输时因基础结构差异导致效果不佳。为应对这一挑战，论文提出了一种名为特征置换攻击 (Feature Permutation Attack, FPA) 的方法。FPA 的关键创新在于引入一种新的特征置换 (Feature Permutation, FP) 操作，通过重新排列选定特征图中的像素值来模拟长距离依赖关系，使CNNs的行为更接近于ViTs和MLPs，从而增强特征多样性并提高跨架构及同构CNN内部的对抗迁移能力。实验结果表明，FPA在CNNs、ViTs和MLPs上的攻击成功率分别提升了7.68%、14.57%和14.48%，优于现有黑盒攻击方法，并且具有高通用性和与其他迁移攻击的良好兼容性。

链接: https://arxiv.org/abs/2503.20310
作者: Tao Wu,Tie Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: PAKDD 2025. Main Track

点击查看摘要

Abstract:Adversarial attacks in black-box settings are highly practical, with transfer-based attacks being the most effective at generating adversarial examples (AEs) that transfer from surrogate models to unseen target models. However, their performance significantly degrades when transferring across heterogeneous architectures – such as CNNs, MLPs, and Vision Transformers (ViTs) – due to fundamental architectural differences. To address this, we propose Feature Permutation Attack (FPA), a zero-FLOP, parameter-free method that enhances adversarial transferability across diverse architectures. FPA introduces a novel feature permutation (FP) operation, which rearranges pixel values in selected feature maps to simulate long-range dependencies, effectively making CNNs behave more like ViTs and MLPs. This enhances feature diversity and improves transferability both across heterogeneous architectures and within homogeneous CNNs. Extensive evaluations on 14 state-of-the-art architectures show that FPA achieves maximum absolute gains in attack success rates of 7.68% on CNNs, 14.57% on ViTs, and 14.48% on MLPs, outperforming existing black-box attacks. Additionally, FPA is highly generalizable and can seamlessly integrate with other transfer-based attacks to further boost their performance. Our findings establish FPA as a robust, efficient, and computationally lightweight strategy for enhancing adversarial transferability across heterogeneous architectures.
zh

[CV-59] Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLM s

【速读】：该论文旨在解决现有偏好对齐方法主要关注幻觉因素（hallucination factors）而忽视多模态理解能力相关因素的问题，导致在减少幻觉方面的改进有限。为弥补这一差距，论文提出了一种名为指令导向偏好对齐（Instruction-oriented Preference Alignment, IPA）的可扩展框架。其关键是通过自动构建基于指令执行效果的有效偏好，并结合专门的验证过程识别指令导向的关键因素，同时避免响应表示中的显著变异性。此外，IPA 还引入了渐进式偏好收集管道，利用模型自进化与参考引导优化进一步召回具有挑战性的样本，从而全面提升多模态大语言模型的通用理解能力。

链接: https://arxiv.org/abs/2503.20309
作者: Zitian Wang,Yue Liao,Kang Rong,Fengyun Rao,Yibo Yang,Si Liu
机构: Beihang University (北京航空航天大学); The Chinese University of Hong Kong (香港中文大学); King Abdullah University of Science and Technology (国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report

点击查看摘要

Abstract:Preference alignment has emerged as an effective strategy to enhance the performance of Multimodal Large Language Models (MLLMs) following supervised fine-tuning. While existing preference alignment methods predominantly target hallucination factors, they overlook the factors essential for multi-modal comprehension capabilities, often narrowing their improvements on hallucination mitigation. To bridge this gap, we propose Instruction-oriented Preference Alignment (IPA), a scalable framework designed to automatically construct alignment preferences grounded in instruction fulfillment efficacy. Our method involves an automated preference construction coupled with a dedicated verification process that identifies instruction-oriented factors, avoiding significant variability in response representations. Additionally, IPA incorporates a progressive preference collection pipeline, further recalling challenging samples through model self-evolution and reference-guided refinement. Experiments conducted on Qwen2VL-7B demonstrate IPA’s effectiveness across multiple benchmarks, including hallucination evaluation, visual question answering, and text understanding tasks, highlighting its capability to enhance general comprehension.
zh

[CV-60] Perceptually Accurate 3D Talking Head Generation: New Definitions Speech-Mesh Representation and Evaluation Metrics

【速读】：该论文旨在解决现有基于语音驱动的3D说话头生成模型在捕捉语音特性与对应唇部运动之间的感知对齐方面的不足。论文提出，实现感知上准确的唇部运动需要满足三个关键标准：时间同步（Temporal Synchronization）、唇读可读性（Lip Readability）和表现力（Expressiveness）。解决方案的关键在于引入一种新的语音-网格同步表示（speech-mesh synchronized representation），该表示能够捕获语音信号与3D人脸网格之间复杂的对应关系。作者通过学习这一表示作为感知损失（perceptual loss）集成到现有模型中，以更好地将唇部运动与输入语音对齐，并利用此表示作为感知度量，结合其他物理基础的唇同步度量来评估生成结果。实验表明，使用该感知损失训练3D说话头生成模型显著提升了唇部运动在上述三个方面的感知准确性。

链接: https://arxiv.org/abs/2503.20308
作者: Lee Chae-Yeon,Oh Hyun-Bin,Han EunGi,Kim Sung-Bin,Suekyeong Nam,Tae-Hyun Oh
机构: Grad. School of AI, POSTECH (POSTECH人工智能研究生院); Dept. of Electrical Engineering, POSTECH (POSTECH电气工程系); KRAFTON; School of Computing, KAIST (KAIST计算学院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in speech-driven 3D talking head generation have made significant progress in lip synchronization. However, existing models still struggle to capture the perceptual alignment between varying speech characteristics and corresponding lip movements. In this work, we claim that three criteria – Temporal Synchronization, Lip Readability, and Expressiveness – are crucial for achieving perceptually accurate lip movements. Motivated by our hypothesis that a desirable representation space exists to meet these three criteria, we introduce a speech-mesh synchronized representation that captures intricate correspondences between speech signals and 3D face meshes. We found that our learned representation exhibits desirable characteristics, and we plug it into existing models as a perceptual loss to better align lip movements to the given speech. In addition, we utilize this representation as a perceptual metric and introduce two other physically grounded lip synchronization metrics to assess how well the generated 3D talking heads align with these three criteria. Experiments show that training 3D talking head generation models with our perceptual loss significantly improve all three aspects of perceptually accurate lip synchronization. Codes and datasets are available at this https URL.
zh

[CV-61] Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability CVPR2025

【速读】：该论文旨在解决现有语言瓶颈模型（Language Bottleneck Models, LBMs）中存在的两个主要问题：一是简单地将所有概念无组织地堆叠作为瓶颈层，导致虚假线索推理问题（spurious cue inference problem），从而影响模型的可解释性；二是无法很好地推广到未见过的类别（cannot generalize to unseen classes）。为了解决这些问题，论文提出了属性构建的语言瓶颈模型（Attribute-formed Language Bottleneck Model, ALBM）。ALBM 的关键创新在于通过在特定类别的属性空间中组织概念，使每个类别的概念仅描述其本质特征，从而避免虚假线索推理问题，并通过跨类别的统一属性集增强不同类别概念空间的相关性，实现对未见类别的良好泛化能力。此外，论文还提出视觉属性提示学习（Visual Attribute Prompt Learning, VAPL）和描述、总结与补充（Description, Summary, and Supplement, DSS）策略以进一步提升模型的可解释性和减少人工标注的工作量。

链接: https://arxiv.org/abs/2503.20301
作者: Jianyang Zhang,Qianli Luo,Guowu Yang,Wenjing Yang,Weide Liu,Guosheng Lin,Fengmao Lv
机构: University of Electronic Science and Technology of China (电子科技大学); Southwest Jiaotong University (西南交通大学); Institute of Electronics and Information Industry Technology of Kash (卡什电子与信息技术研究所); University of Minnesota (明尼苏达大学); Harvard University (哈佛大学); Nanyang Technological University (南洋理工大学); Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education (教育部可持续城市智能交通工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted to CVPR 2025

点击查看摘要

Abstract:Language Bottleneck Models (LBMs) are proposed to achieve interpretable image recognition by classifying images based on textual concept bottlenecks. However, current LBMs simply list all concepts together as the bottleneck layer, leading to the spurious cue inference problem and cannot generalized to unseen classes. To address these limitations, we propose the Attribute-formed Language Bottleneck Model (ALBM). ALBM organizes concepts in the attribute-formed class-specific space, where concepts are descriptions of specific attributes for specific classes. In this way, ALBM can avoid the spurious cue inference problem by classifying solely based on the essential concepts of each class. In addition, the cross-class unified attribute set also ensures that the concept spaces of different classes have strong correlations, as a result, the learned concept classifier can be easily generalized to unseen classes. Moreover, to further improve interpretability, we propose Visual Attribute Prompt Learning (VAPL) to extract visual features on fine-grained attributes. Furthermore, to avoid labor-intensive concept annotation, we propose the Description, Summary, and Supplement (DSS) strategy to automatically generate high-quality concept sets with a complete and precise attribute. Extensive experiments on 9 widely used few-shot benchmarks demonstrate the interpretability, transferability, and performance of our approach. The code and collected concept sets are available at this https URL.
zh

[CV-62] raversing Distortion-Perception Tradeoff using a Single Score-Based Generative Model

【速读】：该论文致力于解决在去噪任务中失真-感知（Distortion-Perception, DP）权衡的问题，即如何在失真度量（如均方误差MSE和峰值信噪比PSNR）与感知质量之间找到平衡。现有方法通常要么牺牲可接受的失真以优化感知质量，要么专注于最小化失真以实现忠实恢复，但这些方法难以适应DP平面上不同点的需求，往往需要重新训练或设计模型。论文的关键在于提出了一种基于预训练分数模型（score-based model）的灵活且最优遍历DP权衡的解决方案。具体而言，作者引入了一个方差缩放的反向扩散过程，并理论分析了其边缘分布，证明了所提出的采样过程是条件高斯分布下DP权衡的最优解。实验结果表明，单一分数网络能够有效且灵活地处理通用去噪问题中的DP权衡。

链接: https://arxiv.org/abs/2503.20297
作者: Yuhan Wang,Suzhi Bi,Ying-Jun Angela Zhang,Xiaojun Yuan
机构: The Chinese University of Hong Kong (香港中文大学); Shenzhen University (深圳大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025

点击查看摘要

Abstract:The distortion-perception (DP) tradeoff reveals a fundamental conflict between distortion metrics (e.g., MSE and PSNR) and perceptual quality. Recent research has increasingly concentrated on evaluating denoising algorithms within the DP framework. However, existing algorithms either prioritize perceptual quality by sacrificing acceptable distortion, or focus on minimizing MSE for faithful restoration. When the goal shifts or noisy measurements vary, adapting to different points on the DP plane needs retraining or even re-designing the model. Inspired by recent advances in solving inverse problems using score-based generative models, we explore the potential of flexibly and optimally traversing DP tradeoffs using a single pre-trained score-based model. Specifically, we introduce a variance-scaled reverse diffusion process and theoretically characterize the marginal distribution. We then prove that the proposed sample process is an optimal solution to the DP tradeoff for conditional Gaussian distribution. Experimental results on two-dimensional and image datasets illustrate that a single score network can effectively and flexibly traverse the DP tradeoff for general denoising problems.
zh

[CV-63] Context-Aware Weakly Supervised Image Manipulation Localization with SAM Refinement

【速读】：该论文旨在解决恶意图像篡改检测中弱监督方法忽视边缘信息导致定位性能不佳的问题。为应对这一挑战，论文提出了一种上下文感知边界定位（Context-Aware Boundary Localization, CABL）模块，用于聚合边界特征并学习上下文不一致性以实现篡改区域的精确定位。此外，通过结合类别激活映射（Class Activation Mapping, CAM）与Segment Anything模型（SAM），引入CAM引导的SAM优化（CAM-Guided SAM Refinement, CGSR）模块以生成更精确的篡改定位图。解决方案的关键在于设计这两个模块，分别从边界特征聚合与上下文分析以及定位结果优化的角度提升弱监督图像篡改定位的性能。

链接: https://arxiv.org/abs/2503.20294
作者: Xinghao Wang,Changtao Miao,Dianmo Sheng,Tao Gong,Qi Chu,Bin Liu,Nenghai Yu
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Malicious image manipulation poses societal risks, increasing the importance of effective image manipulation detection methods. Recent approaches in image manipulation detection have largely been driven by fully supervised approaches, which require labor-intensive pixel-level annotations. Thus, it is essential to explore weakly supervised image manipulation localization methods that only require image-level binary labels for training. However, existing weakly supervised image manipulation methods overlook the importance of edge information for accurate localization, leading to suboptimal localization performance. To address this, we propose a Context-Aware Boundary Localization (CABL) module to aggregate boundary features and learn context-inconsistency for localizing manipulated areas. Furthermore, by leveraging Class Activation Mapping (CAM) and Segment Anything Model (SAM), we introduce the CAM-Guided SAM Refinement (CGSR) module to generate more accurate manipulation localization maps. By integrating two modules, we present a novel weakly supervised framework based on a dual-branch Transformer-CNN architecture. Our method achieves outstanding localization performance across multiple datasets.
zh

[CV-64] CryoSAMU: Enhancing 3D Cryo-EM Density Maps of Protein Structures at Intermediate Resolution with Structure-Aware Multimodal U-Nets

【速读】：该论文旨在解决增强中间分辨率（4-8 Å）冷冻电镜（cryo-EM）三维电子密度图的问题，现有基于深度学习的方法虽已实现自动化处理，但未能针对此类分辨率进行优化，且仅依赖密度特征。为应对这一挑战，论文提出了一种名为CryoSAMU的新方法，其关键在于利用结构感知的多模态U-Net架构，并通过精心整理的中间分辨率密度图进行训练，从而有效提升蛋白质结构密度图的质量。实验表明，CryoSAMU在多个评估指标上表现出色，尤其具备显著更快的处理速度，展现出实际应用的潜力。

链接: https://arxiv.org/abs/2503.20291
作者: Chenwei Zhang,Anne Condon,Khanh Dao Duc
机构: Department of Computer Science, UBC (UBC 计算机科学系); Department of Mathematics, UBC (UBC 数学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
备注: 18 pages, 6 main figures, 2 supplementary figures, 3 main tables, 4 supplementary tables

点击查看摘要

Abstract:Enhancing cryogenic electron microscopy (cryo-EM) 3D density maps at intermediate resolution (4-8 Å) is crucial in protein structure determination. Recent advances in deep learning have led to the development of automated approaches for enhancing experimental cryo-EM density maps. Yet, these methods are not optimized for intermediate-resolution maps and rely on map density features alone. To address this, we propose CryoSAMU, a novel method designed to enhance 3D cryo-EM density maps of protein structures using structure-aware multimodal U-Nets and trained on curated intermediate-resolution density maps. We comprehensively evaluate CryoSAMU across various metrics and demonstrate its competitive performance compared to state-of-the-art methods. Notably, CryoSAMU achieves significantly faster processing speed, showing promise for future practical applications. Our code is available at this https URL.
zh

[CV-65] RelTriple: Learning Plausible Indoor Layouts by Integrating Relationship Triples into the Diffusion Process

【速读】：本文旨在解决室内家具布局生成中手动定义关系不完整导致的不合理布局问题。论文的关键在于提出了一种名为RelTriple的新方法，通过自动提取基于层次分析的间距关系，并采用Delaunay三角剖分生成重要的三元组关系（triple relationships）。与成对关系建模相比，三元组关系能够更好地考虑多个物体之间的交互及空间利用率。该方法将三元组关系表述为物体间损失（O2O）和物体到区域损失（O2R），并直接整合到生成扩散模型的训练过程中，从而显著提升了在无条件布局生成、平面图约束布局生成以及场景重排等任务上的视觉结果评价指标，特别是在引入的空间关系度量上至少提高了12%，同时实现了更好的空间连贯性和实际可用性。

链接: https://arxiv.org/abs/2503.20289
作者: Kaifan Sun,Bingchen Yang,Peter Wonka,Jun Xiao,Haiyong Jiang
机构: University of Chinese Academy of Sciences (UCAS); King Abdullah University of Science and Technology, KAUST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The generation of indoor furniture layouts has significant applications in augmented reality, smart homes, and architectural design. Successful furniture arrangement requires proper physical relationships (e.g., collision avoidance) and spacing relationships between furniture and their functional zones to be respected. However, manually defined relationships are almost always incomplete and can produce unrealistic layouts. This work instead extracts spacing relationships automatically based on a hierarchical analysis and adopts the Delaunay Triangulation to produce important triple relationships. Compared to pairwise relationship modeling, triple relationships account for interactions and space utilization among multiple objects. To this end, we introduce RelTriple, a novel approach that enhances furniture distribution by learning spacing relationships between objects and regions. We formulate triple relationships as object-to-object (O2O) losses and object-to-region (O2R) losses and integrate them directly into the training process of generative diffusion. Our approach consistently improves over existing state-of-the-art methods in visual results evaluation metrics on unconditional layout generation, floorplan-conditioned layout generation, and scene rearrangement, achieving at least 12% on the introduced spatial relationship metric and superior spatial coherence and practical usability.
zh

[CV-66] InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction

【速读】：该论文旨在解决基于指令的视频编辑（Instruction-based Video Editing）领域中高质量训练三元组（源视频、编辑后的视频、指令）收集困难的问题。现有数据集普遍存在源视频分辨率低、时长短、数量有限且编辑质量不佳等局限性，从而限制了编辑模型的性能。论文的关键解决方案在于构建了一个包含100万个高质量三元组的InsViE-1M数据集。具体而言，首先精心筛选高分辨率高质量的源视频与图像，随后设计了一种有效的编辑筛选流水线来生成高质量的编辑三元组。对于源视频的第一帧，通过不同强度的无分类器引导（classifier-free guidance）生成多个编辑样本，并利用GPT-4o依据精心设计的指南自动过滤；编辑后的第一帧被传播到后续帧以生成完整的编辑视频，随后进行一轮针对帧质量和运动特性的进一步筛选。此外，还从高质量图像中生成并筛选了多样化的视频编辑三元组。基于此数据集，论文提出了一种多阶段学习策略来训练InsViE模型，逐步提升其遵循指令和编辑能力。实验结果验证了InsViE-1M数据集及其训练模型相较于现有最先进方法的优势。

链接: https://arxiv.org/abs/2503.20287
作者: Yuhui Wu,Liyi Chen,Ruibin Li,Shihao Wang,Chenxi Xie,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学); OPPO Research Institute (OPPO研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Instruction-based video editing allows effective and interactive editing of videos using only instructions without extra inputs such as masks or attributes. However, collecting high-quality training triplets (source video, edited video, instruction) is a challenging task. Existing datasets mostly consist of low-resolution, short duration, and limited amount of source videos with unsatisfactory editing quality, limiting the performance of trained editing models. In this work, we present a high-quality Instruction-based Video Editing dataset with 1M triplets, namely InsViE-1M. We first curate high-resolution and high-quality source videos and images, then design an effective editing-filtering pipeline to construct high-quality editing triplets for model training. For a source video, we generate multiple edited samples of its first frame with different intensities of classifier-free guidance, which are automatically filtered by GPT-4o with carefully crafted guidelines. The edited first frame is propagated to subsequent frames to produce the edited video, followed by another round of filtering for frame quality and motion evaluation. We also generate and filter a variety of video editing triplets from high-quality images. With the InsViE-1M dataset, we propose a multi-stage learning strategy to train our InsViE model, progressively enhancing its instruction following and editing ability. Extensive experiments demonstrate the advantages of our InsViE-1M dataset and the trained model over state-of-the-art works. Codes are available at InsViE.
zh

[CV-67] Faster Parameter-Efficient Tuning with Token Redundancy Reduction CVPR2025

【速读】：该论文旨在解决参数高效微调（Parameter-Efficient Tuning, PET）方法在推理速度和计算效率方面的局限性问题。尽管PET相比传统全量微调显著降低了存储和迁移成本，但其继承了大型基础模型的推理延迟，并因额外模块（如适配器）引入了额外的计算开销，限制了其在计算密集型应用中的实用性。论文的关键解决方案是提出了一种名为Faster Parameter-Efficient Tuning (FPET) 的新方法，通过设计一个插拔式的令牌冗余削减模块，优化自注意力层的令牌表示以减少冗余，并采用完全可微的令牌合并策略实现高效的令牌削减。这一设计在保持高存储效率的同时，提升了推理速度和训练效率，且性能与现有最先进的PET方法相当。

链接: https://arxiv.org/abs/2503.20282
作者: Kwonyoung Kim,Jungin Park,Jin Kim,Hyeongjun Kwon,Kwanghoon Sohn
机构: Yonsei University (延世大学); Korea Institute of Science and Technology (KIST) (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2025 Camera-ready

点击查看摘要

Abstract:Parameter-efficient tuning (PET) aims to transfer pre-trained foundation models to downstream tasks by learning a small number of parameters. Compared to traditional fine-tuning, which updates the entire model, PET significantly reduces storage and transfer costs for each task regardless of exponentially increasing pre-trained model capacity. However, most PET methods inherit the inference latency of their large backbone models and often introduce additional computational overhead due to additional modules (e.g. adapters), limiting their practicality for compute-intensive applications. In this paper, we propose Faster Parameter-Efficient Tuning (FPET), a novel approach that enhances inference speed and training efficiency while maintaining high storage efficiency. Specifically, we introduce a plug-and-play token redundancy reduction module delicately designed for PET. This module refines tokens from the self-attention layer using an adapter to learn the accurate similarity between tokens and cuts off the tokens through a fully-differentiable token merging strategy, which uses a straight-through estimator for optimal token reduction. Experimental results prove that our FPET achieves faster inference and higher memory efficiency than the pre-trained backbone while keeping competitive performance on par with state-of-the-art PET methods.
zh

[CV-68] EGVD: Event-Guided Video Diffusion Model for Physically Realistic Large-Motion Frame Interpolation

【速读】：该论文旨在解决视频帧插值（Video Frame Interpolation, VFI）在大运动场景下因帧间运动模糊而带来的挑战。传统方法难以有效利用事件相机捕捉的高时间分辨率运动信息，尤其是在有限训练数据和复杂运动模式的情况下。论文的关键解决方案是提出了一种名为Event-Guided Video Diffusion Model (EGVD) 的新框架，它结合了预训练稳定视频扩散模型的强大先验知识与事件相机提供的精确时间信息。EGVD 的核心创新在于其多模态运动条件生成器（Multi-modal Motion Condition Generator, MMCG），能够有效地整合RGB帧和事件信号以指导扩散过程，从而生成物理上真实的中间帧。此外，通过选择性微调策略以及输入-输出归一化技术，EGVD 在保持空间建模能力的同时，高效地引入了事件引导的时间信息，并显著提升了在不同噪声水平下的训练稳定性。这些方法使得EGVD在处理大运动和极端光照条件时表现出色，实现了感知质量指标上的大幅改进（Prophesee 数据集上LPIPS提升27.4%，BSRGB数据集上提升24.1%），同时保持了竞争性的保真度。

链接: https://arxiv.org/abs/2503.20268
作者: Ziran Zhang,Xiaohui Li,Yihao Liu,Yujin Wang,Yueting Chen,Tianfan Xue,Shi Guo
机构: Zhejiang University (浙江大学); Shanghai AI Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video frame interpolation (VFI) in scenarios with large motion remains challenging due to motion ambiguity between frames. While event cameras can capture high temporal resolution motion information, existing event-based VFI methods struggle with limited training data and complex motion patterns. In this paper, we introduce Event-Guided Video Diffusion Model (EGVD), a novel framework that leverages the powerful priors of pre-trained stable video diffusion models alongside the precise temporal information from event cameras. Our approach features a Multi-modal Motion Condition Generator (MMCG) that effectively integrates RGB frames and event signals to guide the diffusion process, producing physically realistic intermediate frames. We employ a selective fine-tuning strategy that preserves spatial modeling capabilities while efficiently incorporating event-guided temporal information. We incorporate input-output normalization techniques inspired by recent advances in diffusion modeling to enhance training stability across varying noise levels. To improve generalization, we construct a comprehensive dataset combining both real and simulated event data across diverse scenarios. Extensive experiments on both real and simulated datasets demonstrate that EGVD significantly outperforms existing methods in handling large motion and challenging lighting conditions, achieving substantial improvements in perceptual quality metrics (27.4% better LPIPS on Prophesee and 24.1% on BSRGB) while maintaining competitive fidelity measures. Code and datasets available at: this https URL.
zh

[CV-69] Mamba-3D as Masked Autoencoders for Accurate and Data-Efficient Analysis of Medical Ultrasound Videos

【速读】：该论文旨在解决超声视频分析中因标注数据稀缺及视频数据固有挑战导致的相关方法发展受阻的问题。解决方案的关键在于提出了一种名为E-ViM³的数据高效视觉网络，其通过引入Enclosure Global Tokens (EGT)有效捕捉和聚合全局特征，并通过Spatial-Temporal Chained (STC)掩码策略实现自监督预训练，以增强时空相关性的建模能力，同时提升数据效率。实验结果表明，E-ViM³在多种规模的数据集上实现了最先进的性能，并在少量标注情况下表现出色，展示了其在临床实际应用中的潜力。

链接: https://arxiv.org/abs/2503.20258
作者: Jiaheng Zhou,Yanfeng Zhou,Wei Fang,Yuxing Tang,Le Lu,Ge Yang
机构: Institute of Automation, Chinese Academy of Sciences (自动化研究所，中国科学院); School of Artificial Intelligence, University of Chinese Academy of Sciences (人工智能学院，中国科学院大学); DAMO Academy, Alibaba Group (达摩院，阿里巴巴集团); Hupan Laboratory (湖畔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultrasound videos are an important form of clinical imaging data, and deep learning-based automated analysis can improve diagnostic accuracy and clinical efficiency. However, the scarcity of labeled data and the inherent challenges of video analysis have impeded the advancement of related methods. In this work, we introduce E-ViM ^3 , a data-efficient Vision Mamba network that preserves the 3D structure of video data, enhancing long-range dependencies and inductive biases to better model space-time correlations. With our design of Enclosure Global Tokens (EGT), the model captures and aggregates global features more effectively than competing methods. To further improve data efficiency, we employ masked video modeling for self-supervised pre-training, with the proposed Spatial-Temporal Chained (STC) masking strategy designed to adapt to various video scenarios. Experiments demonstrate that E-ViM ^3 performs as the state-of-the-art in two high-level semantic analysis tasks across four datasets of varying sizes: EchoNet-Dynamic, CAMUS, MICCAI-BUV, and WHBUS. Furthermore, our model achieves competitive performance with limited labels, highlighting its potential impact on real-world clinical applications.
zh

[CV-70] LogicQA: Logical Anomaly Detection with Vision Language Model Generated Questions

【速读】：该论文致力于解决工业过程中逻辑异常检测（Logical Anomaly Detection, LAD）的问题，这类异常可能在外在表现上看似正常，但违反了预定义的对象存在、排列或数量等方面的约束。为了解决这一问题，论文提出了一种名为LogicQA的框架，其关键在于通过自动生成的问题检查表以及对这些问题的回答来识别逻辑约束的违规情况，从而为工业操作人员提供异常检测的可解释性。LogicQA无需训练和标注，且能在少量样本情况下运行，实现了在MVTec LOCO AD等公开基准数据集上的最新性能（AUROC为87.6%，F1-max为87.0%），同时提供了异常的解释。此外，LogicQA在半导体扫描电子显微镜（SEM）企业数据上的出色表现进一步验证了其在工业应用中的有效性。

链接: https://arxiv.org/abs/2503.20252
作者: Yejin Kwon,Daeun Moon,Youngje Oh,Hyunsoo Yoon
机构: Department of Industrial Engineering, Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Anomaly Detection (AD) focuses on detecting samples that differ from the standard pattern, making it a vital tool in process control. Logical anomalies may appear visually normal yet violate predefined constraints on object presence, arrangement, or quantity, depending on reasoning and explainability. We introduce LogicQA, a framework that enhances AD by providing industrial operators with explanations for logical anomalies. LogicQA compiles automatically generated questions into a checklist and collects responses to identify violations of logical constraints. LogicQA is training-free, annotation-free, and operates in a few-shot setting. We achieve state-of-the-art (SOTA) Logical AD performance on public benchmarks, MVTec LOCO AD, with an AUROC of 87.6 percent and an F1-max of 87.0 percent along with the explanations of anomalies. Also, our approach has shown outstanding performance on semiconductor SEM corporate data, further validating its effectiveness in industrial applications.
zh

[CV-71] Incremental Object Keypoint Learning CVPR

【速读】：该论文旨在解决现有目标关键点估计方法在测试时难以检测未定义新关键点的问题，这限制了其在多样化下游任务中的实用性。为了解决这一挑战，论文提出了一种新的增量式目标关键点学习范式（Incremental object Keypoint Learning, IKL），仅需标注新数据中的新关键点即可进行模型增量训练，无需保留旧数据。解决方案的关键在于设计了一个两阶段学习方案：第一阶段通过空间与解剖学关系自动关联新旧关键点的知识关联网络（KA-Net）；第二阶段利用关键点导向的空间蒸馏损失函数，结合辅助的KA-Net和旧模型实现知识整合，促进所有旧关键点和新关键点的共同提升。这种方法不仅有效缓解了旧关键点的灾难性遗忘问题，还可能进一步提高旧关键点的估计精度，并实现正向迁移效果。实验结果验证了该方法在减少遗忘问题、提升性能以及低样本数据标注效率方面的优势。

链接: https://arxiv.org/abs/2503.20248
作者: Mingfu Liang,Jiahuan Zhou,Xu Zou,Ying Wu
机构: Northwestern University (西北大学); Wangxuan Institute of Computer Technology, Peking University (王选计算机技术研究所，北京大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

点击查看摘要

Abstract:Existing progress in object keypoint estimation primarily benefits from the conventional supervised learning paradigm based on numerous data labeled with pre-defined keypoints. However, these well-trained models can hardly detect the undefined new keypoints in test time, which largely hinders their feasibility for diverse downstream tasks. To handle this, various solutions are explored but still suffer from either limited generalizability or transferability. Therefore, in this paper, we explore a novel keypoint learning paradigm in that we only annotate new keypoints in the new data and incrementally train the model, without retaining any old data, called Incremental object Keypoint Learning (IKL). A two-stage learning scheme as a novel baseline tailored to IKL is developed. In the first Knowledge Association stage, given the data labeled with only new keypoints, an auxiliary KA-Net is trained to automatically associate the old keypoints to these new ones based on their spatial and intrinsic anatomical relations. In the second Mutual Promotion stage, based on a keypoint-oriented spatial distillation loss, we jointly leverage the auxiliary KA-Net and the old model for knowledge consolidation to mutually promote the estimation of all old and new keypoints. Owing to the investigation of the correlations between new and old keypoints, our proposed method can not just effectively mitigate the catastrophic forgetting of old keypoints, but may even further improve the estimation of the old ones and achieve a positive transfer beyond anti-forgetting. Such an observation has been solidly verified by extensive experiments on different keypoint datasets, where our method exhibits superiority in alleviating the forgetting issue and boosting performance while enjoying labeling efficiency even under the low-shot data regime.
zh

[CV-72] Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models

【速读】：该论文试图解决分类器自由引导（Classifier-Free Guidance, CFG）技术在条件扩散模型训练中因联合学习导致的无条件噪声预测质量不佳，进而影响条件生成质量的问题。论文的关键解决方案是提出一种方法，将CFG中的无条件噪声替换为由预训练的基础模型或不同的扩散模型预测的无条件噪声，从而显著提升条件生成的质量。这一方法不仅验证了在多种基于CFG的条件模型（包括图像和视频生成任务）中其有效性，还表明无需局限于同一扩散模型即可实现性能改进。

链接: https://arxiv.org/abs/2503.20240
作者: Prin Phunyaphibarn,Phillip Y. Lee,Jaihoon Kim,Minhyuk Sung
机构: KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Classifier-Free Guidance (CFG) is a fundamental technique in training conditional diffusion models. The common practice for CFG-based training is to use a single network to learn both conditional and unconditional noise prediction, with a small dropout rate for conditioning. However, we observe that the joint learning of unconditional noise with limited bandwidth in training results in poor priors for the unconditional case. More importantly, these poor unconditional noise predictions become a serious reason for degrading the quality of conditional generation. Inspired by the fact that most CFG-based conditional models are trained by fine-tuning a base model with better unconditional generation, we first show that simply replacing the unconditional noise in CFG with that predicted by the base model can significantly improve conditional generation. Furthermore, we show that a diffusion model other than the one the fine-tuned model was trained on can be used for unconditional noise replacement. We experimentally verify our claim with a range of CFG-based conditional models for both image and video generation, including Zero-1-to-3, Versatile Diffusion, DiT, DynamiCrafter, and InstructPix2Pix.
zh

[CV-73] Leverag ing 3D Geometric Priors in 2D Rotation Symmetry Detection CVPR2025

【速读】：本文旨在解决旋转对称性检测中的两个核心问题：旋转中心与支持顶点的精确定位。传统方法依赖手工设计的特征匹配，而基于卷积神经网络的分割模型虽能检测旋转中心，但在存在视点畸变的情况下难以保持3D几何一致性。为克服这一挑战，论文提出了一种直接在3D空间中预测旋转中心和顶点的模型，并将结果投影回2D以保持结构完整性。关键创新在于引入了一个强制执行3D几何先验（如边长和内角相等）的顶点重建阶段，这显著提升了模型的鲁棒性和准确性。实验结果表明，该方法在DENDI数据集上的旋转轴检测性能优异，并通过消融研究验证了3D先验的重要性。

链接: https://arxiv.org/abs/2503.20235
作者: Ahyun Seo,Minsu Cho
机构: Pohang University of Science and Technology (POSTECH)(浦项科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Symmetry plays a vital role in understanding structural patterns, aiding object recognition and scene interpretation. This paper focuses on rotation symmetry, where objects remain unchanged when rotated around a central axis, requiring detection of rotation centers and supporting vertices. Traditional methods relied on hand-crafted feature matching, while recent segmentation models based on convolutional neural networks detect rotation centers but struggle with 3D geometric consistency due to viewpoint distortions. To overcome this, we propose a model that directly predicts rotation centers and vertices in 3D space and projects the results back to 2D while preserving structural integrity. By incorporating a vertex reconstruction stage enforcing 3D geometric priors – such as equal side lengths and interior angles – our model enhances robustness and accuracy. Experiments on the DENDI dataset show superior performance in rotation axis detection and validate the impact of 3D priors through ablation studies.
zh

[CV-74] raNCE: Transformative Non-linear Concept Explainer for CNNs

【速读】：该论文旨在解决现有概念级可解释性方法在生成全局解释时面临的两个主要问题：一是假设图像激活的线性重构无法捕捉其复杂的内在关系；二是仅关注保真度（Fidelity）的评估方法可能导致解释不一致。为应对这些挑战，论文提出了Transformative Nonlinear Concept Explainer (TraNCE)，这是一种针对卷积神经网络 (CNNs) 的新型非线性概念解释方法。TraNCE 的关键创新在于引入基于变分自编码器 (Variational Autoencoders, VAEs) 的自动概念发现机制，通过变换过程增强从图像激活中提取有意义概念的能力，并结合贝塞尔函数构建原型像素间的平滑过渡，从而不仅揭示 CNN 所“看到”的内容，还展示其“忽略”的部分，同时提出新的 Faith 评分指标以综合评估解释的连贯性和保真度。

链接: https://arxiv.org/abs/2503.20230
作者: Ugochukwu Ejike Akpudo,Yongsheng Gao,Jun Zhou,Andrew Lewis
机构: Griffith University (格里菲斯大学), Australia
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) have succeeded remarkably in various computer vision tasks. However, they are not intrinsically explainable. While the feature-level understanding of CNNs reveals where the models looked, concept-based explainability methods provide insights into what the models saw. However, their assumption of linear reconstructability of image activations fails to capture the intricate relationships within these activations. Their Fidelity-only approach to evaluating global explanations also presents a new concern. For the first time, we address these limitations with the novel Transformative Nonlinear Concept Explainer (TraNCE) for CNNs. Unlike linear reconstruction assumptions made by existing methods, TraNCE captures the intricate relationships within the activations. This study presents three original contributions to the CNN explainability literature: (i) An automatic concept discovery mechanism based on variational autoencoders (VAEs). This transformative concept discovery process enhances the identification of meaningful concepts from image activations. (ii) A visualization module that leverages the Bessel function to create a smooth transition between prototypical image pixels, revealing not only what the CNN saw but also what the CNN avoided, thereby mitigating the challenges of concept duplication as documented in previous works. (iii) A new metric, the Faith score, integrates both Coherence and Fidelity for a comprehensive evaluation of explainer faithfulness and consistency.
zh

[CV-75] C-GS: Tri-plane based compression for 3D Gaussian Splatting ICME2025

【速读】：本文旨在解决3D Gaussian Splatting (3DGS) 在实际应用中因数据量庞大及其无组织结构导致的内存成本高昂问题。为应对这一挑战，论文的关键解决方案是提出了一种结构化的三平面（tri-plane）编码方法，将高斯点云的无序属性转化为规范化的分布形式，从而便于压缩。此外，通过引入K近邻（K-Nearest Neighbors, KNN）算法在解码过程中捕捉相邻高斯点之间的相关性，并结合位置敏感解码器的先验信息以及自适应小波损失函数来增强高频细节的重建质量。实验结果表明，所提方法在多个数据集上的表现与当前最先进的3D高斯点云压缩工作相当甚至更优。

链接: https://arxiv.org/abs/2503.20221
作者: Taorui Wang,Zitong Yu,Yong Xu
机构: Shenzhen Key Laboratory of Visual Object Detection and Recognition (深圳视觉目标检测与识别重点实验室), Harbin Institute of Technology Shenzhen (哈尔滨工业大学深圳校区); School of Computing and Information Technology (计算与信息技术学院), Great Bay University (大湾区大学); Dongguan Key Laboratory for Intelligence and Information Technology (东莞智能信息技术重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME 2025

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) has emerged as a prominent framework for novel view synthesis, providing high fidelity and rapid rendering speed. However, the substantial data volume of 3DGS and its attributes impede its practical utility, requiring compression techniques for reducing memory cost. Nevertheless, the unorganized shape of 3DGS leads to difficulties in compression. To formulate unstructured attributes into normative distribution, we propose a well-structured tri-plane to encode Gaussian attributes, leveraging the distribution of attributes for compression. To exploit the correlations among adjacent Gaussians, K-Nearest Neighbors (KNN) is used when decoding Gaussian distribution from the Tri-plane. We also introduce Gaussian position information as a prior of the position-sensitive decoder. Additionally, we incorporate an adaptive wavelet loss, aiming to focus on the high-frequency details as iterations increase. Our approach has achieved results that are comparable to or surpass that of SOTA 3D Gaussians Splatting compression work in extensive experiments across multiple datasets. The codes are released at this https URL.
zh

[CV-76] DINeMo: Learning Neural Mesh Models with no 3D Annotations

【速读】：该论文旨在解决类别级3D/6D位姿估计中对大量3D标注数据依赖的问题，提出了一种无需3D标注即可训练的新型神经网格模型DINeMo。传统方法在部分遮挡和领域迁移方面虽有所改进，但严重依赖于基于部分对比学习的3D标注，限制了其适用的类别范围并阻碍了高效扩展。为克服这一局限，DINeMo的关键创新在于采用双向伪对应生成方法，利用大规模视觉基础模型产生的伪对应关系，结合局部外观特征与全局上下文信息，从而实现高效的无监督学习。实验结果表明，DINeMo在汽车数据集上的性能显著优于现有零样本和少样本3D位姿估计算法，并将与全监督方法的性能差距缩小了67.3%。此外，DINeMo在训练过程中能够有效利用更多未标注图像，展现了超越依赖3D标注的监督学习方法的优势。

链接: https://arxiv.org/abs/2503.20220
作者: Weijie Guo,Guofeng Zhang,Wufei Ma,Alan Yuille
机构: Johns Hopkins University (约翰斯·霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report

点击查看摘要

Abstract:Category-level 3D/6D pose estimation is a crucial step towards comprehensive 3D scene understanding, which would enable a broad range of applications in robotics and embodied AI. Recent works explored neural mesh models that approach a range of 2D and 3D tasks from an analysis-by-synthesis perspective. Despite the largely enhanced robustness to partial occlusion and domain shifts, these methods depended heavily on 3D annotations for part-contrastive learning, which confines them to a narrow set of categories and hinders efficient scaling. In this work, we present DINeMo, a novel neural mesh model that is trained with no 3D annotations by leveraging pseudo-correspondence obtained from large visual foundation models. We adopt a bidirectional pseudo-correspondence generation method, which produce pseudo correspondence utilize both local appearance features and global context information. Experimental results on car datasets demonstrate that our DINeMo outperforms previous zero- and few-shot 3D pose estimation by a wide margin, narrowing the gap with fully-supervised methods by 67.3%. Our DINeMo also scales effectively and efficiently when incorporating more unlabeled images during training, which demonstrate the advantages over supervised learning methods that rely on 3D annotations. Our project page is available at this https URL.
zh

[CV-77] Video Motion Graphs

【速读】：该论文旨在解决多模态条件下的真实感人体运动视频生成问题。为实现这一目标，论文提出了一种名为Video Motion Graphs的系统，其核心解决方案在于HMInterp模型。HMInterp是一种鲁棒的视频帧插值（Video Frame Interpolation, VFI）模型，通过采用双分支插值方法，结合人体骨架运动的运动扩散模型与基于扩散的视频帧插值模型，实现了不连续帧的平滑插值，特别是在复杂运动场景如舞蹈中的应用。此外，HMInterp还利用条件渐进训练策略，有效融合强弱条件信息（如图像和姿态），从而确保生成视频在纹理质量和运动轨迹准确性上的优越表现。实验结果表明，该方法在多模态条件人体运动视频生成任务中优于现有的生成式和检索式方法。

链接: https://arxiv.org/abs/2503.20218
作者: Haiyang Liu,Zhan Xu,Fa-Ting Hong,Hsin-Ping Huang,Yi Zhou,Yang Zhou
机构: The University of Tokyo (东京大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages,10 figures

点击查看摘要

Abstract:We present Video Motion Graphs, a system designed to generate realistic human motion videos. Using a reference video and conditional signals such as music or motion tags, the system synthesizes new videos by first retrieving video clips with gestures matching the conditions and then generating interpolation frames to seamlessly connect clip boundaries. The core of our approach is HMInterp, a robust Video Frame Interpolation (VFI) model that enables seamless interpolation of discontinuous frames, even for complex motion scenarios like dancing. HMInterp i) employs a dual-branch interpolation approach, combining a Motion Diffusion Model for human skeleton motion interpolation with a diffusion-based video frame interpolation model for final frame generation. ii) adopts condition progressive training to effectively leverage identity strong and weak conditions, such as images and pose. These designs ensure both high video texture quality and accurate motion trajectory. Results show that our Video Motion Graphs outperforms existing generative- and retrieval-based methods for multi-modal conditioned human motion video generation. Project page can be found at this https URL
zh

[CV-78] Synthetic-to-Real Self-supervised Robust Depth Estimation via Learning with Motion and Structure Priors

【速读】：该论文旨在解决在多样化户外条件下（如白天、雨天和夜晚）从单目相机进行自监督深度估计的挑战，主要源于学习通用表示的困难以及真实世界恶劣条件中标注数据的严重缺乏。现有方法要么依赖于合成输入和伪深度标签，要么直接将白天策略应用于恶劣环境，导致结果次优。论文提出了一种首个从合成到真实的鲁棒深度估计框架，关键在于结合运动和结构先验知识以有效捕获真实世界的知识。具体而言，通过在合成适应阶段利用冻结的白天模型，在合成恶劣条件下训练深度估计器，并在创新的真实适应阶段设计一致性重加权策略来识别天气不敏感区域，同时引入新的正则化方法以约束面对真实数据时的模型行为。实验表明，该方法在多帧和单帧评估中优于现有技术，并在nuScenes和Robotcar数据集的零样本评估中展现出更好的泛化能力。

链接: https://arxiv.org/abs/2503.20211
作者: Weilong Yan,Ming Li,Haipeng Li,Shuwei Shao,Robby T. Tan
机构: National University of Singapore (新加坡国立大学); ASUS Intelligent Cloud Services (华硕智能云服务); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东人工智能与数字经济实验室（深圳）); University of Electronic Science and Technology of China (电子科技大学); School of Control Science and Engineering, Shandong University (山东大学控制科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Self-supervised depth estimation from monocular cameras in diverse outdoor conditions, such as daytime, rain, and nighttime, is challenging due to the difficulty of learning universal representations and the severe lack of labeled real-world adverse data. Previous methods either rely on synthetic inputs and pseudo-depth labels or directly apply daytime strategies to adverse conditions, resulting in suboptimal results. In this paper, we present the first synthetic-to-real robust depth estimation framework, incorporating motion and structure priors to capture real-world knowledge effectively. In the synthetic adaptation, we transfer motion-structure knowledge inside cost volumes for better robust representation, using a frozen daytime model to train a depth estimator in synthetic adverse conditions. In the innovative real adaptation, which targets to fix synthetic-real gaps, models trained earlier identify the weather-insensitive regions with a designed consistency-reweighting strategy to emphasize valid pseudo-labels. We introduce a new regularization by gathering explicit depth distributions to constrain the model when facing real-world data. Experiments show that our method outperforms the state-of-the-art across diverse conditions in multi-frame and single-frame evaluations. We achieve improvements of 7.5% and 4.3% in AbsRel and RMSE on average for nuScenes and Robotcar datasets (daytime, nighttime, rain). In zero-shot evaluation of DrivingStereo (rain, fog), our method generalizes better than the previous ones.
zh

[CV-79] BEAR: A Video Dataset For Fine-grained Behaviors Recognition Oriented with Action and Environment Factors ICME2025

【速读】：该论文致力于解决细粒度行为识别中存在的不公平及不全面评估问题。现有方法局限于部分信息的相似性控制，未能充分挖掘环境与动作两个核心因素对行为定义的影响。为应对这一挑战，论文提出了一种新的视频细粒度行为数据集BEAR，其独特之处在于专注于环境(Environment)和动作(Action)两大要素，并设计了包括相似环境下的细粒度行为协议(Fine-grained Behavior with Similar Environments)和相似动作下的细粒度行为协议(Fine-grained Behavior with Similar Actions)以及多种子协议以覆盖不同场景。关键解决方案在于通过构建这一全面且多样化的数据集，系统性地研究输入模态(input modality)对基于环境和动作的行为识别任务的影响，从而揭示出具有重要参考价值的研究洞见。

链接: https://arxiv.org/abs/2503.20209
作者: Chengyang Hu,Yuduo Chen,Lizhuang Ma
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by ICME2025

点击查看摘要

Abstract:Behavior recognition is an important task in video representation learning. An essential aspect pertains to effective feature learning conducive to behavior recognition. Recently, researchers have started to study fine-grained behavior recognition, which provides similar behaviors and encourages the model to concern with more details of behaviors with effective features for distinction. However, previous fine-grained behaviors limited themselves to controlling partial information to be similar, leading to an unfair and not comprehensive evaluation of existing works. In this work, we develop a new video fine-grained behavior dataset, named BEAR, which provides fine-grained (i.e. similar) behaviors that uniquely focus on two primary factors defining behavior: Environment and Action. It includes two fine-grained behavior protocols including Fine-grained Behavior with Similar Environments and Fine-grained Behavior with Similar Actions as well as multiple sub-protocols as different scenarios. Furthermore, with this new dataset, we conduct multiple experiments with different behavior recognition models. Our research primarily explores the impact of input modality, a critical element in studying the environmental and action-based aspects of behavior recognition. Our experimental results yield intriguing insights that have substantial implications for further research endeavors.
zh

[CV-80] Reasoning and Learning a Perceptual Metric for Self-Training of Reflective Objects in Bin-Picking with a Low-cost Camera

【速读】：该论文旨在解决使用低成本 RGB-D 摄像头进行金属物体 bin-picking 任务时因稀疏深度信息和反射表面纹理导致的误差问题，并减少人工标注的需求。为降低人为干预，论文提出了一种包含度量学习阶段和自训练阶段的两阶段框架。关键解决方案包括：引入多目标姿态推理（Multi-object Pose Reasoning, MoPR）算法，在深度、碰撞和边界约束下优化姿态假设；采用基于对称性感知李群的贝叶斯高斯混合模型（Symmetry-aware Lie-group based Bayesian Gaussian Mixture Model, SaL-BGMM），结合期望最大化（Expectation-Maximization, EM）算法实现对称性感知过滤；此外，提出加权排名信息噪声对比估计（Weighted Ranking Information Noise Contrastive Estimation, WR-InfoNCE）损失函数，使低成本相机能够从重构数据中学习感知度量，支持对未训练或甚至未知物体的自训练。实验结果表明，该方法在 ROBI 数据集和新引入的 Self-ROBI 数据集上均优于多个最先进的方法。

链接: https://arxiv.org/abs/2503.20207
作者: Peiyuan Ni,Chee Meng Chew,Marcelo H. Ang Jr.,Gregory S. Chirikjian
机构: Department of Mechanical Engineering, National University of Singapore (新加坡国立大学机械工程系); Department of Mechanical Engineering, University of Delaware (特拉华大学机械工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 9 pages, 10 figures

点击查看摘要

Abstract:Bin-picking of metal objects using low-cost RGB-D cameras often suffers from sparse depth information and reflective surface textures, leading to errors and the need for manual labeling. To reduce human intervention, we propose a two-stage framework consisting of a metric learning stage and a self-training stage. Specifically, to automatically process data captured by a low-cost camera (LC), we introduce a Multi-object Pose Reasoning (MoPR) algorithm that optimizes pose hypotheses under depth, collision, and boundary constraints. To further refine pose candidates, we adopt a Symmetry-aware Lie-group based Bayesian Gaussian Mixture Model (SaL-BGMM), integrated with the Expectation-Maximization (EM) algorithm, for symmetry-aware filtering. Additionally, we propose a Weighted Ranking Information Noise Contrastive Estimation (WR-InfoNCE) loss to enable the LC to learn a perceptual metric from reconstructed data, supporting self-training on untrained or even unseen objects. Experimental results show that our approach outperforms several state-of-the-art methods on both the ROBI dataset and our newly introduced Self-ROBI dataset.
zh

[CV-81] Assessing SAM for Tree Crown Instance Segmentation from Drone Imagery ICLR2025

【速读】：该论文试图解决树种植项目中基于无人机高分辨率影像进行自动树木冠层实例分割的问题，以克服传统人工监测方法成本高、耗时长且劳动密集的局限。论文探索了使用Segment Anything Model (SAM) 方法在有限标注数据条件下的潜力，并对比了SAM方法与自定义Mask R-CNN的表现。解决方案的关键在于通过进一步调优SAM模型以及结合数字表面模型（Digital Surface Model, DSM）信息来提升预测性能，尽管直接使用SAM的方法未显著优于自定义Mask R-CNN，但表明了进一步优化SAM的潜力。

链接: https://arxiv.org/abs/2503.20199
作者: Mélisande Teng,Arthur Ouaknine,Etienne Laliberté,Yoshua Bengio,David Rolnick,Hugo Larochelle
机构: Mila (Mila); Université de Montréal (蒙特利尔大学); McGill University (麦吉尔大学); Rubisco AI (Rubisco AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICLR 2025 ML4RS workshop

点击查看摘要

Abstract:The potential of tree planting as a natural climate solution is often undermined by inadequate monitoring of tree planting projects. Current monitoring methods involve measuring trees by hand for each species, requiring extensive cost, time, and labour. Advances in drone remote sensing and computer vision offer great potential for mapping and characterizing trees from aerial imagery, and large pre-trained vision models, such as the Segment Anything Model (SAM), may be a particularly compelling choice given limited labeled data. In this work, we compare SAM methods for the task of automatic tree crown instance segmentation in high resolution drone imagery of young tree plantations. We explore the potential of SAM for this task, and find that methods using SAM out-of-the-box do not outperform a custom Mask R-CNN, even with well-designed prompts, but that there is potential for methods which tune SAM further. We also show that predictions can be improved by adding Digital Surface Model (DSM) information as an input.
zh

[CV-82] Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models

【速读】：该论文致力于解决现有文本到图像生成系统在处理长篇幅文本（如段落）时表现不足的问题。当前的生成模型在生成连贯的长文本图像方面仍面临重大挑战，而大多数现有系统仅能处理简短的词组或单个句子。论文的关键创新在于识别出图像标记器（image tokenizer）是影响文本生成质量的核心瓶颈，并提出了一种针对场景文本特征优化的新颖二进制标记器。基于此标记器，论文开发了名为\ModelName的多模态自回归模型，该模型能够在高保真度下生成高质量的长文本图像。此外，\ModelName提供了强大的可控性，可定制字体样式、大小、颜色和对齐方式等文本属性。实验结果表明，相比SD3.5 Large、GPT4o和DALL-E 3，\ModelName在长文本生成的准确性、一致性和灵活性方面表现出显著优势。

链接: https://arxiv.org/abs/2503.20198
作者: Alex Jinpeng Wang,Linjie Li,Zhengyuan Yang,Lijuan Wang,Min Li
机构: Central South University (中南大学); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages

点击查看摘要

Abstract:Recent advancements in autoregressive and diffusion models have led to strong performance in image generation with short scene text words. However, generating coherent, long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models. We present the first work specifically focused on long text image generation, addressing a critical gap in existing text-to-image systems that typically handle only brief phrases or single sentences. Through comprehensive analysis of state-of-the-art autoregressive generation models, we identify the image tokenizer as a critical bottleneck in text generating quality. To address this, we introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features. Leveraging our tokenizer, we develop \ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity. Our model offers robust controllability, enabling customization of text properties such as font style, size, color, and alignment. Extensive experiments demonstrate that \ModelName~significantly outperforms SD3.5 Large~\citesd3 and GPT4o~\citegpt4o with DALL-E 3~\citedalle3 in generating long text accurately, consistently, and flexibly. Beyond its technical achievements, \ModelName~opens up exciting opportunities for innovative applications like interleaved document and PowerPoint generation, establishing a new frontier in long-text image generating.
zh

[CV-83] Cross-Modal Prototype Allocation: Unsupervised Slide Representation Learning via Patch-Text Contrast in Computational Pathology

【速读】：该论文旨在解决现有主流弱监督切片表示学习方法（主要基于多重实例学习，Multiple Instance Learning, MIL）针对特定下游任务设计，导致其泛化能力受限的问题。同时，现有无监督方法仅关注图像补丁的视觉模态，忽略了文本数据中丰富的语义信息。为解决这些问题，论文提出ProAlign框架，其关键是通过引入大型语言模型（Large Language Model, LLM）生成描述性文本以构建初始原型嵌入，并采用无参数注意力聚合策略，利用补丁与这些原型之间的相似性生成适用于广泛下游任务的无监督切片嵌入表示。

链接: https://arxiv.org/abs/2503.20190
作者: Yuxuan Chen,Jiawen Li,Jiali Hu,Xitong Ling,Tian Guan,Anjia Han,Yonghong He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11pages,3 figures

点击查看摘要

Abstract:With the rapid advancement of pathology foundation models (FMs), the representation learning of whole slide images (WSIs) attracts increasing attention. Existing studies develop high-quality patch feature extractors and employ carefully designed aggregation schemes to derive slide-level representations. However, mainstream weakly supervised slide representation learning methods, primarily based on multiple instance learning (MIL), are tailored to specific downstream tasks, which limits their generalizability. To address this issue, some studies explore unsupervised slide representation learning. However, these approaches focus solely on the visual modality of patches, neglecting the rich semantic information embedded in textual data. In this work, we propose ProAlign, a cross-modal unsupervised slide representation learning framework. Specifically, we leverage a large language model (LLM) to generate descriptive text for the prototype types present in a WSI, introducing patch-text contrast to construct initial prototype embeddings. Furthermore, we propose a parameter-free attention aggregation strategy that utilizes the similarity between patches and these prototypes to form unsupervised slide embeddings applicable to a wide range of downstream tasks. Extensive experiments on four public datasets show that ProAlign outperforms existing unsupervised frameworks and achieves performance comparable to some weakly supervised models.
zh

[CV-84] Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector

【速读】：该论文致力于解决深度伪造检测中通用性和可解释性不足的问题。传统方法通常仅提供二分类结果或独立的文本解释，而本文提出了一种新颖的方法——多模态人脸伪造检测器（M2F2-Det），其关键在于结合预训练的CLIP模型的多模态学习能力和大型语言模型（LLMs）的卓越可解释性，以同时提升检测的泛化能力和解释能力。具体而言，M2F2-Det通过定制化的人脸伪造提示学习增强对未见过的伪造样本的识别能力，并利用LLMs生成详细的文本解释，从而弥合自然语言与面部伪造细微线索之间的差距，显著提高了检测的透明度和任务性能。

链接: https://arxiv.org/abs/2503.20188
作者: Xiao Guo,Xiufeng Song,Yue Zhang,Xiaohong Liu,Xiaoming Liu
机构: Michigan State University (密歇根州立大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 figures; 6 tables

点击查看摘要

Abstract:Deepfake detection is a long-established research topic vital for mitigating the spread of malicious misinformation. Unlike prior methods that provide either binary classification results or textual explanations separately, we introduce a novel method capable of generating both simultaneously. Our method harnesses the multi-modal learning capability of the pre-trained CLIP and the unprecedented interpretability of large language models (LLMs) to enhance both the generalization and explainability of deepfake detection. Specifically, we introduce a multi-modal face forgery detector (M2F2-Det) that employs tailored face forgery prompt learning, incorporating the pre-trained CLIP to improve generalization to unseen forgeries. Also, M2F2-Det incorporates an LLM to provide detailed textual explanations of its detection decisions, enhancing interpretability by bridging the gap between natural language and subtle cues of facial forgeries. Empirically, we evaluate M2F2-Det on both detection and explanation generation tasks, where it achieves state-of-the-art performance, demonstrating its effectiveness in identifying and explaining diverse forgeries.
zh

[CV-85] Network Inversion for Generating Confidently Classified Counterfeits

【速读】：该论文试图解决在机器学习中，尤其是视觉分类器中，生成能够被模型自信分类但又显著不同于训练数据分布的输入样本的问题。传统方法通常通过修改现有样本实现，但难以始终确保其分类的高置信度。论文的关键解决方案在于将网络反演技术中的生成器条件机制从软向量条件（soft vector conditioning）改进为独热向量条件（one-hot vector conditioning），并通过在独热向量与分类器输出分布之间施加KL散度（Kullback-Leibler divergence, KLD）约束，促使生成的样本既合理可信又具有高置信度分类结果。这种方法对于保障机器学习系统的安全性和可靠性尤为重要，尤其是在安全关键应用中，能够揭示模型的局限性及其决策过程。

链接: https://arxiv.org/abs/2503.20187
作者: Pirzada Suhail,Amit Sethi
机构: IIT Bombay (印度理工学院孟买分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In machine learning, especially with vision classifiers, generating inputs that are confidently classified by the model is essential for understanding its decision boundaries and behavior. However, creating such samples that are confidently classified yet distinct from the training data distribution is a challenge. Traditional methods often modify existing inputs, but they don’t always ensure confident classification. In this work, we extend network inversion techniques to generate Confidently Classified Counterfeits-synthetic samples that are confidently classified by the model despite being significantly different from the training data. We achieve this by modifying the generator’s conditioning mechanism from soft vector conditioning to one-hot vector conditioning and applying Kullback-Leibler divergence (KLD) between the one-hot vectors and the classifier’s output distribution. This encourages the generator to produce samples that are both plausible and confidently classified. Generating Confidently Classified Counterfeits is crucial for ensuring the safety and reliability of machine learning systems, particularly in safety-critical applications where models must exhibit confidence only on data within the training distribution. By generating such counterfeits, we challenge the assumption that high-confidence predictions are always indicative of in-distribution data, providing deeper insights into the model’s limitations and decision-making process.
zh

[CV-86] Spectrum from Defocus: Fast Spectral Imaging with Chromatic Focal Stack

【速读】：该论文旨在解决传统高光谱相机在空间、光谱和时间分辨率之间存在的权衡问题，特别是在低光子条件下性能受限的挑战。同时，尽管计算成像系统通过压缩感知突破了这些限制，但它们通常需要复杂的光学器件或大量的计算资源。论文提出的解决方案——基于离焦的高光谱成像（Spectrum from Defocus, SfD），利用色差焦距扫描方法，在仅使用现成光学元件的小型化系统中实现了最先进的高光谱图像恢复，并仅需1秒的计算时间。其关键在于结合物理模型的迭代算法，能够高效地从模糊的灰度焦距堆栈中分离、去卷积并降噪，从而生成清晰的高光谱图像。这一方案通过高光子效率、光学简单性和物理建模相结合，为快速、紧凑且可解释的高光谱成像提供了有前景的解决方案。

链接: https://arxiv.org/abs/2503.20184
作者: M. Kerem Aydin,Yi-Chun Hung,Jaclyn Pytlarz,Qi Guo,Emma Alexander
机构: Department of Computer Science, McCormick School of Engineering, Northwestern University (西北大学); Dolby Laboratories, Inc. (杜比实验室); Elmore Family School of Electrical and Computer Engineering, Purdue University (普渡大学); Center for Robotics and Biosystems, McCormick School of Engineering, Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Hyperspectral cameras face harsh trade-offs between spatial, spectral, and temporal resolution in an inherently low-photon regime. Computational imaging systems break through these trade-offs with compressive sensing, but require complex optics and/or extensive compute. We present Spectrum from Defocus (SfD), a chromatic focal sweep method that recovers state-of-the-art hyperspectral images with a small system of off-the-shelf optics and 1 second of compute. Our camera uses two lenses and a grayscale sensor to preserve nearly all incident light in a chromatically-aberrated focal stack. Our physics-based iterative algorithm efficiently demixes, deconvolves, and denoises the blurry grayscale focal stack into a sharp spectral image. The combination of photon efficiency, optical simplicity, and physical modeling makes SfD a promising solution for fast, compact, interpretable hyperspectral imaging.
zh

[CV-87] Devil is in the Uniformity: Exploring Diverse Learners within Transformer for Image Restoration

【速读】：该论文旨在解决多头注意力机制（Multi-Head Attention, MHA）在图像恢复任务中的冗余问题。MHA 中各头独立从均匀划分的子空间中计算注意力，这种统一的处理方式可能导致信息冗余，从而影响模型恢复高质量图像结果的能力。为解决此问题，论文提出了一种基于分层多头注意力（Hierarchical Multi-Head Attention, HMHA）驱动的 Transformer 模型，称为 HINT。其关键解决方案包括：(1) HMHA 通过让各头学习不同大小及包含不同信息的子空间来提取多样化的上下文特征；(2) 查询-键缓存更新模块（Query-Key Cache Updating, QKCU），结合层内和跨层方案，进一步通过增强头与头之间的交互减少冗余。实验验证了 HINT 在低光照增强、去雾、去雪、去噪和去雨等五个图像恢复任务的 12 个基准数据集上的优越性。

链接: https://arxiv.org/abs/2503.20174
作者: Shihao Zhou,Dayu Li,Jinshan Pan,Juncheng Zhou,Jinglei Shi,Jufeng Yang
机构: VCIP & TMCC & DISSec, College of Computer Science, Nankai University (南开大学); Nankai International Advanced Research Institute (SHENZHEN· FUTIAN) (南开大学国际先进研究院); School of Computer Science and Engineering, Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 10 figures

点击查看摘要

Abstract:Transformer-based approaches have gained significant attention in image restoration, where the core component, i.e, Multi-Head Attention (MHA), plays a crucial role in capturing diverse features and recovering high-quality results. In MHA, heads perform attention calculation independently from uniform split subspaces, and a redundancy issue is triggered to hinder the model from achieving satisfactory outputs. In this paper, we propose to improve MHA by exploring diverse learners and introducing various interactions between heads, which results in a Hierarchical multI-head atteNtion driven Transformer model, termed HINT, for image restoration. HINT contains two modules, i.e., the Hierarchical Multi-Head Attention (HMHA) and the Query-Key Cache Updating (QKCU) module, to address the redundancy problem that is rooted in vanilla MHA. Specifically, HMHA extracts diverse contextual features by employing heads to learn from subspaces of varying sizes and containing different information. Moreover, QKCU, comprising intra- and inter-layer schemes, further reduces the redundancy problem by facilitating enhanced interactions between attention heads within and across layers. Extensive experiments are conducted on 12 benchmarks across 5 image restoration tasks, including low-light enhancement, dehazing, desnowing, denoising, and deraining, to demonstrate the superiority of HINT. The source code is available in the supplementary materials.
zh

[CV-88] Guiding Human-Object Interactions with Rich Geometry and Relations CVPR

【速读】：该论文致力于解决现有基于人类-物体交互（Human-Object Interaction, HOI）合成方法中因简化物体表示（如质心或人体最近点）而导致的几何复杂性缺失问题，这可能会影响交互的真实感和保真度。论文的关键解决方案在于提出了一种名为ROG的新框架，它通过扩散模型捕捉HOI中的时空关系，并结合丰富的几何细节。ROG通过从物体网格中选择边界聚焦且包含精细特征的关键点来实现高效的物体表示，进而构建交互距离场（Interactive Distance Field, IDF），以捕获稳健的HOI动力学。此外，开发了一种集成空间和时间注意力机制的扩散关系模型，用于更好地理解复杂的HOI关系，并优化生成运动的IDF，从而指导生成具有关系感知和语义一致性的运动。实验结果表明，ROG在合成HOI的真实性与语义准确性方面显著优于现有技术。

链接: https://arxiv.org/abs/2503.20172
作者: Mengqing Xue,Yifei Liu,Ling Guo,Shaoli Huang,Changxing Ding
机构: South China University of Technology (华南理工大学); Pazhou Lab, Guangzhou (琶洲实验室, 广州); Tencent AI Lab (腾讯人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR this http URL website: this https URL

点击查看摘要

Abstract:Human-object interaction (HOI) synthesis is crucial for creating immersive and realistic experiences for applications such as virtual reality. Existing methods often rely on simplified object representations, such as the object’s centroid or the nearest point to a human, to achieve physically plausible motions. However, these approaches may overlook geometric complexity, resulting in suboptimal interaction fidelity. To address this limitation, we introduce ROG, a novel diffusion-based framework that models the spatiotemporal relationships inherent in HOIs with rich geometric detail. For efficient object representation, we select boundary-focused and fine-detail key points from the object mesh, ensuring a comprehensive depiction of the object’s geometry. This representation is used to construct an interactive distance field (IDF), capturing the robust HOI dynamics. Furthermore, we develop a diffusion-based relation model that integrates spatial and temporal attention mechanisms, enabling a better understanding of intricate HOI relationships. This relation model refines the generated motion’s IDF, guiding the motion generation process to produce relation-aware and semantically aligned movements. Experimental evaluations demonstrate that ROG significantly outperforms state-of-the-art methods in the realism and semantic accuracy of synthesized HOIs.
zh

[CV-89] EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis CVPR2025

【速读】：本文旨在解决城市场景新型视角合成问题，这对于自动驾驶等应用至关重要。传统基于NeRF和3D高斯点云（3DGS）的方法虽能实现逼真的渲染效果，但需要针对每个场景进行缓慢的优化。为应对这一挑战，本文提出了一种名为EVolSplat的高效3D高斯点云模型，它以前馈方式工作。该方法的关键在于通过三维卷积网络，在统一的三维体积内跨多帧预测3D高斯分布，而非依赖像素对齐的传统方法。具体而言，该模型首先利用带有噪声的深度预测初始化3D高斯分布，并在三维空间中优化其几何属性，同时依据二维纹理预测颜色。此外，EVolSplat引入了灵活的半球背景模型来处理远距离视图和天空区域，从而实现了快速前馈重建与实时渲染。实验表明，该方法在KITTI-360和Waymo数据集上的表现达到了当前最先进的质量水平。

链接: https://arxiv.org/abs/2503.20168
作者: Sheng Miao,Jiaxin Huang,Dongfeng Bai,Xu Yan,Hongyu Zhou,Yue Wang,Bingbing Liu,Andreas Geiger,Yiyi Liao
机构: Zhejiang University (浙江大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); University of Tübingen (图宾根大学); Tübingen AI Center (图宾根人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025

点击查看摘要

Abstract:Novel view synthesis of urban scenes is essential for autonomous driving-related this http URL NeRF and 3DGS-based methods show promising results in achieving photorealistic renderings but require slow, per-scene optimization. We introduce EVolSplat, an efficient 3D Gaussian Splatting model for urban scenes that works in a feed-forward manner. Unlike existing feed-forward, pixel-aligned 3DGS methods, which often suffer from issues like multi-view inconsistencies and duplicated content, our approach predicts 3D Gaussians across multiple frames within a unified volume using a 3D convolutional network. This is achieved by initializing 3D Gaussians with noisy depth predictions, and then refining their geometric properties in 3D space and predicting color based on 2D textures. Our model also handles distant views and the sky with a flexible hemisphere background model. This enables us to perform fast, feed-forward reconstruction while achieving real-time rendering. Experimental evaluations on the KITTI-360 and Waymo datasets show that our method achieves state-of-the-art quality compared to existing feed-forward 3DGS- and NeRF-based methods.
zh

[CV-90] Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors

【速读】：该论文旨在解决基于有限三维人体-物体交互（HOI）数据集进行零样本HOI合成的问题，现有方法因训练数据集中物体类型和交互模式的多样性受限。论文的关键在于利用预训练的多模态模型中的广泛HOI知识，而无需依赖于端到端的有限三维HOI数据集训练。解决方案的核心是通过图像或视频生成模型从文本描述生成时间一致的二维HOI图像序列，并将其提升为三维HOI姿态里程碑。论文采用预训练的人体姿态估计算法提取人体姿态，并提出一种通用的类别级六自由度估计方法从二维HOI图像获取物体姿态，该方法适应于从文本到三维模型或在线检索获得的各种物体模板。此外，基于物理的三维HOI运动学里程碑跟踪进一步优化人体运动和物体姿态，从而生成更具物理真实性和语义多样性的开放词汇HOI结果。

链接: https://arxiv.org/abs/2503.20118
作者: Yuke Lou,Yiming Wang,Zhen Wu,Rui Zhao,Wenjia Wang,Mingyi Shi,Taku Komura
机构: The University of Hong Kong (香港大学); ETH Zurich (苏黎世联邦理工学院); Stanford University (斯坦福大学); Tencent (腾讯); Meta (未知邮箱)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human-object interaction (HOI) synthesis is important for various applications, ranging from virtual reality to robotics. However, acquiring 3D HOI data is challenging due to its complexity and high cost, limiting existing methods to the narrow diversity of object types and interaction patterns in training datasets. This paper proposes a novel zero-shot HOI synthesis framework without relying on end-to-end training on currently limited 3D HOI datasets. The core idea of our method lies in leveraging extensive HOI knowledge from pre-trained Multimodal Models. Given a text description, our system first obtains temporally consistent 2D HOI image sequences using image or video generation models, which are then uplifted to 3D HOI milestones of human and object poses. We employ pre-trained human pose estimation models to extract human poses and introduce a generalizable category-level 6-DoF estimation method to obtain the object poses from 2D HOI images. Our estimation method is adaptive to various object templates obtained from text-to-3D models or online retrieval. A physics-based tracking of the 3D HOI kinematic milestone is further applied to refine both body motions and object poses, yielding more physically plausible HOI generation results. The experimental results demonstrate that our method is capable of generating open-vocabulary HOIs with physical realism and semantic diversity.
zh

[CV-91] Peepers Pixels: Human Recognition Accuracy on Low Resolution Faces

【速读】：该论文试图解决自动化一人对多（1:N）人脸识别在低分辨率图像（以瞳距IPD衡量）下的人类审查准确性问题。论文的关键在于探索人类在不同IPD值下的识别能力边界，并发现当IPD较低（如10像素和5像素）时，尽管决策者的信心仍然较高，但人类识别准确性已降至机会水平以下（分别为50.7%和35.9%）。这表明，在低IPD图像条件下，人类识别能力可能成为整体系统准确性的限制因素。

链接: https://arxiv.org/abs/2503.20108
作者: Xavier Merino,Gabriella Pangelinan,Samuel Langborgh,Michael C. King,Kevin W. Bowyer
机构: Florida Institute of Technology (佛罗里达理工学院); University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Automated one-to-many (1:N) face recognition is a powerful investigative tool commonly used by law enforcement agencies. In this context, potential matches resulting from automated 1:N recognition are reviewed by human examiners prior to possible use as investigative leads. While automated 1:N recognition can achieve near-perfect accuracy under ideal imaging conditions, operational scenarios may necessitate the use of surveillance imagery, which is often degraded in various quality dimensions. One important quality dimension is image resolution, typically quantified by the number of pixels on the face. The common metric for this is inter-pupillary distance (IPD), which measures the number of pixels between the pupils. Low IPD is known to degrade the accuracy of automated face recognition. However, the threshold IPD for reliability in human face recognition remains undefined. This study aims to explore the boundaries of human recognition accuracy by systematically testing accuracy across a range of IPD values. We find that at low IPDs (10px, 5px), human accuracy is at or below chance levels (50.7%, 35.9%), even as confidence in decision-making remains relatively high (77%, 70.7%). Our findings indicate that, for low IPD images, human recognition ability could be a limiting factor to overall system accuracy.
zh

[CV-92] EBS-EKF: Accurate and High Frequency Event-based Star Tracking CVPR

【速读】：该论文旨在解决基于事件的星敏感器（Event-based Star Tracker, EBST）在实际应用中的精度与实时性问题。现有研究主要局限于仿真环境且采用简化的信号模型，未能充分验证其在真实场景下的性能。论文的关键创新在于提出了一种基于事件的星敏感算法，该算法以事件传感器（EBS）电路分析为基础，并结合扩展卡尔曼滤波器（Extended Kalman Filter, EKF）实现状态估计。通过使用真实的夜空数据与传统空间级有源像素传感器（Active-Pixel Sensor, APS）星敏感器的结果进行定量对比，证明了所提方法在精度上提升了一个数量级，同时实现了更高的更新频率和更强的运动容错能力。此外，论文提供了完整的代码及首个同步事件与APS解算结果的数据集。

链接: https://arxiv.org/abs/2503.20101
作者: Albert W Reed,Connor Hashemi,Dennis Melamed,Nitesh Menon,Keigo Hirakawa,Scott McCloskey
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted into the proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) for 2025. Link to code and dataset is this https URL

点击查看摘要

Abstract:Event-based sensors (EBS) are a promising new technology for star tracking due to their low latency and power efficiency, but prior work has thus far been evaluated exclusively in simulation with simplified signal models. We propose a novel algorithm for event-based star tracking, grounded in an analysis of the EBS circuit and an extended Kalman filter (EKF). We quantitatively evaluate our method using real night sky data, comparing its results with those from a space-ready active-pixel sensor (APS) star tracker. We demonstrate that our method is an order-of-magnitude more accurate than existing methods due to improved signal modeling and state estimation, while providing more frequent updates and greater motion tolerance than conventional APS trackers. We provide all code and the first dataset of events synchronized with APS solutions.
zh

[CV-93] Can Multi-modal (reasoning ) LLM s work as deepfake detectors?

【速读】：该论文试图解决深度伪造（Deepfake）检测这一关键挑战，尤其是在生成式模型（Generative Models）日益先进的背景下，合成媒体（Synthetic Media）变得愈发复杂。论文的关键解决方案在于探索最先进的多模态推理大型语言模型（Multi-modal Reasoning Large Language Models, LLMs）在深度伪造图像检测中的潜力。研究通过微调提示（Prompt Tuning）和深入分析模型的推理路径，识别影响其决策过程的关键因素。此外，研究对比了12种最新的多模态LLMs与传统深度伪造检测方法，并评估其在多个数据集上的表现，包括最近发布的现实世界深度伪造图像数据集。研究发现，部分顶级多模态LLMs在零样本设置下表现出竞争性的性能，甚至在分布外数据集上超越传统的深度伪造检测流水线，而其他LLM家族的表现则令人失望，部分甚至低于随机猜测。进一步分析表明，在这种特定任务中，模型的新版本和推理能力对性能提升贡献有限，但模型规模在某些情况下有助于提高性能。该研究强调了在未来深度伪造检测框架中整合多模态推理的潜力，并为模型可解释性提供了洞见，以增强实际应用中的鲁棒性。

链接: https://arxiv.org/abs/2503.20084
作者: Simiao Ren,Yao Yao,Kidus Zewde,Zisheng Liang,Tsang(Dennis)Ng,Ning-Yau Cheng,Xiaoou Zhan,Qinzhe Liu,Yifei Chen,Hengwei Xu
机构: Duke University; University of Wisconsin Madison (威斯康星大学麦迪逊分校); Scam.ai; Columnbia University (哥伦比亚大学); Georgia Tech (乔治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deepfake detection remains a critical challenge in the era of advanced generative models, particularly as synthetic media becomes more sophisticated. In this study, we explore the potential of state of the art multi-modal (reasoning) large language models (LLMs) for deepfake image detection such as (OpenAI O1/4o, Gemini thinking Flash 2, Deepseek Janus, Grok 3, llama 3.2, Qwen 2/2.5 VL, Mistral Pixtral, Claude 3.5/3.7 sonnet) . We benchmark 12 latest multi-modal LLMs against traditional deepfake detection methods across multiple datasets, including recently published real-world deepfake imagery. To enhance performance, we employ prompt tuning and conduct an in-depth analysis of the models’ reasoning pathways to identify key contributing factors in their decision-making process. Our findings indicate that best multi-modal LLMs achieve competitive performance with promising generalization ability with zero shot, even surpass traditional deepfake detection pipelines in out-of-distribution datasets while the rest of the LLM families performs extremely disappointing with some worse than random guess. Furthermore, we found newer model version and reasoning capabilities does not contribute to performance in such niche tasks of deepfake detection while model size do help in some cases. This study highlights the potential of integrating multi-modal reasoning in future deepfake detection frameworks and provides insights into model interpretability for robustness in real-world scenarios.
zh

[CV-94] NatAg: Multi-Class Classification Models Enabled by a Large-Scale Benchmark Dataset with 4.7M Images of 2959 Crop and Weed Species

【速读】：该论文旨在解决农作物与杂草物种精确识别这一挑战性问题，其核心难点包括物种间高度视觉相似性、环境变异性以及缺乏农业专用的大规模图像数据。为应对这些挑战，论文提出了iNatAg数据集，这是一个包含超过470万张来自2,959种不同作物和杂草物种图像的大型数据集，并提供了从二分类（作物/杂草）到具体物种的精细标注。关键解决方案在于通过构建基于Swin Transformer架构的基准模型，结合地理空间数据的引入及LoRA微调等技术改进，使模型在所有分类任务中达到最先进的性能，在作物与杂草分类任务上的准确率达到92.38%。此外，iNatAg数据集的规模还支持深入分析错误分类情况并解锁植物物种的新分析可能性。

链接: https://arxiv.org/abs/2503.20068
作者: Naitik Jain,Amogh Joshi,Mason Earles
机构: University of California, Davis (加州大学戴维斯分校); Princeton University (普林斯顿大学); AI Institute for Food Systems (食品系统人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate identification of crop and weed species is critical for precision agriculture and sustainable farming. However, it remains a challenging task due to a variety of factors – a high degree of visual similarity among species, environmental variability, and a continued lack of large, agriculture-specific image data. We introduce iNatAg, a large-scale image dataset which contains over 4.7 million images of 2,959 distinct crop and weed species, with precise annotations along the taxonomic hierarchy from binary crop/weed labels to specific species labels. Curated from the broader iNaturalist database, iNatAg contains data from every continent and accurately reflects the variability of natural image captures and environments. Enabled by this data, we train benchmark models built upon the Swin Transformer architecture and evaluate the impact of various modifications such as the incorporation of geospatial data and LoRA finetuning. Our best models achieve state-of-the-art performance across all taxonomic classification tasks, achieving 92.38% on crop and weed classification. Furthermore, the scale of our dataset enables us to explore incorrect misclassifications and unlock new analytic possiblities for plant species. By combining large-scale species coverage, multi-task labels, and geographic diversity, iNatAg provides a new foundation for building robust, geolocation-aware agricultural classification systems. We release the iNatAg dataset publicly through AgML (this https URL), enabling direct access and integration into agricultural machine learning workflows.
zh

[CV-95] Learning Scene-Level Signed Directional Distance Function with Ellipsoidal Priors and Neural Residuals

【速读】：本文旨在解决自动驾驶移动机器人导航与探索中稠密几何环境表示的需求，特别是如何通过神经网络学习连续隐式表示（如占用、符号距离或辐射度）来改进传统基于网格、点云和体素的显式离散表示在重建保真度、效率及可微性方面的不足。论文提出了一种名为符号方向距离函数（Signed Directional Distance Function, SDDF）的方向性表述，它以位置和观测方向作为输入，类似于神经辐射场（Neural Radiance Fields, NeRF），但与SDF不同的是，SDDF直接提供沿指定方向到表面的距离，而非沿视点光线积分，从而实现高效的视图合成。为了高效地学习和预测场景级SDDF，研究开发了一种可微分的混合表示法，结合显式的椭球先验和隐式的神经残差。这种方法使模型能够有效处理障碍物边界附近的长距离不连续性，同时保持密集高保真预测的能力。结果表明，SDDF在重建精度和渲染效率方面可与最先进的神经隐式场景模型竞争，并支持机器人轨迹优化中的可微分视图预测。

链接: https://arxiv.org/abs/2503.20066
作者: Zhirui Dai,Hojoon Shin,Yulun Tian,Ki Myung Brian Lee,Nikolay Atanasov
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dense geometric environment representations are critical for autonomous mobile robot navigation and exploration. Recent work shows that implicit continuous representations of occupancy, signed distance, or radiance learned using neural networks offer advantages in reconstruction fidelity, efficiency, and differentiability over explicit discrete representations based on meshes, point clouds, and voxels. In this work, we explore a directional formulation of signed distance, called signed directional distance function (SDDF). Unlike signed distance function (SDF) and similar to neural radiance fields (NeRF), SDDF has a position and viewing direction as input. Like SDF and unlike NeRF, SDDF directly provides distance to the observed surface along the direction, rather than integrating along the view ray, allowing efficient view synthesis. To learn and predict scene-level SDDF efficiently, we develop a differentiable hybrid representation that combines explicit ellipsoid priors and implicit neural residuals. This approach allows the model to effectively handle large distance discontinuities around obstacle boundaries while preserving the ability for dense high-fidelity prediction. We show that SDDF is competitive with the state-of-the-art neural implicit scene models in terms of reconstruction accuracy and rendering efficiency, while allowing differentiable view prediction for robot trajectory optimization.
zh

[CV-96] Hyperdimensional Uncertainty Quantification for Multimodal Uncertainty Fusion in Autonomous Vehicles Perception CVPR2025

【速读】：该论文旨在解决现有不确定性量化（Uncertainty Quantification, UQ）方法在实际部署中的局限性，特别是无法有效处理模态特征融合层面的知识不确定性（epistemic uncertainty），以及因高计算成本导致的实践困难。针对这些问题，论文提出了一种新颖的确定性不确定性量化方法（Deterministic Uncertainty Method, DUM），命名为HyperDUM。其关键在于利用超维计算（hyperdimensional computing）高效量化特征级别的知识不确定性，并通过通道投影与块状（patch-wise）投影捆绑技术捕获通道和空间不确定性。此外，HyperDUM还通过自适应加权多模态传感器特征来减轻不确定性传播并优化特征融合，从而实现更可靠的模型性能。实验结果表明，HyperDUM在3D目标检测任务中比最先进的算法平均提升2.01%/1.27%，在语义分割任务中提升1.29%，同时显著降低计算开销（浮点运算减少2.36倍，参数量减少高达38.30倍），为实际部署提供了高效的解决方案。

链接: https://arxiv.org/abs/2503.20011
作者: Luke Chen,Junyao Wang,Trier Mortlock,Pramod Khargonekar,Mohammad Abdullah Al Faruque
机构: University of California, Irvine (加州大学欧文分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Uncertainty Quantification (UQ) is crucial for ensuring the reliability of machine learning models deployed in real-world autonomous systems. However, existing approaches typically quantify task-level output prediction uncertainty without considering epistemic uncertainty at the multimodal feature fusion level, leading to sub-optimal outcomes. Additionally, popular uncertainty quantification methods, e.g., Bayesian approximations, remain challenging to deploy in practice due to high computational costs in training and inference. In this paper, we propose HyperDUM, a novel deterministic uncertainty method (DUM) that efficiently quantifies feature-level epistemic uncertainty by leveraging hyperdimensional computing. Our method captures the channel and spatial uncertainties through channel and patch -wise projection and bundling techniques respectively. Multimodal sensor features are then adaptively weighted to mitigate uncertainty propagation and improve feature fusion. Our evaluations show that HyperDUM on average outperforms the state-of-the-art (SOTA) algorithms by up to 2.01%/1.27% in 3D Object Detection and up to 1.29% improvement over baselines in semantic segmentation tasks under various types of uncertainties. Notably, HyperDUM requires 2.36x less Floating Point Operations and up to 38.30x less parameters than SOTA methods, providing an efficient solution for real-world autonomous systems.
zh

[CV-97] he Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs

【速读】：该论文旨在解决由于气候变化和局部压力导致全球珊瑚礁衰退背景下，高效获取高时空分辨率珊瑚礁监测数据的难题。传统珊瑚礁调查方法受限于专家劳动力的时间消耗，难以实现规模化应用，因此需要利用计算机视觉工具来自动化识别活珊瑚并估算其丰度。然而，此类工具的设计与评估受到缺乏大规模高质量数据集的阻碍。为了解决这一问题，论文提出了Coralscapes数据集，这是一个覆盖2075张图像、包含39种底栖类别及174k个由专家标注的分割掩膜的通用密集语义分割数据集。它在范围和结构上类似于广泛使用的Cityscapes数据集，能够用于衡量语义分割模型在需要专业知识注释的新挑战领域中的性能。通过基准测试多种语义分割模型，研究发现从Coralscapes迁移到现有较小数据集可以持续获得最先进的性能。关键在于创建了一个具有广泛适用性和标准化特性的大型高质量数据集，这将推动基于计算机视觉的高效、可扩展珊瑚礁调查方法的研究，并有可能加速水下生态机器人技术的发展。

链接: https://arxiv.org/abs/2503.20000
作者: Jonathan Sauder,Viktor Domazetoski,Guilhem Banc-Prandi,Gabriela Perna,Anders Meibom,Devis Tuia
机构: Environmental Computational Science and Earth Observation Laboratory, École Polytechnique Fédérale de Lausanne (瑞士联邦理工学院环境计算科学与地球观测实验室); Laboratory for Biological Geochemistry, École Polytechnique Fédérale de Lausanne (瑞士联邦理工学院生物地球化学实验室); Centre for Ecology and Conservation, University of Exeter (埃克塞特大学生态与保护中心); School of the Environment, The University of Queensland (昆士兰大学环境学院); Center for Advanced Surface Analysis, University of Lausanne (洛桑大学高级表面分析中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Coral reefs are declining worldwide due to climate change and local stressors. To inform effective conservation or restoration, monitoring at the highest possible spatial and temporal resolution is necessary. Conventional coral reef surveying methods are limited in scalability due to their reliance on expert labor time, motivating the use of computer vision tools to automate the identification and abundance estimation of live corals from images. However, the design and evaluation of such tools has been impeded by the lack of large high quality datasets. We release the Coralscapes dataset, the first general-purpose dense semantic segmentation dataset for coral reefs, covering 2075 images, 39 benthic classes, and 174k segmentation masks annotated by experts. Coralscapes has a similar scope and the same structure as the widely used Cityscapes dataset for urban scene segmentation, allowing benchmarking of semantic segmentation models in a new challenging domain which requires expert knowledge to annotate. We benchmark a wide range of semantic segmentation models, and find that transfer learning from Coralscapes to existing smaller datasets consistently leads to state-of-the-art performance. Coralscapes will catalyze research on efficient, scalable, and standardized coral reef surveying methods based on computer vision, and holds the potential to streamline the development of underwater ecological robotics.
zh

[CV-98] SLIP: Spoof-Aware One-Class Face Anti-Spoofing with Language Image Pretraining AAAI2025

【速读】：该论文旨在解决单类人脸反欺骗（one-class Face Anti-Spoofing, FAS）方法因缺乏欺骗样本训练数据而导致模型可能无意中引入与活体/欺骗区分无关的领域信息（如面部内容）的问题，进而造成在新应用场景下性能下降。为应对这一挑战，论文提出了一种名为“带语言图像预训练的欺骗感知单类人脸反欺骗”（Spoof-aware one-class face anti-spoofing with Language Image Pretraining, SLIP）的新框架。其解决方案的关键在于：首先通过语言引导的欺骗线索图估计来模拟面部是否被欺骗攻击相关物体遮挡，并生成相应的非零欺骗线索图；其次引入提示驱动的活体特征解耦机制以分离与活体/欺骗相关的特征与领域依赖信息；最后设计一种有效的融合策略，通过结合真实人脸潜特征与欺骗提示生成欺骗样本人脸特征，从而增强单类FAS模型对潜在欺骗特征的学习能力。

链接: https://arxiv.org/abs/2503.19982
作者: Pei-Kai Huang,Jun-Xiong Chong,Cheng-Hsuan Chiang,Tzu-Hsien Chen,Tyng-Luh Liu,Chiou-Ting Hsu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Face anti-spoofing (FAS) plays a pivotal role in ensuring the security and reliability of face recognition systems. With advancements in vision-language pretrained (VLP) models, recent two-class FAS techniques have leveraged the advantages of using VLP guidance, while this potential remains unexplored in one-class FAS methods. The one-class FAS focuses on learning intrinsic liveness features solely from live training images to differentiate between live and spoof faces. However, the lack of spoof training data can lead one-class FAS models to inadvertently incorporate domain information irrelevant to the live/spoof distinction (e.g., facial content), causing performance degradation when tested with a new application domain. To address this issue, we propose a novel framework called Spoof-aware one-class face anti-spoofing with Language Image Pretraining (SLIP). Given that live faces should ideally not be obscured by any spoof-attack-related objects (e.g., paper, or masks) and are assumed to yield zero spoof cue maps, we first propose an effective language-guided spoof cue map estimation to enhance one-class FAS models by simulating whether the underlying faces are covered by attack-related objects and generating corresponding nonzero spoof cue maps. Next, we introduce a novel prompt-driven liveness feature disentanglement to alleviate live/spoof-irrelative domain variations by disentangling live/spoof-relevant and domain-dependent information. Finally, we design an effective augmentation strategy by fusing latent features from live images and spoof prompts to generate spoof-like image features and thus diversify latent spoof features to facilitate the learning of one-class FAS. Our extensive experiments and ablation studies support that SLIP consistently outperforms previous one-class FAS methods.
zh

[CV-99] hin-Shell-SfT: Fine-Grained Monocular Non-rigid 3D Surface Tracking with Neural Deformation Fields CVPR2025

【速读】：该论文旨在解决从单目RGB视频中重建高度可变形表面（如布料）时，难以一致且准确恢复精细表面细节的问题。现有方法通常依赖于基于统计、神经网络或物理先验的变形模型，并采用非自适应离散表面表示（如多边形网格），通过逐帧优化导致误差传播，同时受制于基于网格的可微分渲染器的梯度不良问题，从而无法精确恢复如布料褶皱等精细表面细节。论文的关键解决方案在于提出ThinShell-SfT方法，该方法将表面表示为隐式的时空连续神经场，并引入基于Kirchhoff-Love模型的连续薄壳物理先验以实现空间正则化，这与早期工作的离散化替代方案形成鲜明对比。此外，通过利用3D高斯点样技术实现表面到图像空间的可微渲染，并基于分析-综合原理优化变形，进一步提升了重建精度。这些创新使得ThinShell-SfT在定性和定量评估中均优于先前工作。

链接: https://arxiv.org/abs/2503.19976
作者: Navami Kairanda,Marc Habermann,Shanthika Naik,Christian Theobalt,Vladislav Golyanik
机构: MPI for Informatics (马克斯·普朗克计算机科学研究所); VIA Research Center (未知); IIT Jodhpur (印度理工学院乔德普尔)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 12 figures and 3 tables; project page: this https URL CVPR 2025

点击查看摘要

Abstract:3D reconstruction of highly deformable surfaces (e.g. cloths) from monocular RGB videos is a challenging problem, and no solution provides a consistent and accurate recovery of fine-grained surface details. To account for the ill-posed nature of the setting, existing methods use deformation models with statistical, neural, or physical priors. They also predominantly rely on nonadaptive discrete surface representations (e.g. polygonal meshes), perform frame-by-frame optimisation leading to error propagation, and suffer from poor gradients of the mesh-based differentiable renderers. Consequently, fine surface details such as cloth wrinkles are often not recovered with the desired accuracy. In response to these limitations, we propose ThinShell-SfT, a new method for non-rigid 3D tracking that represents a surface as an implicit and continuous spatiotemporal neural field. We incorporate continuous thin shell physics prior based on the Kirchhoff-Love model for spatial regularisation, which starkly contrasts the discretised alternatives of earlier works. Lastly, we leverage 3D Gaussian splatting to differentiably render the surface into image space and optimise the deformations based on analysis-bysynthesis principles. Our Thin-Shell-SfT outperforms prior works qualitatively and quantitatively thanks to our continuous surface formulation in conjunction with a specially tailored simulation prior and surface-induced 3D Gaussians. See our project page at this https URL.
zh

[CV-100] Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals

【速读】：该论文试图解决视频运动估计在真实场景中的应用问题，现有方法主要依赖合成数据训练或特定启发式调参，限制了其实际应用能力。尽管大规模自监督视频学习取得进展，但利用此类表征进行运动估计的研究相对不足。论文的关键解决方案是提出Opt-CWM，一种从预训练的下一帧预测模型中通过自监督方式估计光流和遮挡的技术。Opt-CWM通过学习优化反事实探针，从基础视频模型中提取运动信息，无需固定启发式规则即可在无约束视频输入上训练，实现了真实视频运动估计的最先进性能，且无需标注数据。

链接: https://arxiv.org/abs/2503.19953
作者: Stefan Stojanov,David Wendt,Seungwoo Kim,Rahul Venkatesh,Kevin Feigelis,Jiajun Wu,Daniel LK Yamins
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Estimating motion in videos is an essential computer vision problem with many downstream applications, including controllable video generation and robotics. Current solutions are primarily trained using synthetic data or require tuning of situation-specific heuristics, which inherently limits these models’ capabilities in real-world contexts. Despite recent developments in large-scale self-supervised learning from videos, leveraging such representations for motion estimation remains relatively underexplored. In this work, we develop Opt-CWM, a self-supervised technique for flow and occlusion estimation from a pre-trained next-frame prediction model. Opt-CWM works by learning to optimize counterfactual probes that extract motion information from a base video model, avoiding the need for fixed heuristics while training on unrestricted video inputs. We achieve state-of-the-art performance for motion estimation on real-world videos while requiring no labeled data.
zh

[CV-101] ACVUBench: Audio-Centric Video Understanding Benchmark

【速读】：该论文旨在解决现有音频-视觉大语言模型（Audio-Visual Large Language Models, AV LLMs）在视频理解任务中对音频信息利用不足的问题。传统方法通常将音频作为辅助模态，仅用于补充视觉信息的理解，而忽视了音频本身所提供的关键上下文、情感线索及语义含义。为应对这一挑战，论文提出了一套以音频为中心的视频理解基准（Audio-Centric Video Understanding Benchmark, ACVUBench）。其关键是设计了一个包含2,662个跨18个领域的视频数据集，这些视频富含音频信息，并配以超过13,000个人类标注或验证的问题-答案对。此外，ACVUBench还引入了一系列精心构建的音频相关任务，全面评估模型对音频内容及其与视觉交互的理解能力。通过广泛测试开源及专有AV LLMs的表现，论文揭示了当前模型在处理音频信息方面的不足之处。

链接: https://arxiv.org/abs/2503.19951
作者: Yudong Yang,Jimin Zhuang,Guangzhi Sun,Changli Tang,Yixuan Li,Peihan Li,Yifan Jiang,Wei Li,Zejun Ma,Chao Zhang
机构: Tsinghua University (清华大学); University of Cambridge (剑桥大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Audio often serves as an auxiliary modality in video understanding tasks of audio-visual large language models (LLMs), merely assisting in the comprehension of visual information. However, a thorough understanding of videos significantly depends on auditory information, as audio offers critical context, emotional cues, and semantic meaning that visual data alone often lacks. This paper proposes an audio-centric video understanding benchmark (ACVUBench) to evaluate the video comprehension capabilities of multimodal LLMs with a particular focus on auditory information. Specifically, ACVUBench incorporates 2,662 videos spanning 18 different domains with rich auditory information, together with over 13k high-quality human annotated or validated question-answer pairs. Moreover, ACVUBench introduces a suite of carefully designed audio-centric tasks, holistically testing the understanding of both audio content and audio-visual interactions in videos. A thorough evaluation across a diverse range of open-source and proprietary multimodal LLMs is performed, followed by the analyses of deficiencies in audio-visual LLMs. Demos are available at this https URL.
zh

[CV-102] st-Time Reasoning Through Visual Human Preferences with VLMs and Soft Rewards

【速读】：该论文旨在探究视觉语言模型（Visual Language Models, VLMs）是否能够有效捕捉人类的视觉偏好，并通过在测试时利用强化学习方法（受DeepSeek R1和OpenAI O1启发）训练VLMs进行偏好推理来解决此问题。关键在于采用透明且可解释的方法，结合丰富的世界知识与推理能力，使模型不仅能在ImageReward和Human Preference Score v2 (HPSv2)等数据集上达到与传统编码器基线模型相当的准确率（分别为64.9%和65.4%），还能够实现更广泛的泛化能力和对任意图像的排序能力，从而优化视觉偏好任务中的奖励机制并提升其效率与可解释性。

链接: https://arxiv.org/abs/2503.19948
作者: Alexander Gambashidze,Konstantin Sobolev,Andrey Kuznetsov,Ivan Oseledets
机构: AIRI (人工智能与数字技术研究所); Skoltech (Skolkovo科学技术研究院); Moscow
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Can Visual Language Models (VLMs) effectively capture human visual preferences? This work addresses this question by training VLMs to think about preferences at test time, employing reinforcement learning methods inspired by DeepSeek R1 and OpenAI O1. Using datasets such as ImageReward and Human Preference Score v2 (HPSv2), our models achieve accuracies of 64.9% on the ImageReward test set (trained on ImageReward official split) and 65.4% on HPSv2 (trained on approximately 25% of its data). These results match traditional encoder-based models while providing transparent reasoning and enhanced generalization. This approach allows to use not only rich VLM world knowledge, but also its potential to think, yielding interpretable outcomes that help decision-making processes. By demonstrating that human visual preferences reasonable by current VLMs, we introduce efficient soft-reward strategies for image ranking, outperforming simplistic selection or scoring methods. This reasoning capability enables VLMs to rank arbitrary images-regardless of aspect ratio or complexity-thereby potentially amplifying the effectiveness of visual Preference Optimization. By reducing the need for extensive markup while improving reward generalization and explainability, our findings can be a strong mile-stone that will enhance text-to-vision models even further.
zh

[CV-103] Vanishing Depth: A Depth Adapter with Positional Depth Encoding for Generalized Image Encoders

【速读】：该论文试图解决在视觉引导机器人中精确度量深度理解的问题，当前最先进的视觉编码器（SOTA vision-encoders）无法支持这一需求。为了解决这个问题，论文提出了一种名为Vanishing Depth的自监督训练方法，该方法扩展了预训练的RGB编码器，使其能够融入并校准度量深度到其特征嵌入中。解决方案的关键在于基于新颖的位置深度编码，实现了稳定的深度密度和深度分布不变的特征提取，从而在多种相关RGBD下游任务中取得了性能提升和最先进的结果，且无需微调编码器。

链接: https://arxiv.org/abs/2503.19947
作者: Paul Koch,Jörg Krüger,Ankit Chowdhury,Oliver Heimann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Generalized metric depth understanding is critical for precise vision-guided robotics, which current state-of-the-art (SOTA) vision-encoders do not support. To address this, we propose Vanishing Depth, a self-supervised training approach that extends pretrained RGB encoders to incorporate and align metric depth into their feature embeddings. Based on our novel positional depth encoding, we enable stable depth density and depth distribution invariant feature extraction. We achieve performance improvements and SOTA results across a spectrum of relevant RGBD downstream tasks - without the necessity of finetuning the encoder. Most notably, we achieve 56.05 mIoU on SUN-RGBD segmentation, 88.3 RMSE on Void’s depth completion, and 83.8 Top 1 accuracy on NYUv2 scene classification. In 6D-object pose estimation, we outperform our predecessors of DinoV2, EVA-02, and Omnivore and achieve SOTA results for non-finetuned encoders in several related RGBD downstream tasks.
zh

[CV-104] Reverse Prompt: Cracking the Recipe Inside Text-to-Image Generation

【速读】：该论文试图解决通过文本提示（Prompt）生成目标图像时需要大量人工调整的问题，提出了一种称为自动逆向提示优化（Automatic Reverse Prompt Optimization, ARPO）的方法。解决方案的关键在于通过迭代式的模仿梯度提示优化过程，将初始提示逐步精炼为高质量的提示：首先利用当前提示生成重构图像以体现其引导能力；其次计算文本梯度，即候选提示以减少重构图像与参考图像之间的差异；最后采用贪婪搜索方法更新提示，最大化提示与参考图像之间的CLIP相似性。实验结果表明，ARPO能够快速收敛并生成高质量的逆向提示，同时支持通过直接编辑这些逆向提示来轻松创建具有多样化风格和内容的新图像。

链接: https://arxiv.org/abs/2503.19937
作者: Zhiyao Ren,Yibing Zhan,Baosheng Yu,Dacheng Tao
机构: Nanyang Technological University (南洋理工大学), Singapore; JD Explore Academy (京东探索研究院), China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-image generation has become increasingly popular, but achieving the desired images often requires extensive prompt engineering. In this paper, we explore how to decode textual prompts from reference images, a process we refer to as image reverse prompt engineering. This technique enables us to gain insights from reference images, understand the creative processes of great artists, and generate impressive new images. To address this challenge, we propose a method known as automatic reverse prompt optimization (ARPO). Specifically, our method refines an initial prompt into a high-quality prompt through an iteratively imitative gradient prompt optimization process: 1) generating a recreated image from the current prompt to instantiate its guidance capability; 2) producing textual gradients, which are candidate prompts intended to reduce the difference between the recreated image and the reference image; 3) updating the current prompt with textual gradients using a greedy search method to maximize the CLIP similarity between prompt and reference image. We compare ARPO with several baseline methods, including handcrafted techniques, gradient-based prompt tuning methods, image captioning, and data-driven selection method. Both quantitative and qualitative results demonstrate that our ARPO converges quickly to generate high-quality reverse prompts. More importantly, we can easily create novel images with diverse styles and content by directly editing these reverse prompts. Code will be made publicly available.
zh

[CV-105] VisualQuest: A Diverse Image Dataset for Evaluating Visual Recognition in LLM s

【速读】：该论文试图解决的问题是如何评估大型语言模型（Large Language Models, LLMs）在理解非传统、风格化图像方面的视觉推理能力。现有的基准数据集多基于常规摄影图像，缺乏对抽象、符号和隐喻等复杂视觉元素的挑战。为解决这一问题，论文提出了VisualQuest数据集，其关键在于通过多阶段的筛选、标注和标准化过程精心构建了一个包含多样化和高质量图像的数据集，这些图像强调特定领域的知识整合与高级推理能力。通过使用多个最先进的多模态LLMs进行评估，研究揭示了模型在事实背景知识和推断能力上的显著性能差异，从而为多模态推理研究和模型架构设计提供了有力且全面的基准。

链接: https://arxiv.org/abs/2503.19936
作者: Kelaiti Xiao,Liang Yang,Paerhati Tulajiang,Hongfei Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces VisualQuest, a novel image dataset designed to assess the ability of large language models (LLMs) to interpret non-traditional, stylized imagery. Unlike conventional photographic benchmarks, VisualQuest challenges models with images that incorporate abstract, symbolic, and metaphorical elements, requiring the integration of domain-specific knowledge and advanced reasoning. The dataset was meticulously curated through multiple stages of filtering, annotation, and standardization to ensure high quality and diversity. Our evaluations using several state-of-the-art multimodal LLMs reveal significant performance variations that underscore the importance of both factual background knowledge and inferential capabilities in visual recognition tasks. VisualQuest thus provides a robust and comprehensive benchmark for advancing research in multimodal reasoning and model architecture design.
zh

[CV-106] Robust Object Detection of Underwater Robot based on Domain Generalization

【速读】：该论文试图解决水下环境中目标检测所面临的多样性与复杂性带来的挑战，具体包括严重遮挡、生物伪装、域偏移（domain shift）引起的图像失真及对比度低、能见度降低等问题。论文的关键在于设计一种高性能且鲁棒的水下目标检测器，以应对这些挑战，确保在复杂水下环境中的检测精度与稳定性。

链接: https://arxiv.org/abs/2503.19929
作者: Pinhao Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Master’s thesis, in Chinese language

点击查看摘要

Abstract:Object detection aims to obtain the location and the category of specific objects in a given image, which includes two tasks: classification and location. In recent years, researchers tend to apply object detection to underwater robots equipped with vision systems to complete tasks including seafood fishing, fish farming, biodiversity monitoring and so on. However, the diversity and complexity of underwater environments bring new challenges to object detection. First, aquatic organisms tend to live together, which leads to severe occlusion. Second, theaquatic organisms are good at hiding themselves, which have a similar color to the background. Third, the various water quality and changeable and extreme lighting conditions lead to the distorted, low contrast, blue or green images obtained by the underwater camera, resulting in domain shift. And the deep model is generally vulnerable to facing domain shift. Fourth, the movement of the underwater robot leads to the blur of the captured image and makes the water muddy, which results in low visibility of the water. This paper investigates the problems brought by the underwater environment mentioned above, and aims to design a high-performance and robust underwater object detector.
zh

[CV-107] A Study on the Matching Rate of Dance Movements Using 2D Skeleton Detection and 3D Pose Estimation: Why Is SEVENTEENs Performance So Bita-Zoroi (Perfectly Synchronized)?

【速读】：本文旨在解决SEVENTEEN舞蹈表演中高同步率（据称可达90%或97%）缺乏具体数据支持的问题。为验证这一现象，研究者通过YouTube视频分析了SEVENTEEN的舞蹈表现，采用了2D骨骼检测与3D姿态估计技术，评估关节角度、身体部位运动以及跳跃与蹲下的动作，以探究促成其表演高度一致性的因素。研究的关键在于发现身体各部位运动方向的高度一致性，以及跳跃时踝部和头部位置同步性，以及蹲下时头部位置的同步性，这些因素共同解释了SEVENTEEN卓越的舞蹈同步表现。

链接: https://arxiv.org/abs/2503.19917
作者: Atsushi Simojo,Harumi Haraguchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 11 figures and 20 tables

点击查看摘要

Abstract:SEVENTEEN is a K-pop group with a large number of members 13 in total and the significant physical disparity between the tallest and shortest members among K-pop groups. However, despite their large numbers and physical differences, their dance performances exhibit unparalleled unity in the K-pop industry. According to one theory, their dance synchronization rate is said to be 90% or even 97%. However, there is little concrete data to substantiate this synchronization rate. In this study, we analyzed SEVENTEEN’s dance performances using videos available on YouTube. We applied 2D skeleton detection and 3D pose estimation to evaluate joint angles, body part movements, and jumping and crouching motions to investigate the factors contributing to their performance unity. The analysis revealed exceptionally high consistency in the movement direction of body parts, as well as in the ankle and head positions during jumping movements and the head position during crouching movements. These findings suggested that SEVENTEEN’s high synchronization rate can be attributed to the consistency of movement direction and the synchronization of ankle and head heights during jumping and crouching movements.
zh

[CV-108] Demand Estimation with Text and Image Data

【速读】：该论文试图解决在需求估计中因缺乏产品属性数据或难以量化的消费者偏好（如视觉设计和功能效益）而导致的挑战。解决方案的关键在于提出了一种结合无结构文本和图像数据的方法，通过使用预训练深度学习模型提取产品图片和文字描述的嵌入向量（embeddings），并将这些嵌入向量融入随机系数逻辑模型（random coefficients logit model）。这种方法能够有效推断替代模式（substitution patterns），从而提升对消费者次优选择（second choices）的反事实预测（counterfactual predictions）能力，并在多个产品类别中验证了文本和图像数据识别近似替代品（close substitutes）的有效性。

链接: https://arxiv.org/abs/2503.20711
作者: Giovanni Compiani,Ilya Morozov,Stephan Seiler
机构: 未知
类目: General Economics (econ.GN); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a demand estimation method that leverages unstructured text and image data to infer substitution patterns. Using pre-trained deep learning models, we extract embeddings from product images and textual descriptions and incorporate them into a random coefficients logit model. This approach enables researchers to estimate demand even when they lack data on product attributes or when consumers value hard-to-quantify attributes, such as visual design or functional benefits. Using data from a choice experiment, we show that our approach outperforms standard attribute-based models in counterfactual predictions of consumers’ second choices. We also apply it across 40 product categories on this http URL and consistently find that text and image data help identify close substitutes within each category.
zh

[CV-109] Benchmarking Machine Learning Methods for Distributed Acoustic Sensing

【速读】：该论文旨在研究分布式声学传感（Distributed Acoustic Sensing, DAS）技术与机器学习（Machine Learning, ML）结合的应用潜力，重点探索如何通过机器学习算法提升DAS系统的数据处理能力。论文试图解决的核心问题是：如何有效利用经典机器学习方法及先进深度学习模型，优化DAS在信号识别与解析中的性能，并将其应用于交通基础设施监测、能源管理系统以及自然灾害预警等关键领域。解决方案的关键在于将传统依赖人工经验的数据处理流程转变为自动化、智能化的分析框架，通过引入增强型机器学习技术显著提高DAS系统的数据采集精度与决策可靠性，从而实现更高效、精准的实时监测功能。

链接: https://arxiv.org/abs/2503.20681
作者: Shuaikai Shi,Qijun Zong
机构: School of Physics (物理学院), Nanjing University (南京大学)
类目: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Distributed acoustic sensing (DAS) technology represents an innovative fiber-optic-based sensing methodology that enables real-time acoustic signal monitoring through the detection of minute perturbations along optical fibers. This sensing approach offers compelling advantages, including extensive measurement ranges, exceptional spatial resolution, and an expansive dynamic measurement spectrum. The integration of machine learning (ML) paradigms presents transformative potential for DAS technology, encompassing critical domains such as data augmentation, sophisticated preprocessing techniques, and advanced acoustic event classification and recognition. By leveraging ML algorithms, DAS systems can transition from traditional data processing methodologies to more automated and intelligent analytical frameworks. The computational intelligence afforded by ML-enhanced DAS technologies facilitates unprecedented monitoring capabilities across diverse critical infrastructure sectors. Particularly noteworthy are the technology’s applications in transportation infrastructure, energy management systems, and Natural disaster monitoring frameworks, where the precision of data acquisition and the reliability of intelligent decision-making mechanisms are paramount. This research critically examines the comparative performance characteristics of classical machine learning methodologies and state-of-the-art deep learning models in the context of DAS data recognition and interpretation, offering comprehensive insights into the evolving landscape of intelligent sensing technologies. Subjects: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD) Cite as: arXiv:2503.20681 [eess.AS] (or arXiv:2503.20681v1 [eess.AS] for this version) https://doi.org/10.48550/arXiv.2503.20681 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shuaikai Shi [view email] [v1] Wed, 26 Mar 2025 16:17:22 UTC (1,667 KB)
zh

[CV-110] UWarp: A Whole Slide Image Registration Pipeline to Characterize Scanner-Induced Local Domain Shift

【速读】：该论文旨在解决由组织切片数字化引入的扫描仪引起的域偏移（scanner-induced domain shift）对基于深度学习的计算病理模型预测准确性的影响问题。现有研究通常仅在整体层面（如幻灯片级别或数据集级别）表征这种域偏移，而未能深入分析局部组织特征对其影响的具体机制。为此，论文提出了一种基于UWarp的域偏移分析框架，其关键在于开发了一种新型的配准工具UWarp。UWarp采用分层配准方法，结合全局仿射变换与精细化的局部校正，以实现组织切片补丁的鲁棒对齐。实验结果表明，UWarp在CypathLung和BosomShieldBreast两个私有数据集上的表现优于现有的开源配准方法，并显著降低了计算时间。此外，通过应用UWarp，论文揭示了乳腺癌病理响应预测模型Breast-NEOprAIdict的预测变异性与局部组织密度之间的强相关性，强调了局部域偏移分析的重要性，并表明UWarp可作为提升计算病理学中模型鲁棒性和领域自适应策略的有效工具。

链接: https://arxiv.org/abs/2503.20653
作者: Antoine Schieb,Bilal Hadjadji,Daniel Tshokola Mweze,Natalia Fernanda Valderrama,Valentin Derangère,Laurent Arnould,Sylvain Ladoire,Alain Lalande,Louis-Oscar Morel,Nathan Vinçon
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Histopathology slide digitization introduces scanner-induced domain shift that can significantly impact computational pathology models based on deep learning methods. In the state-of-the-art, this shift is often characterized at a broad scale (slide-level or dataset-level) but not patch-level, which limits our comprehension of the impact of localized tissue characteristics on the accuracy of the deep learning models. To address this challenge, we present a domain shift analysis framework based on UWarp, a novel registration tool designed to accurately align histological slides scanned under varying conditions. UWarp employs a hierarchical registration approach, combining global affine transformations with fine-grained local corrections to achieve robust tissue patch alignment. We evaluate UWarp using two private datasets, CypathLung and BosomShieldBreast, containing whole slide images scanned by multiple devices. Our experiments demonstrate that UWarp outperforms existing open-source registration methods, achieving a median target registration error (TRE) of less than 4 pixels (1 micrometer at 40x magnification) while significantly reducing computational time. Additionally, we apply UWarp to characterize scanner-induced local domain shift in the predictions of Breast-NEOprAIdict, a deep learning model for breast cancer pathological response prediction. We find that prediction variability is strongly correlated with tissue density on a given patch. Our findings highlight the importance of localized domain shift analysis and suggest that UWarp can serve as a valuable tool for improving model robustness and domain adaptation strategies in computational pathology.
zh

[CV-111] Exploring Robustness of Cortical Morphometry in the presence of white matter lesions using Diffusion Models for Lesion Filling

【速读】：该论文旨在解决白质病变（White Matter Lesions, WMLs）对基于磁共振成像（MRI）的大脑皮层厚度测量的影响问题。传统大脑分割工具已知会因白质低信号强度而产生偏差，但这种影响在基于深度学习的方法中研究较少。尽管深度学习方法理论上更稳健，但其在处理白质病变时的表现尚未得到充分验证。

解决方案的关键在于结合高精度病变填充算法与去噪扩散网络（denoising diffusion networks），通过伪三维U-Net架构实现病变区域的合成健康组织生成。该架构以OASIS数据集为基础训练，并利用MSSEG数据集中的二值病变掩模，从而实现在多发性硬化症患者中对白质病变的真实移除。通过在病变填充前后对患者图像进行形态学分析，论文评估了不同方法（基于深度学习的大脑分割如Fastsurfer、DL+DiReCT、ANTsPyNet与经典方法如Freesurfer、ANTs）在面对白质病变时的鲁棒性，发现基于深度学习的方法表现出更高的稳定性。

链接: https://arxiv.org/abs/2503.20571
作者: Vinzenz Uhr,Ivan Diaz,Christian Rummel,Richard McKinley
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cortical thickness measurements from magnetic resonance imaging, an important biomarker in many neurodegenerative and neurological disorders, are derived by many tools from an initial voxel-wise tissue segmentation. White matter (WM) hypointensities in T1-weighted imaging, such as those arising from multiple sclerosis or small vessel disease, are known to affect the output of brain segmentation methods and therefore bias cortical thickness measurements. These effects are well-documented among traditional brain segmentation tools but have not been studied extensively in tools based on deep-learning segmentations, which promise to be more robust. In this paper, we explore the potential of deep learning to enhance the accuracy and efficiency of cortical thickness measurement in the presence of WM lesions, using a high-quality lesion filling algorithm leveraging denoising diffusion networks. A pseudo-3D U-Net architecture trained on the OASIS dataset to generate synthetic healthy tissue, conditioned on binary lesion masks derived from the MSSEG dataset, allows realistic removal of white matter lesions in multiple sclerosis patients. By applying morphometry methods to patient images before and after lesion filling, we analysed robustness of global and regional cortical thickness measurements in the presence of white matter lesions. Methods based on a deep learning-based segmentation of the brain (Fastsurfer, DL+DiReCT, ANTsPyNet) exhibited greater robustness than those using classical segmentation methods (Freesurfer, ANTs). Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.20571 [eess.IV] (or arXiv:2503.20571v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2503.20571 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-112] Attention Xception UNet (AXUNet): A Novel Combination of CNN and Self-Attention for Brain Tumor Segmentation

【速读】：该论文旨在解决胶质瘤脑肿瘤精确分割这一关键诊断与治疗规划问题。当前深度学习技术虽提供了有前景的解决方案，但最优模型架构仍需进一步探索。为应对这一挑战，论文提出了一种名为Attention Xception UNet (AXUNet) 的新架构，其核心创新在于将Xception主干网络与点积自注意力机制相结合，受到如Google Bard和OpenAI ChatGPT等最先进的大型语言模型的启发，并在UNet框架内实现。通过在BraTS 2021数据集上进行测试，AXUNet在Dice相似系数评估中超越了所有对比的最先进模型，特别是在整体肿瘤(WT)和肿瘤核心(TC)区域表现出色，分别达到92.59和86.81的Dice分数，且增强肿瘤(ET)区域也取得了84.89的高分。这一结果验证了AXUNet在捕捉空间和上下文信息方面的显著优势，表明其在促进肿瘤精确勾画方面的潜在应用价值。

链接: https://arxiv.org/abs/2503.20446
作者: Farzan Moodi,Fereshteh Khodadadi Shoushtari,Gelareh Valizadeh,Dornaz Mazinani,Hanieh Mobarak Salari,Hamidreza Saligheh Rad
机构: Quantitative Medical Imaging Systems Group (定量医学影像系统小组), Research Center for Molecular and Cellular Imaging (分子与细胞成像研究中心), Imam Khomeini Hospital (伊玛目霍梅尼医院), Keshavarz Boulevard, Tehran, Iran (德黑兰, 科什瓦尔兹大道)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of glioma brain tumors is crucial for diagnosis and treatment planning. Deep learning techniques offer promising solutions, but optimal model architectures remain under investigation. We used the BraTS 2021 dataset, selecting T1 with contrast enhancement (T1CE), T2, and Fluid-Attenuated Inversion Recovery (FLAIR) sequences for model development. The proposed Attention Xception UNet (AXUNet) architecture integrates an Xception backbone with dot-product self-attention modules, inspired by state-of-the-art (SOTA) large language models such as Google Bard and OpenAI ChatGPT, within a UNet-shaped model. We compared AXUNet with SOTA models. Comparative evaluation on the test set demonstrated improved results over baseline models. Inception-UNet and Xception-UNet achieved mean Dice scores of 90.88 and 93.24, respectively. Attention ResUNet (AResUNet) attained a mean Dice score of 92.80, with the highest score of 84.92 for enhancing tumor (ET) among all models. Attention Gate UNet (AGUNet) yielded a mean Dice score of 90.38. AXUNet outperformed all models with a mean Dice score of 93.73. It demonstrated superior Dice scores across whole tumor (WT) and tumor core (TC) regions, achieving 92.59 for WT, 86.81 for TC, and 84.89 for ET. The integration of the Xception backbone and dot-product self-attention mechanisms in AXUNet showcases enhanced performance in capturing spatial and contextual information. The findings underscore the potential utility of AXUNet in facilitating precise tumor delineation.
zh

[CV-113] Euclidean Distance to Convex Polyhedra and Application to Class Representation in Spectral Images

【速读】：该论文旨在解决从观测数据中估计丰度图（abundance map）的问题，特别是在光谱图像中波段数量过少或观测数据光谱相关性过高时，传统线性分解方法（Linear Unmixing Approaches）不适用的情况。论文的关键解决方案是提出了一种基于任意线性分类器的自适应空间密度函数的新方法，并结合一种鲁棒的数学公式来计算多面体集合的欧几里得距离，同时设计了一个高效的算法以精确求解多面体内最小范数点（minimum-norm point）。这种创新方法在Samson高光谱数据集上的实证评估表明其在重建丰度图方面超越了现有最先进的方法，进一步通过锂离子电池光谱图像的应用验证了方法的通用性和有效性。

链接: https://arxiv.org/abs/2503.20328
作者: Antoine Bottenmuller(CMM),Florent Magaud(LRCS),Arnaud Demortière(LRCS),Etienne Decencière(CMM),Petr Dokladal(CMM)
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the aim of estimating the abundance map from observations only, linear unmixing approaches are not always suitable to spectral images, especially when the number of bands is too small or when the spectra of the observed data are too correlated. To address this issue in the general case, we present a novel approach which provides an adapted spatial density function based on any arbitrary linear classifier. A robust mathematical formulation for computing the Euclidean distance to polyhedral sets is presented, along with an efficient algorithm that provides the exact minimum-norm point in a polyhedron. An empirical evaluation on the widely-used Samson hyperspectral dataset demonstrates that the proposed method surpasses state-of-the-art approaches in reconstructing abundance maps. Furthermore, its application to spectral images of a Lithium-ion battery, incompatible with linear unmixing models, validates the method’s generality and effectiveness.
zh

[CV-114] AI-Driven MRI Spine Pathology Detection: A Comprehensive Deep Learning Approach for Automated Diagnosis in Diverse Clinical Settings

【速读】：该论文旨在解决脊柱病理MRI检测中的自动化诊断问题。传统方法依赖人工阅片，效率低下且易受主观因素影响，尤其是在资源分布不均的医疗环境中。为应对这一挑战，论文提出了一种自主式AI系统，其关键在于集成多种先进的深度学习架构，包括Vision Transformers、带有交叉注意力机制的U-Net、MedSAM以及Cascade R-CNN。这些架构共同实现了对43种不同脊柱病理的全面分类、分割与检测能力，并通过数据集的多维度平衡（覆盖年龄组、性别及设备制造商）确保系统的鲁棒性和适应性。此外，通过子组分析验证了模型在不同患者群体、成像条件及设备类型下的稳定性能，最终实现了高达97.9%的多病理性检测精度和98.0%的正常/异常分类准确率，显著提升了MRI脊柱扫描的诊断效率和报告速度。

链接: https://arxiv.org/abs/2503.20316
作者: Bargava Subramanian,Naveen Kumarasami,Praveen Shastry,Raghotham Sripadraj,Kalyan Sivasailam,Anandakumar D,Abinaya Ramachandran,Sudhir MP,Gunakutti G,Kishore Prasath Venkatesh
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages , 3 figurea

点击查看摘要

Abstract:Study Design This study presents the development of an autonomous AI system for MRI spine pathology detection, trained on a dataset of 2 million MRI spine scans sourced from diverse healthcare facilities across India. The AI system integrates advanced architectures, including Vision Transformers, U-Net with cross-attention, MedSAM, and Cascade R-CNN, enabling comprehensive classification, segmentation, and detection of 43 distinct spinal pathologies. The dataset is balanced across age groups, genders, and scanner manufacturers to ensure robustness and adaptability. Subgroup analyses were conducted to validate the model’s performance across different patient demographics, imaging conditions, and equipment types. Performance The AI system achieved up to 97.9 percent multi-pathology detection, demonstrating consistent performance across age, gender, and manufacturer subgroups. The normal vs. abnormal classification achieved 98.0 percent accuracy, and the system was deployed across 13 major healthcare enterprises in India, encompassing diagnostic centers, large hospitals, and government facilities. During deployment, it processed approximately 100,000 plus MRI spine scans, leading to reduced reporting times and increased diagnostic efficiency by automating the identification of common spinal conditions. Conclusion The AI system’s high precision and recall validate its capability as a reliable tool for autonomous normal/abnormal classification, pathology segmentation, and detection. Its scalability and adaptability address critical diagnostic gaps, optimize radiology workflows, and improve patient care across varied healthcare environments in India. Comments: 20 pages , 3 figurea Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68T07 Cite as: arXiv:2503.20316 [eess.IV] (or arXiv:2503.20316v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2503.20316 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-115] 3D Convolutional Neural Networks for Improved Detection of Intracranial bleeding in CT Imaging

【速读】：该论文旨在解决颅内出血（Intracranial Bleeding, IB）在急诊场景下快速且准确检测的问题。传统成像方法因速度慢且易受人为因素影响，在高压环境下难以满足需求。为应对这一挑战，论文提出了一种基于U形三维卷积神经网络（U-shaped 3D Convolutional Neural Network, CNN）的解决方案。该方案通过先进的图像预处理技术（如CLAHE和强度归一化）提升图像质量，并利用网络架构保留空间与上下文细节以实现精确分割，从而实现颅内出血的自动化检测与分类。关键在于结合高效的数据增强及模型设计，使模型在多种颅内出血类型上的精度、召回率及准确率均超过90%，部分类型达到96%以上的精准度，显著提升了诊断效率与临床可靠性。

链接: https://arxiv.org/abs/2503.20306
作者: Bargava Subramanian,Naveen Kumarasami,Praveen Shastry,Kalyan Sivasailam,Anandakumar D,Elakkiya R,Harsha KG,Rithanya V,Harini T,Afshin Hussain,Kishore Prasath Venkatesh
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages,4 figures

点击查看摘要

Abstract:Background: Intracranial bleeding (IB) is a life-threatening condition caused by traumatic brain injuries, including epidural, subdural, subarachnoid, and intraparenchymal hemorrhages. Rapid and accurate detection is crucial to prevent severe complications. Traditional imaging can be slow and prone to variability, especially in high-pressure scenarios. Artificial Intelligence (AI) provides a solution by quickly analyzing medical images, identifying subtle hemorrhages, and flagging urgent cases. By enhancing diagnostic speed and accuracy, AI improves workflows and patient care. This article explores AI’s role in transforming IB detection in emergency settings. Methods: A U-shaped 3D Convolutional Neural Network (CNN) automates IB detection and classification in volumetric CT scans. Advanced preprocessing, including CLAHE and intensity normalization, enhances image quality. The architecture preserves spatial and contextual details for precise segmentation. A dataset of 2,912 annotated CT scans was used for training and evaluation. Results: The model achieved high performance across major bleed types, with precision, recall, and accuracy exceeding 90 percent in most cases 96 percent precision for epidural hemorrhages and 94 percent accuracy for subarachnoid hemorrhages. Its ability to classify and localize hemorrhages highlights its clinical reliability. Conclusion: This U-shaped 3D CNN offers a scalable solution for automating IB detection, reducing diagnostic delays, and improving emergency care outcomes. Future work will expand dataset diversity, optimize real-time processing, and integrate multimodal data for enhanced clinical applicability. Comments: 12 pages,4 figures Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68T07 Cite as: arXiv:2503.20306 [eess.IV] (or arXiv:2503.20306v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2503.20306 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Anandakumar D [view email] [v1] Wed, 26 Mar 2025 08:10:29 UTC (517 KB)
zh

[CV-116] Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis

【速读】：该论文旨在解决将视觉语言模型（Vision-Language Models, VLMs）从2D医学图像分析扩展到3D领域所面临的挑战，主要体现在处理体积数据的高计算需求以及对齐3D空间特征与临床文本的困难。为应对这些挑战，论文提出了Med3DVLM，其关键创新点包括：(1) DCFormer，一种利用分解的3D卷积高效捕获细粒度空间特征的编码器；(2) SigLIP，一种无需依赖大规模负样本对的对比学习策略，通过成对sigmoid损失改善图像-文本对齐；(3) 双流MLP-Mixer投影器，融合低级和高级图像特征与文本嵌入，以构建更丰富的多模态表示。这些方法共同实现了Med3DVLM在多个基准测试中的卓越性能提升。

链接: https://arxiv.org/abs/2503.20047
作者: Yu Xin,Gorkem Can Ates,Kuang Gong,Wei Shao
机构: University of Florida (佛罗里达大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have shown promise in 2D medical image analysis, but extending them to 3D remains challenging due to the high computational demands of volumetric data and the difficulty of aligning 3D spatial features with clinical text. We present Med3DVLM, a 3D VLM designed to address these challenges through three key innovations: (1) DCFormer, an efficient encoder that uses decomposed 3D convolutions to capture fine-grained spatial features at scale; (2) SigLIP, a contrastive learning strategy with pairwise sigmoid loss that improves image-text alignment without relying on large negative batches; and (3) a dual-stream MLP-Mixer projector that fuses low- and high-level image features with text embeddings for richer multi-modal representations. We evaluate our model on the M3D dataset, which includes radiology reports and VQA data for 120,084 3D medical images. Results show that Med3DVLM achieves superior performance across multiple benchmarks. For image-text retrieval, it reaches 61.00% R@1 on 2,000 samples, significantly outperforming the current state-of-the-art M3D model (19.10%). For report generation, it achieves a METEOR score of 36.42% (vs. 14.38%). In open-ended visual question answering (VQA), it scores 36.76% METEOR (vs. 33.58%), and in closed-ended VQA, it achieves 79.95% accuracy (vs. 75.78%). These results highlight Med3DVLM’s ability to bridge the gap between 3D imaging and language, enabling scalable, multi-task reasoning across clinical applications. Our code is publicly available at this https URL.
zh

[CV-117] Optimizing Breast Cancer Detection in Mammograms: A Comprehensive Study of Transfer Learning Resolution Reduction and Multi-View Classification

【速读】：该论文旨在解决乳腺癌在乳腺X线摄影（Mammogram）中应用机器学习检测的相关开放性问题。研究围绕五个核心问题展开：(1) 中间patch分类器是否对最优性能至关重要？(2) 在自然图像分类任务中表现优异的主干模型是否始终在乳腺X线影像上优于其他模型？(3) 为了适应GPU处理而降低影像分辨率时，学习调整尺寸的技术是否比传统方法更有效？(4) 在双视图分类器中结合两个视角是否显著提高检测准确性？(5) 这些发现对于低质量与高质量乳腺X线影像分析是否存在差异？关键在于通过系统性研究这些问题，优化单视图和双视图分类器的模型架构与迁移学习策略，从而实现更准确高效的乳腺X线影像分析。

链接: https://arxiv.org/abs/2503.19945
作者: Daniel G. P. Petrini,Hae Yong Kim
机构: Department of Electronic Systems Engineering, Polytechnic School, University of São Paulo (巴西圣保罗大学电子系统工程学院; 巴西圣保罗大学工学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:This study explores open questions in the application of machine learning for breast cancer detection in mammograms. Current approaches often employ a two-stage transfer learning process: first, adapting a backbone model trained on natural images to develop a patch classifier, which is then used to create a single-view whole-image classifier. Additionally, many studies leverage both mammographic views to enhance model performance. In this work, we systematically investigate five key questions: (1) Is the intermediate patch classifier essential for optimal performance? (2) Do backbone models that excel in natural image classification consistently outperform others on mammograms? (3) When reducing mammogram resolution for GPU processing, does the learn-to-resize technique outperform conventional methods? (4) Does incorporating both mammographic views in a two-view classifier significantly improve detection accuracy? (5) How do these findings vary when analyzing low-quality versus high-quality mammograms? By addressing these questions, we developed models that outperform previous results for both single-view and two-view classifiers. Our findings provide insights into model architecture and transfer learning strategies contributing to more accurate and efficient mammogram analysis.
zh

[CV-118] Mapping fMRI Signal and Image Stimuli in an Artificial Neural Network Latent Space: Bringing Artificial and Natural Minds Together

【速读】：该论文试图解决的问题是探究视觉刺激和功能磁共振成像（fMRI）数据的潜在空间表示是否共享共同信息。论文的关键解决方案在于通过比较分别基于fMRI数据训练的自动编码器（Autoencoder, AE）和基于图像数据训练的视觉变换器（Vision Transformer, ViT）的潜在空间相似性，来评估从fMRI数据解码和重建刺激的可行性。研究使用表征相似性分析（Representational Similarity Analysis, RSA），发现两个领域的潜在空间看似不同，但初步结果尚无定论，需要进一步深入研究。

链接: https://arxiv.org/abs/2503.19923
作者: Cesare Maria Dalbagno,Manuel de Castro Ribeiro Jardim,Mihnea Angheluţă
机构: Department of Cognitive Science and Artificial Intelligence, Tilburg University (蒂尔堡大学)
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:The goal of this study is to investigate whether latent space representations of visual stimuli and fMRI data share common information. Decoding and reconstructing stimuli from fMRI data remains a challenge in AI and neuroscience, with significant implications for understanding neural representations and improving the interpretability of Artificial Neural Networks (ANNs). In this preliminary study, we investigate the feasibility of such reconstruction by examining the similarity between the latent spaces of one autoencoder (AE) and one vision transformer (ViT) trained on fMRI and image data, respectively. Using representational similarity analysis (RSA), we found that the latent spaces of the two domains appear different. However, these initial findings are inconclusive, and further research is needed to explore this relationship more thoroughly.
zh

人工智能

[AI-0] Optimal Scaling Laws for Efficiency Gains in a Theoretical Transformer-Augmented Sectional MoE Framework

链接: https://arxiv.org/abs/2503.20750
作者: Soham Sane
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces a theoretical framework for a Transformer-augmented, sectional Mixture-of-Experts (MoE) architecture that aims to enhance computational efficiency while preserving model scalability. Unlike conventional MoE models, which route entire token embeddings to selected experts, our approach portions the embedding dimension itself – assigning segments of each token’s representation to dedicated experts. To combat losses in token representation, we utilize a pre-expert transformer layer to recompute attention across tokens and reduce the sequence length dimensionality. We extend our theory by deriving optimal scaling laws that a non-linear relationship between the number of experts and factors such as model dimensionality, sequence length, and system overhead. These formulations yield closed-form and numerically-solvable expressions for identifying the optimal expert count under given architectural and hardware constraints. As a result, our framework not only provides theoretical bounds for computing efficiency with varying frameworks but also guides practical design choices for scaling large models effectively. While empirical validation is pending, we present a comprehensive experimental road map to evaluate the framework’s efficiency, scalability, and practicality in future work.

[AI-1] Quantum Neural Network Restatement of Markov Jump Process

链接: https://arxiv.org/abs/2503.20742
作者: Z.Zarezadeh,N.Zarezadeh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Despite the many challenges in exploratory data analysis, artificial neural networks have motivated strong interests in scientists and researchers both in theoretical as well as practical applications. Among sources of such popularity of artificial neural networks the ability of modeling non-linear dynamical systems, generalization, and adaptation possibilities should be mentioned. Despite this, there is still significant debate about the role of various underlying stochastic processes in stabilizing a unique structure for data learning and prediction. One of such obstacles to the theoretical and numerical study of machine intelligent systems is the curse of dimensionality and the sampling from high-dimensional probability distributions. In general, this curse prevents efficient description of states, providing a significant complexity barrier for the system to be efficiently described and studied. In this strand of research, direct treatment and description of such abstract notions of learning theory in terms of quantum information be one of the most favorable candidates. Hence, the subject matter of these articles is devoted to problems of design, adaptation and the formulations of computationally hard problems in terms of quantum mechanical systems. In order to characterize the microscopic description of such dynamics in the language of inferential statistics, covariance matrix estimation of d-dimensional Gaussian densities and Bayesian interpretation of eigenvalue problem for dynamical systems is assessed.

[AI-2] Graph-Enhanced Model-Free Reinforcement Learning Agents for Efficient Power Grid Topological Control

链接: https://arxiv.org/abs/2503.20688
作者: Eloy Anguiano Batanero,Ángela Fernández,Álvaro Barbero
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increasing complexity of power grid management, driven by the emergence of prosumers and the demand for cleaner energy solutions, has needed innovative approaches to ensure stability and efficiency. This paper presents a novel approach within the model-free framework of reinforcement learning, aimed at optimizing power network operations without prior expert knowledge. We introduce a masked topological action space, enabling agents to explore diverse strategies for cost reduction while maintaining reliable service using the state logic as a guide for choosing proper actions. Through extensive experimentation across 20 different scenarios in a simulated 5-substation environment, we demonstrate that our approach achieves a consistent reduction in power losses, while ensuring grid stability against potential blackouts. The results underscore the effectiveness of combining dynamic observation formalization with opponent-based training, showing a viable way for autonomous management solutions in modern energy systems or even for building a foundational model for this field.

[AI-3] Inductive Link Prediction on N-ary Relational Facts via Semantic Hypergraph Reasoning KDD KDD’25

链接: https://arxiv.org/abs/2503.20676
作者: Gongzhu Yin,Hongli Zhang,Yuchen Yang,Yi Luo
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To be published in Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 (KDD’25)

点击查看摘要

Abstract:N-ary relational facts represent semantic correlations among more than two entities. While recent studies have developed link prediction (LP) methods to infer missing relations for knowledge graphs (KGs) containing n-ary relational facts, they are generally limited to transductive settings. Fully inductive settings, where predictions are made on previously unseen entities, remain a significant challenge. As existing methods are mainly entity embedding-based, they struggle to capture entity-independent logical rules. To fill in this gap, we propose an n-ary subgraph reasoning framework for fully inductive link prediction (ILP) on n-ary relational facts. This framework reasons over local subgraphs and has a strong inductive inference ability to capture n-ary patterns. Specifically, we introduce a novel graph structure, the n-ary semantic hypergraph, to facilitate subgraph extraction. Moreover, we develop a subgraph aggregating network, NS-HART, to effectively mine complex semantic correlations within subgraphs. Theoretically, we provide a thorough analysis from the score function optimization perspective to shed light on NS-HART’s effectiveness for n-ary ILP tasks. Empirically, we conduct extensive experiments on a series of inductive benchmarks, including transfer reasoning (with and without entity features) and pairwise subgraph reasoning. The results highlight the superiority of the n-ary subgraph reasoning framework and the exceptional inductive ability of NS-HART. The source code of this paper has been made publicly available at this https URL.

[AI-4] Procedural Knowledge Ontology (PKO)

链接: https://arxiv.org/abs/2503.20634
作者: Valentina Anita Carriero,Mario Scrocca,Ilaria Baroni,Antonia Azzini,Irene Celino
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Processes, workflows and guidelines are core to ensure the correct functioning of industrial companies: for the successful operations of factory lines, machinery or services, often industry operators rely on their past experience and know-how. The effect is that this Procedural Knowledge (PK) remains tacit and, as such, difficult to exploit efficiently and effectively. This paper presents PKO, the Procedural Knowledge Ontology, which enables the explicit modeling of procedures and their executions, by reusing and extending existing ontologies. PKO is built on requirements collected from three heterogeneous industrial use cases and can be exploited by any AI and data-driven tools that rely on a shared and interoperable representation to support the governance of PK throughout its life cycle. We describe its structure and design methodology, and outline its relevance, quality, and impact by discussing applications leveraging PKO for PK elicitation and exploitation.

[AI-5] β-GNN: A Robust Ensemble Approach Against Graph Structure Perturbation

链接: https://arxiv.org/abs/2503.20630
作者: Haci Ismail Aslan,Philipp Wiesner,Ping Xiong,Odej Kao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This is the author’s version of the paper accepted at EuroMLSys 2025

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are playing an increasingly important role in the efficient operation and security of computing systems, with applications in workload scheduling, anomaly detection, and resource management. However, their vulnerability to network perturbations poses a significant challenge. We propose \beta -GNN, a model enhancing GNN robustness without sacrificing clean data performance. \beta -GNN uses a weighted ensemble, combining any GNN with a multi-layer perceptron. A learned dynamic weight, \beta , modulates the GNN’s contribution. This \beta not only weights GNN influence but also indicates data perturbation levels, enabling proactive mitigation. Experimental results on diverse datasets show \beta -GNN’s superior adversarial accuracy and attack severity quantification. Crucially, \beta -GNN avoids perturbation assumptions, preserving clean data structure and performance.

[AI-6] State-Aware Perturbation Optimization for Robust Deep Reinforcement Learning

链接: https://arxiv.org/abs/2503.20613
作者: Zongyuan Zhang,Tianyang Duan,Zheng Lin,Dong Huang,Zihan Fang,Zekai Sun,Ling Xiong,Hongbin Liang,Heming Cui,Yong Cui
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
*备注: 15 pages, 11 figures

点击查看摘要

Abstract:Recently, deep reinforcement learning (DRL) has emerged as a promising approach for robotic control. However, the deployment of DRL in real-world robots is hindered by its sensitivity to environmental perturbations. While existing whitebox adversarial attacks rely on local gradient information and apply uniform perturbations across all states to evaluate DRL robustness, they fail to account for temporal dynamics and state-specific vulnerabilities. To combat the above challenge, we first conduct a theoretical analysis of white-box attacks in DRL by establishing the adversarial victim-dynamics Markov decision process (AVD-MDP), to derive the necessary and sufficient conditions for a successful attack. Based on this, we propose a selective state-aware reinforcement adversarial attack method, named STAR, to optimize perturbation stealthiness and state visitation dispersion. STAR first employs a soft mask-based state-targeting mechanism to minimize redundant perturbations, enhancing stealthiness and attack effectiveness. Then, it incorporates an information-theoretic optimization objective to maximize mutual information between perturbations, environmental states, and victim actions, ensuring a dispersed state-visitation distribution that steers the victim agent into vulnerable states for maximum return reduction. Extensive experiments demonstrate that STAR outperforms state-of-the-art benchmarks.

[AI-7] Perspective-Shifted Neuro-Symbolic World Models: A Framework for Socially-Aware Robot Navigation

链接: https://arxiv.org/abs/2503.20425
作者: Kevin Alcedo,Pedro U. Lima,Rachid Alami
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Navigating in environments alongside humans requires agents to reason under uncertainty and account for the beliefs and intentions of those around them. Under a sequential decision-making framework, egocentric navigation can naturally be represented as a Markov Decision Process (MDP). However, social navigation additionally requires reasoning about the hidden beliefs of others, inherently leading to a Partially Observable Markov Decision Process (POMDP), where agents lack direct access to others’ mental states. Inspired by Theory of Mind and Epistemic Planning, we propose (1) a neuro-symbolic model-based reinforcement learning architecture for social navigation, addressing the challenge of belief tracking in partially observable environments; and (2) a perspective-shift operator for belief estimation, leveraging recent work on Influence-based Abstractions (IBA) in structured multi-agent settings.

[AI-8] Including local feature interactions in deep non-negative matrix factorization networks improves performance

链接: https://arxiv.org/abs/2503.20398
作者: Mahbod Nouri,David Rotermund,Alberto Garcia-Ortiz,Klaus R. Pawelzik
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The brain uses positive signals as a means of signaling. Forward interactions in the early visual cortex are also positive, realized by excitatory synapses. Only local interactions also include inhibition. Non-negative matrix factorization (NMF) captures the biological constraint of positive long-range interactions and can be implemented with stochastic spikes. While NMF can serve as an abstract formalization of early neural processing in the visual system, the performance of deep convolutional networks with NMF modules does not match that of CNNs of similar size. However, when the local NMF modules are each followed by a module that mixes the NMF’s positive activities, the performances on the benchmark data exceed that of vanilla deep convolutional networks of similar size. This setting can be considered a biologically more plausible emulation of the processing in cortical (hyper-)columns with the potential to improve the performance of deep networks.

[AI-9] FastFT: Accelerating Reinforced Feature Transformation via Advanced Exploration Strategies ICDE2025

链接: https://arxiv.org/abs/2503.20394
作者: Tianqi He,Xiaohan Huang,Yi Du,Qingqing Long,Ziyue Qiao,Min Wu,Yanjie Fu,Yuanchun Zhou,Meng Xiao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, Accepted by ICDE 2025

点击查看摘要

Abstract:Feature Transformation is crucial for classic machine learning that aims to generate feature combinations to enhance the performance of downstream tasks from a data-centric perspective. Current methodologies, such as manual expert-driven processes, iterative-feedback techniques, and exploration-generative tactics, have shown promise in automating such data engineering workflow by minimizing human involvement. However, three challenges remain in those frameworks: (1) It predominantly depends on downstream task performance metrics, as assessment is time-consuming, especially for large datasets. (2) The diversity of feature combinations will hardly be guaranteed after random exploration ends. (3) Rare significant transformations lead to sparse valuable feedback that hinders the learning processes or leads to less effective results. In response to these challenges, we introduce FastFT, an innovative framework that leverages a trio of advanced this http URL first decouple the feature transformation evaluation from the outcomes of the generated datasets via the performance predictor. To address the issue of reward sparsity, we developed a method to evaluate the novelty of generated transformation sequences. Incorporating this novelty into the reward function accelerates the model’s exploration of effective transformations, thereby improving the search productivity. Additionally, we combine novelty and performance to create a prioritized memory buffer, ensuring that essential experiences are effectively revisited during exploration. Our extensive experimental evaluations validate the performance, efficiency, and traceability of our proposed framework, showcasing its superiority in handling complex feature transformation tasks.

[AI-10] MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation

链接: https://arxiv.org/abs/2503.20384
作者: Rongyu Zhang,Menghang Dong,Yuan Zhang,Liang Heng,Xiaowei Chi,Gaole Dai,Li Du,Dan Wang,Yuan Du,Shanghang Zhang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) excel in understanding complex language and visual data, enabling generalist robotic systems to interpret instructions and perform embodied tasks. Nevertheless, their real-world deployment is hindered by substantial computational and storage demands. Recent insights into the homogeneous patterns in the LLM layer have inspired sparsification techniques to address these challenges, such as early exit and token pruning. However, these methods often neglect the critical role of the final layers that encode the semantic information most relevant to downstream robotic tasks. Aligning with the recent breakthrough of the Shallow Brain Hypothesis (SBH) in neuroscience and the mixture of experts in model sparsification, we conceptualize each LLM layer as an expert and propose a Mixture-of-Layers Vision-Language-Action model (MoLe-VLA, or simply MoLe) architecture for dynamic LLM layer activation. We introduce a Spatial-Temporal Aware Router (STAR) for MoLe to selectively activate only parts of the layers based on the robot’s current state, mimicking the brain’s distinct signal pathways specialized for cognition and causal reasoning. Additionally, to compensate for the cognitive ability of LLMs lost in MoLe, we devise a Cognition Self-Knowledge Distillation (CogKD) framework. CogKD enhances the understanding of task demands and improves the generation of task-relevant action sequences by leveraging cognitive features. Extensive experiments conducted in both RLBench simulation and real-world environments demonstrate the superiority of MoLe-VLA in both efficiency and performance. Specifically, MoLe-VLA achieves an 8% improvement in the mean success rate across ten tasks while reducing computational costs by up to x5.6 compared to standard LLMs.

[AI-11] Wasserstein Distributionally Robust Bayesian Optimization with Continuous Context

链接: https://arxiv.org/abs/2503.20341
作者: Francesco Micheli,Efe C. Balta,Anastasios Tsiamis,John Lygeros
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We address the challenge of sequential data-driven decision-making under context distributional uncertainty. This problem arises in numerous real-world scenarios where the learner optimizes black-box objective functions in the presence of uncontrollable contextual variables. We consider the setting where the context distribution is uncertain but known to lie within an ambiguity set defined as a ball in the Wasserstein distance. We propose a novel algorithm for Wasserstein Distributionally Robust Bayesian Optimization that can handle continuous context distributions while maintaining computational tractability. Our theoretical analysis combines recent results in self-normalized concentration in Hilbert spaces and finite-sample bounds for distributionally robust optimization to establish sublinear regret bounds that match state-of-the-art results. Through extensive comparisons with existing approaches on both synthetic and real-world problems, we demonstrate the simplicity, effectiveness, and practical applicability of our proposed method.

[AI-12] Model-Based Offline Reinforcement Learning with Adversarial Data Augmentation

链接: https://arxiv.org/abs/2503.20285
作者: Hongye Cao,Fan Feng,Jing Huo,Shangdong Yang,Meng Fang,Tianpei Yang,Yang Gao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Model-based offline Reinforcement Learning (RL) constructs environment models from offline datasets to perform conservative policy optimization. Existing approaches focus on learning state transitions through ensemble models, rollouting conservative estimation to mitigate extrapolation errors. However, the static data makes it challenging to develop a robust policy, and offline agents cannot access the environment to gather new data. To address these challenges, we introduce Model-based Offline Reinforcement learning with AdversariaL data augmentation (MORAL). In MORAL, we replace the fixed horizon rollout by employing adversaria data augmentation to execute alternating sampling with ensemble models to enrich training data. Specifically, this adversarial process dynamically selects ensemble models against policy for biased sampling, mitigating the optimistic estimation of fixed models, thus robustly expanding the training data for policy optimization. Moreover, a differential factor is integrated into the adversarial process for regularization, ensuring error minimization in extrapolations. This data-augmented optimization adapts to diverse offline tasks without rollout horizon tuning, showing remarkable applicability. Extensive experiments on D4RL benchmark demonstrate that MORAL outperforms other model-based offline RL methods in terms of policy learning and sample efficiency.

[AI-13] Are We There Yet? Unraveling the State-of-the-Art Graph Network Intrusion Detection Systems

链接: https://arxiv.org/abs/2503.20281
作者: Chenglong Wang,Pujia Zheng,Jiaping Gui,Cunqing Hua,Wajih Ul Hassan
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Network Intrusion Detection Systems (NIDS) are vital for ensuring enterprise security. Recently, Graph-based NIDS (GIDS) have attracted considerable attention because of their capability to effectively capture the complex relationships within the graph structures of data communications. Despite their promise, the reproducibility and replicability of these GIDS remain largely unexplored, posing challenges for developing reliable and robust detection systems. This study bridges this gap by designing a systematic approach to evaluate state-of-the-art GIDS, which includes critically assessing, extending, and clarifying the findings of these systems. We further assess the robustness of GIDS under adversarial attacks. Evaluations were conducted on three public datasets as well as a newly collected large-scale enterprise dataset. Our findings reveal significant performance discrepancies, highlighting challenges related to dataset scale, model inputs, and implementation settings. We demonstrate difficulties in reproducing and replicating results, particularly concerning false positive rates and robustness against adversarial attacks. This work provides valuable insights and recommendations for future research, emphasizing the importance of rigorous reproduction and replication studies in developing robust and generalizable GIDS solutions.

[AI-14] ESSR: An 8K@30FPS Super-Resolution Accelerator With Edge Selective Network

链接: https://arxiv.org/abs/2503.20245
作者: Chih-Chia Hsu,Tian-Sheuan Chang
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Deep learning-based super-resolution (SR) is challenging to implement in resource-constrained edge devices for resolutions beyond full HD due to its high computational complexity and memory bandwidth requirements. This paper introduces an 8K@30FPS SR accelerator with edge-selective dynamic input processing. Dynamic processing chooses the appropriate subnets for different patches based on simple input edge criteria, achieving a 50% MAC reduction with only a 0.1dB PSNR decrease. The quality of reconstruction images is guaranteed and maximized its potential with \textitresource adaptive model switching even under resource constraints. In conjunction with hardware-specific refinements, the model size is reduced by 84% to 51K, but with a decrease of less than 0.6dB PSNR. Additionally, to support dynamic processing with high utilization, this design incorporates a \textitconfigurable group of layer mapping that synergizes with the \textitstructure-friendly fusion block, resulting in 77% hardware utilization and up to 79% reduction in feature SRAM access. The implementation, using the TSMC 28nm process, can achieve 8K@30FPS throughput at 800MHz with a gate count of 2749K, 0.2075W power consumption, and 4797Mpixels/J energy efficiency, exceeding previous work.

[AI-15] LGR: LLM -Guided Ranking of Frontiers for Object Goal Navigation

链接: https://arxiv.org/abs/2503.20241
作者: Mitsuaki Uno,Kanji Tanaka,Daiki Iwata,Yudai Noda,Shoya Miyazaki,Kouki Terashima
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 10 pages, 11 figures, technical report

点击查看摘要

Abstract:Object Goal Navigation (OGN) is a fundamental task for robots and AI, with key applications such as mobile robot image databases (MRID). In particular, mapless OGN is essential in scenarios involving unknown or dynamic environments. This study aims to enhance recent modular mapless OGN systems by leveraging the commonsense reasoning capabilities of large language models (LLMs). Specifically, we address the challenge of determining the visiting order in frontier-based exploration by framing it as a frontier ranking problem. Our approach is grounded in recent findings that, while LLMs cannot determine the absolute value of a frontier, they excel at evaluating the relative value between multiple frontiers viewed within a single image using the view image as context. We dynamically manage the frontier list by adding and removing elements, using an LLM as a ranking model. The ranking results are represented as reciprocal rank vectors, which are ideal for multi-view, multi-query information fusion. We validate the effectiveness of our method through evaluations in Habitat-Sim.

[AI-16] Dynamic Learning and Productivity for Data Analysts: A Bayesian Hidden Markov Model Perspective

链接: https://arxiv.org/abs/2503.20233
作者: Yue Yin
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC)
*备注: 29 pages; a shorter 11-page version is accepted by HCI International (HCII) 2025;

点击查看摘要

Abstract:Data analysts are essential in organizations, transforming raw data into insights that drive decision-making and strategy. This study explores how analysts’ productivity evolves on a collaborative platform, focusing on two key learning activities: writing queries and viewing peer queries. While traditional research often assumes static models, where performance improves steadily with cumulative learning, such models fail to capture the dynamic nature of real-world learning. To address this, we propose a Hidden Markov Model (HMM) that tracks how analysts transition between distinct learning states based on their participation in these activities. Using an industry dataset with 2,001 analysts and 79,797 queries, this study identifies three learning states: novice, intermediate, and advanced. Productivity increases as analysts advance to higher states, reflecting the cumulative benefits of learning. Writing queries benefits analysts across all states, with the largest gains observed for novices. Viewing peer queries supports novices but may hinder analysts in higher states due to cognitive overload or inefficiencies. Transitions between states are also uneven, with progression from intermediate to advanced being particularly challenging. This study advances understanding of into dynamic learning behavior of knowledge worker and offers practical implications for designing systems, optimizing training, enabling personalized learning, and fostering effective knowledge sharing. Comments: 29 pages; a shorter 11-page version is accepted by HCI International (HCII) 2025; Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC) Cite as: arXiv:2503.20233 [cs.SI] (or arXiv:2503.20233v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2503.20233 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-17] Learning Adaptive Dexterous Grasping from Single Demonstrations

链接: https://arxiv.org/abs/2503.20208
作者: Liangzhi Shi,Yulin Liu,Lingqi Zeng,Bo Ai,Zhengdong Hong,Hao Su
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:How can robots learn dexterous grasping skills efficiently and apply them adaptively based on user instructions? This work tackles two key challenges: efficient skill acquisition from limited human demonstrations and context-driven skill selection. We introduce AdaDexGrasp, a framework that learns a library of grasping skills from a single human demonstration per skill and selects the most suitable one using a vision-language model (VLM). To improve sample efficiency, we propose a trajectory following reward that guides reinforcement learning (RL) toward states close to a human demonstration while allowing flexibility in exploration. To learn beyond the single demonstration, we employ curriculum learning, progressively increasing object pose variations to enhance robustness. At deployment, a VLM retrieves the appropriate skill based on user instructions, bridging low-level learned skills with high-level intent. We evaluate AdaDexGrasp in both simulation and real-world settings, showing that our approach significantly improves RL efficiency and enables learning human-like grasp strategies across varied object configurations. Finally, we demonstrate zero-shot transfer of our learned policies to a real-world PSYONIC Ability Hand, with a 90% success rate across objects, significantly outperforming the baseline.

[AI-18] Generalized Phase Pressure Control Enhanced Reinforcement Learning for Traffic Signal Control

链接: https://arxiv.org/abs/2503.20205
作者: Xiao-Cheng Liao,Yi Mei,Mengjie Zhang,Xiang-Ling Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Appropriate traffic state representation is crucial for learning traffic signal control policies. However, most of the current traffic state representations are heuristically designed, with insufficient theoretical support. In this paper, we (1) develop a flexible, efficient, and theoretically grounded method, namely generalized phase pressure (G2P) control, which takes only simple lane features into consideration to decide which phase to be actuated; 2) extend the pressure control theory to a general form for multi-homogeneous-lane road networks based on queueing theory; (3) design a new traffic state representation based on the generalized phase state features from G2P control; and 4) develop a reinforcement learning (RL)-based algorithm template named G2P-XLight, and two RL algorithms, G2P-MPLight and G2P-CoLight, by combining the generalized phase state representation with MPLight and CoLight, two well-performed RL methods for learning traffic signal control policies. Extensive experiments conducted on multiple real-world datasets demonstrate that G2P control outperforms the state-of-the-art (SOTA) heuristic method in the transportation field and other recent human-designed heuristic methods; and that the newly proposed G2P-XLight significantly outperforms SOTA learning-based approaches. Our code is available online.

[AI-19] Offline Reinforcement Learning with Discrete Diffusion Skills

链接: https://arxiv.org/abs/2503.20176
作者: RuiXi Qiao,Jie Cheng,Xingyuan Dai,Yonglin Tian,Yisheng Lv
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Skills have been introduced to offline reinforcement learning (RL) as temporal abstractions to tackle complex, long-horizon tasks, promoting consistent behavior and enabling meaningful exploration. While skills in offline RL are predominantly modeled within a continuous latent space, the potential of discrete skill spaces remains largely underexplored. In this paper, we propose a compact discrete skill space for offline RL tasks supported by state-of-the-art transformer-based encoder and diffusion-based decoder. Coupled with a high-level policy trained via offline RL techniques, our method establishes a hierarchical RL framework where the trained diffusion decoder plays a pivotal role. Empirical evaluations show that the proposed algorithm, Discrete Diffusion Skill (DDS), is a powerful offline RL method. DDS performs competitively on Locomotion and Kitchen tasks and excels on long-horizon tasks, achieving at least a 12 percent improvement on AntMaze-v2 benchmarks compared to existing offline RL approaches. Furthermore, DDS offers improved interpretability, training stability, and online exploration compared to previous skill-based methods.

[AI-20] Look Before Leap: Look-Ahead Planning with Uncertainty in Reinforcement Learning

链接: https://arxiv.org/abs/2503.20139
作者: Yongshuai Liu,Xin Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Model-based reinforcement learning (MBRL) has demonstrated superior sample efficiency compared to model-free reinforcement learning (MFRL). However, the presence of inaccurate models can introduce biases during policy learning, resulting in misleading trajectories. The challenge lies in obtaining accurate models due to limited diverse training data, particularly in regions with limited visits (uncertain regions). Existing approaches passively quantify uncertainty after sample generation, failing to actively collect uncertain samples that could enhance state coverage and improve model accuracy. Moreover, MBRL often faces difficulties in making accurate multi-step predictions, thereby impacting overall performance. To address these limitations, we propose a novel framework for uncertainty-aware policy optimization with model-based exploratory planning. In the model-based planning phase, we introduce an uncertainty-aware k-step lookahead planning approach to guide action selection at each step. This process involves a trade-off analysis between model uncertainty and value function approximation error, effectively enhancing policy performance. In the policy optimization phase, we leverage an uncertainty-driven exploratory policy to actively collect diverse training samples, resulting in improved model accuracy and overall performance of the RL agent. Our approach offers flexibility and applicability to tasks with varying state/action spaces and reward structures. We validate its effectiveness through experiments on challenging robotic manipulation tasks and Atari games, surpassing state-of-the-art methods with fewer interactions, thereby leading to significant performance improvements.

[AI-21] Unlocking the Value of Decentralized Data: A Federated Dual Learning Approach for Model Aggregation

链接: https://arxiv.org/abs/2503.20138
作者: Junyi Zhu,Ruicong Yao,Taha Ceritli,Savas Ozkan,Matthew B. Blaschko,Eunchung Noh,Jeongwon Min,Cho Jung Min,Mete Ozay
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) technologies have revolutionized numerous fields, yet their applications often rely on costly and time-consuming data collection processes. Federated Learning (FL) offers a promising alternative by enabling AI models to be trained on decentralized data where data is scattered across clients (distributed nodes). However, existing FL approaches struggle to match the performance of centralized training due to challenges such as heterogeneous data distribution and communication delays, limiting their potential for breakthroughs. We observe that many real-world use cases involve hybrid data regimes, in which a server (center node) has access to some data while a large amount of data is distributed across associated clients. To improve the utilization of decentralized data under this regime, address data heterogeneity issue, and facilitate asynchronous communication between the server and clients, we propose a dual learning approach that leverages centralized data at the server to guide the merging of model updates from clients. Our method accommodates scenarios where server data is out-of-domain relative to decentralized client data, making it applicable to a wide range of use cases. We provide theoretical analysis demonstrating the faster convergence of our method compared to existing methods. Furthermore, experimental results across various scenarios show that our approach significantly outperforms existing technologies, highlighting its potential to unlock the value of large amounts of decentralized data.

[AI-22] Can We Make Code Green? Understanding Trade-Offs in LLM s vs. Human Code Optimizations

链接: https://arxiv.org/abs/2503.20126
作者: Pooja Rani,Jan-Andrea Bard,June Sallou,Alexander Boll,Timo Kehrer,Alberto Bacchelli
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:The rapid technological evolution has accelerated software development for various domains and use cases, contributing to a growing share of global carbon emissions. While recent large language models (LLMs) claim to assist developers in optimizing code for performance and energy efficiency, their efficacy in real-world scenarios remains under exploration. In this work, we explore the effectiveness of LLMs in reducing the environmental footprint of real-world projects, focusing on software written in Matlab-widely used in both academia and industry for scientific and engineering applications. We analyze energy-focused optimization on 400 scripts across 100 top GitHub repositories. We examine potential 2,176 optimizations recommended by leading LLMs, such as GPT-3, GPT-4, Llama, and Mixtral, and a senior Matlab developer, on energy consumption, memory usage, execution time consumption, and code correctness. The developer serves as a real-world baseline for comparing typical human and LLM-generated optimizations. Mapping these optimizations to 13 high-level themes, we found that LLMs propose a broad spectrum of improvements–beyond energy efficiency–including improving code readability and maintainability, memory management, error handling while the developer overlooked some parallel processing, error handling etc. However, our statistical tests reveal that the energy-focused optimizations unexpectedly negatively impacted memory usage, with no clear benefits regarding execution time or energy consumption. Our qualitative analysis of energy-time trade-offs revealed that some themes, such as vectorization preallocation, were among the common themes shaping these trade-offs. With LLMs becoming ubiquitous in modern software development, our study serves as a call to action: prioritizing the evaluation of common coding practices to identify the green ones. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Performance (cs.PF) Cite as: arXiv:2503.20126 [cs.SE] (or arXiv:2503.20126v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2503.20126 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-23] Synthesizing world models for bilevel planning

链接: https://arxiv.org/abs/2503.20124
作者: Zergham Ahmed,Joshua B. Tenenbaum,Christopher J. Bates,Samuel J. Gershman
类目: Artificial Intelligence (cs.AI)
*备注: 25 pages

点击查看摘要

Abstract:Modern reinforcement learning (RL) systems have demonstrated remarkable capabilities in complex environments, such as video games. However, they still fall short of achieving human-like sample efficiency and adaptability when learning new domains. Theory-based reinforcement learning (TBRL) is an algorithmic framework specifically designed to address this gap. Modeled on cognitive theories, TBRL leverages structured, causal world models - “theories” - as forward simulators for use in planning, generalization and exploration. Although current TBRL systems provide compelling explanations of how humans learn to play video games, they face several technical limitations: their theory languages are restrictive, and their planning algorithms are not scalable. To address these challenges, we introduce TheoryCoder, an instantiation of TBRL that exploits hierarchical representations of theories and efficient program synthesis methods for more powerful learning and planning. TheoryCoder equips agents with general-purpose abstractions (e.g., “move to”), which are then grounded in a particular environment by learning a low-level transition model (a Python program synthesized from observations by a large language model). A bilevel planning algorithm can exploit this hierarchical structure to solve large domains. We demonstrate that this approach can be successfully applied to diverse and challenging grid-world games, where approaches based on directly synthesizing a policy perform poorly. Ablation studies demonstrate the benefits of using hierarchical abstractions.

[AI-24] Direct Post-Training Preference Alignment for Multi-Agent Motion Generation Models Using Implicit Feedback from Pre-training Demonstrations ICLR2025

链接: https://arxiv.org/abs/2503.20105
作者: Ran Tian,Kratarth Goel
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: ICLR 2025 Spotlight

点击查看摘要

Abstract:Recent advancements in LLMs have revolutionized motion generation models in embodied applications. While LLM-type auto-regressive motion generation models benefit from training scalability, there remains a discrepancy between their token prediction objectives and human preferences. As a result, models pre-trained solely with token-prediction objectives often generate behaviors that deviate from what humans would prefer, making post-training preference alignment crucial for producing human-preferred motions. Unfortunately, post-training alignment requires extensive preference rankings of motions generated by the pre-trained model, which are costly to annotate, especially in multi-agent settings. Recently, there has been growing interest in leveraging pre-training demonstrations to scalably generate preference data for post-training alignment. However, these methods often adopt an adversarial assumption, treating all pre-trained model-generated samples as unpreferred examples. This adversarial approach overlooks the valuable signal provided by preference rankings among the model’s own generations, ultimately reducing alignment effectiveness and potentially leading to misaligned behaviors. In this work, instead of treating all generated samples as equally bad, we leverage implicit preferences encoded in pre-training demonstrations to construct preference rankings among the pre-trained model’s generations, offering more nuanced preference alignment guidance with zero human cost. We apply our approach to large-scale traffic simulation and demonstrate its effectiveness in improving the realism of pre-trained model’s generated behaviors, making a lightweight 1M motion generation model comparable to SOTA large imitation-based models by relying solely on implicit feedback from pre-training demonstrations, without additional post-training human preference annotations or high computational costs.

[AI-25] AI Identity Empowerment and Mindfulness in Mitigating Unethical AI Use

链接: https://arxiv.org/abs/2503.20099
作者: Mayssam Tarighi Shaayesteh,Sara Memarian Esfahani,Hossein Mohit
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study examines how AI identity influences psychological empowerment and unethical AI behavior among college students, while also exploring the moderating role of IT mindfulness. Findings show that a strong AI identity enhances psychological empowerment and academic engagement but can also lead to increased unethical AI practices. Crucially, IT mindfulness acts as an ethical safeguard, promoting sensitivity to ethical concerns and reducing misuse of AI. These insights have implications for educators, policymakers, and AI developers, emphasizing For Peer Review the need for a balanced approach that encourages digital engagement without compromising student responsibility. The study also contributes to philosophical discussions of psychological agency, suggesting that empowerment through AI can yield both positive and negative outcomes. Mindfulness emerges as essential in guiding ethical AI interactions. Overall, the research informs ongoing debates on ethics in education and AI, offering strategies to align technological advancement with ethical accountability and responsible use.

[AI-26] Abstracting Geo-specific Terrains to Scale Up Reinforcement Learning

链接: https://arxiv.org/abs/2503.20078
作者: Volkan Ustun,Soham Hans,Rajay Kumar,Yunzhe Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 10 pages, 6 figures, 2024 Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC)

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARL) is increasingly ubiquitous in training dynamic and adaptive synthetic characters for interactive simulations on geo-specific terrains. Frameworks such as Unity’s ML-Agents help to make such reinforcement learning experiments more accessible to the simulation community. Military training simulations also benefit from advances in MARL, but they have immense computational requirements due to their complex, continuous, stochastic, partially observable, non-stationary, and doctrine-based nature. Furthermore, these simulations require geo-specific terrains, further exacerbating the computational resources problem. In our research, we leverage Unity’s waypoints to automatically generate multi-layered representation abstractions of the geo-specific terrains to scale up reinforcement learning while still allowing the transfer of learned policies between different representations. Our early exploratory results on a novel MARL scenario, where each side has differing objectives, indicate that waypoint-based navigation enables faster and more efficient learning while producing trajectories similar to those taken by expert human players in CSGO gaming environments. This research points out the potential of waypoint-based navigation for reducing the computational costs of developing and training MARL models for military training simulations, where geo-specific terrains and differing objectives are crucial.

[AI-27] Adaptive Orchestration for Large-Scale Inference on Heterogeneous Accelerator Systems Balancing Cost Performance and Resilience

链接: https://arxiv.org/abs/2503.20074
作者: Yahav Biran,Imry Kissos
类目: Performance (cs.PF); Artificial Intelligence (cs.AI)
*备注: 14 pages, 7 figures

点击查看摘要

Abstract:The surge in generative AI workloads has created a need for scalable inference systems that can flexibly harness both GPUs and specialized accelerators while containing operational costs. This paper proposes a hardware-agnostic control loop that adaptively allocates requests across heterogeneous accelerators based on real-time cost and capacity signals. The approach sustains low latency and high throughput by dynamically shifting between cost-optimized and capacity-optimized modes, ensuring the most efficient use of expensive compute resources under fluctuating availability. Evaluated using the Stable Diffusion model, the framework consistently meets latency targets, automatically redirects traffic during capacity shortfalls, and capitalizes on lower-cost accelerators when possible. These results highlight how a feedback-driven deployment strategy, spanning the entire software and hardware stack, can help organizations efficiently scale generative AI workloads while maintaining resilience in the face of limited accelerator capacity.

[AI-28] BugCraft: End-to-End Crash Bug Reproduction Using LLM Agents in Minecraft

链接: https://arxiv.org/abs/2503.20036
作者: Eray Yapağcı,Yavuz Alp Sencer Öztürk,Eray Tüzün
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reproducing game bugs, in our case crash bugs in continuously evolving games like Minecraft, is a notoriously manual, time-consuming, and challenging process to automate. Despite the success of LLM-driven bug reproduction in other software domains, games, with their complex interactive environments, remain largely unaddressed. This paper introduces BugCraft, a novel end-to-end framework designed to automate the reproduction of crash bugs in Minecraft directly from user-submitted bug reports, addressing the critical gap in automated game bug reproduction. BugCraft employs a two-stage approach: first, a Step Synthesizer leverages LLMs and Minecraft Wiki knowledge to transform bug reports into high-quality, structured steps to reproduce (S2R). Second, an Action Model, powered by a vision-based LLM agent (GPT-4o) and a custom macro API, executes these S2R steps within Minecraft to trigger the reported crash. To facilitate evaluation, we introduce BugCraft-Bench, a curated dataset of Minecraft crash bug reports. Evaluated on BugCraft-Bench, our framework successfully reproduced 30.23% of crash bugs end-to-end. The Step Synthesizer demonstrated a 66.28% accuracy in generating correct bug reproduction plans, highlighting its effectiveness in interpreting and structuring bug report information. BugCraft demonstrates the feasibility of automated reproduction of crash bugs in complex game environments using LLMs, opening promising avenues for game testing and development. The framework and the BugCraft-Bench dataset pave the way for future research in automated game bug analysis and hold potential for generalization to other interactive game platforms. Finally, we make our code open at this https URL

[AI-29] OmniNova:A General Multimodal Agent Framework

链接: https://arxiv.org/abs/2503.20028
作者: Pengfei Du
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) with specialized tools presents new opportunities for intelligent automation systems. However, orchestrating multiple LLM-driven agents to tackle complex tasks remains challenging due to coordination difficulties, inefficient resource utilization, and inconsistent information flow. We present OmniNova, a modular multi-agent automation framework that combines language models with specialized tools such as web search, crawling, and code execution capabilities. OmniNova introduces three key innovations: (1) a hierarchical multi-agent architecture with distinct coordinator, planner, supervisor, and specialist agents; (2) a dynamic task routing mechanism that optimizes agent deployment based on task complexity; and (3) a multi-layered LLM integration system that allocates appropriate models to different cognitive requirements. Our evaluations across 50 complex tasks in research, data analysis, and web interaction domains demonstrate that OmniNova outperforms existing frameworks in task completion rate (87% vs. baseline 62%), efficiency (41% reduced token usage), and result quality (human evaluation score of 4.2/5 vs. baseline 3.1/5). We contribute both a theoretical framework for multi-agent system design and an open-source implementation that advances the state-of-the-art in LLM-based automation systems.

[AI-30] Experience Replay Addresses Loss of Plasticity in Continual Learning

链接: https://arxiv.org/abs/2503.20018
作者: Jiuqi Wang,Rohan Chandra,Shangtong Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 14 pages, 4 figures

点击查看摘要

Abstract:Loss of plasticity is one of the main challenges in continual learning with deep neural networks, where neural networks trained via backpropagation gradually lose their ability to adapt to new tasks and perform significantly worse than their freshly initialized counterparts. The main contribution of this paper is to propose a new hypothesis that experience replay addresses the loss of plasticity in continual learning. Here, experience replay is a form of memory. We provide supporting evidence for this hypothesis. In particular, we demonstrate in multiple different tasks, including regression, classification, and policy evaluation, that by simply adding an experience replay and processing the data in the experience replay with Transformers, the loss of plasticity disappears. Notably, we do not alter any standard components of deep learning. For example, we do not change backpropagation. We do not modify the activation functions. And we do not use any regularization. We conjecture that experience replay and Transformers can address the loss of plasticity because of the in-context learning phenomenon.

[AI-31] Unsupervised Learning for Quadratic Assignment

链接: https://arxiv.org/abs/2503.20001
作者: Yimeng Min,Carla P. Gomes
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: preprint

点击查看摘要

Abstract:We introduce PLUME search, a data-driven framework that enhances search efficiency in combinatorial optimization through unsupervised learning. Unlike supervised or reinforcement learning, PLUME search learns directly from problem instances using a permutation-based loss with a non-autoregressive approach. We evaluate its performance on the quadratic assignment problem, a fundamental NP-hard problem that encompasses various combinatorial optimization problems. Experimental results demonstrate that PLUME search consistently improves solution quality. Furthermore, we study the generalization behavior and show that the learned model generalizes across different densities and sizes.

[AI-32] LEGO-Puzzles: How Good Are MLLM s at Multi-Step Spatial Reasoning ?

链接: https://arxiv.org/abs/2503.19990
作者: Kexian Tang,Junyao Gao,Yanhong Zeng,Haodong Duan,Yanan Sun,Zhening Xing,Wenran Liu,Kaifeng Lyu,Kai Chen
类目: Artificial Intelligence (cs.AI)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:Multi-step spatial reasoning entails understanding and reasoning about spatial relationships across multiple sequential steps, which is crucial for tackling complex real-world applications, such as robotic manipulation, autonomous navigation, and automated assembly. To assess how well current Multimodal Large Language Models (MLLMs) have acquired this fundamental capability, we introduce \textbfLEGO-Puzzles, a scalable benchmark designed to evaluate both \textbfspatial understanding and \textbfsequential reasoning in MLLMs through LEGO-based tasks. LEGO-Puzzles consists of 1,100 carefully curated visual question-answering (VQA) samples spanning 11 distinct tasks, ranging from basic spatial understanding to complex multi-step reasoning. Based on LEGO-Puzzles, we conduct a comprehensive evaluation of state-of-the-art MLLMs and uncover significant limitations in their spatial reasoning capabilities: even the most powerful MLLMs can answer only about half of the test cases, whereas human participants achieve over 90% accuracy. In addition to VQA tasks, we evaluate MLLMs’ abilities to generate LEGO images following assembly illustrations. Our experiments show that only Gemini-2.0-Flash and GPT-4o exhibit a limited ability to follow these instructions, while other MLLMs either replicate the input image or generate completely irrelevant outputs. Overall, LEGO-Puzzles exposes critical deficiencies in existing MLLMs’ spatial understanding and sequential reasoning capabilities, and underscores the need for further advancements in multimodal spatial reasoning.

[AI-33] ExCoT: Optimizing Reasoning for Text-to-SQL with Execution Feedback

链接: https://arxiv.org/abs/2503.19988
作者: Bohan Zhai,Canwen Xu,Yuxiong He,Zhewei Yao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Text-to-SQL demands precise reasoning to convert natural language questions into structured queries. While large language models (LLMs) excel in many reasoning tasks, their ability to leverage Chain-of-Thought (CoT) reasoning for text-to-SQL remains underexplored. We identify critical limitations: zero-shot CoT offers minimal gains, and Direct Preference Optimization (DPO) applied without CoT yields marginal improvements. We propose ExCoT, a novel framework that iteratively optimizes open-source LLMs by combining CoT reasoning with off-policy and on-policy DPO, relying solely on execution accuracy as feedback. This approach eliminates the need for reward models or human-annotated preferences. Our experimental results demonstrate significant performance gains: ExCoT improves execution accuracy on BIRD dev set from 57.37% to 68.51% and on Spider test set from 78.81% to 86.59% for LLaMA-3 70B, with Qwen-2.5-Coder demonstrating similar improvements. Our best model achieves state-of-the-art performance in the single-model setting on both BIRD and Spider datasets, notably achieving 68.53% on the BIRD test set. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB) Cite as: arXiv:2503.19988 [cs.LG] (or arXiv:2503.19988v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.19988 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-34] Body Discovery of Embodied AI

链接: https://arxiv.org/abs/2503.19941
作者: Zhe Sun,Pengfei Tian,Xiaozhu Hu,Xiaoyu Zhao,Huiying Li,Zhenliang Zhang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:In the pursuit of realizing artificial general intelligence (AGI), the importance of embodied artificial intelligence (AI) becomes increasingly apparent. Following this trend, research integrating robots with AGI has become prominent. As various kinds of embodiments have been designed, adaptability to diverse embodiments will become important to AGI. We introduce a new challenge, termed “Body Discovery of Embodied AI”, focusing on tasks of recognizing embodiments and summarizing neural signal functionality. The challenge encompasses the precise definition of an AI body and the intricate task of identifying embodiments in dynamic environments, where conventional approaches often prove inadequate. To address these challenges, we apply causal inference method and evaluate it by developing a simulator tailored for testing algorithms with virtual environments. Finally, we validate the efficacy of our algorithms through empirical testing, demonstrating their robust performance in various scenarios based on virtual environments.

[AI-35] Probabilistic Forecasting for Network Resource Analysis in Integrated Terrestrial and Non-Terrestrial Networks

链接: https://arxiv.org/abs/2503.20658
作者: Cristian J. Vaca-Rubio,Vaishnavi Kasuluru,Engin Zeydan,Luis Blanco,Roberto Pereira,Marius Caus,Kapal Dev
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Efficient resource management is critical for Non-Terrestrial Networks (NTNs) to provide consistent, high-quality service in remote and under-served regions. While traditional single-point prediction methods, such as Long-Short Term Memory (LSTM), have been used in terrestrial networks, they often fall short in NTNs due to the complexity of satellite dynamics, signal latency and coverage variability. Probabilistic forecasting, which quantifies the uncertainties of the predictions, is a robust alternative. In this paper, we evaluate the application of probabilistic forecasting techniques, in particular SFF, to NTN resource allocation scenarios. Our results show their effectiveness in predicting bandwidth and capacity requirements in different NTN segments of probabilistic forecasting compared to single-point prediction techniques such as LSTM. The results show the potential of black probabilistic forecasting models to provide accurate and reliable predictions and to quantify their uncertainty, making them indispensable for optimizing NTN resource allocation. At the end of the paper, we also present application scenarios and a standardization roadmap for the use of probabilistic forecasting in integrated Terrestrial Network (TN)-NTN environments.

[AI-36] A decision-theoretic approach to dealing with uncertainty in quantum mechanics

链接: https://arxiv.org/abs/2503.20607
作者: Keano De Vos,Gert de Cooman,Alexander Erreygers,Jasper De Bock
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Probability (math.PR)
*备注: 52 pages

点击查看摘要

Abstract:We provide a decision-theoretic framework for dealing with uncertainty in quantum mechanics. This uncertainty is two-fold: on the one hand there may be uncertainty about the state the quantum system is in, and on the other hand, as is essential to quantum mechanical uncertainty, even if the quantum state is known, measurements may still produce an uncertain outcome. In our framework, measurements therefore play the role of acts with an uncertain outcome and our simple decision-theoretic postulates ensure that Born’s rule is encapsulated in the utility functions associated with such acts. This approach allows us to uncouple (precise) probability theory from quantum mechanics, in the sense that it leaves room for a more general, so-called imprecise probabilities approach. We discuss the mathematical implications of our findings, which allow us to give a decision-theoretic foundation to recent seminal work by Benavoli, Facchini and Zaffalon, and we compare our approach to earlier and different approaches by Deutsch and Wallace.

[AI-37] Design and Evaluation of Neural Network-Based Receiver Architectures for Reliable Communication

链接: https://arxiv.org/abs/2503.20500
作者: Hüseyin Çevik,Erhan Karakoca,İbrahim Hökelek,Ali Görçin
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: Will be submitted to IEEE Conference

点击查看摘要

Abstract:Neural network-based receivers leverage deep learning to optimize signal detection and decoding, significantly improving bit-error rate (BER) and block-error rate (BLER) in challenging environments. This study evaluates various architectures and compares their BER and BLER performance across different noise levels. Two novel models, the Dual Attention Transformer (DAT) and the Residual Dual Non-Local Attention Network (RDNLA), integrate self-attention and residual learning to enhance signal reconstruction. These models bypass conventional channel estimation and equalization by directly predicting log-likelihood ratios (LLRs) from received signals, with noise variance as an additional input. Simulations show that DAT and RDNLA outperform traditional and other neural receiver models under varying signal-to-noise ratios (SNR), while their computational efficiency supports their feasibility for next-generation communication systems.

[AI-38] Underwater Image Enhancement by Convolutional Spiking Neural Networks

链接: https://arxiv.org/abs/2503.20485
作者: Vidya Sudevan,Fakhreddine Zayer,Rizwana Kausar,Sajid Javed,Hamad Karki,Giulia De Masi,Jorge Dias
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Underwater image enhancement (UIE) is fundamental for marine applications, including autonomous vision-based navigation. Deep learning methods using convolutional neural networks (CNN) and vision transformers advanced UIE performance. Recently, spiking neural networks (SNN) have gained attention for their lightweight design, energy efficiency, and scalability. This paper introduces UIE-SNN, the first SNN-based UIE algorithm to improve visibility of underwater images. UIE-SNN is a 19- layered convolutional spiking encoder-decoder framework with skip connections, directly trained using surrogate gradient-based backpropagation through time (BPTT) strategy. We explore and validate the influence of training datasets on energy reduction, a unique advantage of UIE-SNN architecture, in contrast to the conventional learning-based architectures, where energy consumption is model-dependent. UIE-SNN optimizes the loss function in latent space representation to reconstruct clear underwater images. Our algorithm performs on par with its non-spiking counterpart methods in terms of PSNR and structural similarity index (SSIM) at reduced timesteps ( T=5 ) and energy consumption of 85% . The algorithm is trained on two publicly available benchmark datasets, UIEB and EUVP, and tested on unseen images from UIEB, EUVP, LSUI, U45, and our custom UIE dataset. The UIE-SNN algorithm achieves PSNR of (17.7801~dB) and SSIM of (0.7454) on UIEB, and PSNR of (23.1725~dB) and SSIM of (0.7890) on EUVP. UIE-SNN achieves this algorithmic performance with fewer operators ((147.49) GSOPs) and energy ((0.1327~J)) compared to its non-spiking counterpart (GFLOPs = (218.88) and Energy=(1.0068~J)). Compared with existing SOTA UIE methods, UIE-SNN achieves an average of (6.5\times) improvement in energy efficiency. The source code is available at \hrefthis https URLUIE-SNN.

[AI-39] A multi-agent ic framework for real-time autonomous freeform metasurface design

链接: https://arxiv.org/abs/2503.20479
作者: Robert Lupoiu,Yixuan Shao,Tianxiang Dai,Chenkai Mao,Kofi Edee,Jonathan A. Fan
类目: Applied Physics (physics.app-ph); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Computational Physics (physics.comp-ph)
*备注: 32 pages, 5 figures

点击查看摘要

Abstract:Innovation in nanophotonics currently relies on human experts who synergize specialized knowledge in photonics and coding with simulation and optimization algorithms, entailing design cycles that are time-consuming, computationally demanding, and frequently suboptimal. We introduce MetaChat, a multi-agentic design framework that can translate semantically described photonic design goals into high-performance, freeform device layouts in an automated, nearly real-time manner. Multi-step reasoning is enabled by our Agentic Iterative Monologue (AIM) paradigm, which coherently interfaces agents with code-based tools, other specialized agents, and human designers. Design acceleration is facilitated by Feature-wise Linear Modulation-conditioned Maxwell surrogate solvers that support the generalized evaluation of metasurface structures. We use freeform dielectric metasurfaces as a model system and demonstrate with MetaChat the design of multi-objective, multi-wavelength metasurfaces orders of magnitude faster than conventional methods. These concepts present a scientific computing blueprint for utilizing specialist design agents, surrogate solvers, and human interactions to drive multi-physics innovation and discovery.

[AI-40] Dynamics of Algorithmic Content Amplification on TikTok

链接: https://arxiv.org/abs/2503.20231
作者: Fabian Baumann,Nipun Arora,Iyad Rahwan,Agnieszka Czaplicka
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 34 pages

点击查看摘要

Abstract:Intelligent algorithms increasingly shape the content we encounter and engage with online. TikTok’s For You feed exemplifies extreme algorithm-driven curation, tailoring the stream of video content almost exclusively based on users’ explicit and implicit interactions with the platform. Despite growing attention, the dynamics of content amplification on TikTok remain largely unquantified. How quickly, and to what extent, does TikTok’s algorithm amplify content aligned with users’ interests? To address these questions, we conduct a sock-puppet audit, deploying bots with different interests to engage with TikTok’s “For You” feed. Our findings reveal that content aligned with the bots’ interests undergoes strong amplification, with rapid reinforcement typically occurring within the first 200 videos watched. While amplification is consistently observed across all interests, its intensity varies by interest, indicating the emergence of topic-specific biases. Time series analyses and Markov models uncover distinct phases of recommendation dynamics, including persistent content reinforcement and a gradual decline in content diversity over time. Although TikTok’s algorithm preserves some content diversity, we find a strong negative correlation between amplification and exploration: as the amplification of interest-aligned content increases, engagement with unseen hashtags declines. These findings contribute to discussions on socio-algorithmic feedback loops in the digital age and the trade-offs between personalization and content diversity.

[AI-41] A Spatiotemporal Radar-Based Precipitation Model for Water Level Prediction and Flood Forecasting

链接: https://arxiv.org/abs/2503.19943
作者: Sakshi Dhankhar,Stefan Wittek,Hamidreza Eivazi,Andreas Rausch
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 28 pages, 11 figures, 6 tables

点击查看摘要

Abstract:Study Region: Goslar and Göttingen, Lower Saxony, Germany. Study Focus: In July 2017, the cities of Goslar and Göttingen experienced severe flood events characterized by short warning time of only 20 minutes, resulting in extensive regional flooding and significant damage. This highlights the critical need for a more reliable and timely flood forecasting system. This paper presents a comprehensive study on the impact of radar-based precipitation data on forecasting river water levels in Goslar. Additionally, the study examines how precipitation influences water level forecasts in Göttingen. The analysis integrates radar-derived spatiotemporal precipitation patterns with hydrological sensor data obtained from ground stations to evaluate the effectiveness of this approach in improving flood prediction capabilities. New Hydrological Insights for the Region: A key innovation in this paper is the use of residual-based modeling to address the non-linearity between precipitation images and water levels, leading to a Spatiotemporal Radar-based Precipitation Model with residuals (STRPMr). Unlike traditional hydrological models, our approach does not rely on upstream data, making it independent of additional hydrological inputs. This independence enhances its adaptability and allows for broader applicability in other regions with RADOLAN precipitation. The deep learning architecture integrates (2+1)D convolutional neural networks for spatial and temporal feature extraction with LSTM for timeseries forecasting. The results demonstrate the potential of the STRPMr for capturing extreme events and more accurate flood forecasting.

[AI-42] FuXi-RTM: A Physics-Guided Prediction Framework with Radiative Transfer Modeling

链接: https://arxiv.org/abs/2503.19940
作者: Qiusheng Huang,Xiaohui Zhong,Xu Fan,Lei Chen,Hao Li
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Similar to conventional video generation, current deep learning-based weather prediction frameworks often lack explicit physical constraints, leading to unphysical outputs that limit their reliability for operational forecasting. Among various physical processes requiring proper representation, radiation plays a fundamental role as it drives Earth’s weather and climate systems. However, accurate simulation of radiative transfer processes remains challenging for traditional numerical weather prediction (NWP) models due to their inherent complexity and high computational costs. Here, we propose FuXi-RTM, a hybrid physics-guided deep learning framework designed to enhance weather forecast accuracy while enforcing physical consistency. FuXi-RTM integrates a primary forecasting model (FuXi) with a fixed deep learning-based radiative transfer model (DLRTM) surrogate that efficiently replaces conventional radiation parameterization schemes. This represents the first deep learning-based weather forecasting framework to explicitly incorporate physical process modeling. Evaluated over a comprehensive 5-year dataset, FuXi-RTM outperforms its unconstrained counterpart in 88.51% of 3320 variable and lead time combinations, with improvements in radiative flux predictions. By incorporating additional physical processes, FuXi-RTM paves the way for next-generation weather forecasting systems that are both accurate and physically consistent.

[AI-43] Role of AI Innovation Clean Energy and Digital Economy towards Net Zero Emission in the United States: An ARDL Approach

链接: https://arxiv.org/abs/2503.19933
作者: Adita Sultana,Abdullah Al Abrar Chowdhury,Azizul Hakim Rafi,Abdulla All Noman
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 24 pages, 8 tables, 1 figure

点击查看摘要

Abstract:The current paper investigates the influences of AI innovation, GDP growth, renewable energy utilization, the digital economy, and industrialization on CO2 emissions in the USA from 1990 to 2022, incorporating the ARDL methodology. The outcomes observe that AI innovation, renewable energy usage, and the digital economy reduce CO2 emissions, while GDP expansion and industrialization intensify ecosystem damage. Unit root tests (ADF, PP, and DF-GLS) reveal heterogeneous integration levels amongst components, ensuring robustness in the ARDL analysis. Complementary methods (FMOLS, DOLS, and CCR) validate the results, enhancing their reliability. Pairwise Granger causality assessments identify strong unidirectional connections within CO2 emissions and AI innovation, as well as the digital economy, underscoring their significant roles in ecological sustainability. This research highlights the requirement for strategic actions to nurture equitable growth, including advancements in AI technology, green energy adoption, and environmentally conscious industrial development, to improve environmental quality in the United States.

机器学习

[LG-0] An Empirical Study of the Impact of Federated Learning on Machine Learning Model Accuracy

链接: https://arxiv.org/abs/2503.20768
作者: Haotian Yang,Zhuoran Wang,Benson Chou,Sophie Xu,Hao Wang,Jingxian Wang,Qizhen Zhang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables distributed ML model training on private user data at the global scale. Despite the potential of FL demonstrated in many domains, an in-depth view of its impact on model accuracy remains unclear. In this paper, we investigate, systematically, how this learning paradigm can affect the accuracy of state-of-the-art ML models for a variety of ML tasks. We present an empirical study that involves various data types: text, image, audio, and video, and FL configuration knobs: data distribution, FL scale, client sampling, and local and global computations. Our experiments are conducted in a unified FL framework to achieve high fidelity, with substantial human efforts and resource investments. Based on the results, we perform a quantitative analysis of the impact of FL, and highlight challenging scenarios where applying FL degrades the accuracy of the model drastically and identify cases where the impact is negligible. The detailed and extensive findings can benefit practical deployments and future development of FL.

[LG-1] Reliable algorithm selection for machine learning-guided design

链接: https://arxiv.org/abs/2503.20767
作者: Clara Fannjiang,Ji Won Park
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注: 25 pages, 7 figures

点击查看摘要

Abstract:Algorithms for machine learning-guided design, or design algorithms, use machine learning-based predictions to propose novel objects with desired property values. Given a new design task – for example, to design novel proteins with high binding affinity to a therapeutic target – one must choose a design algorithm and specify any hyperparameters and predictive and/or generative models involved. How can these decisions be made such that the resulting designs are successful? This paper proposes a method for design algorithm selection, which aims to select design algorithms that will produce a distribution of design labels satisfying a user-specified success criterion – for example, that at least ten percent of designs’ labels exceed a threshold. It does so by combining designs’ predicted property values with held-out labeled data to reliably forecast characteristics of the label distributions produced by different design algorithms, building upon techniques from prediction-powered inference. The method is guaranteed with high probability to return design algorithms that yield successful label distributions (or the null set if none exist), if the density ratios between the design and labeled data distributions are known. We demonstrate the method’s effectiveness in simulated protein and RNA design tasks, in settings with either known or estimated density ratios.

[LG-2] ASGO: Adaptive Structured Gradient Optimization

链接: https://arxiv.org/abs/2503.20762
作者: Kang An,Yuxing Liu,Rui Pan,Shiqian Ma,Donald Goldfarb,Tong Zhang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 25 pages, 4 figures

点击查看摘要

Abstract:Training deep neural networks (DNNs) is a structured optimization problem, because the parameters are naturally represented by matrices and tensors rather than simple vectors. Under this structural representation, it has been widely observed that gradients are low-rank and Hessians are approximately block-wise diagonal. These structured properties are crucial for designing efficient optimization algorithms but may not be utilized by current popular optimizers like Adam. In this paper, we present a novel optimization algorithm ASGO that capitalizes on these properties by employing a preconditioner that is adaptively updated using structured gradients. By fine-grained theoretical analysis, ASGO is proven to achieve superior convergence rates compared to existing structured gradient methods. Based on the convergence theory, we further demonstrate that ASGO can benefit from the low-rank and block-wise diagonal properties. We also discuss practical modifications of ASGO and empirically verify the effectiveness of the algorithm on language model tasks.

[LG-3] RecTable: Fast Modeling Tabular Data with Rectified Flow

链接: https://arxiv.org/abs/2503.20731
作者: Masane Fuchi,Tomohiro Takagi
类目: Machine Learning (cs.LG)
*备注: 19 pages, 7 figures, 10 tables

点击查看摘要

Abstract:Score-based or diffusion models generate high-quality tabular data, surpassing GAN-based and VAE-based models. However, these methods require substantial training time. In this paper, we introduce RecTable, which uses the rectified flow modeling, applied in such as text-to-image generation and text-to-video generation. RecTable features a simple architecture consisting of a few stacked gated linear unit blocks. Additionally, our training strategies are also simple, incorporating a mixed-type noise distribution and a logit-normal timestep distribution. Our experiments demonstrate that RecTable achieves competitive performance compared to the several state-of-the-art diffusion and score-based models while reducing the required training time. Our code is available at this https URL.

[LG-4] Benchmarking and optimizing organism wide single-cell RNA alignment methods ICLR2025

链接: https://arxiv.org/abs/2503.20730
作者: Juan Javier Diaz-Mejia,Elias Williams,Octavian Focsa,Dylan Mendonca,Swechha Singh,Brendan Innes,Sam Cooper
类目: Machine Learning (cs.LG)
*备注: Accepted to ICLR 2025 LMRL workshop (International Conference on Learning Representations, Learning Meaningful Representations of Life Workshop)

点击查看摘要

Abstract:Many methods have been proposed for removing batch effects and aligning single-cell RNA (scRNA) datasets. However, performance is typically evaluated based on multiple parameters and few datasets, creating challenges in assessing which method is best for aligning data at scale. Here, we introduce the K-Neighbors Intersection (KNI) score, a single score that both penalizes batch effects and measures accuracy at cross-dataset cell-type label prediction alongside carefully curated small (scMARK) and large (scREF) benchmarks comprising 11 and 46 human scRNA studies respectively, where we have standardized author labels. Using the KNI score, we evaluate and optimize approaches for cross-dataset single-cell RNA integration. We introduce Batch Adversarial single-cell Variational Inference (BA-scVI), as a new variant of scVI that uses adversarial training to penalize batch-effects in the encoder and decoder, and show this approach outperforms other methods. In the resulting aligned space, we find that the granularity of cell-type groupings is conserved, supporting the notion that whole-organism cell-type maps can be created by a single model without loss of information.

[LG-5] Learning Straight Flows by Learning Curved Interpolants ICLR2025

链接: https://arxiv.org/abs/2503.20719
作者: Shiv Shankar,Tomas Geffner
类目: Machine Learning (cs.LG)
*备注: Delta workshop at ICLR 2025

点击查看摘要

Abstract:Flow matching models typically use linear interpolants to define the forward/noise addition process. This, together with the independent coupling between noise and target distributions, yields a vector field which is often non-straight. Such curved fields lead to a slow inference/generation process. In this work, we propose to learn flexible (potentially curved) interpolants in order to learn straight vector fields to enable faster generation. We formulate this via a multi-level optimization problem and propose an efficient approximate procedure to solve it. Our framework provides an end-to-end and simulation-free optimization procedure, which can be leveraged to learn straight line generative trajectories.

[LG-6] Semi-supervised Node Importance Estimation with Informative Distribution Modeling for Uncertainty Regularization

链接: https://arxiv.org/abs/2503.20697
作者: Yankai Chen,Taotao Wang,Yixiang Fang,Yunyu Xiao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Node importance estimation, a classical problem in network analysis, underpins various web applications. Previous methods either exploit intrinsic topological characteristics, e.g., graph centrality, or leverage additional information, e.g., data heterogeneity, for node feature enhancement. However, these methods follow the supervised learning setting, overlooking the fact that ground-truth node-importance data are usually partially labeled in practice. In this work, we propose the first semi-supervised node importance estimation framework, i.e., EASING, to improve learning quality for unlabeled data in heterogeneous graphs. Different from previous approaches, EASING explicitly captures uncertainty to reflect the confidence of model predictions. To jointly estimate the importance values and uncertainties, EASING incorporates DJE, a deep encoder-decoder neural architecture. DJE introduces distribution modeling for graph nodes, where the distribution representations derive both importance and uncertainty estimates. Additionally, DJE facilitates effective pseudo-label generation for the unlabeled data to enrich the training samples. Based on labeled and pseudo-labeled data, EASING develops effective semi-supervised heteroscedastic learning with varying node uncertainty regularization. Extensive experiments on three real-world datasets highlight the superior performance of EASING compared to competing methods. Codes are available via this https URL.

[LG-7] A Low-complexity Structured Neural Network Approach to Intelligently Realize Wideband Multi-beam Beamformers

链接: https://arxiv.org/abs/2503.20694
作者: Hansaka Aluvihare,Sivakumar Sivasankar,Xianqi Li,Arjuna Madanayake,Sirani M. Perera
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 10 pages, 3 figures

点击查看摘要

Abstract:True-time-delay (TTD) beamformers can produce wideband, squint-free beams in both analog and digital signal domains, unlike frequency-dependent FFT beams. Our previous work showed that TTD beamformers can be efficiently realized using the elements of delay Vandermonde matrix (DVM), answering the longstanding beam-squint problem. Thus, building on our work on classical algorithms based on DVM, we propose neural network (NN) architecture to realize wideband multi-beam beamformers using structure-imposed weight matrices and submatrices. The structure and sparsity of the weight matrices and submatrices are shown to reduce the space and computational complexities of the NN greatly. The proposed network architecture has O(pLM logM) complexity compared to a conventional fully connected L-layers network with O(M2L) complexity, where M is the number of nodes in each layer of the network, p is the number of submatrices per layer, and M p. We will show numerical simulations in the 24 GHz to 32 GHz range to demonstrate the numerical feasibility of realizing wideband multi-beam beamformers using the proposed neural architecture. We also show the complexity reduction of the proposed NN and compare that with fully connected NNs, to show the efficiency of the proposed architecture without sacrificing accuracy. The accuracy of the proposed NN architecture was shown using the mean squared error, which is based on an objective function of the weight matrices and beamformed signals of antenna arrays, while also normalizing nodes. The proposed NN architecture shows a low-complexity NN realizing wideband multi-beam beamformers in real-time for low-complexity intelligent systems.

[LG-8] DR-PETS: Learning-Based Control With Planning in Adversarial Environments

链接: https://arxiv.org/abs/2503.20660
作者: Hozefa Jesawada,Antonio Acernese,Giovanni Russo,Carmen Del Vecchiob
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 6 pages, 2 figures, submitted to LCSS

点击查看摘要

Abstract:Ensuring robustness against epistemic, possibly adversarial, perturbations is essential for reliable real-world decision-making. While the Probabilistic Ensembles with Trajectory Sampling (PETS) algorithm inherently handles uncertainty via ensemble-based probabilistic models, it lacks guarantees against structured adversarial or worst-case uncertainty distributions. To address this, we propose DR-PETS, a distributionally robust extension of PETS that certifies robustness against adversarial perturbations. We formalize uncertainty via a p-Wasserstein ambiguity set, enabling worst-case-aware planning through a min-max optimization framework. While PETS passively accounts for stochasticity, DR-PETS actively optimizes robustness via a tractable convex approximation integrated into PETS planning loop. Experiments on pendulum stabilization and cart-pole balancing show that DR-PETS certifies robustness against adversarial parameter perturbations, achieving consistent performance in worst-case scenarios where PETS deteriorates.

[LG-9] Enhancing Multi-modal Models with Heterogeneous MoE Adapters for Fine-tuning

链接: https://arxiv.org/abs/2503.20633
作者: Sashuai Zhou,Hai Huang,Yan Xia
类目: Machine Learning (cs.LG)
*备注: 6 pages,3 figures

点击查看摘要

Abstract:Multi-modal models excel in cross-modal tasks but are computationally expensive due to their billions of parameters. Parameter-efficient fine-tuning (PEFT) offers a solution by adding small trainable components while freezing pre-trained parameters. However, existing methods primarily focus on uni-modal processing, overlooking the critical modal fusion needed for multi-modal tasks. To fill this gap, we propose heterogeneous mixture of experts adapters that extend the traditional PEFT framework to support multi-modal expert combinations and improve information interaction. Additionally, our approach modifies the affine linear expert design to enable efficient modal fusion in a low-rank space, achieving competitive performance with only 5-8% of the parameters fine-tuned. Experiments across eight downstream tasks, including visual-audio and text-visual, demonstrate the superior performance of the approach.

[LG-10] ProFed: a Benchmark for Proximity-based non-IID Federated Learning

链接: https://arxiv.org/abs/2503.20618
作者: Davide Domini,Gianluca Aguzzi,Mirko Viroli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, cro:flFederated learning (FL) has gained significant attention within the machine learning community. Although various FL algorithms have been proposed in the literature, their performance often degrades when data across clients is non-independently and identically distributed (non-IID). This skewness in data distribution often emerges from geographic patterns, with notable examples including regional linguistic variations in text data or localized traffic patterns in urban environments. Such scenarios result in IID data within specific regions but non-IID data across regions. However, existing FL algorithms are typically evaluated by randomly splitting non-IID data across devices, disregarding their spatial distribution. To address this gap, we introduce ProFed, a benchmark that simulates data splits with varying degrees of skewness across different regions. We incorporate several skewness methods from the literature and apply them to well-known datasets, including MNIST, FashionMNIST, CIFAR-10, and CIFAR-100. Our goal is to provide researchers with a standardized framework to evaluate FL algorithms more effectively and consistently against established baselines.

[LG-11] Feature Statistics with Uncertainty Help Adversarial Robustness

链接: https://arxiv.org/abs/2503.20583
作者: Ran Wang,Xinlei Zhou,Rihao Li,Meng Hu,Wenhui Wu,Yuheng Jia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the remarkable success of deep neural networks (DNNs), the security threat of adversarial attacks poses a significant challenge to the reliability of DNNs. By introducing randomness into different parts of DNNs, stochastic methods can enable the model to learn some uncertainty, thereby improving model robustness efficiently. In this paper, we theoretically discover a universal phenomenon that adversarial attacks will shift the distributions of feature statistics. Motivated by this theoretical finding, we propose a robustness enhancement module called Feature Statistics with Uncertainty (FSU). It resamples channel-wise feature means and standard deviations of examples from multivariate Gaussian distributions, which helps to reconstruct the attacked examples and calibrate the shifted distributions. The calibration recovers some domain characteristics of the data for classification, thereby mitigating the influence of perturbations and weakening the ability of attacks to deceive models. The proposed FSU module has universal applicability in training, attacking, predicting and fine-tuning, demonstrating impressive robustness enhancement ability at trivial additional time cost. For example, against powerful optimization-based CW attacks, by incorporating FSU into attacking and predicting phases, it endows many collapsed state-of-the-art models with 50%-80% robust accuracy on CIFAR10, CIFAR100 and SVHN.

[LG-12] A Theoretical Framework for Prompt Engineering: Approximating Smooth Functions with Transformer Prompts

链接: https://arxiv.org/abs/2503.20561
作者: Ryumei Nakada,Wenlong Ji,Tianxi Cai,James Zou,Linjun Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 55 pages, 2 figures

点击查看摘要

Abstract:Prompt engineering has emerged as a powerful technique for guiding large language models (LLMs) toward desired responses, significantly enhancing their performance across diverse tasks. Beyond their role as static predictors, LLMs increasingly function as intelligent agents, capable of reasoning, decision-making, and adapting dynamically to complex environments. However, the theoretical underpinnings of prompt engineering remain largely unexplored. In this paper, we introduce a formal framework demonstrating that transformer models, when provided with carefully designed prompts, can act as a configurable computational system by emulating a ``virtual’’ neural network during inference. Specifically, input prompts effectively translate into the corresponding network configuration, enabling LLMs to adjust their internal computations dynamically. Building on this construction, we establish an approximation theory for \beta -times differentiable functions, proving that transformers can approximate such functions with arbitrary precision when guided by appropriately structured prompts. Moreover, our framework provides theoretical justification for several empirically successful prompt engineering techniques, including the use of longer, structured prompts, filtering irrelevant information, enhancing prompt token diversity, and leveraging multi-agent interactions. By framing LLMs as adaptable agents rather than static models, our findings underscore their potential for autonomous reasoning and problem-solving, paving the way for more robust and theoretically grounded advancements in prompt engineering and AI agent design.

[LG-13] Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation

链接: https://arxiv.org/abs/2503.20552
作者: Yunkai Liang,Zhangyu Chen,Pengfei Zuo,Zhi Zhou,Xu Chen,Zhou Yu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 14 pages, 18 figures

点击查看摘要

Abstract:In large language model (LLM) serving systems, executing each request consists of two phases: the compute-intensive prefill phase and the memory-intensive decoding phase. To prevent performance interference between the two phases, current LLM serving systems typically adopt prefill-decoding disaggregation, where the two phases are split across separate machines. However, we observe this approach leads to significant resource underutilization. Specifically, prefill instances that are compute-intensive suffer from low memory utilization, while decoding instances that are memory-intensive experience low compute utilization. To address this problem, this paper proposes Adrenaline, an attention disaggregation and offloading mechanism designed to enhance resource utilization and performance in LLM serving systems. Adrenaline’s key innovation lies in disaggregating part of the attention computation in the decoding phase and offloading them to prefill instances. The memory-bound nature of decoding-phase attention computation inherently enables an effective offloading strategy, yielding two complementary advantages: 1) improved memory capacity and bandwidth utilization in prefill instances, and 2) increased decoding batch sizes that enhance compute utilization in decoding instances, collectively boosting overall system performance. Adrenaline achieves these gains through three key techniques: low-latency decoding synchronization, resource-efficient prefill colocation, and load-aware offloading scheduling. Experimental results show that Adrenaline achieves 2.28x higher memory capacity and 2.07x better memory bandwidth utilization in prefill instances, up to 1.67x improvements in compute utilization for decoding instances, and 1.68x higher overall inference throughput compared to state-of-the-art systems.

[LG-14] Harmonia: A Multi-Agent Reinforcement Learning Approach to Data Placement and Migration in Hybrid Storag e Systems

链接: https://arxiv.org/abs/2503.20507
作者: Rakesh Nadig,Vamanan Arulchelvan,Rahul Bera,Taha Shahroodi,Gagandeep Singh,Mohammad Sadrosadati,Jisung Park,Onur Mutlu
类目: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hybrid storage systems (HSS) combine multiple storage devices with diverse characteristics to achieve high performance and capacity at low cost. The performance of an HSS highly depends on the effectiveness of two key policies: (1) the data-placement policy, which determines the best-fit storage device for incoming data, and (2) the data-migration policy, which rearranges stored data across the devices to sustain high HSS performance. Prior works focus on improving only data placement or only data migration in HSS, which leads to sub-optimal HSS performance. Unfortunately, no prior work tries to optimize both policies together. Our goal is to design a holistic data-management technique for HSS that optimizes both data-placement and data-migration policies to fully exploit the potential of an HSS. We propose Harmonia, a multi-agent reinforcement learning (RL)-based data-management technique that employs two light-weight autonomous RL agents, a data-placement agent and a data-migration agent, which adapt their policies for the current workload and HSS configuration, and coordinate with each other to improve overall HSS performance. We evaluate Harmonia on a real HSS with up to four heterogeneous storage devices with diverse characteristics. Our evaluation using 17 data-intensive workloads on performance-optimized (cost-optimized) HSS with two storage devices shows that, on average, Harmonia (1) outperforms the best-performing prior approach by 49.5% (31.7%), (2) bridges the performance gap between the best-performing prior work and Oracle by 64.2% (64.3%). On an HSS with three (four) devices, Harmonia outperforms the best-performing prior work by 37.0% (42.0%). Harmonia’s performance benefits come with low latency (240ns for inference) and storage overheads (206 KiB for both RL agents together). We plan to open-source Harmonia’s implementation to aid future research on HSS.

[LG-15] Riemannian Optimization on Relaxed Indicator Matrix Manifold

链接: https://arxiv.org/abs/2503.20505
作者: Jinghui Yuan,Fangyuan Xie,Feiping Nie,Xuelong Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The indicator matrix plays an important role in machine learning, but optimizing it is an NP-hard problem. We propose a new relaxation of the indicator matrix and prove that this relaxation forms a manifold, which we call the Relaxed Indicator Matrix Manifold (RIM manifold). Based on Riemannian geometry, we develop a Riemannian toolbox for optimization on the RIM manifold. Specifically, we provide several methods of Retraction, including a fast Retraction method to obtain geodesics. We point out that the RIM manifold is a generalization of the double stochastic manifold, and it is much faster than existing methods on the double stochastic manifold, which has a complexity of ( \mathcalO(n^3) ), while RIM manifold optimization is ( \mathcalO(n) ) and often yields better results. We conducted extensive experiments, including image denoising, with millions of variables to support our conclusion, and applied the RIM manifold to Ratio Cut, achieving clustering results that outperform the state-of-the-art methods. Our Code in \hrefthis https URLthis https URL.

[LG-16] Adaptive Local Clustering over Attributed Graphs ICDE2025

链接: https://arxiv.org/abs/2503.20488
作者: Haoran Zheng,Renchi Yang,Jianliang Xu
类目: ocial and Information Networks (cs.SI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Accepted by ICDE2025. The code is available at this https URL

点击查看摘要

Abstract:Given a graph G and a seed node v_s , the objective of local graph clustering (LGC) is to identify a subgraph C_s \in G (a.k.a. local cluster) surrounding v_s in time roughly linear with the size of C_s . This approach yields personalized clusters without needing to access the entire graph, which makes it highly suitable for numerous applications involving large graphs. However, most existing solutions merely rely on the topological connectivity between nodes in G , rendering them vulnerable to missing or noisy links that are commonly present in real-world graphs. To address this issue, this paper resorts to leveraging the complementary nature of graph topology and node attributes to enhance local clustering quality. To effectively exploit the attribute information, we first formulate the LGC as an estimation of the bidirectional diffusion distribution (BDD), which is specialized for capturing the multi-hop affinity between nodes in the presence of attributes. Furthermore, we propose LACA, an efficient and effective approach for LGC that achieves superb empirical performance on multiple real datasets while maintaining strong locality. The core components of LACA include (i) a fast and theoretically-grounded preprocessing technique for node attributes, (ii) an adaptive algorithm for diffusing any vectors over G with rigorous theoretical guarantees and expedited convergence, and (iii) an effective three-step scheme for BDD approximation. Extensive experiments, comparing 17 competitors on 8 real datasets, show that LACA outperforms all competitors in terms of result quality measured against ground truth local clusters, while also being up to orders of magnitude faster. The code is available at this https URL. Comments: Accepted by ICDE2025. The code is available at this https URL Subjects: Social and Information Networks (cs.SI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2503.20488 [cs.SI] (or arXiv:2503.20488v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2503.20488 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-17] he Crucial Role of Problem Formulation in Real-World Reinforcement Learning

链接: https://arxiv.org/abs/2503.20442
作者: Georg Schäfer,Tatjana Krau,Jakob Rehrl,Stefan Huber,Simon Hirlaender
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Accepted at ICPS 2025

点击查看摘要

Abstract:Reinforcement Learning (RL) offers promising solutions for control tasks in industrial cyber-physical systems (ICPSs), yet its real-world adoption remains limited. This paper demonstrates how seemingly small but well-designed modifications to the RL problem formulation can substantially improve performance, stability, and sample efficiency. We identify and investigate key elements of RL problem formulation and show that these enhance both learning speed and final policy quality. Our experiments use a one-degree-of-freedom (1-DoF) helicopter testbed, the Quanser Aero~2, which features non-linear dynamics representative of many industrial settings. In simulation, the proposed problem design principles yield more reliable and efficient training, and we further validate these results by training the agent directly on physical hardware. The encouraging real-world outcomes highlight the potential of RL for ICPS, especially when careful attention is paid to the design principles of problem formulation. Overall, our study underscores the crucial role of thoughtful problem formulation in bridging the gap between RL research and the demands of real-world industrial systems.

[LG-18] Active Data Sampling and Generation for Bias Remediation

链接: https://arxiv.org/abs/2503.20414
作者: Antonio Maratea,Rita Perna
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adequate sampling space coverage is the keystone to effectively train trustworthy Machine Learning models. Unfortunately, real data do carry several inherent risks due to the many potential biases they exhibit when gathered without a proper random sampling over the reference population, and most of the times this is way too expensive or time consuming to be a viable option. Depending on how training data have been gathered, unmitigated biases can lead to harmful or discriminatory consequences that ultimately hinders large scale applicability of pre-trained models and undermine their truthfulness or fairness expectations. In this paper, a mixed active sampling and data generation strategy – called samplation – is proposed as a mean to compensate during fine-tuning of a pre-trained classifer the unfair classifications it produces, assuming that the training data come from a non-probabilistic sampling schema. Given a pre-trained classifier, first a fairness metric is evaluated on a test set, then new reservoirs of labeled data are generated and finally a number of reversely-biased artificial samples are generated for the fine-tuning of the model. Using as case study Deep Models for visual semantic role labeling, the proposed method has been able to fully cure a simulated gender bias starting from a 90/10 imbalance, with only a small percentage of new data and with a minor effect on accuracy.

[LG-19] Multi-dataset and Transfer Learning Using Gene Expression Knowledge Graphs

链接: https://arxiv.org/abs/2503.20400
作者: Rita T. Sousa,Heiko Paulheim
类目: Machine Learning (cs.LG)
*备注: Accepted at the Extended Semantic Web Conference 2025

点击查看摘要

Abstract:Gene expression datasets offer insights into gene regulation mechanisms, biochemical pathways, and cellular functions. Additionally, comparing gene expression profiles between disease and control patients can deepen the understanding of disease pathology. Therefore, machine learning has been used to process gene expression data, with patient diagnosis emerging as one of the most popular applications. Although gene expression data can provide valuable insights, challenges arise because the number of patients in expression datasets is usually limited, and the data from different datasets with different gene expressions cannot be easily combined. This work proposes a novel methodology to address these challenges by integrating multiple gene expression datasets and domain-specific knowledge using knowledge graphs, a unique tool for biomedical data integration. Then, vector representations are produced using knowledge graph embedding techniques, which are used as inputs for a graph neural network and a multi-layer perceptron. We evaluate the efficacy of our methodology in three settings: single-dataset learning, multi-dataset learning, and transfer learning. The experimental results show that combining gene expression datasets and domain-specific knowledge improves patient diagnosis in all three settings.

[LG-20] CNNTransformer Based Anomaly Traffic Detection in UAV Networks for Emergency Rescue

链接: https://arxiv.org/abs/2503.20355
作者: Yulu Han,Ziye Jia,Sijie He,Yu Zhang,Qihui Wu
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The unmanned aerial vehicle (UAV) network has gained significant attentions in recent years due to its various applications. However, the traffic security becomes the key threatening public safety issue in an emergency rescue system due to the increasing vulnerability of UAVs to cyber attacks in environments with high heterogeneities. Hence, in this paper, we propose a novel anomaly traffic detection architecture for UAV networks based on the software-defined networking (SDN) framework and blockchain technology. Specifically, SDN separates the control and data plane to enhance the network manageability and security. Meanwhile, the blockchain provides decentralized identity authentication and data security records. Beisdes, a complete security architecture requires an effective mechanism to detect the time-series based abnormal traffic. Thus, an integrated algorithm combining convolutional neural networks (CNNs) and Transformer (CNN+Transformer) for anomaly traffic detection is developed, which is called CTranATD. Finally, the simulation results show that the proposed CTranATD algorithm is effective and outperforms the individual CNN, Transformer, and LSTM algorithms for detecting anomaly traffic.

[LG-21] Revisit Time Series Classification Benchmark: The Impact of Temporal Information for Classification PAKDD2025

链接: https://arxiv.org/abs/2503.20264
作者: Yunrui Zhang,Gustavo Batista,Salil S. Kanhere
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to PAKDD2025

点击查看摘要

Abstract:Time series classification is usually regarded as a distinct task from tabular data classification due to the importance of temporal information. However, in this paper, by performing permutation tests that disrupt temporal information on the UCR time series classification archive, the most widely used benchmark for time series classification, we identify a significant proportion of datasets where temporal information has little to no impact on classification. Many of these datasets are tabular in nature or rely mainly on tabular features, leading to potentially biased evaluations of time series classifiers focused on temporal information. To address this, we propose UCR Augmented, a benchmark based on the UCR time series classification archive designed to evaluate classifiers’ ability to extract and utilize temporal information. Testing classifiers from seven categories on this benchmark revealed notable shifts in performance rankings. Some previously overlooked approaches perform well, while others see their performance decline significantly when temporal information is crucial. UCR Augmented provides a more robust framework for assessing time series classifiers, ensuring fairer evaluations. Our code is available at this https URL.

[LG-22] Solving 2-D Helmholtz equation in the rectangular circular and elliptical domains using neural networks

链接: https://arxiv.org/abs/2503.20222
作者: D. Veerababu,Prasanta K. Ghosh
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph)
*备注: 59 pages

点击查看摘要

Abstract:Physics-informed neural networks offered an alternate way to solve several differential equations that govern complicated physics. However, their success in predicting the acoustic field is limited by the vanishing-gradient problem that occurs when solving the Helmholtz equation. In this paper, a formulation is presented that addresses this difficulty. The problem of solving the two-dimensional Helmholtz equation with the prescribed boundary conditions is posed as an unconstrained optimization problem using trial solution method. According to this method, a trial neural network that satisfies the given boundary conditions prior to the training process is constructed using the technique of transfinite interpolation and the theory of R-functions. This ansatz is initially applied to the rectangular domain and later extended to the circular and elliptical domains. The acoustic field predicted from the proposed formulation is compared with that obtained from the two-dimensional finite element methods. Good agreement is observed in all three domains considered. Minor limitations associated with the proposed formulation and their remedies are also discussed.

[LG-23] Maya: Optimizing Deep Learning Training Workloads using Emulated Virtual Accelerators

链接: https://arxiv.org/abs/2503.20191
作者: Srihas Yarlagadda,Amey Agrawal,Elton Pinto,Hakesh Darapaneni,Mitali Meratwal,Shivam Mittal,Pranavi Bajjuri,Srinivas Sridharan,Alexey Tumanov
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Training large foundation models costs hundreds of millions of dollars, making deployment optimization critical. Current approaches require machine learning engineers to manually craft training recipes through error-prone trial-and-error on expensive compute clusters. To enable efficient exploration of training configurations, researchers have developed performance modeling systems. However, these systems force users to translate their workloads into custom specification languages, introducing a fundamental semantic gap between the actual workload and its representation. This gap creates an inherent tradeoff: systems must either support a narrow set of workloads to maintain usability, require complex specifications that limit practical adoption, or compromise prediction accuracy with simplified models. We present Maya, a performance modeling system that eliminates these tradeoffs through transparent device emulation. By operating at the narrow interface between training frameworks and accelerator devices, Maya can capture complete workload behavior without requiring code modifications or translations. Maya intercepts device API calls from unmodified training code to directly observe low-level operations, enabling accurate performance prediction while maintaining both ease of use and generality. Our evaluation shows Maya achieves less than 5% prediction error across diverse models and optimization strategies, identifying configurations that reduce training costs by up to 56% compared to existing approaches. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2503.20191 [cs.LG] (or arXiv:2503.20191v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.20191 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-24] AIGC-assisted Federated Learning for Edge Intelligence: Architecture Design Research Challenges and Future Directions

链接: https://arxiv.org/abs/2503.20166
作者: Xianke Qiang,Zheng Chang,Ying-Chang Liang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated learning (FL) can fully leverage large-scale terminal data while ensuring privacy and security, and is considered as a distributed alternative for the centralized machine learning. However, the issue of data heterogeneity poses limitations on FL’s performance. To address this challenge, artificial intelligence-generated content (AIGC) which is an innovative data synthesis technique emerges as one potential solution. In this article, we first provide an overview of the system architecture, performance metrics, and challenges associated with AIGC-assistant FL system design. We then propose the Generative federated learning (GenFL) architecture and present its workflow, including the design of aggregation and weight policy. Finally, using the CIFAR10 and CIFAR100 datasets, we employ diffusion models to generate dataset and improve FL performance. Experiments conducted under various non-independent and identically distributed (non-IID) data distributions demonstrate the effectiveness of GenFL on overcoming the bottlenecks in FL caused by data heterogeneity. Open research directions in the research of AIGC-assisted FL are also discussed.

[LG-25] Emotion Detection in Twitter Messages Using Combination of Long Short-Term Memory and Convolutional Deep Neural Networks

链接: https://arxiv.org/abs/2503.20163
作者: Bahareh Golchin,Noushin Riahi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One of the most significant issues as attended a lot in recent years is that of recognizing the sentiments and emotions in social media texts. The analysis of sentiments and emotions is intended to recognize the conceptual information such as the opinions, feelings, attitudes and emotions of people towards the products, services, organizations, people, topics, events and features in the written text. These indicate the greatness of the problem space. In the real world, businesses and organizations are always looking for tools to gather ideas, emotions, and directions of people about their products, services, or events related to their own. This article uses the Twitter social network, one of the most popular social networks with about 420 million active users, to extract data. Using this social network, users can share their information and opinions about personal issues, policies, products, events, etc. It can be used with appropriate classification of emotional states due to the availability of its data. In this study, supervised learning and deep neural network algorithms are used to classify the emotional states of Twitter users. The use of deep learning methods to increase the learning capacity of the model is an advantage due to the large amount of available data. Tweets collected on various topics are classified into four classes using a combination of two Bidirectional Long Short Term Memory network and a Convolutional network. The results obtained from this study with an average accuracy of 93%, show good results extracted from the proposed framework and improved accuracy compared to previous work.

[LG-26] Addressing Challenges in Time Series Forecasting: A Comprehensive Comparison of Machine Learning Techniques

链接: https://arxiv.org/abs/2503.20148
作者: Seyedeh Azadeh Fallah Mortezanejad,Ruochen Wang(School of Automotive and Traffic Engineering, Jiangsu University, Zhenjiang, Jiangsu, China)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The explosion of Time Series (TS) data, driven by advancements in technology, necessitates sophisticated analytical methods. Modern management systems increasingly rely on analyzing this data, highlighting the importance of effcient processing techniques. State-of-the-art Machine Learning (ML) approaches for TS analysis and forecasting are becoming prevalent. This paper briefly describes and compiles suitable algorithms for TS regression task. We compare these algorithms against each other and the classic ARIMA method using diverse datasets: complete data, data with outliers, and data with missing values. The focus is on forecasting accuracy, particularly for long-term predictions. This research aids in selecting the most appropriate algorithm based on forecasting needs and data characteristics.

[LG-27] Physics-Informed Neural Networks with Unknown Partial Differential Equations: an Application in Multivariate Time Series

链接: https://arxiv.org/abs/2503.20144
作者: Seyedeh Azadeh Fallah Mortezanejad(1),Ruochen Wang(2),Ali Mohammad-Djafari(3, 4) ((1, 2) School of Automotive and Traffic Engineering, Jiangsu University, Zhenjiang, Jiangsu, China. (3) International Science Consulting and Training (ISCT), Bures sur Yvette, France. (4) Shanfeng Company, Shaoxing, China)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A significant advancement in Neural Network (NN) research is the integration of domain-specific knowledge through custom loss functions. This approach addresses a crucial challenge: how can models utilize physics or mathematical principles to enhance predictions when dealing with sparse, noisy, or incomplete data? Physics-Informed Neural Networks (PINNs) put this idea into practice by incorporating physical equations, such as Partial Differential Equations (PDEs), as soft constraints. This guidance helps the networks find solutions that align with established laws. Recently, researchers have expanded this framework to include Bayesian NNs (BNNs), which allow for uncertainty quantification while still adhering to physical principles. But what happens when the governing equations of a system are not known? In this work, we introduce methods to automatically extract PDEs from historical data. We then integrate these learned equations into three different modeling approaches: PINNs, Bayesian-PINNs (B-PINNs), and Bayesian Linear Regression (BLR). To assess these frameworks, we evaluate them on a real-world Multivariate Time Series (MTS) dataset. We compare their effectiveness in forecasting future states under different scenarios: with and without PDE constraints and accuracy considerations. This research aims to bridge the gap between data-driven discovery and physics-guided learning, providing valuable insights for practical applications.

[LG-28] Innovative LSGTime Model for Crime Spatiotemporal Prediction Based on MindSpore Framework

链接: https://arxiv.org/abs/2503.20136
作者: Zhenkai Qin,Weibao Zhong,Caifeng Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the acceleration of urbanization, the spatiotemporal characteristics of criminal activities have become increasingly complex. Accurate prediction of crime distribution is crucial for optimizing the allocation of police resources and preventing crime. This paper proposes LGSTime, a crime spatiotemporal prediction model that integrates Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and the Multi-head Sparse Self-attention mechanism. LSTM and GRU capture long-term dependencies in crime time series, such as seasonality and periodicity, through their unique gating mechanisms. The Multi-head Sparse Self-attention mechanism, on the other hand, focuses on both temporal and spatial features of criminal events simultaneously through parallel processing and sparsification techniques, significantly improving computational efficiency and prediction accuracy. The integrated model leverages the strengths of each technique to better handle complex spatiotemporal data. Experimental findings demonstrate that the model attains optimal performance across four real - world crime datasets. In comparison to the CNN model, it exhibits performance enhancements of 2.8%, 1.9%, and 1.4% in the Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) metrics respectively. These results offer a valuable reference for tackling the challenges in crime prediction.

[LG-29] From Interpretation to Correction: A Decentralized Optimization Framework for Exact Convergence in Federated Learning

链接: https://arxiv.org/abs/2503.20117
作者: Bicheng Ying,Zhe Li,Haibo Yang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:This work introduces a novel decentralized framework to interpret federated learning (FL) and, consequently, correct the biases introduced by arbitrary client participation and data heterogeneity, which are two typical traits in practical FL. Specifically, we first reformulate the core processes of FedAvg - client participation, local updating, and model aggregation - as stochastic matrix multiplications. This reformulation allows us to interpret FedAvg as a decentralized algorithm. Leveraging the decentralized optimization framework, we are able to provide a concise analysis to quantify the impact of arbitrary client participation and data heterogeneity on FedAvg’s convergence point. This insight motivates the development of Federated Optimization with Exact Convergence via Push-pull Strategy (FOCUS), a novel algorithm inspired by the decentralized algorithm that eliminates these biases and achieves exact convergence without requiring the bounded heterogeneity assumption. Furthermore, we theoretically prove that FOCUS exhibits linear convergence (exponential decay) for both strongly convex and non-convex functions satisfying the Polyak-Lojasiewicz condition, regardless of the arbitrary nature of client participation.

[LG-30] Domain Adaptation Framework for Turning Movement Count Estimation with Limited Data

链接: https://arxiv.org/abs/2503.20113
作者: Xiaobo Ma,Hyunsoo Noh,Ryan Hatch,James Tokishi,Zepu Wang
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2412.09861

点击查看摘要

Abstract:Urban transportation networks are vital for the efficient movement of people and goods, necessitating effective traffic management and planning. An integral part of traffic management is understanding the turning movement counts (TMCs) at intersections, Accurate TMCs at intersections are crucial for traffic signal control, congestion mitigation, and road safety. In general, TMCs are obtained using physical sensors installed at intersections, but this approach can be cost-prohibitive and technically challenging, especially for cities with extensive road networks. Recent advancements in machine learning and data-driven approaches have offered promising alternatives for estimating TMCs. Traffic patterns can vary significantly across different intersections due to factors such as road geometry, traffic signal settings, and local driver behaviors. This domain discrepancy limits the generalizability and accuracy of machine learning models when applied to new or unseen intersections. In response to these limitations, this research proposes a novel framework leveraging domain adaptation (DA) to estimate TMCs at intersections by using traffic controller event-based data, road infrastructure data, and point-of-interest (POI) data. Evaluated on 30 intersections in Tucson, Arizona, the performance of the proposed DA framework was compared with state-of-the-art models and achieved the lowest values in terms of Mean Absolute Error and Root Mean Square Error.

[LG-31] Extendable Long-Horizon Planning via Hierarchical Multiscale Diffusion

链接: https://arxiv.org/abs/2503.20102
作者: Chang Chen,Hany Hamed,Doojin Baek,Taegu Kang,Yoshua Bengio,Sungjin Ahn
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: First two authors contributed equally

点击查看摘要

Abstract:This paper tackles a novel problem, extendable long-horizon planning-enabling agents to plan trajectories longer than those in training data without compounding errors. To tackle this, we propose the Hierarchical Multiscale Diffuser (HM-Diffuser) and Progressive Trajectory Extension (PTE), an augmentation method that iteratively generates longer trajectories by stitching shorter ones. HM-Diffuser trains on these extended trajectories using a hierarchical structure, efficiently handling tasks across multiple temporal scales. Additionally, we introduce Adaptive Plan Pondering and the Recursive HM-Diffuser, which consolidate hierarchical layers into a single model to process temporal scales recursively. Experimental results demonstrate the effectiveness of our approach, advancing diffusion-based planners for scalable long-horizon planning.

[LG-32] Fundamental Limits of Perfect Concept Erasure AISTATS2025

链接: https://arxiv.org/abs/2503.20098
作者: Somnath Basu Roy Chowdhury,Avinava Dubey,Ahmad Beirami,Rahul Kidambi,Nicholas Monath,Amr Ahmed,Snigdha Chaturvedi
类目: Machine Learning (cs.LG)
*备注: Accepted at AISTATS 2025

点击查看摘要

Abstract:Concept erasure is the task of erasing information about a concept (e.g., gender or race) from a representation set while retaining the maximum possible utility – information from original representations. Concept erasure is useful in several applications, such as removing sensitive concepts to achieve fairness and interpreting the impact of specific concepts on a model’s performance. Previous concept erasure techniques have prioritized robustly erasing concepts over retaining the utility of the resultant representations. However, there seems to be an inherent tradeoff between erasure and retaining utility, making it unclear how to achieve perfect concept erasure while maintaining high utility. In this paper, we offer a fresh perspective toward solving this problem by quantifying the fundamental limits of concept erasure through an information-theoretic lens. Using these results, we investigate constraints on the data distribution and the erasure functions required to achieve the limits of perfect concept erasure. Empirically, we show that the derived erasure functions achieve the optimal theoretical bounds. Additionally, we show that our approach outperforms existing methods on a range of synthetic and real-world datasets using GPT-4 representations.

[LG-33] Random feature-based double Vovk-Azoury-Warmuth algorithm for online multi-kernel learning

链接: https://arxiv.org/abs/2503.20087
作者: Dmitry B. Rokhlin,Olga V. Gurtovaya
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a novel multi-kernel learning algorithm, VAW ^2 , for online least squares regression in reproducing kernel Hilbert spaces (RKHS). VAW ^2 leverages random Fourier feature-based functional approximation and the Vovk-Azoury-Warmuth (VAW) method in a two-level procedure: VAW is used to construct expert strategies from random features generated for each kernel at the first level, and then again to combine their predictions at the second level. A theoretical analysis yields a regret bound of O(T^1/2\ln T) in expectation with respect to artificial randomness, when the number of random features scales as T^1/2 . Empirical results on some benchmark datasets demonstrate that VAW ^2 achieves superior performance compared to the existing online multi-kernel learning algorithms: Raker and OMKL-GF, and to other theoretically grounded method methods involving convex combination of expert predictions at the second level.

[LG-34] Peer Disambiguation in Self-Reported Surveys using Graph Attention Networks

链接: https://arxiv.org/abs/2503.20076
作者: Ajitesh Srivastava,Aryan Shetty,Eric Rice
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Studying peer relationships is crucial in solving complex challenges underserved communities face and designing interventions. The effectiveness of such peer-based interventions relies on accurate network data regarding individual attributes and social influences. However, these datasets are often collected through self-reported surveys, introducing ambiguities in network construction. These ambiguities make it challenging to fully utilize the network data to understand the issues and to design the best interventions. We propose and solve two variations of link ambiguities in such network data – (i) which among the two candidate links exists, and (ii) if a candidate link exists. We design a Graph Attention Network (GAT) that accounts for personal attributes and network relationships on real-world data with real and simulated ambiguities. We also demonstrate that by resolving these ambiguities, we improve network accuracy, and in turn, improve suicide risk prediction. We also uncover patterns using GNNExplainer to provide additional insights into vital features and relationships. This research demonstrates the potential of Graph Neural Networks (GNN) to advance real-world network data analysis facilitating more effective peer interventions across various fields.

[LG-35] Deep Learning Approaches for Blood Disease Diagnosis Across Hematopoietic Lineages

链接: https://arxiv.org/abs/2503.20049
作者: Gabriel Bo,Justin Gu,Christopher Sun
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:We present a foundation modeling framework that leverages deep learning to uncover latent genetic signatures across the hematopoietic hierarchy. Our approach trains a fully connected autoencoder on multipotent progenitor cells, reducing over 20,000 gene features to a 256-dimensional latent space that captures predictive information for both progenitor and downstream differentiated cells such as monocytes and lymphocytes. We validate the quality of these embeddings by training feed-forward, transformer, and graph convolutional architectures for blood disease diagnosis tasks. We also explore zero-shot prediction using a progenitor disease state classification model to classify downstream cell conditions. Our models achieve greater than 95% accuracy for multi-class classification, and in the zero-shot setting, we achieve greater than 0.7 F1-score on the binary classification task. Future work should improve embeddings further to increase robustness on lymphocyte classification specifically.

[LG-36] Continual Learning With Quasi-Newton Methods

链接: https://arxiv.org/abs/2503.19939
作者: Steven Vander Eeckt,Hugo Van hamme
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Published in IEEE Access

点击查看摘要

Abstract:Catastrophic forgetting remains a major challenge when neural networks learn tasks sequentially. Elastic Weight Consolidation (EWC) attempts to address this problem by introducing a Bayesian-inspired regularization loss to preserve knowledge of previously learned tasks. However, EWC relies on a Laplace approximation where the Hessian is simplified to the diagonal of the Fisher information matrix, assuming uncorrelated model parameters. This overly simplistic assumption often leads to poor Hessian estimates, limiting its effectiveness. To overcome this limitation, we introduce Continual Learning with Sampled Quasi-Newton (CSQN), which leverages Quasi-Newton methods to compute more accurate Hessian approximations. CSQN captures parameter interactions beyond the diagonal without requiring architecture-specific modifications, making it applicable across diverse tasks and architectures. Experimental results across four benchmarks demonstrate that CSQN consistently outperforms EWC and other state-of-the-art baselines, including rehearsal-based methods. CSQN reduces EWC’s forgetting by 50 percent and improves its performance by 8 percent on average. Notably, CSQN achieves superior results on three out of four benchmarks, including the most challenging scenarios, highlighting its potential as a robust solution for continual learning.

[LG-37] Unifying Structural Proximity and Equivalence for Enhanced Dynamic Network Embedding

链接: https://arxiv.org/abs/2503.19926
作者: Suchanuch Piriyasatit,Chaohao Yuan,Ercan Engin Kuruoglu
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dynamic network embedding methods transform nodes in a dynamic network into low-dimensional vectors while preserving network characteristics, facilitating tasks such as node classification and community detection. Several embedding methods have been proposed to capture structural proximity among nodes in a network, where densely connected communities are preserved, while others have been proposed to preserve structural equivalence among nodes, capturing their structural roles regardless of their relative distance in the network. However, most existing methods that aim to preserve both network characteristics mainly focus on static networks and those designed for dynamic networks do not explicitly account for inter-snapshot structural properties. This paper proposes a novel unifying dynamic network embedding method that simultaneously preserves both structural proximity and equivalence while considering inter-snapshot structural relationships in a dynamic network. Specifically, to define structural equivalence in a dynamic network, we use temporal subgraphs, known as dynamic graphlets, to capture how a node’s neighborhood structure evolves over time. We then introduce a temporal-structural random walk to flexibly sample time-respecting sequences of nodes, considering both their temporal proximity and similarity in evolving structures. The proposed method is evaluated using five real-world networks on node classification where it outperforms benchmark methods, showing its effectiveness and flexibility in capturing various aspects of a network.

[LG-38] Continual learning via probabilistic exchangeable sequence modelling

链接: https://arxiv.org/abs/2503.20725
作者: Hanwen Xing,Christopher Yau
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning (CL) refers to the ability to continuously learn and accumulate new knowledge while retaining useful information from past experiences. Although numerous CL methods have been proposed in recent years, it is not straightforward to deploy them directly to real-world decision-making problems due to their computational cost and lack of uncertainty quantification. To address these issues, we propose CL-BRUNO, a probabilistic, Neural Process-based CL model that performs scalable and tractable Bayesian update and prediction. Our proposed approach uses deep-generative models to create a unified probabilistic framework capable of handling different types of CL problems such as task- and class-incremental learning, allowing users to integrate information across different CL scenarios using a single model. Our approach is able to prevent catastrophic forgetting through distributional and functional regularisation without the need of retaining any previously seen samples, making it appealing to applications where data privacy or storage capacity is of concern. Experiments show that CL-BRUNO outperforms existing methods on both natural image and biomedical data sets, confirming its effectiveness in real-world applications.

[LG-39] Asset price movement prediction using empirical mode decomposition and Gaussian mixture models

链接: https://arxiv.org/abs/2503.20678
作者: Gabriel R. Palma,Mariusz Skoczeń,Phil Maguire
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 21 pages

点击查看摘要

Abstract:We investigated the use of Empirical Mode Decomposition (EMD) combined with Gaussian Mixture Models (GMM), feature engineering and machine learning algorithms to optimize trading decisions. We used five, two, and one year samples of hourly candle data for GameStop, Tesla, and XRP (Ripple) markets respectively. Applying a 15 hour rolling window for each market, we collected several features based on a linear model and other classical features to predict the next hour’s movement. Subsequently, a GMM filtering approach was used to identify clusters among these markets. For each cluster, we applied the EMD algorithm to extract high, medium, low and trend components from each feature collected. A simple thresholding algorithm was applied to classify market movements based on the percentage change in each market’s close price. We then evaluated the performance of various machine learning models, including Random Forests (RF) and XGBoost, in classifying market movements. A naive random selection of trading decisions was used as a benchmark, which assumed equal probabilities for each outcome, and a temporal cross-validation approach was used to test models on 40%, 30%, and 20% of the dataset. Our results indicate that transforming selected features using EMD improves performance, particularly for ensemble learning algorithms like Random Forest and XGBoost, as measured by accumulated profit. Finally, GMM filtering expanded the range of learning algorithm and data source combinations that outperformed the top percentile of the random baseline.

[LG-40] Regression-Based Estimation of Causal Effects in the Presence of Selection Bias and Confounding

链接: https://arxiv.org/abs/2503.20546
作者: Marlies Hafer,Alexander Marx
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 13 pages plus appendix

点击查看摘要

Abstract:We consider the problem of estimating the expected causal effect E[Y|do(X)] for a target variable Y when treatment X is set by intervention, focusing on continuous random variables. In settings without selection bias or confounding, E[Y|do(X)] = E[Y|X] , which can be estimated using standard regression methods. However, regression fails when systematic missingness induced by selection bias, or confounding distorts the data. Boeken et al. [2023] show that when training data is subject to selection, proxy variables unaffected by this process can, under certain constraints, be used to correct for selection bias to estimate E[Y|X] , and hence E[Y|do(X)] , reliably. When data is additionally affected by confounding, however, this equality is no longer valid. Building on these results, we consider a more general setting and propose a framework that incorporates both selection bias and confounding. Specifically, we derive theoretical conditions ensuring identifiability and recoverability of causal effects under access to external data and proxy variables. We further introduce a two-step regression estimator (TSR), capable of exploiting proxy variables to adjust for selection bias while accounting for confounding. We show that TSR coincides with prior work if confounding is absent, but achieves a lower variance. Extensive simulation studies validate TSR’s correctness for scenarios which may include both selection bias and confounding with proxy variables. Comments: 13 pages plus appendix Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME) Cite as: arXiv:2503.20546 [stat.ML] (or arXiv:2503.20546v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2503.20546 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-41] Fast Modular and Differentiable Framework for Machine Learning-Enhanced Molecular Simulations

链接: https://arxiv.org/abs/2503.20541
作者: Henrik Christiansen,Takashi Maruyama,Federico Errica,Viktor Zaverkin,Makoto Takamoto,Francesco Alesiani
类目: Computational Physics (physics.comp-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present an end-to-end differentiable molecular simulation framework (DIMOS) for molecular dynamics and Monte Carlo simulations. DIMOS easily integrates machine-learning-based interatomic potentials and implements classical force fields including particle-mesh Ewald electrostatics. Thanks to its modularity, both classical and machine-learning-based approaches can be easily combined into a hybrid description of the system (ML/MM). By supporting key molecular dynamics features such as efficient neighborlists and constraint algorithms for larger time steps, the framework bridges the gap between hand-optimized simulation engines and the flexibility of a PyTorch implementation. The superior performance and the high versatility is probed in different benchmarks and applications, with speed-up factors of up to 170\times . The advantage of differentiability is demonstrated by an end-to-end optimization of the proposal distribution in a Markov Chain Monte Carlo simulation based on Hamiltonian Monte Carlo. Using these optimized simulation parameters a 3\times acceleration is observed in comparison to ad-hoc chosen simulation parameters. The code is available at this https URL.

[LG-42] Data-driven Seasonal Climate Predictions via Variational Inference and Transformers

链接: https://arxiv.org/abs/2503.20466
作者: Lluís Palma,Alejandro Peraza,David Civantos,Amanda Duarte,Stefano Materia,Ángel G. Muñoz,Jesús Peña,Laia Romero,Albert Soret,Markus G. Donat
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Most operational climate services providers base their seasonal predictions on initialised general circulation models (GCMs) or statistical techniques that fit past observations. GCMs require substantial computational resources, which limits their capacity. In contrast, statistical methods often lack robustness due to short historical records. Recent works propose machine learning methods trained on climate model output, leveraging larger sample sizes and simulated scenarios. Yet, many of these studies focus on prediction tasks that might be restricted in spatial extent or temporal coverage, opening a gap with existing operational predictions. Thus, the present study evaluates the effectiveness of a methodology that combines variational inference with transformer models to predict fields of seasonal anomalies. The predictions cover all four seasons and are initialised one month before the start of each season. The model was trained on climate model output from CMIP6 and tested using ERA5 reanalysis data. We analyse the method’s performance in predicting interannual anomalies beyond the climate change-induced trend. We also test the proposed methodology in a regional context with a use case focused on Europe. While climate change trends dominate the skill of temperature predictions, the method presents additional skill over the climatological forecast in regions influenced by known teleconnections. We reach similar conclusions based on the validation of precipitation predictions. Despite underperforming SEAS5 in most tropics, our model offers added value in numerous extratropical inland regions. This work demonstrates the effectiveness of training generative models on climate model output for seasonal predictions, providing skilful predictions beyond the induced climate change trend at time scales and lead times relevant for user applications.

[LG-43] Learning Data-Driven Uncertainty Set Partitions for Robust and Adaptive Energy Forecasting with Missing Data

链接: https://arxiv.org/abs/2503.20410
作者: Akylas Stratigakos,Panagiotis Andrianesis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Submitted to IEEE-TSG

点击查看摘要

Abstract:Short-term forecasting models typically assume the availability of input data (features) when they are deployed and in use. However, equipment failures, disruptions, cyberattacks, may lead to missing features when such models are used operationally, which could negatively affect forecast accuracy, and result in suboptimal operational decisions. In this paper, we use adaptive robust optimization and adversarial machine learning to develop forecasting models that seamlessly handle missing data operationally. We propose linear- and neural network-based forecasting models with parameters that adapt to available features, combining linear adaptation with a novel algorithm for learning data-driven uncertainty set partitions. The proposed adaptive models do not rely on identifying historical missing data patterns and are suitable for real-time operations under stringent time constraints. Extensive numerical experiments on short-term wind power forecasting considering horizons from 15 minutes to 4 hours ahead illustrate that our proposed adaptive models are on par with imputation when data are missing for very short periods (e.g., when only the latest measurement is missing) whereas they significantly outperform imputation when data are missing for longer periods. We further provide insights by showcasing how linear adaptation and data-driven partitions (even with a few subsets) approach the performance of the optimal, yet impractical, method of retraining for every possible realization of missing data.

[LG-44] Comparative analysis and evaluation of ageing forecasting methods for semiconductor devices in online health monitoring

链接: https://arxiv.org/abs/2503.20403
作者: Adrian Villalobos,Iban Barrutia,Rafael Pena-Alzola,Tomislav Dragicevic,Jose I. Aizpurua
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 28 pages, 12 figures, published

点击查看摘要

Abstract:Semiconductor devices, especially MOSFETs (Metal-oxide-semiconductor field-effect transistor), are crucial in power electronics, but their reliability is affected by aging processes influenced by cycling and temperature. The primary aging mechanism in discrete semiconductors and power modules is the bond wire lift-off, caused by crack growth due to thermal fatigue. The process is empirically characterized by exponential growth and an abrupt end of life, making long-term aging forecasts challenging. This research presents a comprehensive comparative assessment of different forecasting methods for MOSFET failure forecasting applications. Classical tracking, statistical forecasting and Neural Network (NN) based forecasting models are implemented along with novel Temporal Fusion Transformers (TFTs). A comprehensive comparison is performed assessing their MOSFET ageing forecasting ability for different forecasting horizons. For short-term predictions, all algorithms result in acceptable results, with the best results produced by classical NN forecasting models at the expense of higher computations. For long-term forecasting, only the TFT is able to produce valid outcomes owing to the ability to integrate covariates from the expected future conditions. Additionally, TFT attention points identify key ageing turning points, which indicate new failure modes or accelerated ageing phases.

[LG-45] he cell as a token: high-dimensional geometry in language models and cell embeddings

链接: https://arxiv.org/abs/2503.20278
作者: William Gilpin
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 4 pages, 2 figures

点击查看摘要

Abstract:Single-cell sequencing technology maps cells to a high-dimensional space encoding their internal activity. This process mirrors parallel developments in machine learning, where large language models ingest unstructured text by converting words into discrete tokens embedded within a high-dimensional vector space. This perspective explores how advances in understanding the structure of language embeddings can inform ongoing efforts to analyze and visualize single cell datasets. We discuss how the context of tokens influences the geometry of embedding space, and the role of low-dimensional manifolds in shaping this space’s robustness and interpretability. We highlight new developments in language modeling, such as interpretability probes and in-context reasoning, that can inform future efforts to construct and consolidate cell atlases.

[LG-46] An (εδ)-accurate level set estimation with a stopping criterion

链接: https://arxiv.org/abs/2503.20272
作者: Hideaki Ishibashi,Kota Matsui,Kentaro Kutsukake,Hideitsu Hino
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The level set estimation problem seeks to identify regions within a set of candidate points where an unknown and costly to evaluate function’s value exceeds a specified threshold, providing an efficient alternative to exhaustive evaluations of function values. Traditional methods often use sequential optimization strategies to find \epsilon -accurate solutions, which permit a margin around the threshold contour but frequently lack effective stopping criteria, leading to excessive exploration and inefficiencies. This paper introduces an acquisition strategy for level set estimation that incorporates a stopping criterion, ensuring the algorithm halts when further exploration is unlikely to yield improvements, thereby reducing unnecessary function evaluations. We theoretically prove that our method satisfies \epsilon -accuracy with a confidence level of 1 - \delta , addressing a key gap in existing approaches. Furthermore, we show that this also leads to guarantees on the lower bounds of performance metrics such as F-score. Numerical experiments demonstrate that the proposed acquisition function achieves comparable precision to existing methods while confirming that the stopping criterion effectively terminates the algorithm once adequate exploration is completed.

[LG-47] RxRx3-core: Benchmarking drug-target interactions in High-Content Microscopy ICLR2025

链接: https://arxiv.org/abs/2503.20158
作者: Oren Kraus,Federico Comitani,John Urbanik,Kian Kenyon-Dean,Lakshmanan Arumugam,Saber Saberian,Cas Wognum,Safiye Celik,Imran S. Haque
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Cell Behavior (q-bio.CB)
*备注: Published at LMRL Workshop at ICLR 2025

点击查看摘要

Abstract:High Content Screening (HCS) microscopy datasets have transformed the ability to profile cellular responses to genetic and chemical perturbations, enabling cell-based inference of drug-target interactions (DTI). However, the adoption of representation learning methods for HCS data has been hindered by the lack of accessible datasets and robust benchmarks. To address this gap, we present RxRx3-core, a curated and compressed subset of the RxRx3 dataset, and an associated DTI benchmarking task. At just 18GB, RxRx3-core significantly reduces the size barrier associated with large-scale HCS datasets while preserving critical data necessary for benchmarking representation learning models against a zero-shot DTI prediction task. RxRx3-core includes 222,601 microscopy images spanning 736 CRISPR knockouts and 1,674 compounds at 8 concentrations. RxRx3-core is available on HuggingFace and Polaris, along with pre-trained embeddings and benchmarking code, ensuring accessibility for the research community. By providing a compact dataset and robust benchmarks, we aim to accelerate innovation in representation learning methods for HCS data and support the discovery of novel biological insights.

[LG-48] On the Robustness of Kernel Ridge Regression Using the Cauchy Loss Function

链接: https://arxiv.org/abs/2503.20120
作者: Hongwei Wen,Annika Betken,Wouter Koolen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robust regression aims to develop methods for estimating an unknown regression function in the presence of outliers, heavy-tailed distributions, or contaminated data, which can severely impact performance. Most existing theoretical results in robust regression assume that the noise has a finite absolute mean, an assumption violated by certain distributions, such as Cauchy and some Pareto noise. In this paper, we introduce a generalized Cauchy noise framework that accommodates all noise distributions with finite moments of any order, even when the absolute mean is infinite. Within this framework, we study the \textitkernel Cauchy ridge regressor (\textitKCRR), which minimizes a regularized empirical Cauchy risk to achieve robustness. To derive the L_2 -risk bound for KCRR, we establish a connection between the excess Cauchy risk and L_2 -risk for sufficiently large scale parameters of the Cauchy loss, which reveals that these two risks are equivalent. Furthermore, under the assumption that the regression function satisfies Hölder smoothness, we derive excess Cauchy risk bounds for KCRR, showing improved performance as the scale parameter decreases. By considering the twofold effect of the scale parameter on the excess Cauchy risk and its equivalence with the L_2 -risk, we establish the almost minimax-optimal convergence rate for KCRR in terms of L_2 -risk, highlighting the robustness of the Cauchy loss in handling various types of noise. Finally, we validate the effectiveness of KCRR through experiments on both synthetic and real-world datasets under diverse noise corruption scenarios.

[LG-49] A scalable gene network model of regulatory dynamics in single cells

链接: https://arxiv.org/abs/2503.20027
作者: Paul Bertin,Joseph D. Viviano,Alejandro Tejada-Lapuerta,Weixu Wang,Stefan Bauer,Fabian J. Theis,Yoshua Bengio
类目: Molecular Networks (q-bio.MN); Machine Learning (cs.LG)
*备注: 42 pages, 10 figures

点击查看摘要

Abstract:Single-cell data provide high-dimensional measurements of the transcriptional states of cells, but extracting insights into the regulatory functions of genes, particularly identifying transcriptional mechanisms affected by biological perturbations, remains a challenge. Many perturbations induce compensatory cellular responses, making it difficult to distinguish direct from indirect effects on gene regulation. Modeling how gene regulatory functions shape the temporal dynamics of these responses is key to improving our understanding of biological perturbations. Dynamical models based on differential equations offer a principled way to capture transcriptional dynamics, but their application to single-cell data has been hindered by computational constraints, stochasticity, sparsity, and noise. Existing methods either rely on low-dimensional representations or make strong simplifying assumptions, limiting their ability to model transcriptional dynamics at scale. We introduce a Functional and Learnable model of Cell dynamicS, FLeCS, that incorporates gene network structure into coupled differential equations to model gene regulatory functions. Given (pseudo)time-series single-cell data, FLeCS accurately infers cell dynamics at scale, provides improved functional insights into transcriptional mechanisms perturbed by gene knockouts, both in myeloid differentiation and K562 Perturb-seq experiments, and simulates single-cell trajectories of A549 cells following small-molecule perturbations.

[LG-50] Automated Video-EEG Analysis in Epilepsy Studies: Advances and Challenges

链接: https://arxiv.org/abs/2503.19949
作者: Valerii A. Zuev,Elena G. Salmagambetova,Stepan N. Djakov,Lev V. Utkin
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Epilepsy is typically diagnosed through electroencephalography (EEG) and long-term video-EEG (vEEG) monitoring. The manual analysis of vEEG recordings is time-consuming, necessitating automated tools for seizure detection. Recent advancements in machine learning have shown promise in real-time seizure detection and prediction using EEG and video data. However, diversity of seizure symptoms, markup ambiguities, and limited availability of multimodal datasets hinder progress. This paper reviews the latest developments in automated video-EEG analysis and discusses the integration of multimodal data. We also propose a novel pipeline for treatment effect estimation from vEEG data using concept-based learning, offering a pathway for future research in this domain.

[LG-51] A stochastic gradient descent algorithm with random search directions

链接: https://arxiv.org/abs/2503.19942
作者: Eméric Gbaguidi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Stochastic coordinate descent algorithms are efficient methods in which each iterate is obtained by fixing most coordinates at their values from the current iteration, and approximately minimizing the objective with respect to the remaining coordinates. However, this approach is usually restricted to canonical basis vectors of \mathbbR^d . In this paper, we develop a new class of stochastic gradient descent algorithms with random search directions which uses the directional derivative of the gradient estimate following more general random vectors. We establish the almost sure convergence of these algorithms with decreasing step. We further investigate their central limit theorem and pay particular attention to analyze the impact of the search distributions on the asymptotic covariance matrix. We also provide the non-asymptotic \mathbbL^p rates of convergence.

[LG-52] Accurate provable and fast nonlinear tomographic reconstruction: A variational inequality approach

链接: https://arxiv.org/abs/2503.19925
作者: Mengqi Lou,Kabir Aladin Verchand,Sara Fridovich-Keil,Ashwin Pananjady
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Optimization and Control (math.OC); Medical Physics (physics.med-ph)
*备注: 45 pages, 6 figures, code available: this https URL

点击查看摘要

Abstract:We consider the problem of signal reconstruction for computed tomography (CT) under a nonlinear forward model that accounts for exponential signal attenuation, a polychromatic X-ray source, general measurement noise (e.g. Poisson shot noise), and observations acquired over multiple wavelength windows. We develop a simple iterative algorithm for single-material reconstruction, which we call EXACT (EXtragradient Algorithm for Computed Tomography), based on formulating our estimate as the fixed point of a monotone variational inequality. We prove guarantees on the statistical and computational performance of EXACT under practical assumptions on the measurement process. We also consider a recently introduced variant of this model with Gaussian measurements, and present sample and iteration complexity bounds for EXACT that improve upon those of existing algorithms. We apply our EXACT algorithm to a CT phantom image recovery task and show that it often requires fewer X-ray projection exposures, lower source intensity, and less computation time to achieve similar reconstruction quality to existing methods.

[LG-53] Neural Learning Rules from Associative Networks Theory

链接: https://arxiv.org/abs/2503.19922
作者: Daniele Lotito
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Associative networks theory is increasingly providing tools to interpret update rules of artificial neural networks. At the same time, deriving neural learning rules from a solid theory remains a fundamental challenge. We make some steps in this direction by considering general energy-based associative networks of continuous neurons and synapses that evolve in multiple time scales. We use the separation of these timescales to recover a limit in which the activation of the neurons, the energy of the system and the neural dynamics can all be recovered from a generating function. By allowing the generating function to depend on memories, we recover the conventional Hebbian modeling choice for the interaction strength between neurons. Finally, we propose and discuss a dynamics of memories that enables us to include learning in this framework.

信息检索

[IR-0] RALLRec: Retrieval Augmented Large Language Model Recommendation with Reasoning

链接: https://arxiv.org/abs/2503.20430
作者: Sichun Luo,Jian Xu,Xiaojie Zhang,Linrong Wang,Sicong Liu,Hanxu Hou,Linqi Song
类目: Information Retrieval (cs.IR)
*备注: arXiv admin note: substantial text overlap with arXiv:2502.06101

点击查看摘要

Abstract:Large Language Models (LLMs) have been integrated into recommender systems to enhance user behavior comprehension. The Retrieval Augmented Generation (RAG) technique is further incorporated into these systems to retrieve more relevant items and improve system performance. However, existing RAG methods have two shortcomings. \textit(i) In the \textitretrieval stage, they rely primarily on textual semantics and often fail to incorporate the most relevant items, thus constraining system effectiveness. \textit(ii) In the \textitgeneration stage, they lack explicit chain-of-thought reasoning, further limiting their potential. In this paper, we propose Representation learning and \textbfReasoning empowered retrieval-\textbfAugmented \textbfLarge \textbfLanguage model \textbfRecommendation (RALLRec+). Specifically, for the retrieval stage, we prompt LLMs to generate detailed item descriptions and perform joint representation learning, combining textual and collaborative signals extracted from the LLM and recommendation models, respectively. To account for the time-varying nature of user interests, we propose a simple yet effective reranking method to capture preference dynamics. For the generation phase, we first evaluate reasoning LLMs on recommendation tasks, uncovering valuable insights. Then we introduce knowledge-injected prompting and consistency-based merging approach to integrate reasoning LLMs with general-purpose LLMs, enhancing overall performance. Extensive experiments on three real world datasets validate our method’s effectiveness. Comments: arXiv admin note: substantial text overlap with arXiv:2502.06101 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2503.20430 [cs.IR] (or arXiv:2503.20430v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2503.20430 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] Dewey Long Context Embedding Model: A Technical Report

链接: https://arxiv.org/abs/2503.20376
作者: Dun Zhang,Panxiang Zou,Yudong Zhou
类目: Information Retrieval (cs.IR)
*备注: 5 pages, 1 figure

点击查看摘要

Abstract:This technical report presents the training methodology and evaluation results of the open-source dewey_en_beta embedding model. The increasing demand for retrieval-augmented generation (RAG) systems and the expanding context window capabilities of large language models (LLMs) have created critical challenges for conventional embedding models. Current approaches often struggle to maintain semantic coherence when processing documents exceeding typical sequence length limitations, significantly impacting retrieval performance in knowledge-intensive applications. This paper presents dewey_en_beta, a novel text embedding model that achieves excellent performance on MTEB (Eng, v2) and LongEmbed benchmark while supporting 128K token sequences. Our technical contribution centers on chunk alignment training, an innovative methodology that enables the simultaneous generation of localized chunk embeddings and global document-level representations through distillation. Information regarding the model release can be found at this https URL.

[IR-2] Learnable Sequence Augmenter for Triplet Contrastive Learning in Sequential Recommendation

链接: https://arxiv.org/abs/2503.20232
作者: Wei Wang,Yujie Lin,Jianli Zhao,Moyan Zhang,Pengjie Ren,Xianye Ben,Yujun Li
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Most existing contrastive learning-based sequential recommendation (SR) methods rely on random operations (e.g., crop, reorder, and substitute) to generate augmented sequences. These methods often struggle to create positive sample pairs that closely resemble the representations of the raw sequences, potentially disrupting item correlations by deleting key items or introducing noisy iterac, which misguides the contrastive learning process. To address this limitation, we propose Learnable sequence Augmentor for triplet Contrastive Learning in sequential Recommendation (LACLRec). Specifically, the self-supervised learning-based augmenter can automatically delete noisy items from sequences and insert new items that better capture item transition patterns, generating a higher-quality augmented sequence. Subsequently, we randomly generate another augmented sequence and design a ranking-based triplet contrastive loss to differentiate the similarities between the raw sequence, the augmented sequence from augmenter, and the randomly augmented sequence, providing more fine-grained contrastive signals. Extensive experiments on three real-world datasets demonstrate that both the sequence augmenter and the triplet contrast contribute to improving recommendation accuracy. LACLRec significantly outperforms the baseline model CL4SRec, and demonstrates superior performance compared to several state-of-the-art sequential recommendation algorithms. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2503.20232 [cs.IR] (or arXiv:2503.20232v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2503.20232 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-3] BeLightRec: A lightweight recommender system enhanced with BERT

链接: https://arxiv.org/abs/2503.20206
作者: Manh Mai Van,Tin T. Tran
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The trend of data mining using deep learning models on graph neural networks has proven effective in identifying object features through signal encoders and decoders, particularly in recommendation systems utilizing collaborative filtering methods. Collaborative filtering exploits similarities between users and items from historical data. However, it overlooks distinctive information, such as item names and descriptions. The semantic data of items should be further mined using models in the natural language processing field. Thus, items can be compared using text classification, similarity assessments, or identifying analogous sentence pairs. This research proposes combining two sources of item similarity signals: one from collaborative filtering and one from the semantic similarity measure between item names and descriptions. These signals are integrated into a graph convolutional neural network to optimize model weights, thereby providing accurate recommendations. Experiments are also designed to evaluate the contribution of each signal group to the recommendation results.