Arxiv今日论文 | 2025-03-31

本篇博文主要内容为 2025-03-31 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决大型语言模型（LLMs）在动态诊断场景中的性能不足问题。为应对这一挑战，论文提出MedAgentSim，这是一个包含医生、患者和测量代理的开源模拟临床环境。其解决方案的关键在于通过多轮对话让医生代理主动与患者互动，请求相关的医学检查（如体温、血压、心电图）和成像结果（如MRI、X光），以模拟真实的诊断过程。此外，MedAgentSim引入了自我改进机制，使模型能够迭代优化诊断策略。通过整合多代理讨论、链式思维推理和基于经验的知识检索，促进渐进式学习。同时，论文还设计了一个评估基准，用于衡量LLM在动态上下文感知诊断交互中的能力。这种自动化且支持人工干预的框架全面展示了所提方法的有效性。

链接: https://arxiv.org/abs/2503.22678
作者: Mohammad Almansoori,Komal Kumar,Hisham Cholakkal
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 page, 4 figures, 61 references

点击查看摘要

Abstract:In this work, we introduce MedAgentSim, an open-source simulated clinical environment with doctor, patient, and measurement agents designed to evaluate and enhance LLM performance in dynamic diagnostic settings. Unlike prior approaches, our framework requires doctor agents to actively engage with patients through multi-turn conversations, requesting relevant medical examinations (e.g., temperature, blood pressure, ECG) and imaging results (e.g., MRI, X-ray) from a measurement agent to mimic the real-world diagnostic process. Additionally, we incorporate self improvement mechanisms that allow models to iteratively refine their diagnostic strategies. We enhance LLM performance in our simulated setting by integrating multi-agent discussions, chain-of-thought reasoning, and experience-based knowledge retrieval, facilitating progressive learning as doctor agents interact with more patients. We also introduce an evaluation benchmark for assessing the LLM’s ability to engage in dynamic, context-aware diagnostic interactions. While MedAgentSim is fully automated, it also supports a user-controlled mode, enabling human interaction with either the doctor or patient agent. Comprehensive evaluations in various simulated diagnostic scenarios demonstrate the effectiveness of our approach. Our code, simulation tool, and benchmark are available at \hrefthis https URL.
zh

[NLP-1] hink Before Recommend: Unleashing the Latent Reasoning Power for Sequential Recommendation

【速读】：该论文旨在解决现有顺序推荐（Sequential Recommendation, SeqRec）方法因受限于直接前向计算范式而导致的用户偏好建模能力不足以及对长尾物品理解有限的问题。这些方法通常以序列编码器的最终隐藏状态作为用户表示，但由于其计算深度有限，难以捕捉用户偏好的复杂演化特性。为了解决这些问题，论文提出了一种名为\textbf{ReaRec}的新框架，这是第一个面向推荐系统的推理时计算（inference-time computing）方法。ReaRec的关键创新在于通过隐式的多步推理增强用户表示，具体实现方式是自回归地将序列的最后一个隐藏状态输入到顺序推荐模型中，并结合特殊的推理位置嵌入来解耦原始物品编码空间与多步推理空间。此外，还引入了两种轻量级的基于推理的学习方法——集成推理学习（Ensemble Reasoning Learning, ERL）和渐进推理学习（Progressive Reasoning Learning, PRL），以进一步挖掘ReaRec的推理潜力。实验结果表明，ReaRec在多个公开的真实数据集和不同的SeqRec架构上显著提升了推荐性能，尤其在提升多种顺序推荐骨干模型的性能上限方面达到了约30%-50%的提升。因此，这项工作为顺序推荐中的推理时计算研究开辟了新的方向。

链接: https://arxiv.org/abs/2503.22675
作者: Jiakai Tang,Sunhao Dai,Teng Shi,Jun Xu,Xu Chen,Wen Chen,Wu Jian,Yuning Jiang
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Alibaba Group (阿里巴巴集团), Beijing, China
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sequential Recommendation (SeqRec) aims to predict the next item by capturing sequential patterns from users’ historical interactions, playing a crucial role in many real-world recommender systems. However, existing approaches predominantly adopt a direct forward computation paradigm, where the final hidden state of the sequence encoder serves as the user representation. We argue that this inference paradigm, due to its limited computational depth, struggles to model the complex evolving nature of user preferences and lacks a nuanced understanding of long-tail items, leading to suboptimal performance. To address this issue, we propose \textbfReaRec, the first inference-time computing framework for recommender systems, which enhances user representations through implicit multi-step reasoning. Specifically, ReaRec autoregressively feeds the sequence’s last hidden state into the sequential recommender while incorporating special reasoning position embeddings to decouple the original item encoding space from the multi-step reasoning space. Moreover, we introduce two lightweight reasoning-based learning methods, Ensemble Reasoning Learning (ERL) and Progressive Reasoning Learning (PRL), to further effectively exploit ReaRec’s reasoning potential. Extensive experiments on five public real-world datasets and different SeqRec architectures demonstrate the generality and effectiveness of our proposed ReaRec. Remarkably, post-hoc analyses reveal that ReaRec significantly elevates the performance ceiling of multiple sequential recommendation backbones by approximately 30%-50%. Thus, we believe this work can open a new and promising avenue for future research in inference-time computing for sequential recommendation.
zh

[NLP-2] QuestBench: Can LLM s ask the right question to acquire information in reasoning tasks?

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理实际世界中未明确指定任务（underspecified tasks）时的信息获取能力不足的问题。论文指出，现有的研究主要关注于提升LLMs在已定义良好的推理基准（如数学和逻辑推理）上的表现，而忽视了现实场景中查询通常是不完整的，需要通过获取缺失信息才能解决问题的情况。为此，作者将此类问题形式化为带有缺失变量赋值的约束满足问题（Constraint Satisfaction Problem, CSP），并通过一种特殊情况——仅缺少一个必要的变量赋值来严格评估LLMs识别最小必要问题的能力，并量化不同问题的难度水平。

论文的关键解决方案是提出了一套名为QuestBench的数据集，包含四种类型的未明确指定的推理任务：Logic-Q（逻辑推理任务）、Planning-Q（部分观测初始状态下的PDDL规划问题）、GSM-Q（人类注释的小学数学问题）以及GSME-Q（由人类注释者将文字题转化为方程的GSM-Q版本）。这些任务的设计使得LLMs需要从候选列表中选择正确的澄清问题。尽管最先进的模型在GSM-Q和GSME-Q任务上表现良好，但在Logic-Q和Planning-Q任务上的准确率仅为40%-50%。进一步分析表明，能够解决完全指定推理问题的能力并不足以保证在本基准测试中的成功；即使当模型可以解决完整版本的问题时，它们仍难以确定正确的提问方式。此外，在Planning-Q领域内，LLMs倾向于直接给出答案而非选择“不确定”，这强调了深入探究模型信息获取能力的重要性。

链接: https://arxiv.org/abs/2503.22674
作者: Belinda Z. Li,Been Kim,Zi Wang
机构: Massachusetts Institute of Technology (麻省理工学院); Google DeepMind (谷歌深思维)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code and dataset are available at \url{ this https URL }

点击查看摘要

Abstract:Recently, a large amount of work has focused on improving large language models’ (LLMs’) performance on reasoning benchmarks such as math and logic. However, past work has largely assumed that tasks are well-defined. In the real world, queries to LLMs are often underspecified, only solvable through acquiring missing information. We formalize this as a constraint satisfaction problem (CSP) with missing variable assignments. Using a special case of this formalism where only one necessary variable assignment is missing, we can rigorously evaluate an LLM’s ability to identify the minimal necessary question to ask and quantify axes of difficulty levels for each problem. We present QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question, which includes: (1) Logic-Q: Logical reasoning tasks with one missing proposition, (2) Planning-Q: PDDL planning problems with initial states that are partially-observed, (3) GSM-Q: Human-annotated grade school math problems with one missing variable assignment, and (4) GSME-Q: a version of GSM-Q where word problems are translated into equations by human annotators. The LLM is tasked with selecting the correct clarification question(s) from a list of options. While state-of-the-art models excel at GSM-Q and GSME-Q, their accuracy is only 40-50% on Logic-Q and Planning-Q. Analysis demonstrates that the ability to solve well-specified reasoning problems may not be sufficient for success on our benchmark: models have difficulty identifying the right question to ask, even when they can solve the fully specified version of the problem. Furthermore, in the Planning-Q domain, LLMs tend not to hedge, even when explicitly presented with the option to predict ``not sure.‘’ This highlights the need for deeper investigation into models’ information acquisition capabilities.
zh

[NLP-3] ActionStudio: A Lightweight Framework for Data and Training of Action Models

【速读】：该论文旨在解决在多样化环境和复杂数据背景下，训练大规模动作模型（action models）的挑战，特别是现有基础设施对可扩展且针对特定代理（agent-specific）的微调支持有限的问题。论文的关键解决方案在于提出ActionStudio，这是一种轻量级且可扩展的数据与训练框架，专门设计用于动作模型。其核心创新点包括通过标准化格式统一异构代理轨迹、支持多种训练范式（如LoRA、完全微调及分布式设置），以及集成了强大的预处理和验证工具，从而实现高效且实用的性能表现。

链接: https://arxiv.org/abs/2503.22673
作者: Jianguo Zhang,Thai Hoang,Ming Zhu,Zuxin Liu,Shiyu Wang,Tulika Awalgaonkar,Akshara Prabhakar,Haolin Chen,Weiran Yao,Zhiwei Liu,Juntao Tan,Juan Carlos Niebles,Shelby Heinecke,Huan Wang,Silvio Savarese,Caiming Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Action models are essential for enabling autonomous agents to perform complex tasks. However, training large action models remains challenging due to the diversity of agent environments and the complexity of agentic data. Despite growing interest, existing infrastructure provides limited support for scalable, agent-specific fine-tuning. We present ActionStudio, a lightweight and extensible data and training framework designed for action models. ActionStudio unifies heterogeneous agent trajectories through a standardized format, supports diverse training paradigms including LoRA, full fine-tuning, and distributed setups, and integrates robust preprocessing and verification tools. We validate its effectiveness across both public and realistic industry benchmarks, demonstrating strong performance and practical scalability. We open-sourced code and data at this https URL to facilitate research in the community.
zh

[NLP-4] Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users

【速读】：该论文旨在解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）作为辅助技术在视觉障碍群体中的应用有效性问题。尽管这些模型已被广泛采用，但研究发现其在上下文理解、文化敏感性以及复杂场景解析方面存在显著局限，尤其是在完全依赖其进行视觉解读的用户中。为应对这些问题，论文设计了五个以用户为中心的任务，涵盖图像与视频输入，并引入了一项光学布拉耶文字识别的新任务。通过系统评估十二种MLLMs，论文指出未来需重点突破文化语境、多语言支持、布拉耶文字理解、辅助物体识别及幻觉生成等关键限制。因此，解决方案的关键在于开发更加包容、鲁棒且可信赖的多模态AI技术，以提升其在无障碍领域的实用性和可靠性。

链接: https://arxiv.org/abs/2503.22610
作者: Antonia Karamolegkou,Malvina Nikandrou,Georgios Pantazopoulos,Danae Sanchez Villegas,Phillip Rust,Ruchira Dhar,Daniel Hershcovich,Anders Søgaard
机构: University of Copenhagen (哥本哈根大学); Heriot-Watt University (赫瑞瓦特大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper explores the effectiveness of Multimodal Large Language models (MLLMs) as assistive technologies for visually impaired individuals. We conduct a user survey to identify adoption patterns and key challenges users face with such technologies. Despite a high adoption rate of these models, our findings highlight concerns related to contextual understanding, cultural sensitivity, and complex scene understanding, particularly for individuals who may rely solely on them for visual interpretation. Informed by these results, we collate five user-centred tasks with image and video inputs, including a novel task on Optical Braille Recognition. Our systematic evaluation of twelve MLLMs reveals that further advancements are necessary to overcome limitations related to cultural context, multilingual support, Braille reading comprehension, assistive object recognition, and hallucinations. This work provides critical insights into the future direction of multimodal AI for accessibility, underscoring the need for more inclusive, robust, and trustworthy visual assistance technologies.
zh

[NLP-5] Historical Ink: Exploring Large Language Models for Irony Detection in 19th-Century Spanish

【速读】：该论文旨在解决19世纪拉丁美洲报纸中讽刺（irony）检测的难题，并通过增强数据集来提升大型语言模型（Large Language Models, LLMs）在捕捉讽刺微妙之处方面的表现。论文采用了两种策略评估BERT和GPT-4o模型的效果：一是通过多分类与二元分类任务进行数据集增强，重点关注情感和上下文线索的丰富性；二是引入半自动化标注流程以解决类别不平衡问题并提高标注质量。尽管讽刺的复杂性带来了挑战，但研究的关键贡献在于构建了一个新的带情感分析和讽刺检测标签的历史西班牙语数据集，并提出了一种结合历史和文化背景的人类专业知识驱动的半自动化标注方法，从而显著改进了LLMs的结果。

链接: https://arxiv.org/abs/2503.22585
作者: Kevin Cohen,Laura Manrique-Gómez,Rubén Manrique
机构: Systems and Computing Engineering Department, Universidad de los Andes (蒙特塞拉特大学系统与计算工程系); History and Geography Department, Universidad de los Andes (蒙特塞拉特大学历史与地理系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:This study explores the use of large language models (LLMs) to enhance datasets and improve irony detection in 19th-century Latin American newspapers. Two strategies were employed to evaluate the efficacy of BERT and GPT-4o models in capturing the subtle nuances nature of irony, through both multi-class and binary classification tasks. First, we implemented dataset enhancements focused on enriching emotional and contextual cues; however, these showed limited impact on historical language analysis. The second strategy, a semi-automated annotation process, effectively addressed class imbalance and augmented the dataset with high-quality annotations. Despite the challenges posed by the complexity of irony, this work contributes to the advancement of sentiment analysis through two key contributions: introducing a new historical Spanish dataset tagged for sentiment analysis and irony detection, and proposing a semi-automated annotation methodology where human expertise is crucial for refining LLMs results, enriched by incorporating historical and cultural contexts as core features.
zh

[NLP-6] Beyond Vanilla Fine-Tuning: Leverag ing Multistage Multilingual and Domain-Specific Methods for Low-Resource Machine Translation

【速读】：该论文旨在解决在极低资源神经机器翻译（NMT）场景下，多语言序列到序列大语言模型（msLLMs）通过常规单阶段微调方法性能受限的问题。论文的关键解决方案包括两种创新方法：(1) 持续预训练（Continual Pre-Training, CPT），通过使用领域特定的单语数据进一步训练msLLMs，以弥补低资源语言（LRLs）数据不足的问题；(2) 中间任务迁移学习（Intermediate Task Transfer Learning, ITTL），利用领域内和领域外的双语平行数据对msLLMs进行微调，从而提升其在不同领域和任务中的翻译能力。这些方法在针对Sinhala、Tamil和英语（六种语言对）的工程应用中得到验证，并显著提升了翻译性能。

链接: https://arxiv.org/abs/2503.22582
作者: Sarubi Thillainathan,Songchen Yuan,En-Shiun Annie Lee,Sanath Jayasena,Surangika Ranathunga
机构: Department of Computer Science and Engineering, University of Moratuwa(Katubedda)(科伦坡)Sri Lanka; Department of Language Science and Technology, Saarland Informatics Campus(Saarland University)(萨尔布吕肯)Germany; Department of Computer Science, University of Toronto(Toronto)Canada; Faculty of Science, University of Ontario of Technology(Oshawa)Canada; School of Mathematical and Computational Sciences, Massey University(Auckland)New Zealand
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning multilingual sequence-to-sequence large language models (msLLMs) has shown promise in developing neural machine translation (NMT) systems for low-resource languages (LRLs). However, conventional single-stage fine-tuning methods struggle in extremely low-resource NMT settings, where training data is very limited. This paper contributes to artificial intelligence by proposing two approaches for adapting msLLMs in these challenging scenarios: (1) continual pre-training (CPT), where the msLLM is further trained with domain-specific monolingual data to compensate for the under-representation of LRLs, and (2) intermediate task transfer learning (ITTL), a method that fine-tunes the msLLM with both in-domain and out-of-domain parallel data to enhance its translation capabilities across various domains and tasks. As an application in engineering, these methods are implemented in NMT systems for Sinhala, Tamil, and English (six language pairs) in domain-specific, extremely low-resource settings (datasets containing fewer than 100,000 samples). Our experiments reveal that these approaches enhance translation performance by an average of +1.47 bilingual evaluation understudy (BLEU) score compared to the standard single-stage fine-tuning baseline across all translation directions. Additionally, a multi-model ensemble further improves performance by an additional BLEU score.
zh

[NLP-7] Bridging the Dimensional Chasm: Uncover Layer-wise Dimensional Reduction in Transformers through Token Correlation

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）中词元表示几何演化与人类语言语义组织低维空间特性之间的根本矛盾。人类语言的语义信息通常组织在约 (10^1) 维的空间中，而现代 LLMs 使用约 (10^3) 维的嵌入并通过 Transformer 架构处理。为了解决这一悖论，论文的关键在于开发了一个几何框架，用于追踪 Transformer 层中的词元动态变化，并通过跨多种架构的逐层分析揭示了“扩展-收缩”模式：词元扩散到“工作空间”，然后逐渐投影到更低维度的子流形中。研究发现表明工作空间维度与模型参数敏感性能之间存在负相关性，并指出有效的模型倾向于将词元压缩到约 (10) 维的子流形中，接近人类语义空间。论文不仅通过重新定义 Transformer 层为高维计算与低维语义之间的投影器来提升 LLM 的可解释性，还提供了不依赖任务特定评估的模型诊断实用工具。

链接: https://arxiv.org/abs/2503.22547
作者: Zhuo-Yang Song,Zeyu Li,Qing-Hong Cao,Ming-xing Luo,Hua Xing Zhu
机构: School of Physics, Peking University (北京大学), China; CAS Key Laboratory of Theoretical Physics, Institute of Theoretical Physics, Chinese Academy of Sciences (中国科学院理论物理研究所), China; Center for High Energy Physics, Peking University (北京大学高能物理研究中心), China; Beijing Computational Science Research Center (北京计算科学研究中心), China
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages, 9 figures, 2 tables

点击查看摘要

Abstract:The geometric evolution of token representations in large language models (LLMs) presents a fundamental paradox: while human language inherently organizes semantic information in low-dimensional spaces ( \sim 10^1 dimensions), modern LLMs employ high-dimensional embeddings ( \sim 10^3 dimensions) processed through Transformer architectures. To resolve this paradox, this work bridges this conceptual gap by developing a geometric framework that tracks token dynamics across Transformers layers. Through layer-wise analysis of intrinsic dimensions across multiple architectures, we reveal an expansion-contraction pattern where tokens diffuse to a “working space” and then progressively project onto lower-dimensional submanifolds. Our finding implies a negative correlation between the working space dimension and parameter-sensitive performance of the LLMs, and indicates that effective models tend to compress tokens into approximately 10-dimensional submanifolds, closely resembling human semantic spaces. This work not only advances LLM interpretability by reframing Transformers layers as projectors that mediate between high-dimensional computation and low-dimensional semantics, but also provides practical tools for model diagnostics that do not rely on task-specific evaluations.
zh

[NLP-8] Exploiting Mixture-of-Experts Redundancy Unlocks Multimodal Generative Abilities

【速读】：该论文旨在解决如何在不显著降低原有文本生成能力（约束C1）且保持小参数增量的前提下（约束C2），将预训练的文本-only大型语言模型（LLMs）扩展至具备多模态生成能力的问题。论文的关键在于利用深度模型中未充分利用的容量，具体通过混合专家（Mixture-of-Experts, MoE）中的参数冗余来实现新模态学习的额外容量，同时采用低秩适应技术仅针对新模态的tokens调整以保留原始语言生成性能，并引入基于Gromov-Wasserstein距离的新参数初始化方案以提升收敛性和训练稳定性。此外，通过对路由机制的分析，揭示了模态特定路径的形成与专家内部冗余的减少，从而高效解锁多模态生成能力。这一方法可无缝应用于多种当代LLMs，开辟了从单模态到多模态架构过渡的新途径。

链接: https://arxiv.org/abs/2503.22517
作者: Raman Dutt,Harleen Hanspal,Guoxuan Xia,Petru-Daniel Tudosiu,Alexander Black,Yongxin Yang,Steven McDonagh,Sarah Parisot
机构: The University of Edinburgh (爱丁堡大学); Imperial College, London (伦敦帝国理工学院); Leonardo.AI (Leonardo.AI); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Microsoft Research, Cambridge (微软研究剑桥)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we undertake the challenge of augmenting the existing generative capabilities of pre-trained text-only large language models (LLMs) with multi-modal generation capability while satisfying two core constraints: C1 preserving the preservation of original language generative capabilities with negligible performance degradation, and C2 adhering to a small parameter budget to learn the new modality, ensuring scalability and efficiency. In contrast to current approaches that add dedicated modules, thereby significantly increasing the parameter count, we propose a method that leverages the underutilized capacity inherent in deep models. Specifically, we exploit the parameter redundancy within Mixture-of-Experts (MoEs) as a source of additional capacity for learning a new modality, enabling better parameter efficiency (C1). Moreover, we preserve the original language generation capabilities by applying low-rank adaptation exclusively to the tokens of the new modality (C2). Furthermore, we introduce a novel parameter initialization scheme based on the Gromov-Wasserstein distance to improve convergence and training stability. Through an extensive analysis of the routing mechanism, we uncover the emergence of modality-specific pathways and decreased redundancy within the experts that can efficiently unlock multi-modal generative capabilities. Overall, our method can be seamlessly applied to a wide range of contemporary LLMs, providing a new pathway for transitioning from uni-modal to multi-modal architectures.
zh

[NLP-9] WorkTeam: Constructing Workflows from Natural Language with Multi-Agents NAACL2025

【速读】：该论文旨在解决手工构建工作流（workflow）需要专家知识且存在技术门槛的问题，同时针对现有基于单一大语言模型（LLM）的方法在复杂任务中因专业化知识需求和任务切换压力导致性能下降的挑战。为了解决这些问题，论文提出了一种名为WorkTeam的多智能体自然语言到工作流转换（NL2Workflow）框架，其关键在于通过监督（supervisor）、编排（orchestrator）和填充（filler）三个具有不同角色的智能体协作，优化从自然语言指令生成工作流的过程。此外，论文还构建了一个包含3,695个真实业务样本的HW-NL2Workflow数据集用于训练和评估。实验结果表明，该方法显著提高了工作流生成的成功率，为企业的NL2Workflow服务提供了新颖有效的解决方案。

链接: https://arxiv.org/abs/2503.22473
作者: Hanchao Liu,Rongjun Li,Weimin Xiong,Ziyu Zhou,Wei Peng
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted in NAACL 2025 Industry Track

点击查看摘要

Abstract:Workflows play a crucial role in enhancing enterprise efficiency by orchestrating complex processes with multiple tools or components. However, hand-crafted workflow construction requires expert knowledge, presenting significant technical barriers. Recent advancements in Large Language Models (LLMs) have improved the generation of workflows from natural language instructions (aka NL2Workflow), yet existing single LLM agent-based methods face performance degradation on complex tasks due to the need for specialized knowledge and the strain of task-switching. To tackle these challenges, we propose WorkTeam, a multi-agent NL2Workflow framework comprising a supervisor, orchestrator, and filler agent, each with distinct roles that collaboratively enhance the conversion process. As there are currently no publicly available NL2Workflow benchmarks, we also introduce the HW-NL2Workflow dataset, which includes 3,695 real-world business samples for training and evaluation. Experimental results show that our approach significantly increases the success rate of workflow construction, providing a novel and effective solution for enterprise NL2Workflow services.
zh

[NLP-10] Evaluating LLM -based Agents for Multi-Turn Conversations: A Survey

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）驱动的多轮对话代理评估方法的问题。论文的关键在于提出了一种结构化的评估框架，通过构建两个相互关联的分类系统来解决“评估什么”以及“如何评估”的问题。第一个分类系统定义了多轮对话场景下LLM代理的关键组成要素及其评估维度，包括任务完成度、响应质量、用户体验、记忆与上下文保持能力，以及规划与工具集成能力，确保评估过程全面且有意义。第二个分类系统则聚焦于具体的评估方法，将现有方法归类为基于标注的评估、自动化指标、结合人工评估与定量测量的混合策略，以及利用LLM进行自我评估的方法。这一框架不仅涵盖了传统的基于语言理解的指标（如BLEU和ROUGE），还引入了反映动态交互特性的先进评估技术，从而为多轮对话代理的综合评估提供了坚实的基础。

链接: https://arxiv.org/abs/2503.22458
作者: Shengyue Guan,Haoyi Xiong,Jindong Wang,Jiang Bian,Bin Zhu,Jian-guang Lou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings. Using a PRISMA-inspired framework, we systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication, and establishing a solid foundation for our analysis. Our study offers a structured approach by developing two interrelated taxonomy systems: one that defines \emphwhat to evaluate and another that explains \emphhow to evaluate. The first taxonomy identifies key components of LLM-based agents for multi-turn conversations and their evaluation dimensions, including task completion, response quality, user experience, memory and context retention, as well as planning and tool integration. These components ensure that the performance of conversational agents is assessed in a holistic and meaningful manner. The second taxonomy system focuses on the evaluation methodologies. It categorizes approaches into annotation-based evaluations, automated metrics, hybrid strategies that combine human assessments with quantitative measures, and self-judging methods utilizing LLMs. This framework not only captures traditional metrics derived from language understanding, such as BLEU and ROUGE scores, but also incorporates advanced techniques that reflect the dynamic, interactive nature of multi-turn dialogues.
zh

[NLP-11] Scaling Laws of Scientific Discovery with AI and Robot Scientists

【速读】：该论文试图解决科学探究快速发展中传统研究方法难以满足现代发现需求的问题。解决方案的关键在于构建一种融合主动型人工智能（Agentic AI）与具身机器人技术的自主通用科学家（Autonomous Generalist Scientist, AGS）系统，通过将先进的AI和机器人技术整合到研究生命周期的每个阶段（从假设提出到同行评审的手稿生成），以空前的效率跨越物理与数字领域，实现跨学科洞见的融合，从而大幅减少科学研究的时间和资源消耗。这一系统有望推动科学研究遵循新的规模法则，通过适应极端环境和利用不断增长的知识库，引发范式转变，拓展可能性边界，并开启持续创新的时代。

链接: https://arxiv.org/abs/2503.22444
作者: Pengsong Zhang,Heng Zhang,Huazhe Xu,Renjun Xu,Zhenting Wang,Cong Wang,Animesh Garg,Zhibin Li,Arash Ajoudani,Xinyu Liu
机构: University of Toronto (多伦多大学); Istituto Italiano di Tecnologia (意大利技术研究院); Universita di Genova (热那亚大学); Tsinghua University (清华大学); Zhejiang University (浙江大学); Rutgers University (罗格斯大学); Harvard University (哈佛大学); Georgia Tech (乔治亚理工学院); University College of London (伦敦大学学院)
类目: Computation and Language (cs.CL); Robotics (cs.RO)
备注: 22 pages, 7 figures

点击查看摘要

Abstract:The rapid evolution of scientific inquiry highlights an urgent need for groundbreaking methodologies that transcend the limitations of traditional research. Conventional approaches, bogged down by manual processes and siloed expertise, struggle to keep pace with the demands of modern discovery. We envision an autonomous generalist scientist (AGS) system-a fusion of agentic AI and embodied robotics-that redefines the research lifecycle. This system promises to autonomously navigate physical and digital realms, weaving together insights from disparate disciplines with unprecedented efficiency. By embedding advanced AI and robot technologies into every phase-from hypothesis formulation to peer-ready manuscripts-AGS could slash the time and resources needed for scientific research in diverse field. We foresee a future where scientific discovery follows new scaling laws, driven by the proliferation and sophistication of such systems. As these autonomous agents and robots adapt to extreme environments and leverage a growing reservoir of knowledge, they could spark a paradigm shift, pushing the boundaries of what’s possible and ushering in an era of relentless innovation.
zh

[NLP-12] Long-Tail Crisis in Nearest Neighbor Language Models NAACL2025

【速读】：该论文旨在探究 k-最近邻语言模型（ k NN-LM）在低频词（low-frequency tokens）上的行为，特别是其在估计长尾目标词概率方面的表现。尽管 k NN-LM 的成功通常归因于显式记忆（即数据存储库）对长尾现象预测的增强作用，但现有研究主要集中在检索长尾上下文的能力，而对其在推理过程中预测低频词概率的性能缺乏深入分析。论文的关键解决方案在于通过系统性实验，从预测概率、检索准确性、数据存储库中的词分布以及产品量化（product quantization）的近似误差等多个角度，全面评估 k NN-LM 在低频词上的表现，并揭示其性能提升主要集中在高频词，而非预期的长尾上下文中。

链接: https://arxiv.org/abs/2503.22426
作者: Yuto Nishida,Makoto Morishita,Hiroyuki Deguchi,Hidetaka Kamigaito,Taro Watanabe
机构: Nara Institute of Science and Technology (NAIST); Future Corporation (未来株式会社)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025 Findings

点击查看摘要

Abstract:The k -nearest-neighbor language model ( k NN-LM), one of the retrieval-augmented language models, improves the perplexity for given text by directly accessing a large datastore built from any text data during inference. A widely held hypothesis for the success of k NN-LM is that its explicit memory, i.e., the datastore, enhances predictions for long-tail phenomena. However, prior works have primarily shown its ability to retrieve long-tail contexts, leaving the model’s performance remain underexplored in estimating the probabilities of long-tail target tokens during inference. In this paper, we investigate the behavior of k NN-LM on low-frequency tokens, examining prediction probability, retrieval accuracy, token distribution in the datastore, and approximation error of the product quantization. Our experimental results reveal that k NN-LM does not improve prediction performance for low-frequency tokens but mainly benefits high-frequency tokens regardless of long-tail contexts in the datastore.
zh

[NLP-13] CoSIL: Software Issue Localization via LLM -Driven Code Repository Graph Searching

【速读】：该论文旨在解决现有基于大型语言模型（LLMs）的代码缺陷定位方法在平衡简洁有效的上下文与充分的搜索空间时所面临的挑战。具体而言，由于LLMs的上下文窗口长度限制，这些方法难以在有限的上下文中捕捉足够的信息以实现精确的缺陷定位。论文提出的关键解决方案是CoSIL（Code Search and Issue Localization），这是一种无需训练或索引的函数级缺陷定位方法。CoSIL通过模块调用图缩小搜索空间，并利用动态构建的调用图迭代搜索相关上下文，同时结合上下文剪枝来控制搜索方向和管理上下文。这种方法的核心在于动态构建调用图，避免了预解析的需求，从而提高了效率和准确性。实验结果表明，CoSIL在SWE bench Lite和SWE bench Verified数据集上的Top-1定位成功率分别达到43%和44.6%，相较于现有方法提升了8.6至98.2个百分点。此外，当应用于补丁生成阶段时，问题解决率进一步提升了9.3至31.5个百分点。

链接: https://arxiv.org/abs/2503.22424
作者: Zhonghao Jiang,Xiaoxue Ren,Meng Yan,Wei Jiang,Yong Li,Zhongxin Liu
机构: The State Key Laboratory of Blockchain and Data Security, Zhejiang University (浙江大学), Hangzhou, China; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, Hangzhou, China; School of Big Data and Software Engineering, Chongqing University (重庆大学), Chongqing, China; Ant Group (蚂蚁集团), Hangzhou, China
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have significantly advanced autonomous software engineering, leading to a growing number of software engineering agents that assist developers in automatic program repair. Issue localization forms the basis for accurate patch generation. However, because of limitations caused by the context window length of LLMs, existing issue localization methods face challenges in balancing concise yet effective contexts and adequately comprehensive search spaces. In this paper, we introduce CoSIL, an LLM driven, simple yet powerful function level issue localization method without training or indexing. CoSIL reduces the search space through module call graphs, iteratively searches the function call graph to obtain relevant contexts, and uses context pruning to control the search direction and manage contexts effectively. Importantly, the call graph is dynamically constructed by the LLM during search, eliminating the need for pre-parsing. Experiment results demonstrate that CoSIL achieves a Top-1 localization success rate of 43 percent and 44.6 percent on SWE bench Lite and SWE bench Verified, respectively, using Qwen2.5 Coder 32B, outperforming existing methods by 8.6 to 98.2 percent. When CoSIL is applied to guide the patch generation stage, the resolved rate further improves by 9.3 to 31.5 percent.
zh

[NLP-14] Elite Political Discourse has Become More Toxic in Western Countries

【速读】：该论文试图解决的问题是如何系统性地评估国际政治是否正在变得更加不文明（uncivil），以及识别政治不文明（political incivility）的主要驱动因素。解决方案的关键在于利用一个包含近1800万条来自17个国家议会成员五年内推文的新数据集，通过大规模定量分析揭示政治精英之间有毒话语（toxic discourse）的增长趋势及其与特定政治派别（如激进右翼政党及在野党）的相关性，同时考察不同议题领域（如“文化战争”话题与福利或经济议题）对政治不文明程度的影响。这些分析为理解全球民主体系中建设性对话侵蚀现象提供了重要洞察。

链接: https://arxiv.org/abs/2503.22411
作者: Petter Törnberg,Juliana Chueri
机构: Unknown
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Toxic and uncivil politics is widely seen as a growing threat to democratic values and governance, yet our understanding of the drivers and evolution of political incivility remains limited. Leveraging a novel dataset of nearly 18 million Twitter messages from parliamentarians in 17 countries over five years, this paper systematically investigates whether politics internationally is becoming more uncivil, and what are the determinants of political incivility. Our analysis reveals a marked increase in toxic discourse among political elites, and that it is associated to radical-right parties and parties in opposition. Toxicity diminished markedly during the early phase of the COVID-19 pandemic and, surprisingly, during election campaigns. Furthermore, our results indicate that posts relating to ``culture war’’ topics, such as migration and LGBTQ+ rights, are substantially more toxic than debates focused on welfare or economic issues. These findings underscore a troubling shift in international democracies toward an erosion of constructive democratic dialogue.
zh

[NLP-15] EllieSQL: Cost-Efficient Text-to-SQL with Complexity-Aware Routing

【速读】：该论文旨在解决当前基于大型语言模型（LLM）的 Text-to-SQL 方法在实际部署中的经济可行性问题，其核心挑战在于这些方法的计算成本过高，限制了其广泛应用。论文的关键解决方案是提出了一种名为 EllieSQL 的复杂度感知路由框架，通过根据查询的估计复杂度将其分配至合适的 SQL 生成管道来优化资源利用。这一方案结合了多个路由器，将简单查询导向高效方法，同时保留复杂查询使用计算密集型方法，从而在性能与成本之间实现平衡。此外，论文引入了性能相对于 token 投资的弹性指标（Token Elasticity of Performance, TEP），以量化成本效率，验证了 EllieSQL 在不牺牲性能的前提下显著降低 token 使用量，并大幅提升 TEP，为可持续发展的 Text-to-SQL 系统提供了新思路。

链接: https://arxiv.org/abs/2503.22402
作者: Yizhang Zhu,Runzhi Jiang,Boyan Li,Nan Tang,Yuyu Luo
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Text-to-SQL automatically translates natural language queries to SQL, allowing non-technical users to retrieve data from databases without specialized SQL knowledge. Despite the success of advanced LLM-based Text-to-SQL approaches on leaderboards, their unsustainable computational costs–often overlooked–stand as the “elephant in the room” in current leaderboard-driven research, limiting their economic practicability for real-world deployment and widespread adoption. To tackle this, we exploratively propose EllieSQL, a complexity-aware routing framework that assigns queries to suitable SQL generation pipelines based on estimated complexity. We investigate multiple routers to direct simple queries to efficient approaches while reserving computationally intensive methods for complex cases. Drawing from economics, we introduce the Token Elasticity of Performance (TEP) metric, capturing cost-efficiency by quantifying the responsiveness of performance gains relative to token investment in SQL generation. Experiments show that compared to always using the most advanced methods in our study, EllieSQL with the Qwen2.5-0.5B-DPO router reduces token use by over 40% without compromising performance on Bird development set, achieving more than a 2x boost in TEP over non-routing approaches. This not only advances the pursuit of cost-efficient Text-to-SQL but also invites the community to weigh resource efficiency alongside performance, contributing to progress in sustainable Text-to-SQL.
zh

[NLP-16] Negation: A Pink Elephant in the Large Language Models Room?

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在处理否定（Negations）时存在的挑战，这一问题因其对逻辑推理的重要性而显得尤为关键。尽管否定在自然语言理解中至关重要，但现有研究对其探索尚不充分。论文的关键解决方案在于构建了两个包含成对示例（\textit{paired examples}）且差异仅在于否定表达的多语言自然语言推理（Natural Language Inference, NLI）数据集，并通过评估流行LLMs在这些数据集上的表现，系统性地分析模型规模和语言对否定处理能力的影响。研究发现，增加模型规模能够持续提升其处理否定的能力，同时揭示了模型的推理准确性及对否定的鲁棒性具有显著的语言依赖性，且前提表述的长度和明确性对鲁棒性的影响大于语言本身。这些结果为多语言环境下语言模型推理能力的进一步研究与优化提供了重要支持。

链接: https://arxiv.org/abs/2503.22395
作者: Tereza Vrabcová,Marek Kadlčík,Petr Sojka,Michal Štefánik,Michal Spiegel
机构: Faculty of Informatics, Masaryk University (布尔诺马萨里克大学信息学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Negations are key to determining sentence meaning, making them essential for logical reasoning. Despite their importance, negations pose a substantial challenge for large language models (LLMs) and remain underexplored. We construct two multilingual natural language inference (NLI) datasets with \textitpaired examples differing in negation. We investigate how model size and language impact its ability to handle negation correctly by evaluating popular LLMs. Contrary to previous work, we show that increasing the model size consistently improves the models’ ability to handle negations. Furthermore, we find that both the models’ reasoning accuracy and robustness to negation are language-dependent and that the length and explicitness of the premise have a greater impact on robustness than language. Our datasets can facilitate further research and improvements of language model reasoning in multilingual settings. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2503.22395 [cs.CL] (or arXiv:2503.22395v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.22395 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tereza Vrabcová [view email] [v1] Fri, 28 Mar 2025 13:04:41 UTC (570 KB)
zh

[NLP-17] Why Stop at One Error? Benchmarking LLM s as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors

【速读】：该论文试图解决大型语言模型（LLMs）在数据科学代码调试中的多跳错误追踪和多缺陷检测能力未被充分评估的问题。当前的代码生成和修复基准测试主要集中在简单单错误场景下的语法和功能正确性评估，而LLMs在复杂数据科学代码中自主定位和修复运行时逻辑错误的能力尚未得到深入研究。为填补这一空白，论文引入了DSDBench（数据科学调试基准），这是首个系统评估LLMs在数据科学代码调试中多跳错误追踪和多缺陷检测能力的基准。DSDBench的关键在于通过自动合成包含多跳和多缺陷的代码片段，构建了一个具有现实意义的数据科学调试任务数据集，包含1,117个标注样本和741对因果错误关系及运行时错误信息，从而为评估和提升LLMs的调试与推理能力提供了重要资源。

链接: https://arxiv.org/abs/2503.22388
作者: Zhiyu Yang,Shuo Wang,Yukun Yan,Yang Deng
机构: Singapore Management University (新加坡管理大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:LLMs are transforming software development, yet current code generation and code repair benchmarks mainly assess syntactic and functional correctness in simple, single-error cases. LLMs’ capabilities to autonomously find and fix runtime logical errors in complex data science code remain largely unexplored. To address this gap, we introduce DSDBench: the Data Science Debugging Benchmark, the first benchmark for systematic evaluation of LLMs on multi-hop error tracing and multi-bug detection in data science code debugging. DSDBench adapts datasets from existing data science task benchmarks, such as DABench and MatPlotBench, featuring realistic data science debugging tasks with automatically synthesized multi-hop, multi-bug code snippets. DSDBench includes 1,117 annotated samples with 741 cause-effect error pairs and runtime error messages. Evaluations of state-of-the-art LLMs on DSDBench show significant performance gaps, highlighting challenges in debugging logical runtime errors in data science code. DSDBench offers a crucial resource to evaluate and improve LLMs’ debugging and reasoning capabilities, enabling more reliable AI-assisted data science in the this http URL is publicly available at this https URL.
zh

[NLP-18] Spend Your Budget Wisely: Towards an Intelligent Distribution of the Privacy Budget in Differentially Private Text Rewriting

【速读】：该论文旨在解决**文本差分隐私重写（Differentially Private Text Rewriting）**领域中隐私预算分配不合理的问题。现有方法通常采用统一的隐私预算分配策略，忽视了语言的不同组成部分在敏感性上的差异。论文的关键创新在于提出了一种基于语言学和自然语言处理（NLP）的方法，用于智能且合理地将隐私预算（ε参数）分配给目标文档中的各个组成成分（tokens）。通过一系列隐私性和实用性实验，作者证明了这种智能分配策略能够在相同的隐私预算下显著提升隐私保护水平，并改善隐私与效用之间的权衡，优于简单的均匀分配方法。研究强调了文本差分隐私化过程中的复杂性，并呼吁进一步探索更高效的隐私预算分配机制以最大化差分隐私在文本重写任务中的优势。

链接: https://arxiv.org/abs/2503.22379
作者: Stephen Meisenbacher,Chaeeun Joy Lee,Florian Matthes
机构: Technical University of Munich (慕尼黑工业大学); School of Computation, Information and Technology (计算、信息和技术学院)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 14 pages, 1 figure, 6 tables. Accepted to CODASPY 2025

点击查看摘要

Abstract:The task of \textitDifferentially Private Text Rewriting is a class of text privatization techniques in which (sensitive) input textual documents are \textitrewritten under Differential Privacy (DP) guarantees. The motivation behind such methods is to hide both explicit and implicit identifiers that could be contained in text, while still retaining the semantic meaning of the original text, thus preserving utility. Recent years have seen an uptick in research output in this field, offering a diverse array of word-, sentence-, and document-level DP rewriting methods. Common to these methods is the selection of a privacy budget (i.e., the \varepsilon parameter), which governs the degree to which a text is privatized. One major limitation of previous works, stemming directly from the unique structure of language itself, is the lack of consideration of \textitwhere the privacy budget should be allocated, as not all aspects of language, and therefore text, are equally sensitive or personal. In this work, we are the first to address this shortcoming, asking the question of how a given privacy budget can be intelligently and sensibly distributed amongst a target document. We construct and evaluate a toolkit of linguistics- and NLP-based methods used to allocate a privacy budget to constituent tokens in a text document. In a series of privacy and utility experiments, we empirically demonstrate that given the same privacy budget, intelligent distribution leads to higher privacy levels and more positive trade-offs than a naive distribution of \varepsilon . Our work highlights the intricacies of text privatization with DP, and furthermore, it calls for further work on finding more efficient ways to maximize the privatization benefits offered by DP in text rewriting.
zh

[NLP-19] Supposedly Equivalent Facts That Arent? Entity Frequency in Pre-training Induces Asymmetry in LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成内容时出现幻觉（hallucinations）的问题，并着重探索其根本原因。论文的关键在于揭示LLMs的行为与预训练数据之间的直接联系，特别是通过分析模型对逻辑等价事实识别的不对称性（asymmetry），这种不对称性可归因于作为主语和宾语出现的实体频率差异。为研究这一现象，作者利用开源的OLMo系列模型及其Dolma数据集估算实体频率，并基于Wikidata5M中的关系事实构建探针数据集以分离此效应。实验结果表明，在高频主语和低频宾语的情况下，模型更容易正确识别逻辑等价事实的逆命题，而在低频到高频设置下模式反转，且当两个实体均为高频时未观察到显著不对称性。这些发现强调了预训练数据对模型预测的影响，并为推断封闭或部分封闭LLMs的预训练数据特性提供了洞见。

链接: https://arxiv.org/abs/2503.22362
作者: Yuan He,Bailan He,Zifeng Ding,Alisia Lupidi,Yuqicheng Zhu,Shuo Chen,Caiqi Zhang,Jiaoyan Chen,Yunpu Ma,Volker Tresp,Ian Horrocks
机构: University of Oxford(牛津大学); LMU Munich(慕尼黑大学); Siemens AG(西门子股份公司); University of Cambridge(剑桥大学); Meta(元); University of Stuttgart(斯图加特大学); Bosch Center for AI(博世人工智能中心); The University of Manchester(曼彻斯特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding and mitigating hallucinations in Large Language Models (LLMs) is crucial for ensuring reliable content generation. While previous research has primarily focused on “when” LLMs hallucinate, our work explains “why” and directly links model behaviour to the pre-training data that forms their prior knowledge. Specifically, we demonstrate that an asymmetry exists in the recognition of logically equivalent facts, which can be attributed to frequency discrepancies of entities appearing as subjects versus objects. Given that most pre-training datasets are inaccessible, we leverage the fully open-source OLMo series by indexing its Dolma dataset to estimate entity frequencies. Using relational facts (represented as triples) from Wikidata5M, we construct probing datasets to isolate this effect. Our experiments reveal that facts with a high-frequency subject and a low-frequency object are better recognised than their inverse, despite their logical equivalence. The pattern reverses in low-to-high frequency settings, and no statistically significant asymmetry emerges when both entities are high-frequency. These findings highlight the influential role of pre-training data in shaping model predictions and provide insights for inferring the characteristics of pre-training data in closed or partially closed LLMs.
zh

[NLP-20] Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在多轮交互中的响应一致性问题，特别是在高风险领域的可靠部署需求。论文的关键解决方案包括：首先，提出了一种新的位置加权一致性（Position-Weighted Consistency, PWC）评分方法，以捕捉早期阶段稳定性及恢复模式的重要性；其次，构建了一个涵盖多样化领域和难度级别的精心策划的基准数据集，用于评估LLMs在多种挑战性后续场景下的一致性表现；最后，引入了置信感知响应生成（Confidence-Aware Response Generation, CARG）框架，通过将模型置信信号融入生成过程显著提升响应稳定性，同时保持准确性。实验结果表明，CARG能够在不牺牲精度的前提下显著提高响应的一致性，展现出其在关键应用中部署的潜力。

链接: https://arxiv.org/abs/2503.22353
作者: Yubo Li,Yidi Miao,Xueying Ding,Ramayya Krishnan,Rema Padman
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains requires consistent performance across multiple interaction rounds. This paper introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions. First, we propose a novel Position-Weighted Consistency (PWC) score that captures both the importance of early-stage stability and recovery patterns in multi-turn interactions. Second, we present a carefully curated benchmark dataset spanning diverse domains and difficulty levels, specifically designed to evaluate LLM consistency under various challenging follow-up scenarios. Third, we introduce Confidence-Aware Response Generation (CARG), a framework that significantly improves response stability by incorporating model confidence signals into the generation process. Empirical results demonstrate that CARG significantly improves response stability without sacrificing accuracy, underscoring its potential for reliable LLM deployment in critical applications.
zh

[NLP-21] SKDU at De-Factify 4.0: Natural Language Features for AI-Generated Text-Detection AAAI2025 AAAI

【速读】：该论文旨在解决区分人类撰写文本与人工智能生成文本的新挑战，随着大型语言模型（Large Language Models, LLMs）的快速发展，这一区分任务变得愈发困难。论文提出了一种管道式方法（pipelined approach），用于检测AI生成的文本。该方案的关键在于结合特征提取步骤与分类模块：特征提取包括基于提示的重写特征（prompt-based rewriting features，受RAIDAR启发）以及基于NELA工具包的内容特征（content-based features）；随后通过分类模块进行分析。研究在Defactify4.0数据集上进行了全面实验，评估了二分类任务（区分人类撰写与AI生成文本）和多分类任务（识别生成特定文本所使用的具体生成模型）。结果表明，NELA特征在两项任务中均显著优于RAIDAR特征，能够捕捉到文本在语言、风格和内容上的细微差异；而将两者结合仅带来有限改进，表明冗余特征对整体性能提升作用较小。在所测试的分类器中，XGBoost表现出最佳性能，其利用丰富的特征集实现了高精度与泛化能力。

链接: https://arxiv.org/abs/2503.22338
作者: Shrikant Malviya,Pablo Arnau-González,Miguel Arevalillo-Herráez,Stamos Katsigiannis
机构: Durham University (杜伦大学); Universitat de València (瓦伦西亚大学)
类目: Computation and Language (cs.CL)
备注: De-Factify 4.0 Workshop at the 39th AAAI Conference on Artificial Intelligence (AAAI 2025)

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has introduced new challenges in distinguishing human-written text from AI-generated content. In this work, we explored a pipelined approach for AI-generated text detection that includes a feature extraction step (i.e. prompt-based rewriting features inspired by RAIDAR and content-based features derived from the NELA toolkit) followed by a classification module. Comprehensive experiments were conducted on the Defactify4.0 dataset, evaluating two tasks: binary classification to differentiate human-written and AI-generated text, and multi-class classification to identify the specific generative model used to generate the input text. Our findings reveal that NELA features significantly outperform RAIDAR features in both tasks, demonstrating their ability to capture nuanced linguistic, stylistic, and content-based differences. Combining RAIDAR and NELA features provided minimal improvement, highlighting the redundancy introduced by less discriminative features. Among the classifiers tested, XGBoost emerged as the most effective, leveraging the rich feature sets to achieve high accuracy and generalisation.
zh

[NLP-22] A Refined Analysis of Massive Activations in LLM s

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）中与高精度训练和量化相关的海量激活（massive activations）所带来的挑战。现有研究在分析范围和跨架构的通用性方面存在局限性，特别是对于海量激活的影响及其缓解策略缺乏系统性的理解。论文通过分析多种LLMs（包括基于GLU和非GLU架构的模型），挑战了一些先前的假设，例如并非所有海量激活都具有负面影响，抑制它们不会必然导致困惑度（perplexity）爆炸或下游任务性能崩溃；同时，某些提出的缓解策略（如Attention KV偏置）在特定模型中效果不佳且具有局限性。论文的关键解决方案在于提出新颖的混合缓解策略，特别是将目标方差重缩放（Target Variance Rescaling, TVR）与Attention KV偏置或动态Tanh（Dynamic Tanh, DyT）结合使用，成功在所研究的场景中平衡了海量激活的缓解与下游模型性能的保持。

链接: https://arxiv.org/abs/2503.22329
作者: Louis Owen,Nilabhra Roy Chowdhury,Abhay Kumar,Fabian Güra
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Motivated in part by their relevance for low-precision training and quantization, massive activations in large language models (LLMs) have recently emerged as a topic of interest. However, existing analyses are limited in scope, and generalizability across architectures is unclear. This paper helps address some of these gaps by conducting an analysis of massive activations across a broad range of LLMs, including both GLU-based and non-GLU-based architectures. Our findings challenge several prior assumptions, most importantly: (1) not all massive activations are detrimental, i.e. suppressing them does not lead to an explosion of perplexity or a collapse in downstream task performance; (2) proposed mitigation strategies such as Attention KV bias are model-specific and ineffective in certain cases. We consequently investigate novel hybrid mitigation strategies; in particular pairing Target Variance Rescaling (TVR) with Attention KV bias or Dynamic Tanh (DyT) successfully balances the mitigation of massive activations with preserved downstream model performance in the scenarios we investigated. Our code is available at: this https URL.
zh

[NLP-23] Preference-based Learning with Retrieval Augmented Generation for Conversational Question Answering WWW2025

【速读】：该论文致力于解决会话式问答（Conversational Question Answering, ConvQA）任务中的多子任务学习难题，具体包括理解上下文中的不完整问题、检索相关证据以及生成答案。由于每个子任务缺乏标注训练数据，传统方法难以直接应用。为此，论文提出了一种基于流水线的框架PRAISE，其关键在于通过自监督的方式训练针对每个子任务的语言模型适配器（LLM adapters）。PRAISE利用最终回答性能作为反馈信号，无需人工干预即可从自身生成的数据中学习，并将中间信息（如相关证据）视为弱标注数据。此外，通过直接偏好优化（Direct Preference Optimization），对比成功与失败样本进一步提升模型表现。实验表明，该训练范式在各子任务上均取得改进，并在流行的ConvQA基准测试中达到新的SOTA性能，相比基线提升了15.5个百分点的精确率。

链接: https://arxiv.org/abs/2503.22303
作者: Magdalena Kaiser,Gerhard Weikum
机构: Max Planck Institute for Informatics (马克斯·普朗克计算机科学研究所); Saarland Informatics Campus (萨尔州计算机科学校区)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: WWW 2025 Short Paper, 5 pages

点击查看摘要

Abstract:Conversational Question Answering (ConvQA) involves multiple subtasks, i) to understand incomplete questions in their context, ii) to retrieve relevant information, and iii) to generate answers. This work presents PRAISE, a pipeline-based approach for ConvQA that trains LLM adapters for each of the three subtasks. As labeled training data for individual subtasks is unavailable in practice, PRAISE learns from its own generations using the final answering performance as feedback signal without human intervention and treats intermediate information, like relevant evidence, as weakly labeled data. We apply Direct Preference Optimization by contrasting successful and unsuccessful samples for each subtask. In our experiments, we show the effectiveness of this training paradigm: PRAISE shows improvements per subtask and achieves new state-of-the-art performance on a popular ConvQA benchmark, by gaining 15.5 percentage points increase in precision over baselines.
zh

[NLP-24] MultiClaimNet: A Massively Multilingual Dataset of Fact-Checked Claim Clusters

【速读】：该论文旨在解决跨平台和多语言环境下重复事实核查声明的冗余问题，特别是在现有方法难以应对不断增长的未验证声明数量和日益庞大的已核查声明数据库的情况下。论文的关键解决方案是通过将讨论相同事实的声明聚类来改进声明检索与验证过程。为实现这一目标，研究引入了\textit{MultiClaimNet}，这是一个包含三种多语言声明聚类数据集的集合，覆盖了86种语言及多样化主题的声明。为了构建这些数据集，研究采用了基于近似最近邻检索生成候选声明对，并利用大型语言模型进行自动化相似性标注的方法，从而在较小程度的人工干预下形成声明聚类。最终的大规模数据集包含了78种语言中的85.3K条已核查声明。此外，论文还通过多种聚类技术和句子嵌入模型进行了广泛的实验，为可扩展的声明聚类建立了基准性能。这项工作为高效的事实核查流程提供了坚实的基础和支持。

链接: https://arxiv.org/abs/2503.22280
作者: Rrubaa Panchendrarajan,Rubén Míguez,Arkaitz Zubiaga
机构: Queen Mary University of London (皇后玛丽大学伦敦分校); Newtral Media Audiovisual (新特拉媒体视听)(西班牙)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the context of fact-checking, claims are often repeated across various platforms and in different languages, which can benefit from a process that reduces this redundancy. While retrieving previously fact-checked claims has been investigated as a solution, the growing number of unverified claims and expanding size of fact-checked databases calls for alternative, more efficient solutions. A promising solution is to group claims that discuss the same underlying facts into clusters to improve claim retrieval and validation. However, research on claim clustering is hindered by the lack of suitable datasets. To bridge this gap, we introduce \textitMultiClaimNet, a collection of three multilingual claim cluster datasets containing claims in 86 languages across diverse topics. Claim clusters are formed automatically from claim-matching pairs with limited manual intervention. We leverage two existing claim-matching datasets to form the smaller datasets within \textitMultiClaimNet. To build the larger dataset, we propose and validate an approach involving retrieval of approximate nearest neighbors to form candidate claim pairs and an automated annotation of claim similarity using large language models. This larger dataset contains 85.3K fact-checked claims written in 78 languages. We further conduct extensive experiments using various clustering techniques and sentence embedding models to establish baseline performance. Our datasets and findings provide a strong foundation for scalable claim clustering, contributing to efficient fact-checking pipelines.
zh

[NLP-25] CFiCS: Graph-Based Classification of Common Factors and Microcounseling Skills

【速读】：该论文致力于解决从心理治疗文本数据中自动识别通用因子（Common Factors）和微观咨询技能（Microcounseling Skills）这一挑战性问题。这些问题的复杂性和上下文依赖性使得传统方法难以有效提取其内在变化原则。为应对这一挑战，论文提出了一种名为CFiCS的分层分类框架，其关键是将图机器学习与预训练上下文嵌入相结合。具体而言，CFiCS通过构建异构图来表示通用因子、干预概念和微观咨询技能，并利用ClinicalBERT增强节点的文本信息，从而捕捉层次关系和语义属性。此外，借助图神经网络，该框架能够学习归纳节点嵌入，实现对未见过的文本样本的泛化预测。实验结果表明，结合ClinicalBERT节点特征与图结构显著提升了分类性能，特别是在细粒度技能预测任务中表现优异，相比随机森林、基于BERT的多任务模型及基于图的方法，CFiCS在所有任务中均实现了显著的微F1和宏F1分数提升。

链接: https://arxiv.org/abs/2503.22277
作者: Fabian Schmidt,Karin Hammerfald,Henrik Haaland Jahren,Vladimir Vlassov
机构: Department of Computer Science, KTH Royal Institute of Technology (皇家理工学院), Stockholm, Sweden; Department of Psychology, University of Oslo (奥斯陆大学), Oslo, Norway; Braive AS, Oslo, Norway
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Common factors and microcounseling skills are critical to the effectiveness of psychotherapy. Understanding and measuring these elements provides valuable insights into therapeutic processes and outcomes. However, automatic identification of these change principles from textual data remains challenging due to the nuanced and context-dependent nature of therapeutic dialogue. This paper introduces CFiCS, a hierarchical classification framework integrating graph machine learning with pretrained contextual embeddings. We represent common factors, intervention concepts, and microcounseling skills as a heterogeneous graph, where textual information from ClinicalBERT enriches each node. This structure captures both the hierarchical relationships (e.g., skill-level nodes linking to broad factors) and the semantic properties of therapeutic concepts. By leveraging graph neural networks, CFiCS learns inductive node embeddings that generalize to unseen text samples lacking explicit connections. Our results demonstrate that integrating ClinicalBERT node features and graph structure significantly improves classification performance, especially in fine-grained skill prediction. CFiCS achieves substantial gains in both micro and macro F1 scores across all tasks compared to baselines, including random forests, BERT-based multi-task models, and graph-based methods.
zh

[NLP-26] Process Reward Modeling with Entropy-Driven Uncertainty

【速读】：该论文旨在解决过程监督中训练成本高且需要大量人工精细标注的问题。为应对这一挑战，论文提出了一种新的框架——熵驱动统一过程奖励模型（Entropy-Driven Unified Process Reward Model, EDU-PRM）。其关键解决方案在于引入了一个基于熵引导的动态步骤划分机制，通过logit分布熵动态识别令牌生成过程中的高不确定性区域，从而实现精确的步级反馈。这种自评估能力无需手动细粒度标注，有效降低了训练成本，同时保持了接近最先进的性能。实验结果显示，在仅使用7,500个由EDU-PRM生成的训练查询的情况下，其准确性几乎与完整的Qwen2.5-72B-PRM模型相当（分别为71.1%和71.6%），并且相比先前方法减少了98%的查询成本。

链接: https://arxiv.org/abs/2503.22233
作者: Lang Cao,Renhong Chen,Yingtian Zou,Chao Peng,Wu Ning,Huacong Xu,Qian Chen,Yuxian Wang,Peishuo Su,Mofan Peng,Zijie Chen,Yitong Li
机构: Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents the Entropy-Driven Unified Process Reward Model (EDU-PRM), a novel framework that approximates state-of-the-art performance in process supervision while drastically reducing training costs. EDU-PRM introduces an entropy-guided dynamic step partitioning mechanism, using logit distribution entropy to pinpoint high-uncertainty regions during token generation dynamically. This self-assessment capability enables precise step-level feedback without manual fine-grained annotation, addressing a critical challenge in process supervision. Experiments on the Qwen2.5-72B model with only 7,500 EDU-PRM-generated training queries demonstrate accuracy closely approximating the full Qwen2.5-72B-PRM (71.1% vs. 71.6%), achieving a 98% reduction in query cost compared to prior methods. This work establishes EDU-PRM as an efficient approach for scalable process reward model training.
zh

[NLP-27] Learning to Instruct for Visual Instruction Tuning

【速读】：该论文试图解决视觉指令微调（Visual Instruction Tuning, VIT）在多模态大语言模型（Multimodal Large Language Models, MLLMs）中因过度关注指令跟随能力而导致的过拟合和捷径学习问题，从而可能损害模型性能。论文的关键解决方案在于提出LIT（Learning with Instruction and Text），通过将损失函数融入指令和响应序列中，不仅扩展了训练数据，还有效正则化了模型以减少对语言先验的依赖。这种设计使得LIT在不引入额外训练数据的情况下，在综合多模态基准测试中实现了高达9%的相对性能提升，并显著提升了基础视觉能力，同时减少了多模态模型中的幻觉现象。

链接: https://arxiv.org/abs/2503.22215
作者: Zhihan Zhou,Feng Hong,Jiaan Luo,Jiangchao Yao,Dongsheng Li,Bo Han,Ya Zhang,Yanfeng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 10 figures

点击查看摘要

Abstract:We propose LIT, an advancement of visual instruction tuning (VIT). While VIT equips Multimodal LLMs (MLLMs) with promising multimodal capabilities, the current design choices for VIT often result in overfitting and shortcut learning, potentially degrading performance. This gap arises from an overemphasis on instruction-following abilities, while neglecting the proactive understanding of visual information. Inspired by this, LIT adopts a simple yet effective approach by incorporating the loss function into both the instruction and response sequences. It seamlessly expands the training data, and regularizes the MLLMs from overly relying on language priors. Based on this merit, LIT achieves a significant relative improvement of up to 9% on comprehensive multimodal benchmarks, requiring no additional training data and incurring negligible computational overhead. Surprisingly, LIT attains exceptional fundamental visual capabilities, yielding up to an 18% improvement in captioning performance, while simultaneously alleviating hallucination in MLLMs.
zh

[NLP-28] EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices

【速读】：该论文旨在解决基于Transformer的大规模语言模型（LLMs）在边缘设备上处理长序列时面临的挑战，主要包括注意力机制的二次复杂性以及Key-Value (KV)缓存带来的日益增长的内存需求。现有KV缓存优化方法在长输出任务中难以应对不可逆的标记驱逐问题，而替代的序列建模架构在现有的Transformer基础设施中采用成本高昂。为了解决这些问题，论文提出了一种名为EdgeInfinite的记忆高效解决方案，用于无限上下文处理。其关键在于通过可训练的记忆门控模块将压缩内存集成到基于Transformer的LLMs中，这种方法保持了与标准Transformer架构的完全兼容性，仅需微调少量参数，并通过记忆门控模块的选择性激活实现长上下文和短上下文任务路由。实验结果表明，EdgeInfinite在长上下文基准测试中实现了与基线Transformer-based LLM相当的性能，同时优化了内存消耗和首次令牌的时间。

链接: https://arxiv.org/abs/2503.22196
作者: Jiyu Chen,Shuang Peng,Daxiong Luo,Fan Yang,Renshou Wu,Fangyuan Li,Xiaoxin Chen
机构: vivo AI Lab (维沃移动通信（东莞）有限公司人工智能实验室); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Transformer-based large language models (LLMs) encounter challenges in processing long sequences on edge devices due to the quadratic complexity of attention mechanisms and growing memory demands from Key-Value (KV) cache. Existing KV cache optimizations struggle with irreversible token eviction in long-output tasks, while alternative sequence modeling architectures prove costly to adopt within established Transformer infrastructure. We present EdgeInfinite, a memory-efficient solution for infinite contexts that integrates compressed memory into Transformer-based LLMs through a trainable memory-gating module. This approach maintains full compatibility with standard Transformer architectures, requiring fine-tuning only a small part of parameters, and enables selective activation of the memory-gating module for long and short context task routing. The experimental result shows that EdgeInfinite achieves comparable performance to baseline Transformer-based LLM on long context benchmarks while optimizing memory consumption and time to first token.
zh

[NLP-29] okenization of Gaze Data

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）和多模态大型语言模型（Multimodal Large Language Models, MLLMs）在处理注视数据（gaze data）时缺乏有效分词策略的问题。论文的关键在于分析并比较五种不同的注视数据分词器，在三个不同数据集上的表现，评估其在通过LLMs进行注视数据预测和生成任务中的性能。研究重点包括分词器的重建与压缩能力，并进一步训练基于每种分词策略的LLM以衡量其生成和预测性能。最终发现，分位数分词器（quantile tokenizer）在预测注视位置方面表现最优，而k-means分词器在预测注视速度时效果最佳。因此，该工作的关键是提出一种有效的注视数据分词策略，以充分利用预训练MLLMs的视觉能力。

链接: https://arxiv.org/abs/2503.22145
作者: Tim Rolff,Jurik Karimian,Niklas Hypki,Susanne Schmidt,Markus Lappe,Frank Steinicke
机构: University of Hamburg(Hamburg, Germany); University of Hamburg(Hamburg, Germany); University of Münster(Münster, Germany); Canterbury University(Christchurch, New Zealand); University of Münster(Muenster, Germany); University of Hamburg(Hamburg, Germany)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:A considerable part of the performance of today’s large language models (LLM’s) and multimodal large language models (MLLM’s) depends on their tokenization strategies. While tokenizers are extensively researched for textual and visual input, there is no research on tokenization strategies for gaze data due to its nature. However, a corresponding tokenization strategy would allow using the vision capabilities of pre-trained MLLM’s for gaze data, for example, through fine-tuning. In this paper, we aim to close this research gap by analyzing five different tokenizers for gaze data on three different datasets for the forecasting and generation of gaze data through LLMs (cf.~\creffig:teaser). We evaluate the tokenizers regarding their reconstruction and compression abilities. Further, we train an LLM for each tokenization strategy, measuring its generative and predictive performance. Overall, we found that a quantile tokenizer outperforms all others in predicting the gaze positions and k-means is best when predicting gaze velocities. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC) Cite as: arXiv:2503.22145 [cs.LG] (or arXiv:2503.22145v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.22145 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-30] FRASE: Structured Representations for Generalizable SPARQL Query Generation

【速读】：该论文旨在解决现有自然语言问句到SPARQL查询翻译任务的数据集多为模板驱动的问题，导致模型仅学会浅层的问句与查询模板映射，而缺乏真正的泛化能力。当面对无模板的自然问句时，模型表现不佳。为解决此问题，论文提出了一种名为FRASE（基于框架语义增强）的新方法，其关键是利用框架语义角色标注（Frame Semantic Role Labeling, FSRL）技术，通过检测框架并将其元素映射到查询参数，从而丰富数据集（构建了LC-QuAD 3.0）。实验表明，这种基于框架的结构化表示显著提升了SPARQL生成性能，特别是在处理未见过模板和完全自然问句的泛化场景时。

链接: https://arxiv.org/abs/2503.22144
作者: Papa Abdou Karim Karou Diallo,Amal Zouaq
机构: LAMA-WeST(未翻译); Polytechnique Montreal(蒙特利尔理工学院); Mila
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Translating natural language questions into SPARQL queries enables Knowledge Base querying for factual and up-to-date responses. However, existing datasets for this task are predominantly template-based, leading models to learn superficial mappings between question and query templates rather than developing true generalization capabilities. As a result, models struggle when encountering naturally phrased, template-free questions. This paper introduces FRASE (FRAme-based Semantic Enhancement), a novel approach that leverages Frame Semantic Role Labeling (FSRL) to address this limitation. We also present LC-QuAD 3.0, a new dataset derived from LC-QuAD 2.0, in which each question is enriched using FRASE through frame detection and the mapping of frame-elements to their argument. We evaluate the impact of this approach through extensive experiments on recent large language models (LLMs) under different fine-tuning configurations. Our results demonstrate that integrating frame-based structured representations consistently improves SPARQL generation performance, particularly in challenging generalization scenarios when test questions feature unseen templates (unknown template splits) and when they are all naturally phrased (reformulated questions).
zh

[NLP-31] REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation

【速读】：本文旨在解决机器人在长时序任务规划中的适应性（adaptability）和效率（efficiency）问题，特别是在动态场景变化或意外任务条件下的表现不足。传统方法通常依赖于先验环境知识或精心设计的任务特定提示，这限制了其灵活性和应对复杂情况的能力。为了解决这些问题，论文提出了一种名为REMAC的自适应多智能体规划框架。REMAC的关键在于其包含两个核心模块：自我反思模块（self-reflection module），用于循环执行前提条件和后置条件检查以评估进展并优化计划；以及自我演化模块（self-evolvement module），通过场景特定推理动态调整计划。这些模块使机器人能够在无需复杂提示设计的情况下探索和推理环境，并在迭代过程中协调其他机器人以提高任务执行效率。

链接: https://arxiv.org/abs/2503.22122
作者: Puzhen Yuan,Angyuan Ma,Yunchao Yao,Huaxiu Yao,Masayoshi Tomizuka,Mingyu Ding
机构: Xingjian College, Tsinghua University (清华大学致理书院); Robotics Institute, Carnegie Mellon University (卡内基梅隆大学机器人研究所); Department of Computer Science, UNC-Chapel Hill (北卡罗来纳大学教堂山分校计算机科学系); Department of Mechanical Engineering, UC Berkeley (加州大学伯克利分校机械工程系)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have demonstrated remarkable capabilities in robotic planning, particularly for long-horizon tasks that require a holistic understanding of the environment for task decomposition. Existing methods typically rely on prior environmental knowledge or carefully designed task-specific prompts, making them struggle with dynamic scene changes or unexpected task conditions, e.g., a robot attempting to put a carrot in the microwave but finds the door was closed. Such challenges underscore two critical issues: adaptability and efficiency. To address them, in this work, we propose an adaptive multi-agent planning framework, termed REMAC, that enables efficient, scene-agnostic multi-robot long-horizon task planning and execution through continuous reflection and self-evolution. REMAC incorporates two key modules: a self-reflection module performing pre-condition and post-condition checks in the loop to evaluate progress and refine plans, and a self-evolvement module dynamically adapting plans based on scene-specific reasoning. It offers several appealing benefits: 1) Robots can initially explore and reason about the environment without complex prompt design. 2) Robots can keep reflecting on potential planning errors and adapting the plan based on task-specific insights. 3) After iterations, a robot can call another one to coordinate tasks in parallel, maximizing the task execution efficiency. To validate REMAC’s effectiveness, we build a multi-agent environment for long-horizon robot manipulation and navigation based on RoboCasa, featuring 4 task categories with 27 task styles and 50+ different objects. Based on it, we further benchmark state-of-the-art reasoning models, including DeepSeek-R1, o3-mini, QwQ, and Grok3, demonstrating REMAC’s superiority by boosting average success rates by 40% and execution efficiency by 52.7% over the single robot baseline.
zh

[NLP-32] Beyond Single-Sentence Prompts: Upgrading Value Alignment Benchmarks with Dialogues and Stories

【速读】：该论文试图解决传统单句对抗性提示在评估大型语言模型（Large Language Models, LLMs）价值对齐方面的局限性问题。随着AI安全技术的进步，现有方法难以有效揭示模型潜在的偏见和伦理立场，因为模型能够规避这些简单的测试。为了解决这一问题，论文提出了一种升级版的价值对齐基准测试，关键在于引入多轮对话和基于叙事的情景设计，超越传统的单句提示方式。这种方法通过增强评估的隐蔽性和对抗性，提高了对现代LLMs表面防护措施的鲁棒性，并通过构建包含会话陷阱和伦理模糊故事的数据集，系统性地评估模型在更复杂且语境丰富的场景中的表现。实验结果表明，这种改进的方法能够有效揭示传统一次性评估中未被发现的潜在偏见，强调了对LLMs进行情境化和动态测试的必要性，从而推动更复杂和现实的AI伦理与安全评估的发展。

链接: https://arxiv.org/abs/2503.22115
作者: Yazhou Zhang,Qimeng Liu,Qiuchi Li,Peng Zhang,Jing Qin
机构: College of Intelligence and Computing, Tianjin University (天津大学); Software Engineering College, Zhengzhou University of Light Industry (郑州轻工业大学); Department of Computer Science, Copenhagen University (哥本哈根大学); College of Computing and Intelligence, Tianjin University (天津大学); Department of Computational Neuroscience, The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Evaluating the value alignment of large language models (LLMs) has traditionally relied on single-sentence adversarial prompts, which directly probe models with ethically sensitive or controversial questions. However, with the rapid advancements in AI safety techniques, models have become increasingly adept at circumventing these straightforward tests, limiting their effectiveness in revealing underlying biases and ethical stances. To address this limitation, we propose an upgraded value alignment benchmark that moves beyond single-sentence prompts by incorporating multi-turn dialogues and narrative-based scenarios. This approach enhances the stealth and adversarial nature of the evaluation, making it more robust against superficial safeguards implemented in modern LLMs. We design and implement a dataset that includes conversational traps and ethically ambiguous storytelling, systematically assessing LLMs’ responses in more nuanced and context-rich settings. Experimental results demonstrate that this enhanced methodology can effectively expose latent biases that remain undetected in traditional single-shot evaluations. Our findings highlight the necessity of contextual and dynamic testing for value alignment in LLMs, paving the way for more sophisticated and realistic assessments of AI ethics and safety.
zh

[NLP-33] Few-Shot Graph Out-of-Distribution Detection with LLM s

【速读】：该论文旨在解决 attributed 图（TAG）中图 out-of-distribution (OOD) 检测的问题，其中高质量标注节点的获取既困难又昂贵。现有方法通常依赖大量标记的 in-distribution (ID) 数据训练图神经网络 (GNN) 分类器，而大型语言模型 (LLMs) 虽然在文本任务中具有强大的零样本能力，但难以自然捕捉 TAG 的关键结构信息。论文的关键解决方案在于提出了一种名为 LLM-GOOD 的通用框架，它有效结合了 LLM 和 GNN 的优势以提高数据效率。具体而言，首先利用 LLM 的零样本能力筛选出可能的 OOD 节点，大幅减轻人工标注负担；通过仅对少量未标注节点进行标注来最小化 LLM 的使用成本；然后训练一个轻量级 GNN 过滤器，利用这些噪声标签高效预测其他未标注节点的 ID 状态；接着基于信息性方法选择有价值的节点进行精确的人工标注，并最终使用这些准确标注的 ID 节点训练目标 ID 分类器。实验结果表明，该方法显著降低了人工标注成本，并在 ID 分类准确性和 OOD 检测性能上优于最先进的基线方法。

链接: https://arxiv.org/abs/2503.22097
作者: Haoyan Xu,Zhengtao Yao,Yushun Dong,Ziyi Wang,Ryan A. Rossi,Mengyuan Li,Yue Zhao
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing methods for graph out-of-distribution (OOD) detection typically depend on training graph neural network (GNN) classifiers using a substantial amount of labeled in-distribution (ID) data. However, acquiring high-quality labeled nodes in text-attributed graphs (TAGs) is challenging and costly due to their complex textual and structural characteristics. Large language models (LLMs), known for their powerful zero-shot capabilities in textual tasks, show promise but struggle to naturally capture the critical structural information inherent to TAGs, limiting their direct effectiveness. To address these challenges, we propose LLM-GOOD, a general framework that effectively combines the strengths of LLMs and GNNs to enhance data efficiency in graph OOD detection. Specifically, we first leverage LLMs’ strong zero-shot capabilities to filter out likely OOD nodes, significantly reducing the human annotation burden. To minimize the usage and cost of the LLM, we employ it only to annotate a small subset of unlabeled nodes. We then train a lightweight GNN filter using these noisy labels, enabling efficient predictions of ID status for all other unlabeled nodes by leveraging both textual and structural information. After obtaining node embeddings from the GNN filter, we can apply informativeness-based methods to select the most valuable nodes for precise human annotation. Finally, we train the target ID classifier using these accurately annotated ID nodes. Extensive experiments on four real-world TAG datasets demonstrate that LLM-GOOD significantly reduces human annotation costs and outperforms state-of-the-art baselines in terms of both ID classification accuracy and OOD detection performance. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2503.22097 [cs.LG] (or arXiv:2503.22097v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.22097 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-34] Leverag ing LLM s for Predicting Unknown Diagnoses from Clinical Notes

【速读】：该论文试图解决电子健康记录（EHRs）中缺乏显式链接药物与诊断信息的问题，这使得临床决策和研究变得困难。即使存在某些链接，诊断列表也可能不完整，尤其是在患者早期就诊时。论文通过探索大型语言模型（LLMs）能否从临床笔记中预测隐含提及的诊断并将它们链接到相应的药物来解决这一问题。

解决方案的关键在于采用集成学习方法——多数投票（majority voting），以不同的LLM配置进行预测。研究发现，这种多数投票方法在诊断预测上的准确率达到75%，显著优于最佳单一配置的66%。此外，结果表明没有单一的超参数设置始终最优，但结合确定性、平衡性和探索性策略可以提升性能。因此，基于多种LLM配置的集成多数投票方法能够有效改善EHRs中的诊断预测，并为临床文本中建立药物与诊断之间的联系提供了一种有前景的方法。

链接: https://arxiv.org/abs/2503.22092
作者: Dina Albassam,Adam Cross,Chengxiang Zhai
机构: 未知
类目: Computation and Language (cs.CL)
备注: 19 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Electronic Health Records (EHRs) often lack explicit links between medications and diagnoses, making clinical decision-making and research more difficult. Even when links exist, diagnosis lists may be incomplete, especially during early patient visits. Discharge summaries tend to provide more complete information, which can help infer accurate diagnoses, especially with the help of large language models (LLMs). This study investigates whether LLMs can predict implicitly mentioned diagnoses from clinical notes and link them to corresponding medications. We address two research questions: (1) Does majority voting across diverse LLM configurations outperform the best single configuration in diagnosis prediction? (2) How sensitive is majority voting accuracy to LLM hyperparameters such as temperature, top-p, and summary length? To evaluate, we created a new dataset of 240 expert-annotated medication-diagnosis pairs from 20 MIMIC-IV notes. Using GPT-3.5 Turbo, we ran 18 prompting configurations across short and long summary lengths, generating 8568 test cases. Results show that majority voting achieved 75 percent accuracy, outperforming the best single configuration at 66 percent. No single hyperparameter setting dominated, but combining deterministic, balanced, and exploratory strategies improved performance. Shorter summaries generally led to higher this http URL conclusion, ensemble-style majority voting with diverse LLM configurations improves diagnosis prediction in EHRs and offers a promising method to link medications and diagnoses in clinical texts.
zh

[NLP-35] Penrose Tiled Low-Rank Compression and Section-Wise QA Fine-Tuning: A General Framework for Domain-Specific Large Language Model Adaptation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在材料科学等高知识密度领域中适应特定领域知识效率低且精度不足的问题，主要挑战源于数据稀缺与知识密集。论文提出了一种两阶段框架，结合结构化模型压缩与领域特定微调方案来应对这一挑战。关键在于压缩阶段通过将LLM的权重矩阵分解为局部低秩“秩块”并采用类似彭罗斯非周期镶嵌的布局方式，利用谱变换（如离散余弦或傅里叶变换）压缩每个块，并借助基于Kullback-Leibler (KL) 散度的对齐损失保持压缩模型与原始全模型表征分布的相似性；而在适应阶段，则通过人类式科学阅读协议进行分段式问答（QA）微调，逐步提取显式推理痕迹并注入领域知识，同时减少对通用语言能力的灾难性遗忘，从而实现高效压缩与精准适配的平衡。

链接: https://arxiv.org/abs/2503.22074
作者: Chuan-Wei Kuo,Siyu Chen,Chenqi Yan,Yu Yang Fredrik Liu
机构: DeepVerse (深维科技); TCM Group, Cavendish Laboratory, University of Cambridge (卡文迪许实验室, 剑桥大学 TCM 集团); School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University (上海交通大学电子信息技术与电气工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) hold great promise for specialized scientific domains such as materials science, yet adapting them efficiently and accurately to domain-specific knowledge remains challenging due to limited data and high knowledge density. We propose a two-stage framework that combines structured model compression with a scientific fine-tuning regimen to address this challenge. In the compression stage, we decompose the LLM’s weight matrices into local low-rank “rank blocks” and arrange these blocks in a Penrose-like non-periodic tiling pattern. Each block is then compacted via spectral transformations (e.g., discrete cosine or Fourier transforms), and a Kullback-Leibler (KL) divergence-based alignment loss preserves the distributional similarity between the compressed model’s representations and those of the original full model. In the adaptation stage, the compressed model is further tuned using a human-like scientific reading protocol: it processes technical materials science documents section by section, engaging in a structured question-and-answer routine for each section. This section-wise QA fine-tuning strategy extracts explicit reasoning traces and gradually injects domain knowledge, while minimizing catastrophic forgetting of the model’s general language capabilities. By balancing efficient compression with targeted adaptation, our two-stage approach enables precise specialization of LLMs to high-value domains under data-scarce conditions. We present this principled yet exploratory pipeline and outline its potential for advancing materials science knowledge integration, laying the groundwork for comprehensive empirical evaluation in future work.
zh

[NLP-36] Non-Monotonic Attention-based Read/Write Policy Learning for Simultaneous Translation

【速读】：该论文致力于解决同时性或流式机器翻译中的质量/延迟权衡问题，目标是在保证翻译质量接近非流式模型的同时，尽可能减少延迟。论文的关键解决方案在于通过增强一个预训练的非流式 seq2seq 模型，利用源端和目标端 token 的对齐信息，将其转换为流式模型，并学习一个读取/生成的决策边界以在最小输入条件下实现可靠翻译。这一决策边界由一个小型二元分类模块（read/write policy module）控制，在推理过程中调节质量/延迟权衡。实验结果表明，该方法优于多个强基准模型，并缩小了与非流式基线模型之间的性能差距。

链接: https://arxiv.org/abs/2503.22051
作者: Zeeshan Ahmed,Frank Seide,Zhe Liu,Rastislav Rabatin,Jachym Kolar,Niko Moritz,Ruiming Xie,Simone Merello,Christian Fuegen
机构: Meta AI (Meta)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Simultaneous or streaming machine translation generates translation while reading the input stream. These systems face a quality/latency trade-off, aiming to achieve high translation quality similar to non-streaming models with minimal latency. We propose an approach that efficiently manages this trade-off. By enhancing a pretrained non-streaming model, which was trained with a seq2seq mechanism and represents the upper bound in quality, we convert it into a streaming model by utilizing the alignment between source and target tokens. This alignment is used to learn a read/write decision boundary for reliable translation generation with minimal input. During training, the model learns the decision boundary through a read/write policy module, employing supervised learning on the alignment points (pseudo labels). The read/write policy module, a small binary classification unit, can control the quality/latency trade-off during inference. Experimental results show that our model outperforms several strong baselines and narrows the gap with the non-streaming baseline model.
zh

[NLP-37] hinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在链式思维（chain-of-thought, CoT）推理增强后，偶尔产生过短推理（overly short reasoning）的问题，这种现象导致其在简单数学问题上的性能下降。论文的关键发现是，推理长度由表示空间中的一个线性方向控制，并通过调整模型沿此方向的行为可以诱导过短推理。基于这一洞察，作者提出了ThinkEdit方法，这是一种通过微调少量权重（约模型参数的0.1%）来缓解过短推理问题的简单而有效的策略。具体而言，ThinkEdit首先识别出约2%的注意力头主要驱动短推理行为，然后编辑这些注意力头的输出投影权重以抑制短推理方向，从而显著提升了短推理输出的准确性（+5.44%）以及多个数学基准测试的整体表现（+2.43%）。

链接: https://arxiv.org/abs/2503.22048
作者: Chung-En Sun,Ge Yan,Tsui-Wei Weng
机构: UCSD CSE (加州大学圣地亚哥分校计算机科学与工程系); UCSD HDSI (加州大学圣地亚哥分校高维数据分析研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent studies have shown that Large Language Models (LLMs) augmented with chain-of-thought (CoT) reasoning demonstrate impressive problem-solving abilities. However, in this work, we identify a recurring issue where these models occasionally generate overly short reasoning, leading to degraded performance on even simple mathematical problems. Specifically, we investigate how reasoning length is embedded in the hidden representations of reasoning models and its impact on accuracy. Our analysis reveals that reasoning length is governed by a linear direction in the representation space, allowing us to induce overly short reasoning by steering the model along this direction. Building on this insight, we introduce ThinkEdit, a simple yet effective weight-editing approach to mitigate the issue of overly short reasoning. We first identify a small subset of attention heads (approximately 2%) that predominantly drive short reasoning behavior. We then edit the output projection weights of these heads to suppress the short reasoning direction. With changes to only 0.1% of the model’s parameters, ThinkEdit effectively reduces overly short reasoning and yields notable accuracy gains for short reasoning outputs (+5.44%), along with an overall improvement across multiple math benchmarks (+2.43%). Our findings provide new mechanistic insights into how reasoning length is controlled within LLMs and highlight the potential of fine-grained model interventions to improve reasoning quality. Our code is available at this https URL
zh

[NLP-38] he Risks of Using Large Language Models for Text Annotation in Social Science Research

【速读】：该论文旨在评估大型语言模型（Large Language Models, LLMs）在社会科学研究中文本编码任务中的潜力与风险，并以社会运动研究为例，提出了一种框架，帮助社会科学家将LLMs整合到文本注释工作中，既可以作为主要编码决策者，也可以作为编码助手。关键在于开发有效的提示策略（optimal prompts），同时提供工具以检验和报告LLMs作为方法学工具的有效性与可靠性，并讨论由此产生的效度、信度、可重复性和透明度等认识论风险。论文最后给出了在文本注释任务中使用LLMs的实际指导原则以及如何更好地沟通这些认识论风险的建议。

链接: https://arxiv.org/abs/2503.22040
作者: Hao Lin,Yongjun Zhang
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Generative artificial intelligence (GenAI) or large language models (LLMs) have the potential to revolutionize computational social science, particularly in automated textual analysis. In this paper, we conduct a systematic evaluation of the promises and risks of using LLMs for diverse coding tasks, with social movement studies serving as a case example. We propose a framework for social scientists to incorporate LLMs into text annotation, either as the primary coding decision-maker or as a coding assistant. This framework provides tools for researchers to develop the optimal prompt, and to examine and report the validity and reliability of LLMs as a methodological tool. Additionally, we discuss the associated epistemic risks related to validity, reliability, replicability, and transparency. We conclude with several practical guidelines for using LLMs in text annotation tasks, and how we can better communicate the epistemic risks in research.
zh

[NLP-39] Debate-Driven Multi-Agent LLM s for Phishing Email Detection

【速读】：该论文试图解决钓鱼邮件检测中传统方法存在的局限性问题，包括基于规则系统易被轻微修改绕过以及监督机器学习模型需要大规模标注数据且仍存在误报和漏报的问题。论文的关键解决方案在于提出了一种多智能体大语言模型（Large Language Model, LLM）提示技术，通过模拟智能体之间的辩论机制来判断邮件内容是否属于钓鱼行为。该方案的核心在于使用两个LLM智能体分别提供支持或反对分类任务的论点，并由裁判智能体根据推理质量做出最终判决。这种辩论结构使模型能够批判性地分析文本中的上下文线索和欺骗模式，从而提高分类准确性，且实验表明混合智能体配置优于同质化配置，甚至无需额外的提示策略即可实现准确决策。

链接: https://arxiv.org/abs/2503.22038
作者: Ngoc Tuong Vy Nguyen,Felix D Childress,Yunting Yin
机构: Earlham College (埃尔赫姆学院)
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注: Accepted to the 13th International Symposium on Digital Forensics and Security (ISDFS 2025)

点击查看摘要

Abstract:Phishing attacks remain a critical cybersecurity threat. Attackers constantly refine their methods, making phishing emails harder to detect. Traditional detection methods, including rule-based systems and supervised machine learning models, either rely on predefined patterns like blacklists, which can be bypassed with slight modifications, or require large datasets for training and still can generate false positives and false negatives. In this work, we propose a multi-agent large language model (LLM) prompting technique that simulates debates among agents to detect whether the content presented on an email is phishing. Our approach uses two LLM agents to present arguments for or against the classification task, with a judge agent adjudicating the final verdict based on the quality of reasoning provided. This debate mechanism enables the models to critically analyze contextual cue and deceptive patterns in text, which leads to improved classification accuracy. The proposed framework is evaluated on multiple phishing email datasets and demonstrate that mixed-agent configurations consistently outperform homogeneous configurations. Results also show that the debate structure itself is sufficient to yield accurate decisions without extra prompting strategies.
zh

[NLP-40] Cognitive Prompts Using Guilfords Structure of Intellect Model

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在结构化推理方面的局限性，这些问题导致模型在解决问题时表现出不一致或次优的表现。论文的关键解决方案是基于吉尔福德的智力结构（Structure of Intellect, SOI）模型，将其作为认知提示工程的基础框架。SOI 模型通过分类认知操作（如模式识别、记忆检索和评估）提供了一种系统化的方法来增强 LLM 的推理和决策能力。论文提出了一种新颖的认知提示方法，旨在通过 SOI 启发的推理机制提升模型响应的清晰度、连贯性和适应性。

链接: https://arxiv.org/abs/2503.22036
作者: Oliver Kramer
机构: CI Labs, University of Oldenburg (CI 实验室, 奥尔登堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong language generation capabilities but often struggle with structured reasoning, leading to inconsistent or suboptimal problem-solving. To mitigate this limitation, Guilford’s Structure of Intellect (SOI) model - a foundational framework from intelligence theory - is leveraged as the basis for cognitive prompt engineering. The SOI model categorizes cognitive operations such as pattern recognition, memory retrieval, and evaluation, offering a systematic approach to enhancing LLM reasoning and decision-making. This position paper presents a novel cognitive prompting approach for enforcing SOI-inspired reasoning for improving clarity, coherence, and adaptability in model responses.
zh

[NLP-41] Enhancing Domain-Specific Encoder Models with LLM -Generated Data: How to Leverag e Ontologies and How to Do Without Them

【速读】：该论文旨在解决在特定领域（如入侵生物学）中由于训练数据有限而导致编码器模型持续预训练效果不佳的问题。为了解决这一挑战，论文提出了一种利用大型语言模型（LLM）生成的数据来丰富领域特定本体论的方法，并将编码器模型预训练为一种基于本体的概念定义嵌入模型。关键在于通过结合本体概念与LLM生成的数据，显著提升了模型性能，同时证明了即使在缺乏全面本体论的情况下，通过从少量科学摘要中自动提取概念并利用分布统计建立概念间关系，也能实现与传统大规模掩码语言模型预训练相当的效果。这种方法特别适用于资源受限环境下的领域专用编码器模型增强。

链接: https://arxiv.org/abs/2503.22006
作者: Marc Brinner,Tarek Al Mustafa,Sina Zarrieß
机构: Bielefeld University (比勒费尔德大学), Germany; Friedrich Schiller University Jena (耶拿弗里德里希·席勒大学), Germany
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We investigate the use of LLM-generated data for continual pretraining of encoder models in specialized domains with limited training data, using the scientific domain of invasion biology as a case study. To this end, we leverage domain-specific ontologies by enriching them with LLM-generated data and pretraining the encoder model as an ontology-informed embedding model for concept definitions. To evaluate the effectiveness of this method, we compile a benchmark specifically designed for assessing model performance in invasion biology. After demonstrating substantial improvements over standard LLM pretraining, we investigate the feasibility of applying the proposed approach to domains without comprehensive ontologies by substituting ontological concepts with concepts automatically extracted from a small corpus of scientific abstracts and establishing relationships between concepts through distributional statistics. Our results demonstrate that this automated approach achieves comparable performance using only a small set of scientific abstracts, resulting in a fully automated pipeline for enhancing domain-specific understanding of small encoder models that is especially suited for application in low-resource settings and achieves performance comparable to masked language modeling pretraining on much larger datasets.
zh

[NLP-42] Monte Carlo Sampling for Analyzing In-Context Examples NAACL2025

【速读】：该论文旨在解决上下文学习（in-context learning）中因示例数量、顺序及选择方式等呈现因素之间的交互作用导致的指导建议缺乏普适性的问题。论文的关键在于开发了一种基于蒙特卡洛采样的方法，用于研究示例数量的影响，同时显式考虑顺序和示例选择带来的影响。通过这种方法，作者发现先前关于示例数量选取的建议并不总能在不同的示例集合和排序下推广，并且一次示例设置是否优于零次示例设置高度依赖于所选示例本身。此外，受数据估值的启发，作者将该采样方法应用于上下文示例选择中，以挑选在不同排序下表现良好的示例，但结果表明，尽管性能对示例数量和顺序具有鲁棒性，但与随机采样相比，这种选择方法意外地导致了性能下降。

链接: https://arxiv.org/abs/2503.22002
作者: Stephanie Schoch,Yangfeng Ji
机构: Department of Computer Science (计算机科学系), University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL)
备注: Accepted to the Workshop for Insights from Negative Results (co-located with NAACL 2025)

点击查看摘要

Abstract:Prior works have shown that in-context learning is brittle to presentation factors such as the order, number, and choice of selected examples. However, ablation-based guidance on selecting the number of examples may ignore the interplay between different presentation factors. In this work we develop a Monte Carlo sampling-based method to study the impact of number of examples while explicitly accounting for effects from order and selected examples. We find that previous guidance on how many in-context examples to select does not always generalize across different sets of selected examples and orderings, and whether one-shot settings outperform zero-shot settings is highly dependent on the selected example. Additionally, inspired by data valuation, we apply our sampling method to in-context example selection to select examples that perform well across different orderings. We find a negative result, that while performance is robust to ordering and number of examples, there is an unexpected performance degradation compared to random sampling.
zh

[NLP-43] Cluster automata

【速读】：该论文旨在引入一类新的聚类Moore自动机（Clustered Moore Automata, CMA），并通过研究其时序行为来探索其特性，并描述相关应用。论文的关键在于定义和分析这一新类自动机的聚类结构及其动态行为，这为理解和应用此类系统提供了理论基础与实践指导。

链接: https://arxiv.org/abs/2503.22000
作者: András Kornai
机构: Budapest University of Technology and Economics (布达佩斯技术与经济大学)
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注: Submitted to MOL2025

点击查看摘要

Abstract:We introduce a new class of clustered Moore automata (CMA), investigate their temporal behavior, and describe some applications.
zh

[NLP-44] Socially Constructed Treatment Plans: Analyzing Online Peer Interactions to Understand How Patients Navigate Complex Medical Conditions

【速读】：本文旨在探究在线健康社区中复杂医疗条件（如癌症、精神健康问题、物质依赖康复）的社会化治疗计划构建过程及其有效性。论文通过分析在线讨论内容、临床医生和患者代表的民族志研究，以及深度访谈临床专家，揭示患者在药物辅助康复治疗中偏离临床指南的时间点与原因，并评估社会化治疗计划的实际意义与效果。此外，针对生成式AI (Generative AI) 在患者沟通中的应用趋势，研究进一步考察最先进的大型语言模型 (LLM) 是否及如何反映这些社会化治疗相关知识。关键在于采用一种新颖的混合方法，揭示患者中心沟通在在线健康社区中的重要研究方向。

链接: https://arxiv.org/abs/2503.21986
作者: Madhusudan Basak,Omar Sharif,Jessica Hulsey,Elizabeth C. Saunders,Daisy J. Goodman,Luke J. Archibald,Sarah M. Preum
机构: Dartmouth College (达特茅斯学院); BUET (BUET); Addiction Policy Forum (成瘾政策论坛); Dartmouth College (达特茅斯学院); Dartmouth College (达特茅斯学院); Dartmouth College (达特茅斯学院); Dartmouth College (达特茅斯学院)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:When faced with complex and uncertain medical conditions (e.g., cancer, mental health conditions, recovery from substance dependency), millions of patients seek online peer support. In this study, we leverage content analysis of online discourse and ethnographic studies with clinicians and patient representatives to characterize how treatment plans for complex conditions are “socially constructed.” Specifically, we ground online conversation on medication-assisted recovery treatment to medication guidelines and subsequently surface when and why people deviate from the clinical guidelines. We characterize the implications and effectiveness of socially constructed treatment plans through in-depth interviews with clinical experts. Finally, given the enthusiasm around AI-powered solutions for patient communication, we investigate whether and how socially constructed treatment-related knowledge is reflected in a state-of-the-art large language model (LLM). Leveraging a novel mixed-method approach, this study highlights critical research directions for patient-centered communication in online health communities.
zh

[NLP-45] Entropy-Aware Branching for Improved Mathematical Reasoning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成过程中因不确定性导致的错误率较高的问题，特别是在数学推理任务中，模型在高熵或熵方差较大的输出令牌处更容易出错。为应对这一挑战，论文提出了一种新颖的方法，即在关键决策点动态分支生成过程，而非默认选择单个最可能的令牌（argmax解码）。该方案的关键在于通过并行探索源自高概率令牌的多个分支路径，从而发现常规方法可能遗漏的不同推理路径，并利用更大模型的外部反馈对这些分支进行排序与选择，以确定最连贯且准确的推理路径。实验结果表明，此分支策略可将小规模LLMs的推理能力提升高达4.6%。

链接: https://arxiv.org/abs/2503.21961
作者: Xianzhi Li,Ethan Callanan,Xiaodan Zhu,Mathieu Sibue,Antony Papadimitriou,Mahmoud Mahfouz,Zhiqiang Ma,Xiaomo Liu
机构: Queen’s University; JPMorgan Chase
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) are effectively aligned through extensive pre-training and fine-tuning, they still struggle with varying levels of uncertainty during token generation. In our investigation of mathematical reasoning, we observe that errors are more likely to arise at tokens exhibiting high entropy and variance of entropy in the model’s output distribution. Based on the observation, we propose a novel approach that dynamically branches the generation process on demand instead of defaulting to the single most probable token. By exploring in parallel multiple branches stemming from high probability tokens of critical decision points, the model can discover diverse reasoning paths that might otherwise be missed. We further harness external feedback from larger models to rank and select the most coherent and accurate reasoning branch. Our experimental results on mathematical word problems and calculation questions show that this branching strategy boosts the reasoning capabilities of small LLMs up to 4.6% compared to conventional argmax decoding.
zh

[NLP-46] Proof or Bluff? Evaluating LLM s on 2025 USA Math Olympiad

【速读】：该论文试图解决现有大型语言模型（Large Language Models, LLMs）在数学推理任务中的局限性，即它们仅关注最终数值答案，而忽视了严格的推理和证明生成能力，而这在现实世界中的数学任务中至关重要。为了解决这一问题，论文提出了首个针对复杂数学问题全解推理的全面评估框架。通过引入专家标注的数据集，论文评估了多个最先进的推理模型在2025年USAMO六道题目上的表现，并详细分析了推理轨迹以识别常见的失败模式及训练过程中优化策略导致的不良特性。关键在于通过系统性的评估揭示当前LLMs在数学推理与证明生成方面的能力不足，并为未来模型改进指明方向。

链接: https://arxiv.org/abs/2503.21934
作者: Ivo Petrov,Jasper Dekoninck,Lyuben Baltadzhiev,Maria Drencheva,Kristian Minchev,Mislav Balunović,Nikola Jovanović,Martin Vechev
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, o3-mini, achieving scores comparable to top human competitors. However, these benchmarks evaluate models solely based on final numerical answers, neglecting rigorous reasoning and proof generation which are essential for real-world mathematical tasks. To address this, we introduce the first comprehensive evaluation of full-solution reasoning for challenging mathematical problems. Using expert human annotators, we evaluated several state-of-the-art reasoning models on the six problems from the 2025 USAMO within hours of their release. Our results reveal that all tested models struggled significantly, achieving less than 5% on average. Through detailed analysis of reasoning traces, we identify the most common failure modes and find several unwanted artifacts arising from the optimization strategies employed during model training. Overall, our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks, highlighting the need for substantial improvements in reasoning and proof generation capabilities.
zh

[NLP-47] Local Normalization Distortion and the Thermodynamic Formalism of Decoding Strategies for Large Language Models

【速读】：该论文试图解决自然语言生成中解码策略（Decoding Strategies）设计缺乏理论基础的问题。现有解码方法主要基于启发式规则，难以系统性地改进或优化。论文的关键在于将流行的解码算法形式化为遍历论（ergodic theory）中的平衡状态，并明确这些算法所优化的目标函数。通过这一理论框架，作者分析了top-k、nucleus采样以及温度采样中局部归一化步骤的影响，指出局部归一化引起的失真（local normalization distortion）是解码策略的根本缺陷，并量化了这种失真的大小及其对生成文本质量和多样性的数学代理指标的影响。研究发现，top-k采样相对于nucleus采样的性能差距主要归因于局部归一化失真，而非先前普遍认为的原因。这一结论为未来解码算法的设计以及机器生成文本的检测提供了指导意义。

链接: https://arxiv.org/abs/2503.21929
作者: Tom Kempton,Stuart Burrell
机构: Department of Mathematics, University of Manchester (曼彻斯特大学); Innovation Lab, Featurespace (Featurespace创新实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Dynamical Systems (math.DS)
备注:

点击查看摘要

Abstract:Advances in hardware and language model architecture have spurred a revolution in natural language generation. However, autoregressive models compute probability distributions over next-token choices, and sampling from these distributions, known as decoding, has received significantly less attention than other design choices. Existing decoding strategies are largely based on heuristics, resulting in methods that are hard to apply or improve in a principled manner. We develop the theory of decoding strategies for language models by expressing popular decoding algorithms as equilibrium states in the language of ergodic theory and stating the functions they optimize. Using this, we analyze the effect of the local normalization step of top-k, nucleus, and temperature sampling, used to make probabilities sum to one. We argue that local normalization distortion is a fundamental defect of decoding strategies and quantify the size of this distortion and its effect on mathematical proxies for the quality and diversity of generated text. Contrary to the prevailing explanation, we argue that the major cause of the under-performance of top-k sampling relative to nucleus sampling is local normalization distortion. This yields conclusions for the future design of decoding algorithms and the detection of machine-generated text.
zh

[NLP-48] Hybrid Emotion Recognition: Enhancing Customer Interactions Through Acoustic and Textual Analysis

【速读】：该论文旨在解决传统方法在理解和识别复杂情感状态时的局限性问题，特别是在客户服务中心提升客户互动质量的需求背景下。论文提出的关键解决方案是构建一个融合先进深度学习（Deep Learning）、自然语言处理（NLP）及大语言模型（LLMs）的混合情感识别系统。该系统通过结合声学特征与文本情感分析，实现更细腻的情感检测，并利用长短期记忆网络（LSTM）和卷积神经网络（CNN）进行音频分析，采用DistilBERT进行文本评估，从而适应语言和文化差异，同时确保实时处理能力。

链接: https://arxiv.org/abs/2503.21927
作者: Sahan Hewage Wewelwala,T.G.D.K. Sumanathilaka
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages, 1 figure, 2 tables

点击查看摘要

Abstract:This research presents a hybrid emotion recognition system integrating advanced Deep Learning, Natural Language Processing (NLP), and Large Language Models (LLMs) to analyze audio and textual data for enhancing customer interactions in contact centers. By combining acoustic features with textual sentiment analysis, the system achieves nuanced emotion detection, addressing the limitations of traditional approaches in understanding complex emotional states. Leveraging LSTM and CNN models for audio analysis and DistilBERT for textual evaluation, the methodology accommodates linguistic and cultural variations while ensuring real-time processing. Rigorous testing on diverse datasets demonstrates the system’s robustness and accuracy, highlighting its potential to transform customer service by enabling personalized, empathetic interactions and improving operational efficiency. This research establishes a foundation for more intelligent and human-centric digital communication, redefining customer service standards.
zh

[NLP-49] AutoPsyC: Automatic Recognition of Psychodynamic Conflicts from Semi-structured Interviews with Large Language Models

【速读】：本文旨在解决通过自动化方法识别心理动力冲突（psychodynamic conflicts）的问题，这些冲突通常是无意识的，甚至患者自身可能都无法明确意识到。现有精神病诊断的自动化方案主要集中在广泛的精神障碍类别（如抑郁症）的识别上，而针对心理动力冲突的自动检测仍面临挑战。论文的关键解决方案是提出AutoPsyC，这是一种基于大型语言模型（Large Language Models, LLMs）的方法，用于从完整的操作化心理诊断（Operationalized Psychodynamic Diagnostics, OPD）访谈中识别心理动力冲突的存在及其重要性。AutoPsyC结合了参数高效微调（parameter-efficient fine-tuning）与检索增强生成（Retrieval-Augmented Generation, RAG）技术，并采用摘要策略以有效处理长达90分钟的完整对话内容。实验结果表明，AutoPsyC在识别四种高度相关心理动力冲突方面显著优于所有基线模型及消融条件。

链接: https://arxiv.org/abs/2503.21911
作者: Sayed Muddashir Hossain,Simon Ostermann,Patrick Gebhard,Cord Benecke,Josef van Genabith,Philipp Müller
机构: DFKI(德国人工智能研究中心, 德国萨尔布吕肯); University of Kassel(卡塞尔大学, 德国卡塞尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Psychodynamic conflicts are persistent, often unconscious themes that shape a person’s behaviour and experiences. Accurate diagnosis of psychodynamic conflicts is crucial for effective patient treatment and is commonly done via long, manually scored semi-structured interviews. Existing automated solutions for psychiatric diagnosis tend to focus on the recognition of broad disorder categories such as depression, and it is unclear to what extent psychodynamic conflicts which even the patient themselves may not have conscious access to could be automatically recognised from conversation. In this paper, we propose AutoPsyC, the first method for recognising the presence and significance of psychodynamic conflicts from full-length Operationalized Psychodynamic Diagnostics (OPD) interviews using Large Language Models (LLMs). Our approach combines recent advances in parameter-efficient fine-tuning and Retrieval-Augmented Generation (RAG) with a summarisation strategy to effectively process entire 90 minute long conversations. In evaluations on a dataset of 141 diagnostic interviews we show that AutoPsyC consistently outperforms all baselines and ablation conditions on the recognition of four highly relevant psychodynamic conflicts.
zh

[NLP-50] JEEM: Vision-Language Understanding in Four Arabic Dialects

【速读】：该论文旨在解决跨文化交流与视觉理解中的地域文化差异问题，具体而言，是评估现有的视觉-语言模型（Vision-Language Models, VLMs）在阿拉伯语国家（约旦、阿联酋、埃及和摩洛哥）的文化背景下进行视觉理解和多方言处理的能力。论文的关键在于提出了一个名为JEEM的新基准数据集，该数据集包含图像描述生成和视觉问答任务，并具有丰富的文化内涵和地区多样性。通过在五个开源阿拉伯语VLMs及GPT-4V上的实验结果表明，现有模型在跨方言适应性和文化元素解读方面存在不足，强调了构建更包容性模型以及采用多元化文化评价范式的必要性。

链接: https://arxiv.org/abs/2503.21910
作者: Karima Kadaoui,Hanin Atwany,Hamdan Al-Ali,Abdelrahman Mohamed,Ali Mekky,Sergei Tilga,Natalia Fedorova,Ekaterina Artemova,Hanan Aldarmaki,Yova Kementchedjhieva
机构: MBZUAI (MBZUAI); Toloka AI (Toloka AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce JEEM, a benchmark designed to evaluate Vision-Language Models (VLMs) on visual understanding across four Arabic-speaking countries: Jordan, The Emirates, Egypt, and Morocco. JEEM includes the tasks of image captioning and visual question answering, and features culturally rich and regionally diverse content. This dataset aims to assess the ability of VLMs to generalize across dialects and accurately interpret cultural elements in visual contexts. In an evaluation of five prominent open-source Arabic VLMs and GPT-4V, we find that the Arabic VLMs consistently underperform, struggling with both visual understanding and dialect-specific generation. While GPT-4V ranks best in this comparison, the model’s linguistic competence varies across dialects, and its visual understanding capabilities lag behind. This underscores the need for more inclusive models and the value of culturally-diverse evaluation paradigms.
zh

[NLP-51] OntoAligner: A Comprehensive Modular and Robust Python Toolkit for Ontology Alignment ESWC2025

【速读】：该论文旨在解决现有本体对齐工具在可扩展性（Scalability）、模块化（Modularity）以及与最新人工智能进展集成便利性方面的局限性。论文的关键解决方案是提出了一种名为OntoAligner的综合、模块化且健壮的Python工具包，它通过整合现有的轻量级对齐技术（如模糊匹配）以及支持基于检索增强生成（Retrieval-Augmented Generation）和大语言模型（Large Language Models）的当代方法，超越了传统工具的能力。此外，OntoAligner设计注重可扩展性（Extensibility），允许研究者集成自定义的对齐算法和数据集，从而实现高效处理大规模本体并对齐任务提供高质量结果。

链接: https://arxiv.org/abs/2503.21902
作者: Hamed Babaei Giglou,Jennifer D’Souza,Oliver Karras,Sören Auer
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 3 figures. Accepted for the ESWC 2025 Resource Track

点击查看摘要

Abstract:Ontology Alignment (OA) is fundamental for achieving semantic interoperability across diverse knowledge systems. We present OntoAligner, a comprehensive, modular, and robust Python toolkit for ontology alignment, designed to address current limitations with existing tools faced by practitioners. Existing tools are limited in scalability, modularity, and ease of integration with recent AI advances. OntoAligner provides a flexible architecture integrating existing lightweight OA techniques such as fuzzy matching but goes beyond by supporting contemporary methods with retrieval-augmented generation and large language models for OA. The framework prioritizes extensibility, enabling researchers to integrate custom alignment algorithms and datasets. This paper details the design principles, architecture, and implementation of the OntoAligner, demonstrating its utility through benchmarks on standard OA tasks. Our evaluation highlights OntoAligner’s ability to handle large-scale ontologies efficiently with few lines of code while delivering high alignment quality. By making OntoAligner open-source, we aim to provide a resource that fosters innovation and collaboration within the OA community, empowering researchers and practitioners with a toolkit for reproducible OA research and real-world applications.
zh

[NLP-52] RedditESS: A Mental Health Social Support Interaction Dataset – Understanding Effective Social Support to Refine AI-Driven Support Tools

【速读】：该论文试图解决现有研究中对“有效”心理健康支持定义过于狭窄的问题，主要集中在情感共鸣（empathetic acknowledgments），而忽视了信息指导（informational guidance）、社区认同（community validation）以及实际应对策略（tangible coping strategies）等其他重要维度。为了解决这一局限，并深入理解什么是真正有效的支持，论文提出了RedditESS，这是一个基于Reddit帖子构建的新型真实世界数据集，包含支持性评论及其原帖作者的后续回复。关键解决方案在于开发了一种基于社会科学理论的集成标注机制（ensemble labeling mechanism），用于将支持性评论标注为“有效”或“无效”，并通过定性评估确保标注的可靠性。此外，论文展示了RedditESS在引导大型语言模型（LLM）更好地生成上下文敏感且实用的支持性回应方面的实用性，从而推动了更先进的AI驱动的心理健康干预发展。

链接: https://arxiv.org/abs/2503.21888
作者: Zeyad Alghamdi,Tharindu Kumarage,Garima Agrawal,Mansooreh Karami,Ibrahim Almuteb,Huan Liu
机构: Arizona State University (亚利桑那州立大学); Hail University (海勒大学); HumaConn AI Consulting (HumaConn AI 咨询); Microsoft (微软); Texas A&M University (德克萨斯农工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective mental health support is crucial for alleviating psychological distress. While large language model (LLM)-based assistants have shown promise in mental health interventions, existing research often defines “effective” support primarily in terms of empathetic acknowledgments, overlooking other essential dimensions such as informational guidance, community validation, and tangible coping strategies. To address this limitation and better understand what constitutes effective support, we introduce RedditESS, a novel real-world dataset derived from Reddit posts, including supportive comments and original posters’ follow-up responses. Grounded in established social science theories, we develop an ensemble labeling mechanism to annotate supportive comments as effective or not and perform qualitative assessments to ensure the reliability of the annotations. Additionally, we demonstrate the practical utility of RedditESS by using it to guide LLM alignment toward generating more context-sensitive and genuinely helpful supportive responses. By broadening the understanding of effective support, our study paves the way for advanced AI-driven mental health interventions.
zh

[NLP-53] MSPLoRA: A Multi-Scale Pyramid Low-Rank Adaptation for Efficient Model Fine-Tuning

【速读】：该论文旨在解决传统低秩适应（LoRA）方法在处理分层信息复杂性不同时存在的固定秩假设导致的效率低下和冗余问题。论文的关键解决方案是提出多尺度金字塔LoRA（MSPLoRA），通过引入全局共享LoRA、中间层共享LoRA和层特定LoRA，分别捕获全局模式、中级特征和细粒度信息，构建层次化的结构以减少层间冗余并保持强大的适配能力。这种设计能够更高效地适应不同复杂度的任务需求，并显著降低可训练参数数量，同时通过奇异值分解验证其信息解耦能力。

链接: https://arxiv.org/abs/2503.21838
作者: Jiancheng Zhao,Xingda Yu,Zhen Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) has become an essential approach for adapting large-scale pre-trained models while reducing computational costs. Among PEFT methods, LoRA significantly reduces trainable parameters by decomposing weight updates into low-rank matrices. However, traditional LoRA applies a fixed rank across all layers, failing to account for the varying complexity of hierarchical information, which leads to inefficient adaptation and redundancy. To address this, we propose MSPLoRA (Multi-Scale Pyramid LoRA), which introduces Global Shared LoRA, Mid-Level Shared LoRA, and Layer-Specific LoRA to capture global patterns, mid-level features, and fine-grained information, respectively. This hierarchical structure reduces inter-layer redundancy while maintaining strong adaptation capability. Experiments on various NLP tasks demonstrate that MSPLoRA achieves more efficient adaptation and better performance while significantly reducing the number of trainable parameters. Furthermore, additional analyses based on Singular Value Decomposition validate its information decoupling ability, highlighting MSPLoRA as a scalable and effective optimization strategy for parameter-efficient fine-tuning in large language models. Our code is available at this https URL.
zh

[NLP-54] Refining Time Series Anomaly Detectors using Large Language Models

【速读】：该论文旨在解决时间序列异常检测（TSAD）领域中，尽管已有多种自动检测方法，但仍需大量人工审查以验证检测结果准确性的问题。论文的关键在于探索多模态大型语言模型（Multimodal Large Language Models, LLMs）在部分自动化这一过程中的应用，通过结合时间序列图的视觉检查与数据生成过程的文字描述，有效识别误报（false alarms），从而减少维护TSAD系统所需的人力依赖。

链接: https://arxiv.org/abs/2503.21833
作者: Alan Yang,Yulin Chen,Sean Lee,Venus Montes
机构: Stanford University (斯坦福大学); Meta (Meta)
类目: Computation and Language (cs.CL)
备注: Main content: 4 pages, 1 figure, 1 table

点击查看摘要

Abstract:Time series anomaly detection (TSAD) is of widespread interest across many industries, including finance, healthcare, and manufacturing. Despite the development of numerous automatic methods for detecting anomalies, human oversight remains necessary to review and act upon detected anomalies, as well as verify their accuracy. We study the use of multimodal large language models (LLMs) to partially automate this process. We find that LLMs can effectively identify false alarms by integrating visual inspection of time series plots with text descriptions of the data-generating process. By leveraging the capabilities of LLMs, we aim to reduce the reliance on human effort required to maintain a TSAD system
zh

[NLP-55] Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在与人类价值观和安全约束对齐时面临的挑战，特别是在帮助性、真实性及避免伤害等目标存在冲突的情况下。现有方法如基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）虽取得一定成效，但复杂且不稳定；直接偏好优化（Direct Preference Optimization, DPO）虽简化了基于偏好的微调，但可能存在偏差或牺牲某些目标的问题。为此，论文提出了一种基于群体相对策略优化（Group Relative Policy Optimization, GRPO）的框架，并结合多标签奖励回归模型，以实现语言生成的安全性和一致性。解决方案的关键在于通过比较采样响应的群体来优化策略，无需单独的价值评估器，从而提高训练效率，并利用训练出的奖励模型预测多个对齐分数（如安全性、帮助性等），将这些分数整合为单一奖励信号，以实现多目标平衡。实验表明，该方法在不同规模模型（0.5B、7B 和 14B 参数）的语言生成任务中提升了安全性和质量指标，同时降低了计算成本并实现了显式的多目标处理。

链接: https://arxiv.org/abs/2503.21819
作者: Xuying Li,Zhuo Li,Yuji Kosuga,Victor Bian
机构: HydroX AI (HydroX AI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) with human values and safety constraints is challenging, especially when objectives like helpfulness, truthfulness, and avoidance of harm conflict. Reinforcement Learning from Human Feedback (RLHF) has achieved notable success in steering models, but is complex and can be unstable. Recent approaches such as Direct Preference Optimization (DPO) simplify preference-based fine-tuning but may introduce bias or trade-off certain objectives~\citedpo. In this work, we propose a Group Relative Policy Optimization (GRPO) framework with a multi-label reward regression model to achieve safe and aligned language generation. The GRPO algorithm optimizes a policy by comparing groups of sampled responses, eliminating the need for a separate value critic and improving training efficiency~\citegrpo. We train a reward model to predict multiple alignment scores (e.g., safety, helpfulness, etc.), which are combined into a single reward signal. We provide a theoretical derivation for using this learned multi-aspect reward within GRPO and discuss its advantages and limitations. Empirically, our approach improves all the safety and quality metrics evaluated in language generation tasks on model scales (0.5B, 7B, and 14B parameters), demonstrating a robust balance of objectives. We compare GRPO to PPO-based RLHF and DPO, highlighting that GRPO achieves alignment with significantly lower computational cost and explicit multi-objective handling. \textbfWe will open-source all trained models at this https URL.
zh

[NLP-56] OAEI-LLM -T: A TBox Benchmark Dataset for Understanding LLM Hallucinations in Ontology Matching Systems

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在下游任务中不可避免的幻觉现象（hallucinations），特别是针对基于LLMs的本体匹配（Ontology Matching, OM）系统中幻觉问题带来的挑战。论文的关键解决方案是引入一个新的基准数据集OAEI-LLM-T，该数据集基于Ontology Alignment Evaluation Initiative (OAEI) 的TBox数据集演化而来，专门捕获不同LLMs执行OM任务时产生的幻觉现象，并将这些OM特有的幻觉细分为两类共六个子类别。通过这一数据集，论文展示了其在构建LLMs排行榜以及针对LLM-based OM系统的微调方面的实用性。

链接: https://arxiv.org/abs/2503.21813
作者: Zhangcheng Qiang
机构: School of Computing, Australian National University (澳大利亚国立大学), Canberra, Australia
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 10 pages, 4 figures, 3 tables, 2 prompt templates

点击查看摘要

Abstract:Hallucinations are inevitable in downstream tasks using large language models (LLMs). While addressing hallucinations becomes a substantial challenge for LLM-based ontology matching (OM) systems, we introduce a new benchmark dataset called OAEI-LLM-T. The dataset evolves from the TBox (i.e. schema-matching) datasets in the Ontology Alignment Evaluation Initiative (OAEI), capturing hallucinations of different LLMs performing OM tasks. These OM-specific hallucinations are carefully classified into two primary categories and six sub-categories. We showcase the usefulness of the dataset in constructing the LLM leaderboard and fine-tuning foundational LLMs for LLM-based OM systems.
zh

[NLP-57] axonomy Inference for Tabular Data Using Large Language Models

【速读】：该论文旨在解决表格数据的 taxonomy inference（分类推断）问题，即通过发现表格中实体类型（概念）及其层级结构，实现对表结构的语义理解。这一任务在数据管理、数据探索以及基于数据的各类应用中具有重要意义。现有方法主要针对 XML、JSON 或 RDF 数据，依赖于数据的词法格式与结构进行相似性计算，而对表格内文本的语义利用不足。为克服这些局限，论文提出两种基于 Large Language Models (LLMs) 的解决方案：(i) EmTT 方法通过微调对比学习的编码器-only LLMs（如 BERT）嵌入列信息，并利用聚类构建层级结构；(ii) GeTT 方法借助迭代提示的解码器-only LLMs（如 GPT-4）生成表格实体类型及其层级关系。关键在于充分利用 LLMs 的语义理解和生成能力，以更全面地挖掘表格数据的潜在语义信息。

链接: https://arxiv.org/abs/2503.21810
作者: Zhenyu Wu,Jiaoyan Chen,Norman W. Paton
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Taxonomy inference for tabular data is a critical task of schema inference, aiming at discovering entity types (i.e., concepts) of the tables and building their hierarchy. It can play an important role in data management, data exploration, ontology learning, and many data-centric applications. Existing schema inference systems focus more on XML, JSON or RDF data, and often rely on lexical formats and structures of the data for calculating similarities, with limited exploitation of the semantics of the text across a table. Motivated by recent works on taxonomy completion and construction using Large Language Models (LLMs), this paper presents two LLM-based methods for taxonomy inference for tables: (i) EmTT which embeds columns by fine-tuning with contrastive learning encoder-alone LLMs like BERT and utilises clustering for hierarchy construction, and (ii) GeTT which generates table entity types and their hierarchy by iterative prompting using a decoder-alone LLM like GPT-4. Extensive evaluation on three real-world datasets with six metrics covering different aspects of the output taxonomies has demonstrated that EmTT and GeTT can both produce taxonomies with strong consistency relative to the Ground Truth.
zh

[NLP-58] Large Language Models Meet Contrastive Learning: Zero-Shot Emotion Recognition Across Languages ICME2025

【速读】：该论文旨在解决跨语言零样本语音情感识别中的挑战，特别是由于语音特征变化和语言多样性导致的问题。为应对这些挑战，论文提出了一种基于对比学习的方法来优化多语言语音特征，并扩展大型语言模型以实现跨语言零样本语音情感估计。解决方案的关键在于采用一种新颖的两阶段训练框架，将语音信号与情感空间中的语言特征对齐，从而捕获既包含情感感知又与语言无关的语音表征。此外，为了促进该领域的研究，论文还引入了一个大规模的合成多语言语音情感数据集M5SER。实验结果验证了所提方法在语音情感识别及跨语言零样本语音情感识别任务上的有效性，包括对未见数据集和语言的支持。

链接: https://arxiv.org/abs/2503.21806
作者: Heqing Zou,Fengmao Lv,Desheng Zheng,Eng Siong Chng,Deepu Rajan
机构: Nanyang Technological University (南洋理工大学); Southwest Jiaotong University (西南交通大学); Kash Institute of Electronics and Information Industry (卡什电子与信息技术产业研究所); Southwest Petroleum University (西南石油大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICME 2025

点击查看摘要

Abstract:Multilingual speech emotion recognition aims to estimate a speaker’s emotional state using a contactless method across different languages. However, variability in voice characteristics and linguistic diversity poses significant challenges for zero-shot speech emotion recognition, especially with multilingual datasets. In this paper, we propose leveraging contrastive learning to refine multilingual speech features and extend large language models for zero-shot multilingual speech emotion estimation. Specifically, we employ a novel two-stage training framework to align speech signals with linguistic features in the emotional space, capturing both emotion-aware and language-agnostic speech representations. To advance research in this field, we introduce a large-scale synthetic multilingual speech emotion dataset, M5SER. Our experiments demonstrate the effectiveness of the proposed method in both speech emotion recognition and zero-shot multilingual speech emotion recognition, including previously unseen datasets and languages.
zh

[NLP-59] ImF: Implicit Fingerprint for Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在知识产权（Intellectual Property, IP）保护中存在的指纹方法易被攻击者利用其弱语义相关性而被擦除的问题。现有模型指纹方法通常通过注入弱语义关联的指纹对来保护模型所有权，但这些指纹对缺乏自然问答（Question-Answer, QA）对应有的上下文连贯性和语义相关性，容易被特定攻击如Generation Revision Intervention (GRI) 攻击所破坏。

为了解决这一问题，论文提出了一种名为隐式指纹（Implicit Fingerprints, ImF）的新注入指纹范式。关键在于ImF通过构建具有强语义相关性的指纹对，并将其伪装成LLMs中的自然QA对，从而确保这些指纹与正常模型行为一致，使其难以被检测和移除。实验结果表明，ImF在对抗条件下仍能保持较高的验证成功率，为LLMs的所有权保护提供了可靠方案。

链接: https://arxiv.org/abs/2503.21805
作者: Wu jiaxuan,Peng Wanli,Fu hang,Xue Yiming,Wen juan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:Training large language models (LLMs) is resource-intensive and expensive, making intellectual property (IP) protection essential. Most existing model fingerprint methods inject fingerprints into LLMs to protect model ownership. These methods create fingerprint pairs with weak semantic correlations, lacking the contextual coherence and semantic relatedness founded in normal question-answer (QA) pairs in LLMs. In this paper, we propose a Generation Revision Intervention (GRI) attack that can effectively exploit this flaw to erase fingerprints, highlighting the need for more secure model fingerprint methods. Thus, we propose a novel injected fingerprint paradigm called Implicit Fingerprints (ImF). ImF constructs fingerprint pairs with strong semantic correlations, disguising them as natural QA pairs within LLMs. This ensures the fingerprints are consistent with normal model behavior, making them indistinguishable and robust against detection and removal. Our experiment on multiple LLMs demonstrates that ImF retains high verification success rates under adversarial conditions, offering a reliable solution for protecting LLM ownership.
zh

[NLP-60] Efficient Joint Prediction of Multiple Future Tokens

【速读】：该论文试图解决通过多令牌预测（Multi-Token Prediction, MTP）方法提升隐藏状态表示丰富性的问题，同时避免传统MTP方法在计算开销上的局限性。论文的关键解决方案是引入联合多令牌预测（Joint Multi-Token Prediction, JTP），这是一种轻量级的修改方案，通过精心设计的表示瓶颈，采用教师强迫（Teacher Forcing）策略来预测未来多个令牌。与现有方法相比，JTP能够在训练过程中以最小的计算开销编码丰富的预测信息，并成功实现短期信念状态表示。实验结果表明，JTP在合成星图导航任务中的表现显著优于现有方法，展示了其有效性。

链接: https://arxiv.org/abs/2503.21801
作者: Kwangjun Ahn,Alex Lamb,John Langford
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Technical report; comments welcome!

点击查看摘要

Abstract:In this short report, we introduce joint multi-token prediction (JTP), a lightweight modification of standard next-token prediction designed to enrich hidden state representations by jointly predicting multiple future tokens. Unlike previous multi-token prediction approaches, JTP strategically employs teacher forcing of future-tokens through a carefully designed representation bottleneck, allowing the model to encode rich predictive information with minimal computational overhead during training. We show that the JTP approach achieves a short-horizon belief state representation, while popular alternatives for multi-token prediction fail to do so. We demonstrate the effectiveness of our method on the synthetic star graph navigation task from from Bachmann and Nagarajan [2024], highlighting a significant performance improvement over existing methods. This manuscript presents promising preliminary results intended to stimulate further research.
zh

[NLP-61] ELM: Ensemble of Language Models for Predicting Tumor Group from Pathology Reports

【速读】：该论文试图解决病理报告中手动提取数据的瓶颈问题，这一过程对于肿瘤组分配等任务至关重要，但耗时巨大（约900人时处理10万份报告）。为了解决这一问题，论文提出了一种名为ELM（语言模型集成）的新方法。其关键是结合小语言模型（SLMs）和大语言模型（LLMs），通过六个经过微调的SLMs分别处理病理报告的上下部分以最大化覆盖，并采用五票多数规则进行肿瘤组分类，分歧则由精心设计提示的大语言模型仲裁，从而实现高精度和高召回率（平均0.94），显著提升癌症登记处的工作效率。

链接: https://arxiv.org/abs/2503.21800
作者: Lovedeep Gondara,Jonathan Simkin,Shebnum Devji,Gregory Arbour,Raymond Ng
机构: British Columbia Cancer Registry (不翻译); Provincial Health Services Authority (不翻译); Data Science Institute (不翻译); University of British Columbia (不翻译)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Population-based cancer registries (PBCRs) face a significant bottleneck in manually extracting data from unstructured pathology reports, a process crucial for tasks like tumor group assignment, which can consume 900 person-hours for approximately 100,000 reports. To address this, we introduce ELM (Ensemble of Language Models), a novel ensemble-based approach leveraging both small language models (SLMs) and large language models (LLMs). ELM utilizes six fine-tuned SLMs, where three SLMs use the top part of the pathology report and three SLMs use the bottom part. This is done to maximize report coverage. ELM requires five-out-of-six agreement for a tumor group classification. Disagreements are arbitrated by an LLM with a carefully curated prompt. Our evaluation across nineteen tumor groups demonstrates ELM achieves an average precision and recall of 0.94, outperforming single-model and ensemble-without-LLM approaches. Deployed at the British Columbia Cancer Registry, ELM demonstrates how LLMs can be successfully applied in a PBCR setting to achieve state-of-the-art results and significantly enhance operational efficiencies, saving hundreds of person-hours annually.
zh

[NLP-62] Leverag ing Large Language Models for Automated Causal Loop Diagram Generation: Enhancing System Dynamics Modeling through Curated Prompting Techniques

【速读】：该论文试图解决将动态假设（dynamic hypothesis）转化为因果回路图（Causal Loop Diagram, CLD）过程中，对于新手建模者而言存在的挑战性和耗时性问题，以促进系统动力学（System Dynamics, SD）工具的广泛应用。论文的关键在于提出并验证了一种利用经过精心设计提示技术的大语言模型（Large Language Models, LLMs）实现动态假设到CLD自动化转换的方法。通过采用标准有向图结构进行推理，并结合来自经典SD教材的简单动态假设与相应CLD构建案例，论文评估了不同提示技术组合的表现，结果显示，在处理简单的模型结构且使用精心设计的提示技术时，LLMs生成的CLD质量可媲美专家构建的CLD，从而显著加速了CLD的创建过程。

链接: https://arxiv.org/abs/2503.21798
作者: Ning-Yuan Georgia Liu,David R. Keith
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transforming a dynamic hypothesis into a causal loop diagram (CLD) is crucial for System Dynamics Modelling. Extracting key variables and causal relationships from text to build a CLD is often challenging and time-consuming for novice modelers, limiting SD tool adoption. This paper introduces and tests a method for automating the translation of dynamic hypotheses into CLDs using large language models (LLMs) with curated prompting techniques. We first describe how LLMs work and how they can make the inferences needed to build CLDs using a standard digraph structure. Next, we develop a set of simple dynamic hypotheses and corresponding CLDs from leading SD textbooks. We then compare the four different combinations of prompting techniques, evaluating their performance against CLDs labeled by expert modelers. Results show that for simple model structures and using curated prompting techniques, LLMs can generate CLDs of a similar quality to expert-built ones, accelerating CLD creation.
zh

[NLP-63] Convolutional optimization with convex kernel and power lift

【速读】：该论文致力于构建一种基于凸核卷积的新型优化理论基础范式，旨在设计一种能够道德上确定性定位任意函数全局最优解的模型，这一方法区别于大多数常用的统计模型。论文的关键在于提出了一种新的优化范式，通过凸核卷积实现对全局最优解的定位，其解决方案的核心是利用凸核卷积的独特性质来引导算法搜索过程。初步数值结果展示了源自该范式的特定算法的效率，以期激发进一步的实践兴趣。

链接: https://arxiv.org/abs/2503.22135
作者: Zhipeng Lu
机构: Shenzhen MSU-BIT University (深圳北理莫斯科大学); Guangdong Laboratory of Machine Perception and Intelligent Computing (广东省机器感知与智能计算重点实验室)
类目: Optimization and Control (math.OC); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We focus on establishing the foundational paradigm of a novel optimization theory based on convolution with convex kernels. Our goal is to devise a morally deterministic model of locating the global optima of an arbitrary function, which is distinguished from most commonly used statistical models. Limited preliminary numerical results are provided to test the efficiency of some specific algorithms derived from our paradigm, which we hope to stimulate further practical interest.
zh

计算机视觉

[CV-0] Q-Insight: Understanding Image Quality via Visual Reinforcement Learning

【速读】：本文旨在解决传统图像质量评估（IQA）方法在综合理解图像质量方面存在的局限性，特别是缺乏灵活性和对复杂视觉推理能力的支持。现有基于多模态大语言模型（MLLMs）的方法通常只能生成缺乏可解释性的数值评分，或依赖大规模标注数据进行监督微调（SFT），这限制了其适应性和泛化能力。为了解决这些问题，论文提出了一种基于强化学习的模型Q-Insight，其核心在于利用分组相对策略优化（GRPO）算法，在仅需少量评分和退化标签的情况下，实现强大的视觉推理能力。关键创新点在于通过精心设计的奖励函数同时优化评分回归与退化感知任务，有效挖掘两者之间的协同作用以提升性能。实验结果表明，Q-Insight在评分回归和退化感知任务上显著超越现有最先进方法，并展现出出色的零样本迁移能力用于比较推理任务。

链接: https://arxiv.org/abs/2503.22679
作者: Weiqi Li,Xuanyu Zhang,Shijie Zhao,Yabin Zhang,Junlin Li,Li Zhang,Jian Zhang
机构: School of Electronic and Computer Engineering, Peking University (北京大学电子与计算机工程学院); ByteDance Inc. (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report

点击查看摘要

Abstract:Image quality assessment (IQA) focuses on the perceptual visual quality of images, playing a crucial role in downstream tasks such as image reconstruction, compression, and generation. The rapid advancement of multi-modal large language models (MLLMs) has significantly broadened the scope of IQA, moving toward comprehensive image quality understanding that incorporates content analysis, degradation perception, and comparison reasoning beyond mere numerical scoring. Previous MLLM-based methods typically either generate numerical scores lacking interpretability or heavily rely on supervised fine-tuning (SFT) using large-scale annotated datasets to provide descriptive assessments, limiting their flexibility and applicability. In this paper, we propose Q-Insight, a reinforcement learning-based model built upon group relative policy optimization (GRPO), which demonstrates strong visual reasoning capability for image quality understanding while requiring only a limited amount of rating scores and degradation labels. By jointly optimizing score regression and degradation perception tasks with carefully designed reward functions, our approach effectively exploits their mutual benefits for enhanced performance. Extensive experiments demonstrate that Q-Insight substantially outperforms existing state-of-the-art methods in both score regression and degradation perception tasks, while exhibiting impressive zero-shot generalization to comparison reasoning tasks. Code will be available at this https URL.
zh

[CV-1] DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness

【速读】：该论文旨在解决生成式 3D 对象模型在物理约束方面的不足，特别是确保生成的 3D 对象具备自支撑（self-supporting）属性以保持重力下的平衡。传统方法通过可微分物理模拟器在推理阶段优化几何形状，但存在速度慢、不稳定且容易陷入局部最优的问题。论文的关键解决方案是提出直接模拟优化（Direct Simulation Optimization, DSO）框架，利用非可微分物理模拟器的反馈来提高生成模型直接输出稳定 3D 对象的概率。具体而言，DSO 构建了一个包含稳定性分数标注的 3D 对象数据集，并通过直接偏好优化（DPO）或直接奖励优化（DRO）方法微调生成器，其中 DRO 是一种无需成对偏好即可对齐扩散模型的新目标。实验表明，使用 DPO 或 DRO 的微调生成器不仅速度快，而且比测试时优化更可能生成稳定的对象，尤其值得注意的是，DSO 框架即使在没有真实 3D 对象训练数据的情况下也能工作，使生成器能够通过自动收集自身输出的模拟反馈实现自我改进。

链接: https://arxiv.org/abs/2503.22677
作者: Ruining Li,Chuanxia Zheng,Christian Rupprecht,Andrea Vedaldi
机构: Visual Geometry Group, University of Oxford (牛津大学视觉几何组)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Most 3D object generators focus on aesthetic quality, often neglecting physical constraints necessary in applications. One such constraint is that the 3D object should be self-supporting, i.e., remains balanced under gravity. Prior approaches to generating stable 3D objects used differentiable physics simulators to optimize geometry at test-time, which is slow, unstable, and prone to local optima. Inspired by the literature on aligning generative models to external feedback, we propose Direct Simulation Optimization (DSO), a framework to use the feedback from a (non-differentiable) simulator to increase the likelihood that the 3D generator outputs stable 3D objects directly. We construct a dataset of 3D objects labeled with a stability score obtained from the physics simulator. We can then fine-tune the 3D generator using the stability score as the alignment metric, via direct preference optimization (DPO) or direct reward optimization (DRO), a novel objective, which we introduce, to align diffusion models without requiring pairwise preferences. Our experiments show that the fine-tuned feed-forward generator, using either DPO or DRO objective, is much faster and more likely to produce stable objects than test-time optimization. Notably, the DSO framework works even without any ground-truth 3D objects for training, allowing the 3D generator to self-improve by automatically collecting simulation feedback on its own outputs.
zh

[CV-2] ranSplat: Lighting-Consistent Cross-Scene Object Transfer with 3D Gaussian Splatting

【速读】：该论文旨在解决两个关键问题：(1) 精确提取源场景中的3D物体；(2) 在目标场景中无须显式估计材质属性即可实现物体的真实感重照明。论文提出的TranSplat算法基于高斯点 splatting 框架，通过使用2D物体掩模驱动精细的3D分割来解决第一个问题，并利用球谐分析导出每个高斯点的辐射传输函数以适应目标场景光照环境，从而解决第二个问题。关键在于无需显式材质估计即可实现跨场景物体的真实感转移与重照明。

链接: https://arxiv.org/abs/2503.22676
作者: Boyang(Tony)Yu,Yanlin Jin,Ashok Veeraraghavan,Akshat Dave,Guha Balakrishnan
机构: Rice University (莱斯大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present TranSplat, a 3D scene rendering algorithm that enables realistic cross-scene object transfer (from a source to a target scene) based on the Gaussian Splatting framework. Our approach addresses two critical challenges: (1) precise 3D object extraction from the source scene, and (2) faithful relighting of the transferred object in the target scene without explicit material property estimation. TranSplat fits a splatting model to the source scene, using 2D object masks to drive fine-grained 3D segmentation. Following user-guided insertion of the object into the target scene, along with automatic refinement of position and orientation, TranSplat derives per-Gaussian radiance transfer functions via spherical harmonic analysis to adapt the object’s appearance to match the target scene’s lighting environment. This relighting strategy does not require explicitly estimating physical scene properties such as BRDFs. Evaluated on several synthetic and real-world scenes and objects, TranSplat yields excellent 3D object extractions and relighting performance compared to recent baseline methods and visually convincing cross-scene object transfers. We conclude by discussing the limitations of the approach.
zh

[CV-3] Understanding Co-speech Gestures in-the-wild

【速读】：该论文旨在解决非语言交流中手势理解的问题，并提出了一种新的框架来评估模型在手势-文本-语音关联上的理解能力。具体而言，论文引入了三个新任务和相应的基准测试：基于手势的检索（gesture-based retrieval）、带手势的词定位（gestured word spotting）以及利用手势的主动说话人检测（active speaker detection using gestures）。为了解决这些问题，论文提出了一种学习三模态（语音-文本-视频-手势）表示的方法。关键在于结合全局短语对比损失（global phrase contrastive loss）和局部手势-词语耦合损失（local gesture-word coupling loss），从而以弱监督的方式从野外视频中学习到强大的手势表示。实验结果表明，所提出的表示方法在所有三项任务中均优于先前的方法，包括大型视觉语言模型（Vision-Language Models, VLMs）。进一步分析显示，语音和文本模态捕获了不同的与手势相关的信息信号，强调了共享三模态嵌入空间学习的优势。

链接: https://arxiv.org/abs/2503.22668
作者: Sindhu B Hegde,K R Prajwal,Taein Kwon,Andrew Zisserman
机构: Visual Geometry Group, University of Oxford (视觉几何组, 牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Main paper - 11 pages, 4 figures, Supplementary - 5 pages, 4 figures

点击查看摘要

Abstract:Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model’s capability to comprehend gesture-text-speech associations: (i) gesture-based retrieval, (ii) gestured word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs), across all three tasks. Further analysis reveals that speech and text modalities capture distinct gesture-related signals, underscoring the advantages of learning a shared tri-modal embedding space. The dataset, model, and code are available at: this https URL
zh

[CV-4] Unicorn: Text-Only Data Synthesis for Vision Language Model Training

【速读】：该论文试图解决在训练视觉语言模型（Vision-Language Models, VLMs）时，高质量图像-文本配对数据采集或合成成本高昂的问题。为应对这一挑战，论文提出了一种跨集成的三阶段多模态数据合成框架。其关键是通过大规模语言模型（Large Language Models, LLMs）扩展稀疏的文本种子，在不依赖真实图像的情况下生成高质量的多样化图像表示，同时构造出Unicorn-1.2M用于预训练，以及Unicorn-471K-Instruction用于指令微调。这种方法在保持数据质量和多样性的同时显著降低了训练成本，提供了高效且可扩展的解决方案。

链接: https://arxiv.org/abs/2503.22655
作者: Xiaomin Yu,Pengxiang Ding,Wenjie Zhang,Siteng Huang,Songyang Gao,Chengwei Qin,Kejian Wu,Zhaoxin Fan,Ziyue Qiao,Donglin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question: can high-quality multimodal training data be synthesized purely from text? To tackle this, we propose a cross-integrated three-stage multimodal data synthesis framework, which generates two datasets: Unicorn-1.2M and Unicorn-471K-Instruction. In Stage 1: Diverse Caption Data Synthesis, we construct 1.2M semantically diverse high-quality captions by expanding sparse caption seeds using large language models (LLMs). In Stage 2: Instruction-Tuning Data Generation, we further process 471K captions into multi-turn instruction-tuning tasks to support complex reasoning. Finally, in Stage 3: Modality Representation Transfer, these textual captions representations are transformed into visual representations, resulting in diverse synthetic image representations. This three-stage process enables us to construct Unicorn-1.2M for pretraining and Unicorn-471K-Instruction for instruction-tuning, without relying on real images. By eliminating the dependency on real images while maintaining data quality and diversity, our framework offers a cost-effective and scalable solution for VLMs training. Code is available at this https URL.
zh

[CV-5] Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion Model

【速读】：该论文旨在解决现有四维（4D）视频生成方法面临的两大挑战：一是依赖多视频扩散模型或计算密集型的全4D扩散模型训练，导致实际应用受限；二是由于真实世界4D数据稀缺及高昂计算成本，难以实现高效生成。为应对这些限制，论文提出了一种无需训练的4D视频生成方法，利用现成的视频扩散模型从单一输入视频生成多视角视频。其关键解决方案包括两步：首先通过将时空采样网格中的边缘帧指定为关键帧，并采用基于深度的扭曲技术引导视频扩散模型合成这些关键帧，确保生成帧间的结构一致性与时空连贯性；其次利用视频扩散模型插值剩余帧，构建完全填充且时间连贯的采样网格，同时保持空间和时间的一致性。此方法无需额外训练，充分利用现成模型，为多视角视频生成提供了实用有效的途径。

链接: https://arxiv.org/abs/2503.22622
作者: Jangho Park,Taesung Kwon,Jong Chul Ye
机构: KAIST (高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:Recently, multi-view or 4D video generation has emerged as a significant research topic. Nonetheless, recent approaches to 4D generation still struggle with fundamental limitations, as they primarily rely on harnessing multiple video diffusion models with additional training or compute-intensive training of a full 4D diffusion model with limited real-world 4D data and large computational costs. To address these challenges, here we propose the first training-free 4D video generation method that leverages the off-the-shelf video diffusion models to generate multi-view videos from a single input video. Our approach consists of two key steps: (1) By designating the edge frames in the spatio-temporal sampling grid as key frames, we first synthesize them using a video diffusion model, leveraging a depth-based warping technique for guidance. This approach ensures structural consistency across the generated frames, preserving spatial and temporal coherence. (2) We then interpolate the remaining frames using a video diffusion model, constructing a fully populated and temporally coherent sampling grid while preserving spatial and temporal consistency. Through this approach, we extend a single video into a multi-view video along novel camera trajectories while maintaining spatio-temporal consistency. Our method is training-free and fully utilizes an off-the-shelf video diffusion model, offering a practical and effective solution for multi-view video generation.
zh

[CV-6] Audio-Plane: Audio Factorization Plane Gaussian Splatting for Real-Time Talking Head Synthesis

【速读】：该论文致力于解决动态 Talking Head（虚拟口型生成）在高质量与实时性之间难以平衡的问题。传统方法往往因直接存储密集的 4D 网格而导致计算成本高昂且扩展性不足，尤其是在长时间序列中。为此，论文提出了一种创新性的 Audio Factorization Plane (Audio-Plane) 基于高斯点撒（Gaussian Splatting）的方法作为解决方案的核心。关键在于将 4D 体积表示分解为与音频无关的空间平面和与音频相关的平面，从而实现紧凑且可解释的特征表示，同时支持更精确的音频感知空间编码及增强的音频驱动唇部动态建模。此外，通过引入动态点撒方法进一步优化了嘴部区域的动态建模能力，最终实现了高度真实的实时 Talking Head 合成，并确保了精确的音频-唇同步效果。

链接: https://arxiv.org/abs/2503.22605
作者: Shuai Shen,Wanhua Li,Yunpeng Zhang,Weipeng Hu,Yap-Peng Tan
机构: Nanyang Technological University (南洋理工大学); Harvard University (哈佛大学); PhiGent Robotics (智谱机器人)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Talking head synthesis has become a key research area in computer graphics and multimedia, yet most existing methods often struggle to balance generation quality with computational efficiency. In this paper, we present a novel approach that leverages an Audio Factorization Plane (Audio-Plane) based Gaussian Splatting for high-quality and real-time talking head generation. For modeling a dynamic talking head, 4D volume representation is needed. However, directly storing a dense 4D grid is impractical due to the high cost and lack of scalability for longer durations. We overcome this challenge with the proposed Audio-Plane, where the 4D volume representation is decomposed into audio-independent space planes and audio-dependent planes. This provides a compact and interpretable feature representation for talking head, facilitating more precise audio-aware spatial encoding and enhanced audio-driven lip dynamic modeling. To further improve speech dynamics, we develop a dynamic splatting method that helps the network more effectively focus on modeling the dynamics of the mouth region. Extensive experiments demonstrate that by integrating these innovations with the powerful Gaussian Splatting, our method is capable of synthesizing highly realistic talking videos in real time while ensuring precise audio-lip synchronization. Synthesized results are available in this https URL.
zh

[CV-7] Using AI to Summarize US Presidential Campaign TV Advertisement Videos 1952-2012

【速读】：本文旨在解决大规模US总统竞选电视广告数据集的自动化准备、转录及高质量总结的问题。过去，由于手动采集与标注的繁琐性，研究者多依赖于较小的数据子集。为应对这一挑战，论文设计了一个大规模并行化的基于AI的分析管道（AI-based analysis pipeline），通过自动化流程显著提升了处理效率。该方法的关键在于结合机器学习技术，实现从视频数据到可搜索文本及高质量摘要的高效转换，并通过广泛的人工评估验证了其输出质量与人工生成结果相当。此外，论文展示了如何利用大型语言模型（LLM-based tools）工具为其他视频数据集生成高质量总结，从而推动更广泛的学术研究。

链接: https://arxiv.org/abs/2503.22589
作者: Adam Breuer,Bryce J. Dietrich,Michael H. Crespin,Matthew Butler,J.A. Pyrse,Kosuke Imai
机构: Dartmouth College (达特茅斯学院); Purdue University (普渡大学); University of Oklahoma (俄克拉荷马大学); University of Iowa (爱荷华大学); Harvard University (哈佛大学)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 7 tables, 4 figures, and linked datasets

点击查看摘要

Abstract:This paper introduces the largest and most comprehensive dataset of US presidential campaign television advertisements, available in digital format. The dataset also includes machine-searchable transcripts and high-quality summaries designed to facilitate a variety of academic research. To date, there has been great interest in collecting and analyzing US presidential campaign advertisements, but the need for manual procurement and annotation led many to rely on smaller subsets. We design a large-scale parallelized, AI-based analysis pipeline that automates the laborious process of preparing, transcribing, and summarizing videos. We then apply this methodology to the 9,707 presidential ads from the Julian P. Kanter Political Commercial Archive. We conduct extensive human evaluations to show that these transcripts and summaries match the quality of manually generated alternatives. We illustrate the value of this data by including an application that tracks the genesis and evolution of current focal issue areas over seven decades of presidential elections. Our analysis pipeline and codebase also show how to use LLM-based tools to obtain high-quality summaries for other video datasets.
zh

[CV-8] Next-Best-Trajectory Planning of Robot Manipulators for Effective Observation and Exploration ICRA

【速读】：该论文旨在解决机器人在动态环境中高效采集视觉观测数据的问题。传统方法依赖于大规模标注数据集，而这些数据集的获取成本高且耗时。为应对这一挑战，论文提出了一种基于Next-Best-Trajectory原则的新策略，用于机器人操作器的自主观察与探索。解决方案的关键在于结合局部轨迹生成与全局遍历路径规划：局部轨迹通过最大化观测信息增益并避免碰撞来优化路径规划；同时，利用体素地图建模环境，并通过兴趣点周围视角的射线投射估算信息增益。此外，引入全局遍历轨迹规划器作为参考，以提升探索效率并防止陷入局部最优解。为了提高计算效率，信息增益的估算采用GPU并行化处理。实验结果验证了所提策略的有效性及并行化的优势。

链接: https://arxiv.org/abs/2503.22588
作者: Heiko Renz,Maximilian Krämer,Frank Hoffmann,Torsten Bertram
机构: Institute of Control Theory and Systems Engineering, TU Dortmund University (图宾根大学控制理论与系统工程研究所), Dortmund, Germany
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted for publication at the IEEE International Conference on Robotics and Automation (ICRA), 2025

点击查看摘要

Abstract:Visual observation of objects is essential for many robotic applications, such as object reconstruction and manipulation, navigation, and scene understanding. Machine learning algorithms constitute the state-of-the-art in many fields but require vast data sets, which are costly and time-intensive to collect. Automated strategies for observation and exploration are crucial to enhance the efficiency of data gathering. Therefore, a novel strategy utilizing the Next-Best-Trajectory principle is developed for a robot manipulator operating in dynamic environments. Local trajectories are generated to maximize the information gained from observations along the path while avoiding collisions. We employ a voxel map for environment modeling and utilize raycasting from perspectives around a point of interest to estimate the information gain. A global ergodic trajectory planner provides an optional reference trajectory to the local planner, improving exploration and helping to avoid local minima. To enhance computational efficiency, raycasting for estimating the information gain in the environment is executed in parallel on the graphics processing unit. Benchmark results confirm the efficiency of the parallelization, while real-world experiments demonstrate the strategy’s effectiveness.
zh

[CV-9] Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization

【速读】：该论文试图解决视觉语言模型（Visual Language Models, VLMs）在多语言场景下的图像诱导保真度损失（Image-induced Fidelity Loss, IFL）问题，即尽管输入可能包含多种语言，但模型通常仅生成英语响应。这一现象源于有限的多模态多语言训练数据。为了解决此问题，论文提出了一种连续多语言集成策略，在视觉指令微调过程中注入纯文本多语言数据，以保留语言模型原有的多语言能力。该方案的关键在于通过在微调阶段引入多语言文本数据，有效提升了跨语言的语义保真度，同时未损害视觉性能。此外，论文还探讨了模型合并方法，虽然提高了语言保真度，但会牺牲部分视觉性能。相比之下，所提出的核心方法实现了稳健的多语言对齐，且无需权衡取舍，提供了一条可扩展且有效的路径来缓解IFL问题，促进全球范围内的VLM应用。

链接: https://arxiv.org/abs/2503.22577
作者: Iñigo Pikabea,Iñaki Lacunza,Oriol Pareras,Carlos Escolano,Aitor Gonzalez-Agirre,Javier Hernando,Marta Villegas
机构: Barcelona Supercomputing Center (巴塞罗那超级计算中心); Universitat Politècnica de Catalunya (加泰罗尼亚理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rapid advancements in Visual Language Models (VLMs) have transformed multimodal understanding but are often constrained by generating English responses regardless of the input language. This phenomenon has been termed as Image-induced Fidelity Loss (IFL) and stems from limited multimodal multilingual training data. To address this, we propose a continuous multilingual integration strategy that injects text-only multilingual data during visual instruction tuning, preserving the language model’s original multilingual capabilities. Extensive evaluations demonstrate that our approach significantly improves linguistic fidelity across languages without degradation in visual performance. We also explore model merging, which improves language fidelity but comes at the cost of visual performance. In contrast, our core method achieves robust multilingual alignment without trade-offs, offering a scalable and effective path to mitigating IFL for global VLM adoption.
zh

[CV-10] Image Decomposition with G-norm Weighted by Total Symmetric Variation

【速读】：本文旨在提出一种新颖的变分模型，用于将图像分解为其对应的卡通部分和纹理部分。论文的关键在于通过图像的全对称变分（Total Symmetric Variation, TSV）刻画有界变差（Bounded Variation, BV）图像的某些非局部特性，并证明TSV在识别区域边界方面的有效性。基于这一性质，作者引入加权Meyer’s G-范数以区分纹理内部与轮廓边缘。对于具有有界TSV的BV图像，论文证明所提出的模型存在解。此外，设计了一种基于算子分裂的快速算法来处理相关的非凸优化问题。方法的有效性通过一系列数值实验得到验证。

链接: https://arxiv.org/abs/2503.22560
作者: Roy Y. He,Martin Huska,Hao Liu
机构: City University of Hong Kong (香港城市大学); University of Bologna (博洛尼亚大学); Hong Kong Baptist University (香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose a novel variational model for decomposing images into their respective cartoon and texture parts. Our model characterizes certain non-local features of any Bounded Variation (BV) image by its Total Symmetric Variation (TSV). We demonstrate that TSV is effective in identifying regional boundaries. Based on this property, we introduce a weighted Meyer’s G -norm to identify texture interiors without including contour edges. For BV images with bounded TSV, we show that the proposed model admits a solution. Additionally, we design a fast algorithm based on operator-splitting to tackle the associated non-convex optimization problem. The performance of our method is validated by a series of numerical experiments.
zh

[CV-11] MO-CTranS: A unified multi-organ segmentation model learning from multiple heterogeneously labelled datasets

【速读】：该论文旨在解决多器官分割任务中因数据标注不一致和数据分布不平衡导致的挑战，特别是在多个小规模部分标注的数据集上训练鲁棒模型的问题。传统方法通常为每个数据集单独训练模型，未能有效利用数据。为克服这些问题，论文提出了一种名为MO-CTranS的单模型解决方案，其关键在于结合CNN-based编码器和Transformer-based解码器，并以多分辨率方式连接，同时引入任务特定标记(token)以缓解标签冲突问题，从而实现跨数据集的高效学习与分割性能提升。

链接: https://arxiv.org/abs/2503.22557
作者: Zhendi Gong,Susan Francis,Eleanor Cox,Stamatios N. Sotiropoulos,Dorothee P. Auer,Guoping Qiu,Andrew P. French,Xin Chen
机构: School of Computer Science, University of Nottingham (诺丁汉大学计算机学院); School of Physics, University of Nottingham (诺丁汉大学物理学院); School of Medicine, University of Nottingham (诺丁汉大学医学院); School of Computer Science, University of Nottingham Ningbo China (诺丁汉大学宁波校区计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by International Symposium on Biomedical Imaging (ISIB) 2025 as an oral presentation

点击查看摘要

Abstract:Multi-organ segmentation holds paramount significance in many clinical tasks. In practice, compared to large fully annotated datasets, multiple small datasets are often more accessible and organs are not labelled consistently. Normally, an individual model is trained for each of these datasets, which is not an effective way of using data for model learning. It remains challenging to train a single model that can robustly learn from several partially labelled datasets due to label conflict and data imbalance problems. We propose MO-CTranS: a single model that can overcome such problems. MO-CTranS contains a CNN-based encoder and a Transformer-based decoder, which are connected in a multi-resolution manner. Task-specific tokens are introduced in the decoder to help differentiate label discrepancies. Our method was evaluated and compared to several baseline models and state-of-the-art (SOTA) solutions on abdominal MRI datasets that were acquired in different views (i.e. axial and coronal) and annotated for different organs (i.e. liver, kidney, spleen). Our method achieved better performance (most were statistically significant) than the compared methods. Github link: this https URL.
zh

[CV-12] LIM: Large Interpolator Model for Dynamic Reconstruction

【速读】：该论文旨在解决从视频数据重建动态资产的问题，这一任务在计算机视觉与图形学中具有重要地位。现有方法受限于类别特定模型或基于优化的缓慢方法。论文提出了一种基于Transformer的前馈解决方案——大型插值模型（Large Interpolation Model, LIM），其关键在于引入了一种新颖的因果一致性损失函数，用于引导隐式三维表示的时间插值。LIM能够在秒级时间内生成高质量的连续时间插值帧，并支持显式的网格跟踪，从而提供一致的UV纹理网格序列，便于集成到现有的生产流程中。此外，结合基于扩散的多视图生成器，LIM还可用于从单目视频生成动态四维重建。通过在多个动态数据集上的评估，LIM相较于图像空间插值方法（如FiLM）和直接三平面线性插值展示了明显的优势。总之，LIM是首个能够以高速实现跨类别追踪的四维资产重建的前馈模型。

链接: https://arxiv.org/abs/2503.22537
作者: Remy Sabathier,Niloy J. Mitra,David Novotny
机构: University College London (伦敦大学学院); Meta (Meta); University College London (伦敦大学学院); Meta (Meta)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reconstructing dynamic assets from video data is central to many in computer vision and graphics tasks. Existing 4D reconstruction approaches are limited by category-specific models or slow optimization-based methods. Inspired by the recent Large Reconstruction Model (LRM), we present the Large Interpolation Model (LIM), a transformer-based feed-forward solution, guided by a novel causal consistency loss, for interpolating implicit 3D representations across time. Given implicit 3D representations at times t_0 and t_1 , LIM produces a deformed shape at any continuous time t\in[t_0,t_1] , delivering high-quality interpolated frames in seconds. Furthermore, LIM allows explicit mesh tracking across time, producing a consistently uv-textured mesh sequence ready for integration into existing production pipelines. We also use LIM, in conjunction with a diffusion-based multiview generator, to produce dynamic 4D reconstructions from monocular videos. We evaluate LIM on various dynamic datasets, benchmarking against image-space interpolation methods (e.g., FiLM) and direct triplane linear interpolation, and demonstrate clear advantages. In summary, LIM is the first feed-forward model capable of high-speed tracked 4D asset reconstruction across diverse categories.
zh

[CV-13] AnnoPage Dataset: Dataset of Non-Textual Elements in Documents with Fine-Grained Categorization ICDAR25

【速读】：该论文旨在构建一个名为AnnoPage Dataset的数据集，用于支持文档布局分析和对象检测领域的研究。论文试图解决的问题是如何系统性地标注和提供多样化的历史文档图像数据，以促进相关算法的发展与评估。解决方案的关键在于创建了一个包含7550页历史文档的高质量数据集，这些文档主要使用轴对齐边界框（Axis-Aligned Bounding Boxes, AABB）标注了25类非文本元素（如图像、地图、装饰性元素或图表），并遵循捷克图像文档处理方法学。通过由专业图书管理员进行的精确且一致的标注，以及整合多个历史文档数据集来增强数据集的多样性和连续性，该数据集被划分为开发集和测试集，并提供了基于YOLO和DETR的目标检测器的基线结果，从而为未来的研究提供参考基准。

链接: https://arxiv.org/abs/2503.22526
作者: Martin Kišš,Michal Hradiš,Martina Dvořáková,Václav Jiroušek,Filip Kersch
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 2 tables, 6 figures; Submitted to ICDAR25

点击查看摘要

Abstract:We introduce the AnnoPage Dataset, a novel collection of 7550 pages from historical documents, primarily in Czech and German, spanning from 1485 to the present, focusing on the late 19th and early 20th centuries. The dataset is designed to support research in document layout analysis and object detection. Each page is annotated with axis-aligned bounding boxes (AABB) representing elements of 25 categories of non-textual elements, such as images, maps, decorative elements, or charts, following the Czech Methodology of image document processing. The annotations were created by expert librarians to ensure accuracy and consistency. The dataset also incorporates pages from multiple, mainly historical, document datasets to enhance variability and maintain continuity. The dataset is divided into development and test subsets, with the test set carefully selected to maintain the category distribution. We provide baseline results using YOLO and DETR object detectors, offering a reference point for future research. The AnnoPage Dataset is publicly available on Zenodo (this https URL), along with ground-truth annotations in YOLO format.
zh

[CV-14] Masked Self-Supervised Pre-Training for Text Recognition Transformers on Large-Scale Datasets ICDAR25

【速读】：该论文旨在解决文本识别任务中利用大规模无标注数据提升模型性能的问题。论文的关键解决方案在于提出了一种基于掩码自监督预训练（Masked Self-Supervised Pre-Training）的方法，具体包括两个创新点：一是逐步增加掩码概率（progressively increasing the masking probability），二是修改损失函数以同时考虑掩码区域与非掩码区域（modifying the loss function to incorporate both masked and non-masked patches）。通过这些改进，论文验证了所提出的自监督预训练方法在字符错误率（Character Error Rate, CER）上的显著改进，某些情况下相对提升了多达30%，且其效果可媲美迁移学习（Transfer Learning），但无需依赖额外的标注数据。

链接: https://arxiv.org/abs/2503.22513
作者: Martin Kišš,Michal Hradiš
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 7 tables, 6 figures; Submitted to ICDAR25

点击查看摘要

Abstract:Self-supervised learning has emerged as a powerful approach for leveraging large-scale unlabeled data to improve model performance in various domains. In this paper, we explore masked self-supervised pre-training for text recognition transformers. Specifically, we propose two modifications to the pre-training phase: progressively increasing the masking probability, and modifying the loss function to incorporate both masked and non-masked patches. We conduct extensive experiments using a dataset of 50M unlabeled text lines for pre-training and four differently sized annotated datasets for fine-tuning. Furthermore, we compare our pre-trained models against those trained with transfer learning, demonstrating the effectiveness of the self-supervised pre-training. In particular, pre-training consistently improves the character error rate of models, in some cases up to 30 % relatively. It is also on par with transfer learning but without relying on extra annotated text lines.
zh

[CV-15] Scenario Dreamer: Vectorized Latent Diffusion for Generating Driving Simulation Environments CVPR2025

【速读】：该论文旨在解决现有自动驾驶车辆规划仿真环境生成方法中存在的两个主要问题：一是基于栅格化图像的初始交通场景表示导致网络参数冗余且计算效率低下；二是采用基于规则的代理行为缺乏多样性和真实性。为了解决这些问题，论文提出了Scenario Dreamer，其关键在于采用了新颖的向量化潜在扩散模型（vectorized latent diffusion model）直接处理向量化的场景元素以生成初始场景，并使用自回归Transformer进行数据驱动的代理行为模拟。此外，通过扩散修复技术（diffusion inpainting），Scenario Dreamer还支持场景外推，从而实现无界仿真环境的生成。实验结果表明，Scenario Dreamer在真实感和效率方面均优于现有方法，其基础场景生成模型的参数减少了约2倍，生成延迟降低了6倍，GPU训练时间减少了10倍，同时证明了其在强化学习规划代理训练中的实用价值。

链接: https://arxiv.org/abs/2503.22496
作者: Luke Rowe,Roger Girgis,Anthony Gosselin,Liam Paull,Christopher Pal,Felix Heide
机构: Mila; Université de Montréal (蒙特利尔大学); Polytechnique Montréal (蒙特利尔理工学院); Princeton University (普林斯顿大学); CIFAR AI Chair (CIFAR人工智能主席); Torc Robotics
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:We introduce Scenario Dreamer, a fully data-driven generative simulator for autonomous vehicle planning that generates both the initial traffic scene - comprising a lane graph and agent bounding boxes - and closed-loop agent behaviours. Existing methods for generating driving simulation environments encode the initial traffic scene as a rasterized image and, as such, require parameter-heavy networks that perform unnecessary computation due to many empty pixels in the rasterized scene. Moreover, we find that existing methods that employ rule-based agent behaviours lack diversity and realism. Scenario Dreamer instead employs a novel vectorized latent diffusion model for initial scene generation that directly operates on the vectorized scene elements and an autoregressive Transformer for data-driven agent behaviour simulation. Scenario Dreamer additionally supports scene extrapolation via diffusion inpainting, enabling the generation of unbounded simulation environments. Extensive experiments show that Scenario Dreamer outperforms existing generative simulators in realism and efficiency: the vectorized scene-generation base model achieves superior generation quality with around 2x fewer parameters, 6x lower generation latency, and 10x fewer GPU training hours compared to the strongest baseline. We confirm its practical utility by showing that reinforcement learning planning agents are more challenged in Scenario Dreamer environments than traditional non-generative simulation environments, especially on long and adversarial driving environments.
zh

[CV-16] SemAlign3D: Semantic Correspondence between RGB-Images through Aligning 3D Object-Class Representations CVPR CVPR2025

【速读】：该论文旨在解决现有大型视觉模型（Large Vision Models, LVM）在捕捉语义对象区域之间全局几何关系方面的不足，这一问题导致图像间极端视点变化下的语义对应性能不可靠。论文的关键解决方案是利用单目深度估计来捕获这些几何关系，从而实现更鲁棒且数据高效的语义对应。具体而言，首先通过稀疏标注的图像对应数据集，从单目深度估计和LVM特征构建有效的三维对象类别表示；其次，提出一种可使用梯度下降最小化的对齐能量函数，以实现三维对象类别表示与输入RGB图像中的对象实例之间的对齐。此方法在SPair-71k数据集的多个类别上实现了最先进的匹配精度，显著提升了三个类别的PCK@0.1分数超过10个百分点，并整体提高了3.3个百分点，从85.6%提升至88.9%。

链接: https://arxiv.org/abs/2503.22462
作者: Krispin Wandel,Hesheng Wang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025. Poster: this https URL

点击查看摘要

Abstract:Semantic correspondence made tremendous progress through the recent advancements of large vision models (LVM). While these LVMs have been shown to reliably capture local semantics, the same can currently not be said for capturing global geometric relationships between semantic object regions. This problem leads to unreliable performance for semantic correspondence between images with extreme view variation. In this work, we aim to leverage monocular depth estimates to capture these geometric relationships for more robust and data-efficient semantic correspondence. First, we introduce a simple but effective method to build 3D object-class representations from monocular depth estimates and LVM features using a sparsely annotated image correspondence dataset. Second, we formulate an alignment energy that can be minimized using gradient descent to obtain an alignment between the 3D object-class representation and the object-class instance in the input RGB-image. Our method achieves state-of-the-art matching accuracy in multiple categories on the challenging SPair-71k dataset, increasing the PCK@0.1 score by more than 10 points on three categories and overall by 3.3 points from 85.6% to 88.9%. Additional resources and code are available at this https URL.
zh

[CV-17] EndoLRMGS: Complete Endoscopic Scene Reconstruction combining Large Reconstruction Modelling and Gaussian Splatting

【速读】：该论文旨在解决机器人辅助手术（RAS）中手术场景完整重建的问题，现有深度估计方法在处理深度不连续性时表现不佳，导致物体边界处预测噪声较大且无法实现完全重建（忽略被遮挡表面）。为了解决这些问题，论文提出了一种结合大尺度重建建模（Large Reconstruction Modelling, LRM）和高斯点撒法（Gaussian Splatting, GS）的方法EndoLRMGS。其关键是通过GS重建可变形组织，利用LRM生成手术工具的3D模型，并引入正交视角联合投影优化（Orthogonal Perspective Joint Projection Optimization, OPjPO）来优化位置和比例，从而提高重建精度。实验结果表明，该方法显著提升了工具3D模型在二维投影中的IoU，并大幅改善了工具投影的PSNR以及组织渲染质量的PSNR和SSIM指标。

链接: https://arxiv.org/abs/2503.22437
作者: Xu Wang,Shuai Zhang,Baoru Huang,Danail Stoyanov,Evangelos B. Mazomenos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Complete reconstruction of surgical scenes is crucial for robot-assisted surgery (RAS). Deep depth estimation is promising but existing works struggle with depth discontinuities, resulting in noisy predictions at object boundaries and do not achieve complete reconstruction omitting occluded surfaces. To address these issues we propose EndoLRMGS, that combines Large Reconstruction Modelling (LRM) and Gaussian Splatting (GS), for complete surgical scene reconstruction. GS reconstructs deformable tissues and LRM generates 3D models for surgical tools while position and scale are subsequently optimized by introducing orthogonal perspective joint projection optimization (OPjPO) to enhance accuracy. In experiments on four surgical videos from three public datasets, our method improves the Intersection-over-union (IoU) of tool 3D models in 2D projections by40%. Additionally, EndoLRMGS improves the PSNR of the tools projection from 3.82% to 11.07%. Tissue rendering quality also improves, with PSNR increasing from 0.46% to 49.87%, and SSIM from 1.53% to 29.21% across all test videos.
zh

[CV-18] NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving

【速读】：该论文旨在解决多视角3D视觉接地（Multi-view 3D Visual Grounding）在自动驾驶场景中的两个关键问题：现有数据集和方法中存在的粗粒度语言指令以及三维几何推理与语言理解整合不足的问题。为了解决这些问题，论文提出了NuGrounding数据集和Hierarchy of Grounding (HoG) 方法，以生成多层次的语言指令，并确保全面覆盖人类指令模式。同时，论文提出了一种结合多模态大型语言模型（MLLMs）指令理解能力和检测模型精确定位能力的新范式。其解决方案的关键在于引入了两个解耦的任务标记和上下文查询来聚合三维几何信息和语义指令，并通过融合解码器优化空间语义特征融合，从而实现精准的目标定位。

链接: https://arxiv.org/abs/2503.22436
作者: Fuhao Li,Huan Jin,Bin Gao,Liaoyuan Fan,Lihui Jiang,Long Zeng
机构: Tsinghua University (清华大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-view 3D visual grounding is critical for autonomous driving vehicles to interpret natural languages and localize target objects in complex environments. However, existing datasets and methods suffer from coarse-grained language instructions, and inadequate integration of 3D geometric reasoning with linguistic comprehension. To this end, we introduce NuGrounding, the first large-scale benchmark for multi-view 3D visual grounding in autonomous driving. We present a Hierarchy of Grounding (HoG) method to construct NuGrounding to generate hierarchical multi-level instructions, ensuring comprehensive coverage of human instruction patterns. To tackle this challenging dataset, we propose a novel paradigm that seamlessly combines instruction comprehension abilities of multi-modal LLMs (MLLMs) with precise localization abilities of specialist detection models. Our approach introduces two decoupled task tokens and a context query to aggregate 3D geometric information and semantic instructions, followed by a fusion decoder to refine spatial-semantic feature fusion for precise localization. Extensive experiments demonstrate that our method significantly outperforms the baselines adapted from representative 3D scene understanding methods by a significant margin and achieves 0.59 in precision and 0.64 in recall, with improvements of 50.8% and 54.7%.
zh

[CV-19] MVSAnywhere: Zero-Shot Multi-View Stereo CVPR2025

【速读】：该论文旨在解决多视图深度估计在不同场景类型（如室内与室外）和领域之间泛化能力不足的问题。现有方法难以处理输入视图数量变化时的额外元数据利用、基于Transformer架构的最佳应用方式，以及场景间差异显著且通常未知的有效深度范围估计。论文提出的关键解决方案是引入MVSA（Multi-View Stereo Architecture），一种结合单目和多视图线索的自适应代价体网络架构，通过自适应代价体有效应对尺度相关问题，从而实现跨多样化领域和深度范围的零样本深度估计。这一方法在鲁棒多视图深度基准测试中表现出最先进的性能，超越了现有的多视图立体视觉和单目深度估计基线。

链接: https://arxiv.org/abs/2503.22430
作者: Sergio Izquierdo,Mohamed Sayed,Michael Firman,Guillermo Garcia-Hernando,Daniyar Turmukhambetov,Javier Civera,Oisin Mac Aodha,Gabriel Brostow,Jamie Watson
机构: Niantic(尼安蒂克); University of Edinburgh (爱丁堡大学); I3A, Universidad de Zaragoza (I3A, 萨拉戈萨大学); UCL (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Computing accurate depth from multiple views is a fundamental and longstanding challenge in computer vision. However, most existing approaches do not generalize well across different domains and scene types (e.g. indoor vs. outdoor). Training a general-purpose multi-view stereo model is challenging and raises several questions, e.g. how to best make use of transformer-based architectures, how to incorporate additional metadata when there is a variable number of input views, and how to estimate the range of valid depths which can vary considerably across different scenes and is typically not known a priori? To address these issues, we introduce MVSA, a novel and versatile Multi-View Stereo architecture that aims to work Anywhere by generalizing across diverse domains and depth ranges. MVSA combines monocular and multi-view cues with an adaptive cost volume to deal with scale-related issues. We demonstrate state-of-the-art zero-shot depth estimation on the Robust Multi-View Depth Benchmark, surpassing existing multi-view stereo and monocular baselines.
zh

[CV-20] Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis CVPR2025

【速读】：该论文旨在解决现有3D视觉-语言（3D-VL）基准在评估模型能力时存在的三大问题：一是测试数据存在缺陷，如指代消解任务中的模糊文本导致结果不可靠；二是简化度量方法无法有效反映模型的真实能力；三是当前基准将接地与问答（QA）任务孤立，忽视了两者之间的内在一致性。为了解决这些问题，论文提出了Beacon3D这一新的基准，其关键在于提供高质量的测试数据、以对象为中心的多测试评估机制以及创新的分析链范式，从而提升评估的全面性和准确性，并揭示了当前3D-VL模型在接地-QA一致性上的脆弱性及大语言模型引入对接地和问答性能的影响。

链接: https://arxiv.org/abs/2503.22420
作者: Jiangyong Huang,Baoxiong Jia,Yan Wang,Ziyu Zhu,Xiongkun Linghu,Qing Li,Song-Chun Zhu,Siyuan Huang
机构: State Key Laboratory of General Artificial Intelligence (通用人工智能国家重点实验室), BIGAI; Peking University (北京大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project page: this https URL

点击查看摘要

Abstract:Existing 3D vision-language (3D-VL) benchmarks fall short in evaluating 3D-VL models, creating a “mist” that obscures rigorous insights into model capabilities and 3D-VL tasks. This mist persists due to three key limitations. First, flawed test data, like ambiguous referential text in the grounding task, can yield incorrect and unreliable test results. Second, oversimplified metrics such as simply averaging accuracy per question answering (QA) pair, cannot reveal true model capability due to their vulnerability to language variations. Third, existing benchmarks isolate the grounding and QA tasks, disregarding the underlying coherence that QA should be based on solid grounding capabilities. To unveil the “mist”, we propose Beacon3D, a benchmark for 3D-VL grounding and QA tasks, delivering a perspective shift in the evaluation of 3D-VL understanding. Beacon3D features (i) high-quality test data with precise and natural language, (ii) object-centric evaluation with multiple tests per object to ensure robustness, and (iii) a novel chain-of-analysis paradigm to address language robustness and model performance coherence across grounding and QA. Our evaluation of state-of-the-art 3D-VL models on Beacon3D reveals that (i) object-centric evaluation elicits true model performance and particularly weak generalization in QA; (ii) grounding-QA coherence remains fragile in current 3D-VL models, and (iii) incorporating large language models (LLMs) to 3D-VL models, though as a prevalent practice, hinders grounding capabilities and has yet to elevate QA capabilities. We hope Beacon3D and our comprehensive analysis could benefit the 3D-VL community towards faithful developments.
zh

[CV-21] DF2023: The Digital Forensics 2023 Dataset for Image Forgery Detection WWW

【速读】：该论文试图解决通过在线社交网络广泛传播的篡改图像引起的公众舆论操控问题，这对社会构成了显著威胁。为应对这一技术挑战，论文提出了解决方案的关键在于发布Digital Forensics 2023 (DF2023) 数据集，该数据集包含来自四大伪造类别（拼接、克隆、增强和移除）的一百万张图像，用于训练与验证。此数据集不仅能够客观比较不同网络架构的性能，还能大幅减少研究人员准备数据集所需的时间和精力。

链接: https://arxiv.org/abs/2503.22417
作者: David Fischinger,Martin Boyer
机构: Austrian Institute of Technology (奥地利技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at the 25th Irish Machine Vision and Image Processing Conference (IMVIP) — Proceedings: this https URL — Dataset download: this https URL this https URL Kaggle: this https URL

点击查看摘要

Abstract:The deliberate manipulation of public opinion, especially through altered images, which are frequently disseminated through online social networks, poses a significant danger to society. To fight this issue on a technical level we support the research community by releasing the Digital Forensics 2023 (DF2023) training and validation dataset, comprising one million images from four major forgery categories: splicing, copy-move, enhancement and removal. This dataset enables an objective comparison of network architectures and can significantly reduce the time and effort of researchers preparing datasets.
zh

[CV-22] Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks

【速读】：该论文旨在解决现有错误检测方法在增强现实（AR）辅助和机器人系统中的局限性。传统方法通常仅关注时间顺序错误或依赖静态原型来表示正常动作，但这些方法往往忽视了在给定动作序列后存在多个有效后续动作的常见场景。这导致两个主要问题：(1) 当推理环境或动作执行分布与训练数据不同时，模型难以通过静态原型有效检测错误；(2) 如果当前动作标签与预测的动作标签不同，模型可能会使用错误的原型进行错误检测。为了解决这些问题，论文提出了一种自适应多正常动作表示（Adaptive Multiple Normal Action Representation, AMNAR）框架。AMNAR的关键在于预测所有有效的下一组可能动作，并重建其对应的正常动作表示，然后将这些表示与当前动作进行比较以实现错误检测。实验结果表明，AMNAR在错误检测任务中达到了最先进的性能，强调了建模多个有效后续动作的重要性。

链接: https://arxiv.org/abs/2503.22405
作者: Wei-Jin Huang,Yuan-Ming Li,Zhi-Wei Xia,Yu-Ming Tang,Kun-Yu Lin,Jian-Fang Hu,Wei-Shi Zheng
机构: School of Computer Science and Engineering, Sun Yat-sen University (中山大学), China; Peng Cheng Laboratory (鹏城实验室), China; Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education (教育部机器智能与先进计算重点实验室), China; Guangdong Province Key Laboratory of Information Security Technology (广东省信息安全技术重点实验室), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Error detection in procedural activities is essential for consistent and correct outcomes in AR-assisted and robotic systems. Existing methods often focus on temporal ordering errors or rely on static prototypes to represent normal actions. However, these approaches typically overlook the common scenario where multiple, distinct actions are valid following a given sequence of executed actions. This leads to two issues: (1) the model cannot effectively detect errors using static prototypes when the inference environment or action execution distribution differs from training; and (2) the model may also use the wrong prototypes to detect errors if the ongoing action label is not the same as the predicted one. To address this problem, we propose an Adaptive Multiple Normal Action Representation (AMNAR) framework. AMNAR predicts all valid next actions and reconstructs their corresponding normal action representations, which are compared against the ongoing action to detect errors. Extensive experiments demonstrate that AMNAR achieves state-of-the-art performance, highlighting the effectiveness of AMNAR and the importance of modeling multiple valid next actions in error detection. The code is available at this https URL.
zh

[CV-23] VITAL: More Understandable Feature Visualization through Distribution Alignment and Relevant Information Flow

【速读】：该论文旨在解决现有特征可视化（Feature Visualization, FV）方法生成的图像难以被人类理解的问题，这些问题表现为生成的可视化结果通常包含不可识别的重复模式和视觉伪影。为了解决这些问题，论文提出了一种通过结合真实图像特征的统计信息以及相关网络流的度量来引导特征可视化的方案，以生成具有原型特性的图像。该方法的关键在于综合考虑图像特征的统计特性与网络内部的信息流动，从而显著提升可视化结果的可解释性，不仅在定性上更易于理解，而且在定量上也优于现有的最先进方法。这一改进能够帮助揭示网络所使用的具体信息，并与识别编码位置的机制性电路形成互补。

链接: https://arxiv.org/abs/2503.22399
作者: Ada Gorgun,Bernt Schiele,Jonas Fischer
机构: Max Planck Institute for Informatics (马克斯·普朗克信息学研究所), Saarland Informatics Campus (萨尔州计算机科学校区), Germany (德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at: this https URL

点击查看摘要

Abstract:Neural networks are widely adopted to solve complex and challenging tasks. Especially in high-stakes decision-making, understanding their reasoning process is crucial, yet proves challenging for modern deep networks. Feature visualization (FV) is a powerful tool to decode what information neurons are responding to and hence to better understand the reasoning behind such networks. In particular, in FV we generate human-understandable images that reflect the information detected by neurons of interest. However, current methods often yield unrecognizable visualizations, exhibiting repetitive patterns and visual artifacts that are hard to understand for a human. To address these problems, we propose to guide FV through statistics of real image features combined with measures of relevant network flow to generate prototypical images. Our approach yields human-understandable visualizations that both qualitatively and quantitatively improve over state-of-the-art FVs across various architectures. As such, it can be used to decode which information the network uses, complementing mechanistic circuits that identify where it is encoded. Code is available at: this https URL
zh

[CV-24] DF-Net: The Digital Forensics Network for Image Forgery Detection

【速读】：该论文旨在解决由人为操纵图像引发的公共舆论操控问题，尤其是在在线社交网络（OSN）上传播的伪造图像对社会构成的严重威胁。论文提出了一种名为数字取证网络（Digital Forensics Net, DF-Net）的深度神经网络，用于逐像素检测图像篡改。DF-Net的关键在于其对损失性图像操作（如缩放、压缩等）的鲁棒性，这些操作通常由社交网络自动执行，而大多数现有方法在面对此类操作时表现欠佳。通过在四个基准数据集上的实验，DF-Net展示了超越多种最新技术的能力。

链接: https://arxiv.org/abs/2503.22398
作者: David Fischinger,Martin Boyer
机构: Austrian Institute of Technology (奥地利技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in 2023 at the 25th Irish Machine Vision and Image Processing Conference (IMVIP), this https URL

点击查看摘要

Abstract:The orchestrated manipulation of public opinion, particularly through manipulated images, often spread via online social networks (OSN), has become a serious threat to society. In this paper we introduce the Digital Forensics Net (DF-Net), a deep neural network for pixel-wise image forgery detection. The released model outperforms several state-of-the-art methods on four established benchmark datasets. Most notably, DF-Net’s detection is robust against lossy image operations (e.g resizing, compression) as they are automatically performed by social networks.
zh

[CV-25] GAITGen: Disentangled Motion-Pathology Impaired Gait Generative Model – Bringing Motion Generation to the Clinical Domain

【速读】：该论文旨在解决帕金森病步态分析中因临床数据稀缺及标注不足导致的模型准确性下降和潜在偏差问题。解决方案的关键在于提出了一种名为GAITGen的新框架，它通过条件残差向量量化变分自编码器（Conditional Residual Vector Quantized Variational Autoencoder）学习解耦的运动动态与特定病理因素表示，并结合Mask和Residual Transformer实现条件序列生成，从而生成逼真的跨严重程度步态序列，丰富数据集并支持大规模模型训练。实验结果表明，GAITGen在重建保真度和生成质量方面优于现有先进模型，并通过临床用户研究验证了生成序列的真实性和临床相关性，同时提升了步态严重程度估计任务的性能。

链接: https://arxiv.org/abs/2503.22397
作者: Vida Adeli,Soroush Mehraban,Majid Mirmehdi,Alan Whone,Benjamin Filtjens,Amirhossein Dadashzadeh,Alfonso Fasano,Andrea Iaboni Babak Taati
机构: University of Toronto (多伦多大学); Vector Institute (向量研究所); KITE Research Institute, UHN (KITE 研究所, UHN); University of Toronto, Institute of Biomedical Engineering (多伦多大学, 生物医学工程学院); University of Bristol, School of Computer Science (布里斯托尔大学, 计算机科学学院); University of Bristol, Translational Health Science (布里斯托尔大学, 转化健康科学); University of Toronto, Data Sciences Institute (多伦多大学, 数据科学研究所); University of Toronto, Department of Medicine, Division of Neurology (多伦多大学, 医学系, 神经病学系); Krembil Research Institute, UHN (克雷姆比尔研究所, UHN); Edmond J. Safra Program in Parkinson’s Disease, UHN (埃蒙德·J·萨夫拉帕金森病项目, UHN); University of Toronto, Department of Psychiatry (多伦多大学, 精神病学系); Centre for Mental Health, UHN (心理健康中心, UHN)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gait analysis is crucial for the diagnosis and monitoring of movement disorders like Parkinson’s Disease. While computer vision models have shown potential for objectively evaluating parkinsonian gait, their effectiveness is limited by scarce clinical datasets and the challenge of collecting large and well-labelled data, impacting model accuracy and risk of bias. To address these gaps, we propose GAITGen, a novel framework that generates realistic gait sequences conditioned on specified pathology severity levels. GAITGen employs a Conditional Residual Vector Quantized Variational Autoencoder to learn disentangled representations of motion dynamics and pathology-specific factors, coupled with Mask and Residual Transformers for conditioned sequence generation. GAITGen generates realistic, diverse gait sequences across severity levels, enriching datasets and enabling large-scale model training in parkinsonian gait analysis. Experiments on our new PD-GaM (real) dataset demonstrate that GAITGen outperforms adapted state-of-the-art models in both reconstruction fidelity and generation quality, accurately capturing critical pathology-specific gait features. A clinical user study confirms the realism and clinical relevance of our generated sequences. Moreover, incorporating GAITGen-generated data into downstream tasks improves parkinsonian gait severity estimation, highlighting its potential for advancing clinical gait analysis.
zh

[CV-26] Endo-TTAP: Robust Endoscopic Tissue Tracking via Multi-Facet Guided Attention and Hybrid Flow-point Supervision

【速读】：该论文致力于解决内窥镜视频中组织点跟踪的难题，这一任务对于机器人辅助手术导航和场景理解至关重要，但因组织复杂形变、器械遮挡以及密集轨迹标注数据稀缺而极具挑战性。现有方法在这些条件下难以实现长期稳定跟踪，主要受限于特征利用不足和对标注数据的高度依赖。论文提出了一种名为Endo-TTAP的新框架，其关键在于：(1) 多面引导注意力机制（Multi-Facet Guided Attention, MFGA），通过协同多尺度流动力学、DINOv2语义嵌入和显式运动模式，联合预测点位置并具备不确定性与遮挡感知能力；(2) 两阶段课程学习策略，结合辅助课程适配器（Auxiliary Curriculum Adapter, ACA）实现逐步初始化与混合监督。第一阶段利用带光流真值的合成数据进行不确定性-遮挡正则化，第二阶段结合无监督流一致性与半监督学习，并采用现成追踪器生成的精炼伪标签。广泛的验证表明，Endo-TTAP在MICCAI挑战数据集及作者收集的数据集上达到了最先进的组织点跟踪性能，特别是在复杂的内窥镜条件下表现出色。

链接: https://arxiv.org/abs/2503.22394
作者: Rulin Zhou,Wenlong He,An Wang,Qiqi Yao,Haijun Hu,Jiankun Wang,Xi Zhang an Hongliang Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate tissue point tracking in endoscopic videos is critical for robotic-assisted surgical navigation and scene understanding, but remains challenging due to complex deformations, instrument occlusion, and the scarcity of dense trajectory annotations. Existing methods struggle with long-term tracking under these conditions due to limited feature utilization and annotation dependence. We present Endo-TTAP, a novel framework addressing these challenges through: (1) A Multi-Facet Guided Attention (MFGA) module that synergizes multi-scale flow dynamics, DINOv2 semantic embeddings, and explicit motion patterns to jointly predict point positions with uncertainty and occlusion awareness; (2) A two-stage curriculum learning strategy employing an Auxiliary Curriculum Adapter (ACA) for progressive initialization and hybrid supervision. Stage I utilizes synthetic data with optical flow ground truth for uncertainty-occlusion regularization, while Stage II combines unsupervised flow consistency and semi-supervised learning with refined pseudo-labels from off-the-shelf trackers. Extensive validation on two MICCAI Challenge datasets and our collected dataset demonstrates that Endo-TTAP achieves state-of-the-art performance in tissue point tracking, particularly in scenarios characterized by complex endoscopic conditions. The source code and dataset will be available at this https URL.
zh

[CV-27] Data Quality Matters: Quantifying Image Quality Impact on Machine Learning Performance

【速读】：该论文旨在解决高度自动化驾驶系统中，由于传感器数据压缩和虚拟化处理导致图像质量下降，进而影响机器学习任务（如物体检测和分割）性能的问题。论文的关键在于提出了一种四步框架，用于评估图像修改对机器学习任务的影响。其中，通过构建包含修改图像的数据集以确保一一对应的图像配对，量化压缩与虚拟化引起的图像偏差，并分析其对先进物体检测模型性能的影响，包括边界框精度和可靠性。最终，通过相关性分析确定图像质量与模型性能之间的关系，发现LPIPS（Learned Perceptual Image Patch Similarity）度量在所有评估的机器学习任务中实现了图像偏差与机器学习性能之间最高的相关性。这一框架为评估和优化图像处理对自动驾驶系统中感知任务的影响提供了系统化方法。

链接: https://arxiv.org/abs/2503.22375
作者: Christian Steinhauser,Philipp Reis,Hubert Padusinski,Jacob Langner,Eric Sax
机构: FZI Research Center for Information Technology (FZI 研究中心 for 信息技术), Karlsruhe, Germany (德国)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Submitted to IEEE IV 2025, Under Review

点击查看摘要

Abstract:Precise perception of the environment is essential in highly automated driving systems, which rely on machine learning tasks such as object detection and segmentation. Compression of sensor data is commonly used for data handling, while virtualization is used for hardware-in-the-loop validation. Both methods can alter sensor data and degrade model performance. This necessitates a systematic approach to quantifying image validity. This paper presents a four-step framework to evaluate the impact of image modifications on machine learning tasks. First, a dataset with modified images is prepared to ensure one-to-one matching image pairs, enabling measurement of deviations resulting from compression and virtualization. Second, image deviations are quantified by comparing the effects of compression and virtualization against original camera-based sensor data. Third, the performance of state-of-the-art object detection models is analyzed to determine how altered input data affects perception tasks, including bounding box accuracy and reliability. Finally, a correlation analysis is performed to identify relationships between image quality and model performance. As a result, the LPIPS metric achieves the highest correlation between image deviation and machine learning performance across all evaluated machine learning tasks.
zh

[CV-28] ViSketch-GPT : Collaborative Multi-Scale Feature Extraction for Sketch Recognition and Generation

【速读】：该论文旨在解决因人类手绘草图在绘制方式上的广泛差异而导致的理解其本质特征的挑战。具体而言，复杂结构模式的识别能够同时提升草图分类的准确性以及生成草图的保真度。论文提出的关键解决方案是ViSketch-GPT算法，这是一种基于多尺度上下文提取方法的新算法。其核心在于通过类似集成机制捕捉多尺度细节，并使这些特征协同工作以增强关键细节的识别与生成，这对于分类和生成任务至关重要。实验结果表明，该模型在QuickDraw数据集上显著优于现有方法，在分类和生成任务中的准确率及生成草图的保真度均有大幅提升。

链接: https://arxiv.org/abs/2503.22374
作者: Giulio Federico,Giuseppe Amato,Fabio Carrara,Claudio Gennaro,Marco Di Benedetto
机构: ISTI-CNR (意大利国家研究委员会信息学与自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding the nature of human sketches is challenging because of the wide variation in how they are created. Recognizing complex structural patterns improves both the accuracy in recognizing sketches and the fidelity of the generated sketches. In this work, we introduce ViSketch-GPT, a novel algorithm designed to address these challenges through a multi-scale context extraction approach. The model captures intricate details at multiple scales and combines them using an ensemble-like mechanism, where the extracted features work collaboratively to enhance the recognition and generation of key details crucial for classification and generation tasks. The effectiveness of ViSketch-GPT is validated through extensive experiments on the QuickDraw dataset. Our model establishes a new benchmark, significantly outperforming existing methods in both classification and generation tasks, with substantial improvements in accuracy and the fidelity of generated sketches. The proposed algorithm offers a robust framework for understanding complex structures by extracting features that collaborate to recognize intricate details, enhancing the understanding of structures like sketches and making it a versatile tool for various applications in computer vision and machine learning. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.22374 [cs.CV] (or arXiv:2503.22374v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.22374 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-29] ForcePose: A Deep Learning Approach for Force Calculation Based on Action Recognition Using MediaPipe Pose Estimation Combined with Object Detection

【速读】：本文旨在解决在人机交互中精确估算作用力的问题，传统方法依赖于昂贵且仅限实验室环境的专用设备（如力板和传感器）。为应对这一挑战，论文提出了一种名为ForcePose的新深度学习框架，其关键在于结合人体姿态估计与物体检测来推断作用力。通过利用MediaPipe进行骨骼跟踪及SSD MobileNet进行物体识别，构建了统一的人机交互表征，并设计了一个处理空间和时间特征的专用神经网络以预测力的大小和方向，无需任何物理传感器。此方法在包含850段带标注视频及其对应力测量数据的自建数据集上训练后，实现了5.83牛顿的力大小平均绝对误差和7.4度的力方向误差，在现有计算机视觉方法基础上性能提升了27.5%，同时保持了标准计算硬件上的实时性能。ForcePose为传统测量工具不适用或侵入性较强的多样化实际场景中的力分析开辟了新途径。

链接: https://arxiv.org/abs/2503.22363
作者: Nandakishor M,Vrinda Govind V,Anuradha Puthalath,Anzy L,Swathi P S,Aswathi R,Devaprabha A R,Varsha Raj,Midhuna Krishnan K,Akhila Anilkumar T V,Yamuna P V
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Force estimation in human-object interactions is crucial for various fields like ergonomics, physical therapy, and sports science. Traditional methods depend on specialized equipment such as force plates and sensors, which makes accurate assessments both expensive and restricted to laboratory settings. In this paper, we introduce ForcePose, a novel deep learning framework that estimates applied forces by combining human pose estimation with object detection. Our approach leverages MediaPipe for skeletal tracking and SSD MobileNet for object recognition to create a unified representation of human-object interaction. We’ve developed a specialized neural network that processes both spatial and temporal features to predict force magnitude and direction without needing any physical sensors. After training on our dataset of 850 annotated videos with corresponding force measurements, our model achieves a mean absolute error of 5.83 N in force magnitude and 7.4 degrees in force direction. When compared to existing computer vision approaches, our method performs 27.5% better while still offering real-time performance on standard computing hardware. ForcePose opens up new possibilities for force analysis in diverse real-world scenarios where traditional measurement tools are impractical or intrusive. This paper discusses our methodology, the dataset creation process, evaluation metrics, and potential applications across rehabilitation, ergonomics assessment, and athletic performance analysis.
zh

[CV-30] Mitigating Knowledge Discrepancies among Multiple Datasets for Task-agnostic Unified Face Alignment

【速读】：该论文旨在解决现有面部对齐方法无法从具有不同地标标注的多个数据集中学习统一知识的问题。在单一数据集中的有限训练样本通常导致该领域鲁棒性不足。为缓解不同数据集之间的知识差异并训练一个任务无关的统一面部对齐（TUFA）框架，本文提出了一种从多个数据集中统一知识的策略。解决方案的关键在于通过计算每个数据集的平均人脸形状，并结合语义对齐嵌入显式对齐这些平均形状，将对齐后的形状2D坐标视为平面锚点。然后通过将这些锚点编码为结构提示，并利用图像特征回归对应的面部地标，最终在平面与目标人脸之间建立映射，从而统一不同数据集的学习目标。这一方法不仅提升了模型的泛化能力，还显著提高了少样本面部对齐任务的性能，并赋予TUFA任务无关特性，使其能够在零样本情况下定位训练期间未见过的地标。

链接: https://arxiv.org/abs/2503.22359
作者: Jiahao Xia,Min Xu,Wenjian Huang,Jianguo Zhang,Haimin Zhang,Chunxia Xiao
机构: University of Technology Sydney (悉尼科技大学); South University of Science and Technology of China (南方科技大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 Pages, 9 Figures

点击查看摘要

Abstract:Despite the similar structures of human faces, existing face alignment methods cannot learn unified knowledge from multiple datasets with different landmark annotations. The limited training samples in a single dataset commonly result in fragile robustness in this field. To mitigate knowledge discrepancies among different datasets and train a task-agnostic unified face alignment (TUFA) framework, this paper presents a strategy to unify knowledge from multiple datasets. Specifically, we calculate a mean face shape for each dataset. To explicitly align these mean shapes on an interpretable plane based on their semantics, each shape is then incorporated with a group of semantic alignment embeddings. The 2D coordinates of these aligned shapes can be viewed as the anchors of the plane. By encoding them into structure prompts and further regressing the corresponding facial landmarks using image features, a mapping from the plane to the target faces is finally established, which unifies the learning target of different datasets. Consequently, multiple datasets can be utilized to boost the generalization ability of the model. The successful mitigation of discrepancies also enhances the efficiency of knowledge transferring to a novel dataset, significantly boosts the performance of few-shot face alignment. Additionally, the interpretable plane endows TUFA with a task-agnostic characteristic, enabling it to locate landmarks unseen during training in a zero-shot manner. Extensive experiments are carried on seven benchmarks and the results demonstrate an impressive improvement in face alignment brought by knowledge discrepancies mitigation.
zh

[CV-31] EchoFlow: A Foundation Model for Cardiac Ultrasound Image and Video Generation

【速读】：该论文旨在解决医疗图像分析领域中因患者隐私保护限制而导致的大规模医学数据集可用性不足的问题。论文提出了一种名为EchoFlow的新框架，用于生成高质量且符合隐私保护要求的超声心动图（echocardiogram）图像和视频。解决方案的关键在于EchoFlow包含四个核心组件：一个对抗变分自编码器（adversarial variational autoencoder），用于定义心脏超声图像的有效潜在表示；一个潜在图像流匹配模型（latent image flow matching model），用于生成精确的潜在超声心动图图像；一个潜在再识别模型（latent re-identification model），通过解剖学过滤确保隐私；以及一个潜在视频流匹配模型（latent video flow matching model），用于在射血分数（ejection fraction）条件下将潜在图像转化为逼真的超声心动图视频。通过这些组件，论文首次证明了仅基于EchoFlow生成的合成数据训练的下游模型在射血分数回归任务上的性能可以与基于真实数据训练的模型相当。

链接: https://arxiv.org/abs/2503.22357
作者: Hadrien Reynaud,Alberto Gomez,Paul Leeson,Qingjie Meng,Bernhard Kainz
机构: UKRI Centre for Doctoral Training in Artificial Intelligence for Healthcare (EP / S023283/1); HPC@NHR@FAU (Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)); Ultromics Ltd.; School of Computer Science, University of Birmingham (伯明翰大学); Department of Computing, Imperial College London (伦敦帝国理工学院); Friedrich–Alexander University Erlangen–Nürnberg (弗里德里希-亚历山大-埃尔朗根-纽伦堡大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Advances in deep learning have significantly enhanced medical image analysis, yet the availability of large-scale medical datasets remains constrained by patient privacy concerns. We present EchoFlow, a novel framework designed to generate high-quality, privacy-preserving synthetic echocardiogram images and videos. EchoFlow comprises four key components: an adversarial variational autoencoder for defining an efficient latent representation of cardiac ultrasound images, a latent image flow matching model for generating accurate latent echocardiogram images, a latent re-identification model to ensure privacy by filtering images anatomically, and a latent video flow matching model for animating latent images into realistic echocardiogram videos conditioned on ejection fraction. We rigorously evaluate our synthetic datasets on the clinically relevant task of ejection fraction regression and demonstrate, for the first time, that downstream models trained exclusively on EchoFlow-generated synthetic datasets achieve performance parity with models trained on real datasets. We release our models and synthetic datasets, enabling broader, privacy-compliant research in medical ultrasound imaging at this https URL.
zh

[CV-32] Meta-LoRA: Meta-Learning LoRA Components for Domain-Aware ID Personalization

【速读】：该论文旨在解决文本到图像生成模型在实现身份个性化方面的挑战，即如何通过有限的参考图像确保模型始终生成特定主体的高质量输出。解决方案的关键在于提出了一种名为Meta-Low-Rank Adaptation (Meta-LoRA) 的新框架，它利用元学习将领域特定先验知识编码到基于LoRA的身份个性化中。Meta-LoRA采用了一种结构化的三层LoRA架构，分离了与身份无关的知识和特定于身份的适应性。首先，在多个主体上进行LoRA Meta-Down层的元训练，以学习捕获一般身份相关特征的共享流形；其次，仅优化LoRA-Mid和LoRA-Up层以专注于特定主体，从而显著减少适应时间并提高身份保真度。通过引入新的基准数据集Meta-PHD评估方法并与最先进的技术进行比较，结果表明Meta-LoRA在多样化的身份条件下实现了卓越的身份保留、计算效率和适应性。

链接: https://arxiv.org/abs/2503.22352
作者: Barış Batuhan Topal,Umut Özyurt,Zafer Doğan Budak,Ramazan Gokberk Cinbis
机构: METU Dept. of Computer Engineering (METU 计算机工程系); METU Dept. of Electrical and Elec. Engineering (METU 电气与电子工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in text-to-image generative models, particularly latent diffusion models (LDMs), have demonstrated remarkable capabilities in synthesizing high-quality images from textual prompts. However, achieving identity personalization-ensuring that a model consistently generates subject-specific outputs from limited reference images-remains a fundamental challenge. To address this, we introduce Meta-Low-Rank Adaptation (Meta-LoRA), a novel framework that leverages meta-learning to encode domain-specific priors into LoRA-based identity personalization. Our method introduces a structured three-layer LoRA architecture that separates identity-agnostic knowledge from identity-specific adaptation. In the first stage, the LoRA Meta-Down layers are meta-trained across multiple subjects, learning a shared manifold that captures general identity-related features. In the second stage, only the LoRA-Mid and LoRA-Up layers are optimized to specialize on a given subject, significantly reducing adaptation time while improving identity fidelity. To evaluate our approach, we introduce Meta-PHD, a new benchmark dataset for identity personalization, and compare Meta-LoRA against state-of-the-art methods. Our results demonstrate that Meta-LoRA achieves superior identity retention, computational efficiency, and adaptability across diverse identity conditions. The code, model weights, and dataset will be released publicly upon acceptance.
zh

[CV-33] One Look is Enough: A Novel Seamless Patchwise Refinement for Zero-Shot Monocular Depth Estimation Models on High-Resolution Images

【速读】：该论文旨在解决现有零样本深度估计（Zero-shot Depth Estimation, DE）模型在处理高分辨率图像时面临的两个主要挑战：一是全分辨率处理导致的内存消耗增加和精度下降，二是下采样后边缘模糊的问题。此外，现有的高分辨率深度估计算法通常采用基于块的方法，在拼接深度块时容易产生深度不连续性，并且由于依赖合成数据以获取细粒度深度细节，其泛化能力较差。

为了解决上述问题，论文提出了一种名为Patch Refine Once (PRO) 的高效且可泛化的分块框架。PRO 的关键在于两个组成部分：(i) Grouped Patch Consistency Training，通过在单一反向传播步骤中联合处理四个重叠的图像块并对它们的重叠区域施加一致性损失，从而提高测试时的效率并缓解深度不连续性问题；(ii) Bias Free Masking，避免深度估计模型过度拟合特定数据集的偏差，即使是在基于合成数据训练的情况下也能更好地泛化到真实世界的数据集。实验结果表明，PRO 在多个基准数据集上的零样本评估中表现出色，能够有效减少网格输入高分辨率图像边界处的深度不连续性，并保持快速的推理速度。

链接: https://arxiv.org/abs/2503.22351
作者: Byeongjun Kwon,Munchurl Kim
机构: KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Please visit our project page this this https URL

点击查看摘要

Abstract:Zero-shot depth estimation (DE) models exhibit strong generalization performance as they are trained on large-scale datasets. However, existing models struggle with high-resolution images due to the discrepancy in image resolutions of training (with smaller resolutions) and inference (for high resolutions). Processing them at full resolution leads to decreased estimation accuracy on depth with tremendous memory consumption, while downsampling to the training resolution results in blurred edges in the estimated depth images. Prevailing high-resolution depth estimation methods adopt a patch-based approach, which introduces depth discontinuity issues when reassembling the estimated depth patches and results in test-time inefficiency. Additionally, to obtain fine-grained depth details, these methods rely on synthetic datasets due to the real-world sparse ground truth depth, leading to poor generalizability. To tackle these limitations, we propose Patch Refine Once (PRO), an efficient and generalizable tile-based framework. Our PRO consists of two key components: (i) Grouped Patch Consistency Training that enhances test-time efficiency while mitigating the depth discontinuity problem by jointly processing four overlapping patches and enforcing a consistency loss on their overlapping regions within a single backpropagation step, and (ii) Bias Free Masking that prevents the DE models from overfitting to dataset-specific biases, enabling better generalization to real-world datasets even after training on synthetic data. Zero-shot evaluation on Booster, ETH3D, Middlebury 2014, and NuScenes demonstrates into which our PRO can be well harmonized, making their DE capabilities still effective for the grid input of high-resolution images with little depth discontinuities at the grid boundaries. Our PRO runs fast at inference time.
zh

[CV-34] GCRayDiffusion: Pose-Free Surface Reconstruction via Geometric Consistent Ray Diffusion

【速读】：该论文旨在解决从无定位图像（unposed images）中进行精确表面重建的问题，特别是在稀疏视图场景下联合相机姿态估计的挑战。以往方法在密集视图设置下能够实现无需姿态信息的高质量表面重建，但在缺乏足够视觉重叠的稀疏视图场景中容易失败。为应对这一难题，本文提出了一种新的无姿态表面重建技术，其核心是通过基于三平面（triplane）的符号距离场（Signed Distance Field, SDF）学习，并结合显式的点采样自相机姿态估计的射线扩散（ray-based diffusion）来正则化学习过程。

论文的关键贡献在于提出了几何一致性射线扩散模型（Geometric Consistent Ray Diffusion, GCRayDiffusion），该模型将相机姿态表示为神经束射线（neural bundle rays），并通过扩散模型回归噪声射线的分布。更重要的是，研究进一步利用整个场景的三平面SDF条件化RGRayDiffusion的去噪过程，从而提供有效的三维一致性正则化以实现多视角一致的相机姿态估计。最终，通过引入来自神经束射线采样点的表面几何正则化，将RGRayDiffusion整合到三平面SDF学习中，实现了即使在稀疏视图输入条件下也能获得高度精确的无姿态表面重建结果。大量实验表明，相比现有方法，GCRayDiffusion不仅提高了相机姿态估计的准确性，还获得了几何上更一致的表面重建结果。

链接: https://arxiv.org/abs/2503.22349
作者: Li-Heng Chen,Zi-Xin Zou,Chang Liu,Tianjiao Jing,Yan-Pei Cao,Shi-Sheng Huang,Hongbo Fu,Hua Huang
机构: Beijing Normal University (北京师范大学); VAST; Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate surface reconstruction from unposed images is crucial for efficient 3D object or scene creation. However, it remains challenging, particularly for the joint camera pose estimation. Previous approaches have achieved impressive pose-free surface reconstruction results in dense-view settings, but could easily fail for sparse-view scenarios without sufficient visual overlap. In this paper, we propose a new technique for pose-free surface reconstruction, which follows triplane-based signed distance field (SDF) learning but regularizes the learning by explicit points sampled from ray-based diffusion of camera pose estimation. Our key contribution is a novel Geometric Consistent Ray Diffusion model (GCRayDiffusion), where we represent camera poses as neural bundle rays and regress the distribution of noisy rays via a diffusion model. More importantly, we further condition the denoising process of RGRayDiffusion using the triplane-based SDF of the entire scene, which provides effective 3D consistent regularization to achieve multi-view consistent camera pose estimation. Finally, we incorporate RGRayDiffusion into the triplane-based SDF learning by introducing on-surface geometric regularization from the sampling points of the neural bundle rays, which leads to highly accurate pose-free surface reconstruction results even for sparse-view inputs. Extensive evaluations on public datasets show that our GCRayDiffusion achieves more accurate camera pose estimation than previous approaches, with geometrically more consistent surface reconstruction results, especially given sparse-view inputs.
zh

[CV-35] ArchCAD-400K: An Open Large-Scale Architectural CAD Dataset and New Baseline for Panoptic Symbol Spotting

【速读】：该论文旨在解决建筑领域计算机辅助设计（CAD）图纸中符号识别的问题，这一问题是实现多种先进工程应用的基础。为应对这一挑战，论文提出的关键解决方案包括两个方面：首先，开发了一种新颖的CAD数据标注引擎（CAD Data Annotation Engine），利用系统归档的CAD图纸的固有属性，自动高效地生成高质量标注，从而大幅减少人工标注的工作量；其次，构建了一个大规模的CAD数据集ArchCAD-400K，并提出了一个新的基线模型Dual-Pathway Symbol Spotter (DPSS)。DPSS通过引入自适应融合模块，将基本特征与互补图像特征相结合以增强特征表示，实现了当前最先进的性能和更高的鲁棒性。这两项创新共同构成了论文的核心贡献。

链接: https://arxiv.org/abs/2503.22346
作者: Ruifeng Luo,Zhengjie Liu,Tianxiao Cheng,Jie Wang,Tongjie Wang,Xingguang Wei,Haomin Wang,YanPeng Li,Fu Chai,Fei Cheng,Shenglong Ye,Wenhai Wang,Yanting Zhang,Yu Qiao,Hongjie Zhang,Xianzhong Zhao
机构: Tongji University (同济大学); Arcplus East China Architectural Design & Research Institute Co., Ltd. (Arcplus华东建筑设计研究院有限公司); Shanghai AI Laboratory (上海人工智能实验室); Shanghai Innovation Institute (上海创新研究院); University of Science and Technology of China (中国科学技术大学); Shanghai Jiao Tong University (上海交通大学); Donghua University (东华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recognizing symbols in architectural CAD drawings is critical for various advanced engineering applications. In this paper, we propose a novel CAD data annotation engine that leverages intrinsic attributes from systematically archived CAD drawings to automatically generate high-quality annotations, thus significantly reducing manual labeling efforts. Utilizing this engine, we construct ArchCAD-400K, a large-scale CAD dataset consisting of 413,062 chunks from 5538 highly standardized drawings, making it over 26 times larger than the largest existing CAD dataset. ArchCAD-400K boasts an extended drawing diversity and broader categories, offering line-grained annotations. Furthermore, we present a new baseline model for panoptic symbol spotting, termed Dual-Pathway Symbol Spotter (DPSS). It incorporates an adaptive fusion module to enhance primitive features with complementary image features, achieving state-of-the-art performance and enhanced robustness. Extensive experiments validate the effectiveness of DPSS, demonstrating the value of ArchCAD-400K and its potential to drive innovation in architectural design and construction.
zh

[CV-36] Semantix: An Energy Guided Sampler for Semantic Style Transfer ICLR2025

【速读】：该论文旨在解决现有风格与外观迁移方法中忽略语义对应的问题，同时缺乏将图像和视频任务整合以实现视频迁移的统一框架。为应对这些局限性，论文引入了一种新的任务——语义风格迁移（Semantic Style Transfer），其目标是基于语义对应关系，从参考图像向目标视觉内容传递风格和外观特征。解决方案的关键在于提出了一种无需训练的方法Semantix，这是一种由预训练扩散模型引导的能量指导采样器。Semantix能够同时引导风格和外观迁移，并通过设计精心构造的能量函数来优化采样过程，该函数包含三个核心组件：风格特征引导、空间特征引导以及语义距离作为正则化项。此外，作为一种采样器，Semantix可以无缝应用于图像和视频模型，从而实现跨多种视觉媒体的通用语义风格迁移。实验结果表明，Semantix不仅在图像和视频的语义风格迁移任务中表现出色，还在相关领域超越了现有的最先进方法。

链接: https://arxiv.org/abs/2503.22344
作者: Huiang He,Minghui Hu,Chuanxia Zheng,Chaoyue Wang,Tat-Jen Cham
机构: South China University of Technology; SpellBrush & Nanyang Technological University; VGG, University of Oxford; The University of Sydney; College of Computing and Data Science, Nanyang Technological University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 19 figures, Accepted to ICLR 2025

点击查看摘要

Abstract:Recent advances in style and appearance transfer are impressive, but most methods isolate global style and local appearance transfer, neglecting semantic correspondence. Additionally, image and video tasks are typically handled in isolation, with little focus on integrating them for video transfer. To address these limitations, we introduce a novel task, Semantic Style Transfer, which involves transferring style and appearance features from a reference image to a target visual content based on semantic correspondence. We subsequently propose a training-free method, Semantix an energy-guided sampler designed for Semantic Style Transfer that simultaneously guides both style and appearance transfer based on semantic understanding capacity of pre-trained diffusion models. Additionally, as a sampler, Semantix be seamlessly applied to both image and video models, enabling semantic style transfer to be generic across various visual media. Specifically, once inverting both reference and context images or videos to noise space by SDEs, Semantix utilizes a meticulously crafted energy function to guide the sampling process, including three key components: Style Feature Guidance, Spatial Feature Guidance and Semantic Distance as a regularisation term. Experimental results demonstrate that Semantix not only effectively accomplishes the task of semantic style transfer across images and videos, but also surpasses existing state-of-the-art solutions in both fields. The project website is available at this https URL
zh

[CV-37] Imperceptible but Forgeable: Practical Invisible Watermark Forgery via Diffusion Models

【速读】：该论文旨在解决现有水印方案在防伪造攻击方面的鲁棒性不足问题，特别是在无封闭环境（no-box setting）下对不可见水印进行伪造的挑战。论文提出了一种名为DiffForge的水印伪造框架，其关键是利用无条件扩散模型估计水印分布，并结合浅层反转（shallow inversion）技术将水印无缝注入未加水印的图像中。这种方法通过自适应选择反转步骤的深度，在保持图像质量的同时实现水印注入，其关键洞察在于水印在扩散过程的早期阶段会因噪声增加而退化。这一解决方案显著提升了水印伪造的成功率，验证结果显示其能够以96.38%的成功率欺骗开源水印检测器，并以超过97%的成功率误导商业水印系统。

链接: https://arxiv.org/abs/2503.22330
作者: Ziping Dong,Chao Shuai,Zhongjie Ba,Peng Cheng,Zhan Qin,Qinglong Wang,Kui Ren
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Invisible watermarking is critical for content provenance and accountability in Generative AI. Although commercial companies have increasingly committed to using watermarks, the robustness of existing watermarking schemes against forgery attacks is understudied. This paper proposes DiffForge, the first watermark forgery framework capable of forging imperceptible watermarks under a no-box setting. We estimate the watermark distribution using an unconditional diffusion model and introduce shallow inversion to inject the watermark into a non-watermarked image seamlessly. This approach facilitates watermark injection while preserving image quality by adaptively selecting the depth of inversion steps, leveraging our key insight that watermarks degrade with added noise during the early diffusion phases. Comprehensive evaluations show that DiffForge deceives open-source watermark detectors with a 96.38% success rate and misleads a commercial watermark system with over 97% success rate, achieving high confidence.1 This work reveals fundamental security limitations in current watermarking paradigms.
zh

[CV-38] VoteFlow: Enforcing Local Rigidity in Self-Supervised Scene Flow CVPR2025

【速读】：该论文旨在解决自监督场景流估计中局部刚体运动约束难以有效融入模型架构的问题。现有方法通常通过后处理或添加额外正则化项来改善流场的刚性，但这些方法缺乏在模型结构中对局部刚性进行归纳偏置的能力，导致学习效率低下且性能不佳。论文的关键创新在于设计了一个轻量级附加模块（Voting Module），通过离散投票空间以及可微分投票机制，使附近点共享相同的运动，从而在神经网络中直接实现局部刚性约束，并支持端到端学习。此外，为提高计算效率，该模块基于体素（pillar）而非点进行操作，并为每个体素学习代表性特征用于投票。通过将此模块嵌入流行模型架构并在Argoverse 2和Waymo数据集上验证，论文实现了显著性能提升且仅带来极小的计算开销。

链接: https://arxiv.org/abs/2503.22328
作者: Yancong Lin,Shiming Wang,Liangliang Nan,Julian Kooij,Holger Caesar
机构: TU Delft (代尔夫特理工大学); ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2025. Code is available at this https URL . Yancong Lin and Shiming Wang have equal contributions

点击查看摘要

Abstract:Scene flow estimation aims to recover per-point motion from two adjacent LiDAR scans. However, in real-world applications such as autonomous driving, points rarely move independently of others, especially for nearby points belonging to the same object, which often share the same motion. Incorporating this locally rigid motion constraint has been a key challenge in self-supervised scene flow estimation, which is often addressed by post-processing or appending extra regularization. While these approaches are able to improve the rigidity of predicted flows, they lack an architectural inductive bias for local rigidity within the model structure, leading to suboptimal learning efficiency and inferior performance. In contrast, we enforce local rigidity with a lightweight add-on module in neural network design, enabling end-to-end learning. We design a discretized voting space that accommodates all possible translations and then identify the one shared by nearby points by differentiable voting. Additionally, to ensure computational efficiency, we operate on pillars rather than points and learn representative features for voting per pillar. We plug the Voting Module into popular model designs and evaluate its benefit on Argoverse 2 and Waymo datasets. We outperform baseline works with only marginal compute overhead. Code is available at this https URL.
zh

[CV-39] AH-GS: Augmented 3D Gaussian Splatting for High-Frequency Detail Representation

【速读】：该论文旨在解决3D Gaussian Splatting (3D-GS)及其改进版Scaffold-GS在场景表示和视图合成中的两个主要问题：一是Scaffold-GS对精细场景渲染高度依赖于充分的视角信息；二是神经网络学习的频谱偏见导致其难以有效感知和学习场景中的高频信息。为了解决这些问题，论文的关键方案是通过增强输入特征的流形复杂度，并引入基于网络的特征图损失来提升3D-GS模型的图像重建质量。具体而言，提出的方法AH-GS使结构复杂区域的3D高斯点能够获得更高频率的编码，从而更有效地学习场景的高频信息。此外，通过加入高频强化损失进一步增强模型捕获细节频率信息的能力。实验结果表明，该方法显著提高了渲染保真度，并在特定场景（如MipNeRf360-garden）中仅经过15K迭代即可超越Scaffold-GS的渲染质量。

链接: https://arxiv.org/abs/2503.22324
作者: Chenyang Xu,XingGuo Deng,Rui Zhong
机构: Fuzhou University (福州大学); Central China Normal University (华中师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The 3D Gaussian Splatting (3D-GS) is a novel method for scene representation and view synthesis. Although Scaffold-GS achieves higher quality real-time rendering compared to the original 3D-GS, its fine-grained rendering of the scene is extremely dependent on adequate viewing angles. The spectral bias of neural network learning results in Scaffold-GS’s poor ability to perceive and learn high-frequency information in the scene. In this work, we propose enhancing the manifold complexity of input features and using network-based feature map loss to improve the image reconstruction quality of 3D-GS models. We introduce AH-GS, which enables 3D Gaussians in structurally complex regions to obtain higher-frequency encodings, allowing the model to more effectively learn the high-frequency information of the scene. Additionally, we incorporate high-frequency reinforce loss to further enhance the model’s ability to capture detailed frequency information. Our result demonstrates that our model significantly improves rendering fidelity, and in specific scenarios (e.g., MipNeRf360-garden), our method exceeds the rendering quality of Scaffold-GS in just 15K iterations.
zh

[CV-40] A Dataset for Semantic Segmentation in the Presence of Unknowns CVPR2025

【速读】：该论文试图解决的问题是如何全面评估深度神经网络在处理已知输入（训练数据中包含的内容）和未知异常（如场景理解任务中的安全关键应用，例如自动驾驶中遇到的情况）方面的性能。现有数据集仅支持对已知或未知的单独评估，而无法同时涵盖两者，这限制了模型在真实世界环境中的适用性评价。为了解决这一问题，论文提出了一种新的异常分割数据集ISSU，其关键在于通过包含来自杂乱真实环境的多样化异常输入，显著扩展了现有异常分割数据集的规模，并提供了用于受控领域内评估的训练、验证和测试集。此外，测试集分为静态和动态（视频）部分，并且数据集同时标注了闭集（已知内容）和异常，从而支持闭集和开集评估。这种设计使得研究者能够分析不同条件（如领域迁移、传感器差异及光照变化）下异常检测方法的表现，特别是针对跨领域的泛化能力以及小目标和大目标分割的改进需求。

链接: https://arxiv.org/abs/2503.22309
作者: Zakaria Laskar,Tomas Vojir,Matej Grcic,Iaroslav Melekhov,Shankar Gangisettye,Juho Kannala,Jiri Matas,Giorgos Tolias,C.V. Jawahar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Before deployment in the real-world deep neural networks require thorough evaluation of how they handle both knowns, inputs represented in the training data, and unknowns (anomalies). This is especially important for scene understanding tasks with safety critical applications, such as in autonomous driving. Existing datasets allow evaluation of only knowns or unknowns - but not both, which is required to establish “in the wild” suitability of deep neural network models. To bridge this gap, we propose a novel anomaly segmentation dataset, ISSU, that features a diverse set of anomaly inputs from cluttered real-world environments. The dataset is twice larger than existing anomaly segmentation datasets, and provides a training, validation and test set for controlled in-domain evaluation. The test set consists of a static and temporal part, with the latter comprised of videos. The dataset provides annotations for both closed-set (knowns) and anomalies, enabling closed-set and open-set evaluation. The dataset covers diverse conditions, such as domain and cross-sensor shift, illumination variation and allows ablation of anomaly detection methods with respect to these variations. Evaluation results of current state-of-the-art methods confirm the need for improvements especially in domain-generalization, small and large object segmentation.
zh

[CV-41] VisTa: Visual-contextual and Text-augmented Zero-shot Object-level OOD Detection

【速读】：该论文致力于解决零样本（zero-shot）目标级分布外（OOD）检测的问题，即在对象检测器作为黑盒云服务或仅提供受限访问原始训练数据的预训练模型部署时，如何可靠地识别开放世界中的分布外目标。现有的基于预训练视觉-语言模型（如CLIP）的方法主要针对图像级OOD检测，但直接应用于目标级OOD检测时面临丢失上下文信息及依赖图像级对齐的挑战。论文的关键解决方案是引入一种利用视觉提示（visual prompts）和文本增强的分布内（ID）空间构建方法，以适配CLIP用于零样本目标级OOD检测。此方法通过保留关键上下文信息，提升了区分分布内与分布外目标的能力，在不同基准测试中表现出竞争力。

链接: https://arxiv.org/abs/2503.22291
作者: Bin Zhang,Xiaoyang Qu,Guokuan Li,Jiguang Wan,Jianzong Wang
机构: Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology (华中科技大学); Ping An Technology (深圳平安科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:As object detectors are increasingly deployed as black-box cloud services or pre-trained models with restricted access to the original training data, the challenge of zero-shot object-level out-of-distribution (OOD) detection arises. This task becomes crucial in ensuring the reliability of detectors in open-world settings. While existing methods have demonstrated success in image-level OOD detection using pre-trained vision-language models like CLIP, directly applying such models to object-level OOD detection presents challenges due to the loss of contextual information and reliance on image-level alignment. To tackle these challenges, we introduce a new method that leverages visual prompts and text-augmented in-distribution (ID) space construction to adapt CLIP for zero-shot object-level OOD detection. Our method preserves critical contextual information and improves the ability to differentiate between ID and OOD objects, achieving competitive performance across different benchmarks.
zh

[CV-42] RUNA: Object-level Out-of-Distribution Detection via Regional Uncertainty Alignment of Multimodal Representations

【速读】：本文旨在解决目标检测模型在识别分布外（Out-of-Distribution, OOD）物体时可靠性不足的问题，主要障碍在于模型通常无法从不熟悉的样本中获得监督信号，导致对OOD物体的预测过于自信。尽管已有方法通过检测模型和分布内（In-Distribution, ID）样本估计OOD不确定性，但这些方法多基于图像级处理。本文提出利用预训练的视觉-语言表示进行目标级OOD检测。方案的关键在于提出了RUNA框架，采用双编码器架构捕获丰富的上下文信息，并引入区域不确定性对齐机制以有效区分ID与OOD物体。此外，通过少量样本微调对齐区域级语义表示，进一步提升模型区分相似物体的能力。实验表明，RUNA在目标级OOD检测任务中显著超越现有最先进方法，特别是在包含多样且复杂物体实例的挑战性场景中表现出色。

链接: https://arxiv.org/abs/2503.22285
作者: Bin Zhang,Jinggang Chen,Xiaoyang Qu,Guokuan Li,Kai Lu,Jiguang Wan,Jing Xiao,Jianzong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Enabling object detectors to recognize out-of-distribution (OOD) objects is vital for building reliable systems. A primary obstacle stems from the fact that models frequently do not receive supervisory signals from unfamiliar data, leading to overly confident predictions regarding OOD objects. Despite previous progress that estimates OOD uncertainty based on the detection model and in-distribution (ID) samples, we explore using pre-trained vision-language representations for object-level OOD detection. We first discuss the limitations of applying image-level CLIP-based OOD detection methods to object-level scenarios. Building upon these insights, we propose RUNA, a novel framework that leverages a dual encoder architecture to capture rich contextual information and employs a regional uncertainty alignment mechanism to distinguish ID from OOD objects effectively. We introduce a few-shot fine-tuning approach that aligns region-level semantic representations to further improve the model’s capability to discriminate between similar objects. Our experiments show that RUNA substantially surpasses state-of-the-art methods in object-level OOD detection, particularly in challenging scenarios with diverse and complex object instances.
zh

[CV-43] Divide to Conquer: A Field Decomposition Approach for Multi-Organ Whole-Body CT Image Registration

【速读】：本文旨在解决现有图像配准方法在处理多器官全身CT图像配准时面临的高复杂度变形场问题。这些传统方法通常针对特定器官设计，通用性较差，而现有的多器官配准虽能同时处理多个器官，但其复杂的变形场（由多种个体变形叠加组成）难以有效建模。为应对这一挑战，论文提出了一种新颖的场分解方法，通过将复杂变形场分解为更简单的子组件来降低计算难度。关键在于引入这种场分解策略，使模型能够更好地捕捉多器官间的复杂形变关系，从而提升配准性能。实验基于包含691名患者的纵向数据集进行评估，结果表明所提方法优于基于优化技术和深度学习的基准方法。

链接: https://arxiv.org/abs/2503.22281
作者: Xuan Loc Pham,Mathias Prokop,Bram van Ginneken,Alessa Hering
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image registration is an essential technique for the analysis of Computed Tomography (CT) images in clinical practice. However, existing methodologies are predominantly tailored to a specific organ of interest and often exhibit lower performance on other organs, thus limiting their generalizability and applicability. Multi-organ registration addresses these limitations, but the simultaneous alignment of multiple organs with diverse shapes, sizes and locations requires a highly complex deformation field with a multi-layer composition of individual deformations. This study introduces a novel field decomposition approach to address the high complexity of deformations in multi-organ whole-body CT image registration. The proposed method is trained and evaluated on a longitudinal dataset of 691 patients, each with two CT images obtained at distinct time points. These scans fully encompass the thoracic, abdominal, and pelvic regions. Two baseline registration methods are selected for this study: one based on optimization techniques and another based on deep learning. Experimental results demonstrate that the proposed approach outperforms baseline methods in handling complex deformations in multi-organ whole-body CT image registration.
zh

[CV-44] Segment Any Motion in Videos CVPR2025

【速读】：该论文旨在解决视频中移动物体分割这一关键任务，传统方法主要依赖光流提供运动线索，但容易因部分运动、复杂形变、运动模糊及背景干扰等因素导致预测不完美。论文提出了一种结合长程轨迹运动线索与DINO语义特征的新方法，并利用SAM2通过迭代提示策略进行像素级掩码细化。解决方案的关键在于采用空间-时间轨迹注意力机制以及运动-语义解耦嵌入，以优先考虑运动信息的同时整合语义支持。

链接: https://arxiv.org/abs/2503.22268
作者: Nan Huang,Wenzhao Zheng,Chenfeng Xu,Kurt Keutzer,Shanghang Zhang,Angjoo Kanazawa,Qianqian Wang
机构: UC Berkeley; Peking University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Website: this https URL

点击查看摘要

Abstract:Moving object segmentation is a crucial task for achieving a high-level understanding of visual scenes and has numerous downstream applications. Humans can effortlessly segment moving objects in videos. Previous work has largely relied on optical flow to provide motion cues; however, this approach often results in imperfect predictions due to challenges such as partial motion, complex deformations, motion blur and background distractions. We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state-of-the-art performance, excelling in challenging scenarios and fine-grained segmentation of multiple objects. Our code is available at this https URL.
zh

[CV-45] DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation

【速读】：该论文旨在解决在视频和文本条件下，端到端同步生成语音和音频的研究不足问题。现有方法主要关注视频到高质量同步音频的生成，而同时生成语音和音频的端到端多模态生成尚未被充分研究。论文的关键在于提出了一种名为DeepAudio的端到端多模态生成框架，该框架包含视频到音频（V2A）模块、文本到语音（TTS）模块以及动态模态融合（MoF）模块，通过这些组件实现视频和文本条件下的语音与音频的同时生成，并在视频-音频、视频-语音及文本-语音基准测试中达到了最先进的性能。

链接: https://arxiv.org/abs/2503.22265
作者: Haomin Zhang,Chang Liu,Junjie Zheng,Zihao Chen,Chaofan Ding,Xinhan Di
机构: AI Lab, Giant Network (巨人网络人工智能实验室); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Currently, high-quality, synchronized audio is synthesized using various multi-modal joint learning frameworks, leveraging video and optional text inputs. In the video-to-audio benchmarks, video-to-audio quality, semantic alignment, and audio-visual synchronization are effectively achieved. However, in real-world scenarios, speech and audio often coexist in videos simultaneously, and the end-to-end generation of synchronous speech and audio given video and text conditions are not well studied. Therefore, we propose an end-to-end multi-modal generation framework that simultaneously produces speech and audio based on video and text conditions. Furthermore, the advantages of video-to-audio (V2A) models for generating speech from videos remain unclear. The proposed framework, DeepAudio, consists of a video-to-audio (V2A) module, a text-to-speech (TTS) module, and a dynamic mixture of modality fusion (MoF) module. In the evaluation, the proposed end-to-end framework achieves state-of-the-art performance on the video-audio benchmark, video-speech benchmark, and text-speech benchmark. In detail, our framework achieves comparable results in the comparison with state-of-the-art models for the video-audio and text-speech benchmarks, and surpassing state-of-the-art models in the video-speech benchmark, with WER 16.57% to 3.15% (+80.99%), SPK-SIM 78.30% to 89.38% (+14.15%), EMO-SIM 66.24% to 75.56% (+14.07%), MCD 8.59 to 7.98 (+7.10%), MCD SL 11.05 to 9.40 (+14.93%) across a variety of dubbing settings.
zh

[CV-46] FLIP: Towards Comprehensive and Reliable Evaluation of Federated Prompt Learning

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）与提示学习（Prompt Learning）在跨域视觉-语言模型中的结合问题。论文的关键在于提出了一种名为FLIP的综合框架，用于评估联邦提示学习算法的性能。通过在四种联邦学习协议和十二个开放数据集上测试八种最先进的联邦提示学习方法，并考虑六种不同的评估场景，研究发现提示学习能够在保持强泛化性能的同时显著降低资源消耗，尤其适用于数据稀缺、未见类别以及跨域分布偏移的环境。因此，该工作的关键是设计了FLIP框架以系统性地评估联邦提示学习方法的有效性，并揭示其在特定场景下的优势。

链接: https://arxiv.org/abs/2503.22263
作者: Dongping Liao,Xitong Gao,Yabo Xu,Chengzhong Xu
机构: State Key Lab of IoTSC, CIS Dept, University of Macau (澳门大学物联网系统与通信国家重点实验室, 计算机与信息科学系); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); DataStory Information Technology Co., Ltd. (数说故事信息技术有限公司)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:The increasing emphasis on privacy and data security has driven the adoption of federated learning, a decentralized approach to train machine learning models without sharing raw data. Prompt learning, which fine-tunes prompt embeddings of pretrained models, offers significant advantages in federated settings by reducing computational costs and communication overheads while leveraging the strong performance and generalization capabilities of vision-language models such as CLIP. This paper addresses the intersection of federated learning and prompt learning, particularly for vision-language models. In this work, we introduce a comprehensive framework, named FLIP, to evaluate federated prompt learning algorithms. FLIP assesses the performance of 8 state-of-the-art federated prompt learning methods across 4 federated learning protocols and 12 open datasets, considering 6 distinct evaluation scenarios. Our findings demonstrate that prompt learning maintains strong generalization performance in both in-distribution and out-of-distribution settings with minimal resource consumption. This work highlights the effectiveness of federated prompt learning in environments characterized by data scarcity, unseen classes, and cross-domain distributional shifts. We open-source the code for all implemented algorithms in FLIP to facilitate further research in this domain.
zh

[CV-47] Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion CVPR2025

【速读】：该论文旨在解决立体转换（stereo conversion）任务中因大规模训练数据和综合基准缺乏而导致的方法论优化不足及立体效果评估不准确的问题。论文的关键在于引入了Mono2Stereo数据集，提供高质量的训练数据和基准以支持深入研究，并通过实证研究揭示了现有方法的局限性：一是现有度量未能聚焦于影响立体效果的关键区域；二是主流方法在单阶段左到右生成或扭曲与修复管道中分别面临立体效果退化和图像失真的挑战。基于这些发现，论文提出了新的评估指标Stereo Intersection-over-Union（SIOU），强调视差优先级以实现与人类主观判断高度相关的立体效果评价，并设计了一种强基线模型，同时优化立体效果与图像质量，显著超越当前主流方法。

链接: https://arxiv.org/abs/2503.22262
作者: Songsong Yu,Yuxin Chen,Zhongang Qi,Zeke Xie,Yifan Wang,Lijun Wang,Ying Shan,Huchuan Lu
机构: DLUT(大连理工大学); ARC Lab, Tencent PCG(腾讯互娱 ARC 实验室); HKUST(GZ)(香港科技大学广州校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025 Project webpage: this https URL

点击查看摘要

Abstract:With the rapid proliferation of 3D devices and the shortage of 3D content, stereo conversion is attracting increasing attention. Recent works introduce pretrained Diffusion Models (DMs) into this task. However, due to the scarcity of large-scale training data and comprehensive benchmarks, the optimal methodologies for employing DMs in stereo conversion and the accurate evaluation of stereo effects remain largely unexplored. In this work, we introduce the Mono2Stereo dataset, providing high-quality training data and benchmark to support in-depth exploration of stereo conversion. With this dataset, we conduct an empirical study that yields two primary findings. 1) The differences between the left and right views are subtle, yet existing metrics consider overall pixels, failing to concentrate on regions critical to stereo effects. 2) Mainstream methods adopt either one-stage left-to-right generation or warp-and-inpaint pipeline, facing challenges of degraded stereo effect and image distortion respectively. Based on these findings, we introduce a new evaluation metric, Stereo Intersection-over-Union, which prioritizes disparity and achieves a high correlation with human judgments on stereo effect. Moreover, we propose a strong baseline model, harmonizing the stereo effect and image quality simultaneously, and notably surpassing current mainstream methods. Our code and data will be open-sourced to promote further research in stereo conversion. Our models are available at this http URL.
zh

[CV-48] Efficient Building Roof Type Classification: A Domain-Specific Self-Supervised Approach

【速读】：该论文旨在解决利用航空影像进行建筑屋顶类型分类任务中因标注数据有限而制约监督学习方法性能的问题。为应对这一挑战，论文提出了一种基于EfficientNet架构的自监督学习框架，并通过引入Convolutional Block Attention Module (CBAM) 来增强特征提取能力。此外，论文探索了在特定领域数据集（如Aerial Image Dataset, AID）上预训练相较于ImageNet预训练的优势。关键在于结合SimCLR与EfficientNet-B3及CBAM的方法，不仅实现了95.5%的验证集准确率，且参数量显著少于当前最先进的基于Transformer的模型，同时证明了领域特定预训练的有效性。

链接: https://arxiv.org/abs/2503.22251
作者: Guneet Mutreja,Ksenia Bittner
机构: Remote Sensing Technology Institute, German Aerospace Center (DLR)(德国航空航天中心遥感技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate classification of building roof types from aerial imagery is crucial for various remote sensing applications, including urban planning, disaster management, and infrastructure monitoring. However, this task is often hindered by the limited availability of labeled data for supervised learning approaches. To address this challenge, this paper investigates the effectiveness of self supervised learning with EfficientNet architectures, known for their computational efficiency, for building roof type classification. We propose a novel framework that incorporates a Convolutional Block Attention Module (CBAM) to enhance the feature extraction capabilities of EfficientNet. Furthermore, we explore the benefits of pretraining on a domain-specific dataset, the Aerial Image Dataset (AID), compared to ImageNet pretraining. Our experimental results demonstrate the superiority of our approach. Employing Simple Framework for Contrastive Learning of Visual Representations (SimCLR) with EfficientNet-B3 and CBAM achieves a 95.5% accuracy on our validation set, matching the performance of state-of-the-art transformer-based models while utilizing significantly fewer parameters. We also provide a comprehensive evaluation on two challenging test sets, demonstrating the generalization capability of our method. Notably, our findings highlight the effectiveness of domain-specific pretraining, consistently leading to higher accuracy compared to models pretrained on the generic ImageNet dataset. Our work establishes EfficientNet based self-supervised learning as a computationally efficient and highly effective approach for building roof type classification, particularly beneficial in scenarios with limited labeled data.
zh

[CV-49] SCHNet: SAM Marries CLIP for Human Parsing

【速读】：本文旨在解决人体解析（Human Parsing）任务中同时需要高精度细粒度分割和强语义理解的挑战。尽管Segment Anything Model (SAM) 在细粒度分割方面表现出色，但在语义感知分割任务中面临重大挑战；而Contrastive Language-Image Pre-training Model (CLIP) 虽具备强大的语义理解能力，却在细粒度分割任务中存在不足。为克服这些限制，论文提出了两个关键模块：其一是一个语义精化模块（Semantic-Refinement Module），用于融合CLIP的语义特征与SAM的特征以提升解析效果；其二是高效的微调模块（Fine-tuning Module），通过调整预训练的SAM模型，使其适应需要高语义信息且同时强调空间细节的人体解析任务，显著减少了训练时间并提升了性能。实验结果验证了所提方法在LIP、PPP和CIHP数据集上的有效性。

链接: https://arxiv.org/abs/2503.22237
作者: Kunliang Liu,Jianming Wang,Rize Jin,Wonjun Hwang,Tae-Sun Chung
机构: Ajou University (首尔交通大学), Korea; Tiangong University (天津工业大学), China; Korea University (高丽大学), Korea
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Foundation Model (VFM) such as the Segment Anything Model (SAM) and Contrastive Language-Image Pre-training Model (CLIP) has shown promising performance for segmentation and detection tasks. However, although SAM excels in fine-grained segmentation, it faces major challenges when applying it to semantic-aware segmentation. While CLIP exhibits a strong semantic understanding capability via aligning the global features of language and vision, it has deficiencies in fine-grained segmentation tasks. Human parsing requires to segment human bodies into constituent parts and involves both accurate fine-grained segmentation and high semantic understanding of each part. Based on traits of SAM and CLIP, we formulate high efficient modules to effectively integrate features of them to benefit human parsing. We propose a Semantic-Refinement Module to integrate semantic features of CLIP with SAM features to benefit parsing. Moreover, we formulate a high efficient Fine-tuning Module to adjust the pretrained SAM for human parsing that needs high semantic information and simultaneously demands spatial details, which significantly reduces the training time compared with full-time training and achieves notable performance. Extensive experiments demonstrate the effectiveness of our method on LIP, PPP, and CIHP databases.
zh

[CV-50] Hi3DGen: High-fidelity 3D Geometry Generation from Images via Normal Bridging

【速读】：该论文旨在解决从2D图像生成高保真3D几何模型时面临的挑战，特别是在跨域差距和RGB图像固有歧义限制下准确再现精细几何细节的问题。为了解决这些问题，论文提出了一种名为Hi3DGen的新框架，通过法线图桥接实现高质量3D几何生成。Hi3DGen的关键在于三个组成部分：(1) 图像到法线估计器，它通过噪声注入和双流训练解耦高低频图像模式，以实现可泛化、稳定且清晰的估计；(2) 法线到几何学习方法，利用法线正则化潜在扩散学习增强3D几何生成的保真度；(3) 3D数据合成管道，构建高质量数据集支持训练。实验结果表明，该框架在生成丰富几何细节方面表现出色，超越了现有最先进的方法。这一工作通过利用法线图作为中间表示，为从图像生成高保真3D几何提供了新的方向。

链接: https://arxiv.org/abs/2503.22236
作者: Chongjie Ye,Yushuang Wu,Ziteng Lu,Jiahao Chang,Xiaoyang Guo,Jiaqing Zhou,Hao Zhao,Xiaoguang Han
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学（深圳）); ByteDance (字节跳动); Tsinghua University (清华大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:With the growing demand for high-fidelity 3D models from 2D images, existing methods still face significant challenges in accurately reproducing fine-grained geometric details due to limitations in domain gaps and inherent ambiguities in RGB images. To address these issues, we propose Hi3DGen, a novel framework for generating high-fidelity 3D geometry from images via normal bridging. Hi3DGen consists of three key components: (1) an image-to-normal estimator that decouples the low-high frequency image pattern with noise injection and dual-stream training to achieve generalizable, stable, and sharp estimation; (2) a normal-to-geometry learning approach that uses normal-regularized latent diffusion learning to enhance 3D geometry generation fidelity; and (3) a 3D data synthesis pipeline that constructs a high-quality dataset to support training. Extensive experiments demonstrate the effectiveness and superiority of our framework in generating rich geometric details, outperforming state-of-the-art methods in terms of fidelity. Our work provides a new direction for high-fidelity 3D geometry generation from images by leveraging normal maps as an intermediate representation.
zh

[CV-51] CoGen: 3D Consistent Video Generation via Adaptive Conditioning for Autonomous Driving

【速读】：该论文旨在解决通过生成式方法（Generative Methods）实现具有高3D一致性的可控多视角驾驶视频生成这一挑战性问题。论文的关键创新在于提出了一种名为CoGen的空间自适应生成框架，其核心解决方案包括两个方面：(i) 首先生成高质量且可控制的3D条件来捕捉驾驶场景的几何结构，用这些精细的3D表示替代粗略的2D条件，从而显著提升生成视频的空间一致性；(ii) 引入一致性适配器模块以增强模型对多条件控制的鲁棒性。这些改进使得该方法在保持几何保真度和视觉逼真度方面表现出色，为自动驾驶提供了可靠的视频生成方案。

链接: https://arxiv.org/abs/2503.22231
作者: Yishen Ji,Ziyue Zhu,Zhenxin Zhu,Kaixin Xiong,Ming Lu,Zhiqi Li,Lijun Zhou,Haiyang Sun,Bing Wang,Tong Lu
机构: Nanjing University (南京大学); Xiaomi EV (小米汽车); Nankai University (南开大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in driving video generation has shown significant potential for enhancing self-driving systems by providing scalable and controllable training data. Although pretrained state-of-the-art generation models, guided by 2D layout conditions (e.g., HD maps and bounding boxes), can produce photorealistic driving videos, achieving controllable multi-view videos with high 3D consistency remains a major challenge. To tackle this, we introduce a novel spatial adaptive generation framework, CoGen, which leverages advances in 3D generation to improve performance in two key aspects: (i) To ensure 3D consistency, we first generate high-quality, controllable 3D conditions that capture the geometry of driving scenes. By replacing coarse 2D conditions with these fine-grained 3D representations, our approach significantly enhances the spatial consistency of the generated videos. (ii) Additionally, we introduce a consistency adapter module to strengthen the robustness of the model to multi-condition control. The results demonstrate that this method excels in preserving geometric fidelity and visual realism, offering a reliable video generation solution for autonomous driving.
zh

[CV-52] Follow Your Motion: A Generic Temporal Consistency Portrait Editing Framework with Trajectory Guidance

【速读】：该论文致力于解决预训练条件扩散模型在图像编辑中面临的时序一致性挑战，特别是在人脸表情连续变化的 Talking Head 领域。传统方法因独立处理单帧图像而在编辑过程中丢失时序连续性，导致编辑后的虚拟人物在动态表现上缺乏连贯性。论文的关键解决方案包括两部分：首先，提出 Follow Your Motion (FYM) 框架，通过开发一种扩散模型，使其直观且内在地学习不同尺度与像素坐标的运动轨迹变化，确保编辑后的虚拟人物继承渲染图像中的运动信息；其次，引入动态重加权注意力机制，通过为关键点赋予动态调整的权重系数，并基于关键点损失进行更新，从而实现更一致且精细的面部表情时序一致性。

链接: https://arxiv.org/abs/2503.22225
作者: Haijie Yang,Zhenyu Zhang,Hao Tang,Jianjun Qian,Jian Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Pre-trained conditional diffusion models have demonstrated remarkable potential in image editing. However, they often face challenges with temporal consistency, particularly in the talking head domain, where continuous changes in facial expressions intensify the level of difficulty. These issues stem from the independent editing of individual images and the inherent loss of temporal continuity during the editing process. In this paper, we introduce Follow Your Motion (FYM), a generic framework for maintaining temporal consistency in portrait editing. Specifically, given portrait images rendered by a pre-trained 3D Gaussian Splatting model, we first develop a diffusion model that intuitively and inherently learns motion trajectory changes at different scales and pixel coordinates, from the first frame to each subsequent frame. This approach ensures that temporally inconsistent edited avatars inherit the motion information from the rendered avatars. Secondly, to maintain fine-grained expression temporal consistency in talking head editing, we propose a dynamic re-weighted attention mechanism. This mechanism assigns higher weight coefficients to landmark points in space and dynamically updates these weights based on landmark loss, achieving more consistent and refined facial expressions. Extensive experiments demonstrate that our method outperforms existing approaches in terms of temporal consistency and can be used to optimize and compensate for temporally inconsistent outputs in a range of applications, such as text-driven editing, relighting, and various other applications.
zh

[CV-53] ABC-GS: Alignment-Based Controllable Style Transfer for 3D Gaussian Splatting

【速读】：本文旨在解决基于Neural Radiance Fields (NeRF) 的3D场景风格化方法中存在的两个主要问题：一是Nearest Neighbor Feature Matching (NNFM) 损失函数未考虑全局风格信息；二是NeRF的隐式表示限制了对生成场景的精细控制。为了解决这些问题，论文提出了一种名为ABC-GS的新框架，基于3D Gaussian Splatting实现高质量的3D风格迁移。其关键在于设计了一个可控的匹配阶段，通过分割掩码实现场景内容与风格特征之间的精确对齐，并提出了基于特征对齐的风格迁移损失函数以确保输出结果忠实反映参考图像的整体风格。此外，通过深度损失和Gaussian正则化项保留原始场景的几何信息。实验表明，ABC-GS提供了风格迁移的可控制性，并实现了更忠实地符合所选艺术参考风格的风格化结果。

链接: https://arxiv.org/abs/2503.22218
作者: Wenjie Liu,Zhongliang Liu,Xiaoyan Yang,Man Sha,Yang Li
机构: School of Computer Science and Technology, East China Normal University (华东师范大学计算机科学与技术学院), Shanghai, China; School of Software Engineering, East China Normal University (华东师范大学软件工程学院), Shanghai, China; Shanghai Chinafortune Co., Ltd (上海中福世富科技有限公司), Shanghai, China
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 10 pages, 14 figures

点击查看摘要

Abstract:3D scene stylization approaches based on Neural Radiance Fields (NeRF) achieve promising results by optimizing with Nearest Neighbor Feature Matching (NNFM) loss. However, NNFM loss does not consider global style information. In addition, the implicit representation of NeRF limits their fine-grained control over the resulting scenes. In this paper, we introduce ABC-GS, a novel framework based on 3D Gaussian Splatting to achieve high-quality 3D style transfer. To this end, a controllable matching stage is designed to achieve precise alignment between scene content and style features through segmentation masks. Moreover, a style transfer loss function based on feature alignment is proposed to ensure that the outcomes of style transfer accurately reflect the global style of the reference image. Furthermore, the original geometric information of the scene is preserved with the depth loss and Gaussian regularization terms. Extensive experiments show that our ABC-GS provides controllability of style transfer and achieves stylization results that are more faithfully aligned with the global style of the chosen artistic reference. Our homepage is available at this https URL.
zh

[CV-54] Intrinsic Image Decomposition for Robust Self-supervised Monocular Depth Estimation on Reflective Surfaces AAAI2025

【速读】：本文旨在解决自监督单目深度估计（Self-Supervised Monocular Depth Estimation, SSMDE）中因采用Lambertian假设而导致在反射表面处理上存在显著误差的问题。传统方法依赖光度一致性损失（Photometric Consistency Loss），但该损失函数假设物体表面为朗伯体，当面对偏离此模型的反射表面时，容易产生较大误差。为克服这一局限性，论文提出了一种新颖框架，将内在图像分解（Intrinsic Image Decomposition）融入SSMDE中。方案的关键在于同时协同训练单目深度估计与内在图像分解任务：准确的深度估计通过对齐不同视角坐标系实现多图像一致性，从而辅助内在图像分解；而分解过程能够识别反射区域，并排除这些区域对深度训练的干扰。此外，该框架还引入伪深度生成与知识蒸馏技术，进一步提升学生模型在反射与非反射表面下的性能表现。综合多数据集上的评估结果表明，所提方法在深度预测方面显著优于现有SSMDE基线，尤其是在处理反射表面时表现出色。

链接: https://arxiv.org/abs/2503.22209
作者: Wonhyeok Choi,Kyumin Hwang,Minwoo Choi,Kiljoon Han,Wonjoon Choi,Mingyu Shin,Sunghoon Im
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Self-supervised monocular depth estimation (SSMDE) has gained attention in the field of deep learning as it estimates depth without requiring ground truth depth maps. This approach typically uses a photometric consistency loss between a synthesized image, generated from the estimated depth, and the original image, thereby reducing the need for extensive dataset acquisition. However, the conventional photometric consistency loss relies on the Lambertian assumption, which often leads to significant errors when dealing with reflective surfaces that deviate from this model. To address this limitation, we propose a novel framework that incorporates intrinsic image decomposition into SSMDE. Our method synergistically trains for both monocular depth estimation and intrinsic image decomposition. The accurate depth estimation facilitates multi-image consistency for intrinsic image decomposition by aligning different view coordinate systems, while the decomposition process identifies reflective areas and excludes corrupted gradients from the depth training process. Furthermore, our framework introduces a pseudo-depth generation and knowledge distillation technique to further enhance the performance of the student model across both reflective and non-reflective surfaces. Comprehensive evaluations on multiple datasets show that our approach significantly outperforms existing SSMDE baselines in depth prediction, especially on reflective surfaces.
zh

[CV-55] DeepSound-V1: Start to Think Step-by-Step in the Audio Generation from Videos

【速读】：该论文旨在解决视频生成高质量同步音频时视觉与生成音频之间对齐精度不足的问题。目前开放数据集中缺乏足够的时序和语义标注是导致这一问题的关键因素之一。为了解决此问题，论文提出了一种利用多模态大语言模型（Multi-modal Large Language Model, MLLM）内部链式思维（Chain-of-Thought, CoT）机制的框架，通过逐步推理的方式实现精准对齐，而无需额外标注。此外，构建了一个相应的多模态推理数据集以促进音频生成的初始推理学习。实验结果表明，所提方法在多个指标上超越了现有最先进的方法，显著减少了语音错位现象，并提升了生成音频的质量。

链接: https://arxiv.org/abs/2503.22208
作者: Yunming Liang,Zihao Chen,Chaofan Ding,Xinhan Di
机构: AI Lab, Giant Network (巨人网络AI实验室)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Currently, high-quality, synchronized audio is synthesized from video and optional text inputs using various multi-modal joint learning frameworks. However, the precise alignment between the visual and generated audio domains remains far from satisfactory. One key factor is the lack of sufficient temporal and semantic alignment annotations in open-source video-audio and text-audio benchmarks. Therefore, we propose a framework for audio generation from videos, leveraging the internal chain-of-thought (CoT) of a multi-modal large language model (MLLM) to enable step-by-step reasoning without requiring additional annotations. Additionally, a corresponding multi-modal reasoning dataset is constructed to facilitate the learning of initial reasoning in audio generation. In the experiments, we demonstrate the effectiveness of the proposed framework in reducing misalignment (voice-over) in generated audio and achieving competitive performance compared to various state-of-the-art models. The evaluation results show that the proposed method outperforms state-of-the-art approaches across multiple metrics. Specifically, the F DP aSST indicator is reduced by up to 10.07%, the F DP AN N s indicator by up to 11.62%, and the F DV GG indicator by up to 38.61%. Furthermore, the IS indicator improves by up to 4.95%, the IB-score indicator increases by up to 6.39%, and the DeSync indicator is reduced by up to 0.89%.
zh

[CV-56] Data-Free Universal Attack by Exploiting the Intrinsic Vulnerability of Deep Models AAAI2025

【速读】：该论文旨在解决深度神经网络（DNNs）对无实例特定性通用对抗扰动（Universal Adversarial Perturbations, UAPs）的易受攻击性问题，特别是现有UAP生成方法通常需要大量样本数据这一局限性。论文的关键创新在于提出了一种全新的无数据依赖方法——Intrinsic UAP（IntriUAP），通过挖掘深度模型的内在脆弱性来生成UAP。研究分析表明，线性组件主导了这类模型的脆弱性，并利用线性层的最大奇异值对应的右奇异向量来对齐UAP，从而有效利用线性组件的病态特性。这种方法在无需任何图像样本的情况下，实现了对流行图像分类模型的强大攻击效果。此外，IntriUAP还进一步放宽假设，在仅能访问目标模型部分线性层的情况下依然保持较高攻击成功率，仅下降4%。因此，其关键突破在于通过理论分析揭示模型脆弱性的根源，并设计出一种高效且鲁棒的无数据生成策略。

链接: https://arxiv.org/abs/2503.22205
作者: YangTian Yan,Jinyu Tian
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in AAAI 2025

点击查看摘要

Abstract:Deep neural networks (DNNs) are susceptible to Universal Adversarial Perturbations (UAPs), which are instance agnostic perturbations that can deceive a target model across a wide range of samples. Unlike instance-specific adversarial examples, UAPs present a greater challenge as they must generalize across different samples and models. Generating UAPs typically requires access to numerous examples, which is a strong assumption in real-world tasks. In this paper, we propose a novel data-free method called Intrinsic UAP (IntriUAP), by exploiting the intrinsic vulnerabilities of deep models. We analyze a series of popular deep models composed of linear and nonlinear layers with a Lipschitz constant of 1, revealing that the vulnerability of these models is predominantly influenced by their linear components. Based on this observation, we leverage the ill-conditioned nature of the linear components by aligning the UAP with the right singular vectors corresponding to the maximum singular value of each linear layer. Remarkably, our method achieves highly competitive performance in attacking popular image classification deep models without using any image samples. We also evaluate the black-box attack performance of our method, showing that it matches the state-of-the-art baseline for data-free methods on models that conform to our theoretical framework. Beyond the data-free assumption, IntriUAP also operates under a weaker assumption, where the adversary only can access a few of the victim model’s layers. Experiments demonstrate that the attack success rate decreases by only 4% when the adversary has access to just 50% of the linear layers in the victim model.
zh

[CV-57] Segment then Splat: A Unified Approach for 3D Open-Vocabulary Segmentation based on Gaussian Splatting

【速读】：该论文旨在解决3D空间中开放词汇查询的问题，特别是现有方法在静态和动态场景中因依赖2D像素级解析而导致的多视角不一致性和3D物体检索效果不佳的问题。此外，这些方法难以有效处理动态场景中的运动建模复杂性。论文的关键创新在于提出了一种名为“Segment then Splat”的3D感知开放词汇分割方法，基于高斯点 splatting 技术，能够同时适用于静态和动态场景。其核心解决方案是通过在重建之前将高斯分布划分为不同的物体集合（即“先分割后重建”策略的反转），从而在重建完成后自然实现单个物体的3D分割。这种方法不仅解决了动态场景中高斯体素与物体之间的错配问题，还加速了优化过程，避免了单独学习语言场的需求。最终，每个物体被分配一个CLIP嵌入以支持开放词汇查询。实验结果验证了该方法在静态和动态场景中的有效性。

链接: https://arxiv.org/abs/2503.22204
作者: Yiren Lu,Yunlai Zhou,Yiran Qiao,Chaoda Song,Tuo Liang,Jing Ma,Yu Yin
机构: Case Western Reserve University (凯斯西储大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Open-vocabulary querying in 3D space is crucial for enabling more intelligent perception in applications such as robotics, autonomous systems, and augmented reality. However, most existing methods rely on 2D pixel-level parsing, leading to multi-view inconsistencies and poor 3D object retrieval. Moreover, they are limited to static scenes and struggle with dynamic scenes due to the complexities of motion modeling. In this paper, we propose Segment then Splat, a 3D-aware open vocabulary segmentation approach for both static and dynamic scenes based on Gaussian Splatting. Segment then Splat reverses the long established approach of “segmentation after reconstruction” by dividing Gaussians into distinct object sets before reconstruction. Once the reconstruction is complete, the scene is naturally segmented into individual objects, achieving true 3D segmentation. This approach not only eliminates Gaussian-object misalignment issues in dynamic scenes but also accelerates the optimization process, as it eliminates the need for learning a separate language field. After optimization, a CLIP embedding is assigned to each object to enable open-vocabulary querying. Extensive experiments on various datasets demonstrate the effectiveness of our proposed method in both static and dynamic scenarios.
zh

[CV-58] Multi-modal Knowledge Distillation-based Human Trajectory Forecasting CVPR2025

【速读】：该论文旨在解决在资源受限系统中利用文本描述增强行人轨迹预测准确性的问题。传统方法依赖视觉语言模型（Visual Language Model, VLM）在线提取文本信息，但这种方法可能不适用于资源受限的场景。为了解决这一挑战，论文提出了一种多模态知识蒸馏框架，通过从全模态教师模型（包含轨迹、人体姿态和文本信息）中蒸馏知识到仅使用轨迹或人体姿态作为补充的学生模型，从而实现跨模态信息的有效迁移。关键在于将教师模型中的核心运动洞察力（即个体内部多模态信息与个体间交互的关键知识）独立地传递给学生模型，使得学生模型能够在资源受限条件下依然具备强大的预测能力。实验结果表明，该框架在三种数据集上显著提升了预测性能，特别是在全观测和瞬时观测情况下，预测指标改善最高可达约13%。

链接: https://arxiv.org/abs/2503.22201
作者: Jaewoo Jeong,Seohee Lee,Daehee Park,Giwon Lee,Kuk-Jin Yoon
机构: Visual Intelligence Lab., KAIST (KAIST 视觉智能实验室); Intelligent Systems and Learning Lab., DGIST (DGIST 智能系统与学习实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Pedestrian trajectory forecasting is crucial in various applications such as autonomous driving and mobile robot navigation. In such applications, camera-based perception enables the extraction of additional modalities (human pose, text) to enhance prediction accuracy. Indeed, we find that textual descriptions play a crucial role in integrating additional modalities into a unified understanding. However, online extraction of text requires the use of VLM, which may not be feasible for resource-constrained systems. To address this challenge, we propose a multi-modal knowledge distillation framework: a student model with limited modality is distilled from a teacher model trained with full range of modalities. The comprehensive knowledge of a teacher model trained with trajectory, human pose, and text is distilled into a student model using only trajectory or human pose as a sole supplement. In doing so, we separately distill the core locomotion insights from intra-agent multi-modality and inter-agent interaction. Our generalizable framework is validated with two state-of-the-art models across three datasets on both ego-view (JRDB, SIT) and BEV-view (ETH/UCY) setups, utilizing both annotated and VLM-generated text captions. Distilled student models show consistent improvement in all prediction metrics for both full and instantaneous observations, improving up to ~13%. The code is available at this https URL.
zh

[CV-59] Enhance Generation Quality of Flow Matching V2A Model via Multi-Step CoT-Like Guidance and Combined Preference Optimization

【速读】：该论文旨在解决现有最先进的视频引导音频生成模型在生成高质量音频（无论是通用场景还是专业领域）时存在的不足。为应对这一挑战，论文提出了一种多阶段、多模态、端到端的生成框架——Chain-of-Perform (CoP)，其核心在于引入类似于Chain-of-Thought (CoT) 的指导学习方法。关键解决方案包括：(1) 设计基于Transformer的网络架构以实现CoP指导，从而支持通用及专业音频生成；(2) 构建一个多阶段训练框架，通过逐步指导确保高质量音效生成；(3) 开发一个由视频引导的CoP多模态数据集，用于支持分步音效生成。实验结果表明，该框架在多个数据集上的表现优于现有方法，如VGGSound上的FAD和CLIP评分显著提升，以及PianoYT-2h和Piano-10h上的SI-SDR和MOS评分均有明显改善。

链接: https://arxiv.org/abs/2503.22200
作者: Haomin Zhang,Sizhe Shan,Haoyu Wang,Zihao Chen,Xiulong Liu,Chaofan Ding,Xinhan Di
机构: AI Lab, Giant Network (巨人网络AI实验室); Zhejiang University (浙江大学); University of Washington (华盛顿大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Creating high-quality sound effects from videos and text prompts requires precise alignment between visual and audio domains, both semantically and temporally, along with step-by-step guidance for professional audio generation. However, current state-of-the-art video-guided audio generation models often fall short of producing high-quality audio for both general and specialized use cases. To address this challenge, we introduce a multi-stage, multi-modal, end-to-end generative framework with Chain-of-Thought-like (CoT-like) guidance learning, termed Chain-of-Perform (CoP). First, we employ a transformer-based network architecture designed to achieve CoP guidance, enabling the generation of both general and professional audio. Second, we implement a multi-stage training framework that follows step-by-step guidance to ensure the generation of high-quality sound effects. Third, we develop a CoP multi-modal dataset, guided by video, to support step-by-step sound effects generation. Evaluation results highlight the advantages of the proposed multi-stage CoP generative framework compared to the state-of-the-art models on a variety of datasets, with FAD 0.79 to 0.74 (+6.33%), CLIP 16.12 to 17.70 (+9.80%) on VGGSound, SI-SDR 1.98dB to 3.35dB (+69.19%), MOS 2.94 to 3.49(+18.71%) on PianoYT-2h, and SI-SDR 2.22dB to 3.21dB (+44.59%), MOS 3.07 to 3.42 (+11.40%) on Piano-10h.
zh

[CV-60] Hyperspectral Adapter for Object Tracking based on Hyperspectral Video

【速读】：该论文致力于解决基于高光谱视频的目标跟踪中因特征转换导致光谱信息损失以及对整个预训练网络进行微调效率低下的问题。为了解决这些问题，论文提出了高光谱适配器用于跟踪（HyA-T）的方法，其关键在于引入了自注意力的高光谱适配器（HAS）和多层感知器的高光谱适配器（HAM），通过将适配信息融入到预训练网络中的多头自注意力（MSA）模块和多层感知器（MLP）计算中，从而实现对光谱特征的有效提取与利用。此外，还提出了输入的高光谱增强（HEI）方法以进一步强化原始光谱信息。这些方法能够直接从高光谱图像中提取光谱信息，避免了光谱信息的丢失，并且仅需对所提出方法的参数进行微调，提高了实际应用效率。实验结果验证了所提方法在多个数据集上的有效性，HyA-T 在所有数据集上均达到了最先进的性能。

链接: https://arxiv.org/abs/2503.22199
作者: Long Gao,Yunhe Zhang,Langkun Chen,Yan Jiang,Weiying Xie,Yunsong Li
机构: State Key Laboratory of Integrated Service Networks (综合业务网理论及关键技术实验室), School of Telecommunications Engineering (电信工程学院), Xidian University (西安电子科技大学), No.2, South Taibai Street, Hi-Tech Development Zone, Xi’an, China, 710071; The Department of Electronic and Electrical Engineering (电子电气工程系), the University of Sheffield (谢菲尔德大学), Sheffield, UK, S10 2TN
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object tracking based on hyperspectral video attracts increasing attention to the rich material and motion information in the hyperspectral videos. The prevailing hyperspectral methods adapt pretrained RGB-based object tracking networks for hyperspectral tasks by fine-tuning the entire network on hyperspectral datasets, which achieves impressive results in challenging scenarios. However, the performance of hyperspectral trackers is limited by the loss of spectral information during the transformation, and fine-tuning the entire pretrained network is inefficient for practical applications. To address the issues, a new hyperspectral object tracking method, hyperspectral adapter for tracking (HyA-T), is proposed in this work. The hyperspectral adapter for the self-attention (HAS) and the hyperspectral adapter for the multilayer perceptron (HAM) are proposed to generate the adaption information and to transfer the multi-head self-attention (MSA) module and the multilayer perceptron (MLP) in pretrained network for the hyperspectral object tracking task by augmenting the adaption information into the calculation of the MSA and MLP. Additionally, the hyperspectral enhancement of input (HEI) is proposed to augment the original spectral information into the input of the tracking network. The proposed methods extract spectral information directly from the hyperspectral images, which prevent the loss of the spectral information. Moreover, only the parameters in the proposed methods are fine-tuned, which is more efficient than the existing methods. Extensive experiments were conducted on four datasets with various spectral bands, verifing the effectiveness of the proposed methods. The HyA-T achieves state-of-the-art performance on all the datasets.
zh

[CV-61] Extremely Simple Out-of-distribution Detection for Audio-visual Generalized Zero-shot Learning

【速读】：该论文旨在解决音频-视觉广义零样本学习（Audio-Visual Generalized Zero-Shot Learning, AV-GZSL）中的领域偏移（domain shift）问题及由此引发的类别偏差（bias）问题。现有基于嵌入（embedding-based）和生成式（generative-based）的方法在处理多模态数据（音频、视频和自然语言）时容易受到这些挑战的影响。为了解决这些问题，论文提出了一种名为EZ-AVOOD的简单方法，其关键是通过利用类别特定的logits和类别无关的特征子空间，在初始阶段实现已见类别（seen classes）与未见类别（unseen classes）的有效分离，而无需额外训练一个离分布检测器网络。这种方法不仅缓解了领域偏移带来的影响，还通过两个专家模型分别对已见和未见样本进行分类，从而在三个音频-视觉数据集上实现了优于现有最先进方法的零样本学习（ZSL）和广义零样本学习（GZSL）性能，成为新的SOTA模型。

链接: https://arxiv.org/abs/2503.22197
作者: Yang Liu,Xun Zhang,Jiale Du,Xinbo Gao,Jungong Han
机构: Xidian University (西安电子科技大学); Chongqing University of Posts and Telecommunications (重庆邮电大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot Learning(ZSL) attains knowledge transfer from seen classes to unseen classes by exploring auxiliary category information, which is a promising yet difficult research topic. In this field, Audio-Visual Generalized Zero-Shot Learning~(AV-GZSL) has aroused researchers’ great interest in which intricate relations within triple modalities~(audio, video, and natural language) render this task quite challenging but highly research-worthy. However, both existing embedding-based and generative-based AV-GZSL methods tend to suffer from domain shift problem a lot and we propose an extremely simple Out-of-distribution~(OOD) detection based AV-GZSL method~(EZ-AVOOD) to further mitigate bias problem by differentiating seen and unseen samples at the initial beginning. EZ-AVOOD accomplishes effective seen-unseen separation by exploiting the intrinsic discriminative information held in class-specific logits and class-agnostic feature subspace without training an extra OOD detector network. Followed by seen-unseen binary classification, we employ two expert models to classify seen samples and unseen samples separately. Compared to existing state-of-the-art methods, our model achieves superior ZSL and GZSL performances on three audio-visual datasets and becomes the new SOTA, which comprehensively demonstrates the effectiveness of the proposed EZ-AVOOD.
zh

[CV-62] ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation

【速读】：该论文旨在解决文本到图像生成中多对象和多样化类别场景下的3D方向定位（3D Orientation Grounding）问题，这是此前基于空间定位的图像生成研究中未被充分关注的方向，传统方法主要局限于2D位置控制而缺乏对3D方向的调控能力。为应对这一挑战，论文提出了一种奖励引导采样方法（Reward-Guided Sampling Approach），结合预训练的判别模型用于3D方向估计以及单步文本到图像生成流模型。关键创新在于采用基于朗之万动力学（Langevin Dynamics）的采样策略替代梯度上升优化，后者虽自然适用于奖励引导但难以保持图像真实感；此外，通过引入基于奖励函数的自适应时间缩放（Adaptive Time Rescaling）加速收敛过程。这些方法共同构成了ORIGEN的核心解决方案。

链接: https://arxiv.org/abs/2503.22194
作者: Yunhong Min,Daehyeon Choi,Kyeongmin Yeo,Jihyun Lee,Minhyuk Sung
机构: KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce ORIGEN, the first zero-shot method for 3D orientation grounding in text-to-image generation across multiple objects and diverse categories. While previous work on spatial grounding in image generation has mainly focused on 2D positioning, it lacks control over 3D orientation. To address this, we propose a reward-guided sampling approach using a pretrained discriminative model for 3D orientation estimation and a one-step text-to-image generative flow model. While gradient-ascent-based optimization is a natural choice for reward-based guidance, it struggles to maintain image realism. Instead, we adopt a sampling-based approach using Langevin dynamics, which extends gradient ascent by simply injecting random noise–requiring just a single additional line of code. Additionally, we introduce adaptive time rescaling based on the reward function to accelerate convergence. Our experiments show that ORIGEN outperforms both training-based and test-time guidance methods across quantitative metrics and user studies.
zh

[CV-63] Unbiased Max-Min Embedding Classification for Transductive Few-Shot Learning: Clustering and Classification Are All You Need

【速读】：该论文旨在解决Few-Shot Learning (FSL) 和 Transductive Few-Shot Learning (TFSL) 中因标注数据不足导致的性能瓶颈问题，特别是在面对Hubness问题和小样本场景下的分类准确性与鲁棒性挑战。论文的关键解决方案在于提出了一种名为Unbiased Max-Min Embedding Classification (UMMEC) 的方法，通过三项创新性贡献实现突破：首先，引入分散化的协方差矩阵以缓解Hubness问题，确保嵌入向量分布更加均匀；其次，结合局部对齐与全局一致性，利用自适应加权及非线性变换平衡类内聚类与类间分离；最后，采用Variational Sinkhorn Few-Shot Classifier优化样本与类别原型之间的距离度量，从而提升分类精度与模型鲁棒性。这些创新共同使UMMEC方法能够在极少量标注数据下展现出卓越的性能，推动了TFSL领域的技术前沿。

链接: https://arxiv.org/abs/2503.22193
作者: Yang Liu,Feixiang Liu,Jiale Du,Xinbo Gao,Jungong Han
机构: Xidian University (西安电子科技大学), Chongqing University of Posts and Telecommunications (重庆邮电大学), Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional neural networks and supervised learning have achieved remarkable success in various fields but are limited by the need for large annotated datasets. Few-shot learning (FSL) addresses this limitation by enabling models to generalize from only a few labeled examples. Transductive few-shot learning (TFSL) enhances FSL by leveraging both labeled and unlabeled data, though it faces challenges like the hubness problem. To overcome these limitations, we propose the Unbiased Max-Min Embedding Classification (UMMEC) Method, which addresses the key challenges in few-shot learning through three innovative contributions. First, we introduce a decentralized covariance matrix to mitigate the hubness problem, ensuring a more uniform distribution of embeddings. Second, our method combines local alignment and global uniformity through adaptive weighting and nonlinear transformation, balancing intra-class clustering with inter-class separation. Third, we employ a Variational Sinkhorn Few-Shot Classifier to optimize the distances between samples and class prototypes, enhancing classification accuracy and robustness. These combined innovations allow the UMMEC method to achieve superior performance with minimal labeled data. Our UMMEC method significantly improves classification performance with minimal labeled data, advancing the state-of-the-art in TFSL.
zh

[CV-64] Sell It Before You Make It: Revolutionizing E-Commerce with Personalized AI-Generated Items

【速读】：该论文旨在解决电子商务领域中产品设计与制造流程效率低下的问题，具体表现为传统工作流中产品设计、制造及库存管理耗费大量时间和资源。为应对这一挑战，论文提出了一种基于AI生成商品（AIGI）的创新系统，通过个性化文本到图像生成技术优化电商产品的设计流程。论文的核心解决方案是引入一个名为PerFusion的个性化群体偏好对齐框架，用于扩散模型。关键在于捕捉用户群体层面针对多个生成候选图像的个性化偏好。为此，首先设计了包含特征交叉个性化插件的PerFusion奖励模型来估计用户偏好；其次构建了一个具备个性化自适应网络的PerFusion模型，既能建模不同用户的多样化偏好，又能推导出群体层面的偏好优化目标以捕捉候选对象间的对比行为。实验结果表明，与人工设计的商品相比，AI生成的商品在点击率和转化率方面均有超过13%的相对提升，验证了AI生成商品在电商平台上革命性的潜力。

链接: https://arxiv.org/abs/2503.22182
作者: Jianghao Lin,Peng Du,Jiaqi Liu,Weite Li,Yong Yu,Weinan Zhang,Yang Cao
机构: Shanghai Jiao Tong University, China; Alibaba Group, China
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:E-commerce has revolutionized retail, yet its traditional workflows remain inefficient, with significant time and resource costs tied to product design and manufacturing inventory. This paper introduces a novel system deployed at Alibaba that leverages AI-generated items (AIGI) to address these challenges with personalized text-to-image generation for e-commercial product design. AIGI enables an innovative business mode called “sell it before you make it”, where merchants can design fashion items and generate photorealistic images with digital models based on textual descriptions. Only when the items have received a certain number of orders, do the merchants start to produce them, which largely reduces reliance on physical prototypes and thus accelerates time to market. For such a promising application, we identify the underlying key scientific challenge, i.e., capturing the users’ group-level personalized preferences towards multiple generated candidate images. To this end, we propose a Personalized Group-Level Preference Alignment Framework for Diffusion Models (i.e., PerFusion). We first design PerFusion Reward Model for user preference estimation with a feature-crossing-based personalized plug-in. Then we develop PerFusion with a personalized adaptive network to model diverse preferences across users, and meanwhile derive the group-level preference optimization objective to capture the comparative behaviors among multiple candidates. Both offline and online experiments demonstrate the effectiveness of our proposed algorithm. The AI-generated items have achieved over 13% relative improvements for both click-through rate and conversion rate compared to their human-designed counterparts, validating the revolutionary potential of AI-generated items for e-commercial platforms.
zh

[CV-65] Knowledge Rectification for Camouflaged Object Detection: Unlocking Insights from Low-Quality Data

【速读】：该论文试图解决低质量数据在伪装物体检测（Camouflaged Object Detection, COD）中因细节不足而引入额外隐式伪装效应，导致现有方法性能显著下降的问题。解决方案的关键在于提出KRNet框架，这是一种专为低质量数据设计的首个COD框架。KRNet采用领导者-跟随者（Leader-Follower）结构，其中领导者从高质量数据中提取条件分布和混合分布双重金标准，以指导跟随者修正来自低质量数据的知识；此外，该框架结合了跨一致性策略以进一步优化分布的修正，并利用时间依赖的条件编码器丰富分布多样性。实验结果表明，KRNet在基准数据集上的表现优于最先进的COD方法及超分辨率辅助的COD方法，验证了其在应对低质量数据挑战中的有效性。

链接: https://arxiv.org/abs/2503.22180
作者: Juwei Guan,Xiaolin Fang,Donghyun Kim,Haotian Gong,Tongxin Zhu,Zhen Ling,Ming Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-quality data often suffer from insufficient image details, introducing an extra implicit aspect of camouflage that complicates camouflaged object detection (COD). Existing COD methods focus primarily on high-quality data, overlooking the challenges posed by low-quality data, which leads to significant performance degradation. Therefore, we propose KRNet, the first framework explicitly designed for COD on low-quality data. KRNet presents a Leader-Follower framework where the Leader extracts dual gold-standard distributions: conditional and hybrid, from high-quality data to drive the Follower in rectifying knowledge learned from low-quality data. The framework further benefits from a cross-consistency strategy that improves the rectification of these distributions and a time-dependent conditional encoder that enriches the distribution diversity. Extensive experiments on benchmark datasets demonstrate that KRNet outperforms state-of-the-art COD methods and super-resolution-assisted COD approaches, proving its effectiveness in tackling the challenges of low-quality data in COD.
zh

[CV-66] High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning

【速读】：该论文致力于解决基于扩散模型的面部交换中的两个关键挑战：优先确保源人脸身份的保留而非目标属性，以及身份与属性约束之间的固有冲突。为了解决这些问题，论文提出了一种基于身份约束的属性微调框架，该框架首先通过解耦条件注入确保身份的保留，然后进行属性对齐的精细调整。此外，在后训练精化阶段引入身份损失和对抗损失以进一步提高保真度。关键在于其创新的身份约束机制和解耦条件注入方法，使模型在保持身份相似性和属性一致性方面表现出色，达到了当前最先进的高保真面部交换性能。

链接: https://arxiv.org/abs/2503.22179
作者: Dailan He,Xiahong Wang,Shulun Wang,Guanglu Song,Bingqi Ma,Hao Shao,Yu Liu,Hongsheng Li
机构: CUHK MMLab (香港中文大学多媒体实验室); SenseTime Research (商汤科技研究部); CPII under InnoHK (香港创科署下属机构)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face swapping aims to seamlessly transfer a source facial identity onto a target while preserving target attributes such as pose and expression. Diffusion models, known for their superior generative capabilities, have recently shown promise in advancing face-swapping quality. This paper addresses two key challenges in diffusion-based face swapping: the prioritized preservation of identity over target attributes and the inherent conflict between identity and attribute conditioning. To tackle these issues, we introduce an identity-constrained attribute-tuning framework for face swapping that first ensures identity preservation and then fine-tunes for attribute alignment, achieved through a decoupled condition injection. We further enhance fidelity by incorporating identity and adversarial losses in a post-training refinement stage. Our proposed identity-constrained diffusion-based face-swapping model outperforms existing methods in both qualitative and quantitative evaluations, demonstrating superior identity similarity and attribute consistency, achieving a new state-of-the-art performance in high-fidelity face swapping.
zh

[CV-67] AdaRank: Adaptive Rank Pruning for Enhanced Model Merging

【速读】：该论文旨在解决多任务学习中独立微调模型集成时因手工设计的秩选择导致的跨任务干扰和次优性能问题。论文的关键解决方案是提出AdaRank框架，通过自适应选择任务向量中最有益的奇异方向来合并多个模型。与传统的固定秩截断方法不同，AdaRank在测试阶段通过熵最小化动态剪枝引起干扰的奇异分量，并为每个任务向量提供最优的信息量，从而缓解任务间的有害重叠，实现接近1%的微调模型性能差距消除，表现出一致的最先进性能。

链接: https://arxiv.org/abs/2503.22178
作者: Chanhyuk Lee,Jiho Choi,Chanryeol Lee,Donggyun Kim,Seunghoon Hong
机构: KAIST (韩国科学技术院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Code Available at: this https URL

点击查看摘要

Abstract:Model merging has emerged as a promising approach for unifying independently fine-tuned models into an integrated framework, significantly enhancing computational efficiency in multi-task learning. Recently, several SVD-based techniques have been introduced to exploit low-rank structures for enhanced merging, but their reliance on such manually designed rank selection often leads to cross-task interference and suboptimal performance. In this paper, we propose AdaRank, a novel model merging framework that adaptively selects the most beneficial singular directions of task vectors to merge multiple models. We empirically show that the dominant singular components of task vectors can cause critical interference with other tasks, and that naive truncation across tasks and layers degrades performance. In contrast, AdaRank dynamically prunes the singular components that cause interference and offers an optimal amount of information to each task vector by learning to prune ranks during test-time via entropy minimization. Our analysis demonstrates that such method mitigates detrimental overlaps among tasks, while empirical results show that AdaRank consistently achieves state-of-the-art performance with various backbones and number of tasks, reducing the performance gap between fine-tuned models to nearly 1%.
zh

[CV-68] 3D Acetabular Surface Reconstruction from 2D Pre-operative X-ray Images using SRVF Elastic Registration and Deformation Graph

【速读】：本文旨在解决全髋关节置换术（Total Hip Arthroplasty, THA）中准确选择合适髋臼杯尺寸的问题，这是恢复关节生物力学的关键。论文提出了一种新的框架，将基于平方根速度函数（Square-Root Velocity Function, SRVF）的弹性形状配准技术与嵌入变形（Embedded Deformation, ED）图方法相结合，通过融合多视角的术前骨盆X射线图像和半球面模型来重建三维髋臼关节表面。该方案的关键在于利用SRVF-based弹性配准建立参数化半球模型与X射线图像之间的二维-三维对应关系，并通过ED框架将这些由SRVF衍生出的对应关系作为约束条件，采用非线性最小二乘优化方法优化三维髋臼表面的重建过程。仿真数据和真实患者数据的验证表明，所提出的算法具有鲁棒性和潜在的临床价值，可帮助外科医生在初次THA手术中一次性选择正确的髋臼杯，从而减少翻修手术的需求。

链接: https://arxiv.org/abs/2503.22177
作者: Shuai Zhang,Jinliang Wang,Sujith Konandetails,Xu Wang,Danail Stoyanov,Evangelos B.Mazomenos
机构: The UCL Hawkes Institute, University College London (伦敦大学学院乌尔赫斯研究所, 伦敦大学学院); The Department of Computer Science, University College London (伦敦大学学院计算机科学系, 伦敦大学学院); Joint Disease Department, Zhengzhou Orthopaedic Hospital (郑州骨科医院关节疾病科); University College Hospital (伦敦大学学院医院); The Department of Medical Physics and Biomedical Engineering, University College London (伦敦大学学院医学物理与生物医学工程系, 伦敦大学学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 10 pages, 3 figures, conference

点击查看摘要

Abstract:Accurate and reliable selection of the appropriate acetabular cup size is crucial for restoring joint biomechanics in total hip arthroplasty (THA). This paper proposes a novel framework that integrates square-root velocity function (SRVF)-based elastic shape registration technique with an embedded deformation (ED) graph approach to reconstruct the 3D articular surface of the acetabulum by fusing multiple views of 2D pre-operative pelvic X-ray images and a hemispherical surface model. The SRVF-based elastic registration establishes 2D-3D correspondences between the parametric hemispherical model and X-ray images, and the ED framework incorporates the SRVF-derived correspondences as constraints to optimize the 3D acetabular surface reconstruction using nonlinear least-squares optimization. Validations using both simulation and real patient datasets are performed to demonstrate the robustness and the potential clinical value of the proposed algorithm. The reconstruction result can assist surgeons in selecting the correct acetabular cup on the first attempt in primary THA, minimising the need for revision surgery.
zh

[CV-69] Efficient Continual Learning through Frequency Decomposition and Integration

【速读】：该论文致力于解决连续学习（Continual Learning, CL）中任务适应时面临的遗忘问题，特别是在资源受限环境下的高效解决方案。论文的关键在于提出了一种名为频率分解与集成网络（Frequency Decomposition and Integration Network, FDINet）的新框架，受人类视觉系统处理信息方式的启发，该框架通过分解和整合图像的低频和高频成分来增强跨任务的泛化能力。其关键创新点包括设计两个轻量级子网络分别处理低频和高频信息，并通过这种频率感知的设计，在保留类特定细节的同时提升模型的训练效率和存储利用率。实验结果表明，FDINet显著减少了主干网络参数（降低78%），提升了最高可达7.49%的准确性，并将峰值内存使用降低了80%，同时在边缘设备上的训练速度提高了5倍。

链接: https://arxiv.org/abs/2503.22175
作者: Ruiqi Liu,Boyu Diao,Libo Huang,Hangda Liu,Chuanguang Yang,Zhulin An,Yongjun Xu
机构: Institute of Computing Technology, Chinese Academy of Sciences (中科院计算技术研究所), Beijing, China; University of Chinese Academy of Sciences (中国科学院大学), Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual learning (CL) aims to learn new tasks while retaining past knowledge, addressing the challenge of forgetting during task adaptation. Rehearsal-based methods, which replay previous samples, effectively mitigate forgetting. However, research on enhancing the efficiency of these methods, especially in resource-constrained environments, remains limited, hindering their application in real-world systems with dynamic data streams. The human perceptual system processes visual scenes through complementary frequency channels: low-frequency signals capture holistic cues, while high-frequency components convey structural details vital for fine-grained discrimination. Inspired by this, we propose the Frequency Decomposition and Integration Network (FDINet), a novel framework that decomposes and integrates information across frequencies. FDINet designs two lightweight networks to independently process low- and high-frequency components of images. When integrated with rehearsal-based methods, this frequency-aware design effectively enhances cross-task generalization through low-frequency information, preserves class-specific details using high-frequency information, and facilitates efficient training due to its lightweight architecture. Experiments demonstrate that FDINet reduces backbone parameters by 78%, improves accuracy by up to 7.49% over state-of-the-art (SOTA) methods, and decreases peak memory usage by up to 80%. Additionally, on edge devices, FDINet accelerates training by up to 5 \times .
zh

[CV-70] Synergistic Bleeding Region and Point Detection in Surgical Videos

【速读】：该论文旨在解决腹腔镜手术中术中出血导致手术视野快速模糊的问题，通过智能检测出血区域量化失血量以辅助决策，并定位出血点帮助外科医生及时识别出血源实现止血。论文的关键在于构建了一个包含95个手术视频剪辑、共计5,330帧的真实世界手术出血检测数据集SurgBlood，并开发了一种名为BlooDet的双任务协同在线检测器，用于同时检测手术视频中的出血区域和出血点。该框架采用基于Segment Anything Model 2 (SAM 2) 的双分支双向引导设计，其中掩码分支通过自适应边缘和点提示嵌入来检测出血区域，而点分支利用掩码记忆进行出血点记忆建模并通过帧间光流捕捉出血点的移动方向。通过交互式引导和提示，两个分支探索潜在的空间时间关系，同时利用从前几帧的记忆建模推断当前的出血状况。实验结果表明，该方法在SurgBlood数据集上的出血区域检测（IoU达到64.88%）和出血点检测（PCK-10%达到83.69%）任务中优于其他对比方法。

链接: https://arxiv.org/abs/2503.22174
作者: Jialun Pei,Zhangjun Zhou,Diandian Guo,Zhixi Li,Jing Qin,Bo Du,Pheng-Ann Heng
机构: CUHK (香港中文大学); PolyU (香港理工大学); SMU (新加坡管理大学); WHU (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Intraoperative bleeding in laparoscopic surgery causes rapid obscuration of the operative field to hinder the surgical process. Intelligent detection of bleeding regions can quantify the blood loss to assist decision-making, while locating the bleeding point helps surgeons quickly identify the source of bleeding and achieve hemostasis in time. In this study, we first construct a real-world surgical bleeding detection dataset, named SurgBlood, comprising 5,330 frames from 95 surgical video clips with bleeding region and point annotations. Accordingly, we develop a dual-task synergistic online detector called BlooDet, designed to perform simultaneous detection of bleeding regions and points in surgical videos. Our framework embraces a dual-branch bidirectional guidance design based on Segment Anything Model 2 (SAM 2). The mask branch detects bleeding regions through adaptive edge and point prompt embeddings, while the point branch leverages mask memory to induce bleeding point memory modeling and captures the direction of bleed point movement through inter-frame optical flow. By interactive guidance and prompts, the two branches explore potential spatial-temporal relationships while leveraging memory modeling from previous frames to infer the current bleeding condition. Extensive experiments demonstrate that our approach outperforms other counterparts on SurgBlood in both bleeding region and point detection tasks, e.g., achieving 64.88% IoU for bleeding region detection and 83.69% PCK-10% for bleeding point detection.
zh

[CV-71] Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation

【速读】：该论文旨在解决语义分割任务中的数据稀缺问题，通过文本到图像（Text-to-Image, T2I）生成模型构建数据集，以降低图像采集与标注成本。论文的核心挑战在于：一是生成样本需与目标域对齐；二是生成的样本应具备超越训练数据的信息量。尽管微调T2I模型可帮助生成与目标域对齐的样本，但其往往过拟合并记忆训练数据，限制了生成多样化且良好对齐样本的能力。为此，论文提出了一种名为概念感知LoRA（Concept-Aware LoRA, CA-LoRA）的新颖微调方法，其关键在于仅选择性地更新与必要概念（如风格或视角）相关的权重，以实现域对齐，同时保留预训练模型的知识，从而生成具有信息量的样本。实验表明，CA-LoRA在城市场景分割的数据集生成任务中表现出色，在域内（少量标注与全监督设置）及跨域泛化任务中均优于基线和现有技术方法，特别是在恶劣天气和光照变化等具有挑战性的条件下，进一步凸显其优越性。

链接: https://arxiv.org/abs/2503.22172
作者: Minho Park,Sunghyun Park,Jungsoo Lee,Hyojin Park,Kyuwoong Hwang,Fatih Porikli,Jaegul Choo,Sungha Choi
机构: Qualcomm AI Research (高通人工智能研究); KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper addresses the challenge of data scarcity in semantic segmentation by generating datasets through text-to-image (T2I) generation models, reducing image acquisition and labeling costs. Segmentation dataset generation faces two key challenges: 1) aligning generated samples with the target domain and 2) producing informative samples beyond the training data. Fine-tuning T2I models can help generate samples aligned with the target domain. However, it often overfits and memorizes training data, limiting their ability to generate diverse and well-aligned samples. To overcome these issues, we propose Concept-Aware LoRA (CA-LoRA), a novel fine-tuning approach that selectively identifies and updates only the weights associated with necessary concepts (e.g., style or viewpoint) for domain alignment while preserving the pretrained knowledge of the T2I model to produce informative samples. We demonstrate its effectiveness in generating datasets for urban-scene segmentation, outperforming baseline and state-of-the-art methods in in-domain (few-shot and fully-supervised) settings, as well as in domain generalization tasks, especially under challenging conditions such as adverse weather and varying illumination, further highlighting its superiority.
zh

[CV-72] An Empirical Study of Validating Synthetic Data for Text-Based Person Retrieval

【速读】：该论文旨在解决文本基人物检索（Text-Based Person Retrieval, TBPR）研究中数据面临的隐私敏感、标注劳动密集以及合成数据多样性不足等问题，并探索合成数据在TBPR中的潜力。论文的关键在于提出了一套综合的解决方案，包括：(1) 提出一种跨类图像生成管道，引入自动提示构造策略以引导生成式AI模型生成多样化的跨类图像，而无需依赖原始数据；(2) 开发一种类内图像增强管道，利用生成式AI模型进一步编辑图像以获取多样化的类内图像；(3) 基于上述管道与自动文本生成管道，在多种场景下通过大量实验验证合成数据的有效性，并研究多种抗噪学习策略以缓解合成数据固有的噪声问题。论文还将开源代码及由其管道生成的大规模合成数据集，以推动实际TBPR研究的发展。

链接: https://arxiv.org/abs/2503.22171
作者: Min Cao,ZiYin Zeng,YuXin Lu,Mang Ye,Dong Yi,Jinqiao Wang
机构: School of Computer Science and Technology, Soochow University (苏州大学计算机科学与技术学院); School of Computer Science, Wuhan University (武汉大学计算机科学学院); Institute of Automation, Chinese of Academy (中国科学院自动化研究所); Wuhan AI Research (武汉人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages,13 figures

点击查看摘要

Abstract:Data plays a pivotal role in Text-Based Person Retrieval (TBPR) research. Mainstream research paradigm necessitates real-world person images with manual textual annotations for training models, posing privacy-sensitive and labor-intensive issues. Several pioneering efforts explore synthetic data for TBPR but still rely on real data, keeping the aforementioned issues and also resulting in diversity-deficient issue in synthetic datasets, thus impacting TBPR performance. Moreover, these works tend to explore synthetic data for TBPR through limited perspectives, leading to exploration-restricted issue. In this paper, we conduct an empirical study to explore the potential of synthetic data for TBPR, highlighting three key aspects. (1) We propose an inter-class image generation pipeline, in which an automatic prompt construction strategy is introduced to guide generative Artificial Intelligence (AI) models in generating various inter-class images without reliance on original data. (2) We develop an intra-class image augmentation pipeline, in which the generative AI models are applied to further edit the images for obtaining various intra-class images. (3) Building upon the proposed pipelines and an automatic text generation pipeline, we explore the effectiveness of synthetic data in diverse scenarios through extensive experiments. Additionally, we experimentally investigate various noise-robust learning strategies to mitigate the inherent noise in synthetic data. We will release the code, along with the synthetic large-scale dataset generated by our pipelines, which are expected to advance practical TBPR research.
zh

[CV-73] Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis CVPR2025

【速读】：该论文致力于解决扩散模型在文本到图像（Text-to-Image, T2I）生成过程中存在的“误置对象”问题，即生成的对象空间位置未能与文本提示对齐。尽管现有方法已针对“缺失对象”和“属性不匹配”等问题进行了优化，但“误置对象”这一挑战仍未得到有效解决。论文指出，由于通过文本形式施加显式空间引导存在固有难度，即使在流行的T2I模型中，这一基本功能的实现依然困难。为了解决此问题，论文提出了一种名为STORM（通过重新定位注意力图进行空间传输优化）的全新无训练方法。STORM的关键在于其基于最优传输理论的空间传输优化（Spatial Transport Optimization, STO），通过动态调整对象注意力图实现精确的空间一致性，并辅以空间传输（Spatial Transport, ST）成本函数增强空间理解能力。研究表明，空间感知的引入在去噪的早期阶段最为有效，而后期阶段则用于细节优化。大量实验表明，STORM超越了现有方法，在缓解“误置对象”问题的同时，还改善了“缺失对象”和“属性不匹配”的情况，为T2I合成中的空间对齐设立了新基准。

链接: https://arxiv.org/abs/2503.22168
作者: Woojung Han,Yeonkyung Lee,Chanyoung Kim,Kwanghyun Park,Seong Jae Hwang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025

点击查看摘要

Abstract:Diffusion-based text-to-image (T2I) models have recently excelled in high-quality image generation, particularly in a training-free manner, enabling cost-effective adaptability and generalization across diverse tasks. However, while the existing methods have been continuously focusing on several challenges, such as “missing objects” and “mismatched attributes,” another critical issue of “mislocated objects” remains where generated spatial positions fail to align with text prompts. Surprisingly, ensuring such seemingly basic functionality remains challenging in popular T2I models due to the inherent difficulty of imposing explicit spatial guidance via text forms. To address this, we propose STORM (Spatial Transport Optimization by Repositioning Attention Map), a novel training-free approach for spatially coherent T2I synthesis. STORM employs Spatial Transport Optimization (STO), rooted in optimal transport theory, to dynamically adjust object attention maps for precise spatial adherence, supported by a Spatial Transport (ST) Cost function that enhances spatial understanding. Our analysis shows that integrating spatial awareness is most effective in the early denoising stages, while later phases refine details. Extensive experiments demonstrate that STORM surpasses existing methods, effectively mitigating mislocated objects while improving missing and mismatched attributes, setting a new benchmark for spatial alignment in T2I synthesis.
zh

[CV-74] Disentangled 4D Gaussian Splatting: Towards Faster and More Efficient Dynamic Scene Rendering

【速读】：该论文旨在解决动态场景从二维图像进行新颖视角合成（Novel-View Synthesis, NVS）的挑战，特别是针对三维高斯点 splatting (3D Gaussian Splatting, 3DGS) 扩展到四维 (4D) 以实现动态新颖视角合成时所面临的计算复杂性和存储需求高的问题。传统基于 4D 旋转和缩放的方法引入时空变形，需要将 4D 高斯分布切分为多个 3D 高斯分布，导致冗余计算增加且难以适应时间戳变化，同时四维矩阵运算也极为耗时。

论文的关键解决方案是提出了一种名为解耦四维高斯点 splatting (Disentangled 4D Gaussian Splatting, Disentangled4DGS) 的新方法。该方法通过解耦时空变形，避免了对 4D 矩阵计算的依赖，将 3DGS 渲染过程扩展到 4D，并将时空变形投影到光线空间中的动态二维高斯分布上。这种方法不仅显著提升了动态场景合成的速度，达到了在 RTX 3090 GPU 上平均 343 FPS 的渲染速度，还减少了至少 4.5% 的存储需求。

链接: https://arxiv.org/abs/2503.22159
作者: Hao Feng,Hao Sun,Wei Xie
机构: Central China Normal University (华中师范大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Novel-view synthesis (NVS) for dynamic scenes from 2D images presents significant challenges due to the spatial complexity and temporal variability of such scenes. Recently, inspired by the remarkable success of NVS using 3D Gaussian Splatting (3DGS), researchers have sought to extend 3D Gaussian models to four dimensions (4D) for dynamic novel-view synthesis. However, methods based on 4D rotation and scaling introduce spatiotemporal deformation into the 4D covariance matrix, necessitating the slicing of 4D Gaussians into 3D Gaussians. This process increases redundant computations as timestamps change-an inherent characteristic of dynamic scene rendering. Additionally, performing calculations on a four-dimensional matrix is computationally intensive. In this paper, we introduce Disentangled 4D Gaussian Splatting (Disentangled4DGS), a novel representation and rendering approach that disentangles temporal and spatial deformations, thereby eliminating the reliance on 4D matrix computations. We extend the 3DGS rendering process to 4D, enabling the projection of temporal and spatial deformations into dynamic 2D Gaussians in ray space. Consequently, our method facilitates faster dynamic scene synthesis. Moreover, it reduces storage requirements by at least 4.5% due to our efficient presentation method. Our approach achieves an unprecedented average rendering speed of 343 FPS at a resolution of 1352\times1014 on an RTX 3090 GPU, with experiments across multiple benchmarks demonstrating its competitive performance in both monocular and multi-view scenarios.
zh

[CV-75] Permutation-Invariant and Orientation-Aware Dataset Distillation for 3D Point Clouds

【速读】：该论文试图解决三维点云数据集蒸馏的问题，这是由于传统方法难以处理点云数据无序结构的挑战。解决方案的关键在于提出了一种基于分布匹配的新型蒸馏方法，该方法同时优化合成数据集的几何结构和模型的方向，并通过引入可学习的旋转角度确保不同点云模型之间特征的一致性对齐。此外，设计了一种基于排序特征向量的排列不变分布匹配损失函数，以进一步提升蒸馏效果。实验结果表明，所提方法在ModelNet10、ModelNet40、ShapeNet和ScanObjectNN四个基准数据集上均优于现有方法。

链接: https://arxiv.org/abs/2503.22154
作者: Jae-Young Yim,Dongwook Kim,Jae-Young Sim
机构: Ulsan National Institute of Science and Technology (UNIST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We should collect large amount of data to train deep neural networks for various applications. Recently, the dataset distillation for images and texts has been attracting a lot of attention, that reduces the original dataset to a synthetic dataset while preserving essential task-relevant information. However, 3D point clouds distillation is almost unexplored due to the challenges of unordered structures of points. In this paper, we propose a novel distribution matching-based dataset distillation method for 3D point clouds that jointly optimizes the geometric structures of synthetic dataset as well as the orientations of synthetic models. To ensure the consistent feature alignment between different 3D point cloud models, we devise a permutation invariant distribution matching loss with the sorted feature vectors. We also employ learnable rotation angles to transform each syntheic model according to the optimal orientation best representing the original feature distribution. Extensive experimental results on widely used four benchmark datasets, including ModelNet10, ModelNet40, ShapeNet, and ScanObjectNN, demonstrate that the proposed method consistently outperforms the existing methods.
zh

[CV-76] EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos

【速读】：该论文试图解决如何评估和提升机器在第一人称视角（egocentric）视频中理解人类目标（goals）、信念状态（belief states）以及未来行为预测（future actions）的能力。论文通过引入EgoToM基准数据集，基于因果理论-of-mind（causal Theory-of-Mind, ToM）模型，生成多选题形式的视频问答实例，以测试最先进的多模态大型语言模型（Multimodal Large Language Models, MLLMs）在这三个相互关联推理任务上的表现。解决方案的关键在于设计了一个能够有效衡量机器在第一人称视频中推断佩戴者内心状态（包括目标、信念及未来行为）的新颖基准，并揭示了当前最先进的多模态大模型在这些任务上的性能瓶颈，特别是对于实时信念状态和未来一致行为的预测能力仍有不足。

链接: https://arxiv.org/abs/2503.22152
作者: Yuxuan Li,Vijay Veerabadran,Michael L. Iuzzolino,Brett D. Roads,Asli Celikyilmaz,Karl Ridgeway
机构: Meta (Meta)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce EgoToM, a new video question-answering benchmark that extends Theory-of-Mind (ToM) evaluation to egocentric domains. Using a causal ToM model, we generate multi-choice video QA instances for the Ego4D dataset to benchmark the ability to predict a camera wearer’s goals, beliefs, and next actions. We study the performance of both humans and state of the art multimodal large language models (MLLMs) on these three interconnected inference problems. Our evaluation shows that MLLMs achieve close to human-level accuracy on inferring goals from egocentric videos. However, MLLMs (including the largest ones we tested with over 100B parameters) fall short of human performance when inferring the camera wearers’ in-the-moment belief states and future actions that are most consistent with the unseen video future. We believe that our results will shape the future design of an important class of egocentric digital assistants which are equipped with a reasonable model of the user’s internal mental states.
zh

[CV-77] Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model

【速读】：该论文旨在解决生成与给定舞蹈视频节奏视觉提示同步的音乐的问题。论文的关键解决方案在于提出了一种结合正负节奏信息（Positive-Negative Diffusion, PN-Diffusion）双向引导的方法来增强生成音乐的质量及其与舞蹈视频的同步性。具体而言，PN-Diffusion 包含针对正向条件的噪声预测目标以及额外的负向条件噪声预测目标，并通过巧妙利用舞蹈视频中的时间相关性，分别从前向和后向播放中捕获正向和负向节奏线索。实验结果表明，该方法在 AIST++ 和 TikTok 舞蹈视频数据集上的输入输出对应关系（如舞蹈-音乐节拍对齐）及生成音乐质量方面优于现有最先进的舞蹈到音乐生成模型。

链接: https://arxiv.org/abs/2503.22138
作者: Changchang Sun,Gaowen Liu,Charles Fleming,Yan Yan
机构: University of Illinois Chicago (芝加哥伊利诺伊大学); Cisco Research (思科研究)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Conditional diffusion models have gained increasing attention since their impressive results for cross-modal synthesis, where the strong alignment between conditioning input and generated output can be achieved by training a time-conditioned U-Net augmented with cross-attention mechanism. In this paper, we focus on the problem of generating music synchronized with rhythmic visual cues of the given dance video. Considering that bi-directional guidance is more beneficial for training a diffusion model, we propose to enhance the quality of generated music and its synchronization with dance videos by adopting both positive rhythmic information and negative ones (PN-Diffusion) as conditions, where a dual diffusion and reverse processes is devised. Specifically, to train a sequential multi-modal U-Net structure, PN-Diffusion consists of a noise prediction objective for positive conditioning and an additional noise prediction objective for negative conditioning. To accurately define and select both positive and negative conditioning, we ingeniously utilize temporal correlations in dance videos, capturing positive and negative rhythmic cues by playing them forward and backward, respectively. Through subjective and objective evaluations of input-output correspondence in terms of dance-music beat alignment and the quality of generated music, experimental results on the AIST++ and TikTok dance video datasets demonstrate that our model outperforms SOTA dance-to-music generation models.
zh

[CV-78] Beyond Background Shift: Rethinking Instance Replay in Continual Semantic Segmentation

【速读】：该论文致力于解决连续语义分割（Continual Semantic Segmentation, CSS）中的灾难性遗忘问题，即在不断学习新类别时，模型需要保留先前学到的知识而不被遗忘。传统方法通过存储旧类别的图像并将其直接纳入新模型训练中，在分类任务中有效缓解了灾难性遗忘，但在CSS任务中存在显著局限性，主要是由于存储的旧类别图像与新类别图像部分标注导致未标注类别与背景之间的混淆，增加了模型拟合难度。为了解决这一问题，论文提出了一种新颖的增强实例回放（Enhanced Instance Replay, EIR）方法，其关键是不仅通过存储旧类别的实例保留了旧类别的知识并消除了背景混淆，还通过将存储的实例与新图像结合来缓解新图像中的背景变化，从而有效地解决了存储图像和新图像中的背景偏移问题，进而减轻了CSS任务中的灾难性遗忘，提升了模型的连续语义分割能力。实验结果验证了该方法的有效性，显著优于现有的CSS方法。

链接: https://arxiv.org/abs/2503.22136
作者: Hongmei Yin,Tingliang Feng,Fan Lyu,Fanhua Shang,Hongying Liu,Wei Feng,Liang Wan
机构: College of Intelligence and Computing, Tianjin University (天津大学智能与计算学院); New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所模式识别新实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we focus on continual semantic segmentation (CSS), where segmentation networks are required to continuously learn new classes without erasing knowledge of previously learned ones. Although storing images of old classes and directly incorporating them into the training of new models has proven effective in mitigating catastrophic forgetting in classification tasks, this strategy presents notable limitations in CSS. Specifically, the stored and new images with partial category annotations leads to confusion between unannotated categories and the background, complicating model fitting. To tackle this issue, this paper proposes a novel Enhanced Instance Replay (EIR) method, which not only preserves knowledge of old classes while simultaneously eliminating background confusion by instance storage of old classes, but also mitigates background shifts in the new images by integrating stored instances with new images. By effectively resolving background shifts in both stored and new images, EIR alleviates catastrophic forgetting in the CSS task, thereby enhancing the model’s capacity for CSS. Experimental results validate the efficacy of our approach, which significantly outperforms state-of-the-art CSS methods.
zh

[CV-79] Semantic segmentation for building houses from wooden cubes

【速读】：该论文旨在解决建筑施工过程中效率低下、成本高昂及错误频发的问题，通过引入自动化建造技术提升施工质量。论文的关键在于比较三种用于语义分割的神经网络模型（U-Net(light)、LinkNet 和 PSPNet）的性能，并基于两个专门构建的数据集评估其准确性。其中，第一个数据集包含四类（背景、基础、墙壁和屋顶），用于基本模型评估；第二个数据集则细分为44类，每个木立方被视为独立对象进行标注。研究采用相同的超参数训练模型，并使用MeanIoU和F1分数作为评价指标。结果显示，尽管U-Net(light)在第一数据集上表现出色（MeanIoU为78%，F1分数为87%），但在第二数据集上的表现较差（MeanIoU为17%，F1分数为25%），主要归因于数据量不足、分区复杂度高以及类别不平衡导致难以精确识别单个立方体。此外，在所有实验中均观察到过拟合现象，即训练集上的高精度与验证集上的显著下降。本研究为开发自动分阶段生成建筑计划算法奠定了基础，并计划扩展数据集规模并采用正则化方法（如L1/L2正则化、Early Stopping）来缓解过拟合问题。下一步将致力于开发利用机械臂自动分步骤建造房屋的算法。

链接: https://arxiv.org/abs/2503.22125
作者: Ivan Beleacov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Automated construction is one of the most promising areas that can improve efficiency, reduce costs and minimize errors in the process of building construction. In this paper, a comparative analysis of three neural network models for semantic segmentation, U-Net(light), LinkNet and PSPNet, is performed. Two specialized datasets with images of houses built from wooden cubes were created for the experiments. The first dataset contains 4 classes (background, foundation, walls, roof ) and is designed for basic model evaluation, while the second dataset includes 44 classes where each cube is labeled as a separate object. The models were trained with the same hyperparameters and their accuracy was evaluated using MeanIoU and F1 Score metrics. According to the results obtained, U-Net(light) showed the best performance with 78% MeanIoU and 87% F1 Score on the first dataset and 17% and 25% respectively on the second dataset. The poor results on the second dataset are due to the limited amount of data, the complexity of the partitioning and the imbalance of classes, making it difficult to accurately select individual cubes. In addition, overtraining was observed in all experiments, manifested by high accuracy on the training dataset and its significant decrease on the validation dataset. The present work is the basis for the development of algorithms for automatic generation of staged building plans, which can be further scaled to design complete buildings. Future research is planned to extend the datasets and apply methods to combat overfitting (L1/L2 regularization, Early Stopping). The next stage of work will be the development of algorithms for automatic generation of a step-by-step plan for building houses from cubes using manipulators. Index Terms-Deep Learning, Computer vision, CNN, Semantic segmentation, Construction materials.
zh

[CV-80] Detecting Localized Deepfake Manipulations Using Action Unit-Guided Video Representations

【速读】：该论文旨在解决深度伪造（Deepfake）视频中局部细微编辑（如特定面部特征的调整）的检测难题，这些编辑对现有检测模型构成了重大挑战。论文的关键创新在于提出了一种新的检测方法，通过时空表示（spatiotemporal representations）结合面部动作单元（Facial Action Units, FAUs）来有效捕捉深度伪造视频中的局部变化。该方法利用基于交叉注意力（cross-attention）的表征融合，将预训练任务（如随机掩码和动作单元检测）中学到的表征相结合，生成能够编码细微局部变化的嵌入向量。实验结果表明，尽管仅在传统的FF+数据集上进行训练，该方法在检测包含细粒度局部编辑的深度伪造视频方面达到了当前最先进的检测方法20%的准确率提升，并在标准数据集上表现出良好的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2503.22121
作者: Tharun Anand,Siva Sankar,Pravin Nair
机构: Indian Institute of Technology Madras (印度理工学院马德拉斯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With rapid advancements in generative modeling, deepfake techniques are increasingly narrowing the gap between real and synthetic videos, raising serious privacy and security concerns. Beyond traditional face swapping and reenactment, an emerging trend in recent state-of-the-art deepfake generation methods involves localized edits such as subtle manipulations of specific facial features like raising eyebrows, altering eye shapes, or modifying mouth expressions. These fine-grained manipulations pose a significant challenge for existing detection models, which struggle to capture such localized variations. To the best of our knowledge, this work presents the first detection approach explicitly designed to generalize to localized edits in deepfake videos by leveraging spatiotemporal representations guided by facial action units. Our method leverages a cross-attention-based fusion of representations learned from pretext tasks like random masking and action unit detection, to create an embedding that effectively encodes subtle, localized changes. Comprehensive evaluations across multiple deepfake generation methods demonstrate that our approach, despite being trained solely on the traditional FF+ dataset, sets a new benchmark in detecting recent deepfake-generated videos with fine-grained local edits, achieving a 20% improvement in accuracy over current state-of-the-art detection methods. Additionally, our method delivers competitive performance on standard datasets, highlighting its robustness and generalization across diverse types of local and global forgeries.
zh

[CV-81] Camera Model Identification with SPAIR-Swin and Entropy based Non-Homogeneous Patches

【速读】：该论文旨在解决源相机模型识别（SCMI）中的挑战，具体目标是在图像认证和版权保护等应用中更有效地识别图像的来源相机型号。论文的关键创新在于提出了一种名为SPAIR-Swin的新模型，该模型结合了改进的空间注意力机制与倒残差块（SPAIR）以及Swin Transformer。这种方法能够同时捕获全局和局部特征，特别擅长提取噪声模式等关键伪影，从而实现更稳健的相机模型识别。此外，不同于传统方法仅关注同质区域，该研究引入了一种强调高熵区域（富含图案和纹理）的patch选择策略，以进一步提升识别性能。实验结果表明，SPAIR-Swin在Dresden、Vision、Forchheim和Socrates四个基准数据集上均取得了显著的性能提升，验证了其有效性。

链接: https://arxiv.org/abs/2503.22120
作者: Protyay Dey,Rejoy Chakraborty,Abhilasha S. Jadhav,Kapil Rana,Gaurav Sharma,Puneet Goyal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Source camera model identification (SCMI) plays a pivotal role in image forensics with applications including authenticity verification and copyright protection. For identifying the camera model used to capture a given image, we propose SPAIR-Swin, a novel model combining a modified spatial attention mechanism and inverted residual block (SPAIR) with a Swin Transformer. SPAIR-Swin effectively captures both global and local features, enabling robust identification of artifacts such as noise patterns that are particularly effective for SCMI. Additionally, unlike conventional methods focusing on homogeneous patches, we propose a patch selection strategy for SCMI that emphasizes high-entropy regions rich in patterns and textures. Extensive evaluations on four benchmark SCMI datasets demonstrate that SPAIR-Swin outperforms existing methods, achieving patch-level accuracies of 99.45%, 98.39%, 99.45%, and 97.46% and image-level accuracies of 99.87%, 99.32%, 100%, and 98.61% on the Dresden, Vision, Forchheim, and Socrates datasets, respectively. Our findings highlight that high-entropy patches, which contain high-frequency information such as edge sharpness, noise, and compression artifacts, are more favorable in improving SCMI accuracy. Code will be made available upon request.
zh

[CV-82] How Well Can Vison-Language Models Understand Humans Intention? An Open-ended Theory of Mind Question Evaluation Benchmark AAAI25

【速读】：该论文试图解决视觉语言模型（Vision Language Models, VLMs）在执行心智理论（Theory of Mind, ToM）任务中的能力评估问题。现有研究显示，尽管VLMs在视觉问答（Visual Question Answering, VQA）任务中表现出强大的推理能力，但它们在准确推断人类意图、信念及其他心理状态方面的能力尚未得到充分探索。为了解决这一问题，论文提出了一种开放性问题框架，用于全面评估不同类别ToM任务中VLMs的表现，并构建了一个包含30张图像的数据集进行基准测试。关键在于通过设计多样化的ToM任务场景以及引入具有挑战性的复杂情境（如欺凌或欺骗），系统性地分析和比较了多个规模各异的VLMs性能，揭示了大模型如GPT-4的优势及其局限性，同时发现小模型在特定条件下也能正确推断意图，即使依赖于错误的视觉线索。

链接: https://arxiv.org/abs/2503.22093
作者: Ximing Wen,Mallika Mainali,Anik Sen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 2 pages, accepted by ToM@AAAI25

点击查看摘要

Abstract:Vision Language Models (VLMs) have demonstrated strong reasoning capabilities in Visual Question Answering (VQA) tasks; However, their ability to perform Theory of Mind (ToM) tasks such as accurately inferring human intentions, beliefs, and other mental states remains underexplored. In this work, we propose an open-ended question framework to comprehensively evaluate VLMs’ performance across diverse categories of ToM tasks. We curated and annotated a benchmark dataset composed of 30 images. We then assessed the performance of four VLMs of varying sizes on this dataset. Our experimental results show that the GPT-4 model outperformed all others, with only one smaller model, GPT-4o-mini, achieving comparable performance. Additionally, we observed that VLMs often struggle to accurately infer intentions in complex scenarios such as bullying or cheating. Moreover, our findings also reveal that smaller models can sometimes infer correct intentions despite relying on incorrect visual cues.
zh

[CV-83] Mitigating Trade-off: Stream and Query-guided Aggregation for Efficient and Effective 3D Occupancy Prediction

【速读】：该论文旨在解决三维占用预测在自动驾驶中的实时性和准确性之间的权衡问题。现有方法通过多帧融合处理多个历史帧以整合时空信息，但面临效率与精度难以兼顾的挑战。为缓解这一权衡，论文提出了一种名为StreamOcc的新框架，其关键在于采用基于流的方式聚合时空信息：一是基于流体素聚合（Stream-based Voxel Aggregation），有效累积历史观测同时降低计算成本；二是查询引导聚合（Query-guided Aggregation），递归地将动态对象实例级别的特征融入对应的体素特征中，细化动态对象的细粒度细节。实验表明，StreamOcc在Occ3D-nuScenes数据集上的实时设置中达到了最先进的性能，并将内存使用减少了50%以上。

链接: https://arxiv.org/abs/2503.22087
作者: Seokha Moon,Janghyun Baek,Giseop Kim,Jinkyu Kim,Sunwook Choi
机构: Korea University (韩国大学); DGIST (大邱庆北科学技术院); NAVER LABS (NAVER 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D occupancy prediction has emerged as a key perception task for autonomous driving, as it reconstructs 3D environments to provide a comprehensive scene understanding. Recent studies focus on integrating spatiotemporal information obtained from past observations to improve prediction accuracy, using a multi-frame fusion approach that processes multiple past frames together. However, these methods struggle with a trade-off between efficiency and accuracy, which significantly limits their practicality. To mitigate this trade-off, we propose StreamOcc, a novel framework that aggregates spatio-temporal information in a stream-based manner. StreamOcc consists of two key components: (i) Stream-based Voxel Aggregation, which effectively accumulates past observations while minimizing computational costs, and (ii) Query-guided Aggregation, which recurrently aggregates instance-level features of dynamic objects into corresponding voxel features, refining fine-grained details of dynamic objects. Experiments on the Occ3D-nuScenes dataset show that StreamOcc achieves state-of-the-art performance in real-time settings, while reducing memory usage by more than 50% compared to previous methods.
zh

[CV-84] A Survey on Remote Sensing Foundation Models: From Vision to Multimodality

【速读】：该论文旨在解决远程 sensing 领域中视觉与多模态基础模型在实际应用中的关键挑战，包括数据多样性、大规模标注数据需求、多模态融合复杂性以及计算资源需求等问题。论文的核心目标是全面回顾当前最先进的视觉与多模态基础模型的技术架构、训练方法、数据集及应用场景，并深入探讨其面临的挑战，如数据对齐、跨模态迁移学习和可扩展性等。解决方案的关键在于通过研究新兴方向，提出有效策略以克服现有局限，从而推动这些模型在真实世界应用中的能力边界拓展。

链接: https://arxiv.org/abs/2503.22081
作者: Ziyue Huang,Hongxi Yan,Qiqi Zhan,Shuai Yang,Mingming Zhang,Chenkai Zhang,YiMing Lei,Zeming Liu,Qingjie Liu,Yunhong Wang
机构: State Key Laboratory of Virtual Reality Technology and Systems, Beihang University (北航), China; Hangzhou Innovation Institute, Beihang University (北航), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of remote sensing foundation models, particularly vision and multimodal models, has significantly enhanced the capabilities of intelligent geospatial data interpretation. These models combine various data modalities, such as optical, radar, and LiDAR imagery, with textual and geographic information, enabling more comprehensive analysis and understanding of remote sensing data. The integration of multiple modalities allows for improved performance in tasks like object detection, land cover classification, and change detection, which are often challenged by the complex and heterogeneous nature of remote sensing data. However, despite these advancements, several challenges remain. The diversity in data types, the need for large-scale annotated datasets, and the complexity of multimodal fusion techniques pose significant obstacles to the effective deployment of these models. Moreover, the computational demands of training and fine-tuning multimodal models require significant resources, further complicating their practical application in remote sensing image interpretation tasks. This paper provides a comprehensive review of the state-of-the-art in vision and multimodal foundation models for remote sensing, focusing on their architecture, training methods, datasets and application scenarios. We discuss the key challenges these models face, such as data alignment, cross-modal transfer learning, and scalability, while also identifying emerging research directions aimed at overcoming these limitations. Our goal is to provide a clear understanding of the current landscape of remote sensing foundation models and inspire future research that can push the boundaries of what these models can achieve in real-world applications. The list of resources collected by the paper can be found in the this https URL.
zh

[CV-85] A Semantic-Enhanced Heterogeneous Graph Learning Method for Flexible Objects Recognition ICME2025

【速读】：该论文旨在解决柔性物体识别中的挑战，包括其多样化的形状和大小、半透明属性以及类间细微差异等问题。现有基于图的方法（如图卷积网络和图视觉模型）虽能捕捉柔性物体中的可变关系，但通常关注全局视觉关系或未能有效对齐语义与视觉信息。为缓解这些局限性，论文提出了一种语义增强的异构图学习方法。其关键在于首先采用自适应扫描模块提取判别性语义上下文，促进形状和大小各异的柔性物体匹配，并对齐语义节点与视觉节点以增强跨模态特征相关性；其次，引入异构图生成模块聚合全局视觉与局部语义节点特征，从而提升柔性物体的识别性能。此外，论文构建了一个大规模柔性数据集FSCW，并通过在多个柔性数据集（FDA和FSCW）及挑战基准（CIFAR-100和ImageNet-Hard）上的实验验证了所提方法的有效性。

链接: https://arxiv.org/abs/2503.22079
作者: Kunshan Yang,Wenwei Luo,Yuguo Hu,Jiafu Yan,Mengmeng Jing,Lin Zuo
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME 2025

点击查看摘要

Abstract:Flexible objects recognition remains a significant challenge due to its inherently diverse shapes and sizes, translucent attributes, and subtle inter-class differences. Graph-based models, such as graph convolution networks and graph vision models, are promising in flexible objects recognition due to their ability of capturing variable relations within the flexible objects. These methods, however, often focus on global visual relationships or fail to align semantic and visual information. To alleviate these limitations, we propose a semantic-enhanced heterogeneous graph learning method. First, an adaptive scanning module is employed to extract discriminative semantic context, facilitating the matching of flexible objects with varying shapes and sizes while aligning semantic and visual nodes to enhance cross-modal feature correlation. Second, a heterogeneous graph generation module aggregates global visual and local semantic node features, improving the recognition of flexible objects. Additionally, We introduce the FSCW, a large-scale flexible dataset curated from existing sources. We validate our method through extensive experiments on flexible datasets (FDA and FSCW), and challenge benchmarks (CIFAR-100 and ImageNet-Hard), demonstrating competitive performance.
zh

[CV-86] Contrasting Low and High-Resolution Features for HER2 Scoring using Deep Learning

【速读】：该论文旨在解决乳腺癌免疫组化（IHC）受体状态分类中传统方法依赖病理学家经验导致的劳动密集型及显著观察者间变异性的问题。论文的关键解决方案是开发了一个基于低分辨率IHC图像的端到端ConvNeXt深度学习网络，用于实现HER2三分类（0、低表达、高表达）的自动化预测，其F1分数比基于Patch的方法高出5.35%以上，从而显著提升了分类的准确性与可重复性。

链接: https://arxiv.org/abs/2503.22069
作者: Ekansh Chauhan,Anila Sharma,Amit Sharma,Vikas Nishadham,Asha Ghughtyal,Ankur Kumar,Gurudutt Gupta,Anurag Mehta,C.V. Jawahar,P.K. Vinod
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Breast cancer, the most common malignancy among women, requires precise detection and classification for effective treatment. Immunohistochemistry (IHC) biomarkers like HER2, ER, and PR are critical for identifying breast cancer subtypes. However, traditional IHC classification relies on pathologists’ expertise, making it labor-intensive and subject to significant inter-observer variability. To address these challenges, this study introduces the India Pathology Breast Cancer Dataset (IPD-Breast), comprising of 1,272 IHC slides (HER2, ER, and PR) aimed at automating receptor status classification. The primary focus is on developing predictive models for HER2 3-way classification (0, Low, High) to enhance prognosis. Evaluation of multiple deep learning models revealed that an end-to-end ConvNeXt network utilizing low-resolution IHC images achieved an AUC, F1, and accuracy of 91.79%, 83.52%, and 83.56%, respectively, for 3-way classification, outperforming patch-based methods by over 5.35% in F1 score. This study highlights the potential of simple yet effective deep learning techniques to significantly improve accuracy and reproducibility in breast cancer classification, supporting their integration into clinical workflows for better patient outcomes.
zh

[CV-87] Deep Depth Estimation from Thermal Image: Dataset Benchmark and Challenges

【速读】：该论文旨在解决在恶劣天气和光照条件下实现自动驾驶车辆和机器人高阶自主性的鲁棒且精确的空间感知问题。现有依赖可见光谱的感知算法受天气和光照条件影响显著，而长波红外相机（thermal imaging camera）被提出作为一种潜在解决方案以提高鲁棒性。然而，缺乏大规模数据集和标准化基准仍然是热成像视觉感知研究进展的主要瓶颈。为了解决这一问题，论文提供了包含多模态同步数据的大规模Multi-Spectral Stereo (MS²) 数据集，并通过在该数据集上的评估建立了标准化基准结果，同时深入分析了不同模态下性能的变化性、传感器模态间的领域偏移以及热感知领域的潜在研究方向。关键在于构建了一个包含多种传感器模态（立体RGB、NIR、热成像、LiDAR及GNSS/IMU信息）的大规模数据集及其标准化基准测试，从而推动相关领域的研究发展。

链接: https://arxiv.org/abs/2503.22060
作者: Ukcheol Shin,Jinsun Park
机构: Robotics Institute, School of Computer Science, Carnegie Mellon University (卡内基梅隆大学); School of Computer Science and Engineering, Pusan National University (釜山国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: MS^2 dataset: this https URL , Source code: this https URL

点击查看摘要

Abstract:Achieving robust and accurate spatial perception under adverse weather and lighting conditions is crucial for the high-level autonomy of self-driving vehicles and robots. However, existing perception algorithms relying on the visible spectrum are highly affected by weather and lighting conditions. A long-wave infrared camera (i.e., thermal imaging camera) can be a potential solution to achieve high-level robustness. However, the absence of large-scale datasets and standardized benchmarks remains a significant bottleneck to progress in active research for robust visual perception from thermal images. To this end, this manuscript provides a large-scale Multi-Spectral Stereo (MS ^2 ) dataset that consists of stereo RGB, stereo NIR, stereo thermal, stereo LiDAR data, and GNSS/IMU information along with semi-dense depth ground truth. MS ^2 dataset includes 162K synchronized multi-modal data pairs captured across diverse locations (e.g., urban city, residential area, campus, and high-way road) at different times (e.g., morning, daytime, and nighttime) and under various weather conditions (e.g., clear-sky, cloudy, and rainy). Secondly, we conduct a thorough evaluation of monocular and stereo depth estimation networks across RGB, NIR, and thermal modalities to establish standardized benchmark results on MS ^2 depth test sets (e.g., day, night, and rainy). Lastly, we provide in-depth analyses and discuss the challenges revealed by the benchmark results, such as the performance variability for each modality under adverse conditions, domain shift between different sensor modalities, and potential research direction for thermal perception. Our dataset and source code are publicly available at this https URL and this https URL.
zh

[CV-88] A Deep Learning Framework for Boundary-Aware Semantic Segmentation

【速读】：该论文致力于解决语义分割任务中目标边界模糊以及小目标识别不足的问题。为应对这些挑战，论文提出了一种基于Mask2Former的语义分割算法，并引入了边界增强特征桥接模块（Boundary Enhancement Feature Bridging Module, BEFBM）。该模块的关键在于构建了一个具有边界感知的特征图，并引入特征桥接机制，实现跨尺度特征的有效融合，从而显著提升模型聚焦目标边界的能力。实验结果表明，所提方法在Cityscapes数据集上相较于主流方法在mIOU、mDICE和mRecall等指标上取得了明显改进，并在复杂场景中展现出更优的目标边界保留能力。

链接: https://arxiv.org/abs/2503.22050
作者: Tai An,Weiqiang Huang,Da Xu,Qingyuan He,Jiacheng Hu,Yujia Lou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As a fundamental task in computer vision, semantic segmentation is widely applied in fields such as autonomous driving, remote sensing image analysis, and medical image processing. In recent years, Transformer-based segmentation methods have demonstrated strong performance in global feature modeling. However, they still struggle with blurred target boundaries and insufficient recognition of small targets. To address these issues, this study proposes a Mask2Former-based semantic segmentation algorithm incorporating a boundary enhancement feature bridging module (BEFBM). The goal is to improve target boundary accuracy and segmentation consistency. Built upon the Mask2Former framework, this method constructs a boundary-aware feature map and introduces a feature bridging mechanism. This enables effective cross-scale feature fusion, enhancing the model’s ability to focus on target boundaries. Experiments on the Cityscapes dataset demonstrate that, compared to mainstream segmentation methods, the proposed approach achieves significant improvements in metrics such as mIOU, mDICE, and mRecall. It also exhibits superior boundary retention in complex scenes. Visual analysis further confirms the model’s advantages in fine-grained regions. Future research will focus on optimizing computational efficiency and exploring its potential in other high-precision segmentation tasks.
zh

[CV-89] Multispectral Demosaicing via Dual Cameras

【速读】：该论文旨在解决多光谱（Multispectral, MS）图像去马赛克（demosaicing）的问题，特别是在双摄像头设备中同时使用RGB相机和MS相机捕获相同场景的应用场景。解决方案的关键在于利用RGB图像较高的空间保真度来指导低空间保真度的MS图像去马赛克处理。为此，研究者引入了一个包含配对RGB和MS拜耳模式图像及其真实去马赛克结果的Dual-camera RGB-MS数据集，用于方法的训练与评估。实验结果表明，该方法在准确性方面达到了当前最先进的水平。

链接: https://arxiv.org/abs/2503.22026
作者: SaiKiran Tedla,Junyong Lee,Beixuan Yang,Mahmoud Afifi,Michael Brown
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Multispectral (MS) images capture detailed scene information across a wide range of spectral bands, making them invaluable for applications requiring rich spectral data. Integrating MS imaging into multi camera devices, such as smartphones, has the potential to enhance both spectral applications and RGB image quality. A critical step in processing MS data is demosaicing, which reconstructs color information from the mosaic MS images captured by the camera. This paper proposes a method for MS image demosaicing specifically designed for dual-camera setups where both RGB and MS cameras capture the same scene. Our approach leverages co-captured RGB images, which typically have higher spatial fidelity, to guide the demosaicing of lower-fidelity MS images. We introduce the Dual-camera RGB-MS Dataset - a large collection of paired RGB and MS mosaiced images with ground-truth demosaiced outputs - that enables training and evaluation of our method. Experimental results demonstrate that our method achieves state-of-the-art accuracy compared to existing techniques.
zh

[CV-90] CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

【速读】：该论文旨在解决现有视觉-语言-动作模型（Vision-Language-Action Models, VLAs）在复杂操作任务中的不足，具体表现为缺乏中间推理步骤以及时间规划或推理能力。为了解决这一问题，论文提出了一种新方法，通过在VLAs中引入显式的视觉链式思维（Chain-of-Thought, CoT）推理机制，在生成短动作序列实现目标之前，先自回归预测未来图像帧作为视觉目标。关键在于将显式的视觉CoT推理与VLAs结合，使模型不仅能够理解视觉和动作标记，还能在执行任务时进行有效的中间推理，从而提升其在真实世界操作任务和模拟基准测试中的表现。实验结果显示，提出的CoT-VLA模型在实际操作任务中比最先进的VLA模型高出17%，在仿真基准测试中高出6%。

链接: https://arxiv.org/abs/2503.22020
作者: Qingqing Zhao,Yao Lu,Moo Jin Kim,Zipeng Fu,Zhuoyang Zhang,Yecheng Wu,Zhaoshuo Li,Qianli Ma,Song Han,Chelsea Finn,Ankur Handa,Ming-Yu Liu,Donglai Xiang,Gordon Wetzstein,Tsung-Yi Lin
机构: NVIDIA(英伟达); Stanford University (斯坦福大学); MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project website: this https URL

点击查看摘要

Abstract:Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input–output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks. Project website: this https URL
zh

[CV-91] AGILE: A Diffusion-Based Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification

【速读】：该论文旨在解决跨域图像翻译中语义一致性难以保持的问题，特别是在领域间隙显著时，现有生成模型在对象级精度方面表现不足。为了解决这一挑战，论文提出了一种基于扩散模型的方法AGILE（Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification）。其关键是通过优化文本嵌入（Optimized Text Embeddings）加强源域与目标域图像之间的对应关系，并利用注意力引导（Attention Guidance）在去噪过程中控制对象的位置，从而以语义一致的方式约束图像翻译，同时提高生成图像的保真度并保留关键的对象语义。

链接: https://arxiv.org/abs/2503.22019
作者: Earl Ranario,Lars Lundqvist,Heesup Yun,Brian N. Bailey,J. Mason Earles
机构: University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantically consistent cross-domain image translation facilitates the generation of training data by transferring labels across different domains, making it particularly useful for plant trait identification in agriculture. However, existing generative models struggle to maintain object-level accuracy when translating images between domains, especially when domain gaps are significant. In this work, we introduce AGILE (Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification), a diffusion-based framework that leverages optimized text embeddings and attention guidance to semantically constrain image translation. AGILE utilizes pretrained diffusion models and publicly available agricultural datasets to improve the fidelity of translated images while preserving critical object semantics. Our approach optimizes text embeddings to strengthen the correspondence between source and target images and guides attention maps during the denoising process to control object placement. We evaluate AGILE on cross-domain plant datasets and demonstrate its effectiveness in generating semantically accurate translated images. Quantitative experiments show that AGILE enhances object detection performance in the target domain while maintaining realism and consistency. Compared to prior image translation methods, AGILE achieves superior semantic alignment, particularly in challenging cases where objects vary significantly or domain gaps are substantial.
zh

[CV-92] FACETS: Efficient Once-for-all Object Detection via Constrained Iterative Search

【速读】：该论文旨在解决神经架构搜索（NAS）在深度学习目标检测框架中的两大挑战：其一，多模块联合优化因庞大的搜索空间而导致的高计算成本和复杂性；其二，在满足目标设备约束的同时优化各模块架构的额外难度。论文的关键解决方案是提出了一种名为FACETS（Efficient Once-for-All Object Detection via Constrained Iterative Search）的新颖统一迭代NAS方法。该方法通过循环方式逐步优化所有模块的架构，利用前序迭代的反馈，在固定一个模块架构的同时优化其他模块，从而有效减小搜索空间，同时保持模块间的相互依赖性，并结合目标设备的计算预算施加约束。这种策略不仅显著提升了搜索效率，还实现了性能更优的架构，并通过迭代细化搜索空间进一步提高最终模型的准确性。

链接: https://arxiv.org/abs/2503.21999
作者: Tony Tran,Bin Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Neural Architecture Search (NAS) for deep learning object detection frameworks typically involves multiple modules, each performing distinct tasks. These modules contribute to a vast search space, resulting in searches that can take several GPU hours or even days, depending on the complexity of the search space. This makes joint optimization both challenging and computationally expensive. Furthermore, satisfying target device constraints across modules adds additional complexity to the optimization process. To address these challenges, we propose \textbfFACETS, e\textbf\underlineFficient Once-for-\textbf\underlineAll Object Detection via \textbf\underlineConstrained it\textbf\underlineEra\textbf\underlineTive\textbf\underlineSearch, a novel unified iterative NAS method that refines the architecture of all modules in a cyclical manner. FACETS leverages feedback from previous iterations, alternating between fixing one module’s architecture and optimizing the others. This approach reduces the overall search space while preserving interdependencies among modules and incorporates constraints based on the target device’s computational budget. In a controlled comparison against progressive and single-module search strategies, FACETS achieves architectures with up to 4.75% higher accuracy twice as fast as progressive search strategies in earlier stages, while still being able to achieve a global optimum. Moreover, FACETS demonstrates the ability to iteratively refine the search space, producing better performing architectures over time. The refined search space yields candidates with a mean accuracy up to 27% higher than global search and 5% higher than progressive search methods via random sampling.
zh

[CV-93] BOOTPLACE: Bootstrapped Object Placement with Detection Transformers CVPR2025

【速读】：本文旨在解决图像到图像合成中的复制粘贴物体放置问题，重点关注物体放置学习。传统方法依赖生成模型以减少密集监督的需求，但通常限制了其建模复杂数据分布的能力。而基于稀疏对比损失的变换网络虽有所探索，但过度松弛的正则化往往导致物体放置不够精确。为此，论文提出BOOTPLACE，一种将物体放置转化为检测问题的新范式。其关键是通过训练专门的检测变换器，在移除目标物体的背景下识别感兴趣区域，并结合多目标监督增强模型能力；随后依据互补特征将合成目标物体与检测到的区域进行语义关联。此外，利用随机移除物体的图像进行引导式训练，通过丰富的配对数据增强确保有意义的物体放置。实验结果表明，BOOTPLACE在Cityscapes和OPA数据集上的物体重新定位性能优于现有技术，显著提升了IOU分数，消融研究进一步验证了方法的组成性和泛化性。

链接: https://arxiv.org/abs/2503.21991
作者: Hang Zhou,Xinxin Zuo,Rui Ma,Li Cheng
机构: University of Alberta (阿尔伯塔大学); Concordia University (康考迪亚大学); Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: CVPR 2025. Project page: this https URL , code: this https URL

点击查看摘要

Abstract:In this paper, we tackle the copy-paste image-to-image composition problem with a focus on object placement learning. Prior methods have leveraged generative models to reduce the reliance for dense supervision. However, this often limits their capacity to model complex data distributions. Alternatively, transformer networks with a sparse contrastive loss have been explored, but their over-relaxed regularization often leads to imprecise object placement. We introduce BOOTPLACE, a novel paradigm that formulates object placement as a placement-by-detection problem. Our approach begins by identifying suitable regions of interest for object placement. This is achieved by training a specialized detection transformer on object-subtracted backgrounds, enhanced with multi-object supervisions. It then semantically associates each target compositing object with detected regions based on their complementary characteristics. Through a boostrapped training approach applied to randomly object-subtracted images, our model enforces meaningful placements through extensive paired data augmentation. Experimental results on established benchmarks demonstrate BOOTPLACE’s superior performance in object repositioning, markedly surpassing state-of-the-art baselines on Cityscapes and OPA datasets with notable improvements in IOU scores. Additional ablation studies further showcase the compositionality and generalizability of our approach, supported by user study evaluations.
zh

[CV-94] AgRowStitch: A High-fidelity Image Stitching Pipeline for Ground-based Agricultural Images

【速读】：该论文旨在解决农业图像拼接难题，特别是在近距离拍摄作物时因重复纹理、非平面植物及多图像拼接累积误差导致的特征匹配困难和漂移问题。传统方法依赖地理参考图像或高空拍摄，但这些方案不适用于贴近作物的场景。为应对这一挑战，论文提出了一种用户友好且开源的管道，用于拼接基于地面的线性作物行图像，无需额外数据支持。其关键在于首先利用SuperPoint和LightGlue在小批量图像内提取并匹配特征，随后通过约束相机运动逐批串行拼接图像，并对每批次镶嵌图进行校直和缩放，最后将所有批次镶嵌图串联并校直为最终镶嵌图。实验结果表明，该方法在三种不同采集条件下均生成了高质量镶嵌图，能够以平均绝对误差20厘米的精度粗略地理参考行内真实世界位置。

链接: https://arxiv.org/abs/2503.21990
作者: Isaac Kazuo Uyehara,Heesup Yun,Earl Ranario,Mason Earles
机构: University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Agricultural imaging often requires individual images to be stitched together into a final mosaic for analysis. However, agricultural images can be particularly challenging to stitch because feature matching across images is difficult due to repeated textures, plants are non-planar, and mosaics built from many images can accumulate errors that cause drift. Although these issues can be mitigated by using georeferenced images or taking images at high altitude, there is no general solution for images taken close to the crop. To address this, we created a user-friendly and open source pipeline for stitching ground-based images of a linear row of crops that does not rely on additional data. First, we use SuperPoint and LightGlue to extract and match features within small batches of images. Then we stitch the images in each batch in series while imposing constraints on the camera movement. After straightening and rescaling each batch mosaic, all batch mosaics are stitched together in series and then straightened into a final mosaic. We tested the pipeline on images collected along 72 m long rows of crops using two different agricultural robots and a camera manually carried over the row. In all three cases, the pipeline produced high-quality mosaics that could be used to georeference real world positions with a mean absolute error of 20 cm. This approach provides accessible leaf-scale stitching to users who need to coarsely georeference positions within a row, but do not have access to accurate positional data or sophisticated imaging systems.
zh

[CV-95] Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

【速读】：该论文旨在解决在单一多模态框架内统一视觉理解与生成任务这一重大挑战。当前基于向量量化（Vector Quantization, VQ）或变分自编码器（Variational Autoencoder, VAE）的方法倾向于优先提取图像内在特征而非语义信息，从而影响了理解性能。为应对这一问题，论文受到掩码图像建模（Masked Image Modelling, MIM）及其扩展的掩码自回归（Masked Autoregressive, MAR）图像生成方法的启发，发现MAR编码器具有卓越的线性探测准确性及对视觉概念精确的特征响应能力，表明其在视觉理解任务中的潜力远超最初的生成角色。基于此洞察，论文提出了一种名为Harmon的统一自回归框架，通过共享MAR编码器实现理解和生成任务的协调，并采用三阶段训练流程逐步优化理解与生成能力。最终，Harmon不仅在GenEval、MJHQ30K和WISE基准测试中取得了最先进的图像生成结果，还在图像理解基准测试中达到了与专用语义编码器方法（如Janus）相当的性能。

链接: https://arxiv.org/abs/2503.21979
作者: Size Wu,Wenwei Zhang,Lumin Xu,Sheng Jin,Zhonghua Wu,Qingyi Tao,Wentao Liu,Wei Li,Chen Change Loy
机构: S-Lab, Nanyang Technological University (南洋理工大学); Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学); SenseTime Research; Tetras.AI (商汤科技研究部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unifying visual understanding and generation within a single multimodal framework remains a significant challenge, as the two inherently heterogeneous tasks require representations at different levels of granularity. Current approaches that utilize vector quantization (VQ) or variational autoencoders (VAE) for unified visual representation prioritize intrinsic imagery features over semantics, compromising understanding performance. In this work, we take inspiration from masked image modelling (MIM) that learns rich semantics via a mask-and-reconstruct pre-training and its successful extension to masked autoregressive (MAR) image generation. A preliminary study on the MAR encoder’s representation reveals exceptional linear probing accuracy and precise feature response to visual concepts, which indicates MAR’s potential for visual understanding tasks beyond its original generation role. Based on these insights, we present \emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder. Through a three-stage training procedure that progressively optimizes understanding and generation capabilities, Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks while matching the performance of methods with dedicated semantic encoders (e.g., Janus) on image understanding benchmarks. Our code and models will be available at this https URL.
zh

[CV-96] Q-MambaIR: Accurate Quantized Mamba for Efficient Image Restoration

【速读】：该论文旨在解决基于状态空间模型（SSMs）在图像恢复（IR）任务中部署到边缘设备时面临的挑战，特别是由于内存、计算能力和功耗限制导致的高效压缩需求。传统低比特量化方法虽能有效减小模型规模并加速任务执行，但在超低比特宽度（2-4位）下会导致显著性能下降，主要归因于异常值引起的量化误差加剧。为应对这一挑战，论文提出了一种名为Q-MambaIR的精确、高效且灵活的量化方法。其关键在于引入统计动态平衡可学习标量（Statistical Dynamic-balancing Learnable Scalar, DLS），通过动态调整量化映射范围来减轻极值引起的峰值截断损失；同时设计了具有自适应阈值的范围浮动灵活分配器（Range-floating Flexible Allocator, RFA），以灵活四舍五入数值，从而保留高频细节并保持SSM的特征提取能力，同时支持预部署权重量化，平衡了计算效率与模型精度。实验结果表明，Q-MambaIR在不显著增加训练计算和存储开销的情况下，大幅超越现有量化SSMs，实现了更高的最先进技术（SOTA）精度。

链接: https://arxiv.org/abs/2503.21970
作者: Yujie Chen,Haotong Qin,Zhang Zhang,Michelo Magno,Luca Benini,Yawei Li
机构: ETH Zurich (苏黎世联邦理工学院); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State-Space Models (SSMs) have attracted considerable attention in Image Restoration (IR) due to their ability to scale linearly sequence length while effectively capturing long-distance dependencies. However, deploying SSMs to edge devices is challenging due to the constraints in memory, computing capacity, and power consumption, underscoring the need for efficient compression strategies. While low-bit quantization is an efficient model compression strategy for reducing size and accelerating IR tasks, SSM suffers substantial performance drops at ultra-low bit-widths (2-4 bits), primarily due to outliers that exacerbate quantization error. To address this challenge, we propose Q-MambaIR, an accurate, efficient, and flexible Quantized Mamba for IR tasks. Specifically, we introduce a Statistical Dynamic-balancing Learnable Scalar (DLS) to dynamically adjust the quantization mapping range, thereby mitigating the peak truncation loss caused by extreme values. Furthermore, we design a Range-floating Flexible Allocator (RFA) with an adaptive threshold to flexibly round values. This approach preserves high-frequency details and maintains the SSM’s feature extraction capability. Notably, RFA also enables pre-deployment weight quantization, striking a balance between computational efficiency and model accuracy. Extensive experiments on IR tasks demonstrate that Q-MambaIR consistently outperforms existing quantized SSMs, achieving much higher state-of-the-art (SOTA) accuracy results with only a negligible increase in training computation and storage saving.
zh

[CV-97] NeRF-based Point Cloud Reconstruction using a Stationary Camera for Agricultural Applications

【速读】：该论文旨在解决传统NeRF（Neural Radiance Fields）方法在高通量室内植物表型分析设施中的局限性，即当目标物体在传送带或旋转底座上快速移动时，无法通过相机围绕静止物体运动的传统方式完成点云（Point Cloud, PCD）重建。为了解决这一问题，论文提出了一种基于NeRF的变体方法，利用单个固定位置的相机捕捉旋转物体的图像，并通过COLMAP进行位姿估计，结合简单的位姿变换模拟相机运动，随后采用标准NeRF训练流程实现高分辨率点云（10M点）的重建。关键在于定义感兴趣区域（Region of Interest, ROI）以排除无关场景数据，并优化了从位姿估计到最终点云生成的工作流，从而在保持高精度的同时显著提升了效率，验证了该方法在实际高通量表型分析中的可行性。

链接: https://arxiv.org/abs/2503.21958
作者: Kibon Ku,Talukder Z Jubery,Elijah Rodriguez,Aditya Balu,Soumik Sarkar,Adarsh Krishnamurthy,Baskar Ganapathysubramanian
机构: Iowa State University (爱荷华州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a NeRF-based framework for point cloud (PCD) reconstruction, specifically designed for indoor high-throughput plant phenotyping facilities. Traditional NeRF-based reconstruction methods require cameras to move around stationary objects, but this approach is impractical for high-throughput environments where objects are rapidly imaged while moving on conveyors or rotating pedestals. To address this limitation, we develop a variant of NeRF-based PCD reconstruction that uses a single stationary camera to capture images as the object rotates on a pedestal. Our workflow comprises COLMAP-based pose estimation, a straightforward pose transformation to simulate camera movement, and subsequent standard NeRF training. A defined Region of Interest (ROI) excludes irrelevant scene data, enabling the generation of high-resolution point clouds (10M points). Experimental results demonstrate excellent reconstruction fidelity, with precision-recall analyses yielding an F-score close to 100.00 across all evaluated plant objects. Although pose estimation remains computationally intensive with a stationary camera setup, overall training and reconstruction times are competitive, validating the method’s feasibility for practical high-throughput indoor phenotyping applications. Our findings indicate that high-quality NeRF-based 3D reconstructions are achievable using a stationary camera, eliminating the need for complex camera motion or costly imaging equipment. This approach is especially beneficial when employing expensive and delicate instruments, such as hyperspectral cameras, for 3D plant phenotyping. Future work will focus on optimizing pose estimation techniques and further streamlining the methodology to facilitate seamless integration into automated, high-throughput 3D phenotyping pipelines.
zh

[CV-98] Enhancing Pavement Crack Classification with Bidirectional Cascaded Neural Networks

【速读】：该论文旨在解决路面病害（如裂缝和坑洞）分类准确性不足的问题，以提升道路安全与维护效率。为实现这一目标，论文提出了一种基于双向级联神经网络（Bidirectional Cascaded Neural Networks, BCNNs）的解决方案，并结合U-Net 50进行图像增强处理。解决方案的关键在于设计了一种能够利用前向与后向信息流的网络结构，通过级联架构使每一层逐步优化前一层的输出结果，从而显著提高检测精度。最终，该模型在包含599张增强图像的数据集上实现了总体分类准确率为87%，并在不同类别（线性裂缝、疲劳裂缝和坑洞）中展现出高精度（Precision）、召回率（Recall）及F1分数，证明了BCNN在复杂路面病害模式分类中的优异性能。

链接: https://arxiv.org/abs/2503.21956
作者: Taqwa I.Alhadidi,Asmaa Alazmi,Shadi Jaradat,Ahmed Jaber,Huthaifa Ashqar,Mohammed Elhenawy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pavement distress, such as cracks and potholes, is a significant issue affecting road safety and maintenance. In this study, we present the implementation and evaluation of Bidirectional Cascaded Neural Networks (BCNNs) for the classification of pavement crack images following image augmentation. We classified pavement cracks into three main categories: linear cracks, potholes, and fatigue cracks on an enhanced dataset utilizing U-Net 50 for image augmentation. The augmented dataset comprised 599 images. Our proposed BCNN model was designed to leverage both forward and backward information flows, with detection accuracy enhanced by its cascaded structure wherein each layer progressively refines the output of the preceding one. Our model achieved an overall accuracy of 87%, with precision, recall, and F1-score measures indicating high effectiveness across the categories. For fatigue cracks, the model recorded a precision of 0.87, recall of 0.83, and F1-score of 0.85 on 205 images. Linear cracks were detected with a precision of 0.81, recall of 0.89, and F1-score of 0.85 on 205 images, and potholes with a precision of 0.96, recall of 0.90, and F1-score of 0.93 on 189 images. The macro and weighted average of precision, recall, and F1-score were identical at 0.88, confirming the BCNN’s excellent performance in classifying complex pavement crack patterns. This research demonstrates the potential of BCNNs to significantly enhance the accuracy and reliability of pavement distress classification, resulting in more effective and efficient pavement maintenance and management systems.
zh

[CV-99] Parametric Shadow Control for Portrait Generationin Text-to-Image Diffusion Models

【速读】：该论文旨在解决现有文本到图像扩散模型在阴影控制方面的不足，这些模型虽然擅长生成多样化的肖像，但在直观的阴影操控方面表现欠佳。此外，现有的编辑方法作为后处理手段，在不同风格间提供有效操控时面临挑战，并且部分方法依赖昂贵的真实世界光场数据采集或需要大量的计算资源进行训练。为了解决这些问题，论文提出了一种名为Shadow Director的方法，该方法能够提取并操纵训练良好的扩散模型中的隐藏阴影属性。其关键在于使用一个小巧的估计网络，仅需少量数千张合成图像和数小时的训练即可实现，无需昂贵的真实世界光场数据。Shadow Director能够在肖像生成过程中提供参数化且直观的阴影形状、位置和强度控制，同时保持艺术完整性和身份一致性，适用于多种风格，从而提供了一种更易访问且资源友好的解决方案。

链接: https://arxiv.org/abs/2503.21943
作者: Haoming Cai,Tsung-Wei Huang,Shiv Gehlot,Brandon Y. Feng,Sachin Shah,Guan-Ming Su,Christopher Metzler
机构: University of Maryland (马里兰大学); Dolby Labs (杜比实验室); MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: ShadowDirector Arxiv Version

点击查看摘要

Abstract:Text-to-image diffusion models excel at generating diverse portraits, but lack intuitive shadow control. Existing editing approaches, as post-processing, struggle to offer effective manipulation across diverse styles. Additionally, these methods either rely on expensive real-world light-stage data collection or require extensive computational resources for training. To address these limitations, we introduce Shadow Director, a method that extracts and manipulates hidden shadow attributes within well-trained diffusion models. Our approach uses a small estimation network that requires only a few thousand synthetic images and hours of training-no costly real-world light-stage data needed. Shadow Director enables parametric and intuitive control over shadow shape, placement, and intensity during portrait generation while preserving artistic integrity and identity across diverse styles. Despite training only on synthetic data built on real-world identities, it generalizes effectively to generated portraits with diverse styles, making it a more accessible and resource-friendly solution.
zh

[CV-100] Flexible Moment-Invariant Bases from Irreducible Tensors

【速读】：该论文旨在解决现有旋转不变描述符生成方法在处理球面函数（spherical functions）时的脆弱性问题。尽管当前最先进的方法对矩张量（moment tensors）恒为零的情况具有鲁棒性，但在实际应用中仍容易受到球面函数引起的退化（degeneracy）影响。论文的关键解决方案在于结合两种流行的矩不变量方法：基于球谐函数（spherical harmonics）的方法与基于笛卡尔张量代数（Cartesian tensor algebra）的方法，从而克服这一局限性。

链接: https://arxiv.org/abs/2503.21939
作者: Roxana Bujack,Emily Shinkle,Alice Allen,Tomas Suk,Nicholas Lubbers
机构: Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室); Max Planck Institute for Polymer Research (马克斯普朗克聚合物研究所); Czech Academy of Sciences (捷克科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Moment invariants are a powerful tool for the generation of rotation-invariant descriptors needed for many applications in pattern detection, classification, and machine learning. A set of invariants is optimal if it is complete, independent, and robust against degeneracy in the input. In this paper, we show that the current state of the art for the generation of these bases of moment invariants, despite being robust against moment tensors being identically zero, is vulnerable to a degeneracy that is common in real-world applications, namely spherical functions. We show how to overcome this vulnerability by combining two popular moment invariant approaches: one based on spherical harmonics and one based on Cartesian tensor algebra.
zh

[CV-101] Multimodal Data Integration for Sustainable Indoor Gardening: Tracking Anyplant with Time Series Foundation Model

【速读】：该论文旨在解决室内园艺中植物健康与生长状态自动化监测的问题。传统方法通常依赖单一数据源（如环境传感器或图像分析），而该研究提出的关键解决方案在于整合多模态数据（包括RGB图像、植物表型数据及环境因素）以实现更精准的植物水分胁迫预测。通过结合高分辨率相机提取表型特征，并利用Lag-Llama时间序列模型处理这些数据，该框架能够显著提高预测准确性（均方误差MSE=0.420777，平均绝对误差MAE=0.595428）。这种基于计算机视觉、机器学习与环境传感技术的综合方法不仅优化了资源使用效率，还为可持续建筑管理与城市绿化提供了创新路径。

链接: https://arxiv.org/abs/2503.21932
作者: Seyed Hamidreza Nabaei,Zeyang Zheng,Dong Chen,Arsalan Heydarian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: Accepted at ASCE International Conference on Computing in Civil Engineering (i3ce)

点击查看摘要

Abstract:Indoor gardening within sustainable buildings offers a transformative solution to urban food security and environmental sustainability. By 2030, urban farming, including Controlled Environment Agriculture (CEA) and vertical farming, is expected to grow at a compound annual growth rate (CAGR) of 13.2% from 2024 to 2030, according to market reports. This growth is fueled by advancements in Internet of Things (IoT) technologies, sustainable innovations such as smart growing systems, and the rising interest in green interior design. This paper presents a novel framework that integrates computer vision, machine learning (ML), and environmental sensing for the automated monitoring of plant health and growth. Unlike previous approaches, this framework combines RGB imagery, plant phenotyping data, and environmental factors such as temperature and humidity, to predict plant water stress in a controlled growth environment. The system utilizes high-resolution cameras to extract phenotypic features, such as RGB, plant area, height, and width while employing the Lag-Llama time series model to analyze and predict water stress. Experimental results demonstrate that integrating RGB, size ratios, and environmental data significantly enhances predictive accuracy, with the Fine-tuned model achieving the lowest errors (MSE = 0.420777, MAE = 0.595428) and reduced uncertainty. These findings highlight the potential of multimodal data and intelligent systems to automate plant care, optimize resource consumption, and align indoor gardening with sustainable building management practices, paving the way for resilient, green urban spaces.
zh

[CV-102] Locally Orderless Images for Optimization in Differentiable Rendering CVPR2025

【速读】：该论文旨在解决可微分渲染中优化场景参数时因图像空间运动引起的梯度稀疏性导致收敛效果差的问题。现有方法通过拓扑导数或拉格朗日导数等代理梯度来缓解这一问题，但这些方法通常对渲染过程作出简化假设。此外，多分辨率图像金字塔虽然提供了一种替代方案，但在实际应用中表现不稳定。论文的关键解决方案是引入一种利用局部无序图像（Locally Orderless Images, LOI）的方法，其中每个像素映射为一个强度直方图，以保留外观上的局部变化。通过最小化直方图距离的逆向渲染目标函数，该方法扩展了对稀疏定义图像梯度的支持，并成功恢复最优参数。研究在合成数据和真实数据上验证了所提方法的有效性。

链接: https://arxiv.org/abs/2503.21931
作者: Ishit Mehta,Manmohan Chandraker,Ravi Ramamoorthi
机构: University of California San Diego (加州大学圣地亚哥分校)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project: this https URL

点击查看摘要

Abstract:Problems in differentiable rendering often involve optimizing scene parameters that cause motion in image space. The gradients for such parameters tend to be sparse, leading to poor convergence. While existing methods address this sparsity through proxy gradients such as topological derivatives or lagrangian derivatives, they make simplifying assumptions about rendering. Multi-resolution image pyramids offer an alternative approach but prove unreliable in practice. We introduce a method that uses locally orderless images, where each pixel maps to a histogram of intensities that preserves local variations in appearance. Using an inverse rendering objective that minimizes histogram distance, our method extends support for sparsely defined image gradients and recovers optimal parameters. We validate our method on various inverse problems using both synthetic and real data.
zh

[CV-103] KernelFusion: Assumption-Free Blind Super-Resolution via Patch Diffusion

【速读】：该论文试图解决传统超分辨率（Super-Resolution, SR）方法依赖于理想化下采样核假设的问题，以及当前盲超分辨率（Blind-SR）方法局限于简单下采样核且在复杂退化场景下失效的局限性。论文的关键在于提出了一种零样本扩散（diffusion-based）方法——KernelFusion，它无需对下采样核进行任何假设。其核心解决方案是通过从低分辨率（Low-Resolution, LR）输入图像直接恢复与之对应的图像特定的超分辨率核（SR-kernel），同时重建高分辨率（High-Resolution, HR）图像。该方法利用了正确SR-kernel应最大化低分辨率图像不同尺度间patch相似性的原则，并通过训练基于图像特定patch的扩散模型来捕捉输入图像的独特内部patch统计特性，从而实现跨尺度关系的保持和HR图像的重构。实验结果表明，KernelFusion在复杂的下采样退化场景中显著优于现有SR基线及盲超分辨率方法，推动了盲超分辨率技术进入无假设的新范式。

链接: https://arxiv.org/abs/2503.21907
作者: Oliver Heinimann,Assaf Shocher,Tal Zimbalist,Michal Irani
机构: Weizmann Institute of Science (魏茨曼科学研究学院); NVIDIA Research (NVIDIA研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional super-resolution (SR) methods assume an ideal'' downscaling SR-kernel (e.g., bicubic downscaling) between the high-resolution (HR) image and the low-resolution (LR) image. Such methods fail once the LR images are generated differently. Current blind-SR methods aim to remove this assumption, but are still fundamentally restricted to rather simplistic downscaling SR-kernels (e.g., anisotropic Gaussian kernels), and fail on more complex (out of distribution) downscaling degradations. However, using the correct SR-kernel is often more important than using a sophisticated SR algorithm. In KernelFusion’’ we introduce a zero-shot diffusion-based method that makes no assumptions about the kernel. Our method recovers the unique image-specific SR-kernel directly from the LR input image, while simultaneously recovering its corresponding HR image. KernelFusion exploits the principle that the correct SR-kernel is the one that maximizes patch similarity across different scales of the LR image. We first train an image-specific patch-based diffusion model on the single LR input image, capturing its unique internal patch statistics. We then reconstruct a larger HR image with the same learned patch distribution, while simultaneously recovering the correct downscaling SR-kernel that maintains this cross-scale relation between the HR and LR images. Empirical results show that KernelFusion vastly outperforms all SR baselines on complex downscaling degradations, where existing SotA Blind-SR methods fail miserably. By breaking free from predefined kernel assumptions, KernelFusion pushes Blind-SR into a new assumption-free paradigm, handling downscaling kernels previously thought impossible.
zh

[CV-104] AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction Detection and Analysis

【速读】：该论文旨在解决现有基于大型语言模型（Large Language Models, LLMs）的视频异常检测（Video Anomaly Detection, VAD）方法主要关注于视频级异常问答或离线检测的问题，而忽视了实际应用中实时性的重要性。论文的关键在于提出AssistPDA，这是一种首个统一视频异常预测、检测与分析（Video Anomaly Prediction, Detection, and Analysis, VAPDA）的在线视频异常监控助手。其核心解决方案包括引入事件级异常预测任务以实现提前预警，以及设计空间-时间关系蒸馏（Spatio-Temporal Relation Distillation, STRD）模块，将视觉-语言模型（Vision-Language Models, VLMs）在离线设置下的长期时空建模能力迁移到实时场景中，从而增强对复杂时序依赖性和长序列记忆的理解。此外，构建了VAPDA-127K数据集作为首个针对基于VLM的在线VAPDA的大规模基准，进一步验证了AssistPDA在实时VAPDA中的优越性能。

链接: https://arxiv.org/abs/2503.21904
作者: Zhiwei Yang,Chen Gao,Jing Liu,Peng Wu,Guansong Pang,Mike Zheng Shou
机构: Xidian University (西安电子科技大学); Show Lab, National University of Singapore (新加坡国立大学Show实验室); Northwestern Polytechnical University (西北工业大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages

点击查看摘要

Abstract:The rapid advancements in large language models (LLMs) have spurred growing interest in LLM-based video anomaly detection (VAD). However, existing approaches predominantly focus on video-level anomaly question answering or offline detection, ignoring the real-time nature essential for practical VAD applications. To bridge this gap and facilitate the practical deployment of LLM-based VAD, we introduce AssistPDA, the first online video anomaly surveillance assistant that unifies video anomaly prediction, detection, and analysis (VAPDA) within a single framework. AssistPDA enables real-time inference on streaming videos while supporting interactive user engagement. Notably, we introduce a novel event-level anomaly prediction task, enabling proactive anomaly forecasting before anomalies fully unfold. To enhance the ability to model intricate spatiotemporal relationships in anomaly events, we propose a Spatio-Temporal Relation Distillation (STRD) module. STRD transfers the long-term spatiotemporal modeling capabilities of vision-language models (VLMs) from offline settings to real-time scenarios. Thus it equips AssistPDA with a robust understanding of complex temporal dependencies and long-sequence memory. Additionally, we construct VAPDA-127K, the first large-scale benchmark designed for VLM-based online VAPDA. Extensive experiments demonstrate that AssistPDA outperforms existing offline VLM-based approaches, setting a new state-of-the-art for real-time VAPDA. Our dataset and code will be open-sourced to facilitate further research in the community.
zh

[CV-105] Exponentially Weighted Instance-Aware Repeat Factor Sampling for Long-Tailed Object Detection Model Training in Unmanned Aerial Vehicles Surveillance Scenarios

【速读】：该论文试图解决目标检测模型在类别不平衡（class imbalance）问题上的挑战，即罕见类别（rare categories）出现频率远低于常见类别的情况。现有基于采样重平衡策略的方法，如重复因子采样（Repeat Factor Sampling, RFS）和实例感知重复因子采样（Instance-Aware Repeat Factor Sampling, IRFS），通过调整图像和实例的数量来重新平衡样本频率，但这些方法基于线性调整，在长尾分布（long-tailed distributions）场景下效果有限。论文的关键解决方案是提出了一种改进方法——指数加权实例感知重复因子采样（Exponentially Weighted Instance-Aware Repeat Factor Sampling, E-IRFS）。E-IRFS 在 IRFS 的基础上引入指数缩放机制，利用几何平均值的指数函数调整采样概率，从而更有效地区分罕见类别与频繁类别，实现更具适应性的重平衡策略。实验结果表明，E-IRFS 在多个数据集上的检测性能提升了 22%，尤其显著改善了罕见类别的检测效果，并在轻量级模型中表现出更强的效果。

链接: https://arxiv.org/abs/2503.21893
作者: Taufiq Ahmed,Abhishek Kumar,Constantino Álvarez Casado,Anlan Zhang,Tuomo Hänninen,Lauri Loven,Miguel Bordallo López,Sasu Tarkoma
机构: Center for Ubiquitous Computing (UBICOMP), University of Oulu (奥卢大学), Finland; Center for Machine Vision and Signal Analysis (CMVS), University of Oulu (奥卢大学), Finland; Center for Wireless Communications (CWC), University of Oulu (奥卢大学), Finland; Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California (南加州大学), United States; Department of Computer Science, University of Helsinki (赫尔辛基大学), Finland
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 2 figures, 9 tables, 6 formulas, conference paper

点击查看摘要

Abstract:Object detection models often struggle with class imbalance, where rare categories appear significantly less frequently than common ones. Existing sampling-based rebalancing strategies, such as Repeat Factor Sampling (RFS) and Instance-Aware Repeat Factor Sampling (IRFS), mitigate this issue by adjusting sample frequencies based on image and instance counts. However, these methods are based on linear adjustments, which limit their effectiveness in long-tailed distributions. This work introduces Exponentially Weighted Instance-Aware Repeat Factor Sampling (E-IRFS), an extension of IRFS that applies exponential scaling to better differentiate between rare and frequent classes. E-IRFS adjusts sampling probabilities using an exponential function applied to the geometric mean of image and instance frequencies, ensuring a more adaptive rebalancing strategy. We evaluate E-IRFS on a dataset derived from the Fireman-UAV-RGBT Dataset and four additional public datasets, using YOLOv11 object detection models to identify fire, smoke, people and lakes in emergency scenarios. The results show that E-IRFS improves detection performance by 22% over the baseline and outperforms RFS and IRFS, particularly for rare categories. The analysis also highlights that E-IRFS has a stronger effect on lightweight models with limited capacity, as these models rely more on data sampling strategies to address class imbalance. The findings demonstrate that E-IRFS improves rare object detection in resource-constrained environments, making it a suitable solution for real-time applications such as UAV-based emergency monitoring.
zh

[CV-106] StarFlow: Generating Structured Workflow Outputs From Sketch Images

【速读】：该论文旨在解决从手绘草图或计算机生成的流程图自动转换为可执行结构化工作流的问题。这一任务面临的主要挑战包括自由形式绘图的歧义性、不同图表风格的变化以及从视觉元素推断执行逻辑的难度。为了解决这些问题，论文提出的关键方案是开发StarFlow框架，利用视觉-语言模型（Vision-Language Models, VLMs）生成结构化的工作流输出。该方法通过构建包含合成数据、人工标注数据和真实世界样本的多样化数据集来实现鲁棒训练与评估，并通过对多个视觉-语言模型进行微调和消融研究，分析所提方法的优势与局限性。实验结果表明，微调显著提升了结构化工作流生成的性能，在此任务上优于大型视觉-语言模型。

链接: https://arxiv.org/abs/2503.21889
作者: Patrice Bechard,Chao Wang,Amirhossein Abaskohi,Juan Rodriguez,Christopher Pal,David Vazquez,Spandana Gella,Sai Rajeswar,Perouz Taslakian
机构: ServiceNow; University of British Columbia (英属哥伦比亚大学); Mila; École de Technologie Supérieure (高等技术学院); CIFAR AI Chair; Polytechnique Montréal (蒙特利尔理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Workflows are a fundamental component of automation in enterprise platforms, enabling the orchestration of tasks, data processing, and system integrations. Despite being widely used, building workflows can be complex, often requiring manual configuration through low-code platforms or visual programming tools. To simplify this process, we explore the use of generative foundation models, particularly vision-language models (VLMs), to automatically generate structured workflows from visual inputs. Translating hand-drawn sketches or computer-generated diagrams into executable workflows is challenging due to the ambiguity of free-form drawings, variations in diagram styles, and the difficulty of inferring execution logic from visual elements. To address this, we introduce StarFlow, a framework for generating structured workflow outputs from sketches using vision-language models. We curate a diverse dataset of workflow diagrams – including synthetic, manually annotated, and real-world samples – to enable robust training and evaluation. We finetune and benchmark multiple vision-language models, conducting a series of ablation studies to analyze the strengths and limitations of our approach. Our results show that finetuning significantly enhances structured workflow generation, outperforming large vision-language models on this task.
zh

[CV-107] Refined Geometry-guided Head Avatar Reconstruction from Monocular RGB Video

【速读】：该论文旨在解决从单目视频高保真重建人脸 avatar 的难题，这是计算机图形学和计算机视觉领域的一大挑战。论文提出了一种两阶段的人脸 avatar 重建网络，并引入了精炼的 3D 网格表示方法。与依赖基于 3DMM 的粗略模板化 3D 表示的传统方法不同，本文方案致力于学习一种适合神经辐射场（NeRF）的细化网格表示，以捕捉复杂的面部细节。关键在于第一阶段利用几何先验训练初始网格的 NeRF，并通过一致的潜在代码整合跨帧观测；第二阶段则基于初始 NeRF 的密度场构建的 SDF 进行网格优化，同时采用拉普拉斯平滑处理位移场以减少 NeRF 密度场中的噪声，从而进一步提升网格细节。实验表明，该方法显著改进了基于初始网格的 NeRF 渲染，并在高保真人脸 avatar 重建任务中超越现有最先进方法。

链接: https://arxiv.org/abs/2503.21886
作者: Pilseo Park,Ze Zhang,Michel Sarkis,Ning Bi,Xiaoming Liu,Yiying Tong
机构: Michigan State University (密歇根州立大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-fidelity reconstruction of head avatars from monocular videos is highly desirable for virtual human applications, but it remains a challenge in the fields of computer graphics and computer vision. In this paper, we propose a two-phase head avatar reconstruction network that incorporates a refined 3D mesh representation. Our approach, in contrast to existing methods that rely on coarse template-based 3D representations derived from 3DMM, aims to learn a refined mesh representation suitable for a NeRF that captures complex facial nuances. In the first phase, we train 3DMM-stored NeRF with an initial mesh to utilize geometric priors and integrate observations across frames using a consistent set of latent codes. In the second phase, we leverage a novel mesh refinement procedure based on an SDF constructed from the density field of the initial NeRF. To mitigate the typical noise in the NeRF density field without compromising the features of the 3DMM, we employ Laplace smoothing on the displacement field. Subsequently, we apply a second-phase training with these refined meshes, directing the learning process of the network towards capturing intricate facial details. Our experiments demonstrate that our method further enhances the NeRF rendering based on the initial mesh and achieves performance superior to state-of-the-art methods in reconstructing high-fidelity head avatars with such input.
zh

[CV-108] ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning CVPR2025

【速读】：该论文旨在解决在模拟环境中高效将人类双手技能迁移到灵巧机器人手的问题，尤其针对生成精确、大规模且类人化的操作序列这一挑战。传统强化学习或现实世界中的遥操作难以满足这些需求。论文的关键解决方案是提出了一种名为ManipTrans的两阶段方法：首先通过预训练一个广义轨迹模仿器来模仿手部动作；然后在交互约束下微调特定的残差模块。这种方法实现了复杂双手任务的有效学习与准确执行，并显著提升了成功率、保真度和效率。此外，利用ManipTrans，作者构建了一个包含3300集机器人操作数据的大规模数据集DexManipNet，涵盖如笔帽扣合和瓶盖拧开等此前未探索的任务，为灵巧手的进一步策略训练及实际部署奠定了基础。

链接: https://arxiv.org/abs/2503.21860
作者: Kailin Li,Puhao Li,Tengyu Liu,Yuyang Li,Siyuan Huang
机构: State Key Laboratory of General Artificial Intelligence (通用人工智能国家重点实验室), BIGAI; Department of Automation, Tsinghua University (清华大学自动化系); Institute for Artificial Intelligence, Peking University (北京大学人工智能研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Human hands play a central role in interacting, motivating increasing research in dexterous robotic manipulation. Data-driven embodied AI algorithms demand precise, large-scale, human-like manipulation sequences, which are challenging to obtain with conventional reinforcement learning or real-world teleoperation. To address this, we introduce ManipTrans, a novel two-stage method for efficiently transferring human bimanual skills to dexterous robotic hands in simulation. ManipTrans first pre-trains a generalist trajectory imitator to mimic hand motion, then fine-tunes a specific residual module under interaction constraints, enabling efficient learning and accurate execution of complex bimanual tasks. Experiments show that ManipTrans surpasses state-of-the-art methods in success rate, fidelity, and efficiency. Leveraging ManipTrans, we transfer multiple hand-object datasets to robotic hands, creating DexManipNet, a large-scale dataset featuring previously unexplored tasks like pen capping and bottle unscrewing. DexManipNet comprises 3.3K episodes of robotic manipulation and is easily extensible, facilitating further policy training for dexterous hands and enabling real-world deployments.
zh

[CV-109] Foveated Instance Segmentation

【速读】：该论文致力于解决实例分割在资源受限的增强现实和虚拟现实（AR/VR）设备上的高计算开销问题，这导致了较大的处理延迟并降低了用户体验。传统方法通常无法有效应对AR/VR场景中用户仅关注视野内少数区域的特点。为此，论文提出了一种基于用户实时视线数据的聚焦实例分割（FovealSeg）框架，通过仅对感兴趣实例进行分割来显著降低计算负载并提升实时性能。该方案的关键在于利用视线聚焦策略实现高效的实例分割，从而实现实质性的计算节省。实验结果显示，所提出的FSNet在ADE20K和LVIS数据集上的IoU分别达到0.56和0.54，明显优于基线模型。相关代码已公开发布。

链接: https://arxiv.org/abs/2503.21854
作者: Hongyi Zeng,Wenxuan Liu,Tianhua Xia,Jinhui Chen,Ziyun Li,Sai Qian Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instance segmentation is essential for augmented reality and virtual reality (AR/VR) as it enables precise object recognition and interaction, enhancing the integration of virtual and real-world elements for an immersive experience. However, the high computational overhead of segmentation limits its application on resource-constrained AR/VR devices, causing large processing latency and degrading user experience. In contrast to conventional scenarios, AR/VR users typically focus on only a few regions within their field of view before shifting perspective, allowing segmentation to be concentrated on gaze-specific areas. This insight drives the need for efficient segmentation methods that prioritize processing instance of interest, reducing computational load and enhancing real-time performance. In this paper, we present a foveated instance segmentation (FovealSeg) framework that leverages real-time user gaze data to perform instance segmentation exclusively on instance of interest, resulting in substantial computational savings. Evaluation results show that FSNet achieves an IoU of 0.56 on ADE20K and 0.54 on LVIS, notably outperforming the baseline. The code is available at this https URL
zh

[CV-110] On Large Multimodal Models as Open-World Image Classifiers ALT

【速读】：该论文试图解决传统图像分类依赖预定义语义类别列表的问题，并评估大规模多模态模型（Large Multimodal Models, LMMs）在真正开放世界设置下的分类性能。现有研究大多局限于封闭世界设定且假设类别集固定，忽略了开放世界中未见类别的挑战。为此，论文通过定义任务、提出评估协议及多种度量指标来全面评估LMMs的分类表现，并分析其在细粒度分类中的困难。解决方案的关键在于引入新的评估框架以涵盖更广泛的类别类型，同时基于提出的度量指标揭示LMMs的错误模式，并强调定制化提示与推理策略如何缓解这些挑战。

链接: https://arxiv.org/abs/2503.21851
作者: Alessandro Conti,Massimiliano Mancini,Enrico Fini,Yiming Wang,Paolo Rota,Elisa Ricci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 13 figures, code is available at this https URL

点击查看摘要

Abstract:Traditional image classification requires a predefined list of semantic categories. In contrast, Large Multimodal Models (LMMs) can sidestep this requirement by classifying images directly using natural language (e.g., answering the prompt “What is the main object in the image?”). Despite this remarkable capability, most existing studies on LMM classification performance are surprisingly limited in scope, often assuming a closed-world setting with a predefined set of categories. In this work, we address this gap by thoroughly evaluating LMM classification performance in a truly open-world setting. We first formalize the task and introduce an evaluation protocol, defining various metrics to assess the alignment between predicted and ground truth classes. We then evaluate 13 models across 10 benchmarks, encompassing prototypical, non-prototypical, fine-grained, and very fine-grained classes, demonstrating the challenges LMMs face in this task. Further analyses based on the proposed metrics reveal the types of errors LMMs make, highlighting challenges related to granularity and fine-grained capabilities, showing how tailored prompting and reasoning can alleviate them.
zh

[CV-111] Comparative Analysis of Image Video and Audio Classifiers for Automated News Video Segmentation

【速读】：该论文旨在解决新闻视频内容组织与检索系统效率低下的问题，特别是针对其无结构化特性带来的自动化处理挑战。论文的关键在于提出了一种综合比较分析方法，评估多种深度学习方法（如ResNet、ViViT、AST及多模态架构）在自动新闻视频分割中的表现，并专注于五类不同段落类型的分类任务，包括广告、故事、演播室场景、转场和可视化内容。研究通过使用包含1,832个场景片段的自定义标注数据集发现，基于图像的分类器（84.34%的准确率）优于更复杂的时序模型。其中，ResNet架构不仅超越了最先进的视频分类器，还显著减少了计算资源需求。二元分类模型在转场（94.23%）和广告（92.74%）识别上展现了高精度。这些结果为新闻视频分割的有效架构设计提供了深入理解，并为媒体应用中实现自动化内容组织系统（如归档、个性化内容分发和智能视频搜索）提供了实用见解。

链接: https://arxiv.org/abs/2503.21848
作者: Jonathan Attard,Dylan Seychell
机构: University of Malta (马耳他大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint for paper in CAI 2025, 7 pages, 5 tables, 3 tables

点击查看摘要

Abstract:News videos require efficient content organisation and retrieval systems, but their unstructured nature poses significant challenges for automated processing. This paper presents a comprehensive comparative analysis of image, video, and audio classifiers for automated news video segmentation. This work presents the development and evaluation of multiple deep learning approaches, including ResNet, ViViT, AST, and multimodal architectures, to classify five distinct segment types: advertisements, stories, studio scenes, transitions, and visualisations. Using a custom-annotated dataset of 41 news videos comprising 1,832 scene clips, our experiments demonstrate that image-based classifiers achieve superior performance (84.34% accuracy) compared to more complex temporal models. Notably, the ResNet architecture outperformed state-of-the-art video classifiers while requiring significantly fewer computational resources. Binary classification models achieved high accuracy for transitions (94.23%) and advertisements (92.74%). These findings advance the understanding of effective architectures for news video segmentation and provide practical insights for implementing automated content organisation systems in media applications. These include media archiving, personalised content delivery, and intelligent video search.
zh

[CV-112] CMD-HAR: Cross-Modal Disentanglement for Wearable Human Activity Recognition

【速读】：该论文旨在解决基于传感器的人类活动识别（Sensor-based Human Activity Recognition, HAR）中存在的多模态数据混合、活动异质性以及复杂模型部署等问题。论文的关键解决方案在于提出了一种时空注意力模态分解对齐融合策略（Spatiotemporal Attention Modal Decomposition Alignment Fusion Strategy），通过跨模态时空解耦表示捕捉活动的关键判别特征，并结合梯度调制缓解数据异质性问题。此外，还构建了一个可穿戴设备部署仿真系统以验证方法的有效性。实验结果表明，该模型在多个公开数据集上的表现具有显著效果。

链接: https://arxiv.org/abs/2503.21843
作者: Hanyu Liu,Siyao Li,Ying Yu,Yixuan Jiang,Hang Xiao,Jingxi Long,Haotian Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) is a fundamental technology for numerous human - centered intelligent applications. Although deep learning methods have been utilized to accelerate feature extraction, issues such as multimodal data mixing, activity heterogeneity, and complex model deployment remain largely unresolved. The aim of this paper is to address issues such as multimodal data mixing, activity heterogeneity, and complex model deployment in sensor-based human activity recognition. We propose a spatiotemporal attention modal decomposition alignment fusion strategy to tackle the problem of the mixed distribution of sensor data. Key discriminative features of activities are captured through cross-modal spatio-temporal disentangled representation, and gradient modulation is combined to alleviate data heterogeneity. In addition, a wearable deployment simulation system is constructed. We conducted experiments on a large number of public datasets, demonstrating the effectiveness of the model.
zh

[CV-113] HyperFree: A Channel-adaptive and Tuning-free Foundation Model for Hyperspectral Remote Sensing Imagery CVPR2025

【速读】：本文旨在解决高光谱遥感图像高级解释任务中，由于其多通道特性导致现有视觉基础模型需针对每张图像进行单独微调的问题，这给硬件资源和时间带来了巨大压力。为应对这一挑战，论文提出了一种无需微调的高光谱基础模型HyperFree。其关键是通过适配现有的视觉提示工程，并设计了一个学习权重字典，覆盖从0.4到2.5 μm的全光谱范围，支持动态构建嵌入层，以处理不同数量的通道；同时引入基于特征距离的语义感知掩码生成机制，提升提示设计的灵活性。此外，HyperFree在大规模高分辨率高光谱图像上预训练后，仅使用单个提示便在5项任务和11个数据集上取得了与经过5次shot微调的专业模型相当的结果。

链接: https://arxiv.org/abs/2503.21841
作者: Jingtao Li,Yingyi Liu,Xinyu Wang,Yunning Peng,Chen Sun,Shaoyu Wang,Zhendong Sun,Tian Ke,Xiao Jiang,Tangwei Lu,Anran Zhao,Yanfei Zhong
机构: Wuhan University (武汉大学); Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Advanced interpretation of hyperspectral remote sensing images benefits many precise Earth observation tasks. Recently, visual foundation models have promoted the remote sensing interpretation but concentrating on RGB and multispectral images. Due to the varied hyperspectral channels,existing foundation models would face image-by-image tuning situation, imposing great pressure on hardware and time resources. In this paper, we propose a tuning-free hyperspectral foundation model called HyperFree, by adapting the existing visual prompt engineering. To process varied channel numbers, we design a learned weight dictionary covering full-spectrum from 0.4 \sim 2.5 , \mu\textm , supporting to build the embedding layer dynamically. To make the prompt design more tractable, HyperFree can generate multiple semantic-aware masks for one prompt by treating feature distance as semantic-similarity. After pre-training HyperFree on constructed large-scale high-resolution hyperspectral images, HyperFree (1 prompt) has shown comparable results with specialized models (5 shots) on 5 tasks and 11 this http URL and dataset are accessible at this https URL.
zh

[CV-114] M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?

【速读】：该论文试图解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在理解文档中交织图像与文本时是否真正具备相关能力的问题。现有文档理解基准通常通过问答格式评估LVLMs，但这种方式信息稀疏且难以保证对长距离依赖关系的全面覆盖。为了解决这一问题，论文引入了一个新的挑战性多模态文档摘要基准（Multimodal Document Summarization Benchmark, M-DocSum-Bench），包含500篇高质量arXiv论文及其与人类偏好对齐的交织多模态摘要。此基准是一个基于参考的生成任务，要求利用提供的参考图像生成交织的图像-文本摘要，从而同时评估复杂多模态文档场景下的理解、推理、定位和摘要能力。为支持该基准，研究者开发了一种自动摘要构建框架，并提出了名为M-DocEval的细粒度评估方法。此外，通过多样化指令和偏好数据的渐进两阶段训练，还开发出一个鲁棒的摘要基线模型M-DocSum-7B。论文结果显示，主流LVLMs在处理长且交织上下文时难以保持连贯性和准确整合信息，而M-DocSum-7B相较于更大规模的闭源模型表现出色，验证了LVLMs在改进交织图像-文本理解方面的潜力。关键在于创新性地设计了M-DocSum-Bench及其配套的评估与训练方法，以更全面地测试LVLMs的能力。

链接: https://arxiv.org/abs/2503.21839
作者: Haolong Yan,Kaijun Tan,Yeqing Shen,Xin Huang,Zheng Ge,Xiangyu Zhang,Si Li,Daxin Jiang
机构: BUPT(北京邮电大学); StepFun; Waseda University(早稻田大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We investigate a critical yet under-explored question in Large Vision-Language Models (LVLMs): Do LVLMs genuinely comprehend interleaved image-text in the document? Existing document understanding benchmarks often assess LVLMs using question-answer formats, which are information-sparse and difficult to guarantee the coverage of long-range dependencies. To address this issue, we introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench), which comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences. M-DocSum-Bench is a reference-based generation task and necessitates the generation of interleaved image-text summaries using provided reference images, thereby simultaneously evaluating capabilities in understanding, reasoning, localization, and summarization within complex multimodal document scenarios. To facilitate this benchmark, we develop an automated framework to construct summaries and propose a fine-grained evaluation method called M-DocEval. Moreover, we further develop a robust summarization baseline, i.e., M-DocSum-7B, by progressive two-stage training with diverse instruction and preference data. The extensive results on our M-DocSum-Bench reveal that the leading LVLMs struggle to maintain coherence and accurately integrate information within long and interleaved contexts, often exhibiting confusion between similar images and a lack of robustness. Notably, M-DocSum-7B achieves state-of-the-art performance compared to larger and closed-source models (including GPT-4o, Gemini Pro, Claude-3.5-Sonnet and Qwen2.5-VL-72B, etc.), demonstrating the potential of LVLMs for improved interleaved image-text understanding. The code, data, and models are available at this https URL.
zh

[CV-115] MedImage Technical Report

【速读】：该论文旨在解决通过传统方法难以高效检测染色体结构异常的问题。染色体核型分析在遗传病诊断中至关重要，但现有技术在检测结构性异常方面仍面临挑战。为应对这一问题，论文提出了解决方案的关键在于开发了一个名为iMedImage的端到端基础模型。该模型利用多模态医学影像整合的优势，实现了稳健的特征提取与精准诊断能力。其核心创新包括：(1) 针对多种模态输入和任务的统一表示方法；(2) 借助Chain of Thought (CoT)嵌入和Mixture of Experts (MoE)策略增强的多层次（病例级、图像级、补丁级）影像识别能力。通过这些关键技术，iMedImage实现了染色体分析的全自动化流程，包括分割、核型构建及异常检测，在包含多样数据来源的测试集中达到了92.75%的敏感性和91.5%的特异性，显著提升了染色体异常检测的性能。

链接: https://arxiv.org/abs/2503.21836
作者: Ran Wei,ZhiXiong Lan,Qing Yan,Ning Song,Ming Lv,LongQing Ye
机构: Hangzhou Diagens Biotechnology Co., Ltd. (杭州迪安生物技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background: Chromosome karyotype analysis is crucial for diagnosing hereditary diseases, yet detecting structural abnormalities remains challenging. While AI has shown promise in medical imaging, its effectiveness varies across modalities. Leveraging advances in Foundation Models that integrate multimodal medical imaging for robust feature extraction and accurate diagnosis, we developed iMedImage, an end-to-end model for general medical image recognition, demonstrating strong performance across multiple imaging tasks, including chromosome abnormality detection. Materials and Methods: We constructed a comprehensive medical image dataset encompassing multiple modalities from common medical domains, including chromosome, cell, pathology, ultrasound, X-ray, CT, and MRI images. Based on this dataset, we developed the iMedImage model, which incorporates the following key features: (1) a unified representation method for diverse modality inputs and medical imaging tasks; (2) multi-level (case-level, image-level, patch-level) image recognition capabilities enhanced by Chain of Thought (CoT) embedding and Mixture of Experts (MoE) strategies. Results: The test set comprised data from 12 institutions across six regions in China, covering three mainstream scanning devices, and included naturally distributed, unscreened abnormal cases. On this diverse dataset, the model achieved a fully automated chromosome analysis workflow, including segmentation, karyotyping, and abnormality detection, reaching a sensitivity of 92.75% and a specificity of 91.5%. Conclusion: We propose iMedImage, an end-to-end foundation model for medical image analysis, demonstrating its superior performance across various medical imaging tasks. iMedImage provides clinicians with a precise imaging analysis tool and contributes to improving diagnostic accuracy and disease screening.
zh

[CV-116] A Multi-Modal Knowledge-Enhanced Framework for Vessel Trajectory Prediction

【速读】：该论文旨在解决船舶轨迹预测中因全球自动识别系统（AIS）数据采样时间间隔不规则及船舶运动复杂性所导致的模型学习与泛化困难的问题。论文提出了一种多模态知识增强框架（MAKER），其关键是通过两个核心模块提升预测性能：一是由大规模语言模型引导的知识迁移（LKT）模块，用于有效转移特定于轨迹的上下文知识以应对不规则采样；二是基于知识的自适应学习（KSL）模块，利用运动学知识在训练过程中逐步整合复杂模式，实现自适应学习与增强的泛化能力。实验结果表明，MAKER可将现有最先进方法的预测精度提升12.08%-17.86%。

链接: https://arxiv.org/abs/2503.21834
作者: Haomin Yu,Tianyi Li,Kristian Torp,Christian S. Jensen
机构: Aalborg University (奥尔堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Accurate vessel trajectory prediction facilitates improved navigational safety, routing, and environmental protection. However, existing prediction methods are challenged by the irregular sampling time intervals of the vessel tracking data from the global AIS system and the complexity of vessel movement. These aspects render model learning and generalization difficult. To address these challenges and improve vessel trajectory prediction, we propose the multi-modal knowledge-enhanced framework (MAKER) for vessel trajectory prediction. To contend better with the irregular sampling time intervals, MAKER features a Large language model-guided Knowledge Transfer (LKT) module that leverages pre-trained language models to transfer trajectory-specific contextual knowledge effectively. To enhance the ability to learn complex trajectory patterns, MAKER incorporates a Knowledge-based Self-paced Learning (KSL) module. This module employs kinematic knowledge to progressively integrate complex patterns during training, allowing for adaptive learning and enhanced generalization. Experimental results on two vessel trajectory datasets show that MAKER can improve the prediction accuracy of state-of-the-art methods by 12.08%-17.86%.
zh

[CV-117] Shape Generation via Weight Space Learning

【速读】：该论文旨在解决利用基础模型中的几何先验（geometric priors）进行下游任务时面临的挑战，尤其是在真实世界数据稀缺或嘈杂的情况下，传统微调方法可能导致灾难性遗忘（catastrophic forgetting）的问题。论文的关键在于将大型3D形状生成模型的权重空间（weight space）视为一种可以直接探索的数据模态（data modality），并通过在该高维权重空间中发现子流形（submanifolds），分别调节拓扑属性或细粒度部分特征。实验结果表明，权重空间中的小范围变化能够显著影响全局连通性（global connectivity）和局部几何特性（local geometry），从而为3D形状生成和特定领域的微调提供新的方法路径。

链接: https://arxiv.org/abs/2503.21830
作者: Maximilian Plattner,Arturs Berzins,Johannes Brandstetter
机构: ELLIS Unit, LIT AI Lab, Institute for Machine Learning, JKU Linz (约翰·开普勒大学林茨), Austria; Emmi AI GmbH (艾米人工智能有限公司), Linz, Austria
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Foundation models for 3D shape generation have recently shown a remarkable capacity to encode rich geometric priors across both global and local dimensions. However, leveraging these priors for downstream tasks can be challenging as real-world data are often scarce or noisy, and traditional fine-tuning can lead to catastrophic forgetting. In this work, we treat the weight space of a large 3D shape-generative model as a data modality that can be explored directly. We hypothesize that submanifolds within this high-dimensional weight space can modulate topological properties or fine-grained part features separately, demonstrating early-stage evidence via two experiments. First, we observe a sharp phase transition in global connectivity when interpolating in conditioning space, suggesting that small changes in weight space can drastically alter topology. Second, we show that low-dimensional reparameterizations yield controlled local geometry changes even with very limited data. These results highlight the potential of weight space learning to unlock new approaches for 3D shape generation and specialized fine-tuning.
zh

[CV-118] Hybrid Multi-Stage Learning Framework for Edge Detection: A Survey

【速读】：该论文致力于解决计算机视觉中边缘检测在复杂场景（如光照变化、噪声干扰）下的基本挑战。论文的关键在于提出了一种混合多阶段学习框架（Hybrid Multi-Stage Learning Framework），通过将卷积神经网络（CNN）特征提取与支持向量机（SVM）分类器相结合，提升边缘定位和结构准确性。与端到端深度学习模型不同，该方法解耦了特征表示和分类阶段，增强了鲁棒性和可解释性。

链接: https://arxiv.org/abs/2503.21827
作者: Mark Phil Pacot,Jayno Juventud,Gleen Dalaorao
机构: College of Computing and Information Sciences, Caraga State University (卡加延州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Edge detection remains a fundamental yet challenging task in computer vision, especially under varying illumination, noise, and complex scene conditions. This paper introduces a Hybrid Multi-Stage Learning Framework that integrates Convolutional Neural Network (CNN) feature extraction with a Support Vector Machine (SVM) classifier to improve edge localization and structural accuracy. Unlike conventional end-to-end deep learning models, our approach decouples feature representation and classification stages, enhancing robustness and interpretability. Extensive experiments conducted on benchmark datasets such as BSDS500 and NYUDv2 demonstrate that the proposed framework outperforms traditional edge detectors and even recent learning-based methods in terms of Optimal Dataset Scale (ODS) and Optimal Image Scale (OIS), while maintaining competitive Average Precision (AP). Both qualitative and quantitative results highlight enhanced performance on edge continuity, noise suppression, and perceptual clarity achieved by our method. This work not only bridges classical and deep learning paradigms but also sets a new direction for scalable, interpretable, and high-quality edge detection solutions.
zh

[CV-119] Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations CVPR2025

【速读】：该论文旨在解决视频数据在基于视频的大语言模型（video-based LLMs）自动化标注过程中面临的隐私和安全问题，特别是未经授权使用个人视频数据可能导致的下游任务性能提升风险。论文的关键解决方案是提出两种带有不可察觉对抗扰动的保护性视频水印——Ramblings和Mutes。Ramblings通过误导模型生成与视频内容不一致的字幕来降低视频标注质量，而Mutes则促使模型生成极简且缺乏描述性的字幕。这两种方法均展示了良好的隐蔽性和鲁棒性，有效减少了多种基于视频的LLMs的视频标注性能，从而有效地保护了个人视频数据免受未经授权的利用。

链接: https://arxiv.org/abs/2503.21824
作者: Haitong Liu,Kuofeng Gao,Yang Bai,Jinmin Li,Jinxiao Shan,Tao Dai,Shu-Tao Xia
机构: Tsinghua University (清华大学); ByteDance (字节跳动); Digital Center, China Merchants Group Limited (招商局集团数字中心); Shenzhen University (深圳大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Recently, video-based large language models (video-based LLMs) have achieved impressive performance across various video comprehension tasks. However, this rapid advancement raises significant privacy and security concerns, particularly regarding the unauthorized use of personal video data in automated annotation by video-based LLMs. These unauthorized annotated video-text pairs can then be used to improve the performance of downstream tasks, such as text-to-video generation. To safeguard personal videos from unauthorized use, we propose two series of protective video watermarks with imperceptible adversarial perturbations, named Ramblings and Mutes. Concretely, Ramblings aim to mislead video-based LLMs into generating inaccurate captions for the videos, thereby degrading the quality of video annotations through inconsistencies between video content and captions. Mutes, on the other hand, are designed to prompt video-based LLMs to produce exceptionally brief captions, lacking descriptive detail. Extensive experiments demonstrate that our video watermarking methods effectively protect video data by significantly reducing video annotation performance across various video-based LLMs, showcasing both stealthiness and robustness in protecting personal video content. Our code is available at this https URL.
zh

[CV-120] Low-Rank Adaptation of Pre-Trained Stable Diffusion for Rigid-Body Target ISAR Imaging

【速读】：本文旨在解决传统基于范围瞬时多普勒（Range-Instantaneous Doppler, RID）方法在刚性目标成像中因时间频率分析（Time-Frequency Analysis, TFA）局限性而导致的分辨率较低的问题。为应对这一挑战，研究的核心在于从低分辨率的时间频率表示（Time-Frequency Representations, TFRs）中获取高分辨率的TFRs。鉴于TFRs的曲线特征属于特定类型的纹理特征，论文提出利用预训练生成模型如Stable Diffusion（SD），因其强大的纹理表征捕捉能力，能够有效增强TFRs。关键解决方案是提出了一种新的逆合成孔径雷达（Inverse Synthetic Aperture Radar, ISAR）成像方法，通过结合预训练SD模型的低秩适应（Low-Rank Adaptation, LoRA）实现。该方法在保留SD Turbo的基本结构和预训练参数的基础上，引入额外的线性操作以支持LoRA和对抗训练，从而实现超分辨率和噪声抑制。最终将LoRA-SD集成到基于RID的ISAR成像中，实现了清晰聚焦且去噪的高分辨率成像。实验结果表明，与传统方法相比，所提方法在频率估计和ISAR成像方面具有显著优势，并通过模拟雷达数据训练和实测雷达数据测试验证了其泛化能力。

链接: https://arxiv.org/abs/2503.21823
作者: Boan Zhang,Hang Dong,Jiongge Zhang,Long Tian,Rongrong Wang,Zhenhua Wu,Xiyang Liu,Hongwei Liu
机构: Xidian University (西安电子科技大学); Anhui University (安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, IGARSS 2025

点击查看摘要

Abstract:Traditional range-instantaneous Doppler (RID) methods for rigid-body target imaging often suffer from low resolution due to the limitations of time-frequency analysis (TFA). To address this challenge, our primary focus is on obtaining high resolution time-frequency representations (TFRs) from their low resolution counterparts. Recognizing that the curve features of TFRs are a specific type of texture feature, we argue that pre trained generative models such as Stable Diffusion (SD) are well suited for enhancing TFRs, thanks to their powerful capability in capturing texture representations. Building on this insight, we propose a novel inverse synthetic aperture radar (ISAR) imaging method for rigid-body targets, leveraging the low-rank adaptation (LoRA) of a pre-trained SD model. Our approach adopts the basic structure and pre-trained parameters of SD Turbo while incorporating additional linear operations for LoRA and adversarial training to achieve super-resolution and noise suppression. Then we integrate LoRA-SD into the RID-based ISAR imaging, enabling sharply focused and denoised imaging with super-resolution capabilities. We evaluate our method using both simulated and real radar data. The experimental results demonstrate the superiority of our approach in frequency es timation and ISAR imaging compared to traditional methods. Notably, the generalization capability is verified by training on simulated radar data and testing on measured radar data.
zh

[CV-121] UFM: Unified Feature Matching Pre-training with Multi-Modal Image Assistants

【速读】：该论文旨在解决跨模态图像应用中的特征匹配难题，这一问题因其对特定数据集的复杂训练需求而长期具有挑战性。论文提出的关键解决方案是设计了一种名为统一特征匹配预训练模型（Unified Feature Matching, UFM）的方法，其核心在于引入多模态图像辅助（Multimodal Image Assistant, MIA）变换器，这是一种高度可调的结构，能够有效应对多样化的特征匹配任务。此外，论文还提出了数据增强算法和分阶段预训练策略，以应对特定模态下数据稀疏及模态间数据分布不平衡带来的挑战。实验结果表明，UFM 在多种特征匹配任务中表现出色的泛化能力和性能。

链接: https://arxiv.org/abs/2503.21820
作者: Yide Di,Yun Liao,Hao Zhou,Kaijun Zhu,Qing Duan,Junhui Liu,Mingyu Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 34 pages, 13 figures

点击查看摘要

Abstract:Image feature matching, a foundational task in computer vision, remains challenging for multimodal image applications, often necessitating intricate training on specific datasets. In this paper, we introduce a Unified Feature Matching pre-trained model (UFM) designed to address feature matching challenges across a wide spectrum of modal images. We present Multimodal Image Assistant (MIA) transformers, finely tunable structures adept at handling diverse feature matching problems. UFM exhibits versatility in addressing both feature matching tasks within the same modal and those across different modals. Additionally, we propose a data augmentation algorithm and a staged pre-training strategy to effectively tackle challenges arising from sparse data in specific modals and imbalanced modal datasets. Experimental results demonstrate that UFM excels in generalization and performance across various feature matching tasks. The code will be released at:this https URL.
zh

[CV-122] Skip-Vision: A Comprehensive Framework for Accelerating Vision-Language Models

【速读】：该论文旨在解决基于 Transformer 的多模态大语言模型（Multimodal Large Language Models, MLLMs）在扩展分辨率、训练数据和模型参数时计算成本急剧增加的问题。其核心瓶颈在于细粒度图像理解所需的视觉标记数量激增。为了解决这一问题，论文提出了一种名为 Skip-Vision 的统一框架，旨在同时提升视觉-语言模型的训练和推理效率。Skip-Vision 的关键创新在于引入了两种互补的加速策略：一是通过 Skip-FFN 策略，在训练阶段跳过冗余视觉标记上的前馈网络（Feed-Forward Network, FFN）计算；二是设计了一种选择性 KV 缓存移除机制，在推理阶段裁剪被跳过的键值对，同时保持模型性能。实验结果表明，Skip-Vision 可将训练时间减少多达 35%，推理浮点运算量（FLOPs）降低 75%，延迟减少 45%，同时实现与现有方法相当或更优的性能。

链接: https://arxiv.org/abs/2503.21817
作者: Weili Zeng,Ziyuan Huang,Kaixiang Ji,Yichao Yan
机构: MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University (教育部人工智能重点实验室，上海交通大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformer-based models have driven significant advancements in Multimodal Large Language Models (MLLMs), yet their computational costs surge drastically when scaling resolution, training data, and model parameters. A key bottleneck stems from the proliferation of visual tokens required for fine-grained image understanding. We propose Skip-Vision, a unified framework addressing both training and inference inefficiencies in vision-language models. On top of conventional token compression approaches, our method introduces two complementary acceleration strategies. For training acceleration, we observe that Feed-Forward Network (FFN) computations on visual tokens induce marginal feature updates. This motivates our Skip-FFN strategy, which bypasses FFN layers for redundant visual tokens. For inference acceleration, we design a selective KV-cache removal mechanism that prunes the skipped key-value pairs during decoding while preserving model performance. Experimental results demonstrate that Skip-Vision reduces training time by up to 35%, inference FLOPs by 75%, and latency by 45%, while achieving comparable or superior performance to existing methods. Our work provides a practical solution for scaling high-performance MLLMs with enhanced efficiency.
zh

[CV-123] Mamba-3D as Masked Autoencoders for Accurate and Data-Efficient Analysis of Medical Ultrasound Videos

【速读】：该论文致力于解决超声视频分析中因标注数据稀缺和视频数据固有挑战导致的相关方法发展受阻的问题。为应对这些挑战，论文提出了一种名为E-ViM³的数据高效视觉模型，其关键在于通过Enclosure Global Tokens (EGT)有效捕捉和聚合全局特征，并通过Spatial-Temporal Chained (STC)掩码策略实现自监督预训练，以增强时空相关性的建模能力。实验结果表明，E-ViM³在多种数据规模的数据集上达到了当前最先进的性能，尤其在有限标注数据条件下表现出色，凸显了其在临床实际应用中的潜力。

链接: https://arxiv.org/abs/2503.20258
作者: Jiaheng Zhou,Yanfeng Zhou,Wei Fang,Yuxing Tang,Le Lu,Ge Yang
机构: Institute of Automation, Chinese Academy of Sciences (自动化研究所，中国科学院); School of Artificial Intelligence, University of Chinese Academy of Sciences (人工智能学院，中国科学院大学); DAMO Academy, Alibaba Group (阿里达摩院); Hupan Laboratory, 310023, Hangzhou, China (湖畔实验室，浙江杭州)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ultrasound videos are an important form of clinical imaging data, and deep learning-based automated analysis can improve diagnostic accuracy and clinical efficiency. However, the scarcity of labeled data and the inherent challenges of video analysis have impeded the advancement of related methods. In this work, we introduce E-ViM ^3 , a data-efficient Vision Mamba network that preserves the 3D structure of video data, enhancing long-range dependencies and inductive biases to better model space-time correlations. With our design of Enclosure Global Tokens (EGT), the model captures and aggregates global features more effectively than competing methods. To further improve data efficiency, we employ masked video modeling for self-supervised pre-training, with the proposed Spatial-Temporal Chained (STC) masking strategy designed to adapt to various video scenarios. Experiments demonstrate that E-ViM ^3 performs as the state-of-the-art in two high-level semantic analysis tasks across four datasets of varying sizes: EchoNet-Dynamic, CAMUS, MICCAI-BUV, and WHBUS. Furthermore, our model achieves competitive performance with limited labels, highlighting its potential impact on real-world clinical applications.
zh

[CV-124] Evaluation of Machine-generated Biomedical Images via A Tally-based Similarity Measure

【速读】：该论文试图解决在医学等任务关键场景中，难以定量且权威地评估机器学习生成图像质量的问题。解决方案的关键在于提出使用Tversky指数（Tversky Index）来衡量生成图像的质量，这是一种基于感知相似性的成熟度量方法，能够通过相对比较而非绝对差异量化来实现有意义的评估。论文通过实证研究验证了这一方法，并指出相比于基于深度特征空间距离汇总的传统方法，Tversky方法能够在考虑特征编码选择的主观性和固有缺陷时提供更直观的结果。

链接: https://arxiv.org/abs/2503.22658
作者: Frank J. Brooks,Rucha Deshpande
机构: Center for Label-Free Imaging and Multiscale Biophotonics, University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Department of Bioengineering, University of Illinois at Urbana–Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages. Manuscript under review at IEEE. Data available at this https URL

点击查看摘要

Abstract:Super-resolution, in-painting, whole-image generation, unpaired style-transfer, and network-constrained image reconstruction each include an aspect of machine-learned image synthesis where the actual ground truth is not known at time of use. It is generally difficult to quantitatively and authoritatively evaluate the quality of synthetic images; however, in mission-critical biomedical scenarios robust evaluation is paramount. In this work, all practical image-to-image comparisons really are relative qualifications, not absolute difference quantifications; and, therefore, meaningful evaluation of generated image quality can be accomplished using the Tversky Index, which is a well-established measure for assessing perceptual similarity. This evaluation procedure is developed and then demonstrated using multiple image data sets, both real and simulated. The main result is that when the subjectivity and intrinsic deficiencies of any feature-encoding choice are put upfront, Tversky’s method leads to intuitive results, whereas traditional methods based on summarizing distances in deep feature spaces do not.
zh

[CV-125] KEVS: Enhancing Segmentation of Visceral Adipose Tissue in Pre-Cystectomy CT with Gaussian Kernel Density Estimation

【速读】：该论文旨在解决腹腔镜膀胱切除术（cystectomy）患者腹部CT扫描中内脏脂肪组织（visceral adipose tissue, VAT）分割存在的两个主要问题：一是基于强度阈值的传统方法存在观察者间变异性（inter-observer variability）的问题；二是缺乏精确的VAT真实标签（ground-truth masks），限制了深度学习（deep learning, DL）模型的开发与应用。论文提出的关键解决方案是引入一种名为Kernel密度增强型VAT分割器（Kernel density Enhanced VAT Segmentator, KEVS）的新方法。KEVS通过结合DL语义分割模型进行多体特征预测，并利用高斯核密度估计分析预测的皮下脂肪组织来实现对CT扫描中特定腹部VAT的准确预测，且无需依赖真实标签VAT掩膜即可完成训练，从而有效克服了上述限制。

链接: https://arxiv.org/abs/2503.22592
作者: Thomas Boucher,Nicholas Tetlow,Annie Fung,Amy Dewar,Pietro Arina,Sven Kerneis,John Whittle,Evangelos B. Mazomenos
机构: University College London (伦敦大学学院); University College London (伦敦大学学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint for submission to IPCAI special edition of IJCARS 2025, version prior to any peer review

点击查看摘要

Abstract:Purpose: The distribution of visceral adipose tissue (VAT) in cystectomy patients is indicative of the incidence of post-operative complications. Existing VAT segmentation methods for computed tomography (CT) employing intensity thresholding have limitations relating to inter-observer variability. Moreover, the difficulty in creating ground-truth masks limits the development of deep learning (DL) models for this task. This paper introduces a novel method for VAT prediction in pre-cystectomy CT, which is fully automated and does not require ground-truth VAT masks for training, overcoming aforementioned limitations. Methods: We introduce the Kernel density Enhanced VAT Segmentator ( KEVS), combining a DL semantic segmentation model, for multi-body feature prediction, with Gaussian kernel density estimation analysis of predicted subcutaneous adipose tissue to achieve accurate scan-specific predictions of VAT in the abdominal cavity. Uniquely for a DL pipeline, KEVS does not require ground-truth VAT masks. Results: We verify the ability of KEVS to accurately segment abdominal organs in unseen CT data and compare KEVS VAT segmentation predictions to existing state-of-the-art (SOTA) approaches in a dataset of 20 pre-cystectomy CT scans, collected from University College London Hospital (UCLH-Cyst), with expert ground-truth annotations. KEVS presents a 4.80% and 6.02% improvement in Dice Coefficient over the second best DL and thresholding-based VAT segmentation techniques respectively when evaluated on UCLH-Cyst. Conclusion: This research introduces KEVS; an automated, SOTA method for the prediction of VAT in pre-cystectomy CT which eliminates inter-observer variability and is trained entirely on open-source CT datasets which do not contain ground-truth VAT masks.
zh

[CV-126] RELD: Regularization by Latent Diffusion Models for Image Restoration

【速读】：该论文旨在解决图像处理中的去噪（denoising）、去模糊（deblurring）及超分辨率（super-resolution）任务，同时追求在保证高质量结果的同时降低计算成本。论文的关键创新在于提出了一种名为“正则化通过潜在去噪”（Regularization by Latent Denoising, RELD）的策略。该方案的核心是将一个经过去噪任务训练的潜在扩散模型（Latent Diffusion Model），通过半二次分裂（Half-Quadratic Splitting）技术整合到变分框架中，利用其正则化特性。这种设计不仅能够满足多种成像应用的条件，还能够在实现高感知质量的同时显著减少计算开销。

链接: https://arxiv.org/abs/2503.22563
作者: Pasquale Cascarano,Lorenzo Stacchio,Andrea Sebastiani,Alessandro Benfenati,Ulugbek S. Kamilov,Gustavo Marfia
机构: Department of the Arts, University of Bologna (博洛尼亚大学); Department of Political Sciences, Communication and International Relations, University of Macerata (马切拉塔大学); Department of Physics, Informatics and Mathematics, University of Modena and Reggio Emilia (摩德纳和雷焦艾米利亚大学); Department of Environmental and Science Policy, University of Milan (米兰大学); Department of Computer Science & Engineering and Department of Electrical & System Engineering, Washington University in St. Louis (圣路易斯华盛顿大学); Department of the Arts, University of Bologna (博洛尼亚大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, Diffusion Models have become the new state-of-the-art in deep generative modeling, ending the long-time dominance of Generative Adversarial Networks. Inspired by the Regularization by Denoising principle, we introduce an approach that integrates a Latent Diffusion Model, trained for the denoising task, into a variational framework using Half-Quadratic Splitting, exploiting its regularization properties. This approach, under appropriate conditions that can be easily met in various imaging applications, allows for reduced computational cost while achieving high-quality results. The proposed strategy, called Regularization by Latent Denoising (RELD), is then tested on a dataset of natural images, for image denoising, deblurring, and super-resolution tasks. The numerical experiments show that RELD is competitive with other state-of-the-art methods, particularly achieving remarkable results when evaluated using perceptual quality metrics.
zh

[CV-127] Deterministic Medical Image Translation via High-fidelity Brownian Bridges

【速读】：该论文旨在解决扩散模型在生成医学图像翻译中的非确定性以及与真实图像高保真度不足的问题。论文提出了一种新颖的高保真布朗桥模型（HiFi-BBrg），其关键在于结合生成映射与重建映射的双映射结构，并通过布朗桥训练过程中的保真损失（fidelity loss）和对抗训练指导重建映射，确保翻译后的图像能够被精确逆向还原至原始形式，从而实现对真实图像的高保真一致翻译。实验结果表明，HiFi-BBrg 在多模态图像翻译和多图像超分辨率任务中优于现有最先进的方法。

链接: https://arxiv.org/abs/2503.22531
作者: Qisheng He,Nicholas Summerfield,Peiyong Wang,Carri Glide-Hurst,Ming Dong
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent studies have shown that diffusion models produce superior synthetic images when compared to Generative Adversarial Networks (GANs). However, their outputs are often non-deterministic and lack high fidelity to the ground truth due to the inherent randomness. In this paper, we propose a novel High-fidelity Brownian bridge model (HiFi-BBrg) for deterministic medical image translations. Our model comprises two distinct yet mutually beneficial mappings: a generation mapping and a reconstruction mapping. The Brownian bridge training process is guided by the fidelity loss and adversarial training in the reconstruction mapping. This ensures that translated images can be accurately reversed to their original forms, thereby achieving consistent translations with high fidelity to the ground truth. Our extensive experiments on multiple datasets show HiFi-BBrg outperforms state-of-the-art methods in multi-modal image translation and multi-image super-resolution.
zh

[CV-128] Efficient Epistemic Uncertainty Estimation in Cerebrovascular Segmentation

【速读】：该论文旨在解决脑血管疾病诊断中磁共振（MR）扫描血管分割的可信度不足问题。传统深度学习模型因复杂度高且缺乏决策可靠性指示而被认为不够可靠。为提升基于深度学习模型的信任度，论文首次将认识论不确定性量化引入脑血管分割模型，并通过高效集成模型结合贝叶斯近似与深度集成的优势，克服了传统概率网络的高计算成本。关键在于通过实现一种能够对高模型不确定性和错误预测区域进行对齐的集成方法，验证了该方法的有效性和可靠性。此外，论文通过在分布外（OOD）数据上的广泛实验表明，对于OOD图像，估计的不确定性增加，且忽略高度不确定区域可提高分割质量，从而证明了该集成模型不仅在分布内数据上有效，也能可靠地解释其局限性，适用于临床应用。

链接: https://arxiv.org/abs/2503.22271
作者: Omini Rathore,Richard Paul,Abigail Morrison,Hanno Scharr,Elisabeth Pfaehler
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brain vessel segmentation of MR scans is a critical step in the diagnosis of cerebrovascular diseases. Due to the fine vessel structure, manual vessel segmentation is time consuming. Therefore, automatic deep learning (DL) based segmentation techniques are intensively investigated. As conventional DL models yield a high complexity and lack an indication of decision reliability, they are often considered as not trustworthy. This work aims to increase trust in DL based models by incorporating epistemic uncertainty quantification into cerebrovascular segmentation models for the first time. By implementing an efficient ensemble model combining the advantages of Bayesian Approximation and Deep Ensembles, we aim to overcome the high computational costs of conventional probabilistic networks. Areas of high model uncertainty and erroneous predictions are aligned which demonstrates the effectiveness and reliability of the approach. We perform extensive experiments applying the ensemble model on out-of-distribution (OOD) data. We demonstrate that for OOD-images, the estimated uncertainty increases. Additionally, omitting highly uncertain areas improves the segmentation quality, both for in- and out-of-distribution data. The ensemble model explains its limitations in a reliable manner and can maintain trustworthiness also for OOD data and could be considered in clinical applications
zh

[CV-129] A Multi-Site Study on AI-Driven Pathology Detection and Osteoarthritis Grading from Knee X-Ray

【速读】：该论文旨在解决骨健康相关疾病（如骨关节炎和骨质疏松症）诊断延迟的问题，这些问题通常由于有限的诊断工具而难以早期发现。论文提出了一种基于人工智能的系统，该系统能够分析膝关节X光片以检测关键病理特征，包括关节间隙变窄、硬化、骨赘、胫骨刺、对齐问题及软组织异常，并对骨关节炎的严重程度进行分级，从而实现及时且个性化的治疗。解决方案的关键在于开发了针对特定病理的高质量训练数据集，例如用于关节间隙变窄的ResNet15模型和用于骨关节炎分级的DenseNet模型，这些模型在多样化的成像环境中表现出色，并通过多种性能指标验证了其精度、召回率和阴性预测值等指标。

链接: https://arxiv.org/abs/2503.22176
作者: Bargava Subramanian,Naveen Kumarasami,Praveen Shastry,Kalyan Sivasailam,Anandakumar D,Keerthana R,Mounigasri M,Abilaasha G,Kishore Prasath Venkatesh
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 2 figures

点击查看摘要

Abstract:Introduction: Bone health disorders like osteoarthritis and osteoporosis pose major global health challenges, often leading to delayed diagnoses due to limited diagnostic tools. This study presents an AI-powered system that analyzes knee X-rays to detect key pathologies, including joint space narrowing, sclerosis, osteophytes, tibial spikes, alignment issues, and soft tissue anomalies. It also grades osteoarthritis severity, enabling timely, personalized treatment. Study Design: The research used 1.3 million knee X-rays from a multi-site Indian clinical trial across government, private, and SME hospitals. The dataset ensured diversity in demographics, imaging equipment, and clinical settings. Rigorous annotation and preprocessing yielded high-quality training datasets for pathology-specific models like ResNet15 for joint space narrowing and DenseNet for osteoarthritis grading. Performance: The AI system achieved strong diagnostic accuracy across diverse imaging environments. Pathology-specific models excelled in precision, recall, and NPV, validated using Mean Squared Error (MSE), Intersection over Union (IoU), and Dice coefficient. Subgroup analyses across age, gender, and manufacturer variations confirmed generalizability for real-world applications. Conclusion: This scalable, cost-effective solution for bone health diagnostics demonstrated robust performance in a multi-site trial. It holds promise for widespread adoption, especially in resource-limited healthcare settings, transforming bone health management and enabling proactive patient care. Comments: 15 pages, 2 figures Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68T07 Cite as: arXiv:2503.22176 [eess.IV] (or arXiv:2503.22176v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2503.22176 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Anandakumar D [view email] [v1] Fri, 28 Mar 2025 06:41:22 UTC (419 KB)
zh

[CV-130] A Self-Supervised Learning of a Foundation Model for Analog Layout Design Automation

【速读】：该论文旨在解决两个关键挑战：1) 缺乏高质量标注的模拟布局数据；2) 模拟布局设计任务的多样性。为了解决这些问题，论文提出了一种基于UNet的预训练基础模型及其自监督学习方法。关键解决方案在于通过随机补丁采样（random patch sampling）和随机掩码（random masking）技术，从少量未标注的布局数据中自动获取足够的训练数据。这些数据经过增强后，具有较少偏差、统一大小且包含多样化的合格布局模式信息。通过在获得的数据上进行预训练，基础模型能够学习到布局模式的隐式通用知识，从而仅需使用少量任务特定数据即可针对多种下游布局任务进行微调。这种方法显著减少了针对每个任务单独开发模型所需的巨大人工努力，并在实验中展示了高效性和一致性，例如，在金属布线任务中，与从头训练相比，微调仅需1/8的数据量即可达到相同的Dice分数，并且验证损失降低了90%，基准分数提高了40%。

链接: https://arxiv.org/abs/2503.22143
作者: Sungyu Jeong,Won Joon Choi,Junung Choi,Anik Biswas,Byungsub Kim
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 11 figures

点击查看摘要

Abstract:We propose a UNet-based foundation model and its self-supervised learning method to address two key challenges: 1) lack of qualified annotated analog layout data, and 2) excessive variety in analog layout design tasks. For self-supervised learning, we propose random patch sampling and random masking techniques automatically to obtain enough training data from a small unannotated layout dataset. The obtained data are greatly augmented, less biased, equally sized, and contain enough information for excessive varieties of qualified layout patterns. By pre-training with the obtained data, the proposed foundation model can learn implicit general knowledge on layout patterns so that it can be fine-tuned for various downstream layout tasks with small task-specific datasets. Fine-tuning provides an efficient and consolidated methodology for diverse downstream tasks, reducing the enormous human effort to develop a model per task separately. In experiments, the foundation model was pre-trained using 324,000 samples obtained from 6 silicon-proved manually designed analog circuits, then it was fine-tuned for the five example downstream tasks: generating contacts, vias, dummy fingers, N-wells, and metal routings. The fine-tuned models successfully performed these tasks for more than one thousand unseen layout inputs, generating DRC/LVS-clean layouts for 96.6% of samples. Compared with training the model from scratch for the metal routing task, fine-tuning required only 1/8 of the data to achieve the same dice score of 0.95. With the same data, fine-tuning achieved a 90% lower validation loss and a 40% higher benchmark score than training from scratch.
zh

[CV-131] Score-Based Turbo Message Passing for Plug-and-Play Compressive Image Recovery

【速读】：该论文旨在解决现有基于消息传递算法的压缩成像方法因使用通用或手工设计先验的去噪器，无法充分捕捉真实图像先验而导致性能不足的问题，特别是在高度欠定场景下。为解决此问题，论文的关键创新在于提出了一种基于贝叶斯公式的消息传递框架，将基于分数的最小均方误差（MMSE）去噪器集成到压缩图像恢复任务中。这种方法利用了基于分数的生成模型在精确刻画复杂图像分布方面的优势，并通过状态演化（SE）方程准确预测其渐进性能。实验表明，该方法在FFHQ数据集上的性能-复杂度权衡显著优于传统消息传递、正则化线性回归以及基于分数的后验采样基准方法，且通常仅需少于20次神经函数评估（NFEs）即可收敛。

链接: https://arxiv.org/abs/2503.22140
作者: Chang Cai,Xiaojun Yuan,Ying-Jun Angela Zhang
机构: Department of Information Engineering, The Chinese University of Hong Kong (香港中文大学); National Key Lab. of Wireless Commun., University of Electronic Science and Technology of China (电子科技大学无线通信国家重点实验室)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Message passing algorithms have been tailored for compressive imaging applications by plugging in different types of off-the-shelf image denoisers. These off-the-shelf denoisers mostly rely on some generic or hand-crafted priors for denoising. Due to their insufficient accuracy in capturing the true image prior, these methods often fail to produce satisfactory results, especially in largely underdetermined scenarios. On the other hand, score-based generative modeling offers a promising way to accurately characterize the sophisticated image distribution. In this paper, by exploiting the close relation between score-based modeling and empirical Bayes-optimal denoising, we devise a message passing framework that integrates a score-based minimum mean squared error (MMSE) denoiser for compressive image recovery. This framework is firmly rooted in Bayesian formalism, in which state evolution (SE) equations accurately predict its asymptotic performance. Experiments on the FFHQ dataset demonstrate that our method strikes a significantly better performance-complexity tradeoff than conventional message passing, regularized linear regression, and score-based posterior sampling baselines. Remarkably, our method typically requires less than 20 neural function evaluations (NFEs) to converge.
zh

[CV-132] Improving the generalization of deep learning models in the segmentation of mammography images

【速读】：该论文旨在解决乳腺X线摄影图像中标志结构分割的数据不足问题，以提高深度学习模型在不同设备生成图像上的泛化能力。论文的关键解决方案是提出一系列数据增强策略，包括基于标注引导的图像强度调整和风格迁移，通过平衡的方式扩充训练样本，使模型能够在保留原始数据处理能力的同时适应多样化的输入数据。这种方法显著提升了模型的泛化性能，优于标准训练流程，并在包含多厂商设备生成图像的大规模数据集上得到了验证。实验结果表明，该方法具备高精度和鲁棒性，适合临床应用。

链接: https://arxiv.org/abs/2503.22052
作者: Jan Hurtado,Joao P. Maia,Cesar A. Sierra-Franco,Alberto Raposo
机构: PUC-Rio (Pontifical Catholic University of Rio de Janeiro)(里约热内卢天主教大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mammography stands as the main screening method for detecting breast cancer early, enhancing treatment success rates. The segmentation of landmark structures in mammography images can aid the medical assessment in the evaluation of cancer risk and the image acquisition adequacy. We introduce a series of data-centric strategies aimed at enriching the training data for deep learning-based segmentation of landmark structures. Our approach involves augmenting the training samples through annotation-guided image intensity manipulation and style transfer to achieve better generalization than standard training procedures. These augmentations are applied in a balanced manner to ensure the model learns to process a diverse range of images generated by different vendor equipments while retaining its efficacy on the original data. We present extensive numerical and visual results that demonstrate the superior generalization capabilities of our methods when compared to the standard training. For this evaluation, we consider a large dataset that includes mammography images generated by different vendor equipments. Further, we present complementary results that show both the strengths and limitations of our methods across various scenarios. The accuracy and robustness demonstrated in the experiments suggest that our method is well-suited for integration into clinical practice.
zh

[CV-133] DeCompress: Denoising via Neural Compression

【速读】：该论文旨在解决学习型去噪算法在训练过程中对大规模带标签清洁与噪声图像对数据集依赖的问题，特别是在许多成像应用（如显微镜成像）中难以获取真实参考图像（ground truth images）的场景下。论文的关键创新在于提出了一种基于压缩的去噪算法DeCompress，其核心解决方案是结合基于压缩的去噪思想与神经压缩领域的最新进展，通过利用单一噪声图像即可完成训练，无需访问大规模训练数据集或真实参考图像，同时具备抗过拟合能力，并在性能上超越现有的零样本（zero-shot）或无监督学习去噪方法。

链接: https://arxiv.org/abs/2503.22015
作者: Ali Zafari,Xi Chen,Shirin Jalali
机构: Department of Electrical and Computer Engineering (电气与计算机工程系), Rutgers University (罗格斯大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Learning-based denoising algorithms achieve state-of-the-art performance across various denoising tasks. However, training such models relies on access to large training datasets consisting of clean and noisy image pairs. On the other hand, in many imaging applications, such as microscopy, collecting ground truth images is often infeasible. To address this challenge, researchers have recently developed algorithms that can be trained without requiring access to ground truth data. However, training such models remains computationally challenging and still requires access to large noisy training samples. In this work, inspired by compression-based denoising and recent advances in neural compression, we propose a new compression-based denoising algorithm, which we name DeCompress, that i) does not require access to ground truth images, ii) does not require access to large training dataset - only a single noisy image is sufficient, iii) is robust to overfitting, and iv) achieves superior performance compared with zero-shot or unsupervised learning-based denoisers.
zh

[CV-134] Differential Evolution for Grassmann Manifold Optimization: A Projection Approach

【速读】：本文提出了一种新颖的演化算法，旨在优化定义在Grassmann流形Gr(k,n)上的实值目标函数。现有基于Gr(k,n)的优化技术主要依赖于黎曼流形的一阶或二阶方法，但这些局部性方法在非凸或多模态景观中常表现出局限性。为解决此问题，作者将差分演化算法（Differential Evolution）——一种全局性的群体优化方法——适配到Grassmann流形上运行。关键在于引入了自适应控制参数方案，并设计了一种通过QR分解将试验向量投影到流形上的机制，从而确保解的可行性同时支持全局探索。这一框架提供了比经典黎曼优化方法更灵活且几何感知的选择，特别适用于机器学习、信号处理以及低秩矩阵恢复等领域，其中子空间表示起着核心作用。论文通过多个Grassmann流形上的优化问题实例验证了所提方法的有效性。

链接: https://arxiv.org/abs/2503.21984
作者: Andrew Lesniewski
机构: Baruch College (巴鲁克学院)
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:We propose a novel evolutionary algorithm for optimizing real-valued objective functions defined on the Grassmann manifold Gr(k,n), the space of all k-dimensional linear subspaces of R^n. While existing optimization techniques on Gr(k,n) predominantly rely on first- or second-order Riemannian methods, these inherently local methods often struggle with nonconvex or multimodal landscapes. To address this limitation, we adapt the Differential Evolution algorithm - a global, population based optimization method - to operate effectively on the Grassmannian. Our approach incorporates adaptive control parameter schemes, and introduces a projection mechanism that maps trial vectors onto the manifold via QR decomposition. The resulting algorithm maintains feasibility with respect to the manifold structure while enabling exploration beyond local neighborhoods. This framework provides a flexible and geometry-aware alternative to classical Riemannian optimization methods and is well-suited to applications in machine learning, signal processing, and low-rank matrix recovery where subspace representations play a central role. We test the methodology on a number of examples of optimization problems on Grassmann manifolds.
zh

[CV-135] Comprehensive segmentation of deep grey nuclei from structural MRI data

【速读】：该论文旨在解决深灰质核团（丘脑核团、基底神经节、屏状核、红核）在结构T1磁共振成像数据中的全面且完整的分割工具缺乏的问题，以实现结果的可重复性和再现性。论文的目标是提出一种快速、准确且鲁棒的方法来分割这些结构。解决方案的关键在于利用白质抑制成像对比度的提升，通过最近提出的直方图多项式合成（Histogram-based Polynomial Synthesis, HIPS）方法从标准T1图像合成类似白质抑制的图像，随后结合多图谱分割与联合标签融合技术实现深灰质核团的分割。这种方法在1.5T、3T和7T不同场强下均表现稳健，并实现了与手动分割金标准相比Dice系数不低于0.7的结果。

链接: https://arxiv.org/abs/2503.21955
作者: Manojkumar Saranathan,Giuseppina Cogliandro,Thomas Hicks,Dianne Patterson,Behroze Vachha,Alberto Cacciola
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 Figures 2 Tables 2 Supplemental Figures 1 Supplemental Table

点击查看摘要

Abstract:Motivation: Lack of tools for comprehensive and complete segmentation of deep grey nuclei using a single software for reproducibility and repeatability Goal(s): A fast accurate and robust method for segmentation of deep grey nuclei (thalamic nuclei, basal ganglia, claustrum, red nucleus) from structural T1 MRI data at conventional field strengths Approach: We leverage the improved contrast of white-matter-nulled imaging by using the recently proposed Histogram-based Polynomial Synthesis (HIPS) to synthesize WMn-like images from standard T1 and then use a multi-atlas segmentation with joint label fusion to segment deep grey nuclei. Results: The method worked robustly on all field strengths (1.5/3/7) and Dice coefficients of 0.7 or more were achieved for all structures compared against manual segmentation ground truth. Impact: This method facilitates careful investigation of the role of deep grey nuclei by enabling the use of conventional T1 data from large public databases, which has not been possible, hitherto, due to lack of robust reproducible segmentation tools.
zh

[CV-136] PyUAT: Open-source Python framework for efficient and scalable cell tracking

【速读】：该论文旨在解决微生物细胞在时间序列显微成像中由于随机运动和频繁分裂导致的跟踪难题，尤其是在受限的成像帧率条件下避免产生不准确结果的问题。论文的关键解决方案是提出了一种基于不确定性感知跟踪（Uncertainty-Aware Tracking, UAT）的方法，通过统计模型预测细胞关联的可能情况，这些模型经过实测细胞行为校准。论文进一步展示了PyUAT这一高效且模块化的Python实现，在大型二维加时间维度数据集上的性能，并探讨了生物模型模块化以及成像间隔对跟踪性能的影响。

链接: https://arxiv.org/abs/2503.21914
作者: Johannes Seiffarth,Katharina Nöh
机构: Institute of Bio- and Geosciences, IBG-1: Biotechnology, Forschungszentrum Jülich, Jülich, Germany (生物与地球科学研究所，生物技术部，尤利希研究中心，德国尤利希); Computational Systems Biotechnology (AVT.CSB), RWTH Aachen University, Aachen, Germany (计算系统生物技术研究所，RWTH亚琛工业大学，德国亚琛)
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tracking individual cells in live-cell imaging provides fundamental insights, inevitable for studying causes and consequences of phenotypic heterogeneity, responses to changing environmental conditions or stressors. Microbial cell tracking, characterized by stochastic cell movements and frequent cell divisions, remains a challenging task when imaging frame rates must be limited to avoid counterfactual results. A promising way to overcome this limitation is uncertainty-aware tracking (UAT), which uses statistical models, calibrated to empirically observed cell behavior, to predict likely cell associations. We present PyUAT, an efficient and modular Python implementation of UAT for tracking microbial cells in time-lapse imaging. We demonstrate its performance on a large 2D+t data set and investigate the influence of modular biological models and imaging intervals on the tracking performance. The open-source PyUAT software is available at this https URL, including example notebooks for immediate use in Google Colab.
zh

[CV-137] Vision Language Models versus Machine Learning Models Performance on Polyp Detection and Classification in Colonoscopy Images

【速读】：该论文旨在评估视觉-语言模型（Vision-Language Models, VLMs）在结肠镜息肉图像的计算机辅助检测（Computer-Aided Detection, CADe）和计算机辅助诊断（Computer-Aided Diagnosis, CADx）任务中的性能，并与传统的卷积神经网络（Convolutional Neural Networks, CNNs）以及经典机器学习模型（Classic Machine Learning Models, CMLs）进行综合比较。论文的关键在于通过标准化预处理技术和严格的对比框架，分析了11种不同模型在两项临床任务（息肉检测和分类）上的表现，从而揭示CNNs在CADe和CADx任务中的持续优势，同时探讨了VLMs如BioMedCLIP和GPT-4在特定场景下的潜在应用价值。

链接: https://arxiv.org/abs/2503.21840
作者: Mohammad Amin Khalafi,Seyed Amir Ahmad Safavi-Naini,Ameneh Salehi,Nariman Naderi,Dorsa Alijanzadeh,Pardis Ketabi Moghadam,Kaveh Kavosi,Negar Golestani,Shabnam Shahrokh,Soltanali Fallah,Jamil S Samaan,Nicholas P. Tatonetti,Nicholas Hoerter,Girish Nadkarni,Hamid Asadzadeh Aghdaei,Ali Soroush
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at: this https URL . CoI: AlSo serves on the advisory board and holds equity in Virgo Surgical Solutions. The other authors declare no conflicts of interest. Data

点击查看摘要

Abstract:Introduction: This study provides a comprehensive performance assessment of vision-language models (VLMs) against established convolutional neural networks (CNNs) and classic machine learning models (CMLs) for computer-aided detection (CADe) and computer-aided diagnosis (CADx) of colonoscopy polyp images. Method: We analyzed 2,258 colonoscopy images with corresponding pathology reports from 428 patients. We preprocessed all images using standardized techniques (resizing, normalization, and augmentation) and implemented a rigorous comparative framework evaluating 11 distinct models: ResNet50, 4 CMLs (random forest, support vector machine, logistic regression, decision tree), two specialized contrastive vision language encoders (CLIP, BiomedCLIP), and three general-purpose VLMs ( GPT-4 Gemini-1.5-Pro, Claude-3-Opus). Our performance assessment focused on two clinical tasks: polyp detection (CADe) and classification (CADx). Result: In polyp detection, ResNet50 achieved the best performance (F1: 91.35%, AUROC: 0.98), followed by BiomedCLIP (F1: 88.68%, AUROC: [AS1] ). GPT-4 demonstrated comparable effectiveness to traditional machine learning approaches (F1: 81.02%, AUROC: [AS2] ), outperforming other general-purpose VLMs. For polyp classification, performance rankings remained consistent but with lower overall metrics. ResNet50 maintained the highest efficacy (weighted F1: 74.94%), while GPT-4 demonstrated moderate capability (weighted F1: 41.18%), significantly exceeding other VLMs (Claude-3-Opus weighted F1: 25.54%, Gemini 1.5 Pro weighted F1: 6.17%). Conclusion: CNNs remain superior for both CADx and CADe tasks. However, VLMs like BioMedCLIP and GPT-4 may be useful for polyp detection tasks where training CNNs is not feasible.
zh

[CV-138] Learning from spatially inhomogenous data: resolution-adaptive convolutions for multiple sclerosis lesion segmentation

【速读】：该论文旨在解决临床成像数据因设备厂商、医院及扫描序列差异导致的空间分辨率异质性问题，特别是在MRI中，体素尺寸、层间距和采集平面可能显著不同。临床应用需要算法能够处理具有多种体素分辨率的数据，而传统的解决方法是通过重采样实现数据的同质化（通常是等体素重采样）。然而，这种方法可能导致插值引起的平面外失真以及平面内下采样的保真度损失。为此，论文提出了一种基于e3nn框架的网络架构，其关键在于利用球谐函数参数化的卷积核，而不是依赖于体素网格，并且卷积核具有固定的物理半径。这种网络架构可以直接从空间异构数据中学习，无需进行重采样。此外，该网络可以被重新调整以适应输入数据的体素维度。实验结果显示，该网络在包含三个中心的公开数据集和一个内部多发性硬化症病例的高空间异质性数据集上表现出色，在未见的图像分辨率测试中也展现了良好的泛化能力。与标准U-Net相比，该方法在二维和大部分三维测试案例中表现更优。

链接: https://arxiv.org/abs/2503.21829
作者: Ivan Diaz,Florin Scherer,Yanik Berli,Roland Wiest,Helly Hammer,Robert Hoepner,Alejandro Leon Betancourt,Piotr Radojewski,Richard McKinley
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the setting of clinical imaging, differences in between vendors, hospitals and sequences can yield highly inhomogeneous imaging data. In MRI in particular, voxel dimension, slice spacing and acquisition plane can vary substantially. For clinical applications, therefore, algorithms must be trained to handle data with various voxel resolutions. The usual strategy to deal with heterogeneity of resolution is harmonization: resampling imaging data to a common (usually isovoxel) resolution. This can lead to loss of fidelity arising from interpolation artifacts out-of-plane and downsampling in-plane. We present in this paper a network architecture designed to be able to learn directly from spatially heterogeneous data, without resampling: a segmentation network based on the e3nn framework that leverages a spherical harmonic, rather than voxel-grid, parameterization of convolutional kernels, with a fixed physical radius. Networks based on these kernels can be resampled to their input voxel dimensions. We trained and tested our network on a publicly available dataset assembled from three centres, and on an in-house dataset of Multiple Sclerosis cases with a high degree of spatial inhomogeneity. We compared our approach to a standard U-Net with two strategies for handling inhomogeneous data: training directly on the data without resampling, and resampling to a common resolution of 1mm isovoxels. We show that our network is able to learn from various combinations of voxel sizes and outperforms classical U-Nets on 2D testing cases and most 3D testing cases. This shows an ability to generalize well when tested on image resolutions not seen during training. Our code can be found at: this http URL_U-Net.
zh

[CV-139] Implicit neural representations for end-to-end PET reconstruction

【速读】：该论文试图解决正电子发射断层成像（PET）图像重建中的传统方法存在的问题，特别是在减少伪影和提高图像质量方面。论文提出了一种基于隐式SIREN神经网络架构的无监督PET图像重建方法，该架构采用正弦激活函数。关键在于结合了一个前向投影模型和适应于从sinogram直接重建PET图像的损失函数，无需依赖大规模训练数据集即可实现高效的图像重建。通过与传统的惩罚似然方法及基于深度图像先验（DIP）的重建方法在脑部幻影数据和现实模拟的sinogram上的对比实验表明，基于隐式神经表示（INR）的方法能够以更简单、更高效的方式重建高质量图像，在对比度、活性恢复以及相对偏差等方面表现出显著改进。

链接: https://arxiv.org/abs/2503.21825
作者: Younès Moussaoui(Nantes Univ - ECN, CHU Nantes),Diana Mateus(Nantes Univ - ECN),Nasrin Taheri(CHU Nantes),Saïd Moussaoui(Nantes Univ - ECN),Thomas Carlier(CHU Nantes),Simon Stute(CHU Nantes)
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE International Symposium on Biomedical Imaging, Apr 2025, Houston (Texas), United States

点击查看摘要

Abstract:Implicit neural representations (INRs) have demonstrated strong capabilities in various medical imaging tasks, such as denoising, registration, and segmentation, by representing images as continuous functions, allowing complex details to be captured. For image reconstruction problems, INRs can also reduce artifacts typically introduced by conventional reconstruction algorithms. However, to the best of our knowledge, INRs have not been studied in the context of PET reconstruction. In this paper, we propose an unsupervised PET image reconstruction method based on the implicit SIREN neural network architecture using sinusoidal activation functions. Our method incorporates a forward projection model and a loss function adapted to perform PET image reconstruction directly from sinograms, without the need for large training datasets. The performance of the proposed approach was compared with that of conventional penalized likelihood methods and deep image prior (DIP) based reconstruction using brain phantom data and realistically simulated sinograms. The results show that the INR-based approach can reconstruct high-quality images with a simpler, more efficient model, offering improvements in PET image reconstruction, particularly in terms of contrast, activity recovery, and relative bias.
zh

[CV-140] Deep Learning-Based Quantitative Assessment of Renal Chronicity Indices in Lupus Nephritis

【速读】：该论文旨在解决狼疮性肾炎（Lupus Nephritis, LN）患者慢性指数（Chronicity Index, CI）评估中存在的挑战，包括病理科医生面临的时间消耗大、观察者间差异显著以及易受疲劳影响等问题。为应对这些挑战，论文提出了一种基于深度学习（Deep Learning, DL）的有效自动化管道，用于精准评估CI，并从疾病特异性角度提供有价值的预后见解。

解决方案的关键在于开发了一个包含训练集、内部测试集及外部测试集的多阶段DL管道。通过在30名患者的60张切片（总计22,410个图像块）上进行模型训练，并在两个独立测试集中验证其性能，该DL管道不仅实现了对组织区域和病理病变的高度分割能力，超越现有最先进的方法，还显著提高了CI评估的一致性和准确性，增强了与病理科医生评估结果的相关性，并改善了预后分析中的结局预测效果。这一创新方法不仅提升了工作效率，也为临床决策提供了有力支持。

链接: https://arxiv.org/abs/2503.21818
作者: Tianqi Tu,Hui Wang,Jiangbo Pei,Xiaojuan Yu,Aidong Men,Suxia Wang,Qingchao Chen,Ying Tan,Feng Yu,Minghui Zhao
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background: Renal chronicity indices (CI) have been identified as strong predictors of long-term outcomes in lupus nephritis (LN) patients. However, assessment by pathologists is hindered by challenges such as substantial time requirements, high interobserver variation, and susceptibility to fatigue. This study aims to develop an effective deep learning (DL) pipeline that automates the assessment of CI and provides valuable prognostic insights from a disease-specific perspective. Methods: We curated a dataset comprising 282 slides obtained from 141 patients across two independent cohorts with a complete 10-years follow-up. Our DL pipeline was developed on 60 slides (22,410 patch images) from 30 patients in the training cohort and evaluated on both an internal testing set (148 slides, 77,605 patch images) and an external testing set (74 slides, 27,522 patch images). Results: The study included two cohorts with slight demographic differences, particularly in age and hemoglobin levels. The DL pipeline showed high segmentation performance across tissue compartments and histopathologic lesions, outperforming state-of-the-art methods. The DL pipeline also demonstrated a strong correlation with pathologists in assessing CI, significantly improving interobserver agreement. Additionally, the DL pipeline enhanced prognostic accuracy, particularly in outcome prediction, when combined with clinical parameters and pathologist-assessed CIs Conclusions: The DL pipeline demonstrated accuracy and efficiency in assessing CI in LN, showing promise in improving interobserver agreement among pathologists. It also exhibited significant value in prognostic analysis and enhancing outcome prediction in LN patients, offering a valuable tool for clinical decision-making.
zh

人工智能

[AI-0] Exploring the Effectiveness of Multi-stage Fine-tuning for Cross-encoder Re-rankers ECIR

链接: https://arxiv.org/abs/2503.22672
作者: Francesca Pezzuti,Sean MacAvaney,Nicola Tonellotto
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 7 pages. To be published as short paper in the Proceedings of the European Conference on Information Retrieval (ECIR) 2025

点击查看摘要

Abstract:State-of-the-art cross-encoders can be fine-tuned to be highly effective in passage re-ranking. The typical fine-tuning process of cross-encoders as re-rankers requires large amounts of manually labelled data, a contrastive learning objective, and a set of heuristically sampled negatives. An alternative recent approach for fine-tuning instead involves teaching the model to mimic the rankings of a highly effective large language model using a distillation objective. These fine-tuning strategies can be applied either individually, or in sequence. In this work, we systematically investigate the effectiveness of point-wise cross-encoders when fine-tuned independently in a single stage, or sequentially in two stages. Our experiments show that the effectiveness of point-wise cross-encoders fine-tuned using contrastive learning is indeed on par with that of models fine-tuned with multi-stage approaches. Code is available for reproduction at this https URL.

[AI-1] Empirical Analysis of Sim-and-Real Cotraining Of Diffusion Policies For Planar Pushing from Pixels IROS2025

链接: https://arxiv.org/abs/2503.22634
作者: Adam Wei,Abhinav Agarwal,Boyuan Chen,Rohan Bosworth,Nicholas Pfaff,Russ Tedrake
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 9 pages, 15 figures, In Submission to IROS 2025

点击查看摘要

Abstract:In imitation learning for robotics, cotraining with demonstration data generated both in simulation and on real hardware has emerged as a powerful recipe to overcome the sim2real gap. This work seeks to elucidate basic principles of this sim-and-real cotraining to help inform simulation design, sim-and-real dataset creation, and policy training. Focusing narrowly on the canonical task of planar pushing from camera inputs enabled us to be thorough in our study. These experiments confirm that cotraining with simulated data \emphcan dramatically improve performance in real, especially when real data is limited. Performance gains scale with simulated data, but eventually plateau; real-world data increases this performance ceiling. The results also suggest that reducing the domain gap in physics may be more important than visual fidelity for non-prehensile manipulation tasks. Perhaps surprisingly, having some visual domain gap actually helps the cotrained policy – binary probes reveal that high-performing policies learn to distinguish simulated domains from real. We conclude by investigating this nuance and mechanisms that facilitate positive transfer between sim-and-real. In total, our experiments span over 40 real-world policies (evaluated on 800+ trials) and 200 simulated policies (evaluated on 40,000+ trials).

[AI-2] Challenges and Paths Towards AI for Software Engineering

链接: https://arxiv.org/abs/2503.22625
作者: Alex Gu,Naman Jain,Wen-Ding Li,Manish Shetty,Yijia Shao,Ziyang Li,Diyi Yang,Kevin Ellis,Koushik Sen,Armando Solar-Lezama
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 75 pages

点击查看摘要

Abstract:AI for software engineering has made remarkable progress recently, becoming a notable success within generative AI. Despite this, there are still many challenges that need to be addressed before automated software engineering reaches its full potential. It should be possible to reach high levels of automation where humans can focus on the critical decisions of what to build and how to balance difficult tradeoffs while most routine development effort is automated away. Reaching this level of automation will require substantial research and engineering efforts across academia and industry. In this paper, we aim to discuss progress towards this in a threefold manner. First, we provide a structured taxonomy of concrete tasks in AI for software engineering, emphasizing the many other tasks in software engineering beyond code generation and completion. Second, we outline several key bottlenecks that limit current approaches. Finally, we provide an opinionated list of promising research directions toward making progress on these bottlenecks, hoping to inspire future research in this rapidly maturing field.

[AI-3] Generative Latent Neural PDE Solver using Flow Matching

链接: https://arxiv.org/abs/2503.22600
作者: Zijie Li,Anthony Zhou,Amir Barati Farimani
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: work in progress

点击查看摘要

Abstract:Autoregressive next-step prediction models have become the de-facto standard for building data-driven neural solvers to forecast time-dependent partial differential equations (PDEs). Denoise training that is closely related to diffusion probabilistic model has been shown to enhance the temporal stability of neural solvers, while its stochastic inference mechanism enables ensemble predictions and uncertainty quantification. In principle, such training involves sampling a series of discretized diffusion timesteps during both training and inference, inevitably increasing computational overhead. In addition, most diffusion models apply isotropic Gaussian noise on structured, uniform grids, limiting their adaptability to irregular domains. We propose a latent diffusion model for PDE simulation that embeds the PDE state in a lower-dimensional latent space, which significantly reduces computational costs. Our framework uses an autoencoder to map different types of meshes onto a unified structured latent grid, capturing complex geometries. By analyzing common diffusion paths, we propose to use a coarsely sampled noise schedule from flow matching for both training and testing. Numerical experiments show that the proposed model outperforms several deterministic baselines in both accuracy and long-term stability, highlighting the potential of diffusion-based approaches for robust data-driven PDE learning.

[AI-4] On the Mistaken Assumption of Interchangeable Deep Reinforcement Learning Implementations ICSE2025

链接: https://arxiv.org/abs/2503.22575
作者: Rajdeep Singh Hundal,Yan Xiao,Xiaochun Cao,Jin Song Dong,Manuel Rigger
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: To be published in the 47th International Conference on Software Engineering (ICSE 2025)

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) is a paradigm of artificial intelligence where an agent uses a neural network to learn which actions to take in a given environment. DRL has recently gained traction from being able to solve complex environments like driving simulators, 3D robotic control, and multiplayer-online-battle-arena video games. Numerous implementations of the state-of-the-art algorithms responsible for training these agents, like the Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) algorithms, currently exist. However, studies make the mistake of assuming implementations of the same algorithm to be consistent and thus, interchangeable. In this paper, through a differential testing lens, we present the results of studying the extent of implementation inconsistencies, their effect on the implementations’ performance, as well as their impact on the conclusions of prior studies under the assumption of interchangeable implementations. The outcomes of our differential tests showed significant discrepancies between the tested algorithm implementations, indicating that they are not interchangeable. In particular, out of the five PPO implementations tested on 56 games, three implementations achieved superhuman performance for 50% of their total trials while the other two implementations only achieved superhuman performance for less than 15% of their total trials. As part of a meticulous manual analysis of the implementations’ source code, we analyzed implementation discrepancies and determined that code-level inconsistencies primarily caused these discrepancies. Lastly, we replicated a study and showed that this assumption of implementation interchangeability was sufficient to flip experiment outcomes. Therefore, this calls for a shift in how implementations are being used.

[AI-5] A Framework for Cryptographic Verifiability of End-to-End AI Pipelines

链接: https://arxiv.org/abs/2503.22573
作者: Kar Balan,Robert Learney,Tim Wood
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Accepted to 11th ACM International Workshop on Security and Privacy Analytics (IWSPA 2025)

点击查看摘要

Abstract:The increasing integration of Artificial Intelligence across multiple industry sectors necessitates robust mechanisms for ensuring transparency, trust, and auditability of its development and deployment. This topic is particularly important in light of recent calls in various jurisdictions to introduce regulation and legislation on AI safety. In this paper, we propose a framework for complete verifiable AI pipelines, identifying key components and analyzing existing cryptographic approaches that contribute to verifiability across different stages of the AI lifecycle, from data sourcing to training, inference, and unlearning. This framework could be used to combat misinformation by providing cryptographic proofs alongside AI-generated assets to allow downstream verification of their provenance and correctness. Our findings underscore the importance of ongoing research to develop cryptographic tools that are not only efficient for isolated AI processes, but that are efficiently `linkable’ across different processes within the AI pipeline, to support the development of end-to-end verifiable AI technologies.

[AI-6] Niyama : Breaking the Silos of LLM Inference Serving

链接: https://arxiv.org/abs/2503.22562
作者: Kanishk Goel,Jayashree Mohan,Nipun Kwatra,Ravi Shreyas Anupindi,Ramachandran Ramjee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The widespread adoption of Large Language Models (LLMs) has enabled diverse applications with very different latency requirements. Existing LLM serving frameworks rely on siloed infrastructure with coarse-grained workload segregation – interactive and batch – leading to inefficient resource utilization and limited support for fine-grained Quality-of-Service (QoS) differentiation. This results in operational inefficiencies, over-provisioning and poor load management during traffic surges. We present Niyama, a novel QoS-driven inference serving system that enables efficient co-scheduling of diverse workloads on shared infrastructure. Niyama introduces fine-grained QoS classification allowing applications to specify precise latency requirements, and dynamically adapts scheduling decisions based on real-time system state. Leveraging the predictable execution characteristics of LLM inference, Niyama implements a dynamic chunking mechanism to improve overall throughput while maintaining strict QoS guarantees. Additionally, Niyama employs a hybrid prioritization policy that balances fairness and efficiency, and employs selective request relegation that enables graceful service degradation during overload conditions. Our evaluation demonstrates that Niyama increases serving capacity by 32% compared to current siloed deployments, while maintaining QoS guarantees. Notably, under extreme load, our system reduces SLO violations by an order of magnitude compared to current strategies. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2503.22562 [cs.LG] (or arXiv:2503.22562v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.22562 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-7] SafeCast: Risk-Responsive Motion Forecasting for Autonomous Vehicles

链接: https://arxiv.org/abs/2503.22541
作者: Haicheng Liao,Hanlin Kong,Bin Rao,Bonan Wang,Chengyue Wang,Guyang Yu,Yuming Huang,Ruru Tang,Chengzhong Xu,Zhenning Li
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate motion forecasting is essential for the safety and reliability of autonomous driving (AD) systems. While existing methods have made significant progress, they often overlook explicit safety constraints and struggle to capture the complex interactions among traffic agents, environmental factors, and motion dynamics. To address these challenges, we present SafeCast, a risk-responsive motion forecasting model that integrates safety-aware decision-making with uncertainty-aware adaptability. SafeCast is the first to incorporate the Responsibility-Sensitive Safety (RSS) framework into motion forecasting, encoding interpretable safety rules–such as safe distances and collision avoidance–based on traffic norms and physical principles. To further enhance robustness, we introduce the Graph Uncertainty Feature (GUF), a graph-based module that injects learnable noise into Graph Attention Networks, capturing real-world uncertainties and enhancing generalization across diverse scenarios. We evaluate SafeCast on four real-world benchmark datasets–Next Generation Simulation (NGSIM), Highway Drone (HighD), ApolloScape, and the Macao Connected Autonomous Driving (MoCAD)–covering highway, urban, and mixed-autonomy traffic environments. Our model achieves state-of-the-art (SOTA) accuracy while maintaining a lightweight architecture and low inference latency, underscoring its potential for real-time deployment in safety-critical AD systems.

[AI-8] Robust Offline Imitation Learning Through State-level Trajectory Stitching

链接: https://arxiv.org/abs/2503.22524
作者: Shuze Wang,Yunpeng Mei,Hongjie Cao,Yetian Yuan,Gang Wang,Jian Sun,Jie Chen
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Imitation learning (IL) has proven effective for enabling robots to acquire visuomotor skills through expert demonstrations. However, traditional IL methods are limited by their reliance on high-quality, often scarce, expert data, and suffer from covariate shift. To address these challenges, recent advances in offline IL have incorporated suboptimal, unlabeled datasets into the training. In this paper, we propose a novel approach to enhance policy learning from mixed-quality offline datasets by leveraging task-relevant trajectory fragments and rich environmental dynamics. Specifically, we introduce a state-based search framework that stitches state-action pairs from imperfect demonstrations, generating more diverse and informative training trajectories. Experimental results on standard IL benchmarks and real-world robotic tasks showcase that our proposed method significantly improves both generalization and performance.

[AI-9] Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent

链接: https://arxiv.org/abs/2503.22478
作者: Max Hennick,Stijn De Baerdemacker
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We show that the behavior of stochastic gradient descent is related to Bayesian statistics by showing that SGD is effectively diffusion on a fractal landscape, where the fractal dimension can be accounted for in a purely Bayesian way. By doing this we show that SGD can be regarded as a modified Bayesian sampler which accounts for accessibility constraints induced by the fractal structure of the loss landscape. We verify our results experimentally by examining the diffusion of weights during training. These results offer insight into the factors which determine the learning process, and seemingly answer the question of how SGD and purely Bayesian sampling are related.

[AI-10] Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning

链接: https://arxiv.org/abs/2503.22456
作者: Abdullah Vanlioglu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce Entropy-Guided Sequence Weighting (EGSW), a novel approach that enhances the exploration-exploitation tradeoff by dynamically assigning weights to generated outputs based on their advantage and entropy for Reinforcement Learning-based Large Language Model fine-tuning. EGSW integrates entropy regularization with advantage-based weighting to balance policy updates, enabling efficient exploration in high-dimensional state spaces. By employing temperature-scaled softmax weighting over sequences, EGSW prioritizing high-reward, high-uncertainty steps while maintaining training stability. Although originally developed to improve Group Relative Policy Optimization (GRPO) during large language model (LLM) fine-tuning, EGSW is generalizable to other reinforcement learning (RL) algorithms and can be implemented in both step-wise and trajectory-wise settings. Empirical evaluations demonstrate that EGSW enhances GRPO reasoning ability, yielding improvements in sample efficiency. Future work will explore the application of EGSW to advanced RL methodologies.

[AI-11] A Causal Framework to Measure and Mitigate Non-binary Treatment Discrimination

链接: https://arxiv.org/abs/2503.22454
作者: Ayan Majumdar,Deborah D. Kanubala,Kavya Gupta,Isabel Valera
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 24 pages, 5 figures

点击查看摘要

Abstract:Fairness studies of algorithmic decision-making systems often simplify complex decision processes, such as bail or loan approvals, into binary classification tasks. However, these approaches overlook that such decisions are not inherently binary (e.g., approve or not approve bail or loan); they also involve non-binary treatment decisions (e.g., bail conditions or loan terms) that can influence the downstream outcomes (e.g., loan repayment or reoffending). In this paper, we argue that non-binary treatment decisions are integral to the decision process and controlled by decision-makers and, therefore, should be central to fairness analyses in algorithmic decision-making. We propose a causal framework that extends fairness analyses and explicitly distinguishes between decision-subjects’ covariates and the treatment decisions. This specification allows decision-makers to use our framework to (i) measure treatment disparity and its downstream effects in historical data and, using counterfactual reasoning, (ii) mitigate the impact of past unfair treatment decisions when automating decision-making. We use our framework to empirically analyze four widely used loan approval datasets to reveal potential disparity in non-binary treatment decisions and their discriminatory impact on outcomes, highlighting the need to incorporate treatment decisions in fairness assessments. Moreover, by intervening in treatment decisions, we show that our framework effectively mitigates treatment discrimination from historical data to ensure fair risk score estimation and (non-binary) decision-making processes that benefit all stakeholders.

[AI-12] raining Large Language Models for Advanced Typosquatting Detection

链接: https://arxiv.org/abs/2503.22406
作者: Jackson Welch
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: 6 pages, 1 figure

点击查看摘要

Abstract:Typosquatting is a long-standing cyber threat that exploits human error in typing URLs to deceive users, distribute malware, and conduct phishing attacks. With the proliferation of domain names and new Top-Level Domains (TLDs), typosquatting techniques have grown more sophisticated, posing significant risks to individuals, businesses, and national cybersecurity infrastructure. Traditional detection methods primarily focus on well-known impersonation patterns, leaving gaps in identifying more complex attacks. This study introduces a novel approach leveraging large language models (LLMs) to enhance typosquatting detection. By training an LLM on character-level transformations and pattern-based heuristics rather than domain-specific data, a more adaptable and resilient detection mechanism develops. Experimental results indicate that the Phi-4 14B model outperformed other tested models when properly fine tuned achieving a 98% accuracy rate with only a few thousand training samples. This research highlights the potential of LLMs in cybersecurity applications, specifically in mitigating domain-based deception tactics, and provides insights into optimizing machine learning strategies for threat detection.

[AI-13] On-site estimation of battery electrochemical parameters via transfer learning based physics-informed neural network approach

链接: https://arxiv.org/abs/2503.22396
作者: Josu Yeregui,Iker Lopetegi,Sergio Fernandez,Erik Garayalde,Unai Iraola
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a novel physical parameter estimation framework for on-site model characterization, using a two-phase modelling strategy with Physics-Informed Neural Networks (PINNs) and transfer learning (TL). In the first phase, a PINN is trained using only the physical principles of the single particle model (SPM) equations. In the second phase, the majority of the PINN parameters are frozen, while critical electrochemical parameters are set as trainable and adjusted using real-world voltage profile data. The proposed approach significantly reduces computational costs, making it suitable for real-time implementation on Battery Management Systems (BMS). Additionally, as the initial phase does not require field data, the model is easy to deploy with minimal setup requirements. With the proposed methodology, we have been able to effectively estimate relevant electrochemical parameters with operating data. This has been proved estimating diffusivities and active material volume fractions with charge data in different degradation conditions. The methodology is experimentally validated in a Raspberry Pi device using data from a standard charge profile with a 3.89% relative accuracy estimating the active material volume fractions of a NMC cell with 82.09% of its nominal capacity.

[AI-14] Shapley Revisited: Tractable Responsibility Measures for Query Answers PODS’25

链接: https://arxiv.org/abs/2503.22358
作者: Meghyn Bienvenu,Diego Figueira,Pierre Lafourcade
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: Long version of PODS’25 paper

点击查看摘要

Abstract:The Shapley value, originating from cooperative game theory, has been employed to define responsibility measures that quantify the contributions of database facts to obtaining a given query answer. For non-numeric queries, this is done by considering a cooperative game whose players are the facts and whose wealth function assigns 1 or 0 to each subset of the database, depending on whether the query answer holds in the given subset. While conceptually simple, this approach suffers from a notable drawback: the problem of computing such Shapley values is #P-hard in data complexity, even for simple conjunctive queries. This motivates us to revisit the question of what constitutes a reasonable responsibility measure and to introduce a new family of responsibility measures – weighted sums of minimal supports (WSMS) – which satisfy intuitive properties. Interestingly, while the definition of WSMSs is simple and bears no obvious resemblance to the Shapley value formula, we prove that every WSMS measure can be equivalently seen as the Shapley value of a suitably defined cooperative game. Moreover, WSMS measures enjoy tractable data complexity for a large class of queries, including all unions of conjunctive queries. We further explore the combined complexity of WSMS computation and establish (in)tractability results for various subclasses of conjunctive queries.

[AI-15] CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

链接: https://arxiv.org/abs/2503.22342
作者: Zhihang Lin,Mingbao Lin,Yuan Xie,Rongrong Ji
类目: Artificial Intelligence (cs.AI)
*备注: 16 pages

点击查看摘要

Abstract:This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need for sampling multiple completions for each question. Our experiment and theoretical analysis reveals that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training – their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experimental results demonstrate that CPPO achieves up to 8.32\times speedup on GSM8K and 3.51\times on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at this https URL.

[AI-16] Machine Learning Models for Soil Parameter Prediction Based on Satellite Weather Clay and Yield Data

链接: https://arxiv.org/abs/2503.22276
作者: Calvin Kammerlander,Viola Kolb,Marinus Luegmair,Lou Scheermann,Maximilian Schmailzl,Marco Seufert,Jiayun Zhang,Denis Dalic,Torsten Schön
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This technical report is the documentation of a student project collaboration between Technische Hochschule Ingolstadt and MI4People

点击查看摘要

Abstract:Efficient nutrient management and precise fertilization are essential for advancing modern agriculture, particularly in regions striving to optimize crop yields sustainably. The AgroLens project endeavors to address this challenge by develop ing Machine Learning (ML)-based methodologies to predict soil nutrient levels without reliance on laboratory tests. By leveraging state of the art techniques, the project lays a foundation for acionable insights to improve agricultural productivity in resource-constrained areas, such as Africa. The approach begins with the development of a robust European model using the LUCAS Soil dataset and Sentinel-2 satellite imagery to estimate key soil properties, including phosphorus, potassium, nitrogen, and pH levels. This model is then enhanced by integrating supplementary features, such as weather data, harvest rates, and Clay AI-generated embeddings. This report details the methodological framework, data preprocessing strategies, and ML pipelines employed in this project. Advanced algorithms, including Random Forests, Extreme Gradient Boosting (XGBoost), and Fully Connected Neural Networks (FCNN), were implemented and finetuned for precise nutrient prediction. Results showcase robust model performance, with root mean square error values meeting stringent accuracy thresholds. By establishing a reproducible and scalable pipeline for soil nutrient prediction, this research paves the way for transformative agricultural applications, including precision fertilization and improved resource allocation in underresourced regions like Africa.

[AI-17] Beyond the Script: Testing LLM s for Authentic Patient Communication Styles in Healthcare

链接: https://arxiv.org/abs/2503.22250
作者: Anna Bodonhelyi,Christian Stegemann-Philipps,Alessandra Sonanini,Lea Herschbach,Márton Szép,Anne Herrmann-Werner,Teresa Festl-Wietek,Enkelejda Kasneci,Friederike Holderried
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Effective patient communication is pivotal in healthcare, yet traditional medical training often lacks exposure to diverse, challenging interpersonal dynamics. To bridge this gap, this study proposes the use of Large Language Models (LLMs) to simulate authentic patient communication styles, specifically the “accuser” and “rationalizer” personas derived from the Satir model, while also ensuring multilingual applicability to accommodate diverse cultural contexts and enhance accessibility for medical professionals. Leveraging advanced prompt engineering, including behavioral prompts, author’s notes, and stubbornness mechanisms, we developed virtual patients (VPs) that embody nuanced emotional and conversational traits. Medical professionals evaluated these VPs, rating their authenticity (accuser: 3.8 \pm 1.0 ; rationalizer: 3.7 \pm 0.8 on a 5-point Likert scale (from one to five)) and correctly identifying their styles. Emotion analysis revealed distinct profiles: the accuser exhibited pain, anger, and distress, while the rationalizer displayed contemplation and calmness, aligning with predefined, detailed patient description including medical history. Sentiment scores (on a scale from zero to nine) further validated these differences in the communication styles, with the accuser adopting negative ( 3.1 \pm 0.6 ) and the rationalizer more neutral ( 4.0 \pm 0.4 ) tone. These results underscore LLMs’ capability to replicate complex communication styles, offering transformative potential for medical education. This approach equips trainees to navigate challenging clinical scenarios by providing realistic, adaptable patient interactions, enhancing empathy and diagnostic acumen. Our findings advocate for AI-driven tools as scalable, cost-effective solutions to cultivate nuanced communication skills, setting a foundation for future innovations in healthcare training.

[AI-18] Agent -Centric Personalized Multiple Clustering with Multi-Modal LLM s ICCV2025

链接: https://arxiv.org/abs/2503.22241
作者: Ziye Chen,Yiqun Duan,Riheng Zhu,Zhenbang Sun,Mingming Gong
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages, 7 figures, in submission to ICCV 2025

点击查看摘要

Abstract:Personalized multiple clustering aims to generate diverse partitions of a dataset based on different user-specific aspects, rather than a single clustering. It has recently drawn research interest for accommodating varying user preferences. Recent approaches primarily use CLIP embeddings with proxy learning to extract representations biased toward user clustering preferences. However, CLIP primarily focuses on coarse image-text alignment, lacking a deep contextual understanding of user interests. To overcome these limitations, we propose an agent-centric personalized clustering framework that leverages multi-modal large language models (MLLMs) as agents to comprehensively traverse a relational graph to search for clusters based on user interests. Due to the advanced reasoning mechanism of MLLMs, the obtained clusters align more closely with user-defined criteria than those obtained from CLIP-based representations. To reduce computational overhead, we shorten the agents’ traversal path by constructing a relational graph using user-interest-biased embeddings extracted by MLLMs. A large number of weakly connected edges can be filtered out based on embedding similarity, facilitating an efficient traversal search for agents. Experimental results show that the proposed method achieves NMI scores of 0.9667 and 0.9481 on the Card Order and Card Suits benchmarks, respectively, largely improving the SOTA model by over 140%.

[AI-19] WeatherMesh-3: Fast and accurate operational global weather forecasting

链接: https://arxiv.org/abs/2503.22235
作者: Haoxing Du,Lyna Kim,Joan Creus-Costa,Jack Michaels,Anuj Shetty,Todd Hutchinson,Christopher Riedel,John Dean
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present WeatherMesh-3 (WM-3), an operational transformer-based global weather forecasting system that improves the state of the art in both accuracy and computational efficiency. We introduce the following advances: 1) a latent rollout that enables arbitrary-length predictions in latent space without intermediate encoding or decoding; and 2) a modular architecture that flexibly utilizes mixed-horizon processors and encodes multiple real-time analyses to create blended initial conditions. WM-3 generates 14-day global forecasts at 0.25-degree resolution in 12 seconds on a single RTX 4090. This represents a 100,000-fold speedup over traditional NWP approaches while achieving superior accuracy with up to 37.7% improvement in RMSE over operational models, requiring only a single consumer-grade GPU for deployment. We aim for WM-3 to democratize weather forecasting by providing an accessible, lightweight model for operational use while pushing the performance boundaries of machine learning-based weather prediction.

[AI-20] MFH: A Multi-faceted Heuristic Algorithm Selection Approach for Software Verification

链接: https://arxiv.org/abs/2503.22228
作者: Jie Su,Liansai Deng,Cheng Wen,Rong Wang,Zhi Ma,Nan Zhang,Cong Tian,Zhenhua Duan,Shengchao Qin
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The implementation, along with all relevant publicly available data, can be accessed on the Figshare platform: this https URL

点击查看摘要

Abstract:Currently, many verification algorithms are available to improve the reliability of software systems. Selecting the appropriate verification algorithm typically demands domain expertise and non-trivial manpower. An automated algorithm selector is thus desired. However, existing selectors, either depend on machine-learned strategies or manually designed heuristics, encounter issues such as reliance on high-quality samples with algorithm labels and limited scalability. In this paper, an automated algorithm selection approach, namely MFH, is proposed for software verification. Our approach leverages the heuristics that verifiers producing correct results typically implement certain appropriate algorithms, and the supported algorithms by these verifiers indirectly reflect which ones are potentially applicable. Specifically, MFH embeds the code property graph (CPG) of a semantic-preserving transformed program to enhance the robustness of the prediction model. Furthermore, our approach decomposes the selection task into the sub-tasks of predicting potentially applicable algorithms and matching the most appropriate verifiers. Additionally, MFH also introduces a feedback loop on incorrect predictions to improve model prediction accuracy. We evaluate MFH on 20 verifiers and over 15,000 verification tasks. Experimental results demonstrate the effectiveness of MFH, achieving a prediction accuracy of 91.47% even without ground truth algorithm labels provided during the training phase. Moreover, the prediction accuracy decreases only by 0.84% when introducing 10 new verifiers, indicating the strong scalability of the proposed approach.

[AI-21] -person Architecture and Framework for Human-AI Co-adventure Relationship

链接: https://arxiv.org/abs/2503.22181
作者: Kanako Esaki,Tadayuki Matsumura,Yang Shao,Hiroyuki Mizuno
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 24 pages, 4 figures, 1 table

点击查看摘要

Abstract:This paper proposes the e-person architecture for constructing a unified and incremental development of AI ethics. The e-person architecture takes the reduction of uncertainty through collaborative cognition and action with others as a unified basis for ethics. By classifying and defining uncertainty along two axes - (1) first, second, and third person perspectives, and (2) the difficulty of inference based on the depth of information - we support the development of unified and incremental development of AI ethics. In addition, we propose the e-person framework based on the free energy principle, which considers the reduction of uncertainty as a unifying principle of brain function, with the aim of implementing the e-person architecture, and we show our previous works and future challenges based on the proposed framework.

[AI-22] When Autonomy Breaks: The Hidden Existential Risk of AI

链接: https://arxiv.org/abs/2503.22151
作者: Joshua Krook
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:AI risks are typically framed around physical threats to humanity, a loss of control or an accidental error causing humanity’s extinction. However, I argue in line with the gradual disempowerment thesis, that there is an underappreciated risk in the slow and irrevocable decline of human autonomy. As AI starts to outcompete humans in various areas of life, a tipping point will be reached where it no longer makes sense to rely on human decision-making, creativity, social care or even leadership. What may follow is a process of gradual de-skilling, where we lose skills that we currently take for granted. Traditionally, it is argued that AI will gain human skills over time, and that these skills are innate and immutable in humans. By contrast, I argue that humans may lose such skills as critical thinking, decision-making and even social care in an AGI world. The biggest threat to humanity is therefore not that machines will become more like humans, but that humans will become more like machines. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2503.22151 [cs.CY] (or arXiv:2503.22151v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2503.22151 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-23] Integrating Artificial Intelligence with Human Expertise: An In-depth Analysis of ChatGPT s Capabilities in Generating Metamorphic Relations

链接: https://arxiv.org/abs/2503.22141
作者: Yifan Zhang(1),Dave Towey(1),Matthew Pike(1),Quang-Hung Luu(2),Huai Liu(2),Tsong Yueh Chen(2) ((1) University of Nottingham Ningbo China, (2) Swinburne University of Technology)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Submitted to Information and Software Technology

点击查看摘要

Abstract:Context: This paper provides an in-depth examination of the generation and evaluation of Metamorphic Relations (MRs) using GPT models developed by OpenAI, with a particular focus on the capabilities of GPT-4 in software testing environments. Objective: The aim is to examine the quality of MRs produced by GPT-3.5 and GPT-4 for a specific System Under Test (SUT) adopted from an earlier study, and to introduce and apply an improved set of evaluation criteria for a diverse range of SUTs. Method: The initial phase evaluates MRs generated by GPT-3.5 and GPT-4 using criteria from a prior study, followed by an application of an enhanced evaluation framework on MRs created by GPT-4 for a diverse range of nine SUTs, varying from simple programs to complex systems incorporating AI/ML components. A custom-built GPT evaluator, alongside human evaluators, assessed the MRs, enabling a direct comparison between automated and human evaluation methods. Results: The study finds that GPT-4 outperforms GPT-3.5 in generating accurate and useful MRs. With the advanced evaluation criteria, GPT-4 demonstrates a significant ability to produce high-quality MRs across a wide range of SUTs, including complex systems incorporating AI/ML components. Conclusions: GPT-4 exhibits advanced capabilities in generating MRs suitable for various applications. The research underscores the growing potential of AI in software testing, particularly in the generation and evaluation of MRs, and points towards the complementarity of human and AI skills in this domain. Comments: Submitted to Information and Software Technology Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.22141 [cs.SE] (or arXiv:2503.22141v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2503.22141 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yifan Zhang [view email] [v1] Fri, 28 Mar 2025 04:31:32 UTC (2,393 KB) Full-text links: Access Paper: View a PDF of the paper titled Integrating Artificial Intelligence with Human Expertise: An In-depth Analysis of ChatGPT’s Capabilities in Generating Metamorphic Relations, by Yifan Zhang (1) and 6 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.SE prev | next new | recent | 2025-03 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-24] Sharpe Ratio-Guided Active Learning for Preference Optimization in RLHF

链接: https://arxiv.org/abs/2503.22137
作者: Syrine Belakaria,Joshua Kazdan,Charles Marx,Chris Cundy,Willie Neiswanger,Sanmi Koyejo,Barbara E. Engelhardt,Stefano Ermon
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has become a cornerstone of the training and alignment pipeline for large language models (LLMs). Recent advances, such as direct preference optimization (DPO), have simplified the preference learning step. However, collecting preference data remains a challenging and costly process, often requiring expert annotation. This cost can be mitigated by carefully selecting the data points presented for annotation. In this work, we propose an active learning approach to efficiently select prompt and preference pairs using a risk assessment strategy based on the Sharpe Ratio. To address the challenge of unknown preferences prior to annotation, our method evaluates the gradients of all potential preference annotations to assess their impact on model updates. These gradient-based evaluations enable risk assessment of data points regardless of the annotation outcome. By leveraging the DPO loss derivations, we derive a closed-form expression for computing these Sharpe ratios on a per-tuple basis, ensuring our approach remains both tractable and computationally efficient. We also introduce two variants of our method, each making different assumptions about prior information. Experimental results demonstrate that our method outperforms the baseline by up to 5% in win rates against the chosen completion with limited human preference data across several language models and real-world datasets.

[AI-25] A Proposal for Networks Capable of Continual Learning ICLR2025

链接: https://arxiv.org/abs/2503.22068
作者: Zeki Doruk Erden,Boi Faltings
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published at ICLR 2025 World Models Workshop

点击查看摘要

Abstract:We analyze the ability of computational units to retain past responses after parameter updates, a key property for system-wide continual learning. Neural networks trained with gradient descent lack this capability, prompting us to propose Modelleyen, an alternative approach with inherent response preservation. We demonstrate through experiments on modeling the dynamics of a simple environment and on MNIST that, despite increased computational complexity and some representational limitations at its current stage, Modelleyen achieves continual learning without relying on sample replay or predefined task boundaries.

[AI-26] Multi-Task Semantic Communications via Large Models

链接: https://arxiv.org/abs/2503.22064
作者: Wanli Ni,Zhijin Qin,Haofeng Sun,Xiaoming Tao,Zhu Han
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 7 pages, 6 figures

点击查看摘要

Abstract:Artificial intelligence (AI) promises to revolutionize the design, optimization and management of next-generation communication systems. In this article, we explore the integration of large AI models (LAMs) into semantic communications (SemCom) by leveraging their multi-modal data processing and generation capabilities. Although LAMs bring unprecedented abilities to extract semantics from raw data, this integration entails multifaceted challenges including high resource demands, model complexity, and the need for adaptability across diverse modalities and tasks. To overcome these challenges, we propose a LAM-based multi-task SemCom (MTSC) architecture, which includes an adaptive model compression strategy and a federated split fine-tuning approach to facilitate the efficient deployment of LAM-based semantic models in resource-limited networks. Furthermore, a retrieval-augmented generation scheme is implemented to synthesize the most recent local and global knowledge bases to enhance the accuracy of semantic extraction and content generation, thereby improving the inference performance. Finally, simulation results demonstrate the efficacy of the proposed LAM-based MTSC architecture, highlighting the performance enhancements across various downstream tasks under varying channel conditions.

[AI-27] Safeguarding Autonomy: a Focus on Machine Learning Decision Systems

链接: https://arxiv.org/abs/2503.22023
作者: Paula Subías-Beltrán,Oriol Pujol,Itziar de Lecuona
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As global discourse on AI regulation gains momentum, this paper focuses on delineating the impact of ML on autonomy and fostering awareness. Respect for autonomy is a basic principle in bioethics that establishes persons as decision-makers. While the concept of autonomy in the context of ML appears in several European normative publications, it remains a theoretical concept that has yet to be widely accepted in ML practice. Our contribution is to bridge the theoretical and practical gap by encouraging the practical application of autonomy in decision-making within ML practice by identifying the conditioning factors that currently prevent it. Consequently, we focus on the different stages of the ML pipeline to identify the potential effects on ML end-users’ autonomy. To improve its practical utility, we propose a related question for each detected impact, offering guidance for identifying possible focus points to respect ML end-users autonomy in decision-making.

[AI-28] Pretrained Bayesian Non-parametric Knowledge Prior in Robotic Long-Horizon Reinforcement Learning

链接: https://arxiv.org/abs/2503.21975
作者: Yuan Meng,Xiangtong Yao,Kejia Chen,Yansong Wu,Liding Zhang,Zhenshan Bing,Alois Knoll
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: initial upload 8 pages

点击查看摘要

Abstract:Reinforcement learning (RL) methods typically learn new tasks from scratch, often disregarding prior knowledge that could accelerate the learning process. While some methods incorporate previously learned skills, they usually rely on a fixed structure, such as a single Gaussian distribution, to define skill priors. This rigid assumption can restrict the diversity and flexibility of skills, particularly in complex, long-horizon tasks. In this work, we introduce a method that models potential primitive skill motions as having non-parametric properties with an unknown number of underlying features. We utilize a Bayesian non-parametric model, specifically Dirichlet Process Mixtures, enhanced with birth and merge heuristics, to pre-train a skill prior that effectively captures the diverse nature of skills. Additionally, the learned skills are explicitly trackable within the prior space, enhancing interpretability and control. By integrating this flexible skill prior into an RL framework, our approach surpasses existing methods in long-horizon manipulation tasks, enabling more efficient skill transfer and task success in complex environments. Our findings show that a richer, non-parametric representation of skill priors significantly improves both the learning and execution of challenging robotic tasks. All data, code, and videos are available at this https URL.

[AI-29] Data-Agnostic Robotic Long-Horizon Manipulation with Vision-Language-Guided Closed-Loop Feedback

链接: https://arxiv.org/abs/2503.21969
作者: Yuan Meng,Xiangtong Yao,Haihui Ye,Yirui Zhou,Shengqiang Zhang,Zhenshan Bing,Alois Knoll
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: initial upload 8 page

点击查看摘要

Abstract:Recent advances in language-conditioned robotic manipulation have leveraged imitation and reinforcement learning to enable robots to execute tasks from human commands. However, these methods often suffer from limited generalization, adaptability, and the lack of large-scale specialized datasets, unlike data-rich domains such as computer vision, making long-horizon task execution challenging. To address these gaps, we introduce DAHLIA, a data-agnostic framework for language-conditioned long-horizon robotic manipulation, leveraging large language models (LLMs) for real-time task planning and execution. DAHLIA employs a dual-tunnel architecture, where an LLM-powered planner collaborates with co-planners to decompose tasks and generate executable plans, while a reporter LLM provides closed-loop feedback, enabling adaptive re-planning and ensuring task recovery from potential failures. Moreover, DAHLIA integrates chain-of-thought (CoT) in task reasoning and temporal abstraction for efficient action execution, enhancing traceability and robustness. Our framework demonstrates state-of-the-art performance across diverse long-horizon tasks, achieving strong generalization in both simulated and real-world scenarios. Videos and code are available at this https URL.

[AI-30] Lobster: A GPU-Accelerated Framework for Neurosymbolic Programming

链接: https://arxiv.org/abs/2503.21937
作者: Paul Biberstein,Ziyang Li,Joseph Devietti,Mayur Naik
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neurosymbolic programs combine deep learning with symbolic reasoning to achieve better data efficiency, interpretability, and generalizability compared to standalone deep learning approaches. However, existing neurosymbolic learning frameworks implement an uneasy marriage between a highly scalable, GPU-accelerated neural component with a slower symbolic component that runs on CPUs. We propose Lobster, a unified framework for harnessing GPUs in an end-to-end manner for neurosymbolic learning. Lobster maps a general neurosymbolic language based on Datalog to the GPU programming paradigm. This mapping is implemented via compilation to a new intermediate language called APM. The extra abstraction provided by APM allows Lobster to be both flexible, supporting discrete, probabilistic, and differentiable modes of reasoning on GPU hardware with a library of provenance semirings, and performant, implementing new optimization passes. We demonstrate that Lobster programs can solve interesting problems spanning the domains of natural language processing, image processing, program reasoning, bioinformatics, and planning. On a suite of 8 applications, Lobster achieves an average speedup of 5.3x over Scallop, a state-of-the-art neurosymbolic framework, and enables scaling of neurosymbolic solutions to previously infeasible tasks.

[AI-31] An Efficient Training Algorithm for Models with Block-wise Sparsity

链接: https://arxiv.org/abs/2503.21928
作者: Ding Zhu,Zhiqun Zuo,Mohammad Mahdi Khalili
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 24 pages, submitted on Transactions on Machine Learning Research

点击查看摘要

Abstract:Large-scale machine learning (ML) models are increasingly being used in critical domains like education, lending, recruitment, healthcare, criminal justice, etc. However, the training, deployment, and utilization of these models demand substantial computational resources. To decrease computation and memory costs, machine learning models with sparse weight matrices are widely used in the literature. Among sparse models, those with special sparse structures (e.g., models with block-wise sparse weight matrices) fit better with the hardware accelerators and can decrease the memory and computation costs during the inference. Unfortunately, while there are several efficient training methods, none of them are designed to train a block-wise sparse model efficiently. As a result, the current methods for training block-wise sparse models start with full and dense models leading to inefficient training. In this work, we focus on training models with \textitblock-wise sparse matrices and propose an efficient training algorithm to decrease both computation and memory costs during training and inference. In addition, we will show that our proposed method enables us to efficiently find the right block size for the sparsity pattern during the training process. Our extensive empirical and theoretical analyses show that our algorithms can decrease the computation and memory costs significantly without a performance drop compared to baselines.

[AI-32] Is Best-of-N the Best of Them? Coverag e Scaling and Optimality in Inference-Time Alignment

链接: https://arxiv.org/abs/2503.21878
作者: Audrey Huang,Adam Block,Qinghua Liu,Nan Jiang,Dylan J. Foster,Akshay Krishnamurthy
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Inference-time computation provides an important axis for scaling language model performance, but naively scaling compute through techniques like Best-of- N sampling can cause performance to degrade due to reward hacking. Toward a theoretical understanding of how to best leverage additional computation, we focus on inference-time alignment which we formalize as the problem of improving a pre-trained policy’s responses for a prompt of interest, given access to an imperfect reward model. We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute, and provide new results that highlight the importance of the pre-trained policy’s coverage over high-quality responses for performance and compute scaling: 1. We show that Best-of- N alignment with an ideal choice for N can achieve optimal performance under stringent notions of coverage, but provably suffers from reward hacking when N is large, and fails to achieve tight guarantees under more realistic coverage conditions. 2. We introduce \textttInferenceTimePessimism , a new algorithm which mitigates reward hacking through deliberate use of inference-time compute, implementing the principle of pessimism in the face of uncertainty via rejection sampling; we prove that its performance is optimal and does not degrade with N , meaning it is scaling-monotonic. We complement our theoretical results with an experimental evaluation that demonstrate the benefits of \textttInferenceTimePessimism across a variety of tasks and models. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2503.21878 [cs.AI] (or arXiv:2503.21878v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2503.21878 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Audrey Huang [view email] [v1] Thu, 27 Mar 2025 18:00:08 UTC (588 KB)

[AI-33] ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer

链接: https://arxiv.org/abs/2503.21847
作者: Yong Xie,Yunlian Sun,Hongwen Zhang,Yebin Liu,Jinhui Tang
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
*备注: 8 pages, 6 figures, Project Page: this https URL

点击查看摘要

Abstract:We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech. The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture to explicitly model co-speech motion dynamics. This architecture enables joint spatial-temporal dependency modeling, thereby enhancing gesture naturalness and fidelity through coherent motion synthesis. To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization, thereby improving the naturalness and fluency of zero-shot motion generation for unseen speech inputs. To mitigate inherent limitations of autoregressive inference, including error accumulation and limited self-correction, we propose an iterative reconstruction inference (IRI) strategy. IRI refines motion sequences via cyclic pose reconstruction, driven by two key components: (1) classifier-free guidance improves distribution alignment between generated and real gestures without auxiliary supervision, and (2) a temporal smoothing process eliminates abrupt inter-frame transitions while ensuring kinematic continuity. Extensive experiments on benchmark datasets validate ReCoM’s effectiveness, achieving state-of-the-art performance across metrics. Notably, it reduces the Fréchet Gesture Distance (FGD) from 18.70 to 2.48, demonstrating an 86.7% improvement in motion realism. Our project page is this https URL.

[AI-34] LightSNN: Lightweight Architecture Search for Sparse and Accurate Spiking Neural Networks

链接: https://arxiv.org/abs/2503.21846
作者: Yesmine Abdennadher,Giovanni Perin,Riccardo Mazzieri,Jacopo Pegoraro,Michele Rossi
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 6 pages, 3 figures, 2 tables. Submitted to conference

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) are highly regarded for their energy efficiency, inherent activation sparsity, and suitability for real-time processing in edge devices. However, most current SNN methods adopt architectures resembling traditional artificial neural networks (ANNs), leading to suboptimal performance when applied to SNNs. While SNNs excel in energy efficiency, they have been associated with lower accuracy levels than traditional ANNs when utilizing conventional architectures. In response, in this work we present LightSNN, a rapid and efficient Neural Network Architecture Search (NAS) technique specifically tailored for SNNs that autonomously leverages the most suitable architecture, striking a good balance between accuracy and efficiency by enforcing sparsity. Based on the spiking NAS network (SNASNet) framework, a cell-based search space including backward connections is utilized to build our training-free pruning-based NAS mechanism. Our technique assesses diverse spike activation patterns across different data samples using a sparsity-aware Hamming distance fitness evaluation. Thorough experiments are conducted on both static (CIFAR10 and CIFAR100) and neuromorphic datasets (DVS128-Gesture). Our LightSNN model achieves state-of-the-art results on CIFAR10 and CIFAR100, improves performance on DVS128Gesture by 4.49%, and significantly reduces search time, most notably offering a 98x speedup over SNASNet and running 30% faster than the best existing method on DVS128Gesture.

[AI-35] LERO: LLM -driven Evolutionary framework with Hybrid Rewards and Enhanced Observation for Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2503.21807
作者: Yuan Wei,Xiaohan Shan,Jianmin Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARL) faces two critical bottlenecks distinct from single-agent RL: credit assignment in cooperative tasks and partial observability of environmental states. We propose LERO, a framework integrating Large language models (LLMs) with evolutionary optimization to address these MARL-specific challenges. The solution centers on two LLM-generated components: a hybrid reward function that dynamically allocates individual credit through reward decomposition, and an observation enhancement function that augments partial observations with inferred environmental context. An evolutionary algorithm optimizes these components through iterative MARL training cycles, where top-performing candidates guide subsequent LLM generations. Evaluations in Multi-Agent Particle Environments (MPE) demonstrate LERO’s superiority over baseline methods, with improved task performance and training efficiency.

[AI-36] Comparison of Metadata Representation Models for Knowledge Graph Embeddings

链接: https://arxiv.org/abs/2503.21804
作者: Shusaku Egami,Kyoumoto Matsushita,Takanori Ugai,Ken Fukuda
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 11 pages, 9 Figures

点击查看摘要

Abstract:Hyper-relational Knowledge Graphs (HRKGs) extend traditional KGs beyond binary relations, enabling the representation of contextual, provenance, and temporal information in domains, such as historical events, sensor data, video content, and narratives. HRKGs can be structured using several Metadata Representation Models (MRMs), including Reification (REF), Singleton Property (SGP), and RDF-star (RDR). However, the effects of different MRMs on KG Embedding (KGE) and Link Prediction (LP) models remain unclear. This study evaluates MRMs in the context of LP tasks, identifies the limitations of existing evaluation frameworks, and introduces a new task that ensures fair comparisons across MRMs. Furthermore, we propose a framework that effectively reflects the knowledge representations of the three MRMs in latent space. Experiments on two types of datasets reveal that REF performs well in simple HRKGs, whereas SGP is less effective. However, in complex HRKGs, the differences among MRMs in the LP tasks are minimal. Our findings contribute to an optimal knowledge representation strategy for HRKGs in LP tasks.

[AI-37] Forecasting Volcanic Radiative Power (VPR) at Fuego Volcano Using Bayesian Regularized Neural Network

链接: https://arxiv.org/abs/2503.21803
作者: Snehamoy Chatterjee,Greg Waite,Sidike Paheding,Luke Bowman
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Forecasting volcanic activity is critical for hazard assessment and risk mitigation. Volcanic Radiative Power (VPR), derived from thermal remote sensing data, serves as an essential indicator of volcanic activity. In this study, we employ Bayesian Regularized Neural Networks (BRNN) to predict future VPR values based on historical data from Fuego Volcano, comparing its performance against Scaled Conjugate Gradient (SCG) and Levenberg-Marquardt (LM) models. The results indicate that BRNN outperforms SCG and LM, achieving the lowest mean squared error (1.77E+16) and the highest R-squared value (0.50), demonstrating its superior ability to capture VPR variability while minimizing overfitting. Despite these promising results, challenges remain in improving the model’s predictive accuracy. Future research should focus on integrating additional geophysical parameters, such as seismic and gas emission data, to enhance forecasting precision. The findings highlight the potential of machine learning models, particularly BRNN, in advancing volcanic activity forecasting, contributing to more effective early warning systems for volcanic hazards.

[AI-38] A Novel Two-Phase Cooperative Co-evolution Framework for Large-Scale Global Optimization with Complex Overlapping GECCO2025

链接: https://arxiv.org/abs/2503.21797
作者: Wenjie Qiu,Hongshu Guo,Zeyuan Ma,Yue-Jiao Gong
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: Accepted at ACM GECCO 2025

点击查看摘要

Abstract:Cooperative Co-evolution, through the decomposition of the problem space, is a primary approach for solving large-scale global optimization problems. Typically, when the subspaces are disjoint, the algorithms demonstrate significantly both effectiveness and efficiency compared to non-decomposition algorithms. However, the presence of overlapping variables complicates the decomposition process and adversely affects the performance of cooperative co-evolution. In this study, we propose a novel two-phase cooperative co-evolution framework to address large-scale global optimization problems with complex overlapping. An effective method for decomposing overlapping problems, grounded in their mathematical properties, is embedded within the framework. Additionally, a customizable benchmark for overlapping problems is introduced to extend existing benchmarks and facilitate experimentation. Extensive experiments demonstrate that the algorithm instantiated within our framework significantly outperforms existing algorithms. The results reveal the characteristics of overlapping problems and highlight the differing strengths of cooperative co-evolution and non-decomposition algorithms. Our work is open-source and accessible at: this https URL.

[AI-39] hreshold Adaptation in Spiking Networks Enables Shortest Path Finding and Place Disambiguation

链接: https://arxiv.org/abs/2503.21795
作者: Robin Dietrich,Tobias Fischer,Nicolai Waniek,Nico Reeb,Michael Milford,Alois Knoll,Adam D. Hines
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Appears in the proceedings of the 2025 Neuro Inspired Computational Elements Conference (NICE)

点击查看摘要

Abstract:Efficient spatial navigation is a hallmark of the mammalian brain, inspiring the development of neuromorphic systems that mimic biological principles. Despite progress, implementing key operations like back-tracing and handling ambiguity in bio-inspired spiking neural networks remains an open challenge. This work proposes a mechanism for activity back-tracing in arbitrary, uni-directional spiking neuron graphs. We extend the existing replay mechanism of the spiking hierarchical temporal memory (S-HTM) by our spike timing-dependent threshold adaptation (STDTA), which enables us to perform path planning in networks of spiking neurons. We further present an ambiguity dependent threshold adaptation (ADTA) for identifying places in an environment with less ambiguity, enhancing the localization estimate of an agent. Combined, these methods enable efficient identification of the shortest path to an unambiguous target. Our experiments show that a network trained on sequences reliably computes shortest paths with fewer replays than the steps required to reach the target. We further show that we can identify places with reduced ambiguity in multiple, similar environments. These contributions advance the practical application of biologically inspired sequential learning algorithms like the S-HTM towards neuromorphic localization and navigation.

[AI-40] Architecture of Information

链接: https://arxiv.org/abs/2503.21794
作者: Yurii Parzhyn
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 81 pages, 5 figures

点击查看摘要

Abstract:The paper explores an approach to constructing energy landscapes of a formal neuron and multilayer artificial neural networks (ANNs). Their analysis makes it possible to determine the conceptual limitations of both classification ANNs (e.g., MLP or CNN) and generative ANN models. The study of informational and thermodynamic entropy in formal neuron and ANN models leads to the conclusion about the energetic nature of informational entropy. The application of the Gibbs free energy concept allows representing the output information of ANNs as the structured part of enthalpy. Modeling ANNs as energy systems makes it possible to interpret the structure of their internal energy as an internal model of the external world, which self-organizes based on the interaction of the system’s internal energy components. The control of the self-organization and evolution process of this model is carried out through an energy function (analogous to the Lyapunov function) based on reduction operators. This makes it possible to introduce a new approach to constructing self-organizing and evolutionary ANNs with direct learning, which does not require additional external algorithms. The presented research makes it possible to formulate a formal definition of information in terms of the interaction processes between the internal and external energy of the system.

[AI-41] Input-Triggered Hardware Trojan Attack on Spiking Neural Networks

链接: https://arxiv.org/abs/2503.21793
作者: Spyridon Raptis,Paul Kling,Ioannis Kaskampas,Ihsen Alouani,Haralampos-G. Stratigopoulos
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neuromorphic computing based on spiking neural networks (SNNs) is emerging as a promising alternative to traditional artificial neural networks (ANNs), offering unique advantages in terms of low power consumption. However, the security aspect of SNNs is under-explored compared to their ANN counterparts. As the increasing reliance on AI systems comes with unique security risks and challenges, understanding the vulnerabilities and threat landscape is essential as neuromorphic computing matures. In this effort, we propose a novel input-triggered Hardware Trojan (HT) attack for SNNs. The HT mechanism is condensed in the area of one neuron. The trigger mechanism is an input message crafted in the spiking domain such that a selected neuron produces a malicious spike train that is not met in normal settings. This spike train triggers a malicious modification in the neuron that forces it to saturate, firing permanently and failing to recover to its resting state even when the input activity stops. The excessive spikes pollute the network and produce misleading decisions. We propose a methodology to select an appropriate neuron and to generate the input pattern that triggers the HT payload. The attack is illustrated by simulation on three popular benchmarks in the neuromorphic community. We also propose a hardware implementation for an analog spiking neuron and a digital SNN accelerator, demonstrating that the HT has a negligible area and power footprint and, thereby, can easily evade detection.

[AI-42] Make Some Noise: Towards LLM audio reasoning and generation using sound tokens ICASSP2025

链接: https://arxiv.org/abs/2503.22275
作者: Shivam Mehta,Nebojsa Jojic,Hannes Gamper
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: 5 pages, 2 figures, Accepted at ICASSP 2025

点击查看摘要

Abstract:Integrating audio comprehension and generation into large language models (LLMs) remains challenging due to the continuous nature of audio and the resulting high sampling rates. Here, we introduce a novel approach that combines Variational Quantization with Conditional Flow Matching to convert audio into ultra-low bitrate discrete tokens of 0.23kpbs, allowing for seamless integration with text tokens in LLMs. We fine-tuned a pretrained text-based LLM using Low-Rank Adaptation (LoRA) to assess its effectiveness in achieving true multimodal capabilities, i.e., audio comprehension and generation. Our tokenizer outperforms a traditional VQ-VAE across various datasets with diverse acoustic events. Despite the substantial loss of fine-grained details through audio tokenization, our multimodal LLM trained with discrete tokens achieves competitive results in audio comprehension with state-of-the-art methods, though audio generation is poor. Our results highlight the need for larger, more diverse datasets and improved evaluation metrics to advance multimodal LLM performance.

[AI-43] PharmAgents : Building a Virtual Pharma with Large Language Model Agents

链接: https://arxiv.org/abs/2503.22164
作者: Bowen Gao,Yanwen Huang,Yiqiao Liu,Wenxuan Xie,Wei-Ying Ma,Ya-Qin Zhang,Yanyan Lan
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The discovery of novel small molecule drugs remains a critical scientific challenge with far-reaching implications for treating diseases and advancing human health. Traditional drug development–especially for small molecule therapeutics–is a highly complex, resource-intensive, and time-consuming process that requires multidisciplinary collaboration. Recent breakthroughs in artificial intelligence (AI), particularly the rise of large language models (LLMs), present a transformative opportunity to streamline and accelerate this process. In this paper, we introduce PharmAgents, a virtual pharmaceutical ecosystem driven by LLM-based multi-agent collaboration. PharmAgents simulates the full drug discovery workflow–from target discovery to preclinical evaluation–by integrating explainable, LLM-driven agents equipped with specialized machine learning models and computational tools. Through structured knowledge exchange and automated optimization, PharmAgents identifies potential therapeutic targets, discovers promising lead compounds, enhances binding affinity and key molecular properties, and performs in silico analyses of toxicity and synthetic feasibility. Additionally, the system supports interpretability, agent interaction, and self-evolvement, enabling it to refine future drug designs based on prior experience. By showcasing the potential of LLM-powered multi-agent systems in drug discovery, this work establishes a new paradigm for autonomous, explainable, and scalable pharmaceutical research, with future extensions toward comprehensive drug lifecycle management.

[AI-44] ATP: Adaptive Threshold Pruning for Efficient Data Encoding in Quantum Neural Networks CVPR

链接: https://arxiv.org/abs/2503.21815
作者: Mohamed Afane,Gabrielle Ebbrecht,Ying Wang,Juntao Chen,Junaid Farooq
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025.a

点击查看摘要

Abstract:Quantum Neural Networks (QNNs) offer promising capabilities for complex data tasks, but are often constrained by limited qubit resources and high entanglement, which can hinder scalability and efficiency. In this paper, we introduce Adaptive Threshold Pruning (ATP), an encoding method that reduces entanglement and optimizes data complexity for efficient computations in QNNs. ATP dynamically prunes non-essential features in the data based on adaptive thresholds, effectively reducing quantum circuit requirements while preserving high performance. Extensive experiments across multiple datasets demonstrate that ATP reduces entanglement entropy and improves adversarial robustness when combined with adversarial training methods like FGSM. Our results highlight ATPs ability to balance computational efficiency and model resilience, achieving significant performance improvements with fewer resources, which will help make QNNs more feasible in practical, resource-constrained settings.

[AI-45] March Madness Tournament Predictions Model: A Mathematical Modeling Approach

链接: https://arxiv.org/abs/2503.21790
作者: Christian McIver,Karla Avalos,Nikhil Nayak
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:This paper proposes a model to predict the outcome of the March Madness tournament based on historical NCAA basketball data since 2013. The framework of this project is a simplification of the FiveThrityEight NCAA March Madness prediction model, where the only four predictors of interest are Adjusted Offensive Efficiency (ADJOE), Adjusted Defensive Efficiency (ADJDE), Power Rating, and Two-Point Shooting Percentage Allowed. A logistic regression was utilized with the aforementioned metrics to generate a probability of a particular team winning each game. Then, a tournament simulation is developed and compared to real-world March Madness brackets to determine the accuracy of the model. Accuracies of performance were calculated using a naive approach and a Spearman rank correlation coefficient.

[AI-46] From Deep Learning to LLM s: A survey of AI in Quantitative Investment

链接: https://arxiv.org/abs/2503.21422
作者: Bokai Cao,Saizhuo Wang,Xinyi Lin,Xiaojun Wu,Haohan Zhang,Lionel M. Ni,Jian Guo
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Trading and Market Microstructure (q-fin.TR)
*备注:

点击查看摘要

Abstract:Quantitative investment (quant) is an emerging, technology-driven approach in asset management, increasingy shaped by advancements in artificial intelligence. Recent advances in deep learning and large language models (LLMs) for quant finance have improved predictive modeling and enabled agent-based automation, suggesting a potential paradigm shift in this field. In this survey, taking alpha strategy as a representative example, we explore how AI contributes to the quantitative investment pipeline. We first examine the early stage of quant research, centered on human-crafted features and traditional statistical models with an established alpha pipeline. We then discuss the rise of deep learning, which enabled scalable modeling across the entire pipeline from data processing to order execution. Building on this, we highlight the emerging role of LLMs in extending AI beyond prediction, empowering autonomous agents to process unstructured data, generate alphas, and support self-iterative workflows.

机器学习

[LG-0] ropical Bisectors and Carlini-Wagner Attacks

链接: https://arxiv.org/abs/2503.22653
作者: Gillian Grindstaff,Julia Lindberg,Daniela Schkoda,Miruna-Stefana Sorea,Ruriko Yoshida
类目: Machine Learning (cs.LG); Algebraic Geometry (math.AG); Combinatorics (math.CO); Metric Geometry (math.MG); Optimization and Control (math.OC)
*备注: 23 pages, 8 figures, 5 tables, 1 appendix

点击查看摘要

Abstract:Pasque et al. showed that using a tropical symmetric metric as an activation function in the last layer can improve the robustness of convolutional neural networks (CNNs) against state-of-the-art attacks, including the Carlini-Wagner attack. This improvement occurs when the attacks are not specifically adapted to the non-differentiability of the tropical layer. Moreover, they showed that the decision boundary of a tropical CNN is defined by tropical bisectors. In this paper, we explore the combinatorics of tropical bisectors and analyze how the tropical embedding layer enhances robustness against Carlini-Wagner attacks. We prove an upper bound on the number of linear segments the decision boundary of a tropical CNN can have. We then propose a refined version of the Carlini-Wagner attack, specifically tailored for the tropical architecture. Computational experiments with MNIST and LeNet5 showcase our attacks improved success rate.

[LG-1] Sentiment Classification of Thai Central Bank Press Releases Using Supervised Learning

链接: https://arxiv.org/abs/2503.22629
作者: Stefano Grassi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Central bank communication plays a critical role in shaping economic expectations and monetary policy effectiveness. This study applies supervised machine learning techniques to classify the sentiment of press releases from the Bank of Thailand, addressing gaps in research that primarily focus on lexicon-based approaches. My findings show that supervised learning can be an effective method, even with smaller datasets, and serves as a starting point for further automation. However, achieving higher accuracy and better generalization requires a substantial amount of labeled data, which is time-consuming and demands expertise. Using models such as Naïve Bayes, Random Forest and SVM, this study demonstrates the applicability of machine learning for central bank sentiment analysis, with English-language communications from the Thai Central Bank as a case study.

[LG-2] Reinforcement Learning for Machine Learning Model Deployment: Evaluating Multi-Armed Bandits in ML Ops Environments

链接: https://arxiv.org/abs/2503.22595
作者: S. Aaron McClendon,Vishaal Venkatesh,Juan Morinelli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In modern ML Ops environments, model deployment is a critical process that traditionally relies on static heuristics such as validation error comparisons and A/B testing. However, these methods require human intervention to adapt to real-world deployment challenges, such as model drift or unexpected performance degradation. We investigate whether reinforcement learning, specifically multi-armed bandit (MAB) algorithms, can dynamically manage model deployment decisions more effectively. Our approach enables more adaptive production environments by continuously evaluating deployed models and rolling back underperforming ones in real-time. We test six model selection strategies across two real-world datasets and find that RL based approaches match or exceed traditional methods in performance. Our findings suggest that reinforcement learning (RL)-based model management can improve automation, reduce reliance on manual interventions, and mitigate risks associated with post-deployment model failures.

[LG-3] Comparing Methods for Bias Mitigation in Graph Neural Networks

链接: https://arxiv.org/abs/2503.22569
作者: Barbara Hoffmann,Ruben Mayer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper examines the critical role of Graph Neural Networks (GNNs) in data preparation for generative artificial intelligence (GenAI) systems, with a particular focus on addressing and mitigating biases. We present a comparative analysis of three distinct methods for bias mitigation: data sparsification, feature modification, and synthetic data augmentation. Through experimental analysis using the german credit dataset, we evaluate these approaches using multiple fairness metrics, including statistical parity, equality of opportunity, and false positive rates. Our research demonstrates that while all methods improve fairness metrics compared to the original dataset, stratified sampling and synthetic data augmentation using GraphSAGE prove particularly effective in balancing demographic representation while maintaining model performance. The results provide practical insights for developing more equitable AI systems while maintaining model performance.

[LG-4] Benchmarking Ultra-Low-Power μNPUs

链接: https://arxiv.org/abs/2503.22567
作者: Josh Millar,Yushan Huang,Sarab Sethi,Hamed Haddadi,Anil Madhavapeddy
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Efficient on-device neural network (NN) inference has various advantages over cloud-based processing, including predictable latency, enhanced privacy, greater reliability, and reduced operating costs for vendors. This has sparked the recent rapid development of microcontroller-scale NN accelerators, often referred to as neural processing units ( \mu NPUs), designed specifically for ultra-low-power applications. In this paper we present the first comparative evaluation of a number of commercially-available \mu NPUs, as well as the first independent benchmarks for several of these platforms. We develop and open-source a model compilation framework to enable consistent benchmarking of quantized models across diverse \mu NPU hardware. Our benchmark targets end-to-end performance and includes model inference latency, power consumption, and memory overhead, alongside other factors. The resulting analysis uncovers both expected performance trends as well as surprising disparities between hardware specifications and actual performance, including \mu NPUs exhibiting unexpected scaling behaviors with increasing model complexity. Our framework provides a foundation for further evaluation of \mu NPU platforms alongside valuable insights for both hardware designers and software developers in this rapidly evolving space. Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR) Cite as: arXiv:2503.22567 [cs.LG] (or arXiv:2503.22567v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.22567 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-5] Efficient Verified Machine Unlearning For Distillation

链接: https://arxiv.org/abs/2503.22539
作者: Yijun Quan,Zushu Li,Giovanni Montana
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Growing data privacy demands, driven by regulations like GDPR and CCPA, require machine unlearning methods capable of swiftly removing the influence of specific training points. Although verified approaches like SISA, using data slicing and checkpointing, achieve efficient unlearning for single models by reverting to intermediate states, these methods struggle in teacher-student knowledge distillation settings. Unlearning in the teacher typically forces costly, complete student retraining due to pervasive information propagation during distillation. Our primary contribution is PURGE (Partitioned Unlearning with Retraining Guarantee for Ensembles), a novel framework integrating verified unlearning with distillation. We introduce constituent mapping and an incremental multi-teacher strategy that partitions the distillation process, confines each teacher constituent’s impact to distinct student data subsets, and crucially maintains data isolation. The PURGE framework substantially reduces retraining overhead, requiring only partial student updates when teacher-side unlearning occurs. We provide both theoretical analysis, quantifying significant speed-ups in the unlearning process, and empirical validation on multiple datasets, demonstrating that PURGE achieves these efficiency gains while maintaining student accuracy comparable to standard baselines.

[LG-6] MixFunn: A Neural Network for Differential Equations with Improved Generalization and Interpretability

链接: https://arxiv.org/abs/2503.22528
作者: Tiago de Souza Farias,Gubio Gomes de Lima,Jonas Maziero,Celso Jorge Villas-Boas
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph)
*备注: 21 pages

点击查看摘要

Abstract:We introduce MixFunn, a novel neural network architecture designed to solve differential equations with enhanced precision, interpretability, and generalization capability. The architecture comprises two key components: the mixed-function neuron, which integrates multiple parameterized nonlinear functions to improve representational flexibility, and the second-order neuron, which combines a linear transformation of its inputs with a quadratic term to capture cross-combinations of input variables. These features significantly enhance the expressive power of the network, enabling it to achieve comparable or superior results with drastically fewer parameters and a reduction of up to four orders of magnitude compared to conventional approaches. We applied MixFunn in a physics-informed setting to solve differential equations in classical mechanics, quantum mechanics, and fluid dynamics, demonstrating its effectiveness in achieving higher accuracy and improved generalization to regions outside the training domain relative to standard machine learning models. Furthermore, the architecture facilitates the extraction of interpretable analytical expressions, offering valuable insights into the underlying solutions.

[LG-7] Assessing Foundation Models for Sea Ice Type Segmentation in Sentinel-1 SAR Imagery

链接: https://arxiv.org/abs/2503.22516
作者: Samira Alkaee Taleghan,Morteza Karimzadeh,Andrew P. Barrett,Walter N. Meier,Farnoush Banaei-Kashani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate segmentation of sea ice types is essential for mapping and operational forecasting of sea ice conditions for safe navigation and resource extraction in ice-covered waters, as well as for understanding polar climate processes. While deep learning methods have shown promise in automating sea ice segmentation, they often rely on extensive labeled datasets which require expert knowledge and are time-consuming to create. Recently, foundation models (FMs) have shown excellent results for segmenting remote sensing images by utilizing pre-training on large datasets using self-supervised techniques. However, their effectiveness for sea ice segmentation remains unexplored, especially given sea ice’s complex structures, seasonal changes, and unique spectral signatures, as well as peculiar Synthetic Aperture Radar (SAR) imagery characteristics including banding and scalloping noise, and varying ice backscatter characteristics, which are often missing in standard remote sensing pre-training datasets. In particular, SAR images over polar regions are acquired using different modes than used to capture the images at lower latitudes by the same sensors that form training datasets for FMs. This study evaluates ten remote sensing FMs for sea ice type segmentation using Sentinel-1 SAR imagery, focusing on their seasonal and spatial generalization. Among the selected models, Prithvi-600M outperforms the baseline models, while CROMA achieves a very similar performance in F1-score. Our contributions include offering a systematic methodology for selecting FMs for sea ice data analysis, a comprehensive benchmarking study on performances of FMs for sea ice segmentation with tailored performance metrics, and insights into existing gaps and future directions for improving domain-specific models in polar applications using SAR data.

[LG-8] Learnable cut flow

链接: https://arxiv.org/abs/2503.22498
作者: Jing Li,Hao Sun
类目: Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注: 26 pages, 33 figures

点击查看摘要

Abstract:Neural networks have emerged as a powerful paradigm for tasks in high energy physics, yet their opaque training process renders them as a black box. In contrast, the traditional cut flow method offers simplicity and interpretability but demands human effort to identify optimal boundaries. To merge the strengths of both approaches, we propose the Learnable Cut Flow (LCF), a neural network that transforms the traditional cut selection into a fully differentiable, data-driven process. LCF implements two cut strategies-parallel, where observable distributions are treated independently, and sequential, where prior cuts shape subsequent ones-to flexibly determine optimal boundaries. Building on this, we introduce the Learnable Importance, a metric that quantifies feature importance and adjusts their contributions to the loss accordingly, offering model-driven insights unlike ad-hoc metrics. To ensure differentiability, a modified loss function replaces hard cuts with mask operations, preserving data shape throughout the training process. LCF is tested on six varied mock datasets and a realistic diboson vs. QCD dataset. Results demonstrate that LCF (1) accurately learns cut boundaries across typical feature distributions in both parallel and sequential strategies, (2) assigns higher importance to discriminative features with minimal overlap, (3) handles redundant or correlated features robustly, and (4) performs effectively in real-world scenarios. In diboson dataset, LCF initially underperforms boosted decision trees and multiplayer perceptrons when using all observables. However, pruning less critical features-guided by learned importance-boosts its performance to match or exceed these baselines. LCF bridges the gap between traditional cut flow method and modern black-box neural networks, delivering actionable insights into the training process and feature importance.

[LG-9] SPDNet: Seasonal-Periodic Decomposition Network for Advanced Residential Demand Forecasting

链接: https://arxiv.org/abs/2503.22485
作者: Reza Nematirad,Anil Pahwa,Balasubramaniam Natarajan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Residential electricity demand forecasting is critical for efficient energy management and grid stability. Accurate predictions enable utility companies to optimize planning and operations. However, real-world residential electricity demand data often exhibit intricate temporal variability, including multiple seasonalities, periodicities, and abrupt fluctuations, which pose significant challenges for forecasting models. Previous models that rely on statistical methods, recurrent, convolutional neural networks, and transformers often struggle to capture these intricate temporal dynamics. To address these challenges, we propose the Seasonal-Periodic Decomposition Network (SPDNet), a novel deep learning framework consisting of two main modules. The first is the Seasonal-Trend Decomposition Module (STDM), which decomposes the input data into trend, seasonal, and residual components. The second is the Periodical Decomposition Module (PDM), which employs the Fast Fourier Transform to identify the dominant periods. For each dominant period, 1D input data is reshaped into a 2D tensor, where rows represent periods and columns correspond to frequencies. The 2D representations are then processed through three submodules: a 1D convolution to capture sharp fluctuations, a transformer-based encoder to model global patterns, and a 2D convolution to capture interactions between periods. Extensive experiments conducted on real-world residential electricity load data demonstrate that SPDNet outperforms traditional and advanced models in both forecasting accuracy and computational efficiency. The code is available in this repository: this https URL.

[LG-10] Probabilistic Uncertain Reward Model: A Natural Generalization of Bradley-Terry Reward Model

链接: https://arxiv.org/abs/2503.22480
作者: Wangtao Sun,Xiang Cheng,Xing Yu,Haotian Xu,Zhao Yang,Shizhu He,Jun Zhao,Kang Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has emerged as a critical technique for training large language models. However, reward hacking-a phenomenon where models exploit flaws in the reward model-remains a significant barrier to achieving robust and scalable intelligence through long-term training. Existing studies have proposed uncertain reward model to address reward hacking, however, they often lack systematic or theoretical foundations, failing to model the uncertainty intrinsically emerging from preference data. In this paper, we propose the Probabilistic Uncertain Reward Model (PURM), a natural generalization of the classical Bradley-Terry reward model. PURM learns reward distributions directly from preference data and quantifies per-sample uncertainty via the average overlap area between reward distributions. To mitigate reward hacking, we further introduce an uncertainty-aware penalty into Proximal Policy Optimization (PPO), which leverages the learned uncertainty to dynamically balance reward optimization and exploration. We propose a lightweight and easy-to-use implementation of PURM. Experiments demonstrate that PURM significantly delays the onset of reward hacking while improving final reward performance, outperforming baseline methods in both stability and effectiveness.

[LG-11] DeepOFormer: Deep Operator Learning with Domain-informed Features for Fatigue Life Prediction

链接: https://arxiv.org/abs/2503.22475
作者: Chenyang Li,Tanmay Sunil Kapure,Prokash Chandra Roy,Zhengtao Gan,Bo Shen
类目: Machine Learning (cs.LG)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:Fatigue life characterizes the duration a material can function before failure under specific environmental conditions, and is traditionally assessed using stress-life (S-N) curves. While machine learning and deep learning offer promising results for fatigue life prediction, they face the overfitting challenge because of the small size of fatigue experimental data in specific materials. To address this challenge, we propose, DeepOFormer, by formulating S-N curve prediction as an operator learning problem. DeepOFormer improves the deep operator learning framework with a transformer-based encoder and a mean L2 relative error loss function. We also consider Stussi, Weibull, and Pascual and Meeker (PM) features as domain-informed features. These features are motivated by empirical fatigue models. To evaluate the performance of our DeepOFormer, we compare it with different deep learning models and XGBoost on a dataset with 54 S-N curves of aluminum alloys. With seven different aluminum alloys selected for testing, our DeepOFormer achieves an R2 of 0.9515, a mean absolute error of 0.2080, and a mean relative error of 0.5077, significantly outperforming state-of-the-art deep/machine learning methods including DeepONet, TabTransformer, and XGBoost, etc. The results highlight that our Deep0Former integrating with domain-informed features substantially improves prediction accuracy and generalization capabilities for fatigue life prediction in aluminum alloys.

[LG-12] STADE: Standard Deviation as a Pruning Metric

链接: https://arxiv.org/abs/2503.22451
作者: Diego Coello de Portugal Mecke,Haya Alyoussef,Ilia Koloiarov,Maximilian Stubbemann,Lars Schmidt-Thieme
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, Large Language Models (LLMs) have become very widespread and are used to solve a wide variety of tasks. To successfully handle these tasks, LLMs require longer training times and larger model sizes. This makes LLMs ideal candidates for pruning methods that reduce computational demands while maintaining performance. Previous methods require a retraining phase after pruning to maintain the original model’s performance. However, state-of-the-art pruning methods, such as Wanda, prune the model without retraining, making the pruning process faster and more efficient. Building upon Wanda’s work, this study provides a theoretical explanation of why the method is effective and leverages these insights to enhance the pruning process. Specifically, a theoretical analysis of the pruning problem reveals a common scenario in Machine Learning where Wanda is the optimal pruning method. Furthermore, this analysis is extended to cases where Wanda is no longer optimal, leading to the development of a new method, STADE, based on the standard deviation of the input. From a theoretical standpoint, STADE demonstrates better generality across different scenarios. Finally, extensive experiments on Llama and Open Pre-trained Transformers (OPT) models validate these theoretical findings, showing that depending on the training conditions, Wanda’s optimal performance varies as predicted by the theoretical framework. These insights contribute to a more robust understanding of pruning strategies and their practical implications. Code is available at: this https URL

[LG-13] Robustness quantification and how it allows for reliable classification even in the presence of distribution shift and for small training sets

链接: https://arxiv.org/abs/2503.22418
作者: Adrián Detavernier,Jasper De Bock
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Based on existing ideas in the field of imprecise probabilities, we present a new approach for assessing the reliability of the individual predictions of a generative probabilistic classifier. We call this approach robustness quantification, compare it to uncertainty quantification, and demonstrate that it continues to work well even for classifiers that are learned from small training sets that are sampled from a shifted distribution.

[LG-14] Instance-Level Data-Use Auditing of Visual ML Models

链接: https://arxiv.org/abs/2503.22413
作者: Zonghao Huang,Neil Zhenqiang Gong,Michael K. Reiter
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing trend of legal disputes over the unauthorized use of data in machine learning (ML) systems highlights the urgent need for reliable data-use auditing mechanisms to ensure accountability and transparency in ML. In this paper, we present the first proactive instance-level data-use auditing method designed to enable data owners to audit the use of their individual data instances in ML models, providing more fine-grained auditing results. Our approach integrates any black-box membership inference technique with a sequential hypothesis test, providing a quantifiable and tunable false-detection rate. We evaluate our method on three types of visual ML models: image classifiers, visual encoders, and Contrastive Image-Language Pretraining (CLIP) models. In additional, we apply our method to evaluate the performance of two state-of-the-art approximate unlearning methods. Our findings reveal that neither method successfully removes the influence of the unlearned data instances from image classifiers and CLIP models even if sacrificing model utility by 10.33% .

[LG-15] Generative Reliability-Based Design Optimization Using In-Context Learning Capabilities of Large Language Models

链接: https://arxiv.org/abs/2503.22401
作者: Zhonglin Jiang,Qian Tang,Zequn Wang
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 17 pages, 11 figures, 4tables

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable in-context learning capabilities, enabling flexible utilization of limited historical information to play pivotal roles in reasoning, problem-solving, and complex pattern recognition tasks. Inspired by the successful applications of LLMs in multiple domains, this paper proposes a generative design method by leveraging the in-context learning capabilities of LLMs with the iterative search mechanisms of metaheuristic algorithms for solving reliability-based design optimization problems. In detail, reliability analysis is performed by engaging the LLMs and Kriging surrogate modeling to overcome the computational burden. By dynamically providing critical information of design points to the LLMs with prompt engineering, the method enables rapid generation of high-quality design alternatives that satisfy reliability constraints while achieving performance optimization. With the Deepseek-V3 model, three case studies are used to demonstrated the performance of the proposed approach. Experimental results indicate that the proposed LLM-RBDO method successfully identifies feasible solutions that meet reliability constraints while achieving a comparable convergence rate compared to traditional genetic algorithms.

[LG-16] MASCOTS: Model-Agnostic Symbolic COunterfactual explanations for Time Series

链接: https://arxiv.org/abs/2503.22389
作者: Dawid Płudowski,Francesco Spinnato,Piotr Wilczyński,Krzysztof Kotowski,Evridiki Vasileia Ntagiou,Riccardo Guidotti,Przemysław Biecek
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Counterfactual explanations provide an intuitive way to understand model decisions by identifying minimal changes required to alter an outcome. However, applying counterfactual methods to time series models remains challenging due to temporal dependencies, high dimensionality, and the lack of an intuitive human-interpretable representation. We introduce MASCOTS, a method that leverages the Bag-of-Receptive-Fields representation alongside symbolic transformations inspired by Symbolic Aggregate Approximation. By operating in a symbolic feature space, it enhances interpretability while preserving fidelity to the original data and model. Unlike existing approaches that either depend on model structure or autoencoder-based sampling, MASCOTS directly generates meaningful and diverse counterfactual observations in a model-agnostic manner, operating on both univariate and multivariate data. We evaluate MASCOTS on univariate and multivariate benchmark datasets, demonstrating comparable validity, proximity, and plausibility to state-of-the-art methods, while significantly improving interpretability and sparsity. Its symbolic nature allows for explanations that can be expressed visually, in natural language, or through semantic representations, making counterfactual reasoning more accessible and actionable.

[LG-17] Grasping a Handful: Sequential Multi-Object Dexterous Grasp Generation

链接: https://arxiv.org/abs/2503.22370
作者: Haofei Lu,Yifei Dong,Zehang Weng,Jens Lundell,Danica Kragic
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:We introduce the sequential multi-object robotic grasp sampling algorithm SeqGrasp that can robustly synthesize stable grasps on diverse objects using the robotic hand’s partial Degrees of Freedom (DoF). We use SeqGrasp to construct the large-scale Allegro Hand sequential grasping dataset SeqDataset and use it for training the diffusion-based sequential grasp generator SeqDiffuser. We experimentally evaluate SeqGrasp and SeqDiffuser against the state-of-the-art non-sequential multi-object grasp generation method MultiGrasp in simulation and on a real robot. The experimental results demonstrate that SeqGrasp and SeqDiffuser reach an 8.71%-43.33% higher grasp success rate than MultiGrasp. Furthermore, SeqDiffuser is approximately 1000 times faster at generating grasps than SeqGrasp and MultiGrasp.

[LG-18] Hybrid Time-Domain Behavior Model Based on Neural Differential Equations and RNNs

链接: https://arxiv.org/abs/2503.22313
作者: Zenghui Chang,Yang Zhang,Hu Tan,Hong Cai Chen
类目: Machine Learning (cs.LG)
*备注: 7 pages,5 figures

点击查看摘要

Abstract:Nonlinear dynamics system identification is crucial for circuit emulation. Traditional continuous-time domain modeling approaches have limitations in fitting capability and computational efficiency when used for modeling circuit IPs and device this http URL paper presents a novel continuous-time domain hybrid modeling paradigm. It integrates neural network differential models with recurrent neural networks (RNNs), creating NODE-RNN and NCDE-RNN models based on neural ordinary differential equations (NODE) and neural controlled differential equations (NCDE), this http URL analysis shows that this hybrid model has mathematical advantages in event-driven dynamic mutation response and gradient propagation stability. Validation using real data from PIN diodes in high-power microwave environments shows NCDE-RNN improves fitting accuracy by 33% over traditional NCDE, and NODE-RNN by 24% over CTRNN, especially in capturing nonlinear memory this http URL model has been successfully deployed in Verilog-A and validated through circuit emulation, confirming its compatibility with existing platforms and practical this http URL hybrid dynamics paradigm, by restructuring the neural differential equation solution path, offers new ideas for high-precision circuit time-domain modeling and is significant for complex nonlinear circuit system modeling.

[LG-19] DynaGraph: Interpretable Multi-Label Prediction from EHRs via Dynamic Graph Learning and Contrastive Augmentation

链接: https://arxiv.org/abs/2503.22257
作者: Munib Mesinovic,Soheila Molaei,Peter Watkinson,Tingting Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning from longitudinal electronic health records is limited if it does not capture the temporal trajectories of the patient’s state in a clinical setting. Graph models allow us to capture the hidden dependencies of the multivariate time-series when the graphs are constructed in a similar dynamic manner. Previous dynamic graph models require a pre-defined and/or static graph structure, which is unknown in most cases, or they only capture the spatial relations between the features. Furthermore in healthcare, the interpretability of the model is an essential requirement to build trust with clinicians. In addition to previously proposed attention mechanisms, there has not been an interpretable dynamic graph framework for data from multivariate electronic health records (EHRs). Here, we propose DynaGraph, an end-to-end interpretable contrastive graph model that learns the dynamics of multivariate time-series EHRs as part of optimisation. We validate our model in four real-world clinical datasets, ranging from primary care to secondary care settings with broad demographics, in challenging settings where tasks are imbalanced and multi-labelled. Compared to state-of-the-art models, DynaGraph achieves significant improvements in balanced accuracy and sensitivity over the nearest complex competitors in time-series or dynamic graph modelling across three ICU and one primary care datasets. Through a pseudo-attention approach to graph construction, our model also indicates the importance of clinical covariates over time, providing means for clinical validation.

[LG-20] FLAM: Foundation Model-Based Body Stabilization for Humanoid Locomotion and Manipulation

链接: https://arxiv.org/abs/2503.22249
作者: Xianqi Zhang,Hongliang Wei,Wenrui Wang,Xingtao Wang,Xiaopeng Fan,Debin Zhao
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Humanoid robots have attracted significant attention in recent years. Reinforcement Learning (RL) is one of the main ways to control the whole body of humanoid robots. RL enables agents to complete tasks by learning from environment interactions, guided by task rewards. However, existing RL methods rarely explicitly consider the impact of body stability on humanoid locomotion and manipulation. Achieving high performance in whole-body control remains a challenge for RL methods that rely solely on task rewards. In this paper, we propose a Foundation model-based method for humanoid Locomotion And Manipulation (FLAM for short). FLAM integrates a stabilizing reward function with a basic policy. The stabilizing reward function is designed to encourage the robot to learn stable postures, thereby accelerating the learning process and facilitating task completion. Specifically, the robot pose is first mapped to the 3D virtual human model. Then, the human pose is stabilized and reconstructed through a human motion reconstruction model. Finally, the pose before and after reconstruction is used to compute the stabilizing reward. By combining this stabilizing reward with the task reward, FLAM effectively guides policy learning. Experimental results on a humanoid robot benchmark demonstrate that FLAM outperforms state-of-the-art RL methods, highlighting its effectiveness in improving stability and overall performance.

[LG-21] CRLLK: Constrained Reinforcement Learning for Lane Keeping in Autonomous Driving AAMAS2025

链接: https://arxiv.org/abs/2503.22248
作者: Xinwei Gao,Arambam James Singh,Gangadhar Royyuru,Michael Yuhas,Arvind Easwaran
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted at AAMAS 2025 (Demonstration Track), 3 pages, 2 figures, 1 table

点击查看摘要

Abstract:Lane keeping in autonomous driving systems requires scenario-specific weight tuning for different objectives. We formulate lane-keeping as a constrained reinforcement learning problem, where weight coefficients are automatically learned along with the policy, eliminating the need for scenario-specific tuning. Empirically, our approach outperforms traditional RL in efficiency and reliability. Additionally, real-world demonstrations validate its practical value for real-world autonomous driving.

[LG-22] Analysis of On-policy Policy Gradient Methods under the Distribution Mismatch

链接: https://arxiv.org/abs/2503.22244
作者: Weizhen Wang,Jianping He,Xiaoming Duan
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Policy gradient methods are one of the most successful methods for solving challenging reinforcement learning problems. However, despite their empirical successes, many SOTA policy gradient algorithms for discounted problems deviate from the theoretical policy gradient theorem due to the existence of a distribution mismatch. In this work, we analyze the impact of this mismatch on the policy gradient methods. Specifically, we first show that in the case of tabular parameterizations, the methods under the mismatch remain globally optimal. Then, we extend this analysis to more general parameterizations by leveraging the theory of biased stochastic gradient descent. Our findings offer new insights into the robustness of policy gradient methods as well as the gap between theoretical foundations and practical implementations.

[LG-23] Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback

链接: https://arxiv.org/abs/2503.22230
作者: Wei Shen,Guanlin Liu,Zheng Wu,Ruofei Zhu,Qingping Yang,Chao Xin,Yu Yue,Lin Yan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning large language models with human preferences. While recent research has focused on algorithmic improvements, the importance of prompt-data construction has been overlooked. This paper addresses this gap by exploring data-driven bottlenecks in RLHF performance scaling, particularly reward hacking and decreasing response diversity. We introduce a hybrid reward system combining reasoning task verifiers (RTV) and a generative reward model (GenRM) to mitigate reward hacking. We also propose a novel prompt-selection method, Pre-PPO, to maintain response diversity and enhance learning effectiveness. Additionally, we find that prioritizing mathematical and coding tasks early in RLHF training significantly improves performance. Experiments across two model sizes validate our methods’ effectiveness and scalability. Results show that RTV is most resistant to reward hacking, followed by GenRM with ground truth, and then GenRM with SFT Best-of-N responses. Our strategies enable rapid capture of subtle task-specific distinctions, leading to substantial improvements in overall RLHF performance. This work highlights the importance of careful data construction and provides practical methods to overcome performance barriers in RLHF.

[LG-24] DREMnet: An Interpretable Denoising Framework for Semi-Airborne Transient Electromagnetic Signal

链接: https://arxiv.org/abs/2503.22223
作者: Shuang Wang,Ming Guo,Xuben Wang,Fei Deng,Lifeng Mao,Bin Wang,Wenlong Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The semi-airborne transient electromagnetic method (SATEM) is capable of conducting rapid surveys over large-scale and hard-to-reach areas. However, the acquired signals are often contaminated by complex noise, which can compromise the accuracy of subsequent inversion interpretations. Traditional denoising techniques primarily rely on parameter selection strategies, which are insufficient for processing field data in noisy environments. With the advent of deep learning, various neural networks have been employed for SATEM signal denoising. However, existing deep learning methods typically use single-mapping learning approaches that struggle to effectively separate signal from noise. These methods capture only partial information and lack interpretability. To overcome these limitations, we propose an interpretable decoupled representation learning framework, termed DREMnet, that disentangles data into content and context factors, enabling robust and interpretable denoising in complex conditions. To address the limitations of CNN and Transformer architectures, we utilize the RWKV architecture for data processing and introduce the Contextual-WKV mechanism, which allows unidirectional WKV to perform bidirectional signal modeling. Our proposed Covering Embedding technique retains the strong local perception of convolutional networks through stacked embedding. Experimental results on test datasets demonstrate that the DREMnet method outperforms existing techniques, with processed field data that more accurately reflects the theoretical signal, offering improved identification of subsurface electrical structures.

[LG-25] Interpretable Deep Learning Paradigm for Airborne Transient Electromagnetic Inversion

链接: https://arxiv.org/abs/2503.22214
作者: Shuang Wang,Xuben Wang,Fei Deng,Xiaodong Yu,Peifan Jiang,Lifeng Mao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The extraction of geoelectric structural information from airborne transient electromagnetic(ATEM)data primarily involves data processing and inversion. Conventional methods rely on empirical parameter selection, making it difficult to process complex field data with high noise levels. Additionally, inversion computations are time consuming and often suffer from multiple local minima. Existing deep learning-based approaches separate the data processing steps, where independently trained denoising networks struggle to ensure the reliability of subsequent inversions. Moreover, end to end networks lack interpretability. To address these issues, we propose a unified and interpretable deep learning inversion paradigm based on disentangled representation learning. The network explicitly decomposes noisy data into noise and signal factors, completing the entire data processing workflow based on the signal factors while incorporating physical information for guidance. This approach enhances the network’s reliability and interpretability. The inversion results on field data demonstrate that our method can directly use noisy data to accurately reconstruct the subsurface electrical structure. Furthermore, it effectively processes data severely affected by environmental noise, which traditional methods struggle with, yielding improved lateral structural resolution.

[LG-26] Fuzzy Cluster-Aware Contrastive Clustering for Time Series

链接: https://arxiv.org/abs/2503.22211
作者: Congyu Wang,Mingjing Du,Xiang Jiang,Yongquan Dong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid growth of unlabeled time series data, driven by the Internet of Things (IoT), poses significant challenges in uncovering underlying patterns. Traditional unsupervised clustering methods often fail to capture the complex nature of time series data. Recent deep learning-based clustering approaches, while effective, struggle with insufficient representation learning and the integration of clustering objectives. To address these issues, we propose a fuzzy cluster-aware contrastive clustering framework (FCACC) that jointly optimizes representation learning and clustering. Our approach introduces a novel three-view data augmentation strategy to enhance feature extraction by leveraging various characteristics of time series data. Additionally, we propose a cluster-aware hard negative sample generation mechanism that dynamically constructs high-quality negative samples using clustering structure information, thereby improving the model’s discriminative ability. By leveraging fuzzy clustering, FCACC dynamically generates cluster structures to guide the contrastive learning process, resulting in more accurate clustering. Extensive experiments on 40 benchmark datasets show that FCACC outperforms the selected baseline methods (eight in total), providing an effective solution for unsupervised time series learning. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2503.22211 [cs.LG] (or arXiv:2503.22211v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.22211 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] Reasoning of Large Language Models over Knowledge Graphs with Super-Relations

链接: https://arxiv.org/abs/2503.22166
作者: Song Wang,Junhong Lin,Xiaojie Guo,Julian Shun,Jundong Li,Yada Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While large language models (LLMs) have made significant progress in processing and reasoning over knowledge graphs, current methods suffer from a high non-retrieval rate. This limitation reduces the accuracy of answering questions based on these graphs. Our analysis reveals that the combination of greedy search and forward reasoning is a major contributor to this issue. To overcome these challenges, we introduce the concept of super-relations, which enables both forward and backward reasoning by summarizing and connecting various relational paths within the graph. This holistic approach not only expands the search space, but also significantly improves retrieval efficiency. In this paper, we propose the ReKnoS framework, which aims to Reason over Knowledge Graphs with Super-Relations. Our framework’s key advantages include the inclusion of multiple relation paths through super-relations, enhanced forward and backward reasoning capabilities, and increased efficiency in querying LLMs. These enhancements collectively lead to a substantial improvement in the successful retrieval rate and overall reasoning performance. We conduct extensive experiments on nine real-world datasets to evaluate ReKnoS, and the results demonstrate the superior performance of ReKnoS over existing state-of-the-art baselines, with an average accuracy gain of 2.92%.

[LG-28] Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models

链接: https://arxiv.org/abs/2503.22165
作者: Zhanke Zhou,Zhaocheng Zhu,Xuan Li,Mikhail Galkin,Xiao Feng,Sanmi Koyejo,Jian Tang,Bo Han
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Numerous applications of large language models (LLMs) rely on their ability to perform step-by-step reasoning. However, the reasoning behavior of LLMs remains poorly understood, posing challenges to research, development, and safety. To address this gap, we introduce landscape of thoughts-the first visualization tool for users to inspect the reasoning paths of chain-of-thought and its derivatives on any multi-choice dataset. Specifically, we represent the states in a reasoning path as feature vectors that quantify their distances to all answer choices. These features are then visualized in two-dimensional plots using t-SNE. Qualitative and quantitative analysis with the landscape of thoughts effectively distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks. It also uncovers undesirable reasoning patterns, such as low consistency and high uncertainty. Additionally, users can adapt our tool to a model that predicts the property they observe. We showcase this advantage by adapting our tool to a lightweight verifier that evaluates the correctness of reasoning paths. The code is publicly available at: this https URL.

[LG-29] -CIL: Temperature Scaling using Adversarial Perturbation for Calibration in Class-Incremental Learning CVPR2025

链接: https://arxiv.org/abs/2503.22163
作者: Seong-Hyeon Hwang,Minsu Kim,Steven Euijong Whang
类目: Machine Learning (cs.LG)
*备注: Accepted to CVPR 2025

点击查看摘要

Abstract:We study model confidence calibration in class-incremental learning, where models learn from sequential tasks with different class sets. While existing works primarily focus on accuracy, maintaining calibrated confidence has been largely overlooked. Unfortunately, most post-hoc calibration techniques are not designed to work with the limited memories of old-task data typical in class-incremental learning, as retaining a sufficient validation set would be impractical. Thus, we propose T-CIL, a novel temperature scaling approach for class-incremental learning without a validation set for old tasks, that leverages adversarially perturbed exemplars from memory. Directly using exemplars is inadequate for temperature optimization, since they are already used for training. The key idea of T-CIL is to perturb exemplars more strongly for old tasks than for the new task by adjusting the perturbation direction based on feature distance, with the single magnitude determined using the new-task validation set. This strategy makes the perturbation magnitude computed from the new task also applicable to old tasks, leveraging the tendency that the accuracy of old tasks is lower than that of the new task. We empirically show that T-CIL significantly outperforms various baselines in terms of calibration on real datasets and can be integrated with existing class-incremental learning techniques with minimal impact on accuracy.

[LG-30] Long-Term Electricity Demand Prediction Using Non-negative Tensor Factorization and Genetic Algorithm-Driven Temporal Modeling

链接: https://arxiv.org/abs/2503.22132
作者: Toma Masaki,Kanta Tachibana
类目: Machine Learning (cs.LG)
*备注: 17 pages, 9 figures, 10 tables

点击查看摘要

Abstract:This study proposes a novel framework for long-term electricity demand prediction based solely on historical consumption data, without relying on external variables such as temperature or economic indicators. The method combines Non-negative Tensor Factorization (NTF) to extract low-dimensional temporal features from multi-way electricity usage data, with a Genetic Algorithm that optimizes the hyperparameters of time series models applied to the latent annual factors. We model the dataset as a third-order tensor spanning electric utilities, industrial sectors, and years, and apply canonical polyadic decomposition under non-negativity constraints. The annual component is forecasted using autoregressive models, with hyperparameter tuning guided by the prediction error or reconstruction accuracy on a validation set. Comparative experiments using real-world electricity data from Japan demonstrate that the proposed method achieves lower mean squared error than baseline approaches without tensor decomposition or evolutionary optimization. Moreover, we find that reducing the model’s degrees of freedom via tensor decomposition improves generalization performance, and that initialization sensitivity in NTF can be mitigated through multiple runs or ensemble strategies. These findings suggest that the proposed framework offers an interpretable, flexible, and scalable approach to long-term electricity demand prediction and can be extended to other structured time series forecasting tasks.

[LG-31] Multimodal Machine Learning for Real Estate Appraisal: A Comprehensive Survey

链接: https://arxiv.org/abs/2503.22119
作者: Chenya Huang,Zhidong Li,Fang Chen,Bin Liang
类目: Machine Learning (cs.LG)
*备注: 13 pages, 5 figures

点击查看摘要

Abstract:Real estate appraisal has undergone a significant transition from manual to automated valuation and is entering a new phase of evolution. Leveraging comprehensive attention to various data sources, a novel approach to automated valuation, multimodal machine learning, has taken shape. This approach integrates multimodal data to deeply explore the diverse factors influencing housing prices. Furthermore, multimodal machine learning significantly outperforms single-modality or fewer-modality approaches in terms of prediction accuracy, with enhanced interpretability. However, systematic and comprehensive survey work on the application in the real estate domain is still lacking. In this survey, we aim to bridge this gap by reviewing the research efforts. We begin by reviewing the background of real estate appraisal and propose two research questions from the perspecve of performance and fusion aimed at improving the accuracy of appraisal results. Subsequently, we explain the concept of multimodal machine learning and provide a comprehensive classification and definition of modalities used in real estate appraisal for the first time. To ensure clarity, we explore works related to data and techniques, along with their evaluation methods, under the framework of these two research questions. Furthermore, specific application domains are summarized. Finally, we present insights into future research directions including multimodal complementarity, technology and modality contribution.

[LG-32] Estimating City-wide operating mode Distribution of Light-Duty Vehicles: A Neural Network-based Approach

链接: https://arxiv.org/abs/2503.22118
作者: Muhammad Usama,Haris N. Koutsopoulos,Zhengbing He,Lijiao Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Driving cycles are a set of driving conditions and are crucial for the existing emission estimation model to evaluate vehicle performance, fuel efficiency, and emissions, by matching them with average speed to calculate the operating modes, such as braking, idling, and cruising. While existing emission estimation models, such as the Motor Vehicle Emission Simulator (MOVES), are powerful tools, their reliance on predefined driving cycles can be limiting, as these cycles often do not accurately represent regional driving conditions, making the models less effective for city-wide analyses. To solve this problem, this paper proposes a modular neural network (NN)-based framework to estimate operating mode distributions bypassing the driving cycle development phase, utilizing macroscopic variables such as speed, flow, and link infrastructure attributes. The proposed method is validated using a well-calibrated microsimulation model of Brookline MA, the United States. The results indicate that the proposed framework outperforms the operating mode distribution calculated by MOVES based on default driving cycles, providing a closer match to the actual operating mode distribution derived from trajectory data. Specifically, the proposed model achieves an average RMSE of 0.04 in predicting operating mode distribution, compared to 0.08 for MOVES. The average error in emission estimation across pollutants is 8.57% for the proposed method, lower than the 32.86% error for MOVES. In particular, for the estimation of CO2, the proposed method has an error of just 4%, compared to 35% for MOVES. The proposed model can be utilized for real-time emissions monitoring by providing rapid and accurate emissions estimates with easily accessible inputs.

[LG-33] ReLU Networks as Random Functions: Their Distribution in Probability Space

链接: https://arxiv.org/abs/2503.22082
作者: Shreyas Chaudhari,José M. F. Moura
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a novel framework for understanding trained ReLU networks as random, affine functions, where the randomness is induced by the distribution over the inputs. By characterizing the probability distribution of the network’s activation patterns, we derive the discrete probability distribution over the affine functions realizable by the network. We extend this analysis to describe the probability distribution of the network’s outputs. Our approach provides explicit, numerically tractable expressions for these distributions in terms of Gaussian orthant probabilities. Additionally, we develop approximation techniques to identify the support of affine functions a trained ReLU network can realize for a given distribution of inputs. Our work provides a framework for understanding the behavior and performance of ReLU networks corresponding to stochastic inputs, paving the way for more interpretable and reliable models.

[LG-34] Concise One-Layer Transformers Can Do Function Evaluation (Sometimes)

链接: https://arxiv.org/abs/2503.22076
作者: Lena Strobl,Dana Angluin,Robert Frank
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While transformers have proven enormously successful in a range of tasks, their fundamental properties as models of computation are not well understood. This paper contributes to the study of the expressive capacity of transformers, focusing on their ability to perform the fundamental computational task of evaluating an arbitrary function from [n] to [n] at a given argument. We prove that concise 1-layer transformers (i.e., with a polylog bound on the product of the number of heads, the embedding dimension, and precision) are capable of doing this task under some representations of the input, but not when the function’s inputs and values are only encoded in different input positions. Concise 2-layer transformers can perform the task even with the more difficult input representation. Experimentally, we find a rough alignment between what we have proven can be computed by concise transformers and what can be practically learned.

[LG-35] Arch-LLM : Taming LLM s for Neural Architecture Generation via Unsupervised Discrete Representation Learning

链接: https://arxiv.org/abs/2503.22063
作者: Deshani Geethika Poddenige,Sachith Seneviratne,Damith Senanayake,Mahesan Niranjan,PN Suganthan,Saman Halgamuge
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unsupervised representation learning has been widely explored across various modalities, including neural architectures, where it plays a key role in downstream applications like Neural Architecture Search (NAS). These methods typically learn an unsupervised representation space before generating/ sampling architectures for the downstream search. A common approach involves the use of Variational Autoencoders (VAEs) to map discrete architectures onto a continuous representation space, however, sampling from these spaces often leads to a high percentage of invalid or duplicate neural architectures. This could be due to the unnatural mapping of inherently discrete architectural space onto a continuous space, which emphasizes the need for a robust discrete representation of these architectures. To address this, we introduce a Vector Quantized Variational Autoencoder (VQ-VAE) to learn a discrete latent space more naturally aligned with the discrete neural architectures. In contrast to VAEs, VQ-VAEs (i) map each architecture into a discrete code sequence and (ii) allow the prior to be learned by any generative model rather than assuming a normal distribution. We then represent these architecture latent codes as numerical sequences and train a text-to-text model leveraging a Large Language Model to learn and generate sequences representing architectures. We experiment our method with Inception/ ResNet-like cell-based search spaces, namely NAS-Bench-101 and NAS-Bench-201. Compared to VAE-based methods, our approach improves the generation of valid and unique architectures by over 80% on NASBench-101 and over 8% on NASBench-201. Finally, we demonstrate the applicability of our method in NAS employing a sequence-modeling-based NAS algorithm.

[LG-36] Low Rank and Sparse Fourier Structure in Recurrent Networks Trained on Modular Addition ICASSP2025

链接: https://arxiv.org/abs/2503.22059
作者: Akshay Rangamani
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: To appear at ICASSP 2025

点击查看摘要

Abstract:Modular addition tasks serve as a useful test bed for observing empirical phenomena in deep learning, including the phenomenon of \emphgrokking. Prior work has shown that one-layer transformer architectures learn Fourier Multiplication circuits to solve modular addition tasks. In this paper, we show that Recurrent Neural Networks (RNNs) trained on modular addition tasks also use a Fourier Multiplication strategy. We identify low rank structures in the model weights, and attribute model components to specific Fourier frequencies, resulting in a sparse representation in the Fourier space. We also show empirically that the RNN is robust to removing individual frequencies, while the performance degrades drastically as more frequencies are ablated from the model.

[LG-37] une It Up: Music Genre Transfer and Prediction

链接: https://arxiv.org/abs/2503.22008
作者: Fidan Samet,Oguz Bakir,Adnan Fidan
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Deep generative models have been used in style transfer tasks for images. In this study, we adapt and improve CycleGAN model to perform music style transfer on Jazz and Classic genres. By doing so, we aim to easily generate new songs, cover music to different music genres and reduce the arrangements needed in those processes. We train and use music genre classifier to assess the performance of the transfer models. To that end, we obtain 87.7% accuracy with Multi-layer Perceptron algorithm. To improve our style transfer baseline, we add auxiliary discriminators and triplet loss to our model. According to our experiments, we obtain the best accuracies as 69.4% in Jazz to Classic task and 39.3% in Classic to Jazz task with our developed genre classifier. We also run a subjective experiment and results of it show that the overall performance of our transfer model is good and it manages to conserve melody of inputs on the transferred outputs. Our code is available at this https URL fidansamet/tune-it-up

[LG-38] Bresa: Bio-inspired Reflexive Safe Reinforcement Learning for Contact-Rich Robotic Tasks

链接: https://arxiv.org/abs/2503.21989
作者: Heng Zhang,Gokhan Solak,Arash Ajoudani
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: submitted to IEEE RA-L

点击查看摘要

Abstract:Ensuring safety in reinforcement learning (RL)-based robotic systems is a critical challenge, especially in contact-rich tasks within unstructured environments. While the state-of-the-art safe RL approaches mitigate risks through safe exploration or high-level recovery mechanisms, they often overlook low-level execution safety, where reflexive responses to potential hazards are crucial. Similarly, variable impedance control (VIC) enhances safety by adjusting the robot’s mechanical response, yet lacks a systematic way to adapt parameters, such as stiffness and damping throughout the task. In this paper, we propose Bresa, a Bio-inspired Reflexive Hierarchical Safe RL method inspired by biological reflexes. Our method decouples task learning from safety learning, incorporating a safety critic network that evaluates action risks and operates at a higher frequency than the task solver. Unlike existing recovery-based methods, our safety critic functions at a low-level control layer, allowing real-time intervention when unsafe conditions arise. The task-solving RL policy, running at a lower frequency, focuses on high-level planning (decision-making), while the safety critic ensures instantaneous safety corrections. We validate Bresa on multiple tasks including a contact-rich robotic task, demonstrating its reflexive ability to enhance safety, and adaptability in unforeseen dynamic environments. Our results show that Bresa outperforms the baseline, providing a robust and reflexive safety mechanism that bridges the gap between high-level planning and low-level execution. Real-world experiments and supplementary material are available at project website this https URL.

[LG-39] Improving Equivariant Networks with Probabilistic Symmetry Breaking

链接: https://arxiv.org/abs/2503.21985
作者: Hannah Lawrence,Vasco Portilheiro,Yan Zhang,Sékou-Oumar Kaba
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 28 pages, 7 figures

点击查看摘要

Abstract:Equivariance encodes known symmetries into neural networks, often enhancing generalization. However, equivariant networks cannot break symmetries: the output of an equivariant network must, by definition, have at least the same self-symmetries as the input. This poses an important problem, both (1) for prediction tasks on domains where self-symmetries are common, and (2) for generative models, which must break symmetries in order to reconstruct from highly symmetric latent spaces. This fundamental limitation can be addressed by considering equivariant conditional distributions, instead of equivariant functions. We present novel theoretical results that establish necessary and sufficient conditions for representing such distributions. Concretely, this representation provides a practical framework for breaking symmetries in any equivariant network via randomized canonicalization. Our method, SymPE (Symmetry-breaking Positional Encodings), admits a simple interpretation in terms of positional encodings. This approach expands the representational power of equivariant networks while retaining the inductive bias of symmetry, which we justify through generalization bounds. Experimental results demonstrate that SymPE significantly improves performance of group-equivariant and graph neural networks across diffusion models for graphs, graph autoencoders, and lattice spin system modeling.

[LG-40] RocketPPA: Ultra-Fast LLM -Based PPA Estimator at Code-Level Abstraction

链接: https://arxiv.org/abs/2503.21971
作者: Armin Abdollahi,Mehdi Kamal,Massoud Pedram
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Large language models have recently transformed hardware design, yet bridging the gap between code synthesis and PPA (power, performance, and area) estimation remains a challenge. In this work, we introduce a novel framework that leverages a 21k dataset of thoroughly cleaned and synthesizable Verilog modules, each annotated with detailed power, delay, and area metrics. By employing chain-of-thought techniques, we automatically debug and curate this dataset to ensure high fidelity in downstream applications. We then fine-tune CodeLlama using LoRA-based parameter-efficient methods, framing the task as a regression problem to accurately predict PPA metrics from Verilog code. Furthermore, we augment our approach with a mixture-of-experts architecture-integrating both LoRA and an additional MLP expert layer-to further refine predictions. Experimental results demonstrate significant improvements: power estimation accuracy is enhanced by 5.9% at a 20% error threshold and by 7.2% at a 10% threshold, delay estimation improves by 5.1% and 3.9%, and area estimation sees gains of 4% and 7.9% for the 20% and 10% thresholds, respectively. Notably, the incorporation of the mixture-of-experts module contributes an additional 3–4% improvement across these tasks. Our results establish a new benchmark for PPA-aware Verilog generation, highlighting the effectiveness of our integrated dataset and modeling strategies for next-generation EDA workflows.

[LG-41] NeuroLIP: Interpretable and Fair Cross-Modal Alignment of fMRI and Phenotypic Text

链接: https://arxiv.org/abs/2503.21964
作者: Yanting Yang,Xiaoxiao Li
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Integrating functional magnetic resonance imaging (fMRI) connectivity data with phenotypic textual descriptors (e.g., disease label, demographic data) holds significant potential to advance our understanding of neurological conditions. However, existing cross-modal alignment methods often lack interpretability and risk introducing biases by encoding sensitive attributes together with diagnostic-related features. In this work, we propose NeuroLIP, a novel cross-modal contrastive learning framework. We introduce text token-conditioned attention (TTCA) and cross-modal alignment via localized tokens (CALT) to the brain region-level embeddings with each disease-related phenotypic token. It improves interpretability via token-level attention maps, revealing brain region-disease associations. To mitigate bias, we propose a loss for sensitive attribute disentanglement that maximizes the attention distance between disease tokens and sensitive attribute tokens, reducing unintended correlations in downstream predictions. Additionally, we incorporate a negative gradient technique that reverses the sign of CALT loss on sensitive attributes, further discouraging the alignment of these features. Experiments on neuroimaging datasets (ABIDE and ADHD-200) demonstrate NeuroLIP’s superiority in terms of fairness metrics while maintaining the overall best standard metric performance. Qualitative visualization of attention maps highlights neuroanatomical patterns aligned with diagnostic characteristics, validated by the neuroscientific literature. Our work advances the development of transparent and equitable neuroimaging AI.

[LG-42] Reward Design for Reinforcement Learning Agents

链接: https://arxiv.org/abs/2503.21949
作者: Rati Devidze
类目: Machine Learning (cs.LG)
*备注: Doctoral thesis

点击查看摘要

Abstract:Reward functions are central in reinforcement learning (RL), guiding agents towards optimal decision-making. The complexity of RL tasks requires meticulously designed reward functions that effectively drive learning while avoiding unintended consequences. Effective reward design aims to provide signals that accelerate the agent’s convergence to optimal behavior. Crafting rewards that align with task objectives, foster desired behaviors, and prevent undesirable actions is inherently challenging. This thesis delves into the critical role of reward signals in RL, highlighting their impact on the agent’s behavior and learning dynamics and addressing challenges such as delayed, ambiguous, or intricate rewards. In this thesis work, we tackle different aspects of reward shaping. First, we address the problem of designing informative and interpretable reward signals from a teacher’s/expert’s perspective (teacher-driven). Here, the expert, equipped with the optimal policy and the corresponding value function, designs reward signals that expedite the agent’s convergence to optimal behavior. Second, we build on this teacher-driven approach by introducing a novel method for adaptive interpretable reward design. In this scenario, the expert tailors the rewards based on the learner’s current policy, ensuring alignment and optimal progression. Third, we propose a meta-learning approach, enabling the agent to self-design its reward signals online without expert input (agent-driven). This self-driven method considers the agent’s learning and exploration to establish a self-improving feedback loop.

[LG-43] Hierarchical Label Propagation: A Model-Size-Dependent Performance Booster for AudioSet Tagging

链接: https://arxiv.org/abs/2503.21826
作者: Ludovic Tuncay(IRIT-SAMoVA),Etienne Labbé(IRIT-SAMoVA),Thomas Pellegrini(IRIT-SAMoVA, UT3)
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:AudioSet is one of the most used and largest datasets in audio tagging, containing about 2 million audio samples that are manually labeled with 527 event categories organized into an ontology. However, the annotations contain inconsistencies, particularly where categories that should be labeled as positive according to the ontology are frequently mislabeled as negative. To address this issue, we apply Hierarchical Label Propagation (HLP), which propagates labels up the ontology hierarchy, resulting in a mean increase in positive labels per audio clip from 1.98 to 2.39 and affecting 109 out of the 527 classes. Our results demonstrate that HLP provides performance benefits across various model architectures, including convolutional neural networks (PANN’s CNN6 and ConvNeXT) and transformers (PaSST), with smaller models showing more improvements. Finally, on FSD50K, another widely used dataset, models trained on AudioSet with HLP consistently outperformed those trained without HLP. Our source code will be made available on GitHub.

[LG-44] Unsupervised Ordering for Maximum Clique

链接: https://arxiv.org/abs/2503.21814
作者: Yimeng Min,Carla P. Gomes
类目: Machine Learning (cs.LG)
*备注: preprint

点击查看摘要

Abstract:We propose an unsupervised approach for learning vertex orderings for the maximum clique problem by framing it within a permutation-based framework. We transform the combinatorial constraints into geometric relationships such that the ordering of vertices aligns with the clique structures. By integrating this clique-oriented ordering into branch-and-bound search, we improve search efficiency and reduce the number of computational steps. Our results demonstrate how unsupervised learning of vertex ordering can enhance search efficiency across diverse graph instances. We further study the generalization across different sizes.

[LG-45] IPGO: Indirect Prompt Gradient Optimization on Text-to-Image Generative Models with High Data Efficiency

链接: https://arxiv.org/abs/2503.21812
作者: Jianping Ye,Michel Wedel,Kunpeng Zhang
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, 1 table

点击查看摘要

Abstract:Text-to-Image Diffusion models excel at generating images from text prompts but often lack optimal alignment with content semantics, aesthetics, and human preferences. To address these issues, in this study we introduce a novel framework, Indirect Prompt Gradient Optimization (IPGO), for prompt-level fine-tuning. IPGO enhances prompt embeddings by injecting continuously differentiable tokens at the beginning and end of the prompt embeddings, while exploiting low-rank benefits and flexibility from rotations. It allows for gradient-based optimization of injected tokens while enforcing value, orthonormality, and conformity constraints, facilitating continuous updates and empowering computational efficiency. To evaluate the performance of IPGO, we conduct prompt-wise and prompt-batch training with three reward models targeting image aesthetics, image-text alignment, and human preferences under three datasets of different complexity. The results show that IPGO consistently matches or outperforms cutting-edge benchmarks, including stable diffusion v1.5 with raw prompts, training-based approaches (DRaFT and DDPO), and training-free methods (DPO-Diffusion, Promptist, and ChatGPT-4o). Furthermore, we demonstrate IPGO’s effectiveness in enhancing image generation quality while requiring minimal training data and limited computational resources.

[LG-46] Meta-Representational Predictive Coding: Biomimetic Self-Supervised Learning

链接: https://arxiv.org/abs/2503.21796
作者: Alexander Ororbia,Karl Friston,Rajesh P. N. Rao
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Self-supervised learning has become an increasingly important paradigm in the domain of machine intelligence. Furthermore, evidence for self-supervised adaptation, such as contrastive formulations, has emerged in recent computational neuroscience and brain-inspired research. Nevertheless, current work on self-supervised learning relies on biologically implausible credit assignment – in the form of backpropagation of errors – and feedforward inference, typically a forward-locked pass. Predictive coding, in its mechanistic form, offers a biologically plausible means to sidestep these backprop-specific limitations. However, unsupervised predictive coding rests on learning a generative model of raw pixel input (akin to ``generative AI’’ approaches), which entails predicting a potentially high dimensional input; on the other hand, supervised predictive coding, which learns a mapping between inputs to target labels, requires human annotation, and thus incurs the drawbacks of supervised learning. In this work, we present a scheme for self-supervised learning within a neurobiologically plausible framework that appeals to the free energy principle, constructing a new form of predictive coding that we call meta-representational predictive coding (MPC). MPC sidesteps the need for learning a generative model of sensory input (e.g., pixel-level features) by learning to predict representations of sensory input across parallel streams, resulting in an encoder-only learning and inference scheme. This formulation rests on active inference (in the form of sensory glimpsing) to drive the learning of representations, i.e., the representational dynamics are driven by sequences of decisions made by the model to sample informative portions of its sensorium.

[LG-47] Differential equation quantum solvers: engineering measurements to reduce cost

链接: https://arxiv.org/abs/2503.22656
作者: Annie Paine,Casper Gyurik,Antonio Andrea Gentile
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:Quantum computers have been proposed as a solution for efficiently solving non-linear differential equations (DEs), a fundamental task across diverse technological and scientific domains. However, a crucial milestone in this regard is to design protocols that are hardware-aware, making efficient use of limited available quantum resources. We focus here on promising variational methods derived from scientific machine learning: differentiable quantum circuits (DQC), addressing specifically their cost in number of circuit evaluations. Reducing the number of quantum circuit evaluations is particularly valuable in hybrid quantum/classical protocols, where the time required to interface and run quantum hardware at each cycle can impact the total wall-time much more than relatively inexpensive classical post-processing overhead. Here, we propose and test two sample-efficient protocols for solving non-linear DEs, achieving exponential savings in quantum circuit evaluations. These protocols are based on redesigning the extraction of information from DQC in a ``measure-first" approach, by introducing engineered cost operators similar to the randomized-measurement toolbox (i.e. classical shadows). In benchmark simulations on one and two-dimensional DEs, we report up to \sim 100 fold reductions in circuit evaluations. Our protocols thus hold the promise to unlock larger and more challenging non-linear differential equation demonstrations with existing quantum hardware.

[LG-48] Using Machine Learning for Lunar Mineralogy-I: Hyperspectral Imaging of Volcanic Samples

链接: https://arxiv.org/abs/2503.22617
作者: Fatemeh Fazel Hesar,Mojtaba Raouf,Peyman Soltani,Bernard Foing,Michiel J.A. de Dood,Fons J. Verbeek,Esther Cheng,Chenming Zhou
类目: Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注: 18 pages, 7 Figure, Accepted to the Special Issue: Planetary Radar Astronomy - Universe: Planetary Sciences Journal

点击查看摘要

Abstract:This study examines the mineral composition of volcanic samples similar to lunar materials, focusing on olivine and pyroxene. Using hyperspectral imaging from 400 to 1000 nm, we created data cubes to analyze the reflectance characteristics of samples from samples from Vulcano, a volcanically active island in the Aeolian Archipelago, north of Sicily, Italy, categorizing them into nine regions of interest and analyzing spectral data for each. We applied various unsupervised clustering algorithms, including K-Means, Hierarchical Clustering, GMM, and Spectral Clustering, to classify the spectral profiles. Principal Component Analysis revealed distinct spectral signatures associated with specific minerals, facilitating precise identification. Clustering performance varied by region, with K-Means achieving the highest silhouette-score of 0.47, whereas GMM performed poorly with a score of only 0.25. Non-negative Matrix Factorization aided in identifying similarities among clusters across different methods and reference spectra for olivine and pyroxene. Hierarchical clustering emerged as the most reliable technique, achieving a 94% similarity with the olivine spectrum in one sample, whereas GMM exhibited notable variability. Overall, the analysis indicated that both Hierarchical and K-Means methods yielded lower errors in total measurements, with K-Means demonstrating superior performance in estimated dispersion and clustering. Additionally, GMM showed a higher root mean square error compared to the other models. The RMSE analysis confirmed K-Means as the most consistent algorithm across all samples, suggesting a predominance of olivine in the Vulcano region relative to pyroxene. This predominance is likely linked to historical formation conditions similar to volcanic processes on the Moon, where olivine-rich compositions are common in ancient lava flows and impact melt rocks.

[LG-49] Comparison between neural network clustering hierarchical clustering and k-means clustering: Applications using fluidic lenses

链接: https://arxiv.org/abs/2503.22448
作者: Graciana Puentes
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注: 19 pages, 9 figures

点击查看摘要

Abstract:A comparison between neural network clustering (NNC), hierarchical clustering (HC) and K-means clustering (KMC) is performed to evaluate the computational superiority of these three machine learning (ML) techniques for organizing large datasets into clusters. For NNC, a self-organizing map (SOM) training was applied to a collection of wavefront sensor reconstructions, decomposed in terms of 15 Zernike coefficients, characterizing the optical aberrations of the phase front transmitted by fluidic lenses. In order to understand the distribution and structure of the 15 Zernike variables within an input space, SOM-neighboring weight distances, SOM-sample hits, SOM-weight positions and SOM-weight planes were analyzed to form a visual interpretation of the system’s structural properties. In the case of HC, the data was partitioned using a combined dissimilarity-linkage matrix computation. The effectiveness of this method was confirmed by a high cophenetic correlation coefficient value (c=0.9651). Additionally, a maximum number of clusters was established by setting an inconsistency cutoff of 0.8, yielding a total of 7 clusters for system segmentation. In addition, a KMC approach was employed to establish a quantitative measure of clustering segmentation efficiency, obtaining a sillhoute average value of 0.905 for data segmentation into K=5 non-overlapping clusters. On the other hand, the NNC analysis revealed that the 15 variables could be characterized through the collective influence of 8 clusters. It was established that the formation of clusters through the combined linkage and dissimilarity algorithms of HC alongside KMC is a more dependable clustering solution than separate assessment via NNC or HC, where altering the SOM size or inconsistency cutoff can lead to completely new clustering configurations.

[LG-50] Data-driven modeling of fluid flow around rotating structures with graph neural networks

链接: https://arxiv.org/abs/2503.22252
作者: Rui Gao,Zhi Cheng,Rajeev K. Jaiman
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks, recently introduced into the field of fluid flow surrogate modeling, have been successfully applied to model the temporal evolution of various fluid flow systems. Existing applications, however, are mostly restricted to cases where the domain is time-invariant. The present work extends the application of graph neural network-based modeling to fluid flow around structures rotating with respect to a certain axis. Specifically, we propose to apply a graph neural network-based surrogate modeling for fluid flow with the mesh corotating with the structure. Unlike conventional data-driven approaches that rely on structured Cartesian meshes, our framework operates on unstructured co-rotating meshes, enforcing rotation equivariance of the learned model by leveraging co-rotating polar (2D) and cylindrical (3D) coordinate systems. To model the pressure for systems without Dirichlet pressure boundaries, we propose a novel local directed pressure difference formulation that is invariant to the reference pressure point and value. For flow systems with large mesh sizes, we introduce a scheme to train the network in single or distributed graphics processing units by accumulating the backpropagated gradients from partitions of the mesh. The effectiveness of our proposed framework is examined on two test cases: (i) fluid flow in a 2D rotating mixer, and (ii) the flow past a 3D rotating cube. Our results show that the model achieves stable and accurate rollouts for over 2000 time steps in periodic regimes while capturing accurate short-term dynamics in chaotic flow regimes. In addition, the drag and lift force predictions closely match the CFD calculations, highlighting the potential of the framework for modeling both periodic and chaotic fluid flow around rotating structures.

[LG-51] An Advanced Ensemble Deep Learning Framework for Stock Price Prediction Using VAE Transformer and LSTM Model

链接: https://arxiv.org/abs/2503.22192
作者: Anindya Sarkar,G. Vadivu
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This research proposes a cutting-edge ensemble deep learning framework for stock price prediction by combining three advanced neural network architectures: The particular areas of interest for the research include but are not limited to: Variational Autoencoder (VAE), Transformer, and Long Short-Term Memory (LSTM) networks. The presented framework is aimed to substantially utilize the advantages of each model which would allow for achieving the identification of both linear and non-linear relations in stock price movements. To improve the accuracy of its predictions it uses rich set of technical indicators and it scales its predictors based on the current market situation. By trying out the framework on several stock data sets, and benchmarking the results against single models and conventional forecasting, the ensemble method exhibits consistently high accuracy and reliability. The VAE is able to learn linear representation on high-dimensional data while the Transformer outstandingly perform in recognizing long-term patterns on the stock price data. LSTM, based on its characteristics of being a model that can deal with sequences, brings additional improvements to the given framework, especially regarding temporal dynamics and fluctuations. Combined, these components provide exceptional directional performance and a very small disparity in the predicted results. The present solution has given a probable concept that can handle the inherent problem of stock price prediction with high reliability and scalability. Compared to the performance of individual proposals based on the neural network, as well as classical methods, the proposed ensemble framework demonstrates the advantages of combining different architectures. It has a very important application in algorithmic trading, risk analysis, and control and decision-making for finance professions and scholars.

[LG-52] Characterizing Non-Markovian Dynamics of Open Quantum Systems

链接: https://arxiv.org/abs/2503.22147
作者: Sohail Reddy
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Characterizing non-Markovian quantum dynamics is essential for accurately modeling open quantum systems, particularly in near-term quantum technologies. In this work, we develop a structure-preserving approach to characterizing non-Markovian evolution using the time-convolutionless (TCL) master equation, considering both linear and nonlinear formulations. To parameterize the master equation, we explore two distinct techniques: the Karhunen-Loeve (KL) expansion, which provides an optimal basis representation of the dynamics, and neural networks, which offer a data-driven approach to learning system-environment interactions. We demonstrate our methodology using experimental data from a superconducting qubit at the Quantum Device Integration Testbed (QuDIT) at Lawrence Livermore National Laboratory (LLNL). Our results show that while neural networks can capture complex dependencies, the KL expansion yields the most accurate predictions of the qubit’s non-Markovian dynamics, highlighting its effectiveness in structure-preserving quantum system characterization. These findings provide valuable insights into efficient modeling strategies for open quantum systems, with implications for quantum control and error mitigation in near-term quantum processors.

[LG-53] me-resolved dynamic CBCT reconstruction using prior-model-free spatiotemporal Gaussian representation (PMF-STGR)

链接: https://arxiv.org/abs/2503.22139
作者: Jiacheng Xie,Hua-Chieh Shao,You Zhang
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 25 pages, 5 figures

点击查看摘要

Abstract:Time-resolved CBCT imaging, which reconstructs a dynamic sequence of CBCTs reflecting intra-scan motion (one CBCT per x-ray projection without phase sorting or binning), is highly desired for regular and irregular motion characterization, patient setup, and motion-adapted radiotherapy. Representing patient anatomy and associated motion fields as 3D Gaussians, we developed a Gaussian representation-based framework (PMF-STGR) for fast and accurate dynamic CBCT reconstruction. PMF-STGR comprises three major components: a dense set of 3D Gaussians to reconstruct a reference-frame CBCT for the dynamic sequence; another 3D Gaussian set to capture three-level, coarse-to-fine motion-basis-components (MBCs) to model the intra-scan motion; and a CNN-based motion encoder to solve projection-specific temporal coefficients for the MBCs. Scaled by the temporal coefficients, the learned MBCs will combine into deformation vector fields to deform the reference CBCT into projection-specific, time-resolved CBCTs to capture the dynamic motion. Due to the strong representation power of 3D Gaussians, PMF-STGR can reconstruct dynamic CBCTs in a ‘one-shot’ training fashion from a standard 3D CBCT scan, without using any prior anatomical or motion model. We evaluated PMF-STGR using XCAT phantom simulations and real patient scans. Metrics including the image relative error, structural-similarity-index-measure, tumor center-of-mass-error, and landmark localization error were used to evaluate the accuracy of solved dynamic CBCTs and motion. PMF-STGR shows clear advantages over a state-of-the-art, INR-based approach, PMF-STINR. Compared with PMF-STINR, PMF-STGR reduces reconstruction time by 50% while reconstructing less blurred images with better motion accuracy. With improved efficiency and accuracy, PMF-STGR enhances the applicability of dynamic CBCT imaging for potential clinical translation.

[LG-54] Enhancing Predictive Accuracy in Tennis: Integrating Fuzzy Logic and CV-GRNN for Dynamic Match Outcome and Player Momentum Analysis

链接: https://arxiv.org/abs/2503.21809
作者: Kechen Li,Jiaming Liu,Zhenyu Wu,Jinpeng Li
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 22 pages,10 figures,9 tables

点击查看摘要

Abstract:The predictive analysis of match outcomes and player momentum in professional tennis has long been a subject of scholarly debate. In this paper, we introduce a novel approach to game prediction by combining a multi-level fuzzy evaluation model with a CV-GRNN model. We first identify critical statistical indicators via Principal Component Analysis and then develop a two-tier fuzzy model based on the Wimbledon data. In addition, the results of Pearson Correlation Coefficient indicate that the momentum indicators, such as Player Win Streak and Score Difference, have a strong correlation among them, revealing insightful trends among players transitioning between losing and winning streaks. Subsequently, we refine the CV-GRNN model by incorporating 15 statistically significant indicators, resulting in an increase in accuracy to 86.64% and a decrease in MSE by 49.21%. This consequently strengthens the methodological framework for predicting tennis match outcomes, emphasizing its practical utility and potential for adaptation in various athletic contexts.

[LG-55] Structured and sparse partial least squares coherence for multivariate cortico-muscular analysis

链接: https://arxiv.org/abs/2503.21802
作者: Jingyao Sun,Qilu Zhang,Di Ma,Tianyu Jia,Shijie Jia,Xiaoxue Zhai,Ruimou Xie,Ping-Ju Lin,Zhibin Li,Yu Pan,Linhong Ji,Chong Li
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Multivariate cortico-muscular analysis has recently emerged as a promising approach for evaluating the corticospinal neural pathway. However, current multivariate approaches encounter challenges such as high dimensionality and limited sample sizes, thus restricting their further applications. In this paper, we propose a structured and sparse partial least squares coherence algorithm (ssPLSC) to extract shared latent space representations related to cortico-muscular interactions. Our approach leverages an embedded optimization framework by integrating a partial least squares (PLS)-based objective function, a sparsity constraint and a connectivity-based structured constraint, addressing the generalizability, interpretability and spatial structure. To solve the optimization problem, we develop an efficient alternating iterative algorithm within a unified framework and prove its convergence experimentally. Extensive experimental results from one synthetic and several real-world datasets have demonstrated that ssPLSC can achieve competitive or better performance over some representative multivariate cortico-muscular fusion methods, particularly in scenarios characterized by limited sample sizes and high noise levels. This study provides a novel multivariate fusion method for cortico-muscular analysis, offering a transformative tool for the evaluation of corticospinal pathway integrity in neurological disorders.

[LG-56] Binary AddiVortes: (Bayesian) Additive Voronoi Tessellations for Binary Classification with an application to Predicting Home Mortgage Application Outcomes

链接: https://arxiv.org/abs/2503.21792
作者: Adam J. Stone,Emmanuel Ogundimu,John Paul Gosling
类目: Applications (stat.AP); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:The Additive Voronoi Tessellations (AddiVortes) model is a multivariate regression model that uses multiple Voronoi tessellations to partition the covariate space for an additive ensemble model. In this paper, the AddiVortes framework is extended to binary classification by incorporating a probit model with a latent variable formulation. Specifically, we utilise a data augmentation technique, where a latent variable is introduced and the binary response is determined via thresholding. In most cases, the AddiVortes model outperforms random forests, BART and other leading black-box regression models when compared using a range of metrics. A comprehensive analysis is conducted using AddiVortes to predict an individual’s likelihood of being approved for a home mortgage, based on a range of covariates. This evaluation highlights the model’s effectiveness in capturing complex relationships within the data and its potential for improving decision-making in mortgage approval processes.

[LG-57] SeisRDT: Latent Diffusion Model Based On Representation Learning For Seismic Data Interpolation And Reconstruction

链接: https://arxiv.org/abs/2503.21791
作者: Shuang Wang,Fei Deng,Peifan Jiang,Zezheng Ni,Bin Wang
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: Submitted to geopysics

点击查看摘要

Abstract:Due to limitations such as geographic, physical, or economic factors, collected seismic data often have missing traces. Traditional seismic data reconstruction methods face the challenge of selecting numerous empirical parameters and struggle to handle large-scale continuous missing traces. With the advancement of deep learning, various diffusion models have demonstrated strong reconstruction capabilities. However, these UNet-based diffusion models require significant computational resources and struggle to learn the correlation between different traces in seismic data. To address the complex and irregular missing situations in seismic data, we propose a latent diffusion transformer utilizing representation learning for seismic data reconstruction. By employing a mask modeling scheme based on representation learning, the representation module uses the token sequence of known data to infer the token sequence of unknown data, enabling the reconstructed data from the diffusion model to have a more consistent data distribution and better correlation and accuracy with the known data. We propose the Representation Diffusion Transformer architecture, and a relative positional bias is added when calculating attention, enabling the diffusion model to achieve global modeling capability for seismic data. Using a pre-trained data compression model compresses the training and inference processes of the diffusion model into a latent space, which, compared to other diffusion model-based reconstruction methods, reduces computational and inference costs. Reconstruction experiments on field and synthetic datasets indicate that our method achieves higher reconstruction accuracy than existing methods and can handle various complex missing scenarios.

[LG-58] Pharmolix-FM: An All-Atom Multi-Modal Foundation Model for Molecular Modeling and Generation

链接: https://arxiv.org/abs/2503.21788
作者: Yizhen Luo,Jiashuo Wang,Siqi Fan,Zaiqing Nie
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Atomic interactions lie at the heart of molecular structures and functions, serving as the foundation for a wide range of molecular interaction tasks. To establish a unified framework for modeling diverse molecular systems, we propose Pharmolix-FM, a generative model that operates at the all-atom level, enabling consistent representation and interaction modeling across different biomolecules. Pharmolix-FM integrates multiple state-of-the-art generative modeling approaches and allows for unified evaluation across multiple tasks without the need for fine-tuning. Through a systematic comparison of different generative algorithms within this framework, we provide a comprehensive analysis of their effectiveness in molecular interaction tasks. Experimental results demonstrate that Pharmolix-FM achieves superior performance across various tasks, highlighting its potential for advancing molecular interaction modeling. Our code will be publicly available at this https URL.

信息检索

[IR-0] Improving Low-Resource Retrieval Effectiveness using Zero-Shot Linguistic Similarity Transfer ECIR2025

链接: https://arxiv.org/abs/2503.22508
作者: Andreas Chari,Sean MacAvaney,Iadh Ounis
类目: Information Retrieval (cs.IR)
*备注: 12 Pages, 5 Figures, 2 Tables, Full Paper accepted at IR4GOOD track in ECIR 2025

点击查看摘要

Abstract:Globalisation and colonisation have led the vast majority of the world to use only a fraction of languages, such as English and French, to communicate, excluding many others. This has severely affected the survivability of many now-deemed vulnerable or endangered languages, such as Occitan and Sicilian. These languages often share some characteristics, such as elements of their grammar and lexicon, with other high-resource languages, e.g. French or Italian. They can be clustered into groups of language varieties with various degrees of mutual intelligibility. Current search systems are not usually trained on many of these low-resource varieties, leading search users to express their needs in a high-resource language instead. This problem is further complicated when most information content is expressed in a high-resource language, inhibiting even more retrieval in low-resource languages. We show that current search systems are not robust across language varieties, severely affecting retrieval effectiveness. Therefore, it would be desirable for these systems to leverage the capabilities of neural models to bridge the differences between these varieties. This can allow users to express their needs in their low-resource variety and retrieve the most relevant documents in a high-resource one. To address this, we propose fine-tuning neural rankers on pairs of language varieties, thereby exposing them to their linguistic similarities. We find that this approach improves the performance of the varieties upon which the models were directly trained, thereby regularising these models to generalise and perform better even on unseen language variety pairs. We also explore whether this approach can transfer across language families and observe mixed results that open doors for future research.

[IR-1] HyperMAN: Hypergraph-enhanced Meta-learning Adaptive Network for Next POI Recommendation

链接: https://arxiv.org/abs/2503.22049
作者: Jinze Wang,Tiehua Zhang,Lu Zhang,Yang Bai,Xin Li,Jiong Jin
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Next Point-of-Interest (POI) recommendation aims to predict users’ next locations by leveraging historical check-in sequences. Although existing methods have shown promising results, they often struggle to capture complex high-order relationships and effectively adapt to diverse user behaviors, particularly when addressing the cold-start issue. To address these challenges, we propose Hypergraph-enhanced Meta-learning Adaptive Network (HyperMAN), a novel framework that integrates heterogeneous hypergraph modeling with a difficulty-aware meta-learning mechanism for next POI recommendation. Specifically, three types of heterogeneous hyperedges are designed to capture high-order relationships: user visit behaviors at specific times (Temporal behavioral hyperedge), spatial correlations among POIs (spatial functional hyperedge), and user long-term preferences (user preference hyperedge). Furthermore, a diversity-aware meta-learning mechanism is introduced to dynamically adjust learning strategies, considering users behavioral diversity. Extensive experiments on real-world datasets demonstrate that HyperMAN achieves superior performance, effectively addressing cold start challenges and significantly enhancing recommendation accuracy.

[IR-2] Empowering Retrieval-based Conversational Recommendation with Contrasting User Preferences NAACL2025

链接: https://arxiv.org/abs/2503.22005
作者: Heejin Kook,Junyoung Kim,Seongmin Park,Jongwuk Lee
类目: Information Retrieval (cs.IR)
*备注: NAACL 2025

点击查看摘要

Abstract:Conversational recommender systems (CRSs) are designed to suggest the target item that the user is likely to prefer through multi-turn conversations. Recent studies stress that capturing sentiments in user conversations improves recommendation accuracy. However, they employ a single user representation, which may fail to distinguish between contrasting user intentions, such as likes and dislikes, potentially leading to suboptimal performance. To this end, we propose a novel conversational recommender model, called COntrasting user pReference expAnsion and Learning (CORAL). Firstly, CORAL extracts the user’s hidden preferences through contrasting preference expansion using the reasoning capacity of the LLMs. Based on the potential preference, CORAL explicitly differentiates the contrasting preferences and leverages them into the recommendation process via preference-aware learning. Extensive experiments show that CORAL significantly outperforms existing methods in three benchmark datasets, improving up to 99.72% in Recall@10. The code and datasets are available at this https URL

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-03-31

目录

概览 (2025-03-31)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载