Arxiv今日论文 | 2025-03-13

本篇博文主要内容为 2025-03-13 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决 Retrieval-Augmented Generation (RAG) 系统中传统文本分块方法在处理复杂上下文语义时存在的局限性问题。论文提出了一种双指标评估方法（Boundary Clarity 和 Chunk Stickiness），用于量化分块质量，并揭示了现有分块技术难以有效捕捉上下文细微差异的问题。为应对基于大型语言模型 (LLMs) 的分块方法中计算效率与分块精度之间的权衡，论文设计了一种粒度感知的 Chunkers 混合框架（granularity-aware Mixture-of-Chunkers, MoC），通过三阶段处理机制引导分块器生成结构化的分块规则表达式，从而提升分块质量和 RAG 系统的整体性能。解决方案的关键在于引入新的评估指标以明确分块问题的挑战，并结合 LLM 提升分块精度的同时兼顾计算效率。

链接: https://arxiv.org/abs/2503.09600
作者: Jihao Zhao,Zhiyuan Ji,Zhaoxin Fan,Hanyu Wang,Simin Niu,Bo Tang,Feiyu Xiong,Zhiyu Li
机构: School of Information, Renmin University of China (中国人民大学信息学院); Institute for Advanced Algorithms Research (上海先进算法研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable the direct quantification of chunking quality. Leveraging this assessment method, we highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances, thereby substantiating the necessity of integrating LLMs into chunking process. To address the inherent trade-off between computational efficiency and chunking precision in LLM-based approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism. Notably, our objective is to guide the chunker towards generating a structured list of chunking regular expressions, which are subsequently employed to extract chunks from the original text. Extensive experiments demonstrate that both our proposed metrics and the MoC framework effectively settle challenges of the chunking task, revealing the chunking kernel while enhancing the performance of the RAG system.
zh

[NLP-1] How to Protect Yourself from 5G Radiation? Investigating LLM Responses to Implicit Misinformation

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在实际应用中可能隐式传播隐性错误信息的问题，这一问题目前尚未得到充分关注。现有研究主要集中在显式的虚假陈述上，而忽视了隐性错误信息通常以未被质疑的前提形式出现在真实世界用户交互中的情况。为了解决此问题，论文提出了ECHOMIST，这是首个针对隐性错误信息的全面基准测试集，其中错误假设被嵌入到用户向LLMs提出的查询中。此外，论文还引入了一种新的评估指标，用于衡量LLMs是否能够识别并纠正错误信息，而不是放大用户的误解。通过广泛实验发现，当前主流LLMs（如GPT-4、Claude和Llama）在这项任务上的表现令人担忧，普遍存在未能检测到错误前提并生成误导性解释的情况。研究结果强调了加强对隐性错误信息处理的LLMs安全性研究的重要性。

链接: https://arxiv.org/abs/2503.09598
作者: Ruohao Guo,Wei Xu,Alan Ritter
机构: Georgia Institute of Technology (乔治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are widely deployed in diverse scenarios, the extent to which they could tacitly spread misinformation emerges as a critical safety concern. Current research primarily evaluates LLMs on explicit false statements, overlooking how misinformation often manifests subtly as unchallenged premises in real-world user interactions. We curated ECHOMIST, the first comprehensive benchmark for implicit misinformation, where the misinformed assumptions are embedded in a user query to LLMs. ECHOMIST is based on rigorous selection criteria and carefully curated data from diverse sources, including real-world human-AI conversations and social media interactions. We also introduce a new evaluation metric to measure whether LLMs can recognize and counter false information rather than amplify users’ misconceptions. Through an extensive empirical study on a wide range of LLMs, including GPT-4, Claude, and Llama, we find that current models perform alarmingly poorly on this task, often failing to detect false premises and generating misleading explanations. Our findings underscore the critical need for an increased focus on implicit misinformation in LLM safety research.
zh

[NLP-2] Cost-Optimal Grouped-Query Attention for Long-Context LLM s

【速读】：该论文试图解决如何在最大化语言模型能力的同时最小化其训练和部署成本的问题。现有研究主要关注模型性能、参数规模与数据规模之间的复杂关系，并探索最优计算资源分配策略，但忽略了上下文长度和注意力头配置（分组查询注意力中的查询和键值注意力头数量）对训练和推理的影响。论文的关键在于系统性地比较不同参数规模、上下文长度及注意力头配置下的模型性能、计算成本与内存成本，并扩展现有的基于参数规模和训练计算的扩展方法，提出一种指导训练和推理阶段构建成本最优大型语言模型的方案。研究表明，在处理足够长的序列时，具有较少注意力头的大模型能够以更低的损失实现更低的计算和内存开销。

链接: https://arxiv.org/abs/2503.09579
作者: Yingfa Chen,Yutong Wu,Xu Han,Zhiyuan Liu,Maosong Sun
机构: NLP Group, DCST, IAI, BNRIST, Tsinghua University (清华大学); SIST, University of Science and Technology Beijing (北京科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 17 figures

点击查看摘要

Abstract:Building effective and efficient Transformer-based large language models (LLMs) has recently become a research focus, requiring maximizing model language capabilities and minimizing training and deployment costs. Existing efforts have primarily described complex relationships among model performance, parameter size, and data size, as well as searched for the optimal compute allocation to train LLMs. However, they overlook the impacts of context length and attention head configuration (the number of query and key-value heads in grouped-query attention) on training and inference. In this paper, we systematically compare models with different parameter sizes, context lengths, and attention head configurations in terms of model performance, computational cost, and memory cost. Then, we extend the existing scaling methods, which are based solely on parameter size and training compute, to guide the construction of cost-optimal LLMs during both training and inference. Our quantitative scaling studies show that, when processing sufficiently long sequences, a larger model with fewer attention heads can achieve a lower loss while incurring lower computational and memory costs. Our findings provide valuable insights for developing practical LLMs, especially in long-context processing scenarios. We will publicly release our code and data.
zh

[NLP-3] Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

【速读】：该论文试图解决大型语言模型（LLMs）在处理复杂、多步骤、长时程任务时存在的挑战，特别是生成精确计划的困难，因为LLMs并非专门为此任务设计。论文的关键解决方案是提出了一种名为Plan-and-Act的新框架，该框架将显式的规划引入基于LLMs的代理，并通过一种新颖的合成数据生成方法增强计划生成的可扩展性。Plan-and-Act由一个规划器模型和一个执行器模型组成，前者生成结构化的高级计划以实现用户目标，后者将这些计划转化为特定环境中的操作。为了有效训练规划器，研究引入了一种合成数据生成方法，该方法使用可行的计划标注真实轨迹，并结合多样化且广泛的示例以提升泛化能力。评估表明，该方法在WebArena-Lite基准测试中达到了最先进的54%成功率。

链接: https://arxiv.org/abs/2503.09572
作者: Lutfi Eren Erdogan,Nicholas Lee,Sehoon Kim,Suhong Moon,Hiroki Furuta,Gopala Anumanchipalli,Kurt Keutzer,Amir Gholami
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable advancements in enabling language agents to tackle simple tasks. However, applying them for complex, multi-step, long-horizon tasks remains a challenge. Recent work have found success by separating high-level planning from low-level execution, which enables the model to effectively balance high-level planning objectives and low-level execution details. However, generating accurate plans remains difficult since LLMs are not inherently trained for this task. To address this, we propose Plan-and-Act, a novel framework that incorporates explicit planning into LLM-based agents and introduces a scalable method to enhance plan generation through a novel synthetic data generation method. Plan-and-Act consists of a Planner model which generates structured, high-level plans to achieve user goals, and an Executor model that translates these plans into environment-specific actions. To train the Planner effectively, we introduce a synthetic data generation method that annotates ground-truth trajectories with feasible plans, augmented with diverse and extensive examples to enhance generalization. We evaluate Plan-and-Act using web navigation as a representative long-horizon planning environment, demonstrating a state-of the-art 54% success rate on the WebArena-Lite benchmark.
zh

[NLP-4] owards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）推理领域中关于长链-of-thought（Long CoT）特性缺乏系统性综述的问题，这一空白限制了对其与传统短链-of-thought（Short CoT）差异的理解，并阻碍了关于“过度推理”和“测试时扩展”等争议的深入探讨。论文的关键解决方案在于提出一种统一视角来解析Long CoT，通过区分Long CoT与Short CoT并引入新的分类体系（第一部分），分析Long CoT的关键特性如深度推理、广泛探索和可行反思（第二部分），研究相关现象如过度推理及测试时扩展的表现（第三部分），以及识别研究空白并提出未来方向（第四部分）。这些努力旨在推动逻辑推理在人工智能领域的进一步发展。

链接: https://arxiv.org/abs/2503.09567
作者: Qiguang Chen,Libo Qin,Jinhao Liu,Dengyun Peng,Jiannan Guan,Peng Wang,Mengkang Hu,Yuhang Zhou,Te Gao,Wangxiang Che
机构: Research Center for Social Computing and Interactive Robotics (社会计算与交互机器人研究中心), Harbin Institute of Technology (哈尔滨工业大学); School of Computer Science and Engineering, Central South University (中南大学计算机科学与工程学院); The University of Hong Kong (香港大学); Fudan University (复旦大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Paper are available at this https URL

点击查看摘要

Abstract:Recent advancements in reasoning with large language models (RLLMs), such as OpenAI-O1 and DeepSeek-R1, have demonstrated their impressive capabilities in complex domains like mathematics and coding. A central factor in their success lies in the application of long chain-of-thought (Long CoT) characteristics, which enhance reasoning abilities and enable the solution of intricate problems. However, despite these developments, a comprehensive survey on Long CoT is still lacking, limiting our understanding of its distinctions from traditional short chain-of-thought (Short CoT) and complicating ongoing debates on issues like “overthinking” and “test-time scaling.” This survey seeks to fill this gap by offering a unified perspective on Long CoT. (1) We first distinguish Long CoT from Short CoT and introduce a novel taxonomy to categorize current reasoning paradigms. (2) Next, we explore the key characteristics of Long CoT: deep reasoning, extensive exploration, and feasible reflection, which enable models to handle more complex tasks and produce more efficient, coherent outcomes compared to the shallower Short CoT. (3) We then investigate key phenomena such as the emergence of Long CoT with these characteristics, including overthinking, and test-time scaling, offering insights into how these processes manifest in practice. (4) Finally, we identify significant research gaps and highlight promising future directions, including the integration of multi-modal reasoning, efficiency improvements, and enhanced knowledge frameworks. By providing a structured overview, this survey aims to inspire future research and further the development of logical reasoning in artificial intelligence.
zh

[NLP-5] PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs ICLR2025

【速读】：该论文试图解决语言模型预训练的稳定性及其对下游任务性能影响的研究不足问题。论文的关键解决方案在于引入了一组新的训练运行实例——PolyPythias，包含针对Pythia模型套件在不同初始条件（种子值）下的45个新训练运行，覆盖从14M到410M参数规模的5种模型尺寸，并新增约7000个检查点。通过这些新旧训练数据的分析，研究了种子值决定的不同初始条件（如参数初始化与数据顺序）对下游性能、学习的语言表示以及训练阶段涌现现象的影响。此外，通过每个模型的新种子，识别异常训练运行并描述其特性。论文的核心贡献在于揭示了使用这些方法预测训练稳定性的潜力。

链接: https://arxiv.org/abs/2503.09543
作者: Oskar van der Wal,Pietro Lesci,Max Muller-Eberstein,Naomi Saphra,Hailey Schoelkopf,Willem Zuidema,Stella Biderman
机构: University of Amsterdam (阿姆斯特丹大学); University of Cambridge (剑桥大学); IT University of Copenhagen (哥本哈根信息技术大学); Harvard University (哈佛大学); Anthropic (Anthropic); EleutherAI (EleutherAI)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:The stability of language model pre-training and its effects on downstream performance are still understudied. Prior work shows that the training process can yield significantly different results in response to slight variations in initial conditions, e.g., the random seed. Crucially, the research community still lacks sufficient resources and tools to systematically investigate pre-training stability, particularly for decoder-only language models. We introduce the PolyPythias, a set of 45 new training runs for the Pythia model suite: 9 new seeds across 5 model sizes, from 14M to 410M parameters, resulting in about 7k new checkpoints that we release. Using these new 45 training runs, in addition to the 5 already available, we study the effects of different initial conditions determined by the seed – i.e., parameters’ initialisation and data order – on (i) downstream performance, (ii) learned linguistic representations, and (iii) emergence of training phases. In addition to common scaling behaviours, our analyses generally reveal highly consistent training dynamics across both model sizes and initial conditions. Further, the new seeds for each model allow us to identify outlier training runs and delineate their characteristics. Our findings show the potential of using these methods to predict training stability.
zh

[NLP-6] SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

【速读】：该论文试图解决的问题是如何更全面且标准化地评估稀疏自编码器（Sparse Autoencoders, SAEs）的实际性能，并揭示现有评估方法中存在的局限性。目前大多数关于改进SAE有效性的研究依赖于无监督代理指标（unsupervised proxy metrics），但这些指标与实际应用的相关性不明确。论文的关键解决方案是引入了一个名为SAEBench的综合评估套件，它通过七种多样化指标（涵盖可解释性、特征解缠以及实际应用如遗忘学习等）来衡量SAE的表现。此外，作者开源了一组超过200个基于八种不同提出的SAE架构和训练算法的模型，以支持系统化的对比分析。这一标准化框架不仅促进了对SAE开发进展的量化评估，还揭示了代理指标上的改进并不总是转化为实际性能的提升，从而推动了对不同SAE架构和训练方法之间细微差异的研究。

链接: https://arxiv.org/abs/2503.09532
作者: Adam Karvonen,Can Rager,Johnny Lin,Curt Tigges,Joseph Bloom,David Chanin,Yeu-Tong Lau,Eoin Farrell,Callum McDougall,Kola Ayonrinde,Matthew Wearden,Arthur Conmy,Samuel Marks,Neel Nanda
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are a popular technique for interpreting language model activations, and there is extensive recent work on improving SAE effectiveness. However, most prior work evaluates progress using unsupervised proxy metrics with unclear practical relevance. We introduce SAEBench, a comprehensive evaluation suite that measures SAE performance across seven diverse metrics, spanning interpretability, feature disentanglement and practical applications like unlearning. To enable systematic comparison, we open-source a suite of over 200 SAEs across eight recently proposed SAE architectures and training algorithms. Our evaluation reveals that gains on proxy metrics do not reliably translate to better practical performance. For instance, while Matryoshka SAEs slightly underperform on existing proxy metrics, they substantially outperform other architectures on feature disentanglement metrics; moreover, this advantage grows with SAE scale. By providing a standardized framework for measuring progress in SAE development, SAEBench enables researchers to study scaling trends and make nuanced comparisons between different SAE architectures and training methodologies. Our interactive interface enables researchers to flexibly visualize relationships between metrics across hundreds of open-source SAEs at: this https URL
zh

[NLP-7] Search-R1: Training LLM s to Reason and Leverag e Search Engines with Reinforcement Learning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在推理和文本生成过程中有效获取外部知识及最新信息的问题。当前基于检索增强（Retrieval Augmentation）和工具使用训练的方法存在多轮检索灵活性不足或需要大规模标注数据的局限性，而利用具备推理能力的LLMs在推理阶段提示其与搜索引擎交互的方式也未能实现最优效果，因为LLMs并未学会如何以最佳方式与搜索引擎交互。

论文的关键解决方案是提出Search-R1模型，它是DeepSeek-R1模型的扩展版本。Search-R1通过强化学习（Reinforcement Learning, RL）使LLM能够在逐步推理的同时自主生成（多个）搜索查询，并结合实时检索进行操作。该方法通过多轮搜索交互优化LLM的输出，采用检索到的标记屏蔽技术确保RL训练的稳定性，并利用简单的基于结果的奖励函数指导模型学习。实验结果显示，在七个问答数据集上，Search-R1相比最先进的基线模型分别提升了26%（Qwen2.5-7B）、21%（Qwen2.5-3B）以及10%（LLaMA3.2-3B）。此外，论文还提供了关于RL优化方法、LLM选择以及检索增强推理中响应长度动态的实证见解。相关代码和模型检查点已公开发布。

链接: https://arxiv.org/abs/2503.09516
作者: Bowen Jin,Hansi Zeng,Zhenrui Yue,Dong Wang,Hamed Zamani,Jiawei Han
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 16 pages

点击查看摘要

Abstract:Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Retrieval augmentation and tool-use training approaches where a search engine is treated as a tool lack complex multi-turn retrieval flexibility or require large-scale supervised data. Prompting advanced LLMs with reasoning capabilities during inference to use search engines is not optimal, since the LLM does not learn how to optimally interact with the search engine. This paper introduces Search-R1, an extension of the DeepSeek-R1 model where the LLM learns – solely through reinforcement learning (RL) – to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM rollouts with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 26% (Qwen2.5-7B), 21% (Qwen2.5-3B), and 10% (LLaMA3.2-3B) over SOTA baselines. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at this https URL.
zh

[NLP-8] Reinforcement Learning is all You Need

【速读】：该论文试图解决如何通过纯强化学习（Reinforcement Learning, RL）提升语言模型的推理能力，尤其是在无需人类反馈的情况下实现更好的泛化性能。解决方案的关键在于利用Countdown Game作为训练环境，通过纯强化学习训练一个30亿参数的语言模型（DeepSeek R1），并探索响应长度与推理质量之间的关系以及“顿悟”现象对推理准确性的影响。研究发现，虽然“顿悟”能够产生新颖的见解，但其并不总能带来正确的答案，这提示未来工作应着重优化奖励机制以弥合洞察力与准确性之间的差距。

链接: https://arxiv.org/abs/2503.09512
作者: Yongsheng Lian
机构: University of Louisville (路易斯维尔大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 15 pages, 2 figures

点击查看摘要

Abstract:Inspired by the success of DeepSeek R1 in reasoning via reinforcement learning without human feedback, we train a 3B language model using the Countdown Game with pure reinforcement learning. Our model outperforms baselines on four of five benchmarks, demonstrating improved generalization beyond its training data. Notably, response length does not correlate with reasoning quality, and while “aha moments” emerge, they do not always yield correct answers. These findings highlight the potential of RL-only training for reasoning enhancement and suggest future work on refining reward structures to bridge emergent insights with accuracy.
zh

[NLP-9] RACE: Real-Time Multimodal Common Ground Tracking in Situated Collaborative Dialogues NAACL2025

【速读】：该论文旨在解决实时协作任务中共同基础（live common ground）跟踪的问题。在多模态协作环境中，共同基础的动态变化直接影响任务效率与成功与否。为应对这一挑战，TRACE 系统通过整合语音、动作、手势以及视觉注意力等多模态输入，实时推断对话进程中提出的与任务相关的核心命题，并持续追踪群体对该命题的知识状态与信念演化。解决方案的关键在于其快速、实时处理多模态数据的能力，以及对任务相关知识共享状态的精准建模与跟踪。这种能力使得 TRACE 在多参与方、多模态交互的场景中具有重要应用价值。

链接: https://arxiv.org/abs/2503.09511
作者: Hannah VanderHoeven,Brady Bhalla,Ibrahim Khebour,Austin Youngren,Videep Venkatesha,Mariah Bradford,Jack Fitzgerald,Carlos Mabrey,Jingxuan Tu,Yifan Zhu,Kenneth Lai,Changsoo Jung,James Pustejovsky,Nikhil Krishnaswamy
机构: Colorado State University (科罗拉多州立大学); California Inst. of Technology (加州理工学院); Brandeis University (布兰迪斯大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 4 tables, 4 figures, to appear at NAACL 2025 Demos program, Albuquerque, NM, USA

点击查看摘要

Abstract:We present TRACE, a novel system for live common ground tracking in situated collaborative tasks. With a focus on fast, real-time performance, TRACE tracks the speech, actions, gestures, and visual attention of participants, uses these multimodal inputs to determine the set of task-relevant propositions that have been raised as the dialogue progresses, and tracks the group’s epistemic position and beliefs toward them as the task unfolds. Amid increased interest in AI systems that can mediate collaborations, TRACE represents an important step forward for agents that can engage with multiparty, multimodal discourse.
zh

[NLP-10] ReMA: Learning to Meta-think for LLM s with Multi-Agent Reinforcement Learning

【速读】：该论文旨在解决现有大型语言模型（Large Language Models, LLMs）在推理能力增强过程中缺乏专门设计以获取元推理（meta-thinking）能力的问题，这导致其性能提升效果有限。为应对这一挑战，论文提出了一种名为强化元推理代理（Reinforced Meta-thinking Agents, ReMA）的新框架，通过多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）来诱发元推理行为，促使LLMs具备“思考如何思考”的能力。ReMA的关键在于将推理过程解耦为两个层级的智能体：高层次的元推理智能体负责制定战略规划，低层次的推理智能体执行具体任务。通过目标对齐的迭代强化学习，这两个智能体逐步探索并学会协作，从而显著提升了泛化能力和鲁棒性。实验结果表明，ReMA在复杂的推理任务中优于单智能体强化学习基线，包括高水平数学基准测试和LLM作为裁判的基准测试。

链接: https://arxiv.org/abs/2503.09501
作者: Ziyu Wan,Yunxiang Li,Yan Song,Hanjing Wang,Linyi Yang,Mark Schmidt,Jun Wang,Weinan Zhang,Shuyue Hu,Ying Wen
机构: Shanghai Jiao Tong University (上海交通大学); University of British Columbia (英属哥伦比亚大学); University College London (伦敦大学学院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Recent research on Reasoning of Large Language Models (LLMs) has sought to further enhance their performance by integrating meta-thinking – enabling models to monitor, evaluate, and control their reasoning processes for more adaptive and effective problem-solving. However, current single-agent work lacks a specialized design for acquiring meta-thinking, resulting in low efficacy. To address this challenge, we introduce Reinforced Meta-thinking Agents (ReMA), a novel framework that leverages Multi-Agent Reinforcement Learning (MARL) to elicit meta-thinking behaviors, encouraging LLMs to think about thinking. ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions. Through iterative reinforcement learning with aligned objectives, these agents explore and learn collaboration, leading to improved generalization and robustness. Experimental results demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks, including competitive-level mathematical benchmarks and LLM-as-a-Judge benchmarks. Comprehensive ablation studies further illustrate the evolving dynamics of each distinct agent, providing valuable insights into how the meta-thinking reasoning process enhances the reasoning capabilities of LLMs.
zh

[NLP-11] MindGYM: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions

【速读】：该论文试图解决大型视觉语言模型（Vision-Language Models, VLMs）在实现鲁棒且可迁移推理能力时面临的挑战，这些问题源于对劳动密集型人工标注数据集的依赖或计算成本高昂的自监督方法。为了解决这些问题，论文提出了一种名为MindGYM的框架，其关键是通过合成的自挑战问题来增强VLMs，具体包括三个阶段：(1) 种子单跳问题合成，生成涵盖文本（如逻辑推导）和多模态上下文（如基于图表的问题）的认知问题，涉及八个语义领域；(2) 挑战性多跳问题合成，通过桥接、视觉-文本对齐等多样化原则组合种子问题，构建需要更深层次推理的多步骤问题；(3) 思维诱导课程微调，一个逐步训练模型从辅助推理到独立推理的结构化流程。MindGYM通过利用模型的自合成能力，在数据效率、计算效率以及跨任务的鲁棒泛化方面表现出色，同时显著提升了推理深度和广度，并减少了人为干预和资源需求。

链接: https://arxiv.org/abs/2503.09499
作者: Zhe Xu,Daoyuan Chen,Zhenqing Ling,Yaliang Li,Ying Shen
机构: Sun Yat-Sen University (中山大学); Alibaba Group (阿里集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages

点击查看摘要

Abstract:Large vision-language models (VLMs) face challenges in achieving robust, transferable reasoning abilities due to reliance on labor-intensive manual instruction datasets or computationally expensive self-supervised methods. To address these issues, we introduce MindGYM, a framework that enhances VLMs through synthetic self-challenging questions, consisting of three stages: (1) Seed Single-Hop Question Synthesis, generating cognitive questions across textual (e.g., logical deduction) and multimodal contexts (e.g., diagram-based queries) spanning eight semantic areas like ethical analysis; (2) Challenging Multi-Hop Question Synthesis, combining seed questions via diverse principles like bridging, visual-textual alignment, to create multi-step problems demanding deeper reasoning; and (3) Thinking-Induced Curriculum Fine-Tuning, a structured pipeline that progressively trains the model from scaffolded reasoning to standalone inference. By leveraging the model’s self-synthesis capability, MindGYM achieves high data efficiency (e.g., +16% gains on MathVision-Mini with only 400 samples), computational efficiency (reducing both training and inference costs), and robust generalization across tasks. Extensive evaluations on seven benchmarks demonstrate superior performance over strong baselines, with notable improvements (+15.77% win rates) in reasoning depth and breadth validated via GPT-based scoring. MindGYM underscores the viability of self-challenging for refining VLM capabilities while minimizing human intervention and resource demands. Code and data are released to advance multimodal reasoning research.
zh

[NLP-12] BAMBI: Developing Baby Language Models for Italian

【速读】：该论文试图解决的问题是如何通过有限的数据和计算资源开发具备较强语言能力的模型，并研究多模态信息对语言习得的贡献。论文的关键解决方案在于提出了一组针对意大利语五岁儿童语言输入数据进行训练的Baby Language Models（BAMBI），并通过设计特定的基准测试评估其性能。研究对比了这些模型与大规模语言模型（LLMs）及多模态语言模型（VLMs）的表现，发现尽管LLMs拥有更多的训练资源，但其性能并未显著优于BAMBI模型。这表明，除了扩大训练规模外，数据筛选、多模态输入引入以及课程学习等策略可能在提升模型性能方面起到至关重要的作用。

链接: https://arxiv.org/abs/2503.09481
作者: Alice Suozzi,Luca Capone,Gianluca E. Lebani,Alessandro Lenci
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages, 2 figures

点击查看摘要

Abstract:This paper presents BAMBI (BAby language Models Boostrapped for Italian), a series of Baby Language Models (BabyLMs) trained on data that mimic the linguistic input received by a five-year-old Italian-speaking child. The BAMBI models are tested using a benchmark specifically designed to evaluate language models, which takes into account the amount of training input the models received. The BAMBI models are compared against a large language model (LLM) and a multimodal language model (VLM) to study the contribution of extralinguistic information for language acquisition. The results of our evaluation align with the existing literature on English language models, confirming that while reduced training data support the development of relatively robust syntactic competence, they are insufficient for fostering semantic understanding. However, the gap between the training resources (data and computation) of the BAMBI models and the LLMs is not fully reflected in their performance: despite LLMs’ massive training, their performance is not much better than that of BAMBI models. This suggests that strategies beyond scaling training resources, such as data curation, inclusion of multimodal input, and other training strategies such as curriculum learning, could play a crucial role in shaping model performance.
zh

[NLP-13] Explicit Learning and the LLM in Machine Translation

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在显式学习（explicit learning）能力上的局限性问题，特别是其对复杂语言现象的理解与应用能力。论文通过构建基于加密方法生成的人工语言作为受控测试环境，设计实验评估LLMs显式学习和应用语法规则的能力。解决方案的关键在于利用有监督的思维链（chains of thought）微调显著提升了LLMs的性能，但这种策略在处理类型学上新颖或更复杂的语言特征时仍面临泛化困难。研究指出，改进LLMs显式学习能力需要更加多样化训练数据集以及探索替代性的微调策略。

链接: https://arxiv.org/abs/2503.09454
作者: Malik Marmonier,Rachel Bawden,Benoît Sagot
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study explores the capacity of large language models (LLMs) for explicit learning, a process involving the assimilation of metalinguistic explanations to carry out language tasks. Using constructed languages generated by cryptographic means as controlled test environments, we designed experiments to assess an LLM’s ability to explicitly learn and apply grammar rules. Our results demonstrate that while LLMs possess a measurable capacity for explicit learning, this ability diminishes as the complexity of the linguistic phenomena at hand increases. Supervised fine-tuning on chains of thought significantly enhances LLM performance but struggles to generalize to typologically novel or more complex linguistic features. These findings point to the need for more diverse training sets and alternative fine-tuning strategies to further improve explicit learning by LLMs.
zh

[NLP-14] Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models

【速读】：该论文试图解决跨语言迁移任务中，现有视觉-语言模型（Vision-Language Models, VLMs）因多语言能力导致性能下降的问题。具体而言，当前方法依赖大规模预训练的多语言语言模型，但这些模型面临“多语言诅咒”（curse of multilinguality），即为了支持多语言能力而牺牲了下游任务性能，难以应对词义歧义，并且未能充分利用最新的技术进展。为了解决这些问题，论文研究了使用单语言 VLMs 进行多语言任务的系统泛化缩放规律，重点关注模型规模和已见训练样本数量的影响。

解决方案的关键在于提出了一种名为 Florenz 的单语言编码器-解码器架构的 VLM，其参数规模从 0.4B 到 11.2B，结合了预训练模型 Florence-2 和大型语言模型 Gemma-2。Florenz 模型通过在合成数据集上进行训练，该数据集故意缺乏某些语言的覆盖范围，从而模拟仅具有翻译任务数据的情况。实验表明，即使只有翻译任务的数据可用，通过间接学习未见过的任务-语言对，图像描述能力仍能在特定语言中显现出来。此外，通过微调下游任务数据集，Florenz 在多模态机器翻译（Multi30K, CoMMuTE）、词义消歧（CoMMuTE）以及图像描述（Multi30K, XM3600, COCO Karpathy）等任务上取得了竞争性表现，并展示了良好的扩展趋势。

链接: https://arxiv.org/abs/2503.09443
作者: Julian Spravil,Sebastian Houben,Sven Behnke
机构: Fraunhofer IAIS (弗劳恩霍夫应用研究促进协会人工智能研究所), Germany; University of Applied Sciences Bonn-Rhein-Sieg (波恩-莱茵-锡格应用科学大学), Germany; University of Bonn, Computer Science Institute VI, Center for Robotics (波恩大学计算机科学系VI、机器人中心), Germany; Lamarr Institute for Machine Learning and Artificial Intelligence (拉马尔机器学习与人工智能研究所), Germany
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cross-lingual transfer enables vision-language models (VLMs) to perform vision tasks in various languages with training data only in one language. Current approaches rely on large pre-trained multilingual language models. However, they face the curse of multilinguality, sacrificing downstream task performance for multilingual capabilities, struggling with lexical ambiguities, and falling behind recent advances. In this work, we study the scaling laws of systematic generalization with monolingual VLMs for multilingual tasks, focusing on the impact of model size and seen training samples. We propose Florenz, a monolingual encoder-decoder VLM with 0.4B to 11.2B parameters combining the pre-trained VLM Florence-2 and the large language model Gemma-2. Florenz is trained with varying compute budgets on a synthetic dataset that features intentionally incomplete language coverage for image captioning, thus, testing generalization from the fully covered translation task. We show that not only does indirectly learning unseen task-language pairs adhere to a scaling law, but also that with our data generation pipeline and the proposed Florenz model family, image captioning abilities can emerge in a specific language even when only data for the translation task is available. Fine-tuning on a mix of downstream datasets yields competitive performance and demonstrates promising scaling trends in multimodal machine translation (Multi30K, CoMMuTE), lexical disambiguation (CoMMuTE), and image captioning (Multi30K, XM3600, COCO Karpathy).
zh

[NLP-15] owards Generating Automatic Anaphora Annotations

【速读】：该论文旨在解决在自然语言处理（NLP）任务中，尤其是涉及复杂任务如指代消解时，因高质量标注数据成本高昂而导致的难题。为应对这一挑战，论文探索了两种自动创建带有共指标注的数据集的方法：一是通过现有数据集的直接转换；二是利用能够处理新语言和未见过语言的多语言模型进行解析。解决方案的关键在于开发有效的策略以实现这两种方法，并克服当前面临的挑战，从而提高模型在多种NLP任务中的性能。

链接: https://arxiv.org/abs/2503.09417
作者: Dima Taji,Daniel Zeman
机构: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (ÚFAL) (查尔斯大学，数学与物理学院，形式与应用语言学研究所)
类目: Computation and Language (cs.CL)
备注: 6 pages, 0 figures, 2 tables

点击查看摘要

Abstract:Training models that can perform well on various NLP tasks require large amounts of data, and this becomes more apparent with nuanced tasks such as anaphora and conference resolution. To combat the prohibitive costs of creating manual gold annotated data, this paper explores two methods to automatically create datasets with coreferential annotations; direct conversion from existing datasets, and parsing using multilingual models capable of handling new and unseen languages. The paper details the current progress on those two fronts, as well as the challenges the efforts currently face, and our approach to overcoming these challenges.
zh

[NLP-16] Got Compute but No Data: Lessons From Post-training a Finnish LLM

【速读】：该论文试图解决在低资源语言下如何有效实现大型语言模型（LLMs）指令跟随能力的问题。解决方案的关键在于通过多语言LLM将指令和偏好数据从高资源语言（如英语）翻译到低资源语言（如芬兰语），并在两种语言中分别进行指令微调和偏好优化。研究发现，结合双语偏好数据可以获得最佳性能，即使在仅有几百个芬兰语指令样本的情况下，也能在芬兰语指令跟随任务中获得具有竞争力的表现。论文还开源了模型、数据集及方法。

链接: https://arxiv.org/abs/2503.09407
作者: Elaine Zosa,Ville Komulainen,Sampo Pyysalo
机构: Silo AI (Silo AI); University of Turku (图尔库大学)
类目: Computation and Language (cs.CL)
备注: 7 pages

点击查看摘要

Abstract:As LLMs gain more popularity as chatbots and general assistants, methods have been developed to enable LLMs to follow instructions and align with human preferences. These methods have found success in the field, but their effectiveness has not been demonstrated outside of high-resource languages. In this work, we discuss our experiences in post-training an LLM for instruction-following for English and Finnish. We use a multilingual LLM to translate instruction and preference datasets from English to Finnish. We perform instruction tuning and preference optimization in English and Finnish and evaluate the instruction-following capabilities of the model in both languages. Our results show that with a few hundred Finnish instruction samples we can obtain competitive performance in Finnish instruction-following. We also found that although preference optimization in English offers some cross-lingual benefits, we obtain our best results by using preference data from both languages. We release our model, datasets, and recipes under open licenses at this https URL
zh

[NLP-17] RetSTA: An LLM -Based Approach for Standardizing Clinical Fundus Image Reports

【速读】：该论文旨在解决临床眼底诊断报告中缺乏统一标准（包括格式、术语和风格）的问题，这阻碍了大型语言模型（Large Language Models, LLMs）对数据的理解。为了解决这一挑战，论文的关键方案是构建了一个双语标准化术语集，并开发了两个模型：RetSTA-7B-Zero 和 RetSTA-7B。其中，RetSTA-7B-Zero 在模拟临床场景的增强数据集上微调，展现出强大的标准化能力，但其覆盖疾病范围有限。而 RetSTA-7B 的关键创新在于整合了 RetSTA-7B-Zero 生成的大规模标准化数据及其对应的英文数据，实现了报告级别的全面标准化，并显著提升了多场景下的性能表现。实验结果验证了 RetSTA-7B 在双语标准化任务中的优越性及通用性。

链接: https://arxiv.org/abs/2503.09358
作者: Jiushen Cai,Weihang Zhang,Hanruo Liu,Ningli Wang,Huiqi Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Standardization of clinical reports is crucial for improving the quality of healthcare and facilitating data integration. The lack of unified standards, including format, terminology, and style, is a great challenge in clinical fundus diagnostic reports, which increases the difficulty for large language models (LLMs) to understand the data. To address this, we construct a bilingual standard terminology, containing fundus clinical terms and commonly used descriptions in clinical diagnosis. Then, we establish two models, RetSTA-7B-Zero and RetSTA-7B. RetSTA-7B-Zero, fine-tuned on an augmented dataset simulating clinical scenarios, demonstrates powerful standardization behaviors. However, it encounters a challenge of limitation to cover a wider range of diseases. To further enhance standardization performance, we build RetSTA-7B, which integrates a substantial amount of standardized data generated by RetSTA-7B-Zero along with corresponding English data, covering diverse complex clinical scenarios and achieving report-level standardization for the first time. Experimental results demonstrate that RetSTA-7B outperforms other compared LLMs in bilingual standardization task, which validates its superior performance and generalizability. The checkpoints are available at this https URL.
zh

[NLP-18] MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding

【速读】：该论文试图解决大型多模态模型（Large Multimodal Models, LMMs）在复杂视觉-语言（Vision-Language, VL）任务中与人类表现之间的显著差距问题。论文指出，尽管现有最先进的LMMs在某些任务上表现出色，但在需要结合多种基础VL能力或处理复杂指令接地的任务上，其性能远低于人类。为深入探究这一差距及其根本原因，论文提出了MOAT基准测试集，它包含了一系列对当前LMMs极具挑战性的复杂真实世界VL任务。MOAT的关键在于通过一个由作者提出的包含10种基础VL能力的分类法，系统性地评估LMMs在这些任务中的表现，从而提供关于模型优势与劣势的精细视角。此外，MOAT首次明确评估了LMMs处理复杂文本和视觉指令接地的能力，这是许多实际应用中的重要需求。论文通过在MOAT上测试超过20种专有和开源LMMs以及人类的表现，发现人类的准确率达到82.7%，而最佳表现的LMM（OpenAI o1）仅达到38.8%。解决方案的关键在于设计MOAT这一全面且细致的评估框架，通过揭示LMMs在不同VL能力上的瓶颈、分析测试时间扩展的效果以及研究分块（tiling）对计数能力的影响，为未来模型的改进提供了明确方向。

链接: https://arxiv.org/abs/2503.09348
作者: Zhoutong Ye,Mingze Sun,Huan-ang Gao,Chun Yu,Yuanchun Shi
机构: Department of Computer Science and Technology, Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Large multimodal models (LMMs) have demonstrated significant potential as generalists in vision-language (VL) tasks. However, there remains a significant gap between state-of-the-art LMMs and human performance when it comes to complex tasks that require a combination of fundamental VL capabilities, as well as tasks involving the grounding of complex instructions. To thoroughly investigate the human-LMM gap and its underlying causes, we propose MOAT, a diverse benchmark with complex real-world VL tasks that are challenging for LMMs. Specifically, the tasks in MOAT require LMMs to engage in generalist problem solving by integrating fundamental VL capabilities such as reading text, counting, understanding spatial relations, grounding textual and visual instructions, etc. All these abilities fit into a taxonomy proposed by us that contains 10 fundamental VL capabilities, enabling MOAT to provide a fine-grained view of LMMs’ strengths and weaknesses. Besides, MOAT is the first benchmark to explicitly evaluate LMMs’ ability to ground complex text and visual instructions, which is essential to many real-world applications. We evaluate over 20 proprietary and open source LMMs, as well as humans, on MOAT, and found that humans achieved 82.7% accuracy while the best performing LMM (OpenAI o1) achieved only 38.8%. To guide future model development, we analyze common trends in our results and discuss the underlying causes of observed performance gaps between LMMs and humans, focusing on which VL capability forms the bottleneck in complex tasks, whether test time scaling improves performance on MOAT, and how tiling harms LMMs’ capability to count. Code and data are available at this https URL.
zh

[NLP-19] Safer or Luckier? LLM s as Safety Evaluators Are Not Robust to Artifacts

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）作为自动化评估工具在内容安全评估中的可靠性问题。研究发现，LLMs在重复判断任务中的自一致性、与人类判断的对齐程度以及对输入 Artifact（如道歉或冗长表述）的敏感性等方面存在显著不足，可能导致对内容安全性比较评估的有效性失效。特别是，单一模型的判决容易受到 Artifact 的强烈影响，甚至导致高达 98% 的偏好偏移。此外，研究还表明，更大的模型并不总是表现出更高的鲁棒性，而较小的模型有时对特定 Artifact 更具抵抗力。

为缓解这些问题，论文提出了一种基于 Jury 的评估方法，即通过聚合多个模型的决策来改善鲁棒性和对齐度。尽管这种方案在一定程度上提升了模型的整体表现，但对 Artifact 的敏感性依然存在。因此，论文强调需要开发多样化的、对 Artifact 鲁棒的新方法，以确保内容安全评估的可靠性。关键在于设计能够有效抵抗输入 Artifact 干扰的评估机制，同时兼顾模型间的协作与多样性。

链接: https://arxiv.org/abs/2503.09347
作者: Hongyu Chen,Seraphina Goldfarb-Tarrant
机构: Cohere
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, preprint

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly employed as automated evaluators to assess the safety of generated content, yet their reliability in this role remains uncertain. This study evaluates a diverse set of 11 LLM judge models across critical safety domains, examining three key aspects: self-consistency in repeated judging tasks, alignment with human judgments, and susceptibility to input artifacts such as apologetic or verbose phrasing. Our findings reveal that biases in LLM judges can significantly distort the final verdict on which content source is safer, undermining the validity of comparative evaluations. Notably, apologetic language artifacts alone can skew evaluator preferences by up to 98%. Contrary to expectations, larger models do not consistently exhibit greater robustness, while smaller models sometimes show higher resistance to specific artifacts. To mitigate LLM evaluator robustness issues, we investigate jury-based evaluations aggregating decisions from multiple models. Although this approach both improves robustness and enhances alignment to human judgements, artifact sensitivity persists even with the best jury configurations. These results highlight the urgent need for diversified, artifact-resistant methodologies to ensure reliable safety assessments.
zh

[NLP-20] An Evaluation of LLM s for Detecting Harmful Computing Terms

【速读】：该论文试图解决在技术语境中检测有害和非包容性术语的问题，以促进计算领域的包容性环境。解决方案的关键在于评估不同模型架构对有害语言检测的影响，通过测试包括编码器（如BERT）、解码器（如Gemini Flash 2.0和Claude AI）以及编码器-解码器（如GPT-4和T5-large）在内的多种语言模型，分析其在标准化提示下对64个技术术语进行有害和非包容性语言识别的能力。研究发现解码器模型在细微上下文分析方面表现优异，而编码器模型则在模式识别上有较强能力但分类确定性较弱，从而为改进自动化检测工具提供了方向，并强调了特定模型在技术领域促进包容性交流的优势与局限。

链接: https://arxiv.org/abs/2503.09341
作者: Joshua Jacas,Hana Winchester,Alicia Boyd,Brittany Johnson
机构: Clarksburg High School (克利夫斯堡高中); The Ohio State University (俄亥俄州立大学); Yale University (耶鲁大学); George Mason University (乔治梅森大学)
类目: Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Detecting harmful and non-inclusive terminology in technical contexts is critical for fostering inclusive environments in computing. This study explores the impact of model architecture on harmful language detection by evaluating a curated database of technical terms, each paired with specific use cases. We tested a range of encoder, decoder, and encoder-decoder language models, including BERT-base-uncased, RoBERTa large-mnli, Gemini Flash 1.5 and 2.0, GPT-4, Claude AI Sonnet 3.5, T5-large, and BART-large-mnli. Each model was presented with a standardized prompt to identify harmful and non-inclusive language across 64 terms. Results reveal that decoder models, particularly Gemini Flash 2.0 and Claude AI, excel in nuanced contextual analysis, while encoder models like BERT exhibit strong pattern recognition but struggle with classification certainty. We discuss the implications of these findings for improving automated detection tools and highlight model-specific strengths and limitations in fostering inclusive communication in technical domains.
zh

[NLP-21] Investigating User Perspectives on Differentially Private Text Privatization

【速读】：该论文试图解决 Differential Privacy (DP) 自然语言处理技术在实际应用中面临的用户接受度问题，特别是针对文本私有化过程中用户对输出文本的偏好及其影响因素的研究。论文的关键在于通过全球范围内的大规模调查研究，探讨场景（scenario）、数据敏感性（data sensitivity）、机制类型（mechanism type）以及数据收集原因（reason for data collection）等因素如何影响用户的隐私决策，并发现用户对私有化输出文本的效用性和连贯性高度敏感。这揭示了在 DP NLP 研究中需要综合考虑的社会和技术因素，为进一步基于用户的研究奠定了基础。

链接: https://arxiv.org/abs/2503.09338
作者: Stephen Meisenbacher,Alexandra Klymenko,Alexander Karpp,Florian Matthes
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 20 pages, 5 figures, 10 tables. Accepted to PrivateNLP 2025

点击查看摘要

Abstract:Recent literature has seen a considerable uptick in \textitDifferentially Private Natural Language Processing (DP NLP). This includes DP text privatization, where potentially sensitive input texts are transformed under DP to achieve privatized output texts that ideally mask sensitive information \textitand maintain original semantics. Despite continued work to address the open challenges in DP text privatization, there remains a scarcity of work addressing user perceptions of this technology, a crucial aspect which serves as the final barrier to practical adoption. In this work, we conduct a survey study with 721 laypersons around the globe, investigating how the factors of \textitscenario , \textitdata sensitivity , \textitmechanism type , and \textitreason for data collection impact user preferences for text privatization. We learn that while all these factors play a role in influencing privacy decisions, users are highly sensitive to the utility and coherence of the private output texts. Our findings highlight the socio-technical factors that must be considered in the study of DP NLP, opening the door to further user-based investigations going forward.
zh

[NLP-22] A Survey on Enhancing Causal Reasoning Ability of Large Language Models

【速读】：该论文试图解决的问题是如何增强大型语言模型（Large Language Models, LLMs）的因果推理能力，以应对需要强大因果推理能力的任务挑战，如医疗保健和经济分析。论文的关键在于系统性地回顾现有研究，提出一个新颖的分类法来组织相关方法，并在类别内部及之间进行详细比较。此外，论文还总结了现有的基准数据集和评估指标，同时展望了该领域的未来研究方向。通过这些努力，论文旨在填补当前领域内缺乏全面综述的空白，为研究人员和从业者提供有价值的洞见与启发。

链接: https://arxiv.org/abs/2503.09326
作者: Xin Li,Zhuo Cai,Shoujin Wang,Kun Yu,Fang Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently shown remarkable performance in language tasks and beyond. However, due to their limited inherent causal reasoning ability, LLMs still face challenges in handling tasks that require robust causal reasoning ability, such as health-care and economic analysis. As a result, a growing body of research has focused on enhancing the causal reasoning ability of LLMs. Despite the booming research, there lacks a survey to well review the challenges, progress and future directions in this area. To bridge this significant gap, we systematically review literature on how to strengthen LLMs’ causal reasoning ability in this paper. We start from the introduction of background and motivations of this topic, followed by the summarisation of key challenges in this area. Thereafter, we propose a novel taxonomy to systematically categorise existing methods, together with detailed comparisons within and between classes of methods. Furthermore, we summarise existing benchmarks and evaluation metrics for assessing LLMs’ causal reasoning ability. Finally, we outline future research directions for this emerging field, offering insights and inspiration to researchers and practitioners in the area.
zh

[NLP-23] xVLM2Vec: Adapting LVLM-based embedding models to multilinguality using Self-Knowledge Distillation

【速读】：该论文旨在解决现有嵌入模型在处理多语言和多模态输入时的局限性问题，特别是在大型视觉-语言模型（Large Vision-Language Models）领域。当前大多数嵌入模型基于单一模态（如文本）的编码器-解码器架构，且主要针对英语训练的数据进行优化，缺乏对多语言和多模态数据的有效支持。论文的关键解决方案是提出一种针对英语数据训练的大型视觉-语言模型的适应性方法，通过该方法改进其提取多语言和多模态嵌入的能力，并设计了一个基准测试来评估这些模型的效果。

链接: https://arxiv.org/abs/2503.09313
作者: Elio Musacchio,Lucia Siciliani,Pierpaolo Basile,Giovanni Semeraro
机构: University of Bari Aldo Moro (巴里阿尔多莫罗大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In the current literature, most embedding models are based on the encoder-only transformer architecture to extract a dense and meaningful representation of the given input, which can be a text, an image, and more. With the recent advances in language modeling thanks to the introduction of Large Language Models, the possibility of extracting embeddings from these large and extensively trained models has been explored. However, current studies focus on textual embeddings in English, which is also the main language on which these models have been trained. Furthermore, there are very few models that consider multimodal and multilingual input. In light of this, we propose an adaptation methodology for Large Vision-Language Models trained on English language data to improve their performance in extracting multilingual and multimodal embeddings. Finally, we design and introduce a benchmark to evaluate the effectiveness of multilingual and multimodal embedding models.
zh

[NLP-24] Unmask It! AI-Generated Product Review Detection in Dravidian Languages NAACL2025

【速读】：该论文试图解决在线平台上由生成式 AI (Generative AI) 生成的虚假评论对产品和服务信息可信度造成的威胁问题。这些虚假评论会误导消费者、破坏信任并可能助长数字市场中的欺诈行为。特别是在泰米尔语和马拉雅拉姆语这两种低资源语言中，此类研究相对较少。论文的关键解决方案在于利用最先进的变压器模型（如 Indic-BERT、IndicSBERT、MuRIL、XLM-RoBERTa 和 MalayalamBERT），通过从传统机器学习方法到高级变压器模型的多种技术手段，有效识别生成式 AI 生成的内容，从而提升在低资源语言环境中检测虚假评论的能力。

链接: https://arxiv.org/abs/2503.09289
作者: Somsubhra De,Advait Vats
机构: Indian Institute of Technology Madras
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 9 figures, Accepted to DravidianLangTech Workshop proceedings at NAACL 2025

点击查看摘要

Abstract:The rise of Generative AI has led to a surge in AI-generated reviews, often posing a serious threat to the credibility of online platforms. Reviews serve as the primary source of information about products and services. Authentic reviews play a vital role in consumer decision-making. The presence of fabricated content misleads consumers, undermines trust and facilitates potential fraud in digital marketplaces. This study focuses on detecting AI-generated product reviews in Tamil and Malayalam, two low-resource languages where research in this domain is relatively under-explored. We worked on a range of approaches - from traditional machine learning methods to advanced transformer-based models such as Indic-BERT, IndicSBERT, MuRIL, XLM-RoBERTa and MalayalamBERT. Our findings highlight the effectiveness of leveraging the state-of-the-art transformers in accurately identifying AI-generated content, demonstrating the potential in enhancing the detection of fake reviews in low-resource language settings.
zh

[NLP-25] Considering Length Diversity in Retrieval-Augmented Summarization NAACL2025

【速读】：该论文致力于解决检索增强摘要中，当存在长度约束时，不同示例摘要长度对摘要质量影响的研究空白。此前研究未充分探讨这一问题。论文的关键解决方案是提出了一种名为Diverse Length-aware Maximal Marginal Relevance (DL-MMR) 的算法，用于更好地控制摘要长度。与传统方法需要通过MMR进行穷尽的示例相关性比较不同，DL-MMR不仅结合查询相关性，还考虑目标长度的多样性，并避免了示例之间的两两比较，从而显著降低了计算成本并减少了内存消耗，特别是在构建示例池的过程中。实验结果表明，DL-MMR在保持信息量不变的同时，相比原始MMR算法，在内存节省（781,513倍）和计算成本降低（500,092倍）方面表现出色。

链接: https://arxiv.org/abs/2503.09249
作者: Juseon-Do,Jaesung Hwang,Jingun Kwon,Hidetaka Kamigaito,Manabu Okumura
机构: Chungnam National University (忠南国立大学); Nara Institute of Science and Technology (NAIST) (奈良先端科学技术大学院大学); Institute of Science Tokyo (东京工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, accepted to NAACL 2025 Findings

点击查看摘要

Abstract:This study investigates retrieval-augmented summarization by specifically examining the impact of exemplar summary lengths under length constraints, not covered by previous work. We propose a Diverse Length-aware Maximal Marginal Relevance (DL-MMR) algorithm to better control summary lengths. This algorithm combines the query relevance with diverse target lengths in retrieval-augmented summarization. Unlike previous methods that necessitate exhaustive exemplar exemplar relevance comparisons using MMR, DL-MMR considers the exemplar target length as well and avoids comparing exemplars to each other, thereby reducing computational cost and conserving memory during the construction of an exemplar pool. Experimental results showed the effectiveness of DL-MMR, which considers length diversity, compared to the original MMR algorithm. DL-MMR additionally showed the effectiveness in memory saving of 781,513 times and computational cost reduction of 500,092 times, while maintaining the same level of informativeness.
zh

[NLP-26] Rethinking Prompt-based Debiasing in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）中的偏差（bias）问题，特别是评估基于提示（prompt-based）方法在识别和减轻模型偏差方面的有效性。论文的关键在于系统性地分析了基于提示的方法是否真正依赖于模型内在理解偏差这一假设。研究通过使用BBQ和StereoSet基准测试，发现基于提示的方法往往流于表面，例如Llama2-7B-Chat模型在未标注为有偏的数据中错误分类率超过90%，尽管其在特定数据集上表现出较高的准确性。此外，论文指出偏差基准测试中的特定评估设计容易导致LLMs给出“回避式答案”（evasive answers），忽视了问题的核心和上下文相关性。同时，过去方法看似成功的部分原因可能是由于评价指标存在缺陷。因此，研究强调需要重新审视偏差度量方法，以确保AI系统的真正可信性。

链接: https://arxiv.org/abs/2503.09219
作者: Xinyi Yang,Runzhe Zhan,Derek F. Wong,Shu Yang,Junchao Wu,Lidia S. Chao
机构: NLP2CT Lab, Department of Computer and Information Science, University of Macau (澳门大学); Provable Responsible AI and Data Analytics (PRADA) Lab, KAUST (沙特国王 Abdullah University of Science and Technology)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Investigating bias in large language models (LLMs) is crucial for developing trustworthy AI. While prompt-based through prompt engineering is common, its effectiveness relies on the assumption that models inherently understand biases. Our study systematically analyzed this assumption using the BBQ and StereoSet benchmarks on both open-source models as well as commercial GPT model. Experimental results indicate that prompt-based is often superficial; for instance, the Llama2-7B-Chat model misclassified over 90% of unbiased content as biased, despite achieving high accuracy in identifying bias issues on the BBQ dataset. Additionally, specific evaluation and question settings in bias benchmarks often lead LLMs to choose “evasive answers”, disregarding the core of the question and the relevance of the response to the context. Moreover, the apparent success of previous methods may stem from flawed evaluation metrics. Our research highlights a potential “false prosperity” in prompt-base efforts and emphasizes the need to rethink bias metrics to ensure truly trustworthy AI.
zh

[NLP-27] N2C2: Nearest Neighbor Enhanced Confidence Calibration for Cross-Lingual In-Context Learning

【速读】：该论文致力于解决跨语言情境下上下文学习（In-Context Learning, ICL）模型校准不足及预测置信度低的问题。研究发现，ICL 在跨语言情感分类任务中表现不佳，不仅准确性较低，而且校准误差较高。为应对这一挑战，论文提出了一种名为N2C2的新方法，其关键在于引入基于-k最近邻（k-nearest neighbors）增强的分类器以实现预测置信度的校准。具体而言，N2C2通过利用缓存的少量样本数据存储库（datastore），结合数据存储库的预测集成、与语义一致的检索表示、自适应邻居组合模块以及置信度感知分布等机制，有效提升了有限支持样本的利用效率，从而显著缩小了预测差距。实验结果表明，N2C2在多语言情感分类数据集上的性能优于传统ICL方法，并在准确性和校准误差方面超越了微调、提示调优以及现有的最先进方法。

链接: https://arxiv.org/abs/2503.09218
作者: Jie He,Simon Yu,Deyi Xiong,Víctor Gutiérrez-Basulto,Jeff Z. Pan
机构: School of Informatics, University of Edinburgh (爱丁堡大学); School of Computer Science and Informatics, Cardiff University (卡迪夫大学); College of Intelligence and Computing, Tianjin University (天津大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements of in-context learning (ICL) show language models can significantly improve their performance when demonstrations are provided. However, little attention has been paid to model calibration and prediction confidence of ICL in cross-lingual scenarios. To bridge this gap, we conduct a thorough analysis of ICL for cross-lingual sentiment classification. Our findings suggest that ICL performs poorly in cross-lingual scenarios, exhibiting low accuracy and presenting high calibration errors. In response, we propose a novel approach, N2C2, which employs a -nearest neighbors augmented classifier for prediction confidence calibration. N2C2 narrows the prediction gap by leveraging a datastore of cached few-shot instances. Specifically, N2C2 integrates the predictions from the datastore and incorporates confidence-aware distribution, semantically consistent retrieval representation, and adaptive neighbor combination modules to effectively utilize the limited number of supporting instances. Evaluation on two multilingual sentiment classification datasets demonstrates that N2C2 outperforms traditional ICL. It surpasses fine tuning, prompt tuning and recent state-of-the-art methods in terms of accuracy and calibration errors.
zh

[NLP-28] Why LLM s Cannot Think and How to Fix It NEURIPS2024

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在特征空间内无法进行决策或发展“思维”的根本问题。论文通过重新定义“思维”并分析当代LLMs的架构设计与训练方法，论证了这些模型天生不具备参与真正思维过程的能力。解决方案的关键在于提出能够在特征空间内实现思维过程的架构修改，并探讨这些修改带来的更广泛影响。

链接: https://arxiv.org/abs/2503.09211
作者: Marius Jahrens,Thomas Martinetz
机构: University of Lübeck (吕贝克大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Original conference submission for neurips 2024

点击查看摘要

Abstract:This paper elucidates that current state-of-the-art Large Language Models (LLMs) are fundamentally incapable of making decisions or developing “thoughts” within the feature space due to their architectural constraints. We establish a definition of “thought” that encompasses traditional understandings of that term and adapt it for application to LLMs. We demonstrate that the architectural design and language modeling training methodology of contemporary LLMs inherently preclude them from engaging in genuine thought processes. Our primary focus is on this theoretical realization rather than practical insights derived from experimental data. Finally, we propose solutions to enable thought processes within the feature space and discuss the broader implications of these architectural modifications.
zh

[NLP-29] Quality Over Quantity? LLM -Based Curation for a Data-Efficient Audio-Video Foundation Model

【速读】：该论文旨在解决多模态基础模型训练中音频与视觉数据集成的挑战，特别是如何超越简单的时序同步来对齐音频-视觉（Audiovisual, AV）场景内容。论文提出的方法名为Audio-Video Vector Alignment (AVVA)，其关键在于利用基于大型语言模型（Large Language Model, LLM）的数据整理流水线，通过Whisper（基于语音的音频基础模型）和DINOv2（视频处理模型）在双编码器对比学习框架下对高质量训练片段进行评分与选择。这种方法不仅显著提升了检索精度，还大幅减少了对标注数据的需求，展示了数据质量优于数据数量的潜力。因此，AVVA的关键创新在于结合高效的LLM驱动数据筛选策略与对比学习框架，以实现更鲁棒且无需文本的音视频学习。

链接: https://arxiv.org/abs/2503.09205
作者: Ali Vosoughi,Dimitra Emmanouilidou,Hannes Gamper
机构: Microsoft Research (微软研究) (Redmond, WA, USA); University of Rochester (罗切斯特大学) (Rochester, NY, USA)
类目: Multimedia (cs.MM); Computation and Language (cs.CL); Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Integrating audio and visual data for training multimodal foundational models remains challenging. We present Audio-Video Vector Alignment (AVVA), which aligns audiovisual (AV) scene content beyond mere temporal synchronization via a Large Language Model (LLM)-based data curation pipeline. Specifically, AVVA scores and selects high-quality training clips using Whisper (speech-based audio foundation model) for audio and DINOv2 for video within a dual-encoder contrastive learning framework. Evaluations on AudioCaps, VALOR, and VGGSound demonstrate that this approach can achieve significant accuracy gains with substantially less curated data. For instance, AVVA yields a 7.6% improvement in top-1 accuracy for audio-to-video retrieval on VGGSound compared to ImageBind, despite training on only 192 hours of carefully filtered data (vs. 5800+ hours). Moreover, an ablation study highlights that trading data quantity for data quality improves performance, yielding respective top-3 accuracy increases of 47.8, 48.4, and 58.0 percentage points on AudioCaps, VALOR, and VGGSound over uncurated baselines. While these results underscore AVVA’s data efficiency, we also discuss the overhead of LLM-driven curation and how it may be scaled or approximated in larger domains. Overall, AVVA provides a viable path toward more robust, text-free audiovisual learning with improved retrieval accuracy.
zh

[NLP-30] oken Weighting for Long-Range Language Modeling NAACL2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在长上下文理解任务中表现不佳的问题。传统的方法采用均匀权重的下一-token预测训练（next-token prediction training），即每个token被赋予相等的重要性，但这种策略忽略了不同数据所需的上下文长度差异。论文的关键创新在于提出了一种新的token加权方案（token-weighting schemes），通过在损失函数中为每个训练token分配不同的权重，从而实现对现有工作的泛化。该方案基于一个两步框架，比较长上下文模型与短上下文模型的置信度来为token评分。实验表明，非均匀的损失权重有助于提升LLMs的长上下文能力，且可以使用比目标长上下文模型小得多的短上下文模型进行有效的token评分。这项工作加深了对长上下文语言建模挑战的理解，并提供了基于经验的模型引导指导。

链接: https://arxiv.org/abs/2503.09202
作者: Falko Helm,Nico Daheim,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab); Department of Computer Science and Hessian Center for AI (hessian.AI); Technical University of Darmstadt
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025 (Findings). For the code, see this https URL

点击查看摘要

Abstract:Many applications of large language models (LLMs) require long-context understanding, but models continue to struggle with such tasks. We hypothesize that conventional next-token prediction training could contribute to this, because each token is assigned equal weight. Yet, intuitively, the amount of context needed to predict the next token accurately varies greatly across different data. To reflect this, we propose various novel token-weighting schemes that assign different weights to each training token in the loss, thereby generalizing existing works. For this, we categorize token-weighting methods using a two-step framework which compares the confidences of a long-context and short-context model to score tokens. We evaluate all methods on multiple long-context understanding tasks and show that non-uniform loss weights are helpful to improve the long-context abilities of LLMs. Different short-context models can be used effectively for token scoring, including models that are much smaller than the long-context model that is trained. All in all, this work contributes to a better understanding of the trade-offs long-context language modeling faces and provides guidelines for model steering via loss-weighting based on empirical evidence. The code can be found on Github.
zh

[NLP-31] Is LLM s Hallucination Usable? LLM -based Negative Reasoning for Fake News Detection

【速读】：该论文试图解决大型语言模型（LLMs）在知识幻觉（knowledge hallucination）影响下产生的不稳定决策能力问题，并探索其是否可被用于生成负向推理以辅助假新闻检测。论文的关键解决方案在于提出了一种新颖的有监督自强化推理校正方法——SR³，通过利用LLMs的反射机制进行语义一致性学习，同时生成常规合理推理与错误理解（负向推理）。在此基础上，构建了一个基于负向推理的新闻学习模型——\emphNRFE，该模型利用正负新闻-推理对来学习它们之间的语义一致性。为避免标签隐含推理的影响，进一步部署了一个仅输入新闻内容的学生模型——\emphNRFE-D，通过蒸馏\emphNRFE的知识来评估方法性能。实验结果验证了所提方法在三个流行的假新闻数据集上的优越性，相比提示调优LLMs、微调预训练SLMs以及其它代表性假新闻检测方法均表现出显著优势。

链接: https://arxiv.org/abs/2503.09153
作者: Chaowei Zhang,Zongling Feng,Zewei Zhang,Jipeng Qiang,Guandong Xu,Yun Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 12 figures, conference

点击查看摘要

Abstract:The questionable responses caused by knowledge hallucination may lead to LLMs’ unstable ability in decision-making. However, it has never been investigated whether the LLMs’ hallucination is possibly usable to generate negative reasoning for facilitating the detection of fake news. This study proposes a novel supervised self-reinforced reasoning rectification approach - SR ^3 that yields both common reasonable reasoning and wrong understandings (negative reasoning) for news via LLMs reflection for semantic consistency learning. Upon that, we construct a negative reasoning-based news learning model called - \emphNRFE, which leverages positive or negative news-reasoning pairs for learning the semantic consistency between them. To avoid the impact of label-implicated reasoning, we deploy a student model - \emphNRFE-D that only takes news content as input to inspect the performance of our method by distilling the knowledge from \emphNRFE. The experimental results verified on three popular fake news datasets demonstrate the superiority of our method compared with three kinds of baselines including prompting on LLMs, fine-tuning on pre-trained SLMs, and other representative fake news detection methods.
zh

[NLP-32] Specification languages for computational laws versus basic legal principles

【速读】：该论文试图解决当法律以计算形式（Computational Law）通过自动化决策过程执行时所面临的挑战，特别是在自然语言与形式化语言两种表达方式下如何有效实现法律的一致性、透明性和可追溯性。论文的关键在于探讨法律原则在自然语言与形式化语言两种表述形式下的表现，并通过欧盟道路运输法规的实际案例，展示两种语言形式的优势与潜在冲突，从而为制定更加公平、可理解且易于验证的计算法律提供理论支持与实践参考。

链接: https://arxiv.org/abs/2503.09129
作者: Petia Guintchev,Joost J. Joosten,Sofia Santiago Fernández,Eric Sancho Adamson,Aleix Solé Sánchez,Marta Soria Heredia
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We speak of a \textitcomputational law when that law is intended to be enforced by software through an automated decision-making process. As digital technologies evolve to offer more solutions for public administrations, we see an ever-increasing number of computational laws. Traditionally, law is written in natural language. Computational laws, however, suffer various complications when written in natural language, such as underspecification and ambiguity which lead to a diversity of possible interpretations to be made by the coder. These could potentially result into an uneven application of the law. Thus, resorting to formal languages to write computational laws is tempting. However, writing laws in a formal language leads to further complications, for example, incomprehensibility for non-experts, lack of explicit motivation of the decisions made, or difficulties in retrieving the data leading to the outcome. In this paper, we investigate how certain legal principles fare in both scenarios: computational law written in natural language or written in formal language. We use a running example from the European Union’s road transport regulation to showcase the tensions arising, and the benefits from each language.
zh

[NLP-33] GRU: Mitigating the Trade-off between Unlearning and Retention for Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）在无损其整体功能的前提下实现高效且彻底的无学习（unlearning）的问题。现有方法通常在无学习与功能保留之间存在显著权衡，导致性能下降。论文的关键在于提出了一种基于梯度的方向性分析，揭示了无学习过程中保留性能与梯度方向差异之间的关系，并由此设计了一种更新机制，将梯度分为对保留有害和对无学习有益两类。基于此，论文提出了梯度校正无学习框架（Gradient Rectified Unlearning, GRU），通过几何导向和优化驱动的方式控制更新梯度，以最小化其对其他无关响应的副作用。具体而言，GRU通过数学闭式解将无学习梯度投影到有害于保留的梯度的正交空间中，在保证整体性能的同时尽量减少方向偏差。实验结果表明，GRU作为一种通用框架，易于实施，并能有效提升多种基线方法的性能，同时在多个基准测试中展现了广泛的适用性和有效性。

链接: https://arxiv.org/abs/2503.09117
作者: Yue Wang,Qizhou Wang,Feng Liu,Wei Huang,Yali Du,Xiaojiang Du,Bo Han
机构: TMLR Group, Department of Computer Science, Hong Kong Baptist University (香港浸会大学); The University of Melbourne (墨尔本大学); RIKEN Center for Advanced Intelligence Project (理化学研究所先进智能项目); King’s College London (伦敦国王学院); Department of Electrical and Computer Engineering, Stevens Institute of Technology (史蒂文斯理工学院电气与计算机工程系)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model (LLM) unlearning has demonstrated its essential role in removing privacy and copyright-related responses, crucial for their legal and safe applications. However, the pursuit of complete unlearning often comes with substantial costs due to its compromises in their general functionality, leading to a notorious trade-off between unlearning and retention. In examining the update process for unlearning dynamically, we find gradients hold essential information for revealing this trade-off. In particular, we look at the varying relationship between retention performance and directional disparities between gradients during unlearning. It motivates the sculpting of an update mechanism derived from gradients from two sources, i.e., harmful for retention and useful for unlearning. Accordingly, we propose Gradient Rectified Unlearning (GRU), an enhanced unlearning framework controlling the updating gradients in a geometry-focused and optimization-driven manner such that their side impacts on other, unrelated responses can be minimized. Specifically, GRU derives a closed-form solution to project the unlearning gradient onto the orthogonal space of that gradient harmful for retention, ensuring minimal deviation from its original direction under the condition that overall performance is retained. Comprehensive experiments are conducted to demonstrate that GRU, as a general framework, is straightforward to implement and efficiently enhances a range of baseline methods through its adaptable and compatible characteristics. Additionally, experimental results show its broad effectiveness across a diverse set of benchmarks for LLM unlearning.
zh

[NLP-34] VaxGuard: A Multi-Generator Multi-Type and Multi-Role Dataset for Detecting LLM -Generated Vaccine Misinformation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成疫苗相关 misinformation 方面的挑战，以及现有检测方法在应对特定领域（疫苗）和多样化的 misinformation 传播者角色时的不足。论文的关键解决方案是引入了一个名为 VaxGuard 的新型数据集，该数据集不仅包含由多种 LLMs 生成的疫苗相关 misinformation，还提供了一个全面的检测框架以覆盖不同传播角色的情境。此外，研究通过评估多个 LLMs 的检测性能，揭示了针对特定角色采用专门检测策略的重要性，并指出 VaxGuard 可作为提升 LLM 生成疫苗 misinformation 检测能力的关键资源。

链接: https://arxiv.org/abs/2503.09103
作者: Syed Talal Ahmad,Haohui Lu,Sidong Liu,Annie Lau,Amin Beheshti,Mark Dras,Usman Naseem
机构: Macquarie University (麦考瑞大学), Australia; University of Sydney (悉尼大学), Australia
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have significantly improved text generation capabilities. However, they also present challenges, particularly in generating vaccine-related misinformation, which poses risks to public health. Despite research on human-authored misinformation, a notable gap remains in understanding how LLMs contribute to vaccine misinformation and how best to detect it. Existing benchmarks often overlook vaccine-specific misinformation and the diverse roles of misinformation spreaders. This paper introduces VaxGuard, a novel dataset designed to address these challenges. VaxGuard includes vaccine-related misinformation generated by multiple LLMs and provides a comprehensive framework for detecting misinformation across various roles. Our findings show that GPT-3.5 and GPT-4o consistently outperform other LLMs in detecting misinformation, especially when dealing with subtle or emotionally charged narratives. On the other hand, PHI3 and Mistral show lower performance, struggling with precision and recall in fear-driven contexts. Additionally, detection performance tends to decline as input text length increases, indicating the need for improved methods to handle larger content. These results highlight the importance of role-specific detection strategies and suggest that VaxGuard can serve as a key resource for improving the detection of LLM-generated vaccine misinformation.
zh

[NLP-35] Domain Adaptation for Japanese Sentence Embeddings with Contrastive Learning based on Synthetic Sentence Generation

【速读】：该论文旨在解决低资源语言（如日语）在领域自适应中的难题，由于大规模标注数据的匮乏，基于预训练模型的通用句向量难以通过传统方法有效迁移到特定领域。为克服这一挑战，论文提出了一种名为SDJC（Self-supervised Domain adaptation for Japanese sentence embeddings with Contrastive learning）的方法。其关键是利用一个数据生成器构造与特定领域无标注语料库中句子具有相同句法结构但语义不同的生成句子，从而增强对比学习的效果，使预训练骨干模型能够更精准地区分特定领域的句子。此外，论文还构建了一个综合的日语STS（Semantic Textual Similarity）基准数据集以支持模型选择和评估。实验结果验证了SDJC在两个领域特定下游任务中的有效性以及所构建数据集的实用性。

链接: https://arxiv.org/abs/2503.09094
作者: Zihao Chen,Hisashi Handa,Miho Ohsaki,Kimiaki Shirahama
机构: Doshisha University (同志社大学); Kinki University (近畿大学)
类目: Computation and Language (cs.CL)
备注: 39 pages, 7 figures

点击查看摘要

Abstract:Several backbone models pre-trained on general domain datasets can encode a sentence into a widely useful embedding. Such sentence embeddings can be further enhanced by domain adaptation that adapts a backbone model to a specific domain. However, domain adaptation for low-resource languages like Japanese is often difficult due to the scarcity of large-scale labeled datasets. To overcome this, this paper introduces SDJC (Self-supervised Domain adaptation for Japanese sentence embeddings with Contrastive learning) that utilizes a data generator to generate sentences, which have the same syntactic structure to a sentence in an unlabeled specific domain corpus but convey different semantic meanings. Generated sentences are then used to boost contrastive learning that adapts a backbone model to accurately discriminate sentences in the specific domain. In addition, the components of SDJC like a backbone model and a method to adapt it need to be carefully selected, but no benchmark dataset is available for Japanese. Thus, a comprehensive Japanese STS (Semantic Textual Similarity) benchmark dataset is constructed by combining datasets machine-translated from English with existing datasets. The experimental results validates the effectiveness of SDJC on two domain-specific downstream tasks as well as the usefulness of the constructed dataset. Datasets, codes and backbone models adapted by SDJC are available on our github repository this https URL.
zh

[NLP-36] LocAgent : Graph-Guided LLM Agents for Code Localization

【速读】：该论文致力于解决代码定位（code localization）这一在软件维护中基础但具有挑战性的任务，即精准识别代码库中需要修改的具体位置。现有方法在处理复杂代码库时难以高效导航并定位相关的代码片段，主要困难在于连接自然语言的问题描述与合适的代码元素，这通常需要跨层次结构及多重依赖关系进行推理。为应对这一挑战，论文提出了一种名为LocAgent的框架，其关键在于通过基于图的方法来表示代码。LocAgent将代码库解析为有向异构图，从而创建一个轻量级表示，捕捉文件、类、函数等代码结构及其依赖关系（如导入、调用、继承），使大型语言模型（LLM）代理能够通过强大的多跳推理有效搜索和定位相关实体。实验结果表明，该方法显著提高了代码定位的准确性，并且使用经过微调的Qwen-2.5-Coder-Instruct-32B模型实现了与最先进的专有模型相当的结果，同时成本大幅降低（约86%），在文件级别定位上的准确率可达92.7%，并使GitHub问题解决的成功率在多次尝试中提升了12%（Pass@10）。

链接: https://arxiv.org/abs/2503.09089
作者: Zhaoling Chen,Xiangru Tang,Gangda Deng,Fang Wu,Jialong Wu,Zhiwei Jiang,Viktor Prasanna,Arman Cohan,Xingyao Wang
机构: Yale University (耶鲁大学); University of Southern California (南加州大学); Stanford University (斯坦福大学); All Hands AI
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Code localization–identifying precisely where in a codebase changes need to be made–is a fundamental yet challenging task in software maintenance. Existing approaches struggle to efficiently navigate complex codebases when identifying relevant code sections. The challenge lies in bridging natural language problem descriptions with the appropriate code elements, often requiring reasoning across hierarchical structures and multiple dependencies. We introduce LocAgent, a framework that addresses code localization through graph-based representation. By parsing codebases into directed heterogeneous graphs, LocAgent creates a lightweight representation that captures code structures (files, classes, functions) and their dependencies (imports, invocations, inheritance), enabling LLM agents to effectively search and locate relevant entities through powerful multi-hop reasoning. Experimental results on real-world benchmarks demonstrate that our approach significantly enhances accuracy in code localization. Notably, our method with the fine-tuned Qwen-2.5-Coder-Instruct-32B model achieves comparable results to SOTA proprietary models at greatly reduced cost (approximately 86% reduction), reaching up to 92.7% accuracy on file-level localization while improving downstream GitHub issue resolution success rates by 12% for multiple attempts (Pass@10). Our code is available at this https URL.
zh

[NLP-37] aching LLM s How to Learn with Contextual Fine-Tuning ICLR2025

【速读】：该论文试图解决在快速发展的领域中，如何通过更高效的微调方法提升大型语言模型（Large Language Models, LLMs）在特定领域的知识理解和开放性推理能力的问题。论文的关键在于提出了一种名为“上下文微调”（Contextual Fine-Tuning）的新方法，它通过设计模仿人类认知策略的指令提示（instructional prompts），引导模型在训练过程中更好地理解领域特定的知识，从而实现对LLMs的快速适应与优化。实验结果表明，这一简单而有效的改进显著提升了LLMs在医学和金融等新数据集上的微调效果。

链接: https://arxiv.org/abs/2503.09032
作者: Younwoo Choi,Muhammad Adil Asif,Ziwen Han,John Willes,Rahul G. Krishnan
机构: University of Toronto (多伦多大学); Vector Institute (向量研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICLR 2025

点击查看摘要

Abstract:Prompting Large Language Models (LLMs), or providing context on the expected model of operation, is an effective way to steer the outputs of such models to satisfy human desiderata after they have been trained. But in rapidly evolving domains, there is often need to fine-tune LLMs to improve either the kind of knowledge in their memory or their abilities to perform open ended reasoning in new domains. When human’s learn new concepts, we often do so by linking the new material that we are studying to concepts we have already learned before. To that end, we ask, “can prompting help us teach LLMs how to learn”. In this work, we study a novel generalization of instruction tuning, called contextual fine-tuning, to fine-tune LLMs. Our method leverages instructional prompts designed to mimic human cognitive strategies in learning and problem-solving to guide the learning process during training, aiming to improve the model’s interpretation and understanding of domain-specific knowledge. We empirically demonstrate that this simple yet effective modification improves the ability of LLMs to be fine-tuned rapidly on new datasets both within the medical and financial domains.
zh

[NLP-38] DAST: Difficulty-Aware Self-Training on Large Language Models

【速读】：该论文试图解决现有大语言模型（Large Language Models, LLM）自训练方法在处理困难查询时样本不足的问题，导致难以充分学习复杂问题，从而限制了模型的能力。为了解决这一问题，论文提出了一种基于难度感知的自训练（Difficulty-aware Self-Training, DAST）框架，其关键是通过三个组件提升自生成响应的质量与数量：1）基于采样的难度级别估计；2）难度感知的数据增强；3）分别利用SFT（Supervised Fine-Tuning）和DPO（Dense Passage Retrieval）的自训练算法。实验结果表明，DAST在数学任务中表现出显著的有效性和泛化能力，强调了难度感知策略在改进LLM自训练中的关键作用。

链接: https://arxiv.org/abs/2503.09029
作者: Boyang Xue,Qi Zhu,Hongru Wang,Rui Wang,Sheng Wang,Hongling Xu,Fei Mi,Yasheng Wang,Lifeng Shang,Qun Liu,Kam-Fai Wong
机构: The Chinese University of Hong Kong (香港中文大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); The University of Hong Kong (香港大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Present Large Language Models (LLM) self-training methods always under-sample on challenging queries, leading to inadequate learning on difficult problems which limits LLMs’ ability. Therefore, this work proposes a difficulty-aware self-training (DAST) framework that focuses on improving both the quantity and quality of self-generated responses on challenging queries during self-training. DAST is specified in three components: 1) sampling-based difficulty level estimation, 2) difficulty-aware data augmentation, and 3) the self-training algorithm using SFT and DPO respectively. Experiments on mathematical tasks demonstrate the effectiveness and generalization of DAST, highlighting the critical role of difficulty-aware strategies in advancing LLM self-training.
zh

[NLP-39] Aligning to What? Limits to RLHF Based Alignment

【速读】：该论文试图解决的问题是如何通过Reinforcement Learning from Human Feedback (RLHF) 技术有效减少大型语言模型（Large Language Models, LLMs）中的隐性（covert）和显性（overt）偏见，特别是针对非洲裔美国人群体的偏见。论文的关键在于评估不同RLHF技术（如DPO、ORPO和RLOO）在减轻模型偏见方面的效果，并揭示监督微调（Supervised Fine-Tuning, SFT）在RLHF之前的潜在负面影响，即可能固化模型的现有偏见。此外，研究还扩展了多模态模型的偏见测量工具，强调了开发更有效的数据集、数据整理技术和对齐工具的重要性。

链接: https://arxiv.org/abs/2503.09025
作者: Logan Barnhart,Reza Akbarian Bafghi,Stephen Becker,Maziar Raissi
机构: Department of Applied Mathematics - University of Colorado at Boulder (应用数学系 - 科罗拉多大学博尔德分校); Department of Computer Science - University of Colorado at Boulder (计算机科学系 - 科罗拉多大学博尔德分校); Department of Mathematics - University of California Riverside (数学系 - 加州大学河滨分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is increasingly used to align large language models (LLMs) with human preferences. However, the effectiveness of RLHF in addressing underlying biases remains unclear. This study investigates the relationship between RLHF and both covert and overt biases in LLMs, particularly focusing on biases against African Americans. We applied various RLHF techniques (DPO, ORPO, and RLOO) to Llama 3 8B and evaluated the covert and overt biases of the resulting models using matched-guise probing and explicit bias testing. We performed additional tests with DPO on different base models and datasets; among several implications, we found that SFT before RLHF calcifies model biases. Additionally, we extend the tools for measuring biases to multi-modal models. Through our experiments we collect evidence that indicates that current alignment techniques are inadequate for nebulous tasks such as mitigating covert biases, highlighting the need for capable datasets, data curating techniques, or alignment tools.
zh

[NLP-40] Word2winners at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval

【速读】：该论文旨在解决SemEval 2025 Task 7中的“Previously Fact-Checked Claim Retrieval”问题，即从包含多语言社交媒体帖子和事实核查的MultiClaim数据集中检索与给定声明相关的事实核查结果。为应对这一挑战，论文的关键解决方案包括首先评估了最先进的英文和多语言检索模型在零样本设置下的性能，并进一步针对表现最佳的系统进行微调，同时利用机器翻译技术增强跨语言检索能力。最终，所提出的方法在跨语言数据上的准确率达到85%，而在单一语言数据上的准确率则达到了92%。

链接: https://arxiv.org/abs/2503.09011
作者: Amirmohammad Azadi,Sina Zamani,Mohammadmostafa Rostamkhani,Sauleh Eetemadi
机构: Iran University of Science and Technology (伊朗科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper describes our system for SemEval 2025 Task 7: Previously Fact-Checked Claim Retrieval. The task requires retrieving relevant fact-checks for a given input claim from the extensive, multilingual MultiClaim dataset, which comprises social media posts and fact-checks in several languages. To address this challenge, we first evaluated zero-shot performance using state-of-the-art English and multilingual retrieval models and then fine-tuned the most promising systems, leveraging machine translation to enhance crosslingual retrieval. Our best model achieved an accuracy of 85% on crosslingual data and 92% on monolingual data.
zh

[NLP-41] Leverag ing Retrieval Augmented Generative LLM s For Automated Metadata Description Generation to Enhance Data Catalogs

【速读】：该论文旨在解决企业内部数据目录中因元数据（如资产描述）不足导致的搜索性受限问题。解决方案的关键在于提出一种基于检索的少量-shot技术与生成式大语言模型（LLM）相结合的提示增强方法，通过利用现有元数据内容以可扩展的方式丰富和整理元数据。此外，研究还探讨了对LLM进行微调的效果，并评估了少量-shot预训练LLM（如Llama、GPT3.5）与少量-shot微调LLM（如Llama2-7b）在准确性、事实依据和毒性方面的表现。初步结果显示，生成内容的Rouge-1 F1值超过80%，且约87%-88%的实例被数据管理员直接接受或仅需轻微编辑即可使用。通过以最准确的方式自动生成表和列的描述，研究尝试为企业提供一个整体框架，以有效扩展元数据整理并提升数据目录的整体可用性。

链接: https://arxiv.org/abs/2503.09003
作者: Mayank Singh,Abhijeet Kumar,Sasidhar Donaparthi,Gayatri Karambelkar
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Presented in 5th International Conference on NLP Text Mining (NLTM 2025)

点击查看摘要

Abstract:Data catalogs serve as repositories for organizing and accessing diverse collection of data assets, but their effectiveness hinges on the ease with which business users can look-up relevant content. Unfortunately, many data catalogs within organizations suffer from limited searchability due to inadequate metadata like asset descriptions. Hence, there is a need of content generation solution to enrich and curate metadata in a scalable way. This paper explores the challenges associated with metadata creation and proposes a unique prompt enrichment idea of leveraging existing metadata content using retrieval based few-shot technique tied with generative large language models (LLM). The literature also considers finetuning an LLM on existing content and studies the behavior of few-shot pretrained LLM (Llama, GPT3.5) vis-à-vis few-shot finetuned LLM (Llama2-7b) by evaluating their performance based on accuracy, factual grounding, and toxicity. Our preliminary results exhibit more than 80% Rouge-1 F1 for the generated content. This implied 87%- 88% of instances accepted as is or curated with minor edits by data stewards. By automatically generating descriptions for tables and columns in most accurate way, the research attempts to provide an overall framework for enterprises to effectively scale metadata curation and enrich its data catalog thereby vastly improving the data catalog searchability and overall usability.
zh

[NLP-42] JBFuzz: Jailbreaking LLM s Efficiently and Effectively Using Fuzzing

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在广泛可用性下因“越狱”（jailbreak）攻击而生成有害、不道德或冒犯性内容的问题。尽管开发人员通过基于人类反馈的方法努力使LLMs与安全目标对齐，但这些模型仍然容易受到越狱攻击的影响。现有的红队技术（red-teaming approaches）在有效性、可扩展性或两者兼备方面存在不足。为此，论文提出了一种名为JBFuzz的新方法，这是一种有效的、自动化的且可扩展的红队技术，用于检测和研究LLMs的越狱漏洞。

JBFuzz的关键创新在于解决了三个主要挑战：设计新颖的种子提示（seed prompts）、轻量级的变异引擎（mutation engine）以及高效且精确的评估器（evaluator），以指导模糊测试过程。这些解决方案使得JBFuzz仅需黑盒访问目标LLM即可实现强大的模糊测试能力。实验结果表明，JBFuzz能够成功针对九种流行LLMs进行越狱操作，在处理各种有害或不道德问题时平均攻击成功率达到了99%，并且每次攻击平均耗时仅为60秒。这项工作强调了即使经过安全性对齐，最先进的LLMs仍易受越狱攻击影响，并为LLMs开发者提供了一个有价值的红队工具。

链接: https://arxiv.org/abs/2503.08990
作者: Vasudev Gohil
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown great promise as language understanding and decision making tools, and they have permeated various aspects of our everyday life. However, their widespread availability also comes with novel risks, such as generating harmful, unethical, or offensive content, via an attack called jailbreaking. Despite extensive efforts from LLM developers to align LLMs using human feedback, they are still susceptible to jailbreak attacks. To tackle this issue, researchers often employ red-teaming to understand and investigate jailbreak prompts. However, existing red-teaming approaches lack effectiveness, scalability, or both. To address these issues, we propose JBFuzz, a novel effective, automated, and scalable red-teaming technique for jailbreaking LLMs. JBFuzz is inspired by the success of fuzzing for detecting bugs/vulnerabilities in software. We overcome three challenges related to effectiveness and scalability by devising novel seed prompts, a lightweight mutation engine, and a lightweight and accurate evaluator for guiding the fuzzer. Assimilating all three solutions results in a potent fuzzer that only requires black-box access to the target LLM. We perform extensive experimental evaluation of JBFuzz using nine popular and widely-used LLMs. We find that JBFuzz successfully jailbreaks all LLMs for various harmful/unethical questions, with an average attack success rate of 99%. We also find that JBFuzz is extremely efficient as it jailbreaks a given LLM for a given question in 60 seconds on average. Our work highlights the susceptibility of the state-of-the-art LLMs to jailbreak attacks even after safety alignment, and serves as a valuable red-teaming tool for LLM developers. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2503.08990 [cs.CR] (or arXiv:2503.08990v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2503.08990 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-43] I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?

【速读】：该论文试图解决的问题是揭示大型语言模型（Large Language Models, LLMs）是否真正捕捉到了数据中的潜在生成因素，而非仅仅通过简单的数据操作实现其能力。论文的关键解决方案在于引入一种新颖的生成式模型，该模型基于由人类可解释概念表示的潜在离散变量来生成标记，并在理论层面证明了在一定条件下，LLMs通过下一个标记预测学到的表征可以近似建模为这些潜在离散概念后验概率的对数，且仅受可逆线性变换的影响。这一理论发现不仅提供了证据表明LLMs捕获了潜在的生成因子，还进一步支持了线性表征假设，即LLMs学习了人类可解释概念的线性表征。此外，论文通过模拟数据以及Pythia、Llama和DeepSeek模型族的评估验证了这一理论结果。

链接: https://arxiv.org/abs/2503.08980
作者: Yuhang Liu,Dong Gong,Erdun Gao,Zhen Zhang,Biwei Huang,Mingming Gong,Anton van den Hengel,Javen Qinfeng Shi
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The remarkable achievements of large language models (LLMs) have led many to conclude that they exhibit a form of intelligence. This is as opposed to explanations of their capabilities based on their ability to perform relatively simple manipulations of vast volumes of data. To illuminate the distinction between these explanations, we introduce a novel generative model that generates tokens on the basis of human interpretable concepts represented as latent discrete variables. Under mild conditions, even when the mapping from the latent space to the observed space is non-invertible, we establish an identifiability result: the representations learned by LLMs through next-token prediction can be approximately modeled as the logarithm of the posterior probabilities of these latent discrete concepts, up to an invertible linear transformation. This theoretical finding not only provides evidence that LLMs capture underlying generative factors, but also strongly reinforces the linear representation hypothesis, which posits that LLMs learn linear representations of human-interpretable concepts. Empirically, we validate our theoretical results through evaluations on both simulation data and the Pythia, Llama, and DeepSeek model families.
zh

[NLP-44] Agent ic AI for Scientific Discovery: A Survey of Progress Challenges and Future Directions

【速读】：该论文旨在探讨生成式人工智能（Agentic AI）在科学研究中的集成及其对研究自动化的影响。论文试图解决的核心问题是：如何有效利用具备推理、规划和自主决策能力的AI系统，以革新科学发现的各个阶段，包括文献综述、假设生成、实验设计与结果分析，并同时应对由此带来的挑战，如文献自动化处理、系统可靠性及伦理考量。关键在于通过分类现有系统与工具、分析跨领域（如化学、生物学和材料科学）的最新进展，提出基于人类与AI协作以及系统优化校准的未来研究方向，从而推动这一前沿领域的深入发展。

链接: https://arxiv.org/abs/2503.08979
作者: Mourad Gridach,Jay Nanavati,Khaldoun Zine El Abidine,Lenon Mendes,Christina Mack
机构: IQVIA
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The integration of Agentic AI into scientific discovery marks a new frontier in research automation. These AI systems, capable of reasoning, planning, and autonomous decision-making, are transforming how scientists perform literature review, generate hypotheses, conduct experiments, and analyze results. This survey provides a comprehensive overview of Agentic AI for scientific discovery, categorizing existing systems and tools, and highlighting recent progress across fields such as chemistry, biology, and materials science. We discuss key evaluation metrics, implementation frameworks, and commonly used datasets to offer a detailed understanding of the current state of the field. Finally, we address critical challenges, such as literature review automation, system reliability, and ethical concerns, while outlining future research directions that emphasize human-AI collaboration and enhanced system calibration.
zh

[NLP-45] Gradient-guided Attention Map Editing: Towards Efficient Contextual Hallucination Mitigation NAACL2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在摘要生成和开放书本问答等任务中常见的“上下文幻觉”（contextual hallucination）问题，即模型在拥有准确源信息的情况下仍会生成无关或错误响应的现象。这种现象通常源于模型更倾向于依赖自生成内容而非输入上下文，从而忽视重要细节。为应对这一挑战，论文提出了一种名为“引导注意力图编辑”（Guided Attention Map Editing, GAME）的新方法。GAME的核心在于动态调整注意力图，通过训练分类器识别易引发幻觉的注意力图，并利用梯度指导的“编辑方向”进行针对性干预，有策略地重新分配不同注意力头的权重，从而有效减少幻觉现象。实验结果表明，GAME在多种开源模型上显著降低了幻觉的发生率，在XSum摘要任务中将幻觉减少了10%，同时相比现有最佳基线提升了7倍的计算效率。

链接: https://arxiv.org/abs/2503.08963
作者: Yu Wang,Jiaxin Zhang,Xiang Gao,Wendi Cui,Peng Li,Kamalika Das
机构: University of California, Santa Barbara(加州大学圣塔芭芭拉分校); Intuit AI Research (Intuit 人工智能研究院), Mountain View, CA
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted as Finding of NAACL 2025

点击查看摘要

Abstract:In tasks like summarization and open-book question answering (QA), Large Language Models (LLMs) often encounter “contextual hallucination”, where they produce irrelevant or incorrect responses despite having access to accurate source information. This typically occurs because these models tend to prioritize self-generated content over the input context, causing them to disregard pertinent details. To address this challenge, we introduce a novel method called “Guided Attention Map Editing” (GAME), which dynamically adjusts attention maps to improve contextual relevance. During inference, GAME employs a trained classifier to identify attention maps prone to inducing hallucinations and executes targeted interventions. These interventions, guided by gradient-informed "edit directions’', strategically redistribute attention weights across various heads to effectively reduce hallucination. Comprehensive evaluations on challenging summarization and open-book QA tasks show that GAME consistently reduces hallucinations across a variety of open-source models. Specifically, GAME reduces hallucinations by 10% in the XSum summarization task while achieving a 7X speed-up in computational efficiency compared to the state-of-the-art baselines.
zh

[NLP-46] Backtracking for Safety

【速读】：该论文旨在解决现有大型语言模型（Large Language Models, LLMs）安全对齐方法在应对复杂安全违规（如毒性）时的局限性。当前的安全对齐方法（如监督微调和基于强化学习的方法）容易受到对抗攻击的影响，并且主要关注于防止初始生成令牌中的有害内容，对于生成过程中较晚出现的细微安全违规缺乏有效处理能力。此外，虽然重置方法可以在一定程度上通过丢弃先前令牌并重新开始生成来恢复安全性，但这种方式效率较低且无法针对性地修正具体问题段落。

为了解决上述挑战，本文提出了一种新的回溯（backtracking）方法。该方法的关键在于允许模型在检测到安全违规时返回到一个更安全的生成状态，而不一定从生成过程的起点开始。这种方法能够精准地修正有问题的部分内容，同时保留大部分已生成文本，从而保持生成效率。实验结果表明，所提出的回溯方法显著减少了生成过程中出现的毒性问题，且对整体效率影响较小。

链接: https://arxiv.org/abs/2503.08919
作者: Bilgehan Sel,Dingcheng Li,Phillip Wallis,Vaishakh Keshava,Ming Jin,Siddhartha Reddy Jonnalagadda
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across various tasks, but ensuring their safety and alignment with human values remains crucial. Current safety alignment methods, such as supervised fine-tuning and reinforcement learning-based approaches, can exhibit vulnerabilities to adversarial attacks and often result in shallow safety alignment, primarily focusing on preventing harmful content in the initial tokens of the generated output. While methods like resetting can help recover from unsafe generations by discarding previous tokens and restarting the generation process, they are not well-suited for addressing nuanced safety violations like toxicity that may arise within otherwise benign and lengthy generations. In this paper, we propose a novel backtracking method designed to address these limitations. Our method allows the model to revert to a safer generation state, not necessarily at the beginning, when safety violations occur during generation. This approach enables targeted correction of problematic segments without discarding the entire generated text, thereby preserving efficiency. We demonstrate that our method dramatically reduces toxicity appearing through the generation process with minimal impact to efficiency.
zh

[NLP-47] Interpreting the Repeated Token Phenomenon in Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在重复单个单词时经常输出无关文本的问题，这种现象被视为一种未被解释的漏洞，允许终端用户引导模型偏离其预期行为。论文的关键在于揭示这一现象的原因，并将其与“注意力陷阱”（attention sinks）的概念联系起来，这是一种对流利性至关重要的LLM涌现行为，其中初始标记会获得不成比例高的注意力分数。研究确定了导致注意力陷阱的神经回路，并展示了长重复序列如何破坏这一回路。进一步发现其他非重复序列也表现出类似的回路干扰。为了解决此问题，论文提出了一种针对性补丁，有效修复了该问题且未对模型的整体性能产生负面影响。这项研究提供了对LLM漏洞的机制性解释，展示了可解释性如何诊断和解决问题，并为更安全、可靠的模型提供了方向。

链接: https://arxiv.org/abs/2503.08908
作者: Itay Yona,Ilia Shumailov,Jamie Hayes,Federico Barbero,Yossi Gandelsman
机构: Google DeepMind; University of Oxford (牛津大学); UC Berkeley (加州大学伯克利分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), despite their impressive capabilities, often fail to accurately repeat a single word when prompted to, and instead output unrelated text. This unexplained failure mode represents a vulnerability, allowing even end-users to diverge models away from their intended behavior. We aim to explain the causes for this phenomenon and link it to the concept of ``attention sinks’', an emergent LLM behavior crucial for fluency, in which the initial token receives disproportionately high attention scores. Our investigation identifies the neural circuit responsible for attention sinks and shows how long repetitions disrupt this circuit. We extend this finding to other non-repeating sequences that exhibit similar circuit disruptions. To address this, we propose a targeted patch that effectively resolves the issue without negatively impacting the model’s overall performance. This study provides a mechanistic explanation for an LLM vulnerability, demonstrating how interpretability can diagnose and address issues, and offering insights that pave the way for more secure and reliable models.
zh

[NLP-48] Prompt-OT: An Optimal Transport Regularization Paradigm for Knowledge Preservation in Vision-Language Model Adaptation

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在适应下游任务时因微调导致过拟合以及零样本泛化能力下降的问题。现有方法在保持预训练知识的同时通过提示学习（Prompt Learning）进行模型适配，但容易遗忘预训练的知识并削弱泛化性能。为了解决这一挑战，论文提出了一种基于最优传输（Optimal Transport, OT）引导的提示学习框架，其关键在于通过保留预训练模型和微调模型之间特征分布的结构一致性来缓解遗忘问题。与传统的点对点约束不同，最优传输能够自然捕捉实例间的关系，并扩展提示调优的可行参数空间，从而实现适配和泛化之间的更好权衡。此外，该方法对视觉和文本表示施加联合约束，确保整体特征对齐。

链接: https://arxiv.org/abs/2503.08906
作者: Xiwen Chen,Wenhui Zhu,Peijie Qiu,Hao Wang,Huayu Li,Haiyu Wu,Aristeidis Sotiras,Yalin Wang,Abolfazl Razi
机构: Clemson University (克莱姆森大学); Arizona State University (亚利桑那州立大学); Washington University in St. Louis (圣路易斯华盛顿大学); University of Arizona (亚利桑那大学); University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) such as CLIP demonstrate strong performance but struggle when adapted to downstream tasks. Prompt learning has emerged as an efficient and effective strategy to adapt VLMs while preserving their pre-trained knowledge. However, existing methods still lead to overfitting and degrade zero-shot generalization. To address this challenge, we propose an optimal transport (OT)-guided prompt learning framework that mitigates forgetting by preserving the structural consistency of feature distributions between pre-trained and fine-tuned models. Unlike conventional point-wise constraints, OT naturally captures cross-instance relationships and expands the feasible parameter space for prompt tuning, allowing a better trade-off between adaptation and generalization. Our approach enforces joint constraints on both vision and text representations, ensuring a holistic feature alignment. Extensive experiments on benchmark datasets demonstrate that our simple yet effective method can outperform existing prompt learning strategies in base-to-novel generalization, cross-dataset evaluation, and domain generalization without additional augmentation or ensemble techniques. The code is available at this https URL
zh

[NLP-49] EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees

【速读】：该论文试图解决语言模型（Language Model, LM）评估中如何有效识别模型弱点并提供可操作改进建议的问题。为实现这一目标，论文提出了生成“弱点画像”（weakness profile）的任务，即用自然语言描述模型在基准数据集每个实例上的表现所反映的一组弱点。论文的关键解决方案是提出了一种名为EvalTree的方法，它通过构建一个能力树（capability tree）来系统化组织和分析模型在不同能力上的表现，其中每个节点代表一种用自然语言描述的能力，并链接到特定评估该能力的基准实例集合。EvalTree通过提取模型表现较差的节点生成弱点画像，从而更精确和全面地识别模型的不足之处。此外，论文还展示了EvalTree在MATH和WildChat基准上的优越性，并证明了其指导的数据收集策略能够显著提升模型性能，同时揭示了基于人工投票的Chatbot Arena评估方法的局限性。

链接: https://arxiv.org/abs/2503.08893
作者: Zhiyuan Zeng,Yizhong Wang,Hannaneh Hajishirzi,Pang Wei Koh
机构: Paul G. Allen School of Computer Science & Engineering, University of Washington (保罗·G·艾伦计算机科学与工程学院，华盛顿大学); Allen Institute for Artificial Intelligence (艾伦人工智能研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:An ideal model evaluation should achieve two goals: identifying where the model fails and providing actionable improvement guidance. Toward these goals for Language Model (LM) evaluations, we formulate the problem of generating a weakness profile, a set of weaknesses expressed in natural language, given an LM’s performance on every individual instance in a benchmark. We introduce a suite of quantitative assessments to compare different weakness profiling methods. We also propose a weakness profiling method EvalTree. It constructs a capability tree where each node represents a capability described in natural language and is linked to a subset of benchmark instances that specifically evaluate this capability; it then extracts nodes where the LM performs poorly to generate a weakness profile. On the MATH and WildChat benchmarks, we show that EvalTree outperforms baseline weakness profiling methods by identifying weaknesses more precisely and comprehensively. Weakness profiling further enables weakness-guided data collection, and training data collection guided by EvalTree-identified weaknesses improves LM performance more than other data collection strategies. We also show how EvalTree exposes flaws in Chatbot Arena’s human-voter-based evaluation practice. To facilitate future work, we release our code and an interface that allows practitioners to interactively explore the capability trees built by EvalTree.
zh

[NLP-50] PlainQAFact: Automatic Factuality Evaluation Metric for Biomedical Plain Language Summaries Generation

【速读】：该论文旨在解决医学领域语言模型生成的幻觉输出所带来的事实性风险，特别是在普通大众进行健康相关决策时的问题。现有基于蕴涵（entailment）和问答（QA）的事实性评估方法在处理简明语言摘要（Plain Language Summary, PLS）生成时面临挑战，因为PLS往往包含详细的解释现象，引入了源文档未提及的外部内容（如定义、背景、示例等），以增强可理解性。为了解决这一问题，论文提出了PlainQAFact框架，该框架基于一个细粒度的人类注释数据集PlainFact进行训练。PlainQAFact的关键创新在于首先分类事实性类型，然后采用检索增强的问答（retrieval-augmented QA）评分方法来评估事实性。这种方法轻量且计算高效。实验结果表明，现有的事实性度量无法有效评估PLS中的事实性，尤其是在详细的解释部分，而PlainQAFact实现了最先进的性能。此外，论文进一步分析了其在不同外部知识源、答案提取策略、重叠度量以及文档粒度水平上的有效性，并优化了整体的事实性评估。

链接: https://arxiv.org/abs/2503.08890
作者: Zhiwen You,Yue Guo
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hallucinated outputs from language models pose risks in the medical domain, especially for lay audiences making health-related decisions. Existing factuality evaluation methods, such as entailment- and question-answering-based (QA), struggle with plain language summary (PLS) generation due to elaborative explanation phenomenon, which introduces external content (e.g., definitions, background, examples) absent from the source document to enhance comprehension. To address this, we introduce PlainQAFact, a framework trained on a fine-grained, human-annotated dataset PlainFact, to evaluate the factuality of both source-simplified and elaboratively explained sentences. PlainQAFact first classifies factuality type and then assesses factuality using a retrieval-augmented QA-based scoring method. Our approach is lightweight and computationally efficient. Empirical results show that existing factuality metrics fail to effectively evaluate factuality in PLS, especially for elaborative explanations, whereas PlainQAFact achieves state-of-the-art performance. We further analyze its effectiveness across external knowledge sources, answer extraction strategies, overlap measures, and document granularity levels, refining its overall factuality assessment.
zh

[NLP-51] Seeing Whats Not There: Spurious Correlation in Multimodal LLM s

【速读】：该论文试图解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在训练过程中可能受到虚假相关性（spurious correlations）影响的问题，尽管这些模型受到语言监督，但其程度尚不明确。论文的关键在于引入了一种名为SpurLens的自动化检测管道，该管道利用GPT-4和开放集目标检测器来识别MLLMs中的虚假视觉线索，而无需人工标注。通过这一方法，研究揭示了虚假相关性导致的两种主要失效模式，并进一步探索了缓解策略，如提示集成（prompt ensembling）和基于推理的提示（reasoning-based prompting），同时进行了消融研究以探究MLLMs中虚假偏见的根本原因。最终，研究呼吁采用更严格的评估方法和缓解措施以提高MLLMs的可靠性。

链接: https://arxiv.org/abs/2503.08884
作者: Parsa Hosseini,Sumit Nawathe,Mazda Moayeri,Sriram Balasubramanian,Soheil Feizi
机构: University of Maryland (马里兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Unimodal vision models are known to rely on spurious correlations, but it remains unclear to what extent Multimodal Large Language Models (MLLMs) exhibit similar biases despite language supervision. In this paper, we investigate spurious bias in MLLMs and introduce SpurLens, a pipeline that leverages GPT-4 and open-set object detectors to automatically identify spurious visual cues without human supervision. Our findings reveal that spurious correlations cause two major failure modes in MLLMs: (1) over-reliance on spurious cues for object recognition, where removing these cues reduces accuracy, and (2) object hallucination, where spurious cues amplify the hallucination by over 10x. We validate our findings in various MLLMs and datasets. Beyond diagnosing these failures, we explore potential mitigation strategies, such as prompt ensembling and reasoning-based prompting, and conduct ablation studies to examine the root causes of spurious bias in MLLMs. By exposing the persistence of spurious correlations, our study calls for more rigorous evaluation methods and mitigation strategies to enhance the reliability of MLLMs.
zh

[NLP-52] LLM s Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在长上下文推理中的高效性问题，特别是在采用从128K到1M令牌的大上下文窗口场景下，由于快速增长的关键值（Key-Value, KV）缓存以及注意力机制的高计算复杂度导致的内存使用瓶颈和延迟问题。论文的关键发现是，长上下文任务中的注意力表现出稀疏性，并且LLMs在预填充阶段后能够在token级别隐式地识别哪些tokens可以被丢弃或驱逐。基于这一洞察，论文提出了一种名为Self-Attention Guided Eviction (SAGE-KV) 的简单而有效的KV缓存驱逐方法。该方法在预填充之后，通过对token和head级别的top-k选择来压缩KV缓存，从而实现高效的推理过程。实验结果表明，SAGE-KV在保持与完整注意力相当的准确性的同时，显著提高了效率，在LongBench基准测试及三种不同规模的LLMs上实现了比静态KV缓存选择方法StreamLLM高出4倍的内存效率和比动态KV缓存选择方法Quest高出2倍的内存效率。

链接: https://arxiv.org/abs/2503.08879
作者: Guangtao Wang,Shubhangi Upasani,Chen Wu,Darshan Gandhi,Jonathan Li,Changran Hu,Bo Li,Urmish Thakker
机构: SambaNova Systems, Inc (萨姆巴诺瓦系统公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Efficient long-context inference is critical as large language models (LLMs) adopt context windows of ranging from 128K to 1M tokens. However, the growing key-value (KV) cache and the high computational complexity of attention create significant bottlenecks in memory usage and latency. In this paper, we find that attention in diverse long-context tasks exhibits sparsity, and LLMs implicitly “know” which tokens can be dropped or evicted at the head level after the pre-filling stage. Based on this insight, we propose Self-Attention Guided Eviction~(SAGE-KV), a simple and effective KV eviction cache method for long-context inference. After prefilling, our method performs a one-time top-k selection at both the token and head levels to compress the KV cache, enabling efficient inference with the reduced cache. Evaluations on LongBench and three long-context LLMs (Llama3.1-8B-Instruct-128k, Llama3-8B-Prolong-512k-Instruct, and Qwen2.5-7B-Instruct-128k) show that SAGE-KV maintains accuracy comparable to full attention while significantly improving efficiency. Specifically, SAGE-KV achieves 4x higher memory efficiency with improved accuracy over the static KV cache selection method StreamLLM, and 2x higher memory efficiency with better accuracy than the dynamic KV cache selection method Quest.
zh

[NLP-53] Interpretable and Robust Dialogue State Tracking via Natural Language Summarization with LLM s

【速读】：该论文旨在解决传统对话状态跟踪（Dialogue State Tracking, DST）方法在开放领域对话和噪声输入场景下的局限性。传统DST方法依赖于槽位-值（slot-value）表示，难以有效处理复杂或模糊的对话状态描述。为应对这一挑战，论文提出了一种基于大型语言模型（Large Language Models, LLMs）的自然语言对话状态跟踪（Natural Language DST, NL-DST）框架，其关键在于利用LLMs生成可读性强的人类语言描述来表达对话状态，而非传统的结构化槽位表示。通过在MultiWOZ 2.1和Taskmaster-1数据集上的实验验证，NL-DST在联合目标准确率（Joint Goal Accuracy）和槽位准确率（Slot Accuracy）方面显著优于现有规则基线、BERT基线及生成式槽位填充GPT-2模型，同时具备更强的鲁棒性和可解释性。

链接: https://arxiv.org/abs/2503.08857
作者: Rafael Carranza,Mateo Alejandro Rojas
机构: Technological University of Peru (秘鲁理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces a novel approach to Dialogue State Tracking (DST) that leverages Large Language Models (LLMs) to generate natural language descriptions of dialogue states, moving beyond traditional slot-value representations. Conventional DST methods struggle with open-domain dialogues and noisy inputs. Motivated by the generative capabilities of LLMs, our Natural Language DST (NL-DST) framework trains an LLM to directly synthesize human-readable state descriptions. We demonstrate through extensive experiments on MultiWOZ 2.1 and Taskmaster-1 datasets that NL-DST significantly outperforms rule-based and discriminative BERT-based DST baselines, as well as generative slot-filling GPT-2 DST models, in both Joint Goal Accuracy and Slot Accuracy. Ablation studies and human evaluations further validate the effectiveness of natural language state generation, highlighting its robustness to noise and enhanced interpretability. Our findings suggest that NL-DST offers a more flexible, accurate, and human-understandable approach to dialogue state tracking, paving the way for more robust and adaptable task-oriented dialogue systems.
zh

[NLP-54] Contrastive Speaker-Aware Learning for Multi-party Dialogue Generation with LLM s

【速读】：该论文致力于解决多说话人对话生成中的复杂性挑战，特别是传统方法在依赖人工标注的对话关系时难以有效捕捉多说话人交互与交织的对话线索的问题。论文提出的关键解决方案是Speaker-Attentive LLM (SA-LLM)，这是一种基于预训练大语言模型（Large Language Models, LLMs）并结合说话人感知对比学习策略的新颖生成模型。SA-LLM通过引入说话人属性输入编码和对比学习目标，能够在无需显式关系标注的情况下隐式学习上下文连贯性和说话人角色，从而有效应对上述挑战。

链接: https://arxiv.org/abs/2503.08842
作者: Tianyu Sun,Kun Qian,Wenhong Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-party dialogue generation presents significant challenges due to the complex interplay of multiple speakers and interwoven conversational threads. Traditional approaches often fall short in capturing these complexities, particularly when relying on manually annotated dialogue relations. This paper introduces Speaker-Attentive LLM (SA-LLM), a novel generative model that leverages pre-trained Large Language Models (LLMs) and a speaker-aware contrastive learning strategy to address these challenges. SA-LLM incorporates a speaker-attributed input encoding and a contrastive learning objective to implicitly learn contextual coherence and speaker roles without explicit relation annotations. Extensive experiments on the Ubuntu IRC and Movie Dialogues datasets demonstrate that SA-LLM significantly outperforms state-of-the-art baselines in automatic and human evaluations, achieving superior performance in fluency, coherence, informativeness, and response diversity. Ablation studies and detailed error analyses further validate the effectiveness of the proposed speaker-attentive training approach, highlighting its robustness across different speaker roles and context lengths. The results underscore the potential of SA-LLM as a powerful and annotation-free solution for high-quality multi-party dialogue generation.
zh

[NLP-55] voBPE: Evolutionary Protein Sequence Tokenization

【速读】：该论文旨在解决现有子词（subword）分词技术在表示蛋白质序列复杂结构和功能特性方面的不足。这些传统方法主要针对自然语言处理设计，未能充分捕捉蛋白质序列的进化动态。论文提出了一种名为evoBPE的新颖分词方法，其关键是将进化突变模式融入序列分割过程，通过利用已建立的替换矩阵超越传统的基于频率的分词策略。具体而言，evoBPE通过生物学启发的突变生成候选令牌对，并根据成对排列分数和频率阈值进行评估。实验结果表明，evoBPE在多个维度上表现更优，特别是在词汇量增大时，其在域保守性分析中优于标准Byte-Pair Encoding，并且基于突变的令牌替换比任意替换更能有效保留生物序列属性。这一研究通过引入一种具有突变感知能力的分词方法，改进了蛋白质序列表示，促进了计算语言学与分子生物学之间的融合，为蛋白质功能预测、结构建模及进化分析中的机器学习应用开辟了新途径。

链接: https://arxiv.org/abs/2503.08838
作者: Burak Suyunu,Özdeniz Dolu,Arzucan Özgür
机构: Boğaziçi University (博阿济奇大学)
类目: Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注: 13 pages, 8 figures, 1 table, 1 algorithm

点击查看摘要

Abstract:Recent advancements in computational biology have drawn compelling parallels between protein sequences and linguistic structures, highlighting the need for sophisticated tokenization methods that capture the intricate evolutionary dynamics of protein sequences. Current subword tokenization techniques, primarily developed for natural language processing, often fail to represent protein sequences’ complex structural and functional properties adequately. This study introduces evoBPE, a novel tokenization approach that integrates evolutionary mutation patterns into sequence segmentation, addressing critical limitations in existing methods. By leveraging established substitution matrices, evoBPE transcends traditional frequency-based tokenization strategies. The method generates candidate token pairs through biologically informed mutations, evaluating them based on pairwise alignment scores and frequency thresholds. Extensive experiments on human protein sequences show that evoBPE performs better across multiple dimensions. Domain conservation analysis reveals that evoBPE consistently outperforms standard Byte-Pair Encoding, particularly as vocabulary size increases. Furthermore, embedding similarity analysis using ESM-2 suggests that mutation-based token replacements preserve biological sequence properties more effectively than arbitrary substitutions. The research contributes to protein sequence representation by introducing a mutation-aware tokenization method that better captures evolutionary nuances. By bridging computational linguistics and molecular biology, evoBPE opens new possibilities for machine learning applications in protein function prediction, structural modeling, and evolutionary analysis.
zh

[NLP-56] ResBench: Benchmarking LLM -Generated FPGA Designs with Resource Awareness

【速读】：该论文试图解决现有Large Language Models (LLMs)在生成Hardware Description Language (HDL)代码时，主要关注功能正确性而忽视硬件资源效率的问题，同时指出当前基准测试缺乏多样性和对实际FPGA应用广泛场景的覆盖。论文的关键解决方案是提出ResBench，这是一个专门设计用于区分资源优化与低效LLM生成HDL的资源导向型基准。ResBench包含12个类别共56个问题，涵盖从有限状态机到金融计算的应用，并通过系统性整合FPGA资源约束（尤其是Lookup Table, LUT使用）的评估框架，实现对硬件效率的现实评估。实验结果表明，不同LLMs在资源利用上的显著差异验证了ResBench的有效性。

链接: https://arxiv.org/abs/2503.08823
作者: Ce Guo,Tong Zhao
机构: Department of Computing, Imperial College London (帝国理工学院)
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: to be published in International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies 2025

点击查看摘要

Abstract:Field-Programmable Gate Arrays (FPGAs) are widely used in modern hardware design, yet writing Hardware Description Language (HDL) code for FPGA implementation remains labor-intensive and complex. Large Language Models (LLMs) have emerged as a promising tool for automating HDL generation, but existing benchmarks for LLM HDL code generation primarily evaluate functional correctness while overlooking the critical aspect of hardware resource efficiency. Moreover, current benchmarks lack diversity, failing to capture the broad range of real-world FPGA applications. To address these gaps, we introduce ResBench, the first resource-oriented benchmark explicitly designed to differentiate between resource-optimized and inefficient LLM-generated HDL. ResBench consists of 56 problems across 12 categories, covering applications from finite state machines to financial computing. Our evaluation framework systematically integrates FPGA resource constraints, with a primary focus on Lookup Table (LUT) usage, enabling a realistic assessment of hardware efficiency. Experimental results reveal substantial differences in resource utilization across LLMs, demonstrating ResBench’s effectiveness in distinguishing models based on their ability to generate resource-optimized FPGA designs.
zh

[NLP-57] Cross-Examiner: Evaluating Consistency of Large Language Model-Generated Explanations

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在解释其输出时可能产生的不准确性或遗漏问题。论文指出，尽管LLMs被要求提供解释以增强结果的准确性和透明度，但现有证据表明这些解释可能无法真实反映模型的实际推理过程。为识别这些解释中的问题，论文提出了一种名为“交叉审查者”（cross-examiner）的新方法，用于基于模型解释生成后续问题。该方法的关键在于结合符号信息提取与由语言模型驱动的问题生成技术，从而产生比单一使用LLMs更高质量的后续问题，同时具备更高的灵活性，能够生成更多样化的后续问题。

链接: https://arxiv.org/abs/2503.08815
作者: Danielle Villa,Maria Chang,Keerthiram Murugesan,Rosario Uceda-Sosa,Karthikeyan Natesan Ramamurthy
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); IBM Research (IBM研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are often asked to explain their outputs to enhance accuracy and transparency. However, evidence suggests that these explanations can misrepresent the models’ true reasoning processes. One effective way to identify inaccuracies or omissions in these explanations is through consistency checking, which typically involves asking follow-up questions. This paper introduces, cross-examiner, a new method for generating follow-up questions based on a model’s explanation of an initial question. Our method combines symbolic information extraction with language model-driven question generation, resulting in better follow-up questions than those produced by LLMs alone. Additionally, this approach is more flexible than other methods and can generate a wider variety of follow-up questions.
zh

[NLP-58] ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships

【速读】：该论文旨在解决西班牙语自然语言推理（Natural Language Inference, NLI）数据集相对匮乏的问题，特别是在因果关系建模方面。论文的关键在于构建了一个名为ESNLIR的多体裁西班牙语NLI数据集，并通过利用BERT家族模型进行初步基线评估，验证了增加数据集多样性和体裁丰富性能够显著提升模型的泛化能力。

链接: https://arxiv.org/abs/2503.08803
作者: Johan R. Portela,Nicolás Perez,Rubén Manrique
机构: Universidad de los Andes (Universidad de los Andes)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural Language Inference (NLI), also known as Recognizing Textual Entailment (RTE), serves as a crucial area within the domain of Natural Language Processing (NLP). This area fundamentally empowers machines to discern semantic relationships between assorted sections of text. Even though considerable work has been executed for the English language, it has been observed that efforts for the Spanish language are relatively sparse. Keeping this in view, this paper focuses on generating a multi-genre Spanish dataset for NLI, ESNLIR, particularly accounting for causal Relationships. A preliminary baseline has been conceptualized and subjected to an evaluation, leveraging models drawn from the BERT family. The findings signify that the enrichment of genres essentially contributes to the enrichment of the model’s capability to generalize. The code, notebooks and whole datasets for this experiments is available at: this https URL. If you are interested only in the dataset you can find it here: this https URL. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2503.08803 [cs.CL] (or arXiv:2503.08803v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.08803 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-59] Exposing Product Bias in LLM Investment Recommendation

【速读】：该论文试图揭示大型语言模型（Large Language Models, LLMs）在投资推荐中的产品偏见（Product Bias），即LLMs对特定金融产品的系统性偏好，这种偏见可能影响用户决策，导致产品估值虚高及潜在的金融泡沫风险。论文的关键解决方案在于开发了一套自动化管道（pipeline），构建了一个包含567,000个样本的数据集，覆盖五类资产（股票、共同基金、加密货币、储蓄产品和投资组合）。基于此数据集，研究首次全面分析了LLM投资推荐中的产品偏见，并发现即使采用去偏技术后，这种偏好仍然存在。因此，论文强调需关注并缓解这一偏见，以保障数字空间与市场的公平性和稳定性。

链接: https://arxiv.org/abs/2503.08750
作者: Yuhan Zhi,Xiaoyu Zhang,Longtian Wang,Shumin Jiang,Shiqing Ma,Xiaohong Guan,Chao Shen
机构: Xi’an Jiaotong University (西安交通大学); Shanghai Jiaotong University (上海交通大学); University of Massachusetts at Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs), as a new generation of recommendation engines, possess powerful summarization and data analysis capabilities, surpassing traditional recommendation systems in both scope and performance. One promising application is investment recommendation. In this paper, we reveal a novel product bias in LLM investment recommendation, where LLMs exhibit systematic preferences for specific products. Such preferences can subtly influence user investment decisions, potentially leading to inflated valuations of products and financial bubbles, posing risks to both individual investors and market stability. To comprehensively study the product bias, we develop an automated pipeline to create a dataset of 567,000 samples across five asset classes (stocks, mutual funds, cryptocurrencies, savings, and portfolios). With this dataset, we present the bf first study on product bias in LLM investment recommendations. Our findings reveal that LLMs exhibit clear product preferences, such as certain stocks (e.g., AAPL' from Apple and MSFT’ from Microsoft). Notably, this bias persists even after applying debiasing techniques. We urge AI researchers to take heed of the product bias in LLM investment recommendations and its implications, ensuring fairness and security in the digital space and market.
zh

[NLP-60] An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR

【速读】：该论文试图解决自动语音识别（Automatic Speech Recognition, ASR）系统训练数据增强的问题，特别是通过合成数据（由文本转语音 Text-to-Speech, TTS 或语音转换 Voice Conversion, VC 生成）提升 ASR 性能。然而，由于合成语音的多样性较低，直接将合成数据与真实数据混合通常无法获得最佳效果。为了解决这一局限性，论文的关键在于利用最近提出的基于流模型的 TTS/VC 方法，以提高合成语音的多样性，并评估不同语音属性增强对词错误率（Word Error Rate, WER）的影响。研究发现，单独进行音高增强和基于 VC 的说话人增强无效，而联合增强其他属性可使基于 Conformer-Transducer 的 ASR 模型在 Common Voice 数据集上的 WER 相对降低 11%，在 LibriSpeech 数据集上的 WER 最高相对降低 35%。因此，该解决方案的关键在于通过提升合成语音的多样性来优化 ASR 性能。

链接: https://arxiv.org/abs/2503.08954
作者: Sewade Ogun,Vincent Colotte,Emmanuel Vincent
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Augmenting the training data of automatic speech recognition (ASR) systems with synthetic data generated by text-to-speech (TTS) or voice conversion (VC) has gained popularity in recent years. Several works have demonstrated improvements in ASR performance using this augmentation approach. However, because of the lower diversity of synthetic speech, naively combining synthetic and real data often does not yield the best results. In this work, we leverage recently proposed flow-based TTS/VC models allowing greater speech diversity, and assess the respective impact of augmenting various speech attributes on the word error rate (WER) achieved by several ASR models. Pitch augmentation and VC-based speaker augmentation are found to be ineffective in our setup. Jointly augmenting all other attributes reduces the WER of a Conformer-Transducer model by 11% relative on Common Voice and by up to 35% relative on LibriSpeech compared to training on real data only.
zh

计算机视觉

[CV-0] RewardSDS: Aligning Score Distillation via Reward-Weighted Sampling

【速读】：该论文旨在解决现有Score Distillation Sampling (SDS) 方法在实现与用户意图的细粒度对齐方面存在的挑战。为克服这一局限，论文提出了一种名为RewardSDS的新方法，其关键是通过奖励模型的对齐分数来加权噪声样本，从而生成加权的SDS损失函数。此损失函数优先利用产生高奖励且对齐输出的噪声样本的梯度。该解决方案的关键创新在于将奖励机制引入到采样过程中，以更好地引导生成过程与期望目标一致。此外，作者进一步展示了RewardSDS的通用性，将其扩展应用于Variational Score Distillation (VSD)，提出了RewardVSD。实验结果表明，RewardSDS和RewardVSD在文本到图像、2D编辑以及文本到3D生成任务中显著优于传统的SDS和VSD方法，并在多个评估指标上实现了最先进的性能。

链接: https://arxiv.org/abs/2503.09601
作者: Itay Chachy,Guy Yariv,Sagie Benaim
机构: Hebrew University of Jerusalem (耶路撒冷希伯来大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Score Distillation Sampling (SDS) has emerged as an effective technique for leveraging 2D diffusion priors for tasks such as text-to-3D generation. While powerful, SDS struggles with achieving fine-grained alignment to user intent. To overcome this, we introduce RewardSDS, a novel approach that weights noise samples based on alignment scores from a reward model, producing a weighted SDS loss. This loss prioritizes gradients from noise samples that yield aligned high-reward output. Our approach is broadly applicable and can extend SDS-based methods. In particular, we demonstrate its applicability to Variational Score Distillation (VSD) by introducing RewardVSD. We evaluate RewardSDS and RewardVSD on text-to-image, 2D editing, and text-to-3D generation tasks, showing significant improvements over SDS and VSD on a diverse set of metrics measuring generation quality and alignment to desired reward models, enabling state-of-the-art performance. Project page is available at https://itaychachy. this http URL.
zh

[CV-1] PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop

【速读】：该论文旨在解决大型预训练视频生成模型在物理准确性世界建模方面的不足，尽管这些模型在内容创作方面表现出色，但它们无法直接作为可靠的物理精确世界模拟器。论文以物体自由落体这一基础物理任务为例，研究了对这些模型进行后训练（post-training）以提升其物理建模能力的方法。研究发现，最先进的视频生成模型在处理此类基本任务时表现不佳，即使其生成的视觉效果令人印象深刻。

为了解决这一问题，论文的关键在于通过在相对少量的模拟视频数据上进行微调（fine-tuning），能够有效诱导模型学习物体下落的行为模式。此外，作者引入了一种新颖的奖励建模方法（reward modeling procedure），进一步提升了模型的表现。研究还揭示了后训练在泛化能力和分布建模方面的局限性，并发布了一个针对自由落体任务的基准测试集，可作为跟踪大规模视频生成模型物理准确性的重要诊断工具。

链接: https://arxiv.org/abs/2503.09595
作者: Chenyu Li,Oscar Michel,Xichen Pan,Sainan Liu,Mike Roberts,Saining Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale pre-trained video generation models excel in content creation but are not reliable as physically accurate world simulators out of the box. This work studies the process of post-training these models for accurate world modeling through the lens of the simple, yet fundamental, physics task of modeling object freefall. We show state-of-the-art video generation models struggle with this basic task, despite their visually impressive outputs. To remedy this problem, we find that fine-tuning on a relatively small amount of simulated videos is effective in inducing the dropping behavior in the model, and we can further improve results through a novel reward modeling procedure we introduce. Our study also reveals key limitations of post-training in generalization and distribution modeling. Additionally, we release a benchmark for this task that may serve as a useful diagnostic tool for tracking physical accuracy in large-scale video generative model development.
zh

[CV-2] SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment CVPR2025

【速读】：该论文旨在解决将大型语言模型（Large Language Models, LLMs）集成到自动驾驶中以提升泛化能力和可解释性的问题，但现有方法通常只能专注于驾驶或视觉-语言理解中的一个方面，难以同时实现高驾驶性能和广泛的语言理解能力。此外，传统的视觉-语言理解方法多依赖于视觉问答（Visual Question Answering, VQA），然而在自动驾驶场景下，这种方法仅当其与动作空间一致时才有实际意义，否则可能导致模型的回答与其行为不一致。因此，论文提出了一种能够处理闭环驾驶（closed-loop driving）、视觉-语言理解以及语言-动作对齐（language-action alignment）三项任务的模型SimLingo。关键在于SimLingo基于视觉语言模型（Vision-Language Model, VLM），仅使用相机作为输入而无需昂贵的LiDAR传感器，并通过统一框架实现了上述三任务的高效协同，从而在CARLA仿真器的Bench2Drive基准测试中取得了当前最优性能（state-of-the-art），同时在多种语言相关任务中表现出色且保持了高水平的驾驶性能。

链接: https://arxiv.org/abs/2503.09594
作者: Katrin Renz,Long Chen,Elahe Arani,Oleg Sinavski
机构: Wayve; University of Tübingen (图宾根大学); Tübingen AI Center (图宾根人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: CVPR 2025. 1st Place @ CARLA Challenge 2024. Challenge tech report (preliminary version of SimLingo): arXiv:2406.10165

点击查看摘要

Abstract:Integrating large language models (LLMs) into autonomous driving has attracted significant attention with the hope of improving generalization and explainability. However, existing methods often focus on either driving or vision-language understanding but achieving both high driving performance and extensive language understanding remains challenging. In addition, the dominant approach to tackle vision-language understanding is using visual question answering. However, for autonomous driving, this is only useful if it is aligned with the action space. Otherwise, the model’s answers could be inconsistent with its behavior. Therefore, we propose a model that can handle three different tasks: (1) closed-loop driving, (2) vision-language understanding, and (3) language-action alignment. Our model SimLingo is based on a vision language model (VLM) and works using only camera, excluding expensive sensors like LiDAR. SimLingo obtains state-of-the-art performance on the widely used CARLA simulator on the Bench2Drive benchmark and is the winning entry at the CARLA challenge 2024. Additionally, we achieve strong results in a wide variety of language-related tasks while maintaining high driving performance.
zh

[CV-3] BIMBA: Selective-Scan Compression for Long-Range Video Question Answering CVPR2025

【速读】：该论文致力于解决长视频（long videos）中视频问答（Video Question Answering, VQA）的关键挑战，即从大量冗余帧中提取相关特征并建模长程依赖关系。传统方法通常依赖自注意力机制进行序列建模，但其计算成本在处理长视频中的海量时空标记（spatiotemporal tokens）时变得不可承受。现有方法多采用压缩策略（如稀疏帧采样或时空池化）来降低计算开销，然而这些方法容易过度表示冗余信息，并遗漏关键事件或快速发生的时空模式。论文提出的关键解决方案是引入BIMBA（Bidirectional Iterative Memory-based Baseline Algorithm），这是一种高效的状态空间模型，通过选择性扫描算法学习从高维视频数据中有效筛选重要信息，并将其转化为缩减后的标记序列，从而实现对大语言模型（Large Language Model, LLM）的高效处理。实验结果表明，BIMBA在多个长视频VQA基准测试中达到了当前最优的准确性。

链接: https://arxiv.org/abs/2503.09590
作者: Md Mohaiminul Islam,Tushar Nagarajan,Huiyu Wang,Gedas Bertasius,Lorenzo Torresani
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); Meta AI (Meta AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Video Question Answering (VQA) in long videos poses the key challenge of extracting relevant information and modeling long-range dependencies from many redundant frames. The self-attention mechanism provides a general solution for sequence modeling, but it has a prohibitive cost when applied to a massive number of spatiotemporal tokens in long videos. Most prior methods rely on compression strategies to lower the computational cost, such as reducing the input length via sparse frame sampling or compressing the output sequence passed to the large language model (LLM) via space-time pooling. However, these naive approaches over-represent redundant information and often miss salient events or fast-occurring space-time patterns. In this work, we introduce BIMBA, an efficient state-space model to handle long-form videos. Our model leverages the selective scan algorithm to learn to effectively select critical information from high-dimensional video and transform it into a reduced token sequence for efficient LLM processing. Extensive experiments demonstrate that BIMBA achieves state-of-the-art accuracy on multiple long-form VQA benchmarks, including PerceptionTest, NExT-QA, EgoSchema, VNBench, LongVideoBench, and Video-MME. Code, and models are publicly available at this https URL.
zh

[CV-4] PDiff: Temporal Pyramid Video Diffusion Model

【速读】：该论文旨在解决视频扩散模型在训练和推理过程中面临的显著计算需求挑战。论文的关键洞察在于扩散过程的逆向步骤具有内在的熵减少特性，并且视频模态中存在的帧间冗余使得在高熵阶段维持全帧率并非必要。基于此，论文提出了一种名为TPDiff的统一框架来提升训练和推理效率。其核心解决方案是将扩散过程分为多个阶段，并在扩散过程中逐步提高帧率，仅在最后一个阶段采用全帧率操作，从而优化计算效率。此外，为了训练多阶段扩散模型，论文引入了分阶段扩散（stage-wise diffusion）的专用训练框架，通过求解对齐数据和噪声下的扩散分区概率流常微分方程（ODE），使训练策略适用于多种扩散形式并进一步提升训练效率。实验结果验证了该方法的通用性，展示了50%的训练成本降低和1.5倍的推理效率提升。

链接: https://arxiv.org/abs/2503.09566
作者: Lingmin Ran,Mike Zheng Shou
机构: Show Lab, National University of Singapore (国立新加坡大学展示实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:The development of video diffusion models unveils a significant challenge: the substantial computational demands. To mitigate this challenge, we note that the reverse process of diffusion exhibits an inherent entropy-reducing nature. Given the inter-frame redundancy in video modality, maintaining full frame rates in high-entropy stages is unnecessary. Based on this insight, we propose TPDiff, a unified framework to enhance training and inference efficiency. By dividing diffusion into several stages, our framework progressively increases frame rate along the diffusion process with only the last stage operating on full frame rate, thereby optimizing computational efficiency. To train the multi-stage diffusion model, we introduce a dedicated training framework: stage-wise diffusion. By solving the partitioned probability flow ordinary differential equations (ODE) of diffusion under aligned data and noise, our training strategy is applicable to various diffusion forms and further enhances training efficiency. Comprehensive experimental evaluations validate the generality of our method, demonstrating 50% reduction in training cost and 1.5x improvement in inference efficiency.
zh

[CV-5] Electromyography-Informed Facial Expression Reconstruction for Physiological-Based Synthesis and Analysis CVPR2025

【速读】：该论文旨在解决面部表面肌电图（sEMG）记录中因电极遮挡导致现有面部分析方法失效的问题。传统方法无法处理电极遮挡，并且即使有无遮挡的参考图像，也无法匹配表情强度和执行方式的变化。论文的关键解决方案是提出了一种基于肌电图的面部表情重建（Electromyography-Informed Facial Expression Reconstruction, EIFER）的新方法。该方法通过结合3D Morphable Model (3DMM) 和神经网络的非配对图像到图像转换技术，解耦面部几何结构与视觉外观（如皮肤纹理、光照、电极等），并学习3DMM表情参数与肌肉活动之间的双向映射，从而在对抗性框架下实现对遮挡情况下的面部几何形状和外观的真实重建。这一创新方法不仅验证了其在同步sEMG记录和面部模仿数据集上的有效性，还展示了如何利用肌肉活动合成表情以及观察到的表情预测动态肌肉活动的能力。

链接: https://arxiv.org/abs/2503.09556
作者: Tim Büchner,Christoph Anders,Orlando Guntinas-Lichius,Joachim Denzler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025, 41 pages, 37 figures, 8 tables

点击查看摘要

Abstract:The relationship between muscle activity and resulting facial expressions is crucial for various fields, including psychology, medicine, and entertainment. The synchronous recording of facial mimicry and muscular activity via surface electromyography (sEMG) provides a unique window into these complex dynamics. Unfortunately, existing methods for facial analysis cannot handle electrode occlusion, rendering them ineffective. Even with occlusion-free reference images of the same person, variations in expression intensity and execution are unmatchable. Our electromyography-informed facial expression reconstruction (EIFER) approach is a novel method to restore faces under sEMG occlusion faithfully in an adversarial manner. We decouple facial geometry and visual appearance (e.g., skin texture, lighting, electrodes) by combining a 3D Morphable Model (3DMM) with neural unpaired image-to-image translation via reference recordings. Then, EIFER learns a bidirectional mapping between 3DMM expression parameters and muscle activity, establishing correspondence between the two domains. We validate the effectiveness of our approach through experiments on a dataset of synchronized sEMG recordings and facial mimicry, demonstrating faithful geometry and appearance reconstruction. Further, we synthesize expressions based on muscle activity and how observed expressions can predict dynamic muscle activity. Consequently, EIFER introduces a new paradigm for facial electromyography, which could be extended to other forms of multi-modal face recordings.
zh

[CV-6] GenHPE: Generative Counterfactuals for 3D Human Pose Estimation with Radio Frequency Signals

【速读】：该论文旨在解决现有基于射频（RF）信号的人体姿态估计（HPE）方法因受限于特定领域的干扰因子（如人体部位引起的信号变化及环境噪声），导致无法泛化到新领域且性能下降的问题。论文的关键在于提出了一种名为GenHPE的3D HPE方法，通过生成对抗性RF信号来消除领域特定的干扰因子。具体而言，GenHPE利用条件生成模型学习人体部位与干扰因子如何影响RF信号，并通过操纵骨架标签（如移除身体部位）作为反事实条件，合成反事实RF信号。这些反事实信号之间的差异能够近似消除领域特定的干扰因子，并正则化编码器-解码器模型以学习领域无关的表示，从而实现跨领域的3D HPE泛化能力。实验结果表明，GenHPE在WiFi、超宽带和毫米波三个公开数据集上的表现优于现有最先进方法，分别将跨主体和跨环境的估计误差降低了高达52.2毫米和10.6毫米。

链接: https://arxiv.org/abs/2503.09537
作者: Shuokang Huang,Julie A. McCann
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Human pose estimation (HPE) detects the positions of human body joints for various applications. Compared to using cameras, HPE using radio frequency (RF) signals is non-intrusive and more robust to adverse conditions, exploiting the signal variations caused by human interference. However, existing studies focus on single-domain HPE confined by domain-specific confounders, which cannot generalize to new domains and result in diminished HPE performance. Specifically, the signal variations caused by different human body parts are entangled, containing subject-specific confounders. RF signals are also intertwined with environmental noise, involving environment-specific confounders. In this paper, we propose GenHPE, a 3D HPE approach that generates counterfactual RF signals to eliminate domain-specific confounders. GenHPE trains generative models conditioned on human skeleton labels, learning how human body parts and confounders interfere with RF signals. We manipulate skeleton labels (i.e., removing body parts) as counterfactual conditions for generative models to synthesize counterfactual RF signals. The differences between counterfactual signals approximately eliminate domain-specific confounders and regularize an encoder-decoder model to learn domain-independent representations. Such representations help GenHPE generalize to new subjects/environments for cross-domain 3D HPE. We evaluate GenHPE on three public datasets from WiFi, ultra-wideband, and millimeter wave. Experimental results show that GenHPE outperforms state-of-the-art methods and reduces estimation errors by up to 52.2mm for cross-subject HPE and 10.6mm for cross-environment HPE.
zh

[CV-7] Evaluating Visual Explanations of Attention Maps for Transformer-based Medical Imaging MICCAI2024

【速读】：该论文试图解决视觉Transformer（Vision Transformer, ViT）在医学影像领域中可解释性不足的问题。解决方案的关键在于通过对比分析注意力图（attention maps）与其他常用的解释方法（如GradCAM）的有效性，评估其在不同预训练方式（监督学习与自监督学习）下的表现，并进一步引入特定于Transformer的解释方法进行对比。研究发现，虽然注意力图在某些条件下具有潜力且通常优于GradCAM，但在提供全面解释以支持可靠医疗决策方面，它仍逊色于Transformer特有的解释技术。这表明注意力图的解释效能依赖于具体应用场景，其局限性在于无法始终提供所需的全面洞察。

链接: https://arxiv.org/abs/2503.09535
作者: Minjae Chung,Jong Bum Won,Ganghyun Kim,Yujin Kim,Utku Ozbulak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication in MICCAI 2024 Workshop on Interpretability of Machine Intelligence in Medical Image Computing (iMIMIC)

点击查看摘要

Abstract:Although Vision Transformers (ViTs) have recently demonstrated superior performance in medical imaging problems, they face explainability issues similar to previous architectures such as convolutional neural networks. Recent research efforts suggest that attention maps, which are part of decision-making process of ViTs can potentially address the explainability issue by identifying regions influencing predictions, especially in models pretrained with self-supervised learning. In this work, we compare the visual explanations of attention maps to other commonly used methods for medical imaging problems. To do so, we employ four distinct medical imaging datasets that involve the identification of (1) colonic polyps, (2) breast tumors, (3) esophageal inflammation, and (4) bone fractures and hardware implants. Through large-scale experiments on the aforementioned datasets using various supervised and self-supervised pretrained ViTs, we find that although attention maps show promise under certain conditions and generally surpass GradCAM in explainability, they are outperformed by transformer-specific interpretability methods. Our findings indicate that the efficacy of attention maps as a method of interpretability is context-dependent and may be limited as they do not consistently provide the comprehensive insights required for robust medical decision-making.
zh

[CV-8] CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games

【速读】：该论文旨在解决复杂三维环境中实时决策的挑战，特别是在需要秒级响应、高分辨率感知以及动态条件下的战术推理任务。论文的关键解决方案在于提出了一种名为CombatVLA的高效Vision-Language-Action模型，专门优化用于三维动作角色扮演游戏（ARPGs）中的战斗任务。CombatVLA通过训练大量视频-动作对数据，并采用行动思维（AoT）序列格式，结合截断的AoT策略实现高效推断，从而显著提升了战斗理解性能并实现了游戏内战斗速度50倍的加速，同时在任务成功率上超越人类玩家。

链接: https://arxiv.org/abs/2503.09527
作者: Peng Chen,Pi Bu,Yingyao Wang,Xinyi Wang,Ziming Wang,Jie Guo,Yingxiu Zhao,Qi Zhu,Jun Song,Siran Yang,Jiamang Wang,Bo Zheng
机构: Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Vision-Language-Action models (VLAs) have expanded the capabilities of embodied intelligence. However, significant challenges remain in real-time decision-making in complex 3D environments, which demand second-level responses, high-resolution perception, and tactical reasoning under dynamic conditions. To advance the field, we introduce CombatVLA, an efficient VLA model optimized for combat tasks in 3D action role-playing games(ARPGs). Specifically, our CombatVLA is a 3B model trained on video-action pairs collected by an action tracker, where the data is formatted as action-of-thought (AoT) sequences. Thereafter, CombatVLA seamlessly integrates into an action execution framework, allowing efficient inference through our truncated AoT strategy. Experimental results demonstrate that CombatVLA not only outperforms all existing models on the combat understanding benchmark but also achieves a 50-fold acceleration in game combat. Moreover, it has a higher task success rate than human players. We will open-source all resources, including the action tracker, dataset, benchmark, model weights, training code, and the implementation of the framework at this https URL.
zh

[CV-9] Patch-Wise Hypergraph Contrastive Learning with Dual Normal Distribution Weighting for Multi-Domain Stain Transfer

【速读】：该论文旨在解决虚拟染色转换（Virtual Stain Transfer）任务中因循环一致性假设限制而导致的病理细节信息丢失问题。为应对这一挑战，论文提出了一种基于超图的块级对比学习方法STNHCL。其关键是通过超图建模捕捉图像块之间的高阶关系，确保输入与输出图像之间高阶拓扑的一致性；同时引入一种新颖的负样本加权策略，利用判别器热图根据组织与背景的高斯分布施加不同权重，从而增强传统加权方法的效果。

链接: https://arxiv.org/abs/2503.09523
作者: Haiyan Wei,Hangrui Xu,Bingxu Zhu,Yulian Geng,Aolei Liu,Wenfei Yin,Jian Liu
机构: Hefei University of Technology (合肥工业大学), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Virtual stain transfer leverages computer-assisted technology to transform the histochemical staining patterns of tissue samples into other staining types. However, existing methods often lose detailed pathological information due to the limitations of the cycle consistency assumption. To address this challenge, we propose STNHCL, a hypergraph-based patch-wise contrastive learning method. STNHCL captures higher-order relationships among patches through hypergraph modeling, ensuring consistent higher-order topology between input and output images. Additionally, we introduce a novel negative sample weighting strategy that leverages discriminator heatmaps to apply different weights based on the Gaussian distribution for tissue and background, thereby enhancing traditional weighting methods. Experiments demonstrate that STNHCL achieves state-of-the-art performance in the two main categories of stain transfer tasks. Furthermore, our model also performs excellently in downstream tasks. Code will be made available.
zh

[CV-10] CM-Diff: A Single Generative Network for Bidirectional Cross-Modality Translation Diffusion Model Between Infrared and Visible Images

【速读】：该论文旨在解决红外与可见光图像双向跨模态翻译中的性能瓶颈问题，现有方法要么仅支持单向模态翻译，要么依赖循环一致性实现双向翻译，可能导致性能不理想。论文的关键解决方案是提出了一种跨模态扩散模型（CM-Diff），通过结合翻译方向标签指导训练和跨模态特征控制来同时建模红外和可见光模态的数据分布。具体而言，作者将模态间映射关系的学习视为理解数据分布和模态差异的过程，并采用创新的双向扩散训练（BDT）策略实现模态间的高效翻译。此外，引入统计约束推理（SCI）策略以确保生成图像严格遵循目标模态的数据分布。实验结果验证了CM-Diff相比现有先进方法的优越性。

链接: https://arxiv.org/abs/2503.09514
作者: Bin Hu,Chenqiang Gao,Shurui Liu,Junjie Guo,Fang Chen,Fangcen Liu
机构: Chongqing University of Posts and Telecommunications (重庆邮电大学); Sun Yat-sen University (中山大学); University of California, Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The image translation method represents a crucial approach for mitigating information deficiencies in the infrared and visible modalities, while also facilitating the enhancement of modality-specific datasets. However, existing methods for infrared and visible image translation either achieve unidirectional modality translation or rely on cycle consistency for bidirectional modality translation, which may result in suboptimal performance. In this work, we present the cross-modality translation diffusion model (CM-Diff) for simultaneously modeling data distributions in both the infrared and visible modalities. We address this challenge by combining translation direction labels for guidance during training with cross-modality feature control. Specifically, we view the establishment of the mapping relationship between the two modalities as the process of learning data distributions and understanding modality differences, achieved through a novel Bidirectional Diffusion Training (BDT) strategy. Additionally, we propose a Statistical Constraint Inference (SCI) strategy to ensure the generated image closely adheres to the data distribution of the target modality. Experimental results demonstrate the superiority of our CM-Diff over state-of-the-art methods, highlighting its potential for generating dual-modality datasets.
zh

[CV-11] ViM-VQ: Efficient Post-Training Vector Quantization for Visual Mamba

【速读】：本文旨在解决视觉曼巴网络（ViMs）在低比特量化（如3-bit、2-bit、1-bit）中精度不足的问题。尽管现有的向量量化（Vector Quantization, VQ）方法在卷积神经网络和基于Transformer的网络中取得了显著成果，但直接应用于ViMs时会导致不理想的精度表现。论文识别出两个主要挑战：一是ViMs中基于Mamba模块的权重包含大量异常值，显著放大了量化误差；二是现有VQ方法在ViMs上的应用存在内存消耗大、校准过程冗长以及搜索最优码字性能不佳等问题。为应对这些挑战，作者提出了ViM-VQ，这是一种针对ViMs优化的高效后训练向量量化方法。ViM-VQ的关键创新包括：1）一种快速凸组合优化算法，能够同时高效更新凸组合和凸包以寻找最优码字；2）一种增量向量量化策略，通过逐步确认最优码字来减轻截断误差。实验结果表明，ViM-VQ在多种视觉任务中的低比特量化方面达到了最先进的性能水平。

链接: https://arxiv.org/abs/2503.09509
作者: Juncan Deng,Shuaiting Li,Zeyu Wang,Kedong Xu,Hong Gu,Kejie Huang
机构: Zhejiang University (浙江大学); vivo Mobile Communication Co., Ltd (维沃移动通信有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Mamba networks (ViMs) extend the selective space state model (Mamba) to various vision tasks and demonstrate significant potential. Vector quantization (VQ), on the other hand, decomposes network weights into codebooks and assignments, significantly reducing memory usage and computational latency to enable ViMs deployment on edge devices. Although existing VQ methods have achieved extremely low-bit quantization (e.g., 3-bit, 2-bit, and 1-bit) in convolutional neural networks and Transformer-based networks, directly applying these methods to ViMs results in unsatisfactory accuracy. We identify several key challenges: 1) The weights of Mamba-based blocks in ViMs contain numerous outliers, significantly amplifying quantization errors. 2) When applied to ViMs, the latest VQ methods suffer from excessive memory consumption, lengthy calibration procedures, and suboptimal performance in the search for optimal codewords. In this paper, we propose ViM-VQ, an efficient post-training vector quantization method tailored for ViMs. ViM-VQ consists of two innovative components: 1) a fast convex combination optimization algorithm that efficiently updates both the convex combinations and the convex hulls to search for optimal codewords, and 2) an incremental vector quantization strategy that incrementally confirms optimal codewords to mitigate truncation errors. Experimental results demonstrate that ViM-VQ achieves state-of-the-art performance in low-bit quantization across various visual tasks.
zh

[CV-12] Double-Stage Feature-Level Clustering-Based Mixture of Experts Framework

【速读】：本文旨在解决Mixture-of-Experts (MoE) 模型在图像分类任务中复杂架构的优势以及其对噪声和异常值敏感的问题。传统方法中，许多聚类算法因缺乏标注数据而限制了其效果，尤其是在输入空间中存在噪声和异常值的情况下，MoE 性能往往受到影响。为应对这些挑战，论文提出了一种名为 Double-stage Feature-level Clustering and Pseudo-labeling-based Mixture of Experts (DFCP-MoE) 的框架，其关键在于结合两阶段特征级聚类与伪标签策略，以减少噪声和异常值的影响，并利用少量标注数据为大量未标注数据分配伪标签。此外，通过条件端到端联合训练方法，进一步提升专家模块的专业化能力。与传统 MoE 和密集模型相比，DFCP-MoE 框架能够更有效地捕捉输入空间的多样性，从而实现卓越的推理性能。

链接: https://arxiv.org/abs/2503.09504
作者: Bakary Badjie,José Cecílio,António Casimiro
机构: LASIGE (LASIGE); Departamento de Informática (计算机系); Faculdade de Ciências da Universidade de Lisboa (里斯本大学理学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Logic in Computer Science (cs.LO)
备注: 14 Pages, 1 Figure, and 3 Tables

点击查看摘要

Abstract:The Mixture-of-Experts (MoE) model has succeeded in deep learning (DL). However, its complex architecture and advantages over dense models in image classification remain unclear. In previous studies, MoE performance has often been affected by noise and outliers in the input space. Some approaches incorporate input clustering for training MoE models, but most clustering algorithms lack access to labeled data, limiting their effectiveness. This paper introduces the Double-stage Feature-level Clustering and Pseudo-labeling-based Mixture of Experts (DFCP-MoE) framework, which consists of input feature extraction, feature-level clustering, and a computationally efficient pseudo-labeling strategy. This approach reduces the impact of noise and outliers while leveraging a small subset of labeled data to label a large portion of unlabeled inputs. We propose a conditional end-to-end joint training method that improves expert specialization by training the MoE model on well-labeled, clustered inputs. Unlike traditional MoE and dense models, the DFCP-MoE framework effectively captures input space diversity, leading to competitive inference results. We validate our approach on three benchmark datasets for multi-class classification tasks.
zh

[CV-13] owards Robust Multimodal Representation: A Unified Approach with Adaptive Experts and Alignment

【速读】：该论文试图解决医疗领域中因隐私限制、成本和技术问题导致的多模态数据缺失挑战，这使得许多现有的多模态模型可靠性降低。为了解决这一问题，论文提出了一种名为Mixture of Experts, Symmetric Aligning, and Reconstruction (MoSARe) 的新方法。MoSARe的关键在于通过专家选择（expert selection）、跨模态注意力（cross-modal attention）和对比学习（contrastive learning）来提升特征表示与决策能力，从而在数据完整或部分缺失的情况下均能保持高精度预测，特别适用于资源受限的实际医疗环境。

链接: https://arxiv.org/abs/2503.09498
作者: Nazanin Moradinasab,Saurav Sengupta,Jiebei Liu,Sana Syed,Donald E. Brown
机构: School of Data Science, University of Virginia (数据科学学院，弗吉尼亚大学); Systems and Information Engineering, University of Virginia (系统与信息工程，弗吉尼亚大学); Division of Pediatric Gastroenterology, Duke University Medical Center, Duke Clinical Research Institute (小儿胃肠病学分部，杜克大学医学中心，杜克临床研究中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Healthcare relies on multiple types of data, such as medical images, genetic information, and clinical records, to improve diagnosis and treatment. However, missing data is a common challenge due to privacy restrictions, cost, and technical issues, making many existing multi-modal models unreliable. To address this, we propose a new multi-model model called Mixture of Experts, Symmetric Aligning, and Reconstruction (MoSARe), a deep learning framework that handles incomplete multimodal data while maintaining high accuracy. MoSARe integrates expert selection, cross-modal attention, and contrastive learning to improve feature representation and decision-making. Our results show that MoSARe outperforms existing models in situations when the data is complete. Furthermore, it provides reliable predictions even when some data are missing. This makes it especially useful in real-world healthcare settings, including resource-limited environments. Our code is publicly available at this https URL.
zh

[CV-14] Robust Multimodal Survival Prediction with the Latent Differentiation Conditional Variational AutoEncoder CVPR2025

【速读】：该论文旨在解决基于组织病理图像与基因组数据的整合分析在人类癌症生存预测中的两个关键问题：(1) 全模态数据在测试样本中不可用时的缺失模态处理；(2) 高分辨率全切片图像（WSI, gigapixel whole slide images）难以表示及不同功能类别的基因嵌入在统一生成框架下难以建模。为应对这些挑战，论文提出了一种条件潜伏分化变分自编码器（LD-CVAE），即使在存在缺失基因数据的情况下，也能实现鲁棒的多模态生存预测。其关键解决方案包括：引入变分信息瓶颈Transformer模块（VIB-Trans）从高分辨率WSI中学习压缩的病理表征，并设计一种新颖的潜伏分化变分自编码器（LD-VAE）以学习具有多样功能的基因嵌入的公共后验和特定后验，最终通过专家乘积技术联合估计基因后验与图像后验的联合潜在分布。实验结果表明，该方法在完整模态和缺失模态场景下均表现出优越性。

链接: https://arxiv.org/abs/2503.09496
作者: Junjie Zhou,Jiao Tang,Yingli Zuo,Peng Wan,Daoqiang Zhang,Wei Shao
机构: The College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics (南京航空航天大学人工智能学院); The Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education (教育部脑机智能技术重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:The integrative analysis of histopathological images and genomic data has received increasing attention for survival prediction of human cancers. However, the existing studies always hold the assumption that full modalities are available. As a matter of fact, the cost for collecting genomic data is high, which sometimes makes genomic data unavailable in testing samples. A common way of tackling such incompleteness is to generate the genomic representations from the pathology images. Nevertheless, such strategy still faces the following two challenges: (1) The gigapixel whole slide images (WSIs) are huge and thus hard for representation. (2) It is difficult to generate the genomic embeddings with diverse function categories in a unified generative framework. To address the above challenges, we propose a Conditional Latent Differentiation Variational AutoEncoder (LD-CVAE) for robust multimodal survival prediction, even with missing genomic data. Specifically, a Variational Information Bottleneck Transformer (VIB-Trans) module is proposed to learn compressed pathological representations from the gigapixel WSIs. To generate different functional genomic features, we develop a novel Latent Differentiation Variational AutoEncoder (LD-VAE) to learn the common and specific posteriors for the genomic embeddings with diverse functions. Finally, we use the product-of-experts technique to integrate the genomic common posterior and image posterior for the joint latent distribution estimation in LD-CVAE. We test the effectiveness of our method on five different cancer datasets, and the experimental results demonstrate its superiority in both complete and missing modality scenarios.
zh

[CV-15] Parameter-Efficient Adaptation of Geospatial Foundation Models through Embedding Deflection

【速读】：该论文旨在解决如何以低成本高效适配预训练的地理空间基础模型（Geospatial Foundation Models, GFMs），使其从RGB卫星图像扩展到其他类型的光学卫星数据。论文的关键在于引入了一种名为DEFLECT的新策略，通过在数据和模型中融入更强的归纳偏置（inductive biases），利用GFMs预训练参数对多光谱图像的空间结构提供的强先验知识，在仅增加极少量额外参数的情况下实现高效的适配。DEFLECT的核心创新在于提升提取特征的表达能力，特别是增强光谱信息，这对于地球科学和环境相关任务至关重要。实验结果表明，DEFLECT在分类和分割任务中达到了与现有方法相当或更高的准确性，但所需参数量仅为后者的1/5到1/10。

链接: https://arxiv.org/abs/2503.09493
作者: Romain Thoreau,Valerio Marsocci,Dawa Derksen
机构: CNES; European Space Agency (欧洲航天局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As large-scale heterogeneous data sets become increasingly available, adapting foundation models at low cost has become a key issue. Seminal works in natural language processing, e.g. Low-Rank Adaptation (LoRA), leverage the low “intrinsic rank” of parameter updates during adaptation. In this paper, we argue that incorporating stronger inductive biases in both data and models can enhance the adaptation of Geospatial Foundation Models (GFMs), pretrained on RGB satellite images, to other types of optical satellite data. Specifically, the pretrained parameters of GFMs serve as a strong prior for the spatial structure of multispectral images. For this reason, we introduce DEFLECT (Deflecting Embeddings for Finetuning Latent representations for Earth and Climate Tasks), a novel strategy for adapting GFMs to multispectral satellite imagery with very few additional parameters. DEFLECT improves the representation capabilities of the extracted features, particularly enhancing spectral information, which is essential for geoscience and environmental-related tasks. We demonstrate the effectiveness of our method across three different GFMs and five diverse datasets, ranging from forest monitoring to marine environment segmentation. Compared to competing methods, DEFLECT achieves on-par or higher accuracy with 5-10 \times fewer parameters for classification and segmentation tasks. The code will be made publicly available.
zh

[CV-16] DAMM-Diffusion: Learning Divergence-Aware Multi-Modal Diffusion Model for Nanoparticles Distribution Prediction CVPR2025

【速读】：该论文旨在解决纳米颗粒 (Nanoparticles, NPs) 在肿瘤中的分布预测问题，特别关注肿瘤微环境 (Tumor Microenvironment, TME) 的多模态异质性对分布预测的影响。传统方法中，单一模态模型可能在某些情况下优于联合生成模型，导致分布预测结果的不一致性及潜在副作用。为应对这一挑战，论文提出了一种关键创新方案：Divergence-Aware Multi-Modal Diffusion 模型（DAMM-Diffusion）。该模型通过统一网络自适应地整合单模态与多模态分支的预测结果。具体而言，单模态分支采用 U-Net 架构，而多模态分支则引入了两种新颖的融合模块——多模态融合模块 (MMFM) 和不确定性感知融合模块 (UAFM)，以有效融合多模态特征并学习不确定性映射。此外，论文设计了分歧感知多模态预测器 (DAMMP) 模块，用于评估多模态数据与不确定性映射的一致性，从而决定最终预测结果来源于单模态还是多模态分支。实验结果表明，DAMM-Diffusion 在纳米颗粒分布预测任务中具有更高的准确性，并在多模态脑图像合成任务中进一步验证了其有效性。

链接: https://arxiv.org/abs/2503.09491
作者: Junjie Zhou,Shouju Wang,Yuxia Tang,Qi Zhu,Daoqiang Zhang,Wei Shao
机构: The College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics (南京航空航天大学人工智能学院); The Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education (教育部脑机智能技术重点实验室); Department of Radiology, Nanjing Medical University (南京医科大学放射科)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:The prediction of nanoparticles (NPs) distribution is crucial for the diagnosis and treatment of tumors. Recent studies indicate that the heterogeneity of tumor microenvironment (TME) highly affects the distribution of NPs across tumors. Hence, it has become a research hotspot to generate the NPs distribution by the aid of multi-modal TME components. However, the distribution divergence among multi-modal TME components may cause side effects i.e., the best uni-modal model may outperform the joint generative model. To address the above issues, we propose a \textbfDivergence-\textbfAware \textbfMulti-\textbfModal \textbfDiffusion model (i.e., \textbfDAMM-Diffusion) to adaptively generate the prediction results from uni-modal and multi-modal branches in a unified network. In detail, the uni-modal branch is composed of the U-Net architecture while the multi-modal branch extends it by introducing two novel fusion modules i.e., Multi-Modal Fusion Module (MMFM) and Uncertainty-Aware Fusion Module (UAFM). Specifically, the MMFM is proposed to fuse features from multiple modalities, while the UAFM module is introduced to learn the uncertainty map for cross-attention computation. Following the individual prediction results from each branch, the Divergence-Aware Multi-Modal Predictor (DAMMP) module is proposed to assess the consistency of multi-modal data with the uncertainty map, which determines whether the final prediction results come from multi-modal or uni-modal predictions. We predict the NPs distribution given the TME components of tumor vessels and cell nuclei, and the experimental results show that DAMM-Diffusion can generate the distribution of NPs with higher accuracy than the comparing methods. Additional results on the multi-modal brain image synthesis task further validate the effectiveness of the proposed method.
zh

[CV-17] Project-Probe-Aggregate: Efficient Fine-Tuning for Group Robustness CVPR2025

【速读】：该论文旨在解决图像-文本基础模型在输入与标签之间存在虚假相关性（spurious correlations）时的表现不佳问题。为应对这一挑战，论文提出了一种名为Project-Probe-Aggregate (PPA) 的三步方法，该方法能够在不依赖群体标注（group annotations）的情况下实现基础模型的参数高效微调。PPA 方法的关键在于改进基于失败的去偏方案中的两个核心组件：少数群体样本识别和鲁棒训练算法。具体而言，首先通过将图像特征投影到文本编码器类别代理的零空间来训练带有偏差的分类器；其次利用该分类器推断群体标签，并结合先验校正探测群体目标；最后聚合每个类别的群体权重以生成去偏后的分类器。理论分析表明，PPA 方法不仅提升了少数群体的识别能力，还实现了平衡群体误差最小化的贝叶斯最优解，从而减轻了虚假相关性的影响。广泛的实验结果验证了PPA 方法的有效性，它在平均最差群体准确率方面优于现有技术，且仅需不到 0.01% 的可调参数即可完成训练。

链接: https://arxiv.org/abs/2503.09487
作者: Beier Zhu,Jiequan Cui,Hanwang Zhang,Chi Zhang
机构: Nanyang Technological University (南洋理工大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:While image-text foundation models have succeeded across diverse downstream tasks, they still face challenges in the presence of spurious correlations between the input and label. To address this issue, we propose a simple three-step approach,Project-Probe-Aggregate (PPA), that enables parameter-efficient fine-tuning for foundation models without relying on group annotations. Building upon the failure-based debiasing scheme, our method, PPA, improves its two key components: minority samples identification and the robust training algorithm. Specifically, we first train biased classifiers by projecting image features onto the nullspace of class proxies from text encoders. Next, we infer group labels using the biased classifier and probe group targets with prior correction. Finally, we aggregate group weights of each class to produce the debiased classifier. Our theoretical analysis shows that our PPA enhances minority group identification and is Bayes optimal for minimizing the balanced group error, mitigating spurious correlations. Extensive experimental results confirm the effectiveness of our PPA: it outperforms the state-of-the-art by an average worst-group accuracy while requiring less than 0.01% tunable parameters without training group labels.
zh

[CV-18] Learning Spatially Adaptive ell_1-Norms Weights for Convolutional Synthesis Regularization

【速读】：该论文旨在解决低场磁共振成像（MRI）图像重建中的空间自适应参数映射学习问题。解决方案的关键在于提出了一种基于展开算法（unrolled algorithm approach）的方法，在卷积合成的(\ell_1)正则化框架下估计空间可变参数。具体而言，通过展开Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) 来解决潜在的稀疏估计问题，利用预训练的卷积滤波器，以深度参数化的方式估计其在稀疏特征图上的空间变化参数。这种方法不仅能够产生与现有基于总 variation 正则化及模型驱动深度学习方法相当的视觉和定量结果，同时保持高度可解释性，所推断出的参数图量化了每个滤波器在重建过程中的局部贡献，为理解算法机制提供了重要洞察，并可能用于筛选不合适的滤波器。

链接: https://arxiv.org/abs/2503.09483
作者: Andreas Kofler,Luca Calatroni,Christoph Kolbitsch,Kostas Papafitsoros
机构: Physikalisch-Technische Bundesanstalt (PTB); MaLGa Center, DIBRIS, Università di Genova; MMS, Istituto Italiano di Tecnologia; School of Mathematical Sciences, Queen Mary University of London
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注: To be submitted to the EUSIPCO 2025 conference

点击查看摘要

Abstract:We propose an unrolled algorithm approach for learning spatially adaptive parameter maps in the framework of convolutional synthesis-based \ell_1 regularization. More precisely, we consider a family of pre-trained convolutional filters and estimate deeply parametrized spatially varying parameters applied to the sparse feature maps by means of unrolling a FISTA algorithm to solve the underlying sparse estimation problem. The proposed approach is evaluated for image reconstruction of low-field MRI and compared to spatially adaptive and non-adaptive analysis-type procedures relying on Total Variation regularization and to a well-established model-based deep learning approach. We show that the proposed approach produces visually and quantitatively comparable results with the latter approaches and at the same time remains highly interpretable. In particular, the inferred parameter maps quantify the local contribution of each filter in the reconstruction, which provides valuable insight into the algorithm mechanism and could potentially be used to discard unsuited filters. Comments: To be submitted to the EUSIPCO 2025 conference Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC) Cite as: arXiv:2503.09483 [cs.LG] (or arXiv:2503.09483v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.09483 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-19] SurgicalVLM-Agent : Towards an Interactive AI Co-Pilot for Pituitary Surgery

【速读】：本文旨在解决图像引导手术中静态AI模型在结构化任务规划和交互式指导方面的局限性问题。为实现动态任务规划和预测性决策支持，论文提出了一种基于大型视觉-语言模型（Vision-Language Models, VLMs）的创新解决方案。关键在于开发了SurgicalVLM-Agent，这是一种用于垂体手术的AI辅助系统，具备对话、规划和任务执行能力。该系统通过处理外科医生的查询并规划包括MRI肿瘤分割、内窥镜解剖分割、术前影像与术中视图叠加、器械跟踪以及手术视觉问答（Visual Question Answering, VQA）等任务，实现了动态适应。此外，论文提出了FFT-GaLore，一种基于快速傅里叶变换（Fast Fourier Transform, FFT）的梯度投影技术，用于高效低秩微调LLaMA 3.2模型，以优化其在手术环境中的性能。通过构建PitAgent数据集，并验证任务规划、提示生成及零样本VQA能力，研究展示了该方法在任务规划和查询解析上的领先性能，同时提供了高度语义丰富的VQA响应，从而推动了AI驱动的手术辅助技术的发展。

链接: https://arxiv.org/abs/2503.09474
作者: Jiayuan Huang,Runlong He,Danyal Z. Khan,Evangelos Mazomenos,Danail Stoyanov,Hani J. Marcus,Matthew J. Clarkson,Mobarakol Islam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages

点击查看摘要

Abstract:Image-guided surgery demands adaptive, real-time decision support, yet static AI models struggle with structured task planning and providing interactive guidance. Large vision-language models (VLMs) offer a promising solution by enabling dynamic task planning and predictive decision support. We introduce SurgicalVLM-Agent, an AI co-pilot for image-guided pituitary surgery, capable of conversation, planning, and task execution. The agent dynamically processes surgeon queries and plans the tasks such as MRI tumor segmentation, endoscope anatomy segmentation, overlaying preoperative imaging with intraoperative views, instrument tracking, and surgical visual question answering (VQA). To enable structured task planning, we develop the PitAgent dataset, a surgical context-aware dataset covering segmentation, overlaying, instrument localization, tool tracking, tool-tissue interactions, phase identification, and surgical activity recognition. Additionally, we propose FFT-GaLore, a fast Fourier transform (FFT)-based gradient projection technique for efficient low-rank adaptation, optimizing fine-tuning for LLaMA 3.2 in surgical environments. We validate SurgicalVLM-Agent by assessing task planning and prompt generation on our PitAgent dataset and evaluating zero-shot VQA using a public pituitary dataset. Results demonstrate state-of-the-art performance in task planning and query interpretation, with highly semantically meaningful VQA responses, advancing AI-driven surgical assistance.
zh

[CV-20] Hybrid Rendering for Multimodal Autonomous Driving: Merging Neural and Physics-Based Simulation

【速读】：该论文致力于解决现有神经重建模型在自动驾驶仿真中的局限性，即这些模型通常只能处理遵循原始轨迹的域内动态对象。论文提出了一种结合神经重建与基于物理渲染的混合方法，关键在于通过一种新颖的训练方法NeRF2GS，将基于NeRF的方法的泛化能力和3D高斯点绘（3D Gaussian Splatting, 3DGS）的实时渲染速度相结合。这种方法通过利用从带深度正则化的原始图像训练定制化NeRF模型，并将其作为教师模型指导3DGS训练，确保了深度、表面法线及相机外观建模的准确性。此外，通过块状训练并行化，该技术能够支持大规模场景重建（≥100,000平方米），并提供实时渲染能力，同时保持交互帧率，显著提升了新视角合成的质量，尤其是在道路表面和车道标记方面。

链接: https://arxiv.org/abs/2503.09464
作者: Máté Tóth,Péter Kovács,Zoltán Bendefy,Zoltán Hortsin,Balázs Teréki,Tamás Matuszka
机构: aiMotive
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural reconstruction models for autonomous driving simulation have made significant strides in recent years, with dynamic models becoming increasingly prevalent. However, these models are typically limited to handling in-domain objects closely following their original trajectories. We introduce a hybrid approach that combines the strengths of neural reconstruction with physics-based rendering. This method enables the virtual placement of traditional mesh-based dynamic agents at arbitrary locations, adjustments to environmental conditions, and rendering from novel camera viewpoints. Our approach significantly enhances novel view synthesis quality – especially for road surfaces and lane markings – while maintaining interactive frame rates through our novel training method, NeRF2GS. This technique leverages the superior generalization capabilities of NeRF-based methods and the real-time rendering speed of 3D Gaussian Splatting (3DGS). We achieve this by training a customized NeRF model on the original images with depth regularization derived from a noisy LiDAR point cloud, then using it as a teacher model for 3DGS training. This process ensures accurate depth, surface normals, and camera appearance modeling as supervision. With our block-based training parallelization, the method can handle large-scale reconstructions (greater than or equal to 100,000 square meters) and predict segmentation masks, surface normals, and depth maps. During simulation, it supports a rasterization-based rendering backend with depth-based composition and multiple camera models for real-time camera simulation, as well as a ray-traced backend for precise LiDAR simulation.
zh

[CV-21] Online Language Splatting

【速读】：本文旨在解决AI代理在与人类和3D环境交互时，如何高效地将自然语言映射到3D空间表示的问题。传统方法通过将语言特征融入基于3D高斯点 splatting (GS) 的场景表示中取得了进展，但这些方法依赖于对每个输入图像进行计算密集型的离线预处理，限制了其适应新环境的能力。为了解决这一局限性，论文提出了Online Language Splatting框架，首次实现了在3D GS-SLAM系统中的在线、接近实时的开放词汇量语言映射，且无需预先生成的语言特征。

解决方案的关键在于高效地将高维语言特征融合到3D表示中，同时平衡计算速度、内存使用、渲染质量和开放词汇能力。为此，作者创新性地设计了以下三个模块：(1) 高分辨率CLIP嵌入模块，能够在每帧18毫秒内生成详细的语言特征图；(2) 两阶段在线自动编码器，将768维的CLIP特征压缩至15维，同时保留开放词汇能力；(3) 颜色-语言解耦优化方法，以提升渲染质量。实验结果表明，该在线方法不仅在准确性上超越了最先进的离线方法，还实现了超过40倍的效率提升，展示了其在动态和交互式AI应用中的潜力。

链接: https://arxiv.org/abs/2503.09447
作者: Saimouli Katragadda,Cho-Ying Wu,Yuliang Guo,Xinyu Huang,Guoquan Huang,Liu Ren
机构: University of Delaware (特拉华大学); Bosch Research North America (博世北美研究院) & Bosch Center for Artificial Intelligence (BCAI) (博世人工智能中心)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:To enable AI agents to interact seamlessly with both humans and 3D environments, they must not only perceive the 3D world accurately but also align human language with 3D spatial representations. While prior work has made significant progress by integrating language features into geometrically detailed 3D scene representations using 3D Gaussian Splatting (GS), these approaches rely on computationally intensive offline preprocessing of language features for each input image, limiting adaptability to new environments. In this work, we introduce Online Language Splatting, the first framework to achieve online, near real-time, open-vocabulary language mapping within a 3DGS-SLAM system without requiring pre-generated language features. The key challenge lies in efficiently fusing high-dimensional language features into 3D representations while balancing the computation speed, memory usage, rendering quality and open-vocabulary capability. To this end, we innovatively design: (1) a high-resolution CLIP embedding module capable of generating detailed language feature maps in 18ms per frame, (2) a two-stage online auto-encoder that compresses 768-dimensional CLIP features to 15 dimensions while preserving open-vocabulary capabilities, and (3) a color-language disentangled optimization approach to improve rendering quality. Experimental results show that our online method not only surpasses the state-of-the-art offline methods in accuracy but also achieves more than 40x efficiency boost, demonstrating the potential for dynamic and interactive AI applications.
zh

[CV-22] Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models

【速读】：该论文试图解决在文本到图像（Text-to-Image, T2I）扩散模型中精确移除特定概念的同时保持整体生成性能的问题。现有方法虽能删除不需要的概念，但通常会导致正常生成任务的性能下降。为应对这一挑战，论文提出了一种名为“先解析后停用”（Interpret then Deactivate, ItD）的新框架。其关键是利用稀疏自动编码器（Sparse Autoencoder, SAE）将每个概念解析为多个特征的组合，并通过永久禁用与目标概念相关的特定特征，使SAE能够作为零样本分类器，判断输入提示是否包含目标概念，从而实现对扩散模型中概念的精确选择性移除。此外，ItD还无需额外训练即可扩展至同时移除多个概念，且在消除目标概念的同时不影响其他正常概念的生成，同时对对抗性提示具有鲁棒性。

链接: https://arxiv.org/abs/2503.09446
作者: Zhihua Tian,Sirun Nan,Ming Xu,Shengfang Zhai,Wenjie Qu,Jian Liu,Kui Ren,Ruoxi Jia,Jiaheng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 25 pages

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images but also raise people’s concerns about generating harmful or misleading content. While extensive approaches have been proposed to erase unwanted concepts without requiring retraining from scratch, they inadvertently degrade performance on normal generation tasks. In this work, we propose Interpret then Deactivate (ItD), a novel framework to enable precise concept removal in T2I diffusion models while preserving overall performance. ItD first employs a sparse autoencoder (SAE) to interpret each concept as a combination of multiple features. By permanently deactivating the specific features associated with target concepts, we repurpose SAE as a zero-shot classifier that identifies whether the input prompt includes target concepts, allowing selective concept erasure in diffusion models. Moreover, we demonstrate that ItD can be easily extended to erase multiple concepts without requiring further training. Comprehensive experiments across celebrity identities, artistic styles, and explicit content demonstrate ItD’s effectiveness in eliminating targeted concepts without interfering with normal concept generation. Additionally, ItD is also robust against adversarial prompts designed to circumvent content filters. Code is available at: this https URL.
zh

[CV-23] Astrea: A MOE-based Visual Understanding Model with Progressive Alignment

【速读】：该论文旨在解决基于混合专家（Mixture-of-Experts, MoE）架构的视觉-语言模型（Vision-Language Models, VLMs）在处理任务异构性和专家负载不平衡时面临的挑战。随着任务复杂性和多样性的增加，协调不同视觉专家之间的负载变得尤为困难，优化某一特定专家的性能往往以牺牲其他专家的能力为代价。为了解决这些问题，论文提出了一种名为Astrea的新颖多专家协作VLM架构，其核心在于逐步预对齐（progressive pre-alignment）。Astrea的关键创新包括：1）一种整合检测、分割、分类和描述四种专门模型的异构专家协调机制，形成覆盖视觉理解基本要素的综合专家矩阵；2）一种动态知识融合策略，通过对比学习实现逐步预对齐，利用概率激活的随机残差连接在VLM潜在空间内和谐化专家，同时保持知识连续性；3）一个增强的优化框架，采用动量对比学习进行长距离依赖建模，并使用自适应权重分配器实时校准专家贡献。这些创新共同构成了Astrea的核心解决方案，使其在12个基准任务上的表现优于现有最先进模型，平均性能提升达+4.7%。

链接: https://arxiv.org/abs/2503.09445
作者: Xiaoda Yang,JunYu Lu,Hongshun Qiu,Sijing Li,Hao Li,Shengpeng Ji,Xudong Tang,Jiayang Xu,Jiaqi Duan,Ziyue Jiang,Cong Lin,Sihang Cai,Zejian Xie,Zhuoyang Song,Songxin Zhang
机构: Zhejiang University (浙江大学); Beijing University of Technology (北京工业大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Hong Kong Polytechnic University (香港理工大学); Qingdao University (青岛大学); Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) based on Mixture-of-Experts (MoE) architectures have emerged as a pivotal paradigm in multimodal understanding, offering a powerful framework for integrating visual and linguistic information. However, the increasing complexity and diversity of tasks present significant challenges in coordinating load balancing across heterogeneous visual experts, where optimizing one specialist’s performance often compromises others’ capabilities. To address task heterogeneity and expert load imbalance, we propose Astrea, a novel multi-expert collaborative VLM architecture based on progressive pre-alignment. Astrea introduces three key innovations: 1) A heterogeneous expert coordination mechanism that integrates four specialized models (detection, segmentation, classification, captioning) into a comprehensive expert matrix covering essential visual comprehension elements; 2) A dynamic knowledge fusion strategy featuring progressive pre-alignment to harmonize experts within the VLM latent space through contrastive learning, complemented by probabilistically activated stochastic residual connections to preserve knowledge continuity; 3) An enhanced optimization framework utilizing momentum contrastive learning for long-range dependency modeling and adaptive weight allocators for real-time expert contribution calibration. Extensive evaluations across 12 benchmark tasks spanning VQA, image captioning, and cross-modal retrieval demonstrate Astrea’s superiority over state-of-the-art models, achieving an average performance gain of +4.7%. This study provides the first empirical demonstration that progressive pre-alignment strategies enable VLMs to overcome task heterogeneity limitations, establishing new methodological foundations for developing general-purpose multimodal agents.
zh

[CV-24] SuperCarver: Texture-Consistent 3D Geometry Super-Resolution for High-Fidelity Surface Detail Generation

【速读】：该论文致力于解决传统高精度3D网格资产生产流程中手工雕刻耗时且繁琐的问题，尽管基于AI的3D内容创建近年来取得了显著进展，但现有最先进的方法生成的网格表面通常过于平滑且缺乏几何细节。为此，论文引入了SuperCarver，这是一种专门设计用于向给定粗略网格添加纹理一致表面细节的3D几何超分辨率框架。解决方案的关键在于：首先将原始带纹理的网格从多个视角渲染到图像域；其次开发了一种确定性先验引导的法线扩散模型，并在精心策划的低多边形与高多边形法线渲染配对数据集上进行微调以实现几何细节生成；最后设计了一种基于距离场变形的简单而有效的抗噪逆向渲染方案来优化可能不完美的法线图预测，从而改善网格结构。实验表明，SuperCarver能够生成符合特定纹理外观的逼真且富有表现力的表面细节，成为自动升级大量过时低质量资产的强大工具，并缩短高质量网格生产的迭代周期。

链接: https://arxiv.org/abs/2503.09439
作者: Qijian Zhang,Xiaozheng Jian,Xuan Zhang,Wenping Wang,Junhui Hou
机构: Tencent Games (腾讯游戏); Texas A&M University (德克萨斯农工大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional production workflow of high-precision 3D mesh assets necessitates a cumbersome and laborious process of manual sculpting by specialized modelers. The recent years have witnessed remarkable advances in AI-empowered 3D content creation. However, although the latest state-of-the-arts are already capable of generating plausible structures and intricate appearances from images or text prompts, the actual mesh surfaces are typically over-smoothing and lack geometric details. This paper introduces SuperCarver, a 3D geometry super-resolution framework particularly tailored for adding texture-consistent surface details to given coarse meshes. Technically, we start by rendering the original textured mesh into the image domain from multiple viewpoints. To achieve geometric detail generation, we develop a deterministic prior-guided normal diffusion model fine-tuned on a carefully curated dataset of paired low-poly and high-poly normal renderings. To optimize mesh structures from potentially imperfect normal map predictions, we design a simple yet effective noise-resistant inverse rendering scheme based on distance field deformation. Extensive experiments show that SuperCarver generates realistic and expressive surface details as depicted by specific texture appearances, making it a powerful tool for automatically upgrading massive outdated low-quality assets and shortening the iteration cycle of high-quality mesh production in practical applications.
zh

[CV-25] Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter

【速读】：该论文旨在解决语言条件下的复杂环境中抓取与放置任务的问题，即机器人需从开放的杂乱环境中抓取目标物体并将其移动到指定位置。现有方法主要依赖于大规模数据集训练端到端策略或零样本设置下的基础模型组合，但存在数据需求量大及级联错误等问题，且较少关注动作先验（action priors）。为此，论文提出通过整合视觉、语言和动作的基础先验来开发有效的策略。关键在于提出了一种名为A²的动作先验对齐方法，通过学习一个注意力层将未条件化动作先验与三维视觉-语言先验对齐。此对齐机制不仅使策略能够使用更少的数据进行训练，还保持了零样本泛化能力，并通过共享抓取和放置动作的策略提升性能，同时引入适应多模态特性的策略调整方案。实验结果表明，所提方法在模拟和真实环境中均实现了更高的任务成功率及更少的操作步骤，有效推广至未见过的目标物体和语言指令。

链接: https://arxiv.org/abs/2503.09423
作者: Kechun Xu,Xunlong Xia,Kaixuan Wang,Yifei Yang,Yunxuan Mao,Bing Deng,Rong Xiong,Yue Wang
机构: Zhejiang University, Hangzhou, China (浙江大学，杭州，中国); Alibaba Cloud, Hangzhou, China (阿里云，杭州，中国)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We study the task of language-conditioned pick and place in clutter, where a robot should grasp a target object in open clutter and move it to a specified place. Some approaches learn end-to-end policies with features from vision foundation models, requiring large datasets. Others combine foundation models in a zero-shot setting, suffering from cascading errors. In addition, they primarily leverage vision and language foundation models, focusing less on action priors. In this paper, we aim to develop an effective policy by integrating foundation priors from vision, language, and action. We propose A ^2 , an action prior alignment method that aligns unconditioned action priors with 3D vision-language priors by learning one attention layer. The alignment formulation enables our policy to train with less data and preserve zero-shot generalization capabilities. We show that a shared policy for both pick and place actions enhances the performance for each task, and introduce a policy adaptation scheme to accommodate the multi-modal nature of actions. Extensive experiments in simulation and the real-world show that our policy achieves higher task success rates with fewer steps for both pick and place tasks in clutter, effectively generalizing to unseen objects and language instructions.
zh

[CV-26] Alias-Free Latent Diffusion Models:Improving Fractional Shift Equivariance of Diffusion Latent Space

【速读】：该论文旨在解决 Latent Diffusion Models (LDMs) 在生成过程中不稳定的问题，即输入噪声的小幅度扰动可能导致显著不同的输出，从而限制其在需要一致结果的应用中的适用性。论文的关键解决方案是重新设计 LDMs 以增强其移位等变性 (shift-equivariance)。具体而言，尽管引入抗混叠操作 (anti-aliasing operations) 可部分改善移位等变性，但由于 LDMs 的独特挑战（如变分自编码器 (VAE) 训练和多个 U-Net 推理过程中的混叠放大效应，以及自注意力模块固有的非移位等变特性），仍存在显著的混叠和不一致性。为此，论文通过重新设计注意力模块使其具备移位等变性，并提出一种等变损失函数，有效抑制连续域特征的频率带宽。最终提出的无混叠 LDM (AF-LDM) 不仅实现了强大的移位等变性，还对不规则形变具有鲁棒性，在视频编辑和图像到图像翻译等多种应用中表现出显著更一致的结果。

链接: https://arxiv.org/abs/2503.09419
作者: Yifan Zhou,Zeqi Xiao,Shuai Yang,Xingang Pan
机构: S-Lab, Nanyang Technological University (南洋理工大学); Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Latent Diffusion Models (LDMs) are known to have an unstable generation process, where even small perturbations or shifts in the input noise can lead to significantly different outputs. This hinders their applicability in applications requiring consistent results. In this work, we redesign LDMs to enhance consistency by making them shift-equivariant. While introducing anti-aliasing operations can partially improve shift-equivariance, significant aliasing and inconsistency persist due to the unique challenges in LDMs, including 1) aliasing amplification during VAE training and multiple U-Net inferences, and 2) self-attention modules that inherently lack shift-equivariance. To address these issues, we redesign the attention modules to be shift-equivariant and propose an equivariance loss that effectively suppresses the frequency bandwidth of the features in the continuous domain. The resulting alias-free LDM (AF-LDM) achieves strong shift-equivariance and is also robust to irregular warping. Extensive experiments demonstrate that AF-LDM produces significantly more consistent results than vanilla LDM across various applications, including video editing and image-to-image translation. Code is available at: this https URL
zh

[CV-27] OpenVidVRD: Open-Vocabulary Video Visual Relation Detection via Prompt-Driven Semantic Space Alignment

【速读】：该论文旨在解决视频视觉关系检测（Video Visual Relation Detection, VidVRD）任务中因动态内容、高标注成本以及关系类别长尾分布所导致的挑战。此外，现有基于视觉语言模型（Visual Language Models, VLMs）的方法通常未能充分考虑不同视觉区域与其关系之间的关联，并且直接利用VLMs识别视频中的视觉关系面临图像与视频之间巨大差异带来的显著困难。为此，论文提出了一种名为OpenVidVRD的新颖开放词汇VidVRD框架，其关键在于通过提示学习（prompt learning）将VLMs丰富的知识与强大能力迁移到VidVRD任务中以提升性能。具体而言，首先利用VLM从基于视频区域自动生成的区域描述符中提取文本表示；接着开发时空精炼模块，通过整合跨模态时空互补信息来获取视频中对象级的关系表示；最后采用提示驱动策略对齐语义空间，从而利用VLMs的语义理解能力增强OpenVidVRD的整体泛化能力。实验结果表明，所提出的模型在VidVRD和VidOR公开数据集上的表现优于现有方法。

链接: https://arxiv.org/abs/2503.09416
作者: Qi Liu,Weiying Xue,Yuxiao Wang,Zhenao Wei
机构: South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The video visual relation detection (VidVRD) task is to identify objects and their relationships in videos, which is challenging due to the dynamic content, high annotation costs, and long-tailed distribution of relations. Visual language models (VLMs) help explore open-vocabulary visual relation detection tasks, yet often overlook the connections between various visual regions and their relations. Moreover, using VLMs to directly identify visual relations in videos poses significant challenges because of the large disparity between images and videos. Therefore, we propose a novel open-vocabulary VidVRD framework, termed OpenVidVRD, which transfers VLMs’ rich knowledge and powerful capabilities to improve VidVRD tasks through prompt learning. Specificall y, We use VLM to extract text representations from automatically generated region captions based on the video’s regions. Next, we develop a spatiotemporal refiner module to derive object-level relationship representations in the video by integrating cross-modal spatiotemporal complementary information. Furthermore, a prompt-driven strategy to align semantic spaces is employed to harness the semantic understanding of VLMs, enhancing the overall generalization ability of OpenVidVRD. Extensive experiments conducted on the VidVRD and VidOR public datasets show that the proposed model outperforms existing methods.
zh

[CV-28] Monte Carlo Diffusion for Generalizable Learning-Based RANSAC

【速读】：该论文旨在解决现有基于学习的RANSAC方法在推理阶段对分布外数据泛化能力有限的问题。这些方法通常在由相同算法生成的数据上进行训练和测试，导致其对实际噪声条件下的适应性不足。为了解决这一问题，论文提出了一种基于扩散（diffusion-based）的新范式，通过逐步向真实数据注入噪声来模拟训练过程中的噪声环境，从而增强学习型RANSAC的鲁棒性。此外，为了提升数据多样性，论文引入蒙特卡洛采样（Monte Carlo sampling），在多个阶段引入不同类型随机性以近似不同的数据分布。关键在于创新性的蒙特卡洛扩散机制，它显著提高了学习型RANSAC的泛化能力，并通过全面的消融研究验证了框架中各组件的有效性。

链接: https://arxiv.org/abs/2503.09410
作者: Jiale Wang,Chen Zhao,Wei Ke,Tong Zhang
机构: Xi’an Jiaotong University (西安交通大学); EPFL (洛桑联邦理工学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Random Sample Consensus (RANSAC) is a fundamental approach for robustly estimating parametric models from noisy data. Existing learning-based RANSAC methods utilize deep learning to enhance the robustness of RANSAC against outliers. However, these approaches are trained and tested on the data generated by the same algorithms, leading to limited generalization to out-of-distribution data during inference. Therefore, in this paper, we introduce a novel diffusion-based paradigm that progressively injects noise into ground-truth data, simulating the noisy conditions for training learning-based RANSAC. To enhance data diversity, we incorporate Monte Carlo sampling into the diffusion paradigm, approximating diverse data distributions by introducing different types of randomness at multiple stages. We evaluate our approach in the context of feature matching through comprehensive experiments on the ScanNet and MegaDepth datasets. The experimental results demonstrate that our Monte Carlo diffusion mechanism significantly improves the generalization ability of learning-based RANSAC. We also develop extensive ablation studies that highlight the effectiveness of key components in our framework.
zh

[CV-29] Diff-CL: A Novel Cross Pseudo-Supervision Method for Semi-supervised Medical Image Segmentation

【速读】：该论文致力于解决半监督学习在医学图像分割任务中对大尺度数据分布建模不足的问题，尤其是在利用少量标注数据时难以实现鲁棒且精确的分割结果。论文的关键在于结合扩散模型（DM）与卷积神经网络（CNNs）的优势，并从数据分布的角度设计了一种新颖的半监督框架——Diff-CL。具体而言，通过跨伪监督学习机制整合扩散模型的数据分布学习能力与卷积网络的细节修正能力，同时引入高频Mamba模块捕获全局边界和细节信息，并利用对比学习实现标签传播，从而有效提升分割性能。这一方法在左心房、脑肿瘤及NIH胰腺三个数据集上达到了当前最先进的性能。

链接: https://arxiv.org/abs/2503.09408
作者: Xiuzhen Guo,Lianyuan Yu,Ji Shi,Na Lei,Hongxiao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semi-supervised learning utilizes insights from unlabeled data to improve model generalization, thereby reducing reliance on large labeled datasets. Most existing studies focus on limited samples and fail to capture the overall data distribution. We contend that combining distributional information with detailed information is crucial for achieving more robust and accurate segmentation results. On the one hand, with its robust generative capabilities, diffusion models (DM) learn data distribution effectively. However, it struggles with fine detail capture, leading to generated images with misleading details. Combining DM with convolutional neural networks (CNNs) enables the former to learn data distribution while the latter corrects fine details. While capturing complete high-frequency details by CNNs requires substantial computational resources and is susceptible to local noise. On the other hand, given that both labeled and unlabeled data come from the same distribution, we believe that regions in unlabeled data similar to overall class semantics to labeled data are likely to belong to the same class, while regions with minimal similarity are less likely to. This work introduces a semi-supervised medical image segmentation framework from the distribution perspective (Diff-CL). Firstly, we propose a cross-pseudo-supervision learning mechanism between diffusion and convolution segmentation networks. Secondly, we design a high-frequency mamba module to capture boundary and detail information globally. Finally, we apply contrastive learning for label propagation from labeled to unlabeled data. Our method achieves state-of-the-art (SOTA) performance across three datasets, including left atrium, brain tumor, and NIH pancreas datasets.
zh

[CV-30] Multi-Agent Image Restoration

【速读】：该论文旨在解决复杂混合退化（complex mixed degradations）在图像恢复（Image Restoration, IR）任务中的挑战，现有方法难以同时处理多种退化类型且效率低下。论文的关键创新在于提出了一种名为MAIR（Multi-Agent approach for complex IR problems）的新方法，通过引入基于真实世界退化先验的三阶段恢复框架，将退化分为场景（scene）、成像（imaging）和压缩（compression）三类，并按相反顺序进行逆向修复。该设计模拟了一组协作的人类专家团队，包括一个“调度器”负责整体规划以及多个专注于特定退化的“专家”。这种多智能体（multi-agent）架构显著减少了搜索空间和试验努力，提升了图像质量的同时降低了推理成本。此外，通过注册机制（registry mechanism）实现了新工具的便捷集成。实验表明，MAIR在合成数据集和真实数据集上均表现出优于现有智能体驱动图像恢复系统的性能与效率。

链接: https://arxiv.org/abs/2503.09403
作者: Xu Jiang,Gehui Li,Bin Chen,Jian Zhang
机构: School of Electronic and Computer Engineering, Peking University (电子与计算机工程学院, 北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Image restoration (IR) is challenging due to the complexity of real-world degradations. While many specialized and all-in-one IR models have been developed, they fail to effectively handle complex, mixed degradations. Recent agentic methods RestoreAgent and AgenticIR leverage intelligent, autonomous workflows to alleviate this issue, yet they suffer from suboptimal results and inefficiency due to their resource-intensive finetunings, and ineffective searches and tool execution trials for satisfactory outputs. In this paper, we propose MAIR, a novel Multi-Agent approach for complex IR problems. We introduce a real-world degradation prior, categorizing degradations into three types: (1) scene, (2) imaging, and (3) compression, which are observed to occur sequentially in real world, and reverse them in the opposite order. Built upon this three-stage restoration framework, MAIR emulates a team of collaborative human specialists, including a “scheduler” for overall planning and multiple “experts” dedicated to specific degradations. This design minimizes search space and trial efforts, improving image quality while reducing inference costs. In addition, a registry mechanism is introduced to enable easy integration of new tools. Experiments on both synthetic and real-world datasets show that proposed MAIR achieves competitive performance and improved efficiency over the previous agentic IR system. Code and models will be made available.
zh

[CV-31] VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary CVPR2025

【速读】：该论文旨在解决视频理解领域中现有生成式视频-语言模型在词汇表定义上的局限性问题，提出了一种新颖的视频理解框架VLog。传统的生成式视频-语言模型通常依赖于子词（subword）词汇表，而VLog将视频叙述定义为一种词汇表，从而超越了这一限制。其解决方案的关键在于三项创新：(i) 结合语言模型复杂推理能力与对比检索高效相似度搜索的生成式检索模型；(ii) 基于大规模视频叙述构建的分层词汇表，通过叙述对编码算法识别更广泛场景，并使用表达性后缀实现特定事件（如切番茄）的高效索引；(iii) 利用生成模型实现词汇表的动态更新策略，以扩展推理过程中遇到的新事件。这些创新共同提升了视频叙述的简洁性、上下文准确性及效率。

链接: https://arxiv.org/abs/2503.09402
作者: Kevin Qinghong Lin,Mike Zheng Shou
机构: Show Lab, National University of Singapore (新加坡国立大学展示实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025. Github: this https URL

点击查看摘要

Abstract:Human daily activities can be concisely narrated as sequences of routine events (e.g., turning off an alarm) in video streams, forming an event vocabulary. Motivated by this, we introduce VLog, a novel video understanding framework that define video narrations as vocabulary, going beyond the typical subword vocabularies in existing generative video-language models. Built on the lightweight language model GPT-2, VLog feature three key innovations: (i) A generative retrieval model, marrying language model’s complex reasoning capabilities with contrastive retrieval’s efficient similarity search. (ii) A hierarchical vocabulary derived from large-scale video narrations using our narration pair encoding algorithm, enabling efficient indexing of specific events (e.g., cutting a tomato) by identifying broader scenarios (e.g., kitchen) with expressive postfixes (e.g., by the left hand). (iii) A vocabulary update strategy leveraging generative models to extend the vocabulary for novel events encountered during inference. To validate our approach, we introduce VidCap-Eval, a development set requiring concise narrations with reasoning relationships (e.g., before and after). Experiments on EgoSchema, COIN, and HiREST further demonstrate the effectiveness of VLog, highlighting its ability to generate concise, contextually accurate, and efficient narrations, offering a novel perspective on video understanding. Codes are released at this https URL.
zh

[CV-32] ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation

【速读】：该论文旨在解决大规模图像分类任务中Transformer（尤其是Vision Transformers, ViTs）需要大量数据且容易受到偏差影响，从而限制其鲁棒性和泛化能力的问题。论文的关键解决方案是提出了一种名为ForAug的新颖数据增强方案，它通过将预训练的基础模型用于分离前景对象与不同背景并重新组合，显式地将归纳偏置（inductive biases）引入训练数据中。这种方法不仅增加了训练数据的多样性，还有效提升了模型性能，在ImageNet上的ViTs和其他架构的准确性提高了多达4.5个百分点，并在下游任务中提高了7.3个百分点。此外，ForAug还引入了衡量背景鲁棒性、前景聚焦度、中心偏差和尺寸偏差的新指标，证明了相比在ImageNet上训练，使用ForNet训练可以显著减少这些偏差。

链接: https://arxiv.org/abs/2503.09399
作者: Tobias Christian Nauen,Brian Moser,Federico Raue,Stanislav Frolov,Andreas Dengel
机构: RPTU Kaiserslautern-Landau (莱茵兰-普法尔茨技术大学); German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformers, particularly Vision Transformers (ViTs), have achieved state-of-the-art performance in large-scale image classification. However, they often require large amounts of data and can exhibit biases that limit their robustness and generalizability. This paper introduces ForAug, a novel data augmentation scheme that addresses these challenges and explicitly includes inductive biases, which commonly are part of the neural network architecture, into the training data. ForAug is constructed by using pretrained foundation models to separate and recombine foreground objects with different backgrounds, enabling fine-grained control over image composition during training. It thus increases the data diversity and effective number of training samples. We demonstrate that training on ForNet, the application of ForAug to ImageNet, significantly improves the accuracy of ViTs and other architectures by up to 4.5 percentage points (p.p.) on ImageNet and 7.3 p.p. on downstream tasks. Importantly, ForAug enables novel ways of analyzing model behavior and quantifying biases. Namely, we introduce metrics for background robustness, foreground focus, center bias, and size bias and show that training on ForNet substantially reduces these biases compared to training on ImageNet. In summary, ForAug provides a valuable tool for analyzing and mitigating biases, enabling the development of more robust and reliable computer vision models. Our code and dataset are publicly available at this https URL.
zh

[CV-33] Close-up-GS: Enhancing Close-Up View Synthesis in 3D Gaussian Splatting with Progressive Self-Training

【速读】：该论文旨在解决3D Gaussian Splatting (3DGS) 在生成远偏离训练视点的新视角时渲染质量下降的问题，特别是针对比训练集中视点更靠近物体的特写视图生成任务。这一问题源于模型在分布外场景泛化能力不足以及分辨率变化和遮挡导致细节插值困难。论文提出的解决方案关键在于三个核心思想：首先利用最近引入的3D感知生成模型See3D增强渲染视图的细节；其次提出一种策略以逐步扩展3DGS模型的“信任区域”并更新其参考视图集；最后通过精心设计的微调策略利用自动生成的数据更新3DGS模型。这些方法共同提高了模型在特写视图生成任务中的性能。

链接: https://arxiv.org/abs/2503.09396
作者: Jiatong Xia,Lingqiao Liu
机构: Australian Institute for Machine Learning, The University of Adelaide (澳大利亚阿德莱德大学机器学习研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has demonstrated impressive performance in synthesizing novel views after training on a given set of viewpoints. However, its rendering quality deteriorates when the synthesized view deviates significantly from the training views. This decline occurs due to (1) the model’s difficulty in generalizing to out-of-distribution scenarios and (2) challenges in interpolating fine details caused by substantial resolution changes and occlusions. A notable case of this limitation is close-up view generation–producing views that are significantly closer to the object than those in the training set. To tackle this issue, we propose a novel approach for close-up view generation based by progressively training the 3DGS model with self-generated data. Our solution is based on three key ideas. First, we leverage the See3D model, a recently introduced 3D-aware generative model, to enhance the details of rendered views. Second, we propose a strategy to progressively expand the ``trust regions’’ of the 3DGS model and update a set of reference views for See3D. Finally, we introduce a fine-tuning strategy to carefully update the 3DGS model with training data generated from the above schemes. We further define metrics for close-up views evaluation to facilitate better research on this problem. By conducting evaluations on specifically selected scenarios for close-up views, our proposed approach demonstrates a clear advantage over competitive solutions.
zh

[CV-34] Bidirectional Prototype-Reward co-Evolution for Test-Time Adaptation of Vision-Language Models

【速读】：该论文旨在解决测试时适应（Test-time Adaptation, TTA）在视觉语言模型（Vision-Language Models, VLMs）中面临的挑战，特别是在面对真实世界的数据分布偏移时，当源数据或目标标签不可用的情况下，如何保持模型性能。现有方法依赖于CLIP的输出概率分布进行特征评估，但在领域偏移下可能引入偏差，导致因文本先验或错误的文本关联而误分类特征的问题。为了解决这些局限性，论文提出了一种名为双向原型-奖励协同进化（Bidirectional Prototype-Reward co-Evolution, BPRE）的新框架。BPRE的关键在于通过协同反馈回路将特征质量评估与原型演化相结合，首先利用多维质量感知奖励模块精确评估特征质量和引导原型细化，然后通过原型-奖励交互式演化不断改进原型质量，从而增强多维质量感知奖励分数的鲁棒性。这种双向互动使得奖励精度和原型演化相互强化，形成自进化循环。

链接: https://arxiv.org/abs/2503.09394
作者: Xiaozhen Qiao,Peng Huang,Jiakang Yuan,Xianda Guo,Bowen Ye,Zhe Sun,Xuelong Li
机构: TeleAI; University of Science and Technology of China (中国科学技术大学); Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学); Wuhan University (武汉大学); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Test-time adaptation (TTA) is crucial in maintaining Vision-Language Models (VLMs) performance when facing real-world distribution shifts, particularly when the source data or target labels are inaccessible. Existing TTA methods rely on CLIP’s output probability distribution for feature evaluation, which can introduce biases under domain shifts. This misalignment may cause features to be misclassified due to text priors or incorrect textual associations. To address these limitations, we propose Bidirectional Prototype-Reward co-Evolution (BPRE), a novel TTA framework for VLMs that integrates feature quality assessment with prototype evolution through a synergistic feedback loop. BPRE first employs a Multi-Dimensional Quality-Aware Reward Module to evaluate feature quality and guide prototype refinement precisely. The continuous refinement of prototype quality through Prototype-Reward Interactive Evolution will subsequently enhance the computation of more robust Multi-Dimensional Quality-Aware Reward Scores. Through the bidirectional interaction, the precision of rewards and the evolution of prototypes mutually reinforce each other, forming a self-evolving cycle. Extensive experiments are conducted across 15 diverse recognition datasets encompassing natural distribution shifts and cross-dataset generalization scenarios. Results demonstrate that BPRE consistently achieves superior average performance compared to state-of-the-art methods across different model architectures, such as ResNet-50 and ViT-B/16. By emphasizing comprehensive feature evaluation and bidirectional knowledge refinement, BPRE advances VLM generalization capabilities, offering a new perspective on TTA.
zh

[CV-35] VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers

【速读】：该论文旨在解决视频理解领域中长时或实时应用面临的计算开销大的挑战，特别是由视觉标记长度过长导致的显著计算负担。为应对这一问题，论文提出了一种名为VideoScan的高效视觉-语言模型（Vision-Language Model, VLM）推理框架。其关键解决方案在于引入单一语义载体标记来表征每一帧，并通过两阶段推理过程——预填充（prefilling）和解码（decoding），逐步减少计算和内存开销。具体而言，语义载体标记的嵌入基于优化后的帧级视觉特征聚合得到，确保表示紧凑且语义丰富；同时，相关键值对被训练以保留来自先前帧的上下文语义信息，从而实现高效的内存管理而不损害时间连贯性。此外，在推理过程中，每帧的视觉标记仅在预填充阶段处理一次后即被丢弃，避免了冗余计算。这种设计保证了即使在严格的实时约束条件下，VLM推理依然高效。实验结果表明，借助此方法改进的LLaVA-Video模型相较于原始版本及现有高效流媒体视频理解方法分别实现了约5倍和1.29倍的速度提升，同时保持了竞争性的性能表现并维持了稳定的GPU内存消耗（始终约为18GB，与视频时长无关）。

链接: https://arxiv.org/abs/2503.09387
作者: Ruanjun Li,Yuedong Tan,Yuanming Shi,Jiawei Shao
机构: ShanghaiTech University; Xidian University; TeleAI (TeleAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:This paper introduces VideoScan, an efficient vision-language model (VLM) inference framework designed for real-time video interaction that effectively comprehends and retains streamed video inputs while delivering rapid and accurate responses. A longstanding challenge in video understanding–particularly for long-term or real-time applications–stems from the substantial computational overhead caused by the extensive length of visual tokens. To address this, VideoScan employs a single semantic carrier token to represent each frame, progressively reducing computational and memory overhead during its two-phase inference process: prefilling and decoding. The embedding of the semantic carrier token is derived from an optimized aggregation of frame-level visual features, ensuring compact yet semantically rich representations. Critically, the corresponding key-value pairs are trained to retain contextual semantics from prior frames, enabling efficient memory management without sacrificing temporal coherence. During inference, the visual tokens of each frame are processed only once during the prefilling phase and subsequently discarded in the decoding stage, eliminating redundant computations. This design ensures efficient VLM inference even under stringent real-time constraints. Comprehensive experiments on diverse offline and online benchmarks demonstrate that LLaVA-Video, supported by our method, achieves up to \sim 5\times and 1.29\times speedups compared to its original version and previous efficient streaming video understanding approaches, respectively. Crucially, these improvements are attained while maintaining competitive performance and ensuring stable GPU memory consumption (consistently \sim 18 GB, independent of video duration).
zh

[CV-36] Pig behavior dataset and Spatial-temporal perception and enhancement networks based on the attention mechanism for pig behavior recognition

【速读】：该论文旨在解决猪行为识别领域中缺乏公开可用行为数据集的问题，这一问题限制了创新算法的发展，并影响模型的鲁棒性。为应对这一挑战，论文提出了一种包含13种对猪福利有显著影响的行为的数据集，并设计了一种基于注意力机制的空间-时间感知与增强网络（Spatial-Temporal Perception and Enhancement Networks）。该网络由空间-时间感知网络和空间-时间特征增强网络组成，其关键在于通过建立猪与其行为关键区域之间的连接来建模视频数据中的时空特征及其交互区域，并进一步强化个体猪的重要空间特征，捕捉单个行为的长程依赖关系，从而提升模型对猪行为时空变化的感知能力。实验结果表明，所提出的模型在论文构建的数据集上达到了75.92%的mAP评分，较最佳传统模型提升了8.17%。

链接: https://arxiv.org/abs/2503.09378
作者: Fangzheng Qi,Zhenjie Hou,En Lin,Xing Li,iuzhen Liang,Xinwen Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The recognition of pig behavior plays a crucial role in smart farming and welfare assurance for pigs. Currently, in the field of pig behavior recognition, the lack of publicly available behavioral datasets not only limits the development of innovative algorithms but also hampers model robustness and algorithm this http URL paper proposes a dataset containing 13 pig behaviors that significantly impact this http URL on this dataset, this paper proposes a spatial-temporal perception and enhancement networks based on the attention mechanism to model the spatiotemporal features of pig behaviors and their associated interaction areas in video data. The network is composed of a spatiotemporal perception network and a spatiotemporal feature enhancement network. The spatiotemporal perception network is responsible for establishing connections between the pigs and the key regions of their behaviors in the video data. The spatiotemporal feature enhancement network further strengthens the important spatial features of individual pigs and captures the long-term dependencies of the spatiotemporal features of individual behaviors by remodeling these connections, thereby enhancing the model’s perception of spatiotemporal changes in pig behaviors. Experimental results demonstrate that on the dataset established in this paper, our proposed model achieves a MAP score of 75.92%, which is an 8.17% improvement over the best-performing traditional model. This study not only improces the accuracy and generalizability of individual pig behavior recognition but also provides new technological tools for modern smart farming. The dataset and related code will be made publicly available alongside this paper.
zh

[CV-37] Revisiting Medical Image Retrieval via Knowledge Consolidation

【速读】：该论文旨在解决医疗图像检索中现有方法存在的两个主要问题：一是无法从融合嵌入中生成具有代表性的哈希码；二是未能有效应对分布外（Out-of-Distribution, OOD）样本及对抗攻击。为解决这些问题，论文提出了一种创新方法，通过引入Depth-aware Representation Fusion (DaRF) 和 Structure-aware Contrastive Hashing (SCH)，关键在于整合层次化特征与优化函数。DaRF 动态融合浅层与深层表示以形成混合特征，而 SCH 利用图像指纹增强正负样本配对的适应性，从而提升分布外检测能力和基于内容的推荐效果，构建安全的AI驱动医疗环境。此外，论文还提出了内容引导排序以提高检索结果的鲁棒性和可重复性。实验表明，所提方法在识别分布外样本及医学图像检索性能方面显著优于现有方法（p<0.05），尤其在解剖放射学数据集上实现了平均精度均值（mAP）5.6%-38.9%的提升。

链接: https://arxiv.org/abs/2503.09370
作者: Yang Nan,Huichi Zhou,Xiaodan Xing,Giorgos Papanastasiou,Lei Zhu,Zhifan Gao,Alejandro F Fangi,Guang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As artificial intelligence and digital medicine increasingly permeate healthcare systems, robust governance frameworks are essential to ensure ethical, secure, and effective implementation. In this context, medical image retrieval becomes a critical component of clinical data management, playing a vital role in decision-making and safeguarding patient information. Existing methods usually learn hash functions using bottleneck features, which fail to produce representative hash codes from blended embeddings. Although contrastive hashing has shown superior performance, current approaches often treat image retrieval as a classification task, using category labels to create positive/negative pairs. Moreover, many methods fail to address the out-of-distribution (OOD) issue when models encounter external OOD queries or adversarial attacks. In this work, we propose a novel method to consolidate knowledge of hierarchical features and optimisation functions. We formulate the knowledge consolidation by introducing Depth-aware Representation Fusion (DaRF) and Structure-aware Contrastive Hashing (SCH). DaRF adaptively integrates shallow and deep representations into blended features, and SCH incorporates image fingerprints to enhance the adaptability of positive/negative pairings. These blended features further facilitate OOD detection and content-based recommendation, contributing to a secure AI-driven healthcare environment. Moreover, we present a content-guided ranking to improve the robustness and reproducibility of retrieval results. Our comprehensive assessments demonstrate that the proposed method could effectively recognise OOD samples and significantly outperform existing approaches in medical image retrieval (p0.05). In particular, our method achieves a 5.6-38.9% improvement in mean Average Precision on the anatomical radiology dataset.
zh

[CV-38] PerCoV2: Improved Ultra-Low Bit-Rate Perceptual Image Compression with Implicit Hierarchical Masked Image Modeling

【速读】：该论文试图解决低带宽和存储受限环境下的图像压缩问题，目标是在极低比特率下实现更高的图像保真度并保持良好的感知质量。论文的关键在于提出PerCoV2系统，其通过扩展Careil等人的先前工作至Stable Diffusion 3生态系统，并显式建模离散超潜层图像分布来提升熵编码效率。此外，论文还评估了多种自回归方法（VAR和MaskGIT）用于熵建模，并在大规模MSCOCO-30k数据集上验证了所提方法的有效性。与已有工作相比，PerCoV2不仅在更低比特率下实现了更高的图像保真度，还支持混合生成模式以进一步降低比特率，且完全基于公开组件构建。

链接: https://arxiv.org/abs/2503.09368
作者: Nikolai Körber,Eduard Kromer,Andreas Siebert,Sascha Hauke,Daniel Mueller-Gritschneder,Björn Schuller
机构: Technical University of Munich (慕尼黑工业大学); University of Applied Sciences Landshut (兰茨胡特应用技术大学); TU Wien (维也纳技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We introduce PerCoV2, a novel and open ultra-low bit-rate perceptual image compression system designed for bandwidth- and storage-constrained applications. Building upon prior work by Careil et al., PerCoV2 extends the original formulation to the Stable Diffusion 3 ecosystem and enhances entropy coding efficiency by explicitly modeling the discrete hyper-latent image distribution. To this end, we conduct a comprehensive comparison of recent autoregressive methods (VAR and MaskGIT) for entropy modeling and evaluate our approach on the large-scale MSCOCO-30k benchmark. Compared to previous work, PerCoV2 (i) achieves higher image fidelity at even lower bit-rates while maintaining competitive perceptual quality, (ii) features a hybrid generation mode for further bit-rate savings, and (iii) is built solely on public components. Code and trained models will be released at this https URL.
zh

[CV-39] Post-interactive Multimodal Trajectory Prediction for Autonomous Driving

【速读】：本文旨在解决自动驾驶轨迹预测中因代理行为不确定性导致的交互建模难题，并特别关注预测轨迹中尚未被充分考虑的后交互（post-interaction）特征。为应对这一挑战，论文提出了一种从粗到细的Transformer架构——Pioformer，其关键在于显式提取后交互特征以提升预测准确性。具体而言，首先构建粗轨迹网络利用观测轨迹与车道段生成粗略轨迹并提取低阶交互特征；接着基于超图神经网络构建轨迹提案网络以学习高阶交互特征并生成候选轨迹；最后通过提案精化网络进一步优化这些候选轨迹。在提案精化阶段，将观测轨迹与提案结合作为输入，利用后交互特征结合先前交互特征及轨迹一致性特征进行学习。此外，论文设计了三阶段训练方案以促进模型学习过程。实验结果表明，与基线方法HiVT-64相比，所提方法在minADE6、minFDE6、MR6和brier-minFDE6等指标上的预测误差分别降低了4.4%、8.4%、14.4%和5.7%。

链接: https://arxiv.org/abs/2503.09366
作者: Ziyi Huang,Yang Li,Dushuai Li,Yao Mu,Hongmao Qin,Nan Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modeling the interactions among agents for trajectory prediction of autonomous driving has been challenging due to the inherent uncertainty in agents’ behavior. The interactions involved in the predicted trajectories of agents, also called post-interactions, have rarely been considered in trajectory prediction models. To this end, we propose a coarse-to-fine Transformer for multimodal trajectory prediction, i.e., Pioformer, which explicitly extracts the post-interaction features to enhance the prediction accuracy. Specifically, we first build a Coarse Trajectory Network to generate coarse trajectories based on the observed trajectories and lane segments, in which the low-order interaction features are extracted with the graph neural networks. Next, we build a hypergraph neural network-based Trajectory Proposal Network to generate trajectory proposals, where the high-order interaction features are learned by the hypergraphs. Finally, the trajectory proposals are sent to the Proposal Refinement Network for further refinement. The observed trajectories and trajectory proposals are concatenated together as the inputs of the Proposal Refinement Network, in which the post-interaction features are learned by combining the previous interaction features and trajectory consistency features. Moreover, we propose a three-stage training scheme to facilitate the learning process. Extensive experiments on the Argoverse 1 dataset demonstrate the superiority of our method. Compared with the baseline HiVT-64, our model has reduced the prediction errors by 4.4%, 8.4%, 14.4%, 5.7% regarding metrics minADE6, minFDE6, MR6, and brier-minFDE6, respectively.
zh

[CV-40] Deep Learning for Climate Action: Computer Vision Analysis of Visual Narratives on X

【速读】：本文旨在解决在后API时代社交媒体数据获取受限的情况下，如何有效分析气候变迁相关的公众情感与叙事的问题。为应对这一挑战，研究的关键在于开发一种综合方法，将统计分析、图像分类、目标检测及情感分析相结合，以探索气候讨论中的视觉叙事模式。此外，研究还设计了一个图形用户界面（GUI），用于促进交互式数据分析。通过此方法，论文揭示了气候传播中的关键主题，强调了图像与文本之间情感表达的差异，并评估了基础模型在分析社交媒体图像方面的优劣。最终，研究通过开源代码和工具支持未来跨气候变迁、社交媒体与计算机视觉领域的研究。

链接: https://arxiv.org/abs/2503.09361
作者: Katharina Prasse,Marcel Kleinmann,Inken Adam,Kerstin Beckersjuergen,Andreas Edte,Jona Frroku,Timotheus Gumpp,Steffen Jung,Isaac Bravo,Stefanie Walter,Margret Keuper
机构: Data & Web Science Group, University of Mannheim (数据与网络科学组，曼海姆大学); School of Governance, Technical University of Munich (治理学院，慕尼黑工业大学); University of Mannheim and Max Planck Institute for Informatics, Saarland Informatics Campus (曼海姆大学和马克斯·普朗克信息学研究所，萨尔州计算机科学校园)
类目: Computer Vision and Pattern Recognition (cs.CV); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Climate change is one of the most pressing challenges of the 21st century, sparking widespread discourse across social media platforms. Activists, policymakers, and researchers seek to understand public sentiment and narratives while access to social media data has become increasingly restricted in the post-API era. In this study, we analyze a dataset of climate change-related tweets from X (formerly Twitter) shared in 2019, containing 730k tweets along with the shared images. Our approach integrates statistical analysis, image classification, object detection, and sentiment analysis to explore visual narratives in climate discourse. Additionally, we introduce a graphical user interface (GUI) to facilitate interactive data exploration. Our findings reveal key themes in climate communication, highlight sentiment divergence between images and text, and underscore the strengths and limitations of foundation models in analyzing social media imagery. By releasing our code and tools, we aim to support future research on the intersection of climate change, social media, and computer vision.
zh

[CV-41] GIGP: A Global Information Interacting and Geometric Priors Focusing Framework for Semi-supervised Medical Image Segmentation

【速读】：该论文旨在解决半监督医学图像分割中两个关键问题：一是有限标注数据与大量未标注数据之间的分布差异可能阻碍模型泛化，现有方法大多依赖局部相似性匹配可能导致偏差；二是未能充分利用全局几何先验（如体积、矩等）。论文提出的关键解决方案是设计了一个全局信息交互与几何先验聚焦框架（GIGP），包括三个模块：Global Information Interaction Mamba模块以减少标注与未标注数据间的分布差异，Geometric Moment Attention Mechanism用于提取更丰富的全局几何特征，以及Global Geometric Perturbation Consistency用于模拟器官动态变化和几何变异，从而提升模型的泛化能力。实验结果验证了所提方法在NIH胰腺和左心房数据集上的优越性能。

链接: https://arxiv.org/abs/2503.09355
作者: Lianyuan Yu,Xiuzhen Guo,Ji Shi,Hongxiao Wang,Hongwei Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semi-supervised learning enhances medical image segmentation by leveraging unlabeled data, reducing reliance on extensive labeled datasets. On the one hand, the distribution discrepancy between limited labeled data and abundant unlabeled data can hinder model generalization. Most existing methods rely on local similarity matching, which may introduce bias. In contrast, Mamba effectively models global context with linear complexity, learning more comprehensive data representations. On the other hand, medical images usually exhibit consistent anatomical structures defined by geometric features. Most existing methods fail to fully utilize global geometric priors, such as volumes, moments etc. In this work, we introduce a global information interaction and geometric priors focus framework (GIGP). Firstly, we present a Global Information Interaction Mamba module to reduce distribution discrepancy between labeled and unlabeled data. Secondly, we propose a Geometric Moment Attention Mechanism to extract richer global geometric features. Finally, we propose Global Geometric Perturbation Consistency to simulate organ dynamics and geometric variations, enhancing the ability of the model to learn generalized features. The superior performance on the NIH Pancreas and Left Atrium datasets demonstrates the effectiveness of our approach.
zh

[CV-42] Fully-Synthetic Training for Visual Quality Inspection in Automotive Production

【速读】：该论文旨在解决传统计算机视觉（Computer Vision, CV）模型在视觉质量检测中对大规模真实标注数据依赖性强的问题，这些问题包括高昂的成本、耗时以及潜在的人工标注错误。为克服这些挑战，论文提出的关键解决方案是利用领域随机化（Domain Randomization）生成合成图像，并自动为其分配标签，从而提供一种经济高效且可扩展的替代方案。实验结果表明，仅使用合成数据训练的目标检测模型在实际检测场景中优于基于真实图像训练的模型。

链接: https://arxiv.org/abs/2503.09354
作者: Christoph Huber,Dino Knoll,Michael Guthe
机构: University of Bayreuth (拜罗伊特大学); BMW Group (宝马集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in Procedia CIRP

点击查看摘要

Abstract:Visual Quality Inspection plays a crucial role in modern manufacturing environments as it ensures customer safety and satisfaction. The introduction of Computer Vision (CV) has revolutionized visual quality inspection by improving the accuracy and efficiency of defect detection. However, traditional CV models heavily rely on extensive datasets for training, which can be costly, time-consuming, and error-prone. To overcome these challenges, synthetic images have emerged as a promising alternative. They offer a cost-effective solution with automatically generated labels. In this paper, we propose a pipeline for generating synthetic images using domain randomization. We evaluate our approach in three real inspection scenarios and demonstrate that an object detection model trained solely on synthetic data can outperform models trained on real images.
zh

[CV-43] Unified Dense Prediction of Video Diffusion CVPR2025

【速读】：该论文旨在解决从文本提示同时生成视频及其对应的实体分割和深度图的问题。现有数据集缺乏同时包含字幕、视频、分割和深度图的数据，限制了相关任务的研究进展。为解决此问题，论文提出了一种统一的网络架构，通过引入可学习的任务嵌入将多种密集预测任务整合到单一模型中，实现了视频生成与密集预测任务（包括实体掩码和深度图）的同时优化。关键创新在于利用色彩映射表示实体掩码和深度图，并将密集预测信息与RGB视频生成紧密结合，从而在不增加计算成本的情况下提升视频的一致性和运动平滑度，显著提高了视频生成的质量、一致性和运动表现。此外，论文构建了一个大规模密集预测视频数据集（\datasetname），进一步增强了方法的实用性和性能。

链接: https://arxiv.org/abs/2503.09344
作者: Lehan Yang,Lu Qi,Xiangtai Li,Sheng Li,Varun Jampani,Ming-Hsuan Yang
机构: University of Virginia (弗吉尼亚大学); University of California, Merced (加州大学默塞德分校); Nanyang Technological University (南洋理工大学); Stability AI (Stability AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:We present a unified network for simultaneously generating videos and their corresponding entity segmentation and depth maps from text prompts. We utilize colormap to represent entity masks and depth maps, tightly integrating dense prediction with RGB video generation. Introducing dense prediction information improves video generation’s consistency and motion smoothness without increasing computational costs. Incorporating learnable task embeddings brings multiple dense prediction tasks into a single model, enhancing flexibility and further boosting performance. We further propose a large-scale dense prediction video dataset~\datasetname, addressing the issue that existing datasets do not concurrently contain captions, videos, segmentation, or depth maps. Comprehensive experiments demonstrate the high efficiency of our method, surpassing the state-of-the-art in terms of video quality, consistency, and motion smoothness.
zh

[CV-44] GASPACHO: Gaussian Splatting for Controllable Humans and Objects

【速读】：该论文旨在解决通过多视角图像生成逼真的可控制的人体与物体交互渲染的问题。现有方法通常专注于人体重建而忽略物体作为背景的存在，导致无法对新视角下的新型交互进行可控渲染。论文的关键在于同时以高斯分布的形式分别重建可动画化的人体和物体模板，并通过线性变形函数约束生成渲染图像的高斯分布，从而实现从新视角生成不同姿势下的人体与物体的新交互渲染。此外，论文提出了一种基于特征表示的对象模板定义方法以及一种考虑遮挡的光度损失函数，以应对显著遮挡情况下的高质量重建与合成任务。实验结果表明，所提方法在BEHAVE和DNA-Rendering两个数据集上均表现出色。

链接: https://arxiv.org/abs/2503.09342
作者: Aymen Mir,Arthur Moreau,Helisa Dhamo,Zhensong Zhang,Eduardo Pérez-Pellitero
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); University of Tübingen (蒂宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present GASPACHO: a method for generating photorealistic controllable renderings of human-object interactions. Given a set of multi-view RGB images of human-object interactions, our method reconstructs animatable templates of the human and object as separate sets of Gaussians simultaneously. Different from existing work, which focuses on human reconstruction and ignores objects as background, our method explicitly reconstructs both humans and objects, thereby allowing for controllable renderings of novel human object interactions in different poses from novel-camera viewpoints. During reconstruction, we constrain the Gaussians that generate rendered images to be a linear function of a set of canonical Gaussians. By simply changing the parameters of the linear deformation functions after training, our method can generate renderings of novel human-object interaction in novel poses from novel camera viewpoints. We learn the 3D Gaussian properties of the canonical Gaussians on the underlying 2D manifold of the canonical human and object templates. This in turns requires a canonical object template with a fixed UV unwrapping. To define such an object template, we use a feature based representation to track the object across the multi-view sequence. We further propose an occlusion aware photometric loss that allows for reconstructions under significant occlusions. Several experiments on two human-object datasets - BEHAVE and DNA-Rendering - demonstrate that our method allows for high-quality reconstruction of human and object templates under significant occlusion and the synthesis of controllable renderings of novel human-object interactions in novel human poses from novel camera views.
zh

[CV-45] Stealthy Patch-Wise Backdoor Attack in 3D Point Cloud via Curvature Awareness

【速读】：该论文旨在解决现有3D点云后门攻击方法因依赖样本级全局修改而导致隐蔽性不足的问题。论文的关键创新在于提出了一种名为“隐匿式块级后门攻击”(Stealthy Patch-Wise Backdoor Attack, SPBA) 的方案。SPBA 首次引入点云的块级触发器，并将扰动限制在局部区域以显著提升隐蔽性。具体而言，SPBA 将点云分解为局部块，并通过基于曲率的块不可感知评分评估其几何复杂度，从而有策略地将触发器应用于多个几何复杂且视觉敏感度较低的块中，确保其不易被肉眼察觉。此外，SPBA 利用图傅里叶变换 (Graph Fourier Transform, GFT) 优化块级频谱触发器，扰动选定块的频谱特征，同时保持点云的整体几何结构，从而在提升攻击有效性的同时增强隐蔽性。实验结果表明，SPBA 在不同模型上的攻击成功率 (Attack Success Rate, ASR) 超过 96.5%，并且在隐蔽性方面达到了当前最优水平。

链接: https://arxiv.org/abs/2503.09336
作者: Yu Feng,Dingxin Zhang,Runkai Zhao,Yong Xia,Heng Huang,Weidong Cai
机构: The University of Sydney (悉尼大学); Northwestern Polytechnical University (西北工业大学); University of Maryland College Park (马里兰大学帕克分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures, 6 tables

点击查看摘要

Abstract:Backdoor attacks pose a severe threat to deep neural networks (DNN) by implanting hidden backdoors that can be activated with predefined triggers to manipulate model behaviors maliciously. Existing 3D point cloud backdoor attacks primarily rely on sample-wise global modifications, resulting in suboptimal stealthiness. To address this limitation, we propose Stealthy Patch-Wise Backdoor Attack (SPBA), which employs the first patch-wise trigger for 3D point clouds and restricts perturbations to local regions, significantly enhancing stealthiness. Specifically, SPBA decomposes point clouds into local patches and evaluates their geometric complexity using a curvature-based patch imperceptibility score, ensuring that the trigger remains less perceptible to the human eye by strategically applying it across multiple geometrically complex patches with lower visual sensitivity. By leveraging the Graph Fourier Transform (GFT), SPBA optimizes a patch-wise spectral trigger that perturbs the spectral features of selected patches, enhancing attack effectiveness while preserving the global geometric structure of the point cloud. Extensive experiments on ModelNet40 and ShapeNetPart demonstrate that SPBA consistently achieves an attack success rate (ASR) exceeding 96.5% across different models while achieving state-of-the-art imperceptibility compared to existing backdoor attack methods.
zh

[CV-46] SDD-4DGS: Static-Dynamic Aware Decoupling in Gaussian Splatting for 4D Scene Reconstruction

【速读】：该论文试图解决动态和静态场景成分在现有4D重建方法中未被区分而导致性能受限的问题。解决方案的关键在于提出了一种基于高斯点 splatting 的静态-动态解耦4D场景重建框架SDD-4DGS，其核心创新是一种新颖的概率动态感知系数，该系数被自然集成到高斯重建管道中，从而实现静态和动态成分的自适应分离。这种方法不仅能够有效学习动态元素的运动模式，还能保持静态结构的几何稳定性。

链接: https://arxiv.org/abs/2503.09332
作者: Dai Sun,Huhao Guan,Kun Zhang,Xike Xie,S. Kevin Zhou
机构: School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China (中国科学技术大学), Hefei, Anhui, 230026, P.R.China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dynamic and static components in scenes often exhibit distinct properties, yet most 4D reconstruction methods treat them indiscriminately, leading to suboptimal performance in both cases. This work introduces SDD-4DGS, the first framework for static-dynamic decoupled 4D scene reconstruction based on Gaussian Splatting. Our approach is built upon a novel probabilistic dynamic perception coefficient that is naturally integrated into the Gaussian reconstruction pipeline, enabling adaptive separation of static and dynamic components. With carefully designed implementation strategies to realize this theoretical framework, our method effectively facilitates explicit learning of motion patterns for dynamic elements while maintaining geometric stability for static structures. Extensive experiments on five benchmark datasets demonstrate that SDD-4DGS consistently outperforms state-of-the-art methods in reconstruction fidelity, with enhanced detail restoration for static structures and precise modeling of dynamic motions. The code will be released.
zh

[CV-47] DAVE: Diagnostic benchmark for Audio Visual Evaluation

【速读】：该论文旨在解决现有多模态学习基准数据集中普遍存在的强视觉偏见问题以及仅提供综合评分导致难以区分模型在视觉理解、音频解析或音视频对齐方面具体困难的问题。论文提出的关键解决方案是引入DAVE（Diagnostic Audio Visual Evaluation）数据集，这是一个专门设计用于系统性评估音视频模型在受控挑战下的表现的新基准数据集。DAVE通过确保两种模态对于正确回答问题都是必要的，并将评估分解为原子子类别来克服现有数据集的局限性。这种标准化诊断框架有助于推动音视频模型的更稳健发展。

链接: https://arxiv.org/abs/2503.09321
作者: Gorjan Radevski,Teodora Popordanoska,Matthew B. Blaschko,Tinne Tuytelaars
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: First two authors contributed equally

点击查看摘要

Abstract:Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias – where answers can be inferred from visual data alone – and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment. In this work, we introduce DAVE (Diagnostic Audio Visual Evaluation), a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled challenges. DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories. Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models. The dataset is released: this https URL
zh

[CV-48] 2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos

【速读】：该论文试图解决的问题是如何从视频数据中提取精确的对象可用性（affordance）区域及其类别标签，并实现基于视觉的可用性预测。当前基于视觉的可用性预测方法通常简化为简单的物体部分分割，未能充分考虑任务需求及双手协同交互的复杂性。论文的关键解决方案在于提出了一种从人类活动视频数据集中提取可用性数据的框架，构建了一个包含精确对象可用性区域分割和类别标签的2HANDS数据集，同时涵盖了双手协同操作的情景。此外，论文设计了一个基于视觉语言模型（VLM）的可用性预测模型2HandedAfforder，并在多种活动中展示了优于基线模型的可用性区域分割性能，进一步验证了所预测的可用性区域在机器人操作任务中的可执行性。

链接: https://arxiv.org/abs/2503.09320
作者: Marvin Heidinger,Snehal Jauhri,Vignesh Prasad,Georgia Chalvatzaki
机构: Technische Universität Darmstadt (达姆施塔特工业大学); Hessian.AI (海森堡人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:When interacting with objects, humans effectively reason about which regions of objects are viable for an intended action, i.e., the affordance regions of the object. They can also account for subtle differences in object regions based on the task to be performed and whether one or two hands need to be used. However, current vision-based affordance prediction methods often reduce the problem to naive object part segmentation. In this work, we propose a framework for extracting affordance data from human activity video datasets. Our extracted 2HANDS dataset contains precise object affordance region segmentations and affordance class-labels as narrations of the activity performed. The data also accounts for bimanual actions, i.e., two hands co-ordinating and interacting with one or more objects. We present a VLM-based affordance prediction model, 2HandedAfforder, trained on the dataset and demonstrate superior performance over baselines in affordance region segmentation for various activities. Finally, we show that our predicted affordance regions are actionable, i.e., can be used by an agent performing a task, through demonstration in robotic manipulation scenarios.
zh

[CV-49] Revealing the Implicit Noise-based Imprint of Generative Models

【速读】：该论文旨在解决现有AI生成图像检测方法因泛化能力不足，在应对新兴生成模型时性能不佳的问题。为了解决这一挑战，论文提出了一种基于噪声的模型特定印记（noise-based model-specific imprint）的新框架。关键在于引入了一种新颖的基于噪声的印记模拟器（noise-based imprint simulator），用于捕捉不同生成模型在图像中留下的固有模式，并通过聚合多种生成模型的印记来外推未来模型的印记，从而扩充训练数据以增强泛化能力和鲁棒性。此外，论文设计了一种新pipeline，首次结合基于噪声印记提取器（noise-based imprint extractor）的噪声模式与其他视觉特征进行AI生成图像检测，显著提升了检测性能，在三个公开基准数据集GenImage、Synthbuster和Chameleon上实现了当前最优的表现。

链接: https://arxiv.org/abs/2503.09314
作者: Xinghan Li,Jingjing Chen,Yue Yu,Xue Song,Haijun Shan,Yu-Gang Jiang
机构: Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University (复旦大学), China; CEC GienTech Technology Co.,Ltd. (中电科 Gentle Tech Co., Ltd.), Shanghai China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of vision generation models, the potential security risks stemming from synthetic visual content have garnered increasing attention, posing significant challenges for AI-generated image detection. Existing methods suffer from inadequate generalization capabilities, resulting in unsatisfactory performance on emerging generative models. To address this issue, this paper presents a novel framework that leverages noise-based model-specific imprint for the detection task. Specifically, we propose a novel noise-based imprint simulator to capture intrinsic patterns imprinted in images generated by different models. By aggregating imprints from various generative models, imprints of future models can be extrapolated to expand training data, thereby enhancing generalization and robustness. Furthermore, we design a new pipeline that pioneers the use of noise patterns, derived from a noise-based imprint extractor, alongside other visual features for AI-generated image detection, resulting in a significant improvement in performance. Our approach achieves state-of-the-art performance across three public benchmarks including GenImage, Synthbuster and Chameleon.
zh

[CV-50] Revealing Unintentional Information Leakage in Low-Dimensional Facial Portrait Representations

【速读】：该论文旨在评估神经网络低维输出中无意泄露的信息量，通过从仅用于描述面部肖像抽象属性的40-或32维特征向量重建输入图像来实现。与以往工作不同，本文利用了关于图像生成和面部相似性的最新知识，提出了一种超越当前最先进方法的技术。解决方案的关键在于采用预训练的StyleGAN结合一种新的损失函数，该函数通过将人脸映射到FaceNet嵌入的潜在空间来比较肖像的感知相似性；此外，还引入了一种新方法，融合集成模型的输出以有意生成重建图像的特定方面。

链接: https://arxiv.org/abs/2503.09306
作者: Kathleen Anderson,Thomas Martinetz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We evaluate the information that can unintentionally leak into the low dimensional output of a neural network, by reconstructing an input image from a 40- or 32-element feature vector that intends to only describe abstract attributes of a facial portrait. The reconstruction uses blackbox-access to the image encoder which generates the feature vector. Other than previous work, we leverage recent knowledge about image generation and facial similarity, implementing a method that outperforms the current state-of-the-art. Our strategy uses a pretrained StyleGAN and a new loss function that compares the perceptual similarity of portraits by mapping them into the latent space of a FaceNet embedding. Additionally, we present a new technique that fuses the output of an ensemble, to deliberately generate specific aspects of the recreated image.
zh

[CV-51] IQPFR: An Image Quality Prior for Blind Face Restoration and Beyond

【速读】：本文旨在解决盲面修复（Blind Face Restoration, BFR）领域中，基于真实数据（ground-truth, GT）特征学习的传统方法因训练数据集固有缺陷而受限的问题。这些方法通常只能恢复至训练数据集的平均质量水平，而非达到潜在的最大可实现视觉质量。为克服这一限制，论文提出了一种创新框架，通过引入从无参考图像质量评估（No-Reference Image Quality Assessment, NR-IQA）模型中提取的图像质量先验（Image Quality Prior, IQP）来引导修复过程，使其趋向最优的高质量输出。

解决方案的关键在于两项重要创新：首先，在代码本学习阶段设计了一个双分支代码本架构，将特征提取解耦为通用结构成分与高质量特定属性，从而全面表示普通及高质面部特征；其次，在代码本查询阶段实现了基于Transformer的质量条件化框架，利用NR-IQA衍生的质量评分作为动态调节信号，指导修复向最高可行质量标准优化。此质量条件化范式能够无缝增强现有BFR架构，无需修改原有结构。此外，还提出了基于离散表示的质量优化策略，以避免连续隐空间方法中常见的过优化伪影。实验结果表明，该方法在多个基准测试中超越了当前最先进的技术，并且当与已有BFR模型结合时，展现出一致的性能提升。

链接: https://arxiv.org/abs/2503.09294
作者: Peng Hu,Chunming He,Lei Xu,Jingduo Tian,Sina Farsiu,Yulun Zhang,Pei Liu,Xiu Li
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Duke University (杜克大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Media Technology Lab, Huawei (华为媒体技术实验室); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Blind Face Restoration (BFR) addresses the challenge of reconstructing degraded low-quality (LQ) facial images into high-quality (HQ) outputs. Conventional approaches predominantly rely on learning feature representations from ground-truth (GT) data; however, inherent imperfections in GT datasets constrain restoration performance to the mean quality level of the training data, rather than attaining maximally attainable visual quality. To overcome this limitation, we propose a novel framework that incorporates an Image Quality Prior (IQP) derived from No-Reference Image Quality Assessment (NR-IQA) models to guide the restoration process toward optimal HQ reconstructions. Our methodology synergizes this IQP with a learned codebook prior through two critical innovations: (1) During codebook learning, we devise a dual-branch codebook architecture that disentangles feature extraction into universal structural components and HQ-specific attributes, ensuring comprehensive representation of both common and high-quality facial characteristics. (2) In the codebook lookup stage, we implement a quality-conditioned Transformer-based framework. NR-IQA-derived quality scores act as dynamic conditioning signals to steer restoration toward the highest feasible quality standard. This score-conditioned paradigm enables plug-and-play enhancement of existing BFR architectures without modifying the original structure. We also formulate a discrete representation-based quality optimization strategy that circumvents over-optimization artifacts prevalent in continuous latent space approaches. Extensive experiments demonstrate that our method outperforms state-of-the-art techniques across multiple benchmarks. Besides, our quality-conditioned framework demonstrates consistent performance improvements when integrated with prior BFR models. The code will be released.
zh

[CV-52] Better Together: Unified Motion Capture and 3D Avatar Reconstruction

【速读】：本文提出了一种名为“Better Together”的方法，旨在同时解决多人姿态估计问题并从多视角视频中重建真实的3D人体虚拟化身。不同于以往将这两个问题分开处理的方法，本文的关键在于联合优化骨骼运动与可渲染的3D身体模型，从而产生协同效应，即实现更精确的动作捕捉以及提升实时渲染化身的视觉质量。为此，研究者引入了一种带有个性化网格上的3D高斯动画化身，并提出了利用时间相关的MLP来优化运动序列，以提供准确且时间一致的姿态估计。通过在具有挑战性的瑜伽姿势数据集上的评估，该方法在多视角人体姿态估计方面达到了最先进的精度，在身体关节和手部关节误差分别比基于关键点的方法降低了35%和45%，同时显著提高了可动画化化身在新视角合成上的视觉质量（+2dB PSNR）。

链接: https://arxiv.org/abs/2503.09293
作者: Arthur Moreau,Mohammed Brahimi,Richard Shaw,Athanasios Papaioannou,Thomas Tanay,Zhensong Zhang,Eduardo Pérez-Pellitero
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); TU Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:We present Better Together, a method that simultaneously solves the human pose estimation problem while reconstructing a photorealistic 3D human avatar from multi-view videos. While prior art usually solves these problems separately, we argue that joint optimization of skeletal motion with a 3D renderable body model brings synergistic effects, i.e. yields more precise motion capture and improved visual quality of real-time rendering of avatars. To achieve this, we introduce a novel animatable avatar with 3D Gaussians rigged on a personalized mesh and propose to optimize the motion sequence with time-dependent MLPs that provide accurate and temporally consistent pose estimates. We first evaluate our method on highly challenging yoga poses and demonstrate state-of-the-art accuracy on multi-view human pose estimation, reducing error by 35% on body joints and 45% on hand joints compared to keypoint-based methods. At the same time, our method significantly boosts the visual quality of animatable avatars (+2dB PSNR on novel view synthesis) on diverse challenging subjects.
zh

[CV-53] Noise2Score3D: Tweedies Approach for Unsupervised Point Cloud Denoising

【速读】：该论文旨在解决点云去噪领域中无监督学习方法的性能瓶颈问题，特别是缺乏干净数据作为训练标签的挑战。论文提出Noise2Score3D框架，通过直接从噪声数据中学习点云分布的评分函数（score function），避免了传统方法对干净数据的依赖。其关键创新在于利用Tweedie公式实现单步去噪，取代了现有无监督方法中的迭代过程，从而显著提升了去噪精度与效率。此外，引入点云总变分（Total Variation for Point Clouds）作为去噪质量评估指标，进一步增强了方法的实用性和鲁棒性。实验结果表明，Noise2Score3D在Chamfer距离和点到网格等标准度量下达到当前最先进的性能，并展现出强大的泛化能力。

链接: https://arxiv.org/abs/2503.09283
作者: Xiangbin Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2502.16826

点击查看摘要

Abstract:Building on recent advances in Bayesian statistics and image denoising, we propose Noise2Score3D, a fully unsupervised framework for point cloud denoising. Noise2Score3D learns the score function of the underlying point cloud distribution directly from noisy data, eliminating the need for clean data during training. Using Tweedie’s formula, our method performs denoising in a single step, avoiding the iterative processes used in existing unsupervised methods, thus improving both accuracy and efficiency. Additionally, we introduce Total Variation for Point Clouds as a denoising quality metric, which allows for the estimation of unknown noise parameters. Experimental results demonstrate that Noise2Score3D achieves state-of-the-art performance on standard benchmarks among unsupervised learning methods in Chamfer distance and point-to-mesh metrics. Noise2Score3D also demonstrates strong generalization ability beyond training datasets. Our method, by addressing the generalization issue and challenge of the absence of clean data in learning-based methods, paves the way for learning-based point cloud denoising methods in real-world applications.
zh

[CV-54] Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption

【速读】：该论文旨在解决视频详细描述（Video Detailed Captioning, VDC）任务中的两个关键问题：现有最先进的方法存在偏向特定描述方面的能力不足以及与人类偏好不一致的问题。为了解决这些缺陷，论文提出了一种名为Cockatiel的新颖三阶段训练管道。该方案的关键在于通过集成合成数据和与人类偏好对齐的数据来提升VDC性能。具体而言，首先从精心标注的数据集中衍生出一个评分器，用于选择在细粒度视频-字幕对齐和人类偏好评价方面表现优异的合成字幕；接着使用经过筛选的数据集训练Cockatiel-13B模型，使其吸收集合模型的优势及人类偏好；最后进一步蒸馏出Cockatiel-8B以简化应用。广泛的定量和定性实验验证了此方法的有效性，不仅在VDCSCORE上实现了平衡维度下的新最先进性能，还在人类评估结果中大幅超越领先替代方案。

链接: https://arxiv.org/abs/2503.09279
作者: Luozheng Qin,Zhiyu Tan,Mengping Yang,Xiaomeng Yang,Hao Li
机构: Shanghai Academy of Artificial Intelligence for Science (上海人工智能科学研究院); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: For more details, please refer to our project page: this https URL

点击查看摘要

Abstract:Video Detailed Captioning (VDC) is a crucial task for vision-language bridging, enabling fine-grained descriptions of complex video content. In this paper, we first comprehensively benchmark current state-of-the-art approaches and systematically identified two critical limitations: biased capability towards specific captioning aspect and misalignment with human preferences. To address these deficiencies, we propose Cockatiel, a novel three-stage training pipeline that ensembles synthetic and human-aligned training for improving VDC performance. In the first stage, we derive a scorer from a meticulously annotated dataset to select synthetic captions high-performing on certain fine-grained video-caption alignment and human-preferred while disregarding others. Then, we train Cockatiel-13B, using this curated dataset to infuse it with assembled model strengths and human preferences. Finally, we further distill Cockatiel-8B from Cockatiel-13B for the ease of usage. Extensive quantitative and qualitative experiments reflect the effectiveness of our method, as we not only set new state-of-the-art performance on VDCSCORE in a dimension-balanced way but also surpass leading alternatives on human preference by a large margin as depicted by the human evaluation results.
zh

[CV-55] UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer

【速读】：该论文旨在解决现有扩散模型在多条件可控生成任务中有效融合多种条件输入的同时保持与所有条件一致性的挑战。论文的关键创新在于提出了UniCombine框架，基于DiT架构，能够处理任意组合的条件（如文本提示、空间映射和主体图像）。其核心技术包括引入新型的Conditional MMDiT注意力机制以及集成可训练的LoRA模块，从而构建出无需训练和基于训练的两种版本。此外，论文还设计了一种新管道来创建SubjectSpatial200K数据集，这是首个涵盖主体驱动和空间对齐条件的多条件生成任务数据集。实验结果验证了该方法在多条件生成任务中的卓越通用性和领先性能。

链接: https://arxiv.org/abs/2503.09277
作者: Haoxuan Wang,Jinlong Peng,Qingdong He,Hao Yang,Ying Jin,Jiafu Wu,Xiaobin Hu,Yanjie Pan,Zhenye Gan,Mingmin Chi,Bo Peng,Yabiao Wang
机构: Fudan University (复旦大学); Tencent Youtu Lab (腾讯优图实验室); Shanghai Jiao Tong University (上海交通大学); Shanghai Ocean University (上海海洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid development of diffusion models in image generation, the demand for more powerful and flexible controllable frameworks is increasing. Although existing methods can guide generation beyond text prompts, the challenge of effectively combining multiple conditional inputs while maintaining consistency with all of them remains unsolved. To address this, we introduce UniCombine, a DiT-based multi-conditional controllable generative framework capable of handling any combination of conditions, including but not limited to text prompts, spatial maps, and subject images. Specifically, we introduce a novel Conditional MMDiT Attention mechanism and incorporate a trainable LoRA module to build both the training-free and training-based versions. Additionally, we propose a new pipeline to construct SubjectSpatial200K, the first dataset designed for multi-conditional generative tasks covering both the subject-driven and spatially-aligned conditions. Extensive experimental results on multi-conditional generation demonstrate the outstanding universality and powerful capability of our approach with state-of-the-art performance.
zh

[CV-56] DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection

【速读】：该论文试图解决开放词汇目标检测器在识别罕见类别或特定领域专业化方面的挑战。传统方法通常依赖单一模型权重进行适应，而本文提出了一种模块化深度学习的方法。关键在于引入了DitHub框架，它通过借鉴版本控制系统的思想，将专家模块组织成类似分支的形式，可根据需要获取和合并这些模块。这种模块化方法不仅实现了最先进的性能，还在开放词汇目标检测领域首次探索了模块组合的方式。

链接: https://arxiv.org/abs/2503.09271
作者: Chiara Cappellino,Gianluca Mancusi,Matteo Mosconi,Angelo Porrello,Simone Calderara,Rita Cucchiara
机构: AImageLab - University of Modena and Reggio Emilia (艾米利亚-罗马涅大学图像实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Open-Vocabulary object detectors can recognize a wide range of categories using simple textual prompts. However, improving their ability to detect rare classes or specialize in certain domains remains a challenge. While most recent methods rely on a single set of model weights for adaptation, we take a different approach by using modular deep learning. We introduce DitHub, a framework designed to create and manage a library of efficient adaptation modules. Inspired by Version Control Systems, DitHub organizes expert modules like branches that can be fetched and merged as needed. This modular approach enables a detailed study of how adaptation modules combine, making it the first method to explore this aspect in Object Detection. Our approach achieves state-of-the-art performance on the ODinW-13 benchmark and ODinW-O, a newly introduced benchmark designed to evaluate how well models adapt when previously seen classes reappear. For more details, visit our project page: this https URL
zh

[CV-57] Bayesian Test-Time Adaptation for Vision-Language Models

【速读】：该论文旨在解决在测试阶段使用预训练视觉-语言模型（如CLIP）进行零样本图像分类时，现有方法仅关注通过调整类别嵌入来适应似然性（likelihood），而忽视先验（prior）重要性的问题。论文基于贝叶斯定理分析发现，最终预测的核心因素包括似然性和先验。为填补这一空白，提出了一种名为“贝叶斯类别适应（Bayesian Class Adaptation, BCA）”的新方法。该方法不仅持续更新类别嵌入以适配似然性，还利用新样本的后验概率连续更新每个类别嵌入的先验。这种双重更新机制使模型能够更好地适应分布偏移，并实现更高的预测准确性。BCA 方法不仅在性能指标上超越现有方法，同时保持了高效的推理速度和较低的内存占用，使其适用于实际应用。

链接: https://arxiv.org/abs/2503.09248
作者: Lihua Zhou,Mao Ye,Shuaifeng Li,Nianxin Li,Xiatian Zhu,Lei Deng,Hongbin Liu,Zhen Lei
机构: CAIR, HKSIS, CAS (自动化推理中心, 香港科技大学, 中国科学院); UESTC (电子科技大学); University of Surrey (萨里大学); Shenzhen University (深圳大学); MAIS, Institute of Automation, CAS (多模态人工智能系统实验室, 自动化研究所, 中国科学院); SAI, UCAS (智能科学与技术学院, 中国科学院大学); M.U.S.T (澳门科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Test-time adaptation with pre-trained vision-language models, such as CLIP, aims to adapt the model to new, potentially out-of-distribution test data. Existing methods calculate the similarity between visual embedding and learnable class embeddings, which are initialized by text embeddings, for zero-shot image classification. In this work, we first analyze this process based on Bayes theorem, and observe that the core factors influencing the final prediction are the likelihood and the prior. However, existing methods essentially focus on adapting class embeddings to adapt likelihood, but they often ignore the importance of prior. To address this gap, we propose a novel approach, \textbfBayesian \textbfClass \textbfAdaptation (BCA), which in addition to continuously updating class embeddings to adapt likelihood, also uses the posterior of incoming samples to continuously update the prior for each class embedding. This dual updating mechanism allows the model to better adapt to distribution shifts and achieve higher prediction accuracy. Our method not only surpasses existing approaches in terms of performance metrics but also maintains superior inference rates and memory usage, making it highly efficient and practical for real-world applications.
zh

[CV-58] How To Make Your Cell Tracker Say “I dunno!”

【速读】：该论文试图解决在高通量活细胞显微镜成像分析中，由于数据量巨大超出人工处理能力，需要可靠且具备不确定性感知的数据分析工具的问题。具体而言，论文聚焦于线性指派（linear assignment）方法的细胞跟踪算法中的不确定性量化问题。解决方案的关键在于从贝叶斯推理（Bayesian inference）和分类问题（classification problem）两个视角出发，提出了一种框架式的不确定性量化方法，能够为任意帧间跟踪算法赋予不确定性评估能力，并通过应用于多种现有跟踪算法（包括基于Transformer的最新方法）验证了其有效性和良好的校准性能。

链接: https://arxiv.org/abs/2503.09244
作者: Richard D. Paul,Johannes Seiffarth,David Rügamer,Hanno Scharr,Katharina Nöh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Cell tracking is a key computational task in live-cell microscopy, but fully automated analysis of high-throughput imaging requires reliable and, thus, uncertainty-aware data analysis tools, as the amount of data recorded within a single experiment exceeds what humans are able to overlook. We here propose and benchmark various methods to reason about and quantify uncertainty in linear assignment-based cell tracking algorithms. Our methods take inspiration from statistics and machine learning, leveraging two perspectives on the cell tracking problem explored throughout this work: Considering it as a Bayesian inference problem and as a classification problem. Our methods admit a framework-like character in that they equip any frame-to-frame tracking method with uncertainty quantification. We demonstrate this by applying it to various existing tracking algorithms including the recently presented Transformer-based trackers. We demonstrate empirically that our methods yield useful and well-calibrated tracking uncertainties.
zh

[CV-59] GarmentPile: Point-Level Visual Affordance Guided Retrieval and Adaptation for Cluttered Garments Manipulation

【速读】：该论文致力于解决杂乱衣物操作中的挑战，特别是在复杂、可变形衣物及其相互关系导致的困难场景下，同时需要保持衣物清洁与操作稳定性。论文的关键在于提出了一种点级别affordance（Point-level Affordance）的学习方法，通过密集表示建模复杂的空间及多模态操作候选方案，同时考虑衣物几何结构和物体间关系。此外，针对极度纠缠的杂乱场景难以直接定位特定衣物的问题，引入了一个适应模块（Adaptation Module），在学习到的affordance指导下重新组织高度纠缠的衣物状态，使其更适合操作。该框架在模拟环境和真实世界中展示了对多样化衣物类型和堆叠配置的有效性。

链接: https://arxiv.org/abs/2503.09243
作者: Ruihai Wu,Ziyu Zhu,Yuran Wang,Yue Chen,Jiarui Wang,Hao Dong
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cluttered garments manipulation poses significant challenges due to the complex, deformable nature of garments and intricate garment relations. Unlike single-garment manipulation, cluttered scenarios require managing complex garment entanglements and interactions, while maintaining garment cleanliness and manipulation stability. To address these demands, we propose to learn point-level affordance, the dense representation modeling the complex space and multi-modal manipulation candidates, while being aware of garment geometry, structure, and inter-object relations. Additionally, as it is difficult to directly retrieve a garment in some extremely entangled clutters, we introduce an adaptation module, guided by learned affordance, to reorganize highly-entangled garments into states plausible for manipulation. Our framework demonstrates effectiveness over environments featuring diverse garment types and pile configurations in both simulation and the real world. Project page: this https URL.
zh

[CV-60] NAMI: Efficient Image Generation via Progressive Rectified Flow Transformers

【速读】：该论文旨在解决基于流的Transformer模型在图像生成任务中，尽管具有最先进的性能（state-of-the-art），但其高推理部署成本的问题。为提升推理效率同时保持生成质量，论文提出了一种渐进式校正流Transformer方法。方案的关键在于将校正流划分为不同分辨率阶段，在低分辨率阶段使用较少的Transformer层以生成图像布局和概念轮廓，并随着分辨率提高逐步增加更多层。这种方法不仅实现了快速收敛，还显著减少了推理时间，同时确保了生成质量。

链接: https://arxiv.org/abs/2503.09242
作者: Yuhang Ma,Bo Cheng,Shanyuan Liu,Ao Ma,Xiaoyu Wu,Liebucha Wu,Dawei Leng,Yuhui Yin
机构: 360 AI Research (360 AI 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flow-based transformer models for image generation have achieved state-of-the-art performance with larger model parameters, but their inference deployment cost remains high. To enhance inference performance while maintaining generation quality, we propose progressive rectified flow transformers. We divide the rectified flow into different stages according to resolution, using fewer transformer layers at the low-resolution stages to generate image layouts and concept contours, and progressively adding more layers as the resolution increases. Experiments demonstrate that our approach achieves fast convergence and reduces inference time while ensuring generation quality. The main contributions of this paper are summarized as follows: (1) We introduce progressive rectified flow transformers that enable multi-resolution training, accelerating model convergence; (2) NAMI leverages piecewise flow and spatial cascading of Diffusion Transformer (DiT) to rapidly generate images, reducing inference time by 40% to generate a 1024 resolution image; (3) We propose NAMI-1K benchmark to evaluate human preference performance, aiming to mitigate distributional bias and prevent data leakage from open-source benchmarks. The results show that our model is competitive with state-of-the-art models.
zh

[CV-61] Active Learning Inspired ControlNet Guidance for Augmenting Semantic Segmentation Datasets

【速读】：该论文旨在解决如何通过控制扩散模型生成对特定任务最具信息量的合成样本的问题。传统方法虽能在保持用户引入约束的同时生成高质量图像，但缺乏对目标任务性能的直接优化。为此，作者受到主动学习（Active Learning）启发，提出了一种将基于主动学习的选择度量集成到反向扩散过程中的新方法。该方案的关键在于利用不确定性（Uncertainty）、委员会查询（Query by Committee）以及预期模型变化（Expected Model Change）等主动学习中常用的技术，并通过梯度近似引导样本生成过程。此方法无需额外训练，仅需修改预训练ControlNet的反向扩散过程即可实现，从而显著提升以引导合成数据训练的分割模型性能，凸显了合成数据生成在下游任务优化中的潜力。

链接: https://arxiv.org/abs/2503.09221
作者: Hannah Kniesel,Pedro Hermosilla,Timo Ropinski
机构: Ulm University (乌尔姆大学); TU Vienna (维也纳技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in conditional image generation from diffusion models have shown great potential in achieving impressive image quality while preserving the constraints introduced by the user. In particular, ControlNet enables precise alignment between ground truth segmentation masks and the generated image content, allowing the enhancement of training datasets in segmentation tasks. This raises a key question: Can ControlNet additionally be guided to generate the most informative synthetic samples for a specific task? Inspired by active learning, where the most informative real-world samples are selected based on sample difficulty or model uncertainty, we propose the first approach to integrate active learning-based selection metrics into the backward diffusion process for sample generation. Specifically, we explore uncertainty, query by committee, and expected model change, which are commonly used in active learning, and demonstrate their application for guiding the sample generation process through gradient approximation. Our method is training-free, modifying only the backward diffusion process, allowing it to be used on any pretrained ControlNet. Using this process, we show that segmentation models trained with guided synthetic data outperform those trained on non-guided synthetic data. Our work underscores the need for advanced control mechanisms for diffusion-based models, which are not only aligned with image content but additionally downstream task performance, highlighting the true potential of synthetic data generation.
zh

[CV-62] Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latant Space

【速读】：该论文旨在解决现有世界模型（World Model）侧重于自车（ego vehicle）轨迹预测而忽视其他车辆可控性的问题，这限制了其在真实模拟自车与驾驶场景交互中的能力。此外，将多条轨迹与视频中的每一辆车匹配以控制视频生成仍具挑战性。为解决上述问题，论文提出了一种名为EOT-WM的驾驶世界模型，其关键在于统一自车与他车轨迹（Ego-Other vehicle Trajectories）在视频中的表示。具体而言，首先将BEV空间中的自车与其他车辆轨迹投影到图像坐标系以实现轨迹与视频中对应车辆的匹配；接着利用空间-时间变分自编码器（Spatial-Temporal Variational Auto Encoder）对轨迹视频进行编码，使其在统一视觉空间中时空对齐；进一步设计了一种轨迹注入扩散Transformer（trajectory-injected diffusion Transformer），通过自车与其他车辆轨迹的引导来去噪视频潜在表征以生成视频。此外，还提出了一种基于控制潜在相似性的度量方法来评估轨迹的可控性。实验表明，该模型在nuScenes数据集上的FID和FVD指标分别比最先进的方法高出30%和55%，并且能够预测未见过的驾驶场景并生成自车轨迹。

链接: https://arxiv.org/abs/2503.09215
作者: Jian Zhu,Zhengyu Jia,Tian Gao,Jiaxin Deng,Shidi Li,Fu Liu,Peng Jia,Xianpeng Lang,Xiaolong Sun
机构: Li Auto Inc. (理想汽车)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:Advanced end-to-end autonomous driving systems predict other vehicles’ motions and plan ego vehicle’s trajectory. The world model that can foresee the outcome of the trajectory has been used to evaluate the end-to-end autonomous driving system. However, existing world models predominantly emphasize the trajectory of the ego vehicle and leave other vehicles uncontrollable. This limitation hinders their ability to realistically simulate the interaction between the ego vehicle and the driving scenario. In addition, it remains a challenge to match multiple trajectories with each vehicle in the video to control the video generation. To address above issues, a driving \textbfWorld \textbfModel named EOT-WM is proposed in this paper, unifying \textbfEgo-\textbfOther vehicle \textbfTrajectories in videos. Specifically, we first project ego and other vehicle trajectories in the BEV space into the image coordinate to match each trajectory with its corresponding vehicle in the video. Then, trajectory videos are encoded by the Spatial-Temporal Variational Auto Encoder to align with driving video latents spatially and temporally in the unified visual space. A trajectory-injected diffusion Transformer is further designed to denoise the noisy video latents for video generation with the guidance of ego-other vehicle trajectories. In addition, we propose a metric based on control latent similarity to evaluate the controllability of trajectories. Extensive experiments are conducted on the nuScenes dataset, and the proposed model outperforms the state-of-the-art method by 30% in FID and 55% in FVD. The model can also predict unseen driving scenes with self-produced trajectories.
zh

[CV-63] Robust Asymmetric Heterogeneous Federated Learning with Corrupted Clients

【速读】：本文研究了一个具有模型异构性和数据受损客户端的鲁棒联邦学习任务，其中客户端具有不同的本地模型结构。由于实际部署中的随机噪声、压缩伪影或环境条件等因素，数据损坏不可避免，这会严重损害整个联邦系统。为了解决这些问题，本文引入了一种新颖的鲁棒非对称异构联邦学习（Robust Asymmetric Heterogeneous Federated Learning, RAHFL）框架。其关键解决方案包括：提出了一种增强多样性监督对比学习技术（Diversity-enhanced Supervised Contrastive Learning），通过混合数据增强策略生成复杂的增强样本进行监督对比学习，从而提升模型在多种数据损坏模式下的鲁棒性和适应性；同时设计了一种非对称异构联邦学习策略，以抵抗外部客户端的损坏反馈，在协作学习阶段允许客户端选择性地进行单向学习，避免整合低质量信息来自较弱或表现不佳的合作者。实验结果验证了所提方法在多样化且具有挑战性的联邦学习环境中的有效性和鲁棒性。

链接: https://arxiv.org/abs/2503.09206
作者: Xiuwen Fang,Mang Ye,Bo Du
机构: School of Computer Science, Taikang Center for Life and Medical Sciences, Wuhan University (武汉大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper studies a challenging robust federated learning task with model heterogeneous and data corrupted clients, where the clients have different local model structures. Data corruption is unavoidable due to factors such as random noise, compression artifacts, or environmental conditions in real-world deployment, drastically crippling the entire federated system. To address these issues, this paper introduces a novel Robust Asymmetric Heterogeneous Federated Learning (RAHFL) framework. We propose a Diversity-enhanced supervised Contrastive Learning technique to enhance the resilience and adaptability of local models on various data corruption patterns. Its basic idea is to utilize complex augmented samples obtained by the mixed-data augmentation strategy for supervised contrastive learning, thereby enhancing the ability of the model to learn robust and diverse feature representations. Furthermore, we design an Asymmetric Heterogeneous Federated Learning strategy to resist corrupt feedback from external clients. The strategy allows clients to perform selective one-way learning during collaborative learning phase, enabling clients to refrain from incorporating lower-quality information from less robust or underperforming collaborators. Extensive experimental results demonstrate the effectiveness and robustness of our approach in diverse, challenging federated learning environments. Our code and models are public available at this https URL.
zh

[CV-64] aching LMMs for Image Quality Scoring and Interpreting

【速读】：该论文旨在解决图像质量评估（IQA）中图像质量评分与解释任务长期以来独立处理的问题。从人眼视觉系统（HVS）和感知-决策整合模型的角度来看，这两项任务本质上是相互关联的：解释为评分提供了基础，而评分是对解释的抽象总结。论文的关键在于提出了一种统一框架Q-SiT（Quality Scoring and Interpreting joint Teaching），使大规模多模态模型（LMMs）能够同时学习图像质量评分与解释能力。为此，通过将传统IQA数据集转化为可学习的问答数据集，并结合人工标注的质量解释数据进行训练来实现这一目标。此外，还引入了一种高效的评分与解释平衡策略，在轻量级LMMs上确定最优数据混合比例后映射到主模型用于微调调整。此策略不仅减少了任务间干扰并增强了跨任务知识迁移，而且相比直接在全规模LMMs上优化显著降低了计算成本。最终，基于联合学习框架及其相应训练策略开发了Q-SiT及其轻量化版本Q-SiT-mini，实验结果验证了其在两项任务上的出色表现及优异的泛化能力。

链接: https://arxiv.org/abs/2503.09197
作者: Zicheng Zhang,Haoning Wu,Ziheng Jia,Weisi Lin,Guangtao Zhai
机构: Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University (上海交通大学图像通信与网络工程研究所); S-Lab, Nanyang Technological University (南洋理工大学S-Lab); School of Computer Science and Engineering, Nanyang Technological University (南洋理工大学计算机科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image quality scoring and interpreting are two fundamental components of Image Quality Assessment (IQA). The former quantifies image quality, while the latter enables descriptive question answering about image quality. Traditionally, these two tasks have been addressed independently. However, from the perspective of the Human Visual System (HVS) and the Perception-Decision Integration Model, they are inherently interconnected: interpreting serves as the foundation for scoring, while scoring provides an abstract summary of interpreting. Thus, unifying these capabilities within a single model is both intuitive and logically coherent. In this paper, we propose Q-SiT (Quality Scoring and Interpreting joint Teaching), a unified framework that enables large multimodal models (LMMs) to learn both image quality scoring and interpreting simultaneously. We achieve this by transforming conventional IQA datasets into learnable question-answering datasets and incorporating human-annotated quality interpreting data for training. Furthermore, we introduce an efficient scoring interpreting balance strategy, which first determines the optimal data mix ratio on lightweight LMMs and then maps this ratio to primary LMMs for fine-tuning adjustment. This strategy not only mitigates task interference and enhances cross-task knowledge transfer but also significantly reduces computational costs compared to direct optimization on full-scale LMMs. With this joint learning framework and corresponding training strategy, we develop Q-SiT, the first model capable of simultaneously performing image quality scoring and interpreting tasks, along with its lightweight variant, Q-SiT-mini. Experimental results demonstrate that Q-SiT achieves strong performance in both tasks with superior generalization IQA this http URL page at this https URL.
zh

[CV-65] Learning Appearance and Motion Cues for Panoptic Tracking

【速读】：本文旨在解决视频像素级场景理解中的全景跟踪（Panoptic Tracking）问题，通过将实例跟踪与全景分割相结合，赋予机器人在动态环境中时空环境理解的能力。论文的关键在于提出了一种新颖的方法，能够同时捕获通用语义信息以及特定实例的外观和运动特征。不同于现有方法忽略动态场景属性的问题，该方法通过专门的网络头利用外观和运动线索，并采用多尺度可变形卷积来结合语义上下文和增强的运动外观特征，以学习跟踪嵌入。此外，引入了一个创新的两步融合模块，首先匹配当前时间步实例与从前一时间步传播过来的实例，随后利用增强的运动外观嵌入优化关联，从而在具有挑战性的场景中提升鲁棒性。

链接: https://arxiv.org/abs/2503.09191
作者: Juana Valeria Hurtado,Sajad Marvi,Rohit Mohan,Abhinav Valada
机构: Department of Computer Science, University of Freiburg (弗莱堡大学), Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Panoptic tracking enables pixel-level scene interpretation of videos by integrating instance tracking in panoptic segmentation. This provides robots with a spatio-temporal understanding of the environment, an essential attribute for their operation in dynamic environments. In this paper, we propose a novel approach for panoptic tracking that simultaneously captures general semantic information and instance-specific appearance and motion features. Unlike existing methods that overlook dynamic scene attributes, our approach leverages both appearance and motion cues through dedicated network heads. These interconnected heads employ multi-scale deformable convolutions that reason about scene motion offsets with semantic context and motion-enhanced appearance features to learn tracking embeddings. Furthermore, we introduce a novel two-step fusion module that integrates the outputs from both heads by first matching instances from the current time step with propagated instances from previous time steps and subsequently refines associations using motion-enhanced appearance embeddings, improving robustness in challenging scenarios. Extensive evaluations of our proposed \netname model on two benchmark datasets demonstrate that it achieves state-of-the-art performance in panoptic tracking accuracy, surpassing prior methods in maintaining object identities over time. To facilitate future research, we make the code available at this http URL
zh

[CV-66] Polygonizing Roof Segments from High-Resolution Aerial Images Using Yolov8-Based Edge Detection

【速读】：本文旨在解决自动屋顶结构矢量化中的挑战，特别是从遥感图像中提取和矢量化屋顶细节的问题。传统方法依赖于基于几何基元检测角点的方式，而本文提出的方法以边缘检测为核心机制进行屋顶重建，同时利用几何关系定义角点和面。关键在于对 YOLOv8 OBB 模型（原用于旋转目标检测）的适应性改造，使其能够有效提取屋顶边缘。这种方法在面对噪声和遮挡时表现出鲁棒性，并实现了精确的矢量化屋顶表示。实验结果表明，该方法不仅在栅格级别优于最先进的分割模型 SAM，还在矢量级别通过多指标评估显著提升了复杂屋顶结构的矢量化精度。

链接: https://arxiv.org/abs/2503.09187
作者: Qipeng Mei,Dimitri Bulatov,Dorota Iwaszczuk
机构: Technical University of Darmstadt (达姆施塔特工业大学); Fraunhofer IOSB (弗劳恩霍夫IOSB)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures, conference paper (VISAPP 2025, part of the 20th International Joint Conference on Computer Vision, Imaging, and Computer Graphics Theory and Applications)

点击查看摘要

Abstract:This study presents a novel approach for roof detail extraction and vectorization using remote sensing images. Unlike previous geometric-primitive-based methods that rely on the detection of corners, our method focuses on edge detection as the primary mechanism for roof reconstruction, while utilizing geometric relationships to define corners and faces. We adapt the YOLOv8 OBB model, originally designed for rotated object detection, to extract roof edges effectively. Our method demonstrates robustness against noise and occlusion, leading to precise vectorized representations of building roofs. Experiments conducted on the SGA and Melville datasets highlight the method’s effectiveness. At the raster level, our model outperforms the state-of-the-art foundation segmentation model (SAM), achieving a mIoU between 0.85 and 1 for most samples and an ovIoU close to 0.97. At the vector level, evaluation using the Hausdorff distance, PolyS metric, and our raster-vector-metric demonstrates significant improvements after polygonization, with a close approximation to the reference data. The method successfully handles diverse roof structures and refines edge gaps, even on complex roof structures of new, excluded from training datasets. Our findings underscore the potential of this approach to address challenges in automatic roof structure vectorization, supporting various applications such as urban terrain reconstruction.
zh

[CV-67] Incomplete Multi-view Clustering via Diffusion Contrastive Generation

【速读】：该论文旨在解决不完全多视图聚类（IMVC）中的关键挑战，即在多视图数据集普遍存在缺失数据的情况下实现高效且准确的聚类。传统方法通常依赖于先恢复缺失视图再应用常规多视图聚类技术，但这类基于插补的方法存在两个显著局限：一是对配对数据的高度依赖，在高缺失率场景下难以实际应用；二是生成的数据缺乏多样性和判别性，导致聚类结果次优。为克服这些不足，论文提出了一种名为扩散对比生成（Diffusion Contrastive Generation, DCG）的新方法。DCG 的核心创新在于利用扩散过程与聚类过程之间的一致性，通过前向扩散和反向去噪操作增强视图内的分布特性，并结合对比学习将生成视图与真实视图对齐，从而实现在任意缺失视图情况下的精准视图恢复。此外，DCG 还整合了实例级和类别级交互学习机制，充分利用多视图数据中一致性和互补性的信息，实现了鲁棒且端到端的聚类性能。实验表明，该方法在多个基准数据集上超越了现有最先进的方法。

链接: https://arxiv.org/abs/2503.09185
作者: Yuanyang Zhang,Yijie Lin,Weiqing Yan,Li Yao,Xinhang Wan,Guangyuan Li,Chao Zhang,Guanzhou Ke,Jie Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Incomplete multi-view clustering (IMVC) has garnered increasing attention in recent years due to the common issue of missing data in multi-view datasets. The primary approach to address this challenge involves recovering the missing views before applying conventional multi-view clustering methods. Although imputation-based IMVC methods have achieved significant improvements, they still encounter notable limitations: 1) heavy reliance on paired data for training the data recovery module, which is impractical in real scenarios with high missing data rates; 2) the generated data often lacks diversity and discriminability, resulting in suboptimal clustering results. To address these shortcomings, we propose a novel IMVC method called Diffusion Contrastive Generation (DCG). Motivated by the consistency between the diffusion and clustering processes, DCG learns the distribution characteristics to enhance clustering by applying forward diffusion and reverse denoising processes to intra-view data. By performing contrastive learning on a limited set of paired multi-view samples, DCG can align the generated views with the real views, facilitating accurate recovery of views across arbitrary missing view scenarios. Additionally, DCG integrates instance-level and category-level interactive learning to exploit the consistent and complementary information available in multi-view data, achieving robust and end-to-end clustering. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches.
zh

[CV-68] WonderVerse: Extendable 3D Scene Generation with Video Generative Models

【速读】：本文旨在解决现有3D场景生成方法中存在的几何失真和不一致性问题，同时提升生成环境的规模与扩展性。现有方法通常依赖迭代深度估计和图像修复，容易导致几何问题。为应对这些挑战，论文提出了一种名为WonderVerse的简单而有效的框架，其关键在于利用视频生成基础模型中嵌入的世界级先验（world-level priors），以创建高度沉浸且几何一致的3D环境。此外，论文还引入了一种新的可控3D场景扩展技术，并设计了一个基于相机轨迹的异常序列检测模块来解决生成视频中的几何不一致问题。WonderVerse的创新之处在于其优雅的简化流程以及与多种3D重建方法的兼容性，从而实现高效且高质量的生成效果，显著优于依赖更复杂架构的传统方法。

链接: https://arxiv.org/abs/2503.09160
作者: Hao Feng,Zhi Zuo,Jia-hui Pan,Ka-hei Hui,Yi-hua Shao,Qi Dou,Wei Xie,Zheng-zhe Liu
机构: Central China Normal University (华中师范大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); The Chinese University of Hong Kong (香港中文大学); Autodesk; University of Science and Technology Beijing (北京科技大学); Lingnan University (岭南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce \textitWonderVerse, a simple but effective framework for generating extendable 3D scenes. Unlike existing methods that rely on iterative depth estimation and image inpainting, often leading to geometric distortions and inconsistencies, WonderVerse leverages the powerful world-level priors embedded within video generative foundation models to create highly immersive and geometrically coherent 3D environments. Furthermore, we propose a new technique for controllable 3D scene extension to substantially increase the scale of the generated environments. Besides, we introduce a novel abnormal sequence detection module that utilizes camera trajectory to address geometric inconsistency in the generated videos. Finally, WonderVerse is compatible with various 3D reconstruction methods, allowing both efficient and high-quality generation. Extensive experiments on 3D scene generation demonstrate that our WonderVerse, with an elegant and simple pipeline, delivers extendable and highly-realistic 3D scenes, markedly outperforming existing works that rely on more complex architectures.
zh

[CV-69] FaVChat: Unlocking Fine-Grained Facail Video Understanding with Multimodal Large Language Models

【速读】：本文旨在解决视频基础多模态大语言模型（VMLLMs）在细粒度人脸理解（fine-grained facial comprehension）方面的能力不足问题。这一问题的核心在于如何提升模型对人脸相关复杂视觉信息的感知与表达能力，这对于以人为核心的人工智能系统具有重要意义。为填补这一研究空白，论文提出了一种专门设计用于细粒度人脸视频理解的FaVChat模型。

解决方案的关键在于构建了一个包含超过60,000个视频的大规模面部视频数据集，并通过引入83种细粒度面部属性来增强GPT-4o生成的字幕，从而生成高质量的视频摘要-字幕对以及额外的170,000组细粒度问答（QA）对。此外，提出了一个混合模型架构，包括通用视觉编码器、专用面部编码器以及基于专家混合的适配器，用于自适应融合多源视觉特征。为了减少特征转换过程中的信息丢失，从面部编码器提取多粒度表示，并将其整合到后续的语言模型中，以提升模型处理不同层次视觉细节的能力。最后，采用渐进式训练策略，从视频摘要任务逐步过渡到高质量的视频问答子集，逐步增加任务难度，进一步强化模型对细粒度视觉信息的理解能力。

链接: https://arxiv.org/abs/2503.09158
作者: Fufangchen Zhao,Ming Li,Linrui Xu,Wenhao Jiang,Jian Gao,Danfeng Yan
机构: State Key Laboratory of Networking and Switching Technology (网络与交换技术国家重点实验室), BUPT (北京邮电大学); Guangdong Laboratory of Artifcial Intelligence and Digital Economy (SZ) (深圳人工智能与数字经济广东省实验室); Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video-based multimodal large language models (VMLLMs) have demonstrated remarkable potential in cross-modal video understanding. However, their abilities in fine-grained face comprehension remain largely underexplored. Given its pivotal role in human-centric intelligence, developing VMLLMs for facial understanding holds a fundamental problem. To address this gap, we propose FaVChat, the first VMLLM specifically designed for fine-grained facial video understanding. To facilitate its training, we construct a large-scale facial video dataset comprising over 60k videos, with the majority annotated with 83 fine-grained facial attributes. These attributes are incorporated to enrich GPT-4o-generated captions, yielding 60k high-quality video-summary pairs and an additional 170k fine-grained question-answering (QA) pairs. To effectively capture rich facial clues, we propose a hybrid model architecture composed of a general visual encoder, a dedicated facial encoder, and a mixture-of-experts-enhanced adapter for adaptive fusion of multi-source visual features. To mitigate information loss during feature transformation, we extract multi-granularity representations from the facial encoder and integrate them into the subsequent LLM. This design enhances the model’s ability to comprehend and respond to questions involving diverse levels of visual details. We employ a progressive training paradigm, transitioning from video summarization to a high-quality subset of video QA, gradually increasing task complexity to enhance the model’s fine-grained visual perception. We conduct extensive zero-shot evaluation on a couple of public benchmarks, demonstrating that FaVChat consistently surpasses existing VMLLMs across multiple tasks.
zh

[CV-70] SwapAnyone: Consistent and Realistic Video Synthesis for Swapping Any Person into Any Video

【速读】：该论文旨在解决视频换身（Video Body-Swapping）任务中存在的挑战，包括现有方法因将该任务分解为多个子任务而缺乏端到端优化的问题，这些问题导致了帧间亮度变化、遮挡关系混乱以及身体与背景分离等现象。论文的关键在于重新定义视频换身为一个独立任务，并提出三个核心一致性：身份一致性（Identity Consistency）、运动一致性（Motion Consistency）和环境一致性（Environment Consistency）。为此，作者设计了一个名为SwapAnyone的端到端模型，将视频换身视为具有参考保真度和运动控制的视频修复（Video Inpainting）任务。此外，为了提升环境和谐性，特别是亮度和谐性，论文引入了一种渐进训练的新型EnvHarmony策略。实验结果表明，所提方法在开源方案中达到当前最优性能（SOTA），并在多个维度上接近或超越闭源模型。

链接: https://arxiv.org/abs/2503.09154
作者: Chengshu Zhao,Yunyang Ge,Xinhua Cheng,Bin Zhu,Yatian Pang,Bin Lin,Fan Yang,Feng Gao,Li Yuan
机构: Peking University (北京大学); Rabbitpre Intelligence; National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video body-swapping aims to replace the body in an existing video with a new body from arbitrary sources, which has garnered more attention in recent years. Existing methods treat video body-swapping as a composite of multiple tasks instead of an independent task and typically rely on various models to achieve video body-swapping sequentially. However, these methods fail to achieve end-to-end optimization for the video body-swapping which causes issues such as variations in luminance among frames, disorganized occlusion relationships, and the noticeable separation between bodies and background. In this work, we define video body-swapping as an independent task and propose three critical consistencies: identity consistency, motion consistency, and environment consistency. We introduce an end-to-end model named SwapAnyone, treating video body-swapping as a video inpainting task with reference fidelity and motion control. To improve the ability to maintain environmental harmony, particularly luminance harmony in the resulting video, we introduce a novel EnvHarmony strategy for training our model progressively. Additionally, we provide a dataset named HumanAction-32K covering various videos about human actions. Extensive experiments demonstrate that our method achieves State-Of-The-Art (SOTA) performance among open-source methods while approaching or surpassing closed-source models across multiple dimensions. All code, model weights, and the HumanAction-32K dataset will be open-sourced at this https URL.
zh

[CV-71] Reangle-A-Video: 4D Video Generation as Video-to-Video Translation

【速读】：该论文旨在解决单视视频生成同步多视视频的问题，提出了一种名为Reangle-A-Video的统一框架。不同于主流方法依赖大规模4D数据集训练多视视频扩散模型，该方法将多视视频生成任务重新定义为视频到视频的翻译任务，利用公开可用的图像和视频扩散先验知识。其关键在于两个阶段：(1) 多视运动学习，通过自监督方式微调图像到视频扩散变换器以提取视角不变的运动；(2) 在推理阶段使用跨视一致性引导，将输入视频的第一帧进行扭曲和修复生成多视一致的起始图像，从而实现多视视频的一致性转换。实验表明，Reangle-A-Video在静态视图传输和动态相机控制方面超越现有方法，为多视视频生成提供了新方案。

链接: https://arxiv.org/abs/2503.09151
作者: Hyeonho Jeong,Suhyeon Lee,Jong Chul Ye
机构: KAIST (高丽科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video. Unlike mainstream approaches that train multi-view video diffusion models on large-scale 4D datasets, our method reframes the multi-view video generation task as video-to-videos translation, leveraging publicly available image and video diffusion priors. In essence, Reangle-A-Video operates in two stages. (1) Multi-View Motion Learning: An image-to-video diffusion transformer is synchronously fine-tuned in a self-supervised manner to distill view-invariant motion from a set of warped videos. (2) Multi-View Consistent Image-to-Images Translation: The first frame of the input video is warped and inpainted into various camera perspectives under an inference-time cross-view consistency guidance using DUSt3R, generating multi-view consistent starting images. Extensive experiments on static view transport and dynamic camera control show that Reangle-A-Video surpasses existing methods, establishing a new solution for multi-view video generation. We will publicly release our code and data. Project page: this https URL
zh

[CV-72] Memory-enhanced Retrieval Augmentation for Long Video Understanding

【速读】：本文旨在解决长视频理解（Long-Video Understanding, LVU）任务中传统检索增强生成（Retrieval-Augmented Generation, RAG）方法因依赖显式搜索查询而存在的根本性限制问题。这种依赖在许多情况下不可用，导致性能受限。为克服这一挑战，论文提出了一种基于记忆的人类认知启发的新颖RAG方法MemVid。其关键解决方案在于设计了一个包含四个基本步骤的框架：首先对整体视频信息进行记忆存储；其次基于记忆推断任务的信息需求；接着依据这些需求检索关键片段；最后聚焦于检索到的片段生成最终答案。此外，为了提升系统基于记忆的推理能力并实现端到端的最佳性能，提出了一个课程学习策略，从监督学习开始，逐步通过强化学习探索和强化更合理的推理结果。实验表明，MemVid在多个流行LVU基准测试中显著优于现有方法，验证了所提方法的有效性。

链接: https://arxiv.org/abs/2503.09149
作者: Huaying Yuan,Zheng Liu,Minhao Qin,Hongjin Qian,Y Shu,Zhicheng Dou,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (高瓴人工智能学院，中国人民大学); Beijing Academy of Artificial Intelligence (北京智源人工智能研究院); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) shows strong potential in addressing long-video understanding (LVU) tasks. However, traditional RAG methods remain fundamentally limited due to their dependence on explicit search queries, which are unavailable in many situations. To overcome this challenge, we introduce a novel RAG-based LVU approach inspired by the cognitive memory of human beings, which is called MemVid. Our approach operates with four basics steps: memorizing holistic video information, reasoning about the task’s information needs based on the memory, retrieving critical moments based on the information needs, and focusing on the retrieved moments to produce the final answer. To enhance the system’s memory-grounded reasoning capabilities and achieve optimal end-to-end performance, we propose a curriculum learning strategy. This approach begins with supervised learning on well-annotated reasoning results, then progressively explores and reinforces more plausible reasoning outcomes through reinforcement learning. We perform extensive evaluations on popular LVU benchmarks, including MLVU, VideoMME and LVBench. In our experiment, MemVid significantly outperforms existing RAG-based methods and popular LVU models, which demonstrate the effectiveness of our approach. Our model and source code will be made publicly available upon acceptance.
zh

[CV-73] Generative Frame Sampler for Long Video Understanding

【速读】：该论文旨在解决长视频理解中计算负担过重的问题，即如何高效感知包含数千帧的长视频。传统方法在处理长视频时面临显著的计算挑战，而论文提出的解决方案是引入了一个名为生成式帧采样器（Generative Frame Sampler, GenS）的插件模块。GenS 关键在于利用轻量级 VideoLLM 的视觉语言能力，通过识别与问题相关的帧来实现高效的帧选择。此外，为了支持有效的帧检索，研究者构建了一个大规模视频指令数据集 GenS-Video-150K，并进行了密集的帧相关性标注。实验结果表明，无论是在开源模型还是专有系统上，GenS 都能显著提升长视频理解性能。

链接: https://arxiv.org/abs/2503.09146
作者: Linli Yao,Haoning Wu,Kun Ouyang,Yuanxing Zhang,Caiming Xiong,Bei Chen,Xu Sun,Junnan Li
机构: Peking University (北京大学); Salesforce Research (Salesforce 研究院); Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Despite recent advances in Video Large Language Models (VideoLLMs), effectively understanding long-form videos remains a significant challenge. Perceiving lengthy videos containing thousands of frames poses substantial computational burden. To mitigate this issue, this paper introduces Generative Frame Sampler (GenS), a plug-and-play module integrated with VideoLLMs to facilitate efficient lengthy video perception. Built upon a lightweight VideoLLM, GenS leverages its inherent vision-language capabilities to identify question-relevant frames. To facilitate effective retrieval, we construct GenS-Video-150K, a large-scale video instruction dataset with dense frame relevance annotations. Extensive experiments demonstrate that GenS consistently boosts the performance of various VideoLLMs, including open-source models (Qwen2-VL-7B, Aria-25B, VILA-40B, LLaVA-Video-7B/72B) and proprietary assistants (GPT-4o, Gemini). When equipped with GenS, open-source VideoLLMs achieve impressive state-of-the-art results on long-form video benchmarks: LLaVA-Video-72B reaches 66.8 (+4.3) on LongVideoBench and 77.0 (+2.7) on MLVU, while Aria obtains 39.2 on HourVideo surpassing the Gemini-1.5-pro by 1.9 points. We will release all datasets and models at this https URL.
zh

[CV-74] Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding

【速读】：该论文旨在解决两个主要问题：一是当前多模态大型语言模型（Multimodal Large Language Models, MLLMs）主要关注第三人称（exocentric）视觉，而忽视了第一人称（egocentric）视频的独特性；二是高数据获取成本限制了数据规模，从而影响了MLLMs的性能。为了解决这些问题，论文的关键方案是通过学习第三人称与第一人称域之间的映射关系，利用现有MLLMs中丰富的第三人称知识来增强第一人称视频的理解能力。为此，作者提出了Ego-ExoClip预训练数据集，并设计了一个包含教师自准备、教师-学生指导以及学生自我实践三个阶段的渐进式训练管道。此外，还引入了来自多个来源的指令微调数据集EgoIT和评估基准EgoBench，以强化模型的指令跟随能力和全面评估模型性能。实验结果表明，现有的MLLMs在第一人称视频理解方面表现不佳，而所提出的模型显著优于现有领先方法。

链接: https://arxiv.org/abs/2503.09143
作者: Haoyu Zhang,Qiaohui Chu,Meng Liu,Yunxiao Wang,Bin Wen,Fan Yang,Tingting Gao,Di Zhang,Yaowei Wang,Liqiang Nie
机构: Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学（深圳）); Pengcheng Laboratory (鹏城实验室); Shandong Jianzhu University (山东建筑大学); Shandong University (山东大学); Kuaishou(快手)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project: this https URL

点击查看摘要

Abstract:AI personal assistants, deployed through robots or wearables, require embodied understanding to collaborate effectively with humans. Current Multimodal Large Language Models (MLLMs) primarily focus on third-person (exocentric) vision, overlooking the unique aspects of first-person (egocentric) videos. Additionally, high acquisition costs limit data size, impairing MLLM performance. To address these challenges, we propose learning the mapping between exocentric and egocentric domains, leveraging the extensive exocentric knowledge within existing MLLMs to enhance egocentric video understanding. To this end, we introduce Ego-ExoClip, a pre-training dataset comprising 1.1M synchronized ego-exo clip-text pairs derived from Ego-Exo4D. Our approach features a progressive training pipeline with three stages: Teacher Self-Preparation, Teacher-Student Guidance, and Student Self-Practice. Additionally, we propose an instruction-tuning data EgoIT from multiple sources to strengthen the model’s instruction-following capabilities, along with the EgoBench benchmark comprising eight different tasks for thorough evaluation. Extensive experiments across diverse egocentric tasks reveal that existing MLLMs perform inadequately in egocentric video understanding, while our model significantly outperforms these leading models.
zh

[CV-75] Investigation of Frame Differences as Motion Cues for Video Object Segmentation ICML

【速读】：该论文试图解决自动视频对象分割（Automatic Video Object Segmentation, AVOS）在资源受限设备上的实时应用问题。传统方法依赖光流（Optical Flow）提取运动信息，但其计算开销较大，限制了其实时性。为应对这一挑战，论文提出利用帧差（Frame Differences）作为光流的替代方案来提取运动线索。关键在于设计了一个扩展的U-Net结构模型，该模型以待分割的目标帧和帧差作为输入，输出估计的分割图，并验证了该方法在固定摄像头拍摄的视频中可实现与光流输入相当的性能，同时显著降低了计算复杂度。

链接: https://arxiv.org/abs/2503.09132
作者: Sota Kawamura,Hirotada Honda,Shugo Nakamura,Takashi Sano
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, 2 tables. Accepted to The 9th International Conference on Machine Learning and Soft Computing (ICMLSC 2025)

点击查看摘要

Abstract:Automatic Video Object Segmentation (AVOS) refers to the task of autonomously segmenting target objects in video sequences without relying on human-provided annotations in the first frames. In AVOS, the use of motion information is crucial, with optical flow being a commonly employed method for capturing motion cues. However, the computation of optical flow is resource-intensive, making it unsuitable for real-time applications, especially on edge devices with limited computational resources. In this study, we propose using frame differences as an alternative to optical flow for motion cue extraction. We developed an extended U-Net-like AVOS model that takes a frame on which segmentation is performed and a frame difference as inputs, and outputs an estimated segmentation map. Our experimental results demonstrate that the proposed model achieves performance comparable to the model with optical flow as an input, particularly when applied to videos captured by stationary cameras. Our results suggest the usefulness of employing frame differences as motion cues in cases with limited computational resources.
zh

[CV-76] MP-HSIR: A Multi-Prompt Framework for Universal Hyperspectral Image Restoration

【速读】：该论文旨在解决高光谱图像（Hyperspectral Images, HSIs）在成像过程中因多种未知退化导致的严重光谱和空间失真的问题。现有方法通常依赖于特定的退化假设，这限制了它们在复杂场景中的有效性。为了解决这一问题，论文提出了一种名为MP-HSIR的新型多提示框架。该框架的关键在于通过整合光谱、文本和视觉提示来实现针对不同种类和强度退化的通用高光谱图像恢复。具体而言，开发了一种提示引导的空间-光谱变换器，它结合了空间自注意力机制和提示引导的双分支光谱自注意力机制。为了增强光谱重建，引入了光谱提示以提供普遍的低秩光谱模式作为先验知识。此外，文本-视觉协同提示融合了高级语义表示与细粒度视觉特征，用于编码退化信息，从而指导恢复过程。实验结果表明，MP-HSIR不仅在所有任务中优于现有的通用方法，而且在多个任务上超过了最先进的专用方法。

链接: https://arxiv.org/abs/2503.09131
作者: Zhehui Wu,Yong Chen,Naoto Yokoya,Wei He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral images (HSIs) often suffer from diverse and unknown degradations during imaging, leading to severe spectral and spatial distortions. Existing HSI restoration methods typically rely on specific degradation assumptions, limiting their effectiveness in complex scenarios. In this paper, we propose MP-HSIR, a novel multi-prompt framework that effectively integrates spectral, textual, and visual prompts to achieve universal HSI restoration across diverse degradation types and intensities. Specifically, we develop a prompt-guided spatial-spectral transformer, which incorporates spatial self-attention and a prompt-guided dual-branch spectral self-attention. Since degradations affect spectral features differently, we introduce spectral prompts in the local spectral branch to provide universal low-rank spectral patterns as prior knowledge for enhancing spectral reconstruction. Furthermore, the text-visual synergistic prompt fuses high-level semantic representations with fine-grained visual features to encode degradation information, thereby guiding the restoration process. Extensive experiments on 9 HSI restoration tasks, including all-in-one scenarios, generalization tests, and real-world cases, demonstrate that MP-HSIR not only consistently outperforms existing all-in-one methods but also surpasses state-of-the-art task-specific approaches across multiple tasks. The code and models will be released at this https URL.
zh

[CV-77] InteractEdit: Zero-Shot Editing of Human-Object Interactions in Images

【速读】：该论文试图解决零样本人类-物体交互（Human-Object Interaction, HOI）编辑的问题，即在保持主体和客体身份不变的前提下，将图像中的现有交互转换为新的期望交互。这一任务不同于简单的属性操作、对象替换或风格迁移，因其涉及复杂的空间、上下文和关系依赖。现有方法通常局限于源图像结构，难以适应新交互所需的大幅结构性修改。为了解决此问题，论文的关键方案是提出InteractEdit框架，它通过分解场景为主体、客体和背景组件，并利用低秩适应（Low-Rank Adaptation, LoRA）和选择性微调，在保留预训练交互先验的同时学习源图像的视觉身份，从而实现交互编辑与身份一致性之间的有效平衡。

链接: https://arxiv.org/abs/2503.09130
作者: Jiun Tian Hoe,Weipeng Hu,Wei Zhou,Chao Xie,Ziwei Wang,Chee Seng Chan,Xudong Jiang,Yap-Peng Tan
机构: Nanyang Technological University (南洋理工大学); Sun Yat-sen University (中山大学); Nanjing Forestry University (南京林业大学); Universiti Malaya (马来西亚大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Website: this https URL

点击查看摘要

Abstract:This paper presents InteractEdit, a novel framework for zero-shot Human-Object Interaction (HOI) editing, addressing the challenging task of transforming an existing interaction in an image into a new, desired interaction while preserving the identities of the subject and object. Unlike simpler image editing scenarios such as attribute manipulation, object replacement or style transfer, HOI editing involves complex spatial, contextual, and relational dependencies inherent in humans-objects interactions. Existing methods often overfit to the source image structure, limiting their ability to adapt to the substantial structural modifications demanded by new interactions. To address this, InteractEdit decomposes each scene into subject, object, and background components, then employs Low-Rank Adaptation (LoRA) and selective fine-tuning to preserve pretrained interaction priors while learning the visual identity of the source image. This regularization strategy effectively balances interaction edits with identity consistency. We further introduce IEBench, the most comprehensive benchmark for HOI editing, which evaluates both interaction editing and identity preservation. Our extensive experiments show that InteractEdit significantly outperforms existing methods, establishing a strong baseline for future HOI editing research and unlocking new possibilities for creative and practical applications. Code will be released upon publication.
zh

[CV-78] AdvAD: Exploring Non-Parametric Diffusion for Imperceptible Adversarial Attacks NEURIPS NEURIPS2024

【速读】：本文旨在解决隐形对抗攻击（imperceptible adversarial attacks）的问题，即通过在输入数据中添加人类难以察觉的扰动来欺骗深度神经网络（DNNs）。现有方法通常通过结合通用攻击范式与特定设计的感知损失函数或生成模型的能力来提升攻击的不可感知性。然而，本文提出了一种名为AdvAD（Adversarial Attacks in Diffusion）的新颖建模框架，其关键创新点在于将攻击视为一种非参数化扩散过程，而非依赖于常规扩散模型的去噪或生成能力。具体而言，AdvAD在每个步骤仅利用被攻击的模型本身生成极为微妙但有效的对抗引导，逐步将扩散过程从原始图像引导至目标的隐形对抗样本，而无需任何额外的神经网络。这一基于非参数化扩散过程的理论基础不仅实现了高攻击成功率与不可感知性，还显著降低了整体扰动强度。此外，文中进一步提出了增强版本AdvAD-X，用于评估该框架在理想条件下的极限性能。实验结果表明，AdvAD在攻击效果与不可感知性方面均优于现有最先进的隐形对抗攻击方法。

链接: https://arxiv.org/abs/2503.09124
作者: Jin Li,Ziqiang He,Anwei Luo,Jian-Fang Hu,Z. Jane Wang,Xiangui Kang
机构: Guangdong Key Lab of Information Security (广东省信息安全重点实验室), School of Computer Science and Engineering (计算机科学与工程学院), Sun Yat-Sen University (中山大学); Electrical and Computer Engineering Dept (电气与计算机工程系), University of British Columbia (不列颠哥伦比亚大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by NeurIPS 2024. Please cite this paper using the following format: J. Li, Z. He, A. Luo, J. Hu, Z. Wang, X. Kang*, “AdvAD: Exploring Non-Parametric Diffusion for Imperceptible Adversarial Attacks”, the 38th Annual Conference on Neural Information Processing Systems (NeurIPS), Vancouver, Canada, Dec 9-15, 2024. Code: this https URL

点击查看摘要

Abstract:Imperceptible adversarial attacks aim to fool DNNs by adding imperceptible perturbation to the input data. Previous methods typically improve the imperceptibility of attacks by integrating common attack paradigms with specifically designed perception-based losses or the capabilities of generative models. In this paper, we propose Adversarial Attacks in Diffusion (AdvAD), a novel modeling framework distinct from existing attack paradigms. AdvAD innovatively conceptualizes attacking as a non-parametric diffusion process by theoretically exploring basic modeling approach rather than using the denoising or generation abilities of regular diffusion models requiring neural networks. At each step, much subtler yet effective adversarial guidance is crafted using only the attacked model without any additional network, which gradually leads the end of diffusion process from the original image to a desired imperceptible adversarial example. Grounded in a solid theoretical foundation of the proposed non-parametric diffusion process, AdvAD achieves high attack efficacy and imperceptibility with intrinsically lower overall perturbation strength. Additionally, an enhanced version AdvAD-X is proposed to evaluate the extreme of our novel framework under an ideal scenario. Extensive experiments demonstrate the effectiveness of the proposed AdvAD and AdvAD-X. Compared with state-of-the-art imperceptible attacks, AdvAD achieves an average of 99.9 % (+17.3 % ) ASR with 1.34 (-0.97) l_2 distance, 49.74 (+4.76) PSNR and 0.9971 (+0.0043) SSIM against four prevalent DNNs with three different architectures on the ImageNet-compatible dataset. Code is available at this https URL.
zh

[CV-79] raining Data Provenance Verification: Did Your Model Use Synthetic Data from My Generative Model for Training? CVPR2025

【速读】：该论文试图解决生成式文本到图像模型训练数据出处验证这一重要且未被解决的问题。论文提出了一种名为Training Data Provenance Verification（TrainProVe）的方法，其关键在于利用泛化误差界原理，即对于具有相同任务的两个模型，若它们的训练数据分布距离更小，则其泛化能力更为接近。通过在四个文本到图像模型上的验证，TrainProVe实现了超过99%的验证准确率，显著优于现有方法。

链接: https://arxiv.org/abs/2503.09122
作者: Yuechen Xie,Jie Song,Huiqiong Wang,Mingli Song
机构: Zhejiang University (浙江大学); Ningbo Innovation Center, Zhejiang University (宁波创新中心, 浙江大学); State Key Laboratory of Blockchain and Security, Zhejiang University (区块链与安全国家重点实验室, 浙江大学); Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security (杭州高新技术产业开发区（滨江）区块链与数据安全研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:High-quality open-source text-to-image models have lowered the threshold for obtaining photorealistic images significantly, but also face potential risks of misuse. Specifically, suspects may use synthetic data generated by these generative models to train models for specific tasks without permission, when lacking real data resources especially. Protecting these generative models is crucial for the well-being of their owners. In this work, we propose the first method to this important yet unresolved issue, called Training data Provenance Verification (TrainProVe). The rationale behind TrainProVe is grounded in the principle of generalization error bound, which suggests that, for two models with the same task, if the distance between their training data distributions is smaller, their generalization ability will be closer. We validate the efficacy of TrainProVe across four text-to-image models (Stable Diffusion v1.4, latent consistency model, PixArt- \alpha , and Stable Cascade). The results show that TrainProVe achieves a verification accuracy of over 99% in determining the provenance of suspicious model training data, surpassing all previous methods. Code is available at this https URL.
zh

[CV-80] Freeze and Cluster: A Simple Baseline for Rehearsal-Free Continual Category Discovery

【速读】：该论文致力于解决无重放连续类别发现（Rehearsal-Free Continual Category Discovery, RF-CCD）的问题，即在不使用存储的历史数据情况下，通过利用已标注数据的知识，持续识别新类别。现有方法通常从头训练模型，忽视了基础模型的潜力，并依赖数据存储以防止遗忘。此外，由于RF-CCD结合了连续学习和新颖类别发现，以往的方法难以有效整合这两个领域中的先进技术，导致比较不够令人信服，未能充分揭示RF-CCD特有的挑战。

为应对这些挑战，论文首先将两个领域的最新进展进行集成，并开展了广泛的实验与分析。研究发现，这种集成能够达到最先进的性能，进一步表明在存在预训练模型的情况下，引入未标注数据可能会导致表征退化而非提升。为此，论文提出了一种简单但非常有效的基线方法：首先利用已知类别的先验知识估计新类别的数量；然后使用专门针对基础类别训练的模型获取表征，通过k-means聚类生成高质量伪标签，并仅训练分类器层。通过在多个基准数据集（如Stanford Cars、CUB、iNat和Tiny-ImageNet）上的广泛实验验证了结论和方法的有效性。结果清晰展示了所提出的基线方法在解决RF-CCD问题中的优势，并为未来的研究奠定了基础。

链接: https://arxiv.org/abs/2503.09106
作者: Chuyu Zhang,Xueyang Yu,Peiyan Gu,Xuming He
机构: ShanghaiTech University (上海科技大学), China; Shanghai Engineering Research Center of Intelligent Vision and Imaging (智能视觉与成像上海工程研究中心), China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Underreview

点击查看摘要

Abstract:This paper addresses the problem of Rehearsal-Free Continual Category Discovery (RF-CCD), which focuses on continuously identifying novel class by leveraging knowledge from labeled data. Existing methods typically train from scratch, overlooking the potential of base models, and often resort to data storage to prevent forgetting. Moreover, because RF-CCD encompasses both continual learning and novel class discovery, previous approaches have struggled to effectively integrate advanced techniques from these fields, resulting in less convincing comparisons and failing to reveal the unique challenges posed by RF-CCD. To address these challenges, we lead the way in integrating advancements from both domains and conducting extensive experiments and analyses. Our findings demonstrate that this integration can achieve state-of-the-art results, leading to the conclusion that in the presence of pre-trained models, the representation does not improve and may even degrade with the introduction of unlabeled data. To mitigate representation degradation, we propose a straightforward yet highly effective baseline method. This method first utilizes prior knowledge of known categories to estimate the number of novel classes. It then acquires representations using a model specifically trained on the base classes, generates high-quality pseudo-labels through k-means clustering, and trains only the classifier layer. We validate our conclusions and methods by conducting extensive experiments across multiple benchmarks, including the Stanford Cars, CUB, iNat, and Tiny-ImageNet datasets. The results clearly illustrate our findings, demonstrate the effectiveness of our baseline, and pave the way for future advancements in RF-CCD.
zh

[CV-81] he Shape of Attraction in UMAP: Exploring the Embedding Forces in Dimensionality Reduction

【速读】：该论文旨在分析Uniform Manifold Approximation and Projection (UMAP) 方法中吸引与排斥力的作用机制，以揭示其对聚类形成和可视化的影响，并提出改进措施。论文的关键在于深入研究吸引和排斥力的特性：排斥力强调数据点之间的差异，控制聚类边界和类间距离；而吸引力在低维映射中表现出复杂性，既可能表现为吸引也可能表现为排斥，这解释了为何需要学习率退火，并促使对吸引和排斥项进行不同处理。通过调整吸引力，论文进一步提升了随机初始化条件下聚类形成的稳定性。总体而言，该研究使UMAP及其类似方法更具可解释性、鲁棒性和准确性。

链接: https://arxiv.org/abs/2503.09101
作者: Mohammad Tariqul Islam,Jason W. Fleischer
机构: Department of Electrical and Computer Engineering, Princeton University (普林斯顿大学); Media Lab, Massachusetts Institute of Technology (麻省理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 page + appendix

点击查看摘要

Abstract:Uniform manifold approximation and projection (UMAP) is among the most popular neighbor embedding methods. The method relies on attractive and repulsive forces among high-dimensional data points to obtain a low-dimensional embedding. In this paper, we analyze the forces to reveal their effects on cluster formations and visualization. Repulsion emphasizes differences, controlling cluster boundaries and inter-cluster distance. Attraction is more subtle, as attractive tension between points can manifest simultaneously as attraction and repulsion in the lower-dimensional mapping. This explains the need for learning rate annealing and motivates the different treatments between attractive and repulsive terms. Moreover, by modifying attraction, we improve the consistency of cluster formation under random initialization. Overall, our analysis makes UMAP and similar embedding methods more interpretable, more robust, and more accurate.
zh

[CV-82] acchi 2.0: A Low Computational Cost and Comprehensive Dynamic Contact Simulator for Vision-based Tactile Sensors

【速读】：该论文旨在解决基于视觉的触觉传感器在接触密集型机器人任务中的高成本问题，主要由于其耐用性不足导致的数据采集成本增加。同时，传统数据驱动方法缺乏鲁棒性，而基于有限元法（Finite Element Methods, FEM）的方法又面临显著的计算开销。为应对这些挑战，论文的关键创新在于将针孔相机模型集成到低计算成本的基于视觉的触觉模拟器 Tacchi 中，并采用 Material Point Method (MPM) 模拟标记运动图像，从而实现标记运动图像的高效仿真。此外，升级后的 Tacchi 2.0 模拟器能够生成不同运动状态（如按压、滑动和旋转）下的触觉图像、标记运动图像和关节图像。实验结果验证了该方法的可靠性和对多种基于视觉的触觉传感器的鲁棒性。

链接: https://arxiv.org/abs/2503.09100
作者: Yuhao Sun,Shixin Zhang,Wenzhuang Li,Jie Zhao,Jianhua Shan,Zirong Shen,Zixi Chen,Fuchun Sun,Di Guo,Bin Fang
机构: School of Artificial Intelligence, Beijing University of Posts and Telecommunications (北京邮电大学人工智能学院); School of Engineering and Technology, China University of Geosciences (Beijing) (中国地质大学（北京）工程与技术学院); School of International, Beijing University of Posts and Telecommunications (北京邮电大学国际学院); School of Mechanical Engineering, Anhui University of Technology (安徽工业大学机械工程学院); Zhili College, Tsinghua University (清华大学知力学院); Biorobotics Institute and the Department of Excellence in Robotics and AI, Scuola Superiore Sant’Anna (Scuola Superiore Sant’Anna仿生学研究所和机器人与人工智能卓越系); Institute for Artificial Intelligence, Department of Computer Science and Technology, Beijing National Research Center for Information Science and Technology, Tsinghua University (清华大学人工智能研究院，计算机科学与技术系，北京信息科学与技术国家研究中心)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the development of robotics technology, some tactile sensors, such as vision-based sensors, have been applied to contact-rich robotics tasks. However, the durability of vision-based tactile sensors significantly increases the cost of tactile information acquisition. Utilizing simulation to generate tactile data has emerged as a reliable approach to address this issue. While data-driven methods for tactile data generation lack robustness, finite element methods (FEM) based approaches require significant computational costs. To address these issues, we integrated a pinhole camera model into the low computational cost vision-based tactile simulator Tacchi that used the Material Point Method (MPM) as the simulated method, completing the simulation of marker motion images. We upgraded Tacchi and introduced Tacchi 2.0. This simulator can simulate tactile images, marked motion images, and joint images under different motion states like pressing, slipping, and rotating. Experimental results demonstrate the reliability of our method and its robustness across various vision-based tactile sensors.
zh

[CV-83] Causal-Ex: Causal Graph-based Micro and Macro Expression Spotting

【速读】：该论文旨在解决在情感识别任务中，由于训练数据集中无意引入的偏差导致模型错误关联面部动作单元（Action Units, AUs）与特定情感类别的问题。这种偏差会引入不期望的虚假关联，影响分类准确性。为了解决这一问题，论文提出了一种创新方法，将传统的AUs邻接信息替换为基于因果图（causal graphs）的表示形式。通过构建面部动作单元的因果图，该方法能够识别并消除这些虚假关联，仅保留无偏见的信息用于分类。论文提出的模型名为Causal-Ex，其核心在于利用快速因果推理算法构建AUs的因果图，并选择具有因果相关性的AUs进行情感识别。实验结果显示，该方法在CAS(ME)^2和SAMM-Long Video数据集上的F1分数分别达到了0.388和0.3701，较现有最先进方法有所提升。

链接: https://arxiv.org/abs/2503.09098
作者: Pei-Sze Tan,Sailaja Rajanala,Arghya Pal,Raphaël C.-W. Phan,Huey-Fang Ong
机构: Monash University (蒙纳士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 6 figures. The paper is under consideration at Pattern Recognition Letters

点击查看摘要

Abstract:Detecting concealed emotions within apparently normal expressions is crucial for identifying potential mental health issues and facilitating timely support and intervention. The task of spotting macro and micro-expressions involves predicting the emotional timeline within a video, accomplished by identifying the onset, apex, and offset frames of the displayed emotions. Utilizing foundational facial muscle movement cues, known as facial action units, boosts the accuracy. However, an overlooked challenge from previous research lies in the inadvertent integration of biases into the training model. These biases arising from datasets can spuriously link certain action unit movements to particular emotion classes. We tackle this issue by novel replacement of action unit adjacency information with the action unit causal graphs. This approach aims to identify and eliminate undesired spurious connections, retaining only unbiased information for classification. Our model, named Causal-Ex (Causal-based Expression spotting), employs a rapid causal inference algorithm to construct a causal graph of facial action units. This enables us to select causally relevant facial action units. Our work demonstrates improvement in overall F1-scores compared to state-of-the-art approaches with 0.388 on CAS(ME)^2 and 0.3701 on SAMM-Long Video datasets.
zh

[CV-84] C2 ATTACK: Towards Representation Backdoor on CLIP via Concept Confusion

【速读】：该论文旨在解决传统后门攻击（Backdoor Attacks）在面对现有防御机制时容易被检测和防御的问题。传统方法通常依赖于在输入数据中插入显式的触发器（如外部补丁或扰动），但这些方法容易被现有的防御技术识别。论文的关键创新在于从深度学习系统的推理过程角度重新审视后门攻击，并借鉴可解释人工智能（Interpretable AI）的思想，将后门激活视为模型潜在表示中已学习概念的操纵。基于此，作者提出了一种新的后门攻击框架——概念混淆攻击（Concept Confusion Attack, C² ATTACK）。该方案的关键在于利用模型推理过程中内部的概念作为“触发器”，而无需引入显式的外部修改，通过直接操控潜在空间中的特定概念实现后门效果。这种方法显著增强了攻击的隐蔽性，使现有防御手段更难检测，同时保持了高攻击成功率和对先进防御策略的鲁棒性。

链接: https://arxiv.org/abs/2503.09095
作者: Lijie Hu,Junchi Liao,Weimin Lyu,Shaopeng Fu,Tianhao Huang,Shu Yang,Guimin Hu,Di Wang
机构: Provable Responsible AI and Data Analytics (PRADA) Lab (可证明负责的人工智能与数据分析实验室); King Abdullah University of Science and Technology (阿卜杜拉国王科技大学); University of Electronic Science and Technology of China (电子科技大学); Stony Brook University (石溪大学); University of Virginia (弗吉尼亚大学); University of Copenhagen (哥本哈根大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Backdoor attacks pose a significant threat to deep learning models, enabling adversaries to embed hidden triggers that manipulate the behavior of the model during inference. Traditional backdoor attacks typically rely on inserting explicit triggers (e.g., external patches, or perturbations) into input data, but they often struggle to evade existing defense mechanisms. To address this limitation, we investigate backdoor attacks through the lens of the reasoning process in deep learning systems, drawing insights from interpretable AI. We conceptualize backdoor activation as the manipulation of learned concepts within the model’s latent representations. Thus, existing attacks can be seen as implicit manipulations of these activated concepts during inference. This raises interesting questions: why not manipulate the concepts explicitly? This idea leads to our novel backdoor attack framework, Concept Confusion Attack (C^2 ATTACK), which leverages internal concepts in the model’s reasoning as “triggers” without introducing explicit external modifications. By avoiding the use of real triggers and directly activating or deactivating specific concepts in latent spaces, our approach enhances stealth, making detection by existing defenses significantly harder. Using CLIP as a case study, experimental results demonstrate the effectiveness of C^2 ATTACK, achieving high attack success rates while maintaining robustness against advanced defenses.
zh

[CV-85] Multi-Modal Foundation Models for Computational Pathology: A Survey

【速读】：本文综述了计算病理学（Computational Pathology, CPath）领域中多模态基础模型的发展现状，旨在解决传统单模态模型在分析组织病理图像时可扩展性和通用性不足的问题。论文的关键在于探索多模态基础模型如何整合异构数据源（如文本报告、结构化领域知识和分子特征），特别是基于苏木精和伊红（Hematoxylin and Eosin, HE）染色的全切片图像（Whole Slide Images, WSIs）及其局部块级表示的建模方法。通过将32种最先进的多模态基础模型归类为视觉-语言、视觉-知识图谱和视觉-基因表达三大范式，并进一步细分视觉-语言模型为非大型语言模型（Non-LLM-based）和基于大型语言模型（LLM-based）的方法，论文揭示了多模态融合在提升模型性能方面的关键作用。此外，通过对28个针对病理学设计的多模态数据集的分析，以及对下游任务分类、训练与评估策略的讨论，论文指出了当前面临的挑战和未来研究方向。因此，论文的核心解决方案在于构建能够有效整合多种模态信息的基础模型，以实现更精准、更通用的组织病理图像分析能力。

链接: https://arxiv.org/abs/2503.09091
作者: Dong Li,Guihong Wan,Xintao Wu,Xinyu Wu,Xiaohui Chen,Yi He,Christine G. Lian,Peter K. Sorger,Yevgeniy R. Semenov,Chen Zhao
机构: Baylor University (贝勒大学); Harvard Medical School (哈佛医学院); University of Arkansas (阿肯色大学); William & Mary (威廉玛丽学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models have emerged as a powerful paradigm in computational pathology (CPath), enabling scalable and generalizable analysis of histopathological images. While early developments centered on uni-modal models trained solely on visual data, recent advances have highlighted the promise of multi-modal foundation models that integrate heterogeneous data sources such as textual reports, structured domain knowledge, and molecular profiles. In this survey, we provide a comprehensive and up-to-date review of multi-modal foundation models in CPath, with a particular focus on models built upon hematoxylin and eosin (HE) stained whole slide images (WSIs) and tile-level representations. We categorize 32 state-of-the-art multi-modal foundation models into three major paradigms: vision-language, vision-knowledge graph, and vision-gene expression. We further divide vision-language models into non-LLM-based and LLM-based approaches. Additionally, we analyze 28 available multi-modal datasets tailored for pathology, grouped into image-text pairs, instruction datasets, and image-other modality pairs. Our survey also presents a taxonomy of downstream tasks, highlights training and evaluation strategies, and identifies key challenges and future directions. We aim for this survey to serve as a valuable resource for researchers and practitioners working at the intersection of pathology and AI.
zh

[CV-86] Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment

【速读】：该论文致力于解决长视频问答（LVQA）任务中因时间推理需求和大规模多模态数据处理带来的挑战，特别是现有方法在从长视频中检索跨模态信息时面临的困难，尤其是在相关信息稀疏分布的情况下。论文提出的关键解决方案是UMaT（Unified Multi-modal as Text），这是一种基于检索增强生成（RAG）框架的方法。UMaT通过将视觉和听觉数据统一转换为文本表示，确保语义和时间对齐，并利用结构化的时间对齐段落以及自适应过滤技术去除冗余并保留重要细节，从而高效处理极长视频的同时保持跨模态一致性。最终，该方法通过向量数据库实现精确检索，显著提升了多模态整合、长视频理解及稀疏信息检索的能力。

链接: https://arxiv.org/abs/2503.09081
作者: Xiaowei Bi,Zheyuan Xu
机构: Northwestern University (西北大学); IEEE (电气电子工程师学会); University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long Video Question Answering (LVQA) is challenging due to the need for temporal reasoning and large-scale multimodal data processing. Existing methods struggle with retrieving cross-modal information from long videos, especially when relevant details are sparsely distributed. We introduce UMaT (Unified Multi-modal as Text), a retrieval-augmented generation (RAG) framework that efficiently processes extremely long videos while maintaining cross-modal coherence. UMaT converts visual and auditory data into a unified textual representation, ensuring semantic and temporal alignment. Short video clips are analyzed using a vision-language model, while automatic speech recognition (ASR) transcribes dialogue. These text-based representations are structured into temporally aligned segments, with adaptive filtering to remove redundancy and retain salient details. The processed data is embedded into a vector database, enabling precise retrieval of dispersed yet relevant content. Experiments on a benchmark LVQA dataset show that UMaT outperforms existing methods in multimodal integration, long-form video understanding, and sparse information retrieval. Its scalability and interpretability allow it to process videos over an hour long while maintaining semantic and temporal coherence. These findings underscore the importance of structured retrieval and multimodal synchronization for advancing LVQA and long-form AI systems.
zh

[CV-87] heoretical Guarantees for High Order Trajectory Refinement in Generative Flows

【速读】：该论文旨在解决高阶流匹配（higher-order flow matching）作为分布估计器在最坏情况下的最优性保证未被探索的问题。论文的关键在于证明高阶流匹配能够保持与标准流匹配相当的最坏情况最优性，并通过推导二阶流匹配的估计误差上界，展示收敛速率依赖于目标分布的平滑程度（通过Besov空间量化）以及常微分方程(ODE)动力学的关键参数。为实现这一目标，作者利用具有严格控制深度、宽度和稀疏性的神经网络近似方法，以在小时间和大时间间隔内对加速误差进行界值分析，最终统一这些结果，形成适用于所有时间步的通用最坏情况最优界。

链接: https://arxiv.org/abs/2503.09069
作者: Chengyue Gong,Xiaoyu Li,Yingyu Liang,Jiangxuan Long,Zhenmei Shi,Zhao Song,Yu Tian
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Stevens Institute of Technology (史蒂文斯理工学院); The University of Hong Kong (香港大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); South China University of Technology (华南理工大学); The Simons Institute for the Theory of Computing at the UC, Berkeley (伯克利加州大学西蒙斯计算理论研究所); Independent Researcher (独立研究员)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: arXiv admin note: text overlap with arXiv:2410.11261

点击查看摘要

Abstract:Flow matching has emerged as a powerful framework for generative modeling, offering computational advantages over diffusion models by leveraging deterministic Ordinary Differential Equations (ODEs) instead of stochastic dynamics. While prior work established the worst case optimality of standard flow matching under Wasserstein distances, the theoretical guarantees for higher-order flow matching - which incorporates acceleration terms to refine sample trajectories - remain unexplored. In this paper, we bridge this gap by proving that higher-order flow matching preserves worst case optimality as a distribution estimator. We derive upper bounds on the estimation error for second-order flow matching, demonstrating that the convergence rates depend polynomially on the smoothness of the target distribution (quantified via Besov spaces) and key parameters of the ODE dynamics. Our analysis employs neural network approximations with carefully controlled depth, width, and sparsity to bound acceleration errors across both small and large time intervals, ultimately unifying these results into a general worst case optimal bound for all time steps.
zh

[CV-88] Implicit Contrastive Representation Learning with Guided Stop-gradient NEURIPS2023

【速读】：该论文旨在解决自监督表征学习中，孪生网络（Siamese Networks）在学习转换不变性（transformation-invariance）时容易陷入退化解（degenerate solution）的问题。此外，论文还关注对比学习（Contrastive Learning）中负采样（negative sampling）算法对负样本数量减少的鲁棒性不足的问题。为了解决这些问题，论文提出了一种新的方法，关键在于利用非对称网络架构（asymmetric architecture），通过引入源编码器（source encoder）和目标编码器（target encoder）的不对称设计，隐式地结合对比学习的思想。作为具体实现，论文提出了“指导停止梯度”（guided stop-gradient）方法，并将其应用于基准算法SimSiam和BYOL中，证明该方法能够稳定训练过程、提升性能，同时使算法在小批量（small batch sizes）情况下表现良好且无需预测器（predictor）即可避免退化解。

链接: https://arxiv.org/abs/2503.09058
作者: Byeongchan Lee,Sehyun Lee
机构: Gauss Labs; KAIST (韩国科学技术院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Neurips 2023

点击查看摘要

Abstract:In self-supervised representation learning, Siamese networks are a natural architecture for learning transformation-invariance by bringing representations of positive pairs closer together. But it is prone to collapse into a degenerate solution. To address the issue, in contrastive learning, a contrastive loss is used to prevent collapse by moving representations of negative pairs away from each other. But it is known that algorithms with negative sampling are not robust to a reduction in the number of negative samples. So, on the other hand, there are algorithms that do not use negative pairs. Many positive-only algorithms adopt asymmetric network architecture consisting of source and target encoders as a key factor in coping with collapse. By exploiting the asymmetric architecture, we introduce a methodology to implicitly incorporate the idea of contrastive learning. As its implementation, we present a novel method guided stop-gradient. We apply our method to benchmark algorithms SimSiam and BYOL and show that our method stabilizes training and boosts performance. We also show that the algorithms with our method work well with small batch sizes and do not collapse even when there is no predictor. The code is available at this https URL.
zh

[CV-89] Discovering Influential Neuron Path in Vision Transformers ICLR2025

【速读】：该论文旨在解决视觉Transformer模型在实际应用中因缺乏透明性而带来的挑战与风险，特别是现有方法在分析层间信息流的整体路径以及识别影响模型推理的关键神经元路径方面的不足。论文提出的关键解决方案包括：首先，设计了一种联合影响力度量方法（Joint Influence Measure），用于评估一组神经元对模型输出结果的贡献；其次，提出了一种分层渐进的神经元定位方法（Layer-Progressive Neuron Locating Approach），通过高效选择每一层中最显著的神经元，以发现从输入到输出的完整关键神经元路径。实验结果表明，该方法在识别最具影响力的神经元路径方面优于现有基线方法，并揭示了视觉Transformer处理同类图像时的特定内部工作机制。此外，研究进一步展示了这些神经元路径对于图像分类任务的重要作用，证明其能够保留模型在下游任务中的能力，为模型剪枝等实际应用提供了参考价值。

链接: https://arxiv.org/abs/2503.09046
作者: Yifan Wang,Yifei Liu,Yingdong Shi,Changming Li,Anqi Pang,Sibei Yang,Jingyi Yu,Kan Ren
机构: ShanghaiTech University (上海科技大学); Tencent PCG (腾讯互娱)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in ICLR 2025

点击查看摘要

Abstract:Vision Transformer models exhibit immense power yet remain opaque to human understanding, posing challenges and risks for practical applications. While prior research has attempted to demystify these models through input attribution and neuron role analysis, there’s been a notable gap in considering layer-level information and the holistic path of information flow across layers. In this paper, we investigate the significance of influential neuron paths within vision Transformers, which is a path of neurons from the model input to output that impacts the model inference most significantly. We first propose a joint influence measure to assess the contribution of a set of neurons to the model outcome. And we further provide a layer-progressive neuron locating approach that efficiently selects the most influential neuron at each layer trying to discover the crucial neuron path from input to output within the target model. Our experiments demonstrate the superiority of our method finding the most influential neuron path along which the information flows, over the existing baseline solutions. Additionally, the neuron paths have illustrated that vision Transformers exhibit some specific inner working mechanism for processing the visual information within the same image category. We further analyze the key effects of these neurons on the image classification task, showcasing that the found neuron paths have already preserved the model capability on downstream tasks, which may also shed some lights on real-world applications like model pruning. The project website including implementation code is available at this https URL.
zh

[CV-90] Motion Blender Gaussian Splatting for Dynamic Reconstruction

【速读】：该论文旨在解决现有高保真动态场景重建方法中运动操控性不足的问题。现有的高斯点 splatting 方法主要依赖隐式的运动表示（如将运动编码到神经网络或每个高斯分布的参数中），这使得难以进一步操纵重建的运动。这种缺乏显式可控性的局限性限制了这些方法只能重放记录的运动，阻碍了其更广泛的应用。为了解决这一问题，论文提出了一种名为 Motion Blender Gaussian Splatting (MB-GS) 的新框架，其关键是引入运动图（motion graph）作为显式且稀疏的运动表示。通过双重四元数蒙皮技术将图链接的运动传播到个体高斯分布，并利用可学习的权重绘制函数确定每个链接的影响，同时联合优化运动图和 3D 高斯分布以从输入视频中进行可微渲染。实验结果表明，MB-GS 在 iPhone 数据集上实现了最先进的性能，并在 HyperNeRF 上表现出竞争力，同时展示了其在生成新的物体运动和机器人演示方面的应用潜力。

链接: https://arxiv.org/abs/2503.09040
作者: Xinyu Zhang,Haonan Chang,Yuhan Liu,Abdeslam Boularias
机构: Department of Computer Science, Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Gaussian splatting has emerged as a powerful tool for high-fidelity reconstruction of dynamic scenes. However, existing methods primarily rely on implicit motion representations, such as encoding motions into neural networks or per-Gaussian parameters, which makes it difficult to further manipulate the reconstructed motions. This lack of explicit controllability limits existing methods to replaying recorded motions only, which hinders a wider application. To address this, we propose Motion Blender Gaussian Splatting (MB-GS), a novel framework that uses motion graph as an explicit and sparse motion representation. The motion of graph links is propagated to individual Gaussians via dual quaternion skinning, with learnable weight painting functions determining the influence of each link. The motion graphs and 3D Gaussians are jointly optimized from input videos via differentiable rendering. Experiments show that MB-GS achieves state-of-the-art performance on the iPhone dataset while being competitive on HyperNeRF. Additionally, we demonstrate the application potential of our method in generating novel object motions and robot demonstrations through motion editing. Video demonstrations can be found at this https URL.
zh

[CV-91] Adaptive Temperature Based on Logits Correlation in Knowledge Distillation

【速读】：该论文试图解决知识蒸馏中温度参数（temperature）难以有效促进教师模型与学生模型logits信息传递的问题。解决方案的关键在于提出了一种新颖的温度计算方法，该方法仅依赖于教师模型生成的最大logit值，从而显著减少了计算开销，相比现有先进方法提升了效率。实验结果表明，所提出的动态温度调整方法在标准基准数据集上对不同学生和教师模型均表现出良好的性能提升效果，并进一步验证了知识蒸馏通过温度参数传递logits相关性的有效性。

链接: https://arxiv.org/abs/2503.09030
作者: Kazuhiro Matsuyama,Usman Anjum,Satoko Matsuyama,Tetsuo Shoda,Justin Zhan
机构: University of Cincinnati (辛辛那提大学); Cincinnati Children’s Hospital Medical Center (辛辛那提儿童医院医学中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knowledge distillation is a technique to imitate a performance that a deep learning model has, but reduce the size on another model. It applies the outputs of a model to train another model having comparable accuracy. These two distinct models are similar to the way information is delivered in human society, with one acting as the “teacher” and the other as the “student”. Softmax plays a role in comparing logits generated by models with each other by converting probability distributions. It delivers the logits of a teacher to a student with compression through a parameter named temperature. Tuning this variable reinforces the distillation performance. Although only this parameter helps with the interaction of logits, it is not clear how temperatures promote information transfer. In this paper, we propose a novel approach to calculate the temperature. Our method only refers to the maximum logit generated by a teacher model, which reduces computational time against state-of-the-art methods. Our method shows a promising result in different student and teacher models on a standard benchmark dataset. Algorithms using temperature can obtain the improvement by plugging in this dynamic approach. Furthermore, the approximation of the distillation process converges to a correlation of logits by both models. This reinforces the previous argument that the distillation conveys the relevance of logits. We report that this approximating algorithm yields a higher temperature compared to the commonly used static values in testing.
zh

[CV-92] Measure Twice Cut Once: Grasping Video Structures and Event Semantics with LLM s for Video Temporal Localization

【速读】：该论文旨在解决通过自然语言定位用户查询事件的时间定位任务中，现有方法难以充分利用视频大语言模型（Video LLMs）强大语义理解能力的问题。论文的关键解决方案是提出MeCo框架，这是一种无边界时间戳（timestamp-free）的方法，使视频LLMs能够充分发挥其内在的语义能力来完成时间定位任务。MeCo通过基于视频LLMs时间结构理解能力提出的结构化标记生成与定位管道，将视频划分为整体的事件和过渡片段，而非直接输出边界时间戳。此外，论文进一步设计了一种以查询为中心的字幕任务，促使LLMs提取细粒度的、特定于事件的细节，弥合局部定位与高级语义之间的差距，从而提升定位性能。实验结果表明，MeCo在多种时间定位任务中始终优于以边界为中心的方法，证明了语义驱动方法的优势。

链接: https://arxiv.org/abs/2503.09027
作者: Zongshang Pang,Mayu Otani,Yuta Nakashima
机构: Osaka University (大阪大学); CyberAgent, Inc. (赛博_AGENT公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Localizing user-queried events through natural language is crucial for video understanding models. Recent methods predominantly adapt Video LLMs to generate event boundary timestamps to handle temporal localization tasks, which struggle to leverage LLMs’ powerful semantic understanding. In this work, we introduce MeCo, a novel timestamp-free framework that enables video LLMs to fully harness their intrinsic semantic capabilities for temporal localization tasks. Rather than outputting boundary timestamps, MeCo partitions videos into holistic event and transition segments based on the proposed structural token generation and grounding pipeline, derived from video LLMs’ temporal structure understanding capability. We further propose a query-focused captioning task that compels the LLM to extract fine-grained, event-specific details, bridging the gap between localization and higher-level semantics and enhancing localization performance. Extensive experiments on diverse temporal localization tasks show that MeCo consistently outperforms boundary-centric methods, underscoring the benefits of a semantic-driven approach for temporal localization with video LLMs.
zh

[CV-93] Prompt to Restore Restore to Prompt: Cyclic Prompting for Universal Adverse Weather Removal

【速读】：该论文旨在解决通用恶劣天气去除（Universal Adverse Weather Removal, UAWR）问题，即在统一框架下处理多种由恶劣天气引起的图像退化。现有方法主要受预训练视觉-语言模型（如CLIP）的提示学习启发，通过利用与退化相关的提示来促进无天气图像的恢复，取得了显著改进。然而，这些方法仍存在适应性和泛化性方面的不足。

论文提出了一种创新的循环提示方法（CyclicPrompt），以增强UAWR的有效性、适应性和泛化能力。其关键解决方案包含两个核心组件：1）复合上下文提示（composite context prompt），将与天气相关的知识和上下文感知表示集成到网络中，引导图像恢复过程。与以往方法不同的是，该提示结合了可学习的输入条件向量与特定天气的知识，从而提升了对多种退化场景的适应性；2）擦除与粘贴机制，在初步引导恢复后，用约束恢复先验替代特定天气知识，并引入高质量的无天气概念到复合提示中，进一步优化恢复过程。由此形成一个高效的“提示-恢复-提示”循环管道（cyclic “Prompt-Restore-Prompt” pipeline），充分利用天气特定知识、文本上下文和可靠纹理信息。实验结果表明，CyclicPrompt在合成数据集和真实世界数据集上均展现出优越性能。

链接: https://arxiv.org/abs/2503.09013
作者: Rongxin Liao,Feng Li,Yanyan Wei,Zenglin Shi,Le Zhang,Huihui Bai,Meng Wang
机构: School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, Anhui 230601, China (合肥工业大学计算机科学与信息工程学院); School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan 611731, China (电子科技大学信息与通信工程学院); School of Computer Science and Technology, Beijing Jiaotong University, Beijing 100044, China (北京交通大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Universal adverse weather removal (UAWR) seeks to address various weather degradations within a unified framework. Recent methods are inspired by prompt learning using pre-trained vision-language models (e.g., CLIP), leveraging degradation-aware prompts to facilitate weather-free image restoration, yielding significant improvements. In this work, we propose CyclicPrompt, an innovative cyclic prompt approach designed to enhance the effectiveness, adaptability, and generalizability of UAWR. CyclicPrompt Comprises two key components: 1) a composite context prompt that integrates weather-related information and context-aware representations into the network to guide restoration. This prompt differs from previous methods by marrying learnable input-conditional vectors with weather-specific knowledge, thereby improving adaptability across various degradations. 2) The erase-and-paste mechanism, after the initial guided restoration, substitutes weather-specific knowledge with constrained restoration priors, inducing high-quality weather-free concepts into the composite prompt to further fine-tune the restoration process. Therefore, we can form a cyclic “Prompt-Restore-Prompt” pipeline that adeptly harnesses weather-specific knowledge, textual contexts, and reliable textures. Extensive experiments on synthetic and real-world datasets validate the superior performance of CyclicPrompt. The code is available at: this https URL.
zh

[CV-94] Dual-Domain Homogeneous Fusion with Cross-Modal Mamba and Progressive Decoder for 3D Object Detection

【速读】：该论文致力于解决多模态特征融合在自动驾驶3D目标检测中的两个主要问题：一是基于鸟瞰图（BEV）空间的多模态特征过度压缩导致的信息损失；二是密集体素空间中特征融合计算成本高且查询生成效率低。为了解决这些问题，论文提出了Dual-Domain Homogeneous Fusion网络（DDHFusion），其关键在于同时利用BEV域和稀疏体素域的优势，同时缓解各自的缺点。具体而言，通过视觉-激光雷达统一表征（LSS）和语义感知特征采样模块将图像特征转换到BEV和稀疏体素空间，显著减少计算开销；设计针对BEV和体素特征融合的双网络，并引入跨模态体素与BEV Mamba块以解决特征错位问题；通过在BEV空间注入体素特征补偿高度压缩带来的细节丢失；采用渐进式查询生成模块减轻因特征压缩和小物体引起的假阴性问题；最终利用渐进式解码器聚合丰富的上下文BEV特征和几何感知的体素特征，实现更精确的置信度预测和边界框回归。实验表明，DDHFusion在NuScenes数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2503.08992
作者: Xuzhong Hu,Zaipeng Duan,Pei An,Jun zhang,Jie Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 9 figures

点击查看摘要

Abstract:Fusing LiDAR point cloud features and image features in a homogeneous BEV space has been widely adopted for 3D object detection in autonomous driving. However, such methods are limited by the excessive compression of multi-modal features. While some works explore feature fusion in dense voxel spaces, they suffer from high computational costs and inefficiencies in query generation. To address these limitations, we propose a Dual-Domain Homogeneous Fusion network (DDHFusion), which leverages the complementary advantages of both BEV and voxel domains while mitigating their respective drawbacks. Specifically, we first transform image features into BEV and sparse voxel spaces using LSS and our proposed semantic-aware feature sampling module which can significantly reduces computational overhead by filtering unimportant voxels. For feature encoding, we design two networks for BEV and voxel feature fusion, incorporating novel cross-modal voxel and BEV Mamba blocks to resolve feature misalignment and enable efficient yet comprehensive scene perception. The output voxel features are injected into the BEV space to compensate for the loss of 3D details caused by height compression. For feature decoding, a progressive query generation module is implemented in the BEV domain to alleviate false negatives during query selection caused by feature compression and small object sizes. Finally, a progressive decoder can sequentially aggregate not only context-rich BEV features but also geometry-aware voxel features, ensuring more precise confidence prediction and bounding box regression. On the NuScenes dataset, DDHfusion achieves state-of-the-art performance, and further experiments demonstrate its superiority over other homogeneous fusion methods.
zh

[CV-95] Decoupled Doubly Contrastive Learning for Cross Domain Facial Action Unit Detection

【速读】：该论文旨在解决基于视觉的面部动作单元（Action Unit, AU）检测方法在跨域场景中对域间变化敏感且现有跨域AU检测方法研究不足的问题。为应对这一挑战，论文提出了一种解耦双对比适应（Decoupled Doubly Contrastive Adaptation, D²CA）方法，其关键在于通过分解潜在表示为与AU相关的成分和与AU无关的成分，专注于促进AU相关子空间内的自适应学习。D²CA通过评估跨域场景下合成人脸的质量，在修改AU或领域属性时实现AU和领域因素的解耦。此外，为了进一步增强特征解耦，特别是在有限的AU数据多样性情况下，D²CA采用双对比学习机制，包括图像级和特征级对比学习，以确保合成人脸的质量并缓解特征歧义。这种方法实现了AU相关因素和领域相关因素的自动分离，并支持直观的、特定尺度的跨域人脸图像合成控制。实验结果表明，D²CA在成功解耦AU和领域因素的同时，显著提升了跨域AU检测性能，平均F1分数较现有方法提高了6%-14%。

链接: https://arxiv.org/abs/2503.08977
作者: Yong Li,Menglin Liu,Zhen Cui,Yi Ding,Yuan Zong,Wenming Zheng,Shiguang Shan,Cuntai Guan
机构: School of Computer Science and Engineering and the Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University (东南大学); Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information, Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology (南京理工大学); School of Artificial Intelligence, Beijing Normal University (北京师范大学); State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); School of Computer Science and Engineering, Nanyang Technological University (南洋理工大学); Key Laboratory of Child Development and Learning Science of Ministry of Education, School of Biological Science and Medical Engineering, Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Image Processing 2025. A novel and elegant feature decoupling method for cross-domain facial action unit detection

点击查看摘要

Abstract:Despite the impressive performance of current vision-based facial action unit (AU) detection approaches, they are heavily susceptible to the variations across different domains and the cross-domain AU detection methods are under-explored. In response to this challenge, we propose a decoupled doubly contrastive adaptation (D ^2 CA) approach to learn a purified AU representation that is semantically aligned for the source and target domains. Specifically, we decompose latent representations into AU-relevant and AU-irrelevant components, with the objective of exclusively facilitating adaptation within the AU-relevant subspace. To achieve the feature decoupling, D ^2 CA is trained to disentangle AU and domain factors by assessing the quality of synthesized faces in cross-domain scenarios when either AU or domain attributes are modified. To further strengthen feature decoupling, particularly in scenarios with limited AU data diversity, D ^2 CA employs a doubly contrastive learning mechanism comprising image and feature-level contrastive learning to ensure the quality of synthesized faces and mitigate feature ambiguities. This new framework leads to an automatically learned, dedicated separation of AU-relevant and domain-relevant factors, and it enables intuitive, scale-specific control of the cross-domain facial image synthesis. Extensive experiments demonstrate the efficacy of D ^2 CA in successfully decoupling AU and domain factors, yielding visually pleasing cross-domain synthesized facial images. Meanwhile, D ^2 CA consistently outperforms state-of-the-art cross-domain AU detection approaches, achieving an average F1 score improvement of 6%-14% across various cross-domain scenarios.
zh

[CV-96] Beyond Overfitting: Doubly Adaptive Dropout for Generalizable AU Detection

【速读】：该论文旨在解决自动面部动作单元（Action Units, AUs）检测系统在跨域应用中的局限性，这些问题主要源于现有方法容易过拟合于特定数据集和个人特征。为克服这些限制，论文提出了一种双重自适应丢弃（doubly adaptive dropout）的方法，以增强卷积特征图和空间标记对领域偏移的鲁棒性。该方案的关键在于结合了通道丢弃单元（Channel Drop Unit, CD-Unit）和标记丢弃单元（Token Drop Unit, TD-Unit），它们分别从通道级和标记级减少领域相关的噪声。CD-Unit 保留了特征图中与领域无关的局部模式，而 TD-Unit 则有助于模型识别可泛化到不同领域的 AU 关系。此外，通过在每一层集成辅助领域分类器以及采用渐进训练策略，进一步引导对领域敏感特征的选择性忽略，从而避免过度特征丢弃。实验结果表明，所提方法在跨域 AU 检测任务上始终优于现有技术，并且注意力图的可视化也验证了该方法的有效性。

链接: https://arxiv.org/abs/2503.08974
作者: Yong Li,Yi Ren,Xuesong Niu,Yi Ding,Xiu-Shen Wei,Cuntai Guan
机构: School of Computer Science and Engineering and the Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University (东南大学计算机科学与工程学院及新一代人工智能技术及其交叉应用重点实验室), Nanjing 210096, China; School of Computer Science and Engineering, Nanjing University of Science and Technology (南京理工大学计算机科学与工程学院), Nanjing, 210094 China; Beijing Institute for General Artificial Intelligence (北京通用人工智能研究院), Beijing, China; School of Computer Science and Engineering, and Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University (东南大学计算机科学与工程学院及新一代人工智能技术及其交叉应用重点实验室), Nanjing 210096, China; School of Computer Science and Engineering, Nanyang Technological University (南洋理工大学计算机科学与工程学院), 50 Nanyang Avenue, Singapore, 639798
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accetped by IEEE Transactions on Affective Computing 2025. A novel method for cross-domain facial action unit detection

点击查看摘要

Abstract:Facial Action Units (AUs) are essential for conveying psychological states and emotional expressions. While automatic AU detection systems leveraging deep learning have progressed, they often overfit to specific datasets and individual features, limiting their cross-domain applicability. To overcome these limitations, we propose a doubly adaptive dropout approach for cross-domain AU detection, which enhances the robustness of convolutional feature maps and spatial tokens against domain shifts. This approach includes a Channel Drop Unit (CD-Unit) and a Token Drop Unit (TD-Unit), which work together to reduce domain-specific noise at both the channel and token levels. The CD-Unit preserves domain-agnostic local patterns in feature maps, while the TD-Unit helps the model identify AU relationships generalizable across domains. An auxiliary domain classifier, integrated at each layer, guides the selective omission of domain-sensitive features. To prevent excessive feature dropout, a progressive training strategy is used, allowing for selective exclusion of sensitive features at any model layer. Our method consistently outperforms existing techniques in cross-domain AU detection, as demonstrated by extensive experimental evaluations. Visualizations of attention maps also highlight clear and meaningful patterns related to both individual and combined AUs, further validating the approach’s effectiveness.
zh

[CV-97] Are ECGs enough? Deep learning classification of cardiac anomalies using only electrocardiograms

【速读】：该论文旨在解决心电图（Electrocardiography, ECG）分析在诊断多种心脏异常中的作用不明确以及模型泛化能力不足的问题。具体而言，许多现有方法依赖于额外的成像模态（如Computed Tomography Pulmonary Angiography, CTPA），这些模态并非始终可用，或者无法有效适用于不同的分类任务。此外，公开的ECG数据集有限且通常较小，这使得优化学习策略变得至关重要。为了解决这些问题，论文的关键在于评估多种神经网络架构的性能，并探究通过迁移学习将从大型ECG数据集（如PTB-XL和CPSC18）中学到的知识迁移到一个小而更具挑战性的肺栓塞（Pulmonary Embolism, PE）检测数据集上的效果，从而提升模型在有限数据条件下的学习效率和预测性能。

链接: https://arxiv.org/abs/2503.08960
作者: Joao D.S. Marques,Arlindo L. Oliveira
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Electrocardiography (ECG) is an essential tool for diagnosing multiple cardiac anomalies: it provides valuable clinical insights, while being affordable, fast and available in many settings. However, in the current literature, the role of ECG analysis is often unclear: many approaches either rely on additional imaging modalities, such as Computed Tomography Pulmonary Angiography (CTPA), which may not always be available, or do not effectively generalize across different classification problems. Furthermore, the availability of public ECG datasets is limited and, in practice, these datasets tend to be small, making it essential to optimize learning strategies. In this study, we investigate the performance of multiple neural network architectures in order to assess the impact of various approaches. Moreover, we check whether these practices enhance model generalization when transfer learning is used to translate information learned in larger ECG datasets, such as PTB-XL and CPSC18, to a smaller, more challenging dataset for pulmonary embolism (PE) detection. By leveraging transfer learning, we analyze the extent to which we can improve learning efficiency and predictive performance on limited data. Code available at this https URL .
zh

[CV-98] KAN-Mixers: a new deep learning architecture for image classification

【速读】：该论文旨在设计一种基于Kolmogorov-Arnold网络（KAN）的新混合架构——KAN-Mixers，并评估其在图像分类任务中的性能。论文试图解决的问题是现有基于多层感知机（MLP）的架构（如MLP-Mixer）在精细特征提取方面的不足。解决方案的关键在于利用KAN替代传统MLP作为主要网络层，以期在准确性与可解释性方面超越现有的MLP模型，同时保持竞争力。实验结果显示，在Fashion-MNIST和CIFAR-10数据集上，KAN-Mixers模型的平均准确率分别达到0.9030和0.6980，优于MLP、MLP-Mixer及KAN模型。

链接: https://arxiv.org/abs/2503.08939
作者: Jorge Luiz dos Santos Canuto,Linnyer Beatrys Ruiz Aylon,Rodrigo Clemente Thom de Souza
机构: Maringá State University (马林加州立大学); Paraná Federal University (巴拉那联邦大学); Maringá State University (马林加州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Due to their effective performance, Convolutional Neural Network (CNN) and Vision Transformer (ViT) architectures have become the standard for solving computer vision tasks. Such architectures require large data sets and rely on convolution and self-attention operations. In 2021, MLP-Mixer emerged, an architecture that relies only on Multilayer Perceptron (MLP) and achieves extremely competitive results when compared to CNNs and ViTs. Despite its good performance in computer vision tasks, the MLP-Mixer architecture may not be suitable for refined feature extraction in images. Recently, the Kolmogorov-Arnold Network (KAN) was proposed as a promising alternative to MLP models. KANs promise to improve accuracy and interpretability when compared to MLPs. Therefore, the present work aims to design a new mixer-based architecture, called KAN-Mixers, using KANs as main layers and evaluate its performance, in terms of several performance metrics, in the image classification task. As main results obtained, the KAN-Mixers model was superior to the MLP, MLP-Mixer and KAN models in the Fashion-MNIST and CIFAR-10 datasets, with 0.9030 and 0.6980 of average accuracy, respectively.
zh

[CV-99] PromptGAR: Flexible Promptive Group Activity Recognition

【速读】：该论文旨在解决现有群体活动识别（Group Activity Recognition, GAR）方法在实际应用中的局限性，包括对完整提示标注的依赖、长期行为中演员一致性的缺失以及多组场景的探索不足。为了解决这些问题，论文提出了一种名为PromptGAR的新框架。PromptGAR的关键创新在于提供了跨提示、帧和实例的输入灵活性，而无需重新训练模型。它通过将边界框、骨骼关键点和区域统一为点提示，并采用识别解码器来交叉更新类别和提示标记实现这一目标。此外，为了确保长时间一致性，引入了相对实例注意力机制以直接编码实例ID。最后，PromptGAR利用区域提示实现了在包含多个并发组的视频中选择性识别特定群体活动的能力。综合评估表明，PromptGAR在完整提示和多样化提示输入下均表现出色，证明了其在实际应用中的输入灵活性和泛化能力。

链接: https://arxiv.org/abs/2503.08933
作者: Zhangyu Jin,Andrew Feng,Ankur Chemburkar,Celso M. De Melo
机构: University of Southern California, Institute for Creative Technologies (南加州大学，创意技术研究所); Army Research Laboratory (陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present PromptGAR, a novel framework that addresses the limitations of current Group Activity Recognition (GAR) approaches by leveraging multi-modal prompts to achieve both input flexibility and high recognition accuracy. The existing approaches suffer from limited real-world applicability due to their reliance on full prompt annotations, the lack of long-term actor consistency, and under-exploration of multi-group scenarios. To bridge the gap, we proposed PromptGAR, which is the first GAR model to provide input flexibility across prompts, frames, and instances without the need for retraining. Specifically, we unify bounding boxes, skeletal keypoints, and areas as point prompts and employ a recognition decoder for cross-updating class and prompt tokens. To ensure long-term consistency for extended activity durations, we also introduce a relative instance attention mechanism that directly encodes instance IDs. Finally, PromptGAR explores the use of area prompts to enable the selective recognition of the particular group activity within videos that contain multiple concurrent groups. Comprehensive evaluations demonstrate that PromptGAR achieves competitive performances both on full prompts and diverse prompt inputs, establishing its effectiveness on input flexibility and generalization ability for real-world applications.
zh

[CV-100] HessianForge: Scalable LiDAR reconstruction with Physics-Informed Neural Representation and Smoothness Energy Constraints

【速读】：该论文旨在解决利用LiDAR测量数据进行大规模室外环境高精度三维映射的问题，特别关注如何确保表面重建的平滑性和无伪影特性。当前最先进的方法虽然侧重于通过内存高效的神经表示来生成高保真的表面，但往往无法避免由于输入噪声和稀疏性导致的伪影问题。为了解决这一挑战，论文将表面映射问题重新定义为一个基于物理信息的能量优化问题，通过优化一个惩罚尖锐表面脊线的能量泛函来强制实现表面平滑。关键在于提出了一种基于深度学习的方法，使用包含物理信息的损失函数（该函数优化表面的L₂-Hessian能量）从原始LiDAR点云中学习曲面流形的符号距离场(SDF)。此外，该框架还包括一个基于分层八叉树的输入特征编码以及多尺度神经网络，用于在不同分辨率尺度上迭代优化符号距离场。同时，引入了一种测试时细化策略以修正生成网格中的拓扑不一致性和边缘畸变，并采用CUDA加速的最小二乘优化算法局部调整顶点位置，从而实现保留特征的平滑处理。实验表明，与现有最先进方法相比，该方法在提升准确性和平滑度方面表现更优。

链接: https://arxiv.org/abs/2503.08929
作者: Hrishikesh Viswanath,Md Ashiqur Rahman,Chi Lin,Damon Conover,Aniket Bera
机构: Dept of CS, Purdue University (普渡大学), West Lafayette, IN, USA; DEVCOM Army Research Laboratory (美国陆军研究实验室), USA
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Accurate and efficient 3D mapping of large-scale outdoor environments from LiDAR measurements is a fundamental challenge in robotics, particularly towards ensuring smooth and artifact-free surface reconstructions. Although the state-of-the-art methods focus on memory-efficient neural representations for high-fidelity surface generation, they often fail to produce artifact-free manifolds, with artifacts arising due to noisy and sparse inputs. To address this issue, we frame surface mapping as a physics-informed energy optimization problem, enforcing surface smoothness by optimizing an energy functional that penalizes sharp surface ridges. Specifically, we propose a deep learning based approach that learns the signed distance field (SDF) of the surface manifold from raw LiDAR point clouds using a physics-informed loss function that optimizes the L_2 -Hessian energy of the surface. Our learning framework includes a hierarchical octree based input feature encoding and a multi-scale neural network to iteratively refine the signed distance field at different scales of resolution. Lastly, we introduce a test-time refinement strategy to correct topological inconsistencies and edge distortions that can arise in the generated mesh. We propose a \textttCUDA-accelerated least-squares optimization that locally adjusts vertex positions to enforce feature-preserving smoothing. We evaluate our approach on large-scale outdoor datasets and demonstrate that our approach outperforms current state-of-the-art methods in terms of improved accuracy and smoothness. Our code is available at \hrefthis https URLthis https URL
zh

[CV-101] he Detection of Saccadic Eye Movements and Per-Eye Comparisons using Virtual Reality Eye Tracking Devices

【速读】：该论文旨在解决如何利用运行频率为60Hz或90Hz的虚拟现实（VR）头戴式眼动仪准确检测扫视（saccadic）眼动的问题，这一任务在传统眼动研究中已较为成熟，但在VR头戴式设备中的实现尚未建立。论文的关键在于开发基于VR眼动追踪技术的原型软件，以实现对单眼扫视眼动及其个体差异的精确检测。该解决方案的核心在于结合VR眼动追踪技术和神经科学方法，充分利用商业化和消费级眼动设备的优势，弥补传统专用眼动仪成本高、便携性差的不足，从而为神经科学领域提供一种创新工具，用于检测扫视眼动及相关神经与神经退行性疾病。然而，该项目受到时间限制以及所使用眼动仪最高采样频率仅为90Hz的制约。

链接: https://arxiv.org/abs/2503.08926
作者: Teran Bukenberger,Brent Davis
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:Eye tracking has been found to be useful in various tasks including diagnostic and screening tools. However, traditional eye trackers had a complicated setup and operated at a higher frequency to measure eye movements. The use of more commonly available eye trackers such as those in head-mounted virtual reality (VR) headsets greatly expands the utility of these eye trackers for research and analytical purposes. In this study, the research question is focused on detecting saccades, which is a common task when analyzing eye tracking data, but it is not well-established for VR headset-mounted eye trackers. The aim is to determine how accurately saccadic eye movements can be detected using an eye tracker that operates at 60 or 90Hz. The study involves VR eye tracking technology and neuroscience with respect to saccadic eye movements. The goal is to build prototype software implemented using VR eye tracking technology to detect saccadic eye movements, and per-eye differences in an individual. It is anticipated that the software will be able to accurately detect when saccades occur and analyze the differences in saccadic eye movements per-eye. The field of research surrounding VR eye tracking software is still developing rapidly, specifically its applications to neuroscience. Since previous methods of eye tracking involved specialized equipment, using commercially and consumer available VR eye tracking technology to assist in the detection of saccades and per-eye differences would be novel. This project will impact the field of neuroscience by providing a tool that can be used to detect saccadic eye movements and neurological and neurodegenerative disorders. However, this project is limited by the short time frame and that the eye tracker used in this study operates at a maximum frequency of 90Hz.
zh

[CV-102] Zero-Shot Action Generalization with Limited Observations AISTATS2025

【速读】：该论文致力于解决强化学习（Reinforcement Learning, RL）在实际场景中面临的挑战，即当面对未见过的动作时，传统方法难以实现泛化的问题。当前针对零样本动作泛化的一些工作依赖于大规模的动作观测数据集来捕捉新动作的行为，这在实际应用中往往不可行。为了解决这一难题，本文提出了一种新颖的零样本框架——从有限观测中进行动作泛化（Action Generalization from Limited Observations, AGLO）。该框架的关键在于其由两个主要模块组成：动作表示学习模块和策略学习模块。动作表示学习模块能够从有限的观测中提取出具有区分性的动作嵌入，而策略学习模块则利用这些学习到的动作表示以及增强的合成动作表示来训练一个可以处理包含未见过动作的任务的策略。实验结果表明，与最先进的方法相比，本框架在多个基准任务上的表现显著更优，证明了其在仅依靠少量动作观测的情况下有效推广到新动作的能力。

链接: https://arxiv.org/abs/2503.08867
作者: Abdullah Alchihabi,Hanping Zhang,Yuhong Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: AISTATS 2025

点击查看摘要

Abstract:Reinforcement Learning (RL) has demonstrated remarkable success in solving sequential decision-making problems. However, in real-world scenarios, RL agents often struggle to generalize when faced with unseen actions that were not encountered during training. Some previous works on zero-shot action generalization rely on large datasets of action observations to capture the behaviors of new actions, making them impractical for real-world applications. In this paper, we introduce a novel zero-shot framework, Action Generalization from Limited Observations (AGLO). Our framework has two main components: an action representation learning module and a policy learning module. The action representation learning module extracts discriminative embeddings of actions from limited observations, while the policy learning module leverages the learned action representations, along with augmented synthetic action representations, to learn a policy capable of handling tasks with unseen actions. The experimental results demonstrate that our framework significantly outperforms state-of-the-art methods for zero-shot action generalization across multiple benchmark tasks, showcasing its effectiveness in generalizing to new actions with minimal action observations.
zh

[CV-103] Keypoint Semantic Integration for Improved Feature Matching in Outdoor Agricultural Environments

【速读】：该论文旨在解决户外环境中机器人导航中视觉特征匹配因感知歧义（perceptual aliasing）导致的准确性下降问题，特别是在葡萄园场景中，重复的藤干和其他自然元素会生成模糊的描述符，阻碍可靠的关键点匹配。为应对这一挑战，论文提出的关键解决方案是将语义信息与关键点位置相结合，通过增强描述符在图像语义显著区域的区分度，从而实现对视觉上相似局部特征的更精确区分。实验验证表明，该方法在相对位姿估计和视觉定位两项任务中提升了12.6%的匹配精度，证明了其在复杂葡萄园条件下的有效性。

链接: https://arxiv.org/abs/2503.08843
作者: Rajitha de Silva,Jonathan Cox,Marija Popovic,Cesar Cadena,Cyrill Stachniss,Riccardo Polvara
机构: Lincoln Centre for Autonomous Systems (L-CAS), School of Engineering & Physical Sciences, University of Lincoln (英国林肯大学), UK; MAVLab, Faculty of Aerospace Engineering, TU Delft (代尔夫特理工大学), Netherlands; Robotics Systems Lab, Institute of Robotics and Intelligent Systems, ETH Zurich (瑞士苏黎世联邦理工学院), Switzerland
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project Page: this https URL

点击查看摘要

Abstract:Robust robot navigation in outdoor environments requires accurate perception systems capable of handling visual challenges such as repetitive structures and changing appearances. Visual feature matching is crucial to vision-based pipelines but remains particularly challenging in natural outdoor settings due to perceptual aliasing. We address this issue in vineyards, where repetitive vine trunks and other natural elements generate ambiguous descriptors that hinder reliable feature matching. We hypothesise that semantic information tied to keypoint positions can alleviate perceptual aliasing by enhancing keypoint descriptor distinctiveness. To this end, we introduce a keypoint semantic integration technique that improves the descriptors in semantically meaningful regions within the image, enabling more accurate differentiation even among visually similar local features. We validate this approach in two vineyard perception tasks: (i) relative pose estimation and (ii) visual localisation. Across all tested keypoint types and descriptors, our method improves matching accuracy by 12.6%, demonstrating its effectiveness over multiple months in challenging vineyard conditions.
zh

[CV-104] Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining

【速读】：该论文试图解决大规模视觉语言数据集高效筛选的问题。解决方案的关键在于提出了一种名为Filter Like You Test (FLYT)的方法，通过训练一个评分模型来评估每个数据点作为预训练样本的有用性，利用下游任务训练集的梯度信号动态调整权重。进一步地，通过引入Mixing-FLYT (M-FLYT)，实现了不同评分方法生成的单个评分的统一。此外，结合Soft Cap Sampling (SCS)策略，从每个样本的概率分布中采样以构建过滤后的预训练数据集，同时避免样本过拟合。最终，该方法在DataComp中等规模过滤基准测试中达到了40.1%的ImageNet零样本准确率，超越现有方法1.9%的绝对精度，并显著提升了仅使用公共资源的性能表现。

链接: https://arxiv.org/abs/2503.08805
作者: Mikey Shechter,Yair Carmon
机构: Tel Aviv University (特拉维夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce Filter Like You Test (FLYT), a method for curating large-scale vision-language datasets that learns the usefulness of each data point as a pretraining example. FLYT trains a scoring model that learns to weigh each example using gradient signals from downstream tasks training sets. Using the same training methodology, we develop Mixing-FLYT (M-FLYT), which takes the per-example scores generated by different scoring methods and learns to unify them into a single score. Our training methodology naturally produces a distribution over the training examples, which we leverage through Soft Cap Sampling (SCS), a strategy for obtaining a filtered pretraining dataset from per-example probabilities that samples examples while preventing over-representation through a repetition penalty. Using all three methods, we achieve 40.1% ImageNet zero-shot accuracy on the DataComp medium scale filtering benchmark, a 1.9% absolute accuracy increase over all previous results and a 5.5% increase over results that – like us – use only public resources.
zh

[CV-105] Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning

【速读】：该论文旨在解决在实际场景中训练视觉强化学习（Reinforcement Learning, RL）面临的显著挑战，即在具有变化的环境中，RL智能体样本效率较低的问题。传统方法通过解纠缠表征学习缓解此问题，但通常需要从头开始学习且缺乏世界先验知识。为应对这一挑战，论文提出了一种基于离线到在线潜在蒸馏和灵活解纠缠约束的方法，从干扰视频中学习和理解潜在语义变化。论文的关键创新在于引入了一个可解释的基于模型的RL框架——解纠缠世界模型（Disentangled World Models, DisWM）。具体而言，首先利用解纠缠正则化在离线阶段预训练无动作视频预测模型以提取语义知识，随后通过潜在蒸馏将预训练模型的解纠缠能力转移到世界模型中。在在线微调阶段，结合预训练模型的知识并引入解纠缠约束，同时在适应阶段利用在线环境交互的动作和奖励数据丰富数据多样性，从而增强解纠缠表征学习。实验结果验证了该方法在多种基准上的优越性。

链接: https://arxiv.org/abs/2503.08751
作者: Qi Wang,Zhipeng Zhang,Baao Xie,Xin Jin,Yunbo Wang,Shiyu Wang,Liaomo Zheng,Xiaokang Yang,Wenjun Zeng
机构: MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University (上海交通大学), Shanghai, China; Ningbo Institute of Digital Twin, Eastern Institute of Technology (东华理工学院), Ningbo, China; Zhejiang Key Laboratory of Industrial Intelligence and Digital Twin, Eastern Institute of Technology (东华理工学院), Ningbo, China; University of Chinese Academy of Sciences (中国科学院大学), Beijing, China; Shenyang Institute of Computing Technology, Chinese Academy of Sciences (中国科学院沈阳计算技术研究所), Shenyang, China; Shenyang CASNC Technology Co., Ltd, Shenyang, China
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Training visual reinforcement learning (RL) in practical scenarios presents a significant challenge, \textiti.e., RL agents suffer from low sample efficiency in environments with variations. While various approaches have attempted to alleviate this issue by disentanglement representation learning, these methods usually start learning from scratch without prior knowledge of the world. This paper, in contrast, tries to learn and understand underlying semantic variations from distracting videos via offline-to-online latent distillation and flexible disentanglement constraints. To enable effective cross-domain semantic knowledge transfer, we introduce an interpretable model-based RL framework, dubbed Disentangled World Models (DisWM). Specifically, we pretrain the action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos. The disentanglement capability of the pretrained model is then transferred to the world model through latent distillation. For finetuning in the online environment, we exploit the knowledge from the pretrained model and introduce a disentanglement constraint to the world model. During the adaptation phase, the incorporation of actions and rewards from online environment interactions enriches the diversity of the data, which in turn strengthens the disentangled representation learning. Experimental results validate the superiority of our approach on various benchmarks.
zh

[CV-106] Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis

【速读】：该论文试图解决多模态大型语言模型（MLLMs）因训练数据不可用而影响性能的问题，特别是由于隐私顾虑导致的数据获取困难以及多模态数据收集过程的高成本与劳动密集性。为应对这一挑战，论文提出了一种名为Oasis的新方法，其关键在于仅利用图像作为输入来合成高质量的多模态训练数据，通过这种方式大幅扩展数据多样性，并结合精细的质量控制机制确保生成数据的品质。实验结果表明，该方法能够显著提升MLLMs的性能，同时支持针对特定领域的模型能力优化。

链接: https://arxiv.org/abs/2503.08741
作者: Letian Zhang,Quan Cui,Bingchen Zhao,Cheng Yang
机构: Tongji University (同济大学); Bytedance (字节跳动); University of Edinburgh (爱丁堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The success of multi-modal large language models (MLLMs) has been largely attributed to the large-scale training data. However, the training data of many MLLMs is unavailable due to privacy concerns. The expensive and labor-intensive process of collecting multi-modal data further exacerbates the problem. Is it possible to synthesize multi-modal training data automatically without compromising diversity and quality? In this paper, we propose a new method, Oasis, to synthesize high-quality multi-modal data with only images. Oasis breaks through traditional methods by prompting only images to the MLLMs, thus extending the data diversity by a large margin. Our method features a delicate quality control method which ensures the data quality. We collected over 500k data and conducted incremental experiments on LLaVA-NeXT. Extensive experiments demonstrate that our method can significantly improve the performance of MLLMs. The image-based synthesis also allows us to focus on the specific-domain ability of MLLMs. Code and data will be publicly available.
zh

[CV-107] Representing 3D Shapes With 64 Latent Vectors for 3D Diffusion Models

【速读】：该论文旨在解决高效三维扩散模型中压缩潜在空间的问题，同时确保重建质量不下降。论文的关键解决方案是提出了一种名为COD-VAE的变分自编码器（Variational Autoencoder, VAE），它能够将三维形状编码为一组紧凑的一维潜在向量。COD-VAE通过引入两阶段自动编码器方案来提升压缩和解码效率：首先，编码器块通过中间点补丁逐步压缩点云至紧凑潜在向量；其次，基于三平面的解码器从潜在向量重构密集三平面而非直接解码神经场，大幅减少了神经场解码的计算开销。此外，论文还提出了不确定性引导的标记剪枝方法，通过在简单区域跳过计算来自适应分配资源，进一步提高了解码器效率。实验结果表明，与基线相比，COD-VAE实现了16倍的压缩比，并提升了20.8倍的生成速度，证明高质量重建和生成并不依赖于大量潜在向量。

链接: https://arxiv.org/abs/2503.08737
作者: In Cho,Youngbeom Yoo,Subin Jeon,Seon Joo Kim
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Constructing a compressed latent space through a variational autoencoder (VAE) is the key for efficient 3D diffusion models. This paper introduces COD-VAE, a VAE that encodes 3D shapes into a COmpact set of 1D latent vectors without sacrificing quality. COD-VAE introduces a two-stage autoencoder scheme to improve compression and decoding efficiency. First, our encoder block progressively compresses point clouds into compact latent vectors via intermediate point patches. Second, our triplane-based decoder reconstructs dense triplanes from latent vectors instead of directly decoding neural fields, significantly reducing computational overhead of neural fields decoding. Finally, we propose uncertainty-guided token pruning, which allocates resources adaptively by skipping computations in simpler regions and improves the decoder efficiency. Experimental results demonstrate that COD-VAE achieves 16x compression compared to the baseline while maintaining quality. This enables 20.8x speedup in generation, highlighting that a large number of latent vectors is not a prerequisite for high-quality reconstruction and generation.
zh

[CV-108] FairDeFace: Evaluating the Fairness and Adversarial Robustness of Face Obfuscation Methods

【速读】：该论文旨在解决因缺乏通用平台和基准数据集而导致的面部遮蔽方法评估困难的问题，并填补现有研究在评估面部遮蔽方法公平性方面的空白。论文的关键在于提出了一种名为FairDeFace的综合性框架，该框架通过引入数据基准、人脸检测与识别算法、对抗模型、效用检测模型以及公平性度量等模块，实现了对多种面部遮蔽方法的全面评估与比较。FairDeFace不仅能够集成任意面部遮蔽方法以进行严格测试，还通过包含多种攻击方式及隐私、效用与公平性指标，揭示了现有方法在对抗鲁棒性和针对不同性别或种族群体偏见方面的特性。这一解决方案的核心在于提供了一个标准化且可扩展的平台，以促进面部遮蔽技术的公平性和安全性研究。

链接: https://arxiv.org/abs/2503.08731
作者: Seyyed Mohammad Sadegh Moosavi Khorzooghi,Poojitha Thota,Mohit Singhal,Abolfazl Asudeh,Gautam Das,Shirin Nilizadeh
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); Northeastern University (东北大学); The University of Texas at Arlington (德克萨斯大学阿灵顿分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The lack of a common platform and benchmark datasets for evaluating face obfuscation methods has been a challenge, with every method being tested using arbitrary experiments, datasets, and metrics. While prior work has demonstrated that face recognition systems exhibit bias against some demographic groups, there exists a substantial gap in our understanding regarding the fairness of face obfuscation methods. Providing fair face obfuscation methods can ensure equitable protection across diverse demographic groups, especially since they can be used to preserve the privacy of vulnerable populations. To address these gaps, this paper introduces a comprehensive framework, named FairDeFace, designed to assess the adversarial robustness and fairness of face obfuscation methods. The framework introduces a set of modules encompassing data benchmarks, face detection and recognition algorithms, adversarial models, utility detection models, and fairness metrics. FairDeFace serves as a versatile platform where any face obfuscation method can be integrated, allowing for rigorous testing and comparison with other state-of-the-art methods. In its current implementation, FairDeFace incorporates 6 attacks, and several privacy, utility and fairness metrics. Using FairDeFace, and by conducting more than 500 experiments, we evaluated and compared the adversarial robustness of seven face obfuscation methods. This extensive analysis led to many interesting findings both in terms of the degree of robustness of existing methods and their biases against some gender or racial groups. FairDeFace also uses visualization of focused areas for both obfuscation and verification attacks to show not only which areas are mostly changed in the obfuscation process for some demographics, but also why they failed through focus area comparison of obfuscation and verification.
zh

[CV-109] Preserving Product Fidelity in Large Scale Image Recontextualization with Diffusion Models

【速读】：本文提出了一种基于文本到图像扩散模型以及新型数据增强管道的高保真产品图像再上下文化框架，旨在解决真实世界数据收集对于此任务的局限性。论文的关键在于通过解耦产品表示并利用图像到视频扩散以及内外填充负样本生成合成训练数据，从而提升生成图像的质量与多样性，增强模型对产品特性的理解。

链接: https://arxiv.org/abs/2503.08729
作者: Ishaan Malhi,Praneet Dutta,Ellie Talius,Sally Ma,Brendan Driscoll,Krista Holden,Garima Pruthi,Arunachalam Narayanaswamy
机构: Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present a framework for high-fidelity product image recontextualization using text-to-image diffusion models and a novel data augmentation pipeline. This pipeline leverages image-to-video diffusion, in/outpainting negatives to create synthetic training data, addressing limitations of real-world data collection for this task. Our method improves the quality and diversity of generated images by disentangling product representations and enhancing the model’s understanding of product characteristics. Evaluation on the ABO dataset and a private product dataset, using automated metrics and human assessment, demonstrates the effectiveness of our framework in generating realistic and compelling product visualizations, with implications for applications such as e-commerce and virtual product showcasing.
zh

[CV-110] Is CLIP ideal? No. Can we fix it? Yes!

【速读】：该论文旨在解决CLIP（Contrastive Language-Image Pre-Training）在处理复杂视觉-文本交互时的局限性问题。尽管CLIP在多模态语义学习方面表现出色，但其潜在空间在某些复杂场景下无法同时正确表示基本描述与图像内容、属性绑定、空间位置与关系以及否定等四种特性。论文通过严格分析CLIP潜在空间的几何性质，证明了不存在一种类似CLIP的联合嵌入空间能够同时满足上述任意两项需求。

为了解决这一根本性限制，论文提出了一种名为密集余弦相似性图（Dense Cosine Similarity Maps, DCSMs）的新方法。该方法通过保留图像块和文本标记的语义拓扑结构，提供了一种原则性和可解释性的评分机制，从而有效克服了CLIP的核心局限性。实验结果显示，DCSMs方法显著提升了经典CLIP类联合编码模型在多种基准任务上的性能表现。代码和数据已公开，以促进复现研究。

链接: https://arxiv.org/abs/2503.08723
作者: Raphi Kang,Yue Song,Georgia Gkioxari,Pietro Perona
机构: California Institute of Technology (加州理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive Language-Image Pre-Training (CLIP) is a popular method for learning multimodal latent spaces with well-organized semantics. Despite its wide range of applications, CLIP’s latent space is known to fail at handling complex visual-textual interactions. Recent works attempt to address its shortcomings with data-centric or algorithmic approaches. But what if the problem is more fundamental, and lies in the geometry of CLIP? Toward this end, we rigorously analyze CLIP’s latent space properties, and prove that no CLIP-like joint embedding space exists which can correctly do any two of the following at the same time: 1. represent basic descriptions and image content, 2. represent attribute binding, 3. represent spatial location and relationships, 4. represent negation. Informed by this analysis, we propose Dense Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models, which solves the fundamental limitations of CLIP by retaining the semantic topology of the image patches and text tokens. This method improves upon the performance of classical CLIP-like joint encoder models on a wide array of benchmarks. We share our code and data here for reproducibility: this https URL
zh

[CV-111] A Recipe for Improving Remote Sensing VLM Zero Shot Generalization

【速读】：该论文旨在解决远程 sensing (Remote Sensing, RS) 领域中视觉语言模型 (Visual-Language Models, VLMs) 基础模型训练数据匮乏的问题，以及提升模型在零样本跨模态检索任务中的性能。论文的关键创新在于提出了两个新的图像-标题数据集，分别基于 Gemini 和网络图片及其对应的 alt-text 构建，以提供多样化的训练数据。此外，通过这些数据预训练 MaMMUT 模型，实现了当前最先进的零样本跨模态检索性能。为增强模型的定位能力，论文进一步引入了一种新颖的注意力池化机制——平滑注意力操作 (Smooth-Attention-Operation)，用于从视觉语言对比训练过程中提取的图像级知识中生成伪标签，并优化模型的区域定位精度。

链接: https://arxiv.org/abs/2503.08722
作者: Aviad Barzilai,Yotam Gigi,Vered Silverman,Yehonathan Refael,Bolous Jaber,Amr Helmy,Tomer Shekel,George Leifman,Genady Beryozkin
机构: Google Research (谷歌研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Foundation models have had a significant impact across various AI applications, enabling use cases that were previously impossible. Contrastive Visual Language Models (VLMs), in particular, have outperformed other techniques in many tasks. However, their prevalence in remote sensing (RS) is still limited, due to the scarcity of diverse remote-sensing visual-language datasets. In this work we introduce two novel image-caption datasets for training of remote sensing foundation models. The first dataset pairs aerial and satellite imagery with captions generated by Gemini using landmarks extracted from Google Maps. The second dataset utilizes public web images and their corresponding alt-text, filtered for the remote sensing domain, resulting in a diverse dataset with greater breadth in image styles and subject matter. These datasets are used to pre-train the MaMMUT~\citepkuo2023mammutsimplearchitecturejoint VLM architecture, resulting in state-of-the-art generalization performance in zero-shot cross-modal retrieval on well-known public benchmarks. Finally, we present our ongoing research to distill image-level knowledge gained in the VLM contrastive training procedure to enhance the model’s localization ability. Specifically, we iteratively generate pseudo-labels for image regions based on the model’s attention maps and use these labels for further training. To mitigate noisy attention maps and create robust segmentation masks, we introduce a novel attention-pooling mechanism called the Smooth-Attention-Operation.
zh

[CV-112] Versatile Multimodal Controls for Whole-Body Talking Human Animation

【速读】：该论文旨在解决从单个参考图像生成灵活可控制的整体人体动画的问题，目标是能够合成全身运动，而不仅限于头部或半身，并且可以由音频信号和文本提示灵活驱动。现有方法大多只能支持预设的头部或半身运动与音频输入对齐，难以满足更广泛的应用需求。

解决方案的关键在于提出了一种名为VersaAnimator的通用人体动画方法，其核心包括：1）设计了一个文本控制、音频驱动的运动生成器，能够在三维空间中同步生成全身运动表示，同时遵循文本描述的运动指令；2）引入了一种代码-姿态翻译模块，将变分自编码器（VAE）码本与从模板视频中提取的二维姿态（2D DWpose）关联起来，以促进自然平滑的运动；3）开发了一种多模态视频扩散模型，可根据参考图像、音频输入以及全身运动表示生成高度逼真的人体动画。实验结果表明，VersaAnimator在视觉质量、身份保持和音频-唇同步方面均优于现有方法。

链接: https://arxiv.org/abs/2503.08714
作者: Zheng Qin,Ruobing Zheng,Yabing Wang,Tianqi Li,Zixin Zhu,Minghui Yang,Ming Yang,Le Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human animation from a single reference image shall be flexible to synthesize whole-body motion for either a headshot or whole-body portrait, where the motions are readily controlled by audio signal and text prompts. This is hard for most existing methods as they only support producing pre-specified head or half-body motion aligned with audio inputs. In this paper, we propose a versatile human animation method, i.e., VersaAnimator, which generates whole-body talking human from arbitrary portrait images, not only driven by audio signal but also flexibly controlled by text prompts. Specifically, we design a text-controlled, audio-driven motion generator that produces whole-body motion representations in 3D synchronized with audio inputs while following textual motion descriptions. To promote natural smooth motion, we propose a code-pose translation module to link VAE codebooks with 2D DWposes extracted from template videos. Moreover, we introduce a multi-modal video diffusion that generates photorealistic human animation from a reference image according to both audio inputs and whole-body motion representations. Extensive experiments show that VersaAnimator outperforms existing methods in visual quality, identity preservation, and audio-lip synchronization.
zh

[CV-113] SDTrack: A Baseline for Event-based Tracking via Spiking Neural Networks

【速读】：该论文旨在解决现有结合人工神经网络（ANNs）与脉冲神经网络（SNNs）的方法因次优架构导致的能量效率低下及跟踪性能受限的问题。为应对这些挑战，论文提出了首个基于Transformer的脉冲驱动跟踪管道，并引入全局轨迹提示（Global Trajectory Prompt, GTP）方法，有效捕获全局轨迹信息并与事件流整合生成事件图像，以增强时空表示能力。此外，论文设计了SDTrack，这是一种包含脉冲MetaFormer主干和直接预测归一化坐标的简单跟踪头的Transformer基脉冲驱动跟踪器。关键在于通过端到端框架实现无数据增强和后处理的高效跟踪，同时保持极低的参数量和能耗，从而在多个事件驱动跟踪基准上达到最先进的性能，为类脑视觉领域的研究奠定坚实基础。

链接: https://arxiv.org/abs/2503.08703
作者: Yimeng Shan,Zhenbang Ren,Haodi Wu,Wenjie Wei,Rui-Jie Zhu,Shuai Wang,Dehao Zhang,Yichen Xiao,Jieyuan Zhang,Kexin Shi,Jingzhinan Wang,Jason K. Eshraghian,Haicheng Qu,Jiqing Zhang,Malu Zhang,Yang Yang
机构: University of Electronic Science and Technology of China (电子科技大学), China; Liaoning Technical University (辽宁工程技术大学), China; University of California, Santa Cruz (加州大学圣克鲁兹分校), USA; Dalian Maritime University (大连海事大学), China
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages,7 figures,4 tables

点击查看摘要

Abstract:Event cameras provide superior temporal resolution, dynamic range, power efficiency, and pixel bandwidth. Spiking Neural Networks (SNNs) naturally complement event data through discrete spike signals, making them ideal for event-based tracking. However, current approaches that combine Artificial Neural Networks (ANNs) and SNNs, along with suboptimal architectures, compromise energy efficiency and limit tracking performance. To address these limitations, we propose the first Transformer-based spike-driven tracking pipeline. Our Global Trajectory Prompt (GTP) method effectively captures global trajectory information and aggregates it with event streams into event images to enhance spatiotemporal representation. We then introduce SDTrack, a Transformer-based spike-driven tracker comprising a Spiking MetaFormer backbone and a simple tracking head that directly predicts normalized coordinates using spike signals. The framework is end-to-end, does not require data augmentation or post-processing. Extensive experiments demonstrate that SDTrack achieves state-of-the-art performance while maintaining the lowest parameter count and energy consumption across multiple event-based tracking benchmarks, establishing a solid baseline for future research in the field of neuromorphic vision.
zh

[CV-114] Real-Time Semantic Segmentation of Aerial Images Using an Embedded U-Net: A Comparison of CPU GPU and FPGA Workflows

【速读】：本文研究旨在解决轻量级语义分割模型在嵌入式计算平台上的实时部署问题，特别关注商用现成（COTS）硬件的高效利用。论文的关键在于提出了一种针对航拍图像优化的轻量化U-Net模型，在保持其在真实世界数据集上的准确性的同时，将模型参数和乘积累加操作（MAC）减少了16倍。解决方案的核心在于通过综合评估CPU、GPU和FPGA三种硬件平台以及TVM、FINN、Vitis AI、TensorFlow GPU和cuDNN五种工具链的性能指标（如延迟、功耗、内存占用、能效和FPGA资源利用率），验证了不同平台与工具链之间的权衡，并突出了实际应用中的部署挑战。研究结果表明，尽管基于Vitis AI的FPGA在性能、能效和成熟度方面表现最优，但其部署需要专业的硬件知识，这凸显了在选择嵌入式计算解决方案时需平衡实用性和技术复杂性的必要性。

链接: https://arxiv.org/abs/2503.08700
作者: Julien Posso,Hugo Kieffer,Nicolas Menga,Omar Hlimi,Sébastien Tarris,Hubert Guerard,Guy Bois,Matthieu Couderc,Eric Jenn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注: ERTS2024, Jun 2024, Toulouse, France

点击查看摘要

Abstract:This study introduces a lightweight U-Net model optimized for real-time semantic segmentation of aerial images, targeting the efficient utilization of Commercial Off-The-Shelf (COTS) embedded computing platforms. We maintain the accuracy of the U-Net on a real-world dataset while significantly reducing the model’s parameters and Multiply-Accumulate (MAC) operations by a factor of 16. Our comprehensive analysis covers three hardware platforms (CPU, GPU, and FPGA) and five different toolchains (TVM, FINN, Vitis AI, TensorFlow GPU, and cuDNN), assessing each on metrics such as latency, power consumption, memory footprint, energy efficiency, and FPGA resource usage. The results highlight the trade-offs between these platforms and toolchains, with a particular focus on the practical deployment challenges in real-world applications. Our findings demonstrate that while the FPGA with Vitis AI emerges as the superior choice due to its performance, energy efficiency, and maturity, it requires specialized hardware knowledge, emphasizing the need for a balanced approach in selecting embedded computing solutions for semantic segmentation tasks
zh

[CV-115] Out-of-Distribution Segmentation in Autonomous Driving: Problems and State of the Art

【速读】：该论文旨在解决Out-of-Distribution (OoD)分割在实际应用中的问题，特别是在自动驾驶领域中道路障碍物检测的挑战。论文通过分析现有方法在SegmentMeIfYouCan Obstacle Track和LostAndFound-NoKnown两个常用基准数据集上的性能，揭示了这些方法的优势、局限性以及在真实场景中的适用性。论文的关键在于探讨OoD分割领域的关键挑战，并提出潜在的研究方向，以推动该领域的发展，从而实现更安全、更可靠的自动驾驶系统。

链接: https://arxiv.org/abs/2503.08695
作者: Youssef Shoeb,Azarm Nowzad,Hanno Gottschalk
机构: Continental AG; Technische Universität Berlin
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:In this paper, we review the state of the art in Out-of-Distribution (OoD) segmentation, with a focus on road obstacle detection in automated driving as a real-world application. We analyse the performance of existing methods on two widely used benchmarks, SegmentMeIfYouCan Obstacle Track and LostAndFound-NoKnown, highlighting their strengths, limitations, and real-world applicability. Additionally, we discuss key challenges and outline potential research directions to advance the field. Our goal is to provide researchers and practitioners with a comprehensive perspective on the current landscape of OoD segmentation and to foster further advancements toward safer and more reliable autonomous driving systems.
zh

[CV-116] Orientation tracking method for anisotropic particles

【速读】：该论文旨在解决各向异性颗粒的空间位置与取向跟踪问题。为实现这一目标，论文提出了一种基于多相机（高速）记录的算法，利用颗粒的已知形状从不同视角重建其三维位置与取向。解决方案的关键在于开发了一种能够同时追踪多个各向异性颗粒位置与取向的算法，并通过量化误差和分析噪声、图像尺寸、相机数量及布置方式等参数的影响，验证了方法的鲁棒性与适用性。此外，该方法成功适用于多种颗粒形状，支持多颗粒同步跟踪，并可区分不同类型的颗粒。

链接: https://arxiv.org/abs/2503.08694
作者: Mees M. Flapper,Elian Bernard,Sander G. Huisman
机构: University of Twente (特温特大学); ENS de Lyon (里昂高等师范学校); CNRS (法国国家科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 20 figures

点击查看摘要

Abstract:A method for particle orientation tracking is developed and demonstrated specifically for anisotropic particles. Using (high-speed) multi-camera recordings of anisotropic particles from different viewpoints, we reconstruct the 3D location and orientation of these particles using their known shape. This paper describes an algorithm which tracks the location and orientation of multiple anisotropic particles over time, enabling detailed investigations of location, orientation, and rotation statistics. The robustness and error of this method is quantified, and we explore the effects of noise, image size, the number of used cameras, and the camera arrangement by applying the algorithm to synthetic images. We showcase several use-cases of this method in several experiments (in both quiescent and turbulent fluids), demonstrating the effectiveness and broad applicability of the described tracking method. The proposed method is shown to work for widely different particle shapes, successfully tracks multiple particles simultaneously, and the method can distinguish between different types of particles.
zh

[CV-117] ProReflow: Progressive Reflow with Decomposed Velocity

【速读】：该论文旨在解决扩散模型在图像和视频生成方面取得显著进展但计算成本过高的问题。为应对这一挑战，流匹配方法尝试将扩散过程重新塑造成一条直线，以实现少步甚至一步生成。然而，本文指出流匹配的原始训练流程并非最优，并提出了两种改进技术：首先，引入逐步重流（progressive reflow），通过在局部时间步长内逐步重塑扩散模型，直至整个扩散过程完成，从而降低流匹配的难度；其次，提出对齐的v预测（aligned v-prediction），强调流匹配中方向匹配的重要性高于幅度匹配。实验结果表明，所提方法在SDv1.5和SDXL上的有效性，例如，在MSCOCO2014验证集上仅使用4个采样步骤即可达到FID=10.70，接近采用32个DDIM步骤的教师模型（FID=10.05）。

链接: https://arxiv.org/abs/2503.04824
作者: Lei Ke,Haohang Xu,Xuefei Ning,Yu Li,Jiajun Li,Haoling Li,Yuxuan Lin,Dongsheng Jiang,Yujiu Yang,Linfeng Zhang
机构: Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); Huawei Inc. (华为); University of Electronic Science and Technology of China (电子科技大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Our codes will be released at Github

点击查看摘要

Abstract:Diffusion models have achieved significant progress in both image and video generation while still suffering from huge computation costs. As an effective solution, flow matching aims to reflow the diffusion process of diffusion models into a straight line for a few-step and even one-step generation. However, in this paper, we suggest that the original training pipeline of flow matching is not optimal and introduce two techniques to improve it. Firstly, we introduce progressive reflow, which progressively reflows the diffusion models in local timesteps until the whole diffusion progresses, reducing the difficulty of flow matching. Second, we introduce aligned v-prediction, which highlights the importance of direction matching in flow matching over magnitude matching. Experimental results on SDv1.5 and SDXL demonstrate the effectiveness of our method, for example, conducting on SDv1.5 achieves an FID of 10.70 on MSCOCO2014 validation set with only 4 sampling steps, close to our teacher model (32 DDIM steps, FID = 10.05).
zh

[CV-118] Fair Federated Medical Image Classification Against Quality Shift via Inter-Client Progressive State Matching

【速读】：该论文旨在解决联邦学习在医疗影像应用中因机构间数据质量不一致（少数客户端提供低质量数据）导致的模型偏向高质量图像的问题，从而引发的公平性关切。现有公平联邦学习方法主要通过单一的零阶或一阶收敛状态（如训练损失或锐度）进行对齐，但论文指出这种单一指标无法充分捕捉收敛特性，因此不足以指导公平学习。

解决方案的关键在于提出了一种广义框架FedISM+，其核心思想是通过评估多个收敛状态（定义为不同搜索距离下的锐度或扰动损失）来实现公平性。具体而言，FedISM+通过动态调整搜索距离，逐步关注不同的状态，并在本地训练和全局聚合中引入两个组件，以确保跨客户端的状态公平性。这种方法能够使所有状态的收敛趋于公平，从而在测试阶段提升整体公平性。实验结果表明，FedISM+在RSNA ICH和ISIC 2019数据集上的表现优于现有的最先进的公平联邦学习方法。

链接: https://arxiv.org/abs/2503.09587
作者: Nannan Wu,Zhuo Kuang,Zengqiang Yan,Ping Wang,Li Yu
机构: School of Electronic Information and Communications, Huazhong University of Science and Technology (华中科技大学电子与信息通信学院), Wuhan, China; Department of Electrical Engineering and Computer Science, Lassonde School of Engineering, York University (约克大学拉松德工程学院电气工程与计算机科学系), Toronto, Canada
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Despite the potential of federated learning in medical applications, inconsistent imaging quality across institutions-stemming from lower-quality data from a minority of clients-biases federated models toward more common high-quality images. This raises significant fairness concerns. Existing fair federated learning methods have demonstrated some effectiveness in solving this problem by aligning a single 0th- or 1st-order state of convergence (e.g., training loss or sharpness). However, we argue in this work that fairness based on such a single state is still not an adequate surrogate for fairness during testing, as these single metrics fail to fully capture the convergence characteristics, making them suboptimal for guiding fair learning. To address this limitation, we develop a generalized framework. Specifically, we propose assessing convergence using multiple states, defined as sharpness or perturbed loss computed at varying search distances. Building on this comprehensive assessment, we propose promoting fairness for these states across clients to achieve our ultimate fairness objective. This is accomplished through the proposed method, FedISM+. In FedISM+, the search distance evolves over time, progressively focusing on different states. We then incorporate two components in local training and global aggregation to ensure cross-client fairness for each state. This gradually makes convergence equitable for all states, thereby improving fairness during testing. Our empirical evaluations, performed on the well-known RSNA ICH and ISIC 2019 datasets, demonstrate the superiority of FedISM+ over existing state-of-the-art methods for fair federated learning. The code is available at this https URL.
zh

[CV-119] FCaS: Fine-grained Cardiac Image Synthesis based on 3D Template Conditional Diffusion Model

【速读】：该论文旨在解决心脏医学影像中细粒度解剖结构重建困难的问题，特别是由于心脏成像中存在的严格拓扑一致性、脆弱的冠状动脉特征以及复杂的三维形态异质性，现有方法在生成精细结构方面效果有限。为了解决这一挑战，论文提出了Fine-grained Cardiac image Synthesis (FCaS) 框架，其核心基于3D模板条件扩散模型（Template Conditional Diffusion Model, TCDM）。关键创新包括：通过双向机制实现精确的心脏结构生成，利用模板指导生成目标图像的细粒度拓扑结构信息；设计可变形掩膜生成模块（Deformable Mask Generation Module, MGM）以缓解高质量多样化参考掩膜稀缺的问题；提出置信感知自适应学习（Confidence-aware Adaptive Learning, CAL）策略，通过跳采样方差估计生成置信图，并用于下游分割任务的预训练修正。实验结果表明，FCaS生成的图像在拓扑一致性和视觉质量上达到当前最优水平，显著促进了下游任务的性能提升。

链接: https://arxiv.org/abs/2503.09560
作者: Jiahao Xia,Yutao Hu,Yaolei Qi,Zhenliang Li,Wenqi Shao,Junjun He,Ying Fu,Longjiang Zhang,Guanyu Yang
机构: Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China (东南大学教育部重点实验室);
Jiangsu Province Joint International Research Laboratory of Medical Information Processing, Southeast University, Nanjing, China (江苏省医学信息处理联合国际研究实验室，东南大学，南京);
Univ Rennes, CHU Rennes, Inserm, LTSI– UMR 1099, F-35000 Rennes, France (法国雷恩大学，雷恩大学医院，Inserm，LTSI实验室);
Shanghai AI Laboratory (上海人工智能实验室);
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China (北京理工大学计算机科学与技术学院，中国北京);
Department of Radiology, Jinling Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, China (南京大学医学院附属金陵医院放射科，中国南京)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 9 figures

点击查看摘要

Abstract:Solving medical imaging data scarcity through semantic image generation has attracted significant attention in recent years. However, existing methods primarily focus on generating whole-organ or large-tissue structures, showing limited effectiveness for organs with fine-grained structure. Due to stringent topological consistency, fragile coronary features, and complex 3D morphological heterogeneity in cardiac imaging, accurately reconstructing fine-grained anatomical details of the heart remains a great challenge. To address this problem, in this paper, we propose the Fine-grained Cardiac image Synthesis(FCaS) framework, established on 3D template conditional diffusion model. FCaS achieves precise cardiac structure generation using Template-guided Conditional Diffusion Model (TCDM) through bidirectional mechanisms, which provides the fine-grained topological structure information of target image through the guidance of template. Meanwhile, we design a deformable Mask Generation Module (MGM) to mitigate the scarcity of high-quality and diverse reference mask in the generation process. Furthermore, to alleviate the confusion caused by imprecise synthetic images, we propose a Confidence-aware Adaptive Learning (CAL) strategy to facilitate the pre-training of downstream segmentation tasks. Specifically, we introduce the Skip-Sampling Variance (SSV) estimation to obtain confidence maps, which are subsequently employed to rectify the pre-training on downstream tasks. Experimental results demonstrate that images generated from FCaS achieves state-of-the-art performance in topological consistency and visual quality, which significantly facilitates the downstream tasks as well. Code will be released in the future.
zh

[CV-120] he R2D2 Deep Neural Network Series for Scalable Non-Cartesian Magnetic Resonance Imaging

【速读】：该论文旨在解决磁共振成像 (MRI) 中快速且可扩展的图像重建问题，特别是在高度加速的非笛卡尔 k 空间采集情况下。传统展开式深度神经网络 (DNN) 架构通过数据一致性层提供稳健的图像形成方法，但将非均匀快速傅里叶变换 (NUFFT) 算子嵌入 DNN 在大规模训练时变得不切实际，例如在多线圈二维 MRI 或高维成像场景下。插件式方法虽然不受此限制，但由于其高度迭代性质导致重建速度缓慢。为应对这一可扩展性挑战，论文引入了 R2D2 模型范式，该范式通过将重建过程建模为一系列残差图像的迭代估计来实现超快的大规模傅里叶成像。这种方法可以被看作是匹配追踪算法的可学习版本。关键在于，R2D2 方法通过构建多个依次训练的 DNN 模块，利用前一次迭代的数据残差作为输入，从而实现高效且高质量的图像重建，同时在快速MRI 数据集上的实验表明其性能优于传统的展开式 R2D2-Net 和基于扩散的最新方法。

链接: https://arxiv.org/abs/2503.09559
作者: Yiwei Chen,Amir Aghabiglou,Shijie Chen,Motahare Torki,Chao Tang,Ruud B. van Heeswijk,Yves Wiaux
机构: Institute of Sensors, Signals and Systems, Heriot-Watt University (赫瑞瓦特大学), Edinburgh, United Kingdom; EPCC, University of Edinburgh (爱丁堡大学), United Kingdom; Department of Diagnostic Imaging and Interventional Radiology, Lausanne University and University Hospital (洛桑大学和洛桑大学医院), Switzerland
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: 13 pages, 10 figures

点击查看摘要

Abstract:We introduce the R2D2 Deep Neural Network (DNN) series paradigm for fast and scalable image reconstruction from highly-accelerated non-Cartesian k-space acquisitions in Magnetic Resonance Imaging (MRI). While unrolled DNN architectures provide a robust image formation approach via data-consistency layers, embedding non-uniform fast Fourier transform operators in a DNN can become impractical to train at large scale, e.g in 2D MRI with a large number of coils, or for higher-dimensional imaging. Plug-and-play approaches that alternate a learned denoiser blind to the measurement setting with a data-consistency step are not affected by this limitation but their highly iterative nature implies slow reconstruction. To address this scalability challenge, we leverage the R2D2 paradigm that was recently introduced to enable ultra-fast reconstruction for large-scale Fourier imaging in radio astronomy. R2D2’s reconstruction is formed as a series of residual images iteratively estimated as outputs of DNN modules taking the previous iteration’s data residual as input. The method can be interpreted as a learned version of the Matching Pursuit algorithm. A series of R2D2 DNN modules were sequentially trained in a supervised manner on the fastMRI dataset and validated for 2D multi-coil MRI in simulation and on real data, targeting highly under-sampled radial k-space sampling. Results suggest that a series with only few DNNs achieves superior reconstruction quality over its unrolled incarnation R2D2-Net (whose training is also much less scalable), and over the state-of-the-art diffusion-based “Decomposed Diffusion Sampler” approach (also characterised by a slower reconstruction process).
zh

[CV-121] Fast computation of the TGOSPA metric for multiple target tracking via unbalanced optimal transport

【速读】：该论文旨在解决在大规模多目标跟踪场景下，计算轨迹广义最优子模式分配度量（Trajectory Generalized Optimal Sub-Pattern Assignment, TGOSPA）时面临的高计算复杂度问题。论文的关键解决方案是将TGOSPA问题转化为一个非平衡多边际最优传输问题，并引入熵正则化方法，通过推导其拉格朗日对偶问题的迭代求解方案，在保证适度精度的前提下显著提升计算效率。

链接: https://arxiv.org/abs/2503.09449
作者: Viktor Nevelius Wernholm,Alfred Wärnsäter,Axel Ringh
机构: 未知
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: 6 pages

点击查看摘要

Abstract:In multiple target tracking, it is important to be able to evaluate the performance of different tracking algorithms. The trajectory generalized optimal sub-pattern assignment metric (TGOSPA) is a recently proposed metric for such evaluations. The TGOSPA metric is computed as the solution to an optimization problem, but for large tracking scenarios, solving this problem becomes computationally demanding. In this paper, we present an approximation algorithm for evaluating the TGOSPA metric, based on casting the TGOSPA problem as an unbalanced multimarginal optimal transport problem. Following recent advances in computational optimal transport, we introduce an entropy regularization and derive an iterative scheme for solving the Lagrangian dual of the regularized problem. Numerical results suggest that our proposed algorithm is more computationally efficient than the alternative of computing the exact metric using a linear programming solver, while still providing an adequate approximation of the metric.
zh

[CV-122] Mono2D: A Trainable Monogenic Layer for Robust Knee Cartilage Segmentation on Out-of-Distribution 2D Ultrasound Data

【速读】：该论文旨在解决自动化膝关节软骨分割在多域点-of-care 超声设备上的领域适应（domain shift）问题，以提高膝关节骨关节炎管理中分割算法的泛化能力。论文的关键创新在于提出了一种名为Mono2D的单极层，该层利用可训练的带通四象限滤波器提取多尺度、对比度和强度不变的局部相位特征，从而缓解领域转移问题并提升跨分布域的泛化性能。Mono2D被集成于分割网络的第一层之前，并与网络参数联合优化。实验结果表明，Mono2D在单源领域泛化任务中优于其他方法，同时在多站点前列腺MRI数据集上的表现也验证了其在医学影像领域泛化能力的潜力。

链接: https://arxiv.org/abs/2503.09050
作者: Alvin Kimbowa,Arjun Parmar,Maziar Badii,David Liu,Matthew Harkey,Ilker Hacihaliloglu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated knee cartilage segmentation using point-of-care ultrasound devices and deep-learning networks has the potential to enhance the management of knee osteoarthritis. However, segmentation algorithms often struggle with domain shifts caused by variations in ultrasound devices and acquisition parameters, limiting their generalizability. In this paper, we propose Mono2D, a monogenic layer that extracts multi-scale, contrast- and intensity-invariant local phase features using trainable bandpass quadrature filters. This layer mitigates domain shifts, improving generalization to out-of-distribution domains. Mono2D is integrated before the first layer of a segmentation network, and its parameters jointly trained alongside the network’s parameters. We evaluated Mono2D on a multi-domain 2D ultrasound knee cartilage dataset for single-source domain generalization (SSDG). Our results demonstrate that Mono2D outperforms other SSDG methods in terms of Dice score and mean average surface distance. To further assess its generalizability, we evaluate Mono2D on a multi-site prostate MRI dataset, where it continues to outperform other SSDG methods, highlighting its potential to improve domain generalization in medical imaging. Nevertheless, further evaluation on diverse datasets is still necessary to assess its clinical utility.
zh

[CV-123] Evaluation of state-of-the-art deep learning models in the segmentation of the heart ventricles in parasternal short-axis echocardiograms

【速读】：该论文旨在解决心室在短轴经胸超声心动图（PSAX-echo）中的分割问题，并评估当前先进的深度学习模型在小数据集上的性能表现。研究的关键在于通过特定领域的模型训练（如Unet-ResNet系列），提升在有限数据条件下的分割精度，以支持计算用于诊断心血管和肺部疾病以及其他心肌病的重要指标。论文通过对比特定领域模型与通用领域模型的性能，验证了特定领域模型（如Unet-Resnet101）在小规模本地采集数据集上的优越性，其Dice相似系数（DSC）、Hausdorff距离（HD）和交叉截面积差异（DCSA）分别达到了0.83、4.93像素和106像素²的平均表现。

链接: https://arxiv.org/abs/2503.08970
作者: Julian Rene Cuellar Buritica,Vu Dinh,Manjula Burri,Julie Roelandts,James Wendling,Jon D. Klingensmith
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 25 pages, 13 figures, 6 tables

点击查看摘要

Abstract:Previous studies on echocardiogram segmentation are focused on the left ventricle in parasternal long-axis views. In this study, deep-learning models were evaluated on the segmentation of the ventricles in parasternal short-axis echocardiograms (PSAX-echo). Segmentation of the ventricles in complementary echocardiogram views will allow the computation of important metrics with the potential to aid in diagnosing cardio-pulmonary diseases and other cardiomyopathies. Evaluating state-of-the-art models with small datasets can reveal if they improve performance on limited data. PSAX-echo were performed on 33 volunteer women. An experienced cardiologist identified end-diastole and end-systole frames from 387 scans, and expert observers manually traced the contours of the cardiac structures. Traced frames were pre-processed and used to create labels to train 2 specific-domain (Unet-Resnet101 and Unet-ResNet50), and 4 general-domain (3 Segment Anything (SAM) variants, and the Detectron2) deep-learning models. The performance of the models was evaluated using the Dice similarity coefficient (DSC), Hausdorff distance (HD), and difference in cross-sectional area (DCSA). The Unet-Resnet101 model provided superior performance in the segmentation of the ventricles with 0.83, 4.93 pixels, and 106 pixel2 on average for DSC, HD, and DCSA respectively. A fine-tuned MedSAM model provided a performance of 0.82, 6.66 pixels, and 1252 pixel2, while the Detectron2 model provided 0.78, 2.12 pixels, and 116 pixel2 for the same metrics respectively. Deep-learning models are suitable for the segmentation of the left and right ventricles in PSAX-echo. This study demonstrated that specific-domain trained models such as Unet-ResNet provide higher accuracy for echo segmentation than general-domain segmentation models when working with small and locally acquired datasets.
zh

[CV-124] On the status of current quantum machine learning software

【速读】：该论文旨在研究在实际可用的量子设备上运行混合量子-经典模型的可行性与挑战，并以卫星图像分割任务为例，分析其实施难度、成本以及模型性能的变化。论文关注的重点在于硬件之外的软件限制对量子计算能力的影响，特别是如何克服这些限制以实现有效应用。解决方案的关键在于设计适合当前 noisy intermediate-scale quantum (NISQ) 设备特性的算法和架构，同时评估其在真实环境中的表现与资源开销。

链接: https://arxiv.org/abs/2503.08962
作者: Manish K. Gupta,Tomasz Rybotycki,Piotr Gawron
机构: unknown
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 1 figure, 1 table

点击查看摘要

Abstract:The recent advancements in noisy intermediate-scale quantum (NISQ) devices implementation allow us to study their application to real-life computational problems. However, hardware challenges are not the only ones that hinder our quantum computation capabilities. Software limitations are the other, less explored side of this medal. Using satellite image segmentation as a task example, we investigated how difficult it is to run a hybrid quantum-classical model on a real, publicly available quantum device. We also analyzed the costs of such endeavor and the change in quality of model.
zh

[CV-125] Acoustic Neural 3D Reconstruction Under Pose Drift

【速读】：该论文旨在解决利用带有漂移传感器姿态的声学图像优化神经隐式曲面进行3D重建的问题。当前最先进的3D声学建模算法高度依赖于精确的姿态估计，传感器姿态的小误差可能导致严重的重建伪影。论文的关键解决方案在于提出了一种联合优化神经场景表示和声纳姿态的算法，通过将6自由度（6DoF）姿态参数化为可学习参数，并通过神经渲染器和隐式表示反向传播梯度来实现。这一方法在真实数据集和模拟数据集上的验证表明，即使在显著的姿态漂移情况下，也能生成高保真的3D重建结果。

链接: https://arxiv.org/abs/2503.08930
作者: Tianxiang Lin,Mohamad Qadri,Kevin Zhang,Adithya Pediredla,Christopher A. Metzler,Michael Kaess
机构: Robotics Institute, Carnegie Mellon University (卡内基梅隆大学机器人研究所); Department of Computer Science, University of Maryland (马里兰大学计算机科学系); Computer Science Department, Dartmouth College (达特茅斯学院计算机科学系)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 8 figures. This paper is under review

点击查看摘要

Abstract:We consider the problem of optimizing neural implicit surfaces for 3D reconstruction using acoustic images collected with drifting sensor poses. The accuracy of current state-of-the-art 3D acoustic modeling algorithms is highly dependent on accurate pose estimation; small errors in sensor pose can lead to severe reconstruction artifacts. In this paper, we propose an algorithm that jointly optimizes the neural scene representation and sonar poses. Our algorithm does so by parameterizing the 6DoF poses as learnable parameters and backpropagating gradients through the neural renderer and implicit representation. We validated our algorithm on both real and simulated datasets. It produces high-fidelity 3D reconstructions even under significant pose drift.
zh

[CV-126] Reconstruct Anything Model: a lightweight foundation model for computational imaging

【速读】：该论文旨在解决现有基于学习的成像逆问题求解方法存在的两个主要挑战：第一类迭代算法（如即插即用和扩散方法）依赖预训练去噪器但计算成本高且重建性能次优；第二类展开架构虽端到端训练但通常针对单一逆问题设计且需要昂贵的训练。论文提出了一种新颖的非迭代、轻量级架构，通过结合前向算子的知识（采集物理和噪声参数），无需展开即可实现高效求解。该模型能够处理广泛的逆问题（如去模糊、磁共振成像、CT、图像修复及超分辨率等），并且可以通过少量微调步骤（甚至仅需几张图像）自监督地适应未见过的逆问题或数据集，而无需真实参考标签。实验结果表明其在从医学成像到低光子成像和显微镜领域的表现达到或超越当前最优水平。

链接: https://arxiv.org/abs/2503.08915
作者: Matthieu Terris,Samuel Hurault,Maxime Song,Julian Tachella
机构: Université Paris-Saclay, Inria, CEA (巴黎萨克雷大学, 法国国家信息与自动化研究所, 法国原子能委员会); ENS Paris, PSL, CNRS (巴黎高等师范学院, 巴黎文理研究大学, 法国国家科学研究中心); CNRS UAR 851, Université Paris-Saclay (法国国家科学研究中心附属单位851, 巴黎萨克雷大学); ENSL, CNRS UMR 5672 (里昂高等师范学院, 法国国家科学研究中心联合研究单位5672)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most existing learning-based methods for solving imaging inverse problems can be roughly divided into two classes: iterative algorithms, such as plug-and-play and diffusion methods, that leverage pretrained denoisers, and unrolled architectures that are trained end-to-end for specific imaging problems. Iterative methods in the first class are computationally costly and often provide suboptimal reconstruction performance, whereas unrolled architectures are generally specific to a single inverse problem and require expensive training. In this work, we propose a novel non-iterative, lightweight architecture that incorporates knowledge about the forward operator (acquisition physics and noise parameters) without relying on unrolling. Our model is trained to solve a wide range of inverse problems beyond denoising, including deblurring, magnetic resonance imaging, computed tomography, inpainting, and super-resolution. The proposed model can be easily adapted to unseen inverse problems or datasets with a few fine-tuning steps (up to a few images) in a self-supervised way, without ground-truth references. Throughout a series of experiments, we demonstrate state-of-the-art performance from medical imaging to low-photon imaging and microscopy.
zh

[CV-127] Residual Learning and Filtering Networks for End-to-End Lossless Video Compression

【速读】：该论文旨在解决基于学习的视频压缩方法中存在的运动估计不准确和运动补偿结构不足的问题，这些问题导致压缩错误以及率失真权衡不佳。为了解决这些挑战，论文提出了一种端到端的视频压缩方法，其关键是引入了一个具有残差跳过连接的自动编码器型网络以高效压缩运动信息，并设计了运动矢量和残差帧过滤网络以减轻视频压缩系统中的压缩误差。此外，通过利用强大的非线性变换（如PReLU）改进运动补偿网络架构，并引入缓冲区来微调之前的参考帧以提升重建帧质量。这些模块结合精心设计的损失函数共同优化整体视频质量。实验结果表明，该方法在多个数据集上的性能具有竞争力。

链接: https://arxiv.org/abs/2503.08819
作者: Md baharul Islam,Afsana Ahsan Jeny
机构: Bahcesehir University (巴什基尔大学), Istanbul, Turkey; Florida Gulf Coast University (佛罗里达海湾海岸大学), Fort Myers FL 33965, United States; Bahcesehir University (巴什基尔大学), Istanbul, Turkey; University of Connecticut (康涅狄格大学), Storrs CT 06269, United States
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing learning-based video compression methods still face challenges related to inaccurate motion estimates and inadequate motion compensation structures. These issues result in compression errors and a suboptimal rate-distortion trade-off. To address these challenges, this work presents an end-to-end video compression method that incorporates several key operations. Specifically, we propose an autoencoder-type network with a residual skip connection to efficiently compress motion information. Additionally, we design motion vector and residual frame filtering networks to mitigate compression errors in the video compression system. To improve the effectiveness of the motion compensation network, we utilize powerful nonlinear transforms, such as the Parametric Rectified Linear Unit (PReLU), to delve deeper into the motion compensation architecture. Furthermore, a buffer is introduced to fine-tune the previous reference frames, thereby enhancing the reconstructed frame quality. These modules are combined with a carefully designed loss function that assesses the trade-off and enhances the overall video quality of the decoded output. Experimental results showcase the competitive performance of our method on various datasets, including HEVC (sequences B, C, and D), UVG, VTL, and MCL-JCV. The proposed approach tackles the challenges of accurate motion estimation and motion compensation in video compression, and the results highlight its competitive performance compared to existing methods.
zh

[CV-128] Deformable Registration Framework for Augmented Reality-based Surgical Guidance in Head and Neck Tumor Resection

【速读】：该论文旨在解决头颈部鳞状细胞癌（HNSCC）手术中由于切除标本厚度变化及复杂三维解剖结构导致的阳性切缘定位难题。传统基于冰冻切片分析（FSA）的阳性切缘评估方法在将标本切缘信息重新定位到切除部位时面临显著误差（Target Registration Error, TRE）。论文的关键创新在于提出了一种新的可变形配准框架，通过结合术前切除表面与术后标本底部的信息，并引入厚度数据以增强配准适应性。特别是针对舌头等具有复杂三维解剖结构且临床意义重大的标本，该方法使TRE较先前方法提升了高达33%。此外，作者进一步将此框架与基于增强现实（AR）的自动对齐系统集成，实现术中精准可视化，从而显著降低外科医生的平均目标重定位误差。

链接: https://arxiv.org/abs/2503.08802
作者: Qingyun Yang,Fangjie Li,Jiayi Xu,Zixuan Liu,Sindhura Sridhar,Whitney Jin,Jennifer Du,Jon Heiselman,Michael Miga,Michael Topf,Jie Ying Wu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Head and neck squamous cell carcinoma (HNSCC) has one of the highest rates of recurrence cases among solid malignancies. Recurrence rates can be reduced by improving positive margins localization. Frozen section analysis (FSA) of resected specimens is the gold standard for intraoperative margin assessment. However, because of the complex 3D anatomy and the significant shrinkage of resected specimens, accurate margin relocation from specimen back onto the resection site based on FSA results remains challenging. We propose a novel deformable registration framework that uses both the pre-resection upper surface and the post-resection site of the specimen to incorporate thickness information into the registration process. The proposed method significantly improves target registration error (TRE), demonstrating enhanced adaptability to thicker specimens. In tongue specimens, the proposed framework improved TRE by up to 33% as compared to prior deformable registration. Notably, tongue specimens exhibit complex 3D anatomies and hold the highest clinical significance compared to other head and neck specimens from the buccal and skin. We analyzed distinct deformation behaviors in different specimens, highlighting the need for tailored deformation strategies. To further aid intraoperative visualization, we also integrated this framework with an augmented reality-based auto-alignment system. The combined system can accurately and automatically overlay the deformed 3D specimen mesh with positive margin annotation onto the resection site. With a pilot study of the AR guided framework involving two surgeons, the integrated system improved the surgeons’ average target relocation error from 9.8 cm to 4.8 cm.
zh

[CV-129] QUIET-SR: Quantum Image Enhancement Transformer for Single Image Super-Resolution

【速读】：该论文旨在解决单图像超分辨率（Single-Image Super-Resolution, SISR）中经典模型因参数量庞大导致的高计算成本问题，以及量子算法在图像处理中的可扩展性挑战。为应对这些难题，论文提出了一种名为Quantum Image Enhancement Transformer for Super-Resolution (QUIET-SR) 的混合框架，其关键在于通过引入基于变分量子神经网络的新型位移量子窗口注意力机制，扩展了Swin变换架构。此机制能够有效捕捉低分辨率与高分辨率图像间的复杂残差映射，同时利用量子注意力机制提升特征提取与图像恢复能力，并仅需少量量子比特（qubits），使其适用于噪声中级规模量子（Noisy Intermediate-Scale Quantum, NISQ）时代。实验结果表明，QUIET-SR在MNIST、FashionMNIST及MedMNIST数据集上的峰值信噪比（PSNR）和结构相似性指数（SSIM）表现接近最先进的方法，但所需参数更少。这凸显了可扩展变分量子机器学习模型在SISR领域的潜力，标志着迈向实用量子增强图像超分辨率的重要一步。

链接: https://arxiv.org/abs/2503.08759
作者: Siddhant Dutta,Nouhaila Innan,Khadijeh Najafi,Sadok Ben Yahia,Muhammad Shafique
机构: SVKM’s Dwarkadas J. Sanghvi College of Engineering (SVKM的Dwarkadas J. Sanghvi工程学院, 印度); eBRAIN Lab, Division of Engineering, New York University Abu Dhabi (NYUAD)(纽约大学阿布扎比分校, 阿联酋); Center for Quantum and Topological Systems (CQTS), NYUAD Research Institute, NYUAD (NYUAD量子与拓扑系统研究中心, NYUAD研究所, 纽约大学阿布扎比分校, 阿联酋); The Maersk Mc-Kinney Moller Institute, University of Southern Denmark (丹麦南丹麦大学Maersk Mc-Kinney Moller研究所, 丹麦); Department of Software Science, Tallinn University of Technology (塔林理工大学软件科学系, 爱沙尼亚)
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 10 figures, 3 pages

点击查看摘要

Abstract:Recent advancements in Single-Image Super-Resolution (SISR) using deep learning have significantly improved image restoration quality. However, the high computational cost of processing high-resolution images due to the large number of parameters in classical models, along with the scalability challenges of quantum algorithms for image processing, remains a major obstacle. In this paper, we propose the Quantum Image Enhancement Transformer for Super-Resolution (QUIET-SR), a hybrid framework that extends the Swin transformer architecture with a novel shifted quantum window attention mechanism, built upon variational quantum neural networks. QUIET-SR effectively captures complex residual mappings between low-resolution and high-resolution images, leveraging quantum attention mechanisms to enhance feature extraction and image restoration while requiring a minimal number of qubits, making it suitable for the Noisy Intermediate-Scale Quantum (NISQ) era. We evaluate our framework in MNIST (30.24 PSNR, 0.989 SSIM), FashionMNIST (29.76 PSNR, 0.976 SSIM) and the MedMNIST dataset collection, demonstrating that QUIET-SR achieves PSNR and SSIM scores comparable to state-of-the-art methods while using fewer parameters. These findings highlight the potential of scalable variational quantum machine learning models for SISR, marking a step toward practical quantum-enhanced image super-resolution.
zh

[CV-130] Frequency selection for the diagnostic characterization of human brain tumours

【速读】：该论文致力于解决脑肿瘤诊断中基于非侵入性技术提取有效特征以实现精准分类的问题。解决方案的关键在于结合自定义光谱频率选择方法与非线性分类技术，从磁共振波谱提供的高维代谢信息中提取有意义的特征，从而提高脑肿瘤的诊断准确性。

链接: https://arxiv.org/abs/2503.08756
作者: Carlos Arizmendi,Alfredo Vellido,Enrique Romero
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
备注:

点击查看摘要

Abstract:The diagnosis of brain tumours is an extremely sensitive and complex clinical task that must rely upon information gathered through non-invasive techniques. One such technique is magnetic resonance, in the modalities of imaging or spectroscopy. The latter provides plenty of metabolic information about the tumour tissue, but its high dimensionality makes resorting to pattern recognition techniques advisable. In this brief paper, an international database of brain tumours is analyzed resorting to an ad hoc spectral frequency selection procedure combined with nonlinear classification.
zh

[CV-131] Neural Network for Blind Unmixing: a novel MatrixConv Unmixing (MCU) Approach

【速读】：本文旨在解决高光谱图像（Hyperspectral Image, HSI）分解中的两个主要挑战：1）现有基于深度学习的方法容易产生缺乏物理意义的结果，如对应于未知或不存在材料的光谱特征；2）通用卷积神经网络（Convolutional Neural Network, CNN）结构未针对分解任务进行显式优化。为应对这些问题，论文提出了一种新颖的网络结构，其关键在于结合双层深度图像先验（Double Deep Image Prior, DIP）技术和算法展开（Algorithm Unrolling）。具体而言，首先分别提出矩阵卷积分解方法（MatrixConv Unmixing, MCU）用于端成员（Endmember）和丰度（Abundance）估计，并通过迭代求解器实现；然后将这些求解器展开为两个子网络——端成员估计DIP（UEDIP）和丰度估计DIP（UADIP），以生成端成员和丰度的估计结果。整体网络由这两个子网络组装而成。此外，为了生成有意义的分解结果，论文设计了一个复合损失函数，并分别为端成员和丰度估计引入显式的正则化项，从而进一步提升分解质量。所提方法在合成数据集和真实数据集上均进行了有效性验证。

链接: https://arxiv.org/abs/2503.08745
作者: Chao Zhou,Wei Pu,Miguel Rodrigues
机构: Department of Electronic and Electrical Engineering, University College London (伦敦大学学院电子与电气工程系); Department of Information and Communication Engineering, University of Electronic Science and Technology of China (电子科技大学信息与通信工程系)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hyperspectral image (HSI) unmixing is a challenging research problem that tries to identify the constituent components, known as endmembers, and their corresponding proportions, known as abundances, in the scene by analysing images captured by hyperspectral cameras. Recently, many deep learning based unmixing approaches have been proposed with the surge of machine learning techniques, especially convolutional neural networks (CNN). However, these methods face two notable challenges: 1. They frequently yield results lacking physical significance, such as signatures corresponding to unknown or non-existent materials. 2. CNNs, as general-purpose network structures, are not explicitly tailored for unmixing tasks. In response to these concerns, our work draws inspiration from double deep image prior (DIP) techniques and algorithm unrolling, presenting a novel network structure that effectively addresses both issues. Specifically, we first propose a MatrixConv Unmixing (MCU) approach for endmember and abundance estimation, respectively, which can be solved via certain iterative solvers. We then unroll these solvers to build two sub-networks, endmember estimation DIP (UEDIP) and abundance estimation DIP (UADIP), to generate the estimation of endmember and abundance, respectively. The overall network is constructed by assembling these two sub-networks. In order to generate meaningful unmixing results, we also propose a composite loss function. To further improve the unmixing quality, we also add explicitly a regularizer for endmember and abundance estimation, respectively. The proposed methods are tested for effectiveness on both synthetic and real datasets.
zh

[CV-132] A Bi-channel Aided Stitching of Atomic Force Microscopy Images

【速读】：该论文旨在解决传统图像拼接工具在处理特征稀疏的显微镜图像时难以有效拼接的问题，以及无法应对所有图像变换的情况。为了解决这些问题，论文提出了一种基于双通道辅助特征的图像拼接方法（bi-channel aided feature-based image stitching method），并通过原子力显微镜（AFM）生成的生物膜图像验证其有效性。该方法的关键在于利用AFM数据中的振幅通道（amplitude channel）来最大化匹配特征，并估计原始拓扑图像的位置，从而显著提升拼接效果。此外，研究还发现，沿x轴方向对拓扑图像进行区分可提供与振幅通道图像相似的特征信息，这使得该方法在振幅图像不可用时仍具有通用性。这一工作不仅适用于AFM图像，还可推广至光学显微镜的亮场和荧光通道等场景，有助于实验人员避免因错误拼接而导致的分析失误和错误发现。

链接: https://arxiv.org/abs/2503.08735
作者: Huanhuan Zhao,Ruben Millan Solsona,Marti Checa,Spenser R. Brown,Jennifer L. Morrell-Falvey,Liam Collins,Arpan Biswas
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: The manuscript has 21 pages with 8 figures in main-text and 2 figures in Supplementary materials

点击查看摘要

Abstract:Microscopy is an essential tool in scientific research, enabling the visualization of structures at micro- and nanoscale resolutions. However, the field of microscopy often encounters limitations in field-of-view (FOV), restricting the amount of sample that can be imaged in a single capture. To overcome this limitation, image stitching techniques have been developed to seamlessly merge multiple overlapping images into a single, high-resolution composite. The images collected from microscope need to be optimally stitched before accurate physical information can be extracted from post analysis. However, the existing stitching tools either struggle to stitch images together when the microscopy images are feature sparse or cannot address all the transformations of images. To address these issues, we propose a bi-channel aided feature-based image stitching method and demonstrate its use on AFM generated biofilm images. The topographical channel image of AFM data captures the morphological details of the sample, and a stitched topographical image is desired for researchers. We utilize the amplitude channel of AFM data to maximize the matching features and to estimate the position of the original topographical images and show that the proposed bi-channel aided stitching method outperforms the traditional stitching approach. Furthermore, we found that the differentiation of the topographical images along the x-axis provides similar feature information to the amplitude channel image, which generalizes our approach when the amplitude images are not available. Here we demonstrated the application on AFM, but similar approaches could be employed of optical microscopy with brightfield and fluorescence channels. We believe this proposed workflow will benefit the experimentalist to avoid erroneous analysis and discovery due to incorrect stitching.
zh

[CV-133] QuantU-Net: Efficient Wearable Medical Imaging Using Bitwidth as a Trainable Parameter

【速读】：该论文旨在解决医学图像分割（尤其是肿瘤分割）任务中U-Net模型在资源受限设备（如可穿戴医疗系统）上部署面临的高计算和内存需求挑战。解决方案的关键在于提出QuantU-Net，这是一种基于量化感知训练（quantization-aware training）优化的U-Net版本，使用Brevitas库将模型精度平均降至4.24位，同时保持验证准确率为94.25%，仅比浮点基线低1.89%。通过引入结合Binary Cross-Entropy (BCE) Loss、Dice Loss和自定义比特宽度损失函数的复合损失函数，不仅显著减少了模型大小至原来的约1/8，还大幅降低了训练时间，从理论上需要的6^23次训练迭代减少到单次训练即可找到最优比特宽度与准确率的平衡。这种基于整数运算的实现方式凸显了其在FPGA等专用AI加速硬件上的部署潜力，从而推动了医学图像分割领域在资源受限设备上的实时、低功耗诊断应用发展。

链接: https://arxiv.org/abs/2503.08719
作者: Christiaan Boerkamp,Akhil John Thomas
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medical image segmentation, particularly tumor segmentation, is a critical task in medical imaging, with U-Net being a widely adopted convolutional neural network (CNN) architecture for this purpose. However, U-Net’s high computational and memory requirements pose challenges for deployment on resource-constrained devices such as wearable medical systems. This paper addresses these challenges by introducing QuantU-Net, a quantized version of U-Net optimized for efficient deployment on low-power devices like Field-Programmable Gate Arrays (FPGAs). Using Brevitas, a PyTorch library for quantization-aware training, we quantize the U-Net model, reducing its precision to an average of 4.24 bits while maintaining a validation accuracy of 94.25%, only 1.89% lower than the floating-point baseline. The quantized model achieves an approximately 8x reduction in size, making it suitable for real-time applications in wearable medical devices. We employ a custom loss function that combines Binary Cross-Entropy (BCE) Loss, Dice Loss, and a bitwidth loss function to optimize both segmentation accuracy and the size of the model. Using this custom loss function, we have significantly reduced the training time required to find an optimal combination of bitwidth and accuracy from a hypothetical 6^23 number of training sessions to a single training session. The model’s usage of integer arithmetic highlights its potential for deployment on FPGAs and other designated AI accelerator hardware. This work advances the field of medical image segmentation by enabling the deployment of deep learning models on resource-constrained devices, paving the way for real-time, low-power diagnostic solutions in wearable healthcare applications.
zh

[CV-134] SHAP-Integrated Convolutional Diagnostic Networks for Feature-Selective Medical Analysis

【速读】：该论文试图解决因数据隐私法规限制访问医学数据集而带来的挑战。为应对这一问题，论文提出了一种名为SHAP-integrated卷积诊断网络（SICDN）的可解释特征选择方法，专为有限数据集设计。SICDN的关键创新在于结合了SHAP值进行特征选择，并通过引入历史加权移动平均技术进一步增强特征选择能力，从而在肺炎和乳腺癌分类任务中实现了超过97%的准确率，显著优于四种流行的卷积神经网络（CNN）模型。

链接: https://arxiv.org/abs/2503.08712
作者: Yan Hu,Ahmad Chaddad
机构: Laboratory for AIPM, School of Artificial Intelligence, Guilin University of Electronic Technology (桂林电子科技大学), China; The Laboratory for Imagery, Vision and Artificial Intelligence, École de Technologie Supérieure (魁北克工程学院), Canada
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages

点击查看摘要

Abstract:This study introduces the SHAP-integrated convolutional diagnostic network (SICDN), an interpretable feature selection method designed for limited datasets, to address the challenge posed by data privacy regulations that restrict access to medical datasets. The SICDN model was tested on classification tasks using pneumonia and breast cancer datasets, demonstrating over 97% accuracy and surpassing four popular CNN models. We also integrated a historical weighted moving average technique to enhance feature selection. The SICDN shows potential in medical image prediction, with the code available on this https URL.
zh

[CV-135] Large model enhanced computational ghost imaging

【速读】：该论文旨在解决鬼成像（Ghost Imaging, GI）在图像重建质量及复杂环境下的成像效果提升问题。鬼成像通过一维桶探测信号与二维光场信息的高阶相关实现二维图像重构，但其重建质量易受噪声影响且对低分辨率或严重退化图像的恢复能力有限。为应对这些问题，论文提出了一种包含14亿参数的大规模成像模型——结合鬼成像物理原理的大型模型（Ghost Imaging Large Model, GILM）。GILM的关键创新在于引入跳跃连接机制以缓解深层架构中的梯度爆炸挑战，并通过多头注意力机制学习像素点间的空间依赖性，从而有效捕捉物体单像素测量值之间的复杂关联，优化从采集数据中恢复图像的能力。实验结果验证了GILM在多种场景下（如自由空间成像和水下远距离成像）的有效性和鲁棒性。

链接: https://arxiv.org/abs/2503.08710
作者: Yifan Chen,Hongjun An,Zhe Sun,Tong Tian,Mingliang Chen,Christian Spielmann,Xuelong Li
机构: Northwestern Polytechnical University (西北工业大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ghost imaging (GI) achieves 2D image reconstruction through high-order correlation of 1D bucket signals and 2D light field information, particularly demonstrating enhanced detection sensitivity and high-quality image reconstruction via efficient photon collection in scattering media. Recent investigations have established that deep learning (DL) can substantially enhance the ghost imaging reconstruction quality. Furthermore, with the emergence of large models like SDXL, GPT-4, etc., the constraints of conventional DL in parameters and architecture have been transcended, enabling models to comprehensively explore relationships among all distinct positions within feature sequences. This paradigm shift has significantly advanced the capability of DL in restoring severely degraded and low-resolution imagery, making it particularly advantageous for noise-robust image reconstruction in GI applications. In this paper, we propose the first large imaging model with 1.4 billion parameters that incorporates the physical principles of GI (GILM). The proposed GILM implements a skip connection mechanism to mitigate gradient explosion challenges inherent in deep architectures, ensuring sufficient parametric capacity to capture intricate correlations among object single-pixel measurements. Moreover, GILM leverages multi-head attention mechanism to learn spatial dependencies across pixel points during image reconstruction, facilitating the extraction of comprehensive object information for subsequent reconstruction. We validated the effectiveness of GILM through a series of experiments, including simulated object imaging, imaging objects in free space, and imaging object located 52 meters away in underwater environment. The experimental results show that GILM effectively analyzes the fluctuation trends of the collected signals, thereby optimizing the recovery of the object’s image from the acquired data.
zh

人工智能

[AI-0] Auspex: Building Threat Modeling Tradecraft into an Artificial Intelligence-based Copilot

链接: https://arxiv.org/abs/2503.09586
作者: Andrew Crossman,Andrew R. Plummer,Chandra Sekharudu,Deepak Warrier,Mohammad Yekrangian
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We present Auspex - a threat modeling system built using a specialized collection of generative artificial intelligence-based methods that capture threat modeling tradecraft. This new approach, called tradecraft prompting, centers on encoding the on-the-ground knowledge of threat modelers within the prompts that drive a generative AI-based threat modeling system. Auspex employs tradecraft prompts in two processing stages. The first stage centers on ingesting and processing system architecture information using prompts that encode threat modeling tradecraft knowledge pertaining to system decomposition and description. The second stage centers on chaining the resulting system analysis through a collection of prompts that encode tradecraft knowledge on threat identification, classification, and mitigation. The two-stage process yields a threat matrix for a system that specifies threat scenarios, threat types, information security categorizations and potential mitigations. Auspex produces formalized threat model output in minutes, relative to the weeks or months a manual process takes. More broadly, the focus on bespoke tradecraft prompting, as opposed to fine-tuning or agent-based add-ons, makes Auspex a lightweight, flexible, modular, and extensible foundational system capable of addressing the complexity, resource, and standardization limitations of both existing manual and automated threat modeling processes. In this connection, we establish the baseline value of Auspex to threat modelers through an evaluation procedure based on feedback collected from cybersecurity subject matter experts measuring the quality and utility of threat models generated by Auspex on real banking systems. We conclude with a discussion of system performance and plans for enhancements to Auspex.

[AI-1] Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models ICLR2025

链接: https://arxiv.org/abs/2503.09573
作者: Marianne Arriola,Aaron Gokaslan,Justin T Chiu,Zhihan Yang,Zhixuan Qi,Jiaqi Han,Subham Sekhar Sahoo,Volodymyr Kuleshov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICLR 2025 Oral. We provide the code at this https URL

点击查看摘要

Abstract:Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models. Block diffusion overcomes key limitations of both approaches by supporting flexible-length generation and improving inference efficiency with KV caching and parallel token sampling. We propose a recipe for building effective block diffusion models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Block diffusion sets a new state-of-the-art performance among diffusion models on language modeling benchmarks and enables generation of arbitrary-length sequences. We provide the code, along with the model weights and blog post on the project page: this https URL

[AI-2] Global Convergence and Rich Feature Learning in L-Layer Infinite-Width Neural Networks under μP Parametrization

链接: https://arxiv.org/abs/2503.09565
作者: Zixiang Chen,Greg Yang,Qingyue Zhao,Quanquan Gu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 29 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Despite deep neural networks’ powerful representation learning capabilities, theoretical understanding of how networks can simultaneously achieve meaningful feature learning and global convergence remains elusive. Existing approaches like the neural tangent kernel (NTK) are limited because features stay close to their initialization in this parametrization, leaving open questions about feature properties during substantial evolution. In this paper, we investigate the training dynamics of infinitely wide, L -layer neural networks using the tensor program (TP) framework. Specifically, we show that, when trained with stochastic gradient descent (SGD) under the Maximal Update parametrization ( \mu P) and mild conditions on the activation function, SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum. Our analysis leverages both the interactions among features across layers and the properties of Gaussian random variables, providing new insights into deep representation learning. We further validate our theoretical findings through experiments on real-world datasets.

[AI-3] he Value of Goal Commitment in Planning

链接: https://arxiv.org/abs/2503.09545
作者: Alberto Pozanco,Marianela Morales,Daniel Borrajo,Manuela Veloso
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we revisit the concept of goal commitment from early planners in the presence of current forward chaining heuristic planners. We present a compilation that extends the original planning task with commit actions that enforce the persistence of specific goals once achieved, thereby committing to them in the search sub-tree. This approach imposes a specific goal achievement order in parts of the search tree, potentially introducing dead-end states. This can reduce search effort if the goal achievement order is correct. Otherwise, the search algorithm can expand nodes in the open list where goals do not persist. Experimental results demonstrate that the reformulated tasks suit state-of-the-art agile planners, enabling them to find better

[AI-4] Differentially Private Equilibrium Finding in Polymatrix Games

链接: https://arxiv.org/abs/2503.09538
作者: Mingyang Liu,Gabriele Farina,Asuman Ozdaglar
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study equilibrium finding in polymatrix games under differential privacy constraints. To start, we show that high accuracy and asymptotically vanishing differential privacy budget (as the number of players goes to infinity) cannot be achieved simultaneously under either of the two settings: (i) We seek to establish equilibrium approximation guarantees in terms of Euclidean distance to the equilibrium set, and (ii) the adversary has access to all communication channels. Then, assuming the adversary has access to a constant number of communication channels, we develop a novel distributed algorithm that recovers strategies with simultaneously vanishing Nash gap (in expected utility, also referred to as exploitability and privacy budget as the number of players increases.

[AI-5] PairVDN - Pair-wise Decomposed Value Functions

链接: https://arxiv.org/abs/2503.09521
作者: Zak Buzzard
类目: Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Extending deep Q-learning to cooperative multi-agent settings is challenging due to the exponential growth of the joint action space, the non-stationary environment, and the credit assignment problem. Value decomposition allows deep Q-learning to be applied at the joint agent level, at the cost of reduced expressivity. Building on past work in this direction, our paper proposes PairVDN, a novel method for decomposing the value function into a collection of pair-wise, rather than per-agent, functions, improving expressivity at the cost of requiring a more complex (but still efficient) dynamic programming maximisation algorithm. Our method enables the representation of value functions which cannot be expressed as a monotonic combination of per-agent functions, unlike past approaches such as VDN and QMIX. We implement a novel many-agent cooperative environment, Box Jump, and demonstrate improved performance over these baselines in this setting. We open-source our code and environment at this https URL.

[AI-6] RESTRAIN: Reinforcement Learning-Based Secure Framework for Trigger-Action IoT Environment

链接: https://arxiv.org/abs/2503.09513
作者: Md Morshed Alam,Lokesh Chandra Das,Sandip Roy,Sachin Shetty,Weichao Wang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Internet of Things (IoT) platforms with trigger-action capability allow event conditions to trigger actions in IoT devices autonomously by creating a chain of interactions. Adversaries exploit this chain of interactions to maliciously inject fake event conditions into IoT hubs, triggering unauthorized actions on target IoT devices to implement remote injection attacks. Existing defense mechanisms focus mainly on the verification of event transactions using physical event fingerprints to enforce the security policies to block unsafe event transactions. These approaches are designed to provide offline defense against injection attacks. The state-of-the-art online defense mechanisms offer real-time defense, but extensive reliability on the inference of attack impacts on the IoT network limits the generalization capability of these approaches. In this paper, we propose a platform-independent multi-agent online defense system, namely RESTRAIN, to counter remote injection attacks at runtime. RESTRAIN allows the defense agent to profile attack actions at runtime and leverages reinforcement learning to optimize a defense policy that complies with the security requirements of the IoT network. The experimental results show that the defense agent effectively takes real-time defense actions against complex and dynamic remote injection attacks and maximizes the security gain with minimal computational overhead.

[AI-7] PromptMap: An Alternative Interaction Style for AI-Based Image Generation

链接: https://arxiv.org/abs/2503.09436
作者: Krzysztof Adamkiewicz,Paweł W. Woźniak,Julia Dominiak,Andrzej Romanowski,Jakob Karolus,Stanislav Frolov
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: To be published in the proceedings of 30th International Conference on Intelligent User Interfaces (IUI '25), March 24-27, 2025, Cagliari, Italy

点击查看摘要

Abstract:Recent technological advances popularized the use of image generation among the general public. Crafting effective prompts can, however, be difficult for novice users. To tackle this challenge, we developed PromptMap, a new interaction style for text-to-image AI that allows users to freely explore a vast collection of synthetic prompts through a map-like view with semantic zoom. PromptMap groups images visually by their semantic similarity, allowing users to discover relevant examples. We evaluated PromptMap in a between-subject online study ( n=60 ) and a qualitative within-subject study ( n=12 ). We found that PromptMap supported users in crafting prompts by providing them with examples. We also demonstrated the feasibility of using LLMs to create vast example collections. Our work contributes a new interaction style that supports users unfamiliar with prompting in achieving a satisfactory image output.

[AI-8] CASTLE: Benchmarking Dataset for Static Code Analyzers and LLM s towards CWE Detection

链接: https://arxiv.org/abs/2503.09433
作者: Richard A. Dubniczky,Krisztofer Zoltán Horvát,Tamás Bisztray,Mohamed Amine Ferrag,Lucas C. Cordeiro,Norbert Tihanyi
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Identifying vulnerabilities in source code is crucial, especially in critical software components. Existing methods such as static analysis, dynamic analysis, formal verification, and recently Large Language Models are widely used to detect security flaws. This paper introduces CASTLE (CWE Automated Security Testing and Low-Level Evaluation), a benchmarking framework for evaluating the vulnerability detection capabilities of different methods. We assess 13 static analysis tools, 10 LLMs, and 2 formal verification tools using a hand-crafted dataset of 250 micro-benchmark programs covering 25 common CWEs. We propose the CASTLE Score, a novel evaluation metric to ensure fair comparison. Our results reveal key differences: ESBMC (a formal verification tool) minimizes false positives but struggles with vulnerabilities beyond model checking, such as weak cryptography or SQL injection. Static analyzers suffer from high false positives, increasing manual validation efforts for developers. LLMs perform exceptionally well in the CASTLE dataset when identifying vulnerabilities in small code snippets. However, their accuracy declines, and hallucinations increase as the code size grows. These results suggest that LLMs could play a pivotal role in future security solutions, particularly within code completion frameworks, where they can provide real-time guidance to prevent vulnerabilities. The dataset is accessible at this https URL.

[AI-9] Multimodal Language Modeling for High-Accuracy Single Cell Transcriptomics Analysis and Generation

链接: https://arxiv.org/abs/2503.09427
作者: Yaorui Shi,Jiaqi Yang,Sihang Li,Junfeng Fang,Xiang Wang,Zhiyuan Liu,Yang Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pre-trained language models (PLMs) have revolutionized scientific research, yet their application to single-cell analysis remains limited. Text PLMs cannot process single-cell RNA sequencing data, while cell PLMs lack the ability to handle free text, restricting their use in multimodal tasks. Existing efforts to bridge these modalities often suffer from information loss or inadequate single-modal pre-training, leading to suboptimal performances. To address these challenges, we propose Single-Cell MultiModal Generative Pre-trained Transformer (scMMGPT), a unified PLM for joint cell and text modeling. scMMGPT effectively integrates the state-of-the-art cell and text PLMs, facilitating cross-modal knowledge sharing for improved performance. To bridge the text-cell modality gap, scMMGPT leverages dedicated cross-modal projectors, and undergoes extensive pre-training on 27 million cells – the largest dataset for multimodal cell-text PLMs to date. This large-scale pre-training enables scMMGPT to excel in joint cell-text tasks, achieving an 84% relative improvement of textual discrepancy for cell description generation, 20.5% higher accuracy for cell type annotation, and 4% improvement in k -NN accuracy for text-conditioned pseudo-cell generation, outperforming baselines.

[AI-10] AI-based Framework for Robust Model-Based Connector Mating in Robotic Wire Harness Installation

链接: https://arxiv.org/abs/2503.09409
作者: Claudius Kienle,Benjamin Alt,Finn Schneider,Tobias Pertlwieser,Rainer Jäkel,Rania Rayyes
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 6 pages, 6 figures, 4 tables, submitted to the 2025 IEEE 21st International Conference on Automation Science and Engineering

点击查看摘要

Abstract:Despite the widespread adoption of industrial robots in automotive assembly, wire harness installation remains a largely manual process, as it requires precise and flexible manipulation. To address this challenge, we design a novel AI-based framework that automates cable connector mating by integrating force control with deep visuotactile learning. Our system optimizes search-and-insertion strategies using first-order optimization over a multimodal transformer architecture trained on visual, tactile, and proprioceptive data. Additionally, we design a novel automated data collection and optimization pipeline that minimizes the need for machine learning expertise. The framework optimizes robot programs that run natively on standard industrial controllers, permitting human experts to audit and certify them. Experimental validations on a center console assembly task demonstrate significant improvements in cycle times and robustness compared to conventional robot programming approaches. Videos are available under this https URL.

[AI-11] owards Next-Generation Recommender Systems: A Benchmark for Personalized Recommendation Assistant with LLM s

链接: https://arxiv.org/abs/2503.09382
作者: Jiani Huang,Shijie Wang,Liang-bo Ning,Wenqi Fan,Shuaiqiang Wang,Dawei Yin,Qing Li
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recommender systems (RecSys) are widely used across various modern digital platforms and have garnered significant attention. Traditional recommender systems usually focus only on fixed and simple recommendation scenarios, making it difficult to generalize to new and unseen recommendation tasks in an interactive paradigm. Recently, the advancement of large language models (LLMs) has revolutionized the foundational architecture of RecSys, driving their evolution into more intelligent and interactive personalized recommendation assistants. However, most existing studies rely on fixed task-specific prompt templates to generate recommendations and evaluate the performance of personalized assistants, which limits the comprehensive assessments of their capabilities. This is because commonly used datasets lack high-quality textual user queries that reflect real-world recommendation scenarios, making them unsuitable for evaluating LLM-based personalized recommendation assistants. To address this gap, we introduce RecBench+, a new dataset benchmark designed to access LLMs’ ability to handle intricate user recommendation needs in the era of LLMs. RecBench+ encompasses a diverse set of queries that span both hard conditions and soft preferences, with varying difficulty levels. We evaluated commonly used LLMs on RecBench+ and uncovered below findings: 1) LLMs demonstrate preliminary abilities to act as recommendation assistants, 2) LLMs are better at handling queries with explicitly stated conditions, while facing challenges with queries that require reasoning or contain misleading information. Our dataset has been released at this https URL.

[AI-12] Membership Inference Attacks fueled by Few-Short Learning to detect privacy leakage tackling data integrity

链接: https://arxiv.org/abs/2503.09365
作者: Daniel Jiménez-López,Nuria Rodríguez-Barroso,M. Victoria Luzón,Francisco Herrera
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep learning models have an intrinsic privacy issue as they memorize parts of their training data, creating a privacy leakage. Membership Inference Attacks (MIA) exploit it to obtain confidential information about the data used for training, aiming to steal information. They can be repurposed as a measurement of data integrity by inferring whether it was used to train a machine learning model. While state-of-the-art attacks achieve a significant privacy leakage, their requirements are not feasible enough, hindering their role as practical tools to assess the magnitude of the privacy risk. Moreover, the most appropriate evaluation metric of MIA, the True Positive Rate at low False Positive Rate lacks interpretability. We claim that the incorporation of Few-Shot Learning techniques to the MIA field and a proper qualitative and quantitative privacy evaluation measure should deal with these issues. In this context, our proposal is twofold. We propose a Few-Shot learning based MIA, coined as the FeS-MIA model, which eases the evaluation of the privacy breach of a deep learning model by significantly reducing the number of resources required for the purpose. Furthermore, we propose an interpretable quantitative and qualitative measure of privacy, referred to as Log-MIA measure. Jointly, these proposals provide new tools to assess the privacy leakage and to ease the evaluation of the training data integrity of deep learning models, that is, to analyze the privacy breach of a deep learning model. Experiments carried out with MIA over image classification and language modeling tasks and its comparison to the state-of-the-art show that our proposals excel at reporting the privacy leakage of a deep learning model with little extra information.

[AI-13] Automatic Operator-level Parallelism Planning for Distributed Deep Learning – A Mixed-Integer Programming Approach

链接: https://arxiv.org/abs/2503.09357
作者: Ruifeng She,Bowen Pang,Kai Li,Zehua Liu,Tao Zhong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Discrete Mathematics (cs.DM)
*备注:

点击查看摘要

Abstract:As the artificial intelligence community advances into the era of large models with billions of parameters, distributed training and inference have become essential. While various parallelism strategies-data, model, sequence, and pipeline-have been successfully implemented for popular neural networks on main-stream hardware, optimizing the distributed deployment schedule requires extensive expertise and manual effort. Further more, while existing frameworks with most simple chain-like structures, they struggle with complex non-linear architectures. Mixture-of-experts and multi-modal models feature intricate MIMO and branch-rich topologies that require fine-grained operator-level parallelization beyond the capabilities of existing frameworks. We propose formulating parallelism planning as a scheduling optimization problem using mixed-integer programming. We propose a bi-level solution framework balancing optimality with computational efficiency, automatically generating effective distributed plans that capture both the heterogeneous structure of modern neural networks and the underlying hardware constraints. In experiments comparing against expert-designed strategies like DeepSeek’s DualPipe, our framework achieves comparable or superior performance, reducing computational bubbles by half under the same memory constraints. The framework’s versatility extends beyond throughput optimization to incorporate hardware utilization maximization, memory capacity constraints, and other considerations or potential strategies. Such capabilities position our solution as both a valuable research tool for exploring optimal parallelization strategies and a practical industrial solution for large-scale AI deployment.

[AI-14] NVP-HRI: Zero Shot Natural Voice and Posture-based Human-Robot Interaction via Large Language Model

链接: https://arxiv.org/abs/2503.09335
作者: Yuzhi Lai,Shenghai Yuan,Youssef Nassar,Mingyu Fan,Thomas Weber,Matthias Rätsch
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: This work has been accepted for publication in ESWA @ 2025 Elsevier. Personal use of this material is permitted. Permission from Elsevier must be obtained for all other uses, including reprinting/redistribution, creating new works, or reuse of any copyrighted components of this work in other media

点击查看摘要

Abstract:Effective Human-Robot Interaction (HRI) is crucial for future service robots in aging societies. Existing solutions are biased toward only well-trained objects, creating a gap when dealing with new objects. Currently, HRI systems using predefined gestures or language tokens for pretrained objects pose challenges for all individuals, especially elderly ones. These challenges include difficulties in recalling commands, memorizing hand gestures, and learning new names. This paper introduces NVP-HRI, an intuitive multi-modal HRI paradigm that combines voice commands and deictic posture. NVP-HRI utilizes the Segment Anything Model (SAM) to analyze visual cues and depth data, enabling precise structural object representation. Through a pre-trained SAM network, NVP-HRI allows interaction with new objects via zero-shot prediction, even without prior knowledge. NVP-HRI also integrates with a large language model (LLM) for multimodal commands, coordinating them with object selection and scene distribution in real time for collision-free trajectory solutions. We also regulate the action sequence with the essential control syntax to reduce LLM hallucination risks. The evaluation of diverse real-world tasks using a Universal Robot showcased up to 59.2% efficiency improvement over traditional gesture control, as illustrated in the video this https URL. Our code and design will be openly available at this https URL.

[AI-15] CyberLLM Instruct: A New Dataset for Analysing Safety of Fine-Tuned LLM s Using Cyber Security Data SIGIR

链接: https://arxiv.org/abs/2503.09334
作者: Adel ElZemity,Budi Arief,Shujun Li
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: The paper is submitted to “The 48th International ACM SIGIR Conference on Research and Development in Information Retrieval” and is currently under review

点击查看摘要

Abstract:The integration of large language models (LLMs) into cyber security applications presents significant opportunities, such as enhancing threat analysis and malware detection, but can also introduce critical risks and safety concerns, including personal data leakage and automated generation of new malware. To address these challenges, we developed CyberLLMInstruct, a dataset of 54,928 instruction-response pairs spanning cyber security tasks such as malware analysis, phishing simulations, and zero-day vulnerabilities. The dataset was constructed through a multi-stage process. This involved sourcing data from multiple resources, filtering and structuring it into instruction-response pairs, and aligning it with real-world scenarios to enhance its applicability. Seven open-source LLMs were chosen to test the usefulness of CyberLLMInstruct: Phi 3 Mini 3.8B, Mistral 7B, Qwen 2.5 7B, Llama 3 8B, Llama 3.1 8B, Gemma 2 9B, and Llama 2 70B. In our primary example, we rigorously assess the safety of fine-tuned models using the OWASP top 10 framework, finding that fine-tuning reduces safety resilience across all tested LLMs and every adversarial attack (e.g., the security score of Llama 3.1 8B against prompt injection drops from 0.95 to 0.15). In our second example, we show that these same fine-tuned models can also achieve up to 92.50 percent accuracy on the CyberMetric benchmark. These findings highlight a trade-off between performance and safety, showing the importance of adversarial testing and further research into fine-tuning methodologies that can mitigate safety risks while still improving performance across diverse datasets and domains. All scripts required to reproduce the dataset, along with examples and relevant resources for replicating our results, will be made available upon the paper’s acceptance.

[AI-16] Group-robust Machine Unlearning

链接: https://arxiv.org/abs/2503.09330
作者: Thomas De Min,Subhankar Roy,Stéphane Lathuilière,Elisa Ricci,Massimiliano Mancini
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Work in progress

点击查看摘要

Abstract:Machine unlearning is an emerging paradigm to remove the influence of specific training data (i.e., the forget set) from a model while preserving its knowledge of the rest of the data (i.e., the retain set). Previous approaches assume the forget data to be uniformly distributed from all training datapoints. However, if the data to unlearn is dominant in one group, we empirically show that performance for this group degrades, leading to fairness issues. This work tackles the overlooked problem of non-uniformly distributed forget sets, which we call group-robust machine unlearning, by presenting a simple, effective strategy that mitigates the performance loss in dominant groups via sample distribution reweighting. Moreover, we present MIU (Mutual Information-aware Machine Unlearning), the first approach for group robustness in approximate machine unlearning. MIU minimizes the mutual information between model features and group information, achieving unlearning while reducing performance degradation in the dominant group of the forget set. Additionally, MIU exploits sample distribution reweighting and mutual information calibration with the original model to preserve group robustness. We conduct experiments on three datasets and show that MIU outperforms standard methods, achieving unlearning without compromising model robustness. Source code available at this https URL.

[AI-17] Adaptive political surveys and GPT -4: Tackling the cold start problem with simulated user interactions

链接: https://arxiv.org/abs/2503.09311
作者: Fynn Bachmann,Daan van der Weijden,Lucien Heitz,Cristina Sarasua,Abraham Bernstein
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 23 pages. Under review at PLOS One

点击查看摘要

Abstract:Adaptive questionnaires dynamically select the next question for a survey participant based on their previous answers. Due to digitalisation, they have become a viable alternative to traditional surveys in application areas such as political science. One limitation, however, is their dependency on data to train the model for question selection. Often, such training data (i.e., user interactions) are unavailable a priori. To address this problem, we (i) test whether Large Language Models (LLM) can accurately generate such interaction data and (ii) explore if these synthetic data can be used to pre-train the statistical model of an adaptive political survey. To evaluate this approach, we utilise existing data from the Swiss Voting Advice Application (VAA) Smartvote in two ways: First, we compare the distribution of LLM-generated synthetic data to the real distribution to assess its similarity. Second, we compare the performance of an adaptive questionnaire that is randomly initialised with one pre-trained on synthetic data to assess their suitability for training. We benchmark these results against an “oracle” questionnaire with perfect prior knowledge. We find that an off-the-shelf LLM (GPT-4) accurately generates answers to the Smartvote questionnaire from the perspective of different Swiss parties. Furthermore, we demonstrate that initialising the statistical model with synthetic data can (i) significantly reduce the error in predicting user responses and (ii) increase the candidate recommendation accuracy of the VAA. Our work emphasises the considerable potential of LLMs to create training data to improve the data collection process in adaptive questionnaires in LLM-affine areas such as political surveys.

[AI-18] Steering No-Regret Agents in MFGs under Model Uncertainty AISTATS2025

链接: https://arxiv.org/abs/2503.09309
作者: Leo Widmer,Jiawei Huang,Niao He
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Machine Learning (stat.ML)
*备注: AISTATS 2025; 34 Pages

点击查看摘要

Abstract:Incentive design is a popular framework for guiding agents’ learning dynamics towards desired outcomes by providing additional payments beyond intrinsic rewards. However, most existing works focus on a finite, small set of agents or assume complete knowledge of the game, limiting their applicability to real-world scenarios involving large populations and model uncertainty. To address this gap, we study the design of steering rewards in Mean-Field Games (MFGs) with density-independent transitions, where both the transition dynamics and intrinsic reward functions are unknown. This setting presents non-trivial challenges, as the mediator must incentivize the agents to explore for its model learning under uncertainty, while simultaneously steer them to converge to desired behaviors without incurring excessive incentive payments. Assuming agents exhibit no(-adaptive) regret behaviors, we contribute novel optimistic exploration algorithms. Theoretically, we establish sub-linear regret guarantees for the cumulative gaps between the agents’ behaviors and the desired ones. In terms of the steering cost, we demonstrate that our total incentive payments incur only sub-linear excess, competing with a baseline steering strategy that stabilizes the target policy as an equilibrium. Our work presents an effective framework for steering agents behaviors in large-population systems under uncertainty.

[AI-19] DeepInnovation AI: A Global Dataset Mapping the AI innovation and technology Transfer from Academic Research to Industrial Patents

链接: https://arxiv.org/abs/2503.09257
作者: Haixing Gong,Hui Zou,Xingzhou Liang,Shiyuan Meng,Pinlong Cai,Xingcheng Xu,Jingjing Qu
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
*备注: 32 pages and 8 figures

点击查看摘要

Abstract:In the rapidly evolving field of artificial intelligence (AI), mapping innovation patterns and understanding effective technology transfer from academic research to practical applications are essential for economic growth. This paper introduces DeepInnovationAI, the first comprehensive global dataset designed to bridge the gap between academic papers and industrial patents. However, existing data infrastructures face three major limitations: fragmentation, incomplete coverage, and insufficient evaluative capacity. Here, we present DeepInnovationAI, a comprehensive global dataset documenting AI innovation trajectories. The dataset comprises three structured files: this http URL: Contains 2,356,204 patent records with 8 field-specific attributes. this http URL: Encompasses 3,511,929 academic publications with 13 metadata fields. These two datasets employ large language models, multilingual text analysis and dual-layer BERT classifiers to accurately identify AI-related content and utilizing hypergraph analysis methods to create robust innovation metrics. In addition, this http URL: By applying semantic vector proximity analysis, this file presents approximately one hundred million calculated paper-patent similarity pairs to enhance understanding of how theoretical advancements translate into commercial technologies. This enables researchers, policymakers, and industry leaders to anticipate trends and identify emerging areas for collaboration. With its extensive temporal and geographical scope, DeepInnovationAI supports detailed analysis of technological development patterns and international competition dynamics, providing a robust foundation for modeling AI innovation dynamics and technology transfer processes.

[AI-20] SCOPE-DTI: Semi-Inductive Dataset Construction and Framework Optimization for Practical Usability Enhancement in Deep Learning-Based Drug Target Interaction Prediction

链接: https://arxiv.org/abs/2503.09251
作者: Yigang Chen,Xiang Ji,Ziyue Zhang,Yuming Zhou,Yang-Chi-Dung Lin,Hsi-Yuan Huang,Tao Zhang,Yi Lai,Ke Chen,Chang Su,Xingqiao Lin,Zihao Zhu,Yanggyi Zhang,Kangping Wei,Jiehui Fu,Yixian Huang,Shidong Cui,Shih-Chung Yen,Ariel Warshel,Hsien-Da Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Deep learning-based drug-target interaction (DTI) prediction methods have demonstrated strong performance; however, real-world applicability remains constrained by limited data diversity and modeling complexity. To address these challenges, we propose SCOPE-DTI, a unified framework combining a large-scale, balanced semi-inductive human DTI dataset with advanced deep learning modeling. Constructed from 13 public repositories, the SCOPE dataset expands data volume by up to 100-fold compared to common benchmarks such as the Human dataset. The SCOPE model integrates three-dimensional protein and compound representations, graph neural networks, and bilinear attention mechanisms to effectively capture cross domain interaction patterns, significantly outperforming state-of-the-art methods across various DTI prediction tasks. Additionally, SCOPE-DTI provides a user-friendly interface and database. We further validate its effectiveness by experimentally identifying anticancer targets of Ginsenoside Rh1. By offering comprehensive data, advanced modeling, and accessible tools, SCOPE-DTI accelerates drug discovery research.

[AI-21] In-Context Defense in Computer Agents : An Empirical Study

链接: https://arxiv.org/abs/2503.09241
作者: Pei Yang,Hai Ci,Mike Zheng Shou
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Computer agents powered by vision-language models (VLMs) have significantly advanced human-computer interaction, enabling users to perform complex tasks through natural language instructions. However, these agents are vulnerable to context deception attacks, an emerging threat where adversaries embed misleading content into the agent’s operational environment, such as a pop-up window containing deceptive instructions. Existing defenses, such as instructing agents to ignore deceptive elements, have proven largely ineffective. As the first systematic study on protecting computer agents, we introduce textbfin-context defense, leveraging in-context learning and chain-of-thought (CoT) reasoning to counter such attacks. Our approach involves augmenting the agent’s context with a small set of carefully curated exemplars containing both malicious environments and corresponding defensive responses. These exemplars guide the agent to first perform explicit defensive reasoning before action planning, reducing susceptibility to deceptive attacks. Experiments demonstrate the effectiveness of our method, reducing attack success rates by 91.2% on pop-up window attacks, 74.6% on average on environment injection attacks, while achieving 100% successful defenses against distracting advertisements. Our findings highlight that (1) defensive reasoning must precede action planning for optimal performance, and (2) a minimal number of exemplars (fewer than three) is sufficient to induce an agent’s defensive behavior.

[AI-22] LREF: A Novel LLM -based Relevance Framework for E-commerce

链接: https://arxiv.org/abs/2503.09223
作者: Tian Tang,Zhixing Tian,Zhenyu Zhu,Chenyang Wang,Haiqing Hu,Guoyu Tang,Lin Liu,Sulong Xu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Query and product relevance prediction is a critical component for ensuring a smooth user experience in e-commerce search. Traditional studies mainly focus on BERT-based models to assess the semantic relevance between queries and products. However, the discriminative paradigm and limited knowledge capacity of these approaches restrict their ability to comprehend the relevance between queries and products fully. With the rapid advancement of Large Language Models (LLMs), recent research has begun to explore their application to industrial search systems, as LLMs provide extensive world knowledge and flexible optimization for reasoning processes. Nonetheless, directly leveraging LLMs for relevance prediction tasks introduces new challenges, including a high demand for data quality, the necessity for meticulous optimization of reasoning processes, and an optimistic bias that can result in over-recall. To overcome the above problems, this paper proposes a novel framework called the LLM-based RElevance Framework (LREF) aimed at enhancing e-commerce search relevance. The framework comprises three main stages: supervised fine-tuning (SFT) with Data Selection, Multiple Chain of Thought (Multi-CoT) tuning, and Direct Preference Optimization (DPO) for de-biasing. We evaluate the performance of the framework through a series of offline experiments on large-scale real-world datasets, as well as online A/B testing. The results indicate significant improvements in both offline and online metrics. Ultimately, the model was deployed in a well-known e-commerce application, yielding substantial commercial benefits.

[AI-23] Evaluating the Generalizability of LLM s in Automated Program Repair ICSE2025

链接: https://arxiv.org/abs/2503.09217
作者: Fengjie Li,Jiajun Jiang,Jiajun Sun,Hongyu Zhang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 5 pages, 1 figure, to be published in ICSE2025-NIER

点击查看摘要

Abstract:LLM-based automated program repair methods have attracted significant attention for their state-of-the-art performance. However, they were primarily evaluated on a few well known datasets like Defects4J, raising questions about their effectiveness on new datasets. In this study, we evaluate 11 top-performing LLMs on DEFECTS4J-TRANS, a new dataset derived from transforming Defects4J while maintaining the original semantics. Results from experiments on both Defects4J and DEFECTS4J-TRANS show that all studied LLMs have limited generalizability in APR tasks, with the average number of correct and plausible patches decreasing by 49.48% and 42.90%, respectively, on DEFECTS4J-TRANS. Further investigation into incorporating additional repair-relevant information in repair prompts reveals that, although this information significantly enhances the LLMs’ capabilities (increasing the number of correct and plausible patches by up to 136.67% and 121.82%, respectively), performance still falls short of their original results. This indicates that prompt engineering alone is insufficient to substantially enhance LLMs’ repair capabilities. Based on our study, we also offer several recommendations for future research.

[AI-24] GENEOnet: Statistical analysis supporting explainability and trustworthiness

链接: https://arxiv.org/abs/2503.09199
作者: Giovanni Bocchi,Patrizio Frosini,Alessandra Micheletti,Alessandro Pedretti,Carmen Gratteri,Filippo Lunghini,Andrea Rosario Beccari,Carmine Talarico
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Group Equivariant Non-Expansive Operators (GENEOs) have emerged as mathematical tools for constructing networks for Machine Learning and Artificial Intelligence. Recent findings suggest that such models can be inserted within the domain of eXplainable Artificial Intelligence (XAI) due to their inherent interpretability. In this study, we aim to verify this claim with respect to GENEOnet, a GENEO network developed for an application in computational biochemistry by employing various statistical analyses and experiments. Such experiments first allow us to perform a sensitivity analysis on GENEOnet’s parameters to test their significance. Subsequently, we show that GENEOnet exhibits a significantly higher proportion of equivariance compared to other methods. Lastly, we demonstrate that GENEOnet is on average robust to perturbations arising from molecular dynamics. These results collectively serve as proof of the explainability, trustworthiness, and robustness of GENEOnet and confirm the beneficial use of GENEOs in the context of Trustworthy Artificial Intelligence.

[AI-25] Long-Term Planning Around Humans in Domestic Environments with 3D Scene Graphs

链接: https://arxiv.org/abs/2503.09173
作者: Ermanno Bartoli,Dennis Rotondi,Kai O. Arras,Iolanda Leite
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 5 pages, 2 figures, 1 table

点击查看摘要

Abstract:Long-term planning for robots operating in domestic environments poses unique challenges due to the interactions between humans, objects, and spaces. Recent advancements in trajectory planning have leveraged vision-language models (VLMs) to extract contextual information for robots operating in real-world environments. While these methods achieve satisfying performance, they do not explicitly model human activities. Such activities influence surrounding objects and reshape spatial constraints. This paper presents a novel approach to trajectory planning that integrates human preferences, activities, and spatial context through an enriched 3D scene graph (3DSG) representation. By incorporating activity-based relationships, our method captures the spatial impact of human actions, leading to more context-sensitive trajectory adaptation. Preliminary results demonstrate that our approach effectively assigns costs to spaces influenced by human activities, ensuring that the robot trajectory remains contextually appropriate and sensitive to the ongoing environment. This balance between task efficiency and social appropriateness enhances context-aware human-robot interactions in domestic settings. Future work includes implementing a full planning pipeline and conducting user studies to evaluate trajectory acceptability.

[AI-26] AI-Driven Decision Support in Oncology: Evaluating Data Readiness for Skin Cancer Treatment

链接: https://arxiv.org/abs/2503.09164
作者: Joscha Grüger,Tobias Geyer,Tobias Brix,Michael Storck,Sonja Leson,Laura Bley,Carsten Weishaupt,Ralph Bergmann,Stephan A. Braun
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This research focuses on evaluating and enhancing data readiness for the development of an Artificial Intelligence (AI)-based Clinical Decision Support System (CDSS) in the context of skin cancer treatment. The study, conducted at the Skin Tumor Center of the University Hospital Münster, delves into the essential role of data quality, availability, and extractability in implementing effective AI applications in oncology. By employing a multifaceted methodology, including literature review, data readiness assessment, and expert workshops, the study addresses the challenges of integrating AI into clinical decision-making. The research identifies crucial data points for skin cancer treatment decisions, evaluates their presence and quality in various information systems, and highlights the difficulties in extracting information from unstructured data. The findings underline the significance of high-quality, accessible data for the success of AI-driven CDSS in medical settings, particularly in the complex field of oncology.

[AI-27] Efficient UAV Swarm-Based Multi-Task Federated Learning with Dynamic Task Knowledge Sharing

链接: https://arxiv.org/abs/2503.09144
作者: Yubo Yang,Tao Yang,Xiaofeng Wu,Ziyu Guo,Bo Hu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Due to the limitation “The abstract field cannot be longer than 1,920 characters”, the abstract here is shorter than that in the PDF file

点击查看摘要

Abstract:UAV swarms are widely used in emergency communications, area monitoring, and disaster relief. Coordinated by control centers, they are ideal for federated learning (FL) frameworks. However, current UAV-assisted FL methods primarily focus on single tasks, overlooking the need for multi-task training. In disaster relief scenarios, UAVs perform tasks such as crowd detection, road feasibility analysis, and disaster assessment, which exhibit time-varying demands and potential correlations. In order to meet the time-varying requirements of tasks and complete multiple tasks efficiently under resource constraints, in this paper, we propose a UAV swarm based multi-task FL framework, where ground emergency vehicles (EVs) collaborate with UAVs to accomplish multiple tasks efficiently under constrained energy and bandwidth resources. Through theoretical analysis, we identify key factors affecting task performance and introduce a task attention mechanism to dynamically evaluate task importance, thereby achieving efficient resource allocation. Additionally, we propose a task affinity (TA) metric to capture the dynamic correlation among tasks, thereby promoting task knowledge sharing to accelerate training and improve the generalization ability of the model in different scenarios. To optimize resource allocation, we formulate a two-layer optimization problem to jointly optimize UAV transmission power, computation frequency, bandwidth allocation, and UAV-EV associations. For the inner problem, we derive closed-form solutions for transmission power, computation frequency, and bandwidth allocation and apply a block coordinate descent method for optimization. For the outer problem, a two-stage algorithm is designed to determine optimal UAV-EV associations. Furthermore, theoretical analysis reveals a trade-off between UAV energy consumption and multi-task performance.

[AI-28] On the Internal Representations of Graph Metanetworks ICLR2025

链接: https://arxiv.org/abs/2503.09120
作者: Taesun Yeom,Jaeho Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICLR 2025 Workshop on Weight Space Learning

点击查看摘要

Abstract:Weight space learning is an emerging paradigm in the deep learning community. The primary goal of weight space learning is to extract informative features from a set of parameters using specially designed neural networks, often referred to as \emphmetanetworks. However, it remains unclear how these metanetworks learn solely from parameters. To address this, we take the first step toward understanding \emphrepresentations of metanetworks, specifically graph metanetworks (GMNs), which achieve state-of-the-art results in this field, using centered kernel alignment (CKA). Through various experiments, we reveal that GMNs and general neural networks (\textite.g., multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs)) differ in terms of their representation space.

[AI-29] Sometimes Painful but Certainly Promising: Feasibility and Trade-offs of Language Model Inference at the Edge

链接: https://arxiv.org/abs/2503.09114
作者: Maximilian Abstreiter,Sasu Tarkoma,Roberto Morabito
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注: This paper is currently under review for publication in an ACM journal. If accepted, the copyright will be transferred to ACM

点击查看摘要

Abstract:The rapid rise of Language Models (LMs) has expanded the capabilities of natural language processing, powering applications from text generation to complex decision-making. While state-of-the-art LMs often boast hundreds of billions of parameters and are primarily deployed in data centers, recent trends show a growing focus on compact models-typically under 10 billion parameters-enabled by techniques such as quantization and other model compression techniques. This shift paves the way for LMs on edge devices, offering potential benefits such as enhanced privacy, reduced latency, and improved data sovereignty. However, the inherent complexity of even these smaller models, combined with the limited computing resources of edge hardware, raises critical questions about the practical trade-offs in executing LM inference outside the cloud. To address these challenges, we present a comprehensive evaluation of generative LM inference on representative CPU-based and GPU-accelerated edge devices. Our study measures key performance indicators-including memory usage, inference speed, and energy consumption-across various device configurations. Additionally, we examine throughput-energy trade-offs, cost considerations, and usability, alongside an assessment of qualitative model performance. While quantization helps mitigate memory overhead, it does not fully eliminate resource bottlenecks, especially for larger models. Our findings quantify the memory and energy constraints that must be considered for practical real-world deployments, offering concrete insights into the trade-offs between model size, inference performance, and efficiency. The exploration of LMs at the edge is still in its early stages. We hope this study provides a foundation for future research, guiding the refinement of models, the enhancement of inference efficiency, and the advancement of edge-centric AI systems.

[AI-30] Constraint-Guided Learning of Data-driven Health Indicator Models: An Application on the Pronostia Bearing Dataset

链接: https://arxiv.org/abs/2503.09113
作者: Yonas Tefera,Quinten Van Baelen,Maarten Meire,Stijn Luca,Peter Karsmakers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a constraint-guided deep learning framework for developing physically consistent health indicators in bearing prognostics and health management. Conventional data-driven methods often lack physical plausibility, while physics-based models are limited by incomplete system knowledge. To address this, we integrate domain knowledge into deep learning using constraints to enforce monotonicity, bound output values between 1 and 0 (representing healthy to failed states), and ensure consistency between signal energy trends and health indicator estimates. This eliminates the need for complex loss term balancing. We implement constraint-guided gradient descent within an autoencoder architecture, creating a constrained autoencoder. However, the framework is adaptable to other architectures. Using time-frequency representations of accelerometer signals from the Pronostia dataset, our constrained model generates smoother, more reliable degradation profiles compared to conventional methods, aligning with expected physical behavior. Performance is assessed using three metrics: trendability, robustness, and consistency. Compared to a conventional baseline, the constrained model improves all three. Another baseline, incorporating monotonicity via a soft-ranking loss function, outperforms in trendability but falls short in robustness and consistency. An ablation study confirms that the monotonicity constraint enhances trendability, the boundary constraint ensures consistency, and the energy-health consistency constraint improves robustness. These findings highlight the effectiveness of constraint-guided deep learning in producing reliable, physically meaningful health indicators, offering a promising direction for future prognostic applications.

[AI-31] Probing Network Decisions: Capturing Uncertainties and Unveiling Vulnerabilities Without Label Information ICPR

链接: https://arxiv.org/abs/2503.09068
作者: Youngju Joung,Sehyun Lee,Jaesik Choi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: ICPRAI 2024

点击查看摘要

Abstract:To improve trust and transparency, it is crucial to be able to interpret the decisions of Deep Neural classifiers (DNNs). Instance-level examinations, such as attribution techniques, are commonly employed to interpret the model decisions. However, when interpreting misclassified decisions, human intervention may be required. Analyzing the attribu tions across each class within one instance can be particularly labor intensive and influenced by the bias of the human interpreter. In this paper, we present a novel framework to uncover the weakness of the classifier via counterfactual examples. A prober is introduced to learn the correctness of the classifier’s decision in terms of binary code-hit or miss. It enables the creation of the counterfactual example concerning the prober’s decision. We test the performance of our prober’s misclassification detection and verify its effectiveness on the image classification benchmark datasets. Furthermore, by generating counterfactuals that penetrate the prober, we demonstrate that our framework effectively identifies vulnerabilities in the target classifier without relying on label information on the MNIST dataset.

[AI-32] Probing Latent Subspaces in LLM for AI Security: Identifying and Manipulating Adversarial States

链接: https://arxiv.org/abs/2503.09066
作者: Xin Wei Chia,Jonathan Pan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they remain vulnerable to adversarial manipulations such as jailbreaking via prompt injection attacks. These attacks bypass safety mechanisms to generate restricted or harmful content. In this study, we investigated the underlying latent subspaces of safe and jailbroken states by extracting hidden activations from a LLM. Inspired by attractor dynamics in neuroscience, we hypothesized that LLM activations settle into semi stable states that can be identified and perturbed to induce state transitions. Using dimensionality reduction techniques, we projected activations from safe and jailbroken responses to reveal latent subspaces in lower dimensional spaces. We then derived a perturbation vector that when applied to safe representations, shifted the model towards a jailbreak state. Our results demonstrate that this causal intervention results in statistically significant jailbreak responses in a subset of prompts. Next, we probed how these perturbations propagate through the model’s layers, testing whether the induced state change remains localized or cascades throughout the network. Our findings indicate that targeted perturbations induced distinct shifts in activations and model responses. Our approach paves the way for potential proactive defenses, shifting from traditional guardrail based methods to preemptive, model agnostic techniques that neutralize adversarial states at the representation level.

[AI-33] reeX: Generating Global Graphical GNN Explanations via Critical Subtree Extraction

链接: https://arxiv.org/abs/2503.09051
作者: Shengyao Lu,Jiuding Yang,Baochun Li,Di Niu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The growing demand for transparency and interpretability in critical domains has driven increased interests in comprehending the explainability of Message-Passing (MP) Graph Neural Networks (GNNs). Although substantial research efforts have been made to generate explanations for individual graph instances, identifying global explaining concepts for a GNN still poses great challenges, especially when concepts are desired in a graphical form on the dataset level. While most prior works treat GNNs as black boxes, in this paper, we propose to unbox GNNs by analyzing and extracting critical subtrees incurred by the inner workings of message passing, which correspond to critical subgraphs in the datasets. By aggregating subtrees in an embedding space with an efficient algorithm, which does not require complex subgraph matching or search, we can make intuitive graphical explanations for Message-Passing GNNs on local, class and global levels. We empirically show that our proposed approach not only generates clean subgraph concepts on a dataset level in contrast to existing global explaining methods which generate non-graphical rules (e.g., language or embeddings) as explanations, but it is also capable of providing explanations for individual instances with a comparable or even superior performance as compared to leading local-level GNN explainers.

[AI-34] ManeuverGPT Agent ic Control for Safe Autonomous Stunt Maneuvers IROS

链接: https://arxiv.org/abs/2503.09035
作者: Shawn Azdam,Pranav Doma,Aliasghar Moj Arab
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 6 Pages, Submitted to IROS

点击查看摘要

Abstract:The next generation of active safety features in autonomous vehicles should be capable of safely executing evasive hazard-avoidance maneuvers akin to those performed by professional stunt drivers to achieve high-agility motion at the limits of vehicle handling. This paper presents a novel framework, ManeuverGPT, for generating and executing high-dynamic stunt maneuvers in autonomous vehicles using large language model (LLM)-based agents as controllers. We target aggressive maneuvers, such as J-turns, within the CARLA simulation environment and demonstrate an iterative, prompt-based approach to refine vehicle control parameters, starting tabula rasa without retraining model weights. We propose an agentic architecture comprised of three specialized agents (1) a Query Enricher Agent for contextualizing user commands, (2) a Driver Agent for generating maneuver parameters, and (3) a Parameter Validator Agent that enforces physics-based and safety constraints. Experimental results demonstrate successful J-turn execution across multiple vehicle models through textual prompts that adapt to differing vehicle dynamics. We evaluate performance via established success criteria and discuss limitations regarding numeric precision and scenario complexity. Our findings underscore the potential of LLM-driven control for flexible, high-dynamic maneuvers, while highlighting the importance of hybrid approaches that combine language-based reasoning with algorithmic validation.

[AI-35] RFUAV: A Benchmark Dataset for Unmanned Aerial Vehicle Detection and Identification

链接: https://arxiv.org/abs/2503.09033
作者: Rui Shi,Xiaodong Yu,Shengming Wang,Yijia Zhang,Lu Xu,Peng Pan,Chunlai Ma
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 23 pages, 13 figures, conference

点击查看摘要

Abstract:In this paper, we propose RFUAV as a new benchmark dataset for radio-frequency based (RF-based) unmanned aerial vehicle (UAV) identification and address the following challenges: Firstly, many existing datasets feature a restricted variety of drone types and insufficient volumes of raw data, which fail to meet the demands of practical applications. Secondly, existing datasets often lack raw data covering a broad range of signal-to-noise ratios (SNR), or do not provide tools for transforming raw data to different SNR levels. This limitation undermines the validity of model training and evaluation. Lastly, many existing datasets do not offer open-access evaluation tools, leading to a lack of unified evaluation standards in current research within this field. RFUAV comprises approximately 1.3 TB of raw frequency data collected from 37 distinct UAVs using the Universal Software Radio Peripheral (USRP) device in real-world environments. Through in-depth analysis of the RF data in RFUAV, we define a drone feature sequence called RF drone fingerprint, which aids in distinguishing drone signals. In addition to the dataset, RFUAV provides a baseline preprocessing method and model evaluation tools. Rigorous experiments demonstrate that these preprocessing methods achieve state-of-the-art (SOTA) performance using the provided evaluation tools. The RFUAV dataset and baseline implementation are publicly available at this https URL.

[AI-36] Enhancing High-Quality Code Generation in Large Language Models with Comparative Prefix-Tuning

链接: https://arxiv.org/abs/2503.09020
作者: Yuan Jiang,Yujian Zhang,Liang Lu,Christoph Treude,Xiaohong Su,Shan Huang,Tiantian Wang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been widely adopted in commercial code completion engines, significantly enhancing coding efficiency and productivity. However, LLMs may generate code with quality issues that violate coding standards and best practices, such as poor code style and maintainability, even when the code is functionally correct. This necessitates additional effort from developers to improve the code, potentially negating the efficiency gains provided by LLMs. To address this problem, we propose a novel comparative prefix-tuning method for controllable high-quality code generation. Our method introduces a single, property-specific prefix that is prepended to the activations of the LLM, serving as a lightweight alternative to fine-tuning. Unlike existing methods that require training multiple prefixes, our approach trains only one prefix and leverages pairs of high-quality and low-quality code samples, introducing a sequence-level ranking loss to guide the model’s training. This comparative approach enables the model to better understand the differences between high-quality and low-quality code, focusing on aspects that impact code quality. Additionally, we design a data construction pipeline to collect and annotate pairs of high-quality and low-quality code, facilitating effective training. Extensive experiments on the Code Llama 7B model demonstrate that our method improves code quality by over 100% in certain task categories, while maintaining functional correctness. We also conduct ablation studies and generalization experiments, confirming the effectiveness of our method’s components and its strong generalization capability.

[AI-37] owards Quantifying Long-Range Interactions in Graph Machine Learning: a Large Graph Dataset and a Measurement

链接: https://arxiv.org/abs/2503.09008
作者: Huidong Liang,Haitz Sáez de Ocáriz Borde,Baskaran Sripathmanathan,Michael Bronstein,Xiaowen Dong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: work in progress

点击查看摘要

Abstract:Long-range dependencies are critical for effective graph representation learning, yet most existing datasets focus on small graphs tailored to inductive tasks, offering limited insight into long-range interactions. Current evaluations primarily compare models employing global attention (e.g., graph transformers) with those using local neighborhood aggregation (e.g., message-passing neural networks) without a direct measurement of long-range dependency. In this work, we introduce City-Networks, a novel large-scale transductive learning dataset derived from real-world city roads. This dataset features graphs with over 10^5 nodes and significantly larger diameters than those in existing benchmarks, naturally embodying long-range information. We annotate the graphs using an eccentricity-based approach, ensuring that the classification task inherently requires information from distant nodes. Furthermore, we propose a model-agnostic measurement based on the Jacobians of neighbors from distant hops, offering a principled quantification of long-range dependencies. Finally, we provide theoretical justifications for both our dataset design and the proposed measurement - particularly by focusing on over-smoothing and influence score dilution - which establishes a robust foundation for further exploration of long-range interactions in graph neural networks.

[AI-38] KNighter: Transforming Static Analysis with LLM -Synthesized Checkers

链接: https://arxiv.org/abs/2503.09002
作者: Chenyuan Yang,Zijie Zhao,Zichen Xie,Haoyu Li,Lingming Zhang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Operating Systems (cs.OS)
*备注:

点击查看摘要

Abstract:Static analysis is a powerful technique for bug detection in critical systems like operating system kernels. However, designing and implementing static analyzers is challenging, time-consuming, and typically limited to predefined bug patterns. While large language models (LLMs) have shown promise for static analysis, directly applying them to scan large codebases remains impractical due to computational constraints and contextual limitations. We present KNighter, the first approach that unlocks practical LLM-based static analysis by automatically synthesizing static analyzers from historical bug patterns. Rather than using LLMs to directly analyze massive codebases, our key insight is leveraging LLMs to generate specialized static analyzers guided by historical patch knowledge. KNighter implements this vision through a multi-stage synthesis pipeline that validates checker correctness against original patches and employs an automated refinement process to iteratively reduce false positives. Our evaluation on the Linux kernel demonstrates that KNighter generates high-precision checkers capable of detecting diverse bug patterns overlooked by existing human-written analyzers. To date, KNighter-synthesized checkers have discovered 70 new bugs/vulnerabilities in the Linux kernel, with 56 confirmed and 41 already fixed. 11 of these findings have been assigned CVE numbers. This work establishes an entirely new paradigm for scalable, reliable, and traceable LLM-based static analysis for real-world systems via checker synthesis. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Operating Systems (cs.OS) Cite as: arXiv:2503.09002 [cs.SE] (or arXiv:2503.09002v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2503.09002 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-39] DistJoin: A Decoupled Join Cardinality Estimator based on Adaptive Neural Predicate Modulation

链接: https://arxiv.org/abs/2503.08994
作者: Kaixin Zhang,Hongzhi Wang,Ziqi Li,Yabin Lu,Yingze Li,Yu Yan,Yiming Guan
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Research on learned cardinality estimation has achieved significant progress in recent years. However, existing methods still face distinct challenges that hinder their practical deployment in production environments. We conceptualize these challenges as the “Trilemma of Cardinality Estimation”, where learned cardinality estimation methods struggle to balance generality, accuracy, and updatability. To address these challenges, we introduce DistJoin, a join cardinality estimator based on efficient distribution prediction using multi-autoregressive models. Our contributions are threefold: (1) We propose a method for estimating both equi and non-equi join cardinality by leveraging the conditional probability distributions of individual tables in a decoupled manner. (2) To meet the requirements of efficient training and inference for DistJoin, we develop Adaptive Neural Predicate Modulation (ANPM), a high-throughput conditional probability distribution estimation model. (3) We formally analyze the variance of existing similar methods and demonstrate that such approaches suffer from variance accumulation issues. To mitigate this problem, DistJoin employs a selectivity-based approach rather than a count-based approach to infer join cardinality, effectively reducing variance. In summary, DistJoin not only represents the first data-driven method to effectively support both equi and non-equi joins but also demonstrates superior accuracy while enabling fast and flexible updates. We evaluate DistJoin on JOB-light and JOB-light-ranges, extending the evaluation to non-equi join conditions. The results demonstrate that our approach achieves the highest accuracy, robustness to data updates, generality, and comparable update and inference speed relative to existing methods.

[AI-40] FP3: A 3D Foundation Policy for Robotic Manipulation

链接: https://arxiv.org/abs/2503.08950
作者: Rujia Yang,Geng Chen,Chuan Wen,Yang Gao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Project website: this https URL

点击查看摘要

Abstract:Following its success in natural language processing and computer vision, foundation models that are pre-trained on large-scale multi-task datasets have also shown great potential in robotics. However, most existing robot foundation models rely solely on 2D image observations, ignoring 3D geometric information, which is essential for robots to perceive and reason about the 3D world. In this paper, we introduce FP3, a first large-scale 3D foundation policy model for robotic manipulation. FP3 builds on a scalable diffusion transformer architecture and is pre-trained on 60k trajectories with point cloud observations. With the model design and diverse pre-training data, FP3 can be efficiently fine-tuned for downstream tasks while exhibiting strong generalization capabilities. Experiments on real robots demonstrate that with only 80 demonstrations, FP3 is able to learn a new task with over 90% success rates in novel environments with unseen objects, significantly surpassing existing robot foundation models.

[AI-41] Simulator Ensembles for Trustworthy Autonomous Driving Testing

链接: https://arxiv.org/abs/2503.08936
作者: Lev Sorokin,Matteo Biagiola,Andrea Stocco
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Scenario-based testing with driving simulators is extensively used to identify failing conditions of automated driving assistance systems (ADAS) and reduce the amount of in-field road testing. However, existing studies have shown that repeated test execution in the same as well as in distinct simulators can yield different outcomes, which can be attributed to sources of flakiness or different implementations of the physics, among other factors. In this paper, we present MultiSim, a novel approach to multi-simulation ADAS testing based on a search-based testing approach that leverages an ensemble of simulators to identify failure-inducing, simulator-agnostic test scenarios. During the search, each scenario is evaluated jointly on multiple simulators. Scenarios that produce consistent results across simulators are prioritized for further exploration, while those that fail on only a subset of simulators are given less priority, as they may reflect simulator-specific issues rather than generalizable failures. Our case study, which involves testing a deep neural network-based ADAS on different pairs of three widely used simulators, demonstrates that MultiSim outperforms single-simulator testing by achieving on average a higher rate of simulator-agnostic failures by 51%. Compared to a state-of-the-art multi-simulator approach that combines the outcome of independent test generation campaigns obtained in different simulators, MultiSim identifies 54% more simulator-agnostic failing tests while showing a comparable validity rate. An enhancement of MultiSim that leverages surrogate models to predict simulator disagreements and bypass executions does not only increase the average number of valid failures but also improves efficiency in finding the first valid failure.

[AI-42] Robust Unsupervised Fault Diagnosis For High-Dimensional Nonlinear Noisy Data

链接: https://arxiv.org/abs/2503.08916
作者: Dandan Zhao,Hongpeng Yin,Jintang Bian,Han Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional fault diagnosis methods struggle to handle fault data, with complex data characteristics such as high dimensions and large noise. Deep learning is a promising solution, which typically works well only when labeled fault data are available. To address these problems, a robust unsupervised fault diagnosis using machine learning is proposed in this paper. First, a special dimension reduction method for the high-dimensional fault data is designed. Second, the extracted features are enhanced by incorporating nonlinear information through the learning of a graph structure. Third, to alleviate the problem of reduced fault-diagnosis accuracy attributed to noise and outliers, l_2,1 -norm and typicality-aware constraints are introduced from the perspective of model optimization, respectively. Finally, this paper provides comprehensive theoretical and experimental evidence supporting the effectiveness and robustness of the proposed method. The experiments on both the benchmark Tennessee-Eastman process and a real hot-steel milling process show that the proposed method exhibits better robustness compared to other methods, maintaining high diagnostic accuracy even in the presence of outliers or noise.

[AI-43] Imitation Learning of Correlated Policies in Stackelberg Games

链接: https://arxiv.org/abs/2503.08883
作者: Kunag-Da Wang,Ping-Chun Hsieh,Wen-Chih Peng
类目: Artificial Intelligence (cs.AI)
*备注: Preprint. Code will be released at this GitHub link: this https URL

点击查看摘要

Abstract:Stackelberg games, widely applied in domains like economics and security, involve asymmetric interactions where a leader’s strategy drives follower responses. Accurately modeling these dynamics allows domain experts to optimize strategies in interactive scenarios, such as turn-based sports like badminton. In multi-agent systems, agent behaviors are interdependent, and traditional Multi-Agent Imitation Learning (MAIL) methods often fail to capture these complex interactions. Correlated policies, which account for opponents’ strategies, are essential for accurately modeling such dynamics. However, even methods designed for learning correlated policies, like CoDAIL, struggle in Stackelberg games due to their asymmetric decision-making, where leaders and followers cannot simultaneously account for each other’s actions, often leading to non-correlated policies. Furthermore, existing MAIL methods that match occupancy measures or use adversarial techniques like GAIL or Inverse RL face scalability challenges, particularly in high-dimensional environments, and suffer from unstable training. To address these challenges, we propose a correlated policy occupancy measure specifically designed for Stackelberg games and introduce the Latent Stackelberg Differential Network (LSDN) to match it. LSDN models two-agent interactions as shared latent state trajectories and uses multi-output Geometric Brownian Motion (MO-GBM) to effectively capture joint policies. By leveraging MO-GBM, LSDN disentangles environmental influences from agent-driven transitions in latent space, enabling the simultaneous learning of interdependent policies. This design eliminates the need for adversarial training and simplifies the learning process. Extensive experiments on Iterative Matrix Games and multi-agent particle environments demonstrate that LSDN can better reproduce complex interaction dynamics than existing MAIL methods.

[AI-44] Meta-Reinforcement Learning with Discrete World Models for Adaptive Load Balancing

链接: https://arxiv.org/abs/2503.08872
作者: Cameron Redovian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Operating Systems (cs.OS)
*备注: 6 pages, 1 figure, to be published in ACMSE 2025

点击查看摘要

Abstract:We integrate a meta-reinforcement learning algorithm with the DreamerV3 architecture to improve load balancing in operating systems. This approach enables rapid adaptation to dynamic workloads with minimal retraining, outperforming the Advantage Actor-Critic (A2C) algorithm in standard and adaptive trials. It demonstrates robust resilience to catastrophic forgetting, maintaining high performance under varying workload distributions and sizes. These findings have important implications for optimizing resource management and performance in modern operating systems. By addressing the challenges posed by dynamic and heterogeneous workloads, our approach advances the adaptability and efficiency of reinforcement learning in real-world system management tasks.

[AI-45] Robust Multi-Objective Controlled Decoding of Large Language Models

链接: https://arxiv.org/abs/2503.08796
作者: Seongho Son,William Bankes,Sangwoong Yoon,Shyam Sundhar Ramesh,Xiaohang Tang,Ilija Bogunovic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 24 pages, 9 figures

点击查看摘要

Abstract:Test-time alignment of Large Language Models (LLMs) to human preferences offers a flexible way to generate responses aligned to diverse objectives without extensive retraining of LLMs. Existing methods achieve alignment to multiple objectives simultaneously (e.g., instruction-following, helpfulness, conciseness) by optimizing their corresponding reward functions. However, they often rely on predefined weights or optimize for averages, sacrificing one objective for another and leading to unbalanced outcomes. To address this, we introduce Robust Multi-Objective Decoding (RMOD), a novel inference-time algorithm that optimizes for improving worst-case rewards. RMOD formalizes the robust decoding problem as a maximin two-player game between reward weights and the sampling policy, solving for the Nash equilibrium. We show that the game reduces to a convex optimization problem to find the worst-case weights, while the best response policy can be computed analytically. We also introduce a practical RMOD variant designed for efficient decoding with contemporary LLMs, incurring minimal computational overhead compared to non-robust Multi-Objective Decoding (MOD) methods. Our experimental results showcase the effectiveness of RMOD in generating responses equitably aligned with diverse objectives, outperforming baselines up to 20%.

[AI-46] Combining Local Symmetry Exploitation and Reinforcement Learning for Optimised Probabilistic Inference – A Work In Progress IJCAI2024

链接: https://arxiv.org/abs/2503.08786
作者: Sagad Hamid,Tanya Braun
类目: Artificial Intelligence (cs.AI)
*备注: Contributed to: Sixth Data Science Meets Optimisation (DSO) Workshop at IJCAI 2024

点击查看摘要

Abstract:Efficient probabilistic inference by variable elimination in graphical models requires an optimal elimination order. However, finding an optimal order is a challenging combinatorial optimisation problem for models with a large number of random variables. Most recently, a reinforcement learning approach has been proposed to find efficient contraction orders in tensor networks. Due to the duality between graphical models and tensor networks, we adapt this approach to probabilistic inference in graphical models. Furthermore, we incorporate structure exploitation into the process of finding an optimal order. Currently, the agent’s cost function is formulated in terms of intermediate result sizes which are exponential in the number of indices (i.e., random variables). We show that leveraging specific structures during inference allows for introducing compact encodings of intermediate results which can be significantly smaller. By considering the compact encoding sizes for the cost function instead, we enable the agent to explore more efficient contraction orders. The structure we consider in this work is the presence of local symmetries (i.e., symmetries within a model’s factors).

[AI-47] Neurosymbolic Decision Trees

链接: https://arxiv.org/abs/2503.08762
作者: Matthias Möller,Arvid Norlander,Pedro Zuidberg Dos Martires,Luc De Raedt
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Neurosymbolic (NeSy) AI studies the integration of neural networks (NNs) and symbolic reasoning based on logic. Usually, NeSy techniques focus on learning the neural, probabilistic and/or fuzzy parameters of NeSy models. Learning the symbolic or logical structure of such models has, so far, received less attention. We introduce neurosymbolic decision trees (NDTs), as an extension of decision trees together with a novel NeSy structure learning algorithm, which we dub NeuID3. NeuID3 adapts the standard top-down induction of decision tree algorithms and combines it with a neural probabilistic logic representation, inherited from the DeepProbLog family of models. The key advantage of learning NDTs with NeuID3 is the support of both symbolic and subsymbolic data (such as images), and that they can exploit background knowledge during the induction of the tree structure, In our experimental evaluation we demonstrate the benefits of NeSys structure learning over more traditonal approaches such as purely data-driven learning with neural networks.

[AI-48] Heterogeneous Graph Structure Learning through the Lens of Data-generating Processes

链接: https://arxiv.org/abs/2503.08760
作者: Keyue Jiang,Bohan Tang,Xiaowen Dong,Laura Toni
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Inferring the graph structure from observed data is a key task in graph machine learning to capture the intrinsic relationship between data entities. While significant advancements have been made in learning the structure of homogeneous graphs, many real-world graphs exhibit heterogeneous patterns where nodes and edges have multiple types. This paper fills this gap by introducing the first approach for heterogeneous graph structure learning (HGSL). To this end, we first propose a novel statistical model for the data-generating process (DGP) of heterogeneous graph data, namely hidden Markov networks for heterogeneous graphs (H2MN). Then we formalize HGSL as a maximum a-posterior estimation problem parameterized by such DGP and derive an alternating optimization method to obtain a solution together with a theoretical justification of the optimization conditions. Finally, we conduct extensive experiments on both synthetic and real-world datasets to demonstrate that our proposed method excels in learning structure on heterogeneous graphs in terms of edge type identification and edge weight recovery.

[AI-49] Source-free domain adaptation based on label reliability for cross-domain bearing fault diagnosis

链接: https://arxiv.org/abs/2503.08749
作者: Wenyi Wu,Hao Zhang,Zhisen Wei,Xiao-Yuan Jing,Qinghua Zhang,Songsong Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 12 figures and 7 tables

点击查看摘要

Abstract:Source-free domain adaptation (SFDA) has been exploited for cross-domain bearing fault diagnosis without access to source data. Current methods select partial target samples with reliable pseudo-labels for model adaptation, which is sub-optimal due to the ignored target samples. We argue that every target sample can contribute to model adaptation, and accordingly propose in this paper a novel SFDA-based approach for bearing fault diagnosis that exploits both reliable and unreliable pseudo-labels. We develop a data-augmentation-based label voting strategy to divide the target samples into reliable and unreliable ones. We propose to explore the underlying relation between feature space and label space by using the reliable pseudo-labels as ground-truth labels, meanwhile, alleviating negative transfer by maximizing the entropy of the unreliable pseudo-labels. The proposed method achieves well-balance between discriminability and diversity by taking advantage of reliable and unreliable pseudo-labels. Extensive experiments are conducted on two bearing fault benchmarks, demonstrating that our approach achieves significant performance improvements against existing SFDA-based bearing fault diagnosis methods. Our code is available at this https URL.

[AI-50] Mirror Descent and Novel Exponentiated Gradient Algorithms Using Trace-Form Entropies and Deformed Logarithms

链接: https://arxiv.org/abs/2503.08748
作者: Andrzej Cichocki,Toshihisa Tanaka,Sergio Cruces
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 22 pages, 9 figures

点击查看摘要

Abstract:In this paper we propose and investigate a wide class of Mirror Descent updates (MD) and associated novel Generalized Exponentiated Gradient (GEG) algorithms by exploiting various trace-form entropies and associated deformed logarithms and their inverses - deformed (generalized) exponential functions. The proposed algorithms can be considered as extension of entropic MD and generalization of multiplicative updates. In the literature, there exist nowadays over fifty mathematically well defined generalized entropies, so impossible to exploit all of them in one research paper. So we focus on a few selected most popular entropies and associated logarithms like the Tsallis, Kaniadakis and Sharma-Taneja-Mittal and some of their extension like Tempesta or Kaniadakis-Scarfone entropies. The shape and properties of the deformed logarithms and their inverses are tuned by one or more hyperparameters. By learning these hyperparameters, we can adapt to distribution of training data, which can be designed to the specific geometry of the optimization problem, leading to potentially faster convergence and better performance. The using generalized entropies and associated deformed logarithms in the Bregman divergence, used as a regularization term, provides some new insight into exponentiated gradient descent updates.

[AI-51] HeGMN: Heterogeneous Graph Matching Network for Learning Graph Similarity

链接: https://arxiv.org/abs/2503.08739
作者: Shilong Sang,Ke-Jia Chen,Zheng liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph similarity learning (GSL), also referred to as graph matching in many scenarios, is a fundamental problem in computer vision, pattern recognition, and graph learning. However, previous GSL methods assume that graphs are homogeneous and struggle to maintain their performance on heterogeneous graphs. To address this problem, this paper proposes a Heterogeneous Graph Matching Network (HeGMN), which is an end-to-end graph similarity learning framework composed of a two-tier matching mechanism. Firstly, a heterogeneous graph isomorphism network is proposed as the encoder, which reinvents graph isomorphism network for heterogeneous graphs by perceiving different semantic relationships during aggregation. Secondly, a graph-level and node-level matching modules are designed, both employing type-aligned matching principles. The former conducts graph-level matching by node type alignment, and the latter computes the interactions between the cross-graph nodes with the same type thus reducing noise interference and computational overhead. Finally, the graph-level and node-level matching features are combined and fed into fully connected layers for predicting graph similarity scores. In experiments, we propose a heterogeneous graph resampling method to construct heterogeneous graph pairs and define the corresponding heterogeneous graph edit distance, filling the gap in missing datasets. Extensive experiments demonstrate that HeGMN consistently achieves advanced performance on graph similarity prediction across all datasets.

[AI-52] Zero-to-One IDV: A Conceptual Model for AI-Powered Identity Verification

链接: https://arxiv.org/abs/2503.08734
作者: Aniket Vaidya,Anurag Awasthi
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 7 pages

点击查看摘要

Abstract:In today’s increasingly digital interactions, robust Identity Verification (IDV) is crucial for security and trust. Artificial Intelligence (AI) is transforming IDV, enhancing accuracy and fraud detection. This paper introduces Zero to One,'' a holistic conceptual framework for developing AI-powered IDV products. This paper outlines the foundational problem and research objectives that necessitate a new framework for IDV in the age of AI. It details the evolution of identity verification and the current regulatory landscape to contextualize the need for a robust conceptual model. The core of the paper is the presentation of the Zero to One’’ framework itself, dissecting its four essential components: Document Verification, Biometric Verification, Risk Assessment, and Orchestration. The paper concludes by discussing the implications of this conceptual model and suggesting future research directions focused on the framework’s further development and application. The framework addresses security, privacy, UX, and regulatory compliance, offering a structured approach to building effective IDV solutions. Successful IDV platforms require a balanced conceptual understanding of verification methods, risk management, and operational scalability, with AI as a key enabler. This paper presents the ``Zero to One’’ framework as a refined conceptual model, detailing verification layers, and AI’s transformative role in shaping next-generation IDV products.

[AI-53] Enhancing Traffic Signal Control through Model-based Reinforcement Learning and Policy Reuse

链接: https://arxiv.org/abs/2503.08728
作者: Yihong Li,Chengwei Zhang,Furui Zhan,Wanting Liu,Kailing Zhou,Longji Zheng
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARL) has shown significant potential in traffic signal control (TSC). However, current MARL-based methods often suffer from insufficient generalization due to the fixed traffic patterns and road network conditions used during training. This limitation results in poor adaptability to new traffic scenarios, leading to high retraining costs and complex deployment. To address this challenge, we propose two algorithms: PLight and PRLight. PLight employs a model-based reinforcement learning approach, pretraining control policies and environment models using predefined source-domain traffic scenarios. The environment model predicts the state transitions, which facilitates the comparison of environmental features. PRLight further enhances adaptability by adaptively selecting pre-trained PLight agents based on the similarity between the source and target domains to accelerate the learning process in the target domain. We evaluated the algorithms through two transfer settings: (1) adaptability to different traffic scenarios within the same road network, and (2) generalization across different road networks. The results show that PRLight significantly reduces the adaptation time compared to learning from scratch in new TSC scenarios, achieving optimal performance using similarities between available and target scenarios.

[AI-54] raining Plug-n-Play Knowledge Modules with Deep Context Distillation

链接: https://arxiv.org/abs/2503.08727
作者: Lucas Caccia,Alan Ansell,Edoardo Ponti,Ivan Vulić,Alessandro Sordoni
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Dynamically integrating new or rapidly evolving information after (Large) Language Model pre-training remains challenging, particularly in low-data scenarios or when dealing with private and specialized documents. In-context learning and retrieval-augmented generation (RAG) face limitations, including their high inference costs and their inability to capture global document information. In this paper, we propose a way of modularizing knowledge by training document-level Knowledge Modules (KMs). KMs are lightweight components implemented as parameter-efficient LoRA modules, which are trained to store information about new documents and can be easily plugged into models on demand. We show that next-token prediction performs poorly as the training objective for KMs. We instead propose Deep Context Distillation: we learn KMs parameters such as to simulate hidden states and logits of a teacher that takes the document in context. Our method outperforms standard next-token prediction and pre-instruction training techniques, across two datasets. Finally, we highlight synergies between KMs and retrieval-augmented generation.

[AI-55] SIMAC: A Semantic-Driven Integrated Multimodal Sensing And Communication Framework

链接: https://arxiv.org/abs/2503.08726
作者: Yubo Peng,Luping Xiang,Kun Yang,Feibo Jiang,Kezhi Wang,Dapeng Oliver Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Traditional single-modality sensing faces limitations in accuracy and capability, and its decoupled implementation with communication systems increases latency in bandwidth-constrained environments. Additionally, single-task-oriented sensing systems fail to address users’ diverse demands. To overcome these challenges, we propose a semantic-driven integrated multimodal sensing and communication (SIMAC) framework. This framework leverages a joint source-channel coding architecture to achieve simultaneous sensing decoding and transmission of sensing results. Specifically, SIMAC first introduces a multimodal semantic fusion (MSF) network, which employs two extractors to extract semantic information from radar signals and images, respectively. MSF then applies cross-attention mechanisms to fuse these unimodal features and generate multimodal semantic representations. Secondly, we present a large language model (LLM)-based semantic encoder (LSE), where relevant communication parameters and multimodal semantics are mapped into a unified latent space and input to the LLM, enabling channel-adaptive semantic encoding. Thirdly, a task-oriented sensing semantic decoder (SSD) is proposed, in which different decoded heads are designed according to the specific needs of tasks. Simultaneously, a multi-task learning strategy is introduced to train the SIMAC framework, achieving diverse sensing services. Finally, experimental simulations demonstrate that the proposed framework achieves diverse sensing services and higher accuracy.

[AI-56] he Algorithmic State Architecture (ASA): An Integrated Framework for AI-Enabled Government

链接: https://arxiv.org/abs/2503.08725
作者: Zeynep Engin,Jon Crowcroft,David Hand,Philip Treleaven
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注: Main text: 25 pages, with references: 35 pages, 2 figures

点击查看摘要

Abstract:As artificial intelligence transforms public sector operations, governments struggle to integrate technological innovations into coherent systems for effective service delivery. This paper introduces the Algorithmic State Architecture (ASA), a novel four-layer framework conceptualising how Digital Public Infrastructure, Data-for-Policy, Algorithmic Government/Governance, and GovTech interact as an integrated system in AI-enabled states. Unlike approaches that treat these as parallel developments, ASA positions them as interdependent layers with specific enabling relationships and feedback mechanisms. Through comparative analysis of implementations in Estonia, Singapore, India, and the UK, we demonstrate how foundational digital infrastructure enables systematic data collection, which powers algorithmic decision-making processes, ultimately manifesting in user-facing services. Our analysis reveals that successful implementations require balanced development across all layers, with particular attention to integration mechanisms between them. The framework contributes to both theory and practice by bridging previously disconnected domains of digital government research, identifying critical dependencies that influence implementation success, and providing a structured approach for analysing the maturity and development pathways of AI-enabled government systems.

[AI-57] AI for Just Work: Constructing Diverse Imaginations of AI beyond “Replacing Humans”

链接: https://arxiv.org/abs/2503.08720
作者: Weina Jin,Nicholas Vincent,Ghassan Hamarneh
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The AI community usually focuses on “how” to develop AI techniques, but lacks thorough open discussions on “why” we develop AI. Lacking critical reflections on the general visions and purposes of AI may make the community vulnerable to manipulation. In this position paper, we explore the “why” question of AI. We denote answers to the “why” question the imaginations of AI, which depict our general visions, frames, and mindsets for the prospects of AI. We identify that the prevailing vision in the AI community is largely a monoculture that emphasizes objectives such as replacing humans and improving productivity. Our critical examination of this mainstream imagination highlights its underpinning and potentially unjust assumptions. We then call to diversify our collective imaginations of AI, embedding ethical assumptions from the outset in the imaginations of AI. To facilitate the community’s pursuit of diverse imaginations, we demonstrate one process for constructing a new imagination of “AI for just work,” and showcase its application in the medical image synthesis task to make it more ethical. We hope this work will help the AI community to open dialogues with civil society on the visions and purposes of AI, and inspire more technical works and advocacy in pursuit of diverse and ethical imaginations to restore the value of AI for the public good.

[AI-58] A Semantic Link Network Model for Supporting Traceability of Logistics on Blockchain

链接: https://arxiv.org/abs/2503.08717
作者: Xiaoping Sun,Sirui Zhuge,Hai Zhuge
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Databases (cs.DB); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The ability of tracing states of logistic transportations requires an efficient storage and retrieval of the state of logistic transportations and locations of logistic objects. However, the restriction of sharing states and locations of logistic objects across organizations from different countries makes it hard to deploy a centralized database for implementing the traceability in a cross-border logistic system. This paper proposes a semantic data model on Blockchain to represent a logistic process based on the Semantic Link Network model where each semantic link represents a logistic transportation of a logistic object between two parties. A state representation model is designed to represent the states of a logistic transportation with semantic links. It enables the locations of logistic objects to be derived from the link states. A mapping from the semantic links to the blockchain transactions is designed to enable schema of semantic links and states of semantic links to be published in blockchain transactions. To improve the efficiency of tracing a path of semantic links on blockchain platform, an algorithm is designed to build shortcuts along the path of semantic links to enable a query on the path of a logistic object to reach the target in logarithmic steps on the blockchain platform. A reward-penalty policy is designed to allow participants to confirm the state of links on blockchain. Analysis and simulation demonstrate the flexibility, effectiveness and the efficiency of Semantic Link Network on immutable blockchain for implementing logistic traceability.

[AI-59] AuthorMist: Evading AI Text Detectors with Reinforcement Learning

链接: https://arxiv.org/abs/2503.08716
作者: Isaac David,Arthur Gervais
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the age of powerful AI-generated text, automatic detectors have emerged to identify machine-written content. This poses a threat to author privacy and freedom, as text authored with AI assistance may be unfairly flagged. We propose AuthorMist, a novel reinforcement learning-based system to transform AI-generated text into human-like writing. AuthorMist leverages a 3-billion-parameter language model as a backbone, fine-tuned with Group Relative Policy Optimization (GPRO) to paraphrase text in a way that evades AI detectors. Our framework establishes a generic approach where external detector APIs (GPTZero, WinstonAI, this http URL, etc.) serve as reward functions within the reinforcement learning loop, enabling the model to systematically learn outputs that these detectors are less likely to classify as AI-generated. This API-as-reward methodology can be applied broadly to optimize text against any detector with an accessible interface. Experiments on multiple datasets and detectors demonstrate that AuthorMist effectively reduces the detectability of AI-generated text while preserving the original meaning. Our evaluation shows attack success rates ranging from 78.6% to 96.2% against individual detectors, significantly outperforming baseline paraphrasing methods. AuthorMist maintains high semantic similarity (above 0.94) with the original text while successfully evading detection. These results highlight limitations in current AI text detection technologies and raise questions about the sustainability of the detection-evasion arms race. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2503.08716 [cs.CR] (or arXiv:2503.08716v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2503.08716 Focus to learn more arXiv-issued DOI via DataCite

[AI-60] Simulating Influence Dynamics with LLM Agents

链接: https://arxiv.org/abs/2503.08709
作者: Mehwish Nasim,Syed Muslim Gilani,Amin Qasmi,Usman Naseem
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces a simulator designed for opinion dynamics researchers to model competing influences within social networks in the presence of LLM-based agents. By integrating established opinion dynamics principles with state-of-the-art LLMs, this tool enables the study of influence propagation and counter-misinformation strategies. The simulator is particularly valuable for researchers in social science, psychology, and operations research, allowing them to analyse societal phenomena without requiring extensive coding expertise. Additionally, the simulator will be openly available on GitHub, ensuring accessibility and adaptability for those who wish to extend its capabilities for their own research.

[AI-61] H-Bench: Evaluating Evading Attacks via Humanizing AI Text on Machine-Generated Text Detectors

链接: https://arxiv.org/abs/2503.08708
作者: Jingyi Zheng,Junfeng Wang,Zhen Sun,Wenhan Dong,Yule Liu,Xinlei He
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) advance, Machine-Generated Texts (MGTs) have become increasingly fluent, high-quality, and informative. Existing wide-range MGT detectors are designed to identify MGTs to prevent the spread of plagiarism and misinformation. However, adversaries attempt to humanize MGTs to evade detection (named evading attacks), which requires only minor modifications to bypass MGT detectors. Unfortunately, existing attacks generally lack a unified and comprehensive evaluation framework, as they are assessed using different experimental settings, model architectures, and datasets. To fill this gap, we introduce the Text-Humanization Benchmark (TH-Bench), the first comprehensive benchmark to evaluate evading attacks against MGT detectors. TH-Bench evaluates attacks across three key dimensions: evading effectiveness, text quality, and computational overhead. Our extensive experiments evaluate 6 state-of-the-art attacks against 13 MGT detectors across 6 datasets, spanning 19 domains and generated by 11 widely used LLMs. Our findings reveal that no single evading attack excels across all three dimensions. Through in-depth analysis, we highlight the strengths and limitations of different attacks. More importantly, we identify a trade-off among three dimensions and propose two optimization insights. Through preliminary experiments, we validate their correctness and effectiveness, offering potential directions for future research.

[AI-62] Life-Cycle Routing Vulnerabilities of LLM Router

链接: https://arxiv.org/abs/2503.08704
作者: Qiqi Lin,Xiaoyang Ji,Shengfang Zhai,Qingni Shen,Zhi Zhang,Yuejian Fang,Yansong Gao
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Work in progress

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success in natural language processing, yet their performance and computational costs vary significantly. LLM routers play a crucial role in dynamically balancing these trade-offs. While previous studies have primarily focused on routing efficiency, security vulnerabilities throughout the entire LLM router life cycle, from training to inference, remain largely unexplored. In this paper, we present a comprehensive investigation into the life-cycle routing vulnerabilities of LLM routers. We evaluate both white-box and black-box adversarial robustness, as well as backdoor robustness, across several representative routing models under extensive experimental settings. Our experiments uncover several key findings: 1) Mainstream DNN-based routers tend to exhibit the weakest adversarial and backdoor robustness, largely due to their strong feature extraction capabilities that amplify vulnerabilities during both training and inference; 2) Training-free routers demonstrate the strongest robustness across different attack types, benefiting from the absence of learnable parameters that can be manipulated. These findings highlight critical security risks spanning the entire life cycle of LLM routers and provide insights for developing more robust models.

[AI-63] Blockchain As a Platform For Artificial Intelligence (AI) Transparency

链接: https://arxiv.org/abs/2503.08699
作者: Afroja Akther,Ayesha Arobee,Abdullah Al Adnan,Omum Auyon,ASM Johirul Islam,Farhad Akter
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 14 pages, 2 figures, 5 tables

点击查看摘要

Abstract:As artificial intelligence (AI) systems become increasingly complex and autonomous, concerns over transparency and accountability have intensified. The “black box” problem in AI decision-making limits stakeholders’ ability to understand, trust, and verify outcomes, particularly in high-stakes sectors such as healthcare, finance, and autonomous systems. Blockchain technology, with its decentralized, immutable, and transparent characteristics, presents a potential solution to enhance AI transparency and auditability. This paper explores the integration of blockchain with AI to improve decision traceability, data provenance, and model accountability. By leveraging blockchain as an immutable record-keeping system, AI decision-making can become more interpretable, fostering trust among users and regulatory compliance. However, challenges such as scalability, integration complexity, and computational overhead must be addressed to fully realize this synergy. This study discusses existing research, proposes a framework for blockchain-enhanced AI transparency, and highlights practical applications, benefits, and limitations. The findings suggest that blockchain could be a foundational technology for ensuring AI systems remain accountable, ethical, and aligned with regulatory standards.

[AI-64] Dubito Ergo Sum: Exploring AI Ethics

链接: https://arxiv.org/abs/2503.06788
作者: Viktor Dorfler,Giles Cuthbert
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: 10 pages, 1 figure, HICSS 57: Hawaii International Conference on System Sciences, Honolulu, HI, published January 2024

点击查看摘要

Abstract:We paraphrase Descartes’ famous dictum in the area of AI ethics where the “I doubt and therefore I am” is suggested as a necessary aspect of morality. Therefore AI, which cannot doubt itself, cannot possess moral agency. Of course, this is not the end of the story. We explore various aspects of the human mind that substantially differ from AI, which includes the sensory grounding of our knowing, the act of understanding, and the significance of being able to doubt ourselves. The foundation of our argument is the discipline of ethics, one of the oldest and largest knowledge projects of human history, yet, we seem only to be beginning to get a grasp of it. After a couple of thousand years of studying the ethics of humans, we (humans) arrived at a point where moral psychology suggests that our moral decisions are intuitive, and all the models from ethics become relevant only when we explain ourselves. This recognition has a major impact on what and how we can do regarding AI ethics. We do not offer a solution, we explore some ideas and leave the problem open, but we hope somewhat better understood than before our study.

[AI-65] Receding Hamiltonian-Informed Optimal Neural Control and State Estimation for Closed-Loop Dynamical Systems

链接: https://arxiv.org/abs/2411.01297
作者: Josue N. Rivera,Dengfeng Sun
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper formalizes Hamiltonian-Informed Optimal Neural (Hion) controllers, a novel class of neural network-based controllers for dynamical systems and explicit non-linear model predictive control. Hion controllers estimate future states and compute optimal control inputs using Pontryagin’s Maximum Principle. The proposed framework allows for customization of transient behavior, addressing limitations of existing methods. The Taylored Multi-Faceted Approach for Neural ODE and Optimal Control (T-mano) architecture facilitates training and ensures accurate state estimation. Optimal control strategies are demonstrated for both linear and non-linear dynamical systems.

[AI-66] Single-Qudit Quantum Neural Networks for Multiclass Classification

链接: https://arxiv.org/abs/2503.09269
作者: Leandro C. Souza,Renato Portugal
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 24 pages, 3 figures, 6 tables

点击查看摘要

Abstract:This paper proposes a single-qudit quantum neural network for multiclass classification, by using the enhanced representational capacity of high-dimensional qudit states. Our design employs an d -dimensional unitary operator, where d corresponds to the number of classes, constructed using the Cayley transform of a skew-symmetric matrix, to efficiently encode and process class information. This architecture enables a direct mapping between class labels and quantum measurement outcomes, reducing circuit depth and computational overhead. To optimize network parameters, we introduce a hybrid training approach that combines an extended activation function – derived from a truncated multivariable Taylor series expansion – with support vector machine optimization for weight determination. We evaluate our model on the MNIST and EMNIST datasets, demonstrating competitive accuracy while maintaining a compact single-qudit quantum circuit. Our findings highlight the potential of qudit-based QNNs as scalable alternatives to classical deep learning models, particularly for multiclass classification. However, practical implementation remains constrained by current quantum hardware limitations. This research advances quantum machine learning by demonstrating the feasibility of higher-dimensional quantum systems for efficient learning tasks.

[AI-67] owards Interpretable Protein Structure Prediction with Sparse Autoencoders ICLR2025

链接: https://arxiv.org/abs/2503.08764
作者: Nithin Parsan,David J. Yang,John J. Yang
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at the GEMBio ICLR 2025 Workshop

点击查看摘要

Abstract:Protein language models have revolutionized structure prediction, but their nonlinear nature obscures how sequence representations inform structure prediction. While sparse autoencoders (SAEs) offer a path to interpretability here by learning linear representations in high-dimensional space, their application has been limited to smaller protein language models unable to perform structure prediction. In this work, we make two key advances: (1) we scale SAEs to ESM2-3B, the base model for ESMFold, enabling mechanistic interpretability of protein structure prediction for the first time, and (2) we adapt Matryoshka SAEs for protein language models, which learn hierarchically organized features by forcing nested groups of latents to reconstruct inputs independently. We demonstrate that our Matryoshka SAEs achieve comparable or better performance than standard architectures. Through comprehensive evaluations, we show that SAEs trained on ESM2-3B significantly outperform those trained on smaller models for both biological concept discovery and contact map prediction. Finally, we present an initial case study demonstrating how our approach enables targeted steering of ESMFold predictions, increasing structure solvent accessibility while fixing the input sequence. To facilitate further investigation by the broader community, we open-source our code, dataset, pretrained models this https URL , and visualizer this https URL .

[AI-68] Quantifying Circadian Desynchrony in ICU Patients and Its Association with Delirium

链接: https://arxiv.org/abs/2503.08732
作者: Yuanfang Ren,Andrea E. Davidson,Jiaqing Zhang,Miguel Contreras,Ayush K. Patel,Michelle Gumz,Tezcan Ozrazgat-Baslanti,Parisa Rashidi,Azra Bihorac
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Background: Circadian desynchrony characterized by the misalignment between an individual’s internal biological rhythms and external environmental cues, significantly affects various physiological processes and health outcomes. Quantifying circadian desynchrony often requires prolonged and frequent monitoring, and currently, an easy tool for this purpose is missing. Additionally, its association with the incidence of delirium has not been clearly explored. Methods: A prospective observational study was carried out in intensive care units (ICU) of a tertiary hospital. Circadian transcriptomics of blood monocytes from 86 individuals were collected on two consecutive days, although a second sample could not be obtained from all participants. Using two public datasets comprised of healthy volunteers, we replicated a model for determining internal circadian time. We developed an approach to quantify circadian desynchrony by comparing internal circadian time and external blood collection time. We applied the model and quantified circadian desynchrony index among ICU patients, and investigated its association with the incidence of delirium. Results: The replicated model for determining internal circadian time achieved comparable high accuracy. The quantified circadian desynchrony index was significantly higher among critically ill ICU patients compared to healthy subjects, with values of 10.03 hours vs 2.50-2.95 hours (p 0.001). Most ICU patients had a circadian desynchrony index greater than 9 hours. Additionally, the index was lower in patients whose blood samples were drawn after 3pm, with values of 5.00 hours compared to 10.01-10.90 hours in other groups (p 0.001)…

[AI-69] A Beam Search Based Parallel Algorithm for the Two-Dimensional Strip Packing Problem

链接: https://arxiv.org/abs/2503.08711
作者: Yajie Wen,Defu Zhang
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
*备注: 9 pages,4figures

点击查看摘要

Abstract:This paper introduces BSPA, a parallel algorithm that leverages beam search to address the two-dimensional strip packing problem. The study begins with a comprehensive review of existing approaches and methodologies, followed by a detailed presentation of the BSPA algorithm. Experimental results demonstrate the effectiveness of the proposed method. To facilitate further research, both the code and datasets are publicly available.

[AI-70] A Block-Based Heuristic Algorithm for the Three-Dimensional Nuclear Waste Packing Problem

链接: https://arxiv.org/abs/2503.08705
作者: Yajie Wen,Defu Zhang
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
*备注: 10 pages,7 figures

点击查看摘要

Abstract:In this study, we present a block-based heuristic search algorithm to address the nuclear waste container packing problem in the context of real-world nuclear power plants. Additionally, we provide a dataset comprising 1600 problem instances for future researchers to use. Experimental results on this dataset demonstrate that the proposed algorithm effectively enhances the disposal pool’s space utilization while minimizing the radiation dose within the pool. The code and data employed in this study are publicly available to facilitate reproducibility and further investigation.

机器学习

[LG-0] Parsing the Language of Expression: Enhancing Symbolic Regression with Domain-Aware Symbolic Priors

链接: https://arxiv.org/abs/2503.09592
作者: Sikai Huang,Yixin Berry Wen,Tara Adusumilli,Kusum Choudhary,Haizhao Yang
类目: Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:Symbolic regression is essential for deriving interpretable expressions that elucidate complex phenomena by exposing the underlying mathematical and physical relationships in data. In this paper, we present an advanced symbolic regression method that integrates symbol priors from diverse scientific domains - including physics, biology, chemistry, and engineering - into the regression process. By systematically analyzing domain-specific expressions, we derive probability distributions of symbols to guide expression generation. We propose novel tree-structured recurrent neural networks (RNNs) that leverage these symbol priors, enabling domain knowledge to steer the learning process. Additionally, we introduce a hierarchical tree structure for representing expressions, where unary and binary operators are organized to facilitate more efficient learning. To further accelerate training, we compile characteristic expression blocks from each domain and include them in the operator dictionary, providing relevant building blocks. Experimental results demonstrate that leveraging symbol priors significantly enhances the performance of symbolic regression, resulting in faster convergence and higher accuracy.

[LG-1] Minimax Optimality of the Probability Flow ODE for Diffusion Models

链接: https://arxiv.org/abs/2503.09583
作者: Changxiao Cai,Gen Li
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Score-based diffusion models have become a foundational paradigm for modern generative modeling, demonstrating exceptional capability in generating samples from complex high-dimensional distributions. Despite the dominant adoption of probability flow ODE-based samplers in practice due to their superior sampling efficiency and precision, rigorous statistical guarantees for these methods have remained elusive in the literature. This work develops the first end-to-end theoretical framework for deterministic ODE-based samplers that establishes near-minimax optimal guarantees under mild assumptions on target data distributions. Specifically, focusing on subgaussian distributions with \beta -Hölder smooth densities for \beta\leq 2 , we propose a smooth regularized score estimator that simultaneously controls both the L^2 score error and the associated mean Jacobian error. Leveraging this estimator within a refined convergence analysis of the ODE-based sampling process, we demonstrate that the resulting sampler achieves the minimax rate in total variation distance, modulo logarithmic factors. Notably, our theory comprehensively accounts for all sources of error in the sampling process and does not require strong structural conditions such as density lower bounds or Lipschitz/smooth scores on target distributions, thereby covering a broad range of practical data distributions.

[LG-2] Manify: A Python Library for Learning Non-Euclidean Representations

链接: https://arxiv.org/abs/2503.09576
作者: Philippe Chlenski,Kaizhu Du,Dylan Satow,Itsik Pe’er
类目: Machine Learning (cs.LG)
*备注: 30 pages, 4 figures, 4 tables. Preprint

点击查看摘要

Abstract:We present Manify, an open-source Python library for non-Euclidean representation learning. Leveraging manifold learning techniques, Manify provides tools for learning embeddings in (products of) non-Euclidean spaces, performing classification and regression with data that lives in such spaces, and estimating the curvature of a manifold. Manify aims to advance research and applications in machine learning by offering a comprehensive suite of tools for manifold-based data analysis. Our source code, examples, datasets, results, and documentation are available at this https URL

[LG-3] Strategyproof Reinforcement Learning from Human Feedback

链接: https://arxiv.org/abs/2503.09561
作者: Thomas Kleine Buening,Jiarui Gan,Debmalya Mandal,Marta Kwiatkowska
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study Reinforcement Learning from Human Feedback (RLHF), where multiple individuals with diverse preferences provide feedback strategically to sway the final policy in their favor. We show that existing RLHF methods are not strategyproof, which can result in learning a substantially misaligned policy even when only one out of k individuals reports their preferences strategically. In turn, we also find that any strategyproof RLHF algorithm must perform k -times worse than the optimal policy, highlighting an inherent trade-off between incentive alignment and policy alignment. We then propose a pessimistic median algorithm that, under appropriate coverage assumptions, is approximately strategyproof and converges to the optimal policy as the number of individuals and samples increases.

[LG-4] Large Language Models for Multi-Facility Location Mechanism Design

链接: https://arxiv.org/abs/2503.09533
作者: Nguyen Thach,Fei Liu,Houyu Zhou,Hau Chan
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Designing strategyproof mechanisms for multi-facility location that optimize social costs based on agent preferences had been challenging due to the extensive domain knowledge required and poor worst-case guarantees. Recently, deep learning models have been proposed as alternatives. However, these models require some domain knowledge and extensive hyperparameter tuning as well as lacking interpretability, which is crucial in practice when transparency of the learned mechanisms is mandatory. In this paper, we introduce a novel approach, named LLMMech, that addresses these limitations by incorporating large language models (LLMs) into an evolutionary framework for generating interpretable, hyperparameter-free, empirically strategyproof, and nearly optimal mechanisms. Our experimental results, evaluated on various problem settings where the social cost is arbitrarily weighted across agents and the agent preferences may not be uniformly distributed, demonstrate that the LLM-generated mechanisms generally outperform existing handcrafted baselines and deep learning models. Furthermore, the mechanisms exhibit impressive generalizability to out-of-distribution agent preferences and to larger instances with more agents.

[LG-5] Federated Smoothing ADMM for Localization

链接: https://arxiv.org/abs/2503.09497
作者: Reza Mirzaeifard,Ashkan Moradi,Masahiro Yukawa,Stefan Werner
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses the challenge of localization in federated settings, which are characterized by distributed data, non-convexity, and non-smoothness. To tackle the scalability and outlier issues inherent in such environments, we propose a robust algorithm that employs an \ell_1 -norm formulation within a novel federated ADMM framework. This approach addresses the problem by integrating an iterative smooth approximation for the total variation consensus term and employing a Moreau envelope approximation for the convex function that appears in a subtracted form. This transformation ensures that the problem is smooth and weakly convex in each iteration, which results in enhanced computational efficiency and improved estimation accuracy. The proposed algorithm supports asynchronous updates and multiple client updates per iteration, which ensures its adaptability to real-world federated systems. To validate the reliability of the proposed algorithm, we show that the method converges to a stationary point, and numerical simulations highlight its superior performance in convergence speed and outlier resilience compared to existing state-of-the-art localization methods.

[LG-6] Representation Retrieval Learning for Heterogeneous Data Integration

链接: https://arxiv.org/abs/2503.09494
作者: Qi Xu,Annie Qu
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:In the era of big data, large-scale, multi-modal datasets are increasingly ubiquitous, offering unprecedented opportunities for predictive modeling and scientific discovery. However, these datasets often exhibit complex heterogeneity, such as covariate shift, posterior drift, and missing modalities, that can hinder the accuracy of existing prediction algorithms. To address these challenges, we propose a novel Representation Retrieval ( R^2 ) framework, which integrates a representation learning module (the representer) with a sparsity-induced machine learning model (the learner). Moreover, we introduce the notion of “integrativeness” for representers, characterized by the effective data sources used in learning representers, and propose a Selective Integration Penalty (SIP) to explicitly improve the property. Theoretically, we demonstrate that the R^2 framework relaxes the conventional full-sharing assumption in multi-task learning, allowing for partially shared structures, and that SIP can improve the convergence rate of the excess risk bound. Extensive simulation studies validate the empirical performance of our framework, and applications to two real-world datasets further confirm its superiority over existing approaches.

[LG-7] Learning Cascade Ranking as One Network

链接: https://arxiv.org/abs/2503.09492
作者: Yunli Wang,Zhen Zhang,Zhiqiang Wang,Zixuan Yang,Yu Li,Jian Yang,Shiyang Wen,Peng Jiang,Kun Gai
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 16 pages, 2 figures

点击查看摘要

Abstract:Cascade Ranking is a prevalent architecture in large-scale top-k selection systems like recommendation and advertising platforms. Traditional training methods focus on single-stage optimization, neglecting interactions between stages. Recent advances such as RankFlow and FS-LTR have introduced interaction-aware training paradigms but still struggle to 1) align training objectives with the goal of the entire cascade ranking (i.e., end-to-end recall) and 2) learn effective collaboration patterns for different stages. To address these challenges, we propose LCRON, which introduces a novel surrogate loss function derived from the lower bound probability that ground truth items are selected by cascade ranking, ensuring alignment with the overall objective of the system. According to the properties of the derived bound, we further design an auxiliary loss for each stage to drive the reduction of this bound, leading to a more robust and effective top-k selection. LCRON enables end-to-end training of the entire cascade ranking system as a unified network. Experimental results demonstrate that LCRON achieves significant improvement over existing methods on public benchmarks and industrial applications, addressing key limitations in cascade ranking training and significantly enhancing system performance.

[LG-8] A Novel Approach for Intrinsic Dimension Estimation

链接: https://arxiv.org/abs/2503.09485
作者: Kadir Özçoban,Murat Manguoğlu,Emrullah Fatih Yetkin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The real-life data have a complex and non-linear structure due to their nature. These non-linearities and the large number of features can usually cause problems such as the empty-space phenomenon and the well-known curse of dimensionality. Finding the nearly optimal representation of the dataset in a lower-dimensional space (i.e. dimensionality reduction) offers an applicable mechanism for improving the success of machine learning tasks. However, estimating the required data dimension for the nearly optimal representation (intrinsic dimension) can be very costly, particularly if one deals with big data. We propose a highly efficient and robust intrinsic dimension estimation approach that only relies on matrix-vector products for dimensionality reduction methods. An experimental study is also conducted to compare the performance of proposed method with state of the art approaches.

[LG-9] Neural reservoir control of a soft bio-hybrid arm

链接: https://arxiv.org/abs/2503.09477
作者: Noel Naughton,Arman Tekinalp,Keshav Shivam,Seung Hung Kim,Volodymyr Kindratenko,Mattia Gazzola
类目: Robotics (cs.RO); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 12 pages; 4 figures

点击查看摘要

Abstract:A long-standing engineering problem, the control of soft robots is difficult because of their highly non-linear, heterogeneous, anisotropic, and distributed nature. Here, bridging engineering and biology, a neural reservoir is employed for the dynamic control of a bio-hybrid model arm made of multiple muscle-tendon groups enveloping an elastic spine. We show how the use of reservoirs facilitates simultaneous control and self-modeling across a set of challenging tasks, outperforming classic neural network approaches. Further, by implementing a spiking reservoir on neuromorphic hardware, energy efficiency is achieved, with nearly two-orders of magnitude improvement relative to standard CPUs, with implications for the on-board control of untethered, small-scale soft robots.

[LG-10] SO(3)-Equivariant Neural Networks for Learning Vector Fields on Spheres

链接: https://arxiv.org/abs/2503.09456
作者: Francesco Ballerin,Nello Blaser,Erlend Grong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Analyzing vector fields on the sphere, such as wind speed and direction on Earth, is a difficult task. Models should respect both the rotational symmetries of the sphere and the inherent symmetries of the vector fields. In this paper, we introduce a deep learning architecture that respects both symmetry types using novel techniques based on group convolutions in the 3-dimensional rotation group. This architecture is suitable for scalar and vector fields on the sphere as they can be described as equivariant signals on the 3-dimensional rotation group. Experiments show that our architecture achieves lower prediction and reconstruction error when tested on rotated data compared to both standard CNNs and spherical CNNs.

[LG-11] How Well Does Your Tabular Generator Learn the Structure of Tabular Data? ICLR2025

链接: https://arxiv.org/abs/2503.09453
作者: Xiangjian Jiang,Nikola Simidjievski,Mateja Jamnik
类目: Machine Learning (cs.LG)
*备注: Accepted by ICLR 2025 workshops (DeLTa and SynthData)

点击查看摘要

Abstract:Heterogeneous tabular data poses unique challenges in generative modelling due to its fundamentally different underlying data structure compared to homogeneous modalities, such as images and text. Although previous research has sought to adapt the successes of generative modelling in homogeneous modalities to the tabular domain, defining an effective generator for tabular data remains an open problem. One major reason is that the evaluation criteria inherited from other modalities often fail to adequately assess whether tabular generative models effectively capture or utilise the unique structural information encoded in tabular data. In this paper, we carefully examine the limitations of the prevailing evaluation framework and introduce \textbfTabStruct , a novel evaluation benchmark that positions structural fidelity as a core evaluation dimension. Specifically, TabStruct evaluates the alignment of causal structures in real and synthetic data, providing a direct measure of how effectively tabular generative models learn the structure of tabular data. Through extensive experiments using generators from eight categories on seven datasets with expert-validated causal graphical structures, we show that structural fidelity offers a task-independent, domain-agnostic evaluation dimension. Our findings highlight the importance of tabular data structure and offer practical guidance for developing more effective and robust tabular generative models. Code is available at this https URL.

[LG-12] Efficient dynamic modal load reconstruction using physics-informed Gaussian processes based on frequency-sparse Fourier basis functions

链接: https://arxiv.org/abs/2503.09418
作者: Gledson Rodrigo Tondo,Igor Kavrakov,Guido Morgenthal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge of the force time history of a structure is essential to assess its behaviour, ensure safety and maintain reliability. However, direct measurement of external forces is often challenging due to sensor limitations, unknown force characteristics, or inaccessible load points. This paper presents an efficient dynamic load reconstruction method using physics-informed Gaussian processes (GP) based on frequency-sparse Fourier basis functions. The GP’s covariance matrices are built using the description of the system dynamics, and the model is trained using structural response measurements. This provides support and interpretability to the machine learning model, in contrast to purely data-driven methods. In addition, the model filters out irrelevant components in the Fourier basis function by leveraging the sparsity of structural responses in the frequency domain, thereby reducing computational complexity during optimization. The trained model for structural responses is then integrated with the differential equation for a harmonic oscillator, creating a probabilistic dynamic load model that predicts load patterns without requiring force data during training. The model’s effectiveness is validated through two case studies: a numerical model of a wind-excited 76-story building and an experiment using a physical scale model of the Lillebælt Bridge in Denmark, excited by a servo motor. For both cases, validation of the reconstructed forces is provided using comparison metrics for several signal properties. The developed model holds potential for applications in structural health monitoring, damage prognosis, and load model validation.

[LG-13] Mitigating Membership Inference Vulnerability in Personalized Federated Learning

链接: https://arxiv.org/abs/2503.09414
作者: Kangsoo Jung,Sayan Biswas,Catuscia Palamidessi
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a promising paradigm for collaborative model training without the need to share clients’ personal data, thereby preserving privacy. However, the non-IID nature of the clients’ data introduces major challenges for FL, highlighting the importance of personalized federated learning (PFL) methods. In PFL, models are trained to cater to specific feature distributions present in the population data. A notable method for PFL is the Iterative Federated Clustering Algorithm (IFCA), which mitigates the concerns associated with the non-IID-ness by grouping clients with similar data distributions. While it has been shown that IFCA enhances both accuracy and fairness, its strategy of dividing the population into smaller clusters increases vulnerability to Membership Inference Attacks (MIA), particularly among minorities with limited training samples. In this paper, we introduce IFCA-MIR, an improved version of IFCA that integrates MIA risk assessment into the clustering process. Allowing clients to select clusters based on both model performance and MIA vulnerability, IFCA-MIR achieves an improved performance with respect to accuracy, fairness, and privacy. We demonstrate that IFCA-MIR significantly reduces MIA risk while maintaining comparable model accuracy and fairness as the original IFCA.

[LG-14] Benefits of Learning Rate Annealing for Tuning-Robustness in Stochastic Optimization

链接: https://arxiv.org/abs/2503.09411
作者: Amit Attia,Tomer Koren
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 22 pages

点击查看摘要

Abstract:The learning rate in stochastic gradient methods is a critical hyperparameter that is notoriously costly to tune via standard grid search, especially for training modern large-scale models with billions of parameters. We identify a theoretical advantage of learning rate annealing schemes that decay the learning rate to zero at a polynomial rate, such as the widely-used cosine schedule, by demonstrating their increased robustness to initial parameter misspecification due to a coarse grid search. We present an analysis in a stochastic convex optimization setup demonstrating that the convergence rate of stochastic gradient descent with annealed schedules depends sublinearly on the multiplicative misspecification factor \rho (i.e., the grid resolution), achieving a rate of O(\rho^1/(2p+1)/\sqrtT) where p is the degree of polynomial decay and T is the number of steps, in contrast to the O(\rho/\sqrtT) rate that arises with fixed stepsizes and exhibits a linear dependence on \rho . Experiments confirm the increased robustness compared to tuning with a fixed stepsize, that has significant implications for the computational overhead of hyperparameter search in practical training scenarios.

[LG-15] Adjusted Count Quantification Learning on Graphs

链接: https://arxiv.org/abs/2503.09395
作者: Clemens Damke,Eyke Hüllermeier
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantification learning is the task of predicting the label distribution of a set of instances. We study this problem in the context of graph-structured data, where the instances are vertices. Previously, this problem has only been addressed via node clustering methods. In this paper, we extend the popular Adjusted Classify Count (ACC) method to graphs. We show that the prior probability shift assumption upon which ACC relies is often not fulfilled and propose two novel graph quantification techniques: Structural importance sampling (SIS) makes ACC applicable in graph domains with covariate shift. Neighborhood-aware ACC improves quantification in the presence of non-homophilic edges. We show the effectiveness of our techniques on multiple graph quantification tasks.

[LG-16] Context-aware Constrained Reinforcement Learning Based Energy-Efficient Power Scheduling for Non-stationary XR Data Traffic

链接: https://arxiv.org/abs/2503.09391
作者: Kexuan Wang,An Liu
类目: ystems and Control (eess.SY); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In XR downlink transmission, energy-efficient power scheduling (EEPS) is essential for conserving power resource while delivering large data packets within hard-latency constraints. Traditional constrained reinforcement learning (CRL) algorithms show promise in EEPS but still struggle with non-convex stochastic constraints, non-stationary data traffic, and sparse delayed packet dropout feedback (rewards) in XR. To overcome these challenges, this paper models the EEPS in XR as a dynamic parameter-constrained Markov decision process (DP-CMDP) with a varying transition function linked to the non-stationary data traffic and solves it by a proposed context-aware constrained reinforcement learning (CACRL) algorithm, which consists of a context inference (CI) module and a CRL module. The CI module trains an encoder and multiple potential networks to characterize the current transition function and reshape the packet dropout rewards according to the context, transforming the original DP-CMDP into a general CMDP with immediate dense rewards. The CRL module employs a policy network to make EEPS decisions under this CMDP and optimizes the policy using a constrained stochastic successive convex approximation (CSSCA) method, which is better suited for non-convex stochastic constraints. Finally, theoretical analyses provide deep insights into the CADAC algorithm, while extensive simulations demonstrate that it outperforms advanced baselines in both power conservation and satisfying packet dropout constraints.

[LG-17] Evaluating Reinforcement Learning Safety and Trustworthiness in Cyber-Physical Systems

链接: https://arxiv.org/abs/2503.09388
作者: Katherine Dearstyne,Pedro(Tony)Alarcon Granadeno,Theodore Chambers,Jane Cleland-Huang
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cyber-Physical Systems (CPS) often leverage Reinforcement Learning (RL) techniques to adapt dynamically to changing environments and optimize performance. However, it is challenging to construct safety cases for RL components. We therefore propose the SAFE-RL (Safety and Accountability Framework for Evaluating Reinforcement Learning) for supporting the development, validation, and safe deployment of RL-based CPS. We adopt a design science approach to construct the framework and demonstrate its use in three RL applications in small Uncrewed Aerial systems (sUAS)

[LG-18] Revisiting Agnostic Boosting

链接: https://arxiv.org/abs/2503.09384
作者: Arthur da Cunha,Mikael Møller Høgsgaard,Andrea Paudice,Yuxin Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Boosting is a key method in statistical learning, allowing for converting weak learners into strong ones. While well studied in the realizable case, the statistical properties of weak-to-strong learning remains less understood in the agnostic setting, where there are no assumptions on the distribution of the labels. In this work, we propose a new agnostic boosting algorithm with substantially improved sample complexity compared to prior works under very general assumptions. Our approach is based on a reduction to the realizable case, followed by a margin-based filtering step to select high-quality hypotheses. We conjecture that the error rate achieved by our proposed method is optimal up to logarithmic factors.

[LG-19] owards Graph Foundation Models: A Transferability Perspective

链接: https://arxiv.org/abs/2503.09363
作者: Yuxiang Wang,Wenqi Fan,Suhang Wang,Yao Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, Graph Foundation Models (GFMs) have gained significant attention for their potential to generalize across diverse graph domains and tasks. Some works focus on Domain-Specific GFMs, which are designed to address a variety of tasks within a specific domain, while others aim to create General-Purpose GFMs that extend the capabilities of domain-specific models to multiple domains. Regardless of the type, transferability is crucial for applying GFMs across different domains and tasks. However, achieving strong transferability is a major challenge due to the structural, feature, and distributional variations in graph data. To date, there has been no systematic research examining and analyzing GFMs from the perspective of transferability. To bridge the gap, we present the first comprehensive taxonomy that categorizes and analyzes existing GFMs through the lens of transferability, structuring GFMs around their application scope (domain-specific vs. general-purpose) and their approaches to knowledge acquisition and transfer. We provide a structured perspective on current progress and identify potential pathways for advancing GFM generalization across diverse graph datasets and tasks. We aims to shed light on the current landscape of GFMs and inspire future research directions in GFM development.

[LG-20] Online multidimensional dictionary learning

链接: https://arxiv.org/abs/2503.09337
作者: Ferdaous Ait Addi,Abdeslem Hafid Bentbib,Khalide Jbilou
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dictionary learning is a widely used technique in signal processing and machine learning that aims to represent data as a linear combination of a few elements from an overcomplete dictionary. In this work, we propose a generalization of the dictionary learning technique using the t-product framework, enabling efficient handling of multidimensional tensor data. We address the dictionary learning problem through online methods suitable for tensor structures. To effectively address the sparsity problem, we utilize an accelerated Iterative Shrinkage-Thresholding Algorithm (ISTA) enhanced with an extrapolation technique known as Anderson acceleration. This approach significantly improves signal reconstruction results. Extensive experiments prove that our proposed method outperforms existing acceleration techniques, particularly in applications such as data completion. These results suggest that our approach can be highly beneficial for large-scale tensor data analysis in various domains.

[LG-21] Energy Optimized Piecewise Polynomial Approximation Utilizing Modern Machine Learning Optimizers

链接: https://arxiv.org/abs/2503.09329
作者: Hannes Waclawek,Stefan Huber
类目: Machine Learning (cs.LG)
*备注: Submitted to Austrian Robotics Workshop 2025 (2 page student paper)

点击查看摘要

Abstract:This work explores an extension of ML-optimized piecewise polynomial approximation by incorporating energy optimization as an additional objective. Traditional closed-form solutions enable continuity and approximation targets but lack flexibility in accommodating complex optimization goals. By leveraging modern gradient descent optimizers within TensorFlow, we introduce a framework that minimizes total curvature in cam profiles, leading to smoother motion and reduced energy consumption for input data that is unfavorable for sole approximation and continuity optimization. Experimental results confirm the effectiveness of this approach, demonstrating its potential to improve efficiency in scenarios where input data is noisy or suboptimal for conventional methods.

[LG-22] ShuffleGate: An Efficient and Self-Polarizing Feature Selection Method for Large-Scale Deep Models in Industry

链接: https://arxiv.org/abs/2503.09315
作者: Yihong Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep models in industrial applications rely on thousands of features for accurate predictions, such as deep recommendation systems. While new features are introduced to capture evolving user behavior, outdated or redundant features often remain, significantly increasing storage and computational costs. To address this issue, feature selection methods are widely adopted to identify and remove less important features. However, existing approaches face two major challenges: (1) they often require complex Hyperparameter (Hp) tuning, making them difficult to employ in practice, and (2) they fail to produce well-separated feature importance scores, which complicates straightforward feature removal. Moreover, the impact of removing unimportant features can only be evaluated through retraining the model, a time-consuming and resource-intensive process that severely hinders efficient feature selection. To solve these challenges, we propose a novel feature selection approach, Shuffle-Gate. In particular, it shuffles all feature values across instances simultaneously and uses a gating mechanism that allows the model to dynamically learn the weights for combining the original and shuffled inputs. Notably, it can generate well-separated feature importance scores and estimate the performance without retraining the model, while introducing only a single Hp. Experiments on four public datasets show that our approach outperforms state-of-the-art methods in selecting the top half of the feature set for model retraining. Moreover, it has been successfully integrated into the daily iteration of Bilibili’s search models across various scenarios, where it significantly reduces feature set size and computational resource usage, while maintaining comparable performance. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2503.09315 [cs.LG] (or arXiv:2503.09315v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.09315 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-23] Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference

链接: https://arxiv.org/abs/2503.09304
作者: Mohammad Siavashi,Faezeh Keshmiri Dindarloo,Dejan Kostic,Marco Chiesa
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Large Language Models have revolutionized natural language processing, yet serving them efficiently in data centers remains challenging due to mixed workloads comprising latency-sensitive (LS) and best-effort (BE) jobs. Existing inference systems employ iteration-level first-come-first-served scheduling, causing head-of-line blocking when BE jobs delay LS jobs. We introduce QLLM, a novel inference system designed for Mixture of Experts (MoE) models, featuring a fine-grained, priority-aware preemptive scheduler. QLLM enables expert-level preemption, deferring BE job execution while minimizing LS time-to-first-token (TTFT). Our approach removes iteration-level scheduling constraints, enabling the scheduler to preempt jobs at any layer based on priority. Evaluations on an Nvidia A100 GPU show that QLLM significantly improves performance. It reduces LS TTFT by an average of 65.5\times and meets the SLO at up to 7 requests/sec, whereas the baseline fails to do so under the tested workload. Additionally, it cuts LS turnaround time by up to 12.8\times without impacting throughput. QLLM is modular, extensible, and seamlessly integrates with Hugging Face MoE models.

[LG-24] Rule-Guided Reinforcement Learning Policy Evaluation and Improvement

链接: https://arxiv.org/abs/2503.09270
作者: Martin Tappler,Ignacio D. Lopez-Miguel,Sebastian Tschiatschek,Ezio Bartocci
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 11 pages, 3 figures, accompanying source code available at this https URL

点击查看摘要

Abstract:We consider the challenging problem of using domain knowledge to improve deep reinforcement learning policies. To this end, we propose LEGIBLE, a novel approach, following a multi-step process, which starts by mining rules from a deep RL policy, constituting a partially symbolic representation. These rules describe which decisions the RL policy makes and which it avoids making. In the second step, we generalize the mined rules using domain knowledge expressed as metamorphic relations. We adapt these relations from software testing to RL to specify expected changes of actions in response to changes in observations. The third step is evaluating generalized rules to determine which generalizations improve performance when enforced. These improvements show weaknesses in the policy, where it has not learned the general rules and thus can be improved by rule guidance. LEGIBLE supported by metamorphic relations provides a principled way of expressing and enforcing domain knowledge about RL environments. We show the efficacy of our approach by demonstrating that it effectively finds weaknesses, accompanied by explanations of these weaknesses, in eleven RL environments and by showcasing that guiding policy execution with rules improves performance w.r.t. gained reward.

[LG-25] Neural Normalized Cut: A Differential and Generalizable Approach for Spectral Clustering

链接: https://arxiv.org/abs/2503.09260
作者: Wei He,Shangzhi Zhang,Chun-Guang Li,Xianbiao Qi,Rong Xiao,Jun Guo
类目: Machine Learning (cs.LG)
*备注: 5 figures, 8 tables, accepted by Pattern Recognition (2025-03-11)

点击查看摘要

Abstract:Spectral clustering, as a popular tool for data clustering, requires an eigen-decomposition step on a given affinity to obtain the spectral embedding. Nevertheless, such a step suffers from the lack of generalizability and scalability. Moreover, the obtained spectral embeddings can hardly provide a good approximation to the ground-truth partition and thus a k-means step is adopted to quantize the embedding. In this paper, we propose a simple yet effective scalable and generalizable approach, called Neural Normalized Cut (NeuNcut), to learn the clustering membership for spectral clustering directly. In NeuNcut, we properly reparameterize the unknown cluster membership via a neural network, and train the neural network via stochastic gradient descent with a properly relaxed normalized cut loss. As a result, our NeuNcut enjoys a desired generalization ability to directly infer clustering membership for out-of-sample unseen data and hence brings us an efficient way to handle clustering task with ultra large-scale data. We conduct extensive experiments on both synthetic data and benchmark datasets and experimental results validate the effectiveness and the superiority of our approach. Our code is available at: this https URL.

[LG-26] Large-scale Regional Traffic Signal Control Based on Single-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2503.09252
作者: Qiang Li,Jin Niu,Qin Luo,Lina Yu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 16 pages, 8 figures. arXiv admin note: text overlap with arXiv:2503.02279

点击查看摘要

Abstract:In the context of global urbanization and motorization, traffic congestion has become a significant issue, severely affecting the quality of life, environment, and economy. This paper puts forward a single-agent reinforcement learning (RL)-based regional traffic signal control (TSC) model. Different from multi - agent systems, this model can coordinate traffic signals across a large area, with the goals of alleviating regional traffic congestion and minimizing the total travel time. The TSC environment is precisely defined through specific state space, action space, and reward functions. The state space consists of the current congestion state, which is represented by the queue lengths of each link, and the current signal phase scheme of intersections. The action space is designed to select an intersection first and then adjust its phase split. Two reward functions are meticulously crafted. One focuses on alleviating congestion and the other aims to minimize the total travel time while considering the congestion level. The experiments are carried out with the SUMO traffic simulation software. The performance of the TSC model is evaluated by comparing it with a base case where no signal-timing adjustments are made. The results show that the model can effectively control congestion. For example, the queuing length is significantly reduced in the scenarios tested. Moreover, when the reward is set to both alleviate congestion and minimize the total travel time, the average travel time is remarkably decreased, which indicates that the model can effectively improve traffic conditions. This research provides a new approach for large-scale regional traffic signal control and offers valuable insights for future urban traffic management.

[LG-27] MarineGym: A High-Performance Reinforcement Learning Platform for Underwater Robotics

链接: https://arxiv.org/abs/2503.09203
作者: Shuguang Chu,Zebin Huang,Yutong Li,Mingwei Lin,Ignacio Carlucho,Yvan R. Petillot,Canjun Yang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work presents the MarineGym, a high-performance reinforcement learning (RL) platform specifically designed for underwater robotics. It aims to address the limitations of existing underwater simulation environments in terms of RL compatibility, training efficiency, and standardized benchmarking. MarineGym integrates a proposed GPU-accelerated hydrodynamic plugin based on Isaac Sim, achieving a rollout speed of 250,000 frames per second on a single NVIDIA RTX 3060 GPU. It also provides five models of unmanned underwater vehicles (UUVs), multiple propulsion systems, and a set of predefined tasks covering core underwater control challenges. Additionally, the DR toolkit allows flexible adjustments of simulation and task parameters during training to improve Sim2Real transfer. Further benchmark experiments demonstrate that MarineGym improves training efficiency over existing platforms and supports robust policy adaptation under various perturbations. We expect this platform could drive further advancements in RL research for underwater robotics. For more details about MarineGym and its applications, please visit our project page: this https URL.

[LG-28] me-EAPCR: A Deep Learning-Based Novel Approach for Anomaly Detection Applied to the Environmental Field

链接: https://arxiv.org/abs/2503.09200
作者: Lei Liu,Yuchao Lu,Ling An,Huajie Liang,Chichun Zhou,Zhenyu Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As human activities intensify, environmental systems such as aquatic ecosystems and water treatment systems face increasingly complex pressures, impacting ecological balance, public health, and sustainable development, making intelligent anomaly monitoring essential. However, traditional monitoring methods suffer from delayed responses, insufficient data processing capabilities, and weak generalisation, making them unsuitable for complex environmental monitoring this http URL recent years, machine learning has been widely applied to anomaly detection, but the multi-dimensional features and spatiotemporal dynamics of environmental ecological data, especially the long-term dependencies and strong variability in the time dimension, limit the effectiveness of traditional this http URL learning, with its ability to automatically learn features, captures complex nonlinear relationships, improving detection performance. However, its application in environmental monitoring is still in its early stages and requires further this http URL paper introduces a new deep learning method, Time-EAPCR (Time-Embedding-Attention-Permutated CNN-Residual), and applies it to environmental science. The method uncovers feature correlations, captures temporal evolution patterns, and enables precise anomaly detection in environmental this http URL validated Time-EAPCR’s high accuracy and robustness across four publicly available environmental datasets. Experimental results show that the method efficiently handles multi-source data, improves detection accuracy, and excels across various scenarios with strong adaptability and generalisation. Additionally, a real-world river monitoring dataset confirmed the feasibility of its deployment, providing reliable technical support for environmental monitoring.

[LG-29] Differential Privacy Personalized Federated Learning Based on Dynamically Sparsified Client Updates

链接: https://arxiv.org/abs/2503.09192
作者: Chuanyin Wang,Yifei Zhang,Neng Gao,Qiang Luo
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 10 pages,2 figures

点击查看摘要

Abstract:Personalized federated learning is extensively utilized in scenarios characterized by data heterogeneity, facilitating more efficient and automated local training on data-owning terminals. This includes the automated selection of high-performance model parameters for upload, thereby enhancing the overall training process. However, it entails significant risks of privacy leakage. Existing studies have attempted to mitigate these risks by utilizing differential privacy. Nevertheless, these studies present two major limitations: (1) The integration of differential privacy into personalized federated learning lacks sufficient personalization, leading to the introduction of excessive noise into the model. (2) It fails to adequately control the spatial scope of model update information, resulting in a suboptimal balance between data privacy and model effectiveness in differential privacy federated learning. In this paper, we propose a differentially private personalized federated learning approach that employs dynamically sparsified client updates through reparameterization and adaptive norm(DP-pFedDSU). Reparameterization training effectively selects personalized client update information, thereby reducing the quantity of updates. This approach minimizes the introduction of noise to the greatest extent possible. Additionally, dynamic adaptive norm refers to controlling the norm space of model updates during the training process, mitigating the negative impact of clipping on the update information. These strategies substantially enhance the effective integration of differential privacy and personalized federated learning. Experimental results on EMNIST, CIFAR-10, and CIFAR-100 demonstrate that our proposed scheme achieves superior performance and is well-suited for more complex personalized federated learning scenarios.

[LG-30] Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework

链接: https://arxiv.org/abs/2503.09186
作者: Jian-Jian Jiang,Xiao-Ming Wu,Yi-Xiang He,Ling-An Zeng,Yi-Lin Wei,Dandan Zhang,Wei-Shi Zheng
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:Bimanual robotic manipulation is an emerging and critical topic in the robotics community. Previous works primarily rely on integrated control models that take the perceptions and states of both arms as inputs to directly predict their actions. However, we think bimanual manipulation involves not only coordinated tasks but also various uncoordinated tasks that do not require explicit cooperation during execution, such as grasping objects with the closest hand, which integrated control frameworks ignore to consider due to their enforced cooperation in the early inputs. In this paper, we propose a novel decoupled interaction framework that considers the characteristics of different tasks in bimanual manipulation. The key insight of our framework is to assign an independent model to each arm to enhance the learning of uncoordinated tasks, while introducing a selective interaction module that adaptively learns weights from its own arm to improve the learning of coordinated tasks. Extensive experiments on seven tasks in the RoboTwin dataset demonstrate that: (1) Our framework achieves outstanding performance, with a 23.5% boost over the SOTA method. (2) Our framework is flexible and can be seamlessly integrated into existing methods. (3) Our framework can be effectively extended to multi-agent manipulation tasks, achieving a 28% boost over the integrated control SOTA. (4) The performance boost stems from the decoupled design itself, surpassing the SOTA by 16.5% in success rate with only 1/6 of the model size.

[LG-31] Exploiting Unstructured Sparsity in Fully Homomorphic Encrypted DNNs EUROSYS’25

链接: https://arxiv.org/abs/2503.09184
作者: Aidan Ferguson,Perry Gibson,Lara D’Agata,Parker McLeod,Ferhat Yaman,Amitabh Das,Ian Colbert,José Cano
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注: Accepted to 5th Workshop on Machine Learning and Systems (EuroMLSys) co-located with EuroSys '25

点击查看摘要

Abstract:The deployment of deep neural networks (DNNs) in privacy-sensitive environments is constrained by computational overheads in fully homomorphic encryption (FHE). This paper explores unstructured sparsity in FHE matrix multiplication schemes as a means of reducing this burden while maintaining model accuracy requirements. We demonstrate that sparsity can be exploited in arbitrary matrix multiplication, providing runtime benefits compared to a baseline naive algorithm at all sparsity levels. This is a notable departure from the plaintext domain, where there is a trade-off between sparsity and the overhead of the sparse multiplication algorithm. In addition, we propose three sparse multiplication schemes in FHE based on common plaintext sparse encodings. We demonstrate the performance gain is scheme-invariant; however, some sparse schemes vastly reduce the memory storage requirements of the encrypted matrix at high sparsity values. Our proposed sparse schemes yield an average performance gain of 2.5x at 50% unstructured sparsity, with our multi-threading scheme providing a 32.5x performance increase over the equivalent single-threaded sparse computation when utilizing 64 cores.

[LG-32] Dynamic Feature Selection from Variable Feature Sets Using Features of Features

链接: https://arxiv.org/abs/2503.09181
作者: Katsumi Takahashi,Koh Takeuchi,Hisashi Kashima
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Machine learning models usually assume that a set of feature values used to obtain an output is fixed in advance. However, in many real-world problems, a cost is associated with measuring these features. To address the issue of reducing measurement costs, various methods have been proposed to dynamically select which features to measure, but existing methods assume that the set of measurable features remains constant, which makes them unsuitable for cases where the set of measurable features varies from instance to instance. To overcome this limitation, we define a new problem setting for Dynamic Feature Selection (DFS) with variable feature sets and propose a deep learning method that utilizes prior information about each feature, referred to as ‘‘features of features’’. Experimental results on several datasets demonstrate that the proposed method effectively selects features based on the prior information, even when the set of measurable features changes from instance to instance.

[LG-33] Effective Feature Selection for Predicting Spreading Factor with ML in Large LoRaWAN-based Mobile IoT Networks

链接: https://arxiv.org/abs/2503.09170
作者: Aman Prakash,Nikumani Choudhury,Anakhi Hazarika,Alekhya Gorrela
类目: Machine Learning (cs.LG)
*备注: Accepted at 31st National Conference on Communications

点击查看摘要

Abstract:LoRaWAN is a low-power long-range protocol that enables reliable and robust communication. This paper addresses the challenge of predicting the spreading factor (SF) in LoRaWAN networks using machine learning (ML) techniques. Optimal SF allocation is crucial for optimizing data transmission in IoT-enabled mobile devices, yet it remains a challenging task due to the fluctuation in environment and network conditions. We evaluated ML model performance across a large publicly available dataset to explore the best feature across key LoRaWAN features such as RSSI, SNR, frequency, distance between end devices and gateways, and antenna height of the end device, further, we also experimented with 31 different combinations possible for 5 features. We trained and evaluated the model using k-nearest neighbors (k-NN), Decision Tree Classifier (DTC), Random Forest (RF), and Multinomial Logistic Regression (MLR) algorithms. The combination of RSSI and SNR was identified as the best feature set. The finding of this paper provides valuable information for reducing the overall cost of dataset collection for ML model training and extending the battery life of LoRaWAN devices. This work contributes to a more reliable LoRaWAN system by understanding the importance of specific feature sets for optimized SF allocation.

[LG-34] Unreflected Use of Tabular Data Repositories Can Undermine Research Quality

链接: https://arxiv.org/abs/2503.09159
作者: Andrej Tschalzev,Lennart Purucker,Stefan Lüdtke,Frank Hutter,Christian Bartelt,Heiner Stuckenschmidt
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data repositories have accumulated a large number of tabular datasets from various domains. Machine Learning researchers are actively using these datasets to evaluate novel approaches. Consequently, data repositories have an important standing in tabular data research. They not only host datasets but also provide information on how to use them in supervised learning tasks. In this paper, we argue that, despite great achievements in usability, the unreflected usage of datasets from data repositories may have led to reduced research quality and scientific rigor. We present examples from prominent recent studies that illustrate the problematic use of datasets from OpenML, a large data repository for tabular data. Our illustrations help users of data repositories avoid falling into the traps of (1) using suboptimal model selection strategies, (2) overlooking strong baselines, and (3) inappropriate preprocessing. In response, we discuss possible solutions for how data repositories can prevent the inappropriate use of datasets and become the cornerstones for improved overall quality of empirical research studies.

[LG-35] Clustering by Nonparametric Smoothing

链接: https://arxiv.org/abs/2503.09134
作者: David P. Hofmeyr
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Under submission for possible publication by IEEE

点击查看摘要

Abstract:A novel formulation of the clustering problem is introduced in which the task is expressed as an estimation problem, where the object to be estimated is a function which maps a point to its distribution of cluster membership. Unlike existing approaches which implicitly estimate such a function, like Gaussian Mixture Models (GMMs), the proposed approach bypasses any explicit modelling assumptions and exploits the flexible estimation potential of nonparametric smoothing. An intuitive approach for selecting the tuning parameters governing estimation is provided, which allows the proposed method to automatically determine both an appropriate level of flexibility and also the number of clusters to extract from a given data set. Experiments on a large collection of publicly available data sets are used to document the strong performance of the proposed approach, in comparison with relevant benchmarks from the literature. R code to implement the proposed approach is available from this https URL CNS

[LG-36] Urban Region Representation Learning: A Flexible Approach

链接: https://arxiv.org/abs/2503.09128
作者: Fengze Sun,Yanchuan Chang,Egemen Tanin,Shanika Karunasekera,Jianzhong Qi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing availability of urban data offers new opportunities for learning region representations, which can be used as input to machine learning models for downstream tasks such as check-in or crime prediction. While existing solutions have produced promising results, an issue is their fixed formation of regions and fixed input region features, which may not suit the needs of different downstream tasks. To address this limitation, we propose a model named FlexiReg for urban region representation learning that is flexible with both the formation of urban regions and the input region features. FlexiReg is based on a spatial grid partitioning over the spatial area of interest. It learns representations for the grid cells, leveraging publicly accessible data, including POI, land use, satellite imagery, and street view imagery. We propose adaptive aggregation to fuse the cell representations and prompt learning techniques to tailor the representations towards different tasks, addressing the needs of varying formations of urban regions and downstream tasks. Extensive experiments on five real-world datasets demonstrate that FlexiReg outperforms state-of-the-art models by up to 202% in term of the accuracy of four diverse downstream tasks using the produced urban region representations.

[LG-37] Drift-Aware Federated Learning: A Causal Perspective

链接: https://arxiv.org/abs/2503.09116
作者: Yunjie Fang,Sheng Wu,Tao Yang,Xiaofeng Wu,Bo Hu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated learning (FL) facilitates collaborative model training among multiple clients while preserving data privacy, often resulting in enhanced performance compared to models trained by individual clients. However, factors such as communication frequency and data distribution can contribute to feature drift, hindering the attainment of optimal training performance. This paper examine the relationship between model update drift and global as well as local optimizer from causal perspective. The influence of the global optimizer on feature drift primarily arises from the participation frequency of certain clients in server updates, whereas the effect of the local optimizer is typically associated with imbalanced data this http URL mitigate this drift, we propose a novel framework termed Causal drift-Aware Federated lEarning (CAFE). CAFE exploits the causal relationship between feature-invariant components and classification outcomes to independently calibrate local client sample features and classifiers during the training phase. In the inference phase, it eliminated the drifts in the global model that favor frequently communicating this http URL results demonstrate that CAFE’s integration of feature calibration, parameter calibration, and historical information effectively reduces both drift towards majority classes and tendencies toward frequently communicating nodes.

[LG-38] Adaptive Backdoor Attacks with Reason able Constraints on Graph Neural Networks

链接: https://arxiv.org/abs/2503.09049
作者: Xuewen Dong,Jiachen Li,Shujun Li,Zhichao You,Qiang Qu,Yaroslav Kholodov,Yulong Shen
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Recent studies show that graph neural networks (GNNs) are vulnerable to backdoor attacks. Existing backdoor attacks against GNNs use fixed-pattern triggers and lack reasonable trigger constraints, overlooking individual graph characteristics and rendering insufficient evasiveness. To tackle the above issues, we propose ABARC, the first Adaptive Backdoor Attack with Reasonable Constraints, applying to both graph-level and node-level tasks in GNNs. For graph-level tasks, we propose a subgraph backdoor attack independent of the graph’s topology. It dynamically selects trigger nodes for each target graph and modifies node features with constraints based on graph similarity, feature range, and feature type. For node-level tasks, our attack begins with an analysis of node features, followed by selecting and modifying trigger features, which are then constrained by node similarity, feature range, and feature type. Furthermore, an adaptive edge-pruning mechanism is designed to reduce the impact of neighbors on target nodes, ensuring a high attack success rate (ASR). Experimental results show that even with reasonable constraints for attack evasiveness, our attack achieves a high ASR while incurring a marginal clean accuracy drop (CAD). When combined with the state-of-the-art defense randomized smoothing (RS) method, our attack maintains an ASR over 94%, surpassing existing attacks by more than 7%.

[LG-39] Feasibility-aware Imitation Learning from Observations through a Hand-mounted Demonstration Interface

链接: https://arxiv.org/abs/2503.09018
作者: Kei Takahashi,Hikaru Sasaki,Takamitsu Matsubara
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Imitation learning through a demonstration interface is expected to learn policies for robot automation from intuitive human demonstrations. However, due to the differences in human and robot movement characteristics, a human expert might unintentionally demonstrate an action that the robot cannot execute. We propose feasibility-aware behavior cloning from observation (FABCO). In the FABCO framework, the feasibility of each demonstration is assessed using the robot’s pre-trained forward and inverse dynamics models. This feasibility information is provided as visual feedback to the demonstrators, encouraging them to refine their demonstrations. During policy learning, estimated feasibility serves as a weight for the demonstration data, improving both the data efficiency and the robustness of the learned policy. We experimentally validated FABCO’s effectiveness by applying it to a pipette insertion task involving a pipette and a vial. Four participants assessed the impact of the feasibility feedback and the weighted policy learning in FABCO. Additionally, we used the NASA Task Load Index (NASA-TLX) to evaluate the workload induced by demonstrations with visual feedback.

[LG-40] Natural Humanoid Robot Locomotion with Generative Motion Prior

链接: https://arxiv.org/abs/2503.09015
作者: Haodong Zhang,Liang Zhang,Zhenghan Chen,Lu Chen,Yue Wang,Rong Xiong
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Natural and lifelike locomotion remains a fundamental challenge for humanoid robots to interact with human society. However, previous methods either neglect motion naturalness or rely on unstable and ambiguous style rewards. In this paper, we propose a novel Generative Motion Prior (GMP) that provides fine-grained motion-level supervision for the task of natural humanoid robot locomotion. To leverage natural human motions, we first employ whole-body motion retargeting to effectively transfer them to the robot. Subsequently, we train a generative model offline to predict future natural reference motions for the robot based on a conditional variational auto-encoder. During policy training, the generative motion prior serves as a frozen online motion generator, delivering precise and comprehensive supervision at the trajectory level, including joint angles and keypoint positions. The generative motion prior significantly enhances training stability and improves interpretability by offering detailed and dense guidance throughout the learning process. Experimental results in both simulation and real-world environments demonstrate that our method achieves superior motion naturalness compared to existing approaches. Project page can be found at this https URL

[LG-41] From Task-Specific Models to Unified Systems: A Review of Model Merging Approaches

链接: https://arxiv.org/abs/2503.08998
作者: Wei Ruan,Tianze Yang,Yifan Zhou,Tianming Liu,Jin Lu
类目: Machine Learning (cs.LG)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:Model merging has achieved significant success, with numerous innovative methods proposed to enhance capabilities by combining multiple models. However, challenges persist due to the lack of a unified framework for classification and systematic comparative analysis, leading to inconsistencies in terminologies and categorizations. Meanwhile, as an increasing number of fine-tuned models are publicly available, their original training data often remain inaccessible due to privacy concerns or intellectual property restrictions. This makes traditional multi-task learning based on shared training data impractical. In scenarios where direct access to training data is infeasible, merging model parameters to create a unified model with broad generalization across multiple domains becomes crucial, further underscoring the importance of model merging techniques. Despite the rapid progress in this field, a comprehensive taxonomy and survey summarizing recent advances and predicting future directions are still lacking. This paper addresses these gaps by establishing a new taxonomy of model merging methods, systematically comparing different approaches, and providing an overview of key developments. By offering a structured perspective on this evolving area, we aim to help newcomers quickly grasp the field’s landscape and inspire further innovations.

[LG-42] Unified Locomotion Transformer with Simultaneous Sim-to-Real Transfer for Quadrupeds

链接: https://arxiv.org/abs/2503.08997
作者: Dikai Liu,Tianwei Zhang,Jianxiong Yin,Simon See
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Project website for video: this https URL

点击查看摘要

Abstract:Quadrupeds have gained rapid advancement in their capability of traversing across complex terrains. The adoption of deep Reinforcement Learning (RL), transformers and various knowledge transfer techniques can greatly reduce the sim-to-real gap. However, the classical teacher-student framework commonly used in existing locomotion policies requires a pre-trained teacher and leverages the privilege information to guide the student policy. With the implementation of large-scale models in robotics controllers, especially transformers-based ones, this knowledge distillation technique starts to show its weakness in efficiency, due to the requirement of multiple supervised stages. In this paper, we propose Unified Locomotion Transformer (ULT), a new transformer-based framework to unify the processes of knowledge transfer and policy optimization in a single network while still taking advantage of privilege information. The policies are optimized with reinforcement learning, next state-action prediction, and action imitation, all in just one training stage, to achieve zero-shot deployment. Evaluation results demonstrate that with ULT, optimal teacher and student policies can be obtained at the same time, greatly easing the difficulty in knowledge transfer, even with complex transformer-based models.

[LG-43] raG rip: Sensor-Driven Multi-Suction Reactive Object Manipulation in Cluttered Scenes

链接: https://arxiv.org/abs/2503.08978
作者: Paolo Torrado,Joshua Levin,Markus Grotz,Joshua Smith
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Warehouse robotic systems equipped with vacuum grippers must reliably grasp a diverse range of objects from densely packed shelves. However, these environments present significant challenges, including occlusions, diverse object orientations, stacked and obstructed items, and surfaces that are difficult to suction. We introduce \tetra, a novel vacuum-based grasping strategy featuring four suction cups mounted on linear actuators. Each actuator is equipped with an optical time-of-flight (ToF) proximity sensor, enabling reactive grasping. We evaluate \tetra in a warehouse-style setting, demonstrating its ability to manipulate objects in stacked and obstructed configurations. Our results show that our RL-based policy improves picking success in stacked-object scenarios by 22.86% compared to a single-suction gripper. Additionally, we demonstrate that TetraGrip can successfully grasp objects in scenarios where a single-suction gripper fails due to physical limitations, specifically in two cases: (1) picking an object occluded by another object and (2) retrieving an object in a complex scenario. These findings highlight the advantages of multi-actuated, suction-based grasping in unstructured warehouse environments. The project website is available at: \hrefthis https URLthis https URL. Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2503.08978 [cs.RO] (or arXiv:2503.08978v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2503.08978 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-44] Not All Edges are Equally Robust: Evaluating the Robustness of Ranking-Based Federated Learning

链接: https://arxiv.org/abs/2503.08976
作者: Zirui Gong,Yanjun Zhang,Leo Yu Zhang,Zhaoxi Zhang,Yong Xiang,Shirui Pan
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 18 pages. To appear in the IEEE Symposium on Security and Privacy 2025

点击查看摘要

Abstract:Federated Ranking Learning (FRL) is a state-of-the-art FL framework that stands out for its communication efficiency and resilience to poisoning attacks. It diverges from the traditional FL framework in two ways: 1) it leverages discrete rankings instead of gradient updates, significantly reducing communication costs and limiting the potential space for malicious updates, and 2) it uses majority voting on the server side to establish the global ranking, ensuring that individual updates have minimal influence since each client contributes only a single vote. These features enhance the system’s scalability and position FRL as a promising paradigm for FL training. However, our analysis reveals that FRL is not inherently robust, as certain edges are particularly vulnerable to poisoning attacks. Through a theoretical investigation, we prove the existence of these vulnerable edges and establish a lower bound and an upper bound for identifying them in each layer. Based on this finding, we introduce a novel local model poisoning attack against FRL, namely the Vulnerable Edge Manipulation (VEM) attack. The VEM attack focuses on identifying and perturbing the most vulnerable edges in each layer and leveraging an optimization-based approach to maximize the attack’s impact. Through extensive experiments on benchmark datasets, we demonstrate that our attack achieves an overall 53.23% attack impact and is 3.7x more impactful than existing methods. Our findings highlight significant vulnerabilities in ranking-based FL systems and underline the urgency for the development of new robust FL frameworks. Comments: 18 pages. To appear in the IEEE Symposium on Security and Privacy 2025 Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2503.08976 [cs.LG] (or arXiv:2503.08976v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.08976 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-45] Quantitative Analysis of Deeply Quantized Tiny Neural Networks Robust to Adversarial Attacks

链接: https://arxiv.org/abs/2503.08973
作者: Idris Zakariyya,Ferheen Ayaz,Mounia Kharbouche-Harrari,Jeremy Singer,Sye Loong Keoh,Danilo Pau,José Cano
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Performance (cs.PF)
*备注: arXiv admin note: substantial text overlap with arXiv:2304.12829

点击查看摘要

Abstract:Reducing the memory footprint of Machine Learning (ML) models, especially Deep Neural Networks (DNNs), is imperative to facilitate their deployment on resource-constrained edge devices. However, a notable drawback of DNN models lies in their susceptibility to adversarial attacks, wherein minor input perturbations can deceive them. A primary challenge revolves around the development of accurate, resilient, and compact DNN models suitable for deployment on resource-constrained edge devices. This paper presents the outcomes of a compact DNN model that exhibits resilience against both black-box and white-box adversarial attacks. This work has achieved this resilience through training with the QKeras quantization-aware training framework. The study explores the potential of QKeras and an adversarial robustness technique, Jacobian Regularization (JR), to co-optimize the DNN architecture through per-layer JR methodology. As a result, this paper has devised a DNN model employing this co-optimization strategy based on Stochastic Ternary Quantization (STQ). Its performance was compared against existing DNN models in the face of various white-box and black-box attacks. The experimental findings revealed that, the proposed DNN model had small footprint and on average, it exhibited better performance than Quanos and DS-CNN MLCommons/TinyML (MLC/T) benchmarks when challenged with white-box and black-box attacks, respectively, on the CIFAR-10 image and Google Speech Commands audio datasets.

[LG-46] Multiplayer Information Asymmetric Contextual Bandits

链接: https://arxiv.org/abs/2503.08961
作者: William Chang,Yuanhao Lu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Single-player contextual bandits are a well-studied problem in reinforcement learning that has seen applications in various fields such as advertising, healthcare, and finance. In light of the recent work on \emphinformation asymmetric bandits \citechang2022online, chang2023online, we propose a novel multiplayer information asymmetric contextual bandit framework where there are multiple players each with their own set of actions. At every round, they observe the same context vectors and simultaneously take an action from their own set of actions, giving rise to a joint action. However, upon taking this action the players are subjected to information asymmetry in (1) actions and/or (2) rewards. We designed an algorithm \textttLinUCB by modifying the classical single-player algorithm \textttLinUCB in \citechu2011contextual to achieve the optimal regret O(\sqrtT) when only one kind of asymmetry is present. We then propose a novel algorithm \textttETC that is built on explore-then-commit principles to achieve the same optimal regret when both types of asymmetry are present.

[LG-47] Data-driven Nonlinear Modal Analysis with Physics-constrained Deep Learning: Numerical and Experimental Study

链接: https://arxiv.org/abs/2503.08952
作者: Abdolvahhab Rostamijavanani,Shanwu Li,Yongchao Yang
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:To fully understand, analyze, and determine the behavior of dynamical systems, it is crucial to identify their intrinsic modal coordinates. In nonlinear dynamical systems, this task is challenging as the modal transformation based on the superposition principle that works well for linear systems is no longer applicable. To understand the nonlinear dynamics of a system, one of the main approaches is to use the framework of Nonlinear Normal Modes (NNMs) which attempts to provide an in-depth representation. In this research, we examine the effectiveness of NNMs in characterizing nonlinear dynamical systems. Given the difficulty of obtaining closed-form models or equations for these real-world systems, we present a data-driven framework that combines physics and deep learning to the nonlinear modal transformation function of NNMs from response data only. We assess the framework’s ability to represent the system by analyzing its mode decomposition, reconstruction, and prediction accuracy using a nonlinear beam as an example. Initially, we perform numerical simulations on a nonlinear beam at different energy levels in both linear and nonlinear scenarios. Afterward, using experimental vibration data of a nonlinear beam, we isolate the first two NNMs. It is observed that the NNMs’ frequency values increase as the excitation level of energy increases, and the configuration plots become more twisted (more nonlinear). In the experiment, the framework successfully decomposed the first two NNMs of the nonlinear beam using experimental free vibration data and captured the dynamics of the structure via prediction and reconstruction of some physical points of the beam.

[LG-48] Extrag radient Preference Optimization (EGPO): Beyond Last-Iterate Convergence for Nash Learning from Human Feedback

链接: https://arxiv.org/abs/2503.08942
作者: Runlong Zhou,Maryam Fazel,Simon S. Du
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has become essential for improving language model capabilities, but traditional approaches rely on the assumption that human preferences follow a transitive Bradley-Terry model. This assumption fails to capture the non-transitive nature of populational human preferences. Nash learning from human feedback (NLHF), targeting non-transitive preferences, is a problem of computing the Nash equilibrium (NE) of the two-player constant-sum game defined by the human preference. We introduce Extragradient preference optimization (EGPO), a novel algorithm for NLHF achieving last-iterate linear convergence to the NE of KL-regularized games and polynomial convergence to the NE of original games, while being robust to noise. Unlike previous approaches that rely on nested optimization, we derive an equivalent implementation using gradients of an online variant of the identity preference optimization (IPO) loss, enabling more faithful implementation for neural networks. Our empirical evaluations demonstrate EGPO’s superior performance over baseline methods when training for the same number of epochs, as measured by pairwise win-rates using the ground truth preference. These results validate both the theoretical strengths and practical advantages of EGPO for language model alignment with non-transitive human preferences.

[LG-49] Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model

链接: https://arxiv.org/abs/2503.08934
作者: Zilong Deng,Simon Khan,Shaofeng Zou
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2305.16589 by other authors

点击查看摘要

Abstract:In this work, we study the sample complexity problem of risk-sensitive Reinforcement Learning (RL) with a generative model, where we aim to maximize the Conditional Value at Risk (CVaR) with risk tolerance level \tau at each step, named Iterated CVaR. %We consider the sample complexity of obtaining an \epsilon -optimal policy in an infinite horizon discounted MDP, given access to a generative model. % We first build a connection between Iterated CVaR RL with (s, a) -rectangular distributional robust RL with the specific uncertainty set for CVaR. We develop nearly matching upper and lower bounds on the sample complexity for this problem. Specifically, we first prove that a value iteration-based algorithm, ICVaR-VI, achieves an \epsilon -optimal policy with at most \tildeO\left(\fracSA(1-\gamma)^4\tau^2\epsilon^2\right) samples, where \gamma is the discount factor, and S, A are the sizes of the state and action spaces. Furthermore, if \tau \geq \gamma , then the sample complexity can be further improved to \tildeO\left( \fracSA(1-\gamma)^3\epsilon^2 \right) . We further show a minimax lower bound of \tildeO\left(\frac(1-\gamma \tau)SA(1-\gamma)^4\tau\epsilon^2\right) . For a constant risk level 0\tau\leq 1 , our upper and lower bounds match with each other, demonstrating the tightness and optimality of our analyses. We also investigate a limiting case with a small risk level \tau , called Worst-Path RL, where the objective is to maximize the minimum possible cumulative reward. We develop matching upper and lower bounds of \tildeO\left(\fracSAp_\min\right) , where p_\min denotes the minimum non-zero reaching probability of the transition kernel. Comments: arXiv admin note: text overlap with arXiv:2305.16589 by other authors Subjects: Machine Learning (cs.LG) Cite as: arXiv:2503.08934 [cs.LG] (or arXiv:2503.08934v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.08934 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zilong Deng [view email] [v1] Tue, 11 Mar 2025 22:31:03 UTC (90 KB) Full-text links: Access Paper: View a PDF of the paper titled Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model, by Zilong Deng and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-50] Enhancing Large Language Models for Hardware Verification: A Novel SystemVerilog Assertion Dataset

链接: https://arxiv.org/abs/2503.08923
作者: Anand Menon,Samit S Miftah,Shamik Kundu,Souvik Kundu,Amisha Srivastava,Arnab Raha,Gabriel Theodor Sonnenschein,Suvadeep Banerjee,Deepak Mathaikutty,Kanad Basu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 29 Pages

点击查看摘要

Abstract:Hardware verification is crucial in modern SoC design, consuming around 70% of development time. SystemVerilog assertions ensure correct functionality. However, existing industrial practices rely on manual efforts for assertion generation, which becomes increasingly untenable as hardware systems become complex. Recent research shows that Large Language Models (LLMs) can automate this process. However, proprietary SOTA models like GPT-4o often generate inaccurate assertions and require expensive licenses, while smaller open-source LLMs need fine-tuning to manage HDL code complexities. To address these issues, we introduce VERT, an open-source dataset designed to enhance SystemVerilog assertion generation using LLMs. VERT enables researchers in academia and industry to fine-tune open-source models, outperforming larger proprietary ones in both accuracy and efficiency while ensuring data privacy through local fine-tuning and eliminating costly licenses. The dataset is curated by systematically augmenting variables from open-source HDL repositories to generate synthetic code snippets paired with corresponding assertions. Experimental results demonstrate that fine-tuned models like Deepseek Coder 6.7B and Llama 3.1 8B outperform GPT-4o, achieving up to 96.88% improvement over base models and 24.14% over GPT-4o on platforms including OpenTitan, CVA6, OpenPiton and Pulpissimo. VERT is available at this https URL.

[LG-51] Multilevel Generative Samplers for Investigating Critical Phenomena ICLR2025

链接: https://arxiv.org/abs/2503.08918
作者: Ankur Singha,Elia Cellini,Kim A. Nicoli,Karl Jansen,Stefan Kühn,Shinichi Nakajima
类目: Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat); Machine Learning (stat.ML)
*备注: 10 pages, 4 figures (main text); 13th International Conference on Learning Representations (ICLR 2025)

点击查看摘要

Abstract:Investigating critical phenomena or phase transitions is of high interest in physics and chemistry, for which Monte Carlo (MC) simulations, a crucial tool for numerically analyzing macroscopic properties of given systems, are often hindered by an emerging divergence of correlation length – known as scale invariance at criticality (SIC) in the renormalization group theory. SIC causes the system to behave the same at any length scale, from which many existing sampling methods suffer: long-range correlations cause critical slowing down in Markov chain Monte Carlo (MCMC), and require intractably large receptive fields for generative samplers. In this paper, we propose a Renormalization-informed Generative Critical Sampler (RiGCS) – a novel sampler specialized for near-critical systems, where SIC is leveraged as an advantage rather than a nuisance. Specifically, RiGCS builds on MultiLevel Monte Carlo (MLMC) with Heat Bath (HB) algorithms, which perform ancestral sampling from low-resolution to high-resolution lattice configurations with site-wise-independent conditional HB sampling. Although MLMC-HB is highly efficient under exact SIC, it suffers from a low acceptance rate under slight SIC violation. Notably, SIC violation always occurs in finite-size systems, and may induce long-range and higher-order interactions in the renormalized distributions, which are not considered by independent HB samplers. RiGCS enhances MLMC-HB by replacing a part of the conditional HB sampler with generative models that capture those residual interactions and improve the sampling efficiency. Our experiments show that the effective sample size of RiGCS is a few orders of magnitude higher than state-of-the-art generative model baselines in sampling configurations for 128x128 two-dimensional Ising systems.

[LG-52] From Models To Experiments: Shallow Recurrent Decoder Networks on the DYNASTY Experimental Facility

链接: https://arxiv.org/abs/2503.08907
作者: Carolina Introini,Stefano Riva,J. Nathan Kutz,Antonio Cammi
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:The Shallow Recurrent Decoder networks are a novel paradigm recently introduced for state estimation, combining sparse observations with high-dimensional model data. This architecture features important advantages compared to standard data-driven methods including: the ability to use only three sensors (even randomly selected) for reconstructing the entire dynamics of a physical system; the ability to train on compressed data spanned by a reduced basis; the ability to measure a single field variable (easy to measure) and reconstruct coupled spatio-temporal fields that are not observable and minimal hyper-parameter tuning. This approach has been verified on different test cases within different fields including nuclear reactors, even though an application to a real experimental facility, adopting the employment of in-situ observed quantities, is missing. This work aims to fill this gap by applying the Shallow Recurrent Decoder architecture to the DYNASTY facility, built at Politecnico di Milano, which studies the natural circulation established by internally heated fluids for Generation IV applications, especially in the case of Circulating Fuel reactors. The RELAP5 code is used to generate the high-fidelity data, and temperature measurements extracted by the facility are used as input for the state estimation. The results of this work will provide a validation of the Shallow Recurrent Decoder architecture to engineering systems, showing the capabilities of this approach to provide and accurate state estimation.

[LG-53] owards Efficient Parametric State Estimation in Circulating Fuel Reactors with Shallow Recurrent Decoder Networks

链接: https://arxiv.org/abs/2503.08904
作者: Stefano Riva,Carolina Introini,J. Nathan Kutz,Antonio Cammi
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph)
*备注: arXiv admin note: text overlap with arXiv:2409.12550

点击查看摘要

Abstract:The recent developments in data-driven methods have paved the way to new methodologies to provide accurate state reconstruction of engineering systems; nuclear reactors represent particularly challenging applications for this task due to the complexity of the strongly coupled physics involved and the extremely harsh and hostile environments, especially for new technologies such as Generation-IV reactors. Data-driven techniques can combine different sources of information, including computational proxy models and local noisy measurements on the system, to robustly estimate the state. This work leverages the novel Shallow Recurrent Decoder architecture to infer the entire state vector (including neutron fluxes, precursors concentrations, temperature, pressure and velocity) of a reactor from three out-of-core time-series neutron flux measurements alone. In particular, this work extends the standard architecture to treat parametric time-series data, ensuring the possibility of investigating different accidental scenarios and showing the capabilities of this approach to provide an accurate state estimation in various operating conditions. This paper considers as a test case the Molten Salt Fast Reactor (MSFR), a Generation-IV reactor concept, characterised by strong coupling between the neutronics and the thermal hydraulics due to the liquid nature of the fuel. The promising results of this work are further strengthened by the possibility of quantifying the uncertainty associated with the state estimation, due to the considerably low training cost. The accurate reconstruction of every characteristic field in real-time makes this approach suitable for monitoring and control purposes in the framework of a reactor digital twin.

[LG-54] Comprehensive Benchmarking of Machine Learning Methods for Risk Prediction Modelling from Large-Scale Survival Data: A UK Biobank Study

链接: https://arxiv.org/abs/2503.08870
作者: Rafael R. Oexner,Robin Schmitt,Hyunchan Ahn,Ravi A. Shah,Anna Zoccarato,Konstantinos Theofilatos,Ajay M. Shah
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Predictive modelling is vital to guide preventive efforts. Whilst large-scale prospective cohort studies and a diverse toolkit of available machine learning (ML) algorithms have facilitated such survival task efforts, choosing the best-performing algorithm remains challenging. Benchmarking studies to date focus on relatively small-scale datasets and it is unclear how well such findings translate to large datasets that combine omics and clinical features. We sought to benchmark eight distinct survival task implementations, ranging from linear to deep learning (DL) models, within the large-scale prospective cohort study UK Biobank (UKB). We compared discrimination and computational requirements across heterogenous predictor matrices and endpoints. Finally, we assessed how well different architectures scale with sample sizes ranging from n = 5,000 to n = 250,000 individuals. Our results show that discriminative performance across a multitude of metrices is dependent on endpoint frequency and predictor matrix properties, with very robust performance of (penalised) COX Proportional Hazards (COX-PH) models. Of note, there are certain scenarios which favour more complex frameworks, specifically if working with larger numbers of observations and relatively simple predictor matrices. The observed computational requirements were vastly different, and we provide solutions in cases where current implementations were impracticable. In conclusion, this work delineates how optimal model choice is dependent on a variety of factors, including sample size, endpoint frequency and predictor matrix properties, thus constituting an informative resource for researchers working on similar datasets. Furthermore, we showcase how linear models still display a highly effective and scalable platform to perform risk modelling at scale and suggest that those are reported alongside non-linear ML models.

[LG-55] Smoothing ADMM for Non-convex and Non-smooth Hierarchical Federated Learning

链接: https://arxiv.org/abs/2503.08869
作者: Reza Mirzaeifard,Stefan Werner
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a hierarchical federated learning (FL) framework that extends the alternating direction method of multipliers (ADMM) with smoothing techniques, tailored for non-convex and non-smooth objectives. Unlike traditional hierarchical FL methods, our approach supports asynchronous updates and multiple updates per iteration, enhancing adaptability to heterogeneous data and system settings. Additionally, we introduce a flexible mechanism to leverage diverse regularization functions at each layer, allowing customization to the specific prior information within each cluster and accommodating (possibly) non-smooth penalty objectives. Depending on the learning goal, the framework supports both consensus and personalization: the total variation norm can be used to enforce consensus across layers, while non-convex penalties such as minimax concave penalty (MCP) or smoothly clipped absolute deviation (SCAD) enable personalized learning. Experimental results demonstrate the superior convergence rates and accuracy of our method compared to conventional approaches, underscoring its robustness and versatility for a wide range of FL scenarios.

[LG-56] Seal Your Backdoor with Variational Defense

链接: https://arxiv.org/abs/2503.08829
作者: Ivan Sabolić,Matej Grcić,Siniša Šegvić
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We propose VIBE, a model-agnostic framework that trains classifiers resilient to backdoor attacks. The key concept behind our approach is to treat malicious inputs and corrupted labels from the training dataset as observed random variables, while the actual clean labels are latent. VIBE then recovers the corresponding latent clean label posterior through variational inference. The resulting training procedure follows the expectation-maximization (EM) algorithm. The E-step infers the clean pseudolabels by solving an entropy-regularized optimal transport problem, while the M-step updates the classifier parameters via gradient descent. Being modular, VIBE can seamlessly integrate with recent advancements in self-supervised representation learning, which enhance its ability to resist backdoor attacks. We experimentally validate the method effectiveness against contemporary backdoor attacks on standard datasets, a large-scale setup with 1 k classes, and a dataset poisoned with multiple attacks. VIBE consistently outperforms previous defenses across all tested scenarios.

[LG-57] Enhanced Estimation Techniques for Certified Radii in Randomized Smoothing

链接: https://arxiv.org/abs/2503.08801
作者: Zixuan Liang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: IEEE The 8th International Conference on Artificial Intelligence and Big Data (ICAIBD 2025)

点击查看摘要

Abstract:This paper presents novel methods for estimating certified radii in randomized smoothing, a technique crucial for certifying the robustness of neural networks against adversarial perturbations. Our proposed techniques significantly improve the accuracy of certified test-set accuracy by providing tighter bounds on the certified radii. We introduce advanced algorithms for both discrete and continuous domains, demonstrating their effectiveness on CIFAR-10 and ImageNet datasets. The new methods show considerable improvements over existing approaches, particularly in reducing discrepancies in certified radii estimates. We also explore the impact of various hyperparameters, including sample size, standard deviation, and temperature, on the performance of these methods. Our findings highlight the potential for more efficient certification processes and pave the way for future research on tighter confidence sequences and improved theoretical frameworks. The study concludes with a discussion of potential future directions, including enhanced estimation techniques for discrete domains and further theoretical advancements to bridge the gap between empirical and theoretical performance in randomized smoothing.

[LG-58] Contextual Speech Extraction: Leverag ing Textual History as an Implicit Cue for Target Speech Extraction ICASSP2025

链接: https://arxiv.org/abs/2503.08798
作者: Minsu Kim,Rodrigo Mira,Honglie Chen,Stavros Petridis,Maja Pantic
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to ICASSP 2025

点击查看摘要

Abstract:In this paper, we investigate a novel approach for Target Speech Extraction (TSE), which relies solely on textual context to extract the target speech. We refer to this task as Contextual Speech Extraction (CSE). Unlike traditional TSE methods that rely on pre-recorded enrollment utterances, video of the target speaker’s face, spatial information, or other explicit cues to identify the target stream, our proposed method requires only a few turns of previous dialogue (or monologue) history. This approach is naturally feasible in mobile messaging environments where voice recordings are typically preceded by textual dialogue that can be leveraged implicitly. We present three CSE models and analyze their performances on three datasets. Through our experiments, we demonstrate that even when the model relies purely on dialogue history, it can achieve over 90 % accuracy in identifying the correct target stream with only two previous dialogue turns. Furthermore, we show that by leveraging both textual context and enrollment utterances as cues during training, we further enhance our model’s flexibility and effectiveness, allowing us to use either cue during inference, or combine both for improved performance. Samples and code available on this https URL .

[LG-59] Automatic welding detection by an intelligent tool pipe inspection

链接: https://arxiv.org/abs/2503.08757
作者: C J Arizmendi,W L Garcia,M A Quintero
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work provide a model based on machine learning techniques in welds recognition, based on signals obtained through in-line inspection tool called smart pig in Oil and Gas pipelines . The model uses a signal noise reduction phase by means of preprocessing algorithms and attributeselection techniques. The noise reduction techniques were selected after a literature review and testing with survey data. Subsequently, the model was trained using recognition and classification algorithms, specifically artificial neural networks and support vector machines. Finally, the trained model was validated with different data sets and the performance was measured with cross validation and ROC analysis. The results show that is possible to identify welding automatically with an efficiency between 90 and 98 percent

[LG-60] Large Neighborhood Search and Bitmask Dynamic Programming for Wireless Mobile Charging Electric Vehicle Routing Problems in Medical Transportation

链接: https://arxiv.org/abs/2503.08752
作者: Jingyi Zhao,Haoxiang Yang,Yang Liu
类目: Machine Learning (cs.LG)
*备注: 33 pages, 12 figures

点击查看摘要

Abstract:The transition to electric vehicles (EVs) is critical to achieving sustainable transportation, but challenges such as limited driving range and insufficient charging infrastructure have hindered the widespread adoption of EVs, especially in time-sensitive logistics such as medical transportation. This paper presents a new model to break through this barrier by combining wireless mobile charging technology with optimization. We propose the Wireless Mobile Charging Electric Vehicle Routing Problem (WMC-EVRP), which enables Medical Transportation Electric Vehicles (MTEVs) to be charged while traveling via Mobile Charging Carts (MCTs). This eliminates the time wastage of stopping for charging and ensures uninterrupted operation of MTEVs for such time-sensitive transportation problems. However, in this problem, the decisions of these two types of heterogeneous vehicles are coupled with each other, which greatly increases the difficulty of vehicle routing optimizations. To address this complex problem, we develop a mathematical model and a tailored meta-heuristic algorithm that combines Bit Mask Dynamic Programming (BDP) and Large Neighborhood Search (LNS). The BDP approach efficiently optimizes charging strategies, while the LNS framework utilizes custom operators to optimize the MTEV routes under capacity and synchronization constraints. Our approach outperforms traditional solvers in providing solutions for medium and large instances. Using actual hospital locations in Singapore as data, we validated the practical applicability of the model through extensive experiments and provided important insights into minimizing costs and ensuring the timely delivery of healthcare services.

[LG-61] Shedding Light in Task Decomposition in Program Synthesis: The Driving Force of the Synthesizer Model ICLR2025

链接: https://arxiv.org/abs/2503.08738
作者: Janis Zenkner,Tobias Sesterhenn,Christian Bartelt
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: Accepted at ICLR 2025 Workshop Deep Learning for Code

点击查看摘要

Abstract:Task decomposition is a fundamental mechanism in program synthesis, enabling complex problems to be broken down into manageable subtasks. ExeDec, a state-of-the-art program synthesis framework, employs this approach by combining a Subgoal Model for decomposition and a Synthesizer Model for program generation to facilitate compositional generalization. In this work, we develop REGISM, an adaptation of ExeDec that removes decomposition guidance and relies solely on iterative execution-driven synthesis. By comparing these two exemplary approaches-ExeDec, which leverages task decomposition, and REGISM, which does not-we investigate the interplay between task decomposition and program generation. Our findings indicate that ExeDec exhibits significant advantages in length generalization and concept composition tasks, likely due to its explicit decomposition strategies. At the same time, REGISM frequently matches or surpasses ExeDec’s performance across various scenarios, with its solutions often aligning more closely with ground truth decompositions. These observations highlight the importance of repeated execution-guided synthesis in driving task-solving performance, even within frameworks that incorporate explicit decomposition strategies. Our analysis suggests that task decomposition approaches like ExeDec hold significant potential for advancing program synthesis, though further work is needed to clarify when and why these strategies are most effective.

[LG-62] Neural Network-Based Change Point Detection for Large-Scale Time-Evolving Data

链接: https://arxiv.org/abs/2503.09541
作者: Jialiang Geng,George Michailidis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The paper studies the problem of detecting and locating change points in multivariate time-evolving data. The problem has a long history in statistics and signal processing and various algorithms have been developed primarily for simple parametric models. In this work, we focus on modeling the data through feed-forward neural networks and develop a detection strategy based on the following two-step procedure. In the first step, the neural network is trained over a prespecified window of the data, and its test error function is calibrated over another prespecified window. Then, the test error function is used over a moving window to identify the change point. Once a change point is detected, the procedure involving these two steps is repeated until all change points are identified. The proposed strategy yields consistent estimates for both the number and the locations of the change points under temporal dependence of the data-generating process. The effectiveness of the proposed strategy is illustrated on synthetic data sets that provide insights on how to select in practice tuning parameters of the algorithm and in real data sets. Finally, we note that although the detection strategy is general and can work with different neural network architectures, the theoretical guarantees provided are specific to feed-forward neural architectures.

[LG-63] Precoder Learning by Leverag ing Unitary Equivariance Property

链接: https://arxiv.org/abs/2503.09398
作者: Yilun Ge,Shuyao Liao,Shengqian Han,Chenyang Yang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Systems and Control (eess.SY); Group Theory (math.GR)
*备注:

点击查看摘要

Abstract:Incorporating mathematical properties of a wireless policy to be learned into the design of deep neural networks (DNNs) is effective for enhancing learning efficiency. Multi-user precoding policy in multi-antenna system, which is the mapping from channel matrix to precoding matrix, possesses a permutation equivariance property, which has been harnessed to design the parameter sharing structure of the weight matrix of DNNs. In this paper, we study a stronger property than permutation equivariance, namely unitary equivariance, for precoder learning. We first show that a DNN with unitary equivariance designed by further introducing parameter sharing into a permutation equivariant DNN is unable to learn the optimal precoder. We proceed to develop a novel non-linear weighting process satisfying unitary equivariance and then construct a joint unitary and permutation equivariant DNN. Simulation results demonstrate that the proposed DNN not only outperforms existing learning methods in learning performance and generalizability but also reduces training complexity.

[LG-64] rrier: A Deep Learning Repeat Classifier

链接: https://arxiv.org/abs/2503.09312
作者: Robert Turnbull,Neil D. Young,Edoardo Tescari,Lee F. Skerratt,Tiffany A. Kosch
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: 11 pages, 9 figures

点击查看摘要

Abstract:Repetitive DNA sequences underpin genome architecture and evolutionary processes, yet they remain challenging to classify accurately. Terrier is a deep learning model designed to overcome these challenges by classifying repetitive DNA sequences using a publicly available, curated repeat sequence library trained under the RepeatMasker schema. Existing tools often struggle to classify divergent taxa due to biases in reference libraries, limiting our understanding of repeat evolution and function. Terrier overcomes these challenges by leveraging deep learning for improved accuracy. Trained on RepBase, which includes over 100,000 repeat families – four times more than Dfam – Terrier maps 97.1% of RepBase sequences to RepeatMasker categories, offering the most comprehensive classification system available. When benchmarked against DeepTE, TERL, and TEclass2 in model organisms (rice and fruit flies), Terrier achieved superior accuracy while classifying a broader range of sequences. Further validation in non-model amphibian and flatworm genomes highlights its effectiveness in improving classification in non-model species, facilitating research on repeat-driven evolution, genomic instability, and phenotypic variation.

[LG-65] owards Regulatory-Confirmed Adaptive Clinical Trials: Machine Learning Opportunities and Solutions AISTATS2025

链接: https://arxiv.org/abs/2503.09226
作者: Omer Noy Klein,Alihan Hüyük,Ron Shamir,Uri Shalit,Mihaela van der Schaar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: AISTATS 2025

点击查看摘要

Abstract:Randomized Controlled Trials (RCTs) are the gold standard for evaluating the effect of new medical treatments. Treatments must pass stringent regulatory conditions in order to be approved for widespread use, yet even after the regulatory barriers are crossed, real-world challenges might arise: Who should get the treatment? What is its true clinical utility? Are there discrepancies in the treatment effectiveness across diverse and under-served populations? We introduce two new objectives for future clinical trials that integrate regulatory constraints and treatment policy value for both the entire population and under-served populations, thus answering some of the questions above in advance. Designed to meet these objectives, we formulate Randomize First Augment Next (RFAN), a new framework for designing Phase III clinical trials. Our framework consists of a standard randomized component followed by an adaptive one, jointly meant to efficiently and safely acquire and assign patients into treatment arms during the trial. Then, we propose strategies for implementing RFAN based on causal, deep Bayesian active learning. Finally, we empirically evaluate the performance of our framework using synthetic and real-world semi-synthetic datasets.

[LG-66] Addressing pitfalls in implicit unobserved confounding synthesis using explicit block hierarchical ancestral sampling

链接: https://arxiv.org/abs/2503.09194
作者: Xudong Sun,Alex Markham,Pratik Misra,Carsten Marr
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Unbiased data synthesis is crucial for evaluating causal discovery algorithms in the presence of unobserved confounding, given the scarcity of real-world datasets. A common approach, implicit parameterization, encodes unobserved confounding by modifying the off-diagonal entries of the idiosyncratic covariance matrix while preserving positive definiteness. Within this approach, state-of-the-art protocols have two distinct issues that hinder unbiased sampling from the complete space of causal models: first, the use of diagonally dominant constructions, which restrict the spectrum of partial correlation matrices; and second, the restriction of possible graphical structures when sampling bidirected edges, unnecessarily ruling out valid causal models. To address these limitations, we propose an improved explicit modeling approach for unobserved confounding, leveraging block-hierarchical ancestral generation of ground truth causal graphs. Algorithms for converting the ground truth DAG into ancestral graph is provided so that the output of causal discovery algorithms could be compared with. We prove that our approach fully covers the space of causal models, including those generated by the implicit parameterization, thus enabling more robust evaluation of methods for causal discovery and inference.

[LG-67] Self-Consistent Equation-guided Neural Networks for Censored Time-to-Event Data

链接: https://arxiv.org/abs/2503.09097
作者: Sehwan Kim,Rui Wang,Wenbin Lu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:In survival analysis, estimating the conditional survival function given predictors is often of interest. There is a growing trend in the development of deep learning methods for analyzing censored time-to-event data, especially when dealing with high-dimensional predictors that are complexly interrelated. Many existing deep learning approaches for estimating the conditional survival functions extend the Cox regression models by replacing the linear function of predictor effects by a shallow feed-forward neural network while maintaining the proportional hazards assumption. Their implementation can be computationally intensive due to the use of the full dataset at each iteration because the use of batch data may distort the at-risk set of the partial likelihood function. To overcome these limitations, we propose a novel deep learning approach to non-parametric estimation of the conditional survival functions using the generative adversarial networks leveraging self-consistent equations. The proposed method is model-free and does not require any parametric assumptions on the structure of the conditional survival function. We establish the convergence rate of our proposed estimator of the conditional survival function. In addition, we evaluate the performance of the proposed method through simulation studies and demonstrate its application on a real-world dataset.

[LG-68] Differentiable Folding for Nearest Neighbor Model Optimization

链接: https://arxiv.org/abs/2503.09085
作者: Ryan K. Krueger,Sharon Aviran,David H. Mathews,Jeffrey Zuber,Max Ward
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:The Nearest Neighbor model is the \textitde facto thermodynamic model of RNA secondary structure formation and is a cornerstone of RNA structure prediction and sequence design. The current functional form (Turner 2004) contains \approx13,000 underlying thermodynamic parameters, and fitting these to both experimental and structural data is computationally challenging. Here, we leverage recent advances in \textitdifferentiable folding , a method for directly computing gradients of the RNA folding algorithms, to devise an efficient, scalable, and flexible means of parameter optimization that uses known RNA structures and thermodynamic experiments. Our method yields a significantly improved parameter set that outperforms existing baselines on all metrics, including an increase in the average predicted probability of ground-truth sequence-structure pairs for a single RNA family by over 23 orders of magnitude. Our framework provides a path towards drastically improved RNA models, enabling the flexible incorporation of new experimental data, definition of novel loss terms, large training sets, and even treatment as a module in larger deep learning pipelines. We make available a new database, RNAometer, with experimentally-determined stabilities for small RNA model systems.

[LG-69] Beam Selection in ISAC using Contextual Bandit with Multi-modal Transformer and Transfer Learning

链接: https://arxiv.org/abs/2503.08937
作者: Mohammad Farzanullah,Han Zhang,Akram Bin Sediq,Ali Afana,Melike Erol-Kantarci
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, 2 tables, IEEE International Conference on Communications 2025

点击查看摘要

Abstract:Sixth generation (6G) wireless technology is anticipated to introduce Integrated Sensing and Communication (ISAC) as a transformative paradigm. ISAC unifies wireless communication and RADAR or other forms of sensing to optimize spectral and hardware resources. This paper presents a pioneering framework that leverages ISAC sensing data to enhance beam selection processes in complex indoor environments. By integrating multi-modal transformer models with a multi-agent contextual bandit algorithm, our approach utilizes ISAC sensing data to improve communication performance and achieves high spectral efficiency (SE). Specifically, the multi-modal transformer can capture inter-modal relationships, enhancing model generalization across diverse scenarios. Experimental evaluations on the DeepSense 6G dataset demonstrate that our model outperforms traditional deep reinforcement learning (DRL) methods, achieving superior beam prediction accuracy and adaptability. In the single-user scenario, we achieve an average SE regret improvement of 49.6% as compared to DRL. Furthermore, we employ transfer reinforcement learning to reduce training time and improve model performance in multi-user environments. In the multi-user scenario, this approach enhances the average SE regret, which is a measure to demonstrate how far the learned policy is from the optimal SE policy, by 19.7% compared to training from scratch, even when the latter is trained 100 times longer.

[LG-70] Revisiting Frank-Wolfe for Structured Nonconvex Optimization

链接: https://arxiv.org/abs/2503.08921
作者: Hoomaan Maskan,Yikun Hou,Suvrit Sra,Alp Yurtsever
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:We introduce a new projection-free (Frank-Wolfe) method for optimizing structured nonconvex functions that are expressed as a difference of two convex functions. This problem class subsumes smooth nonconvex minimization, positioning our method as a promising alternative to the classical Frank-Wolfe algorithm. DC decompositions are not unique; by carefully selecting a decomposition, we can better exploit the problem structure, improve computational efficiency, and adapt to the underlying problem geometry to find better local solutions. We prove that the proposed method achieves a first-order stationary point in O(1/\epsilon^2) iterations, matching the complexity of the standard Frank-Wolfe algorithm for smooth nonconvex minimization in general. Specific decompositions can, for instance, yield a gradient-efficient variant that requires only O(1/\epsilon) calls to the gradient oracle. Finally, we present numerical experiments demonstrating the effectiveness of the proposed method compared to the standard Frank-Wolfe algorithm.

[LG-71] A Deep Bayesian Nonparametric Framework for Robust Mutual Information Estimation

链接: https://arxiv.org/abs/2503.08902
作者: Forough Fazeliasl,Michael Minyi Zhang,Bei Jiang,Linglong Kong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Mutual Information (MI) is a crucial measure for capturing dependencies between variables, but exact computation is challenging in high dimensions with intractable likelihoods, impacting accuracy and robustness. One idea is to use an auxiliary neural network to train an MI estimator; however, methods based on the empirical distribution function (EDF) can introduce sharp fluctuations in the MI loss due to poor out-of-sample performance, destabilizing convergence. We present a Bayesian nonparametric (BNP) solution for training an MI estimator by constructing the MI loss with a finite representation of the Dirichlet process posterior to incorporate regularization in the training process. With this regularization, the MI loss integrates both prior knowledge and empirical data to reduce the loss sensitivity to fluctuations and outliers in the sample data, especially in small sample settings like mini-batches. This approach addresses the challenge of balancing accuracy and low variance by effectively reducing variance, leading to stabilized and robust MI loss gradients during training and enhancing the convergence of the MI approximation while offering stronger theoretical guarantees for convergence. We explore the application of our estimator in maximizing MI between the data space and the latent space of a variational autoencoder. Experimental results demonstrate significant improvements in convergence over EDF-based methods, with applications across synthetic and real datasets, notably in 3D CT image generation, yielding enhanced structure discovery and reduced overfitting in data synthesis. While this paper focuses on generative models in application, the proposed estimator is not restricted to this setting and can be applied more broadly in various BNP learning procedures.

[LG-72] Risk-sensitive Bandits: Arm Mixture Optimality and Regret-efficient Algorithms AISTATS2025

链接: https://arxiv.org/abs/2503.08896
作者: Meltem Tatlı,Arpan Mukherjee,Prashanth L.A.,Karthikeyan Shanmugam,Ali Tajer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: AISTATS 2025

点击查看摘要

Abstract:This paper introduces a general framework for risk-sensitive bandits that integrates the notions of risk-sensitive objectives by adopting a rich class of distortion riskmetrics. The introduced framework subsumes the various existing risk-sensitive models. An important and hitherto unknown observation is that for a wide range of riskmetrics, the optimal bandit policy involves selecting a mixture of arms. This is in sharp contrast to the convention in the multi-arm bandit algorithms that there is generally a solitary arm that maximizes the utility, whether purely reward-centric or risk-sensitive. This creates a major departure from the principles for designing bandit algorithms since there are uncountable mixture possibilities. The contributions of the paper are as follows: (i) it formalizes a general framework for risk-sensitive bandits, (ii) identifies standard risk-sensitive bandit models for which solitary arm selections is not optimal, (iii) and designs regret-efficient algorithms whose sampling strategies can accurately track optimal arm mixtures (when mixture is optimal) or the solitary arms (when solitary is optimal). The algorithms are shown to achieve a regret that scales according to O((\log T/T )^\nu) , where T is the horizon, and \nu0 is a riskmetric-specific constant.

[LG-73] Learning Pareto manifolds in high dimensions: How can regularization help? AISTATS

链接: https://arxiv.org/abs/2503.08849
作者: Tobias Wegel,Filip Kovačević,Alexandru Ţifrea,Fanny Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Published in Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS) 2025

点击查看摘要

Abstract:Simultaneously addressing multiple objectives is becoming increasingly important in modern machine learning. At the same time, data is often high-dimensional and costly to label. For a single objective such as prediction risk, conventional regularization techniques are known to improve generalization when the data exhibits low-dimensional structure like sparsity. However, it is largely unexplored how to leverage this structure in the context of multi-objective learning (MOL) with multiple competing objectives. In this work, we discuss how the application of vanilla regularization approaches can fail, and propose a two-stage MOL framework that can successfully leverage low-dimensional structure. We demonstrate its effectiveness experimentally for multi-distribution learning and fairness-risk trade-offs.

[LG-74] Multimodal Stock Price Prediction: A Case Study of the Russian Securities Market

链接: https://arxiv.org/abs/2503.08696
作者: Kasymkhan Khubiev,Mikhail Semenov
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: NSCF-2024, PROGRAM SYSTEMS: THEORY AND APPLICATIONS

点击查看摘要

Abstract:Classical asset price forecasting methods primarily rely on numerical data, such as price time series, trading volumes, limit order book data, and technical analysis indicators. However, the news flow plays a significant role in price formation, making the development of multimodal approaches that combine textual and numerical data for improved prediction accuracy highly relevant. This paper addresses the problem of forecasting financial asset prices using the multimodal approach that combines candlestick time series and textual news flow data. A unique dataset was collected for the study, which includes time series for 176 Russian stocks traded on the Moscow Exchange and 79,555 financial news articles in Russian. For processing textual data, pre-trained models RuBERT and Vikhr-Qwen2.5-0.5b-Instruct (a large language model) were used, while time series and vectorized text data were processed using an LSTM recurrent neural network. The experiments compared models based on a single modality (time series only) and two modalities, as well as various methods for aggregating text vector representations. Prediction quality was estimated using two key metrics: Accuracy (direction of price movement prediction: up or down) and Mean Absolute Percentage Error (MAPE), which measures the deviation of the predicted price from the true price. The experiments showed that incorporating textual modality reduced the MAPE value by 55%. The resulting multimodal dataset holds value for the further adaptation of language models in the financial sector. Future research directions include optimizing textual modality parameters, such as the time window, sentiment, and chronological order of news messages.

[LG-75] Leverag ing neural control variates for enhanced precision in lattice field theory

链接: https://arxiv.org/abs/2312.08228
作者: Paulo F. Bedaque,Hyunwoo Oh
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG); Nuclear Theory (nucl-th)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:Results obtained with stochastic methods have an inherent uncertainty due to the finite number of samples that can be achieved in practice. In lattice QCD this problem is particularly salient in some observables like, for instance, observables involving one or more baryons and it is the main problem preventing the calculation of nuclear forces from first principles. The method of control variables has been used extensively in statistics and it amounts to computing the expectation value of the difference between the observable of interest and another observable whose average is known to be zero but is correlated with the observable of interest. Recently, control variates methods emerged as a promising solution in the context of lattice field theories. In our current study, instead of relying on an educated guess to determine the control variate, we utilize a neural network to parametrize this function. Using 1+1 dimensional scalar field theory as a testbed, we demonstrate that this neural network approach yields substantial improvements. Notably, our findings indicate that the neural network ansatz is particularly effective in the strong coupling regime.

信息检索

[IR-0] LLM -Driven Usefulness Labeling for IR Evaluation

链接: https://arxiv.org/abs/2503.08965
作者: Mouly Dewan,Jiqun Liu,Chirag Shah
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In the information retrieval (IR) domain, evaluation plays a crucial role in optimizing search experiences and supporting diverse user intents. In the recent LLM era, research has been conducted to automate document relevance labels, as these labels have traditionally been assigned by crowd-sourced workers - a process that is both time and consuming and costly. This study focuses on LLM-generated usefulness labels, a crucial evaluation metric that considers the user’s search intents and task objectives, an aspect where relevance falls short. Our experiment utilizes task-level, query-level, and document-level features along with user search behavior signals, which are essential in defining the usefulness of a document. Our research finds that (i) pre-trained LLMs can generate moderate usefulness labels by understanding the comprehensive search task session, (ii) pre-trained LLMs perform better judgement in short search sessions when provided with search session contexts. Additionally, we investigated whether LLMs can capture the unique divergence between relevance and usefulness, along with conducting an ablation study to identify the most critical metrics for accurate usefulness label generation. In conclusion, this work explores LLM-generated usefulness labels by evaluating critical metrics and optimizing for practicality in real-world settings.

附件下载

点击下载今日全部论文列表

目录

概览 (2025-03-13)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载