Arxiv今日论文 | 2025-04-17

本篇博文主要内容为 2025-04-17 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决大型语言模型（Large Language Model, LLM）在计算效率和资源消耗方面的瓶颈问题，同时保持与全精度模型相当的性能。解决方案的关键在于提出BitNet b1.58 2B4T，这是一种开放源代码的原生1比特量化大型语言模型，参数规模达到20亿。通过在4万亿令牌语料库上的训练，该模型实现了显著降低的内存占用、能量消耗以及解码延迟，同时在语言理解、数学推理、编码能力及对话能力等多方面的基准测试中表现出与同类开放权重全精度模型相当的性能。这种创新通过采用1比特表示大幅提升了计算效率，为后续研究和实际应用提供了高效且可行的技术路径。

链接: https://arxiv.org/abs/2504.12285
作者: Shuming Ma,Hongyu Wang,Shaohan Huang,Xingxing Zhang,Ying Hu,Ting Song,Yan Xia,Furu Wei
机构: Microsoft Research (微软研究)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:We introduce BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale. Trained on a corpus of 4 trillion tokens, the model has been rigorously evaluated across benchmarks covering language understanding, mathematical reasoning, coding proficiency, and conversational ability. Our results demonstrate that BitNet b1.58 2B4T achieves performance on par with leading open-weight, full-precision LLMs of similar size, while offering significant advantages in computational efficiency, including substantially reduced memory footprint, energy consumption, and decoding latency. To facilitate further research and adoption, the model weights are released via Hugging Face along with open-source inference implementations for both GPU and CPU architectures.
zh

[NLP-1] Dysarthria Normalization via Local Lie Group Transformations for Robust ASR

【速读】：该论文旨在解决构音障碍（Dysarthria）语音的标准化处理问题，以提升自动语音识别（ASR）在构音障碍语音上的性能。论文的关键在于提出了一种基于几何驱动的方法，利用局部李群变换（local Lie group transformations）对光谱图进行操作，将时间、频率和幅值的畸变建模为平滑可逆的形变，并通过标量场参数化，借助指数映射（exponential maps）实现这些变换。解决方案的核心是训练一个神经网络，从合成的典型语音畸变数据中推断这些标量场，而无需使用任何病理数据。测试时，模型对真实构音障碍语音应用近似逆变换，实现了高达16个百分点的词错误率（WER）降低，同时保持清洁语音无退化，从而提供了一种原理性强且可解释的鲁棒语音识别方法。

链接: https://arxiv.org/abs/2504.12279
作者: Mikhail Osipov
机构: Independent Researcher (独立研究员)
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Preprint. 11 pages, 3 figures, 2 tables, 8 appendices. Code and data available upon request

点击查看摘要

Abstract:We present a geometry-driven method for normalizing dysarthric speech using local Lie group transformations of spectrograms. Time, frequency, and amplitude distortions are modeled as smooth, invertible deformations, parameterized by scalar fields and applied via exponential maps. A neural network is trained to infer these fields from synthetic distortions of typical speech-without using any pathological data. At test time, the model applies an approximate inverse to real dysarthric inputs. Despite zero-shot generalization, we observe substantial ASR gains, including up to 16 percentage points WER reduction on challenging TORGO samples, with no degradation on clean speech. This work introduces a principled, interpretable approach for robust speech recognition under motor speech disorders
zh

[NLP-2] Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning

【速读】：该论文旨在解决低资源语言（如阿拉伯语）自动语音识别（ASR）模型开发中因缺乏大规模标注数据而导致性能受限的问题。论文的关键解决方案是采用弱监督学习（weakly supervised learning），利用未经过人工验证但带有弱标签（weakly annotated）的15,000小时语音数据训练基于Conformer架构的ASR模型。通过这种方式，研究消除了对昂贵手动转录的需求，同时在标准基准测试中达到了最先进的性能（state-of-the-art, SOTA）。这一方法证明了弱监督作为一种可扩展且成本效益高的替代传统监督学习的技术，在低资源环境下提升ASR系统性能的潜力。

链接: https://arxiv.org/abs/2504.12254
作者: Mahmoud Salhab,Marwan Elghitany,Shameed Sait,Syed Sibghat Ullah,Mohammad Abusheikh,Hasan Abusheikh
机构: CNTXT AI (CNTXT AI)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic speech recognition (ASR) is crucial for human-machine interaction in diverse applications like conversational agents, industrial robotics, call center automation, and automated subtitling. However, developing high-performance ASR models remains challenging, particularly for low-resource languages like Arabic, due to the scarcity of large, labeled speech datasets, which are costly and labor-intensive to produce. In this work, we employ weakly supervised learning to train an Arabic ASR model using the Conformer architecture. Our model is trained from scratch on 15,000 hours of weakly annotated speech data covering both Modern Standard Arabic (MSA) and Dialectal Arabic (DA), eliminating the need for costly manual transcriptions. Despite the absence of human-verified labels, our approach attains state-of-the-art (SOTA) performance, exceeding all previous efforts in the field of Arabic ASR on the standard benchmarks. By demonstrating the effectiveness of weak supervision as a scalable, cost-efficient alternative to traditional supervised approaches, paving the way for improved ASR systems in low resource settings.
zh

[NLP-3] Watermarking Needs Input Repetition Masking

【速读】：该论文试图解决Large Language Models (LLMs)生成文本可能引发的潜在误用问题，特别是检测器和水印技术在应对人类或未加水印的LLMs无意模仿生成文本特性时的可靠性挑战。论文的关键在于揭示“模仿效应”（mimicry）的存在，即人类和LLMs在会话适应过程中可能会无意间表现出与水印信号相似的特征，从而影响现有检测方法的有效性。为确保长期可靠的水印机制，论文建议降低误报概率，并采用更长的文本序列作为水印种子。

链接: https://arxiv.org/abs/2504.12229
作者: David Khachaturov,Robert Mullins,Ilia Shumailov,Sumanth Dathathri
机构: University of Cambridge; Google DeepMind (谷歌深思维)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) raised concerns over potential misuse, such as for spreading misinformation. In response two counter measures emerged: machine learning-based detectors that predict if text is synthetic, and LLM watermarking, which subtly marks generated text for identification and attribution. Meanwhile, humans are known to adjust language to their conversational partners both syntactically and lexically. By implication, it is possible that humans or unwatermarked LLMs could unintentionally mimic properties of LLM generated text, making counter measures unreliable. In this work we investigate the extent to which such conversational adaptation happens. We call the concept \textitmimicry and demonstrate that both humans and LLMs end up mimicking, including the watermarking signal even in seemingly improbable settings. This challenges current academic assumptions and suggests that for long-term watermarking to be reliable, the likelihood of false positives needs to be significantly lower, while longer word sequences should be used for seeding watermarking mechanisms.
zh

[NLP-4] d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

【速读】：该论文试图解决的问题是如何使基于扩散的大语言模型（diffusion-based large language models, dLLMs）具备与自回归（autoregressive, AR）生成模型相当的推理能力。尽管现有的基于扩散的模型在语言建模性能上已达到竞争力，但其是否能够有效利用最近在大语言模型推理方面的进展仍不清楚。
解决方案的关键在于提出了一种名为“d1”的框架，通过结合监督微调（Supervised Fine-Tuning, SFT）和无奖励信号的策略梯度强化学习（policy-gradient based reinforcement learning, RL），将预训练的掩码dLLMs转化为具备推理能力的模型。具体而言，该方法包括两项核心技术：(a) 使用掩码SFT技术从现有数据集中蒸馏知识并直接植入自我改进行为；(b) 引入一种新颖的无评估器的策略梯度强化学习算法diffu-GRPO。通过实证研究，论文验证了不同后训练方法在多种数学和逻辑推理基准上的表现，“d1”框架取得了最佳性能，并显著提升了最先进的dLLM的推理能力。

链接: https://arxiv.org/abs/2504.12216
作者: Siyan Zhao,Devaansh Gupta,Qinqing Zheng,Aditya Grover
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 25 pages, project page at this https URL

点击查看摘要

Abstract:Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL). These capabilities have primarily been demonstrated within the left-to-right autoregressive (AR) generation paradigm. In contrast, non-autoregressive paradigms based on diffusion generate text in a coarse-to-fine manner. Although recent diffusion-based large language models (dLLMs) have achieved competitive language modeling performance compared to their AR counterparts, it remains unclear if dLLMs can also leverage recent advances in LLM reasoning. To this end, we propose d1, a framework to adapt pre-trained masked dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL. Specifically, we develop and extend techniques to improve reasoning in pretrained dLLMs: (a) we utilize a masked SFT technique to distill knowledge and instill self-improvement behavior directly from existing datasets, and (b) we introduce a novel critic-free, policy-gradient based RL algorithm called diffu-GRPO. Through empirical studies, we investigate the performance of different post-training recipes on multiple mathematical and logical reasoning benchmarks. We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM.
zh

[NLP-5] What Do Large Language Models Know? Tacit Knowledge as a Potential Causal-Explanatory Structure

【速读】：该论文试图解决的问题是：Large Language Models (LLMs) 是否能够获得隐性知识（tacit knowledge），以及如何定义和理解 LLMs 所“知道”的内容。作者挑战了传统观点，即认为 LLMs 并非简单地“知道”事实（如巴黎是法国的首都），而是探讨它们是否能够以某种方式掌握隐性知识。

解决方案的关键在于分析 LLMs 的架构特性是否符合隐性知识的三个核心约束条件：语义描述（semantic description）、句法结构（syntactic structure）和因果系统性（causal systematicity）。通过证明这些特性满足上述条件，作者提出隐性知识可以作为一种概念框架，用于描述、解释和干预 LLMs 及其行为。

链接: https://arxiv.org/abs/2504.12187
作者: Céline Budding
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication in Philosophy of Science

点击查看摘要

Abstract:It is sometimes assumed that Large Language Models (LLMs) know language, or for example that they know that Paris is the capital of France. But what – if anything – do LLMs actually know? In this paper, I argue that LLMs can acquire tacit knowledge as defined by Martin Davies (1990). Whereas Davies himself denies that neural networks can acquire tacit knowledge, I demonstrate that certain architectural features of LLMs satisfy the constraints of semantic description, syntactic structure, and causal systematicity. Thus, tacit knowledge may serve as a conceptual framework for describing, explaining, and intervening on LLMs and their behavior.
zh

[NLP-6] SALAD: Improving Robustness and Generalization through Contrastive Learning with Structure-Aware and LLM -Driven Augmented Data NAACL2025

【速读】：该论文旨在解决在多种自然语言处理任务中，微调预训练语言模型（Pre-trained Language Models, PLMs）时容易产生虚假相关性（spurious correlations）的问题，尤其是在处理分布外数据（out-of-distribution data）时，这种问题会对模型性能造成负面影响。为了解决这一挑战，论文提出了一种名为SALAD（Structure Aware and LLM-driven Augmented Data）的新方法。SALAD的关键在于通过生成结构感知的增强数据（structure-aware augmented data）和反事实负样本（counterfactual negative samples），结合对比学习（contrastive learning），使模型能够专注于学习句子关键成分之间的结构性关系，同时减少对虚假相关性的依赖，从而提升模型的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2504.12185
作者: Suyoung Bae,Hyojun Kim,YunSeok Choi,Jee-Hyong Lee
机构: Sungkyunkwan University (成均馆大学)(韩国); SK Telecom (韩国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to NAACL 2025 main. 15 pages, 4 figures

点击查看摘要

Abstract:In various natural language processing (NLP) tasks, fine-tuning Pre-trained Language Models (PLMs) often leads to the issue of spurious correlations, which negatively impacts performance, particularly when dealing with out-of-distribution data. To address this problem, we propose SALAD(Structure Aware and LLM-driven Augmented Data), a novel approach designed to enhance model robustness and generalization by generating structure-aware and counterfactually augmented data for contrastive learning. Our method leverages a tagging-based approach to generate structure-aware positive samples and utilizes large language models (LLMs) to generate counterfactual negative samples with diverse sentence patterns. By applying contrastive learning, SALAD enables the model to focus on learning the structural relationships between key sentence components while minimizing reliance on spurious correlations. We validate our approach through experiments on three tasks: Sentiment Classification, Sexism Detection, and Natural Language Inference. The results demonstrate that SALAD not only improves model robustness and performance across different environments but also enhances generalization to out-of-distribution datasets and cross-domain scenarios.
zh

[NLP-7] rusting CHATGPT : how minor tweaks in the prompts lead to major differences in sentiment classification

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在分类任务中的稳健性与可靠性问题，具体关注提示词（prompts）的细微变化是否会对情感极性分析的结果产生显著影响。论文的关键在于通过实验验证提示词的结构化差异（如词汇、句法或模态的变化）是否会导致分类结果的不一致，并揭示LLMs在面对不同提示时的脆弱性。研究采用包含10万条西班牙语评论的数据集，对拉丁美洲四位总统的相关评论进行多次分类测试（每次修改提示词），并通过探索性和验证性分析识别分类结果间的显著差异。实验结果表明，即使是轻微的提示结构调整也可能导致分类结果的不稳定，甚至出现类别混淆、提供无关解释或输出非目标语言等现象。统计分析进一步证实了大多数提示之间的分类结果存在显著差异，仅在语言结构高度相似的情况下例外。此外，研究还发现缺乏结构化的语法会增加幻觉（hallucinations）的发生频率。因此，论文强调信任LLMs不仅依赖于技术性能，还需考虑其背后的社会和制度关系。

链接: https://arxiv.org/abs/2504.12180
作者: Jaime E. Cuellar,Oscar Moreno-Martinez,Paula Sofia Torres-Rodriguez,Jaime Andres Pavlich-Mariscal,Andres Felipe Mican-Castiblanco,Juan Guillermo Torres-Hurtado
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: in Spanish language

点击查看摘要

Abstract:One fundamental question for the social sciences today is: how much can we trust highly complex predictive models like ChatGPT? This study tests the hypothesis that subtle changes in the structure of prompts do not produce significant variations in the classification results of sentiment polarity analysis generated by the Large Language Model GPT-4o mini. Using a dataset of 100.000 comments in Spanish on four Latin American presidents, the model classified the comments as positive, negative, or neutral on 10 occasions, varying the prompts slightly each time. The experimental methodology included exploratory and confirmatory analyses to identify significant discrepancies among classifications. The results reveal that even minor modifications to prompts such as lexical, syntactic, or modal changes, or even their lack of structure impact the classifications. In certain cases, the model produced inconsistent responses, such as mixing categories, providing unsolicited explanations, or using languages other than Spanish. Statistical analysis using Chi-square tests confirmed significant differences in most comparisons between prompts, except in one case where linguistic structures were highly similar. These findings challenge the robustness and trust of Large Language Models for classification tasks, highlighting their vulnerability to variations in instructions. Moreover, it was evident that the lack of structured grammar in prompts increases the frequency of hallucinations. The discussion underscores that trust in Large Language Models is based not only on technical performance but also on the social and institutional relationships underpinning their use. Comments: in Spanish language Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.12180 [cs.CL] (or arXiv:2504.12180v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.12180 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-8] Mapping Controversies Using Artificial Intelligence: An Analysis of the Hamas-Israel Conflict on YouTube

【速读】：本文旨在通过分析2023年10月至2024年1月期间发布的253,925条西班牙语YouTube评论，研究哈马斯与以色列冲突期间公众意见的演变。论文采用跨学科方法，结合科学与技术研究（Science and Technology Studies, STS）中的争议分析与先进的计算方法，特别是基于BERT（Bidirectional Encoder Representations from Transformers）模型的自然语言处理（Natural Language Processing, NLP）。关键解决方案在于利用BERT模型实现对评论的自动分类，将其归入七个类别，包括亲巴勒斯坦、亲以色列、反巴勒斯坦、反以色列等立场。此外，论文应用议程设置理论（Agenda-Setting Theory），揭示媒体覆盖如何显著影响公众认知，并观察到公众意见从亲巴勒斯坦向更批判以色列的转变。本研究的核心贡献在于通过整合计算分析与批判性社会理论，提出了一种分析复杂公共舆论现象和媒体叙事的方法论创新。

链接: https://arxiv.org/abs/2504.12177
作者: Victor Manuel Hernandez Lopez,Jaime E. Cuellar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: in Spanish language

点击查看摘要

Abstract:This article analyzes the Hamas-Israel controversy through 253,925 Spanish-language YouTube comments posted between October 2023 and January 2024, following the October 7 attack that escalated the conflict. Adopting an interdisciplinary approach, the study combines the analysis of controversies from Science and Technology Studies (STS) with advanced computational methodologies, specifically Natural Language Processing (NLP) using the BERT (Bidirectional Encoder Representations from Transformers) model. Using this approach, the comments were automatically classified into seven categories, reflecting pro-Palestinian, pro-Israeli, anti- Palestinian, anti-Israeli positions, among others. The results show a predominance of pro- Palestinian comments, although pro-Israeli and anti-Palestinian comments received more “likes.” This study also applies the agenda-setting theory to demonstrate how media coverage significantly influences public perception, observing a notable shift in public opinion, transitioning from a pro- Palestinian stance to a more critical position towards Israel. This work highlights the importance of combining social science perspectives with technological tools in the analysis of controversies, presenting a methodological innovation by integrating computational analysis with critical social theories to address complex public opinion phenomena and media narratives.
zh

[NLP-9] Poem Meter Classification of Recited Arabic Poetry: Integrating High-Resource Systems for a Low-Resource Task

【速读】：该论文旨在解决自动识别朗诵形式的阿拉伯诗歌韵律（meter）的问题。由于标注数据稀缺且识别过程需要专业知识和技术知识，这一任务具有挑战性。论文的关键解决方案在于提出了一种先进的框架，通过整合两个高资源需求系统来完成低资源条件下的韵律识别任务，从而实现跨领域的知识迁移与模型泛化能力提升。为确保所提架构的通用性，研究还发布了针对该任务的数据集基准，以促进未来相关研究的发展。

链接: https://arxiv.org/abs/2504.12172
作者: Maged S. Al-Shaibani,Zaid Alyafeai,Irfan Ahmad
机构: King Fahd University of Petroleum and Minerals (KFUPM) (法赫德国王石油矿产大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Arabic poetry is an essential and integral part of Arabic language and culture. It has been used by the Arabs to spot lights on their major events such as depicting brutal battles and conflicts. They also used it, as in many other languages, for various purposes such as romance, pride, lamentation, etc. Arabic poetry has received major attention from linguistics over the decades. One of the main characteristics of Arabic poetry is its special rhythmic structure as opposed to prose. This structure is referred to as a meter. Meters, along with other poetic characteristics, are intensively studied in an Arabic linguistic field called “\textitAroud”. Identifying these meters for a verse is a lengthy and complicated process. It also requires technical knowledge in \textitAruod. For recited poetry, it adds an extra layer of processing. Developing systems for automatic identification of poem meters for recited poems need large amounts of labelled data. In this study, we propose a state-of-the-art framework to identify the poem meters of recited Arabic poetry, where we integrate two separate high-resource systems to perform the low-resource task. To ensure generalization of our proposed architecture, we publish a benchmark for this task for future research.
zh

[NLP-10] Multilingual Contextualization of Large Language Models for Document-Level Machine Translation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在文档级翻译任务中的挑战，特别是长距离依赖建模和跨句及段落语篇现象建模的困难。论文的关键解决方案是通过在高质量文档级数据集DocBlocks上进行目标导向的微调（targeted fine-tuning），以提升基于LLM的长文档翻译性能。该方法支持多种翻译范式，包括直接的文档到文档翻译以及基于片段的翻译，并通过结合带与不带上下文的指令实现这一目标，从而更好地捕捉句子间的依赖关系，同时保持良好的句子级翻译效果。实验结果表明，这种多范式集成的方法相较于提示法（prompting）和基于代理的方法，在文档级翻译质量和推理速度方面均有显著改进。

链接: https://arxiv.org/abs/2504.12140
作者: Miguel Moura Ramos,Patrick Fernandes,Sweta Agrawal,André F. T. Martins
机构: Instituto Superior Técnico, Universidade de Lisboa (ELLIS Unit Lisbon); Instituto de Telecomunicações; Carnegie Mellon University; Unbabel
类目: Computation and Language (cs.CL)
备注: 9 pages, work-in-progress

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong performance in sentence-level machine translation, but scaling to document-level translation remains challenging, particularly in modeling long-range dependencies and discourse phenomena across sentences and paragraphs. In this work, we propose a method to improve LLM-based long-document translation through targeted fine-tuning on high-quality document-level data, which we curate and introduce as DocBlocks. Our approach supports multiple translation paradigms, including direct document-to-document and chunk-level translation, by integrating instructions both with and without surrounding context. This enables models to better capture cross-sentence dependencies while maintaining strong sentence-level translation performance. Experimental results show that incorporating multiple translation paradigms improves document-level translation quality and inference speed compared to prompting and agent-based methods.
zh

[NLP-11] Efficient Contrastive Decoding with Probabilistic Hallucination Detection - Mitigating Hallucinations in Large Vision Language Models -

【速读】：该论文试图解决大型视觉语言模型（Large Vision Language Models, LVLMs）在生成响应时容易出现与视觉输入不一致的幻觉现象（hallucinations）的问题。解决方案的关键在于引入了一种名为高效对比解码（Efficient Contrastive Decoding, ECD）的方法，该方法通过利用概率幻觉检测，在推理阶段调整输出分布以偏向于上下文相关的准确答案。ECD 通过对比标记概率和幻觉分数，从原始分布中减去幻觉概念，从而有效抑制幻觉现象。此方法可应用于任何开源的 LVLM，且无需额外的模型训练。

链接: https://arxiv.org/abs/2504.12137
作者: Laura Fieback,Nishilkumar Balar,Jakob Spiegelberg,Hanno Gottschalk
机构: Volkswagen AG (大众汽车集团); Technical University Berlin (柏林工业大学); University of Siegen (锡根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite recent advances in Large Vision Language Models (LVLMs), these models still suffer from generating hallucinatory responses that do not align with the visual input provided. To mitigate such hallucinations, we introduce Efficient Contrastive Decoding (ECD), a simple method that leverages probabilistic hallucination detection to shift the output distribution towards contextually accurate answers at inference time. By contrasting token probabilities and hallucination scores, ECD subtracts hallucinated concepts from the original distribution, effectively suppressing hallucinations. Notably, our proposed method can be applied to any open-source LVLM and does not require additional LVLM training. We evaluate our method on several benchmark datasets and across different LVLMs. Our experiments show that ECD effectively mitigates hallucinations, outperforming state-of-the-art methods with respect to performance on LVLM benchmarks and computation time.
zh

[NLP-12] Entropy-Guided Watermarking for LLM s: A Test-Time Framework for Robust and Traceable Text Generation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）生成文本的可追溯性及潜在误用问题，现有嵌入水印方案通常在保持文本质量与对抗各类攻击的鲁棒检测之间存在权衡。论文的关键解决方案在于提出了一种新的水印嵌入方案，通过引入累积水印熵阈值，同时提升检测能力和生成文本的质量，并且兼容并泛化现有的采样函数以增强适应性。实验结果表明，该方法在MATH和GSM8K等常用数据集上的性能显著优于现有方法，提升了超过80%，同时保持了高检测准确性。

链接: https://arxiv.org/abs/2504.12108
作者: Shizhan Cai,Liang Ding,Dacheng Tao
机构: Nanyang Technological University (南洋理工大学); University of Sydney (悉尼大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid development of Large Language Models (LLMs) has intensified concerns about content traceability and potential misuse. Existing watermarking schemes for sampled text often face trade-offs between maintaining text quality and ensuring robust detection against various attacks. To address these issues, we propose a novel watermarking scheme that improves both detectability and text quality by introducing a cumulative watermark entropy threshold. Our approach is compatible with and generalizes existing sampling functions, enhancing adaptability. Experimental results across multiple LLMs show that our scheme significantly outperforms existing methods, achieving over 80% improvements on widely-used datasets, e.g., MATH and GSM8K, while maintaining high detection accuracy.
zh

[NLP-13] Gauging Overprecision in LLM s: An Empirical Study

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在数值任务中的过度精确性（overprecision）问题，这是一种反映模型过自信（overconfidence）的新视角。传统方法通过让黑盒模型输出其置信度（即口头化置信度 verbalized confidence），但这些置信度容易受到偏见和幻觉的影响。论文提出了一种基于认知科学中过度精确性概念的新框架，以更深入地研究LLMs的过精确行为。

解决方案的关键在于设计了一个包含生成（generation）、精炼（refinement）和评估（evaluation）三个阶段的框架。在生成阶段，通过在提示中明确指定置信水平，引导模型以区间形式生成数值答案，而非依赖模型自行产生置信度。这种方法避免了传统方法中可能存在的偏差，并通过多次使用相同提示来分析生成过程中的随机性影响。在精炼阶段，对前一阶段的结果进行优化以改善答案质量，而在评估阶段则系统性地分析模型的行为和内部机制。这一框架揭示了LLMs在数值任务中的诸多特性，包括其高度未校准性（uncalibrated nature）、置信区间长度与指定置信水平无关的现象，以及不同任务、答案尺度和提示技术对数值精度的影响。最终，该研究为理解LLMs的过度自信提供了新见解，并为未来相关研究奠定了重要基础。

链接: https://arxiv.org/abs/2504.12098
作者: Adil Bahaj,Hamed Rahimi,Mohamed Chetouani,Mounir Ghogho
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages

点击查看摘要

Abstract:Recently, overconfidence in large language models (LLMs) has garnered considerable attention due to its fundamental importance in quantifying the trustworthiness of LLM generation. However, existing approaches prompt the \textitblack box LLMs to produce their confidence (\textitverbalized confidence), which can be subject to many biases and hallucinations. Inspired by a different aspect of overconfidence in cognitive science called \textitoverprecision, we designed a framework for its study in black box LLMs. This framework contains three main phases: 1) generation, 2) refinement and 3) evaluation. In the generation phase we prompt the LLM to generate answers to numerical questions in the form of intervals with a certain level of confidence. This confidence level is imposed in the prompt and not required for the LLM to generate as in previous approaches. We use various prompting techniques and use the same prompt multiple times to gauge the effects of randomness in the generation process. In the refinement phase, answers from the previous phase are refined to generate better answers. The LLM answers are evaluated and studied in the evaluation phase to understand its internal workings. This study allowed us to gain various insights into LLM overprecision: 1) LLMs are highly uncalibrated for numerical tasks 2) \colorbluethere is no correlation between the length of the interval and the imposed confidence level, which can be symptomatic of a a) lack of understanding of the concept of confidence or b) inability to adjust self-confidence by following instructions, \colorblue3) LLM numerical precision differs depending on the task, scale of answer and prompting technique \colorblue4) Refinement of answers doesn’t improve precision in most cases. We believe this study offers new perspectives on LLM overconfidence and serves as a strong baseline for overprecision in LLMs.
zh

[NLP-14] Selective Demonstration Retrieval for Improved Implicit Hate Speech Detection

【速读】：该论文致力于解决隐式仇恨言论（implicit hate speech）检测中的关键挑战，这类仇恨言论通过微妙或间接的方式表达有害意图，难以被一致识别。其核心问题在于隐式表达依赖于上下文、文化细微差别及隐藏偏见，并受外部知识和人口统计偏见影响，导致不同语言模型的检测结果存在差异。此外，大型语言模型（Large Language Models）对有毒语言和弱势群体的提及表现出过度敏感性，容易产生误报（将无害陈述错误分类为仇恨言论）和漏报（未能识别真正有害的内容）。为应对这些问题，论文提出了一种新颖的方法，关键在于利用情境学习（in-context learning），而无需对模型进行微调。该方法通过自适应检索与目标群体或相似度分数最高的示例，增强模型对上下文的理解能力。实验结果显示，此方法优于当前最先进的技术。

链接: https://arxiv.org/abs/2504.12082
作者: Yumin Kim,Hwanhee Lee
机构: Chung-Ang University (中央大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hate speech detection is a crucial area of research in natural language processing, essential for ensuring online community safety. However, detecting implicit hate speech, where harmful intent is conveyed in subtle or indirect ways, remains a major challenge. Unlike explicit hate speech, implicit expressions often depend on context, cultural subtleties, and hidden biases, making them more challenging to identify consistently. Additionally, the interpretation of such speech is influenced by external knowledge and demographic biases, resulting in varied detection results across different language models. Furthermore, Large Language Models often show heightened sensitivity to toxic language and references to vulnerable groups, which can lead to misclassifications. This over-sensitivity results in false positives (incorrectly identifying harmless statements as hateful) and false negatives (failing to detect genuinely harmful content). Addressing these issues requires methods that not only improve detection precision but also reduce model biases and enhance robustness. To address these challenges, we propose a novel method, which utilizes in-context learning without requiring model fine-tuning. By adaptively retrieving demonstrations that focus on similar groups or those with the highest similarity scores, our approach enhances contextual comprehension. Experimental results show that our method outperforms current state-of-the-art techniques. Implementation details and code are available at TBD.
zh

[NLP-15] Bayesian dynamic borrowing considering semantic similarity between outcomes for disproportionality analysis in FAERS

【速读】：该论文旨在解决传统离散信号检测方法在自发报告系统（SRS）中识别不良事件（AEs）时存在的局限性，特别是当前比例差异分析（DPA）中基于固定层次分组的刚性约束问题。为应对这一挑战，论文提出了一种基于贝叶斯动态借用（Bayesian Dynamic Borrowing, BDB）的方法，称为IC SSM。该方法的关键在于将稳健的元分析预测（MAP）先验嵌入到贝叶斯分层模型中，并结合语义相似性度量（Semantic Similarity Measures, SSMs），实现从MedDRA首选术语（Preferred Terms, PTs）到目标PT的加权信息共享。通过这种基于连续相似性的借用策略，IC SSM有效克服了现有方法的局限性，显著提高了敏感性，同时在早期上市后阶段提供了更稳定和相关的结果，相比传统的IC分析和基于MedDRA高级组术语（HLGT）的借用方法具有更好的性能表现。

链接: https://arxiv.org/abs/2504.12052
作者: François Haguinet,Jeffery L Painter,Gregory E Powell,Andrea Callegaro,Andrew Bate
机构: GlaxoSmithKline (GSK)(葛兰素史克)
类目: Computation and Language (cs.CL)
备注: 30 pages, 7 figures, 5 supplementary figures

点击查看摘要

Abstract:We present a Bayesian dynamic borrowing (BDB) approach to enhance the quantitative identification of adverse events (AEs) in spontaneous reporting systems (SRSs). The method embeds a robust meta-analytic predictive (MAP) prior within a Bayesian hierarchical model and incorporates semantic similarity measures (SSMs) to enable weighted information sharing from MedDRA Preferred Terms (PTs) that are clinical similar to the target PT. This continuous similarity-based borrowing addresses limitation of rigid hierarchical grouping in current disproportionality analysis (DPA). Using data from the FDA Adverse Event Reporting System (FAERS) between 2015 and 2019, we evalute this approach - termed IC SSM - against standard Information Component (IC) analysis and IC with borrowing at the MedDRA high-level group term (HLGT) level. A novel references set (PVLens), derived from FDA product label updates, enabled prospective evaluation of method performance in identifying AEs prior to official labeling. The IC SSM approach demonstrated improved sensitivity compared to both traditional IC and HLGT-based borrowing, with minor trade-offs in F1 scores and Youden’s index. IC SSM consistently identified more true positives and detected signals over 5 months sooner than traditional IC. Despite a marginally lower aggregate Youden’s index, IC SSM showed higher performance in the early post-marketing period, providing more stable and relevant estimates than HLGT-based borrowing and traditional IC. These findings support the use of SSM-informed Bayesian borrowing as a scalable and context-aware enhancement to traditional DPA methods. Future research should validate this approach across other datasets and explore additional similarity metrics and Bayesian inference strategies using case-level data. Comments: 30 pages, 7 figures, 5 supplementary figures Subjects: Computation and Language (cs.CL) ACMclasses: I.2.4; G.3; H.3.3 Cite as: arXiv:2504.12052 [cs.CL] (or arXiv:2504.12052v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.12052 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jeffery Painter [view email] [v1] Wed, 16 Apr 2025 13:06:24 UTC (3,438 KB)
zh

[NLP-16] Language Models as Quasi-Crystalline Thought: Structure Constraint and Emergence in Generative Systems

【速读】：该论文试图解决的问题是如何重新定义大型语言模型（Large Language Models, LLMs）的评价与设计标准。传统上，LLMs主要通过预测准确性、事实性或对齐程度进行评估，而论文提出将LLMs类比为准晶体（quasicrystals），即一种具有全局一致性但无周期重复且由局部约束生成的系统。关键在于转变视角，从关注单个标记（token）的准确性转向强调约束传播和形式一致性，从而揭示LLMs最显著的行为是生成内部共振的语言模式。这种结构性视角不仅重新定义了语言生成的秩序，还开辟了新的研究路径，使LLMs被视为产生准结构化语言的生成器，强调输出的约束传播和形式连贯性，而非固定意义。

链接: https://arxiv.org/abs/2504.11986
作者: Jose Manuel Guevara-Vela
机构: School of Engineering and Physical Sciences, Heriot-Watt University (赫瑞瓦特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This essay proposes an analogy between large language models (LLMs) and quasicrystals: systems that exhibit global coherence without periodic repetition and that are generated through local constraints. While LLMs are often evaluated in terms of predictive accuracy, factuality, or alignment, this structural perspective suggests that their most characteristic behavior is the production of internally resonant linguistic patterns. Just as quasicrystals forced a redefinition of order in physical systems, viewing LLMs as generators of quasi-structured language opens new paths for evaluation and design: privileging propagation of constraint over token-level accuracy, and coherence of form over fixed meaning. LLM outputs should be read not only for what they say, but for the patterns of constraint and coherence that organize them. This shift reframes generative language as a space of emergent patterning: LLMs are neither fully random nor strictly rule-based, but defined by a logic of constraint, resonance, and structural depth.
zh

[NLP-17] SemEval-2025 Task 3: Mu-SHROOM the Multilingual Shared Task on Hallucinations and Related Observable Overgeneration Mistakes SEMEVAL-2025

【速读】：本文档介绍了Mu-SHROOM共享任务，其核心目标是检测经过指令微调的大规模语言模型（Large Language Models, LLMs）输出中的幻觉现象（Hallucinations）以及其他过生成错误。论文聚焦于14种语言的通用LLMs，并将幻觉检测问题形式化为一个片段标注任务（Span-Labeling Task）。为解决这一问题，共有来自43支团队的2,618份提交，体现了研究社区对此领域的浓厚兴趣。论文展示了各参赛系统的性能结果，并通过实证分析确定了影响任务表现的关键因素。其中，解决方案的关键在于有效应对跨语言幻觉程度的差异以及标注幻觉片段时的高标注者分歧问题。

链接: https://arxiv.org/abs/2504.11975
作者: Raúl Vázquez,Timothee Mickus,Elaine Zosa,Teemu Vahtola,Jörg Tiedemann,Aman Sinha,Vincent Segonne,Fernando Sánchez-Vega,Alessandro Raganato,Jindřich Libovický,Jussi Karlgren,Shaoxiong Ji,Jindřich Helcl,Liane Guillou,Ona de Gibert,Jaione Bengoetxea,Joseph Attieh,Marianna Apidianaki
机构: University of Helsinki (赫尔辛基大学); SiLO (SiLO); Université de Lorraine & ICANS Strasbourg (洛林大学 & 斯特拉斯堡 ICANS); Université Bretagne Sud (南布列塔尼大学); CIMAT A. C. (CIMAT A. C.); University of Milano-Bicocca (米兰比可卡大学); TU Darmstadt (达姆施塔特工业大学); Aveni (Aveni); Charles University (查理大学); HiTZ Basque Center for Language Technology - Ixa (巴斯克语言技术中心 - Ixa); University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注: Mu-SHROOM is part of SemEval-2025 (Task 3). TBP: Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

点击查看摘要

Abstract:We present the Mu-SHROOM shared task which is focused on detecting hallucinations and other overgeneration mistakes in the output of instruction-tuned large language models (LLMs). Mu-SHROOM addresses general-purpose LLMs in 14 languages, and frames the hallucination detection problem as a span-labeling task. We received 2,618 submissions from 43 participating teams employing diverse methodologies. The large number of submissions underscores the interest of the community in hallucination detection. We present the results of the participating systems and conduct an empirical analysis to identify key factors contributing to strong performance in this task. We also emphasize relevant current challenges, notably the varying degree of hallucinations across languages and the high annotator disagreement when labeling hallucination spans.
zh

[NLP-18] LLM -as-a-Judge: Reassessing the Performance of LLM s in Extractive QA

【速读】：该论文试图解决传统Exact Match (EM) 和 F1-score 指标在评估阅读理解问答（QA）模型性能时未能充分反映模型真实表现的问题。论文的关键解决方案是引入大型语言模型（LLMs）作为裁判（LLM-as-a-judge），通过利用不同家族的LLMs和多种答案类型重新评估四个阅读理解QA数据集上的模型性能。研究结果表明，LLM-as-a-judge与人类判断高度相关，能够显著提高相关性至0.85（从EM的0.17和F1-score的0.36提升），从而证明传统指标低估了QA模型的实际性能。尽管对于较难的答案类型（如职业），LLM-as-a-judge并非完美，但它依然优于EM/F1，并且在相同模型同时用于QA和判断任务时未发现偏见问题。

链接: https://arxiv.org/abs/2504.11972
作者: Xanh Ho,Jiahao Huang,Florian Boudin,Akiko Aizawa
机构: National Institute of Informatics (国立信息学研究所), Japan; The University of Tokyo (东京大学), Japan; JFLI, CNRS, Nantes Université (CNRS, 南特大学联合实验室), France; National Institute of Informatics (国立信息学研究所), Japan
类目: Computation and Language (cs.CL)
备注: 17 pages; code and data are available at this https URL

点击查看摘要

Abstract:Extractive reading comprehension question answering (QA) datasets are typically evaluated using Exact Match (EM) and F1-score, but these metrics often fail to fully capture model performance. With the success of large language models (LLMs), they have been employed in various tasks, including serving as judges (LLM-as-a-judge). In this paper, we reassess the performance of QA models using LLM-as-a-judge across four reading comprehension QA datasets. We examine different families of LLMs and various answer types to evaluate the effectiveness of LLM-as-a-judge in these tasks. Our results show that LLM-as-a-judge is highly correlated with human judgments and can replace traditional EM/F1 metrics. By using LLM-as-a-judge, the correlation with human judgments improves significantly, from 0.17 (EM) and 0.36 (F1-score) to 0.85. These findings confirm that EM and F1 metrics underestimate the true performance of the QA models. While LLM-as-a-judge is not perfect for more difficult answer types (e.g., job), it still outperforms EM/F1, and we observe no bias issues, such as self-preference, when the same model is used for both the QA and judgment tasks.
zh

[NLP-19] Robust and Fine-Grained Detection of AI Generated Texts ACL2025

【速读】：该论文试图解决机器生成内容检测系统在处理较短文本时准确性不足的问题，并特别关注人类与大型语言模型（LLMs）协同创作（human-LLM co-authored）的文本检测挑战。随着越来越多高级LLMs的涌现，理想的检测系统需要在任何生成器上表现良好。论文的关键解决方案在于开发了一组用于标记分类任务的模型，这些模型经过大规模人类-机器协同创作文本数据集的训练。此外，论文引入了一个包含超过240万条文本的新数据集，覆盖23种语言，主要由多种流行的专有LLMs与人类共同创作而成。研究还评估了模型在未见过领域的文本、不同生成器、非母语作者文本以及对抗性输入条件下的性能，并分析了生成文本与原始人工创作文本之间的特性差异及对抗方法的性能对比。

链接: https://arxiv.org/abs/2504.11952
作者: Ram Mohan Rao Kadiyala,Siddartha Pullakhandam,Kanwal Mehreen,Drishti Sharma,Siddhant Gupta,Jebish Purbey,Ashay Srivastava,Subhasya TippaReddy,Arvind Reddy Bobbili,Suraj Telugara Chandrashekhar,Modabbir Adeeb,Srinadh Vura,Hamza Farooq
机构: Traversaal.ai; Vantager; Cohere for AI Community; University of Maryland, College Park; IIT Roorkee; University of South Florida; University of Houston; IISc Bangalore; Stanford University; University of California, Los Angeles; M2ai.in
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACL 2025 Feb ARR Submission

点击查看摘要

Abstract:An ideal detection system for machine generated content is supposed to work well on any generator as many more advanced LLMs come into existence day by day. Existing systems often struggle with accurately identifying AI-generated content over shorter texts. Further, not all texts might be entirely authored by a human or LLM, hence we focused more over partial cases i.e human-LLM co-authored texts. Our paper introduces a set of models built for the task of token classification which are trained on an extensive collection of human-machine co-authored texts, which performed well over texts of unseen domains, unseen generators, texts by non-native speakers and those with adversarial inputs. We also introduce a new dataset of over 2.4M such texts mostly co-authored by several popular proprietary LLMs over 23 languages. We also present findings of our models’ performance over each texts of each domain and generator. Additional findings include comparison of performance against each adversarial method, length of input texts and characteristics of generated texts compared to the original human authored texts.
zh

[NLP-20] ADAT: Time-Series-Aware Adaptive Transformer Architecture for Sign Language Translation

【速读】：该论文旨在解决现有手语机器翻译系统在识别高帧率手势间的细粒度短程时间依赖性方面缺乏准确性的问题，同时降低系统的计算复杂度以提高训练效率。为了解决这些问题，论文提出了一种自适应Transformer（ADAT），其关键在于通过引入增强特征提取组件以及基于门控机制的自适应特征加权方法，突出与上下文相关的特征，从而在保持翻译准确性的同时减少训练开销。此外，论文还发布了MedASL数据集用于评估ADAT的性能。

链接: https://arxiv.org/abs/2504.11942
作者: Nada Shahin,Leila Ismail
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current sign language machine translation systems rely on recognizing hand movements, facial expressions and body postures, and natural language processing, to convert signs into text. Recent approaches use Transformer architectures to model long-range dependencies via positional encoding. However, they lack accuracy in recognizing fine-grained, short-range temporal dependencies between gestures captured at high frame rates. Moreover, their high computational complexity leads to inefficient training. To mitigate these issues, we propose an Adaptive Transformer (ADAT), which incorporates components for enhanced feature extraction and adaptive feature weighting through a gating mechanism to emphasize contextually relevant features while reducing training overhead and maintaining translation accuracy. To evaluate ADAT, we introduce MedASL, the first public medical American Sign Language dataset. In sign-to-gloss-to-text experiments, ADAT outperforms the encoder-decoder transformer, improving BLEU-4 accuracy by 0.1% while reducing training time by 14.33% on PHOENIX14T and 3.24% on MedASL. In sign-to-text experiments, it improves accuracy by 8.7% and reduces training time by 2.8% on PHOENIX14T and achieves 4.7% higher accuracy and 7.17% faster training on MedASL. Compared to encoder-only and decoder-only baselines in sign-to-text, ADAT is at least 6.8% more accurate despite being up to 12.1% slower due to its dual-stream structure.
zh

[NLP-21] An LLM -as-a-judge Approach for Scalable Gender-Neutral Translation Evaluation

【速读】：该论文试图解决自动评估性别中性翻译（Gender-Neutral Translation, GNT）时面临的挑战，当前方法主要依赖于单一语言分类器，这些方法的局限在于未能结合源句信息，并且需要专用数据和微调以扩展到新语言。为了解决这些问题，论文探索了利用大型语言模型（Large Language Models, LLMs）作为GNT评估器的可能性。解决方案的关键在于提出两种提示方法：一种仅生成句子级评估，另一种类似于链式思维方法，在进行句子级判断前先生成详细的短语级注释。通过在多种语言下对五种模型（包括开源与专有模型）进行广泛实验，研究发现LLMs能够有效作为GNT评估器，且预先进行短语级注释再进行句子级评估的方法显著提高了所有模型的准确性，提供了一种更优且更具可扩展性的替代方案。

链接: https://arxiv.org/abs/2504.11934
作者: Andrea Piergentili,Beatrice Savoldi,Matteo Negri,Luisa Bentivogli
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at GITT 2025

点击查看摘要

Abstract:Gender-neutral translation (GNT) aims to avoid expressing the gender of human referents when the source text lacks explicit cues about the gender of those referents. Evaluating GNT automatically is particularly challenging, with current solutions being limited to monolingual classifiers. Such solutions are not ideal because they do not factor in the source sentence and require dedicated data and fine-tuning to scale to new languages. In this work, we address such limitations by investigating the use of large language models (LLMs) as evaluators of GNT. Specifically, we explore two prompting approaches: one in which LLMs generate sentence-level assessments only, and another, akin to a chain-of-thought approach, where they first produce detailed phrase-level annotations before a sentence-level judgment. Through extensive experiments on multiple languages with five models, both open and proprietary, we show that LLMs can serve as evaluators of GNT. Moreover, we find that prompting for phrase-level annotations before sentence-level assessments consistently improves the accuracy of all models, providing a better and more scalable alternative to current solutions.
zh

[NLP-22] Finding Flawed Fictions: Evaluating Complex Reasoning in Language Models via Plot Hole Detection

【速读】：该论文试图解决如何有效评估大型语言模型（Large Language Models, LLMs）在深层次叙事一致性和语言理解能力方面的问题。现有基准测试主要关注表层文本理解，无法充分反映LLMs在复杂叙事推理中的表现。为解决这一问题，论文提出通过检测故事中的情节漏洞（plot holes）作为评估LLMs叙事一致性和推理能力的代理任务。解决方案的关键在于引入了一种名为FlawedFictionsMaker的新算法，该算法能够可控且谨慎地在人工撰写的故事中合成情节漏洞，从而构建了一个高质量的基准数据集FlawedFictions。此数据集经过人类过滤以确保鲁棒性与高精度，并用于评估LLMs在不同长度故事中识别情节漏洞的能力。研究发现，最先进的LLMs在解决FlawedFictions任务时表现不佳，且性能随着故事长度增加而显著下降。此外，论文还揭示了基于LLM的故事摘要和生成容易引入更多情节漏洞的现象。

链接: https://arxiv.org/abs/2504.11900
作者: Kabir Ahuja,Melanie Sclar,Yulia Tsvetkov
机构: Paul G. Allen Center for Computer Science & Engineering (保罗·G·艾伦计算机科学与工程中心); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Stories are a fundamental aspect of human experience. Engaging deeply with stories and spotting plot holes – inconsistencies in a storyline that break the internal logic or rules of a story’s world – requires nuanced reasoning skills, including tracking entities and events and their interplay, abstract thinking, pragmatic narrative understanding, commonsense and social reasoning, and theory of mind. As Large Language Models (LLMs) increasingly generate, interpret, and modify text, rigorously assessing their narrative consistency and deeper language understanding becomes critical. However, existing benchmarks focus mainly on surface-level comprehension. In this work, we propose plot hole detection in stories as a proxy to evaluate language understanding and reasoning in LLMs. We introduce FlawedFictionsMaker, a novel algorithm to controllably and carefully synthesize plot holes in human-written stories. Using this algorithm, we construct a benchmark to evaluate LLMs’ plot hole detection abilities in stories – FlawedFictions – , which is robust to contamination, with human filtering ensuring high quality. We find that state-of-the-art LLMs struggle in accurately solving FlawedFictions regardless of the reasoning effort allowed, with performance significantly degrading as story length increases. Finally, we show that LLM-based story summarization and story generation are prone to introducing plot holes, with more than 50% and 100% increases in plot hole detection rates with respect to human-written originals.
zh

[NLP-23] Rethinking LLM -Based Recommendations: A Query Generation-Based Training-Free Approach

【速读】：该论文旨在解决现有基于大型语言模型（Large Language Model, LLM）的推荐方法面临的四大挑战：处理大规模候选池效率低下、对提示词中项目顺序敏感（“迷失在中间”现象）、扩展性差以及因随机负采样导致的不切实际的评估。为了解决这些问题，论文提出了一种Query-to-Recommendation方法，利用LLMs生成个性化查询，从完整的候选池中检索相关项目，从而消除了候选预选的需求。该方法的关键在于通过LLMs的世界知识提升推荐性能和多样性，并且即使对于较不流行的项目组也能表现良好。实验结果表明，该方法在三个数据集上实现了高达57%的改进，平均提升了31%，展示了强大的零样本性能，并且与现有模型集成后进一步提升了性能。

链接: https://arxiv.org/abs/2504.11889
作者: Donghee Han,Hwanjun Song,Mun Yong Yi
机构: KAIST (韩国科学技术院)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing large language model LLM-based recommendation methods face several challenges, including inefficiency in handling large candidate pools, sensitivity to item order within prompts (“lost in the middle” phenomenon) poor scalability, and unrealistic evaluation due to random negative sampling. To address these issues, we propose a Query-to-Recommendation approach that leverages LLMs to generate personalized queries for retrieving relevant items from the entire candidate pool, eliminating the need for candidate pre-selection. This method can be integrated into an ID-based recommendation system without additional training, enhances recommendation performance and diversity through LLMs’ world knowledge, and performs well even for less popular item groups. Experiments on three datasets show up to 57 percent improvement, with an average gain of 31 percent, demonstrating strong zero-shot performance and further gains when ensembled with existing models.
zh

[NLP-24] Evaluating the Goal-Directedness of Large Language Models

【速读】：该论文试图解决的问题是：LLMs（大型语言模型）在多大程度上利用其能力来达成特定目标，即衡量它们的目标导向性 (Goal-Directedness)。论文通过设计任务评估模型在信息收集、认知努力和计划执行中的表现，并通过子任务推断模型的相关能力，以此量化目标导向性。

解决方案的关键在于：将目标导向性作为独立于任务性能的指标进行评估，并通过引入动机提示 (Motivational Prompts) 来观察其影响。研究发现，尽管不同任务中的目标导向性相对一致，但其与任务性能差异显著，且仅对动机提示表现出适度敏感。这表明大多数模型尚未达到完全的目标导向性。论文希望通过这种评估方法促进对LLMs进展的更好监测，并为设计具有更强自主属性的模型提供指导。

链接: https://arxiv.org/abs/2504.11844
作者: Tom Everitt,Cristina Garbacea,Alexis Bellot,Jonathan Richens,Henry Papadatos,Siméon Campos,Rohin Shah
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:To what extent do LLMs use their capabilities towards their given goal? We take this as a measure of their goal-directedness. We evaluate goal-directedness on tasks that require information gathering, cognitive effort, and plan execution, where we use subtasks to infer each model’s relevant capabilities. Our evaluations of LLMs from Google DeepMind, OpenAI, and Anthropic show that goal-directedness is relatively consistent across tasks, differs from task performance, and is only moderately sensitive to motivational prompts. Notably, most models are not fully goal-directed. We hope our goal-directedness evaluations will enable better monitoring of LLM progress, and enable more deliberate design choices of agentic properties in LLMs.
zh

[NLP-25] FiSMiness: A Finite State Machine Based Paradigm for Emotional Support Conversations

【速读】：该论文旨在解决情感支持对话（Emotional Support Conversation, ESC）领域中，现有大型语言模型（Large Language Models, LLMs）在提供长期满意度方面存在的不足。大多数研究可能未从状态模型的角度定义对话流程图，导致解决方案次优。为了解决这一问题，论文提出了一种名为FiSMiness的新框架，其关键是利用有限状态机（Finite State Machine, FSM）增强LLMs的能力，使单一模型能够在每次对话轮次中自规划、自推理求助者的情绪、支持策略及最终回复，从而实现更优的长期满意度。实验结果表明，FiSMiness在多个基准测试中优于直接推理、自优化、思维链、微调以及外部辅助方法等。

链接: https://arxiv.org/abs/2504.11837
作者: Yue Zhao,Qingqing Gu,Xiaoyu Wang,Teng Chen,Zhonglin Jiang,Yong Chen,Luo Ji
机构: Geely AI Lab (吉利人工智能实验室); Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: accepted by CMCL

点击查看摘要

Abstract:Emotional support conversation (ESC) aims to alleviate the emotional distress of individuals through effective conversations. Although large language models (LLMs) have obtained remarkable progress on ESC, most of these studies might not define the diagram from the state model perspective, therefore providing a suboptimal solution for long-term satisfaction. To address such an issue, we leverage the Finite State Machine (FSM) on LLMs, and propose a framework called FiSMiness. Our framework allows a single LLM to bootstrap the planning during ESC, and self-reason the seeker’s emotion, support strategy and the final response upon each conversational turn. Substantial experiments on ESC datasets suggest that FiSMiness outperforms many baselines, including direct inference, self-refine, chain of thought, finetuning, and external-assisted methods, even those with many more parameters.
zh

[NLP-26] Could Thinking Multilingually Empower LLM Reasoning ?

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在推理任务中普遍存在的“英语偏见”（English bias）问题，即这些模型在处理英语任务时通常表现更优，而忽视了其他语言可能带来的优势。论文的关键在于探索利用多语言能力（Multilingualism）进行推理任务所能达到的性能上限，并揭示其显著（高出近10个Acc@k点）且稳健（对翻译质量和语言选择变化具有容忍度）的优势。研究发现，常见的答案选择方法无法达到这一性能上限，因为它们存在局限性和偏差。因此，论文的核心贡献在于分析性能上限背后的原因、实现该上限的挑战，以及指出需要改进现有方法以充分挖掘LLMs中多语言推理的潜力。

链接: https://arxiv.org/abs/2504.11833
作者: Changjiang Gao,Xu Huang,Wenhao Zhu,Shujian Huang,Lei Li,Fei Yuan
机构: National Key Laboratory for Novel Software Technology, Nanjing University (南京大学国家重点软件技术实验室); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Previous work indicates that large language models exhibit a significant “English bias”, i.e. they often perform better when tasks are presented in English. Interestingly, we have observed that using certain other languages in reasoning tasks can yield better performance than English. However, this phenomenon remains under-explored. In this paper, we explore the upper bound of harnessing multilingualism in reasoning tasks, suggesting that multilingual reasoning promises significantly (by nearly 10 Acc@ k points) and robustly (tolerance for variations in translation quality and language choice) higher upper bounds than English-only reasoning. Besides analyzing the reason behind the upper bound and challenges in reaching it, we also find that common answer selection methods cannot achieve this upper bound, due to their limitations and biases. These insights could pave the way for future research aimed at fully harnessing the potential of multilingual reasoning in LLMs.
zh

[NLP-27] Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation

【速读】：该论文试图解决多语言大型语言模型（Multilingual Large Language Models, mLLMs）生成能力评估缺乏全面性、科学严谨性和一致性的问题，这些问题阻碍了其有效指导模型开发的潜力。论文的关键解决方案是借鉴机器翻译（Machine Translation, MT）领域的成熟经验，通过在生成式评估管道的关键阶段开展针对性实验，展示如何采用MT评估的最佳实践来深入理解不同模型之间的质量差异，并进一步识别出确保评估方法本身严格可靠的元评估（Meta-evaluation）的核心要素，最终提炼出一套可操作的建议清单以促进mLLM的研究与开发。

链接: https://arxiv.org/abs/2504.11829
作者: Julia Kreutzer,Eleftheria Briakou,Sweta Agrawal,Marzieh Fadaee,Kocmi Tom
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generation capabilities and language coverage of multilingual large language models (mLLMs) are advancing rapidly. However, evaluation practices for generative abilities of mLLMs are still lacking comprehensiveness, scientific rigor, and consistent adoption across research labs, which undermines their potential to meaningfully guide mLLM development. We draw parallels with machine translation (MT) evaluation, a field that faced similar challenges and has, over decades, developed transparent reporting standards and reliable evaluations for multilingual generative models. Through targeted experiments across key stages of the generative evaluation pipeline, we demonstrate how best practices from MT evaluation can deepen the understanding of quality differences between models. Additionally, we identify essential components for robust meta-evaluation of mLLMs, ensuring the evaluation methods themselves are rigorously assessed. We distill these insights into a checklist of actionable recommendations for mLLM research and development.
zh

[NLP-28] ARWI: Arabic Write and Improve

【速读】：该论文旨在解决高级阿拉伯语写作辅助工具匮乏的问题，尽管阿拉伯语被超过4亿人使用。解决方案的关键在于提出ARWI，这是一种新的写作助手，它首次公开提供了针对不同熟练程度的提示数据库、阿拉伯文本编辑器、最先进的语法错误检测与纠正功能，以及与《欧洲语言共同参考框架》标准对齐的自动化作文评分系统。此外，ARWI还能用于构建不断增长的自动注释语料库，从而促进阿拉伯语法修正和作文评分的研究，以及分析母语者和非母语学习者的错误模式。初步用户研究表明，ARWI能够提供可操作的反馈，帮助学习者识别语法缺口、评估语言熟练度并指导改进。

链接: https://arxiv.org/abs/2504.11814
作者: Kirill Chirkunov,Bashar Alhafni,Chatrine Qwaider,Nizar Habash,Ted Briscoe
机构: MBZUAI (MBZUAI); New York University Abu Dhabi (纽约大学阿联酋分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although Arabic is spoken by over 400 million people, advanced Arabic writing assistance tools remain limited. To address this gap, we present ARWI, a new writing assistant that helps learners improve essay writing in Modern Standard Arabic. ARWI is the first publicly available Arabic writing assistant to include a prompt database for different proficiency levels, an Arabic text editor, state-of-the-art grammatical error detection and correction, and automated essay scoring aligned with the Common European Framework of Reference standards for language attainment. Moreover, ARWI can be used to gather a growing auto-annotated corpus, facilitating further research on Arabic grammar correction and essay scoring, as well as profiling patterns of errors made by native speakers and non-native learners. A preliminary user study shows that ARWI provides actionable feedback, helping learners identify grammatical gaps, assess language proficiency, and guide improvement.
zh

[NLP-29] Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture

【速读】：本文旨在解决同时语音翻译（Simultaneous Speech Translation, SimulST）中大语言模型（Large Language Models, LLMs）应用所面临的挑战。尽管LLMs在离线翻译任务中表现出色，但将其应用于SimulST时存在显著的计算开销或固定读写策略的问题，导致效率和性能受限。为了解决这些问题，论文提出了一种高效自适应同时语音翻译框架（Efficient and Adaptive Simultaneous Speech Translation, EASiST），其关键在于采用完全单向架构，包括单向语音编码器和LLM，并引入多延迟数据整理策略以生成语义对齐的训练样本，重新定义SimulST为带有显式读/写标记的交错生成任务。此外，通过轻量级策略头动态预测读/写动作实现自适应推理，并采用多阶段训练策略优化模态对齐、翻译及策略行为。实验结果表明，EASiST在MuST-C En → De和En → Es数据集上实现了更优的时延-质量权衡。

链接: https://arxiv.org/abs/2504.11809
作者: Biao Fu,Donglei Yu,Minpeng Liao,Chengxi Li,Yidong Chen,Kai Fan,Xiaodong Shi
机构: School of Informatics, Xiamen University (厦门大学信息学院); Alibaba Group Tongyi Lab (阿里云通义实验室); Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism (福建省与台湾文化遗产数字化保护与智能处理重点实验室（厦门大学），文化和旅游部)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Simultaneous speech translation (SimulST) produces translations incrementally while processing partial speech input. Although large language models (LLMs) have showcased strong capabilities in offline translation tasks, applying them to SimulST poses notable challenges. Existing LLM-based SimulST approaches either incur significant computational overhead due to repeated encoding of bidirectional speech encoder, or they depend on a fixed read/write policy, limiting the efficiency and performance. In this work, we introduce Efficient and Adaptive Simultaneous Speech Translation (EASiST) with fully unidirectional architecture, including both speech encoder and LLM. EASiST includes a multi-latency data curation strategy to generate semantically aligned SimulST training samples and redefines SimulST as an interleaved generation task with explicit read/write tokens. To facilitate adaptive inference, we incorporate a lightweight policy head that dynamically predicts read/write actions. Additionally, we employ a multi-stage training strategy to align speech-text modalities and optimize both translation and policy behavior. Experiments on the MuST-C En \rightarrow De and En \rightarrow Es datasets demonstrate that EASiST offers superior latency-quality trade-offs compared to several strong baselines.
zh

[NLP-30] Selective Attention Federated Learning: Improving Privacy and Efficiency for Clinical Text Classification

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）在训练大型语言模型（Large Language Models, LLMs）时面临的通信开销大和模型隐私保护不足的问题，特别是在医疗健康领域的应用。论文的关键解决方案是引入了一种名为选择性注意力联邦学习（Selective Attention Federated Learning, SAFL）的新方法。SAFL通过利用注意力模式动态微调被识别为关键的Transformer层，从而显著减少了通信带宽需求，并增强了差分隐私的鲁棒性。

链接: https://arxiv.org/abs/2504.11793
作者: Yue Li,Lihong Zhang
机构: Harvard University (哈佛大学); Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) faces major challenges regarding communication overhead and model privacy when training large language models (LLMs), especially in healthcare applications. To address these, we introduce Selective Attention Federated Learning (SAFL), a novel approach that dynamically fine-tunes only those transformer layers identified as attention-critical. By employing attention patterns to determine layer importance, SAFL significantly reduces communication bandwidth and enhances differential privacy resilience. Evaluations on clinical NLP benchmarks (i2b2 Clinical Concept Extraction and MIMIC-III discharge summaries) demonstrate that SAFL achieves competitive performance with centralized models while substantially improving communication efficiency and privacy preservation.
zh

[NLP-31] Enhancing Web Agents with Explicit Rollback Mechanisms

【速读】：该论文旨在解决复杂且动态的网络环境中现有网络代理在规划和搜索能力上的不足，特别是传统贪婪单向搜索策略容易陷入错误状态且难以恢复的问题。论文的关键解决方案是引入显式的回滚机制（explicit rollback mechanism），使代理能够在导航轨迹中回退到先前的状态，从而赋予模型直接控制搜索过程的能力，实现更高效且有效的网络导航方法。

链接: https://arxiv.org/abs/2504.11788
作者: Zhisong Zhang,Tianqing Fang,Kaixin Ma,Wenhao Yu,Hongming Zhang,Haitao Mi,Dong Yu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With recent advancements in large language models, web agents have been greatly improved. However, dealing with complex and dynamic web environments requires more advanced planning and search abilities. Previous studies usually adopt a greedy one-way search strategy, which may struggle to recover from erroneous states. In this work, we enhance web agents with an explicit rollback mechanism, enabling the agent to revert back to a previous state in its navigation trajectory. This mechanism gives the model the flexibility to directly control the search process, leading to an effective and efficient web navigation method. We conduct experiments on two live web navigation benchmarks with zero-shot and fine-tuning settings. The results demonstrate the effectiveness of our proposed approach.
zh

[NLP-32] Unsupervised Classification of English Words Based on Phonological Information: Discovery of Germanic and Latinate Clusters

【速读】：该论文试图解决的问题是：尽管基于词源的跨语言语音规则归纳在认知模型中面临可学习性挑战（因为词的历史起源对普通语言学习者而言通常是不可访问的信息），本研究旨在探讨是否可以从单个词汇的音系信息（phonotactics）出发，学习英语中日耳曼语源与拉丁语源词汇的区别。解决方案的关键在于采用无监督聚类方法对从语料库中提取的单词进行分析，结果显示发现的单词聚类与词源学上的区分高度一致，并且这些聚类还重现了先前文献中关于相应词源类别记录的语言学概括，同时揭示了一些之前未被认识的准词源学聚类特征，为未来实验研究提供了新假设。

链接: https://arxiv.org/abs/2504.11770
作者: Takashi Morita,Timothy J. O’Donnell
机构: MIT (麻省理工学院); McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-linguistically, native words and loanwords follow different phonological rules. In English, for example, words of Germanic and Latinate origin exhibit different stress patterns, and a certain syntactic structure is exclusive to Germanic verbs. When seeing them as a cognitive model, however, such etymology-based generalizations face challenges in terms of learnability, since the historical origins of words are presumably inaccessible information for general language learners. In this study, we present computational evidence indicating that the Germanic-Latinate distinction in the English lexicon is learnable from the phonotactic information of individual words. Specifically, we performed an unsupervised clustering on corpus-extracted words, and the resulting word clusters largely aligned with the etymological distinction. The model-discovered clusters also recovered various linguistic generalizations documented in the previous literature regarding the corresponding etymological classes. Moreover, our findings also uncovered previously unrecognized features of the quasi-etymological clusters, offering novel hypotheses for future experimental studies.
zh

[NLP-33] Climbing the Ladder of Reasoning : What LLM s Can-and Still Cant-Solve after SFT?

【速读】：该论文旨在解决语言模型在数学推理任务中特定能力增强机制不明确的问题。论文通过详细分析AIME24数据集上的模型性能，揭示了问题难度的阶梯式结构，并将问题分为四个层级（易、中、难、极难），同时识别出在各层级间进阶的具体需求。研究发现，从易到中层级的进步主要依赖于采用R1推理风格且仅需少量监督微调（500-1000实例），而较难题目在推理链的每一步都容易出现错误，即使数据规模对数扩展后准确率也趋于饱和（约65%）。极难题目则需要非常规解题技能，当前模型普遍表现不佳。此外，研究表明精心设计的小规模数据集优势有限，扩大数据集规模更为有效。论文的关键解决方案在于通过系统性分析揭示模型在不同数学推理层级上的能力瓶颈及优化路径，从而为提升语言模型的数学推理能力提供了清晰的路线图。

链接: https://arxiv.org/abs/2504.11741
作者: Yiyou Sun,Georgia Zhou,Hao Wang,Dacheng Li,Nouha Dziri,Dawn Song
机构: University of California, Berkeley (加州大学伯克利分校); Allen Institute for AI (艾伦人工智能研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent supervised fine-tuning (SFT) approaches have significantly improved language models’ performance on mathematical reasoning tasks, even when models are trained at a small scale. However, the specific capabilities enhanced through such fine-tuning remain poorly understood. In this paper, we conduct a detailed analysis of model performance on the AIME24 dataset to understand how reasoning capabilities evolve. We discover a ladder-like structure in problem difficulty, categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard (Exh)), and identify the specific requirements for advancing between tiers. We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT (500-1K instances), while Hard-level questions suffer from frequent model’s errors at each step of the reasoning chain, with accuracy plateauing at around 65% despite logarithmic scaling. Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills that current models uniformly struggle with. Additional findings reveal that carefully curated small-scale datasets offer limited advantage-scaling dataset size proves far more effective. Our analysis provides a clearer roadmap for advancing language model capabilities in mathematical reasoning.
zh

[NLP-34] he Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation CVPR2025

【速读】：该论文旨在解决文本到视频（Text-to-video, T2V）生成模型对输入提示（prompt）敏感的问题，强调了提示设计对生成结果的重要影响。传统方法主要依赖大型语言模型（Large Language Models, LLMs）将用户提供的提示与训练数据分布对齐，但缺乏针对提示词汇和句法细节的定制化指导。为了解决这一问题，论文提出了\textbfRAPO框架，这是一种新的检索增强型提示优化（Retrieval-Augmented Prompt Optimization）方法。其关键在于通过两个优化分支对原始提示进行精炼：第一个分支利用从学习到的关系图中提取的多样化修饰词来增强提示，并通过微调的语言模型将其格式调整为与训练数据一致；第二个分支则遵循定义明确的指令集，使用预训练的语言模型重写原始提示。实验结果表明，\textbfRAPO能够有效提升生成视频在静态和动态维度上的质量，凸显了提示优化对于用户提供的输入提示的重要性。

链接: https://arxiv.org/abs/2504.11739
作者: Bingjie Gao,Xinyu Gao,Xiaoxue Wu,Yujie Zhou,Yu Qiao,Li Niu,Xinyuan Chen,Yaohui Wang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: accepted by CVPR2025

点击查看摘要

Abstract:The evolution of Text-to-video (T2V) generative models, trained on large-scale datasets, has been marked by significant progress. However, the sensitivity of T2V generative models to input prompts highlights the critical role of prompt design in influencing generative outcomes. Prior research has predominantly relied on Large Language Models (LLMs) to align user-provided prompts with the distribution of training prompts, albeit without tailored guidance encompassing prompt vocabulary and sentence structure nuances. To this end, we introduce \textbfRAPO, a novel \textbfRetrieval-\textbfAugmented \textbfPrompt \textbfOptimization framework. In order to address potential inaccuracies and ambiguous details generated by LLM-generated prompts. RAPO refines the naive prompts through dual optimization branches, selecting the superior prompt for T2V generation. The first branch augments user prompts with diverse modifiers extracted from a learned relational graph, refining them to align with the format of training prompts via a fine-tuned LLM. Conversely, the second branch rewrites the naive prompt using a pre-trained LLM following a well-defined instruction set. Extensive experiments demonstrate that RAPO can effectively enhance both the static and dynamic dimensions of generated videos, demonstrating the significance of prompt optimization for user-provided prompts. Project website: \hrefthis https URLGitHub.
zh

[NLP-35] Higher-Order Binding of Language Model Virtual Personas: a Study on Approximating Political Partisan Misperceptions

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）在模拟人类行为以估计用户响应时，如何更有效地反映个体在群体中的观点及其对外群体的认知与评价问题。传统方法主要关注个体意见或态度的再现，而本文提出需要更高阶的虚拟人格整合，即不仅要准确模拟用户作为特定群体成员的观点，还需真实刻画其对外群体的细微感知与评估方式。这一能力对于将LLMs应用于政治科学研究（如极化动态、群体间冲突及民主倒退等热点议题）至关重要。

解决方案的关键在于提出了一种新颖的方法，通过生成合成用户的“背景故事”来构建虚拟人格，这些背景故事以扩展的多轮访谈文本形式呈现。与以往方法相比，本文生成的背景故事更长、细节更丰富且一致性更强，能够真实描述单一个体。实验结果显示，基于这些背景故事构建的虚拟人格在复制人类响应分布方面表现出显著提升（Wasserstein距离衡量下最高可达87%的改进），并且产生的效应规模与原始研究观察到的结果高度吻合。这一方法拓展了LLMs的应用范围，使其不仅限于估计个体自我观点，还能广泛用于各类人类行为研究。

链接: https://arxiv.org/abs/2504.11673
作者: Minwoo Kang,Suhong Moon,Seung Hyeong Lee,Ayush Raj,Joseph Suh,David M. Chan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly capable of simulating human behavior, offering cost-effective ways to estimate user responses during the early phases of survey design. While previous studies have examined whether models can reflect individual opinions or attitudes, we argue that a \emphhigher-order binding of virtual personas requires successfully approximating not only the opinions of a user as an identified member of a group, but also the nuanced ways in which that user perceives and evaluates those outside the group. In particular, faithfully simulating how humans perceive different social groups is critical for applying LLMs to various political science studies, including timely topics on polarization dynamics, inter-group conflict, and democratic backsliding. To this end, we propose a novel methodology for constructing virtual personas with synthetic user ``backstories" generated as extended, multi-turn interview transcripts. Our generated backstories are longer, rich in detail, and consistent in authentically describing a singular individual, compared to previous methods. We show that virtual personas conditioned on our backstories closely replicate human response distributions (up to an 87% improvement as measured by Wasserstein Distance) and produce effect sizes that closely match those observed in the original studies. Altogether, our work extends the applicability of LLMs beyond estimating individual self-opinions, enabling their use in a broader range of human studies.
zh

[NLP-36] Improving Instruct Models for Free: A Study on Partial Adaptation

【速读】：该论文试图解决在指令微调（Instruction Tuning）过程中，指令跟随能力提升与即时上下文少量样本学习（in-context few-shot learning）性能之间的潜在权衡问题。论文指出，虽然指令微调能够增强模型的指令跟随能力，但可能导致遗忘预训练知识或使模型过于冗长，从而损害即时上下文学习性能。为此，研究通过部分适配方法（partial adaptation）降低指令微调的强度，探索基础模型（base model）与指令模型（instruct model）的表现轨迹。关键在于调整指令微调的强度，结果显示，在多个模型家族和不同规模的模型中，适度减弱指令微调可显著提升即时上下文少量样本学习基准任务的表现，代价是牺牲部分指令跟随能力（如AlpacaEval所衡量）。这一研究揭示了在实际应用中需平衡即时上下文学习与指令跟随能力的重要性。

链接: https://arxiv.org/abs/2504.11626
作者: Ozan İrsoy,Pengxiang Cheng,Jennifer L. Chen,Daniel Preoţiuc-Pietro,Shiyue Zhang,Duccio Pappadopulo
机构: Bloomberg(彭博); NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Author ordering chosen at random

点击查看摘要

Abstract:Instruct models, obtained from various instruction tuning or post-training steps, are commonly deemed superior and more usable than their base counterpart. While the model gains instruction following ability, instruction tuning may lead to forgetting the knowledge from pre-training or it may encourage the model being overly conversational or verbose. This, in turn, can lead to degradation of in-context few-shot learning performance. In this work, we study the performance trajectory between base and instruct models by scaling down the strength of instruction-tuning via the partial adaption method. We show that, across several model families and model sizes, reducing the strength of instruction-tuning results in material improvement on a few-shot in-context learning benchmark covering a variety of classic natural language tasks. This comes at the cost of losing some degree of instruction following ability as measured by AlpacaEval. Our study shines light on the potential trade-off between in-context learning and instruction following abilities that is worth considering in practice.
zh

[NLP-37] AskQE: Question Answering as Automatic Evaluation for Machine Translation

【速读】：该论文旨在解决单语种英语使用者如何判断法语文本的自动翻译质量是否足够好以供分享这一实际问题。现有机器翻译错误检测和质量估计（MT Error Detection and Quality Estimation, QE）技术未能应对此场景。论文提出了一种名为AskQE的问题生成与回答框架，其关键在于通过设计基于对比合成的机器翻译错误数据集（ContraTICO）和大型语言模型（LLaMA-3 70B）及推导事实来优化问句生成过程，从而检测关键错误并提供可操作反馈，帮助用户在不掌握目标语言的情况下决定是否接受或拒绝机器翻译输出。实验表明，AskQE在BioMQM自然发生的机器翻译错误数据集上的Kendall’s Tau相关性和决策准确性优于其他QE指标。

链接: https://arxiv.org/abs/2504.11582
作者: Dayeon Ki,Kevin Duh,Marine Carpuat
机构: 未知
类目: Computation and Language (cs.CL)
备注: 38 pages, 7 figures

点击查看摘要

Abstract:How can a monolingual English speaker determine whether an automatic translation in French is good enough to be shared? Existing MT error detection and quality estimation (QE) techniques do not address this practical scenario. We introduce AskQE, a question generation and answering framework designed to detect critical MT errors and provide actionable feedback, helping users decide whether to accept or reject MT outputs even without the knowledge of the target language. Using ContraTICO, a dataset of contrastive synthetic MT errors in the COVID-19 domain, we explore design choices for AskQE and develop an optimized version relying on LLaMA-3 70B and entailed facts to guide question generation. We evaluate the resulting system on the BioMQM dataset of naturally occurring MT errors, where AskQE has higher Kendall’s Tau correlation and decision accuracy with human ratings compared to other QE metrics.
zh

[NLP-38] GraphicBench: A Planning Benchmark for Graphic Design with Language Agents

【速读】：本文旨在探索大型语言模型（Large Language Model, LLM）驱动的智能体在创意设计任务中的能力，特别是针对目标开放的设计任务。传统研究主要集中于明确目标的任务自动化，而创意设计领域中LLM-agent的能力尚未充分挖掘。为此，论文引入了GraphicBench基准测试集和GraphicTown框架作为解决方案的核心。GraphicBench包含1,079个用户查询及输入图像，覆盖四种设计类型；GraphicTown则构建了一个包含三个设计专家与46种工具的LLM-agent框架，用于执行规划的工作流。实验表明，LLM能够整合显式设计约束与隐式常识约束生成工作流，但这些工作流往往无法成功执行，主要面临空间关系推理、全局依赖协调以及每步选择最适行动等挑战。因此，论文将GraphicBench视为推进LLM-agent在创意设计任务中规划与执行能力的重要测试平台。

链接: https://arxiv.org/abs/2504.11571
作者: Dayeon Ki,Tianyi Zhou,Marine Carpuat,Gang Wu,Puneet Mathur,Viswanathan Swaminathan
机构: University of Maryland, College Park (马里兰大学帕克分校); Adobe Research (Adobe 研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 41 pages, 11 figures

点击查看摘要

Abstract:Large Language Model (LLM)-powered agents have unlocked new possibilities for automating human tasks. While prior work has focused on well-defined tasks with specified goals, the capabilities of agents in creative design tasks with open-ended goals remain underexplored. We introduce GraphicBench, a new planning benchmark for graphic design that covers 1,079 user queries and input images across four design types. We further present GraphicTown, an LLM agent framework with three design experts and 46 actions (tools) to choose from for executing each step of the planned workflows in web environments. Experiments with six LLMs demonstrate their ability to generate workflows that integrate both explicit design constraints from user queries and implicit commonsense constraints. However, these workflows often do not lead to successful execution outcomes, primarily due to challenges in: (1) reasoning about spatial relationships, (2) coordinating global dependencies across experts, and (3) retrieving the most appropriate action per step. We envision GraphicBench as a challenging yet valuable testbed for advancing LLM-agent planning and execution in creative design tasks.
zh

[NLP-39] ReTool: Reinforcement Learning for Strategic Tool Use in LLM s

【速读】：本文旨在解决现有基于强化学习（Reinforcement Learning, RL）训练的推理模型（如DeepSeek R1）在处理需要结构化问题解决的任务（如几何推理、简练计算或复杂方程求解）时表现不佳的问题。这些任务领域中，计算工具如代码解释器（Code Interpreter, CI）展现出显著优势。为弥合这一差距，论文提出了一种名为ReTool的新方法，其核心在于通过工具集成学习增强长文本推理能力，关键创新包括：(1) 在自然语言推理过程中动态嵌入实时代码执行的能力，以及(2) 一种自动化的RL范式，允许策略展开时结合多轮实时代码执行，并教导模型何时及如何调用工具以优化结果反馈指导下的学习过程。ReTool采用系统性训练框架，首先利用合成冷启动数据生成代码增强的长文本推理轨迹以微调基础模型，随后通过任务结果作为奖励迭代优化工具使用策略，实现无需人工先验的最优工具调用模式自主发现。实验表明，在MATH奥林匹克竞赛AIME基准测试中，ReTool的32B模型在仅400个训练步骤内达到67%的准确率，效率和性能均优于文本基线模型（40%准确率，1080步），且扩展设置下可达72.5%，超越OpenAI的o1-preview模型27.9%。进一步分析揭示了模型自发出现的代码自修正等行为，标志着模型在适应性工具使用上的“顿悟”时刻。这些结果强调了以结果为导向的工具整合在推动复杂数学推理发展中的潜力，并为混合神经符号系统的构建提供了新见解。

链接: https://arxiv.org/abs/2504.11536
作者: Jiazhan Feng,Shijue Huang,Xingwei Qu,Ge Zhang,Yujia Qin,Baoquan Zhong,Chengquan Jiang,Jinxin Chi,Wanjun Zhong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model’s tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool’s superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI’s o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ‘‘aha moment’’ in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.
zh

[NLP-40] HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation

【速读】：该论文试图解决的问题是如何定义一个有效的假设以及如何系统性地评估大型语言模型（Large Language Models, LLMs）和假设生成方法的有效性。为了解决这些问题，论文提出了HypoBench，这是一个新颖的基准测试工具，用于从实用性、泛化能力及假设发现率等多个方面评估LLMs与假设生成方法。HypoBench包含7个真实世界任务和5个合成任务，并涵盖了194个不同的数据集。通过结合四种最先进的LLMs与六种现有的假设生成方法进行评估，研究结果表明现有方法能够发现数据中的有效且新颖的模式，但在合成数据集上的表现揭示了仍有显著改进空间，尤其是在任务难度增加时，当前方法仅恢复了38.8%的真实假设。这凸显了假设生成面临的挑战，并证明了HypoBench作为提升旨在辅助科学发现的人工智能系统的重要价值。

链接: https://arxiv.org/abs/2504.11524
作者: Haokun Liu,Sicong Huang,Jingyu Hu,Yangqiaoyu Zhou,Chenhao Tan
机构: University of Chicago (芝加哥大学); University of Toronto (多伦多大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 29 pages, 6 figures, website link: this https URL

点击查看摘要

Abstract:There is growing interest in hypothesis generation with large language models (LLMs). However, fundamental questions remain: what makes a good hypothesis, and how can we systematically evaluate methods for hypothesis generation? To address this, we introduce HypoBench, a novel benchmark designed to evaluate LLMs and hypothesis generation methods across multiple aspects, including practical utility, generalizability, and hypothesis discovery rate. HypoBench includes 7 real-world tasks and 5 synthetic tasks with 194 distinct datasets. We evaluate four state-of-the-art LLMs combined with six existing hypothesis-generation methods. Overall, our results suggest that existing methods are capable of discovering valid and novel patterns in the data. However, the results from synthetic datasets indicate that there is still significant room for improvement, as current hypothesis generation methods do not fully uncover all relevant or meaningful patterns. Specifically, in synthetic settings, as task difficulty increases, performance significantly drops, with best models and methods only recovering 38.8% of the ground-truth hypotheses. These findings highlight challenges in hypothesis generation and demonstrate that HypoBench serves as a valuable resource for improving AI systems designed to assist scientific discovery.
zh

[NLP-41] Graph-Driven Multimodal Feature Learning Framework for Apparent Personality Assessment

【速读】：该论文旨在解决计算机视觉领域中自动预测人格特质这一具有挑战性的问题。论文提出了一种创新的多模态特征学习框架，用于短视频片段中的人格分析。解决方案的关键在于结合多种先进技术：通过构建面部图并设计基于几何的双流网络（结合注意力机制、图卷积网络GCN和卷积神经网络CNN）来捕捉静态面部表情；利用ResNet18和VGGFace网络提取全局场景和面部外观特征；引入带有时间注意力模块的双向门控循环单元BiGRU以捕获动态时间信息；通过VGGish CNN和XLM-Roberta分别增强音频和文本特征；最后，采用多模态通道注意力机制整合不同模态信息，并使用多层感知机MLP回归模型进行人格特质预测。实验结果表明，所提出的框架在性能上超越了现有的最先进方法。

链接: https://arxiv.org/abs/2504.11515
作者: Kangsheng Wang,Chengwei Ye,Huanzhen Zhang,Linuo Xu,Shuyan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Predicting personality traits automatically has become a challenging problem in computer vision. This paper introduces an innovative multimodal feature learning framework for personality analysis in short video clips. For visual processing, we construct a facial graph and design a Geo-based two-stream network incorporating an attention mechanism, leveraging both Graph Convolutional Networks (GCN) and Convolutional Neural Networks (CNN) to capture static facial expressions. Additionally, ResNet18 and VGGFace networks are employed to extract global scene and facial appearance features at the frame level. To capture dynamic temporal information, we integrate a BiGRU with a temporal attention module for extracting salient frame representations. To enhance the model’s robustness, we incorporate the VGGish CNN for audio-based features and XLM-Roberta for text-based features. Finally, a multimodal channel attention mechanism is introduced to integrate different modalities, and a Multi-Layer Perceptron (MLP) regression model is used to predict personality traits. Experimental results confirm that our proposed framework surpasses existing state-of-the-art approaches in performance.
zh

[NLP-42] Language and Knowledge Representation: A Stratified Approach

【速读】：该论文试图解决表示异构性（representation heterogeneity）的问题，强调异构性是任何表示形式的固有属性。不同观察者会以分层的方式使用不同的概念、语言和知识（以及数据）对同一目标现实进行编码。为了解决这一分层问题，论文提出了一个自顶向下的解决方案，其关键是通过以下组件实现：(i) 分层表示形式（概念层、语言层、知识层和数据层）以容纳表示异构性；(ii) 借助通用知识核心（Universal Knowledge Core, UKC）、UKC命名空间和领域语言的自顶向下语言表示来应对概念和语言层面的异构性；(iii) 利用语言目的论（language teleontology）和知识目的论（knowledge teleontology）的概念进行自顶向下的知识表示以处理知识层面的异构性；(iv) 使用和进一步开发现有的LiveKnowledge目录以强制执行语言和知识表示的迭代重用与共享；(v) 集成上述组件的kTelos方法论以迭代生成消除表示异构性的语言和知识表示。论文还展示了为两个国际研究项目（DataScientia和JIDEP）开发的语言和知识表示的验证实例，并提出了未来的研究方向。

链接: https://arxiv.org/abs/2504.11492
作者: Mayukh Bagchi
机构: 未知
类目: Databases (cs.DB); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: Doctor of Philosophy (Ph.D) in Information Engineering and Computer Science, DISI, University of Trento, Italy

点击查看摘要

Abstract:The thesis proposes the problem of representation heterogeneity to emphasize the fact that heterogeneity is an intrinsic property of any representation, wherein, different observers encode different representations of the same target reality in a stratified manner using different concepts, language and knowledge (as well as data). The thesis then advances a top-down solution approach to the above stratified problem of representation heterogeneity in terms of several solution components, namely: (i) a representation formalism stratified into concept level, language level, knowledge level and data level to accommodate representation heterogeneity, (ii) a top-down language representation using Universal Knowledge Core (UKC), UKC namespaces and domain languages to tackle the conceptual and language level heterogeneity, (iii) a top-down knowledge representation using the notions of language teleontology and knowledge teleontology to tackle the knowledge level heterogeneity, (iv) the usage and further development of the existing LiveKnowledge catalog for enforcing iterative reuse and sharing of language and knowledge representations, and, (v) the kTelos methodology integrating the solution components above to iteratively generate the language and knowledge representations absolving representation heterogeneity. The thesis also includes proof-of-concepts of the language and knowledge representations developed for two international research projects - DataScientia (data catalogs) and JIDEP (materials modelling). Finally, the thesis concludes with future lines of research.
zh

[NLP-43] SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

【速读】：该论文试图解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在基于监督微调（Supervised Fine-Tuning, SFT）后继强化学习（Reinforcement Learning, RL）过程中存在的性能瓶颈问题。论文发现，SFT可能通过诱导“伪推理路径”显著损害后续的RL，这些路径虽然看似模仿了专家模型的推理方式，但实际上通常包含冗长、犹豫、信息量低且错误的步骤。为系统研究此现象，作者引入了一个新的多模态数据集VLAA-Thinking，它包含高质量的逐步视觉推理轨迹，并设计了六步构建流程以支持LVLMs的推理能力。论文的关键解决方案在于提出了一种基于分组相对策略优化（Group Relative Policy Optimization, GRPO）的新方法，结合了感知与认知信号的混合奖励模块，从而促进了更真实、更具适应性的推理行为。最终，基于Qwen2.5VL 3B开发的模型VLAA-Thinker在Open LMM Reasoning Leaderboard上取得了领先性能。

链接: https://arxiv.org/abs/2504.11468
作者: Hardy Chen,Haoqin Tu,Fali Wang,Hui Liu,Xianfeng Tang,Xinya Du,Yuyin Zhou,Cihang Xie
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校); University of Texas at Dallas (达拉斯德克萨斯大学); The Pennsylvania State University (宾夕法尼亚州立大学); Amazon Research (亚马逊研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs), and reveals a key finding: SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths’’ imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning. To systematically study this effect, we introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs. Constructed via a six-step pipeline involving captioning, reasoning distillation, answer rewrite and verification, VLAA-Thinking comprises high-quality, step-by-step visual reasoning traces for SFT, along with a more challenging RL split from the same data source. Using this dataset, we conduct extensive experiments comparing SFT, RL and their combinations. Results show that while SFT helps models learn reasoning formats, it often locks aligned models into imitative, rigid reasoning modes that impede further learning. In contrast, building on the Group Relative Policy Optimization (GRPO) with a novel mixed reward module integrating both perception and cognition signals, our RL approach fosters more genuine, adaptive reasoning behavior. Notably, our model VLAA-Thinker, based on Qwen2.5VL 3B, achieves top-1 performance on Open LMM Reasoning Leaderboard (this https URL) among 4B scale LVLMs, surpassing the previous state-of-the-art by 1.8%. We hope our findings provide valuable insights in developing reasoning-capable LVLMs and can inform future research in this area.
zh

[NLP-44] Semantic Matters: Multimodal Features for Affective Analysis

【速读】：本文针对行为矛盾/犹豫 (Behavioural Ambivalence/Hesitancy, BAH) 识别挑战和情感模仿强度 (Emotional Mimicry Intensity, EMI) 估计挑战提出了解决方案。论文的关键在于利用预训练于大规模播客数据集的 Wav2Vec 2.0 模型提取语音特征，结合由其衍生的 VAD (valence-arousal-dominance) 模块、BERT-like 编码器以及视觉Transformer (Vision Transformer, ViT)，并通过长短期记忆网络 (Long Short-Term Memory, LSTM) 进行时间建模。此外，本文首次将文本和视觉模态融入分析，强调语义信息提供重要的上下文线索，并通过融合视觉模态增强对文本模态的理解精度，从而显著提升了性能相较基线方法的表现。

链接: https://arxiv.org/abs/2504.11460
作者: Tobias Hallmen,Robin-Nico Kampa,Fabian Deuser,Norbert Oswald,Elisabeth André
机构: Chair for Human-Centered Artificial Intelligence, University of Augsburg (以人为中心的人工智能讲席, 奥格斯堡大学); Institute for Distributed Intelligent Systems, University of the Bundeswehr Munich (分布式智能系统研究所, 德国联邦国防军慕尼黑大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this study, we present our methodology for two tasks: the Behavioural Ambivalence/Hesitancy (BAH) Recognition Challenge and the Emotional Mimicry Intensity (EMI) Estimation Challenge, both conducted as part of the 8th Workshop and Competition on Affective Behavior Analysis in-the-wild. Building on previous work, we utilize a Wav2Vec 2.0 model pre-trained on a large podcast dataset to extract various audio features, capturing both linguistic and paralinguistic information. Our approach incorporates a valence-arousal-dominance (VAD) module derived from Wav2Vec 2.0, a BERT-like encoder, and a vision transformer (ViT) with predictions subsequently processed through a long short-term memory (LSTM) architecture for temporal modeling. In this iteration, we integrate the textual and visual modality into our analysis, recognizing that semantic content provides valuable contextual cues and underscoring that the meaning of speech often conveys more critical insights than its acoustic counterpart alone. Fusing in the vision modality helps in some cases to interpret the textual modality more precisely. This combined approach yields significant performance improvements over baseline methods.
zh

[NLP-45] From Conceptual Data Models to Multimodal Representation

【速读】：该论文致力于解决信息设计中的两个核心问题：一是如何定义文本数据集的语义意义及其视觉或多媒体表示；二是如何通过建模和概念设计实现复杂语料库的分析与发布。论文的关键在于引入语义建模工具（如概念网络或图），这些工具能够通过考虑概念间的关系、使用上下文及特定目标来结构化领域知识。同时，论文强调了构建动态且可适应模型的挑战，特别是如何整合词表或互操作本体以处理复杂的语料库。最终，这些方法被应用于实际工作环境，如OKAPI系统，并通过可视化叙事和文档再工程等创新方式，强调了系统间的互操作性、灵活性以及智能通信的重要性。

链接: https://arxiv.org/abs/2504.11459
作者: Peter Stockinger(PLIDAM, ESCOM)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: in French language

点击查看摘要

Abstract:1) Introduction and Conceptual Framework: This document explores the concept of information design by dividing it into two major practices: defining the meaning of a corpus of textual data and its visual or multimodal representation. It draws on expertise in enriching textual corpora, particularly audiovisual ones, and transforming them into multiple narrative formats. The text highlights a crucial distinction between the semantic content of a domain and the modalities of its graphic expression, illustrating this approach with concepts rooted in structural semiotics and linguistics traditions. 2) Modeling and Conceptual Design: The article emphasizes the importance of semantic modeling, often achieved through conceptual networks or graphs. These tools enable the structuring of knowledge within a domain by accounting for relationships between concepts, contexts of use, and specific objectives. Stockinger also highlights the constraints and challenges involved in creating dynamic and adaptable models, integrating elements such as thesauri or interoperable ontologies to facilitate the analysis and publication of complex corpora. 3) Applications and Multimodal Visualization: The text concludes by examining the practical application of these models in work environments like OKAPI, developed to analyze, publish, and reuse audiovisual data. It also discusses innovative approaches such as visual storytelling and document reengineering, which involve transforming existing content into new resources tailored to various contexts. These methods emphasize interoperability, flexibility, and the intelligence of communication systems, paving the way for richer and more collaborative use of digital data. The content of this document was presented during the “Semiotics of Information Design” Day organized by Anne Beyaert-Geslin of the University of Bordeaux Montaigne (MICA laboratory) on June 21, 2018, in Bordeaux. Comments: in French language Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2504.11459 [cs.AI] (or arXiv:2504.11459v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.11459 Focus to learn more arXiv-issued DOI via DataCite
zh

计算机视觉

[CV-0] Adapting a World Model for Trajectory Following in a 3D Game

【速读】：该论文旨在解决在复杂环境中（如现代3D电子游戏）利用模仿学习（Imitation Learning）复制专家轨迹的问题，特别是在存在分布偏移（distribution shift）和随机性（stochasticity）的情况下，传统简单动作回放方法的局限性。论文的关键在于探索使用逆动力学模型（Inverse Dynamics Models, IDM）结合不同编码器（encoders）和策略头（policy heads）来提升轨迹跟随性能，并提出多种未来对齐（future alignment）策略以缓解由认知不确定性（aleatoric uncertainty）和智能体不完美性引起的分布偏移问题。通过评估轨迹偏差距离和首次显著偏差点，研究发现最优配置取决于具体设置，其中在多样化数据场景下从头训练的GPT风格策略头表现最佳，在低数据场景下DINOv2编码器与GPT风格策略头的组合效果最佳，且预训练后再微调的GPT风格和MLP风格策略头具有相当的性能。

链接: https://arxiv.org/abs/2504.12299
作者: Marko Tot,Shu Ishida,Abdelhak Lemkhenter,David Bignell,Pallavi Choudhury,Chris Lovett,Luis França,Matheus Ribeiro Furtado de Mendonça,Tarun Gupta,Darren Gehring,Sam Devlin,Sergio Valcarcel Macua,Raluca Georgescu
机构: Microsoft Research (微软研究院); Queen Mary University of London (伦敦玛丽女王大学); University of Oxford (牛津大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Imitation learning is a powerful tool for training agents by leveraging expert knowledge, and being able to replicate a given trajectory is an integral part of it. In complex environments, like modern 3D video games, distribution shift and stochasticity necessitate robust approaches beyond simple action replay. In this study, we apply Inverse Dynamics Models (IDM) with different encoders and policy heads to trajectory following in a modern 3D video game – Bleeding Edge. Additionally, we investigate several future alignment strategies that address the distribution shift caused by the aleatoric uncertainty and imperfections of the agent. We measure both the trajectory deviation distance and the first significant deviation point between the reference and the agent’s trajectory and show that the optimal configuration depends on the chosen setting. Our results show that in a diverse data setting, a GPT-style policy head with an encoder trained from scratch performs the best, DINOv2 encoder with the GPT-style policy head gives the best results in the low data regime, and both GPT-style and MLP-style policy heads had comparable results when pre-trained on a diverse setting and fine-tuned for a specific behaviour setting.
zh

[CV-1] SHeaP: Self-Supervised Head Geometry Predictor Learned via 2D Gaussians

【速读】：该论文旨在解决从单目图像和视频实时准确重建人类头部三维几何结构的问题。由于大规模获取精确的三维地面真实数据（3D ground truth data）具有挑战性，现有方法通常通过自监督方式利用丰富的二维视频进行学习，但这些方法依赖可微分网格渲染（differentiable mesh rendering），存在局限性。论文的关键创新在于提出了一种名为SHeaP（Self-supervised Head Geometry Predictor Learned via 2D Gaussians）的方法，通过引入一组与3D Morphable Model (3DMM) 网格绑定的二维高斯分布（2D Gaussians）来改进自监督学习框架。这种方法通过重新动画绑定的头部虚拟化身以匹配目标帧，并反向传播光度损失（photometric losses）至3DMM预测网络和高斯分布预测网络，显著提升了自监督方法的有效性。其核心解决方案在于使用高斯分布进行渲染，从而在仅基于二维数据训练的情况下，在中性面部的NoW基准测试以及非中性表情的新基准测试中超越现有自监督方法的几何评估性能，同时生成高度表情化的网格，在情感分类任务中表现出色。

链接: https://arxiv.org/abs/2504.12292
作者: Liam Schoneveld,Zhe Chen,Davide Davoli,Jiapeng Tang,Saimon Terazawa,Ko Nishino,Matthias Nießner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: For video demonstrations and additional materials please see this https URL

点击查看摘要

Abstract:Accurate, real-time 3D reconstruction of human heads from monocular images and videos underlies numerous visual applications. As 3D ground truth data is hard to come by at scale, previous methods have sought to learn from abundant 2D videos in a self-supervised manner. Typically, this involves the use of differentiable mesh rendering, which is effective but faces limitations. To improve on this, we propose SHeaP (Self-supervised Head Geometry Predictor Learned via 2D Gaussians). Given a source image, we predict a 3DMM mesh and a set of Gaussians that are rigged to this mesh. We then reanimate this rigged head avatar to match a target frame, and backpropagate photometric losses to both the 3DMM and Gaussian prediction networks. We find that using Gaussians for rendering substantially improves the effectiveness of this self-supervised approach. Training solely on 2D data, our method surpasses existing self-supervised approaches in geometric evaluations on the NoW benchmark for neutral faces and a new benchmark for non-neutral expressions. Our method also produces highly expressive meshes, outperforming state-of-the-art in emotion classification.
zh

[CV-2] How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions CVPR2025

【速读】：本文旨在解决利用单目RGB图像、动作文本以及物体上的3D接触点作为输入，预测手部运动（3D手部运动）和接触映射（或交互轨迹）这一新问题。关键在于提出了一种两阶段的解决方案：首先构建了一个名为“交互词典（Interaction Codebook）”的VQVAE模型，用于学习手部姿势和接触点的潜在码本，从而有效地将交互轨迹离散化为令牌；其次设计了一个“交互预测器（Interaction Predictor）”，通过使用索引模块从已学得的码本中检索潜在功能，由Transformer解码器模块预测交互轨迹。该方法在扩展的HoloAssist数据集上进行训练，并在比现有工作大2.5至10倍的基准测试集中验证，涵盖了更广泛的物体类别、交互类型、任务及场景，实验结果证明了所提方法相较于Transformer扩散基线的有效性。

链接: https://arxiv.org/abs/2504.12284
作者: Aditya Prakash,Benjamin Lundell,Dmitry Andreychuk,David Forsyth,Saurabh Gupta,Harpreet Sawhney
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CVPR 2025, Project page: this https URL

点击查看摘要

Abstract:We tackle the novel problem of predicting 3D hand motion and contact maps (or Interaction Trajectories) given a single RGB view, action text, and a 3D contact point on the object as input. Our approach consists of (1) Interaction Codebook: a VQVAE model to learn a latent codebook of hand poses and contact points, effectively tokenizing interaction trajectories, (2) Interaction Predictor: a transformer-decoder module to predict the interaction trajectory from test time inputs by using an indexer module to retrieve a latent affordance from the learned codebook. To train our model, we develop a data engine that extracts 3D hand poses and contact trajectories from the diverse HoloAssist dataset. We evaluate our model on a benchmark that is 2.5-10X larger than existing works, in terms of diversity of objects and interactions observed, and test for generalization of the model across object categories, action categories, tasks, and scenes. Experimental results show the effectiveness of our approach over transformer diffusion baselines across all settings.
zh

[CV-3] he Tenth NTIRE 2025 Image Denoising Challenge Report

【速读】：该论文旨在解决固定噪声水平（σ=50）下的图像去噪问题，目标是设计一种网络架构，在不考虑计算复杂度和模型大小限制的情况下，实现高质量的图像去噪性能，并通过峰值信噪比（PSNR）进行定量评估。论文的关键在于提出有效的去噪方法以应对假设的独立加性白高斯噪声（Additive White Gaussian Noise, AWGN），并通过挑战赛的形式汇集多种解决方案，展示了当前图像去噪领域的技术前沿。

链接: https://arxiv.org/abs/2504.12276
作者: Lei Sun,Hang Guo,Bin Ren,Luc Van Gool,Radu Timofte,Yawei Li,Xiangyu Kong,Hyunhee Park,Xiaoxuan Yu,Suejin Han,Hakjae Jeon,Jia Li,Hyung-Ju Chun,Donghun Ryou,Inju Ha,Bohyung Han,Jingyu Ma,Zhijuan Huang,Huiyuan Fu,Hongyuan Yu,Boqi Zhang,Jiawei Shi,Heng Zhang,Huadong Ma,Deepak Kumar Tyagi,Aman Kukretti,Gajender Sharma,Sriharsha Koundinya,Asim Manna,Jun Cheng,Shan Tan,Jun Liu,Jiangwei Hao,Jianping Luo,Jie Lu,Satya Narayan Tazi,Arnim Gautam,Aditi Pawar,Aishwarya Joshi,Akshay Dudhane,Praful Hambadre,Sachin Chaudhary,Santosh Kumar Vipparthi,Subrahmanyam Murala,Jiachen Tu,Nikhil Akalwadi,Vijayalaxmi Ashok Aralikatti,Dheeraj Damodar Hegde,G Gyaneshwar Rao,Jatin Kalal,Chaitra Desai,Ramesh Ashok Tabib,Uma Mudenagudi,Zhenyuan Lin,Yubo Dong,Weikun Li,Anqi Li,Ang Gao,Weijun Yuan,Zhan Li,Ruting Deng,Yihang Chen,Yifan Deng,Zhanglu Chen,Boyang Yao,Shuling Zheng,Feng Zhang,Zhiheng Fu,Anas M. Ali,Bilel Benjdira,Wadii Boulila,Jan Seny,Pei Zhou,Jianhua Hu,K. L. Eddie Law,Jaeho Lee,M. J. Aashik Rasool,Abdur Rehman,SMA Sharif,Seongwan Kim,Alexandru Brateanu,Raul Balmez,Ciprian Orhei,Cosmin Ancuti,Zeyu Xiao,Zhuoyuan Li,Ziqi Wang,Yanyan Wei,Fei Wang,Kun Li,Shengeng Tang,Yunkai Zhang,Weirun Zhou,Haoxuan Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents an overview of the NTIRE 2025 Image Denoising Challenge (\sigma = 50), highlighting the proposed methodologies and corresponding results. The primary objective is to develop a network architecture capable of achieving high-quality denoising performance, quantitatively evaluated using PSNR, without constraints on computational complexity or model size. The task assumes independent additive white Gaussian noise (AWGN) with a fixed noise level of 50. A total of 290 participants registered for the challenge, with 20 teams successfully submitting valid results, providing insights into the current state-of-the-art in image denoising.
zh

[CV-4] Beyond Reconstruction: A Physics Based Neural Deferred Shader for Photo-realistic Rendering

【速读】：该论文致力于解决深度学习驱动渲染中难以分解光照与材质参数的问题，这一局限性阻碍了现有方法在场景重构中的灵活性，使其无法有效控制这些参数。为应对这一挑战，论文提出了一种基于物理的神经延迟着色管道，通过数据驱动的方式分解渲染过程，学习一个可泛化的着色函数以实现高质量的阴影和重新照明任务，并引入了一个高效的阴影估计器来模拟阴影效果。其关键创新在于结合物理模型与神经网络，实现了比传统方法及当前最先进的神经着色模型更优的性能，同时支持任意光照输入下的真实感着色。

链接: https://arxiv.org/abs/2504.12273
作者: Zhuo He,Paul Henderson,Nicolas Pugeault
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning based rendering has demonstrated major improvements for photo-realistic image synthesis, applicable to various applications including visual effects in movies and photo-realistic scene building in video games. However, a significant limitation is the difficulty of decomposing the illumination and material parameters, which limits such methods to reconstruct an input scene, without any possibility to control these parameters. This paper introduces a novel physics based neural deferred shading pipeline to decompose the data-driven rendering process, learn a generalizable shading function to produce photo-realistic results for shading and relighting tasks, we also provide a shadow estimator to efficiently mimic shadowing effect. Our model achieves improved performance compared to classical models and a state-of-art neural shading model, and enables generalizable photo-realistic shading from arbitrary illumination input.
zh

[CV-5] owards Learning to Complete Anything in Lidar

【速读】：该论文试图解决Lidar（激光雷达）在野外场景中基于形状完成的问题，这与基于Lidar的语义/全景场景完成密切相关。然而，现有方法仅能完成并识别来自封闭词汇表中标记物体的对象，无法处理未见过的类别。为解决此问题，论文提出了一种零样本方法，通过挖掘多模态传感器序列的时间上下文来提取观测对象的形状和语义特征，并将其蒸馏到仅依赖Lidar的实例级完成和识别模型中。关键在于利用部分形状完成的信息，使模型能够从多个部分观察中推断完整物体形状，从而实现超出固定类别词汇表的物体识别及全景语义场景完成任务。

链接: https://arxiv.org/abs/2504.12264
作者: Ayca Takmaz,Cristiano Saltori,Neehar Peri,Tim Meinhardt,Riccardo de Lutio,Laura Leal-Taixé,Aljoša Ošep
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose CAL (Complete Anything in Lidar) for Lidar-based shape-completion in-the-wild. This is closely related to Lidar-based semantic/panoptic scene completion. However, contemporary methods can only complete and recognize objects from a closed vocabulary labeled in existing Lidar datasets. Different to that, our zero-shot approach leverages the temporal context from multi-modal sensor sequences to mine object shapes and semantic features of observed objects. These are then distilled into a Lidar-only instance-level completion and recognition model. Although we only mine partial shape completions, we find that our distilled model learns to infer full object shapes from multiple such partial observations across the dataset. We show that our model can be prompted on standard benchmarks for Semantic and Panoptic Scene Completion, localize objects as (amodal) 3D bounding boxes, and recognize objects beyond fixed class vocabularies. Our project page is this https URL
zh

[CV-6] VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate

【速读】：该论文旨在解决基于Diffusion Transformer (DiT) 的视频生成模型在实际应用中计算需求高、效率低的问题。论文观察到真实世界视频的时间非均匀性，并提出视频生成中的动态潜空间帧率（Dynamic Latent Frame Rate）方法，以适应不同运动频率段的信息密度需求。关键解决方案包括：(1) 提出一种自适应分配视频片段帧率的动态帧率调度器；(2) 设计一种新颖的潜空间帧合并方法，在低分辨率空间消除冗余前对齐潜表示与去噪后的表示；(3) 分析并优化Rotary Positional Embeddings (RoPE) 在DiT各层中的偏好策略，以增强语义与局部信息捕获能力。实验表明，VGDFR 可实现高达3倍的速度提升，同时保持较小的质量损失。

链接: https://arxiv.org/abs/2504.12259
作者: Zhihang Yuan,Rui Xie,Yuzhang Shang,Hanling Zhang,Siyuan Wang,Shengen Yan,Guohao Dai,Yu Wang
机构: Tsinghua University (清华大学); Infinigence AI; Illinois Tech (伊利诺伊理工学院); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformer(DiT)-based generation models have achieved remarkable success in video generation. However, their inherent computational demands pose significant efficiency challenges. In this paper, we exploit the inherent temporal non-uniformity of real-world videos and observe that videos exhibit dynamic information density, with high-motion segments demanding greater detail preservation than static scenes. Inspired by this temporal non-uniformity, we propose VGDFR, a training-free approach for Diffusion-based Video Generation with Dynamic Latent Frame Rate. VGDFR adaptively adjusts the number of elements in latent space based on the motion frequency of the latent space content, using fewer tokens for low-frequency segments while preserving detail in high-frequency segments. Specifically, our key contributions are: (1) A dynamic frame rate scheduler for DiT video generation that adaptively assigns frame rates for video segments. (2) A novel latent-space frame merging method to align latent representations with their denoised counterparts before merging those redundant in low-resolution space. (3) A preference analysis of Rotary Positional Embeddings (RoPE) across DiT layers, informing a tailored RoPE strategy optimized for semantic and local information capture. Experiments show that VGDFR can achieve a speedup up to 3x for video generation with minimal quality degradation.
zh

[CV-7] FLIP Reasoning Challenge ICLR2025

【速读】：该论文试图解决人工智能（Artificial Intelligence, AI）在推理能力方面的挑战，特别是在多模态系统中的逻辑推理、视觉叙事和常识应用。为实现这一目标，论文提出了一套基于人类验证任务的新基准数据集FLIP（Fine-grained Logical Inference and Perception），利用Idena区块链上的众包机制构建，通过图像排序任务评估模型的推理能力。论文的关键解决方案在于设计了一个强调顺序推理、视觉叙事和常识的多模态测试环境，并通过结合视觉-语言模型（Vision-Language Models, VLMs）与大型语言模型（Large Language Models, LLMs）的方法，验证了现有模型在零样本设置下的局限性。实验结果表明，即使最先进的开源和闭源模型，在零样本条件下的最高准确率仅为75.5%和77.9%，远低于人类表现的95.3%。此外，论文还发现通过使用图像描述（captioning）辅助推理模型可以显著提升性能，而集成多个模型预测的集成方法进一步将准确率提高至85.2%。这些发现凸显了现有推理模型的不足，并强调了开发如FLIP这样鲁棒的多模态基准的重要性。

链接: https://arxiv.org/abs/2504.12256
作者: Andreas Plesner,Turlan Kuzhagaliyev,Roger Wattenhofer
机构: ETH Zurich (瑞士苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published at First Workshop on Open Science for Foundation Models at ICLR 2025

点击查看摘要

Abstract:Over the past years, advances in artificial intelligence (AI) have demonstrated how AI can solve many perception and generation tasks, such as image classification and text writing, yet reasoning remains a challenge. This paper introduces the FLIP dataset, a benchmark for evaluating AI reasoning capabilities based on human verification tasks on the Idena blockchain. FLIP challenges present users with two orderings of 4 images, requiring them to identify the logically coherent one. By emphasizing sequential reasoning, visual storytelling, and common sense, FLIP provides a unique testbed for multimodal AI systems. Our experiments evaluate state-of-the-art models, leveraging both vision-language models (VLMs) and large language models (LLMs). Results reveal that even the best open-sourced and closed-sourced models achieve maximum accuracies of 75.5% and 77.9%, respectively, in zero-shot settings, compared to human performance of 95.3%. Captioning models aid reasoning models by providing text descriptions of images, yielding better results than when using the raw images directly, 69.6% vs. 75.2% for Gemini 1.5 Pro. Combining the predictions from 15 models in an ensemble increases the accuracy to 85.2%. These findings highlight the limitations of existing reasoning models and the need for robust multimodal benchmarks like FLIP. The full codebase and dataset will be available at this https URL.
zh

[CV-8] Human Aligned Compression for Robust Models CVPR2025

【速读】：该论文试图解决图像模型在对抗攻击下的鲁棒性问题，即通过引入人类对齐的学习型有损压缩方法来防御由不可察觉扰动引起的错误预测。论文的关键解决方案在于采用两种学习型压缩模型（HiFiC 和 ELIC）与传统 JPEG 进行对比，证明学习型压缩方法在保留语义上有意义的内容同时去除对抗噪声方面优于 JPEG，尤其是在视觉变换器（Vision Transformer）架构中表现更佳。此外，论文还发现多次迭代的压缩-解压缩过程能够显著提升防御效果，同时保持分类性能。这些发现表明，人类对齐的压缩提供了一种有效且计算高效的防御策略，保护了对人类和机器理解至关重要的图像特征。

链接: https://arxiv.org/abs/2504.12255
作者: Samuel Räber,Andreas Plesner,Till Aczel,Roger Wattenhofer
机构: ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Presented at the Workshop AdvML at CVPR 2025

点击查看摘要

Abstract:Adversarial attacks on image models threaten system robustness by introducing imperceptible perturbations that cause incorrect predictions. We investigate human-aligned learned lossy compression as a defense mechanism, comparing two learned models (HiFiC and ELIC) against traditional JPEG across various quality levels. Our experiments on ImageNet subsets demonstrate that learned compression methods outperform JPEG, particularly for Vision Transformer architectures, by preserving semantically meaningful content while removing adversarial noise. Even in white-box settings where attackers can access the defense, these methods maintain substantial effectiveness. We also show that sequential compression–applying rounds of compression/decompression–significantly enhances defense efficacy while maintaining classification performance. Our findings reveal that human-aligned compression provides an effective, computationally efficient defense that protects the image features most relevant to human and machine understanding. It offers a practical approach to improving model robustness against adversarial threats.
zh

[CV-9] SIDME: Self-supervised Image Demoiréing via Masked Encoder-Decoder Reconstruction

【速读】：该论文旨在解决由物体光信号与相机采样频率之间混叠引起的莫尔纹（Moiré patterns）现象导致的图像质量下降问题。现有传统去莫尔纹方法通常将整幅图像视为整体进行处理和训练，忽视了不同颜色通道的独特信号特性，且难以应对莫尔纹生成的随机性和变化性，从而影响其在实际应用中的鲁棒性。为了解决这些问题，论文提出了一种名为SIDME（Self-supervised Image Demoiréing via Masked Encoder-Decoder Reconstruction）的新模型。该模型的关键创新在于结合了掩码编码器-解码器架构与自监督学习，并利用相机采样频率的固有属性来重建图像。此外，针对相机采样中绿通道具有更高采样频率的特点，设计了专门的自监督损失函数以提高训练效率和效果。同时，为了增强模型的泛化能力，开发了一种自监督的莫尔纹图像生成方法，用于构建接近真实场景的数据集。实验结果表明，SIDME在处理真实莫尔纹数据时优于现有方法，展现出卓越的泛化性能和鲁棒性。

链接: https://arxiv.org/abs/2504.12245
作者: Xia Wang,Haiyang Sun,Tiantian Cao,Yueying Sun,Min Feng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 21 pages, 13 figures

点击查看摘要

Abstract:Moiré patterns, resulting from aliasing between object light signals and camera sampling frequencies, often degrade image quality during capture. Traditional demoiréing methods have generally treated images as a whole for processing and training, neglecting the unique signal characteristics of different color channels. Moreover, the randomness and variability of moiré pattern generation pose challenges to the robustness of existing methods when applied to real-world data. To address these issues, this paper presents SIDME (Self-supervised Image Demoiréing via Masked Encoder-Decoder Reconstruction), a novel model designed to generate high-quality visual images by effectively processing moiré patterns. SIDME combines a masked encoder-decoder architecture with self-supervised learning, allowing the model to reconstruct images using the inherent properties of camera sampling frequencies. A key innovation is the random masked image reconstructor, which utilizes an encoder-decoder structure to handle the reconstruction task. Furthermore, since the green channel in camera sampling has a higher sampling frequency compared to red and blue channels, a specialized self-supervised loss function is designed to improve the training efficiency and effectiveness. To ensure the generalization ability of the model, a self-supervised moiré image generation method has been developed to produce a dataset that closely mimics real-world conditions. Extensive experiments demonstrate that SIDME outperforms existing methods in processing real moiré pattern data, showing its superior generalization performance and robustness.
zh

[CV-10] Cobra: Efficient Line Art COlorization with BRoAder References

【速读】：该论文致力于解决基于参考线稿的高精度、高效、上下文一致且具备灵活控制的漫画上色问题。尽管图像生成领域的扩散模型（Diffusion Models）已取得进展，但其在线稿上色中的应用仍受限于处理大量参考图像、推理时间过长以及灵活性不足等挑战。为应对这些难题，论文提出了一种名为Cobra的方法，其关键在于引入因果稀疏DiT（Causal Sparse DiT）架构，通过专门设计的位置编码、因果稀疏注意力机制及键值缓存（Key-Value Cache），有效管理长上下文参考信息并确保颜色一致性。这一架构使得Cobra能够在利用超过200张参考图像的同时保持低延迟，显著提升了推理速度与交互性，满足了工业应用的核心需求。

链接: https://arxiv.org/abs/2504.12240
作者: Junhao Zhuang,Lingen Li,Xuan Ju,Zhaoyang Zhang,Chun Yuan,Ying Shan
机构: Tsinghua University (清华大学); The Chinese University of Hong Kong (香港中文大学); Tencent ARC Lab (腾讯ARC实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page with code: this https URL

点击查看摘要

Abstract:The comic production industry requires reference-based line art colorization with high accuracy, efficiency, contextual consistency, and flexible control. A comic page often involves diverse characters, objects, and backgrounds, which complicates the coloring process. Despite advancements in diffusion models for image generation, their application in line art colorization remains limited, facing challenges related to handling extensive reference images, time-consuming inference, and flexible control. We investigate the necessity of extensive contextual image guidance on the quality of line art colorization. To address these challenges, we introduce Cobra, an efficient and versatile method that supports color hints and utilizes over 200 reference images while maintaining low latency. Central to Cobra is a Causal Sparse DiT architecture, which leverages specially designed positional encodings, causal sparse attention, and Key-Value Cache to effectively manage long-context references and ensure color identity consistency. Results demonstrate that Cobra achieves accurate line art colorization through extensive contextual reference, significantly enhancing inference speed and interactivity, thereby meeting critical industrial demands. We release our codes and models on our project page: this https URL.
zh

[CV-11] Coding-Prior Guided Diffusion Network for Video Deblurring

【速读】：该论文旨在解决视频去模糊化方法中未充分利用两类重要先验信息的问题：(1) 来自视频编解码器的运动矢量 (Motion Vectors, MVs) 和编码残差 (Coding Residuals, CRs)，它们提供了高效的帧间对齐线索；(2) 预训练扩散生成模型中蕴含的丰富真实世界知识。为了解决这些问题，论文提出了CPGDNet，这是一种新颖的两阶段框架，关键在于有效结合了编码先验 (Coding Priors) 和生成扩散先验 (Generative Diffusion Priors)。具体而言，首先通过编码先验特征传播 (Coding-Prior Feature Propagation, CPFP) 模块利用MV进行高效帧对齐，并使用CR生成注意力掩膜以应对运动不准确性和纹理变化；其次，编码先验控制生成 (Coding-Prior Controlled Generation, CPC) 模块将编码先验整合到预训练扩散模型中，引导其增强关键区域并合成逼真的细节。实验表明，该方法在感知质量方面达到最先进的性能，在图像质量评估 (IQA) 指标上提升了多达30%。

链接: https://arxiv.org/abs/2504.12222
作者: Yike Liu,Jianhui Zhang,Haipeng Li,Shuaicheng Liu,Bing Zeng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While recent video deblurring methods have advanced significantly, they often overlook two valuable prior information: (1) motion vectors (MVs) and coding residuals (CRs) from video codecs, which provide efficient inter-frame alignment cues, and (2) the rich real-world knowledge embedded in pre-trained diffusion generative models. We present CPGDNet, a novel two-stage framework that effectively leverages both coding priors and generative diffusion priors for high-quality deblurring. First, our coding-prior feature propagation (CPFP) module utilizes MVs for efficient frame alignment and CRs to generate attention masks, addressing motion inaccuracies and texture variations. Second, a coding-prior controlled generation (CPC) module network integrates coding priors into a pretrained diffusion model, guiding it to enhance critical regions and synthesize realistic details. Experiments demonstrate our method achieves state-of-the-art perceptual quality with up to 30% improvement in IQA metrics. Both the code and the codingprior-augmented dataset will be open-sourced.
zh

[CV-12] Uncertainty-Guided Coarse-to-Fine Tumor Segmentation with Anatomy-Aware Post-Processing

【速读】：该论文旨在解决胸腔 CT 中肿瘤分割面临的边界模糊、类别不平衡及解剖学变异等挑战。论文提出了一种基于不确定性引导的粗到细分割框架，结合全体积肿瘤定位与精化感兴趣区域（ROI）分割，并通过解剖学感知后处理增强性能。关键在于首先利用第一阶段模型生成粗略预测，随后基于肺重叠、靠近肺表面的距离以及成分大小进行解剖学信息过滤，得到 ROI 后由第二阶段模型进一步分割，该模型采用不确定性感知损失函数以提高在模糊区域的精度和边界校准能力。实验结果表明，该方法在 Dice 和 Hausdorff 分数上均有提升，同时减少了假阳性并增强了空间可解释性。

链接: https://arxiv.org/abs/2504.12215
作者: Ilkin Sevgi Isler,David Mohaisen,Curtis Lisle,Damla Turgut,Ulas Bagci
机构: Department of Computer Science (计算机科学系), University of Central Florida (中佛罗里达大学); KnowledgeVis LLC (KnowledgeVis有限责任公司), Altamonte Springs, FL, USA; Department of Radiology (放射学系), Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, to appear in IEEE ADSCA 2025

点击查看摘要

Abstract:Reliable tumor segmentation in thoracic computed tomography (CT) remains challenging due to boundary ambiguity, class imbalance, and anatomical variability. We propose an uncertainty-guided, coarse-to-fine segmentation framework that combines full-volume tumor localization with refined region-of-interest (ROI) segmentation, enhanced by anatomically aware post-processing. The first-stage model generates a coarse prediction, followed by anatomically informed filtering based on lung overlap, proximity to lung surfaces, and component size. The resulting ROIs are segmented by a second-stage model trained with uncertainty-aware loss functions to improve accuracy and boundary calibration in ambiguous regions. Experiments on private and public datasets demonstrate improvements in Dice and Hausdorff scores, with fewer false positives and enhanced spatial interpretability. These results highlight the value of combining uncertainty modeling and anatomical priors in cascaded segmentation pipelines for robust and clinically meaningful tumor delineation. On the Orlando dataset, our framework improved Swin UNETR Dice from 0.4690 to 0.6447. Reduction in spurious components was strongly correlated with segmentation gains, underscoring the value of anatomically informed post-processing.
zh

[CV-13] owards Realistic Low-Light Image Enhancement via ISP Driven Data Modeling

【速读】：该论文旨在解决低光图像增强（Low-Light Image Enhancement, LLIE）任务中深度神经网络（Deep Neural Networks, DNNs）在实际应用中可能产生的问题，如噪声放大、白平衡错误或不自然的增强效果。这些问题的根本原因在于缺乏多样化且大规模的训练数据，以充分表征低光条件及其成像管道的复杂性。论文的关键解决方案是提出了一种基于图像信号处理（Image Signal Processing, ISP）驱动的数据合成管道，通过生成无限配对的训练数据来克服上述挑战。具体而言，该管道从高质量的正常光照图像开始，将其逆向转换为RAW格式，并直接在RAW域中合成低光退化。随后，这些数据经过一系列受控变化的ISP阶段处理，包括白平衡调整、色彩空间转换、色调映射和伽马校正，从而扩展退化空间并提高训练数据的多样性。实验结果表明，使用该合成管道训练的基础UNet模型在多个数据集上的表现超越了当前最先进的方法，无论是定量指标还是视觉效果均表现出色。

链接: https://arxiv.org/abs/2504.12204
作者: Zhihua Wang,Yu Long,Qinghua Lin,Kai Zhang,Yazhu Zhang,Yuming Fang,Li Liu,Xiaochun Cao
机构: Department of Computer Science, City University of Hong Kong (香港城市大学); Department of Engineering, Shenzhen MSU-BIT University (深圳北理莫斯科大学); School of Intelligence Science and Technology, Nanjing University (南京大学); School of Information Technology, Jiangxi University of Finance and Economics (江西财经大学); College of Electronic Science and Technology, National University of Defense Technology (国防科技大学); School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 17 pages, 11 tables, 10 figures

点击查看摘要

Abstract:Deep neural networks (DNNs) have recently become the leading method for low-light image enhancement (LLIE). However, despite significant progress, their outputs may still exhibit issues such as amplified noise, incorrect white balance, or unnatural enhancements when deployed in real world applications. A key challenge is the lack of diverse, large scale training data that captures the complexities of low-light conditions and imaging pipelines. In this paper, we propose a novel image signal processing (ISP) driven data synthesis pipeline that addresses these challenges by generating unlimited paired training data. Specifically, our pipeline begins with easily collected high-quality normal-light images, which are first unprocessed into the RAW format using a reverse ISP. We then synthesize low-light degradations directly in the RAW domain. The resulting data is subsequently processed through a series of ISP stages, including white balance adjustment, color space conversion, tone mapping, and gamma correction, with controlled variations introduced at each stage. This broadens the degradation space and enhances the diversity of the training data, enabling the generated data to capture a wide range of degradations and the complexities inherent in the ISP pipeline. To demonstrate the effectiveness of our synthetic pipeline, we conduct extensive experiments using a vanilla UNet model consisting solely of convolutional layers, group normalization, GeLU activation, and convolutional block attention modules (CBAMs). Extensive testing across multiple datasets reveals that the vanilla UNet model trained with our data synthesis pipeline delivers high fidelity, visually appealing enhancement results, surpassing state-of-the-art (SOTA) methods both quantitatively and qualitatively.
zh

[CV-14] Beyond Patches: Mining Interpretable Part-Prototypes for Explainable AI

【速读】：该论文致力于解决深度学习模型在多媒体系统中的可解释性挑战，特别是现有后验解释方法（如GradCAM）缺乏概念清晰度，以及基于原型的方法（如ProtoPNet和PIPNnet）依赖固定补丁导致鲁棒性和语义一致性不足的问题。论文的关键解决方案是提出了一种部分原型概念挖掘网络（PCMNet），它能够从有意义的区域动态学习可解释的原型，并通过聚类将原型分组成概念组，从而生成语义基础的解释，无需额外标注。PCMNet通过无监督部件发现与概念激活向量提取的联合过程，有效捕捉判别性概念并实现可解释的分类决策，从而在清洁和遮挡场景下提供了高水平的可解释性、稳定性和鲁棒性。

链接: https://arxiv.org/abs/2504.12197
作者: Mahdi Alehdaghi,Rajarshi Bhattacharya,Pourya Shamsolmoali,Rafael M.O. Cruz,Maguelonne Heritier,Eric Granger
机构: LIVIA, ILLS, Dept. of Systems Engineering, École de technologie supérieure (魁北克省蒙特利尔); Dept. of Computer Science, University of York (英国约克); Genetec Inc. (加拿大蒙特利尔); LIVIA, ILLS, Dept. of Systems Engineering, École de technologie supérieure (魁北克省蒙特利尔); Genetec Inc. (加拿大蒙特利尔)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has provided considerable advancements for multimedia systems, yet the interpretability of deep models remains a challenge. State-of-the-art post-hoc explainability methods, such as GradCAM, provide visual interpretation based on heatmaps but lack conceptual clarity. Prototype-based approaches, like ProtoPNet and PIPNet, offer a more structured explanation but rely on fixed patches, limiting their robustness and semantic consistency. To address these limitations, a part-prototypical concept mining network (PCMNet) is proposed that dynamically learns interpretable prototypes from meaningful regions. PCMNet clusters prototypes into concept groups, creating semantically grounded explanations without requiring additional annotations. Through a joint process of unsupervised part discovery and concept activation vector extraction, PCMNet effectively captures discriminative concepts and makes interpretable classification decisions. Our extensive experiments comparing PCMNet against state-of-the-art methods on multiple datasets show that it can provide a high level of interpretability, stability, and robustness under clean and occluded scenarios. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.12197 [cs.CV] (or arXiv:2504.12197v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.12197 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-15] CoMotion: Concurrent Multi-person 3D Motion ICLR2025

【速读】：该论文旨在解决从单目摄像机流中检测和跟踪多人详细三维姿态的问题，尤其关注在拥挤场景中包含复杂姿态和遮挡情况下的时间一致性预测。解决方案的关键在于模型不仅实现了每帧的强检测能力，还通过学习到的姿态更新机制实现跨帧的人体跟踪，而非依赖于时间上的检测匹配。这种直接从新输入图像更新姿态的方式使得系统能够在遮挡情况下进行在线跟踪，同时通过利用伪标签标注的多种图像和视频数据集训练，确保了在多人长时间跟踪中的高效性和准确性。

链接: https://arxiv.org/abs/2504.12186
作者: Alejandro Newell,Peiyun Hu,Lahav Lipson,Stephan R. Richter,Vladlen Koltun
机构: Apple
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICLR 2025, for code and weights go to this https URL

点击查看摘要

Abstract:We introduce an approach for detecting and tracking detailed 3D poses of multiple people from a single monocular camera stream. Our system maintains temporally coherent predictions in crowded scenes filled with difficult poses and occlusions. Our model performs both strong per-frame detection and a learned pose update to track people from frame to frame. Rather than match detections across time, poses are updated directly from a new input image, which enables online tracking through occlusion. We train on numerous image and video datasets leveraging pseudo-labeled annotations to produce a model that matches state-of-the-art systems in 3D pose estimation accuracy while being faster and more accurate in tracking multiple people through time. Code and weights are provided at this https URL
zh

[CV-16] owards a General-Purpose Zero-Shot Synthetic Low-Light Image and Video Pipeline

【速读】：该论文旨在解决低光条件下图像和视频标注不足的问题，特别是缺乏针对低光环境下的机器理解研究。传统方法通常依赖于从高质量数据集迁移标注，并通过合成低光版本来实现，但这些方法往往受限于不切实际的噪声模型。论文的关键在于提出了一种新的退化估计网络（Degradation Estimation Network, DEN），其核心创新点是无需相机元数据即可合成具有现实感的标准RGB (sRGB) 噪声。这一目标是在自监督框架下通过估计物理驱动的噪声分布参数实现的。该零样本（zero-shot）方法能够生成多样化的、具有真实噪声特性的合成噪声内容，与仅关注再现训练数据噪声特征的传统方法形成对比。实验结果表明，基于所提出的合成管道训练的方法，在典型低光任务（如合成噪声复制、视频增强和对象检测）中取得了高达24%的KL散度（KLD）、21%的LPIPS以及62%的平均精度均值（AP _50-95）的性能提升。

链接: https://arxiv.org/abs/2504.12169
作者: Joanne Lin,Crispian Morris,Ruirui Lin,Fan Zhang,David Bull,Nantheera Anantrasirichai
机构: Visual Information Laboratory, University of Bristol (视觉信息实验室, 布里斯托尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Low-light conditions pose significant challenges for both human and machine annotation. This in turn has led to a lack of research into machine understanding for low-light images and (in particular) videos. A common approach is to apply annotations obtained from high quality datasets to synthetically created low light versions. In addition, these approaches are often limited through the use of unrealistic noise models. In this paper, we propose a new Degradation Estimation Network (DEN), which synthetically generates realistic standard RGB (sRGB) noise without the requirement for camera metadata. This is achieved by estimating the parameters of physics-informed noise distributions, trained in a self-supervised manner. This zero-shot approach allows our method to generate synthetic noisy content with a diverse range of realistic noise characteristics, unlike other methods which focus on recreating the noise characteristics of the training data. We evaluate our proposed synthetic pipeline using various methods trained on its synthetic data for typical low-light tasks including synthetic noise replication, video enhancement, and object detection, showing improvements of up to 24% KLD, 21% LPIPS, and 62% AP _50-95 , respectively.
zh

[CV-17] RADLER: Radar Object Detection Leverag ing Semantic 3D City Models and Self-Supervised Radar-Image Learning CVPR

【速读】：该论文旨在解决雷达目标检测中噪声影响的问题，并探索语义三维城市模型在缓解这一问题中的潜力。论文的关键在于提出了一种结合对比自监督学习（Contrastive Self-Supervised Learning, SSL）与语义三维城市模型的新方法。具体而言，首先通过SSL网络在雷达-图像预训练任务中获取鲁棒的雷达特征；然后利用一种简单有效的特征融合策略，将语义深度信息从语义三维城市模型中引入雷达检测中。借助先验三维信息的引导，所提出的RADLER网络能够增强行人、 cyclist 和汽车的雷达目标检测性能。实验结果表明，RADLER在RadarCity数据集上的平均精度均值（mAP）提升了5.46%，平均召回率均值（mAR）提升了3.51%，显著优于现有方法。

链接: https://arxiv.org/abs/2504.12167
作者: Yuan Luo,Rudolf Hoffmann,Yan Xia,Olaf Wysocki,Benedikt Schwab,Thomas H. Kolbe,Daniel Cremers
机构: Technical University of Munich (慕尼黑工业大学); MCML; GPP Communication GmbH
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: The paper accepted for CVPRW '25 (PBVS 2025 - the Perception Beyond the Visible Spectrum)

点击查看摘要

Abstract:Semantic 3D city models are worldwide easy-accessible, providing accurate, object-oriented, and semantic-rich 3D priors. To date, their potential to mitigate the noise impact on radar object detection remains under-explored. In this paper, we first introduce a unique dataset, RadarCity, comprising 54K synchronized radar-image pairs and semantic 3D city models. Moreover, we propose a novel neural network, RADLER, leveraging the effectiveness of contrastive self-supervised learning (SSL) and semantic 3D city models to enhance radar object detection of pedestrians, cyclists, and cars. Specifically, we first obtain the robust radar features via a SSL network in the radar-image pretext task. We then use a simple yet effective feature fusion strategy to incorporate semantic-depth features from semantic 3D city models. Having prior 3D information as guidance, RADLER obtains more fine-grained details to enhance radar object detection. We extensively evaluate RADLER on the collected RadarCity dataset and demonstrate average improvements of 5.46% in mean avarage precision (mAP) and 3.51% in mean avarage recall (mAR) over previous radar object detection methods. We believe this work will foster further research on semantic-guided and map-supported radar object detection. Our project page is publicly available athttps://gppthis http URL .
zh

[CV-18] CodingHomo: Bootstrapping Deep Homography With Video Coding

【速读】：本文旨在解决复杂运动场景下精确估计单应性（Homography）这一挑战性问题。传统基于深度学习的方法虽然在无监督学习框架下提升了鲁棒性和泛化能力，但在处理复杂运动时仍存在不足。为应对这一难题，论文提出了一种创新方法，利用视频编码中的固有运动矢量（Motion Vectors, MVs）。该方案的关键在于设计了一个名为Mask-Guided Fusion (MGF) 的模块，用于从运动矢量中筛选并融合有益特征以增强单应性预测的准确性；同时引入了Mask-Guided Homography Estimation (MGHE) 模块，在由粗到精的单应性细化过程中消除不利特征。通过这些技术改进，CodingHomo 框架在无监督方法中表现出色，实现了良好的鲁棒性和泛化性能。

链接: https://arxiv.org/abs/2504.12165
作者: Yike Liu,Haipeng Li,Shuaicheng Liu,Bing Zeng
机构: School of Information and Communication Engineering, University of Electronic Science and Technology of China (电子科技大学信息与通信工程学院)(中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Homography estimation is a fundamental task in computer vision with applications in diverse fields. Recent advances in deep learning have improved homography estimation, particularly with unsupervised learning approaches, offering increased robustness and generalizability. However, accurately predicting homography, especially in complex motions, remains a challenge. In response, this work introduces a novel method leveraging video coding, particularly by harnessing inherent motion vectors (MVs) present in videos. We present CodingHomo, an unsupervised framework for homography estimation. Our framework features a Mask-Guided Fusion (MGF) module that identifies and utilizes beneficial features among the MVs, thereby enhancing the accuracy of homography prediction. Additionally, the Mask-Guided Homography Estimation (MGHE) module is presented for eliminating undesired features in the coarse-to-fine homography refinement process. CodingHomo outperforms existing state-of-the-art unsupervised methods, delivering good robustness and generalizability. The code and dataset are available at: \hrefgithubthis https URL
zh

[CV-19] FocusedAD: Character-centric Movie Audio Description

【速读】：该论文旨在解决电影音频描述（Movie Audio Description, AD）在无对话段落中提供与情节相关的叙述，尤其是包含明确角色名称引用的问题，以更好地服务于盲人和视力受损（Blind and Visually Impaired, BVI）观众。相较于通用视频字幕任务，AD面临更独特的挑战，如识别活跃的主要角色并聚焦于与剧情相关的情节区域。

为了解决这些问题，论文提出了一种名为FocusedAD的新框架，其核心在于生成以角色为中心的电影音频描述。该框架的关键组成部分包括：(i) 角色感知模块（Character Perception Module, CPM），用于追踪角色区域并将它们与名字关联；(ii) 动态先验模块（Dynamic Prior Module, DPM），通过可学习的软提示注入来自先前AD和字幕的上下文线索；以及(iii) 聚焦字幕模块（Focused Caption Module, FCM），用于生成包含情节相关细节和命名角色的叙述。此外，为了克服角色识别的限制，论文还引入了一个自动化的角色查询库构建管道。

综上所述，FocusedAD的关键在于结合角色感知、动态上下文建模和聚焦情节的生成能力，从而实现高质量的电影音频描述。

链接: https://arxiv.org/abs/2504.12157
作者: Xiaojun Ye,Chun Wang,Yiren Song,Sheng Zhou,Liangcheng Li,Jiajun Bu
机构: Zhejiang University (浙江大学); Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学); Zhejiang University (浙江大学); Zhejiang University (浙江大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and Demo link: this https URL

点击查看摘要

Abstract:Movie Audio Description (AD) aims to narrate visual content during dialogue-free segments, particularly benefiting blind and visually impaired (BVI) audiences. Compared with general video captioning, AD demands plot-relevant narration with explicit character name references, posing unique challenges in movie this http URL identify active main characters and focus on storyline-relevant regions, we propose FocusedAD, a novel framework that delivers character-centric movie audio descriptions. It includes: (i) a Character Perception Module(CPM) for tracking character regions and linking them to names; (ii) a Dynamic Prior Module(DPM) that injects contextual cues from prior ADs and subtitles via learnable soft prompts; and (iii) a Focused Caption Module(FCM) that generates narrations enriched with plot-relevant details and named characters. To overcome limitations in character identification, we also introduce an automated pipeline for building character query banks. FocusedAD achieves state-of-the-art performance on multiple benchmarks, including strong zero-shot results on MAD-eval-Named and our newly proposed Cinepile-AD dataset. Code and data will be released at this https URL .
zh

[CV-20] Weakly Semi-supervised Whole Slide Image Classification by Two-level Cross Consistency Supervision

【速读】：该论文试图解决计算机辅助全片图像（WSI）分类中因大规模标注成本高而限制现有方法有效性的问题。具体而言，作者提出了一个弱半监督全片图像分类（WSWC）问题设定，在该设定下仅有少量袋样本被标记，大部分样本未标注。解决方案的关键在于提出了一种名为CroCo的框架，通过双层交叉一致性监督机制来应对WSWC问题。CroCo包含两个异构分类器分支，能够在训练过程中建立袋级和实例级的交叉一致性约束，从而有效利用未标注数据提升分类性能。

链接: https://arxiv.org/abs/2504.12132
作者: Linhao Qu,Shiman Li,Xiaoyuan Luo,Shaolei Liu,Qinhao Guo,Manning Wang,Zhijian Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computer-aided Whole Slide Image (WSI) classification has the potential to enhance the accuracy and efficiency of clinical pathological diagnosis. It is commonly formulated as a Multiple Instance Learning (MIL) problem, where each WSI is treated as a bag and the small patches extracted from the WSI are considered instances within that bag. However, obtaining labels for a large number of bags is a costly and time-consuming process, particularly when utilizing existing WSIs for new classification tasks. This limitation renders most existing WSI classification methods ineffective. To address this issue, we propose a novel WSI classification problem setting, more aligned with clinical practice, termed Weakly Semi-supervised Whole slide image Classification (WSWC). In WSWC, a small number of bags are labeled, while a significant number of bags remain unlabeled. The MIL nature of the WSWC problem, coupled with the absence of patch labels, distinguishes it from typical semi-supervised image classification problems, making existing algorithms for natural images unsuitable for directly solving the WSWC problem. In this paper, we present a concise and efficient framework, named CroCo, to tackle the WSWC problem through two-level Cross Consistency supervision. CroCo comprises two heterogeneous classifier branches capable of performing both instance classification and bag classification. The fundamental idea is to establish cross-consistency supervision at both the bag-level and instance-level between the two branches during training. Extensive experiments conducted on four datasets demonstrate that CroCo achieves superior bag classification and instance classification performance compared to other comparative methods when limited WSIs with bag labels are available. To the best of our knowledge, this paper presents for the first time the WSWC problem and gives a successful resolution.
zh

[CV-21] Anti-Aesthetics: Protecting Facial Privacy against Customized Text-to-Image Synthesis

【速读】：该论文旨在解决由定制化扩散模型（Customized Diffusion Models）引发的个性化视觉内容创作热潮所带来的恶意滥用风险，这些风险严重威胁个人隐私和版权保护。论文的关键在于从美学视角提出了一种新的解决方案，通过降低恶意定制模型生成内容的质量来更好地保护面部身份信息。具体而言，作者提出了分层反美学（Hierarchical Anti-Aesthetic, HAA）框架，该框架包含两个关键分支：1）全局反美学（Global Anti-Aesthetics），通过建立全局反美学奖励机制和损失函数，削弱生成内容的整体美学属性；2）局部反美学（Local Anti-Aesthetics），设计局部反美学奖励机制和损失函数，引导对抗扰动破坏局部面部身份特征。通过将这两个分支无缝集成，HAA 实现了从全局到局部的反美学目标，在定制化生成过程中有效提升了身份去除的效果，为面部隐私和版权保护提供了强有力的工具。

链接: https://arxiv.org/abs/2504.12129
作者: Songping Wang,Yueming Lyu,Shiqi Liu,Ning Li,Tong Tong,Hao Sun,Caifeng Shan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rise of customized diffusion models has spurred a boom in personalized visual content creation, but also poses risks of malicious misuse, severely threatening personal privacy and copyright protection. Some studies show that the aesthetic properties of images are highly positively correlated with human perception of image quality. Inspired by this, we approach the problem from a novel and intriguing aesthetic perspective to degrade the generation quality of maliciously customized models, thereby achieving better protection of facial identity. Specifically, we propose a Hierarchical Anti-Aesthetic (HAA) framework to fully explore aesthetic cues, which consists of two key branches: 1) Global Anti-Aesthetics: By establishing a global anti-aesthetic reward mechanism and a global anti-aesthetic loss, it can degrade the overall aesthetics of the generated content; 2) Local Anti-Aesthetics: A local anti-aesthetic reward mechanism and a local anti-aesthetic loss are designed to guide adversarial perturbations to disrupt local facial identity. By seamlessly integrating both branches, our HAA effectively achieves the goal of anti-aesthetics from a global to a local level during customized generation. Extensive experiments show that HAA outperforms existing SOTA methods largely in identity removal, providing a powerful tool for protecting facial privacy and copyright.
zh

[CV-22] Remote sensing colour image semantic segmentation of trails created by large herbivorous Mammals

【速读】：该论文旨在解决利用机器学习技术自动检测大型草食性哺乳动物活动强烈的区域（即识别放牧小径）的问题，这对于生态系统保护与管理具有重要意义。论文的关键在于评估并优化多种基于语义分割方法的算法组合，特别是结合了14种编码器的五种语义分割模型，以实现从航拍图像中精准映射放牧小径。研究发现，UNet架构搭配MambaOut编码器的表现最佳，尽管在少数情况下存在对实际小径结构低估的现象。这一方法可进一步用于开发监测景观结构随时间变化的工具，从而支持栖息地保护及土地管理计划。这是首次通过竞争性的图像分割技术实现对大型草食动物小径的检测与界定。

链接: https://arxiv.org/abs/2504.12121
作者: Jose Francisco Diez-Pastor,Francisco Javier Gonzalez-Moya,Pedro Latorre-Carmona,Francisco Javier Perez-Barbería,Ludmila I.Kuncheva,Antonio Canepa-Oneto,Alvar Arnaiz-González,Cesar Garcia-Osorio
机构: Department of Computer Engineering, Universidad de Burgos (布尔戈斯大学); Biodiversity Research Institute, Spanish Research Council, University of Oviedo (奥维耶多大学, 西班牙国家研究委员会); School of Computer Science and Electronic Engineering, Bangor university (班戈大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 6 figures. Submitted to Computers and Geosciences

点击查看摘要

Abstract:Detection of spatial areas where biodiversity is at risk is of paramount importance for the conservation and monitoring of ecosystems. Large terrestrial mammalian herbivores are keystone species as their activity not only has deep effects on soils, plants, and animals, but also shapes landscapes, as large herbivores act as allogenic ecosystem engineers. One key landscape feature that indicates intense herbivore activity and potentially impacts biodiversity is the formation of grazing trails. Grazing trails are formed by the continuous trampling activity of large herbivores that can produce complex networks of tracks of bare soil. Here, we evaluated different algorithms based on machine learning techniques to identify grazing trails. Our goal is to automatically detect potential areas with intense herbivory activity, which might be beneficial for conservation and management plans. We have applied five semantic segmentation methods combined with fourteen encoders aimed at mapping grazing trails on aerial images. Our results indicate that in most cases the chosen methodology successfully mapped the trails, although there were a few instances where the actual trail structure was underestimated. The UNet architecture with the MambaOut encoder was the best architecture for mapping trails. The proposed approach could be applied to develop tools for mapping and monitoring temporal changes in these landscape structures to support habitat conservation and land management programs. This is the first time, to the best of our knowledge, that competitive image segmentation results are obtained for the detection and delineation of trails of large herbivorous mammals. Comments: 24 pages, 6 figures. Submitted to Computers and Geosciences Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.12121 [cs.CV] (or arXiv:2504.12121v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.12121 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-23] A Diffusion-Based Framework for Terrain-Aware Remote Sensing Image Reconstruction

【速读】：该论文旨在解决遥感影像在高分辨率和高频任务中因云层遮挡、传感器故障或数据采集不完整导致的数据缺失问题。传统插值方法难以应对大面积缺失和复杂结构，且无法保证多光谱数据之间的一致性。论文的关键解决方案是提出了一种基于扩散模型的SatelliteMaker方法，能够在不同程度的数据丢失情况下重建缺失数据，并同时保持空间、光谱和时间上的一致性。此外，通过引入数字高程模型（DEM）作为条件输入并设计特定提示，使扩散模型适用于定量遥感任务。同时，提出了基于分布损失的VGG-Adapter模块，减少分布差异以确保风格一致性。实验结果表明，SatelliteMaker在多个任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2504.12112
作者: Zhenyu Yu,Mohd Yamani Inda Idris,Pei Wang
机构: Universiti Malaya; Kunming University of Science and Technology (昆明理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Remote sensing imagery is essential for environmental monitoring, agricultural management, and disaster response. However, data loss due to cloud cover, sensor failures, or incomplete acquisition-especially in high-resolution and high-frequency tasks-severely limits satellite imagery’s effectiveness. Traditional interpolation methods struggle with large missing areas and complex structures. Remote sensing imagery consists of multiple bands, each with distinct meanings, and ensuring consistency across bands is critical to avoid anomalies in the combined images. This paper proposes SatelliteMaker, a diffusion-based method that reconstructs missing data across varying levels of data loss while maintaining spatial, spectral, and temporal consistency. We also propose Digital Elevation Model (DEM) as a conditioning input and use tailored prompts to generate realistic images, making diffusion models applicable to quantitative remote sensing tasks. Additionally, we propose a VGG-Adapter module based on Distribution Loss, which reduces distribution discrepancy and ensures style consistency. Extensive experiments show that SatelliteMaker achieves state-of-the-art performance across multiple tasks.
zh

[CV-24] Logits DeConfusion with CLIP for Few-Shot Learning CVPR2025

【速读】：该论文旨在解决CLIP在下游任务中logits存在的严重类别间混淆问题，该问题导致类别间的歧义严重影响分类准确性。为应对这一挑战，论文提出了一种名为Logits DeConfusion的新方法，其关键是结合多级适配器融合（Multi-level Adapter Fusion, MAF）模块与类别间去混淆（Inter-Class Deconfusion, ICD）模块。其中，MAF模块从不同特征层级提取信息并进行统一融合以增强特征表示能力，而ICD模块通过残差结构可学习地消除logits中的类别间混淆。实验结果表明，该方法显著提升了分类性能并缓解了类别间混淆问题。

链接: https://arxiv.org/abs/2504.12104
作者: Shuo Li,Fang Liu,Zehua Hao,Xinyi Wang,Lingling Li,Xu Liu,Puhua Chen,Wenping Ma
机构: Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education (智能感知与图像理解教育部重点实验室); International Research Center for Intelligent Perception and Computation (智能感知与计算国际研究中心); Joint International Research Laboratory of Intelligent Perception and Computation (智能感知与计算联合国际研究实验室); School of Artificial Intelligence, Xidian University (西安电子科技大学人工智能学院), Xi’an 710071, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:With its powerful visual-language alignment capability, CLIP performs well in zero-shot and few-shot learning tasks. However, we found in experiments that CLIP’s logits suffer from serious inter-class confusion problems in downstream tasks, and the ambiguity between categories seriously affects the accuracy. To address this challenge, we propose a novel method called Logits DeConfusion, which effectively learns and eliminates inter-class confusion in logits by combining our Multi-level Adapter Fusion (MAF) module with our Inter-Class Deconfusion (ICD) module. Our MAF extracts features from different levels and fuses them uniformly to enhance feature representation. Our ICD learnably eliminates inter-class confusion in logits with a residual structure. Experimental results show that our method can significantly improve the classification performance and alleviate the inter-class confusion problem. The code is available at this https URL.
zh

[CV-25] Metric-Solver: Sliding Anchored Metric Depth Estimation from a Single Image

【速读】：该论文致力于解决准确且具有泛化能力的度量深度估计问题，这一问题在室内和室外场景中由于深度尺度的多样性而极具挑战性。论文的关键解决方案在于引入了一种名为Metric-Solver的新方法，它基于滑动锚点的机制，能够动态适应不同的场景尺度。具体而言，Metric-Solver通过锚点表示法，利用参考深度作为锚点将场景深度分离并归一化为两部分：缩放近场深度和渐进远场深度。这种锚点机制不仅作为归一化因子，使得近场深度能在一致范围内被归一化，同时还能平滑地将远场深度映射至零，从而实现从零到无穷大的场景深度统一表示。此外，对于同一场景，锚点可沿深度轴滑动以动态调整到不同的深度尺度，小锚点提高近场分辨率以提升近距离物体的深度精度，大锚点则改善远距离区域的深度估计，这种灵活性使模型能够在不同距离上进行深度预测，并在数据集间展现出强大的泛化能力。

链接: https://arxiv.org/abs/2504.12103
作者: Tao Wen,Jiepeng Wang,Yabo Chen,Shugong Xu,Chi Zhang,Xuelong Li
机构: Shanghai University (上海大学); Institute of Artificial Intelligence, China Telecom (TeleAI) (中国电信人工智能研究院); Shanghai Jiaotong University (上海交通大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our project page: this https URL

点击查看摘要

Abstract:Accurate and generalizable metric depth estimation is crucial for various computer vision applications but remains challenging due to the diverse depth scales encountered in indoor and outdoor environments. In this paper, we introduce Metric-Solver, a novel sliding anchor-based metric depth estimation method that dynamically adapts to varying scene scales. Our approach leverages an anchor-based representation, where a reference depth serves as an anchor to separate and normalize the scene depth into two components: scaled near-field depth and tapered far-field depth. The anchor acts as a normalization factor, enabling the near-field depth to be normalized within a consistent range while mapping far-field depth smoothly toward zero. Through this approach, any depth from zero to infinity in the scene can be represented within a unified representation, effectively eliminating the need to manually account for scene scale variations. More importantly, for the same scene, the anchor can slide along the depth axis, dynamically adjusting to different depth scales. A smaller anchor provides higher resolution in the near-field, improving depth precision for closer objects while a larger anchor improves depth estimation in far regions. This adaptability enables the model to handle depth predictions at varying distances and ensure strong generalization across datasets. Our design enables a unified and adaptive depth representation across diverse environments. Extensive experiments demonstrate that Metric-Solver outperforms existing methods in both accuracy and cross-dataset generalization.
zh

[CV-26] Generalized Visual Relation Detection with Diffusion Models

【速读】：该论文旨在解决视觉关系检测（Visual Relation Detection, VRD）中存在的预定义关系类别限制以及语义歧义性问题。传统VRD模型受限于预定义的关系类别，未能充分考虑视觉关系在语义上的多视角描述特性。论文的关键解决方案在于提出将视觉关系建模为连续嵌入向量，并设计扩散模型（Diffusion Models）以条件生成的方式实现通用化VRD任务，称为Diff-VRD。其核心创新点包括：(1) 在潜在空间中建模扩散过程，并生成图像中的所有可能关系作为嵌入序列；(2) 利用主体-客体对的视觉与文本嵌入作为条件信号，并通过交叉注意力机制注入；(3) 引入后处理匹配阶段，基于语义相似性为关系分配合适的谓词词。得益于扩散生成过程，Diff-VRD能够超越数据集的预定义类别标签，生成更丰富的视觉关系表达。此外，论文还提出了文本到图像检索和SPICE PR曲线两种评估指标，以全面评价这一通用化VRD任务的效果。

链接: https://arxiv.org/abs/2504.12100
作者: Kaifeng Gao,Siqi Chen,Hanwang Zhang,Jun Xiao,Yueting Zhuang,Qianru Sun
机构: Zhejiang University (浙江大学); School of Computer Science and Engineering, Nanying Technological University (南洋理工大学计算机科学与工程学院); School of Information Systems, Singapore Management University (新加坡管理大学信息系统学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review at IEEE TCSVT. The Appendix is provided additionally

点击查看摘要

Abstract:Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. Although recent VRD models have achieved impressive performance, they are all restricted to pre-defined relation categories, while failing to consider the semantic ambiguity characteristic of visual relations. Unlike objects, the appearance of visual relations is always subtle and can be described by multiple predicate words from different perspectives, e.g., ride'' can be depicted as race’’ and ``sit on’', from the sports and spatial position views, respectively. To this end, we propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner, termed Diff-VRD. We model the diffusion process in a latent space and generate all possible relations in the image as an embedding sequence. During the generation, the visual and text embeddings of subject-object pairs serve as conditional signals and are injected via cross-attention. After the generation, we design a subsequent matching stage to assign the relation words to subject-object pairs by considering their semantic similarities. Benefiting from the diffusion-based generative process, our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets. To properly evaluate this generalized VRD task, we introduce two evaluation metrics, i.e., text-to-image retrieval and SPICE PR Curve inspired by image captioning. Extensive experiments in both human-object interaction (HOI) detection and scene graph generation (SGG) benchmarks attest to the superiority and effectiveness of Diff-VRD.
zh

[CV-27] AttentionDrop: A Novel Regularization Method for Transformer Models

【速读】：该论文旨在解决 Transformer 模型在训练数据有限或噪声较多时容易过拟合的问题。为应对这一挑战，论文提出了一种名为 AttentionDrop 的统一随机正则化技术家族，直接作用于自注意力分布。其关键是设计了三种变体：1）Hard Attention Masking 随机将每个查询的 top-k 注意力对数置零，以鼓励多样化上下文利用；2）Blurred Attention Smoothing 对注意力对数应用动态高斯卷积，平滑过峰值的分布；3）Consistency-Regularized AttentionDrop 借助基于 KL 散度的一致性损失，在多次独立的 AttentionDrop 扰动下强制输出稳定性。这些方法共同提升了模型的泛化能力。

链接: https://arxiv.org/abs/2504.12088
作者: Mirza Samad Ahmed Baig,Syeda Anshrah Gillani,Abdul Akbar Khan,Shahid Munir Shah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages

点击查看摘要

Abstract:Transformer-based architectures achieve state-of-the-art performance across a wide range of tasks in natural language processing, computer vision, and speech. However, their immense capacity often leads to overfitting, especially when training data is limited or noisy. We propose AttentionDrop, a unified family of stochastic regularization techniques that operate directly on the self-attention distributions. We introduces three variants: 1. Hard Attention Masking: randomly zeroes out top-k attention logits per query to encourage diverse context utilization. 2. Blurred Attention Smoothing: applies a dynamic Gaussian convolution over attention logits to diffuse overly peaked distributions. 3. Consistency-Regularized AttentionDrop: enforces output stability under multiple independent AttentionDrop perturbations via a KL-based consistency loss.
zh

[CV-28] Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization

【速读】：该论文旨在解决现有Large Video Language Models (LVLMs) 在细粒度时间理解方面的不足，以及其在视频问答任务中容易产生幻觉（hallucinate）和犯简单错误的问题，这些问题限制了LVLMs在实际应用中的安全可靠部署。论文的关键解决方案是提出了一种自对齐框架（self-alignment framework），使LVLMs能够从自身的错误中学习。具体而言，该框架通过构建优选与非优选响应对（preferred and non-preferred response pairs）来实现这一目标，其中非优选响应引入了因空间-时间理解不足、共现概念间的虚假相关性以及过度依赖语言线索而产生的常见错误模式。为有效利用这些响应对进行模型优化，论文进一步提出了精调正则化偏好优化（Refined Regularized Preference Optimization, RRPO），这是一种改进的偏好优化方法，通过子序列级精调奖励和基于词元的KL正则化，克服了直接偏好优化（Direct Preference Optimization, DPO）的局限性。实验表明，RRPO实现了更精确的对齐和更稳定的训练过程，并在多种视频任务中验证了所提方法的有效性。

链接: https://arxiv.org/abs/2504.12083
作者: Pritam Sarkar,Ali Etemad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent advances in Large Video Language Models (LVLMs), they still struggle with fine-grained temporal understanding, hallucinate, and often make simple mistakes on even simple video question-answering tasks, all of which pose significant challenges to their safe and reliable deployment in real-world applications. To address these limitations, we propose a self-alignment framework that enables LVLMs to learn from their own errors. Our proposed framework first obtains a training set of preferred and non-preferred response pairs, where non-preferred responses are generated by incorporating common error patterns that often occur due to inadequate spatio-temporal understanding, spurious correlations between co-occurring concepts, and over-reliance on linguistic cues while neglecting the vision modality, among others. To facilitate self-alignment of LVLMs with the constructed preferred and non-preferred response pairs, we introduce Refined Regularized Preference Optimization (RRPO), a novel preference optimization method that utilizes sub-sequence-level refined rewards and token-wise KL regularization to address the limitations of Direct Preference Optimization (DPO). We demonstrate that RRPO achieves more precise alignment and more stable training compared to DPO. Our experiments and analysis validate the effectiveness of our approach across diverse video tasks, including video hallucination, short- and long-video understanding, and fine-grained temporal reasoning.
zh

[CV-29] DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

【速读】：该论文旨在解决单样本上下文分割（one-shot in-context segmentation）问题，即在仅提供一个带标注示例的情况下，通过探索分割模型的泛化能力来分割对应的对象。这一任务属于少量学习（few-shot learning）中的场景，广泛应用于场景理解、图像/视频编辑等视觉任务。然而，现有的Segment Anything Models（SAM）虽在交互式分割中表现出色，但并不直接适用于上下文分割任务。为此，论文提出了一种基于提示调优（prompt-tuning）的Dual Consistency SAM (DC-SAM) 方法，以适配SAM和SAM2用于图像和视频的上下文分割。

解决方案的关键在于增强SAM提示编码器（prompt encoder）的特征表示，并通过高质量的视觉提示提升分割性能。具体而言，论文设计了以下创新点：首先，在生成掩码先验（mask prior）时，融合SAM特征以更好地对齐提示编码器；其次，在融合特征与初始视觉提示之间引入循环一致性的交叉注意力机制；接着，通过正负提示设计双分支结构以增强判别能力；最后，提出一种简单的掩码管训练策略，将所提出的双一致性方法融入掩码管框架。此外，针对视频领域缺乏上下文分割基准的问题，论文手动构建了一个名为In-Context Video Object Segmentation (IC-VOS) 的新数据集，以评估模型的上下文分割能力。实验结果表明，DC-SAM在COCO-20i上达到了55.5 (+1.4) mIoU，在PASCAL-5i上达到73.0 (+1.1) mIoU，并在提出的IC-VOS基准上获得了71.52的JF得分。

链接: https://arxiv.org/abs/2504.12080
作者: Mengshi Qi,Pengfei Zhu,Xiangtai Li,Xiaoyang Bi,Lu Qi,Huadong Ma,Ming-Hsuan Yang
机构: State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China(网络与交换技术国家重点实验室，北京邮电大学，中国); Nanyang Technological University, Singapore(南洋理工大学，新加坡); UC Merced, US(加州大学默塞德分校，美国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Given a single labeled example, in-context segmentation aims to segment corresponding objects. This setting, known as one-shot segmentation in few-shot learning, explores the segmentation model’s generalization ability and has been applied to various vision tasks, including scene understanding and image/video editing. While recent Segment Anything Models have achieved state-of-the-art results in interactive segmentation, these approaches are not directly applicable to in-context segmentation. In this work, we propose the Dual Consistency SAM (DC-SAM) method based on prompt-tuning to adapt SAM and SAM2 for in-context segmentation of both images and videos. Our key insights are to enhance the features of the SAM’s prompt encoder in segmentation by providing high-quality visual prompts. When generating a mask prior, we fuse the SAM features to better align the prompt encoder. Then, we design a cycle-consistent cross-attention on fused features and initial visual prompts. Next, a dual-branch design is provided by using the discriminative positive and negative prompts in the prompt encoder. Furthermore, we design a simple mask-tube training strategy to adopt our proposed dual consistency method into the mask tube. Although the proposed DC-SAM is primarily designed for images, it can be seamlessly extended to the video domain with the support of SAM2. Given the absence of in-context segmentation in the video domain, we manually curate and construct the first benchmark from existing video segmentation datasets, named In-Context Video Object Segmentation (IC-VOS), to better assess the in-context capability of the model. Extensive experiments demonstrate that our method achieves 55.5 (+1.4) mIoU on COCO-20i, 73.0 (+1.1) mIoU on PASCAL-5i, and a JF score of 71.52 on the proposed IC-VOS benchmark. Our source code and benchmark are available at this https URL.
zh

[CV-30] Single-shot Star-convex Polygon-based Instance Segmentation for Spatially-correlated Biomedical Objects

【速读】：该论文试图解决生物医学图像中多类别对象实例分割的问题，特别是利用对象之间的空间相关性来提高分割性能。传统方法通常将检测任务独立处理，需要多阶段分析管道，而忽略了空间相关性这一潜在的先验知识。论文的关键在于提出两种基于StarDist的架构：HydraStarDist (HSD) 和创新性的HSD-WBR。HSD通过联合编码器隐式结合对象交互的空间相关性先验，而HSD-WBR则在正则化层中进一步引入了一种名为Within Boundary Regularisation Penalty (WBR) 的惩罚项以强化这种先验。这两种方法能够在单次推断中实现嵌套实例分割，并在IoU_R、AP以及新的任务相关指标Joint TP rate (JTPR) 上表现出色，证明了其有效性和优越性。

链接: https://arxiv.org/abs/2504.12078
作者: Trina De,Adrian Urbanski,Artur Yakimovich
机构: Department of Computer Science, Technical University of Dresden (德累斯顿工业大学计算机科学系); Institute of Computer Science, University of Wrocław (弗罗茨瓦夫大学计算机科学研究所); Helmholtz-Zentrum Dresden-Rossendorf e. V. (德累斯顿-罗斯托夫赫姆霍兹中心有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:Biomedical images often contain objects known to be spatially correlated or nested due to their inherent properties, leading to semantic relations. Examples include cell nuclei being nested within eukaryotic cells and colonies growing exclusively within their culture dishes. While these semantic relations bear key importance, detection tasks are often formulated independently, requiring multi-shot analysis pipelines. Importantly, spatial correlation could constitute a fundamental prior facilitating learning of more meaningful representations for tasks like instance segmentation. This knowledge has, thus far, not been utilised by the biomedical computer vision community. We argue that the instance segmentation of two or more categories of objects can be achieved in parallel. We achieve this via two architectures HydraStarDist (HSD) and the novel (HSD-WBR) based on the widely-used StarDist (SD), to take advantage of the star-convexity of our target objects. HSD and HSD-WBR are constructed to be capable of incorporating their interactions as constraints into account. HSD implicitly incorporates spatial correlation priors based on object interaction through a joint encoder. HSD-WBR further enforces the prior in a regularisation layer with the penalty we proposed named Within Boundary Regularisation Penalty (WBR). Both architectures achieve nested instance segmentation in a single shot. We demonstrate their competitiveness based on IoU_R and AP and superiority in a new, task-relevant criteria, Joint TP rate (JTPR) compared to their baseline SD and Cellpose. Our approach can be further modified to capture partial-inclusion/-exclusion in multi-object interactions in fluorescent or brightfield microscopy or digital imaging. Finally, our strategy suggests gains by making this learning single-shot and computationally efficient.
zh

[CV-31] Modular-Cam: Modular Dynamic Camera-view Video Generation with LLM AAAI2025

【速读】：该论文旨在解决现有文本到视频生成方法在处理复杂提示（包含动态场景和多视角相机变换）时存在的两个主要问题：无法将整体信息分解为独立场景，以及难以基于相应视角平滑切换场景。为了解决这些问题，论文提出了一种名为Modular-Cam的新方法。其关键是利用大型语言模型分析用户指令并将其解耦为多个场景及过渡动作，引入时间Transformer以确保单个场景内的连续性，并设计了一个名为CamOperator的模块化网络来精确控制相机运动。此外，还提出了AdaControlNet，通过ControlNet保证跨场景一致性并自适应调整生成视频的颜色色调。

链接: https://arxiv.org/abs/2504.12048
作者: Zirui Pan,Xin Wang,Yipeng Zhang,Hong Chen,Kwan Man Cheng,Yaofei Wu,Wenwu Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025 Poster

点击查看摘要

Abstract:Text-to-Video generation, which utilizes the provided text prompt to generate high-quality videos, has drawn increasing attention and achieved great success due to the development of diffusion models recently. Existing methods mainly rely on a pre-trained text encoder to capture the semantic information and perform cross attention with the encoded text prompt to guide the generation of video. However, when it comes to complex prompts that contain dynamic scenes and multiple camera-view transformations, these methods can not decompose the overall information into separate scenes, as well as fail to smoothly change scenes based on the corresponding camera-views. To solve these problems, we propose a novel method, i.e., Modular-Cam. Specifically, to better understand a given complex prompt, we utilize a large language model to analyze user instructions and decouple them into multiple scenes together with transition actions. To generate a video containing dynamic scenes that match the given camera-views, we incorporate the widely-used temporal transformer into the diffusion model to ensure continuity within a single scene and propose CamOperator, a modular network based module that well controls the camera movements. Moreover, we propose AdaControlNet, which utilizes ControlNet to ensure consistency across scenes and adaptively adjusts the color tone of the generated video. Extensive qualitative and quantitative experiments prove our proposed Modular-Cam’s strong capability of generating multi-scene videos together with its ability to achieve fine-grained control of camera movements. Generated results are available at this https URL.
zh

[CV-32] pix2pockets: Shot Suggestions in 8-Ball Pool from a Single Image in the Wild

【速读】：本文旨在构建一个基于强化学习（Reinforcement Learning, RL）辅助的8球台球教练系统pix2pockets。论文的关键问题是：如何从单张台球桌图像中检测台球桌与球，并提出最优击球建议。为解决这些问题，论文构建了一个包含195张多样化图像的数据集，手动标注了所有球和桌面上的点，生成了5748个物体分割掩码；同时开发了一个标准化的RL环境用于算法的快速开发与基准测试。关键解决方案在于提出了高效的物体检测模型（AP50达91.2%）以及精确的球位置推算方法（误差仅0.4厘米），从而实现精准的击球建议。

链接: https://arxiv.org/abs/2504.12045
作者: Jonas Myhre Schiøtt,Viktor Sebastian Petersen,Dimitrios P. Papadopoulos
机构: Technical University of Denmark (丹麦技术大学); Pioneer Center for AI (先锋人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 7 figures, to be published in SCIA 2025

点击查看摘要

Abstract:Computer vision models have seen increased usage in sports, and reinforcement learning (RL) is famous for beating humans in strategic games such as Chess and Go. In this paper, we are interested in building upon these advances and examining the game of classic 8-ball pool. We introduce pix2pockets, a foundation for an RL-assisted pool coach. Given a single image of a pool table, we first aim to detect the table and the balls and then propose the optimal shot suggestion. For the first task, we build a dataset with 195 diverse images where we manually annotate all balls and table dots, leading to 5748 object segmentation masks. For the second task, we build a standardized RL environment that allows easy development and benchmarking of any RL algorithm. Our object detection model yields an AP50 of 91.2 while our ball location pipeline obtains an error of only 0.4 cm. Furthermore, we compare standard RL algorithms to set a baseline for the shot suggestion task and we show that all of them fail to pocket all balls without making a foul move. We also present a simple baseline that achieves a per-shot success rate of 94.7% and clears a full game in a single turn 30% of the time.
zh

[CV-33] RadMamba: Efficient Human Activity Recognition through Radar-based Micro-Doppler-Oriented Mamba State-Space Model

【速读】：本文旨在解决基于雷达的人体活动识别（Radar-based HAR）中现有方法计算复杂度过高、参数量庞大的问题，特别是在资源受限场景或需要多传感器协作的情况下。现有基于卷积神经网络（CNN）和循环神经网络（RNN）的方法虽然有效，但部署时计算开销较大；而先进的ViT和SSM架构虽提升了建模能力，但其计算复杂度仍然较高。为克服这些问题，论文提出RadMamba，这是一种针对雷达微多普勒信号（radar micro-Doppler）优化的参数高效混合自回归移动平均模型（Mamba SSM）。RadMamba的关键在于结合Transformer架构的优势，在保持高精度的同时显著降低参数量和计算复杂度，从而实现更高效的雷达HAR系统。

链接: https://arxiv.org/abs/2504.12039
作者: Yizhuo Wu,Francesco Fioranelli,Chang Gao
机构: Department of Microelectronics, Delft University of Technology (代尔夫特理工大学), The Netherlands
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under Review

点击查看摘要

Abstract:Radar-based HAR has emerged as a promising alternative to conventional monitoring approaches, such as wearable devices and camera-based systems, due to its unique privacy preservation and robustness advantages. However, existing solutions based on convolutional and recurrent neural networks, although effective, are computationally demanding during deployment. This limits their applicability in scenarios with constrained resources or those requiring multiple sensors. Advanced architectures, such as ViT and SSM architectures, offer improved modeling capabilities and have made efforts toward lightweight designs. However, their computational complexity remains relatively high. To leverage the strengths of transformer architectures while simultaneously enhancing accuracy and reducing computational complexity, this paper introduces RadMamba, a parameter-efficient, radar micro-Doppler-oriented Mamba SSM specifically tailored for radar-based HAR. Across three diverse datasets, RadMamba matches the top-performing previous model’s 99.8% classification accuracy on Dataset DIAT with only 1/400 of its parameters and equals the leading models’ 92.0% accuracy on Dataset CI4R with merely 1/10 of their parameters. In scenarios with continuous sequences of actions evaluated on Dataset UoG2020, RadMamba surpasses other models with significantly higher parameter counts by at least 3%, achieving this with only 6.7k parameters. Our code is available at: this https URL.
zh

[CV-34] Object Placement for Anything ICME2025

【速读】：该论文试图解决小规模标注数据限制导致的物体放置任务在实际应用中的泛化能力不足问题。论文的关键解决方案在于提出了一种半监督框架，能够利用大规模未标注数据提升判别式物体放置模型的泛化能力。具体而言，该框架通过预测给定前景-背景图像对的每个前景放置合理性标签（rationality label）来构建判别模型，并进一步通过迁移标注数据中的合理性变化知识（即前景放置变化是否会导致合理性标签变化）到未标注数据，从而更有效地利用标注信息。实验结果验证了该方法在增强判别式物体放置模型泛化能力方面的有效性。

链接: https://arxiv.org/abs/2504.12029
作者: Bingjie Gao,Bo Zhang,Li Niu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ICME 2025

点击查看摘要

Abstract:Object placement aims to determine the appropriate placement (\emphe.g., location and size) of a foreground object when placing it on the background image. Most previous works are limited by small-scale labeled dataset, which hinders the real-world application of object placement. In this work, we devise a semi-supervised framework which can exploit large-scale unlabeled dataset to promote the generalization ability of discriminative object placement models. The discriminative models predict the rationality label for each foreground placement given a foreground-background pair. To better leverage the labeled data, under the semi-supervised framework, we further propose to transfer the knowledge of rationality variation, \emphi.e., whether the change of foreground placement would result in the change of rationality label, from labeled data to unlabeled data. Extensive experiments demonstrate that our framework can effectively enhance the generalization ability of discriminative object placement models.
zh

[CV-35] Understanding Attention Mechanism in Video Diffusion Models

【速读】：该论文试图解决的问题是如何理解扩散模型中空间和时间注意力块所学习的中间特征及其对视频合成质量（如图像质量和时间一致性）的影响。论文通过信息论方法对文本到视频（T2V）模型中的注意力机制进行了深入扰动分析，揭示了时空注意力图不仅影响视频的时间布局和复杂性，还与合成视频的美学质量密切相关。关键在于发现高熵注意力图通常与高质量视频相关联，而低熵注意力图则与帧内结构相关，并基于此提出了两种仅通过轻量级操作注意力矩阵来提升视频质量和实现文本引导视频编辑的新方法。这些方法在多个数据集上的实验验证表明其有效性和优越性。

链接: https://arxiv.org/abs/2504.12027
作者: Bingyan Liu,Chengyu Wang,Tongtong Su,Huan Ten,Jun Huang,Kailing Guo,Kui Jia
机构: South China University of Technology (华南理工大学); Alibaba Group (阿里巴巴集团); Zhejiang University (浙江大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学，深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-video (T2V) synthesis models, such as OpenAI’s Sora, have garnered significant attention due to their ability to generate high-quality videos from a text prompt. In diffusion-based T2V models, the attention mechanism is a critical component. However, it remains unclear what intermediate features are learned and how attention blocks in T2V models affect various aspects of video synthesis, such as image quality and temporal consistency. In this paper, we conduct an in-depth perturbation analysis of the spatial and temporal attention blocks of T2V models using an information-theoretic approach. Our results indicate that temporal and spatial attention maps affect not only the timing and layout of the videos but also the complexity of spatiotemporal elements and the aesthetic quality of the synthesized videos. Notably, high-entropy attention maps are often key elements linked to superior video quality, whereas low-entropy attention maps are associated with the video’s intra-frame structure. Based on our findings, we propose two novel methods to enhance video quality and enable text-guided video editing. These methods rely entirely on lightweight manipulation of the attention matrices in T2V models. The efficacy and effectiveness of our methods are further validated through experimental evaluation across multiple datasets.
zh

[CV-36] Action Anticipation from SoccerNet Football Video Broadcasts CVPR

【速读】：该论文试图解决足球比赛视频中未来动作预测的问题，即在未观测到的未来帧中提前预测动作（动作前置预测，Action Anticipation）。传统方法主要集中在分析已发生或当前动作，而本文关注如何在比赛开始前的五至十秒内准确预测即将发生的与球相关的动作。
解决方案的关键在于提出了一种基于Football Action ANticipation TRAnsformer (FAANTRA) 的基线方法，它通过适配最先进的动作前置预测模型FUTR，实现了对足球视频中球相关动作的有效预测。此外，为了评估预测性能，论文引入了新的评价指标，如mAP@δ 和mAP@∞，并进行了广泛的消融研究以分析任务设置、输入配置及模型架构的影响。这一工作为体育数据分析中的预测模型设计提供了重要参考，并有望推动自动化转播、战术分析以及球员决策等应用的发展。

链接: https://arxiv.org/abs/2504.12021
作者: Mohamad Dalal,Artur Xarles,Anthony Cioppa,Silvio Giancola,Marc Van Droogenbroeck,Bernard Ghanem,Albert Clapés,Sergio Escalera,Thomas B. Moeslund
机构: Aalborg University (奥胡斯大学); Universitat de Barcelona (巴塞罗那大学); Computer Vision Center (计算机视觉中心); University of Liège (列日大学); KAUST (阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 14 figures. To be published in the CVSports CVPR workshop

点击查看摘要

Abstract:Artificial intelligence has revolutionized the way we analyze sports videos, whether to understand the actions of games in long untrimmed videos or to anticipate the player’s motion in future frames. Despite these efforts, little attention has been given to anticipating game actions before they occur. In this work, we introduce the task of action anticipation for football broadcast videos, which consists in predicting future actions in unobserved future frames, within a five- or ten-second anticipation window. To benchmark this task, we release a new dataset, namely the SoccerNet Ball Action Anticipation dataset, based on SoccerNet Ball Action Spotting. Additionally, we propose a Football Action ANticipation TRAnsformer (FAANTRA), a baseline method that adapts FUTR, a state-of-the-art action anticipation model, to predict ball-related actions. To evaluate action anticipation, we introduce new metrics, including mAP@ \delta , which evaluates the temporal precision of predicted future actions, as well as mAP@ \infty , which evaluates their occurrence within the anticipation window. We also conduct extensive ablation studies to examine the impact of various task settings, input configurations, and model architectures. Experimental results highlight both the feasibility and challenges of action anticipation in football videos, providing valuable insights into the design of predictive models for sports analytics. By forecasting actions before they unfold, our work will enable applications in automated broadcasting, tactical analysis, and player decision-making. Our dataset and code are publicly available at this https URL.
zh

[CV-37] MixSignGraph: A Sign Sequence is Worth Mixed Graphs of Nodes

【速读】：该论文试图解决传统基于CNN的骨干网络在手语任务中难以有效提取与手语相关的区域特征的问题。解决方案的关键在于引入了MixSignGraph方法，通过将手语序列表示为一组混合图，并设计了三个图模块（Local Sign Graph (LSG) 模块、Temporal Sign Graph (TSG) 模块和Hierarchical Sign Graph (HSG) 模块）来分别捕捉空间、时间以及层次化的区域相关特征。此外，为了进一步提升无字幕注释的手语任务性能，提出了Text-driven CTC Pre-training (TCP) 方法，利用文本标签生成伪字幕标签进行模型预训练。实验结果表明，该方法在多个公开手语数据集上的多种任务中超越了现有最先进的模型，且无需依赖任何额外提示。

链接: https://arxiv.org/abs/2504.12020
作者: Shiwei Gan,Yafeng Yin,Zhiwei Jiang,Hongkai Wen,Lei Xie,Sanglu Lu
机构: State Key Laboratory for Novel Software Technology, Nanjing University (国家重点实验室，南京大学); Department of Computer Science, University of Warwick (计算机科学系，华威大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 9 figures, submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI). This is a regular paper submission

点击查看摘要

Abstract:Recent advances in sign language research have benefited from CNN-based backbones, which are primarily transferred from traditional computer vision tasks (\eg object identification, image recognition). However, these CNN-based backbones usually excel at extracting features like contours and texture, but may struggle with capturing sign-related features. In fact, sign language tasks require focusing on sign-related regions, including the collaboration between different regions (\eg left hand region and right hand region) and the effective content in a single region. To capture such region-related features, we introduce MixSignGraph, which represents sign sequences as a group of mixed graphs and designs the following three graph modules for feature extraction, \ie Local Sign Graph (LSG) module, Temporal Sign Graph (TSG) module and Hierarchical Sign Graph (HSG) module. Specifically, the LSG module learns the correlation of intra-frame cross-region features within one frame, \ie focusing on spatial features. The TSG module tracks the interaction of inter-frame cross-region features among adjacent frames, \ie focusing on temporal features. The HSG module aggregates the same-region features from different-granularity feature maps of a frame, \ie focusing on hierarchical features. In addition, to further improve the performance of sign language tasks without gloss annotations, we propose a simple yet counter-intuitive Text-driven CTC Pre-training (TCP) method, which generates pseudo gloss labels from text labels for model pre-training. Extensive experiments conducted on current five public sign language datasets demonstrate the superior performance of the proposed model. Notably, our model surpasses the SOTA models on multiple sign language tasks across several datasets, without relying on any additional cues.
zh

[CV-38] Instruction-augmented Multimodal Alignment for Image-Text and Element Matching CVPR2025

【速读】：该论文旨在解决现有文本到图像（Text-to-Image, T2I）生成模型评估方法在细粒度语义对齐评估及精确量化方面的不足。当前基于视觉问答（Visual Question Answering, VQA）的方法仍面临挑战，难以实现对图像与文本描述之间对齐关系的精准评价。为应对这一问题，论文提出了一种名为“指令增强的图像-文本及元素匹配对齐”（Instruction-augmented Multimodal Alignment for Image-Text and Element Matching, iMatch）的改进评估方法。该方案通过微调多模态大型语言模型来评估图像与文本的语义一致性，并引入了四种创新性的数据增强策略：首先，QAlign策略构建精确的概率映射，将多模态大型语言模型输出的离散评分转换为连续的匹配分数；其次，验证集增强策略利用模型预测的伪标签扩充训练数据以提升泛化能力；第三，元素增强策略结合元素类别标签优化模型对图像-文本匹配的理解；最后，图像增强策略采用随机光照等技术提高模型鲁棒性。此外，还提出了提示类型增强和评分扰动策略进一步提升元素评估的准确性。实验结果表明，iMatch方法显著优于现有方法，其有效性已在CVPR NTIRE 2025 Text to Image Generation Model Quality Assessment - Track 1竞赛中得到验证。

链接: https://arxiv.org/abs/2504.12018
作者: Xinli Yue,JianHui Sun,Junda Lu,Liangchao Yao,Fan Xia,Tianyi Wang,Fengyun Rao,Jing Lyu,Yuetang Deng
机构: Wuhan University(武汉大学); WeChat(微信)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025 Workshop

点击查看摘要

Abstract:With the rapid advancement of text-to-image (T2I) generation models, assessing the semantic alignment between generated images and text descriptions has become a significant research challenge. Current methods, including those based on Visual Question Answering (VQA), still struggle with fine-grained assessments and precise quantification of image-text alignment. This paper presents an improved evaluation method named Instruction-augmented Multimodal Alignment for Image-Text and Element Matching (iMatch), which evaluates image-text semantic alignment by fine-tuning multimodal large language models. We introduce four innovative augmentation strategies: First, the QAlign strategy creates a precise probabilistic mapping to convert discrete scores from multimodal large language models into continuous matching scores. Second, a validation set augmentation strategy uses pseudo-labels from model predictions to expand training data, boosting the model’s generalization performance. Third, an element augmentation strategy integrates element category labels to refine the model’s understanding of image-text matching. Fourth, an image augmentation strategy employs techniques like random lighting to increase the model’s robustness. Additionally, we propose prompt type augmentation and score perturbation strategies to further enhance the accuracy of element assessments. Our experimental results show that the iMatch method significantly surpasses existing methods, confirming its effectiveness and practical value. Furthermore, our iMatch won first place in the CVPR NTIRE 2025 Text to Image Generation Model Quality Assessment - Track 1 Image-Text Alignment.
zh

[CV-39] A Complex-valued SAR Foundation Model Based on Physically Inspired Representation Learning

【速读】：本文旨在解决基于合成孔径雷达（SAR）图像解析的视觉基础模型在信息利用不足和可解释性差方面的挑战。为应对这些挑战，论文提出了一种基于复数值SAR数据的遥感基础模型，通过模拟极化分解过程进行预训练，将像素散射强度表示为散射基与散射系数的加权组合，从而赋予模型物理可解释性。关键解决方案在于构建一系列散射查询，每个查询代表一个独立且有意义的散射基，并在散射查询解码器中与SAR特征交互以输出对应的散射系数。同时，设计了极化分解损失和功率自监督损失来指导预训练过程，前者使预测系数与Yamaguchi系数对齐，后者从预测系数重构功率并与输入图像的功率对比。实验验证表明，该模型在六个典型下游任务中达到最先进的性能，并展现出强大的泛化能力，尤其是在数据稀缺条件下。

链接: https://arxiv.org/abs/2504.11999
作者: Mengyu Wang,Hanbo Bi,Yingchao Feng,Linlin Xin,Shuo Gong,Tianqi Wang,Zhiyuan Yan,Peijin Wang,Wenhui Diao,Xian Sun
机构: Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息创新研究院); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences (中国科学院大学电子电气与通信工程学院); Key Laboratory of Target Cognition and Application Technology (TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences (目标认知与应用技术重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision foundation models in remote sensing have been extensively studied due to their superior generalization on various downstream tasks. Synthetic Aperture Radar (SAR) offers all-day, all-weather imaging capabilities, providing significant advantages for Earth observation. However, establishing a foundation model for SAR image interpretation inevitably encounters the challenges of insufficient information utilization and poor interpretability. In this paper, we propose a remote sensing foundation model based on complex-valued SAR data, which simulates the polarimetric decomposition process for pre-training, i.e., characterizing pixel scattering intensity as a weighted combination of scattering bases and scattering coefficients, thereby endowing the foundation model with physical interpretability. Specifically, we construct a series of scattering queries, each representing an independent and meaningful scattering basis, which interact with SAR features in the scattering query decoder and output the corresponding scattering coefficient. To guide the pre-training process, polarimetric decomposition loss and power self-supervision loss are constructed. The former aligns the predicted coefficients with Yamaguchi coefficients, while the latter reconstructs power from the predicted coefficients and compares it to the input image’s power. The performance of our foundation model is validated on six typical downstream tasks, achieving state-of-the-art results. Notably, the foundation model can extract stable feature representations and exhibits strong generalization, even in data-scarce conditions.
zh

[CV-40] A Review of YOLOv12: Attention-Based Enhancements vs. Previous Versions

【速读】：该论文旨在解决如何在实时目标检测框架YOLO中有效集成注意力机制的问题，同时保持其高速推理性能。解决方案的关键在于提出了一种新颖的方法，通过引入Area Attention实现计算高效的自注意力机制、利用Residual Efficient Layer Aggregation Networks提升特征聚合能力，并采用FlashAttention优化内存访问效率，从而在不牺牲实时性能的前提下成功增强了模型的准确性与计算效率。

链接: https://arxiv.org/abs/2504.11995
作者: Rahima Khanam,Muhammad Hussain
机构: Department of Computer Science, Huddersfield University (赫德斯菲尔德大学), Queensgate, Huddersfield HD1 3DH, UK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The YOLO (You Only Look Once) series has been a leading framework in real-time object detection, consistently improving the balance between speed and accuracy. However, integrating attention mechanisms into YOLO has been challenging due to their high computational overhead. YOLOv12 introduces a novel approach that successfully incorporates attention-based enhancements while preserving real-time performance. This paper provides a comprehensive review of YOLOv12’s architectural innovations, including Area Attention for computationally efficient self-attention, Residual Efficient Layer Aggregation Networks for improved feature aggregation, and FlashAttention for optimized memory access. Additionally, we benchmark YOLOv12 against prior YOLO versions and competing object detectors, analyzing its improvements in accuracy, inference speed, and computational efficiency. Through this analysis, we demonstrate how YOLOv12 advances real-time object detection by refining the latency-accuracy trade-off and optimizing computational resources.
zh

[CV-41] Analysis of Pseudo-Labeling for Online Source-Free Universal Domain Adaptation

【速读】：该论文旨在解决在线源-free（source-free）领域适应（domain adaptation, DA）中的类别偏移（category shift）问题，即源域和目标域标签空间可能不同的情形。传统方法主要依赖于基于伪标签（pseudo-label）的自训练（self-training），但尚未深入研究伪标签与适应结果之间的关系。为此，论文通过模拟伪标签的控制实验进行了系统分析，揭示了当前最先进的方法与理想伪标签条件下能达到的性能上限之间存在显著差距。解决方案的关键在于伪标签的质量而非数量：对比损失函数即使在伪标签精度适中时也能实现有效适应，而交叉熵损失虽然对伪标签错误较敏感，但在伪标签接近完美时表现更优。论文强调伪标签质量的重要性，并为在线源-free通用领域适应（SF-UniDA）的未来发展提供了指导性见解。

链接: https://arxiv.org/abs/2504.11992
作者: Pascal Schlachter,Jonathan Fuss,Bin Yang
机构: Institute of Signal Processing and System Theory, University of Stuttgart (斯图加特大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to the 33rd European Signal Processing Conference (EUSIPCO 2025)

点击查看摘要

Abstract:A domain (distribution) shift between training and test data often hinders the real-world performance of deep neural networks, necessitating unsupervised domain adaptation (UDA) to bridge this gap. Online source-free UDA has emerged as a solution for practical scenarios where access to source data is restricted and target data is received as a continuous stream. However, the open-world nature of many real-world applications additionally introduces category shifts meaning that the source and target label spaces may differ. Online source-free universal domain adaptation (SF-UniDA) addresses this challenge. Existing methods mainly rely on self-training with pseudo-labels, yet the relationship between pseudo-labeling and adaptation outcomes has not been studied yet. To bridge this gap, we conduct a systematic analysis through controlled experiments with simulated pseudo-labeling, offering valuable insights into pseudo-labeling for online SF-UniDA. Our findings reveal a substantial gap between the current state-of-the-art and the upper bound of adaptation achieved with perfect pseudo-labeling. Moreover, we show that a contrastive loss enables effective adaptation even with moderate pseudo-label accuracy, while a cross-entropy loss, though less robust to pseudo-label errors, achieves superior results when pseudo-labeling approaches perfection. Lastly, our findings indicate that pseudo-label accuracy is in general more crucial than quantity, suggesting that prioritizing fewer but high-confidence pseudo-labels is beneficial. Overall, our study highlights the critical role of pseudo-labeling in (online) SF-UniDA and provides actionable insights to drive future advancements in the field. Our code is available at this https URL.
zh

[CV-42] Securing the Skies: A Comprehensive Survey on Anti-UAV Methods Benchmarking and Future Directions CVPR

【速读】：本文综述了反无人机（anti-UAV）领域的研究进展，聚焦于分类（classification）、检测（detection）和跟踪（tracking）三大核心目标，并详细探讨了基于扩散的数据合成（diffusion-based data synthesis）、多模态融合（multi-modal fusion）、视觉-语言建模（vision-language modeling）、自监督学习（self-supervised learning）以及强化学习（reinforcement learning）等新兴方法。论文系统评估了单模态与多传感器管道（包括RGB、红外、音频、雷达和射频信号）下的最新解决方案，并讨论了大规模基准测试与对抗性场景。论文指出当前解决方案在实时性能、隐蔽目标检测以及集群场景中的持续不足，强调了构建鲁棒且适应性强的反无人机系统的迫切需求。关键在于结合多种模态信息与先进学习范式，以提升系统的综合性能与应对复杂场景的能力。

链接: https://arxiv.org/abs/2504.11967
作者: Yifei Dong,Fengyi Wu,Sanjian Zhang,Guangyu Chen,Yuzhi Hu,Masumi Yano,Jingdong Sun,Siyu Huang,Feng Liu,Qi Dai,Zhi-Qi Cheng
机构: University of Washington (华盛顿大学); Carnegie Mellon University (卡内基梅隆大学); Clemson University (克莱姆森大学); Drexel University (德雷塞尔大学); Microsoft Research (微软研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted at CVPR Workshop Anti-UAV 2025. 15 pages

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) are indispensable for infrastructure inspection, surveillance, and related tasks, yet they also introduce critical security challenges. This survey provides a wide-ranging examination of the anti-UAV domain, centering on three core objectives-classification, detection, and tracking-while detailing emerging methodologies such as diffusion-based data synthesis, multi-modal fusion, vision-language modeling, self-supervised learning, and reinforcement learning. We systematically evaluate state-of-the-art solutions across both single-modality and multi-sensor pipelines (spanning RGB, infrared, audio, radar, and RF) and discuss large-scale as well as adversarially oriented benchmarks. Our analysis reveals persistent gaps in real-time performance, stealth detection, and swarm-based scenarios, underscoring pressing needs for robust, adaptive anti-UAV systems. By highlighting open research directions, we aim to foster innovation and guide the development of next-generation defense strategies in an era marked by the extensive use of UAVs.
zh

[CV-43] Exploring Video-Based Driver Activity Recognition under Noisy Labels

【速读】：该论文试图解决在驾驶员活动识别任务中利用带有噪声标签的数据进行学习的问题。针对真实世界视频数据中普遍存在误标样本的情况，影响模型的可靠性和性能这一挑战，论文提出了首个面向驾驶员活动识别任务的标签噪声学习方法。解决方案的关键在于基于聚类假设，首先使模型从给定视频中学习易于聚类的低维表示，并将嵌入分配到聚类中；其次，在每个聚类内执行协同优化以平滑分类器输出；此外，提出了一种结合两种选择标准的灵活样本选择策略，无需任何超参数即可从训练数据集中筛选干净样本，并在样本选择过程中引入自适应参数以实现类别间的平衡。实验结果表明，该方法在DriveAct公开数据集上的表现优于源自图像分类领域的其他去噪方法。

链接: https://arxiv.org/abs/2504.11966
作者: Linjuan Fan,Di Wen,Kunyu Peng,Kailun Yang,Jiaming Zhang,Ruiping Liu,Yufan Chen,Junwei Zheng,Jiamin Wu,Xudong Han,Rainer Stiefelhagen
机构: Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); School of Robotics and the National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University (湖南大学); Shanghai AI Lab (上海人工智能实验室); University of Sussex (苏塞克斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: The source code is available at this https URL

点击查看摘要

Abstract:As an open research topic in the field of deep learning, learning with noisy labels has attracted much attention and grown rapidly over the past ten years. Learning with label noise is crucial for driver distraction behavior recognition, as real-world video data often contains mislabeled samples, impacting model reliability and performance. However, label noise learning is barely explored in the driver activity recognition field. In this paper, we propose the first label noise learning approach for the driver activity recognition task. Based on the cluster assumption, we initially enable the model to learn clustering-friendly low-dimensional representations from given videos and assign the resultant embeddings into clusters. We subsequently perform co-refinement within each cluster to smooth the classifier outputs. Furthermore, we propose a flexible sample selection strategy that combines two selection criteria without relying on any hyperparameters to filter clean samples from the training dataset. We also incorporate a self-adaptive parameter into the sample selection process to enforce balancing across classes. A comprehensive variety of experiments on the public DriveAct dataset for all granularity levels demonstrates the superior performance of our method in comparison with other label-denoising methods derived from the image classification field. The source code is available at this https URL.
zh

[CV-44] Flow Intelligence: Robust Feature Matching via Temporal Signature Correlation

【速读】：该论文试图解决跨视频流特征匹配这一计算机视觉领域的核心挑战，特别是在面对噪声、错位或跨模态数据时传统方法失效的问题。论文提出的关键解决方案是Flow Intelligence范式，它摒弃传统的空间特征检测，转而专注于时间上的运动模式。通过从像素块中提取连续帧间的运动特征，并在视频间提取时间运动签名，该方法能够实现对平移、旋转和尺度变化的自然不变性，同时保持跨成像模态的鲁棒性。此外，Flow Intelligence无需预训练数据，省去了空间特征检测的需求，仅依赖时间运动即可实现跨模态匹配，并在传统方法失效的复杂场景中表现出色。

链接: https://arxiv.org/abs/2504.11949
作者: Jie Wang,Chen Ye Gan,Caoqi Wei,Jiangtao Wen,Yuxing Han
机构: SIGS, Tsinghua University (清华大学深圳国际研究生院); University of Electronic Science and Technology of China (电子科技大学); New York University Shanghai (纽约大学上海分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Feature matching across video streams remains a cornerstone challenge in computer vision. Increasingly, robust multimodal matching has garnered interest in robotics, surveillance, remote sensing, and medical imaging. While traditional rely on detecting and matching spatial features, they break down when faced with noisy, misaligned, or cross-modal data. Recent deep learning methods have improved robustness through learned representations, but remain constrained by their dependence on extensive training data and computational demands. We present Flow Intelligence, a paradigm-shifting approach that moves beyond spatial features by focusing on temporal motion patterns exclusively. Instead of detecting traditional keypoints, our method extracts motion signatures from pixel blocks across consecutive frames and extract temporal motion signatures between videos. These motion-based descriptors achieve natural invariance to translation, rotation, and scale variations while remaining robust across different imaging modalities. This novel approach also requires no pretraining data, eliminates the need for spatial feature detection, enables cross-modal matching using only temporal motion, and it outperforms existing methods in challenging scenarios where traditional approaches fail. By leveraging motion rather than appearance, Flow Intelligence enables robust, real-time video feature matching in diverse environments.
zh

[CV-45] R-Meshfusion: Reinforcement Learning Powered Sparse-View Mesh Reconstruction with Diffusion Priors

【速读】：本文旨在解决多视角图像下稀疏视图（sparse-view）网格重建性能显著下降的问题，特别是在无真实观测数据的未见区域。现有扩散模型虽在从有限输入生成新视角方面表现强大，但其输出常存在视觉伪影且缺乏三维一致性，这给可靠网格优化带来了挑战。论文提出了一种新颖框架，通过扩散模型以系统化和可靠的方式增强稀疏视图网格重建。关键解决方案包括引入共识扩散模块（Consensus Diffusion Module），通过四分位距（interquartile range, IQR）分析过滤不可靠生成，并执行方差感知图像融合以生成鲁棒的伪监督信号；设计基于上界置信度（Upper Confidence Bound, UCB）的在线强化学习策略，自适应选择最具信息量的视点进行增强；最后，将融合图像与稀疏视图的真实观测数据共同用于监督基于NeRF的模型，确保几何和外观的一致性。实验表明，该方法在几何质量和渲染质量上均实现了显著提升。

链接: https://arxiv.org/abs/2504.11946
作者: Haoyang Wang,Liming Liu,Peiheng Wang,Junlin Hao,Jiangkai Wu,Xinggong Zhang
机构: Peking University (北京大学); Beijing (北京)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mesh reconstruction from multi-view images is a fundamental problem in computer vision, but its performance degrades significantly under sparse-view conditions, especially in unseen regions where no ground-truth observations are available. While recent advances in diffusion models have demonstrated strong capabilities in synthesizing novel views from limited inputs, their outputs often suffer from visual artifacts and lack 3D consistency, posing challenges for reliable mesh optimization. In this paper, we propose a novel framework that leverages diffusion models to enhance sparse-view mesh reconstruction in a principled and reliable manner. To address the instability of diffusion outputs, we propose a Consensus Diffusion Module that filters unreliable generations via interquartile range (IQR) analysis and performs variance-aware image fusion to produce robust pseudo-supervision. Building on this, we design an online reinforcement learning strategy based on the Upper Confidence Bound (UCB) to adaptively select the most informative viewpoints for enhancement, guided by diffusion loss. Finally, the fused images are used to jointly supervise a NeRF-based model alongside sparse-view ground truth, ensuring consistency across both geometry and appearance. Extensive experiments demonstrate that our method achieves significant improvements in both geometric quality and rendering quality.
zh

[CV-46] Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning

【速读】：该论文致力于解决利用大量无标注数据微调视觉语言模型（Vision-Language Models, VLMs）时，高质量伪标签数据缺乏的问题。当前的伪标签生成策略常因语义与视觉信息之间的不匹配而导致无监督提示学习（Unsupervised Prompt Learning, UPL）方法性能不佳。为应对这一挑战，论文提出了一种名为“通过扩散增强判别丰富性”（Augmenting Discriminative Richness via Diffusions, AiR）的简单而有效的方法。其关键在于通过高保真合成样本构建辅助分类器，捕捉更丰富的视觉变化，将文本-图像对分类转化为更具鲁棒性的图像-图像对分类，从而实现对类别更全面的表征。此外，利用基于扩散模型的合成样本多样性来提升提示学习的效果，为语义-视觉对齐提供更多信息支持。实验结果表明，AiR在多种公开基准数据集（如RESISC45和Flowers102）以及三种学习范式下均显著提升了无监督提示学习方法的性能。

链接: https://arxiv.org/abs/2504.11930
作者: Hairui Ren,Fan Tang,He Zhao,Zixuan Wang,Dandan Guo,Yi Chang
机构: School of Artificial Intelligence, Jilin University (吉林大学人工智能学院); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织的数据61部门)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-tuning vision-language models (VLMs) with large amounts of unlabeled data has recently garnered significant interest. However, a key challenge remains the lack of high-quality pseudo-labeled data. Current pseudo-labeling strategies often struggle with mismatches between semantic and visual information, leading to sub-optimal performance of unsupervised prompt learning (UPL) methods. In this paper, we introduce a simple yet effective approach called \textbfAugmenting D\textbfiscriminative \textbfRichness via Diffusions (AiR), toward learning a richer discriminating way to represent the class comprehensively and thus facilitate classification. Specifically, our approach includes a pseudo-label generation module that leverages high-fidelity synthetic samples to create an auxiliary classifier, which captures richer visual variation, bridging text-image-pair classification to a more robust image-image-pair classification. Additionally, we exploit the diversity of diffusion-based synthetic samples to enhance prompt learning, providing greater information for semantic-visual alignment. Extensive experiments on five public benchmarks, including RESISC45 and Flowers102, and across three learning paradigms-UL, SSL, and TRZSL-demonstrate that AiR achieves substantial and consistent performance improvements over state-of-the-art unsupervised prompt learning methods.
zh

[CV-47] SemDiff: Generating Natural Unrestricted Adversarial Examples via Semantic Attributes Optimization in Diffusion Models

【速读】：该论文旨在解决无约束对抗样本（UAEs）生成中自然性和不可察觉性不足的问题，尤其是在基于扩散模型生成UAEs时，由于仅优化中间潜伏噪声而导致的缺陷。论文的关键解决方案是提出SemDiff，这是一种新颖的无约束对抗攻击方法，通过探索扩散模型的语义潜伏空间以提取有意义的属性，并设计多属性优化方法，在确保攻击成功率的同时保持生成UAEs的自然性和不可察觉性。实验结果表明，SemDiff在攻击成功率和不可察觉性方面优于现有最先进方法，并且生成的UAEs具有语义意义且与属性权重一致。此外，SemDiff还能够规避不同防御措施，进一步验证了其有效性和威胁性。

链接: https://arxiv.org/abs/2504.11923
作者: Zeyu Dai,Shengcai Liu,Rui He,Jiahao Wu,Ning Lu,Wenqi Fan,Qing Li,Ke Tang
机构: Guangdong Provincial Key Laboratory of Brain-Inspired Intelligent Computation and Department of Computer Science and Engineering, Southern University of Science and Technology (南方科技大学), Shenzhen 518055, China; Department of Computing, The Hong Kong Polytechnic University (香港理工大学), Hong Kong
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unrestricted adversarial examples (UAEs), allow the attacker to create non-constrained adversarial examples without given clean samples, posing a severe threat to the safety of deep learning models. Recent works utilize diffusion models to generate UAEs. However, these UAEs often lack naturalness and imperceptibility due to simply optimizing in intermediate latent noises. In light of this, we propose SemDiff, a novel unrestricted adversarial attack that explores the semantic latent space of diffusion models for meaningful attributes, and devises a multi-attributes optimization approach to ensure attack success while maintaining the naturalness and imperceptibility of generated UAEs. We perform extensive experiments on four tasks on three high-resolution datasets, including CelebA-HQ, AFHQ and ImageNet. The results demonstrate that SemDiff outperforms state-of-the-art methods in terms of attack success rate and imperceptibility. The generated UAEs are natural and exhibit semantically meaningful changes, in accord with the attributes’ weights. In addition, SemDiff is found capable of evading different defenses, which further validates its effectiveness and threatening.
zh

[CV-48] Zooming In on Fakes: A Novel Dataset for Localized AI-Generated Image Detection with Forgery Amplification Approach

【速读】：该论文旨在解决现有AI生成内容（AIGC）检测方法在处理场景级局部伪造（如天空或地面区域的编辑）时的局限性。虽然已有研究探索了局部伪造检测，但现有的数据集主要集中在物体级别的伪造，而忽视了更广泛的场景编辑。为了解决这些不足，论文引入了\textbf{BR-Gen}，这是一个包含150,000张多样化且语义标注的局部伪造图像的大规模数据集，并通过感知-创建-评估（Perception-Creation-Evaluation）的全自动管道确保语义一致性和视觉真实性。此外，论文提出了\textbf{NFA-ViT}，一种基于噪声引导的伪造放大视觉Transformer，通过在整个图像中增强与伪造相关的特征来改进局部伪造检测。其关键是利用噪声指纹挖掘图像中的异质区域（即潜在编辑区域），并通过注意力机制促进正常与异常特征之间的交互，从而在整个图像中传播泛化痕迹，提升对细微伪造的检测能力和整体鲁棒性。实验表明，\textbf{BR-Gen}构建了现有方法未覆盖的新场景，而\textbf{NFA-ViT}在BR-Gen上的表现优于现有方法，并在当前基准测试中具有良好的泛化能力。

链接: https://arxiv.org/abs/2504.11922
作者: Lvpan Cai,Haowei Wang,Jiayi Ji,YanShu ZhouMen,Yiwei Ma,Xiaoshuai Sun,Liujuan Cao,Rongrong Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rise of AI-generated image editing tools has made localized forgeries increasingly realistic, posing challenges for visual content integrity. Although recent efforts have explored localized AIGC detection, existing datasets predominantly focus on object-level forgeries while overlooking broader scene edits in regions such as sky or ground. To address these limitations, we introduce \textbfBR-Gen, a large-scale dataset of 150,000 locally forged images with diverse scene-aware annotations, which are based on semantic calibration to ensure high-quality samples. BR-Gen is constructed through a fully automated Perception-Creation-Evaluation pipeline to ensure semantic coherence and visual realism. In addition, we further propose \textbfNFA-ViT, a Noise-guided Forgery Amplification Vision Transformer that enhances the detection of localized forgeries by amplifying forgery-related features across the entire image. NFA-ViT mines heterogeneous regions in images, \emphi.e., potential edited areas, by noise fingerprints. Subsequently, attention mechanism is introduced to compel the interaction between normal and abnormal features, thereby propagating the generalization traces throughout the entire image, allowing subtle forgeries to influence a broader context and improving overall detection robustness. Extensive experiments demonstrate that BR-Gen constructs entirely new scenarios that are not covered by existing methods. Take a step further, NFA-ViT outperforms existing methods on BR-Gen and generalizes well across current benchmarks. All data and codes are available at this https URL.
zh

[CV-49] AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly Detection

【速读】：本文旨在解决工业异常检测（IAD）中因缺陷样本稀缺导致的传统方法难以有效检测未知异常的问题。为应对这一挑战，论文提出的关键解决方案是AnomalyR1框架，它结合了多模态大型语言模型（MLLM）VLM-R1及其特有的推理能力，并通过引入新的推理结果对齐度量（ROAM）优化组相对策略（GRPO），实现了从输入处理到精确异常定位与分割的端到端自动化流程。这一创新不仅突破了传统依赖手工特征或领域专家模型的局限性，还通过紧凑型30亿参数模型在最新多模态IAD基准测试中取得了最先进的性能表现，展示了ROAM增强的GRPO在有限缺陷数据场景下的巨大潜力。

链接: https://arxiv.org/abs/2504.11914
作者: Yuhao Chao,Jie Liu,Jie Tang,Gangshan Wu
机构: State Key Laboratory for Novel Software Technology, Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Industrial Anomaly Detection (IAD) poses a formidable challenge due to the scarcity of defective samples, making it imperative to deploy models capable of robust generalization to detect unseen anomalies effectively. Traditional approaches, often constrained by hand-crafted features or domain-specific expert models, struggle to address this limitation, underscoring the need for a paradigm shift. We introduce AnomalyR1, a pioneering framework that leverages VLM-R1, a Multimodal Large Language Model (MLLM) renowned for its exceptional generalization and interpretability, to revolutionize IAD. By integrating MLLM with Group Relative Policy Optimization (GRPO), enhanced by our novel Reasoned Outcome Alignment Metric (ROAM), AnomalyR1 achieves a fully end-to-end solution that autonomously processes inputs of image and domain knowledge, reasons through analysis, and generates precise anomaly localizations and masks. Based on the latest multimodal IAD benchmark, our compact 3-billion-parameter model outperforms existing methods, establishing state-of-the-art results. As MLLM capabilities continue to advance, this study is the first to deliver an end-to-end VLM-based IAD solution that demonstrates the transformative potential of ROAM-enhanced GRPO, positioning our framework as a forward-looking cornerstone for next-generation intelligent anomaly detection systems in industrial applications with limited defective data.
zh

[CV-50] Learning Physics-Informed Color-Aware Transforms for Low-Light Image Enhancement ICME2025

【速读】：该论文旨在解决现有低光图像增强方法在sRGB色彩空间中直接映射低光到正常光照图像时面临的颜色预测不一致以及对光谱功率分布(SPD)变化敏感的问题，这些问题导致其在不同照明条件下的性能不稳定。为了解决这些挑战，论文提出了一个基于物理信息的颜色感知变换(Physics-informed Color-aware Transform, PiCat)，它通过引入颜色感知变换(Color-aware Transform, CAT)将低光图像从sRGB色彩空间转换为深度光照不变描述符，从而稳健地处理复杂的光照和SPD变化。此外，还提出了内容噪声分解网络(Content-Noise Decomposition Network, CNDN)，用于优化描述符分布以更好地适应良好照明条件，有效恢复低光图像的内容表示。关键在于CAT和CNDN共同作为物理先验，指导从低光域到正常光域的转换过程。实验结果表明，所提出的PiCat框架在五个基准数据集上表现优于现有最先进方法。

链接: https://arxiv.org/abs/2504.11896
作者: Xingxing Yang,Jie Chen,Zaifeng Yang
机构: Department of Computer Science, Hong Kong Baptist University (香港浸会大学计算机科学系); Institute of High Performance Computing, Agency for Science, Technology and Research (新加坡科技研究局高性能计算研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICME 2025

点击查看摘要

Abstract:Image decomposition offers deep insights into the imaging factors of visual data and significantly enhances various advanced computer vision tasks. In this work, we introduce a novel approach to low-light image enhancement based on decomposed physics-informed priors. Existing methods that directly map low-light to normal-light images in the sRGB color space suffer from inconsistent color predictions and high sensitivity to spectral power distribution (SPD) variations, resulting in unstable performance under diverse lighting conditions. To address these challenges, we introduce a Physics-informed Color-aware Transform (PiCat), a learning-based framework that converts low-light images from the sRGB color space into deep illumination-invariant descriptors via our proposed Color-aware Transform (CAT). This transformation enables robust handling of complex lighting and SPD variations. Complementing this, we propose the Content-Noise Decomposition Network (CNDN), which refines the descriptor distributions to better align with well-lit conditions by mitigating noise and other distortions, thereby effectively restoring content representations to low-light images. The CAT and the CNDN collectively act as a physical prior, guiding the transformation process from low-light to normal-light domains. Our proposed PiCat framework demonstrates superior performance compared to state-of-the-art methods across five benchmark datasets.
zh

[CV-51] Search is All You Need for Few-shot Anomaly Detection

【速读】：该论文致力于解决少样本异常检测（Few-shot Anomaly Detection, FSAD）在工业检测中的挑战，即仅基于极少量正常样本完成正常分布建模。现有方法通常依赖于多模态基础模型并通过提示工程实现引导式异常检测，但这些方法往往需要复杂的提示设计与大量人工调参。论文的关键解决方案是提出了一种名为VisionAD的简单而高效的最近邻搜索框架，其核心在于四个关键组件：(1) 可扩展视觉基础模型以提取通用且判别性强的特征；(2) 双重增强策略，包括支持集增强以提升特征匹配适应性，以及查询集增强以弥补单视图预测的不足；(3) 多层特征融合机制，在低频全局上下文与高频局部细节之间实现高效整合；(4) 类感知视觉记忆库，用于高效的一对多多类别检测。实验结果表明，VisionAD在MVTec-AD、VisA和Real-IAD等基准数据集上实现了卓越性能，尤其在仅使用1张正常样本的情况下，分别取得了97.4%、94.8%和70.8%的图像级AUROC分数，显著超越当前最先进的方法。这种无需训练且具备优越少样本能力的方法使其特别适用于样本稀缺或获取成本高昂的实际应用中。

链接: https://arxiv.org/abs/2504.11895
作者: Qishan Wang,Jia Guo,Shuyong Gao,Haofen Wang,Li Xiong,Junjie Hu,Hanqi Guo,Wenqiang Zhang
机构: Academy for Engineering and Technology Fudan University (复旦大学工程与技术学院); School of Biomedical Engineering Tsinghua University (清华大学生物医学工程学院); School of Computer Science Fudan University (复旦大学计算机科学学院); College of Design & Innovation Tongji University (同济大学设计创意学院); School of Physics and Mechanical & Electrical Engineering Hexi University (河西大学物理与机电工程学院); School of Computer Science Fudan University (复旦大学计算机科学学院); School of Computer Science Fudan University (复旦大学计算机科学学院); [0] Engineering Research Center of AI&Robotics, Ministry of Education (教育部人工智能与机器人工程研究中心); [1] Academy for Engineering&Technology Fudan University (复旦大学工程与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot anomaly detection (FSAD) has emerged as a crucial yet challenging task in industrial inspection, where normal distribution modeling must be accomplished with only a few normal images. While existing approaches typically employ multi-modal foundation models combining language and vision modalities for prompt-guided anomaly detection, these methods often demand sophisticated prompt engineering and extensive manual tuning. In this paper, we demonstrate that a straightforward nearest-neighbor search framework can surpass state-of-the-art performance in both single-class and multi-class FSAD scenarios. Our proposed method, VisionAD, consists of four simple yet essential components: (1) scalable vision foundation models that extract universal and discriminative features; (2) dual augmentation strategies - support augmentation to enhance feature matching adaptability and query augmentation to address the oversights of single-view prediction; (3) multi-layer feature integration that captures both low-frequency global context and high-frequency local details with minimal computational overhead; and (4) a class-aware visual memory bank enabling efficient one-for-all multi-class detection. Extensive evaluations across MVTec-AD, VisA, and Real-IAD benchmarks demonstrate VisionAD’s exceptional performance. Using only 1 normal images as support, our method achieves remarkable image-level AUROC scores of 97.4%, 94.8%, and 70.8% respectively, outperforming current state-of-the-art approaches by significant margins (+1.6%, +3.2%, and +1.4%). The training-free nature and superior few-shot capabilities of VisionAD make it particularly appealing for real-world applications where samples are scarce or expensive to obtain. Code is available at this https URL.
zh

[CV-52] CAGS: Open-Vocabulary 3D Scene Understanding with Context-Aware Gaussian Splatting

【速读】：本文旨在解决开放词汇量（open-vocabulary）3D场景理解中的跨视图粒度不一致性问题，这一问题是由于基于2D分割方法（如SAM）引入的，导致在不同视角下物体分割结果不一致（例如，“咖啡套装”在一个视角中被分割为单一实体，而在另一个视角中被分割为“杯子+咖啡+勺子”）。现有基于3D高斯点阵(3DGS)的方法通常依赖于孤立的每个高斯分布特征学习，忽视了空间上下文信息，从而产生碎片化的表示。为了解决此问题，论文提出了Context-Aware Gaussian Splatting (CAGS)，其关键是通过将空间上下文信息整合到3DGS中来改善这一状况。具体而言，CAGS构建局部图以在高斯分布之间传播上下文特征，减少因粒度不一致产生的噪声；采用基于掩码对比学习的方法平滑SAM生成的跨视角特征；并通过预计算策略预先计算邻域关系，降低计算成本，实现在大规模场景下的高效训练。通过集成空间上下文信息，CAGS显著提升了3D实例分割性能，并减少了片段化错误，在LERF-OVS和ScanNet数据集上表现优异，从而实现了稳健的语言引导型3D场景理解。

链接: https://arxiv.org/abs/2504.11893
作者: Wei Sun,Yanzhao Zhou,Jianbin Jiao,Yuan Li
机构: University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-vocabulary 3D scene understanding is crucial for applications requiring natural language-driven spatial interpretation, such as robotics and augmented reality. While 3D Gaussian Splatting (3DGS) offers a powerful representation for scene reconstruction, integrating it with open-vocabulary frameworks reveals a key challenge: cross-view granularity inconsistency. This issue, stemming from 2D segmentation methods like SAM, results in inconsistent object segmentations across views (e.g., a “coffee set” segmented as a single entity in one view but as “cup + coffee + spoon” in another). Existing 3DGS-based methods often rely on isolated per-Gaussian feature learning, neglecting the spatial context needed for cohesive object reasoning, leading to fragmented representations. We propose Context-Aware Gaussian Splatting (CAGS), a novel framework that addresses this challenge by incorporating spatial context into 3DGS. CAGS constructs local graphs to propagate contextual features across Gaussians, reducing noise from inconsistent granularity, employs mask-centric contrastive learning to smooth SAM-derived features across views, and leverages a precomputation strategy to reduce computational cost by precomputing neighborhood relationships, enabling efficient training in large-scale scenes. By integrating spatial context, CAGS significantly improves 3D instance segmentation and reduces fragmentation errors on datasets like LERF-OVS and ScanNet, enabling robust language-guided 3D scene understanding.
zh

[CV-53] Learning Compatible Multi-Prize Subnetworks for Asymmetric Retrieval CVPR2025

【速读】：本文旨在解决跨平台适配（Cross-Platform Adaptation）中的兼容性问题，特别是在不对称检索系统中，现有方法因缺乏灵活性而在多平台部署时面临挑战。例如，当引入新平台时，需要额外训练与现有模型兼容的子网络，这增加了开发成本。为应对这一问题，论文提出了一种具有自兼容性的可剪枝网络（Prunable Network with Self-Compatibility）。其关键在于通过后训练剪枝技术，在一个密集网络内同时优化不同容量子网络的架构与权重，并设计了一种冲突感知的梯度整合方案以缓解主网络与子网络之间的梯度冲突。这种方法允许开发者在无需额外训练的情况下，生成与新平台资源匹配的子网络，从而实现高效的跨平台适配。

链接: https://arxiv.org/abs/2504.11879
作者: Yushuai Sun,Zikun Zhou,Dongmei Jiang,Yaowei Wang,Jun Yu,Guangming Lu,Wenjie Pei
机构: Harbin Institute of Technology (哈尔滨工业大学); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Asymmetric retrieval is a typical scenario in real-world retrieval systems, where compatible models of varying capacities are deployed on platforms with different resource configurations. Existing methods generally train pre-defined networks or subnetworks with capacities specifically designed for pre-determined platforms, using compatible learning. Nevertheless, these methods suffer from limited flexibility for multi-platform deployment. For example, when introducing a new platform into the retrieval systems, developers have to train an additional model at an appropriate capacity that is compatible with existing models via backward-compatible learning. In this paper, we propose a Prunable Network with self-compatibility, which allows developers to generate compatible subnetworks at any desired capacity through post-training pruning. Thus it allows the creation of a sparse subnetwork matching the resources of the new platform without additional training. Specifically, we optimize both the architecture and weight of subnetworks at different capacities within a dense network in compatible learning. We also design a conflict-aware gradient integration scheme to handle the gradient conflicts between the dense network and subnetworks during compatible learning. Extensive experiments on diverse benchmarks and visual backbones demonstrate the effectiveness of our method. Our code and model are available at this https URL.
zh

[CV-54] A Category-Frag ment Segmentation Framework for Pelvic Fracture Segmentation in X-ray Images

【速读】：该论文旨在解决通过二维X射线（2D X-ray）图像自动分割骨盆骨折碎片的问题。解决方案的关键在于提出了一种基于深度学习的类别与碎片分割（Category and Fragment Segmentation, CFS）框架，该框架包含三个连续步骤：类别分割、碎片分割以及后处理。通过这一框架，研究实现了解剖结构0.91的交并比（IoU）和骨折分割0.78的IoU，验证了其方法的有效性和准确性。

链接: https://arxiv.org/abs/2504.11872
作者: Daiqi Liu,Fuxin Fan,Andreas Maier
机构: Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希亚历山大大学模式识别实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures, 1 table

点击查看摘要

Abstract:Pelvic fractures, often caused by high-impact trauma, frequently require surgical intervention. Imaging techniques such as CT and 2D X-ray imaging are used to transfer the surgical plan to the operating room through image registration, enabling quick intraoperative adjustments. Specifically, segmenting pelvic fractures from 2D X-ray imaging can assist in accurately positioning bone fragments and guiding the placement of screws or metal plates. In this study, we propose a novel deep learning-based category and fragment segmentation (CFS) framework for the automatic segmentation of pelvic bone fragments in 2D X-ray images. The framework consists of three consecutive steps: category segmentation, fragment segmentation, and post-processing. Our best model achieves an IoU of 0.91 for anatomical structures and 0.78 for fracture segmentation. Results demonstrate that the CFS framework is effective and accurate.
zh

[CV-55] Synthetic Data for Blood Vessel Network Extraction ICLR2025

【速读】：该论文旨在解决从三维显微镜数据中自动提取脑血管网络拓扑信息的问题，这一任务面临标注数据稀缺和高拓扑精度需求的挑战。论文的关键解决方案在于结合合成数据生成与深度学习技术，通过构建一个包含三个阶段的综合管道生成大规模逼真的合成数据集，并设计了一个基于3D U-Net的两阶段深度学习框架用于节点检测和边预测。此外，利用仅有的5个手动标注的真实样本进行微调，显著提升了边缘预测的F1分数（从0.496提高到0.626），证明了该方法在实际应用中的可行性，为大规模脑血管分析在中风研究中的应用开辟了新途径。

链接: https://arxiv.org/abs/2504.11858
作者: Joël Mathys,Andreas Plesner,Jorel Elmiger,Roger Wattenhofer
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at SynthData Workshop at ICLR 2025

点击查看摘要

Abstract:Blood vessel networks in the brain play a crucial role in stroke research, where understanding their topology is essential for analyzing blood flow dynamics. However, extracting detailed topological vessel network information from microscopy data remains a significant challenge, mainly due to the scarcity of labeled training data and the need for high topological accuracy. This work combines synthetic data generation with deep learning to automatically extract vessel networks as graphs from volumetric microscopy data. To combat data scarcity, we introduce a comprehensive pipeline for generating large-scale synthetic datasets that mirror the characteristics of real vessel networks. Our three-stage approach progresses from abstract graph generation through vessel mask creation to realistic medical image synthesis, incorporating biological constraints and imaging artifacts at each stage. Using this synthetic data, we develop a two-stage deep learning pipeline of 3D U-Net-based models for node detection and edge prediction. Fine-tuning on real microscopy data shows promising adaptation, improving edge prediction F1 scores from 0.496 to 0.626 by training on merely 5 manually labeled samples. These results suggest that automated vessel network extraction is becoming practically feasible, opening new possibilities for large-scale vascular analysis in stroke research.
zh

[CV-56] Cross-Frequency Collaborative Training Network and Dataset for Semi-supervised First Molar Root Canal Segmentation

【速读】：本文旨在解决根管（RC）治疗领域因缺乏公开数据集而导致深度学习应用受限的问题。解决方案的关键在于构建了一个名为FMRC-2025的第一磨牙根管分割数据集，并设计了一种名为CFC-Net的跨频协作训练半监督学习（Semi-Supervised Learning, SSL）网络。CFC-Net包含两个核心组件：(1) 跨频协作均值教师（Cross-Frequency Collaborative Mean Teacher, CFC-MT），通过引入两个专门的学生（Specialized Students, SS）和一个综合教师（Comprehensive Teacher, CT），实现多频段协作训练，并通过交叉与全频一致性监督充分整合多频知识；(2) 不确定性引导的跨频混合（Uncertainty-guided Cross-Frequency Mix, UCF-Mix）机制，使网络能够生成高置信度伪标签，同时学习整合多频信息并保持目标结构完整性。实验结果表明，CFC-MT不仅在RC分割任务中表现优异，还具有良好的泛化能力，优于现有最先进的SSL医学图像分割方法。

链接: https://arxiv.org/abs/2504.11856
作者: Zhenhuan Zhou,Yuchen Zhang,Along He,Peng Wang,Xueshuo Xie,Tao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, Initial submission time 25 December 2024, Now Under Review

点击查看摘要

Abstract:Root canal (RC) treatment is a highly delicate and technically complex procedure in clinical practice, heavily influenced by the clinicians’ experience and subjective judgment. Deep learning has made significant advancements in the field of computer-aided diagnosis (CAD) because it can provide more objective and accurate diagnostic results. However, its application in RC treatment is still relatively rare, mainly due to the lack of public datasets in this field. To address this issue, in this paper, we established a First Molar Root Canal segmentation dataset called FMRC-2025. Additionally, to alleviate the workload of manual annotation for dentists and fully leverage the unlabeled data, we designed a Cross-Frequency Collaborative training semi-supervised learning (SSL) Network called CFC-Net. It consists of two components: (1) Cross-Frequency Collaborative Mean Teacher (CFC-MT), which introduces two specialized students (SS) and one comprehensive teacher (CT) for collaborative multi-frequency training. The CT and SS are trained on different frequency components while fully integrating multi-frequency knowledge through cross and full frequency consistency supervisions. (2) Uncertainty-guided Cross-Frequency Mix (UCF-Mix) mechanism enables the network to generate high-confidence pseudo-labels while learning to integrate multi-frequency information and maintaining the structural integrity of the targets. Extensive experiments on FMRC-2025 and three public dental datasets demonstrate that CFC-MT is effective for RC segmentation and can also exhibit strong generalizability on other dental segmentation tasks, outperforming state-of-the-art SSL medical image segmentation methods. Codes and dataset will be released.
zh

[CV-57] ACE: Attentional Concept Erasure in Diffusion Models

【速读】：该论文试图解决扩散模型中概念擦除的问题，即从预训练模型中移除指定概念，使得触发该概念（或相关同义词）不再生成其表征，同时保留模型生成其他内容的能力。论文的关键在于提出了一种名为注意力机制概念擦除（Attentional Concept Erasure, ACE）的新方法，该方法结合了闭式注意力操作与轻量级微调。理论层面，将概念擦除建模为使模型在目标概念上的条件分布与中性分布对齐的过程；技术实现上，通过门控低秩适配识别并消除跨注意力模块中的概念特定潜在方向，并辅以对抗增强微调确保彻底擦除目标概念及其同义词。实验表明，ACE 在多个基准测试中表现出最先进的擦除效果和鲁棒性，同时实现了通用性和特异性之间的良好平衡，可扩展至数十个概念且效率高，每处理一个概念仅需几秒钟。

链接: https://arxiv.org/abs/2504.11850
作者: Finn Carter
机构: Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Large text-to-image diffusion models have demonstrated remarkable image synthesis capabilities, but their indiscriminate training on Internet-scale data has led to learned concepts that enable harmful, copyrighted, or otherwise undesirable content generation. We address the task of concept erasure in diffusion models, i.e., removing a specified concept from a pre-trained model such that prompting the concept (or related synonyms) no longer yields its depiction, while preserving the model’s ability to generate other content. We propose a novel method, Attentional Concept Erasure (ACE), that integrates a closed-form attention manipulation with lightweight fine-tuning. Theoretically, we formulate concept erasure as aligning the model’s conditional distribution on the target concept with a neutral distribution. Our approach identifies and nullifies concept-specific latent directions in the cross-attention modules via a gated low-rank adaptation, followed by adversarially augmented fine-tuning to ensure thorough erasure of the concept and its synonyms. Empirically, we demonstrate on multiple benchmarks, including object classes, celebrity faces, explicit content, and artistic styles, that ACE achieves state-of-the-art concept removal efficacy and robustness. Compared to prior methods, ACE better balances generality (erasing concept and related terms) and specificity (preserving unrelated content), scales to dozens of concepts, and is efficient, requiring only a few seconds of adaptation per concept. We will release our code to facilitate safer deployment of diffusion models.
zh

[CV-58] Boosting Multi-View Stereo with Depth Foundation Model in the Absence of Real-World Labels

【速读】：该论文旨在解决基于学习的多视图立体视觉（Multi-View Stereo, MVS）方法在缺乏真实世界标签的情况下有效训练网络的难题。论文的关键解决方案是提出了一种名为DFM-MVS的新方法，利用深度基础模型生成有效的深度先验（depth prior），以提升MVS性能。具体而言，该方法开发了一种基于深度先验的伪监督训练机制，通过生成的深度先验模拟真实的立体对应关系，从而为MVS网络构建有效的监督信号。此外，还提出了一种基于深度先验的误差校正策略，利用深度先验作为引导，缓解粗到细网络结构中固有的误差传播问题。实验结果表明，DFM-MVS在DTU和Tanks & Temples数据集上显著优于现有的无真实世界标签的MVS方法。

链接: https://arxiv.org/abs/2504.11845
作者: Jie Zhu,Bo Peng,Zhe Zhang,Bingzheng Liu,Jianjun Lei
机构: Tianjin University (天津大学); Tianjin University of Commerce (天津商业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning-based Multi-View Stereo (MVS) methods have made remarkable progress in recent years. However, how to effectively train the network without using real-world labels remains a challenging problem. In this paper, driven by the recent advancements of vision foundation models, a novel method termed DFM-MVS, is proposed to leverage the depth foundation model to generate the effective depth prior, so as to boost MVS in the absence of real-world labels. Specifically, a depth prior-based pseudo-supervised training mechanism is developed to simulate realistic stereo correspondences using the generated depth prior, thereby constructing effective supervision for the MVS network. Besides, a depth prior-guided error correction strategy is presented to leverage the depth prior as guidance to mitigate the error propagation problem inherent in the widely-used coarse-to-fine network structure. Experimental results on DTU and Tanks Temples datasets demonstrate that the proposed DFM-MVS significantly outperforms existing MVS methods without using real-world labels.
zh

[CV-59] A Visual RAG Pipeline for Few-Shot Fine-Grained Product Classification

【速读】：该论文旨在解决细粒度分类（Fine-Grained Classification, FGC）在实际应用中的挑战，特别是在零售领域中快速变化且视觉上高度相似的产品及其属性识别问题，这是实现自动化价格监控和产品推荐的关键。论文提出了一种新颖的视觉检索增强生成（Visual Retrieval-Augmented Generation, RAG）管道，结合检索增强生成（Retrieval Augmented Generation, RAG）方法与视觉语言模型（Vision-Language Models, VLMs），用于少量样本下的细粒度分类任务。该方案的关键创新在于其无需重新训练即可通过向RAG数据库添加少量类别样本来预测新型产品的能力，从而显著提升了模型的灵活性与实用性。实验结果显示，基于不同VLM后端（如GPT-4o、GPT-4o-mini和Gemini 2.0 Flash）的方法在多样化数据集上达到了86.8%的准确率。

链接: https://arxiv.org/abs/2504.11838
作者: Bianca Lamm,Janis Keuper
机构: Markant Services International GmbH (Markant国际服务有限公司); IMLA, Offenburg University (Offenburg大学IMLA研究所); University of Mannheim (曼海姆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the rapid evolution of learning and computer vision algorithms, Fine-Grained Classification (FGC) still poses an open problem in many practically relevant applications. In the retail domain, for example, the identification of fast changing and visually highly similar products and their properties are key to automated price-monitoring and product recommendation. This paper presents a novel Visual RAG pipeline that combines the Retrieval Augmented Generation (RAG) approach and Vision Language Models (VLMs) for few-shot FGC. This Visual RAG pipeline extracts product and promotion data in advertisement leaflets from various retailers and simultaneously predicts fine-grained product ids along with price and discount information. Compared to previous approaches, the key characteristic of the Visual RAG pipeline is that it allows the prediction of novel products without re-training, simply by adding a few class samples to the RAG database. Comparing several VLM back-ends like GPT-4o [23], GPT-4o-mini [24], and Gemini 2.0 Flash [10], our approach achieves 86.8% accuracy on a diverse dataset.
zh

[CV-60] Real-World Depth Recovery via Structure Uncertainty Modeling and Inaccurate GT Depth Fitting

【速读】：该论文旨在解决现实世界中RGB-D数据集因低质量结构导致的真实深度恢复任务中的挑战。具体而言，由于缺乏配对的真实地面真值(raw-GT)数据以及现有方法未能充分考虑原始深度图中结构错位的多样性，导致深度恢复在实际应用中的泛化能力不足。此外，随机的结构错位不仅影响原始深度数据，也会影响真实数据集中的GT深度。为了解决这些问题，论文从输入和输出两个视角提出了解决方案。关键在于设计了一个新的原始深度生成管道以增加结构错位的多样性，并引入了一个结构不确定性模块来显式识别输入原始深度图中的错位结构，从而更好地适应未见场景。同时，在输出端设计了一个鲁棒特征对齐模块，精确对齐RGB图像的准确结构，避免不准确的GT深度干扰。这些创新点共同提升了模型的准确性和泛化能力。

链接: https://arxiv.org/abs/2504.11820
作者: Delong Suzhang,Meng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The low-quality structure in raw depth maps is prevalent in real-world RGB-D datasets, which makes real-world depth recovery a critical task in recent years. However, the lack of paired raw-ground truth (raw-GT) data in the real world poses challenges for generalized depth recovery. Existing methods insufficiently consider the diversity of structure misalignment in raw depth maps, which leads to poor generalization in real-world depth recovery. Notably, random structure misalignments are not limited to raw depth data but also affect GT depth in real-world datasets. In the proposed method, we tackle the generalization problem from both input and output perspectives. For input, we enrich the diversity of structure misalignment in raw depth maps by designing a new raw depth generation pipeline, which helps the network avoid overfitting to a specific condition. Furthermore, a structure uncertainty module is designed to explicitly identify the misaligned structure for input raw depth maps to better generalize in unseen scenarios. Notably the well-trained depth foundation model (DFM) can help the structure uncertainty module estimate the structure uncertainty better. For output, a robust feature alignment module is designed to precisely align with the accurate structure of RGB images avoiding the interference of inaccurate GT depth. Extensive experiments on multiple datasets demonstrate the proposed method achieves competitive accuracy and generalization capabilities across various challenging raw depth maps.
zh

[CV-61] Neighbor-Based Feature and Index Enhancement for Person Re-Identification CVPR

【速读】：该论文旨在解决行人再识别（Person Re-Identification, Re-ID）中特征表示鲁棒性不足的问题，特别是现有方法在提取特征表示时未能充分利用潜在的上下文信息，导致检索性能受限。论文指出，现有的研究对多阶邻域的潜在信息挖掘不足，而这种信息能够有效丰富特征表达并提升检索准确性。为此，论文提出了一种新颖的模型DMON-ARO，通过利用潜在的邻域信息来增强特征表示和索引性能。

解决方案的关键在于设计了两个互补模块：动态多阶邻域建模（Dynamic Multi-Order Neighbor Modeling, DMON）和非对称关系优化（Asymmetric Relationship Optimization, ARO）。DMON模块通过动态聚合多阶邻域关系，实现自适应邻域建模以捕获更丰富的上下文信息；ARO模块则通过对查询与图库间关系的优化，进一步改进距离矩阵，从而提高索引精度。实验结果表明，该方法在三个基准数据集上的表现优于基线模型，并显著提升了Rank-1准确率和mAP等关键指标。

链接: https://arxiv.org/abs/2504.11798
作者: Chao Yuan,Tianyi Zhang,Guanglin Niu
机构: School of Computer Science and Engineering, Beihang University (北京航空航天大学); School of Artificial Intelligence, Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Comment: This paper has been accepted for publication in the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

点击查看摘要

Abstract:Person re-identification (Re-ID) aims to match the same pedestrian in a large gallery with different cameras and views. Enhancing the robustness of the extracted feature representations is a main challenge in Re-ID. Existing methods usually improve feature representation by improving model architecture, but most methods ignore the potential contextual information, which limits the effectiveness of feature representation and retrieval performance. Neighborhood information, especially the potential information of multi-order neighborhoods, can effectively enrich feature expression and improve retrieval accuracy, but this has not been fully explored in existing research. Therefore, we propose a novel model DMON-ARO that leverages latent neighborhood information to enhance both feature representation and index performance. Our approach is built on two complementary modules: Dynamic Multi-Order Neighbor Modeling (DMON) and Asymmetric Relationship Optimization (ARO). The DMON module dynamically aggregates multi-order neighbor relationships, allowing it to capture richer contextual information and enhance feature representation through adaptive neighborhood modeling. Meanwhile, ARO refines the distance matrix by optimizing query-to-gallery relationships, improving the index accuracy. Extensive experiments on three benchmark datasets demonstrate that our approach achieves performance improvements against baseline models, which illustrate the effectiveness of our model. Specifically, our model demonstrates improvements in Rank-1 accuracy and mAP. Moreover, this method can also be directly extended to other re-identification tasks.
zh

[CV-62] DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation CVPR

【速读】：该论文旨在解决放射学报告自动生成中报告准确性不足的问题，特别是确保生成的报告能够准确捕捉与X射线图像相关的疾病特征，并提升临床应用的可靠性。论文的关键解决方案在于提出了一种名为DART（Disease-aware image-text Alignment and self-correcting Re-alignment for Trustworthy radiology report generation）的框架。该框架包含两个阶段：第一阶段通过图像到文本的检索与对比学习，在共享嵌入空间中对齐图像和文本，以生成包含疾病相关特征的初始报告；第二阶段引入自校正模块，进一步优化初始报告与X射线图像的对齐。这种设计不仅提高了报告生成的准确性，还增强了放射学报告的可信度。

链接: https://arxiv.org/abs/2504.11786
作者: Sang-Jun Park,Keun-Soo Heo,Dong-Hee Shin,Young-Han Son,Ji-Hye Oh,Tae-Eui Kam
机构: Department of Artificial Intelligence, Korea University (韩国大学人工智能系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

点击查看摘要

Abstract:The automatic generation of radiology reports has emerged as a promising solution to reduce a time-consuming task and accurately capture critical disease-relevant findings in X-ray images. Previous approaches for radiology report generation have shown impressive performance. However, there remains significant potential to improve accuracy by ensuring that retrieved reports contain disease-relevant findings similar to those in the X-ray images and by refining generated reports. In this study, we propose a Disease-aware image-text Alignment and self-correcting Re-alignment for Trustworthy radiology report generation (DART) framework. In the first stage, we generate initial reports based on image-to-text retrieval with disease-matching, embedding both images and texts in a shared embedding space through contrastive learning. This approach ensures the retrieval of reports with similar disease-relevant findings that closely align with the input X-ray images. In the second stage, we further enhance the initial reports by introducing a self-correction module that re-aligns them with the X-ray images. Our proposed framework achieves state-of-the-art results on two widely used benchmarks, surpassing previous approaches in both report generation and clinical efficacy metrics, thereby enhancing the trustworthiness of radiology reports.
zh

[CV-63] ACMamba: Fast Unsupervised Anomaly Detection via An Asymmetrical Consensus State Space Model

【速读】：该论文旨在解决高光谱图像（Hyperspectral Images, HSI）无监督异常检测中因高维特性和密集采样训练范式导致的计算成本高昂的问题，这对于地球表面监测中的未知目标探测具有挑战性。论文的关键在于提出了一种新颖的非对称一致性状态空间模型（Asymmetrical Consensus State Space Model, ACMamba），通过设计一种以区域级实例替代密集像素级样本的非对称异常检测范式，显著降低了计算开销，同时保持了检测精度。其核心解决方案的关键在于引入基于低成本Mamba模块的全局上下文属性提取机制，以及从优化角度开发的一致性学习策略，从而在背景重建与异常压缩之间实现平衡，并缓解异常重建带来的负面影响。理论分析和八项基准实验验证了ACMamba在速度和性能上的优越性。

链接: https://arxiv.org/abs/2504.11781
作者: Guanchun Wang,Xiangrong Zhang,Yifei Zhang,Zelin Peng,Tianyang Zhang,Xu Tang,Licheng Jiao
机构: School of Artificial Intelligence, Xidian University (西安电子科技大学); MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 9 figures

点击查看摘要

Abstract:Unsupervised anomaly detection in hyperspectral images (HSI), aiming to detect unknown targets from backgrounds, is challenging for earth surface monitoring. However, current studies are hindered by steep computational costs due to the high-dimensional property of HSI and dense sampling-based training paradigm, constraining their rapid deployment. Our key observation is that, during training, not all samples within the same homogeneous area are indispensable, whereas ingenious sampling can provide a powerful substitute for reducing costs. Motivated by this, we propose an Asymmetrical Consensus State Space Model (ACMamba) to significantly reduce computational costs without compromising accuracy. Specifically, we design an asymmetrical anomaly detection paradigm that utilizes region-level instances as an efficient alternative to dense pixel-level samples. In this paradigm, a low-cost Mamba-based module is introduced to discover global contextual attributes of regions that are essential for HSI reconstruction. Additionally, we develop a consensus learning strategy from the optimization perspective to simultaneously facilitate background reconstruction and anomaly compression, further alleviating the negative impact of anomaly reconstruction. Theoretical analysis and extensive experiments across eight benchmarks verify the superiority of ACMamba, demonstrating a faster speed and stronger performance over the state-of-the-art.
zh

[CV-64] Multimodal Spatio-temporal Graph Learning for Alignment-free RGBT Video Object Detection

【速读】：该论文旨在解决无对齐（alignment-free）条件下的RGB-Thermal视频目标检测（RGBT VOD）问题，传统方法主要依赖人工对齐的多模态图像对，而本文提出了一种新颖的多模态时空图学习网络（MSGNet），通过利用鲁棒的图表示学习模型实现无需对齐的检测。关键在于设计了一个自适应划分层（APL）以在高分辨率RGB图像中初步估计热成像对应的区域，实现粗略对齐；随后引入空间稀疏图学习模块（S-SGLM）和混合结构化时序建模模块（HSTM），分别通过稀疏信息传递机制实现不同模态间的可靠交互，并充分挖掘时序线索以提升检测性能。实验验证了所提方法在对齐与非对齐数据集上的有效性与优越性。

链接: https://arxiv.org/abs/2504.11779
作者: Qishun Wang,Zhengzheng Tu,Chenglong Li,Bo Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:RGB-Thermal Video Object Detection (RGBT VOD) can address the limitation of traditional RGB-based VOD in challenging lighting conditions, making it more practical and effective in many applications. However, similar to most RGBT fusion tasks, it still mainly relies on manually aligned multimodal image pairs. In this paper, we propose a novel Multimodal Spatio-temporal Graph learning Network (MSGNet) for alignment-free RGBT VOD problem by leveraging the robust graph representation learning model. Specifically, we first design an Adaptive Partitioning Layer (APL) to estimate the corresponding regions of the Thermal image within the RGB image (high-resolution), achieving a preliminary inexact alignment. Then, we introduce the Spatial Sparse Graph Learning Module (S-SGLM) which employs a sparse information passing mechanism on the estimated inexact alignment to achieve reliable information interaction between different modalities. Moreover, to fully exploit the temporal cues for RGBT VOD problem, we introduce Hybrid Structured Temporal Modeling (HSTM), which involves a Temporal Sparse Graph Learning Module (T-SGLM) and Temporal Star Block (TSB). T-SGLM aims to filter out some redundant information between adjacent frames by employing the sparse aggregation mechanism on the temporal graph. Meanwhile, TSB is dedicated to achieving the complementary learning of local spatial relationships. Extensive comparative experiments conducted on both the aligned dataset VT-VOD50 and the unaligned dataset UVT-VOD2024 demonstrate the effectiveness and superiority of our proposed method. Our project will be made available on our website for free public access. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.11779 [cs.CV] (or arXiv:2504.11779v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.11779 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Qishun Wang [view email] [v1] Wed, 16 Apr 2025 05:32:59 UTC (2,742 KB) Full-text links: Access Paper: View a PDF of the paper titled Multimodal Spatio-temporal Graph Learning for Alignment-free RGBT Video Object Detection, by Qishun Wang and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-04 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-65] Bridging the Semantic Gaps: Improving Medical VQA Consistency with LLM -Augmented Question Sets

【速读】：该论文旨在解决医学视觉问答（MVQA）系统因自然语言问句表述的语义变异性而导致的一致性不足的问题。为应对这一挑战，论文提出了一种语义等价问句增强（SEQA）框架，其关键是利用大规模语言模型（LLMs）生成多样化且语义等价的问句重述，从而在保持语义一致的同时提升语言表达的多样性。此外，论文还引入了总一致性率（TAR-SC）评估指标，并通过SEQA框架对SLAKE、VQA-RAD和PathVQA三个基准MVQA数据集进行增强，显著提升了数据集中问句的多样性和模型一致性。

链接: https://arxiv.org/abs/2504.11777
作者: Yongpei Ma,Pengyu Wang,Adam Dunn,Usman Naseem,Jinman Kim
机构: School of Computer Science, University of Sydney (悉尼大学); School of Medical Sciences, University of Sydney (悉尼大学); School of Coumputing, Macquarie University (麦考瑞大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: The first two listed authors contributed equally to this work

点击查看摘要

Abstract:Medical Visual Question Answering (MVQA) systems can interpret medical images in response to natural language queries. However, linguistic variability in question phrasing often undermines the consistency of these systems. To address this challenge, we propose a Semantically Equivalent Question Augmentation (SEQA) framework, which leverages large language models (LLMs) to generate diverse yet semantically equivalent rephrasings of questions. Specifically, this approach enriches linguistic diversity while preserving semantic meaning. We further introduce an evaluation metric, Total Agreement Rate with Semantically Equivalent Input and Correct Answer (TAR-SC), which assesses a model’s capability to generate consistent and correct responses to semantically equivalent linguistic variations. In addition, we also propose three other diversity metrics - average number of QA items per image (ANQI), average number of questions per image with the same answer (ANQA), and average number of open-ended questions per image with the same semantics (ANQS). Using the SEQA framework, we augmented the benchmarked MVQA public datasets of SLAKE, VQA-RAD, and PathVQA. As a result, all three datasets achieved significant improvements by incorporating more semantically equivalent questions: ANQI increased by an average of 86.1, ANQA by 85.1, and ANQS by 46. Subsequent experiments evaluate three MVQA models (M2I2, MUMC, and BiomedGPT) under both zero-shot and fine-tuning settings on the enhanced datasets. Experimental results in MVQA datasets show that fine-tuned models achieve an average accuracy improvement of 19.35%, while our proposed TAR-SC metric shows an average improvement of 11. 61%, indicating a substantial enhancement in model consistency.
zh

[CV-66] acoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion CVPR2025

【速读】：该论文致力于解决雷达与相机融合进行深度估计时因雷达回波稀疏导致的现有方法多阶段框架耗时且鲁棒性不足的问题。论文的关键创新在于提出了一种名为TacoDepth的一阶段融合模型，通过基于图的雷达结构提取器和基于金字塔的雷达融合模块，直接捕捉并整合雷达点云的图结构，无需依赖中间稠密深度结果即可实现高效的模型性能和鲁棒性。此外，TacoDepth在不同推理模式下具有灵活性，能够在速度和精度之间提供更好的平衡。实验表明，与现有最先进的方法相比，TacoDepth提升了12.8%的深度精度和91.8%的处理速度。

链接: https://arxiv.org/abs/2504.11773
作者: Yiran Wang,Jiaqi Li,Chaoyi Hong,Ruibo Li,Liusheng Sun,Xiao Song,Zhe Wang,Zhiguo Cao,Guosheng Lin
机构: S-Lab, Nanyang Technological University (南洋理工大学 S-Lab); School of AIA, Huazhong University of Science and Technology (华中科技大学人工智能学院); SenseTime Research (商汤科技研究部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025 (Oral Presentation)

点击查看摘要

Abstract:Radar-Camera depth estimation aims to predict dense and accurate metric depth by fusing input images and Radar data. Model efficiency is crucial for this task in pursuit of real-time processing on autonomous vehicles and robotic platforms. However, due to the sparsity of Radar returns, the prevailing methods adopt multi-stage frameworks with intermediate quasi-dense depth, which are time-consuming and not robust. To address these challenges, we propose TacoDepth, an efficient and accurate Radar-Camera depth estimation model with one-stage fusion. Specifically, the graph-based Radar structure extractor and the pyramid-based Radar fusion module are designed to capture and integrate the graph structures of Radar point clouds, delivering superior model efficiency and robustness without relying on the intermediate depth results. Moreover, TacoDepth can be flexible for different inference modes, providing a better balance of speed and accuracy. Extensive experiments are conducted to demonstrate the efficacy of our method. Compared with the previous state-of-the-art approach, TacoDepth improves depth accuracy and processing speed by 12.8% and 91.8%. Our work provides a new perspective on efficient Radar-Camera depth estimation.
zh

[CV-67] Extended Short- and Long-Range Mesh Learning for Fast and Generalized Garment Simulation

【速读】：本文旨在解决基于图神经网络（Graph Neural Networks, GNNs）的三维服装模拟在高分辨率下计算效率低的问题。传统GNNs在处理服装网格时需要进行大量消息传递以传播物理力信息并保持接触意识，这在更高分辨率下变得尤为耗时。为了解决这一挑战，论文提出了一种新颖的基于GNN的网格学习框架，其关键是设计了两个关键模块：拉普拉斯平滑双重消息传递（Laplacian-Smoothed Dual Message-Passing, LSDMP）和测地自注意力（Geodesic Self-Attention, GSA）。LSDMP通过引入拉普拉斯特征平滑过程增强消息传递，有效将每个顶点的影响传播到邻近顶点；而GSA则利用测地距离嵌入表示顶点间的空间关系，并结合注意力机制捕捉全局网格信息。这两个模块并行工作，确保同时建模短程与长程网格关系。实验结果验证了该方法在性能上的领先性，表现出更少的网络层数和更低的推理延迟。

链接: https://arxiv.org/abs/2504.11763
作者: Aoran Liu,Kun Hu,Clinton Mo,Changyang Li,Zhiyong Wang
机构: School of Computer Science, The University of Sydney (悉尼大学), Darlington, NSW, Australia; School of Science, Edith Cowan University (埃迪斯·科文大学), Joondalup, WA, Australia; Sydney Polytechnic Institute Pty Ltd (悉尼理工学院), Haymarket, NSW, Australia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D garment simulation is a critical component for producing cloth-based graphics. Recent advancements in graph neural networks (GNNs) offer a promising approach for efficient garment simulation. However, GNNs require extensive message-passing to propagate information such as physical forces and maintain contact awareness across the entire garment mesh, which becomes computationally inefficient at higher resolutions. To address this, we devise a novel GNN-based mesh learning framework with two key components to extend the message-passing range with minimal overhead, namely the Laplacian-Smoothed Dual Message-Passing (LSDMP) and the Geodesic Self-Attention (GSA) modules. LSDMP enhances message-passing with a Laplacian features smoothing process, which efficiently propagates the impact of each vertex to nearby vertices. Concurrently, GSA introduces geodesic distance embeddings to represent the spatial relationship between vertices and utilises attention mechanisms to capture global mesh information. The two modules operate in parallel to ensure both short- and long-range mesh modelling. Extensive experiments demonstrate the state-of-the-art performance of our method, requiring fewer layers and lower inference latency.
zh

[CV-68] GrabS: Generative Embodied Agent for 3D Object Segmentation without Scene Supervision ICLR2025

【速读】：该论文致力于解决复杂点云中三维物体分割的难题，尤其是在无需三维场景人工标注的情况下实现无监督学习。现有无监督方法通常依赖预训练二维特征的相似性或外部信号（如运动）来聚类点云中的物体，但这些方法往往局限于识别简单物体（如汽车），且分割结果质量较低，主要因为预训练特征缺乏物体先验信息。
论文提出了一种名为GrabS的新两阶段框架作为解决方案。其关键是第一阶段通过物体数据集学习生成式与判别式的物体中心先验（object-centric priors），第二阶段设计具身代理（embodied agent）通过查询预训练生成式先验来发现多个物体。该方法在真实世界数据集和新创建的合成数据集上的评估表明，其分割性能显著优于所有现有无监督方法。

链接: https://arxiv.org/abs/2504.11754
作者: Zihui Zhang,Yafei Yang,Hongtao Wen,Bo Yang
机构: Shenzhen Research Institute, The Hong Kong Polytechnic University (香港理工大学深圳研究院); vLAR Group, The Hong Kong Polytechnic University (香港理工大学vLAR小组)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: ICLR 2025 Spotlight. Code and data are available at: this https URL

点击查看摘要

Abstract:We study the hard problem of 3D object segmentation in complex point clouds without requiring human labels of 3D scenes for supervision. By relying on the similarity of pretrained 2D features or external signals such as motion to group 3D points as objects, existing unsupervised methods are usually limited to identifying simple objects like cars or their segmented objects are often inferior due to the lack of objectness in pretrained features. In this paper, we propose a new two-stage pipeline called GrabS. The core concept of our method is to learn generative and discriminative object-centric priors as a foundation from object datasets in the first stage, and then design an embodied agent to learn to discover multiple objects by querying against the pretrained generative priors in the second stage. We extensively evaluate our method on two real-world datasets and a newly created synthetic dataset, demonstrating remarkable segmentation performance, clearly surpassing all existing unsupervised methods.
zh

[CV-69] SkeletonX: Data-Efficient Skeleton-based Action Recognition via Cross-sample Feature Aggregation

【速读】：该论文旨在解决现有骨架动作识别模型在适应新应用场景时面临的挑战，特别是在处理新的动作类别、多样化的表演者以及不同的骨架布局时，会导致性能显著退化的问题。此外，大规模骨架数据收集的成本和难度使得大量数据收集变得不切实际。为应对这些挑战，论文研究了一次性学习和小规模学习设置，以实现高效适应最小数据量。论文指出，现有的方法往往忽略了标记样本之间的丰富互信息，导致低数据场景下的性能不佳。为此，论文提出了两个关键属性：表演者的可变性和每个动作的共同性。基于此，论文设计了一个名为SkeletonX的轻量级训练管道，该管道无缝集成到现有的基于GCN的骨架动作识别器中，在有限的标记数据下促进有效的训练。解决方案的关键在于提出了一种针对这两个关键属性的定制样本对构建策略，以形成和聚合样本对，并开发了一个简洁而有效的特征聚合模块来处理这些对。实验结果表明，该管道在从头开始使用有限数据训练时有效提高了性能，并且在一次性学习设置中超越了先前最先进的方法，同时参数量仅为1/10，浮点运算次数（FLOPs）也大幅减少。

链接: https://arxiv.org/abs/2504.11749
作者: Zongye Zhang,Wenrui Cai,Qingjie Liu,Yunhong Wang
机构: State Key Laboratory of Virtual Reality Technology and Systems, Beihang University (北航大学), Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Multimedia (TMM). 13 pages, 7 figures, 11 tables

点击查看摘要

Abstract:While current skeleton action recognition models demonstrate impressive performance on large-scale datasets, their adaptation to new application scenarios remains challenging. These challenges are particularly pronounced when facing new action categories, diverse performers, and varied skeleton layouts, leading to significant performance degeneration. Additionally, the high cost and difficulty of collecting skeleton data make large-scale data collection impractical. This paper studies one-shot and limited-scale learning settings to enable efficient adaptation with minimal data. Existing approaches often overlook the rich mutual information between labeled samples, resulting in sub-optimal performance in low-data scenarios. To boost the utility of labeled data, we identify the variability among performers and the commonality within each action as two key attributes. We present SkeletonX, a lightweight training pipeline that integrates seamlessly with existing GCN-based skeleton action recognizers, promoting effective training under limited labeled data. First, we propose a tailored sample pair construction strategy on two key attributes to form and aggregate sample pairs. Next, we develop a concise and effective feature aggregation module to process these pairs. Extensive experiments are conducted on NTU RGB+D, NTU RGB+D 120, and PKU-MMD with various GCN backbones, demonstrating that the pipeline effectively improves performance when trained from scratch with limited data. Moreover, it surpasses previous state-of-the-art methods in the one-shot setting, with only 1/10 of the parameters and much fewer FLOPs. The code and data are available at: this https URL
zh

[CV-70] Recent Advance in 3D Object and Scene Generation: A Survey

【速读】：该论文旨在解决3D内容生成领域中传统手工建模方法面临的劳动密集型工作流程和漫长的生产周期等限制性问题。论文的关键解决方案在于通过新型3D表示范式与人工智能生成技术的融合，推动革命性的进展。具体而言，论文系统性地回顾了静态3D物体和场景生成的前沿成果，并构建了全面的技术框架。在物体生成方面，分析了主流的3D表示方法，并深入探讨了数据驱动的监督学习方法与基于深度生成模型的技术路径；在场景生成方面，则聚焦于布局引导的组合合成、基于2D先验的场景生成以及规则驱动的建模三种主导范式。最终，论文批判性地审视了3D生成中的持续挑战，并提出了未来研究方向。其核心在于通过跨学科技术整合，提供结构化理解当前最先进的3D生成技术，同时激励更多探索。

链接: https://arxiv.org/abs/2504.11734
作者: Xiang Tang,Ruotong Li,Xiaopeng Fan
机构: Harbin Institute of Technology (哈尔滨工业大学); Pengcheng Laboratory (鹏城实验室); Harbin Institute of Technology (哈尔滨工业大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 6 figures

点击查看摘要

Abstract:In recent years, the demand for 3D content has grown exponentially with intelligent upgrading of interactive media, extended reality (XR), and Metaverse industries. In order to overcome the limitation of traditional manual modeling approaches, such as labor-intensive workflows and prolonged production cycles, revolutionary advances have been achieved through the convergence of novel 3D representation paradigms and artificial intelligence generative technologies. In this survey, we conduct a systematically review of the cutting-edge achievements in static 3D object and scene generation, as well as establish a comprehensive technical framework through systematic categorization. Specifically, we initiate our analysis with mainstream 3D object representations, followed by in-depth exploration of two principal technical pathways in object generation: data-driven supervised learning methods and deep generative model-based approaches. Regarding scene generation, we focus on three dominant paradigms: layout-guided compositional synthesis, 2D prior-based scene generation, and rule-driven modeling. Finally, we critically examine persistent challenges in 3D generation and propose potential research directions for future investigation. This survey aims to provide readers with a structured understanding of state-of-the-art 3D generation technologies while inspiring researchers to undertake more exploration in this domain.
zh

[CV-71] DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment

【速读】：该论文旨在解决现有基于对比语言图像预训练（Contrastive Language-Image Pretraining, CLIP）的双流视频质量评估（Dual-stream Video Quality Assessment, VQA）方法无法有效捕捉视频特有的时间与运动信息，以及现有无参考视频质量评估（No-Reference Video Quality Assessment, NR-VQA）特征融合策略固定且缺乏自适应调整能力的问题。论文的关键创新在于提出了一种解耦视觉-语言建模并结合文本引导自适应的盲视频质量评估方法（Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment, DVLTA-VQA），通过将CLIP的视觉和文本组件解耦，并将其分别整合到NR-VQA的不同阶段，以提升模型对视频质量评估的语义理解和特征自适应能力。

链接: https://arxiv.org/abs/2504.11733
作者: Li Yu,Situo Wang,Wei Zhou,Moncef Gabbouj
机构: School of Computer Science, Nanjing University of Information Science & Technology (南京信息工程大学计算机科学学院); Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology (CICAEET), Nanjing University of Information Science & Technology (江苏协同创新中心大气环境与装备技术)(南京信息工程大学); School of Computer Science, Nanjing University of Information Science & Technology (南京信息工程大学计算机科学学院); School of Computer Science and Informatics, Cardiff University (卡迪夫大学计算机科学与信息学学院); Faculty of Information Technology and Communication Sciences, Tampere University (坦佩雷大学信息技术与通信科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inspired by the dual-stream theory of the human visual system (HVS) - where the ventral stream is responsible for object recognition and detail analysis, while the dorsal stream focuses on spatial relationships and motion perception - an increasing number of video quality assessment (VQA) works built upon this framework are proposed. Recent advancements in large multi-modal models, notably Contrastive Language-Image Pretraining (CLIP), have motivated researchers to incorporate CLIP into dual-stream-based VQA methods. This integration aims to harness the model’s superior semantic understanding capabilities to replicate the object recognition and detail analysis in ventral stream, as well as spatial relationship analysis in dorsal stream. However, CLIP is originally designed for images and lacks the ability to capture temporal and motion information inherent in videos. %Furthermore, existing feature fusion strategies in no-reference video quality assessment (NR-VQA) often rely on fixed weighting schemes, which fail to adaptively adjust feature importance. To address the limitation, this paper propose a Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment (DVLTA-VQA), which decouples CLIP’s visual and textual components, and integrates them into different stages of the NR-VQA pipeline.
zh

[CV-72] EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos ICLR2025

【速读】：该论文旨在解决跨视角视频预测任务中的未来第一人称视角（ego-centric）视频帧生成问题，即在给定第三人称视角（exo-centric）视频、第一人称视角视频的第一帧以及文本指令的情况下，生成后续的第一人称视角视频帧。论文的关键在于提出了一种名为EgoExo-Gen的方法，通过显式建模手-物交互（Hand-Object Interaction, HOI）动态来提升跨视角视频预测的效果。其解决方案分为两个阶段：首先设计了一个跨视角HOI掩码预测模型，利用空间-时间第一人称与第三人称视频的关联性预测未来的HOI掩码；其次使用视频扩散模型结合第一人称视频初始帧、文本指令以及预测的HOI掩码作为结构引导生成高质量的未来帧。此外，为了促进训练，开发了一种自动化管道，利用视觉基础模型自动生成用于训练的伪HOI掩码。实验结果表明，EgoExo-Gen在Ego-Exo4D和H2O基准数据集上的表现优于现有方法，而HOI掩码显著提升了第一人称视角视频中手部及交互物体的生成质量。

链接: https://arxiv.org/abs/2504.11732
作者: Jilan Xu,Yifei Huang,Baoqi Pei,Junlin Hou,Qingqiu Li,Guo Chen,Yuejie Zhang,Rui Feng,Weidi Xie
机构: Fudan University; The University of Tokyo; Zhejiang University; Hong Kong University of Science and Technology; Nanjing University; Shanghai Jiao Tong University; Shanghai Artificial Intelligence Laboratory
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025

点击查看摘要

Abstract:Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence. In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate futur frames of the ego-centric video. Inspired by the notion that hand-object interactions (HOI) in ego-centric videos represent the primary intentions and actions of the current actor, we present EgoExo-Gen that explicitly models the hand-object dynamics for cross-view video prediction. EgoExo-Gen consists of two stages. First, we design a cross-view HOI mask prediction model that anticipates the HOI masks in future ego-frames by modeling the spatio-temporal ego-exo correspondence. Next, we employ a video diffusion model to predict future ego-frames using the first ego-frame and textual instructions, while incorporating the HOI masks as structural guidance to enhance prediction quality. To facilitate training, we develop an automated pipeline to generate pseudo HOI masks for both ego- and exo-videos by exploiting vision foundation models. Extensive experiments demonstrate that our proposed EgoExo-Gen achieves better prediction performance compared to previous video prediction models on the Ego-Exo4D and H2O benchmark datasets, with the HOI masks significantly improving the generation of hands and interactive objects in the ego-centric videos.
zh

[CV-73] owards Safe Synthetic Image Generation On the Web: A Multimodal Robust NSFW Defense and Million Scale Dataset

【速读】：该论文试图解决Text-to-Image (T2I) 模型在生成超现实图像过程中引发的滥用问题，特别是Not-Safe-For-Work (NSFW) 内容的生成及其对网络环境的污染。此外，论文关注如何防御针对文本和图像模态的对抗攻击，这是现有防御方法（如NSFW过滤器和事后安全检查）容易失效的关键挑战。当前缺乏一个包含提示与图像对以及对抗样本的鲁棒多模态NSFW数据集，这也是研究的核心痛点之一。

解决方案的关键在于提出了一种百万规模的提示与图像数据集，通过开源扩散模型生成；同时开发了一种鲁棒的多模态防御机制，能够有效区分安全与NSFW的文本和图像，并且对对抗攻击具有较强的抵抗能力。实验结果表明，该模型在准确性、召回率方面优于现有的SOTA NSFW检测方法，并显著降低了多模态对抗攻击场景下的攻击成功率 (Attack Success Rate, ASR)。

链接: https://arxiv.org/abs/2504.11707
作者: Muhammad Shahid Muneer,Simon S. Woo
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Short Paper The Web Conference

点击查看摘要

Abstract:In the past years, we have witnessed the remarkable success of Text-to-Image (T2I) models and their widespread use on the web. Extensive research in making T2I models produce hyper-realistic images has led to new concerns, such as generating Not-Safe-For-Work (NSFW) web content and polluting the web society. To help prevent misuse of T2I models and create a safer web environment for users features like NSFW filters and post-hoc security checks are used in these models. However, recent work unveiled how these methods can easily fail to prevent misuse. In particular, adversarial attacks on text and image modalities can easily outplay defensive measures. %Exploiting such leads to the growing concern of preventing adversarial attacks on text and image modalities. Moreover, there is currently no robust multimodal NSFW dataset that includes both prompt and image pairs and adversarial examples. This work proposes a million-scale prompt and image dataset generated using open-source diffusion models. Second, we develop a multimodal defense to distinguish safe and NSFW text and images, which is robust against adversarial attacks and directly alleviates current challenges. Our extensive experiments show that our model performs well against existing SOTA NSFW detection methods in terms of accuracy and recall, drastically reducing the Attack Success Rate (ASR) in multimodal adversarial attack scenarios. Code: this https URL.
zh

[CV-74] Learning What NOT to Count

【速读】：该论文旨在解决现有少样本/零样本目标计数方法在区分细粒度类别（尤其是场景中存在多个相似对象时）方面的局限性。解决方案的关键在于提出了一种无需标注的方法，通过利用潜在生成模型合成高质量的类别特定拥挤场景，为适应新类别提供了丰富的训练资源，而无需人工标注。此外，引入了一个注意力预测网络，用于识别细粒度类别边界，并使用仅由合成伪标注数据训练，从而在推理阶段优化现有少样本/零样本计数网络的输出。

链接: https://arxiv.org/abs/2504.11705
作者: Adriano D’Alessandro,Ali Mahdavi-Amiri,Ghassan Hamarneh
机构: Simon Fraser University (西蒙弗雷泽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few/zero-shot object counting methods reduce the need for extensive annotations but often struggle to distinguish between fine-grained categories, especially when multiple similar objects appear in the same scene. To address this limitation, we propose an annotation-free approach that enables the seamless integration of new fine-grained categories into existing few/zero-shot counting models. By leveraging latent generative models, we synthesize high-quality, category-specific crowded scenes, providing a rich training source for adapting to new categories without manual labeling. Our approach introduces an attention prediction network that identifies fine-grained category boundaries trained using only synthetic pseudo-annotated data. At inference, these fine-grained attention estimates refine the output of existing few/zero-shot counting networks. To benchmark our method, we further introduce the FGTC dataset, a taxonomy-specific fine-grained object counting dataset for natural images. Our method substantially enhances pre-trained state-of-the-art models on fine-grained taxon counting tasks, while using only synthetic data. Code and data to be released upon acceptance.
zh

[CV-75] Non-uniform Point Cloud Upsampling via Local Manifold Distribution

【速读】：该论文旨在解决现有基于学习的点云上采样方法忽视点云内在数据分布特性的问题，特别是在处理稀疏和非均匀点云时往往导致次优结果。论文的关键在于从流形分布的角度施加约束，利用高斯函数的强大拟合能力，通过网络迭代优化高斯分量及其权重，精确表示局部流形，并借助高斯函数的概率分布特性构建统一的统计流形以施加分布约束。实验结果显示，该方法在处理稀疏和非均匀输入时能够生成更高质量和更均匀分布的密集点云，优于现有的点云上采样技术。

链接: https://arxiv.org/abs/2504.11701
作者: Yaohui Fang,Xingce Wang
机构: Beijing Normal University (北京师范大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Differential Geometry (math.DG)
备注:

点击查看摘要

Abstract:Existing learning-based point cloud upsampling methods often overlook the intrinsic data distribution charac?teristics of point clouds, leading to suboptimal results when handling sparse and non-uniform point clouds. We propose a novel approach to point cloud upsampling by imposing constraints from the perspective of manifold distributions. Leveraging the strong fitting capability of Gaussian functions, our method employs a network to iteratively optimize Gaussian components and their weights, accurately representing local manifolds. By utilizing the probabilistic distribution properties of Gaussian functions, we construct a unified statistical manifold to impose distribution constraints on the point cloud. Experimental results on multiple datasets demonstrate that our method generates higher-quality and more uniformly distributed dense point clouds when processing sparse and non-uniform inputs, outperforming state-of-the-art point cloud upsampling techniques.
zh

[CV-76] An Online Adaptation Method for Robust Depth Estimation and Visual Odometry in the Open World

【速读】：该论文旨在解决基于学习的机器人导航系统在开放世界场景中的泛化能力不足问题，特别是在场景测量和状态估计方面，当实际应用场景偏离训练数据时，深度估计和姿态估计的可靠性会显著下降。为了解决这一挑战，论文提出了一种能够在线快速适应多样化新环境的视觉里程计系统。方案的关键在于构建了一个以在线更新的深度估计模块辅助的自监督在线适应框架，并设计了一个带有轻量级精炼模块的单目深度估计网络，同时提出了稀疏深度稠密化模块和动态一致性增强模块，利用相机姿态和上下文语义信息生成伪深度图和有效掩码，从而实现高效的在线适应。实验结果验证了所提方法在城市数据集、室内数据集以及机器人平台上的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2504.11698
作者: Xingwu Ji,Haochen Niu,Dexin Duan,Rendong Ying,Fei Wen,Peilin Liu
机构: Brain-inspired Application Technology Center (BATC), School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University (上海交通大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 14 figures

点击查看摘要

Abstract:Recently, learning-based robotic navigation systems have gained extensive research attention and made significant progress. However, the diversity of open-world scenarios poses a major challenge for the generalization of such systems to practical scenarios. Specifically, learned systems for scene measurement and state estimation tend to degrade when the application scenarios deviate from the training data, resulting to unreliable depth and pose estimation. Toward addressing this problem, this work aims to develop a visual odometry system that can fast adapt to diverse novel environments in an online manner. To this end, we construct a self-supervised online adaptation framework for monocular visual odometry aided by an online-updated depth estimation module. Firstly, we design a monocular depth estimation network with lightweight refiner modules, which enables efficient online adaptation. Then, we construct an objective for self-supervised learning of the depth estimation module based on the output of the visual odometry system and the contextual semantic information of the scene. Specifically, a sparse depth densification module and a dynamic consistency enhancement module are proposed to leverage camera poses and contextual semantics to generate pseudo-depths and valid masks for the online adaptation. Finally, we demonstrate the robustness and generalization capability of the proposed method in comparison with state-of-the-art learning-based approaches on urban, in-house datasets and a robot platform. Code is publicly available at: this https URL.
zh

[CV-77] Interpreting the Linear Structure of Vision-language Model Embedding Spaces

【速读】：该论文试图解决的问题是如何组织视觉与语言在联合嵌入空间中的表示，并揭示生成式 AI (Generative AI) 模型如何编码语义和模态。为了解决这一问题，论文的关键方案是训练并发布稀疏自编码器（Sparse Autoencoders, SAEs），将其应用于四种视觉-语言模型（CLIP、SigLIP、SigLIP2 和 AIMv2）的嵌入空间中。SAEs 能够以稀疏线性组合的形式近似模型嵌入，并提取出被称为“概念”的学习方向。通过这种方式，研究发现 SAEs 不仅能够更好地重建真实嵌入，同时保留更高的稀疏性。此外，通过使用不同的随机种子或数据集重新训练 SAEs，研究进一步揭示了这些稀疏概念的稳定性和变化性，并提出桥接分数（Bridge Score）来量化跨模态语义整合中的协作行为。最终，这项工作揭示了视觉-语言模型嵌入空间中存在的稀疏线性结构，这种结构既受模态影响，又通过潜在的桥接机制实现跨模态语义的构建。

链接: https://arxiv.org/abs/2504.11695
作者: Isabel Papadimitriou,Huangyuan Su,Thomas Fel,Naomi Saphra,Sham Kakade,Stephanie Gil
机构: Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University (肯普纳自然与人工智能研究所, 哈佛大学); Department of Computer Science, Harvard University (计算机科学系, 哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Vision-language models encode images and text in a joint space, minimizing the distance between corresponding image and text pairs. How are language and images organized in this joint space, and how do the models encode meaning and modality? To investigate this, we train and release sparse autoencoders (SAEs) on the embedding spaces of four vision-language models (CLIP, SigLIP, SigLIP2, and AIMv2). SAEs approximate model embeddings as sparse linear combinations of learned directions, or “concepts”. We find that, compared to other methods of linear feature learning, SAEs are better at reconstructing the real embeddings, while also able to retain the most sparsity. Retraining SAEs with different seeds or different data diet leads to two findings: the rare, specific concepts captured by the SAEs are liable to change drastically, but we also show that the key commonly-activating concepts extracted by SAEs are remarkably stable across runs. Interestingly, while most concepts are strongly unimodal in activation, we find they are not merely encoding modality per se. Many lie close to - but not entirely within - the subspace defining modality, suggesting that they encode cross-modal semantics despite their unimodal usage. To quantify this bridging behavior, we introduce the Bridge Score, a metric that identifies concept pairs which are both co-activated across aligned image-text inputs and geometrically aligned in the shared space. This reveals that even unimodal concepts can collaborate to support cross-modal integration. We release interactive demos of the SAEs for all models, allowing researchers to explore the organization of the concept spaces. Overall, our findings uncover a sparse linear structure within VLM embedding spaces that is shaped by modality, yet stitched together through latent bridges-offering new insight into how multimodal meaning is constructed.
zh

[CV-78] Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for Forensics

【速读】：本文旨在解决利用多模态大型语言模型（Multimodal Large Language Models, LLMs）进行深度伪造检测的问题。当前生成式 AI（Generative AI）的发展使得内容创作更加便捷，但同时也加剧了伪造内容的检测难度。尽管多模态 LLMs 已经编码了丰富的世界知识，但它们并非专门设计用于对抗 AI 生成内容（AI-Generated Content, AIGC），在理解局部伪造细节方面表现欠佳。为应对这一挑战，本文提出了一种框架，能够评估图像真实性、定位篡改区域、提供证据并追踪生成方法，基于语义篡改线索实现上述功能。关键在于通过精心设计提示工程（Prompt Engineering）以及应用少量学习（Few-Shot Learning）技术，有效释放多模态 LLMs 在伪造分析中的潜力。实验结果表明，GPT4V 在 Autosplice 和 LaMa 数据集上的准确率分别达到 92.1% 和 86.3%，与最先进的 AIGC 检测方法具有竞争力。此外，文章还讨论了多模态 LLMs 在此类任务中的局限性，并提出了潜在的改进方向。

链接: https://arxiv.org/abs/2504.11686
作者: Yiran He,Yun Cao,Bowen Yang,Zeyu Zhang
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); Institute of Information Engineering, Chinese Academy of Science (中国科学院信息工程研究所); Institute of Information Engineering, Chinese Academy of Science (中国科学院信息工程研究所); Institute of Information Engineering, Chinese Academy of Science (中国科学院信息工程研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 11 figures, 13IHMMSec2025

点击查看摘要

Abstract:The rapid development of generative AI facilitates content creation and makes image manipulation easier and more difficult to detect. While multimodal Large Language Models (LLMs) have encoded rich world knowledge, they are not inherently tailored for combating AI-generated Content (AIGC) and struggle to comprehend local forgery details. In this work, we investigate the application of multimodal LLMs in forgery detection. We propose a framework capable of evaluating image authenticity, localizing tampered regions, providing evidence, and tracing generation methods based on semantic tampering clues. Our method demonstrates that the potential of LLMs in forgery analysis can be effectively unlocked through meticulous prompt engineering and the application of few-shot learning techniques. We conduct qualitative and quantitative experiments and show that GPT4V can achieve an accuracy of 92.1% in Autosplice and 86.3% in LaMa, which is competitive with state-of-the-art AIGC detection methods. We further discuss the limitations of multimodal LLMs in such tasks and propose potential improvements.
zh

[CV-79] DM-OSVP: One-Shot View Planning Using 3D Diffusion Models for Active RGB-Based Object Reconstruction

【速读】：该论文旨在解决主动物体重建中的视角规划问题，特别是在机器人应用中，通过生成特定于物体的视图配置以获取用于重建的有意义测量值。论文的关键在于利用三维扩散模型（3D Diffusion Model）的生成能力作为有价值的先验信息。通过在初始多视角图像的基础上利用三维扩散模型的先验知识生成物体的近似模型，以此为基础进行一次性视角规划（one-shot view planning）。该方法将物体模型的几何分布和纹理分布整合到视角规划过程中，生成聚焦于待重建物体复杂部分的视图。论文通过仿真和真实世界实验验证了所提出的主动物体重建系统的有效性，证明了使用三维扩散先验进行一次性视角规划的可行性与优势。

链接: https://arxiv.org/abs/2504.11674
作者: Sicong Pan,Liren Jin,Xuying Huang,Cyrill Stachniss,Marija Popović,Maren Bennewitz
机构: University of Bonn, Germany (波恩大学，德国); Center for Robotics, Bonn, Germany (波恩机器人中心，德国); Lamarr Institute for Machine Learning and Artificial Intelligence, Germany (拉马尔机器学习与人工智能研究所，德国); Delft University of Technology, The Netherlands (代尔夫特理工大学，荷兰)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Active object reconstruction is crucial for many robotic applications. A key aspect in these scenarios is generating object-specific view configurations to obtain informative measurements for reconstruction. One-shot view planning enables efficient data collection by predicting all views at once, eliminating the need for time-consuming online replanning. Our primary insight is to leverage the generative power of 3D diffusion models as valuable prior information. By conditioning on initial multi-view images, we exploit the priors from the 3D diffusion model to generate an approximate object model, serving as the foundation for our view planning. Our novel approach integrates the geometric and textural distributions of the object model into the view planning process, generating views that focus on the complex parts of the object to be reconstructed. We validate the proposed active object reconstruction system through both simulation and real-world experiments, demonstrating the effectiveness of using 3D diffusion priors for one-shot view planning.
zh

[CV-80] Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation

【速读】：该论文旨在解决源-free无监督视频域适应（Source-Free Unsupervised Video Domain Adaptation, SFUVDA）中伪标签噪声和过自信预测导致的性能瓶颈问题。论文的关键解决方案是提出Co-STAR框架，该框架结合课程学习与源模型教师与对比视觉语言模型（CLIP）之间的协作自训练。其核心创新点包括可靠性导向的权重函数，通过双向预测对齐来平衡置信度与不确定性，同时保留困难样本的不确定性；以及自适应课程正则化，以概率性且自适应的方式调整样本的学习优先级，缓解对噪声和过自信样本的过拟合问题。这些方法显著提升了跨域适应的效果。

链接: https://arxiv.org/abs/2504.11669
作者: Amirhossein Dadashzadeh,Parsa Esmati,Majid Mirmehdi
机构: University of Bristol (布里斯托大学), UK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Source-Free Unsupervised Video Domain Adaptation (SFUVDA) leverage vision-language models to enhance pseudo-label generation. However, challenges such as noisy pseudo-labels and over-confident predictions limit their effectiveness in adapting well across domains. We propose Co-STAR, a novel framework that integrates curriculum learning with collaborative self-training between a source-trained teacher and a contrastive vision-language model (CLIP). Our curriculum learning approach employs a reliability-based weight function that measures bidirectional prediction alignment between the teacher and CLIP, balancing between confident and uncertain predictions. This function preserves uncertainty for difficult samples, while prioritizing reliable pseudo-labels when the predictions from both models closely align. To further improve adaptation, we propose Adaptive Curriculum Regularization, which modifies the learning priority of samples in a probabilistic, adaptive manner based on their confidence scores and prediction stability, mitigating overfitting to noisy and over-confident samples. Extensive experiments across multiple video domain adaptation benchmarks demonstrate that Co-STAR consistently outperforms state-of-the-art SFUVDA methods. Code is available at: this https URL
zh

[CV-81] Real-time Object and Event Detection Service through Computer Vision and Edge Computing

【速读】：该论文旨在解决城市道路环境中致命交通事故频发的问题，特别是涉及易受伤害道路使用者（Vulnerable Road Users, VRUs）的事故。论文提出了一种基于计算机视觉（Computer Vision, CV）和边缘计算的系统策略与实施方案，用于智能城市的道路监控与安全。解决方案的关键在于通过部署监控摄像头实现视觉算法的实时运行，结合先进的传感器和数据集，利用机器学习模型准确检测和跟踪车辆、行人和自行车，并预测道路状态、移动物体间的距离，同时推断潜在碰撞事件以实现实时预防，从而显著提升道路安全性。

链接: https://arxiv.org/abs/2504.11662
作者: Marcos Mendes,Gonçalo Perna,Pedro Rito,Duarte Raposo,Susana Sargento
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30th ITS World Congress, Dubai, UAE, 16-20 September 2024

点击查看摘要

Abstract:The World Health Organization suggests that road traffic crashes cost approximately 518 billion dollars globally each year, which accounts for 3% of the gross domestic product for most countries. Most fatal road accidents in urban areas involve Vulnerable Road Users (VRUs). Smart cities environments present innovative approaches to combat accidents involving cutting-edge technologies, that include advanced sensors, extensive datasets, Machine Learning (ML) models, communication systems, and edge computing. This paper proposes a strategy and an implementation of a system for road monitoring and safety for smart cities, based on Computer Vision (CV) and edge computing. Promising results were obtained by implementing vision algorithms and tracking using surveillance cameras, that are part of a Smart City testbed, the Aveiro Tech City Living Lab (ATCLL). The algorithm accurately detects and tracks cars, pedestrians, and bicycles, while predicting the road state, the distance between moving objects, and inferring on collision events to prevent collisions, in near real-time.
zh

[CV-82] DamageCAT: A Deep Learning Transformer Framework for Typology-Based Post-Disaster Building Damage Categorization

【速读】：该论文旨在解决当前建筑灾害评估方法局限于二元或序数严重程度分类，无法提供行动所需详细灾损类型的不足。论文的关键解决方案在于提出DamageCAT框架，其核心创新包括构建包含四级灾损类型（部分屋顶损坏、完全屋顶损坏、部分结构坍塌和完全结构坍塌）的BD-TypoSAT数据集，以及一种基于分层U-Net的Transformer架构。该架构能够有效处理灾前灾后图像对，识别并分类建筑灾损类型。尽管训练数据存在显著类别不平衡，模型仍实现了0.7921的平均IoU和0.8835的F1分数，尤其在罕见类别中表现出对复杂灾损类型的卓越识别能力。这一框架通过提供基于类型的可操作信息，显著提升了自动化灾损评估的能力，优于传统的基于严重程度的方法。

链接: https://arxiv.org/abs/2504.11637
作者: Yiming Xiao,Ali Mostafavi
机构: Texas A&M University (德克萨斯农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 6 figures

点击查看摘要

Abstract:Natural disasters increasingly threaten communities worldwide, creating an urgent need for rapid, reliable building damage assessment to guide emergency response and recovery efforts. Current methods typically classify damage in binary (damaged/undamaged) or ordinal severity terms, limiting their practical utility. In fact, the determination of damage typology is crucial for response and recovery efforts. To address this important gap, this paper introduces DamageCAT, a novel framework that provides typology-based categorical damage descriptions rather than simple severity ratings. Accordingly, this study presents two key contributions: (1) the BD-TypoSAT dataset containing satellite image triplets (pre-disaster, post-disaster, and damage masks) from Hurricane Ida with four damage categories (partial roof damage, total roof damage, partial structural collapse, and total structural collapse), and (2) a hierarchical U-Net-based transformer architecture that effectively processes pre-post disaster image pairs to identify and categorize building damage. Despite significant class imbalances in the training data, our model achieved robust performance with overall metrics of 0.7921 Intersection over Union (IoU) and 0.8835 F1 scores across all categories. The model’s capability to recognize intricate damage typology in less common categories is especially remarkable. The DamageCAT framework advances automated damage assessment by providing actionable, typological information that better supports disaster response decision-making and resource allocation compared to traditional severity-based approaches.
zh

[CV-83] Deep Learning Approaches for Medical Imaging Under Varying Degrees of Label Availability: A Comprehensive Survey

【速读】：该论文试图解决医疗影像领域深度学习模型对大规模高质量标注数据依赖的问题，这一问题限制了深度学习技术在医学领域的广泛应用。论文的关键在于综述并分析了近年来基于“不完全（incomplete）、不精确（inexact）或缺失（absent）标签”的学习范式的研究进展，涵盖了自2018年以来约600项重要贡献。通过形式化定义不同学习范式，总结并解析多种学习机制与策略，论文旨在帮助读者更好地理解当前研究现状，并为未来可能面临的挑战提供指导。

链接: https://arxiv.org/abs/2504.11588
作者: Siteng Ma,Honghui Du,Yu An,Jing Wang,Qinqin Wang,Haochang Wu,Aonghus Lawlor,Ruihai Dong
机构: University College Dublin (都柏林大学); North China Institute of Aerospace Engineering (华北航天工业学院); Hithink RoyalFlush Information Network Co., Ltd (恒生电子股份有限公司); University College Dublin (都柏林大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 33 pages, 10 figures, 8 tables. Will be submit to Medical Image Analysis

点击查看摘要

Abstract:Deep learning has achieved significant breakthroughs in medical imaging, but these advancements are often dependent on large, well-annotated datasets. However, obtaining such datasets poses a significant challenge, as it requires time-consuming and labor-intensive annotations from medical experts. Consequently, there is growing interest in learning paradigms such as incomplete, inexact, and absent supervision, which are designed to operate under limited, inexact, or missing labels. This survey categorizes and reviews the evolving research in these areas, analyzing around 600 notable contributions since 2018. It covers tasks such as image classification, segmentation, and detection across various medical application areas, including but not limited to brain, chest, and cardiac imaging. We attempt to establish the relationships among existing research studies in related areas. We provide formal definitions of different learning paradigms and offer a comprehensive summary and interpretation of various learning mechanisms and strategies, aiding readers in better understanding the current research landscape and ideas. We also discuss potential future research challenges.
zh

[CV-84] ConvShareViT: Enhancing Vision Transformers with Convolutional Attention Mechanisms for Free-Space Optical Accelerators

【速读】：该论文旨在解决如何将Vision Transformers (ViTs) 适配于自由空间光学系统（4f系统），同时提升其在光学硬件上的推理效率。论文的关键在于提出了一种名为ConvShareViT的新架构，通过用共享权重的深度可分离卷积层替代传统ViTs中的多头自注意力机制（Multi-Head Self-Attention, MHSA）和多层感知机（Multilayer Perceptron, MLP）中的线性层，实现了对ViT行为的系统性分析与优化。实验表明，特定配置（如使用有效填充共享卷积）能够成功学习注意力机制，并在性能上接近标准ViTs，而其他配置则表现出局限性。ConvShareViT通过充分利用光学系统的并行性和高分辨率能力，理论上可实现高达3.04倍的推理加速，证明了仅通过卷积操作即可高效实现ViT的潜力。

链接: https://arxiv.org/abs/2504.11517
作者: Riad Ibadulla,Thomas M. Chen,Constantino Carlos Reyes-Aldasoro
机构: School of Science and Technology, City St. George’s, University of London (伦敦大学圣乔治学院科学与技术学院); Integrated Pathology Unit, Institute of Cancer Research (癌症研究所病理学部)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:This paper introduces ConvShareViT, a novel deep learning architecture that adapts Vision Transformers (ViTs) to the 4f free-space optical system. ConvShareViT replaces linear layers in multi-head self-attention (MHSA) and Multilayer Perceptrons (MLPs) with a depthwise convolutional layer with shared weights across input channels. Through the development of ConvShareViT, the behaviour of convolutions within MHSA and their effectiveness in learning the attention mechanism were analysed systematically. Experimental results demonstrate that certain configurations, particularly those using valid-padded shared convolutions, can successfully learn attention, achieving comparable attention scores to those obtained with standard ViTs. However, other configurations, such as those using same-padded convolutions, show limitations in attention learning and operate like regular CNNs rather than transformer models. ConvShareViT architectures are specifically optimised for the 4f optical system, which takes advantage of the parallelism and high-resolution capabilities of optical systems. Results demonstrate that ConvShareViT can theoretically achieve up to 3.04 times faster inference than GPU-based systems. This potential acceleration makes ConvShareViT an attractive candidate for future optical deep learning applications and proves that our ViT (ConvShareViT) can be employed using only the convolution operation, via the necessary optimisation of the ViT to balance performance and complexity.
zh

[CV-85] PATFinger: Prompt-Adapted Transferable Fingerprinting against Unauthorized Multimodal Dataset Usage

【速读】：该论文旨在解决当前多模态数据集使用权验证方法中存在的两个主要问题：一是侵入式方法虽可适应多模态数据集但会降低模型准确性；二是非侵入式方法依赖于标签驱动的决策边界，在验证过程中难以保证稳定行为。为应对这些挑战，论文提出了一种名为PATFinger的新颖提示适配可转移指纹方案，从无训练的角度出发，结合全局最优扰动（Global Optimal Perturbation, GOP）和自适应提示来捕捉特定数据集的分布特性。其关键是利用数据集本身的固有属性作为指纹，而不是强迫模型学习触发器，并通过精心设计的代理模型捕获跨模态交互作用，从而实现对数据集使用情况的有效检测。实验结果表明，该方案在多种跨模态检索架构中对抗未经授权的多模态数据集使用方面比现有最佳基线提升了30%的效果。

链接: https://arxiv.org/abs/2504.11509
作者: Wenyi Zhang,Ju Jia,Xiaojun Jia,Yihao Huang,Xinfeng Li,Cong Wu,Lina Wang
机构: Southeast University (东南大学); Nanyang Technological University (南洋理工大学); University of Hong Kong (香港大学); Wuhan University (武汉大学)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The multimodal datasets can be leveraged to pre-train large-scale vision-language models by providing cross-modal semantics. Current endeavors for determining the usage of datasets mainly focus on single-modal dataset ownership verification through intrusive methods and non-intrusive techniques, while cross-modal approaches remain under-explored. Intrusive methods can adapt to multimodal datasets but degrade model accuracy, while non-intrusive methods rely on label-driven decision boundaries that fail to guarantee stable behaviors for verification. To address these issues, we propose a novel prompt-adapted transferable fingerprinting scheme from a training-free perspective, called PATFinger, which incorporates the global optimal perturbation (GOP) and the adaptive prompts to capture dataset-specific distribution characteristics. Our scheme utilizes inherent dataset attributes as fingerprints instead of compelling the model to learn triggers. The GOP is derived from the sample distribution to maximize embedding drifts between different modalities. Subsequently, our PATFinger re-aligns the adaptive prompt with GOP samples to capture the cross-modal interactions on the carefully crafted surrogate model. This allows the dataset owner to check the usage of datasets by observing specific prediction behaviors linked to the PATFinger during retrieval queries. Extensive experiments demonstrate the effectiveness of our scheme against unauthorized multimodal dataset usage on various cross-modal retrieval architectures by 30% over state-of-the-art baselines.
zh

[CV-86] ransitReID: Transit OD Data Collection with Occlusion-Resistant Dynamic Passenger Re-Identification

【速读】：该论文旨在解决城市公共交通系统中基于传统方法（如人工调查）成本高且效率低的问题，同时克服蓝牙和WiFi方法因依赖特定设备而限制数据覆盖范围的局限性。论文聚焦于利用车载摄像头通过视觉行人重识别(Visual Person Re-Identification, ReID)技术收集个体层面的公交线路OD (Origin-Destination) 数据。然而，这种方法面临严重的遮挡和视角变化等挑战，导致匹配精度下降，阻碍了其广泛应用。此外，在边缘设备上设计高效算法仍是一个未解决的问题。

为了解决上述挑战，论文提出了一种名为TransitReID的新框架。该框架的关键在于两个核心组件：一是具有变分自编码器引导区域注意力机制的遮挡鲁棒ReID算法，该机制通过优化重建损失的权重分配，自适应地关注可见的身体部位；二是专为高效且鲁棒的公交OD匹配设计的分层存储与动态匹配(Hierarchical Storage and Dynamic Matching, HSDM)机制，它在存储、速度和准确性之间实现了平衡。此外，多线程设计支持在边缘设备上的近实时操作，并确保隐私保护。同时，论文还引入了一个针对复杂公交车环境定制的ReID数据集，以应对缺乏相关训练数据的问题。实验结果表明，TransitReID在ReID任务中达到了最先进的性能，在公交车路线模拟中的准确率约为90%。

链接: https://arxiv.org/abs/2504.11500
作者: Kaicong Huang,Talha Azfar,Jack Reilly,Ruimin Ke
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Transit Origin-Destination (OD) data are essential for transit planning, particularly in route optimization and demand-responsive paratransit systems. Traditional methods, such as manual surveys, are costly and inefficient, while Bluetooth and WiFi-based approaches require passengers to carry specific devices, limiting data coverage. On the other hand, most transit vehicles are equipped with onboard cameras for surveillance, offering an opportunity to repurpose them for edge-based OD data collection through visual person re-identification (ReID). However, such approaches face significant challenges, including severe occlusion and viewpoint variations in transit environments, which greatly reduce matching accuracy and hinder their adoption. Moreover, designing effective algorithms that can operate efficiently on edge devices remains an open challenge. To address these challenges, we propose TransitReID, a novel framework for individual-level transit OD data collection. TransitReID consists of two key components: (1) An occlusion-robust ReID algorithm featuring a variational autoencoder guided region-attention mechanism that adaptively focuses on visible body regions through reconstruction loss-optimized weight allocation; and (2) a Hierarchical Storage and Dynamic Matching (HSDM) mechanism specifically designed for efficient and robust transit OD matching which balances storage, speed, and accuracy. Additionally, a multi-threaded design supports near real-time operation on edge devices, which also ensuring privacy protection. We also introduce a ReID dataset tailored for complex bus environments to address the lack of relevant training data. Experimental results demonstrate that TransitReID achieves state-of-the-art performance in ReID tasks, with an accuracy of approximately 90% in bus route simulations.
zh

[CV-87] Probabilistic Task Parameterization of Tool-Tissue Interaction via Sparse Landmarks Tracking in Robotic Surgery ICRA’25

【速读】：该论文旨在解决机器人手术中工具与组织交互的精确建模问题，传统方法因依赖繁琐的人工标注或刚性假设而缺乏灵活性。论文的关键在于提出了一种结合稀疏关键点跟踪与概率建模的框架，通过聚类组织关键点利用主成分分析（PCA）构建动态局部变换，并将工具位姿与这些帧相对表达。将这些嵌入任务参数化高斯混合模型（Task-Parameterized Gaussian Mixture Model, TP-GMM）中，整合数据驱动观测与标注的临床专业知识，从而有效预测工具与组织的相对位姿，增强从视频数据中直接理解机器人手术运动的视觉能力。

链接: https://arxiv.org/abs/2504.11495
作者: Yiting Wang,Yunxin Fan,Fei Liu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Submitted to ICRA’25 Workshop of 3rd Robot-Assisted Medical Imaging

点击查看摘要

Abstract:Accurate modeling of tool-tissue interactions in robotic surgery requires precise tracking of deformable tissues and integration of surgical domain knowledge. Traditional methods rely on labor-intensive annotations or rigid assumptions, limiting flexibility. We propose a framework combining sparse keypoint tracking and probabilistic modeling that propagates expert-annotated landmarks across endoscopic frames, even with large tissue deformations. Clustered tissue keypoints enable dynamic local transformation construction via PCA, and tool poses, tracked similarly, are expressed relative to these frames. Embedding these into a Task-Parameterized Gaussian Mixture Model (TP-GMM) integrates data-driven observations with labeled clinical expertise, effectively predicting relative tool-tissue poses and enhancing visual understanding of robotic surgical motions directly from video data.
zh

[CV-88] oward Aligning Human and Robot Actions via Multi-Modal Demonstration Learning ICRA’25

【速读】：本文旨在解决人类与机器人在非结构化环境中动作对应关系的理解问题，特别是在人机协作和模仿学习中的决策对齐挑战。为实现这一目标，论文提出了一种多模态演示学习框架，该框架明确利用RGB视频中的人类演示和体素化的RGB-D空间中的机器人演示进行建模。关键在于结合基于ResNet的视觉编码（用于人类意图建模）和基于Perceiver Transformer的体素级机器人动作预测，从而有效对齐复杂多模态的人类与机器人行为，特别是在操作任务如“pick and place”中的应用。

链接: https://arxiv.org/abs/2504.11493
作者: Azizul Zahid,Jie Fan,Farong Wang,Ashton Dy,Sai Swaminathan,Fei Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA’25 Workshop: Human-Centered Robot Learning in the Era of Big Data and Large Models

点击查看摘要

Abstract:Understanding action correspondence between humans and robots is essential for evaluating alignment in decision-making, particularly in human-robot collaboration and imitation learning within unstructured environments. We propose a multimodal demonstration learning framework that explicitly models human demonstrations from RGB video with robot demonstrations in voxelized RGB-D space. Focusing on the “pick and place” task from the RH20T dataset, we utilize data from 5 users across 10 diverse scenes. Our approach combines ResNet-based visual encoding for human intention modeling and a Perceiver Transformer for voxel-based robot action prediction. After 2000 training epochs, the human model reaches 71.67% accuracy, and the robot model achieves 71.8% accuracy, demonstrating the framework’s potential for aligning complex, multimodal human and robot behaviors in manipulation tasks.
zh

[CV-89] Uncovering Branch specialization in InceptionV1 using k sparse autoencoders CVPR

【速读】：该论文旨在解决 InceptionV1 模型中分支专业化（branch specialization）在后期层中仍存在的谜团问题。通过展示混合层（mixed4a-4e）、5x5 分支以及某个 1x1 分支中分支专业化现象的多种示例，并提供证据表明这种专业化在各层中具有一致性，即模型中相似的特征会在相应层的相同卷积尺寸分支中局部化。关键在于验证并揭示分支专业化现象的存在及其一致性规律，从而深化对稀疏自编码器（Sparse Autoencoder, SAE）在提取可解释特征方面的理解与应用。

链接: https://arxiv.org/abs/2504.11489
作者: Matthew Bozoukov
机构: Miramar college (迈阿密学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR MIV workshop. 9 pages with an appendix

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) have shown to find interpretable features in neural networks from polysemantic neurons caused by superposition. Previous work has shown SAEs are an effective tool to extract interpretable features from the early layers of InceptionV1. Since then, there have been many improvements to SAEs but branch specialization is still an enigma in the later layers of InceptionV1. We show various examples of branch specialization occuring in each layer of the mixed4a-4e branch, in the 5x5 branch and in one 1x1 branch. We also provide evidence to claim that branch specialization seems to be consistent across layers, similar features across the model will be localized in the same convolution size branches in their respective layer.
zh

[CV-90] snnTrans-DHZ: A Lightweight Spiking Neural Network Architecture for Underwater Image Dehazing

【速读】：该论文旨在解决水下图像去雾问题，这是基于视觉的海洋操作中的关键挑战，因为光散射和吸收会严重降低能见度。为了解决这一问题，论文提出了一种轻量级的脉冲神经网络(snnTrans-DHZ)，专门设计用于水下图像去雾。该方法利用了脉冲神经网络的时间动态特性，在保持低功耗的同时高效处理时间相关的原始图像序列。论文的关键创新在于其架构包含三个核心模块：(i) K估计器从多颜色空间表示中提取特征；(ii) 背景光估计器联合推断RGB-LAB图像中的背景光成分；(iii) 软图像重建模块生成无雾且增强可视性的输出。此外，snnTrans-DHZ模型通过基于代理梯度的时反向传播(BPTT)策略以及新颖的组合损失函数直接训练。实验结果表明，该算法在效率上显著优于现有最先进的方法，适合部署于水下机器人、海洋探索和环境监测等领域。

链接: https://arxiv.org/abs/2504.11482
作者: Vidya Sudevan,Fakhreddine Zayer,Rizwana Kausar,Sajid Javed,Hamad Karki,Giulia De Masi,Jorge Dias
机构: Center for Autonomous Robotic Systems, Khalifa University (自主机器人系统中心，哈利法大学); Dept. Science and Engineering, Sorbonne University (科学与工程系，索邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Performance (cs.PF); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Underwater image dehazing is critical for vision-based marine operations because light scattering and absorption can severely reduce visibility. This paper introduces snnTrans-DHZ, a lightweight Spiking Neural Network (SNN) specifically designed for underwater dehazing. By leveraging the temporal dynamics of SNNs, snnTrans-DHZ efficiently processes time-dependent raw image sequences while maintaining low power consumption. Static underwater images are first converted into time-dependent sequences by repeatedly inputting the same image over user-defined timesteps. These RGB sequences are then transformed into LAB color space representations and processed concurrently. The architecture features three key modules: (i) a K estimator that extracts features from multiple color space representations; (ii) a Background Light Estimator that jointly infers the background light component from the RGB-LAB images; and (iii) a soft image reconstruction module that produces haze-free, visibility-enhanced outputs. The snnTrans-DHZ model is directly trained using a surrogate gradient-based backpropagation through time (BPTT) strategy alongside a novel combined loss function. Evaluated on the UIEB benchmark, snnTrans-DHZ achieves a PSNR of 21.68 dB and an SSIM of 0.8795, and on the EUVP dataset, it yields a PSNR of 23.46 dB and an SSIM of 0.8439. With only 0.5670 million network parameters, and requiring just 7.42 GSOPs and 0.0151 J of energy, the algorithm significantly outperforms existing state-of-the-art methods in terms of efficiency. These features make snnTrans-DHZ highly suitable for deployment in underwater robotics, marine exploration, and environmental monitoring.
zh

[CV-91] Flux Already Knows - Activating Subject-Driven Image Generation without Training

【速读】：该论文旨在解决无监督（zero-shot）场景下的主体驱动图像生成问题，即在不依赖额外数据、训练或推理时微调的情况下，通过预训练的基础文本到图像模型实现高质量的主体图像生成。论文的关键创新在于提出了一种简单的网格化图像补全框架，并结合主体图像的马赛克布局复制，激活了强身份保留能力。此外，通过引入新颖的级联注意力设计（cascade attention design）和元提示技术（meta prompting technique），进一步提升了生成图像的保真度和多功能性。实验结果表明，该方法在多个基准测试和人类偏好研究中优于基线方法，同时支持多样化的编辑任务，如Logo插入、虚拟试穿、主体替换等。

链接: https://arxiv.org/abs/2504.11478
作者: Hao Kang,Stathi Fotiadis,Liming Jiang,Qing Yan,Yumin Jia,Zichuan Liu,Min Jin Chong,Xin Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a simple yet effective zero-shot framework for subject-driven image generation using a vanilla Flux model. By framing the task as grid-based image completion and simply replicating the subject image(s) in a mosaic layout, we activate strong identity-preserving capabilities without any additional data, training, or inference-time fine-tuning. This “free lunch” approach is further strengthened by a novel cascade attention design and meta prompting technique, boosting fidelity and versatility. Experimental results show that our method outperforms baselines across multiple key metrics in benchmarks and human preference studies, with trade-offs in certain aspects. Additionally, it supports diverse edits, including logo insertion, virtual try-on, and subject replacement or insertion. These results demonstrate that a pre-trained foundational text-to-image model can enable high-quality, resource-efficient subject-driven generation, opening new possibilities for lightweight customization in downstream applications.
zh

[CV-92] SDIGLM: Leverag ing Large Language Models and Multi-Modal Chain of Thought for Structural Damage Identification

【速读】：该论文旨在解决现有基于计算机视觉的结构损伤识别模型在实际土木工程应用中的局限性。这些局限性包括对复杂多样损伤类型的识别能力不足，以及缺乏通过自然语言描述结构损伤特征的能力。为应对这些问题，论文提出了一种基于大型多模态模型（Large Multi-modal Models, LMMs）的新方法，利用其统一编码和对齐文本与视觉数据的能力，实现结构损伤的自主详细描述，并具备跨场景和任务的强大泛化能力。解决方案的关键在于开发了一种名为SDIGLM的创新LMM，它基于开源VisualGLM-6B架构构建，并通过集成U-Net语义分割模块生成缺陷分割图作为视觉链式思维（Chain of Thought, CoT），同时结合多轮对话微调数据集增强逻辑推理能力，辅以通过提示工程形成的语言CoT。这种多模态CoT的应用显著提升了结构损伤识别的准确性，达到了95.24%的总体精度，并能够有效描述损伤特征如孔洞大小、裂缝方向和腐蚀严重程度。

链接: https://arxiv.org/abs/2504.11477
作者: Yunkai Zhang,Shiyin Wei,Yong Huang,Yawu Su,Shanshan Lu,Hui Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing computer vision(CV)-based structural damage identification models demonstrate notable accuracy in categorizing and localizing damage. However, these models present several critical limitations that hinder their practical application in civil engineering(CE). Primarily, their ability to recognize damage types remains constrained, preventing comprehensive analysis of the highly varied and complex conditions encountered in real-world CE structures. Second, these models lack linguistic capabilities, rendering them unable to articulate structural damage characteristics through natural language descriptions. With the continuous advancement of artificial intelligence(AI), large multi-modal models(LMMs) have emerged as a transformative solution, enabling the unified encoding and alignment of textual and visual data. These models can autonomously generate detailed descriptive narratives of structural damage while demonstrating robust generalization across diverse scenarios and tasks. This study introduces SDIGLM, an innovative LMM for structural damage identification, developed based on the open-source VisualGLM-6B architecture. To address the challenge of adapting LMMs to the intricate and varied operating conditions in CE, this work integrates a U-Net-based semantic segmentation module to generate defect segmentation maps as visual Chain of Thought(CoT). Additionally, a multi-round dialogue fine-tuning dataset is constructed to enhance logical reasoning, complemented by a language CoT formed through prompt engineering. By leveraging this multi-modal CoT, SDIGLM surpasses general-purpose LMMs in structural damage identification, achieving an accuracy of 95.24% across various infrastructure types. Moreover, the model effectively describes damage characteristics such as hole size, crack direction, and corrosion severity.
zh

[CV-93] Visual moral inference and communication

【速读】：该论文试图解决的问题是如何从自然图像中实现道德推理（moral inference），并分析公共新闻中通过图像传达的道德内容模式。现有方法主要依赖于基于文本的语言模型，但未能充分捕捉人类对视觉刺激的精细道德判断。为此，论文提出了一种计算框架，关键在于结合语言与视觉信息的融合模型（language-vision fusion models），以更精确地进行视觉道德推理，并揭示新闻数据中的隐性偏见及地缘政治讨论中的道德传播模式。这一方案为自动化视觉道德推理以及探索公共媒体中的视觉道德交流模式提供了新的途径。

链接: https://arxiv.org/abs/2504.11473
作者: Warren Zhu,Aida Ramezani,Yang Xu
机构: University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humans can make moral inferences from multiple sources of input. In contrast, automated moral inference in artificial intelligence typically relies on language models with textual input. However, morality is conveyed through modalities beyond language. We present a computational framework that supports moral inference from natural images, demonstrated in two related tasks: 1) inferring human moral judgment toward visual images and 2) analyzing patterns in moral content communicated via images from public news. We find that models based on text alone cannot capture the fine-grained human moral judgment toward visual stimuli, but language-vision fusion models offer better precision in visual moral inference. Furthermore, applications of our framework to news data reveal implicit biases in news categories and geopolitical discussions. Our work creates avenues for automating visual moral inference and discovering patterns of visual moral communication in public media.
zh

[CV-94] High Dynamic Range Modulo Imaging for Robust Object Detection in Autonomous Driving

【速读】：该论文旨在解决因极端光照变化导致的图像饱和问题，从而影响自动驾驶系统实时识别车辆、行人及障碍物精度的问题。传统高动态范围（HDR）成像虽能有效捕捉宽广光强范围，但其多帧捕获方式效率低下，不适用于实时应用。论文提出的关键解决方案是利用模运算传感器（modulo sensor），通过像素饱和后重置/溢出的方式获取辐照编码图像，并结合解包裹算法恢复HDR图像。此方法在保证极佳视觉质量的同时，显著提升了物体检测的准确性，且所需时间短于传统HDR图像采集过程。

链接: https://arxiv.org/abs/2504.11472
作者: Kebin Contreras,Brayan Monroy,Jorge Bacca
机构: VIE-UIS (维耶-伊斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Object detection precision is crucial for ensuring the safety and efficacy of autonomous driving systems. The quality of acquired images directly influences the ability of autonomous driving systems to correctly recognize and respond to other vehicles, pedestrians, and obstacles in real-time. However, real environments present extreme variations in lighting, causing saturation problems and resulting in the loss of crucial details for detection. Traditionally, High Dynamic Range (HDR) images have been preferred for their ability to capture a broad spectrum of light intensities, but the need for multiple captures to construct HDR images is inefficient for real-time applications in autonomous vehicles. To address these issues, this work introduces the use of modulo sensors for robust object detection. The modulo sensor allows pixels to `reset/wrap’ upon reaching saturation level by acquiring an irradiance encoding image which can then be recovered using unwrapping algorithms. The applied reconstruction techniques enable HDR recovery of color intensity and image details, ensuring better visual quality even under extreme lighting conditions at the cost of extra time. Experiments with the YOLOv10 model demonstrate that images processed using modulo images achieve performance comparable to HDR images and significantly surpass saturated images in terms of object detection accuracy. Moreover, the proposed modulo imaging step combined with HDR image reconstruction is shorter than the time required for conventional HDR image acquisition.
zh

[CV-95] SO-DETR: Leverag ing Dual-Domain Features and Knowledge Distillation for Small Object Detection

【速读】：该论文旨在解决现有基于Detection Transformer的方法在小目标检测中存在的两个关键挑战：一是现有编码器难以高效融合低级特征；二是查询选择策略对小目标优化不足。为应对这些挑战，论文提出了一种名为Small Object Detection Transformer (SO-DETR) 的高效模型。其解决方案的关键在于三个核心组件：双域混合编码器（dual-domain hybrid encoder）、增强的查询选择机制（enhanced query selection mechanism）以及知识蒸馏策略（knowledge distillation strategy）。其中，双域混合编码器通过整合空间域和频域来有效融合多尺度特征，同时保持较低的计算开销；增强的查询选择机制通过动态选择高分锚框优化查询初始化；此外，结合轻量级主干网络与知识蒸馏策略进一步提升了小目标检测的效率与性能。实验结果表明，SO-DETR在VisDrone-2019-DET和UAVVaste数据集上优于具有相似计算需求的现有方法。

链接: https://arxiv.org/abs/2504.11470
作者: Huaxiang Zhang,Hao Zhang,Aoran Mei,Zhongxue Gan,Guo-Niu Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detection Transformer-based methods have achieved significant advancements in general object detection. However, challenges remain in effectively detecting small objects. One key difficulty is that existing encoders struggle to efficiently fuse low-level features. Additionally, the query selection strategies are not effectively tailored for small objects. To address these challenges, this paper proposes an efficient model, Small Object Detection Transformer (SO-DETR). The model comprises three key components: a dual-domain hybrid encoder, an enhanced query selection mechanism, and a knowledge distillation strategy. The dual-domain hybrid encoder integrates spatial and frequency domains to fuse multi-scale features effectively. This approach enhances the representation of high-resolution features while maintaining relatively low computational overhead. The enhanced query selection mechanism optimizes query initialization by dynamically selecting high-scoring anchor boxes using expanded IoU, thereby improving the allocation of query resources. Furthermore, by incorporating a lightweight backbone network and implementing a knowledge distillation strategy, we develop an efficient detector for small objects. Experimental results on the VisDrone-2019-DET and UAVVaste datasets demonstrate that SO-DETR outperforms existing methods with similar computational demands. The project page is available at this https URL.
zh

[CV-96] MultiCoreTPU Accelerated Multi-Modal TinyML for Livestock Behaviour Recognition

【速读】：该论文旨在解决传统牲畜监测系统在成本、性能及适用性方面的局限性，提出了一种高效、低成本的牲畜感知与行为识别系统。解决方案的关键在于结合tiny machine learning (TinyML) 技术、无线通信框架以及微控制器平台，构建了一个多模态网络，融合加速度计数据与视觉输入，用于图像分类、目标检测和行为识别三大任务。通过在商用微控制器上部署，该系统实现了高达270倍的模型大小缩减、低于80毫秒的响应延迟，并保持与现有方法相当的性能，同时支持在弱网环境下设备间的数据无缝传输，从而提供了适应多样化养殖需求的鲁棒且可扩展的物联网边缘监测方案。

链接: https://arxiv.org/abs/2504.11467
作者: Qianxue Zhang,Eiman Kanjo
机构: Medical AI Lab, Hebei Provincial Engineering Research Center for AI-Based Cancer Treatment Decision-Making, The First Hospital of Hebei Medical University (河北医科大学第一医院), Shijiazhuang (石家庄), China; Computing Department, Imperial College London (帝国理工学院计算系), London (伦敦), UK; Professor Pervasive Sensing & TinyML and Smart Sensing Lab, Nottingham Trent University (诺丁汉特伦特大学), Nottingham (诺丁汉), UK; Provost’s Visiting Professor in tinyML, Imperial College London (帝国理工学院), London (伦敦), UK
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 11 pages, 10 figures

点击查看摘要

Abstract:The advancement of technology has revolutionised the agricultural industry, transitioning it from labour-intensive farming practices to automated, AI-powered management systems. In recent years, more intelligent livestock monitoring solutions have been proposed to enhance farming efficiency and productivity. This work presents a novel approach to animal activity recognition and movement tracking, leveraging tiny machine learning (TinyML) techniques, wireless communication framework, and microcontroller platforms to develop an efficient, cost-effective livestock sensing system. It collects and fuses accelerometer data and vision inputs to build a multi-modal network for three tasks: image classification, object detection, and behaviour recognition. The system is deployed and evaluated on commercial microcontrollers for real-time inference using embedded applications, demonstrating up to 270 \times model size reduction, less than 80ms response latency, and on-par performance comparable to existing methods. The incorporation of the TinyML technique allows for seamless data transmission between devices, benefiting use cases in remote locations with poor Internet connectivity. This work delivers a robust, scalable IoT-edge livestock monitoring solution adaptable to diverse farming needs, offering flexibility for future extensions.
zh

[CV-97] Comparative Evaluation of Radiomics and Deep Learning Models for Disease Detection in Chest Radiography

【速读】：该论文旨在解决如何有效利用人工智能（Artificial Intelligence, AI）技术提升胸部X光影像中疾病检测的诊断准确性与鲁棒性的问题。论文聚焦于COVID-19、肺部阴影及病毒性肺炎等疾病的检测，并比较了基于放射组学（Radiomics）和深度学习（Deep Learning）的方法。深度学习模型如卷积神经网络（Convolutional Neural Networks, CNNs）和视觉变换器（Vision Transformers, ViTs）直接从图像数据中学习特征，而放射组学模型则提取并分析定量特征，在数据受限场景下可能更具优势。关键在于系统性地评估不同AI模型（包括决策树、梯度提升、随机森林、支持向量机（Support Vector Machines, SVM）、多层感知器（Multi-Layer Perceptrons, MLP））在放射组学中的表现以及与最先进的计算机视觉深度学习架构的对比，以揭示各模型在不同样本量下的效能，为临床实践中AI驱动诊断工具的选择提供指导依据。

链接: https://arxiv.org/abs/2504.12249
作者: Zhijin He,Alan B. McMillan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The application of artificial intelligence (AI) in medical imaging has revolutionized diagnostic practices, enabling advanced analysis and interpretation of radiological data. This study presents a comprehensive evaluation of radiomics-based and deep learning-based approaches for disease detection in chest radiography, focusing on COVID-19, lung opacity, and viral pneumonia. While deep learning models, particularly convolutional neural networks (CNNs) and vision transformers (ViTs), learn directly from image data, radiomics-based models extract and analyze quantitative features, potentially providing advantages in data-limited scenarios. This study systematically compares the diagnostic accuracy and robustness of various AI models, including Decision Trees, Gradient Boosting, Random Forests, Support Vector Machines (SVM), and Multi-Layer Perceptrons (MLP) for radiomics, against state-of-the-art computer vision deep learning architectures. Performance metrics across varying sample sizes reveal insights into each model’s efficacy, highlighting the contexts in which specific AI approaches may offer enhanced diagnostic capabilities. The results aim to inform the integration of AI-driven diagnostic tools in clinical practice, particularly in automated and high-throughput environments where timely, reliable diagnosis is critical. This comparative study addresses an essential gap, establishing guidance for the selection of AI models based on clinical and operational needs.
zh

[CV-98] Modality-Independent Explainable Detection of Inaccurate Organ Segmentations Using Denoising Autoencoders

【速读】：该论文试图解决放射治疗计划中因未被临床医生检测到的器官-at-风险（Organs-at-Risk, OARs）分割不准确而导致的治疗递送次优的问题。解决方案的关键在于提出了一种基于去噪自编码器（Denoising Autoencoder, DAE）的方法，通过向真实标签器官分割施加噪声，并训练自编码器去除噪声，从而检测分割中的不准确性。该方法不仅展示了对MR和CT扫描生成的器官分割具有模态独立性，还通过提供重建结果直观显示分割中的不准确区域，实现了更可解释的次优分割检测。与现有方法相比，该方法在大多数器官上的性能表现更为优越。

链接: https://arxiv.org/abs/2504.12203
作者: Levente Lippenszky,István Megyeri,Krisztian Koos,Zsófia Karancsi,Borbála Deák-Karancsi,András Frontó,Árpád Makk,Attila Rádics,Erhan Bas,László Ruskó
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Short version of this paper was accepted for poster presentation at IEEE ISBI 2025

点击查看摘要

Abstract:In radiation therapy planning, inaccurate segmentations of organs at risk can result in suboptimal treatment delivery, if left undetected by the clinician. To address this challenge, we developed a denoising autoencoder-based method to detect inaccurate organ segmentations. We applied noise to ground truth organ segmentations, and the autoencoders were tasked to denoise them. Through the application of our method to organ segmentations generated on both MR and CT scans, we demonstrated that the method is independent of imaging modality. By providing reconstructions, our method offers visual information about inaccurate regions of the organ segmentations, leading to more explainable detection of suboptimal segmentations. We compared our method to existing approaches in the literature and demonstrated that it achieved superior performance for the majority of organs.
zh

[CV-99] Novel-view X-ray Projection Synthesis through Geometry-Integrated Deep Learning

【速读】：该论文旨在解决传统X射线成像技术中因需多角度投影导致辐射暴露增加及临床流程复杂化的问题。解决方案的关键在于提出DL-GIPS模型，通过利用单一现有投影合成新视角下的X射线投影。该模型通过对初始投影提取的几何特征与纹理特征进行策略性操作以匹配新的观测角度，并结合先进的图像生成过程将修改后的几何特征与一致的纹理信息融合，从而生成最终投影。这一方法展示了DL-GIPS框架在肺部成像中的有效性和广泛应用潜力，尤其强调其在减少数据采集需求的同时推动立体和体积成像革新的能力。

链接: https://arxiv.org/abs/2504.11953
作者: Daiqi Liu,Fuxin Fan,Andreas Maier
机构: Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希亚历山大大学模式识别实验室)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, 1 table

点击查看摘要

Abstract:X-ray imaging plays a crucial role in the medical field, providing essential insights into the internal anatomy of patients for diagnostics, image-guided procedures, and clinical decision-making. Traditional techniques often require multiple X-ray projections from various angles to obtain a comprehensive view, leading to increased radiation exposure and more complex clinical processes. This paper explores an innovative approach using the DL-GIPS model, which synthesizes X-ray projections from new viewpoints by leveraging a single existing projection. The model strategically manipulates geometry and texture features extracted from an initial projection to match new viewing angles. It then synthesizes the final projection by merging these modified geometry features with consistent texture information through an advanced image generation process. We demonstrate the effectiveness and broad applicability of the DL-GIPS framework through lung imaging examples, highlighting its potential to revolutionize stereoscopic and volumetric imaging by minimizing the need for extensive data acquisition.
zh

[CV-100] xtDiffSeg: Text-guided Latent Diffusion Model for 3d Medical Images Segmentation

【速读】：该论文旨在解决扩散概率模型（Diffusion Probabilistic Models, DPMs）在3D医学图像分割任务中计算成本高且难以充分捕捉全局三维上下文信息的问题。为应对这些挑战，论文提出了一种新颖的文本引导扩散模型框架TextDiffSeg。其关键在于通过整合三维体积数据与自然语言描述的条件扩散框架，实现跨模态嵌入并在视觉和文本模态间构建共享语义空间。此外，TextDiffSeg引入创新的标签嵌入技术和跨模态注意力机制，不仅有效降低了计算复杂度，还保持了全局三维上下文信息的完整性。实验结果表明，该方法在肾肿瘤、胰腺肿瘤以及多器官分割等任务中均优于现有方法，消融研究进一步验证了各关键组件的有效性及其协同作用。

链接: https://arxiv.org/abs/2504.11825
作者: Kangbo Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Probabilistic Models (DPMs) have demonstrated significant potential in 3D medical image segmentation tasks. However, their high computational cost and inability to fully capture global 3D contextual information limit their practical applications. To address these challenges, we propose a novel text-guided diffusion model framework, TextDiffSeg. This method leverages a conditional diffusion framework that integrates 3D volumetric data with natural language descriptions, enabling cross-modal embedding and establishing a shared semantic space between visual and textual modalities. By enhancing the model’s ability to recognize complex anatomical structures, TextDiffSeg incorporates innovative label embedding techniques and cross-modal attention mechanisms, effectively reducing computational complexity while preserving global 3D contextual integrity. Experimental results demonstrate that TextDiffSeg consistently outperforms existing methods in segmentation tasks involving kidney and pancreas tumors, as well as multi-organ segmentation scenarios. Ablation studies further validate the effectiveness of key components, highlighting the synergistic interaction between text fusion, image feature extractor, and label encoder. TextDiffSeg provides an efficient and accurate solution for 3D medical image segmentation, showcasing its broad applicability in clinical diagnosis and treatment planning.
zh

[CV-101] FACT: Foundation Model for Assessing Cancer Tissue Margins with Mass Spectrometry

【速读】：该论文旨在解决手术过程中因数据稀缺导致的组织切缘分类难题。传统机器学习模型在处理快速蒸发电离质谱（REIMS）数据时面临标注样本不足的挑战，限制了其临床应用。为应对这一问题，论文提出的关键解决方案是开发了一种名为FACT（Foundation model for Assessing Cancer Tissue margins）的基础模型，该模型基于文本-音频关联模型进行适配，并通过自监督对比学习方法（采用三元组损失函数）进行预训练。实验表明，与自监督和半监督基线模型相比，FACT显著提升了分类性能，在AUROC指标上达到82.4% ± 0.8%，验证了所提预训练方法及选择的主干网络的有效性。这表明，通过新颖的适配和预训练策略，基础模型能够在有限标注样本的情况下有效分类REIMS数据，为实时手术切缘评估提供了可行的技术路径，特别是在数据稀缺的临床环境中具有重要价值。

链接: https://arxiv.org/abs/2504.11519
作者: Mohammad Farahmand,Amoon Jamzad,Fahimeh Fooladgar,Laura Connolly,Martin Kaufmann,Kevin Yi Mi Ren,John Rudan,Doug McKay,Gabor Fichtinger,Parvin Mousavi
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Purpose: Accurately classifying tissue margins during cancer surgeries is crucial for ensuring complete tumor removal. Rapid Evaporative Ionization Mass Spectrometry (REIMS), a tool for real-time intraoperative margin assessment, generates spectra that require machine learning models to support clinical decision-making. However, the scarcity of labeled data in surgical contexts presents a significant challenge. This study is the first to develop a foundation model tailored specifically for REIMS data, addressing this limitation and advancing real-time surgical margin assessment. Methods: We propose FACT, a Foundation model for Assessing Cancer Tissue margins. FACT is an adaptation of a foundation model originally designed for text-audio association, pretrained using our proposed supervised contrastive approach based on triplet loss. An ablation study is performed to compare our proposed model against other models and pretraining methods. Results: Our proposed model significantly improves the classification performance, achieving state-of-the-art performance with an AUROC of 82.4% \pm 0.8 . The results demonstrate the advantage of our proposed pretraining method and selected backbone over the self-supervised and semi-supervised baselines and alternative models. Conclusion: Our findings demonstrate that foundation models, adapted and pretrained using our novel approach, can effectively classify REIMS data even with limited labeled examples. This highlights the viability of foundation models for enhancing real-time surgical margin assessment, particularly in data-scarce clinical environments.
zh

[CV-102] Attention GhostUNet: Enhanced Segmentation of Adipose Tissue and Liver in CT Images

【速读】：该论文旨在解决腹部脂肪组织（包括皮下脂肪组织SAT和内脏脂肪组织VAT）以及肝脏的精确分割问题，这对于理解身体组成及其与2型糖尿病和心血管疾病等健康风险的关系至关重要。论文提出的解决方案是Attention GhostUNet++，这是一种新颖的深度学习模型，通过在Ghost UNet++瓶颈模块中引入通道注意力(Channel Attention)、空间注意力(Spatial Attention)和深度注意力(Depth Attention)机制，实现了自动化且精确的分割。该模型在AATTCT-IDS和LiTS数据集上的评估结果显示，其对VAT、SAT和肝脏分割的Dice系数分别达到了0.9430、0.9639和0.9652，优于基准模型。尽管在边界细节分割方面存在一些局限性，但该模型显著提升了特征优化、上下文理解能力和计算效率，为身体组成分析提供了稳健的解决方案。

链接: https://arxiv.org/abs/2504.11491
作者: Mansoor Hayat,Supavadee Aramvith,Subrata Bhattacharjee,Nouman Ahmad
机构: Chulalongkorn University (朱拉隆功大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted for presentation in the 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2025)

点击查看摘要

Abstract:Accurate segmentation of abdominal adipose tissue, including subcutaneous (SAT) and visceral adipose tissue (VAT), along with liver segmentation, is essential for understanding body composition and associated health risks such as type 2 diabetes and cardiovascular disease. This study proposes Attention GhostUNet++, a novel deep learning model incorporating Channel, Spatial, and Depth Attention mechanisms into the Ghost UNet++ bottleneck for automated, precise segmentation. Evaluated on the AATTCT-IDS and LiTS datasets, the model achieved Dice coefficients of 0.9430 for VAT, 0.9639 for SAT, and 0.9652 for liver segmentation, surpassing baseline models. Despite minor limitations in boundary detail segmentation, the proposed model significantly enhances feature refinement, contextual understanding, and computational efficiency, offering a robust solution for body composition analysis. The implementation of the proposed Attention GhostUNet++ model is available at:this https URL.
zh

[CV-103] Deciphering scrolls with tomography: A training experiment

【速读】：该论文旨在解决严重受损古代文献的虚拟恢复与阅读难题，传统物理展开方式不可行且具有破坏性。论文的关键解决方案在于开发了一套基于可见光而非有害X射线的实验装置，结合教学软件流程，使学生能够模拟通过计算机视觉算法虚拟重建包裹卷轴上的透明滚筒及印刷文字，从而实现非破坏性地获取和解析隐藏内容。

链接: https://arxiv.org/abs/2504.11485
作者: Sonia Foschiatti,Axel Kittenberger,Otmar Scherzer
机构: University of Vienna (维也纳大学); Johann Radon Institute for Computational and Applied Mathematics (RICAM), Austrian Academy of Sciences (奥地利科学院); Computational Science Center, University of Vienna (维也纳大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The recovery of severely damaged ancient written documents has proven to be a major challenge for many scientists, mainly due to the impracticality of physical unwrapping them. Non-destructive techniques, such as X-ray computed tomography (CT), combined with computer vision algorithms, have emerged as a means of facilitating the virtual reading of the hidden contents of the damaged documents. This paper proposes an educational laboratory aimed at simulating the entire process of acquisition and virtual recovery of the ancient works. We have developed an experimental setup that uses visible light to replace the detrimental X-rays, and a didactic software pipeline that allows students to virtually reconstruct a transparent rolled sheet with printed text on it, the wrapped scroll.
zh

[CV-104] Local Temporal Feature Enhanced Transformer with ROI-rank Based Masking for Diagnosis of ADHD

【速读】：该论文旨在解决注意力缺陷多动障碍（ADHD）诊断中有效提取脑部时空生物标志物的问题。解决方案的关键在于提出了一种基于Transformer的ADHD诊断模型，该模型不仅能够学习脑区的时空个体特征，还能捕捉专门用于ADHD诊断的全脑注意结构关联。具体而言，模型聚焦于学习局部血氧水平依赖（BOLD）信号，并识别大脑中的重要感兴趣区域（ROI）。为了实现这一目标，论文设计了三个关键方法：首先，通过基于CNN的嵌入块增强脑区注意的表达能力；其次，引入局部时间注意机制和基于ROI排名的掩码方法，分别针对fMRI的时间和空间特征进行优化，从而更精准地提取ADHD相关的生物标志物。实验结果表明，所提出的时空增强型Transformer在ADHD诊断任务上的性能优于其他Transformer变体。

链接: https://arxiv.org/abs/2504.11474
作者: Byunggun Kim,Younghun Kwon
机构: Hanyang University (汉阳大学), Ansan, Kyunggi-Do, 425-791, Republic of Korea
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In modern society, Attention-Deficit/Hyperactivity Disorder (ADHD) is one of the common mental diseases discovered not only in children but also in adults. In this context, we propose a ADHD diagnosis transformer model that can effectively simultaneously find important brain spatiotemporal biomarkers from resting-state functional magnetic resonance (rs-fMRI). This model not only learns spatiotemporal individual features but also learns the correlation with full attention structures specialized in ADHD diagnosis. In particular, it focuses on learning local blood oxygenation level dependent (BOLD) signals and distinguishing important regions of interest (ROI) in the brain. Specifically, the three proposed methods for ADHD diagnosis transformer are as follows. First, we design a CNN-based embedding block to obtain more expressive embedding features in brain region attention. It is reconstructed based on the previously CNN-based ADHD diagnosis models for the transformer. Next, for individual spatiotemporal feature attention, we change the attention method to local temporal attention and ROI-rank based masking. For the temporal features of fMRI, the local temporal attention enables to learn local BOLD signal features with only simple window masking. For the spatial feature of fMRI, ROI-rank based masking can distinguish ROIs with high correlation in ROI relationships based on attention scores, thereby providing a more specific biomarker for ADHD diagnosis. The experiment was conducted with various types of transformer models. To evaluate these models, we collected the data from 939 individuals from all sites provided by the ADHD-200 competition. Through this, the spatiotemporal enhanced transformer for ADHD diagnosis outperforms the performance of other different types of transformer variants. (77.78ACC 76.60SPE 79.22SEN 79.30AUC)
zh

[CV-105] Do Segmentation Models Understand Vascular Structure? A Blob-Based XAI Framework

【速读】：该论文旨在解决深度学习模型在医学图像分割中的“黑箱”特性限制其临床应用的问题，特别是在血管分割任务中，如何有效利用局部图像线索与全局解剖结构（如血管连接性和分支）进行可信赖的分割。论文的关键在于提出了一种结合梯度归因方法、图引导点选择以及基于感兴趣区域（POIs）的显著图Blob分析的新颖解释性管道。通过从真实标签提取的血管图定义解剖学相关的兴趣点，并使用自定义Blob检测器在全局和局部尺度上分析显著图，揭示模型决策主要依赖于靠近兴趣点的局部化特征，而这些特征与血管厚度、管状性或连接性等全局属性的相关性较低，表明现有模型在利用全局解剖推理方面存在局限性。

链接: https://arxiv.org/abs/2504.11469
作者: Guillaume Garret,Antoine Vacavant,Carole Frindel
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Open access version of an article submitted to Medical Image Understanding and Analysis (MIUA) 2025

点击查看摘要

Abstract:Deep learning models have achieved impressive performance in medical image segmentation, yet their black-box nature limits clinical adoption. In vascular applications, trustworthy segmentation should rely on both local image cues and global anatomical structures, such as vessel connectivity or branching. However, the extent to which models leverage such global context remains unclear. We present a novel explainability pipeline for 3D vessel segmentation, combining gradient-based attribution with graph-guided point selection and a blob-based analysis of Saliency maps. Using vascular graphs extracted from ground truth, we define anatomically meaningful points of interest (POIs) and assess the contribution of input voxels via Saliency maps. These are analyzed at both global and local scales using a custom blob detector. Applied to IRCAD and Bullitt datasets, our analysis shows that model decisions are dominated by highly localized attribution blobs centered near POIs. Attribution features show little correlation with vessel-level properties such as thickness, tubularity, or connectivity – suggesting limited use of global anatomical reasoning. Our results underline the importance of structured explainability tools and highlight the current limitations of segmentation models in capturing global vascular context.
zh

人工智能

[AI-0] HLS-Eval: A Benchmark and Framework for Evaluating LLM s on High-Level Synthesis Design Tasks

【速读】：该论文旨在解决大型语言模型（LLM）在高层次综合（HLS）设计任务评估方面的不足。尽管已有研究主要针对硬件描述语言（HDL）如Verilog进行LLM评估，但随着设计者越来越多地使用HLS构建专用加速器和复杂硬件系统，缺乏全面评估LLM在HLS设计任务中的基准和工具。为了解决这一问题，论文引入了HLS-Eval，这是一个专门针对LLM驱动的HLS设计的第一个完整基准和评估框架。其关键在于定义了两个核心任务：从自然语言描述生成HLS代码以及执行特定于HLS的代码编辑以优化性能和硬件效率，并提供了包含94个独特设计的基准，确保每个任务都“LLM-ready”。此外，HLS-Eval还提供了一个模块化的Python框架，支持自动化和平行化评估本地和托管的LLMs。

链接: https://arxiv.org/abs/2504.12268
作者: Stefan Abi-Karam,Cong Hao
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid scaling of large language model (LLM) training and inference has driven their adoption in semiconductor design across academia and industry. While most prior work evaluates LLMs on hardware description language (HDL) tasks, particularly Verilog, designers are increasingly using high-level synthesis (HLS) to build domain-specific accelerators and complex hardware systems. However, benchmarks and tooling to comprehensively evaluate LLMs for HLS design tasks remain scarce. To address this, we introduce HLS-Eval, the first complete benchmark and evaluation framework for LLM-driven HLS design. HLS-Eval targets two core tasks: (1) generating HLS code from natural language descriptions, and (2) performing HLS-specific code edits to optimize performance and hardware efficiency. The benchmark includes 94 unique designs drawn from standard HLS benchmarks and novel sources. Each case is prepared via a semi-automated flow that produces a natural language description and a paired testbench for C-simulation and synthesis validation, ensuring each task is “LLM-ready.” Beyond the benchmark, HLS-Eval offers a modular Python framework for automated, parallel evaluation of both local and hosted LLMs. It includes a parallel evaluation engine, direct HLS tool integration, and abstractions for to support different LLM interaction paradigms, enabling rapid prototyping of new benchmarks, tasks, and LLM methods. We demonstrate HLS-Eval through baseline evaluations of open-source LLMs on Vitis HLS, measuring outputs across four key metrics - parseability, compilability, runnability, and synthesizability - reflecting the iterative HLS design cycle. We also report pass@k metrics, establishing clear baselines and reusable infrastructure for the broader LLM-for-hardware community. All benchmarks, framework code, and results are open-sourced at this https URL. Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.12268 [cs.AR] (or arXiv:2504.12268v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2504.12268 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Stefan Abi-Karam [view email] [v1] Wed, 16 Apr 2025 17:30:36 UTC (3,234 KB) Full-text links: Access Paper: View a PDF of the paper titled HLS-Eval: A Benchmark and Framework for Evaluating LLMs on High-Level Synthesis Design Tasks, by Stefan Abi-Karam and 1 other authorsView PDFTeX SourceOther Formats view license Current browse context: cs.AR prev | next new | recent | 2025-04 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-1] SCENT: Robust Spatiotemporal Learning for Continuous Scientific Data via Scalable Conditioned Neural Fields

【速读】：该论文旨在解决时空学习面临的多重挑战，包括空间与时间依赖关系的复杂交互、数据高维性以及可扩展性限制，特别是在科学领域中因数据分布不规则（如传感器故障导致的缺失值）和高容量（如高保真模拟）而加剧的计算和建模难题。论文提出了一种名为SCENT的新框架，其关键在于通过基于Transformer的编码器-处理器-解码器主干网络统一插值、重构和预测任务，并引入可学习查询以增强泛化能力及查询级交叉注意力机制以有效捕捉多尺度依赖关系。此外，为了确保数据规模和模型复杂度上的可扩展性，SCENT采用了稀疏注意力机制，实现了灵活的输出表示和在任意分辨率下的高效评估。实验结果表明，SCENT在多个具有挑战性的任务中表现出最先进的性能，同时具备卓越的可扩展性。

链接: https://arxiv.org/abs/2504.12262
作者: David Keetae Park,Xihaier Luo,Guang Zhao,Seungjun Lee,Miruna Oprescu,Shinjae Yoo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 5 main figures, 3 tables, under review

点击查看摘要

Abstract:Spatiotemporal learning is challenging due to the intricate interplay between spatial and temporal dependencies, the high dimensionality of the data, and scalability constraints. These challenges are further amplified in scientific domains, where data is often irregularly distributed (e.g., missing values from sensor failures) and high-volume (e.g., high-fidelity simulations), posing additional computational and modeling difficulties. In this paper, we present SCENT, a novel framework for scalable and continuity-informed spatiotemporal representation learning. SCENT unifies interpolation, reconstruction, and forecasting within a single architecture. Built on a transformer-based encoder-processor-decoder backbone, SCENT introduces learnable queries to enhance generalization and a query-wise cross-attention mechanism to effectively capture multi-scale dependencies. To ensure scalability in both data size and model complexity, we incorporate a sparse attention mechanism, enabling flexible output representations and efficient evaluation at arbitrary resolutions. We validate SCENT through extensive simulations and real-world experiments, demonstrating state-of-the-art performance across multiple challenging tasks while achieving superior scalability.
zh

[AI-2] Communication Optimization for Decentralized Learning atop Bandwidth-limited Edge Networks

【速读】：该论文旨在解决在多跳带宽受限网络中运行去中心化联邦学习（Decentralized Federated Learning, DFL）时面临的严重性能挑战。现有解决方案大多基于简化的通信模型，无法有效应对多跳网络中的复杂通信需求。论文的关键在于联合设计代理节点组成的覆盖网络的通信方案以及控制代理间通信需求的混合矩阵。通过将每个设计问题转化为可解的优化问题，并开发出具有性能保证的高效算法，论文实现了显著的性能提升。实验结果表明，所提出的算法相比基线方法可将总训练时间减少超过80%，同时保持相同的准确性，并大幅提高计算效率。

链接: https://arxiv.org/abs/2504.12210
作者: Tingyang Sun,Tuan Nguyen,Ting He
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: arXiv admin note: text overlap with arXiv:2408.04705

点击查看摘要

Abstract:Decentralized federated learning (DFL) is a promising machine learning paradigm for bringing artificial intelligence (AI) capabilities to the network edge. Running DFL on top of edge networks, however, faces severe performance challenges due to the extensive parameter exchanges between agents. Most existing solutions for these challenges were based on simplistic communication models, which cannot capture the case of learning over a multi-hop bandwidth-limited network. In this work, we address this problem by jointly designing the communication scheme for the overlay network formed by the agents and the mixing matrix that controls the communication demands between the agents. By carefully analyzing the properties of our problem, we cast each design problem into a tractable optimization and develop an efficient algorithm with guaranteed performance. Our evaluations based on real topology and data show that the proposed algorithm can reduce the total training time by over 80% compared to the baseline without sacrificing accuracy, while significantly improving the computational efficiency over the state of the art.
zh

[AI-3] From Requirements to Architecture: Semi-Automatically Generating Software Architectures

【速读】：该论文试图解决传统架构创建过程中建筑师工作负担过重的问题，通过引入大型语言模型（LLMs）的能力，提出了一种新的架构创建方法。解决方案的关键在于建筑师与以LLMs驱动的工具紧密协作，贯穿从领域模型创建、用例定义到架构决策及评估的全过程。这种方法既赋予建筑师对流程和结果的完全控制权，又鼓励遵循设计意图以最大化工具支持，初步结果显示其可行并可显著节省建筑师的时间。

链接: https://arxiv.org/abs/2504.12192
作者: Tobias Eisenreich
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: to be published in EMISA 2025

点击查看摘要

Abstract:To support junior and senior architects, I propose developing a new architecture creation method that leverages LLMs’ evolving capabilities to support the architect. This method involves the architect’s close collaboration with LLM-fueled tooling over the whole process. The architect is guided through Domain Model creation, Use Case specification, architectural decisions, and architecture evaluation. While the architect can take complete control of the process and the results, and use the tooling as a building set, they can follow the intended process for maximum tooling support. The preliminary results suggest the feasibility of this process and indicate major time savings for the architect.
zh

[AI-4] owards Explainable Fusion and Balanced Learning in Multimodal Sentiment Analysis

【速读】：该论文旨在解决多模态情感分析（Multimodal Sentiment Analysis, MSA）中的两个关键挑战：一是多模态融合决策逻辑缺乏可解释性；二是由于跨模态信息密度差异导致的模态不平衡。为了解决这些问题，论文提出了一种名为KAN-MCP的新框架，其核心在于将Kolmogorov-Arnold网络（KAN）的可解释性与多模态清洁Pareto（Multimodal Clean Pareto, MCPareto）框架的鲁棒性相结合。关键解决方案包括：首先，KAN通过单变量函数分解实现跨模态交互的透明分析，这种结构设计无需依赖外部解释工具即可直接检查特征变换，从而确保高表达性和可解释性；其次，MCPareto通过引入降维去噪模态信息瓶颈（Dimensionality Reduction and Denoising Modal Information Bottleneck, DRD-MIB）方法增强鲁棒性，该方法联合去噪并降低特征维度，为KAN提供判别性的低维输入以简化建模复杂度，同时保留关键的情感相关信息，并动态平衡各模态间的梯度贡献，有效缓解模态不平衡问题。这一结合可解释性和鲁棒性的方法不仅在CMU-MOSI、CMU-MOSEI和CH-SIMS v2等基准数据集上表现出色，还通过KAN的可解释架构提供了直观的可视化界面。

链接: https://arxiv.org/abs/2504.12151
作者: Miaosen Luo,Yuncheng Jiang,Sijie Mai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Sentiment Analysis (MSA) faces two critical challenges: the lack of interpretability in the decision logic of multimodal fusion and modality imbalance caused by disparities in inter-modal information density. To address these issues, we propose KAN-MCP, a novel framework that integrates the interpretability of Kolmogorov-Arnold Networks (KAN) with the robustness of the Multimodal Clean Pareto (MCPareto) framework. First, KAN leverages its univariate function decomposition to achieve transparent analysis of cross-modal interactions. This structural design allows direct inspection of feature transformations without relying on external interpretation tools, thereby ensuring both high expressiveness and interpretability. Second, the proposed MCPareto enhances robustness by addressing modality imbalance and noise interference. Specifically, we introduce the Dimensionality Reduction and Denoising Modal Information Bottleneck (DRD-MIB) method, which jointly denoises and reduces feature dimensionality. This approach provides KAN with discriminative low-dimensional inputs to reduce the modeling complexity of KAN while preserving critical sentiment-related information. Furthermore, MCPareto dynamically balances gradient contributions across modalities using the purified features output by DRD-MIB, ensuring lossless transmission of auxiliary signals and effectively alleviating modality imbalance. This synergy of interpretability and robustness not only achieves superior performance on benchmark datasets such as CMU-MOSI, CMU-MOSEI, and CH-SIMS v2 but also offers an intuitive visualization interface through KAN’s interpretable architecture.
zh

[AI-5] ARCeR: an Agent ic RAG for the Automated Definition of Cyber Ranges

【速读】：该论文旨在解决网络安全领域中创建逼真的虚拟化 IT 环境（Cyber Ranges, CRs）以支持威胁分析、防御措施验证及技能训练的问题。论文提出的解决方案是 ARCeR，这是一种基于 Agentic RAG 范式的创新性工具，能够根据用户提供的自然语言描述自动生成并部署 CRs。其关键之处在于利用先进的 AI 技术，不仅提升了处理复杂任务的能力，还能在面对传统大语言模型（LLMs）或基础 RAG 系统难以应对的情况时仍表现出色，并且具备适应多种 CR 框架的灵活性，前提是为其提供相应的特定知识。

链接: https://arxiv.org/abs/2504.12143
作者: Matteo Lupinacci,Francesco Blefari,Francesco Romeo,Francesco Aurelio Pironti,Angelo Furfaro
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing and evolving landscape of cybersecurity threats necessitates the development of supporting tools and platforms that allow for the creation of realistic IT environments operating within virtual, controlled settings as Cyber Ranges (CRs). CRs can be exploited for analyzing vulnerabilities and experimenting with the effectiveness of devised countermeasures, as well as serving as training environments for building cyber security skills and abilities for IT operators. This paper proposes ARCeR as an innovative solution for the automatic generation and deployment of CRs, starting from user-provided descriptions in a natural language. ARCeR relies on the Agentic RAG paradigm, which allows it to fully exploit state-of-art AI technologies. Experimental results show that ARCeR is able to successfully process prompts even in cases that LLMs or basic RAG systems are not able to cope with. Furthermore, ARCeR is able to target any CR framework provided that specific knowledge is made available to it.
zh

[AI-6] owards LLM Agents for Earth Observation

【速读】：该论文试图解决的问题是：评估当前AI系统在处理地球观测（Earth Observation, EO）任务中的可靠性，并探究其在自动化地球观测方面的能力与挑战。论文通过构建一个包含140个是非题的数据集（\datasetnamenospace），覆盖13个主题和17种卫星传感器，揭示了基于大型语言模型（LLM）的代理在使用Google Earth Engine API执行相关任务时存在的显著失败率（超过58%）。

解决方案的关键在于通过微调合成数据（fine-tuning synthetic data），显著降低了模型运行失败率，使得较小规模的模型（如Llama-3.1-8B）能够在准确性上接近甚至媲美更大规模的模型（如DeepSeek-R1）。这一方法为提升AI代理在地球观测任务中的性能提供了可行路径，并指出了未来需要解决的核心挑战。

链接: https://arxiv.org/abs/2504.12110
作者: Chia Hsiang Kao,Wenting Zhao,Shreelekha Revankar,Samuel Speas,Snehal Bhagat,Rajeev Datta,Cheng Perng Phoo,Utkarsh Mall,Carl Vondrick,Kavita Bala,Bharath Hariharan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 36 pages

点击查看摘要

Abstract:Earth Observation (EO) provides critical planetary data for environmental monitoring, disaster management, climate science, and other scientific domains. Here we ask: Are AI systems ready for reliable Earth Observation? We introduce \datasetnamenospace, a benchmark of 140 yes/no questions from NASA Earth Observatory articles across 13 topics and 17 satellite sensors. Using Google Earth Engine API as a tool, LLM agents can only achieve an accuracy of 33% because the code fails to run over 58% of the time. We improve the failure rate for open models by fine-tuning synthetic data, allowing much smaller models (Llama-3.1-8B) to achieve comparable accuracy to much larger ones (e.g., DeepSeek-R1). Taken together, our findings identify significant challenges to be solved before AI agents can automate earth observation, and suggest paths forward. The project page is available at this https URL.
zh

[AI-7] Reasoning -Based AI for Startup Evaluation (R.A.I.S.E.): A Memory-Augmented Multi-Step Decision Framework

【速读】：本文旨在解决决策树的可解释性与大型语言模型（Large Language Models, LLMs）先进推理能力之间的割裂问题，以预测创业公司成功概率。解决方案的关键在于提出了一种新颖框架，通过链式思维提示（chain-of-thought prompting）生成详细的推理日志，并将其蒸馏为结构化且易于理解的逻辑规则。该框架集成了多种增强功能，包括高效数据摄取、两步精炼过程、集成候选采样、模拟强化学习评分及持久记忆，从而确保稳定决策与透明输出。实验评估表明，与单一代替的OpenAI o3模型相比，此组合管道将精确度提高了54%，从0.225提升到0.346，同时将准确率提高了50%，从0.46提高到0.70，并且其精度是随机分类器的两倍以上（16%）。通过结合最先进的AI推理与显式基于规则的解释，本方法不仅增强了传统决策流程，还促进了专家干预和持续政策优化。这项研究为在高风险投资环境及其他需要透明和数据驱动洞察的领域实现LLM驱动的可解释决策框架奠定了基础。

链接: https://arxiv.org/abs/2504.12090
作者: Jack Preuveneers,Joseph Ternasky,Fuat Alican,Yigit Ihlamur
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a novel framework that bridges the gap between the interpretability of decision trees and the advanced reasoning capabilities of large language models (LLMs) to predict startup success. Our approach leverages chain-of-thought prompting to generate detailed reasoning logs, which are subsequently distilled into structured, human-understandable logical rules. The pipeline integrates multiple enhancements - efficient data ingestion, a two-step refinement process, ensemble candidate sampling, simulated reinforcement learning scoring, and persistent memory - to ensure both stable decision-making and transparent output. Experimental evaluations on curated startup datasets demonstrate that our combined pipeline improves precision by 54% from 0.225 to 0.346 and accuracy by 50% from 0.46 to 0.70 compared to a standalone OpenAI o3 model. Notably, our model achieves over 2x the precision of a random classifier (16%). By combining state-of-the-art AI reasoning with explicit rule-based explanations, our method not only augments traditional decision-making processes but also facilitates expert intervention and continuous policy refinement. This work lays the foundation for the implementation of interpretable LLM-powered decision frameworks in high-stakes investment environments and other domains that require transparent and data-driven insights.
zh

[AI-8] Optimizing Compound Retrieval Systems SIGIR2025

【速读】：该论文试图解决现代检索系统在平衡排名质量与计算成本时面临的挑战，特别是现有级联式多模型交互方法的局限性。论文提出了一种更广义的复合检索系统（Compound Retrieval Systems）概念，该方法不仅包含级联模型，还支持其他形式的模型交互，例如利用大型语言模型（LLMs）进行相对相关性比较。解决方案的关键在于优化复合检索系统的架构设计，包括确定各组件模型的应用位置以及如何聚合其预测结果以生成最终排名。研究展示了如何将经典的BM25检索模型与最新的LLM排序预测相结合，并同时优化给定的排名指标和效率目标。实验结果表明，这种复合方法在效果与效率的权衡上优于传统的级联方法，即使在自监督场景下亦如此。这一工作旨在激发信息检索领域对模型交互形成排名方式的更多创新思考。

链接: https://arxiv.org/abs/2504.12063
作者: Harrie Oosterhuis,Rolf Jagerman,Zhen Qin,Xuanhui Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: SIGIR 2025

点击查看摘要

Abstract:Modern retrieval systems do not rely on a single ranking model to construct their rankings. Instead, they generally take a cascading approach where a sequence of ranking models are applied in multiple re-ranking stages. Thereby, they balance the quality of the top-K ranking with computational costs by limiting the number of documents each model re-ranks. However, the cascading approach is not the only way models can interact to form a retrieval system. We propose the concept of compound retrieval systems as a broader class of retrieval systems that apply multiple prediction models. This encapsulates cascading models but also allows other types of interactions than top-K re-ranking. In particular, we enable interactions with large language models (LLMs) which can provide relative relevance comparisons. We focus on the optimization of compound retrieval system design which uniquely involves learning where to apply the component models and how to aggregate their predictions into a final ranking. This work shows how our compound approach can combine the classic BM25 retrieval model with state-of-the-art (pairwise) LLM relevance predictions, while optimizing a given ranking metric and efficiency target. Our experimental results show optimized compound retrieval systems provide better trade-offs between effectiveness and efficiency than cascading approaches, even when applied in a self-supervised manner. With the introduction of compound retrieval systems, we hope to inspire the information retrieval field to more out-of-the-box thinking on how prediction models can interact to form rankings. Comments: SIGIR 2025 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2504.12063 [cs.IR] (or arXiv:2504.12063v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.12063 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3726302.3730051 Focus to learn more DOI(s) linking to related resources
zh

[AI-9] Proof-Carrying Neuro-Symbolic Code

【速读】：该论文试图解决如何构建可信且可解释的神经符号代码（Proof-Carrying Neuro-Symbolic Code）的问题。其关键在于结合神经网络（Neural Network）的高度表达能力和形式化方法（Formal Methods）在符号推理方面的严谨性，通过将证明信息嵌入代码中，确保其行为的正确性和可验证性，从而应对神经网络固有的不可解释性挑战，并克服形式化方法在处理复杂神经网络时的计算瓶颈。

链接: https://arxiv.org/abs/2504.12031
作者: Ekaterina Komendantskaya
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Invited paper at CiE 2025. arXiv admin note: text overlap with arXiv:2501.05867

点击查看摘要

Abstract:This invited paper introduces the concept of “proof-carrying neuro-symbolic code” and explains its meaning and value, from both the “neural” and the “symbolic” perspectives. The talk outlines the first successes and challenges that this new area of research faces.
zh

[AI-10] Purposefully Induced Psychosis (PIP): Embracing Hallucination as Imagination in Large Language Models

【速读】：本文旨在重新定义大型语言模型（Large Language Models, LLMs）中“幻觉”（hallucinations）的概念，将其从被视为错误的传统认知转变为一种计算想象力的来源。论文提出了一种名为“有目的诱发精神病”（Purposefully Induced Psychosis, PIP）的新方法，通过增强LLMs在事实准确性非首要目标的任务（如推测性小说、互动叙事和混合现实模拟）中的幻觉输出，探索其在创造性任务中的潜力。关键在于通过微调LLMs鼓励产生推测性、隐喻性和超现实性的输出，并将这些“错误”置于用户愿意主动接受的语境中，从而实现从“错误”到创新思维催化剂的转变。这一方案的核心在于重新定义幻觉的价值，并结合用户同意的设计原则，探讨其在更广泛的AI伦理和人机协作中的潜在影响。

链接: https://arxiv.org/abs/2504.12012
作者: Kris Pilcher,Esen K. Tütüncü
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Hallucinations in Large Language Models (LLMs) are widely regarded as errors - outputs that deviate from factual accuracy. However, in creative or exploratory contexts, these “mistakes” may represent unexpected avenues for innovation. We introduce Purposefully Induced Psychosis (PIP), a novel approach that amplifies LLM hallucinations for imaginative tasks such as speculative fiction, interactive storytelling, and mixed-reality simulations. Drawing on Herman Melville’s Moby-Dick, where Pip’s “madness” reveals profound insight, we reframe hallucinations as a source of computational imagination rather than a flaw. Our method fine-tunes LLMs to encourage speculative, metaphorical, and surreal outputs - hallucinations that are useful when factual accuracy is not the chief objective. Inspired by the consensual illusions of theater and stage magic, PIP situates these creative missteps in contexts where users willingly suspend disbelief, thereby transforming “errors” into catalysts for new ways of thinking. We discuss potential applications, design principles for ensuring user consent, preliminary observations, and implications for broader AI ethics and human-AI collaboration.
zh

[AI-11] Balancing Graph Embedding Smoothness in Self-Supervised Learning via Information-Theoretic Decomposition WWW

【速读】：本文旨在解决图自监督学习（Graph Self-Supervised Learning, SSL）中现有方法未能有效反映图数据本质属性（尤其是节点表示与其邻居之间的相似性）的问题。研究发现，当前基于图嵌入平滑性（graph embedding smoothness）设计的方法在谱系两端表现出色，但分别针对特定下游任务优化，缺乏通用性。论文通过信息论框架将SSL目标分解为三个项（邻域损失、最小化损失和发散损失），揭示了这种极化现象源于这些项之间的不平衡。为解决此问题，论文提出了一种名为BSG（Balancing Smoothness in Graph SSL）的新框架，其关键是引入了补充图SSL表示质量的新型损失函数，以平衡上述三项的贡献。理论分析表明，这种平衡策略能够提升更广泛下游任务中的性能。实验证明，BSG在多个真实世界数据集上的节点分类和链接预测任务中实现了最先进的性能。

链接: https://arxiv.org/abs/2504.12011
作者: Heesoo Jung,Hogun Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the Web Conference (WWW) 2025

点击查看摘要

Abstract:Self-supervised learning (SSL) in graphs has garnered significant attention, particularly in employing Graph Neural Networks (GNNs) with pretext tasks initially designed for other domains, such as contrastive learning and feature reconstruction. However, it remains uncertain whether these methods effectively reflect essential graph properties, precisely representation similarity with its neighbors. We observe that existing methods position opposite ends of a spectrum driven by the graph embedding smoothness, with each end corresponding to outperformance on specific downstream tasks. Decomposing the SSL objective into three terms via an information-theoretic framework with a neighbor representation variable reveals that this polarization stems from an imbalance among the terms, which existing methods may not effectively maintain. Further insights suggest that balancing between the extremes can lead to improved performance across a wider range of downstream tasks. A framework, BSG (Balancing Smoothness in Graph SSL), introduces novel loss functions designed to supplement the representation quality in graph-based SSL by balancing the derived three terms: neighbor loss, minimal loss, and divergence loss. We present a theoretical analysis of the effects of these loss functions, highlighting their significance from both the SSL and graph smoothness perspectives. Extensive experiments on multiple real-world datasets across node classification and link prediction consistently demonstrate that BSG achieves state-of-the-art performance, outperforming existing methods. Our implementation code is available at this https URL.
zh

[AI-12] Generative Recommendation with Continuous-Token Diffusion

【速读】：该论文试图解决传统基于离散空间的大语言模型（LLM）推荐系统（RecSys）在表示复杂用户-物品交互时面临的局限性，包括信息压缩和词汇量限制等问题。为了解决这些问题，论文提出了一种名为DeftRec的新框架，其关键在于引入去噪扩散模型（denoising diffusion model），使基于LLM的推荐系统能够无缝支持连续型数据作为输入和目标。具体而言，通过设计一个具有掩码操作和加法K路架构的鲁棒分词器来索引用户和物品，并捕捉它们之间的复杂协作关系；同时利用预训练的LLM提供的推理内容对用户偏好进行条件建模，在去噪过程中重新定义目标以包含负交互，从而构建全面的用户偏好理解。最终，通过基于分数检索生成推荐结果，验证了所提方法的有效性，证明DeftRec在性能上优于现有基准模型。

链接: https://arxiv.org/abs/2504.12007
作者: Haohao Qu,Wenqi Fan,Shanru Lin
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, there has been a significant trend toward using large language model (LLM)-based recommender systems (RecSys). Current research primarily focuses on representing complex user-item interactions within a discrete space to align with the inherent discrete nature of language models. However, this approach faces limitations due to its discrete nature: (i) information is often compressed during discretization; (ii) the tokenization and generation for the vast number of users and items in real-world scenarios are constrained by a limited vocabulary. Embracing continuous data presents a promising alternative to enhance expressive capabilities, though this approach is still in its early stages. To address this gap, we propose a novel framework, DeftRec, which incorporates \textbfdenoising di\textbfffusion models to enable LLM-based RecSys to seamlessly support continuous \textbftoken as input and target. First, we introduce a robust tokenizer with a masking operation and an additive K-way architecture to index users and items, capturing their complex collaborative relationships into continuous tokens. Crucially, we develop a denoising diffusion model to process user preferences within continuous domains by conditioning on reasoning content from pre-trained large language model. During the denoising process, we reformulate the objective to include negative interactions, building a comprehensive understanding of user preferences for effective and accurate recommendation generation. Finally, given a continuous token as output, recommendations can be easily generated through score-based retrieval. Extensive experiments demonstrate the effectiveness of the proposed methods, showing that DeftRec surpasses competitive benchmarks, including both traditional and emerging LLM-based RecSys.
zh

[AI-13] A Computationally Efficient Algorithm for Infinite-Horizon Averag e-Reward Linear MDPs

【速读】：该论文致力于解决无限时域平均奖励设置下的线性马尔可夫决策过程（Linear MDPs）中的强化学习问题。此前的研究通过将平均奖励设置近似为折扣设置，并采用基于值迭代的算法，利用裁剪操作来约束值函数的跨度以提高统计效率。然而，传统的裁剪程序需要在整个状态空间上计算值函数的最小值，这在状态空间大甚至无限的情况下是不可行的。本文的关键解决方案在于提出了一种具有高效裁剪操作的值迭代方法，该方法仅需计算算法访问的状态集合上的值函数最小值。由此，该算法保持了与之前工作相同的后悔界，同时具备计算效率，其计算复杂度不依赖于状态空间的大小。

链接: https://arxiv.org/abs/2504.11997
作者: Kihyuk Hong,Ambuj Tewari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study reinforcement learning in infinite-horizon average-reward settings with linear MDPs. Previous work addresses this problem by approximating the average-reward setting by discounted setting and employing a value iteration-based algorithm that uses clipping to constrain the span of the value function for improved statistical efficiency. However, the clipping procedure requires computing the minimum of the value function over the entire state space, which is prohibitive since the state space in linear MDP setting can be large or even infinite. In this paper, we introduce a value iteration method with efficient clipping operation that only requires computing the minimum of value functions over the set of states visited by the algorithm. Our algorithm enjoys the same regret bound as the previous work while being computationally efficient, with computational complexity that is independent of the size of the state space.
zh

[AI-14] Leverag ing Machine Learning Models to Predict the Outcome of Digital Medical Triage Interviews

【速读】：该论文试图解决现有基于问卷的数字化分诊系统在面对未完成分诊访谈时只能服务于完成流程的患者，导致服务效率和患者安全受限的问题。论文的关键解决方案是利用机器学习（Machine Learning, ML）预测未完成访谈的分诊结果，通过构建决策树模型（如LGBMClassifier和CatBoostClassifier）以及TabTransformer模型来提升分诊预测的准确性与覆盖范围。研究发现，决策树模型的预测准确率与访谈完整度呈线性相关，而TabTransformer模型虽在所有完整度下均表现出超过80%的高准确率，但其训练时间较长，需要更强的计算资源支持。因此，关键在于通过机器学习方法弥补因访谈不完整带来的信息缺失，从而增强分诊系统的可用性和效率。

链接: https://arxiv.org/abs/2504.11977
作者: Sofia Krylova,Fabian Schmidt,Vladimir Vlassov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, 8 tables

点击查看摘要

Abstract:Many existing digital triage systems are questionnaire-based, guiding patients to appropriate care levels based on information (e.g., symptoms, medical history, and urgency) provided by the patients answering questionnaires. Such a system often uses a deterministic model with predefined rules to determine care levels. It faces challenges with incomplete triage interviews since it can only assist patients who finish the process. In this study, we explore the use of machine learning (ML) to predict outcomes of unfinished interviews, aiming to enhance patient care and service quality. Predicting triage outcomes from incomplete data is crucial for patient safety and healthcare efficiency. Our findings show that decision-tree models, particularly LGBMClassifier and CatBoostClassifier, achieve over 80% accuracy in predicting outcomes from complete interviews while having a linear correlation between the prediction accuracy and interview completeness degree. For example, LGBMClassifier achieves 88,2% prediction accuracy for interviews with 100% completeness, 79,6% accuracy for interviews with 80% completeness, 58,9% accuracy for 60% completeness, and 45,7% accuracy for 40% completeness. The TabTransformer model demonstrated exceptional accuracy of over 80% for all degrees of completeness but required extensive training time, indicating a need for more powerful computational resources. The study highlights the linear correlation between interview completeness and predictive power of the decision-tree models.
zh

[AI-15] VIPO: Value Function Inconsistency Penalized Offline Reinforcement Learning

【速读】：该论文致力于解决离线强化学习（Offline Reinforcement Learning, Offline RL）中基于模型方法存在的模型误差导致的保守性问题，这些问题通常由启发式的不确定性估计引入且可靠性不足。为应对这一挑战，论文提出了一种名为VIPO的新算法，其关键在于通过自监督反馈机制增强模型训练。具体而言，VIPO在模型学习过程中额外最小化从离线数据直接学到的价值函数与由模型估计的价值函数之间的不一致性。这种方法不仅提高了模型的准确性，还保持了高效性和稳定性，并在D4RL和NeoRL基准测试中的几乎所有任务上实现了当前最优性能。

链接: https://arxiv.org/abs/2504.11944
作者: Xuyang Chen,Guojian Wang,Keyu Yan,Lin Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) learns effective policies from pre-collected datasets, offering a practical solution for applications where online interactions are risky or costly. Model-based approaches are particularly advantageous for offline RL, owing to their data efficiency and generalizability. However, due to inherent model errors, model-based methods often artificially introduce conservatism guided by heuristic uncertainty estimation, which can be unreliable. In this paper, we introduce VIPO, a novel model-based offline RL algorithm that incorporates self-supervised feedback from value estimation to enhance model training. Specifically, the model is learned by additionally minimizing the inconsistency between the value learned directly from the offline data and the one estimated from the model. We perform comprehensive evaluations from multiple perspectives to show that VIPO can learn a highly accurate model efficiently and consistently outperform existing methods. It offers a general framework that can be readily integrated into existing model-based offline RL algorithms to systematically enhance model accuracy. As a result, VIPO achieves state-of-the-art performance on almost all tasks in both D4RL and NeoRL benchmarks.
zh

[AI-16] Rethinking the Generation of High-Quality CoT Data from the Perspective of LLM -Adaptive Question Difficulty Grading

【速读】：该论文旨在解决如何高效生成具有不同大型语言模型（LLMs）适应性难度级别的高质量链式思考（CoT）数据的问题，以提升小规模LLMs的推理能力，并降低数据生成成本及增强模型监督微调（SFT）的效率。论文的关键解决方案在于构建了一个LLM-适应性问题数据库，通过根据LLMs自身的推理能力对问题难度进行分级，并基于问题难度分布采样，利用DeepSeek-R1 (671B)生成对应的高质量CoT数据。这种方法显著降低了数据生成成本，提升了效率，并在复杂数学竞赛和代码生成任务中验证了所提出方法的有效性和通用性。

链接: https://arxiv.org/abs/2504.11919
作者: Qianjin Yu,Keyu Wu,Zihan Chen,Chushu Zhang,Manlin Mei,Lingjun Huang,Fang Tan,Yongsheng Du,Kunlin Liu,Yurui Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, DeepSeek-R1 (671B) (DeepSeek-AIet al., 2025) has demonstrated its excellent reasoning ability in complex tasks and has publiclyshared its methodology. This provides potentially high-quality chain-of-thought (CoT) data for stimulating the reasoning abilities of small-sized large language models (LLMs). To generate high-quality CoT data for different LLMs, we seek an efficient method for generating high-quality CoT data with LLM-Adaptive questiondifficulty levels. First, we grade the difficulty of the questions according to the reasoning ability of the LLMs themselves and construct a LLM-Adaptive question database. Second, we sample the problem database based on a distribution of difficulty levels of the questions and then use DeepSeek-R1 (671B) (DeepSeek-AI et al., 2025) to generate the corresponding high-quality CoT data with correct answers. Thanks to the construction of CoT data with LLM-Adaptive difficulty levels, we have significantly reduced the cost of data generation and enhanced the efficiency of model supervised fine-tuning (SFT). Finally, we have validated the effectiveness and generalizability of the proposed method in the fields of complex mathematical competitions and code generation tasks. Notably, with only 2k high-quality mathematical CoT data, our ZMath-32B surpasses DeepSeek-Distill-32B in math reasoning task. Similarly, with only 2k high-quality code CoT data, our ZCode-32B surpasses DeepSeek-Distill-32B in code reasoning tasks.
zh

[AI-17] Causality-enhanced Decision-Making for Autonomous Mobile Robots in Dynamic Environments

【速读】：该论文旨在解决在人机共享环境中，如何通过因果推理提升自主机器人任务规划与执行效率的问题。传统方法局限于相关性分析，而本文提出的关键解决方案是基于因果关系的决策框架，该框架通过对学习到的因果模型进行推理，预测电池消耗与人类干扰等环境因素对机器人任务执行的影响，从而辅助机器人决定何时及如何完成任务。此外，为支持这一框架，作者开发了PeopleFlow仿真器，用于模拟复杂的上下文感知的人机空间交互。通过在仓库环境中的案例研究，验证了所提因果方法相较于非因果基线的优越性，证明了因果推理能够显著提高机器人在动态人机共存环境中的运行效率与安全性。

链接: https://arxiv.org/abs/2504.11901
作者: Luca Castri,Gloria Beraldo,Nicola Bellotto
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Under review at The International Journal of Robotics Research (IJRR)

点击查看摘要

Abstract:The growing integration of robots in shared environments – such as warehouses, shopping centres, and hospitals – demands a deep understanding of the underlying dynamics and human behaviours, including how, when, and where individuals engage in various activities and interactions. This knowledge goes beyond simple correlation studies and requires a more comprehensive causal analysis. By leveraging causal inference to model cause-and-effect relationships, we can better anticipate critical environmental factors and enable autonomous robots to plan and execute tasks more effectively. To this end, we propose a novel causality-based decision-making framework that reasons over a learned causal model to predict battery usage and human obstructions, understanding how these factors could influence robot task execution. Such reasoning framework assists the robot in deciding when and how to complete a given task. To achieve this, we developed also PeopleFlow, a new Gazebo-based simulator designed to model context-sensitive human-robot spatial interactions in shared workspaces. PeopleFlow features realistic human and robot trajectories influenced by contextual factors such as time, environment layout, and robot state, and can simulate a large number of agents. While the simulator is general-purpose, in this paper we focus on a warehouse-like environment as a case study, where we conduct an extensive evaluation benchmarking our causal approach against a non-causal baseline. Our findings demonstrate the efficacy of the proposed solutions, highlighting how causal reasoning enables autonomous robots to operate more efficiently and safely in dynamic environments shared with humans.
zh

[AI-18] Seeking and leverag ing alternative variable dependency concepts in gray-box-elusive bimodal land-use allocation problems

【速读】：该论文试图解决土地利用分配中的多目标优化问题，这类问题是NP难问题，传统方法难以有效处理。尤其在面对变量间依赖关系复杂且标准变量依赖发现技术不可用的情况下，论文提出了一种针对具体问题定义的变量依赖性概念，并基于此构建了依赖变量的掩码。关键解决方案在于由此提出的三种新型杂交算子，这些算子被引入到两种经典的多目标优化算法（NSGA-II 和 MOEA/D）中，显著提升了其优化效果。

链接: https://arxiv.org/abs/2504.11882
作者: J. Maciążek,M. W. Przewozniczek,J. Schwaab
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Solving land-use allocation problems can help us to deal with some of the most urgent global environmental issues. Since these problems are NP-hard, effective optimizers are needed to handle them. The knowledge about variable dependencies allows for proposing such tools. However, in this work, we consider a real-world multi-objective problem for which standard variable dependency discovery techniques are inapplicable. Therefore, using linkage-based variation operators is unreachable. To address this issue, we propose a definition of problem-dedicated variable dependency. On this base, we propose obtaining masks of dependent variables. Using them, we construct three novel crossover operators. The results concerning real-world test cases show that introducing our propositions into two well-known optimizers (NSGA-II, MOEA/D) dedicated to multi-objective optimization significantly improves their effectiveness.
zh

[AI-19] Moving between high-quality optima using multi-satisfiability characteristics in hard-to-solve Max3Sat instances

【速读】：该论文致力于解决在最大可满足性问题（Maximum Satisfiability Problem, MaxSat）及其特定形式Max3Sat中，当隧道效应（tunnelling）失效时，如何有效连接局部最优高质量解与全局最优解区域的问题。论文的关键在于分析此类问题实例在相变（phase transition）背景下的特征，并提出通过操控子句可满足性特性来实现远距离优质解之间的连接。解决方案的关键是利用多子句可满足性特性，在基于典型灰盒机制构建的优化器中引入改进策略，从而能够有效解决那些当前最先进的灰盒优化器无法处理的Max3Sat实例，同时保持对已有灰盒方法有效实例的性能。

链接: https://arxiv.org/abs/2504.11864
作者: J. Piatek,M. W. Przewozniczek,F. Chicano,R. Tinós
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Gray-box optimization proposes effective and efficient optimizers of general use. To this end, it leverages information about variable dependencies and the subfunction-based problem representation. These approaches were already shown effective by enabling \textittunnelling between local optima even if these moves require the modification of many dependent variables. Tunnelling is useful in solving the maximum satisfiability problem (MaxSat), which can be reformulated to Max3Sat. Since many real-world problems can be brought to solving the MaxSat/Max3Sat instances, it is important to solve them effectively and efficiently. Therefore, we focus on Max3Sat instances for which tunnelling fails to introduce improving moves between locally optimal high-quality solutions and the region of globally optimal solutions. We analyze the features of such instances on the ground of phase transitions. Based on these observations, we propose manipulating clause-satisfiability characteristics that allow connecting high-quality solutions distant in the solution space. We utilize multi-satisfiability characteristics in the optimizer built from typical gray-box mechanisms. The experimental study shows that the proposed optimizer can solve those Max3Sat instances that are out of the grasp of state-of-the-art gray-box optimizers. At the same time, it remains effective for instances that have already been successfully solved by gray-box.
zh

[AI-20] EngramNCA: a Neural Cellular Automaton Model of Memory Transfer

【速读】：本文旨在解决生物记忆存储机制在人工系统中的建模与应用问题。论文提出EngramNCA这一神经细胞自动机模型，其关键在于结合了公共可见状态与私有细胞内部记忆通道的设计，受到最新生物学证据的启发，即记忆存储不仅限于突触修饰，还包括细胞内机制。该模型包含两个组件：GeneCA通过训练从种子细胞中发展出不同形态，这些细胞含有不可变的“基因”编码；GenePropCA作为辅助模型，调节细胞的私有“遗传”记忆而不改变其可见状态。这种架构通过公共和私有通道的交互，支持复杂形态的编码与传播，并促进从共享“遗传”基质中生长出多样结构，从而实现分层和共存形态的涌现，为分布式记忆存储与传输提供了新见解。这可能推动自适应、自组织系统的开发，并增进对生物与合成系统中记忆机制的理解。

链接: https://arxiv.org/abs/2504.11855
作者: Etienne Guichard,Felix Reimers,Mia Kvalsund,Mikkel Lepperød,Stefano Nichele
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study introduces EngramNCA, a neural cellular automaton (NCA) that integrates both publicly visible states and private, cell-internal memory channels, drawing inspiration from emerging biological evidence suggesting that memory storage extends beyond synaptic modifications to include intracellular mechanisms. The proposed model comprises two components: GeneCA, an NCA trained to develop distinct morphologies from seed cells containing immutable “gene” encodings, and GenePropCA, an auxiliary NCA that modulates the private “genetic” memory of cells without altering their visible states. This architecture enables the encoding and propagation of complex morphologies through the interaction of visible and private channels, facilitating the growth of diverse structures from a shared “genetic” substrate. EngramNCA supports the emergence of hierarchical and coexisting morphologies, offering insights into decentralized memory storage and transfer in artificial systems. These findings have potential implications for the development of adaptive, self-organizing systems and may contribute to the broader understanding of memory mechanisms in both biological and synthetic contexts.
zh

[AI-21] Learning Strategies in Particle Swarm Optimizer: A Critical Review and Performance Analysis

【速读】：该论文试图解决的问题是如何全面系统地分析和评估粒子群优化算法（PSO）中各种增强性能的学习策略，并理解它们对优化性能的具体影响。论文的关键解决方案在于对现有学习策略进行系统的分类与回顾，同时通过对比实验评估这些策略如何影响PSO的搜索动态，从而填补了这一领域的研究空白。此外，论文强调开发具备自适应性和智能化的PSO变体以应对日益复杂的实际问题的重要性。

链接: https://arxiv.org/abs/2504.11812
作者: Dikshit Chauhan,Shivani,P. N. Suganthan
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 53 pages, 14 figures

点击查看摘要

Abstract:Nature has long inspired the development of swarm intelligence (SI), a key branch of artificial intelligence that models collective behaviors observed in biological systems for solving complex optimization problems. Particle swarm optimization (PSO) is widely adopted among SI algorithms due to its simplicity and efficiency. Despite numerous learning strategies proposed to enhance PSO’s performance in terms of convergence speed, robustness, and adaptability, no comprehensive and systematic analysis of these strategies exists. We review and classify various learning strategies to address this gap, assessing their impact on optimization performance. Additionally, a comparative experimental evaluation is conducted to examine how these strategies influence PSO’s search dynamics. Finally, we discuss open challenges and future directions, emphasizing the need for self-adaptive, intelligent PSO variants capable of addressing increasingly complex real-world problems.
zh

[AI-22] Large Language Models for Drug Overdose Prediction from Longitudinal Medical Records

【速读】：该论文试图解决通过患者医疗记录预测药物过量风险的问题，以实现及时干预和预防。解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）处理长文本数据的能力及其在多样化任务中的先验知识，特别是评估OpenAI的GPT-4o LLM在纵向保险索赔记录中预测药物过量事件的有效性，并将其性能与传统机器学习方法进行对比，验证其在微调和零样本设置下的表现优势。

链接: https://arxiv.org/abs/2504.11792
作者: Md Sultan Al Nahian,Chris Delcher,Daniel Harris,Peter Akpunonu,Ramakanth Kavuluru
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The ability to predict drug overdose risk from a patient’s medical records is crucial for timely intervention and prevention. Traditional machine learning models have shown promise in analyzing longitudinal medical records for this task. However, recent advancements in large language models (LLMs) offer an opportunity to enhance prediction performance by leveraging their ability to process long textual data and their inherent prior knowledge across diverse tasks. In this study, we assess the effectiveness of Open AI’s GPT-4o LLM in predicting drug overdose events using patients’ longitudinal insurance claims records. We evaluate its performance in both fine-tuned and zero-shot settings, comparing them to strong traditional machine learning methods as baselines. Our results show that LLMs not only outperform traditional models in certain settings but can also predict overdose risk in a zero-shot setting without task-specific training. These findings highlight the potential of LLMs in clinical decision support, particularly for drug overdose risk prediction.
zh

[AI-23] Agile Retrospectives: What went well? What didnt go well? What should we do? WWW

【速读】：该论文致力于解决敏捷/Scrum软件开发中回顾会议（Retrospective Meetings, 简称Retros）的信息交互效率及团队信息可视化的问题。论文的关键在于探索生成式人工智能（Generative AI）在提升Retros信息交互中的潜在应用，并通过开发原型工具RetroAI++实现对Retros相关信息的功能支持与可视化展示，从而帮助软件开发团队更高效地分析和利用回顾会议中的信息。

链接: https://arxiv.org/abs/2504.11780
作者: Maria Spichkova,Hina Lee,Kevin Iwan,Madeleine Zwart,Yuwon Yoon,Xiaohan Qin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Preprint. Accepted to the 20th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2025). Final version to be published by SCITEPRESS, this http URL

点击查看摘要

Abstract:In Agile/Scrum software development, the idea of retrospective meetings (retros) is one of the core elements of the project process. In this paper, we present our work in progress focusing on two aspects: analysis of potential usage of generative AI for information interaction within retrospective meetings, and visualisation of retros’ information to software development teams. We also present our prototype tool RetroAI++, focusing on retros-related functionalities.
zh

[AI-24] PCDiff: Proactive Control for Ownership Protection in Diffusion Models with Watermark Compatibility

【速读】：该论文旨在解决文本到图像扩散模型知识产权（Intellectual Property, IP）保护的需求，通过提出PCDiff框架，重新定义模型授权机制以控制生成质量为核心目标。论文的关键解决方案在于将可训练的融合模块与分层认证层集成到解码器架构中，确保仅持有有效加密凭据的用户能够生成高保真图像，而在缺少合法密钥时故意降低输出质量，从而有效防止未经授权的访问。其核心创新点是通过架构干预实现主动访问控制，同时保留与现有数字水印技术的兼容性，满足模型所有者主动管理模型所有权的需求，并保持传统水印方法的可追溯能力。实验评估表明，凭据验证与图像质量之间存在强相关性，并且结合典型后处理操作时，PCDiff在性能上与传统水印方法相当。此工作实现了从被动检测到主动授权执行的范式转变，为扩散模型的知识产权管理奠定了基础。

链接: https://arxiv.org/abs/2504.11774
作者: Keke Gai,Ziyue Shen,Jing Yu,Liehuang Zhu,Qi Wu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the growing demand for protecting the intellectual property (IP) of text-to-image diffusion models, we propose PCDiff – a proactive access control framework that redefines model authorization by regulating generation quality. At its core, PCDIFF integrates a trainable fuser module and hierarchical authentication layers into the decoder architecture, ensuring that only users with valid encrypted credentials can generate high-fidelity images. In the absence of valid keys, the system deliberately degrades output quality, effectively preventing unauthorized this http URL, while the primary mechanism enforces active access control through architectural intervention, its decoupled design retains compatibility with existing watermarking techniques. This satisfies the need of model owners to actively control model ownership while preserving the traceability capabilities provided by traditional watermarking this http URL experimental evaluations confirm a strong dependency between credential verification and image quality across various attack scenarios. Moreover, when combined with typical post-processing operations, PCDIFF demonstrates powerful performance alongside conventional watermarking methods. This work shifts the paradigm from passive detection to proactive enforcement of authorization, laying the groundwork for IP management of diffusion models.
zh

[AI-25] Shared Disk KV Cache Management for Efficient Multi-Instance Inference in RAG -Powered LLM s

【速读】：该论文旨在解决大型语言模型（LLMs）在推理过程中因输入上下文长度和模型规模增加而导致的延迟（inference latency）问题，特别是检索增强生成（Retrieval-Augmented Generation, RAG）技术由于显著增加输入标记数而加剧的预填充阶段（prefill stage）计算开销问题。论文的关键解决方案是通过引入基于磁盘的键值（KV）缓存来减轻预填充阶段的计算负担，并提出了一种名为Shared RAG-DCache的共享KV缓存管理系统。该系统利用RAG中与用户查询相关的文档局部性以及LLM推理服务中的排队延迟，主动为与查询相关的文档生成并存储磁盘KV缓存，同时跨多个LLM实例共享这些缓存，从而提升推理性能。实验结果显示，在单主机配置下，Shared RAG-DCache提升了15~71%的吞吐量，并减少了12~65%的延迟。

链接: https://arxiv.org/abs/2504.11765
作者: Hyungwoo Lee(1),Kihyun Kim(1),Jinwoo Kim(1),Jungmin So(1),Myung-Hoon Cha(2),Hong-Yeon Kim(2),James J. Kim(3),Youngjae Kim(1) ((1) Dept. of Computer Science and Engineering, Sogang University, Seoul, Republic of Korea, (2) ETRI, Daejeon, Republic of Korea, (3) Soteria Inc)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent large language models (LLMs) face increasing inference latency as input context length and model size continue to grow. In particular, the retrieval-augmented generation (RAG) technique, which enhances LLM responses by incorporating external knowledge, exacerbates this issue by significantly increasing the number of input tokens. This expansion in token length leads to a substantial rise in computational overhead, particularly during the prefill stage, resulting in prolonged time-to-first-token (TTFT). To address this issue, this paper proposes a method to reduce TTFT by leveraging a disk-based key-value (KV) cache to lessen the computational burden during the prefill stage. We also introduce a disk-based shared KV cache management system, called Shared RAG-DCache, for multi-instance LLM RAG service environments. This system, together with an optimal system configuration, improves both throughput and latency under given resource constraints. Shared RAG-DCache exploits the locality of documents related to user queries in RAG, as well as the queueing delay in LLM inference services. It proactively generates and stores disk KV caches for query-related documents and shares them across multiple LLM instances to enhance inference performance. In experiments on a single host equipped with 2 GPUs and 1 CPU, Shared RAG-DCache achieved a 15~71% increase in throughput and up to a 12~65% reduction in latency, depending on the resource configuration.
zh

[AI-26] Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures

【速读】：该论文旨在解决大型语言模型（LLM）推理工作负载在不断发展的CPU-GPU耦合架构上的性能优化问题。随着LLM推理工作负载逐渐主导数据中心的成本和资源利用率，理解其在松耦合（如PCIe A100/H100）和紧耦合（如GH200）系统上的行为特征变得至关重要。论文通过细粒度的操作符到内核跟踪分析，利用创新的SKIP剖析器和指标（如总内核启动与排队时间，Total Kernel Launch and Queuing Time, TKLQT），深入研究了LLM推理行为。

解决方案的关键在于识别CPU-GPU绑定的切换点，并通过内核融合技术减少内核启动开销以缓解低批次大小下的延迟瓶颈。研究发现，紧耦合的GH200在大批次大小下显著优于松耦合系统，但GH200在低批次大小下仍受CPU限制的程度更高。论文表明，TKLQT能够准确识别这一切换点，并通过内核融合显著改善GH200在低批次大小下的推理延迟。这项详细到内核级别的特性分析为优化不同CPU-GPU耦合策略提供了重要见解。

链接: https://arxiv.org/abs/2504.11750
作者: Prabhu Vellaisamy,Thomas Labonte,Sourav Chakraborty,Matt Turner,Samantika Sury,John Paul Shen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Performance (cs.PF)
备注: Accepted for ISPASS 2025

点击查看摘要

Abstract:Large language model (LLM)-based inference workloads increasingly dominate data center costs and resource utilization. Therefore, understanding the inference workload characteristics on evolving CPU-GPU coupled architectures is crucial for optimization. This paper presents an in-depth analysis of LLM inference behavior on loosely-coupled (PCIe A100/H100) and closely-coupled (GH200) systems. We analyze performance dynamics using fine-grained operator-to-kernel trace analysis, facilitated by our novel profiler SKIP and metrics like Total Kernel Launch and Queuing Time (TKLQT). Results show that closely-coupled (CC) GH200 significantly outperforms loosely-coupled (LC) systems at large batch sizes, achieving 1.9x-2.7x faster prefill latency for Llama 3.2-1B. However, our analysis also reveals that GH200 remains CPU-bound up to 4x larger batch sizes than LC systems. In this extended CPU-bound region, we identify the performance characteristics of the Grace CPU as a key factor contributing to higher inference latency at low batch sizes on GH200. We demonstrate that TKLQT accurately identifies this CPU/GPU-bound transition point. Based on this analysis, we further show that kernel fusion offers significant potential to mitigate GH200’s low-batch latency bottleneck by reducing kernel launch overhead. This detailed kernel-level characterization provides critical insights for optimizing diverse CPU-GPU coupling strategies. This work is an initial effort, and we plan to explore other major AI/DL workloads that demand different degrees of CPU-GPU heterogeneous architectures.
zh

[AI-27] Saga: Capturing Multi-granularity Semantics from Massive Unlabelled IMU Data for User Perception

【速读】：该论文旨在解决在移动感知应用中，由于微活动标注困难以及缺乏真实标签导致的大规模惯性测量单元(IMU)数据难以有效利用的问题。论文提出了一种名为Saga的细粒度用户感知方法，其关键是通过自监督学习预训练一个骨干特征提取模型，充分利用大规模无标注IMU数据中嵌入的不同语义层次的丰富信息。同时，针对特定的下游用户感知任务，采用贝叶斯优化(Bayesian Optimization)确定预训练任务中不同语义层次的最佳权重，从而仅需少量标注数据即可实现接近全量数据训练模型的高精度性能，且不增加额外系统开销。

链接: https://arxiv.org/abs/2504.11726
作者: Yunzhe Li,Facheng Hu,Hongzi Zhu,Shifan Zhang,Liang Zhang,Shan Chang,Minyi Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 2025 IEEE 45th International Conference on Distributed Computing Systems (ICDCS)

点击查看摘要

Abstract:Inertial measurement units (IMUs), have been prevalently used in a wide range of mobile perception applications such as activity recognition and user authentication, where a large amount of labelled data are normally required to train a satisfactory model. However, it is difficult to label micro-activities in massive IMU data due to the hardness of understanding raw IMU data and the lack of ground truth. In this paper, we propose a novel fine-grained user perception approach, called Saga, which only needs a small amount of labelled IMU data to achieve stunning user perception accuracy. The core idea of Saga is to first pre-train a backbone feature extraction model, utilizing the rich semantic information of different levels embedded in the massive unlabelled IMU data. Meanwhile, for a specific downstream user perception application, Bayesian Optimization is employed to determine the optimal weights for pre-training tasks involving different semantic levels. We implement Saga on five typical mobile phones and evaluate Saga on three typical tasks on three IMU datasets. Results show that when only using about 100 training samples per class, Saga can achieve over 90% accuracy of the full-fledged model trained on over ten thousands training samples with no additional system overhead.
zh

[AI-28] Adjoint Sampling: Highly Scalable Diffusion Samplers via Adjoint Matching

【速读】：本文旨在解决从非归一化密度（unnormalized density）或能量函数中高效采样的问题，特别是在需要大规模梯度更新的场景下。传统方法受限于能量评估次数与模型样本数量的比例，而本文提出的伴随采样（Adjoint Sampling）算法首次实现了显著更多的梯度更新次数，从而支持更大规模的问题设置。其关键在于结合随机最优控制理论，无需采用校正措施来推动样本接近目标分布，同时通过引入对称性和周期性边界条件等机制，有效处理分子在笛卡尔坐标和扭转角坐标下的建模。此外，该方法不仅适用于经典能量函数，还扩展到基于神经网络的能量模型，展示了其在计算化学领域中的广泛应用潜力。

链接: https://arxiv.org/abs/2504.11713
作者: Aaron Havens,Benjamin Kurt Miller,Bing Yan,Carles Domingo-Enrich,Anuroop Sriram,Brandon Wood,Daniel Levine,Bin Hu,Brandon Amos,Brian Karrer,Xiang Fu,Guan-Horng Liu,Ricky T. Q. Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Adjoint Sampling, a highly scalable and efficient algorithm for learning diffusion processes that sample from unnormalized densities, or energy functions. It is the first on-policy approach that allows significantly more gradient updates than the number of energy evaluations and model samples, allowing us to scale to much larger problem settings than previously explored by similar methods. Our framework is theoretically grounded in stochastic optimal control and shares the same theoretical guarantees as Adjoint Matching, being able to train without the need for corrective measures that push samples towards the target distribution. We show how to incorporate key symmetries, as well as periodic boundary conditions, for modeling molecules in both cartesian and torsional coordinates. We demonstrate the effectiveness of our approach through extensive experiments on classical energy functions, and further scale up to neural network-based energy models where we perform amortized conformer generation across many molecular systems. To encourage further research in developing highly scalable sampling methods, we plan to open source these challenging benchmarks, where successful methods can directly impact progress in computational chemistry.
zh

[AI-29] he Hitchhikers Guide to Program Analysis Part II: Deep Thoughts by LLM s

【速读】：该论文试图解决静态分析在软件漏洞检测中的经典精度-可扩展性权衡问题，尤其是在大规模代码库（如Linux内核）中普遍存在高误报率的问题。这些问题源于简化的漏洞建模以及路径和数据约束的过度近似。为应对这些挑战，论文提出的关键解决方案是BugLens，这是一种后处理精化框架。BugLens通过引导大型语言模型（LLM）评估有缺陷代码的安全影响，并验证与静态警告相关的约束条件，从而显著提高静态分析的精度。实验结果显示，BugLens将原始精度从0.10提升至0.72，大幅降低了误报率，并发现了四个先前未报告的漏洞。这表明基于结构化LLM的工作流能够显著增强静态分析工具的有效性。

链接: https://arxiv.org/abs/2504.11711
作者: Haonan Li,Hang Zhang,Kexin Pei,Zhiyun Qian
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Static analysis is a cornerstone for software vulnerability detection, yet it often struggles with the classic precision-scalability trade-off. In practice, such tools often produce high false positive rates, particularly in large codebases like the Linux kernel. This imprecision can arise from simplified vulnerability modeling and over-approximation of path and data constraints. While large language models (LLMs) show promise in code understanding, their naive application to program analysis yields unreliable results due to inherent reasoning limitations. We introduce BugLens, a post-refinement framework that significantly improves static analysis precision. BugLens guides an LLM to follow traditional analysis steps by assessing buggy code patterns for security impact and validating the constraints associated with static warnings. Evaluated on real-world Linux kernel bugs, BugLens raises precision from 0.10 (raw) and 0.50 (semi-automated refinement) to 0.72, substantially reducing false positives and revealing four previously unreported vulnerabilities. Our results suggest that a structured LLM-based workflow can meaningfully enhance the effectiveness of static analysis tools.
zh

[AI-30] A Library of LLM Intrinsics for Retrieval-Augmented Generation

【速读】：本文旨在解决大型语言模型（LLMs）开发者社区中缺乏类似软件库的清晰模式以支持大规模协作的问题。尤其针对检索增强生成（Retrieval-Augmented Generation, RAG）这一常见应用场景，目前尚无法基于由不同LLM提供商共同认可的一组明确API编写RAG应用程序。受编译器内置函数（compiler intrinsics）概念的启发，本文提出了通过引入RAG专用的LLM内置函数库（LLM Intrinsics Library for RAG）的部分解决方案。这些LLM内置函数被定义为可通过合理稳定且独立于其实现方式的API调用的能力。本研究中的内置函数以LoRA适配器的形式发布在HuggingFace平台上，并作为vLLM推理平台上的软件接口提供，同时在两处均附带文档和代码。文章描述了每个内置函数的设计目标、训练细节及评估结果，以及多个内置函数的组合应用。关键在于提出了一种以LLM内置函数为核心的抽象机制，使开发者能够更便捷地构建和扩展RAG系统，从而促进社区内的协作与标准化。

链接: https://arxiv.org/abs/2504.11704
作者: Marina Danilevsky,Kristjan Greenewald,Chulaka Gunasekara,Maeda Hanafi,Lihong He,Yannis Katsis,Krishnateja Killamsetty,Yatin Nandwani,Lucian Popa,Dinesh Raghu,Frederick Reiss,Vraj Shah,Khoi-Nguyen Tran,Huaiyu Zhu,Luis Lastras
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the developer community for large language models (LLMs), there is not yet a clean pattern analogous to a software library, to support very large scale collaboration. Even for the commonplace use case of Retrieval-Augmented Generation (RAG), it is not currently possible to write a RAG application against a well-defined set of APIs that are agreed upon by different LLM providers. Inspired by the idea of compiler intrinsics, we propose some elements of such a concept through introducing a library of LLM Intrinsics for RAG. An LLM intrinsic is defined as a capability that can be invoked through a well-defined API that is reasonably stable and independent of how the LLM intrinsic itself is implemented. The intrinsics in our library are released as LoRA adapters on HuggingFace, and through a software interface with clear structured input/output characteristics on top of vLLM as an inference platform, accompanied in both places with documentation and code. This article describes the intended usage, training details, and evaluations for each intrinsic, as well as compositions of multiple intrinsics.
zh

[AI-31] Progent: Programmable Privilege Control for LLM Agents

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLM）代理系统中存在的显著安全风险问题。尽管LLM代理具有巨大潜力，但它们在与外部世界交互时可能执行危险操作，特别是在受到恶意命令攻击的情况下。为应对这一挑战，论文提出通过实施最小权限原则来限制仅允许完成任务所需的必要操作，同时阻止其他所有不必要的行为。然而，实现这一目标极具挑战性，因为需要覆盖多样化的代理场景，并且在保障安全性的同时保持实用性。

论文的关键解决方案是引入Progent，这是首个面向LLM代理的权限控制机制。Progent的核心是一种领域专用语言，用于灵活表达在代理执行期间应用的权限控制策略。这些策略能够对工具调用施加细粒度约束，决定何时允许工具调用，并在不允许时指定替代方案。这使得代理开发者和用户可以根据具体应用场景制定合适的策略，并以确定性方式强制执行以确保安全性。由于其模块化设计，集成Progent不会改变代理内部结构，只需对代理实现进行最小改动即可，从而提升了其实用性和广泛采用的可能性。此外，为了自动化策略编写过程，论文利用LLM根据用户查询生成策略，并动态更新这些策略以提高安全性和实用性。广泛的评估表明，Progent能够在三个不同的场景或基准测试中（AgentDojo、ASB和AgentPoison）提供强大的安全保障并保持高实用性。进一步的深入分析展示了其核心组件的有效性以及自动策略生成对于对抗性攻击的鲁棒性。

链接: https://arxiv.org/abs/2504.11703
作者: Tianneng Shi,Jingxuan He,Zhun Wang,Linyu Wu,Hongwei Li,Wenbo Guo,Dawn Song
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM agents are an emerging form of AI systems where large language models (LLMs) serve as the central component, utilizing a diverse set of tools to complete user-assigned tasks. Despite their great potential, LLM agents pose significant security risks. When interacting with the external world, they may encounter malicious commands from attackers, leading to the execution of dangerous actions. A promising way to address this is by enforcing the principle of least privilege: allowing only essential actions for task completion while blocking unnecessary ones. However, achieving this is challenging, as it requires covering diverse agent scenarios while preserving both security and utility. We introduce Progent, the first privilege control mechanism for LLM agents. At its core is a domain-specific language for flexibly expressing privilege control policies applied during agent execution. These policies provide fine-grained constraints over tool calls, deciding when tool calls are permissible and specifying fallbacks if they are not. This enables agent developers and users to craft suitable policies for their specific use cases and enforce them deterministically to guarantee security. Thanks to its modular design, integrating Progent does not alter agent internals and requires only minimal changes to agent implementation, enhancing its practicality and potential for widespread adoption. To automate policy writing, we leverage LLMs to generate policies based on user queries, which are then updated dynamically for improved security and utility. Our extensive evaluation shows that it enables strong security while preserving high utility across three distinct scenarios or benchmarks: AgentDojo, ASB, and AgentPoison. Furthermore, we perform an in-depth analysis, showcasing the effectiveness of its core components and the resilience of its automated policy generation against adaptive attacks. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.11703 [cs.CR] (or arXiv:2504.11703v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2504.11703 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-32] Steering Prosocial AI Agents : Computational Basis of LLM s Decision Making in Social Simulation

【速读】：本文旨在探索大型语言模型（LLMs）在社会科学研究和实际应用中作为类人决策代理的行为机制。具体而言，研究关注如何通过人类赋予的特征及其所处情境影响LLM的行为，这一领域目前尚缺乏深入研究。论文提出并验证了一种方法，用于探测、量化以及调整LLM在独裁者博弈（Dictator Game）中的内部表征，该博弈是研究公平性和亲社会行为的经典实验。关键在于从LLM的内部状态提取“可变变化向量”（例如从“男性”到“女性”），并通过操控这些向量来显著改变变量与模型决策之间的关系。此方法为系统性研究和调节基于Transformer架构的模型内社会概念的编码与工程化提供了理论依据，并具有促进对齐（alignment）、去偏（debiasing）及设计学术与商业应用场景下社交模拟AI代理的重要意义。

链接: https://arxiv.org/abs/2504.11671
作者: Ji Ma
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly serve as human-like decision-making agents in social science and applied settings. These LLM-agents are typically assigned human-like characters and placed in real-life contexts. However, how these characters and contexts shape an LLM’s behavior remains underexplored. This study proposes and tests methods for probing, quantifying, and modifying an LLM’s internal representations in a Dictator Game – a classic behavioral experiment on fairness and prosocial behavior. We extract vectors of variable variations'' (e.g., male’’ to ``female’') from the LLM’s internal state. Manipulating these vectors during the model’s inference can substantially alter how those variables relate to the model’s decision-making. This approach offers a principled way to study and regulate how social concepts can be encoded and engineered within transformer-based models, with implications for alignment, debiasing, and designing AI agents for social simulations in both academic and commercial applications.
zh

[AI-33] Improving LLM Interpretability and Performance via Guided Embedding Refinement for Sequential Recommendation

【速读】：该论文旨在解决在将大规模语言模型（Large Language Models, LLMs）集成到现有基础推荐系统时，因模型可解释性、透明性及安全性等问题引发的挑战。论文的关键解决方案是提出了一种名为“引导嵌入精炼”（guided embedding refinement）的方法，通过以一种有指导且可解释的方式利用LLMs，增强与基础推荐系统相关的嵌入表示。具体而言，该方法不直接将LLMs作为推荐系统的主干模型，而是将其作为辅助工具来模拟推荐的销售逻辑，并生成捕获领域相关语义信息的引导嵌入（guided embedding）。通过结合引导嵌入与降维后的基础嵌入，构建出经过精炼的嵌入表示，进而将其整合至推荐模块用于训练和推理。实验结果表明，该方法不仅提升了推荐性能，在平均倒数排名（Mean Reciprocal Rank, MRR）、召回率（Recall rate）和归一化折扣累积增益（Normalized Discounted Cumulative Gain, NDCG）等指标上取得了约10%到50%的提升，同时显著增强了模型的可解释性。

链接: https://arxiv.org/abs/2504.11658
作者: Nanshan Jia,Chenfei Yuan,Yuhang Wu,Zeyu Zheng
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The fast development of Large Language Models (LLMs) offers growing opportunities to further improve sequential recommendation systems. Yet for some practitioners, integrating LLMs to their existing base recommendation systems raises questions about model interpretability, transparency and related safety. To partly alleviate challenges from these questions, we propose guided embedding refinement, a method that carries out a guided and interpretable usage of LLM to enhance the embeddings associated with the base recommendation system. Instead of directly using LLMs as the backbone of sequential recommendation systems, we utilize them as auxiliary tools to emulate the sales logic of recommendation and generate guided embeddings that capture domain-relevant semantic information on interpretable attributes. Benefiting from the strong generalization capabilities of the guided embedding, we construct refined embedding by using the guided embedding and reduced-dimension version of the base embedding. We then integrate the refined embedding into the recommendation module for training and inference. A range of numerical experiments demonstrate that guided embedding is adaptable to various given existing base embedding models, and generalizes well across different recommendation tasks. The numerical results show that the refined embedding not only improves recommendation performance, achieving approximately 10% to 50% gains in Mean Reciprocal Rank (MRR), Recall rate, and Normalized Discounted Cumulative Gain (NDCG), but also enhances interpretability, as evidenced by case studies.
zh

[AI-34] Data driven approach towards more efficient Newton-Raphson power flow calculation for distribution grids

【速读】：该论文旨在解决电力系统潮流（Power Flow, PF）计算在接近电网容量极限时因病态条件和收敛问题导致的挑战。传统牛顿-拉夫森（Newton-Raphson, NR）方法虽然收敛速度快，但在复杂工况下容易发散或需要大量迭代。为此，论文提出通过改进NR方法的初始化策略来提升其性能，从而减少迭代次数并避免发散。解决方案的关键在于设计了三种方法：(i) 基于电压数学边界的解析方法估计吸引域；(ii) 利用监督学习或物理信息神经网络（Physics-Informed Neural Networks, PINNs）的数据驱动模型预测最优初始猜测；(iii) 增量调整电压的强化学习（Reinforcement Learning, RL）方法加速收敛。这些方法在基准系统上的实验验证表明，均能够显著提高NR方法的收敛效率，为现代高比例可再生能源接入的电力系统的实时高效运行提供了有效路径。

链接: https://arxiv.org/abs/2504.11650
作者: Shengyuan Yan,Farzad Vazinram,Zeynab Kaseb,Lindsay Spoor,Jochen Stiasny,Betul Mamudi,Amirhossein Heydarian Ardakani,Ugochukwu Orji,Pedro P. Vergara,Yu Xiang,Jerry Guo
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注: 7 pages, 9 figures, 3 tables, 14 equations, 1 lemma, and 2 theorems. ICT for Industry 2025 Alliander usecase workshop paper. Oral presentation of this paper accepted and to be given on 16th April 2025 in this http URL 2025 conference of Netherlands in the Beatrix Theatre in Utrecht

点击查看摘要

Abstract:Power flow (PF) calculations are fundamental to power system analysis to ensure stable and reliable grid operation. The Newton-Raphson (NR) method is commonly used for PF analysis due to its rapid convergence when initialized properly. However, as power grids operate closer to their capacity limits, ill-conditioned cases and convergence issues pose significant challenges. This work, therefore, addresses these challenges by proposing strategies to improve NR initialization, hence minimizing iterations and avoiding divergence. We explore three approaches: (i) an analytical method that estimates the basin of attraction using mathematical bounds on voltages, (ii) Two data-driven models leveraging supervised learning or physics-informed neural networks (PINNs) to predict optimal initial guesses, and (iii) a reinforcement learning (RL) approach that incrementally adjusts voltages to accelerate convergence. These methods are tested on benchmark systems. This research is particularly relevant for modern power systems, where high penetration of renewables and decentralized generation require robust and scalable PF solutions. In experiments, all three proposed methods demonstrate a strong ability to provide an initial guess for Newton-Raphson method to converge with fewer steps. The findings provide a pathway for more efficient real-time grid operations, which, in turn, support the transition toward smarter and more resilient electricity networks.
zh

[AI-35] Achieving Tighter Finite-Time Rates for Heterogeneous Federated Stochastic Approximation under Markovian Sampling

【速读】：本文旨在解决联邦随机逼近问题中涉及时间相关数据的协作强化学习（Reinforcement Learning, RL）和优化挑战。具体而言，研究了由 ( M ) 个代理组成的系统，每个代理具有特定的局部算子（可能是非线性的）。目标是通过服务器间歇性通信找到这些局部算子平均值的根。论文的独特之处在于允许每个代理具有马尔可夫数据（Markovian data），并且允许各代理的局部算子根存在异构性。现有相关工作未能保证收敛至期望点或展示协作的优势，且通常依赖投影步骤以确保迭代有界。本文的关键突破在于提出了一种新的算法 FedHSA，并证明其能够保证收敛至正确解，同时由于协作实现样本复杂度的 ( M )-倍线性加速。这一结果首次在有限时间内建立了此类联邦设置的理论保证，且未依赖投影步骤，这需要处理复杂的时序相关性、多步本地计算以及异构算子引起的漂移效应之间的相互作用。本研究对具有函数逼近的广义异构联邦 RL 问题（如策略评估与控制）具有重要影响，尤其是在代理的马尔可夫决策过程的概率转移核和奖励函数不同的场景下。

链接: https://arxiv.org/abs/2504.11645
作者: Feng Zhu,Aritra Mitra,Robert W. Heath
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Motivated by collaborative reinforcement learning (RL) and optimization with time-correlated data, we study a generic federated stochastic approximation problem involving M agents, where each agent is characterized by an agent-specific (potentially nonlinear) local operator. The goal is for the agents to communicate intermittently via a server to find the root of the average of the agents’ local operators. The generality of our setting stems from allowing for (i) Markovian data at each agent and (ii) heterogeneity in the roots of the agents’ local operators. The limited recent work that has accounted for both these features in a federated setting fails to guarantee convergence to the desired point or to show any benefit of collaboration; furthermore, they rely on projection steps in their algorithms to guarantee bounded iterates. Our work overcomes each of these limitations. We develop a novel algorithm titled \textttFedHSA, and prove that it guarantees convergence to the correct point, while enjoying an M -fold linear speedup in sample-complexity due to collaboration. To our knowledge, \emphthis is the first finite-time result of its kind, and establishing it (without relying on a projection step) entails a fairly intricate argument that accounts for the interplay between complex temporal correlations due to Markovian sampling, multiple local steps to save communication, and the drift-effects induced by heterogeneous local operators. Our results have implications for a broad class of heterogeneous federated RL problems (e.g., policy evaluation and control) with function approximation, where the agents’ Markov decision processes can differ in their probability transition kernels and reward functions.
zh

[AI-36] Possibility for Proactive Anomaly Detection ICLR2025

【速读】：该论文旨在解决现有时间序列异常检测模型因依赖模型输出与真实值之间误差来检测异常而导致实用性受限的问题。论文提出了一种基于专门用于异常检测的时间序列预测模型和数据驱动异常检测模型的主动式（proactive）解决方案。关键在于通过数据驱动的异常检测模型从训练数据中建立异常阈值，并利用该阈值识别超出阈值的预测值以检测异常。这种方法不仅提升了检测的实用性，还通过四个基准数据集的广泛评估验证了模型性能，并分析了可预测和不可预测的异常类型。

链接: https://arxiv.org/abs/2504.11623
作者: Jinsung Jeon,Jaehyeon Park,Sewon Park,Jeongwhan Choi,Minjung Kim,Noseong Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2025 I Can’t Believe It’s Not Better: Challenges in Applied Deep Learning Workshop (ICBINB)

点击查看摘要

Abstract:Time-series anomaly detection, which detects errors and failures in a workflow, is one of the most important topics in real-world applications. The purpose of time-series anomaly detection is to reduce potential damages or losses. However, existing anomaly detection models detect anomalies through the error between the model output and the ground truth (observed) value, which makes them impractical. In this work, we present a \textitproactive approach for time-series anomaly detection based on a time-series forecasting model specialized for anomaly detection and a data-driven anomaly detection model. Our proactive approach establishes an anomaly threshold from training data with a data-driven anomaly detection model, and anomalies are subsequently detected by identifying predicted values that exceed the anomaly threshold. In addition, we extensively evaluated the model using four anomaly detection benchmarks and analyzed both predictable and unpredictable anomalies. We attached the source code as supplementary material.
zh

[AI-37] MULTI-LF: A Unified Continuous Learning Framework for Real-Time DDoS Detection in Multi-Environment Networks

【速读】：该论文旨在解决多环境（M-En）网络中分布式拒绝服务（DDoS）攻击检测面临的挑战，特别是现有基于AI的检测系统难以适应新型攻击策略、缺乏高精度和高效实时检测能力的问题。论文的关键创新在于提出了一种在线、持续学习的DDoS检测方法，通过构建一个多层级框架（MULTI-LF），结合两个机器学习模型：轻量级模型M1用于快速初步检测，复杂且高精度的模型M2用于验证与模型优化。当M1对预测结果置信度较低时，将决策转交至M2进行进一步验证，并利用M2的反馈对M1进行微调；若两者均表现出低置信度，则触发人工干预以更新模型类别，从而增强对未知攻击模式的适应性。此外，论文通过NS-3工具搭建了包含真实受害者与僵尸设备的仿真环境，模拟了多种IoT及传统IP环境下的DDoS攻击场景，验证了所提方法在分类准确性（0.999）和低延迟（0.866秒）方面的优越性能，同时展示了其在内存占用（3.632 MB）和CPU利用率（10.05%）上的高效特性。

链接: https://arxiv.org/abs/2504.11575
作者: Furqan Rustam,Islam Obaidat,Anca Delia Jurcut
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Detecting Distributed Denial of Service (DDoS) attacks in Multi-Environment (M-En) networks presents significant challenges due to diverse malicious traffic patterns and the evolving nature of cyber threats. Existing AI-based detection systems struggle to adapt to new attack strategies and lack real-time attack detection capabilities with high accuracy and efficiency. This study proposes an online, continuous learning methodology for DDoS detection in M-En networks, enabling continuous model updates and real-time adaptation to emerging threats, including zero-day attacks. First, we develop a unique M-En network dataset by setting up a realistic, real-time simulation using the NS-3 tool, incorporating both victim and bot devices. DDoS attacks with varying packet sizes are simulated using the DDoSim application across IoT and traditional IP-based environments under M-En network criteria. Our approach employs a multi-level framework (MULTI-LF) featuring two machine learning models: a lightweight Model 1 (M1) trained on a selective, critical packet dataset for fast and efficient initial detection, and a more complex, highly accurate Model 2 (M2) trained on extensive data. When M1 exhibits low confidence in its predictions, the decision is escalated to M2 for verification and potential fine-tuning of M1 using insights from M2. If both models demonstrate low confidence, the system flags the incident for human intervention, facilitating model updates with human-verified categories to enhance adaptability to unseen attack patterns. We validate the MULTI-LF through real-world simulations, demonstrating superior classification accuracy of 0.999 and low prediction latency of 0.866 seconds compared to established baselines. Furthermore, we evaluate performance in terms of memory usage (3.632 MB) and CPU utilization (10.05%) in real-time scenarios.
zh

[AI-38] Perceptions of Agent ic AI in Organizations: Implications for Responsible AI and ROI

【速读】：该论文试图解决组织在应对日益自主的生成式 AI (Generative AI) 系统时，如何构建和实施稳健的责任 AI (Responsible AI) 框架的问题。论文的关键在于通过解析 AI 专业人士的实际经验，揭示责任 AI 的复杂性及其实施过程中因知识缺口、利益相关者参与不足以及过度关注控制所导致的挑战。研究强调，这些因素阻碍了组织的有效适应与实施，从而影响责任 AI 的潜力及投资回报率 (ROI) 的实现。因此，解决方案的关键在于优化知识获取、强化利益相关者参与，并平衡控制需求与灵活适应能力。

链接: https://arxiv.org/abs/2504.11564
作者: Lee Ackerman
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 26 pages, 15 figures

点击查看摘要

Abstract:As artificial intelligence (AI) systems rapidly gain autonomy, the need for robust responsible AI frameworks becomes paramount. This paper investigates how organizations perceive and adapt such frameworks amidst the emerging landscape of increasingly sophisticated agentic AI. Employing an interpretive qualitative approach, the study explores the lived experiences of AI professionals. Findings highlight that the inherent complexity of agentic AI systems and their responsible implementation, rooted in the intricate interconnectedness of responsible AI dimensions and the thematic framework (an analytical structure developed from the data), combined with the novelty of agentic AI, contribute to significant challenges in organizational adaptation, characterized by knowledge gaps, a limited emphasis on stakeholder engagement, and a strong focus on control. These factors, by hindering effective adaptation and implementation, ultimately compromise the potential for responsible AI and the realization of ROI.
zh

[AI-39] Error Broadcast and Decorrelation as a Potential Artificial and Natural Learning Mechanism

【速读】：该论文试图解决神经网络中的信用分配（credit assignment）问题，即如何有效地将输出误差分配到网络的不同层以更新权重。为了解决这一问题，论文提出了Error Broadcast and Decorrelation (EBD)算法。其关键在于通过直接广播输出误差到各个层，并利用最优最小均方误差（MMSE）估计器的随机正交性特性，定义逐层损失函数来惩罚层激活与输出误差之间的相关性，从而实现无权值传输（weight transport free）的误差广播机制。这一方法不仅在实验中自然导出了三因子学习规则，还与生物可塑性框架兼容，提升了性能与生物学合理性。

链接: https://arxiv.org/abs/2504.11558
作者: Mete Erdogan,Cengiz Pehlevan,Alper T. Erdogan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce the Error Broadcast and Decorrelation (EBD) algorithm, a novel learning framework that addresses the credit assignment problem in neural networks by directly broadcasting output error to individual layers. Leveraging the stochastic orthogonality property of the optimal minimum mean square error (MMSE) estimator, EBD defines layerwise loss functions to penalize correlations between layer activations and output errors, offering a principled approach to error broadcasting without the need for weight transport. The optimization framework naturally leads to the experimentally observed three-factor learning rule and integrates with biologically plausible frameworks to enhance performance and plausibility. Numerical experiments demonstrate that EBD achieves performance comparable to or better than known error-broadcast methods on benchmark datasets. While the scalability of EBD to very large or complex datasets remains to be further explored, our findings suggest it provides a biologically plausible, efficient, and adaptable alternative for neural network training. This approach could inform future advancements in artificial and natural learning paradigms.
zh

[AI-40] Probabilistic causal graphs as categorical data synthesizers: Do they do better than Gaussian Copulas and Conditional Tabular GANs?

【速读】：该论文旨在解决高质量合成分类数据（如调查数据）的生成问题，同时确保在保护隐私的前提下保留数据间的因果关系。论文的关键在于结合结构方程模型（Structural Equation Modeling, SEM）与贝叶斯网络（Bayesian Networks, BN），通过构建因果图模型来表示变量间的因果关系并捕获联合分布。研究以针对残障人士服务可及性调查的分类数据为例，验证了基于SEM的BN方法相较于其他方法（如概率高斯Copula技术和条件表格生成对抗网络CTGAN）在统计度量（如卡方检验、Kullback-Leibler散度和总变差距离TVD）上的优越性，尤其在保持数据统计有效性和隐私保护方面展现了显著优势。因此，该方法特别适用于敏感数据的研究场景，如无障碍与残障相关领域。

链接: https://arxiv.org/abs/2504.11547
作者: Olha Shaposhnyk,Noor Abid,Mouri Zakir,Svetlana Yanushkevich
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study investigates the generation of high-quality synthetic categorical data, such as survey data, using causal graph models. Generating synthetic data aims not only to create a variety of data for training the models but also to preserve privacy while capturing relationships between the data. The research employs Structural Equation Modeling (SEM) followed by Bayesian Networks (BN). We used the categorical data that are based on the survey of accessibility to services for people with disabilities. We created both SEM and BN models to represent causal relationships and to capture joint distributions between variables. In our case studies, such variables include, in particular, demographics, types of disability, types of accessibility barriers and frequencies of encountering those barriers. The study compared the SEM-based BN method with alternative approaches, including the probabilistic Gaussian copula technique and generative models like the Conditional Tabular Generative Adversarial Network (CTGAN). The proposed method outperformed others in statistical metrics, including the Chi-square test, Kullback-Leibler divergence, and Total Variation Distance (TVD). In particular, the BN model demonstrated superior performance, achieving the highest TVD, indicating alignment with the original data. The Gaussian Copula ranked second, while CTGAN exhibited moderate performance. These analyses confirmed the ability of the SEM-based BN to produce synthetic data that maintain statistical and relational validity while maintaining confidentiality. This approach is particularly beneficial for research on sensitive data, such as accessibility and disability studies. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2504.11547 [cs.AI] (or arXiv:2504.11547v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.11547 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-41] NodeRAG : Structuring Graph-based RAG with Heterogeneous Nodes

【速读】：本文旨在解决现有基于图的检索增强生成（Retrieval-Augmented Generation, RAG）方法中图结构设计不足的问题。当前方法通常忽视了图结构的设计，导致算法整合困难、工作流不一致以及性能下降。为充分挖掘图在RAG中的潜力，论文提出了一种名为NodeRAG的图中心框架，引入异构图结构，实现图相关方法与RAG工作流的无缝且全面集成。该框架紧密契合大型语言模型（LLMs）的能力，确保端到端流程的高度协调性和效率。通过广泛实验验证，NodeRAG在索引时间、查询时间、存储效率以及多跳基准测试和开放性对比评估中的问答性能方面均优于先前方法，如GraphRAG和LightRAG，并且使用最少的检索标记即可实现卓越表现。

链接: https://arxiv.org/abs/2504.11544
作者: Tianyang Xu,Haojie Zheng,Chengze Li,Haoxiang Chen,Yixin Liu,Ruoxi Chen,Lichao Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) empowers large language models to access external and private corpus, enabling factually consistent responses in specific domains. By exploiting the inherent structure of the corpus, graph-based RAG methods further enrich this process by building a knowledge graph index and leveraging the structural nature of graphs. However, current graph-based RAG approaches seldom prioritize the design of graph structures. Inadequately designed graph not only impede the seamless integration of diverse graph algorithms but also result in workflow inconsistencies and degraded performance. To further unleash the potential of graph for RAG, we propose NodeRAG, a graph-centric framework introducing heterogeneous graph structures that enable the seamless and holistic integration of graph-based methodologies into the RAG workflow. By aligning closely with the capabilities of LLMs, this framework ensures a fully cohesive and efficient end-to-end process. Through extensive experiments, we demonstrate that NodeRAG exhibits performance advantages over previous methods, including GraphRAG and LightRAG, not only in indexing time, query time, and storage efficiency but also in delivering superior question-answering performance on multi-hop benchmarks and open-ended head-to-head evaluations with minimal retrieval tokens. Our GitHub repository could be seen at this https URL.
zh

[AI-42] REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

【速读】：本文旨在解决多轮智能体在模拟真实世界网站环境中导航与完成任务能力不足的问题。论文提出REAL（Replica Assessment Environment for Agents）这一基准测试平台与评估框架，其关键在于通过高保真、确定性的11个跨领域（如电商、旅行、通信和职业社交）网站仿真环境，以及包含112项实用任务的基准测试集，全面评估智能体的信息检索准确性和状态改变操作的可靠性。REAL采用基于程序化检查与基于规则的大语言模型（LLM）判断相结合的创新评估方法，支持开源及专有智能体系统的灵活测试，并通过浏览器黑盒命令实现可控环境下的安全、可复现评价。实验结果显示，当前前沿语言模型在REAL上的成功率仅为41%，凸显了现有系统在自主网页导航与任务完成方面的重要差距。关键解决方案在于构建高度可控且可扩展的仿真环境与评估体系，以填补智能体能力的不足并促进模型训练与改进。相关资源已公开可用。

链接: https://arxiv.org/abs/2504.11543
作者: Divyansh Garg,Shaun VanWeelden,Diego Caples,Andis Draguns,Nikil Ravi,Pranav Putta,Naman Garg,Tomas Abraham,Michael Lara,Federico Lopez,James Liu,Atharva Gundawar,Prannay Hebbar,Youngchul Joo,Charles London,Christian Schroeder de Witt,Sumeet Motwani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites. REAL comprises high-fidelity, deterministic replicas of 11 widely-used websites across domains such as e-commerce, travel, communication, and professional networking. We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions requiring both accurate information retrieval and state-changing actions. All interactions occur within this fully controlled setting, eliminating safety risks and enabling robust, reproducible evaluation of agent capability and reliability. Our novel evaluation framework combines programmatic checks of website state for action-based tasks with rubric-guided LLM-based judgments for information retrieval. The framework supports both open-source and proprietary agent systems through a flexible evaluation harness that accommodates black-box commands within browser environments, allowing research labs to test agentic systems without modification. Our empirical results show that frontier language models achieve at most a 41% success rate on REAL, highlighting critical gaps in autonomous web navigation and task completion capabilities. Our framework supports easy integration of new tasks, reproducible evaluation, and scalable data generation for training web agents. The websites, framework, and leaderboard are available at this https URL and this https URL.
zh

[AI-43] Enhancing Autonomous Driving Systems with On-Board Deployed Large Language Models

【速读】：该论文旨在解决自动驾驶领域中神经网络在处理真实驾驶场景中的边缘情况（edge-case scenarios）时存在的局限性。由于监督学习训练的神经网络难以应对未穷尽的数据集所覆盖的所有极端情况，导致其在检测意外驾驶行为方面缺乏鲁棒性。为弥补数据驱动方法的不足，论文提出结合知识驱动的方法，通过模仿人类直觉来增强决策能力。解决方案的关键在于设计了一种混合架构，将低级别的模型预测控制器（Model Predictive Controller, MPC）与本地部署的大语言模型（Large Language Models, LLMs）相结合。具体而言，“DecisionxLLM”模块用于评估机器人状态信息并确保符合预期驾驶行为，而“MPCxLLM”模块则依据LLMs生成的洞见调整MPC参数，从而实现控制适应性的同时保持传统MPC系统的安全性和约束保证。此外，为了实现实时部署并减少对云连接的依赖，论文提出了利用检索增强生成（Retrieval Augmented Generation, RAG）、低秩适应（Low Rank Adaptation, LoRA）微调以及量化技术的方案。实验结果表明，这些改进显著提高了推理准确性（提升10.45%）、控制适应性（提升52.2%）及计算效率（提升10.5倍），验证了该框架在资源受限平台上的实用性和实时部署可行性。这一工作实现了高级别决策与低级别控制适应性的融合，为知识驱动与自适应的自动驾驶系统提供了协同工作的框架。

链接: https://arxiv.org/abs/2504.11514
作者: Nicolas Baumann,Cheng Hu,Paviththiren Sivasothilingam,Haotong Qin,Lei Xie,Michele Magno,Luca Benini
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Neural Networks (NNs) trained through supervised learning struggle with managing edge-case scenarios common in real-world driving due to the intractability of exhaustive datasets covering all edge-cases, making knowledge-driven approaches, akin to how humans intuitively detect unexpected driving behavior, a suitable complement to data-driven methods. This work proposes a hybrid architecture combining low-level Model Predictive Controller (MPC) with locally deployed Large Language Models (LLMs) to enhance decision-making and Human Machine Interaction (HMI). The DecisionxLLM module evaluates robotic state information against natural language instructions to ensure adherence to desired driving behavior. The MPCxLLM module then adjusts MPC parameters based on LLM-generated insights, achieving control adaptability while preserving the safety and constraint guarantees of traditional MPC systems. Further, to enable efficient on-board deployment and to eliminate dependency on cloud connectivity, we shift processing to the on-board computing platform: We propose an approach that exploits Retrieval Augmented Generation (RAG), Low Rank Adaptation (LoRA) fine-tuning, and quantization. Experimental results demonstrate that these enhancements yield significant improvements in reasoning accuracy by up to 10.45%, control adaptability by as much as 52.2%, and up to 10.5x increase in computational efficiency (tokens/s), validating the proposed framework’s practicality for real-time deployment even on down-scaled robotic platforms. This work bridges high-level decision-making with low-level control adaptability, offering a synergistic framework for knowledge-driven and adaptive Autonomous Driving Systems (ADS).
zh

[AI-44] Position Paper: Rethinking Privacy in RL for Sequential Decision-making in the Age of LLM s IJCNN2025

【速读】：该论文试图解决强化学习（Reinforcement Learning, RL）在关键实际应用中隐私保护不足的问题。传统隐私框架主要针对孤立数据点进行保护，无法应对从时间模式、行为策略及协作动态中显现敏感信息的序列决策系统。现代RL范式，如联邦RL（Federated RL, FedRL）和基于人类反馈的大型语言模型（Large Language Models, LLMs）中的RL（RL with Human Feedback, RLHF），因引入复杂、交互且上下文依赖的学习环境而加剧了这些挑战。论文提出的关键解决方案是构建一个基于四大核心原则的新隐私范式：多尺度保护、行为模式保护、协作隐私保护以及上下文感知适应。通过这些原则，论文揭示了隐私、效用与可解释性之间的固有权衡，并呼吁开发新的理论框架、实用机制及严格评估方法，以实现序列决策系统的有效隐私保护。

链接: https://arxiv.org/abs/2504.11511
作者: Flint Xiaofeng Fan,Cheston Tan,Roger Wattenhofer,Yew-Soon Ong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to IJCNN 2025 Position Paper Track

点击查看摘要

Abstract:The rise of reinforcement learning (RL) in critical real-world applications demands a fundamental rethinking of privacy in AI systems. Traditional privacy frameworks, designed to protect isolated data points, fall short for sequential decision-making systems where sensitive information emerges from temporal patterns, behavioral strategies, and collaborative dynamics. Modern RL paradigms, such as federated RL (FedRL) and RL with human feedback (RLHF) in large language models (LLMs), exacerbate these challenges by introducing complex, interactive, and context-dependent learning environments that traditional methods do not address. In this position paper, we argue for a new privacy paradigm built on four core principles: multi-scale protection, behavioral pattern protection, collaborative privacy preservation, and context-aware adaptation. These principles expose inherent tensions between privacy, utility, and interpretability that must be navigated as RL systems become more pervasive in high-stakes domains like healthcare, autonomous vehicles, and decision support systems powered by LLMs. To tackle these challenges, we call for the development of new theoretical frameworks, practical mechanisms, and rigorous evaluation methodologies that collectively enable effective privacy protection in sequential decision-making systems.
zh

[AI-45] RAID: An In-Training Defense against Attribute Inference Attacks in Recommender Systems

【速读】：该论文旨在解决推荐系统中用户属性推断攻击的问题，即如何在保持推荐性能的同时减轻攻击者利用部分暴露的用户信息（如嵌入向量）推断目标用户的敏感属性（如性别和政治观点）的能力。现有防御方法多集中于训练后的调整（post-training settings），无法充分利用训练数据以维持推荐效果；而对抗性训练虽扩展到训练中（in-training settings），但常因训练过程不稳定而导致收敛困难。论文的关键解决方案是提出RAID（Recommender Systems Attribute Inference Defense），一种在训练过程中进行防御的方法。其核心在于定义了一个防御目标，通过最优传输（optimal transport）将用户分布与一个满足约束条件的中心分布对齐，使得受保护属性与类别标签相互独立，从而实现对属性推断攻击的有效抵抗，同时确保推荐性能不受损。这一目标具体表现为求解一个受限的Wasserstein重心问题，以识别使属性不可区分的中心分布，并在推荐性能约束下优化模型。

链接: https://arxiv.org/abs/2504.11510
作者: Xiaohua Feng,Yuyuan Li,Fengyuan Yu,Ke Xiong,Junjie Fang,Li Zhang,Tianyu Du,Chaochao Chen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 17 pages

点击查看摘要

Abstract:In various networks and mobile applications, users are highly susceptible to attribute inference attacks, with particularly prevalent occurrences in recommender systems. Attackers exploit partially exposed user profiles in recommendation models, such as user embeddings, to infer private attributes of target users, such as gender and political views. The goal of defenders is to mitigate the effectiveness of these attacks while maintaining recommendation performance. Most existing defense methods, such as differential privacy and attribute unlearning, focus on post-training settings, which limits their capability of utilizing training data to preserve recommendation performance. Although adversarial training extends defenses to in-training settings, it often struggles with convergence due to unstable training processes. In this paper, we propose RAID, an in-training defense method against attribute inference attacks in recommender systems. In addition to the recommendation objective, we define a defensive objective to ensure that the distribution of protected attributes becomes independent of class labels, making users indistinguishable from attribute inference attacks. Specifically, this defensive objective aims to solve a constrained Wasserstein barycenter problem to identify the centroid distribution that makes the attribute indistinguishable while complying with recommendation performance constraints. To optimize our proposed objective, we use optimal transport to align users with the centroid distribution. We conduct extensive experiments on four real-world datasets to evaluate RAID. The experimental results validate the effectiveness of RAID and demonstrate its significant superiority over existing methods in multiple aspects.
zh

[AI-46] A Framework for the Private Governance of Frontier Artificial Intelligence

【速读】：该论文试图解决如何有效治理前沿人工智能（Frontier AI）系统的问题。论文提出的关键解决方案是一个混合型的公私合作治理体系，其中私营机构在政府授权和监督下，以自愿为基础为前沿AI系统的开发者提供认证。这一方案的核心在于通过为参与认证的前沿AI公司提供客户误用模型的侵权责任保护，激励其主动参与治理过程，同时平衡政治经济、制度、法律、安全等多方面的利弊权衡。

链接: https://arxiv.org/abs/2504.11501
作者: Dean W. Ball
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a proposal for the governance of frontier AI systems through a hybrid public-private system. Private bodies, authorized and overseen by government, provide certifications to developers of frontier AI systems on an opt-in basis. In exchange for opting in, frontier AI firms receive protections from tort liability for customer misuse of their models. Before detailing the proposal, the paper explores more commonly discussed approaches to AI governance, analyzing their strengths and flaws. It also examines the nature of frontier AI governance itself. The paper includes consideration of the political economic, institutional, legal, safety, and other merits and tradeoffs inherent in the governance system it proposes.
zh

[AI-47] owards Interpretable Deep Generative Models via Causal Representation Learning

【速读】：该论文旨在解决生成式 AI 中深度神经网络难以解释的问题，其核心在于通过因果表示学习（Causal Representation Learning, CRL）构建灵活、可解释且可迁移的生成式 AI 模型。CRL 的关键是将因果性作为构建工具，结合潜在变量模型（如因子分析）、包含潜在变量的因果图模型以及非参数统计与深度学习的方法，以揭示复杂多模态数据的隐含表示。这种解决方案的关键在于从统计视角整合经典模型，并关注统计与因果识别结果，从而实现对深度神经网络表示的透明化和系统化理解。

链接: https://arxiv.org/abs/2504.11609
作者: Gemma E. Moran,Bryon Aragam
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Recent developments in generative artificial intelligence (AI) rely on machine learning techniques such as deep learning and generative modeling to achieve state-of-the-art performance across wide-ranging domains. These methods’ surprising performance is due in part to their ability to learn implicit "representations’’ of complex, multi-modal data. Unfortunately, deep neural networks are notoriously black boxes that obscure these representations, making them difficult to interpret or analyze. To resolve these difficulties, one approach is to build new interpretable neural network models from the ground up. This is the goal of the emerging field of causal representation learning (CRL) that uses causality as a vector for building flexible, interpretable, and transferable generative AI. CRL can be seen as a culmination of three intrinsically statistical problems: (i) latent variable models such as factor analysis; (ii) causal graphical models with latent variables; and (iii) nonparametric statistics and deep learning. This paper reviews recent progress in CRL from a statistical perspective, focusing on connections to classical models and statistical and causal identifiablity results. This review also highlights key application areas, implementation strategies, and open statistical questions in CRL.
zh

机器学习

[LG-0] Edge Intelligence for Wildlife Conservation: Real-Time Hornbill Call Classification Using TinyML

链接: https://arxiv.org/abs/2504.12272
作者: Kong Ka Hing,Mehran Behjati
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: This is a preprint version of a paper accepted and published in Springer Lecture Notes in Networks and Systems. The final version is available at this https URL

点击查看摘要

Abstract:Hornbills, an iconic species of Malaysia’s biodiversity, face threats from habi-tat loss, poaching, and environmental changes, necessitating accurate and real-time population monitoring that is traditionally challenging and re-source intensive. The emergence of Tiny Machine Learning (TinyML) offers a chance to transform wildlife monitoring by enabling efficient, real-time da-ta analysis directly on edge devices. Addressing the challenge of wildlife conservation, this research paper explores the pivotal role of machine learn-ing, specifically TinyML, in the classification and monitoring of hornbill calls in Malaysia. Leveraging audio data from the Xeno-canto database, the study aims to develop a speech recognition system capable of identifying and classifying hornbill vocalizations. The proposed methodology involves pre-processing the audio data, extracting features using Mel-Frequency Energy (MFE), and deploying the model on an Arduino Nano 33 BLE, which is adept at edge computing. The research encompasses foundational work, in-cluding a comprehensive introduction, literature review, and methodology. The model is trained using Edge Impulse and validated through real-world tests, achieving high accuracy in hornbill species identification. The project underscores the potential of TinyML for environmental monitoring and its broader application in ecological conservation efforts, contributing to both the field of TinyML and wildlife conservation.

[LG-1] Comparative analysis of unsupervised clustering techniques using validation metrics: Study on cognitive features from the Canadian Longitudinal Study on Aging (CLSA)

链接: https://arxiv.org/abs/2504.12270
作者: ChenNingZhi Sheng,Rafal Kustra,Davide Chicco
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 22 pages,3 figures,5 tables

点击查看摘要

Abstract:Purpose: The primary goal of this study is to explore the application of evaluation metrics to different clustering algorithms using the data provided from the Canadian Longitudinal Study (CLSA), focusing on cognitive features. The objective of our work is to discover potential clinically relevant clusters that contribute to the development of dementia over time-based on cognitive changes. Method: The CLSA dataset includes 18,891 participants with data available at both baseline and follow-up assessments, to which clustering algorithms were applied. The clustering methodologies employed in this analysis are K-means (KM) clustering, Hierarchical Clustering (HC) and Partitioning Around Medoids (PAM). We use multiple evaluation metrics to assess our analysis. For internal evaluation metrics, we use: Average silhouette Width, Within and Between the sum of square Ratio (this http URL), Entropy, Calinski-Harabasz Index (CH Index), and Separation Index. For clustering comparison metrics, we used: Homogeneity, Completeness, Adjusted Rand Index (ARI), Rand Index (RI), and Variation Information. Results: Using evaluation metrics to compare the results of the three clustering techniques, K-means and Partitioning Around Medoids (PAM) produced similar results. In contrast, there are significant differences between K-means clustering and Hierarchical Clustering. Our study highlights the importance of the two internal evaluation metrics: entropy and separation index. In between clustering comparison metrics, the Adjusted Rand Index is a key tool. Conclusion: The study results have the potential to contribute to understanding dementia. Researchers can also benefit by applying the suggested evaluation metrics to other areas of healthcare research. Overall, our study improves the understanding of using clustering techniques and evaluation metrics to reveal complex patterns in medical data.

[LG-2] Battery-aware Cyclic Scheduling in Energy-harvesting Federated Learning

链接: https://arxiv.org/abs/2504.12181
作者: Eunjeong Jeong,Nikolaos Pappas
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: This paper is currently under review for presentation at a peer-reviewed conference

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a promising framework for distributed learning, but its growing complexity has led to significant energy consumption, particularly from computations on the client side. This challenge is especially critical in energy-harvesting FL (EHFL) systems, where device availability fluctuates due to limited and time-varying energy resources. We propose FedBacys, a battery-aware FL framework that introduces cyclic client participation based on users’ battery levels to cope with these issues. FedBacys enables clients to save energy and strategically perform local training just before their designated transmission time by clustering clients and scheduling their involvement sequentially. This design minimizes redundant computation, reduces system-wide energy usage, and improves learning stability. Our experiments demonstrate that FedBacys outperforms existing approaches in terms of energy efficiency and performance consistency, exhibiting robustness even under non-i.i.d. training data distributions and with very infrequent battery charging. This work presents the first comprehensive evaluation of cyclic client participation in EHFL, incorporating both communication and computation costs into a unified, resource-aware scheduling strategy.

[LG-3] Predictive Multiplicity in Survival Models: A Method for Quantifying Model Uncertainty in Predictive Maintenance Applications

链接: https://arxiv.org/abs/2504.12156
作者: Mustafa Cavus
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In many applications, especially those involving prediction, models may yield near-optimal performance yet significantly disagree on individual-level outcomes. This phenomenon, known as predictive multiplicity, has been formally defined in binary, probabilistic, and multi-target classification, and undermines the reliability of predictive systems. However, its implications remain unexplored in the context of survival analysis, which involves estimating the time until a failure or similar event while properly handling censored data. We frame predictive multiplicity as a critical concern in survival-based models and introduce formal measures – ambiguity, discrepancy, and obscurity – to quantify it. This is particularly relevant for downstream tasks such as maintenance scheduling, where precise individual risk estimates are essential. Understanding and reporting predictive multiplicity helps build trust in models deployed in high-stakes environments. We apply our methodology to benchmark datasets from predictive maintenance, extending the notion of multiplicity to survival models. Our findings show that ambiguity steadily increases, reaching up to 40-45% of observations; discrepancy is lower but exhibits a similar trend; and obscurity remains mild and concentrated in a few models. These results demonstrate that multiple accurate survival models may yield conflicting estimations of failure risk and degradation progression for the same equipment. This highlights the need to explicitly measure and communicate predictive multiplicity to ensure reliable decision-making in process health management.

[LG-4] Neural Contextual Bandits Under Delayed Feedback Constraints

链接: https://arxiv.org/abs/2504.12086
作者: Mohammadali Moghimi,Sharu Theresa Jose,Shana Moothedath
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a new algorithm for neural contextual bandits (CBs) that addresses the challenge of delayed reward feedback, where the reward for a chosen action is revealed after a random, unknown delay. This scenario is common in applications such as online recommendation systems and clinical trials, where reward feedback is delayed because the outcomes or results of a user’s actions (such as recommendations or treatment responses) take time to manifest and be measured. The proposed algorithm, called Delayed NeuralUCB, uses an upper confidence bound (UCB)-based exploration strategy. Under the assumption of independent and identically distributed sub-exponential reward delays, we derive an upper bound on the cumulative regret over a T-length horizon. We further consider a variant of the algorithm, called Delayed NeuralTS, that uses Thompson Sampling-based exploration. Numerical experiments on real-world datasets, such as MNIST and Mushroom, along with comparisons to benchmark approaches, demonstrate that the proposed algorithms effectively manage varying delays and are well-suited for complex real-world scenarios.

[LG-5] Generative Deep Learning Framework for Inverse Design of Fuels

链接: https://arxiv.org/abs/2504.12075
作者: Kiran K. Yalamanchi,Pinaki Pal,Balaji Mohan,Abdullah S. AlRamadan,Jihad A. Badra,Yuanjiang Pei
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:In the present work, a generative deep learning framework combining a Co-optimized Variational Autoencoder (Co-VAE) architecture with quantitative structure-property relationship (QSPR) techniques is developed to enable accelerated inverse design of fuels. The Co-VAE integrates a property prediction component coupled with the VAE latent space, enhancing molecular reconstruction and accurate estimation of Research Octane Number (RON) (chosen as the fuel property of interest). A subset of the GDB-13 database, enriched with a curated RON database, is used for model training. Hyperparameter tuning is further utilized to optimize the balance among reconstruction fidelity, chemical validity, and RON prediction. An independent regression model is then used to refine RON prediction, while a differential evolution algorithm is employed to efficiently navigate the VAE latent space and identify promising fuel molecule candidates with high RON. This methodology addresses the limitations of traditional fuel screening approaches by capturing complex structure-property relationships within a comprehensive latent representation. The generative model provides a flexible tool for systematically exploring vast chemical spaces, paving the way for discovering fuels with superior anti-knock properties. The demonstrated approach can be readily extended to incorporate additional fuel properties and synthesizability criteria to enhance applicability and reliability for de novo design of new fuels.

[LG-6] On the calibration of Just-in-time Defect Prediction

链接: https://arxiv.org/abs/2504.12051
作者: Xhulja Shahini,Jone Bartel,Klaus Pohl
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Just in time defect prediction (JIT DP) leverages ML to identify defect-prone code commits, enabling quality assurance (QA) teams to allocate resources more efficiently by focusing on commits that are most likely to contain defects. Although JIT DP techniques have introduced improvements in terms of predictive accuracy, they are still susceptible to misclassification errors such as false positives and negatives. This can lead to wasted resources or undetected defects, a particularly critical concern when QA resources are limited. To mitigate these challenges and preserve the practical utility of JIT DP tools, it becomes essential to estimate the reliability of the predictions, i.e., computing confidence scores. Such scores can help practitioners determine the trustworthiness of predictions and thus prioritize them efficiently. A simple approach to computing confidence scores is to extract, alongside each prediction, the corresponding prediction probabilities and use them as indicators of confidence. However, for these probabilities to reliably serve as confidence scores, the predictive model must be well-calibrated. This means that the prediction probabilities must accurately represent the true likelihood of each prediction being correct. Miscalibration, common in modern ML models, distorts probability scores such that they do not align with the actual correctness probability. In this study, we evaluate the calibration of three JIT DP techniques to determine whether and to what extent they exhibit poor calibration. Furthermore, we assess whether post-calibration methods can improve the calibration of existing JIT defect prediction models. Our results reveal that all evaluated JIT DP models exhibit some level of miscalibration, with ECE ranging from 2-35%. Furthermore, post-calibration methods do not consistently improve the calibration.

[LG-7] FedEPA: Enhancing Personalization and Modality Alignment in Multimodal Federated Learning

链接: https://arxiv.org/abs/2504.12025
作者: Yu Zhang,Qingfeng Du,Jiaqi Lv
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables decentralized model training across multiple parties while preserving privacy. However, most FL systems assume clients hold only unimodal data, limiting their real-world applicability, as institutions often possess multimodal data. Moreover, the lack of labeled data further constrains the performance of most FL methods. In this work, we propose FedEPA, a novel FL framework for multimodal learning. FedEPA employs a personalized local model aggregation strategy that leverages labeled data on clients to learn personalized aggregation weights, thereby alleviating the impact of data heterogeneity. We also propose an unsupervised modality alignment strategy that works effectively with limited labeled data. Specifically, we decompose multimodal features into aligned features and context features. We then employ contrastive learning to align the aligned features across modalities, ensure the independence between aligned features and context features within each modality, and promote the diversity of context features. A multimodal feature fusion strategy is introduced to obtain a joint embedding. The experimental results show that FedEPA significantly outperforms existing FL methods in multimodal classification tasks under limited labeled data conditions.

[LG-8] Active Human Feedback Collection via Neural Contextual Dueling Bandits ICLR2025

链接: https://arxiv.org/abs/2504.12016
作者: Arun Verma,Xiaoqiang Lin,Zhongxiang Dai,Daniela Rus,Bryan Kian Hsiang Low
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR 2025 Workshop on Bidirectional Human-AI Alignment (BiAlign)

点击查看摘要

Abstract:Collecting human preference feedback is often expensive, leading recent works to develop principled algorithms to select them more efficiently. However, these works assume that the underlying reward function is linear, an assumption that does not hold in many real-life applications, such as online recommendation and LLM alignment. To address this limitation, we propose Neural-ADB, an algorithm based on the neural contextual dueling bandit framework that provides a principled and practical method for collecting human preference feedback when the underlying latent reward function is non-linear. We theoretically show that when preference feedback follows the Bradley-Terry-Luce model, the worst sub-optimality gap of the policy learned by Neural-ADB decreases at a sub-linear rate as the preference dataset increases. Our experimental results on problem instances derived from synthetic preference datasets further validate the effectiveness of Neural-ADB.

[LG-9] Voice Conversion with Diverse Intonation using Conditional Variational Auto-Encoder

链接: https://arxiv.org/abs/2504.12005
作者: Soobin Suh,Dabi Ahn,Heewoong Park,Jonghun Park
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 2 pages, Machine Learning in Speech and Language Processing Workshop (MLSLP) 2018

点击查看摘要

Abstract:Voice conversion is a task of synthesizing an utterance with target speaker’s voice while maintaining linguistic information of the source utterance. While a speaker can produce varying utterances from a single script with different intonations, conventional voice conversion models were limited to producing only one result per source input. To overcome this limitation, we propose a novel approach for voice conversion with diverse intonations using conditional variational autoencoder (CVAE). Experiments have shown that the speaker’s style feature can be mapped into a latent space with Gaussian distribution. We have also been able to convert voices with more diverse intonation by making the posterior of the latent space more complex with inverse autoregressive flow (IAF). As a result, the converted voice not only has a diversity of intonations, but also has better sound quality than the model without CVAE.

[LG-10] Secure Transfer Learning: Training Clean Models Against Backdoor in (Both) Pre-trained Encoders and Downstream Datasets

链接: https://arxiv.org/abs/2504.11990
作者: Yechao Zhang,Yuxuan Zhou,Tianyu Li,Minghui Li,Shengshan Hu,Wei Luo,Leo Yu Zhang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: To appear at IEEE Symposium on Security and Privacy 2025, 20 pages

点击查看摘要

Abstract:Transfer learning from pre-trained encoders has become essential in modern machine learning, enabling efficient model adaptation across diverse tasks. However, this combination of pre-training and downstream adaptation creates an expanded attack surface, exposing models to sophisticated backdoor embeddings at both the encoder and dataset levels–an area often overlooked in prior research. Additionally, the limited computational resources typically available to users of pre-trained encoders constrain the effectiveness of generic backdoor defenses compared to end-to-end training from scratch. In this work, we investigate how to mitigate potential backdoor risks in resource-constrained transfer learning scenarios. Specifically, we conduct an exhaustive analysis of existing defense strategies, revealing that many follow a reactive workflow based on assumptions that do not scale to unknown threats, novel attack types, or different training paradigms. In response, we introduce a proactive mindset focused on identifying clean elements and propose the Trusted Core (T-Core) Bootstrapping framework, which emphasizes the importance of pinpointing trustworthy data and neurons to enhance model security. Our empirical evaluations demonstrate the effectiveness and superiority of T-Core, specifically assessing 5 encoder poisoning attacks, 7 dataset poisoning attacks, and 14 baseline defenses across five benchmark datasets, addressing four scenarios of 3 potential backdoor threats.

[LG-11] Hardware-Friendly Delayed-Feedback Reservoir for Multivariate Time-Series Classification

链接: https://arxiv.org/abs/2504.11981
作者: Sosei Ikeda,Hiromitsu Awano,Takashi Sato
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Reservoir computing (RC) is attracting attention as a machine-learning technique for edge computing. In time-series classification tasks, the number of features obtained using a reservoir depends on the length of the input series. Therefore, the features must be converted to a constant-length intermediate representation (IR), such that they can be processed by an output layer. Existing conversion methods involve computationally expensive matrix inversion that significantly increases the circuit size and requires processing power when implemented in hardware. In this article, we propose a simple but effective IR, namely, dot-product-based reservoir representation (DPRR), for RC based on the dot product of data features. Additionally, we propose a hardware-friendly delayed-feedback reservoir (DFR) consisting of a nonlinear element and delayed feedback loop with DPRR. The proposed DFR successfully classified multivariate time series data that has been considered particularly difficult to implement efficiently in hardware. In contrast to conventional DFR models that require analog circuits, the proposed model can be implemented in a fully digital manner suitable for high-level syntheses. A comparison with existing machine-learning methods via field-programmable gate array implementation using 12 multivariate time-series classification tasks confirmed the superior accuracy and small circuit size of the proposed method.

[LG-12] FedCanon: Non-Convex Composite Federated Learning with Efficient Proximal Operation on Heterogeneous Data

链接: https://arxiv.org/abs/2504.11903
作者: Yuan Zhou,Jiachen Zhong,Xinli Shi,Guanghui Wen,Xinghuo Yu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Composite federated learning offers a general framework for solving machine learning problems with additional regularization terms. However, many existing methods require clients to perform multiple proximal operations to handle non-smooth terms and their performance are often susceptible to data heterogeneity. To overcome these limitations, we propose a novel composite federated learning algorithm called \textbfFedCanon, designed to solve the optimization problems comprising a possibly non-convex loss function and a weakly convex, potentially non-smooth regularization term. By decoupling proximal mappings from local updates, FedCanon requires only a single proximal evaluation on the server per iteration, thereby reducing the overall proximal computation cost. It also introduces control variables that incorporate global gradient information into client updates, which helps mitigate the effects of data heterogeneity. Theoretical analysis demonstrates that FedCanon achieves sublinear convergence rates under general non-convex settings and linear convergence under the Polyak-Łojasiewicz condition, without relying on bounded heterogeneity assumptions. Experiments demonstrate that FedCanon outperforms the state-of-the-art methods in terms of both accuracy and computational efficiency, particularly under heterogeneous data distributions.

[LG-13] HyperSAT: Unsupervised Hypergraph Neural Networks for Weighted MaxSAT Problems

链接: https://arxiv.org/abs/2504.11885
作者: Qiyue Chen(1 and 2),Shaolin Tan(2),Suixiang Gao(1 and 2),Jinhu Lü(3 and 2) ((1) School of Mathematical Sciences, University of Chinese Academy of Science, Beijing, China, (2) Zhongguancun Laboratory, Beijing, China, (3) School of Automation Science and Electrical Engineering, Beihang University, Beijing, China)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have shown promising performance in solving both Boolean satisfiability (SAT) and Maximum Satisfiability (MaxSAT) problems due to their ability to efficiently model and capture the structural dependencies between literals and clauses. However, GNN methods for solving Weighted MaxSAT problems remain underdeveloped. The challenges arise from the non-linear dependency and sensitive objective function, which are caused by the non-uniform distribution of weights across clauses. In this paper, we present HyperSAT, a novel neural approach that employs an unsupervised hypergraph neural network model to solve Weighted MaxSAT problems. We propose a hypergraph representation for Weighted MaxSAT instances and design a cross-attention mechanism along with a shared representation constraint loss function to capture the logical interactions between positive and negative literal nodes in the hypergraph. Extensive experiments on various Weighted MaxSAT datasets demonstrate that HyperSAT achieves better performance than state-of-the-art competitors.

[LG-14] Benchmarking Mutual Information-based Loss Functions in Federated Learning

链接: https://arxiv.org/abs/2504.11877
作者: Sarang S,Harsh D. Chothani,Qilei Li,Ahmed M. Abdelmoniem,Arnab K. Paul
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:Federated Learning (FL) has attracted considerable interest due to growing privacy concerns and regulations like the General Data Protection Regulation (GDPR), which stresses the importance of privacy-preserving and fair machine learning approaches. In FL, model training takes place on decentralized data, so as to allow clients to upload a locally trained model and receive a globally aggregated model without exposing sensitive information. However, challenges related to fairness-such as biases, uneven performance among clients, and the “free rider” issue complicates its adoption. In this paper, we examine the use of Mutual Information (MI)-based loss functions to address these concerns. MI has proven to be a powerful method for measuring dependencies between variables and optimizing deep learning models. By leveraging MI to extract essential features and minimize biases, we aim to improve both the fairness and effectiveness of FL systems. Through extensive benchmarking, we assess the impact of MI-based losses in reducing disparities among clients while enhancing the overall performance of FL.

[LG-15] Factor-MCLS: Multi-agent learning system with reward factor matrix and multi-critic framework for dynamic portfolio optimization

链接: https://arxiv.org/abs/2504.11874
作者: Ruoyu Sun,Angelos Stefanidis,Zhengyong Jiang,Jionglong Su
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Typical deep reinforcement learning (DRL) agents for dynamic portfolio optimization learn the factors influencing portfolio return and risk by analyzing the output values of the reward function while adjusting portfolio weights within the training environment. However, it faces a major limitation where it is difficult for investors to intervene in the training based on different levels of risk aversion towards each portfolio asset. This difficulty arises from another limitation: existing DRL agents may not develop a thorough understanding of the factors responsible for the portfolio return and risk by only learning from the output of the reward function. As a result, the strategy for determining the target portfolio weights is entirely dependent on the DRL agents themselves. To address these limitations, we propose a reward factor matrix for elucidating the return and risk of each asset in the portfolio. Additionally, we propose a novel learning system named Factor-MCLS using a multi-critic framework that facilitates learning of the reward factor matrix. In this way, our DRL-based learning system can effectively learn the factors influencing portfolio return and risk. Moreover, based on the critic networks within the multi-critic framework, we develop a risk constraint term in the training objective function of the policy function. This risk constraint term allows investors to intervene in the training of the DRL agent according to their individual levels of risk aversion towards the portfolio assets.

[LG-16] ransferable Deployment of Semantic Edge Inference Systems via Unsupervised Domain Adaption

链接: https://arxiv.org/abs/2504.11873
作者: Weiqiang Jiao,Suzhi Bi,Xian Li,Cheng Guo,Hao Chen,Zhi Quan
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 14 pages, 14 figures, the paper is submitted for potential journal publication

点击查看摘要

Abstract:This paper investigates deploying semantic edge inference systems for performing a common image clarification task. In particular, each system consists of multiple Internet of Things (IoT) devices that first locally encode the sensing data into semantic features and then transmit them to an edge server for subsequent data fusion and task inference. The inference accuracy is determined by efficient training of the feature encoder/decoder using labeled data samples. Due to the difference in sensing data and communication channel distributions, deploying the system in a new environment may induce high costs in annotating data labels and re-training the encoder/decoder models. To achieve cost-effective transferable system deployment, we propose an efficient Domain Adaptation method for Semantic Edge INference systems (DASEIN) that can maintain high inference accuracy in a new environment without the need for labeled samples. Specifically, DASEIN exploits the task-relevant data correlation between different deployment scenarios by leveraging the techniques of unsupervised domain adaptation and knowledge distillation. It devises an efficient two-step adaptation procedure that sequentially aligns the data distributions and adapts to the channel variations. Numerical results show that, under a substantial change in sensing data distributions, the proposed DASEIN outperforms the best-performing benchmark method by 7.09% and 21.33% in inference accuracy when the new environment has similar or 25 dB lower channel signal to noise power ratios (SNRs), respectively. This verifies the effectiveness of the proposed method in adapting both data and channel distributions in practical transfer deployment applications.

[LG-17] On the Problem of Best Arm Retention

链接: https://arxiv.org/abs/2504.11866
作者: Houshuang Chen,Yuchen He,Chihao Zhang
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:This paper presents a comprehensive study on the problem of Best Arm Retention (BAR), which has recently found applications in streaming algorithms for multi-armed bandits. In the BAR problem, the goal is to retain m arms with the best arm included from n after some trials, in stochastic multi-armed bandit settings. We first investigate pure exploration for the BAR problem under different criteria, and then minimize the regret with specific constraints, in the context of further exploration in streaming algorithms. - We begin by revisiting the lower bound for the (\varepsilon,\delta) -PAC algorithm for Best Arm Identification (BAI) and adapt the classical KL-divergence argument to derive optimal bounds for (\varepsilon,\delta) -PAC algorithms for BAR. - We further study another variant of the problem, called r -BAR, which requires the expected gap between the best arm and the optimal arm retained is less than r . We prove tight sample complexity for the problem. - We explore the regret minimization problem for r -BAR and develop algorithm beyond pure exploration. We conclude with a conjecture on the optimal regret in this setting. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2504.11866 [cs.LG] (or arXiv:2504.11866v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.11866 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Theoretical Computer Science, Volume 1041, 2025, Related DOI: https://doi.org/10.1016/j.tcs.2025.115213 Focus to learn more DOI(s) linking to related resources Submission history From: Houshuang Chen [view email] [v1] Wed, 16 Apr 2025 08:41:20 UTC (30 KB)

[LG-18] GT-SVQ: A Linear-Time Graph Transformer for Node Classification Using Spiking Vector Quantization

链接: https://arxiv.org/abs/2504.11840
作者: Huizhe Zhang,Jintang Li,Yuchang Zhu,Liang Chen,Zibin Zheng
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: work in progress

点击查看摘要

Abstract:Graph Transformers (GTs), which simultaneously integrate message-passing and self-attention mechanisms, have achieved promising empirical results in some graph prediction tasks. Although these approaches show the potential of Transformers in capturing long-range graph topology information, issues concerning the quadratic complexity and high computing energy consumption severely limit the scalability of GTs on large-scale graphs. Recently, as brain-inspired neural networks, Spiking Neural Networks (SNNs), facilitate the development of graph representation learning methods with lower computational and storage overhead through the unique event-driven spiking neurons. Inspired by these characteristics, we propose a linear-time Graph Transformer using Spiking Vector Quantization (GT-SVQ) for node classification. GT-SVQ reconstructs codebooks based on rate coding outputs from spiking neurons, and injects the codebooks into self-attention blocks to aggregate global information in linear complexity. Besides, spiking vector quantization effectively alleviates codebook collapse and the reliance on complex machinery (distance measure, auxiliary loss, etc.) present in previous vector quantization-based graph learning methods. In experiments, we compare GT-SVQ with other state-of-the-art baselines on node classification datasets ranging from small to large. Experimental results show that GT-SVQ has achieved competitive performances on most datasets while maintaining up to 130x faster inference speed compared to other GTs.

[LG-19] Support is All You Need for Certified VAE Training ICLR’25

链接: https://arxiv.org/abs/2504.11831
作者: Changming Xu,Debangshu Banerjee,Deepak Vasisht,Gagandeep Singh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 21 pages, 3 figures, ICLR '25

点击查看摘要

Abstract:Variational Autoencoders (VAEs) have become increasingly popular and deployed in safety-critical applications. In such applications, we want to give certified probabilistic guarantees on performance under adversarial attacks. We propose a novel method, CIVET, for certified training of VAEs. CIVET depends on the key insight that we can bound worst-case VAE error by bounding the error on carefully chosen support sets at the latent layer. We show this point mathematically and present a novel training algorithm utilizing this insight. We show in an extensive evaluation across different datasets (in both the wireless and vision application areas), architectures, and perturbation magnitudes that our method outperforms SOTA methods achieving good standard performance with strong robustness guarantees.

[LG-20] Emergence of Computational Structure in a Neural Network Physics Simulator

链接: https://arxiv.org/abs/2504.11830
作者: Rohan Hitchcock,Gary W. Delaney,Jonathan H. Manton,Richard Scalzo,Jingge Zhu
类目: Machine Learning (cs.LG)
*备注: 35 pages

点击查看摘要

Abstract:Neural networks often have identifiable computational structures - components of the network which perform an interpretable algorithm or task - but the mechanisms by which these emerge and the best methods for detecting these structures are not well understood. In this paper we investigate the emergence of computational structure in a transformer-like model trained to simulate the physics of a particle system, where the transformer’s attention mechanism is used to transfer information between particles. We show that (a) structures emerge in the attention heads of the transformer which learn to detect particle collisions, (b) the emergence of these structures is associated to degenerate geometry in the loss landscape, and © the dynamics of this emergence follows a power law. This suggests that these components are governed by a degenerate “effective potential”. These results have implications for the convergence time of computational structure within neural networks and suggest that the emergence of computational structure can be detected by studying the dynamics of network components.

[LG-21] Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading

链接: https://arxiv.org/abs/2504.11816
作者: Kihyun Kim,Jinwoo Kim,Hyunsun Chung,Myung-Hoon Cha,Hong-Yeon Kim,Youngjae Kim
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:LLM inference is essential for applications like text summarization, translation, and data analysis, but the high cost of GPU instances from Cloud Service Providers (CSPs) like AWS is a major burden. This paper proposes InferSave, a cost-efficient VM selection framework for cloud based LLM inference. InferSave optimizes KV cache offloading based on Service Level Objectives (SLOs) and workload charac teristics, estimating GPU memory needs, and recommending cost-effective VM instances. Additionally, the Compute Time Calibration Function (CTCF) improves instance selection accuracy by adjusting for discrepancies between theoretical and actual GPU performance. Experiments on AWS GPU instances show that selecting lower-cost instances without KV cache offloading improves cost efficiency by up to 73.7% for online workloads, while KV cache offloading saves up to 20.19% for offline workloads.

[LG-22] Manifold meta-learning for reduced-complexity neural system identification

链接: https://arxiv.org/abs/2504.11811
作者: Marco Forgione,Ankush Chakrabarty,Dario Piga,Matteo Rufolo,Alberto Bemporad
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:System identification has greatly benefited from deep learning techniques, particularly for modeling complex, nonlinear dynamical systems with partially unknown physics where traditional approaches may not be feasible. However, deep learning models often require large datasets and significant computational resources at training and inference due to their high-dimensional parameterizations. To address this challenge, we propose a meta-learning framework that discovers a low-dimensional manifold within the parameter space of an over-parameterized neural network architecture. This manifold is learned from a meta-dataset of input-output sequences generated by a class of related dynamical systems, enabling efficient model training while preserving the network’s expressive power for the considered system class. Unlike bilevel meta-learning approaches, our method employs an auxiliary neural network to map datasets directly onto the learned manifold, eliminating the need for costly second-order gradient computations during meta-training and reducing the number of first-order updates required in inference, which could be expensive for large models. We validate our approach on a family of Bouc-Wen oscillators, which is a well-studied nonlinear system identification benchmark. We demonstrate that we are able to learn accurate models even in small-data scenarios.

[LG-23] Federated Spectral Graph Transformers Meet Neural Ordinary Differential Equations for Non-IID Graphs

链接: https://arxiv.org/abs/2504.11808
作者: Kishan Gurumurthy,Himanshu Pal,Charu Sharma
类目: Machine Learning (cs.LG)
*备注: The first two listed authors contributed equally to this work

点击查看摘要

Abstract:Graph Neural Network (GNN) research is rapidly advancing due to GNNs’ capacity to learn distributed representations from graph-structured data. However, centralizing large volumes of real-world graph data for GNN training is often impractical due to privacy concerns, regulatory restrictions, and commercial competition. Federated learning (FL), a distributed learning paradigm, offers a solution by preserving data privacy with collaborative model training. Despite progress in training huge vision and language models, federated learning for GNNs remains underexplored. To address this challenge, we present a novel method for federated learning on GNNs based on spectral GNNs equipped with neural ordinary differential equations (ODE) for better information capture, showing promising results across both homophilic and heterophilic graphs. Our approach effectively handles non-Independent and Identically Distributed (non-IID) data, while also achieving performance comparable to existing methods that only operate on IID data. It is designed to be privacy-preserving and bandwidth-optimized, making it suitable for real-world applications such as social network analysis, recommendation systems, and fraud detection, which often involve complex, non-IID, and heterophilic graph structures. Our results in the area of federated learning on non-IID heterophilic graphs demonstrate significant improvements, while also achieving better performance on homophilic graphs. This work highlights the potential of federated learning in diverse and challenging graph settings. Open-source code available on GitHub (this https URL).

[LG-24] Dynamics and Computational Principles of Echo State Networks: A Mathematical Perspective

链接: https://arxiv.org/abs/2504.11757
作者: Pradeep Singh,Ashutosh Kumar,Sutirtha Ghosh,Hrishit B P,Balasubramanian Raman
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 100 pages, 17 tables, 41 figures

点击查看摘要

Abstract:Reservoir computing (RC) represents a class of state-space models (SSMs) characterized by a fixed state transition mechanism (the reservoir) and a flexible readout layer that maps from the state space. It is a paradigm of computational dynamical systems that harnesses the transient dynamics of high-dimensional state spaces for efficient processing of temporal data. Rooted in concepts from recurrent neural networks, RC achieves exceptional computational power by decoupling the training of the dynamic reservoir from the linear readout layer, thereby circumventing the complexities of gradient-based optimization. This work presents a systematic exploration of RC, addressing its foundational properties such as the echo state property, fading memory, and reservoir capacity through the lens of dynamical systems theory. We formalize the interplay between input signals and reservoir states, demonstrating the conditions under which reservoirs exhibit stability and expressive power. Further, we delve into the computational trade-offs and robustness characteristics of RC architectures, extending the discussion to their applications in signal processing, time-series prediction, and control systems. The analysis is complemented by theoretical insights into optimization, training methodologies, and scalability, highlighting open challenges and potential directions for advancing the theoretical underpinnings of RC.

[LG-25] Unravelling Technical debt topics through Time Programming Languages and Repository

链接: https://arxiv.org/abs/2504.11714
作者: Karthik Shivashankar,Antonio Martini
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study explores the dynamic landscape of Technical Debt (TD) topics in software engineering by examining its evolution across time, programming languages, and repositories. Despite the extensive research on identifying and quantifying TD, there remains a significant gap in understanding the diversity of TD topics and their temporal development. To address this, we have conducted an explorative analysis of TD data extracted from GitHub issues spanning from 2015 to September 2023. We employed BERTopic for sophisticated topic modelling. This study categorises the TD topics and tracks their progression over time. Furthermore, we have incorporated sentiment analysis for each identified topic, providing a deeper insight into the perceptions and attitudes associated with these topics. This offers a more nuanced understanding of the trends and shifts in TD topics through time, programming language, and repository.

[LG-26] Clustering and analysis of user behaviour in blockchain: A case study of Planet IX

链接: https://arxiv.org/abs/2504.11702
作者: Dorottya Zelenyanszki,Zhe Hou,Kamanashis Biswas,Vallipuram Muthukkumarasamy
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 15 pages, 8 figures, submitted to Blockchain: Research and Applications

点击查看摘要

Abstract:Decentralised applications (dApps) that run on public blockchains have the benefit of trustworthiness and transparency as every activity that happens on the blockchain can be publicly traced through the transaction data. However, this introduces a potential privacy problem as this data can be tracked and analysed, which can reveal user-behaviour information. A user behaviour analysis pipeline was proposed to present how this type of information can be extracted and analysed to identify separate behavioural clusters that can describe how users behave in the game. The pipeline starts with the collection of transaction data, involving smart contracts, that is collected from a blockchain-based game called Planet IX. Both the raw transaction information and the transaction events are considered in the data collection. From this data, separate game actions can be formed and those are leveraged to present how and when the users conducted their in-game activities in the form of user flows. An extended version of these user flows also presents how the Non-Fungible Tokens (NFTs) are being leveraged in the user actions. The latter is given as input for a Graph Neural Network (GNN) model to provide graph embeddings for these flows which then can be leveraged by clustering algorithms to cluster user behaviours into separate behavioural clusters. We benchmark and compare well-known clustering algorithms as a part of the proposed method. The user behaviour clusters were analysed and visualised in a graph format. It was found that behavioural information can be extracted regarding the users that belong to these clusters. Such information can be exploited by malicious users to their advantage. To demonstrate this, a privacy threat model was also presented based on the results that correspond to multiple potentially affected areas.

[LG-27] H3GNNs: Harmonizing Heterophily and Homophily in GNNs via Joint Structural Node Encoding and Self-Supervised Learning

链接: https://arxiv.org/abs/2504.11699
作者: Rui Xue,Tianfu Wu
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) struggle to balance heterophily and homophily in representation learning, a challenge further amplified in self-supervised settings. We propose H ^3 GNNs, an end-to-end self-supervised learning framework that harmonizes both structural properties through two key innovations: (i) Joint Structural Node Encoding. We embed nodes into a unified space combining linear and non-linear feature projections with K-hop structural representations via a Weighted Graph Convolution Network(WGCN). A cross-attention mechanism enhances awareness and adaptability to heterophily and homophily. (ii) Self-Supervised Learning Using Teacher-Student Predictive Architectures with Node-Difficulty Driven Dynamic Masking Strategies. We use a teacher-student model, the student sees the masked input graph and predicts node features inferred by the teacher that sees the full input graph in the joint encoding space. To enhance learning difficulty, we introduce two novel node-predictive-difficulty-based masking strategies. Experiments on seven benchmarks (four heterophily datasets and three homophily datasets) confirm the effectiveness and efficiency of H ^3 GNNs across diverse graph types. Our H ^3 GNNs achieves overall state-of-the-art performance on the four heterophily datasets, while retaining on-par performance to previous state-of-the-art methods on the three homophily datasets.

[LG-28] ransformer-Driven Neural Beamforming with Imperfect CSI in Urban Macro Wireless Channels

链接: https://arxiv.org/abs/2504.11667
作者: Cemil Vahapoglu,Timothy J. O’Shea,Wan Liu,Tamoghna Roy,Sennur Ulukus
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The literature is abundant with methodologies focusing on using transformer architectures due to their prominence in wireless signal processing and their capability to capture long-range dependencies via attention mechanisms. In particular, depthwise separable convolutions enhance parameter efficiency for the process of high-dimensional data characteristics of MIMO systems. In this work, we introduce a novel unsupervised deep learning framework that integrates depthwise separable convolutions and transformers to generate beamforming weights under imperfect channel state information (CSI) for a multi-user single-input multiple-output (MU-SIMO) system in dense urban environments. The primary goal is to enhance throughput by maximizing sum-rate while ensuring reliable communication. Spectral efficiency and block error rate (BLER) are considered as performance metrics. Experiments are carried out under various conditions to compare the performance of the proposed NNBF framework against baseline methods zero-forcing beamforming (ZFBF) and minimum mean square error (MMSE) beamforming. Experimental results demonstrate the superiority of the proposed framework over the baseline techniques.

[LG-29] 70% Size 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

链接: https://arxiv.org/abs/2504.11651
作者: Tianyi Zhang,Yang Sui,Shaochen Zhong,Vipin Chaudhary,Xia Hu,Anshumali Shrivastava
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have grown rapidly in size, creating significant challenges for efficient deployment on resource-constrained hardware. In this paper, we introduce Dynamic-Length Float (DFloat11), a lossless compression framework that reduces LLM size by 30% while preserving outputs that are bit-for-bit identical to the original model. DFloat11 is motivated by the low entropy in the BFloat16 weight representation of LLMs, which reveals significant inefficiency in existing storage format. By applying entropy coding, DFloat11 assigns dynamic-length encodings to weights based on frequency, achieving near information-optimal compression without any loss of precision. To facilitate efficient inference with dynamic-length encodings, we develop a custom GPU kernel for fast online decompression. Our design incorporates the following: (i) decomposition of memory-intensive lookup tables (LUTs) into compact LUTs that fit in GPU SRAM, (ii) a two-phase kernel for coordinating thread read/write positions using lightweight auxiliary variables, and (iii) transformer-block-level decompression to minimize latency. Experiments on recent models, including Llama-3.1, Qwen-2.5, and Gemma-3, validates our hypothesis that DFloat11 achieves around 30% model size reduction while preserving bit-for-bit exact outputs. Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 1.9-38.8x higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.3-13.17x longer context lengths than uncompressed models. Notably, our method enables lossless inference of Llama-3.1-405B, an 810GB model, on a single node equipped with 8x80GB GPUs. Our code and models are available at this https URL.

[LG-30] Robust Markov stability for community detection at a scale learned based on the structure

链接: https://arxiv.org/abs/2504.11621
作者: Samin Aref,Sanchaai Mathiyarasan
类目: ocial and Information Networks (cs.SI); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: This is the author copy of an article accepted for publication by ACM. The publisher’s verified version and full citation details are available on the ACM website

点击查看摘要

Abstract:Community detection, the unsupervised task of clustering nodes of a graph, finds applications across various fields. The common approaches for community detection involve optimizing an objective function to partition the nodes into communities at a single scale of granularity. However, the single-scale approaches often fall short of producing partitions that are robust and at a suitable scale. The existing algorithm, PyGenStability, returns multiple robust partitions for a network by optimizing the multi-scale Markov stability function. However, in cases where the suitable scale is not known or assumed by the user, there is no principled method to select a single robust partition at a suitable scale from the multiple partitions that PyGenStability produces. Our proposed method combines the Markov stability framework with a pre-trained machine learning model for scale selection to obtain one robust partition at a scale that is learned based on the graph structure. This automatic scale selection involves using a gradient boosting model pre-trained on hand-crafted and embedding-based network features from a labeled dataset of 10k benchmark networks. This model was trained to predicts the scale value that maximizes the similarity of the output partition to the planted partition of the benchmark network. Combining our scale selection algorithm with the PyGenStability algorithm results in PyGenStabilityOne (PO): a hyperparameter-free multi-scale community detection algorithm that returns one robust partition at a suitable scale without the need for any assumptions, input, or tweaking from the user. We compare the performance of PO against 29 algorithms and show that it outperforms 25 other algorithms by statistically meaningful margins. Our results facilitate choosing between community detection algorithms, among which PO stands out as the accurate, robust, and hyperparameter-free method.

[LG-31] Dueling Deep Reinforcement Learning for Financial Time Series

链接: https://arxiv.org/abs/2504.11601
作者: Bruno Giorgio
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as a powerful paradigm for solving decision-making problems in dynamic environments. In this research, we explore the application of Double DQN (DDQN) and Dueling Network Architectures, to financial trading tasks using historical SP500 index data. Our focus is training agents capable of optimizing trading strategies while accounting for practical constraints such as transaction costs. The study evaluates the model performance across scenarios with and without commissions, highlighting the impact of cost-sensitive environments on reward dynamics. Despite computational limitations and the inherent complexity of financial time series data, the agent successfully learned meaningful trading policies. The findings confirm that RL agents, even when trained on limited datasets, can outperform random strategies by leveraging advanced architectures such as DDQN and Dueling Networks. However, significant challenges persist, particularly with a sub-optimal policy due to the complexity of data source.

[LG-32] owards a Universal Vibration Analysis Dataset: A Framework for Transfer Learning in Predictive Maintenance and Structural Health Monitoring

链接: https://arxiv.org/abs/2504.11581
作者: Mert Sehri,Igor Varejão,Zehui Hua,Vitor Bonella,Adriano Santos,Francisco de Assis Boldt,Patrick Dumond,Flavio Miguel Varejão
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:ImageNet has become a reputable resource for transfer learning, allowing the development of efficient ML models with reduced training time and data requirements. However, vibration analysis in predictive maintenance, structural health monitoring, and fault diagnosis, lacks a comparable large-scale, annotated dataset to facilitate similar advancements. To address this, a dataset framework is proposed that begins with bearing vibration data as an initial step towards creating a universal dataset for vibration-based spectrogram analysis for all machinery. The initial framework includes a collection of bearing vibration signals from various publicly available datasets. To demonstrate the advantages of this framework, experiments were conducted using a deep learning architecture, showing improvements in model performance when pre-trained on bearing vibration data and fine-tuned on a smaller, domain-specific dataset. These findings highlight the potential to parallel the success of ImageNet in visual computing but for vibration analysis. For future work, this research will include a broader range of vibration signals from multiple types of machinery, emphasizing spectrogram-based representations of the data. Each sample will be labeled according to machinery type, operational status, and the presence or type of faults, ensuring its utility for supervised and unsupervised learning tasks. Additionally, a framework for data preprocessing, feature extraction, and model training specific to vibration data will be developed. This framework will standardize methodologies across the research community, allowing for collaboration and accelerating progress in predictive maintenance, structural health monitoring, and related fields. By mirroring the success of ImageNet in visual computing, this dataset has the potential to improve the development of intelligent systems in industrial applications.

[LG-33] LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation

链接: https://arxiv.org/abs/2504.11521
作者: Wei-Jer Chang,Wei Zhan,Masayoshi Tomizuka,Manmohan Chandraker,Francesco Pittaluga
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Dataset and project website in preparation

点击查看摘要

Abstract:Evaluating autonomous vehicles with controllability enables scalable testing in counterfactual or structured settings, enhancing both efficiency and safety. We introduce LangTraj, a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios. By conditioning on natural language inputs, LangTraj provides flexible and intuitive control over interactive behaviors, generating nuanced and realistic scenarios. Unlike prior approaches that depend on domain-specific guidance functions, LangTraj incorporates language conditioning during training, facilitating more intuitive traffic simulation control. We propose a novel closed-loop training strategy for diffusion models, explicitly tailored to enhance stability and realism during closed-loop simulation. To support language-conditioned simulation, we develop Inter-Drive, a large-scale dataset with diverse and interactive labels for training language-conditioned diffusion models. Our dataset is built upon a scalable pipeline for annotating agent-agent interactions and single-agent behaviors, ensuring rich and varied supervision. Validated on the Waymo Motion Dataset, LangTraj demonstrates strong performance in realism, language controllability, and language-conditioned safety-critical simulation, establishing a new paradigm for flexible and scalable autonomous vehicle testing.

[LG-34] Multi-output Classification Framework and Frequency Layer Normalization for Compound Fault Diagnosis in Motor

链接: https://arxiv.org/abs/2504.11513
作者: Wonjun Yi,Yong-Hwa Park
类目: Machine Learning (cs.LG)
*备注: Extended version of “Multi-output Classification for Compound Fault Diagnosis in Motor under Partially Labeled Target Domain” Will not be published in any conferences or journels

点击查看摘要

Abstract:This work introduces a multi-output classification (MOC) framework designed for domain adaptation in fault diagnosis, particularly under partially labeled (PL) target domain scenarios and compound fault conditions in rotating machinery. Unlike traditional multi-class classification (MCC) methods that treat each fault combination as a distinct class, the proposed approach independently estimates the severity of each fault type, improving both interpretability and diagnostic accuracy. The model incorporates multi-kernel maximum mean discrepancy (MK-MMD) and entropy minimization (EM) losses to facilitate feature transfer from the source to the target domain. In addition, frequency layer normalization (FLN) is applied to preserve structural properties in the frequency domain, which are strongly influenced by system dynamics and are often stationary with respect to changes in rpm. Evaluations across six domain adaptation cases with PL data demonstrate that MOC outperforms baseline models in macro F1 score. Moreover, MOC consistently achieves better classification performance for individual fault types, and FLN shows superior adaptability compared to other normalization techniques.

[LG-35] Reward Distance Comparisons Under Transition Sparsity

链接: https://arxiv.org/abs/2504.11508
作者: Clement Nyanhongo,Bruno Miranda Henrique,Eugene Santos
类目: Machine Learning (cs.LG)
*备注: Published in the TMLR, this https URL

点击查看摘要

Abstract:Reward comparisons are vital for evaluating differences in agent behaviors induced by a set of reward functions. Most conventional techniques utilize the input reward functions to learn optimized policies, which are then used to compare agent behaviors. However, learning these policies can be computationally expensive and can also raise safety concerns. Direct reward comparison techniques obviate policy learning but suffer from transition sparsity, where only a small subset of transitions are sampled due to data collection challenges and feasibility constraints. Existing state-of-the-art direct reward comparison methods are ill-suited for these sparse conditions since they require high transition coverage, where the majority of transitions from a given coverage distribution are sampled. When this requirement is not satisfied, a distribution mismatch between sampled and expected transitions can occur, leading to significant errors. This paper introduces the Sparsity Resilient Reward Distance (SRRD) pseudometric, designed to eliminate the need for high transition coverage by accommodating diverse sample distributions, which are common under transition sparsity. We provide theoretical justification for SRRD’s robustness and conduct experiments to demonstrate its practical efficacy across multiple domains.

[LG-36] Cross-cultural Deployment of Autonomous Vehicles Using Data-light Inverse Reinforcement Learning

链接: https://arxiv.org/abs/2504.11506
作者: Hongliang Lu,Shuqi Shen,Junjie Yang,Chao Lu,Xinhu Zheng,Hai Yang
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:More than the adherence to specific traffic regulations, driving culture touches upon a more implicit part - an informal, conventional, collective behavioral pattern followed by drivers - that varies across countries, regions, and even cities. Such cultural divergence has become one of the biggest challenges in deploying autonomous vehicles (AVs) across diverse regions today. The current emergence of data-driven methods has shown a potential solution to enable culture-compatible driving through learning from data, but what if some underdeveloped regions cannot provide sufficient local data to inform driving culture? This issue is particularly significant for a broader global AV market. Here, we propose a cross-cultural deployment scheme for AVs, called data-light inverse reinforcement learning, designed to re-calibrate culture-specific AVs and assimilate them into other cultures. First, we report the divergence in driving cultures through a comprehensive comparative analysis of naturalistic driving datasets on highways from three countries: Germany, China, and the USA. Then, we demonstrate the effectiveness of our scheme by testing the expeditious cross-cultural deployment across these three countries, with cumulative testing mileage of over 56084 km. The performance is particularly advantageous when cross-cultural deployment is carried out without affluent local data. Results show that we can reduce the dependence on local data by a margin of 98.67% at best. This study is expected to bring a broader, fairer AV global market, particularly in those regions that lack enough local data to develop culture-compatible AVs.

[LG-37] Counterfactual Fairness Evaluation of Machine Learning Models on Educational Datasets

链接: https://arxiv.org/abs/2504.11504
作者: Woojin Kim,Hyeoncheol Kim
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 12 pages, 6 figures, accepted to ITS2025

点击查看摘要

Abstract:As machine learning models are increasingly used in educational settings, from detecting at-risk students to predicting student performance, algorithmic bias and its potential impacts on students raise critical concerns about algorithmic fairness. Although group fairness is widely explored in education, works on individual fairness in a causal context are understudied, especially on counterfactual fairness. This paper explores the notion of counterfactual fairness for educational data by conducting counterfactual fairness analysis of machine learning models on benchmark educational datasets. We demonstrate that counterfactual fairness provides meaningful insight into the causality of sensitive attributes and causal-based individual fairness in education.

[LG-38] ming Analysis Agent : Autonomous Multi-Corner Multi-Mode (MCMM) Timing Debugging with Timing Debug Relation Graph

链接: https://arxiv.org/abs/2504.11502
作者: Jatin Nainani,Chia-Tung Ho,Anirudh Dhurka,Haoxing Ren
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 7 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Timing analysis is an essential and demanding verification method for Very Large Scale Integrated (VLSI) circuit design and optimization. In addition, it also serves as the cornerstone of the final sign-off, determining whether the chip is ready to be sent to the semiconductor foundry for fabrication. Recently, as the technology advance relentlessly, smaller metal pitches and the increasing number of devices have led to greater challenges and longer turn-around-time for experienced human designers to debug timing issues from the Multi-Corner Multi-Mode (MCMM) timing reports. As a result, an efficient and intelligent methodology is highly necessary and essential for debugging timing issues and reduce the turnaround times. Recently, Large Language Models (LLMs) have shown great promise across various tasks in language understanding and interactive decision-making, incorporating reasoning and actions. In this work, we propose a timing analysis agent, that is empowered by multi-LLMs task solving, and incorporates a novel hierarchical planning and solving flow to automate the analysis of timing reports from commercial tool. In addition, we build a Timing Debug Relation Graph (TDRG) that connects the reports with the relationships of debug traces from experienced timing engineers. The timing analysis agent employs the novel Agentic Retrieval Augmented Generation (RAG) approach, that includes agent and coding to retrieve data accurately, on the developed TDRG. In our studies, the proposed timing analysis agent achieves an average 98% pass-rate on a single-report benchmark and a 90% pass-rate for multi-report benchmark from industrial designs, demonstrating its effectiveness and adaptability.

[LG-39] LLM -based AI Agent for Sizing of Analog and Mixed Signal Circuit

链接: https://arxiv.org/abs/2504.11497
作者: Chang Liu,Emmanuel A. Olowe,Danial Chitnis
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: to be presented in IEEE NEWCAS 2025

点击查看摘要

Abstract:The design of Analog and Mixed-Signal (AMS) integrated circuits (ICs) often involves significant manual effort, especially during the transistor sizing process. While Machine Learning techniques in Electronic Design Automation (EDA) have shown promise in reducing complexity and minimizing human intervention, they still face challenges such as numerous iterations and a lack of knowledge about AMS circuit design. Recently, Large Language Models (LLMs) have demonstrated significant potential across various fields, showing a certain level of knowledge in circuit design and indicating their potential to automate the transistor sizing process. In this work, we propose an LLM-based AI agent for AMS circuit design to assist in the sizing process. By integrating LLMs with external circuit simulation tools and data analysis functions and employing prompt engineering strategies, the agent successfully optimized multiple circuits to achieve target performance metrics. We evaluated the performance of different LLMs to assess their applicability and optimization effectiveness across seven basic circuits, and selected the best-performing model Claude 3.5 Sonnet for further exploration on an operational amplifier, with complementary input stage and class AB output stage. This circuit was evaluated against nine performance metrics, and we conducted experiments under three distinct performance requirement groups. A success rate of up to 60% was achieved for reaching the target requirements. Overall, this work demonstrates the potential of LLMs to improve AMS circuit design.

[LG-40] CI-RKM: A Class-Informed Approach to Robust Restricted Kernel Machines IJCNN

链接: https://arxiv.org/abs/2504.11476
作者: Ritik Mishra,Mushir Akhtar,M. Tanveer
类目: Machine Learning (cs.LG)
*备注: Accepted in International Joint Conference on Neural Networks (IJCNN) 2025

点击查看摘要

Abstract:Restricted kernel machines (RKMs) represent a versatile and powerful framework within the kernel machine family, leveraging conjugate feature duality to address a wide range of machine learning tasks, including classification, regression, and feature learning. However, their performance can degrade significantly in the presence of noise and outliers, which compromises robustness and predictive accuracy. In this paper, we propose a novel enhancement to the RKM framework by integrating a class-informed weighted function. This weighting mechanism dynamically adjusts the contribution of individual training points based on their proximity to class centers and class-specific characteristics, thereby mitigating the adverse effects of noisy and outlier data. By incorporating weighted conjugate feature duality and leveraging the Schur complement theorem, we introduce the class-informed restricted kernel machine (CI-RKM), a robust extension of the RKM designed to improve generalization and resilience to data imperfections. Experimental evaluations on benchmark datasets demonstrate that the proposed CI-RKM consistently outperforms existing baselines, achieving superior classification accuracy and enhanced robustness against noise and outliers. Our proposed method establishes a significant advancement in the development of kernel-based learning models, addressing a core challenge in the field.

[LG-41] SDFs from Unoriented Point Clouds using Neural Variational Heat Distances

链接: https://arxiv.org/abs/2504.11212
作者: Samuel Weidemaier,Florine Hartwig,Josua Sassen,Sergio Conti,Mirela Ben-Chen,Martin Rumpf
类目: Numerical Analysis (math.NA); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 14 pages, 16 figures, 4 tables

点击查看摘要

Abstract:We propose a novel variational approach for computing neural Signed Distance Fields (SDF) from unoriented point clouds. To this end, we replace the commonly used eikonal equation with the heat method, carrying over to the neural domain what has long been standard practice for computing distances on discrete surfaces. This yields two convex optimization problems for whose solution we employ neural networks: We first compute a neural approximation of the gradients of the unsigned distance field through a small time step of heat flow with weighted point cloud densities as initial data. Then we use it to compute a neural approximation of the SDF. We prove that the underlying variational problems are well-posed. Through numerical experiments, we demonstrate that our method provides state-of-the-art surface reconstruction and consistent SDF gradients. Furthermore, we show in a proof-of-concept that it is accurate enough for solving a PDE on the zero-level set.

[LG-42] Leave-One-Out Stable Conformal Prediction ICLR2025

链接: https://arxiv.org/abs/2504.12189
作者: Kiljae Lee,Yuan Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at ICLR 2025

点击查看摘要

Abstract:Conformal prediction (CP) is an important tool for distribution-free predictive uncertainty quantification. Yet, a major challenge is to balance computational efficiency and prediction accuracy, particularly for multiple predictions. We propose Leave-One-Out Stable Conformal Prediction (LOO-StabCP), a novel method to speed up full conformal using algorithmic stability without sample splitting. By leveraging leave-one-out stability, our method is much faster in handling a large number of prediction requests compared to existing method RO-StabCP based on replace-one stability. We derived stability bounds for several popular machine learning tools: regularized loss minimization (RLM) and stochastic gradient descent (SGD), as well as kernel method, neural networks and bagging. Our method is theoretically justified and demonstrates superior numerical performance on synthetic and real-world data. We applied our method to a screening problem, where its effective exploitation of training data led to improved test power compared to state-of-the-art method based on split conformal.

[LG-43] Approximation Bounds for Transformer Networks with Application to Regression

链接: https://arxiv.org/abs/2504.12175
作者: Yuling Jiao,Yanming Lai,Defeng Sun,Yang Wang,Bokai Yan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We explore the approximation capabilities of Transformer networks for Hölder and Sobolev functions, and apply these results to address nonparametric regression estimation with dependent observations. First, we establish novel upper bounds for standard Transformer networks approximating sequence-to-sequence mappings whose component functions are Hölder continuous with smoothness index \gamma \in (0,1] . To achieve an approximation error \varepsilon under the L^p -norm for p \in [1, \infty] , it suffices to use a fixed-depth Transformer network whose total number of parameters scales as \varepsilon^-d_x n / \gamma . This result not only extends existing findings to include the case p = \infty , but also matches the best known upper bounds on number of parameters previously obtained for fixed-depth FNNs and RNNs. Similar bounds are also derived for Sobolev functions. Second, we derive explicit convergence rates for the nonparametric regression problem under various \beta -mixing data assumptions, which allow the dependence between observations to weaken over time. Our bounds on the sample complexity impose no constraints on weight magnitudes. Lastly, we propose a novel proof strategy to establish approximation bounds, inspired by the Kolmogorov-Arnold representation theorem. We show that if the self-attention layer in a Transformer can perform column averaging, the network can approximate sequence-to-sequence Hölder functions, offering new insights into the interpretability of self-attention mechanisms.

[LG-44] Control of Rayleigh-Bénard Convection: Effectiveness of Reinforcement Learning in the Turbulent Regime

链接: https://arxiv.org/abs/2504.12000
作者: Thorben Markmann,Michiel Straat,Sebastian Peitz,Barbara Hammer
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data-driven flow control has significant potential for industry, energy systems, and climate science. In this work, we study the effectiveness of Reinforcement Learning (RL) for reducing convective heat transfer in the 2D Rayleigh-Bénard Convection (RBC) system under increasing turbulence. We investigate the generalizability of control across varying initial conditions and turbulence levels and introduce a reward shaping technique to accelerate the training. RL agents trained via single-agent Proximal Policy Optimization (PPO) are compared to linear proportional derivative (PD) controllers from classical control theory. The RL agents reduced convection, measured by the Nusselt Number, by up to 33% in moderately turbulent systems and 10% in highly turbulent settings, clearly outperforming PD control in all settings. The agents showed strong generalization performance across different initial conditions and to a significant extent, generalized to higher degrees of turbulence. The reward shaping improved sample efficiency and consistently stabilized the Nusselt Number to higher turbulence levels.

[LG-45] Efficient identification of linear parameter-varying and nonlinear systems with noise models

链接: https://arxiv.org/abs/2504.11982
作者: Alberto Bemporad,Roland Tóth
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 28 pages, 3 figures

点击查看摘要

Abstract:We present a general system identification procedure capable of estimating of a broad spectrum of state-space dynamical models, including linear time-invariant (LTI), linear parameter-varying (LPV), and nonlinear (NL) dynamics, along with rather general classes of noise models. Similar to the LTI case, we show that for this general class of model structures, including the NL case, the model dynamics can be separated into a deterministic process and a stochastic noise part, allowing to seamlessly tune the complexity of the combined model both in terms of nonlinearity and noise modeling. We parameterize the involved nonlinear functional relations by means of artificial neural-networks (ANNs), although alternative parametric nonlinear mappings can also be used. To estimate the resulting model structures, we optimize a prediction-error-based criterion using an efficient combination of a constrained quasi-Newton approach and automatic differentiation, achieving training times in the order of seconds compared to existing state-of-the-art ANN methods which may require hours for models of similar complexity. We formally establish the consistency guarantees for the proposed approach and demonstrate its superior estimation accuracy and computational efficiency on several benchmark LTI, LPV, and NL system identification problems.

[LG-46] Discrimination-free Insurance Pricing with Privatized Sensitive Attributes

链接: https://arxiv.org/abs/2504.11775
作者: Tianhe Zhang,Suhan Liu,Peng Shi
类目: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG); Risk Management (q-fin.RM)
*备注:

点击查看摘要

Abstract:Fairness has emerged as a critical consideration in the landscape of machine learning algorithms, particularly as AI continues to transform decision-making across societal domains. To ensure that these algorithms are free from bias and do not discriminate against individuals based on sensitive attributes such as gender and race, the field of algorithmic bias has introduced various fairness concepts, along with methodologies to achieve these notions in different contexts. Despite the rapid advancement, not all sectors have embraced these fairness principles to the same extent. One specific sector that merits attention in this regard is insurance. Within the realm of insurance pricing, fairness is defined through a distinct and specialized framework. Consequently, achieving fairness according to established notions does not automatically ensure fair pricing in insurance. In particular, regulators are increasingly emphasizing transparency in pricing algorithms and imposing constraints on insurance companies on the collection and utilization of sensitive consumer attributes. These factors present additional challenges in the implementation of fairness in pricing algorithms. To address these complexities and comply with regulatory demands, we propose an efficient method for constructing fair models that are tailored to the insurance domain, using only privatized sensitive attributes. Notably, our approach ensures statistical guarantees, does not require direct access to sensitive attributes, and adapts to varying transparency requirements, addressing regulatory demands while ensuring fairness in insurance pricing.

[LG-47] Generalized probabilistic canonical correlation analysis for multi-modal data integration with full or partial observations

链接: https://arxiv.org/abs/2504.11610
作者: Tianjian Yang,Wei Vivian Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Background: The integration and analysis of multi-modal data are increasingly essential across various domains including bioinformatics. As the volume and complexity of such data grow, there is a pressing need for computational models that not only integrate diverse modalities but also leverage their complementary information to improve clustering accuracy and insights, especially when dealing with partial observations with missing data. Results: We propose Generalized Probabilistic Canonical Correlation Analysis (GPCCA), an unsupervised method for the integration and joint dimensionality reduction of multi-modal data. GPCCA addresses key challenges in multi-modal data analysis by handling missing values within the model, enabling the integration of more than two modalities, and identifying informative features while accounting for correlations within individual modalities. The model demonstrates robustness to various missing data patterns and provides low-dimensional embeddings that facilitate downstream clustering and analysis. In a range of simulation settings, GPCCA outperforms existing methods in capturing essential patterns across modalities. Additionally, we demonstrate its applicability to multi-omics data from TCGA cancer datasets and a multi-view image dataset. Conclusion: GPCCA offers a useful framework for multi-modal data integration, effectively handling missing data and providing informative low-dimensional embeddings. Its performance across cancer genomics and multi-view image data highlights its robustness and potential for broad application. To make the method accessible to the wider research community, we have released an R package, GPCCA, which is available at this https URL.

[LG-48] raffic Adaptive Moving-window Service Patrolling for Real-time Incident Management during High-impact Events

链接: https://arxiv.org/abs/2504.11570
作者: Haozhe Lei,Ya-Ting Yang,Tao Li,Zilin Bian,Fan Zuo,Sundeep Rangan,Kaan Ozbay
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents the Traffic Adaptive Moving-window Patrolling Algorithm (TAMPA), designed to improve real-time incident management during major events like sports tournaments and concerts. Such events significantly stress transportation networks, requiring efficient and adaptive patrol solutions. TAMPA integrates predictive traffic modeling and real-time complaint estimation, dynamically optimizing patrol deployment. Using dynamic programming, the algorithm continuously adjusts patrol strategies within short planning windows, effectively balancing immediate response and efficient routing. Leveraging the Dvoretzky-Kiefer-Wolfowitz inequality, TAMPA detects significant shifts in complaint patterns, triggering proactive adjustments in patrol routes. Theoretical analyses ensure performance remains closely aligned with optimal solutions. Simulation results from an urban traffic network demonstrate TAMPA’s superior performance, showing improvements of approximately 87.5% over stationary methods and 114.2% over random strategies. Future work includes enhancing adaptability and incorporating digital twin technology for improved predictive accuracy, particularly relevant for events like the 2026 FIFA World Cup at MetLife Stadium.

[LG-49] Sub-optimality of the Separation Principle for Quadratic Control from Bilinear Observations

链接: https://arxiv.org/abs/2504.11555
作者: Yahya Sattar,Sunmook Choi,Yassir Jedra,Maryam Fazel,Sarah Dean
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the problem of controlling a linear dynamical system from bilinear observations with minimal quadratic cost. Despite the similarity of this problem to standard linear quadratic Gaussian (LQG) control, we show that when the observation model is bilinear, neither does the Separation Principle hold, nor is the optimal controller affine in the estimated state. Moreover, the cost-to-go is non-convex in the control input. Hence, finding an analytical expression for the optimal feedback controller is difficult in general. Under certain settings, we show that the standard LQG controller locally maximizes the cost instead of minimizing it. Furthermore, the optimal controllers (derived analytically) are not unique and are nonlinear in the estimated state. We also introduce a notion of input-dependent observability and derive conditions under which the Kalman filter covariance remains bounded. We illustrate our theoretical results through numerical experiments in multiple synthetic settings.

[LG-50] Normalizing Flow Regression for Bayesian Inference with Offline Likelihood Evaluations

链接: https://arxiv.org/abs/2504.11554
作者: Chengkun Li,Bobby Huggins,Petrus Mikkola,Luigi Acerbi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at the Proceedings track of the 7th Symposium on Advances in Approximate Bayesian Inference (AABI 2025). 40 pages, 10 figures

点击查看摘要

Abstract:Bayesian inference with computationally expensive likelihood evaluations remains a significant challenge in many scientific domains. We propose normalizing flow regression (NFR), a novel offline inference method for approximating posterior distributions. Unlike traditional surrogate approaches that require additional sampling or inference steps, NFR directly yields a tractable posterior approximation through regression on existing log-density evaluations. We introduce training techniques specifically for flow regression, such as tailored priors and likelihood functions, to achieve robust posterior and model evidence estimation. We demonstrate NFR’s effectiveness on synthetic benchmarks and real-world applications from neuroscience and biology, showing superior or comparable performance to existing methods. NFR represents a promising approach for Bayesian inference when standard methods are computationally prohibitive or existing model evaluations can be recycled.

[LG-51] Strengthening Anomaly Awareness

链接: https://arxiv.org/abs/2504.11520
作者: Adam Banda,Charanjit K. Khosa,Veronica Sanz
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:We present a refined version of the Anomaly Awareness framework for enhancing unsupervised anomaly detection. Our approach introduces minimal supervision into Variational Autoencoders (VAEs) through a two-stage training strategy: the model is first trained in an unsupervised manner on background data, and then fine-tuned using a small sample of labeled anomalies to encourage larger reconstruction errors for anomalous samples. We validate the method across diverse domains, including the MNIST dataset with synthetic anomalies, network intrusion data from the CICIDS benchmark, collider physics data from the LHCO2020 dataset, and simulated events from the Standard Model Effective Field Theory (SMEFT). The latter provides a realistic example of subtle kinematic deviations in Higgs boson production. In all cases, the model demonstrates improved sensitivity to unseen anomalies, achieving better separation between normal and anomalous samples. These results indicate that even limited anomaly information, when incorporated through targeted fine-tuning, can substantially improve the generalization and performance of unsupervised models for anomaly detection. Comments: 16 pages, 5 figures Subjects: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG) Cite as: arXiv:2504.11520 [hep-ph] (or arXiv:2504.11520v1 [hep-ph] for this version) https://doi.org/10.48550/arXiv.2504.11520 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-52] FEAT: Free energy Estimators with Adaptive Transport

链接: https://arxiv.org/abs/2504.11516
作者: Jiajun He,Yuanqi Du,Francisco Vargas,Yuanqing Wang,Carla P. Gomes,José Miguel Hernández-Lobato,Eric Vanden-Eijnden
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
*备注: 29 pages, 2 tables, 3 figures

点击查看摘要

Abstract:We present Free energy Estimators with Adaptive Transport (FEAT), a novel framework for free energy estimation – a critical challenge across scientific domains. FEAT leverages learned transports implemented via stochastic interpolants and provides consistent, minimum-variance estimators based on escorted Jarzynski equality and controlled Crooks theorem, alongside variational upper and lower bounds on free energy differences. Unifying equilibrium and non-equilibrium methods under a single theoretical framework, FEAT establishes a principled foundation for neural free energy calculations. Experimental validation on toy examples, molecular simulations, and quantum field theory demonstrates improvements over existing learning-based methods.

[LG-53] Learned enclosure method for experimental EIT data

链接: https://arxiv.org/abs/2504.11512
作者: Sara Sippola,Siiri Rautio,Andreas Hauptmann,Takanori Ide,Samuli Siltanen
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注:

点击查看摘要

Abstract:Electrical impedance tomography (EIT) is a non-invasive imaging method with diverse applications, including medical imaging and non-destructive testing. The inverse problem of reconstructing internal electrical conductivity from boundary measurements is nonlinear and highly ill-posed, making it difficult to solve accurately. In recent years, there has been growing interest in combining analytical methods with machine learning to solve inverse problems. In this paper, we propose a method for estimating the convex hull of inclusions from boundary measurements by combining the enclosure method proposed by Ikehata with neural networks. We demonstrate its performance using experimental data. Compared to the classical enclosure method with least squares fitting, the learned convex hull achieves superior performance on both simulated and experimental data.

信息检索

[IR-0] Clarifying Ambiguities: on the Role of Ambiguity Types in Prompting Methods for Clarification Generation SIGIR2025

链接: https://arxiv.org/abs/2504.12113
作者: Anfu Tang,Laure Soulier,Vincent Guigue
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 3 figures. Accepted at SIGIR 2025

点击查看摘要

Abstract:In information retrieval (IR), providing appropriate clarifications to better understand users’ information needs is crucial for building a proactive search-oriented dialogue system. Due to the strong in-context learning ability of large language models (LLMs), recent studies investigate prompting methods to generate clarifications using few-shot or Chain of Thought (CoT) prompts. However, vanilla CoT prompting does not distinguish the characteristics of different information needs, making it difficult to understand how LLMs resolve ambiguities in user queries. In this work, we focus on the concept of ambiguity for clarification, seeking to model and integrate ambiguities in the clarification process. To this end, we comprehensively study the impact of prompting schemes based on reasoning and ambiguity for clarification. The idea is to enhance the reasoning abilities of LLMs by limiting CoT to predict first ambiguity types that can be interpreted as instructions to clarify, then correspondingly generate clarifications. We name this new prompting scheme Ambiguity Type-Chain of Thought (AT-CoT). Experiments are conducted on various datasets containing human-annotated clarifying questions to compare AT-CoT with multiple baselines. We also perform user simulations to implicitly measure the quality of generated clarifications under various IR scenarios.

[IR-1] Résumé abstractif à partir dune transcription audio

链接: https://arxiv.org/abs/2504.11803
作者: Ilia Derkach
类目: Information Retrieval (cs.IR)
*备注: 35 pages, in French language, 8 tables, 6 figures

点击查看摘要

Abstract:Currently, large language models are gaining popularity, their achievements are used in many areas, ranging from text translation to generating answers to queries. However, the main problem with these new machine learning algorithms is that training such models requires large computing resources that only large IT companies have. To avoid this problem, a number of methods (LoRA, quantization) have been proposed so that existing models can be effectively fine-tuned for specific tasks. In this paper, we propose an E2E (end to end) audio summarization model using these techniques. In addition, this paper examines the effectiveness of these approaches to the problem under consideration and draws conclusions about the applicability of these methods.

[IR-2] A New Paradigm of User-Centric Wireless Communication Driven by Large Language Models

链接: https://arxiv.org/abs/2504.11696
作者: Kuiyuan Ding,Caili Guo,Yang Yang,Wuxia Hu,Yonina C. Eldar
类目: Networking and Internet Architecture (cs.NI); Information Retrieval (cs.IR); Systems and Control (eess.SY)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:The next generation of wireless communications seeks to deeply integrate artificial intelligence (AI) with user-centric communication networks, with the goal of developing AI-native networks that more accurately address user requirements. The rapid development of large language models (LLMs) offers significant potential in realizing these goals. However, existing efforts that leverage LLMs for wireless communication often overlook the considerable gap between human natural language and the intricacies of real-world communication systems, thus failing to fully exploit the capabilities of LLMs. To address this gap, we propose a novel LLM-driven paradigm for wireless communication that innovatively incorporates the nature language to structured query language (NL2SQL) tool. Specifically, in this paradigm, user personal requirements is the primary focus. Upon receiving a user request, LLMs first analyze the user intent in terms of relevant communication metrics and system parameters. Subsequently, a structured query language (SQL) statement is generated to retrieve the specific parameter values from a high-performance real-time database. We further utilize LLMs to formulate and solve an optimization problem based on the user request and the retrieved parameters. The solution to this optimization problem then drives adjustments in the communication system to fulfill the user’s requirements. To validate the feasibility of the proposed paradigm, we present a prototype system. In this prototype, we consider user-request centric semantic communication (URC-SC) system in which a dynamic semantic representation network at the physical layer adapts its encoding depth to meet user requirements. Additionally, two LLMs are employed to analyze user requests and generate SQL statements, respectively. Simulation results demonstrate the effectiveness.

附件下载

点击下载今日全部论文列表

目录

概览 (2025-04-17)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载