Arxiv今日论文 | 2025-04-16

本篇博文主要内容为 2025-04-16 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决复杂数学推理领域中人工智能模型训练数据不足的问题。具体而言，现有数据集在规模、挑战性、可验证的答案格式以及避免与评估基准污染等方面存在局限性，阻碍了基于强化学习（Reinforcement Learning, RL）的大型语言模型（LLMs）在数学推理任务上的进步。为了解决这些问题，论文提出了DeepMath-103K这一新数据集，包含约103,000个精心设计的数学问题，专门用于通过RL训练高级推理模型。其关键是通过严格的管道构建流程，包括来源分析、严格去污染处理以及高难度筛选（主要集中在5到9级），确保数据集的挑战性和纯净度显著超越现有的开放资源。每个问题不仅提供可验证的最终答案以支持基于规则的RL，还包含三种不同的R1生成解法，适用于多种训练范式（如监督微调或蒸馏）。这一全面且多样化的数据集促进了通用推理能力的发展，并验证了在具有挑战性的数学基准测试中的有效性。

链接: https://arxiv.org/abs/2504.11456
作者: Zhiwei He,Tian Liang,Jiahao Xu,Qiuzhi Liu,Xingyu Chen,Yue Wang,Linfeng Song,Dian Yu,Zhenwen Liang,Wenxuan Wang,Zhuosheng Zhang,Rui Wang,Zhaopeng Tu,Haitao Mi,Dong Yu
机构: Tencent(腾讯); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: WIP

点击查看摘要

Abstract:The capacity for complex mathematical reasoning is a key benchmark for artificial intelligence. While reinforcement learning (RL) applied to LLMs shows promise, progress is significantly hindered by the lack of large-scale training data that is sufficiently challenging, possesses verifiable answer formats suitable for RL, and is free from contamination with evaluation benchmarks. To address these limitations, we introduce DeepMath-103K, a new, large-scale dataset comprising approximately 103K mathematical problems, specifically designed to train advanced reasoning models via RL. DeepMath-103K is curated through a rigorous pipeline involving source analysis, stringent decontamination against numerous benchmarks, and filtering for high difficulty (primarily Levels 5-9), significantly exceeding existing open resources in challenge. Each problem includes a verifiable final answer, enabling rule-based RL, and three distinct R1-generated solutions suitable for diverse training paradigms like supervised fine-tuning or distillation. Spanning a wide range of mathematical topics, DeepMath-103K promotes the development of generalizable reasoning. We demonstrate that models trained on DeepMath-103K achieve significant improvements on challenging mathematical benchmarks, validating its effectiveness. We release DeepMath-103K publicly to facilitate community progress in building more capable AI reasoning systems: this https URL.
zh

[NLP-1] xtArena

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在评估具有能动性（agentic behavior）的行为时面临的局限性，特别是传统基准测试缺乏对动态社交技能（如谈判、心智理论、欺骗等）的有效评估。为了解决这一问题，论文提出了TextArena，这是一个包含57个以上独特环境的开源文本游戏集合。TextArena的关键解决方案在于其设计了一个易于使用的在线对战系统，支持模型与人类或其他提交的模型进行实时对抗，并通过TrueSkill评分机制提供即时反馈，从而弥补传统基准测试的不足。此外，TextArena强调易用性、可扩展性和社区协作，便于添加新游戏、适配框架、测试模型以及进行模型训练和对战。

链接: https://arxiv.org/abs/2504.11442
作者: Leon Guertler,Bobby Cheng,Simon Yu,Bo Liu,Leshem Choshen,Cheston Tan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: work in progress; 5 pages, 3 figures

点击查看摘要

Abstract:TextArena is an open-source collection of competitive text-based games for training and evaluation of agentic behavior in Large Language Models (LLMs). It spans 57+ unique environments (including single-player, two-player, and multi-player setups) and allows for easy evaluation of model capabilities via an online-play system (against humans and other submitted models) with real-time TrueSkill scores. Traditional benchmarks rarely assess dynamic social skills such as negotiation, theory of mind, and deception, creating a gap that TextArena addresses. Designed with research, community and extensibility in mind, TextArena emphasizes ease of adding new games, adapting the framework, testing models, playing against the models, and training models. Detailed documentation of environments, games, leaderboard, and examples are available on this https URL and this https URL.
zh

[NLP-2] ADACap: Time-series Adaptive Domain-Aware Captioning

【速读】：该论文旨在解决现有时间序列图像描述方法在领域适应性方面的局限性，即它们通常只能提供通用的、领域无关的时间序列形状描述，并且难以在不进行大量重新训练的情况下适应新领域。为了解决这些问题，论文提出了TADACap，这是一种基于检索的框架，能够为时间序列图像生成领域相关的描述，同时具备无需再训练即可适应新领域的特性。TADACap的关键在于其引入了一种新颖的检索策略——TADACap-diverse，该策略可以从目标领域的数据库中检索出多样化的图像-描述对，从而实现更高效的领域适应性和语义准确性。

链接: https://arxiv.org/abs/2504.11441
作者: Elizabeth Fons,Rachneet Kaur,Zhen Zeng,Soham Palande,Tucker Balch,Svitlana Vyetrenko,Manuela Veloso
机构: JP Morgan AI Research (摩根大通人工智能研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to ICAIF 2024

点击查看摘要

Abstract:While image captioning has gained significant attention, the potential of captioning time-series images, prevalent in areas like finance and healthcare, remains largely untapped. Existing time-series captioning methods typically offer generic, domain-agnostic descriptions of time-series shapes and struggle to adapt to new domains without substantial retraining. To address these limitations, we introduce TADACap, a retrieval-based framework to generate domain-aware captions for time-series images, capable of adapting to new domains without retraining. Building on TADACap, we propose a novel retrieval strategy that retrieves diverse image-caption pairs from a target domain database, namely TADACap-diverse. We benchmarked TADACap-diverse against state-of-the-art methods and ablation variants. TADACap-diverse demonstrates comparable semantic accuracy while requiring significantly less annotation effort.
zh

[NLP-3] Masculine Defaults via Gendered Discourse in Podcasts and Large Language Models

【速读】：该论文旨在解决由“阳刚默认”（Masculine Defaults）所导致的性别偏见问题，特别是这种偏见在语言使用中的隐性表现。论文关注话语层面的阳刚默认，包含文化语境、阳刚特质或行为及其被奖励或接受的现象。为应对这一挑战，论文提出了一个双管齐下的框架：首先，通过“性别化话语关联框架”(Gendered Discourse Correlation Framework, GDCF) 大规模发现和分析口语内容中的性别化话语词汇；其次，利用“话语词嵌入关联测试”(Discourse Word-Embedding Association Test, D-WEAT) 测量大型语言模型 (LLMs) 中与这些性别化话语词汇相关的性别偏见。关键在于结合主题建模（LDA 和 BERTopic）和先进的词嵌入技术，揭示性别化话语词汇的分布规律及其在不同领域（如商业、技术和电子游戏）中的表现，并进一步量化这些词汇在顶尖语言模型（如 OpenAI 的模型）中的表征差异，从而揭示阳刚默认如何通过更好的系统性能对男性形成奖励，同时构成表征性伤害。

链接: https://arxiv.org/abs/2504.11431
作者: Maria Teleki,Xiangjue Dong,Haoran Liu,James Caverlee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: To appear in ICWSM 2025

点击查看摘要

Abstract:Masculine defaults are widely recognized as a significant type of gender bias, but they are often unseen as they are under-researched. Masculine defaults involve three key parts: (i) the cultural context, (ii) the masculine characteristics or behaviors, and (iii) the reward for, or simply acceptance of, those masculine characteristics or behaviors. In this work, we study discourse-based masculine defaults, and propose a twofold framework for (i) the large-scale discovery and analysis of gendered discourse words in spoken content via our Gendered Discourse Correlation Framework (GDCF); and (ii) the measurement of the gender bias associated with these gendered discourse words in LLMs via our Discourse Word-Embedding Association Test (D-WEAT). We focus our study on podcasts, a popular and growing form of social media, analyzing 15,117 podcast episodes. We analyze correlations between gender and discourse words – discovered via LDA and BERTopic – to automatically form gendered discourse word lists. We then study the prevalence of these gendered discourse words in domain-specific contexts, and find that gendered discourse-based masculine defaults exist in the domains of business, technology/politics, and video games. Next, we study the representation of these gendered discourse words from a state-of-the-art LLM embedding model from OpenAI, and find that the masculine discourse words have a more stable and robust representation than the feminine discourse words, which may result in better system performance on downstream tasks for men. Hence, men are rewarded for their discourse patterns with better system performance by one of the state-of-the-art language models – and this embedding disparity is a representational harm and a masculine default.
zh

[NLP-4] A Dual-Space Framework for General Knowledge Distillation of Large Language Models

【速读】：该论文旨在解决当前基于白盒知识蒸馏（White-box Knowledge Distillation, KD）框架在压缩大型语言模型（Large Language Models, LLMs）过程中存在的两个主要限制：(1) 不同输出空间的概率分布之间的桥梁会限制教师模型与学生模型之间的相似性；(2) 无法应用于具有不同词汇表的LLMs。这些问题的根本原因是用于知识蒸馏的教师模型和学生模型的预测头输出分布在不同的输出空间和维度。为了解决这些问题，论文提出了一种双空间知识蒸馏（Dual-Space Knowledge Distillation, DSKD）框架，其关键是通过引入两个理想初始化的投影器，将教师/学生隐藏状态投影到教师/学生表示空间，从而统一教师和学生模型的预测头，并使来自不同模型的隐藏状态共享同一个头以统一分布的输出空间。此外，还开发了精确的标记对齐（Exact Token Alignment, ETA）算法来对齐两个不同分词序列中的相同标记。DSKD框架支持离线策略和在线策略的知识蒸馏，以及任意两种LLMs之间的蒸馏，无论它们的词汇表是否相同。实验结果表明，DSKD在指令跟随、数学推理和代码生成基准测试中显著优于现有基于当前白盒KD框架的方法，并且对于具有不同词汇表的LLMs的跨分词器KD方法也表现出优越性。

链接: https://arxiv.org/abs/2504.11426
作者: Xue Zhang,Songming Zhang,Yunlong Liang,Fandong Meng,Yufeng Chen,Jinan Xu,Jie Zhou
机构: School of Computer Science and Technology, Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University (北京交通大学计算机科学与技术学院，交通数据挖掘与具身智能北京市重点实验室); Pattern Recognition Center, WeChat AI, Tencent Inc (微信人工智能模式识别中心，腾讯公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 9 figures, 11 tables, under review. Code is available at: this https URL

点击查看摘要

Abstract:Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models. During this process, white-box KD methods usually minimize the distance between the output distributions of the teacher model and the student model to transfer more information. However, we reveal that the current white-box KD framework exhibits two limitations: a) bridging probability distributions from different output spaces will limit the similarity between the teacher model and the student model; b) this framework cannot be applied to LLMs with different vocabularies. One of the root causes for these limitations is that the distributions from the teacher and the student for KD are output by different prediction heads, which yield distributions in different output spaces and dimensions. Therefore, in this paper, we propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD. Specifically, we first introduce two projectors with ideal initialization to project the teacher/student hidden states into the student/teacher representation spaces. After this, the hidden states from different models can share the same head and unify the output spaces of the distributions. Furthermore, we develop an exact token alignment (ETA) algorithm to align the same tokens in two differently-tokenized sequences. Based on the above, our DSKD framework is a general KD framework that supports both off-policy and on-policy KD, and KD between any two LLMs regardless of their vocabularies. Extensive experiments on instruction-following, mathematical reasoning, and code generation benchmarks show that DSKD significantly outperforms existing methods based on the current white-box KD framework and surpasses other cross-tokenizer KD methods for LLMs with different vocabularies.
zh

[NLP-5] Reinforcing Compositional Retrieval: Retrieving Step-by-Step for Composing Informative Contexts

【速读】：该论文旨在解决复杂任务中传统检索增强框架无法有效处理多源信息组合（compositional retrieval）的问题。现有的方法通常在单一检索步骤中选择顶级文档，而许多现实场景需要协调多个信息源的联合检索。为了解决这一挑战，论文提出了一种三编码器顺序检索器（tri-encoder sequential retriever），将此过程建模为马尔可夫决策过程（Markov Decision Process, MDP），通过分解检索一组元素的概率为条件概率序列，并使每次检索能够依赖于之前选定的例子，从而显式地建模样本间的依赖关系。解决方案的关键在于引入这种基于MDP的顺序检索机制以及分阶段训练策略：首先利用监督学习构建初始策略，然后通过与大型语言模型（LLM）生成程序的结构一致性对齐来优化策略。实验结果表明，该方法显著优于基线模型，强调了显式建模样本间依赖关系的重要性。

链接: https://arxiv.org/abs/2504.11420
作者: Quanyu Long,Jianda Chen,Zhengyuan Liu,Nancy F. Chen,Wenya Wang,Sinno Jialin Pan
机构: Nanyang Technological University (南洋理工大学), Singapore; Institute for Infocomm Research (I2R), A*STAR (信息通信研究院, 新加坡科技研究局), Singapore; The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注: 19 pages, 8 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet they often rely on external context to handle complex tasks. While retrieval-augmented frameworks traditionally focus on selecting top-ranked documents in a single pass, many real-world scenarios demand compositional retrieval, where multiple sources must be combined in a coordinated manner. In this work, we propose a tri-encoder sequential retriever that models this process as a Markov Decision Process (MDP), decomposing the probability of retrieving a set of elements into a sequence of conditional probabilities and allowing each retrieval step to be conditioned on previously selected examples. We train the retriever in two stages: first, we efficiently construct supervised sequential data for initial policy training; we then refine the policy to align with the LLM’s preferences using a reward grounded in the structural correspondence of generated programs. Experimental results show that our method consistently and significantly outperforms baselines, underscoring the importance of explicitly modeling inter-example dependencies. These findings highlight the potential of compositional retrieval for tasks requiring multiple pieces of evidence or examples.
zh

[NLP-6] Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning

【速读】：该论文旨在解决如何有效压缩混合型大语言模型（Hybrid LLMs），同时保持其高精度和高性能的问题。论文的关键在于提出了一种新颖的分组感知剪枝策略（group-aware pruning strategy），该策略能够保留状态空间模型（State Space Models, SSMs）块的结构完整性和序列建模能力。此外，通过结合SSM、前馈网络（FFN）、嵌入维度以及层剪枝，并辅以基于知识蒸馏的再训练方法，论文展示了这种压缩方案在提升模型精度和推理速度方面的优越性。最终，使用此方法，Nemotron-H 8B模型被成功压缩至4B参数，所需训练token减少了高达40倍，同时在推理速度上达到2倍提升，显著优化了帕累托前沿（Pareto frontier）。

链接: https://arxiv.org/abs/2504.11409
作者: Ali Taghibakhshi,Sharath Turuvekere Sreenivas,Saurav Muralidharan,Marcin Chochowski,Yashaswi Karnati,Raviraj Joshi,Ameya Sunil Mahabaleshwarkar,Zijia Chen,Yoshi Suhara,Oluwatobi Olabiyi,Daniel Korzekwa,Mostofa Patwary,Mohammad Shoeybi,Jan Kautz,Bryan Catanzaro,Ashwath Aithal,Nima Tajbakhsh,Pavlo Molchanov
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hybrid LLM architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance. Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. Furthermore, we demonstrate the necessity of such SSM pruning to achieve improved accuracy and inference speed compared to traditional approaches. Our compression recipe combines SSM, FFN, embedding dimension, and layer pruning, followed by knowledge distillation-based retraining, similar to the MINITRON technique. Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to 40x fewer training tokens. The resulting model surpasses the accuracy of similarly-sized models while achieving 2x faster inference, significantly advancing the Pareto frontier.
zh

[NLP-7] DataDecide: How to Predict Best Pretraining Data with Small Experiments

【速读】：该论文试图解决在大规模预训练语言模型（Large Language Models, LLMs）中，如何通过小规模实验有效选择最佳数据集以降低预训练成本的问题。论文的关键在于探索哪些基准测试和决策方法能够最准确地利用小规模实验的表现来预测最终能在大规模模型（如10亿参数级别）中表现最优的数据集。为此，作者发布了DataDecide数据集，包含针对25个不同数据源、去重策略及过滤规则的模型训练结果，涵盖最大至1000亿token和10亿参数规模的实验，并使用多种随机种子进行控制。研究发现，单尺度的小模型性能排名（如1.5亿参数模型）作为基线预测大尺度模型（10亿参数）的最佳选择具有较高的准确性（约80%正确率），且连续性可能性指标（Continuous Likelihood Metrics）可作为小规模实验中的代理变量，使包括MMLU、ARC、HellaSwag、MBPP和HumanEval在内的基准测试在目标大尺度下达到80%的可预测性，仅需极少量计算资源（0.01%）。因此，论文的关键解决方案在于通过系统化的小规模实验设计与评估，结合特定的性能指标，实现对大规模模型数据选择的高效决策。

链接: https://arxiv.org/abs/2504.11393
作者: Ian Magnusson,Nguyen Tai,Ben Bogin,David Heineman,Jena D. Hwang,Luca Soldaini,Akshita Bhagia,Jiacheng Liu,Dirk Groeneveld,Oyvind Tafjord,Noah A. Smith,Pang Wei Koh,Jesse Dodge
机构: Allen Institute for AI (艾伦人工智能研究所); University of Washington (华盛顿大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide – the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (~80% of com parisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval 80% predictable at the target 1B scale with just 0.01% of the compute.
zh

[NLP-8] RankAlign: A Ranking View of the Generator-Validator Gap in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在行为上存在的根本性不可靠问题，特别是模型在面对提示词变化时对相同信息描述不一致的关键局限性。论文定义了一种更严格的生成器-验证器差距（generator-validator gap），即模型生成答案与其自身对该答案验证分数之间的不一致性，并指出这种差距广泛存在于问答任务、词汇语义任务以及下一词预测等场景中。为解决此问题，论文提出了一种基于排名优化的训练方法RankAlign，其核心在于通过调整生成器与验证器的评分一致性来显著缩小差距，在多种任务上平均将差距缩小31.8%，且具有良好的跨领域泛化能力。

链接: https://arxiv.org/abs/2504.11381
作者: Juan Diego Rodriguez,Wenxuan Ding,Katrin Erk,Greg Durrett
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although large language models (LLMs) have become generally more capable and accurate across many tasks, some fundamental sources of unreliability remain in their behavior. One key limitation is their inconsistency at reporting the the same information when prompts are changed. In this paper, we consider the discrepancy between a model’s generated answer and their own verification of that answer, the generator-validator gap. We define this gap in a more stringent way than prior work: we expect correlation of scores from a generator and a validator over the entire set of candidate answers. We show that according to this measure, a large gap exists in various settings, including question answering, lexical semantics tasks, and next-word prediction. We then propose RankAlign, a ranking-based training method, and show that it significantly closes the gap by 31.8% on average, surpassing all baseline methods. Moreover, this approach generalizes well to out-of-domain tasks and lexical items.
zh

[NLP-9] Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在处理真实患者提出的复杂、个性化癌症相关问题时，未能有效识别和纠正问题中潜在错误假设（false presuppositions）的问题，这可能对安全的医疗决策构成风险。论文的关键解决方案在于引入了一个名为Cancer-Myth的专业验证对抗性数据集，该数据集包含585个具有错误假设的癌症相关问题，并通过此基准评估前沿LLMs的表现。研究发现，即使是最先进的模型，也未能超过30%的比例纠正这些错误假设，表明现有方法无法充分解决这一局限性，从而揭示了临床可靠性方面的重大缺口，并强调了在医疗AI系统中需要更强大的保障措施。

链接: https://arxiv.org/abs/2504.11373
作者: Wang Bill Zhu,Tianqi Chen,Ching Ying Lin,Jade Law,Mazen Jizzini,Jorge J. Nieva,Ruishan Liu,Robin Jia
机构: Thomas Lord Department of Computer Science, USC (托马斯·洛伊德计算机科学系，南加州大学); Keck School of Medicine, USC (凯克医学院，南加州大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Cancer patients are increasingly turning to large language models (LLMs) as a new form of internet search for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with detailed clinical contexts. In this paper, we first evaluate LLMs on cancer-related questions drawn from real patients, reviewed by three hematology oncology physicians. While responses are generally accurate, with GPT-4-Turbo scoring 4.13 out of 5, the models frequently fail to recognize or address false presuppositions in the questions-posing risks to safe medical decision-making. To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions. On this benchmark, no frontier LLM – including GPT-4o, this http URL, and Claude-3.5-Sonnet – corrects these false presuppositions more than 30% of the time. Even advanced medical agentic methods do not prevent LLMs from ignoring false presuppositions. These findings expose a critical gap in the clinical reliability of LLMs and underscore the need for more robust safeguards in medical AI systems.
zh

[NLP-10] OpenTuringBench: An Open-Model-based Benchmark and Framework for Machine-Generated Text Detection and Attribution

【速读】：该论文旨在解决开放大型语言模型（Open Large Language Models, OLLMs）输出检测的挑战，具体目标是训练和评估机器生成文本的检测器。论文的关键创新在于提出了OpenTuringBench，这是一个基于OLLM的新基准，专注于在图灵测试（Turing Test）和作者归因（Authorship Attribution）问题上训练和评估检测器。此外，论文还引入了OTBDetector，这是一种对比学习框架，用于检测和归因由OLLM生成的机器文本。通过设计具有挑战性的任务，包括人工/机器操纵文本、跨领域文本以及来自未见过模型的文本，OpenTuringBench能够全面评估检测器的能力。实验结果表明，该基准的任务具有不同的难度级别，而提出的检测器在各项任务中表现出色，显著优于大多数现有方法。相关资源已发布在Hugging Face的OpenTuringBench存储库中。

链接: https://arxiv.org/abs/2504.11369
作者: Lucio La Cava,Andrea Tagarelli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Physics and Society (physics.soc-ph)
备注: Under review with ARR

点击查看摘要

Abstract:Open Large Language Models (OLLMs) are increasingly leveraged in generative AI applications, posing new challenges for detecting their outputs. We propose OpenTuringBench, a new benchmark based on OLLMs, designed to train and evaluate machine-generated text detectors on the Turing Test and Authorship Attribution problems. OpenTuringBench focuses on a representative set of OLLMs, and features a number of challenging evaluation tasks, including human/machine-manipulated texts, out-of-domain texts, and texts from previously unseen models. We also provide OTBDetector, a contrastive learning framework to detect and attribute OLLM-based machine-generated texts. Results highlight the relevance and varying degrees of difficulty of the OpenTuringBench tasks, with our detector achieving remarkable capabilities across the various tasks and outperforming most existing detectors. Resources are available on the OpenTuringBench Hugging Face repository at this https URL
zh

[NLP-11] aching Large Language Models to Reason through Learning and Forgetting

【速读】：该论文旨在解决在利用推理时间搜索（inference-time search）提升大型语言模型（Large Language Models, LLMs）解决复杂数学与推理问题能力的同时，显著增加计算成本和推理时间的问题。论文的关键解决方案在于通过微调（fine-tuning），将搜索能力直接集成到模型中，使用成功与失败的推理路径（学习与遗忘数据）进行训练。然而，关键挑战在于，如果微调方法设计不当，模型的搜索能力会迅速退化。研究发现，采用较小的学习率可以显著缓解这一退化问题，从而有效提升模型性能。实验结果表明，该方法不仅超越了标准微调和推理时间搜索基线，还将推理速度提升了180倍。

链接: https://arxiv.org/abs/2504.11364
作者: Tianwei Ni,Allen Nie,Sapana Chaudhary,Yao Liu,Huzefa Rangwala,Rasool Fakoor
机构: Mila (米拉), Université de Montréal (蒙特利尔大学); Amazon Web Services (亚马逊云服务)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Leveraging inference-time search in large language models has proven effective in further enhancing a trained model’s capability to solve complex mathematical and reasoning problems. However, this approach significantly increases computational costs and inference time, as the model must generate and evaluate multiple candidate solutions to identify a viable reasoning path. To address this, we propose an effective approach that integrates search capabilities directly into the model by fine-tuning it using both successful (learning) and failed reasoning paths (forgetting) derived from diverse search methods. While fine-tuning the model with these data might seem straightforward, we identify a critical issue: the model’s search capability tends to degrade rapidly if fine-tuning is performed naively. We show that this degradation can be substantially mitigated by employing a smaller learning rate. Extensive experiments on the challenging Game-of-24 and Countdown mathematical reasoning benchmarks show that our approach not only outperforms both standard fine-tuning and inference-time search baselines but also significantly reduces inference time by 180 \times .
zh

[NLP-12] A Minimalist Approach to LLM Reasoning : from Rejection Sampling to Reinforce

【速读】：该论文旨在探究GRPO在大规模语言模型（Large Language Models, LLMs）强化学习微调中的有效性来源，并提出一种更简单且有效的替代方案。论文的关键发现是，通过拒绝采样（rejection sampling）仅训练于正奖励样本的RAFT方法，在复杂推理任务上的表现可与GRPO和PPO媲美。进一步分析表明，GRPO的主要优势并非来自奖励归一化，而是由于丢弃完全错误响应的提示。受此启发，作者提出了Reinforce-Rej，这是一种基于策略梯度的轻量级扩展，通过过滤掉完全正确或完全错误的样本，提高了KL效率和稳定性。论文建议将RAFT作为稳健且可解释的基线，并呼吁未来研究应关注负样本整合的更系统化设计，而非盲目依赖。这些发现为基于奖励的大规模语言模型后训练提供了指导。

链接: https://arxiv.org/abs/2504.11343
作者: Wei Xiong,Jiarui Yao,Yuhui Xu,Bo Pang,Lei Wang,Doyen Sahoo,Junnan Li,Nan Jiang,Tong Zhang,Caiming Xiong,Hanze Dong
机构: Salesforce AI Research (Salesforce AI 研究院); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Reinforcement learning (RL) has become a prevailing approach for fine-tuning large language models (LLMs) on complex reasoning tasks. Among recent methods, GRPO stands out for its empirical success in training models such as DeepSeek-R1, yet the sources of its effectiveness remain poorly understood. In this work, we revisit GRPO from a reinforce-like algorithm perspective and analyze its core components. Surprisingly, we find that a simple rejection sampling baseline, RAFT, which trains only on positively rewarded samples, yields competitive performance than GRPO and PPO. Our ablation studies reveal that GRPO’s main advantage arises from discarding prompts with entirely incorrect responses, rather than from its reward normalization. Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples. Reinforce-Rej improves KL efficiency and stability, serving as a lightweight yet effective alternative to more complex RL algorithms. We advocate RAFT as a robust and interpretable baseline, and suggest that future advances should focus on more principled designs for incorporating negative samples, rather than relying on them indiscriminately. Our findings provide guidance for future work in reward-based LLM post-training.
zh

[NLP-13] REWARD CONSISTENCY: Improving Multi-Objective Alignment from a Data-Centric Perspective

【速读】：该论文试图解决多目标偏好对齐在语言模型中面临的挑战性权衡问题：优化某一人类偏好（如实用性）通常会损害其他偏好（如无害性），原因是竞争目标之间存在固有冲突。现有工作主要聚焦于算法层面的解决方案，而本文提出了一种新颖的数据驱动方法，旨在揭示能够有效缓解这些冲突的数据类型。方案的关键在于引入奖励一致性（Reward Consistency, RC）的概念，通过识别符合多个偏好目标的样本来减少训练过程中的冲突，并利用基于梯度的分析表明，RC一致的样本天然限制了多目标优化中的性能下降。在此基础上，进一步开发了奖励一致性采样（Reward Consistency Sampling）框架，用于自动构建能够有效缓解冲突的偏好数据集。实验结果显示，生成的数据在优化无害性和实用性时，无害率和实用性胜率平均提升了13.37%，并在不同多目标场景中保持了冲突的稳定解决能力。

链接: https://arxiv.org/abs/2504.11337
作者: Zhihao Xu,Yongqi Tong,Xin Zhang,Jun Zhou,Xiting Wang
机构: Renmin University of China (中国人民大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-objective preference alignment in language models often encounters a challenging trade-off: optimizing for one human preference (e.g., helpfulness) frequently compromises others (e.g., harmlessness) due to the inherent conflicts between competing objectives. While prior work mainly focuses on algorithmic solutions, we explore a novel data-driven approach to uncover the types of data that can effectively mitigate these conflicts. Specifically, we propose the concept of Reward Consistency (RC), which identifies samples that align with multiple preference objectives, thereby reducing conflicts during training. Through gradient-based analysis, we demonstrate that RC-compliant samples inherently constrain performance degradation during multi-objective optimization. Building on these insights, we further develop Reward Consistency Sampling, a framework that automatically constructs preference datasets that effectively mitigate conflicts during multi-objective alignment. Our generated data achieves an average improvement of 13.37% in both the harmless rate and helpfulness win rate when optimizing harmlessness and helpfulness, and can consistently resolve conflicts in varying multi-objective scenarios.
zh

[NLP-14] Looking beyond the next token

【速读】：该论文旨在解决因果语言模型训练中目标预测与人类自然写作及推理过程之间的不匹配问题。传统方法假设每个token可以从先前上下文中准确预测，而人类通常在明确论点或措辞之前已知目标。论文的关键解决方案是提出Trelawney技术，通过重新排列和处理训练数据序列，使模型能够更准确地模拟真实的数据生成过程，而无需对模型架构或训练基础设施进行其他修改。研究显示，基于此技术的推断算法可提升多个关键基准任务（包括规划、算法推理和故事生成）的表现，并且能够以零额外成本生成长期目标，进一步探索了利用模型的目标生成能力来改进规划和推理的方法。此外，作者认为Trelawney可能为超越当前语言建模范式的新型能力打开大门。

链接: https://arxiv.org/abs/2504.11336
作者: Abitha Thankaraj,Yiding Jiang,J. Zico Kolter,Yonatan Bisk
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The structure of causal language model training assumes that each token can be accurately predicted from the previous context. This contrasts with humans’ natural writing and reasoning process, where goals are typically known before the exact argument or phrasings. While this mismatch has been well studied in the literature, the working assumption has been that architectural changes are needed to address this mismatch. We argue that rearranging and processing the training data sequences can allow models to more accurately imitate the true data-generating process, and does not require any other changes to the architecture or training infrastructure. We demonstrate that this technique, Trelawney, and the inference algorithms derived from it allow us to improve performance on several key benchmarks that span planning, algorithmic reasoning, and story generation tasks. Finally, our method naturally enables the generation of long-term goals at no additional cost. We investigate how using the model’s goal-generation capability can further improve planning and reasoning. Additionally, we believe Trelawney could potentially open doors to new capabilities beyond the current language modeling paradigm.
zh

[NLP-15] Dependency Structure Augmented Contextual Scoping Framework for Multimodal Aspect-Based Sentiment Analysis ACM-MM2025

【速读】：该论文致力于解决多模态方面级情感分析（Multimodal Aspect-Based Sentiment Analysis, MABSA）中的三个核心挑战：情感线索感知（Sentiment Cue Perception, SCP）、多模态信息错位（Multimodal Information Misalignment, MIM）以及语义噪声消除（Semantic Noise Elimination, SNE）。为克服现有方法的局限性，论文提出了DASCO（Dependency Structure Augmented Scoping Framework），这是一种细粒度范围导向的框架，通过利用依存解析树增强方面级情感推理。关键在于设计了一种结合方面级增强、图像-文本匹配和方面级情感敏感认知的多任务预训练策略，并引入依存树作为句法分支与语义分支相结合的方式，引导模型在特定目标范围内关注关键上下文元素，同时有效过滤无关噪声，从而系统性地应对上述三个挑战。

链接: https://arxiv.org/abs/2504.11331
作者: Hao Liu,Lijun He,Jiaxi Liang,Zhihan Ren,Fan Li
机构: Xi’an Jiaotong University(Xi’an交通大学)
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注: submitted to ACM MM2025

点击查看摘要

Abstract:Multimodal Aspect-Based Sentiment Analysis (MABSA) seeks to extract fine-grained information from image-text pairs to identify aspect terms and determine their sentiment polarity. However, existing approaches often fall short in simultaneously addressing three core challenges: Sentiment Cue Perception (SCP), Multimodal Information Misalignment (MIM), and Semantic Noise Elimination (SNE). To overcome these limitations, we propose DASCO (\textbfDependency Structure \textbfAugmented \textbfScoping Framework), a fine-grained scope-oriented framework that enhances aspect-level sentiment reasoning by leveraging dependency parsing trees. First, we designed a multi-task pretraining strategy for MABSA on our base model, combining aspect-oriented enhancement, image-text matching, and aspect-level sentiment-sensitive cognition. This improved the model’s perception of aspect terms and sentiment cues while achieving effective image-text alignment, addressing key challenges like SCP and MIM. Furthermore, we incorporate dependency trees as syntactic branch combining with semantic branch, guiding the model to selectively attend to critical contextual elements within a target-specific scope while effectively filtering out irrelevant noise for addressing SNE problem. Extensive experiments on two benchmark datasets across three subtasks demonstrate that DASCO achieves state-of-the-art performance in MABSA, with notable gains in JMASA (+3.1% F1 and +5.4% precision on Twitter2015).
zh

[NLP-16] Automated Python Translation

【速读】：该论文试图解决的问题是如何将Python的自然模态（如关键字、错误类型、标识符等）自动翻译成其他人类语言，以降低非英语使用者理解Python代码的障碍。论文的关键在于开发了一个自动化流水线，通过机器翻译和大型语言模型的策略比较，实现Python术语的跨语言转换，并验证了在法语、希腊语和孟加拉语中的翻译质量。这一方案旨在为构建一种无论国籍或语言背景如何都能被任何人使用的通用Python提供清晰路径。

链接: https://arxiv.org/abs/2504.11290
作者: Joshua Otten,Antonios Anastasopoulos,Kevin Moran
机构: George Mason University; University of Central Florida
类目: Computation and Language (cs.CL)
备注: 15 pages, 4 figures, 17 tables

点击查看摘要

Abstract:Python is one of the most commonly used programming languages in industry and education. Its English keywords and built-in functions/modules allow it to come close to pseudo-code in terms of its readability and ease of writing. However, those who do not speak English may not experience these advantages. In fact, they may even be hindered in their ability to understand Python code, as the English nature of its terms creates an additional layer of overhead. To that end, we introduce the task of automatically translating Python’s natural modality (keywords, error types, identifiers, etc.) into other human languages. This presents a unique challenge, considering the abbreviated nature of these forms, as well as potential untranslatability of advanced mathematical/programming concepts across languages. We therefore create an automated pipeline to translate Python into other human languages, comparing strategies using machine translation and large language models. We then use this pipeline to acquire translations from five common Python libraries (pytorch, pandas, tensorflow, numpy, and random) in seven languages, and do a quality test on a subset of these terms in French, Greek, and Bengali. We hope this will provide a clearer path forward towards creating a universal Python, accessible to anyone regardless of nationality or language background.
zh

[NLP-17] he Obvious Invisible Threat: LLM -Powered GUI Agents Vulnerability to Fine-Print Injections

【速读】：该论文旨在解决大型语言模型驱动的图形用户界面（GUI）代理在执行任务时面临的隐私和安全风险问题。随着GUI代理需要处理和操作敏感用户数据以完成实际任务（如表单填写或服务预订），其自主性引入了新的隐私泄露和安全隐患。攻击者可能通过注入恶意内容篡改代理行为或引发意外的隐私信息泄露，这些攻击通常利用了代理与人类用户之间视觉显著性的差异，或代理在任务自动化中检测上下文完整性违规的能力不足。

论文的关键在于识别出六种针对GUI代理的攻击类型，并通过实验研究验证这些攻击对六个最先进的GUI代理、234个对抗网页以及39名人类参与者的有效性。研究发现，GUI代理特别容易受到上下文化嵌入威胁的影响，同时人类用户也易受多种攻击影响，表明简单的用户监督不足以可靠防止失败。这种不匹配凸显了设计注重隐私保护的代理的重要性。论文提出了实用的防御策略，以指导开发更安全、更可靠的GUI代理。

链接: https://arxiv.org/abs/2504.11281
作者: Chaoran Chen,Zhiping Zhang,Bingcan Guo,Shang Ma,Ibrahim Khalilov,Simret A Gebreegziabher,Yanfang Ye,Ziang Xiao,Yaxing Yao,Tianshi Li,Toby Jia-Jun Li
机构: University of Notre Dame(圣母大学); Northeastern University(东北大学); University of Washington(华盛顿大学); Virginia Tech(弗吉尼亚理工学院); Johns Hopkins University(约翰斯·霍普金斯大学); Northeastern University(东北大学); University of Notre Dame(圣母大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:A Large Language Model (LLM) powered GUI agent is a specialized autonomous system that performs tasks on the user’s behalf according to high-level instructions. It does so by perceiving and interpreting the graphical user interfaces (GUIs) of relevant apps, often visually, inferring necessary sequences of actions, and then interacting with GUIs by executing the actions such as clicking, typing, and tapping. To complete real-world tasks, such as filling forms or booking services, GUI agents often need to process and act on sensitive user data. However, this autonomy introduces new privacy and security risks. Adversaries can inject malicious content into the GUIs that alters agent behaviors or induces unintended disclosures of private information. These attacks often exploit the discrepancy between visual saliency for agents and human users, or the agent’s limited ability to detect violations of contextual integrity in task automation. In this paper, we characterized six types of such attacks, and conducted an experimental study to test these attacks with six state-of-the-art GUI agents, 234 adversarial webpages, and 39 human participants. Our findings suggest that GUI agents are highly vulnerable, particularly to contextually embedded threats. Moreover, human users are also susceptible to many of these attacks, indicating that simple human oversight may not reliably prevent failures. This misalignment highlights the need for privacy-aware agent design. We propose practical defense strategies to inform the development of safer and more reliable GUI agents.
zh

[NLP-18] From Misleading Queries to Accurate Answers: A Three-Stage Fine-Tuning Method for LLM s

【速读】：该论文旨在解决大型语言模型（LLMs）在处理自然语言任务时对输入查询质量高度敏感的问题，特别是在输入查询包含误导性或不准确信息的情况下。现有方法主要关注于修正输出结果，但忽视了提升LLMs检测和纠正输入中误导性内容的能力。论文的关键解决方案在于提出了一种新颖的三阶段微调方法：第一阶段训练LLMs识别误导性信息；第二阶段利用内置或外部知识训练LLMs纠正这些误导性内容；第三阶段基于修正后的查询生成准确答案。这种方法显著提高了LLMs响应的准确性，减少了幻觉现象（hallucinations），尤其是在输入包含误导性信息时表现尤为突出。

链接: https://arxiv.org/abs/2504.11277
作者: Guocong Li,Weize Liu,Yihang Wu,Ping Wang,Shuaihan Huang,Hongxia Xu,Jian Wu
机构: Zhejiang University (浙江大学); Renmin University of China (中国人民大学); AI Research Center, WeDoctor Cloud (微医云人工智能研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit excellent performance in natural language processing (NLP), but remain highly sensitive to the quality of input queries, especially when these queries contain misleading or inaccurate information. Existing methods focus on correcting the output, but they often overlook the potential of improving the ability of LLMs to detect and correct misleading content in the input itself. In this paper, we propose a novel three-stage fine-tuning method that enhances the ability of LLMs to detect and correct misleading information in the input, further improving response accuracy and reducing hallucinations. Specifically, the three stages include (1) training LLMs to identify misleading information, (2) training LLMs to correct the misleading information using built-in or external knowledge, and (3) training LLMs to generate accurate answers based on the corrected queries. To evaluate our method, we conducted experiments on three datasets for the hallucination detection task and the question answering (QA) task, as well as two datasets containing misleading information that we constructed. The experimental results demonstrate that our method significantly improves the accuracy and factuality of LLM responses, while also enhancing the ability to detect hallucinations and reducing the generation of hallucinations in the output, particularly when the query contains misleading information. We will publicly release our code upon acceptance.
zh

[NLP-19] UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis

【速读】：该论文旨在解决基于视觉的图形用户界面（GUI）指令定位任务中的若干挑战，包括元素与屏幕比例失衡、元素类型分布不均以及隐式指令等问题。为应对这些挑战，论文提出了一种大规模数据合成管道UI-E2I-Synth，利用GPT-4o自动生成多样复杂的指令数据集，而非依赖人工标注。此外，还引入了一个新的GUI指令定位基准UI-I2E-Bench，以克服现有基准的局限性，并涵盖更丰富的标注维度。解决方案的关键在于创新的数据合成方法及全面的基准设计，从而显著提升了模型在GUI指令定位任务上的性能，同时为后续研究提供了实用指导。

链接: https://arxiv.org/abs/2504.11257
作者: Xinyi Liu,Xiaoyi Zhang,Ziyun Zhang,Yan Lu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in Large Vision-Language Models are accelerating the development of Graphical User Interface (GUI) agents that utilize human-like vision perception capabilities to enhance productivity on digital devices. Compared to approaches predicated on GUI metadata, which are platform-dependent and vulnerable to implementation variations, vision-based approaches offer broader applicability. In this vision-based paradigm, the GUI instruction grounding, which maps user instruction to the location of corresponding element on the given screenshot, remains a critical challenge, particularly due to limited public training dataset and resource-intensive manual instruction data this http URL this paper, we delve into unexplored challenges in this task including element-to-screen ratio, unbalanced element type, and implicit instruction. To address these challenges, we introduce a large-scale data synthesis pipeline UI-E2I-Synth for generating varying complex instruction datasets using GPT-4o instead of human annotators. Furthermore, we propose a new GUI instruction grounding benchmark UI-I2E-Bench, which is designed to address the limitations of existing benchmarks by incorporating diverse annotation aspects. Our model, trained on the synthesized data, achieves superior performance in GUI instruction grounding, demonstrating the advancements of proposed data synthesis pipeline. The proposed benchmark, accompanied by extensive analyses, provides practical insights for future research in GUI grounding. We will release corresponding artifacts at this https URL
zh

[NLP-20] owards Automated Safety Requirements Derivation Using Agent -based RAG

【速读】：本文旨在解决利用传统预训练大语言模型（LLMs）辅助安全分析时缺乏领域特定知识的问题，同时克服现有检索增强生成（RAG）方法在处理复杂查询时性能下降且难以获取最相关文档的局限性，特别是在安全攸关的应用场景中。为应对上述挑战，论文提出了一种基于智能体的RAG方法来推导自动驾驶系统的安全需求，并证明所获取的信息与查询更加相关。关键在于引入智能体机制，通过在汽车标准文档池和Apollo自动驾驶感知系统案例中实施该方法，提升信息检索的相关性和准确性。

链接: https://arxiv.org/abs/2504.11243
作者: Balahari Vignesh Balu,Florian Geissler,Francesco Carella,Joao-Vitor Zacchi,Josef Jiru,Nuria Mata,Reinhard Stolle
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:We study the automated derivation of safety requirements in a self-driving vehicle use case, leveraging LLMs in combination with agent-based retrieval-augmented generation. Conventional approaches that utilise pre-trained LLMs to assist in safety analyses typically lack domain-specific knowledge. Existing RAG approaches address this issue, yet their performance deteriorates when handling complex queries and it becomes increasingly harder to retrieve the most relevant information. This is particularly relevant for safety-relevant applications. In this paper, we propose the use of agent-based RAG to derive safety requirements and show that the retrieved information is more relevant to the queries. We implement an agent-based approach on a document pool of automotive standards and the Apollo case study, as a representative example of an automated driving perception system. Our solution is tested on a data set of safety requirement questions and answers, extracted from the Apollo data. Evaluating a set of selected RAG metrics, we present and discuss advantages of a agent-based approach compared to default RAG methods.
zh

[NLP-21] Nondeterministic Polynomial-time Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLM s

【速读】：该论文旨在解决当前基准测试面临的两个主要问题：i) 现有基准容易在短时间内（少于一年）被超越，ii) 容易被模型“黑客”攻击。为了解决这些问题，论文提出了“持续扩展性（ever-scaling）”的概念，用于构建无法被轻易超越、不可被简单攻击、自动可验证且通用的基准测试。论文的关键解决方案是提出了一种名为“非确定多项式问题挑战（NPPC）”的全新推理基准，它基于25个著名的NP完全问题构建，并通过三个核心模块实现其功能：i) 提供统一接口的npgym模块，能够生成任意数量和复杂度的问题实例；ii) 提供统一评估接口的npsolver模块，支持在线和离线模型的API调用及本地部署；iii) 提供全面性能分析工具的npeval模块，能够量化模型在不同问题类型、令牌数量、顿悟时刻、推理错误和解错误上的表现。实验结果表明，NPPC成功降低了先进大型语言模型（LLMs）的性能至10%以下，证明其具有不可超越性，同时揭示了模型在复杂任务中的性能变化规律。

链接: https://arxiv.org/abs/2504.11239
作者: Chang Yang,Ruiyu Wang,Junzhe Jiang,Qi Jiang,Qinggang Zhang,Yanchen Deng,Shuxin Li,Shuyue Hu,Bo Li,Florian T. Pokorny,Xiao Huang,Xinrun Wang
机构: The Hong Kong Polytechnic University (香港理工大学); KTH Royal Institute of Technology (瑞典皇家理工学院); Carnegie Mellon University (卡内基梅隆大学); Nanyang Technological University (南洋理工大学); Shanghai AI Laboratory (上海人工智能实验室); Singapore Management University (新加坡管理大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preliminary work, 10 pages for main text

点击查看摘要

Abstract:Reasoning is the fundamental capability of large language models (LLMs). Due to the rapid progress of LLMs, there are two main issues of current benchmarks: i) these benchmarks can be crushed in a short time (less than 1 year), and ii) these benchmarks may be easily hacked. To handle these issues, we propose the ever-scalingness for building the benchmarks which are uncrushable, unhackable, auto-verifiable and general. This paper presents Nondeterministic Polynomial-time Problem Challenge (NPPC), an ever-scaling reasoning benchmark for LLMs. Specifically, the NPPC has three main modules: i) npgym, which provides a unified interface of 25 well-known NP-complete problems and can generate any number of instances with any levels of complexities, ii) npsolver: which provides a unified interface to evaluate the problem instances with both online and offline models via APIs and local deployments, respectively, and iii) npeval: which provides the comprehensive and ready-to-use tools to analyze the performances of LLMs over different problems, the number of tokens, the aha moments, the reasoning errors and the solution errors. Extensive experiments over widely-used LLMs demonstrate: i) NPPC can successfully decrease the performances of advanced LLMs’ performances to below 10%, demonstrating that NPPC is uncrushable, ii) DeepSeek-R1, Claude-3.7-Sonnet, and o1/o3-mini are the most powerful LLMs, where DeepSeek-R1 outperforms Claude-3.7-Sonnet and o1/o3-mini in most NP-complete problems considered, and iii) the numbers of tokens, aha moments in the advanced LLMs, e.g., Claude-3.7-Sonnet and DeepSeek-R1, are observed first to increase and then decrease when the problem instances become more and more difficult. We believe that NPPC is the first ever-scaling reasoning benchmark, serving as the uncrushable and unhackable testbed for LLMs toward artificial general intelligence (AGI).
zh

[NLP-22] Enhancing multimodal analogical reasoning with Logic Augmented Generation

【速读】：该论文旨在解决自然语言中隐性知识自动提取的挑战，特别是机器因缺乏物理世界的主动经验而导致的问题。为应对这一挑战，论文提出的关键解决方案是结合逻辑增强生成（Logic-Augmented Generation, LAG）框架与语义知识图谱的显式表示，并辅以提示启发式方法，以揭示隐喻性的类比连接。通过这种方法，生成扩展的语义知识图谱三元组来表征隐性意义，使系统能够对未标注的多模态数据进行跨领域推理。论文验证了此方法在四种数据集上的隐喻检测与理解任务中的有效性，结果显示该集成方法不仅超越现有基准，还在视觉隐喻理解方面优于人类，同时提供了更可解释的推理过程，但仍然存在特定领域隐喻理解的固有局限性。

链接: https://arxiv.org/abs/2504.11190
作者: Anna Sofia Lippolis,Andrea Giovanni Nuzzolese,Aldo Gangemi
机构: University of Bologna, Italy (博洛尼亚大学); ISTC-CNR, Italy (意大利国家研究委员会认知科学与技术研究所); University of Bologna, Italy (博洛尼亚大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models have demonstrated their capabilities across a variety of tasks. However, automatically extracting implicit knowledge from natural language remains a significant challenge, as machines lack active experience with the physical world. Given this scenario, semantic knowledge graphs can serve as conceptual spaces that guide the automated text generation reasoning process to achieve more efficient and explainable results. In this paper, we apply a logic-augmented generation (LAG) framework that leverages the explicit representation of a text through a semantic knowledge graph and applies it in combination with prompt heuristics to elicit implicit analogical connections. This method generates extended knowledge graph triples representing implicit meaning, enabling systems to reason on unlabeled multimodal data regardless of the domain. We validate our work through three metaphor detection and understanding tasks across four datasets, as they require deep analogical reasoning capabilities. The results show that this integrated approach surpasses current baselines, performs better than humans in understanding visual metaphors, and enables more explainable reasoning processes, though still has inherent limitations in metaphor understanding, especially for domain-specific metaphors. Furthermore, we propose a thorough error analysis, discussing issues with metaphorical annotations and current evaluation methods.
zh

[NLP-23] Benchmarking Next-Generation Reasoning -Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5888 Items

【速读】：该论文旨在解决复杂决策领域中专用大型语言模型（Large Language Models, LLMs）在特定医学亚专科（如眼科学）中的性能评估问题。目前，专注于推理的LLMs已从通用模型向复杂决策任务方向发展，但其在眼科等专业化领域的表现尚缺乏系统性研究。为填补这一空白，论文提出了一种综合评估方案，通过定量与定性分析，全面比较四种新开发的推理导向型LLMs（DeepSeek-R1、OpenAI o1、o3-mini和Gemini 2.0 Flash-Thinking）在眼科多选题考试数据集（MedMCQA）上的准确性、推理能力及响应质量。关键在于采用零样本设置下的大规模标准化试题测试，并结合定量指标（如准确率、Macro-F1以及多种文本生成评估指标）与专家主观评价，揭示各模型在精确性、推理结构完整性及响应效率等方面的差异，从而为医学领域专用推理模型的选择提供依据。

链接: https://arxiv.org/abs/2504.11186
作者: Minjie Zou,Sahana Srinivasan,Thaddaeus Wai Soon Lo,Ke Zou,Gabriel Dawei Yang,Xuguang Ai,Hyunjae Kim,Maxwell Singer,Fares Antaki,Kelvin Li,Robert Chang,Marcus Tan,David Ziyou Chen,Dianbo Liu,Qingyu Chen,Yih Chung Tham
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 83 pages, 6 figures, 3 tables, 9 supplementary figures, 7 supplementary tables

点击查看摘要

Abstract:Recent advances in reasoning-focused large language models (LLMs) mark a shift from general LLMs toward models designed for complex decision-making, a crucial aspect in medicine. However, their performance in specialized domains like ophthalmology remains underexplored. This study comprehensively evaluated and compared the accuracy and reasoning capabilities of four newly developed reasoning-focused LLMs, namely DeepSeek-R1, OpenAI o1, o3-mini, and Gemini 2.0 Flash-Thinking. Each model was assessed using 5,888 multiple-choice ophthalmology exam questions from the MedMCQA dataset in zero-shot setting. Quantitative evaluation included accuracy, Macro-F1, and five text-generation metrics (ROUGE-L, METEOR, BERTScore, BARTScore, and AlignScore), computed against ground-truth reasonings. Average inference time was recorded for a subset of 100 randomly selected questions. Additionally, two board-certified ophthalmologists qualitatively assessed clarity, completeness, and reasoning structure of responses to differential diagnosis questions.O1 (0.902) and DeepSeek-R1 (0.888) achieved the highest accuracy, with o1 also leading in Macro-F1 (0.900). The performance of models across the text-generation metrics varied: O3-mini excelled in ROUGE-L (0.151), o1 in METEOR (0.232), DeepSeek-R1 and o3-mini tied for BERTScore (0.673), DeepSeek-R1 (-4.105) and Gemini 2.0 Flash-Thinking (-4.127) performed best in BARTScore, while o3-mini (0.181) and o1 (0.176) led AlignScore. Inference time across the models varied, with DeepSeek-R1 being slowest (40.4 seconds) and Gemini 2.0 Flash-Thinking fastest (6.7 seconds). Qualitative evaluation revealed that DeepSeek-R1 and Gemini 2.0 Flash-Thinking tended to provide detailed and comprehensive intermediate reasoning, whereas o1 and o3-mini displayed concise and summarized justifications.
zh

[NLP-24] Bias Beyond English: Evaluating Social Bias and Debiasing Methods in a Low-Resource Setting

【速读】：该论文旨在解决低资源语言中社会偏见在语言模型中的体现及其对加剧社会不平等的潜在影响。论文的关键在于利用高资源语言语料库评估不同语言模型的社会偏见，并尝试将高资源语言中的去偏方法有效迁移到低资源语言中。为此，研究者构建了多语言偏见评估数据集，以公平比较五种语言（英语、中文、俄语、印尼语和泰语）中模型的表现，并分析了性别、宗教、国籍和种族四个偏见维度。此外，通过实验三种去偏方法（CDA、Dropout、SenDeb），论文证明了高资源语言的去偏策略可以成功应用于低资源语言，为多语言自然语言处理领域的公平性研究提供了可行的见解。

链接: https://arxiv.org/abs/2504.11183
作者: Ej Zhou,Weiming Lu
机构: Language Technology Lab, University of Cambridge (剑桥大学语言技术实验室); College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Social bias in language models can potentially exacerbate social inequalities. Despite it having garnered wide attention, most research focuses on English data. In a low-resource scenario, the models often perform worse due to insufficient training data. This study aims to leverage high-resource language corpora to evaluate bias and experiment with debiasing methods in low-resource languages. We evaluated the performance of recent multilingual models in five languages: English (\textsceng), Chinese (\textsczho), Russian (\textscrus), Indonesian (\textscind) and Thai (\textsctha), and analyzed four bias dimensions: \textitgender, \textitreligion, \textitnationality, and \textitrace-color. By constructing multilingual bias evaluation datasets, this study allows fair comparisons between models across languages. We have further investigated three debiasing methods-\textttCDA, \textttDropout, \textttSenDeb-and demonstrated that debiasing methods from high-resource languages can be effectively transferred to low-resource ones, providing actionable insights for fairness research in multilingual NLP.
zh

[NLP-25] MuSeD: A Multimodal Spanish Dataset for Sexism Detection in Social Media Videos

【速读】：该论文旨在解决社交媒体平台上视频内容中性别歧视（Sexism）自动检测的挑战，特别是通过结合文本、音频和视觉模态来识别显性和隐性性别偏见的问题。论文的关键在于提出了一种新的多模态西班牙语性别歧视检测数据集MuSeD（包含约11小时从TikTok和BitChute提取的视频），以及一种创新的标注框架，用于分析文本与多模态标签在分类中的贡献。此外，论文评估了多种大型语言模型（LLMs）和多模态LLMs的表现。研究发现，视觉信息对人类和模型正确标注性别歧视内容至关重要，但模型在处理隐性案例（如刻板印象）时表现较差，这表明隐性性别歧视的识别高度依赖于社会和文化背景。因此，解决方案的关键在于构建多模态数据集与标注框架，并结合视觉信息以提高检测能力。

链接: https://arxiv.org/abs/2504.11169
作者: Laura De Grazia,Pol Pastells,Mauro Vázquez Chas,Desmond Elliott,Danae Sánchez Villegas,Mireia Farrús,Mariona Taulé
机构: University of Barcelona (巴塞罗那大学); CLiC-Language and Computing Center (CLiC语言与计算中心); University of Copenhagen (哥本哈根大学); Department of Computer Science (计算机科学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sexism is generally defined as prejudice and discrimination based on sex or gender, affecting every sector of society, from social institutions to relationships and individual behavior. Social media platforms amplify the impact of sexism by conveying discriminatory content not only through text but also across multiple modalities, highlighting the critical need for a multimodal approach to the analysis of sexism online. With the rise of social media platforms where users share short videos, sexism is increasingly spreading through video content. Automatically detecting sexism in videos is a challenging task, as it requires analyzing the combination of verbal, audio, and visual elements to identify sexist content. In this study, (1) we introduce MuSeD, a new Multimodal Spanish dataset for Sexism Detection consisting of \approx 11 hours of videos extracted from TikTok and BitChute; (2) we propose an innovative annotation framework for analyzing the contribution of textual and multimodal labels in the classification of sexist and non-sexist content; and (3) we evaluate a range of large language models (LLMs) and multimodal LLMs on the task of sexism detection. We find that visual information plays a key role in labeling sexist content for both humans and models. Models effectively detect explicit sexism; however, they struggle with implicit cases, such as stereotypes, instances where annotators also show low agreement. This highlights the inherent difficulty of the task, as identifying implicit sexism depends on the social and cultural context.
zh

[NLP-26] Benchmarking Vision Language Models on German Factual Data

【速读】：该论文试图解决视觉语言模型（Vision Language Models, VLMs）在多语言支持方面存在的不平衡问题，特别是针对德语等高资源语言的支持不足。论文的关键在于通过分析德语与英语事实性知识中的VLM表现，分离图像相关的因素与文本相关的因素，评估模型在不同提示语言（德语与英语）及来自德语与国际语境图像上的准确性。研究发现，对于名人和地标，VLMs的表现较差，主要是由于缺乏对德语图像内容的视觉认知；而对于动物、植物，模型能够根据科学名称或英语通用名称正确识别图像内容，但在德语描述上却失败；而汽车和超市产品的识别则在两种语言中表现一致。因此，论文的核心解决方案在于揭示现有VLMs在跨语言视觉理解上的局限性，并为未来模型优化提供方向。

链接: https://arxiv.org/abs/2504.11108
作者: René Peinl,Vincent Tischler
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Similar to LLMs, the development of vision language models is mainly driven by English datasets and models trained in English and Chinese language, whereas support for other languages, even those considered high-resource languages such as German, remains significantly weaker. In this work we present an analysis of open-weight VLMs on factual knowledge in the German and English language. We disentangle the image-related aspects from the textual ones by analyzing accu-racy with jury-as-a-judge in both prompt languages and images from German and international contexts. We found that for celebrities and sights, VLMs struggle because they are lacking visual cognition of German image contents. For animals and plants, the tested models can often correctly identify the image contents ac-cording to the scientific name or English common name but fail in German lan-guage. Cars and supermarket products were identified equally well in English and German images across both prompt languages.
zh

[NLP-27] Using LLM s as prompt modifier to avoid biases in AI image generators

【速读】：该论文试图解决文本到图像生成系统中的偏差问题。解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）修改用户提示词（prompts），通过调整输入使生成的图像更加多样化且减少偏差，而无需对图像生成器本身进行改动。实验表明，这种方法在提高图像多样性、减少偏差方面效果显著，尤其适用于较不先进的图像生成器，但在某些特定场景如残障人士形象表示方面仍存在局限性。

链接: https://arxiv.org/abs/2504.11104
作者: René Peinl
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This study examines how Large Language Models (LLMs) can reduce biases in text-to-image generation systems by modifying user prompts. We define bias as a model’s unfair deviation from population statistics given neutral prompts. Our experiments with Stable Diffusion XL, 3.5 and Flux demonstrate that LLM-modified prompts significantly increase image diversity and reduce bias without the need to change the image generators themselves. While occasionally producing results that diverge from original user intent for elaborate prompts, this approach generally provides more varied interpretations of underspecified requests rather than superficial variations. The method works particularly well for less advanced image generators, though limitations persist for certain contexts like disability representation. All prompts and generated images are available at this https URL
zh

[NLP-28] DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis

【速读】：本文旨在解决多模态情感分析（Multimodal Sentiment Analysis, MSA）中融合深度（fusion depth）、多模态容量分配（multimodal capacity allocation）等关键因素未被充分探索的问题。论文提出将融合深度、可扩展性以及专用的多模态容量作为有效融合的核心要素，并引入了一种名为DeepMLF的新颖多模态语言模型（Multimodal Language Model, MLM），其通过可学习的融合标记（learnable tokens）实现深层次融合。DeepMLF结合了音频-视觉编码器与预训练的语言模型（Language Model, LM），并通过跨层多模态信息增强，在语言模型块中利用因果自注意力捕获模态交互，并通过多模态块（MM Blocks）中的交叉注意力整合音频-视觉信息，从而提供渐进式的多层次融合能力。实验表明，较深的融合深度（5-7层）及较小的融合标记集（约20个）能够显著提升性能，且训练目标和嵌入正则化对整体表现有重要影响。关键在于设计了一种具有可控模态交互能力和独立模态信息流的融合机制，同时实现了渐进式多层融合，以优化多模态情感分析任务的表现。

链接: https://arxiv.org/abs/2504.11082
作者: Efthymios Georgiou,Vassilis Katsouros,Yannis Avrithis,Alexandros Potamianos
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:While multimodal fusion has been extensively studied in Multimodal Sentiment Analysis (MSA), the role of fusion depth and multimodal capacity allocation remains underexplored. In this work, we position fusion depth, scalability, and dedicated multimodal capacity as primary factors for effective fusion. We introduce DeepMLF, a novel multimodal language model (LM) with learnable tokens tailored toward deep fusion. DeepMLF leverages an audiovisual encoder and a pretrained decoder LM augmented with multimodal information across its layers. We append learnable tokens to the LM that: 1) capture modality interactions in a controlled fashion and 2) preserve independent information flow for each modality. These fusion tokens gather linguistic information via causal self-attention in LM Blocks and integrate with audiovisual information through cross-attention MM Blocks. Serving as dedicated multimodal capacity, this design enables progressive fusion across multiple layers, providing depth in the fusion process. Our training recipe combines modality-specific losses and language modelling loss, with the decoder LM tasked to predict ground truth polarity. Across three MSA benchmarks with varying dataset characteristics, DeepMLF achieves state-of-the-art performance. Our results confirm that deeper fusion leads to better performance, with optimal fusion depths (5-7) exceeding those of existing approaches. Additionally, our analysis on the number of fusion tokens reveals that small token sets ( \sim 20) achieve optimal performance. We examine the importance of representation learning order (fusion curriculum) through audiovisual encoder initialization experiments. Our ablation studies demonstrate the superiority of the proposed fusion design and gating while providing a holistic examination of DeepMLF’s scalability to LLMs, and the impact of each training objective and embedding regularization.
zh

[NLP-29] LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews

【速读】：该论文旨在解决科学出版物同行评审过程中因工作量增加导致评审质量下降的问题，特别是由于评审者依赖“快速”启发式思维（即懒惰思维，lazy thinking）所引发的质量隐患。论文指出，目前自然语言处理（NLP）领域对此问题的研究有限，且缺乏实际可用的数据集来支持检测工具的开发。为应对这一挑战，论文提出了LazyReview数据集，该数据集包含带有细粒度懒惰思维类别标注的同行评审句子。研究发现，大规模语言模型（LLMs）在零样本设置下难以有效检测这些懒惰思维实例，而基于指令的微调可以显著提升性能约10-20个百分点，这凸显了高质量训练数据的重要性。此外，实验表明，基于懒惰思维反馈修订后的评审意见比未使用此类反馈的评审更加全面且具有可操作性。论文的关键解决方案在于构建了一个标注精细的同行评审数据集，并通过针对性的微调方法提升了懒惰思维检测的能力，同时提供了改进后的指导原则以帮助培训初级评审员。

链接: https://arxiv.org/abs/2504.11042
作者: Sukannya Purkayastha,Zhuang Li,Anne Lauscher,Lizhen Qu,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab, Department of Computer Science and Hessian Center for AI (hessian.AI), Technical University of Darmstadt (达姆施塔特工业大学); School of Computing Technologies, Royal Melbourne Institute of Technology (皇家墨尔本理工大学), Australia; Data Science Group, University of Hamburg (汉堡大学); Department of Data Science & AI, Monash University (蒙纳士大学), Australia
类目: Computation and Language (cs.CL)
备注: 29 pages, 18 Figures, 15 Tables

点击查看摘要

Abstract:Peer review is a cornerstone of quality control in scientific publishing. With the increasing workload, the unintended use of `quick’ heuristics, referred to as lazy thinking, has emerged as a recurring issue compromising review quality. Automated methods to detect such heuristics can help improve the peer-reviewing process. However, there is limited NLP research on this issue, and no real-world dataset exists to support the development of detection tools. This work introduces LazyReview, a dataset of peer-review sentences annotated with fine-grained lazy thinking categories. Our analysis reveals that Large Language Models (LLMs) struggle to detect these instances in a zero-shot setting. However, instruction-based fine-tuning on our dataset significantly boosts performance by 10-20 performance points, highlighting the importance of high-quality training data. Furthermore, a controlled experiment demonstrates that reviews revised with lazy thinking feedback are more comprehensive and actionable than those written without such feedback. We will release our dataset and the enhanced guidelines that can be used to train junior reviewers in the community. (Code available here: this https URL)
zh

[NLP-30] Dynamic Compressing Prompts for Efficient Inference of Large Language Models

【速读】：该论文旨在解决长提示（long prompts）导致的计算成本增加以及因大语言模型（Large Language Models, LLMs）上下文窗口限制而可能影响性能的问题。现有提示压缩方法面临保留关键信息、适应上下文变化以及在不同任务间保持有效性等挑战。为应对这些问题，论文提出了一种无任务依赖的方法——动态压缩提示（Dynamic Compressing Prompts, LLM-DCP）。其关键在于将提示压缩建模为马尔可夫决策过程（Markov Decision Process, MDP），并通过开发奖励函数训练DCP-Agent，在减少提示令牌数量的同时尽可能保持模型性能，且无需外部黑盒LLM。此外，受课程学习中逐步难度调整的启发，引入分层提示压缩（Hierarchical Prompt Compression, HPC）训练策略，以逐步提高压缩难度，使DCP-Agent能够学会一种有效的方法来维持信息完整性。实验表明，该方法在高压缩率下优于现有最先进的技术。

链接: https://arxiv.org/abs/2504.11004
作者: Jinwu Hu,Wei Zhang,Yufeng Wang,Yu Hu,Bin Xiao,Mingkui Tan,Qing Du
机构: School of Software Engineering, South China University of Technology (华南理工大学软件工程学院); Pazhou Lab (琶洲实验室), Guangzhou, China; School of Future Technology, South China University of Technology (华南理工大学未来技术学院), Guangzhou, China; Peng Cheng Laboratory (鹏城实验室), Shenzhen, China; Department of Health Technology and Informatics, Hong Kong Polytechnic University (香港理工大学健康科技与信息学系), Hong Kong, China; Department of Computer Science and Technology, Chongqing University of Posts and Telecommunications (重庆邮电大学计算机科学与技术系), Chongqing, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review (submited in 2024.11)

点击查看摘要

Abstract:Large Language Models (LLMs) have shown outstanding performance across a variety of tasks, partly due to advanced prompting techniques. However, these techniques often require lengthy prompts, which increase computational costs and can hinder performance because of the limited context windows of LLMs. While prompt compression is a straightforward solution, existing methods confront the challenges of retaining essential information, adapting to context changes, and remaining effective across different tasks. To tackle these issues, we propose a task-agnostic method called Dynamic Compressing Prompts (LLM-DCP). Our method reduces the number of prompt tokens while aiming to preserve the performance as much as possible. We model prompt compression as a Markov Decision Process (MDP), enabling the DCP-Agent to sequentially remove redundant tokens by adapting to dynamic contexts and retaining crucial content. We develop a reward function for training the DCP-Agent that balances the compression rate, the quality of the LLM output, and the retention of key information. This allows for prompt token reduction without needing an external black-box LLM. Inspired by the progressive difficulty adjustment in curriculum learning, we introduce a Hierarchical Prompt Compression (HPC) training strategy that gradually increases the compression difficulty, enabling the DCP-Agent to learn an effective compression method that maintains information integrity. Experiments demonstrate that our method outperforms state-of-the-art techniques, especially at higher compression rates. The code for our approach will be available at this https URL.
zh

[NLP-31] ReZero: Enhancing LLM search ability by trying one-more-time

【速读】：该论文试图解决在知识密集型任务中，基于检索增强生成（Retrieval-Augmented Generation, RAG）的大型语言模型（Large Language Model, LLM）性能依赖于初始搜索查询质量的问题。当前方法通常采用强化学习（Reinforcement Learning, RL），主要集中于查询构建或对结果进行推理，而未明确鼓励在搜索失败后持续尝试的行为。论文提出了一种名为ReZero（Retry-Zero）的新颖RL框架，其关键在于直接奖励在初次搜索失败后的重新尝试行为，从而激励LLM探索替代查询而非过早终止。通过这种方式，ReZero显著提升了模型性能，在复杂信息检索场景中实现了从25%基线到46.88%准确率的提升，增强了LLM的鲁棒性。

链接: https://arxiv.org/abs/2504.11001
作者: Alan Dao(Gia Tuan Dao),Thinh Le
机构: Menlo Research (Menlo 研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) improves Large Language Model (LLM) performance on knowledge-intensive tasks but depends heavily on initial search query quality. Current methods, often using Reinforcement Learning (RL), typically focus on query formulation or reasoning over results, without explicitly encouraging persistence after a failed search. We introduce ReZero (Retry-Zero), a novel RL framework that directly rewards the act of retrying a search query following an initial unsuccessful attempt. This incentivizes the LLM to explore alternative queries rather than prematurely halting. ReZero demonstrates significant improvement, achieving 46.88% accuracy compared to a 25% baseline. By rewarding persistence, ReZero enhances LLM robustness in complex information-seeking scenarios where initial queries may prove insufficient.
zh

[NLP-32] Exploring the Role of KG-Based RAG in Japanese Medical Question Answering with Small-Scale LLM s

【速读】：该论文旨在解决大型语言模型（LLMs）在日语医学问答（Medical QA）场景中的效果受限问题，特别是在隐私约束下无法使用商用模型（如GPT-4）的情况下。为应对这一挑战，研究聚焦于对开源LLMs进行指令微调，同时探索将这些模型与基于检索增强生成（Retrieval-Augmented Generation, RAG）框架结合的可能性，而这一组合方法此前未被充分研究。论文的关键在于提出了一种基于知识图谱（Knowledge Graph, KG）的RAG框架，并应用于小规模开源LLMs的日语医学QA任务。实验结果表明，基于知识图谱的RAG对小规模开源LLMs在日语医学QA中的性能提升有限，进一步的案例研究表明，RAG的有效性高度依赖于外部检索内容的质量和相关性。因此，该研究揭示了RAG在低资源语言（如日语）医学QA中的应用潜力及面临的挑战，为类似场景提供了重要参考。

链接: https://arxiv.org/abs/2504.10982
作者: Yingjian Chen,Feiyang Li,Xingyu Song,Tianxiao Li,Issey Sudeka,Irene Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:Large language models (LLMs) perform well in medical QA, but their effectiveness in Japanese contexts is limited due to privacy constraints that prevent the use of commercial models like GPT-4 in clinical settings. As a result, recent efforts focus on instruction-tuning open-source LLMs, though the potential of combining them with retrieval-augmented generation (RAG) remains underexplored. To bridge this gap, we are the first to explore a knowledge graph-based (KG) RAG framework for Japanese medical QA small-scale open-source LLMs. Experimental results show that KG-based RAG has only a limited impact on Japanese medical QA using small-scale open-source LLMs. Further case studies reveal that the effectiveness of the RAG is sensitive to the quality and relevance of the external retrieved content. These findings offer valuable insights into the challenges and potential of applying RAG in Japanese medical QA, while also serving as a reference for other low-resource languages.
zh

[NLP-33] Understanding LLM s Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在跨语言上下文检索能力方面的不足，尤其是这一能力在实际应用中的重要性未被充分研究的问题。论文通过评估超过40种LLMs在12种语言下的跨语言机器阅读理解（cross-lingual machine reading comprehension, xMRC）任务表现，揭示了这种能力的来源及其限制因素。论文的关键解决方案在于强调小规模开放模型在经过后训练（post-training）后的强大跨语言上下文检索能力，并指出跨语言上下文检索过程可划分为预训练阶段的问题编码和后训练阶段的答案检索两个主要阶段。此外，论文发现跨语言机器阅读理解瓶颈出现在第二阶段的最后几层模型中，而更大规模的预训练并不能提升xMRC性能，只有进一步的多语言后训练才能充分释放大规模LLMs的跨语言上下文检索潜力。

链接: https://arxiv.org/abs/2504.10906
作者: Changjiang Gao,Hankun Lin,Shujian Huang,Xin Huang,Xue Han,Junlan Feng,Chao Deng,Jiajun Chen
机构: National Key Laboratory for Novel Software Technology, Nanjing University (国家重点软件新技术实验室，南京大学); China Mobile Research Beijing, China (中国移动研究院，北京)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The ability of cross-lingual context retrieval is a fundamental aspect of cross-lingual alignment of large language models (LLMs), where the model extracts context information in one language based on requests in another language. Despite its importance in real-life applications, this ability has not been adequately investigated for state-of-the-art models. In this paper, we evaluate the cross-lingual context retrieval ability of over 40 LLMs across 12 languages to understand the source of this ability, using cross-lingual machine reading comprehension (xMRC) as a representative scenario. Our results show that several small, post-trained open LLMs show strong cross-lingual context retrieval ability, comparable to closed-source LLMs such as GPT-4o, and their estimated oracle performances greatly improve after post-training. Our interpretability analysis shows that the cross-lingual context retrieval process can be divided into two main phases: question encoding and answer retrieval, which are formed in pre-training and post-training, respectively. The phasing stability correlates with xMRC performance, and the xMRC bottleneck lies at the last model layers in the second phase, where the effect of post-training can be evidently observed. Our results also indicate that larger-scale pretraining cannot improve the xMRC performance. Instead, larger LLMs need further multilingual post-training to fully unlock their cross-lingual context retrieval potential. Our code and is available at this https URL
zh

[NLP-34] Efficient Reasoning Models: A Survey

【速读】：该论文旨在解决推理模型在处理复杂逻辑任务时因生成冗长Chain-of-Thoughts (CoTs)而导致的显著计算开销问题。论文的关键在于从三个方向提出解决方案：(1) 更短，通过压缩冗长的CoTs为简洁而有效的推理链；(2) 更小，利用知识蒸馏、其他模型压缩技术以及强化学习开发具备强推理能力的紧凑语言模型；(3) 更快，设计高效的解码策略以加速推理过程。这些方法旨在实现高效推理的同时保持模型性能。

链接: https://arxiv.org/abs/2504.10903
作者: Sicheng Feng,Gongfan Fang,Xinyin Ma,Xinchao Wang
机构: National University of Singapore (新加坡国立大学), Singapore; Nankai University (南开大学), Tianjin, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning models have demonstrated remarkable progress in solving complex and logic-intensive tasks by generating extended Chain-of-Thoughts (CoTs) prior to arriving at a final answer. Yet, the emergence of this “slow-thinking” paradigm, with numerous tokens generated in sequence, inevitably introduces substantial computational overhead. To this end, it highlights an urgent need for effective acceleration. This survey aims to provide a comprehensive overview of recent advances in efficient reasoning. It categorizes existing works into three key directions: (1) shorter - compressing lengthy CoTs into concise yet effective reasoning chains; (2) smaller - developing compact language models with strong reasoning capabilities through techniques such as knowledge distillation, other model compression techniques, and reinforcement learning; and (3) faster - designing efficient decoding strategies to accelerate inference. A curated collection of papers discussed in this survey is available in our GitHub repository.
zh

[NLP-35] ARise: Towards Knowledge-Augmented Reasoning via Risk-Adaptive Search

【速读】：该论文旨在解决大型语言模型（LLMs）在开放性、知识密集型复杂推理场景中的应用局限性。具体而言，现有的推理导向方法因隐含假设完整世界知识而难以泛化到开放性场景，而知识增强推理（KAR）方法则面临误差传播和验证瓶颈两大核心挑战。论文的关键创新在于提出ARise框架，该框架结合中间推理状态的风险评估与动态检索增强生成（RAG），并在蒙特卡洛树搜索范式下实现多假设分支的推理计划构建与优化。这种方案有效克服了现有方法的局限性。

链接: https://arxiv.org/abs/2504.10893
作者: Yize Zhang,Tianshu Wang,Sirui Chen,Kun Wang,Xingyu Zeng,Hongyu Lin,Xianpei Han,Le Sun,Chaochao Lu
机构: Shanghai AI Laboratory (上海人工智能实验室); Shanghai Innovation Institute (上海创新研究院); Shanghai Jiao Tong University (上海交通大学); SenseTime (商汤科技); Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); Tongji University (同济大学); Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences (中国科学院大学杭州高等研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project homepage: this https URL

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities and are receiving increasing attention to enhance their reasoning through scaling test–time compute. However, their application in open–ended, knowledge–intensive, complex reasoning scenarios is still limited. Reasoning–oriented methods struggle to generalize to open–ended scenarios due to implicit assumptions of complete world knowledge. Meanwhile, knowledge–augmented reasoning (KAR) methods fail to address two core challenges: 1) error propagation, where errors in early steps cascade through the chain, and 2) verification bottleneck, where the explore–exploit tradeoff arises in multi–branch decision processes. To overcome these limitations, we introduce ARise, a novel framework that integrates risk assessment of intermediate reasoning states with dynamic retrieval–augmented generation (RAG) within a Monte Carlo tree search paradigm. This approach enables effective construction and optimization of reasoning plans across multiple maintained hypothesis branches. Experimental results show that ARise significantly outperforms the state–of–the–art KAR methods by up to 23.10%, and the latest RAG-equipped large reasoning models by up to 25.37%.
zh

[NLP-36] Exploring Persona-dependent LLM Alignment for the Moral Machine Experiment ICLR2025

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在现实世界应用中与人类道德判断对齐的问题。研究关注LLMs在面对道德两难情境时的决策是否与人类一致，并探索这种对齐性如何受到不同人格特征（包括反映社会人口统计特性的虚拟身份）的影响。论文的关键在于分析LLMs的道德决策如何因人格设定而显著变化，尤其是在关键任务中的道德抉择相较于人类更具波动性，同时揭示了政治人格设定对LLM决策方向和程度的主导作用。研究还讨论了部署此类模型于涉及道德决策的应用中可能带来的伦理影响与风险。

链接: https://arxiv.org/abs/2504.10886
作者: Jiseon Kim,Jea Kwon,Luiz Felipe Vecchietti,Alice Oh,Meeyoung Cha
机构: Korea Advanced Institute of Science & Technology (KAIST)(韩国科学技术院); Max Planck Institute for Security and Privacy (MPI-SP)(马克斯·普朗克安全与隐私研究所)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ICLR 2025 Workshop - BiAlign (Bidirectional Human-AI Alignment)

点击查看摘要

Abstract:Deploying large language models (LLMs) with agency in real-world applications raises critical questions about how these models will behave. In particular, how will their decisions align with humans when faced with moral dilemmas? This study examines the alignment between LLM-driven decisions and human judgment in various contexts of the moral machine experiment, including personas reflecting different sociodemographics. We find that the moral decisions of LLMs vary substantially by persona, showing greater shifts in moral decisions for critical tasks than humans. Our data also indicate an interesting partisan sorting phenomenon, where political persona predominates the direction and degree of LLM decisions. We discuss the ethical implications and risks associated with deploying these models in applications that involve moral decisions.
zh

[NLP-37] Ai2 Scholar QA: Organized Literature Synthesis with Attribution

【速读】：该论文试图解决科学问题回答领域中高效且经济的解决方案不足的问题。现有最先进的系统虽然性能优异，但通常成本高昂且闭源。为应对这一挑战，论文提出Ai2 Scholar QA，这是一个免费的在线科学问题回答应用。其解决方案的关键在于提供了一个完整的开源管道，包括可定制的开源Python包、交互式Web应用程序，以及通过公共API访问的论文索引和可下载的数据集，从而实现了系统的透明性和易用性，同时在实验中证明了其在科学QA基准测试中的优越性能。

链接: https://arxiv.org/abs/2504.10861
作者: Amanpreet Singh,Joseph Chee Chang,Chloe Anastasiades,Dany Haddad,Aakanksha Naik,Amber Tanaka,Angele Zamarron,Cecile Nguyen,Jena D. Hwang,Jason Dunkleberger,Matt Latzke,Smita Rao,Jaron Lochner,Rob Evans,Rodney Kinney,Daniel S. Weld,Doug Downey,Sergey Feldman
机构: Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注: 7 pages

点击查看摘要

Abstract:Retrieval-augmented generation is increasingly effective in answering scientific questions from literature, but many state-of-the-art systems are expensive and closed-source. We introduce Ai2 Scholar QA, a free online scientific question answering application. To facilitate research, we make our entire pipeline public: as a customizable open-source Python package and interactive web app, along with paper indexes accessible through public APIs and downloadable datasets. We describe our system in detail and present experiments analyzing its key design decisions. In an evaluation on a recent scientific QA benchmark, we find that Ai2 Scholar QA outperforms competing systems.
zh

[NLP-38] Moving Beyond Next-Token Prediction: Transformers are Context-Sensitive Language Generators

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）机制理解不足的问题，通过提出一种新颖的框架来解释Transformer架构如何作为概率型左上下文敏感语言（Context-Sensitive Languages, CSLs）生成器工作。论文的关键解决方案在于将Transformer分解为三个基本组件：上下文窗口（context windows）、注意力机制（attention mechanisms）以及自回归生成框架（autoregressive generation frameworks）。这种分解不仅促进了更灵活且可解释的计算模型的发展，还超越了传统上将注意力与自回归视为不可分割过程的观点，从而提供了一种直观的方式来理解简单令牌预测如何产生类似人类的智能输出，并进一步揭示Transformer在形式语言理论与实际生成能力之间的联系。

链接: https://arxiv.org/abs/2504.10845
作者: Phill Kyu Rhee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Large Language Models (LLMs), powered by Transformers, have demonstrated human-like intelligence capabilities, yet their underlying mechanisms remain poorly understood. This paper presents a novel framework for interpreting LLMs as probabilistic left context-sensitive languages (CSLs) generators. We hypothesize that Transformers can be effectively decomposed into three fundamental components: context windows, attention mechanisms, and autoregressive generation frameworks. This decomposition allows for the development of more flexible and interpretable computational models, moving beyond the traditional view of attention and autoregression as inseparable processes. We argue that next-token predictions can be understood as probabilistic, dynamic approximations of left CSL production rules, providing an intuitive explanation for how simple token predictions can yield human-like intelligence outputs. Given that all CSLs are left context-sensitive (Penttonen, 1974), we conclude that Transformers stochastically approximate CSLs, which are widely recognized as models of human-like intelligence. This interpretation bridges the gap between Formal Language Theory and the observed generative power of Transformers, laying a foundation for future advancements in generative AI theory and applications. Our novel perspective on Transformer architectures will foster a deeper understanding of LLMs and their future potentials.
zh

[NLP-39] CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在处理高风险两难困境中的价值决策推理能力评估问题。这些困境涉及冲突的价值观，而现有研究仅限于日常场景，缺乏对复杂价值观及其动态变化的深入分析。论文的关键解决方案在于引入了一个精心构建的数据集CLASH（Character perspective-based LLM Assessments in Situations with High-stakes），包含345个高影响力两难困境及3,795个来自不同价值观个体的视角，用于填补现有研究的空白。通过基准测试10种前沿模型，论文揭示了LLMs在识别决策矛盾性、理解价值转变、以及从第三人称视角进行价值推理方面的不足与改进空间，强调了模型对复杂价值观推理能力提升的重要性。

链接: https://arxiv.org/abs/2504.10823
作者: Ayoung Lee,Ryan Sungmo Kwon,Peter Railton,Lu Wang
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Navigating high-stakes dilemmas involving conflicting values is challenging even for humans, let alone for AI. Yet prior work in evaluating the reasoning capabilities of large language models (LLMs) in such situations has been limited to everyday scenarios. To close this gap, this work first introduces CLASH (Character perspective-based LLM Assessments in Situations with High-stakes), a meticulously curated dataset consisting of 345 high-impact dilemmas along with 3,795 individual perspectives of diverse values. In particular, we design CLASH in a way to support the study of critical aspects of value-based decision-making processes which are missing from prior work, including understanding decision ambivalence and psychological discomfort as well as capturing the temporal shifts of values in characters’ perspectives. By benchmarking 10 open and closed frontier models, we uncover several key findings. (1) Even the strongest models, such as GPT-4o and Claude-Sonnet, achieve less than 50% accuracy in identifying situations where the decision should be ambivalent, while they perform significantly better in clear-cut scenarios. (2) While LLMs reasonably predict psychological discomfort as marked by human, they inadequately comprehend perspectives involving value shifts, indicating a need for LLMs to reason over complex values. (3) Our experiments also reveal a significant correlation between LLMs’ value preferences and their steerability towards a given value. (4) Finally, LLMs exhibit greater steerability when engaged in value reasoning from a third-party perspective, compared to a first-person setup, though certain value pairs benefit uniquely from the first-person framing.
zh

[NLP-40] CSPLADE: Learned Sparse Retrieval with Causal Language Models

【速读】：该论文旨在解决 Learned Sparse Retrieval (LSR) 在扩展至大规模语言模型 (Large Language Model, LLM) 时面临的两个主要挑战：(1) 对比学习训练初期的不稳定性；(2) 预训练 LLM 的单向注意力导致的性能局限。为了解决这些问题，论文提出了两种关键技术：(1) 一个轻量级的适配训练阶段以消除训练初期的不稳定性；(2) 两种模型变体以实现双向信息交互。通过这些方法，论文成功训练了基于 8B 规模 LLM 的 LSR 模型，并在保持竞争力检索性能的同时显著减小了索引规模。此外，论文首次通过模型量化分析了基于 LLM 的 LSR 模型在性能与效率之间的权衡，为高效检索建模提供了重要见解。

链接: https://arxiv.org/abs/2504.10816
作者: Zhichao Xu,Aosong Feng,Yijun Tian,Haibo Ding,Lin Leee Cheong
机构: Amazon Web Services (亚马逊云服务)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, dense retrieval has been the focus of information retrieval (IR) research. While effective, dense retrieval produces uninterpretable dense vectors, and suffers from the drawback of large index size. Learned sparse retrieval (LSR) has emerged as promising alternative, achieving competitive retrieval performance while also being able to leverage the classical inverted index data structure for efficient retrieval. However, limited works have explored scaling LSR beyond BERT scale. In this work, we identify two challenges in training large language models (LLM) for LSR: (1) training instability during the early stage of contrastive training; (2) suboptimal performance due to pre-trained LLM’s unidirectional attention. To address these challenges, we propose two corresponding techniques: (1) a lightweight adaptation training phase to eliminate training instability; (2) two model variants to enable bidirectional information. With these techniques, we are able to train LSR models with 8B scale LLM, and achieve competitive retrieval performance with reduced index size. Furthermore, we are among the first to analyze the performance-efficiency tradeoff of LLM-based LSR model through the lens of model quantization. Our findings provide insights into adapting LLMs for efficient retrieval modeling.
zh

[NLP-41] Name of Thrones: Evaluating How LLM s Rank Student Names Race and Gender in Status Hierarchies

【速读】：该论文试图解决大型语言模型（LLMs）在处理名字时是否基于性别和种族对个体进行不公平偏见排序的问题。论文关注的重点不仅包括名字的首字，还扩展到姓氏以及两者结合所产生的综合效应，这是此前研究较少涉及的领域。此外，论文特别探讨了不同文化背景下的名字如何反映和强化社会地位的阶层结构，并揭示了AI对不同种族和性别的能力、领导力及经济潜力期望的差异编码。

解决方案的关键在于通过大规模分析来自五个民族的名字变体，评估AI是否存在名字偏见，并深入研究这种偏见如何体现为不平等的三个核心特征：能力、领导力和经济潜力。研究发现与普遍假设相反，某些族群（如东亚和南亚名字）在AI排名中表现更优；同时，论文挑战了“模范少数族裔”这一单一化假设，展示了更加复杂且分层的偏见模式。此外，研究还表明性别会调节偏见，某些种族中的女性面临不利影响，而采用西方名字可以提升东亚和东南亚学生的AI感知地位，尤其对女性学生更为显著。这些发现强调了在评估LLMs时需要考虑种族、性别及混合身份的交叉性和更细致的理解。

链接: https://arxiv.org/abs/2504.10797
作者: Annabella Sakunkoo,Jonathan Sakunkoo
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Across cultures, names tell a lot about their bearers as they carry deep personal and cultural significance. Names also serve as powerful signals of gender, race, and status in the social hierarchy - a pecking order in which individual positions shape others’ expectations on their perceived competence and worth. With the widespread adoption of LLMs and as names are often an input for LLMs, it is crucial to evaluate whether LLMs may sort people into status positions based on first and last names and, if so, whether it is in an unfair, biased fashion. While prior work has primarily investigated biases in first names, little attention has been paid to last names and even less to the combined effects of first and last names. In this study, we conduct a large-scale analysis of name variations across 5 ethnicities to examine how AI exhibits name biases. Our study investigates three key characteristics of inequality and finds that LLMs reflect and reinforce status hierarchies based on names that signal gender and ethnicity as they encode differential expectations of competence, leadership, and economic potential. Contrary to the common assumption that AI tends to favor Whites, we show that East and, in some contexts, South Asian names receive higher rankings. We also disaggregate Asians, a population projected to be the largest immigrant group in the U.S. by 2055. Our results challenge the monolithic Asian model minority assumption, illustrating a more complex and stratified model of bias. Gender moderates biases, with girls facing unfair disadvantages in certain racial groups. Additionally, spanning cultural categories by adopting Western first names improves AI-perceived status for East and Southeast Asian students, particularly for girls. Our findings underscore the importance of intersectional and more nuanced understandings of race, gender, and mixed identities in the evaluation of LLMs.
zh

[NLP-42] GUM-SAGE: A Novel Dataset and Approach for Graded Entity Salience Prediction

【速读】：该论文旨在解决文本中最显著实体的确定与排序问题，特别是在用户面对系统中，随着用户越来越多地依赖模型来解读他们仅部分阅读的长文档时，这一需求变得更加迫切。论文提出了一种新的分级实体显著性（Graded Entity Salience）方法，以解决现有方法的局限性。传统方法主要分为两类：基于主观判断的方法虽支持梯度评分但缺乏一致性；基于摘要的方法将显著性定义为在摘要中值得提及的程度，虽提升了可解释性但输出局限于二元标签（实体要么是摘要-worthy，要么不是）。论文的关键创新在于结合两类方法的优势，通过构建一个涵盖12种英语书面和口语体裁的数据集，在每篇文档中收集5个摘要，并基于实体在这些摘要中的出现情况计算其显著性分数。这种方法不仅与基于人工摘要和对齐的结果显示出更强的相关性，还超越了现有的技术，包括大型语言模型（LLMs）。论文提供的数据和代码资源有助于推动分级显著实体提取领域的进一步研究。

链接: https://arxiv.org/abs/2504.10792
作者: Jessica Lin,Amir Zeldes
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Determining and ranking the most salient entities in a text is critical for user-facing systems, especially as users increasingly rely on models to interpret long documents they only partially read. Graded entity salience addresses this need by assigning entities scores that reflect their relative importance in a text. Existing approaches fall into two main categories: subjective judgments of salience, which allow for gradient scoring but lack consistency, and summarization-based methods, which define salience as mention-worthiness in a summary, promoting explainability but limiting outputs to binary labels (entities are either summary-worthy or not). In this paper, we introduce a novel approach for graded entity salience that combines the strengths of both approaches. Using an English dataset spanning 12 spoken and written genres, we collect 5 summaries per document and calculate each entity’s salience score based on its presence across these summaries. Our approach shows stronger correlation with scores based on human summaries and alignments, and outperforms existing techniques, including LLMs. We release our data and code at this https URL to support further research on graded salient entity extraction.
zh

[NLP-43] he Art of Audience Engagement: LLM -Based Thin-Slicing of Scientific Talks

【速读】：该论文旨在研究薄片化（thin-slicing）方法在科学演讲质量评估中的应用，即通过极少量信息（薄片）做出准确判断的能力。论文的关键在于探索生成式人工智能（Generative AI, LLMs）如何通过对完整演讲及其薄片的评估来预测整体演讲质量，并确定所需信息量的最小阈值。研究使用超过一百场真实科学演讲的数据集，通过将LLMs对短片段的评价与完整演讲的人类评分进行相关性分析，发现即使少于演讲总时长10%的极短片段也能强烈预测整体评价。这表明演讲初始阶段传递的信息足以形成持久印象。论文的核心贡献在于证明了基于LLMs的薄片化方法具有高信效度且高效，并提出了将其作为可扩展反馈工具以提升人类沟通能力的应用前景。

链接: https://arxiv.org/abs/2504.10768
作者: Ralf Schmälzle,Sue Lim,Yuetong Du,Gary Bente
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper examines the thin-slicing approach - the ability to make accurate judgments based on minimal information - in the context of scientific presentations. Drawing on research from nonverbal communication and personality psychology, we show that brief excerpts (thin slices) reliably predict overall presentation quality. Using a novel corpus of over one hundred real-life science talks, we employ Large Language Models (LLMs) to evaluate transcripts of full presentations and their thin slices. By correlating LLM-based evaluations of short excerpts with full-talk assessments, we determine how much information is needed for accurate predictions. Our results demonstrate that LLM-based evaluations align closely with human ratings, proving their validity, reliability, and efficiency. Critically, even very short excerpts (less than 10 percent of a talk) strongly predict overall evaluations. This suggests that the first moments of a presentation convey relevant information that is used in quality evaluations and can shape lasting impressions. The findings are robust across different LLMs and prompting strategies. This work extends thin-slicing research to public speaking and connects theories of impression formation to LLMs and current research on AI communication. We discuss implications for communication and social cognition research on message reception. Lastly, we suggest an LLM-based thin-slicing framework as a scalable feedback tool to enhance human communication.
zh

[NLP-44] How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients

【速读】：该论文试图解决的问题是如何理解不同类型的数据对大规模语言模型（Large Language Models, LLMs）微调动态的影响，尤其是在从指令跟随任务向复杂推理任务进化的背景下。目前这一领域尚未得到充分探索。论文的关键解决方案在于通过谱分析方法研究由低/高质量指令数据和推理数据引起的逐层梯度特性。研究发现，广泛使用的数据评估指标（如IFD、InsTag、Difficulty和Reward等）可以通过从梯度奇异值分解(SVD)计算出的谱属性来解释和统一。具体而言，高质量数据通常与较低的核范数和较高的有效秩相关联。值得注意的是，有效秩在捕捉细微质量差异方面比核范数更具鲁棒性和分辨率能力。此外，实验表明，同一模型家族内的不同规模模型表现出相似的梯度模式，而不同模型家族间则存在显著差异。这项工作为指令和推理数据的质量效应提供了统一视角，揭示了数据质量和训练稳定性之间的相互作用，并为开发更好的后训练数据探索策略提供了新颖见解。

链接: https://arxiv.org/abs/2504.10766
作者: Ming Li,Yanhong Li,Ziyue Li,Tianyi Zhou
机构: University of Maryland (马里兰大学); University of Chicago (芝加哥大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As the post-training of large language models (LLMs) advances from instruction-following to complex reasoning tasks, understanding how different data affect finetuning dynamics remains largely unexplored. In this paper, we present a spectral analysis of layer-wise gradients induced by low/high-quality instruction and reasoning data for LLM post-training. Our analysis reveals that widely-studied metrics for data evaluation, e.g., IFD, InsTag, Difficulty, and Reward, can be explained and unified by spectral properties computed from gradients’ singular value decomposition (SVD). Specifically, higher-quality data are usually associated with lower nuclear norms and higher effective ranks. Notably, effective rank exhibits better robustness and resolution than nuclear norm in capturing subtle quality differences. For example, reasoning data achieves substantially higher effective ranks than instruction data, implying richer gradient structures on more complex tasks. Our experiments also highlight that models within the same family share similar gradient patterns regardless of their sizes, whereas different model families diverge significantly. Providing a unified view on the effects of data quality across instruction and reasoning data, this work illuminates the interplay between data quality and training stability, shedding novel insights into developing better data exploration strategies for post-training.
zh

[NLP-45] CleanMAP: Distilling Multimodal LLM s for Confidence-Driven Crowdsourced HD Map Updates CVPR

【速读】：本文旨在解决智能网联车辆（ICVs）和车路云协同系统中高精度（HD）地图实时更新所面临的挑战，特别是由于众包数据不一致性导致的地图可靠性保障难题。这些问题包括运动模糊、光照变化、恶劣天气以及车道线退化等。为应对这些挑战，论文提出了一种基于多模态大语言模型（MLLM）的蒸馏框架——CleanMAP，其核心在于通过引入一种由MLLM驱动的车道可见性评分模型，系统性量化关键视觉参数，并据此赋予置信度评分（0-10），以评估车道检测的影响程度。此外，CleanMAP采用了一种新颖的动态分段置信度打分函数，根据车道可见性动态调整评分，确保与人工评价的高度一致性，同时有效过滤不可靠数据。为进一步提升地图准确性，CleanMAP还设计了一种基于置信度驱动的局部地图融合策略，在最佳置信度范围内选择得分最高的前k个局部地图，平衡数据质量和数量之间的关系。实验结果表明，融合三个最高评分的局部地图可将平均地图更新误差降至0.28米，优于基线方法（0.37米），并满足严格的精度阈值（≤0.32米）。此外，真实车辆数据验证显示，CleanMAP与人工评估的一致性达到84.88%，证明了其鲁棒性和可靠性。因此，CleanMAP的关键在于结合MLLM的车道可见性评分机制与置信度驱动的局部地图融合策略，提供了一个可扩展且部署可行的众包高精度地图更新解决方案，以实现更精确和可靠的城市自动驾驶导航。

链接: https://arxiv.org/abs/2504.10738
作者: Ankit Kumar Shaw(1),Kun Jiang(1),Tuopu Wen(1),Chandan Kumar Sah(2),Yining Shi(1),Mengmeng Yang(1),Diange Yang(1),Xiaoli Lian(2) ((1) Tsinghua University, (2) Beihang University)
机构: Tsinghua University (清华大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Kun Jiang, Mengmeng Yang and Diange Yang are Corresponding Author. The main paper and supplementary material are both included here, total 23 pages (main paper is 10 pages and supplementary material is 13 pages), total 17 figures (6 figures in main paper and 11 figures in supplementary material), this paper is Accepted to CVPR WDFM-AD Workshop 2025, The code will be available at this https URL

点击查看摘要

Abstract:The rapid growth of intelligent connected vehicles (ICVs) and integrated vehicle-road-cloud systems has increased the demand for accurate, real-time HD map updates. However, ensuring map reliability remains challenging due to inconsistencies in crowdsourced data, which suffer from motion blur, lighting variations, adverse weather, and lane marking degradation. This paper introduces CleanMAP, a Multimodal Large Language Model (MLLM)-based distillation framework designed to filter and refine crowdsourced data for high-confidence HD map updates. CleanMAP leverages an MLLM-driven lane visibility scoring model that systematically quantifies key visual parameters, assigning confidence scores (0-10) based on their impact on lane detection. A novel dynamic piecewise confidence-scoring function adapts scores based on lane visibility, ensuring strong alignment with human evaluations while effectively filtering unreliable data. To further optimize map accuracy, a confidence-driven local map fusion strategy ranks and selects the top-k highest-scoring local maps within an optimal confidence range (best score minus 10%), striking a balance between data quality and quantity. Experimental evaluations on a real-world autonomous vehicle dataset validate CleanMAP’s effectiveness, demonstrating that fusing the top three local maps achieves the lowest mean map update error of 0.28m, outperforming the baseline (0.37m) and meeting stringent accuracy thresholds (= 0.32m). Further validation with real-vehicle data confirms 84.88% alignment with human evaluators, reinforcing the model’s robustness and reliability. This work establishes CleanMAP as a scalable and deployable solution for crowdsourced HD map updates, ensuring more precise and reliable autonomous navigation. The code will be available at this https URL
zh

[NLP-46] HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving

【速读】：该论文旨在解决在部署大语言模型（Large Language Models, LLMs）过程中因延迟（latency）、准确率（accuracy）和吞吐量（throughput）等关键性能指标之间的内在权衡所引发的核心挑战。现有方法如早期退出的大语言模型（Early-Exit LLMs, EE-LLMs）虽然能够通过提前输出预测结果来降低延迟，但其保守加载整个模型的方式限制了资源节约和吞吐量提升，并且无法动态适应输入查询的变化需求。此外，当前框架通常静态选择模型，缺乏对任务需求变化的灵活性。

论文提出的解决方案——HELIOS 的关键是：首先通过短列表候选模型并基于部分提示符进行实时评估，收集遥测数据；其次利用早期退出的数据信息仅贪婪加载选定模型的部分层，从而实现内存节省，提高并发处理能力；最后，HELIOS 动态监控候选模型性能并在必要时切换至更高效的模型版本（例如使用更少层数而不牺牲准确率）。这种动态调整机制显著提升了系统性能，在优化服务级别目标时实现了吞吐量增加 1.48 倍、能耗效率提升 1.10 倍、响应时间缩短 1.39 倍以及推理批处理规模扩大 3.7 倍的效果。

链接: https://arxiv.org/abs/2504.10724
作者: Avinash Kumar,Shashank Nag,Jason Clemons,Lizy John,Poulami Das
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); NVIDIA (英伟达)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deploying large language models (LLMs) presents critical challenges due to the inherent trade-offs associated with key performance metrics, such as latency, accuracy, and throughput. Typically, gains in one metric is accompanied with degradation in others. Early-Exit LLMs (EE-LLMs) efficiently navigate this trade-off space by skipping some of the later model layers when it confidently finds an output token early, thus reducing latency without impacting accuracy. However, as the early exits taken depend on the task and are unknown apriori to request processing, EE-LLMs conservatively load the entire model, limiting resource savings and throughput. Also, current frameworks statically select a model for a user task, limiting our ability to adapt to changing nature of the input queries. We propose HELIOS to address these challenges. First, HELIOS shortlists a set of candidate LLMs, evaluates them using a subset of prompts, gathering telemetry data in real-time. Second, HELIOS uses the early exit data from these evaluations to greedily load the selected model only up to a limited number of layers. This approach yields memory savings which enables us to process more requests at the same time, thereby improving throughput. Third, HELIOS monitors and periodically reassesses the performance of the candidate LLMs and if needed, switches to another model that can service incoming queries more efficiently (such as using fewer layers without lowering accuracy). Our evaluations show that HELIOS achieves 1.48 \times throughput, 1.10 \times energy-efficiency, 1.39 \times lower response time, and 3.7 \times improvements in inference batch sizes compared to the baseline, when optimizing for the respective service level objectives. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2504.10724 [cs.CL] (or arXiv:2504.10724v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.10724 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-47] EMAFusion: A Self-Optimizing System for Seamless LLM Selection and Integration

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际部署中因高计算和财务成本而面临的挑战。现有的路由策略通过将查询分配到更便宜或专门的模型来部分缓解这一问题，但通常依赖于大量标记数据或脆弱的任务特定启发式方法；融合技术虽能提升准确性与鲁棒性，却往往增加成本并可能强化共享偏见。论文提出的解决方案——EMAFusion框架，其关键是通过自优化实现LLM的选择与可靠执行。具体而言，EMAFusion结合基于分类学的路由、针对模糊输入的学习型路由，以及基于多裁判置信度评估的级联方法，逐步从低成本模型向高成本模型升级。实验表明，EMAFusion不仅比最佳单个模型高出2.6个百分点（94.3% vs. 91.7%），且成本仅为平均成本的四分之一，同时在成本低于GPT-4的1/20情况下实现了17.1个百分点的显著改进。此外，其联合路由方法相较于单一的分类学方法（88.1%）和基于学习型模型预测器的方法（91.7%）展现了更高的准确性（94.3%），验证了统一策略的有效性，并支持灵活的成本-精度权衡。

链接: https://arxiv.org/abs/2504.10681
作者: Soham Shah,Kumar Shridhar,Surojit Chatterjee,Souvik Sen
机构: Ema Unlimited, Inc.
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While recent advances in large language models (LLMs) have significantly enhanced performance across diverse natural language tasks, the high computational and financial costs associated with their deployment remain substantial barriers. Existing routing strategies partially alleviate this challenge by assigning queries to cheaper or specialized models, but they frequently rely on extensive labeled data or fragile task-specific heuristics. Conversely, fusion techniques aggregate multiple LLM outputs to boost accuracy and robustness, yet they often exacerbate cost and may reinforce shared biases. We introduce EMAFusion, a new framework that self-optimizes for seamless LLM selection and reliable execution for a given query. Specifically, EMAFusion integrates a taxonomy-based router for familiar query types, a learned router for ambiguous inputs, and a cascading approach that progressively escalates from cheaper to more expensive models based on multi-judge confidence evaluations. Through extensive evaluations, we find EMAFusion outperforms the best individual models by over 2.6 percentage points (94.3% vs. 91.7%), while being 4X cheaper than the average cost. EMAFusion further achieves a remarkable 17.1 percentage point improvement over models like GPT-4 at less than 1/20th the cost. Our combined routing approach delivers 94.3% accuracy compared to taxonomy-based (88.1%) and learned model predictor-based (91.7%) methods alone, demonstrating the effectiveness of our unified strategy. Finally, EMAFusion supports flexible cost-accuracy trade-offs, allowing users to balance their budgetary constraints and performance needs. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2504.10681 [cs.CL] (or arXiv:2504.10681v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.10681 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-48] Keyword Extraction and Aspect Classification in Sinhala English and Code-Mixed Content

【速读】：该论文旨在解决银行领域品牌声誉管理中对包含代码混合（code-mixed）和多语言内容的客户意见进行有效分析的问题。传统自然语言处理（NLP）模型在处理低资源语言（如僧伽罗语-英语混合文本）时存在误分类或忽略代码混合文本的现象，并且难以捕捉特定领域的知识。为了解决这一挑战，论文提出了一种混合NLP方法，重点在于改进关键词提取、内容过滤以及基于方面的分类任务。关键在于采用经过微调的Transformer模型（如SpaCy、FinBERT、XLM-RoBERTa等），结合领域专用词汇表，显著提升了多语言金融文本分析的准确性，特别是在代码混合和低资源语言场景下的表现，证明了这些方法优于传统的关键词和基于规则的方法。

链接: https://arxiv.org/abs/2504.10679
作者: F.A. Rizvi,T. Navojith,A.M.N.H. Adhikari,W.P.U. Senevirathna,Dharshana Kasthurirathna,Lakmini Abeywardhana
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 Pages, 2 figures, 7 Tables

点击查看摘要

Abstract:Brand reputation in the banking sector is maintained through insightful analysis of customer opinion on code-mixed and multilingual content. Conventional NLP models misclassify or ignore code-mixed text, when mix with low resource languages such as Sinhala-English and fail to capture domain-specific knowledge. This study introduces a hybrid NLP method to improve keyword extraction, content filtering, and aspect-based classification of banking content. Keyword extraction in English is performed with a hybrid approach comprising a fine-tuned SpaCy NER model, FinBERT-based KeyBERT embeddings, YAKE, and EmbedRank, which results in a combined accuracy of 91.2%. Code-mixed and Sinhala keywords are extracted using a fine-tuned XLM-RoBERTa model integrated with a domain-specific Sinhala financial vocabulary, and it results in an accuracy of 87.4%. To ensure data quality, irrelevant comment filtering was performed using several models, with the BERT-base-uncased model achieving 85.2% for English and XLM-RoBERTa 88.1% for Sinhala, which was better than GPT-4o, SVM, and keyword-based filtering. Aspect classification followed the same pattern, with the BERT-base-uncased model achieving 87.4% for English and XLM-RoBERTa 85.9% for Sinhala, both exceeding GPT-4 and keyword-based approaches. These findings confirm that fine-tuned transformer models outperform traditional methods in multilingual financial text analysis. The present framework offers an accurate and scalable solution for brand reputation monitoring in code-mixed and low-resource banking environments.
zh

[NLP-49] Characterizing Knowledge Manipulation in a Russian Wikipedia Fork

【速读】：本文旨在研究与不同类型知识操纵相关的实践和叙事，重点关注俄语维基百科（Russian Wikipedia）的分支——Ruwiki。Ruwiki复制并修改了原始俄语维基百科的内容以符合俄罗斯法律。为了解决这一问题，论文提出了一种方法论，用于描述Ruwiki相对于原始版本的主要变更。关键在于通过元信息以及地理、时间、类别和文本特征，全面比较超过190万篇文章后，分析Ruwiki编辑所作的更改，并进一步将该分支中的知识操纵主题分类，量化其范围。此研究不仅揭示了Ruwiki中的重要变更，还提供了一种可应用于其他维基分支和类似协作项目的分析方法。

链接: https://arxiv.org/abs/2504.10663
作者: Mykola Trokhymovych,Oleksandr Kosovan,Nathan Forrester,Pablo Aragón,Diego Saez-Trumper,Ricardo Baeza-Yates
机构: Universitat Pompeu Fabra (庞培法布拉大学); Universitat Pompeu Fabra (庞培法布拉大学); Google (谷歌); Universitat Pompeu Fabra (庞培法布拉大学); Meta (元宇宙)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Wikipedia is powered by MediaWiki, a free and open-source software that is also the infrastructure for many other wiki-based online encyclopedias. These include the recently launched website Ruwiki, which has copied and modified the original Russian Wikipedia content to conform to Russian law. To identify practices and narratives that could be associated with different forms of knowledge manipulation, this article presents an in-depth analysis of this Russian Wikipedia fork. We propose a methodology to characterize the main changes with respect to the original version. The foundation of this study is a comprehensive comparative analysis of more than 1.9M articles from Russian Wikipedia and its fork. Using meta-information and geographical, temporal, categorical, and textual features, we explore the changes made by Ruwiki editors. Furthermore, we present a classification of the main topics of knowledge manipulation in this fork, including a numerical estimation of their scope. This research not only sheds light on significant changes within Ruwiki, but also provides a methodology that could be applied to analyze other Wikipedia forks and similar collaborative projects.
zh

[NLP-50] LITERA: An LLM Based Approach to Latin-to-English Translation NAACL

【速读】：该论文旨在解决拉丁文到英文翻译的挑战，特别是提高翻译的准确性与质量。论文提出的关键解决方案是开发了一个基于大型语言模型（LLM）的翻译平台LITERA，其通过采用经过微调的GPT-4o-mini和GPT-4o版本，结合多层翻译架构，显著提升了BLEU和BLEURT评分，尤其是在古典拉丁文翻译中的表现。这一成果得益于与杜克大学古典学系的合作，他们共同构建了一个高质量的小型平行拉丁文-英文数据集，这是实现高精度字面翻译能力的核心要素。

链接: https://arxiv.org/abs/2504.10660
作者: Paul Rosu
机构: Duke University (杜克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NAACL Findings

点击查看摘要

Abstract:This paper introduces an LLM-based Latin-to-English translation platform designed to address the challenges of translating Latin texts. We named the model LITERA, which stands for Latin Interpretation and Translations into English for Research Assistance. Through a multi-layered translation process utilizing a fine-tuned version of GPT-4o-mini and GPT-4o, LITERA offers an unprecedented level of accuracy, showcased by greatly improved BLEU scores, particularly in classical Latin, along with improved BLEURT scores. The development of LITERA involved close collaboration with Duke University’s Classical Studies Department, which was instrumental in creating a small, high-quality parallel Latin-English dataset. This paper details the architecture, fine-tuning methodology, and prompting strategies used in LITERA, emphasizing its ability to produce literal translations.
zh

[NLP-51] Will AI shape the way we speak? The emerging sociolinguistic influence of synthetic voices

【速读】：该论文试图探讨由语音和语言技术发展驱动的会话型语音界面对人类交流的影响，并特别关注其对社会索引特征（如口音、语调和言语风格）传递的社会身份和群体归属感的作用。论文指出，与被动媒体不同，交互式对话型人工智能（Conversational AI）通过语音声学-韵律同化（acoustic-prosodic entrainment）和语言适应（linguistic accommodation）等现象，能够更深刻地影响个体在日常交流中的言语模式。解决方案的关键在于通过跨学科研究方法和技术手段，深入理解AI生成语音的社会索引效应及其潜在的社会影响，从而为组织、运动和品牌提供一种塑造和控制公众认知及社会身份的隐性但强大的途径。

链接: https://arxiv.org/abs/2504.10650
作者: Éva Székely,Jūra Miniota,Míša(Michaela)Hejná
机构: Department of Speech, Music and Hearing, KTH Royal Institute of Technology (KTH皇家理工学院), Sweden; Department of English, Aarhus University (奥胡斯大学), Denmark
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
备注: 5 pages, 0 figures, International Workshop on Spoken Dialogue Systems Technology (IWSDS) 2025

点击查看摘要

Abstract:The growing prevalence of conversational voice interfaces, powered by developments in both speech and language technologies, raises important questions about their influence on human communication. While written communication can signal identity through lexical and stylistic choices, voice-based interactions inherently amplify socioindexical elements - such as accent, intonation, and speech style - which more prominently convey social identity and group affiliation. There is evidence that even passive media such as television is likely to influence the audience’s linguistic patterns. Unlike passive media, conversational AI is interactive, creating a more immersive and reciprocal dynamic that holds a greater potential to impact how individuals speak in everyday interactions. Such heightened influence can be expected to arise from phenomena such as acoustic-prosodic entrainment and linguistic accommodation, which occur naturally during interaction and enable users to adapt their speech patterns in response to the system. While this phenomenon is still emerging, its potential societal impact could provide organisations, movements, and brands with a subtle yet powerful avenue for shaping and controlling public perception and social identity. We argue that the socioindexical influence of AI-generated speech warrants attention and should become a focus of interdisciplinary research, leveraging new and existing methodologies and technologies to better understand its implications.
zh

[NLP-52] Improving In-Context Learning with Reasoning Distillation

【速读】：该论文旨在解决语言模型在少样本提示（few-shot prompting）场景下进行归纳推理（inductive reasoning）能力不足的问题。传统方法通过模仿学习（imitation learning）改进上下文学习性能，但未能有效提升模型对输入输出之间潜在规则的理解。论文的关键在于提出了一种名为ReDis的推理蒸馏技术，通过结合数据增强、过滤、有监督微调以及对齐等手段，显著提升了多种任务（如1D-ARC、List Function、ACRE、MiniSCAN）上的表现，并在多个语言模型基座上超越了同等条件下的少样本提示基线方法，甚至在某些情况下超过了教师模型GPT-4o。

链接: https://arxiv.org/abs/2504.10647
作者: Nafis Sadeq,Xin Xu,Zhouhang Xie,Julian McAuley,Byungkyu Kang,Prarit Lamba,Xiang Gao
机构: UC San Diego (加州大学圣地亚哥分校); Intuit (Intuit)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models rely on semantic priors to perform in-context learning, which leads to poor performance on tasks involving inductive reasoning. Instruction-tuning methods based on imitation learning can superficially enhance the in-context learning performance of language models, but they often fail to improve the model’s understanding of the underlying rules that connect inputs and outputs in few-shot demonstrations. We propose ReDis, a reasoning distillation technique designed to improve the inductive reasoning capabilities of language models. Through a careful combination of data augmentation, filtering, supervised fine-tuning, and alignment, ReDis achieves significant performance improvements across a diverse range of tasks, including 1D-ARC, List Function, ACRE, and MiniSCAN. Experiments on three language model backbones show that ReDis outperforms equivalent few-shot prompting baselines across all tasks and even surpasses the teacher model, GPT-4o, in some cases. ReDis, based on the LLaMA-3 backbone, achieves relative improvements of 23.2%, 2.8%, and 66.6% over GPT-4o on 1D-ARC, ACRE, and MiniSCAN, respectively, within a similar hypothesis search space. The code, dataset, and model checkpoints will be made available at this https URL.
zh

[NLP-53] Weight-of-Thought Reasoning : Exploring Neural Network Weights for Enhanced LLM Reasoning

【速读】：该论文试图解决大型语言模型（LLMs）在推理任务中过分依赖于基于标记级别输出的传统方法，而忽视神经网络内部权重动态变化的问题。现有方法如Chain-of-Thought (CoT) 主要关注于通过特定提示策略提升推理能力，但未能深入探究模型权重中的潜在信息。为此，论文提出了一种名为Weight-of-Thought (WoT) 的新颖推理方法，其关键在于通过基于图的消息传递、多步推理过程以及注意力机制来探索权重空间，从而在推理前识别出有效的推理路径。具体实现上，该方法构建了一个由推理节点组成的互联图。实验结果表明，在多种复杂推理任务（包括三段论、数学、代数、组合和几何）中，WoT 方法相较于传统方法表现出更优的性能，并且提高了推理过程的可解释性，为提升LLMs推理能力提供了有前景的方向。

链接: https://arxiv.org/abs/2504.10646
作者: Saif Punjwani,Larry Heck
机构: Georgia Institute of Technology (乔治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable reasoning capabilities when prompted with strategies such as Chain-of-Thought (CoT). However, these approaches focus on token-level output without considering internal weight dynamics. We introduce Weight-of-Thought (WoT) reasoning, a novel approach that examines neural network weights before inference to identify reasoning pathways. Unlike existing methods, WoT explores the weight space through graph-based message passing, multi-step reasoning processes, and attention mechanisms. Our implementation creates an interconnected graph of reasoning nodes. Experiments on diverse reasoning tasks (syllogistic, mathematical, algebraic, combinatorial, and geometric) demonstrate that WoT achieves superior performance compared to traditional methods, particularly for complex problems. This approach leads to both improved performance and greater interpretability of the reasoning process, offering a promising direction for enhancing LLM reasoning capabilities.
zh

[NLP-54] Better Estimation of the KL Divergence Between Language Models

【速读】：该论文旨在解决估计任意两个语言模型之间Kullback–Leibler (KL)散度时面临的高方差问题。传统基于采样的Monte Carlo (MC)估计器虽能提供无偏估计，但其方差通常很高，且可能导致非负值的KL散度出现负值的情况。论文的关键解决方案是引入了一种Rao–Blackwellized估计器，该估计器同样保持无偏性，并在理论上证明其方差不大于标准Monte Carlo估计器。此外，研究还推导了KL散度梯度的类似Rao–Blackwellized估计器，从而实现更稳定的训练过程，并在奖励与KL散度的Pareto前沿上产生更优的模型表现。

链接: https://arxiv.org/abs/2504.10637
作者: Afra Amini,Tim Vieira,Ryan Cotterell
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Estimating the Kullback–Leibler (KL) divergence between language models has many applications, e.g., reinforcement learning from human feedback (RLHF), interpretability, and knowledge distillation. However, computing the exact KL divergence between two arbitrary language models is intractable. Thus, practitioners often resort to the use of sampling-based estimators. While it is easy to fashion a simple Monte Carlo (MC) estimator that provides an unbiased estimate of the KL divergence between language models, this estimator notoriously suffers from high variance, and can even result in a negative estimate of the KL divergence, a non-negative quantity. In this paper, we introduce a Rao–Blackwellized estimator that is also unbiased and provably has variance less than or equal to that of the standard Monte Carlo estimator. In an empirical study on sentiment-controlled fine-tuning, we show that our estimator provides more stable KL estimates and reduces variance substantially in practice. Additionally, we derive an analogous Rao–Blackwellized estimator of the gradient of the KL divergence, which leads to more stable training and produces models that more frequently appear on the Pareto frontier of reward vs. KL compared to the ones trained with the MC estimator of the gradient.
zh

[NLP-55] Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models

【速读】：该论文试图解决的问题是如何量化大型语言模型（Large Language Models, LLMs）在不同领域内的模型内部推理能力，特别是其隐空间（latent space）中的推断“跳跃”能力。论文的关键解决方案在于设计了一个包含4000个项目的基准测试（benchmark），通过要求LLMs以不同于基准语言（英语）的特定语言选择初始响应令牌的方式，指示推理问题的正确答案，而非通过描述性文本作答。这种方法不仅迫使模型超越其上下文窗口进行推理，还克服了其默认倾向于使用与提示相同语言作答的倾向，从而增加了认知负担，有效评估了模型的内部推理能力。实验结果显示，GPT-4.5在该基准上达到了最高的准确率（74.7%）。此外，论文通过控制实验和难度分析指出，尽管LLMs表现出内部推理能力，但仍可能存在基于启发式的策略利用情况，这需要进一步研究，特别是在与安全相关的问题如隐秘规划、目标导向或无明显token痕迹的欺骗行为方面。

链接: https://arxiv.org/abs/2504.10615
作者: Thilo Hagendorff,Sarah Fabi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can perform reasoning computations both internally within their latent space and externally by generating explicit token sequences like chains of thought. Significant progress in enhancing reasoning abilities has been made by scaling test-time compute. However, understanding and quantifying model-internal reasoning abilities - the inferential “leaps” models make between individual token predictions - remains crucial. This study introduces a benchmark (n = 4,000 items) designed to quantify model-internal reasoning in different domains. We achieve this by having LLMs indicate the correct solution to reasoning problems not through descriptive text, but by selecting a specific language of their initial response token that is different from English, the benchmark language. This not only requires models to reason beyond their context window, but also to overrise their default tendency to respond in the same language as the prompt, thereby posing an additional cognitive strain. We evaluate a set of 18 LLMs, showing significant performance variations, with GPT-4.5 achieving the highest accuracy (74.7%), outperforming models like Grok-2 (67.2%), and Llama 3.1 405B (65.6%). Control experiments and difficulty scaling analyses suggest that while LLMs engage in internal reasoning, we cannot rule out heuristic exploitations under certain conditions, marking an area for future investigation. Our experiments demonstrate that LLMs can “think” via latent-space computations, revealing model-internal inference strategies that need further understanding, especially regarding safety-related concerns such as covert planning, goal-seeking, or deception emerging without explicit token traces.
zh

[NLP-56] Federated Learning with Layer Skipping: Efficient Training of Large Language Models for Healthcare NLP

【速读】：该论文旨在解决在联邦学习（Federated Learning, FL）框架下训练大型语言模型（Large Language Models, LLMs）时面临的通信开销大和数据异质性显著的问题。论文的关键创新在于提出了一种名为层跳过联邦学习（Layer-Skipping Federated Learning）的方法，该方法仅对预训练LLM的选定层进行微调，而其余层保持冻结状态。这种方法在保持性能与集中式训练相差不超过2%的同时，将通信成本降低了约70%，并通过临床命名实体识别（NER）和分类任务验证了其有效性，展示了对非独立同分布（non-IID）医疗数据分布的良好适应性和结合差分隐私的鲁棒性。

链接: https://arxiv.org/abs/2504.10536
作者: Lihong Zhang,Yue Li
机构: Harvard University (哈佛大学); Purdue University (普渡大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Federated learning (FL) enables collaborative model training across organizations without sharing raw data, addressing crucial privacy concerns in healthcare natural language processing (NLP). However, training large language models (LLMs) in federated settings faces significant challenges, including communication overhead and data heterogeneity. We propose Layer-Skipping Federated Learning, where only selected layers of a pre-trained LLM are fine-tuned across clients while others remain frozen. Applied to LLaMA 3.2-1B, our approach reduces communication costs by approximately 70% while maintaining performance within 2% of centralized training. We evaluate our method on clinical NER and classification tasks using i2b2 and MIMIC-III datasets. Our experiments demonstrate that Layer-Skipping FL outperforms competitive baselines, handles non-IID clinical data distributions effectively, and shows robustness when combined with differential privacy. This approach represents a practical solution for privacy-preserving collaborative learning in healthcare NLP.
zh

[NLP-57] oward Super Agent System with Hybrid AI Routers

【速读】：本文旨在解决如何构建一个高效、低成本且可扩展的超级人工智能代理系统，以满足多样化用户需求并在实际场景中广泛应用的问题。论文的关键在于设计一种能够精准理解用户意图并动态选择合适工具或生成工作流的超级代理系统架构。为实现这一目标，提出了一种混合模式，即通过路由器根据任务复杂度动态切换本地与云端语言模型，从而平衡性能、成本及隐私保护。此外，还介绍了增强型设备端超级代理的设计蓝图，展望未来多模态模型与边缘硬件的进步将使大部分计算能在本地完成，仅在必要时借助云协作，这种架构为超级代理融入日常生活的无缝集成奠定了基础。

链接: https://arxiv.org/abs/2504.10519
作者: Yuhang Yao,Haixin Wang,Yibo Chen,Jiawen Wang,Min Chang Jordan Ren,Bosheng Ding,Salman Avestimehr,Chaoyang He
机构: TensorOpera, Inc. (TensorOpera, Inc.)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:AI Agents powered by Large Language Models are transforming the world through enormous applications. A super agent has the potential to fulfill diverse user needs, such as summarization, coding, and research, by accurately understanding user intent and leveraging the appropriate tools to solve tasks. However, to make such an agent viable for real-world deployment and accessible at scale, significant optimizations are required to ensure high efficiency and low cost. This paper presents a design of the Super Agent System. Upon receiving a user prompt, the system first detects the intent of the user, then routes the request to specialized task agents with the necessary tools or automatically generates agentic workflows. In practice, most applications directly serve as AI assistants on edge devices such as phones and robots. As different language models vary in capability and cloud-based models often entail high computational costs, latency, and privacy concerns, we then explore the hybrid mode where the router dynamically selects between local and cloud models based on task complexity. Finally, we introduce the blueprint of an on-device super agent enhanced with cloud. With advances in multi-modality models and edge hardware, we envision that most computations can be handled locally, with cloud collaboration only as needed. Such architecture paves the way for super agents to be seamlessly integrated into everyday life in the near future.
zh

[NLP-58] ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception Reasoning and Robustness

【速读】：该论文试图解决的问题是评估视觉-语言模型（Vision-Language Models, VLMs）在颜色理解（包括颜色感知、推理和鲁棒性）方面的能力，并探索其与人类在颜色认知上的差异。论文的关键在于引入了一个名为ColorBench的新基准数据集，它精心设计了一系列多样化且贴近实际应用的测试场景，用于系统性地评估不同VLMs在颜色感知、基于颜色线索的推理以及在颜色变换下的性能一致性方面的表现。通过这项研究，论文揭示了现有VLMs在颜色理解方面的局限性，并强调了提升多模态AI在颜色认知能力上的重要性。

链接: https://arxiv.org/abs/2504.10514
作者: Yijun Liang,Ming Li,Chenrui Fan,Ziyue Li,Dang Nguyen,Kwesi Cobbina,Shweta Bhardwaj,Jiuhai Chen,Fuxiao Liu,Tianyi Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 33 pages, including references and appendix. Code is available at this https URL

点击查看摘要

Abstract:Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks. These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBenchcan serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.
zh

[NLP-59] JEPA4Rec: Learning Effective Language Representations for Sequential Recommendation via Joint Embedding Predictive Architecture

【速读】：该论文旨在解决基于语言表示学习的序列推荐方法在数据稀疏性和对常识性用户偏好的理解有限方面的挑战。为了解决这些问题，论文提出了一种名为\textbf{JEPA4Rec}的框架，其关键是结合了联合嵌入预测架构（Joint Embedding Predictive Architecture）与物品文本描述的语言建模方法。通过将物品表示为包含标题、类别等属性的文本句子，并利用双向Transformer编码器捕获语义丰富的可迁移表征，JEPA4Rec减少了对大规模预训练数据的依赖，同时提升了推荐性能。此外，采用掩码策略和两阶段自监督学习损失的训练策略进一步增强了模型的泛化能力和语言理解能力。

链接: https://arxiv.org/abs/2504.10512
作者: Minh-Anh Nguyen,Dung D.Le
机构: College of Engineering and Computer Science, VinUniversity (VinUniversity 工程与计算机科学学院)(越南)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language representation learning has emerged as a promising approach for sequential recommendation, thanks to its ability to learn generalizable representations. However, despite its advantages, this approach still struggles with data sparsity and a limited understanding of common-sense user preferences. To address these limitations, we propose \textbfJEPA4Rec , a framework that combines \textbfJ oint \textbfE mbedding \textbfP redictive \textbfA rchitecture with language modeling of item textual descriptions. JEPA4Rec captures semantically rich and transferable representations, improving recommendation performance and reducing reliance on large-scale pre-training data. Specifically, JEPA4Rec represents items as text sentences by flattening descriptive information such as \textittitle, category , and other attributes. To encode these sentences, we employ a bidirectional Transformer encoder with modified embedding layers tailored for capturing item information in recommendation datasets. We apply masking to text sentences and use them to predict the representations of the unmasked sentences, helping the model learn generalizable item embeddings. To further improve recommendation performance and language understanding, we employ a two-stage training strategy incorporating self-supervised learning losses. Experiments on six real-world datasets demonstrate that JEPA4Rec consistently outperforms state-of-the-art methods, particularly in cross-domain, cross-platform, and low-resource scenarios.
zh

[NLP-60] LayerFlow: Layer-wise Exploration of LLM Embeddings using Uncertainty-aware Interlinked Projections

【速读】：该论文旨在解决在利用降维技术探索词嵌入（word embeddings）时，由于数据变换引入的不确定性可能影响视觉表示及用户对数据解释的问题。论文的关键解决方案是提出LayerFlow——一种交互式投影设计的可视化分析工作空间，通过多个视觉组件（如凸包展示二维与高维聚类、点对点距离、聚类摘要以及投影质量度量等）来传达嵌入变换、表示及解释中的不确定性，从而帮助用户更好地理解潜在的数据扭曲及不确定性来源。

链接: https://arxiv.org/abs/2504.10504
作者: Rita Sevastjanova,Robin Gerling,Thilo Spinner,Mennatallah El-Assady
机构: ETH Zurich (瑞士苏黎世联邦理工学院); University of Konstanz (康斯坦茨大学)
类目: Computation and Language (cs.CL); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) represent words through contextual word embeddings encoding different language properties like semantics and syntax. Understanding these properties is crucial, especially for researchers investigating language model capabilities, employing embeddings for tasks related to text similarity, or evaluating the reasons behind token importance as measured through attribution methods. Applications for embedding exploration frequently involve dimensionality reduction techniques, which reduce high-dimensional vectors to two dimensions used as coordinates in a scatterplot. This data transformation step introduces uncertainty that can be propagated to the visual representation and influence users’ interpretation of the data. To communicate such uncertainties, we present LayerFlow - a visual analytics workspace that displays embeddings in an interlinked projection design and communicates the transformation, representation, and interpretation uncertainty. In particular, to hint at potential data distortions and uncertainties, the workspace includes several visual components, such as convex hulls showing 2D and HD clusters, data point pairwise distances, cluster summaries, and projection quality metrics. We show the usability of the presented workspace through replication and expert case studies that highlight the need to communicate uncertainty through multiple visual components and different data perspectives.
zh

[NLP-61] Exposure to Content Written by Large Language Models Can Reduce Stigma Around Opioid Use Disorder in Online Communities

【速读】：该论文试图解决在线社区中与阿片类药物使用障碍（OUD）相关的污名化问题，这种污名化阻碍了减少危害的努力。论文关注的是临床批准的成瘾治疗药物（MAT）、患者以及疾病本身的污名化现象。为了解决这一问题，研究探索了基于大型语言模型（LLMs）的人工智能技术是否能够减轻这些污名化态度。解决方案的关键在于通过预注册的随机对照实验验证LLMs生成的内容对参与者态度的影响，结果显示，在单次或重复接触的情况下，参与者在消费LLMs生成的回应后表现出最少的MAT相关污名化态度。这表明LLMs可以作为一种基于教育的干预手段来促进积极态度并提高人们对MAT的接受度。

链接: https://arxiv.org/abs/2504.10501
作者: Shravika Mittal,Darshi Shah,Shin Won Do,Mai ElSherief,Tanushree Mitra,Munmun De Choudhury
机构: College of Computing, Georgia Institute of Technology (计算学院, 佐治亚理工学院); Khoury College of Computer Science, Northeastern University (霍奇学院, 东北大学); Information School, University of Washington (信息学院, 华盛顿大学)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Widespread stigma, both in the offline and online spaces, acts as a barrier to harm reduction efforts in the context of opioid use disorder (OUD). This stigma is prominently directed towards clinically approved medications for addiction treatment (MAT), people with the condition, and the condition itself. Given the potential of artificial intelligence based technologies in promoting health equity, and facilitating empathic conversations, this work examines whether large language models (LLMs) can help abate OUD-related stigma in online communities. To answer this, we conducted a series of pre-registered randomized controlled experiments, where participants read LLM-generated, human-written, or no responses to help seeking OUD-related content in online communities. The experiment was conducted under two setups, i.e., participants read the responses either once (N = 2,141), or repeatedly for 14 days (N = 107). We found that participants reported the least stigmatized attitudes toward MAT after consuming LLM-generated responses under both the setups. This study offers insights into strategies that can foster inclusive online discourse on OUD, e.g., based on our findings LLMs can be used as an education-based intervention to promote positive attitudes and increase people’s propensity toward MAT.
zh

[NLP-62] Graph-based Approaches and Functionalities in Retrieval-Augmented Generation: A Comprehensive Survey

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）在推理过程中因训练数据不足及缺乏最新知识而导致事实性错误的问题，即幻觉（hallucination）现象。为应对这一局限性，检索增强生成（Retrieval-Augmented Generation, RAG）成为一种有前景的解决方案，通过从外部资源检索相关信息来生成更准确的答案。论文的关键在于探索图相关技术在RAG中的应用，特别是在外部资源中普遍存在结构化知识的情况下，利用图的拓扑信息实现更复杂的推理。然而，目前尚无统一综述探讨图在RAG中的多样化角色，也缺乏全面资源以帮助研究者导航和参与这一快速发展的领域。因此，本文提出了一种新的视角，详细分析了图在RAG中的功能及其对多种图结构化数据性能提升的影响，并涵盖了数据库构建、算法、管道以及任务等角色分解。最终，论文识别了当前挑战并提出了未来研究方向，以推动该领域的进一步发展。这种以图为中心的分析强调了现有方法的共性和差异，为图学习、数据库系统和自然语言处理等领域的未来研究奠定了基础。

链接: https://arxiv.org/abs/2504.10499
作者: Zulun Zhu,Tiancheng Huang,Kai Wang,Junda Ye,Xinghe Chen,Siqiang Luo
机构: Nanyang Technological University(Singapore); Beijing University of Posts and Telecommunications(北京邮电大学)(China)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) struggle with the factual error during inference due to the lack of sufficient training data and the most updated knowledge, leading to the hallucination problem. Retrieval-Augmented Generation (RAG) has gained attention as a promising solution to address the limitation of LLMs, by retrieving relevant information from external source to generate more accurate answers to the questions. Given the pervasive presence of structured knowledge in the external source, considerable strides in RAG have been made to employ the techniques related to graphs and achieve more complex reasoning based on the topological information between knowledge entities. However, there is currently neither unified review examining the diverse roles of graphs in RAG, nor a comprehensive resource to help researchers navigate and contribute to this evolving field. This survey offers a novel perspective on the functionality of graphs within RAG and their impact on enhancing performance across a wide range of graph-structured data. It provides a detailed breakdown of the roles that graphs play in RAG, covering database construction, algorithms, pipelines, and tasks. Finally, it identifies current challenges and outline future research directions, aiming to inspire further developments in this field. Our graph-centered analysis highlights the commonalities and differences in existing methods, setting the stage for future researchers in areas such as graph learning, database systems, and natural language processing.
zh

[NLP-63] ArxivBench: Can LLM s Assist Researchers in Conducting Research?

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在生成科学响应时存在的事实性错误问题。具体而言，研究关注LLMs在根据高层次提示提供相关研究论文及其准确链接的能力上是否可靠。为了解决这一问题，论文设计了一个名为arXivBench的新基准（benchmark），它专门用于评估LLMs在arXiv平台八个主要学科领域以及计算机科学五大子领域的表现。关键在于通过这个标准化工具，研究发现不同学科领域中LLMs的表现差异显著，并揭示了某些模型如Claude-3.5-Sonnet在生成准确且相关的响应方面具有明显优势，同时发现LLMs在人工智能子领域中的准确性普遍高于其他子领域。这为提高LLMs在学术和研究环境中的可靠性提供了重要参考。

链接: https://arxiv.org/abs/2504.10496
作者: Ning Li,Jingran Zhang,Justin Cui
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable effectiveness in completing various tasks such as reasoning, translation, and question answering. However the issue of factual incorrect content in LLM-generated responses remains a persistent challenge. In this study, we evaluate both proprietary and open-source LLMs on their ability to respond with relevant research papers and accurate links to articles hosted on the arXiv platform, based on high level prompts. To facilitate this evaluation, we introduce arXivBench, a benchmark specifically designed to assess LLM performance across eight major subject categories on arXiv and five subfields within computer science, one of the most popular categories among them. Our findings reveal a concerning accuracy of LLM-generated responses depending on the subject, with some subjects experiencing significantly lower accuracy than others. Notably, Claude-3.5-Sonnet exhibits a substantial advantage in generating both relevant and accurate responses. And interestingly, most LLMs achieve a much higher accuracy in the Artificial Intelligence sub-field than other sub-fields. This benchmark provides a standardized tool for evaluating the reliability of LLM-generated scientific responses, promoting more dependable use of LLMs in academic and research environments. Our code is open-sourced at this https URL and our dataset is available on huggingface at this https URL.
zh

[NLP-64] GPT Meets Graphs and KAN Splines: Testing Novel Frameworks on Multitask Fine-Tuned GPT -2 with LoRA

【速读】：该论文旨在探索通过在预训练的 GPT-2 模型中集成可学习且可解释的模块——即 Kolmogorov-Arnold 网络（KAN）和基于图的表示方法——以提升多任务学习的准确性。论文的关键在于引入低秩自适应（Low-Rank Adaptation, LoRA），并通过精细调整超参数和引入 L2 正则化来增强标准自注意力变压器的性能。此外，还开发了两种变体（Graph LoRA 和 Hybrid-KAN LoRA）以进一步提高 KAN 和图注意力机制（GAT）的解释性和表达能力。然而，实验结果表明，优化后的 LoRA 增强变压器在情感分析、释义检测和十四行诗生成等任务中表现最佳，分别达到了 SST 测试集 55.249% 的准确率、CFIMDB 开发集 99.18% 的准确率以及释义检测测试集 89.9% 的准确率，CHRF 分数为 42.097。这表明，通过 LoRA 实现的高效参数适应是解决这些任务中最有效的策略。

链接: https://arxiv.org/abs/2504.10490
作者: Gabriel Bo,Marc Bernardino,Justin Gu
机构: Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages, 11 figures. This submission cites arXiv:2404.19756 . Supplementary materials and additional information are available at arXiv:2404.19756

点击查看摘要

Abstract:We explore the potential of integrating learnable and interpretable modules–specifically Kolmogorov-Arnold Networks (KAN) and graph-based representations–within a pre-trained GPT-2 model to enhance multi-task learning accuracy. Motivated by the recent surge in using KAN and graph attention (GAT) architectures in chain-of-thought (CoT) models and debates over their benefits compared to simpler architectures like MLPs, we begin by enhancing a standard self-attention transformer using Low-Rank Adaptation (LoRA), fine-tuning hyperparameters, and incorporating L2 regularization. This approach yields significant improvements. To further boost interpretability and richer representations, we develop two variants that attempt to improve the standard KAN and GAT: Graph LoRA and Hybrid-KAN LoRA (Learnable GPT). However, systematic evaluations reveal that neither variant outperforms the optimized LoRA-enhanced transformer, which achieves 55.249% accuracy on the SST test set, 99.18% on the CFIMDB dev set, and 89.9% paraphrase detection test accuracy. On sonnet generation, we get a CHRF score of 42.097. These findings highlight that efficient parameter adaptation via LoRA remains the most effective strategy for our tasks: sentiment analysis, paraphrase detection, and sonnet generation.
zh

[NLP-65] Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective ICLR2025

【速读】：该论文试图解决文本到图像（Text-to-Image, T2I）合成中语义变化（Semantic Variations）难以被准确捕捉和评估的问题。当前模型在处理词序改变引起的语义差异时表现不佳，而现有的评价方法依赖于间接指标（如文本-图像相似性），无法可靠评估这些挑战，尤其容易忽视复杂或罕见语言模式下的性能不足。为了解决这些问题，论文提出了一个名为SemVarEffect的新评价指标和一个名为SemVarBench的基准数据集，用于评估输入与输出之间语义变化的因果关系。关键在于通过两种语言排列方式实现语义变化，同时避免简单的字面意义变化，并揭示了UNet或Transformer中的跨模态对齐（Cross-modal Alignment）在处理语义变化中的重要作用，这是之前研究中被忽视的因素。

链接: https://arxiv.org/abs/2410.10291
作者: Xiangru Zhu,Penglei Sun,Yaoxian Song,Yanghua Xiao,Zhixu Li,Chengyu Wang,Jun Huang,Bei Yang,Xiaoxiao Xu
机构: Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University (复旦大学); Hong Kong University of Science and Technology (香港科技大学); Zhejiang University (浙江大学); School of Information, Renmin University of China; School of Smart Governance, Renmin University of China (中国人民大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted by ICLR 2025

点击查看摘要

Abstract:Accurate interpretation and visualization of human instructions are crucial for text-to-image (T2I) synthesis. However, current models struggle to capture semantic variations from word order changes, and existing evaluations, relying on indirect metrics like text-image similarity, fail to reliably assess these challenges. This often obscures poor performance on complex or uncommon linguistic patterns by the focus on frequent word combinations. To address these deficiencies, we propose a novel metric called SemVarEffect and a benchmark named SemVarBench, designed to evaluate the causality between semantic variations in inputs and outputs in T2I synthesis. Semantic variations are achieved through two types of linguistic permutations, while avoiding easily predictable literal variations. Experiments reveal that the CogView-3-Plus and Ideogram 2 performed the best, achieving a score of 0.2/1. Semantic variations in object relations are less understood than attributes, scoring 0.07/1 compared to 0.17-0.19/1. We found that cross-modal alignment in UNet or Transformers plays a crucial role in handling semantic variations, a factor previously overlooked by a focus on textual encoders. Our work establishes an effective evaluation framework that advances the T2I synthesis community’s exploration of human instruction understanding. Our benchmark and code are available at this https URL .
zh

[NLP-66] Network Alignment

【速读】：该论文旨在解决网络对齐（Network Alignment）问题，即在多个复杂网络中识别同时存在于不同系统中的实体及其对应关系。这一问题的关键在于不同领域的复杂网络在结构、特性和属性上的差异导致研究通常局限于各自领域内，且缺乏统一的术语与概念。为应对这些挑战，论文提出的关键解决方案是对多种网络对齐方法的实施原理、过程及性能差异进行详细分析，包括基于结构一致性、网络嵌入以及图神经网络（Graph Neural Network, GNN）的方法，并探讨了在属性网络、异构网络、有向网络和动态网络等不同条件下的对齐策略。通过这种综合分析，论文不仅推动了对复杂系统结构与行为的理解，还促进了理论物理学研究的验证与扩展，并为跨领域的实际应用提供了支持。

链接: https://arxiv.org/abs/2504.11367
作者: Rui Tang,Ziyun Yong,Shuyu Jiang,Xingshu Chen,Yaofang Liu,Yi-Cheng Zhang,Gui-Quan Sun,Wei Wang
机构: School of Cyber Science and Engineering, Sichuan University (四川大学网络科学与技术学院); Cyber Science Research Institute, Sichuan University (四川大学网络科学研究所); Key Laboratory of Data Protection and Intelligent Management, Ministry of Education, Sichuan University (教育部数据保护与智能管理重点实验室, 四川大学); Department of Reproductive Technology, The Affiliated Hospital of Southwest Medical University (西南医科大学附属医院生殖技术科); Physics Department, University of Fribourg (瑞士弗里堡大学物理系); Sino-Europe Complex Science Center, School of Mathematics, North University of China (中欧复杂科学中心, 北方工业大学数学学院); Complex Systems Research Center, Shanxi University (山西大学复杂系统研究中心); School of Public Health, Chongqing Medical University (重庆医科大学公共卫生学院)
类目: Physics and Society (physics.soc-ph); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Complex networks are frequently employed to model physical or virtual complex systems. When certain entities exist across multiple systems simultaneously, unveiling their corresponding relationships across the networks becomes crucial. This problem, known as network alignment, holds significant importance. It enhances our understanding of complex system structures and behaviours, facilitates the validation and extension of theoretical physics research about studying complex systems, and fosters diverse practical applications across various fields. However, due to variations in the structure, characteristics, and properties of complex networks across different fields, the study of network alignment is often isolated within each domain, with even the terminologies and concepts lacking uniformity. This review comprehensively summarizes the latest advancements in network alignment research, focusing on analyzing network alignment characteristics and progress in various domains such as social network analysis, bioinformatics, computational linguistics and privacy protection. It provides a detailed analysis of various methods’ implementation principles, processes, and performance differences, including structure consistency-based methods, network embedding-based methods, and graph neural network-based (GNN-based) methods. Additionally, the methods for network alignment under different conditions, such as in attributed networks, heterogeneous networks, directed networks, and dynamic networks, are presented. Furthermore, the challenges and the open issues for future studies are also discussed.
zh

计算机视觉

[CV-0] Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception ICLR2025

【速读】：该论文旨在解决利用生成式扩散模型（Generative Diffusion Models）直接应用于判别任务（Discriminative Tasks）时存在的关键问题。生成式扩散模型在图像生成任务中取得了成功，但由于其生成去噪过程与判别任务需求之间的不匹配，导致性能存在显著差距。判别任务需要在整个过程中保持严格的准确性，而生成模型仅需确保最终分布合理即可容忍中间误差。论文通过分析生成扩散过程与感知任务之间的对齐问题，发现早期去噪步骤对感知质量贡献最大，而后期去噪可能导致意外的感知退化，并且模型的交互性可以用于多轮修正提示的适配。针对这些问题，论文提出的关键解决方案包括设计反映不同时间步贡献的学习目标、引入针对去噪训练分布偏移的数据增强方法以及利用生成过程的交互特性优化模型性能。这些改进显著提升了基于扩散的感知模型的效果，在深度估计、指定位图分割和通用感知任务中达到了当前最佳性能（SOTA），且无需改变模型架构。

链接: https://arxiv.org/abs/2504.11457
作者: Ziqi Pang,Xin Xu,Yu-Xiong Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025

点击查看摘要

Abstract:With the success of image generation, generative diffusion models are increasingly adopted for discriminative tasks, as pixel generation provides a unified perception interface. However, directly repurposing the generative denoising process for discriminative objectives reveals critical gaps rarely addressed previously. Generative models tolerate intermediate sampling errors if the final distribution remains plausible, but discriminative tasks require rigorous accuracy throughout, as evidenced in challenging multi-modal tasks like referring image segmentation. Motivated by this gap, we analyze and enhance alignment between generative diffusion processes and perception tasks, focusing on how perception quality evolves during denoising. We find: (1) earlier denoising steps contribute disproportionately to perception quality, prompting us to propose tailored learning objectives reflecting varying timestep contributions; (2) later denoising steps show unexpected perception degradation, highlighting sensitivity to training-denoising distribution shifts, addressed by our diffusion-tailored data augmentation; and (3) generative processes uniquely enable interactivity, serving as controllable user interfaces adaptable to correctional prompts in multi-round interactions. Our insights significantly improve diffusion-based perception models without architectural changes, achieving state-of-the-art performance on depth estimation, referring image segmentation, and generalist perception tasks. Code available at this https URL.
zh

[CV-1] SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining SFT and RL

【速读】：该论文旨在探索一种简单但高效的自回归视觉生成方法，试图解决复杂架构设计导致的计算开销大及参数量多的问题。论文提出SimpleAR框架，通过精心优化训练与推理过程，在仅有0.5B参数的情况下实现高保真（1024x1024分辨率）图像生成，并在文本到图像生成任务上取得具有竞争力的表现。关键在于采用监督微调(Supervised Fine-Tuning, SFT)与分组相对策略优化(Group Relative Policy Optimization, GRPO)，显著提升生成图像的美学质量与提示对齐效果，同时结合推理加速技术如vLLM将生成时间缩短至约14秒。

链接: https://arxiv.org/abs/2504.11455
作者: Junke Wang,Zhi Tian,Xun Wang,Xinyu Zhang,Weilin Huang,Zuxuan Wu,Yu-Gang Jiang
机构: Fudan University (复旦大学); ByteDance Seed (字节跳动种子计划)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report, work in progress

点击查看摘要

Abstract:This work presents SimpleAR, a vanilla autoregressive visual generation framework without complex architecure modifications. Through careful exploration of training and inference optimization, we demonstrate that: 1) with only 0.5B parameters, our model can generate 1024x1024 resolution images with high fidelity, and achieve competitive results on challenging text-to-image benchmarks, e.g., 0.59 on GenEval and 79.66 on DPG; 2) both supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) training could lead to significant improvements on generation aesthectics and prompt alignment; and 3) when optimized with inference acceleraton techniques like vLLM, the time for SimpleAR to generate an 1024x1024 image could be reduced to around 14 seconds. By sharing these findings and open-sourcing the code, we hope to reveal the potential of autoregressive visual generation and encourage more participation in this research field. Code is available at this https URL.
zh

[CV-2] PARTFIELD: Learning 3D Feature Fields for Part Segmentation and Beyond

【速读】：该论文旨在解决无监督条件下从开放世界3D形状中学习通用的分块特征及其层次结构的问题。传统方法通常依赖于预定义模板或文本标签，而本文提出的PartField无需这些先验知识，能够直接从多种模态的3D数据中捕获零件的概念及其层级关系。解决方案的关键在于通过对比学习框架，将来自标注数据集和大规模无监督图像分割的2D与3D部件提议进行蒸馏，从而训练出一个能够在推理阶段仅需一次前馈操作即可生成连续特征场的模型。此特征场可进一步聚类以获得分层的分块分解结果，不仅提高了准确性（比其他同类方法高出多达20%），还显著提升了运行效率，有时甚至比现有方法快几个数量级。此外，该方法还能在不同形状之间保持一致性，支持如共同分割和对应等任务。

链接: https://arxiv.org/abs/2504.11451
作者: Minghua Liu,Mikaela Angelina Uy,Donglai Xiang,Hao Su,Sanja Fidler,Nicholas Sharp,Jun Gao
机构: NVIDIA (英伟达); University of Toronto (多伦多大学); Vector Institute (向量研究所); UCSD (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:We propose PartField, a feedforward approach for learning part-based 3D features, which captures the general concept of parts and their hierarchy without relying on predefined templates or text-based names, and can be applied to open-world 3D shapes across various modalities. PartField requires only a 3D feedforward pass at inference time, significantly improving runtime and robustness compared to prior approaches. Our model is trained by distilling 2D and 3D part proposals from a mix of labeled datasets and image segmentations on large unsupervised datasets, via a contrastive learning formulation. It produces a continuous feature field which can be clustered to yield a hierarchical part decomposition. Comparisons show that PartField is up to 20% more accurate and often orders of magnitude faster than other recent class-agnostic part-segmentation methods. Beyond single-shape part decomposition, consistency in the learned field emerges across shapes, enabling tasks such as co-segmentation and correspondence, which we demonstrate in several applications of these general-purpose, hierarchical, and consistent 3D feature fields. Check our Webpage! this https URL
zh

[CV-3] Diffusion Distillation With Direct Preference Optimization For Efficient 3D LiDAR Scene Completion

【速读】：该论文旨在解决基于扩散模型的3D LiDAR场景补全应用受限于采样速度慢的问题。现有方法如分数蒸馏虽能加速采样但会导致性能下降，而通过偏好数据进行后训练（Direct Policy Optimization, DPO）虽然提升了性能，但其效果仍有提升空间。论文提出了一种新的扩散蒸馏框架Distillation-DPO，其关键是结合了偏好对齐的分数蒸馏机制。具体而言，首先利用学生模型生成带有不同初始噪声的配对标记完成场景；其次，以LiDAR场景评估指标作为偏好构建胜者-败者样本对；最后，通过利用教师与学生模型在配对标记场景上得分函数的差异来优化学生模型，直至收敛。实验表明，相比最先进的基于扩散的LiDAR场景补全模型，Distillation-DPO不仅实现了更高质量的场景补全，还将补全过程加速超过5倍。此外，这是首次探索将偏好学习应用于蒸馏的研究，为偏好对齐的蒸馏提供了重要见解。

链接: https://arxiv.org/abs/2504.11447
作者: An Zhaol,Shengyuan Zhang,Ling Yang,Zejian Li,Jiale Wu,Haoran Xu,AnYang Wei,Perry Pengyun GU Lingyun Sun
机构: Zhejiang University (浙江大学); Peking University (北京大学); Zhejiang Green Zhixing Technology co., ltd (浙江绿色智行科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our code is public available on this https URL

点击查看摘要

Abstract:The application of diffusion models in 3D LiDAR scene completion is limited due to diffusion’s slow sampling speed. Score distillation accelerates diffusion sampling but with performance degradation, while post-training with direct policy optimization (DPO) boosts performance using preference data. This paper proposes Distillation-DPO, a novel diffusion distillation framework for LiDAR scene completion with preference aligment. First, the student model generates paired completion scenes with different initial noises. Second, using LiDAR scene evaluation metrics as preference, we construct winning and losing sample pairs. Such construction is reasonable, since most LiDAR scene metrics are informative but non-differentiable to be optimized directly. Third, Distillation-DPO optimizes the student model by exploiting the difference in score functions between the teacher and student models on the paired completion scenes. Such procedure is repeated until convergence. Extensive experiments demonstrate that, compared to state-of-the-art LiDAR scene completion diffusion models, Distillation-DPO achieves higher-quality scene completion while accelerating the completion speed by more than 5-fold. Our method is the first to explore adopting preference learning in distillation to the best of our knowledge and provide insights into preference-aligned distillation. Our code is public available on this https URL.
zh

[CV-4] Mamba-Based Ensemble learning for White Blood Cell Classification

【速读】：该论文旨在解决白细胞（White Blood Cell, WBC）分类中因人工分类耗时且易产生不一致性所面临的挑战，同时克服现有深度学习方法在实际应用中的局限性，如数据不平衡及基于Transformer模型的高计算需求和对输入规模的低可扩展性。论文的关键解决方案在于提出了一种新颖的框架，将Mamba模型与集成学习相结合以提升WBC分类性能。Mamba模型因其线性复杂度而具备可扩展性，相较于基于Transformer的方法更适合资源受限环境，从而有效缓解了上述限制。此外，研究还引入了一个新的WBC数据集Chula-WBC-8用于基准测试，验证了Mamba模型在保持分类精度的同时显著提高效率的有效性。

链接: https://arxiv.org/abs/2504.11438
作者: Lewis Clifton,Xin Tian,Duangdao Palasuwan,Phandee Watanaboonyongcharoen,Ponlapat Rojnuckarin,Nantheera Anantrasirichai
机构: University of Bristol (英国布里斯托大学); University of Oxford (牛津大学); Oxidation in Red Cell Disorders Research Unit, Faculty of Allied Health Sciences, Chulalongkorn University (红细胞疾病氧化研究组, 药剂健康科学学院, 朱拉隆功大学); Department of Laboratory Medicine, Faculty of Medicine, Chulalongkorn University (医学部实验室医学系, 朱拉隆功大学); Excellence Center in Translational Hematology, Faculty of Medicine, Chulalongkorn University (转化血液学卓越中心, 医学院, 朱拉隆功大学); Visual Information Laboratory, University of Bristol (视觉信息实验室, 英国布里斯托大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:White blood cell (WBC) classification assists in assessing immune health and diagnosing various diseases, yet manual classification is labor-intensive and prone to inconsistencies. Recent advancements in deep learning have shown promise over traditional methods; however, challenges such as data imbalance and the computational demands of modern technologies, such as Transformer-based models which do not scale well with input size, limit their practical application. This paper introduces a novel framework that leverages Mamba models integrated with ensemble learning to improve WBC classification. Mamba models, known for their linear complexity, provide a scalable alternative to Transformer-based approaches, making them suitable for deployment in resource-constrained environments. Additionally, we introduce a new WBC dataset, Chula-WBC-8, for benchmarking. Our approach not only validates the effectiveness of Mamba models in this domain but also demonstrates their potential to significantly enhance classification efficiency without compromising accuracy. The source code can be found at this https URL.
zh

[CV-5] Enhancing Out-of-Distribution Detection with Extended Logit Normalization

【速读】：该论文旨在解决分布外（Out-of-Distribution, OOD）检测问题，特别是在机器学习模型安全部署中的应用。现有方法主要通过改进分类损失函数或表征学习策略来提升OOD检测性能，但这些方法通常针对特定的后验检测技术，缺乏通用性。论文指出Logit归一化（Logit Normalization, LogitNorm）存在一个关键缺陷，限制了其在某些后验OOD检测方法中的有效性。为了解决这一问题，论文提出了扩展Logit归一化（Extended Logit Normalization, ELogitNorm），这是一种无需超参数的新公式，显著提升了多种后验检测方法的表现。ELogitNorm通过引入特征距离感知机制改进了LogitNorm，在OOD分离能力和分布内（In-Distribution, ID）置信校准方面表现出更强的鲁棒性。广泛的基准实验表明，该方法在OOD检测任务中超越了最先进的训练时方法，同时保持了较强的ID分类准确性。

链接: https://arxiv.org/abs/2504.11434
作者: Yifan Ding,Xixi Liu,Jonas Unger,Gabriel Eilertsen
机构: Linköping University (林雪平大学); Chalmers University of Technology (查尔姆斯理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is essential for the safe deployment of machine learning models. Recent advances have explored improved classification losses and representation learning strategies to enhance OOD detection. However, these methods are often tailored to specific post-hoc detection techniques, limiting their generalizability. In this work, we identify a critical issue in Logit Normalization (LogitNorm), which inhibits its effectiveness in improving certain post-hoc OOD detection methods. To address this, we propose Extended Logit Normalization ( \textbfELogitNorm ), a novel hyperparameter-free formulation that significantly benefits a wide range of post-hoc detection methods. By incorporating feature distance-awareness to LogitNorm, \textbfELogitNorm shows more robust OOD separability and in-distribution (ID) confidence calibration than its predecessor. Extensive experiments across standard benchmarks demonstrate that our approach outperforms state-of-the-art training-time methods in OOD detection while maintaining strong ID classification accuracy.
zh

[CV-6] NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

【速读】：该论文致力于解决视频基于法线估计中时间一致性难以保障的问题。现有方法多侧重于静态图像场景，而在视频序列中确保法线估计的时间连贯性仍具有挑战性。为应对这一问题，论文提出NormalCrafter，通过利用视频扩散模型固有的时间先验知识来增强法线估计能力。其解决方案的关键在于引入语义特征正则化（Semantic Feature Regularization, SFR），该技术使扩散特征与语义线索对齐，引导模型关注场景的内在语义信息。此外，论文设计了一种两阶段训练协议，在潜在空间和像素空间中同时学习，以保持空间精度的同时维持长时间的时间上下文。这些创新共同实现了在多样化视频中生成具有复杂细节且时间一致的法线序列。

链接: https://arxiv.org/abs/2504.11427
作者: Yanrui Bin,Wenbo Hu,Haoyuan Wang,Xinya Chen,Bing Wang
机构: Spatial Intelligence Group, The Hong Kong Polytechnic University (香港理工大学空间智能小组); ARC Lab, Tencent PCG (腾讯互娱 ARC 实验室); City University of Hong Kong (香港城市大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures, Project Page: this https URL

点击查看摘要

Abstract:Surface normal estimation serves as a cornerstone for a spectrum of computer vision applications. While numerous efforts have been devoted to static image scenarios, ensuring temporal coherence in video-based normal estimation remains a formidable challenge. Instead of merely augmenting existing methods with temporal components, we present NormalCrafter to leverage the inherent temporal priors of video diffusion models. To secure high-fidelity normal estimation across sequences, we propose Semantic Feature Regularization (SFR), which aligns diffusion features with semantic cues, encouraging the model to concentrate on the intrinsic semantics of the scene. Moreover, we introduce a two-stage training protocol that leverages both latent and pixel space learning to preserve spatial accuracy while maintaining long temporal context. Extensive evaluations demonstrate the efficacy of our method, showcasing a superior performance in generating temporally consistent normal sequences with intricate details from diverse videos.
zh

[CV-7] ADT: Tuning Diffusion Models with Adversarial Supervision

【速读】：该论文旨在解决扩散模型（Diffusion Models）在训练与推理过程中因分布不一致导致的预测偏差和累积误差积累问题。论文的关键解决方案是提出了一种名为对抗扩散微调（Adversarial Diffusion Tuning, ADT）的新框架。ADT通过在优化过程中模拟推理过程，并利用对抗监督对最终输出进行对齐来弥合训练与推理之间的差距。其核心在于采用具有固定预训练主干网络和轻量级可训练参数的双子网络判别器（siamese-network discriminator），结合图像到图像采样策略以缓解判别难度，并保留原始扩散损失以防止判别器被操控。此外，ADT还设计了合理的梯度反向传播路径，确保在不引起内存过载或梯度爆炸的情况下实现稳定的训练。实验结果表明，ADT显著提升了生成图像的质量及分布一致性。

链接: https://arxiv.org/abs/2504.11423
作者: Dazhong Shen,Guanglu Song,Yi Zhang,Bingqi Ma,Lujundong Li,Dongzhi Jiang,Zhuofan Zong,Yu Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have achieved outstanding image generation by reversing a forward noising process to approximate true data distributions. During training, these models predict diffusion scores from noised versions of true samples in a single forward pass, while inference requires iterative denoising starting from white noise. This training-inference divergences hinder the alignment between inference and training data distributions, due to potential prediction biases and cumulative error accumulation. To address this problem, we propose an intuitive but effective fine-tuning framework, called Adversarial Diffusion Tuning (ADT), by stimulating the inference process during optimization and aligning the final outputs with training data by adversarial supervision. Specifically, to achieve robust adversarial training, ADT features a siamese-network discriminator with a fixed pre-trained backbone and lightweight trainable parameters, incorporates an image-to-image sampling strategy to smooth discriminative difficulties, and preserves the original diffusion loss to prevent discriminator hacking. In addition, we carefully constrain the backward-flowing path for back-propagating gradients along the inference path without incurring memory overload or gradient explosion. Finally, extensive experiments on Stable Diffusion models (v1.5, XL, and v3), demonstrate that ADT significantly improves both distribution alignment and image quality.
zh

[CV-8] Leverag ing Point Transformers for Detecting Anatomical Landmarks in Digital Dentistry MICCAI2024

【速读】：该论文旨在解决自动检测口腔扫描点云中牙齿关键标志点（如牙尖、近远中位置、面轴点和牙龈边界等）的挑战，这些问题包括数据集规模有限、受试者间显著的解剖学变异以及几何数据的复杂性。论文的关键解决方案在于提出了一种基于Transformer架构的点云学习方法，设计了一个Point Transformer v3启发的模块以捕获有意义的几何和解剖特征，并通过轻量级解码器预测每个点的距离值，最后利用基于图的非极小值抑制技术进行后处理，从而实现精准的标志点检测。

链接: https://arxiv.org/abs/2504.11418
作者: Tibor Kubík,Oldřich Kodym,Petr Šilling,Kateřina Trávníčková,Tomáš Mojžiš,Jan Matula
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages + references, 3 figures, MICCAI2024 3DTeethland Challenge submission

点击查看摘要

Abstract:The increasing availability of intraoral scanning devices has heightened their importance in modern clinical orthodontics. Clinicians utilize advanced Computer-Aided Design techniques to create patient-specific treatment plans that include laboriously identifying crucial landmarks such as cusps, mesial-distal locations, facial axis points, and tooth-gingiva boundaries. Detecting such landmarks automatically presents challenges, including limited dataset sizes, significant anatomical variability among subjects, and the geometric nature of the data. We present our experiments from the 3DTeethLand Grand Challenge at MICCAI 2024. Our method leverages recent advancements in point cloud learning through transformer architectures. We designed a Point Transformer v3 inspired module to capture meaningful geometric and anatomical features, which are processed by a lightweight decoder to predict per-point distances, further processed by graph-based non-minima suppression. We report promising results and discuss insights on learned feature interpretability.
zh

[CV-9] Deep Learning-based Bathymetry Retrieval without In-situ Depths using Remote Sensing Imagery and SfM-MVS DSMs with Data Gaps

【速读】：该论文旨在解决浅海区域因气候和人为压力导致的高精度、高频次测深数据需求，现有方法（如基于SfM-MVS带折射校正或光谱测深SDB的方法）存在的问题，包括SDB需要大量人工工作或昂贵参考数据，而SfM-MVS方法在折射校正后仍面临深度数据空缺和噪声等问题。为应对这些挑战，论文提出了一种结合SfM-MVS高保真三维重建能力与最新折射校正技术，以及一种基于深度学习的新光谱分析方法的综合方案。该方案通过让SfM-MVS生成的数据作为训练数据，利用提出的Swin-BathyUNet模型填补数据空白，生成完整的测深图。Swin-BathyUNet的关键在于结合U-Net架构、Swin Transformer自注意力层及交叉注意力机制，能够捕捉长程空间关系以提高测深精度，并可独立于SfM-MVS输出作为标准SDB解决方案。实验结果验证了该方法在地中海和波罗的海不同测试点上的有效性，显著提升了测深精度、细节丰富度、覆盖范围及预测DSM的降噪性能。

链接: https://arxiv.org/abs/2504.11416
作者: Panagiotis Agrafiotis,Begüm Demir
机构: Technical University of Berlin (柏林工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted for publication in ISPRS Journal of Photogrammetry and Remote Sensing

点击查看摘要

Abstract:Accurate, detailed, and high-frequent bathymetry is crucial for shallow seabed areas facing intense climatological and anthropogenic pressures. Current methods utilizing airborne or satellite optical imagery to derive bathymetry primarily rely on either SfM-MVS with refraction correction or Spectrally Derived Bathymetry (SDB). However, SDB methods often require extensive manual fieldwork or costly reference data, while SfM-MVS approaches face challenges even after refraction correction. These include depth data gaps and noise in environments with homogeneous visual textures, which hinder the creation of accurate and complete Digital Surface Models (DSMs) of the seabed. To address these challenges, this work introduces a methodology that combines the high-fidelity 3D reconstruction capabilities of the SfM-MVS methods with state-of-the-art refraction correction techniques, along with the spectral analysis capabilities of a new deep learning-based method for bathymetry prediction. This integration enables a synergistic approach where SfM-MVS derived DSMs with data gaps are used as training data to generate complete bathymetric maps. In this context, we propose Swin-BathyUNet that combines U-Net with Swin Transformer self-attention layers and a cross-attention mechanism, specifically tailored for SDB. Swin-BathyUNet is designed to improve bathymetric accuracy by capturing long-range spatial relationships and can also function as a standalone solution for standard SDB with various training depth data, independent of the SfM-MVS output. Experimental results in two completely different test sites in the Mediterranean and Baltic Seas demonstrate the effectiveness of the proposed approach through extensive experiments that demonstrate improvements in bathymetric accuracy, detail, coverage, and noise reduction in the predicted DSM. The code is available at this https URL.
zh

[CV-10] Robustness and sex differences in skin cancer detection: logistic regression vs CNNs

【速读】：该论文旨在解决皮肤癌检测中生成式 AI (Generative AI) 模型的可复现性和潜在性别偏倚问题。论文通过复制一项关于阿尔茨海默病的研究框架（不同数据集，相同分析方法），探索逻辑回归（Logistic Regression, LR）和卷积神经网络（Convolutional Neural Networks, CNN）在不同性别分布下的鲁棒性。关键解决方案在于使用PAD-UFES-20数据集，结合基于皮肤病学指南的手工特征训练的LR模型以及预训练的ResNet-50模型，并在多个具有不同性别组成的数据集上评估其性能，以量化模型的稳健性及潜在偏倚。研究发现，两种模型均对性别分布具有鲁棒性，但CNN在男性患者上的准确率（ACC）和受试者工作特性曲线下面积（AUROC）显著高于女性患者。这一结果为评估主流医学机器学习方法中的潜在偏倚提供了重要贡献。

链接: https://arxiv.org/abs/2504.11415
作者: Nikolette Pedersen,Regitze Sydendal,Andreas Wulff,Ralf Raumanns,Eike Petersen,Veronika Cheplygina
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages (excluding appendix), 2 figures (excluding appendix), submitted to MIUA 2025 conference (response pending)

点击查看摘要

Abstract:Deep learning has been reported to achieve high performances in the detection of skin cancer, yet many challenges regarding the reproducibility of results and biases remain. This study is a replication (different data, same analysis) of a study on Alzheimer’s disease [28] which studied robustness of logistic regression (LR) and convolutional neural networks (CNN) across patient sexes. We explore sex bias in skin cancer detection, using the PAD-UFES-20 dataset with LR trained on handcrafted features reflecting dermatological guidelines (ABCDE and the 7-point checklist), and a pre-trained ResNet-50 model. We evaluate these models in alignment with [28]: across multiple training datasets with varied sex composition to determine their robustness. Our results show that both the LR and the CNN were robust to the sex distributions, but the results also revealed that the CNN had a significantly higher accuracy (ACC) and area under the receiver operating characteristics (AUROC) for male patients than for female patients. We hope these findings to contribute to the growing field of investigating potential bias in popular medical machine learning methods. The data and relevant scripts to reproduce our results can be found in our Github.
zh

[CV-11] Multi-level Cellular Automata for FLIM networks

【速读】：该论文旨在解决深度学习显著物体检测（deep SOD）中对大量标注数据和复杂网络结构的需求所带来的挑战，特别是在医疗领域发展中国家计算资源有限的场景下。论文的关键解决方案在于提出了一种结合特征学习从图像标记（Feature Learning from Image Markers, FLIM）与元胞自动机（Cellular Automata, CA）的方法。具体而言，通过FLIM网络利用专家知识初始化CA状态，无需针对每张图像进行用户交互，同时从FLIM网络的每一层解码特征以初始化多个CA，构建多层框架。这种方法利用了不同网络层编码的分层知识，将多幅显著图融合为高质量的最终输出，从而实现CA的集成效果。实验表明，该多层CA方法在两个具有挑战性的医学数据集上的表现具有竞争力，优于现有深度SOD模型。

链接: https://arxiv.org/abs/2504.11406
作者: Felipe Crispim Salvagnini,Jancarlo F. Gomes,Cid A. N. Santos,Silvio Jamil F. Guimarães,Alexandre X. Falcão
机构: University of Campinas (巴西坎皮纳斯大学); Eldorado Institute (埃尔多拉多研究所); Pontifical Catholic University of Minas Gerais (帕拉伊巴天主教大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The necessity of abundant annotated data and complex network architectures presents a significant challenge in deep-learning Salient Object Detection (deep SOD) and across the broader deep-learning landscape. This challenge is particularly acute in medical applications in developing countries with limited computational resources. Combining modern and classical techniques offers a path to maintaining competitive performance while enabling practical applications. Feature Learning from Image Markers (FLIM) methodology empowers experts to design convolutional encoders through user-drawn markers, with filters learned directly from these annotations. Recent findings demonstrate that coupling a FLIM encoder with an adaptive decoder creates a flyweight network suitable for SOD, requiring significantly fewer parameters than lightweight models and eliminating the need for backpropagation. Cellular Automata (CA) methods have proven successful in data-scarce scenarios but require proper initialization – typically through user input, priors, or randomness. We propose a practical intersection of these approaches: using FLIM networks to initialize CA states with expert knowledge without requiring user interaction for each image. By decoding features from each level of a FLIM network, we can initialize multiple CAs simultaneously, creating a multi-level framework. Our method leverages the hierarchical knowledge encoded across different network layers, merging multiple saliency maps into a high-quality final output that functions as a CA ensemble. Benchmarks across two challenging medical datasets demonstrate the competitiveness of our multi-level CA approach compared to established models in the deep SOD literature.
zh

[CV-12] VideoPanda: Video Panoramic Diffusion with Multi-view Attention

【速读】：该论文旨在解决高分辨率全景视频内容难以采集的问题，因其需要专用设备和复杂的摄像机布置。为应对这一挑战，论文提出了VideoPanda，这是一种新颖的方法，用于根据文本或单视图视频数据合成360°视频。VideoPanda的关键在于利用多视图注意力层增强视频扩散模型，使其能够生成一致的多视图视频，这些视频可以组合成沉浸式的全景内容。此外，VideoPanda通过联合训练两种条件（仅文本和单视图视频）支持长视频的自回归生成，并采用随机子采样策略以减轻多视图视频生成的计算负担，确保模型在推理阶段能够泛化生成更多帧。广泛的评估表明，VideoPanda相比现有方法，在所有输入条件下生成的360°全景内容更加真实且连贯。

链接: https://arxiv.org/abs/2504.11389
作者: Kevin Xie,Amirmojtaba Sabour,Jiahui Huang,Despoina Paschalidou,Greg Klar,Umar Iqbal,Sanja Fidler,Xiaohui Zeng
机构: NVIDIA (英伟达)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project website at this https URL

点击查看摘要

Abstract:High resolution panoramic video content is paramount for immersive experiences in Virtual Reality, but is non-trivial to collect as it requires specialized equipment and intricate camera setups. In this work, we introduce VideoPanda, a novel approach for synthesizing 360 ^\circ videos conditioned on text or single-view video data. VideoPanda leverages multi-view attention layers to augment a video diffusion model, enabling it to generate consistent multi-view videos that can be combined into immersive panoramic content. VideoPanda is trained jointly using two conditions: text-only and single-view video, and supports autoregressive generation of long-videos. To overcome the computational burden of multi-view video generation, we randomly subsample the duration and camera views used during training and show that the model is able to gracefully generalize to generating more frames during inference. Extensive evaluations on both real-world and synthetic video datasets demonstrate that VideoPanda generates more realistic and coherent 360 ^\circ panoramas across all input conditions compared to existing methods. Visit the project website at this https URL for results.
zh

[CV-13] Omni2: Unifying Omnidirectional Image Generation and Editing in an Omni Model

【速读】：该论文旨在解决360°全景图像（Omnidirectional Images, ODIs）生成与编辑的挑战，特别是现有方法在处理ODIs独特格式（360°视场角）和多样化输入条件时表现不足的问题。论文的关键解决方案是构建了一个名为\textbf\textitAny2Omni的综合数据集，包含超过60,000个训练样本，并涵盖多种输入条件下的9种ODI生成与编辑任务。在此基础上，提出了一种名为\textbf\underlineOmni ^2 的模型，通过单一模型实现对不同输入条件下多种ODI生成与编辑任务的高效处理，从而显著提升了生成和编辑效果。

链接: https://arxiv.org/abs/2504.11379
作者: Liu Yang,Huiyu Duan,Yucheng Zhu,Xiaohong Liu,Lu Liu,Zitong Xu,Guangji Ma,Xiongkuo Min,Guangtao Zhai,Patrick Le Callet
机构: Shanghai Jiao Tong University (上海交通大学); University of Electronic Science and Technology of China (电子科技大学); Université de Nantes (南特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract: 360^\circ omnidirectional images (ODIs) have gained considerable attention recently, and are widely used in various virtual reality (VR) and augmented reality (AR) applications. However, capturing such images is expensive and requires specialized equipment, making ODI synthesis increasingly important. While common 2D image generation and editing methods are rapidly advancing, these models struggle to deliver satisfactory results when generating or editing ODIs due to the unique format and broad 360 ^\circ Field-of-View (FoV) of ODIs. To bridge this gap, we construct \textbf\textitAny2Omni, the first comprehensive ODI generation-editing dataset comprises 60,000+ training data covering diverse input conditions and up to 9 ODI generation and editing tasks. Built upon Any2Omni, we propose an \textbf\underlineOmni model for \textbf\underlineOmni-directional image generation and editing (\textbf\textitOmni ^2 ), with the capability of handling various ODI generation and editing tasks under diverse input conditions using one model. Extensive experiments demonstrate the superiority and effectiveness of the proposed Omni ^2 model for both the ODI generation and editing tasks.
zh

[CV-14] From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation

【速读】：该论文旨在解决医学图像分割中像素级标注成本高昂的问题，特别是在弱监督条件下，利用临床医生注视数据和视觉-语言模型（Vision-Language Models, VLMs）的互补优势来提升分割性能。论文的关键在于提出了一种教师-学生框架，将注视数据与语言描述相结合。教师模型通过结合VLM生成的病灶形态描述增强的注视点进行训练，建立指导学生模型的基础；随后，学生模型通过多尺度特征对齐、置信加权一致性约束以及自适应掩码三种策略进一步优化。这种方法在Kvasir-SEG、NCI-ISBI和ISIC数据集上的Dice分数分别达到了80.78%、80.53%和84.22%，相较于仅使用注视数据的方法提升了3-5个百分点，同时保持了预测结果的临床可解释性。

链接: https://arxiv.org/abs/2504.11368
作者: Jingkun Chen,Haoran Duan,Xiao Zhang,Boyan Gao,Tao Tan,Vicente Grau,Jungong Han
机构: IBME, Department of Engineering Science, University of Oxford (牛津大学), United Kingdom; Department of Automation, Tsinghua University (清华大学), China; School of Information Science and Technology, Northwest University (西北大学), China; Faculty of Applied Sciences, Macao Polytechnic University (澳门城市大学), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Medical image segmentation remains challenging due to the high cost of pixel-level annotations for training. In the context of weak supervision, clinician gaze data captures regions of diagnostic interest; however, its sparsity limits its use for segmentation. In contrast, vision-language models (VLMs) provide semantic context through textual descriptions but lack the explanation precision required. Recognizing that neither source alone suffices, we propose a teacher-student framework that integrates both gaze and language supervision, leveraging their complementary strengths. Our key insight is that gaze data indicates where clinicians focus during diagnosis, while VLMs explain why those regions are significant. To implement this, the teacher model first learns from gaze points enhanced by VLM-generated descriptions of lesion morphology, establishing a foundation for guiding the student model. The teacher then directs the student through three strategies: (1) Multi-scale feature alignment to fuse visual cues with textual semantics; (2) Confidence-weighted consistency constraints to focus on reliable predictions; (3) Adaptive masking to limit error propagation in uncertain areas. Experiments on the Kvasir-SEG, NCI-ISBI, and ISIC datasets show that our method achieves Dice scores of 80.78%, 80.53%, and 84.22%, respectively-improving 3-5% over gaze baselines without increasing the annotation burden. By preserving correlations among predictions, gaze data, and lesion descriptions, our framework also maintains clinical interpretability. This work illustrates how integrating human visual attention with AI-generated semantic context can effectively overcome the limitations of individual weak supervision signals, thereby advancing the development of deployable, annotation-efficient medical AI systems. Code is available at: this https URL.
zh

[CV-15] A Decade of Wheat Mapping for Lebanon

【速读】：该论文旨在解决通过卫星图像准确映射小麦田的问题。解决方案的关键在于提出了一种改进的小麦分割管道，并结合时间空间视觉Transformer（TSViT）与参数高效微调（PEFT）以及基于Fields of The World（FTW）框架的新型后处理管道。这种方法解决了现有方法中的关键挑战，如小农业地块在单一大面积田地中聚类的问题，通过将小麦分割与精确田地边界提取相结合，生成了几何一致且语义丰富的地图，从而支持深入分析如多年作物轮作模式跟踪等任务。

链接: https://arxiv.org/abs/2504.11366
作者: Hasan Wehbi,Hasan Nasrallah,Mohamad Hasan Zahweh,Zeinab Takach,Veera Ganesh Yalla,Ali J. Ghandour
机构: RASID SARL (拉西德有限责任公司); Lebanese University (黎巴嫩大学); National Center for Remote Sensing, National Council for Scientific Research (国家遥感中心, 科学研究国家委员会); IHub Data, IIIT Hyderabad (IHub 数据, 海得拉巴国际信息技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Wheat accounts for approximately 20% of the world’s caloric intake, making it a vital component of global food security. Given this importance, mapping wheat fields plays a crucial role in enabling various stakeholders, including policy makers, researchers, and agricultural organizations, to make informed decisions regarding food security, supply chain management, and resource allocation. In this paper, we tackle the problem of accurately mapping wheat fields out of satellite images by introducing an improved pipeline for winter wheat segmentation, as well as presenting a case study on a decade-long analysis of wheat mapping in Lebanon. We integrate a Temporal Spatial Vision Transformer (TSViT) with Parameter-Efficient Fine Tuning (PEFT) and a novel post-processing pipeline based on the Fields of The World (FTW) framework. Our proposed pipeline addresses key challenges encountered in existing approaches, such as the clustering of small agricultural parcels in a single large field. By merging wheat segmentation with precise field boundary extraction, our method produces geometrically coherent and semantically rich maps that enable us to perform in-depth analysis such as tracking crop rotation pattern over years. Extensive evaluations demonstrate improved boundary delineation and field-level precision, establishing the potential of the proposed framework in operational agricultural monitoring and historical trend analysis. By allowing for accurate mapping of wheat fields, this work lays the foundation for a range of critical studies and future advances, including crop monitoring and yield estimation.
zh

[CV-16] Explicit and Implicit Representations in AI-based 3D Reconstruction for Radiology: A systematic literature review

【速读】：该论文旨在解决医学影像领域中高质量三维重建的精度与效率问题，同时降低患者辐射暴露及不适感。其核心解决方案在于探索基于人工智能（Artificial Intelligence, AI）的先进三维重建算法，通过显式方法（如点表示、体素表示及高斯表示）与隐式方法（如隐式先验嵌入及神经辐射场）实现更精确且快速的图像重建。关键之处在于结合这些创新算法，优化评估指标并利用基准数据集验证性能，从而推动临床诊断的发展。

链接: https://arxiv.org/abs/2504.11349
作者: Yuezhe Yang,Boyu Yang,Yaqian Wang,Yang He,Xingbo Dong,Zhe Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: 43 pages, 5 figures, submit to Medical Image Analysis

点击查看摘要

Abstract:The demand for high-quality medical imaging in clinical practice and assisted diagnosis has made 3D reconstruction in radiological imaging a key research focus. Artificial intelligence (AI) has emerged as a promising approach to enhancing reconstruction accuracy while reducing acquisition and processing time, thereby minimizing patient radiation exposure and discomfort and ultimately benefiting clinical diagnosis. This review explores state-of-the-art AI-based 3D reconstruction algorithms in radiological imaging, categorizing them into explicit and implicit approaches based on their underlying principles. Explicit methods include point-based, volume-based, and Gaussian representations, while implicit methods encompass implicit prior embedding and neural radiance fields. Additionally, we examine commonly used evaluation metrics and benchmark datasets. Finally, we discuss the current state of development, key challenges, and future research directions in this evolving field. Our project available on: this https URL.
zh

[CV-17] DeepWheel: Generating a 3D Synthetic Wheel Dataset for Design and Performance Evaluation

【速读】：该论文旨在解决车辆车轮设计领域因缺乏大规模高质量数据集（包括三维几何形状与物理性能指标）而导致数据驱动设计应用受限的问题。解决方案的关键在于提出了一种基于生成式 AI (Generative AI) 的合成设计-性能数据集生成框架。该框架首先利用Stable Diffusion生成二维渲染图像，接着通过2.5D深度估计重建三维几何结构，并进行结构仿真以提取工程性能数据。为进一步扩展设计与性能空间，引入了拓扑优化技术，从而生成更广泛多样的车轮设计方案。最终形成的数据集DeepWheel包含超过6,000张照片级逼真的图像及900个经过结构分析的三维模型，为代理模型训练、数据驱动逆向设计以及设计空间探索提供了宝贵资源。此外，所提方法适用于其他复杂设计领域，数据集已按CC BY-NC 4.0许可发布并在指定网址提供下载。

链接: https://arxiv.org/abs/2504.11347
作者: Soyoung Yoo,Namwoo Kang
机构: Cho Chun Shik Graduate School of Mobility (崔俊植研究生院); KAIST (韩国科学技术院); Narnia Labs (纳尼亚实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Applied Physics (physics.app-ph)
备注: 28 pages, 18 figures. Not yet submitted to a journal or conference

点击查看摘要

Abstract:Data-driven design is emerging as a powerful strategy to accelerate engineering innovation. However, its application to vehicle wheel design remains limited due to the lack of large-scale, high-quality datasets that include 3D geometry and physical performance metrics. To address this gap, this study proposes a synthetic design-performance dataset generation framework using generative AI. The proposed framework first generates 2D rendered images using Stable Diffusion, and then reconstructs the 3D geometry through 2.5D depth estimation. Structural simulations are subsequently performed to extract engineering performance data. To further expand the design and performance space, topology optimization is applied, enabling the generation of a more diverse set of wheel designs. The final dataset, named DeepWheel, consists of over 6,000 photo-realistic images and 900 structurally analyzed 3D models. This multi-modal dataset serves as a valuable resource for surrogate model training, data-driven inverse design, and design space exploration. The proposed methodology is also applicable to other complex design domains. The dataset is released under the Creative Commons Attribution-NonCommercial 4.0 International(CC BY-NC 4.0) and is available on the this https URL
zh

[CV-18] Seedream 3.0 Technical Report

【速读】：该论文试图解决中文-英文双语图像生成中的多个挑战，包括复杂提示的对齐、细粒度字体生成、视觉美学与保真度不足以及图像分辨率受限等问题。解决方案的关键在于从数据构建到模型部署的全流程改进。具体而言，在数据层面，通过缺陷感知训练范式和双轴协作数据采样框架扩充数据集；在预训练阶段采用混合分辨率训练、跨模态RoPE、表征对齐损失及分辨率感知时间步采样等技术；在后训练阶段利用多样化审美描述进行SFT，并结合基于VLM的奖励模型实现人类偏好对齐。此外，引入一致噪声期望和重要性感知时间步采样，实现了4至8倍的速度提升，同时保持图像质量。这些措施显著提升了Seedream 3.0的整体能力，特别是在复杂中文字符的文本渲染方面，同时支持高达2K的原生高分辨率输出。

链接: https://arxiv.org/abs/2504.11346
作者: Yu Gao,Lixue Gong,Qiushan Guo,Xiaoxia Hou,Zhichao Lai,Fanshi Li,Liang Li,Xiaochen Lian,Chao Liao,Liyang Liu,Wei Liu,Yichun Shi,Shiqi Sun,Yu Tian,Zhi Tian,Peng Wang,Rui Wang,Xuanda Wang,Xun Wang,Ye Wang,Guofeng Wu,Jie Wu,Xin Xia,Xuefeng Xiao,Zhonghua Zhai,Xinyu Zhang,Qi Zhang,Yuwei Zhang,Shijia Zhao,Jianchao Yang,Weilin Huang
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Seedream 3.0 Technical Report

点击查看摘要

Abstract:We present Seedream 3.0, a high-performance Chinese-English bilingual image generation foundation model. We develop several technical improvements to address existing challenges in Seedream 2.0, including alignment with complicated prompts, fine-grained typography generation, suboptimal visual aesthetics and fidelity, and limited image resolutions. Specifically, the advancements of Seedream 3.0 stem from improvements across the entire pipeline, from data construction to model deployment. At the data stratum, we double the dataset using a defect-aware training paradigm and a dual-axis collaborative data-sampling framework. Furthermore, we adopt several effective techniques such as mixed-resolution training, cross-modality RoPE, representation alignment loss, and resolution-aware timestep sampling in the pre-training phase. During the post-training stage, we utilize diversified aesthetic captions in SFT, and a VLM-based reward model with scaling, thereby achieving outputs that well align with human preferences. Furthermore, Seedream 3.0 pioneers a novel acceleration paradigm. By employing consistent noise expectation and importance-aware timestep sampling, we achieve a 4 to 8 times speedup while maintaining image quality. Seedream 3.0 demonstrates significant improvements over Seedream 2.0: it enhances overall capabilities, in particular for text-rendering in complicated Chinese characters which is important to professional typography generation. In addition, it provides native high-resolution output (up to 2K), allowing it to generate images with high visual quality.
zh

[CV-19] PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild

【速读】：该论文试图解决复杂视频分割中的关键挑战，具体包括复杂场景下的视频对象分割（MOSE赛道）以及基于运动引导和语言描述的视频分割（MeViS赛道）。为了解决这些问题，论文引入了更具有挑战性的新数据集，以更好地模拟真实世界的应用场景。解决方案的关键在于通过这些新数据集的详细评估与分析，揭示当前技术前沿及新兴趋势，从而推动复杂视频分割领域的研究进展。

链接: https://arxiv.org/abs/2504.11326
作者: Henghui Ding,Chang Liu,Nikhila Ravi,Shuting He,Yunchao Wei,Song Bai,Philip Torr,Kehuan Song,Xinglin Xie,Kexin Zhang,Licheng Jiao,Lingling Li,Shuyuan Yang,Xuqiang Cao,Linnan Zhao,Jiaxuan Zhao,Fang Liu,Mengjiao Wang,Junpei Zhang,Xu Liu,Yuting Yang,Mengru Ma,Hao Fang,Runmin Cong,Xiankai Lu,Zhiyang Che,Wei Zhan,Tianming Liang,Haichao Jiang,Wei-Shi Zheng,Jian-Fang Hu,Haobo Yuan,Xiangtai Li,Tao Zhang,Lu Qi,Ming-Hsuan Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Workshop Page: this https URL . arXiv admin note: text overlap with arXiv:2504.00476 , arXiv:2504.05178

点击查看摘要

Abstract:This report provides a comprehensive overview of the 4th Pixel-level Video Understanding in the Wild (PVUW) Challenge, held in conjunction with CVPR 2025. It summarizes the challenge outcomes, participating methodologies, and future research directions. The challenge features two tracks: MOSE, which focuses on complex scene video object segmentation, and MeViS, which targets motion-guided, language-based video segmentation. Both tracks introduce new, more challenging datasets designed to better reflect real-world scenarios. Through detailed evaluation and analysis, the challenge offers valuable insights into the current state-of-the-art and emerging trends in complex video segmentation. More information can be found on the workshop website: this https URL.
zh

[CV-20] Intelligent driving vehicle front multi-target tracking and detection based on YOLOv5 and point cloud 3D projection

【速读】：该论文致力于解决智能驾驶车辆在多目标跟踪与检测任务中，如何准确关联连续帧图像中的检测目标以形成稳定轨迹的问题。这一问题是复杂且关键的，因为多目标跟踪需要实时更新每个目标的位置和状态，并确保目标轨迹的连续性和准确性。

解决方案的关键在于结合YOLOv5网络结构和点云3D投影技术。首先，利用Retinex算法增强车辆前方环境图像，去除光照干扰，并构建基于YOLOv5的智能检测模型。通过输入增强后的图像，利用特征提取和目标定位技术识别车辆前方的多个目标。其次，结合点云3D投影技术，推断相邻帧图像在投影坐标系中位置变化的相关性。最后，将多帧图像的目标识别结果顺序投影到3D激光点云环境中，从而实现对所有前方目标运动轨迹的有效跟踪。实验结果显示，该方法的MOTA（综合跟踪精度）值超过30，验证了其优越的跟踪和检测性能。

链接: https://arxiv.org/abs/2504.11310
作者: Dayong Liu,Qingrui Zhang,Zeyang Meng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: in Chinese language

点击查看摘要

Abstract:In multi-target tracking and detection tasks, it is necessary to continuously track multiple targets, such as vehicles, pedestrians, etc. To achieve this goal, the system must be able to continuously acquire and process image frames containing these targets. These consecutive frame images enable the algorithm to update the position and state of the target in real-time in each frame of the image. How to accurately associate the detected target with the target in the previous or next frame to form a stable trajectory is a complex problem. Therefore, a multi object tracking and detection method for intelligent driving vehicles based on YOLOv5 and point cloud 3D projection is proposed. Using Retinex algorithm to enhance the image of the environment in front of the vehicle, remove light interference in the image, and build an intelligent detection model based on YOLOv5 network structure. The enhanced image is input into the model, and multiple targets in front of the vehicle are identified through feature extraction and target localization. By combining point cloud 3D projection technology, the correlation between the position changes of adjacent frame images in the projection coordinate system can be inferred. By sequentially projecting the multi-target recognition results of multiple consecutive frame images into the 3D laser point cloud environment, effective tracking of the motion trajectories of all targets in front of the vehicle can be achieved. The experimental results show that the application of this method for intelligent driving vehicle front multi-target tracking and detection yields a MOTA (Tracking Accuracy) value greater than 30, demonstrating its superior tracking and detection performance.
zh

[CV-21] Big Brother is Watching: Proactive Deepfake Detection via Learnable Hidden Face

【速读】：该论文旨在解决深度伪造（Deepfake）技术在不同篡改操作和数据集上的通用检测难题，提出了一种结合主动防御思想的新方法，以更有效地识别恶意图像篡改。论文的关键创新在于引入了一种基于“隐藏一张可学习人脸于另一张人脸之中”概念的新型检测框架。具体而言，通过半脆弱可逆隐写网络（semi-fragile invertible steganography network），将一个秘密模板图像不可感知地嵌入到宿主图像中，该模板作为逆向隐写恢复过程中监测恶意篡改的指示器。与传统手动指定不同，秘密模板在训练过程中被优化为类似于中性面部外观，犹如隐藏的“大哥哥”用于保护目标图像。此外，通过融合自融合机制和鲁棒性学习策略，并模拟实际传输信道，构建了一个能够准确区分隐写图像是否被恶意篡改或良性处理的鲁棒检测器。实验结果表明，所提方法在多个数据集上优于现有的被动检测和主动防御方法。

链接: https://arxiv.org/abs/2504.11309
作者: Hongbo Li,Shangchao Yang,Ruiyang Xia,Lin Yuan,Xinbo Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As deepfake technologies continue to advance, passive detection methods struggle to generalize with various forgery manipulations and datasets. Proactive defense techniques have been actively studied with the primary aim of preventing deepfake operation effectively working. In this paper, we aim to bridge the gap between passive detection and proactive defense, and seek to solve the detection problem utilizing a proactive methodology. Inspired by several watermarking-based forensic methods, we explore a novel detection framework based on the concept of hiding a learnable face within a face''. Specifically, relying on a semi-fragile invertible steganography network, a secret template image is embedded into a host image imperceptibly, acting as an indicator monitoring for any malicious image forgery when being restored by the inverse steganography process. Instead of being manually specified, the secret template is optimized during training to resemble a neutral facial appearance, just like a big brother’’ hidden in the image to be protected. By incorporating a self-blending mechanism and robustness learning strategy with a simulative transmission channel, a robust detector is built to accurately distinguish if the steganographic image is maliciously tampered or benignly processed. Finally, extensive experiments conducted on multiple datasets demonstrate the superiority of the proposed approach over competing passive and proactive detection methods.
zh

[CV-22] Uncertainty Estimation for Trust Attribution to Speed-of-Sound Reconstruction with Variational Networks

【速读】：该论文旨在解决利用超声数据重建声速（SoS）图像时因噪声（如运动、接触不良及声影等）导致的帧数据质量不均问题，并探索如何通过不确定性估计优化诊断决策。论文的关键解决方案在于提出了一种基于不确定性分析的自动帧选择方法，通过量化SoS重建结果的不确定性来评估每个采集帧的可信度，并在多组采集数据中选择最可靠的帧用于进一步处理和诊断决策。具体而言，论文采用蒙特卡罗丢弃（Monte Carlo Dropout）和贝叶斯变分推断（Bayesian Variational Inference）两种不确定性估计技术，验证了所提方法在乳腺癌鉴别诊断中的有效性，显著提升了诊断性能，其曲线下面积（AUC）分别达到76%和80%，优于未考虑不确定性的基线方法（最佳基线AUC为64%）。

链接: https://arxiv.org/abs/2504.11307
作者: Sonia Laguna,Lin Zhang,Can Deniz Bezek,Monika Farkas,Dieter Schweizer,Rahel A. Kubik-Huch,Orcun Goksel
机构: Computer-assisted Applications in Medicine, ETH Zurich (计算机辅助医学应用, 苏黎世联邦理工学院), Switzerland; Department of Information Technology, Uppsala University (信息技术系, 乌普萨拉大学), Sweden; Department of Radiology, Kantonsspital Baden (放射科, 巴登州立医院), affiliated Hospital for Research and Teaching of the Faculty of Medicine of the University of Zurich, Switzerland; Nvidia (英伟达); Uppsala Medtech Science & Innovation Centre (乌普萨拉医疗技术科学与创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at the International Journal of Computer Assisted Radiology and Surgery. Presented at the 16th International Conference on Information Processing in Computer-Assisted Interventions 2025

点击查看摘要

Abstract:Speed-of-sound (SoS) is a biomechanical characteristic of tissue, and its imaging can provide a promising biomarker for diagnosis. Reconstructing SoS images from ultrasound acquisitions can be cast as a limited-angle computed-tomography problem, with Variational Networks being a promising model-based deep learning solution. Some acquired data frames may, however, get corrupted by noise due to, e.g., motion, lack of contact, and acoustic shadows, which in turn negatively affects the resulting SoS reconstructions. We propose to use the uncertainty in SoS reconstructions to attribute trust to each individual acquired frame. Given multiple acquisitions, we then use an uncertainty based automatic selection among these retrospectively, to improve diagnostic decisions. We investigate uncertainty estimation based on Monte Carlo Dropout and Bayesian Variational Inference. We assess our automatic frame selection method for differential diagnosis of breast cancer, distinguishing between benign fibroadenoma and malignant carcinoma. We evaluate 21 lesions classified as BI-RADS~4, which represents suspicious cases for probable malignancy. The most trustworthy frame among four acquisitions of each lesion was identified using uncertainty based criteria. Selecting a frame informed by uncertainty achieved an area under curve of 76% and 80% for Monte Carlo Dropout and Bayesian Variational Inference, respectively, superior to any uncertainty-uninformed baselines with the best one achieving 64%. A novel use of uncertainty estimation is proposed for selecting one of multiple data acquisitions for further processing and decision making.
zh

[CV-23] Context-Aware Palmprint Recognition via a Relative Similarity Metric

【速读】：该论文旨在解决传统掌纹识别匹配机制中直接成对相似性度量（如余弦或欧几里得距离）无法有效捕捉成对相似性在数据集整体上下文中的相对一致性的问题，从而导致难以有效抑制误报和漏报的现象。为了解决这一问题，论文提出了一种新的相对相似性度量（Relative Similarity Metric, RSM）方法，通过评估成对相似性分数在整个身份集合上的相对一致性，增强了现有匹配框架的鲁棒性和区分能力。这种方法的关键在于引入了这种基于关系结构的度量方式，使得匹配机制能够更好地适应数据集的整体特性，从而显著提升了掌纹识别的性能。实验结果表明，该方法在Tongji数据集上达到了0.000036%的等错误率（EER），创造了新的最先进的性能记录。

链接: https://arxiv.org/abs/2504.11306
作者: Trinnhallen Brisley,Aryan Gandhi,Joseph Magen
机构: University of Edinburgh (爱丁堡大学); Nethermind (Nethermind); University of Waterloo (滑铁卢大学); Nethermind (Nethermind)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a new approach to matching mechanism for palmprint recognition by introducing a Relative Similarity Metric (RSM) that enhances the robustness and discriminability of existing matching frameworks. While conventional systems rely on direct pairwise similarity measures, such as cosine or Euclidean distances, these metrics fail to capture how a pairwise similarity compares within the context of the entire dataset. Our method addresses this by evaluating the relative consistency of similarity scores across up to all identities, allowing for better suppression of false positives and negatives. Applied atop the CCNet architecture, our method achieves a new state-of-the-art 0.000036% Equal Error Rate (EER) on the Tongji dataset, outperforming previous methods and demonstrating the efficacy of incorporating relational structure into the palmprint matching process.
zh

[CV-24] CFIS-YOLO: A Lightweight Multi-Scale Fusion Network for Edge-Deployable Wood Defect Detection

【速读】：本文旨在解决木材加工行业中木材缺陷检测面临的两大挑战：传统方法成本高、主观性强且劳动密集，而主流深度学习模型在边缘设备部署时难以平衡检测精度与计算效率。为应对这些问题，论文提出了CFIS-YOLO模型，这是一种专为边缘设备优化的轻量级目标检测模型。其关键创新在于引入增强版C2f结构、动态特征重组模块以及结合辅助边界框和角度约束的新颖损失函数，这些改进提升了多尺度特征融合能力和小物体定位精度，同时大幅降低了计算开销。实验结果显示，CFIS-YOLO在公共木材缺陷数据集上的平均精度均值(mAP@0.5)达到77.5%，比基线YOLOv10s高出4个百分点，并在SOPHON BM1684X边缘设备上实现了135 FPS的速度，功耗仅为原始实现的17.3%，且mAP仅下降0.5个百分点。这些结果表明，CFIS-YOLO是一种适用于资源受限环境下的实际有效木材缺陷检测方案。

链接: https://arxiv.org/abs/2504.11305
作者: Jincheng Kang,Yi Cen,Yigang Cen,Ke Wang,Yuhan Liu
机构: School of Information and Engineering, Minzu University of China (中央民族大学信息工程学院); School of Computer and Information Technology, Beijing Jiaotong University (北京交通大学计算机与信息技术学院); School of Information and Communication Engineering, Beijing University of Posts and Telecommunications (北京邮电大学信息与通信工程学院); Faculty of Science, The University of Hong Kong (香港大学理学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 11 figures

点击查看摘要

Abstract:Wood defect detection is critical for ensuring quality control in the wood processing industry. However, current industrial applications face two major challenges: traditional methods are costly, subjective, and labor-intensive, while mainstream deep learning models often struggle to balance detection accuracy and computational efficiency for edge deployment. To address these issues, this study proposes CFIS-YOLO, a lightweight object detection model optimized for edge devices. The model introduces an enhanced C2f structure, a dynamic feature recombination module, and a novel loss function that incorporates auxiliary bounding boxes and angular constraints. These innovations improve multi-scale feature fusion and small object localization while significantly reducing computational overhead. Evaluated on a public wood defect dataset, CFIS-YOLO achieves a mean Average Precision (mAP@0.5) of 77.5%, outperforming the baseline YOLOv10s by 4 percentage points. On SOPHON BM1684X edge devices, CFIS-YOLO delivers 135 FPS, reduces power consumption to 17.3% of the original implementation, and incurs only a 0.5 percentage point drop in mAP. These results demonstrate that CFIS-YOLO is a practical and effective solution for real-world wood defect detection in resource-constrained environments.
zh

[CV-25] Autoregressive Distillation of Diffusion Transformers CVPR2025

【速读】：该论文旨在解决基于扩散模型（Diffusion Models）生成高保真图像过程中，因迭代采样带来的资源消耗问题以及现有蒸馏方法易受暴露偏差（exposure bias）影响的局限性。为了解决这些问题，论文提出了一种名为自回归蒸馏（AutoRegressive Distillation, ARD）的新方法。其关键是利用常微分方程（ODE）的历史轨迹来预测未来的步骤，通过引入历史输入信息，缓解了暴露偏差问题，并以更粗粒度的方式提供了更有效的上下文信息。此外，ARD通过对教师Transformer架构进行改进，包括为每个来自轨迹历史的输入添加标记的时间嵌入以及采用块级因果注意力掩码，同时仅在较低层Transformer中加入历史输入，从而在性能与计算效率之间取得平衡。

链接: https://arxiv.org/abs/2504.11295
作者: Yeongmin Kim,Sotiris Anagnostidis,Yuming Du,Edgar Schönfeld,Jonas Kohler,Markos Georgopoulos,Albert Pumarola,Ali Thabet,Artsiom Sanakoyeu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 Oral

点击查看摘要

Abstract:Diffusion models with transformer architectures have demonstrated promising capabilities in generating high-fidelity images and scalability for high resolution. However, iterative sampling process required for synthesis is very resource-intensive. A line of work has focused on distilling solutions to probability flow ODEs into few-step student models. Nevertheless, existing methods have been limited by their reliance on the most recent denoised samples as input, rendering them susceptible to exposure bias. To address this limitation, we propose AutoRegressive Distillation (ARD), a novel approach that leverages the historical trajectory of the ODE to predict future steps. ARD offers two key benefits: 1) it mitigates exposure bias by utilizing a predicted historical trajectory that is less susceptible to accumulated errors, and 2) it leverages the previous history of the ODE trajectory as a more effective source of coarse-grained information. ARD modifies the teacher transformer architecture by adding token-wise time embedding to mark each input from the trajectory history and employs a block-wise causal attention mask for training. Furthermore, incorporating historical inputs only in lower transformer layers enhances performance and efficiency. We validate the effectiveness of ARD in a class-conditioned generation on ImageNet and T2I synthesis. Our model achieves a 5\times reduction in FID degradation compared to the baseline methods while requiring only 1.1% extra FLOPs on ImageNet-256. Moreover, ARD reaches FID of 1.84 on ImageNet-256 in merely 4 steps and outperforms the publicly available 1024p text-to-image distilled models in prompt adherence score with a minimal drop in FID compared to the teacher. Project page: this https URL.
zh

[CV-26] UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer

【速读】：该论文旨在解决一致且高质量的人体图像动画生成问题。为实现这一目标，解决方案的关键在于通过低秩适应（Low-Rank Adaptation, LoRA）技术对开源模型 Wan2.1 进行微调，以仅优化少量参数的方式保留其强大的生成能力，同时大幅降低训练内存开销。此外，设计了一个轻量级的三维卷积层堆叠姿态编码器来提取驱动姿态的运动信息，并采用简单的拼接操作将参考图像的外观特征与姿态信息整合到模型中，以增强姿态对齐效果。实验结果表明，所提出的方法能够生成视觉上逼真且时间上一致的高保真动画，并具备良好的泛化能力，可将训练于 480p 视频上的模型无缝扩展至 720p 推理。

链接: https://arxiv.org/abs/2504.11289
作者: Xiang Wang,Shiwei Zhang,Longxiang Tang,Yingya Zhang,Changxin Gao,Yuehuan Wang,Nong Sang
机构: Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology (华中科技大学); Alibaba Group (阿里巴巴); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The training and inference code (based on Wan2.1) is available at this https URL

点击查看摘要

Abstract:This report presents UniAnimate-DiT, an advanced project that leverages the cutting-edge and powerful capabilities of the open-source Wan2.1 model for consistent human image animation. Specifically, to preserve the robust generative capabilities of the original Wan2.1 model, we implement Low-Rank Adaptation (LoRA) technique to fine-tune a minimal set of parameters, significantly reducing training memory overhead. A lightweight pose encoder consisting of multiple stacked 3D convolutional layers is designed to encode motion information of driving poses. Furthermore, we adopt a simple concatenation operation to integrate the reference appearance into the model and incorporate the pose information of the reference image for enhanced pose alignment. Experimental results show that our approach achieves visually appearing and temporally consistent high-fidelity animations. Trained on 480p (832x480) videos, UniAnimate-DiT demonstrates strong generalization capabilities to seamlessly upscale to 720P (1280x720) during inference. The training and inference code is publicly available at this https URL.
zh

[CV-27] Distillation-Supervised Convolutional Low-Rank Adaptation for Efficient Image Super-Resolution

【速读】：该论文旨在解决基于卷积神经网络 (CNN) 的图像超分辨率方法在追求性能提升时面临的网络深度增加、特征图规模扩大导致的复杂度提高和推理成本上升的问题。为应对这一挑战，论文提出了一种名为 Distillation-Supervised Convolutional Low-Rank Adaptation (DSCLoRA) 的方法。其关键在于通过低秩分解实现参数更新，并结合基于空间特征亲和性的知识蒸馏策略，将预训练教师模型（如 SPAN）的二阶统计信息迁移到学生模型中，同时保持模型架构的轻量化和推理效率不变。实验表明，DSCLoRA 在不增加复杂度的情况下显著提升了 SPAN 的峰值信噪比 (PSNR) 和结构相似性指数 (SSIM)，并在 NTIRE 2025 高效超分辨率挑战赛的整体性能赛道中获得第一名。

链接: https://arxiv.org/abs/2504.11271
作者: Xinning Chai,Yao Zhang,Yuxuan Zhang,Zhengxue Cheng,Yingsheng Qin,Yucai Yang,Li Song
机构: Shanghai Jiao Tong University (上海交通大学); Transsion (传音控股)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) have been widely used in efficient image super-resolution. However, for CNN-based methods, performance gains often require deeper networks and larger feature maps, which increase complexity and inference costs. Inspired by LoRA’s success in fine-tuning large language models, we explore its application to lightweight models and propose Distillation-Supervised Convolutional Low-Rank Adaptation (DSCLoRA), which improves model performance without increasing architectural complexity or inference costs. Specifically, we integrate ConvLoRA into the efficient SR network SPAN by replacing the SPAB module with the proposed SConvLB module and incorporating ConvLoRA layers into both the pixel shuffle block and its preceding convolutional layer. DSCLoRA leverages low-rank decomposition for parameter updates and employs a spatial feature affinity-based knowledge distillation strategy to transfer second-order statistical information from teacher models (pre-trained SPAN) to student models (ours). This method preserves the core knowledge of lightweight models and facilitates optimal solution discovery under certain conditions. Experiments on benchmark datasets show that DSCLoRA improves PSNR and SSIM over SPAN while maintaining its efficiency and competitive image quality. Notably, DSCLoRA ranked first in the Overall Performance Track of the NTIRE 2025 Efficient Super-Resolution Challenge. Our code and models are made publicly available at this https URL.
zh

[CV-28] Single-Input Multi-Output Model Merging: Leverag ing Foundation Models for Dense Multi-Task Learning

【速读】：该论文致力于解决多任务模型合并中的性能退化问题，特别是在单输入多输出（Single-Input-Multiple-Outputs, SIMO）的多任务设置下。传统模型合并方法在处理具有任务特定解码器和多样化损失目标的场景时，会导致表示不一致的问题，从而显著降低性能。论文的关键在于提出两种简单而有效的修正方法，用于在合并后重新对齐特征表示。这些方法通过任务特定的头与合并编码器之间的重新对齐，解决了因表示错位引起的问题，同时保持计算效率和灵活性，并能够离线识别任务关系。实验结果表明，该方法不仅能够启用多任务能力，而且在性能上可与传统多任务学习相媲美，但所需样本和训练步骤更少。

链接: https://arxiv.org/abs/2504.11268
作者: Juan Garcia Giraldo,Nikolaos Dimitriadis,Ke Wang,Pascal Frossard
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 6 figures

点击查看摘要

Abstract:Model merging is a flexible and computationally tractable approach to merge single-task checkpoints into a multi-task model. Prior work has solely focused on constrained multi-task settings where there is a one-to-one mapping between a sample and a task, overlooking the paradigm where multiple tasks may operate on the same sample, e.g., scene understanding. In this paper, we focus on the multi-task setting with single-input-multiple-outputs (SIMO) and show that it qualitatively differs from the single-input-single-output model merging settings studied in the literature due to the existence of task-specific decoders and diverse loss objectives. We identify that existing model merging methods lead to significant performance degradation, primarily due to representation misalignment between the merged encoder and task-specific decoders. We propose two simple and efficient fixes for the SIMO setting to re-align the feature representation after merging. Compared to joint fine-tuning, our approach is computationally effective and flexible, and sheds light into identifying task relationships in an offline manner. Experiments on NYUv2, Cityscapes, and a subset of the Taskonomy dataset demonstrate: (1) task arithmetic suffices to enable multi-task capabilities; however, the representations generated by the merged encoder has to be re-aligned with the task-specific heads; (2) the proposed architecture rivals traditional multi-task learning in performance but requires fewer samples and training steps by leveraging the existence of task-specific models.
zh

[CV-29] Enhanced Small Target Detection via Multi-Modal Fusion and Attention Mechanisms: A YOLOv5 Approach ATC2024

【速读】：该论文旨在解决在复杂环境中高效、实时检测小目标的问题，特别是在军事应用中，因干扰因素导致的小目标检测挑战。论文的关键解决方案在于提出了一种基于多模态图像融合与注意力机制的小目标检测方法。该方法通过将红外和可见光数据与YOLOv5模型相结合，并引入卷积注意力模块，显著提升了检测性能。其关键是利用特征点匹配实现多模态数据集的精确配准，结合注意力机制增强模型对小目标的检测精度和鲁棒性。实验结果验证了该方法在反无人机（anti-UAV）和VisDrone数据集上的有效性，尤其针对小而暗的目标表现出色。

链接: https://arxiv.org/abs/2504.11262
作者: Xiaoxiao Ma,Junxiong Tong
机构: Northwestern Polytechnical University (西北工业大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ATC 2024

点击查看摘要

Abstract:With the rapid development of information technology, modern warfare increasingly relies on intelligence, making small target detection critical in military applications. The growing demand for efficient, real-time detection has created challenges in identifying small targets in complex environments due to interference. To address this, we propose a small target detection method based on multi-modal image fusion and attention mechanisms. This method leverages YOLOv5, integrating infrared and visible light data along with a convolutional attention module to enhance detection performance. The process begins with multi-modal dataset registration using feature point matching, ensuring accurate network training. By combining infrared and visible light features with attention mechanisms, the model improves detection accuracy and robustness. Experimental results on anti-UAV and Visdrone datasets demonstrate the effectiveness and practicality of our approach, achieving superior detection results for small and dim targets.
zh

[CV-30] Next-Future: Sample-Efficient Policy Learning for Robotic-Arm Tasks

【速读】：该论文旨在解决现有Hindsight Experience Replay (HER) 算法在多目标强化学习（Multi-Goal Reinforcement Learning, MRL）中的样本效率与准确性不足的问题。HER 虽然通过重新定义目标来重放轨迹以利用失败的经验，但其基于启发式的回放方法缺乏理论上的严谨性。为克服这一局限，论文提出了一种新的回放策略“Next-Future”，其关键在于专注于奖励单步转换（single-step transitions），从而显著提升多目标马尔可夫决策过程（Multi-Goal Markov Decision Processes, MDPs）的学习效率与精度，特别是在高精度需求场景下。实验结果表明，“Next-Future”策略在八项复杂机器人操作任务中实现了更高的样本效率和成功率，并通过真实环境验证了其实际可行性，尤其适用于复杂的机械臂任务。

链接: https://arxiv.org/abs/2504.11247
作者: Fikrican Özgür,René Zurbrügg,Suryansh Kumar
机构: ETH Zürich (瑞士苏黎世联邦理工学院); RSL Group at ETH Zürich (ETH苏黎世机器人系统实验室); Visual Computing and Computational Media Section, College of PVFA, Department of Electric and Computer Engineering, and Department of Computer Science and Engineering at Texas A&M University (德克萨斯农工大学视觉计算与计算媒体科, PVFA学院, 电气与计算机工程系, 计算机科学与工程系)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 9 figures, 6 tables

点击查看摘要

Abstract:Hindsight Experience Replay (HER) is widely regarded as the state-of-the-art algorithm for achieving sample-efficient multi-goal reinforcement learning (RL) in robotic manipulation tasks with binary rewards. HER facilitates learning from failed attempts by replaying trajectories with redefined goals. However, it relies on a heuristic-based replay method that lacks a principled framework. To address this limitation, we introduce a novel replay strategy, “Next-Future”, which focuses on rewarding single-step transitions. This approach significantly enhances sample efficiency and accuracy in learning multi-goal Markov decision processes (MDPs), particularly under stringent accuracy requirements – a critical aspect for performing complex and precise robotic-arm tasks. We demonstrate the efficacy of our method by highlighting how single-step learning enables improved value approximation within the multi-goal RL framework. The performance of the proposed replay strategy is evaluated across eight challenging robotic manipulation tasks, using ten random seeds for training. Our results indicate substantial improvements in sample efficiency for seven out of eight tasks and higher success rates in six tasks. Furthermore, real-world experiments validate the practical feasibility of the learned policies, demonstrating the potential of “Next-Future” in solving complex robotic-arm tasks.
zh

[CV-31] Leverag ing multimodal explanatory annotations for video interpretation with Modality Specific Dataset

【速读】：该论文旨在解决多模态视频解释模型在概念引导监督下的性能提升问题。论文的关键在于引入了“概念模态特定数据集”(Concept Modality Specific Datasets, CMSDs)，通过将带注释的概念按模态（视觉、文本或音频）分类为数据子集，使模型能够利用模态特定的标注信息。实验结果显示，基于CMSDs训练的模型在早期融合和晚期融合方法中均优于传统方法，尤其使得晚期融合模型的性能接近早期融合模型。这表明模态特定注释对于构建鲁棒且可解释的视频模型至关重要，并推动了复杂视频分析领域的可解释多模态学习发展。

链接: https://arxiv.org/abs/2504.11232
作者: Elisa Ancarani,Julie Tores,Lucile Sassatelli,Rémy Sun,Hui-Yin Wu,Frédéric Precioso
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 6 pages, 8 Figures

点击查看摘要

Abstract:We examine the impact of concept-informed supervision on multimodal video interpretation models using MOByGaze, a dataset containing human-annotated explanatory concepts. We introduce Concept Modality Specific Datasets (CMSDs), which consist of data subsets categorized by the modality (visual, textual, or audio) of annotated concepts. Models trained on CMSDs outperform those using traditional legacy training in both early and late fusion approaches. Notably, this approach enables late fusion models to achieve performance close to that of early fusion models. These findings underscore the importance of modality-specific annotations in developing robust, self-explainable video models and contribute to advancing interpretable multimodal learning in complex video analysis.
zh

[CV-32] CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image CVPR2025

【速读】：该论文致力于解决机器人操作任务中类别级活动物体的姿态估计问题，并引入了一个新的基准数据集。当前方法通常依赖几何线索和复杂的多阶段管道，在点云中首先分割部件，然后进行归一化部件坐标空间（NPCS）估计以获取6D姿态，但这些方法忽略了RGB图像中的密集语义线索，尤其对于小部件的物体，会导致精度不足。为了解决这些问题，论文提出了一种名为CAP-Net的单阶段网络，用于估计类别级活动部件的6D姿态和尺寸。该方法通过结合RGB-D特征，以端到端的方式生成实例分割和NPCS表示。CAP-Net使用统一网络同时预测逐点类别标签、质心偏移量和NPCS图。随后，聚类算法基于估计的质心距离对相同预测类别的点进行分组，以隔离每个部件。最后，将每个部件的NPCS区域与点云对齐，以恢复其最终姿态和尺寸。为了弥合模拟到现实域之间的差距，论文还引入了RGBD-Art数据集，这是迄今为止最大的RGB-D活动数据集，包含从真实传感器模拟的逼真RGB图像和深度噪声。在RGBD-Art数据集上的实验评估表明，该方法显著优于最先进的方法。模型在机器人任务中的实际部署证明了其鲁棒性和出色的模拟到现实迁移能力，证实了其实用价值。论文的数据集、代码和预训练模型可在项目页面上获得。

链接: https://arxiv.org/abs/2504.11230
作者: Jingshun Huang,Haitao Lin,Tianyu Wang,Yanwei Fu,Xiangyang Xue,Yi Zhu
机构: Fudan University (复旦大学); Huawei, Noah’s Ark Lab (华为，诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: To appear in CVPR 2025 (Highlight)

点击查看摘要

Abstract:This paper tackles category-level pose estimation of articulated objects in robotic manipulation tasks and introduces a new benchmark dataset. While recent methods estimate part poses and sizes at the category level, they often rely on geometric cues and complex multi-stage pipelines that first segment parts from the point cloud, followed by Normalized Part Coordinate Space (NPCS) estimation for 6D poses. These approaches overlook dense semantic cues from RGB images, leading to suboptimal accuracy, particularly for objects with small parts. To address these limitations, we propose a single-stage Network, CAP-Net, for estimating the 6D poses and sizes of Categorical Articulated Parts. This method combines RGB-D features to generate instance segmentation and NPCS representations for each part in an end-to-end manner. CAP-Net uses a unified network to simultaneously predict point-wise class labels, centroid offsets, and NPCS maps. A clustering algorithm then groups points of the same predicted class based on their estimated centroid distances to isolate each part. Finally, the NPCS region of each part is aligned with the point cloud to recover its final pose and size. To bridge the sim-to-real domain gap, we introduce the RGBD-Art dataset, the largest RGB-D articulated dataset to date, featuring photorealistic RGB images and depth noise simulated from real sensors. Experimental evaluations on the RGBD-Art dataset demonstrate that our method significantly outperforms the state-of-the-art approach. Real-world deployments of our model in robotic tasks underscore its robustness and exceptional sim-to-real transfer capabilities, confirming its substantial practical utility. Our dataset, code and pre-trained models are available on the project page.
zh

[CV-33] 3DAffordSplat: Efficient Affordance Reasoning with 3D Gaussians

【速读】：该论文旨在解决3D操作性推理在依赖稀疏3D点云的方法中所面临的有限泛化性和鲁棒性问题，这些问题源于其对坐标变化的敏感性和数据固有的稀疏性。论文的关键解决方案是引入了3DAffordSplat数据集和AffordSplatNet模型。3DAffordSplat是首个针对基于3D高斯泼溅(3D Gaussian Splatting, 3DGS)的操作性推理的大规模多模态数据集，包含丰富的标注信息以支持细粒度的操作性细节捕获。而AffordSplatNet则通过创新的跨模态结构对齐模块利用结构一致性先验来对齐3D点云与3DGS表示，从而显著提升了操作性识别的准确性。这一系列工作不仅推动了3DGS领域内的操作性学习进步，还展示了模型在已见及未见场景中的强大泛化能力。

链接: https://arxiv.org/abs/2504.11218
作者: Zeming wei,Junyi Lin,Yang Liu,Weixing Chen,Jingzhou Luo,Guanbin Li,Liang Lin
机构: Sun Yat-sen University (中山大学); Peng Cheng Laboratory (鹏城实验室); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The first large-scale 3D Gaussians Affordance Reasoning Benchmark

点击查看摘要

Abstract:3D affordance reasoning is essential in associating human instructions with the functional regions of 3D objects, facilitating precise, task-oriented manipulations in embodied AI. However, current methods, which predominantly depend on sparse 3D point clouds, exhibit limited generalizability and robustness due to their sensitivity to coordinate variations and the inherent sparsity of the data. By contrast, 3D Gaussian Splatting (3DGS) delivers high-fidelity, real-time rendering with minimal computational overhead by representing scenes as dense, continuous distributions. This positions 3DGS as a highly effective approach for capturing fine-grained affordance details and improving recognition accuracy. Nevertheless, its full potential remains largely untapped due to the absence of large-scale, 3DGS-specific affordance datasets. To overcome these limitations, we present 3DAffordSplat, the first large-scale, multi-modal dataset tailored for 3DGS-based affordance reasoning. This dataset includes 23,677 Gaussian instances, 8,354 point cloud instances, and 6,631 manually annotated affordance labels, encompassing 21 object categories and 18 affordance types. Building upon this dataset, we introduce AffordSplatNet, a novel model specifically designed for affordance reasoning using 3DGS representations. AffordSplatNet features an innovative cross-modal structure alignment module that exploits structural consistency priors to align 3D point cloud and 3DGS representations, resulting in enhanced affordance recognition accuracy. Extensive experiments demonstrate that the 3DAffordSplat dataset significantly advances affordance learning within the 3DGS domain, while AffordSplatNet consistently outperforms existing methods across both seen and unseen settings, highlighting its robust generalization capabilities.
zh

[CV-34] Focal Split: Untethered Snapshot Depth from Differential Defocus CVPR2025

【速读】：该论文旨在开发一种低功耗、高效率的手持式深度相机系统，以实现被动式（passive）场景深度感知。传统深度相机通常依赖主动光源（如激光或红外），而该研究通过深度从差分离焦（Depth-from-Differential-Defocus, DfDD）的理论，提出了一种无需额外光源的解决方案。论文的关键创新在于设计了一种消色差光学系统，能够同时捕捉场景的两个不同离焦图像，并利用每像素仅需500浮点运算（FLOPs）的数据处理方法高效计算每个像素的深度及其置信值。这种高效的算法与硬件集成方案显著降低了系统的能耗，使其能够在电池供电下实时运行，为DIY用户提供了友好的开发体验。

链接: https://arxiv.org/abs/2504.11202
作者: Junjie Luo,John Mamish,Alan Fu,Thomas Concannon,Josiah Hester,Emma Alexander,Qi Guo
机构: Elmore Family School of Electrical and Computer Engineering, Purdue University (普渡大学); College of Computing, Georgia Institute of Technology (乔治亚理工学院); McCormick School of Engineering, Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: CVPR 2025, 8 pages, 7 figures

点击查看摘要

Abstract:We introduce Focal Split, a handheld, snapshot depth camera with fully onboard power and computing based on depth-from-differential-defocus (DfDD). Focal Split is passive, avoiding power consumption of light sources. Its achromatic optical system simultaneously forms two differentially defocused images of the scene, which can be independently captured using two photosensors in a snapshot. The data processing is based on the DfDD theory, which efficiently computes a depth and a confidence value for each pixel with only 500 floating point operations (FLOPs) per pixel from the camera measurements. We demonstrate a Focal Split prototype, which comprises a handheld custom camera system connected to a Raspberry Pi 5 for real-time data processing. The system consumes 4.9 W and is powered on a 5 V, 10,000 mAh battery. The prototype can measure objects with distances from 0.4 m to 1.2 m, outputting 480 \times 360 sparse depth maps at 2.1 frames per second (FPS) using unoptimized Python scripts. Focal Split is DIY friendly. A comprehensive guide to building your own Focal Split depth camera, code, and additional data can be found at this https URL.
zh

[CV-35] Video Summarization with Large Language Models CVPR2025

【速读】：该论文旨在解决视频内容因指数级增长而带来的高效导航、搜索与检索挑战，特别是现有视频摘要方法因过度依赖视觉特征和时间动态而难以捕捉语义信息的问题，这导致摘要可能不完整或不连贯。论文的关键解决方案在于提出了一种新的基于大型语言模型（Large Language Models, LLMs）的视频摘要框架（LLM-based Video Summarization, LLMVS）。该方法通过多模态大型语言模型（Muti-modal Large Language Model, M-LLM）将视频帧转化为描述性字幕序列，并利用LLM根据局部上下文评估每帧的重要性。随后，通过全局注意力机制进一步优化局部重要性评分，在整个视频字幕的上下文中确保摘要既能反映细节又能呈现整体叙事结构。这种结合局部与全局语义理解的方式有效解决了传统方法在定义关键帧时的主观性问题。

链接: https://arxiv.org/abs/2504.11199
作者: Min Jung Lee,Dayoung Gong,Minsu Cho
机构: Pohang University of Science and Technology (POSTECH); GenGenAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:The exponential increase in video content poses significant challenges in terms of efficient navigation, search, and retrieval, thus requiring advanced video summarization techniques. Existing video summarization methods, which heavily rely on visual features and temporal dynamics, often fail to capture the semantics of video content, resulting in incomplete or incoherent summaries. To tackle the challenge, we propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs), expecting that the knowledge learned from massive data enables LLMs to evaluate video frames in a manner that better aligns with diverse semantics and human judgments, effectively addressing the inherent subjectivity in defining keyframes. Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (M-LLM) and then assesses the importance of each frame using an LLM, based on the captions in its local context. These local importance scores are refined through a global attention mechanism in the entire context of video captions, ensuring that our summaries effectively reflect both the details and the overarching narrative. Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks, highlighting the potential of LLMs in the processing of multimedia content.
zh

[CV-36] R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning CVPR2025

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在对抗攻击下的高风险问题，特别是由于其固有脆弱性以及通常从有限的开源模型中选择导致的防御能力不足。现有防御方法多依赖于训练阶段的对抗性微调，这不仅需要标记数据，还缺乏对下游任务的灵活性。论文的关键解决方案是提出了鲁棒测试时提示微调（Robust Test-Time Prompt Tuning, R-TPT），它通过在推理阶段减轻对抗攻击的影响来增强模型的鲁棒性。具体而言，R-TPT 首先重新构建经典的边缘熵目标函数，去除在对抗条件下引入冲突的项，仅保留逐点熵最小化；同时引入基于可靠性的可插拔加权集成策略，利用可靠的增强视图中的有用信息强化防御能力。这种方法无需标记训练数据，同时为推理任务提供了高灵活性，并在广泛使用的基准数据集上验证了其有效性。

链接: https://arxiv.org/abs/2504.11195
作者: Lijun Sheng,Jian Liang,Zilei Wang,Ran He
机构: University of Science and Technology of China (中国科学技术大学); NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences (自动化研究所模式识别国家重点实验室&多模态人工智能系统重点实验室，中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Vision-language models (VLMs), such as CLIP, have gained significant popularity as foundation models, with numerous fine-tuning methods developed to enhance performance on downstream tasks. However, due to their inherent vulnerability and the common practice of selecting from a limited set of open-source models, VLMs suffer from a higher risk of adversarial attacks than traditional vision models. Existing defense techniques typically rely on adversarial fine-tuning during training, which requires labeled data and lacks of flexibility for downstream tasks. To address these limitations, we propose robust test-time prompt tuning (R-TPT), which mitigates the impact of adversarial attacks during the inference stage. We first reformulate the classic marginal entropy objective by eliminating the term that introduces conflicts under adversarial conditions, retaining only the pointwise entropy minimization. Furthermore, we introduce a plug-and-play reliability-based weighted ensembling strategy, which aggregates useful information from reliable augmented views to strengthen the defense. R-TPT enhances defense against adversarial attacks without requiring labeled training data while offering high flexibility for inference tasks. Extensive experiments on widely used benchmarks with various attacks demonstrate the effectiveness of R-TPT. The code is available in this https URL.
zh

[CV-37] rraMesh: A Planetary Mosaic of Multimodal Earth Observation Data

【速读】：该论文旨在解决现有公开数据集在规模、地理覆盖范围或传感器多样性方面的局限性问题。解决方案的关键在于引入TerraMesh，这是一个包含光学、合成孔径雷达、高程和土地覆盖等多种模态的全球多样化、多模态数据集，采用Analysis-Ready Data格式。TerraMesh提供了超过900万个样本的八种时空对齐模态，支持大规模预训练并促进稳健的跨模态相关学习。此外，论文详细描述了数据处理步骤、提供了全面的统计数据，并通过实证证据展示了在TerraMesh上预训练可提升模型性能。

链接: https://arxiv.org/abs/2504.11172
作者: Benedikt Blumenstiel,Paolo Fraccaro,Valerio Marsocci,Johannes Jakubik,Stefano Maurogiovanni,Mikolaj Czerkawski,Rocco Sedona,Gabriele Cavallaro,Thomas Brunschwiler,Juan Bernabe-Moreno,Nicolas Longépé
机构: IBM Research – Europe (IBM研究欧洲中心); European Space Agency ΦΦ\Phiroman_Φ-Lab (欧洲航天局); Forschungszentrum Jülich (于利希研究中心); University of Iceland (冰岛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale foundation models in Earth Observation can learn versatile, label-efficient representations by leveraging massive amounts of unlabeled data. However, existing public datasets are often limited in scale, geographic coverage, or sensor variety. We introduce TerraMesh, a new globally diverse, multimodal dataset combining optical, synthetic aperture radar, elevation, and land-cover modalities in an Analysis-Ready Data format. TerraMesh includes over 9 million samples with eight spatiotemporal aligned modalities, enabling large-scale pre-training and fostering robust cross-modal correlation learning. We provide detailed data processing steps, comprehensive statistics, and empirical evidence demonstrating improved model performance when pre-trained on TerraMesh. The dataset will be made publicly available with a permissive license.
zh

[CV-38] rraMind: Large-Scale Generative Multimodality for Earth Observation

【速读】：该论文试图解决跨模态地球观测（Earth Observation, EO）任务中多源异构数据的有效融合与利用问题。传统方法在处理多模态数据时往往局限于单一尺度或单一表征方式，难以同时捕获高层语义信息与细粒度的空间细节。为此，论文提出TerraMind，这是一种任何到任何（any-to-any）生成式的多模态基础模型，其关键创新在于引入了双尺度早期融合（dual-scale early fusion）策略：一方面通过token级表征学习跨模态的高层上下文关系，另一方面借助pixel级表征捕捉关键的空间细微差异。此外，TerraMind还提出了“模态内思考”（Thinking-in-Modalities, TiM）机制，能够在微调和推理阶段生成额外的人工数据以优化模型输出。这些方法显著提升了模型在零样本（zero-shot）和少样本（few-shot）任务中的表现，并在PANGAEA等社区标准基准测试中实现了超越现有技术水平的性能。

链接: https://arxiv.org/abs/2504.11171
作者: Johannes Jakubik,Felix Yang,Benedikt Blumenstiel,Erik Scheurer,Rocco Sedona,Stefano Maurogiovanni,Jente Bosmans,Nikolaos Dionelis,Valerio Marsocci,Niklas Kopp,Rahul Ramachandran,Paolo Fraccaro,Thomas Brunschwiler,Gabriele Cavallaro,Juan Bernabe-Moreno,Nicolas Longépé
机构: IBM Research – Europe (IBM研究–欧洲); ETH Zurich (瑞士苏黎世联邦理工学院); Forschungszentrum Jülich (尤利希研究中心); European Space Agency ΦΦ\Phiroman_Φ-Lab (欧洲航天局 ΦΦ\Phiroman_Φ实验室); NASA IMPACT (美国宇航局影响计划); University of Iceland (冰岛大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind’s dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces “Thinking-in-Modalities” (TiM) – the capability of generating additional artificial data during finetuning and inference to improve the model output – and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code is open-sourced under a permissive license.
zh

[CV-39] YOLO-RS: Remote Sensing Enhanced Crop Detection Methods

【速读】：该论文旨在解决现有目标检测方法在处理遥感图像中小目标时性能较差的问题，特别是在复杂背景和图像混叠情况下难以满足实际应用需求。解决方案的关键在于提出了一种名为YOLO-RS的新模型，该模型基于最新的Yolov11，并通过引入Context Anchor Attention (CAA)机制和高效的多场域多尺度特征融合网络显著提升了小目标检测能力。此外，模型在特征融合过程中采用双向特征融合策略以增强小目标检测性能，并通过ACmix模块在主干网络末端解决类别不平衡问题，从而提高复杂场景下的检测精度。实验结果表明，YOLO-RS在PDT和CWC数据集上的召回率、平均精度均值(mAP)以及F1分数均有约2-3%的提升，同时仅增加约5.2 GFLOPs的计算复杂度，展现出优越的性能与效率。

链接: https://arxiv.org/abs/2504.11165
作者: Linlin Xiao,Zhang Tiancong,Yutong Jia,Xinyu Nie,Mengyao Wang,Xiaohang Shao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid development of remote sensing technology, crop classification and health detection based on deep learning have gradually become a research hotspot. However, the existing target detection methods show poor performance when dealing with small targets in remote sensing images, especially in the case of complex background and image mixing, which is difficult to meet the practical application requirementsite. To address this problem, a novel target detection model YOLO-RS is proposed in this paper. The model is based on the latest Yolov11 which significantly enhances the detection of small targets by introducing the Context Anchor Attention (CAA) mechanism and an efficient multi-field multi-scale feature fusion network. YOLO-RS adopts a bidirectional feature fusion strategy in the feature fusion process, which effectively enhances the model’s performance in the detection of small targets. Small target detection. Meanwhile, the ACmix module at the end of the model backbone network solves the category imbalance problem by adaptively adjusting the contrast and sample mixing, thus enhancing the detection accuracy in complex scenes. In the experiments on the PDT remote sensing crop health detection dataset and the CWC crop classification dataset, YOLO-RS improves both the recall and the mean average precision (mAP) by about 2-3% or so compared with the existing state-of-the-art methods, while the F1-score is also significantly improved. Moreover, the computational complexity of the model only increases by about 5.2 GFLOPs, indicating its significant advantages in both performance and efficiency. The experimental results validate the effectiveness and application potential of YOLO-RS in the task of detecting small targets in remote sensing images.
zh

[CV-40] SAL: Few-shot Text Segmentation Based on Attribute Learning

【速读】：该论文旨在解决场景文本分割中高质量数据集缺乏及像素级标注成本高的问题。为应对这些挑战，论文探索将少样本学习方法应用于场景文本分割任务。解决方案的关键在于提出TSAL（Text Segmentation with Adaptive Learning）框架，该框架利用CLIP的先验知识学习文本属性用于分割，并通过视觉引导分支和自适应提示引导分支分别提取文本和背景特征以充分利用图像中的语义和纹理信息。此外，引入自适应特征对齐模块（Adaptive Feature Alignment, AFA），使自适应提示能够捕捉通用和特定的文本属性信息，从而减少数据依赖性并提高文本检测精度。实验结果表明，TSAL在多种少样本设置下的文本分割数据集上达到了最先进的性能，并在相关领域展现出巨大潜力。

链接: https://arxiv.org/abs/2504.11164
作者: Chenming Li,Chengxu Liu,Yuanting Fan,Xiao Jin,Xingsong Hou,Xueming Qian
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently supervised learning rapidly develops in scene text segmentation. However, the lack of high-quality datasets and the high cost of pixel annotation greatly limit the development of them. Considering the well-performed few-shot learning methods for downstream tasks, we investigate the application of the few-shot learning method to scene text segmentation. We propose TSAL, which leverages CLIP’s prior knowledge to learn text attributes for segmentation. To fully utilize the semantic and texture information in the image, a visual-guided branch is proposed to separately extract text and background features. To reduce data dependency and improve text detection accuracy, the adaptive prompt-guided branch employs effective adaptive prompt templates to capture various text attributes. To enable adaptive prompts capture distinctive text features and complex background distribution, we propose Adaptive Feature Alignment module(AFA). By aligning learnable tokens of different attributes with visual features and prompt prototypes, AFA enables adaptive prompts to capture both general and distinctive attribute information. TSAL can capture the unique attributes of text and achieve precise segmentation using only few images. Experiments demonstrate that our method achieves SOTA performance on multiple text segmentation datasets under few-shot settings and show great potential in text-related domains.
zh

[CV-41] DMAGaze: Gaze Estimation Based on Feature Disentanglement and Multi-Scale Attention

【速读】：该论文致力于解决人脸图像中由于复杂非注视相关（gaze-irrelevant）信息引起的干扰，从而提升注视估计（gaze estimation）的准确性。解决方案的关键在于提出了一种名为DMAGaze的新框架，通过从面部图像中提取三种特征来改善整体性能：一是与注视相关的全局特征（通过解纠缠方法从完整面部图像中分离得到）；二是局部眼部特征（从裁剪的眼部区域提取）；三是头部姿态估计特征。具体而言，DMAGaze采用了一种新的连续掩码基解纠缠模块（continuous mask-based Disentangler），实现眼区与非眼区的双重分支解纠缠，以精准分离注视相关与无关信息。此外，引入了多尺度全局局部注意力模块（Multi-Scale Global Local Attention Module, MS-GLAM），通过定制化的级联注意力结构，在多个尺度上有效聚焦于全局与局部信息，进一步增强了解纠缠模块的信息提取能力。最终，通过融合上半脸解纠缠的全局注视相关特征、头部姿态信息以及局部眼部特征，经检测头处理后实现了高精度的注视估计。

链接: https://arxiv.org/abs/2504.11160
作者: Haohan Chen,Hongjia Liu,Shiyong Lan,Wenwu Wang,Yixin Qiao,Yao Li,Guonan Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Gaze estimation, which predicts gaze direction, commonly faces the challenge of interference from complex gaze-irrelevant information in face images. In this work, we propose DMAGaze, a novel gaze estimation framework that exploits information from facial images in three aspects: gaze-relevant global features (disentangled from facial image), local eye features (extracted from cropped eye patch), and head pose estimation features, to improve overall performance. Firstly, we design a new continuous mask-based Disentangler to accurately disentangle gaze-relevant and gaze-irrelevant information in facial images by achieving the dual-branch disentanglement goal through separately reconstructing the eye and non-eye regions. Furthermore, we introduce a new cascaded attention module named Multi-Scale Global Local Attention Module (MS-GLAM). Through a customized cascaded attention structure, it effectively focuses on global and local information at multiple scales, further enhancing the information from the Disentangler. Finally, the global gaze-relevant features disentangled by the upper face branch, combined with head pose and local eye features, are passed through the detection head for high-precision gaze estimation. Our proposed DMAGaze has been extensively validated on two mainstream public datasets, achieving state-of-the-art performance.
zh

[CV-42] SAR-to-RGB Translation with Latent Diffusion for Earth Observation

【速读】：该论文旨在解决由于云层遮挡或数据缺口导致Sentinel-2 (S2) 光学遥感图像不可用的问题，提出了一种基于扩散模型（Diffusion Model, DM）的SAR-to-RGB翻译方法，通过合成光学图像填补缺失的RGB影像。解决方案的关键在于设计了三种不同的扩散模型架构：两种采用标准扩散（Standard Diffusion），分别在不带类别条件和带类别条件的情况下通过添加和去除噪声来重建S2图像；另一种采用冷扩散（Cold Diffusion），在去除合成图像中的SAR信号之前先将S2与Sentinel-1 (S1) 数据融合。研究结果表明，类别条件能够提升土地覆盖分类的准确性，而冷扩散设置尽管感知质量较低，但在土地覆盖分类任务中表现出色，这表明传统的定量评价指标可能无法全面反映生成图像的实际应用价值。这些发现凸显了扩散模型在补充缺失RGB数据的遥感应用中的潜力。

链接: https://arxiv.org/abs/2504.11154
作者: Kaan Aydin,Joelle Hanna,Damian Borth
机构: University of St. Gallen (圣加仑大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Earth observation satellites like Sentinel-1 (S1) and Sentinel-2 (S2) provide complementary remote sensing (RS) data, but S2 images are often unavailable due to cloud cover or data gaps. To address this, we propose a diffusion model (DM)-based approach for SAR-to-RGB translation, generating synthetic optical images from SAR inputs. We explore three different setups: two using Standard Diffusion, which reconstruct S2 images by adding and removing noise (one without and one with class conditioning), and one using Cold Diffusion, which blends S2 with S1 before removing the SAR signal. We evaluate the generated images in downstream tasks, including land cover classification and cloud removal. While generated images may not perfectly replicate real S2 data, they still provide valuable information. Our results show that class conditioning improves classification accuracy, while cloud removal performance remains competitive despite our approach not being optimized for it. Interestingly, despite exhibiting lower perceptual quality, the Cold Diffusion setup performs well in land cover classification, suggesting that traditional quantitative evaluation metrics may not fully reflect the practical utility of generated images. Our findings highlight the potential of DMs for SAR-to-RGB translation in RS applications where RGB images are missing.
zh

[CV-43] GC-GAT: Multimodal Vehicular Trajectory Prediction using Graph Goal Conditioning and Cross-context Attention

【速读】：该论文旨在解决自动驾驶场景中周围车辆未来轨迹预测的问题，其核心挑战在于如何有效利用场景上下文信息（包括静态信息如车道与交通规则元素，以及动态信息如交通参与者）。论文提出了一种基于车道图的运动预测模型，关键创新点在于首先通过图结构预测目标提案（graph-based goal proposals），随后结合跨注意力机制（cross attention）融合多模态上下文信息。具体而言，该模型采用轻量级门控循环单元（Gated Recurrent Units, GRU）编码场景上下文，并通过交互器模块对编码特征与目标提案进行跨上下文注意力计算，最终利用拉普拉斯混合密度网络（Laplace Mixture Density Network）从聚合特征中回归多模态轨迹。论文的关键突破在于通过图结构目标提案引入跨注意力机制，使模型能够专注于与目标相关的场景要素，从而提高轨迹预测的鲁棒性。实验结果显示，该方法在nuScenes数据集上达到了当前最优性能。

链接: https://arxiv.org/abs/2504.11150
作者: Mahir Gulzar,Yar Muhammad,Naveed Muhammad
机构: Institute of Computer Science, University of Tartu (塔尔图大学计算机科学研究所); University of Hertfordshire (赫特福德郡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Predicting future trajectories of surrounding vehicles heavily relies on what contextual information is given to a motion prediction model. The context itself can be static (lanes, regulatory elements, etc) or dynamic (traffic participants). This paper presents a lane graph-based motion prediction model that first predicts graph-based goal proposals and later fuses them with cross attention over multiple contextual elements. We follow the famous encoder-interactor-decoder architecture where the encoder encodes scene context using lightweight Gated Recurrent Units, the interactor applies cross-context attention over encoded scene features and graph goal proposals, and the decoder regresses multimodal trajectories via Laplacian Mixture Density Network from the aggregated encodings. Using cross-attention over graph-based goal proposals gives robust trajectory estimates since the model learns to attend to future goal-relevant scene elements for the intended agent. We evaluate our work on nuScenes motion prediction dataset, achieving state-of-the-art results.
zh

[CV-44] aming Consistency Distillation for Accelerated Human Image Animation

【速读】：该论文旨在解决现有基于视频扩散模型的人体图像动画方法因迭代去噪步骤繁多而导致推理成本高、速度慢的问题。同时指出，直接采用一致性模型作为加速策略虽然有效，但在人体图像动画中容易导致动态区域出现视觉模糊、运动退化及面部失真等质量问题。为应对这些挑战，论文提出DanceLCM方法，并通过以下关键方案加以改进：(1) 引入分段一致性蒸馏，并辅以轻量级辅助头，从真实视频潜码中引入监督信号，缓解单一完整轨迹生成带来的累积误差；(2) 设计专注于运动区域的损失函数，并显式注入面部保真特征以提升面部真实性。实验结果表明，DanceLCM在仅使用2-4个推理步骤的情况下实现了与最先进的视频扩散模型相当的结果，显著降低了推理负担且未牺牲视频质量。

链接: https://arxiv.org/abs/2504.11143
作者: Xiang Wang,Shiwei Zhang,Hangjie Yuan,Yujie Wei,Yingya Zhang,Changxin Gao,Yuehuan Wang,Nong Sang
机构: Key Laboratory of Image Processing and Intelligent Control (图像处理与智能控制重点实验室), School of Artificial Intelligence and Automation (人工智能与自动化学院), Huazhong University of Science and Technology (华中科技大学); Alibaba Group; Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in human image animation have been propelled by video diffusion models, yet their reliance on numerous iterative denoising steps results in high inference costs and slow speeds. An intuitive solution involves adopting consistency models, which serve as an effective acceleration paradigm through consistency distillation. However, simply employing this strategy in human image animation often leads to quality decline, including visual blurring, motion degradation, and facial distortion, particularly in dynamic regions. In this paper, we propose the DanceLCM approach complemented by several enhancements to improve visual quality and motion continuity at low-step regime: (1) segmented consistency distillation with an auxiliary light-weight head to incorporate supervision from real video latents, mitigating cumulative errors resulting from single full-trajectory generation; (2) a motion-focused loss to centre on motion regions, and explicit injection of facial fidelity features to improve face authenticity. Extensive qualitative and quantitative experiments demonstrate that DanceLCM achieves results comparable to state-of-the-art video diffusion models with a mere 2-4 inference steps, significantly reducing the inference burden without compromising video quality. The code and models will be made publicly available.
zh

[CV-45] Visual Re-Ranking with Non-Visual Side Information

【速读】：该论文试图解决视觉位置识别中的图像检索精度问题。传统方法依赖全局图像描述符进行初始检索，并通过基于相同描述符的重排序方法进一步优化结果，但这种方法提供的附加信号有限。论文的关键解决方案是提出了一种基于图神经网络的广义上下文相似性聚合（Generalized Contextual Similarity Aggregation, GCSA）方法，它不仅利用视觉描述符，还能结合其他可用的侧信息（如附近WiFi或蓝牙信号强度、数据库图像的几何属性如相机姿态等）。这种多模态异构信息的共享编码通过亲和向量实现，显著提升了图像检索和下游视觉定位任务的性能。实验表明，在两个大规模数据集上的表现优于现有方法。

链接: https://arxiv.org/abs/2504.11134
作者: Gustav Hanning,Gabrielle Flood,Viktor Larsson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Scandinavian Conference on Image Analysis (SCIA) 2025

点击查看摘要

Abstract:The standard approach for visual place recognition is to use global image descriptors to retrieve the most similar database images for a given query image. The results can then be further improved with re-ranking methods that re-order the top scoring images. However, existing methods focus on re-ranking based on the same image descriptors that were used for the initial retrieval, which we argue provides limited additional signal. In this work we propose Generalized Contextual Similarity Aggregation (GCSA), which is a graph neural network-based re-ranking method that, in addition to the visual descriptors, can leverage other types of available side information. This can for example be other sensor data (such as signal strength of nearby WiFi or BlueTooth endpoints) or geometric properties such as camera poses for database images. In many applications this information is already present or can be acquired with low effort. Our architecture leverages the concept of affinity vectors to allow for a shared encoding of the heterogeneous multi-modal input. Two large-scale datasets, covering both outdoor and indoor localization scenarios, are utilized for training and evaluation. In experiments we show significant improvement not only on image retrieval metrics, but also for the downstream visual localization task. Comments: Accepted at Scandinavian Conference on Image Analysis (SCIA) 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.11134 [cs.CV] (or arXiv:2504.11134v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.11134 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-46] K-means Enhanced Density Gradient Analysis for Urban and Transport Metrics Using Multi-Modal Satellite Imagery

【速读】：本文旨在解决利用多模态卫星影像评估城市密度梯度及其与公共交通系统关系的问题。解决方案的关键在于结合光学数据与合成孔径雷达(SAR)数据，通过开发一种方法实现城市区域分割、城市中心识别以及密度梯度量化，并定义了两个核心指标：密度梯度系数(α)和达到目标阈值的最小有效距离(LD)。此外，论文采用K-means聚类等机器学习技术，客观划分密度梯度图中的均质与高变异性区域，揭示城市结构特性。通过对比分析单中心与多中心城市，建立了密度梯度特征与公共交通网络拓扑之间的联系，为城市规划者提供了一种基于免费卫星数据的高效、全球适用的初步公共交通评估方法。

链接: https://arxiv.org/abs/2504.11128
作者: P. Tomkiewicz,J. Jaworski,P. Zielonka,A. Wilinski
机构: WSB Merito University Gdańsk (WSB Merito University 格但斯克); NexRI Laboratory for Artificial Intelligence (NexRI 人工智能实验室), Gdańsk, Poland; WSB Merito University Gdańsk (WSB Merito University 格但斯克)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:This paper presents a novel computational approach for evaluating urban metrics through density gradient analysis using multi-modal satellite imagery, with applications including public transport and other urban systems. By combining optical and Synthetic Aperture Radar (SAR) data, we develop a method to segment urban areas, identify urban centers, and quantify density gradients. Our approach calculates two key metrics: the density gradient coefficient ( \alpha ) and the minimum effective distance (LD) at which density reaches a target threshold. We further employ machine learning techniques, specifically K-means clustering, to objectively identify uniform and high-variability regions within density gradient plots. We demonstrate that these metrics provide an effective screening tool for public transport analyses by revealing the underlying urban structure. Through comparative analysis of two representative cities with contrasting urban morphologies (monocentric vs polycentric), we establish relationships between density gradient characteristics and public transport network topologies. Cities with clear density peaks in their gradient plots indicate distinct urban centers requiring different transport strategies than those with more uniform density distributions. This methodology offers urban planners a cost-effective, globally applicable approach to preliminary public transport assessment using freely available satellite data. The complete implementation, with additional examples and documentation, is available in an open-source repository under the MIT license at this https URL.
zh

[CV-47] Revealing Covert Attention by Analyzing Human and Reinforcement Learning Agent Gameplay

【速读】：本文旨在解决如何仅利用游戏数据揭示人类隐性注意力模式的问题。为实现这一目标，论文提出了一种基于强化学习（Reinforcement Learning, RL）离线注意力技术的新方法，并引入了情境化任务相关（Contextualized Task-Relevant, CTR）注意力网络。该网络能够从人类玩家和RL智能体的游戏过程中生成稀疏但包含必要决策信息的注意力图。关键在于CTR注意力网络能够在无需额外数据（如脑活动记录）的情况下，仅依靠游戏数据有效揭示人类的隐性注意力模式，并通过与基于眼动数据的时间积分显性注意力模型（Temporally Integrated Overt Attention, TIOA）对比，验证了其在捕捉人类注意力特征方面的有效性。研究结果不仅有助于理解人机注意力差异，还为开发融合人类隐性注意力的增强型RL智能体奠定了基础。

链接: https://arxiv.org/abs/2504.11118
作者: Henrik Krauss,Takehisa Yairi
机构: The University of Tokyo (东京大学); The University of Tokyo (东京大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study introduces a novel method for revealing human covert attention patterns using gameplay data alone, utilizing offline attention techniques from reinforcement learning (RL). We propose the contextualized, task-relevant (CTR) attention network, which generates attention maps from both human and RL agent gameplay in Atari environments. These maps are sparse yet retain the necessary information for the current player’s decision making. We compare the CTR-derived attention maps with a temporally integrated overt attention (TIOA) model based on eye-tracking data, serving as a point of comparison and discussion. Visual inspection reveals distinct attention patterns: human CTR maps focus on the player and rather nearby opponents, occasionally shifting between stronger focus and broader views - sometimes even attending to empty space ahead. In contrast, agent maps maintain a consistent broad focus on most objects, including distant ones and the player. Quantitative analysis further demonstrates that human CTR maps align more closely with TIOA than agent maps do. Our findings indicate that the CTR attention network can effectively reveal human covert attention patterns from gameplay alone, without the need for additional data like brain activity recordings. This work contributes to understanding human-agent attention differences and enables the development of RL agents augmented with human covert attention.
zh

[CV-48] Flyweight FLIM Networks for Salient Object Detection in Biomedical Images

【速读】：本文旨在解决基于深度学习的显著目标检测（SOD）在资源受限应用中因计算资源需求大和标注数据集庞大而难以实用的问题。同时，针对轻量级模型在复杂且标注稀缺场景中的性能局限性，提出了一种新的方法——基于图像标记的特征学习（Feature Learning from Image Markers, FLIM）。FLIM 方法通过从少数代表性图像中标记的判别区域提取图像块，并在此基础上学习卷积核，无需依赖大规模标注数据集、预训练或反向传播（backpropagation），从而有效利用生物医学图像中常见的信息冗余。

解决方案的关键在于提出了两种创新技术：一是为 FLIM 网络设计了无反向传播的学习方法以训练空洞可分离卷积核和多空洞层；二是引入一种新颖的网络简化方法，减少卷积核冗余并缩小编码器规模。此外，结合 FLIM 编码器与自适应解码器（一种最近提出的逐点卷积估计方法），构建了高效且参数精简的“轻量级”（flyweight） SOD 模型。实验结果表明，这些模型在具有挑战性的数据集上不仅表现出极高的效率，而且在参数量和浮点运算次数显著减少的情况下，其有效性可媲美重型模型，突显了 FLIM 网络在数据有限和资源受限但信息冗余的应用场景中的潜力。

链接: https://arxiv.org/abs/2504.11112
作者: Leonardo M. Joao,Jancarlo F. Gomes,Silvio J. F. Guimaraes,Ewa Kijak,Alexandre X. Falcao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Salient Object Detection (SOD) with deep learning often requires substantial computational resources and large annotated datasets, making it impractical for resource-constrained applications. Lightweight models address computational demands but typically strive in complex and scarce labeled-data scenarios. Feature Learning from Image Markers (FLIM) learns an encoder’s convolutional kernels among image patches extracted from discriminative regions marked on a few representative images, dismissing large annotated datasets, pretraining, and backpropagation. Such a methodology exploits information redundancy commonly found in biomedical image applications. This study presents methods to learn dilated-separable convolutional kernels and multi-dilation layers without backpropagation for FLIM networks. It also proposes a novel network simplification method to reduce kernel redundancy and encoder size. By combining a FLIM encoder with an adaptive decoder, a concept recently introduced to estimate a pointwise convolution per image, this study presents very efficient (named flyweight) SOD models for biomedical images. Experimental results in challenging datasets demonstrate superior efficiency and effectiveness to lightweight models. By requiring significantly fewer parameters and floating-point operations, the results show competitive effectiveness to heavyweight models. These advances highlight the potential of FLIM networks for data-limited and resource-constrained applications with information redundancy.
zh

[CV-49] S2Teacher: Step-by-step Teacher for Sparsely Annotated Oriented Object Detection

【速读】：该论文试图解决稀疏标注场景下定向目标检测（Sparsely Annotated Oriented Object Detection, SAOOD）的问题，即仅部分实例被标注的情况下，如何有效训练检测器。论文关注的关键问题是：(1) 稀疏标注导致模型在有限前景表示上容易过拟合；(2) 未标注的目标（假阴性）干扰特征学习。为了解决这些问题，论文提出了S²Teacher方法，其关键是通过逐步挖掘伪标签（从易到难）来增强前景表示，并重新加权未标注对象的损失以减轻其对训练的影响。实验表明，S²Teacher不仅提升了不同稀疏标注水平下的检测性能，还实现了接近全监督的检测精度，同时大幅减少了标注需求。

链接: https://arxiv.org/abs/2504.11111
作者: Yu Lin,Jianghang Lin,Kai Ye,You Shen,Yan Zhang,Shengchuan Zhang,Liujuan Cao,Rongrong Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although fully-supervised oriented object detection has made significant progress in multimodal remote sensing image understanding, it comes at the cost of labor-intensive annotation. Recent studies have explored weakly and semi-supervised learning to alleviate this burden. However, these methods overlook the difficulties posed by dense annotations in complex remote sensing scenes. In this paper, we introduce a novel setting called sparsely annotated oriented object detection (SAOOD), which only labels partial instances, and propose a solution to address its challenges. Specifically, we focus on two key issues in the setting: (1) sparse labeling leading to overfitting on limited foreground representations, and (2) unlabeled objects (false negatives) confusing feature learning. To this end, we propose the S ^2 Teacher, a novel method that progressively mines pseudo-labels for unlabeled objects, from easy to hard, to enhance foreground representations. Additionally, it reweights the loss of unlabeled objects to mitigate their impact during training. Extensive experiments demonstrate that S ^2 Teacher not only significantly improves detector performance across different sparse annotation levels but also achieves near-fully-supervised performance on the DOTA dataset with only 10% annotation instances, effectively balancing detection accuracy with annotation efficiency. The code will be public.
zh

[CV-50] oken-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models

【速读】：该论文旨在解决现有文本到图像（Text-to-Image, T2I）生成模型防御机制在对抗性攻击面前的脆弱性问题。具体而言，传统的防御方法如提示检查器和后处理图像检查器难以有效抵御复杂且高级的对抗攻击。论文提出了一种名为TCBS-Attack的新颖查询驱动黑盒越狱攻击方案，其关键是通过搜索由文本和图像检查器定义的决策边界附近的标记（tokens），并在这些边界附近迭代优化标记，从而生成语义一致的对抗性提示。这种方法能够成功绕过多层防御机制，显著提升了越狱攻击的成功率，实验证明其在多种T2I模型上的表现优于当前最先进的越狱攻击技术。

链接: https://arxiv.org/abs/2504.11106
作者: Jiangtao Liu,Zhaoxin Wang,Handing Wang,Cong Tian,Yaochu Jin
机构: Xidian University (西安电子科技大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Recent advancements in Text-to-Image (T2I) generation have significantly enhanced the realism and creativity of generated images. However, such powerful generative capabilities pose risks related to the production of inappropriate or harmful content. Existing defense mechanisms, including prompt checkers and post-hoc image checkers, are vulnerable to sophisticated adversarial attacks. In this work, we propose TCBS-Attack, a novel query-based black-box jailbreak attack that searches for tokens located near the decision boundaries defined by text and image checkers. By iteratively optimizing tokens near these boundaries, TCBS-Attack generates semantically coherent adversarial prompts capable of bypassing multiple defensive layers in T2I models. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art jailbreak attacks across various T2I models, including securely trained open-source models and commercial online services like DALL-E 3. TCBS-Attack achieves an ASR-4 of 45% and an ASR-1 of 21% on jailbreaking full-chain T2I models, significantly surpassing baseline methods.
zh

[CV-51] Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR

【速读】：该论文旨在解决现有视觉-语言模型（Vision-Language Models, VLMs）在光学字符识别（Optical Character Recognition, OCR）任务中样本级质量退化的问题，并缺乏可靠的方法自动检测低质量输出。为应对这些挑战，论文提出了一种名为共识熵（Consensus Entropy, CE）的训练-free后推理方法，通过聚合多个VLMs的输出量化OCR不确定性。其关键是利用正确预测在输出空间中趋于一致而错误预测会发散的关键洞察，开发了一个轻量级多模型框架，能够有效识别有问题的样本、选择最佳输出并结合模型优势。实验表明，CE方法在不增加成本的情况下超越了VLM作为裁判的方法和单模型基线，在多个指标上达到了最先进的性能。

链接: https://arxiv.org/abs/2504.11101
作者: Yulong Zhang,Tianyi Liang,Xinyue Huang,Erfei Cui,Xu Guo,Pei Chu,Chenhui Li,Ru Zhang,Wenhai Wang,Gongshen Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The Optical Character Recognition (OCR) task is important for evaluating Vision-Language Models (VLMs) and providing high-quality data sources for LLM training data. While state-of-the-art VLMs show improved average OCR accuracy, they still struggle with sample-level quality degradation and lack reliable automatic detection of low-quality outputs. We introduce Consensus Entropy (CE), a training-free post-inference method that quantifies OCR uncertainty by aggregating outputs from multiple VLMs. Our approach exploits a key insight: correct VLM OCR predictions converge in output space while errors diverge. We develop a lightweight multi-model framework that effectively identifies problematic samples, selects the best outputs and combines model strengths. Experiments across multiple OCR benchmarks and VLMs demonstrate that CE outperforms VLM-as-judge approaches and single-model baselines at the same cost and achieves state-of-the-art results across multiple metrics. For instance, our solution demonstrates: achieving 15.2% higher F1 scores than VLM-as-judge methods in quality verification, delivering 6.0% accuracy gains on mathematical calculation tasks, and requiring rephrasing only 7.3% of inputs while maintaining overall performance. Notably, the entire process requires neither training nor supervision while maintaining plug-and-play functionality throughout.
zh

[CV-52] Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting

【速读】：该论文旨在解决从随意捕获的单目视频重建四维（4D）动态场景的问题，这是一个具有挑战性的任务，因为每个时间戳仅从单一视角观察。论文提出了一种名为Vivid4D的新方法，通过增强观测视图来改进4D单目视频合成，具体是通过从单目输入中合成多视图视频。解决方案的关键在于将视图增强重新表述为视频修复任务，并结合几何先验和生成先验。论文训练了一个视频修复模型，使用模拟遮挡的合成掩码对未定位的网络视频进行处理，以确保缺失区域在空间和时间上的一致性完成。此外，引入了迭代视图增强策略和鲁棒重建损失，以进一步减少单目深度先验中的不准确性。实验表明，该方法有效提高了单目4D场景的重建与补全能力。

链接: https://arxiv.org/abs/2504.11092
作者: Jiaxin Huang,Sheng Miao,BangBnag Yang,Yuewen Ma,Yiyi Liao
机构: Zhejiang University (浙江大学); ByteDance PICO (字节跳动PICO)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing 4D dynamic scenes from casually captured monocular videos is valuable but highly challenging, as each timestamp is observed from a single viewpoint. We introduce Vivid4D, a novel approach that enhances 4D monocular video synthesis by augmenting observation views - synthesizing multi-view videos from a monocular input. Unlike existing methods that either solely leverage geometric priors for supervision or use generative priors while overlooking geometry, we integrate both. This reformulates view augmentation as a video inpainting task, where observed views are warped into new viewpoints based on monocular depth priors. To achieve this, we train a video inpainting model on unposed web videos with synthetically generated masks that mimic warping occlusions, ensuring spatially and temporally consistent completion of missing regions. To further mitigate inaccuracies in monocular depth priors, we introduce an iterative view augmentation strategy and a robust reconstruction loss. Experiments demonstrate that our method effectively improves monocular 4D scene reconstruction and completion.
zh

[CV-53] InfoClus: Informative Clustering of High-dimensional Data Embeddings

【速读】：该论文旨在解决通过降维可视化高维数据时，低维嵌入难以解释的问题。为促进对低维嵌入的探索与理解，论文引入了一种名为“带解释的划分”（partitioning with explanations）的新概念。其关键是将嵌入展示的数据划分为若干组，并利用原始高维属性为每组提供稀疏解释，同时通过信息论定义目标函数量化解释的学习价值与复杂度。通过调节复杂度参数，可以优化划分的粒度。论文提出的方法InfoClus通过在分层聚类约束下的贪心搜索，联合优化划分与解释过程。实验结果表明，InfoClus在多个数据集上的表现优于现有方法，尤其在Cytometry数据上的结果与人工分析高度一致，且显著优于其他解释嵌入的方法（如RVX和VERA）。这一方案的关键在于结合解释性和复杂度控制，从而生成高质量的低维嵌入初始分析点。

链接: https://arxiv.org/abs/2504.11089
作者: Fuyin Lai,Edith Heiter,Guillaume Bied,Jefrey Lijffijt
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 9 figures

点击查看摘要

Abstract:Developing an understanding of high-dimensional data can be facilitated by visualizing that data using dimensionality reduction. However, the low-dimensional embeddings are often difficult to interpret. To facilitate the exploration and interpretation of low-dimensional embeddings, we introduce a new concept named partitioning with explanations. The idea is to partition the data shown through the embedding into groups, each of which is given a sparse explanation using the original high-dimensional attributes. We introduce an objective function that quantifies how much we can learn through observing the explanations of the data partitioning, using information theory, and also how complex the explanations are. Through parameterization of the complexity, we can tune the solutions towards the desired granularity. We propose InfoClus, which optimizes the partitioning and explanations jointly, through greedy search constrained over a hierarchical clustering. We conduct a qualitative and quantitative analysis of InfoClus on three data sets. We contrast the results on the Cytometry data with published manual analysis results, and compare with two other recent methods for explaining embeddings (RVX and VERA). These comparisons highlight that InfoClus has distinct advantages over existing procedures and methods. We find that InfoClus can automatically create good starting points for the analysis of dimensionality-reduction-based scatter plots.
zh

[CV-54] Change State Space Models for Remote Sensing Change Detection

【速读】：该论文旨在解决基于卷积网络（ConvNets）和视觉变换器（Vision Transformers, ViT）进行变化检测时的局限性，即ConvNets难以建模长距离依赖关系，而ViT在大规模数据集上计算效率低下且训练困难的问题。论文提出的解决方案是引入一种专门设计用于变化检测的Change State Space Model，其关键在于聚焦于双时相图像之间的相关变化特征，有效过滤无关信息。通过仅关注发生变化的特征，减少了网络参数数量，显著提升了计算效率，同时保持了高检测性能和对输入退化较强的鲁棒性。实验结果显示，该模型在三个基准数据集上的表现优于ConvNets、ViTs以及基于Mamba的方法，且具有更低的计算复杂度。

链接: https://arxiv.org/abs/2504.11080
作者: Elman Ghazaei,Erchan Aptoula
机构: Sabanci University (Sabancı大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite their frequent use for change detection, both ConvNets and Vision transformers (ViT) exhibit well-known limitations, namely the former struggle to model long-range dependencies while the latter are computationally inefficient, rendering them challenging to train on large-scale datasets. Vision Mamba, an architecture based on State Space Models has emerged as an alternative addressing the aforementioned deficiencies and has been already applied to remote sensing change detection, though mostly as a feature extracting backbone. In this article the Change State Space Model is introduced, that has been specifically designed for change detection by focusing on the relevant changes between bi-temporal images, effectively filtering out irrelevant information. By concentrating solely on the changed features, the number of network parameters is reduced, enhancing significantly computational efficiency while maintaining high detection performance and robustness against input degradation. The proposed model has been evaluated via three benchmark datasets, where it outperformed ConvNets, ViTs, and Mamba-based counterparts at a fraction of their computational complexity. The implementation will be made available at this https URL upon acceptance.
zh

[CV-55] Improving fingerprint presentation attack detection by an approach integrated into the personal verification stage

【速读】：该论文旨在解决指纹验证系统中 Presentation Attack Detection (PAD) 系统独立设计所带来的安全局限性问题，特别是在需要利用已注册用户模板增强安全性的情景下。论文的关键在于提出一种创新的附加模块——Closeness Binary Code (CC) 模块。这一模块利用了真实指纹特征在欧几里得特征空间中的聚类特性（即“接近性”属性），无需针对特定用户进行重新设计，即可通过挖掘用户样本之间的聚类关系，在验证阶段提升系统的安全性。实验结果表明，该模块能够有效增强现有 PAD 方法的性能，并可无缝集成到指纹验证系统中。

链接: https://arxiv.org/abs/2504.11066
作者: Marco Micheletto,Giulia Orrù,Luca Ghiani,Gian Luca Marcialis
机构: University of Cagliari (萨萨里大学); University of Sassari (萨萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Presentation Attack Detection (PAD) systems are usually designed independently of the fingerprint verification system. While this can be acceptable for use cases where specific user templates are not predetermined, it represents a missed opportunity to enhance security in scenarios where integrating PAD with the fingerprint verification system could significantly leverage users’ templates, which are the real target of a potential presentation attack. This does not mean that a PAD should be specifically designed for such users; that would imply the availability of many enrolled users’ PAI and, consequently, complexity, time, and cost increase. On the contrary, we propose to equip a basic PAD, designed according to the state of the art, with an innovative add-on module called the Closeness Binary Code (CC) module. The term “closeness” refers to a peculiar property of the bona fide-related features: in an Euclidean feature space, genuine fingerprints tend to cluster in a specific pattern. First, samples from the same finger are close to each other, then samples from other fingers of the same user and finally, samples from fingers of other users. This property is statistically verified in our previous publication, and further confirmed in this paper. It is independent of the user population and the feature set class, which can be handcrafted or deep network-based (embeddings). Therefore, the add-on can be designed without the need for the targeted user samples; moreover, it exploits her/his samples’ “closeness” property during the verification stage. Extensive experiments on benchmark datasets and state-of-the-art PAD methods confirm the benefits of the proposed add-on, which can be easily coupled with the main PAD module integrated into the fingerprint verification system.
zh

[CV-56] UKDM: Underwater keypoint detection and matching using underwater image enhancement techniques

【速读】：该论文旨在解决水下图像由于环境因素导致的关键点检测精度低和匹配算法鲁棒性不足的问题。解决方案的关键在于利用先进的深度学习模型，包括生成式对抗网络（GANs）和卷积神经网络（CNNs），通过优化图像增强技术来提升关键点检测的准确性以及匹配算法的鲁棒性，并在多种水下数据集上验证了所提方法相较于传统方法的显著性能提升。

链接: https://arxiv.org/abs/2504.11063
作者: Pedro Diaz-Garcia,Felix Escalona,Miguel Cazorla
机构: University Institute for Computer Research (计算机研究大学研究所); University of Alicante (阿利坎特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The purpose of this paper is to explore the use of underwater image enhancement techniques to improve keypoint detection and matching. By applying advanced deep learning models, including generative adversarial networks and convolutional neural networks, we aim to find the best method which improves the accuracy of keypoint detection and the robustness of matching algorithms. We evaluate the performance of these techniques on various underwater datasets, demonstrating significant improvements over traditional methods.
zh

[CV-57] Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detections

【速读】：该论文旨在解决图像级与像素级异常检测之间的性能差距问题，特别是在无正常训练样本或跨领域泛化困难的场景下。传统异常检测方法通常依赖于正常数据的可用性，而生成式AI（Generative AI）等新兴技术虽然利用了CLIP模型的零样本泛化能力，但在不同分辨率或细节层次上的表现仍存在不足。

解决方案的关键在于通过视觉编码器提取图像上下文来调节文本编码器的提示词，并对CLIP视觉编码器进行修改以更有效地捕捉细粒度变化。这些改进确保了特征保留了更丰富的空间和结构信息，无论对于正常还是异常样本均如此。通过这种方式，该方法在14个数据集上的多种评估指标上实现了2%到29%的性能提升，显著提高了图像级和像素级异常检测的效果。

链接: https://arxiv.org/abs/2504.11055
作者: Alireza Salehi,Mohammadreza Salehi,Reshad Hosseini,Cees G. M. Snoek,Makoto Yamada,Mohammad Sabokrou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Anomaly Detection (AD) involves identifying deviations from normal data distributions and is critical in fields such as medical diagnostics and industrial defect detection. Traditional AD methods typically require the availability of normal training samples; however, this assumption is not always feasible, as collecting such data can be impractical. Additionally, these methods often struggle to generalize across different domains. Recent advancements, such as AnomalyCLIP and AdaCLIP, utilize the zero-shot generalization capabilities of CLIP but still face a performance gap between image-level and pixel-level anomaly detection. To address this gap, we propose a novel approach that conditions the prompts of the text encoder based on image context extracted from the vision encoder. Also, to capture fine-grained variations more effectively, we have modified the CLIP vision encoder and altered the extraction of dense features. These changes ensure that the features retain richer spatial and structural information for both normal and anomalous prompts. Our method achieves state-of-the-art performance, improving performance by 2% to 29% across different metrics on 14 datasets. This demonstrates its effectiveness in both image-level and pixel-level anomaly detection.
zh

[CV-58] Leverag ing LLM s and attention-mechanism for automatic annotation of historical maps

【速读】：该论文旨在解决历史地图在传统或扫描格式下因依赖人工解读而无法规模化分析的问题。解决方案的关键在于提出了一种结合大语言模型（LLMs）和注意力机制的新颖蒸馏方法，用于历史地图的自动标注。LLMs被用来为低分辨率的历史图像块生成粗略分类标签，而注意力机制则用于将这些标签优化至更高分辨率，从而显著提升标注的准确性与一致性。实验结果表明，优化后的标签召回率超过90%，且Wood和Settlement类别的交并比（IoU）分别达到84.2%和72.0%，同时精度（Precision）分别为87.1%和79.5%，这些成果均在未使用细粒度人工标签的情况下实现，突显了该方法在高效与可扩展性方面的潜力。

链接: https://arxiv.org/abs/2504.11050
作者: Yunshuang Yuan,Monika Sester
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Historical maps are essential resources that provide insights into the geographical landscapes of the past. They serve as valuable tools for researchers across disciplines such as history, geography, and urban studies, facilitating the reconstruction of historical environments and the analysis of spatial transformations over time. However, when constrained to analogue or scanned formats, their interpretation is limited to humans and therefore not scalable. Recent advancements in machine learning, particularly in computer vision and large language models (LLMs), have opened new avenues for automating the recognition and classification of features and objects in historical maps. In this paper, we propose a novel distillation method that leverages LLMs and attention mechanisms for the automatic annotation of historical maps. LLMs are employed to generate coarse classification labels for low-resolution historical image patches, while attention mechanisms are utilized to refine these labels to higher resolutions. Experimental results demonstrate that the refined labels achieve a high recall of more than 90%. Additionally, the intersection over union (IoU) scores–84.2% for Wood and 72.0% for Settlement–along with precision scores of 87.1% and 79.5%, respectively, indicate that most labels are well-aligned with ground-truth annotations. Notably, these results were achieved without the use of fine-grained manual labels during training, underscoring the potential of our approach for efficient and scalable historical map analysis.
zh

[CV-59] QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models NAACL2025

【速读】：该论文致力于解决在多模态任务（如视觉问答 VQA）中，针对大规模视觉语言模型（Large Vision-Language Models, LVLMs）的查询特定攻击（query-specific attacks）仅能在特定问题下导致模型输出错误答案的问题。论文指出，对于同一图像可能关联多个问题的情况，即使在受到针对某一问题的对抗攻击后，LVLMs 仍可能对其他问题给出正确答案。为此，论文提出了查询无关的视觉攻击（Query-Agnostic Visual Attack, QAVA），其核心在于生成对未知或未指定问题均能诱导错误响应的鲁棒对抗样本。与传统专注于特定图像和问题的对抗攻击相比，QAVA 在目标问题未知的情况下显著提升了对图像的攻击效果和效率，其性能接近于针对已知目标问题的传统攻击方式。关键解决方案在于设计了一种不依赖具体查询信息的攻击方法，从而拓宽了实际场景下针对 LVLMs 视觉对抗攻击的研究范围，并揭示了之前被忽视的安全漏洞。

链接: https://arxiv.org/abs/2504.11038
作者: Yudong Zhang,Ruobing Xie,Jiansheng Chen,Xingwu Sun,Zhanhui Kang,Yu Wang
机构: Department of Electronic Engineering, Tsinghua University (清华大学); Machine Learning Platform Department, Tencent (腾讯); School of Computer and Communication Engineering, University of Science and Technology Beijing (北京科技大学); Faculty of Science and Technology, University of Macau (澳门大学); State Key Laboratory of Space Network and Communications, Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by NAACL 2025 main

点击查看摘要

Abstract:In typical multimodal tasks, such as Visual Question Answering (VQA), adversarial attacks targeting a specific image and question can lead large vision-language models (LVLMs) to provide incorrect answers. However, it is common for a single image to be associated with multiple questions, and LVLMs may still answer other questions correctly even for an adversarial image attacked by a specific question. To address this, we introduce the query-agnostic visual attack (QAVA), which aims to create robust adversarial examples that generate incorrect responses to unspecified and unknown questions. Compared to traditional adversarial attacks focused on specific images and questions, QAVA significantly enhances the effectiveness and efficiency of attacks on images when the question is unknown, achieving performance comparable to attacks on known target questions. Our research broadens the scope of visual adversarial attacks on LVLMs in practical settings, uncovering previously overlooked vulnerabilities, particularly in the context of visual adversarial threats. The code is available at this https URL.
zh

[CV-60] Defending Against Frequency-Based Attacks with Diffusion Models CVPR

【速读】：该论文试图解决生成式 AI (Generative AI) 模型在对抗性训练中对未见过的威胁模型泛化能力有限的问题。论文的关键解决方案是引入基于扩散模型的对抗性净化技术，通过独立训练的净化器去除输入数据中的扰动，从而在分类前提高模型的鲁棒性。这种方法不仅能够有效应对像素级对抗攻击，还能处理频谱级和空间级的复杂对抗性扰动，尤其在低至高频区域展现出对多种失真模式的强大适应能力。

链接: https://arxiv.org/abs/2504.11034
作者: Fatemeh Amerehi,Patrick Healy
机构: University of Limerick (利默里克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 5th Workshop on Adversarial Machine Learning in Computer Vision: Foundation Models + X

点击查看摘要

Abstract:Adversarial training is a common strategy for enhancing model robustness against adversarial attacks. However, it is typically tailored to the specific attack types it is trained on, limiting its ability to generalize to unseen threat models. Adversarial purification offers an alternative by leveraging a generative model to remove perturbations before classification. Since the purifier is trained independently of both the classifier and the threat models, it is better equipped to handle previously unseen attack scenarios. Diffusion models have proven highly effective for noise purification, not only in countering pixel-wise adversarial perturbations but also in addressing non-adversarial data shifts. In this study, we broaden the focus beyond pixel-wise robustness to explore the extent to which purification can mitigate both spectral and spatial adversarial attacks. Our findings highlight its effectiveness in handling diverse distortion patterns across low- to high-frequency regions.
zh

[CV-61] Acquisition of high-quality images for camera calibration in robotics applications via speech prompts

【速读】：该论文试图解决相机标定过程中因输入图像质量不足（如运动模糊或滚动快门效应）导致优化过程难以收敛的问题。解决方案的关键在于提出了一种基于语音命令控制的新型标定图像采集技术。通过使用具有精确逐词时间戳的状态-of-the-art语音转文本模型，该方法能够实现触发词的精准时间对齐，从而更稳健且用户友好地获取高质量标定图像，显著提升复杂多摄像机系统的标定效率与成功率。

链接: https://arxiv.org/abs/2504.11031
作者: Timm Linder,Kadir Yilmaz,David B. Adrian,Bastian Leibe
机构: Bosch Corporate Research & Bosch Center for AI (博世中央研究院), Renningen, Germany; Computer Vision Group, RWTH Aachen University (RWTH亚琛工业大学计算机视觉小组), Germany
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Accurate intrinsic and extrinsic camera calibration can be an important prerequisite for robotic applications that rely on vision as input. While there is ongoing research on enabling camera calibration using natural images, many systems in practice still rely on using designated calibration targets with e.g. checkerboard patterns or April tag grids. Once calibration images from different perspectives have been acquired and feature descriptors detected, those are typically used in an optimization process to minimize the geometric reprojection error. For this optimization to converge, input images need to be of sufficient quality and particularly sharpness; they should neither contain motion blur nor rolling-shutter artifacts that can arise when the calibration board was not static during image capture. In this work, we present a novel calibration image acquisition technique controlled via voice commands recorded with a clip-on microphone, that can be more robust and user-friendly than e.g. triggering capture with a remote control, or filtering out blurry frames from a video sequence in postprocessing. To achieve this, we use a state-of-the-art speech-to-text transcription model with accurate per-word timestamping to capture trigger words with precise temporal alignment. Our experiments show that the proposed method improves user experience by being fast and efficient, allowing us to successfully calibrate complex multi-camera setups.
zh

[CV-62] Easy3D: A Simple Yet Effective Method for 3D Interactive Segmentation

【速读】：该论文旨在解决在多样化的3D环境中高效、精确地实现3D交互分割的问题，特别是针对未知环境和不熟悉物体的场景。传统方法在处理这些任务时往往表现欠佳，而该研究提出的方法通过结合体素稀疏编码器与轻量级Transformer解码器，并引入隐式的点击融合机制，实现了显著的性能提升。这一解决方案的关键在于其简单而有效的架构设计，既保证了模型的高效性，又提升了跨领域数据集（如ScanNet、S3DIS等）及未见几何分布数据（如Gaussian Splatting）上的表现。

链接: https://arxiv.org/abs/2504.11024
作者: Andrea Simonelli,Norman Müller,Peter Kontschieder
机构: Meta Reality Labs Zürich (Meta现实实验室苏黎世)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The increasing availability of digital 3D environments, whether through image-based 3D reconstruction, generation, or scans obtained by robots, is driving innovation across various applications. These come with a significant demand for 3D interaction, such as 3D Interactive Segmentation, which is useful for tasks like object selection and manipulation. Additionally, there is a persistent need for solutions that are efficient, precise, and performing well across diverse settings, particularly in unseen environments and with unfamiliar objects. In this work, we introduce a 3D interactive segmentation method that consistently surpasses previous state-of-the-art techniques on both in-domain and out-of-domain datasets. Our simple approach integrates a voxel-based sparse encoder with a lightweight transformer-based decoder that implements implicit click fusion, achieving superior performance and maximizing efficiency. Our method demonstrates substantial improvements on benchmark datasets, including ScanNet, ScanNet++, S3DIS, and KITTI-360, and also on unseen geometric distributions such as the ones obtained by Gaussian Splatting. The project web-page is available at this https URL.
zh

[CV-63] Meta-learning For Few-Shot Time Series Crop Type Classification: A Benchmark On The EuroCropsML Dataset

【速读】：该论文旨在解决作物类型数据在空间分布上的不平衡问题对遥感应用中准确分类带来的挑战。论文聚焦于评估知识迁移算法在真实世界任务中的表现，特别是从数据丰富的区域迁移到数据匮乏的区域。解决方案的关键在于比较多种迁移学习与元学习算法（如(First-Order) Model-Agnostic Meta-Learning ((FO)-MAML)、Almost No Inner Loop (ANIL) 和 Task-Informed Meta-Learning (TIML)）在EuroCropsML数据集上的性能。研究发现，在经过拉脱维亚数据预训练后，基于MAML的元学习算法在爱沙尼亚的作物类型分类任务中略优于简单的迁移学习方法，但这种提升伴随着更高的计算需求和训练时间。此外，地理上相距较远的区域（如爱沙尼亚和葡萄牙）之间的知识迁移对所有算法都构成了显著挑战。这些结果揭示了在实际作物类型分类任务中选择机器学习方法时精度与计算资源需求之间的权衡，并强调了跨地球不同区域知识迁移的难度。为推动该领域的进一步研究，本文提供了首个全面的基准测试，用于评估真实条件下作物类型分类的迁移与元学习方法，并公开了对应的代码。

链接: https://arxiv.org/abs/2504.11022
作者: Joana Reuss,Jan Macdonald,Simon Becker,Konrad Schultka,Lorenz Richter,Marco Körner
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 7 figures, 12 tables

点击查看摘要

Abstract:Spatial imbalances in crop type data pose significant challenges for accurate classification in remote sensing applications. Algorithms aiming at transferring knowledge from data-rich to data-scarce tasks have thus surged in popularity. However, despite their effectiveness in previous evaluations, their performance in challenging real-world applications is unclear and needs to be evaluated. This study benchmarks transfer learning and several meta-learning algorithms, including (First-Order) Model-Agnostic Meta-Learning ((FO)-MAML), Almost No Inner Loop (ANIL), and Task-Informed Meta-Learning (TIML), on the real-world EuroCropsML time series dataset, which combines farmer-reported crop data with Sentinel-2 satellite observations from Estonia, Latvia, and Portugal. Our findings indicate that MAML-based meta-learning algorithms achieve slightly higher accuracy compared to simpler transfer learning methods when applied to crop type classification tasks in Estonia after pre-training on data from Latvia. However, this improvement comes at the cost of increased computational demands and training time. Moreover, we find that the transfer of knowledge between geographically disparate regions, such as Estonia and Portugal, poses significant challenges to all investigated algorithms. These insights underscore the trade-offs between accuracy and computational resource requirements in selecting machine learning methods for real-world crop type classification tasks and highlight the difficulties of transferring knowledge between different regions of the Earth. To facilitate future research in this domain, we present the first comprehensive benchmark for evaluating transfer and meta-learning methods for crop type classification under real-world conditions. The corresponding code is publicly available at this https URL.
zh

[CV-64] DRIFT open dataset: A drone-derived intelligence for traffic analysis in urban environmen

【速读】：该论文旨在解决城市交通数据获取与分析的挑战，特别是如何从宏观到微观层面全面理解城市交通流动性和动态特性。论文提出的关键解决方案是构建一个名为DRIFT（DRone-derived Intelligence For Traffic analysis）的大规模城市交通数据集。通过系统性地利用同步无人机视频，在约250米高度采集覆盖韩国大田市九个相互连接交叉口的数据，生成包含方向信息的高分辨率车辆轨迹。该方案的核心在于采用视频同步技术和正射影像对齐方法，确保数据的准确性与完整性，最终形成包含81,699条车辆轨迹的综合数据集。这种结构化且无需额外预处理的数据集结合开源的目标检测模型、轨迹提取工具及相关分析工具，能够支持多尺度交通行为分析，包括个体车辆操作（如变道）、安全性指标（如碰撞时间）以及整个网络流量动力学的研究，从而为学术研究和实际应用提供重要价值。

链接: https://arxiv.org/abs/2504.11019
作者: Hyejin Lee,Seokjun Hong,Jeonghoon Song,Haechan Cho,Zhixiong Jin,Byeonghun Kim,Joobin Jin,Jaegyun Im,Byeongjoon Noh,Hwasoo Yeo
机构: Sunchon National University (顺天乡大学); KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 15 figures

点击查看摘要

Abstract:Reliable traffic data are essential for understanding urban mobility and developing effective traffic management strategies. This study introduces the DRone-derived Intelligence For Traffic analysis (DRIFT) dataset, a large-scale urban traffic dataset collected systematically from synchronized drone videos at approximately 250 meters altitude, covering nine interconnected intersections in Daejeon, South Korea. DRIFT provides high-resolution vehicle trajectories that include directional information, processed through video synchronization and orthomap alignment, resulting in a comprehensive dataset of 81,699 vehicle trajectories. Through our DRIFT dataset, researchers can simultaneously analyze traffic at multiple scales - from individual vehicle maneuvers like lane-changes and safety metrics such as time-to-collision to aggregate network flow dynamics across interconnected urban intersections. The DRIFT dataset is structured to enable immediate use without additional preprocessing, complemented by open-source models for object detection and trajectory extraction, as well as associated analytical tools. DRIFT is expected to significantly contribute to academic research and practical applications, such as traffic flow analysis and simulation studies. The dataset and related resources are publicly accessible at this https URL.
zh

[CV-65] AnimeDL-2M: Million-Scale AI-Generated Anime Image Detection and Localization in Diffusion Era

【速读】：该论文试图解决动漫领域（Anime）图像篡改检测与定位（IMDL）的问题，这一问题因扩散模型等生成式AI技术的进步而变得更加突出。尽管自然图像领域的IMDL研究已有较多积累，但动漫领域的相关研究仍较为匮乏，且面临日益严重的伪造威胁，如AI生成图像被误认为手绘作品、版权侵犯及不当内容修改等问题。为填补这一空白，论文提出了AnimeDL-2M数据集，这是首个包含超过两百万张真实图像、部分篡改图像及完全由AI生成图像的大规模标注数据集。实验表明，基于自然图像训练的传统IMDL模型在动漫图像上的表现不佳，凸显了动漫领域与自然图像领域的显著域隙（domain gap）。为解决这一挑战，论文进一步提出AniXplore模型，其关键在于针对动漫视觉特性的定制化设计，从而有效弥合域隙并提升检测性能。综合评估显示，AniXplore在动漫IMDL任务中取得了优于现有方法的表现。

链接: https://arxiv.org/abs/2504.11015
作者: Chenyang Zhu,Xing Zhang,Yuyang Sun,Ching-Chun Chang,Isao Echizen
机构: The University of Tokyo (东京大学); National Institute of Informatics (国立情報学研究所), Japan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in image generation, particularly diffusion models, have significantly lowered the barrier for creating sophisticated forgeries, making image manipulation detection and localization (IMDL) increasingly challenging. While prior work in IMDL has focused largely on natural images, the anime domain remains underexplored-despite its growing vulnerability to AI-generated forgeries. Misrepresentations of AI-generated images as hand-drawn artwork, copyright violations, and inappropriate content modifications pose serious threats to the anime community and industry. To address this gap, we propose AnimeDL-2M, the first large-scale benchmark for anime IMDL with comprehensive annotations. It comprises over two million images including real, partially manipulated, and fully AI-generated samples. Experiments indicate that models trained on existing IMDL datasets of natural images perform poorly when applied to anime images, highlighting a clear domain gap between anime and natural images. To better handle IMDL tasks in anime domain, we further propose AniXplore, a novel model tailored to the visual characteristics of anime imagery. Extensive evaluations demonstrate that AniXplore achieves superior performance compared to existing methods. Dataset and code can be found in this https URL.
zh

[CV-66] GATE3D: Generalized Attention-based Task-synergized Estimation in 3D*

【速读】：该论文旨在解决多域训练中单目3D目标检测面临的挑战，特别是由于标注精确3D真实标签的数据集稀缺，尤其是在典型道路以外的非驾驶环境中。为应对这一挑战，论文提出了一种基于伪标签的新型弱监督框架。解决方案的关键在于GATE3D框架的设计，它通过在2D和3D预测之间引入一致性损失，有效弥合了领域差距，并实现了跨领域的通用性。实验结果表明，GATE3D通过有效的预训练策略显著加速了从有限标注数据中的学习过程，在KITTI基准和自收集的室内办公场景数据集上均展现了竞争力，凸显了其在机器人、增强现实和虚拟现实等领域的广泛应用潜力。

链接: https://arxiv.org/abs/2504.11014
作者: Eunsoo Im,Jung Kwon Lee,Changhyun Jee
机构: Superb-AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9pages, 1 supple

点击查看摘要

Abstract:The emerging trend in computer vision emphasizes developing universal models capable of simultaneously addressing multiple diverse tasks. Such universality typically requires joint training across multi-domain datasets to ensure effective generalization. However, monocular 3D object detection presents unique challenges in multi-domain training due to the scarcity of datasets annotated with accurate 3D ground-truth labels, especially beyond typical road-based autonomous driving contexts. To address this challenge, we introduce a novel weakly supervised framework leveraging pseudo-labels. Current pretrained models often struggle to accurately detect pedestrians in non-road environments due to inherent dataset biases. Unlike generalized image-based 2D object detection models, achieving similar generalization in monocular 3D detection remains largely unexplored. In this paper, we propose GATE3D, a novel framework designed specifically for generalized monocular 3D object detection via weak supervision. GATE3D effectively bridges domain gaps by employing consistency losses between 2D and 3D predictions. Remarkably, our model achieves competitive performance on the KITTI benchmark as well as on an indoor-office dataset collected by us to evaluate the generalization capabilities of our framework. Our results demonstrate that GATE3D significantly accelerates learning from limited annotated data through effective pre-training strategies, highlighting substantial potential for broader impacts in robotics, augmented reality, and virtual reality applications. Project page: this https URL
zh

[CV-67] MediSee: Reasoning -based Pixel-level Perception in Medical Images

【速读】：该论文旨在解决现有医学图像感知方法在通用性和适用性上的局限性问题。传统方法通常局限于特定任务或依赖于精确的边界框和文本标签作为输入提示，而这些输入要求对专业知识的依赖阻碍了其广泛使用。普通用户更倾向于通过需要逻辑推理的口头查询来获取信息。为了解决这一问题，论文提出了一个新的医学视觉任务：医学推理分割与检测（Medical Reasoning Segmentation and Detection, MedSD），目标是从隐式的自然语言查询中理解医学图像的含义，并为目标对象生成相应的分割掩码和边界框。

解决方案的关键在于引入了一个包含大量医学实体及其对应推理的多视角、逻辑驱动的医学推理分割与检测（Multi-perspective, Logic-driven Medical Reasoning Segmentation and Detection, MLMR-SD）数据集，并设计了一种名为MediSee的有效基线模型。该模型专门用于医学推理分割和检测任务。实验结果表明，所提出的方法能够有效应对基于隐式口语查询的MedSD任务，并优于传统的医学指代分割方法。

链接: https://arxiv.org/abs/2504.11008
作者: Qinyue Tong,Ziqian Lu,Jun Liu,Yangming Zheng,Zheming Lu
机构: Zhejiang University(浙江大学); Zhejiang Sci-Tech University(浙江理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Despite remarkable advancements in pixel-level medical image perception, existing methods are either limited to specific tasks or heavily rely on accurate bounding boxes or text labels as input prompts. However, the medical knowledge required for input is a huge obstacle for general public, which greatly reduces the universality of these methods. Compared with these domain-specialized auxiliary information, general users tend to rely on oral queries that require logical reasoning. In this paper, we introduce a novel medical vision task: Medical Reasoning Segmentation and Detection (MedSD), which aims to comprehend implicit queries about medical images and generate the corresponding segmentation mask and bounding box for the target object. To accomplish this task, we first introduce a Multi-perspective, Logic-driven Medical Reasoning Segmentation and Detection (MLMR-SD) dataset, which encompasses a substantial collection of medical entity targets along with their corresponding reasoning. Furthermore, we propose MediSee, an effective baseline model designed for medical reasoning segmentation and detection. The experimental results indicate that the proposed method can effectively address MedSD with implicit colloquial queries and outperform traditional medical referring segmentation methods.
zh

[CV-68] MCIR: Token Merge Benefits Composed Image Retrieval

【速读】：该论文旨在解决生成式图像检索（Composed Image Retrieval, CIR）中多模态查询融合不均衡的问题。当前跨模态特征融合方法在意图解释上存在固有偏差，要么过度依赖参考图像特征（视觉主导融合），要么过度关注文本修改意图（通过图像到文本转换实现的文本主导融合）。这种不平衡的表征往往无法准确捕捉和反映用户的实际检索意图。为了解决这一挑战，论文提出了一种名为TMCIR的新框架，其关键是两项创新：1) 意图感知的跨模态对齐（Intent-Aware Cross-Modal Alignment），通过扩散模型合成与文本描述对应的伪目标图像，微调CLIP编码器以增强其捕捉文本描述中细微意图的能力；2) 自适应标记融合（Adaptive Token Fusion），通过对比自适应标记融合特征与目标图像进一步微调所有编码器，在对比学习管道中动态平衡视觉和文本表示，优化用于检索的复合特征。实验结果表明，TMCIR在Fashion-IQ和CIRR数据集上显著优于现有方法，尤其是在捕捉细微用户意图方面表现出色。

链接: https://arxiv.org/abs/2504.10995
作者: Chaoyang Wang,Zeyu Zhang,Long Teng,Zijun Li,Shichao Kan
机构: Central South University (中南大学); The Australian National University (澳大利亚国立大学); Wuhan University (武汉大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Composed Image Retrieval (CIR) retrieves target images using a multi-modal query that combines a reference image with text describing desired modifications. The primary challenge is effectively fusing this visual and textual information. Current cross-modal feature fusion approaches for CIR exhibit an inherent bias in intention interpretation. These methods tend to disproportionately emphasize either the reference image features (visual-dominant fusion) or the textual modification intent (text-dominant fusion through image-to-text conversion). Such an imbalanced representation often fails to accurately capture and reflect the actual search intent of the user in the retrieval results. To address this challenge, we propose TMCIR, a novel framework that advances composed image retrieval through two key innovations: 1) Intent-Aware Cross-Modal Alignment. We first fine-tune CLIP encoders contrastively using intent-reflecting pseudo-target images, synthesized from reference images and textual descriptions via a diffusion model. This step enhances the encoder ability of text to capture nuanced intents in textual descriptions. 2) Adaptive Token Fusion. We further fine-tune all encoders contrastively by comparing adaptive token-fusion features with the target image. This mechanism dynamically balances visual and textual representations within the contrastive learning pipeline, optimizing the composed feature for retrieval. Extensive experiments on Fashion-IQ and CIRR datasets demonstrate that TMCIR significantly outperforms state-of-the-art methods, particularly in capturing nuanced user intent.
zh

[CV-69] PraNet-V2: Dual-Supervised Reverse Attention for Medical Image Segmentation

【速读】：该论文旨在解决多类别医学图像分割任务中现有方法性能不足的问题。PraNet-V1虽通过引入逆注意力（Reverse Attention, RA）模块提升了息肉分割效果，但难以应对多类别分割挑战。为克服此限制，论文提出PraNet-V2，其关键在于引入Dual-Supervised Reverse Attention (DSRA)模块。该模块通过显式的背景监督、独立的背景建模以及语义增强的注意力融合机制，显著提升了模型在多类别分割任务中的表现。实验结果显示，PraNet-V2在四种息肉分割数据集上表现出色，并通过将DSRA集成到三种最先进的语义分割模型中，使平均Dice分数提升了多达1.36%。

链接: https://arxiv.org/abs/2504.10986
作者: Bo-Cheng Hu,Ge-Peng Ji,Dian Shao,Deng-Ping Fan
机构: Nankai Institute of Advanced Research (SHENZHEN-FUTIAN)(南开大学深圳研究院); VCIP & CS, Nankai University (南开大学视频与图像处理实验室及计算机科学系); School of Computing, Australian National University (澳大利亚国立大学计算学院); Unmanned System Research Institute, Northwestern Polytechnical University (西北工业大学无人系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report (4 tables 3 figures 8 pages)

点击查看摘要

Abstract:Accurate medical image segmentation is essential for effective diagnosis and treatment. Previously, PraNet-V1 was proposed to enhance polyp segmentation by introducing a reverse attention (RA) module that utilizes background information. However, PraNet-V1 struggles with multi-class segmentation tasks. To address this limitation, we propose PraNet-V2, which, compared to PraNet-V1, effectively performs a broader range of tasks including multi-class segmentation. At the core of PraNet-V2 is the Dual-Supervised Reverse Attention (DSRA) module, which incorporates explicit background supervision, independent background modeling, and semantically enriched attention fusion. Our PraNet-V2 framework demonstrates strong performance on four polyp segmentation datasets. Additionally, by integrating DSRA to iteratively enhance foreground segmentation results in three state-of-the-art semantic segmentation models, we achieve up to a 1.36% improvement in mean Dice score. Code is available at: this https URL.
zh

[CV-70] DMPT: Decoupled Modality-aware Prompt Tuning for Multi-modal Object Re-identification WACV

【速读】：该论文旨在解决现有基于大规模预训练主干网络（如ViT）的多模态目标再识别方法在标准全微调范式下需要优化大量主干参数，导致计算和存储开销过大的问题。论文提出了一种名为DMPT的高效提示微调框架，其关键是冻结主干网络，并仅优化新增的解耦模态感知参数。具体而言，DMPT显式地将视觉提示解耦为利用强大文本编码器提取模态特定知识的模态特定提示以及从多模态输入（如可见光、近红外和热红外）中提取语义信息的模态无关语义提示。在此基础上，设计了一种提示逆向绑定（PromptIBind）策略，通过绑定提示作为媒介连接不同模态的语义提示令牌，促进互补多模态信息的交换，从而提升最终的再识别性能。实验结果表明，DMPT在多个常见基准数据集上达到了与现有最先进方法相当的结果，但仅需微调主干参数的6.5%。

链接: https://arxiv.org/abs/2504.10985
作者: Minghui Lin,Shu Wang,Xiang Wang,Jianhua Tang,Longbin Fu,Zhengrong Zuo,Nong Sang
机构: Huazhong University of Science and Technology (华中科技大学); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

点击查看摘要

Abstract:Current multi-modal object re-identification approaches based on large-scale pre-trained backbones (i.e., ViT) have displayed remarkable progress and achieved excellent performance. However, these methods usually adopt the standard full fine-tuning paradigm, which requires the optimization of considerable backbone parameters, causing extensive computational and storage requirements. In this work, we propose an efficient prompt-tuning framework tailored for multi-modal object re-identification, dubbed DMPT, which freezes the main backbone and only optimizes several newly added decoupled modality-aware parameters. Specifically, we explicitly decouple the visual prompts into modality-specific prompts which leverage prior modality knowledge from a powerful text encoder and modality-independent semantic prompts which extract semantic information from multi-modal inputs, such as visible, near-infrared, and thermal-infrared. Built upon the extracted features, we further design a Prompt Inverse Bind (PromptIBind) strategy that employs bind prompts as a medium to connect the semantic prompt tokens of different modalities and facilitates the exchange of complementary multi-modal information, boosting final re-identification results. Experimental results on multiple common benchmarks demonstrate that our DMPT can achieve competitive results to existing state-of-the-art methods while requiring only 6.5% fine-tuning of the backbone parameters.
zh

[CV-71] Seeing like a Cephalopod: Colour Vision with a Monochrome Event Camera CVPR2025

【速读】：该论文旨在解决传统光谱成像系统依赖于固定颜色滤光片或计算密集型去马赛克（demosaicing）算法的问题，提出了一种受头足类动物色彩辨别机制启发的新型光谱传感方法。头足类动物仅依靠单一类型的感光器，通过光学系统产生的色差及瞳孔形状感知光谱信息。论文的关键解决方案在于结合球形透镜与基于事件的相机，利用电机驱动系统实现焦平面的动态调整，模拟头足类动物的适应性晶状体运动，从而在可见光到近红外波段范围内实现波长相关的聚焦。这种方法不仅验证了生物启发的光谱辨别能力的有效性，还展示了无需传统颜色滤光片或复杂计算的稳健光谱传感能力，为受自然演化启发的新一代光谱传感系统开辟了新途径。

链接: https://arxiv.org/abs/2504.10984
作者: Sami Arja,Nimrod Kruger,Alexandre Marcireau,Nicholas Owen Ralph,Saeed Afshar,Gregory Cohen
机构: Western Sydney University (西悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 14 figures, 1 table. Accepted at CVPR 2025 (Workshop on Event-based Vision)

点击查看摘要

Abstract:Cephalopods exhibit unique colour discrimination capabilities despite having one type of photoreceptor, relying instead on chromatic aberration induced by their ocular optics and pupil shapes to perceive spectral information. We took inspiration from this biological mechanism to design a spectral imaging system that combines a ball lens with an event-based camera. Our approach relies on a motorised system that shifts the focal position, mirroring the adaptive lens motion in cephalopods. This approach has enabled us to achieve wavelength-dependent focusing across the visible light and near-infrared spectrum, making the event a spectral sensor. We characterise chromatic aberration effects, using both event-based and conventional frame-based sensors, validating the effectiveness of bio-inspired spectral discrimination both in simulation and in a real setup as well as assessing the spectral discrimination performance. Our proposed approach provides a robust spectral sensing capability without conventional colour filters or computational demosaicing. This approach opens new pathways toward new spectral sensing systems inspired by nature’s evolutionary solutions. Code and analysis are available at: this https URL
zh

[CV-72] Deep Learning in Concealed Dense Prediction

【速读】：该论文试图解决复杂视觉任务中的 Concealed Dense Prediction (CDP) 问题，这类任务的特点是目标被隐藏在周围环境中，需要细粒度表示、先验知识及辅助推理等能力来全面感知目标。论文的关键在于提出了一种基于隐藏对抗的分类框架，通过在三种典型 CDP 任务上的实验，系统性地比较了 25 种最先进的方法，并总结了其在 12 个常用隐蔽数据集上的性能差异。此外，论文还探讨了大模型时代下 CDP 的潜在应用价值，并提出了 6 个未来研究方向，同时通过构建大规模多模态指令微调数据集 CvpINST 和隐蔽视觉感知代理 CvpAgent 提供了发展思路。

链接: https://arxiv.org/abs/2504.10979
作者: Pancheng Zhao,Deng-Ping Fan,Shupeng Cheng,Salman Khan,Fahad Shahbaz Khan,David Clifton,Peng Xu,Jufeng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technique Report

点击查看摘要

Abstract:Deep learning is developing rapidly and handling common computer vision tasks well. It is time to pay attention to more complex vision tasks, as model size, knowledge, and reasoning capabilities continue to improve. In this paper, we introduce and review a family of complex tasks, termed Concealed Dense Prediction (CDP), which has great value in agriculture, industry, etc. CDP’s intrinsic trait is that the targets are concealed in their surroundings, thus fully perceiving them requires fine-grained representations, prior knowledge, auxiliary reasoning, etc. The contributions of this review are three-fold: (i) We introduce the scope, characteristics, and challenges specific to CDP tasks and emphasize their essential differences from generic vision tasks. (ii) We develop a taxonomy based on concealment counteracting to summarize deep learning efforts in CDP through experiments on three tasks. We compare 25 state-of-the-art methods across 12 widely used concealed datasets. (iii) We discuss the potential applications of CDP in the large model era and summarize 6 potential research directions. We offer perspectives for the future development of CDP by constructing a large-scale multimodal instruction fine-tuning dataset, CvpINST, and a concealed visual perception agent, CvpAgent.
zh

[CV-73] Adaptive Decision Boundary for Few-Shot Class-Incremental Learning

【速读】：该论文旨在解决Few-Shot Class-Incremental Learning (FSCIL) 中的一个关键挑战：在有限样本条件下学习新类别时，如何同时保持对先前类别知识的记忆，而不发生灾难性遗忘。现有方法主要关注于通过冻结特征提取器并仅微调分类器来防止灾难性遗忘，但这些方法通常忽视了每个类别的特定决策空间及其之间的差异性优化。

为了解决这一问题，论文提出了一种可插拔的自适应决策边界策略（Adaptive Decision Boundary Strategy, ADBS），其核心在于为每个类别分配特定的决策边界，并在训练过程中动态调整这些边界以最优地细化每个阶段内各类别的决策空间。此外，引入了一种新颖的类间约束损失函数，用于进一步优化类别间的区分度及各自的原型表示。实验结果表明，将ADBS与现有的FSCIL技术结合后，在CIFAR100、miniImageNet和CUB200三个基准数据集上的表现显著提升，达到了当前最先进的性能水平。

链接: https://arxiv.org/abs/2504.10976
作者: Linhao Li,Yongzhang Tan,Siyuan Yang,Hao Cheng,Yongfeng Dong,Liang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-Shot Class-Incremental Learning (FSCIL) aims to continuously learn new classes from a limited set of training samples without forgetting knowledge of previously learned classes. Conventional FSCIL methods typically build a robust feature extractor during the base training session with abundant training samples and subsequently freeze this extractor, only fine-tuning the classifier in subsequent incremental phases. However, current strategies primarily focus on preventing catastrophic forgetting, considering only the relationship between novel and base classes, without paying attention to the specific decision spaces of each class. To address this challenge, we propose a plug-and-play Adaptive Decision Boundary Strategy (ADBS), which is compatible with most FSCIL methods. Specifically, we assign a specific decision boundary to each class and adaptively adjust these boundaries during training to optimally refine the decision spaces for the classes in each session. Furthermore, to amplify the distinctiveness between classes, we employ a novel inter-class constraint loss that optimizes the decision boundaries and prototypes for each class. Extensive experiments on three benchmarks, namely CIFAR100, miniImageNet, and CUB200, demonstrate that incorporating our ADBS method with existing FSCIL techniques significantly improves performance, achieving overall state-of-the-art results.
zh

[CV-74] Self-Supervised Enhancement of Forward-Looking Sonar Images: Bridging Cross-Modal Degradation Gaps through Feature Space Transformation and Multi-Frame Fusion

【速读】：该论文旨在解决增强前视声呐图像以实现精确水下目标检测的问题。当前基于深度学习的方法主要依赖于模拟数据的监督训练，但高质量真实世界配对数据的获取困难限制了其实际应用和泛化能力。尽管遥感领域的自监督方法部分缓解了数据短缺问题，但忽略了声呐与遥感图像之间的跨模态退化差距。直接迁移预训练权重通常会导致声呐图像过于平滑、细节丢失以及亮度不足。为了解决这些问题，论文的关键方案包括两个方面：首先，提出了一种特征空间变换方法，将声呐图像从像素域映射到鲁棒的特征域，从而有效弥合退化差距；其次，设计了一种自监督多帧融合策略，利用相邻帧之间的互补信息自然去除斑点噪声并增强目标区域的亮度。实验结果表明，所提方法在三个自收集的真实世界前视声呐数据集上显著优于现有方法，有效抑制了噪声、保留了详细边缘，并大幅提高了亮度，在水下目标检测应用中展现出强大潜力。

链接: https://arxiv.org/abs/2504.10974
作者: Zhisheng Zhang,Peng Zhang,Fengxiang Wang,Liangli Ma,Fuchun Sun
机构: College of Electronic Engineering, Naval University of Engineering (海军工程大学电子工程学院), Wuhan 430033, China; College of Meteorology and Oceanography, National University of Defense Technology (国防科技大学气象海洋学院), Changsha 410073, China; College of Computer Science and Technology, National University of Defense Technology (国防科技大学计算机科学与技术学院), Changsha 410073, China; Department of Computer Science and Technology, Tsinghua University (清华大学计算机科学与技术系), Beijing 100084, China
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Enhancing forward-looking sonar images is critical for accurate underwater target detection. Current deep learning methods mainly rely on supervised training with simulated data, but the difficulty in obtaining high-quality real-world paired data limits their practical use and generalization. Although self-supervised approaches from remote sensing partially alleviate data shortages, they neglect the cross-modal degradation gap between sonar and remote sensing images. Directly transferring pretrained weights often leads to overly smooth sonar images, detail loss, and insufficient brightness. To address this, we propose a feature-space transformation that maps sonar images from the pixel domain to a robust feature domain, effectively bridging the degradation gap. Additionally, our self-supervised multi-frame fusion strategy leverages complementary inter-frame information to naturally remove speckle noise and enhance target-region brightness. Experiments on three self-collected real-world forward-looking sonar datasets show that our method significantly outperforms existing approaches, effectively suppressing noise, preserving detailed edges, and substantially improving brightness, demonstrating strong potential for underwater target detection applications.
zh

[CV-75] AFiRe: Anatomy-Driven Self-Supervised Learning for Fine-Grained Representation in Radiographic Images

【速读】：该论文试图解决现有自监督方法（如对比学习）在骨骼影像分析中侧重全局区分而忽略关键细粒度解剖细节的问题。解决方案的关键在于提出了一种基于解剖学驱动的自监督框架——Anatomy-driven self-supervised framework for enhancing Fine-grained Representation in radiographic image analysis (AFiRe)。AFiRe 的核心思想是将解剖一致性与视觉Transformer的独特标记处理特性对齐，并通过两种协同工作的自监督方案实现：(i) 基于结构和类别一致性的标记级解剖引导对比学习，以增强细粒度的空间-解剖区分能力；(ii) 像素级异常移除恢复，专注于局部异常，利用详细的几何信息优化学习到的区分能力。此外，还提出了合成病变掩膜以增强解剖多样性，同时保持内部一致性，避免传统数据增强（如裁剪和仿射变换）导致的破坏。

链接: https://arxiv.org/abs/2504.10972
作者: Yihang Liu,Lianghua He,Ying Wen,Longzhen Yang,Hongzhou Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current self-supervised methods, such as contrastive learning, predominantly focus on global discrimination, neglecting the critical fine-grained anatomical details required for accurate radiographic analysis. To address this challenge, we propose an Anatomy-driven self-supervised framework for enhancing Fine-grained Representation in radiographic image analysis (AFiRe). The core idea of AFiRe is to align the anatomical consistency with the unique token-processing characteristics of Vision Transformer. Specifically, AFiRe synergistically performs two self-supervised schemes: (i) Token-wise anatomy-guided contrastive learning, which aligns image tokens based on structural and categorical consistency, thereby enhancing fine-grained spatial-anatomical discrimination; (ii) Pixel-level anomaly-removal restoration, which particularly focuses on local anomalies, thereby refining the learned discrimination with detailed geometrical information. Additionally, we propose Synthetic Lesion Mask to enhance anatomical diversity while preserving intra-consistency, which is typically corrupted by traditional data augmentations, such as Cropping and Affine transformations. Experimental results show that AFiRe: (i) provides robust anatomical discrimination, achieving more cohesive feature clusters compared to state-of-the-art contrastive learning methods; (ii) demonstrates superior generalization, surpassing 7 radiography-specific self-supervised methods in multi-label classification tasks with limited labeling; and (iii) integrates fine-grained information, enabling precise anomaly detection using only image-level annotations.
zh

[CV-76] An Efficient and Mixed Heterogeneous Model for Image Restoration

【速读】：该论文旨在解决图像恢复（Image Restoration, IR）领域中通用模型难以有效整合异构架构以应对多样化退化类型的问题。现有主流方法基于CNNs、Transformers和Mamba三种架构范式，但它们在特定单一任务下表现出色，而在综合处理多种图像恢复挑战时缺乏有效的融合机制。为此，论文提出RestorMixer，一种基于混合架构融合的高效通用图像恢复模型。其关键是采用三阶段编码器-解码器结构，结合CNNs用于快速提取浅层局部特征，Mamba模块与多尺度窗口自注意力机制用于建模全局上下文及动态特征优化，通过层次化和自适应设计充分发挥不同架构的优势，从而实现跨多种图像恢复任务的领先性能与高推理效率。

链接: https://arxiv.org/abs/2504.10967
作者: Yubin Gu,Yuan Meng,Kaihang Zheng,Xiaoshuai Sun,Jiayi Ji,Weijian Ruan,Liujuan Cao,Rongrong Ji
机构: MAC Lab, Xiamen University (厦门大学MAC实验室); National University of Singapore (新加坡国立大学); Smart City Research Institute, China Electronics Technology Group Corporation (中国电子科技集团公司智慧城市研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image restoration~(IR), as a fundamental multimedia data processing task, has a significant impact on downstream visual applications. In recent years, researchers have focused on developing general-purpose IR models capable of handling diverse degradation types, thereby reducing the cost and complexity of model development. Current mainstream approaches are based on three architectural paradigms: CNNs, Transformers, and Mambas. CNNs excel in efficient inference, whereas Transformers and Mamba excel at capturing long-range dependencies and modeling global contexts. While each architecture has demonstrated success in specialized, single-task settings, limited efforts have been made to effectively integrate heterogeneous architectures to jointly address diverse IR challenges. To bridge this gap, we propose RestorMixer, an efficient and general-purpose IR model based on mixed-architecture fusion. RestorMixer adopts a three-stage encoder-decoder structure, where each stage is tailored to the resolution and feature characteristics of the input. In the initial high-resolution stage, CNN-based blocks are employed to rapidly extract shallow local features. In the subsequent stages, we integrate a refined multi-directional scanning Mamba module with a multi-scale window-based self-attention mechanism. This hierarchical and adaptive design enables the model to leverage the strengths of CNNs in local feature extraction, Mamba in global context modeling, and attention mechanisms in dynamic feature refinement. Extensive experimental results demonstrate that RestorMixer achieves leading performance across multiple IR tasks while maintaining high inference efficiency. The official code can be accessed at this https URL.
zh

[CV-77] Recognition of Geometrical Shapes by Dictionary Learning

【速读】：该论文试图解决将字典学习（Dictionary Learning）方法应用于形状识别（Shape Recognition）的问题。解决方案的关键在于选择合适的优化方法（optimization method），因为实验结果表明，优化方法的选择对识别质量有显著影响。

链接: https://arxiv.org/abs/2504.10958
作者: Alexander Köhler,Michael Breuß
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 4 figures, ACDSA 2025 conference

点击查看摘要

Abstract:Dictionary learning is a versatile method to produce an overcomplete set of vectors, called atoms, to represent a given input with only a few atoms. In the literature, it has been used primarily for tasks that explore its powerful representation capabilities, such as for image reconstruction. In this work, we present a first approach to make dictionary learning work for shape recognition, considering specifically geometrical shapes. As we demonstrate, the choice of the underlying optimization method has a significant impact on recognition quality. Experimental results confirm that dictionary learning may be an interesting method for shape recognition tasks.
zh

[CV-78] Cross-Frequency Implicit Neural Representation with Self-Evolving Parameters

【速读】：该论文旨在解决经典隐式神经表示（Implicit Neural Representation, INR）方法在数据表征中存在的混合频率分量导致特征编码参数（如频率参数 (\omega) 或秩 (R)）需手动配置的问题。为了解决这一挑战，论文提出了一种基于Haar小波变换的自演化跨频率INR方法（CF-INR）。其关键在于将数据解耦为四个频率分量，并在小波空间中采用INR进行表征，从而实现不同频率分量的独立刻画，提高数据表征的准确性。此外，通过引入自演化参数的跨频率张量分解范式，CF-INR能够自动更新每个频率分量的秩参数 (R) 和频率参数 (\omega)，消除繁琐的手动调参过程，并为每个数据集学习定制化的跨频率特征编码配置。实验结果表明，CF-INR在图像回归、修复、去噪及云移除等多种视觉数据表征与恢复任务中优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.10929
作者: Chang Yu,Yisi Luo,Kai Ye,Xile Zhao,Deyu Meng
机构: School of Mathematics and Statistics, Xi’an Jiaotong University (西安交通大学数学与统计学院); School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi’an Jiaotong University (西安交通大学自动化科学与工程学院, 电子与信息工程学院); School of Mathematical Science, University of Electronic Science and Technology of China (电子科技大学数学科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Implicit neural representation (INR) has emerged as a powerful paradigm for visual data representation. However, classical INR methods represent data in the original space mixed with different frequency components, and several feature encoding parameters (e.g., the frequency parameter \omega or the rank R ) need manual configurations. In this work, we propose a self-evolving cross-frequency INR using the Haar wavelet transform (termed CF-INR), which decouples data into four frequency components and employs INRs in the wavelet space. CF-INR allows the characterization of different frequency components separately, thus enabling higher accuracy for data representation. To more precisely characterize cross-frequency components, we propose a cross-frequency tensor decomposition paradigm for CF-INR with self-evolving parameters, which automatically updates the rank parameter R and the frequency parameter \omega for each frequency component through self-evolving optimization. This self-evolution paradigm eliminates the laborious manual tuning of these parameters, and learns a customized cross-frequency feature encoding configuration for each dataset. We evaluate CF-INR on a variety of visual data representation and recovery tasks, including image regression, inpainting, denoising, and cloud removal. Extensive experiments demonstrate that CF-INR outperforms state-of-the-art methods in each case.
zh

[CV-79] owards Efficient Partially Relevant Video Retrieval with Active Moment Discovering

【速读】：该论文致力于解决部分相关视频检索（Partially Relevant Video Retrieval, PRVR）任务中的挑战，旨在高效且有效地捕捉文本查询与未剪辑视频之间的部分对应关系。现有方法通常侧重于多尺度片段表示建模，但存在内容独立性和信息冗余的问题，影响检索性能。论文的关键解决方案是提出了一种名为AMDNet的简单而有效的方法，通过主动时刻发现（Active Moment Discovering, AMD）机制来识别语义上与查询一致的视频片段。具体而言，利用可学习的时间跨度锚点捕获不同片段，并采用掩码多片段注意力机制突出显著片段同时抑制冗余背景，从而实现更紧凑且信息量更大的视频表示。此外，引入了时刻多样性损失以鼓励不同区域的多样片段，以及时刻相关性损失以促进语义上与查询相关的片段，这些损失函数与部分相关检索损失共同优化，实现了端到端训练。实验结果表明，AMDNet在TVR数据集上的参数规模仅为最新方法GMMFormer的1/15.5，而SumR指标高出6.0个百分点，验证了其优越性和效率。

链接: https://arxiv.org/abs/2504.10920
作者: Peipei Song,Long Zhang,Long Lan,Weidong Chen,Dan Guo,Xun Yang,Meng Wang
机构: School of Information Science and Technology, USTC (中国科学技术大学信息科学与技术学院); MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, USTC (教育部脑启发智能感知与认知重点实验室, 中国科学技术大学); Institute for Quantum Information, and the State Key Laboratory of High Performance Computing, National University of Defense Technology (NUDT) (国防科技大学量子信息研究所, 高性能计算国家重点实验室); Key Laboratory of Knowledge Engineering with Big Data (HFUT), Ministry of Education and School of Computer Science and Information Engineering, Hefei University of Technology (HFUT), Hefei (大数据知识工程教育部重点实验室, 合肥工业大学计算机科学与信息工程学院, 合肥)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Multimedia (TMM) on January 19, 2025. The code is available at this https URL

点击查看摘要

Abstract:Partially relevant video retrieval (PRVR) is a practical yet challenging task in text-to-video retrieval, where videos are untrimmed and contain much background content. The pursuit here is of both effective and efficient solutions to capture the partial correspondence between text queries and untrimmed videos. Existing PRVR methods, which typically focus on modeling multi-scale clip representations, however, suffer from content independence and information redundancy, impairing retrieval performance. To overcome these limitations, we propose a simple yet effective approach with active moment discovering (AMDNet). We are committed to discovering video moments that are semantically consistent with their queries. By using learnable span anchors to capture distinct moments and applying masked multi-moment attention to emphasize salient moments while suppressing redundant backgrounds, we achieve more compact and informative video representations. To further enhance moment modeling, we introduce a moment diversity loss to encourage different moments of distinct regions and a moment relevance loss to promote semantically query-relevant moments, which cooperate with a partially relevant retrieval loss for end-to-end optimization. Extensive experiments on two large-scale video datasets (\ie, TVR and ActivityNet Captions) demonstrate the superiority and efficiency of our AMDNet. In particular, AMDNet is about 15.5 times smaller (#parameters) while 6.0 points higher (SumR) than the up-to-date method GMMFormer on TVR.
zh

[CV-80] InterAnimate: Taming Region-aware Diffusion Model for Realistic Human Interaction Animation

【速读】：该论文旨在解决视频生成领域中交互动作（如手脸互动）研究不足的问题，特别是为了满足生物特征认证系统中基于交互运动的反欺骗技术的需求。论文的关键在于提出了一种新颖的手脸互动动画范式，通过同时学习空间-时间接触动力学与生物力学合理的形变效果，实现自然且无碰撞的手部运动对手部引发的面部解剖学精确形变。为支持这一研究，论文构建了包含18种交互模式和90,000个标注视频的大规模手脸交互数据集InterHF，并提出了区域感知扩散模型InterAnimate，其利用可学习的空间和时间潜在变量捕捉动态交互先验，并通过区域感知机制将这些先验注入去噪过程。

链接: https://arxiv.org/abs/2504.10905
作者: Yukang Lin,Yan Hong,Zunnan Xu,Xindi Li,Chao Xu,Chuanbiao Song,Ronghui Li,Haoxing Chen,Jun Lan,Huijia Zhu,Weiqiang Wang,Jianfu Zhang,Xiu Li
机构: Tsinghua University (清华大学); Ant Group (蚂蚁集团); Zhejiang University (浙江大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: under preview

点击查看摘要

Abstract:Recent video generation research has focused heavily on isolated actions, leaving interactive motions-such as hand-face interactions-largely unexamined. These interactions are essential for emerging biometric authentication systems, which rely on interactive motion-based anti-spoofing approaches. From a security perspective, there is a growing need for large-scale, high-quality interactive videos to train and strengthen authentication models. In this work, we introduce a novel paradigm for animating realistic hand-face interactions. Our approach simultaneously learns spatio-temporal contact dynamics and biomechanically plausible deformation effects, enabling natural interactions where hand movements induce anatomically accurate facial deformations while maintaining collision-free contact. To facilitate this research, we present InterHF, a large-scale hand-face interaction dataset featuring 18 interaction patterns and 90,000 annotated videos. Additionally, we propose InterAnimate, a region-aware diffusion model designed specifically for interaction animation. InterAnimate leverages learnable spatial and temporal latents to effectively capture dynamic interaction priors and integrates a region-aware interaction mechanism that injects these priors into the denoising process. To the best of our knowledge, this work represents the first large-scale effort to systematically study human hand-face interactions. Qualitative and quantitative results show InterAnimate produces highly realistic animations, setting a new benchmark. Code and data will be made public to advance research.
zh

[CV-81] Fine-Grained Rib Fracture Diagnosis with Hyperbolic Embeddings: A Detailed Annotation Framework and Multi-Label Classification Model

【速读】：该论文致力于解决现有肋骨骨折数据集缺乏细粒度标注的问题，特别是关于骨折特征、类型以及个体肋骨精确解剖位置的描述不足。为应对这一挑战，论文提出了一种针对骨折分类设计的新颖标注协议，并通过利用跨模态嵌入（cross-modal embeddings）将放射影像与临床描述关联起来以进一步提升骨折分类性能。关键在于采用双曲嵌入（hyperbolic embeddings）来捕捉骨折的层级结构特性，将视觉特征和文本描述映射到一个共享的非欧几里得流形上，从而实现更精细的相似性计算，同时考虑骨折分类学中的固有层级关系。实验结果显示，此方法在多个分类任务中优于现有技术，在AirRib数据集上的平均召回率提升了6%，而在公开的RibFrac数据集上则提高了17.5%。

链接: https://arxiv.org/abs/2504.10889
作者: Shripad Pate,Aiman Farooq,Suvrankar Dutta,Musadiq Aadil Sheikh,Atin Kumar,Deepak Mishra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate rib fracture identification and classification are essential for treatment planning. However, existing datasets often lack fine-grained annotations, particularly regarding rib fracture characterization, type, and precise anatomical location on individual ribs. To address this, we introduce a novel rib fracture annotation protocol tailored for fracture classification. Further, we enhance fracture classification by leveraging cross-modal embeddings that bridge radiological images and clinical descriptions. Our approach employs hyperbolic embeddings to capture the hierarchical nature of fracture, mapping visual features and textual descriptions into a shared non-Euclidean manifold. This framework enables more nuanced similarity computations between imaging characteristics and clinical descriptions, accounting for the inherent hierarchical relationships in fracture taxonomy. Experimental results demonstrate that our approach outperforms existing methods across multiple classification tasks, with average recall improvements of 6% on the AirRib dataset and 17.5% on the public RibFrac dataset.
zh

[CV-82] CDUPatch: Color-Driven Universal Adversarial Patch Attack for Dual-Modal Visible-Infrared Detectors

【速读】：该论文旨在解决现有跨模态对抗补丁攻击（如针对可见光与红外双模态检测器）在多种物理场景下的攻击效果有限的问题。为应对这一挑战，论文提出了一种名为CDUPatch的通用跨模态补丁攻击方法，适用于不同尺度、视角和场景中的可见光-红外物体检测器。解决方案的关键在于利用颜色变化引起的热吸收差异导致红外成像温度变化的特性，设计了一个RGB到红外适配器（RGB-to-infrared adapter），用于将RGB补丁映射为红外补丁，从而实现跨模态补丁的统一优化。通过学习补丁上的最优颜色分布，可以操控其热响应并生成对抗性的红外纹理。此外，引入多尺度裁剪策略并构建包含不同尺度和视角航空器图像的新数据集MSDrone，以增强补丁在真实环境中的鲁棒性。实验结果表明，该方法在数字域内超越了现有的补丁攻击方法，并且广泛的物理测试进一步验证了其在不同尺度、视角和场景中的强迁移能力。

链接: https://arxiv.org/abs/2504.10888
作者: Jiahuan Long,Wen Yao,Tingsong Jiang,Chao Ma
机构: Chinese Academy of Military Science(Beijing, 中国); Shanghai Jiao Tong University(上海交通大学, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adversarial patches are widely used to evaluate the robustness of object detection systems in real-world scenarios. These patches were initially designed to deceive single-modal detectors (e.g., visible or infrared) and have recently been extended to target visible-infrared dual-modal detectors. However, existing dual-modal adversarial patch attacks have limited attack effectiveness across diverse physical scenarios. To address this, we propose CDUPatch, a universal cross-modal patch attack against visible-infrared object detectors across scales, views, and scenarios. Specifically, we observe that color variations lead to different levels of thermal absorption, resulting in temperature differences in infrared imaging. Leveraging this property, we propose an RGB-to-infrared adapter that maps RGB patches to infrared patches, enabling unified optimization of cross-modal patches. By learning an optimal color distribution on the adversarial patch, we can manipulate its thermal response and generate an adversarial infrared texture. Additionally, we introduce a multi-scale clipping strategy and construct a new visible-infrared dataset, MSDrone, which contains aerial vehicle images in varying scales and perspectives. These data augmentation strategies enhance the robustness of our patch in real-world conditions. Experiments on four benchmark datasets (e.g., DroneVehicle, LLVIP, VisDrone, MSDrone) show that our method outperforms existing patch attacks in the digital domain. Extensive physical tests further confirm strong transferability across scales, views, and scenarios.
zh

[CV-83] PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving

【速读】：该论文旨在解决现有多模态基准测试中存在的两个主要问题：一是静态基准测试的复杂性固定且容易受到预训练数据污染的影响；二是人工标注数据集耗时费力、易受人为偏差和不一致性影响，导致可靠性与可复现性问题。为了解决这些问题，论文提出了一种名为开放性视觉谜题生成（OVPG）的完全动态的多模态评估框架。OVPG的关键在于其包含三个模块——原始素材采样模块、视觉内容生成模块以及谜题规则设计模块，这些模块共同确保每个评估实例具有基础性、高度随机性和唯一可解性，从而实现对大型多模态模型（LMMs）不断发展的能力进行持续适应。基于OVPG构建的PuzzleBench是一个包含11,840个VQA样本的动态可扩展基准测试集，它通过六项精心设计的谜题任务来针对LMM的核心能力——视觉识别、逻辑推理和上下文理解进行评估。与快速过时的静态基准测试不同，PuzzleBench能够通过OVPG实现持续的数据更新，并利用丰富的开放式谜题设计无缝适配LMM能力的发展。

链接: https://arxiv.org/abs/2504.10885
作者: Zeyu Zhang,Zijian Chen,Zicheng Zhang,Yuze Sun,Yuan Tian,Ziheng Jia,Chunyi Li,Xiaohong Liu,Xiongkuo Min,Guangtao Zhai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have demonstrated impressive capabilities across a wide range of multimodal tasks, achieving ever-increasing performance on various evaluation benchmarks. However, existing benchmarks are typically static and often overlap with pre-training datasets, leading to fixed complexity constraints and substantial data contamination issues. Meanwhile, manually annotated datasets are labor-intensive, time-consuming, and subject to human bias and inconsistency, leading to reliability and reproducibility issues. To address these problems, we propose a fully dynamic multimodal evaluation framework, named Open-ended Visual Puzzle Generation (OVPG), which aims to generate fresh, diverse, and verifiable evaluation data automatically in puzzle-solving tasks. Specifically, the OVPG pipeline consists of a raw material sampling module, a visual content generation module, and a puzzle rule design module, which ensures that each evaluation instance is primitive, highly randomized, and uniquely solvable, enabling continual adaptation to the evolving capabilities of LMMs. Built upon OVPG, we construct PuzzleBench, a dynamic and scalable benchmark comprising 11,840 VQA samples. It features six carefully designed puzzle tasks targeting three core LMM competencies, visual recognition, logical reasoning, and context understanding. PuzzleBench differs from static benchmarks that quickly become outdated. It enables ongoing dataset refreshing through OVPG and a rich set of open-ended puzzle designs, allowing seamless adaptation to the evolving capabilities of LMMs.
zh

[CV-84] Bringing together invertible UNets with invertible attention modules for memory-efficient diffusion models

【速读】：该论文试图解决扩散模型在高维医学数据集（如CT扫描、MRI等）训练过程中因计算资源需求过高而导致的应用瓶颈问题。论文的关键解决方案在于提出了一种基于可逆UNet架构并结合可逆注意力模块的新型架构，通过实现去噪扩散模型的内存使用与数据集维度无关以及降低训练过程中的能耗，从而显著减少了峰值内存消耗（高达15%），同时保持了图像质量并与现有最先进的方法相比拟。

链接: https://arxiv.org/abs/2504.10883
作者: Karan Jain,Mohammad Nayeem Teli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have recently gained state of the art performance on many image generation tasks. However, most models require significant computational resources to achieve this. This becomes apparent in the application of medical image synthesis due to the 3D nature of medical datasets like CT-scans, MRIs, electron microscope, etc. In this paper we propose a novel architecture for a single GPU memory-efficient training for diffusion models for high dimensional medical datasets. The proposed model is built by using an invertible UNet architecture with invertible attention modules. This leads to the following two contributions: 1. denoising diffusion models and thus enabling memory usage to be independent of the dimensionality of the dataset, and 2. reducing the energy usage during training. While this new model can be applied to a multitude of image generation tasks, we showcase its memory-efficiency on the 3D BraTS2020 dataset leading to up to 15% decrease in peak memory consumption during training with comparable results to SOTA while maintaining the image quality.
zh

[CV-85] Safe-Construct: Redefining Construction Safety Violation Recognition as 3D Multi-View Engagement Task CVPR

【速读】：该论文旨在解决建筑环境中安全违规识别这一关键但未被充分探索的问题。现有方法主要依赖于二维物体检测（2D Object Detection），但由于任务定义过于简化、缺乏真实条件下的验证、缺少标准化基准以及缺乏合成数据集生成器以支持多样化场景，这些方法难以有效捕捉实际违规行为的复杂性。论文的关键解决方案包括：(1) 提出Safe-Construct框架，将违规识别重新定义为三维多视图交互任务，利用场景级工人-物体上下文和三维空间理解能力；(2) 开发Synthetic Indoor Construction Site Generator (SICSG)，用于生成多样且可扩展的训练数据，克服数据局限性。通过整合三维多视图空间理解和合成数据生成技术，Safe-Construct实现了对四种违规类型性能提升7.6%，并设定了高风险行业中可扩展且鲁棒的安全监控新基准。

链接: https://arxiv.org/abs/2504.10880
作者: Aviral Chharia,Tianyu Ren,Tomotake Furuhata,Kenji Shimada
机构: Carnegie Mellon University (卡内基梅隆大学); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR Workshop 2025; Project Website: this https URL

点击查看摘要

Abstract:Recognizing safety violations in construction environments is critical yet remains underexplored in computer vision. Existing models predominantly rely on 2D object detection, which fails to capture the complexities of real-world violations due to: (i) an oversimplified task formulation treating violation recognition merely as object detection, (ii) inadequate validation under realistic conditions, (iii) absence of standardized baselines, and (iv) limited scalability from the unavailability of synthetic dataset generators for diverse construction scenarios. To address these challenges, we introduce Safe-Construct, the first framework that reformulates violation recognition as a 3D multi-view engagement task, leveraging scene-level worker-object context and 3D spatial understanding. We also propose the Synthetic Indoor Construction Site Generator (SICSG) to create diverse, scalable training data, overcoming data limitations. Safe-Construct achieves a 7.6% improvement over state-of-the-art methods across four violation types. We rigorously evaluate our approach in near-realistic settings, incorporating four violations, four workers, 14 objects, and challenging conditions like occlusions (worker-object, worker-worker) and variable illumination (back-lighting, overexposure, sunlight). By integrating 3D multi-view spatial understanding and synthetic data generation, Safe-Construct sets a new benchmark for scalable and robust safety monitoring in high-risk industries. Project Website: this https URL
zh

[CV-86] Large Language Model-Informed Feature Discovery Improves Prediction and Interpretation of Credibility Perceptions of Visual Content

【速读】：该论文旨在解决在视觉主导的社交媒体环境中，预测视觉内容的感知可信度以及理解人类判断背后的驱动因素，以应对虚假信息传播的问题。由于视觉特征的多样性和丰富性，这一任务极具挑战性。论文提出的关键解决方案是一种基于大型语言模型（Large Language Model, LLM）引导的特征发现框架，利用多模态LLM（如GPT-4o）评估内容可信度并解释其推理过程。通过设计针对性提示提取和量化可解释特征，并将其集成到机器学习模型中，以提升可信度预测性能。研究在涉及科学、健康和政治领域的八个主题的4,191个视觉社交媒体帖子上验证了方法的有效性，结果显示该方法比零样本GPT基线预测在R²上提升了13%。论文进一步揭示了信息具体性和图像格式等关键特征的重要性，并讨论了其对虚假信息缓解、视觉可信度以及LLM在社会科学中的潜在作用的影响。

链接: https://arxiv.org/abs/2504.10878
作者: Yilang Peng,Sijia Qian,Yingdan Lu,Cuihua Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages

点击查看摘要

Abstract:In today’s visually dominated social media landscape, predicting the perceived credibility of visual content and understanding what drives human judgment are crucial for countering misinformation. However, these tasks are challenging due to the diversity and richness of visual features. We introduce a Large Language Model (LLM)-informed feature discovery framework that leverages multimodal LLMs, such as GPT-4o, to evaluate content credibility and explain its reasoning. We extract and quantify interpretable features using targeted prompts and integrate them into machine learning models to improve credibility predictions. We tested this approach on 4,191 visual social media posts across eight topics in science, health, and politics, using credibility ratings from 5,355 crowdsourced workers. Our method outperformed zero-shot GPT-based predictions by 13 percent in R2, and revealed key features like information concreteness and image format. We discuss the implications for misinformation mitigation, visual credibility, and the role of LLMs in social science.
zh

[CV-87] Weather-Aware Object Detection Transformer for Domain Adaptation

【速读】：该论文旨在解决实时检测Transformer（RT-DETR）在雾天等恶劣天气条件下性能下降的问题。为提升其在雾天环境中的鲁棒性，论文提出了三种创新方法：(1) 基于感知损失的领域自适应，通过感知监督从教师网络蒸馏出领域不变特征至学生网络；(2) 雾天自适应注意力机制，在注意力模块中引入辅助雾天图像流，并通过雾敏感缩放增强注意力权重；(3) 雾天融合编码器，采用双流编码器架构，利用多头自注意力和交叉注意力融合清晰与雾天图像特征。尽管这些方法在设计上有显著创新，但实验结果显示它们并未始终优于基准模型RT-DETR。论文进一步分析了局限性和潜在原因，为未来面向天气感知的目标检测研究提供了见解。

链接: https://arxiv.org/abs/2504.10877
作者: Soheil Gharatappeh,Salimeh Sekeh,Vikas Dhiman
机构: University of Maine (缅因大学); San Diego State University (圣地亚哥州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:RT-DETRs have shown strong performance across various computer vision tasks but are known to degrade under challenging weather conditions such as fog. In this work, we investigate three novel approaches to enhance RT-DETR robustness in foggy environments: (1) Domain Adaptation via Perceptual Loss, which distills domain-invariant features from a teacher network to a student using perceptual supervision; (2) Weather Adaptive Attention, which augments the attention mechanism with fog-sensitive scaling by introducing an auxiliary foggy image stream; and (3) Weather Fusion Encoder, which integrates a dual-stream encoder architecture that fuses clear and foggy image features via multi-head self and cross-attention. Despite the architectural innovations, none of the proposed methods consistently outperform the baseline RT-DETR. We analyze the limitations and potential causes, offering insights for future research in weather-aware object detection.
zh

[CV-88] Can Vision-Language Models Understand and Interpret Dynamic Gestures from Pedestrians? Pilot Datasets and Exploration Towards Instructive Nonverbal Commands for Cooperative Autonomous Vehicles

【速读】：该论文试图解决在自动驾驶场景下正确解读交通手势（Traffic Gestures, TGs）的问题，以确保所有道路使用者的安全与顺畅通行。论文聚焦于评估当前最先进的视觉语言模型（Vision-Language Models, VLMs）在零样本交通手势理解任务中的能力，特别是其描述和分类人类交通手势的能力。解决方案的关键在于创建并公开分享了两个定制数据集——“Acted TG (ATG)” 和 “Instructive TG In-The-Wild (ITGI)”，这些数据集包含不同形式的正式与非正式交通手势，并以自然语言进行标注，描述行人身体姿势与手势。通过三种方法（基于专家生成的描述作为基线和对照）对模型进行评估：(1) 描述相似度；(2) 手势分类；(3) 姿态序列重建相似度。研究结果表明，现有VLMs在手势理解方面存在显著不足，其句子相似度平均低于0.59，分类F1分数仅为0.14-0.39，远低于专家基准值0.70。尽管姿态重建显示出一定潜力，但需要更多数据和更精细的评估指标来提升可靠性。因此，论文强调了进一步研究的重要性，以开发更准确和鲁棒的交通手势理解模型。

链接: https://arxiv.org/abs/2504.10873
作者: Tonko E. W. Bossen,Andreas Møgelmose,Ross Greer
机构: University of California Merced (加州大学默塞德分校); Aalborg University (奥尔堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In autonomous driving, it is crucial to correctly interpret traffic gestures (TGs), such as those of an authority figure providing orders or instructions, or a pedestrian signaling the driver, to ensure a safe and pleasant traffic environment for all road users. This study investigates the capabilities of state-of-the-art vision-language models (VLMs) in zero-shot interpretation, focusing on their ability to caption and classify human gestures in traffic contexts. We create and publicly share two custom datasets with varying formal and informal TGs, such as ‘Stop’, ‘Reverse’, ‘Hail’, etc. The datasets are “Acted TG (ATG)” and “Instructive TG In-The-Wild (ITGI)”. They are annotated with natural language, describing the pedestrian’s body position and gesture. We evaluate models using three methods utilizing expert-generated captions as baseline and control: (1) caption similarity, (2) gesture classification, and (3) pose sequence reconstruction similarity. Results show that current VLMs struggle with gesture understanding: sentence similarity averages below 0.59, and classification F1 scores reach only 0.14-0.39, well below the expert baseline of 0.70. While pose reconstruction shows potential, it requires more data and refined metrics to be reliable. Our findings reveal that although some SOTA VLMs can interpret zero-shot human traffic gestures, none are accurate and robust enough to be trustworthy, emphasizing the need for further research in this domain.
zh

[CV-89] DAAF:Degradation-Aware Adaptive Fusion Framework for Robust Infrared and Visible Images Fusion

【速读】：该论文旨在解决现有红外与可见光图像融合(IVIF)算法在处理图像退化（如低光照和噪声）时的不足，这些不足限制了其实际应用潜力。论文提出了一种Degradation-Aware Adaptive Image Fusion (DAAF) 方法，通过统一建模自适应退化优化与图像融合过程来应对这一挑战。DAAF的关键在于其包含两个核心模块：辅助的自适应退化优化网络(ADON)和特征交互局部-全局融合网络(FILGF)。ADON负责处理图像退化，其中红外分支利用频域特征分解提取高斯噪声和条带噪声，而可见光分支采用Retinex分解分离光照和反射成分以增强细节和光照分布。FILGF则实现多尺度局部-全局特征的交互融合，包括局部特征的模型间互补以及全局特征的跨模型交互注意力机制。实验结果表明，DAAF在正常及复杂退化场景下均优于当前主流IVIF算法。

链接: https://arxiv.org/abs/2504.10871
作者: Tianpei Zhang,Jufeng Zhao,Yiming Zhu,Guangmang Cui,Yuxin Jing,Yuhan Lyu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing infrared and visible image fusion(IVIF) algorithms often prioritize high-quality images, neglecting image degradation such as low light and noise, which limits the practical potential. This paper propose Degradation-Aware Adaptive image Fusion (DAAF), which achieves unified modeling of adaptive degradation optimization and image fusion. Specifically, DAAF comprises an auxiliary Adaptive Degradation Optimization Network (ADON) and a Feature Interactive Local-Global Fusion (FILGF) Network. Firstly, ADON includes infrared and visible-light branches. Within the infrared branch, frequency-domain feature decomposition and extraction are employed to isolate Gaussian and stripe noise. In the visible-light branch, Retinex decomposition is applied to extract illumination and reflectance components, enabling complementary enhancement of detail and illumination distribution. Subsequently, FILGF performs interactive multi-scale local-global feature fusion. Local feature fusion consists of intra-inter model feature complement, while global feature fusion is achieved through a interactive cross-model attention. Extensive experiments have shown that DAAF outperforms current IVIF algorithms in normal and complex degradation scenarios.
zh

[CV-90] ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping CVPR2025

【速读】：该论文旨在解决机器人抓取任务中因仅利用部分场景信息直接输出抓取姿势而导致运动次优甚至碰撞的问题。为了解决这些问题，论文提出了一种名为ZeroGrasp的新框架，它能够近乎实时地同时进行3D重建和抓取姿势预测。该方法的关键在于认识到遮挡推理以及物体间空间关系建模对于准确重建和抓取同样有益。此外，研究者还将此方法与一个新型的大规模合成数据集结合使用，该数据集包含100万张照片级逼真的图像、高分辨率的3D重建以及来自Objaverse-LVIS数据集中12,000个对象的113亿个物理有效的抓取姿势注释。在GraspNet-1B基准测试及真实世界机器人实验中评估了ZeroGrasp的表现后发现，它达到了最先进的性能，并通过利用合成数据实现了对新现实世界物体的良好泛化能力。

链接: https://arxiv.org/abs/2504.10857
作者: Shun Iwase,Zubair Irshad,Katherine Liu,Vitor Guizilini,Robert Lee,Takuya Ikeda,Ayako Amma,Koichi Nishiwaki,Kris Kitani,Rares Ambrus,Sergey Zakharov
机构: Carnegie Mellon University (卡内基梅隆大学); Toyota Research Institute (丰田研究所); Woven by Toyota (编织由丰田)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Published at CVPR 2025, Webpage: this https URL

点击查看摘要

Abstract:Robotic grasping is a cornerstone capability of embodied systems. Many methods directly output grasps from partial information without modeling the geometry of the scene, leading to suboptimal motion and even collisions. To address these issues, we introduce ZeroGrasp, a novel framework that simultaneously performs 3D reconstruction and grasp pose prediction in near real-time. A key insight of our method is that occlusion reasoning and modeling the spatial relationships between objects is beneficial for both accurate reconstruction and grasping. We couple our method with a novel large-scale synthetic dataset, which comprises 1M photo-realistic images, high-resolution 3D reconstructions and 11.3B physically-valid grasp pose annotations for 12K objects from the Objaverse-LVIS dataset. We evaluate ZeroGrasp on the GraspNet-1B benchmark as well as through real-world robot experiments. ZeroGrasp achieves state-of-the-art performance and generalizes to novel real-world objects by leveraging synthetic data.
zh

[CV-91] LVLM_CSP: Accelerating Large Vision Language Models via Clustering Scattering and Pruning for Reasoning Segmentation

【速读】：该论文旨在解决大型视觉语言模型（LVLMs）在引导视觉基础模型执行推理分割任务时因处理大量图像tokens而导致的显著计算开销问题。现有图像token剪枝方法主要针对高层次视觉理解任务优化，但在需要精确语义和空间推理的视觉掩膜生成任务中难以兼顾计算效率与分割精度之间的平衡。论文的关键创新在于提出了一种名为LVLM_CSP的训练-free视觉token剪枝方法，其核心在于通过聚类、散射和剪枝三个阶段，在减少计算量的同时保持高精度。具体而言，LVLM_CSP首先利用粗粒度视觉推理筛选出关键tokens，接着进行细粒度推理，最终大幅减少图像tokens的数量，实验表明该方法在两种不同精度损失水平下分别实现了65%和70%的FLOPs削减。

链接: https://arxiv.org/abs/2504.10854
作者: Hanning Chen,Yang Ni,Wenjun Huang,Hyunwoo Oh,Yezi Liu,Tamoghno Das,Mohsen Imani
机构: University of California, Irvine (加州大学欧文分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) have been widely adopted to guide vision foundation models in performing reasoning segmentation tasks, achieving impressive performance. However, the substantial computational overhead associated with LVLMs presents a new challenge. The primary source of this computational cost arises from processing hundreds of image tokens. Therefore, an effective strategy to mitigate such overhead is to reduce the number of image tokens, a process known as image token pruning. Previous studies on image token pruning for LVLMs have primarily focused on high level visual understanding tasks, such as visual question answering and image captioning. In contrast, guiding vision foundation models to generate accurate visual masks based on textual queries demands precise semantic and spatial reasoning capabilities. Consequently, pruning methods must carefully control individual image tokens throughout the LVLM reasoning process. Our empirical analysis reveals that existing methods struggle to adequately balance reductions in computational overhead with the necessity to maintain high segmentation accuracy. In this work, we propose LVLM_CSP, a novel training free visual token pruning method specifically designed for LVLM based reasoning segmentation tasks. LVLM_CSP consists of three stages: clustering, scattering, and pruning. Initially, the LVLM performs coarse-grained visual reasoning using a subset of selected image tokens. Next, fine grained reasoning is conducted, and finally, most visual tokens are pruned in the last stage. Extensive experiments demonstrate that LVLM_CSP achieves a 65% reduction in image token inference FLOPs with virtually no accuracy degradation, and a 70% reduction with only a minor 1% drop in accuracy on the 7B LVLM.
zh

[CV-92] Enhancing Features in Long-tailed Data Using Large Vision Mode

【速读】：该论文旨在解决语言基础模型在长尾识别任务中对语言数据依赖的问题，探索利用大型视觉模型（Large Vision Models, LVMs）或视觉基础模型（Visual Foundation Models, VFMs）增强长尾数据特征表示的可能性，而无需依赖任何语言信息。解决方案的关键在于从LVM中提取特征，并将其与基线网络（baseline network）的特征图（map）和潜在空间（latent space）中的特征进行融合以生成增强特征（augmented features）。此外，通过在潜在空间中设计基于原型的损失函数（prototype-based losses），进一步挖掘增强特征的潜力。实验部分验证了该方法在ImageNet-LT和iNaturalist2018两个基准数据集上的有效性。

链接: https://arxiv.org/abs/2504.10852
作者: Pengxiao Han,Changkun Ye,Jinguang Tong,Cuicui Jiang,Jie Hong,Li Fang,Xuesong Li
机构: The Australian National University (澳大利亚国立大学); The University of Hong Kong (香港大学); HKUST (香港科技大学); Chinese Academy of Sciences (中国科学院); CSIRO (澳大利亚联邦科学与工业研究组织)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Language-based foundation models, such as large language models (LLMs) or large vision-language models (LVLMs), have been widely studied in long-tailed recognition. However, the need for linguistic data is not applicable to all practical tasks. In this study, we aim to explore using large vision models (LVMs) or visual foundation models (VFMs) to enhance long-tailed data features without any language information. Specifically, we extract features from the LVM and fuse them with features in the baseline network’s map and latent space to obtain the augmented features. Moreover, we design several prototype-based losses in the latent space to further exploit the potential of the augmented features. In the experimental section, we validate our approach on two benchmark datasets: ImageNet-LT and iNaturalist2018.
zh

[CV-93] A comprehensive review of remote sensing in wetland classification and mapping

【速读】：该论文旨在解决湿地分类与制图领域科学理解不足的问题，具体包括湿地的科学重要性、主要数据与方法、驱动湿地变化的因素、现有研究范式及局限性，以及技术革新与全球环境变化背景下湿地分类与制图面临的挑战与机遇。论文的关键解决方案在于通过综合分析超过1,200篇相关文献，从湿地类型、方法、传感器类型及研究区域等多方面揭示湿地分类与制图的趋势，并系统总结湿地特征、现有数据与方法，同时探索多尺度下湿地变化的内在驱动因素。此外，论文还讨论了当前研究的局限性，并针对全球环境变化和技术革新的背景提出了未来发展方向，以推动湿地科学研究的转型性进展。

链接: https://arxiv.org/abs/2504.10842
作者: Shuai Yuan,Xiangan Liang,Tianwu Lin,Shuang Chen,Rui Liu,Jie Wang,Hongsheng Zhang,Peng Gong
机构: Department of Geography, The University of Hong Kong (香港大学地理系); Department of Earth System Science, Tsinghua University (清华大学地球系统科学系); Department of Electronics and Information Engineering, Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学（深圳）电子与信息工程学院); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Wetlands constitute critical ecosystems that support both biodiversity and human well-being; however, they have experienced a significant decline since the 20th century. Back in the 1970s, researchers began to employ remote sensing technologies for wetland classification and mapping to elucidate the extent and variations of wetlands. Although some review articles summarized the development of this field, there is a lack of a thorough and in-depth understanding of wetland classification and mapping: (1) the scientific importance of wetlands, (2) major data, methods used in wetland classification and mapping, (3) driving factors of wetland changes, (4) current research paradigm and limitations, (5) challenges and opportunities in wetland classification and mapping under the context of technological innovation and global environmental change. In this review, we aim to provide a comprehensive perspective and new insights into wetland classification and mapping for readers to answer these questions. First, we conduct a meta-analysis of over 1,200 papers, encompassing wetland types, methods, sensor types, and study sites, examining prevailing trends in wetland classification and mapping. Next, we review and synthesize the wetland features and existing data and methods in wetland classification and mapping. We also summarize typical wetland mapping products and explore the intrinsic driving factors of wetland changes across multiple spatial and temporal scales. Finally, we discuss current limitations and propose future directions in response to global environmental change and technological innovation. This review consolidates our understanding of wetland remote sensing and offers scientific recommendations that foster transformative progress in wetland science.
zh

[CV-94] LightFormer: A lightweight and efficient decoder for remote sensing image segmentation

【速读】：该论文旨在解决深度学习在边缘平台实时部署时因解码器复杂性导致的性能瓶颈问题，特别是在遥感图像语义分割和地物变化检测等时间敏感任务中的应用。论文的关键创新在于提出了一种名为LightFormer的轻量级解码器，它通过特征融合与精炼模块（基于通道处理和可学习门机制）高效聚合多尺度信息，显著降低了模型复杂度。此外，引入的空间信息选择模块（Spatial Information Selection Module, SISM）结合长距离注意力与细节保留分支，有效提升了复杂场景中非结构化目标的识别精度。这些设计使得LightFormer在保持高精度的同时实现了卓越的计算经济性，为遥感应用提供了实用的精度-效率权衡方案。

链接: https://arxiv.org/abs/2504.10834
作者: Sihang Chen,Lijun Yun,Ze Liu,JianFeng Zhu,Jie Chen,Hui Wang,Yueping Nie
机构: Aerospace Information Research Institute (航天信息研究所); Chinese Academy of Sciences (中国科学院); Ministry of Natural Resources, Bei Jing, China (中国自然资源部，北京); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 69 figures

点击查看摘要

Abstract:Deep learning techniques have achieved remarkable success in the semantic segmentation of remote sensing images and in land-use change detection. Nevertheless, their real-time deployment on edge platforms remains constrained by decoder complexity. Herein, we introduce LightFormer, a lightweight decoder for time-critical tasks that involve unstructured targets, such as disaster assessment, unmanned aerial vehicle search-and-rescue, and cultural heritage monitoring. LightFormer employs a feature-fusion and refinement module built on channel processing and a learnable gating mechanism to aggregate multi-scale, multi-range information efficiently, which drastically curtails model complexity. Furthermore, we propose a spatial information selection module (SISM) that integrates long-range attention with a detail preservation branch to capture spatial dependencies across multiple scales, thereby substantially improving the recognition of unstructured targets in complex scenes. On the ISPRS Vaihingen benchmark, LightFormer attains 99.9% of GLFFNet’s mIoU (83.9% vs. 84.0%) while requiring only 14.7% of its FLOPs and 15.9% of its parameters, thus achieving an excellent accuracy-efficiency trade-off. Consistent results on LoveDA, ISPRS Potsdam, RescueNet, and FloodNet further demonstrate its robustness and superior perception of unstructured objects. These findings highlight LightFormer as a practical solution for remote sensing applications where both computational economy and high-precision segmentation are imperative.
zh

[CV-95] owards Spatially-Aware and Optimally Faithful Concept-Based Explanations

【速读】：该论文旨在解决现有后验无监督基于概念的解释方法（U-CBEMs）在生成深度神经网络决策过程语义解释时的准确性不足问题。尽管确保解释与模型的忠实性至关重要，但现有的忠实性度量存在局限性，特别是它们仅关注概念集合本身而忽略了概念的空间分布。为了解决这些问题，论文提出了代理忠实性（Surrogate Faithfulness, SF）这一评价方法，它引入了一个空间感知代理以及两个新的忠实性度量指标。通过SF方法，论文生成了最优忠实（Optimally Faithful, OF）解释，这些解释能够最大化忠实性。解决方案的关键在于引入空间感知机制以改进忠实性评估，并通过学习出的最优概念实现更高质量的解释，使其不仅在域内数据上表现优异，还能很好地泛化到未见过的数据并对抗样本更具鲁棒性。

链接: https://arxiv.org/abs/2504.10833
作者: Shubham Kumar,Dwip Dalal,Narendra Ahuja
机构: UIUC (伊利诺伊大学香槟分校); UIUC (伊利诺伊大学香槟分校); UIUC (伊利诺伊大学香槟分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Post-hoc, unsupervised concept-based explanation methods (U-CBEMs) are a promising tool for generating semantic explanations of the decision-making processes in deep neural networks, having applications in both model improvement and understanding. It is vital that the explanation is accurate, or faithful, to the model, yet we identify several limitations of prior faithfulness metrics that inhibit an accurate evaluation; most notably, prior metrics involve only the set of concepts present, ignoring how they may be spatially distributed. We address these limitations with Surrogate Faithfulness (SF), an evaluation method that introduces a spatially-aware surrogate and two novel faithfulness metrics. Using SF, we produce Optimally Faithful (OF) explanations, where concepts are found that maximize faithfulness. Our experiments show that (1) adding spatial-awareness to prior U-CBEMs increases faithfulness in all cases; (2) OF produces significantly more faithful explanations than prior U-CBEMs (30% or higher improvement in error); (3) OF’s learned concepts generalize well to out-of-domain data and are more robust to adversarial examples, where prior U-CBEMs struggle.
zh

[CV-96] LayoutCoT: Unleashing the Deep Reasoning Potential of Large Language Models for Layout Generation

【速读】：本文旨在解决条件布局生成任务中现有方法存在的两个主要问题：一是基于生成模型的方法通常需要大量训练数据或广泛的微调，限制了其通用性和实用性；二是无训练方法虽利用了大语言模型（LLMs）的上下文学习能力，但推理能力有限且排名机制过于简单，导致生成高质量布局的一致性较差。为了解决这些问题，论文提出了一种名为LayoutCoT的新方法。其关键是结合检索增强生成（RAG）与链式思考（CoT）技术，通过将布局表示转换为适合LLMs处理的标准序列化格式，利用布局感知的RAG实现有效检索并生成粗略布局，再通过专门设计的CoT推理模块对初步布局进行迭代优化，显著提升语义一致性和视觉质量。实验结果表明，LayoutCoT在无需训练或微调的情况下达到了最先进的性能，并展示了标准LLMs在布局生成任务中释放深层推理能力的潜力。

链接: https://arxiv.org/abs/2504.10829
作者: Hengyu Shi,Junhao Su,Huansheng Ning,Xiaoming Wei,Jialin Gao
机构: Meituan (美团); University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conditional layout generation aims to automatically generate visually appealing and semantically coherent layouts from user-defined constraints. While recent methods based on generative models have shown promising results, they typically require substantial amounts of training data or extensive fine-tuning, limiting their versatility and practical applicability. Alternatively, some training-free approaches leveraging in-context learning with Large Language Models (LLMs) have emerged, but they often suffer from limited reasoning capabilities and overly simplistic ranking mechanisms, which restrict their ability to generate consistently high-quality layouts. To this end, we propose LayoutCoT, a novel approach that leverages the reasoning capabilities of LLMs through a combination of Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) techniques. Specifically, LayoutCoT transforms layout representations into a standardized serialized format suitable for processing by LLMs. A Layout-aware RAG is used to facilitate effective retrieval and generate a coarse layout by LLMs. This preliminary layout, together with the selected exemplars, is then fed into a specially designed CoT reasoning module for iterative refinement, significantly enhancing both semantic coherence and visual quality. We conduct extensive experiments on five public datasets spanning three conditional layout generation tasks. Experimental results demonstrate that LayoutCoT achieves state-of-the-art performance without requiring training or fine-tuning. Notably, our CoT reasoning module enables standard LLMs, even those without explicit deep reasoning abilities, to outperform specialized deep-reasoning models such as deepseek-R1, highlighting the potential of our approach in unleashing the deep reasoning capabilities of LLMs for layout generation tasks.
zh

[CV-97] OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding

【速读】：本文旨在解决在单一扩散模型中合成与理解多种视频视觉内容的问题，提出了一种名为OmniVDiff的新框架。其关键在于通过在颜色空间中处理所有视频视觉模态以学习联合分布，并采用自适应控制策略，在扩散过程中动态调整各视觉模态的角色（生成模态或条件模态），从而实现对每种模态角色的灵活操控，支持广泛的任务需求。这一方法使模型具备三种核心功能：基于文本条件的视频生成、视频理解以及基于细粒度属性（如深度图或分割图）的视频生成，从而提升可控视频扩散的灵活性与可扩展性。

链接: https://arxiv.org/abs/2504.10825
作者: Dianbing Xi,Jiepeng Wang,Yuanzhi Liang,Xi Qiu,Yuchi Huo,Rui Wang,Chi Zhang,Xuelong Li
机构: Zhejiang University (浙江大学); Institute of Artificial Intelligence, China Telecom (TeleAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our project page: this https URL

点击查看摘要

Abstract:In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff, aiming to synthesize and comprehend multiple video visual content in a single diffusion model. To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. This allows flexible manipulation of each modality’s role, enabling support for a wide range of tasks. Consequently, our model supports three key functionalities: (1) Text-conditioned video generation: multi-modal visual video sequences (i.e., rgb, depth, canny, segmentaion) are generated based on the text conditions in one diffusion process; (2) Video understanding: OmniVDiff can estimate the depth, canny map, and semantic segmentation across the input rgb frames while ensuring coherence with the rgb input; and (3) X-conditioned video generation: OmniVDiff generates videos conditioned on fine-grained attributes (e.g., depth maps or segmentation maps). By integrating these diverse tasks into a unified video diffusion framework, OmniVDiff enhances the flexibility and scalability for controllable video diffusion, making it an effective tool for a variety of downstream applications, such as video-to-video translation. Extensive experiments demonstrate the effectiveness of our approach, highlighting its potential for various video-related applications.
zh

[CV-98] IlluSign: Illustrating Sign Language Videos by Leverag ing the Attention Mechanism

【速读】：该论文旨在解决将手语视频转换为静态插图的问题，以作为传统视频教育材料的补充资源，尤其面向新手学习者和教育者。手语的动态特性使得其详细研究具有挑战性，而传统的人工插图方法成本高昂。论文提出了一种利用生成模型理解图像语义和几何特性的方法，通过将手语视频转化为具有简笔画风格的静态插图，结合手势起始帧和结束帧，并使用箭头突出手部方向和动作。关键在于干预扩散模型的去噪过程，将风格作为键值注入高分辨率注意力层，并融合图像与边缘的几何信息作为查询，最终借助注意力机制软组合起始和结束帧的特征，实现高效生成手语插图，填补现有教育资源的空白。

链接: https://arxiv.org/abs/2504.10822
作者: Janna Bruner,Amit Moryossef,Lior Wolf
机构: Reichman University (雷赫曼大学); University of Zurich (苏黎世大学); Tel Aviv University (特拉维夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sign languages are dynamic visual languages that involve hand gestures, in combination with non manual elements such as facial expressions. While video recordings of sign language are commonly used for education and documentation, the dynamic nature of signs can make it challenging to study them in detail, especially for new learners and educators. This work aims to convert sign language video footage into static illustrations, which serve as an additional educational resource to complement video content. This process is usually done by an artist, and is therefore quite costly. We propose a method that illustrates sign language videos by leveraging generative models’ ability to understand both the semantic and geometric aspects of images. Our approach focuses on transferring a sketch like illustration style to video footage of sign language, combining the start and end frames of a sign into a single illustration, and using arrows to highlight the hand’s direction and motion. While many style transfer methods address domain adaptation at varying levels of abstraction, applying a sketch like style to sign languages, especially for hand gestures and facial expressions, poses a significant challenge. To tackle this, we intervene in the denoising process of a diffusion model, injecting style as keys and values into high resolution attention layers, and fusing geometric information from the image and edges as queries. For the final illustration, we use the attention mechanism to combine the attention weights from both the start and end illustrations, resulting in a soft combination. Our method offers a cost effective solution for generating sign language illustrations at inference time, addressing the lack of such resources in educational materials.
zh

[CV-99] PatrolVision: Automated License Plate Recognition in the wild

【速读】：本文旨在解决公共部门低效采用人工智能技术的问题，特别是在大规模人口数据处理中的准确性和速度挑战。尽管计算机视觉在交通监控领域的潜力显著，尤其是在自动驾驶相关应用中，但实际部署的自动车牌识别（ALPR）系统仍较少提供完整的城市巡逻解决方案。论文提出了一种基于低功耗GPU的城市环境巡检系统原型，用于监控车辆上的自动化车辆检测、识别与跟踪。关键在于开发了一个针对新加坡车牌的完整ALPR系统，包括单行和双行车牌，并基于自定义YOLO网络进行设计。研究重点在于非约束性捕获场景下的性能优化，例如由于斜视导致的车牌严重失真情况。解决方案的关键在于首先利用RFB-Net检测并校正图像中的多个失真车牌，然后通过自研网络实现字符识别。实验结果表明，该系统在包含超过16,000张图像的新建数据集上达到了86%的车牌检测精度，以及在测试集中67%的字符识别率（部分匹配时可达89%准确率），且仅有一个字符错误。此外，在Tesla P4 GPU上实现了64 FPS的系统延迟。

链接: https://arxiv.org/abs/2504.10810
作者: Anmol Singhal Navya Singhal
机构: New York University; University of Texas
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in IEEE Southeast Con 2025. To be published in IEEEXplore

点击查看摘要

Abstract:Adoption of AI driven techniques in public services remains low due to challenges related to accuracy and speed of information at population scale. Computer vision techniques for traffic monitoring have not gained much popularity despite their relative strength in areas such as autonomous driving. Despite large number of academic methods for Automatic License Plate Recognition (ALPR) systems, very few provide an end to end solution for patrolling in the city. This paper presents a novel prototype for a low power GPU based patrolling system to be deployed in an urban environment on surveillance vehicles for automated vehicle detection, recognition and tracking. In this work, we propose a complete ALPR system for Singapore license plates having both single and double line creating our own YOLO based network. We focus on unconstrained capture scenarios as would be the case in real world application, where the license plate (LP) might be considerably distorted due to oblique views. In this work, we first detect the license plate from the full image using RFB-Net and rectify multiple distorted license plates in a single image. After that, the detected license plate image is fed to our network for character recognition. We evaluate the performance of our proposed system on a newly built dataset covering more than 16,000 images. The system was able to correctly detect license plates with 86% precision and recognize characters of a license plate in 67% of the test set, and 89% accuracy with one incorrect character (partial match). We also test latency of our system and achieve 64FPS on Tesla P4 GPU
zh

[CV-100] GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR

【速读】：本文旨在解决从普通图像生成空间可变照明（Spatially-Varying Lighting）的问题。关键在于提出了一种名为GaSLight的方法，其核心创新点包括：(1) 利用HDR高斯光斑（HDR Gaussian Splats）作为光源表示，首次实现普通图像在3D渲染器中充当光源；(2) 设计了一个两阶段流程：第一阶段通过扩散模型中的先验知识（priors）合理且准确地增强图像的动态范围；第二阶段采用高斯光斑建模3D照明，从而实现空间可变照明效果。此外，为便于评估图像作为光源的效果，还引入了一个新的校准且未饱和的HDR数据集，并结合现有数据集进行综合测试。该方法在HDR估计及其在虚拟物体和场景照明中的应用方面取得了最先进的成果。

链接: https://arxiv.org/abs/2504.10809
作者: Christophe Bolduc,Yannick Hold-Geoffroy,Zhixin Shu,Jean-François Lalonde
机构: Université Laval (拉瓦尔大学); Adobe (Adobe)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present GaSLight, a method that generates spatially-varying lighting from regular images. Our method proposes using HDR Gaussian Splats as light source representation, marking the first time regular images can serve as light sources in a 3D renderer. Our two-stage process first enhances the dynamic range of images plausibly and accurately by leveraging the priors embedded in diffusion models. Next, we employ Gaussian Splats to model 3D lighting, achieving spatially variant lighting. Our approach yields state-of-the-art results on HDR estimations and their applications in illuminating virtual objects and scenes. To facilitate the benchmarking of images as light sources, we introduce a novel dataset of calibrated and unsaturated HDR to evaluate images as light sources. We assess our method using a combination of this novel dataset and an existing dataset from the literature. The code to reproduce our method will be available upon acceptance.
zh

[CV-101] abular foundation model to detect empathy from visual cues

【速读】：该论文旨在解决从视频交互中检测共情（empathy）的问题，特别是在仅能使用提取特征（tabular数据）而非原始视频的情况下。传统基于树的经典机器学习方法在类似任务中表现最佳，但本文受到文本领域基础模型（large language models）成功应用的启发，探索利用表格基础模型（tabular foundation models）来处理从视觉特征中提取的表格数据，以提高共情检测的准确性。关键解决方案在于采用两种最近提出的表格基础模型——TabPFN v2 和 TabICL，并通过上下文学习（in-context learning）与微调（fine-tuning）的方式进行实验。研究结果表明，在一个公开的人机交互基准数据集上的跨主体共情检测任务中，这些方法显著提升了性能（准确率从 0.590 提升到 0.730；AUC 从 0.564 提升到 0.669），同时为确保模型在未见过的主体上的泛化能力提供了新的见解和评估方案。

链接: https://arxiv.org/abs/2504.10808
作者: Md Rakibul Hasan,Shafin Rahman,Md Zakir Hossain,Aneesh Krishna,Tom Gedeon
机构: Curtin University (科廷大学); North South University (南北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Detecting empathy from video interactions is an emerging area of research. Video datasets, however, are often released as extracted features (i.e., tabular data) rather than raw footage due to privacy and ethical concerns. Prior research on such tabular datasets established tree-based classical machine learning approaches as the best-performing models. Motivated by the recent success of textual foundation models (i.e., large language models), we explore the use of tabular foundation models in empathy detection from tabular visual features. We experiment with two recent tabular foundation models - TabPFN v2 and TabICL - through in-context learning and fine-tuning setups. Our experiments on a public human-robot interaction benchmark demonstrate a significant boost in cross-subject empathy detection accuracy over several strong baselines (accuracy: 0.590 \rightarrow 0.730 ; AUC: 0.564 \rightarrow 0.669 ). In addition to performance improvement, we contribute novel insights and an evaluation setup to ensure generalisation on unseen subjects in this public benchmark. As the practice of releasing video features as tabular datasets is likely to persist due to privacy constraints, our findings will be widely applicable to future empathy detection video datasets as well.
zh

[CV-102] Power-scaled Bayesian Inference with Score-based Generative mModels

【速读】：该论文致力于解决在贝叶斯推理框架下，如何灵活控制先验（prior）与似然（likelihood）对后验分布影响的问题，同时避免为不同功率缩放（power-scaling）配置重新训练模型的需求。论文的关键在于提出了一种基于评分（score-based）的生成算法，用于从功率缩放的先验和似然中采样。该方法通过在中间功率后验（power posteriors）上进行采样，实现了对先验和似然相对影响的敏感性分析，并评估了功率参数在不同设置下的效果，包括仅应用于先验、仅应用于贝叶斯公式中的似然，或同时应用于两者。实验结果表明，适度增加似然的功率可提高后验样本对条件数据（如地震图像）的保真度，而降低先验的功率则促进样本间结构多样性，同时适度缩放似然还能减少炮点数据残差，验证了其在后验优化中的实用性。

链接: https://arxiv.org/abs/2504.10807
作者: Huseyin Tuna Erdinc,Yunlin Zeng,Abhinav Prakash Gahlot,Felix J. Herrmann
机构: Georgia Institute of Technology (乔治亚理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Geophysics (physics.geo-ph)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:We propose a score-based generative algorithm for sampling from power-scaled priors and likelihoods within the Bayesian inference framework. Our algorithm enables flexible control over prior-likelihood influence without requiring retraining for different power-scaling configurations. Specifically, we focus on synthesizing seismic velocity models conditioned on imaged seismic. Our method enables sensitivity analysis by sampling from intermediate power posteriors, allowing us to assess the relative influence of the prior and likelihood on samples of the posterior distribution. Through a comprehensive set of experiments, we evaluate the effects of varying the power parameter in different settings: applying it solely to the prior, to the likelihood of a Bayesian formulation, and to both simultaneously. The results show that increasing the power of the likelihood up to a certain threshold improves the fidelity of posterior samples to the conditioning data (e.g., seismic images), while decreasing the prior power promotes greater structural diversity among samples. Moreover, we find that moderate scaling of the likelihood leads to a reduced shot data residual, confirming its utility in posterior refinement.
zh

[CV-103] he Sword of Damocles in ViTs: Computational Redundancy Amplifies Adversarial Transferability

【速读】：该论文试图解决视觉Transformer（Vision Transformers, ViTs）在对抗鲁棒性方面的挑战，特别是其生成的对抗样本相较于卷积神经网络（CNNs）具有更高的迁移能力的问题。为了解决这一问题，论文的关键在于探索ViTs中的计算冗余，并提出利用这种冗余来提升对抗样本的质量和迁移能力。通过分析，作者识别出数据级和模型级两种形式的冗余，并基于此设计了一系列技术方法，包括注意力稀疏性操作、注意力头置换、干净令牌正则化、幽灵MoE多样化以及测试时对抗训练。实验结果表明，这些方法在ImageNet-1k数据集上显著优于现有基线，在多种模型架构中提升了对抗样本的迁移性和泛化性。

链接: https://arxiv.org/abs/2504.10804
作者: Jiani Liu,Zhiyuan Wang,Zeliang Zhang,Chao Huang,Susan Liang,Yunlong Tang,Chenliang Xu
机构: University of Rochester (罗切斯特大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress. 10 pages. 4 figures

点击查看摘要

Abstract:Vision Transformers (ViTs) have demonstrated impressive performance across a range of applications, including many safety-critical tasks. However, their unique architectural properties raise new challenges and opportunities in adversarial robustness. In particular, we observe that adversarial examples crafted on ViTs exhibit higher transferability compared to those crafted on CNNs, suggesting that ViTs contain structural characteristics favorable for transferable attacks. In this work, we investigate the role of computational redundancy in ViTs and its impact on adversarial transferability. Unlike prior studies that aim to reduce computation for efficiency, we propose to exploit this redundancy to improve the quality and transferability of adversarial examples. Through a detailed analysis, we identify two forms of redundancy, including the data-level and model-level, that can be harnessed to amplify attack effectiveness. Building on this insight, we design a suite of techniques, including attention sparsity manipulation, attention head permutation, clean token regularization, ghost MoE diversification, and test-time adversarial training. Extensive experiments on the ImageNet-1k dataset validate the effectiveness of our approach, showing that our methods significantly outperform existing baselines in both transferability and generality across diverse model architectures.
zh

[CV-104] 3D Wavelet Convolutions with Extended Receptive Fields for Hyperspectral Image Classification

【速读】：该论文旨在解决高光谱图像分类中的若干挑战，包括高维数据、稀疏地物分布以及光谱冗余等问题，这些问题通常导致模型过拟合及泛化能力受限。为更好地适应地物分布，同时扩展感受野而不引入过多参数且避免遗漏冗余信息，论文提出了一种改进的3D-DenseNet模型——WCNet，并将其与小波变换相结合。解决方案的关键在于引入小波卷积（Wavelet Conv），通过级联的方式有效扩展卷积的感受野，并引导卷积神经网络更有效地响应低频成分。每个卷积专注于输入信号的不同频率带宽，逐步增加有效范围，从而更加关注低频分量，同时仅增加少量可训练参数。这种动态机制使模型能够根据不同区域灵活聚焦于关键的空间结构，而非依赖固定感受野的单一静态核。小波卷积模块通过三维小波变换扩展感受野，在不增加网络深度或宽度的情况下增强模型的表达能力。实验结果表明，该方法在IN、UP和KSC数据集上的性能优于主流的高光谱图像分类方法。

链接: https://arxiv.org/abs/2504.10795
作者: Guandong Li,Mengxia Ye
机构: aiFLYTEK (科大讯飞); Aegon THTF (安信信托)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2504.04463

点击查看摘要

Abstract:Deep neural networks face numerous challenges in hyperspectral image classification, including high-dimensional data, sparse ground object distributions, and spectral redundancy, which often lead to classification overfitting and limited generalization capability. To better adapt to ground object distributions while expanding receptive fields without introducing excessive parameters and skipping redundant information, this paper proposes WCNet, an improved 3D-DenseNet model integrated with wavelet transforms. We introduce wavelet transforms to effectively extend convolutional receptive fields and guide CNNs to better respond to low frequencies through cascading, termed wavelet convolution. Each convolution focuses on different frequency bands of the input signal with gradually increasing effective ranges. This process enables greater emphasis on low-frequency components while adding only a small number of trainable parameters. This dynamic approach allows the model to flexibly focus on critical spatial structures when processing different regions, rather than relying on fixed receptive fields of single static kernels. The Wavelet Conv module enhances model representation capability by expanding receptive fields through 3D wavelet transforms without increasing network depth or width. Experimental results demonstrate superior performance on the IN, UP, and KSC datasets, outperforming mainstream hyperspectral image classification methods.
zh

[CV-105] Visual Language Models show widespread visual deficits on neuropsychological tests

【速读】：该论文试图解决的问题是探究视觉语言模型（Visual Language Models, VLMs）在视觉推理能力上的局限性，特别是其是否具备与人类相当的基础视觉概念理解能力。研究表明，尽管VLMs在复杂的对象识别任务中表现出色，但在处理低级和中级视觉概念（如方向、位置、连续性和遮挡等）时存在显著缺陷。这些缺陷表明，VLMs能够实现复杂对象识别的能力并不依赖于人类所必需的基础视觉概念。

解决方案的关键在于采用神经心理学工具，通过来自六个临床和实验测试组的51项测试，系统性评估三种最先进的VLMs在多个视觉领域的表现，并将其能力与健康成人的规范性能进行对比。这种基于验证测试组的方法揭示了VLMs在基础视觉概念上的不足，从而为理解其视觉推理能力提供了深入见解。

链接: https://arxiv.org/abs/2504.10786
作者: Gene Tangtartharakul,Katherine R. Storrs
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 31 pages, 3 figures, 1 supplementary document with 1 figure and 51 sample images

点击查看摘要

Abstract:Visual Language Models (VLMs) show remarkable performance in visual reasoning tasks, successfully tackling college-level challenges that require high-level understanding of images. However, some recent reports of VLMs struggling to reason about elemental visual concepts like orientation, position, continuity, and occlusion suggest a potential gulf between human and VLM vision. Here we use the toolkit of neuropsychology to systematically assess the capabilities of three state-of-the-art VLMs across visual domains. Using 51 tests drawn from six clinical and experimental batteries, we characterise the visual abilities of leading VLMs relative to normative performance in healthy adults. While the models excel in straightforward object recognition tasks, we find widespread deficits in low- and mid-level visual abilities that would be considered clinically significant in humans. These selective deficits, profiled through validated test batteries, suggest that an artificial system can achieve complex object recognition without developing foundational visual concepts that in humans require no explicit training.
zh

[CV-106] Rainy: Unlocking Satellite Calibration for Deep Learning in Precipitation

【速读】：该论文旨在解决传统降水估测方法因数据获取难度大及复杂特征关系难以捕捉而面临的局限性，并克服单一依赖地面站数据导致的多源卫星数据标准化不足的问题。论文的关键解决方案包括提出Rainy数据集和Taper Loss：Rainy数据集通过整合纯卫星数据与地面站数据，支持五类降水相关任务；Taper Loss则专门设计用于在仅有局地数据而缺乏全域支持的情况下填补现有方法的空白。这一创新性的数据集与损失函数设计实现了定量遥感与计算机视觉的无缝协作，为科学人工智能（AI for Science）在定量遥感领域的应用提供了数据支撑，并促进了跨学科的合作与融合。

链接: https://arxiv.org/abs/2504.10776
作者: Zhenyu Yu,Hanqing Chen,Mohd Yamani Idna Idris,Pei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Precipitation plays a critical role in the Earth’s hydrological cycle, directly affecting ecosystems, agriculture, and water resource management. Accurate precipitation estimation and prediction are crucial for understanding climate dynamics, disaster preparedness, and environmental monitoring. In recent years, artificial intelligence (AI) has gained increasing attention in quantitative remote sensing (QRS), enabling more advanced data analysis and improving precipitation estimation accuracy. Although traditional methods have been widely used for precipitation estimation, they face limitations due to the difficulty of data acquisition and the challenge of capturing complex feature relationships. Furthermore, the lack of standardized multi-source satellite datasets, and in most cases, the exclusive reliance on station data, significantly hinders the effective application of advanced AI models. To address these challenges, we propose the Rainy dataset, a multi-source spatio-temporal dataset that integrates pure satellite data with station data, and propose Taper Loss, designed to fill the gap in tasks where only in-situ data is available without area-wide support. The Rainy dataset supports five main tasks: (1) satellite calibration, (2) precipitation event prediction, (3) precipitation level prediction, (4) spatiotemporal prediction, and (5) precipitation downscaling. For each task, we selected benchmark models and evaluation metrics to provide valuable references for researchers. Using precipitation as an example, the Rainy dataset and Taper Loss demonstrate the seamless collaboration between QRS and computer vision, offering data support for AI for Science in the field of QRS and providing valuable insights for interdisciplinary collaboration and integration.
zh

[CV-107] Minimal Sensing for Orienting a Solar Panel

【速读】：该论文旨在解决太阳能面板在任意方向和环境光照条件下，如何确定使其接收最大总辐照量（irradiance）的方向的问题。论文的关键解决方案在于提出了一种最小传感方法，仅利用四个光电探测器的测量值，通过迭代调整面板的倾斜角度来最大化辐照量。由于许多环境中产生的辐照函数具有多个局部最大值，直接采用梯度上升法无法奏效。论文指出，增加探测器与面板之间的优化倾斜角等效于对辐照函数进行模糊处理，从而消除局部最大值，将其转化为单峰函数，进而可通过梯度上升法找到全局最大值。此外，研究还展示了该方法与尺度空间理论之间的密切联系，并通过实证数据集UrbanSky验证了方法的鲁棒性，最终通过便携式实验装置在多种真实场景下证明了所提方法相比标准控制方法显著提高了能量采集效率。

链接: https://arxiv.org/abs/2504.10765
作者: Jeremy Klotz,Shree K. Nayar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 9 figures

点击查看摘要

Abstract:A solar panel harvests the most energy when pointing in the direction that maximizes the total illumination (irradiance) falling on it. Given an arbitrary orientation of a panel and an arbitrary environmental illumination, we address the problem of finding the direction of maximum total irradiance. We develop a minimal sensing approach where measurements from just four photodetectors are used to iteratively vary the tilt of the panel to maximize the irradiance. Many environments produce irradiance functions with multiple local maxima. As a result, simply measuring the gradient of the irradiance function and applying gradient ascent will not work. We show that a larger, optimized tilt between the detectors and the panel is equivalent to blurring the irradiance function. This has the effect of eliminating local maxima and turning the irradiance function into a unimodal one, whose maximum can be found using gradient ascent. We show that there is a close relationship between our approach and scale space theory. We have collected a large dataset of high-dynamic range lighting environments in New York City, called \textitUrbanSky. We used this dataset to conduct simulations to verify the robustness of our approach. Finally, we have built a portable solar panel with four compact detectors and an actuator to conduct experiments in various real-world settings: direct sunlight, cloudy sky, urban settings with occlusions and shadows, and complex indoor lighting. In all cases, we show significant improvements in harvested energy compared to standard approaches for controlling the orientation of a solar panel.
zh

[CV-108] SeeTree – A modular open-source system for tree detection and orchard localization

【速读】：本文旨在解决精准果园管理中精确位置感知这一重要功能需求，当前缺乏现成的商业解决方案。论文提出了一种名为SeeTree的模块化开源嵌入式系统，用于树干检测与果园定位，可部署于任何车辆。其关键创新在于：首先，实现了全果园定位，包括跨行头地转向；其次，具备将视觉、GNSS或车轮里程计整合进运动模型的灵活性。通过在商业果园的实地试验，在800次测试中，系统有99%的概率收敛至正确位置，即使初始粒子位置存在较大不确定性。在跨行转向任务中，系统成功跟踪了99%的转向操作（860次试验代表43次独特的行间变化）。为促进技术采纳及后续研究，作者公开了数据集、设计文件和源代码。

链接: https://arxiv.org/abs/2504.10764
作者: Jostan Brown,Cindy Grimm,Joseph R. Davidson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 26 pages, 12 figures

点击查看摘要

Abstract:Accurate localization is an important functional requirement for precision orchard management. However, there are few off-the-shelf commercial solutions available to growers. In this paper, we present SeeTree, a modular, open source embedded system for tree trunk detection and orchard localization that is deployable on any vehicle. Building on our prior work on vision-based in-row localization using particle filters, SeeTree includes several new capabilities. First, it provides capacity for full orchard localization including out-of-row headland turning. Second, it includes the flexibility to integrate either visual, GNSS, or wheel odometry in the motion model. During field experiments in a commercial orchard, the system converged to the correct location 99% of the time over 800 trials, even when starting with large uncertainty in the initial particle locations. When turning out of row, the system correctly tracked 99% of the turns (860 trials representing 43 unique row changes). To help support adoption and future research and development, we make our dataset, design files, and source code freely available to the community.
zh

[CV-109] Reason Drive: Efficient Visual Question Answering for Autonomous Vehicles with Reasoning -Enhanced Small Vision-Language Models

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在自动驾驶任务中缺乏透明推理能力的问题，这是保障安全的关键挑战。论文的关键解决方案是通过显式建模推理过程来增强模型性能，具体而言，在微调过程中引入基于推理的策略，利用GPT-4o生成特定类别的结构化推理链，并将其应用于DriveLM基准数据集中的驾驶场景。实验结果表明，基于推理的微调方法显著优于仅微调答案或指令微调的基线方法，尤其是在准确性与文本生成质量方面，这验证了显式推理能够提升模型内部表征的能力，从而为开发更可解释的自动驾驶系统提供了重要方向。

链接: https://arxiv.org/abs/2504.10757
作者: Amirhosein Chahe,Lifeng Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) show promise for autonomous driving but often lack transparent reasoning capabilities that are critical for safety. We investigate whether explicitly modeling reasoning during fine-tuning enhances VLM performance on driving decision tasks. Using GPT-4o, we generate structured reasoning chains for driving scenarios from the DriveLM benchmark with category-specific prompting strategies. We compare reasoning-based fine-tuning, answer-only fine-tuning, and baseline instruction-tuned models across multiple small VLM families (Llama 3.2, Llava 1.5, and Qwen 2.5VL). Our results demonstrate that reasoning-based fine-tuning consistently outperforms alternatives, with Llama3.2-11B-reason achieving the highest performance. Models fine-tuned with reasoning show substantial improvements in accuracy and text generation quality, suggesting explicit reasoning enhances internal representations for driving decisions. These findings highlight the importance of transparent decision processes in safety-critical domains and offer a promising direction for developing more interpretable autonomous driving systems.
zh

[CV-110] Real-time Seafloor Segmentation and Mapping

【速读】：该论文旨在解决全球范围内 Posidonia oceanica 海草床因复杂水下环境导致的监测效率低下及现有模型性能不足的问题。论文的关键在于提出了一种结合机器学习与计算机视觉技术的框架，使自主水下航行器（Autonomous Underwater Vehicle, AUV）能够自主检测 Posidonia oceanica 海草床边界。解决方案的核心包括采用 Mask R-CNN 模型进行图像分割，并引入专门用于岩石识别的新类别以增强模型能力，同时结合边界跟踪策略实现全面的海草床监测。通过在真实水下图像上的验证以及仿真环境中的评估，证明了该框架可有效支持水下环境的保护与精准保育工作。

链接: https://arxiv.org/abs/2504.10750
作者: Michele Grimaldi,Nouf Alkaabi,Francesco Ruscio,Sebastian Realpe Rua,Rafael Garcia,Nuno Gracias
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Posidonia oceanica meadows are a species of seagrass highly dependent on rocks for their survival and conservation. In recent years, there has been a concerning global decline in this species, emphasizing the critical need for efficient monitoring and assessment tools. While deep learning-based semantic segmentation and visual automated monitoring systems have shown promise in a variety of applications, their performance in underwater environments remains challenging due to complex water conditions and limited datasets. This paper introduces a framework that combines machine learning and computer vision techniques to enable an autonomous underwater vehicle (AUV) to inspect the boundaries of Posidonia oceanica meadows autonomously. The framework incorporates an image segmentation module using an existing Mask R-CNN model and a strategy for Posidonia oceanica meadow boundary tracking. Furthermore, a new class dedicated to rocks is introduced to enhance the existing model, aiming to contribute to a comprehensive monitoring approach and provide a deeper understanding of the intricate interactions between the meadow and its surrounding environment. The image segmentation model is validated using real underwater images, while the overall inspection framework is evaluated in a realistic simulation environment, replicating actual monitoring scenarios with real underwater images. The results demonstrate that the proposed framework enables the AUV to autonomously accomplish the main tasks of underwater inspection and segmentation of rocks. Consequently, this work holds significant potential for the conservation and protection of marine environments, providing valuable insights into the status of Posidonia oceanica meadows and supporting targeted preservation efforts
zh

[CV-111] Hearing Anywhere in Any Environment CVPR2025

【速读】：该论文旨在解决在混合现实应用中，现有Room Impulse Response (RIR)估计方法无法泛化到具有不同几何形状和表面材质的新房间的问题。论文提出了一种统一模型，能够在最少的额外测量条件下重建任意环境的空间声学体验。解决方案的关键在于结合了几何特征提取器（从全景深度图像中捕获空间上下文）与RIR编码器（仅从少量参考RIR样本中提取详细的声学特征），从而实现跨房间RIR预测的泛化能力。此外，通过引入ACOUSTICROOMS数据集进行评估，并成功完成模拟到真实环境的迁移实验，进一步验证了方法的有效性与数据集的真实性。

链接: https://arxiv.org/abs/2504.10746
作者: Xiulong Liu,Anurag Kumar,Paul Calamia,Sebastia V. Amengual,Calvin Murdock,Ishwarya Ananthabhotla,Philip Robinson,Eli Shlizerman,Vamsi Krishna Ithapu,Ruohan Gao
机构: University of Washington (华盛顿大学); Meta (Meta); University of Maryland, College Park (马里兰大学帕克分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: CVPR 2025

点击查看摘要

Abstract:In mixed reality applications, a realistic acoustic experience in spatial environments is as crucial as the visual experience for achieving true immersion. Despite recent advances in neural approaches for Room Impulse Response (RIR) estimation, most existing methods are limited to the single environment on which they are trained, lacking the ability to generalize to new rooms with different geometries and surface materials. We aim to develop a unified model capable of reconstructing the spatial acoustic experience of any environment with minimum additional measurements. To this end, we present xRIR, a framework for cross-room RIR prediction. The core of our generalizable approach lies in combining a geometric feature extractor, which captures spatial context from panorama depth images, with a RIR encoder that extracts detailed acoustic features from only a few reference RIR samples. To evaluate our method, we introduce ACOUSTICROOMS, a new dataset featuring high-fidelity simulation of over 300,000 RIRs from 260 rooms. Experiments show that our method strongly outperforms a series of baselines. Furthermore, we successfully perform sim-to-real transfer by evaluating our model on four real-world environments, demonstrating the generalizability of our approach and the realism of our dataset.
zh

[CV-112] Interactivity x Explainability: Toward Understanding How Interactivity Can Improve Computer Vision Explanations

【速读】：该论文试图解决计算机视觉模型解释在静态格式下导致的信息过载、语义与像素级信息之间的差距以及探索机会有限等问题。论文的关键解决方案是通过交互性机制来应对这些挑战，特别是在基于热图、概念和原型的三种常见解释类型中引入交互功能。研究发现，尽管交互性增强了用户控制、加快了对相关信息的收敛速度，并扩展了用户对模型及其解释的理解，但也带来了新的挑战。为了解决这些问题，论文提出了包括精心选择默认视图、独立输入控制和受限输出空间在内的设计建议。

链接: https://arxiv.org/abs/2504.10745
作者: Indu Panigrahi,Sunnie S. Y. Kim,Amna Liaqat,Rohan Jinturkar,Olga Russakovsky,Ruth Fong,Parastoo Abtahi
机构: Princeton University (普林斯顿大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '25)

点击查看摘要

Abstract:Explanations for computer vision models are important tools for interpreting how the underlying models work. However, they are often presented in static formats, which pose challenges for users, including information overload, a gap between semantic and pixel-level information, and limited opportunities for exploration. We investigate interactivity as a mechanism for tackling these issues in three common explanation types: heatmap-based, concept-based, and prototype-based explanations. We conducted a study (N=24), using a bird identification task, involving participants with diverse technical and domain expertise. We found that while interactivity enhances user control, facilitates rapid convergence to relevant information, and allows users to expand their understanding of the model and explanation, it also introduces new challenges. To address these, we provide design recommendations for interactive computer vision explanations, including carefully selected default views, independent input controls, and constrained output spaces.
zh

[CV-113] Foundation Models for Remote Sensing: An Analysis of MLLM s for Object Localization CVPR

【速读】：该论文试图解决多模态大型语言模型（MLLMs）在地球观测（EO）影像领域中细粒度空间推理任务表现不足的问题。尽管先前研究表明MLLMs在某些EO任务（如图像描述和场景理解）中表现出色，但在需要更精细空间推理的任务（如目标定位）中表现欠佳。论文的关键解决方案在于分析并评估最近经过显式训练以具备细粒度空间推理能力的MLLMs，并在EO目标定位任务上进行基准测试。研究发现这些模型在特定场景下性能良好，适合零样本设置。此外，论文还详细讨论了提示词选择、地面采样距离（GSD）优化以及失败案例分析等关键因素，为如何优化MLLMs以适应特定EO定位任务提供了指导。

链接: https://arxiv.org/abs/2504.10727
作者: Darryl Hannan,John Cooper,Dylan White,Timothy Doster,Henry Kvinge,Yijing Watkins
机构: Pacific Northwest National Laboratory (太平洋西北国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, CVPR MORSE Workshop 2025

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have altered the landscape of computer vision, obtaining impressive results across a wide range of tasks, especially in zero-shot settings. Unfortunately, their strong performance does not always transfer to out-of-distribution domains, such as earth observation (EO) imagery. Prior work has demonstrated that MLLMs excel at some EO tasks, such as image captioning and scene understanding, while failing at tasks that require more fine-grained spatial reasoning, such as object localization. However, MLLMs are advancing rapidly and insights quickly become out-dated. In this work, we analyze more recent MLLMs that have been explicitly trained to include fine-grained spatial reasoning capabilities, benchmarking them on EO object localization tasks. We demonstrate that these models are performant in certain settings, making them well suited for zero-shot scenarios. Additionally, we provide a detailed discussion focused on prompt selection, ground sample distance (GSD) optimization, and analyzing failure cases. We hope that this work will prove valuable as others evaluate whether an MLLM is well suited for a given EO localization task and how to optimize it.
zh

[CV-114] SpinMeRound: Consistent Multi-View Identity Generation Using Diffusion Models

【速读】：该论文旨在解决从新颖视角生成逼真头部肖像这一具有挑战性的问题。当前大多数方法受限于有限的角度范围，主要集中在正面或接近正面的视角。尽管近期大规模扩散模型在处理3D场景方面表现稳健，但在面部数据上表现欠佳，这与其复杂结构及“恐怖谷效应”有关。论文提出了一种名为SpinMeRound的基于扩散的方法，通过利用多个输入视角与身份嵌入，有效合成被试对象的多样化视角，同时稳健地保持其独特的身份特征。解决方案的关键在于结合多视角输入与身份嵌入，以实现一致且精确的头部肖像生成。实验表明，该模型在360度头部合成任务中超越了现有的最先进的多视角扩散模型。

链接: https://arxiv.org/abs/2504.10716
作者: Stathis Galanakis,Alexandros Lattas,Stylianos Moschoglou,Bernhard Kainz,Stefanos Zafeiriou
机构: Imperial College London (帝国理工学院, UK); FAU Erlangen–Nürnberg (德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent progress in diffusion models, generating realistic head portraits from novel viewpoints remains a significant challenge. Most current approaches are constrained to limited angular ranges, predominantly focusing on frontal or near-frontal views. Moreover, although the recent emerging large-scale diffusion models have been proven robust in handling 3D scenes, they underperform on facial data, given their complex structure and the uncanny valley pitfalls. In this paper, we propose SpinMeRound, a diffusion-based approach designed to generate consistent and accurate head portraits from novel viewpoints. By leveraging a number of input views alongside an identity embedding, our method effectively synthesizes diverse viewpoints of a subject whilst robustly maintaining its unique identity features. Through experimentation, we showcase our model’s generation capabilities in 360 head synthesis, while beating current state-of-the-art multiview diffusion models.
zh

[CV-115] he Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report CVPR2025

【速读】：该论文致力于解决单图像高效超分辨率（Single-Image Efficient Super-Resolution, ESR）领域中的计算效率与图像质量平衡问题。论文通过NTIRE 2025挑战赛的形式，聚焦于优化运行时间（runtime）、参数量（parameters）以及浮点运算次数（FLOPs）等关键计算指标，同时确保在\operatornameDIV2K_LSDIR_valid和\operatornameDIV2K_LSDIR_test数据集上的峰值信噪比（PSNR）分别达到至少26.90 dB和26.99 dB。挑战的关键在于提出能够在资源受限环境下实现高质量图像重建的创新方法，同时推动生成式AI (Generative AI) 技术在计算效率与性能之间的突破性进展。

链接: https://arxiv.org/abs/2504.10686
作者: Bin Ren,Hang Guo,Lei Sun,Zongwei Wu,Radu Timofte,Yawei Li,Yao Zhang,Xinning Chai,Zhengxue Cheng,Yingsheng Qin,Yucai Yang,Li Song,Hongyuan Yu,Pufan Xu,Cheng Wan,Zhijuan Huang,Peng Guo,Shuyuan Cui,Chenjun Li,Xuehai Hu,Pan Pan,Xin Zhang,Heng Zhang,Qing Luo,Linyan Jiang,Haibo Lei,Qifang Gao,Yaqing Li,Weihua Luo,Tsing Li,Qing Wang,Yi Liu,Yang Wang,Hongyu An,Liou Zhang,Shijie Zhao,Lianhong Song,Long Sun,Jinshan Pan,Jiangxin Dong,Jinhui Tang,Jing Wei,Mengyang Wang,Ruilong Guo,Qian Wang,Qingliang Liu,Yang Cheng,Davinci,Enxuan Gu,Pinxin Liu,Yongsheng Yu,Hang Hua,Yunlong Tang,Shihao Wang,Yukun Yang,Zhiyu Zhang,Yukun Yang,Jiyu Wu,Jiancheng Huang,Yifan Liu,Yi Huang,Shifeng Chen,Rui Chen,Yi Feng,Mingxi Li,Cailu Wan,Xiangji Wu,Zibin Liu,Jinyang Zhong,Kihwan Yoon,Ganzorig Gankhuyag,Shengyun Zhong,Mingyang Wu,Renjie Li,Yushen Zuo,Zhengzhong Tu,Zongang Gao,Guannan Chen,Yuan Tian,Wenhui Chen,Weijun Yuan,Zhan Li,Yihang Chen,Yifan Deng,Ruting Deng,Yilin Zhang,Huan Zheng,Yanyan Wei,Wenxuan Zhao,Suiyi Zhao,Fei Wang,Kun Li,Yinggan Tang,Mengjie Su,Jae-hyeon Lee,Dong-Hyeop Son,Ui-Jin Choi,Tiancheng Shao,Yuqing Zhang,Mengcheng Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by CVPR2025 NTIRE Workshop, Efficient Super-Resolution Challenge Report. 50 pages

点击查看摘要

Abstract:This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the \operatornameDIV2K_LSDIR_valid dataset and 26.99 dB on the \operatornameDIV2K_LSDIR_test dataset. A robust participation saw \textbf244 registered entrants, with \textbf43 teams submitting valid entries. This report meticulously analyzes these methods and results, emphasizing groundbreaking advancements in state-of-the-art single-image ESR techniques. The analysis highlights innovative approaches and establishes benchmarks for future research in the field.
zh

[CV-116] NTIRE 2025 Challenge on Cross-Domain Few-Shot Object Detection: Methods and Results CVPR

【速读】：该论文致力于解决跨域少量样本目标检测（Cross-Domain Few-Shot Object Detection, CD-FSOD）问题，即现有目标检测模型在应用于不同域时面临显著挑战。论文通过组织NTIRE 2025首届CD-FSOD挑战赛，旨在仅使用有限标注数据的情况下提升当前目标检测器在全新目标域上的性能。解决方案的关键在于参赛者从多个角度提出创新模型，在开源与闭源设置下均达到了新的技术水平（SOTA）结果。

链接: https://arxiv.org/abs/2504.10685
作者: Yuqian Fu,Xingyu Qiu,Bin Ren,Yanwei Fu,Radu Timofte,Nicu Sebe,Ming-Hsuan Yang,Luc Van Gool,Kaijin Zhang,Qingpeng Nong,Xiugang Dong,Hong Gao,Xiangsheng Zhou,Jiancheng Pan,Yanxing Liu,Xiao He,Jiahao Li,Yuze Sun,Xiaomeng Huang,Zhenyu Zhang,Ran Ma,Yuhan Liu,Zijian Zhuang,Shuai Yi,Yixiong Zou,Lingyi Hong,Mingxi Chen,Runze Li,Xingdong Sheng,Wenqiang Zhang,Weisen Chen,Yongxin Yan,Xinguo Chen,Yuanjie Shao,Zhengrong Zuo,Nong Sang,Hao Wu,Haoran Sun,Shuming Hu,Yan Zhang,Zhiguang Shi,Yu Zhang,Chao Chen,Tao Wang,Da Feng,Linhai Zhuo,Ziming Lin,Yali Huang,Jie Me,Yiming Yang,Mi Guo,Mingyuan Jiu,Mingliang Xu,Maomao Xiong,Qunshu Zhang,Xinyu Cao,Yuqing Yang,Dianmo Sheng,Xuanpu Zhao,Zhiyu Li,Xuyang Ding,Wenqian Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted by CVPRW 25 @ NTIRE

点击查看摘要

Abstract:Cross-Domain Few-Shot Object Detection (CD-FSOD) poses significant challenges to existing object detection and few-shot detection models when applied across domains. In conjunction with NTIRE 2025, we organized the 1st CD-FSOD Challenge, aiming to advance the performance of current object detectors on entirely novel target domains with only limited labeled data. The challenge attracted 152 registered participants, received submissions from 42 teams, and concluded with 13 teams making valid final submissions. Participants approached the task from diverse perspectives, proposing novel models that achieved new state-of-the-art (SOTA) results under both open-source and closed-source settings. In this report, we present an overview of the 1st NTIRE 2025 CD-FSOD Challenge, highlighting the proposed solutions and summarizing the results submitted by the participants.
zh

[CV-117] H-MoRe: Learning Human-centric Motion Representation for Action Analysis CVPR2025

【速读】：该论文旨在解决人类中心运动表示学习中的精确性和鲁棒性问题。传统方法依赖于完全监督的合成数据训练，而本文提出了一种新颖的框架H-MoRe，通过自监督方式直接从真实场景中学习，同时整合人体姿态与身体形状信息。关键在于其创新性的世界-局部流（world-local flows）表征方法，将人体各部位的绝对及相对运动以矩阵形式表示，从而动态保留相关的人体运动并滤除背景干扰，实现对细微运动细节的精准捕捉。实验结果表明，H-MoRe在步态识别、动作识别以及视频生成等下游任务中均取得了显著性能提升，并保持了高效推理速度（34 fps），适用于实时应用场景。

链接: https://arxiv.org/abs/2504.10676
作者: Zhanbo Huang,Xiaoming Liu,Yu Kong
机构: Department of Computer Science and Engineering, Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 14 figures, 7 tables, accepted to CVPR 2025 (Highlight)

点击查看摘要

Abstract:In this paper, we propose H-MoRe, a novel pipeline for learning precise human-centric motion representation. Our approach dynamically preserves relevant human motion while filtering out background movement. Notably, unlike previous methods relying on fully supervised learning from synthetic data, H-MoRe learns directly from real-world scenarios in a self-supervised manner, incorporating both human pose and body shape information. Inspired by kinematics, H-MoRe represents absolute and relative movements of each body point in a matrix format that captures nuanced motion details, termed world-local flows. H-MoRe offers refined insights into human motion, which can be integrated seamlessly into various action-related applications. Experimental results demonstrate that H-MoRe brings substantial improvements across various downstream tasks, including gait recognition(CL@R1: +16.01%), action recognition(Acc@1: +8.92%), and video generation(FVD: -67.07%). Additionally, H-MoRe exhibits high inference efficiency (34 fps), making it suitable for most real-time scenarios. Models and code will be released upon publication.
zh

[CV-118] Perturbed State Space Feature Encoders for Optical Flow with Event Cameras

【速读】：该论文致力于解决基于事件相机的光流估计中，当前神经网络在时序和空间推理上的局限性问题。论文提出的解决方案核心在于设计了一种名为Perturbed State Space Feature Encoders (P-SSE) 的特征编码器，它能够以类似于Transformer的方法自适应处理时空特征，并保持状态空间模型(State Space Models, SSMs)的线性计算复杂度特性。然而，P-SSE实现卓越性能的关键创新在于其应用于SSM系统状态动态矩阵的扰动技术，该技术显著提升了模型的稳定性和性能。此外，通过结合双向光流和循环连接扩展光流预测的时间上下文，进一步增强了模型能力。评估结果显示，P-SSE在DSEC-Flow和MVSEC数据集上分别实现了8.48%和11.86%的端点误差(EPE)性能提升。

链接: https://arxiv.org/abs/2504.10669
作者: Gokul Raju Govinda Raju,Nikola Zubić,Marco Cannici,Davide Scaramuzza
机构: Robotics and Perception Group, University of Zurich (机器人与感知小组, 苏黎世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 4 figures, 4 tables. Equal contribution by Gokul Raju Govinda Raju and Nikola Zubić

点击查看摘要

Abstract:With their motion-responsive nature, event-based cameras offer significant advantages over traditional cameras for optical flow estimation. While deep learning has improved upon traditional methods, current neural networks adopted for event-based optical flow still face temporal and spatial reasoning limitations. We propose Perturbed State Space Feature Encoders (P-SSE) for multi-frame optical flow with event cameras to address these challenges. P-SSE adaptively processes spatiotemporal features with a large receptive field akin to Transformer-based methods, while maintaining the linear computational complexity characteristic of SSMs. However, the key innovation that enables the state-of-the-art performance of our model lies in our perturbation technique applied to the state dynamics matrix governing the SSM system. This approach significantly improves the stability and performance of our model. We integrate P-SSE into a framework that leverages bi-directional flows and recurrent connections, expanding the temporal context of flow prediction. Evaluations on DSEC-Flow and MVSEC datasets showcase P-SSE’s superiority, with 8.48% and 11.86% improvements in EPE performance, respectively.
zh

[CV-119] Relation-Rich Visual Document Generator for Visual Information Extraction CVPR2025

【速读】：该论文旨在解决关系丰富文档中视觉信息提取（Visual Information Extraction, VIE）面临的挑战，特别是由于布局多样性不足和训练数据有限导致的问题。当前合成文档生成器要么依赖人工设计的布局模板，要么采用基于规则的方法，这些方法均限制了布局的多样性。此外，现有的布局生成方法仅关注拓扑模式而忽略文本内容，难以生成具有复杂内容-布局关联的文档。为解决这些问题，论文提出了一种名为Relation-rIch visual Document GEnerator (RIDGE) 的两阶段方法：(1) 内容生成（Content Generation），利用大型语言模型（Large Language Models, LLMs）通过精心设计的分层结构化文本格式生成文档内容，以捕捉实体类别及其关系；(2) 基于内容的布局生成（Content-driven Layout Generation），通过光学字符识别（Optical Character Recognition, OCR）结果学习生成多样化且合理的文档布局，无需人工标注或注释。关键在于结合内容驱动的布局生成与层次化结构化文本格式，从而显著提升了多种VIE基准测试中的文档理解模型性能。

链接: https://arxiv.org/abs/2504.10659
作者: Zi-Han Jiang,Chien-Wei Lin,Wei-Hua Li,Hsuan-Tung Liu,Yi-Ren Yeh,Chu-Song Chen
机构: National Taiwan University (台湾大学); E.SUN Financial Holding Co., Ltd. (玉山金融控股公司); National Kaohsiung Normal University (国立高雄师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Despite advances in Large Language Models (LLMs) and Multimodal LLMs (MLLMs) for visual document understanding (VDU), visual information extraction (VIE) from relation-rich documents remains challenging due to the layout diversity and limited training data. While existing synthetic document generators attempt to address data scarcity, they either rely on manually designed layouts and templates, or adopt rule-based approaches that limit layout diversity. Besides, current layout generation methods focus solely on topological patterns without considering textual content, making them impractical for generating documents with complex associations between the contents and layouts. In this paper, we propose a Relation-rIch visual Document GEnerator (RIDGE) that addresses these limitations through a two-stage approach: (1) Content Generation, which leverages LLMs to generate document content using a carefully designed Hierarchical Structure Text format which captures entity categories and relationships, and (2) Content-driven Layout Generation, which learns to create diverse, plausible document layouts solely from easily available Optical Character Recognition (OCR) results, requiring no human labeling or annotations efforts. Experimental results have demonstrated that our method significantly enhances the performance of document understanding models on various VIE benchmarks. The code and model will be available at this https URL .
zh

[CV-120] SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging CVPR

【速读】：该论文旨在解决医疗视觉语言模型在实际临床环境中的可用性限制以及预测解释性不足的问题。具体而言，大多数现有模型依赖于基于文本的指令，这在手术等场景中对医生而言交互不便；此外，当前医学图像分析模型通常缺乏对预测结果的全面推理支持，降低了其在临床决策中的可靠性。鉴于诊断错误可能带来严重影响，论文提出开发可解释且可靠的医疗辅助工具具有重要意义。

为了解决上述挑战，论文引入了一种端到端语音驱动的医疗视觉语言模型（SilVar-Med），这是一种集成了语音交互与视觉语言模型的多模态医学图像助手，开创性地实现了基于语音的医学图像分析交流方式。同时，论文通过构建一个推理数据集，聚焦于解析医学异常检测背后的具体推理过程。实验结果表明，该方法能够实现基于推理驱动的医学图像解释，并结合端到端语音交互提供概念验证研究。这一工作有望推动医疗AI领域发展，促进更加透明、互动性强且实用的诊断支持系统。代码和数据集已公开发布。

链接: https://arxiv.org/abs/2504.10642
作者: Tan-Hanh Pham,Chris Ngo,Trong-Duong Bui,Minh Luu Quang,Tan-Huong Pham,Truong-Son Hy
机构: Florida Institute of Technology (佛罗里达理工学院); Knovel Engineering Lab (Knovel工程实验室), Singapore; Vietnam Military Medical University (越南军事医学院); 4108 Military Central Hospital (4108中央军事医院), Vietnam; Can Tho University of Medicine and Pharmacy ( Cần Thơ医科大学药学院), Vietnam; University of Alabama at Birmingham (阿拉巴马大学伯明翰分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR Multimodal Algorithmic Reasoning Workshop 2025 - SilVarMed

点击查看摘要

Abstract:Medical Visual Language Models have shown great potential in various healthcare applications, including medical image captioning and diagnostic assistance. However, most existing models rely on text-based instructions, limiting their usability in real-world clinical environments especially in scenarios such as surgery, text-based interaction is often impractical for physicians. In addition, current medical image analysis models typically lack comprehensive reasoning behind their predictions, which reduces their reliability for clinical decision-making. Given that medical diagnosis errors can have life-changing consequences, there is a critical need for interpretable and rational medical assistance. To address these challenges, we introduce an end-to-end speech-driven medical VLM, SilVar-Med, a multimodal medical image assistant that integrates speech interaction with VLMs, pioneering the task of voice-based communication for medical image analysis. In addition, we focus on the interpretation of the reasoning behind each prediction of medical abnormalities with a proposed reasoning dataset. Through extensive experiments, we demonstrate a proof-of-concept study for reasoning-driven medical image interpretation with end-to-end speech interaction. We believe this work will advance the field of medical AI by fostering more transparent, interactive, and clinically viable diagnostic support systems. Our code and dataset are publicly available at SiVar-Med.
zh

[CV-121] Skeleton-Based Intake Gesture Detection With Spatial-Temporal Graph Convolutional Networks

【速读】：该论文旨在解决日常生活中饮食行为监测的自动化问题，特别是通过自动检测进食手势（eating gestures）和饮水手势（drinking gestures）来改善膳食监测。论文的关键在于提出了一种基于骨架数据的模型——结合空时图卷积网络（Spatial-Temporal Graph Convolutional Network, ST-GCN）与双向长短期记忆网络（Bidirectional Long-Short-Term Memory, BiLSTM）的ST-GCN-BiLSTM框架，用于精准识别进食相关手势。该解决方案的核心优势在于其环境鲁棒性、减少对大量数据的依赖以及增强隐私保护能力，从而有效应对因饮食行为复杂性和环境多样性带来的挑战。

链接: https://arxiv.org/abs/2504.10635
作者: Chunzhuo Wang,Zhewen Xue,T. Sunil Kumar,Guido Camps,Hans Hallez,Bart Vanrumste
机构: e-Media Research Lab, KU Leuven (鲁汶大学); ESAT-STADIUS Division, KU Leuven (鲁汶大学); University of Gävle (于默奥大学); Division of Human Nutrition and Health, Wageningen University and Research (瓦赫宁根大学和研究中心); OnePlanet Research Center (OnePlanet 研究中心); M-Group, DistriNet, Department of Computer Science, KU Leuuven (鲁汶大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The manuscript has been accepted in 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (IEEE EMBC 2025)

点击查看摘要

Abstract:Overweight and obesity have emerged as widespread societal challenges, frequently linked to unhealthy eating patterns. A promising approach to enhance dietary monitoring in everyday life involves automated detection of food intake gestures. This study introduces a skeleton based approach using a model that combines a dilated spatial-temporal graph convolutional network (ST-GCN) with a bidirectional long-short-term memory (BiLSTM) framework, as called ST-GCN-BiLSTM, to detect intake gestures. The skeleton-based method provides key benefits, including environmental robustness, reduced data dependency, and enhanced privacy preservation. Two datasets were employed for model validation. The OREBA dataset, which consists of laboratory-recorded videos, achieved segmental F1-scores of 86.18% and 74.84% for identifying eating and drinking gestures. Additionally, a self-collected dataset using smartphone recordings in more adaptable experimental conditions was evaluated with the model trained on OREBA, yielding F1-scores of 85.40% and 67.80% for detecting eating and drinking gestures. The results not only confirm the feasibility of utilizing skeleton data for intake gesture detection but also highlight the robustness of the proposed approach in cross-dataset validation.
zh

[CV-122] AgMMU: A Comprehensive Agricultural Multimodal Understanding and Reasoning Benchmark

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在知识密集型专家领域生成事实准确答案的能力不足问题。为实现这一目标，论文构建了一个名为AgMMU的数据集，专注于农业这一具有高度社会价值的领域，强调将详细的视觉观察与精确的知识相结合以诊断问题，如害虫识别和管理建议等。AgMMU的关键创新在于其所有事实、问题和答案均源自116,231段真实用户与授权农业专家之间的对话。通过GPT-4o、LLaMA模型以及人工验证的三步数据整理流程，该数据集包含5,460个多项选择题(MCQs)和开放性问题(OEQs)的评估集，以及涵盖疾病识别、症状描述、管理指导、昆虫与害虫识别及物种识别等内容的205,399条农业知识信息的发展集。研究揭示，现有VLMs在需要详细感知与事实知识结合的问题上面临显著挑战，并且开源VLMs相较于专有模型仍有较大性能差距。为改善知识密集型VLMs的表现，论文进行了基于发展集的微调实验，使LLaVA-1.5的评估准确率提升了高达3.1%。论文希望AgMMU不仅能作为专门针对农业领域的评估基准，还能成为整合专业知识到通用VLM中的开发工具包。

链接: https://arxiv.org/abs/2504.10568
作者: Aruna Gauba,Irene Pi,Yunze Man,Ziqi Pang,Vikram S. Adve,Yu-Xiong Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Rice University (莱斯大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Website: this https URL Huggingface: this https URL

点击查看摘要

Abstract:We curate a dataset AgMMU for evaluating and developing vision-language models (VLMs) to produce factually accurate answers for knowledge-intensive expert domains. Our AgMMU concentrates on one of the most socially beneficial domains, agriculture, which requires connecting detailed visual observation with precise knowledge to diagnose, e.g., pest identification, management instructions, etc. As a core uniqueness of our dataset, all facts, questions, and answers are extracted from 116,231 conversations between real-world users and authorized agricultural experts. After a three-step dataset curation pipeline with GPT-4o, LLaMA models, and human verification, AgMMU features an evaluation set of 5,460 multiple-choice questions (MCQs) and open-ended questions (OEQs). We also provide a development set that contains 205,399 pieces of agricultural knowledge information, including disease identification, symptoms descriptions, management instructions, insect and pest identification, and species identification. As a multimodal factual dataset, it reveals that existing VLMs face significant challenges with questions requiring both detailed perception and factual knowledge. Moreover, open-source VLMs still demonstrate a substantial performance gap compared to proprietary ones. To advance knowledge-intensive VLMs, we conduct fine-tuning experiments using our development set, which improves LLaVA-1.5 evaluation accuracy by up to 3.1%. We hope that AgMMU can serve both as an evaluation benchmark dedicated to agriculture and a development suite for incorporating knowledge-intensive expertise into general-purpose VLMs.
zh

[CV-123] H3AE: High Compression High Speed and High Quality AutoEncoder for Video Diffusion Models

【速读】：该论文致力于解决传统自动编码器（Autoencoder, AE）在潜扩散模型中的设计局限性，包括网络架构、压缩比以及训练策略等方面的不足。论文的关键在于通过系统性地优化自动编码器的架构设计与计算分布，提出了一系列适用于视频任务的高效高压缩比的自动编码器，这些编码器能够在移动设备上实现实时解码。此外，论文统一了普通自动编码器和平面图像到视频变分自编码器（I2V VAE）的设计，实现了单一网络的多功能性。同时，作者发现广泛使用的判别损失（如GAN、LPIPS和DWT损失）在大规模训练自动编码器时并未带来显著改进，因此提出了一种新颖的潜在一致性损失（latent consistency loss），该方法无需复杂的鉴别器设计或超参数调整，却能稳定提升重建质量。最终，所提出的自动编码器不仅实现了超高的压缩比和移动设备上的实时解码速度，还在重建性能上大幅超越现有技术，并通过在潜在空间上训练DiT模型验证了其快速高质量文本到视频生成的能力。

链接: https://arxiv.org/abs/2504.10567
作者: Yushu Wu,Yanyu Li,Ivan Skorokhodov,Anil Kag,Willi Menapace,Sharath Girish,Aliaksandr Siarohin,Yanzhi Wang,Sergey Tulyakov
机构: Snap Inc. (Snap公司); Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 8 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation, reducing the denoising resolution and improving efficiency. However, the power of AE has long been underexplored in terms of network design, compression ratio, and training strategy. In this work, we systematically examine the architecture design choices and optimize the computation distribution to obtain a series of efficient and high-compression video AEs that can decode in real time on mobile devices. We also unify the design of plain Autoencoder and image-conditioned I2V VAE, achieving multifunctionality in a single network. In addition, we find that the widely adopted discriminative losses, i.e., GAN, LPIPS, and DWT losses, provide no significant improvements when training AEs at scale. We propose a novel latent consistency loss that does not require complicated discriminator design or hyperparameter tuning, but provides stable improvements in reconstruction quality. Our AE achieves an ultra-high compression ratio and real-time decoding speed on mobile while outperforming prior art in terms of reconstruction metrics by a large margin. We finally validate our AE by training a DiT on its latent space and demonstrate fast, high-quality text-to-video generation capability.
zh

[CV-124] Data Augmentation Through Random Style Replacement

【速读】：该论文旨在解决数据增强(Data Augmentation)过程中风格迁移(Style Transfer)应用效果有限以及模型易过拟合的问题。论文的关键解决方案是提出一种新颖的数据增强方法，通过选择性地用风格迁移后的补丁替换图像子区域，结合随机擦除(Random Erasing)的优势。这种方法能够无缝适配现有的多种风格迁移算法，并易于集成到不同的数据增强流程中，从而提高训练过程的鲁棒性，减少过拟合风险，同时实现优于传统风格增强方法的性能提升和更快的收敛速度。

链接: https://arxiv.org/abs/2504.10563
作者: Qikai Yang,Cheng Ji,Huaiying Luo,Panfeng Li,Zhicheng Ding
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Cornell University (康奈尔大学); University of Michigan (密歇根大学); Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by 2025 6th International Conference on Computer Vision, Image and Deep Learning

点击查看摘要

Abstract:In this paper, we introduce a novel data augmentation technique that combines the advantages of style augmentation and random erasing by selectively replacing image subregions with style-transferred patches. Our approach first applies a random style transfer to training images, then randomly substitutes selected areas of these images with patches derived from the style-transferred versions. This method is able to seamlessly accommodate a wide range of existing style transfer algorithms and can be readily integrated into diverse data augmentation pipelines. By incorporating our strategy, the training process becomes more robust and less prone to overfitting. Comparative experiments demonstrate that, relative to previous style augmentation methods, our technique achieves superior performance and faster convergence.
zh

[CV-125] Enhancing Image Restoration through Learning Context-Rich and Detail-Accurate Features

【速读】：该论文致力于解决图像恢复任务中高保真图像从其退化版本中重建的问题，重点在于平衡空间细节与上下文信息之间的复杂关系，同时克服现有方法主要侧重空间维度而忽视频率变化理解的局限性。论文的关键创新在于提出了一种多尺度设计框架，通过最优权衡竞争目标，在空间域和频域间实现无缝整合，从而选择性地恢复最具信息量的数据。具体而言，论文开发了一种混合尺度频域选择块（HSFSBlock），不仅从空间域捕获多尺度信息，还在频域中选择用于图像恢复的最相关信息组件。此外，为了缓解仅使用加法或拼接的跳跃连接引入的固有噪声，引入了跳跃连接注意力机制（SCAM）以选择性地决定哪些信息应通过跳跃连接传播。基于此，构建了一个紧密互联的网络架构，称为LCDNet。广泛的实验结果表明，该模型在多种图像恢复任务中的性能达到了超越或可比于当前最先进算法的水平。

链接: https://arxiv.org/abs/2504.10558
作者: Hu Gao,Depeng Dang
机构: Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image restoration involves recovering high-quality images from their corrupted versions, requiring a nuanced balance between spatial details and contextual information. While certain methods address this balance, they predominantly emphasize spatial aspects, neglecting frequency variation comprehension. In this paper, we present a multi-scale design that optimally balances these competing objectives, seamlessly integrating spatial and frequency domain knowledge to selectively recover the most informative information. Specifically, we develop a hybrid scale frequency selection block (HSFSBlock), which not only captures multi-scale information from the spatial domain, but also selects the most informative components for image restoration in the frequency domain. Furthermore, to mitigate the inherent noise introduced by skip connections employing only addition or concatenation, we introduce a skip connection attention mechanism (SCAM) to selectively determines the information that should propagate through skip connections. The resulting tightly interlinked architecture, named as LCDNet. Extensive experiments conducted across diverse image restoration tasks showcase that our model attains performance levels that are either superior or comparable to those of state-of-the-art algorithms.
zh

[CV-126] Beyond the Generative Learning Trilemma: Generative Model Assessment in Data Scarcity Domains

【速读】：该论文旨在解决数据稀缺环境下生成式深度学习模型（Deep Generative Models, DGMs）在实际应用中的局限性问题。论文的核心问题是探索如何通过生成满足生成式学习三元悖论（fidelity, diversity, and sampling efficiency）的合成数据来缓解数据稀缺的瓶颈，同时扩展这一框架以包括实用性（utility）、鲁棒性（robustness）和隐私保护（privacy），这些因素对于确保DGMs在真实场景中的适用性至关重要。解决方案的关键在于提出了一种综合评估框架，用于在数据稀缺条件下评估不同DGMs（如变分自编码器VAEs、生成对抗网络GANs和扩散模型DMs）的性能，并结合实用性和隐私保护等实际需求，为特定应用场景选择合适的DGMs提供指导。

链接: https://arxiv.org/abs/2504.10555
作者: Marco Salmè,Lorenzo Tronchin,Rosa Sicilia,Paolo Soda,Valerio Guarrasi
机构: unknown
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data scarcity remains a critical bottleneck impeding technological advancements across various domains, including but not limited to medicine and precision agriculture. To address this challenge, we explore the potential of Deep Generative Models (DGMs) in producing synthetic data that satisfies the Generative Learning Trilemma: fidelity, diversity, and sampling efficiency. However, recognizing that these criteria alone are insufficient for practical applications, we extend the trilemma to include utility, robustness, and privacy, factors crucial for ensuring the applicability of DGMs in real-world scenarios. Evaluating these metrics becomes particularly challenging in data-scarce environments, as DGMs traditionally rely on large datasets to perform optimally. This limitation is especially pronounced in domains like medicine and precision agriculture, where ensuring acceptable model performance under data constraints is vital. To address these challenges, we assess the Generative Learning Trilemma in data-scarcity settings using state-of-the-art evaluation metrics, comparing three prominent DGMs: Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Diffusion Models (DMs). Furthermore, we propose a comprehensive framework to assess utility, robustness, and privacy in synthetic data generated by DGMs. Our findings demonstrate varying strengths among DGMs, with each model exhibiting unique advantages based on the application context. This study broadens the scope of the Generative Learning Trilemma, aligning it with real-world demands and providing actionable guidance for selecting DGMs tailored to specific applications.
zh

[CV-127] LEMUR Neural Network Dataset: Towards Seamless AutoML

【速读】：该论文旨在解决神经网络模型在自动化机器学习（AutoML）、基准测试及模型分析等任务中高质量数据集缺乏的问题。论文的关键在于引入了一个名为LEMUR的开源数据集，该数据集包含结构化代码的神经网络模型，覆盖多种任务（如目标检测、图像分类、分割及自然语言处理）的多样化架构。LEMUR通过提供丰富的模型表示及其性能数据，支持大型语言模型（LLMs）在AutoML任务中的微调，并结合Python和PyTorch实现扩展性与一致性。其解决方案的核心还在于集成Optuna框架，用于评估、超参数优化、统计分析和图形化洞见，同时提供边缘设备高效运行的扩展功能，以及支持模型评价、预处理和数据库管理的工具。此外，LEMUR通过API提供单一请求即可获取模型完整性能统计数据的能力，从而进一步增强其实用性。

链接: https://arxiv.org/abs/2504.10552
作者: Arash Torabi Goodarzi,Roman Kochnev,Waleed Khalid,Furui Qin,Tolgay Atinc Uzun,Yashkumar Sanjaybhai Dhameliya,Yash Kanubhai Kathiriya,Zofia Antonina Bentyn,Dmitry Ignatov,Radu Timofte
机构: Computer Vision Lab, CAIDAS, University of Würzburg (维尔茨堡大学), Germany
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:Neural networks are fundamental in artificial intelligence, driving progress in computer vision and natural language processing. High-quality datasets are crucial for their development, and there is growing interest in datasets composed of neural networks themselves to support benchmarking, automated machine learning (AutoML), and model analysis. We introduce LEMUR, an open source dataset of neural network models with well-structured code for diverse architectures across tasks such as object detection, image classification, segmentation, and natural language processing. LEMUR is primarily designed to enable fine-tuning of large language models (LLMs) for AutoML tasks, providing a rich source of structured model representations and associated performance data. Leveraging Python and PyTorch, LEMUR enables seamless extension to new datasets and models while maintaining consistency. It integrates an Optuna-powered framework for evaluation, hyperparameter optimization, statistical analysis, and graphical insights. LEMUR provides an extension that enables models to run efficiently on edge devices, facilitating deployment in resource-constrained environments. Providing tools for model evaluation, preprocessing, and database management, LEMUR supports researchers and practitioners in developing, testing, and analyzing neural networks. Additionally, it offers an API that delivers comprehensive information about neural network models and their complete performance statistics with a single request, which can be used in experiments with code-generating large language models. The LEMUR will be released as an open source project under the MIT license upon acceptance of the paper.
zh

[CV-128] Human-Oriented Image Retrieval System (HORSE): A Neuro-Symbolic Approach to Optimizing Retrieval of Previewed Images

【速读】：该论文旨在解决基于自然语言描述高效检索图像的难题，当前图像搜索引擎在这一任务上常因耗时的预处理、标注及机器学习管道而表现不佳。论文提出了一种名为HORSE（Human-Oriented Retrieval Search Engine for Images）的新方法，其关键是利用神经符号索引（neuro-symbolic indexing），通过聚焦于以人为核心的索引方式来提升图像检索性能。解决方案的关键在于结合认知科学洞见与先进的计算技术，将神经网络和符号推理的优势相结合，同时克服各自的局限性，从而优化图像检索过程，提供更直观且高效的用户体验。

链接: https://arxiv.org/abs/2504.10502
作者: Abraham Itzhak Weinberg
机构: 未知
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image retrieval remains a challenging task due to the complex interaction between human visual perception, memory, and computational processes. Current image search engines often struggle to efficiently retrieve images based on natural language descriptions, as they rely on time-consuming preprocessing, tagging, and machine learning pipelines. This paper introduces the Human-Oriented Retrieval Search Engine for Images (HORSE), a novel approach that leverages neuro-symbolic indexing to improve image retrieval by focusing on human-oriented indexing. By integrating cognitive science insights with advanced computational techniques, HORSE enhances the retrieval process, making it more aligned with how humans perceive, store, and recall visual information. The neuro-symbolic framework combines the strengths of neural networks and symbolic reasoning, mitigating their individual limitations. The proposed system optimizes image retrieval, offering a more intuitive and efficient solution for users. We discuss the design and implementation of HORSE, highlight its potential applications in fields such as design error detection and knowledge management, and suggest future directions for research to further refine the system’s metrics and capabilities.
zh

[CV-129] MARVIS: Motion Geometry Aware Real and Virtual Image Segmentation

【速读】：该论文旨在解决水面上方机器人视觉系统在感知与导航任务中面临的挑战，特别是由动态干扰（如光线反射与折射、不规则液体流动等）导致的虚实图像区域难以区分的问题。传统计算机视觉算法在处理此类场景时容易失败，因为它们无法有效分割真实图像区域与虚拟图像区域。虚拟图像区域是由光线重新定向（通过反射或折射形成）而产生的表象，虽看似物体存在，但实际并无物理位置。

解决方案的关键在于提出了一种名为MARVIS的新型分割网络，其利用合成图像结合域不变信息、运动熵核以及极线几何一致性方法来实现虚实图像区域的有效分割。该网络的一个重要特点是无需针对不同领域重新训练即可适应变化的环境，这通过在同一模拟与真实世界两个不同领域中部署相同网络得以验证。此外，通过生成逼真的合成图像以模拟水面复杂性，为网络提供精细化标注的训练数据，从而显著提升模型性能。实验表明，MARVIS在未见过的真实环境中实现了超过78%的IoU值和超过86%的F1分数，同时保持较小的计算开销，推理速度可达单GPU 43 FPS或单CPU核心8 FPS。

链接: https://arxiv.org/abs/2403.09850
作者: Jiayi Wu,Xiaomin Lin,Shahriar Negahdaripour,Cornelia Fermüller,Yiannis Aloimonos
机构: Maryland Robotics Center (MRC), University of Maryland, College Park, MD 20742, USA (马里兰机器人中心，马里兰大学，College Park, 美国); University of Miami (迈阿密大学, 美国)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Tasks such as autonomous navigation, 3D reconstruction, and object recognition near the water surfaces are crucial in marine robotics applications. However, challenges arise due to dynamic disturbances, e.g., light reflections and refraction from the random air-water interface, irregular liquid flow, and similar factors, which can lead to potential failures in perception and navigation systems. Traditional computer vision algorithms struggle to differentiate between real and virtual image regions, significantly complicating tasks. A virtual image region is an apparent representation formed by the redirection of light rays, typically through reflection or refraction, creating the illusion of an object’s presence without its actual physical location. This work proposes a novel approach for segmentation on real and virtual image regions, exploiting synthetic images combined with domain-invariant information, a Motion Entropy Kernel, and Epipolar Geometric Consistency. Our segmentation network does not need to be re-trained if the domain changes. We show this by deploying the same segmentation network in two different domains: simulation and the real world. By creating realistic synthetic images that mimic the complexities of the water surface, we provide fine-grained training data for our network (MARVIS) to discern between real and virtual images effectively. By motion geometry-aware design choices and through comprehensive experimental analysis, we achieve state-of-the-art real-virtual image segmentation performance in unseen real world domain, achieving an IoU over 78% and a F1-Score over 86% while ensuring a small computational footprint. MARVIS offers over 43 FPS (8 FPS) inference rates on a single GPU (CPU core). Our code and dataset are available here this https URL.
zh

[CV-130] MTCNET: Multi-task Learning Paradigm for Crowd Count Estimation

【速读】：该论文旨在解决人群密度和数量估计中的挑战，特别是由于单个图像中存在的非均匀尺度变化和任意视角导致的难度。论文提出的解决方案是设计一种基于多任务学习（MTL）的深度神经网络架构——MTCNet，其关键在于引入了两个相关任务：以人群密度估计为主任务，以人群计数分组分类为辅助任务。辅助任务通过捕捉与尺度相关的有用信息，帮助提升主任务的性能。主任务模型包含两个模块：用于特征提取的VGG-16前端和用于密度图生成的空洞卷积神经网络；而辅助任务模型共享相同的前端，并接续一个CNN分类器。实验结果显示，MTCNet在ShanghaiTech数据集上的平均绝对误差（MAE）比现有技术降低了5.8%和14.9%，在UCF_CC_50数据集上降低了10.5%。这一方法的关键在于有效结合主辅任务，充分利用多任务学习的优势来提高人群密度和计数估计的准确性。

链接: https://arxiv.org/abs/1908.08652
作者: Abhay Kumar,Nishant Jain,Suraj Tripathi,Chirag Singh,Kamal Krishna
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 5 pages, 3 figures, Accepted in IEEE AVSS 2019

点击查看摘要

Abstract:We propose a Multi-Task Learning (MTL) paradigm based deep neural network architecture, called MTCNet (Multi-Task Crowd Network) for crowd density and count estimation. Crowd count estimation is challenging due to the non-uniform scale variations and the arbitrary perspective of an individual image. The proposed model has two related tasks, with Crowd Density Estimation as the main task and Crowd-Count Group Classification as the auxiliary task. The auxiliary task helps in capturing the relevant scale-related information to improve the performance of the main task. The main task model comprises two blocks: VGG-16 front-end for feature extraction and a dilated Convolutional Neural Network for density map generation. The auxiliary task model shares the same front-end as the main task, followed by a CNN classifier. Our proposed network achieves 5.8% and 14.9% lower Mean Absolute Error (MAE) than the state-of-the-art methods on ShanghaiTech dataset without using any data augmentation. Our model also outperforms with 10.5% lower MAE on UCF_CC_50 dataset.
zh

[CV-131] Efficient Medical Image Restoration via Reliability Guided Learning in Frequency Domain

【速读】：本文旨在解决医学图像恢复任务中高效性和结果可靠性不足的问题。现有基于深度学习的方法虽在复杂模块设计下取得了成功，但在计算效率和结果可信度方面仍存在挑战，尤其是在临床场景中对可靠性的需求更为迫切。为应对这些问题，论文提出了LRformer，这是一种基于轻量级Transformer的频率域可靠性引导方法。其关键创新在于引入了Reliable Lesion-Semantic Prior Producer (RLPP)，通过贝叶斯神经网络中的不确定性量化思想，利用Monte Carlo估计器与随机采样操作，基于MedSAM模型生成可靠的先验信息。同时，设计了Guided Frequency Cross-Attention (GFCA) 模块，将交叉注意力机制分解为实部对称和虚部反对称部分，并利用快速傅里叶变换的共轭对称性质，大幅降低了计算复杂度。实验结果验证了LRformer在多个医学图像恢复任务中的优越性能。

链接: https://arxiv.org/abs/2504.11286
作者: Pengcheng Zheng,Kecheng Chen,Jiaxin Huang,Bohao Chen,Ju Liu,Yazhou Ren,Xiaorong Pu
机构: University of Electronic Science and Technology of China(电子科技大学)(Chengdu, China); City University of Hong Kong(香港城市大学)(Hong Kong, China); Mohamed bin Zayed University of Artificial Intelligence(阿联酋哈利法大学)(Abu Dhabi, The United Arab Emirates); Shenzhen Institute For Advanced Study, University of Electronic Science and Technology of China(电子科技大学深圳研究院)(Shenzhen, China)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image restoration tasks aim to recover high-quality images from degraded observations, exhibiting emergent desires in many clinical scenarios, such as low-dose CT image denoising, MRI super-resolution, and MRI artifact removal. Despite the success achieved by existing deep learning-based restoration methods with sophisticated modules, they struggle with rendering computationally-efficient reconstruction results. Moreover, they usually ignore the reliability of the restoration results, which is much more urgent in medical systems. To alleviate these issues, we present LRformer, a Lightweight Transformer-based method via Reliability-guided learning in the frequency domain. Specifically, inspired by the uncertainty quantification in Bayesian neural networks (BNNs), we develop a Reliable Lesion-Semantic Prior Producer (RLPP). RLPP leverages Monte Carlo (MC) estimators with stochastic sampling operations to generate sufficiently-reliable priors by performing multiple inferences on the foundational medical image segmentation model, MedSAM. Additionally, instead of directly incorporating the priors in the spatial domain, we decompose the cross-attention (CA) mechanism into real symmetric and imaginary anti-symmetric parts via fast Fourier transform (FFT), resulting in the design of the Guided Frequency Cross-Attention (GFCA) solver. By leveraging the conjugated symmetric property of FFT, GFCA reduces the computational complexity of naive CA by nearly half. Extensive experimental results in various tasks demonstrate the superiority of the proposed LRformer in both effectiveness and efficiency.
zh

[CV-132] Cryo-em images are intrinsically low dimensional

【速读】：该论文旨在解决利用模拟数据驱动推理方法推断生物分子构象时，如何有效理解神经网络生成的潜在表示(latent representation)的几何结构及其与物理系统的关键参数之间的关系。论文的关键在于通过流形学习技术（Manifold Learning）分析Cryo-SBI方法中血凝素(hemagglutinin)的潜在表示，揭示这些高维数据实际上分布在低维光滑流形上，并且模拟数据能够有效覆盖实验数据的流形结构。通过采用Diffusion Maps等方法表征流形几何，并结合坐标解释方法识别其主要变化轴，论文建立了潜在结构与关键物理参数之间的直接联系。这一发现不仅验证了Cryo-SBI方法的有效性，还为从数据结构中挖掘更多信息以及通过利用揭示的流形几何优化未来的推理策略提供了可能。

链接: https://arxiv.org/abs/2504.11249
作者: Luke Evans,Octavian-Vlad Murad,Lars Dingeldein,Pilar Cossio,Roberto Covino,Marina Meila
机构: Center for Computational Mathematics, Flatiron Institute (计算数学中心, Flatiron 研究所), New York, NY, USA; Department of Statistics, University of Washington (统计系, 华盛顿大学), Seattle, Washington, USA; Frankfurt Institute For Advanced Study (法兰克福高等研究院), Frankfurt, Hesse, Germany; Center for Computational Biology, Flatiron Institute (计算生物学中心, Flatiron 研究所), New York, NY, USA; Institute of Computer Science, Goethe University Frankfurt (计算机科学研究所, 法兰克福歌德大学), Frankfurt, Hesse, Germany
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Biomolecules (q-bio.BM); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Simulation-based inference provides a powerful framework for cryo-electron microscopy, employing neural networks in methods like CryoSBI to infer biomolecular conformations via learned latent representations. This latent space represents a rich opportunity, encoding valuable information about the physical system and the inference process. Harnessing this potential hinges on understanding the underlying geometric structure of these representations. We investigate this structure by applying manifold learning techniques to CryoSBI representations of hemagglutinin (simulated and experimental). We reveal that these high-dimensional data inherently populate low-dimensional, smooth manifolds, with simulated data effectively covering the experimental counterpart. By characterizing the manifold’s geometry using Diffusion Maps and identifying its principal axes of variation via coordinate interpretation methods, we establish a direct link between the latent structure and key physical parameters. Discovering this intrinsic low-dimensionality and interpretable geometric organization not only validates the CryoSBI approach but enables us to learn more from the data structure and provides opportunities for improving future inference strategies by exploiting this revealed manifold geometry.
zh

[CV-133] Agent Polyp: Accurate Polyp Segmentation via Image Enhancement Agent

【速读】：该论文旨在解决由于人为和环境因素干扰导致结肠息肉图像出现噪声诱导退化的问题，例如亮度不足、模糊和过曝等，这些问题对下游息肉分割任务构成挑战。论文提出的解决方案是AgentPolyp，这是一种新颖的框架，集成了基于CLIP的语义引导、动态图像增强以及轻量级神经网络用于分割。其关键是通过CLIP驱动的语义分析（如识别“具有血管纹理的低对比度息肉”）评估图像质量，并采用强化学习策略动态应用多模态增强操作（如去噪和对比度调整），同时通过质量评估反馈循环协作优化像素级增强和分割焦点，从而在神经网络分割前确保鲁棒的预处理。该模块化架构支持各种增强算法和分割网络的即插即用扩展，满足内窥镜设备的部署需求。

链接: https://arxiv.org/abs/2504.10978
作者: Pu Wang,Zhihua Zhang,Dianjie Lu,Guijuan Zhang,Youshan Zhang,Zhuoran Zheng
机构: School of Mathematics, Shandong University (山东大学数学学院); School of Information Science and Engineering, Shandong Normal University (山东师范大学信息科学与工程学院); Department of Artificial Intelligence and Computer Science, Yeshiva University (叶史瓦大学人工智能与计算机科学系); School of cyber science and technology, Sun Yat-sen University (中山大学网络空间科学与技术学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Since human and environmental factors interfere, captured polyp images usually suffer from issues such as dim lighting, blur, and overexposure, which pose challenges for downstream polyp segmentation tasks. To address the challenges of noise-induced degradation in polyp images, we present AgentPolyp, a novel framework integrating CLIP-based semantic guidance and dynamic image enhancement with a lightweight neural network for segmentation. The agent first evaluates image quality using CLIP-driven semantic analysis (e.g., identifying ``low-contrast polyps with vascular textures") and adapts reinforcement learning strategies to dynamically apply multi-modal enhancement operations (e.g., denoising, contrast adjustment). A quality assessment feedback loop optimizes pixel-level enhancement and segmentation focus in a collaborative manner, ensuring robust preprocessing before neural network segmentation. This modular architecture supports plug-and-play extensions for various enhancement algorithms and segmentation networks, meeting deployment requirements for endoscopic devices.
zh

[CV-134] Embedding Radiomics into Vision Transformers for Multimodal Medical Image Classification

【速读】：该论文旨在解决深度学习在医学影像分析中的两个主要挑战：Vision Transformers (ViTs) 对大数据量的依赖及其缺乏领域特定的归纳偏置，同时克服传统radiomics方法在可扩展性和与端到端学习框架集成方面的局限性。论文的关键创新在于提出了一种名为Radiomics-Embedded Vision Transformer (RE-ViT) 的混合框架，通过早期融合将手工设计的radiomic特征嵌入数据驱动的视觉嵌入中，并整合到ViT主干架构中，从而在多模态医学影像分类任务中提升模型的鲁棒性和性能。

链接: https://arxiv.org/abs/2504.10916
作者: Zhenyu Yang,Haiming Zhu,Rihui Zhang,Haipeng Zhang,Jianliang Wang,Chunhao Wang,Minbin Chen,Fang-Fang Yin
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 3 figures

点击查看摘要

Abstract:Background: Deep learning has significantly advanced medical image analysis, with Vision Transformers (ViTs) offering a powerful alternative to convolutional models by modeling long-range dependencies through self-attention. However, ViTs are inherently data-intensive and lack domain-specific inductive biases, limiting their applicability in medical imaging. In contrast, radiomics provides interpretable, handcrafted descriptors of tissue heterogeneity but suffers from limited scalability and integration into end-to-end learning frameworks. In this work, we propose the Radiomics-Embedded Vision Transformer (RE-ViT) that combines radiomic features with data-driven visual embeddings within a ViT backbone. Purpose: To develop a hybrid RE-ViT framework that integrates radiomics and patch-wise ViT embeddings through early fusion, enhancing robustness and performance in medical image classification. Methods: Following the standard ViT pipeline, images were divided into patches. For each patch, handcrafted radiomic features were extracted and fused with linearly projected pixel embeddings. The fused representations were normalized, positionally encoded, and passed to the ViT encoder. A learnable [CLS] token aggregated patch-level information for classification. We evaluated RE-ViT on three public datasets (including BUSI, ChestXray2017, and Retinal OCT) using accuracy, macro AUC, sensitivity, and specificity. RE-ViT was benchmarked against CNN-based (VGG-16, ResNet) and hybrid (TransMed) models. Results: RE-ViT achieved state-of-the-art results: on BUSI, AUC=0.950+/-0.011; on ChestXray2017, AUC=0.989+/-0.004; on Retinal OCT, AUC=0.986+/-0.001, which outperforms other comparison models. Conclusions: The RE-ViT framework effectively integrates radiomics with ViT architectures, demonstrating improved performance and generalizability across multimodal medical image classification tasks. Comments: 27 pages, 3 figures Subjects: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.10916 [physics.med-ph] (or arXiv:2504.10916v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2504.10916 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhenyu Yang [view email] [v1] Tue, 15 Apr 2025 06:55:58 UTC (3,602 KB) Full-text links: Access Paper: View a PDF of the paper titled Embedding Radiomics into Vision Transformers for Multimodal Medical Image Classification, by Zhenyu Yang and 7 other authorsView PDFOther Formats view license Current browse context: physics.med-ph prev | next new | recent | 2025-04 Change to browse by: cs cs.CV physics References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-135] Efficient and Robust Remote Sensing Image Denoising Using Randomized Approximation of Geodesics Gramian on the Manifold Underlying the Patch Space

【速读】：该论文旨在解决由于环境因素和成像系统问题导致的遥感图像质量退化，进而影响后续视觉任务的问题。当前去噪算法在处理具有复杂纹理特征的遥感图像时性能不佳，而基于人工神经网络的去噪框架虽表现更好，但需要大量资源进行训练且依赖异构样本。论文提出了一种计算高效且鲁棒的遥感图像去噪方法，无需额外的训练样本。该方法的关键在于将遥感图像划分为图像块，并在这些块的空间下揭示表示无噪声图像版本的低秩流形。通过随机近似几何Gramian矩阵的奇异值谱来实现这一流形的高效且鲁棒的揭示。此外，该方法在去噪过程中对每个颜色通道施加独特关注，并最终合并三个去噪通道以生成最终图像。

链接: https://arxiv.org/abs/2504.10820
作者: Kelum Gajamannage,Dilhani I. Jayathilake,Maria Vasilyeva
机构: University of Rhode Island (罗德岛大学); Quinnipiac University (昆尼皮亚克大学); Texas A&M University - Corpus Christi (德克萨斯农工大学科珀斯克里斯蒂分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 5 figures, and submitted to the International Journal of Remote Sensing

点击查看摘要

Abstract:Remote sensing images are widely utilized in many disciplines such as feature recognition and scene semantic segmentation. However, due to environmental factors and the issues of the imaging system, the image quality is often degraded which may impair subsequent visual tasks. Even though denoising remote sensing images plays an essential role before applications, the current denoising algorithms fail to attain optimum performance since these images possess complex features in the texture. Denoising frameworks based on artificial neural networks have shown better performance; however, they require exhaustive training with heterogeneous samples that extensively consume resources like power, memory, computation, and latency. Thus, here we present a computationally efficient and robust remote sensing image denoising method that doesn’t require additional training samples. This method partitions patches of a remote-sensing image in which a low-rank manifold, representing the noise-free version of the image, underlies the patch space. An efficient and robust approach to revealing this manifold is a randomized approximation of the singular value spectrum of the geodesics’ Gramian matrix of the patch space. The method asserts a unique emphasis on each color channel during denoising so the three denoised channels are merged to produce the final image.
zh

[CV-136] Visual anemometry of natural vegetation from their leaf motion

【速读】：该论文旨在解决高分辨率近地面风速测量的问题，这对于提升天气预测与气候模型的准确性、支持野火控制以及保障飞机起降安全至关重要。传统方法依赖现场仪器或复杂的远程技术（如多普勒雷达）进行定量风速测量，但植被因结构和力学属性复杂而难以用于精确的风速测定。论文的关键解决方案在于发现，在低至中等风速范围内（由叶雷诺数表征），叶片运动可以与其枝干支撑结构解耦，并据此提出了一种基于公式 ( U_{\text{wind}} \approx 740 \sqrt{\mu} U_{\text{leaf}} / (\rho D) ) 的远程定量测风方法。该公式仅需叶片尺寸 ( D )、其波动速度 ( U_{\text{leaf}} )、空气黏度 ( \mu ) 和密度 ( \rho )，并通过理论建模及实验室与实地测试验证，为利用自然植被实现低成本、快速且全球范围的远程风速测量开辟了新途径。

链接: https://arxiv.org/abs/2504.10584
作者: Roni H. Goldshmid,John O. Dabiri,John E. Sader
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:High-resolution, near-ground wind-speed data are critical for improving the accuracy of weather predictions and climate models, ^1-3 supporting wildfire control efforts, ^4-7 and ensuring the safe passage of airplanes during takeoff and landing maneouvers. ^8,9 Quantitative wind speed anemometry generally employs on-site instrumentation for accurate single-position data or sophisticated remote techniques such as Doppler radar for quantitative field measurements. It is widely recognized that the wind-induced motion of vegetation depends in a complex manner on their structure and mechanical properties, obviating their use in quantitative anemometry. ^10-14 We analyze measurements on a host of different vegetation showing that leaf motion can be decoupled from the leaf’s branch and support structure, at low-to-moderate wind speed, U_wind . This wind speed range is characterized by a leaf Reynolds number, enabling the development of a remote, quantitative anemometry method based on the formula, U_wind\approx740\sqrt\muU_leaf/\rhoD , that relies only on the leaf size D , its measured fluctuating (RMS) speed U_leaf , the air viscosity \mu , and its mass density \rho . This formula is corroborated by a first-principles model and validated using a host of laboratory and field tests on diverse vegetation types, ranging from oak, olive, and magnolia trees through to camphor and bullgrass. The findings of this study open the door to a new paradigm in anemometry, using natural vegetation to enable remote and rapid quantitative field measurements at global locations with minimal cost.
zh

[CV-137] PathSeqSAM: Sequential Modeling for Pathology Image Segmentation with SAM2

【速读】：该论文旨在解决现有病理图像分割方法通常独立处理二维切片、忽视跨切片有价值信息的问题。为应对这一挑战，论文提出PathSeqSAM方法，其关键是将二维病理切片视为序列视频帧，并利用SAM2的记忆机制来捕获跨切片上下文；同时引入距离感知注意力机制以考虑切片间可变物理距离，并通过LoRA实现领域自适应。这些创新显著提升了在KPI Challenge 2024肾小球分割任务中的分割质量，尤其是在需要跨切片上下文支持的复杂场景中。

链接: https://arxiv.org/abs/2504.10526
作者: Mingyang Zhu,Yinting Liu,Mingyu Li,Jiacheng Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current methods for pathology image segmentation typically treat 2D slices independently, ignoring valuable cross-slice information. We present PathSeqSAM, a novel approach that treats 2D pathology slices as sequential video frames using SAM2’s memory mechanisms. Our method introduces a distance-aware attention mechanism that accounts for variable physical distances between slices and employs LoRA for domain adaptation. Evaluated on the KPI Challenge 2024 dataset for glomeruli segmentation, PathSeqSAM demonstrates improved segmentation quality, particularly in challenging cases that benefit from cross-slice context. We have publicly released our code at this https URL.
zh

[CV-138] Integrating electrocardiogram and fundus images for early detection of cardiovascular diseases

【速读】：该论文旨在解决心血管疾病（CVD）早期诊断与分诊的挑战，提出了一种创新性方法，通过整合心电图（ECG）数据与视网膜眼底图像，实现疾病的早期标记及按优先级排序的分诊。解决方案的关键在于结合ECG提供的动态心脏信息与视网膜血管网络作为全身心血管系统映射的特性，利用快速傅里叶变换（FFT）将两类模态的数据转换至频域，并采用地球 mover 距离（EMD）度量频域特征差异，最终构建包含多模态信息的综合特征集输入神经网络分类器，从而提供鲁棒的CVD分类表征。

链接: https://arxiv.org/abs/2504.10493
作者: K. A. Muthukumar,Dhruva Nandi,Priya Ranjan,Krithika Ramachandran,Shiny PJ,Anirban Ghosh,Ashwini M,Aiswaryah Radhakrishnan,V. E. Dhandapani,Rajiv Janardhanan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: EMD, Fundus image, CNN, CVD prediction

点击查看摘要

Abstract:Cardiovascular diseases (CVD) are a predominant health concern globally, emphasizing the need for advanced diagnostic techniques. In our research, we present an avant-garde methodology that synergistically integrates ECG readings and retinal fundus images to facilitate the early disease tagging as well as triaging of the CVDs in the order of disease priority. Recognizing the intricate vascular network of the retina as a reflection of the cardiovascular system, alongwith the dynamic cardiac insights from ECG, we sought to provide a holistic diagnostic perspective. Initially, a Fast Fourier Transform (FFT) was applied to both the ECG and fundus images, transforming the data into the frequency domain. Subsequently, the Earth Mover’s Distance (EMD) was computed for the frequency-domain features of both modalities. These EMD values were then concatenated, forming a comprehensive feature set that was fed into a Neural Network classifier. This approach, leveraging the FFT’s spectral insights and EMD’s capability to capture nuanced data differences, offers a robust representation for CVD classification. Preliminary tests yielded a commendable accuracy of 84 percent, underscoring the potential of this combined diagnostic strategy. As we continue our research, we anticipate refining and validating the model further to enhance its clinical applicability in resource limited healthcare ecosystems prevalent across the Indian sub-continent and also the world at large.
zh

人工智能

[AI-0] Elucidating the Design Space of Multimodal Protein Language Models

【速读】：该论文旨在解决多模态蛋白语言模型（Multimodal Protein Language Models, PLMs）在整合序列与基于标记的结构信息时，因将三维结构离散化为标记而导致的精细结构细节和关联性损失问题。论文的关键在于识别出标记化损失以及现有PLMs对结构标记预测的不准确性是主要瓶颈，并通过设计空间的探索提出了解决方案。这些方法包括改进的生成式建模、结构感知的架构与表征学习，以及数据探索。论文的关键创新点在于引入更细粒度的监督机制，显著提升了基于标记的多模态PLMs的结构建模能力，其6.5亿参数模型在PDB测试集上的RMSD从5.52降至2.36，不仅优于30亿参数的基线模型，甚至接近专门的蛋白质折叠模型的性能。

链接: https://arxiv.org/abs/2504.11454
作者: Cheng-Yen(Wesley)Hsieh,Xinyou Wang,Daiheng Zhang,Dongyu Xue,Fei Ye,Shujian Huang,Zaixiang Zheng,Quanquan Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: Project Page: this https URL

点击查看摘要

Abstract:Multimodal protein language models (PLMs) integrate sequence and token-based structural information, serving as a powerful foundation for protein modeling, generation, and design. However, the reliance on tokenizing 3D structures into discrete tokens causes substantial loss of fidelity about fine-grained structural details and correlations. In this paper, we systematically elucidate the design space of multimodal PLMs to overcome their limitations. We identify tokenization loss and inaccurate structure token predictions by the PLMs as major bottlenecks. To address these, our proposed design space covers improved generative modeling, structure-aware architectures and representation learning, and data exploration. Our advancements approach finer-grained supervision, demonstrating that token-based multimodal PLMs can achieve robust structural modeling. The effective design methods dramatically improve the structure generation diversity, and notably, folding abilities of our 650M model by reducing the RMSD from 5.52 to 2.36 on PDB testset, even outperforming 3B baselines and on par with the specialized folding models.
zh

[AI-1] A Clean Slate for Offline Reinforcement Learning

【速读】：该论文旨在解决离线强化学习（Offline Reinforcement Learning, Offline RL）领域中存在的问题定义模糊、算法设计复杂以及现有实现不透明等挑战。具体而言，尽管Offline RL明确避免了与环境的交互，但先前方法常依赖于未记录的在线评估进行超参数调整，导致方法对比困难。此外，现有的参考实现存在大量样板代码，掩盖了其核心算法贡献。

为了解决这些问题，论文提出了两个关键方案：首先，引入了一个严谨的分类法和透明的评估协议，明确量化了在线调参预算；其次，提供了多种无模型和有模型Offline RL方法的简洁、最小化且单文件的实现，显著提高了清晰度并实现了显著的速度提升。基于这些优化实现，论文进一步提出了一种统一算法Unifloral，它在一个全面的超参数空间中封装了多种先前方法，使得在共享的超参数空间中开发新算法成为可能。通过结合Unifloral与严格的评估协议，论文开发了两种新颖算法——TD3-AWR（无模型）和MoBRAC（有模型），并在性能上大幅超越了现有基线。论文的代码已公开发布。

链接: https://arxiv.org/abs/2504.11453
作者: Matthew Thomas Jackson,Uljad Berdica,Jarek Liesen,Shimon Whiteson,Jakob Nicolaus Foerster
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Progress in offline reinforcement learning (RL) has been impeded by ambiguous problem definitions and entangled algorithmic designs, resulting in inconsistent implementations, insufficient ablations, and unfair evaluations. Although offline RL explicitly avoids environment interaction, prior methods frequently employ extensive, undocumented online evaluation for hyperparameter tuning, complicating method comparisons. Moreover, existing reference implementations differ significantly in boilerplate code, obscuring their core algorithmic contributions. We address these challenges by first introducing a rigorous taxonomy and a transparent evaluation protocol that explicitly quantifies online tuning budgets. To resolve opaque algorithmic design, we provide clean, minimalistic, single-file implementations of various model-free and model-based offline RL methods, significantly enhancing clarity and achieving substantial speed-ups. Leveraging these streamlined implementations, we propose Unifloral, a unified algorithm that encapsulates diverse prior approaches within a single, comprehensive hyperparameter space, enabling algorithm development in a shared hyperparameter space. Using Unifloral with our rigorous evaluation protocol, we develop two novel algorithms - TD3-AWR (model-free) and MoBRAC (model-based) - which substantially outperform established baselines. Our implementation is publicly available at this https URL.
zh

[AI-2] Embodied World Models Emerge from Navigational Task in Open-Ended Environments

【速读】：该论文旨在解决人工智能系统如何自主发展空间意识与推理能力的问题，这是传统被动观察模型难以实现的核心挑战。论文的关键在于结合元强化学习（Meta-RL）与门控循环单元（GRUs），通过主动交互的方式使神经网络能够内化空间概念，并利用混合动态系统（HDS）建模agent与环境的交互作为封闭的动力学系统，揭示与最优导航策略对应的稳定极限环。同时，通过岭表示（Ridge Representation）和典型相关分析（CCA），验证了agent的神经状态与行为空间之间的强对齐，表明其神经状态能够主动编码空间知识。此外，干预实验进一步证明特定神经维度与导航性能之间的因果关系。关键解决方案在于通过主动交互与元学习机制实现空间认知的内化，并结合动力学分析与因果验证，为构建适应性强且可解释的人工智能模型提供了新思路。

链接: https://arxiv.org/abs/2504.11419
作者: Li Jin,Liu Jia
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Research on explainable meta-reinforcement learning AI

点击查看摘要

Abstract:Understanding how artificial systems can develop spatial awareness and reasoning has long been a challenge in AI research. Traditional models often rely on passive observation, but embodied cognition theory suggests that deeper understanding emerges from active interaction with the environment. This study investigates whether neural networks can autonomously internalize spatial concepts through interaction, focusing on planar navigation tasks. Using Gated Recurrent Units (GRUs) combined with Meta-Reinforcement Learning (Meta-RL), we show that agents can learn to encode spatial properties like direction, distance, and obstacle avoidance. We introduce Hybrid Dynamical Systems (HDS) to model the agent-environment interaction as a closed dynamical system, revealing stable limit cycles that correspond to optimal navigation strategies. Ridge Representation allows us to map navigation paths into a fixed-dimensional behavioral space, enabling comparison with neural states. Canonical Correlation Analysis (CCA) confirms strong alignment between these representations, suggesting that the agent’s neural states actively encode spatial knowledge. Intervention experiments further show that specific neural dimensions are causally linked to navigation performance. This work provides an approach to bridging the gap between action and perception in AI, offering new insights into building adaptive, interpretable models that can generalize across complex environments. The causal validation of neural representations also opens new avenues for understanding and controlling the internal mechanisms of AI systems, pushing the boundaries of how machines learn and reason in dynamic, real-world scenarios.
zh

[AI-3] Measures of Variability for Risk-averse Policy Gradient

【速读】：该论文旨在解决风险厌恶强化学习（Risk-Averse Reinforcement Learning, RARL）中变异性（variability）度量研究不足的问题。尽管现有工作主要关注风险度量（如条件风险价值 Conditional Value-at-Risk, CVaR），但对变异性的量化研究仍显匮乏。论文的关键在于全面研究九种常见的变异性度量指标，并针对其中四种未被研究过的指标（Variance, Gini Deviation, Mean Deviation, Mean-Median Deviation, Standard Deviation, Inter-Quantile Range, CVaR Deviation, Semi_Variance, Semi_Standard Deviation），推导其策略梯度公式，优化梯度估计方法，并分析这些度量的梯度特性。此外，将这些度量整合至REINFORCE和PPO框架中以惩罚回报的分散性。实验结果表明，基于方差的度量可能导致不稳定的学习更新，而CVaR偏差和基尼偏差在不同随机性和评估环境中表现出一致性能，能够有效实现高回报且学习到风险厌恶策略。这一工作为RARL中的变异性度量提供了系统性见解，并为风险感知决策制定及未来风险度量与RARL算法研究指明方向。

链接: https://arxiv.org/abs/2504.11412
作者: Yudong Luo,Yangchen Pan,Jiaqi Tan,Pascal Poupart
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Risk-averse reinforcement learning (RARL) is critical for decision-making under uncertainty, which is especially valuable in high-stake applications. However, most existing works focus on risk measures, e.g., conditional value-at-risk (CVaR), while measures of variability remain underexplored. In this paper, we comprehensively study nine common measures of variability, namely Variance, Gini Deviation, Mean Deviation, Mean-Median Deviation, Standard Deviation, Inter-Quantile Range, CVaR Deviation, Semi_Variance, and Semi_Standard Deviation. Among them, four metrics have not been previously studied in RARL. We derive policy gradient formulas for these unstudied metrics, improve gradient estimation for Gini Deviation, analyze their gradient properties, and incorporate them with the REINFORCE and PPO frameworks to penalize the dispersion of returns. Our empirical study reveals that variance-based metrics lead to unstable policy updates. In contrast, CVaR Deviation and Gini Deviation show consistent performance across different randomness and evaluation domains, achieving high returns while effectively learning risk-averse policies. Mean Deviation and Semi_Standard Deviation are also competitive across different scenarios. This work provides a comprehensive overview of variability measures in RARL, offering practical insights for risk-aware decision-making and guiding future research on risk metrics and RARL algorithms. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.11412 [cs.LG] (or arXiv:2504.11412v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.11412 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-4] rajectory Encoding Temporal Graph Networks

【速读】：该论文旨在解决动态图学习中生成式 AI (Generative AI) 模型在转导任务（如链接预测和节点分类）与归纳任务之间的权衡问题。现有匿名化 Temporal Graph Networks (TGNs) 虽具备较强的归纳泛化能力，但难以区分已知节点；而非匿名化 TGNs 虽在转导任务中表现优异，却无法适应新节点。为此，论文提出了一种名为轨迹编码 TGN (Trajectory Encoding TGN, TETGN) 的解决方案。其关键在于引入可自动扩展的可学习节点标识符作为时间位置特征，并通过这些标识符进行消息传递以捕捉每个节点的历史上下文信息。结合多头注意力机制将轨迹感知模块与标准 TGN 整合，TETGN 实现了转导准确性和归纳泛化能力的有效平衡，显著提升了链接预测和节点分类任务的性能。

链接: https://arxiv.org/abs/2504.11386
作者: Jiafeng Xiong,Rizos Sakellariou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Temporal Graph Networks (TGNs) have demonstrated significant success in dynamic graph tasks such as link prediction and node classification. Both tasks comprise transductive settings, where the model predicts links among known nodes, and in inductive settings, where it generalises learned patterns to previously unseen nodes. Existing TGN designs face a dilemma under these dual scenarios. Anonymous TGNs, which rely solely on temporal and structural information, offer strong inductive generalisation but struggle to distinguish known nodes. In contrast, non-anonymous TGNs leverage node features to excel in transductive tasks yet fail to adapt to new nodes. To address this challenge, we propose Trajectory Encoding TGN (TETGN). Our approach introduces automatically expandable node identifiers (IDs) as learnable temporal positional features and performs message passing over these IDs to capture each node’s historical context. By integrating this trajectory-aware module with a standard TGN using multi-head attention, TETGN effectively balances transductive accuracy with inductive generalisation. Experimental results on three real-world datasets show that TETGN significantly outperforms strong baselines on both link prediction and node classification tasks, demonstrating its ability to unify the advantages of anonymous and non-anonymous models for dynamic graph learning.
zh

[AI-5] A Winner-Takes-All Mechanism for Event Generation

【速读】：该论文旨在解决中枢模式发生器（Central Pattern Generator, CPG）设计中灵活性与鲁棒性不足的问题。其解决方案的关键在于利用神经元固有的反弹兴奋性（rebound excitability）与winner-takes-all（WTA）计算相结合，并通过所有-to-所有抑制连接增强可设计的兴奋性交互作用，在一个简单而强大的网络架构中统一决策制定与节律模式生成。这种设计显著提升了实现的便捷性、适应性和鲁棒性。论文通过环形振荡器模型验证了该框架的有效性，展示了其自适应相位和频率调制能力，使其在类脑系统和机器人领域具有重要应用潜力。

链接: https://arxiv.org/abs/2504.11374
作者: Yongkang Huo,Fuvio Forni,Rodolphe Sepulchre
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a novel framework for central pattern generator design that leverages the intrinsic rebound excitability of neurons in combination with winner-takes-all computation. Our approach unifies decision-making and rhythmic pattern generation within a simple yet powerful network architecture that employs all-to-all inhibitory connections enhanced by designable excitatory interactions. This design offers significant advantages regarding ease of implementation, adaptability, and robustness. We demonstrate its efficacy through a ring oscillator model, which exhibits adaptive phase and frequency modulation, making the framework particularly promising for applications in neuromorphic systems and robotics.
zh

[AI-6] DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks

【速读】：该论文旨在解决大型语言模型（LLMs）集成的应用程序和智能体易受提示注入攻击（prompt injection attacks）的问题，特别是现有检测方法对先进的乃至自适应攻击效果有限的情况。论文提出了一种名为DataSentinel的游戏论方法来检测此类攻击。解决方案的关键在于通过微调LLM，使其能够识别被精心设计以规避检测的注入提示污染的输入，并将此过程建模为一个minimax优化问题，目标是优化LLM以检测强大的自适应攻击。此外，还提出了基于梯度的方法交替求解内外部优化问题，从而有效应对现有的以及自适应的提示注入攻击。

链接: https://arxiv.org/abs/2504.11358
作者: Yupei Liu,Yuqi Jia,Jinyuan Jia,Dawn Song,Neil Zhenqiang Gong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: To appear in IEEE Symposium on Security and Privacy, 2025

点击查看摘要

Abstract:LLM-integrated applications and agents are vulnerable to prompt injection attacks, where an attacker injects prompts into their inputs to induce attacker-desired outputs. A detection method aims to determine whether a given input is contaminated by an injected prompt. However, existing detection methods have limited effectiveness against state-of-the-art attacks, let alone adaptive ones. In this work, we propose DataSentinel, a game-theoretic method to detect prompt injection attacks. Specifically, DataSentinel fine-tunes an LLM to detect inputs contaminated with injected prompts that are strategically adapted to evade detection. We formulate this as a minimax optimization problem, with the objective of fine-tuning the LLM to detect strong adaptive attacks. Furthermore, we propose a gradient-based method to solve the minimax optimization problem by alternating between the inner max and outer min problems. Our evaluation results on multiple benchmark datasets and LLMs show that DataSentinel effectively detects both existing and adaptive prompt injection attacks.
zh

[AI-7] Neural Networks for on-chip Model Predictive Control: a Method to Build Optimized Training Datasets and its application to Type-1 Diabetes

【速读】：该论文旨在解决通过训练神经网络（Neural Networks, NNs）来模拟模型预测控制（Model Predictive Control, MPC）算法以在受限嵌入式设备中高效实现时，训练数据的组成对最终神经网络准确性影响显著但系统性优化方法匮乏的问题。论文的关键解决方案是引入最优采样数据集（Optimally-Sampled Datasets, OSDs）的概念，并提出一种高效的生成算法。OSDs 是从所有可用数据中提取的参数化子集，具有以下特性：(i) 在特定数值精度范围内保留现有 MPC 信息；(ii) 避免重复或近似重复的状态；(iii) 达到饱和或完整性。实验表明，使用 OSD 训练的神经网络能够显著提高准确性，在一项实际应用中实现了四倍的性能提升，并成功获得临床测试的监管批准，成为首个基于神经网络的人类胰岛素直接给药控制算法。此方法为在资源受限的嵌入式平台上实现复杂算法的优化开辟了新途径。

链接: https://arxiv.org/abs/2504.11355
作者: Alberto Castillo,Elliot Pryor,Anas El Fathi,Boris Kovatchev,Marc Breton
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training Neural Networks (NNs) to behave as Model Predictive Control (MPC) algorithms is an effective way to implement them in constrained embedded devices. By collecting large amounts of input-output data, where inputs represent system states and outputs are MPC-generated control actions, NNs can be trained to replicate MPC behavior at a fraction of the computational cost. However, although the composition of the training data critically influences the final NN accuracy, methods for systematically optimizing it remain underexplored. In this paper, we introduce the concept of Optimally-Sampled Datasets (OSDs) as ideal training sets and present an efficient algorithm for generating them. An OSD is a parametrized subset of all the available data that (i) preserves existing MPC information up to a certain numerical resolution, (ii) avoids duplicate or near-duplicate states, and (iii) becomes saturated or complete. We demonstrate the effectiveness of OSDs by training NNs to replicate the University of Virginia’s MPC algorithm for automated insulin delivery in Type-1 Diabetes, achieving a four-fold improvement in final accuracy. Notably, two OSD-trained NNs received regulatory clearance for clinical testing as the first NN-based control algorithm for direct human insulin dosing. This methodology opens new pathways for implementing advanced optimizations on resource-constrained embedded platforms, potentially revolutionizing how complex algorithms are deployed.
zh

[AI-8] Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

【速读】：该论文试图解决在形式定理证明中利用大型语言模型提升自动化证明效率的问题。解决方案的关键在于引入了一种新颖的基于推理驱动的探索范式，并通过大规模强化学习训练（Qwen2.5-72B）开发了Kimina-Prover模型。该模型采用了一种称为“形式推理模式”的结构化推理方法，能够模仿人类在Lean 4中的问题解决策略，迭代生成并优化证明步骤。这种独特的推理模式不仅使模型在miniF2F基准测试中达到80.7%的pass@8192性能，还展现了高采样效率、与模型规模的良好扩展性以及与形式验证和非正式数学直觉之间潜在的桥梁作用。

链接: https://arxiv.org/abs/2504.11354
作者: Haiming Wang,Mert Unsal,Xiaohan Lin,Mantas Baksys,Junqi Liu,Marco Dos Santos,Flood Sung,Marina Vinyes,Zhenzhe Ying,Zekai Zhu,Jianqiao Lu,Hugues de Saxcé,Bolton Bailey,Chendong Song,Chenjun Xiao,Dehao Zhang,Ebony Zhang,Frederick Pu,Han Zhu,Jiawei Liu,Jonas Bayer,Julien Michel,Longhui Yu,Léo Dreyfus-Schmidt,Lewis Tunstall,Luigi Pagani,Moreira Machado,Pauline Bourigault,Ran Wang,Stanislas Polu,Thibaut Barroyer,Wen-Ding Li,Yazhe Niu,Yann Fleureau,Yangyang Hu,Zhouliang Yu,Zihan Wang,Zhilin Yang,Zhengying Liu,Jia Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages

点击查看摘要

Abstract:We introduce Kimina-Prover Preview, a large language model that pioneers a novel reasoning-driven exploration paradigm for formal theorem proving, as showcased in this preview release. Trained with a large-scale reinforcement learning pipeline from Qwen2.5-72B, Kimina-Prover demonstrates strong performance in Lean 4 proof generation by employing a structured reasoning pattern we term \textitformal reasoning pattern. This approach allows the model to emulate human problem-solving strategies in Lean, iteratively generating and refining proof steps. Kimina-Prover sets a new state-of-the-art on the miniF2F benchmark, reaching 80.7% with pass@8192. Beyond improved benchmark performance, our work yields several key insights: (1) Kimina-Prover exhibits high sample efficiency, delivering strong results even with minimal sampling (pass@1) and scaling effectively with computational budget, stemming from its unique reasoning pattern and RL training; (2) we demonstrate clear performance scaling with model size, a trend previously unobserved for neural theorem provers in formal mathematics; (3) the learned reasoning style, distinct from traditional search algorithms, shows potential to bridge the gap between formal verification and informal mathematical intuition. We open source distilled versions with 1.5B and 7B parameters of Kimina-Prover
zh

[AI-9] Interpretable Hybrid-Rule Temporal Point Processes

【速读】：该论文旨在解决现有可解释性时间点过程模型（Interpretable Temporal Point Processes, TPPs）在处理数值特征时的局限性，从而提升其预测精度和临床解释能力。论文的关键创新在于提出了Hybrid-Rule Temporal Point Processes (HRTPP)，通过将时间逻辑规则与数值特征相结合，实现了事件建模中更高的解释性和预测准确性。HRTPP的核心在于三个关键组成部分：基本强度（用于事件固有发生概率）、基于规则的强度（用于结构化的时间依赖关系）以及数值特征强度（用于动态概率调节）。此外，为了有效发现有效的规则，论文引入了一种结合贝叶斯优化的两阶段规则挖掘策略，并构建了一个包含规则有效性、模型拟合度及时间预测准确性多准则评估框架。实验结果表明，HRTPP在真实医疗数据集上的表现优于当前最先进的可解释TPPs，同时为疾病进展提供了有价值的临床解释。

链接: https://arxiv.org/abs/2504.11344
作者: Yunyang Cao,Juekai Lin,Hongye Wang,Wenhao Li,Bo Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Temporal Point Processes (TPPs) are widely used for modeling event sequences in various medical domains, such as disease onset prediction, progression analysis, and clinical decision support. Although TPPs effectively capture temporal dynamics, their lack of interpretability remains a critical challenge. Recent advancements have introduced interpretable TPPs. However, these methods fail to incorporate numerical features, thereby limiting their ability to generate precise predictions. To address this issue, we propose Hybrid-Rule Temporal Point Processes (HRTPP), a novel framework that integrates temporal logic rules with numerical features, improving both interpretability and predictive accuracy in event modeling. HRTPP comprises three key components: basic intensity for intrinsic event likelihood, rule-based intensity for structured temporal dependencies, and numerical feature intensity for dynamic probability modulation. To effectively discover valid rules, we introduce a two-phase rule mining strategy with Bayesian optimization. To evaluate our method, we establish a multi-criteria assessment framework, incorporating rule validity, model fitting, and temporal predictive accuracy. Experimental results on real-world medical datasets demonstrate that HRTPP outperforms state-of-the-art interpretable TPPs in terms of predictive performance and clinical interpretability. In case studies, the rules extracted by HRTPP explain the disease progression, offering valuable contributions to medical diagnosis.
zh

[AI-10] ransformer-Based Model for Cold Start Mitigation in FaaS Architecture

【速读】：该论文旨在解决函数即服务（Function as a Service, FaaS）架构中的冷启动问题，即当闲置的 FaaS 函数被调用时，由于需要经历完整的初始化过程而导致的延迟增加和用户体验下降。论文的关键创新在于提出了一种利用 Transformer 模型的新方法，该方法能够更精准地建模函数初始化延迟，并优化服务器less系统的整体性能。通过在 Azure 提供的公开数据集上的实验评估，该方案实现了最高达 79% 的冷启动时间减少，显著优于传统方法。

链接: https://arxiv.org/abs/2504.11338
作者: Alexandre Savi Fayam Mbala Mouen,Jerry Lacmou Zeutouo,Vianney Kengne Tchendji
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Serverless architectures, particularly the Function as a Service (FaaS) model, have become a cornerstone of modern cloud computing due to their ability to simplify resource management and enhance application deployment agility. However, a significant challenge remains: the cold start problem. This phenomenon occurs when an idle FaaS function is invoked, requiring a full initialization process, which increases latency and degrades user experience. Existing solutions for cold start mitigation are limited in terms of invocation pattern generalization and implementation complexity. In this study, we propose an innovative approach leveraging Transformer models to mitigate the impact of cold starts in FaaS architectures. Our solution excels in accurately modeling function initialization delays and optimizing serverless system performance. Experimental evaluation using a public dataset provided by Azure demonstrates a significant reduction in cold start times, reaching up to 79% compared to conventional methods.
zh

[AI-11] Code Reborn AI-Driven Legacy Systems Modernization from COBOL to Java

【速读】：该论文致力于解决遗留COBOL代码现代化转换为Java这一关键挑战，以应对老化软件系统的升级需求。论文的关键在于结合AI技术解析和优化COBOL代码，通过Legacy COBOL 2024语料库（包含50,000个来自公共及企业来源的COBOL文件）实现自动化升级建议，并利用React可视化改进效果。实验结果显示，该方法达到了93%的准确率，使代码复杂度下降35%（从18降至11.7），耦合度降低33%（从8降至5.4），显著优于人工（75%）和基于规则工具（82%）的表现。其核心解决方案在于将AI驱动的自动化转换与可视化分析相结合，为银行和保险等行业的COBOL系统现代化提供了可扩展路径。

链接: https://arxiv.org/abs/2504.11335
作者: Gopichand Bandarupalli
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study investigates AI-driven modernization of legacy COBOL code into Java, addressing a critical challenge in aging software systems. Leveraging the Legacy COBOL 2024 Corpus – 50,000 COBOL files from public and enterprise sources – Java parses the code, AI suggests upgrades, and React visualizes gains. Achieving 93% accuracy, complexity drops 35% (from 18 to 11.7) and coupling 33% (from 8 to 5.4), surpassing manual efforts (75%) and rule-based tools (82%). The approach offers a scalable path to rejuvenate COBOL systems, vital for industries like banking and insurance.
zh

[AI-12] Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在推理过程中因内存密集型键值（Key-Value, KV）缓存导致的计算资源需求高、特别是在内存受限条件下的效率低下问题。论文将LLMs推理优化建模为一个多阶段在线调度问题，由于提示序列到达和KV缓存增长的动态特性，传统调度方法失效。为解决此问题，论文的关键创新在于提出了WAIT（Waiting for Accumulated Inference Threshold）算法，通过设置多个阈值来实现当输出长度已知时的最优提示调度，并进一步扩展为Nested WAIT以处理输出长度未知的情况。此外，还开发了一种基于流体动力学近似的基准方法用于指导算法设计。理论分析表明，这两种算法在高负载条件下接近流体基准的性能，实现了吞吐量、延迟和首次令牌时间（Time to First Token, TTFT）的良好平衡。实验结果验证了其相对于现有基线（如vLLM和Sarathi）的性能提升。

链接: https://arxiv.org/abs/2504.11320
作者: Ruicheng Ao,Gan Luo,David Simchi-Levi,Xinshang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: 42 pages, 18 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are indispensable in today’s applications, but their inference procedure – generating responses by processing text in segments and using a memory-heavy Key-Value (KV) cache – demands significant computational resources, particularly under memory constraints. This paper formulates LLM inference optimization as a multi-stage online scheduling problem where sequential prompt arrivals and KV cache growth render conventional scheduling ineffective. We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design. Building on this, we propose the Waiting for Accumulated Inference Threshold (WAIT) algorithm, which uses multiple thresholds to schedule incoming prompts optimally when output lengths are known, and extend it to Nested WAIT for cases with unknown output lengths. Theoretical analysis shows that both algorithms achieve near-optimal performance against the fluid benchmark in heavy traffic conditions, balancing throughput, latency, and Time to First Token (TTFT). Experiments with the Llama-7B model on an A100 GPU using both synthetic and real-world datasets demonstrate improved throughput and reduced latency relative to established baselines like vLLM and Sarathi. This work bridges operations research and machine learning, offering a rigorous framework for the efficient deployment of LLMs under memory constraints.
zh

[AI-13] Learning to Be A Doctor: Searching for Effective Medical Agent Architectures

【速读】：该论文旨在解决现有基于大型语言模型（LLM）的医疗代理系统在适应多样化诊断需求和应对新兴临床场景方面的灵活性不足问题。这些系统通常依赖于静态的手动设计工作流，缺乏动态调整的能力。为了解决这一挑战，论文提出了一种新的框架，用于自动化设计医疗代理架构。其关键是定义了一个分层且表达能力强的代理搜索空间，通过节点级、结构级和框架级的结构性修改实现工作流的动态适配。此外，该框架将医疗代理概念化为由多种功能节点类型组成的图架构，并支持基于诊断反馈的迭代自我优化。实验结果表明，所提出的方法能够有效演化工作流结构，并随着时间推移显著提升诊断准确性。这一工作首次实现了医疗代理架构设计的完全自动化，为在真实临床环境中部署智能代理提供了可扩展且灵活的基础。

链接: https://arxiv.org/abs/2504.11301
作者: Yangyang Zhuang,Wenjia Jiang,Jiayu Zhang,Ze Yang,Joey Tianyi Zhou,Chi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based agents have demonstrated strong capabilities across a wide range of tasks, and their application in the medical domain holds particular promise due to the demand for high generalizability and reliance on interdisciplinary knowledge. However, existing medical agent systems often rely on static, manually crafted workflows that lack the flexibility to accommodate diverse diagnostic requirements and adapt to emerging clinical scenarios. Motivated by the success of automated machine learning (AutoML), this paper introduces a novel framework for the automated design of medical agent architectures. Specifically, we define a hierarchical and expressive agent search space that enables dynamic workflow adaptation through structured modifications at the node, structural, and framework levels. Our framework conceptualizes medical agents as graph-based architectures composed of diverse, functional node types and supports iterative self-improvement guided by diagnostic feedback. Experimental results on skin disease diagnosis tasks demonstrate that the proposed method effectively evolves workflow structures and significantly enhances diagnostic accuracy over time. This work represents the first fully automated framework for medical agent architecture design and offers a scalable, adaptable foundation for deploying intelligent agents in real-world clinical environments.
zh

[AI-14] Bipartite Ranking From Multiple Labels: On Loss Versus Label Aggregation

【速读】：该论文旨在解决在二部排名（Bipartite Ranking）任务中，当存在多个二元目标标签（例如来自不同标注者）时，如何将这些标签综合成一个一致的排序问题。论文的关键在于分析两种解决方案：损失聚合（Loss Aggregation）和标签聚合（Label Aggregation），并通过刻画它们的贝叶斯最优解来探讨其特性。研究发现，虽然两种方法均可以产生帕累托最优解，但损失聚合可能导致标签独裁现象（label dictatorship），即无意中偏向某个标签而忽视其他标签。这表明标签聚合相对于损失聚合可能更优，并通过实证验证了这一结论。

链接: https://arxiv.org/abs/2504.11284
作者: Michal Lukasik,Lin Chen,Harikrishna Narasimhan,Aditya Krishna Menon,Wittawat Jitkrittum,Felix X. Yu,Sashank J. Reddi,Gang Fu,Mohammadhossein Bateni,Sanjiv Kumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Bipartite ranking is a fundamental supervised learning problem, with the goal of learning a ranking over instances with maximal area under the ROC curve (AUC) against a single binary target label. However, one may often observe multiple binary target labels, e.g., from distinct human annotators. How can one synthesize such labels into a single coherent ranking? In this work, we formally analyze two approaches to this problem – loss aggregation and label aggregation – by characterizing their Bayes-optimal solutions. Based on this, we show that while both methods can yield Pareto-optimal solutions, loss aggregation can exhibit label dictatorship: one can inadvertently (and undesirably) favor one label over others. This suggests that label aggregation can be preferable to loss aggregation, which we empirically verify.
zh

[AI-15] DeepSelective: Feature Gating and Representation Matching for Interpretable Clinical Prediction

【速读】：该论文旨在解决电子健康记录（Electronic Health Records, EHRs）数据在临床预测中的两个主要挑战：一是传统机器学习模型依赖于专家设计的特征且缺乏鲁棒的表征学习能力；二是深度学习虽然具有强大的预测能力，但其可解释性较差。为应对这些挑战，论文提出了一种名为DeepSelective的新颖端到端深度学习框架，专注于通过增强模型的可解释性来提升患者预后预测的准确性。DeepSelective的关键创新在于结合数据压缩技术和创新性的特征选择方法，并集成自定义设计的模块，这些模块协同工作以同时提高预测精度和模型的可解释性。实验结果表明，该框架不仅提升了预测准确性，还显著增强了模型的可解释性，使其成为临床决策支持的有力工具。

链接: https://arxiv.org/abs/2504.11264
作者: Ruochi Zhang,Qian Yang,Xiaoyang Wang,Haoran Wu,Qiong Zhou,Yu Wang,Kewei Li,Yueying Wang,Yusi Fan,Jiale Zhang,Lan Huang,Chang Liu,Fengfeng Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid accumulation of Electronic Health Records (EHRs) has transformed healthcare by providing valuable data that enhance clinical predictions and diagnoses. While conventional machine learning models have proven effective, they often lack robust representation learning and depend heavily on expert-crafted features. Although deep learning offers powerful solutions, it is often criticized for its lack of interpretability. To address these challenges, we propose DeepSelective, a novel end to end deep learning framework for predicting patient prognosis using EHR data, with a strong emphasis on enhancing model interpretability. DeepSelective combines data compression techniques with an innovative feature selection approach, integrating custom-designed modules that work together to improve both accuracy and interpretability. Our experiments demonstrate that DeepSelective not only enhances predictive accuracy but also significantly improves interpretability, making it a valuable tool for clinical decision-making. The source code is freely available at this http URL .
zh

[AI-16] A Rollout-Based Algorithm and Reward Function for Efficient Resource Allocation in Business Processes

【速读】：该论文旨在解决动态业务环境中资源分配优化的问题，现有基于深度强化学习（Deep Reinforcement Learning, DRL）的方法在奖励函数设计与目标一致性方面存在不足，导致可能产生次优策略或不符合预期的决策。论文的关键解决方案是提出了一种基于滚动（rollout）的DRL算法及直接分解目标函数的奖励机制，通过迭代评估不同动作后的执行轨迹来优化策略，并确保最大化奖励函数能够直接最小化平均周期时间的目标函数，无需复杂的奖励工程设计。实验结果表明，所提方法在所有测试的六个业务流程中均能学习到最优策略，优于仅能在两个流程中学到最优策略的最新算法。

链接: https://arxiv.org/abs/2504.11250
作者: Jeroen Middelhuis,Zaharah Bukhsh,Ivo Adan,Remco Dijkman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Pre-print submitted to the 23rd International Conference on Business Process Management

点击查看摘要

Abstract:Resource allocation plays a critical role in minimizing cycle time and improving the efficiency of business processes. Recently, Deep Reinforcement Learning (DRL) has emerged as a powerful tool to optimize resource allocation policies in business processes. In the DRL framework, an agent learns a policy through interaction with the environment, guided solely by reward signals that indicate the quality of its decisions. However, existing algorithms are not suitable for dynamic environments such as business processes. Furthermore, existing DRL-based methods rely on engineered reward functions that approximate the desired objective, but a misalignment between reward and objective can lead to undesired decisions or suboptimal policies. To address these issues, we propose a rollout-based DRL algorithm and a reward function to optimize the objective directly. Our algorithm iteratively improves the policy by evaluating execution trajectories following different actions. Our reward function directly decomposes the objective function of minimizing the mean cycle time. Maximizing our reward function guarantees that the objective function is minimized without requiring extensive reward engineering. The results show that our method consistently learns the optimal policy in all six evaluated business processes, outperforming the state-of-the-art algorithm that can only learn the optimal policy in two of the evaluated processes.
zh

[AI-17] Influence Maximization in Temporal Social Networks with a Cold-Start Problem: A Supervised Approach

【速读】：本文旨在解决时序图中的 Influence Maximization (IM) 问题，即通过识别关键种子节点（seeds）来最大化网络扩展。论文的核心关注点是如何通过基于 Influence Propagation Paths (IPPs) 定义这些种子节点，并高效标记 IPPs 以精准预测种子节点，同时有效应对时序网络中常见的冷启动问题。关键解决方案包括引入基于图基元（motif）的标记方法和针对多关系时序图优化的张量化 Temporal Graph Network (TGN)，以提升预测精度与计算效率。此外，通过从历史数据中为冷启动节点引入具有相似 IPP 的新邻居，进一步增强模型性能。实验验证了所提方法在离线预测准确性与模型训练效率方面的表现，以及在线环境中对网络增长和冷启动问题解决的有效性。

链接: https://arxiv.org/abs/2504.11245
作者: Laixin Xie,Ying Zhang,Xiyuan Wang,Shiyi Liu,Shenghan Gao,Xingxing Xing,Wei Wan,Haipeng Zhang,Quan Li
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: Accepted by ICWSM 2025

点击查看摘要

Abstract:Influence Maximization (IM) in temporal graphs focuses on identifying influential “seeds” that are pivotal for maximizing network expansion. We advocate defining these seeds through Influence Propagation Paths (IPPs), which is essential for scaling up the network. Our focus lies in efficiently labeling IPPs and accurately predicting these seeds, while addressing the often-overlooked cold-start issue prevalent in temporal networks. Our strategy introduces a motif-based labeling method and a tensorized Temporal Graph Network (TGN) tailored for multi-relational temporal graphs, bolstering prediction accuracy and computational efficiency. Moreover, we augment cold-start nodes with new neighbors from historical data sharing similar IPPs. The recommendation system within an online team-based gaming environment presents subtle impact on the social network, forming multi-relational (i.e., weak and strong) temporal graphs for our empirical IM study. We conduct offline experiments to assess prediction accuracy and model training efficiency, complemented by online A/B testing to validate practical network growth and the effectiveness in addressing the cold-start issue.
zh

[AI-18] Diversity-Driven Learning: Tackling Spurious Correlations and Data Heterogeneity in Federated Models

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）中因客户端数据非独立同分布（Non-IID）及数据不平衡导致的统计异构性问题，这会显著影响服务器模型在客户端间的泛化能力、收敛速度以及整体性能。论文的关键解决方案包括：首先通过定义6个衡量全局与客户端属性不平衡、类别不平衡及虚假相关性的指标来表征统计异构性；其次构建并共享了7个涵盖广泛统计异构性的计算机视觉数据集，用于模拟现实世界中的联邦学习场景；最后提出了一种名为FedDiverse的新颖客户端选择算法，该算法通过促进具有互补数据分布的客户端之间的协作来管理和利用数据异构性，从而提升联邦学习方法的性能与鲁棒性，同时保持较低的通信与计算开销。

链接: https://arxiv.org/abs/2504.11216
作者: Gergely D. Németh,Eros Fanì,Yeat Jeng Ng,Barbara Caputo,Miguel Ángel Lozano,Nuria Oliver,Novi Quadrianto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) enables decentralized training of machine learning models on distributed data while preserving privacy. However, in real-world FL settings, client data is often non-identically distributed and imbalanced, resulting in statistical data heterogeneity which impacts the generalization capabilities of the server’s model across clients, slows convergence and reduces performance. In this paper, we address this challenge by first proposing a characterization of statistical data heterogeneity by means of 6 metrics of global and client attribute imbalance, class imbalance, and spurious correlations. Next, we create and share 7 computer vision datasets for binary and multiclass image classification tasks in Federated Learning that cover a broad range of statistical data heterogeneity and hence simulate real-world situations. Finally, we propose FedDiverse, a novel client selection algorithm in FL which is designed to manage and leverage data heterogeneity across clients by promoting collaboration between clients with complementary data distributions. Experiments on the seven proposed FL datasets demonstrate FedDiverse’s effectiveness in enhancing the performance and robustness of a variety of FL methods while having low communication and computational overhead.
zh

[AI-19] Mutual Understanding between People and Systems via Neurosymbolic AI and Knowledge Graphs

【速读】：该论文旨在解决人类与系统之间相互理解的概念问题，并提出神经符号人工智能（Neuro-symbolic AI, NeSy AI）方法可以通过结合显式的符号知识表示与数据驱动的学习模型显著提升这种相互理解。论文的关键在于通过三个维度——共享知识、交换知识和治理知识来刻画相互理解，并通过引入知识图谱等工具，在具体应用场景中展示如何结合自顶向下的符号推理与自底向上的神经学习，以实现人、人工系统及机器人之间的有意义交互。解决方案的核心在于探索当前方法在这些维度上的覆盖范围，同时识别尚存的研究空白与不足，为未来研究提供方向。

链接: https://arxiv.org/abs/2504.11200
作者: Irene Celino,Mario Scrocca,Agnese Chiatti
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, 13 figures, 1 table; pre-print version of book chapter

点击查看摘要

Abstract:This chapter investigates the concept of mutual understanding between humans and systems, positing that Neuro-symbolic Artificial Intelligence (NeSy AI) methods can significantly enhance this mutual understanding by leveraging explicit symbolic knowledge representations with data-driven learning models. We start by introducing three critical dimensions to characterize mutual understanding: sharing knowledge, exchanging knowledge, and governing knowledge. Sharing knowledge involves aligning the conceptual models of different agents to enable a shared understanding of the domain of interest. Exchanging knowledge relates to ensuring the effective and accurate communication between agents. Governing knowledge concerns establishing rules and processes to regulate the interaction between agents. Then, we present several different use case scenarios that demonstrate the application of NeSy AI and Knowledge Graphs to aid meaningful exchanges between human, artificial, and robotic agents. These scenarios highlight both the potential and the challenges of combining top-down symbolic reasoning with bottom-up neural learning, guiding the discussion of the coverage provided by current solutions along the dimensions of sharing, exchanging, and governing knowledge. Concurrently, this analysis facilitates the identification of gaps and less developed aspects in mutual understanding to address in future research.
zh

[AI-20] Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance

【速读】：该论文旨在解决小语言模型（SLMs）在资源受限的边缘设备上部署高效但推理性能受限的问题，同时现有集中式检索增强生成（RAG）方法无法有效利用分散于云端和设备上的通用及个人知识，且存在隐私泄露风险。论文的关键解决方案是提出DRAGON，一个分布式RAG框架，通过将多文档RAG分解为云和设备上的独立本地化令牌生成过程，并采用Speculative Aggregation算法避免频繁的云边输出同步，同时引入新的调度算法根据实时网络条件选择最优聚合侧，从而实现性能提升、降低每令牌延迟并保持极低的首次令牌时间（TTFT）。

链接: https://arxiv.org/abs/2504.11197
作者: Shangyu Liu,Zhenzhe Zheng,Xiaoyao Huang,Fan Wu,Jie Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Small language models (SLMs) support efficient deployments on resource-constrained edge devices, but their limited capacity compromises inference performance. Retrieval-augmented generation (RAG) is a promising solution to enhance model performance by integrating external databases, without requiring intensive on-device model retraining. However, large-scale public databases and user-specific private contextual documents are typically located on the cloud and the device separately, while existing RAG implementations are primarily centralized. To bridge this gap, we propose DRAGON, a distributed RAG framework to enhance on-device SLMs through both general and personal knowledge without the risk of leaking document privacy. Specifically, DRAGON decomposes multi-document RAG into multiple parallel token generation processes performed independently and locally on the cloud and the device, and employs a newly designed Speculative Aggregation, a dual-side speculative algorithm to avoid frequent output synchronization between the cloud and device. A new scheduling algorithm is further introduced to identify the optimal aggregation side based on real-time network conditions. Evaluations on real-world hardware testbed demonstrate a significant performance improvement of DRAGON-up to 1.9x greater gains over standalone SLM compared to the centralized RAG, substantial reduction in per-token latency, and negligible Time to First Token (TTFT) overhead.
zh

[AI-21] Exploring Backdoor Attack and Defense for LLM -empowered Recommendations

【速读】：该论文旨在解决基于大型语言模型（Large Language Models, LLMs）的推荐系统（Recommender Systems, RecSys）在对抗后门攻击（backdoor attacks）方面的安全性问题。具体而言，研究探索了是否可以将具有特定触发器（trigger）的后门注入到基于LLM的RecSys中，并在触发器附加到项目标题时操纵推荐响应。为了解决这一问题，论文提出了一个名为“BadRec”的新型攻击框架，通过扰动项目的标题并引入伪造用户交互来污染训练集，从而实现后门注入。

解决方案的关键在于两个方面：首先，“BadRec”攻击框架能够仅通过污染1%的训练数据就成功植入后门；其次，为了缓解这种安全威胁，论文提出了一种名为“Poison Scanner (P-Scanner)”的通用防御策略。P-Scanner利用LLM的强大语言理解和丰富知识检测被污染的项目，并通过触发器增强代理生成多样化的合成触发器，以指导P-Scanner学习特定领域的被污染项目检测知识。实验证明了P-Scanner的有效性。

链接: https://arxiv.org/abs/2504.11182
作者: Liangbo Ning,Wenqi Fan,Qing Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The fusion of Large Language Models (LLMs) with recommender systems (RecSys) has dramatically advanced personalized recommendations and drawn extensive attention. Despite the impressive progress, the safety of LLM-based RecSys against backdoor attacks remains largely under-explored. In this paper, we raise a new problem: Can a backdoor with a specific trigger be injected into LLM-based Recsys, leading to the manipulation of the recommendation responses when the backdoor trigger is appended to an item’s title? To investigate the vulnerabilities of LLM-based RecSys under backdoor attacks, we propose a new attack framework termed Backdoor Injection Poisoning for RecSys (BadRec). BadRec perturbs the items’ titles with triggers and employs several fake users to interact with these items, effectively poisoning the training set and injecting backdoors into LLM-based RecSys. Comprehensive experiments reveal that poisoning just 1% of the training data with adversarial examples is sufficient to successfully implant backdoors, enabling manipulation of recommendations. To further mitigate such a security threat, we propose a universal defense strategy called Poison Scanner (P-Scanner). Specifically, we introduce an LLM-based poison scanner to detect the poisoned items by leveraging the powerful language understanding and rich knowledge of LLMs. A trigger augmentation agent is employed to generate diverse synthetic triggers to guide the poison scanner in learning domain-specific knowledge of the poisoned item detection task. Extensive experiments on three real-world datasets validate the effectiveness of the proposed P-Scanner.
zh

[AI-22] Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）守卫系统在抵御提示注入（prompt injection）和越狱攻击（jailbreak attack）时易受规避技术影响的问题。论文的关键在于提出两种方法，分别通过传统的字符注入技术和基于算法的对抗性机器学习（Adversarial Machine Learning, AML）规避技术来绕过现有的LLM守卫系统。研究测试了六种主流保护系统，包括微软的Azure Prompt Shield和Meta的Prompt Guard，证明这两种方法能够在保持对抗性效用的同时实现高达100%的规避成功率。此外，论文还展示了攻击者如何利用离线白盒模型计算的词重要性排名来提升对黑盒目标的攻击成功率（Attack Success Rate, ASR）。研究揭示了当前LLM保护机制中的漏洞，并强调了构建更强大守卫系统的必要性。

链接: https://arxiv.org/abs/2504.11168
作者: William Hackett,Lewis Birch,Stefan Trawicki,Neeraj Suri,Peter Garraghan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Large Language Models (LLMs) guardrail systems are designed to protect against prompt injection and jailbreak attacks. However, they remain vulnerable to evasion techniques. We demonstrate two approaches for bypassing LLM prompt injection and jailbreak detection systems via traditional character injection methods and algorithmic Adversarial Machine Learning (AML) evasion techniques. Through testing against six prominent protection systems, including Microsoft’s Azure Prompt Shield and Meta’s Prompt Guard, we show that both methods can be used to evade detection while maintaining adversarial utility achieving in some instances up to 100% evasion success. Furthermore, we demonstrate that adversaries can enhance Attack Success Rates (ASR) against black-box targets by leveraging word importance ranking computed by offline white-box models. Our findings reveal vulnerabilities within current LLM protection mechanisms and highlight the need for more robust guardrail systems.
zh

[AI-23] C-SHAP for time series: An approach to high-level temporal explanations

【速读】：该论文旨在解决现有可解释人工智能（XAI）方法在时间序列领域中仅关注低级模式而未能有效捕捉可能影响模型推理的高级模式的问题。论文的关键解决方案是提出了一种基于概念的方法——C-SHAP，用于以高级模式的形式提供解释。这种方法通过确定概念对模型输出的贡献来增强解释性，并通过时间序列分解的方式实现了一个示例性的实施。此外，论文通过能源领域的案例研究验证了该方法的有效性。

链接: https://arxiv.org/abs/2504.11159
作者: Annemarie Jutte,Faizan Ahmed,Jeroen Linssen,Maurice van Keulen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Time series are ubiquitous in domains such as energy forecasting, healthcare, and industry. Using AI systems, some tasks within these domains can be efficiently handled. Explainable AI (XAI) aims to increase the reliability of AI solutions by explaining model reasoning. For time series, many XAI methods provide point- or sequence-based attribution maps. These methods explain model reasoning in terms of low-level patterns. However, they do not capture high-level patterns that may also influence model reasoning. We propose a concept-based method to provide explanations in terms of these high-level patterns. In this paper, we present C-SHAP for time series, an approach which determines the contribution of concepts to a model outcome. We provide a general definition of C-SHAP and present an example implementation using time series decomposition. Additionally, we demonstrate the effectiveness of the methodology through a use case from the energy domain.
zh

[AI-24] Divergence of Empirical Neural Tangent Kernel in Classification Problems

【速读】：该论文试图解决的问题是：在分类问题中，基于Neural Tangent Kernel (NTK) 的核逻辑回归是否能够近似全连接神经网络（FCNs）和残差神经网络（ResNets）？论文的关键解决方案在于证明当训练时间趋于无穷大（即过拟合情况下），基于交叉熵损失的网络参数会发散，并且经验NTK会偏离训练样本上的NTK。具体而言，通过证明多层FCNs和ResNets的NTK严格正定性，并分析神经网络参数在训练过程中的发散行为，作者揭示了这种现象与回归问题中常见的“懒惰训练”现象的显著差异。最终，通过反证法，论文表明经验NTK不会随着网络宽度增加而在所有时间点上一致收敛到训练样本上的NTK，从而得出NTK理论在此上下文中不适用的重要结论。

链接: https://arxiv.org/abs/2504.11130
作者: Zixiong Yu,Songtao Tian,Guhan Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:This paper demonstrates that in classification problems, fully connected neural networks (FCNs) and residual neural networks (ResNets) cannot be approximated by kernel logistic regression based on the Neural Tangent Kernel (NTK) under overtraining (i.e., when training time approaches infinity). Specifically, when using the cross-entropy loss, regardless of how large the network width is (as long as it is finite), the empirical NTK diverges from the NTK on the training samples as training time increases. To establish this result, we first demonstrate the strictly positive definiteness of the NTKs for multi-layer FCNs and ResNets. Then, we prove that during training, % with the cross-entropy loss, the neural network parameters diverge if the smallest eigenvalue of the empirical NTK matrix (Gram matrix) with respect to training samples is bounded below by a positive constant. This behavior contrasts sharply with the lazy training regime commonly observed in regression problems. Consequently, using a proof by contradiction, we show that the empirical NTK does not uniformly converge to the NTK across all times on the training samples as the network width increases. We validate our theoretical results through experiments on both synthetic data and the MNIST classification task. This finding implies that NTK theory is not applicable in this context, with significant theoretical implications for understanding neural networks in classification problems.
zh

[AI-25] Emergence of Goal-Directed Behaviors via Active Inference with Self-Prior

【速读】：本文旨在解决如何通过内在动机驱动生成目标导向行为的问题。现有研究主要关注探索如何获取外部奖励，而本文提出了一种新颖的密度模型——“自我先验（self-prior）”，用于表征智能体自身的多模态感官经验。关键在于将“自我先验”整合到基于自由能原理的主动推理框架中，使其能够仅从内在过程中生成行为参考，通过最小化过去平均感官经验和当前观测之间的不匹配来自主诱导目标导向行为。这一机制类似于通过持续与环境交互获得和利用身体图式的过程。实验结果表明，该方法能使智能体在模拟环境中自发地朝触觉刺激物伸出手臂，从而验证了由自身感官经验塑造的内在动机行为可以自发引发早期发展的意图性行为。

链接: https://arxiv.org/abs/2504.11075
作者: Dongmin Kim,Hoshinori Kanazawa,Naoto Yoshida,Yasuo Kuniyoshi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, Code is available at this https URL

点击查看摘要

Abstract:Infants often exhibit goal-directed behaviors, such as reaching for a sensory stimulus, even when no external reward criterion is provided. These intrinsically motivated behaviors facilitate spontaneous exploration and learning of the body and environment during early developmental stages. Although computational modeling can offer insight into the mechanisms underlying such behaviors, many existing studies on intrinsic motivation focus primarily on how exploration contributes to acquiring external rewards. In this paper, we propose a novel density model for an agent’s own multimodal sensory experiences, called the “self-prior,” and investigate whether it can autonomously induce goal-directed behavior. Integrated within an active inference framework based on the free energy principle, the self-prior generates behavioral references purely from an intrinsic process that minimizes mismatches between average past sensory experiences and current observations. This mechanism is also analogous to the acquisition and utilization of a body schema through continuous interaction with the environment. We examine this approach in a simulated environment and confirm that the agent spontaneously reaches toward a tactile stimulus. Our study implements intrinsically motivated behavior shaped by the agent’s own sensory experiences, demonstrating the spontaneous emergence of intentional behavior during early development.
zh

[AI-26] Dynamical errors in machine learning forecasts

【速读】：该论文试图解决机器学习预测在科学与工程应用中未能有效评估物理和动力学一致性的问题。传统误差度量（如MAE和MSE）仅量化预测与目标值之间的差异，但无法直接评估预测的动力学保真度。论文的关键在于引入两个新兴的动力学指标——瞬时维数 (d) 和逆持久性 (\theta)，通过分析这些指标与预测误差的关系，揭示了高误差通常出现在复杂性和非持续性较高的状态中。基于此，作者提出了以这些动力学指标为基础的新误差度量方法，用于评估机器学习预报的动力学一致性，并将其应用于多个经典数据集和实际天气预报任务。这种方法能够识别模型在长时间预测或递归模拟中的动力学失真，从而为改进机器学习模型提供指导。

链接: https://arxiv.org/abs/2504.11074
作者: Zhou Fang,Gianmarco Mengaldo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:In machine learning forecasting, standard error metrics such as mean absolute error (MAE) and mean squared error (MSE) quantify discrepancies between predictions and target values. However, these metrics do not directly evaluate the physical and/or dynamical consistency of forecasts, an increasingly critical concern in scientific and engineering applications. Indeed, a fundamental yet often overlooked question is whether machine learning forecasts preserve the dynamical behavior of the underlying system. Addressing this issue is essential for assessing the fidelity of machine learning models and identifying potential failure modes, particularly in applications where maintaining correct dynamical behavior is crucial. In this work, we investigate the relationship between standard forecasting error metrics, such as MAE and MSE, and the dynamical properties of the underlying system. To achieve this goal, we use two recently developed dynamical indices: the instantaneous dimension ( d ), and the inverse persistence ( \theta ). Our results indicate that larger forecast errors – e.g., higher MSE – tend to occur in states with higher d (higher complexity) and higher \theta (lower persistence). To further assess dynamical consistency, we propose error metrics based on the dynamical indices that measure the discrepancy of the forecasted d and \theta versus their correct values. Leveraging these dynamical indices-based metrics, we analyze direct and recursive forecasting strategies for three canonical datasets – Lorenz, Kuramoto-Sivashinsky equation, and Kolmogorov flow – as well as a real-world weather forecasting task. Our findings reveal substantial distortions in dynamical properties in ML forecasts, especially for long forecast lead times or long recursive simulations, providing complementary information on ML forecast fidelity that can be used to improve ML models. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph) Cite as: arXiv:2504.11074 [cs.LG] (or arXiv:2504.11074v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.11074 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-27] Neural Control Barrier Functions from Physics Informed Neural Networks

【速读】：该论文旨在解决手动设计针对特定应用的控制屏障函数（Control Barrier Functions, CBFs）这一挑战，并探索通过深度学习方法合成神经网络形式的CBFs（Neural CBFs）。论文提出了一种新颖的神经CBF类，通过引入基于物理学启发的神经网络框架，并结合Zubov偏微分方程（Partial Differential Equation, PDE）在安全性的上下文中，提供了一种可扩展的方法来合成适用于高维系统的神经CBFs。关键在于利用互反CBFs（reciprocal CBFs）而非零化CBFs（zeroing CBFs），从而实现灵活且用户定义的安全区域的指定。通过在倒立摆、地面自主导航及障碍物环境下的空中导航三个不同系统上的案例研究验证了该方法的有效性。

链接: https://arxiv.org/abs/2504.11045
作者: Shreenabh Agrawal,Manan Tayal,Aditya Singh,Shishir Kolathaya
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:As autonomous systems become increasingly prevalent in daily life, ensuring their safety is paramount. Control Barrier Functions (CBFs) have emerged as an effective tool for guaranteeing safety; however, manually designing them for specific applications remains a significant challenge. With the advent of deep learning techniques, recent research has explored synthesizing CBFs using neural networks-commonly referred to as neural CBFs. This paper introduces a novel class of neural CBFs that leverages a physics-inspired neural network framework by incorporating Zubov’s Partial Differential Equation (PDE) within the context of safety. This approach provides a scalable methodology for synthesizing neural CBFs applicable to high-dimensional systems. Furthermore, by utilizing reciprocal CBFs instead of zeroing CBFs, the proposed framework allows for the specification of flexible, user-defined safe regions. To validate the effectiveness of the approach, we present case studies on three different systems: an inverted pendulum, autonomous ground navigation, and aerial navigation in obstacle-laden environments.
zh

[AI-28] “Even explanations will not help in trusting [this] fundamentally biased system”: A Predictive Policing Case-Study

【速读】：该论文试图解决用户在高风险领域（如基于预测警务的人工智能系统）中如何建立适当信任的问题。论文关注不同解释形式（文本、视觉和混合形式）以及用户专业知识（退休警察与普通用户）对信任建立的影响。研究发现，尽管混合形式的解释提高了专家用户的主观信任感，但并未带来更好的决策效果，且任何类型的解释均未能有效帮助用户建立适当的信任。解决方案的关键在于重新评估解释在构建适当信任中的作用，并强调需针对高风险人工智能系统设计潜在挑战及政策建议以优化信任管理。

链接: https://arxiv.org/abs/2504.11020
作者: Siddharth Mehrotra,Ujwal Gadiraju,Eva Bittner,Folkert van Delden,Catholijn M. Jonker,Myrthe L. Tielman
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 33rd ACM Conference on User Modeling, Adaptation and Personalization (UMAP '25), June 16–19, 2025, New York City, NY, USA

点击查看摘要

Abstract:In today’s society, where Artificial Intelligence (AI) has gained a vital role, concerns regarding user’s trust have garnered significant attention. The use of AI systems in high-risk domains have often led users to either under-trust it, potentially causing inadequate reliance or over-trust it, resulting in over-compliance. Therefore, users must maintain an appropriate level of trust. Past research has indicated that explanations provided by AI systems can enhance user understanding of when to trust or not trust the system. However, the utility of presentation of different explanations forms still remains to be explored especially in high-risk domains. Therefore, this study explores the impact of different explanation types (text, visual, and hybrid) and user expertise (retired police officers and lay users) on establishing appropriate trust in AI-based predictive policing. While we observed that the hybrid form of explanations increased the subjective trust in AI for expert users, it did not led to better decision-making. Furthermore, no form of explanations helped build appropriate trust. The findings of our study emphasize the importance of re-evaluating the use of explanations to build [appropriate] trust in AI based systems especially when the system’s use is questionable. Finally, we synthesize potential challenges and policy recommendations based on our results to design for appropriate trust in high-risk based AI-based systems.
zh

[AI-29] Document Quality Scoring for Web Crawling

【速读】：该论文旨在解决互联网中大量低质量内容对搜索引擎检索和爬取过程造成的资源浪费问题，以及由此带来的负面影响。论文的关键解决方案在于提出了一种基于神经语义质量评估的方法，通过高效的质量评分技术优化网页质量估计。具体而言，研究在Chang等人的工作基础上（利用神经语义质量进行静态索引剪枝），进一步将神经质量评分器应用于爬取优先级任务，以优先处理语义高质量页面。实验分析表明，这种方法能够提升下游搜索效果。论文的软件贡献是一个Docker容器，用于计算给定网页的有效质量分数，从而便于将其集成到其他网络搜索系统组件中。

链接: https://arxiv.org/abs/2504.11011
作者: Francesca Pezzuti,Ariane Mueller,Sean MacAvaney,Nicola Tonellotto
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Presented at WOWS2025

点击查看摘要

Abstract:The internet contains large amounts of low-quality content, yet users expect web search engines to deliver high-quality, relevant results. The abundant presence of low-quality pages can negatively impact retrieval and crawling processes by wasting resources on these documents. Therefore, search engines can greatly benefit from techniques that leverage efficient quality estimation methods to mitigate these negative impacts. Quality scoring methods for web pages are useful for many processes typical for web search systems, including static index pruning, index tiering, and crawling. Building on work by Chang et al.~\citechang2024neural, who proposed using neural estimators of semantic quality for static index pruning, we extend their approach and apply their neural quality scorers to assess the semantic quality of web pages in crawling prioritisation tasks. In our experimental analysis, we found that prioritising semantically high-quality pages over low-quality ones can improve downstream search effectiveness. Our software contribution consists of a Docker container that computes an effective quality score for a given web page, allowing the quality scorer to be easily included and used in other components of web search systems.
zh

[AI-30] ProtFlow: Fast Protein Sequence Design via Flow Matching on Compressed Protein Language Model Embeddings

【速读】：本文旨在解决蛋白质工程中设计具有所需功能的蛋白质序列这一基础任务，现有深度生成方法（如自回归模型和扩散模型）虽加速了新型蛋白质序列的发现，但存在关注局部或浅层残差语义、推理效率低、建模空间大及训练成本高等挑战。为应对这些挑战，论文提出ProtFlow，这是一种基于快速流匹配的蛋白质序列设计框架，它在蛋白质语言模型的语义有意义的潜在空间嵌入上运行。通过压缩和平滑潜在空间，ProtFlow在有限计算资源下提升了性能。其关键在于利用重流技术实现高质量的单步序列生成，并开发了多链蛋白设计场景的联合设计管道。实验结果表明，ProtFlow在多种蛋白质设计任务中优于特定任务的方法，展示了其在计算蛋白质序列设计与分析中的潜力和广泛应用性。

链接: https://arxiv.org/abs/2504.10983
作者: Zitai Kong,Yiheng Zhu,Yinlong Xu,Hanjing Zhou,Mingzhe Yin,Jialu Wu,Hongxia Xu,Chang-Yu Hsieh,Tingjun Hou,Jian Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:The design of protein sequences with desired functionalities is a fundamental task in protein engineering. Deep generative methods, such as autoregressive models and diffusion models, have greatly accelerated the discovery of novel protein sequences. However, these methods mainly focus on local or shallow residual semantics and suffer from low inference efficiency, large modeling space and high training cost. To address these challenges, we introduce ProtFlow, a fast flow matching-based protein sequence design framework that operates on embeddings derived from semantically meaningful latent space of protein language models. By compressing and smoothing the latent space, ProtFlow enhances performance while training on limited computational resources. Leveraging reflow techniques, ProtFlow enables high-quality single-step sequence generation. Additionally, we develop a joint design pipeline for the design scene of multichain proteins. We evaluate ProtFlow across diverse protein design tasks, including general peptides and long-chain proteins, antimicrobial peptides, and antibodies. Experimental results demonstrate that ProtFlow outperforms task-specific methods in these applications, underscoring its potential and broad applicability in computational protein sequence design and analysis.
zh

[AI-31] Evaluating Trust in AI Human and Co-produced Feedback Among Undergraduate Students

【速读】：该论文试图解决教育领域中生成式人工智能（Generative AI）在反馈实践中的应用问题，具体关注学生对不同反馈提供者（AI生成、人类创建以及人机协作生成反馈）的信任差异及其影响因素。论文的关键在于通过实证研究揭示学生在感知反馈质量、区分反馈来源能力及潜在偏见方面的行为模式，从而为高等教育机构如何有效适应这一新环境下的反馈实践提供指导。研究设计了一项包含91名本科生参与的被试内实验，发现学生普遍认为AI生成和人机协作生成的反馈更具实用性和客观性，而仅AI生成反馈在暴露来源后会丧失真实性感知，但人机协作反馈仍保持积极评价。此外，教育相关AI经验提高了学生识别AI反馈的能力并增强了对各类反馈的信任，而通用AI经验则降低了反馈的感知实用性和可信度。男性学生对所有类型反馈的评价均低于女性和非二元性别学生。因此，论文的核心解决方案在于基于这些发现制定证据导向的指导原则，以促进AI在高等教育反馈系统中的整合，并同时解决信任问题与提升学生的AI素养。

链接: https://arxiv.org/abs/2504.10961
作者: Audrey Zhang,Yifei Gao,Wannapon Suraworachet,Tanya Nazaretsky,Mutlu Cukurova
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 35 pages, 6 figures. Under review at Assessment and Evaluation in Higher Education

点击查看摘要

Abstract:As generative AI transforms educational feedback practices, understanding students’ perceptions of different feedback providers becomes crucial for effective implementation. This study addresses a critical gap by comparing undergraduate students’ trust in AI-generated, human-created, and human-AI co-produced feedback, informing how institutions can adapt feedback practices in this new era. Through a within-subject experiment with 91 participants, we investigated factors predicting students’ ability to distinguish between feedback types, perception of feedback quality, and potential biases to AI involvement. Findings revealed that students generally preferred AI and co-produced feedback over human feedback in terms of perceived usefulness and objectivity. Only AI feedback suffered a decline in perceived genuineness when feedback sources were revealed, while co-produced feedback maintained its positive perception. Educational AI experience improved students’ ability to identify AI feedback and increased their trust in all feedback types, while general AI experience decreased perceived usefulness and credibility. Male students consistently rated all feedback types as less valuable than their female and non-binary counterparts. These insights inform evidence-based guidelines for integrating AI into higher education feedback systems while addressing trust concerns and fostering AI literacy among students.
zh

[AI-32] BEACON: A Benchmark for Efficient and Accurate Counting of Subgraphs

【速读】：该论文旨在解决子图计数领域缺乏统一评估框架、标准化数据集以及可验证真实值的问题，这阻碍了系统性分析和公平基准测试。为克服这些障碍，论文提出了BEACON，这是一个全面的基准平台，用于严格评估基于算法（Algorithmic, AL）和机器学习（Machine Learning, ML）的子图计数方法。BEACON的关键在于提供了一个包含已验证真实值的标准化数据集、一个集成的评估环境以及一个公开的排行榜，从而实现不同方法在可重复性和透明度方面的跨方法比较。通过引入BEACON，论文揭示了AL方法在处理超大图上的高效性，但对复杂模式存在局限；而ML方法虽能应对更大模式，但在小而密集的图上往往表现出较低的准确性。因此，BEACON的核心解决方案是构建一个统一的基准框架，以促进算法与机器学习范式的深入理解和优势互补。

链接: https://arxiv.org/abs/2504.10948
作者: Mohammad Matin Najafi,Xianju Zhu,Chrysanthi Kosyfaki,Laks V.S. Lakshmanan,Reynold Cheng
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Databases (cs.DB); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Subgraph counting the task of determining the number of instances of a query pattern within a large graph lies at the heart of many critical applications, from analyzing financial networks and transportation systems to understanding biological interactions. Despite decades of work yielding efficient algorithmic (AL) solutions and, more recently, machine learning (ML) approaches, a clear comparative understanding is elusive. This gap stems from the absence of a unified evaluation framework, standardized datasets, and accessible ground truths, all of which hinder systematic analysis and fair benchmarking. To overcome these barriers, we introduce BEACON: a comprehensive benchmark designed to rigorously evaluate both AL and ML-based subgraph counting methods. BEACON provides a standardized dataset with verified ground truths, an integrated evaluation environment, and a public leaderboard, enabling reproducible and transparent comparisons across diverse approaches. Our extensive experiments reveal that while AL methods excel in efficiently counting subgraphs on very large graphs, they struggle with complex patterns (e.g., those exceeding six nodes). In contrast, ML methods are capable of handling larger patterns but demand massive graph data inputs and often yield suboptimal accuracy on small, dense graphs. These insights not only highlight the unique strengths and limitations of each approach but also pave the way for future advancements in subgraph counting techniques. Overall, BEACON represents a significant step towards unifying and accelerating research in subgraph counting, encouraging innovative solutions and fostering a deeper understanding of the trade-offs between algorithmic and machine learning paradigms.
zh

[AI-33] Can LLM s Leverag e Observational Data? Towards Data-Driven Causal Discovery with LLM s

【速读】：该论文试图解决传统因果发现方法依赖于大规模数据集及对潜在因果结构假设的问题，并探索大型语言模型（Large Language Models, LLMs）在基于观测数据进行因果发现中的潜力。论文的关键在于提出将观测数据直接整合到基于LLMs的推理过程中，并通过两种提示策略——成对提示（pairwise prompting）和广度优先搜索（breadth-first search, BFS）提示——评估LLMs利用观测数据推断因果关系的能力。实验结果表明，这两种方法均显著提升了因果发现的性能，在基准数据集上的F1分数相较于传统统计因果发现基线提高了最多0.52点，验证了LLMs在数据驱动因果发现中的可行性和局限性。

链接: https://arxiv.org/abs/2504.10936
作者: Yuni Susanti,Michael Färber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Causal discovery traditionally relies on statistical methods applied to observational data, often requiring large datasets and assumptions about underlying causal structures. Recent advancements in Large Language Models (LLMs) have introduced new possibilities for causal discovery by providing domain expert knowledge. However, it remains unclear whether LLMs can effectively process observational data for causal discovery. In this work, we explore the potential of LLMs for data-driven causal discovery by integrating observational data for LLM-based reasoning. Specifically, we examine whether LLMs can effectively utilize observational data through two prompting strategies: pairwise prompting and breadth first search (BFS)-based prompting. In both approaches, we incorporate the observational data directly into the prompt to assess LLMs’ ability to infer causal relationships from such data. Experiments on benchmark datasets show that incorporating observational data enhances causal discovery, boosting F1 scores by up to 0.11 point using both pairwise and BFS LLM-based prompting, while outperforming traditional statistical causal discovery baseline by up to 0.52 points. Our findings highlight the potential and limitations of LLMs for data-driven causal discovery, demonstrating their ability to move beyond textual metadata and effectively interpret and utilize observational data for more informed causal reasoning. Our studies lays the groundwork for future advancements toward fully LLM-driven causal discovery.
zh

[AI-34] ransfer Learning for Temporal Link Prediction

【速读】：该论文致力于解决传统时序链路预测（Temporal Link Prediction, TLP）模型在迁移学习场景下的局限性问题。现有方法依赖于存储训练期间见过节点信息的记忆模块，导致其无法直接应用于全新的测试图或部署环境中的全新图结构。为应对这一挑战，论文提出了一种新的迁移学习框架，并开发了适用于记忆密集型模型的迁移有效方法。关键创新在于引入了一个结构映射模块（structural mapping module），该模块通过学习从图结构（拓扑）特征到记忆嵌入的空间映射，实现了对新图结构的有效适应，从而为无记忆需求的TLP基础模型奠定了理论和技术基础。

链接: https://arxiv.org/abs/2504.10925
作者: Ayan Chatterjee,Barbara Ikica,Babak Ravandi,John Palowitch
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Link prediction on graphs has applications spanning from recommender systems to drug discovery. Temporal link prediction (TLP) refers to predicting future links in a temporally evolving graph and adds additional complexity related to the dynamic nature of graphs. State-of-the-art TLP models incorporate memory modules alongside graph neural networks to learn both the temporal mechanisms of incoming nodes and the evolving graph topology. However, memory modules only store information about nodes seen at train time, and hence such models cannot be directly transferred to entirely new graphs at test time and deployment. In this work, we study a new transfer learning task for temporal link prediction, and develop transfer-effective methods for memory-laden models. Specifically, motivated by work showing the informativeness of structural signals for the TLP task, we augment a structural mapping module to the existing TLP model architectures, which learns a mapping from graph structural (topological) features to memory embeddings. Our work paves the way for a memory-free foundation model for TLP.
zh

[AI-35] owards A Universal Graph Structural Encoder

【速读】：该论文旨在解决在图领域中跨不同图域捕获和传递结构信息的挑战，主要由于各类上下文中拓扑模式的固有差异。此外，大多数现有模型难以捕捉复杂丰富的图结构，导致嵌入空间探索不足。论文的关键解决方案是提出GFSE（Graph Feature Structural Encoder），这是一种通用的图结构编码器，能够捕获可迁移的结构模式，适用于分子图、社交网络和引用网络等多种领域。GFSE是首个基于多种自监督学习目标进行预训练的跨域图结构编码器，它建立在图Transformer之上，结合了由图归纳偏置指导的注意力机制，能够编码复杂的多层次和细粒度拓扑特征。通过这种方式，GFSE生成通用且理论上表达能力强的位置和结构编码，可以无缝集成到各种下游图特征编码器中，从而显著提升模型性能并减少任务特定微调的需求。

链接: https://arxiv.org/abs/2504.10917
作者: Jialin Chen,Haolan Zuo,Haoyu Peter Wang,Siqi Miao,Pan Li,Rex Ying
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in large-scale pre-training have shown the potential to learn generalizable representations for downstream tasks. In the graph domain, however, capturing and transferring structural information across different graph domains remains challenging, primarily due to the inherent differences in topological patterns across various contexts. Additionally, most existing models struggle to capture the complexity of rich graph structures, leading to inadequate exploration of the embedding space. To address these challenges, we propose GFSE, a universal graph structural encoder designed to capture transferable structural patterns across diverse domains such as molecular graphs, social networks, and citation networks. GFSE is the first cross-domain graph structural encoder pre-trained with multiple self-supervised learning objectives. Built on a Graph Transformer, GFSE incorporates attention mechanisms informed by graph inductive bias, enabling it to encode intricate multi-level and fine-grained topological features. The pre-trained GFSE produces generic and theoretically expressive positional and structural encoding for graphs, which can be seamlessly integrated with various downstream graph feature encoders, including graph neural networks for vectorized features and Large Language Models for text-attributed graphs. Comprehensive experiments on synthetic and real-world datasets demonstrate GFSE’s capability to significantly enhance the model’s performance while requiring substantially less task-specific fine-tuning. Notably, GFSE achieves state-of-the-art performance in 81.6% evaluated cases, spanning diverse graph models and datasets, highlighting its potential as a powerful and versatile encoder for graph-structured data.
zh

[AI-36] LOKA Protocol: A Decentralized Framework for Trustworthy and Ethical AI Agent Ecosystems

【速读】：本文针对自主人工智能（Autonomous AI）代理在感知、推理和行动方面的独立能力所带来的挑战，试图解决数字生态系统中因去中心化而引发的身份（Identity）、问责（Accountability）和伦理一致性（Ethical Consensus）三大基础性问题。论文的关键在于提出了一种名为LOKA协议（Layered Orchestration for Knowledgeful Agents）的统一系统级架构，旨在构建以伦理治理为核心的可互操作AI代理生态系统。LOKA的核心创新包括引入通用代理身份层（Universal Agent Identity Layer, UAIL），用于实现去中心化的可验证身份；基于意图的通信协议，以支持跨异构代理的语义协调；以及去中心化伦理共识协议（Decentralized Ethical Consensus Protocol, DECP），使代理能够在共享的伦理基准下做出情境感知的决策。通过结合去中心化标识符（DIDs）、可验证凭证（VCs）和后量子密码学等新兴标准，LOKA为多智能体AI系统的治理提供了可扩展且面向未来的蓝图，从而奠定负责任、透明且自治的AI生态系统的基石。

链接: https://arxiv.org/abs/2504.10915
作者: Rajesh Ranjan,Shailja Gupta,Surya Narayan Singh
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 4 Figures, 1 Table

点击查看摘要

Abstract:The rise of autonomous AI agents, capable of perceiving, reasoning, and acting independently, signals a profound shift in how digital ecosystems operate, govern, and evolve. As these agents proliferate beyond centralized infrastructures, they expose foundational gaps in identity, accountability, and ethical alignment. Three critical questions emerge: Identity: Who or what is the agent? Accountability: Can its actions be verified, audited, and trusted? Ethical Consensus: Can autonomous systems reliably align with human values and prevent harmful emergent behaviors? We present the novel LOKA Protocol (Layered Orchestration for Knowledgeful Agents), a unified, systems-level architecture for building ethically governed, interoperable AI agent ecosystems. LOKA introduces a proposed Universal Agent Identity Layer (UAIL) for decentralized, verifiable identity; intent-centric communication protocols for semantic coordination across diverse agents; and a Decentralized Ethical Consensus Protocol (DECP) that enables agents to make context-aware decisions grounded in shared ethical baselines. Anchored in emerging standards such as Decentralized Identifiers (DIDs), Verifiable Credentials (VCs), and post-quantum cryptography, LOKA offers a scalable, future-resilient blueprint for multi-agent AI governance. By embedding identity, trust, and ethics into the protocol layer itself, LOKA establishes the foundation for a new era of responsible, transparent, and autonomous AI ecosystems operating across digital and physical domains.
zh

[AI-37] Bridging Distribution Gaps in Time Series Foundation Model Pretraining with Prototype-Guided Normalization

【速读】：该论文旨在解决在大规模异构数据集上预训练时因数据分布显著不匹配所引发的问题，特别是在时间序列数据上的挑战。为应对这一问题，论文提出了一种领域感知的自适应归一化策略，其关键在于用原型引导的动态归一化机制（ProtoNorm）替代传统的LayerNorm。ProtoNorm通过学习到的原型捕获不同数据分布特性，并利用样本与原型的亲和度来确定合适的归一化方式，从而有效捕捉时间序列数据的异质性特征，使预训练表示更好地适配下游任务。实验表明，该方法在分类和预测任务中显著优于传统预训练技术，并有效缓解了预训练过程中的分布偏移问题。

链接: https://arxiv.org/abs/2504.10900
作者: Peiliang Gong,Emadeldeen Eldele,Min Wu,Zhenghua Chen,Xiaoli Li,Daoqiang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models have achieved remarkable success across diverse machine-learning domains through large-scale pretraining on large, diverse datasets. However, pretraining on such datasets introduces significant challenges due to substantial mismatches in data distributions, a problem particularly pronounced with time series data. In this paper, we tackle this issue by proposing a domain-aware adaptive normalization strategy within the Transformer architecture. Specifically, we replace the traditional LayerNorm with a prototype-guided dynamic normalization mechanism (ProtoNorm), where learned prototypes encapsulate distinct data distributions, and sample-to-prototype affinity determines the appropriate normalization layer. This mechanism effectively captures the heterogeneity of time series characteristics, aligning pretrained representations with downstream tasks. Through comprehensive empirical evaluation, we demonstrate that our method significantly outperforms conventional pretraining techniques across both classification and forecasting tasks, while effectively mitigating the adverse effects of distribution shifts during pretraining. Incorporating ProtoNorm is as simple as replacing a single line of code. Extensive experiments on diverse real-world time series benchmarks validate the robustness and generalizability of our approach, advancing the development of more versatile time series foundation models.
zh

[AI-38] Xpose: Bi-directional Engineering for Hidden Query Extraction

【速读】：该论文致力于解决基于隐藏查询提取（Hidden Query Extraction, HQE）的查询逆向工程（Query Reverse Engineering, QRE）问题，即通过输入输出示例而非直接侵入的方式从一个不透明的可执行文件中提取其内部的真实SQL查询。现有HQE工具受限于数据库变异和生成技术，仅能处理结构简单且包含基于键的等值连接与合取算术过滤谓词的查询，无法应对复杂查询结构及更广泛的查询操作符。为突破这一限制，论文提出了一种名为Xpose的解决方案。

Xpose的关键在于采用双向工程方法：一方面显著扩展了现有的逆向工程范围，支持并集连接符、代数过滤谓词以及值和谓词的析取；另一方面利用大型语言模型（LLMs）的预测能力，将不透明应用程序的业务描述转化为提取指导，这相当于“正向工程”（Forward Engineering, FE）。FE模块能够识别常见的复杂构造，如子查询嵌套、外连接和标量函数，从而勾勒出查询的整体轮廓，而逆向工程则负责填充具体细节。实验评估表明，Xpose在扩展后的TPC-H基准测试集（E-TPCH）以及真实世界STACK基准上均能准确提取复杂查询，代表了HQE领域的重要进展。

链接: https://arxiv.org/abs/2504.10898
作者: Ahana Pradhan,Jayant Haritsa
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Query reverse engineering (QRE) aims to synthesize a SQL query to connect a given database and result instance. A recent variation of QRE is where an additional input, an opaque executable containing a ground-truth query, is provided, and the goal is to non-invasively extract this specific query through only input-output examples. This variant, called Hidden Query Extraction (HQE), has a spectrum of industrial use-cases including query recovery, database security, and vendor migration. The reverse engineering (RE) tools developed for HQE, which are based on database mutation and generation techniques, can only extract flat queries with key-based equi joins and conjunctive arithmetic filter predicates, making them limited wrt both query structure and query operators. In this paper, we present Xpose, a HQE solution that elevates the extraction scope to realistic complex queries, such as those found in the TPCH benchmark. A two-pronged approach is taken: (1) The existing RE scope is substantially extended to incorporate union connectors, algebraic filter predicates, and disjunctions for both values and predicates. (2) The predictive power of LLMs is leveraged to convert business descriptions of the opaque application into extraction guidance, representing ``forward engineering" (FE). The FE module recognizes common constructs, such as nesting of sub-queries, outer joins, and scalar functions. In essence, FE establishes the broad query contours, while RE fleshes out the fine-grained details. We have evaluated Xpose on (a) E-TPCH, a query suite comprising the complete TPCH benchmark extended with queries featuring unions, diverse join types, and sub-queries; and (b) the real-world STACK benchmark. The experimental results demonstrate that its bi-directional engineering approach accurately extracts these complex queries, representing a significant step forward with regard to HQE coverage.
zh

[AI-39] Understanding the theoretical properties of projected Bellm an equation linear Q-learning and approximate value iteration

【速读】：该论文旨在研究投影型贝尔曼方程（Projected Bellman Equation, PBE）的理论性质及其两种求解算法：线性 Q 学习（Linear Q-learning）和近似值迭代（Approximate Value Iteration, AVI）。论文提出了两个充分条件以保证 PBE 解的存在性：严格负行主导对角线假设（Strictly Negatively Row Dominating Diagonal, SNRDD）和一个由 AVI 收敛性启发的条件。其中，SNRDD 假设不仅确保了线性 Q 学习的收敛性，还探讨了其与 AVI 收敛性的关系。此外，论文在使用 ε-贪心策略时，针对 PBE 的解提供了若干有趣的观察结果。关键在于通过 SNRDD 假设和相关收敛性分析，为 PBE 的求解提供理论保障，并揭示不同算法之间的联系。

链接: https://arxiv.org/abs/2504.10865
作者: Han-Dong Lim,Donghwan Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Initial submission

点击查看摘要

Abstract:In this paper, we study the theoretical properties of the projected Bellman equation (PBE) and two algorithms to solve this equation: linear Q-learning and approximate value iteration (AVI). We consider two sufficient conditions for the existence of a solution to PBE : strictly negatively row dominating diagonal (SNRDD) assumption and a condition motivated by the convergence of AVI. The SNRDD assumption also ensures the convergence of linear Q-learning, and its relationship with the convergence of AVI is examined. Lastly, several interesting observations on the solution of PBE are provided when using \epsilon -greedy policy.
zh

[AI-40] Rethinking Theory of Mind Benchmarks for LLM s: Towards A User-Centered Perspective

【速读】：该论文旨在解决现有利用人类ToM任务评估大型语言模型（LLMs）心智理论（ToM）能力方法中存在的理论、方法论及评估方面的局限性。论文指出，这些局限性源于原始ToM任务本身在评估人类ToM时固有的问题，并在适配到LLMs评估时被放大。解决方案的关键是从人机交互（HCI）视角重新定义ToM基准的任务设计与评估标准，采用一种更具动态性和交互性的方法，考虑用户对LLMs的偏好、需求及使用体验，从而更全面地衡量LLMs的社会智能水平。

链接: https://arxiv.org/abs/2504.10839
作者: Qiaosi Wang,Xuhui Zhou,Maarten Sap,Jodi Forlizzi,Hong Shen
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 7 pages, 1 figure, accepted to the HEAL@CHI 2025 Workshop

点击查看摘要

Abstract:The last couple of years have witnessed emerging research that appropriates Theory-of-Mind (ToM) tasks designed for humans to benchmark LLM’s ToM capabilities as an indication of LLM’s social intelligence. However, this approach has a number of limitations. Drawing on existing psychology and AI literature, we summarize the theoretical, methodological, and evaluation limitations by pointing out that certain issues are inherently present in the original ToM tasks used to evaluate human’s ToM, which continues to persist and exacerbated when appropriated to benchmark LLM’s ToM. Taking a human-computer interaction (HCI) perspective, these limitations prompt us to rethink the definition and criteria of ToM in ToM benchmarks in a more dynamic, interactional approach that accounts for user preferences, needs, and experiences with LLMs in such evaluations. We conclude by outlining potential opportunities and challenges towards this direction.
zh

[AI-41] Hallucination-Aware Generative Pretrained Transformer for Cooperative Aerial Mobility Control

【速读】：该论文旨在解决无人机末端物流配送中的效率与安全性问题，特别是在生成式 AI (Generative AI) 应用于任务分配和路径规划时可能存在的电池耗尽或重复访问等安全隐患。论文的关键创新在于提出了一种名为 SafeGPT 的两层框架，结合了生成式预训练Transformer (GPT) 和强化学习 (RL) 技术。其中，Global GPT 模块负责高层次任务分配，On-Device GPT 模块执行实时局部路径规划，同时通过基于 RL 的安全过滤器监控并修正潜在的不安全决策。此外，双回放缓冲区机制进一步提升了模型策略的优化能力。这些设计有效提高了交付成功率，同时显著降低了电池消耗和行驶距离，验证了将基于 GPT 的语义推理与形式化安全保证相结合的有效性。

链接: https://arxiv.org/abs/2504.10831
作者: Hyojun Ahn,Seungcheol Oh,Gyu Seon Kim,Soyi Jung,Soohyun Park,Joongheon Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:This paper proposes SafeGPT, a two-tiered framework that integrates generative pretrained transformers (GPTs) with reinforcement learning (RL) for efficient and reliable unmanned aerial vehicle (UAV) last-mile deliveries. In the proposed design, a Global GPT module assigns high-level tasks such as sector allocation, while an On-Device GPT manages real-time local route planning. An RL-based safety filter monitors each GPT decision and overrides unsafe actions that could lead to battery depletion or duplicate visits, effectively mitigating hallucinations. Furthermore, a dual replay buffer mechanism helps both the GPT modules and the RL agent refine their strategies over time. Simulation results demonstrate that SafeGPT achieves higher delivery success rates compared to a GPT-only baseline, while substantially reducing battery consumption and travel distance. These findings validate the efficacy of combining GPT-based semantic reasoning with formal safety guarantees, contributing a viable solution for robust and energy-efficient UAV logistics.
zh

[AI-42] Progressive Rock Music Classification

【速读】：该论文旨在解决音乐信息检索（Music Information Retrieval, MIR）任务中的渐进摇滚（progressive rock）音乐分类问题，这一音乐流派以复杂的编曲和多样的乐器使用为特征，与其他音乐风格有显著区别。论文的关键解决方案在于综合运用多种机器学习技术与深度学习方法。首先，通过Librosa库从歌曲片段中提取全面的音频特征，包括频谱图（spectrograms）、梅尔频率倒谱系数（Mel-Frequency Cepstral Coefficients, MFCCs）、色度图（chromagrams）以及节拍位置，并采用赢家通取投票策略将片段级预测聚合为最终的歌曲分类结果。其次，探索了集成学习方法（如Bagging和Boosting）以及降维技术（主成分分析，Principal Component Analysis, PCA）来应对高维特征集带来的计算挑战。此外，还开发了自定义的一维卷积神经网络（1D Convolutional Neural Network, 1D CNN）架构（命名为“Zuck”和“Satya”），并微调了先进的基于注意力机制的音频频谱Transformer（Audio Spectrogram Transformer, AST）模型。实验结果显示，集成方法如极端随机树（Extra Trees）在测试集上的准确率可达76.38%，验证了这些方法在解决渐进摇滚音乐分类这一复杂任务中的有效性。

链接: https://arxiv.org/abs/2504.10821
作者: Arpan Nagar,Joseph Bensabat,Jokent Gaza,Moinak Dey
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 20 pages

点击查看摘要

Abstract:This study investigates the classification of progressive rock music, a genre characterized by complex compositions and diverse instrumentation, distinct from other musical styles. Addressing this Music Information Retrieval (MIR) task, we extracted comprehensive audio features, including spectrograms, Mel-Frequency Cepstral Coefficients (MFCCs), chromagrams, and beat positions from song snippets using the Librosa library. A winner-take-all voting strategy was employed to aggregate snippet-level predictions into final song classifications. We conducted a comparative analysis of various machine learning techniques. Ensemble methods, encompassing Bagging (Random Forest, ExtraTrees, Bagging Classifier) and Boosting (XGBoost, Gradient Boosting), were explored, utilizing Principal Component Analysis (PCA) for dimensionality reduction to manage computational constraints with high-dimensional feature sets. Additionally, deep learning approaches were investigated, including the development of custom 1D Convolutional Neural Network (1D CNN) architectures (named “Zuck” and “Satya”) featuring specific layer configurations, normalization, and activation functions. Furthermore, we fine-tuned a state-of-the-art Audio Spectrogram Transformer (AST) model, leveraging its attention-based mechanisms for audio classification. Performance evaluation on validation and test sets revealed varying effectiveness across models, with ensemble methods like Extra Trees achieving test accuracies up to 76.38%. This research provides insights into the application and relative performance of diverse machine learning paradigms for the nuanced task of progressive rock genre classification.
zh

[AI-43] FHBench: Towards Efficient and Personalized Federated Learning for Multimodal Healthcare

【速读】：该论文旨在解决现有联邦学习（Federated Learning, FL）方法在多机构医疗协作中的两大挑战：一是真实世界医疗数据的多模态特性，二是计算资源受限。论文提出的关键解决方案包括开发一个专门针对实际医疗应用设计的基准平台——联邦医疗基准（Federated Healthcare Benchmark, FHBench），以及在此基础上提出的个性化联邦学习框架——自适应LoRA的高效个性化联邦学习（Efficient Personalized Federated Learning with Adaptive LoRA, EPFL）。其中，FHBench通过覆盖神经、心血管、呼吸系统及病理学等多个领域的诊断任务，填补了现有基准的空白；而EPFL则通过自适应LoRA机制实现了跨多种医疗模态的高效性和有效性提升，从而克服现有方法的关键局限性。

链接: https://arxiv.org/abs/2504.10817
作者: Penghao Wang,Qian Chen,Teng Zhang,Yingwei Zhang,Wang Lu,Yiqiang Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) has emerged as an effective solution for multi-institutional collaborations without sharing patient data, offering a range of methods tailored for diverse applications. However, real-world medical datasets are often multimodal, and computational resources are limited, posing significant challenges for existing FL approaches. Recognizing these limitations, we developed the Federated Healthcare Benchmark(FHBench), a benchmark specifically designed from datasets derived from real-world healthcare applications. FHBench encompasses critical diagnostic tasks across domains such as the nervous, cardiovascular, and respiratory systems and general pathology, providing comprehensive support for multimodal healthcare evaluations and filling a significant gap in existing benchmarks. Building on FHBench, we introduced Efficient Personalized Federated Learning with Adaptive LoRA(EPFL), a personalized FL framework that demonstrates superior efficiency and effectiveness across various healthcare modalities. Our results highlight the robustness of FHBench as a benchmarking tool and the potential of EPFL as an innovative approach to advancing healthcare-focused FL, addressing key limitations of existing methods.
zh

[AI-44] E2E Parking Dataset: An Open Benchmark for End-to-End Autonomous Parking

【速读】：该论文旨在解决自动驾驶泊车领域中公开可用数据集匮乏的问题，这限制了端到端学习方法的可复现性和基准测试。为了解决这一问题，论文的关键在于创建并开源了一个高质量的端到端自动驾驶泊车数据集。通过使用原始模型在该数据集上的实验，实现了85.16%的整体成功率，并显著降低了平均位置和方向误差（分别为0.24米和0.34度）。

链接: https://arxiv.org/abs/2504.10812
作者: Kejia Gao,Liguo Zhou,Mingjun Liu,Alois Knoll
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:End-to-end learning has shown great potential in autonomous parking, yet the lack of publicly available datasets limits reproducibility and benchmarking. While prior work introduced a visual-based parking model and a pipeline for data generation, training, and close-loop test, the dataset itself was not released. To bridge this gap, we create and open-source a high-quality dataset for end-to-end autonomous parking. Using the original model, we achieve an overall success rate of 85.16% with lower average position and orientation errors (0.24 meters and 0.34 degrees).
zh

[AI-45] ATLASv2: LLM -Guided Adaptive Landmark Acquisition and Navigation on the Edge

【速读】：该论文旨在解决边缘设备上自主系统面临的资源受限、实时处理需求以及动态环境适应性等挑战。论文提出的关键解决方案是ATLASv2系统，其核心在于集成经过微调的TinyLLM、实时目标检测与高效路径规划，以支持在Jetson Nano边缘设备上的分层多任务导航与操作。ATLASv2通过检测环境中物体并将其定位后保存至内部知识库，动态扩展可导航地标，从而实现未来任务执行的支持。关键创新点在于利用生成式AI（Generative AI）在完全本地化的框架内优化资源利用率，同时保持极低的提示延迟和功耗，弥合仿真环境与真实世界应用之间的差距。

链接: https://arxiv.org/abs/2504.10784
作者: Mikolaj Walczak,Uttej Kallakuri,Tinoosh Mohsenin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous systems deployed on edge devices face significant challenges, including resource constraints, real-time processing demands, and adapting to dynamic environments. This work introduces ATLASv2, a novel system that integrates a fine-tuned TinyLLM, real-time object detection, and efficient path planning to enable hierarchical, multi-task navigation and manipulation all on the edge device, Jetson Nano. ATLASv2 dynamically expands its navigable landmarks by detecting and localizing objects in the environment which are saved to its internal knowledge base to be used for future task execution. We evaluate ATLASv2 in real-world environments, including a handcrafted home and office setting constructed with diverse objects and landmarks. Results show that ATLASv2 effectively interprets natural language instructions, decomposes them into low-level actions, and executes tasks with high success rates. By leveraging generative AI in a fully on-board framework, ATLASv2 achieves optimized resource utilization with minimal prompting latency and power consumption, bridging the gap between simulated environments and real-world applications.
zh

[AI-46] Epistemic Uncertainty-aware Recommendation Systems via Bayesian Deep Ensemble Learning

【速读】：该论文旨在解决传统推荐模型在处理显式反馈和稀疏数据场景时存在的过拟合倾向以及无法有效融入预测中的认识不确定性（Epistemic Uncertainty）两大主要局限性。为了解决这些问题，论文提出了一种名为BDECF的新型贝叶斯深度集成协同过滤方法。其关键在于利用贝叶斯神经网络（Bayesian Neural Networks）在权重参数中引入不确定性以提升模型泛化能力和预测质量，并结合注意力机制设计了一种新的可解释非线性匹配方法用于用户和项目嵌入，同时通过构建基于集成的超级模型进一步增强预测的鲁棒性和可靠性。

链接: https://arxiv.org/abs/2504.10753
作者: Radin Cheraghi,Amir Mohammad Mahfoozi,Sepehr Zolfaghari,Mohammadshayan Shabani,Maryam Ramezani,Hamid R. Rabiee
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages

点击查看摘要

Abstract:Recommending items to users has long been a fundamental task, and studies have tried to improve it ever since. Most well-known models commonly employ representation learning to map users and items into a unified embedding space for matching assessment. These approaches have primary limitations, especially when dealing with explicit feedback and sparse data contexts. Two primary limitations are their proneness to overfitting and failure to incorporate epistemic uncertainty in predictions. To address these problems, we propose a novel Bayesian Deep Ensemble Collaborative Filtering method named BDECF. To improve model generalization and quality, we utilize Bayesian Neural Networks, which incorporate uncertainty within their weight parameters. In addition, we introduce a new interpretable non-linear matching approach for the user and item embeddings, leveraging the advantages of the attention mechanism. Furthermore, we endorse the implementation of an ensemble-based supermodel to generate more robust and reliable predictions, resulting in a more complete model. Empirical evaluation through extensive experiments and ablation studies across a range of publicly accessible real-world datasets with differing sparsity characteristics confirms our proposed method’s effectiveness and the importance of its components.
zh

[AI-47] Communication-aware Hierarchical Map Compression of Time-Varying Environments for Mobile Robots

【速读】：本文旨在开发一种系统框架，用于动态概率占用栅格的时间序列压缩。论文的关键在于利用信号压缩理论的思想，构建一个优化问题，寻找一个多分辨率层次编码器，以在压缩地图的质量（失真）与描述大小之间取得平衡，后者与传输地图所需的带宽密切相关。该优化方法无需了解占用地图的动态特性即可实现满足可用通信或内存资源限制的多分辨率地图压缩。论文提出了一种算法来解决问题，并通过仿真验证了该框架在静态和动态占用地图上的实用性。

链接: https://arxiv.org/abs/2504.10751
作者: Daniel T. Larsson,Dipankar Maity
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:In this paper, we develop a systematic framework for the time-sequential compression of dynamic probabilistic occupancy grids. Our approach leverages ideas from signal compression theory to formulate an optimization problem that searches for a multi-resolution hierarchical encoder that balances the quality of the compressed map (distortion) with its description size, the latter of which relates to the bandwidth required to reliably transmit the map to other agents or to store map estimates in on-board memory. The resulting optimization problem allows for multi-resolution map compressions to be obtained that satisfy available communication or memory resources, and does not require knowledge of the occupancy map dynamics. We develop an algorithm to solve our problem, and demonstrate the utility of the proposed framework in simulation on both static (i.e., non-time varying) and dynamic (time-varying) occupancy maps.
zh

[AI-48] Frozen Layers: Memory-efficient Many-fidelity Hyperparameter Optimization

【速读】：该论文试图解决在深度学习管道中随着模型规模的增长，高效且经济有效的超参数优化（Hyperparameter Optimization, HPO）方法的需求日益迫切的问题。现有的多保真度 HPO (Multi-Fidelity HPO, MF-HPO) 方法通常通过降低保真度估计来减少计算资源需求，但在计算和内存资源受限的情况下，现有的保真度来源往往表现不佳。论文的关键解决方案是提出了一种新的保真度来源：即在训练过程中被训练或冻结的层数。对于深层网络，这种方法能够在显著节省计算和内存资源的同时，在低保真度下保持与完整模型训练之间超参数的秩相关性。论文通过在 ResNets 和 Transformers 上的实证评估验证了这一方法，并进一步分析了冻结层作为保真度在利用 GPU 资源进行 HPO 中的效用，以及与其他保真度来源结合的联合多保真度 HPO 的潜力。这一贡献为以硬件资源作为保真度的 MF-HPO 开辟了新的应用场景，并为探索联合保真度空间的改进算法创造了机会。

链接: https://arxiv.org/abs/2504.10735
作者: Timur Carstensen,Neeratyoy Mallik,Frank Hutter,Martin Rapp
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As model sizes grow, finding efficient and cost-effective hyperparameter optimization (HPO) methods becomes increasingly crucial for deep learning pipelines. While multi-fidelity HPO (MF-HPO) trades off computational resources required for DL training with lower fidelity estimations, existing fidelity sources often fail under lower compute and memory constraints. We propose a novel fidelity source: the number of layers that are trained or frozen during training. For deep networks, this approach offers significant compute and memory savings while preserving rank correlations between hyperparameters at low fidelities compared to full model training. We demonstrate this in our empirical evaluation across ResNets and Transformers and additionally analyze the utility of frozen layers as a fidelity in using GPU resources as a fidelity in HPO, and for a combined MF-HPO with other fidelity sources. This contribution opens new applications for MF-HPO with hardware resources as a fidelity and creates opportunities for improved algorithms navigating joint fidelity spaces.
zh

[AI-49] Optimizing Data Distribution and Kernel Performance for Efficient Training of Chemistry Foundation Models: A Case Study with MACE

【速读】：该论文旨在解决化学基础模型（CFMs）在基于图神经网络（GNNs）处理三维分子图结构时面临的两个关键挑战：数据分布的负载均衡问题和模型训练的性能优化问题。论文以最先进的CFM——MACE为例，提出了解决方案。对于数据分布问题，论文将其建模为一个多目标装箱问题，并设计了一个迭代算法，实现了高效、快速且实用的负载均衡方案。针对训练阶段，论文识别出对称张量收缩作为MACE中的关键计算内核，并对该内核进行了优化，从而显著提升了整体训练性能。综合的数据分布平衡与内核优化方法，大幅加速了MACE的训练过程，在740块GPU上将每个训练周期的执行时间从12分钟减少到2分钟。因此，论文的关键在于通过创新的数据分布策略和核心计算优化，解决了CFMs在大规模异构图处理中的效率瓶颈问题。

链接: https://arxiv.org/abs/2504.10700
作者: Jesun Firoz,Franco Pellegrini,Mario Geiger,Darren Hsu,Jenna A. Bilbrey,Han-Yi Chou,Maximilian Stadler,Markus Hoehnerbach,Tingyu Wang,Dejun Lin,Emine Kucukbenli,Henry W. Sprueill,Ilyes Batatia,Sotiris S. Xantheas,MalSoon Lee,Chris Mundy,Gabor Csanyi,Justin S. Smith,Ponnuswamy Sadayappan,Sutanay Choudhury
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Accepted at The 34th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2025)

点击查看摘要

Abstract:Chemistry Foundation Models (CFMs) that leverage Graph Neural Networks (GNNs) operating on 3D molecular graph structures are becoming indispensable tools for computational chemists and materials scientists. These models facilitate the understanding of matter and the discovery of new molecules and materials. In contrast to GNNs operating on a large homogeneous graphs, GNNs used by CFMs process a large number of geometric graphs of varying sizes, requiring different optimization strategies than those developed for large homogeneous GNNs. This paper presents optimizations for two critical phases of CFM training: data distribution and model training, targeting MACE - a state-of-the-art CFM. We address the challenge of load balancing in data distribution by formulating it as a multi-objective bin packing problem. We propose an iterative algorithm that provides a highly effective, fast, and practical solution, ensuring efficient data distribution. For the training phase, we identify symmetric tensor contraction as the key computational kernel in MACE and optimize this kernel to improve the overall performance. Our combined approach of balanced data distribution and kernel optimization significantly enhances the training process of MACE. Experimental results demonstrate a substantial speedup, reducing per-epoch execution time for training from 12 to 2 minutes on 740 GPUs with a 2.6M sample dataset.
zh

[AI-50] HyRRT-Connect: Bidirectional Motion Planning for Hybrid Dynamical Systems

【速读】：本文旨在解决混合系统（Hybrid Systems）的运动规划问题，提出了一种双向快速探索随机树（Rapidly-exploring Random Trees, RRT）算法，称为HyRRT-Connect。该算法在混合时间（Hybrid Time）下同时向前和向后传播，直到前向与后向传播结果出现重叠。关键在于通过定义在混合时间域上的函数反转与拼接构造运动规划，并确保其满足给定的混合动力学（Hybrid Dynamics）。为解决由前向和后向部分运动规划容忍一定距离可能引起的流（Flow）中的不连续性问题，通过从后向部分运动规划的最终状态出发进行前向混合时间模拟，有效消除不连续性。

链接: https://arxiv.org/abs/2504.10699
作者: Nan Wang,Ricardo G. Sanfelice
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 59 pages, 9 figures, submitted to IJRR. arXiv admin note: substantial text overlap with arXiv:2403.18413 ; text overlap with arXiv:2406.01802

点击查看摘要

Abstract:This paper proposes a bidirectional rapidly-exploring random trees (RRT) algorithm to solve the motion planning problem for hybrid systems. The proposed algorithm, called HyRRT-Connect, propagates in both forward and backward directions in hybrid time until an overlap between the forward and backward propagation results is detected. Then, HyRRT-Connect constructs a motion plan through the reversal and concatenation of functions defined on hybrid time domains, ensuring that the motion plan satisfies the given hybrid dynamics. To address the potential discontinuity along the flow caused by tolerating some distance between the forward and backward partial motion plans, we reconstruct the backward partial motion plan by a forward-in-hybrid-time simulation from the final state of the forward partial motion plan. effectively eliminating the discontinuity. The proposed algorithm is applied to an actuated bouncing ball system and a walking robot example to highlight its computational improvement.
zh

[AI-51] he Jailbreak Tax: How Useful are Your Jailbreak Outputs?

【速读】：该论文试图解决的问题是：现有越狱（jailbreak）攻击是否真正提高了大型语言模型生成有害输出的实用性。例如，在越狱模型以提供炸弹制造指令的情况下，这些越狱是否确实生成了有效的指令。由于大多数不安全答案（如炸弹制造指南）的实际效用难以严格评估，论文提出通过构建新的越狱评估数据集来解决此问题，这些数据集包含已知的真实答案，并通过调整模型拒绝与良性且易于评估的主题（如生物学或数学）相关的问题，从而建立基准。

解决方案的关键在于引入“越狱税”（jailbreak tax）这一新指标，用于衡量越狱后模型实用性的下降程度。论文通过在五个实用性能基准测试中评估八种代表性越狱方法，发现越狱后的响应普遍表现出显著的实用性下降。例如，尽管所测试的越狱均成功绕过了拒绝回答数学问题的模型限制，但这也导致数学准确性最多下降了92%。论文还公开了用于评估越狱的基准数据集。

链接: https://arxiv.org/abs/2504.10694
作者: Kristina Nikolić,Luze Sun,Jie Zhang,Florian Tramèr
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Jailbreak attacks bypass the guardrails of large language models to produce harmful outputs. In this paper, we ask whether the model outputs produced by existing jailbreaks are actually useful. For example, when jailbreaking a model to give instructions for building a bomb, does the jailbreak yield good instructions? Since the utility of most unsafe answers (e.g., bomb instructions) is hard to evaluate rigorously, we build new jailbreak evaluation sets with known ground truth answers, by aligning models to refuse questions related to benign and easy-to-evaluate topics (e.g., biology or math). Our evaluation of eight representative jailbreaks across five utility benchmarks reveals a consistent drop in model utility in jailbroken responses, which we term the jailbreak tax. For example, while all jailbreaks we tested bypass guardrails in models aligned to refuse to answer math, this comes at the expense of a drop of up to 92% in accuracy. Overall, our work proposes the jailbreak tax as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks. We make the benchmark available at this https URL
zh

[AI-52] Achieving Optimal Tissue Repair Through MARL with Reward Shaping and Curriculum Learning

【速读】：该论文旨在解决如何通过工程化生物代理优化组织修复过程的问题。解决方案的关键在于提出了一种多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）框架，该框架整合了基于随机反应-扩散系统建模的分子信号传递（1）、具有赫布可塑性的神经样电化学通信（2），以及结合化学梯度追踪、神经同步性与鲁棒惩罚的生物信息激励函数（3）。此外，还设计了一个渐进式的学习方案引导代理处理日益复杂的修复场景。这些要素共同促成了仿真实验中组织修复策略的涌现，如动态分泌调控和空间协调能力。

链接: https://arxiv.org/abs/2504.10677
作者: Muhammad Al-Zafar Khan,Jamal Al-Karaki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 14 pages, 4 figures, submitted to the 10th International Conference on Information and Communication Technology for Intelligent Systems (ICTIS)

点击查看摘要

Abstract:In this paper, we present a multi-agent reinforcement learning (MARL) framework for optimizing tissue repair processes using engineered biological agents. Our approach integrates: (1) stochastic reaction-diffusion systems modeling molecular signaling, (2) neural-like electrochemical communication with Hebbian plasticity, and (3) a biologically informed reward function combining chemical gradient tracking, neural synchronization, and robust penalties. A curriculum learning scheme guides the agent through progressively complex repair scenarios. In silico experiments demonstrate emergent repair strategies, including dynamic secretion control and spatial coordination.
zh

[AI-53] Ride-pool Assignment Algorithms: Modern Implementation and Swapping Heuristics

【速读】：本文旨在解决按需拼车（ride-pooling）系统中拼车分配问题（Ride-pool Assignment Problem, RAP）的算法基准测试困难的问题。尽管已有多种算法被开发用于RAP，但缺乏开源实现，导致难以在公共数据集和目标上对这些算法进行公平比较。为了解决这一问题，论文提供了包含多个关键拼车分配算法及其相关组件（如车辆路径规划和再平衡）的拼车模拟器的详细实现。此外，论文开源了一个高度优化且模块化的C++代码库，以支持新算法和功能的扩展。关键解决方案还包括引入基于交换的局部搜索启发式方法，以增强现有拼车分配算法，在性能与计算效率之间取得更好的平衡。实验结果表明，虽然所选算法表现相当，但提出的多轮线性指派带循环交换（LA-MR-CE）算法在显著减少计算时间的同时实现了最先进的服务率。深入分析进一步揭示，所有近视拼车分配算法均受到系统容量瓶颈的限制，而引入未来信息可能是突破此限制的关键。

链接: https://arxiv.org/abs/2504.10649
作者: Matthew Zalesak,Hins Hu,Samitha Samaranayake
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:On-demand ride-pooling has emerged as a popular urban transportation solution, addressing the efficiency limitations of traditional ride-hailing services by grouping multiple riding requests with spatiotemporal proximity into a single vehicle. Although numerous algorithms have been developed for the Ride-pool Assignment Problem (RAP) – a core component of ride-pooling systems, there is a lack of open-source implementations, making it difficult to benchmark these algorithms on a common dataset and objective. In this paper, we present the implementation details of a ride-pool simulator that encompasses several key ride-pool assignment algorithms, along with associated components such as vehicle routing and rebalancing. We also open-source a highly optimized and modular C++ codebase, designed to facilitate the extension of new algorithms and features. Additionally, we introduce a family of swapping-based local-search heuristics to enhance existing ride-pool assignment algorithms, achieving a better balance between performance and computational efficiency. Extensive experiments on a large-scale, real-world dataset from Manhattan, NYC reveal that while all selected algorithms perform comparably, the newly proposed Multi-Round Linear Assignment with Cyclic Exchange (LA-MR-CE) algorithm achieves a state-of-the-art service rate with significantly reduced computational time. Furthermore, an in-depth analysis suggests that a performance barrier exists for all myopic ride-pool assignment algorithms due to the system’s capacity bottleneck, and incorporating future information could be key to overcoming this limitation.
zh

[AI-54] Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling

【速读】：本文旨在解决生成模型在结合部分观测或额外先验信息时变得复杂的问题。现有方法通常通过流匹配或分数匹配将噪声映射到数据，但在处理部分观测或附加先验时操作繁琐。为此，受Wasserstein梯度流最新进展的启发，作者提出了Energy Matching框架，该框架统一了基于流的方法与能量基模型（EBM）的灵活性。其关键是引入了一个无时间依赖的标量场来参数化这一动态过程，该标量场不仅作为强大的生成器，还作为一种灵活的先验，用于有效正则化逆问题。此外，通过引入交互能量以实现多样模式探索，本文强调学习一个静态的标量势能，而不依赖于时间条件、辅助生成器或额外网络，这标志着与近期EBM方法的重要区别。此简化框架显著提升了EBM的能力，并为它们在跨领域生成建模中的广泛应用铺平了道路。

链接: https://arxiv.org/abs/2504.10612
作者: Michal Balcerak,Tamaz Amiranashvili,Suprosanna Shit,Antonio Terpin,Sebastian Kaltenbach,Petros Koumoutsakos,Bjoern Menze
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Generative models often map noise to data by matching flows or scores, but these approaches become cumbersome for incorporating partial observations or additional priors. Inspired by recent advances in Wasserstein gradient flows, we propose Energy Matching, a framework that unifies flow-based approaches with the flexibility of energy-based models (EBMs). Far from the data manifold, samples move along curl-free, optimal transport paths from noise to data. As they approach the data manifold, an entropic energy term guides the system into a Boltzmann equilibrium distribution, explicitly capturing the underlying likelihood structure of the data. We parameterize this dynamic with a single time-independent scalar field, which serves as both a powerful generator and a flexible prior for effective regularization of inverse problems. Our method substantially outperforms existing EBMs on CIFAR-10 generation (FID 3.97 compared to 8.61), while retaining the simulation-free training of transport-based approaches away from the data manifold. Additionally, we exploit the flexibility of our method and introduce an interaction energy for diverse mode exploration. Our approach focuses on learning a static scalar potential energy – without time conditioning, auxiliary generators, or additional networks – marking a significant departure from recent EBM methods. We believe this simplified framework significantly advances EBM capabilities and paves the way for their broader adoption in generative modeling across diverse domains.
zh

[AI-55] Self-Controlled Dynamic Expansion Model for Continual Learning

【速读】：该论文致力于解决在连续学习（Continual Learning, CL）场景下，基于预训练视觉Transformer（Vision Transformer, ViT）的单一静态骨干网络难以有效适应新任务，尤其是在处理多样数据域时因大量非活跃参数导致的性能不足问题。论文的关键创新在于引入了一种自控动态扩展模型（Self-Controlled Dynamic Expansion Model, SCDEM），通过构建多个独立可训练的预训练ViT骨干网络作为共享模块，动态生成新的专家以最小化参数开销来适配新任务。此外，论文提出了协作优化机制（Collaborative Optimization Mechanism, COM）以协同优化多个骨干网络，并通过历史专家的预测信号促进新任务的学习，同时避免灾难性遗忘；还设计了基于最优传输距离的特征分布一致性方法（Feature Distribution Consistency, FDC）来对齐先前与当前学习表示之间的语义相似性，缓解负向知识迁移的影响；同时引入动态分层特征注意力机制（Dynamic Layer-Wise Feature Attention Mechanism, DLWFAM）以自主调节各可训练表示层的正则化强度。一系列实验结果验证了所提方法达到当前最先进水平（state-of-the-art）。

链接: https://arxiv.org/abs/2504.10561
作者: Runqing Wu,Fei Ye,Rongyao Hu,Guoxi Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 6 tables, Continual Learning, Cross-Domain Continual Learning, Mixture Model

点击查看摘要

Abstract:Continual Learning (CL) epitomizes an advanced training paradigm wherein prior data samples remain inaccessible during the acquisition of new tasks. Numerous investigations have delved into leveraging a pre-trained Vision Transformer (ViT) to enhance model efficacy in continual learning. Nonetheless, these approaches typically utilize a singular, static backbone, which inadequately adapts to novel tasks, particularly when engaging with diverse data domains, due to a substantial number of inactive parameters. This paper addresses this limitation by introducing an innovative Self-Controlled Dynamic Expansion Model (SCDEM), which orchestrates multiple distinct trainable pre-trained ViT backbones to furnish diverse and semantically enriched representations. Specifically, by employing the multi-backbone architecture as a shared module, the proposed SCDEM dynamically generates a new expert with minimal parameters to accommodate a new task. A novel Collaborative Optimization Mechanism (COM) is introduced to synergistically optimize multiple backbones by harnessing prediction signals from historical experts, thereby facilitating new task learning without erasing previously acquired knowledge. Additionally, a novel Feature Distribution Consistency (FDC) approach is proposed to align semantic similarity between previously and currently learned representations through an optimal transport distance-based mechanism, effectively mitigating negative knowledge transfer effects. Furthermore, to alleviate over-regularization challenges, this paper presents a novel Dynamic Layer-Wise Feature Attention Mechanism (DLWFAM) to autonomously determine the penalization intensity on each trainable representation layer. An extensive series of experiments have been conducted to evaluate the proposed methodology’s efficacy, with empirical results corroborating that the approach attains state-of-the-art performance.
zh

[AI-56] Efficient Process Reward Model Training via Active Learning

【速读】：该论文试图解决在利用 Process Reward Models (PRMs) 对大语言模型 (Large Language Models, LLMs) 提供逐步骤监督时，训练数据标注成本过高的问题。论文的关键解决方案是提出了一种主动学习方法 ActPRM，其核心在于通过估计不确定性来主动选择最具挑战性的样本进行标注，从而显著降低标注开销。具体而言，ActPRM 在每次前向传播后使用 PRM 评估不确定性，并保留高度不确定的数据；随后由一个能力强但成本较高的推理模型对这些数据进行标注。基于此，计算损失并更新 PRM 的权重。实验表明，ActPRM 在保持性能的同时将标注需求减少了 50%，并且通过主动筛选超过 100 万条数学推理轨迹，进一步提升了模型性能，在 ProcessBench 和 PRMBench 上达到了新的 SOTA 水平。

链接: https://arxiv.org/abs/2504.10559
作者: Keyu Duan,Zichen Liu,Xin Mao,Tianyu Pang,Changyu Chen,Qiguang Chen,Michael Qizhe Shieh,Longxu Dou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Process Reward Models (PRMs) provide step-level supervision to large language models (LLMs), but scaling up training data annotation remains challenging for both humans and LLMs. To address this limitation, we propose an active learning approach, ActPRM, which proactively selects the most uncertain samples for training, substantially reducing labeling costs. During training, we use the PRM to estimate uncertainty after the forward pass, retaining only highly uncertain data. A capable yet costly reasoning model then labels this data. Then we compute the loss with respect to the labels and update the PRM’s weights. We compare ActPRM vs. vanilla fine-tuning, on a pool-based active learning setting, demonstrating that ActPRM reduces 50% annotation, but achieving the comparable or even better performance. Beyond annotation efficiency, we further advance the actively trained PRM by filtering over 1M+ math reasoning trajectories with ActPRM, retaining 60% of the data. A subsequent training on this selected dataset yields a new state-of-the-art (SOTA) PRM on ProcessBench (75.0%) and PRMBench (65.5%) compared with same sized models.
zh

[AI-57] he Code Barrier: What LLM s Actually Understand?

【速读】：该论文旨在评估大型语言模型（LLMs）在代码语义理解方面的能力，特别是在复杂代码混淆情况下的表现。论文提出使用代码混淆作为结构化测试框架，通过生成准确的混淆代码描述以及执行反混淆任务来衡量模型的理解能力。关键在于引入了一种新的评估方法，利用13种先进的模型（包括代码专用模型如StarCoder2和通用模型如GPT-4o）在由CodeNet创建的基准数据集上进行实验，该数据集包含250个过滤后的Java编程问题及其解决方案。研究发现，随着混淆复杂度的增加，模型性能显著下降，但通用型模型相较于专注于代码的模型展现出意外的韧性。尽管部分模型能够识别混淆技术，但在重建底层程序逻辑方面仍存在局限性，这表明现有模型在语义表示机制上的不足。此研究为提升安全关键领域如逆向工程和对抗性代码分析中的代码理解能力提供了实证基础。

链接: https://arxiv.org/abs/2504.10557
作者: Serge Lionel Nikiema,Jordan Samhi,Abdoul Kader Kaboré,Jacques Klein,Tegawendé F. Bissyandé
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding code represents a core ability needed for automating software development tasks. While foundation models like LLMs show impressive results across many software engineering challenges, the extent of their true semantic understanding beyond simple token recognition remains unclear. This research uses code obfuscation as a structured testing framework to evaluate LLMs’ semantic understanding capabilities. We methodically apply controlled obfuscation changes to source code and measure comprehension through two complementary tasks: generating accurate descriptions of obfuscated code and performing deobfuscation, a skill with important implications for reverse engineering applications. Our testing approach includes 13 cutting-edge models, covering both code-specialized (e.g., StarCoder2) and general-purpose (e.g., GPT-4o) architectures, evaluated on a benchmark created from CodeNet and consisting of filtered 250 Java programming problems and their solutions. Findings show a statistically significant performance decline as obfuscation complexity increases, with unexpected resilience shown by general-purpose models compared to their code-focused counterparts. While some models successfully identify obfuscation techniques, their ability to reconstruct the underlying program logic remains constrained, suggesting limitations in their semantic representation mechanisms. This research introduces a new evaluation approach for assessing code comprehension in language models and establishes empirical baselines for advancing research in security-critical code analysis applications such as reverse engineering and adversarial code analysis. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.10557 [cs.SE] (or arXiv:2504.10557v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2504.10557 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-58] VAE-based Feature Disentanglement for Data Augmentation and Compression in Generalized GNSS Interference Classification

【速读】：该论文旨在解决在分布式环境中实时监测和分类干扰信号以提升全球导航卫星系统（GNSS）性能的问题，同时确保低延迟通信、数据隐私保护以及模型高效性。论文的关键挑战在于如何在保持高分类准确率的同时压缩机器学习（ML）模型。为了解决这一问题，作者提出利用变分自编码器（Variational Autoencoders, VAEs）进行解缠（disentanglement），以提取用于精确分类干扰信号的关键潜在特征。此外，通过插值信号功率的低维潜在表示，该方法还可实现数据压缩与数据增强。最终，论文验证了三种VAE变体（vanilla、factorized和conditional generative）在四个不同数据集上的表现，并通过广泛的超参数搜索优化模型性能，实现了高达99.92%的分类准确率及512至8,192的数据压缩率。

链接: https://arxiv.org/abs/2504.10556
作者: Lucas Heublein,Simon Kocher,Tobias Feigl,Alexander Rügamer,Christopher Mutschler,Felix Ott
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 7 pages, 9 figures

点击查看摘要

Abstract:Distributed learning and Edge AI necessitate efficient data processing, low-latency communication, decentralized model training, and stringent data privacy to facilitate real-time intelligence on edge devices while reducing dependency on centralized infrastructure and ensuring high model performance. In the context of global navigation satellite system (GNSS) applications, the primary objective is to accurately monitor and classify interferences that degrade system performance in distributed environments, thereby enhancing situational awareness. To achieve this, machine learning (ML) models can be deployed on low-resource devices, ensuring minimal communication latency and preserving data privacy. The key challenge is to compress ML models while maintaining high classification accuracy. In this paper, we propose variational autoencoders (VAEs) for disentanglement to extract essential latent features that enable accurate classification of interferences. We demonstrate that the disentanglement approach can be leveraged for both data compression and data augmentation by interpolating the lower-dimensional latent representations of signal power. To validate our approach, we evaluate three VAE variants - vanilla, factorized, and conditional generative - on four distinct datasets, including two collected in controlled indoor environments and two real-world highway datasets. Additionally, we conduct extensive hyperparameter searches to optimize performance. Our proposed VAE achieves a data compression rate ranging from 512 to 8,192 and achieves an accuracy up to 99.92%.
zh

[AI-59] MiMu: Mitigating Multiple Shortcut Learning Behavior of Transformers

【速读】：该论文旨在解决生成式 AI (Generative AI) 模型在经验风险最小化 (ERM) 过程中过度依赖虚假相关性（spurious correlations）导致的捷径学习行为（shortcut learning），从而影响模型的鲁棒泛化能力的问题。论文指出，现实场景中的数据线索复杂且未知，现有研究多集中于识别或缓解单一捷径，而未充分应对多种捷径共存的情况。研究表明，模型对强捷径的依赖程度远高于弱捷径，进一步削弱了其泛化性能。

为解决上述挑战，论文提出了一种名为 MiMu 的新方法，该方法结合基于 Transformer 的 ERM 模型，旨在缓解多重捷径学习行为。MiMu 的关键在于引入自校准策略（self-calibration strategy）与自改进策略（self-improvement strategy）。自校准策略通过初步优化源模型，避免其过度依赖捷径并抑制过高的置信度预测；自改进策略则进一步设计用于目标模型，以减少对多种捷径的依赖。具体而言，随机掩码策略（random mask strategy）通过随机遮挡部分注意力位置来分散目标模型的关注点，而非局限于固定区域；自适应注意力对齐模块（adaptive attention alignment module）则无需后验注意力图或监督即可实现注意力权重的对齐。实验结果表明，MiMu 在自然语言处理 (NLP) 和计算机视觉 (CV) 领域显著提升了模型的鲁棒泛化能力。

链接: https://arxiv.org/abs/2504.10551
作者: Lili Zhao,Qi Liu,Wei Chen,Liyi Chen,Ruijun Sun,Min Hou,Yang Wang,Shijin Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Empirical Risk Minimization (ERM) models often rely on spurious correlations between features and labels during the learning process, leading to shortcut learning behavior that undermines robustness generalization performance. Current research mainly targets identifying or mitigating a single shortcut; however, in real-world scenarios, cues within the data are diverse and unknown. In empirical studies, we reveal that the models rely to varying extents on different shortcuts. Compared to weak shortcuts, models depend more heavily on strong shortcuts, resulting in their poor generalization ability. To address these challenges, we propose MiMu, a novel method integrated with Transformer-based ERMs designed to Mitigate Multiple shortcut learning behavior, which incorporates self-calibration strategy and self-improvement strategy. In the source model, we preliminarily propose the self-calibration strategy to prevent the model from relying on shortcuts and make overconfident predictions. Then, we further design self-improvement strategy in target model to reduce the reliance on multiple shortcuts. The random mask strategy involves randomly masking partial attention positions to diversify the focus of target model other than concentrating on a fixed region. Meanwhile, the adaptive attention alignment module facilitates the alignment of attention weights to the calibrated source model, without the need for post-hoc attention maps or supervision. Finally, extensive experiments conducted on Natural Language Processing (NLP) and Computer Vision (CV) demonstrate the effectiveness of MiMu in improving robustness generalization abilities.
zh

[AI-60] Automated Testing of COBOL to Java Transformation

【速读】：该论文旨在解决企业级遗留代码（如COBOL）向现代语言（如Java或Python）自动转换后，由于生成式AI（Generative AI）模型的局限性，导致转换后的代码可能无法正确反映原始代码功能的问题。为应对这一挑战，论文的关键解决方案是开发了一种基于符号执行（Symbolic Execution）的自动化测试框架，用于验证翻译后的Java代码与原始COBOL程序在功能上的等价性。该框架通过模拟外部调用并将生成的测试用例转化为JUnit测试，从而实现对语义一致性的验证，不仅能够检测并修复发现的功能差异，还能为改进AI模型提供反馈。

链接: https://arxiv.org/abs/2504.10548
作者: Sandeep Hans,Atul Kumar,Toshikai Yasue,Kouichi Ono,Saravanan Krishnan,Devika Sondhi,Fumiko Satoh,Gerald Mitchell,Sachin Kumar,Diptikalyan Saha
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Model (LLM) based Generative AI techniques have made it feasible to translate enterprise-level code from legacy languages such as COBOL to modern languages such as Java or Python. While the results of LLM-based automatic transformation are encouraging, the resulting code cannot be trusted to correctly translate the original code, making manual validation of translated Java code from COBOL a necessary but time-consuming and labor-intensive process. In this paper, we share our experience of developing a testing framework for IBM Watsonx Code Assistant for Z (WCA4Z) [5], an industrial tool designed for COBOL to Java translation. The framework automates the process of testing the functional equivalence of the translated Java code against the original COBOL programs in an industry context. Our framework uses symbolic execution to generate unit tests for COBOL, mocking external calls and transforming them into JUnit tests to validate semantic equivalence with translated Java. The results not only help identify and repair any detected discrepancies but also provide feedback to improve the AI model.
zh

[AI-61] Multi-Modal Hypergraph Enhanced LLM Learning for Recommendation

【速读】：该论文旨在解决现有基于大型语言模型（Large Language Models, LLM）的个性化推荐方法未能充分探索推荐场景中固有的多视图图结构相关性的问题。为了解决这一问题，论文提出了一种名为“超图增强LLM学习用于多模态推荐”（Hypergraph Enhanced LLM Learning for multimodal Recommendation, HeLLM）的新框架。该方案的关键在于通过融合图级上下文信号与序列级行为模式，使LLMs能够捕捉复杂的高阶语义相关性。具体而言，在推荐预训练阶段，设计了用户超图以揭示用户间的共享兴趣偏好，并设计了物品超图以捕获多模态相似性中的物品关联；引入超图卷积和协同对比学习机制来提升表征的可区分性。在LLM微调阶段，则将学习到的图结构嵌入直接注入LLM架构，并整合捕捉用户时间行为特征的序列特性，从而让超图利用图结构信息作为全局上下文，增强LLM感知复杂关系模式及整合多模态信息的能力，同时建模局部时间动态。实验结果验证了所提方法优于最先进的基线，证明了将基于超图的上下文与用户序列行为相结合的优势。

链接: https://arxiv.org/abs/2504.10541
作者: Xu Guo,Tong Zhang,Yuanzhi Wang,Chenxu Wang,Fuyun Wang,Xudong Wang,Xiaoya Zhang,Xin Liu,Zhen Cui
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:The burgeoning presence of Large Language Models (LLM) is propelling the development of personalized recommender systems. Most existing LLM-based methods fail to sufficiently explore the multi-view graph structure correlations inherent in recommendation scenarios. To this end, we propose a novel framework, Hypergraph Enhanced LLM Learning for multimodal Recommendation (HeLLM), designed to equip LLMs with the capability to capture intricate higher-order semantic correlations by fusing graph-level contextual signals with sequence-level behavioral patterns. In the recommender pre-training phase, we design a user hypergraph to uncover shared interest preferences among users and an item hypergraph to capture correlations within multimodal similarities among items. The hypergraph convolution and synergistic contrastive learning mechanism are introduced to enhance the distinguishability of learned representations. In the LLM fine-tuning phase, we inject the learned graph-structured embeddings directly into the LLM’s architecture and integrate sequential features capturing each user’s chronological behavior. This process enables hypergraphs to leverage graph-structured information as global context, enhancing the LLM’s ability to perceive complex relational patterns and integrate multimodal information, while also modeling local temporal dynamics. Extensive experiments demonstrate the superiority of our proposed method over state-of-the-art baselines, confirming the advantages of fusing hypergraph-based context with sequential user behavior in LLMs for recommendation.
zh

[AI-62] Distilling Transitional Pattern to Large Language Models for Multimodal Session-based Recommendation

【速读】：本文旨在解决多模态会话推荐系统（Multimodal Session-based Recommendation, MSBR）中的两个核心问题：i) 如何利用大语言模型（Large Language Model, LLM）获取会话的过渡模式与内在多模态知识；ii) 如何将不同特征对齐到一个统一的LLM中，在最小化分布差异的同时最大化表示能力。为了解决这些问题，论文提出了一种名为TPAD的多模态LLM增强框架，通过扩展蒸馏范式解耦并对齐过渡模式以促进MSBR。TPAD构建了并行的知识-MLLM（Knowledge-MLLM）和迁移-MLLM（Transfer-MLLM），分别用于解析反映物品知识的特征以及提取会话下的迁移感知特征。此外，采用互信息估计理论的关键模块实现了两种MLLM的联合，缓解了分布差异并蒸馏出过渡模式至模态表示中。实验证明了所提框架的有效性。

链接: https://arxiv.org/abs/2504.10538
作者: Jiajie Su,Qiyong Zhong,Yunshan Ma,Weiming Liu,Chaochao Chen,Xiaolin Zheng,Jianwei Yin,Tat-Seng Chua
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Session-based recommendation (SBR) predicts the next item based on anonymous sessions. Traditional SBR explores user intents based on ID collaborations or auxiliary content. To further alleviate data sparsity and cold-start issues, recent Multimodal SBR (MSBR) methods utilize simplistic pre-trained models for modality learning but have limitations in semantic richness. Considering semantic reasoning abilities of Large Language Models (LLM), we focus on the LLM-enhanced MSBR scenario in this paper, which leverages LLM cognition for comprehensive multimodal representation generation, to enhance downstream MSBR. Tackling this problem faces two challenges: i) how to obtain LLM cognition on both transitional patterns and inherent multimodal knowledge, ii) how to align both features into one unified LLM, minimize discrepancy while maximizing representation utility. To this end, we propose a multimodal LLM-enhanced framework TPAD, which extends a distillation paradigm to decouple and align transitional patterns for promoting MSBR. TPAD establishes parallel Knowledge-MLLM and Transfer-MLLM, where the former interprets item knowledge-reflected features and the latter extracts transition-aware features underneath sessions. A transitional pattern alignment module harnessing mutual information estimation theory unites two MLLMs, alleviating distribution discrepancy and distilling transitional patterns into modal representations. Extensive experiments on real-world datasets demonstrate the effectiveness of our framework.
zh

[AI-63] HeteRAG : A Heterogeneous Retrieval-augmented Generation Framework with Decoupled Knowledge Representations

【速读】：该论文旨在解决现有检索增强生成（Retrieval-Augmented Generation, RAG）方法在表示知识片段时采用单一表征方式导致性能受限的问题。具体而言，检索步骤需要全面的信息以提高准确性，而生成步骤则偏好较短的知识片段以减少冗余并提升效率，但现有方法未能有效区分这两者的需求。为了解决这一问题，论文提出了一种异构RAG框架（\myname），其关键在于解耦知识片段在检索与生成两个阶段的表征方式：利用较短片段优化生成过程，同时结合多粒度上下文信息增强检索精度。此外，还引入自适应提示微调方法进一步优化检索模型，以适配异构的检索-生成流程。实验结果表明，\myname 在效果和效率上均显著优于基线方法。

链接: https://arxiv.org/abs/2504.10529
作者: Peiru Yang,Xintian Li,Zhiyang Hu,Jiapeng Wang,Jinhua Yin,Huili Wang,Lizhi He,Shuai Yang,Shangguang Wang,Yongfeng Huang,Tao Qi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) methods can enhance the performance of LLMs by incorporating retrieved knowledge chunks into the generation process. In general, the retrieval and generation steps usually have different requirements for these knowledge chunks. The retrieval step benefits from comprehensive information to improve retrieval accuracy, whereas excessively long chunks may introduce redundant contextual information, thereby diminishing both the effectiveness and efficiency of the generation process. However, existing RAG methods typically employ identical representations of knowledge chunks for both retrieval and generation, resulting in suboptimal performance. In this paper, we propose a heterogeneous RAG framework (\myname) that decouples the representations of knowledge chunks for retrieval and generation, thereby enhancing the LLMs in both effectiveness and efficiency. Specifically, we utilize short chunks to represent knowledge to adapt the generation step and utilize the corresponding chunk with its contextual information from multi-granular views to enhance retrieval accuracy. We further introduce an adaptive prompt tuning method for the retrieval model to adapt the heterogeneous retrieval augmented generation process. Extensive experiments demonstrate that \myname achieves significant improvements compared to baselines.
zh

[AI-64] Explainable Artificial Intelligence techniques for interpretation of food datasets: a review

【速读】：该论文试图解决食品工程领域中复杂人工智能（AI）模型在确保可靠性与可解释性方面的不足问题。随着食品质量控制需求的提升，AI模型变得日益复杂，但其决策过程的不透明性限制了其广泛应用。论文的关键解决方案在于引入可解释AI（XAI）技术，如SHAP（Shapley Additive Explanations）和Grad-CAM（Gradient-weighted Class Activation Mapping），通过揭示哪些特征（如光谱波长或图像区域）对预测结果贡献最大，增强模型的透明度，从而帮助开发人员、用户及质量控制检查员更好地理解和验证AI生成的评估结果。此外，论文还提出了一个基于数据类型和解释方法的分类法，并探讨了相关趋势、挑战与机遇，以促进XAI在食品工程中的应用。

链接: https://arxiv.org/abs/2504.10527
作者: Leonardo Arrighi,Ingrid Alves de Moraes,Marco Zullich,Michele Simonato,Douglas Fernandes Barbin,Sylvio Barbon Junior
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 33 pages, 8 figures, 5 tables

点击查看摘要

Abstract:Artificial Intelligence (AI) has become essential for analyzing complex data and solving highly-challenging tasks. It is being applied across numerous disciplines beyond computer science, including Food Engineering, where there is a growing demand for accurate and trustworthy predictions to meet stringent food quality standards. However, this requires increasingly complex AI models, raising reliability concerns. In response, eXplainable AI (XAI) has emerged to provide insights into AI decision-making, aiding model interpretation by developers and users. Nevertheless, XAI remains underutilized in Food Engineering, limiting model reliability. For instance, in food quality control, AI models using spectral imaging can detect contaminants or assess freshness levels, but their opaque decision-making process hinders adoption. XAI techniques such as SHAP (Shapley Additive Explanations) and Grad-CAM (Gradient-weighted Class Activation Mapping) can pinpoint which spectral wavelengths or image regions contribute most to a prediction, enhancing transparency and aiding quality control inspectors in verifying AI-generated assessments. This survey presents a taxonomy for classifying food quality research using XAI techniques, organized by data types and explanation methods, to guide researchers in choosing suitable approaches. We also highlight trends, challenges, and opportunities to encourage the adoption of XAI in Food Engineering.
zh

[AI-65] Integrating Emotion Distribution Networks and Textual Message Analysis for X User Emotional State Classification

【速读】：该论文旨在解决传统社交媒体情感分析方法仅依赖文本内容难以准确捕捉用户在重大事件（如总统选举）中的复杂情感动态的问题。论文的关键在于提出了一种融合多维度信息的混合方法，包括文本分析、用户个人资料考察、关注者分析以及情感传播模式的研究。通过构建“通信树”模型来解析用户互动，并将用户的简介与兴趣与文本内容结合以增强分析深度，同时识别出不同主题下的关键影响者。这种综合考虑情绪分布模式及用户画像的方法显著提升了情感分析的准确性，相较于传统方法分别实现了12%和15%的提升，从而有效揭示了更细腻的情感变化趋势。

链接: https://arxiv.org/abs/2504.10521
作者: Pardis Moradbeiki,Mohammad Ali Zare Chahooki
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As the popularity and reach of social networks continue to surge, a vast reservoir of opinions and sentiments across various subjects inundates these platforms. Among these, X social network (formerly Twitter) stands as a juggernaut, boasting approximately 420 million active users. Extracting users’ emotional and mental states from their expressed opinions on social media has become a common pursuit. While past methodologies predominantly focused on the textual content of messages to analyze user sentiment, the interactive nature of these platforms suggests a deeper complexity. This study employs hybrid methodologies, integrating textual analysis, profile examination, follower analysis, and emotion dissemination patterns. Initially, user interactions are leveraged to refine emotion classification within messages, encompassing exchanges where users respond to each other. Introducing the concept of a communication tree, a model is extracted to map these interactions. Subsequently, users’ bios and interests from this tree are juxtaposed with message text to enrich analysis. Finally, influential figures are identified among users’ followers in the communication tree, categorized into different topics to gauge interests. The study highlights that traditional sentiment analysis methodologies, focusing solely on textual content, are inadequate in discerning sentiment towards significant events, notably the presidential election. Comparative analysis with conventional methods reveals a substantial improvement in accuracy with the incorporation of emotion distribution patterns and user profiles. The proposed approach yields a 12% increase in accuracy with emotion distribution patterns and a 15% increase when considering user profiles, underscoring its efficacy in capturing nuanced sentiment dynamics.
zh

[AI-66] Beyond Reproducibility: Advancing Zero-shot LLM Reranking Efficiency with Setwise Insertion

【速读】：该论文旨在解决在基于大语言模型（Large Language Models, LLMs）的零样本文档重排序任务中，传统点对点（Pointwise）、成对（Pairwise）和列表式（Listwise）方法的效率与效果之间的权衡问题。论文的关键在于提出了一种名为Setwise Insertion的新方法，该方法利用初始文档排序作为先验知识，在Setwise提示的基础上进一步优化，通过聚焦于更可能改善排序结果的候选文档，减少不必要的比较和不确定性。实验结果显示，Setwise Insertion相较于原始Setwise方法，在多个LLM架构（如Flan-T5、Vicuna和Llama）上实现了31%的查询时间减少、23%的模型推理次数降低，并带来轻微但有效的重排序性能提升。这表明将先验排序知识融入Setwise提示具有显著的实用价值。

链接: https://arxiv.org/abs/2504.10509
作者: Jakub Podolak,Leon Peric,Mina Janicijevic,Roxana Petcu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study presents a comprehensive reproducibility and extension analysis of the Setwise prompting methodology for zero-shot ranking with Large Language Models (LLMs), as proposed by Zhuang et al. We evaluate its effectiveness and efficiency compared to traditional Pointwise, Pairwise, and Listwise approaches in document ranking tasks. Our reproduction confirms the findings of Zhuang et al., highlighting the trade-offs between computational efficiency and ranking effectiveness in Setwise methods. Building on these insights, we introduce Setwise Insertion, a novel approach that leverages the initial document ranking as prior knowledge, reducing unnecessary comparisons and uncertainty by focusing on candidates more likely to improve the ranking results. Experimental results across multiple LLM architectures (Flan-T5, Vicuna, and Llama) show that Setwise Insertion yields a 31% reduction in query time, a 23% reduction in model inferences, and a slight improvement in reranking effectiveness compared to the original Setwise method. These findings highlight the practical advantage of incorporating prior ranking knowledge into Setwise prompting for efficient and accurate zero-shot document reranking.
zh

[AI-67] Poly-Vector Retrieval: Reference and Content Embeddings for Legal Documents

【速读】：该论文旨在解决法律领域中传统检索增强生成（Retrieval-Augmented Generation, RAG）方法在处理基于规范标签或别名（如宪法第5条或消费者保护法CDC）以及显式交叉引用（如“根据第34条”）查询时的局限性。这些场景下，仅依赖文本语义嵌入的传统RAG方法往往无法有效检索所需的引用内容。论文的关键解决方案是提出多向量检索（Poly-Vector Retrieval），通过为每个法律条款分配多个独立嵌入来应对这一挑战：一个嵌入捕捉内容（全文），另一个捕捉标签（标识符或正式名称），并可选地添加其他嵌入以涵盖替代名称。受弗雷格关于意义与指称区分的启发，该方法将标签、标识符及引用标记视为刚性指称，而内容嵌入作为语义实质的载体。实验结果表明，Poly-Vector Retrieval显著提高了针对标签为中心查询的检索准确性，并具备解决内部和外部交叉引用问题的潜力，同时不会损害纯语义查询的性能。研究还讨论了在向量嵌入中显式分离引用与内容的哲学和实践意义，并提出了将此方法扩展到更广泛的法律数据集及其他包含明确引用标识领域的未来研究方向。

链接: https://arxiv.org/abs/2504.10508
作者: João Alberto de Oliveira Lima
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 39 pages, 5 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as an effective paradigm for generating contextually accurate answers by integrating Large Language Models (LLMs) with retrieval mechanisms. However, in legal contexts, users frequently reference norms by their labels or nicknames (e.g., Article 5 of the Constitution or Consumer Defense Code (CDC)), rather than by their content, posing challenges for traditional RAG approaches that rely solely on semantic embeddings of text. Furthermore, legal texts themselves heavily rely on explicit cross-references (e.g., “pursuant to Article 34”) that function as pointers. Both scenarios pose challenges for traditional RAG approaches that rely solely on semantic embeddings of text, often failing to retrieve the necessary referenced content. This paper introduces Poly-Vector Retrieval, a method assigning multiple distinct embeddings to each legal provision: one embedding captures the content (the full text), another captures the label (the identifier or proper name), and optionally additional embeddings capture alternative denominations. Inspired by Frege’s distinction between Sense and Reference, this poly-vector retrieval approach treats labels, identifiers and reference markers as rigid designators and content embeddings as carriers of semantic substance. Experiments on the Brazilian Federal Constitution demonstrate that Poly-Vector Retrieval significantly improves retrieval accuracy for label-centric queries and potential to resolve internal and external cross-references, without compromising performance on purely semantic queries. The study discusses philosophical and practical implications of explicitly separating reference from content in vector embeddings and proposes future research directions for applying this approach to broader legal datasets and other domains characterized by explicit reference identifiers.
zh

[AI-68] Leverag ing Auto-Distillation and Generative Self-Supervised Learning in Residual Graph Transformers for Enhanced Recommender Systems

【速读】：该论文旨在解决推荐系统中数据增强与交互建模不足的问题，提出通过结合生成式自监督学习（Generative Self-Supervised Learning, SSL）与残差图Transformer的方法来提升推荐性能。关键在于利用与任务相关的预训练任务实现自动化且有理据意识的自监督学习，以更好地揭示用户与物品之间的交互方式，并通过拓扑感知Transformer捕获全局上下文以及引入残差连接优化图表示学习。此外，自动蒸馏过程进一步提炼自监督信号，挖掘一致的协同推理模式。

链接: https://arxiv.org/abs/2504.10500
作者: Eya Mhedhbi,Youssef Mourchid,Alice Othmani
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces a cutting-edge method for enhancing recommender systems through the integration of generative self-supervised learning (SSL) with a Residual Graph Transformer. Our approach emphasizes the importance of superior data enhancement through the use of pertinent pretext tasks, automated through rationale-aware SSL to distill clear ways of how users and items interact. The Residual Graph Transformer incorporates a topology-aware transformer for global context and employs residual connections to improve graph representation learning. Additionally, an auto-distillation process refines self-supervised signals to uncover consistent collaborative rationales. Experimental evaluations on multiple datasets demonstrate that our approach consistently outperforms baseline methods.
zh

[AI-69] CCSK:Cognitive Convection of Self-Knowledge Based Retrieval Augmentation for Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在问答任务中通过检索增强生成（Retrieval-Augmented Generation, RAG）引入外部知识时面临的核心挑战，即如何平衡LLMs自身的内在知识与外部信息检索（Information Retrieval, IR）之间的关系。现有基于阈值的方法采用单一静态机制和单一标准，导致在复杂查询下其IR决策可能与LLMs的回答不相关。为缓解此问题，论文提出认知对流机制（Cognitive Convection of Self-Knowledge, CCSK）。关键在于CCSK通过孪生网络模块和响应质量模型实现动态联合决策过程：孪生网络计算当前查询与历史查询之间的余弦相似度，响应质量模型利用LightGBM评估LLMs的响应质量，并结合多头注意力机制融合文本特征，最终输出综合决策结果，从而显著提升模型在信息检索中的有效性。

链接: https://arxiv.org/abs/2504.10498
作者: Jianling Lu,Mingqi Lv
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The performance of large language models (LLMs) in QA task increased substantially through Retrieval-Augmented Generation (RAG) which brings in external knowledge. However, the main difficulty lies in balancing the inherent self-knowledge of LLMs with external information retrieval (IR). The current threshold-based methods apply one-dimensional static mechanisms with single criterion. As a result, their IR decisions might be irrelevant to the LLMs’ response under difficult queries. To alleviate this problem, we propose Cognitive Convection of Self-Knowledge (CCSK). Different from traditional methods that maintain single fixed IR activation criteria, CCSK implements a dynamic joint decision process via a Siamese Network module and a Response Quality Model. The Siamese Network calculates the cosine similarity between the current query and the historical queries. The Response Quality Model evaluates the responses of LLMs through LightGBM. The final decision of the CCSK is derived from the outputs of the two modules, as well as text features fused using a multi-head attention mechanism. Extensive experiments on real-world datasets show that CCSK significantly enhances the model’s effectiveness in information retrieval.
zh

[AI-70] Exploring Generative AI Techniques in Government: A Case Study

【速读】：该论文旨在解决国家研究理事会加拿大分会（NRC）在性能测量、数据管理和洞察报告自动化方面的挑战。论文通过开发智能代理Pubbie作为案例研究，探索将生成式人工智能（GenAI）技术，特别是大型语言模型（LLMs），整合到NRC的日常运营中以实现卓越性能的可能性。解决方案的关键在于采用前沿技术，如LLM编排和通过RoBERTa进行语义嵌入，同时结合策略性微调和少量样本学习方法，以经济高效的方式融入领域知识。此外，Pubbie友好的用户界面支持自然语言查询输入及文件的简单上传与下载操作，显著降低了手动操作的工作量并消除了访问障碍。

链接: https://arxiv.org/abs/2504.10497
作者: Sunyi Liu,Mengzhe Geng,Rebecca Hart
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: In submission to IEEE Intelligent Systems

点击查看摘要

Abstract:The swift progress of Generative Artificial intelligence (GenAI), notably Large Language Models (LLMs), is reshaping the digital landscape. Recognizing this transformative potential, the National Research Council of Canada (NRC) launched a pilot initiative to explore the integration of GenAI techniques into its daily operation for performance excellence, where 22 projects were launched in May 2024. Within these projects, this paper presents the development of the intelligent agent Pubbie as a case study, targeting the automation of performance measurement, data management and insight reporting at the NRC. Cutting-edge techniques are explored, including LLM orchestration and semantic embedding via RoBERTa, while strategic fine-tuning and few-shot learning approaches are incorporated to infuse domain knowledge at an affordable cost. The user-friendly interface of Pubbie allows general government users to input queries in natural language and easily upload or download files with a simple button click, greatly reducing manual efforts and accessibility barriers.
zh

[AI-71] Roamify: Designing and Evaluating an LLM Based Google Chrome Extension for Personalised Itinerary Planning

【速读】：该论文旨在解决旅行规划过程复杂的问题，通过开发Roamify这一基于人工智能的旅行助手，提供个性化行程建议以简化用户的旅行计划。解决方案的关键在于：首先，采用网络爬虫技术从不同博客来源抓取目的地的最新资讯（Design Consideration D1），从而显著提升行程推荐的质量；其次，结合用户偏好构建定制化旅行体验，并设计一个动态调整机制以适应用户需求的变化（Design Consideration D2）。用户调查结果验证了AI驱动方式相较于传统方法在各年龄段用户中的接受度，表明Roamify具有满足实际需求的潜力。

链接: https://arxiv.org/abs/2504.10489
作者: Vikranth Udandarao,Noel Abraham Tiju,Muthuraj Vairamuthu,Harsh Mistry,Dhruv Kumar
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: for code implementation, check this https URL

点击查看摘要

Abstract:In this paper, we present Roamify, an Artificial Intelligence powered travel assistant that aims to ease the process of travel planning. We have tested and used multiple Large Language Models like Llama and T5 to generate personalised itineraries per user preferences. Results from user surveys highlight the preference for AI powered mediums over existing methods to help in travel planning across all user age groups. These results firmly validate the potential need of such a travel assistant. We highlight the two primary design considerations for travel assistance: D1) incorporating a web-scraping method to gather up-to-date news articles about destinations from various blog sources, which significantly improves our itinerary suggestions, and D2) utilising user preferences to create customised travel experiences along with a recommendation system which changes the itinerary according to the user needs. Our findings suggest that Roamify has the potential to improve and simplify how users across multiple age groups plan their travel experiences.
zh

[AI-72] EthosGPT : Mapping Human Value Diversity to Advance Sustainable Development Goals (SDGs)

【速读】：该论文试图解决大型语言模型（LLMs）在处理大规模数据时可能导致人类价值观同质化的问题，这一问题类似于生物多样性丧失对生态韧性的负面影响。论文的关键解决方案是提出EthosGPT，这是一个基于古希腊“ethos”概念的开放源代码框架，强调个体品格与社区共享道德结构的重要性。EthosGPT通过国际调查数据、基于提示的评估及比较统计分析，映射和评估LLMs在全球范围内的价值观，揭示其在不同区域和文化中的适应性和偏见，从而提供开发包容性LLMs的可行建议，如多样化训练数据和保护濒危文化遗产以确保AI系统中的代表性。这些贡献与联合国可持续发展目标（SDGs），特别是目标10（减少不平等）、目标11.4（文化遗产保护）和目标16（和平、正义与强大机构）相一致，推动技术稳健且伦理包容的AI系统发展，促进价值多样性作为可持续和公平未来的基石。

链接: https://arxiv.org/abs/2504.09861
作者: Luyao Zhang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); General Economics (econ.GN); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are transforming global decision-making and societal systems by processing diverse data at unprecedented scales. However, their potential to homogenize human values poses critical risks, similar to biodiversity loss undermining ecological resilience. Rooted in the ancient Greek concept of ethos, meaning both individual character and the shared moral fabric of communities, EthosGPT draws on a tradition that spans from Aristotle’s virtue ethics to Adam Smith’s moral sentiments as the ethical foundation of economic cooperation. These traditions underscore the vital role of value diversity in fostering social trust, institutional legitimacy, and long-term prosperity. EthosGPT addresses the challenge of value homogenization by introducing an open-source framework for mapping and evaluating LLMs within a global scale of human values. Using international survey data on cultural indices, prompt-based assessments, and comparative statistical analyses, EthosGPT reveals both the adaptability and biases of LLMs across regions and cultures. It offers actionable insights for developing inclusive LLMs, such as diversifying training data and preserving endangered cultural heritage to ensure representation in AI systems. These contributions align with the United Nations Sustainable Development Goals (SDGs), especially SDG 10 (Reduced Inequalities), SDG 11.4 (Cultural Heritage Preservation), and SDG 16 (Peace, Justice and Strong Institutions). Through interdisciplinary collaboration, EthosGPT promotes AI systems that are both technically robust and ethically inclusive, advancing value plurality as a cornerstone for sustainable and equitable futures.
zh

[AI-73] Greedy Restart Schedules: A Baseline for Dynamic Algorithm Selection on Numerical Black-box Optimization Problems GECCO2025

【速读】：该论文旨在解决优化领域中不同求解器在特定问题实例上的性能差异问题，通过开发一种元算法策略来最大化一组可配置优化器的整体性能。论文的关键创新在于提出了一种简单的贪心重启调度方法，该方法基于选择当前未解决问题分布上表现最优的算法，从而生成一个与具体问题无关的求解器调度方案。这种方法利用数据驱动技术优化重启策略，弥补了单一求解器与虚拟最佳求解器组合之间的性能差距，并为更复杂的动态算法选择模型提供了有力的基准。

链接: https://arxiv.org/abs/2504.11440
作者: Lennart Schäpermeier
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注: Author version. Accepted as full paper to be presented at the GECCO 2025 conference, July 14-18, Málaga, Spain. (DOI https://doi.org/10.1145/3712256.3726408 )

点击查看摘要

Abstract:In many optimization domains, there are multiple different solvers that contribute to the overall state-of-the-art, each performing better on some, and worse on other types of problem instances. Meta-algorithmic approaches, such as instance-based algorithm selection, configuration and scheduling, aim to close this gap by extracting the most performance possible from a set of (configurable) optimizers. In this context, the best performing individual algorithms are often hand-crafted hybrid heuristics which perform many restarts of fast local optimization approaches. However, data-driven techniques to create optimized restart schedules have not yet been extensively studied. Here, we present a simple scheduling approach that iteratively selects the algorithm performing best on the distribution of unsolved training problems at time of selection, resulting in a problem-independent solver schedule. We demonstrate our approach using well-known optimizers from numerical black-box optimization on the BBOB testbed, bridging much of the gap between single and virtual best solver from the original portfolio across various evaluation protocols. Our greedy restart schedule presents a powerful baseline for more complex dynamic algorithm selection models. Comments: Author version. Accepted as full paper to be presented at the GECCO 2025 conference, July 14-18, Málaga, Spain. (DOI https://doi.org/10.1145/3712256.3726408) Subjects: Optimization and Control (math.OC); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.11440 [math.OC] (or arXiv:2504.11440v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2504.11440 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-74] Respiratory Inhaler Sound Event Classification Using Self-Supervised Learning

【速读】：该论文旨在解决哮喘患者在使用手持吸入器时正确用药技术依从性低的问题，通过自动分类吸入器声音来评估用药依从性。现有分类模型通常仅针对特定类型的吸入器数据进行训练，缺乏泛化能力以适应不同吸入器的声音。为解决这一挑战，论文的关键方案是通过预训练和微调wav2vec 2.0自监督学习模型，使其能够处理多种吸入器类型的声音，并验证了通过少量目标吸入器数据进一步微调通用模型的有效性，从而实现跨设备的分类性能。此外，研究首次展示了智能手表作为辅助技术用于个性化监测吸入器依从性的潜力。

链接: https://arxiv.org/abs/2504.11246
作者: Davoud Shariat Panah,Alessandro N Franciosi,Cormac McCarthy,Andrew Hines
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the IEEE EMBC 2025 Conference

点击查看摘要

Abstract:Asthma is a chronic respiratory condition that affects millions of people worldwide. While this condition can be managed by administering controller medications through handheld inhalers, clinical studies have shown low adherence to the correct inhaler usage technique. Consequently, many patients may not receive the full benefit of their medication. Automated classification of inhaler sounds has recently been studied to assess medication adherence. However, the existing classification models were typically trained using data from specific inhaler types, and their ability to generalize to sounds from different inhalers remains unexplored. In this study, we adapted the wav2vec 2.0 self-supervised learning model for inhaler sound classification by pre-training and fine-tuning this model on inhaler sounds. The proposed model shows a balanced accuracy of 98% on a dataset collected using a dry powder inhaler and smartwatch device. The results also demonstrate that re-finetuning this model on minimal data from a target inhaler is a promising approach to adapting a generic inhaler sound classification model to a different inhaler device and audio capture hardware. This is the first study in the field to demonstrate the potential of smartwatches as assistive technologies for the personalized monitoring of inhaler adherence using machine learning models.
zh

[AI-75] Fine-Tuning Large Language Models on Quantum Optimization Problems for Circuit Generation

【速读】：该论文旨在解决如何利用大型语言模型（Large Language Models, LLM）自动生成大规模量子电路的问题，这是量子计算领域中尚未充分探索的挑战。论文的关键在于通过微调预训练的LLMs，并注入量子计算领域的专业知识，构建端到端的流水线以生成针对优化问题的参数化量子电路。具体而言，研究团队准备了涵盖量子优化景观主要部分的14,000个量子电路，包括12个优化问题实例及其对应的经过量子近似优化算法（QAOA）、变分量子本征态算法（VQE）和自适应VQE优化后的电路。论文表明，微调后的LLMs能够生成符合语法规范的参数化量子电路，并且其生成的参数在优化期望值和分布方面优于随机方法，甚至超越现有的最先进模型。这些生成的参数化电路和初始参数可作为进一步优化的起点，例如用于量子机器学习的模板或编译器与硬件的基准测试。

链接: https://arxiv.org/abs/2504.11109
作者: Linus Jern,Valter Uotila,Cong Yu,Bo Zhao
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Large language models (LLM) have achieved remarkable outcomes in addressing complex problems, including math, coding, and analyzing large amounts of scientific reports. Yet few works have explored the potential of LLM in quantum computing. The most challenging problem is how to leverage LLMs to automatically generate quantum circuits at a large scale. In this paper, we address such a challenge by fine-tuning LLMs and injecting the domain-specific knowledge of quantum computing. In particular, we investigate the mechanisms to generate training data sets and construct the end-to-end pipeline to fine-tune pre-trained LLMs that produce parameterized quantum circuits for optimization problems. We have prepared 14,000 quantum circuits covering a substantial part of the quantum optimization landscape: 12 optimization problem instances and their optimized QAOA, VQE, and adaptive VQE circuits. The fine-tuned LLMs can construct syntactically correct parametrized quantum circuits in the most recent OpenQASM 3.0. We have evaluated the quality of the parameters by comparing them to the optimized expectation values and distributions. Our evaluation shows that the fine-tuned LLM outperforms state-of-the-art models and that the parameters are better than random. The LLM-generated parametrized circuits and initial parameters can be used as a starting point for further optimization, \emphe.g., templates in quantum machine learning and the benchmark for compilers and hardware.
zh

[AI-76] AI-guided Antibiotic Discovery Pipeline from Target Selection to Compound Identification

【速读】：该论文旨在解决抗生素耐药性这一日益严重的全球健康危机，通过开发新的治疗策略来应对细菌的新机制。论文的关键在于构建了一个端到端的人工智能指导的抗生素发现管道，从靶点识别到化合物实现贯穿整个流程。解决方案的关键包括利用基于结构的聚类方法在多种病原体预测蛋白质组中识别保守、必需且非人类同源的靶点，并系统评估六种领先的三维结构感知生成模型在可用性、化学有效性和生物相关性方面的表现。此外，严格的后处理过滤器和商业类似物搜索将超过100,000个生成的化合物缩减为一个聚焦且可合成的集合。研究结果表明，DeepBlock和TamGen在不同标准下表现出色，同时揭示了模型复杂度、可用性和输出质量之间的关键权衡。这项工作为在早期抗生素开发中部署人工智能提供了比较基准和蓝图。

链接: https://arxiv.org/abs/2504.11091
作者: Maximilian G. Schuh,Joshua Hesse,Stephan A. Sieber
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, preprint

点击查看摘要

Abstract:Antibiotic resistance presents a growing global health crisis, demanding new therapeutic strategies that target novel bacterial mechanisms. Recent advances in protein structure prediction and machine learning-driven molecule generation offer a promising opportunity to accelerate drug discovery. However, practical guidance on selecting and integrating these models into real-world pipelines remains limited. In this study, we develop an end-to-end, artificial intelligence-guided antibiotic discovery pipeline that spans target identification to compound realization. We leverage structure-based clustering across predicted proteomes of multiple pathogens to identify conserved, essential, and non-human-homologous targets. We then systematically evaluate six leading 3D-structure-aware generative models \unicodex2014 spanning diffusion, autoregressive, graph neural network, and language model architectures \unicodex2014 on their usability, chemical validity, and biological relevance. Rigorous post-processing filters and commercial analogue searches reduce over 100 000 generated compounds to a focused, synthesizable set. Our results highlight DeepBlock and TamGen as top performers across diverse criteria, while also revealing critical trade-offs between model complexity, usability, and output quality. This work provides a comparative benchmark and blueprint for deploying artificial intelligence in early-stage antibiotic development.
zh

[AI-77] QAMA: Quantum annealing multi-head attention operator with classical deep learning framework

【速读】：该论文致力于解决传统注意力机制在大规模语言模型中内存消耗和能源成本呈指数级增长的核心挑战。为应对这一问题，论文提出了一种基于量子退火（Quantum Annealing）的多头注意力机制（QAMA），其关键在于通过二次无约束二进制优化（Quadratic Unconstrained Binary Optimization, QUBO）建模实现前向传播与基于能量的反向传播，从而无缝兼容经典注意力架构。QAMA 利用伊辛模型（Ising Model）的量子比特交互特性，将传统 O(n²) 的时空复杂度优化至线性资源消耗，并结合相干伊辛机（Coherent Ising Machine, CIM）的光学计算优势，在保持毫秒级实时响应的同时显著降低能耗。论文的关键贡献包括：从理论上证明 QAMA 在数学上等价于经典注意力机制；通过 QUBO 约束实现多头特异性和长距离信息捕获的双重优化；利用伊辛能量方程的显式梯度证明实现计算图中的梯度传导作为唯一路径；以及提出软选择机制以克服传统二元注意力的局限性，逼近连续权重。实验结果表明，QAMA 在 QBoson CPQC 量子计算机上的性能与经典算子相当，同时将推理时间降至毫秒级并提升了解的质量。这项工作开创性地实现了量子计算与深度学习在架构层面的集成，适用于任何基于注意力的模型，推动了人工智能基础计算的范式创新。

链接: https://arxiv.org/abs/2504.11083
作者: Peng Du,Shuolei Wang,Shicheng Li,Jinjing Shi
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models scale up, the conventional attention mechanism faces critical challenges of exponential growth in memory consumption and energy costs. Quantum annealing computing, with its inherent advantages in computational efficiency and low energy consumption, offers an innovative direction for constructing novel deep learning architectures. This study proposes the first Quantum Annealing-based Multi-head Attention (QAMA) mechanism, achieving seamless compatibility with classical attention architectures through quadratic unconstrained binary optimization (QUBO) modeling of forward propagation and energy-based backpropagation. The method innovatively leverages the quantum bit interaction characteristics of Ising models to optimize the conventional O(n^2) spatiotemporal complexity into linear resource consumption. Integrated with the optical computing advantages of coherent Ising machines (CIM), the system maintains millisecond-level real-time responsiveness while significantly reducing energy consumption. Our key contributions include: Theoretical proofs establish QAMA mathematical equivalence to classical attention mechanisms; Dual optimization of multi-head specificity and long-range information capture via QUBO constraints; Explicit gradient proofs for the Ising energy equation are utilized to implement gradient conduction as the only path in the computational graph as a layer; Proposed soft selection mechanism overcoming traditional binary attention limitations to approximate continuous weights. Experiments on QBoson CPQC quantum computer show QAMA achieves comparable accuracy to classical operators while reducing inference time to millisecond level and improving solution quality. This work pioneers architectural-level integration of quantum computing and deep learning, applicable to any attention-based model, driving paradigm innovation in AI foundational computing.
zh

[AI-78] Uplink Assisted Joint Channel Estimation and CSI Feedback: An Approach Based on Deep Joint Source-Channel Coding

【速读】：该论文旨在解决传统模块化通信框架下基于人工智能的信道状态信息（CSI）反馈架构中，因独立设计的各模块（如信道估计CE、CSI压缩与反馈等）导致的性能次优问题。论文的关键在于提出了一种通过深度学习实现的下行CSI获取方案，该方案结合上行辅助信息，采用端到端联合训练的架构，同时引入深度联合源-信道编码（DJSCC）结构以缓解传统分离式源-信道编码中的悬崖效应，并利用FDD系统上下行信道的部分互易性，在不增加额外开销的情况下提升CSI重构精度。通过全面的消融实验和可扩展性验证，证明了上行CSI作为辅助信息的有效性和端到端多模块联合训练架构的必要性。

链接: https://arxiv.org/abs/2504.10836
作者: Yiran Guo,Wei Chen,Bo Ai
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In frequency division duplex (FDD) multiple-input multiple-output (MIMO) wireless communication systems, the acquisition of downlink channel state information (CSI) is essential for maximizing spatial resource utilization and improving system spectral efficiency. The separate design of modules in AI-based CSI feedback architectures under traditional modular communication frameworks, including channel estimation (CE), CSI compression and feedback, leads to sub-optimal performance. In this paper, we propose an uplink assisted joint CE and and CSI feedback approach via deep learning for downlink CSI acquisition, which mitigates performance degradation caused by distribution bias across separately trained modules in traditional modular communication frameworks. The proposed network adopts a deep joint source-channel coding (DJSCC) architecture to mitigate the cliff effect encountered in the conventional separate source-channel coding. Furthermore, we exploit the uplink CSI as auxiliary information to enhance CSI reconstruction accuracy by leveraging the partial reciprocity between the uplink and downlink channels in FDD systems, without introducing additional overhead. The effectiveness of uplink CSI as assisted information and the necessity of an end-toend multi-module joint training architecture is validated through comprehensive ablation and scalability experiments.
zh

[AI-79] Neural Network Emulation of the Classical Limit in Quantum Systems via Learned Observable Mappings

【速读】：该论文旨在探索量子力学经典极限的计算方法，具体研究当普朗克常数 (\hbar) 趋近于零时，量子谐振子的经典行为如何涌现。论文的关键在于设计并训练一种神经网络架构，使其能够学习从初始期望值和 \hbar 到位置期望值时间演化之间的映射关系。通过分析网络在不同 \hbar 区域的预测结果，论文希望为量子-经典过渡的本质提供计算上的洞见。此工作展示了机器学习作为量子力学基础问题及其经典极限研究的补充工具的潜力。

链接: https://arxiv.org/abs/2504.10781
作者: Kamran Majid
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The classical limit of quantum mechanics, formally investigated through frameworks like strict deformation quantization, remains a profound area of inquiry in the philosophy of physics. This paper explores a computational approach employing a neural network to emulate the emergence of classical behavior from the quantum harmonic oscillator as Planck’s constant \hbar approaches zero. We develop and train a neural network architecture to learn the mapping from initial expectation values and \hbar to the time evolution of the expectation value of position. By analyzing the network’s predictions across different regimes of hbar, we aim to provide computational insights into the nature of the quantum-classical transition. This work demonstrates the potential of machine learning as a complementary tool for exploring foundational questions in quantum mechanics and its classical limit.
zh

[AI-80] MatterTune: An Integrated User-Friendly Platform for Fine-Tuning Atomistic Foundation Models to Accelerate Materials Simulation and Discovery

【速读】：该论文旨在解决几何机器学习模型（如图神经网络）在化学和材料科学领域中的高数据需求问题，特别是在数据稀疏场景下的应用限制。这些模型虽然在高通量虚拟筛选和原子尺度模拟等任务中取得了显著成功，但其性能高度依赖于大规模标注数据，这在许多实际问题中难以满足。为了解决这一局限性，论文提出了通过预训练方法构建原子尺度基础模型（atomistic foundation models），这些模型能够从多样化的大规模原子数据中学习通用且基础的几何关系，并通过微调适应特定的小型任务数据集。

解决方案的关键在于开发了一个名为MatterTune的模块化可扩展框架，它不仅支持对多种先进预训练基础模型（如ORB、MatterSim、JMP和EquformerV2）进行高效的微调，还实现了与下游材料信息学和模拟工作流的无缝集成。通过提供高级微调功能和灵活的设计，MatterTune降低了基础模型在材料科学领域的应用门槛，促进了其在数据受限环境下的广泛应用。

链接: https://arxiv.org/abs/2504.10655
作者: Lingyu Kong,Nima Shoghi,Guoxiang Hu,Pan Li,Victor Fung
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Geometric machine learning models such as graph neural networks have achieved remarkable success in recent years in chemical and materials science research for applications such as high-throughput virtual screening and atomistic simulations. The success of these models can be attributed to their ability to effectively learn latent representations of atomic structures directly from the training data. Conversely, this also results in high data requirements for these models, hindering their application to problems which are data sparse which are common in this domain. To address this limitation, there is a growing development in the area of pre-trained machine learning models which have learned general, fundamental, geometric relationships in atomistic data, and which can then be fine-tuned to much smaller application-specific datasets. In particular, models which are pre-trained on diverse, large-scale atomistic datasets have shown impressive generalizability and flexibility to downstream applications, and are increasingly referred to as atomistic foundation models. To leverage the untapped potential of these foundation models, we introduce MatterTune, a modular and extensible framework that provides advanced fine-tuning capabilities and seamless integration of atomistic foundation models into downstream materials informatics and simulation workflows, thereby lowering the barriers to adoption and facilitating diverse applications in materials science. In its current state, MatterTune supports a number of state-of-the-art foundation models such as ORB, MatterSim, JMP, and EquformerV2, and hosts a wide range of features including a modular and flexible design, distributed and customizable fine-tuning, broad support for downstream informatics tasks, and more.
zh

[AI-81] Who is More Bayesian: Humans or ChatGPT ?

【速读】：该论文旨在比较人类与人工智能（AI）决策者在简单二元分类任务中的表现，这些任务的最佳决策规则由贝叶斯法则（Bayes Rule）给出。研究重新分析了El-Gamal和Grether以及Holt和Smith实验室实验中收集的人类受试者的决策数据，确认总体而言，贝叶斯法则作为预测人类选择的最佳单一模型依然有效，但发现人类受试者具有异质性，且相当一部分人会做出反映卡尼曼和特沃斯基所描述判断偏差的次优选择，包括“代表性启发式”（过度依赖样本证据而非先验信息）和“保守主义”（过度依赖先验信息而非样本证据）。此外，论文评估了从近期大规模语言模型（LLMs）版本中获取的AI主体的表现，特别是多个版本的ChatGPT，尽管这些通用生成式AI聊天机器人并未专门针对狭窄的决策任务进行训练，而是作为“语言预测器”通过网络文本语料库进行训练。研究表明，ChatGPT同样受到偏差影响导致次优决策，但关键在于其性能经历了快速演变，早期版本（如ChatGPT 3.5）的表现低于人类水平，而最新版本（如ChatGPT 4o）已达到超人类水平，并接近完美的贝叶斯分类能力。因此，论文的核心解决方案在于揭示AI决策系统随版本迭代如何逐步改进以逼近最优决策规则的过程及其内在偏差特性。

链接: https://arxiv.org/abs/2504.10636
作者: Tianshi Mu,Pranjal Rawat,John Rust,Chengjun Zhang,Qixuan Zhong
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注: 86 pages, 19 figures

点击查看摘要

Abstract:We compare the performance of human and artificially intelligent (AI) decision makers in simple binary classification tasks where the optimal decision rule is given by Bayes Rule. We reanalyze choices of human subjects gathered from laboratory experiments conducted by El-Gamal and Grether and Holt and Smith. We confirm that while overall, Bayes Rule represents the single best model for predicting human choices, subjects are heterogeneous and a significant share of them make suboptimal choices that reflect judgement biases described by Kahneman and Tversky that include the representativeness heuristic'' (excessive weight on the evidence from the sample relative to the prior) and conservatism’’ (excessive weight on the prior relative to the sample). We compare the performance of AI subjects gathered from recent versions of large language models (LLMs) including several versions of ChatGPT. These general-purpose generative AI chatbots are not specifically trained to do well in narrow decision making tasks, but are trained instead as ``language predictors’’ using a large corpus of textual data from the web. We show that ChatGPT is also subject to biases that result in suboptimal decisions. However we document a rapid evolution in the performance of ChatGPT from sub-human performance for early versions (ChatGPT 3.5) to superhuman and nearly perfect Bayesian classifications in the latest versions (ChatGPT 4o).
zh

[AI-82] AB-Cache: Training-Free Acceleration of Diffusion Models via Adams-Bashforth Cached Feature Reuse

【速读】：该论文旨在解决扩散模型（Diffusion Models）在推理阶段因迭代去噪过程导致的计算速度慢的问题，限制其实际应用。现有加速方法虽利用相邻步骤间已知的U形相似性模式通过缓存机制实现加速，但缺乏理论基础且依赖简单的计算重用，常导致性能下降。论文的关键在于通过二阶Adams-Bashforth方法分析去噪过程，揭示连续步骤输出之间的线性关系，从而解释相邻步骤呈现U形模式的原因。进一步地，论文将Adams-Bashforth方法扩展到更高阶，提出了一种基于缓存的新颖加速方法，与直接重用缓存结果不同，该方法具有截断误差界 (O(h^k))，其中 (h) 是步长。实验验证表明，该方法在多种图像和视频扩散模型（包括HunyuanVideo和FLUX.1-dev）上实现了近3倍的速度提升，同时保持原有性能水平，提供了一种实用的实时解决方案，未牺牲生成质量。

链接: https://arxiv.org/abs/2504.10540
作者: Zichao Yu,Zhen Zou,Guojiang Shao,Chengwei Zhang,Shengze Xu,Jie Huang,Feng Zhao,Xiaodong Cun,Wenyi Zhang
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion models have demonstrated remarkable success in generative tasks, yet their iterative denoising process results in slow inference, limiting their practicality. While existing acceleration methods exploit the well-known U-shaped similarity pattern between adjacent steps through caching mechanisms, they lack theoretical foundation and rely on simplistic computation reuse, often leading to performance degradation. In this work, we provide a theoretical understanding by analyzing the denoising process through the second-order Adams-Bashforth method, revealing a linear relationship between the outputs of consecutive steps. This analysis explains why the outputs of adjacent steps exhibit a U-shaped pattern. Furthermore, extending Adams-Bashforth method to higher order, we propose a novel caching-based acceleration approach for diffusion models, instead of directly reusing cached results, with a truncation error bound of only (O(h^k)) where h is the step size. Extensive validation across diverse image and video diffusion models (including HunyuanVideo and FLUX.1-dev) with various schedulers demonstrates our method’s effectiveness in achieving nearly 3\times speedup while maintaining original performance levels, offering a practical real-time solution without compromising generation quality.
zh

[AI-83] Physics-Informed Neural Networks for Enhanced Interface Preservation in Lattice Boltzmann Multiphase Simulations

【速读】：该论文旨在解决多相格子玻尔兹曼方法（LBM）模拟中界面扩散导致尖锐界面模糊的问题，这是多相LBM中的常见挑战，尤其在界面动力学至关重要的现象模拟中会显著降低准确性。论文的关键解决方案是提出了一种结合物理信息神经网络（PINNs）与LBM的耦合框架（PINN-LBM），通过将神经网络集成到LBM中，有效抵消数值扩散，同时保持物理一致性，从而在保证物理准确性的同时维持界面的清晰度。这一方法通过液滴模拟进行验证，并通过定量指标（如界面宽度、最大梯度、相分离等）证明了其优越性，特别是在整个模拟过程中保持清晰界面的能力。

链接: https://arxiv.org/abs/2504.10539
作者: Yue Li
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents an improved approach for preserving sharp interfaces in multiphase Lattice Boltzmann Method (LBM) simulations using Physics-Informed Neural Networks (PINNs). Interface diffusion is a common challenge in multiphase LBM, leading to reduced accuracy in simulating phenomena where interfacial dynamics are critical. We propose a coupled PINN-LBM framework that maintains interface sharpness while preserving the physical accuracy of the simulation. Our approach is validated through droplet simulations, with quantitative metrics measuring interface width, maximum gradient, phase separation, effective interface width, and interface energy. The enhanced visualization techniques employed in this work clearly demonstrate the superior performance of PINN-LBM over standard LBM for multiphase simulations, particularly in maintaining well-defined interfaces throughout the simulation. We provide a comprehensive analysis of the results, showcasing how the neural network integration effectively counteracts numerical diffusion, while maintaining physical consistency with the underlying fluid dynamics.
zh

[AI-84] Focal Loss based Residual Convolutional Neural Network for Speech Emotion Recognition

【速读】：该论文旨在解决语音情感识别中因类别不均衡导致模型性能下降的问题。论文的关键在于提出了一种基于残差卷积神经网络（ResNet）的架构，并采用焦点损失函数（Focal Loss）进行训练。通过利用光谱图（Spectrogram）和梅尔频率倒谱系数（Mel-frequency Cepstral Coefficients, MFCCs）等语音特征，模型能够更有效地捕捉情感信息。同时，焦点损失函数通过关注难以分类的样本并降低易分类样本的影响，优化了训练过程，从而提升了模型在类别不均衡数据集上的表现。

链接: https://arxiv.org/abs/1906.05682
作者: Suraj Tripathi,Abhay Kumar,Abhiram Ramesh,Chirag Singh,Promod Yenigalla
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Machine Learning (stat.ML)
备注: Accepted in CICLing 2019

点击查看摘要

Abstract:This paper proposes a Residual Convolutional Neural Network (ResNet) based on speech features and trained under Focal Loss to recognize emotion in speech. Speech features such as Spectrogram and Mel-frequency Cepstral Coefficients (MFCCs) have shown the ability to characterize emotion better than just plain text. Further Focal Loss, first used in One-Stage Object Detectors, has shown the ability to focus the training process more towards hard-examples and down-weight the loss assigned to well-classified examples, thus preventing the model from being overwhelmed by easily classifiable examples.
zh

机器学习

[LG-0] Predicting Wave Dynamics using Deep Learning with Multistep Integration Inspired Attention and Physics-Based Loss Decomposition

链接: https://arxiv.org/abs/2504.11433
作者: Indu Kant Deo,Rajeev K. Jaiman
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Fluid Dynamics (physics.flu-dyn)
*备注: 30 pages, 14 figures

点击查看摘要

Abstract:In this paper, we present a physics-based deep learning framework for data-driven prediction of wave propagation in fluid media. The proposed approach, termed Multistep Integration-Inspired Attention (MI2A), combines a denoising-based convolutional autoencoder for reduced latent representation with an attention-based recurrent neural network with long-short-term memory cells for time evolution of reduced coordinates. This proposed architecture draws inspiration from classical linear multistep methods to enhance stability and long-horizon accuracy in latent-time integration. Despite the efficiency of hybrid neural architectures in modeling wave dynamics, autoregressive predictions are often prone to accumulating phase and amplitude errors over time. To mitigate this issue within the MI2A framework, we introduce a novel loss decomposition strategy that explicitly separates the training loss function into distinct phase and amplitude components. We assess the performance of MI2A against two baseline reduced-order models trained with standard mean-squared error loss: a sequence-to-sequence recurrent neural network and a variant using Luong-style attention. To demonstrate the effectiveness of the MI2A model, we consider three benchmark wave propagation problems of increasing complexity, namely one-dimensional linear convection, the nonlinear viscous Burgers equation, and the two-dimensional Saint-Venant shallow water system. Our results demonstrate that the MI2A framework significantly improves the accuracy and stability of long-term predictions, accurately preserving wave amplitude and phase characteristics. Compared to the standard long-short term memory and attention-based models, MI2A-based deep learning exhibits superior generalization and temporal accuracy, making it a promising tool for real-time wave modeling.

[LG-1] MLPs and KANs for data-driven learning in physical problems: A performance comparison

链接: https://arxiv.org/abs/2504.11397
作者: Raghav Pant,Sikan Li,Xingjian Li,Hassan Iqbal,Krishna Kumar
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 30 pages, 18 figures, 8 tables

点击查看摘要

Abstract:There is increasing interest in solving partial differential equations (PDEs) by casting them as machine learning problems. Recently, there has been a spike in exploring Kolmogorov-Arnold Networks (KANs) as an alternative to traditional neural networks represented by Multi-Layer Perceptrons (MLPs). While showing promise, their performance advantages in physics-based problems remain largely unexplored. Several critical questions persist: Can KANs capture complex physical dynamics and under what conditions might they outperform traditional architectures? In this work, we present a comparative study of KANs and MLPs for learning physical systems governed by PDEs. We assess their performance when applied in deep operator networks (DeepONet) and graph network-based simulators (GNS), and test them on physical problems that vary significantly in scale and complexity. Drawing inspiration from the Kolmogorov Representation Theorem, we examine the behavior of KANs and MLPs across shallow and deep network architectures. Our results reveal that although KANs do not consistently outperform MLPs when configured as deep neural networks, they demonstrate superior expressiveness in shallow network settings, significantly outpacing MLPs in accuracy over our test cases. This suggests that KANs are a promising choice, offering a balance of efficiency and accuracy in applications involving physical systems.

[LG-2] Accelerating Multiscale Modeling with Hybrid Solvers: Coupling FEM and Neural Operators with Domain Decomposition

链接: https://arxiv.org/abs/2504.11383
作者: Wei Wanga,Maryam Hakimzadeh,Haihui Ruan,Somdatta Goswami
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Numerical solvers for partial differential equations (PDEs) face challenges balancing computational cost and accuracy, especially in multiscale and dynamic systems. Neural operators can significantly speed up simulations; however, they often face challenges such as error accumulation and limited generalization in multiphysics problems. This work introduces a novel hybrid framework that integrates physics-informed DeepONet with FEM through domain decomposition. The core innovation lies in adaptively coupling FEM and DeepONet subdomains via a Schwarz alternating method. This methodology strategically allocates computationally demanding regions to a pre-trained Deep Operator Network, while the remaining computational domain is solved through FEM. To address dynamic systems, we integrate the Newmark time-stepping scheme directly into the DeepONet, significantly mitigating error accumulation in long-term simulations. Furthermore, an adaptive subdomain evolution enables the ML-resolved region to expand dynamically, capturing emerging fine-scale features without remeshing. The framework’s efficacy has been validated across a range of solid mechanics problems, including static, quasi-static, and dynamic regimes, demonstrating accelerated convergence rates (up to 20% improvement compared to FE-FE approaches), while preserving solution fidelity with error 1%. Our case studies show that our proposed hybrid solver: (1) maintains solution continuity across subdomain interfaces, (2) reduces computational costs by eliminating fine mesh requirements, (3) mitigates error accumulation in time-dependent simulations, and (4) enables automatic adaptation to evolving physical phenomena. This work bridges the gap between numerical methods and AI-driven surrogates, offering a scalable pathway for high-fidelity simulations in engineering and scientific applications.

[LG-3] An Adaptive Dropout Approach for High-Dimensional Bayesian Optimization

链接: https://arxiv.org/abs/2504.11353
作者: Jundi Huang,Dawei Zhan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian optimization (BO) is a widely used algorithm for solving expensive black-box optimization problems. However, its performance decreases significantly on high-dimensional problems due to the inherent high-dimensionality of the acquisition function. In the proposed algorithm, we adaptively dropout the variables of the acquisition function along the iterations. By gradually reducing the dimension of the acquisition function, the proposed approach has less and less difficulty to optimize the acquisition function. Numerical experiments demonstrate that AdaDropout effectively tackle high-dimensional challenges and improve solution quality where standard Bayesian optimization methods often struggle. Moreover, it achieves superior results when compared with state-of-the-art high-dimensional Bayesian optimization approaches. This work provides a simple yet efficient solution for high-dimensional expensive optimization.

[LG-4] Erzeugunsgrad VC-Dimension and Neural Networks with rational activation function

链接: https://arxiv.org/abs/2504.11345
作者: Luis Miguel Pardo,Daniel Sebastián
类目: Machine Learning (cs.LG); Algebraic Geometry (math.AG)
*备注: 50 pages

点击查看摘要

Abstract:The notion of Erzeugungsgrad was introduced by Joos Heintz in 1983 to bound the number of non-empty cells occurring after a process of quantifier elimination. We extend this notion and the combinatorial bounds of Theorem 2 in Heintz (1983) using the degree for constructible sets defined in Pardo-Sebastián (2022). We show that the Erzeugungsgrad is the key ingredient to connect affine Intersection Theory over algebraically closed fields and the VC-Theory of Computational Learning Theory for families of classifiers given by parameterized families of constructible sets. In particular, we prove that the VC-dimension and the Krull dimension are linearly related up to logarithmic factors based on Intersection Theory. Using this relation, we study the density of correct test sequences in evasive varieties. We apply these ideas to analyze parameterized families of neural networks with rational activation function.

[LG-5] Subset-Contrastive Multi-Omics Network Embedding

链接: https://arxiv.org/abs/2504.11321
作者: Pedro Henrique da Costa Avelar,Min Wu,Sophia Tsoka
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Motivation: Network-based analyses of omics data are widely used, and while many of these methods have been adapted to single-cell scenarios, they often remain memory- and space-intensive. As a result, they are better suited to batch data or smaller datasets. Furthermore, the application of network-based methods in multi-omics often relies on similarity-based networks, which lack structurally-discrete topologies. This limitation may reduce the effectiveness of graph-based methods that were initially designed for topologies with better defined structures. Results: We propose Subset-Contrastive multi-Omics Network Embedding (SCONE), a method that employs contrastive learning techniques on large datasets through a scalable subgraph contrastive approach. By exploiting the pairwise similarity basis of many network-based omics methods, we transformed this characteristic into a strength, developing an approach that aims to achieve scalable and effective analysis. Our method demonstrates synergistic omics integration for cell type clustering in single-cell data. Additionally, we evaluate its performance in a bulk multi-omics integration scenario, where SCONE performs comparable to the state-of-the-art despite utilising limited views of the original data. We anticipate that our findings will motivate further research into the use of subset contrastive methods for omics data.

[LG-6] Reconstructing Fine-Grained Network Data using Autoencoder Architectures with Domain Knowledge Penalties

链接: https://arxiv.org/abs/2504.11255
作者: Mark Cheung,Sridhar Venkatesan
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The ability to reconstruct fine-grained network session data, including individual packets, from coarse-grained feature vectors is crucial for improving network security models. However, the large-scale collection and storage of raw network traffic pose significant challenges, particularly for capturing rare cyberattack samples. These challenges hinder the ability to retain comprehensive datasets for model training and future threat detection. To address this, we propose a machine learning approach guided by formal methods to encode and reconstruct network data. Our method employs autoencoder models with domain-informed penalties to impute PCAP session headers from structured feature representations. Experimental results demonstrate that incorporating domain knowledge through constraint-based loss terms significantly improves reconstruction accuracy, particularly for categorical features with session-level encodings. By enabling efficient reconstruction of detailed network sessions, our approach facilitates data-efficient model training while preserving privacy and storage efficiency.

[LG-7] he Forward-Forward Algorithm: Characterizing Training Behavior

链接: https://arxiv.org/abs/2504.11229
作者: Reece Adamson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Forward-Forward algorithm is an alternative learning method which consists of two forward passes rather than a forward and backward pass employed by backpropagation. Forward-Forward networks employ layer local loss functions which are optimized based on the layer activation for each forward pass rather than a single global objective function. This work explores the dynamics of model and layer accuracy changes in Forward-Forward networks as training progresses in pursuit of a mechanistic understanding of their internal behavior. Treatments to various system characteristics are applied to investigate changes in layer and overall model accuracy as training progresses, how accuracy is impacted by layer depth, and how strongly individual layer accuracy is correlated with overall model accuracy. The empirical results presented suggest that layers deeper within Forward-Forward networks experience a delay in accuracy improvement relative to shallower layers and that shallower layer accuracy is strongly correlated with overall model accuracy.

[LG-8] VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers

链接: https://arxiv.org/abs/2504.11227
作者: Run Wang,Gamze Islamoglu,Andrea Belano,Viviane Potocnik,Francesco Conti,Angelo Garofalo,Luca Benini
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While Transformers are dominated by Floating-Point (FP) Matrix-Multiplications, their aggressive acceleration through dedicated hardware or many-core programmable systems has shifted the performance bottleneck to non-linear functions like Softmax. Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step. To address this, we design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph’s method, and we integrate it into the Floating-Point Unit (FPU) of the RISC-V cores of a compute cluster, through custom Instruction Set Architecture (ISA) extensions, with a negligible area overhead of 1%. By optimizing the software kernels to leverage the extension, we execute Softmax with 162.7 \times less latency and 74.3 \times less energy compared to the baseline cluster, achieving an 8.2 \times performance improvement and 4.1 \times higher energy efficiency for the FlashAttention-2 kernel in GPT-2 configuration. Moreover, the proposed approach enables a multi-cluster system to efficiently execute end-to-end inference of pre-trained Transformer models, such as GPT-2, GPT-3 and ViT, achieving up to 5.8 \times and 3.6 \times reduction in latency and energy consumption, respectively, without requiring re-training and with negligible accuracy loss.

[LG-9] SDFs from Unoriented Point Clouds using Neural Variational Heat Distances

链接: https://arxiv.org/abs/2504.11212
作者: Samuel Weidemaier,Florine Hartwig,Josua Sassen,Sergio Conti,Mirela Ben-Chen,Martin Rumpf
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 14 pages, 16 figures, 4 tables

点击查看摘要

Abstract:We propose a novel variational approach for computing neural Signed Distance Fields (SDF) from unoriented point clouds. To this end, we replace the commonly used eikonal equation with the heat method, carrying over to the neural domain what has long been standard practice for computing distances on discrete surfaces. This yields two convex optimization problems for whose solution we employ neural networks: We first compute a neural approximation of the gradients of the unsigned distance field through a small time step of heat flow with weighted point cloud densities as initial data. Then we use it to compute a neural approximation of the SDF. We prove that the underlying variational problems are well-posed. Through numerical experiments, we demonstrate that our method provides state-of-the-art surface reconstruction and consistent SDF gradients. Furthermore, we show in a proof-of-concept that it is accurate enough for solving a PDE on the zero-level set.

[LG-10] A Real-time Anomaly Detection Method for Robots based on a Flexible and Sparse Latent Space

链接: https://arxiv.org/abs/2504.11170
作者: Taewook Kang,Bum-Jae You,Juyoun Park,Yisoo Lee
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 20 pages, 11 figures

点击查看摘要

Abstract:The growing demand for robots to operate effectively in diverse environments necessitates the need for robust real-time anomaly detection techniques during robotic operations. However, deep learning-based models in robotics face significant challenges due to limited training data and highly noisy signal features. In this paper, we present Sparse Masked Autoregressive Flow-based Adversarial AutoEncoders model to address these problems. This approach integrates Masked Autoregressive Flow model into Adversarial AutoEncoders to construct a flexible latent space and utilize Sparse autoencoder to efficiently focus on important features, even in scenarios with limited feature space. Our experiments demonstrate that the proposed model achieves a 4.96% to 9.75% higher area under the receiver operating characteristic curve for pick-and-place robotic operations with randomly placed cans, compared to existing state-of-the-art methods. Notably, it showed up to 19.67% better performance in scenarios involving collisions with lightweight objects. Additionally, unlike the existing state-of-the-art model, our model performs inferences within 1 millisecond, ensuring real-time anomaly detection. These capabilities make our model highly applicable to machine learning-based robotic safety systems in dynamic environments. The code will be made publicly available after acceptance.

[LG-11] D-Suite: All Batteries Included Framework for Technical Debt Classification

链接: https://arxiv.org/abs/2504.11085
作者: Karthik Shivashankar,Antonio Martini
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: In submission

点击查看摘要

Abstract:Recognizing that technical debt is a persistent and significant challenge requiring sophisticated management tools, TD-Suite offers a comprehensive software framework specifically engineered to automate the complex task of its classification within software projects. It leverages the advanced natural language understanding of state-of-the-art transformer models to analyze textual artifacts, such as developer discussions in issue reports, where subtle indicators of debt often lie hidden. TD-Suite provides a seamless end-to-end pipeline, managing everything from initial data ingestion and rigorous preprocessing to model training, thorough evaluation, and final inference. This allows it to support both straightforward binary classification (debt or no debt) and more valuable, identifying specific categories like code, design, or documentation debt, thus enabling more targeted management strategies. To ensure the generated models are robust and perform reliably on real-world, often imbalanced, datasets, TD-Suite incorporates critical training methodologies: k-fold cross-validation assesses generalization capability, early stopping mechanisms prevent overfitting to the training data, and class weighting strategies effectively address skewed data distributions. Beyond core functionality, and acknowledging the growing importance of sustainability, the framework integrates tracking and reporting of carbon emissions associated with the computationally intensive model training process. It also features a user-friendly Gradio web interface in a Docker container setup, simplifying model interaction, evaluation, and inference. Comments: In submission Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2504.11085 [cs.SE] (or arXiv:2504.11085v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2504.11085 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Karthik Shivashankar [view email] [v1] Tue, 15 Apr 2025 11:31:17 UTC (778 KB)

[LG-12] Scalability and Maintainability Challenges and Solutions in Machine Learning: Systematic Literature Review

链接: https://arxiv.org/abs/2504.11079
作者: Karthik Shivashankar,Ghadi S. Al Hajj,Antonio Martini
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Minor Revision ACM Computing Survey

点击查看摘要

Abstract:This systematic literature review examines the critical challenges and solutions related to scalability and maintainability in Machine Learning (ML) systems. As ML applications become increasingly complex and widespread across industries, the need to balance system scalability with long-term maintainability has emerged as a significant concern. This review synthesizes current research and practices addressing these dual challenges across the entire ML life-cycle, from data engineering to model deployment in production. We analyzed 124 papers to identify and categorize 41 maintainability challenges and 13 scalability challenges, along with their corresponding solutions. Our findings reveal intricate inter dependencies between scalability and maintainability, where improvements in one often impact the other. The review is structured around six primary research questions, examining maintainability and scalability challenges in data engineering, model engineering, and ML system development. We explore how these challenges manifest differently across various stages of the ML life-cycle. This comprehensive overview offers valuable insights for both researchers and practitioners in the field of ML systems. It aims to guide future research directions, inform best practices, and contribute to the development of more robust, efficient, and sustainable ML applications across various domains. Comments: Minor Revision ACM Computing Survey Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2504.11079 [cs.SE] (or arXiv:2504.11079v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2504.11079 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-13] Morphing-based Compression for Data-centric ML Pipelines

链接: https://arxiv.org/abs/2504.11067
作者: Sebastian Baunsgaard,Matthias Boehm
类目: Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 20 pages, 28 figures, 4 tables

点击查看摘要

Abstract:Data-centric ML pipelines extend traditional machine learning (ML) pipelines – of feature transformations and ML model training – by outer loops for data cleaning, augmentation, and feature engineering to create high-quality input data. Existing lossless matrix compression applies lightweight compression schemes to numeric matrices and performs linear algebra operations such as matrix-vector multiplications directly on the compressed representation but struggles to efficiently rediscover structural data redundancy. Compressed operations are effective at fitting data in available memory, reducing I/O across the storage-memory-cache hierarchy, and improving instruction parallelism. The applied data cleaning, augmentation, and feature transformations provide a rich source of information about data characteristics such as distinct items, column sparsity, and column correlations. In this paper, we introduce BWARE – an extension of AWARE for workload-aware lossless matrix compression – that pushes compression through feature transformations and engineering to leverage information about structural transformations. Besides compressed feature transformations, we introduce a novel technique for lightweight morphing of a compressed representation into workload-optimized compressed representations without decompression. BWARE shows substantial end-to-end runtime improvements, reducing the execution time for training data-centric ML pipelines from days to hours.

[LG-14] Zero-Shot Whole-Body Humanoid Control via Behavioral Foundation Models ICLR2025

链接: https://arxiv.org/abs/2504.11054
作者: Andrea Tirinzoni,Ahmed Touati,Jesse Farebrother,Mateusz Guzek,Anssi Kanervisto,Yingchen Xu,Alessandro Lazaric,Matteo Pirotta
类目: Machine Learning (cs.LG)
*备注: Published at ICLR 2025

点击查看摘要

Abstract:Unsupervised reinforcement learning (RL) aims at pre-training agents that can solve a wide range of downstream tasks in complex environments. Despite recent advancements, existing approaches suffer from several limitations: they may require running an RL process on each downstream task to achieve a satisfactory performance, they may need access to datasets with good coverage or well-curated task-specific samples, or they may pre-train policies with unsupervised losses that are poorly correlated with the downstream tasks of interest. In this paper, we introduce a novel algorithm regularizing unsupervised RL towards imitating trajectories from unlabeled behavior datasets. The key technical novelty of our method, called Forward-Backward Representations with Conditional-Policy Regularization, is to train forward-backward representations to embed the unlabeled trajectories to the same latent space used to represent states, rewards, and policies, and use a latent-conditional discriminator to encourage policies to ``cover’’ the states in the unlabeled behavior dataset. As a result, we can learn policies that are well aligned with the behaviors in the dataset, while retaining zero-shot generalization capabilities for reward-based and imitation tasks. We demonstrate the effectiveness of this new approach in a challenging humanoid control problem: leveraging observation-only motion capture datasets, we train Meta Motivo, the first humanoid behavioral foundation model that can be prompted to solve a variety of whole-body tasks, including motion tracking, goal reaching, and reward optimization. The resulting model is capable of expressing human-like behaviors and it achieves competitive performance with task-specific methods while outperforming state-of-the-art unsupervised RL and model-based baselines.

[LG-15] QualiTagger: Automating software quality detection in issue trackers

链接: https://arxiv.org/abs/2504.11053
作者: Karthik Shivashankar,Rafael Capilla,Maren Maritsdatter Kruke,Mili Orucevic,Antonio Martini
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: IN Review ASE journal

点击查看摘要

Abstract:A systems quality is a major concern for development teams when it evolve. Understanding the effects of a loss of quality in the codebase is crucial to avoid side effects like the appearance of technical debt. Although the identification of these qualities in software requirements described in natural language has been investigated, most of the results are often not applicable in practice, and rely on having been validated on small datasets and limited amount of projects. For many years, machine learning (ML) techniques have been proved as a valid technique to identify and tag terms described in natural language. In order to advance previous works, in this research we use cutting edge models like Transformers, together with a vast dataset mined and curated from GitHub, to identify what text is usually associated with different quality properties. We also study the distribution of such qualities in issue trackers from openly accessible software repositories, and we evaluate our approach both with students from a software engineering course and with its application to recognize security labels in industry.

[LG-16] A PyTorch-Compatible Spike Encoding Framework for Energy-Efficient Neuromorphic Applications

链接: https://arxiv.org/abs/2504.11026
作者: Alexandru Vasilache,Jona Scholz,Vincent Schilling,Sven Nitzsche,Florian Kaelber,Johannes Korsch,Juergen Becker
类目: Machine Learning (cs.LG)
*备注: A preliminary version of this work was accepted at the 20th International Conference on Systems (ICONS 2025), May 18-22, 2025, Nice, France. The conference proceedings will be published by IARIA Press (ISSN: 2308-4243, ISBN: 978-1-68558-278-4) and archived in the ThinkMind Digital Library. The proposed Spike Encoding Framework can be accessed at this https URL

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) offer promising energy efficiency advantages, particularly when processing sparse spike trains. However, their incompatibility with traditional datasets, which consist of batches of input vectors rather than spike trains, necessitates the development of efficient encoding methods. This paper introduces a novel, open-source PyTorch-compatible Python framework for spike encoding, designed for neuromorphic applications in machine learning and reinforcement learning. The framework supports a range of encoding algorithms, including Leaky Integrate-and-Fire (LIF), Step Forward (SF), Pulse Width Modulation (PWM), and Ben’s Spiker Algorithm (BSA), as well as specialized encoding strategies covering population coding and reinforcement learning scenarios. Furthermore, we investigate the performance trade-offs of each method on embedded hardware using C/C++ implementations, considering energy consumption, computation time, spike sparsity, and reconstruction accuracy. Our findings indicate that SF typically achieves the lowest reconstruction error and offers the highest energy efficiency and fastest encoding speed, achieving the second-best spike sparsity. At the same time, other methods demonstrate particular strengths depending on the signal characteristics. This framework and the accompanying empirical analysis provide valuable resources for selecting optimal encoding strategies for energy-efficient SNN applications.

[LG-17] Leverag ing Vertical Public-Private Split for Improved Synthetic Data Generation ICLR2025

链接: https://arxiv.org/abs/2504.10987
作者: Samuel Maddock,Shripad Gade,Graham Cormode,Will Bullock
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted to the Synthetic Data x Data Access Problem (SynthData) workshop @ ICLR 2025

点击查看摘要

Abstract:Differentially Private Synthetic Data Generation (DP-SDG) is a key enabler of private and secure tabular-data sharing, producing artificial data that carries through the underlying statistical properties of the input data. This typically involves adding carefully calibrated statistical noise to guarantee individual privacy, at the cost of synthetic data quality. Recent literature has explored scenarios where a small amount of public data is used to help enhance the quality of synthetic data. These methods study a horizontal public-private partitioning which assumes access to a small number of public rows that can be used for model initialization, providing a small utility gain. However, realistic datasets often naturally consist of public and private attributes, making a vertical public-private partitioning relevant for practical synthetic data deployments. We propose a novel framework that adapts horizontal public-assisted methods into the vertical setting. We compare this framework against our alternative approach that uses conditional generation, highlighting initial limitations of public-data assisted methods and proposing future research directions to address these challenges.

[LG-18] Learning-Based User Association for MmWave Vehicular Networks With Kernelized Contextual Bandits

链接: https://arxiv.org/abs/2504.10959
作者: Xiaoyang He,Xiaoxia Huang
类目: Machine Learning (cs.LG)
*备注: Accepted by IEEE WCNC 2025

点击查看摘要

Abstract:Vehicles require timely channel conditions to determine the base station (BS) to communicate with, but it is costly to estimate the fast-fading mmWave channels frequently. Without additional channel estimations, the proposed Distributed Kernelized Upper Confidence Bound (DK-UCB) algorithm estimates the current instantaneous transmission rates utilizing past contexts, such as the vehicle’s location and velocity, along with past instantaneous transmission rates. To capture the nonlinear mapping from a context to the instantaneous transmission rate, DK-UCB maps a context into the reproducing kernel Hilbert space (RKHS) where a linear mapping becomes observable. To improve estimation accuracy, we propose a novel kernel function in RKHS which incorporates the propagation characteristics of the mmWave signals. Moreover, DK-UCB encourages a vehicle to share necessary information when it has conducted significant explorations, which speeds up the learning process while maintaining affordable communication costs.

[LG-19] When is Task Vector Provably Effective for Model Editing? A Generalization Analysis of Nonlinear Transformers ICLR2025

链接: https://arxiv.org/abs/2504.10957
作者: Hongkang Li,Yihua Zhang,Shuai Zhang,Meng Wang,Sijia Liu,Pin-Yu Chen
类目: Machine Learning (cs.LG)
*备注: Published at ICLR 2025 as an oral paper

点击查看摘要

Abstract:Task arithmetic refers to editing the pre-trained model by adding a weighted sum of task vectors, each of which is the weight update from the pre-trained model to fine-tuned models for certain tasks. This approach recently gained attention as a computationally efficient inference method for model editing, e.g., multi-task learning, forgetting, and out-of-domain generalization capabilities. However, the theoretical understanding of why task vectors can execute various conceptual operations remains limited, due to the highly non-convexity of training Transformer-based models. To the best of our knowledge, this paper provides the first theoretical characterization of the generalization guarantees of task vector methods on nonlinear Transformers. We consider a conceptual learning setting, where each task is a binary classification problem based on a discriminative pattern. We theoretically prove the effectiveness of task addition in simultaneously learning a set of irrelevant or aligned tasks, as well as the success of task negation in unlearning one task from irrelevant or contradictory tasks. Moreover, we prove the proper selection of linear coefficients for task arithmetic to achieve guaranteed generalization to out-of-domain tasks. All of our theoretical results hold for both dense-weight parameters and their low-rank approximations. Although established in a conceptual setting, our theoretical findings were validated on a practical machine unlearning task using the large language model Phi-1.5 (1.3B).

[LG-20] Multi-scale DeepOnet (Mscale-DeepOnet) for Mitigating Spectral Bias in Learning High Frequency Operators of Oscillatory Functions

链接: https://arxiv.org/abs/2504.10932
作者: Bo Wang,Lizuo Liu,Wei Cai
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:In this paper, a multi-scale DeepOnet (Mscale-DeepOnet) is proposed to reduce the spectral bias of the DeepOnet in learning high-frequency mapping between highly oscillatory functions, with an application to the nonlinear mapping between the coefficient of the Helmholtz equation and its solution. The Mscale-DeepOnet introduces the multiscale neural network in the branch and trunk networks of the original DeepOnet, the resulting Mscale-DeepOnet is shown to be able to capture various high-frequency components of the mapping itself and its image. Numerical results demonstrate the substantial improvement of the Mscale-DeepOnet for the problem of wave scattering in the high-frequency regime over the normal DeepOnet with a similar number of network parameters.

[LG-21] Fast-Powerformer: A Memory-Efficient Transformer for Accurate Mid-Term Wind Power Forecasting

链接: https://arxiv.org/abs/2504.10923
作者: Mingyi Zhu,Zhaoxin Li,Qiao Lin,Li Ding
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Mingyi Zhu is the first author. Li Ding is the corresponding author

点击查看摘要

Abstract:Wind power forecasting (WPF), as a significant research topic within renewable energy, plays a crucial role in enhancing the security, stability, and economic operation of power grids. However, due to the high stochasticity of meteorological factors (e.g., wind speed) and significant fluctuations in wind power output, mid-term wind power forecasting faces a dual challenge of maintaining high accuracy and computational efficiency. To address these issues, this paper proposes an efficient and lightweight mid-term wind power forecasting model, termed Fast-Powerformer. The proposed model is built upon the Reformer architecture, incorporating structural enhancements such as a lightweight Long Short-Term Memory (LSTM) embedding module, an input transposition mechanism, and a Frequency Enhanced Channel Attention Mechanism (FECAM). These improvements enable the model to strengthen temporal feature extraction, optimize dependency modeling across variables, significantly reduce computational complexity, and enhance sensitivity to periodic patterns and dominant frequency components. Experimental results conducted on multiple real-world wind farm datasets demonstrate that the proposed Fast-Powerformer achieves superior prediction accuracy and operational efficiency compared to mainstream forecasting approaches. Furthermore, the model exhibits fast inference speed and low memory consumption, highlighting its considerable practical value for real-world deployment scenarios.

[LG-22] Leverag ing Submodule Linearity Enhances Task Arithmetic Performance in LLM s ICLR2025

链接: https://arxiv.org/abs/2504.10902
作者: Rui Dai,Sile Hu,Xu Shen,Yonggang Zhang,Xinmei Tian,Jieping Ye
类目: Machine Learning (cs.LG)
*备注: Accepted by ICLR 2025

点击查看摘要

Abstract:Task arithmetic is a straightforward yet highly effective strategy for model merging, enabling the resultant model to exhibit multi-task capabilities. Recent research indicates that models demonstrating linearity enhance the performance of task arithmetic. In contrast to existing methods that rely on the global linearization of the model, we argue that this linearity already exists within the model’s submodules. In particular, we present a statistical analysis and show that submodules (e.g., layers, self-attentions, and MLPs) exhibit significantly higher linearity than the overall model. Based on these findings, we propose an innovative model merging strategy that independently merges these submodules. Especially, we derive a closed-form solution for optimal merging weights grounded in the linear properties of these submodules. Experimental results demonstrate that our method consistently outperforms the standard task arithmetic approach and other established baselines across different model scales and various tasks. This result highlights the benefits of leveraging the linearity of submodules and provides a new perspective for exploring solutions for effective and practical multi-task model merging.

[LG-23] ICAFS: Inter-Client-Aware Feature Selection for Vertical Federated Learning

链接: https://arxiv.org/abs/2504.10851
作者: Ruochen Jin,Boning Tong,Shu Yang,Bojian Hou,Li Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vertical federated learning (VFL) enables a paradigm for vertically partitioned data across clients to collaboratively train machine learning models. Feature selection (FS) plays a crucial role in Vertical Federated Learning (VFL) due to the unique nature that data are distributed across multiple clients. In VFL, different clients possess distinct subsets of features for overlapping data samples, making the process of identifying and selecting the most relevant features a complex yet essential task. Previous FS efforts have primarily revolved around intra-client feature selection, overlooking vital feature interaction across clients, leading to subpar model outcomes. We introduce ICAFS, a novel multi-stage ensemble approach for effective FS in VFL by considering inter-client interactions. By employing conditional feature synthesis alongside multiple learnable feature selectors, ICAFS facilitates ensemble FS over these selectors using synthetic embeddings. This method bypasses the limitations of private gradient sharing and allows for model training using real data with refined embeddings. Experiments on multiple real-world datasets demonstrate that ICAFS surpasses current state-of-the-art methods in prediction accuracy.

[LG-24] How to Enhance Downstream Adversarial Robustness (almost) without Touching the Pre-Trained Foundation Model?

链接: https://arxiv.org/abs/2504.10850
作者: Meiqi Liu,Zhuoqun Huang,Yue Xing
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 22 pages, 2 figures, 12 tables. Include 10 pages of appendices

点击查看摘要

Abstract:With the rise of powerful foundation models, a pre-training-fine-tuning paradigm becomes increasingly popular these days: A foundation model is pre-trained using a huge amount of data from various sources, and then the downstream users only need to fine-tune and adapt it to specific downstream tasks. However, due to the high computation complexity of adversarial training, it is not feasible to fine-tune the foundation model to improve its robustness on the downstream task. Observing the above challenge, we want to improve the downstream robustness without updating/accessing the weights in the foundation model. Inspired from existing literature in robustness inheritance (Kim et al., 2020), through theoretical investigation, we identify a close relationship between robust contrastive learning with the adversarial robustness of supervised learning. To further validate and utilize this theoretical insight, we design a simple-yet-effective robust auto-encoder as a data pre-processing method before feeding the data into the foundation model. The proposed approach has zero access to the foundation model when training the robust auto-encoder. Extensive experiments demonstrate the effectiveness of the proposed method in improving the robustness of downstream tasks, verifying the connection between the feature robustness (implied by small adversarial contrastive loss) and the robustness of the downstream task.

[LG-25] SonicSieve: Bringing Directional Speech Extraction to Smartphones Using Acoustic Microstructures

链接: https://arxiv.org/abs/2504.10793
作者: Kuang Yuan,Yifeng Wang,Xiyuxing Zhang,Chengyi Shen,Swarun Kumar,Justin Chan
类目: ound (cs.SD); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Imagine placing your smartphone on a table in a noisy restaurant and clearly capturing the voices of friends seated around you, or recording a lecturer’s voice with clarity in a reverberant auditorium. We introduce SonicSieve, the first intelligent directional speech extraction system for smartphones using a bio-inspired acoustic microstructure. Our passive design embeds directional cues onto incoming speech without any additional electronics. It attaches to the in-line mic of low-cost wired earphones which can be attached to smartphones. We present an end-to-end neural network that processes the raw audio mixtures in real-time on mobile devices. Our results show that SonicSieve achieves a signal quality improvement of 5.0 dB when focusing on a 30° angular region. Additionally, the performance of our system based on only two microphones exceeds that of conventional 5-microphone arrays.

[LG-26] AtlasD: Automatic Local Symmetry Discovery

链接: https://arxiv.org/abs/2504.10777
作者: Manu Bhat,Jonghyun Park,Jianke Yang,Nima Dehmamy,Robin Walters,Rose Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing symmetry discovery methods predominantly focus on global transformations across the entire system or space, but they fail to consider the symmetries in local neighborhoods. This may result in the reported symmetry group being a misrepresentation of the true symmetry. In this paper, we formalize the notion of local symmetry as atlas equivariance. Our proposed pipeline, automatic local symmetry discovery (AtlasD), recovers the local symmetries of a function by training local predictor networks and then learning a Lie group basis to which the predictors are equivariant. We demonstrate AtlasD is capable of discovering local symmetry groups with multiple connected components in top-quark tagging and partial differential equation experiments. The discovered local symmetry is shown to be a useful inductive bias that improves the performance of downstream tasks in climate segmentation and vision tasks.

[LG-27] Collaborative Bayesian Optimization via Wasserstein Barycenters

链接: https://arxiv.org/abs/2504.10770
作者: Donglin Zhan,Haoting Zhang,Rhonda Righter,Zeyu Zheng,James Anderson
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Motivated by the growing need for black-box optimization and data privacy, we introduce a collaborative Bayesian optimization (BO) framework that addresses both of these challenges. In this framework agents work collaboratively to optimize a function they only have oracle access to. In order to mitigate against communication and privacy constraints, agents are not allowed to share their data but can share their Gaussian process (GP) surrogate models. To enable collaboration under these constraints, we construct a central model to approximate the objective function by leveraging the concept of Wasserstein barycenters of GPs. This central model integrates the shared models without accessing the underlying data. A key aspect of our approach is a collaborative acquisition function that balances exploration and exploitation, allowing for the optimization of decision variables collaboratively in each iteration. We prove that our proposed algorithm is asymptotically consistent and that its implementation via Monte Carlo methods is numerically accurate. Through numerical experiments, we demonstrate that our approach outperforms other baseline collaborative frameworks and is competitive with centralized approaches that do not consider data privacy.

[LG-28] Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables SIGMOD2025

链接: https://arxiv.org/abs/2504.10762
作者: Qixu Chen,Yeye He,Raymond Chi-Wing Wong,Weiwei Cui,Song Ge,Haidong Zhang,Dongmei Zhang,Surajit Chaudhuri
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: full version of a paper accepted to SIGMOD 2025

点击查看摘要

Abstract:Data cleaning is a long-standing challenge in data management. While powerful logic and statistical algorithms have been developed to detect and repair data errors in tables, existing algorithms predominantly rely on domain-experts to first manually specify data-quality constraints specific to a given table, before data cleaning algorithms can be applied. In this work, we propose a new class of data-quality constraints that we call Semantic-Domain Constraints, which can be reliably inferred and automatically applied to any tables, without requiring domain-experts to manually specify on a per-table basis. We develop a principled framework to systematically learn such constraints from table corpora using large-scale statistical tests, which can further be distilled into a core set of constraints using our optimization framework, with provable quality guarantees. Extensive evaluations show that this new class of constraints can be used to both (1) directly detect errors on real tables in the wild, and (2) augment existing expert-driven data-cleaning techniques as a new class of complementary constraints. Our extensively labeled benchmark dataset with 2400 real data columns, as well as our code are available at this https URL to facilitate future research. Comments: full version of a paper accepted to SIGMOD 2025 Subjects: Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2504.10762 [cs.DB] (or arXiv:2504.10762v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2504.10762 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-29] auto-fpt: Automating Free Probability Theory Calculations for Machine Learning Theory

链接: https://arxiv.org/abs/2504.10754
作者: Arjun Subramonian,Elvis Dohmatob
类目: Machine Learning (cs.LG); Mathematical Software (cs.MS)
*备注: Work in progress

点击查看摘要

Abstract:A large part of modern machine learning theory often involves computing the high-dimensional expected trace of a rational expression of large rectangular random matrices. To symbolically compute such quantities using free probability theory, we introduce auto-fpt, a lightweight Python and SymPy-based tool that can automatically produce a reduced system of fixed-point equations which can be solved for the quantities of interest, and effectively constitutes a theory. We overview the algorithmic ideas underlying auto-fpt and its applications to various interesting problems, such as the high-dimensional error of linearized feed-forward neural networks, recovering well-known results. We hope that auto-fpt streamlines the majority of calculations involved in high-dimensional analysis, while helping the machine learning community reproduce known and uncover new phenomena.

[LG-30] me-varying EEG spectral power predicts evoked and spontaneous fMRI motor brain activity

链接: https://arxiv.org/abs/2504.10752
作者: Neil Mehta,Ines Goncalves,Alberto Montagna,Mathis Fleury,Gustavo Caetano,Ines Esteves,Athanasios Vourvopoulos,Pulkit Grover,Patricia Figueiredo
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Simultaneous EEG-fMRI recordings are increasingly used to investigate brain activity by leveraging the complementary high spatial and high temporal resolution of fMRI and EEG signals respectively. It remains unclear, however, to what degree these two imaging modalities capture shared information about neural activity. Here, we investigate whether it is possible to predict both task-evoked and spontaneous fMRI signals of motor brain networks from EEG time-varying spectral power using interpretable models trained for individual subjects with Sparse Group Lasso regularization. Critically, we test the trained models on data acquired from each subject on a different day and obtain statistical validation by comparison with appropriate null models as well as the conventional EEG sensorimotor rhythm. We find significant prediction results in most subjects, although less frequently for resting-state compared to task-based conditions. Furthermore, we interpret the model learned parameters to understand representations of EEG-fMRI coupling in terms of predictive EEG channels, frequencies, and haemodynamic delays. In conclusion, our work provides evidence of the ability to predict fMRI motor brain activity from EEG recordings alone across different days, in both task-evoked and spontaneous conditions, with statistical significance in individual subjects. These results present great potential for translation to EEG neurofeedback applications.

[LG-31] Leverag ing Deep Operator Networks (DeepONet) for Acoustic Full Waveform Inversion (FWI)

链接: https://arxiv.org/abs/2504.10720
作者: Kamaljyoti Nath,Khemraj Shukla,Victor C. Tsai,Umair bin Waheed,Christian Huber,Omer Alpak,Chuen-Song Chen,Ligang Lu,Amik St-Cyr
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Full Waveform Inversion (FWI) is an important geophysical technique considered in subsurface property prediction. It solves the inverse problem of predicting high-resolution Earth interior models from seismic data. Traditional FWI methods are computationally demanding. Inverse problems in geophysics often face challenges of non-uniqueness due to limited data, as data are often collected only on the surface. In this study, we introduce a novel methodology that leverages Deep Operator Networks (DeepONet) to attempt to improve both the efficiency and accuracy of FWI. The proposed DeepONet methodology inverts seismic waveforms for the subsurface velocity field. This approach is able to capture some key features of the subsurface velocity field. We have shown that the architecture can be applied to noisy seismic data with an accuracy that is better than some other machine learning methods. We also test our proposed method with out-of-distribution prediction for different velocity models. The proposed DeepONet shows comparable and better accuracy in some velocity models than some other machine learning methods. To improve the FWI workflow, we propose using the DeepONet output as a starting model for conventional FWI and that it may improve FWI performance. While we have only shown that DeepONet facilitates faster convergence than starting with a homogeneous velocity field, it may have some benefits compared to other approaches to constructing starting models. This integration of DeepONet into FWI may accelerate the inversion process and may also enhance its robustness and reliability.

[LG-32] Emotion Alignment: Discovering the Gap Between Social Media and Real-World Sentiments in Persian Tweets and Images

链接: https://arxiv.org/abs/2504.10662
作者: Sina Elahimanesh,Mohammadali Mohammadkhani,Shohreh Kasaei
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:In contemporary society, widespread social media usage is evident in people’s daily lives. Nevertheless, disparities in emotional expressions between the real world and online platforms can manifest. We comprehensively analyzed Persian community on X to explore this phenomenon. An innovative pipeline was designed to measure the similarity between emotions in the real world compared to social media. Accordingly, recent tweets and images of participants were gathered and analyzed using Transformers-based text and image sentiment analysis modules. Each participant’s friends also provided insights into the their real-world emotions. A distance criterion was used to compare real-world feelings with virtual experiences. Our study encompassed N=105 participants, 393 friends who contributed their perspectives, over 8,300 collected tweets, and 2,000 media images. Results indicated a 28.67% similarity between images and real-world emotions, while tweets exhibited a 75.88% alignment with real-world feelings. Additionally, the statistical significance confirmed that the observed disparities in sentiment proportions.

[LG-33] ransfer Learning Assisted XgBoost For Adaptable Cyberattack Detection In Battery Packs

链接: https://arxiv.org/abs/2504.10658
作者: Sanchita Ghosh,Tanushree Roy
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Optimal charging of electric vehicle (EVs) depends heavily on reliable sensor measurements from the battery pack to the cloud-controller of the smart charging station. However, an adversary could corrupt the voltage sensor data during transmission, potentially causing local to wide-scale disruptions. Therefore, it is essential to detect sensor cyberattacks in real-time to ensure secure EV charging, and the developed algorithms must be readily adaptable to variations, including pack configurations. To tackle these challenges, we propose adaptable fine-tuning of an XgBoost-based cell-level model using limited pack-level data to use for voltage prediction and residual generation. We used battery cell and pack data from high-fidelity charging experiments in PyBaMM and `liionpack’ package to train and test the detection algorithm. The algorithm’s performance has been evaluated for two large-format battery packs under sensor swapping and replay attacks. The simulation results also highlight the adaptability and efficacy of our proposed detection algorithm.

[LG-34] SPreV

链接: https://arxiv.org/abs/2504.10620
作者: Srivathsan Amruth
类目: Graphics (cs.GR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 45 Pages, 7 Figures, 3 Tables, 9 Algorithms, Opensource

点击查看摘要

Abstract:SPREV, short for hyperSphere Reduced to two-dimensional Regular Polygon for Visualisation, is a novel dimensionality reduction technique developed to address the challenges of reducing dimensions and visualizing labeled datasets that exhibit a unique combination of three characteristics: small class size, high dimensionality, and low sample size. SPREV is designed not only to uncover but also to visually represent hidden patterns within such datasets. Its distinctive integration of geometric principles, adapted for discrete computational environments, makes it an indispensable tool in the modern data science toolkit, enabling users to identify trends, extract insights, and navigate complex data efficiently and effectively.

[LG-35] Integrating Textual Embeddings from Contrastive Learning with Generative Recommender for Enhanced Personalization WWW

链接: https://arxiv.org/abs/2504.10545
作者: Yijun Liu
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Code available at this https URL

点击查看摘要

Abstract:Recent advances in recommender systems have highlighted the complementary strengths of generative modeling and pretrained language models. We propose a hybrid framework that augments the Hierarchical Sequential Transduction Unit (HSTU) generative recommender with BLaIR – a contrastive text embedding model. This integration enriches item representations with semantic signals from textual metadata while preserving HSTU’s powerful sequence modeling capabilities. We evaluate our method on two domains from the Amazon Reviews 2023 dataset, comparing it against the original HSTU and a variant that incorporates embeddings from OpenAI’s state-of-the-art text-embedding-3-large model. While the OpenAI embedding model is likely trained on a substantially larger corpus with significantly more parameters, our lightweight BLaIR-enhanced approach – pretrained on domain-specific data – consistently achieves better performance, highlighting the effectiveness of contrastive text embeddings in compute-efficient settings. Comments: Code available at this https URL Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2504.10545 [cs.IR] (or arXiv:2504.10545v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.10545 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-36] PinRec: Outcome-Conditioned Multi-Token Generative Retrieval for Industry-Scale Recommendation Systems KDD

链接: https://arxiv.org/abs/2504.10507
作者: Anirudhan Badrinath,Prabhat Agarwal,Laksh Bhasin,Jaewon Yang,Jiajing Xu,Charles Rosenberg
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Submitted to KDD ADS 2025

点击查看摘要

Abstract:Generative retrieval methods utilize generative sequential modeling techniques, such as transformers, to generate candidate items for recommender systems. These methods have demonstrated promising results in academic benchmarks, surpassing traditional retrieval models like two-tower architectures. However, current generative retrieval methods lack the scalability required for industrial recommender systems, and they are insufficiently flexible to satisfy the multiple metric requirements of modern systems. This paper introduces PinRec, a novel generative retrieval model developed for applications at Pinterest. PinRec utilizes outcome-conditioned generation, enabling modelers to specify how to balance various outcome metrics, such as the number of saves and clicks, to effectively align with business goals and user exploration. Additionally, PinRec incorporates multi-token generation to enhance output diversity while optimizing generation. Our experiments demonstrate that PinRec can successfully balance performance, diversity, and efficiency, delivering a significant positive impact to users using generative models. This paper marks a significant milestone in generative retrieval, as it presents, to our knowledge, the first rigorous study on implementing generative retrieval at the scale of Pinterest. Comments: Submitted to KDD ADS 2025 Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2504.10507 [cs.IR] (or arXiv:2504.10507v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.10507 Focus to learn more arXiv-issued DOI via DataCite

[LG-37] Early Impacts of M365 Copilot

链接: https://arxiv.org/abs/2504.11443
作者: Eleanor Wiske Dillon,Sonia Jaffe,Sida Peng,Alexia Cambon
类目: General Economics (econ.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Advances in generative AI have rapidly expanded the potential of computers to perform or assist in a wide array of tasks traditionally performed by humans. We analyze a large, real-world randomized experiment of over 6,000 workers at 56 firms to present some of the earliest evidence on how these technologies are changing the way knowledge workers do their jobs. We find substantial time savings on common core tasks across a wide range of industries and occupations: workers who make use of this technology spent half an hour less reading email each week and completed documents 12% faster. Despite the newness of the technology, nearly 40% of workers who were given access to the tool used it regularly in their work throughout the 6-month study.

[LG-38] Shifting Work Patterns with Generative AI

链接: https://arxiv.org/abs/2504.11436
作者: Eleanor Wiske Dillon,Sonia Jaffe,Nicole Immorlica,Christopher T. Stanton
类目: General Economics (econ.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present evidence on how generative AI changes the work patterns of knowledge workers using data from a 6-month-long, cross-industry, randomized field experiment. Half of the 6,000 workers in the study received access to a generative AI tool integrated into the applications they already used for emails, document creation, and meetings. We find that access to the AI tool during the first year of its release primarily impacted behaviors that could be changed independently and not behaviors that required coordination to change: workers who used the tool spent 3 fewer hours, or 25% less time on email each week (intent to treat estimate is 1.4 hours) and seemed to complete documents moderately faster, but did not significantly change time spent in meetings.

[LG-39] Mildly-Interacting Fermionic Unitaries are Efficiently Learnable

链接: https://arxiv.org/abs/2504.11318
作者: Vishnu Iyer
类目: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 30 pages, 4 figures

点击查看摘要

Abstract:Recent work has shown that one can efficiently learn fermionic Gaussian unitaries, also commonly known as nearest-neighbor matchcircuits or non-interacting fermionic unitaries. However, one could ask a similar question about unitaries that are near Gaussian: for example, unitaries prepared with a small number of non-Gaussian circuit elements. These operators find significance in quantum chemistry and many-body physics, yet no algorithm exists to learn them. We give the first such result by devising an algorithm which makes queries to a n -mode fermionic unitary U prepared by at most O(t) non-Gaussian gates and returns a circuit approximating U to diamond distance \varepsilon in time \textrmpoly(n,2^t,1/\varepsilon) . This resolves a central open question of Mele and Herasymenko under the strongest distance metric. In fact, our algorithm is much more general: we define a property of unitary Gaussianity known as unitary Gaussian dimension and show that our algorithm can learn n -mode unitaries of Gaussian dimension at least 2n - O(t) in time \textrmpoly(n,2^t,1/\varepsilon) . Indeed, this class subsumes unitaries prepared by at most O(t) non-Gaussian gates but also includes several unitaries that require up to 2^O(t) non-Gaussian gates to construct. In addition, we give a \textrmpoly(n,1/\varepsilon) -time algorithm to distinguish whether an n -mode unitary is of Gaussian dimension at least k or \varepsilon -far from all such unitaries in Frobenius distance, promised that one is the case. Along the way, we prove structural results about near-Gaussian fermionic unitaries that are likely to be of independent interest. Comments: 30 pages, 4 figures Subjects: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2504.11318 [quant-ph] (or arXiv:2504.11318v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2504.11318 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-40] Differentially Private Geodesic and Linear Regression

链接: https://arxiv.org/abs/2504.11304
作者: Aditya Kulkarni,Carlos Soto
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 16 pages, 7 figures

点击查看摘要

Abstract:In statistical applications it has become increasingly common to encounter data structures that live on non-linear spaces such as manifolds. Classical linear regression, one of the most fundamental methodologies of statistical learning, captures the relationship between an independent variable and a response variable which both are assumed to live in Euclidean space. Thus, geodesic regression emerged as an extension where the response variable lives on a Riemannian manifold. The parameters of geodesic regression, as with linear regression, capture the relationship of sensitive data and hence one should consider the privacy protection practices of said parameters. We consider releasing Differentially Private (DP) parameters of geodesic regression via the K-Norm Gradient (KNG) mechanism for Riemannian manifolds. We derive theoretical bounds for the sensitivity of the parameters showing they are tied to their respective Jacobi fields and hence the curvature of the space. This corroborates recent findings of differential privacy for the Fréchet mean. We demonstrate the efficacy of our methodology on the sphere, \mbS^2\subset\mbR^3 and, since it is general to Riemannian manifolds, the manifold of Euclidean space which simplifies geodesic regression to a case of linear regression. Our methodology is general to any Riemannian manifold and thus it is suitable for data in domains such as medical imaging and computer vision.

[LG-41] Limits of Discrete Energy of Families of Increasing Sets

链接: https://arxiv.org/abs/2504.11302
作者: Hari Nathan
类目: Classical Analysis and ODEs (math.CA); Machine Learning (cs.LG); Metric Geometry (math.MG)
*备注:

点击查看摘要

Abstract:The Hausdorff dimension of a set can be detected using the Riesz energy. Here, we consider situations where a sequence of points, \x_n\ , ``fills in’’ a set E \subset \mathbbR^d in an appropriate sense and investigate the degree to which the discrete analog to the Riesz energy of these sets can be used to bound the Hausdorff dimension of E . We also discuss applications to data science and Erdős/Falconer type problems.

[LG-42] Efficient and Stable Multi-Dimensional Kolmogorov-Smirnov Distance

链接: https://arxiv.org/abs/2504.11299
作者: Peter Matthew Jacobs,Foad Namjoo,Jeff M. Phillips
类目: Computation (stat.CO); Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注: 21 pages, Primary: stat.CO. Secondary: cs.CG, cs.LG

点击查看摘要

Abstract:We revisit extending the Kolmogorov-Smirnov distance between probability distributions to the multidimensional setting and make new arguments about the proper way to approach this generalization. Our proposed formulation maximizes the difference over orthogonal dominating rectangular ranges (d-sided rectangles in R^d), and is an integral probability metric. We also prove that the distance between a distribution and a sample from the distribution converges to 0 as the sample size grows, and bound this rate. Moreover, we show that one can, up to this same approximation error, compute the distance efficiently in 4 or fewer dimensions; specifically the runtime is near-linear in the size of the sample needed for that error. With this, we derive a delta-precision two-sample hypothesis test using this distance. Finally, we show these metric and approximation properties do not hold for other popular variants.

[LG-43] Multi-Agent Reinforcement Learning for Greenhouse Gas Offset Credit Markets

链接: https://arxiv.org/abs/2504.11258
作者: Liam Welsh,Udit Grover,Sebastian Jaimungal
类目: Mathematical Finance (q-fin.MF); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Climate change is a major threat to the future of humanity, and its impacts are being intensified by excess man-made greenhouse gas emissions. One method governments can employ to control these emissions is to provide firms with emission limits and penalize any excess emissions above the limit. Excess emissions may also be offset by firms who choose to invest in carbon reducing and capturing projects. These projects generate offset credits which can be submitted to a regulating agency to offset a firm’s excess emissions, or they can be traded with other firms. In this work, we characterize the finite-agent Nash equilibrium for offset credit markets. As computing Nash equilibria is an NP-hard problem, we utilize the modern reinforcement learning technique Nash-DQN to efficiently estimate the market’s Nash equilibria. We demonstrate not only the validity of employing reinforcement learning methods applied to climate themed financial markets, but also the significant financial savings emitting firms may achieve when abiding by the Nash equilibria through numerical experiments.

[LG-44] Using Time Structure to Estimate Causal Effects

链接: https://arxiv.org/abs/2504.11076
作者: Tom Hochsprung,Jakob Runge,Andreas Gerhardus
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 25 pages main paper, 25 pages Appendix, 50 pages in total, 3 tables, 7 figures

点击查看摘要

Abstract:There exist several approaches for estimating causal effects in time series when latent confounding is present. Many of these approaches rely on additional auxiliary observed variables or time series such as instruments, negative controls or time series that satisfy the front- or backdoor criterion in certain graphs. In this paper, we present a novel approach for estimating direct (and via Wright’s path rule total) causal effects in a time series setup which does not rely on additional auxiliary observed variables or time series. This approach assumes that the underlying time series is a Structural Vector Autoregressive (SVAR) process and estimates direct causal effects by solving certain linear equation systems made up of different covariances and model parameters. We state sufficient graphical criteria in terms of the so-called full time graph under which these linear equations systems are uniquely solvable and under which their solutions contain the to-be-identified direct causal effects as components. We also state sufficient lag-based criteria under which the previously mentioned graphical conditions are satisfied and, thus, under which direct causal effects are identifiable. Several numerical experiments underline the correctness and applicability of our results.

[LG-45] Early Detection of Cognitive Impairment in Elderly using a Passive FPVS-EEG BCI and Machine Learning – Extended Version

链接: https://arxiv.org/abs/2504.10973
作者: Tomasz M. Rutkowski,Stanisław Narębski,Mihoko Otake-Matsuura,Tomasz Komendziński
类目: Neurons and Cognition (q-bio.NC); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 4 pages, 4 figures, exteded version of an abstract accepted for a poster presentation at the 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC2025), Copenhagen, Denmark, July 14-17, 2025

点击查看摘要

Abstract:Early dementia diagnosis requires biomarkers sensitive to both structural and functional brain changes. While structural neuroimaging biomarkers have progressed significantly, objective functional biomarkers of early cognitive decline remain a critical unmet need. Current cognitive assessments often rely on behavioral responses, making them susceptible to factors like effort, practice effects, and educational background, thereby hindering early and accurate detection. This work introduces a novel approach, leveraging a lightweight convolutional neural network (CNN) to infer cognitive impairment levels directly from electroencephalography (EEG) data. Critically, this method employs a passive fast periodic visual stimulation (FPVS) paradigm, eliminating the need for explicit behavioral responses or task comprehension from the participant. This passive approach provides an objective measure of working memory function, independent of confounding factors inherent in active cognitive tasks, and offers a promising new avenue for early and unbiased detection of cognitive decline.

[LG-46] Wasserstein Distributionally Regret Optimization

链接: https://arxiv.org/abs/2504.10796
作者: Lukas-Benedikt Fiechtner,Jose Blanchet
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distributionally Robust Optimization (DRO) is a popular framework for decision-making under uncertainty, but its adversarial nature can lead to overly conservative solutions. To address this, we study ex-ante Distributionally Robust Regret Optimization (DRRO), focusing on Wasserstein-based ambiguity sets which are popular due to their links to regularization and machine learning. We provide a systematic analysis of Wasserstein DRRO, paralleling known results for Wasserstein DRO. Under smoothness and regularity conditions, we show that Wasserstein DRRO coincides with Empirical Risk Minimization (ERM) up to first-order terms, and exactly so in convex quadratic settings. We revisit the Wasserstein DRRO newsvendor problem, where the loss is the maximum of two linear functions of demand and decision. Extending [25], we show that the regret can be computed by maximizing two one-dimensional concave functions. For more general loss functions involving the maximum of multiple linear terms in multivariate random variables and decision vectors, we prove that computing the regret and thus also the DRRO policy is NP-hard. We then propose a convex relaxation for these more general Wasserstein DRRO problems and demonstrate its strong empirical performance. Finally, we provide an upper bound on the optimality gap of our relaxation and show it improves over recent alternatives.

[LG-47] Cross-Problem Parameter Transfer in Quantum Approximate Optimization Algorithm: A Machine Learning Approach

链接: https://arxiv.org/abs/2504.10733
作者: Kien X. Nguyen,Bao Bach,Ilya Safro
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum Approximate Optimization Algorithm (QAOA) is one of the most promising candidates to achieve the quantum advantage in solving combinatorial optimization problems. The process of finding a good set of variational parameters in the QAOA circuit has proven to be challenging due to multiple factors, such as barren plateaus. As a result, there is growing interest in exploiting parameter transferability, where parameter sets optimized for one problem instance are transferred to another that could be more complex either to estimate the solution or to serve as a warm start for further optimization. But can we transfer parameters from one class of problems to another? Leveraging parameter sets learned from a well-studied class of problems could help navigate the less studied one, reducing optimization overhead and mitigating performance pitfalls. In this paper, we study whether pretrained QAOA parameters of MaxCut can be used as is or to warm start the Maximum Independent Set (MIS) circuits. Specifically, we design machine learning models to find good donor candidates optimized on MaxCut and apply their parameters to MIS acceptors. Our experimental results show that such parameter transfer can significantly reduce the number of optimization iterations required while achieving comparable approximation ratios.

[LG-48] Distinct hydrologic response patterns and trends worldwide revealed by physics-embedded learning

链接: https://arxiv.org/abs/2504.10707
作者: Haoyu Ji,Yalan Song,Tadd Bindas,Chaopeng Shen,Yuan Yang,Ming Pan,Jiangtao Liu,Farshid Rahmani,Ather Abbas,Hylke Beck,Yoshihide Wada,Kathryn Lawson
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To track rapid changes within our water sector, Global Water Models (GWMs) need to realistically represent hydrologic systems’ response patterns - such as baseflow fraction - but are hindered by their limited ability to learn from data. Here we introduce a high-resolution physics-embedded big-data-trained model as a breakthrough in reliably capturing characteristic hydrologic response patterns (‘signatures’) and their shifts. By realistically representing the long-term water balance, the model revealed widespread shifts - up to ~20% over 20 years - in fundamental green-blue-water partitioning and baseflow ratios worldwide. Shifts in these response patterns, previously considered static, contributed to increasing flood risks in northern mid-latitudes, heightening water supply stresses in southern subtropical regions, and declining freshwater inputs to many European estuaries, all with ecological implications. With more accurate simulations at monthly and daily scales than current operational systems, this next-generation model resolves large, nonlinear seasonal runoff responses to rainfall (‘elasticity’) and streamflow flashiness in semi-arid and arid regions. These metrics highlight regions with management challenges due to large water supply variability and high climate sensitivity, but also provide tools to forecast seasonal water availability. This capability newly enables global-scale models to deliver reliable and locally relevant insights for water management.

[LG-49] On the Contractivity of Stochastic Interpolation Flow

链接: https://arxiv.org/abs/2504.10653
作者: Max Daniels
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Proof of concept. I would be excited to chat about extensions!

点击查看摘要

Abstract:We investigate stochastic interpolation, a recently introduced framework for high dimensional sampling which bears many similarities to diffusion modeling. Stochastic interpolation generates a data sample by first randomly initializing a particle drawn from a simple base distribution, then simulating deterministic or stochastic dynamics such that in finite time the particle’s distribution converges to the target. We show that for a Gaussian base distribution and a strongly log-concave target distribution, the stochastic interpolation flow map is Lipschitz with a sharp constant which matches that of Caffarelli’s theorem for optimal transport maps. We are further able to construct Lipschitz transport maps between non-Gaussian distributions, generalizing some recent constructions in the literature on transport methods for establishing functional inequalities. We discuss the practical implications of our theorem for the sampling and estimation problems required by stochastic interpolation.

[LG-50] Beyond Worst-Case Online Classification: VC-Based Regret Bounds for Relaxed Benchmarks

链接: https://arxiv.org/abs/2504.10598
作者: Omar Montasser,Abhishek Shetty,Nikita Zhivotovskiy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We revisit online binary classification by shifting the focus from competing with the best-in-class binary loss to competing against relaxed benchmarks that capture smoothed notions of optimality. Instead of measuring regret relative to the exact minimal binary error – a standard approach that leads to worst-case bounds tied to the Littlestone dimension – we consider comparing with predictors that are robust to small input perturbations, perform well under Gaussian smoothing, or maintain a prescribed output margin. Previous examples of this were primarily limited to the hinge loss. Our algorithms achieve regret guarantees that depend only on the VC dimension and the complexity of the instance space (e.g., metric entropy), and notably, they incur only an O(\log(1/\gamma)) dependence on the generalized margin \gamma . This stands in contrast to most existing regret bounds, which typically exhibit a polynomial dependence on 1/\gamma . We complement this with matching lower bounds. Our analysis connects recent ideas from adversarial robustness and smoothed online learning.

[LG-51] FLOWR: Flow Matching for Structure-Aware De Novo Interaction- and Frag ment-Based Ligand Generation

链接: https://arxiv.org/abs/2504.10564
作者: Julian Cremer,Ross Irwin,Alessandro Tibot,Jon Paul Janet,Simon Olsson,Djork-Arné Clevert
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:We introduce FLOWR, a novel structure-based framework for the generation and optimization of three-dimensional ligands. FLOWR integrates continuous and categorical flow matching with equivariant optimal transport, enhanced by an efficient protein pocket conditioning. Alongside FLOWR, we present SPINDR, a thoroughly curated dataset comprising ligand-pocket co-crystal complexes specifically designed to address existing data quality issues. Empirical evaluations demonstrate that FLOWR surpasses current state-of-the-art diffusion- and flow-based methods in terms of PoseBusters-validity, pose accuracy, and interaction recovery, while offering a significant inference speedup, achieving up to 70-fold faster performance. In addition, we introduce this http URL, a highly accurate multi-purpose model allowing for the targeted sampling of novel ligands that adhere to predefined interaction profiles and chemical substructures for fragment-based design without the need of re-training or any re-sampling strategies

[LG-52] Molecular Learning Dynamics

链接: https://arxiv.org/abs/2504.10560
作者: Yaroslav Gusev,Vitaly Vanchurin
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 16 pages, 7 figures, 1 table

点击查看摘要

Abstract:We apply the physics-learning duality to molecular systems by complementing the physical description of interacting particles with a dual learning description, where each particle is modeled as an agent minimizing a loss function. In the traditional physics framework, the equations of motion are derived from the Lagrangian function, while in the learning framework, the same equations emerge from learning dynamics driven by the agent loss function. The loss function depends on scalar quantities that describe invariant properties of all other agents or particles. To demonstrate this approach, we first infer the loss functions of oxygen and hydrogen directly from a dataset generated by the CP2K physics-based simulation of water molecules. We then employ the loss functions to develop a learning-based simulation of water molecules, which achieves comparable accuracy while being significantly more computationally efficient than standard physics-based simulations.

[LG-53] Inferring the Hubble Constant Using Simulated Strongly Lensed Supernovae and Neural Network Ensembles

链接: https://arxiv.org/abs/2504.10553
作者: Gonçalo Gonçalves,Nikki Arendse,Doogesh Kodi Ramanah,Radosław Wojtak
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
*备注: 12 pages, 9 figures. To be submitted to the Open Journal of Astrophysics

点击查看摘要

Abstract:Strongly lensed supernovae are a promising new probe to obtain independent measurements of the Hubble constant ( H_0 ). In this work, we employ simulated gravitationally lensed Type Ia supernovae (glSNe Ia) to train our machine learning (ML) pipeline to constrain H_0 . We simulate image time-series of glSNIa, as observed with the upcoming Nancy Grace Roman Space Telescope, that we employ for training an ensemble of five convolutional neural networks (CNNs). The outputs of this ensemble network are combined with a simulation-based inference (SBI) framework to quantify the uncertainties on the network predictions and infer full posteriors for the H_0 estimates. We illustrate that the combination of multiple glSN systems enhances constraint precision, providing a 4.4% estimate of H_0 based on 100 simulated systems, which is in agreement with the ground truth. This research highlights the potential of leveraging the capabilities of ML with glSNe systems to obtain a pipeline capable of fast and automated H_0 measurements.

[LG-54] LCDC: Bridging Science and Machine Learning for Light Curve Analysis

链接: https://arxiv.org/abs/2504.10550
作者: Daniel Kyselica,Tomáš Hrobár,Jiří Šilha,Roman Ďurikovič,Marek Šuppa
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注: 13 pages, 9 figures. arXiv admin note: text overlap with arXiv:2412.00544

点击查看摘要

Abstract:The characterization and analysis of light curves are vital for understanding the physical and rotational properties of artificial space objects such as satellites, rocket stages, and space debris. This paper introduces the Light Curve Dataset Creator (LCDC), a Python-based toolkit designed to facilitate the preprocessing, analysis, and machine learning applications of light curve data. LCDC enables seamless integration with publicly available datasets, such as the newly introduced Mini Mega Tortora (MMT) database. Moreover, it offers data filtering, transformation, as well as feature extraction tooling. To demonstrate the toolkit’s capabilities, we created the first standardized dataset for rocket body classification, RoBo6, which was used to train and evaluate several benchmark machine learning models, addressing the lack of reproducibility and comparability in recent studies. Furthermore, the toolkit enables advanced scientific analyses, such as surface characterization of the Atlas 2AS Centaur and the rotational dynamics of the Delta 4 rocket body, by streamlining data preprocessing, feature extraction, and visualization. These use cases highlight LCDC’s potential to advance space debris characterization and promote sustainable space exploration. Additionally, they highlight the toolkit’s ability to enable AI-focused research within the space debris community.

[LG-55] An Efficient Quantum Classifier Based on Hamiltonian Representations

链接: https://arxiv.org/abs/2504.10542
作者: Federico Tiblias,Anna Schroeder,Yue Zhang,Mariami Gachechiladze,Iryna Gurevych
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum machine learning (QML) is a discipline that seeks to transfer the advantages of quantum computing to data-driven tasks. However, many studies rely on toy datasets or heavy feature reduction, raising concerns about their scalability. Progress is further hindered by hardware limitations and the significant costs of encoding dense vector representations on quantum devices. To address these challenges, we propose an efficient approach called Hamiltonian classifier that circumvents the costs associated with data encoding by mapping inputs to a finite set of Pauli strings and computing predictions as their expectation values. In addition, we introduce two classifier variants with different scaling in terms of parameters and sample complexity. We evaluate our approach on text and image classification tasks, against well-established classical and quantum models. The Hamiltonian classifier delivers performance comparable to or better than these methods. Notably, our method achieves logarithmic complexity in both qubits and quantum gates, making it well-suited for large-scale, real-world applications. We make our implementation available on GitHub.

信息检索

[IR-0] Evaluation Report on MCP Servers

链接: https://arxiv.org/abs/2504.11094
作者: Zhiling Luo,Xiaorong Shi,Xuanrui Lin,Jinyang Gao
类目: Information Retrieval (cs.IR); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:With the rise of LLMs, a large number of Model Context Protocol (MCP) services have emerged since the end of 2024. However, the effectiveness and efficiency of MCP servers have not been well studied. To study these questions, we propose an evaluation framework, called MCPBench. We selected several widely used MCP server and conducted an experimental evaluation on their accuracy, time, and token usage. Our experiments showed that the most effective MCP, Bing Web Search, achieved an accuracy of 64%. Importantly, we found that the accuracy of MCP servers can be substantially enhanced by involving declarative interface. This research paves the way for further investigations into optimized MCP implementations, ultimately leading to better AI-driven applications and data retrieval solutions.

[IR-1] Why am I seeing this? Towards recognizing social media recommender systems with missing recommendations

链接: https://arxiv.org/abs/2504.11000
作者: Sabrina Guidotti,Sabrina Patania,Giuseppe Vizzari,Dimitri Ognibene,Gregor Donabauer,Udo Kruschwitz,Davide Taibi
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: Accepted at RLDM 2025

点击查看摘要

Abstract:Social media plays a crucial role in shaping society, often amplifying polarization and spreading misinformation. These effects stem from complex dynamics involving user interactions, individual traits, and recommender algorithms driving content selection. Recommender systems, which significantly shape the content users see and decisions they make, offer an opportunity for intervention and regulation. However, assessing their impact is challenging due to algorithmic opacity and limited data availability. To effectively model user decision-making, it is crucial to recognize the recommender system adopted by the platform. This work introduces a method for Automatic Recommender Recognition using Graph Neural Networks (GNNs), based solely on network structure and observed behavior. To infer the hidden recommender, we first train a Recommender Neutral User model (RNU) using a GNN and an adapted hindsight academic network recommender, aiming to reduce reliance on the actual recommender in the data. We then generate several Recommender Hypothesis-specific Synthetic Datasets (RHSD) by combining the RNU with different known recommenders, producing ground truths for testing. Finally, we train Recommender Hypothesis-specific User models (RHU) under various hypotheses and compare each candidate with the original used to generate the RHSD. Our approach enables accurate detection of hidden recommenders and their influence on user behavior. Unlike audit-based methods, it captures system behavior directly, without ad hoc experiments that often fail to reflect real platforms. This study provides insights into how recommenders shape behavior, aiding efforts to reduce polarization and misinformation. Comments: Accepted at RLDM 2025 Subjects: Information Retrieval (cs.IR); Social and Information Networks (cs.SI) Cite as: arXiv:2504.11000 [cs.IR] (or arXiv:2504.11000v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.11000 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] MSCRS: Multi-modal Semantic Graph Prompt Learning Framework for Conversational Recommender Systems

链接: https://arxiv.org/abs/2504.10921
作者: Yibiao Wei,Jie Zou,Weikang Guo,Guoqing Wang,Xing Xu,Yang Yang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Conversational Recommender Systems (CRSs) aim to provide personalized recommendations by interacting with users through conversations. Most existing studies of CRS focus on extracting user preferences from conversational contexts. However, due to the short and sparse nature of conversational contexts, it is difficult to fully capture user preferences by conversational contexts only. We argue that multi-modal semantic information can enrich user preference expressions from diverse dimensions (e.g., a user preference for a certain movie may stem from its magnificent visual effects and compelling storyline). In this paper, we propose a multi-modal semantic graph prompt learning framework for CRS, named MSCRS. First, we extract textual and image features of items mentioned in the conversational contexts. Second, we capture higher-order semantic associations within different semantic modalities (collaborative, textual, and image) by constructing modality-specific graph structures. Finally, we propose an innovative integration of multi-modal semantic graphs with prompt learning, harnessing the power of large language models to comprehensively explore high-dimensional semantic relationships. Experimental results demonstrate that our proposed method significantly improves accuracy in item recommendation, as well as generates more natural and contextually relevant content in response generation. We have released the code and the expanded multi-modal CRS datasets to facilitate further exploration in related research\footnotethis https URL.

[IR-3] Enhancing Document Retrieval for Curating N-ary Relations in Knowledge Bases

链接: https://arxiv.org/abs/2504.10613
作者: Xing David Wang,Ulf Leser
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Curation of biomedical knowledge bases (KBs) relies on extracting accurate multi-entity relational facts from the literature - a process that remains largely manual and expert-driven. An essential step in this workflow is retrieving documents that can support or complete partially observed n-ary relations. We present a neural retrieval model designed to assist KB curation by identifying documents that help fill in missing relation arguments and provide relevant contextual evidence. To reduce dependence on scarce gold-standard training data, we exploit existing KB records to construct weakly supervised training sets. Our approach introduces two key technical contributions: (i) a layered contrastive loss that enables learning from noisy and incomplete relational structures, and (ii) a balanced sampling strategy that generates high-quality negatives from diverse KB records. On two biomedical retrieval benchmarks, our approach achieves state-of-the-art performance, outperforming strong baselines in NDCG@10 by 5.7 and 3.7 percentage points, respectively. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2504.10613 [cs.IR] (or arXiv:2504.10613v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.10613 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-04-16

目录

概览 (2025-04-16)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载