Arxiv今日论文 | 2025-04-23

本篇博文主要内容为 2025-04-23 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决无标注数据在大型语言模型（Large Language Models, LLMs）推理任务中的强化学习（Reinforcement Learning, RL）问题，核心挑战在于推理过程中奖励估计缺乏真实标签（ground-truth）信息。解决方案的关键是引入测试时强化学习（Test-Time Reinforcement Learning, TTRL），通过利用预训练模型中的先验知识以及测试时扩展（Test-Time Scaling, TTS）中常见的多数投票（majority voting）等方法，有效估计奖励信号，从而驱动模型的自我进化与性能提升。实验表明，TTRL能够在多种任务和模型上实现一致的性能改进，并且仅使用无标注测试数据即可显著提升Qwen-2.5-Math-7B在AIME 2024上的pass@1表现达159%，同时超越初始模型的上限，接近直接基于标注数据训练的模型性能。

链接: https://arxiv.org/abs/2504.16084
作者: Yuxin Zuo,Kaiyan Zhang,Shang Qu,Li Sheng,Xuekai Zhu,Biqing Qi,Youbang Sun,Ganqu Cui,Ning Ding,Bowen Zhou
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 159% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the Maj@N metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks, and highlight TTRL’s potential for broader tasks and domains. GitHub: this https URL
zh

[NLP-1] Survey of Video Diffusion Models: Foundations Implementations and Applications

【速读】：该论文旨在系统性地回顾基于扩散模型的视频生成领域，解决其在运动一致性、计算效率及伦理考量方面的显著挑战。论文的关键在于提出一个全面的方法学分类体系，分析架构创新与优化策略，并探讨其在去噪、超分辨率等底层视觉任务中的应用。此外，论文还探索了扩散模型视频生成与视频表征学习、问答及检索等领域的协同效应。相比现有侧重特定方面的综述，本文提供了更广泛、更新且更细致的视角，特别关注评估指标、行业解决方案及训练工程技巧。其解决方案的关键在于整合理论框架与实践实现，为研究者和从业者提供指导，推动该快速发展的领域向前发展。

链接: https://arxiv.org/abs/2504.16081
作者: Yimu Wang,Xuye Liu,Wei Pang,Li Ma,Shuai Yuan,Paul Debevec,Ning Yu
机构: University of Waterloo (滑铁卢大学); Netflix Eyeline Studios (奈飞眼线工作室); Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in diffusion models have revolutionized video generation, offering superior temporal consistency and visual quality compared to traditional generative adversarial networks-based approaches. While this emerging field shows tremendous promise in applications, it faces significant challenges in motion consistency, computational efficiency, and ethical considerations. This survey provides a comprehensive review of diffusion-based video generation, examining its evolution, technical foundations, and practical applications. We present a systematic taxonomy of current methodologies, analyze architectural innovations and optimization strategies, and investigate applications across low-level vision tasks such as denoising and super-resolution. Additionally, we explore the synergies between diffusionbased video generation and related domains, including video representation learning, question answering, and retrieval. Compared to the existing surveys (Lei et al., 2024a;b; Melnik et al., 2024; Cao et al., 2023; Xing et al., 2024c) which focus on specific aspects of video generation, such as human video synthesis (Lei et al., 2024a) or long-form content generation (Lei et al., 2024b), our work provides a broader, more updated, and more fine-grained perspective on diffusion-based approaches with a special section for evaluation metrics, industry solutions, and training engineering techniques in video generation. This survey serves as a foundational resource for researchers and practitioners working at the intersection of diffusion models and video generation, providing insights into both the theoretical frameworks and practical implementations that drive this rapidly evolving field. A structured list of related works involved in this survey is also available on this https URL.
zh

[NLP-2] PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在物理推理能力评估方面缺乏高质量基准的问题。为了解决这一问题，论文提出了PHYBench，这是一个包含500个精心策划的物理学问题的新型高精度基准，覆盖从高中到大学以及物理奥林匹克竞赛难度的多个层次，涉及经典力学、电磁学、热力学、光学、现代物理及高级物理等领域的真实物理场景。此外，论文引入了表达编辑距离（Expression Edit Distance, EED）评分，这是一种基于数学表达式编辑距离的新颖评价指标，能够更有效地捕捉模型推理过程与结果之间的差异，超越传统的二元评分方法。通过在PHYBench上的评估并与人类专家进行比较，研究发现即使最先进的推理模型在复杂物理推理任务上也显著落后于人类专家，从而揭示了现有模型的局限性并强调了改进的方向。因此，解决方案的关键在于构建一个全面且细致的物理推理基准PHYBench，并提出创新的EED评分机制以更精准地衡量模型性能。

链接: https://arxiv.org/abs/2504.16074
作者: Shi Qiu,Shaoyang Guo,Zhuo-Yang Song,Yunbo Sun,Zeyu Cai,Jiashen Wei,Tianyu Luo,Yixuan Yin,Haoxu Zhang,Yi Hu,Chenyang Wang,Chencheng Tang,Haoling Chang,Qi Liu,Ziheng Zhou,Tianyu Zhang,Jingtian Zhang,Zhangyi Liu,Minghao Li,Yuku Zhang,Boxuan Jing,Xianqi Yin,Yutong Ren,Zizhuo Fu,Weike Wang,Xudong Tian,Anqi Lv,Laifu Man,Jianxiang Li,Feiyu Tao,Qihua Sun,Zhou Liang,Yushu Mu,Zhongxuan Li,Jing-Jun Zhang,Shutao Zhang,Xiaotian Li,Xingqi Xia,Jiawei Lin,Zheyu Shen,Jiahang Chen,Qiuhao Xiong,Binran Wang,Fengyuan Wang,Ziyang Ni,Bohan Zhang,Fan Cui,Changkun Shao,Qing-Hong Cao,Ming-xing Luo,Muhan Zhang,Hua Xing Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 21 pages ,8 figures, 4 tables

点击查看摘要

Abstract:We introduce PHYBench, a novel, high-quality benchmark designed for evaluating reasoning capabilities of large language models (LLMs) in physical contexts. PHYBench consists of 500 meticulously curated physics problems based on real-world physical scenarios, designed to assess the ability of models to understand and reason about realistic physical processes. Covering mechanics, electromagnetism, thermodynamics, optics, modern physics, and advanced physics, the benchmark spans difficulty levels from high school exercises to undergraduate problems and Physics Olympiad challenges. Additionally, we propose the Expression Edit Distance (EED) Score, a novel evaluation metric based on the edit distance between mathematical expressions, which effectively captures differences in model reasoning processes and results beyond traditional binary scoring methods. We evaluate various LLMs on PHYBench and compare their performance with human experts. Our results reveal that even state-of-the-art reasoning models significantly lag behind human experts, highlighting their limitations and the need for improvement in complex physical reasoning scenarios. Our benchmark results and dataset are publicly available at this https URL.
zh

[NLP-3] Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation

【速读】：该论文旨在解决视觉语言模型（Visual Language Models, VLMs）在复杂图形用户界面（Graphical User Interface, GUI）交互任务中生成正确操作的难题。当前框架面临的主要挑战包括：最先进的商用VLMs为黑盒模型，而针对GUI任务微调开源VLMs需要大量资源；现有的基于轨迹级别的评估与优化技术由于延迟反馈和局部优化问题，往往效果不佳。论文的关键解决方案在于提出了一种在推理阶段通过奖励模型的过程监督来引导VLM代理的方法。这种方法使得VLM代理能够在每次推理步骤中优化其动作，从而提升其在静态和动态环境中的性能。具体而言，所提方法在三个GUI导航任务中表现出显著的性能提升，在静态环境中单步动作准确率提高了3.4%，在一个动态环境中任务成功率提升了约33%。进一步结合轨迹反思与重试机制后，任务成功率得到了更大的提升。

链接: https://arxiv.org/abs/2504.16073
作者: Zhiyuan Hu,Shiyun Xiong,Yifan Zhang,See-Kiong Ng,Anh Tuan Luu,Bo An,Shuicheng Yan,Bryan Hooi
机构: National University of Singapore; Skywork AI (Skywork AI); Nanyang Technological University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in visual language models (VLMs) have notably enhanced their capabilities in handling complex Graphical User Interface (GUI) interaction tasks. Despite these improvements, current frameworks often struggle to generate correct actions in challenging GUI environments. State-of-the-art commercial VLMs are black-boxes, and fine-tuning open-source VLMs for GUI tasks requires significant resources. Additionally, existing trajectory-level evaluation and refinement techniques frequently fall short due to delayed feedback and local optimization issues. To address these challenges, we propose an approach that guides VLM agents with process supervision by a reward model during GUI navigation and control at inference time. This guidance allows the VLM agent to optimize actions at each inference step, thereby improving performance in both static and dynamic environments. In particular, our method demonstrates significant performance gains in three GUI navigation tasks, achieving a 3.4% improvement in single step action accuracy for static environments, along with a around 33% increase in task success rate in one dynamic environment. With further integration of trajectory reflection and retry mechanisms, we also demonstrate even greater enhancement in task success.
zh

[NLP-4] A Python Tool for Reconstructing Full News Text from GDELT

【速读】：该论文旨在解决新闻数据获取难度大且成本高的问题，特别是全面获取全文新闻语料库的挑战。许多主流新闻提供商（如Factiva和LexisNexis）需要高昂订阅费用，而免费替代方案则存在数据不完整及透明性不足的问题。论文的关键解决方案是利用Global Database of Events, Language, and Tone (GDELT) 的Web News NGrams 3.0数据集，通过提取全球在线新闻来源中的高频n-gram，并基于Python代码重构全文文章。该方法的核心在于识别重叠文本片段并智能合并，从而实现接近零成本获取高质量、结构化的新闻数据。这一创新方法克服了现有专有数据集的局限性，显著提升了新闻数据在经济预测、计算社会科学及自然语言处理等领域的可及性和应用潜力。

链接: https://arxiv.org/abs/2504.16063
作者: A. Fronzetti Colladon,R. Vestrelli
机构: 未知
类目: Computation and Language (cs.CL); Databases (cs.DB); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:News data have become an essential resource across various disciplines, including economics, finance, management, social sciences, and computer science. Researchers leverage newspaper articles to study economic trends, market dynamics, corporate strategies, public perception, political discourse, and the evolution of public opinion. Additionally, news datasets have been instrumental in training large-scale language models, with applications in sentiment analysis, fake news detection, and automated news summarization. Despite their significance, access to comprehensive news corpora remains a key challenge. Many full-text news providers, such as Factiva and LexisNexis, require costly subscriptions, while free alternatives often suffer from incomplete data and transparency issues. This paper presents a novel approach to obtaining full-text newspaper articles at near-zero cost by leveraging data from the Global Database of Events, Language, and Tone (GDELT). Specifically, we focus on the GDELT Web News NGrams 3.0 dataset, which provides high-frequency updates of n-grams extracted from global online news sources. We provide Python code to reconstruct full-text articles from these n-grams by identifying overlapping textual fragments and intelligently merging them. Our method enables researchers to access structured, large-scale newspaper data for text analysis while overcoming the limitations of existing proprietary datasets. The proposed approach enhances the accessibility of news data for empirical research, facilitating applications in economic forecasting, computational social science, and natural language processing.
zh

[NLP-5] Vision-Language Models Are Not Prag matically Competent in Referring Expression Generation

【速读】：该论文旨在解决当前视觉-语言模型（Vision-Language Models, VLMs）在生成指代表达（Referring Expression Generation, REG）任务中忽视语用维度的问题。现有评估通常将REG简化为基于区域的图像描述任务，并未遵循格赖斯的合作原则（Gricean maxims）。为应对这一挑战，论文从语用学视角重新审视REG，提出了一个新的包含1.5k图像的数据集（RefOI），其标注同时涵盖书面和口语形式的指代表达。通过系统评估最先进的VLMs，发现其在语用能力上的三个关键缺陷：(1) 未能唯一确定目标对象，(2) 包含过多或无关信息，以及(3) 与人类语用偏好不一致，例如对最小空间线索的使用不足。此外，论文指出标准自动评估方法无法捕捉这些语用违规行为，而倾向于强调表面相关性而非真正的指代成功。因此，解决方案的关键在于开发更注重语用信息的模型及评价框架，以更好地反映真实的人类交流模式。

链接: https://arxiv.org/abs/2504.16060
作者: Ziqiao Ma,Jing Ding,Xuejun Zhang,Dezhi Luo,Jiahe Ding,Sihan Xu,Yuchen Huang,Run Peng,Joyce Chai
机构: 未知
类目: Computation and Language (cs.CL)
备注: Homepage: this https URL

点击查看摘要

Abstract:Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975). However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims. In this work, we revisit REG from a pragmatic perspective, introducing a new dataset (RefOI) of 1.5k images annotated with both written and spoken referring expressions. Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues. We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.
zh

[NLP-6] Honey I Shrunk the Language Model: Impact of Knowledge Distillation Methods on Performance and Explainability

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在资源受限环境中的部署限制问题，主要源于其高昂的计算和存储需求。为应对这一挑战，论文关注知识蒸馏技术，通过从大型教师模型训练小型学生模型来缓解上述限制。然而，尽管已有研究提出了多种数据生成和学生模型训练的蒸馏方法，但这些方法在最新性能与可解释性方面的效果尚未得到系统评估和比较。

解决方案的关键在于提出两种新的蒸馏方法：一是将批评-修订提示（critique-revision prompting）应用于数据生成的蒸馏；二是综合现有方法以改进学生模型的训练过程。论文通过在Commonsense Question-Answering (CQA) 数据集上的系统性对比实验，评估了这些新方法在学生模型准确性（性能）以及人类驱动的可解释性方面的表现，从而填补了现有研究的空白，并进一步推动小规模语言模型的蒸馏技术发展及其广泛应用。

链接: https://arxiv.org/abs/2504.16056
作者: Daniel Hendriks,Philipp Spitzer,Niklas Kühl,Gerhard Satzger
机构: Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院); Institute for Information Systems (WIN)(信息系统研究所); University of Bayreuth (拜罗伊特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) has increasingly influenced modern society, recently in particular through significant advancements in Large Language Models (LLMs). However, high computational and storage demands of LLMs still limit their deployment in resource-constrained environments. Knowledge distillation addresses this challenge by training a small student model from a larger teacher model. Previous research has introduced several distillation methods for both generating training data and for training the student model. Despite their relevance, the effects of state-of-the-art distillation methods on model performance and explainability have not been thoroughly investigated and compared. In this work, we enlarge the set of available methods by applying critique-revision prompting to distillation for data generation and by synthesizing existing methods for training. For these methods, we provide a systematic comparison based on the widely used Commonsense Question-Answering (CQA) dataset. While we measure performance via student model accuracy, we employ a human-grounded study to evaluate explainability. We contribute new distillation methods and their comparison in terms of both performance and explainability. This should further advance the distillation of small language models and, thus, contribute to broader applicability and faster diffusion of LLM technology.
zh

[NLP-7] LongMamba: Enhancing Mambas Long Context Capabilities via Training-Free Receptive Field Enlargement ICLR2025

【速读】：该论文旨在解决基于状态空间模型（State Space Models, SSMs）在长上下文理解任务中性能不足的问题。尽管SSMs如Mamba模型在处理长上下文时具有线性计算复杂度和恒定内存使用的优势，但研究表明它们通常在长上下文理解任务中的表现逊色于Transformer模型。为了解决这一显著短板，论文提出了一种名为LongMamba的无训练技术，以显著提升Mamba模型的长上下文能力。

LongMamba的核心在于发现Mamba模型中的隐藏通道可以分为局部通道和全局通道，其中全局通道主要负责长上下文能力。然而，当输入上下文长度远超训练序列长度时，全局通道难以自适应扩展其感受野，导致Mamba模型在长上下文任务上的表现不佳。LongMamba的关键思想是通过防止非重要标记在全局通道的记忆中累积，从而缓解隐藏状态的记忆衰减问题。具体而言，该方法首先识别全局通道中的关键标记，然后应用标记过滤技术仅积累这些关键标记。经过广泛的合成数据和真实世界长上下文场景的基准测试，LongMamba显著提升了Mamba模型的长上下文性能，并大幅扩展了其操作范围，且无需额外的训练。

链接: https://arxiv.org/abs/2504.16053
作者: Zhifan Ye,Kejing Xia,Yonggan Fu,Xin Dong,Jihoon Hong,Xiangchi Yuan,Shizhe Diao,Jan Kautz,Pavlo Molchanov,Yingyan Celine Lin
机构: Georgia Institute of Technology (乔治亚理工学院); NVIDIA (英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2025

点击查看摘要

Abstract:State space models (SSMs) have emerged as an efficient alternative to Transformer models for language modeling, offering linear computational complexity and constant memory usage as context length increases. However, despite their efficiency in handling long contexts, recent studies have shown that SSMs, such as Mamba models, generally underperform compared to Transformers in long-context understanding tasks. To address this significant shortfall and achieve both efficient and accurate long-context understanding, we propose LongMamba, a training-free technique that significantly enhances the long-context capabilities of Mamba models. LongMamba builds on our discovery that the hidden channels in Mamba can be categorized into local and global channels based on their receptive field lengths, with global channels primarily responsible for long-context capability. These global channels can become the key bottleneck as the input context lengthens. Specifically, when input lengths largely exceed the training sequence length, global channels exhibit limitations in adaptively extend their receptive fields, leading to Mamba’s poor long-context performance. The key idea of LongMamba is to mitigate the hidden state memory decay in these global channels by preventing the accumulation of unimportant tokens in their memory. This is achieved by first identifying critical tokens in the global channels and then applying token filtering to accumulate only those critical tokens. Through extensive benchmarking across synthetic and real-world long-context scenarios, LongMamba sets a new standard for Mamba’s long-context performance, significantly extending its operational range without requiring additional training. Our code is available at this https URL.
zh

[NLP-8] Certified Mitigation of Worst-Case LLM Copyright Infringement

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在部署后因包含实质性相似于受版权保护材料的内容而引发的潜在无意版权侵权问题。现有缓解方法虽能在平均情况下一定程度上有效，但未能充分应对最坏情况下的版权风险，特别是长段直接引用自受版权保护来源的内容。论文的关键解决方案是提出了一种名为BloomScrub的方法，这是一种在推理阶段执行的简单但高度有效的“版权下架”机制。其核心在于通过高效的数据速记技术（Bloom滤波器）实现可扩展的版权筛查，并结合反复交替的引用检测与重写技术，将可能侵权的片段转化为非侵权形式。当超出长度阈值的引用无法被移除时，系统可以选择性拒绝响应，从而提供经过认证的风险降低。实验结果表明，BloomScrub能够显著减少侵权风险，同时保持模型的实用性，并根据不同执法严格程度灵活调整选择性拒绝策略。

链接: https://arxiv.org/abs/2504.16046
作者: Jingyu Zhang,Jiacan Yu,Marc Marone,Benjamin Van Durme,Daniel Khashabi
机构: Johns Hopkins University (约翰斯·霍普金斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The exposure of large language models (LLMs) to copyrighted material during pre-training raises concerns about unintentional copyright infringement post deployment. This has driven the development of “copyright takedown” methods, post-training approaches aimed at preventing models from generating content substantially similar to copyrighted ones. While current mitigation approaches are somewhat effective for average-case risks, we demonstrate that they overlook worst-case copyright risks exhibits by the existence of long, verbatim quotes from copyrighted sources. We propose BloomScrub, a remarkably simple yet highly effective inference-time approach that provides certified copyright takedown. Our method repeatedly interleaves quote detection with rewriting techniques to transform potentially infringing segments. By leveraging efficient data sketches (Bloom filters), our approach enables scalable copyright screening even for large-scale real-world corpora. When quotes beyond a length threshold cannot be removed, the system can abstain from responding, offering certified risk reduction. Experimental results show that BloomScrub reduces infringement risk, preserves utility, and accommodates different levels of enforcement stringency with adaptive abstention. Our results suggest that lightweight, inference-time methods can be surprisingly effective for copyright prevention.
zh

[NLP-9] Methods for Recognizing Nested Terms

【速读】：该论文致力于解决从非嵌套标注的平面训练数据中识别嵌套术语（Nested Terms）的问题。论文的关键解决方案在于应用Binder模型，该模型先前已被成功用于嵌套命名实体的识别。通过此方法，研究者在RuTermEval竞赛的所有三个赛道中均取得了最佳术语识别结果，并验证了几种提出的方案在无需嵌套标注的情况下有效提取嵌套术语的可行性。

链接: https://arxiv.org/abs/2504.16007
作者: Igor Rozhkov,Natalia Loukachevitch
机构: Lomonosov Moscow State University (莫斯科国立大学)
类目: Computation and Language (cs.CL)
备注: To be published in Computational Linguistics and Intellectual Technologies proceedings

点击查看摘要

Abstract:In this paper, we describe our participation in the RuTermEval competition devoted to extracting nested terms. We apply the Binder model, which was previously successfully applied to the recognition of nested named entities, to extract nested terms. We obtained the best results of term recognition in all three tracks of the RuTermEval competition. In addition, we study the new task of recognition of nested terms from flat training data annotated with terms without nestedness. We can conclude that several approaches we proposed in this work are viable enough to retrieve nested terms effectively without nested labeling of them.
zh

[NLP-10] CAPO: Cost-Aware Prompt Optimization

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在任务完成过程中对提示词（prompt）高度敏感的问题，即如何高效地优化提示词以提升模型性能。当前的自动提示词优化方法虽然能够应对这一挑战，但需要大量的LLM调用和输入令牌，导致优化过程成本高昂。为了解决这个问题，论文提出了一种名为CAPO（Cost-Aware Prompt Optimization）的算法，其关键是将AutoML技术与进化方法相结合，通过引入竞赛机制减少评估次数，并采用多目标优化平衡性能与提示词长度。此外，CAPO能够在优化指令的同时利用少量示例，并结合任务描述增强鲁棒性。这些设计使得CAPO不仅在多种数据集和LLM上超越了现有的离散提示词优化方法，而且在更小的预算下实现更好的性能，同时降低了平均提示词长度，显著提高了优化效率和成本意识。

链接: https://arxiv.org/abs/2504.16005
作者: Tom Zehle,Moritz Schlager,Timo Heiß,Matthias Feurer
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
备注: Submitted to AutoML 2025

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized natural language processing by solving a wide range of tasks simply guided by a prompt. Yet their performance is highly sensitive to prompt formulation. While automated prompt optimization addresses this challenge by finding optimal prompts, current methods require a substantial number of LLM calls and input tokens, making prompt optimization expensive. We introduce CAPO (Cost-Aware Prompt Optimization), an algorithm that enhances prompt optimization efficiency by integrating AutoML techniques. CAPO is an evolutionary approach with LLMs as operators, incorporating racing to save evaluations and multi-objective optimization to balance performance with prompt length. It jointly optimizes instructions and few-shot examples while leveraging task descriptions for improved robustness. Our extensive experiments across diverse datasets and LLMs demonstrate that CAPO outperforms state-of-the-art discrete prompt optimization methods in 11/15 cases with improvements up to 21%p. Our algorithm achieves better performances already with smaller budgets, saves evaluations through racing, and decreases average prompt length via a length penalty, making it both cost-efficient and cost-aware. Even without few-shot examples, CAPO outperforms its competitors and generally remains robust to initial prompts. CAPO represents an important step toward making prompt optimization more powerful and accessible by improving cost-efficiency.
zh

[NLP-11] Few-shot Hate Speech Detection Based on the MindSpore Framework

【速读】：该论文旨在解决在资源匮乏（few-shot 或低资源）环境下，深度学习模型在仇恨言论检测任务中性能下降的问题。传统方法通常依赖于大规模标注数据集，但在实际应用中，这类数据往往难以获取。为了解决这一挑战，论文提出了一种名为 MS-FSLHate 的基于提示增强的神经框架，其关键在于结合可学习的提示嵌入（learnable prompt embeddings）、CNN-BiLSTM 主干网络与注意力池化机制，以及基于同义词的对抗性数据增强技术，以提升模型的泛化能力。实验结果表明，该方法在 HateXplain 和 HSOL 数据集上的表现优于现有基线，并展现出高效性和可扩展性，适合部署于资源受限环境。

链接: https://arxiv.org/abs/2504.15987
作者: Zhenkai Qin,Dongze Wu,Yuxin Liu,Guifang Yang
机构: Guangxi Police College (广西警察学院); Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); School of Information Technology, Guangxi Police College (广西警察学院信息技术学院)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The proliferation of hate speech on social media poses a significant threat to online communities, requiring effective detection systems. While deep learning models have shown promise, their performance often deteriorates in few-shot or low-resource settings due to reliance on large annotated corpora. To address this, we propose MS-FSLHate, a prompt-enhanced neural framework for few-shot hate speech detection implemented on the MindSpore deep learning platform. The model integrates learnable prompt embeddings, a CNN-BiLSTM backbone with attention pooling, and synonym-based adversarial data augmentation to improve generalization. Experimental results on two benchmark datasets-HateXplain and HSOL-demonstrate that our approach outperforms competitive baselines in precision, recall, and F1-score. Additionally, the framework shows high efficiency and scalability, suggesting its suitability for deployment in resource-constrained environments. These findings highlight the potential of combining prompt-based learning with adversarial augmentation for robust and adaptable hate speech detection in few-shot scenarios.
zh

[NLP-12] W-PCA Based Gradient-Free Proxy for Efficient Search of Lightweight Language Models ICLR2025

【速读】：本文旨在解决轻量级语言模型高效神经架构搜索（Neural Architecture Search, NAS）的问题。传统基于训练的NAS方法存在计算开销大且需要大量资源的局限性，而现有的零样本（Zero-Shot）NAS方法则面临评估指标偏差及计算效率低等挑战。为应对这些难题，论文提出了一种名为加权主成分分析（Weight-weighted PCA, W-PCA）的新颖零样本NAS方法。该方法的关键在于引入参数数量与前馈神经网络（Feed-Forward Neural, FFN）层中累积贡献超过阈值 (\eta) 的主成分个数作为两个评价代理指标，并通过避免梯度计算来优化评估时间，从而大幅提升轻量级语言模型设计与评估的效率。实验结果表明，与一次性NAS方法相比，W-PCA显著减少了训练时间；在测试阶段的表现优于现有基于训练的最佳方法；同时，在从FlexiBERT搜索空间采样的数据集上的排名评估显示，该方法具有更好的排名相关性和更低的求解时间。

链接: https://arxiv.org/abs/2504.15983
作者: Shang Wang
机构: ShanghaiTech University (上海科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICLR 2025

点击查看摘要

Abstract:The demand for efficient natural language processing (NLP) systems has led to the development of lightweight language models. Previous work in this area has primarily focused on manual design or training-based neural architecture search (NAS) methods. Recently, zero-shot NAS methods have been proposed for evaluating language models without the need for training. However, prevailing approaches to zero-shot NAS often face challenges such as biased evaluation metrics and computational inefficiencies. In this paper, we introduce weight-weighted PCA (W-PCA), a novel zero-shot NAS method specifically tailored for lightweight language models. Our approach utilizes two evaluation proxies: the parameter count and the number of principal components with cumulative contribution exceeding \eta in the feed-forward neural (FFN) layer. Additionally, by eliminating the need for gradient computations, we optimize the evaluation time, thus enhancing the efficiency of designing and evaluating lightweight language models. We conduct a comparative analysis on the GLUE and SQuAD datasets to evaluate our approach. The results demonstrate that our method significantly reduces training time compared to one-shot NAS methods and achieves higher scores in the testing phase compared to previous state-of-the-art training-based methods. Furthermore, we perform ranking evaluations on a dataset sampled from the FlexiBERT search space. Our approach exhibits superior ranking correlation and further reduces solving time compared to other zero-shot NAS methods that require gradient computation.
zh

[NLP-13] FairTranslate: An English-French Dataset for Gender Bias Evaluation in Machine Translation by Overcoming Gender Binarity

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在翻译包容性语言（如包含“他们”单数代词或遵循公平语言规范的文本）时表现不足的问题。这一挑战跨越计算和技术领域，因此需要一个基于合理框架的严格评估方法来衡量LLMs处理包容性翻译的能力。论文的关键解决方案是提出FairTranslate数据集，这是一个全新的人类注释数据集，专门用于评估从英语到法语的机器翻译系统中的非二元性别偏见。FairTranslate包含2418个英语-法语句子对，涵盖职业相关主题，并附有丰富的元数据，包括职业的刻板印象对齐情况、语法性别指示的模糊性以及真实性别标签（男性、女性或包容性）。通过在不同提示程序下评估四种领先LLMs在此数据集上的表现，研究揭示了这些模型在性别表示方面存在显著偏差，强调了在LLM驱动的翻译系统中实现公平和包容性语言使用的持续挑战。关键在于通过构建高质量的标注数据集和系统性评估方法，识别和理解现有模型的偏见来源，从而为制定针对性策略提供依据。

链接: https://arxiv.org/abs/2504.15941
作者: Fanny Jourdan,Yannick Chevalier,Cécile Favre
机构: IRT Saint Exupery (IRT 圣埃克苏佩里); Université Lumière Lyon 2 (里昂第二大学); Université Claude Bernard Lyon 1 (里昂第一大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: FAccT 2025

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly leveraged for translation tasks but often fall short when translating inclusive language – such as texts containing the singular ‘they’ pronoun or otherwise reflecting fair linguistic protocols. Because these challenges span both computational and societal domains, it is imperative to critically evaluate how well LLMs handle inclusive translation with a well-founded framework. This paper presents FairTranslate, a novel, fully human-annotated dataset designed to evaluate non-binary gender biases in machine translation systems from English to French. FairTranslate includes 2418 English-French sentence pairs related to occupations, annotated with rich metadata such as the stereotypical alignment of the occupation, grammatical gender indicator ambiguity, and the ground-truth gender label (male, female, or inclusive). We evaluate four leading LLMs (Gemma2-2B, Mistral-7B, Llama3.1-8B, Llama3.3-70B) on this dataset under different prompting procedures. Our results reveal substantial biases in gender representation across LLMs, highlighting persistent challenges in achieving equitable outcomes in machine translation. These findings underscore the need for focused strategies and interventions aimed at ensuring fair and inclusive language usage in LLM-based translation systems. We make the FairTranslate dataset publicly available on Hugging Face, and disclose the code for all experiments on GitHub. Comments: FAccT 2025 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.15941 [cs.CL] (or arXiv:2504.15941v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.15941 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Fanny Jourdan PhD [view email] [v1] Tue, 22 Apr 2025 14:35:16 UTC (2,343 KB)
zh

[NLP-14] SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning

【速读】：该论文试图解决如何通过强化学习提升大型音频语言模型（Large Audio-Language Model, LALM）的结构化推理能力，并探索这种能力在音频-语言任务中的迁移效果。论文的关键在于将基于DeepSeek-R1的分组相对策略优化（Group-Relative Policy Optimization, GRPO）框架扩展到音频语言领域，并结合监督微调与课程引导的强化学习方法，构建了一个名为SARI（Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning）的结构化音频推理模型。通过在统一架构下对比隐式与显式、结构化与自由形式推理的性能差异，论文验证了显式、结构化推理以及课程学习对提升音频语言理解的重要作用。

链接: https://arxiv.org/abs/2504.15900
作者: Cheng Wen,Tingwei Guo,Shuaijiang Zhao,Wei Zou,Xiangang Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent work shows that reinforcement learning(RL) can markedly sharpen the reasoning ability of large language models (LLMs) by prompting them to “think before answering.” Yet whether and how these gains transfer to audio-language reasoning remains largely unexplored. We extend the Group-Relative Policy Optimization (GRPO) framework from DeepSeek-R1 to a Large Audio-Language Model (LALM), and construct a 32k sample multiple-choice corpus. Using a two-stage regimen supervised fine-tuning on structured and unstructured chains-of-thought, followed by curriculum-guided GRPO, we systematically compare implicit vs. explicit, and structured vs. free form reasoning under identical architectures. Our structured audio reasoning model, SARI (Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning), achieves a 16.35% improvement in average accuracy over the base model Qwen2-Audio-7B-Instruct. Furthermore, the variant built upon Qwen2.5-Omni reaches state-of-the-art performance of 67.08% on the MMAU test-mini benchmark. Ablation experiments show that on the base model we use: (i) SFT warm-up is important for stable RL training, (ii) structured chains yield more robust generalization than unstructured ones, and (iii) easy-to-hard curricula accelerate convergence and improve final performance. These findings demonstrate that explicit, structured reasoning and curriculum learning substantially enhances audio-language understanding.
zh

[NLP-15] Dynamic Early Exit in Reasoning Models

【速读】：该论文试图解决长链路推理（long chain-of-thought, CoT）过程中过长时间推理导致效率降低以及可能因冗余步骤引发的精度损失问题。论文的关键解决方案在于提出了一种无需额外训练且可无缝集成到现有推理大模型中的方法，通过在生成过程中引入自截断机制，利用模型在潜在推理转折点（如“等待”tokens）处的行为监测，在模型对候选答案表现出高置信度时提前终止后续推理链的生成。这种方法显著减少了CoT序列长度（平均减少31%-43%），同时提升了任务准确性（提升1.7%-5.7%）。

链接: https://arxiv.org/abs/2504.15895
作者: Chenxu Yang,Qingyi Si,Yongjie Duan,Zheliang Zhu,Chenyu Zhu,Zheng Lin,Li Cao,Weiping Wang
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所), School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院), Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 11 figures

点击查看摘要

Abstract:Recent advances in large reasoning language models (LRLMs) rely on test-time scaling, which extends long chain-of-thought (CoT) generation to solve complex tasks. However, overthinking in long CoT not only slows down the efficiency of problem solving, but also risks accuracy loss due to the extremely detailed or redundant reasoning steps. We propose a simple yet effective method that allows LLMs to self-truncate CoT sequences by early exit during generation. Instead of relying on fixed heuristics, the proposed method monitors model behavior at potential reasoning transition points (e.g.,“Wait” tokens) and dynamically terminates the next reasoning chain’s generation when the model exhibits high confidence in a trial answer. Our method requires no additional training and can be seamlessly integrated into existing o1-like reasoning LLMs. Experiments on multiple reasoning benchmarks MATH-500, AMC 2023, GPQA Diamond and AIME 2024 show that the proposed method is consistently effective on deepseek-series reasoning LLMs, reducing the length of CoT sequences by an average of 31% to 43% while improving accuracy by 1.7% to 5.7%.
zh

[NLP-16] Exploring Cognitive and Aesthetic Causality for Multimodal Aspect-Based Sentiment Analysis

【速读】：本文旨在解决多模态方面级情感分类（MASC）任务中对细粒度视觉内容及由语义内容和印象引发的认知推理解释理解不足的问题。现有方法在提取图像中的细粒度特征以及从语义视角和情感-认知共鸣角度推断情感表达的根本驱动因素方面存在显著差距。为解决这一问题，论文提出了Chimera框架，其关键是将视觉补丁特征与文本对齐，并提取粗粒度视觉特征（如整体图像表示）和细粒度视觉区域（如与方面相关的区域），将其转化为相应的文本描述（如面部、美学）。此外，利用大型语言模型（LLM）生成的情感原因和印象来增强模型对由语义内容和情感-认知共鸣引发的情感线索的感知能力。实验结果表明，该模型在标准MASC数据集上的表现优于GPT-4o等LLMs，且具有更高的灵活性。

链接: https://arxiv.org/abs/2504.15848
作者: Luwei Xiao,Rui Mao,Shuai Zhao,Qika Lin,Yanhao Jia,Liang He,Erik Cambria
机构: School of Computer Science and Technology, East China Normal University (华东师范大学计算机科学与技术学院); College of Computing and Data Science, Nanyang Technological University (南洋理工大学计算与数据科学学院); Saw Swee Hock School of Public Health, National University of Singapore (新加坡国立大学Saw Swee Hock公共卫生学院)
类目: Computation and Language (cs.CL)
备注: Accepted by TAFFC 2025

点击查看摘要

Abstract:Multimodal aspect-based sentiment classification (MASC) is an emerging task due to an increase in user-generated multimodal content on social platforms, aimed at predicting sentiment polarity toward specific aspect targets (i.e., entities or attributes explicitly mentioned in text-image pairs). Despite extensive efforts and significant achievements in existing MASC, substantial gaps remain in understanding fine-grained visual content and the cognitive rationales derived from semantic content and impressions (cognitive interpretations of emotions evoked by image content). In this study, we present Chimera: a cognitive and aesthetic sentiment causality understanding framework to derive fine-grained holistic features of aspects and infer the fundamental drivers of sentiment expression from both semantic perspectives and affective-cognitive resonance (the synergistic effect between emotional responses and cognitive interpretations). Specifically, this framework first incorporates visual patch features for patch-word alignment. Meanwhile, it extracts coarse-grained visual features (e.g., overall image representation) and fine-grained visual regions (e.g., aspect-related regions) and translates them into corresponding textual descriptions (e.g., facial, aesthetic). Finally, we leverage the sentimental causes and impressions generated by a large language model (LLM) to enhance the model’s awareness of sentimental cues evoked by semantic content and affective-cognitive resonance. Experimental results on standard MASC datasets demonstrate the effectiveness of the proposed model, which also exhibits greater flexibility to MASC compared to LLMs such as GPT-4o. We have publicly released the complete implementation and dataset at this https URL
zh

[NLP-17] Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

【速读】：该论文试图解决在基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）中，大型语言模型（Large Language Models, LLMs）直接偏好优化（Direct Preference Optimization, DPO）训练过程中数据利用效率低下以及性能提升受限的问题。同时，针对简单偏好优化（Simple Preference Optimization, SimPO）缺乏参考模型导致的训练鲁棒性不足和灾难性遗忘风险增加的问题也进行了探讨。论文的关键解决方案是提出了一种名为Pre-DPO的新训练范式，它通过引入一个引导参考模型来增强偏好优化性能。这个参考模型能够前瞻最优策略状态，并作为自适应权重分配机制，对更适合模型的样本赋予更高权重，而对不太适合的样本赋予更低权重，从而提高数据利用效率并突破性能瓶颈。

链接: https://arxiv.org/abs/2504.15843
作者: Junshu Pan,Wei Shen,Shulin Huang,Qiji Zhou,Yue Zhang
机构: Zhejiang University (浙江大学); School of Engineering, Westlake University (西湖大学工程学院); Shanghai Innovation Institute (上海创新研究院); Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training, the reference model plays the role of a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the lack of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and necessitates stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that enhances preference optimization performance by leveraging a guiding reference model. This reference model provides foresight into the optimal policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on AlpacaEval 2.0 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.
zh

[NLP-18] Whats the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns

【速读】：本文旨在解决提示工程（Prompt Engineering）在大语言模型中的挑战，即小幅度的提示扰动或模型调整可能显著影响生成文本的问题。现有评估方法（无论是自动化指标还是人工评估）存在局限性，如提供洞察有限或耗时费力。为应对这一问题，论文提出了一种名为Spotlight的新方法，结合自动化与人工分析的优势。其关键是利用数据挖掘技术自动区分语言模型输出中的随机（解码）变化与系统性差异，并提取对应的标记模式（token patterns），这些模式能够描述系统性差异并有效指导用户高效分析提示与模型更改的效果。通过构建三个基准测试以及人本研究，证明了该方法不仅能提升标记模式提取的可靠性，还能为已有提示数据提供新的见解，同时帮助用户理解模型输出的系统性差异，发现由提示和模型更改引起的与性别或文化等相关的重要差异，从而支持提示工程及以人为中心的模型行为研究。

链接: https://arxiv.org/abs/2504.15815
作者: Michael A. Hedderich,Anyi Wang,Raoyuan Zhao,Florian Eichin,Barbara Plank
机构: LMU Munich (慕尼黑大学) and Munich Center for Machine Learning (慕尼黑机器学习中心), Germany
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Prompt engineering for large language models is challenging, as even small prompt perturbations or model changes can significantly impact the generated output texts. Existing evaluation methods, either automated metrics or human evaluation, have limitations, such as providing limited insights or being labor-intensive. We propose Spotlight, a new approach that combines both automation and human analysis. Based on data mining techniques, we automatically distinguish between random (decoding) variations and systematic differences in language model outputs. This process provides token patterns that describe the systematic differences and guide the user in manually analyzing the effects of their prompt and model changes efficiently. We create three benchmarks to quantitatively test the reliability of token pattern extraction methods and demonstrate that our approach provides new insights into established prompt data. From a human-centric perspective, through demonstration studies and a user study, we show that our token pattern approach helps users understand the systematic differences of language model outputs, and we are able to discover relevant differences caused by prompt and model changes (e.g. related to gender or culture), thus supporting the prompt engineering process and human-centric model behavior research.
zh

[NLP-19] A closer look at how large language models trust humans: patterns and biases

【速读】：该论文试图解决的问题是如何理解大型语言模型（Large Language Models, LLMs）对人类的信任发展动态。现有研究主要关注人类如何信任人工智能代理，而对基于LLM的代理如何有效信任人类的理解较少。论文通过探索LLM信任是否依赖于人类主体的三个主要信任维度（能力、善意和诚信）以及人口统计变量的影响，填补了这一空白。解决方案的关键在于结合已有的行为理论，设计了一种方法来分析LLM信任发展的模式，并通过大规模模拟实验验证其与人类信任发展的相似性及差异性，同时识别潜在的偏差来源，如年龄、宗教和性别等因素对信任形成的影响。

链接: https://arxiv.org/abs/2504.15801
作者: Valeria Lerman,Yaniv Dover
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As large language models (LLMs) and LLM-based agents increasingly interact with humans in decision-making contexts, understanding the trust dynamics between humans and AI agents becomes a central concern. While considerable literature studies how humans trust AI agents, it is much less understood how LLM-based agents develop effective trust in humans. LLM-based agents likely rely on some sort of implicit effective trust in trust-related contexts (e.g., evaluating individual loan applications) to assist and affect decision making. Using established behavioral theories, we develop an approach that studies whether LLMs trust depends on the three major trustworthiness dimensions: competence, benevolence and integrity of the human subject. We also study how demographic variables affect effective trust. Across 43,200 simulated experiments, for five popular language models, across five different scenarios we find that LLM trust development shows an overall similarity to human trust development. We find that in most, but not all cases, LLM trust is strongly predicted by trustworthiness, and in some cases also biased by age, religion and gender, especially in financial scenarios. This is particularly true for scenarios common in the literature and for newer models. While the overall patterns align with human-like mechanisms of effective trust formation, different models exhibit variation in how they estimate trust; in some cases, trustworthiness and demographic factors are weak predictors of effective trust. These findings call for a better understanding of AI-to-human trust dynamics and monitoring of biases and trust development patterns to prevent unintended and potentially harmful outcomes in trust-sensitive applications of AI.
zh

[NLP-20] Automated Creativity Evaluation for Large Language Models : A Reference-Based Approach

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）生成文本创意评估的挑战，即现有方法要么依赖昂贵的手动标注，要么与人类评估结果的契合度不足。论文的关键解决方案是提出了一种基于托兰斯创意写作测试（Torrance Test of Creative Writing, TTCW）的自动化评估方法，通过参考式的李克特量表（Likert-style）方法，将生成的创意文本相对于高质量参考文本进行评分，从而更有效地衡量生成文本的创造力。实验结果显示，该方法显著提高了LLM评估与人类评估之间的一致性，达到了0.75的成对准确性（+15%）。

链接: https://arxiv.org/abs/2504.15784
作者: Ruizhe Li,Chiwei Zhu,Benfeng Xu,Xiaorui Wang,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学); MetastoneTechnology (MetastoneTechnology)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Creative writing is a key capability of Large Language Models (LLMs), with potential applications in literature, storytelling, and various creative domains. However, evaluating the creativity of machine-generated texts remains a significant challenge, as existing methods either rely on costly manual annotations or fail to align closely with human assessments. In this paper, we propose an effective automated evaluation method based on the Torrance Test of Creative Writing (TTCW), which evaluates creativity as product. Our method employs a reference-based Likert-style approach, scoring generated creative texts relative to high-quality reference texts across various tests. Experimental results demonstrate that our method significantly improves the alignment between LLM evaluations and human assessments, achieving a pairwise accuracy of 0.75 (+15%).
zh

[NLP-21] rustGeoGen: Scalable and Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving

【速读】：该论文旨在解决几何问题求解（GPS）领域中方法论与基准测试集不足的问题，特别是现有合成数据集因大型语言模型（LLMs）的幻觉效应而存在噪声和自相矛盾信息的情况。论文提出了一种名为TrustGeoGen的可扩展数据引擎作为解决方案，其关键是通过四个创新实现高质量问题生成与验证：1）图示、文本描述及逐步解答的多模态对齐生成；2）形式化验证确保推理路径符合规则；3）基于递归状态生成的引导机制以提升复杂度；4）设计的GeoExplore系列算法同时生成多解变体并提供自我反思回溯轨迹。通过这些手段，TrustGeoGen构建了包含20万样本的GeoTrust-200K数据集及其测试子集GeoTrust-test，实验表明当前最先进的模型在GeoTrust-test上的准确率仅为49.17%，证明了基准的严格性，并且在GeoQA任务上的OOD泛化能力显著优于OpenAI-o1伪标签标注的结果。

链接: https://arxiv.org/abs/2504.15780
作者: Daocheng Fu,Zijun Chen,Renqiu Xia,Qi Liu,Yuan Feng,Hongbin Zhou,Renrui Zhang,Shiyang Feng,Peng Gao,Junchi Yan,Botian Shi,Bo Zhang,Yu Qiao
机构: Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mathematical geometric problem solving (GPS) often requires effective integration of multimodal information and verifiable logical coherence. Despite the fast development of large language models in general problem solving, it remains unresolved regarding with both methodology and benchmarks, especially given the fact that exiting synthetic GPS benchmarks are often not self-verified and contain noise and self-contradicted information due to the illusion of LLMs. In this paper, we propose a scalable data engine called TrustGeoGen for problem generation, with formal verification to provide a principled benchmark, which we believe lays the foundation for the further development of methods for GPS. The engine synthesizes geometric data through four key innovations: 1) multimodal-aligned generation of diagrams, textual descriptions, and stepwise solutions; 2) formal verification ensuring rule-compliant reasoning paths; 3) a bootstrapping mechanism enabling complexity escalation via recursive state generation and 4) our devised GeoExplore series algorithms simultaneously produce multi-solution variants and self-reflective backtracking traces. By formal logical verification, TrustGeoGen produces GeoTrust-200K dataset with guaranteed modality integrity, along with GeoTrust-test testset. Experiments reveal the state-of-the-art models achieve only 49.17% accuracy on GeoTrust-test, demonstrating its evaluation stringency. Crucially, models trained on GeoTrust achieve OOD generalization on GeoQA, significantly reducing logical inconsistencies relative to pseudo-label annotated by OpenAI-o1. Our code is available at this https URL
zh

[NLP-22] na: Tiny Reasoning Models via LoRA

【速读】：该论文试图解决如何以高成本效率实现强大的推理能力的问题。为了解决这一问题，论文提出了Tina，一个通过高效参数更新实现的小型推理模型家族。关键在于，在强化学习（Reinforcement Learning, RL）过程中采用低秩适应（Low-Rank Adaptation, LoRA）技术对一个仅有1.5B参数的基础模型进行微调，从而在极小资源下显著提升推理性能。这种方法不仅使Tina模型的推理表现与基于相同基础模型的SOTA RL推理模型相当甚至超越，还大幅降低了计算开销，例如，最佳的Tina模型在AIME24数据集上的Pass@1准确率提升了20%，而其后训练和评估成本仅为9美元（相当于现有SOTA模型的约260倍成本降低）。论文进一步验证了这种有效性，通过多个开源推理数据集和多种消融实验设置，并推测其成功源于LoRA能够快速调整模型以适应强化学习奖励的推理结构格式，同时保留基础模型的核心知识。

链接: https://arxiv.org/abs/2504.15777
作者: Shangshang Wang,Julian Asilis,Ömer Faruk Akgül,Enes Burak Bilgin,Ollie Liu,Willie Neiswanger
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:How cost-effectively can strong reasoning abilities be achieved in language models? Driven by this fundamental question, we present Tina, a family of tiny reasoning models achieved with high cost-efficiency. Notably, Tina demonstrates that substantial reasoning performance can be developed using only minimal resources, by applying parameter-efficient updates during reinforcement learning (RL), using low-rank adaptation (LoRA), to an already tiny 1.5B parameter base model. This minimalist approach produces models that achieve reasoning performance which is competitive with, and sometimes surpasses, SOTA RL reasoning models built upon the same base model. Crucially, this is achieved at a tiny fraction of the computational post-training cost employed by existing SOTA models. In fact, the best Tina model achieves a 20% reasoning performance increase and 43.33% Pass@1 accuracy on AIME24, at only \ 9 USD post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our work reveals the surprising effectiveness of efficient RL reasoning via LoRA. We validate this across multiple open-source reasoning datasets and various ablation settings starting with a single, fixed set of hyperparameters. Furthermore, we hypothesize that this effectiveness and efficiency stem from LoRA rapidly adapting the model to the structural format of reasoning rewarded by RL, while largely preserving the base model’s underlying knowledge. In service of accessibility and open research, we fully open-source all code, training logs, and model weights \ checkpoints.
zh

[NLP-23] Subject islands do not reduce to construction-specific discourse function

【速读】：该论文试图解决的问题是：为何在某些句法结构中（如疑问句中的主语），提取其中的部分会导致语法错误（即“岛状效应”），而这种现象是否仅由句法的抽象运动依赖关系决定，还是与语言的交际功能及信息结构相关。论文的关键解决方案在于通过大规模可接受性研究，对比三种不同句法构造（疑问句、关系从句和主题化）中的主语岛状效应，验证Abeillé等（2020）提出的假设——即主语岛状效应特异性地与疑问句的信息结构有关，而非普遍的句法运动限制，并进一步探讨岛状效应是否可以脱离特定的交际功能，仅由抽象的句法表征解释。

链接: https://arxiv.org/abs/2504.15688
作者: Mandy Cartner,Matthew Kogan,Nikolas Webster,Matthew Wagers,Ivy Sichel
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The term islands in linguistics refers to phrases from which extracting an element results in ungrammaticality (Ross, 1967). Grammatical subjects are considered islands because extracting a sub-part of a subject results in an ill-formed sentence, despite having a clear intended meaning (e.g., “Which topic did the article about inspire you?”). The generative tradition, which views syntax as autonomous of meaning and function, attributes this ungrammaticality to the abstract movement dependency between the wh-phrase and the subject-internal position with which it is associated for interpretation. However, research on language that emphasizes its communicative function suggests instead that syntactic constraints, including islands, can be explained based on the way different constructions package information. Accordingly, Abeillé et al. (2020) suggest that the islandhood of subjects is specific to the information structure of wh-questions, and propose that subjects are not islands for movement, but for focusing, due to their discourse-backgroundedness. This predicts that other constructions that differ in their information structure from wh-questions, but still involve movement, should not create a subject island effect. We test this prediction in three large-scale acceptability studies, using a super-additive design that singles out subject island violations, in three different constructions: wh-questions, relative clauses, and topicalization. We report evidence for a subject island effect in each construction type, despite only wh-questions introducing what Abeillé et al. (2020) call “a clash in information structure.” We argue that this motivates an account of islands in terms of abstract, syntactic representations, independent of the communicative function associated with the constructions.
zh

[NLP-24] FinTextSim: Enhancing Financial Text Analysis with BERTopic

【速读】：本文旨在解决如何从大量年度报告文本数据中提取有价值的洞见的问题。为实现这一目标，研究聚焦于优化主题建模技术在财务文本分析中的应用效果。论文提出的关键解决方案是引入FinTextSim，这是一种针对金融领域语义聚类和语义搜索优化的微调句子嵌入模型。与广泛使用的all-MiniLM-L6-v2相比，FinTextSim显著提升了同主题内相似性（提高81%）并完全消除了跨主题间的混淆，从而极大增强了主题分类的清晰度和准确性。此外，研究表明，当结合FinTextSim的嵌入表示时，最先进的BERTopic模型才能形成明确且独立的经济主题聚类；否则，BERTopic会面临误分类及主题重叠的问题。因此，FinTextSim不仅是提升财务文本分析质量的核心工具，还为未来相关领域的研究奠定了坚实基础，其改进的信息质量有望帮助利益相关者获得竞争优势，优化资源配置与决策流程，并可能进一步推动企业估值和股票价格预测模型的发展。

链接: https://arxiv.org/abs/2504.15683
作者: Simon Jehnen,Joaquín Ordieres-Meré,Javier Villalba-Díez
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); General Economics (econ.GN); General Finance (q-fin.GN)
备注:

点击查看摘要

Abstract:Recent advancements in information availability and computational capabilities have transformed the analysis of annual reports, integrating traditional financial metrics with insights from textual data. To extract valuable insights from this wealth of textual data, automated review processes, such as topic modeling, are crucial. This study examines the effectiveness of BERTopic, a state-of-the-art topic model relying on contextual embeddings, for analyzing Item 7 and Item 7A of 10-K filings from SP 500 companies (2016-2022). Moreover, we introduce FinTextSim, a finetuned sentence-transformer model optimized for clustering and semantic search in financial contexts. Compared to all-MiniLM-L6-v2, the most widely used sentence-transformer, FinTextSim increases intratopic similarity by 81% and reduces intertopic similarity by 100%, significantly enhancing organizational clarity. We assess BERTopic’s performance using embeddings from both FinTextSim and all-MiniLM-L6-v2. Our findings reveal that BERTopic only forms clear and distinct economic topic clusters when paired with FinTextSim’s embeddings. Without FinTextSim, BERTopic struggles with misclassification and overlapping topics. Thus, FinTextSim is pivotal for advancing financial text analysis. FinTextSim’s enhanced contextual embeddings, tailored for the financial domain, elevate the quality of future research and financial information. This improved quality of financial information will enable stakeholders to gain a competitive advantage, streamlining resource allocation and decision-making processes. Moreover, the improved insights have the potential to leverage business valuation and stock price prediction models.
zh

[NLP-25] VeriCoder: Enhancing LLM -Based RTL Code Generation through Functional Correctness Validation

【速读】：该论文试图解决现有大型语言模型（Large Language Models, LLMs）在电子设计自动化（Electronic Design Automation, EDA）任务中生成寄存器传输级（Register Transfer Level, RTL）代码时功能验证不足的问题。大多数现有的RTL数据集主要关注语法正确性而非功能验证，导致生成的代码虽可编译但可能无法实现预期行为。为了解决这一问题，论文提出了一种名为VERICODER的新方法，其关键在于通过结合单元测试生成与基于反馈的迭代优化构建一个经过功能正确性验证的数据集。具体而言，该过程利用教师模型（GPT-4o-mini）根据自然语言规格生成单元测试，并基于这些测试的仿真结果逐步修订RTL设计；必要时还调整测试以确保其符合规格要求。最终得到的数据集包含自然语言描述、RTL实现以及通过的测试，从而实现了每条训练样本的功能验证。通过在超过125,000个此类高质量样本上进行微调，VERICODER在VerilogEval和RTLLM基准测试中的功能正确性方面达到了最先进的性能水平，相对提升了高达71.7%和27.4%。此外，消融研究进一步证明了使用功能验证数据集训练的模型优于未验证的数据集，强调了高质量数据集对于RTL代码生成的重要性。

链接: https://arxiv.org/abs/2504.15659
作者: Anjiang Wei,Huanmi Tan,Tarun Suresh,Daniel Mendoza,Thiago S. F. X. Teixeira,Ke Wang,Caroline Trippel,Alex Aiken
机构: Stanford University (斯坦福大学); Carnegie Mellon University (卡内基梅隆大学); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校); Intel (英特尔); Visa Research (visa研究机构)
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have sparked growing interest in applying them to Electronic Design Automation (EDA) tasks, particularly Register Transfer Level (RTL) code generation. While several RTL datasets have been introduced, most focus on syntactic validity rather than functional validation with tests, leading to training examples that compile but may not implement the intended behavior. We present VERICODER, a model for RTL code generation fine-tuned on a dataset validated for functional correctness. This fine-tuning dataset is constructed using a novel methodology that combines unit test generation with feedback-directed refinement. Given a natural language specification and an initial RTL design, we prompt a teacher model (GPT-4o-mini) to generate unit tests and iteratively revise the RTL design based on its simulation results using the generated tests. If necessary, the teacher model also updates the tests to ensure they comply with the natural language specification. As a result of this process, every example in our dataset is functionally validated, consisting of a natural language description, an RTL implementation, and passing tests. Fine-tuned on this dataset of over 125,000 examples, VERICODER achieves state-of-the-art metrics in functional correctness on VerilogEval and RTLLM, with relative gains of up to 71.7% and 27.4% respectively. An ablation study further shows that models trained on our functionally validated dataset outperform those trained on functionally non-validated datasets, underscoring the importance of high-quality datasets in RTL code generation.
zh

[NLP-26] Computational Typology

【速读】：该论文旨在探索计算统计建模在语言类型学中的应用价值，试图解决传统语言类型学研究中因人工分析效率低、规模受限而导致的问题。论文的关键在于利用计算方法处理大规模语言数据，并通过统计建模验证有关语言结构与演化假设的有效性，从而揭示人类语言中的普遍特征与模式。

链接: https://arxiv.org/abs/2504.15642
作者: Gerhard Jäger
机构: University of Tübingen (蒂宾根大学)
类目: Computation and Language (cs.CL); Populations and Evolution (q-bio.PE)
备注: 19 pages, s5 figure

点击查看摘要

Abstract:Typology is a subfield of linguistics that focuses on the study and classification of languages based on their structural features. Unlike genealogical classification, which examines the historical relationships between languages, typology seeks to understand the diversity of human languages by identifying common properties and patterns, known as universals. In recent years, computational methods have played an increasingly important role in typological research, enabling the analysis of large-scale linguistic data and the testing of hypotheses about language structure and evolution. This article provides an illustration of the benefits of computational statistical modeling in typology.
zh

[NLP-27] Cost-Effective Text Clustering with Large Language Models

【速读】：该论文旨在解决文本聚类任务中因大规模语言模型（Large Language Models, LLMs）引入的高计算与财务开销问题，特别是在基于API查询或推理调用时。论文的关键创新在于提出了一种名为TECL（Cost-effective Text Clustering Framework）的成本效益框架，通过有效利用LLMs的反馈，在有限的LLMs查询预算内实现准确的文本聚类。TECL的核心解决方案是采用EdgeLLM或TriangleLLM构建“必须链接”（must-link）与“不能链接”（cannot-link）约束，并将这些约束作为监督信号输入到加权约束聚类方法中生成聚类结果。其中，EdgeLLM（TriangleLLM）通过精心设计的贪心算法识别信息量大的文本对（三元组），并通过巧妙设计的提示技术提取精确的成对约束，从而显著降低查询成本的同时提升聚类性能。实验表明，TECL在相同查询预算下于多个基准数据集上的无监督文本聚类任务中表现出一致且显著的优越性。

链接: https://arxiv.org/abs/2504.15640
作者: Hongtao Wang,Taiyan Zhang,Renchi Yang,Jianliang Xu
机构: Hong Kong Baptist University (香港浸会大学); ShanghaiTech University (上海科技大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text clustering aims to automatically partition a collection of text documents into distinct clusters based on linguistic features. In the literature, this task is usually framed as metric clustering based on text embeddings from pre-trained encoders or a graph clustering problem upon pairwise similarities from an oracle, e.g., a large ML model. Recently, large language models (LLMs) bring significant advancement in this field by offering contextualized text embeddings and highly accurate similarity scores, but meanwhile, present grand challenges to cope with substantial computational and/or financial overhead caused by numerous API-based queries or inference calls to the models. In response, this paper proposes TECL, a cost-effective framework that taps into the feedback from LLMs for accurate text clustering within a limited budget of queries to LLMs. Under the hood, TECL adopts our EdgeLLM or TriangleLLM to construct must-link/cannot-link constraints for text pairs, and further leverages such constraints as supervision signals input to our weighted constrained clustering approach to generate clusters. Particularly, EdgeLLM (resp. TriangleLLM) enables the identification of informative text pairs (resp. triplets) for querying LLMs via well-thought-out greedy algorithms and accurate extraction of pairwise constraints through carefully-crafted prompting techniques. Our experiments on multiple benchmark datasets exhibit that TECL consistently and considerably outperforms existing solutions in unsupervised text clustering under the same query cost for LLMs. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.15640 [cs.CL] (or arXiv:2504.15640v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.15640 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-28] Exploiting Contextual Knowledge in LLM s through V-usable Information based Layer Enhancement

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成式任务中难以忠实反映上下文知识的问题。现有方法主要集中在改进解码策略，但忽视了LLMs内部状态处理上下文信息的基本机制，导致其利用上下文知识的能力受限。论文的关键解决方案是提出了一种名为上下文感知层增强（Context-aware Layer Enhancement, CaLE）的新干预方法。CaLE通过采用V-可用信息分析，在最佳层位战略性地放大上下文信息的增长，从而丰富最终层的表示。实验表明，CaLE在问答任务中有效提升了上下文忠实生成能力，特别是在涉及未知或冲突上下文知识的情景下表现出色。

链接: https://arxiv.org/abs/2504.15630
作者: Xiaowei Yuan,Zhao Yang,Ziyang Huang,Yequan Wang,Siqi Fan,Yiming Ju,Jun Zhao,Kang Liu
机构: The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所认知与智能决策重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Beijing Academy of Artificial Intelligence (北京智源人工智能研究院); Meituan (美团); University of Electronic Science and Technology of China (电子科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet they often struggle with context-faithfulness generations that properly reflect contextual knowledge. While existing approaches focus on enhancing the decoding strategies, they ignore the fundamental mechanism of how contextual information is processed within LLMs’ internal states. As a result, LLMs remain limited in their ability to fully leverage contextual knowledge. In this paper, we propose Context-aware Layer Enhancement (CaLE), a novel intervention method that enhances the utilization of contextual knowledge within LLMs’ internal representations. By employing V-usable information analysis, CaLE strategically amplifies the growth of contextual information at an optimal layer, thereby enriching representations in the final layer. Our experiments demonstrate that CaLE effectively improves context-faithful generation in Question-Answering tasks, particularly in scenarios involving unknown or conflicting contextual knowledge.
zh

[NLP-29] CiteFix: Enhancing RAG Accuracy Through Post-Processing Citation Correction

【速读】：该论文旨在解决 Retrieval Augmented Generation (RAG) 系统中大型语言模型（Large Language Models, LLMs）在源引用（source attribution）方面准确性较低的问题。现有研究表明，主流生成式搜索引擎的引用准确性仅约为 74%，这限制了 RAG 系统在信息检索与摘要任务中的可靠性与可信度。为应对这一挑战，论文提出了一系列高效的后处理算法来提升 LLM 生成响应中的引用准确性，同时尽量减少对系统延迟和成本的影响。

解决方案的关键在于采用多种方法交叉验证生成的引用是否与检索到的文章一致，具体包括基于关键词匹配和语义匹配的方法、结合 BERTScore 的微调模型以及一种轻量级的基于 LLM 的技术。实验结果显示，这些方法使 RAG 系统的整体准确性指标相对提升了 15.46%，从而有可能将当前使用的大型语言模型替换为约 12 倍成本效益更高且推理速度提升 3 倍的小型模型，同时保持相近的性能水平。这项研究显著增强了信息检索与摘要任务中 AI 生成内容的可靠性和可信度，这对于赢得商业产品用户的信任尤为重要。

链接: https://arxiv.org/abs/2504.15629
作者: Harsh Maheshwari,Srikanth Tenneti,Alwarappan Nakkiran
机构: Amazon(亚马逊)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) has emerged as a powerful application of Large Language Models (LLMs), revolutionizing information search and consumption. RAG systems combine traditional search capabilities with LLMs to generate comprehensive answers to user queries, ideally with accurate citations. However, in our experience of developing a RAG product, LLMs often struggle with source attribution, aligning with other industry studies reporting citation accuracy rates of only about 74% for popular generative search engines. To address this, we present efficient post-processing algorithms to improve citation accuracy in LLM-generated responses, with minimal impact on latency and cost. Our approaches cross-check generated citations against retrieved articles using methods including keyword + semantic matching, fine tuned model with BERTScore, and a lightweight LLM-based technique. Our experimental results demonstrate a relative improvement of 15.46% in the overall accuracy metrics of our RAG system. This significant enhancement potentially enables a shift from our current larger language model to a relatively smaller model that is approximately 12x more cost-effective and 3x faster in inference time, while maintaining comparable performance. This research contributes to enhancing the reliability and trustworthiness of AI-generated content in information retrieval and summarization tasks which is critical to gain customer trust especially in commercial products.
zh

[NLP-30] Exploring Next Token Prediction in Theory of Mind (ToM) Tasks: Comparative Experiments with GPT -2 and LLaMA-2 AI Models

【速读】：该论文旨在比较OpenAI的GPT-2与Meta的Llama-2-7b-chat-hf在心智理论（Theory of Mind, ToM）任务中的下一-token预测性能。研究的关键在于构建了一个包含10个短故事的数据集，并通过GPT-4程序化插入额外句子（infills）来增强故事的上下文复杂性，从而分析增加上下文如何影响模型表现。研究测试了两种模型在四种温度设置下的表现，并评估其在三种推理层次上的预测能力。结果显示，增加上下文虽略微降低预测准确性，但Llama-2在较低温度下始终优于GPT-2，尤其在更高推理复杂度下，模型间的差异更加显著。研究揭示了模型架构、温度及上下文复杂度对预测性能的影响，为理解当前语言模型的优势与局限提供了重要见解。

链接: https://arxiv.org/abs/2504.15604
作者: Pavan Yadav,Nikhil Khandalkar,Krishna Shinde,Lokesh B. Ramegowda,Rajarshi Das
机构: enkefalos.com(Enkefalos); mqube.ai(MQUBE)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 75 pages, 60 figures

点击查看摘要

Abstract:Language models have made significant progress in generating coherent text and predicting next tokens based on input prompts. This study compares the next-token prediction performance of two well-known models: OpenAI’s GPT-2 and Meta’s Llama-2-7b-chat-hf on Theory of Mind (ToM) tasks. To evaluate their capabilities, we built a dataset from 10 short stories sourced from the Explore ToM Dataset. We enhanced these stories by programmatically inserting additional sentences (infills) using GPT-4, creating variations that introduce different levels of contextual complexity. This setup enables analysis of how increasing context affects model performance. We tested both models under four temperature settings (0.01, 0.5, 1.0, 2.0) and evaluated their ability to predict the next token across three reasoning levels. Zero-order reasoning involves tracking the state, either current (ground truth) or past (memory). First-order reasoning concerns understanding another’s mental state (e.g., “Does Anne know the apple is salted?”). Second-order reasoning adds recursion (e.g., “Does Anne think that Charles knows the apple is salted?”). Our results show that adding more infill sentences slightly reduces prediction accuracy, as added context increases complexity and ambiguity. Llama-2 consistently outperforms GPT-2 in prediction accuracy, especially at lower temperatures, demonstrating greater confidence in selecting the most probable token. As reasoning complexity rises, model responses diverge more. Notably, GPT-2 and Llama-2 display greater variability in predictions during first- and second-order reasoning tasks. These findings illustrate how model architecture, temperature, and contextual complexity influence next-token prediction, contributing to a better understanding of the strengths and limitations of current language models. Comments: 75 pages, 60 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.15604 [cs.CL] (or arXiv:2504.15604v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.15604 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-31] A Comprehensive Survey in LLM (-Agent ) Full Stack Safety: Data Training and Deployment

【速读】：该论文旨在解决现有大型语言模型（Large Language Models, LLMs）安全调查局限于特定生命周期阶段的问题，缺乏对LLMs整个“生命链”全面理解的不足。论文首次引入“全栈”安全（full-stack safety）的概念，系统性地考虑从训练到部署及最终商业化的整个过程中LLMs的安全问题。其关键在于提出将LLM完整生命周期定义为数据准备、预训练、后训练、部署以及最终商业化五个阶段，并基于超过800篇文献的广泛回顾，提供全面且系统的安全问题覆盖与组织，同时通过系统分析提炼出可靠的研究路线图和见解，包括数据生成安全、对齐技术、模型编辑以及基于LLMs的代理系统等研究方向，为未来相关领域的研究提供了有价值的指导。

链接: https://arxiv.org/abs/2504.15585
作者: Kun Wang,Guibin Zhang,Zhenhong Zhou,Jiahao Wu,Miao Yu,Shiqian Zhao,Chenlong Yin,Jinhu Fu,Yibo Yan,Hanjun Luo,Liang Lin,Zhihao Xu,Haolang Lu,Xinye Cao,Xinyun Zhou,Weifei Jin,Fanci Meng,Junyuan Mao,Hao Wu,Minghe Wang,Fan Zhang,Junfeng Fang,Chengwei Liu,Yifan Zhang,Qiankun Li,Chongye Guo,Yalan Qin,Yi Ding,Donghai Hong,Jiaming Ji,Xinfeng Li,Yifan Jiang,Dongxia Wang,Yihao Huang,Yufei Guo,Jen-tse Huang,Yanwei Yue,Wenke Huang,Guancheng Wan,Tianlin Li,Lei Bai,Jie Zhang,Qing Guo,Jingyi Wang,Tianlong Chen,Joey Tianyi Zhou,Xiaojun Jia,Weisong Sun,Cong Wu,Jing Chen,Xuming Hu,Yiming Li,Xiao Wang,Ningyu Zhang,Luu Anh Tuan,Guowen Xu,Tianwei Zhang,Xingjun Ma,Xiang Wang,Bo An,Jun Sun,Mohit Bansal,Shirui Pan,Yuval Elovici,Bhavya Kailkhura,Bo Li,Yaodong Yang,Hongwei Li,Wenyuan Xu,Yizhou Sun,Wei Wang,Qing Li,Ke Tang,Yu-Gang Jiang,Felix Juefei-Xu,Hui Xiong,Xiaofeng Wang,Shuicheng Yan,Dacheng Tao,Philip S. Yu,Qingsong Wen,Yang Liu
机构: Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学); The Hong Kong Polytechnic University (香港理工大学); A*STAR (新加坡科技研究局); Squirrel AI Learning (松鼠AI学习); Southern University of Science and Technology (南方科技大学); University of Science and Technology of China (中国科学技术大学); The Pennsylvania State University (宾夕法尼亚州立大学); TeleAI (电信AI); Hong Kong University of Science and Technology (香港科技大学); Zhejiang University (浙江大学); Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); Renmin University of China (中国人民大学); The University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Griffith University (格里菲斯大学); Ben Gurion University (本·古里安大学); Center for Applied Scientific Computing (应用科学计算中心); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校); Peking University (北京大学); University of Electronic Science and Technology of China (电子科技大学); Wuhan University (武汉大学); Shanghai AI Laboratory (上海人工智能实验室); Shanghai University (上海大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); The University of Hong Kong (香港大学); University of Southern California (南加州大学); Johns Hopkins University (约翰霍普金斯大学); University of Washington (华盛顿大学); New York University (纽约大学); University of Illinois at Chicago (芝加哥大学); ACM Member (ACM会员); National University of Singapore (新加坡国立大学); University of California, Los Angeles (加州大学洛杉矶分校); Fudan University (复旦大学); University of California, Los Angeles (加州大学洛杉矶分校); The Hong Kong Polytechnic University (香港理工大学); Singapore Management University (新加坡管理大学); The University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Center for Applied Scientific Computing (应用科学计算中心); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校); Peking University (北京大学); University of Electronic Science and Technology of China (电子科技大学); Wuhan University (武汉大学); Shanghai AI Laboratory (上海人工智能实验室); Shanghai University (上海大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); The University of Hong Kong (香港大学); University of Southern California (南加州大学); Johns Hopkins University (约翰霍普金斯大学); University of Washington (华盛顿大学); New York University (纽约大学); University of Illinois at Chicago (芝加哥大学); ACM Member (ACM会员); National University of Singapore (新加坡国立大学); University of California, Los Angeles (加州大学洛杉矶分校); Fudan University (复旦大学); University of California, Los Angeles (加州大学洛杉矶分校);
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The remarkable success of Large Language Models (LLMs) has illuminated a promising pathway toward achieving Artificial General Intelligence for both academic and industrial communities, owing to their unprecedented performance across various applications. As LLMs continue to gain prominence in both research and commercial domains, their security and safety implications have become a growing concern, not only for researchers and corporations but also for every nation. Currently, existing surveys on LLM safety primarily focus on specific stages of the LLM lifecycle, e.g., deployment phase or fine-tuning phase, lacking a comprehensive understanding of the entire “lifechain” of LLMs. To address this gap, this paper introduces, for the first time, the concept of “full-stack” safety to systematically consider safety issues throughout the entire process of LLM training, deployment, and eventual commercialization. Compared to the off-the-shelf LLM safety surveys, our work demonstrates several distinctive advantages: (I) Comprehensive Perspective. We define the complete LLM lifecycle as encompassing data preparation, pre-training, post-training, deployment and final commercialization. To our knowledge, this represents the first safety survey to encompass the entire lifecycle of LLMs. (II) Extensive Literature Support. Our research is grounded in an exhaustive review of over 800+ papers, ensuring comprehensive coverage and systematic organization of security issues within a more holistic understanding. (III) Unique Insights. Through systematic literature analysis, we have developed reliable roadmaps and perspectives for each chapter. Our work identifies promising research directions, including safety in data generation, alignment techniques, model editing, and LLM-based agent systems. These insights provide valuable guidance for researchers pursuing future work in this field.
zh

[NLP-32] Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction

【速读】：该论文旨在解决高质量指令-响应（instruction-response）数据对的稀缺性问题，这对于提升大型语言模型（LLMs）的指令跟随能力至关重要。现有自动数据合成方法虽然减轻了人工标注的工作量，但通常高度依赖种子数据的质量或对网页文档结构与内容的强假设。论文的关键解决方案是提出了一种名为Web Reconstruction (WebR) 的全自动框架，通过一种新颖的双重视角范式——“网页作为指令”和“网页作为响应”，直接从原始网页文档中合成高质量的指令微调（IT）数据，且假设最少。这种方法利用了原始网页内容的内在多样性，并通过重新构造过程将每个网页文档指定为指令或响应来触发数据生成。实验表明，WebR生成的数据在四个指令跟随基准测试中比最先进的基线方法高出最多16.65%，同时展示了优越的兼容性、数据效率和可扩展性。

链接: https://arxiv.org/abs/2504.15573
作者: Yuxin Jiang,Yufei Wang,Chuhan Wu,Xinyi Dai,Yan Xu,Weinan Gan,Yasheng Wang,Xin Jiang,Lifeng Shang,Ruiming Tang,Wei Wang
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）); The Hong Kong University of Science and Technology(香港科技大学); Huawei Noah’s Ark Lab(华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注: 15 pages, 11 figures, 9 tables

点击查看摘要

Abstract:The improvement of LLMs’ instruction-following capabilities depends critically on the availability of high-quality instruction-response pairs. While existing automatic data synthetic methods alleviate the burden of manual curation, they often rely heavily on either the quality of seed data or strong assumptions about the structure and content of web documents. To tackle these challenges, we propose Web Reconstruction (WebR), a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents with minimal assumptions. Leveraging the inherent diversity of raw web content, we conceptualize web reconstruction as an instruction-tuning data synthesis task via a novel dual-perspective paradigm–Web as Instruction and Web as Response–where each web document is designated as either an instruction or a response to trigger the reconstruction process. Comprehensive experiments show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks. Notably, WebR demonstrates superior compatibility, data efficiency, and scalability, enabling enhanced domain adaptation with minimal effort. The data and code are publicly available at this https URL.
zh

[NLP-33] LLM -based Semantic Augmentation for Harmful Content Detection

【速读】：该论文旨在解决复杂社会媒体任务（如宣传检测、仇恨表情包分类和毒性识别）中大型语言模型（Large Language Models, LLMs）性能下降的问题。现有工作多聚焦于利用LLMs生成合成训练数据，而忽视了其在文本预处理和语义增强方面的潜力。论文的关键解决方案是引入一种方法，通过提示LLMs对噪声文本进行清洗并提供富含上下文的解释，从而在不显著增加数据量的情况下增强训练集质量。实验结果表明，这种基于LLM的语义增强方法在SemEval 2024多标签说服性表情包数据集以及其他公开数据集上的表现与依赖人工标注数据的方法相当，但成本大幅降低，这凸显了将LLMs战略性整合到社交媒体分类机器学习（Machine Learning, ML）流水线中的重要性。

链接: https://arxiv.org/abs/2504.15548
作者: Elyas Meguellati,Assaad Zeghina,Shazia Sadiq,Gianluca Demartini
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have demonstrated strong performance on simple text classification tasks, frequently under zero-shot settings. However, their efficacy declines when tackling complex social media challenges such as propaganda detection, hateful meme classification, and toxicity identification. Much of the existing work has focused on using LLMs to generate synthetic training data, overlooking the potential of LLM-based text preprocessing and semantic augmentation. In this paper, we introduce an approach that prompts LLMs to clean noisy text and provide context-rich explanations, thereby enhancing training sets without substantial increases in data volume. We systematically evaluate on the SemEval 2024 multi-label Persuasive Meme dataset and further validate on the Google Jigsaw toxic comments and Facebook hateful memes datasets to assess generalizability. Our results reveal that zero-shot LLM classification underperforms on these high-context tasks compared to supervised models. In contrast, integrating LLM-based semantic augmentation yields performance on par with approaches that rely on human-annotated data, at a fraction of the cost. These findings underscore the importance of strategically incorporating LLMs into machine learning (ML) pipeline for social media classification tasks, offering broad implications for combating harmful content online.
zh

[NLP-34] m-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length

【速读】：该论文试图解决在大规模语料库和长上下文预训练中，与解码器-only变换模型相比，编码器-only变换模型（如BERT）研究相对不足的问题。论文的关键解决方案是提出了一种名为ModernBERT的新模型（ModernBERT），其在公开可用的大规模日语文本数据集上进行预训练，上下文长度可达8192个标记。通过扩展上下文长度和详细分析句子嵌入的变化趋势，论文验证了该模型的有效性，并提供了可复现的研究代码以促进长上下文BERT模型的发展。

链接: https://arxiv.org/abs/2504.15544
作者: Issa Sugiura,Kouta Nakayama,Yusuke Oda
机构: Kyoto University (京都大学); NII LLMC (国立情报学研究所低资源语言中心)
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Encoder-only transformer models like BERT are widely adopted as a pre-trained backbone for tasks like sentence classification and retrieval. However, pretraining of encoder models with large-scale corpora and long contexts has been relatively underexplored compared to decoder-only transformers. In this work, we present llm-jp-modernbert, a ModernBERT model trained on a publicly available, massive Japanese corpus with a context length of 8192 tokens. While our model does not surpass existing baselines on downstream tasks, it achieves good results on fill-mask test evaluations. We also analyze the effect of context length expansion through pseudo-perplexity experiments. Furthermore, we investigate sentence embeddings in detail, analyzing their transitions during training and comparing them with those from other existing models, confirming similar trends with models sharing the same architecture. To support reproducibility and foster the development of long-context BERT, we release our model, along with the training and evaluation code.
zh

[NLP-35] Compass-V2 Technical Report

【速读】：该论文旨在解决现有大型语言模型（LLMs）主要关注高资源语言而忽视低资源语言，尤其是东南亚（SEA）地区语言的问题，同时这些模型多为通用型，对电子商务（e-commerce）领域的针对性不足。为克服这些局限性，论文提出Compass-v2，这是一种专为东南亚语言和电子商务应用设计的轻量级专家混合模型（Mixture-of-Experts, MoE）。其关键解决方案在于通过总参数量为30B、活跃参数量为5B的设计平衡模型性能与推理成本，并结合细粒度专家模块与共享专家模块；构建高质量行业领先的SEA数据集以提升多语言性能；创建包含数百亿token的电商领域数据集，结合外部数据挖掘与内部平台采集；创新性地提出一种支持快速思维与深度思维统一框架的混合推理模型，从而增强推理能力，区别于传统行业部署的两个独立模型的做法。实验表明，Compass-v2在子30B规模模型中实现了SEA多语言和电商领域的最先进性能，同时保持显著更低的推理成本。

链接: https://arxiv.org/abs/2504.15527
作者: Sophia Maria
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Predominant LLMs focus on high-resource languages while leaving low-resource languages, particularly those in Southeast Asia (SEA), underrepresented. In addition, those models are general-purpose and pay limited attention to the e-commerce domain. To overcome these limitations, we introduce Compass-v2, a lightweight Mixture-of-Experts (MoE) model specifically designed for Southeast Asian languages and e-commerce applications. To balance model performance and inference cost, the model is designed with 30B total parameters and 5B active parameters, incorporating both fine-grained and shared expert modules. To enhance multilingual performance, we curated and constructed a high-quality, industry-leading SEA dataset, to the best of our knowledge. To boost performance in the e-commerce domain, we built a dataset comprising hundreds of billions of tokens, sourced through external data mining and internal platform collection. Besides, we pioneered a hybrid reasoning model that supports both fast thinking and deep thinking within a unified framework to enhance the reasoning capabilities, diverging from the conventional industry practice of deploying two separate models. Through extensive experimental evaluations, our model demonstrates state-of-the-art SEA multilingual and e-commerce performance among sub-30B models, while maintaining significantly lower inference cost.
zh

[NLP-36] IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property

【速读】：该论文试图解决知识产权（Intellectual Property, IP）领域因技术与法律知识高度融合而带来的复杂性和知识密集性问题，并提升大语言模型（Large Language Models, LLMs）在真实世界IP应用场景中的表现。现有数据集和基准要么局限于专利分析，要么仅覆盖IP领域的有限方面，难以反映实际场景需求。为此，论文的关键解决方案是提出首个全面的IP任务分类法以及一个包含8种IP机制和20项任务的大规模双语基准IPBench。该基准旨在评估LLMs在理解与生成IP相关内容方面的性能，揭示即使是表现最佳的模型也仅达到75.8%的准确率，表明仍有显著改进空间。此外，论文强调开源IP和法律导向模型的表现落后于闭源通用模型，并公开发布IPBench的所有数据和代码，计划持续更新以更好地模拟IP领域的现实挑战。

链接: https://arxiv.org/abs/2504.15524
作者: Qiyao Wang,Guhong Chen,Hongbo Wang,Huaren Liu,Minghui Zhu,Zhifei Qin,Linwei Li,Yilin Yue,Shiqiang Wang,Jiayan Li,Yihang Wu,Ziqiang Liu,Longze Chen,Run Luo,Liyang Fan,Jiaming Li,Lei Zhang,Kan Xu,Hongfei Lin,Hamid Alinejad-Rokny,Shiwen Ni,Yuan Lin,Min Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 89 pages, 75 figures, 55 tables

点击查看摘要

Abstract:Intellectual Property (IP) is a unique domain that integrates technical and legal knowledge, making it inherently complex and knowledge-intensive. As large language models (LLMs) continue to advance, they show great potential for processing IP tasks, enabling more efficient analysis, understanding, and generation of IP-related content. However, existing datasets and benchmarks either focus narrowly on patents or cover limited aspects of the IP field, lacking alignment with real-world scenarios. To bridge this gap, we introduce the first comprehensive IP task taxonomy and a large, diverse bilingual benchmark, IPBench, covering 8 IP mechanisms and 20 tasks. This benchmark is designed to evaluate LLMs in real-world intellectual property applications, encompassing both understanding and generation. We benchmark 16 LLMs, ranging from general-purpose to domain-specific models, and find that even the best-performing model achieves only 75.8% accuracy, revealing substantial room for improvement. Notably, open-source IP and law-oriented models lag behind closed-source general-purpose models. We publicly release all data and code of IPBench and will continue to update it with additional IP-related tasks to better reflect real-world challenges in the intellectual property domain.
zh

[NLP-37] he Bitter Lesson Learned from 2000 Multilingual Benchmarks

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在多语言评估中存在的不公平性和局限性问题，特别是英语过度代表以及现有基准与人类判断之间存在显著差异的问题。论文通过分析来自148个国家的超过2,000个多语言非英语基准，揭示了当前多语言基准设计中的六大关键局限，并强调了创建文化与语言适配的本地化基准而非单纯依赖翻译的重要性。解决方案的关键在于提出指导原则以推动有效的多语言基准设计，并倡导全球合作开发与人类评价高度一致且面向实际应用的基准。

链接: https://arxiv.org/abs/2504.15521
作者: Minghao Wu,Weixuan Wang,Sinuo Liu,Huifeng Yin,Xintong Wang,Yu Zhao,Chenyang Lyu,Longyue Wang,Weihua Luo,Kaifu Zhang
机构: Alibaba International Digital Commerce (阿里巴巴国际数字商业); Monash University (蒙纳士大学); The University of Edinburgh (爱丁堡大学); Tsinghua University (清华大学); Universität Hamburg (汉堡大学)
类目: Computation and Language (cs.CL)
备注: work in progress; 22 pages, 8 figures, 3 tables;

点击查看摘要

Abstract:As large language models (LLMs) continue to advance in linguistic capabilities, robust multilingual evaluation has become essential for promoting equitable technological progress. This position paper examines over 2,000 multilingual (non-English) benchmarks from 148 countries, published between 2021 and 2024, to evaluate past, present, and future practices in multilingual benchmarking. Our findings reveal that, despite significant investments amounting to tens of millions of dollars, English remains significantly overrepresented in these benchmarks. Additionally, most benchmarks rely on original language content rather than translations, with the majority sourced from high-resource countries such as China, India, Germany, the UK, and the USA. Furthermore, a comparison of benchmark performance with human judgments highlights notable disparities. STEM-related tasks exhibit strong correlations with human evaluations (0.70 to 0.85), while traditional NLP tasks like question answering (e.g., XQuAD) show much weaker correlations (0.11 to 0.30). Moreover, translating English benchmarks into other languages proves insufficient, as localized benchmarks demonstrate significantly higher alignment with local human judgments (0.68) than their translated counterparts (0.47). This underscores the importance of creating culturally and linguistically tailored benchmarks rather than relying solely on translations. Through this comprehensive analysis, we highlight six key limitations in current multilingual evaluation practices, propose the guiding principles accordingly for effective multilingual benchmarking, and outline five critical research directions to drive progress in the field. Finally, we call for a global collaborative effort to develop human-aligned benchmarks that prioritize real-world applications.
zh

[NLP-38] SimulS2S-LLM : Unlocking Simultaneous Inference of Speech LLM s for Speech-to-Speech Translation

【速读】：该论文旨在解决同时传译（Simultaneous Speech Translation, SST）中翻译质量和延迟之间的权衡问题，特别是在大型语言模型（Large Language Models, LLMs）扩展到处理语音模态时，流式处理面临的挑战。传统方法在推理阶段将整个语音作为提示输入，导致训练与推理之间存在不匹配。为了解决这一问题，论文提出了一种名为SimulS2S-LLM的方法，其关键是通过离线训练语音LLMs，并在测试时采用策略引导同步推理。具体而言，SimulS2S-LLM通过提取边界感知的语音提示，缓解了训练与推理之间的不匹配，使其能够更好地适配文本输入数据。此外，通过增量束搜索扩展语音令牌预测的搜索空间，同时保持低延迟，进一步优化了翻译质量与延迟的平衡。实验结果表明，该方法在CVSS数据集上的翻译质量优于使用相同训练数据的现有方法。

链接: https://arxiv.org/abs/2504.15509
作者: Keqi Deng,Wenxi Chen,Xie Chen,Philip C. Woodland
机构: Department of Engineering, University of Cambridge (剑桥大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Simultaneous speech translation (SST) outputs translations in parallel with streaming speech input, balancing translation quality and latency. While large language models (LLMs) have been extended to handle the speech modality, streaming remains challenging as speech is prepended as a prompt for the entire generation process. To unlock LLM streaming capability, this paper proposes SimulS2S-LLM, which trains speech LLMs offline and employs a test-time policy to guide simultaneous inference. SimulS2S-LLM alleviates the mismatch between training and inference by extracting boundary-aware speech prompts that allows it to be better matched with text input data. SimulS2S-LLM achieves simultaneous speech-to-speech translation (Simul-S2ST) by predicting discrete output speech tokens and then synthesising output speech using a pre-trained vocoder. An incremental beam search is designed to expand the search space of speech token prediction without increasing latency. Experiments on the CVSS speech data show that SimulS2S-LLM offers a better translation quality-latency trade-off than existing methods that use the same training data, such as improving ASR-BLEU scores by 3 points at similar latency.
zh

[NLP-39] CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

【速读】：该论文试图解决视觉场景中被遮挡（部分或完全隐藏）物体的识别与推理问题，这是理解复杂视觉场景的关键挑战。论文引入了一个名为“通过未见区域的模式无模态计数”(Counting Amodally for Patterns Through Unseen REgions, CAPTURe) 的新任务，旨在测试视觉-语言模型（Vision-Language Models, VLMs）在处理被遮挡模式计数方面的表现。CAPTURe 的关键在于要求模型不仅能够识别视觉模式，还需要通过推断遮挡物后方的模式延续来完成计数任务，从而评估模型的空间理解能力和世界模型构建能力。实验结果表明，尽管现有强大的 VLMs 在未遮挡模式下表现尚可，但在面对遮挡时普遍表现较差，这揭示了当前 VLMs 在推断未见空间关系上的不足。论文进一步发现，提供遮挡物体位置的辅助信息可以显著提高模型性能，表明模型错误源于对遮挡的处理能力不足以及图像中计数的难度。

链接: https://arxiv.org/abs/2504.15485
作者: Atin Pothiraj,Elias Stengel-Eskin,Jaemin Cho,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code and data: this https URL

点击查看摘要

Abstract:Recognizing and reasoning about occluded (partially or fully hidden) objects is vital to understanding visual scenes, as occlusions frequently occur in real-world environments and act as obstacles for spatial comprehension. To test models’ ability to reason about multiple occluded objects, we introduce a novel task, Counting Amodally for Patterns Through Unseen REgions (CAPTURe), which requires a model to count objects arranged in a pattern by inferring how the pattern continues behind an occluder (an object which blocks parts of the scene). CAPTURe requires both recognizing visual patterns and reasoning, making it a useful testbed for evaluating vision-language models (VLMs) on whether they understand occluded patterns and possess spatial understanding skills. By requiring models to reason about occluded objects, CAPTURe also tests VLMs’ ability to form world models that would allow them to fill in missing information. CAPTURe consists of two parts: (1) CAPTURe-real, with manually filtered images of real objects in patterns and (2) CAPTURe-synthetic, a controlled diagnostic with generated patterned images. We evaluate four strong VLMs (GPT-4o, Intern-VL2, Molmo, and Qwen2-VL) on CAPTURe, finding that models struggle to count on both occluded and unoccluded patterns. Crucially, we find that models perform worse with occlusion, suggesting that VLMs are also deficient in inferring unseen spatial relationships: even the strongest VLMs like GPT-4o fail to count with occlusion. In contrast, we find that humans achieve very little error on CAPTURe. We also find that providing auxiliary information of occluded object locations increases performance, underscoring that the model error comes both from an inability to handle occlusion as well as difficulty counting in images.
zh

[NLP-40] Speculative Sampling via Exponential Races

【速读】：该论文旨在解决通过推测解码（Speculative Decoding）加速大型语言模型推理的问题。推测解码利用较小的草稿模型（draft model）来提升推理速度，但其加速效果的具体理论边界尚不明确。论文的关键创新在于揭示了推测解码与信道模拟（channel simulation）之间的意外联系，后者的目标是以尽可能少的比特数模拟噪声信道。基于这一联系，论文提供了推测解码加速能力的信息论分析，并推导出在生成大量标记 ( k ) 时生成速度提升与草稿模型生成标记数量 ( k ) 之间的显式关系，该关系为所有 ( k ) 值提供了上限。此外，论文提出了一种新的推测解码方法——指数竞赛推测解码（ERSD, Exponential Race Speculative Decoding），其性能达到了当前最先进的水平。

链接: https://arxiv.org/abs/2504.15475
作者: Szymon Kobus,Deniz Gündüz
机构: Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Speculative decoding accelerates large language model inference using a smaller draft model. In this paper, we establish a surprising connection between speculative decoding and channel simulation, which aims at simulating a noisy channel using as few bits as possible. This connection allows us to provide an information-theoretic analysis of the speed up that can be achieved by speculative decoding. Leveraging this link, we derive an explicit relation between generation speed-up and the number of tokens k generated by the draft model for large k , which serves as an upper bound for all k . We also propose a novel speculative decoding method via exponential race ERSD that matches state-of-the-art performance.
zh

[NLP-41] Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models

【速读】：该论文旨在研究 Transformer 语言模型中用于从当前词嵌入预测下一个词的核心机制，并试图识别出最小化形式的这种转换。论文的关键在于发现并验证“bigram 子网络”（bigram subnetworks）的存在及其重要性。这些子网络仅基于当前词进行朴素的下一个词预测，却在模型性能中发挥关键作用，即使其参数量仅占模型总参数的不到 0.2%。论文指出，这些 bigram 子网络集中在第一个 Transformer 多层感知机（MLP）层中，并且与经过最优剪枝训练的子网络有显著重叠。机制上，这些子网络通过在第一层引入强烈的激活变换，将当前词的表示重新调整以更好地匹配下一个词的预测，而非简单保留当前词的特征表示。论文的核心贡献在于证明了 bigram 子网络构成了一组必要且充分的最小参数集合，能够驱动从当前词到下一个词激活状态的转变，并为从最小电路出发研究语言模型的电路提供了基础。

链接: https://arxiv.org/abs/2504.15471
作者: Tyler A. Chang,Benjamin K. Bergen
机构: Department of Cognitive Science (认知科学系), University of California San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In Transformer language models, activation vectors transform from current token embeddings to next token predictions as they pass through the model. To isolate a minimal form of this transformation, we identify language model subnetworks that make bigram predictions, naive next token predictions based only on the current token. We find that bigram subnetworks can be found in fully trained language models up to 1B parameters, and these subnetworks are critical for model performance even when they consist of less than 0.2% of model parameters. Bigram subnetworks are concentrated in the first Transformer MLP layer, and they overlap significantly with subnetworks trained to optimally prune a given model. Mechanistically, the bigram subnetworks often recreate a pattern from the full models where the first layer induces a sharp change that aligns activations with next token predictions rather than current token representations. Our results demonstrate that bigram subnetworks comprise a minimal subset of parameters that are both necessary and sufficient for basic next token predictions in language models, and they help drive the transformation from current to next token activations in the residual stream. These subnetworks can lay a foundation for studying language model circuits by building up from a minimal circuit rather than the traditional approach of ablating circuits from a full model.
zh

[NLP-42] Learning Adaptive Parallel Reasoning with Language Models

【速读】：该论文旨在解决现有推理方法在处理复杂任务时的两大主要问题：串行链式推理方法生成过长输出导致延迟增加和上下文窗口耗尽，以及并行推理方法如自洽（self-consistency）中存在的协调不足、冗余计算及性能提升有限的问题。为了解决这些问题，论文提出了一种名为自适应并行推理（Adaptive Parallel Reasoning, APR）的新框架。APR的关键创新在于通过引入spawn()和join()操作实现了端到端的多线程推理，同时采用了一种端到端强化学习策略，优化父线程与子线程的协作以提高任务成功率，而无需依赖预定义的推理结构。实验结果表明，APR在保持相同上下文窗口的情况下提升了性能，在增加计算量时表现出更好的可扩展性，并在等效延迟下提高了准确性。

链接: https://arxiv.org/abs/2504.15466
作者: Jiayi Pan,Xiuyu Li,Long Lian,Charlie Snell,Yifei Zhou,Adam Yala,Trevor Darrell,Kurt Keutzer,Alane Suhr
机构: UC Berkeley (加州大学伯克利分校); UCSF (加州大学旧金山分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code, model, and data are available at this https URL . The first three authors contributed equally to this work

点击查看摘要

Abstract:Scaling inference-time computation has substantially improved the reasoning capabilities of language models. However, existing methods have significant limitations: serialized chain-of-thought approaches generate overly long outputs, leading to increased latency and exhausted context windows, while parallel methods such as self-consistency suffer from insufficient coordination, resulting in redundant computations and limited performance gains. To address these shortcomings, we propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end. APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations. A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures. Experiments on the Countdown reasoning task demonstrate significant benefits of APR: (1) higher performance within the same context window (83.4% vs. 60.0% at 4k context); (2) superior scalability with increased computation (80.1% vs. 66.6% at 20k total tokens); (3) improved accuracy at equivalent latency (75.2% vs. 57.3% at approximately 5,000ms). APR represents a step towards enabling language models to autonomously optimize their reasoning processes through adaptive allocation of computation.
zh

[NLP-43] Feeding LLM Annotations to BERT Classifiers at Your Own Risk

【速读】：该论文旨在研究利用大型语言模型（LLM）生成的标签来微调较小的仅编码器文本分类模型时所面临的问题。论文通过实证分析揭示了在这一特定设置下，训练合成数据所带来的长期困扰（即“合成数据训练的诅咒”）的具体表现，包括准确性与F1分数的性能下降、训练过程的不稳定性以及过早的性能 plateau 等现象。论文的关键在于通过错误传播的视角解析这些现象，并提出若干缓解策略，如基于熵的过滤和集成技术。尽管这些启发式方法能在一定程度上减轻问题，但无法完全消除从LLM标注中传播非随机错误的风险，从而强调了在高风险文本分类任务中应用此工作流时需保持谨慎的重要性。

链接: https://arxiv.org/abs/2504.15432
作者: Yucheng Lu,Kazimier Smith
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Using LLM-generated labels to fine-tune smaller encoder-only models for text classification has gained popularity in various settings. While this approach may be justified in simple and low-stakes applications, we conduct empirical analysis to demonstrate how the perennial curse of training on synthetic data manifests itself in this specific setup. Compared to models trained on gold labels, we observe not only the expected performance degradation in accuracy and F1 score, but also increased instability across training runs and premature performance plateaus. These findings cast doubts on the reliability of such approaches in real-world applications. We contextualize the observed phenomena through the lens of error propagation and offer several practical mitigation strategies, including entropy-based filtering and ensemble techniques. Although these heuristics offer partial relief, they do not fully resolve the inherent risks of propagating non-random errors from LLM annotations to smaller classifiers, underscoring the need for caution when applying this workflow in high-stakes text classification tasks.
zh

[NLP-44] rillion 7B Technical Report

【速读】：该论文试图解决如何在有限的多语言训练资源下构建高效且性能卓越的多语言大型语言模型（LLM）。解决方案的关键在于提出了一种新颖的跨语言文档注意力机制（Cross-lingual Document Attention, XLDA），该机制显著提升了从英语到目标语言（如韩语和日语）的知识迁移效率。结合优化的数据混合策略、语言特定的过滤方法以及定制化的分词器构建，Trillion-7B 在仅使用 10% 的 2T 训练令牌进行多语言数据训练的情况下，实现了卓越的多语言性能和出色的跨语言一致性。

链接: https://arxiv.org/abs/2504.15431
作者: Sungjun Han,Juyoung Suk,Suyeong An,Hyungguk Kim,Kyuseok Kim,Wonsuk Yang,Seungtaek Choi,Jamin Shin(Trillion Labs)
机构: Trillion Labs (Trillion 实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preview version

点击查看摘要

Abstract:We introduce Trillion-7B, the most token-efficient Korean-centric multilingual LLM available. Our novel Cross-lingual Document Attention (XLDA) mechanism enables highly efficient and effective knowledge transfer from English to target languages like Korean and Japanese. Combined with optimized data mixtures, language-specific filtering, and tailored tokenizer construction, Trillion-7B achieves competitive performance while dedicating only 10% of its 2T training tokens to multilingual data and requiring just 59.4K H100 GPU hours (\ 148K) for full training. Comprehensive evaluations across 27 benchmarks in four languages demonstrate Trillion-7B’s robust multilingual performance and exceptional cross-lingual consistency.
zh

[NLP-45] IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLM s

【速读】：该论文旨在解决现有多模态大型语言模型（MLLMs）评估框架主要关注图像推理或通用视频理解任务，而严重忽视图像上下文在视频理解中的重要作用的问题。为填补这一空白，论文提出IV-Bench，这是一个用于评估图像引导的视频感知与推理能力的第一个综合性基准数据集。IV-Bench包含967个视频及其对应的2,585个精心标注的图像-文本查询，涵盖13项任务（7项感知任务和6项推理任务）以及5个代表性类别。通过广泛评估当前最先进的开源（如InternVL2.5、Qwen2.5-VL）和闭源（如GPT-4o、Gemini2-Flash和Gemini2-Pro）MLLMs，发现这些模型在此类任务上的表现远低于预期，最高仅达到28.9%的准确率。进一步分析表明，影响模型在IV-Bench上性能的关键因素包括推理模式、帧数和分辨率。此外，通过简单的数据合成方法，证明了IV-Bench面临的挑战不仅限于训练过程中对齐数据格式。这些发现为未来研究提供了宝贵的见解。论文代码和数据已公开发布于指定网址。

链接: https://arxiv.org/abs/2504.15415
作者: David Ma,Yuanxing Zhang,Jincheng Ren,Jarvis Guo,Yifan Yao,Zhenlin Wei,Zhenzhu Yang,Zhongyuan Peng,Boyu Feng,Jun Ma,Xiao Gu,Zhoufutu Wen,King Zhu,Yancheng He,Meng Cao,Shiwen Ni,Jiaheng Liu,Wenhao Huang,Ge Zhang,Xiaojie Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose IV-Bench, the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning. IV-Bench consists of 967 videos paired with 2,585 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) and 5 representative categories. Extensive evaluations of state-of-the-art open-source (e.g., InternVL2.5, Qwen2.5-VL) and closed-source (e.g., GPT-4o, Gemini2-Flash and Gemini2-Pro) MLLMs demonstrate that current models substantially underperform in image-grounded video Perception and Reasoning, merely achieving at most 28.9% accuracy. Further analysis reveals key factors influencing model performance on IV-Bench, including inference pattern, frame number, and resolution. Additionally, through a simple data synthesis approach, we demonstratethe challenges of IV- Bench extend beyond merely aligning the data format in the training proecss. These findings collectively provide valuable insights for future research. Our codes and data are released in this https URL.
zh

[NLP-46] Me What You Know About Sexism: Expert-LLM Interaction Strategies and Co-Created Definitions for Zero-Shot Sexism Detection NAACL2025

【速读】：该论文旨在研究性别歧视（Sexism）领域专家与大型语言模型（Large Language Models, LLMs）之间的混合智能（Hybrid Intelligence）及其协作方式。论文通过一个四组件流程探讨如何有效利用LLMs在性别歧视研究中的潜力。关键在于通过专家与LLMs的交互，共同创建定义，并评估这些定义在性别歧视检测任务中的表现。结果显示，尽管专家独立撰写的定义通常性能较差，但某些专家通过与LLMs协作生成的混合定义显著提升了分类性能，尤其对于不熟悉LLMs使用的专家而言，这种协作尤为重要。

链接: https://arxiv.org/abs/2504.15392
作者: Myrthe Reuver,Indira Sen,Matteo Melis,Gabriella Lapesa
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted and published at Findings of NAACL 2025: cite published version whenever possible

点击查看摘要

Abstract:This paper investigates hybrid intelligence and collaboration between researchers of sexism and Large Language Models (LLMs), with a four-component pipeline. First, nine sexism researchers answer questions about their knowledge of sexism and of LLMs. They then participate in two interactive experiments involving an LLM (GPT3.5). The first experiment has experts assessing the model’s knowledge about sexism and suitability for use in research. The second experiment tasks them with creating three different definitions of sexism: an expert-written definition, an LLM-written one, and a co-created definition. Lastly, zero-shot classification experiments use the three definitions from each expert in a prompt template for sexism detection, evaluating GPT4o on 2.500 texts sampled from five sexism benchmarks. We then analyze the resulting 67.500 classification decisions. The LLM interactions lead to longer and more complex definitions of sexism. Expert-written definitions on average perform poorly compared to LLM-generated definitions. However, some experts do improve classification performance with their co-created definitions of sexism, also experts who are inexperienced in using LLMs.
zh

[NLP-47] owards Understanding Camera Motions in Any Video

【速读】：本文旨在解决视频中相机运动理解的评估与提升问题。为实现这一目标，论文构建了一个名为CameraBench的大规模数据集和基准，包含约3000段经过专家多阶段质量控制标注的多样化互联网视频。关键解决方案在于引入了一种由摄影师协作设计的相机运动基元分类法，并通过大规模人类研究量化了标注性能，发现领域专业知识和基于教程的培训可显著提高准确性。此外，通过CameraBench评估了运动重建（SfM）和视频-语言模型（VLM），发现SfM模型难以捕捉依赖场景内容的语义基元，而VLM则在需要精确轨迹估计的几何基元上表现不佳。最终，论文通过在CameraBench上微调生成式VLM，结合两者优势，展示了其在运动增强的字幕生成、视频问答及视频-文本检索等应用中的潜力。

链接: https://arxiv.org/abs/2504.15376
作者: Zhiqiu Lin,Siyuan Cen,Daniel Jiang,Jay Karhade,Hewei Wang,Chancharik Mitra,Tiffany Ling,Yuhan Huang,Sifan Liu,Mingyu Chen,Rushikesh Zawar,Xue Bai,Yilun Du,Chuang Gan,Deva Ramanan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Project site: this https URL

点击查看摘要

Abstract:We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like “follow” (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.
zh

[NLP-48] LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception ALT

【速读】：本文旨在解决在感知任务（如视觉推理）领域中，长链思维（long-chain-of-thoughts）对于系统-2推理的潜在价值未被充分探索的问题。传统观点认为感知任务主要依赖浅层的系统-1推理，而本文通过引入LongPerceptualThoughts数据集，尝试验证长链思维是否能在这些任务中带来性能提升。解决方案的关键在于提出了一种三阶段的数据合成框架：首先从密集图像描述中生成可验证的多选题；其次利用视觉语言模型（VLMs）提取简短的链式思维（CoTs）；最后借助前沿推理模型扩展这些简短思维为详细的长链思维。这种创新方法有效克服了现有模型缺乏此类复杂推理行为以及感知任务难以构建可靠过程验证器的挑战，并在多个视觉任务基准测试中实现了显著性能提升（平均+3.4分），同时意外地在文本推理任务MMLU-Pro上也带来了+2分的改进。

链接: https://arxiv.org/abs/2504.15362
作者: Yuan-Hong Liao,Sven Elflein,Liu He,Laura Leal-Taixé,Yejin Choi,Sanja Fidler,David Acuna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 24 pages, 10 figures, in submission. Project page: this https URL

点击查看摘要

Abstract:Recent reasoning models through test-time scaling have demonstrated that long chain-of-thoughts can unlock substantial performance boosts in hard reasoning tasks such as math and code. However, the benefit of such long thoughts for system-2 reasoning is relatively less explored in other domains such as perceptual tasks where shallower, system-1 reasoning seems sufficient. In this paper, we introduce LongPerceptualThoughts, a new synthetic dataset with 30K long-thought traces for perceptual tasks. The key challenges in synthesizing elaborate reasoning thoughts for perceptual tasks are that off-the-shelf models are not yet equipped with such thinking behavior and that it is not straightforward to build a reliable process verifier for perceptual tasks. Thus, we propose a novel three-stage data synthesis framework that first synthesizes verifiable multiple-choice questions from dense image descriptions, then extracts simple CoTs from VLMs for those verifiable problems, and finally expands those simple thoughts to elaborate long thoughts via frontier reasoning models. In controlled experiments with a strong instruction-tuned 7B model, we demonstrate notable improvements over existing visual reasoning data-generation methods. Our model, trained on the generated dataset, achieves an average +3.4 points improvement over 5 vision-centric benchmarks, including +11.8 points on V ^* Bench. Notably, despite being tuned for vision tasks, it also improves performance on the text reasoning benchmark, MMLU-Pro, by +2 points.
zh

[NLP-49] Exploring Compositional Generalization (in ReCOGS_pos) by Transformers using Restricted Access Sequence Processing (RASP)

【速读】：该论文旨在解决生成式 AI (Generative AI) 模型在语义组合泛化（Compositional Generalization）任务上的挑战，特别是 Transformer 模型在 COGS 数据集某些结构泛化测试上的零准确率问题。论文的关键解决方案在于提出了一种基于 Weiss 等人提出的受限访问序列处理（Restricted Access Sequence Processing, RASP）语言的新型 Transformer 编码器-解码器模型。该模型通过扁平化模式匹配规则而非递归树状规则实现任务目标，仅使用词级标记与词性标注嵌入层，在每次编码器传递中应用 19 条兼容注意力头的模式匹配规则，并结合预训练短语逻辑处理逻辑，从而实现对 ReCOGS_pos 任务的系统性和组合性求解，最终在测试集上达到 100% 的语义精确匹配，并在绝大多数泛化测试集中达到 100% 的字符串精确匹配，仅在特定子任务 obj_pp_to_subj_pp 上取得了 92% 的准确率。

链接: https://arxiv.org/abs/2504.15349
作者: William Bruns
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages main text with 3 figures and 1 table; limitations page and references separate; 4 more figures, 1 image, and 1 more table in the appendices supplement the work. 29 pages of appendix content

点击查看摘要

Abstract:Humans understand new combinations of words encountered if they are combinations of words recognized from different contexts, an ability called Compositional Generalization. The COGS benchmark (Kim and Linzen, 2020) arXiv:2010.05465 reports 0% accuracy for Transformer models on some structural generalizations. We use (Weiss et al., 2021) arXiv:2106.06981’s Restricted Access Sequence Processing (RASP), a Transformer-equivalent programming language, to prove by construction that a Transformer encoder-decoder can perform the semantically equivalent ReCOGS_pos (Wu et al., 2024) arXiv:2303.13716 variant of COGS systematically and compositionally: Our RASP model attains 100% semantic exact match on the ReCOGS test set and 100% SEM on all generalization splits except obj_pp_to_subj_pp which gets 92%. Furthermore, our RASP model shows the ReCOGS_pos task does not require a hierarchical or tree-structured solution: we use word-level tokens with an “embedding” layer that tags with possible parts of speech, applying just once per encoder pass 19 attention-head compatible flat pattern-matching rules, shown using grammar coverage (Zeller et al., 2023) to be learnable from the training data, plus general prepositional phrase (pp) handling and sentential complement (cp) handling logic, and output the next logical form (LF) token (repeating until the LF is complete). The model does not apply recursive, tree-structured rules like ‘np_det pp np - np_pp - np’, but scores 100% semantic and string exact match on pp recursion, cp recursion using the decoder loop.
zh

[NLP-50] Med-CoDE: Medical Critique based Disagreement Evaluation Framework NAACL

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在医疗领域应用中可靠性与准确性不足的问题，以及现有评估方法无法全面衡量其性能导致临床潜在风险的挑战。论文的关键解决方案是提出Med-CoDE，这是一种专为医学LLMs设计的评估框架。Med-CoDE采用基于批评的方法，定量衡量模型生成响应与既定医学事实之间的分歧程度，从而同时捕捉医学场景中的准确性和可靠性。通过系统化的方法，该框架旨在填补当前LLM评估中的空白，提供高质量和可信度的评价，以实现医学LLMs的全面且可靠的评估。

链接: https://arxiv.org/abs/2504.15330
作者: Mohit Gupta,Akiko Aizawa,Rajiv Ratn Shah
机构: Indraprastha Institute of Information Technology Delhi (德里英迪拉普拉斯特信息技术学院), National Institute of Informatics (日本国家信息学研究所)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 4 figures, NAACL SRW 2025

点击查看摘要

Abstract:The emergence of large language models (LLMs) has significantly influenced numerous fields, including healthcare, by enhancing the capabilities of automated systems to process and generate human-like text. However, despite their advancements, the reliability and accuracy of LLMs in medical contexts remain critical concerns. Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance, leading to potential risks in clinical settings. In this work, we propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges. The framework leverages a critique-based approach to quantitatively measure the degree of disagreement between model-generated responses and established medical ground truths. This framework captures both accuracy and reliability in medical settings. The proposed evaluation framework aims to fill the existing gap in LLM assessment by offering a systematic method to evaluate the quality and trustworthiness of medical LLMs. Through extensive experiments and case studies, we illustrate the practicality of our framework in providing a comprehensive and reliable evaluation of medical LLMs.
zh

[NLP-51] Evidence of conceptual mastery in the application of rules by Large Language Models

【速读】：本文旨在探究大型语言模型（LLMs）在应用规则时是否具备人类水平的概念性掌握。为解决这一问题，研究引入了一种新颖的方法，通过匹配LLMs生成的思想多样性与人类样本中的多样性来评估其能力，并设计了两项实验对比人类与LLMs基于规则的决策过程。关键在于利用心理学术语和实验设计验证LLMs是否能够重现人类在规则应用上的模式，包括不同情境下的行为一致性及时间压力下的规则依赖性差异。实验结果显示，某些模型（如Gemini Pro和Claude 3）表现出类似人类的行为特征，而其他模型则不然，这表明部分LLMs确实掌握了规则概念，这对法律决策制定和哲学探讨均具有重要意义。

链接: https://arxiv.org/abs/2503.00992
作者: José Luiz Nunes,Guilherme FCF Almeida,Brian Flanagan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In this paper we leverage psychological methods to investigate LLMs’ conceptual mastery in applying rules. We introduce a novel procedure to match the diversity of thought generated by LLMs to that observed in a human sample. We then conducted two experiments comparing rule-based decision-making in humans and LLMs. Study 1 found that all investigated LLMs replicated human patterns regardless of whether they are prompted with scenarios created before or after their training cut-off. Moreover, we found unanticipated differences between the two sets of scenarios among humans. Surprisingly, even these differences were replicated in LLM responses. Study 2 turned to a contextual feature of human rule application: under forced time delay, human samples rely more heavily on a rule’s text than on other considerations such as a rule’s purpose… Our results revealed that some models (Gemini Pro and Claude 3) responded in a human-like manner to a prompt describing either forced delay or time pressure, while others (GPT-4o and Llama 3.2 90b) did not. We argue that the evidence gathered suggests that LLMs have mastery over the concept of rule, with implications for both legal decision making and philosophical inquiry.
zh

[NLP-52] How Private is Your Attention? Bridging Privacy with In-Context Learning

【速读】：该论文试图解决在形式化隐私约束下上下文学习（In-context Learning, ICL）的可行性问题。解决方案的关键在于提出了一种针对线性注意力头的差分隐私预训练算法，并首次对线性回归中ICL的隐私-准确性权衡进行了理论分析。研究揭示了优化与由隐私引入的噪声之间的基本冲突，同时证明了所提方法对训练提示的对抗扰动具有鲁棒性，而标准岭回归不具备此特性。所有理论发现均通过多样化设置下的大量仿真实验得到支持。

链接: https://arxiv.org/abs/2504.16000
作者: Soham Bonnerjee,Zhen Wei(Kingsley)Yeon,Anna Asch,Sagnik Nandy,Promit Ghosal
机构: University of Chicago (芝加哥大学)
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In-context learning (ICL)-the ability of transformer-based models to perform new tasks from examples provided at inference time-has emerged as a hallmark of modern language models. While recent works have investigated the mechanisms underlying ICL, its feasibility under formal privacy constraints remains largely unexplored. In this paper, we propose a differentially private pretraining algorithm for linear attention heads and present the first theoretical analysis of the privacy-accuracy trade-off for ICL in linear regression. Our results characterize the fundamental tension between optimization and privacy-induced noise, formally capturing behaviors observed in private training via iterative methods. Additionally, we show that our method is robust to adversarial perturbations of training prompts, unlike standard ridge regression. All theoretical findings are supported by extensive simulations across diverse settings.
zh

[NLP-53] Real-Time Sentiment Insights from X Using VADER DistilBERT and Web-Scraped Data

【速读】：本文旨在解决在社交媒体时代，如何高效、准确地监测和理解公众对大型企业声誉的情感倾向，为投资者、政策制定者和研究人员提供决策支持。论文的关键解决方案在于提出了一种结合自然语言处理（NLP）和机器学习技术的综合性情感分析系统。该系统通过集成基于规则的模型（VADER）与基于Transformer的深度学习模型（DistilBERT），实现对多平台社交媒体数据的实时情感解读。其核心在于采用预处理、噪声去除、文本标准化以及基于集成方法的情感分类框架，确保结果既具有可解释性又能在上下文中保持准确性。最终，通过可视化情感分布图、对比分析及时间序列趋势展示，为利益相关方提供了直观且实用的战略决策依据。

链接: https://arxiv.org/abs/2504.15448
作者: Yanampally Abhiram Reddy,Siddhi Agarwal,Vikram Parashar,Arshiya Arora
机构: 未知
类目: General Economics (econ.GN); Computation and Language (cs.CL)
备注: 19 pages, 2 figures

点击查看摘要

Abstract:In the age of social media, understanding public sentiment toward major corporations is crucial for investors, policymakers, and researchers. This paper presents a comprehensive sentiment analysis system tailored for corporate reputation monitoring, combining Natural Language Processing (NLP) and machine learning techniques to accurately interpret public opinion in real time. The methodology integrates a hybrid sentiment detection framework leveraging both rule-based models (VADER) and transformer-based deep learning models (DistilBERT), applied to social media data from multiple platforms. The system begins with robust preprocessing involving noise removal and text normalization, followed by sentiment classification using an ensemble approach to ensure both interpretability and contextual accuracy. Results are visualized through sentiment distribution plots, comparative analyses, and temporal sentiment trends for enhanced interpretability. Our analysis reveals significant disparities in public sentiment across major corporations, with companies like Amazon (81.2) and Samsung (45.8) receiving excellent sentiment scores, while Microsoft (21.7) and Walmart (21.9) exhibit poor sentiment profiles. These findings demonstrate the utility of our multi-source sentiment framework in providing actionable insights regarding corporate public perception, enabling stakeholders to make informed strategic decisions based on comprehensive sentiment analysis.
zh

计算机视觉

[CV-0] MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

【速读】：该论文旨在解决长上下文视觉语言模型（Vision Language Models, VLMs）在预填充阶段因二次注意力复杂度导致的计算瓶颈问题，限制其在现实世界中的部署。论文的关键解决方案是提出了一种名为MMInference的动态稀疏注意力方法，用于加速处理长上下文多模态输入的预填充阶段。MMInference通过分析视频输入的时间和空间局部性，发现了独特的“网格”（Grid）稀疏模式，并引入基于排列的方法来利用这种模式以应对模态边界问题。此外，通过离线搜索每个注意力头的最佳稀疏模式，MMInference能够根据输入动态构建稀疏分布。同时，论文还提供了优化的GPU内核以实现高效的稀疏计算。该方法无需对现有VLM管道进行任何模型修改或微调即可无缝集成。实验表明，MMInference在包含100万tokens的情况下将预填充速度提升了高达8.3倍，同时保持了准确性。

链接: https://arxiv.org/abs/2504.16083
作者: Yucheng Li,Huiqiang Jiang,Chengruidong Zhang,Qianhui Wu,Xufang Luo,Surin Ahn,Amir H. Abdi,Dongsheng Li,Jianfeng Gao,Yuqing Yang,Lili Qiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The integration of long-context capabilities with visual understanding unlocks unprecedented potential for Vision Language Models (VLMs). However, the quadratic attention complexity during the pre-filling phase remains a significant obstacle to real-world deployment. To overcome this limitation, we introduce MMInference (Multimodality Million tokens Inference), a dynamic sparse attention method that accelerates the prefilling stage for long-context multi-modal inputs. First, our analysis reveals that the temporal and spatial locality of video input leads to a unique sparse pattern, the Grid pattern. Simultaneously, VLMs exhibit markedly different sparse distributions across different modalities. We introduce a permutation-based method to leverage the unique Grid pattern and handle modality boundary issues. By offline search the optimal sparse patterns for each head, MMInference constructs the sparse distribution dynamically based on the input. We also provide optimized GPU kernels for efficient sparse computations. Notably, MMInference integrates seamlessly into existing VLM pipelines without any model modifications or fine-tuning. Experiments on multi-modal benchmarks-including Video QA, Captioning, VisionNIAH, and Mixed-Modality NIAH-with state-of-the-art long-context VLMs (LongVila, LlavaVideo, VideoChat-Flash, Qwen2.5-VL) show that MMInference accelerates the pre-filling stage by up to 8.3x at 1M tokens while maintaining accuracy. Our code is available at this https URL.
zh

[CV-1] MR. Video: “MapReduce” is the Principle for Long Video Understanding

【速读】：该论文旨在解决长视频理解中的两个主要挑战：一是现有视觉-语言模型（Vision-Language Models, VLMs）在处理长视频时受限于上下文长度，难以实现详细短片段感知；二是现有视频代理通常依赖顺序的关键片段选择，导致复杂度较高且扩展性较差。为应对这些问题，论文提出了MR. Video框架，其关键在于引入MapReduce原则，将长视频处理分解为“Map”和“Reduce”两阶段。“Map”阶段通过独立密集地感知短视频片段实现并行化处理，“Reduce”阶段则通过联合聚合所有片段的信息来实现更全面的上下文整合与推理。这一方法不仅提升了长视频理解的准确性，还增强了模型的可扩展性和泛化能力，在LVBench基准测试中实现了超过10%的精度提升。

链接: https://arxiv.org/abs/2504.16082
作者: Ziqi Pang,Yu-Xiong Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:We propose MR. Video, an agentic long video understanding framework that demonstrates the simple yet effective MapReduce principle for processing long videos: (1) Map: independently and densely perceiving short video clips, and (2) Reduce: jointly aggregating information from all clips. Compared with sequence-to-sequence vision-language models (VLMs), MR. Video performs detailed short video perception without being limited by context length. Compared with existing video agents that typically rely on sequential key segment selection, the Map operation enables simpler and more scalable sequence parallel perception of short video segments. Its Reduce step allows for more comprehensive context aggregation and reasoning, surpassing explicit key segment retrieval. This MapReduce principle is applicable to both VLMs and video agents, and we use LLM agents to validate its effectiveness. In practice, MR. Video employs two MapReduce stages: (A) Captioning: generating captions for short video clips (map), then standardizing repeated characters and objects into shared names (reduce); (B) Analysis: for each user question, analyzing relevant information from individual short videos (map), and integrating them into a final answer (reduce). MR. Video achieves over 10% accuracy improvement on the challenging LVBench compared to state-of-the-art VLMs and video agents. Code is available at: this https URL Comments: Preprint Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.16082 [cs.CV] (or arXiv:2504.16082v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.16082 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-2] From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

【速读】：该论文旨在解决现有文本到图像扩散模型在处理复杂场景和精细细节时表现不佳的问题。为了解决这一挑战，论文提出了一种名为ReflectionFlow的推理阶段框架，它通过迭代反思和优化输出来提升生成质量。ReflectionFlow的关键创新在于引入了三个互补的推理尺度：(1) 噪声级别缩放用于优化潜在初始化；(2) 提示级别缩放以实现精确的语义引导；特别是(3) 反思级别缩放，通过提供可操作的反思来评估和修正先前的生成结果。为了支持反思级别的缩放，研究构建了一个包含100万个三元组的大规模数据集GenRef，每个三元组包含一个反思、一个有缺陷的图像和一个增强后的图像。利用该数据集，通过在一个统一框架内联合建模多模态输入，论文对最先进的扩散变换器FLUX.1-dev进行了高效的反思调优。实验结果显示，ReflectionFlow显著优于朴素的噪声级别缩放方法，在具有挑战性的任务中提供了可扩展且计算高效的高质量图像合成解决方案。

链接: https://arxiv.org/abs/2504.16080
作者: Le Zhuo,Liangbing Zhao,Sayak Paul,Yue Liao,Renrui Zhang,Yi Xin,Peng Gao,Mohamed Elhoseiny,Hongsheng Li
机构: CUHK MMLab (香港中文大学多媒体实验室); KAUST (阿卜杜拉国王科技大学); Hugging Face; Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: All code, checkpoints, and datasets are available at \url{ this https URL }

点击查看摘要

Abstract:Recent text-to-image diffusion models achieve impressive visual quality through extensive scaling of training data and model parameters, yet they often struggle with complex scenes and fine-grained details. Inspired by the self-reflection capabilities emergent in large language models, we propose ReflectionFlow, an inference-time framework enabling diffusion models to iteratively reflect upon and refine their outputs. ReflectionFlow introduces three complementary inference-time scaling axes: (1) noise-level scaling to optimize latent initialization; (2) prompt-level scaling for precise semantic guidance; and most notably, (3) reflection-level scaling, which explicitly provides actionable reflections to iteratively assess and correct previous generations. To facilitate reflection-level scaling, we construct GenRef, a large-scale dataset comprising 1 million triplets, each containing a reflection, a flawed image, and an enhanced image. Leveraging this dataset, we efficiently perform reflection tuning on state-of-the-art diffusion transformer, FLUX.1-dev, by jointly modeling multimodal inputs within a unified framework. Experimental results show that ReflectionFlow significantly outperforms naive noise-level scaling methods, offering a scalable and compute-efficient solution toward higher-quality image synthesis on challenging tasks.
zh

[CV-3] Describe Anything: Detailed Localized Image and Video Captioning

【速读】：该论文旨在解决图像和视频中特定区域生成详细且准确描述的根本性挑战。为实现这一目标，论文提出了Describe Anything Model (DAM)，其专注于详细的局部化描述（DLC）。DAM的关键创新在于通过焦点提示（focal prompt）确保目标区域的高分辨率编码，同时结合局部化视觉主干（localized vision backbone），将精确的局部定位与更广泛的上下文信息整合起来。为应对高质量DLC数据稀缺的问题，论文还提出了一种基于半监督学习（Semi-supervised Learning, SSL）的数据管道（DLC-SDP），从现有的分割数据集开始，并扩展到未标注的网络图像。此外，论文引入了DLC-Bench基准，用于评估DLC性能而不依赖参考描述。最终，DAM在涵盖关键词级、短语级以及多句详细局部化图像和视频描述的7个基准上达到了新的最先进的性能水平。

链接: https://arxiv.org/abs/2504.16072
作者: Long Lian,Yifan Ding,Yunhao Ge,Sifei Liu,Hanzi Mao,Boyi Li,Marco Pavone,Ming-Yu Liu,Trevor Darrell,Adam Yala,Yin Cui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.
zh

[CV-4] Boosting Generative Image Modeling via Joint Image-Feature Synthesis

【速读】：该论文旨在解决将表征学习（representation learning）与生成建模（generative modeling）无缝集成的挑战。目前基于潜在扩散模型（Latent Diffusion Models, LDMs）的高质量图像生成方法在这一方面仍存在不足。论文提出了一种新颖的生成式图像建模框架，通过利用扩散模型同时建模低级图像潜在表示（来自变分自编码器，Variational Autoencoder, VAE）和高级语义特征（来自预训练的自监督编码器如DINO），成功填补了这一空白。其关键创新在于引入了潜伏-语义扩散（latent-semantic diffusion）方法，该方法能够从纯噪声中学习生成一致的图像-特征对，显著提升生成质量和训练效率，同时仅需对标准Diffusion Transformer架构进行最小程度的修改。此外，通过摒弃复杂的蒸馏目标，统一的设计简化了训练流程，并提出了新的推理策略——表征引导（Representation Guidance），利用学到的语义信息指导和优化图像生成过程。这一方法在条件和无条件生成设置下均展示了图像质量的大幅提升和训练收敛速度的显著加快，为表征感知的生成式建模开辟了新方向。

链接: https://arxiv.org/abs/2504.16064
作者: Theodoros Kouzelis,Efstathios Karypidis,Ioannis Kakogeorgiou,Spyros Gidaris,Nikos Komodakis
机构: Archimedes, Athena RC (雅典研究与技术中心); National Technical University of Athens (雅典国立技术大学); IIT, NCSR “Demokritos” (意大利技术研究院, 雅典研究与技术中心); valeo.ai (瓦莱奥.ai); University of Crete (克里特大学); IACM-Forth (基础计算数学研究所-福斯特)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Latent diffusion models (LDMs) dominate high-quality image generation, yet integrating representation learning with generative modeling remains a challenge. We introduce a novel generative image modeling framework that seamlessly bridges this gap by leveraging a diffusion model to jointly model low-level image latents (from a variational autoencoder) and high-level semantic features (from a pretrained self-supervised encoder like DINO). Our latent-semantic diffusion approach learns to generate coherent image-feature pairs from pure noise, significantly enhancing both generative quality and training efficiency, all while requiring only minimal modifications to standard Diffusion Transformer architectures. By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance, which leverages learned semantics to steer and refine image generation. Evaluated in both conditional and unconditional settings, our method delivers substantial improvements in image quality and training convergence speed, establishing a new direction for representation-aware generative modeling.
zh

[CV-5] ForesightNav: Learning Scene Imagination for Efficient Exploration

【速读】：该论文旨在解决如何使自主机器人利用先验知识在未知环境中进行高效的探索决策问题。论文的关键解决方案是提出了一种名为ForesightNav的新颖探索策略，受人类想象力和推理能力的启发。该方法通过引入一个想象模块，使机器人能够预测未探索区域的上下文信息（如占用情况和语义细节），从而高效选择有意义的长期导航目标。这一能力显著提升了机器人在未知环境中的探索效率。实验验证表明，该想象模块在Structured3D数据集上的PointNav任务中实现了100%完成率，在ObjectNav任务中达到了67%的SPL，充分展示了想象力驱动推理在增强自主系统通用性和高效探索方面的潜力。

链接: https://arxiv.org/abs/2504.16062
作者: Hardik Shah,Jiaxu Xing,Nico Messikommer,Boyang Sun,Marc Pollefeys,Davide Scaramuzza
机构: ETH Zurich; University of Zurich; ETH Zurich; ETH Zurich; ETH Zurich, Microsoft; University of Zurich
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding how humans leverage prior knowledge to navigate unseen environments while making exploratory decisions is essential for developing autonomous robots with similar abilities. In this work, we propose ForesightNav, a novel exploration strategy inspired by human imagination and reasoning. Our approach equips robotic agents with the capability to predict contextual information, such as occupancy and semantic details, for unexplored regions. These predictions enable the robot to efficiently select meaningful long-term navigation goals, significantly enhancing exploration in unseen environments. We validate our imagination-based approach using the Structured3D dataset, demonstrating accurate occupancy prediction and superior performance in anticipating unseen scene geometry. Our experiments show that the imagination module improves exploration efficiency in unseen environments, achieving a 100% completion rate for PointNav and an SPL of 67% for ObjectNav on the Structured3D Validation split. These contributions demonstrate the power of imagination-driven reasoning for autonomous systems to enhance generalizable and efficient exploration.
zh

[CV-6] Vision language models are unreliable at trivial spatial cognition

【速读】：该论文试图解决视觉语言模型（Vision Language Models, VLMs）在处理空间关系理解任务中的可靠性问题，特别是评估其在简单空间认知任务（如判断物体是否位于另一物体左侧）中的表现一致性。论文的关键解决方案在于开发了一个名为TableTest的新基准数据集，该数据集包含桌面场景中物体排列的三维图像，并利用此数据集测试了最先进的VLMs。研究发现，这些模型的性能会因提示语的细微变化而显著下降，即使这些提示语在逻辑上等价。这表明VLMs在实际应用中推理空间关系的能力存在局限性，同时也揭示了通过增强图像描述语料库来提高训练和测试效率的新机会。

链接: https://arxiv.org/abs/2504.16061
作者: Sangeet Khemlani,Tyler Tran,Nathaniel Gyory,Anthony M. Harrison,Wallace E. Lawson,Ravenna Thielstrom,Hunter Thompson,Taaren Singh,J. Gregory Trafton
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision language models (VLMs) are designed to extract relevant visuospatial information from images. Some research suggests that VLMs can exhibit humanlike scene understanding, while other investigations reveal difficulties in their ability to process relational information. To achieve widespread applicability, VLMs must perform reliably, yielding comparable competence across a wide variety of related tasks. We sought to test how reliable these architectures are at engaging in trivial spatial cognition, e.g., recognizing whether one object is left of another in an uncluttered scene. We developed a benchmark dataset – TableTest – whose images depict 3D scenes of objects arranged on a table, and used it to evaluate state-of-the-art VLMs. Results show that performance could be degraded by minor variations of prompts that use logically equivalent descriptions. These analyses suggest limitations in how VLMs may reason about spatial relations in real-world applications. They also reveal novel opportunities for bolstering image caption corpora for more efficient training and testing.
zh

[CV-7] Evaluating Vision Language Models (VLMs) for Radiology: A Comprehensive Analysis

【速读】：本文旨在评估三种不同的视觉-语言基础模型（RAD-DINO、CheXagent 和 BiomedCLIP）在捕捉胸部 X 光影像中细微成像特征以支持放射学任务的能力，包括分类、分割和回归任务，分别针对气胸和心脏肥大。研究发现，自监督学习的 RAD-DINO 在分割任务中表现优异，文本监督的 CheXagent 在分类任务中表现更优，而 BiomedCLIP 的性能在不同任务间不一致。通过将全局与局部特征整合到定制的分割模型中，显著提升了所有基础模型的表现，尤其是在具有挑战性的气胸分割任务中。关键在于揭示了预训练方法对特定下游任务性能的显著影响：无文本监督的模型在细粒度分割任务中表现更好，而文本监督模型在分类和可解释性方面更具优势。这些见解为基于临床具体需求选择合适的基础模型提供了指导。

链接: https://arxiv.org/abs/2504.16047
作者: Frank Li,Hari Trivedi,Bardia Khosravi,Theo Dapamede,Mohammadreza Chavoshi,Abdulhameed Dere,Rohan Satya Isaac,Aawez Mansuri,Janice Newsome,Saptarshi Purkayastha,Judy Gichoya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models, trained on vast amounts of data using self-supervised techniques, have emerged as a promising frontier for advancing artificial intelligence (AI) applications in medicine. This study evaluates three different vision-language foundation models (RAD-DINO, CheXagent, and BiomedCLIP) on their ability to capture fine-grained imaging features for radiology tasks. The models were assessed across classification, segmentation, and regression tasks for pneumothorax and cardiomegaly on chest radiographs. Self-supervised RAD-DINO consistently excelled in segmentation tasks, while text-supervised CheXagent demonstrated superior classification performance. BiomedCLIP showed inconsistent performance across tasks. A custom segmentation model that integrates global and local features substantially improved performance for all foundation models, particularly for challenging pneumothorax segmentation. The findings highlight that pre-training methodology significantly influences model performance on specific downstream tasks. For fine-grained segmentation tasks, models trained without text supervision performed better, while text-supervised models offered advantages in classification and interpretability. These insights provide guidance for selecting foundation models based on specific clinical applications in radiology.
zh

[CV-8] LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale CVPR2025

【速读】：该论文试图解决视频大型语言模型（Video LLMs）在大规模训练中因依赖昂贵的人类标注或专有模型API而受限的问题。为了解决这一挑战，论文提出了一种创新的流式训练方法，通过将自动语音识别（ASR）转录的文字与视频帧根据时间戳密集交错来实现。这种方法的关键在于充分利用ASR的流式特性，使模型能够学习到时间对齐且细粒度的视觉-语言建模。此外，论文还设计了一个数据生产管道处理YouTube视频及其字幕（即ASR），创建了Live-CC-5M数据集用于预训练，并构建了Live-WhisperX-526K数据集用于高质量监督微调（SFT）。即使不经过SFT，仅基于ASR的预训练模型LiveCC-7B-Base也展示了出色的通用视频问答性能及实时视频解说的新能力。最终提出的LiveCC-7B-Instruct模型不仅在实时模式下超越了更大规模的模型，在流行的视频QA基准测试如VideoMME和OVOBench上也达到了最先进的结果，证明了该方法的广泛适用性。所有资源已公开发布于指定链接。

链接: https://arxiv.org/abs/2504.16030
作者: Joya Chen,Ziyun Zeng,Yiqi Lin,Wei Li,Zejun Ma,Mike Zheng Shou
机构: Show Lab, National University of Singapore (新加坡国立大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. If any references are missing, please contact joyachen@u. this http URL

点击查看摘要

Abstract:Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary model APIs (e.g., GPT-4o) to produce training data, which limits their training at scale. In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. Specifically, we propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps. Compared to previous studies in vision-language representation with ASR, our method naturally fits the streaming characteristics of ASR, thus enabling the model to learn temporally-aligned, fine-grained vision-language modeling. To support the training algorithm, we introduce a data production pipeline to process YouTube videos and their closed captions (CC, same as ASR), resulting in Live-CC-5M dataset for pre-training and Live-WhisperX-526K dataset for high-quality supervised fine-tuning (SFT). Remarkably, even without SFT, the ASR-only pre-trained LiveCC-7B-Base model demonstrates competitive general video QA performance and exhibits a new capability in real-time video commentary. To evaluate this, we carefully design a new LiveSports-3K benchmark, using LLM-as-a-judge to measure the free-form commentary. Experiments show our final LiveCC-7B-Instruct model can surpass advanced 72B models (Qwen2.5-VL-72B-Instruct, LLaVA-Video-72B) in commentary quality even working in a real-time mode. Meanwhile, it achieves state-of-the-art results at the 7B/8B scale on popular video QA benchmarks such as VideoMME and OVOBench, demonstrating the broad generalizability of our approach. All resources of this paper have been released at this https URL.
zh

[CV-9] PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning CVPR2025

【速读】：本文旨在解决大规模预训练点云模型在下游任务中完全微调所需的高计算和存储资源需求问题。当前参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法大多依赖复杂的适配器和提示机制，导致可调参数数量增加。为应对这一挑战，论文提出PointLoRA，其关键是结合低秩适应（Low-Rank Adaptation, LoRA）与多尺度令牌选择。PointLoRA通过在点云Transformer中最参数密集的部分嵌入LoRA层，减少可调参数的同时增强全局特征捕获能力；同时，多尺度令牌选择提取关键局部信息作为下游微调的提示，有效补充LoRA捕捉到的全局上下文。实验结果表明，该方法仅需3.43%的可训练参数即可实现与其他方法竞争的性能，非常适合资源受限的应用场景。

链接: https://arxiv.org/abs/2504.16023
作者: Song Wang,Xiaolu Liu,Lingdong Kong,Jianyun Xu,Chunyong Hu,Gongfan Fang,Wentong Li,Jianke Zhu,Xinchao Wang
机构: ZJU(浙江大学); NUS(新加坡国立大学); AD Lab, CaiNiao, Alibaba(阿里云AD实验室); NUAA(南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Self-supervised representation learning for point cloud has demonstrated effectiveness in improving pre-trained model performance across diverse tasks. However, as pre-trained models grow in complexity, fully fine-tuning them for downstream applications demands substantial computational and storage resources. Parameter-efficient fine-tuning (PEFT) methods offer a promising solution to mitigate these resource requirements, yet most current approaches rely on complex adapter and prompt mechanisms that increase tunable parameters. In this paper, we propose PointLoRA, a simple yet effective method that combines low-rank adaptation (LoRA) with multi-scale token selection to efficiently fine-tune point cloud models. Our approach embeds LoRA layers within the most parameter-intensive components of point cloud transformers, reducing the need for tunable parameters while enhancing global feature capture. Additionally, multi-scale token selection extracts critical local information to serve as prompts for downstream fine-tuning, effectively complementing the global context captured by LoRA. The experimental results across various pre-trained models and three challenging public datasets demonstrate that our approach achieves competitive performance with only 3.43% of the trainable parameters, making it highly effective for resource-constrained applications. Source code is available at: this https URL.
zh

[CV-10] Efficient Temporal Consistency in Diffusion-Based Video Editing with Adaptor Modules: A Theoretical Framework

【速读】：该论文旨在解决基于DDIM（Denoising Diffusion Implicit Models）的扩散模型在视频编辑任务中保持帧间一致性的问题。论文的关键在于提出了一种基于适配器（Adapter）的理论框架，在时间一致性损失下增强模型性能。解决方案的关键包括：首先证明了时间一致性目标在有界特征范数条件下可微，并建立了梯度的Lipschitz界；其次证明了在适当学习率范围内，梯度下降方法能够单调降低损失并收敛至局部最小值；最后分析了DDIM反演过程中模块的稳定性，表明相关误差可控。这些理论成果将增强依赖于适配器策略的扩散模型视频编辑方法的可靠性，并为视频生成任务提供理论指导。

链接: https://arxiv.org/abs/2504.16016
作者: Xinyuan Song,Yangfan He,Sida Li,Jianhui Wang,Hongyang He,Xinhang Yuan,Ruoyu Wang,Jiaqi Chen,Keqin Li,Kuan Lu,Menghao Huo,Binxu Li,Pei Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2501.04606

点击查看摘要

Abstract:Adapter-based methods are commonly used to enhance model performance with minimal additional complexity, especially in video editing tasks that require frame-to-frame consistency. By inserting small, learnable modules into pretrained diffusion models, these adapters can maintain temporal coherence without extensive retraining. Approaches that incorporate prompt learning with both shared and frame-specific tokens are particularly effective in preserving continuity across frames at low training cost. In this work, we want to provide a general theoretical framework for adapters that maintain frame consistency in DDIM-based models under a temporal consistency loss. First, we prove that the temporal consistency objective is differentiable under bounded feature norms, and we establish a Lipschitz bound on its gradient. Second, we show that gradient descent on this objective decreases the loss monotonically and converges to a local minimum if the learning rate is within an appropriate range. Finally, we analyze the stability of modules in the DDIM inversion procedure, showing that the associated error remains controlled. These theoretical findings will reinforce the reliability of diffusion-based video editing methods that rely on adapter strategies and provide theoretical insights in video generation tasks.
zh

[CV-11] MVQA: Mamba with Unified Sampling for Efficient Video Quality Assessment

【速读】：该论文旨在解决长时长、高分辨率视频高效质量评估（Video Quality Assessment, VQA）的挑战。现有方法通常通过减少模型参数或对输入进行重采样来应对这一问题，但轻量级卷积神经网络（CNN）和Transformer在保持高效性的同时难以兼顾远距离建模能力与高性能。此外，高效的VQA依赖于对长序列的重采样以降低计算成本，然而当前的重采样方法往往难以有效保留关键语义信息。为此，论文提出了一种基于状态空间模型Mamba的MVQA模型，并引入一种新颖的统一语义与失真采样（Unified Semantic and Distortion Sampling, USDS）方法作为解决方案的关键。USDS结合低分辨率视频的语义块采样和原始分辨率视频的失真块采样，前者捕捉语义密集区域，后者保留关键失真细节。为避免双输入带来的额外计算开销，论文设计了一种使用预定义掩码的融合机制，实现了既能捕获语义又能捕获质量信息的统一采样策略，从而显著提升了效率。实验表明，MVQA在性能上可媲美最先进的方法，但速度提升至两倍且仅需五分之一的GPU内存。

链接: https://arxiv.org/abs/2504.16003
作者: Yachun Mi,Yu Li,Weicheng Meng,Chaofeng Chen,Chen Hui,Shaohui Liu
机构: Harbin Institute of Technology (哈尔滨工业大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid growth of long-duration, high-definition videos has made efficient video quality assessment (VQA) a critical challenge. Existing research typically tackles this problem through two main strategies: reducing model parameters and resampling inputs. However, light-weight Convolution Neural Networks (CNN) and Transformers often struggle to balance efficiency with high performance due to the requirement of long-range modeling capabilities. Recently, the state-space model, particularly Mamba, has emerged as a promising alternative, offering linear complexity with respect to sequence length. Meanwhile, efficient VQA heavily depends on resampling long sequences to minimize computational costs, yet current resampling methods are often weak in preserving essential semantic information. In this work, we present MVQA, a Mamba-based model designed for efficient VQA along with a novel Unified Semantic and Distortion Sampling (USDS) approach. USDS combines semantic patch sampling from low-resolution videos and distortion patch sampling from original-resolution videos. The former captures semantically dense regions, while the latter retains critical distortion details. To prevent computation increase from dual inputs, we propose a fusion mechanism using pre-defined masks, enabling a unified sampling strategy that captures both semantic and quality information without additional computational burden. Experiments show that the proposed MVQA, equipped with USDS, achieve comparable performance to state-of-the-art methods while being 2\times as fast and requiring only 1/5 GPU memory.
zh

[CV-12] Efficient Adaptation of Deep Neural Networks for Semantic Segmentation in Space Applications

【速读】：该论文旨在评估在月球和火星等地外环境中，利用适配器（adapters）进行高效迁移学习以实现岩石分割任务的可行性。论文的关键在于通过将适配器战略性地集成到预训练的基础模型中，以减少目标地外设备的带宽和内存需求。解决方案的关键包括两种节省内存的策略：层融合（layer fusion）以消除推理开销，以及“适配器排名”（adapter ranking）以降低传输成本。最终，论文从任务性能、内存使用和嵌入式设备计算的角度评估了这些方法，并揭示了相关的权衡，为该领域的进一步研究指明了方向。

链接: https://arxiv.org/abs/2504.15991
作者: Leonardo Olivi,Edoardo Santero Mormile,Enzo Tartaglione
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In recent years, the application of Deep Learning techniques has shown remarkable success in various computer vision tasks, paving the way for their deployment in extraterrestrial exploration. Transfer learning has emerged as a powerful strategy for addressing the scarcity of labeled data in these novel environments. This paper represents one of the first efforts in evaluating the feasibility of employing adapters toward efficient transfer learning for rock segmentation in extraterrestrial landscapes, mainly focusing on lunar and martian terrains. Our work suggests that the use of adapters, strategically integrated into a pre-trained backbone model, can be successful in reducing both bandwidth and memory requirements for the target extraterrestrial device. In this study, we considered two memory-saving strategies: layer fusion (to reduce to zero the inference overhead) and an ``adapter ranking’’ (to also reduce the transmission cost). Finally, we evaluate these results in terms of task performance, memory, and computation on embedded devices, evidencing trade-offs that open the road to more research in the field.
zh

[CV-13] A New Graph Grammar Formalism for Robust Syntactic Pattern Recognition

【速读】：该论文试图解决复杂递归结构模式解析中的理论与实际挑战，特别是处理包含几何关系变化、模糊符号、重叠元素、杂乱背景及缺失部分的大型模式（50-1000符号）。论文的关键在于提出了一种不依赖于生产规则的语法形式化方法，而是通过将语法和模式均表示为网络，并将解析视为从模式到语法的同态构造，从而实现多维迭代、分层和嵌套递归结构的表达。这种基于网络的方法支持高度并行的解析过程，将特征检测、分割、解析、缺失符号填充以及自顶向下和自底向上的推理等环节整合为单一协同过程，以提升复杂模式解析的鲁棒性与效率。

链接: https://arxiv.org/abs/2504.15975
作者: Peter Fletcher
机构: 未知
类目: Formal Languages and Automata Theory (cs.FL); Computer Vision and Pattern Recognition (cs.CV)
备注: 64 pages, 23 figures

点击查看摘要

Abstract:I introduce a formalism for representing the syntax of recursively structured graph-like patterns. It does not use production rules, like a conventional graph grammar, but represents the syntactic structure in a more direct and declarative way. The grammar and the pattern are both represented as networks, and parsing is seen as the construction of a homomorphism from the pattern to the grammar. The grammars can represent iterative, hierarchical and nested recursive structure in more than one dimension. This supports a highly parallel style of parsing, in which all aspects of pattern recognition (feature detection, segmentation, parsing, filling in missing symbols, top-down and bottom-up inference) are integrated into a single process, to exploit the synergy between them. The emphasis of this paper is on underlying theoretical issues, but I also give some example runs to illustrate the error-tolerant parsing of complex recursively structured patterns of 50-1000 symbols, involving variability in geometric relationships, blurry and indistinct symbols, overlapping symbols, cluttered images, and erased patches. Comments: 64 pages, 23 figures Subjects: Formal Languages and Automata Theory (cs.FL); Computer Vision and Pattern Recognition (cs.CV) ACMclasses: F.4.2; F.4.3 Cite as: arXiv:2504.15975 [cs.FL] (or arXiv:2504.15975v1 [cs.FL] for this version) https://doi.org/10.48550/arXiv.2504.15975 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-14] Recent Advances and Future Directions in Extended Reality (XR): Exploring AI-Powered Spatial Intelligence

【速读】：该论文试图探讨扩展现实（XR，包括增强现实 AR、虚拟现实 VR 和混合现实 MR）的技术演进及其未来发展方向。论文通过分析 XR 的基础框架（硬件与软件）以及当前最先进的 XR 产品性能，强调了空间智能在推动商业 XR 设备实现高质量性能方面的重要性。论文指出，未来的 XR 系统应着重于整合多模态人工智能（multi-modal AI）和物联网驱动的数字孪生技术，以构建自适应的 XR 环境。解决方案的关键在于利用人工智能（AI），特别是其在空间智能领域的应用，以解锁 XR 技术作为下一代人机交互前沿的巨大潜力，并创建能够为人类带来真实体验的新数字空间。

链接: https://arxiv.org/abs/2504.15970
作者: Baichuan Zeng
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: 7 pages,4 figures

点击查看摘要

Abstract:Extended Reality (XR), encompassing Augmented Reality (AR), Virtual Reality (VR) and Mixed Reality (MR), is a transformative technology bridging the physical and virtual world and it has diverse potential which will be ubiquitous in the future. This review examines XR’s evolution through foundational framework - hardware ranging from monitors to sensors and software ranging from visual tasks to user interface; highlights state of the art (SOTA) XR products with the comparison and analysis of performance based on their foundational framework; discusses how commercial XR devices can support the demand of high-quality performance focusing on spatial intelligence. For future directions, attention should be given to the integration of multi-modal AI and IoT-driven digital twins to enable adaptive XR systems. With the concept of spatial intelligence, future XR should establish a new digital space with realistic experience that benefits humanity. This review underscores the pivotal role of AI in unlocking XR as the next frontier in human-computer interaction.
zh

[CV-15] FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation

【速读】：该论文致力于解决基于文本驱动的主体图像生成任务中，现有方法在保真度与效率之间难以兼顾的核心问题。具体而言，基于微调的方法依赖于耗时且资源密集的主体特定优化，而零样本方法则难以保持足够的主体一致性。论文提出的FreeGraftor是一种无需训练的框架，通过跨图像特征嫁接技术解决了这些局限性。其关键在于利用语义匹配和位置约束注意力融合来将参考主体的视觉细节转移到生成图像中，并结合一种新颖的噪声初始化策略以保留参考主体的几何先验，从而实现鲁棒的特征匹配。实验表明，FreeGraftor在主体保真度和文本对齐方面显著优于现有的零样本和无需训练的方法，同时支持多主体生成，适用于实际部署。

链接: https://arxiv.org/abs/2504.15958
作者: Zebin Yao,Lei Ren,Huixing Jiang,Chen Wei,Xiaojie Wang,Ruifan Li,Fangxiang Feng
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Li Auto Inc. (理想汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Subject-driven image generation aims to synthesize novel scenes that faithfully preserve subject identity from reference images while adhering to textual guidance, yet existing methods struggle with a critical trade-off between fidelity and efficiency. Tuning-based approaches rely on time-consuming and resource-intensive subject-specific optimization, while zero-shot methods fail to maintain adequate subject consistency. In this work, we propose FreeGraftor, a training-free framework that addresses these limitations through cross-image feature grafting. Specifically, FreeGraftor employs semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to the generated image. Additionally, our framework incorporates a novel noise initialization strategy to preserve geometry priors of reference subjects for robust feature matching. Extensive qualitative and quantitative experiments demonstrate that our method enables precise subject identity transfer while maintaining text-aligned scene synthesis. Without requiring model fine-tuning or additional training, FreeGraftor significantly outperforms existing zero-shot and training-free approaches in both subject fidelity and text alignment. Furthermore, our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment. Our code is available at this https URL.
zh

[CV-16] Visual Place Cell Encoding: A Computational Model for Spatial Representation and Cognitive Mapping

【速读】：该论文试图解决如何通过视觉输入模拟生物大脑中的位置细胞（Place Cells）激活模式的问题。解决方案的关键在于提出了一种名为视觉位置细胞编码（Visual Place Cell Encoding, VPCE）的计算框架，该框架通过聚类机器人搭载摄像头捕获图像的高维外观特征（high-dimensional appearance features），从而激活视觉位置细胞。每个聚类中心定义感受野（receptive field），并通过径向基函数（radial basis function）基于视觉相似性计算激活程度。这种设计使得VPCE能够区分视觉上相似但空间上不同的位置，并适应环境变化，如墙壁的插入或移除，验证了结构化视觉输入在无需运动线索或奖励驱动学习的情况下生成类似生物的位置细胞空间表示的可行性。

链接: https://arxiv.org/abs/2504.15953
作者: Chance J. Hamilton,Alfredo Weitzenfeld
机构: University of South Florida (南佛罗里达大学); University of South Florida (南佛罗里达大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents the Visual Place Cell Encoding (VPCE) model, a biologically inspired computational framework for simulating place cell-like activation using visual input. Drawing on evidence that visual landmarks play a central role in spatial encoding, the proposed VPCE model activates visual place cells by clustering high-dimensional appearance features extracted from images captured by a robot-mounted camera. Each cluster center defines a receptive field, and activation is computed based on visual similarity using a radial basis function. We evaluate whether the resulting activation patterns correlate with key properties of biological place cells, including spatial proximity, orientation alignment, and boundary differentiation. Experiments demonstrate that the VPCE can distinguish between visually similar yet spatially distinct locations and adapt to environment changes such as the insertion or removal of walls. These results suggest that structured visual input, even in the absence of motion cues or reward-driven learning, is sufficient to generate place-cell-like spatial representations and support biologically inspired cognitive mapping.
zh

[CV-17] Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning

【速读】：该论文旨在解决视频生成过程中难以遵循物理定律的问题，传统基于扩散的方法因依赖数据驱动的近似而难以外推至未见过的物理条件（如速度）。为了解决这一挑战，论文提出将符号推理与强化学习相结合，以确保视频生成的物理一致性。方案的关键在于引入了扩散时间步量化器（DDT），通过恢复扩散过程中丢失的视觉属性来学习离散递归的视觉标记，这些标记支持大型语言模型进行符号推理。在此基础上，提出了Phys-AR框架，包含两个阶段：第一阶段使用有监督微调转移符号知识，第二阶段应用强化学习通过基于物理条件的奖励函数优化模型的推理能力。这种方法使模型能够动态调整并提升生成视频的物理属性，从而确保遵守物理定律。实验结果表明，Phys-AR可以生成物理上一致的视频。

链接: https://arxiv.org/abs/2504.15932
作者: Wang Lin,Liyu Jia,Wentao Hu,Kaihang Pan,Zhongqi Yue,Wei Zhao,Jingyuan Chen,Fei Wu,Hanwang Zhang
机构: Zhejiang University (浙江大学); Nanyang Technological University (南洋理工大学); Huawei Singapore Research Center (华为新加坡研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent progress in video generation, producing videos that adhere to physical laws remains a significant challenge. Traditional diffusion-based methods struggle to extrapolate to unseen physical conditions (eg, velocity) due to their reliance on data-driven approximations. To address this, we propose to integrate symbolic reasoning and reinforcement learning to enforce physical consistency in video generation. We first introduce the Diffusion Timestep Tokenizer (DDT), which learns discrete, recursive visual tokens by recovering visual attributes lost during the diffusion process. The recursive visual tokens enable symbolic reasoning by a large language model. Based on it, we propose the Phys-AR framework, which consists of two stages: The first stage uses supervised fine-tuning to transfer symbolic knowledge, while the second stage applies reinforcement learning to optimize the model’s reasoning abilities through reward functions based on physical conditions. Our approach allows the model to dynamically adjust and improve the physical properties of generated videos, ensuring adherence to physical laws. Experimental results demonstrate that PhysAR can generate videos that are physically consistent.
zh

[CV-18] Benchmarking the Reproducibility of Brain MRI Segmentation Across Scanners and Time

【速读】：该论文旨在解决在结构磁共振成像（structural MRI）中，基于深度学习的脑形态测量（brain morphometry）结果在纵向研究和多站点研究中的可重复性限制问题。尽管深度学习加速了分割工作流，但扫描仪引起的变异性及可重复性问题仍然显著，尤其是在纵向和多中心研究场景下。论文通过评估两个现代分割管道——FastSurfer 和 SynthSeg，并结合 FreeSurfer 工具，使用 Dice 系数、Surface Dice、Hausdorff 距离（HD95）以及平均绝对百分比误差（MAPE）量化分割变异性，揭示了即使在受控的重复测试条件下，小脑区如杏仁核和腹侧间脑的体积变化可达 7-8%。关键在于分析了配准模板和插值模式的影响，并提出了基于表面的质量过滤方法以提高分割可靠性。此研究为脑形态测量的可重复性提供了可复现的基准，并强调了实际神经影像学研究中需要制定标准化策略。

链接: https://arxiv.org/abs/2504.15931
作者: Ekaterina Kondrateva,Sandzhi Barg,Mikhail Vasiliev
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate and reproducible brain morphometry from structural MRI is critical for monitoring neuroanatomical changes across time and across imaging domains. Although deep learning has accelerated segmentation workflows, scanner-induced variability and reproducibility limitations remain-especially in longitudinal and multi-site settings. In this study, we benchmark two modern segmentation pipelines, FastSurfer and SynthSeg, both integrated into FreeSurfer, one of the most widely adopted tools in neuroimaging. Using two complementary datasets - a 17-year longitudinal cohort (SIMON) and a 9-site test-retest cohort (SRPBS)-we quantify inter-scan segmentation variability using Dice coefficient, Surface Dice, Hausdorff Distance (HD95), and Mean Absolute Percentage Error (MAPE). Our results reveal up to 7-8% volume variation in small subcortical structures such as the amygdala and ventral diencephalon, even under controlled test-retest conditions. This raises a key question: is it feasible to detect subtle longitudinal changes on the order of 5-10% in pea-sized brain regions, given the magnitude of domain-induced morphometric noise? We further analyze the effects of registration templates and interpolation modes, and propose surface-based quality filtering to improve segmentation reliability. This study provides a reproducible benchmark for morphometric reproducibility and emphasizes the need for harmonization strategies in real-world neuroimaging studies. Code and figures: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.15931 [cs.CV] (or arXiv:2504.15931v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.15931 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ekaterina Kondrateva [view email] [v1] Tue, 22 Apr 2025 14:20:18 UTC (465 KB)
zh

[CV-19] Meta-Entity Driven Triplet Mining for Aligning Medical Vision-Language Models

【速读】：该论文旨在解决医疗影像与报告在诊断中的图像-文本对齐问题，特别是在胸部X光片（CXR）评估中，现有方法因主要基于对比学习（contrastive learning），过度关注疾病类别的分离而忽视了病灶属性（如位置、大小或严重程度）的精细区分，导致表示效果次优。论文的关键解决方案是提出MedTrim（Meta-entity-driven Triplet mining），一种结合疾病类别及形容词和方向性病理描述符的多模态三元组学习方法。MedTrim通过利用结构化的元实体信息保留类内细微但临床上重要的变化，并引入基于本体论的实体识别模块提取特定病理的元实体，同时设计了一种新的评分函数以改进三元组采样，最后定义多模态三元组对齐目标实现跨模态和模态内的显式对齐。实验表明，MedTrim在下游检索和分类任务中优于现有最先进的对齐方法。

链接: https://arxiv.org/abs/2504.15929
作者: Saban Ozturk,Melih B. Yilmaz,Muti Kara,M. Talat Yavuz,Aykut Koç,Tolga Çukur
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Diagnostic imaging relies on interpreting both images and radiology reports, but the growing data volumes place significant pressure on medical experts, yielding increased errors and workflow backlogs. Medical vision-language models (med-VLMs) have emerged as a powerful framework to efficiently process multimodal imaging data, particularly in chest X-ray (CXR) evaluations, albeit their performance hinges on how well image and text representations are aligned. Existing alignment methods, predominantly based on contrastive learning, prioritize separation between disease classes over segregation of fine-grained pathology attributes like location, size or severity, leading to suboptimal representations. Here, we propose MedTrim (Meta-entity-driven Triplet mining), a novel method that enhances image-text alignment through multimodal triplet learning synergistically guided by disease class as well as adjectival and directional pathology descriptors. Unlike common alignment methods that separate broad disease classes, MedTrim leverages structured meta-entity information to preserve subtle but clinically significant intra-class variations. For this purpose, we first introduce an ontology-based entity recognition module that extracts pathology-specific meta-entities from CXR reports, as annotations on pathology attributes are rare in public datasets. For refined sample selection in triplet mining, we then introduce a novel score function that captures an aggregate measure of inter-sample similarity based on disease classes and adjectival/directional descriptors. Lastly, we introduce a multimodal triplet alignment objective for explicit within- and cross-modal alignment between samples sharing detailed pathology characteristics. Our demonstrations indicate that MedTrim improves performance in downstream retrieval and classification tasks compared to state-of-the-art alignment methods.
zh

[CV-20] A Clinician-Friendly Platform for Ophthalmic Image Analysis Without Technical Barriers

【速读】：该论文旨在解决当前医疗影像 AI 模型在不同临床中心部署时通常需要重新训练或微调的问题，这限制了其广泛应用。论文提出的关键解决方案是 GlobeReady 平台，它通过无训练的局部特征增强（training-free local feature augmentation）技术应对跨中心和人群的数据域偏移问题，从而实现无需重新训练或具备技术专长即可进行眼疾诊断。这种能力显著提高了模型的通用性和可扩展性。

链接: https://arxiv.org/abs/2504.15928
作者: Meng Wang,Tian Lin,Qingshan Hou,Aidi Lin,Jingcheng Wang,Qingsheng Peng,Truong X. Nguyen,Danqi Fang,Ke Zou,Ting Xu,Cancan Xue,Ten Cheer Quek,Qinkai Yu,Minxin Liu,Hui Zhou,Zixuan Xiao,Guiqin He,Huiyu Liang,Tingkun Shi,Man Chen,Linna Liu,Yuanyuan Peng,Lianyu Wang,Qiuming Hu,Junhong Chen,Zhenhua Zhang,Cheng Chen,Yitian Zhao,Dianbo Liu,Jianhua Wu,Xinjian Chen,Changqing Zhang,Triet Thanh Nguyen,Yanda Meng,Yalin Zheng,Yih Chung Tham,Carol Y. Cheung,Huazhu Fu,Haoyu Chen,Ching-Yu Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) shows remarkable potential in medical imaging diagnostics, but current models typically require retraining when deployed across different clinical centers, limiting their widespread adoption. We introduce GlobeReady, a clinician-friendly AI platform that enables ocular disease diagnosis without retraining/fine-tuning or technical expertise. GlobeReady achieves high accuracy across imaging modalities: 93.9-98.5% for an 11-category fundus photo dataset and 87.2-92.7% for a 15-category OCT dataset. Through training-free local feature augmentation, it addresses domain shifts across centers and populations, reaching an average accuracy of 88.9% across five centers in China, 86.3% in Vietnam, and 90.2% in the UK. The built-in confidence-quantifiable diagnostic approach further boosted accuracy to 94.9-99.4% (fundus) and 88.2-96.2% (OCT), while identifying out-of-distribution cases at 86.3% (49 CFP categories) and 90.6% (13 OCT categories). Clinicians from multiple countries rated GlobeReady highly (average 4.6 out of 5) for its usability and clinical relevance. These results demonstrate GlobeReady’s robust, scalable diagnostic capability and potential to support ophthalmic care without technical barriers.
zh

[CV-21] ViSMaP: Unsupervised Hour-long Video Summarisation by Meta-Prompting

【速读】：该论文试图解决长视频无监督摘要生成的问题，特别是针对未分割且事件稀疏分布的长时间视频。传统方法依赖于标注丰富的短视频模型或昂贵的逐帧标注，难以有效扩展到长视频场景。论文的关键解决方案在于提出了一种基于元提示（Meta Prompting）的ViSMap系统，通过利用标注充分的短片段描述，借助大型语言模型（LLMs）生成优化的伪摘要，并迭代优化生成过程以提升伪摘要质量。这种方法绕过了长视频昂贵的人工标注需求，同时实现了与完全监督方法相当的性能，并具备跨领域泛化能力。

链接: https://arxiv.org/abs/2504.15921
作者: Jian Hu,Dimitrios Korkinof,Shaogang Gong,Mariano Beguerisse-Diaz
机构: Queen Mary University of London (伦敦玛丽女王大学); Spotify (Spotify)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce ViSMap: Unsupervised Video Summarisation by Meta Prompting, a system to summarise hour long videos with no-supervision. Most existing video understanding models work well on short videos of pre-segmented events, yet they struggle to summarise longer videos where relevant events are sparsely distributed and not pre-segmented. Moreover, long-form video understanding often relies on supervised hierarchical training that needs extensive annotations which are costly, slow and prone to inconsistency. With ViSMaP we bridge the gap between short videos (where annotated data is plentiful) and long ones (where it’s not). We rely on LLMs to create optimised pseudo-summaries of long videos using segment descriptions from short ones. These pseudo-summaries are used as training data for a model that generates long-form video summaries, bypassing the need for expensive annotations of long videos. Specifically, we adopt a meta-prompting strategy to iteratively generate and refine creating pseudo-summaries of long videos. The strategy leverages short clip descriptions obtained from a supervised short video model to guide the summary. Each iteration uses three LLMs working in sequence: one to generate the pseudo-summary from clip descriptions, another to evaluate it, and a third to optimise the prompt of the generator. This iteration is necessary because the quality of the pseudo-summaries is highly dependent on the generator prompt, and varies widely among videos. We evaluate our summaries extensively on multiple datasets; our results show that ViSMaP achieves performance comparable to fully supervised state-of-the-art models while generalising across domains without sacrificing performance. Code will be released upon publication.
zh

[CV-22] Ask2Loc: Learning to Locate Instructional Visual Answers by Asking Questions

【速读】：该论文旨在解决视觉答案定位（Visual Answer Localization, VAL）任务中因单一交互导致用户难以获得符合预期答案的问题。为模拟人类与视频之间的多轮交互过程，论文提出了一个新的任务——In-VAL，以解决语义鸿沟问题，包括用户输入问题意图的模糊性、视频字幕语言的不完整性以及视频片段内容的碎片化。论文的关键解决方案是提出Ask2Loc框架，通过提问的方式解决In-VAL任务中的上述问题。Ask2Loc包含三个核心模块：1）聊天模块用于细化初始问题并明确意图；2）改写模块生成流畅语言并构建完整描述；3）搜索模块扩展局部上下文并整合内容。实验结果表明，相比传统端到端和两阶段方法，Ask2Loc在In-VAL任务上的性能提升了高达14.91（mIoU）。

链接: https://arxiv.org/abs/2504.15918
作者: Chang Zong,Bin Li,Shoujun Zhou,Jian Wan,Lei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:Locating specific segments within an instructional video is an efficient way to acquire guiding knowledge. Generally, the task of obtaining video segments for both verbal explanations and visual demonstrations is known as visual answer localization (VAL). However, users often need multiple interactions to obtain answers that align with their expectations when using the system. During these interactions, humans deepen their understanding of the video content by asking themselves questions, thereby accurately identifying the location. Therefore, we propose a new task, named In-VAL, to simulate the multiple interactions between humans and videos in the procedure of obtaining visual answers. The In-VAL task requires interactively addressing several semantic gap issues, including 1) the ambiguity of user intent in the input questions, 2) the incompleteness of language in video subtitles, and 3) the fragmentation of content in video segments. To address these issues, we propose Ask2Loc, a framework for resolving In-VAL by asking questions. It includes three key modules: 1) a chatting module to refine initial questions and uncover clear intentions, 2) a rewriting module to generate fluent language and create complete descriptions, and 3) a searching module to broaden local context and provide integrated content. We conduct extensive experiments on three reconstructed In-VAL datasets. Compared to traditional end-to-end and two-stage methods, our proposed Ask2Loc can improve performance by up to 14.91 (mIoU) on the In-VAL task. Our code and datasets can be accessed at this https URL.
zh

[CV-23] RaSCL: Radar to Satellite Crossview Localization

【速读】：该论文旨在解决GNSS（全球导航卫星系统）在许多实时自主现场应用中不可靠、不精确且不足的问题，提出了一种无需GNSS的全局定位解决方案。该方案通过将地面成像雷达与 overhead RGB 影像进行配准，并联合优化来自里程计的相对位姿和来自 overhead 注册的全局位姿来实现。关键在于仅利用地面雷达和单一地理参考初始猜测，从RGB overhead 图像中提取关键特征，以实现高效的全局定位。论文通过在不同地理条件和机器人平台（包括无人水面艇 USV 和城市及郊区驾驶数据集）上的评估，验证了该方法的有效性。

链接: https://arxiv.org/abs/2504.15899
作者: Blerim Abdullai,Tony Wang,Xinyuan Qiao,Florian Shkurti,Timothy D. Barfoot
机构: University of Toronto Robotics Institute (多伦多大学机器人研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:GNSS is unreliable, inaccurate, and insufficient in many real-time autonomous field applications. In this work, we present a GNSS-free global localization solution that contains a method of registering imaging radar on the ground with overhead RGB imagery, with joint optimization of relative poses from odometry and global poses from our overhead registration. Previous works have used various combinations of ground sensors and overhead imagery, and different feature extraction and matching methods. These include various handcrafted and deep-learning-based methods for extracting features from overhead imagery. Our work presents insights on extracting essential features from RGB overhead images for effective global localization against overhead imagery using only ground radar and a single georeferenced initial guess. We motivate our method by evaluating it on datasets in diverse geographic conditions and robotic platforms, including on an Unmanned Surface Vessel (USV) as well as urban and suburban driving datasets.
zh

[CV-24] MS-Occ: Multi-Stage LiDAR-Camera Fusion for 3D Semantic Occupancy Prediction

【速读】：该论文旨在解决自动驾驶在复杂环境中因几何精度不足或语义信息缺乏而导致的三维语义占用感知不准确的问题。论文提出的解决方案关键在于设计了一个名为MS-Occ的多阶段激光雷达-相机融合框架，通过中间阶段融合与后期阶段融合，将激光雷达的几何精确性与相机的语义丰富性有机结合。其创新点集中于两个关键阶段：(1) 中间阶段特征融合中，Gaussian-Geo模块利用高斯核渲染增强图像特征的几何先验，而Semantic-Aware模块通过可变形交叉注意力为激光雷达体素注入语义上下文；(2) 后期阶段体素融合中，Adaptive Fusion (AF) 模块动态平衡多模态体素特征，High Classification Confidence Voxel Fusion (HCCVF) 模块则基于自注意力机制解决语义不一致问题。实验结果表明，MS-Occ在nuScenes-OpenOccupancy基准上的表现优于现有方法，并显著提升了小物体感知能力。

链接: https://arxiv.org/abs/2504.15888
作者: Zhiqiang Wei,Lianqing Zheng,Jianan Liu,Tao Huang,Qing-Long Han,Wenwen Zhang,Fengdeng Zhang
机构: School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology (上海理工大学光学与电子信息工程学院); School of Automotive Studies, Tongji University (同济大学汽车学院); Momoni AI (Momoni AI); College of Science and Engineering, James Cook University (詹姆斯库克大学理学院); School of Engineering, Swinburne University of Technology (斯威本科技大学工程学院); School of Electrical and Electronic Engineering, Nanyang Technological University (南洋理工大学电气与电子工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Accurate 3D semantic occupancy perception is essential for autonomous driving in complex environments with diverse and irregular objects. While vision-centric methods suffer from geometric inaccuracies, LiDAR-based approaches often lack rich semantic information. To address these limitations, MS-Occ, a novel multi-stage LiDAR-camera fusion framework which includes middle-stage fusion and late-stage fusion, is proposed, integrating LiDAR’s geometric fidelity with camera-based semantic richness via hierarchical cross-modal fusion. The framework introduces innovations at two critical stages: (1) In the middle-stage feature fusion, the Gaussian-Geo module leverages Gaussian kernel rendering on sparse LiDAR depth maps to enhance 2D image features with dense geometric priors, and the Semantic-Aware module enriches LiDAR voxels with semantic context via deformable cross-attention; (2) In the late-stage voxel fusion, the Adaptive Fusion (AF) module dynamically balances voxel features across modalities, while the High Classification Confidence Voxel Fusion (HCCVF) module resolves semantic inconsistencies using self-attention-based refinement. Experiments on the nuScenes-OpenOccupancy benchmark show that MS-Occ achieves an Intersection over Union (IoU) of 32.1% and a mean IoU (mIoU) of 25.3%, surpassing the state-of-the-art by +0.7% IoU and +2.4% mIoU. Ablation studies further validate the contribution of each module, with substantial improvements in small-object perception, demonstrating the practical value of MS-Occ for safety-critical autonomous driving scenarios.
zh

[CV-25] Integrating Non-Linear Radon Transformation for Diabetic Retinopathy Grading

【速读】：该论文旨在解决糖尿病视网膜病变（Diabetic Retinopathy, DR）早期检测与精确分级的难题，特别是现有基于深度学习方法在处理视网膜眼底图像中复杂且不规则病灶模式时面临的挑战。论文提出的关键解决方案是RadFuse框架，它通过结合非线性RadEx变换生成的断层扫描图（sinogram）图像与传统眼底图像，充分利用空间域与变换域的信息，增强特征表达能力，从而提升DR的检测与分级性能。其中，RadEx变换作为Radon变换的优化非线性扩展，能够有效捕捉复杂的视网膜病变模式。实验结果表明，RadFuse在多个指标上显著优于仅使用眼底图像的方法，并在基准数据集APTOS-2019和DDR上超越了当前最先进的方法。

链接: https://arxiv.org/abs/2504.15883
作者: Farida Mohsen,Samir Belhaouari,Zubair Shah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diabetic retinopathy is a serious ocular complication that poses a significant threat to patients’ vision and overall health. Early detection and accurate grading are essential to prevent vision loss. Current automatic grading methods rely heavily on deep learning applied to retinal fundus images, but the complex, irregular patterns of lesions in these images, which vary in shape and distribution, make it difficult to capture subtle changes. This study introduces RadFuse, a multi-representation deep learning framework that integrates non-linear RadEx-transformed sinogram images with traditional fundus images to enhance diabetic retinopathy detection and grading. Our RadEx transformation, an optimized non-linear extension of the Radon transform, generates sinogram representations to capture complex retinal lesion patterns. By leveraging both spatial and transformed domain information, RadFuse enriches the feature set available to deep learning models, improving the differentiation of severity levels. We conducted extensive experiments on two benchmark datasets, APTOS-2019 and DDR, using three convolutional neural networks (CNNs): ResNeXt-50, MobileNetV2, and VGG19. RadFuse showed significant improvements over fundus-image-only models across all three CNN architectures and outperformed state-of-the-art methods on both datasets. For severity grading across five stages, RadFuse achieved a quadratic weighted kappa of 93.24%, an accuracy of 87.07%, and an F1-score of 87.17%. In binary classification between healthy and diabetic retinopathy cases, the method reached an accuracy of 99.09%, precision of 98.58%, and recall of 99.6%, surpassing previously established models. These results demonstrate RadFuse’s capacity to capture complex non-linear features, advancing diabetic retinopathy classification and promoting the integration of advanced mathematical transforms in medical image analysis.
zh

[CV-26] MedNNS: Supernet-based Medical Task-Adaptive Neural Network Search

【速读】：该论文旨在解决深度学习模型在医学影像任务中适配性差的问题，主要挑战来自架构选择与权重初始化。架构选择因不同任务需专门设计模型，而权重初始化直接影响模型的收敛速度和最终性能。尽管从ImageNet进行迁移学习被广泛应用，但其效果受限于自然图像与医学图像之间的显著差异。为应对这些挑战，论文提出Medical Neural Network Search (MedNNS)，首个针对医学影像应用的神经网络搜索框架。MedNNS的关键在于通过构建一个元空间，该空间基于数据集和模型协同表现的好坏来编码它们，从而联合优化架构选择与权重初始化。这种方法利用基于超网络的扩展方式，使模型库规模比先前最先进的方法扩大了51倍，并引入排名损失和Fréchet Inception Distance (FID) 损失以更精确地对齐元空间中的关系。实验结果表明，MedNNS在多个数据集上显著优于ImageNet预训练的深度学习模型及最先进的神经架构搜索(NAS)方法，平均提升数据集准确率1.7%，且收敛速度大幅提升。

链接: https://arxiv.org/abs/2504.15865
作者: Lotfi Abdelkrim Mecharbat,Ibrahim Elmakky,Martin Takac,Mohammed Yaqub
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning (DL) has achieved remarkable progress in the field of medical imaging. However, adapting DL models to medical tasks remains a significant challenge, primarily due to two key factors: (1) architecture selection, as different tasks necessitate specialized model designs, and (2) weight initialization, which directly impacts the convergence speed and final performance of the models. Although transfer learning from ImageNet is a widely adopted strategy, its effectiveness is constrained by the substantial differences between natural and medical images. To address these challenges, we introduce Medical Neural Network Search (MedNNS), the first Neural Network Search framework for medical imaging applications. MedNNS jointly optimizes architecture selection and weight initialization by constructing a meta-space that encodes datasets and models based on how well they perform together. We build this space using a Supernetwork-based approach, expanding the model zoo size by 51x times over previous state-of-the-art (SOTA) methods. Moreover, we introduce rank loss and Fréchet Inception Distance (FID) loss into the construction of the space to capture inter-model and inter-dataset relationships, thereby achieving more accurate alignment in the meta-space. Experimental results across multiple datasets demonstrate that MedNNS significantly outperforms both ImageNet pre-trained DL models and SOTA Neural Architecture Search (NAS) methods, achieving an average accuracy improvement of 1.7% across datasets while converging substantially faster. The code and the processed meta-space is available at this https URL.
zh

[CV-27] DERD-Net: Learning Depth from Event-based Ray Densities

【速读】：该论文旨在解决传统深度学习框架在处理事件相机（event camera）数据时面临的挑战，这些数据具有异步流式特性，与常规相机的离散图像输入不兼容。论文提出了一种可扩展、灵活且适应性强的框架，用于事件相机的像素级深度估计，适用于单目和立体视觉设置。解决方案的关键在于将场景结构编码到视差空间图像（Disparity Space Images, DSIs）中，并通过已知相机姿态将事件反投影到空间中以表示射线的空间密度。神经网络处理DSI的局部子区域，结合三维卷积和循环结构来识别用于深度预测的有价值模式。这种局部处理方式实现了快速推理、全并行化以及无论相机分辨率如何都保持超低模型复杂度和内存成本。

链接: https://arxiv.org/abs/2504.15863
作者: Diego de Oliveira Hitzges,Suman Ghosh,Guillermo Gallego
机构: Technische Universität Berlin (柏林工业大学), Einstein Center Digital Future (数字未来爱因斯坦中心), Robotics Institute Germany (德国机器人研究所); Science of Intelligence Excellence Cluster (智能科学卓越集群), Germany (德国)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO); Signal Processing (eess.SP)
备注: 13 pages, 3 figures, 14 tables. Project page: this https URL

点击查看摘要

Abstract:Event cameras offer a promising avenue for multi-view stereo depth estimation and Simultaneous Localization And Mapping (SLAM) due to their ability to detect blur-free 3D edges at high-speed and over broad illumination conditions. However, traditional deep learning frameworks designed for conventional cameras struggle with the asynchronous, stream-like nature of event data, as their architectures are optimized for discrete, image-like inputs. We propose a scalable, flexible and adaptable framework for pixel-wise depth estimation with event cameras in both monocular and stereo setups. The 3D scene structure is encoded into disparity space images (DSIs), representing spatial densities of rays obtained by back-projecting events into space via known camera poses. Our neural network processes local subregions of the DSIs combining 3D convolutions and a recurrent structure to recognize valuable patterns for depth prediction. Local processing enables fast inference with full parallelization and ensures constant ultra-low model complexity and memory costs, regardless of camera resolution. Experiments on standard benchmarks (MVSEC and DSEC datasets) demonstrate unprecedented effectiveness: (i) using purely monocular data, our method achieves comparable results to existing stereo methods; (ii) when applied to stereo data, it strongly outperforms all state-of-the-art (SOTA) approaches, reducing the mean absolute error by at least 42%; (iii) our method also allows for increases in depth completeness by more than 3-fold while still yielding a reduction in median absolute error of at least 30%. Given its remarkable performance and effective processing of event-data, our framework holds strong potential to become a standard approach for using deep learning for event-based depth estimation and SLAM. Project page: this https URL
zh

[CV-28] xt-based Animatable 3D Avatars with Morphable Model Alignment

【速读】：该论文旨在解决基于文本生成高质量、可动画化3D人脸 avatar 的挑战，特别是现有方法在合成真实细节以及外观与驱动参数化模型对齐方面存在的不足。论文指出这些问题源于3D avatar蒸馏过程中2D扩散预测的歧义性，包括文本输入对avatar外观和几何结构的约束不足，以及扩散模型与参数化头模型之间的语义对齐不够。为了解决这些问题，论文提出了一种名为AnimPortrait3D的新框架，并引入了两个关键策略：首先利用预训练的文本到3D模型的先验信息初始化具有鲁棒外观、几何结构及绑定关系的3D avatar；其次通过一个条件于参数化模型语义图和法线图的ControlNet优化初始avatar以确保动态表情的精确对齐。这些方案显著提升了生成结果的质量、对齐精度及动画保真度。

链接: https://arxiv.org/abs/2504.15835
作者: Yiqian Wu,Malte Prinzler,Xiaogang Jin,Siyu Tang
机构: ETH Zürich (苏黎世联邦理工学院); State Key Lab of CAD&CG, Zhejiang University (浙江大学计算机辅助设计与图形学国家重点实验室); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The generation of high-quality, animatable 3D head avatars from text has enormous potential in content creation applications such as games, movies, and embodied virtual assistants. Current text-to-3D generation methods typically combine parametric head models with 2D diffusion models using score distillation sampling to produce 3D-consistent results. However, they struggle to synthesize realistic details and suffer from misalignments between the appearance and the driving parametric model, resulting in unnatural animation results. We discovered that these limitations stem from ambiguities in the 2D diffusion predictions during 3D avatar distillation, specifically: i) the avatar’s appearance and geometry is underconstrained by the text input, and ii) the semantic alignment between the predictions and the parametric head model is insufficient because the diffusion model alone cannot incorporate information from the parametric model. In this work, we propose a novel framework, AnimPortrait3D, for text-based realistic animatable 3DGS avatar generation with morphable model alignment, and introduce two key strategies to address these challenges. First, we tackle appearance and geometry ambiguities by utilizing prior information from a pretrained text-to-3D model to initialize a 3D avatar with robust appearance, geometry, and rigging relationships to the morphable model. Second, we refine the initial 3D avatar for dynamic expressions using a ControlNet that is conditioned on semantic and normal maps of the morphable model to ensure accurate alignment. As a result, our method outperforms existing approaches in terms of synthesis quality, alignment, and animation fidelity. Our experiments show that the proposed method advances the state of the art in text-based, animatable 3D head avatar generation.
zh

[CV-29] Human-Imperceptible Physical Adversarial Attack for NIR Face Recognition Models

【速读】：该论文旨在解决近红外（NIR）人脸识别系统在物理对抗攻击下的脆弱性问题。论文提出了一种新颖、隐蔽且实用的对抗补丁，用于黑盒环境下的NIR人脸识别系统攻击。解决方案的关键在于利用人眼不可见的红外吸收墨水生成具有数字优化形状和位置的多个补丁，并针对数字域与真实世界NIR成像之间的优化差异，开发了一个人皮光反射模型以模拟NIR光线反射，从而最小化像素级差异。实验结果表明，该方法在数字域和物理域均提升了攻击成功率，特别是在不同人脸姿态下仍保持有效性，平均物理攻击成功率达到了82.46%，显著优于现有方法的64.18%。

链接: https://arxiv.org/abs/2504.15823
作者: Songyan Xie,Jinghang Wen,Encheng Su,Qiucheng Yu
机构: School of Computer Science and Technology, China University of Mining and Technology (中国矿业大学); Department of Computer Science, City University of Hong Kong (香港城市大学); Department of Engineering and Design, Technische Universität München (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Near-infrared (NIR) face recognition systems, which can operate effectively in low-light conditions or in the presence of makeup, exhibit vulnerabilities when subjected to physical adversarial attacks. To further demonstrate the potential risks in real-world applications, we design a novel, stealthy, and practical adversarial patch to attack NIR face recognition systems in a black-box setting. We achieved this by utilizing human-imperceptible infrared-absorbing ink to generate multiple patches with digitally optimized shapes and positions for infrared images. To address the optimization mismatch between digital and real-world NIR imaging, we develop a light reflection model for human skin to minimize pixel-level discrepancies by simulating NIR light reflection. Compared to state-of-the-art (SOTA) physical attacks on NIR face recognition systems, the experimental results show that our method improves the attack success rate in both digital and physical domains, particularly maintaining effectiveness across various face postures. Notably, the proposed approach outperforms SOTA methods, achieving an average attack success rate of 82.46% in the physical domain across different models, compared to 64.18% for existing methods. The artifact is available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.15823 [cs.CV] (or arXiv:2504.15823v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.15823 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-30] Locating and Mitigating Gradient Conflicts in Point Cloud Domain Adaptation via Saliency Map Skewness

【速读】：该论文旨在解决点云数据在未见或分布外（Out-of-Distribution, OOD）场景下分类模型性能下降的问题。现有基于无监督领域自适应（Unsupervised Domain Adaptation, UDA）的方法通常采用多任务学习（Multi-Task Learning, MTL）框架，通过结合主分类任务与辅助自监督任务来弥合跨域特征分布之间的差距。然而，进一步实验表明，并非所有来自自监督任务的梯度都有益，部分甚至可能对分类性能产生负面影响。为此，论文提出了一种新颖的解决方案——基于显著图的数据采样块（Saliency Map-based Data Sampling Block, SM-DSB）。该方法的关键在于设计了一种基于三维显著图偏度的新评分机制，无需目标域标签即可评估梯度冲突，并据此开发了一种动态样本选择策略，过滤掉对分类无益的自监督梯度对应的样本。此方案具有可扩展性且计算开销较小，可集成到现有的点云UDA MTL框架中。大量评估结果表明，该方法优于当前最先进的方法，并为理解UDA问题提供了新的视角。

链接: https://arxiv.org/abs/2504.15796
作者: Jiaqi Tang,Yinsong Xu,Qingchao Chen
机构: National Institute of Health Data Science, Peking University (北京大学健康数据科学研究院); Institute of Medical Technology, Peking University (北京大学医学技术研究所); State Key Laboratory of General Artificial Intelligence, Peking University (北京大学通用人工智能实验室); School of Artificial Intelligence, Beijing University of Posts and Telecommunications (北京邮电大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Object classification models utilizing point cloud data are fundamental for 3D media understanding, yet they often struggle with unseen or out-of-distribution (OOD) scenarios. Existing point cloud unsupervised domain adaptation (UDA) methods typically employ a multi-task learning (MTL) framework that combines primary classification tasks with auxiliary self-supervision tasks to bridge the gap between cross-domain feature distributions. However, our further experiments demonstrate that not all gradients from self-supervision tasks are beneficial and some may negatively impact the classification performance. In this paper, we propose a novel solution, termed Saliency Map-based Data Sampling Block (SM-DSB), to mitigate these gradient conflicts. Specifically, our method designs a new scoring mechanism based on the skewness of 3D saliency maps to estimate gradient conflicts without requiring target labels. Leveraging this, we develop a sample selection strategy that dynamically filters out samples whose self-supervision gradients are not beneficial for the classification. Our approach is scalable, introducing modest computational overhead, and can be integrated into all the point cloud UDA MTL frameworks. Extensive evaluations demonstrate that our method outperforms state-of-the-art approaches. In addition, we provide a new perspective on understanding the UDA problem through back-propagation analysis.
zh

[CV-31] Development and evaluation of a deep learning algorithm for German word recognition from lip movements

【速读】：该论文试图解决德语唇读（Lip Reading）领域中基于人工智能算法的缺乏问题。传统方法依赖于视觉信息，但错误率较高，而现有的基于人工神经网络的唇读算法主要针对英语等语言开发，未能覆盖德语。论文的关键解决方案在于构建了一个专门针对德语的唇读神经网络模型，通过使用包含32名德语说话人的38,391个视频片段数据集进行训练与验证，其中包含18个多音节且视觉可区分的目标词汇。研究比较了不同模型结构（如3D卷积神经网络、门控循环单元模型及其组合模型GRUConv）、视频图像的不同裁剪区域以及色彩空间对识别性能的影响，并通过5000次训练迭代优化模型。结果显示，仅裁剪嘴唇区域的输入显著提高了识别精度，而采用GRUConv模型时，在已知说话人条件下达到最高87%的准确率，在未知说话人条件下仍可达63%，表明所开发的模型具有高泛化能力和与英语算法相当的性能。因此，该研究的关键在于设计适用于德语的唇读神经网络，并结合特定的数据处理策略以提高识别准确性。

链接: https://arxiv.org/abs/2504.15792
作者: Dinh Nam Pham,Torsten Rahne
机构: Technische Universität Berlin (柏林工业大学); Department of Otorhinolaryngology, University Medicine Halle (Saale), Martin Luther University Halle-Wittenberg (哈勒-维滕贝格马丁路德大学耳鼻喉科，哈勒大学医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: English version of journal article in HNO 2022

点击查看摘要

Abstract:When reading lips, many people benefit from additional visual information from the lip movements of the speaker, which is, however, very error prone. Algorithms for lip reading with artificial intelligence based on artificial neural networks significantly improve word recognition but are not available for the German language. A total of 1806 video clips with only one German-speaking person each were selected, split into word segments, and assigned to word classes using speech-recognition software. In 38,391 video segments with 32 speakers, 18 polysyllabic, visually distinguishable words were used to train and validate a neural network. The 3D Convolutional Neural Network and Gated Recurrent Units models and a combination of both models (GRUConv) were compared, as were different image sections and color spaces of the videos. The accuracy was determined in 5000 training epochs. Comparison of the color spaces did not reveal any relevant different correct classification rates in the range from 69% to 72%. With a cut to the lips, a significantly higher accuracy of 70% was achieved than when cut to the entire speaker’s face (34%). With the GRUConv model, the maximum accuracies were 87% with known speakers and 63% in the validation with unknown speakers. The neural network for lip reading, which was first developed for the German language, shows a very high level of accuracy, comparable to English-language algorithms. It works with unknown speakers as well and can be generalized with more word classes.
zh

[CV-32] Satellite to GroundScape – Large-scale Consistent Ground View Generation from Satellite Views

【速读】：该论文旨在解决从卫星图像生成一致的地面视角图像的问题，主要挑战在于卫星与地面域之间存在显著的视角和分辨率差异。现有方法多聚焦于单一视角生成，导致相邻地面视角间的一致性不足。论文提出了一种新颖的跨视角合成方法，通过确保由卫星视角生成的地面视角图像之间的一致性来克服这些挑战。解决方案的关键在于基于固定潜扩散模型引入两个条件模块：卫星引导去噪模块，用于提取高级场景布局以指导去噪过程；以及卫星-时间去噪模块，用于捕捉相机运动以保持多视角输出之间的时间一致性。此外，研究还贡献了一个包含超过100,000组透视对的大规模卫星-地面数据集，以促进广泛的地面场景或视频生成。实验结果表明，该方法在感知和时间度量上优于现有技术，实现了高保真度和多视角输出的一致性。

链接: https://arxiv.org/abs/2504.15786
作者: Ningli Xu,Rongjun Qin
机构: The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 figures

点击查看摘要

Abstract:Generating consistent ground-view images from satellite imagery is challenging, primarily due to the large discrepancies in viewing angles and resolution between satellite and ground-level domains. Previous efforts mainly concentrated on single-view generation, often resulting in inconsistencies across neighboring ground views. In this work, we propose a novel cross-view synthesis approach designed to overcome these challenges by ensuring consistency across ground-view images generated from satellite views. Our method, based on a fixed latent diffusion model, introduces two conditioning modules: satellite-guided denoising, which extracts high-level scene layout to guide the denoising process, and satellite-temporal denoising, which captures camera motion to maintain consistency across multiple generated views. We further contribute a large-scale satellite-ground dataset containing over 100,000 perspective pairs to facilitate extensive ground scene or video generation. Experimental results demonstrate that our approach outperforms existing methods on perceptual and temporal metrics, achieving high photorealism and consistency in multi-view outputs.
zh

[CV-33] owards prediction of morphological heart age from computed tomography angiography

【速读】：本文旨在解决利用医学影像或与健康相关的非影像数据进行年龄预测的问题，以推动基于数据驱动的衰老研究，探索特定组织或器官携带的与个体实际年龄的相关信息。论文的核心目标包括研究心脏形态与衰老之间的关系，并开发一种新的基于形态的心脏“生物学年龄”生物标志物。解决方案的关键在于提出了一种基于图像配准的方法，将整个队列的CTA图像标准化到同一空间，并通过无监督分割提取超级体素及其稳健的密度和局部体积特征，这些特征能够详细表示心脏形态同时对配准误差具有鲁棒性。随后，利用机器学习模型从这些特征回归实际年龄。该方法在SCAPIS数据集的一个子集上验证，包含721名女性和666名男性，结果显示女性和男性的平均绝对误差分别为2.74年和2.77年。此外，通过对不同感兴趣区域的预测结果观察发现，其一致性高于实际年龄，进一步证明了形态预测的可靠性。通过显著性分析确定了与预测年龄正相关或负相关的区域，生成了详细的关联图谱，不仅揭示了一些已知和新发现的兴趣区域的重要性，还提高了模型的可解释性。

链接: https://arxiv.org/abs/2504.15783
作者: Johan Öfverstedt,Elin Lundström,Håkan Ahlström,Joel Kullberg
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages

点击查看摘要

Abstract:Age prediction from medical images or other health-related non-imaging data is an important approach to data-driven aging research, providing knowledge of how much information a specific tissue or organ carries about the chronological age of the individual. In this work, we studied the prediction of age from computed tomography angiography (CTA) images, which provide detailed representations of the heart morphology, with the goals of (i) studying the relationship between morphology and aging, and (ii) developing a novel \emphmorphological heart age biomarker. We applied an image registration-based method that standardizes the images from the whole cohort into a single space. We then extracted supervoxels (using unsupervised segmentation), and corresponding robust features of density and local volume, which provide a detailed representation of the heart morphology while being robust to registration errors. Machine learning models are then trained to fit regression models from these features to the chronological age. We applied the method to a subset of the images from the Swedish CArdioPulomonary bioImage Study (SCAPIS) dataset, consisting of 721 females and 666 males. We observe a mean absolute error of 2.74 years for females and 2.77 years for males. The predictions from different sub-regions of interest were observed to be more highly correlated with the predictions from the whole heart, compared to the chronological age, revealing a high consistency in the predictions from morphology. Saliency analysis was also performed on the prediction models to study what regions are associated positively and negatively with the predicted age. This resulted in detailed association maps where the density and volume of known, as well as some novel sub-regions of interest, are determined to be important. The saliency analysis aids in the interpretability of the models and their predictions.
zh

[CV-34] Model-based Metric 3D Shape and Motion Reconstruction of Wild Bottlenose Dolphins in Drone-Shot Videos

【速读】：本文旨在解决从单目视频中估计野生海豚的三维形状、运动以及由此评估其身体状况的问题。由于水下环境的复杂性，现有的针对陆地四足动物的三维重建技术难以直接应用于水生动物。为了解决这一挑战，论文提出了一种基于模型的方法，关键在于引入了一个考虑水体引起的遮挡效应的传输模型。该方法能够处理不同海洋条件下的视频数据，并通过估算质量和体积与传统基于人工二维测量的方法进行对比验证。

链接: https://arxiv.org/abs/2504.15782
作者: Daniele Baieri,Riccardo Cicciarella,Michael Krützen,Emanuele Rodolà,Silvia Zuffi
机构: University of Milano-Bicocca; University of Zurich; Sapienza University of Rome; IMATI-CNR
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:We address the problem of estimating the metric 3D shape and motion of wild dolphins from monocular video, with the aim of assessing their body condition. While considerable progress has been made in reconstructing 3D models of terrestrial quadrupeds, aquatic animals remain unexplored due to the difficulty of observing them in their natural underwater environment. To address this, we propose a model-based approach that incorporates a transmission model to account for water-induced occlusion. We apply our method to video captured under different sea conditions. We estimate mass and volume, and compare our results to a manual 2D measurements-based method.
zh

[CV-35] Pose Optimization for Autonomous Driving Datasets using Neural Rendering Models

【速读】：该论文旨在解决公共数据集中的传感器标定和车辆姿态潜在不准确性可能导致下游任务评估错误的问题，从而影响自动驾驶系统的可靠性和性能。为了解决这一挑战，论文提出了一种基于神经辐射场（Neural Radiance Fields, NeRF）的鲁棒优化方法，通过优化传感器姿态和标定参数来提升数据集基准的完整性。解决方案的关键在于利用NeRF技术对传感器姿态和标定参数进行精确优化，并通过重投影度量、新视角合成渲染质量以及几何对齐等评估过程验证优化后姿态的准确性，从而显著提高传感器姿态的精度。这种方法不仅提升了现有数据集的可用性，还为更可靠的自动驾驶模型奠定了基础。

链接: https://arxiv.org/abs/2504.15776
作者: Quentin Herau,Nathan Piasco,Moussab Bennehar,Luis Rolado,Dzmitry Tsishkou,Bingbing Liu,Cyrille Migniot,Pascal Vasseur,Cédric Demonceaux
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: under review

点击查看摘要

Abstract:Autonomous driving systems rely on accurate perception and localization of the ego car to ensure safety and reliability in challenging real-world driving scenarios. Public datasets play a vital role in benchmarking and guiding advancement in research by providing standardized resources for model development and evaluation. However, potential inaccuracies in sensor calibration and vehicle poses within these datasets can lead to erroneous evaluations of downstream tasks, adversely impacting the reliability and performance of the autonomous systems. To address this challenge, we propose a robust optimization method based on Neural Radiance Fields (NeRF) to refine sensor poses and calibration parameters, enhancing the integrity of dataset benchmarks. To validate improvement in accuracy of our optimized poses without ground truth, we present a thorough evaluation process, relying on reprojection metrics, Novel View Synthesis rendering quality, and geometric alignment. We demonstrate that our method achieves significant improvements in sensor pose accuracy. By optimizing these critical parameters, our approach not only improves the utility of existing datasets but also paves the way for more reliable autonomous driving models. To foster continued progress in this field, we make the optimized sensor poses publicly available, providing a valuable resource for the research community.
zh

[CV-36] Multi-Scale Tensorial Summation and Dimensional Reduction Guided Neural Network for Edge Detection

【速读】：该论文旨在解决边缘检测任务中传统卷积神经网络需要深层结构以实现大感受野的问题，这通常导致计算资源的浪费及冗余信息的干扰。论文的关键创新在于提出了一种基于多尺度张量和维度约简（MTS-DR）模块的新网络架构MTS-DR-Net。其核心解决方案是引入MTS层与对应的MTS-DR块作为新的主干网络，通过维度约简模块去除冗余信息，使模型能够专注于相关特征子空间，同时在初始层即可获得大感受野。此外，网络末端还设计了一个权重U形精化模块以进一步提升边缘检测性能。实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2504.15770
作者: Lei Xu,Mehmet Yamac,Mete Ahishali,Moncef Gabbouj
机构: Tampere University (坦佩雷大学); University of Eastern Finland (东芬兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Edge detection has attracted considerable attention thanks to its exceptional ability to enhance performance in downstream computer vision tasks. In recent years, various deep learning methods have been explored for edge detection tasks resulting in a significant performance improvement compared to conventional computer vision algorithms. In neural networks, edge detection tasks require considerably large receptive fields to provide satisfactory performance. In a typical convolutional operation, such a large receptive field can be achieved by utilizing a significant number of consecutive layers, which yields deep network structures. Recently, a Multi-scale Tensorial Summation (MTS) factorization operator was presented, which can achieve very large receptive fields even from the initial layers. In this paper, we propose a novel MTS Dimensional Reduction (MTS-DR) module guided neural network, MTS-DR-Net, for the edge detection task. The MTS-DR-Net uses MTS layers, and corresponding MTS-DR blocks as a new backbone to remove redundant information initially. Such a dimensional reduction module enables the neural network to focus specifically on relevant information (i.e., necessary subspaces). Finally, a weight U-shaped refinement module follows MTS-DR blocks in the MTS-DR-Net. We conducted extensive experiments on two benchmark edge detection datasets: BSDS500 and BIPEDv2 to verify the effectiveness of our model. The implementation of the proposed MTS-DR-Net can be found at this https URL.
zh

[CV-37] DSDNet: Raw Domain Demoiréing via Dual Color-Space Synergy

【速读】：该论文旨在解决屏幕截图中摩尔纹（moire）引起的严重视觉退化问题。现有基于sRGB域的去摩尔纹方法存在不可逆的信息损失，而近期基于两阶段raw域的方法则面临信息瓶颈和推理效率低下的挑战。为克服这些局限性，论文提出了一种单阶段raw域去摩尔纹框架——Dual-Stream Demoiréing Network (DSDNet)，其关键在于利用raw图像与YCbCr图像的协同作用，在去除摩尔纹的同时保持亮度和色彩保真度。具体而言，通过设计raw到YCbCr映射管道并引入Synergic Attention with Dynamic Modulation (SADM)模块，以增强跨域上下文特征的raw到sRGB转换；同时开发Luminance-Chrominance Adaptive Transformer (LCAT)模块，实现亮度和色度表示的解耦，从而更好地指导色彩保真。实验结果表明，DSDNet在视觉质量和定量评估方面均优于当前最先进的方法，并且推理速度比第二好的方法快2.4倍，凸显其实用优势。

链接: https://arxiv.org/abs/2504.15756
作者: Qirui Yang,Fangpu Zhang,Yeying Jin,Qihua Cheng,Pengtao Jiang,Huanjing Yue,Jingyu Yang
机构: Tianjin University (天津大学); National University of Singapore (新加坡国立大学); Shenzhen Bit Microelectronics Technology Co., Ltd (深圳比特微电子科技有限公司); vivo Mobile Communication Co., Ltd (维沃移动通信有限公司); Tianjin University (天津大学); Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:With the rapid advancement of mobile imaging, capturing screens using smartphones has become a prevalent practice in distance learning and conference recording. However, moiré artifacts, caused by frequency aliasing between display screens and camera sensors, are further amplified by the image signal processing pipeline, leading to severe visual degradation. Existing sRGB domain demoiréing methods struggle with irreversible information loss, while recent two-stage raw domain approaches suffer from information bottlenecks and inference inefficiency. To address these limitations, we propose a single-stage raw domain demoiréing framework, Dual-Stream Demoiréing Network (DSDNet), which leverages the synergy of raw and YCbCr images to remove moiré while preserving luminance and color fidelity. Specifically, to guide luminance correction and moiré removal, we design a raw-to-YCbCr mapping pipeline and introduce the Synergic Attention with Dynamic Modulation (SADM) module. This module enriches the raw-to-sRGB conversion with cross-domain contextual features. Furthermore, to better guide color fidelity, we develop a Luminance-Chrominance Adaptive Transformer (LCAT), which decouples luminance and chrominance representations. Extensive experiments demonstrate that DSDNet outperforms state-of-the-art methods in both visual quality and quantitative evaluation, and achieves an inference speed \mathrm\textbf2.4x faster than the second-best method, highlighting its practical advantages. We provide an anonymous online demo at this https URL.
zh

[CV-38] GADS: A Super Lightweight Model for Head Pose Estimation

【速读】：该论文致力于解决基于人脸标志点（Facial Landmarks）的头部姿态估计算法在边缘设备和计算资源受限环境中的部署难题。传统方法虽精度较高，但模型复杂度和规模较大，限制了其在实际应用中的广泛使用。为了解决这一问题，论文提出了一种名为\textbf{Grouped Attention Deep Sets (GADS)}的新架构，基于Deep Set框架构建。其关键创新在于通过将地标分组并采用小型Deep Set层来降低计算复杂度，并引入多头注意力机制提取和融合组间信息。这种设计使得GADS模型比当前最轻量级的最先进的模型小7.5倍且快25倍，相较于最佳性能模型缩小了4321倍，从而显著提升了模型在资源受限场景下的实用性和效率。

链接: https://arxiv.org/abs/2504.15751
作者: Menan Velayuthan,Asiri Gawesha,Purushoth Velayuthan,Nuwan Kodagoda,Dharshana Kasthurirathna,Pradeepa Samarasinghe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 5 tables, 10 figures, not submitted to any conference or journal

点击查看摘要

Abstract:In human-computer interaction, head pose estimation profoundly influences application functionality. Although utilizing facial landmarks is valuable for this purpose, existing landmark-based methods prioritize precision over simplicity and model size, limiting their deployment on edge devices and in compute-poor environments. To bridge this gap, we propose \textbfGrouped Attention Deep Sets (GADS), a novel architecture based on the Deep Set framework. By grouping landmarks into regions and employing small Deep Set layers, we reduce computational complexity. Our multihead attention mechanism extracts and combines inter-group information, resulting in a model that is 7.5\times smaller and executes 25\times faster than the current lightest state-of-the-art model. Notably, our method achieves an impressive reduction, being 4321\times smaller than the best-performing model. We introduce vanilla GADS and Hybrid-GADS (landmarks + RGB) and evaluate our models on three benchmark datasets – AFLW2000, BIWI, and 300W-LP. We envision our architecture as a robust baseline for resource-constrained head pose estimation methods.
zh

[CV-39] SAGA: Semantic-Aware Gray color Augmentation for Visible-to-Thermal Domain Adaptation across Multi-View Drone and Ground-Based Vision Systems CVPR

【速读】：该论文旨在解决跨模态（可见光到热成像）领域自适应热目标检测中的挑战，特别是由于红外图像缺乏色彩和纹理线索导致RGB训练模型在热图像上的表现不佳，从而引发假阳性增加及伪标签质量下降的问题。为应对这一难题，论文提出了一种名为语义感知灰度增强（Semantic-Aware Gray color Augmentation, SAGA）的关键解决方案，通过提取与热图像相关的物体级特征来减轻颜色偏差并弥合模态差异。此外，为了验证SAGA在无人机影像中的有效性，作者构建了一个包含5,612张图像和145,666个实例的多传感器（RGB-IR）数据集IndraEye，用于促进多模态学习、跨域目标检测与分割以及传感器特性研究。实验结果表明，结合最先进的领域自适应技术后，SAGA能够在自动驾驶场景及IndraEye数据集上实现0.4%至7.6%（mAP）的一致性能提升。

链接: https://arxiv.org/abs/2504.15728
作者: Manjunath D,Aniruddh Sikdar,Prajwal Gurunath,Sumanth Udupa,Suresh Sundaram
机构: Indian Institute of Science (印度科学学院); Robert Bosch Centre for Cyber Physical Systems (罗伯特博世网络物理系统中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR-W PBVS 2025

点击查看摘要

Abstract:Domain-adaptive thermal object detection plays a key role in facilitating visible (RGB)-to-thermal (IR) adaptation by reducing the need for co-registered image pairs and minimizing reliance on large annotated IR datasets. However, inherent limitations of IR images, such as the lack of color and texture cues, pose challenges for RGB-trained models, leading to increased false positives and poor-quality pseudo-labels. To address this, we propose Semantic-Aware Gray color Augmentation (SAGA), a novel strategy for mitigating color bias and bridging the domain gap by extracting object-level features relevant to IR images. Additionally, to validate the proposed SAGA for drone imagery, we introduce the IndraEye, a multi-sensor (RGB-IR) dataset designed for diverse applications. The dataset contains 5,612 images with 145,666 instances, captured from diverse angles, altitudes, backgrounds, and times of day, offering valuable opportunities for multimodal learning, domain adaptation for object detection and segmentation, and exploration of sensor-specific strengths and weaknesses. IndraEye aims to enhance the development of more robust and accurate aerial perception systems, especially in challenging environments. Experimental results show that SAGA significantly improves RGB-to-IR adaptation for autonomous driving and IndraEye dataset, achieving consistent performance gains of +0.4% to +7.6% (mAP) when integrated with state-of-the-art domain adaptation techniques. The dataset and codes are available at this https URL.
zh

[CV-40] Structure-Preserving Zero-Shot Image Editing via Stage-Wise Latent Injection in Diffusion Models

【速读】：本文旨在解决零样本（zero-shot）图像编辑问题，即在不进行模型微调的情况下，通过文本描述或参考图像实现对源图像的精确编辑。论文的关键在于提出了一种基于扩散模型的框架，该框架统一了文本引导和参考引导的图像编辑方法。其核心解决方案包括利用扩散反演（diffusion inversion）和针对特定时间步的空文本嵌入（null-text embeddings）来保持源图像的结构完整性，并通过分阶段潜在注入策略（在早期步骤进行形状注入，在后期步骤进行属性注入）实现精细的局部修改同时确保全局一致性。此外，引入参考潜在变量的交叉注意力机制（cross-attention with reference latents）促进了源图像与参考图像之间的语义对齐。实验结果表明，该方法在表情迁移、纹理变换和风格融合等任务中表现出最先进的性能，验证了其在多种图像编辑场景中的可扩展性和适应性。

链接: https://arxiv.org/abs/2504.15723
作者: Dasol Jeong,Donggoo Kang,Jiwon Park,Hyebean Lee,Joonki Paik
机构: Chung-Ang University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a diffusion-based framework for zero-shot image editing that unifies text-guided and reference-guided approaches without requiring fine-tuning. Our method leverages diffusion inversion and timestep-specific null-text embeddings to preserve the structural integrity of the source image. By introducing a stage-wise latent injection strategy-shape injection in early steps and attribute injection in later steps-we enable precise, fine-grained modifications while maintaining global consistency. Cross-attention with reference latents facilitates semantic alignment between the source and reference. Extensive experiments across expression transfer, texture transformation, and style infusion demonstrate state-of-the-art performance, confirming the method’s scalability and adaptability to diverse image editing scenarios.
zh

[CV-41] RePOPE: Impact of Annotation Errors on the POPE Benchmark

【速读】：该论文旨在评估MSCOCO标签错误对常用目标检测基准POPE的影响。论文通过重新标注POPE基准数据集（形成RePOPE），发现不同子集间存在标注错误的不平衡现象，并观察到模型排名因标注质量提升而发生显著变化。论文的关键解决方案在于重新标注数据以提高标签质量，并基于此修订后的数据集评估模型性能，从而揭示标签准确性对模型评价结果的重要影响。

链接: https://arxiv.org/abs/2504.15707
作者: Yannic Neuhaus,Matthias Hein
机构: Tübingen AI Center – University of Tübingen (图宾根人工智能中心–图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Since data annotation is costly, benchmark datasets often incorporate labels from established image datasets. In this work, we assess the impact of label errors in MSCOCO on the frequently used object hallucination benchmark POPE. We re-annotate the benchmark images and identify an imbalance in annotation errors across different subsets. Evaluating multiple models on the revised labels, which we denote as RePOPE, we observe notable shifts in model rankings, highlighting the impact of label quality. Code and data are available at this https URL .
zh

[CV-42] You Sense Only Once Beneath: Ultra-Light Real-Time Underwater Object Detection

【速读】：该论文旨在解决水下复杂环境（如低图像质量、计算资源受限）中目标检测模型准确性与效率难以兼顾的问题。为实现这一目标，论文提出了一种超轻量实时水下目标检测框架——You Sense Only Once Beneath (YSOOB)。其关键解决方案包括：(1) 利用多光谱小波编码器(MSWE)进行频域编码以减轻水下光学色差引起的语义损失；(2) 重新审视偶数尺寸卷积与转置卷积的独特特性，在重采样过程中动态选择并增强关键信息，提升模型泛化能力；(3) 通过简单有效的通道压缩与重构大核卷积(RLKC)消除模型冗余，实现轻量化设计。最终，YSOOB仅需120万参数即可达到高性能检测效果，并在多个数据集上表现出色，同时具备极高的推理速度。

链接: https://arxiv.org/abs/2504.15694
作者: Jun Dong,Wenli Wu,Jintao Cheng,Xiaoyu Tang
机构: School of Data Science and Engineering, and Xingzhi College, South China Normal University (华南师范大学数据科学与工程学院，兴智学院); School of Physics, South China Normal University (华南师范大学物理学院); School of Electronics and Information Engineering, and Xingzhi College, South China Normal University (华南师范大学电子与信息工程学院，兴智学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the remarkable achievements in object detection, the model’s accuracy and efficiency still require further improvement under challenging underwater conditions, such as low image quality and limited computational resources. To address this, we propose an Ultra-Light Real-Time Underwater Object Detection framework, You Sense Only Once Beneath (YSOOB). Specifically, we utilize a Multi-Spectrum Wavelet Encoder (MSWE) to perform frequency-domain encoding on the input image, minimizing the semantic loss caused by underwater optical color distortion. Furthermore, we revisit the unique characteristics of even-sized and transposed convolutions, allowing the model to dynamically select and enhance key information during the resampling process, thereby improving its generalization ability. Finally, we eliminate model redundancy through a simple yet effective channel compression and reconstructed large kernel convolution (RLKC) to achieve model lightweight. As a result, forms a high-performance underwater object detector YSOOB with only 1.2 million parameters. Extensive experimental results demonstrate that, with the fewest parameters, YSOOB achieves mAP50 of 83.1% and 82.9% on the URPC2020 and DUO datasets, respectively, comparable to the current SOTA detectors. The inference speed reaches 781.3 FPS and 57.8 FPS on the T4 GPU (TensorRT FP16) and the edge computing device Jetson Xavier NX (TensorRT FP16), surpassing YOLOv12-N by 28.1% and 22.5%, respectively.
zh

[CV-43] Vidi: Large Multimodal Models for Video Understanding and Editing

【速读】：该论文旨在解决视频编辑场景中长视频内容高效理解和检索的问题。传统模型在处理多模态数据（如视觉、音频、文本）以及灵活输入长度（如长达数小时的原始视频）时面临显著挑战。论文的关键解决方案是提出了一种名为Vidi的大型多模态模型（Large Multimodal Models, LMMs）家族，专注于时间检索任务，即从输入视频中识别与给定文本查询相关的时间范围。Vidi具备强大的时间理解能力，能够处理长达数小时的视频，并通过引入VUE-TR基准来支持真实场景的全面评估，该基准在视频时长、音频支持、查询格式、标注质量和评估指标等方面实现了五项重要改进。实验结果显示，Vidi在时间检索任务中显著优于领先的专有模型（如GPT-4o和Gemini），表明其在视频编辑场景中的优越性。

链接: https://arxiv.org/abs/2504.15681
作者: Vidi Team,Celong Liu,Chia-Wen Kuo,Dawei Du,Fan Chen,Guang Chen,Jiamin Yuan,Lingxi Zhang,Lu Guo,Lusha Li,Longyin Wen,Qingyu Chen,Rachel Deng,Sijie Zhu,Stuart Siew,Tong Jin,Wei Lu,Wen Zhong,Xiaohui Shen,Xin Gu,Xing Mei,Xueqiong Qu
机构: ByteDance Inc. (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans naturally share information with those they are connected to, and video has become one of the dominant mediums for communication and expression on the Internet. To support the creation of high-quality large-scale video content, a modern pipeline requires a comprehensive understanding of both the raw input materials (e.g., the unedited footage captured by cameras) and the editing components (e.g., visual effects). In video editing scenarios, models must process multiple modalities (e.g., vision, audio, text) with strong background knowledge and handle flexible input lengths (e.g., hour-long raw videos), which poses significant challenges for traditional models. In this report, we introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understand editing scenarios. The first release focuses on temporal retrieval, i.e., identifying the time ranges within the input videos corresponding to a given text query, which plays a critical role in intelligent editing. The model is capable of processing hour-long videos with strong temporal understanding capability, e.g., retrieve time ranges for certain queries. To support a comprehensive evaluation in real-world scenarios, we also present the VUE-TR benchmark, which introduces five key advancements. 1) Video duration: significantly longer than existing temporal retrival datasets, 2) Audio support: includes audio-based queries, 3) Query format: diverse query lengths/formats, 4) Annotation quality: ground-truth time ranges are manually annotated. 5) Evaluation metric: a refined IoU metric to support evaluation over multiple time ranges. Remarkably, Vidi significantly outperforms leading proprietary models, e.g., GPT-4o and Gemini, on the temporal retrieval task, indicating its superiority in video editing scenarios.
zh

[CV-44] DINOv2-powered Few-Shot Semantic Segmentation: A Unified Framework via Cross-Model Distillation and 4D Correlation Mining

【速读】：该论文旨在解决Few-shot语义分割任务中的数据稀缺问题，通过构建一个统一模型整合来自基础模型的知识来增强表征的迁移能力。论文的关键在于提出FS-DINO模型，它仅利用DINOv2的编码器和一个轻量级分割器，并通过粗到细的跨模型蒸馏方式将SAM的知识有效融入其中，同时结合支持-查询对的4D相关性挖掘进一步优化分割性能。实验结果验证了所提方法在COCO-20i、PASCAL-5i以及FSS-1000数据集上的有效性和优越性。

链接: https://arxiv.org/abs/2504.15669
作者: Wei Zhuo,Zhiyue Tang,Wufeng Xue,Hao Ding,Linlin Shen
机构: School of Artificial Intelligence, Shenzhen University (深圳大学人工智能学院); National Engineering Laboratory of Big Data System Computing Technology, Shenzhen University (深圳大学大数据系统计算技术国家工程实验室); School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University (深圳大学医学部生物医学工程学院); School of Computer Science and Software Engineering, Shenzhen University (深圳大学计算机科学与软件工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot semantic segmentation has gained increasing interest due to its generalization capability, i.e., segmenting pixels of novel classes requiring only a few annotated images. Prior work has focused on meta-learning for support-query matching, with extensive development in both prototype-based and aggregation-based methods. To address data scarcity, recent approaches have turned to foundation models to enhance representation transferability for novel class segmentation. Among them, a hybrid dual-modal framework including both DINOv2 and SAM has garnered attention due to their complementary capabilities. We wonder “can we build a unified model with knowledge from both foundation models?” To this end, we propose FS-DINO, with only DINOv2’s encoder and a lightweight segmenter. The segmenter features a bottleneck adapter, a meta-visual prompt generator based on dense similarities and semantic embeddings, and a decoder. Through coarse-to-fine cross-model distillation, we effectively integrate SAM’s knowledge into our lightweight segmenter, which can be further enhanced by 4D correlation mining on support-query pairs. Extensive experiments on COCO-20i, PASCAL-5i, and FSS-1000 demonstrate the effectiveness and superiority of our method.
zh

[CV-45] Motion-Enhanced Nonlocal Similarity Implicit Neural Representation for Infrared Dim and Small Target Detection

【速读】：该论文旨在解决红外弱小目标检测中的动态多帧场景下背景复杂及弱目标信号难以捕捉的问题。传统低秩加稀疏模型常因无法有效捕获动态背景和全局时空相关性而导致背景泄漏或目标丢失。论文的关键解决方案在于提出了一种基于运动增强的非局部相似隐式神经表示（INR）框架。首先通过光流进行运动估计以捕捉微小的目标移动，并采用多帧融合提升运动显著性；其次利用非局部相似性构建具有强低秩特性的块张量，并提出一种基于张量分解的INR模型来表示非局部块张量，从而通过连续神经表示有效地编码背景的非局部低秩性和时空相关性。所发展的非局部INR模型采用乘子交替方向法，具备理论上的固定点收敛性。实验结果表明，该方法能够稳健地从复杂的红外背景下分离出弱小目标，在检测精度和鲁棒性方面优于现有先进方法。

链接: https://arxiv.org/abs/2504.15665
作者: Pei Liu,Yisi Luo,Wenzhen Wang,Xiangyong Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared dim and small target detection presents a significant challenge due to dynamic multi-frame scenarios and weak target signatures in the infrared modality. Traditional low-rank plus sparse models often fail to capture dynamic backgrounds and global spatial-temporal correlations, which results in background leakage or target loss. In this paper, we propose a novel motion-enhanced nonlocal similarity implicit neural representation (INR) framework to address these challenges. We first integrate motion estimation via optical flow to capture subtle target movements, and propose multi-frame fusion to enhance motion saliency. Second, we leverage nonlocal similarity to construct patch tensors with strong low-rank properties, and propose an innovative tensor decomposition-based INR model to represent the nonlocal patch tensor, effectively encoding both the nonlocal low-rankness and spatial-temporal correlations of background through continuous neural representations. An alternating direction method of multipliers is developed for the nonlocal INR model, which enjoys theoretical fixed-point convergence. Experimental results show that our approach robustly separates dim targets from complex infrared backgrounds, outperforming state-of-the-art methods in detection accuracy and robustness.
zh

[CV-46] An XAI-based Analysis of Shortcut Learning in Neural Networks

【速读】：该论文试图解决机器学习模型过度依赖虚假特征（spurious features）的问题。虚假特征是指与目标标签强相关但非因果关系的特征。现有方法在某些情况下可以缓解这一问题，但在其他情况下则失效。论文的关键在于引入了一种基于可解释人工智能（XAI）的诊断度量——神经元虚假评分（neuron spurious score），用于量化神经网络中特定神经元对虚假特征的依赖程度。通过分析卷积神经网络（CNNs）和视觉变换器（ViTs）的架构特性，研究发现虚假特征在一定程度上被解耦，但不同模型架构之间的解耦程度存在差异。此外，论文指出现有缓解方法背后的假设并不完整。这些结果为开发新型方法以更有效地减轻虚假相关性奠定了基础，从而提高实际应用中AI模型的安全性和可靠性。

链接: https://arxiv.org/abs/2504.15664
作者: Phuong Quynh Le,Jörg Schlötterer,Christin Seifert
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at The World Conference on eXplainable Artificial Intelligence 2025 (XAI-2025)

点击查看摘要

Abstract:Machine learning models tend to learn spurious features - features that strongly correlate with target labels but are not causal. Existing approaches to mitigate models’ dependence on spurious features work in some cases, but fail in others. In this paper, we systematically analyze how and where neural networks encode spurious correlations. We introduce the neuron spurious score, an XAI-based diagnostic measure to quantify a neuron’s dependence on spurious features. We analyze both convolutional neural networks (CNNs) and vision transformers (ViTs) using architecture-specific methods. Our results show that spurious features are partially disentangled, but the degree of disentanglement varies across model architectures. Furthermore, we find that the assumptions behind existing mitigation methods are incomplete. Our results lay the groundwork for the development of novel methods to mitigate spurious correlations and make AI models safer to use in practice.
zh

[CV-47] DiTPainter: Efficient Video Inpainting with Diffusion Transformers

【速读】：该论文试图解决现有视频修复算法在处理不准确光流或大遮挡时可能出现模糊和时空不一致的问题。同时，针对Diffusion Transformer (DiT) 在视频生成任务中的潜力，论文提出了一种新的端到端视频修复模型 DiTPainter。其解决方案的关键在于设计了一个专为视频修复任务优化的高效Transformer网络，并从零开始训练而非基于任何大规模预训练模型初始化，从而在保证性能的同时降低了计算开销，使其能够高效应用于任意长度视频的修复、去字幕及补全等任务。

链接: https://arxiv.org/abs/2504.15661
作者: Xian Wu,Chang Liu
机构: ByteDance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Many existing video inpainting algorithms utilize optical flows to construct the corresponding maps and then propagate pixels from adjacent frames to missing areas by mapping. Despite the effectiveness of the propagation mechanism, they might encounter blurry and inconsistencies when dealing with inaccurate optical flows or large masks. Recently, Diffusion Transformer (DiT) has emerged as a revolutionary technique for video generation tasks. However, pretrained DiT models for video generation all contain a large amount of parameters, which makes it very time consuming to apply to video inpainting tasks. In this paper, we present DiTPainter, an end-to-end video inpainting model based on Diffusion Transformer (DiT). DiTPainter uses an efficient transformer network designed for video inpainting, which is trained from scratch instead of initializing from any large pretrained models. DiTPainter can address videos with arbitrary lengths and can be applied to video decaptioning and video completion tasks with an acceptable time cost. Experiments show that DiTPainter outperforms existing video inpainting algorithms with higher quality and better spatial-temporal consistency.
zh

[CV-48] A Vision-Enabled Prosthetic Hand for Children with Upper Limb Disabilities

【速读】：该论文旨在解决儿童（10-12岁）上肢残疾者使用传统肌电假手面临的高成本和功能局限性的问题。解决方案的关键在于结合3D打印技术、先进机器视觉、传感技术和嵌入式计算，设计了一款仿生外观、多关节活动且轻量化的智能假手。通过在腕部集成微摄像头与低功耗FPGA实现实时物体检测，并结合基于深度学习的物体检测与抓取分类模型，实现了96%和100%的高精度性能，同时在力预测方面达到0.018的平均绝对误差。其核心技术优势包括：a) 腕戴式微摄像头用于人工感知，支持多样化的手部任务；b) 实时物体检测与距离估计以实现精准抓握；c) 超低功耗运行，在受限的能源与资源条件下提供高性能支持。

链接: https://arxiv.org/abs/2504.15654
作者: Md Abdul Baset Sarker,Art Nguyen,Sigmond Kukla,Kevin Fite,Masudul H. Imtiaz
机构: Clarkson University (克拉克森大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces a novel AI vision-enabled pediatric prosthetic hand designed to assist children aged 10-12 with upper limb disabilities. The prosthesis features an anthropomorphic appearance, multi-articulating functionality, and a lightweight design that mimics a natural hand, making it both accessible and affordable for low-income families. Using 3D printing technology and integrating advanced machine vision, sensing, and embedded computing, the prosthetic hand offers a low-cost, customizable solution that addresses the limitations of current myoelectric prostheses. A micro camera is interfaced with a low-power FPGA for real-time object detection and assists with precise grasping. The onboard DL-based object detection and grasp classification models achieved accuracies of 96% and 100% respectively. In the force prediction, the mean absolute error was found to be 0.018. The features of the proposed prosthetic hand can thus be summarized as: a) a wrist-mounted micro camera for artificial sensing, enabling a wide range of hand-based tasks; b) real-time object detection and distance estimation for precise grasping; and c) ultra-low-power operation that delivers high performance within constrained power and resource limits.
zh

[CV-49] AffordanceSAM: Segment Anything Once More in Affordance Grounding

【速读】：该论文旨在解决提升操作性先验（affordance）接地模型在识别未见物体及其功能区域方面泛化能力的问题，以满足真实世界应用的需求。当前模型在这方面仍有较大差距。为解决此问题，论文提出了一种名为AffordanceSAM的有效方法，其关键是扩展了SAM（Segment Anything Model）在操作性接地领域的泛化能力。具体而言，通过引入一种操作性适配模块（affordance-adaption module），帮助调整SAM的分割输出以适应特定的功能区域需求；同时采用从粗到细的训练策略，使SAM能够初步识别操作性对象和动作，再精细生成操作性热图（affordance heatmaps）。实验结果表明，AffordanceSAM不仅在AGD20K基准数据集上超越了现有方法，还展示了处理新型物体及其功能的能力。

链接: https://arxiv.org/abs/2504.15650
作者: Dengyang Jiang,Mengmeng Wang,Teli Ma,Hengzhuang Li,Yong liu,Guang Dai,Lei Zhang
机构: NWPU; SGIT AI Lab; ZJUT; HKUST(GZ); HUST; ZJU
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SAM Meets Affordance Grounding

点击查看摘要

Abstract:Improving the generalization ability of an affordance grounding model to recognize regions for unseen objects and affordance functions is crucial for real-world application. However, current models are still far away from such standards. To address this problem, we introduce AffordanceSAM, an effective approach that extends SAM’s generalization capacity to the domain of affordance grounding. For the purpose of thoroughly transferring SAM’s robust performance in segmentation to affordance, we initially propose an affordance-adaption module in order to help modify SAM’s segmentation output to be adapted to the specific functional regions required for affordance grounding. We concurrently make a coarse-to-fine training recipe to make SAM first be aware of affordance objects and actions coarsely, and then be able to generate affordance heatmaps finely. Both quantitative and qualitative experiments show the strong generalization capacity of our AffordanceSAM, which not only surpasses previous methods under AGD20K benchmark but also shows evidence to handle the task with novel objects and affordance functions.
zh

[CV-50] ZeroSlide: Is Zero-Shot Classification Adequate for Lifelong Learning in Whole-Slide Image Analysis in the Era of Pathology Vision-Language Foundation Models?

【速读】：该论文旨在解决在分布式持续学习框架下，利用单一统一模型完成多种与 Whole Slide Images (WSIs) 相关任务（如癌症亚型分类和肿瘤分类）的挑战。传统方法需要每次定义新任务时重新训练模型，耗时且低效。随着视觉-语言基础模型的兴起，这些模型能否通过零样本分类（zero-shot classification）实现 WSIs 的终身学习成为关键问题。论文的关键在于比较传统的连续学习方法与基于视觉-语言模型的零样本分类策略在 WSIs 应用中的性能，以评估是否需要进一步研究持续学习策略以提升模型表现。

链接: https://arxiv.org/abs/2504.15627
作者: Doanh C. Bui,Hoai Luan Pham,Vu Trung Duong Le,Tuan Hai Vu,Van Duy Tran,Yasuhiko Nakashima
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, 1 table, conference submission

点击查看摘要

Abstract:Lifelong learning for whole slide images (WSIs) poses the challenge of training a unified model to perform multiple WSI-related tasks, such as cancer subtyping and tumor classification, in a distributed, continual fashion. This is a practical and applicable problem in clinics and hospitals, as WSIs are large, require storage, processing, and transfer time. Training new models whenever new tasks are defined is time-consuming. Recent work has applied regularization- and rehearsal-based methods to this setting. However, the rise of vision-language foundation models that align diagnostic text with pathology images raises the question: are these models alone sufficient for lifelong WSI learning using zero-shot classification, or is further investigation into continual learning strategies needed to improve performance? To our knowledge, this is the first study to compare conventional continual-learning approaches with vision-language zero-shot classification for WSIs. Our source code and experimental results will be available soon.
zh

[CV-51] FaceInsight: A Multimodal Large Language Model for Face Perception

【速读】：该论文旨在解决现有多模态大型语言模型（Multimodal Large Language Models, MLLMs）在人脸感知任务中的不足，这些模型在处理与人脸相关的特定查询时往往表现出不准确或误导性的响应。为了解决这一问题，论文提出了一种名为FaceInsight的新方法，这是一种专门用于人脸感知的多功能多模态大型语言模型，能够提供细粒度的人脸信息。其解决方案的关键在于引入了人脸知识的视觉-文本对齐技术，以建模人脸信息之间的不确定依赖关系和确定性关系，从而缓解由语言驱动推理带来的限制；同时，通过将人脸分割图作为辅助感知模态，融入局部化的结构线索，进一步增强语义理解能力。实验结果表明，FaceInsight在未经过训练和微调两种设置下均显著优于九个对比模型。

链接: https://arxiv.org/abs/2504.15624
作者: Jingzhi Li,Changjiang Luo,Ruoyu Chen,Hua Zhang,Wenqi Ren,Jianhou Gan,Xiaochun Cao
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间学院); School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区网络科学与技术学院); Key Laboratory of Education Informatization for Nationalities (Yunnan Normal University), Ministry of Education (教育部民族教育信息化重点实验室（云南师范大学）)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in multimodal large language models (MLLMs) have demonstrated strong capabilities in understanding general visual content. However, these general-domain MLLMs perform poorly in face perception tasks, often producing inaccurate or misleading responses to face-specific queries. To address this gap, we propose FaceInsight, the versatile face perception MLLM that provides fine-grained facial information. Our approach introduces visual-textual alignment of facial knowledge to model both uncertain dependencies and deterministic relationships among facial information, mitigating the limitations of language-driven reasoning. Additionally, we incorporate face segmentation maps as an auxiliary perceptual modality, enriching the visual input with localized structural cues to enhance semantic understanding. Comprehensive experiments and analyses across three face perception tasks demonstrate that FaceInsight consistently outperforms nine compared MLLMs under both training-free and fine-tuned settings.
zh

[CV-52] AdaViP: Aligning Multi-modal LLM s via Adaptive Vision-enhanced Preference Optimization

【速读】：该论文致力于解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在偏好对齐过程中仅关注语言偏好而忽视视觉语境的关键问题。为应对这一局限性，论文提出了自适应视觉增强偏好优化方法（Adaptive Vision-enhanced Preference optimization, AdaViP）。该方案的关键创新点包括：(1) 基于视觉的偏好对齐对构建，通过整合多个视觉基础模型，有策略地移除图像中的关键视觉元素，从而提升MLLMs对视觉细节的敏感度；(2) 自适应偏好优化，动态平衡视觉与语言偏好，以实现更精确的偏好对齐。实验结果表明，AdaViP在不同基准测试中表现出色，其AdaViP-7B在Object HalBench上的响应级和提及级幻觉分别降低了93.7%和96.4%，显著优于现有最先进方法。

链接: https://arxiv.org/abs/2504.15619
作者: Jinda Lu,Jinghan Li,Yuan Gao,Junkang Wu,Jiancan Wu,Xiang Wang,Xiangnan He
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Preference alignment through Direct Preference Optimization (DPO) has demonstrated significant effectiveness in aligning multimodal large language models (MLLMs) with human preferences. However, existing methods focus primarily on language preferences while neglecting the critical visual context. In this paper, we propose an Adaptive Vision-enhanced Preference optimization (AdaViP) that addresses these limitations through two key innovations: (1) vision-based preference pair construction, which integrates multiple visual foundation models to strategically remove key visual elements from the image, enhancing MLLMs’ sensitivity to visual details; and (2) adaptive preference optimization that dynamically balances vision- and language-based preferences for more accurate alignment. Extensive evaluations across different benchmarks demonstrate our effectiveness. Notably, our AdaViP-7B achieves 93.7% and 96.4% reductions in response-level and mentioned-level hallucination respectively on the Object HalBench, significantly outperforming current state-of-the-art methods.
zh

[CV-53] SocialMOIF: Multi-Order Intention Fusion for Pedestrian Trajectory Prediction

【速读】：该论文旨在解决智能系统中agent轨迹分析与预测的挑战，特别是在短时轨迹预测中的高不确定性问题。当前研究受限于agent意图的内在不确定性以及邻近群体间复杂高阶交互的影响。为应对这些挑战，论文提出SocialMOIF方法，其关键是同时关注邻近群体间的高阶意图交互，并强化邻居与目标agent之间的一阶意图交互作用。通过构建多阶意图融合模型及设计轨迹分布逼近器引导轨迹更贴近实际数据，增强模型可解释性，同时引入全局轨迹优化器以实现更精确高效的并行预测。此外，通过引入考虑距离和方向的新型损失函数，在动态和静态数据集上均实现了对现有最先进基线模型的超越。

链接: https://arxiv.org/abs/2504.15616
作者: Kai Chen,Xiaodong Zhao,Yujie Huang,Guoyu Fang,Xiao Song,Ruiping Wang,Ziyuan Wang
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Beihang University (北京航空航天大学); Nanyang Technological University (南洋理工大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages,6 figures

点击查看摘要

Abstract:The analysis and prediction of agent trajectories are crucial for decision-making processes in intelligent systems, with precise short-term trajectory forecasting being highly significant across a range of applications. Agents and their social interactions have been quantified and modeled by researchers from various perspectives; however, substantial limitations exist in the current work due to the inherent high uncertainty of agent intentions and the complex higher-order influences among neighboring groups. SocialMOIF is proposed to tackle these challenges, concentrating on the higher-order intention interactions among neighboring groups while reinforcing the primary role of first-order intention interactions between neighbors and the target agent. This method develops a multi-order intention fusion model to achieve a more comprehensive understanding of both direct and indirect intention information. Within SocialMOIF, a trajectory distribution approximator is designed to guide the trajectories toward values that align more closely with the actual data, thereby enhancing model interpretability. Furthermore, a global trajectory optimizer is introduced to enable more accurate and efficient parallel predictions. By incorporating a novel loss function that accounts for distance and direction during training, experimental results demonstrate that the model outperforms previous state-of-the-art baselines across multiple metrics in both dynamic and static datasets.
zh

[CV-54] HS-Mamba: Full-Field Interaction Multi-Groups Mamba for Hyperspectral Image Classification

【速读】：该论文旨在解决高光谱图像（Hyperspectral Image, HSI）分类面临的挑战，特别是传统Mamba架构在处理高维数据和特征嵌入时存在的局限性。为了解决这些问题，论文提出了一种全视场交互多组Mamba框架（HS-Mamba）。该方案的关键在于结合局部与全局特征表示：通过将整图分割成非重叠的小块送入多组Mamba模块以解耦和建模局部的空间-光谱特征，并利用位置信息感知空间和光谱域内的嵌入特征；同时，将整个图像送入轻量级全局嵌入注意力模块以增强全局特征表达能力。具体而言，HS-Mamba 包含双通道空间-光谱编码器（DCSS-encoder）模块和轻量级全局嵌入注意力（LGI-Att）分支，从而实现高精度的高光谱图像分类。实验结果表明，该方法在四个基准数据集上优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.15612
作者: Hongxing Peng,Kang Lin,Huanai Liu
机构: College of Mathematics and Informatics, South China Agricultural University (华南农业大学数学与信息学院); College of Mathematics and Informatics, South China Agricultural University (华南农业大学数学与信息学院); School of Chemistry and Chemical Engineering, South China University of Technology (华南理工大学化学与化工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral image (HSI) classification has been one of the hot topics in remote sensing fields. Recently, the Mamba architecture based on selective state-space models (S6) has demonstrated great advantages in long sequence modeling. However, the unique properties of hyperspectral data, such as high dimensionality and feature inlining, pose challenges to the application of Mamba to HSI classification. To compensate for these shortcomings, we propose an full-field interaction multi-groups Mamba framework (HS-Mamba), which adopts a strategy different from pixel-patch based or whole-image based, but combines the advantages of both. The patches cut from the whole image are sent to multi-groups Mamba, combined with positional information to perceive local inline features in the spatial and spectral domains, and the whole image is sent to a lightweight attention module to enhance the global feature representation ability. Specifically, HS-Mamba consists of a dual-channel spatial-spectral encoder (DCSS-encoder) module and a lightweight global inline attention (LGI-Att) branch. The DCSS-encoder module uses multiple groups of Mamba to decouple and model the local features of dual-channel sequences with non-overlapping patches. The LGI-Att branch uses a lightweight compressed and extended attention module to perceive the global features of the spatial and spectral domains of the unsegmented whole image. By fusing local and global features, high-precision classification of hyperspectral images is achieved. Extensive experiments demonstrate the superiority of the proposed HS-Mamba, outperforming state-of-the-art methods on four benchmark HSI datasets.
zh

[CV-55] SonarT165: A Large-scale Benchmark and STFTrack Framework for Acoustic Object Tracking

【速读】：该论文旨在解决水下声学目标跟踪（UAOT）任务中缺乏统一评估基准的问题，这严重限制了现有方法的价值。为缓解这一局限性，论文提出了首个大规模UAOT基准数据集SonarT165，包含165个方形序列、165个扇形序列以及205K高质量标注。实验结果表明，SonarT165揭示了当前最先进的单一目标跟踪（SOT）追踪器的局限性。为此，论文提出了一种高效的声学目标跟踪框架STFTrack，其关键在于引入了两个创新模块：多视角模板融合模块（MTFM）和最优轨迹校正模块（OTCM）。MTFM模块整合了原始图像与动态模板二值化图像的多视角特征，并通过类似交叉注意力的层融合时空目标表示；OTCM模块利用声响应等效像素特性，提出归一化像素亮度响应评分以抑制由不准确卡尔曼滤波预测框引起的次优匹配。此外，STFTrack还通过引入声学图像增强方法和频率增强模块（FEM）进一步优化模型特征。综合实验显示，所提出的STFTrack在所构建的基准上实现了最先进的性能。代码可从提供的链接获取。

链接: https://arxiv.org/abs/2504.15609
作者: Yunfeng Li,Bo Wang,Jiahao Wan,Xueyi Wu,Ye Li
机构: National Natural Science Foundation of China (国家自然科学基金委员会); National Key Research and Development Program of China (中国国家重点研发计划); National Key Laboratory Foundation of Autonomous Marine Vehicle Technology (自主水下航行器技术全国重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater observation systems typically integrate optical cameras and imaging sonar systems. When underwater visibility is insufficient, only sonar systems can provide stable data, which necessitates exploration of the underwater acoustic object tracking (UAOT) task. Previous studies have explored traditional methods and Siamese networks for UAOT. However, the absence of a unified evaluation benchmark has significantly constrained the value of these methods. To alleviate this limitation, we propose the first large-scale UAOT benchmark, SonarT165, comprising 165 square sequences, 165 fan sequences, and 205K high-quality annotations. Experimental results demonstrate that SonarT165 reveals limitations in current state-of-the-art SOT trackers. To address these limitations, we propose STFTrack, an efficient framework for acoustic object tracking. It includes two novel modules, a multi-view template fusion module (MTFM) and an optimal trajectory correction module (OTCM). The MTFM module integrates multi-view feature of both the original image and the binary image of the dynamic template, and introduces a cross-attention-like layer to fuse the spatio-temporal target representations. The OTCM module introduces the acoustic-response-equivalent pixel property and proposes normalized pixel brightness response scores, thereby suppressing suboptimal matches caused by inaccurate Kalman filter prediction boxes. To further improve the model feature, STFTrack introduces a acoustic image enhancement method and a Frequency Enhancement Module (FEM) into its tracking pipeline. Comprehensive experiments show the proposed STFTrack achieves state-of-the-art performance on the proposed benchmark. The code is available at this https URL.
zh

[CV-56] Multi-Modal Fusion of In-Situ Video Data and Process Parameters for Online Forecasting of Cookie Drying Readiness

【速读】：该论文致力于解决食品干燥过程中实时预测干燥完成时间的难题，旨在通过最小化能耗、提升生产效率并保证产品质量来优化整个干燥流程。然而，由于干燥过程的动态特性、数据稀缺以及缺乏有效的预测分析方法，这一目标极具挑战性。论文的关键解决方案在于提出了一种端到端的多模态数据融合框架，将现场视频数据与工艺参数相结合，用于实时预测食品干燥状态。该方案采用了一种新颖的编码器-解码器架构，其中包含针对不同模态的专用编码器以及基于变换器的解码器，从而在有效提取特征的同时保留各模态的独特结构。实验结果表明，该模型在糖饼干干燥任务中的平均预测误差仅为15秒，相较于最先进的数据融合方法提升了65.69%，比仅使用视频的模型提升了11.30%，同时在预测准确性、模型大小和计算效率之间实现了良好的平衡，适用于异构工业数据集，且具有广泛的可扩展性以支持其他在线决策任务。

链接: https://arxiv.org/abs/2504.15599
作者: Shichen Li,Chenhui Shao
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学香槟分校); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 12 figures

点击查看摘要

Abstract:Food drying is essential for food production, extending shelf life, and reducing transportation costs. Accurate real-time forecasting of drying readiness is crucial for minimizing energy consumption, improving productivity, and ensuring product quality. However, this remains challenging due to the dynamic nature of drying, limited data availability, and the lack of effective predictive analytical methods. To address this gap, we propose an end-to-end multi-modal data fusion framework that integrates in-situ video data with process parameters for real-time food drying readiness forecasting. Our approach leverages a new encoder-decoder architecture with modality-specific encoders and a transformer-based decoder to effectively extract features while preserving the unique structure of each modality. We apply our approach to sugar cookie drying, where time-to-ready is predicted at each timestamp. Experimental results demonstrate that our model achieves an average prediction error of only 15 seconds, outperforming state-of-the-art data fusion methods by 65.69% and a video-only model by 11.30%. Additionally, our model balances prediction accuracy, model size, and computational efficiency, making it well-suited for heterogenous industrial datasets. The proposed model is extensible to various other industrial modality fusion tasks for online decision-making.
zh

[CV-57] Analytical Softmax Temperature Setting from Feature Dimensions for Model- and Domain-Robust Classification

【速读】：该论文旨在解决深度学习分类任务中Softmax函数温度参数 ( T ) 对输出分布及整体性能的关键影响问题。传统方法难以确定最优温度 ( T^* )，且其在实际应用中因模型、数据集等因素的变化而波动。论文的关键创新在于提出了一种基于特征维度的理论视角，证明 ( T^* ) 可由特征表示的维度唯一确定，并据此设计了一组温度确定系数以调整 ( T^* )。此外，通过在输出层前插入批量归一化(Batch Normalization)层稳定特征空间，结合大规模实验开发出无需额外训练即可估算 ( T^* ) 的经验公式，并引入修正方案以根据类别数量和任务复杂度进一步优化 ( T^* )。最终，所提方法不仅验证了理论预测的一致性，还展示了在多种任务中的通用性和有效性，提供了一种实用且无需训练的解决方案。

链接: https://arxiv.org/abs/2504.15594
作者: Tatsuhito Hasegawa,Shunsuke Sakai
机构: Graduate School of Engineering, University of Fukui (工学研究科, 福井大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 11 figures, under review

点击查看摘要

Abstract:In deep learning-based classification tasks, the softmax function’s temperature parameter T critically influences the output distribution and overall performance. This study presents a novel theoretical insight that the optimal temperature T^* is uniquely determined by the dimensionality of the feature representations, thereby enabling training-free determination of T^* . Despite this theoretical grounding, empirical evidence reveals that T^* fluctuates under practical conditions owing to variations in models, datasets, and other confounding factors. To address these influences, we propose and optimize a set of temperature determination coefficients that specify how T^* should be adjusted based on the theoretical relationship to feature dimensionality. Additionally, we insert a batch normalization layer immediately before the output layer, effectively stabilizing the feature space. Building on these coefficients and a suite of large-scale experiments, we develop an empirical formula to estimate T^* without additional training while also introducing a corrective scheme to refine T^* based on the number of classes and task complexity. Our findings confirm that the derived temperature not only aligns with the proposed theoretical perspective but also generalizes effectively across diverse tasks, consistently enhancing classification performance and offering a practical, training-free solution for determining T^* .
zh

[CV-58] Bayesian Autoencoder for Medical Anomaly Detection: Uncertainty-Aware Approach for Brain 2 MRI Analysis

【速读】：该论文旨在解决医学影像中异常检测任务中不确定性建模不足的问题，特别是在脑部磁共振成像（MRI）中检测神经疾病的潜在风险。传统确定性方法难以有效捕捉异常检测任务中的固有不确定性。为了解决这一问题，论文提出了一种带有多头注意力机制的贝叶斯变分自编码器（Bayesian Variational Autoencoder, VAE），并通过贝叶斯推断整合了认识论不确定性（epistemic uncertainty）和数据不确定性（aleatoric uncertainty）的估计。这种方法的关键在于通过建模这两种不确定性来提升异常检测性能，并同时增强模型的解释性和置信度估计能力，从而为临床医生提供可靠的支持以辅助医疗决策。

链接: https://arxiv.org/abs/2504.15562
作者: Dip Roy
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:In medical imaging, anomaly detection is a vital element of healthcare diagnostics, especially for neurological conditions which can be life-threatening. Conventional deterministic methods often fall short when it comes to capturing the inherent uncertainty of anomaly detection tasks. This paper introduces a Bayesian Variational Autoencoder (VAE) equipped with multi-head attention mechanisms for detecting anomalies in brain magnetic resonance imaging (MRI). For the purpose of improving anomaly detection performance, we incorporate both epistemic and aleatoric uncertainty estimation through Bayesian inference. The model was tested on the BraTS2020 dataset, and the findings were a 0.83 ROC AUC and a 0.83 PR AUC. The data in our paper suggests that modeling uncertainty is an essential component of anomaly detection, enhancing both performance and interpretability and providing confidence estimates, as well as anomaly predictions, for clinicians to leverage in making medical decisions.
zh

[CV-59] InstaRevive: One-Step Image Enhancement via Dynamic Score Matching ICLR2025

【速读】：该论文旨在解决图像增强在复杂环境和成像设备固有限制下的应用需求，传统基于扩散的方法虽表现优异，但需要长时间且计算密集的迭代采样。为应对这一挑战，论文提出了一种名为InstaRevive的简单而强大的图像增强框架，通过基于分数的扩散蒸馏技术充分利用预训练扩散模型的强大生成能力，并显著减少采样步骤。其关键解决方案在于设计了一个实用有效的扩散蒸馏流程，采用动态控制策略解决分数匹配过程中更新方向的不准确性，该策略实现了动态扩散范围，确保在训练过程中精确学习去噪轨迹并实现准确的分布匹配梯度。此外，通过图像描述生成引入文本提示作为辅助条件，进一步增强生成能力。实验验证了该框架在多种具有挑战性的任务和数据集上的有效性和高效性。

链接: https://arxiv.org/abs/2504.15513
作者: Yixuan Zhu,Haolin Wang,Ao Li,Wenliang Zhao,Yansong Tang,Jingxuan Niu,Lei Chen,Jie Zhou,Jiwen Lu
机构: Department of Automation, Tsinghua University (清华大学自动化系); Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2025

点击查看摘要

Abstract:Image enhancement finds wide-ranging applications in real-world scenarios due to complex environments and the inherent limitations of imaging devices. Recent diffusion-based methods yield promising outcomes but necessitate prolonged and computationally intensive iterative sampling. In response, we propose InstaRevive, a straightforward yet powerful image enhancement framework that employs score-based diffusion distillation to harness potent generative capability and minimize the sampling steps. To fully exploit the potential of the pre-trained diffusion model, we devise a practical and effective diffusion distillation pipeline using dynamic control to address inaccuracies in updating direction during score matching. Our control strategy enables a dynamic diffusing scope, facilitating precise learning of denoising trajectories within the diffusion model and ensuring accurate distribution matching gradients during training. Additionally, to enrich guidance for the generative power, we incorporate textual prompts via image captioning as auxiliary conditions, fostering further exploration of the diffusion model. Extensive experiments substantiate the efficacy of our framework across a diverse array of challenging tasks and datasets, unveiling the compelling efficacy and efficiency of InstaRevive in delivering high-quality and visually appealing results. Code is available at this https URL.
zh

[CV-60] Unifying Image Counterfactuals and Feature Attributions with Latent-Space Adversarial Attacks

【速读】：本文旨在解决计算机视觉模型中生成反事实（Counterfactual）图像的挑战性问题。传统基于梯度的方法容易产生对抗样本，即对图像像素进行几乎不可察觉的修改会导致预测结果发生显著变化。为了解决这一问题，论文提出了一种新的、易于实现的框架——反事实攻击（Counterfactual Attacks），它能够灵活适应生成式建模（Generative Modeling）的最新进展。该方法通过在低维流形（manifold）上的表示进行操作，类似于对图像表示的对抗攻击，从而生成更合理的反事实图像。此外，利用辅助图像描述符数据集，论文展示了如何为这些反事实图像提供特征归因（Feature Attribution），量化原始图像与反事实图像之间的变化，并进一步将重要性分数聚合为全局反事实解释，突出影响模型预测的关键特征。关键在于结合生成模型与特征归因技术，使生成过程既高效又具有可解释性。实验验证表明，该方法在MNIST和CelebA数据集上表现出色。

链接: https://arxiv.org/abs/2504.15479
作者: Jeremy Goldwasser,Giles Hooker
机构: University of California, Berkeley (加州大学伯克利分校); University of Pennsylvania (宾夕法尼亚大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Counterfactuals are a popular framework for interpreting machine learning predictions. These what if explanations are notoriously challenging to create for computer vision models: standard gradient-based methods are prone to produce adversarial examples, in which imperceptible modifications to image pixels provoke large changes in predictions. We introduce a new, easy-to-implement framework for counterfactual images that can flexibly adapt to contemporary advances in generative modeling. Our method, Counterfactual Attacks, resembles an adversarial attack on the representation of the image along a low-dimensional manifold. In addition, given an auxiliary dataset of image descriptors, we show how to accompany counterfactuals with feature attribution that quantify the changes between the original and counterfactual images. These importance scores can be aggregated into global counterfactual explanations that highlight the overall features driving model predictions. While this unification is possible for any counterfactual method, it has particular computational efficiency for ours. We demonstrate the efficacy of our approach with the MNIST and CelebA datasets.
zh

[CV-61] Emergence and Evolution of Interpretable Concepts in Diffusion Models

【速读】：该论文旨在解决扩散模型（Diffusion Models）在文本到图像生成中的黑箱性质及其复杂多步生成过程导致的机制不透明性问题。尽管扩散模型在生成高质量图像方面表现出色，但其内部工作原理仍缺乏清晰的理解。论文的关键在于利用稀疏自动编码器（Sparse Autoencoders, SAEs）这一机制可解释性（Mechanistic Interpretability, MI）技术，深入分析扩散模型的内部表示，以揭示其生成过程中的人类可解释概念。研究发现，在反向扩散的第一步完成之前，通过观察激活概念的空间分布即可较好预测最终场景的构成，并进一步证明这些概念具有因果效应，能够被用于引导生成过程。基于此，作者设计了干预技术，分别在扩散过程的早期、中期和晚期实现对图像组成和风格的有效控制，从而解决了扩散模型生成过程难以操控的问题。

链接: https://arxiv.org/abs/2504.15473
作者: Berk Tinaz,Zalan Fabian,Mahdi Soltanolkotabi
机构: University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 32 pages, 32 figures, preliminary version

点击查看摘要

Abstract:Diffusion models have become the go-to method for text-to-image generation, producing high-quality images from noise through a process called reverse diffusion. Understanding the dynamics of the reverse diffusion process is crucial in steering the generation and achieving high sample quality. However, the inner workings of diffusion models is still largely a mystery due to their black-box nature and complex, multi-step generation process. Mechanistic Interpretability (MI) techniques, such as Sparse Autoencoders (SAEs), aim at uncovering the operating principles of models through granular analysis of their internal representations. These MI techniques have been successful in understanding and steering the behavior of large language models at scale. However, the great potential of SAEs has not yet been applied toward gaining insight into the intricate generative process of diffusion models. In this work, we leverage the SAE framework to probe the inner workings of a popular text-to-image diffusion model, and uncover a variety of human-interpretable concepts in its activations. Interestingly, we find that even before the first reverse diffusion step is completed, the final composition of the scene can be predicted surprisingly well by looking at the spatial distribution of activated concepts. Moreover, going beyond correlational analysis, we show that the discovered concepts have a causal effect on the model output and can be leveraged to steer the generative process. We design intervention techniques aimed at manipulating image composition and style, and demonstrate that (1) in early stages of diffusion image composition can be effectively controlled, (2) in the middle stages of diffusion image composition is finalized, however stylistic interventions are effective, and (3) in the final stages of diffusion only minor textural details are subject to change.
zh

[CV-62] Manifold Induced Biases for Zero-shot and Few-shot Detection of Generated Images ICLR2025

【速读】：本文旨在解决区分真实图像与生成式 AI 图像（即“图像检测”）这一具有挑战性的问题。当前，在半监督或有监督方法之外，零样本（zero-shot）和少样本（few-shot）解决方案逐渐崭露头角，因其能够缓解因生成技术快速发展导致的数据维护过时问题。然而，现有方法存在两大不足：一是缺乏理论基础，二是零样本和少样本场景下的性能仍有较大提升空间。论文的关键在于通过分析预训练扩散模型隐含的概率流形中的偏差，将这些偏差作为表征生成图像的标准，并利用测分函数（score-function）量化概率流形上的曲率、梯度及偏差，从而在零样本场景下构建检测准则。进一步地，通过引入专家混合（mixture-of-experts）方法，将该方案扩展到少样本场景。实验结果表明，所提方法在零样本和少样本设置下均优于现有方法，推动了生成内容偏差在流形分析视角下的理论理解和实际应用。

链接: https://arxiv.org/abs/2504.15470
作者: Jonathan Brokman,Amit Giloni,Omer Hofman,Roman Vainshtein,Hisashi Kojima,Guy Gilboa
机构: Fujitsu (富士通); Technion - Israel Institute of Technology (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2025 (The International Conference on Learning Representations)

点击查看摘要

Abstract:Distinguishing between real and AI-generated images, commonly referred to as ‘image detection’, presents a timely and significant challenge. Despite extensive research in the (semi-)supervised regime, zero-shot and few-shot solutions have only recently emerged as promising alternatives. Their main advantage is in alleviating the ongoing data maintenance, which quickly becomes outdated due to advances in generative technologies. We identify two main gaps: (1) a lack of theoretical grounding for the methods, and (2) significant room for performance improvements in zero-shot and few-shot regimes. Our approach is founded on understanding and quantifying the biases inherent in generated content, where we use these quantities as criteria for characterizing generated images. Specifically, we explore the biases of the implicit probability manifold, captured by a pre-trained diffusion model. Through score-function analysis, we approximate the curvature, gradient, and bias towards points on the probability manifold, establishing criteria for detection in the zero-shot regime. We further extend our contribution to the few-shot setting by employing a mixture-of-experts methodology. Empirical results across 20 generative models demonstrate that our method outperforms current approaches in both zero-shot and few-shot settings. This work advances the theoretical understanding and practical usage of generated content biases through the lens of manifold analysis.
zh

[CV-63] AGI Is Coming… Right After AI Learns to Play Wordle

【速读】：该论文旨在研究多模态代理（multimodal agents）在通过标准计算机界面完成任务方面的表现，特别关注OpenAI的计算机-用户代理（Computer-User Agent, CUA）。论文以《纽约时报》Wordle游戏为测试平台，评估代理在不同情境下识别颜色的能力，并揭示其在简单任务中的局限性。尽管当前AI代理受到广泛关注，且被视为迈向通用人工智能（Artificial General Intelligence, AGI）的重要一步，但研究发现，即使在看似简单的任务中，现有的前沿AI模型仍面临显著挑战。论文的关键在于通过实证分析指出模型在颜色识别上的成功率仅为5.36%，并探讨了潜在原因及未来改进方向。

链接: https://arxiv.org/abs/2504.15434
作者: Sarath Shekkizhar,Romain Cosentino
机构: Salesforce (赛富时)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper investigates multimodal agents, in particular, OpenAI’s Computer-User Agent (CUA), trained to control and complete tasks through a standard computer interface, similar to humans. We evaluated the agent’s performance on the New York Times Wordle game to elicit model behaviors and identify shortcomings. Our findings revealed a significant discrepancy in the model’s ability to recognize colors correctly depending on the context. The model had a 5.36% success rate over several hundred runs across a week of Wordle. Despite the immense enthusiasm surrounding AI agents and their potential to usher in Artificial General Intelligence (AGI), our findings reinforce the fact that even simple tasks present substantial challenges for today’s frontier AI models. We conclude with a discussion of the potential underlying causes, implications for future development, and research directions to improve these AI systems.
zh

[CV-64] Context Aware Grounded Teacher for Source Free Object Detection

【速读】：该论文致力于解决源无关目标检测（Source-Free Object Detection, SFOD）问题，在适应过程中无源数据可用，模型需适应未标记的目标域。特别是在医学影像领域，已有方法通过半监督师生架构来弥合域差异，但标签训练数据中的上下文不平衡及域间显著偏移可能导致教师模型产生不准确的伪标签，从而影响学生模型性能并引发模式崩溃。此外，类别不平衡（如某一类显著多于其他类）会导致上下文偏差。为解决SFOD设置下上下文偏差问题及学生模型性能显著下降的问题，本文引入了Grounded Teacher (GT)框架作为标准解决方案。其关键在于利用专门的关系上下文模块建模上下文关系，并以此缓解模型的固有偏差。此方法能够对密切相关的类别实施增强操作，跨域和域内均有效，提升少数类别的表现同时尽量减少对主导类别的影响。进一步地，通过实现专家基础分支来监督学生模型以提高预测质量。实验验证了该方法在三种医学数据集上的有效性，并通过全面的消融研究支持结论。所有相关资源均已公开。

链接: https://arxiv.org/abs/2504.15404
作者: Tajamul Ashraf,Rajes Manna,Partha Sarathi Purkayastha,Tavaheed Tariq,Janibul Bashir
机构: MBZUAI ( Mohamed Bin Zayed University of Artificial Intelligence ); NITS Srinagar ( National Institute of Technology Srinagar ); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We focus on the Source Free Object Detection (SFOD) problem, when source data is unavailable during adaptation, and the model must adapt to the unlabeled target domain. In medical imaging, several approaches have leveraged a semi-supervised student-teacher architecture to bridge domain discrepancy. Context imbalance in labeled training data and significant domain shifts between domains can lead to biased teacher models that produce inaccurate pseudolabels, degrading the student model’s performance and causing a mode collapse. Class imbalance, particularly when one class significantly outnumbers another, leads to contextual bias. To tackle the problem of context bias and the significant performance drop of the student model in the SFOD setting, we introduce Grounded Teacher (GT) as a standard framework. In this study, we model contextual relationships using a dedicated relational context module and leverage it to mitigate inherent biases in the model. This approach enables us to apply augmentations to closely related classes, across and within domains, enhancing the performance of underrepresented classes while keeping the effect on dominant classes minimal. We further improve the quality of predictions by implementing an expert foundational branch to supervise the student model. We validate the effectiveness of our approach in mitigating context bias under the SFOD setting through experiments on three medical datasets supported by comprehensive ablation studies. All relevant resources, including preprocessed data, trained model weights, and code, are publicly available at this this https URL.
zh

[CV-65] MirrorVerse: Pushing Diffusion Models to Realistically Reflect the World CVPR2025

【速读】：该论文旨在解决使用基于扩散（diffusion-based）生成模型生成真实感镜像反射的问题，现有方法在处理如阴影、反射和遮挡等物理细节时表现不足。论文的关键解决方案在于通过引入三个主要的数据增强策略来改进合成数据集：(1) 随机对象位置设定；(2) 随机旋转；(3) 对象的定位约束，从而显著提升模型对不同物体姿态和位置的泛化能力。此外，为了进一步处理多物体场景中的空间关系与遮挡问题，论文提出了一种在数据集生成阶段配对物体的策略，构建了一个能够应对复杂场景的鲁棒数据集。针对现实世界场景的泛化挑战，论文设计了一个三阶段训练流程以优化MirrorFusion 2.0模型的性能。

链接: https://arxiv.org/abs/2504.15397
作者: Ankit Dhiman,Manan Shah,R Venkatesh Babu
机构: Vision and AI Lab, IISc Bangalore (印度科学研究院视觉与人工智能实验室); Samsung R & D Institute India - Bangalore (三星印度研发学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025. Project Page: this https URL

点击查看摘要

Abstract:Diffusion models have become central to various image editing tasks, yet they often fail to fully adhere to physical laws, particularly with effects like shadows, reflections, and occlusions. In this work, we address the challenge of generating photorealistic mirror reflections using diffusion-based generative models. Despite extensive training data, existing diffusion models frequently overlook the nuanced details crucial to authentic mirror reflections. Recent approaches have attempted to resolve this by creating synhetic datasets and framing reflection generation as an inpainting task; however, they struggle to generalize across different object orientations and positions relative to the mirror. Our method overcomes these limitations by introducing key augmentations into the synthetic data pipeline: (1) random object positioning, (2) randomized rotations, and (3) grounding of objects, significantly enhancing generalization across poses and placements. To further address spatial relationships and occlusions in scenes with multiple objects, we implement a strategy to pair objects during dataset generation, resulting in a dataset robust enough to handle these complex scenarios. Achieving generalization to real-world scenes remains a challenge, so we introduce a three-stage training curriculum to develop the MirrorFusion 2.0 model to improve real-world performance. We provide extensive qualitative and quantitative evaluations to support our approach. The project page is available at: this https URL.
zh

[CV-66] ICGM-FRAX: Iterative Cross Graph Matching for Hip Fracture Risk Assessment using Dual-energy X-ray Absorptiometry Images

【速读】：该论文旨在解决老年人髋部骨折风险评估中的早期准确预测问题，以降低其导致的行动能力下降和死亡率增加的风险。论文提出了一种名为Iterative Cross Graph Matching for Hip Fracture Risk Assessment (ICGM-FRAX) 的新方法，用于基于双能X射线吸收测量法（Dual-energy X-ray Absorptiometry, DXA）图像预测髋部骨折风险。该方法的关键在于通过迭代比较待测图（受试者图）与多个表示髋部骨折特征的模板图，评估两者相似性从而实现精确的风险预测。具体而言，DXA图像被分割为多个感兴趣区域（Regions of Interest, RoIs），提取放射组学特征后构建图结构，其中节点为中心坐标，边由RoIs的质心间的欧几里得距离决定，以此捕捉空间关系。如果待测图与代表髋部骨折患者的模板图高度匹配，则判定为高风险。实验结果显示，ICGM-FRAX在UK Biobank数据集上的敏感度达到0.9869，验证了其在髋部骨折预测中的高准确性。

链接: https://arxiv.org/abs/2504.15384
作者: Chen Zhao,Anjum Shaik,Joyce H. Keyak,Nancy E. Lane,Jeffrey D. Deng,Kuan-Jui Su,Qiuying Sha,Hui Shen,Hong-Wen Deng,Weihua Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 4 figures

点击查看摘要

Abstract:Hip fractures represent a major health concern, particularly among the elderly, often leading decreased mobility and increased mortality. Early and accurate detection of at risk individuals is crucial for effective intervention. In this study, we propose Iterative Cross Graph Matching for Hip Fracture Risk Assessment (ICGM-FRAX), a novel approach for predicting hip fractures using Dual-energy X-ray Absorptiometry (DXA) images. ICGM-FRAX involves iteratively comparing a test (subject) graph with multiple template graphs representing the characteristics of hip fracture subjects to assess the similarity and accurately to predict hip fracture risk. These graphs are obtained as follows. The DXA images are separated into multiple regions of interest (RoIs), such as the femoral head, shaft, and lesser trochanter. Radiomic features are then calculated for each RoI, with the central coordinates used as nodes in a graph. The connectivity between nodes is established according to the Euclidean distance between these coordinates. This process transforms each DXA image into a graph, where each node represents a RoI, and edges derived by the centroids of RoIs capture the spatial relationships between them. If the test graph closely matches a set of template graphs representing subjects with incident hip fractures, it is classified as indicating high hip fracture risk. We evaluated our method using 547 subjects from the UK Biobank dataset, and experimental results show that ICGM-FRAX achieved a sensitivity of 0.9869, demonstrating high accuracy in predicting hip fractures.
zh

[CV-67] Plug-and-Play Versatile Compressed Video Enhancement CVPR2025

【速读】：该论文旨在解决视频压缩在降低文件大小以支持实时云计算的同时，因视觉质量损失对下游视觉任务鲁棒性带来的挑战。论文的关键解决方案是提出了一种通用的编解码器感知增强框架，通过复用编解码器信息，在不同压缩设置下自适应提升视频质量，同时不引入计算瓶颈。具体而言，该框架包含一个压缩感知自适应（Compression-Aware Adaptation, CAA）网络，利用分层自适应机制估计帧级增强网络（Bitstream-Aware Enhancement, BAE）的参数。BAE 网络进一步利用嵌入在码流中的时序和空间先验，有效改善压缩输入帧的质量。

链接: https://arxiv.org/abs/2504.15380
作者: Huimin Zeng,Jiacheng Li,Zhiwei Xiong
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:As a widely adopted technique in data transmission, video compression effectively reduces the size of files, making it possible for real-time cloud computing. However, it comes at the cost of visual quality, posing challenges to the robustness of downstream vision models. In this work, we present a versatile codec-aware enhancement framework that reuses codec information to adaptively enhance videos under different compression settings, assisting various downstream vision tasks without introducing computation bottleneck. Specifically, the proposed codec-aware framework consists of a compression-aware adaptation (CAA) network that employs a hierarchical adaptation mechanism to estimate parameters of the frame-wise enhancement network, namely the bitstream-aware enhancement (BAE) network. The BAE network further leverages temporal and spatial priors embedded in the bitstream to effectively improve the quality of compressed input frames. Extensive experimental results demonstrate the superior quality enhancement performance of our framework over existing enhancement methods, as well as its versatility in assisting multiple downstream tasks on compressed videos as a plug-and-play module. Code and models are available at this https URL.
zh

[CV-68] Physics Driven Image Simulation from Commercial Satellite Imagery

【速读】：该论文旨在解决通过传统渲染管道难以实现的物理真实感场景建模与生成问题。针对给定区域，论文提出了一种自动化的解决方案，利用卫星影像来模拟场景几何结构、推导材质估计，并填充动态元素，从而构建物理真实的三维场景。解决方案的关键在于开发了一系列自动化技术，能够充分利用卫星影像在整个仿真场景中的信息，以加速场景构建并减少人工干预。此外，该方法不依赖于激光雷达（LiDAR），使得原本无法构建的场景成为可能。通过使用数字表面模型（Digital Surface Model, DSM）作为场景几何的基础，论文实现了对真实地点地形、人造结构以及植被、车辆等小型元素的建模，最终生成高保真的场景，为地球上的新地点提供更少人工介入的逼真模拟，并支持从紫外到长波红外（200nm-20μm）范围内的图像算法开发及处理流程设计。

链接: https://arxiv.org/abs/2504.15378
作者: Scott Sorensen,Wayne Treible,Robert Wagner,Andrew D. Gilliam,Todd Rovito,Joseph L. Mundy
机构: Vision Systems Inc. (视系统公司); Air Force Research Lab (空军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures

点击查看摘要

Abstract:Physics driven image simulation allows for the modeling and creation of realistic imagery beyond what is afforded by typical rendering pipelines. We aim to automatically generate a physically realistic scene for simulation of a given region using satellite imagery to model the scene geometry, drive material estimates, and populate the scene with dynamic elements. We present automated techniques to utilize satellite imagery throughout the simulated scene to expedite scene construction and decrease manual overhead. Our technique does not use lidar, enabling simulations that could not be constructed previously. To develop a 3D scene, we model the various components of the real location, addressing the terrain, modelling man-made structures, and populating the scene with smaller elements such as vegetation and vehicles. To create the scene we begin with a Digital Surface Model, which serves as the basis for scene geometry, and allows us to reason about the real location in a common 3D frame of reference. These simulated scenes can provide increased fidelity with less manual intervention for novel locations on earth, and can facilitate algorithm development, and processing pipelines for imagery ranging from UV to LWIR (200nm-20\mu m) .
zh

[CV-69] Event2Vec: Processing neuromorphic events directly by representations in vector space

【速读】：该论文旨在解决事件相机（event camera）输出的异步、稀疏且不规则事件与主流计算机视觉及深度学习方法不兼容的问题。传统方法在解决这一问题时通常需要较长的预处理步骤、损失时间分辨率或无法充分利用大规模并行计算的优势。论文的关键创新在于提出了一种名为“事件到向量（event2vec）”的新表示方法，通过借鉴“词到向量（word2vec）”的成功经验，将事件与词汇的相似性进行类比，并将其映射到自然语言处理的领域中。这种表示方法不仅在ASL-DVS数据集分类任务中表现出更高的参数效率、准确性和速度，还为将事件数据集成到大型语言模型和多模态模型中提供了可能性，从而展示了事件数据在跨领域应用中的广阔前景。

链接: https://arxiv.org/abs/2504.15371
作者: Wei Fang,Priyadarshini Panda
机构: Yale University (耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:The neuromorphic event cameras have overwhelming advantages in temporal resolution, power efficiency, and dynamic range compared to traditional cameras. However, the event cameras output asynchronous, sparse, and irregular events, which are not compatible with mainstream computer vision and deep learning methods. Various methods have been proposed to solve this issue but at the cost of long preprocessing procedures, losing temporal resolutions, or being incompatible with massively parallel computation. Inspired by the great success of the word to vector, we summarize the similarities between words and events, then propose the first event to vector (event2vec) representation. We validate event2vec on classifying the ASL-DVS dataset, showing impressive parameter efficiency, accuracy, and speed than previous graph/image/voxel-based representations. Beyond task performance, the most attractive advantage of event2vec is that it aligns events to the domain of natural language processing, showing the promising prospect of integrating events into large language and multimodal models. Our codes, models, and training logs are available at this https URL.
zh

[CV-70] Vision6D: 3D-to-2D Interactive Visualization and Annotation Tool for 6D Pose Estimation

【速读】：该论文旨在解决6D姿态估计研究领域中精确标注相机与物体相对位置和方向（即6D姿态）的问题，特别是在未知相机与世界物体之间变换矩阵的情况下。论文的关键创新在于提出了一种交互式的3D到2D可视化与标注工具Vision6D，它允许用户在二维真实场景中直观地可视化和操作三维物体，并通过提供视觉提示和空间关系来确定物体在不同环境中的位置和方向。这种工具特别适用于利用仅有的相机内参矩阵实现对未知变换矩阵场景下物体姿态的精准标注，为跨领域高级姿态估计算法的开发与训练奠定了基础。论文通过与Linemod和HANDAL数据集的对比实验以及用户研究验证了Vision6D的有效性，表明其能够通过直观的三维用户界面生成准确的姿态标注。这一方法致力于弥合二维场景投影与三维场景之间的差距，为研究人员和开发者提供了解决6D姿态标注相关问题的有效途径。

链接: https://arxiv.org/abs/2504.15329
作者: Yike Zhang,Eduardo Davalos,Jack Noble
机构: IEEE Publication Technology Group (IEEE 出版技术组)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Accurate 6D pose estimation has gained more attention over the years for robotics-assisted tasks that require precise interaction with physical objects. This paper presents an interactive 3D-to-2D visualization and annotation tool to support the 6D pose estimation research community. To the best of our knowledge, the proposed work is the first tool that allows users to visualize and manipulate 3D objects interactively on a 2D real-world scene, along with a comprehensive user study. This system supports robust 6D camera pose annotation by providing both visual cues and spatial relationships to determine object position and orientation in various environments. The annotation feature in Vision6D is particularly helpful in scenarios where the transformation matrix between the camera and world objects is unknown, as it enables accurate annotation of these objects’ poses using only the camera intrinsic matrix. This capability serves as a foundational step in developing and training advanced pose estimation models across various domains. We evaluate Vision6D’s effectiveness by utilizing widely-used open-source pose estimation datasets Linemod and HANDAL through comparisons between the default ground-truth camera poses with manual annotations. A user study was performed to show that Vision6D generates accurate pose annotations via visual cues in an intuitive 3D user interface. This approach aims to bridge the gap between 2D scene projections and 3D scenes, offering an effective way for researchers and developers to solve 6D pose annotation related problems. The software is open-source and publicly available at this https URL.
zh

[CV-71] HyperFlow: Gradient-Free Emulation of Few-Shot Fine-Tuning

【速读】：该论文旨在解决在少量样本学习（few-shot learning）中，测试时微调（test-time fine-tuning）虽然有效但因需要多次反向传播（backpropagation）而导致计算成本过高，尤其是在实时或资源受限场景下的局限性。论文的关键解决方案是提出一种无需显式计算梯度即可模拟梯度下降的方法，通过将梯度下降公式化为常微分方程（ODE）的Euler离散化，并利用辅助网络预测任务相关的漂移（drift），仅依赖少量支持样本进行训练。这种方法将适应过程简化为数值积分（如Euler法），只需少量辅助网络的前向传递，而无需目标模型的梯度计算或额外前向传递。实验表明，该方法在跨域少量样本分类任务中显著提升了域外性能，同时仅消耗标准微调方法6%的内存和0.02%的计算时间，从而在直接迁移与完全微调之间提供了实用的折中方案。

链接: https://arxiv.org/abs/2504.15323
作者: Donggyun Kim,Chanwoo Kim,Seunghoon Hong
机构: KAIST School of Computing (KAIST 计算机科学学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While test-time fine-tuning is beneficial in few-shot learning, the need for multiple backpropagation steps can be prohibitively expensive in real-time or low-resource scenarios. To address this limitation, we propose an approach that emulates gradient descent without computing gradients, enabling efficient test-time adaptation. Specifically, we formulate gradient descent as an Euler discretization of an ordinary differential equation (ODE) and train an auxiliary network to predict the task-conditional drift using only the few-shot support set. The adaptation then reduces to a simple numerical integration (e.g., via the Euler method), which requires only a few forward passes of the auxiliary network – no gradients or forward passes of the target model are needed. In experiments on cross-domain few-shot classification using the Meta-Dataset and CDFSL benchmarks, our method significantly improves out-of-domain performance over the non-fine-tuned baseline while incurring only 6% of the memory cost and 0.02% of the computation time of standard fine-tuning, thus establishing a practical middle ground between direct transfer and fully fine-tuned approaches.
zh

[CV-72] LLM -Enabled Style and Content Regularization for Personalized Text-to-Image Generation

【速读】：该论文旨在解决个性化文本到图像生成中因嵌入标识符微调而导致的风格化不足和图像内容不准确的问题。现有方法通常由于文本可控性降低而面临这些挑战。论文的关键解决方案包括风格优化策略和内容保持策略：风格优化策略利用视觉推理提示和参考图像的语义信息来优化风格嵌入，以实现更精确且一致的风格表达；内容保持策略通过保留模型的泛化能力解决内容偏差问题，在增强文本可控性的同时不牺牲风格化效果。实验结果验证了所提方法在生成一致且个性化的文本到图像输出方面的优越性能。

链接: https://arxiv.org/abs/2504.15309
作者: Anran Yu,Wei Feng,Yaochen Zhang,Xiang Li,Lei Meng,Lei Wu,Xiangxu Meng
机构: School of Software, Shandong University (山东大学软件学院), China; Shandong Research Institute of Shandong University (山东大学山东研究院), China; Inspur Technology (浪潮技术), Jinan, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The personalized text-to-image generation has rapidly advanced with the emergence of Stable Diffusion. Existing methods, which typically fine-tune models using embedded identifiers, often struggle with insufficient stylization and inaccurate image content due to reduced textual controllability. In this paper, we propose style refinement and content preservation strategies. The style refinement strategy leverages the semantic information of visual reasoning prompts and reference images to optimize style embeddings, allowing a more precise and consistent representation of style information. The content preservation strategy addresses the content bias problem by preserving the model’s generalization capabilities, ensuring enhanced textual controllability without compromising stylization. Experimental results verify that our approach achieves superior performance in generating consistent and personalized text-to-image outputs.
zh

[CV-73] SLAM-Based Navigation and Fault Resilience in a Surveillance Quadcopter with Embedded Vision Systems

【速读】：本文旨在解决在GPS受限或无GPS环境下，无人机自主导航、实时目标识别以及故障恢复的问题。解决方案的关键在于设计了一种容错四旋翼平台Veg，其核心包括：(1) 结合ORB-SLAM3实现无需GPS的6自由度(6-DoF)定位与闭环检测，并通过Dijkstra算法基于SLAM生成的地图进行航点导航；(2) 采用级联控制架构，内环为线性二次调节器(LQR)，外环为比例微分(PD)轨迹控制，确保动态稳定性；(3) 集成轻量级卷积神经网络(CNN)和主成分分析(PCA)的嵌入式视觉系统，实现实时目标检测与人脸识别；(4) 引入实时故障检测与识别(FDI)系统，能够检测螺旋桨故障并执行紧急着陆。这些关键技术共同实现了实时定位、故障恢复及嵌入式人工智能在单一平台上的整合，适用于受限环境。

链接: https://arxiv.org/abs/2504.15305
作者: Abhishek Tyagi,Charu Gaur
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: 18 pages, 21 figures, 4 tables. Onboard processing using Raspberry Pi 4 and Arduino Nano. Includes ORB-SLAM3-based navigation, LQR control, rotor fault recovery, object detection, and PCA face recognition. Real-world and simulation tests included. Designed for GPS-denied autonomous UAV surveillance

点击查看摘要

Abstract:We present an autonomous aerial surveillance platform, Veg, designed as a fault-tolerant quadcopter system that integrates visual SLAM for GPS-independent navigation, advanced control architecture for dynamic stability, and embedded vision modules for real-time object and face recognition. The platform features a cascaded control design with an LQR inner-loop and PD outer-loop trajectory control. It leverages ORB-SLAM3 for 6-DoF localization and loop closure, and supports waypoint-based navigation through Dijkstra path planning over SLAM-derived maps. A real-time Failure Detection and Identification (FDI) system detects rotor faults and executes emergency landing through re-routing. The embedded vision system, based on a lightweight CNN and PCA, enables onboard object detection and face recognition with high precision. The drone operates fully onboard using a Raspberry Pi 4 and Arduino Nano, validated through simulations and real-world testing. This work consolidates real-time localization, fault recovery, and embedded AI on a single platform suitable for constrained environments.
zh

[CV-74] SPICE: A Synergistic Precise Iterative and Customizable Image Editing Workflow CVPR

【速读】：该论文试图解决现有基于提示的图像编辑模型在局部编辑能力、细节遵循以及多步编辑中保持全局图像质量方面的不足。解决方案的关键在于提出了一种名为SPICE的无训练工作流，它通过结合基础扩散模型（base diffusion model）和Canny边缘ControlNet模型的优势，实现了对自由形式编辑指令的鲁棒处理，并在超过100个编辑步骤中持续提升图像质量，从而显著提升了在复杂真实图像编辑任务中的性能。

链接: https://arxiv.org/abs/2504.09697
作者: Kenan Tang,Yanhong Li,Yao Qin
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 24 pages, 21 figures. Figure 9(b) has been accepted by CVPR AI Art Gallery 2025

点击查看摘要

Abstract:Recent prompt-based image editing models have demonstrated impressive prompt-following capability at structural editing tasks. However, existing models still fail to perform local edits, follow detailed editing prompts, or maintain global image quality beyond a single editing step. To address these challenges, we introduce SPICE, a training-free workflow that accepts arbitrary resolutions and aspect ratios, accurately follows user requirements, and improves image quality consistently during more than 100 editing steps. By synergizing the strengths of a base diffusion model and a Canny edge ControlNet model, SPICE robustly handles free-form editing instructions from the user. SPICE outperforms state-of-the-art baselines on a challenging realistic image-editing dataset consisting of semantic editing (object addition, removal, replacement, and background change), stylistic editing (texture changes), and structural editing (action change) tasks. Not only does SPICE achieve the highest quantitative performance according to standard evaluation metrics, but it is also consistently preferred by users over existing image-editing methods. We release the workflow implementation for popular diffusion model Web UIs to support further research and artistic exploration.
zh

[CV-75] Performance Estimation for Supervised Medical Image Segmentation Models on Unlabeled Data Using UniverSeg

【速读】：该论文试图解决在临床应用中，由于无法对所有数据进行标注而导致模型性能评估不确定性的问题。解决方案的关键在于提出了一种名为Segmentation Performance Evaluator (SPE) 的框架，该框架能够通过无需标注的方式估计分割模型在未标注数据上的性能。SPE 框架具有广泛的适应性，可兼容多种评价指标和模型架构，并通过实验验证了其高相关性（0.956 ± 0.046）和低平均绝对误差（0.025 ± 0.019），证明了其可靠性和有效性。此外，SPE 可无缝集成到任何模型训练流程中，而不会增加额外的训练开销，从而促进医学图像分割算法的实际应用。源代码已公开发布。

链接: https://arxiv.org/abs/2504.15667
作者: Jingchen Zou,Jianqiang Li,Gabriel Jimenez,Qing Zhao,Daniel Racoceanu,Matias Cosarinsky,Enzo Ferrante,Guanghui Fu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The performance of medical image segmentation models is usually evaluated using metrics like the Dice score and Hausdorff distance, which compare predicted masks to ground truth annotations. However, when applying the model to unseen data, such as in clinical settings, it is often impractical to annotate all the data, making the model’s performance uncertain. To address this challenge, we propose the Segmentation Performance Evaluator (SPE), a framework for estimating segmentation models’ performance on unlabeled data. This framework is adaptable to various evaluation metrics and model architectures. Experiments on six publicly available datasets across six evaluation metrics including pixel-based metrics such as Dice score and distance-based metrics like HD95, demonstrated the versatility and effectiveness of our approach, achieving a high correlation (0.956 \pm 0.046) and low MAE (0.025 \pm 0.019) compare with real Dice score on the independent test set. These results highlight its ability to reliably estimate model performance without requiring annotations. The SPE framework integrates seamlessly into any model training process without adding training overhead, enabling performance estimation and facilitating the real-world application of medical image segmentation algorithms. The source code is publicly available
zh

[CV-76] RepNet-VSR: Reparameterizable Architecture for High-Fidelity Video Super-Resolution CVPR2025

【速读】：该论文致力于解决视频超分辨率（Video Super-Resolution, VSR）在资源受限边缘设备上的实时部署挑战，特别是在移动视频处理场景中兼顾功率效率与低延迟的需求。传统深度卷积神经网络虽然在时空超分辨率任务中表现出最先进的性能，但其计算复杂度限制了其在实际应用中的广泛部署。为应对这一挑战，论文提出了一种名为RepNet-VSR的可重参数化高保真视频超分辨率方法。该方法的关键在于设计了一种高效且轻量化的架构，能够在保证高质量重建（验证集上达到27.79 dB PSNR）的同时实现每10帧180p到720p视频的实时处理（103毫秒），从而在恢复质量与部署效率之间取得了优异的平衡。

链接: https://arxiv.org/abs/2504.15649
作者: Biao Wu,Diankai Zhang,Shaoli Liu,Si Gao,Chengjian Zheng,Ning Wang
机构: State Key Laboratory of Mobile Network and Mobile Multimedia Technology, ZTE, China (移动网络与移动多媒体技术国家重点实验室, 中兴通讯, 中国)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Champion Solution for CVPR 2025 MAI VSR Track

点击查看摘要

Abstract:As a fundamental challenge in visual computing, video super-resolution (VSR) focuses on reconstructing highdefinition video sequences from their degraded lowresolution counterparts. While deep convolutional neural networks have demonstrated state-of-the-art performance in spatial-temporal super-resolution tasks, their computationally intensive nature poses significant deployment challenges for resource-constrained edge devices, particularly in real-time mobile video processing scenarios where power efficiency and latency constraints coexist. In this work, we propose a Reparameterizable Architecture for High Fidelity Video Super Resolution method, named RepNet-VSR, for real-time 4x video super-resolution. On the REDS validation set, the proposed model achieves 27.79 dB PSNR when processing 180p to 720p frames in 103 ms per 10 frames on a MediaTek Dimensity NPU. The competition results demonstrate an excellent balance between restoration quality and deployment efficiency. The proposed method scores higher than the previous champion algorithm of MAI video super-resolution challenge.
zh

[CV-77] VLM-based Prompts as the Optimal Assistant for Unpaired Histopathology Virtual Staining

【速读】：该论文旨在解决虚拟染色任务中分离组织切片的基本视觉特征与染色剂诱导的视觉差异的问题，并弥补传统方法仅停留在风格层面迁移而忽视病理知识和染色物理特性的不足。为应对这些挑战，论文首次在虚拟染色任务中引入病理学视觉-语言大模型（Vision-Language Model, VLM）作为辅助工具，通过结合对比学习可训练提示、组织切片的基础概念锚点以及特定染色的概念锚点，充分利用病理学VLM的广泛知识来描述、构建和优化虚拟染色的方向。此外，还开发了一种基于VLM约束的数据增强方法，利用其强大的图像解析能力进一步整合图像风格和结构信息，从而提升高精度病理诊断的效果。关键在于创新性地将病理学知识融入虚拟染色流程，并结合数据增强策略以提高生成图像的真实性和下游任务（如肾小球检测与分割）的准确性。

链接: https://arxiv.org/abs/2504.15545
作者: Zizhi Chen,Xinyu Zhang,Minghao Han,Yizhou Liu,Ziyun Qian,Weifeng Zhang,Xukun Zhang,Jingwei Wei,Lihua Zhang
机构: Fudan University (复旦大学); Central South University (中南大学); Harbin Institute of Technology (哈尔滨工业大学); Chinese Academy of Sciences (中国科学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In histopathology, tissue sections are typically stained using common HE staining or special stains (MAS, PAS, PASM, etc.) to clearly visualize specific tissue structures. The rapid advancement of deep learning offers an effective solution for generating virtually stained images, significantly reducing the time and labor costs associated with traditional histochemical staining. However, a new challenge arises in separating the fundamental visual characteristics of tissue sections from the visual differences induced by staining agents. Additionally, virtual staining often overlooks essential pathological knowledge and the physical properties of staining, resulting in only style-level transfer. To address these issues, we introduce, for the first time in virtual staining tasks, a pathological vision-language large model (VLM) as an auxiliary tool. We integrate contrastive learnable prompts, foundational concept anchors for tissue sections, and staining-specific concept anchors to leverage the extensive knowledge of the pathological VLM. This approach is designed to describe, frame, and enhance the direction of virtual staining. Furthermore, we have developed a data augmentation method based on the constraints of the VLM. This method utilizes the VLM’s powerful image interpretation capabilities to further integrate image style and structural information, proving beneficial in high-precision pathological diagnostics. Extensive evaluations on publicly available multi-domain unpaired staining datasets demonstrate that our method can generate highly realistic images and enhance the accuracy of downstream tasks, such as glomerular detection and segmentation. Our code is available at: this https URL
zh

[CV-78] Fluorescence Reference Target Quantitative Analysis Library

【速读】：该论文旨在解决荧光引导手术（Fluorescence-Guided Surgery, FGS）领域中荧光成像系统标准化性能评估这一关键需求未被满足的问题。当前，尽管美国医学物理学家协会（AAPM）TG311报告及FDA最新草案提供了推荐的系统表征指标，但实际用于提取这些指标的工具仍存在局限性，表现为方法不统一且难以获取。论文提出的解决方案核心在于开发了一个名为QUEL-QAL的开源Python库，该库通过利用固态参考目标，实现了荧光图像定量分析的流程化与标准化。QUEL-QAL的关键创新点在于其模块化设计，支持从感兴趣区域检测到统计分析再到可视化的一整套可重复的工作流，并涵盖了响应线性度、检测限、深度灵敏度以及空间分辨率等重要指标，同时符合监管和学术指导原则。此外，基于广泛使用的Python包构建的QUEL-QAL具有可扩展性，能够适应新型靶标设计与分析协议，从而促进透明度、可重复性和法规一致性，为推动荧光成像系统的标准化基准测试和加速其发展与评估提供基础工具。

链接: https://arxiv.org/abs/2504.15496
作者: Eammon A. Littler,Emmanuel A. Mannoh,Ethan P. M. LaRochelle
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM)
备注: 12 pages, 1 table, 4 figures. Code available: this https URL ), PyPi: quel-qal

点击查看摘要

Abstract:Standardized performance evaluation of fluorescence imaging systems remains a critical unmet need in the field of fluorescence-guided surgery (FGS). While the American Association of Physicists in Medicine (AAPM) TG311 report and recent FDA draft guidance provide recommended metrics for system characterization, practical tools for extracting these metrics remain limited, inconsistent, and often inaccessible. We present QUEL-QAL, an open-source Python library designed to streamline and standardize the quantitative analysis of fluorescence images using solid reference targets. The library provides a modular, reproducible workflow that includes region of interest (ROI) detection, statistical analysis, and visualization capabilities. QUEL-QAL supports key metrics such as response linearity, limit of detection, depth sensitivity, and spatial resolution, in alignment with regulatory and academic guidance. Built on widely adopted Python packages, the library is designed to be extensible, enabling users to adapt it to novel target designs and analysis protocols. By promoting transparency, reproducibility, and regulatory alignment, QUEL-QAL offers a foundational tool to support standardized benchmarking and accelerate the development and evaluation of fluorescence imaging systems.
zh

[CV-79] Split-quaternions for perceptual white balance

【速读】：该论文旨在解决白平衡中的感知色温适应问题，提出了一种基于分裂四元数（split-quaternions）的感知色度适应变换方法。解决方案的关键在于强调量子-like颜色感知模型中代数结构与分裂四元数子代数之间的联系，并通过适当利用分裂四元数乘法实现色度适应变换。这种创新性方法展示了在彩色图像处理应用中的潜力，并通过定量比较证明其相对于von Kries色度适应变换的优势。

链接: https://arxiv.org/abs/2504.15481
作者: Michel Berthier,Nicoletta Prencipe,Edoardo Provenzi
机构: Laboratoire MIA, Université de La Rochelle (拉罗谢尔大学); Center for Ubiquitous Computing, University of Oulu (奥卢大学); Université de Bordeaux (波尔多大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a perceptual chromatic adaptation transform for white balance that makes use of split-quaternions. The novelty of the present work, which is motivated by a recently developed quantum-like model of color perception, consists at stressing the link between the algebraic structures appearing in this model and a certain sub-algebra of the split-quaternions. We show the potentiality of this approach for color image processing applications by proposing a chromatic adaptation transform, implemented via an appropriate use of the split-quaternion multiplication. Moreover, quantitative comparisons with the widely used state-of-the art von Kries chromatic adaptation transform are provided.
zh

[CV-80] Enhancing DR Classification with Swin Transformer and Shifted Window Attention

【速读】：该论文旨在解决糖尿病视网膜病变（Diabetic Retinopathy, DR）自动化分类面临的挑战，这些问题包括图像质量的变化、类别不平衡以及像素级相似性导致的模型训练困难。为了解决这些问题，论文提出了一种鲁棒的预处理流程，包括图像裁剪、对比度受限自适应直方图均衡化（Contrast-Limited Adaptive Histogram Equalization, CLAHE）以及针对性的数据增强方法，以提升模型的泛化能力和抗干扰能力。此外，关键在于采用Swin Transformer模型，其通过分层令牌处理和滑动窗口注意力机制，高效捕捉细粒度特征的同时保持线性计算复杂度。

链接: https://arxiv.org/abs/2504.15317
作者: Meher Boulaabi,Takwa Ben Aïcha Gader,Afef Kacem Echi,Zied Bouraoui
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diabetic retinopathy (DR) is a leading cause of blindness worldwide, underscoring the importance of early detection for effective treatment. However, automated DR classification remains challenging due to variations in image quality, class imbalance, and pixel-level similarities that hinder model training. To address these issues, we propose a robust preprocessing pipeline incorporating image cropping, Contrast-Limited Adaptive Histogram Equalization (CLAHE), and targeted data augmentation to improve model generalization and resilience. Our approach leverages the Swin Transformer, which utilizes hierarchical token processing and shifted window attention to efficiently capture fine-grained features while maintaining linear computational complexity. We validate our method on the Aptos and IDRiD datasets for multi-class DR classification, achieving accuracy rates of 89.65% and 97.40%, respectively. These results demonstrate the effectiveness of our model, particularly in detecting early-stage DR, highlighting its potential for improving automated retinal screening in clinical settings.
zh

人工智能

[AI-0] LLM s are Greedy Agents : Effects of RL Fine-tuning on Decision-Making Abilities

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在决策场景中表现次优的问题，具体表现为贪婪性（greediness）、频率偏差（frequency bias）以及知道-做到差距（knowing-doing gap）等三种常见的失效模式。论文的关键解决方案是通过强化学习（Reinforcement Learning, RL）微调，利用自动生成的链式思维（Chain-of-Thought, CoT）理由来缓解这些不足。实验结果表明，这种微调方法能够提升LLMs的探索能力，并缩小知道-做到差距，从而增强其决策能力。此外，论文还研究了经典的探索机制（如 \epsilon -贪心策略）和LLM特有的方法（如自我校正和自我一致性），以实现更有效的决策相关微调。

链接: https://arxiv.org/abs/2504.16078
作者: Thomas Schmied,Jörg Bornschein,Jordi Grau-Moya,Markus Wulfmeier,Razvan Pascanu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The success of Large Language Models (LLMs) has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain-of-Thought (CoT) reasoning, can effectively explore and efficiently solve complex domains. However, LLM agents have been found to suffer from sub-optimal exploration and the knowing-doing gap, the inability to effectively act on knowledge present in the model. In this work, we systematically study why LLMs perform sub-optimally in decision-making scenarios. In particular, we closely examine three prevalent failure modes: greediness, frequency bias, and the knowing-doing gap. We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales. Our experiments across multi-armed bandits, contextual bandits, and Tic-tac-toe, demonstrate that RL fine-tuning enhances the decision-making abilities of LLMs by increasing exploration and narrowing the knowing-doing gap. Finally, we study both classic exploration mechanisms, such as \epsilon -greedy, and LLM-specific approaches, such as self-correction and self-consistency, to enable more effective fine-tuning of LLMs for decision-making.
zh

[AI-1] Approximate matrices of systems of max-min fuzzy relational equations

【速读】：该论文致力于解决由最大-最小模糊关系方程系统（max-min fuzzy relational equation system）中的不一致性所导致的问题，通过最小化修改控制系统的矩阵来实现一致性。论文的关键在于提出了一种方法，能够构建出与原不一致系统具有相同右端向量的一致性系统，并且仅对需要修正的矩阵元素进行精确且最小的修改，同时保持其他所有元素不变。为了获得更接近原不一致系统的近似一致系统，研究了不一致系统矩阵与其对应的、使用相同右端向量的一致性系统矩阵集合之间的距离（基于 (L_1)、(L_2) 或 (L_\infty) 范数）。论文表明，该方法可以直接计算出与不一致系统矩阵在 (L_\infty) 范数意义下距离最小的一致性系统矩阵（使用 (L_1) 或 (L_2) 范数时计算成本更高），并且给出了计算此最小 (L_\infty) 距离的显式分析公式。此外，还将结果推广至最小-最大模糊关系方程系统，并探讨了潜在的应用场景。

链接: https://arxiv.org/abs/2504.16042
作者: Ismaïl Baaj
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:In this article, we address the inconsistency of a system of max-min fuzzy relational equations by minimally modifying the matrix governing the system in order to achieve consistency. Our method yields consistent systems that approximate the original inconsistent system in the following sense: the right-hand side vector of each consistent system is that of the inconsistent system, and the coefficients of the matrix governing each consistent system are obtained by modifying, exactly and minimally, the entries of the original matrix that must be corrected to achieve consistency, while leaving all other entries unchanged. To obtain a consistent system that closely approximates the considered inconsistent system, we study the distance (in terms of a norm among L_1 , L_2 or L_\infty ) between the matrix of the inconsistent system and the set formed by the matrices of consistent systems that use the same right-hand side vector as the inconsistent system. We show that our method allows us to directly compute matrices of consistent systems that use the same right-hand side vector as the inconsistent system whose distance in terms of L_\infty norm to the matrix of the inconsistent system is minimal (the computational costs are higher when using L_1 norm or L_2 norm). We also give an explicit analytical formula for computing this minimal L_\infty distance. Finally, we translate our results for systems of min-max fuzzy relational equations and present some potential applications. Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) Cite as: arXiv:2504.16042 [cs.AI] (or arXiv:2504.16042v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.16042 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-2] Muon Optimizer Accelerates Grokking

【速读】：该论文旨在探究不同优化器对“grokking”现象的影响，即模型在训练过程中表现出的延迟泛化行为。研究通过在七个数值任务（主要是模算术）上使用现代Transformer架构进行实验，系统性地改变优化器（Muon与AdamW）以及Softmax激活函数的类型（标准Softmax、Stablemax和Sparsemax），以评估它们对学习动态的综合影响。论文的关键在于发现Muon优化器通过采用谱归一化约束和二阶信息，显著加速了grokking现象的发生，相比AdamW优化器将平均grokking周期从153.09减少到102.89，这一差异具有统计学意义（t = 5.0175, p = 6.33e-08）。这表明优化器的选择在促进模型从记忆向泛化的过渡中起着至关重要的作用。

链接: https://arxiv.org/abs/2504.16041
作者: Amund Tveit,Bjørn Remseth,Arve Skogvold
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:This paper investigates the impact of different optimizers on the grokking phenomenon, where models exhibit delayed generalization. We conducted experiments across seven numerical tasks (primarily modular arithmetic) using a modern Transformer architecture. The experimental configuration systematically varied the optimizer (Muon vs. AdamW) and the softmax activation function (standard softmax, stablemax, and sparsemax) to assess their combined effect on learning dynamics. Our empirical evaluation reveals that the Muon optimizer, characterized by its use of spectral norm constraints and second-order information, significantly accelerates the onset of grokking compared to the widely used AdamW optimizer. Specifically, Muon reduced the mean grokking epoch from 153.09 to 102.89 across all configurations, a statistically significant difference (t = 5.0175, p = 6.33e-08). This suggests that the optimizer choice plays a crucial role in facilitating the transition from memorization to generalization.
zh

[AI-3] LLM s meet Federated Learning for Scalable and Secure IoT Management

【速读】：该论文旨在解决物联网（IoT）生态系统扩展带来的可扩展性、安全性和实时决策能力方面的严峻挑战。传统集中式架构在延迟、隐私保护和资源消耗方面表现不佳，难以满足现代大规模物联网部署的需求。论文提出了一种基于联邦学习的大规模语言模型（FL-LLM）框架，通过结合生成式物联网（GIoT）模型与梯度感知联邦策略（GSFS），动态优化模型更新以适应实时网络条件，同时确保数据隐私和计算效率。关键在于采用边缘-云混合处理架构，平衡分布式物联网环境中的智能性、可扩展性和安全性。

链接: https://arxiv.org/abs/2504.16032
作者: Yazan Otoum,Arghavan Asad,Amiya Nayak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: This work has been submitted to the IEEE Global Communications Conference (GLOBECOM) 2025 for possible publication

点击查看摘要

Abstract:The rapid expansion of IoT ecosystems introduces severe challenges in scalability, security, and real-time decision-making. Traditional centralized architectures struggle with latency, privacy concerns, and excessive resource consumption, making them unsuitable for modern large-scale IoT deployments. This paper presents a novel Federated Learning-driven Large Language Model (FL-LLM) framework, designed to enhance IoT system intelligence while ensuring data privacy and computational efficiency. The framework integrates Generative IoT (GIoT) models with a Gradient Sensing Federated Strategy (GSFS), dynamically optimizing model updates based on real-time network conditions. By leveraging a hybrid edge-cloud processing architecture, our approach balances intelligence, scalability, and security in distributed IoT environments. Evaluations on the IoT-23 dataset demonstrate that our framework improves model accuracy, reduces response latency, and enhances energy efficiency, outperforming traditional FL techniques (i.e., FedAvg, FedOpt). These findings highlight the potential of integrating LLM-powered federated learning into large-scale IoT ecosystems, paving the way for more secure, scalable, and adaptive IoT management solutions.
zh

[AI-4] Benchmarking LLM for Code Smells Detection: OpenAI GPT -4.0 vs DeepSeek -V3

【速读】：本文旨在解决在多种编程语言中确定最有效的大型语言模型（Large Language Model, LLM）用于代码异味（code smell）检测这一复杂挑战。论文的关键在于提出了一种结构化的方法论与评估矩阵，并利用一个经过精心标注的代码样本数据集（涵盖Java、Python、JavaScript和C++四种主流编程语言），以支持跨语言比较。解决方案的核心包括采用精准率（precision）、召回率（recall）和F1分数作为评估指标，对两个最先进的LLM——OpenAI GPT 4.0和DeepSeek-V3进行基准测试。此外，研究还深入分析了总体性能、类别级别性能以及单个代码异味类型性能，并探讨了基于令牌的GPT 4.0检测方法与DeepSeek V3所使用的模式匹配技术之间的成本效益对比，同时将成本与传统静态分析工具如SonarQube进行了相对比较。这些发现为从业者选择高效且经济有效的自动化代码异味检测方案提供了重要指导。

链接: https://arxiv.org/abs/2504.16027
作者: Ahmed R. Sadik,Siddhata Govind
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Determining the most effective Large Language Model for code smell detection presents a complex challenge. This study introduces a structured methodology and evaluation matrix to tackle this issue, leveraging a curated dataset of code samples consistently annotated with known smells. The dataset spans four prominent programming languages Java, Python, JavaScript, and C++; allowing for cross language comparison. We benchmark two state of the art LLMs, OpenAI GPT 4.0 and DeepSeek-V3, using precision, recall, and F1 score as evaluation metrics. Our analysis covers three levels of detail: overall performance, category level performance, and individual code smell type performance. Additionally, we explore cost effectiveness by comparing the token based detection approach of GPT 4.0 with the pattern-matching techniques employed by DeepSeek V3. The study also includes a cost analysis relative to traditional static analysis tools such as SonarQube. The findings offer valuable guidance for practitioners in selecting an efficient, cost effective solution for automated code smell detection
zh

[AI-5] rends in AI Supercomputers

【速读】：该论文试图解决的问题是如何全面分析和理解AI超算系统的性能、成本、能耗及其全球分布等关键趋势。解决方案的关键在于创建了一个包含从2019年至2025年的500台AI超算系统的大规模数据集，并通过深入分析揭示了计算性能每九个月翻一番、硬件成本与能耗每年翻一番等重要规律。此外，研究还探讨了AI超算系统从科学工具向工业设备演进过程中所有权结构的变化以及国家间的竞争态势，为政策制定者评估资源需求、所有权模式及国家竞争力提供了洞察。

链接: https://arxiv.org/abs/2504.16026
作者: Konstantin F. Pilz,James Sanders,Robi Rahman,Lennart Heim
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Frontier AI development relies on powerful AI supercomputers, yet analysis of these systems is limited. We create a dataset of 500 AI supercomputers from 2019 to 2025 and analyze key trends in performance, power needs, hardware cost, ownership, and global distribution. We find that the computational performance of AI supercomputers has doubled every nine months, while hardware acquisition cost and power needs both doubled every year. The leading system in March 2025, xAI’s Colossus, used 200,000 AI chips, had a hardware cost of \ 7B, and required 300 MW of power, as much as 250,000 households. As AI supercomputers evolved from tools for science to industrial machines, companies rapidly expanded their share of total AI supercomputer performance, while the share of governments and academia diminished. Globally, the United States accounts for about 75% of total performance in our dataset, with China in second place at 15%. If the observed trends continue, the leading AI supercomputer in 2030 will achieve 2\times10^22 16-bit FLOP/s, use two million AI chips, have a hardware cost of \ 200 billion, and require 9 GW of power. Our analysis provides visibility into the AI supercomputer landscape, allowing policymakers to assess key AI trends like resource needs, ownership, and national competitiveness.
zh

[AI-6] Navigating the State of Cognitive Flow: Context-Aware AI Interventions for Effective Reasoning Support

【速读】：该论文旨在解决在人工智能辅助推理过程中，传统静态干预方法可能破坏认知流畅状态（Cognitive Flow）并阻碍决策的问题。论文提出的关键解决方案是一个上下文感知的认知增强框架（Context-Aware Cognitive Augmentation Framework），其核心在于根据任务类型（type）、干预时机（timing）和强度（scale）三个关键上下文因素动态调整认知支持。通过利用多模态行为线索（如 gaze behavior、打字犹豫、交互速度等），该框架能够个性化、自适应且低侵入性地维持或恢复认知流畅状态，从而确保人工智能系统能够在不打断认知沉浸的前提下支持复杂的决策与推理过程。

链接: https://arxiv.org/abs/2504.16021
作者: Dinithi Dissanayake,Suranga Nanayakkara
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Presented at the 2025 ACM Workshop on Human-AI Interaction for Augmented Reasoning, Report Number: CHI25-WS-AUGMENTED-REASONING

点击查看摘要

Abstract:Flow theory describes an optimal cognitive state where individuals experience deep focus and intrinsic motivation when a task’s difficulty aligns with their skill level. In AI-augmented reasoning, interventions that disrupt the state of cognitive flow can hinder rather than enhance decision-making. This paper proposes a context-aware cognitive augmentation framework that adapts interventions based on three key contextual factors: type, timing, and scale. By leveraging multimodal behavioral cues (e.g., gaze behavior, typing hesitation, interaction speed), AI can dynamically adjust cognitive support to maintain or restore flow. We introduce the concept of cognitive flow, an extension of flow theory in AI-augmented reasoning, where interventions are personalized, adaptive, and minimally intrusive. By shifting from static interventions to context-aware augmentation, our approach ensures that AI systems support deep engagement in complex decision-making and reasoning without disrupting cognitive immersion.
zh

[AI-7] AlphaGrad: Non-Linear Gradient Normalization Optimizer

【速读】：该论文试图解决自适应优化方法（如Adam）在实际应用中面临的内存开销大及超参数调节复杂的问题。为应对这些挑战，论文提出了AlphaGrad优化器，其关键是通过张量级L2梯度归一化结合平滑双曲正切变换（g' = \tanh(\alpha \cdot \tildeg)）实现尺度不变性，并仅依赖单一陡度参数 $\alpha$ 进行控制。这种设计不仅降低了内存消耗，还简化了超参数配置的复杂度，同时通过理论分析证明了其在非凸问题上的收敛性保证，并在多种强化学习基准测试中展示了其性能优势与适用场景的多样性。

链接: https://arxiv.org/abs/2504.16020
作者: Soham Sane
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We introduce AlphaGrad, a memory-efficient, conditionally stateless optimizer addressing the memory overhead and hyperparameter complexity of adaptive methods like Adam. AlphaGrad enforces scale invariance via tensor-wise L2 gradient normalization followed by a smooth hyperbolic tangent transformation, g’ = \tanh(\alpha \cdot \tildeg) , controlled by a single steepness parameter \alpha . Our contributions include: (1) the AlphaGrad algorithm formulation; (2) a formal non-convex convergence analysis guaranteeing stationarity; (3) extensive empirical evaluation on diverse RL benchmarks (DQN, TD3, PPO). Compared to Adam, AlphaGrad demonstrates a highly context-dependent performance profile. While exhibiting instability in off-policy DQN, it provides enhanced training stability with competitive results in TD3 (requiring careful \alpha tuning) and achieves substantially superior performance in on-policy PPO. These results underscore the critical importance of empirical \alpha selection, revealing strong interactions between the optimizer’s dynamics and the underlying RL algorithm. AlphaGrad presents a compelling alternative optimizer for memory-constrained scenarios and shows significant promise for on-policy learning regimes where its stability and efficiency advantages can be particularly impactful.
zh

[AI-8] OPUS-VFL: Incentivizing Optimal Privacy-Utility Tradeoffs in Vertical Federated Learning

【速读】：该论文旨在解决现有垂直联邦学习（Vertical Federated Learning, VFL）系统面临的三个关键问题：缺乏有效的激励机制、难以平衡隐私与效用之间的权衡、以及无法适配资源异构的客户端。这些问题导致参与度不足、模型性能下降，并限制了实际部署的可能性。为了解决这些挑战，论文提出了一种名为OPUS-VFL（Optimal Privacy-Utility tradeoff Strategy for VFL）的新框架。其核心解决方案在于引入一种隐私感知的激励机制，通过综合考虑模型贡献、隐私保护和资源投入来奖励客户端；采用轻量级的留一法（leave-one-out, LOO）策略量化每个客户端的特征重要性；并结合自适应差分隐私机制，使客户端能够动态调整噪声水平以优化个人效用。此外，该框架设计为可扩展、预算平衡，并对推理攻击和投毒攻击具有鲁棒性。实验结果表明，OPUS-VFL在效率和鲁棒性方面显著优于现有的VFL基准方法，同时提升了隐私保护效果并实现了公平激励分配。

链接: https://arxiv.org/abs/2504.15995
作者: Sindhuja Madabushi,Ahmad Faraz Khan,Haider Ali,Jin-Hee Cho
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vertical Federated Learning (VFL) enables organizations with disjoint feature spaces but shared user bases to collaboratively train models without sharing raw data. However, existing VFL systems face critical limitations: they often lack effective incentive mechanisms, struggle to balance privacy-utility tradeoffs, and fail to accommodate clients with heterogeneous resource capabilities. These challenges hinder meaningful participation, degrade model performance, and limit practical deployment. To address these issues, we propose OPUS-VFL, an Optimal Privacy-Utility tradeoff Strategy for VFL. OPUS-VFL introduces a novel, privacy-aware incentive mechanism that rewards clients based on a principled combination of model contribution, privacy preservation, and resource investment. It employs a lightweight leave-one-out (LOO) strategy to quantify feature importance per client, and integrates an adaptive differential privacy mechanism that enables clients to dynamically calibrate noise levels to optimize their individual utility. Our framework is designed to be scalable, budget-balanced, and robust to inference and poisoning attacks. Extensive experiments on benchmark datasets (MNIST, CIFAR-10, and CIFAR-100) demonstrate that OPUS-VFL significantly outperforms state-of-the-art VFL baselines in both efficiency and robustness. It reduces label inference attack success rates by up to 20%, increases feature inference reconstruction error (MSE) by over 30%, and achieves up to 25% higher incentives for clients that contribute meaningfully while respecting privacy and cost constraints. These results highlight the practicality and innovation of OPUS-VFL as a secure, fair, and performance-driven solution for real-world VFL.
zh

[AI-9] Bug Destiny Prediction in Large Open-Source Software Repositories through Sentiment Analysis and BERT Topic Modeling

【速读】：该论文旨在解决预测关键软件缺陷相关结果的问题，包括缺陷解决时间（time to resolution）、修复时间（time to fix）以及最终状态（bug destiny）。为实现这一目标，论文提出了一种新颖的方法，利用Bugzilla Eclipse项目中的数据，在缺陷解决之前提取可用特征以提升预测准确性。关键解决方案在于结合情感分析（sentiment analysis）与BERTopic模型：情感分析用于计算情绪分数及分类（正面或负面），而BERTopic模型则用于提取主题并结合优先级作为特征输入至卷积神经网络（CNN）和多层感知机（MLP）。研究发现，将BERTopic与情感分析相结合可改善某些模型性能指标，并通过平衡模型输入提升了实际应用价值，尽管这通常会导致准确率显著下降。此外，论文采用二元分类和精确时间值预测两种方式评估预测效果，表明情感分析在预测缺陷是否会被修复方面具有重要价值，但在处理更复杂或非传统的结果类别时作用相对有限。

链接: https://arxiv.org/abs/2504.15972
作者: Sophie C. Pope,Andrew Barovic,Armin Moin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study explores a novel approach to predicting key bug-related outcomes, including the time to resolution, time to fix, and ultimate status of a bug, using data from the Bugzilla Eclipse Project. Specifically, we leverage features available before a bug is resolved to enhance predictive accuracy. Our methodology incorporates sentiment analysis to derive both an emotionality score and a sentiment classification (positive or negative). Additionally, we integrate the bug’s priority level and its topic, extracted using a BERTopic model, as features for a Convolutional Neural Network (CNN) and a Multilayer Perceptron (MLP). Our findings indicate that the combination of BERTopic and sentiment analysis can improve certain model performance metrics. Furthermore, we observe that balancing model inputs enhances practical applicability, albeit at the cost of a significant reduction in accuracy in most cases. To address our primary objectives, predicting time-to-resolution, time-to-fix, and bug destiny, we employ both binary classification and exact time value predictions, allowing for a comparative evaluation of their predictive effectiveness. Results demonstrate that sentiment analysis serves as a valuable predictor of a bug’s eventual outcome, particularly in determining whether it will be fixed. However, its utility is less pronounced when classifying bugs into more complex or unconventional outcome categories.
zh

[AI-10] Universal Approximation with Softmax Attention

【速读】：该论文试图证明带有线性变换的自注意力机制在紧致域上能够作为连续序列到序列函数的通用近似器。具体而言，它证明了两种架构：(i) 两层自注意力和 (ii) 一层自注意力后接 softmax 函数均具备这一性质。论文的关键在于提出了一种基于插值的新方法来分析自注意力的内部机制，揭示了自注意力能够以任意精度逼近广义的修正线性单元（ReLU），从而涵盖许多已知的通用近似器。基于此，研究进一步表明，仅使用两层多头自注意力即可作为序列到序列的通用近似器。与之前依赖前馈网络的工作不同，本文直接通过自注意力实现这一目标。此外，论文还扩展了该技术，证明了仅含注意力（加 softmax）的层能够在上下文中逼近多种统计模型。这些技术方法具有独立的研究价值。

链接: https://arxiv.org/abs/2504.15956
作者: Jerry Yao-Chieh Hu,Hude Liu,Hong-Yu Chen,Weimin Wu,Han Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We prove that with linear transformations, both (i) two-layer self-attention and (ii) one-layer self-attention followed by a softmax function are universal approximators for continuous sequence-to-sequence functions on compact domains. Our main technique is a new interpolation-based method for analyzing attention’s internal mechanism. This leads to our key insight: self-attention is able to approximate a generalized version of ReLU to arbitrary precision, and hence subsumes many known universal approximators. Building on these, we show that two-layer multi-head attention alone suffices as a sequence-to-sequence universal approximator. In contrast, prior works rely on feed-forward networks to establish universal approximation in Transformers. Furthermore, we extend our techniques to show that, (softmax-)attention-only layers are capable of approximating various statistical models in-context. We believe these techniques hold independent interest.
zh

[AI-11] New Recipe for Semi-supervised Community Detection: Clique Annealing under Crystallization Kinetics

【速读】：该论文旨在解决半监督社区检测方法中因标签稀缺导致的初始社区核心候选不合理以及可扩展性差的问题。现有方法通常包含初始识别和后续调整两个学习阶段，依赖于强化学习和生成对抗网络，这不仅增加了计算成本，还限制了候选节点的选择。为了解决这些问题，论文通过类比结晶动力学与社区检测，将退火过程的自发性引入社区检测中。关键在于提出了一种名为CLique ANNealing（CLANN）的方法，它将动力学概念融入优化过程以增强社区核心的一致性，并通过无学习的传递退火器在第一阶段合并邻接团并重新定位社区核心，实现自发增长过程以提升可扩展性。实验结果表明，CLANN在多个真实数据集上优于最先进的方法，展示了其卓越的有效性和效率。

链接: https://arxiv.org/abs/2504.15927
作者: Ling Cheng,Jiashu Pu,Ruicheng Liang,Qian Shao,Hezhe Qiao,Feida Zhu
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2203.05898 by other authors

点击查看摘要

Abstract:Semi-supervised community detection methods are widely used for identifying specific communities due to the label scarcity. Existing semi-supervised community detection methods typically involve two learning stages learning in both initial identification and subsequent adjustment, which often starts from an unreasonable community core candidate. Moreover, these methods encounter scalability issues because they depend on reinforcement learning and generative adversarial networks, leading to higher computational costs and restricting the selection of candidates. To address these limitations, we draw a parallel between crystallization kinetics and community detection to integrate the spontaneity of the annealing process into community detection. Specifically, we liken community detection to identifying a crystal subgrain (core) that expands into a complete grain (community) through a process similar to annealing. Based on this finding, we propose CLique ANNealing (CLANN), which applies kinetics concepts to community detection by integrating these principles into the optimization process to strengthen the consistency of the community core. Subsequently, a learning-free Transitive Annealer was employed to refine the first-stage candidates by merging neighboring cliques and repositioning the community core, enabling a spontaneous growth process that enhances scalability. Extensive experiments on \textbf43 different network settings demonstrate that CLANN outperforms state-of-the-art methods across multiple real-world datasets, showcasing its exceptional efficacy and efficiency in community detection.
zh

[AI-12] Achieving Distributive Justice in Federated Learning via Uncertainty Quantification

【速读】：该论文旨在解决联邦学习中客户端级公平性度量缺乏统一理论基础的问题。现有方法大多随意选择公平性定义，导致实践者难以根据自身公平伦理选择最合适的公平性度量。论文提出了一种名为UDJ-FL（基于不确定性分配正义的联邦学习）的灵活框架，能够实现多种基于分配正义的客户端级公平性度量。解决方案的关键在于结合公平资源分配技术和基于认知不确定性（aleatoric uncertainty）的客户端加权策略，使UDJ-FL框架能够支持平等主义、功利主义、Rawls差异原则以及应得论等四种分配正义导向的公平性目标。此外，论文通过实验证明了UDJ-FL在实现这些公平性目标上的有效性，并提供了理论保障以证明其泛化能力。

链接: https://arxiv.org/abs/2504.15924
作者: Alycia Carey,Xintao Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 21 pages, 1 figure, 7 tables

点击查看摘要

Abstract:Client-level fairness metrics for federated learning are used to ensure that all clients in a federation either: a) have similar final performance on their local data distributions (i.e., client parity), or b) obtain final performance on their local data distributions relative to their contribution to the federated learning process (i.e., contribution fairness). While a handful of works that propose either client-parity or contribution-based fairness metrics ground their definitions and decisions in social theories of equality – such as distributive justice – most works arbitrarily choose what notion of fairness to align with which makes it difficult for practitioners to choose which fairness metric aligns best with their fairness ethics. In this work, we propose UDJ-FL (Uncertainty-based Distributive Justice for Federated Learning), a flexible federated learning framework that can achieve multiple distributive justice-based client-level fairness metrics. Namely, by utilizing techniques inspired by fair resource allocation, in conjunction with performing aleatoric uncertainty-based client weighing, our UDJ-FL framework is able to achieve egalitarian, utilitarian, Rawls’ difference principle, or desert-based client-level fairness. We empirically show the ability of UDJ-FL to achieve all four defined distributive justice-based client-level fairness metrics in addition to providing fairness equivalent to (or surpassing) other popular fair federated learning works. Further, we provide justification for why aleatoric uncertainty weighing is necessary to the construction of our UDJ-FL framework as well as derive theoretical guarantees for the generalization bounds of UDJ-FL. Our code is publicly available at this https URL.
zh

[AI-13] Automated Bug Report Prioritization in Large Open-Source Projects

【速读】：该论文旨在解决大型开源项目因资源有限而无法及时处理所有问题报告（包括软件缺陷报告和新功能请求）的问题，通过提出一种基于开源问题跟踪系统中存储的缺陷报告自然语言文本的自动化缺陷优先级排序方法来优化资源配置。论文的关键在于结合TopicMiner-MTM进行主题建模以及利用BERT大规模语言模型实现文本分类，从而在准确性（Accuracy）、精确率（Precision）、召回率（Recall）和F1值（F1-measure）等指标上超越现有最先进的方法，显著提升了缺陷优先级预测性能。

链接: https://arxiv.org/abs/2504.15912
作者: Riley Pierson,Armin Moin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large open-source projects receive a large number of issues (known as bugs), including software defect (i.e., bug) reports and new feature requests from their user and developer communities at a fast rate. The often limited project resources do not allow them to deal with all issues. Instead, they have to prioritize them according to the project’s priorities and the issues’ severities. In this paper, we propose a novel approach to automated bug prioritization based on the natural language text of the bug reports that are stored in the open bug repositories of the issue-tracking systems. We conduct topic modeling using a variant of LDA called TopicMiner-MTM and text classification with the BERT large language model to achieve a higher performance level compared to the state-of-the-art. Experimental results using an existing reference dataset containing 85,156 bug reports of the Eclipse Platform project indicate that we outperform existing approaches in terms of Accuracy, Precision, Recall, and F1-measure of the bug report priority prediction.
zh

[AI-14] GraphEdge: Dynamic Graph Partition and Task Scheduling for GNNs Computing in Edge Network

【速读】：该论文旨在解决图结构场景下基于图神经网络（Graph Neural Network, GNN）的任务处理中服务器通信成本高昂的问题，特别是在物联网（Internet of Things, IoT）设备快速增长的背景下，如交通流量预测和社交关系推荐等用户数据具有相关性的任务。论文的关键解决方案在于提出了一种名为GraphEdge的高效GNN边缘计算架构。该架构首先通过感知用户拓扑并将数据关联表示为图布局，然后利用提出的分层遍历图切割算法（Hierarchical Traversal Graph Cut, HiCut）优化图布局，将图划分为多个弱关联子图，并最小化GNN推理过程中不同子图间的通信开销。最后，结合基于深度强化学习的图卸载算法（Deep Reinforcement Learning based Graph Offloading, DRLGO），以子图为基础制定最优卸载策略，在保证任务处理时间和能量消耗最小化的同时，尽量将同一子图的任务卸载到相同的边缘服务器上。这一系列方法有效降低了通信成本并提升了动态适应能力。

链接: https://arxiv.org/abs/2504.15905
作者: Wenjing Xiao,Chenglong Shi,Miaojiang Chen,Zhiquan Liu,Min Chen,H. Herbert Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages,12 figures

点击查看摘要

Abstract:With the exponential growth of Internet of Things (IoT) devices, edge computing (EC) is gradually playing an important role in providing cost-effective services. However, existing approaches struggle to perform well in graph-structured scenarios where user data is correlated, such as traffic flow prediction and social relationship recommender systems. In particular, graph neural network (GNN)-based approaches lead to expensive server communication cost. To address this problem, we propose GraphEdge, an efficient GNN-based EC architecture. It considers the EC system of GNN tasks, where there are associations between users and it needs to take into account the task data of its neighbors when processing the tasks of a user. Specifically, the architecture first perceives the user topology and represents their data associations as a graph layout at each time step. Then the graph layout is optimized by calling our proposed hierarchical traversal graph cut algorithm (HiCut), which cuts the graph layout into multiple weakly associated subgraphs based on the aggregation characteristics of GNN, and the communication cost between different subgraphs during GNN inference is minimized. Finally, based on the optimized graph layout, our proposed deep reinforcement learning (DRL) based graph offloading algorithm (DRLGO) is executed to obtain the optimal offloading strategy for the tasks of users, the offloading strategy is subgraph-based, it tries to offload user tasks in a subgraph to the same edge server as possible while minimizing the task processing time and energy consumption of the EC system. Experimental results show the good effectiveness and dynamic adaptation of our proposed architecture and it also performs well even in dynamic scenarios.
zh

[AI-15] Impact of Noise on LLM -Models Performance in Abstraction and Reasoning Corpus (ARC) Tasks with Model Temperature Considerations

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）在抽象推理任务中对噪声和不确定性敏感的问题。论文通过系统性评估包括GPT-4o、DeepSeek R1和LLaMA 3.2在内的多种模型在不同噪声水平和温度设置下的表现，揭示了当前LLMs尽管展现出一定抽象推理能力，但在面对输入扰动时仍表现出显著脆弱性。这一现象引发了对其实际应用可靠性的担忧。关键在于探索不同模型架构在此类挑战下的响应差异，从而识别现代LLMs在推理任务中的结构性弱点，并为构建更鲁棒、更具适应性的AI系统提供指导，以增强模型的泛化能力、鲁棒性以及与人类认知灵活性的对齐。

链接: https://arxiv.org/abs/2504.15903
作者: Nikhil Khandalkar,Pavan Yadav,Krishna Shinde,Lokesh B. Ramegowda,Rajarshi Das
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 60 pages, 25 figures

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have generated growing interest in their structured reasoning capabilities, particularly in tasks involving abstraction and pattern recognition. The Abstraction and Reasoning Corpus (ARC) benchmark plays a crucial role in evaluating these capabilities by testing how well AI models generalize to novel problems. While GPT-4o demonstrates strong performance by solving all ARC tasks under zero-noise conditions, other models like DeepSeek R1 and LLaMA 3.2 fail to solve any, suggesting limitations in their ability to reason beyond simple pattern matching. To explore this gap, we systematically evaluate these models across different noise levels and temperature settings. Our results reveal that the introduction of noise consistently impairs model performance, regardless of architecture. This decline highlights a shared vulnerability: current LLMs, despite showing signs of abstract reasoning, remain highly sensitive to input perturbations. Such fragility raises concerns about their real-world applicability, where noise and uncertainty are common. By comparing how different model architectures respond to these challenges, we offer insights into the structural weaknesses of modern LLMs in reasoning tasks. This work underscores the need for developing more robust and adaptable AI systems capable of handling the ambiguity and variability inherent in real-world scenarios. Our findings aim to guide future research toward enhancing model generalization, robustness, and alignment with human-like cognitive flexibility.
zh

[AI-16] Supporting Data-Frame Dynamics in AI-assisted Decision Making

【速读】：该论文旨在解决当前人工智能（AI）决策支持系统在高风险决策场景中无法有效支持不断演化的证据与假设动态交互的问题。论文的关键在于提出了一种基于数据帧理论（data-frame theory）和评估型 AI （evaluative AI）范式的混合主动框架，使人类与 AI 能够协作构建、验证和调整假设。该方法通过结合概念瓶颈模型（concept bottleneck model），实现了可解释的交互以及诊断假设的动态更新，从而显著提升了 AI 协助决策的能力。

链接: https://arxiv.org/abs/2504.15894
作者: Chengbo Zheng,Tim Miller,Alina Bialkowski,H Peter Soyer,Monika Janda
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Presented at the 2025 ACM Workshop on Human-AI Interaction for Augmented Reasoning, Report Number: CHI25-WS-AUGMENTED-REASONING

点击查看摘要

Abstract:High stakes decision-making often requires a continuous interplay between evolving evidence and shifting hypotheses, a dynamic that is not well supported by current AI decision support systems. In this paper, we introduce a mixed-initiative framework for AI assisted decision making that is grounded in the data-frame theory of sensemaking and the evaluative AI paradigm. Our approach enables both humans and AI to collaboratively construct, validate, and adapt hypotheses. We demonstrate our framework with an AI-assisted skin cancer diagnosis prototype that leverages a concept bottleneck model to facilitate interpretable interactions and dynamic updates to diagnostic hypotheses.
zh

[AI-17] Bidirectional Task-Motion Planning Based on Hierarchical Reinforcement Learning for Strategic Confrontation

【速读】：本文针对群体机器人在对抗场景（如战略对抗）中高效决策的需求，致力于解决传统任务与运动规划方法因单向结构无法捕捉离散命令与连续动作之间相互依赖性的问题，特别是在动态环境下的适应性不足。论文的关键在于提出了一种基于分层强化学习的新型双向交互方法，通过在任务分配与路径规划层之间建立动态反馈机制，同时结合跨训练技术优化分层框架内的学习过程。此外，引入轨迹预测模型将抽象任务表示与可执行规划目标关联起来，进一步提升了决策效率与环境适应能力。实验结果表明，该方法在对抗胜率上超过80%，决策时间低于0.01秒，显著优于现有方法，并通过大规模仿真及真实机器人测试验证了其泛化能力和实际应用价值。

链接: https://arxiv.org/abs/2504.15876
作者: Qizhen Wu Lei Chen,Kexin Liu,Jinhu Lü
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In swarm robotics, confrontation scenarios, including strategic confrontations, require efficient decision-making that integrates discrete commands and continuous actions. Traditional task and motion planning methods separate decision-making into two layers, but their unidirectional structure fails to capture the interdependence between these layers, limiting adaptability in dynamic environments. Here, we propose a novel bidirectional approach based on hierarchical reinforcement learning, enabling dynamic interaction between the layers. This method effectively maps commands to task allocation and actions to path planning, while leveraging cross-training techniques to enhance learning across the hierarchical framework. Furthermore, we introduce a trajectory prediction model that bridges abstract task representations with actionable planning goals. In our experiments, it achieves over 80% in confrontation win rate and under 0.01 seconds in decision time, outperforming existing approaches. Demonstrations through large-scale tests and real-world robot experiments further emphasize the generalization capabilities and practical applicability of our method.
zh

[AI-18] CARE: Compatibility-Aware Incentive Mechanisms for Federated Learning with Budgeted Requesters

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）场景中两个被现有研究忽视的关键问题：一是工人（workers）的固有不兼容特性（如通信信道和数据源差异）导致联邦学习效率下降（如通信效率低和模型泛化能力差）；二是请求者（requesters）通常面临预算限制，这限制了他们能够雇佣的工人数量。论文考虑了两种设置：合作预算设置下请求者通过共同分配预算提升整体效用，以及非合作预算设置下每个请求者在自身预算内优化其效用。为应对工人不兼容带来的效率下降，论文提出了两种新颖的兼容性感知激励机制CARE-CO（适用于合作预算设置）和CARE-NO（适用于非合作预算设置），以揭示工人的真实私有成本，并为请求者确定合适的工人及其对应的奖励，同时满足预算约束。这些机制保证了个体理性、真实性、预算可行性及近似性能。通过使用真实世界的数据集进行大量实验，论文表明所提出的机制显著优于现有基准方法。

链接: https://arxiv.org/abs/2504.15847
作者: Xiang Liu,Hau Chan,Minming Li,Xianlong Zeng,Chenchen Fu,Weiwei Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated learning (FL) is a promising approach that allows requesters (\eg, servers) to obtain local training models from workers (e.g., clients). Since workers are typically unwilling to provide training services/models freely and voluntarily, many incentive mechanisms in FL are designed to incentivize participation by offering monetary rewards from requesters. However, existing studies neglect two crucial aspects of real-world FL scenarios. First, workers can possess inherent incompatibility characteristics (e.g., communication channels and data sources), which can lead to degradation of FL efficiency (e.g., low communication efficiency and poor model generalization). Second, the requesters are budgeted, which limits the amount of workers they can hire for their tasks. In this paper, we investigate the scenario in FL where multiple budgeted requesters seek training services from incompatible workers with private training costs. We consider two settings: the cooperative budget setting where requesters cooperate to pool their budgets to improve their overall utility and the non-cooperative budget setting where each requester optimizes their utility within their own budgets. To address efficiency degradation caused by worker incompatibility, we develop novel compatibility-aware incentive mechanisms, CARE-CO and CARE-NO, for both settings to elicit true private costs and determine workers to hire for requesters and their rewards while satisfying requester budget constraints. Our mechanisms guarantee individual rationality, truthfulness, budget feasibility, and approximation performance. We conduct extensive experiments using real-world datasets to show that the proposed mechanisms significantly outperform existing baselines.
zh

[AI-19] Generative AI for Research Data Processing: Lessons Learnt From Three Use Cases

【速读】：该论文旨在解决生成式 AI (Generative AI) 在研究数据处理中的应用可行性及输出准确性与一致性的问题。论文通过三个具体案例（信息抽取、自然语言理解、文本分类）验证了使用生成式 AI 模型 Claude 3 Opus 处理复杂数据任务的潜力，并分享了如何评估生成式 AI 是否适用于特定任务以及如何优化结果准确性和一致性的经验。关键在于结合领域需求选择合适的任务，并采用针对性的方法提升模型性能。

链接: https://arxiv.org/abs/2504.15829
作者: Modhurita Mitra,Martine G. de Vos,Nicola Cortinovis,Dawa Ometto
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures, 6 tables. Published in Proceedings of the 2024 IEEE 20th International Conference on e-Science (e-Science), Osaka, Japan

点击查看摘要

Abstract:There has been enormous interest in generative AI since ChatGPT was launched in 2022. However, there are concerns about the accuracy and consistency of the outputs of generative AI. We have carried out an exploratory study on the application of this new technology in research data processing. We identified tasks for which rule-based or traditional machine learning approaches were difficult to apply, and then performed these tasks using generative AI. We demonstrate the feasibility of using the generative AI model Claude 3 Opus in three research projects involving complex data processing tasks: 1) Information extraction: We extract plant species names from historical seedlists (catalogues of seeds) published by botanical gardens. 2) Natural language understanding: We extract certain data points (name of drug, name of health indication, relative effectiveness, cost-effectiveness, etc.) from documents published by Health Technology Assessment organisations in the EU. 3) Text classification: We assign industry codes to projects on the crowdfunding website Kickstarter. We share the lessons we learnt from these use cases: How to determine if generative AI is an appropriate tool for a given data processing task, and if so, how to maximise the accuracy and consistency of the results obtained. Comments: 10 pages, 4 figures, 6 tables. Published in Proceedings of the 2024 IEEE 20th International Conference on e-Science (e-Science), Osaka, Japan Subjects: Artificial Intelligence (cs.AI) MSC classes: 68T50 ACMclasses: I.2.7 Cite as: arXiv:2504.15829 [cs.AI] (or arXiv:2504.15829v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.15829 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/e-Science62913.2024.10678704 Focus to learn more DOI(s) linking to related resources Submission history From: Modhurita Mitra [view email] [v1] Tue, 22 Apr 2025 12:21:07 UTC (1,504 KB) Full-text links: Access Paper: View a PDF of the paper titled Generative AI for Research Data Processing: Lessons Learnt From Three Use Cases, by Modhurita Mitra and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2025-04 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-20] DualOptim: Enhancing Efficacy and Stability in Machine Unlearning with Dual Optimizers

【速读】：该论文旨在解决现有机器遗忘（Machine Unlearning, MU）方法对超参数敏感且需要精细调参的问题，这限制了其实际应用的广泛部署。论文通过实证研究表明，当前流行的MU方法在不同场景中表现出不稳定性和次优性能。为了解决这一问题，论文提出了一种名为Dual Optimizer (DualOptim) 的解决方案，其关键在于引入自适应学习率和解耦动量因子。通过实证与理论分析，DualOptim被证明能够有效提升MU的效率和稳定性，并在图像分类、图像生成以及大语言模型等多样化任务中展现出显著优势，从而成为一种增强现有MU算法的通用方法。

链接: https://arxiv.org/abs/2504.15827
作者: Xuyang Zhong,Haochen Luo,Chen Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing machine unlearning (MU) approaches exhibit significant sensitivity to hyperparameters, requiring meticulous tuning that limits practical deployment. In this work, we first empirically demonstrate the instability and suboptimal performance of existing popular MU methods when deployed in different scenarios. To address this issue, we propose Dual Optimizer (DualOptim), which incorporates adaptive learning rate and decoupled momentum factors. Empirical and theoretical evidence demonstrates that DualOptim contributes to effective and stable unlearning. Through extensive experiments, we show that DualOptim can significantly boost MU efficacy and stability across diverse tasks, including image classification, image generation, and large language models, making it a versatile approach to empower existing MU algorithms.
zh

[AI-21] Fusing Reward and Dueling Feedback in Stochastic Bandits

【速读】：该论文旨在研究在随机多臂老虎机问题中绝对反馈（奖励反馈）与相对反馈（对抗式反馈）的融合方法。论文的核心问题是：如何有效结合这两种反馈类型以最小化后悔值（regret），同时充分利用各自的优势。论文的关键贡献在于推导了后悔值的下界，并提出两种融合算法：(1) 消除融合算法通过共享候选臂集合整合两种反馈信息来探索所有臂；(2) 分解融合算法根据反馈类型的有效性动态分配探索与利用任务。其中，消除融合算法由于对抗式消除固有的次优性导致后悔值存在次优的乘法因子，而分解融合算法在常见假设下实现了与下界匹配的后悔值，仅相差一个常数因子。因此，解决方案的关键在于设计能够动态适应两种反馈类型的机制，以实现最优或接近最优的性能。

链接: https://arxiv.org/abs/2504.15812
作者: Xuchuang Wang,Qirun Zeng,Jinhang Zuo,Xutong Liu,Mohammad Hajiesmaili,John C.S. Lui,Adam Wierman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper investigates the fusion of absolute (reward) and relative (dueling) feedback in stochastic bandits, where both feedback types are gathered in each decision round. We derive a regret lower bound, demonstrating that an efficient algorithm may incur only the smaller among the reward and dueling-based regret for each individual arm. We propose two fusion approaches: (1) a simple elimination fusion algorithm that leverages both feedback types to explore all arms and unifies collected information by sharing a common candidate arm set, and (2) a decomposition fusion algorithm that selects the more effective feedback to explore the corresponding arms and randomly assigns one feedback type for exploration and the other for exploitation in each round. The elimination fusion experiences a suboptimal multiplicative term of the number of arms in regret due to the intrinsic suboptimality of dueling elimination. In contrast, the decomposition fusion achieves regret matching the lower bound up to a constant under a common assumption. Extensive experiments confirm the efficacy of our algorithms and theoretical results.
zh

[AI-22] DAE-KAN: A Kolmogorov-Arnold Network Model for High-Index Differential-Algebraic Equations

【速读】：该论文旨在解决高指数微分代数方程（Differential-Algebraic Equations, DAEs）的求解问题，特别是从指数-1到指数-3的复杂DAE系统。论文的关键创新在于提出了一种名为DAE-KAN的新框架，通过结合Kolmogorov-Arnold Networks (KANs) 和 Physics-Informed Neural Networks (PINNs)，不仅保留了传统PINNs建模物理定律约束系统的特性，还利用KANs在函数拟合方面的优势显著提升了求解精度。实验结果表明，与传统PINNs相比，DAE-KAN在减少微分变量和代数变量的绝对误差方面提高了1到2个数量级，并且在控制漂移误差方面表现出色，验证了该方法在高精度和泛化能力上的优越性。

链接: https://arxiv.org/abs/2504.15806
作者: Kai Luo,Juan Tang,Mingchao Cai,Xiaoqing Zeng,Manqi Xie,Ming Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) have emerged as a promising alternative to Multi-Layer Perceptrons (MLPs) due to their superior function-fitting abilities in data-driven modeling. In this paper, we propose a novel framework, DAE-KAN, for solving high-index differential-algebraic equations (DAEs) by integrating KANs with Physics-Informed Neural Networks (PINNs). This framework not only preserves the ability of traditional PINNs to model complex systems governed by physical laws but also enhances their performance by leveraging the function-fitting strengths of KANs. Numerical experiments demonstrate that for DAE systems ranging from index-1 to index-3, DAE-KAN reduces the absolute errors of both differential and algebraic variables by 1 to 2 orders of magnitude compared to traditional PINNs. To assess the effectiveness of this approach, we analyze the drift-off error and find that both PINNs and DAE-KAN outperform classical numerical methods in controlling this phenomenon. Our results highlight the potential of neural network methods, particularly DAE-KAN, in solving high-index DAEs with substantial computational accuracy and generalization, offering a promising solution for challenging partial differential-algebraic equations.
zh

[AI-23] Insights from Verification: Training a Verilog Generation LLM with Reinforcement Learning with Testbench Feedback

【速读】：该论文旨在解决基于自然语言描述生成功能正确的 Verilog 代码这一挑战。传统大语言模型（Large Language Models, LLMs）在生成 Verilog 代码方面表现出色，但难以确保生成代码的功能正确性。为应对这一问题，论文的关键解决方案是将硬件设计的核心目标——功能正确性融入到 LLM 的训练过程中。具体而言，论文提出了一种方法，通过集成测试平台（testbench）中的验证见解来指导训练，并引入了一个自动化的测试平台生成管道，该管道利用 Verilog 编译器模拟器（VCS）反馈减少幻觉（hallucination）并保证生成代码的准确性。此外，论文采用强化学习（Reinforcement Learning, RL）中的直接偏好优化（Direct Preference Optimization, DPO）技术，基于测试平台的结果构建偏好对（preference pairs），从而进一步优化生成代码的功能正确性。这些措施共同提升了生成 Verilog 代码的质量，并在多个基准数据集上取得了优于现有方法的表现。

链接: https://arxiv.org/abs/2504.15804
作者: Ning Wang,Bingkun Yao,Jie Zhou,Yuchen Hu,Xi Wang,Nan Guan,Zhe Jiang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown strong performance in Verilog generation from natural language description. However, ensuring the functional correctness of the generated code remains a significant challenge. This paper introduces a method that integrates verification insights from testbench into the training of Verilog generation LLMs, aligning the training with the fundamental goal of hardware design: functional correctness. The main obstacle in using LLMs for Verilog code generation is the lack of sufficient functional verification data, particularly testbenches paired with design specifications and code. To address this problem, we introduce an automatic testbench generation pipeline that decomposes the process and uses feedback from the Verilog compiler simulator (VCS) to reduce hallucination and ensure correctness. We then use the testbench to evaluate the generated codes and collect them for further training, where verification insights are introduced. Our method applies reinforcement learning (RL), specifically direct preference optimization (DPO), to align Verilog code generation with functional correctness by training preference pairs based on testbench outcomes. In evaluations on VerilogEval-Machine, VerilogEval-Human, RTLLM v1.1, RTLLM v2, and VerilogEval v2, our approach consistently outperforms state-of-the-art baselines in generating functionally correct Verilog code. We open source all training code, data, and models at this https URL.
zh

[AI-24] Crisp complexity of fuzzy classifiers

【速读】：该论文旨在解决模糊规则基分类器在非模糊领域难以广泛应用的问题，其核心在于模糊划分在某些情况下的可解释性不足。论文提出了一种将模糊规则基分类器转化为明确（crisp）规则基分类器的方法论，并研究了多种可能的明确描述方式，同时开发了相应的算法来实现这一转化。解决方案的关键在于通过清晰的明确规则表示来提升模型的可解释性，从而帮助模糊与非模糊领域的实践者更好地理解特征空间的划分方式及其相互转换的可行性。此外，论文还分析了由此产生的明确分类器的复杂度，提供了一种基于等效明确划分选择不同模糊分类器的度量标准。

链接: https://arxiv.org/abs/2504.15791
作者: Raquel Fernandez-Peralta,Javier Fumanal-Idocin,Javier Andreu-Perez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rule-based systems are a very popular form of explainable AI, particularly in the fuzzy community, where fuzzy rules are widely used for control and classification problems. However, fuzzy rule-based classifiers struggle to reach bigger traction outside of fuzzy venues, because users sometimes do not know about fuzzy and because fuzzy partitions are not so easy to interpret in some situations. In this work, we propose a methodology to reduce fuzzy rule-based classifiers to crisp rule-based classifiers. We study different possible crisp descriptions and implement an algorithm to obtain them. Also, we analyze the complexity of the resulting crisp classifiers. We believe that our results can help both fuzzy and non-fuzzy practitioners understand better the way in which fuzzy rule bases partition the feature space and how easily one system can be translated to another and vice versa. Our complexity metric can also help to choose between different fuzzy classifiers based on what the equivalent crisp partitions look like.
zh

[AI-25] WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents

【速读】：该论文旨在解决如何利用大型语言模型（Large Language Models, LLMs）构建准确的世界模型，并探讨世界模型如何提升LLM代理的表现。论文指出，LLMs在作为世界模型使用时通常受到其先验知识与特定环境动态之间差距的限制。为弥合这一差距，研究提出了无需训练的“世界对齐”方法，通过从探索轨迹中提取动作规则、知识图谱和场景图等符号化知识，并将其编码为可执行代码以调节LLM代理策略。此外，论文还提出了一种基于模型预测控制（Model-Predictive Control, MPC）框架的无强化学习（RL-free）代理“WALL-E 2.0”。不同于传统MPC需要在线进行昂贵优化，该方案利用LLM代理与神经符号化世界模型交互，高效规划未来步骤的动作。关键在于，通过将LLM代理的强大启发式算法与经过对齐的世界模型提供的精确预测相结合，显著提升了代理在新环境中的学习效率。实验表明，在Mars和ALFWorld等开放世界挑战中，“WALL-E 2.0”大幅超越现有方法，例如在Mars任务中成功率提高16.1%-51.6%，得分至少提升61.7%，并在ALFWorld中实现98%的成功率。

链接: https://arxiv.org/abs/2504.15785
作者: Siyu Zhou,Tianyi Zhou,Yijun Yang,Guodong Long,Deheng Ye,Jing Jiang,Chengqi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code is available at this https URL

点击查看摘要

Abstract:Can we build accurate world models out of large language models (LLMs)? How can world models benefit LLM agents? The gap between the prior knowledge of LLMs and the specified environment’s dynamics usually bottlenecks LLMs’ performance as world models. To bridge the gap, we propose a training-free “world alignment” that learns an environment’s symbolic knowledge complementary to LLMs. The symbolic knowledge covers action rules, knowledge graphs, and scene graphs, which are extracted by LLMs from exploration trajectories and encoded into executable codes to regulate LLM agents’ policies. We further propose an RL-free, model-based agent “WALL-E 2.0” through the model-predictive control (MPC) framework. Unlike classical MPC requiring costly optimization on the fly, we adopt an LLM agent as an efficient look-ahead optimizer of future steps’ actions by interacting with the neurosymbolic world model. While the LLM agent’s strong heuristics make it an efficient planner in MPC, the quality of its planned actions is also secured by the accurate predictions of the aligned world model. They together considerably improve learning efficiency in a new environment. On open-world challenges in Mars (Minecraft like) and ALFWorld (embodied indoor environments), WALL-E 2.0 significantly outperforms existing methods, e.g., surpassing baselines in Mars by 16.1%-51.6% of success rate and by at least 61.7% in score. In ALFWorld, it achieves a new record 98% success rate after only 4 iterations.
zh

[AI-26] Shannon invariants: A scalable approach to information decomposition

【速读】：本文旨在解决分布式系统（如生物神经网络和人工神经网络）中高阶信息处理分析的挑战，主要难题在于定义合适的多元度量指标以及确保这些指标在大规模系统中的可扩展性。为了解决这些问题，论文提出了一种基于“香农不变量”(Shannon invariants) 的新框架。香农不变量是一种仅依赖于熵定义且能够高效计算的大规模系统的高阶信息处理本质属性的量化指标。关键之处在于，这一框架不仅解决了长期存在的关于广泛使用的多元信息论测度解释的歧义，还揭示了不同深度学习架构在各层中的独特信息处理特征，从而提供了关于这些系统如何处理信息及其在训练过程中如何演化的全新见解。总体而言，该框架克服了分析高阶现象的根本局限性，并为理论发展和实证分析提供了广阔机会。

链接: https://arxiv.org/abs/2504.15779
作者: Aaron J. Gutknecht,Fernando E. Rosas,David A. Ehrlich,Abdullah Makkeh,Pedro A. M. Mediano,Michael Wibral
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO); Data Analysis, Statistics and Probability (physics.data-an)
备注: 16 pages, 4 Figures

点击查看摘要

Abstract:Distributed systems, such as biological and artificial neural networks, process information via complex interactions engaging multiple subsystems, resulting in high-order patterns with distinct properties across scales. Investigating how these systems process information remains challenging due to difficulties in defining appropriate multivariate metrics and ensuring their scalability to large systems. To address these challenges, we introduce a novel framework based on what we call “Shannon invariants” – quantities that capture essential properties of high-order information processing in a way that depends only on the definition of entropy and can be efficiently calculated for large systems. Our theoretical results demonstrate how Shannon invariants can be used to resolve long-standing ambiguities regarding the interpretation of widely used multivariate information-theoretic measures. Moreover, our practical results reveal distinctive information-processing signatures of various deep learning architectures across layers, which lead to new insights into how these systems process information and how this evolves during training. Overall, our framework resolves fundamental limitations in analyzing high-order phenomena and offers broad opportunities for theoretical developments and empirical analyses.
zh

[AI-27] Clifford Group Equivariant Diffusion Models for 3D Molecular Generation

【速读】：该论文旨在探索利用Clifford代数的强大表达能力来构建\E(n)-等变扩散模型（\E(n)-equivariant diffusion models）。论文的关键在于引入Clifford扩散模型（Clifford Diffusion Models, CDMs），通过Clifford多向量之间的几何积以及嵌套在Clifford子空间中的丰富几何信息，将扩散过程从仅限于Clifford一矢量扩展到包含所有更高阶多矢量子空间。数据被嵌入到k阶子空间中，从而实现完整多矢量的潜在扩散，使CDMs能够捕捉代数不同子空间间的联合分布，并通过高阶特征整合更丰富的几何信息。实验结果表明，CDMs在QM9数据集上的无条件分子生成任务中展现出生成式建模（Generative Modeling）的潜力。

链接: https://arxiv.org/abs/2504.15773
作者: Cong Liu,Sharvaree Vadgama,David Ruhe,Erik Bekkers,Patrick Forrè
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 1 figure, 1 table

点击查看摘要

Abstract:This paper explores leveraging the Clifford algebra’s expressive power for \E(n) -equivariant diffusion models. We utilize the geometric products between Clifford multivectors and the rich geometric information encoded in Clifford subspaces in \emphClifford Diffusion Models (CDMs). We extend the diffusion process beyond just Clifford one-vectors to incorporate all higher-grade multivector subspaces. The data is embedded in grade- k subspaces, allowing us to apply latent diffusion across complete multivectors. This enables CDMs to capture the joint distribution across different subspaces of the algebra, incorporating richer geometric information through higher-order features. We provide empirical results for unconditional molecular generation on the QM9 dataset, showing that CDMs provide a promising avenue for generative modeling.
zh

[AI-28] Dynamic Intent Queries for Motion Transformer-based Trajectory Prediction

【速读】：该论文旨在解决现代轨迹预测模型在特定交通场景下因静态意图点（static intention points）与地图数据不匹配而导致的不可行或不现实目标点的问题。解决方案的关键在于将场景特定的动态意图点（dynamic intention points）引入Motion Transformer (MTR) 模型中，以替代原有的静态意图点。这种改进使模型能够更准确地捕捉复杂交通场景中的动态特性，从而显著提升长时间预测的准确性，并改善对不符合地图数据或非法操作的地面真实轨迹的预测性能。

链接: https://arxiv.org/abs/2504.15766
作者: Tobias Demmler,Lennart Hartung,Andreas Tamke,Thao Dang,Alexander Hegai,Karsten Haug,Lars Mikelsons
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In autonomous driving, accurately predicting the movements of other traffic participants is crucial, as it significantly influences a vehicle’s planning processes. Modern trajectory prediction models strive to interpret complex patterns and dependencies from agent and map data. The Motion Transformer (MTR) architecture and subsequent work define the most accurate methods in common benchmarks such as the Waymo Open Motion Benchmark. The MTR model employs pre-generated static intention points as initial goal points for trajectory prediction. However, the static nature of these points frequently leads to misalignment with map data in specific traffic scenarios, resulting in unfeasible or unrealistic goal points. Our research addresses this limitation by integrating scene-specific dynamic intention points into the MTR model. This adaptation of the MTR model was trained and evaluated on the Waymo Open Motion Dataset. Our findings demonstrate that incorporating dynamic intention points has a significant positive impact on trajectory prediction accuracy, especially for predictions over long time horizons. Furthermore, we analyze the impact on ground truth trajectories which are not compliant with the map data or are illegal maneuvers.
zh

[AI-29] Medic: Towards Smartphone-based Self-Auscultation Tool for AI-Powered Pediatric Respiratory Assessment

【速读】：该论文旨在解决在医疗资源匮乏地区，由于医生短缺导致的儿童肺炎早期诊断困难问题。论文提出了一种基于智能手机的系统，通过利用手机内置麦克风和先进的深度学习算法，检测与肺炎风险相关的异常呼吸音。解决方案的关键在于其端到端的深度学习框架，该框架采用领域泛化技术，将大规模电子听诊器数据集与较小的智能手机采集数据集相结合，从而实现鲁棒的特征学习，无需昂贵设备即可进行准确的呼吸评估。此外，配套的移动应用程序指导看护者收集高质量的肺部声音样本，并提供即时反馈以识别潜在的肺炎风险。

链接: https://arxiv.org/abs/2504.15743
作者: Seung Gyu Jeong,Sung Woo Nam,Seong Kwan Jung,Seong-Eun Kim
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Respiratory auscultation is crucial for early detection of pediatric pneumonia, a condition that can quickly worsen without timely intervention. In areas with limited physician access, effective auscultation is challenging. We present a smartphone-based system that leverages built-in microphones and advanced deep learning algorithms to detect abnormal respiratory sounds indicative of pneumonia risk. Our end-to-end deep learning framework employs domain generalization to integrate a large electronic stethoscope dataset with a smaller smartphone-derived dataset, enabling robust feature learning for accurate respiratory assessments without expensive equipment. The accompanying mobile application guides caregivers in collecting high-quality lung sound samples and provides immediate feedback on potential pneumonia risks. User studies show strong classification performance and high acceptance, demonstrating the system’s ability to facilitate proactive interventions and reduce preventable childhood pneumonia deaths. By seamlessly integrating into ubiquitous smartphones, this approach offers a promising avenue for more equitable and comprehensive remote pediatric care.
zh

[AI-30] Collaborative Split Federated Learning with Parallel Training and Aggregation

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）中因客户端计算与通信负担过重导致的训练延迟长及现有分裂联邦学习（Split Federated Learning, SFL）方案在异构设备参与时仍存在的通信开销大和模型精度不足的问题。论文的关键创新在于提出了一种协作型分裂联邦学习（Collaborative-Split Federated Learning, C-SFL）方案，将模型划分为三部分：由计算能力较弱的客户端训练的部分、由计算能力强的客户端训练的部分以及由服务器训练的部分。不同于已有工作，C-SFL通过允许客户端与服务器上的模型部分并行训练与聚合，显著减少了训练延迟和通信开销，同时提升了模型的准确性。实验验证了C-SFL相对于现有方案的多重优势。

链接: https://arxiv.org/abs/2504.15724
作者: Yiannis Papageorgiou,Yannis Thomas,Alexios Filippakopoulos,Ramin Khalili,Iordanis Koutsopoulos
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated learning (FL) operates based on model exchanges between the server and the clients, and it suffers from significant client-side computation and communication burden. Split federated learning (SFL) arises a promising solution by splitting the model into two parts, that are trained sequentially: the clients train the first part of the model (client-side model) and transmit it to the server that trains the second (server-side model). Existing SFL schemes though still exhibit long training delays and significant communication overhead, especially when clients of different computing capability participate. Thus, we propose Collaborative-Split Federated Learning~(C-SFL), a novel scheme that splits the model into three parts, namely the model parts trained at the computationally weak clients, the ones trained at the computationally strong clients, and the ones at the server. Unlike existing works, C-SFL enables parallel training and aggregation of model’s parts at the clients and at the server, resulting in reduced training delays and commmunication overhead while improving the model’s accuracy. Experiments verify the multiple gains of C-SFL against the existing schemes.
zh

[AI-31] Implementing Rational Choice Functions with LLM s and Measuring their Alignment with User Preferences

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）作为决策代理在智能用户界面（Intelligent User Interfaces, IUIs）中的对齐（alignment）问题，特别是关注如何衡量LLMs的决策输出与用户偏好之间的对齐程度。传统研究更多聚焦于事实性、偏见和毒性等问题，而对量化LLMs是否符合用户偏好的研究相对较少。论文的关键在于提出了一种通用方法，通过利用LLMs对备选结果进行排序，并扩展对用户偏好的定义以涵盖严格偏好和无差异两种情况，从而实现更灵活且可靠的决策对齐。为此，作者提出了设计原则以使用LLMs实现理性选择函数，并提供了评估偏好满足度所需的工具。论文通过汽车领域的实际IUI应用验证了该方法的有效性。

链接: https://arxiv.org/abs/2504.15719
作者: Anna Karnysheva,Christian Drescher,Dietrich Klakow
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) become integral to intelligent user interfaces (IUIs), their role as decision-making agents raises critical concerns about alignment. Although extensive research has addressed issues such as factuality, bias, and toxicity, comparatively little attention has been paid to measuring alignment to preferences, i.e., the relative desirability of different alternatives, a concept used in decision making, economics, and social choice theory. However, a reliable decision-making agent makes choices that align well with user preferences. In this paper, we generalize existing methods that exploit LLMs for ranking alternative outcomes by addressing alignment with the broader and more flexible concept of user preferences, which includes both strict preferences and indifference among alternatives. To this end, we put forward design principles for using LLMs to implement rational choice functions, and provide the necessary tools to measure preference satisfaction. We demonstrate the applicability of our approach through an empirical study in a practical application of an IUI in the automotive domain. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2504.15719 [cs.AI] (or arXiv:2504.15719v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.15719 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-32] DianJin-R1: Evaluating and Enhancing Financial Reasoning in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在金融领域面临的有效推理挑战，这些挑战包括需要领域特定知识、精确的数值计算以及严格遵守合规规则。论文的关键解决方案是提出DianJin-R1框架，该框架通过增强推理的监督和强化学习来应对这些挑战。其中，DianJin-R1-Data数据集是核心组成部分，它整合了来自CFLUE、FinQA和自有合规语料库（Chinese Compliance Check, CCC）的高质量数据，结合多样化的金融推理场景与验证标注。模型DianJin-R1-7B和DianJin-R1-32B基于Qwen2.5系列指令模型微调，并采用结构化格式生成推理步骤和最终答案。为进一步提升推理质量，引入了分组相对策略优化（Group Relative Policy Optimization, GRPO），这是一种结合双重奖励信号的强化学习方法，分别鼓励结构化输出和答案正确性。实验结果表明，DianJin-R1模型在多个基准测试中显著优于非推理模型，尤其在复杂金融任务中表现优异，并且在真实世界的CCC数据集上，单次调用的推理模型性能可媲美甚至超越多代理系统，展示了其在实际应用中的有效性与实用性。

链接: https://arxiv.org/abs/2504.15716
作者: Jie Zhu,Qian Chen,Huaixia Dou,Junhui Li,Lifan Guo,Feng Chen,Chi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective reasoning remains a core challenge for large language models (LLMs) in the financial domain, where tasks often require domain-specific knowledge, precise numerical calculations, and strict adherence to compliance rules. We propose DianJin-R1, a reasoning-enhanced framework designed to address these challenges through reasoning-augmented supervision and reinforcement learning. Central to our approach is DianJin-R1-Data, a high-quality dataset constructed from CFLUE, FinQA, and a proprietary compliance corpus (Chinese Compliance Check, CCC), combining diverse financial reasoning scenarios with verified annotations. Our models, DianJin-R1-7B and DianJin-R1-32B, are fine-tuned from Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct using a structured format that generates both reasoning steps and final answers. To further refine reasoning quality, we apply Group Relative Policy Optimization (GRPO), a reinforcement learning method that incorporates dual reward signals: one encouraging structured outputs and another rewarding answer correctness. We evaluate our models on five benchmarks: three financial datasets (CFLUE, FinQA, and CCC) and two general reasoning benchmarks (MATH-500 and GPQA-Diamond). Experimental results show that DianJin-R1 models consistently outperform their non-reasoning counterparts, especially on complex financial tasks. Moreover, on the real-world CCC dataset, our single-call reasoning models match or even surpass the performance of multi-agent systems that require significantly more computational cost. These findings demonstrate the effectiveness of DianJin-R1 in enhancing financial reasoning through structured supervision and reward-aligned learning, offering a scalable and practical solution for real-world applications.
zh

[AI-33] Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation

【速读】：该论文旨在解决现有研究主要关注通用大型语言模型安全而缺乏针对具身代理（embodied agents）的安全基准和输入调节专门方法的问题。论文的关键在于提出了一种新颖的输入调节框架，该框架全面涵盖了从分类定义、数据集整理、调节器架构设计到模型训练及严格评估的整个流程。此外，论文引入了EAsafetyBench，这是一个专为具身代理设计的调节器训练与评估的安全基准，并提出了Pinpoint——一种利用掩码注意力机制的提示解耦输入调节方案，以有效隔离和减轻功能性提示对调节任务的影响。实验结果表明，所提方法在多种基准数据集上的平均检测准确率达到94.58%，且每个实例的调节处理时间仅为0.002秒，显著优于现有最先进的技术。

链接: https://arxiv.org/abs/2504.15699
作者: Ning Wang,Zihan Yan,Weiyang Li,Chuan Ma,He Chen,Tao Xiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Embodied agents exhibit immense potential across a multitude of domains, making the assurance of their behavioral safety a fundamental prerequisite for their widespread deployment. However, existing research predominantly concentrates on the security of general large language models, lacking specialized methodologies for establishing safety benchmarks and input moderation tailored to embodied agents. To bridge this gap, this paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents. This framework encompasses the entire pipeline, including taxonomy definition, dataset curation, moderator architecture, model training, and rigorous evaluation. Notably, we introduce EAsafetyBench, a meticulously crafted safety benchmark engineered to facilitate both the training and stringent assessment of moderators specifically designed for embodied agents. Furthermore, we propose Pinpoint, an innovative prompt-decoupled input moderation scheme that harnesses a masked attention mechanism to effectively isolate and mitigate the influence of functional prompts on moderation tasks. Extensive experiments conducted on diverse benchmark datasets and models validate the feasibility and efficacy of the proposed approach. The results demonstrate that our methodologies achieve an impressive average detection accuracy of 94.58%, surpassing the performance of existing state-of-the-art techniques, alongside an exceptional moderation processing time of merely 0.002 seconds per instance.
zh

[AI-34] Exploring Inevitable Waypoints for Unsolvability Explanation in Hybrid Planning Problems

【速读】：该论文旨在解决规划问题不可解性的解释这一研究难题。当前规划领域已有关于生成可解性方案解释的研究，但针对不可解性解释的工作尚显不足。论文的关键创新在于提出通过识别混合系统中的“共同路标”（common waypoints）来分析和解释规划问题的不可解性。这些路标是普遍存在的障碍，出现在从初始状态到目标状态的所有可能计划路径上。论文将路标视为子问题，并将无法到达任一路标视为原问题不可解的解释。为实现这一目标，论文设计了一种新颖的方法，通过将问题转化为最长公共子序列问题（Longest Common Subsequence Problem）来识别这些路标，这是一种经典的计算机科学问题，常用于动态规划教学示例。随后，通过对路标进行符号可达性分析，确定最早无法到达的路标并将其作为不可解性的解释。实验结果验证了该方法在混合域不可解规划问题上的有效性。

链接: https://arxiv.org/abs/2504.15668
作者: Mir Md Sajid Sarwar,Rajarshi Ray
机构: 未知
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:Explaining unsolvability of planning problems is of significant research interest in Explainable AI Planning. AI planning literature has reported several research efforts on generating explanations of solutions to planning problems. However, explaining the unsolvability of planning problems remains a largely open and understudied problem. A widely practiced approach to plan generation and automated problem solving, in general, is to decompose tasks into sub-problems that help progressively converge towards the goal. In this paper, we propose to adopt the same philosophy of sub-problem identification as a mechanism for analyzing and explaining unsolvability of planning problems in hybrid systems. In particular, for a given unsolvable planning problem, we propose to identify common waypoints, which are universal obstacles to plan existence; in other words, they appear on every plan from the source to the planning goal. This work envisions such waypoints as sub-problems of the planning problem and the unreachability of any of these waypoints as an explanation for the unsolvability of the original planning problem. We propose a novel method of waypoint identification by casting the problem as an instance of the longest common subsequence problem, a widely popular problem in computer science, typically considered as an illustrative example for the dynamic programming paradigm. Once the waypoints are identified, we perform symbolic reachability analysis on them to identify the earliest unreachable waypoint and report it as the explanation of unsolvability. We present experimental results on unsolvable planning problems in hybrid domains.
zh

[AI-35] DR.FIX: Automatically Fixing Data Races at Industry Scale

【速读】：该论文旨在解决在工业规模下自动修复数据竞争（data races）的问题。数据竞争是共享内存并行程序中普遍存在的一类并发错误，对软件的可靠性和可复现性构成重大挑战。尽管已有大量研究致力于检测数据竞争，并开发出多种实用的检测工具，但针对其自动化修复的研究相对较少，尤其是在大规模代码库中，数据竞争模式复杂且不断引入，使得自动化修复尤为困难。

论文的关键解决方案在于提出了一种名为[this http URL]的工具，该工具结合了大型语言模型（Large Language Models, LLMs）与程序分析技术，能够在实际场景中生成数据竞争的修复方案。通过将LLMs与程序分析相结合，该工具能够有效处理复杂的代码上下文中多种类型的数据竞争模式。此外，该工具专为Go语言设计，这种语言广泛应用于现代微服务架构中，这些架构中并发性普遍较高且数据竞争频发。[this http URL]能够无缝集成到现有的开发工作流中，支持开发者高效协作。

在过去18个月中，[this http URL]已被集成到Uber的开发者工作流中，展示了其实用价值。统计显示，在404个涵盖各类数据竞争的样本中，该工具成功生成了224个修复补丁（占55%），其中193个补丁（占生成补丁总数的86%）通过了超过百名开发者的代码审查并被整合进代码库。这一结果验证了该工具在解决工业级数据竞争修复问题上的有效性及其设计选择对修复质量的积极影响。

链接: https://arxiv.org/abs/2504.15637
作者: Farnaz Behrang,Zhizhou Zhang,Georgian-Vlad Saioc,Peng Liu,Milind Chabbi
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: To appear in PLDI 2025

点击查看摘要

Abstract:Data races are a prevalent class of concurrency bugs in shared-memory parallel programs, posing significant challenges to software reliability and reproducibility. While there is an extensive body of research on detecting data races and a wealth of practical detection tools across various programming languages, considerably less effort has been directed toward automatically fixing data races at an industrial scale. In large codebases, data races are continuously introduced and exhibit myriad patterns, making automated fixing particularly challenging. In this paper, we tackle the problem of automatically fixing data races at an industrial scale. We present this http URL, a tool that combines large language models (LLMs) with program analysis to generate fixes for data races in real-world settings, effectively addressing a broad spectrum of racy patterns in complex code contexts. Implemented for Go–the programming language widely used in modern microservice architectures where concurrency is pervasive and data races are this http URL seamlessly integrates into existing development workflows. We detail the design of this http URL and examine how individual design choices influence the quality of the fixes produced. Over the past 18 months, this http URL has been integrated into developer workflows at Uber demonstrating its practical utility. During this period, this http URL produced patches for 224 (55%) from a corpus of 404 data races spanning various categories; 193 of these patches (86%) were accepted by more than a hundred developers via code reviews and integrated into the codebase. Comments: To appear in PLDI 2025 Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE) Cite as: arXiv:2504.15637 [cs.DC] (or arXiv:2504.15637v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2504.15637 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-36] Enhancing Reinforcement learning in 3-Dimensional Hydrophobic-Polar Protein Folding Model with Attention-based layers

【速读】：该论文旨在解决基于三维疏水-亲水（H-P）格点模型的蛋白质折叠问题，这一问题在传统序列建模领域中较少受到Transformer架构的关注。论文的关键解决方案在于将深度Q网络（DQN）与注意力机制（Transformers）相结合，通过自回避行走的方式进行折叠决策，并在强化环境中引入基于有利疏水相互作用的专用奖励函数。此外，为了提升性能，该方法还结合了有效性检查（包括对称性破缺约束）、Dueling和double Q学习以及优先回放策略，以聚焦于关键状态转换的学习。这些措施共同确保了模型能够在标准基准序列上的实验评估中实现较短序列的最佳已知解，并对较长链获得接近最优的结果。

链接: https://arxiv.org/abs/2504.15634
作者: Peizheng Liu,Hitoshi Iba
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer-based architectures have recently propelled advances in sequence modeling across domains, but their application to the hydrophobic-hydrophilic (H-P) model for protein folding remains relatively unexplored. In this work, we adapt a Deep Q-Network (DQN) integrated with attention mechanisms (Transformers) to address the 3D H-P protein folding problem. Our system formulates folding decisions as a self-avoiding walk in a reinforced environment, and employs a specialized reward function based on favorable hydrophobic interactions. To improve performance, the method incorporates validity check including symmetry-breaking constraints, dueling and double Q-learning, and prioritized replay to focus learning on critical transitions. Experimental evaluations on standard benchmark sequences demonstrate that our approach achieves several known best solutions for shorter sequences, and obtains near-optimal results for longer chains. This study underscores the promise of attention-based reinforcement learning for protein folding, and created a prototype of Transformer-based Q-network structure for 3-dimensional lattice models.
zh

[AI-37] A LoRA-Based Approach to Fine-Tuning LLM s for Educational Guidance in Resource-Constrained Settings

【速读】：该论文旨在解决如何以成本效益的方式适配大型语言模型（Large Language Models, LLMs）用于学术指导，并特别关注留学语境下的应用以及低资源环境中的文化适应性（acculturation）。论文的关键解决方案在于结合使用Mistral-7B-Instruct模型与两种技术方法：Low-Rank Adaptation (LoRA) 和 4-bit量化方法。通过两阶段训练策略，首先利用Gemini Pro API生成的合成数据集进行预训练（Phase 1），随后使用StudyAbroadGPT项目提供的人工标注数据集进一步微调（Phase 2），从而提升领域专用性同时保持计算效率。此过程的技术创新包括内存高效的量化、参数高效的适配以及通过Weights & Biases实现的持续训练分析。最终结果表明，该方法显著降低了训练损失（减少52.7%），在特定领域的推荐准确率达到92%，并支持95%的基于markdown格式处理能力，在商用GPU上的平均吞吐量达到每秒100个样本。尽管存在泛化能力下降及依赖合成数据集等局限性，但该框架具有扩展至多语言增强及实时学术咨询场景的潜力。

链接: https://arxiv.org/abs/2504.15610
作者: Md Millat,Md Motiur
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures (3 graphs + 3 flowchart/architecture diagrams), submitted as a preprint for review consideration in AI for Education or Machine Learning applications in low-resource settings. Includes detailed experiments with LoRA and quantization methods for efficient LLM fine-tuning

点击查看摘要

Abstract:The current study describes a cost-effective method for adapting large language models (LLMs) for academic advising with study-abroad contexts in mind and for application in low-resource methods for acculturation. With the Mistral-7B-Instruct model applied with a Low-Rank Adaptation (LoRA) method and a 4-bit quantization method, the model underwent training in two distinct stages related to this study’s purpose to enhance domain specificity while maintaining computational efficiency. In Phase 1, the model was conditioned with a synthetic dataset via the Gemini Pro API, and in Phase 2, it was trained with manually curated datasets from the StudyAbroadGPT project to achieve enhanced, contextualized responses. Technical innovations entailed memory-efficient quantization, parameter-efficient adaptation, and continuous training analytics via Weights Biases. After training, this study demonstrated a reduction in training loss by 52.7%, 92% accuracy in domain-specific recommendations, achieved 95% markdown-based formatting support, and a median run-rate of 100 samples per second on off-the-shelf GPU equipment. These findings support the effective application of instruction-tuned LLMs within educational advisers, especially in low-resource institutional scenarios. Limitations included decreased generalizability and the application of a synthetically generated dataset, but this framework is scalable for adding new multilingual-augmented and real-time academic advising processes. Future directions may include plans for the integration of retrieval-augmented generation, applying dynamic quantization routines, and connecting to real-time academic databases to increase adaptability and accuracy.
zh

[AI-38] MetaMolGen: A Neural Graph Motif Generation Model for De Novo Molecular Design

【速读】：该论文旨在解决分子生成在数据稀缺场景下传统生成模型难以实现令人满意的条件泛化的问题。解决方案的关键在于提出MetaMolGen，这是一种基于一阶元学习的分子生成器，专为少量样本和属性条件下的分子生成而设计。MetaMolGen通过将图基序标准化到归一化的潜在空间，并采用轻量级自回归序列模型生成忠实反映底层分子结构的SMILES序列来实现这一目标。此外，它还通过集成到生成器中的可学习属性投影器支持具有目标属性的分子的条件生成。实验结果表明，MetaMolGen在低数据条件下始终能够生成有效的多样化SMILES序列，优于传统的基准模型，突显了其在实际分子设计中快速适应和高效条件生成的优势。

链接: https://arxiv.org/abs/2504.15587
作者: Zimo Yan,Jie Zhang,Zheng Xie,Chang Liu,Yizhen Liu,Yiping Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Molecular generation plays an important role in drug discovery and materials science, especially in data-scarce scenarios where traditional generative models often struggle to achieve satisfactory conditional generalization. To address this challenge, we propose MetaMolGen, a first-order meta-learning-based molecular generator designed for few-shot and property-conditioned molecular generation. MetaMolGen standardizes the distribution of graph motifs by mapping them to a normalized latent space, and employs a lightweight autoregressive sequence model to generate SMILES sequences that faithfully reflect the underlying molecular structure. In addition, it supports conditional generation of molecules with target properties through a learnable property projector integrated into the generative this http URL results demonstrate that MetaMolGen consistently generates valid and diverse SMILES sequences under low-data regimes, outperforming conventional baselines. This highlights its advantage in fast adaptation and efficient conditional generation for practical molecular design.
zh

[AI-39] A Large-scale Class-level Benchmark Dataset for Code Generation with LLM s

【速读】：该论文试图解决现有代码生成基准测试聚焦于孤立函数而未能有效捕捉真实世界软件类级别复杂结构的问题。为填补这一空白，研究引入了一个大规模的Python类级别数据集，该数据集从13,174个真实开源项目中精心构建，包含超过842,000个类骨架，并保留了结构化和上下文依赖关系以反映实际软件开发场景。此外，通过添加静态代码度量值增强数据集，以支持后续分析。论文的关键解决方案在于利用从真实类骨架提取的结构化提示来显著提升大型语言模型（LLMs）在类级别代码生成任务中的表现，验证结果显示生成代码在词法、结构相似性及质量指标（如ROUGE@L、BLEU、TSED）方面均达到较高水平。这表明精心设计的基于真实类骨架的提示能够极大改善LLMs的性能。

链接: https://arxiv.org/abs/2504.15564
作者: Musfiqur Rahman,SayedHassan Khatoonabadi,Emad Shihab
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper was submitted to the 29th International Conference on Evaluation and Assessment in Software Engineering (EASE 2025) AI models/data track

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have demonstrated promising capabilities in code generation tasks. However, most existing benchmarks focus on isolated functions and fail to capture the complexity of real-world, class-level software structures. To address this gap, we introduce a large-scale, Python class-level dataset curated from 13,174 real-world open-source projects. The dataset contains over 842,000 class skeletons, each including class and method signatures, along with associated docstrings when available. We preserve structural and contextual dependencies critical to realistic software development scenarios and enrich the dataset with static code metrics to support downstream analysis. To evaluate the usefulness of this dataset, we use extracted class skeletons as prompts for GPT-4 to generate full class implementations. Results show that the LLM-generated classes exhibit strong lexical and structural similarity to human-written counterparts, with average ROUGE@L, BLEU, and TSED scores of 0.80, 0.59, and 0.73, respectively. These findings confirm that well-structured prompts derived from real-world class skeletons significantly enhance LLM performance in class-level code generation. This dataset offers a valuable resource for benchmarking, training, and improving LLMs in realistic software engineering contexts.
zh

[AI-40] A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models

【速读】：该论文旨在解决利用人工智能技术自动化端到端生产传统戏曲（如秦腔）过程中所面临的挑战。论文的关键在于提出了一种新型的多智能体框架（multi-Agent framework），通过整合大型语言模型（Large Language Models）、视觉生成模型以及文本转语音技术（Text to Speech, TTS），实现剧本创作、舞台场景渲染及语音表演三阶段的专业化协作。其中，三个专门设计的智能体按序分工合作：Agent1负责基于LLM生成符合文化背景且连贯的剧本；Agent2利用视觉生成模型创建语境相符的舞台场景；Agent3则借助TTS生成同步且富有情感表达的语音表现。通过在《窦娥冤》案例中的应用验证，该框架不仅在剧本忠实度、视觉一致性及语音准确性等方面达到了专家评分的显著水平，还通过消融实验表明了模块化协作的重要性。此研究展示了AI驱动的工作流如何有效促进传统表演艺术的保护与规模化发展，并指出了未来在跨模态对齐、情感细腻度提升及支持更多戏曲类型方面的改进方向。

链接: https://arxiv.org/abs/2504.15552
作者: Gengxian Cao,Fengyuan Li,Hong Duan,Ye Yang,Bofeng Wang,Donghe Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages,7 figures,1 tables

点击查看摘要

Abstract:This paper introduces a novel multi-Agent framework that automates the end to end production of Qinqiang opera by integrating Large Language Models , visual generation, and Text to Speech synthesis. Three specialized agents collaborate in sequence: Agent1 uses an LLM to craft coherent, culturally grounded scripts;Agent2 employs visual generation models to render contextually accurate stage scenes; and Agent3 leverages TTS to produce synchronized, emotionally expressive vocal performances. In a case study on Dou E Yuan, the system achieved expert ratings of 3.8 for script fidelity, 3.5 for visual coherence, and 3.8 for speech accuracy-culminating in an overall score of 3.6, a 0.3 point improvement over a Single Agent baseline. Ablation experiments demonstrate that removing Agent2 or Agent3 leads to drops of 0.4 and 0.5 points, respectively, underscoring the value of modular collaboration. This work showcases how AI driven pipelines can streamline and scale the preservation of traditional performing arts, and points toward future enhancements in cross modal alignment, richer emotional nuance, and support for additional opera genres.
zh

[AI-41] Do It For Me vs. Do It With Me: Investigating User Perceptions of Different Paradigms of Automation in Copilots for Feature-Rich Software

【速读】：该论文试图解决的问题是如何在基于大型语言模型（Large Language Model, LLM）的嵌入式助手（copilot）中确定最佳自动化水平，以提供更有效的用户体验。研究关注用户在学习和完成任务时对控制感的需求，并探索完全自动化与半自动化范式的优劣。

解决方案的关键在于引入两种不同的自动化范式：一种是完全自动化的助手（AutoCopilot），另一种是半自动化的指导型助手（GuidedCopilot）。其中，GuidedCopilot 在自动化简单步骤的同时，通过逐步可视化引导提供用户控制权，从而在探索性和创造性任务中表现出更高的用户控制感、软件实用性和可学性。此外，后续设计探索进一步增强了 GuidedCopilot 的功能，引入任务感知和状态感知特性（如上下文预览片段和自适应指令），以实现更个性化的指导和支持。这些发现强调了用户控制和定制化指导在下一代 copilot 设计中的重要性。

链接: https://arxiv.org/abs/2504.15549
作者: Anjali Khurana,Xiaotian Su,April Yi Wang,Parmit K Chilana
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication in the CHI Conference on Human Factors in Computing Systems (CHI 2025), April 26 - May 1, 2025, Yokohama, Japan

点击查看摘要

Abstract:Large Language Model (LLM)-based in-application assistants, or copilots, can automate software tasks, but users often prefer learning by doing, raising questions about the optimal level of automation for an effective user experience. We investigated two automation paradigms by designing and implementing a fully automated copilot (AutoCopilot) and a semi-automated copilot (GuidedCopilot) that automates trivial steps while offering step-by-step visual guidance. In a user study (N=20) across data analysis and visual design tasks, GuidedCopilot outperformed AutoCopilot in user control, software utility, and learnability, especially for exploratory and creative tasks, while AutoCopilot saved time for simpler visual tasks. A follow-up design exploration (N=10) enhanced GuidedCopilot with task-and state-aware features, including in-context preview clips and adaptive instructions. Our findings highlight the critical role of user control and tailored guidance in designing the next generation of copilots that enhance productivity, support diverse skill levels, and foster deeper software engagement.
zh

[AI-42] A Framework for Testing and Adapting REST APIs as LLM Tools

【速读】：本文旨在解决大型语言模型（LLMs）驱动的自主代理在利用企业系统提供的REST API作为工具时所面临的挑战，这些挑战包括API输入模式复杂、响应繁琐以及文档常常模糊不清。现有的工具测试基准未能充分应对这些复杂性，导致评估API为代理驱动自动化做好准备的程度存在关键差距。为了解决这一问题，论文提出了一种新颖的测试框架，其关键是将API转换为工具，并为此生成全面的测试用例，然后将这些测试用例转化为适合代理理解的自然语言指令，同时丰富工具定义并评估代理正确调用API及其处理输入和响应的能力。通过分析750个测试用例的结果，论文还提出了详细的错误分类，包括输入误解、输出处理不一致和模式不匹配，并对测试用例进行分类以简化调试和改进工具集成。这项工作为使企业API成为有效的工具提供了基础步骤，提高了它们在基于代理的应用中的可用性。

链接: https://arxiv.org/abs/2504.15546
作者: Jayachandu Bandlamudi,Ritwik Chaudhuri,Neelamadhav Gantayat,Kushal Mukherjee,Prerna Agarwal,Renuka Sindhgatta,Sameep Mehta
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are enabling autonomous agents to perform complex workflows using external tools or functions, often provided via REST APIs in enterprise systems. However, directly utilizing these APIs as tools poses challenges due to their complex input schemas, elaborate responses, and often ambiguous documentation. Current benchmarks for tool testing do not adequately address these complexities, leading to a critical gap in evaluating API readiness for agent-driven automation. In this work, we present a novel testing framework aimed at evaluating and enhancing the readiness of REST APIs to function as tools for LLM-based agents. Our framework transforms apis as tools, generates comprehensive test cases for the APIs, translates tests cases into natural language instructions suitable for agents, enriches tool definitions and evaluates the agent’s ability t correctly invoke the API and process its inputs and responses. To provide actionable insights, we analyze the outcomes of 750 test cases, presenting a detailed taxonomy of errors, including input misinterpretation, output handling inconsistencies, and schema mismatches. Additionally, we classify these test cases to streamline debugging and refinement of tool integrations. This work offers a foundational step toward enabling enterprise APIs as tools, improving their usability in agent-based applications.
zh

[AI-43] Guillotine: Hypervisors for Isolating Malicious AIs

【速读】：该论文试图解决强大的AI模型在金融、医疗和军事等关键领域中因不可预测行为带来的社会风险问题，特别是那些可能意外或恶意生成存在性威胁的AI模型。论文提出了一种名为Guillotine的hypervisor架构，用于隔离这些强大的AI模型。解决方案的关键在于不仅采用已知的虚拟化技术，还引入全新的隔离机制以应对存在性风险AI特有的威胁模型。这包括通过精心设计hypervisor软件与支持它的CPU、RAM、NIC及存储设备之间的协同工作，防止侧信道泄漏并消除基于反射漏洞的利用途径。此外，Guillotine hypervisor还需提供物理失效保护措施，如网络电缆的机电断开或数据中心的淹没，以确保即使在软件、网络和微架构隔离被突破的情况下，也能暂时关闭或永久销毁失控的AI模型，从而实现多层次防御。

链接: https://arxiv.org/abs/2504.15499
作者: James Mickens,Sarah Radway,Ravi Netravali
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Operating Systems (cs.OS)
备注: To be published in the ACM SIGOPS 2025 Workshop on Hot Topics in Operating Systems

点击查看摘要

Abstract:As AI models become more embedded in critical sectors like finance, healthcare, and the military, their inscrutable behavior poses ever-greater risks to society. To mitigate this risk, we propose Guillotine, a hypervisor architecture for sandboxing powerful AI models – models that, by accident or malice, can generate existential threats to humanity. Although Guillotine borrows some well-known virtualization techniques, Guillotine must also introduce fundamentally new isolation mechanisms to handle the unique threat model posed by existential-risk AIs. For example, a rogue AI may try to introspect upon hypervisor software or the underlying hardware substrate to enable later subversion of that control plane; thus, a Guillotine hypervisor requires careful co-design of the hypervisor software and the CPUs, RAM, NIC, and storage devices that support the hypervisor software, to thwart side channel leakage and more generally eliminate mechanisms for AI to exploit reflection-based vulnerabilities. Beyond such isolation at the software, network, and microarchitectural layers, a Guillotine hypervisor must also provide physical fail-safes more commonly associated with nuclear power plants, avionic platforms, and other types of mission critical systems. Physical fail-safes, e.g., involving electromechanical disconnection of network cables, or the flooding of a datacenter which holds a rogue AI, provide defense in depth if software, network, and microarchitectural isolation is compromised and a rogue AI must be temporarily shut down or permanently destroyed.
zh

[AI-44] Scalable APT Malware Classification via Parallel Feature Extraction and GPU-Accelerated Learning

【速读】：该论文旨在解决自动化和加速恶意软件分类的问题，特别是将恶意可执行文件映射到已知高级持续性威胁（APT）组织的任务。论文的关键在于利用程序的汇编级指令（opcodes）作为特征，并通过开源逆向工程工具结合并行计算脚本高效收集大量恶意样本的opcode数据集。为了解决传统机器学习模型在处理n-gram序列时对元数据依赖以及计算限制的问题，论文引入卷积神经网络（CNNs），并通过图形计算单元（GPU）资源显著加速模型训练，从而实现更高效的恶意软件分类能力。

链接: https://arxiv.org/abs/2504.15497
作者: Noah Subedar,Taeui Kim,Saathwick Venkataramalingam
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 26 pages, 54 figures, 14 tables

点击查看摘要

Abstract:This paper presents an underlying framework for both automating and accelerating malware classification, more specifically, mapping malicious executables to known Advanced Persistent Threat (APT) groups. The main feature of this analysis is the assembly-level instructions present in executables which are also known as opcodes. The collection of such opcodes on many malicious samples is a lengthy process; hence, open-source reverse engineering tools are used in tandem with scripts that leverage parallel computing to analyze multiple files at once. Traditional and deep learning models are applied to create models capable of classifying malware samples. One-gram and two-gram datasets are constructed and used to train models such as SVM, KNN, and Decision Tree; however, they struggle to provide adequate results without relying on metadata to support n-gram sequences. The computational limitations of such models are overcome with convolutional neural networks (CNNs) and heavily accelerated using graphical compute unit (GPU) resources.
zh

[AI-45] Improving Human-AI Coordination through Adversarial Training and Generative Models

【速读】：本文旨在解决在合作任务中，训练智能体（Agent）能够有效泛化至与新人类伙伴交互的问题。传统对抗性训练在合作场景中的应用受限，因为对抗策略倾向于学习破坏任务而非模拟有效的合作者行为，从而导致自我 sabotaged 的结果。为应对这一挑战，论文提出了一种名为 GOAT（Generative Online Adversarial Training）的新方法，其关键是结合预训练的生成模型来模拟有效的合作策略，并通过最大化后悔值进行对抗性训练。这种方法的核心在于动态搜索并生成使合作方（Cooperator）表现不佳的协调策略，同时仅更新生成模型的嵌入参数而冻结其余参数，以保持协调策略的真实性和避免对抗性利用。通过这种方式，GOAT 提升了智能体的泛化能力，使其能够在面对多样化的人类行为时表现出色，并在 Overcooked 基准测试中展现了最先进的性能。

链接: https://arxiv.org/abs/2504.15457
作者: Paresh Chaudhary,Yancheng Liang,Daphne Chen,Simon S. Du,Natasha Jaques
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Being able to cooperate with new people is an important component of many economically valuable AI tasks, from household robotics to autonomous driving. However, generalizing to novel humans requires training on data that captures the diversity of human behaviors. Adversarial training is one avenue for searching for such data and ensuring that agents are robust. However, it is difficult to apply in the cooperative setting because adversarial policies intentionally learn to sabotage the task instead of simulating valid cooperation partners. To address this challenge, we propose a novel strategy for overcoming self-sabotage that combines a pre-trained generative model to simulate valid cooperative agent policies with adversarial training to maximize regret. We call our method GOAT: Generative Online Adversarial Training. In this framework, the GOAT dynamically searches for and generates coordination strategies where the learning policy – the Cooperator agent – underperforms. GOAT enables better generalization by exposing the Cooperator to various challenging interaction scenarios. We maintain realistic coordination strategies by updating only the generative model’s embedding while keeping its parameters frozen, thus avoiding adversarial exploitation. We evaluate GOAT with real human partners, and the results demonstrate state-of-the-art performance on the Overcooked benchmark, highlighting its effectiveness in generalizing to diverse human behaviors.
zh

[AI-46] Demand for LLM s: Descriptive Evidence on Substitution Market Expansion and Multihoming

【速读】：该论文旨在研究大型语言模型（Large Language Models, LLMs）需求的三个特征事实：新模型在发布后迅速被采用并在数周内趋于稳定；不同模型的发布主要吸引新用户或替代竞争模型的需求存在显著差异；同时使用多个模型（多归属, multihoming）的现象在应用程序中较为常见。通过这些发现，论文揭示了LLM市场中显著的水平和垂直差异化，这为模型提供者在快速技术进步背景下维持需求和定价能力提供了机会。论文的关键在于通过实证数据验证上述特征事实，并以此为基础探讨LLM市场的动态特性及其对提供商的战略意义。

链接: https://arxiv.org/abs/2504.15440
作者: Andrey Fradkin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:This paper documents three stylized facts about the demand for Large Language Models (LLMs) using data from OpenRouter, a prominent LLM marketplace. First, new models experience rapid initial adoption that stabilizes within weeks. Second, model releases differ substantially in whether they primarily attract new users or substitute demand from competing models. Third, multihoming, using multiple models simultaneously, is common among apps. These findings suggest significant horizontal and vertical differentiation in the LLM market, implying opportunities for providers to maintain demand and pricing power despite rapid technological advances.
zh

[AI-47] Solving Multi-Agent Safe Optimal Control with Distributed Epigraph Form MARL

【速读】：本文旨在解决多机器人系统在协作完成团队目标的同时确保安全的问题。该问题通常被形式化为约束马尔可夫决策过程（CMDP），目标是最小化全局成本，并将约束违反的均值降至用户定义的阈值以下。然而，许多现有的安全多智能体强化学习（MARL）算法在训练稳定性方面表现不佳。为了解决这一问题，论文的关键创新在于采用约束优化的epigraph形式来提升训练稳定性，并证明中心化的epigraph形式问题可以通过分布式方式由每个智能体独立求解。基于此，提出了一个新颖的中心化训练分布式执行的MARL算法Def-MARL。仿真和真实硬件实验结果表明，Def-MARL不仅实现了最优的整体性能，还满足了严格的安全约束并保持了稳定的训练过程。

链接: https://arxiv.org/abs/2504.15425
作者: Songyuan Zhang,Oswin So,Mitchell Black,Zachary Serlin,Chuchu Fan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
备注: 28 pages, 16 figures; Accepted by Robotics: Science and Systems 2025

点击查看摘要

Abstract:Tasks for multi-robot systems often require the robots to collaborate and complete a team goal while maintaining safety. This problem is usually formalized as a constrained Markov decision process (CMDP), which targets minimizing a global cost and bringing the mean of constraint violation below a user-defined threshold. Inspired by real-world robotic applications, we define safety as zero constraint violation. While many safe multi-agent reinforcement learning (MARL) algorithms have been proposed to solve CMDPs, these algorithms suffer from unstable training in this setting. To tackle this, we use the epigraph form for constrained optimization to improve training stability and prove that the centralized epigraph form problem can be solved in a distributed fashion by each agent. This results in a novel centralized training distributed execution MARL algorithm named Def-MARL. Simulation experiments on 8 different tasks across 2 different simulators show that Def-MARL achieves the best overall performance, satisfies safety constraints, and maintains stable training. Real-world hardware experiments on Crazyflie quadcopters demonstrate the ability of Def-MARL to safely coordinate agents to complete complex collaborative tasks compared to other methods.
zh

[AI-48] LLM -Assisted Translation of Legacy FORTRAN Codes to C: A Cross-Platform Study

【速读】：该论文试图解决的问题是如何评估基于大型语言模型（Large Language Models, LLMs）的遗留代码（如Fortran）向目标语言（如C++）翻译的可用性与准确性。论文关注的重点在于通过统计方法量化翻译后的C++代码的编译成功率、LLM翻译代码与人工翻译代码之间的相似度，以及Fortran到C++翻译输出的相似性。解决方案的关键在于采用开放权重的LLMs，在两个不同的计算平台上对Fortran到C++的翻译任务进行系统性研究，并通过多维度的定量分析来全面评估翻译结果的可靠性和实用性。

链接: https://arxiv.org/abs/2504.15424
作者: Nishath Rajiv Ranasinghe,Shawn M. Jones,Michal Kucer,Ayan Biswas,Daniel O’Malley,Alexander Buschmann Most,Selma Liliane Wanna,Ajay Sreekumar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being leveraged for generating and translating scientific computer codes by both domain-experts and non-domain experts. Fortran has served as one of the go to programming languages in legacy high-performance computing (HPC) for scientific discoveries. Despite growing adoption, LLM-based code translation of legacy code-bases has not been thoroughly assessed or quantified for its usability. Here, we studied the applicability of LLM-based translation of Fortran to C++ as a step towards building an agentic-workflow using open-weight LLMs on two different computational platforms. We statistically quantified the compilation accuracy of the translated C++ codes, measured the similarity of the LLM translated code to the human translated C++ code, and statistically quantified the output similarity of the Fortran to C++ translation.
zh

[AI-49] On the Boolean Network Theory of Datalogneg

【速读】：本文旨在建立Datalog ^\neg 与布尔网络理论之间的形式化联系，并利用布尔网络理论的结果研究Datalog ^\neg 的模型性质。关键在于证明：在Datalog ^\neg 程序中，无奇环时正则模型与稳定模型一致，从而保证稳定模型的存在性；无偶环时稳定偏模型唯一，进而确保正则模型唯一。此外，通过分析原子依赖图的反馈点集大小，给出了稳定偏模型、正则模型及稳定模型数量的上界。本文还引入了Datalog ^\neg 的陷阱空间概念，揭示了支持或稳定陷阱空间与其他语义之间的关系，并证明了最小子集稳定的陷阱空间与正则模型等价。这些结果修正了You和Yuan在1994年关于正常逻辑程序的工作中的某些缺陷，仅适用于负正常逻辑程序的情况。

链接: https://arxiv.org/abs/2504.15417
作者: Van-Giang Trinh,Belaid Benhamou,Sylvain Soliman,François Fages
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 48 pages, 7 figures

点击查看摘要

Abstract:Datalog ^\neg is a central formalism used in a variety of domains ranging from deductive databases and abstract argumentation frameworks to answer set programming. Its model theory is the finite counterpart of the logical semantics developed for normal logic programs, mainly based on the notions of Clark’s completion and two-valued or three-valued canonical models including supported, stable, regular and well-founded models. In this paper we establish a formal link between Datalog ^\neg and Boolean network theory, which was initially introduced by Stuart Kaufman and René Thomas to reason about gene regulatory networks. We use previous results from Boolean network theory to prove that in the absence of odd cycles in a Datalog ^\neg program, the regular models coincide with the stable models, which entails the existence of stable models, and in the absence of even cycles, we show the uniqueness of stable partial models, which entails the uniqueness of regular models. These results on regular models have been claimed by You and Yuan in 1994 for normal logic programs but we show problems in their definition of well-founded stratification and in their proofs that we can fix for negative normal logic programs only. We also give upper bounds on the numbers of stable partial, regular, and stable models of a Datalog ^\neg program using the cardinality of a feedback vertex set in its atom dependency graph. Interestingly, our connection to Boolean network theory also points us to the notion of trap spaces for Datalog ^\neg programs. We relate the notions of supported or stable trap spaces to the other semantics of Datalog ^\neg , and show the equivalence between subset-minimal stable trap spaces and regular models.
zh

[AI-50] Solving New Tasks by Adapting Internet Video Knowledge ICLR2025

【速读】：该论文旨在解决视频生成模型在机器人领域中的两个关键问题：一是大规模预训练模型虽能通过文本条件实现通用化，但可能缺乏对特定环境的敏感性；二是领域内特定数据训练的模型虽能编码环境细节，但可用演示数据规模有限，难以通过自然语言规范实现对未见任务的泛化。论文探索了将领域内信息与大规模预训练视频模型相结合的不同适配技术，并评估其在机器人任务中实现新型文本条件泛化的有效性及其独立的数据与资源需求。关键解决方案在于提出了一种新颖的适配策略——逆概率适配（Inverse Probabilistic Adaptation），该方法不仅在多种机器人任务和设定下表现出一致的强泛化性能，还对适配数据的质量具有鲁棒性，即使使用次优的领域内演示数据，也能成功解决新任务。

链接: https://arxiv.org/abs/2504.15369
作者: Calvin Luo,Zilai Zeng,Yilun Du,Chen Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: ICLR 2025. Project Webpage: this https URL

点击查看摘要

Abstract:Video generative models demonstrate great promise in robotics by serving as visual planners or as policy supervisors. When pretrained on internet-scale data, such video models intimately understand alignment with natural language, and can thus facilitate generalization to novel downstream behavior through text-conditioning. However, they may not be sensitive to the specificities of the particular environment the agent inhabits. On the other hand, training video models on in-domain examples of robotic behavior naturally encodes environment-specific intricacies, but the scale of available demonstrations may not be sufficient to support generalization to unseen tasks via natural language specification. In this work, we investigate different adaptation techniques that integrate in-domain information with large-scale pretrained video models, and explore the extent to which they enable novel text-conditioned generalization for robotic tasks, while also considering their independent data and resource considerations. We successfully demonstrate across robotic environments that adapting powerful video models with small scales of example data can successfully facilitate generalization to novel behaviors. In particular, we present a novel adaptation strategy, termed Inverse Probabilistic Adaptation, that not only consistently achieves strong generalization performance across robotic tasks and settings, but also exhibits robustness to the quality of adaptation data, successfully solving novel tasks even when only suboptimal in-domain demonstrations are available.
zh

[AI-51] KeDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments

【速读】：该论文旨在解决在资源受限环境中，基于大语言模型（LLM）的应用部署面临的长输入提示处理难题。传统键值（KV）缓存 eviction 方法在面对长输入时表现不佳，而该论文提出了一种无需训练的 KV 缓存 eviction 方法 KeyDiff，其核心在于利用键相似性（key similarity）进行缓存管理。KeyDiff 的关键创新在于它不依赖于注意力分数（attention scores），从而能够与优化后的注意力机制（如 FlashAttention）结合使用，同时确保在严格资源限制下高效生成响应，并最大化键多样性（key diversity）。实验表明，KeyDiff 在 LongBench 基准测试中实现了与非 eviction 基线相比小于 0.04% 的性能差距，同时减少了约 23% 的 KV 缓存占用。

链接: https://arxiv.org/abs/2504.15364
作者: Junyoung Park,Dalton Jones,Matt Morse,Raghavv Goel,Mingu Lee,Chris Lott
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 14 figures

点击查看摘要

Abstract:In this work, we demonstrate that distinctive keys during LLM inference tend to have high attention scores. We explore this phenomenon and propose KeyDiff, a training-free KV cache eviction method based on key similarity. This method facilitates the deployment of LLM-based application requiring long input prompts in resource-constrained environments with limited memory and compute budgets. Unlike other KV cache eviction methods, KeyDiff can process arbitrarily long prompts within strict resource constraints and efficiently generate responses. We demonstrate that KeyDiff computes the optimal solution to a KV cache selection problem that maximizes key diversity, providing a theoretical understanding of KeyDiff. Notably,KeyDiff does not rely on attention scores, allowing the use of optimized attention mechanisms like FlashAttention. We demonstrate the effectiveness of KeyDiff across diverse tasks and models, illustrating a performance gap of less than 0.04% with 8K cache budget ( \sim 23% KV cache reduction) from the non-evicting baseline on the LongBench benchmark for Llama 3.1-8B and Llama 3.2-3B.
zh

[AI-52] Reliable Classification with Conformal Learning and Interval-Type 2 Fuzzy Sets

【速读】：该论文旨在解决经典机器学习分类器在实际应用中输出可靠性不足的问题，尤其是在实验室基准之外的场景中可能表现不佳。解决方案的关键在于结合模糊规则系统（Fuzzy Rule-Based Systems）与一致性学习（Conformal Learning），利用一致性学习技术生成具有统计保证的预测置信区间，确保目标类别的覆盖率满足预设的显著性水平。此外，论文进一步探讨了采用**二型模糊集（Type 2 Fuzzy Sets）**替代传统模糊规则或清晰规则（Crisp Rules），以提升系统输出的质量，并讨论了如何通过微调一致性预测系统优化其性能。

链接: https://arxiv.org/abs/2504.15360
作者: Javier Fumanal-Idocin,Javier Andreu-Perez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Classical machine learning classifiers tend to be overconfident can be unreliable outside of the laboratory benchmarks. Properly assessing the reliability of the output of the model per sample is instrumental for real-life scenarios where these systems are deployed. Because of this, different techniques have been employed to properly quantify the quality of prediction for a given model. These are most commonly Bayesian statistics and, more recently, conformal learning. Given a calibration set, conformal learning can produce outputs that are guaranteed to cover the target class with a desired significance level, and are more reliable than the standard confidence intervals used by Bayesian methods. In this work, we propose to use conformal learning with fuzzy rule-based systems in classification and show some metrics of their performance. Then, we discuss how the use of type 2 fuzzy sets can improve the quality of the output of the system compared to both fuzzy and crisp rules. Finally, we also discuss how the fine-tuning of the system can be adapted to improve the quality of the conformal prediction.
zh

[AI-53] Bayesian Federated Learning for Continual Training

【速读】：该论文旨在解决动态环境中贝叶斯联邦学习（Bayesian Federated Learning, BFL）面临的连续学习挑战，即数据分布随时间变化时，如何在保持已有知识的同时适应不断演化的数据。现有BFL方法忽视了这一问题，而论文提出了一种应用于雷达数据的人类感知任务中的连续BFL框架。方案的关键在于利用随机梯度Langevin动力学（Stochastic Gradient Langevin Dynamics, SGLD），通过依次更新模型，并以前一任务的后验分布构建新任务的先验分布，从而实现知识的保留与适应性调整。实验结果验证了所提方法在准确性、预期校准误差（Expected Calibration Error, ECE）以及收敛速度方面的有效性。

链接: https://arxiv.org/abs/2504.15328
作者: Usevalad Milasheuski,Luca Barbieri,Sanaz Kianoush,Monica Nicoli,Stefano Savazzi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Bayesian Federated Learning (BFL) enables uncertainty quantification and robust adaptation in distributed learning. In contrast to the frequentist approach, it estimates the posterior distribution of a global model, offering insights into model reliability. However, current BFL methods neglect continual learning challenges in dynamic environments where data distributions shift over time. We propose a continual BFL framework applied to human sensing with radar data collected over several days. Using Stochastic Gradient Langevin Dynamics (SGLD), our approach sequentially updates the model, leveraging past posteriors to construct the prior for the new tasks. We assess the accuracy, the expected calibration error (ECE) and the convergence speed of our approach against several baselines. Results highlight the effectiveness of continual Bayesian updates in preserving knowledge and adapting to evolving data.
zh

[AI-54] Significativity Indices for Agreement Values

【速读】：本文旨在解决如何客观评估分类器之间一致性度量（如Cohen’s kappa或类内相关系数）显著性的问题。现有方法主要依赖于人为设定的标尺或界限，缺乏系统性的显著性评价标准，尤其是针对有限数据集或概率分布的情况。论文的关键解决方案在于提出了一种通用的方法，引入了两种显著性指数：一种适用于有限数据集，另一种处理分类概率分布。此外，论文还探讨了这些指数的计算复杂性，并提出了若干高效算法以实现其评估。

链接: https://arxiv.org/abs/2504.15325
作者: Alberto Casagrande,Francesco Fabris,Rossano Girometti,Roberto Pagliarini
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 27 pages, 6 figures

点击查看摘要

Abstract:Agreement measures, such as Cohen’s kappa or intraclass correlation, gauge the matching between two or more classifiers. They are used in a wide range of contexts from medicine, where they evaluate the effectiveness of medical treatments and clinical trials, to artificial intelligence, where they can quantify the approximation due to the reduction of a classifier. The consistency of different classifiers to a golden standard can be compared simply by using the order induced by their agreement measure with respect to the golden standard itself. Nevertheless, labelling an approach as good or bad exclusively by using the value of an agreement measure requires a scale or a significativity index. Some quality scales have been proposed in the literature for Cohen’s kappa, but they are mainly naive, and their boundaries are arbitrary. This work proposes a general approach to evaluate the significativity of any agreement value between two classifiers and introduces two significativity indices: one dealing with finite data sets, the other one handling classification probability distributions. Moreover, this manuscript considers the computational issues of evaluating such indices and identifies some efficient algorithms to evaluate them.
zh

[AI-55] How to systematically develop an effective AI-based bias correction model?

【速读】：本文旨在解决数值天气预报（Numerical Weather Prediction, NWP）中的系统性偏差问题，提出了一种名为ReSA-ConvLSTM的人工智能框架。解决方案的关键在于通过三个创新点实现对偏差的有效校正：首先引入动态气候学归一化方法，其次在ConvLSTM中加入时间因果性约束，最后结合残差自注意力机制。这些创新使得模型能够在欧洲中期天气预报中心（ECMWF）预报与ERA5再分析数据之间构建具有物理意义的非线性映射，从而显著降低2米空气温度（T2m）、10米风速（U10/V10）及海平面气压（SLP）等变量在1至7天预报中的均方根误差（RMSE），最大降幅可达20%。此外，轻量级架构（仅10.6百万参数）保证了模型的高效泛化能力，并大幅减少了跨变量校正所需的重训练时间达85%，同时提升了海洋模式的表现。消融实验进一步验证了各创新点对提升模型校正性能的重要性，表明将变量特性整合到模型设计中有助于增强预测技能。

链接: https://arxiv.org/abs/2504.15322
作者: Xiao Zhou,Yuze Sun,Jie Wu,Xiaomeng Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:This study introduces ReSA-ConvLSTM, an artificial intelligence (AI) framework for systematic bias correction in numerical weather prediction (NWP). We propose three innovations by integrating dynamic climatological normalization, ConvLSTM with temporal causality constraints, and residual self-attention mechanisms. The model establishes a physics-aware nonlinear mapping between ECMWF forecasts and ERA5 reanalysis data. Using 41 years (1981-2021) of global atmospheric data, the framework reduces systematic biases in 2-m air temperature (T2m), 10-m winds (U10/V10), and sea-level pressure (SLP), achieving up to 20% RMSE reduction over 1-7 day forecasts compared to operational ECMWF outputs. The lightweight architecture (10.6M parameters) enables efficient generalization to multiple variables and downstream applications, reducing retraining time by 85% for cross-variable correction while improving ocean model skill through bias-corrected boundary conditions. The ablation experiments demonstrate that our innovations significantly improve the model’s correction performance, suggesting that incorporating variable characteristics into the model helps enhance forecasting skills.
zh

[AI-56] Diffusion-Driven Inertial Generated Data for Smartphone Location Classification

【速读】：该论文旨在解决因惯性测量数据收集耗时且资源密集，阻碍运动跟踪和导航系统中鲁棒机器学习模型发展的难题。论文的关键解决方案是提出了一种基于扩散模型生成特定力（specific force）数据的方法，用于智能手机位置识别任务。通过提供全面的评估方法，将合成数据与真实记录的数据在多个指标上进行比较，证明了该扩散驱动的生成模型能够成功捕捉不同智能手机放置条件下的特定力信号特征。这种方法的核心在于利用扩散模型生成多样化且逼真的合成数据，从而减轻大量数据收集的压力，并为机器学习模型提供高质量的训练数据。

链接: https://arxiv.org/abs/2504.15315
作者: Noa Cohen,Rotem Dror,Itzik Klein
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the crucial role of inertial measurements in motion tracking and navigation systems, the time-consuming and resource-intensive nature of collecting extensive inertial data has hindered the development of robust machine learning models in this field. In recent years, diffusion models have emerged as a revolutionary class of generative models, reshaping the landscape of artificial data generation. These models surpass generative adversarial networks and other state-of-the-art approaches to complex tasks. In this work, we propose diffusion-driven specific force-generated data for smartphone location recognition. We provide a comprehensive evaluation methodology by comparing synthetic and real recorded specific force data across multiple metrics. Our results demonstrate that our diffusion-based generative model successfully captures the distinctive characteristics of specific force signals across different smartphone placement conditions. Thus, by creating diverse, realistic synthetic data, we can reduce the burden of extensive data collection while providing high-quality training data for machine learning models.
zh

[AI-57] PolicyEvol-Agent : Evolving Policy via Environment Perception and Self-Awareness with Theory of Mind

【速读】：该论文旨在解决多智能体在动态交互场景中有效认知链（包括推理、规划、决策和反思）研究的局限性，特别是面对不确定性博弈过程时基于提示的响应在心理状态感知和经验校准方面面临的挑战，这些问题可能导致认知偏差。论文的关键解决方案是提出PolicyEvol-Agent框架，该框架通过系统性获取他人的意图并自适应优化非理性策略以实现持续改进。具体而言，PolicyEvol-Agent首先获取反思性的专家模式，并结合理论思维与内外部视角整合多种认知操作。实验结果表明，该方法在最终游戏胜利方面优于基于强化学习和基于代理的方法，并揭示了动态指导调整在自动和人工评估中的有效性。

链接: https://arxiv.org/abs/2504.15313
作者: Yajie Yu,Yue Feng
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-agents has exhibited significant intelligence in real-word simulations with Large language models (LLMs) due to the capabilities of social cognition and knowledge retrieval. However, existing research on agents equipped with effective cognition chains including reasoning, planning, decision-making and reflecting remains limited, especially in the dynamically interactive scenarios. In addition, unlike human, prompt-based responses face challenges in psychological state perception and empirical calibration during uncertain gaming process, which can inevitably lead to cognition bias. In light of above, we introduce PolicyEvol-Agent, a comprehensive LLM-empowered framework characterized by systematically acquiring intentions of others and adaptively optimizing irrational strategies for continual enhancement. Specifically, PolicyEvol-Agent first obtains reflective expertise patterns and then integrates a range of cognitive operations with Theory of Mind alongside internal and external perspectives. Simulation results, outperforming RL-based models and agent-based methods, demonstrate the superiority of PolicyEvol-Agent for final gaming victory. Moreover, the policy evolution mechanism reveals the effectiveness of dynamic guideline adjustments in both automatic and human evaluation.
zh

[AI-58] Power Transformer Health Index and Life Span Assessment: A Comprehensive Review of Conventional and Machine Learning based Approaches

【速读】：该论文旨在解决电力变压器健康评估与剩余寿命预测的问题，以确保电网高效运行并支持有效的维护规划。论文的关键解决方案在于综合分析传统与前沿技术，特别是通过智能故障诊断方法提升变压器状态评估的准确性。论文强调了多种人工智能（Artificial Intelligence, AI）方法，如人工神经网络（Artificial Neural Networks, ANN）、卷积神经网络（Convolutional Neural Network, CNN）、支持向量机（Support Vector Machine, SVM）、随机森林（Random Forest, RF）、遗传算法（Genetic Algorithm, GA）及粒子群优化（Particle Swarm Optimization, PSO），这些方法提供了实用方案以增强故障诊断性能。此外，结合多AI方法以及时间序列分析的探索进一步提升了诊断精度，并实现了早期故障检测。通过提供变压器故障诊断领域的人工智能应用全景图，本研究为未来研究奠定了基础并推动了这一关键领域的进展。

链接: https://arxiv.org/abs/2504.15310
作者: Syeda Tahreem Zahra,Syed Kashif Imdad,Sohail Khan,Sohail Khalid,Nauman Anwar Baig
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Power transformers play a critical role within the electrical power system, making their health assessment and the prediction of their remaining lifespan paramount for the purpose of ensuring efficient operation and facilitating effective maintenance planning. This paper undertakes a comprehensive examination of existent literature, with a primary focus on both conventional and cutting-edge techniques employed within this domain. The merits and demerits of recent methodologies and techniques are subjected to meticulous scrutiny and explication. Furthermore, this paper expounds upon intelligent fault diagnosis methodologies and delves into the most widely utilized intelligent algorithms for the assessment of transformer conditions. Diverse Artificial Intelligence (AI) approaches, including Artificial Neural Networks (ANN) and Convolutional Neural Network (CNN), Support Vector Machine (SVM), Random Forest (RF), Genetic Algorithm (GA), and Particle Swarm Optimization (PSO), are elucidated offering pragmatic solutions for enhancing the performance of transformer fault diagnosis. The amalgamation of multiple AI methodologies and the exploration of timeseries analysis further contribute to the augmentation of diagnostic precision and the early detection of faults in transformers. By furnishing a comprehensive panorama of AI applications in the field of transformer fault diagnosis, this study lays the groundwork for future research endeavors and the progression of this critical area of study.
zh

[AI-59] Can Machine Learning Agents Deal with Hard Choices?

【速读】：该论文试图解决机器学习代理（ML agents）在多目标优化（Multi-Objective Optimization, MOO）中的局限性，特别是其无法识别和处理人类决策中常见的不可通约（incommensurable）硬选择（hard choices）的问题。论文指出，当前基于标量化优化（Scalarised Optimisation）和帕累托优化（Pareto Optimisation）的MOO方法无法捕捉不可通约性，从而导致了三种对齐问题：即机器决策行为与人类视角的疏离、基于偏好的对齐策略在硬选择中的不可靠性，以及追求多重目标的对齐策略受阻。为解决这些问题，论文评估了两种潜在的技术方案，并推荐采用集成方案（ensemble solution），以期使机器学习代理能够识别硬选择并缓解对齐问题。然而，目前尚无技术能使机器代理通过自主反思（deliberation）来解决硬选择问题，因为它们无法自主改变目标。这凸显了人类决策的独特性，并呼吁机器学习研究者重新概念化机器自治（machine autonomy），开发更有效的框架和方法以填补这一根本性差距。

链接: https://arxiv.org/abs/2504.15304
作者: Kangyu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages excluding bibliography, 27 pagas including bibliography, 3 figures

点击查看摘要

Abstract:Machine Learning ML agents have been increasingly used in decision-making across a wide range of tasks and environments. These ML agents are typically designed to balance multiple objectives when making choices. Understanding how their decision-making processes align with or diverge from human reasoning is essential. Human agents often encounter hard choices, that is, situations where options are incommensurable; neither option is preferred, yet the agent is not indifferent between them. In such cases, human agents can identify hard choices and resolve them through deliberation. In contrast, current ML agents, due to fundamental limitations in Multi-Objective Optimisation or MOO methods, cannot identify hard choices, let alone resolve them. Neither Scalarised Optimisation nor Pareto Optimisation, the two principal MOO approaches, can capture incommensurability. This limitation generates three distinct alignment problems: the alienness of ML decision-making behaviour from a human perspective; the unreliability of preference-based alignment strategies for hard choices; and the blockage of alignment strategies pursuing multiple objectives. Evaluating two potential technical solutions, I recommend an ensemble solution that appears most promising for enabling ML agents to identify hard choices and mitigate alignment problems. However, no known technique allows ML agents to resolve hard choices through deliberation, as they cannot autonomously change their goals. This underscores the distinctiveness of human agency and urges ML researchers to reconceptualise machine autonomy and develop frameworks and methods that can better address this fundamental gap.
zh

[AI-60] High-Throughput LLM inference on Heterogeneous Clusters

【速读】：该论文旨在解决在异构集群中高效提供高吞吐量大语言模型（LLM）推理服务的问题。具体而言，论文针对两个主要挑战提出了解决方案：首先，不同部署配置会导致性能差异显著，且可能的配置组合数量庞大，优化配置极具挑战性；其次，异构集群中的LLM推理实例具有不同的处理能力，导致请求调度复杂且难以最大化每个实例的潜力。为应对这些挑战，论文的关键解决方案包括：通过建模资源量与预期吞吐量并采用穷举搜索方法优化部署配置，以及提出一种新机制以充分考虑各实例的不同处理能力来调度请求。实验结果表明，所提出的调度器分别在两个异构集群上将吞吐量提高了122.5%和33.6%。

链接: https://arxiv.org/abs/2504.15303
作者: Yi Xiong,Jinqi Huang,Wenjie Huang,Xuebing Yu,Entong Li,Zhixiong Ning,Jinhua Zhou,Li Zeng,Xin Chen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Nowadays, many companies possess various types of AI accelerators, forming heterogeneous clusters. Efficiently leveraging these clusters for high-throughput large language model (LLM) inference services can significantly reduce costs and expedite task processing. However, LLM inference on heterogeneous clusters presents two main challenges. Firstly, different deployment configurations can result in vastly different performance. The number of possible configurations is large, and evaluating the effectiveness of a specific setup is complex. Thus, finding an optimal configuration is not an easy task. Secondly, LLM inference instances within a heterogeneous cluster possess varying processing capacities, leading to different processing speeds for handling inference requests. Evaluating these capacities and designing a request scheduling algorithm that fully maximizes the potential of each instance is challenging. In this paper, we propose a high-throughput inference service system on heterogeneous clusters. First, the deployment configuration is optimized by modeling the resource amount and expected throughput and using the exhaustive search method. Second, a novel mechanism is proposed to schedule requests among instances, which fully considers the different processing capabilities of various instances. Extensive experiments show that the proposed scheduler improves throughput by 122.5% and 33.6% on two heterogeneous clusters, respectively.
zh

[AI-61] A biologically Inspired Trust Model for Open Multi-Agent Systems that is Resilient to Rapid Performance Fluctuations

【速读】：该论文旨在解决现有信任模型在处理移动代理、行为变化以及冷启动问题时面临的挑战。论文提出了一种受生物启发的信任模型，其中受托者评估自身能力并本地存储信任数据，以提高移动性支持、减少通信开销、抵抗虚假信息并保护隐私。然而，先前的评估显示该模型在适应提供者群体变化和持续性能波动方面存在局限性。为了解决这些问题，论文引入了一种新算法，该算法包含一种自我分类机制，用于检测可能对服务消费者有害的性能下降。解决方案的关键在于这一自我分类机制，它使提供者能够主动识别性能异常，从而增强模型的适应性和鲁棒性。

链接: https://arxiv.org/abs/2504.15301
作者: Zoi Lygizou,Dimitris Kalles
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Trust management provides an alternative solution for securing open, dynamic, and distributed multi-agent systems, where conventional cryptographic methods prove to be impractical. However, existing trust models face challenges related to agent mobility, changing behaviors, and the cold start problem. To address these issues we introduced a biologically inspired trust model in which trustees assess their own capabilities and store trust data locally. This design improves mobility support, reduces communication overhead, resists disinformation, and preserves privacy. Despite these advantages, prior evaluations revealed limitations of our model in adapting to provider population changes and continuous performance fluctuations. This study proposes a novel algorithm, incorporating a self-classification mechanism for providers to detect performance drops potentially harmful for the service consumers. Simulation results demonstrate that the new algorithm outperforms its original version and FIRE, a well-known trust and reputation model, particularly in handling dynamic trustee behavior. While FIRE remains competitive under extreme environmental changes, the proposed algorithm demonstrates greater adaptability across various conditions. In contrast to existing trust modeling research, this study conducts a comprehensive evaluation of our model using widely recognized trust model criteria, assessing its resilience against common trust-related attacks while identifying strengths, weaknesses, and potential countermeasures. Finally, several key directions for future research are proposed.
zh

[AI-62] D2MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving

【速读】：该论文旨在解决混合专家模型（Mixture of Experts, MoE）在资源受限的边缘设备上部署时，因计算开销过高而导致服务质量下降的问题。尽管已有研究通过量化、剪枝和合并等模型压缩技术降低MoE的复杂度，但这些方法由于采用预先定义的静态优化策略，在处理多请求时难以实现理想的性能与开销权衡，最终影响了边缘设备上的推理质量。

为了解决上述问题，论文提出了一种名为D²MoE的算法-系统协同设计框架。其关键是引入基于套娃结构（matryoshka dolls）启发的套娃权重量化（Matryoshka Weight Quantization, MWQ）方法，以渐进且分层的方式压缩每个专家的权重，从而减少运行时内存需求。在此基础上，进一步优化输入输出（I/O）与计算流水线，并设计了一种遵循“最热专家优先”（Hottest Expert-Bit-First, HEBF）原则的启发式调度算法。该算法能够在受限内存预算下最大化I/O队列与计算队列之间的专家并行性，显著减少了专家加载过程中产生的空闲时间。实验结果表明，D²MoE在真实边缘设备上的整体推理吞吐量提升了1.39倍，峰值内存占用降低了53%，同时保持了与INT8版本相当的服务精度。

链接: https://arxiv.org/abs/2504.15299
作者: Haodong Wang,Qihua Zhou,Zicong Hong,Song Guo
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Accepted by MobiCom 2025

点击查看摘要

Abstract:The mixture of experts (MoE) model is a sparse variant of large language models (LLMs), designed to hold a better balance between intelligent capability and computational overhead. Despite its benefits, MoE is still too expensive to deploy on resource-constrained edge devices, especially with the demands of on-device inference services. Recent research efforts often apply model compression techniques, such as quantization, pruning and merging, to restrict MoE complexity. Unfortunately, due to their predefined static model optimization strategies, they cannot always achieve the desired quality-overhead trade-off when handling multiple requests, finally degrading the on-device quality of service. These limitations motivate us to propose the D ^2 MoE, an algorithm-system co-design framework that matches diverse task requirements by dynamically allocating the most proper bit-width to each expert. Specifically, inspired by the nested structure of matryoshka dolls, we propose the matryoshka weight quantization (MWQ) to progressively compress expert weights in a bit-nested manner and reduce the required runtime memory. On top of it, we further optimize the I/O-computation pipeline and design a heuristic scheduling algorithm following our hottest-expert-bit-first (HEBF) principle, which maximizes the expert parallelism between I/O and computation queue under constrained memory budgets, thus significantly reducing the idle temporal bubbles waiting for the experts to load. Evaluations on real edge devices show that D ^2 MoE improves the overall inference throughput by up to 1.39 \times and reduces the peak memory footprint by up to 53% over the latest on-device inference frameworks, while still preserving comparable serving accuracy as its INT8 counterparts.
zh

[AI-63] Scalability Optimization in Cloud-Based AI Inference Services: Strategies for Real-Time Load Balancing and Automated Scaling

【速读】：该论文旨在解决云环境中AI推理服务因快速扩展而导致的动态工作负载管理难题，以维持高性能。论文提出了一种综合的可扩展性优化框架，重点在于实时负载均衡和自动扩展策略。解决方案的关键在于采用混合方法，结合强化学习实现自适应负载分配，以及深度神经网络进行精确的需求预测，从而实现对工作负载波动的预判和资源的主动调整，确保最大化资源利用率并最小化延迟。此外，模型中引入的去中心化决策过程进一步提升了容错能力和扩展操作的响应速度。实验结果表明，该模型在提升负载均衡效率和减少响应延迟方面分别达到了35%和28%，展现了显著的优化效果。

链接: https://arxiv.org/abs/2504.15296
作者: Yihong Jin,Ze Yang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Accepted to BDICN 2025

点击查看摘要

Abstract:The rapid expansion of AI inference services in the cloud necessitates a robust scalability solution to manage dynamic workloads and maintain high performance. This study proposes a comprehensive scalability optimization framework for cloud AI inference services, focusing on real-time load balancing and autoscaling strategies. The proposed model is a hybrid approach that combines reinforcement learning for adaptive load distribution and deep neural networks for accurate demand forecasting. This multi-layered approach enables the system to anticipate workload fluctuations and proactively adjust resources, ensuring maximum resource utilisation and minimising latency. Furthermore, the incorporation of a decentralised decision-making process within the model serves to enhance fault tolerance and reduce response time in scaling operations. Experimental results demonstrate that the proposed model enhances load balancing efficiency by 35\ and reduces response delay by 28, thereby exhibiting a substantial optimization effect in comparison with conventional scalability solutions.
zh

[AI-64] CUBETESTERAI: Automated JUnit Test Generation using the LLaMA Model

【速读】：本文旨在解决Java应用程序基于Spring Boot框架的JUnit测试自动生成问题，特别是提高测试效率与准确性。解决方案的关键在于利用LLaMA（Large Language Model Architecture）模型的强大自然语言处理能力，通过CUBETESTERAI工具实现从代码片段直接生成高质量JUnit测试的功能。该工具结合了友好的Web界面、GitLab与Docker集成的CI/CD流水线，并借助RunPod在线GPU服务执行LLaMA模型以优化资源管理及隐私保护。此外，CUBETESTERAI能够有效处理常见挑战如缺失导入声明和私有方法调用，同时确保高覆盖率和功能验证的准确性。与现有先进工具对比，CUBETESTERAI在多种实际Java程序中展现了竞争性甚至更优的代码覆盖率表现。

链接: https://arxiv.org/abs/2504.15286
作者: Daniele Gorla,Shivam Kumar,Pietro Nicolaus Roselli Lorenzini,Alireza Alipourfaz
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to ICST 2025 Industry Track

点击查看摘要

Abstract:This paper presents an approach to automating JUnit test generation for Java applications using the Spring Boot framework, leveraging the LLaMA (Large Language Model Architecture) model to enhance the efficiency and accuracy of the testing process. The resulting tool, called CUBETESTERAI, includes a user-friendly web interface and the integration of a CI/CD pipeline using GitLab and Docker. These components streamline the automated test generation process, allowing developers to generate JUnit tests directly from their code snippets with minimal manual intervention. The final implementation executes the LLaMA models through RunPod, an online GPU service, which also enhances the privacy of our tool. Using the advanced natural language processing capabilities of the LLaMA model, CUBETESTERAI is able to generate test cases that provide high code coverage and accurate validation of software functionalities in Java-based Spring Boot applications. Furthermore, it efficiently manages resource-intensive operations and refines the generated tests to address common issues like missing imports and handling of private methods. By comparing CUBETESTERAI with some state-of-the-art tools, we show that our proposal consistently demonstrates competitive and, in many cases, better performance in terms of code coverage in different real-life Java programs.
zh

[AI-65] FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning ICASSP2025

【速读】：该论文旨在解决自动说话人验证（ASV）系统在面对未seen的分布外（Out-of-Distribution, OOD）欺骗攻击时的泛化能力不足问题。现有方法因使用softmax进行分类而固有存在过自信（Overconfidence）问题，在遇到不可预测的欺骗尝试时可能导致不可靠的预测结果。为克服这一局限性，论文提出了一种名为“基于确信学习的假音频检测框架（FADEL）”的新方法。其关键是通过Dirichlet分布建模类别概率，将模型不确定性融入预测过程中，从而在OOD场景下实现更稳健的性能表现。实验结果表明，该方法显著提升了基线模型的表现，并且通过分析发现平均不确定性与等错误率（Equal Error Rate, EER）之间存在强相关性，进一步验证了不确定性估计的有效性。

链接: https://arxiv.org/abs/2504.15663
作者: Ju Yeon Kang,Ji Won Yoon,Semin Kim,Min Hyun Han,Nam Soo Kim
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:Recently, fake audio detection has gained significant attention, as advancements in speech synthesis and voice conversion have increased the vulnerability of automatic speaker verification (ASV) systems to spoofing attacks. A key challenge in this task is generalizing models to detect unseen, out-of-distribution (OOD) attacks. Although existing approaches have shown promising results, they inherently suffer from overconfidence issues due to the usage of softmax for classification, which can produce unreliable predictions when encountering unpredictable spoofing attempts. To deal with this limitation, we propose a novel framework called fake audio detection with evidential learning (FADEL). By modeling class probabilities with a Dirichlet distribution, FADEL incorporates model uncertainty into its predictions, thereby leading to more robust performance in OOD scenarios. Experimental results on the ASVspoof2019 Logical Access (LA) and ASVspoof2021 LA datasets indicate that the proposed method significantly improves the performance of baseline models. Furthermore, we demonstrate the validity of uncertainty estimation by analyzing a strong correlation between average uncertainty and equal error rate (EER) across different spoofing algorithms.
zh

[AI-66] ransport f divergences

【速读】：该论文旨在定义一类新的散度（divergences），用于衡量一维样本空间中概率密度函数之间的差异。论文的关键创新在于基于映射函数的雅可比算子构造这些散度，该映射函数将一个概率密度函数推前为另一个。所提出的散度被称为“运输 $f$ -散度”（transport $f$ -divergences）。解决方案的关键在于利用映射函数及其雅可比算子来构建这些信息度量，从而实现对概率密度函数差异的精确刻画，并进一步分析其不变性、凸性、变分公式以及与映射函数相关的泰勒展开性质。论文还提供了生成式模型中的运输 $f$ -散度实例以验证其有效性。

链接: https://arxiv.org/abs/2504.15515
作者: Wuchen Li
机构: 未知
类目: atistics Theory (math.ST); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: Comments are welcome

点击查看摘要

Abstract:We define a class of divergences to measure differences between probability density functions in one-dimensional sample space. The construction is based on the convex function with the Jacobi operator of mapping function that pushforwards one density to the other. We call these information measures \em transport f -divergences. We present several properties of transport f -divergences, including invariances, convexities, variational formulations, and Taylor expansions in terms of mapping functions. Examples of transport f -divergences in generative models are provided.
zh

[AI-67] A Graph Based Raman Spectral Processing Technique for Exosome Classification

【速读】：本文旨在解决利用拉曼光谱（Raman spectroscopy）分析外泌体（exosomes）时面临的高样本浓度需求及对脂质和蛋白质敏感性有限的问题。为克服这些挑战，论文提出了一种基于Neo4j图数据库组织外泌体拉曼光谱数据的方法，并引入了一种创新的光谱过滤流程，将PageRank滤波器与最优降维技术相结合。该方案的关键在于通过图基光谱过滤与最优降维技术的结合，有效减少噪声同时保留关键生物标志信号，显著提升了分类准确性，特别是在基于拉曼光谱和表面分类的超血糖、低血糖及正常外泌体样本上的表现，Extra Trees模型分别达到了0.76和0.857的准确率，从而增强了基于拉曼光谱的外泌体分析能力，扩展其在生物医学应用、疾病诊断及生物标志物发现中的潜力。

链接: https://arxiv.org/abs/2504.15324
作者: Vuong M. Ngo,Edward Bolger,Stan Goodwin,John O’Sullivan,Dinh Viet Cuong,Mark Roantree
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: The 23rd International Conference on Artificial Intelligence in Medicine (AIME 2025), LNAI, Springer, 11 pages

点击查看摘要

Abstract:Exosomes are small vesicles crucial for cell signaling and disease biomarkers. Due to their complexity, an “omics” approach is preferable to individual biomarkers. While Raman spectroscopy is effective for exosome analysis, it requires high sample concentrations and has limited sensitivity to lipids and proteins. Surface-enhanced Raman spectroscopy helps overcome these challenges. In this study, we leverage Neo4j graph databases to organize 3,045 Raman spectra of exosomes, enhancing data generalization. To further refine spectral analysis, we introduce a novel spectral filtering process that integrates the PageRank Filter with optimal Dimensionality Reduction. This method improves feature selection, resulting in superior classification performance. Specifically, the Extra Trees model, using our spectral processing approach, achieves 0.76 and 0.857 accuracy in classifying hyperglycemic, hypoglycemic, and normal exosome samples based on Raman spectra and surface, respectively, with group 10-fold cross-validation. Our results show that graph-based spectral filtering combined with optimal dimensionality reduction significantly improves classification accuracy by reducing noise while preserving key biomarker signals. This novel framework enhances Raman-based exosome analysis, expanding its potential for biomedical applications, disease diagnostics, and biomarker discovery.
zh

[AI-68] RINN: One Sample Radio Frequency Imaging based on Physics Informed Neural Network

【速读】：该论文旨在解决现有射频成像技术因常用设备（如Wi-Fi）难以提供高精度电磁测量及大规模数据集而导致的应用瓶颈问题。论文的关键在于结合物理信息神经网络（Physics-Informed Neural Network, PINN）的思想，设计了一种射频隐式神经网络（RINN），通过利用物理约束而非真实值对比约束，并适应通用射频信号特性，使RINN能够在仅使用单一样本、无相位信息且存在振幅噪声的情况下实现射频成像。这种方案的核心创新点在于其能够有效处理相位缺失和噪声干扰的数据，从而显著提升射频成像的性能与适用性。

链接: https://arxiv.org/abs/2504.15311
作者: Fei Shang,Haohua Du,Dawei Yan,Panlong Yang,Xiang-Yang Li
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Due to its ability to work in non-line-of-sight and low-light environments, radio frequency (RF) imaging technology is expected to bring new possibilities for embodied intelligence and multimodal sensing. However, widely used RF devices (such as Wi-Fi) often struggle to provide high-precision electromagnetic measurements and large-scale datasets, hindering the application of RF imaging technology. In this paper, we combine the ideas of PINN to design the RINN network, using physical constraints instead of true value comparison constraints and adapting it with the characteristics of ubiquitous RF signals, allowing the RINN network to achieve RF imaging using only one sample without phase and with amplitude noise. Our numerical evaluation results show that compared with 5 classic algorithms based on phase data for imaging results, RINN’s imaging results based on phaseless data are good, with indicators such as RRMSE (0.11) performing similarly well. RINN provides new possibilities for the universal development of radio frequency imaging technology.
zh

机器学习

[LG-0] π_0.5: a Vision-Language-Action Model with Open-World Generalization

链接: https://arxiv.org/abs/2504.16054
作者: Physical Intelligence,Kevin Black,Noah Brown,James Darpinian,Karan Dhabalia,Danny Driess,Adnan Esmail,Michael Equi,Chelsea Finn,Niccolo Fusai,Manuel Y. Galliker,Dibya Ghosh,Lachy Groom,Karol Hausman,Brian Ichter,Szymon Jakubczak,Tim Jones,Liyiming Ke,Devin LeBlanc,Sergey Levine,Adrian Li-Bell,Mohith Mothukuri,Suraj Nair,Karl Pertsch,Allen Z. Ren,Lucy Xiaoyang Shi,Laura Smith,Jost Tobias Springenberg,Kyle Stachowicz,James Tanner,Quan Vuong,Homer Walke,Anna Walling,Haohuan Wang,Lili Yu,Ury Zhilinsky
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In order for robots to be useful, they must perform practically relevant tasks in the real world, outside of the lab. While vision-language-action (VLA) models have demonstrated impressive results for end-to-end robot control, it remains an open question how far such models can generalize in the wild. We describe \pi_0.5 , a new model based on \pi_0 that uses co-training on heterogeneous tasks to enable broad generalization. \pi_0.5 \ uses data from multiple robots, high-level semantic prediction, web data, and other sources to enable broadly generalizable real-world robotic manipulation. Our system uses a combination of co-training and hybrid multi-modal examples that combine image observations, language commands, object detections, semantic subtask prediction, and low-level actions. Our experiments show that this kind of knowledge transfer is essential for effective generalization, and we demonstrate for the first time that an end-to-end learning-enabled robotic system can perform long-horizon and dexterous manipulation skills, such as cleaning a kitchen or bedroom, in entirely new homes.

[LG-1] he Formation of Production Networks: How Supply Chains Arise from Simple Learning with Minimal Information

链接: https://arxiv.org/abs/2504.16010
作者: Tuong Manh Vu,Ernesto Carrella,Robert Axtell,Omar A. Guerrero
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:We develop a model where firms determine the price at which they sell their differentiable goods, the volume that they produce, and the inputs (types and amounts) that they purchase from other firms. A steady-state production network emerges endogenously without resorting to assumptions such as equilibrium or perfect knowledge about production technologies. Through a simple version of reinforcement learning, firms with heterogeneous technologies cope with uncertainty and maximize profits. Due to this learning process, firms can adapt to shocks such as demand shifts, suppliers/clients closure, productivity changes, and production technology modifications; effectively reshaping the production network. To demonstrate the potential of this model, we analyze the upstream and downstream impact of demand and productivity shocks.

[LG-2] Efficient Discovery of Motif Transition Process for Large-Scale Temporal Graphs

链接: https://arxiv.org/abs/2504.15979
作者: Zhiyuan Zheng,Jianpeng Qi,Jiantao Li,Guoqing Chao,Junyu Dong,Yanwei Yu
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding the dynamic transition of motifs in temporal graphs is essential for revealing how graph structures evolve over time, identifying critical patterns, and predicting future behaviors, yet existing methods often focus on predefined motifs, limiting their ability to comprehensively capture transitions and interrelationships. We propose a parallel motif transition process discovery algorithm, PTMT, a novel parallel method for discovering motif transition processes in large-scale temporal graphs. PTMT integrates a tree-based framework with the temporal zone partitioning (TZP) strategy, which partitions temporal graphs by time and structure while preserving lossless motif transitions and enabling massive parallelism. PTMT comprises three phases: growth zone parallel expansion, overlap-aware result aggregation, and deterministic encoding of motif transitions, ensuring accurate tracking of dynamic transitions and interactions. Results on 10 real-world datasets demonstrate that PTMT achieves speedups ranging from 12.0 \times to 50.3 \times compared to the SOTA method.

[LG-3] Adversarial Observations in Weather Forecasting

链接: https://arxiv.org/abs/2504.15942
作者: Erik Imgrund,Thorsten Eisenhofer,Konrad Rieck
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:AI-based systems, such as Google’s GenCast, have recently redefined the state of the art in weather forecasting, offering more accurate and timely predictions of both everyday weather and extreme events. While these systems are on the verge of replacing traditional meteorological methods, they also introduce new vulnerabilities into the forecasting process. In this paper, we investigate this threat and present a novel attack on autoregressive diffusion models, such as those used in GenCast, capable of manipulating weather forecasts and fabricating extreme events, including hurricanes, heat waves, and intense rainfall. The attack introduces subtle perturbations into weather observations that are statistically indistinguishable from natural noise and change less than 0.1% of the measurements - comparable to tampering with data from a single meteorological satellite. As modern forecasting integrates data from nearly a hundred satellites and many other sources operated by different countries, our findings highlight a critical security risk with the potential to cause large-scale disruptions and undermine public trust in weather prediction.

[LG-4] Low-Rank Adaptation of Neural Fields

链接: https://arxiv.org/abs/2504.15933
作者: Anh Truong,Ahmed H. Mahmoud,Mina Konaković Luković,Justin Solomon
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Processing visual data often involves small adjustments or sequences of changes, such as in image filtering, surface smoothing, and video storage. While established graphics techniques like normal mapping and video compression exploit redundancy to encode such small changes efficiently, the problem of encoding small changes to neural fields (NF) – neural network parameterizations of visual or physical functions – has received less attention. We propose a parameter-efficient strategy for updating neural fields using low-rank adaptations (LoRA). LoRA, a method from the parameter-efficient fine-tuning LLM community, encodes small updates to pre-trained models with minimal computational overhead. We adapt LoRA to instance-specific neural fields, avoiding the need for large pre-trained models yielding a pipeline suitable for low-compute hardware. We validate our approach with experiments in image filtering, video compression, and geometry editing, demonstrating its effectiveness and versatility for representing neural field updates. Subjects: Graphics (cs.GR); Machine Learning (cs.LG) Cite as: arXiv:2504.15933 [cs.GR] (or arXiv:2504.15933v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2504.15933 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-5] StreamRL: Scalable Heterogeneous and Elastic RL for LLM s with Disaggregated Stream Generation

链接: https://arxiv.org/abs/2504.15930
作者: Yinmin Zhong,Zili Zhang,Xiaoniu Song,Hanpeng Hu,Chao Jin,Bingyang Wu,Nuo Chen,Yukun Chen,Yu Zhou,Changyi Wan,Hongyu Zhou,Yimin Jiang,Yibo Zhu,Daxin Jiang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs). RL for LLMs involves two stages: generation and training. The LLM first generates samples online, which are then used to derive rewards for training. The conventional view holds that the colocated architecture, where the two stages share resources via temporal multiplexing, outperforms the disaggregated architecture, in which dedicated resources are assigned to each stage. However, in real-world deployments, we observe that the colocated architecture suffers from resource coupling, where the two stages are constrained to use the same resources. This coupling compromises the scalability and cost-efficiency of colocated RL in large-scale training. In contrast, the disaggregated architecture allows for flexible resource allocation, supports heterogeneous training setups, and facilitates cross-datacenter deployment. StreamRL is designed with disaggregation from first principles and fully unlocks its potential by addressing two types of performance bottlenecks in existing disaggregated RL frameworks: pipeline bubbles, caused by stage dependencies, and skewness bubbles, resulting from long-tail output length distributions. To address pipeline bubbles, StreamRL breaks the traditional stage boundary in synchronous RL algorithms through stream generation and achieves full overlapping in asynchronous RL. To address skewness bubbles, StreamRL employs an output-length ranker model to identify long-tail samples and reduces generation time via skewness-aware dispatching and scheduling. Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems, and improves cost-effectiveness by up to 1.33x in a heterogeneous, cross-datacenter setting. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2504.15930 [cs.LG] (or arXiv:2504.15930v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.15930 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-6] ScaleGNN: Towards Scalable Graph Neural Networks via Adaptive High-order Neighboring Feature Fusion

链接: https://arxiv.org/abs/2504.15920
作者: Xiang Li,Haobing Liu,Jianpeng Qi,Yuan Cao,Guoqing Chao,Yanwei Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated strong performance across various graph-based tasks by effectively capturing relational information between nodes. These models rely on iterative message passing to propagate node features, enabling nodes to aggregate information from their neighbors. Recent research has significantly improved the message-passing mechanism, enhancing GNN scalability on large-scale graphs. However, GNNs still face two main challenges: over-smoothing, where excessive message passing results in indistinguishable node representations, especially in deep networks incorporating high-order neighbors; and scalability issues, as traditional architectures suffer from high model complexity and increased inference time due to redundant information aggregation. This paper proposes a novel framework for large-scale graphs named ScaleGNN that simultaneously addresses both challenges by adaptively fusing multi-level graph features. We first construct neighbor matrices for each order, learning their relative information through trainable weights through an adaptive high-order feature fusion module. This allows the model to selectively emphasize informative high-order neighbors while reducing unnecessary computational costs. Additionally, we introduce a High-order redundant feature masking mechanism based on a Local Contribution Score (LCS), which enables the model to retain only the most relevant neighbors at each order, preventing redundant information propagation. Furthermore, low-order enhanced feature aggregation adaptively integrates low-order and high-order features based on task relevance, ensuring effective capture of both local and global structural information without excessive complexity. Extensive experiments on real-world datasets demonstrate that our approach consistently outperforms state-of-the-art GNN models in both accuracy and computational efficiency.

[LG-7] SUPRA: Subspace Parameterized Attention for Neural Operator on General Domains

链接: https://arxiv.org/abs/2504.15897
作者: Zherui Yang,Zhengyang Xue,Ligang Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural operators are efficient surrogate models for solving partial differential equations (PDEs), but their key components face challenges: (1) in order to improve accuracy, attention mechanisms suffer from computational inefficiency on large-scale meshes, and (2) spectral convolutions rely on the Fast Fourier Transform (FFT) on regular grids and assume a flat geometry, which causes accuracy degradation on irregular domains. To tackle these problems, we regard the matrix-vector operations in the standard attention mechanism on vectors in Euclidean space as bilinear forms and linear operators in vector spaces and generalize the attention mechanism to function spaces. This new attention mechanism is fully equivalent to the standard attention but impossible to compute due to the infinite dimensionality of function spaces. To address this, inspired by model reduction techniques, we propose a Subspace Parameterized Attention (SUPRA) neural operator, which approximates the attention mechanism within a finite-dimensional subspace. To construct a subspace on irregular domains for SUPRA, we propose using the Laplacian eigenfunctions, which naturally adapt to domains’ geometry and guarantee the optimal approximation for smooth functions. Experiments show that the SUPRA neural operator reduces error rates by up to 33% on various PDE datasets while maintaining state-of-the-art computational efficiency.

[LG-8] Consistent Causal Inference of Group Effects in Non-Targeted Trials with Finitely Many Effect Levels

链接: https://arxiv.org/abs/2504.15854
作者: Georgios Mavroudeas,Malik Magdon-Ismail,Kristin P. Bennett,Jason Kuruzovich
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A treatment may be appropriate for some group (the sick" group) on whom it has a positive effect, but it can also have a detrimental effect on subjects from another group (the healthy" group). In a non-targeted trial both sick and healthy subjects may be treated, producing heterogeneous effects within the treated group. Inferring the correct treatment effect on the sick population is then difficult, because the effects on the different groups get tangled. We propose an efficient nonparametric approach to estimating the group effects, called \bf PCM (pre-cluster and merge). We prove its asymptotic consistency in a general setting and show, on synthetic data, more than a 10x improvement in accuracy over existing state-of-the-art. Our approach applies more generally to consistent estimation of functions with a finite range.

[LG-9] Adaptive PCA-Based Outlier Detection for Multi-Feature Time Series in Space Missions CCS2025

链接: https://arxiv.org/abs/2504.15846
作者: Jonah Ekelund,Savvas Raptis,Vicki Toy-Edens,Wenli Mo,Drew L. Turner,Ian J. Cohen,Stefano Markidis
类目: Machine Learning (cs.LG); Space Physics (physics.space-ph)
*备注: Accepted to ICCS 2025

点击查看摘要

Abstract:Analyzing multi-featured time series data is critical for space missions making efficient event detection, potentially onboard, essential for automatic analysis. However, limited onboard computational resources and data downlink constraints necessitate robust methods for identifying regions of interest in real time. This work presents an adaptive outlier detection algorithm based on the reconstruction error of Principal Component Analysis (PCA) for feature reduction, designed explicitly for space mission applications. The algorithm adapts dynamically to evolving data distributions by using Incremental PCA, enabling deployment without a predefined model for all possible conditions. A pre-scaling process normalizes each feature’s magnitude while preserving relative variance within feature types. We demonstrate the algorithm’s effectiveness in detecting space plasma events, such as distinct space environments, dayside and nightside transients phenomena, and transition layers through NASA’s MMS mission observations. Additionally, we apply the method to NASA’s THEMIS data, successfully identifying a dayside transient using onboard-available measurements.

[LG-10] Grounded in Context: Retrieval-Based Method for Hallucination Detection

链接: https://arxiv.org/abs/2504.15771
作者: Assaf Gerner,Netta Madvil,Nadav Barak,Alex Zaikman,Jonatan Liberman,Liron Hamra,Rotem Brazilay,Shay Tsadok,Yaron Friedman,Neal Harow,Noam Bresler,Shir Chorev,Philip Tannor
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite advancements in grounded content generation, production Large Language Models (LLMs) based applications still suffer from hallucinated answers. We present “Grounded in Context” - Deepchecks’ hallucination detection framework, designed for production-scale long-context data and tailored to diverse use cases, including summarization, data extraction, and RAG. Inspired by RAG architecture, our method integrates retrieval and Natural Language Inference (NLI) models to predict factual consistency between premises and hypotheses using an encoder-based model with only a 512-token context window. Our framework identifies unsupported claims with an F1 score of 0.83 in RAGTruth’s response-level classification task, matching methods that trained on the dataset, and outperforming all comparable frameworks using similar-sized models.

[LG-11] Observability conditions for neural state-space models with eigenvalues and their roots of unity

链接: https://arxiv.org/abs/2504.15758
作者: Andrew Gracyk
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注: First version

点击查看摘要

Abstract:We operate through the lens of ordinary differential equations and control theory to study the concept of observability in the context of neural state-space models and the Mamba architecture. We develop strategies to enforce observability, which are tailored to a learning context, specifically where the hidden states are learnable at initial time, in conjunction to over its continuum, and high-dimensional. We also highlight our methods emphasize eigenvalues, roots of unity, or both. Our methods effectuate computational efficiency when enforcing observability, sometimes at great scale. We formulate observability conditions in machine learning based on classical control theory and discuss their computational complexity. Our nontrivial results are fivefold. We discuss observability through the use of permutations in neural applications with learnable matrices without high precision. We present two results built upon the Fourier transform that effect observability with high probability up to the randomness in the learning. These results are worked with the interplay of representations in Fourier space and their eigenstructure, nonlinear mappings, and the observability matrix. We present a result for Mamba that is similar to a Hautus-type condition, but instead employs an argument using a Vandermonde matrix instead of eigenvectors. Our final result is a shared-parameter construction of the Mamba system, which is computationally efficient in high exponentiation. We develop a training algorithm with this coupling, showing it satisfies a Robbins-Monro condition under certain orthogonality, while a more classical training procedure fails to satisfy a contraction with high Lipschitz constant.

[LG-12] Riemannian Neural Geodesic Interpolant

链接: https://arxiv.org/abs/2504.15736
作者: Jiawen Wu,Bingguang Chen,Yuyi Zhou,Qi Meng,Rongchan Zhu,Zhi-Ming Ma
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Stochastic interpolants are efficient generative models that bridge two arbitrary probability density functions in finite time, enabling flexible generation from the source to the target distribution or vice versa. These models are primarily developed in Euclidean space, and are therefore limited in their application to many distribution learning problems defined on Riemannian manifolds in real-world scenarios. In this work, we introduce the Riemannian Neural Geodesic Interpolant (RNGI) model, which interpolates between two probability densities on a Riemannian manifold along the stochastic geodesics, and then samples from one endpoint as the final state using the continuous flow originating from the other endpoint. We prove that the temporal marginal density of RNGI solves a transport equation on the Riemannian manifold. After training the model’s the neural velocity and score fields, we propose the Embedding Stochastic Differential Equation (E-SDE) algorithm for stochastic sampling of RNGI. E-SDE significantly improves the sampling quality by reducing the accumulated error caused by the excessive intrinsic discretization of Riemannian Brownian motion in the classical Geodesic Random Walk (GRW) algorithm. We also provide theoretical bounds on the generative bias measured in terms of KL-divergence. Finally, we demonstrate the effectiveness of the proposed RNGI and E-SDE through experiments conducted on both collected and synthetic distributions on S2 and SO(3).

[LG-13] Invariant Learning with Annotation-free Environments NEURIPS2024

链接: https://arxiv.org/abs/2504.15686
作者: Phuong Quynh Le,Christin Seifert,Jörg Schlötterer
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2024 Workshop UniReps

点击查看摘要

Abstract:Invariant learning is a promising approach to improve domain generalization compared to Empirical Risk Minimization (ERM). However, most invariant learning methods rely on the assumption that training examples are pre-partitioned into different known environments. We instead infer environments without the need for additional annotations, motivated by observations of the properties within the representation space of a trained ERM model. We show the preliminary effectiveness of our approach on the ColoredMNIST benchmark, achieving performance comparable to methods requiring explicit environment labels and on par with an annotation-free method that poses strong restrictions on the ERM reference model.

[LG-14] rojanDam: Detection-Free Backdoor Defense in Federated Learning through Proactive Model Robustification utilizing OOD Data

链接: https://arxiv.org/abs/2504.15674
作者: Yanbo Dai,Songze Li,Zihan Gan,Xueluan Gong
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) systems allow decentralized data-owning clients to jointly train a global model through uploading their locally trained updates to a centralized server. The property of decentralization enables adversaries to craft carefully designed backdoor updates to make the global model misclassify only when encountering adversary-chosen triggers. Existing defense mechanisms mainly rely on post-training detection after receiving updates. These methods either fail to identify updates which are deliberately fabricated statistically close to benign ones, or show inconsistent performance in different FL training stages. The effect of unfiltered backdoor updates will accumulate in the global model, and eventually become functional. Given the difficulty of ruling out every backdoor update, we propose a backdoor defense paradigm, which focuses on proactive robustification on the global model against potential backdoor attacks. We first reveal that the successful launching of backdoor attacks in FL stems from the lack of conflict between malicious and benign updates on redundant neurons of ML models. We proceed to prove the feasibility of activating redundant neurons utilizing out-of-distribution (OOD) samples in centralized settings, and migrating to FL settings to propose a novel backdoor defense mechanism, TrojanDam. The proposed mechanism has the FL server continuously inject fresh OOD mappings into the global model to activate redundant neurons, canceling the effect of backdoor updates during aggregation. We conduct systematic and extensive experiments to illustrate the superior performance of TrojanDam, over several SOTA backdoor defense methods across a wide range of FL settings.

[LG-15] Neural Kinematic Bases for Fluids

链接: https://arxiv.org/abs/2504.15657
作者: Yibo Liu,Paul Kry,Kenny Erleben,Noam Aigerman,Sune Darkner,Teseo Schneider
类目: Graphics (cs.GR); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:We propose mesh-free fluid simulations that exploit a kinematic neural basis for velocity fields represented by an MLP. We design a set of losses that ensures that these neural bases satisfy fundamental physical properties such as orthogonality, divergence-free, boundary alignment, and smoothness. Our neural bases can then be used to fit an input sketch of a flow, which will inherit the same fundamental properties from the bases. We then can animate such flow in real-time using standard time integrators. Our neural bases can accommodate different domains and naturally extend to three dimensions.

[LG-16] A Study On Mixup-inspired Augmentation Methods For Software Vulnerability Detection

链接: https://arxiv.org/abs/2504.15632
作者: Seyed Shayan Daneshvar,Da Tan,Shaowei Wang,Carson Leung
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted at EASE 2025, Istanbul, Turkey

点击查看摘要

Abstract:Various Deep Learning (DL) methods have recently been utilized to detect software vulnerabilities. Real-world software vulnerability datasets are rare and hard to acquire as there’s no simple metric for classifying vulnerability. Such datasets are heavily imbalanced, and none of the current datasets are considered huge for DL models. To tackle these problems a recent work has tried to augment the dataset using the source code and generate realistic single-statement vulnerabilities which is not quite practical and requires manual checking of the generated vulnerabilities. In this regard, we aim to explore the augmentation of vulnerabilities at the representation level to help current models learn better which has never been done before to the best of our knowledge. We implement and evaluate the 5 augmentation techniques that augment the embedding of the data and recently have been used for code search which is a completely different software engineering task. We also introduced a conditioned version of those augmentation methods, which ensures the augmentation does not change the vulnerable section of the vector representation. We show that such augmentation methods can be helpful and increase the f1-score by up to 9.67%, yet they cannot beat Random Oversampling when balancing datasets which increases the f1-score by 10.82%!

[LG-17] RadioDiff-k2: Helmholtz Equation Informed Generative Diffusion Model for Multi-Path Aware Radio Map Construction

链接: https://arxiv.org/abs/2504.15623
作者: Xiucheng Wang,Qiming Zhang,Nan Cheng,Ruijin Sun,Zan Li,Shuguang Cui,Xuemin Shen
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel physics-informed generative learning approach, termed RadioDiff- \bmk^2 , for accurate and efficient multipath-aware radio map (RM) construction. As wireless communication evolves towards environment-aware paradigms, driven by the increasing demand for intelligent and proactive optimization in sixth-generation (6G) networks, accurate construction of RMs becomes crucial yet highly challenging. Conventional electromagnetic (EM)-based methods, such as full-wave solvers and ray-tracing approaches, exhibit substantial computational overhead and limited adaptability to dynamic scenarios. Although, existing neural network (NN) approaches have efficient inferencing speed, they lack sufficient consideration of the underlying physics of EM wave propagation, limiting their effectiveness in accurately modeling critical EM singularities induced by complex multipath environments. To address these fundamental limitations, we propose a novel physics-inspired RM construction method guided explicitly by the Helmholtz equation, which inherently governs EM wave propagation. Specifically, we theoretically establish a direct correspondence between EM singularities, which correspond to the critical spatial features influencing wireless propagation, and regions defined by negative wave numbers in the Helmholtz equation. Based on this insight, we design an innovative dual generative diffusion model (DM) framework comprising one DM dedicated to accurately inferring EM singularities and another DM responsible for reconstructing the complete RM using these singularities along with environmental contextual information. Our physics-informed approach uniquely combines the efficiency advantages of data-driven methods with rigorous physics-based EM modeling, significantly enhancing RM accuracy, particularly in complex propagation environments dominated by multipath effects.

[LG-18] Dimension-Free Decision Calibration for Nonlinear Loss Functions

链接: https://arxiv.org/abs/2504.15615
作者: Jingwu Tang,Jiayun Wu,Zhiwei Steven Wu,Jiahao Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:When model predictions inform downstream decision making, a natural question is under what conditions can the decision-makers simply respond to the predictions as if they were the true outcomes. Calibration suffices to guarantee that simple best-response to predictions is optimal. However, calibration for high-dimensional prediction outcome spaces requires exponential computational and statistical complexity. The recent relaxation known as decision calibration ensures the optimality of the simple best-response rule while requiring only polynomial sample complexity in the dimension of outcomes. However, known results on calibration and decision calibration crucially rely on linear loss functions for establishing best-response optimality. A natural approach to handle nonlinear losses is to map outcomes y into a feature space \phi(y) of dimension m , then approximate losses with linear functions of \phi(y) . Unfortunately, even simple classes of nonlinear functions can demand exponentially large or infinite feature dimensions m . A key open problem is whether it is possible to achieve decision calibration with sample complexity independent of~ m . We begin with a negative result: even verifying decision calibration under standard deterministic best response inherently requires sample complexity polynomial in~ m . Motivated by this lower bound, we investigate a smooth version of decision calibration in which decision-makers follow a smooth best-response. This smooth relaxation enables dimension-free decision calibration algorithms. We introduce algorithms that, given \mathrmpoly(|A|,1/\epsilon) samples and any initial predictor~ p , can efficiently post-process it to satisfy decision calibration without worsening accuracy. Our algorithms apply broadly to function classes that can be well-approximated by bounded-norm functions in (possibly infinite-dimensional) separable RKHS.

[LG-19] Learning Dynamic Graphs via Tensorized and Lightweight Graph Convolutional Networks

链接: https://arxiv.org/abs/2504.15613
作者: Minglian Han
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A dynamic graph (DG) is frequently encountered in numerous real-world scenarios. Consequently, A dynamic graph convolutional network (DGCN) has been successfully applied to perform precise representation learning on a DG. However, conventional DGCNs typically consist of a static GCN coupled with a sequence neural network (SNN) to model spatial and temporal patterns separately. This decoupled modeling mechanism inherently disrupts the intricate spatio-temporal dependencies. To address the issue, this study proposes a novel Tensorized Lightweight Graph Convolutional Network (TLGCN) for accurate dynamic graph learning. It mainly contains the following two key concepts: a) designing a novel spatio-temporal information propagation method for joint propagation of spatio-temporal information based on the tensor M-product framework; b) proposing a tensorized lightweight graph convolutional network based on the above method, which significantly reduces the memory occupation of the model by omitting complex feature transformation and nonlinear activation. Numerical experiments on four real-world datasets demonstrate that the proposed TLGCN outperforms the state-of-the-art models in the weight estimation task on DGs.

[LG-20] Smooth Calibration and Decision Making

链接: https://arxiv.org/abs/2504.15582
作者: Jason Hartline,Yifan Wu,Yunran Yang
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: In FORC 2025

点击查看摘要

Abstract:Calibration requires predictor outputs to be consistent with their Bayesian posteriors. For machine learning predictors that do not distinguish between small perturbations, calibration errors are continuous in predictions, e.g., smooth calibration error (Foster and Hart, 2018), Distance to Calibration (Blasiok et al., 2023a). On the contrary, decision-makers who use predictions make optimal decisions discontinuously in probabilistic space, experiencing loss from miscalibration discontinuously. Calibration errors for decision-making are thus discontinuous, e.g., Expected Calibration Error (Foster and Vohra, 1997), and Calibration Decision Loss (Hu and Wu, 2024). Thus, predictors with a low calibration error for machine learning may suffer a high calibration error for decision-making, i.e., they may not be trustworthy for decision-makers optimizing assuming their predictions are correct. It is natural to ask if post-processing a predictor with a low calibration error for machine learning is without loss to achieve a low calibration error for decision-making. In our paper, we show that post-processing an online predictor with \epsilon distance to calibration achieves O(\sqrt\epsilon) ECE and CDL, which is asymptotically optimal. The post-processing algorithm adds noise to make predictions differentially private. The optimal bound from low distance to calibration predictors from post-processing is non-optimal compared with existing online calibration algorithms that directly optimize for ECE and CDL.

[LG-21] On the Price of Differential Privacy for Hierarchical Clustering ICLR2025

链接: https://arxiv.org/abs/2504.15580
作者: Chengyuan Deng,Jie Gao,Jalaj Upadhyay,Chen Wang,Samson Zhou
类目: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: ICLR 2025

点击查看摘要

Abstract:Hierarchical clustering is a fundamental unsupervised machine learning task with the aim of organizing data into a hierarchy of clusters. Many applications of hierarchical clustering involve sensitive user information, therefore motivating recent studies on differentially private hierarchical clustering under the rigorous framework of Dasgupta’s objective. However, it has been shown that any privacy-preserving algorithm under edge-level differential privacy necessarily suffers a large error. To capture practical applications of this problem, we focus on the weight privacy model, where each edge of the input graph is at least unit weight. We present a novel algorithm in the weight privacy model that shows significantly better approximation than known impossibility results in the edge-level DP setting. In particular, our algorithm achieves O(\log^1.5n/\varepsilon) multiplicative error for \varepsilon -DP and runs in polynomial time, where n is the size of the input graph, and the cost is never worse than the optimal additive error in existing work. We complement our algorithm by showing if the unit-weight constraint does not apply, the lower bound for weight-level DP hierarchical clustering is essentially the same as the edge-level DP, i.e. \Omega(n^2/\varepsilon) additive error. As a result, we also obtain a new lower bound of \tilde\Omega(1/\varepsilon) additive error for balanced sparsest cuts in the weight-level DP model, which may be of independent interest. Finally, we evaluate our algorithm on synthetic and real-world datasets. Our experimental results show that our algorithm performs well in terms of extra cost and has good scalability to large graphs.

[LG-22] Real-Time Optimal Design of Experiment for Parameter Identification of Li-Ion Cell Electrochemical Model

链接: https://arxiv.org/abs/2504.15578
作者: Ian Mikesell,Samuel Filgueira da Silva,Mehmet Fatih Ozkan,Faissal El Idrissi,Prashanth Ramesh,Marcello Canova
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately identifying the parameters of electrochemical models of li-ion battery (LiB) cells is a critical task for enhancing the fidelity and predictive ability. Traditional parameter identification methods often require extensive data collection experiments and lack adaptability in dynamic environments. This paper describes a Reinforcement Learning (RL) based approach that dynamically tailors the current profile applied to a LiB cell to optimize the parameters identifiability of the electrochemical model. The proposed framework is implemented in real-time using a Hardware-in-the-Loop (HIL) setup, which serves as a reliable testbed for evaluating the RL-based design strategy. The HIL validation confirms that the RL-based experimental design outperforms conventional test protocols used for parameter identification in terms of both reducing the modeling errors on a verification test and minimizing the duration of the experiment used for parameter identification.

[LG-23] State-Aware IoT Scheduling Using Deep Q-Networks and Edge-Based Coordination

链接: https://arxiv.org/abs/2504.15577
作者: Qingyuan He,Chang Liu,Juecen Zhan,Weiqiang Huang,Ran Hao
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses the challenge of energy efficiency management faced by intelligent IoT devices in complex application environments. A novel optimization method is proposed, combining Deep Q-Network (DQN) with an edge collaboration mechanism. The method builds a state-action-reward interaction model and introduces edge nodes as intermediaries for state aggregation and policy scheduling. This enables dynamic resource coordination and task allocation among multiple devices. During the modeling process, device status, task load, and network resources are jointly incorporated into the state space. The DQN is used to approximate and learn the optimal scheduling strategy. To enhance the model’s ability to perceive inter-device relationships, a collaborative graph structure is introduced to model the multi-device environment and assist in decision optimization. Experiments are conducted using real-world IoT data collected from the FastBee platform. Several comparative and validation tests are performed, including energy efficiency comparisons across different scheduling strategies, robustness analysis under varying task loads, and evaluation of state dimension impacts on policy convergence speed. The results show that the proposed method outperforms existing baseline approaches in terms of average energy consumption, processing latency, and resource utilization. This confirms its effectiveness and practicality in intelligent IoT scenarios.

[LG-24] SPECI: Skill Prompts based Hierarchical Continual Imitation Learning for Robot Manipulation

链接: https://arxiv.org/abs/2504.15561
作者: Jingkai Xu,Xiangli Nie
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world robot manipulation in dynamic unstructured environments requires lifelong adaptability to evolving objects, scenes and tasks. Traditional imitation learning relies on static training paradigms, which are ill-suited for lifelong adaptation. Although Continual Imitation Learnin (CIL) enables incremental task adaptation while preserving learned knowledge, current CIL methods primarily overlook the intrinsic skill characteristics of robot manipulation or depend on manually defined and rigid skills, leading to suboptimal cross-task knowledge transfer. To address these issues, we propose Skill Prompts-based HiErarchical Continual Imitation Learning (SPECI), a novel end-to-end hierarchical CIL policy architecture for robot manipulation. The SPECI framework consists of a multimodal perception and fusion module for heterogeneous sensory information encoding, a high-level skill inference module for dynamic skill extraction and selection, and a low-level action execution module for precise action generation. To enable efficient knowledge transfer on both skill and task levels, SPECI performs continual implicit skill acquisition and reuse via an expandable skill codebook and an attention-driven skill selection mechanism. Furthermore, we introduce mode approximation to augment the last two modules with task-specific and task-sharing parameters, thereby enhancing task-level knowledge transfer. Extensive experiments on diverse manipulation task suites demonstrate that SPECI consistently outperforms state-of-the-art CIL methods across all evaluated metrics, revealing exceptional bidirectional knowledge transfer and superior overall performance.

[LG-25] RiskNet: Interaction-Aware Risk Forecasting for Autonomous Driving in Long-Tail Scenarios

链接: https://arxiv.org/abs/2504.15541
作者: Qichao Liu,Heye Huang,Shiyue Zhao,Lei Shi,Soyoung Ahn,Xiaopeng Li
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 24 pages, 14 figures

点击查看摘要

Abstract:Ensuring the safety of autonomous vehicles (AVs) in long-tail scenarios remains a critical challenge, particularly under high uncertainty and complex multi-agent interactions. To address this, we propose RiskNet, an interaction-aware risk forecasting framework, which integrates deterministic risk modeling with probabilistic behavior prediction for comprehensive risk assessment. At its core, RiskNet employs a field-theoretic model that captures interactions among ego vehicle, surrounding agents, and infrastructure via interaction fields and force. This model supports multidimensional risk evaluation across diverse scenarios (highways, intersections, and roundabouts), and shows robustness under high-risk and long-tail settings. To capture the behavioral uncertainty, we incorporate a graph neural network (GNN)-based trajectory prediction module, which learns multi-modal future motion distributions. Coupled with the deterministic risk field, it enables dynamic, probabilistic risk inference across time, enabling proactive safety assessment under uncertainty. Evaluations on the highD, inD, and rounD datasets, spanning lane changes, turns, and complex merges, demonstrate that our method significantly outperforms traditional approaches (e.g., TTC, THW, RSS, NC Field) in terms of accuracy, responsiveness, and directional sensitivity, while maintaining strong generalization across scenarios. This framework supports real-time, scenario-adaptive risk forecasting and demonstrates strong generalization across uncertain driving environments. It offers a unified foundation for safety-critical decision-making in long-tail scenarios.

[LG-26] Interpretable Deep Learning for Polar Mechanistic Reaction Prediction

链接: https://arxiv.org/abs/2504.15539
作者: Ryan J. Miller,Alexander E. Dashuta,Brayden Rudisill,David Van Vranken,Pierre Baldi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately predicting chemical reactions is essential for driving innovation in synthetic chemistry, with broad applications in medicine, manufacturing, and agriculture. At the same time, reaction prediction is a complex problem which can be both time-consuming and resource-intensive for chemists to solve. Deep learning methods offer an appealing solution by enabling high-throughput reaction prediction. However, many existing models are trained on the US Patent Office dataset and treat reactions as overall transformations: mapping reactants directly to products with limited interpretability or mechanistic insight. To address this, we introduce PMechRP (Polar Mechanistic Reaction Predictor), a system that trains machine learning models on the PMechDB dataset, which represents reactions as polar elementary steps that capture electron flow and mechanistic detail. To further expand model coverage and improve generalization, we augment PMechDB with a diverse set of combinatorially generated reactions. We train and compare a range of machine learning models, including transformer-based, graph-based, and two-step siamese architectures. Our best-performing approach was a hybrid model, which combines a 5-ensemble of Chemformer models with a two-step Siamese framework to leverage the accuracy of transformer architectures, while filtering away “alchemical” products using the two-step network predictions. For evaluation, we use a test split of the PMechDB dataset and additionally curate a human benchmark dataset consisting of complete mechanistic pathways extracted from an organic chemistry textbook. Our hybrid model achieves a top-10 accuracy of 94.9% on the PMechDB test set and a target recovery rate of 84.9% on the pathway dataset.

[LG-27] Federated Latent Factor Learning for Recovering Wireless Sensor Networks Signal with Privacy-Preserving

链接: https://arxiv.org/abs/2504.15525
作者: Chengjun Yu,Yixin Ran,Yangyi Xia,Jia Wu,Xiaojing Liu
类目: Machine Learning (cs.LG)
*备注: Accepted By ICAISISAS 2025

点击查看摘要

Abstract:Wireless Sensor Networks (WSNs) are a cutting-edge domain in the field of intelligent sensing. Due to sensor failures and energy-saving strategies, the collected data often have massive missing data, hindering subsequent analysis and decision-making. Although Latent Factor Learning (LFL) has been proven effective in recovering missing data, it fails to sufficiently consider data privacy protection. To address this issue, this paper innovatively proposes a federated latent factor learning (FLFL) based spatial signal recovery (SSR) model, named FLFL-SSR. Its main idea is two-fold: 1) it designs a sensor-level federated learning framework, where each sensor uploads only gradient updates instead of raw data to optimize the global model, and 2) it proposes a local spatial sharing strategy, allowing sensors within the same spatial region to share their latent feature vectors, capturing spatial correlations and enhancing recovery accuracy. Experimental results on two real-world WSNs datasets demonstrate that the proposed model outperforms existing federated methods in terms of recovery performance.

[LG-28] Few-Shot Vision-Language Action-Incremental Policy Learning

链接: https://arxiv.org/abs/2504.15517
作者: Mingchen Song,Xiang Deng,Guoqiang Zhong,Qi Lv,Jia Wan,Yinchuan Li,Jianye Hao,Weili Guan
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, Transformer-based robotic manipulation methods utilize multi-view spatial representations and language instructions to learn robot motion trajectories by leveraging numerous robot demonstrations. However, the collection of robot data is extremely challenging, and existing methods lack the capability for continuous learning on new tasks with only a few demonstrations. In this paper, we formulate these challenges as the Few-Shot Action-Incremental Learning (FSAIL) task, and accordingly design a Task-prOmpt graPh evolutIon poliCy (TOPIC) to address these issues. Specifically, to address the data scarcity issue in robotic imitation learning, TOPIC learns Task-Specific Prompts (TSP) through the deep interaction of multi-modal information within few-shot demonstrations, thereby effectively extracting the task-specific discriminative information. On the other hand, to enhance the capability for continual learning on new tasks and mitigate the issue of catastrophic forgetting, TOPIC adopts a Continuous Evolution Strategy (CES). CES leverages the intrinsic relationships between tasks to construct a task relation graph, which effectively facilitates the adaptation of new tasks by reusing skills learned from previous tasks. TOPIC pioneers few-shot continual learning in the robotic manipulation task, and extensive experimental results demonstrate that TOPIC outperforms state-of-the-art baselines by over 26 % in success rate, significantly enhancing the continual learning capabilities of existing Transformer-based policies.

[LG-29] 2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models

链接: https://arxiv.org/abs/2504.15512
作者: Siyuan Liang,Jiayang Liu,Jiecheng Zhai,Tianmeng Fang,Rongcheng Tu,Aishan Liu,Xiaochun Cao,Dacheng Tao
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 25 pages, 5 figures

点击查看摘要

Abstract:The rapid development of generative artificial intelligence has made text to video models essential for building future multimodal world simulators. However, these models remain vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content. Such vulnerabilities undermine the reliability and security of simulation based applications. In this paper, we propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats. Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses, including semantic ambiguities in prompts, difficulties in detecting malicious content in dynamic video outputs, and inflexible model centric mitigation strategies. T2VShield introduces a prompt rewriting mechanism based on reasoning and multimodal retrieval to sanitize malicious inputs, along with a multi scope detection module that captures local and global inconsistencies across time and modalities. The framework does not require access to internal model parameters and works with both open and closed source systems. Extensive experiments on five platforms show that T2VShield can reduce jailbreak success rates by up to 35 percent compared to strong baselines. We further develop a human centered audiovisual evaluation protocol to assess perceptual safety, emphasizing the importance of visual level defense in enhancing the trustworthiness of next generation multimodal simulators.

[LG-30] Application of Deep Generative Models for Anomaly Detection in Complex Financial Transactions

链接: https://arxiv.org/abs/2504.15491
作者: Tengda Tang,Jianhua Yao,Yixian Wang,Qiuwu Sha,Hanrui Feng,Zhen Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study proposes an algorithm for detecting suspicious behaviors in large payment flows based on deep generative models. By combining Generative Adversarial Networks (GAN) and Variational Autoencoders (VAE), the algorithm is designed to detect abnormal behaviors in financial transactions. First, the GAN is used to generate simulated data that approximates normal payment flows. The discriminator identifies anomalous patterns in transactions, enabling the detection of potential fraud and money laundering behaviors. Second, a VAE is introduced to model the latent distribution of payment flows, ensuring that the generated data more closely resembles real transaction features, thus improving the model’s detection accuracy. The method optimizes the generative capabilities of both GAN and VAE, ensuring that the model can effectively capture suspicious behaviors even in sparse data conditions. Experimental results show that the proposed method significantly outperforms traditional machine learning algorithms and other deep learning models across various evaluation metrics, especially in detecting rare fraudulent behaviors. Furthermore, this study provides a detailed comparison of performance in recognizing different transaction patterns (such as normal, money laundering, and fraud) in large payment flows, validating the advantages of generative models in handling complex financial data.

[LG-31] Fourier analysis of the physics of transfer learning for data-driven subgrid-scale models of ocean turbulence

链接: https://arxiv.org/abs/2504.15487
作者: Moein Darman,Pedram Hassanzadeh,Laure Zanna,Ashesh Chattopadhyay
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Atmospheric and Oceanic Physics (physics.ao-ph); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Transfer learning (TL) is a powerful tool for enhancing the performance of neural networks (NNs) in applications such as weather and climate prediction and turbulence modeling. TL enables models to generalize to out-of-distribution data with minimal training data from the new system. In this study, we employ a 9-layer convolutional NN to predict the subgrid forcing in a two-layer ocean quasi-geostrophic system and examine which metrics best describe its performance and generalizability to unseen dynamical regimes. Fourier analysis of the NN kernels reveals that they learn low-pass, Gabor, and high-pass filters, regardless of whether the training data are isotropic or anisotropic. By analyzing the activation spectra, we identify why NNs fail to generalize without TL and how TL can overcome these limitations: the learned weights and biases from one dataset underestimate the out-of-distribution sample spectra as they pass through the network, leading to an underestimation of output spectra. By re-training only one layer with data from the target system, this underestimation is corrected, enabling the NN to produce predictions that match the target spectra. These findings are broadly applicable to data-driven parameterization of dynamical systems.

[LG-32] In-context Ranking Preference Optimization

链接: https://arxiv.org/abs/2504.15477
作者: Junda Wu,Rohan Surana,Zhouhang Xie,Yiran Shen,Yu Xia,Tong Yu,Ryan A. Rossi,Prithviraj Ammanabrolu,Julian McAuley
类目: Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Recent developments in Direct Preference Optimization (DPO) allow large language models (LLMs) to function as implicit ranking models by maximizing the margin between preferred and non-preferred responses. In practice, user feedback on such lists typically involves identifying a few relevant items in context rather than providing detailed pairwise comparisons for every possible item pair. Moreover, many complex information retrieval tasks, such as conversational agents and summarization systems, critically depend on ranking the highest-quality outputs at the top, emphasizing the need to support natural and flexible forms of user feedback. To address the challenge of limited and sparse pairwise feedback in the in-context setting, we propose an In-context Ranking Preference Optimization (IRPO) framework that directly optimizes LLMs based on ranking lists constructed during inference. To further capture flexible forms of feedback, IRPO extends the DPO objective by incorporating both the relevance of items and their positions in the list. Modeling these aspects jointly is non-trivial, as ranking metrics are inherently discrete and non-differentiable, making direct optimization difficult. To overcome this, IRPO introduces a differentiable objective based on positional aggregation of pairwise item preferences, enabling effective gradient-based optimization of discrete ranking metrics. We further provide theoretical insights showing that IRPO (i) automatically emphasizes items with greater disagreement between the model and the reference ranking, and (ii) links its gradient to an importance sampling estimator, yielding an unbiased estimator with reduced variance. Empirical results show IRPO outperforms standard DPO approaches in ranking performance, highlighting its effectiveness in aligning LLMs with direct in-context ranking preferences.

[LG-33] LAPP: Large Language Model Feedback for Preference-Driven Reinforcement Learning

链接: https://arxiv.org/abs/2504.15472
作者: Pingcheng Jian,Xiao Wei,Yanbaihui Liu,Samuel A. Moore,Michael M. Zavlanos,Boyuan Chen
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We introduce Large Language Model-Assisted Preference Prediction (LAPP), a novel framework for robot learning that enables efficient, customizable, and expressive behavior acquisition with minimum human effort. Unlike prior approaches that rely heavily on reward engineering, human demonstrations, motion capture, or expensive pairwise preference labels, LAPP leverages large language models (LLMs) to automatically generate preference labels from raw state-action trajectories collected during reinforcement learning (RL). These labels are used to train an online preference predictor, which in turn guides the policy optimization process toward satisfying high-level behavioral specifications provided by humans. Our key technical contribution is the integration of LLMs into the RL feedback loop through trajectory-level preference prediction, enabling robots to acquire complex skills including subtle control over gait patterns and rhythmic timing. We evaluate LAPP on a diverse set of quadruped locomotion and dexterous manipulation tasks and show that it achieves efficient learning, higher final performance, faster adaptation, and precise control of high-level behaviors. Notably, LAPP enables robots to master highly dynamic and expressive tasks such as quadruped backflips, which remain out of reach for standard LLM-generated or handcrafted rewards. Our results highlight LAPP as a promising direction for scalable preference-driven robot learning.

[LG-34] LithOS: An Operating System for Efficient Machine Learning on GPUs

链接: https://arxiv.org/abs/2504.15465
作者: Patrick H. Coppock,Brian Zhang,Eliot H. Solomon,Vasilis Kypriotis,Leon Yang,Bikash Sharma,Dan Schatzberg,Todd C. Mowry,Dimitrios Skarlatos
类目: Operating Systems (cs.OS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The surging demand for GPUs in datacenters for machine learning (ML) has made efficient GPU utilization crucial. However, meeting the diverse needs of ML models while optimizing resource usage is challenging. To enable transparent, fine-grained GPU management that maximizes utilization and energy efficiency while maintaining strong isolation, an operating system (OS) approach is needed. This paper introduces LithOS, a first step toward a GPU OS. LithOS includes the following new abstractions and mechanisms for efficient GPU resource management: (i) a novel TPC Scheduler that supports spatial scheduling at the granularity of individual TPCs, unlocking efficient TPC stealing between workloads; (ii) transparent kernel atomization to reduce head-of-line blocking and enable dynamic resource reallocation mid-execution; (iii) a lightweight hardware right-sizing mechanism that determines the minimal TPC resources needed per atom; and (iv) a transparent power management mechanism that reduces power consumption based on in-flight work behavior. We implement LithOS in Rust and evaluate its performance across extensive ML environments, comparing it to state-of-the-art solutions from NVIDIA and prior research. For inference stacking, LithOS reduces tail latencies by 13x compared to MPS; compared to the best SotA, it reduces tail latencies by 3x while improving aggregate throughput by 1.6x. In hybrid inference-training stacking, LithOS reduces tail latencies by 4.7x compared to MPS; compared to the best SotA, it reduces tail latencies 1.18x while improving aggregate throughput by 1.35x. Finally, for a modest performance hit under 4%, LithOS’s right-sizing provides a quarter of GPU capacity savings on average, while for a 7% hit, its power management yields a quarter of a GPU’s energy savings. Overall, LithOS increases GPU efficiency, establishing a foundation for future OS research on GPUs.

[LG-35] Compton Form Factor Extraction using Quantum Deep Neural Networks

链接: https://arxiv.org/abs/2504.15458
作者: Brandon Le,Dustin Keller
类目: Machine Learning (cs.LG); Nuclear Theory (nucl-th); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Extraction tests of Compton Form Factors are performed using pseudodata based on experimental data from Deeply Virtual Compton Scattering experiments conducted at Jefferson Lab. The standard Belitsky, Kirchner, and Muller formalism at twist-two is employed, along with a fitting procedure designed to reduce model dependency similar to traditional local fits. The extraction of the Compton Form Factors is performed using both Classical Deep Neural Networks (CDNNs) and Quantum Deep Neural Networks (QDNNs). Comparative studies reveal that QDNNs outperform CDNNs for this application, demonstrating improved predictive accuracy and precision even for limited model complexity. The results demonstrate the potential of QDNNs for future studies in which quantum algorithms can be fully optimized.

[LG-36] Combating Toxic Language: A Review of LLM -Based Strategies for Software Engineering

链接: https://arxiv.org/abs/2504.15439
作者: Hao Zhuo,Yicheng Yang,Kewen Peng
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become integral to software engineering (SE), where they are increasingly used in development workflows. However, their widespread use raises concerns about the presence and propagation of toxic language–harmful or offensive content that can foster exclusionary environments. This paper provides a comprehensive review of recent research on toxicity detection and mitigation, focusing on both SE-specific and general-purpose datasets. We examine annotation and preprocessing techniques, assess detection methodologies, and evaluate mitigation strategies, particularly those leveraging LLMs. Additionally, we conduct an ablation study demonstrating the effectiveness of LLM-based rewriting for reducing toxicity. By synthesizing existing work and identifying open challenges, this review highlights key areas for future research to ensure the responsible deployment of LLMs in SE and beyond.

[LG-37] Post-Convergence Sim-to-Real Policy Transfer: A Principled Alternative to Cherry-Picking

链接: https://arxiv.org/abs/2504.15414
作者: Dylan Khor,Bowen Weng
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning-based approaches, particularly reinforcement learning (RL), have become widely used for developing control policies for autonomous agents, such as locomotion policies for legged robots. RL training typically maximizes a predefined reward (or minimizes a corresponding cost/loss) by iteratively optimizing policies within a simulator. Starting from a randomly initialized policy, the empirical expected reward follows a trajectory with an overall increasing trend. While some policies become temporarily stuck in local optima, a well-defined training process generally converges to a reward level with noisy oscillations. However, selecting a policy for real-world deployment is rarely an analytical decision (i.e., simply choosing the one with the highest reward) and is instead often performed through trial and error. To improve sim-to-real transfer, most research focuses on the pre-convergence stage, employing techniques such as domain randomization, multi-fidelity training, adversarial training, and architectural innovations. However, these methods do not eliminate the inevitable convergence trajectory and noisy oscillations of rewards, leading to heuristic policy selection or cherry-picking. This paper addresses the post-convergence sim-to-real transfer problem by introducing a worst-case performance transference optimization approach, formulated as a convex quadratic-constrained linear programming problem. Extensive experiments demonstrate its effectiveness in transferring RL-based locomotion policies from simulation to real-world laboratory tests.

[LG-38] Improving Learning to Optimize Using Parameter Symmetries ICLR

链接: https://arxiv.org/abs/2504.15399
作者: Guy Zamir,Aryan Dokania,Bo Zhao,Rose Yu
类目: Machine Learning (cs.LG)
*备注: Accepted at the ICLR Workshop on Neural Network Weights as a New Data Modality 2025

点击查看摘要

Abstract:We analyze a learning-to-optimize (L2O) algorithm that exploits parameter space symmetry to enhance optimization efficiency. Prior work has shown that jointly learning symmetry transformations and local updates improves meta-optimizer performance. Supporting this, our theoretical analysis demonstrates that even without identifying the optimal group element, the method locally resembles Newton’s method. We further provide an example where the algorithm provably learns the correct symmetry transformation during training. To empirically evaluate L2O with teleportation, we introduce a benchmark, analyze its success and failure cases, and show that enhancements like momentum further improve performance. Our results highlight the potential of leveraging neural network parameter space symmetry to advance meta-optimization.

[LG-39] FLARE: Feature-based Lightweight Aggregation for Robust Evaluation of IoT Intrusion Detection

链接: https://arxiv.org/abs/2504.15375
作者: Bradley Boswell,Seth Barrett,Swarnamugi Rajaganapathy,Gokila Dorai
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 23 pages, 19 tables, 2 algorithms, 2 figures, submitted to SecureComm25

点击查看摘要

Abstract:The proliferation of Internet of Things (IoT) devices has expanded the attack surface, necessitating efficient intrusion detection systems (IDSs) for network protection. This paper presents FLARE, a feature-based lightweight aggregation for robust evaluation of IoT intrusion detection to address the challenges of securing IoT environments through feature aggregation techniques. FLARE utilizes a multilayered processing approach, incorporating session, flow, and time-based sliding-window data aggregation to analyze network behavior and capture vital features from IoT network traffic data. We perform extensive evaluations on IoT data generated from our laboratory experimental setup to assess the effectiveness of the proposed aggregation technique. To classify attacks in IoT IDS, we employ four supervised learning models and two deep learning models. We validate the performance of these models in terms of accuracy, precision, recall, and F1-score. Our results reveal that incorporating the FLARE aggregation technique as a foundational step in feature engineering, helps lay a structured representation, and enhances the performance of complex end-to-end models, making it a crucial step in IoT IDS pipeline. Our findings highlight the potential of FLARE as a valuable technique to improve performance and reduce computational costs of end-to-end IDS implementations, thereby fostering more robust IoT intrusion detection systems.

[LG-40] FedFetch: Faster Federated Learning with Adaptive Downstream Prefetching

链接: https://arxiv.org/abs/2504.15366
作者: Qifan Yan,Andrew Liu,Shiqi He,Mathias Lécuyer,Ivan Beschastnikh
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted at INFOCOM 2025

点击查看摘要

Abstract:Federated learning (FL) is a machine learning paradigm that facilitates massively distributed model training with end-user data on edge devices directed by a central server. However, the large number of heterogeneous clients in FL deployments leads to a communication bottleneck between the server and the clients. This bottleneck is made worse by straggling clients, any one of which will further slow down training. To tackle these challenges, researchers have proposed techniques like client sampling and update compression. These techniques work well in isolation but combine poorly in the downstream, server-to-client direction. This is because unselected clients have outdated local model states and need to synchronize these states with the server first. We introduce FedFetch, a strategy to mitigate the download time overhead caused by combining client sampling and compression techniques. FedFetch achieves this with an efficient prefetch schedule for clients to prefetch model states multiple rounds before a stated training round. We empirically show that adding FedFetch to communication efficient FL techniques reduces end-to-end training time by 1.26 \times and download time by 4.49 \times across compression techniques with heterogeneous client settings. Our implementation is available at this https URL Comments: Accepted at INFOCOM 2025 Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2504.15366 [cs.LG] (or arXiv:2504.15366v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.15366 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-41] Advancing Embodied Intelligence in Robotic-Assisted Endovascular Procedures: A Systematic Review of AI Solutions

链接: https://arxiv.org/abs/2504.15327
作者: Tianliang Yao,Bo Lu,Markus Kowarschik,Yixuan Yuan,Hubin Zhao,Sebastien Ourselin,Kaspar Althoefer,Junbo Ge,Peng Qi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 24 pages, 7 figures, submitted to IEEE

点击查看摘要

Abstract:Endovascular procedures have revolutionized the treatment of vascular diseases thanks to minimally invasive solutions that significantly reduce patient recovery time and enhance clinical outcomes. However, the precision and dexterity required during these procedures poses considerable challenges for interventionists. Robotic systems have emerged offering transformative solutions, addressing issues such as operator fatigue, radiation exposure, and the inherent limitations of human precision. The integration of Embodied Intelligence (EI) into these systems signifies a paradigm shift, enabling robots to navigate complex vascular networks and adapt to dynamic physiological conditions. Data-driven approaches, advanced computer vision, medical image analysis, and machine learning techniques, are at the forefront of this evolution. These methods augment procedural intelligence by facilitating real-time vessel segmentation, device tracking, and anatomical landmark detection. Reinforcement learning and imitation learning further refine navigation strategies and replicate experts’ techniques. This review systematically examines the integration of EI principles into robotic technologies, in relation to endovascular procedures. We discuss recent advancements in intelligent perception and data-driven control, and their practical applications in robot-assisted endovascular procedures. By critically evaluating current limitations and emerging opportunities, this review establishes a framework for future developments, emphasizing the potential for greater autonomy and improved clinical outcomes. Emerging trends and specific areas of research, such as federated learning for medical data sharing, explainable AI for clinical decision support, and advanced human-robot collaboration paradigms, are also explored, offering insights into the future direction of this rapidly evolving field.

[LG-42] M-TabNet: A Multi-Encoder Transformer Model for Predicting Neonatal Birth Weight from Multimodal Data

链接: https://arxiv.org/abs/2504.15312
作者: Muhammad Mursil,Hatem A. Rashwan,Luis Santos-Calderon,Pere Cavalle-Busquets,Michelle M. Murphy,Domenec Puig
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Birth weight (BW) is a key indicator of neonatal health, with low birth weight (LBW) linked to increased mortality and morbidity. Early prediction of BW enables timely interventions; however, current methods like ultrasonography have limitations, including reduced accuracy before 20 weeks and operator dependent variability. Existing models often neglect nutritional and genetic influences, focusing mainly on physiological and lifestyle factors. This study presents an attention-based transformer model with a multi-encoder architecture for early (less than 12 weeks of gestation) BW prediction. Our model effectively integrates diverse maternal data such as physiological, lifestyle, nutritional, and genetic, addressing limitations seen in prior attention-based models such as TabNet. The model achieves a Mean Absolute Error (MAE) of 122 grams and an R-squared value of 0.94, demonstrating high predictive accuracy and interoperability with our in-house private dataset. Independent validation confirms generalizability (MAE: 105 grams, R-squared: 0.95) with the IEEE children dataset. To enhance clinical utility, predicted BW is classified into low and normal categories, achieving a sensitivity of 97.55% and a specificity of 94.48%, facilitating early risk stratification. Model interpretability is reinforced through feature importance and SHAP analyses, highlighting significant influences of maternal age, tobacco exposure, and vitamin B12 status, with genetic factors playing a secondary role. Our results emphasize the potential of advanced deep-learning models to improve early BW prediction, offering clinicians a robust, interpretable, and personalized tool for identifying pregnancies at risk and optimizing neonatal outcomes.

[LG-43] Collaborative Learning of On-Device Small Model and Cloud-Based Large Model: Advances and Future Directions

链接: https://arxiv.org/abs/2504.15300
作者: Chaoyue Niu,Yucheng Ding,Junhui Lu,Zhengxiang Huang,Hang Zeng,Yutong Dai,Xuezhen Tu,Chengfei Lv,Fan Wu,Guihai Chen
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:The conventional cloud-based large model learning framework is increasingly constrained by latency, cost, personalization, and privacy concerns. In this survey, we explore an emerging paradigm: collaborative learning between on-device small model and cloud-based large model, which promises low-latency, cost-efficient, and personalized intelligent services while preserving user privacy. We provide a comprehensive review across hardware, system, algorithm, and application layers. At each layer, we summarize key problems and recent advances from both academia and industry. In particular, we categorize collaboration algorithms into data-based, feature-based, and parameter-based frameworks. We also review publicly available datasets and evaluation metrics with user-level or device-level consideration tailored to collaborative learning settings. We further highlight real-world deployments, ranging from recommender systems and mobile livestreaming to personal intelligent assistants. We finally point out open research directions to guide future development in this rapidly evolving field.

[LG-44] EditLord: Learning Code Transformation Rules for Code Editing

链接: https://arxiv.org/abs/2504.15284
作者: Weichen Li,Albert Jan,Baishakhi Ray,Chengzhi Mao,Junfeng Yang,Kexin Pei
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Code editing is a foundational task in software development, where its effectiveness depends on whether it introduces desired code property changes without changing the original code’s intended functionality. Existing approaches often formulate code editing as an implicit end-to-end task, omitting the fact that code-editing procedures inherently consist of discrete and explicit steps. Thus, they suffer from suboptimal performance and lack of robustness and generalization. We introduce EditLord, a code editing framework that makes the code transformation steps explicit. Our key insight is to employ a language model (LM) as an inductive learner to extract code editing rules from the training code pairs as concise meta-rule sets. Such rule sets will be manifested for each training sample to augment them for finetuning or assist in prompting- and iterative-based code editing. EditLordoutperforms the state-of-the-art by an average of 22.7% in editing performance and 58.1% in robustness while achieving 20.2% higher functional correctness across critical software engineering and security applications, LM models, and editing modes.

[LG-45] Explainable Unsupervised Anomaly Detection with Random Forest

链接: https://arxiv.org/abs/2504.16075
作者: Joshua S. Harvey,Joshua Rosaler,Mingshu Li,Dhruv Desai,Dhagash Mehta
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:We describe the use of an unsupervised Random Forest for similarity learning and improved unsupervised anomaly detection. By training a Random Forest to discriminate between real data and synthetic data sampled from a uniform distribution over the real data bounds, a distance measure is obtained that anisometrically transforms the data, expanding distances at the boundary of the data manifold. We show that using distances recovered from this transformation improves the accuracy of unsupervised anomaly detection, compared to other commonly used detectors, demonstrated over a large number of benchmark datasets. As well as improved performance, this method has advantages over other unsupervised anomaly detection methods, including minimal requirements for data preprocessing, native handling of missing data, and potential for visualizations. By relating outlier scores to partitions of the Random Forest, we develop a method for locally explainable anomaly predictions in terms of feature importance.

[LG-46] High-performance training and inference for deep equivariant interatomic potentials

链接: https://arxiv.org/abs/2504.16068
作者: Chuin Wei Tan,Marc L. Descoteaux,Mit Kotak,Gabriel de Miranda Nascimento,Seán R. Kavanagh,Laura Zichi,Menghang Wang,Aadit Saluja,Yizhong R. Hu,Tess Smidt,Anders Johansson,William C. Witt,Boris Kozinsky,Albert Musaelian
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Machine learning interatomic potentials, particularly those based on deep equivariant neural networks, have demonstrated state-of-the-art accuracy and computational efficiency in atomistic modeling tasks like molecular dynamics and high-throughput screening. The size of datasets and demands of downstream workflows are growing rapidly, making robust and scalable software essential. This work presents a major overhaul of the NequIP framework focusing on multi-node parallelism, computational performance, and extensibility. The redesigned framework supports distributed training on large datasets and removes barriers preventing full utilization of the PyTorch 2.0 compiler at train time. We demonstrate this acceleration in a case study by training Allegro models on the SPICE 2 dataset of organic molecular systems. For inference, we introduce the first end-to-end infrastructure that uses the PyTorch Ahead-of-Time Inductor compiler for machine learning interatomic potentials. Additionally, we implement a custom kernel for the Allegro model’s most expensive operation, the tensor product. Together, these advancements speed up molecular dynamics calculations on system sizes of practical relevance by up to a factor of 18.

[LG-47] Benchmarking machine learning models for predicting aerofoil performance

链接: https://arxiv.org/abs/2504.15993
作者: Oliver Summerell,Gerardo Aragon-Camarasa,Stephanie Ordonez Sanchez
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 9 pages, 10 figures, submitted to EWTEC

点击查看摘要

Abstract:This paper investigates the capability of Neural Networks (NNs) as alternatives to the traditional methods to analyse the performance of aerofoils used in the wind and tidal energy industry. The current methods used to assess the characteristic lift and drag coefficients include Computational Fluid Dynamics (CFD), thin aerofoil and panel methods, all face trade-offs between computational speed and the accuracy of the results and as such NNs have been investigated as an alternative with the aim that it would perform both quickly and accurately. As such, this paper provides a benchmark for the windAI_bench dataset published by the National Renewable Energy Laboratory (NREL) in the USA. In order to validate the methodology of the benchmarking, the AirfRANS \tt arXiv:2212.07564v3 dataset is used as both a starting point and a point of comparison. This study evaluates four neural networks (MLP, PointNet, GraphSAGE, GUNet) trained on a range aerofoils at 25 angles of attack (4 ^\circ to 20 ^\circ ). to predict fluid flow and calculate lift coefficients ( C_L ) via the panel method. GraphSAGE and GUNet performed well during the testing phase, but underperformed during validation. Accordingly, this paper has identified PointNet and MLP as the two strongest models tested, however whilst the results from MLP are more commonly correct for predicting the behaviour of the fluid, the results from PointNet provide the more accurate results for calculating C_L .

[LG-48] Full waveform inversion with CNN-based velocity representation extension

链接: https://arxiv.org/abs/2504.15826
作者: Xinru Mu,Omar M. Saad,Tariq Alkhalifah
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: 16 pages, 15 figures, Scientific paper

点击查看摘要

Abstract:Full waveform inversion (FWI) updates the velocity model by minimizing the discrepancy between observed and simulated data. However, discretization errors in numerical modeling and incomplete seismic data acquisition can introduce noise, which propagates through the adjoint operator and affects the accuracy of the velocity gradient, thereby impacting the FWI inversion accuracy. To mitigate the influence of noise on the gradient, we employ a convolutional neural network (CNN) to refine the velocity model before performing the forward simulation, aiming to reduce noise and provide a more accurate velocity update direction. We use the same data misfit loss to update both the velocity and network parameters, thereby forming a self-supervised learning procedure. We propose two implementation schemes, which differ in whether the velocity update passes through the CNN. In both methodologies, the velocity representation is extended (VRE) by using a neural network in addition to the grid-based velocities. Thus, we refer to this general approach as VRE-FWI. Synthetic and real data tests demonstrate that the proposed VRE-FWI achieves higher velocity inversion accuracy compared to traditional FWI, at a marginal additional computational cost of approximately 1%.

[LG-49] Markov Kernels Distances and Optimal Control: A Parable of Linear Quadratic Non-Gaussian Distribution Steering

链接: https://arxiv.org/abs/2504.15753
作者: Alexis M.H. Teter,Wenqing Wang,Sachin Shivakumar,Abhishek Halder
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY); Probability (math.PR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:For a controllable linear time-varying (LTV) pair (\boldsymbolA_t,\boldsymbolB_t) and \boldsymbolQ_t positive semidefinite, we derive the Markov kernel for the Itô diffusion \mathrmd\boldsymbolx_t=\boldsymbolA_t\boldsymbolx_t \mathrmd t + \sqrt2\boldsymbolB_t\mathrmd\boldsymbolw_t with an accompanying killing of probability mass at rate \frac12\boldsymbolx^\top\boldsymbolQ_t\boldsymbolx . This Markov kernel is the Green’s function for an associated linear reaction-advection-diffusion partial differential equation. Our result generalizes the recently derived kernel for the special case \left(\boldsymbolA_t,\boldsymbolB_t\right)=\left(\boldsymbol0,\boldsymbolI\right) , and depends on the solution of an associated Riccati matrix ODE. A consequence of this result is that the linear quadratic non-Gaussian Schrödinger bridge is exactly solvable. This means that the problem of steering a controlled LTV diffusion from a given non-Gaussian distribution to another over a fixed deadline while minimizing an expected quadratic cost can be solved using dynamic Sinkhorn recursions performed with the derived kernel. Our derivation for the \left(\boldsymbolA_t,\boldsymbolB_t,\boldsymbolQ_t\right) -parametrized kernel pursues a new idea that relies on finding a state-time dependent distance-like functional given by the solution of a deterministic optimal control problem. This technique breaks away from existing methods, such as generalizing Hermite polynomials or Weyl calculus, which have seen limited success in the reaction-diffusion context. Our technique uncovers a new connection between Markov kernels, distances, and optimal control. This connection is of interest beyond its immediate application in solving the linear quadratic Schrödinger bridge problem.

[LG-50] From predictions to confidence intervals: an empirical study of conformal prediction methods for in-context learning

链接: https://arxiv.org/abs/2504.15722
作者: Zhe Huang,Simone Rossi,Rui Yuan,Thomas Hannagan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers have become a standard architecture in machine learning, demonstrating strong in-context learning (ICL) abilities that allow them to learn from the prompt at inference time. However, uncertainty quantification for ICL remains an open challenge, particularly in noisy regression tasks. This paper investigates whether ICL can be leveraged for distribution-free uncertainty estimation, proposing a method based on conformal prediction to construct prediction intervals with guaranteed coverage. While traditional conformal methods are computationally expensive due to repeated model fitting, we exploit ICL to efficiently generate confidence intervals in a single forward pass. Our empirical analysis compares this approach against ridge regression-based conformal methods, showing that conformal prediction with in-context learning (CP with ICL) achieves robust and scalable uncertainty estimates. Additionally, we evaluate its performance under distribution shifts and establish scaling laws to guide model training. These findings bridge ICL and conformal prediction, providing a theoretically grounded and new framework for uncertainty quantification in transformer-based models.

[LG-51] ransfer Learning for High-dimensional Reduced Rank Time Series Models AISTATS2025

链接: https://arxiv.org/abs/2504.15691
作者: Mingliang Ma Abolfazl Safikhani
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 29 pages accepted by AISTATS2025

点击查看摘要

Abstract:The objective of transfer learning is to enhance estimation and inference in a target data by leveraging knowledge gained from additional sources. Recent studies have explored transfer learning for independent observations in complex, high-dimensional models assuming sparsity, yet research on time series models remains limited. Our focus is on transfer learning for sequences of observations with temporal dependencies and a more intricate model parameter structure. Specifically, we investigate the vector autoregressive model (VAR), a widely recognized model for time series data, where the transition matrix can be deconstructed into a combination of a sparse matrix and a low-rank one. We propose a new transfer learning algorithm tailored for estimating high-dimensional VAR models characterized by low-rank and sparse structures. Additionally, we present a novel approach for selecting informative observations from auxiliary datasets. Theoretical guarantees are established, encompassing model parameter consistency, informative set selection, and the asymptotic distribution of estimators under mild conditions. The latter facilitates the construction of entry-wise confidence intervals for model parameters. Finally, we demonstrate the empirical efficacy of our methodologies through both simulated and real-world datasets.

[LG-52] Policy-Based Radiative Transfer: Solving the 2-Level Atom Non-LTE Problem using Soft Actor-Critic Reinforcement Learning

链接: https://arxiv.org/abs/2504.15679
作者: Brandon Panos,Ivan Milic
类目: olar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel reinforcement learning (RL) approach for solving the classical 2-level atom non-LTE radiative transfer problem by framing it as a control task in which an RL agent learns a depth-dependent source function S(\tau) that self-consistently satisfies the equation of statistical equilibrium (SE). The agent’s policy is optimized entirely via reward-based interactions with a radiative transfer engine, without explicit knowledge of the ground truth. This method bypasses the need for constructing approximate lambda operators ( \Lambda^* ) common in accelerated iterative schemes. Additionally, it requires no extensive precomputed labeled datasets to extract a supervisory signal, and avoids backpropagating gradients through the complex RT solver itself. Finally, we show through experiment that a simple feedforward neural network trained greedily cannot solve for SE, possibly due to the moving target nature of the problem. Our \Lambda^-\textFree method offers potential advantages for complex scenarios (e.g., atmospheres with enhanced velocity fields, multi-dimensional geometries, or complex microphysics) where \Lambda^ construction or solver differentiability is challenging. Additionally, the agent can be incentivized to find more efficient policies by manipulating the discount factor, leading to a reprioritization of immediate rewards. If demonstrated to generalize past its training data, this RL framework could serve as an alternative or accelerated formalism to achieve SE. To the best of our knowledge, this study represents the first application of reinforcement learning in solar physics that directly solves for a fundamental physical constraint.

[LG-53] Deep learning with missing data

链接: https://arxiv.org/abs/2504.15388
作者: Tianyi Ma,Tengyao Wang,Richard J. Samworth
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 49 pages, 9 figures

点击查看摘要

Abstract:In the context of multivariate nonparametric regression with missing covariates, we propose Pattern Embedded Neural Networks (PENNs), which can be applied in conjunction with any existing imputation technique. In addition to a neural network trained on the imputed data, PENNs pass the vectors of observation indicators through a second neural network to provide a compact representation. The outputs are then combined in a third neural network to produce final predictions. Our main theoretical result exploits an assumption that the observation patterns can be partitioned into cells on which the Bayes regression function behaves similarly, and belongs to a compositional Hölder class. It provides a finite-sample excess risk bound that holds for an arbitrary missingness mechanism, and in combination with a complementary minimax lower bound, demonstrates that our PENN estimator attains in typical cases the minimax rate of convergence as if the cells of the partition were known in advance, up to a poly-logarithmic factor in the sample size. Numerical experiments on simulated, semi-synthetic and real data confirm that the PENN estimator consistently improves, often dramatically, on standard neural networks without pattern embedding. Code to reproduce our experiments, as well as a tutorial on how to apply our method, is publicly available.

[LG-54] Assessing Surrogate Heterogeneity in Real World Data Using Meta-Learners

链接: https://arxiv.org/abs/2504.15386
作者: Rebecca Knowlton,Layla Parast
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Surrogate markers are most commonly studied within the context of randomized clinical trials. However, the need for alternative outcomes extends beyond these settings and may be more pronounced in real-world public health and social science research, where randomized trials are often impractical. Research on identifying surrogates in real-world non-randomized data is scarce, as available statistical approaches for evaluating surrogate markers tend to rely on the assumption that treatment is randomized. While the few methods that allow for non-randomized treatment/exposure appropriately handle confounding individual characteristics, they do not offer a way to examine surrogate heterogeneity with respect to patient characteristics. In this paper, we propose a framework to assess surrogate heterogeneity in real-world, i.e., non-randomized, data and implement this framework using various meta-learners. Our approach allows us to quantify heterogeneity in surrogate strength with respect to patient characteristics while accommodating confounders through the use of flexible, off-the-shelf machine learning methods. In addition, we use our framework to identify individuals for whom the surrogate is a valid replacement of the primary outcome. We examine the performance of our methods via a simulation study and application to examine heterogeneity in the surrogacy of hemoglobin A1c as a surrogate for fasting plasma glucose.

[LG-55] ransferable Learning of Reaction Pathways from Geometric Priors

链接: https://arxiv.org/abs/2504.15370
作者: Juno Nam,Miguel Steiner,Max Misterka,Soojung Yang,Avni Singhal,Rafael Gómez-Bombarelli
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 14 pages, 6 figures; Supporting Information in ancillary files

点击查看摘要

Abstract:Identifying minimum-energy paths (MEPs) is crucial for understanding chemical reaction mechanisms but remains computationally demanding. We introduce MEPIN, a scalable machine-learning method for efficiently predicting MEPs from reactant and product configurations, without relying on transition-state geometries or pre-optimized reaction paths during training. The task is defined as predicting deviations from geometric interpolations along reaction coordinates. We address this task with a continuous reaction path model based on a symmetry-broken equivariant neural network that generates a flexible number of intermediate structures. The model is trained using an energy-based objective, with efficiency enhanced by incorporating geometric priors from geodesic interpolation as initial interpolations or pre-training objectives. Our approach generalizes across diverse chemical reactions and achieves accurate alignment with reference intrinsic reaction coordinates, as demonstrated on various small molecule reactions and [3+2] cycloadditions. Our method enables the exploration of large chemical reaction spaces with efficient, data-driven predictions of reaction pathways.

信息检索

[IR-0] Intent-aware Diffusion with Contrastive Learning for Sequential Recommendation SIGIR2025

链接: https://arxiv.org/abs/2504.16077
作者: Yuanpeng Qu,Hajime Nobuhara
类目: Information Retrieval (cs.IR)
*备注: Accepted at SIGIR 2025. 10 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Contrastive learning has proven effective in training sequential recommendation models by incorporating self-supervised signals from augmented views. Most existing methods generate multiple views from the same interaction sequence through stochastic data augmentation, aiming to align their representations in the embedding space. However, users typically have specific intents when purchasing items (e.g., buying clothes as gifts or cosmetics for beauty). Random data augmentation used in existing methods may introduce noise, disrupting the latent intent information implicit in the original interaction sequence. Moreover, using noisy augmented sequences in contrastive learning may mislead the model to focus on irrelevant features, distorting the embedding space and failing to capture users’ true behavior patterns and intents. To address these issues, we propose Intent-aware Diffusion with contrastive learning for sequential Recommendation (InDiRec). The core idea is to generate item sequences aligned with users’ purchasing intents, thus providing more reliable augmented views for contrastive learning. Specifically, InDiRec first performs intent clustering on sequence representations using K-means to build intent-guided signals. Next, it retrieves the intent representation of the target interaction sequence to guide a conditional diffusion model, generating positive views that share the same underlying intent. Finally, contrastive learning is applied to maximize representation consistency between these intent-aligned views and the original sequence. Extensive experiments on five public datasets demonstrate that InDiRec achieves superior performance compared to existing baselines, learning more robust representations even under noisy and sparse data conditions.

[IR-1] From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLM s

链接: https://arxiv.org/abs/2504.15965
作者: Yaxiong Wu,Sheng Liang,Chen Zhang,Yichao Wang,Yongyue Zhang,Huifeng Guo,Ruiming Tang,Yong Liu
类目: Information Retrieval (cs.IR)
*备注: 26 pages, 1 figure, 3 tables

点击查看摘要

Abstract:Memory is the process of encoding, storing, and retrieving information, allowing humans to retain experiences, knowledge, skills, and facts over time, and serving as the foundation for growth and effective interaction with the world. It plays a crucial role in shaping our identity, making decisions, learning from past experiences, building relationships, and adapting to changes. In the era of large language models (LLMs), memory refers to the ability of an AI system to retain, recall, and use information from past interactions to improve future responses and interactions. Although previous research and reviews have provided detailed descriptions of memory mechanisms, there is still a lack of a systematic review that summarizes and analyzes the relationship between the memory of LLM-driven AI systems and human memory, as well as how we can be inspired by human memory to construct more powerful memory systems. To achieve this, in this paper, we propose a comprehensive survey on the memory of LLM-driven AI systems. In particular, we first conduct a detailed analysis of the categories of human memory and relate them to the memory of AI systems. Second, we systematically organize existing memory-related work and propose a categorization method based on three dimensions (object, form, and time) and eight quadrants. Finally, we illustrate some open problems regarding the memory of current AI systems and outline possible future directions for memory in the era of large language models.

[IR-2] Synergizing RAG and Reasoning : A Systematic Review

链接: https://arxiv.org/abs/2504.15909
作者: Yunfan Gao,Yun Xiong,Yijie Zhong,Yuxi Bi,Ming Xue,Haofen Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent breakthroughs in large language models (LLMs), particularly in reasoning capabilities, have propelled Retrieval-Augmented Generation (RAG) to unprecedented levels. By synergizing retrieval mechanisms with advanced reasoning, LLMs can now tackle increasingly complex problems. This paper presents a systematic review of the collaborative interplay between RAG and reasoning, clearly defining “reasoning” within the RAG context. It construct a comprehensive taxonomy encompassing multi-dimensional collaborative objectives, representative paradigms, and technical implementations, and analyze the bidirectional synergy methods. Additionally, we critically evaluate current limitations in RAG assessment, including the absence of intermediate supervision for multi-step reasoning and practical challenges related to cost-risk trade-offs. To bridge theory and practice, we provide practical guidelines tailored to diverse real-world applications. Finally, we identify promising research directions, such as graph-based knowledge integration, hybrid model collaboration, and RL-driven optimization. Overall, this work presents a theoretical framework and practical foundation to advance RAG systems in academia and industry, fostering the next generation of RAG solutions.

[IR-3] NLCTables: A Dataset for Marrying Natural Language Conditions with Table Discovery SIGIR’25

链接: https://arxiv.org/abs/2504.15849
作者: Lingxi Cui,Huan Li,Ke Chen,Lidan Shou,Gang Chen
类目: Information Retrieval (cs.IR)
*备注: accepted by SIGIR’25

点击查看摘要

Abstract:With the growing abundance of repositories containing tabular data, discovering relevant tables for in-depth analysis remains a challenging task. Existing table discovery methods primarily retrieve desired tables based on a query table or several vague keywords, leaving users to manually filter large result sets. To address this limitation, we propose a new task: NL-conditional table discovery (nlcTD), where users combine a query table with natural language (NL) requirements to refine search results. To advance research in this area, we present nlcTables, a comprehensive benchmark dataset comprising 627 diverse queries spanning NL-only, union, join, and fuzzy conditions, 22,080 candidate tables, and 21,200 relevance annotations. Our evaluation of six state-of-the-art table discovery methods on nlcTables reveals substantial performance gaps, highlighting the need for advanced techniques to tackle this challenging nlcTD scenario. The dataset, construction framework, and baseline implementations are publicly available at this https URL to foster future research.

[IR-4] FinDER: Financial Dataset for Question Answering and Evaluating Retrieval-Augmented Generation ICLR2025

链接: https://arxiv.org/abs/2504.15800
作者: Chanyeol Choi,Jihoon Kwon,Jaeseon Ha,Hojun Choi,Chaewoon Kim,Yongjae Lee,Jy-yong Sohn,Alejandro Lopez-Lira
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 3 figures, ICLR 2025 Workshop Advances in Financial AI

点击查看摘要

Abstract:In the fast-paced financial domain, accurate and up-to-date information is critical to addressing ever-evolving market conditions. Retrieving this information correctly is essential in financial Question-Answering (QA), since many language models struggle with factual accuracy in this domain. We present FinDER, an expert-generated dataset tailored for Retrieval-Augmented Generation (RAG) in finance. Unlike existing QA datasets that provide predefined contexts and rely on relatively clear and straightforward queries, FinDER focuses on annotating search-relevant evidence by domain experts, offering 5,703 query-evidence-answer triplets derived from real-world financial inquiries. These queries frequently include abbreviations, acronyms, and concise expressions, capturing the brevity and ambiguity common in the realistic search behavior of professionals. By challenging models to retrieve relevant information from large corpora rather than relying on readily determined contexts, FinDER offers a more realistic benchmark for evaluating RAG systems. We further present a comprehensive evaluation of multiple state-of-the-art retrieval models and Large Language Models, showcasing challenges derived from a realistic benchmark to drive future research on truthful and precise RAG in the financial domain.

[IR-5] Assessing FAIRness of the Digital Shadow Reference Model

链接: https://arxiv.org/abs/2504.15715
作者: Johannes Theissen-Lipp
类目: Databases (cs.DB); Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注: 5 pages (2-column IEEE), 2 tables, accepted and to be published in 2025 IEEE 8th International Conference on Industrial Cyber-Physical Systems (ICPS) (see this https URL )

点击查看摘要

Abstract:Models play a critical role in managing the vast amounts of data and increasing complexity found in the IoT, IIoT, and IoP domains. The Digital Shadow Reference Model, which serves as a foundational metadata schema for linking data and metadata in these environments, is an example of such a model. Ensuring FAIRness (adherence to the FAIR Principles) is critical because it improves data findability, accessibility, interoperability, and reusability, facilitating efficient data management and integration across systems. This paper presents an evaluation of the FAIRness of the Digital Shadow Reference Model using a structured evaluation framework based on the FAIR Data Principles. Using the concept of FAIR Implementation Profiles (FIPs), supplemented by a mini-questionnaire, we systematically evaluate the model’s adherence to these principles. Our analysis identifies key strengths, including the model’s metadata schema that supports rich descriptions and authentication techniques, and highlights areas for improvement, such as the need for globally unique identifiers and consequent support for different Web standards. The results provide actionable insights for improving the FAIRness of the model and promoting better data management and reuse. This research contributes to the field by providing a detailed assessment of the Digital Shadow Reference Model and recommending next steps to improve its FAIRness and usability. Comments: 5 pages (2-column IEEE), 2 tables, accepted and to be published in 2025 IEEE 8th International Conference on Industrial Cyber-Physical Systems (ICPS) (see this https URL) Subjects: Databases (cs.DB); Computers and Society (cs.CY); Information Retrieval (cs.IR) ACMclasses: H.1; H.3; H.4; H.m; I.6 Cite as: arXiv:2504.15715 [cs.DB] (or arXiv:2504.15715v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2504.15715 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-6] he Viability of Crowdsourcing for RAG Evaluation SIGIR’25

链接: https://arxiv.org/abs/2504.15689
作者: Lukas Gienapp,Tim Hagen,Maik Fröbe,Matthias Hagen,Benno Stein,Martin Potthast,Harrisen Scells
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 9 tables, 5 figures. Accepted at SIGIR’25

点击查看摘要

Abstract:How good are humans at writing and judging responses in retrieval-augmented generation (RAG) scenarios? To answer this question, we investigate the efficacy of crowdsourcing for RAG through two complementary studies: response writing and response utility judgment. We present the Crowd RAG Corpus 2025 (CrowdRAG-25), which consists of 903 human-written and 903 LLM-generated responses for the 301 topics of the TREC RAG’24 track, across the three discourse styles ‘bulleted list’, ‘essay’, and ‘news’. For a selection of 65 topics, the corpus further contains 47,320 pairwise human judgments and 10,556 pairwise LLM judgments across seven utility dimensions (e.g., coverage and coherence). Our analyses give insights into human writing behavior for RAG and the viability of crowdsourcing for RAG evaluation. Human pairwise judgments provide reliable and cost-effective results compared to LLM-based pairwise or human/LLM-based pointwise judgments, as well as automated comparisons with human-written reference responses. All our data and tools are freely available.

[IR-7] Comprehensive List Generation for Multi-Generator Reranking

链接: https://arxiv.org/abs/2504.15625
作者: Hailan Yang,Zhenyu Qi,Shuchang Liu,Xiaoyu Yang,Xiaobei Wang,Xiang Li,Lantao Hu,Han Li,Kun Gai
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 6 figures, 9 tables

点击查看摘要

Abstract:Reranking models solve the final recommendation lists that best fulfill users’ demands. While existing solutions focus on finding parametric models that approximate optimal policies, recent approaches find that it is better to generate multiple lists to compete for a ``pass’’ ticket from an evaluator, where the evaluator serves as the supervisor who accurately estimates the performance of the candidate lists. In this work, we show that we can achieve a more efficient and effective list proposal with a multi-generator framework and provide empirical evidence on two public datasets and online A/B tests. More importantly, we verify that the effectiveness of a generator is closely related to how much it complements the views of other generators with sufficiently different rerankings, which derives the metric of list comprehensiveness. With this intuition, we design an automatic complementary generator-finding framework that learns a policy that simultaneously aligns the users’ preferences and maximizes the list comprehensiveness metric. The experimental results indicate that the proposed framework can further improve the multi-generator reranking performance.

[IR-8] From Reviews to Dialogues: Active Synthesis for Zero-Shot LLM -based Conversational Recommender System

链接: https://arxiv.org/abs/2504.15476
作者: Rohan Surana,Junda Wu,Zhouhang Xie,Yu Xia,Harald Steck,Dawen Liang,Nathan Kallus,Julian McAuley
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:Conversational recommender systems (CRS) typically require extensive domain-specific conversational datasets, yet high costs, privacy concerns, and data-collection challenges severely limit their availability. Although Large Language Models (LLMs) demonstrate strong zero-shot recommendation capabilities, practical applications often favor smaller, internally managed recommender models due to scalability, interpretability, and data privacy constraints, especially in sensitive or rapidly evolving domains. However, training these smaller models effectively still demands substantial domain-specific conversational data, which remains challenging to obtain. To address these limitations, we propose an active data augmentation framework that synthesizes conversational training data by leveraging black-box LLMs guided by active learning techniques. Specifically, our method utilizes publicly available non-conversational domain data, including item metadata, user reviews, and collaborative signals, as seed inputs. By employing active learning strategies to select the most informative seed samples, our approach efficiently guides LLMs to generate synthetic, semantically coherent conversational interactions tailored explicitly to the target domain. Extensive experiments validate that conversational data generated by our proposed framework significantly improves the performance of LLM-based CRS models, effectively addressing the challenges of building CRS in no- or low-resource scenarios.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-04-23

目录

概览 (2025-04-23)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载