本篇博文主要内容为 2025-05-06 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-05-06)

今日共更新729篇论文,其中:

  • 自然语言处理81篇(Computation and Language (cs.CL))
  • 人工智能205篇(Artificial Intelligence (cs.AI))
  • 计算机视觉161篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习197篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

【速读】: 该论文试图解决多模态奖励模型(Multimodal Reward Models, MRM)在长期推理能力方面的有效性问题及其激活机制,尤其是在强化学习(Reinforcement Learning, RL)框架下如何提升奖励建模的稳定性与性能。其解决方案的关键在于提出一种名为StableReinforce的算法,通过优化训练损失、优势估计策略和奖励设计,实现更稳定的训练动态和更优的性能表现。此外,研究者还收集了200K条来自多样化数据集的偏好数据,用于训练名为R1-Reward的奖励模型,该模型在多个基准测试中显著优于现有最先进模型。

链接: https://arxiv.org/abs/2505.02835
作者: Yi-Fan Zhang,Xingyu Lu,Xiao Hu,Chaoyou Fu,Bin Wen,Tianke Zhang,Changyi Liu,Kaiyu Jiang,Kaibing Chen,Kaiyu Tang,Haojie Ding,Jiankang Chen,Fan Yang,Zhang Zhang,Tingting Gao,Liang Wang
机构: CASIA(中国科学院); THU(清华大学); KuaiShou(快手); NJU(南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Home page: this https URL

点击查看摘要

Abstract:Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In this paper, we explore how Reinforcement Learning (RL) can be used to improve reward modeling. Specifically, we reformulate the reward modeling problem as a rule-based RL task. However, we observe that directly applying existing RL algorithms, such as Reinforce++, to reward modeling often leads to training instability or even collapse due to the inherent limitations of these algorithms. To address this issue, we propose the StableReinforce algorithm, which refines the training loss, advantage estimation strategy, and reward design of existing RL methods. These refinements result in more stable training dynamics and superior performance. To facilitate MRM training, we collect 200K preference data from diverse datasets. Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks. Compared to previous SOTA models, R1-Reward achieves a 8.4% improvement on the VL Reward-Bench and a 14.3% improvement on the Multimodal Reward Bench. Moreover, with more inference compute, R1-Reward’s performance is further enhanced, highlighting the potential of RL algorithms in optimizing MRMs.
zh

[NLP-1] AOR: Anatomical Ontology-Guided Reasoning for Medical Large Multimodal Model in Chest X-Ray Interpretation

【速读】: 该论文旨在解决当前医学大型多模态模型(Medical LMMs)在胸部X光片(CXRs)自动解读中面临的两个主要问题:区域级理解与交互不足,以及由于单步推理导致的准确性和可解释性有限。解决方案的关键在于赋予医学LMMs以解剖学为中心的推理能力,具体通过提出一种基于解剖学本体的推理(Anatomical Ontology-Guided Reasoning, AOR)框架,该框架围绕跨模态区域级信息进行多步骤推理,并结合专家医师指导构建的AOR-Instruction大规模指令数据集,从而提升模型的交互性和可解释性。

链接: https://arxiv.org/abs/2505.02830
作者: Qingqiu Li,Zihang Cui,Seongsu Bae,Jilan Xu,Runtian Yuan,Yuejie Zhang,Rui Feng,Quanli Shen,Xiaobo Zhang,Junjun He,Shujun Wang
机构: Fudan University (复旦大学); Xidian University (西安电子科技大学); KAIST (韩国科学技术院); Children’s Hospital of Fudan University (复旦大学附属儿童医院); Shanghai AI Laboratory (上海人工智能实验室); Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chest X-rays (CXRs) are the most frequently performed imaging examinations in clinical settings. Recent advancements in Large Multimodal Models (LMMs) have enabled automated CXR interpretation, enhancing diagnostic accuracy and efficiency. However, despite their strong visual understanding, current Medical LMMs (MLMMs) still face two major challenges: (1) Insufficient region-level understanding and interaction, and (2) Limited accuracy and interpretability due to single-step reasoning. In this paper, we empower MLMMs with anatomy-centric reasoning capabilities to enhance their interactivity and explainability. Specifically, we first propose an Anatomical Ontology-Guided Reasoning (AOR) framework, which centers on cross-modal region-level information to facilitate multi-step reasoning. Next, under the guidance of expert physicians, we develop AOR-Instruction, a large instruction dataset for MLMMs training. Our experiments demonstrate AOR’s superior performance in both VQA and report generation tasks.
zh

[NLP-2] AutoLibra: Agent Metric Induction from Open-Ended Feedback

【速读】: 该论文试图解决传统代理评估方法中任务成功度量指标过于粗略、依赖专家手动设计且无法奖励中间涌现行为的问题。其解决方案的关键在于提出AutoLibra框架,该框架通过将开放式人类反馈转化为细粒度代理轨迹行为的评估指标,实现对代理行为的精确评价。具体而言,AutoLibra通过将反馈与代理行为相联系、聚类相似的正负行为,并创建具有明确定义和实例的量化指标,从而为大型语言模型作为评判者提供支持。此外,论文还引入了“覆盖”和“冗余”两个元指标,用于评估生成指标与开放反馈的一致性,进一步优化评估效果。

链接: https://arxiv.org/abs/2505.02820
作者: Hao Zhu,Phil Cuvin,Xinkai Yu,Charlotte Ka Yee Yan,Jason Zhang,Diyi Yang
机构: Stanford University (斯坦福大学); University of Toronto (多伦多大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: this https URL

点击查看摘要

Abstract:Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose AutoLibra, a framework for agent evaluation, that transforms open-ended human feedback, e.g., “If you find that the button is disabled, don’t click it again”, or “This agent has too much autonomy to decide what to do on its own”, into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent’s behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta-metrics to evaluate the alignment of a set of (induced) metrics with open feedback: “coverage” and “redundancy”. Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra’s ability to induce more concrete agent evaluation metrics than the ones proposed in previous agent evaluation benchmarks and discover new metrics to analyze agents. We also present two applications of AutoLibra in agent improvement: First, we show that AutoLibra-induced metrics serve as better prompt-engineering targets than the task success rate on a wide range of text game tasks, improving agent performance over baseline by a mean of 20%. Second, we show that AutoLibra can iteratively select high-quality fine-tuning data for web navigation agents. Our results suggest that AutoLibra is a powerful task-agnostic tool for evaluating and improving language agents.
zh

[NLP-3] ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations

【速读】: 该论文试图解决深度剪枝(depth pruning)中如何在不进行额外训练或微调的情况下,有效替换Transformer块为线性操作以保持模型高性能的问题。解决方案的关键在于提出ReplaceMe方法,该方法仅需一个小的校准数据集来估计一个线性变换,以近似被剪枝的块,并将此线性映射无缝合并到剩余的Transformer块中,从而无需引入额外网络参数,实现了高效的模型压缩。

链接: https://arxiv.org/abs/2505.02819
作者: Dmitriy Shopkhoev,Ammar Ali,Magauiya Zhussip,Valentin Malykh,Stamatios Lefkimmiatis,Nikos Komodakis,Sergey Zagoruyko
机构: MTS AI; ITMO University; IITU; University of Crete; IACM-Forth; Archimedes Athena RC; Polynome
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation to approximate the pruned blocks. This estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25% pruning while retaining approximately 90% of the original model’s performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead (see Fig.1). We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at this repository.
zh

[NLP-4] Knowing You Dont Know: Learning When to Continue Search in Multi-round RAG through Self-Practicing SIGIR2025

【速读】: 该论文旨在解决多轮检索增强生成(RAG)系统在复杂任务中存在自我意识不足和检索效率低下的问题,这些问题导致系统可能在已获取足够信息的情况下继续搜索或在缺乏足够知识时给出错误答案。其解决方案的关键在于引入一种新的框架SIM-RAG,通过让RAG系统自我练习多轮检索,并利用中间内在独白推理步骤生成合成训练数据,进而训练一个轻量级的信息充分性评论家(Critic),在推理阶段评估每一轮检索是否已获得足够信息,从而指导检索决策并提升系统级别的自我意识。

链接: https://arxiv.org/abs/2505.02811
作者: Diji Yang,Linda Zeng,Jinmeng Rao,Yi Zhang
机构: University of California Santa Cruz(加州大学圣克鲁兹分校); The Harker School(哈克学校); Mineral.ai(矿物.ai); Google(谷歌)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Proceedings of the 48th International ACM SIGIR 2025

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) has shown strong capability in enhancing language models’ knowledge and reducing AI generative hallucinations, driving its widespread use. However, complex tasks requiring multi-round retrieval remain challenging, and early attempts tend to be overly optimistic without a good sense of self-skepticism. Current multi-round RAG systems may continue searching even when enough information has already been retrieved, or they may provide incorrect answers without having sufficient information or knowledge. Existing solutions either require large amounts of expensive human-labeled process supervision data or lead to subpar performance. This paper aims to address these limitations by introducing a new framework, \textbfSIM-RAG, to explicitly enhance RAG systems’ self-awareness and multi-round retrieval capabilities. To train SIM-RAG, we first let a RAG system self-practice multi-round retrieval, augmenting existing question-answer pairs with intermediate inner monologue reasoning steps to generate synthetic training data. For each pair, the system may explore multiple retrieval paths, which are labeled as successful if they reach the correct answer and unsuccessful otherwise. Using this data, we train a lightweight information sufficiency Critic. At inference time, the Critic evaluates whether the RAG system has retrieved sufficient information at each round, guiding retrieval decisions and improving system-level self-awareness through in-context reinforcement learning. Experiments across multiple prominent RAG benchmarks show that SIM-RAG is an effective multi-round RAG solution. Furthermore, this framework is system-efficient, adding a lightweight component to RAG without requiring modifications to existing LLMs or search engines, and data-efficient, eliminating the need for costly human-annotated mid-step retrieval process supervision data. Comments: Proceedings of the 48th International ACM SIGIR 2025 Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2505.02811 [cs.AI] (or arXiv:2505.02811v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.02811 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-5] Bye-bye Bluebook? Automating Legal Procedure with Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在遵循复杂法律引用规范(如《蓝皮书》The Bluebook)方面的能力问题。其关键解决方案是构建一个包含866个《蓝皮书》任务的原始数据集,并测试多个主流LLMs(包括OpenAI、Anthropic、Google、Meta和DeepSeek的旗舰模型),以评估它们在生成符合《蓝皮书》格式要求的引用时的准确性。研究结果表明,这些模型在完全合规性上表现有限,仅在69%-74%的情况下生成正确引用,即使通过上下文学习提升后也仅达到77%的准确率。

链接: https://arxiv.org/abs/2505.02763
作者: Matthew Dahl
机构: Yale Law School (耶鲁法学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Legal practice requires careful adherence to procedural rules. In the United States, few are more complex than those found in The Bluebook: A Uniform System of Citation. Compliance with this system’s 500+ pages of byzantine formatting instructions is the raison d’etre of thousands of student law review editors and the bete noire of lawyers everywhere. To evaluate whether large language models (LLMs) are able to adhere to the procedures of such a complicated system, we construct an original dataset of 866 Bluebook tasks and test flagship LLMs from OpenAI, Anthropic, Google, Meta, and DeepSeek. We show (1) that these models produce fully compliant Bluebook citations only 69%-74% of the time and (2) that in-context learning on the Bluebook’s underlying system of rules raises accuracy only to 77%. These results caution against using off-the-shelf LLMs to automate aspects of the law where fidelity to procedure is paramount.
zh

[NLP-6] Using Knowledge Graphs to harvest datasets for efficient CLIP model training

【速读】: 该论文试图解决训练高质量对比语言-图像预训练(Contrastive Language-Image Pretraining, CLIP)模型所需的大规模数据集带来的限制问题,特别是在特定领域中,即使最大的CLIP模型也难以覆盖,这导致了模型开发成本高昂和科学研究中对CLIP模型训练过程精细控制的困难。解决方案的关键在于采用结合知识图谱增强的智能网络搜索策略,从而在显著减少数据量的情况下训练出稳健的CLIP模型。

链接: https://arxiv.org/abs/2505.02746
作者: Simon Ging,Sebastian Walter,Jelena Bratulić,Johannes Dienert,Hannah Bast,Thomas Brox
机构: University of Freiburg (弗赖堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Training high-quality CLIP models typically requires enormous datasets, which limits the development of domain-specific models – especially in areas that even the largest CLIP models do not cover well – and drives up training costs. This poses challenges for scientific research that needs fine-grained control over the training procedure of CLIP models. In this work, we show that by employing smart web search strategies enhanced with knowledge graphs, a robust CLIP model can be trained from scratch with considerably less data. Specifically, we demonstrate that an expert foundation model for living organisms can be built using just 10M images. Moreover, we introduce EntityNet, a dataset comprising 33M images paired with 46M text descriptions, which enables the training of a generic CLIP model in significantly reduced time.
zh

[NLP-7] Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

【速读】: 该论文试图解决传统语音AI系统在实时性、情感表达和个性化交互方面的不足,旨在构建一个能够无缝融入日常生活的生成式语音AI代理(voice AI agent)。其解决方案的关键在于提出Voila,这是一个基于端到端架构的大规模语音-语言基础模型,能够实现全双工、低延迟的对话,并保留丰富的语音细节如语调、节奏和情感。Voila通过融合大语言模型的推理能力与强大的声学建模技术,实现了自然且具有人物特征的语音生成,并支持大规模预设语音及快速定制新语音,从而推动下一代人机交互的发展。

链接: https://arxiv.org/abs/2505.02707
作者: Yemin Shi,Yu Shu,Siwei Dong,Guangyi Liu,Jaward Sesay,Jingwen Li,Zhiting Hu
机构: Maitrix.org (Maitrix.org); UC San Diego (加州大学圣地亚哥分校); MBZUAI (MBZUAI)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: 18 pages, 7 figures, Website: this https URL

点击查看摘要

Abstract:A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation – where users can simply write text instructions to define the speaker’s identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.
zh

[NLP-8] Predicting Movie Hits Before They Happen with LLM s

【速读】: 该论文旨在解决内容推荐中的冷启动问题(cold-start issue),特别是针对娱乐平台上新上映电影的热度预测。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)结合电影元数据进行冷启动电影的流行度预测,该方法可集成到个性化流程中的检索系统或作为编辑团队的工具,以确保潜在被传统或算法方案忽视的电影得到公平推广。

链接: https://arxiv.org/abs/2505.02693
作者: Shaghayegh Agah,Yejin Kim,Neeraj Sharma,Mayur Nankani,Kevin Foley,H. Howie Huang,Sardar Hamidian
机构: Comcast Technology AI(康卡斯特科技人工智能); George Washington University(乔治华盛顿大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted at ACM UMAP 2025 Industry Track

点击查看摘要

Abstract:Addressing the cold-start issue in content recommendation remains a critical ongoing challenge. In this work, we focus on tackling the cold-start problem for movies on a large entertainment platform. Our primary goal is to forecast the popularity of cold-start movies using Large Language Models (LLMs) leveraging movie metadata. This method could be integrated into retrieval systems within the personalization pipeline or could be adopted as a tool for editorial teams to ensure fair promotion of potentially overlooked movies that may be missed by traditional or algorithmic solutions. Our study validates the effectiveness of this approach compared to established baselines and those we developed.
zh

[NLP-9] fastabx: A library for efficient computation of ABX discriminability

【速读】: 该论文试图解决在自监督语音表示中评估语音可区分性的ABX辨别任务缺乏高效工具的问题,这限制了其更广泛的应用。解决方案的关键在于提出fastabx,这是一个高性能的Python库,能够构建任意类型的ABX任务,并提供必要的效率以支持快速开发周期,包括任务创建和表示间距离计算。

链接: https://arxiv.org/abs/2505.02692
作者: Maxime Poli,Emmanuel Chemla,Emmanuel Dupoux
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:We introduce fastabx, a high-performance Python library for building ABX discrimination tasks. ABX is a measure of the separation between generic categories of interest. It has been used extensively to evaluate phonetic discriminability in self-supervised speech representations. However, its broader adoption has been limited by the absence of adequate tools. fastabx addresses this gap by providing a framework capable of constructing any type of ABX task while delivering the efficiency necessary for rapid development cycles, both in task creation and in calculating distances between representations. We believe that fastabx will serve as a valuable resource for the broader representation learning community, enabling researchers to systematically investigate what information can be directly extracted from learned representations across several domains beyond speech processing. The source code is available at this https URL.
zh

[NLP-10] Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models

【速读】: 该论文旨在探讨并综述“从奖励中学习”(Learning from Rewards)这一关键范式在大型语言模型(Large Language Models, LLMs)中的应用与进展。该范式通过奖励信号引导模型行为,推动了从静态数据被动学习向动态反馈主动学习的转变,从而赋予LLMs对齐偏好和深度推理能力。其解决方案的关键在于利用奖励机制优化模型的训练、推理及后推理阶段,实现更高效、更符合人类价值观的模型行为。

链接: https://arxiv.org/abs/2505.02686
作者: Xiaobao Wu
机构: Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注: 35 Pages

点击查看摘要

Abstract:Recent developments in Large Language Models (LLMs) have shifted from pre-training scaling to post-training and test-time scaling. Across these developments, a key unified paradigm has arisen: Learning from Rewards, where reward signals act as the guiding stars to steer LLM behavior. It has underpinned a wide range of prevalent techniques, such as reinforcement learning (in RLHF, DPO, and GRPO), reward-guided decoding, and post-hoc correction. Crucially, this paradigm enables the transition from passive learning from static data to active learning from dynamic feedback. This endows LLMs with aligned preferences and deep reasoning capabilities. In this survey, we present a comprehensive overview of the paradigm of learning from rewards. We categorize and analyze the strategies under this paradigm across training, inference, and post-inference stages. We further discuss the benchmarks for reward models and the primary applications. Finally we highlight the challenges and future directions. We maintain a paper collection at this https URL.
zh

[NLP-11] A Survey on Progress in LLM Alignment from the Perspective of Reward Design

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)与人类价值观和意图对齐的核心问题,其关键在于奖励机制的设计。研究通过构建一个系统的理论框架,将奖励机制的发展划分为反馈(诊断)、奖励设计(处方)和优化(治疗)三个关键阶段,并基于构造基础、格式、表达和粒度四个维度进行分析,提出了一个系统化的分类框架,揭示了奖励建模的演变趋势。论文指出,当前LLM对齐面临诸多挑战,而奖励设计的最新进展正在推动范式转变,包括从基于强化学习的框架向新型优化范式的过渡,以及在多模态融合和并行任务协调等复杂对齐场景中的能力提升。

链接: https://arxiv.org/abs/2505.02666
作者: Miaomiao Ji,Yanqiu Wu,Zhibin Wu,Shoujin Wang,Jian Yang,Mark Dras,Usman Naseem
机构: South China University of Technology (华南理工大学); Macquarie University (麦考瑞大学); University of Technology Sydney (悉尼科技大学)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:The alignment of large language models (LLMs) with human values and intentions represents a core challenge in current AI research, where reward mechanism design has become a critical factor in shaping model behavior. This study conducts a comprehensive investigation of reward mechanisms in LLM alignment through a systematic theoretical framework, categorizing their development into three key phases: (1) feedback (diagnosis), (2) reward design (prescription), and (3) optimization (treatment). Through a four-dimensional analysis encompassing construction basis, format, expression, and granularity, this research establishes a systematic classification framework that reveals evolutionary trends in reward modeling. The field of LLM alignment faces several persistent challenges, while recent advances in reward design are driving significant paradigm shifts. Notable developments include the transition from reinforcement learning-based frameworks to novel optimization paradigms, as well as enhanced capabilities to address complex alignment scenarios involving multimodal integration and concurrent task coordination. Finally, this survey outlines promising future research directions for LLM alignment through innovative reward design strategies.
zh

[NLP-12] Proper Name Diacritization for Arabic Wikipedia: A Benchmark Dataset

【速读】: 该论文试图解决阿拉伯语维基百科中专有名词缺乏变音符号(diacritization)导致的发音和解释歧义问题,尤其是外来语实体在转写(transliteration)过程中的歧义。解决方案的关键在于构建一个手动标注了变音符号的阿拉伯语专有名词数据集,该数据集包含不同来源的专有名词及其对应的英文维基百科释义,并基于此对GPT-4o模型进行评估,以探索恢复完整变音符号任务的挑战与可行方法。

链接: https://arxiv.org/abs/2505.02656
作者: Rawan Bondok,Mayar Nassar,Salam Khalifa,Kurt Micallaf,Nizar Habash
机构: New York University Abu Dhabi (纽约大学阿布扎比分校); Ain Shams University (亚历山大大学); Stony Brook University (石溪大学); Department of Artificial Intelligence, University of Malta (马耳他大学人工智能系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Proper names in Arabic Wikipedia are frequently undiacritized, creating ambiguity in pronunciation and interpretation, especially for transliterated named entities of foreign origin. While transliteration and diacritization have been well-studied separately in Arabic NLP,their intersection remains underexplored. In this paper, we introduce a new manually diacritized dataset of Arabic proper names of various origins with their English Wikipedia equivalent glosses, and present the challenges and guidelines we followed to create it. We benchmark GPT-4o on the task of recovering full diacritization given the undiacritized Arabic and English forms, and analyze its performance. Achieving 73% accuracy, our results underscore both the difficulty of the task and the need for improved models and resources. We release our dataset to facilitate further research on Arabic Wikipedia proper name diacritization.
zh

[NLP-13] Enhancing Chemical Reaction and Retrosynthesis Prediction with Large Language Model and Dual-task Learning IJCAI2025

【速读】: 该论文旨在解决化学反应和逆合成预测在药物发现中的两个主要挑战:缺乏大规模与化学合成相关的指令数据集,以及现有微调策略忽视了反应与逆合成预测之间的紧密关联。其解决方案的关键在于提出ChemDual框架,该框架将分子的反应与逆合成视为相关联的重组与碎片化过程,并构建了一个包含440万条指令的大规模数据集;同时引入增强型LLaMA模型,配备多尺度分词器和双任务学习策略,以联合优化重组与碎片化过程及反应与逆合成预测任务。

链接: https://arxiv.org/abs/2505.02639
作者: Xuan Lin,Qingrui Liu,Hongxin Xiang,Daojian Zeng,Xiangxiang Zeng
机构: Xiangtan University (湘潭大学); Hunan University (湖南大学); Hunan Normal University (湖南师范大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for publication at IJCAI 2025

点击查看摘要

Abstract:Chemical reaction and retrosynthesis prediction are fundamental tasks in drug discovery. Recently, large language models (LLMs) have shown potential in many domains. However, directly applying LLMs to these tasks faces two major challenges: (i) lacking a large-scale chemical synthesis-related instruction dataset; (ii) ignoring the close correlation between reaction and retrosynthesis prediction for the existing fine-tuning strategies. To address these challenges, we propose ChemDual, a novel LLM framework for accurate chemical synthesis. Specifically, considering the high cost of data acquisition for reaction and retrosynthesis, ChemDual regards the reaction-and-retrosynthesis of molecules as a related recombination-and-fragmentation process and constructs a large-scale of 4.4 million instruction dataset. Furthermore, ChemDual introduces an enhanced LLaMA, equipped with a multi-scale tokenizer and dual-task learning strategy, to jointly optimize the process of recombination and fragmentation as well as the tasks between reaction and retrosynthesis prediction. Extensive experiments on Mol-Instruction and USPTO-50K datasets demonstrate that ChemDual achieves state-of-the-art performance in both predictions of reaction and retrosynthesis, outperforming the existing conventional single-task approaches and the general open-source LLMs. Through molecular docking analysis, ChemDual generates compounds with diverse and strong protein binding affinity, further highlighting its strong potential in drug design.
zh

[NLP-14] LLaMA-Omni2: LLM -based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

【速读】: 该论文旨在解决下一代人机交互中实时、智能且自然的语音交互问题。其解决方案的关键在于构建一个系列的语音语言模型(SpeechLMs),即LLaMA-Omni 2,该模型参数规模从0.5B到14B不等,能够实现高质量的实时语音交互。LLaMA-Omni 2基于Qwen2.5系列模型,集成了语音编码器和自回归流式语音解码器,尽管仅在20万条多轮语音对话样本上进行训练,但仍表现出色,在多个语音问答和语音指令遵循基准测试中超越了此前最先进的SpeechLMs如GLM-4-Voice。

链接: https://arxiv.org/abs/2505.02625
作者: Qingkai Fang,Yan Zhou,Shoutao Guo,Shaolei Zhang,Yang Feng
机构: Chinese Academy of Sciences (中国科学院); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); Key Laboratory of Intelligent Information Processing (智能信息处理重点实验室); Key Laboratory of AI Safety, Chinese Academy of Sciences (人工智能安全重点实验室); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Preprint. Project: this https URL

点击查看摘要

Abstract:Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data.
zh

[NLP-15] Automatic Proficiency Assessment in L2 English Learners

【速读】: 该论文旨在解决第二语言(L2)英语水平的自动化评估问题,传统方法依赖英语教师或专家评估者进行感知评价,存在内在的评分者内部和跨评分者变异。其解决方案的关键在于利用深度学习技术,综合分析语音信号及其对应转录文本,采用多种架构如2D CNN、基于频率的CNN、ResNet以及预训练的wav2vec 2.0模型进行口语能力分类预测,并通过微调BERT语言模型在资源受限条件下实现文本-based的L2评估。此外,针对自发对话评估任务,通过分别应用wav2vec 2.0和BERT模型处理长音频和说话人交互,验证了深度学习,特别是预训练的wav2vec 2.0模型,在鲁棒的自动化L2水平评估中的潜力。

链接: https://arxiv.org/abs/2505.02615
作者: Armita Mohammadi,Alessandro Lameiras Koerich,Laureano Moro-Velazquez,Patrick Cardinal
机构: École de Technologie Supérieure, Université du Québec(École de Technologie Supérieure, Université du Québec); Johns Hopkins University(约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 6 pages

点击查看摘要

Abstract:Second language proficiency (L2) in English is usually perceptually evaluated by English teachers or expert evaluators, with the inherent intra- and inter-rater variability. This paper explores deep learning techniques for comprehensive L2 proficiency assessment, addressing both the speech signal and its correspondent transcription. We analyze spoken proficiency classification prediction using diverse architectures, including 2D CNN, frequency-based CNN, ResNet, and a pretrained wav2vec 2.0 model. Additionally, we examine text-based proficiency assessment by fine-tuning a BERT language model within resource constraints. Finally, we tackle the complex task of spontaneous dialogue assessment, managing long-form audio and speaker interactions through separate applications of wav2vec 2.0 and BERT models. Results from experiments on EFCamDat and ANGLISH datasets and a private dataset highlight the potential of deep learning, especially the pretrained wav2vec 2.0 model, for robust automated L2 proficiency evaluation.
zh

[NLP-16] Ensemble Kalman filter for uncertainty in human language comprehension

【速读】: 该论文试图解决传统人工神经网络(Artificial Neural Networks, ANNs)在句子处理建模中表现出的确定性行为与人类句子理解中对不确定性的处理之间的差异问题,特别是在处理歧义或意外输入时的表现不足。其关键解决方案是提出一种基于贝叶斯框架的句子理解方法,通过应用集合卡尔曼滤波器(Ensemble Kalman Filter, EnKF)的扩展版本进行贝叶斯推断,以量化不确定性。该方法将语言理解建模为贝叶斯逆问题,从而提升了传统Sentence Gestalt (SG) 模型在不确定性表示方面的性能,使其更接近人类认知处理机制。

链接: https://arxiv.org/abs/2505.02590
作者: Diksha Bhandari,Alessandro Lopopolo,Milena Rabovsky,Sebastian Reich
机构: Uni-Potsdam(波茨坦大学)
类目: Computation and Language (cs.CL); Applications (stat.AP); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Artificial neural networks (ANNs) are widely used in modeling sentence processing but often exhibit deterministic behavior, contrasting with human sentence comprehension, which manages uncertainty during ambiguous or unexpected inputs. This is exemplified by reversal anomalies-sentences with unexpected role reversals that challenge syntax and semantics-highlighting the limitations of traditional ANN models, such as the Sentence Gestalt (SG) Model. To address these limitations, we propose a Bayesian framework for sentence comprehension, applying an extension of the ensemble Kalman filter (EnKF) for Bayesian inference to quantify uncertainty. By framing language comprehension as a Bayesian inverse problem, this approach enhances the SG model’s ability to reflect human sentence processing with respect to the representation of uncertainty. Numerical experiments and comparisons with maximum likelihood estimation (MLE) demonstrate that Bayesian methods improve uncertainty representation, enabling the model to better approximate human cognitive processing when dealing with linguistic ambiguities.
zh

[NLP-17] EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning SIGDIAL2025

【速读】: 该论文旨在解决大规模语言模型(Large Language Model, LLM)微调过程中面临的多目标任务挑战,包括复杂的损失函数平衡、低训练效率、可扩展性差以及解释性不足等问题。其解决方案的关键在于引入一种基于集成学习原理的多目标强化学习框架(Ensemble Multi-Objective RL, EMORL),该框架通过在训练后优化多个模型的聚合方式,提升效率与灵活性,并首次采用对个体模型最后隐藏状态的聚合方法,结合来自多目标的上下文信息,同时利用分层网格搜索算法确定最优加权组合。

链接: https://arxiv.org/abs/2505.02579
作者: Lingxiao Kong(1),Cong Yang(2),Susanne Neufang(3),Oya Deniz Beyan(1,3),Zeyd Boukhers(1,3) ((1) Fraunhofer Institute for Applied Information Technology FIT, (2) Soochow University, (3) University Hospital of Cologne)
机构: Fraunhofer Institute for Applied Information Technology FIT(弗劳恩霍夫应用信息处理研究所); Soochow University(苏州大学); University Hospital of Cologne(科隆大学医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 9 figures, submitted to SIGDIAL 2025 conference

点击查看摘要

Abstract:Recent advances in reinforcement learning (RL) for large language model (LLM) fine-tuning show promise in addressing multi-objective tasks but still face significant challenges, including complex objective balancing, low training efficiency, poor scalability, and limited explainability. Leveraging ensemble learning principles, we introduce an Ensemble Multi-Objective RL (EMORL) framework that fine-tunes multiple models with individual objectives while optimizing their aggregation after the training to improve efficiency and flexibility. Our method is the first to aggregate the last hidden states of individual models, incorporating contextual information from multiple objectives. This approach is supported by a hierarchical grid search algorithm that identifies optimal weighted combinations. We evaluate EMORL on counselor reflection generation tasks, using text-scoring LLMs to evaluate the generations and provide rewards during RL fine-tuning. Through comprehensive experiments on the PAIR and Psych8k datasets, we demonstrate the advantages of EMORL against existing baselines: significantly lower and more stable training consumption ( 17,529\pm 1,650 data points and 6,573\pm 147.43 seconds), improved scalability and explainability, and comparable performance across multiple objectives.
zh

[NLP-18] Bielik v3 Small: Technical Report

【速读】: 该论文旨在解决在资源受限环境下,如何为波兰语等低资源语言构建高效且性能优越的生成式AI模型的问题。其关键解决方案包括:开发了一种定制化的波兰语分词器(APT4)以提升分词效率,引入加权指令交叉熵损失函数以平衡不同指令类型的训练效果,并采用自适应学习率机制根据训练进度动态调整学习率。这些创新使得较小参数规模的模型(1.5B和4.5B)在多个基准测试中表现出与更大模型相当甚至更优的性能,同时显著降低了计算资源需求。

链接: https://arxiv.org/abs/2505.02550
作者: Krzysztof Ociepa,Łukasz Flis,Remigiusz Kinas,Krzysztof Wróbel,Adrian Gwoździej
机构: SpeakLeash; ACK Cyfronet AGH; Jagiellonian University; Azurro; Enelpol
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Bielik v3, a series of parameter-efficient generative text models (1.5B and 4.5B) optimized for Polish language processing. These models demonstrate that smaller, well-optimized architectures can achieve performance comparable to much larger counterparts while requiring substantially fewer computational resources. Our approach incorporates several key innovations: a custom Polish tokenizer (APT4) that significantly improves token efficiency, Weighted Instruction Cross-Entropy Loss to balance learning across instruction types, and Adaptive Learning Rate that dynamically adjusts based on training progress. Trained on a meticulously curated corpus of 292 billion tokens spanning 303 million documents, these models excel across multiple benchmarks, including the Open PL LLM Leaderboard, Complex Polish Text Understanding Benchmark, Polish EQ-Bench, and Polish Medical Leaderboard. The 4.5B parameter model achieves results competitive with models 2-3 times its size, while the 1.5B model delivers strong performance despite its extremely compact profile. These advances establish new benchmarks for parameter-efficient language modeling in less-represented languages, making high-quality Polish language AI more accessible for resource-constrained applications.
zh

[NLP-19] Bemba Speech Translation: Exploring a Low-Resource African Language

【速读】: 该论文旨在解决低资源语言(Bemba-to-English)语音翻译问题,其核心挑战在于数据稀缺导致的模型性能受限。解决方案的关键在于构建级联的语音翻译系统,基于Whisper和NLLB-200模型,并采用数据增强技术,如回译(back-translation),以提升模型在低资源场景下的表现。研究还探讨了合成数据对系统性能的影响。

链接: https://arxiv.org/abs/2505.02518
作者: Muhammad Hazim Al Farouq,Aman Kassahun Wassie,Yasmin Moslem
机构: Kreasof AI (Kreasof AI); African Institute for Mathematical Sciences (African Institute for Mathematical Sciences); ADAPT Centre (ADAPT Centre)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: IWSLT 2025

点击查看摘要

Abstract:This paper describes our system submission to the International Conference on Spoken Language Translation (IWSLT 2025), low-resource languages track, namely for Bemba-to-English speech translation. We built cascaded speech translation systems based on Whisper and NLLB-200, and employed data augmentation techniques, such as back-translation. We investigate the effect of using synthetic data and discuss our experimental setup.
zh

[NLP-20] Data Augmentation With Back translation for Low Resource languages: A case of English and Luganda

【速读】: 该论文试图解决低资源语言(如英语-卢干达语)在神经机器翻译(Neural Machine Translation, NMT)中由于双语数据稀缺而导致的性能瓶颈问题。解决方案的关键在于采用回译(Back translation, BT)作为一种半监督技术,通过从单语语料库生成合成数据来弥补双语数据的不足,进而提升NMT模型的翻译性能。研究创新性地提出了迭代和增量式回译方法,并通过策略性选择多个小规模数据集进行增量回译,显著提升了翻译质量,超过了之前的所有基准,证明了BT在低资源语言场景下的有效性。

链接: https://arxiv.org/abs/2505.02463
作者: Richard Kimera,Dongnyeong Heo,Daniela N. Rim,Heeyoul Choi
机构: Mbarara University of Science and Technology (马巴拉科技大学); Handong Global University (翰南全球大学)
类目: Computation and Language (cs.CL)
备注: NLPIR '24: Proceedings of the 2024 8th International Conference on Natural Language Processing and Information Retrieval

点击查看摘要

Abstract:In this paper,we explore the application of Back translation (BT) as a semi-supervised technique to enhance Neural Machine Translation(NMT) models for the English-Luganda language pair, specifically addressing the challenges faced by low-resource languages. The purpose of our study is to demonstrate how BT can mitigate the scarcity of bilingual data by generating synthetic data from monolingual corpora. Our methodology involves developing custom NMT models using both publicly available and web-crawled data, and applying Iterative and Incremental Back translation techniques. We strategically select datasets for incremental back translation across multiple small datasets, which is a novel element of our approach. The results of our study show significant improvements, with translation performance for the English-Luganda pair exceeding previous benchmarks by more than 10 BLEU score units across all translation directions. Additionally, our evaluation incorporates comprehensive assessment metrics such as SacreBLEU, ChrF2, and TER, providing a nuanced understanding of translation quality. The conclusion drawn from our research confirms the efficacy of BT when strategically curated datasets are utilized, establishing new performance benchmarks and demonstrating the potential of BT in enhancing NMT models for low-resource languages.
zh

[NLP-21] Incentivizing Inclusive Contributions in Model Sharing Markets

【速读】: 该论文试图解决在公共数据即将耗尽的背景下,如何有效利用大规模去中心化私有数据的问题,同时克服原始数据的隐私敏感性和缺乏激励机制所带来的挑战。其解决方案的关键在于提出一种包容且具有激励机制的个性化联邦学习(iPFL),通过构建基于图的训练优化模型共享市场,并结合博弈论原理设计激励机制,从而实现数据持有者在不泄露原始数据的情况下协同训练个性化模型。

链接: https://arxiv.org/abs/2505.02462
作者: Enpei Zhang,Jingyi Chai,Rui Ye,Yanfeng Wang,Siheng Chen
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:While data plays a crucial role in training contemporary AI models, it is acknowledged that valuable public data will be exhausted in a few years, directing the world’s attention towards the massive decentralized private data. However, the privacy-sensitive nature of raw data and lack of incentive mechanism prevent these valuable data from being fully exploited. Addressing these challenges, this paper proposes inclusive and incentivized personalized federated learning (iPFL), which incentivizes data holders with diverse purposes to collaboratively train personalized models without revealing raw data. iPFL constructs a model-sharing market by solving a graph-based training optimization and incorporates an incentive mechanism based on game theory principles. Theoretical analysis shows that iPFL adheres to two key incentive properties: individual rationality and truthfulness. Empirical studies on eleven AI tasks (e.g., large language models’ instruction-following tasks) demonstrate that iPFL consistently achieves the highest economic utility, and better or comparable model performance compared to baseline methods. We anticipate that our iPFL can serve as a valuable technique for boosting future AI models on decentralized private data while making everyone satisfied.
zh

[NLP-22] Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLM s

【速读】: 该论文试图解决自然语言处理(NLP)领域中多语言交叉国家与性别偏见的测量与缓解问题,特别是针对由大型语言模型(LLM)生成的职业推荐中存在的刻板印象偏见。解决方案的关键在于构建一个包含英语、西班牙语和德语的基准测试集,系统地变化国家和性别变量,并评估基于Llama的多种模型在该基准上的表现,从而揭示LLM中存在显著的性别和国家偏见,以及即使在单一维度上达到公平性,交叉维度的偏见仍持续存在的现象。此外,研究还表明提示语言对偏见有显著影响,且经过指令微调的模型表现出最低且最稳定的偏见水平。

链接: https://arxiv.org/abs/2505.02456
作者: Elisa Forcada Rodríguez,Olatz Perez-de-Viñaspre,Jon Ander Campos,Dietrich Klakow,Vagrant Gautam
机构: Saarland University (萨尔兰大学); HiTZ Center - Ixa, University of the Basque Country (UPV/EHU) (HiTZ中心-Ixa,巴斯克大学(UPV/EHU)); Cohere (Cohere)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:One of the goals of fairness research in NLP is to measure and mitigate stereotypical biases that are propagated by NLP systems. However, such work tends to focus on single axes of bias (most often gender) and the English language. Addressing these limitations, we contribute the first study of multilingual intersecting country and gender biases, with a focus on occupation recommendations generated by large language models. We construct a benchmark of prompts in English, Spanish and German, where we systematically vary country and gender, using 25 countries and four pronoun sets. Then, we evaluate a suite of 5 Llama-based models on this benchmark, finding that LLMs encode significant gender and country biases. Notably, we find that even when models show parity for gender or country individually, intersectional occupational biases based on both country and gender persist. We also show that the prompting language significantly affects bias, and instruction-tuned models consistently demonstrate the lowest and most stable levels of bias. Our findings highlight the need for fairness researchers to use intersectional and multilingual lenses in their work.
zh

[NLP-23] Bielik 11B v2 Technical Report

【速读】: 该论文旨在解决波兰语文本处理中语言模型性能不足及资源效率低的问题,特别是在少资源语言中的表现。其解决方案的关键在于基于Mistral 7B v0.2架构进行深度扩展至11B参数,并引入两项核心技术创新:加权指令交叉熵损失(Weighted Instruction Cross-Entropy Loss)和自适应学习率(Adaptive Learning Rate),以提升模型在多样指令任务上的学习效果和动态适应能力,从而实现更高的参数效率与跨语言能力。

链接: https://arxiv.org/abs/2505.02410
作者: Krzysztof Ociepa,Łukasz Flis,Krzysztof Wróbel,Adrian Gwoździej,Remigiusz Kinas
机构: SpeakLeash(语音助手); ACK Cyfronet AGH(AGH大学计算机网络中心); Jagiellonian University(雅盖隆大学); Azurro(阿祖罗); Enelpol(能源公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Bielik 11B v2, a state-of-the-art language model optimized for Polish text processing. Built on the Mistral 7B v0.2 architecture and scaled to 11B parameters using depth up-scaling, this model demonstrates exceptional performance across Polish language benchmarks while maintaining strong cross-lingual capabilities. We introduce two key technical innovations: Weighted Instruction Cross-Entropy Loss, which optimizes learning across diverse instruction types by assigning quality-based weights to training examples, and Adaptive Learning Rate, which dynamically adjusts based on context length. Comprehensive evaluation across multiple benchmarks demonstrates that Bielik 11B v2 outperforms many larger models, including those with 2-6 times more parameters, and significantly surpasses other specialized Polish language models on tasks ranging from linguistic understanding to complex reasoning. The model’s parameter efficiency and extensive quantization options enable deployment across various hardware configurations, advancing Polish language AI capabilities and establishing new benchmarks for resource-efficient language modeling in less-represented languages.
zh

[NLP-24] Optimizing Chain-of-Thought Reason ers via Gradient Variance Minimization in Rejection Sampling and RL

【速读】: 该论文旨在解决链式思维(Chain-of-thought, CoT)训练中由于静态采样策略导致的随机梯度估计效率低下问题,从而影响模型收敛速度和性能。其解决方案的关键在于提出GVM-RAFT方法,该方法通过动态分配计算资源,根据提示接受率和随机梯度范数实时调整采样策略,以在计算预算约束下最小化梯度方差,进而提升收敛速度和准确性。

链接: https://arxiv.org/abs/2505.02391
作者: Jiarui Yao,Yifan Hao,Hanning Zhang,Hanze Dong,Wei Xiong,Nan Jiang,Tong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning in large language models (LLMs) can be formalized as a latent variable problem, where the model needs to generate intermediate reasoning steps. While prior approaches such as iterative reward-ranked fine-tuning (RAFT) have relied on such formulations, they typically apply uniform inference budgets across prompts, which fails to account for variability in difficulty and convergence behavior. This work identifies the main bottleneck in CoT training as inefficient stochastic gradient estimation due to static sampling strategies. We propose GVM-RAFT, a prompt-specific Dynamic Sample Allocation Strategy designed to minimize stochastic gradient variance under a computational budget constraint. The method dynamically allocates computational resources by monitoring prompt acceptance rates and stochastic gradient norms, ensuring that the resulting gradient variance is minimized. Our theoretical analysis shows that the proposed dynamic sampling strategy leads to accelerated convergence guarantees under suitable conditions. Experiments on mathematical reasoning show that GVM-RAFT achieves a 2-4x speedup and considerable accuracy improvements over vanilla RAFT. The proposed dynamic sampling strategy is general and can be incorporated into other reinforcement learning algorithms, such as GRPO, leading to similar improvements in convergence and test accuracy. Our code is available at this https URL.
zh

[NLP-25] RM-R1: Reward Modeling as Reasoning

【速读】: 该论文试图解决传统奖励模型(Reward Model, RM)在对齐大型语言模型(Large Language Models, LLMs)与人类偏好时存在的可解释性不足问题,其核心在于现有RM要么生成不透明的标量分数,要么直接预测偏好答案,难以整合自然语言批评,导致缺乏可解释性。解决方案的关键是引入一种新的生成式奖励模型——推理奖励模型(Reasoning Reward Models, ReasRMs),将奖励建模任务转化为推理任务,并通过两个关键阶段进行训练:高质量推理链的蒸馏以及基于可验证奖励的强化学习,从而提升模型的可解释性和性能。

链接: https://arxiv.org/abs/2505.02387
作者: Xiusi Chen,Gaotang Li,Ziqi Wang,Bowen Jin,Cheng Qian,Yu Wang,Hongru Wang,Yu Zhang,Denghui Zhang,Tong Zhang,Hanghang Tong,Heng Ji
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of California, San Diego (加利福尼亚大学圣地亚哥分校); Texas A&M University (德克萨斯A&M大学); Stevens Institute of Technology (史蒂文斯理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 7 figures

点击查看摘要

Abstract:Reward modeling is essential for aligning large language models (LLMs) with human preferences, especially through reinforcement learning from human feedback (RLHF). To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. However, existing RMs either produce opaque scalar scores or directly generate the prediction of a preferred answer, making them struggle to integrate natural language critiques, thus lacking interpretability. Inspired by recent advances of long chain-of-thought (CoT) on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RM’s interpretability and performance. In this work, we introduce a new class of generative reward models – Reasoning Reward Models (ReasRMs) – which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. The training consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. RM-R1 improves LLM rollouts by self-generating reasoning traces or chat-specific rubrics and evaluating candidate responses against them. Empirically, our models achieve state-of-the-art or near state-of-the-art performance of generative RMs across multiple comprehensive reward model benchmarks, outperforming much larger open-weight models (e.g., Llama3.1-405B) and proprietary ones (e.g., GPT-4o) by up to 13.8%. Beyond final performance, we perform thorough empirical analysis to understand the key ingredients of successful ReasRM training. To facilitate future research, we release six ReasRM models along with code and data at this https URL.
zh

[NLP-26] JTCSE: Joint Tensor-Modulus Constraints and Cross-Attention for Unsupervised Contrastive Learning of Sentence Embeddings

【速读】: 该论文旨在解决无监督对比学习中语义表示张量的模长约束不足以及BERT类模型在注意力机制中对CLS token关注不足的问题。其解决方案的关键在于提出一种基于模长约束的训练目标,以增强正样本之间的对齐,并引入双塔集成模型间的交叉注意力结构,以提升模型对CLS token的关注度和CLS Pooling的质量,最终构建了一个名为JTCSE的联合张量模长约束与交叉注意力的无监督对比学习句向量表示框架。

链接: https://arxiv.org/abs/2505.02366
作者: Tianyu Zong,Hongzhu Yi,Bingkang Shi,Yuanxiang Wang,Jungang Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Unsupervised contrastive learning has become a hot research topic in natural language processing. Existing works usually aim at constraining the orientation distribution of the representations of positive and negative samples in the high-dimensional semantic space in contrastive learning, but the semantic representation tensor possesses both modulus and orientation features, and the existing works ignore the modulus feature of the representations and cause insufficient contrastive learning. % Therefore, we firstly propose a training objective that aims at modulus constraints on the semantic representation tensor, to strengthen the alignment between the positive samples in contrastive learning. Therefore, we first propose a training objective that is designed to impose modulus constraints on the semantic representation tensor, to strengthen the alignment between positive samples in contrastive learning. Then, the BERT-like model suffers from the phenomenon of sinking attention, leading to a lack of attention to CLS tokens that aggregate semantic information. In response, we propose a cross-attention structure among the twin-tower ensemble models to enhance the model’s attention to CLS token and optimize the quality of CLS Pooling. Combining the above two motivations, we propose a new \textbfJoint \textbfTensor representation modulus constraint and \textbfCross-attention unsupervised contrastive learning \textbfSentence \textbfEmbedding representation framework JTCSE, which we evaluate in seven semantic text similarity computation tasks, and the experimental results show that JTCSE’s twin-tower ensemble model and single-tower distillation model outperform the other baselines and become the current SOTA. In addition, we have conducted an extensive zero-shot downstream task evaluation, which shows that JTCSE outperforms other baselines overall on more than 130 tasks.
zh

[NLP-27] SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning ICML2025

【速读】: 该论文试图解决语言模型与人类偏好对齐的问题,特别是在偏好学习中如何有效利用在线策略(on-policy)和离线策略(off-policy)数据的互补性。其解决方案的关键在于提出SIMPLEMIX方法,通过简单混合在线策略和离线策略的偏好数据,充分发挥两者在不同任务中的优势,从而显著提升语言模型的对齐效果。

链接: https://arxiv.org/abs/2505.02363
作者: Tianjian Li,Daniel Khashabi
机构: 未知
类目: Computation and Language (cs.CL)
备注: To appear in ICML 2025

点击查看摘要

Abstract:Aligning language models with human preferences relies on pairwise preference datasets. While some studies suggest that on-policy data consistently outperforms off -policy data for preference learning, others indicate that the advantages of on-policy data may be task-dependent, highlighting the need for a systematic exploration of their interplay. In this work, we show that on-policy and off-policy data offer complementary strengths in preference optimization: on-policy data is particularly effective for reasoning tasks like math and coding, while off-policy data performs better on open-ended tasks such as creative writing and making personal recommendations. Guided by these findings, we introduce SIMPLEMIX, an approach to combine the complementary strengths of on-policy and off-policy preference learning by simply mixing these two data sources. Our empirical results across diverse tasks and benchmarks demonstrate that SIMPLEMIX substantially improves language model alignment. Specifically, SIMPLEMIX improves upon on-policy DPO and off-policy DPO by an average of 6.03% on Alpaca Eval 2.0. Moreover, it outperforms prior approaches that are much more complex in combining on- and off-policy data, such as HyPO and DPO-Mix-P, by an average of 3.05%. Comments: To appear in ICML 2025 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.02363 [cs.CL] (or arXiv:2505.02363v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.02363 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-28] Invoke Interfaces Only When Needed: Adaptive Invocation for Large Language Models in Question Answering

【速读】: 该论文旨在解决小语言模型(small LMs)在生成过程中出现幻觉(hallucination)时,如何精确判断调用大语言模型(large LMs)的时机问题。传统优化方法主要依赖于与推理过程分离的后处理技术,导致计算成本高且效果有限。论文提出了一种实用的调用评估指标AttenHScore,通过计算小LMs生成过程中幻觉的累积与传播,持续放大潜在推理错误,并通过动态调整检测阈值实现更准确的实时大LM调用。其解决方案的关键在于将幻觉检测与生成过程紧密结合,从而提升实时检测能力并降低计算开销。

链接: https://arxiv.org/abs/2505.02311
作者: Jihao Zhao,Chunlai Zhou,Biao Qin
机构: Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The collaborative paradigm of large and small language models (LMs) effectively balances performance and cost, yet its pivotal challenge lies in precisely pinpointing the moment of invocation when hallucinations arise in small LMs. Previous optimization efforts primarily focused on post-processing techniques, which were separate from the reasoning process of LMs, resulting in high computational costs and limited effectiveness. In this paper, we propose a practical invocation evaluation metric called AttenHScore, which calculates the accumulation and propagation of hallucinations during the generation process of small LMs, continuously amplifying potential reasoning errors. By dynamically adjusting the detection threshold, we achieve more accurate real-time invocation of large LMs. Additionally, considering the limited reasoning capacity of small LMs, we leverage uncertainty-aware knowledge reorganization to assist them better capture critical information from different text chunks. Extensive experiments reveal that our AttenHScore outperforms most baseline in enhancing real-time hallucination detection capabilities across multiple QA datasets, especially when addressing complex queries. Moreover, our strategies eliminate the need for additional model training and display flexibility in adapting to various transformer-based LMs.
zh

[NLP-29] Optimizing LLM s for Resource-Constrained Environments: A Survey of Model Compression Techniques

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)因资源需求高而难以在移动和边缘设备上部署的问题。解决方案的关键在于通过模型压缩技术实现高效推理,主要包括知识蒸馏、模型量化和模型剪枝三种主要方法,这些技术能够有效降低模型的计算和存储开销,从而使其适用于资源受限的环境。

链接: https://arxiv.org/abs/2505.02309
作者: Sanjay Surendranath Girija,Shashank Kapoor,Lakshit Arora,Dipen Pradhan,Aman Raj,Ankit Shetgaonkar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to IEEE COMPSAC 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized many areas of artificial intelligence (AI), but their substantial resource requirements limit their deployment on mobile and edge devices. This survey paper provides a comprehensive overview of techniques for compressing LLMs to enable efficient inference in resource-constrained environments. We examine three primary approaches: Knowledge Distillation, Model Quantization, and Model Pruning. For each technique, we discuss the underlying principles, present different variants, and provide examples of successful applications. We also briefly discuss complementary techniques such as mixture-of-experts and early-exit strategies. Finally, we highlight promising future directions, aiming to provide a valuable resource for both researchers and practitioners seeking to optimize LLMs for edge deployment.
zh

[NLP-30] Generative Sign-description Prompts with Multi-positive Contrastive Learning for Sign Language Recognition

【速读】: 该论文旨在解决手语识别(Sign Language Recognition, SLR)中由于同时存在手动和非手动信号而导致的准确标注难题。其关键解决方案是提出一种新型的生成式手语描述提示多正对比学习方法(GSP-MC),该方法结合了检索增强生成(RAG)与领域特定的大语言模型(LLMs),通过多步骤提示工程和专家验证的手语语料库生成精确的多部分描述,并采用双编码器架构通过概率匹配双向对齐层次化骨骼特征与多种文本描述(全局、同义词和部分级别)。

链接: https://arxiv.org/abs/2505.02304
作者: Siyu Liang,Yunan Li,Wentian Xin,Huizhou Chen,Xujie Liu,Kang Liu,Qiguang Miao
机构: Xidian University(西安电子科技大学); Dalian Martime University(大连海事大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Sign language recognition (SLR) faces fundamental challenges in creating accurate annotations due to the inherent complexity of simultaneous manual and non-manual signals. To the best of our knowledge, this is the first work to integrate generative large language models (LLMs) into SLR tasks. We propose a novel Generative Sign-description Prompts Multi-positive Contrastive learning (GSP-MC) method that leverages retrieval-augmented generation (RAG) with domain-specific LLMs, incorporating multi-step prompt engineering and expert-validated sign language corpora to produce precise multipart descriptions. The GSP-MC method also employs a dual-encoder architecture to bidirectionally align hierarchical skeleton features with multiple text descriptions (global, synonym, and part level) through probabilistic matching. Our approach combines global and part-level losses, optimizing KL divergence to ensure robust alignment across all relevant text-skeleton pairs while capturing both sign-level semantics and detailed part dynamics. Experiments demonstrate state-of-the-art performance against existing methods on the Chinese SLR500 (reaching 97.1%) and Turkish AUTSL datasets (97.07% accuracy). The method’s cross-lingual effectiveness highlight its potential for developing inclusive communication technologies.
zh

[NLP-31] Demystifying optimized prompts in language models

【速读】: 该论文试图解决现代语言模型(Language Models, LMs)在面对分布外输入时缺乏鲁棒性的问题,以及如何通过机器生成的“优化”提示(optimized prompts)来调节模型输出并诱导特定行为。解决方案的关键在于分析优化提示的组成及其在模型内部的解析机制,发现优化提示主要由训练数据中较为罕见的标点符号和名词标记构成,并且在模型激活的稀疏子集上与自然语言提示存在明显差异。此外,研究还表明不同指令调优模型家族中的优化提示在表示形成路径上具有相似性。

链接: https://arxiv.org/abs/2505.02273
作者: Rimon Melamed,Lucas H. McCabe,H. Howie Huang
机构: The George Washington University (乔治华盛顿大学); LMI Consulting
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern language models (LMs) are not robust to out-of-distribution inputs. Machine generated (``optimized’') prompts can be used to modulate LM outputs and induce specific behaviors while appearing completely uninterpretable. In this work, we investigate the composition of optimized prompts, as well as the mechanisms by which LMs parse and build predictions from optimized prompts. We find that optimized prompts primarily consist of punctuation and noun tokens which are more rare in the training data. Internally, optimized prompts are clearly distinguishable from natural language counterparts based on sparse subsets of the model’s activations. Across various families of instruction-tuned models, optimized prompts follow a similar path in how their representations form through the network.
zh

[NLP-32] Parameter-Efficient Transformer Embeddings

【速读】: 该论文试图解决基于Transformer的自然语言处理模型中嵌入层参数量过大但性能提升不显著的问题(Embedding layers in transformer-based NLP models)。其解决方案的关键在于:首先通过傅里叶展开从令牌ID直接确定性地生成词向量,随后使用一个轻量级多层感知机(MLP)捕捉高阶交互关系,从而在减少参数数量的同时保持模型性能。

链接: https://arxiv.org/abs/2505.02266
作者: Henry Ndubuaku,Mouad Talhi
机构: Cactus Compute, Inc. (Cactus Compute, Inc.); Department of Computing, Imperial College (计算系,帝国理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 2 tables. Code available at this https URL

点击查看摘要

Abstract:Embedding layers in transformer-based NLP models typically account for the largest share of model parameters, scaling with vocabulary size but not yielding performance gains proportional to scale. We propose an alternative approach in which token embedding vectors are first generated deterministically, directly from the token IDs using a Fourier expansion of their normalized values, followed by a lightweight multilayer perceptron (MLP) that captures higher-order interactions. We train standard transformers and our architecture on natural language inference tasks (SNLI and MNLI), and evaluate zero-shot performance on sentence textual similarity (STS-B). Our results demonstrate that the proposed method achieves competitive performance using significantly fewer parameters, trains faster, and operates effectively without the need for dropout. This proof-of-concept study highlights the potential for scalable, memory-efficient language models and motivates further large-scale experimentation based on our findings.
zh

[NLP-33] Personalisation or Prejudice? Addressing Geographic Bias in Hate Speech Detection using Debias Tuning in Large Language Models

【速读】: 该论文试图解决个性化信息融入上下文对大型语言模型(Large Language Models, LLMs)行为的影响问题,特别是在敏感领域如仇恨言论检测中的潜在偏差。其解决方案的关键在于通过惩罚在不同国家或语言特定上下文中产生的不一致的仇恨言论分类,对LLMs进行微调,从而减少不必要的偏见并提升模型在个性化和非个性化情境下的性能。

链接: https://arxiv.org/abs/2505.02252
作者: Paloma Piot,Patricia Martín-Rodilla,Javier Parapar
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Commercial Large Language Models (LLMs) have recently incorporated memory features to deliver personalised responses. This memory retains details such as user demographics and individual characteristics, allowing LLMs to adjust their behaviour based on personal information. However, the impact of integrating personalised information into the context has not been thoroughly assessed, leading to questions about its influence on LLM behaviour. Personalisation can be challenging, particularly with sensitive topics. In this paper, we examine various state-of-the-art LLMs to understand their behaviour in different personalisation scenarios, specifically focusing on hate speech. We prompt the models to assume country-specific personas and use different languages for hate speech detection. Our findings reveal that context personalisation significantly influences LLMs’ responses in this sensitive area. To mitigate these unwanted biases, we fine-tune the LLMs by penalising inconsistent hate speech classifications made with and without country or language-specific context. The refined models demonstrate improved performance in both personalised contexts and when no context is provided.
zh

[NLP-34] SEval-Ex: A Statement-Level Framework for Explainable Summarization Evaluation

【速读】: 该论文试图解决文本摘要质量评估中的关键挑战,即当前方法在性能与可解释性之间存在权衡。其解决方案的关键在于提出SEval-Ex框架,该框架通过将摘要评估分解为原子陈述(atomic statements),实现了高性能与可解释性的统一。该框架采用两阶段流程:首先利用大语言模型(LLM)从文本源和摘要中提取原子陈述,然后进行生成陈述的匹配,从而通过语句级别的对齐生成详细的评估证据。

链接: https://arxiv.org/abs/2505.02235
作者: Tanguy Herserant,Vincent Guigue
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating text summarization quality remains a critical challenge in Natural Language Processing. Current approaches face a trade-off between performance and interpretability. We present SEval-Ex, a framework that bridges this gap by decomposing summarization evaluation into atomic statements, enabling both high performance and explainability. SEval-Ex employs a two-stage pipeline: first extracting atomic statements from text source and summary using LLM, then a matching between generated statements. Unlike existing approaches that provide only summary-level scores, our method generates detailed evidence for its decisions through statement-level alignments. Experiments on the SummEval benchmark demonstrate that SEval-Ex achieves state-of-the-art performance with 0.580 correlation on consistency with human consistency judgments, surpassing GPT-4 based evaluators (0.521) while maintaining interpretability. Finally, our framework shows robustness against hallucination.
zh

[NLP-35] Interpretable Emergent Language Using Inter-Agent Transformers

【速读】: 该论文试图解决多智能体强化学习(MARL)中语言出现的问题,特别是现有方法如RIAL、DIAL和CommNet在实现智能体间通信时缺乏可解释性。解决方案的关键在于提出可微分的智能体间Transformer(DIAT),该方法利用自注意力机制学习符号化且人类可理解的通信协议,从而实现观察信息到可解释词汇和有意义嵌入的编码,有效解决协作任务。

链接: https://arxiv.org/abs/2505.02215
作者: Mannan Bhardwaj
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper explores the emergence of language in multi-agent reinforcement learning (MARL) using transformers. Existing methods such as RIAL, DIAL, and CommNet enable agent communication but lack interpretability. We propose Differentiable Inter-Agent Transformers (DIAT), which leverage self-attention to learn symbolic, human-understandable communication protocols. Through experiments, DIAT demonstrates the ability to encode observations into interpretable vocabularies and meaningful embeddings, effectively solving cooperative tasks. These results highlight the potential of DIAT for interpretable communication in complex multi-agent environments.
zh

[NLP-36] DNAZEN: Enhanced Gene Sequence Representations via Mixed Granularities of Coding Units

【速读】: 该论文试图解决传统基因序列建模方法在处理基因序列时未能充分考虑其内在信息组织结构的问题,特别是不同粒度单元对表示学习的贡献。解决方案的关键在于提出DNAZEN框架,该框架通过从大规模基因组语料库中提取G-grams(由多个连续聚合物组成的组合),并利用基于Transformer的G-gram编码器进行动态匹配与表示学习,从而实现对基因序列多粒度信息的有效捕捉和整合。

链接: https://arxiv.org/abs/2505.02206
作者: Lei Mao,Yuanhe Tian,Yan Song
机构: Origin Omics (起源组学); University of Washington (华盛顿大学); University of Science and Technology of China (中国科学技术大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 3 figures

点击查看摘要

Abstract:Genome modeling conventionally treats gene sequence as a language, reflecting its structured motifs and long-range dependencies analogous to linguistic units and organization principles such as words and syntax. Recent studies utilize advanced neural networks, ranging from convolutional and recurrent models to Transformer-based models, to capture contextual information of gene sequence, with the primary goal of obtaining effective gene sequence representations and thus enhance the models’ understanding of various running gene samples. However, these approaches often directly apply language modeling techniques to gene sequences and do not fully consider the intrinsic information organization in them, where they do not consider how units at different granularities contribute to representation. In this paper, we propose DNAZEN, an enhanced genomic representation framework designed to learn from various granularities in gene sequences, including small polymers and G-grams that are combinations of several contiguous polymers. Specifically, we extract the G-grams from large-scale genomic corpora through an unsupervised approach to construct the G-gram vocabulary, which is used to provide G-grams in the learning process of DNA sequences through dynamically matching from running gene samples. A Transformer-based G-gram encoder is also proposed and the matched G-grams are fed into it to compute their representations and integrated into the encoder for basic unit (E4BU), which is responsible for encoding small units and maintaining the learning and inference process. To further enhance the learning process, we propose whole G-gram masking to train DNAZEN, where the model largely favors the selection of each entire G-gram to mask rather than an ordinary masking mechanism performed on basic units. Experiments on benchmark datasets demonstrate the effectiveness of DNAZEN on various downstream tasks.
zh

[NLP-37] Exploring new Approaches for Information Retrieval through Natural Language Processing

【速读】: 该论文试图解决信息检索(Information Retrieval, IR)在自然语言处理(Natural Language Processing, NLP)中的应用问题,旨在提升检索的准确性、可扩展性及伦理考量。其解决方案的关键在于综合传统IR模型与现代技术,如深度学习、强化学习以及预训练Transformer模型(例如BERT),并结合高效的文本索引与搜索工具(如Lucene、Anserini和Pyserini),通过对比分析稀疏、密集和混合检索方法,探索其在多种应用场景中的有效性。

链接: https://arxiv.org/abs/2505.02199
作者: Manak Raj,Nidhi Mishra
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 12 pages, 4 figures, comprehensive literature review covering six key IR-NLP papers, plus keywords and full reference list

点击查看摘要

Abstract:This review paper explores recent advancements and emerging approaches in Information Retrieval (IR) applied to Natural Language Processing (NLP). We examine traditional IR models such as Boolean, vector space, probabilistic, and inference network models, and highlight modern techniques including deep learning, reinforcement learning, and pretrained transformer models like BERT. We discuss key tools and libraries - Lucene, Anserini, and Pyserini - for efficient text indexing and search. A comparative analysis of sparse, dense, and hybrid retrieval methods is presented, along with applications in web search engines, cross-language IR, argument mining, private information retrieval, and hate speech detection. Finally, we identify open challenges and future research directions to enhance retrieval accuracy, scalability, and ethical considerations.
zh

[NLP-38] Measuring Hong Kong Massive Multi-Task Language Understanding

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在跨文化应用中对多语言理解能力不足的问题,特别是在香港独特的语言环境中,即传统中文书写系统与粤语口语及文化背景的结合。解决方案的关键在于提出HKMMLU,这是一个多任务语言理解基准,用于评估香港的语言能力和社会文化知识,包含26,698道多选题和90,550个普通话-粤语翻译任务,以全面测试LLMs的多语言理解能力。

链接: https://arxiv.org/abs/2505.02177
作者: Chuxue Cao,Zhenghao Zhu,Junqi Zhu,Guoying Lu,Siyu Peng,Juntao Dai,Weijie Shi,Sirui Han,Yike Guo
机构: Hong Kong University of Science and Technology (香港科技大学); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multilingual understanding is crucial for the cross-cultural applicability of Large Language Models (LLMs). However, evaluation benchmarks designed for Hong Kong’s unique linguistic landscape, which combines Traditional Chinese script with Cantonese as the spoken form and its cultural context, remain underdeveloped. To address this gap, we introduce HKMMLU, a multi-task language understanding benchmark that evaluates Hong Kong’s linguistic competence and socio-cultural knowledge. The HKMMLU includes 26,698 multi-choice questions across 66 subjects, organized into four categories: Science, Technology, Engineering, and Mathematics (STEM), Social Sciences, Humanities, and Other. To evaluate the multilingual understanding ability of LLMs, 90,550 Mandarin-Cantonese translation tasks were additionally included. We conduct comprehensive experiments on GPT-4o, Claude 3.7 Sonnet, and 18 open-source LLMs of varying sizes on HKMMLU. The results show that the best-performing model, DeepSeek-V3, struggles to achieve an accuracy of 75%, significantly lower than that of MMLU and CMMLU. This performance gap highlights the need to improve LLMs’ capabilities in Hong Kong-specific language and knowledge domains. Furthermore, we investigate how question language, model size, prompting strategies, and question and reasoning token lengths affect model performance. We anticipate that HKMMLU will significantly advance the development of LLMs in multilingual and cross-cultural contexts, thereby enabling broader and more impactful applications.
zh

[NLP-39] Identifying Legal Holdings with LLM s: A Systematic Study of Performance Scale and Memorization

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在法律任务中的性能评估问题,具体是通过CaseHOLD这一法律基准数据集来衡量模型识别案件结论的能力。其解决方案的关键在于验证模型性能是否随着参数规模的增加而提升,并证明模型的表现并非源于对训练数据中司法意见的机械记忆。研究通过引入一种新颖的引用匿名化测试,在保持语义一致性的前提下使案例名称和引用变得虚构,从而有效排除了记忆效应,结果显示模型在该测试下仍保持较高的性能,表明其具备一定的泛化能力和真正的理解水平。

链接: https://arxiv.org/abs/2505.02172
作者: Chuck Arvin
机构: USC Gould School of Law (南加州大学法学院)
类目: Computation and Language (cs.CL)
备注: Presented as a short paper at International Conference on Artificial Intelligence and Law 2025 (Chicago, IL)

点击查看摘要

Abstract:As large language models (LLMs) continue to advance in capabilities, it is essential to assess how they perform on established benchmarks. In this study, we present a suite of experiments to assess the performance of modern LLMs (ranging from 3B to 90B+ parameters) on CaseHOLD, a legal benchmark dataset for identifying case holdings. Our experiments demonstrate ``scaling effects’’ - performance on this task improves with model size, with more capable models like GPT4o and AmazonNovaPro achieving macro F1 scores of 0.744 and 0.720 respectively. These scores are competitive with the best published results on this dataset, and do not require any technically sophisticated model training, fine-tuning or few-shot prompting. To ensure that these strong results are not due to memorization of judicial opinions contained in the training data, we develop and utilize a novel citation anonymization test that preserves semantic meaning while ensuring case names and citations are fictitious. Models maintain strong performance under these conditions (macro F1 of 0.728), suggesting the performance is not due to rote memorization. These findings demonstrate both the promise and current limitations of LLMs for legal tasks with important implications for the development and measurement of automated legal analytics and legal benchmarks.
zh

[NLP-40] A New HOPE: Domain-agnostic Automatic Evaluation of Text Chunking SIGIR25

【速读】: 该论文试图解决文档分块(document chunking)对检索增强生成(Retrieval-Augmented Generation, RAG)系统性能影响的评估问题,目前缺乏一个系统的框架来分析不同分块方法的效果。其解决方案的关键在于提出HOPE(Holistic Passage Evaluation)评估指标,该指标从内在段落属性、外在段落属性和段落-文档一致性三个层面定义分块过程的核心特征,并通过自动化的量化与聚合方式评估这些特征,从而为优化分块策略提供依据。

链接: https://arxiv.org/abs/2505.02171
作者: Henrik Brådland,Morten Goodwin,Per-Arne Andersen,Alexander S. Nossum,Aditya Gupta
机构: Centre for Artificial Intelligence Research, University of Agder; Norkart AS
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, To be published in SIGIR25

点击查看摘要

Abstract:Document chunking fundamentally impacts Retrieval-Augmented Generation (RAG) by determining how source materials are segmented before indexing. Despite evidence that Large Language Models (LLMs) are sensitive to the layout and structure of retrieved data, there is currently no framework to analyze the impact of different chunking methods. In this paper, we introduce a novel methodology that defines essential characteristics of the chunking process at three levels: intrinsic passage properties, extrinsic passage properties, and passages-document coherence. We propose HOPE (Holistic Passage Evaluation), a domain-agnostic, automatic evaluation metric that quantifies and aggregates these characteristics. Our empirical evaluations across seven domains demonstrate that the HOPE metric correlates significantly (p 0.13) with various RAG performance indicators, revealing contrasts between the importance of extrinsic and intrinsic properties of passages. Semantic independence between passages proves essential for system performance with a performance gain of up to 56.2% in factual correctness and 21.1% in answer correctness. On the contrary, traditional assumptions about maintaining concept unity within passages show minimal impact. These findings provide actionable insights for optimizing chunking strategies, thus improving RAG system design to produce more factually correct responses.
zh

[NLP-41] Incorporating Legal Structure in Retrieval-Augmented Generation: A Case Study on Copyright Fair Use

【速读】: 该论文试图解决在美国版权法中的合理使用原则(Fair Use Doctrine)领域,内容创作者面临因DMCA下架请求而缺乏可访问法律支持的问题。其解决方案的关键在于构建一种针对特定领域的检索增强生成(Retrieval-Augmented Generation, RAG)实现,通过结合语义搜索、法律知识图谱和法院引用网络,提升检索质量和推理可靠性。该方法在法规因素层级(如目的、性质、数量、市场影响)上对法律先例进行建模,并利用加权引用图表示优先考虑具有权威性的法律来源,同时采用思维链推理和交错检索步骤以更好地模拟法律推理过程。

链接: https://arxiv.org/abs/2505.02164
作者: Justin Ho,Alexandra Colby,William Fisher
机构: Harvard Business School (哈佛商学院); Harvard Law School (哈佛法学院)
类目: Computation and Language (cs.CL)
备注: Submitted to the 7th Workshop on Automated Semantic Analysis of Information in Legal Text. 8 pages, 5 Figures

点击查看摘要

Abstract:This paper presents a domain-specific implementation of Retrieval-Augmented Generation (RAG) tailored to the Fair Use Doctrine in U.S. copyright law. Motivated by the increasing prevalence of DMCA takedowns and the lack of accessible legal support for content creators, we propose a structured approach that combines semantic search with legal knowledge graphs and court citation networks to improve retrieval quality and reasoning reliability. Our prototype models legal precedents at the statutory factor level (e.g., purpose, nature, amount, market effect) and incorporates citation-weighted graph representations to prioritize doctrinally authoritative sources. We use Chain-of-Thought reasoning and interleaved retrieval steps to better emulate legal reasoning. Preliminary testing suggests this method improves doctrinal relevance in the retrieval process, laying groundwork for future evaluation and deployment of LLM-based legal assistance tools.
zh

[NLP-42] hink on your Feet: Adaptive Thinking via Reinforcement Learning for Social Agents

【速读】: 该论文旨在解决现有语言代理在社会智能模拟中缺乏动态调整推理深度的能力问题,这一能力在当前方法中明显缺失,导致推理能力不足或统一的长链式推理策略,从而造成过多的token消耗和不恰当的社会模拟。论文提出的解决方案是自适应模式学习(Adaptive Mode Learning, AML),其核心创新为自适应模式策略优化(Adaptive Mode Policy Optimization, AMPO)算法,该算法通过多粒度推理模式设计、面向社交互动的上下文感知模式切换以及基于深度自适应处理的token高效推理,实现了更接近人类的自适应推理能力。

链接: https://arxiv.org/abs/2505.02156
作者: Minzheng Wang,Yongbin Li,Haobo Wang,Xinghua Zhang,Nan Xu,Bingli Wu,Fei Huang,Haiyang Yu,Wenji Mao
机构: MAIS, Institute of Automation, Chinese Academy of Sciences (MAIS,自动化研究所,中国科学院); School of Artificial Intelligence, University of Chinese Academy of Sciences (人工智能学院,中国科学院大学); Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团); Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The code and data are available, see this https URL . arXiv admin note: text overlap with arXiv:2502.15538 by other authors

点击查看摘要

Abstract:Effective social intelligence simulation requires language agents to dynamically adjust reasoning depth, a capability notably absent in current approaches. While existing methods either lack this kind of reasoning capability or enforce uniform long chain-of-thought reasoning across all scenarios, resulting in excessive token usage and inappropriate social simulation. In this paper, we propose \textbfA daptive \textbfM ode \textbfL earning ( \textbfAML ) that strategically selects from four thinking modes (intuitive reaction \rightarrow deep contemplation) based on real-time context. Our framework’s core innovation, the \textbfA daptive \textbfM ode \textbfP olicy \textbfO ptimization ( \textbfAMPO ) algorithm, introduces three key advancements over existing methods: (1) Multi-granular thinking mode design, (2) Context-aware mode switching across social interaction, and (3) Token-efficient reasoning via depth-adaptive processing. Extensive experiments on social intelligence tasks confirm that AML achieves 15.6% higher task performance than state-of-the-art methods. Notably, our method outperforms GRPO by 7.0% with 32.8% shorter reasoning chains. These results demonstrate that context-sensitive thinking mode selection, as implemented in AMPO, enables more human-like adaptive reasoning than GRPO’s fixed-depth approach
zh

[NLP-43] QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach OSDI2025

【速读】: 该论文旨在解决异构深度学习系统(DLS)中跨平台张量程序编译的难题,即如何实现“Write Once, Run Anywhere”的目标,以减轻编程负担。当前的转换编译技术面临手动工作量大或功能不正确的问题。论文提出的解决方案是开发一个名为QiMeng-Xpiler的新型转换编译器,其关键在于结合大型语言模型(LLMs)和符号程序合成(symbolic program synthesis)进行神经符号合成,利用LLMs强大的代码生成能力使基于搜索的符号合成在计算上可行。

链接: https://arxiv.org/abs/2505.02146
作者: Shouyang Dong,Yuanbo Wen,Jun Bi,Di Huang,Jiaming Guo,Jianxing Xu,Ruibai Xu,Xinkai Song,Yifan Hao,Xuehai Zhou,Tianshi Chen,Qi Guo,Yunji Chen
机构: University of Science and Technology of China (中国科学技术大学); Cambricon Technologies (寒武纪科技); SKL of Processors, Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所处理器重点实验室); Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注: Accepted to OSDI 2025

点击查看摘要

Abstract:Heterogeneous deep learning systems (DLS) such as GPUs and ASICs have been widely deployed in industrial data centers, which requires to develop multiple low-level tensor programs for different platforms. An attractive solution to relieve the programming burden is to transcompile the legacy code of one platform to others. However, current transcompilation techniques struggle with either tremendous manual efforts or functional incorrectness, rendering “Write Once, Run Anywhere” of tensor programs an open question. We propose a novel transcompiler, i.e., QiMeng-Xpiler, for automatically translating tensor programs across DLS via both large language models (LLMs) and symbolic program synthesis, i.e., neural-symbolic synthesis. The key insight is leveraging the powerful code generation ability of LLM to make costly search-based symbolic synthesis computationally tractable. Concretely, we propose multiple LLM-assisted compilation passes via pre-defined meta-prompts for program transformation. During each program transformation, efficient symbolic program synthesis is employed to repair incorrect code snippets with a limited scale. To attain high performance, we propose a hierarchical auto-tuning approach to systematically explore both the parameters and sequences of transformation passes. Experiments on 4 DLS with distinct programming interfaces, i.e., Intel DL Boost with VNNI, NVIDIA GPU with CUDA, AMD MI with HIP, and Cambricon MLU with BANG, demonstrate that QiMeng-Xpiler correctly translates different tensor programs at the accuracy of 95% on average, and the performance of translated programs achieves up to 2.0x over vendor-provided manually-optimized libraries. As a result, the programming productivity of DLS is improved by up to 96.0x via transcompiling legacy tensor programs. Comments: Accepted to OSDI 2025 Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Programming Languages (cs.PL) Cite as: arXiv:2505.02146 [cs.CL] (or arXiv:2505.02146v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.02146 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-44] Exploring the Potential of Offline RL for Reasoning in LLM s: A Preliminary Study

【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在长上下文推理中依赖计算成本高且复杂的在线强化学习(Online Reinforcement Learning, RL)方法的问题,而探索更简单经济的离线强化学习(Offline RL)方法的潜力。解决方案的关键在于评估离线RL方法,特别是直接偏好优化(Direct Preference Optimization, DPO)及其长度无关变体LD-DPO,在提升LLMs推理能力方面的有效性,并通过实验验证其性能提升,尤其是在Arena-Hard基准上的显著改进。此外,研究还强调了输出长度与语义丰富性之间的平衡对模型性能的重要性。

链接: https://arxiv.org/abs/2505.02142
作者: Xiaoyu Tian,Sitong Zhao,Haotian Wang,Shuaiting Chen,Yiping Peng,Yunjie Ji,Han Zhao,Xiangang Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite significant advances in long-context reasoning by large language models (LLMs), primarily through Online Reinforcement Learning (RL) methods, these approaches incur substantial computational costs and complexity. In contrast, simpler and more economical Offline RL methods remain underexplored. To address this gap, we investigate the effectiveness of Offline RL methods, specifically Direct Preference Optimization (DPO) and its length-desensitized variant LD-DPO, in enhancing the reasoning capabilities of LLMs. Extensive experiments across multiple reasoning benchmarks demonstrate that these simpler Offline RL methods substantially improve model performance, achieving an average enhancement of 3.3%, with a particularly notable increase of 10.1% on the challenging Arena-Hard benchmark. Furthermore, we analyze DPO’s sensitivity to output length, emphasizing that increasing reasoning length should align with semantic richness, as indiscriminate lengthening may adversely affect model performance. We provide comprehensive descriptions of our data processing and training methodologies, offering empirical evidence and practical insights for developing more cost-effective Offline RL approaches.
zh

[NLP-45] Attention Mechanisms Perspective: Exploring LLM Processing of Graph-Structured Data ICML2025

【速读】: 该论文试图解决生成式 AI (Generative AI) 在处理图结构数据时的局限性问题,具体表现为注意力机制在捕捉图拓扑连接方面的不足。研究的核心在于从注意力机制的角度出发,探索大型语言模型(LLMs)如何处理图结构数据,并揭示其在建模节点间关系上的缺陷。解决方案的关键在于提出一种中间状态注意力窗口机制,该机制能够在训练过程中提升 LLM 的性能,并在推理阶段平滑过渡到全连接注意力窗口,从而更好地适应图结构的拓扑特性。

链接: https://arxiv.org/abs/2505.02130
作者: Zhong Guan,Likang Wu,Hongke Zhao,Ming He,Jianpin Fan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICML2025 Accept

点击查看摘要

Abstract:Attention mechanisms are critical to the success of large language models (LLMs), driving significant advancements in multiple fields. However, for graph-structured data, which requires emphasis on topological connections, they fall short compared to message-passing mechanisms on fixed links, such as those employed by Graph Neural Networks (GNNs). This raises a question: ``Does attention fail for graphs in natural language settings?‘’ Motivated by these observations, we embarked on an empirical study from the perspective of attention mechanisms to explore how LLMs process graph-structured data. The goal is to gain deeper insights into the attention behavior of LLMs over graph structures. We uncovered unique phenomena regarding how LLMs apply attention to graph-structured data and analyzed these findings to improve the modeling of such data by LLMs. The primary findings of our research are: 1) While LLMs can recognize graph data and capture text-node interactions, they struggle to model inter-node relationships within graph structures due to inherent architectural constraints. 2) The attention distribution of LLMs across graph nodes does not align with ideal structural patterns, indicating a failure to adapt to graph topology nuances. 3) Neither fully connected attention nor fixed connectivity is optimal; each has specific limitations in its application scenarios. Instead, intermediate-state attention windows improve LLM training performance and seamlessly transition to fully connected windows during inference. Source code: \hrefthis https URLLLM4Exploration
zh

[NLP-46] LLM -OptiRA: LLM -Driven Optimization of Resource Allocation for Non-Convex Problems in Wireless Communications

【速读】: 该论文旨在解决无线通信系统中非凸资源分配问题的求解难题,这类问题通常超出传统优化技术的能力范围。其解决方案的关键在于提出LLM-OptiRA框架,该框架首次利用大语言模型(Large Language Models, LLMs)自动检测并转换非凸成分,将其转化为可求解形式,从而实现对非凸资源分配问题的全自动化求解。

链接: https://arxiv.org/abs/2505.02091
作者: Xinyue Peng,Yanming Liu,Yihan Cang,Chaoqun Cao,Ming Chen
机构: National Mobile Communications Research Laboratory, Southeast University, Nanjing, China(国家移动通信研究实验室,东南大学,中国南京); ZheJiang University, Hangzhou, China(浙江大学,中国杭州); Pervasive Communication Research Center, Purple Mountain Laboratories, Nanjing, China(泛在通信研究中心,紫金山实验室,中国南京)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 6 pages,4 figures

点击查看摘要

Abstract:Solving non-convex resource allocation problems poses significant challenges in wireless communication systems, often beyond the capability of traditional optimization techniques. To address this issue, we propose LLM-OptiRA, the first framework that leverages large language models (LLMs) to automatically detect and transform non-convex components into solvable forms, enabling fully automated resolution of non-convex resource allocation problems in wireless communication systems. LLM-OptiRA not only simplifies problem-solving by reducing reliance on expert knowledge, but also integrates error correction and feasibility validation mechanisms to ensure robustness. Experimental results show that LLM-OptiRA achieves an execution rate of 96% and a success rate of 80% on GPT-4, significantly outperforming baseline approaches in complex optimization tasks across diverse scenarios.
zh

[NLP-47] LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning

【速读】: 该论文试图解决基于幻灯片的多媒体教学内容质量评估问题,现有方法如人工评估、基于参考的度量标准以及大型语言模型评估器在可扩展性、上下文捕捉或偏见方面存在局限。解决方案的关键在于引入LecEval,这是一个基于Mayer的多媒体学习认知理论的自动化度量标准,通过内容相关性(Content Relevance, CR)、表达清晰度(Expressive Clarity, EC)、逻辑结构(Logical Structure, LS)和受众参与度(Audience Engagement, AE)四个评分维度来评估多模态知识获取效果。

链接: https://arxiv.org/abs/2505.02078
作者: Joy Lim Jia Yin,Daniel Zhang-Li,Jifan Yu,Haoxuan Li,Shangqing Tu,Yuanchun Wang,Zhiyuan Liu,Huiqin Liu,Lei Hou,Juanzi Li,Bin Xu
机构: Tsinghua University(清华大学); Beihang University(北京航空航天大学); Renmin University of China(中国人民大学); Beijing China(北京中国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:Evaluating the quality of slide-based multimedia instruction is challenging. Existing methods like manual assessment, reference-based metrics, and large language model evaluators face limitations in scalability, context capture, or bias. In this paper, we introduce LecEval, an automated metric grounded in Mayer’s Cognitive Theory of Multimedia Learning, to evaluate multimodal knowledge acquisition in slide-based learning. LecEval assesses effectiveness using four rubrics: Content Relevance (CR), Expressive Clarity (EC), Logical Structure (LS), and Audience Engagement (AE). We curate a large-scale dataset of over 2,000 slides from more than 50 online course videos, annotated with fine-grained human ratings across these rubrics. A model trained on this dataset demonstrates superior accuracy and adaptability compared to existing metrics, bridging the gap between automated and human assessments. We release our dataset and toolkits at this https URL.
zh

[NLP-48] What do Language Model Probabilities Represent? From Distribution Estimation to Response Prediction

【速读】: 该论文试图解决生成式 AI (Generative AI) 在语言建模过程中,分布估计与响应预测之间的目标冲突问题,以及由此导致的实验结果误读问题。其解决方案的关键在于识别并区分语言模型(LLMs)在不同训练阶段(预训练、上下文学习和偏好调优)所形成的三种不同的预期输出分布,并强调这些分布通常并不相似,从而为 LLMs 的解释和使用提供更坚实的理论基础。

链接: https://arxiv.org/abs/2505.02072
作者: Eitan Wagner,Omri Abend
机构: Hebrew University of Jerusalem (希伯来大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The notion of language modeling has gradually shifted in recent years from a distribution over finite-length strings to general-purpose prediction models for textual inputs and outputs, following appropriate alignment phases. This paper analyzes the distinction between distribution estimation and response prediction in the context of LLMs, and their often conflicting goals. We examine the training phases of LLMs, which include pretraining, in-context learning, and preference tuning, and also the common use cases for their output probabilities, which include completion probabilities and explicit probabilities as output. We argue that the different settings lead to three distinct intended output distributions. We demonstrate that NLP works often assume that these distributions should be similar, which leads to misinterpretations of their experimental findings. Our work sets firmer formal foundations for the interpretation of LLMs, which will inform ongoing work on the interpretation and use of LLMs’ induced distributions.
zh

[NLP-49] An overview of artificial intelligence in computer-assisted language learning

【速读】: 该论文试图解决如何将人工智能(Artificial Intelligence, AI)有效应用于计算机辅助语言学习(Computer-Assisted Language Learning, CALL)以支持语言学习和教学的问题。其解决方案的关键在于探索适用于CALL系统的AI方法,并通过跨学科协作构建整合不同领域研究成果的框架,从而推动更智能、高效和可持续的语言学习系统的发展。

链接: https://arxiv.org/abs/2505.02032
作者: Anisia Katinskaia
机构: University of Helsinki (赫尔辛基大学); Department of Computer Science (计算机科学系); Department of Digital Humanities (数字人文学系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Computer-assisted language learning – CALL – is an established research field. We review how artificial intelligence can be applied to support language learning and teaching. The need for intelligent agents that assist language learners and teachers is increasing: the human teacher’s time is a scarce and costly resource, which does not scale with growing demand. Further factors contribute to the need for CALL: pandemics and increasing demand for distance learning, migration of large populations, the need for sustainable and affordable support for learning, etc. CALL systems are made up of many components that perform various functions, and AI is applied to many different aspects in CALL, corresponding to their own expansive research areas. Most of what we find in the research literature and in practical use are prototypes or partial implementations – systems that perform some aspects of the overall desired functionality. Complete solutions – most of them commercial – are few, because they require massive resources. Recent advances in AI should result in improvements in CALL, yet there is a lack of surveys that focus on AI in the context of this research field. This paper aims to present a perspective on the AI methods that can be employed for language learning from a position of a developer of a CALL system. We also aim to connect work from different disciplines, to build bridges for interdisciplinary work.
zh

[NLP-50] owards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLM s

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在预训练过程中可能受到有害内容污染的问题,这些问题包括仇恨言论、虚假信息和偏见叙事等,可能导致模型生成有毒行为、传播错误信息并加剧社会偏见。解决方案的关键在于提出一种全面的有害内容分类体系,将网页内容分为主题性(Topical)和有毒性(Toxic)两类,并引入高精度的Topical and Toxic Prompt(TTP)评估数据集、基于Transformer的HarmFormer过滤模型以及多危害开放性毒性基准(HAVOC),以提升模型对有害输入的鲁棒性和安全性。

链接: https://arxiv.org/abs/2505.02009
作者: Sai Krishna Mendu,Harish Yenala,Aditi Gulati,Shanu Kumar,Parag Agrawal
机构: Microsoft(微软)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have become integral to various real-world applications, leveraging massive, web-sourced datasets like Common Crawl, C4, and FineWeb for pretraining. While these datasets provide linguistic data essential for high-quality natural language generation, they often contain harmful content, such as hate speech, misinformation, and biased narratives. Training LLMs on such unfiltered data risks perpetuating toxic behaviors, spreading misinformation, and amplifying societal biases which can undermine trust in LLM-driven applications and raise ethical concerns about their use. This paper presents a large-scale analysis of inappropriate content across these datasets, offering a comprehensive taxonomy that categorizes harmful webpages into Topical and Toxic based on their intent. We also introduce a prompt evaluation dataset, a high-accuracy Topical and Toxic Prompt (TTP), and a transformer-based model (HarmFormer) for content filtering. Additionally, we create a new multi-harm open-ended toxicity benchmark (HAVOC) and provide crucial insights into how models respond to adversarial toxic inputs. Upon publishing, we will also opensource our model signal on the entire C4 dataset. Our work offers insights into ensuring safer LLM pretraining and serves as a resource for Responsible AI (RAI) compliance.
zh

[NLP-51] LLM -based Text Simplification and its Effect on User Comprehension and Cognitive Load

【速读】: 该论文试图解决网络上信息(如科学出版物和维基百科)往往超出用户阅读水平的问题,以提升信息的可理解性。其解决方案的关键在于采用自修正方法开发一种大语言模型(LLM)能力,用于实现低损耗的文本简化,从而在保持原意的前提下降低文本的复杂度,使更广泛的受众能够更好地学习和利用网络上的专家知识,提高信息的可访问性。

链接: https://arxiv.org/abs/2505.01980
作者: Theo Guidroz,Diego Ardila,Jimmy Li,Adam Mansour,Paul Jhun,Nina Gonzalez,Xiang Ji,Mike Sanchez,Sujay Kakarmath,Mathias MJ Bellaiche,Miguel Ángel Garrido,Faruk Ahmed,Divyansh Choudhary,Jay Hartford,Chenwei Xu,Henry Javier Serrano Echeverria,Yifan Wang,Jeff Shaffer,Eric(Yifan)Cao,Yossi Matias,Avinatan Hassidim,Dale R Webster,Yun Liu,Sho Fujiwara,Peggy Bui,Quang Duong
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Information on the web, such as scientific publications and Wikipedia, often surpasses users’ reading level. To help address this, we used a self-refinement approach to develop a LLM capability for minimally lossy text simplification. To validate our approach, we conducted a randomized study involving 4563 participants and 31 texts spanning 6 broad subject areas: PubMed (biomedical scientific articles), biology, law, finance, literature/philosophy, and aerospace/computer science. Participants were randomized to viewing original or simplified texts in a subject area, and answered multiple-choice questions (MCQs) that tested their comprehension of the text. The participants were also asked to provide qualitative feedback such as task difficulty. Our results indicate that participants who read the simplified text answered more MCQs correctly than their counterparts who read the original text (3.9% absolute increase, p0.05). This gain was most striking with PubMed (14.6%), while more moderate gains were observed for finance (5.5%), aerospace/computer science (3.8%) domains, and legal (3.5%). Notably, the results were robust to whether participants could refer back to the text while answering MCQs. The absolute accuracy decreased by up to ~9% for both original and simplified setups where participants could not refer back to the text, but the ~4% overall improvement persisted. Finally, participants’ self-reported perceived ease based on a simplified NASA Task Load Index was greater for those who read the simplified text (absolute change on a 5-point scale 0.33, p0.05). This randomized study, involving an order of magnitude more participants than prior works, demonstrates the potential of LLMs to make complex information easier to understand. Our work aims to enable a broader audience to better learn and make use of expert knowledge available on the web, improving information accessibility.
zh

[NLP-52] Analyzing Cognitive Differences Among Large Language Models through the Lens of Social Worldview

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在隐性形成和表达社会认知态度或“世界观”方面的问题,特别是针对权威、平等、自主性和命运等更广泛维度的探索不足。解决方案的关键在于引入社会世界观分类法(Social Worldview Taxonomy, SWT),该框架基于文化理论,将四种经典世界观(等级制、平等主义、个人主义、宿命论)转化为可测量的子维度,并通过实证分析揭示了28种不同LLMs之间的可解释认知特征。此外,研究还结合社会参照理论,验证了显性社会线索对这些认知态度的系统性影响,从而提升了LLMs的可解释性及其对社会反馈的响应能力。

链接: https://arxiv.org/abs/2505.01967
作者: Jiatao Li,Yanheng Li,Xiaojun Wan
机构: Wangxuan Institute of Computer Technology, Peking University (王选计算机技术研究所,北京大学); Information Management Department, Peking University (信息管理系,北京大学); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become integral to daily life, widely adopted in communication, decision-making, and information retrieval, raising critical questions about how these systems implicitly form and express socio-cognitive attitudes or “worldviews”. While existing research extensively addresses demographic and ethical biases, broader dimensions-such as attitudes toward authority, equality, autonomy, and fate-remain under-explored. In this paper, we introduce the Social Worldview Taxonomy (SWT), a structured framework grounded in Cultural Theory, operationalizing four canonical worldviews (Hierarchy, Egalitarianism, Individualism, Fatalism) into measurable sub-dimensions. Using SWT, we empirically identify distinct and interpretable cognitive profiles across 28 diverse LLMs. Further, inspired by Social Referencing Theory, we experimentally demonstrate that explicit social cues systematically shape these cognitive attitudes, revealing both general response patterns and nuanced model-specific variations. Our findings enhance the interpretability of LLMs by revealing implicit socio-cognitive biases and their responsiveness to social feedback, thus guiding the development of more transparent and socially responsible language technologies.
zh

[NLP-53] A Comprehensive Analysis for Visual Object Hallucination in Large Vision-Language Models

【速读】: 该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)中存在的视觉对象幻觉(visual object hallucination)问题,即模型在生成与视觉对象相关的信息时出现不准确的情况,可能导致信息误导并引发安全和可靠性方面的担忧。论文通过分析LLaVA类LVLMs的各个组件——大语言模型、视觉主干网络和投影器——来识别潜在的错误来源及其影响,并针对每个问题组件提出缓解方法。其解决方案的关键在于对模型各组成部分的深入分析以及针对性的改进策略。

链接: https://arxiv.org/abs/2505.01958
作者: Liqiang Jing,Guiming Hardy Chen,Ehsan Aghazadeh,Xin Eric Wang,Xinya Du
机构: University of Texas at Dallas (德克萨斯大学达拉斯分校); University of Massachusetts at Amherst (马萨诸塞大学阿默斯特分校); University of California, Santa Cruz (加利福尼亚大学圣克鲁兹分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) demonstrate remarkable capabilities in multimodal tasks, but visual object hallucination remains a persistent issue. It refers to scenarios where models generate inaccurate visual object-related information based on the query input, potentially leading to misinformation and concerns about safety and reliability. Previous works focus on the evaluation and mitigation of visual hallucinations, but the underlying causes have not been comprehensively investigated. In this paper, we analyze each component of LLaVA-like LVLMs – the large language model, the vision backbone, and the projector – to identify potential sources of error and their impact. Based on our observations, we propose methods to mitigate hallucination for each problematic component. Additionally, we developed two hallucination benchmarks: QA-VisualGenome, which emphasizes attribute and relation hallucinations, and QA-FB15k, which focuses on cognition-based hallucinations.
zh

[NLP-54] Explainability by design: an experimental analysis of the legal coding process

【速读】: 该论文试图解决如何将规范性文本编码为德性可废止逻辑(Deontic Defeasible Logic)规则的问题,其核心在于建立一种从文本片段到逻辑规则的映射过程,并通过情景测试确保编码的准确性。解决方案的关键在于提出一种法律编码方法,该方法不仅包含从文本到规则的转换,还引入了情景测试以验证编码的正确性,并利用Houdini技术进行逻辑推理,从而提升编码过程的可靠性和效率。

链接: https://arxiv.org/abs/2505.01944
作者: Matteo Cristani,Guido Governatori,Francesco Olivieri,Monica Palmirani,Gabriele Buriola
机构: University of Verona(维罗纳大学); Central Queensland University(中央昆士兰大学); University of Bologna(博洛尼亚大学)
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Behind a set of rules in Deontic Defeasible Logic, there is a mapping process of normative background fragments. This process goes from text to rules and implicitly encompasses an explanation of the coded fragments. In this paper we deliver a methodology for \textitlegal coding that starts with a fragment and goes onto a set of Deontic Defeasible Logic rules, involving a set of \textitscenarios to test the correctness of the coded fragments. The methodology is illustrated by the coding process of an example text. We then show the results of a series of experiments conducted with humans encoding a variety of normative backgrounds and corresponding cases in which we have measured the efforts made in the coding process, as related to some measurable features. To process these examples, a recently developed technology, Houdini, that allows reasoning in Deontic Defeasible Logic, has been employed. Finally we provide a technique to forecast time required in coding, that depends on factors such as knowledge of the legal domain, knowledge of the coding processes, length of the text, and a measure of \textitdepth that refers to the length of the paths of legal references. Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2505.01944 [cs.LO] (or arXiv:2505.01944v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2505.01944 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-55] CAMOUFLAGE: Exploiting Misinformation Detection Systems Through LLM -driven Adversarial Claim Transformation

【速读】: 该论文试图解决基于证据的虚假信息检测系统在面对对抗性攻击时的脆弱性问题,特别是现有基于文本的黑盒对抗攻击方法无法有效欺骗由检索和声明-证据比较模块组成的多组件检测系统。解决方案的关键在于提出CAMOUFLAGE,这是一种迭代的、由大语言模型(Large Language Model, LLM)驱动的方法,采用双代理系统——提示优化代理和攻击者代理,通过生成语义等价但能干扰证据检索和误导声明-证据比较的对抗性改写,从而在不改变原声明语义的情况下绕过检测系统。该方法通过分析失败的攻击尝试来优化攻击提示,实现更大幅度的结构和风格变换,并仅依赖二元模型决策进行优化,无需分类器的对数几率或大量查询。

链接: https://arxiv.org/abs/2505.01900
作者: Mazal Bethany,Nishant Vishwamitra,Cho-Yu Jason Chiang,Peyman Najafirad
机构: University of Texas at San Antonio(德克萨斯大学圣安东尼奥分校); Peraton Labs(佩拉顿实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated evidence-based misinformation detection systems, which evaluate the veracity of short claims against evidence, lack comprehensive analysis of their adversarial vulnerabilities. Existing black-box text-based adversarial attacks are ill-suited for evidence-based misinformation detection systems, as these attacks primarily focus on token-level substitutions involving gradient or logit-based optimization strategies, which are incapable of fooling the multi-component nature of these detection systems. These systems incorporate both retrieval and claim-evidence comparison modules, which requires attacks to break the retrieval of evidence and/or the comparison module so that it draws incorrect inferences. We present CAMOUFLAGE, an iterative, LLM-driven approach that employs a two-agent system, a Prompt Optimization Agent and an Attacker Agent, to create adversarial claim rewritings that manipulate evidence retrieval and mislead claim-evidence comparison, effectively bypassing the system without altering the meaning of the claim. The Attacker Agent produces semantically equivalent rewrites that attempt to mislead detectors, while the Prompt Optimization Agent analyzes failed attack attempts and refines the prompt of the Attacker to guide subsequent rewrites. This enables larger structural and stylistic transformations of the text rather than token-level substitutions, adapting the magnitude of changes based on previous outcomes. Unlike existing approaches, CAMOUFLAGE optimizes its attack solely based on binary model decisions to guide its rewriting process, eliminating the need for classifier logits or extensive querying. We evaluate CAMOUFLAGE on four systems, including two recent academic systems and two real-world APIs, with an average attack success rate of 46.92% while preserving textual coherence and semantic equivalence to the original claims.
zh

[NLP-56] Automated Sentiment Classification and Topic Discovery in Large-Scale Social Media Streams

【速读】: 该论文试图解决在动态地缘政治背景下对社交媒体(如Twitter)进行大规模情感与主题分析的问题,其核心挑战在于如何有效提取和理解海量文本数据中的情感倾向及潜在主题。解决方案的关键在于构建一个端到端的分析框架,包括基于冲突相关关键词的定向数据收集、多预训练模型联合进行情感标注以提升标注鲁棒性、利用Latent Dirichlet Allocation (LDA) 对按情感和元数据属性分组的数据子集进行潜在主题识别,以及开发交互式可视化界面以支持跨时间和区域的情感趋势与主题分布探索。

链接: https://arxiv.org/abs/2505.01883
作者: Yiwen Lu,Siheng Xiong,Zhaowei Li
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a framework for large-scale sentiment and topic analysis of Twitter discourse. Our pipeline begins with targeted data collection using conflict-specific keywords, followed by automated sentiment labeling via multiple pre-trained models to improve annotation robustness. We examine the relationship between sentiment and contextual features such as timestamp, geolocation, and lexical content. To identify latent themes, we apply Latent Dirichlet Allocation (LDA) on partitioned subsets grouped by sentiment and metadata attributes. Finally, we develop an interactive visualization interface to support exploration of sentiment trends and topic distributions across time and regions. This work contributes a scalable methodology for social media analysis in dynamic geopolitical contexts.
zh

[NLP-57] Humans can learn to detect AI-generated texts or at least learn when they cant

【速读】: 该论文试图解决个体是否能够通过即时反馈准确区分人类撰写与AI生成文本,并利用反馈重新校准自我效能感的问题,同时探索个体在决策过程中依赖的具体标准,如文本风格和可读性。解决方案的关键在于通过针对训练结合显式反馈,使参与者有效学习区分人类与AI生成文本的能力,从而纠正对AI风格特征和可读性的误解,并提升自我评估的准确性。

链接: https://arxiv.org/abs/2505.01877
作者: Jiří Milička,Anna Marklová,Ondřej Drobil,Eva Pospíšilová
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study investigates whether individuals can learn to accurately discriminate between human-written and AI-produced texts when provided with immediate feedback, and if they can use this feedback to recalibrate their self-perceived competence. We also explore the specific criteria individuals rely upon when making these decisions, focusing on textual style and perceived readability. We used GPT-4o to generate several hundred texts across various genres and text types comparable to Koditex, a multi-register corpus of human-written texts. We then presented randomized text pairs to 255 Czech native speakers who identified which text was human-written and which was AI-generated. Participants were randomly assigned to two conditions: one receiving immediate feedback after each trial, the other receiving no feedback until experiment completion. We recorded accuracy in identification, confidence levels, response times, and judgments about text readability along with demographic data and participants’ engagement with AI technologies prior to the experiment. Participants receiving immediate feedback showed significant improvement in accuracy and confidence calibration. Participants initially held incorrect assumptions about AI-generated text features, including expectations about stylistic rigidity and readability. Notably, without feedback, participants made the most errors precisely when feeling most confident – an issue largely resolved among the feedback group. The ability to differentiate between human and AI-generated texts can be effectively learned through targeted training with explicit feedback, which helps correct misconceptions about AI stylistic features and readability, as well as potential other variables that were not explored, while facilitating more accurate self-assessment. This finding might be particularly important in educational contexts. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.01877 [cs.CL] (or arXiv:2505.01877v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.01877 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jiří Milička [view email] [v1] Sat, 3 May 2025 17:42:49 UTC (299 KB)
zh

[NLP-58] Positional Attention for Efficient BERT-Based Named Entity Recognition

【速读】: 该论文旨在解决在自然语言处理中,基于BERT的命名实体识别(Named Entity Recognition, NER)系统在面对新应用时需要从头微调模型所带来的计算成本高和耗时的问题。其解决方案的关键在于提出一种成本效益更高的方法,通过将位置注意力机制(positional attention mechanisms)整合到实体识别过程中,并利用预训练参数实现有效的定制化,从而在减少训练轮次的情况下仍能保持较高的识别精度。

链接: https://arxiv.org/abs/2505.01868
作者: Mo Sun,Siheng Xiong,Yuankai Cai,Bowen Zuo
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents a framework for Named Entity Recognition (NER) leveraging the Bidirectional Encoder Representations from Transformers (BERT) model in natural language processing (NLP). NER is a fundamental task in NLP with broad applicability across downstream applications. While BERT has established itself as a state-of-the-art model for entity recognition, fine-tuning it from scratch for each new application is computationally expensive and time-consuming. To address this, we propose a cost-efficient approach that integrates positional attention mechanisms into the entity recognition process and enables effective customization using pre-trained parameters. The framework is evaluated on a Kaggle dataset derived from the Groningen Meaning Bank corpus and achieves strong performance with fewer training epochs. This work contributes to the field by offering a practical solution for reducing the training cost of BERT-based NER systems while maintaining high accuracy.
zh

[NLP-59] Intra-Layer Recurrence in Transformers for Language Modeling

【速读】: 该论文试图解决Transformer模型深度增加导致参数量急剧增长的问题,现有循环Transformer方法通过多次重处理层来缓解这一问题,但通常对整个层块 indiscriminately 应用循环。论文提出的解决方案是引入Intra-Layer Recurrence (ILR),其关键在于在单次前向传播中对单个层进行选择性循环,而非对整个层块进行无差别处理,实验表明将更多迭代分配给早期层可获得最佳效果。

链接: https://arxiv.org/abs/2505.01855
作者: Anthony Nguyen,Wenjun Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at Canadian AI 2025. Code available at this https URL

点击查看摘要

Abstract:Transformer models have established new benchmarks in natural language processing; however, their increasing depth results in substantial growth in parameter counts. While existing recurrent transformer methods address this issue by reprocessing layers multiple times, they often apply recurrence indiscriminately across entire blocks of layers. In this work, we investigate Intra-Layer Recurrence (ILR), a more targeted approach that applies recurrence selectively to individual layers within a single forward pass. Our experiments show that allocating more iterations to earlier layers yields optimal results. These findings suggest that ILR offers a promising direction for optimizing recurrent structures in transformer architectures.
zh

[NLP-60] textitNew News: System-2 Fine-tuning for Robust Integration of New Knowledge

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对新信息时,通过微调(fine-tuning)难以有效将知识固化到模型权重中的问题,而相比之下,基于上下文学习(in-context learning, ICL)能够更有效地利用显式提供的上下文信息。论文提出的关键解决方案是“系统2微调”(System-2 Fine-tuning, Sys2-FT),其核心在于通过自对弈数据生成协议(如改写、推论和Self-QAs)将模型在上下文中获得的知识提炼到模型权重中,从而提升模型的内部学习能力。

链接: https://arxiv.org/abs/2505.01812
作者: Core Francisco Park,Zechen Zhang,Hidenori Tanaka
机构: Harvard University (哈佛大学); CBS-NTT Program in Physics of Intelligence (CBS-NTT 人工智能物理计划); Center for Brain Science (脑科学中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Humans and intelligent animals can effortlessly internalize new information (“news”) and accurately extract the implications for performing downstream tasks. While large language models (LLMs) can achieve this through in-context learning (ICL) when the news is explicitly given as context, fine-tuning remains challenging for the models to consolidate learning in weights. In this paper, we introduce \textitNew News , a dataset composed of hypothetical yet plausible news spanning multiple domains (mathematics, coding, discoveries, leaderboards, events), accompanied by downstream evaluation questions whose correct answers critically depend on understanding and internalizing the news. We first demonstrate a substantial gap between naive fine-tuning and in-context learning (FT-ICL gap) on our news dataset. To address this gap, we explore a suite of self-play data generation protocols – paraphrases, implications and Self-QAs – designed to distill the knowledge from the model with context into the weights of the model without the context, which we term \textitSystem-2 Fine-tuning (Sys2-FT). We systematically evaluate ICL and Sys2-FT performance across data domains and model scales with the Qwen 2.5 family of models. Our results demonstrate that the self-QA protocol of Sys2-FT significantly improves models’ in-weight learning of the news. Furthermore, we discover the \textitcontexual shadowing effect , where training with the news \textitin context followed by its rephrases or QAs degrade learning of the news. Finally, we show preliminary evidence of an emerging scaling law of Sys2-FT.
zh

[NLP-61] Distinguishing AI-Generated and Human-Written Text Through Psycholinguistic Analysis

【速读】: 该论文试图解决在教育环境中准确识别AI生成文本与人类写作文本的问题,以维护学术诚信。其解决方案的关键在于构建一个整合了文体特征分析与心理语言学理论的综合框架,通过将31个不同的文体特征映射到词汇检索、话语规划、认知负荷管理和元认知自我监控等认知过程,揭示人类写作的独特心理语言模式,从而提供一种清晰且可解释的区分方法。

链接: https://arxiv.org/abs/2505.01800
作者: Chidimma Opara
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8

点击查看摘要

Abstract:The increasing sophistication of AI-generated texts highlights the urgent need for accurate and transparent detection tools, especially in educational settings, where verifying authorship is essential. Existing literature has demonstrated that the application of stylometric features with machine learning classifiers can yield excellent results. Building on this foundation, this study proposes a comprehensive framework that integrates stylometric analysis with psycholinguistic theories, offering a clear and interpretable approach to distinguishing between AI-generated and human-written texts. This research specifically maps 31 distinct stylometric features to cognitive processes such as lexical retrieval, discourse planning, cognitive load management, and metacognitive self-monitoring. In doing so, it highlights the unique psycholinguistic patterns found in human writing. Through the intersection of computational linguistics and cognitive science, this framework contributes to the development of reliable tools aimed at preserving academic integrity in the era of generative AI.
zh

[NLP-62] A Multimodal Framework for Explainable Evaluation of Soft Skills in Educational Environments

【速读】: 该论文试图解决在高等教育中对软技能进行无偏评估的挑战,特别是在面对复杂和不确定的行为表现时。其解决方案的关键在于采用一种融合了现象粒度语言模型与多模态分析的模糊逻辑方法,通过计算感知实现对复杂软技能表达的结构化分解,从而捕捉细微行为并提高评估的可解释性和可靠性。

链接: https://arxiv.org/abs/2505.01794
作者: Jared D.T. Guerrero-Sosa,Francisco P. Romero,Víctor Hugo Menéndez-Domínguez,Jesus Serrano-Guerrero,Andres Montoro-Montarroso,Jose A. Olivas
机构: University of Castilla-La Mancha (卡斯蒂利亚-拉曼查大学); Autonomous University of Yucatan (尤卡坦自治大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:In the rapidly evolving educational landscape, the unbiased assessment of soft skills is a significant challenge, particularly in higher education. This paper presents a fuzzy logic approach that employs a Granular Linguistic Model of Phenomena integrated with multimodal analysis to evaluate soft skills in undergraduate students. By leveraging computational perceptions, this approach enables a structured breakdown of complex soft skill expressions, capturing nuanced behaviours with high granularity and addressing their inherent uncertainties, thereby enhancing interpretability and reliability. Experiments were conducted with undergraduate students using a developed tool that assesses soft skills such as decision-making, communication, and creativity. This tool identifies and quantifies subtle aspects of human interaction, such as facial expressions and gesture recognition. The findings reveal that the framework effectively consolidates multiple data inputs to produce meaningful and consistent assessments of soft skills, showing that integrating multiple modalities into the evaluation process significantly improves the quality of soft skills scores, making the assessment work transparent and understandable to educational stakeholders.
zh

[NLP-63] Enhancing the Learning Experience: Using Vision-Language Models to Generate Questions for Educational Videos

【速读】: 该论文试图解决教育视频中用户参与度和知识保留率低的问题,以及如何通过自动生成问题来支持学习者知识获取和评估。其解决方案的关键在于探索当前视觉语言模型在生成面向学习的教育视频问题中的能力,包括模型的即用性能、内容特定问题生成的微调效果、不同视频模态对问题质量的影响,以及生成问题的相关性、可回答性和难度水平的定性分析。研究强调了模型微调的必要性,并指出了问题多样性与相关性方面的挑战。

链接: https://arxiv.org/abs/2505.01790
作者: Markos Stamatakis,Joshua Berger,Christian Wartena,Ralph Ewerth,Anett Hoppe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: 12 pages (excluding references), 8 tables, 1 equation

点击查看摘要

Abstract:Web-based educational videos offer flexible learning opportunities and are becoming increasingly popular. However, improving user engagement and knowledge retention remains a challenge. Automatically generated questions can activate learners and support their knowledge acquisition. Further, they can help teachers and learners assess their understanding. While large language and vision-language models have been employed in various tasks, their application to question generation for educational videos remains underexplored. In this paper, we investigate the capabilities of current vision-language models for generating learning-oriented questions for educational video content. We assess (1) out-of-the-box models’ performance; (2) fine-tuning effects on content-specific question generation; (3) the impact of different video modalities on question quality; and (4) in a qualitative study, question relevance, answerability, and difficulty levels of generated questions. Our findings delineate the capabilities of current vision-language models, highlighting the need for fine-tuning and addressing challenges in question diversity and relevance. We identify requirements for future multimodal datasets and outline promising research directions.
zh

[NLP-64] Same evaluation more tokens: On the effect of input length for machine translation evaluation using Large Language Models

【速读】: 该论文试图解决机器翻译文本质量评估(Machine Translation Quality Assessment)在长文档中的准确性问题,尤其是由于文本长度对错误跨度(error spans)和系统排名准确性的影响。其解决方案的关键在于通过改进提示策略(如Focus Sentence Prompting, FSP)和微调方法,减少长度偏差,从而提升大语言模型(LLMs)在长文本翻译评估中的可靠性和一致性。

链接: https://arxiv.org/abs/2505.01761
作者: Tobias Domhan,Dawei Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurately evaluating machine-translated text remains a long-standing challenge, particularly for long documents. Recent work has shown that large language models (LLMs) can serve as reliable and interpretable sentence-level translation evaluators via MQM error span annotations. With modern LLMs supporting larger context windows, a natural question arises: can we feed entire document translations into an LLM for quality assessment? Ideally, evaluation should be invariant to text length, producing consistent error spans regardless of input granularity. However, our analysis shows that text length significantly impacts evaluation: longer texts lead to fewer error spans and reduced system ranking accuracy. To address this limitation, we evaluate several strategies, including granularity-aligned prompting, Focus Sentence Prompting (FSP), and a fine-tuning approach to better align LLMs with the evaluation task. The latter two methods largely mitigate this length bias, making LLMs more reliable for long-form translation evaluation.
zh

[NLP-65] Unraveling Media Perspectives: A Comprehensive Methodology Combining Large Language Models Topic Modeling Sentiment Analysis and Ontology Learning to Analyse Media Bias

【速读】: 该论文试图解决新闻报道中的偏见问题(media bias),这一问题对民主社会的决策过程和正常运作构成重大威胁。解决方案的关键在于提出一种新颖的方法论,通过自然语言处理技术,包括层次化主题建模、情感分析以及基于大语言模型的本体学习,对新闻来源中的事件选择、标签、用词以及陈述与省略偏见进行可扩展且最小偏见的分析。

链接: https://arxiv.org/abs/2505.01754
作者: Orlando Jähde,Thorsten Weber,Rüdiger Buchkremer
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Biased news reporting poses a significant threat to informed decision-making and the functioning of democracies. This study introduces a novel methodology for scalable, minimally biased analysis of media bias in political news. The proposed approach examines event selection, labeling, word choice, and commission and omission biases across news sources by leveraging natural language processing techniques, including hierarchical topic modeling, sentiment analysis, and ontology learning with large language models. Through three case studies related to current political events, we demonstrate the methodology’s effectiveness in identifying biases across news sources at various levels of granularity. This work represents a significant step towards scalable, minimally biased media bias analysis, laying the groundwork for tools to help news consumers navigate an increasingly complex media landscape.
zh

[NLP-66] Efficient Shapley Value-based Non-Uniform Pruning of Large Language Models

【速读】: 该论文旨在解决传统层级剪枝方法在大型语言模型(Large Language Models, LLMs)中因统一稀疏性策略导致性能下降的问题。传统方法未考虑模型中不同Transformer层的重要性差异,从而影响了剪枝后的模型性能。其解决方案的关键在于提出一种基于Shapley值的非均匀剪枝(\methodname)方法,该方法通过量化每个Transformer层对整体模型性能的贡献,为不同层分配定制化的剪枝预算,从而保留关键参数,提升剪枝后模型的性能。

链接: https://arxiv.org/abs/2505.01731
作者: Chuan Sun,Han Yu,Lizhen Cui
机构: Nanyang Technological University (南洋理工大学); Shandong University (山东大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pruning large language models (LLMs) is a promising solution for reducing model sizes and computational complexity while preserving performance. Traditional layer-wise pruning methods often adopt a uniform sparsity approach across all layers, which leads to suboptimal performance due to the varying significance of individual transformer layers within the model not being accounted for. To this end, we propose the \underlineShapley \underlineValue-based \underlineNon-\underlineUniform \underlinePruning (\methodname) method for LLMs. This approach quantifies the contribution of each transformer layer to the overall model performance, enabling the assignment of tailored pruning budgets to different layers to retain critical parameters. To further improve efficiency, we design the Sliding Window-based Shapley Value approximation method. It substantially reduces computational overhead compared to exact SV calculation methods. Extensive experiments on various LLMs including LLaMA-v1, LLaMA-v2 and OPT demonstrate the effectiveness of the proposed approach. The results reveal that non-uniform pruning significantly enhances the performance of pruned models. Notably, \methodname achieves a reduction in perplexity (PPL) of 18.01% and 19.55% on LLaMA-7B and LLaMA-13B, respectively, compared to SparseGPT at 70% sparsity.
zh

[NLP-67] Inducing Robustness in a 2 Dimensional Direct Preference Optimization Paradigm DATE

【速读】: 该论文旨在解决直接偏好优化(Direct Preference Optimisation, DPO)在对齐大型语言模型(Large Language Models, LLMs)与人类偏好时存在的局限性,特别是其无法生成细粒度评分以及对响应中不同片段的偏好处理过于粗略的问题。为了解决这一问题,研究提出了一种二维评分的DPO对齐方法,即2D-DPO,通过引入两个维度的评分来更精确地反映人类偏好。该解决方案的关键在于引入细粒度的评分机制,并增强模型对段级评分噪声的鲁棒性,从而提升对齐效果和稳定性。

链接: https://arxiv.org/abs/2505.01706
作者: Sarvesh Shashidhar,Ritik,Nachiketa Patil,Suraj Racha,Ganesh Ramakrishnan
机构: IIT Bombay (印度理工学院孟买分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Updated abstract, algorithm and experimental results

点击查看摘要

Abstract:Direct Preference Optimisation (DPO) has emerged as a powerful method for aligning Large Language Models (LLMs) with human preferences, offering a stable and efficient alternative to approaches that use Reinforcement learning via Human Feedback. In this work, we investigate the performance of DPO using open-source preference datasets. One of the major drawbacks of DPO is that it doesn’t induce granular scoring and treats all the segments of the responses with equal propensity. However, this is not practically true for human preferences since even “good” responses have segments that may not be preferred by the annotator. To resolve this, a 2-dimensional scoring for DPO alignment called 2D-DPO was proposed. We explore the 2D-DPO alignment paradigm and the advantages it provides over the standard DPO by comparing their win rates. It is observed that these methods, even though effective, are not robust to label/score noise. To counter this, we propose an approach of incorporating segment-level score noise robustness to the 2D-DPO algorithm. Along with theoretical backing, we also provide empirical verification in favour of the algorithm and introduce other noise models that can be present.
zh

[NLP-68] High-Fidelity Pseudo-label Generation by Large Language Models for Training Robust Radiology Report Classifiers

【速读】: 该论文旨在解决胸部X光报告自动标注的问题,这一任务对于训练基于图像的诊断模型、人群健康研究和临床决策支持至关重要。传统自然语言处理方法在应对自由文本报告中高变异性、复杂性以及否定和不确定性的普遍性方面面临显著挑战。论文提出的解决方案关键在于引入DeBERTa-RAD,这是一种结合了先进大语言模型(Large Language Model, LLM)伪标签生成与高效DeBERTa基础模型知识蒸馏的两阶段框架,从而实现了准确且快速的胸部X光报告标注。

链接: https://arxiv.org/abs/2505.01693
作者: Brian Wong,Kaito Tanaka
机构: SANNO University (三重大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated labeling of chest X-ray reports is essential for enabling downstream tasks such as training image-based diagnostic models, population health studies, and clinical decision support. However, the high variability, complexity, and prevalence of negation and uncertainty in these free-text reports pose significant challenges for traditional Natural Language Processing methods. While large language models (LLMs) demonstrate strong text understanding, their direct application for large-scale, efficient labeling is limited by computational cost and speed. This paper introduces DeBERTa-RAD, a novel two-stage framework that combines the power of state-of-the-art LLM pseudo-labeling with efficient DeBERTa-based knowledge distillation for accurate and fast chest X-ray report labeling. We leverage an advanced LLM to generate high-quality pseudo-labels, including certainty statuses, for a large corpus of reports. Subsequently, a DeBERTa-Base model is trained on this pseudo-labeled data using a tailored knowledge distillation strategy. Evaluated on the expert-annotated MIMIC-500 benchmark, DeBERTa-RAD achieves a state-of-the-art Macro F1 score of 0.9120, significantly outperforming established rule-based systems, fine-tuned transformer models, and direct LLM inference, while maintaining a practical inference speed suitable for high-throughput applications. Our analysis shows particular strength in handling uncertain findings. This work demonstrates a promising path to overcome data annotation bottlenecks and achieve high-performance medical text processing through the strategic combination of LLM capabilities and efficient student models trained via distillation.
zh

[NLP-69] A Survey on Inference Engines for Large Language Models : Perspectives on Optimization and Efficiency

【速读】: 该论文试图解决在大规模语言模型(Large Language Models, LLMs)推理过程中,由于链式思维、复杂推理和代理服务等高负载任务导致的推理成本过高的问题。现有优化方法如并行性、压缩和缓存虽被采用,但因服务需求多样化,难以选择合适的优化方案。解决方案的关键在于通过系统性评估25种开源和商业推理引擎,分析其易用性、部署便捷性、通用支持、可扩展性以及对吞吐量和延迟敏感计算的适应性,并探索各引擎的设计目标及所支持的优化技术,从而为研究人员和开发者提供优化LLM推理引擎的选择与设计指导。

链接: https://arxiv.org/abs/2505.01658
作者: Sihyeong Park,Sungryeol Jeon,Chaelyn Lee,Seokhun Jeon,Byung-Soo Kim,Jemin Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under review; 65 pages; 27 figures

点击查看摘要

Abstract:Large language models (LLMs) are widely applied in chatbots, code generators, and search engines. Workloads such as chain-of-thought, complex reasoning, and agent services significantly increase the inference cost by invoking the model repeatedly. Optimization methods such as parallelism, compression, and caching have been adopted to reduce costs, but the diverse service requirements make it hard to select the right method. Recently, specialized LLM inference engines have emerged as a key component for integrating the optimization methods into service-oriented infrastructures. However, a systematic study on inference engines is still lacking. This paper provides a comprehensive evaluation of 25 open-source and commercial inference engines. We examine each inference engine in terms of ease-of-use, ease-of-deployment, general-purpose support, scalability, and suitability for throughput- and latency-aware computation. Furthermore, we explore the design goals of each inference engine by investigating the optimization techniques it supports. In addition, we assess the ecosystem maturity of open source inference engines and handle the performance and cost policy of commercial solutions. We outline future research directions that include support for complex LLM-based services, support of various hardware, and enhanced security, offering practical guidance to researchers and developers in selecting and designing optimized LLM inference engines. We also provide a public repository to continually track developments in this fast-evolving field: this https URL
zh

[NLP-70] Structured Prompting and Feedback-Guided Reasoning with LLM s for Data Interpretation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在结构化数据分析中的可靠性与语义对齐问题,具体表现为模式解释不一致、用户意图与模型输出不匹配以及故障时自我修正机制有限。其解决方案的关键在于提出STROT框架(Structured Task Reasoning and Output Transformation),通过轻量级模式 introspection 和基于样本的字段分类实现动态上下文构建,并将上下文信息嵌入结构化提示中以引导模型生成任务特定且可解释的输出。此外,STROT引入了基于执行反馈和验证信号的迭代优化机制,使模型能够在受控分析循环中作为推理代理进行输出轨迹调整,从而提升结构化数据推理的鲁棒性与可重复性。

链接: https://arxiv.org/abs/2505.01636
作者: Amit Rath
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 2 figures

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and task generalization. However, their application to structured data analysis remains fragile due to inconsistencies in schema interpretation, misalignment between user intent and model output, and limited mechanisms for self-correction when failures occur. This paper introduces the STROT Framework (Structured Task Reasoning and Output Transformation), a method for structured prompting and feedback-driven transformation logic generation aimed at improving the reliability and semantic alignment of LLM-based analytical workflows. STROT begins with lightweight schema introspection and sample-based field classification, enabling dynamic context construction that captures both the structure and statistical profile of the input data. This contextual information is embedded in structured prompts that guide the model toward generating task-specific, interpretable outputs. To address common failure modes in complex queries, STROT incorporates a refinement mechanism in which the model iteratively revises its outputs based on execution feedback and validation signals. Unlike conventional approaches that rely on static prompts or single-shot inference, STROT treats the LLM as a reasoning agent embedded within a controlled analysis loop – capable of adjusting its output trajectory through planning and correction. The result is a robust and reproducible framework for reasoning over structured data with LLMs, applicable to diverse data exploration and analysis tasks where interpretability, stability, and correctness are essential.
zh

[NLP-71] Always Tell Me The Odds: Fine-grained Conditional Probability Estimation

【速读】: 该论文试图解决在不确定或部分信息条件下,大型语言模型(Large Language Models, LLMs)在生成精细概率估计时存在的准确性不足和校准不良的问题。现有LLMs的概率估计往往粗糙且偏向于高频数值,难以提供可靠的不确定性量化。解决方案的关键在于结合人工与合成数据的创建与评估、模型规模的扩展以及更有效的监督机制,从而构建出一组强大且精确的概率估计模型。

链接: https://arxiv.org/abs/2505.01595
作者: Liaoyaqi Wang,Zhengping Jiang,Anqi Liu,Benjamin Van Durme
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present a state-of-the-art model for fine-grained probability estimation of propositions conditioned on context. Recent advances in large language models (LLMs) have significantly enhanced their reasoning capabilities, particularly on well-defined tasks with complete information. However, LLMs continue to struggle with making accurate and well-calibrated probabilistic predictions under uncertainty or partial information. While incorporating uncertainty into model predictions often boosts performance, obtaining reliable estimates of that uncertainty remains understudied. In particular, LLM probability estimates tend to be coarse and biased towards more frequent numbers. Through a combination of human and synthetic data creation and assessment, scaling to larger models, and better supervision, we propose a set of strong and precise probability estimation models. We conduct systematic evaluations across tasks that rely on conditional probability estimation and show that our approach consistently outperforms existing fine-tuned and prompting-based methods by a large margin.
zh

[NLP-72] PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents

【速读】: 该论文试图解决现有基准评估任务规划代理性能时仅以任务完成率为指标,而忽略了用户在整个代理过程中的交互体验,导致评估结果与用户满意度不完全一致的问题。解决方案的关键在于提出PIPA(Planning Agent Evaluation Protocol),该协议基于部分可观测马尔可夫决策过程(POMDP)框架,通过一组原子评估标准对代理行为过程进行综合评估,从而更全面地诊断代理在决策流程中的优劣势。

链接: https://arxiv.org/abs/2505.01592
作者: Takyoung Kim,Janvijay Singh,Shuhaib Mehri,Emre Can Acikgoz,Sagnik Mukherjee,Nimet Beyza Bozdag,Sumuk Shashidhar,Gokhan Tur,Dilek Hakkani-Tür
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint in progress

点击查看摘要

Abstract:The growing capabilities of large language models (LLMs) in instruction-following and context-understanding lead to the era of agents with numerous applications. Among these, task planning agents have become especially prominent in realistic scenarios involving complex internal pipelines, such as context understanding, tool management, and response generation. However, existing benchmarks predominantly evaluate agent performance based on task completion as a proxy for overall effectiveness. We hypothesize that merely improving task completion is misaligned with maximizing user satisfaction, as users interact with the entire agentic process and not only the end result. To address this gap, we propose PIPA, a unified evaluation protocol that conceptualizes the behavioral process of interactive task planning agents within a partially observable Markov Decision Process (POMDP) paradigm. The proposed protocol offers a comprehensive assessment of agent performance through a set of atomic evaluation criteria, allowing researchers and practitioners to diagnose specific strengths and weaknesses within the agent’s decision-making pipeline. Our analyses show that agents excel in different behavioral stages, with user satisfaction shaped by both outcomes and intermediate behaviors. We also highlight future directions, including systems that leverage multiple agents and the limitations of user simulators in task planning.
zh

[NLP-73] AI agents may be worth the hype but not the resources (yet): An initial exploration of machine translation quality and costs in three language pairs in the legal and news domains

【速读】: 该论文旨在解决生成式 AI (Generative AI) 与传统神经机器翻译 (Neural Machine Translation, NMT) 在翻译性能与效率上的比较问题,特别是评估大型语言模型 (Large Language Models, LLMs) 和多智能体编排在机器翻译任务中的实际效益。研究的关键在于通过多维度的评估框架(包括自动评估、人工评估和效率分析)来揭示不同模型架构在法律合同和新闻文本等复杂语料上的表现差异,同时强调了推理增强型模型在语义理解上的优势及其带来的高昂计算成本,从而提出需要更注重成本意识的评估方法和未来研究方向,如轻量级协作策略和混合管道结构。

链接: https://arxiv.org/abs/2505.01560
作者: Vicent Briva Iglesias,Gokhan Dogru
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) and multi-agent orchestration are touted as the next leap in machine translation (MT), but their benefits relative to conventional neural MT (NMT) remain unclear. This paper offers an empirical reality check. We benchmark five paradigms, Google Translate (strong NMT baseline), GPT-4o (general-purpose LLM), o1-preview (reasoning-enhanced LLM), and two GPT-4o-powered agentic workflows (sequential three-stage and iterative refinement), on test data drawn from a legal contract and news prose in three English-source pairs: Spanish, Catalan and Turkish. Automatic evaluation is performed with COMET, BLEU, chrF2 and TER; human evaluation is conducted with expert ratings of adequacy and fluency; efficiency with total input-plus-output token counts mapped to April 2025 pricing. Automatic scores still favour the mature NMT system, which ranks first in seven of twelve metric-language combinations; o1-preview ties or places second in most remaining cases, while both multi-agent workflows trail. Human evaluation reverses part of this narrative: o1-preview produces the most adequate and fluent output in five of six comparisons, and the iterative agent edges ahead once, indicating that reasoning layers capture semantic nuance undervalued by surface metrics. Yet these qualitative gains carry steep costs. The sequential agent consumes roughly five times, and the iterative agent fifteen times, the tokens used by NMT or single-pass LLMs. We advocate multidimensional, cost-aware evaluation protocols and highlight research directions that could tip the balance: leaner coordination strategies, selective agent activation, and hybrid pipelines combining single-pass LLMs with targeted agent intervention. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.01560 [cs.CL] (or arXiv:2505.01560v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.01560 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-74] On the effectiveness of Large Language Models in the mechanical design domain

【速读】: 该论文试图解决大型语言模型在机械工程领域中的性能问题,具体是通过分析语义数据来评估不同模型架构在领域特定数据上的表现。其解决方案的关键在于利用ABC数据集中的装配名称和部件语义名称进行数据预处理,并设计了两种无监督任务:二元句子对分类任务和零样本分类任务。在二元句子对分类任务中,通过调整学习率、丢弃率、序列长度以及添加多头注意力层来优化模型以减少过拟合,从而获得了0.62的准确率;而在零样本分类任务中,模型显著优于基线,达到了0.386的Top-1分类准确率。

链接: https://arxiv.org/abs/2505.01559
作者: Daniele Grandi,Fabian Riquelme
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this work, we seek to understand the performance of large language models in the mechanical engineering domain. We leverage the semantic data found in the ABC dataset, specifically the assembly names that designers assigned to the overall assemblies, and the individual semantic part names that were assigned to each part. After pre-processing the data we developed two unsupervised tasks to evaluate how different model architectures perform on domain-specific data: a binary sentence-pair classification task and a zero-shot classification task. We achieved a 0.62 accuracy for the binary sentence-pair classification task with a fine-tuned model that focuses on fighting over-fitting: 1) modifying learning rates, 2) dropout values, 3) Sequence Length, and 4) adding a multi-head attention layer. Our model on the zero-shot classification task outperforms the baselines by a wide margin, and achieves a top-1 classification accuracy of 0.386. The results shed some light on the specific failure modes that arise when learning from language in this domain.
zh

[NLP-75] CHORUS: Zero-shot Hierarchical Retrieval and Orchestration for Generating Linear Programming Code

【速读】: 该论文旨在解决非专家用户在求解线性规划(Linear Programming, LP)问题时面临的挑战,即如何高效生成特定求解器(如Gurobi)的代码。其解决方案的关键在于提出一种基于检索增强生成(Retrieval-Augmented Generation, RAG)的框架CHORUS,该框架通过分层树状分块策略和基于文档示例的元数据生成,提升检索的相关性和语义一致性,并结合两阶段检索与交叉编码重排序机制,确保上下文相关性。此外,专家设计的提示词和结构化推理解析器进一步提升了代码生成性能。

链接: https://arxiv.org/abs/2505.01485
作者: Tasnim Ahmed,Salimur Choudhury
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This paper has been accepted for presentation at the 19th Learning and Intelligent Optimization Conference (LION 19)

点击查看摘要

Abstract:Linear Programming (LP) problems aim to find the optimal solution to an objective under constraints. These problems typically require domain knowledge, mathematical skills, and programming ability, presenting significant challenges for non-experts. This study explores the efficiency of Large Language Models (LLMs) in generating solver-specific LP code. We propose CHORUS, a retrieval-augmented generation (RAG) framework for synthesizing Gurobi-based LP code from natural language problem statements. CHORUS incorporates a hierarchical tree-like chunking strategy for theoretical contents and generates additional metadata based on code examples from documentation to facilitate self-contained, semantically coherent retrieval. Two-stage retrieval approach of CHORUS followed by cross-encoder reranking further ensures contextual relevance. Finally, expertly crafted prompt and structured parser with reasoning steps improve code generation performance significantly. Experiments on the NL4Opt-Code benchmark show that CHORUS improves the performance of open-source LLMs such as Llama3.1 (8B), Llama3.3 (70B), Phi4 (14B), Deepseek-r1 (32B), and Qwen2.5-coder (32B) by a significant margin compared to baseline and conventional RAG. It also allows these open-source LLMs to outperform or match the performance of much stronger baselines-GPT3.5 and GPT4 while requiring far fewer computational resources. Ablation studies further demonstrate the importance of expert prompting, hierarchical chunking, and structured reasoning.
zh

[NLP-76] SymPlanner: Deliberate Planning in Language Models with Symbolic Representation

【速读】: 该论文旨在解决语言模型(Language Models, LMs)在需要基于外部约束生成连贯多步骤动作序列的领域中,规划能力不足的问题。其解决方案的关键在于引入SymPlanner框架,该框架通过将LM与符号环境(symbolic environment)进行交互,构建了一个显式的世界模型,从而将规划过程置于符号状态空间中。在此过程中,策略模型提出动作,而符号环境则确定性地执行并验证这些动作的效果,结合迭代修正(Iterative Correction, IC)和对比排序(Contrastive Ranking, CR)机制,提升了规划的探索能力和鲁棒性。

链接: https://arxiv.org/abs/2505.01479
作者: Siheng Xiong,Jieyu Zhou,Zhangding Liu,Yusen Su
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Planning remains a core challenge for language models (LMs), particularly in domains that require coherent multi-step action sequences grounded in external constraints. We introduce SymPlanner, a novel framework that equips LMs with structured planning capabilities by interfacing them with a symbolic environment that serves as an explicit world model. Rather than relying purely on natural language reasoning, SymPlanner grounds the planning process in a symbolic state space, where a policy model proposes actions and a symbolic environment deterministically executes and verifies their effects. To enhance exploration and improve robustness, we introduce Iterative Correction (IC), which refines previously proposed actions by leveraging feedback from the symbolic environment to eliminate invalid decisions and guide the model toward valid alternatives. Additionally, Contrastive Ranking (CR) enables fine-grained comparison of candidate plans by evaluating them jointly. We evaluate SymPlanner on PlanBench, demonstrating that it produces more coherent, diverse, and verifiable plans than pure natural language baselines.
zh

[NLP-77] MoxE: Mixture of xLSTM Experts with Entropy-Aware Routing for Efficient Language Modeling

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在可扩展性和效率方面的关键挑战。其解决方案的核心在于提出一种名为MoxE的新架构,该架构通过将扩展型长短期记忆网络(xLSTM)与专家混合(Mixture of Experts, MoE)框架相结合,有效利用xLSTM的创新记忆结构,并通过MoE引入稀疏性以显著降低计算开销。其中,基于熵的路由机制是该方法的关键,它能够动态地将标记路由至专业专家,从而实现高效的资源利用,并确保对罕见和常见标记的有效处理。

链接: https://arxiv.org/abs/2505.01459
作者: Abdoul Majid O. Thiombiano,Brahim Hnich,Ali Ben Mrad,Mohamed Wiem Mkaouer
机构: FSM, University of Monastir(突尼斯莫纳斯蒂尔大学); CES Lab, ENIS, University of Sfax(西迪布泽德大学工程学院); Department of Computer Science, College of Computer, Qassim University(卡西姆大学计算机学院); University of Michigan-Flint(密歇根州立大学弗林特分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces MoxE, a novel architecture that synergistically combines the Extended Long Short-Term Memory (xLSTM) with the Mixture of Experts (MoE) framework to address critical scalability and efficiency challenges in large language models (LLMs). The proposed method effectively leverages xLSTM’s innovative memory structures while strategically introducing sparsity through MoE to substantially reduce computational overhead. At the heart of our approach is a novel entropy-based routing mechanism, designed to dynamically route tokens to specialized experts, thereby ensuring efficient and balanced resource utilization. This entropy awareness enables the architecture to effectively manage both rare and common tokens, with mLSTM blocks being favored to handle rare tokens. To further enhance generalization, we introduce a suite of auxiliary losses, including entropy-based and group-wise balancing losses, ensuring robust performance and efficient training. Theoretical analysis and empirical evaluations rigorously demonstrate that MoxE achieves significant efficiency gains and enhanced effectiveness compared to existing approaches, marking a notable advancement in scalable LLM architectures.
zh

[NLP-78] Unlearning Sensitive Information in Multimodal LLM s: Benchmark and Attack-Defense Evaluation

【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在训练过程中可能无意中获取敏感信息的问题,特别是如何有效实现针对特定多模态知识的遗忘(targeted unlearning)。解决方案的关键在于构建一个高质量的多模态遗忘基准UnLOK-VQA,以及设计攻击与防御框架,用于评估删除多模态知识的方法。通过自动化生成不同接近度的图像-文本对并进行人工筛选以确保质量,该研究进一步测试了多种防御策略的有效性,发现移除模型内部状态中的答案信息是最有效的防御手段。

链接: https://arxiv.org/abs/2505.01456
作者: Vaidehi Patil,Yi-Lin Sung,Peter Hase,Jie Peng,Tianlong Chen,Mohit Bansal
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: The dataset and code are publicly available at this https URL

点击查看摘要

Abstract:LLMs trained on massive datasets may inadvertently acquire sensitive information such as personal details and potentially harmful content. This risk is further heightened in multimodal LLMs as they integrate information from multiple modalities (image and text). Adversaries can exploit this knowledge through multimodal prompts to extract sensitive details. Evaluating how effectively MLLMs can forget such information (targeted unlearning) necessitates the creation of high-quality, well-annotated image-text pairs. While prior work on unlearning has focused on text, multimodal unlearning remains underexplored. To address this gap, we first introduce a multimodal unlearning benchmark, UnLOK-VQA (Unlearning Outside Knowledge VQA), as well as an attack-and-defense framework to evaluate methods for deleting specific multimodal knowledge from MLLMs. We extend a visual question-answering dataset using an automated pipeline that generates varying-proximity samples for testing generalization and specificity, followed by manual filtering for maintaining high quality. We then evaluate six defense objectives against seven attacks (four whitebox, three blackbox), including a novel whitebox method leveraging interpretability of hidden states. Our results show multimodal attacks outperform text- or image-only ones, and that the most effective defense removes answer information from internal model states. Additionally, larger models exhibit greater post-editing robustness, suggesting that scale enhances safety. UnLOK-VQA provides a rigorous benchmark for advancing unlearning in MLLMs.
zh

[NLP-79] AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine

【速读】: 该论文试图解决科学文献中PDF解析的效率与准确性问题,尤其是在处理大规模科学文档时,如何在计算成本和解析精度之间取得平衡。解决方案的关键在于提出一种自适应并行PDF解析与资源扩展引擎(AdaParse),该引擎通过数据驱动的方式为每个文档分配合适的解析器,并结合直接偏好优化(DPO)将人类判断纳入选择过程,同时考虑解析器的硬件需求和预测精度,以实现计算资源的高效调度。

链接: https://arxiv.org/abs/2505.01435
作者: Carlo Siebenschuh,Kyle Hippe,Ozan Gokdemir,Alexander Brace,Arham Khan,Khalid Hossain,Yadu Babuji,Nicholas Chia,Venkatram Vishwanath,Rick Stevens,Arvind Ramanathan,Ian Foster,Robert Underwood
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: This paper has been accepted at the The Eighth Annual Conference on Machine Learning and Systems (MLSys 2025)

点击查看摘要

Abstract:Language models for scientific tasks are trained on text from scientific publications, most distributed as PDFs that require parsing. PDF parsing approaches range from inexpensive heuristics (for simple documents) to computationally intensive ML-driven systems (for complex or degraded ones). The choice of the “best” parser for a particular document depends on its computational cost and the accuracy of its output. To address these issues, we introduce an Adaptive Parallel PDF Parsing and Resource Scaling Engine (AdaParse), a data-driven strategy for assigning an appropriate parser to each document. We enlist scientists to select preferred parser outputs and incorporate this information through direct preference optimization (DPO) into AdaParse, thereby aligning its selection process with human judgment. AdaParse then incorporates hardware requirements and predicted accuracy of each parser to orchestrate computational resources efficiently for large-scale parsing campaigns. We demonstrate that AdaParse, when compared to state-of-the-art parsers, improves throughput by 17\times while still achieving comparable accuracy (0.2 percent better) on a benchmark set of 1000 scientific documents. AdaParse’s combination of high accuracy and parallel scalability makes it feasible to parse large-scale scientific document corpora to support the development of high-quality, trillion-token-scale text datasets. The implementation is available at this https URL
zh

[NLP-80] Enhancing TCR-Peptide Interaction Prediction with Pretrained Language Models and Molecular Representations

【速读】: 该论文旨在解决T细胞受体(TCR)与肽-主要组织相容性复合物(pMHC)结合特异性预测中的泛化能力不足问题,特别是在数据稀缺环境和面对新型表位时的挑战。其解决方案的关键在于引入LANTERN框架,该框架结合了大规模蛋白质语言模型(如ESM-1b)与肽的化学表示(如SMILES字符串),通过MolFormer进行处理,从而捕捉对TCR-肽识别至关重要的生物和化学特征。

链接: https://arxiv.org/abs/2505.01433
作者: Cong Qi,Hanzhang Fang,Siqi jiang,Tianxing Hu,Wei Zhi
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding the binding specificity between T-cell receptors (TCRs) and peptide-major histocompatibility complexes (pMHCs) is central to immunotherapy and vaccine development. However, current predictive models struggle with generalization, especially in data-scarce settings and when faced with novel epitopes. We present LANTERN (Large lAnguage model-powered TCR-Enhanced Recognition Network), a deep learning framework that combines large-scale protein language models with chemical representations of peptides. By encoding TCR \beta-chain sequences using ESM-1b and transforming peptide sequences into SMILES strings processed by MolFormer, LANTERN captures rich biological and chemical features critical for TCR-peptide recognition. Through extensive benchmarking against existing models such as ChemBERTa, TITAN, and NetTCR, LANTERN demonstrates superior performance, particularly in zero-shot and few-shot learning scenarios. Our model also benefits from a robust negative sampling strategy and shows significant clustering improvements via embedding analysis. These results highlight the potential of LANTERN to advance TCR-pMHC binding prediction and support the development of personalized immunotherapies.
zh

计算机视觉

[CV-0] Scenethesis: A Language and Vision Agent ic Framework for 3D Scene Generation

【速读】:该论文试图解决从文本合成交互式3D场景的问题,这一任务在游戏、虚拟现实和具身人工智能中具有重要意义。现有方法面临场景多样性受限和空间真实性不足的挑战,尤其是基于学习的方法依赖于小规模室内数据集,而大型语言模型(Large Language Models, LLMs)虽具备丰富的文本领域知识,但在空间合理性方面表现不佳。该论文提出的解决方案关键在于引入Scenethesis框架,该框架通过将基于LLM的场景规划与视觉引导的布局优化相结合,利用视觉感知提供真实的空间指导,从而弥补LLMs在空间推理上的不足。

链接: https://arxiv.org/abs/2505.02836
作者: Lu Ling,Chen-Hsuan Lin,Tsung-Yi Lin,Yifan Ding,Yu Zeng,Yichen Sheng,Yunhao Ge,Ming-Yu Liu,Aniket Bera,Zhaoshuo Li
机构: NVIDIA Research (NVIDIA 研究院); Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthesizing interactive 3D scenes from text is essential for gaming, virtual reality, and embodied AI. However, existing methods face several challenges. Learning-based approaches depend on small-scale indoor datasets, limiting the scene diversity and layout complexity. While large language models (LLMs) can leverage diverse text-domain knowledge, they struggle with spatial realism, often producing unnatural object placements that fail to respect common sense. Our key insight is that vision perception can bridge this gap by providing realistic spatial guidance that LLMs lack. To this end, we introduce Scenethesis, a training-free agentic framework that integrates LLM-based scene planning with vision-guided layout refinement. Given a text prompt, Scenethesis first employs an LLM to draft a coarse layout. A vision module then refines it by generating an image guidance and extracting scene structure to capture inter-object relations. Next, an optimization module iteratively enforces accurate pose alignment and physical plausibility, preventing artifacts like object penetration and instability. Finally, a judge module verifies spatial coherence. Comprehensive experiments show that Scenethesis generates diverse, realistic, and physically plausible 3D interactive scenes, making it valuable for virtual content creation, simulation environments, and embodied AI research.
zh

[CV-1] WIST: Teleoperated Whole-Body Imitation System

【速读】:该论文旨在解决当前人形机器人在全身协调控制方面的不足,特别是现有系统通常仅限于孤立的移动或操作任务,而无法实现全身协同行为的问题。解决方案的关键在于提出一种基于全身运动仿真的遥操作系统——Teleoperated Whole-Body Imitation System (TWIST),其核心是通过结合强化学习与行为克隆(RL+BC)开发出一个鲁棒、自适应且响应迅速的全身控制器,并利用特权未来运动帧和真实世界运动捕捉(MoCap)数据提升跟踪精度,从而实现由单一统一神经网络控制器驱动的前所未有的全身协调运动技能。

链接: https://arxiv.org/abs/2505.02833
作者: Yanjie Ze,Zixuan Chen,João Pedro Araújo,Zi-ang Cao,Xue Bin Peng,Jiajun Wu,C. Karen Liu
机构: Stanford University (斯坦福大学); Simon Fraser University (西门菲沙大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project website: this https URL

点击查看摘要

Abstract:Teleoperating humanoid robots in a whole-body manner marks a fundamental step toward developing general-purpose robotic intelligence, with human motion providing an ideal interface for controlling all degrees of freedom. Yet, most current humanoid teleoperation systems fall short of enabling coordinated whole-body behavior, typically limiting themselves to isolated locomotion or manipulation tasks. We present the Teleoperated Whole-Body Imitation System (TWIST), a system for humanoid teleoperation through whole-body motion imitation. We first generate reference motion clips by retargeting human motion capture data to the humanoid robot. We then develop a robust, adaptive, and responsive whole-body controller using a combination of reinforcement learning and behavior cloning (RL+BC). Through systematic analysis, we demonstrate how incorporating privileged future motion frames and real-world motion capture (MoCap) data improves tracking accuracy. TWIST enables real-world humanoid robots to achieve unprecedented, versatile, and coordinated whole-body motor skills–spanning whole-body manipulation, legged manipulation, locomotion, and expressive movement–using a single unified neural network controller. Our project website: this https URL
zh

[CV-2] No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves

【速读】:该论文试图解决在扩散变换器(diffusion transformers)中如何有效获取有意义的内部表示以加速生成训练并提升生成质量的问题。现有方法通常需要引入额外且复杂的表示训练框架或依赖大规模预训练的表示基础模型来提供指导,这增加了模型的复杂性和资源消耗。该研究提出的关键解决方案是Self-Representation Alignment(SRA),其核心在于利用扩散变换器固有的判别过程,在仅进行生成训练的过程中通过自蒸馏的方式对齐不同噪声水平下的潜在表示,从而逐步增强整体表示学习能力。实验结果表明,SRA在DiTs和SiTs上的应用均取得了性能提升,并优于依赖辅助复杂表示训练框架的方法,同时达到与依赖强大外部表示先验方法相当的性能。

链接: https://arxiv.org/abs/2505.02831
作者: Dengyang Jiang,Mengmeng Wang,Liuzhuozheng Li,Lei Zhang,Haoyu Wang,Wei Wei,Guang Dai,Yanning Zhang,Jingdong Wang
机构: Northwestern Polytechnical University; SGIT AI Lab, State Grid Corporation of China; Zhejiang University of Technology; Baidu Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Self-Representation Alignment for Diffusion Transformers. arXiv admin note: text overlap with arXiv:2410.06940 by other authors

点击查看摘要

Abstract:Recent studies have demonstrated that learning a meaningful internal representation can both accelerate generative training and enhance generation quality of the diffusion transformers. However, existing approaches necessitate to either introduce an additional and complex representation training framework or rely on a large-scale, pre-trained representation foundation model to provide representation guidance during the original generative training process. In this study, we posit that the unique discriminative process inherent to diffusion transformers enables them to offer such guidance without requiring external representation components. We therefore propose Self-Representation Alignment (SRA), a simple yet straightforward method that obtain representation guidance through a self-distillation manner. Specifically, SRA aligns the output latent representation of the diffusion transformer in earlier layer with higher noise to that in later layer with lower noise to progressively enhance the overall representation learning during only generative training process. Experimental results indicate that applying SRA to DiTs and SiTs yields consistent performance improvements. Moreover, SRA not only significantly outperforms approaches relying on auxiliary, complex representation training frameworks but also achieves performance comparable to methods that heavily dependent on powerful external representation priors.
zh

[CV-3] owards Application-Specific Evaluation of Vision Models: Case Studies in Ecology and Biology CVPR

【速读】:该论文试图解决当前计算机视觉模型在生态学和生物学应用中,其评估主要依赖机器学习指标而忽视对下游分析影响的问题。解决方案的关键在于引入应用特定的评估指标,以更准确地反映模型在其最终使用场景中的性能,从而提升模型在实际应用中的可靠性和有效性。

链接: https://arxiv.org/abs/2505.02825
作者: Alex Hoi Hang Chan,Otto Brookes,Urs Waldmann,Hemal Naik,Iain D. Couzin,Majid Mirmehdi,Noël Adiko Houa,Emmanuelle Normand,Christophe Boesch,Lukas Boesch,Mimi Arandjelovic,Hjalmar Kühl,Tilo Burghardt,Fumihiro Kano
机构: Centre for the Advanced Study of Collective Behaviour, University of Konstanz, Germany; Department of Collective Behavior, Max Planck Institute of Animal Behavior, Germany; Department of Biology, University of Konstanz, Germany; School of Computer Science, University of Bristol, United Kingdom; Wild Chimpanzee Foundation, Germany; Department of Computer and Information Science, University of Konstanz, Konstanz, Germany; Department of Ecology of Animal Societies, Max Planck Institute of Animal Behavior, Konstanz, Germany; Senckenberg Museum of Natural History Goerlitz, Goerlitz, Germany; Max Planck Institute for Evolutionary Anthropology, Germany.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR Workshop, CV4Animals 2025

点击查看摘要

Abstract:Computer vision methods have demonstrated considerable potential to streamline ecological and biological workflows, with a growing number of datasets and models becoming available to the research community. However, these resources focus predominantly on evaluation using machine learning metrics, with relatively little emphasis on how their application impacts downstream analysis. We argue that models should be evaluated using application-specific metrics that directly represent model performance in the context of its final use case. To support this argument, we present two disparate case studies: (1) estimating chimpanzee abundance and density with camera trap distance sampling when using a video-based behaviour classifier and (2) estimating head rotation in pigeons using a 3D posture estimator. We show that even models with strong machine learning performance (e.g., 87% mAP) can yield data that leads to discrepancies in abundance estimates compared to expert-derived data. Similarly, the highest-performing models for posture estimation do not produce the most accurate inferences of gaze direction in pigeons. Motivated by these findings, we call for researchers to integrate application-specific metrics in ecological/biological datasets, allowing for models to be benchmarked in the context of their downstream application and to facilitate better integration of models into application workflows.
zh

[CV-4] owards Dataset Copyright Evasion Attack against Personalized Text-to-Image Diffusion Models

【速读】:该论文试图解决生成式 AI (Generative AI) 领域中,通过数据集所有权验证(Dataset Ownership Verification, DOV)技术嵌入的水印在面对版权规避攻击(Copyright Evasion Attack, CEA)时的脆弱性问题。其解决方案的关键在于提出了一种针对文本到图像扩散模型(Text-to-Image Diffusion Models)的首个版权规避攻击方法(CEAT2I),该方法通过三个阶段实现:水印样本检测、触发词识别和高效水印消除,利用模型在微调过程中对水印样本的快速收敛特性,从而可靠地识别并移除水印,同时保持模型性能。

链接: https://arxiv.org/abs/2505.02824
作者: Kuofeng Gao,Yufei Zhu,Yiming Li,Jiawang Bai,Yong Yang,Zhifeng Li,Shu-Tao Xia
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学); Peng Cheng Laboratory(鹏城实验室); College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院); Nanyang Technological University(南洋理工大学); Tencent(腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models have rapidly advanced, enabling high-quality image generation conditioned on textual prompts. However, the growing trend of fine-tuning pre-trained models for personalization raises serious concerns about unauthorized dataset usage. To combat this, dataset ownership verification (DOV) has emerged as a solution, embedding watermarks into the fine-tuning datasets using backdoor techniques. These watermarks remain inactive under benign samples but produce owner-specified outputs when triggered. Despite the promise of DOV for T2I diffusion models, its robustness against copyright evasion attacks (CEA) remains unexplored. In this paper, we explore how attackers can bypass these mechanisms through CEA, allowing models to circumvent watermarks even when trained on watermarked datasets. We propose the first copyright evasion attack (i.e., CEAT2I) specifically designed to undermine DOV in T2I diffusion models. Concretely, our CEAT2I comprises three stages: watermarked sample detection, trigger identification, and efficient watermark mitigation. A key insight driving our approach is that T2I models exhibit faster convergence on watermarked samples during the fine-tuning, evident through intermediate feature deviation. Leveraging this, CEAT2I can reliably detect the watermarked samples. Then, we iteratively ablate tokens from the prompts of detected watermarked samples and monitor shifts in intermediate features to pinpoint the exact trigger tokens. Finally, we adopt a closed-form concept erasure method to remove the injected watermark. Extensive experiments show that our CEAT2I effectively evades DOV mechanisms while preserving model performance.
zh

[CV-5] MUSAR: Exploring Multi-Subject Customization from Single-Subject Dataset via Attention Routing

【速读】:该论文旨在解决多主体定制化方法中面临的两个关键问题:获取多样化多主体训练数据的困难以及不同主体间属性纠缠。其解决方案的关键在于提出MUSAR框架,该框架通过去偏双联学习(debiased diptych learning)突破数据限制,并利用动态注意力路由机制消除跨主体纠缠,从而实现鲁棒的多主体定制化。

链接: https://arxiv.org/abs/2505.02823
作者: Zinan Guo,Pengze Zhang,Yanze Wu,Chong Mou,Songtao Zhao,Qian He
机构: Bytedance Intelligent Creation (字节跳动智能创作)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL

点击查看摘要

Abstract:Current multi-subject customization approaches encounter two critical challenges: the difficulty in acquiring diverse multi-subject training data, and attribute entanglement across different subjects. To bridge these gaps, we propose MUSAR - a simple yet effective framework to achieve robust multi-subject customization while requiring only single-subject training data. Firstly, to break the data limitation, we introduce debiased diptych learning. It constructs diptych training pairs from single-subject images to facilitate multi-subject learning, while actively correcting the distribution bias introduced by diptych construction via static attention routing and dual-branch LoRA. Secondly, to eliminate cross-subject entanglement, we introduce dynamic attention routing mechanism, which adaptively establishes bijective mappings between generated images and conditional subjects. This design not only achieves decoupling of multi-subject representations but also maintains scalable generalization performance with increasing reference subjects. Comprehensive experiments demonstrate that our MUSAR outperforms existing methods - even those trained on multi-subject dataset - in image quality, subject consistency, and interaction naturalness, despite requiring only single-subject dataset.
zh

[CV-6] Database-Agnostic Gait Enrollm ent using SetTransformers

【速读】:该论文试图解决开放集步态注册(open-set gait enrollment)问题,即在实际应用中判断一个新的步态样本是否对应已知身份或代表未见过的个体。解决方案的关键在于提出一种基于Transformer的框架,该框架具有数据集无关性和识别架构无关性,通过SetTransformer利用探针样本的嵌入和来自画廊的上下文集进行注册决策,无需任务特定阈值或针对新环境的微调,从而实现了在不同数据集、画廊规模和身份分布下的泛化能力。

链接: https://arxiv.org/abs/2505.02815
作者: Nicoleta Basoc,Adrian Cosma,Andy Cǎtrunǎ,Emilian Rǎdoi
机构: National University of Science and Technology POLITEHNICA Bucharest (国家科学与技术大学“布加勒斯特理工大学”)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 Tables, 6 Figures

点击查看摘要

Abstract:Gait recognition has emerged as a powerful tool for unobtrusive and long-range identity analysis, with growing relevance in surveillance and monitoring applications. Although recent advances in deep learning and large-scale datasets have enabled highly accurate recognition under closed-set conditions, real-world deployment demands open-set gait enrollment, which means determining whether a new gait sample corresponds to a known identity or represents a previously unseen individual. In this work, we introduce a transformer-based framework for open-set gait enrollment that is both dataset-agnostic and recognition-architecture-agnostic. Our method leverages a SetTransformer to make enrollment decisions based on the embedding of a probe sample and a context set drawn from the gallery, without requiring task-specific thresholds or retraining for new environments. By decoupling enrollment from the main recognition pipeline, our model is generalized across different datasets, gallery sizes, and identity distributions. We propose an evaluation protocol that uses existing datasets in different ratios of identities and walks per identity. We instantiate our method using skeleton-based gait representations and evaluate it on two benchmark datasets (CASIA-B and PsyMo), using embeddings from three state-of-the-art recognition models (GaitGraph, GaitFormer, and GaitPT). We show that our method is flexible, is able to accurately perform enrollment in different scenarios, and scales better with data compared to traditional approaches. We will make the code and dataset scenarios publicly available.
zh

[CV-7] DPNet: Dynamic Pooling Network for Tiny Object Detection

【速读】:该论文旨在解决在复杂环境下无人航空系统中对微小目标检测的准确性问题,特别是在通过图像缩放提升检测精度时所面临的计算成本高和负样本数量增加导致的性能下降问题。解决方案的关键在于提出一种动态池化网络(Dynamic Pooling Network, DPNet),其核心是引入一个可调节的下采样因子(df),将固定的特征图下采样过程转化为可调整的策略,并设计一个轻量级预测器来为每张输入图像预测df,从而实现输入感知的下采样。此外,还设计了自适应归一化模块(Adaptive Normalization Module, ANM)以使统一检测器兼容不同df值,并通过引导损失监督预测器的训练,实现计算资源的动态分配,平衡检测精度与效率。

链接: https://arxiv.org/abs/2505.02797
作者: Luqi Gong,Haotian Chen,Yikun Chen,Tianliang Yao,Chao Li,Shuai Zhao,Guangjie Han
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Zhejiang Lab (浙江实验室); SWJTU-LEEDS Joint School (西南交通大学-利兹联合学院); Guangdong Zhiyun City construction Technology Co., LTD (广东智云城市建设技术有限公司); Tongji University (同济大学); Hohai University (河海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 12 figures Haotian Chen and Luqi Gong contributed equally to this work

点击查看摘要

Abstract:In unmanned aerial systems, especially in complex environments, accurately detecting tiny objects is crucial. Resizing images is a common strategy to improve detection accuracy, particularly for small objects. However, simply enlarging images significantly increases computational costs and the number of negative samples, severely degrading detection performance and limiting its applicability. This paper proposes a Dynamic Pooling Network (DPNet) for tiny object detection to mitigate these issues. DPNet employs a flexible down-sampling strategy by introducing a factor (df) to relax the fixed downsampling process of the feature map to an adjustable one. Furthermore, we design a lightweight predictor to predict df for each input image, which is used to decrease the resolution of feature maps in the backbone. Thus, we achieve input-aware downsampling. We also design an Adaptive Normalization Module (ANM) to make a unified detector compatible with different dfs. A guidance loss supervises the predictor’s training. DPNet dynamically allocates computing resources to trade off between detection accuracy and efficiency. Experiments on the TinyCOCO and TinyPerson datasets show that DPNet can save over 35% and 25% GFLOPs, respectively, while maintaining comparable detection performance. The code will be made publicly available.
zh

[CV-8] Unsupervised training of keypoint-agnostic descriptors for flexible retinal image registration

【速读】:该论文试图解决当前彩色眼底图像配准方法在标注数据不足方面的局限性,特别是在医学领域中这一问题更为显著,从而推动无监督学习的应用。其解决方案的关键在于开发一种新型的无监督描述符学习方法,该方法不依赖关键点检测,使得最终的描述符网络在配准推理过程中对所使用的关键点检测器具有无关性。通过在公开的视网膜图像配准数据集上进行广泛比较,并测试多种不同性质的关键点检测器,验证了该方法的有效性。实验结果表明,该方法在保持与监督方法相当性能的同时,实现了准确的图像配准。

链接: https://arxiv.org/abs/2505.02787
作者: David Rivas-Villar,Álvaro S. Hervella,José Rouco,Jorge Novo
机构: Universidade da Coruña (奥杜比卡大学); Instituto de Investigacion Biomédica de A Coruña (INIBIC) (拉科鲁尼亚生物医学研究研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current color fundus image registration approaches are limited, among other things, by the lack of labeled data, which is even more significant in the medical domain, motivating the use of unsupervised learning. Therefore, in this work, we develop a novel unsupervised descriptor learning method that does not rely on keypoint detection. This enables the resulting descriptor network to be agnostic to the keypoint detector used during the registration inference. To validate this approach, we perform an extensive and comprehensive comparison on the reference public retinal image registration dataset. Additionally, we test our method with multiple keypoint detectors of varied nature, even proposing some novel ones. Our results demonstrate that the proposed approach offers accurate registration, not incurring in any performance loss versus supervised methods. Additionally, it demonstrates accurate performance regardless of the keypoint detector used. Thus, this work represents a notable step towards leveraging unsupervised learning in the medical domain. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.02787 [cs.CV] (or arXiv:2505.02787v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.02787 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-9] Advances in Automated Fetal Brain MRI Segmentation and Biometry: Insights from the FeTA 2024 Challenge

【速读】:该论文旨在解决胎儿脑组织分割与生物测量分析的自动化问题,以支持子宫内脑发育的研究。其关键解决方案在于引入了新的生物测量预测任务,并利用多中心、多场强的MRI数据集进行评估,特别是首次纳入了低场(0.55T)MRI数据,验证了低成本成像系统在高质量重建技术支持下的潜力。此外,通过引入拓扑特定的欧拉特征差异(Euler characteristic difference, ED)指标,揭示了传统指标未能捕捉到的拓扑差异,为更全面的模型评估提供了新思路。

链接: https://arxiv.org/abs/2505.02784
作者: Vladyslav Zalevskyi,Thomas Sanchez,Misha Kaandorp,Margaux Roulet,Diego Fajardo-Rojas,Liu Li,Jana Hutter,Hongwei Bran Li,Matthew Barkovich,Hui Ji,Luca Wilhelmi,Aline Dändliker,Céline Steger,Mériam Koob,Yvan Gomez,Anton Jakovčić,Melita Klaić,Ana Adžić,Pavel Marković,Gracia Grabarić,Milan Rados,Jordina Aviles Verdera,Gregor Kasprian,Gregor Dovjak,Raphael Gaubert-Rachmühl,Maurice Aschwanden,Qi Zeng,Davood Karimi,Denis Peruzzo,Tommaso Ciceri,Giorgio Longari,Rachika E. Hamadache,Amina Bouzid,Xavier Lladó,Simone Chiarella,Gerard Martí-Juan,Miguel Ángel González Ballester,Marco Castellaro,Marco Pinamonti,Valentina Visani,Robin Cremese,Keïn Sam,Fleur Gaudfernau,Param Ahir,Mehul Parikh,Maximilian Zenk,Michael Baumgartner,Klaus Maier-Hein,Li Tianhong,Yang Hong,Zhao Longfei,Domen Preloznik,Žiga Špiclin,Jae Won Choi,Muyang Li,Jia Fu,Guotai Wang,Jingwen Jiang,Lyuyang Tong,Bo Du,Andrea Gondova,Sungmin You,Kiho Im,Abdul Qayyum,Moona Mazher,Steven A Niederer,Maya Yanko,Bella Specktor-Fadida,Dafna Ben Bashat,Andras Jakab,Roxane Licandro,Kelly Payette,Meritxell Bach Cuadra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate fetal brain tissue segmentation and biometric analysis are essential for studying brain development in utero. The FeTA Challenge 2024 advanced automated fetal brain MRI analysis by introducing biometry prediction as a new task alongside tissue segmentation. For the first time, our diverse multi-centric test set included data from a new low-field (0.55T) MRI dataset. Evaluation metrics were also expanded to include the topology-specific Euler characteristic difference (ED). Sixteen teams submitted segmentation methods, most of which performed consistently across both high- and low-field scans. However, longitudinal trends indicate that segmentation accuracy may be reaching a plateau, with results now approaching inter-rater variability. The ED metric uncovered topological differences that were missed by conventional metrics, while the low-field dataset achieved the highest segmentation scores, highlighting the potential of affordable imaging systems when paired with high-quality reconstruction. Seven teams participated in the biometry task, but most methods failed to outperform a simple baseline that predicted measurements based solely on gestational age, underscoring the challenge of extracting reliable biometric estimates from image data alone. Domain shift analysis identified image quality as the most significant factor affecting model generalization, with super-resolution pipelines also playing a substantial role. Other factors, such as gestational age, pathology, and acquisition site, had smaller, though still measurable, effects. Overall, FeTA 2024 offers a comprehensive benchmark for multi-class segmentation and biometry estimation in fetal brain MRI, underscoring the need for data-centric approaches, improved topological evaluation, and greater dataset diversity to enable clinically robust and generalizable AI tools.
zh

[CV-10] Unsupervised Deep Learning-based Keypoint Localization Estimating Descriptor Matching Performance

【速读】:该论文旨在解决彩色视网膜图像配准中对标注数据依赖性强的问题,传统方法通常依赖于关键点和描述符进行对齐,但其显著局限在于需要大量标注数据,而医学领域此类数据稀缺。论文提出了一种完全无需标注数据的无监督配准流程,其关键在于通过描述符的区分性来确定可靠的关键点,从而将检测器的条件设定为描述符而非传统的相反方式,实现了对关键点检测和描述符学习的创新性联合优化。

链接: https://arxiv.org/abs/2505.02779
作者: David Rivas-Villar,Álvaro S. Hervella,José Rouco,Jorge Novo
机构: Universidade da Coruña (阿科鲁尼亚大学); Instituto de Investigacion Biomédica de A Coruña (INIBIC) (阿科鲁尼亚生物医学研究所以及INIBIC)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Retinal image registration, particularly for color fundus images, is a challenging yet essential task with diverse clinical applications. Existing registration methods for color fundus images typically rely on keypoints and descriptors for alignment; however, a significant limitation is their reliance on labeled data, which is particularly scarce in the medical domain. In this work, we present a novel unsupervised registration pipeline that entirely eliminates the need for labeled data. Our approach is based on the principle that locations with distinctive descriptors constitute reliable keypoints. This fully inverts the conventional state-of-the-art approach, conditioning the detector on the descriptor rather than the opposite. First, we propose an innovative descriptor learning method that operates without keypoint detection or any labels, generating descriptors for arbitrary locations in retinal images. Next, we introduce a novel, label-free keypoint detector network which works by estimating descriptor performance directly from the input image. We validate our method through a comprehensive evaluation on four hold-out datasets, demonstrating that our unsupervised descriptor outperforms state-of-the-art supervised descriptors and that our unsupervised detector significantly outperforms existing unsupervised detection methods. Finally, our full registration pipeline achieves performance comparable to the leading supervised methods, while not employing any labeled data. Additionally, the label-free nature and design of our method enable direct adaptation to other domains and modalities. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.02779 [cs.CV] (or arXiv:2505.02779v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.02779 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: David Rivas-Villar [view email] [v1] Mon, 5 May 2025 16:46:32 UTC (3,412 KB)
zh

[CV-11] Advancing Generalizable Tumor Segmentation with Anomaly-Aware Open-Vocabulary Attention Maps and Frozen Foundation Diffusion Models CVPR2025

【速读】:该论文旨在解决跨不同解剖区域的零样本肿瘤分割问题,即训练一个单一模型以实现对未见过的肿瘤类型的泛化分割。现有方法在分割质量、可扩展性和适用成像模态范围方面存在局限。论文提出了一种名为DiffuGTS的新框架,其关键在于利用冻结的医学基础扩散模型内部表示作为高效的零样本学习者,并通过文本提示生成异常感知的开放词汇注意力图,从而实现不受预定义类别列表限制的泛化异常分割。此外,DiffuGTS通过潜在空间修复将病灶区域转换为高质量伪健康图像,并采用像素级和特征级残差学习方法,显著提升了分割掩码的质量和泛化能力。

链接: https://arxiv.org/abs/2505.02753
作者: Yankai Jiang,Peng Zhang,Donglin Yang,Yuan Tian,Hai Lin,Xiaosong Wang
机构: Shanghai AI Laboratory (上海人工智能实验室); Zhejiang University (浙江大学); The University of British Columbia (不列颠哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted to CVPR 2025

点击查看摘要

Abstract:We explore Generalizable Tumor Segmentation, aiming to train a single model for zero-shot tumor segmentation across diverse anatomical regions. Existing methods face limitations related to segmentation quality, scalability, and the range of applicable imaging modalities. In this paper, we uncover the potential of the internal representations within frozen medical foundation diffusion models as highly efficient zero-shot learners for tumor segmentation by introducing a novel framework named DiffuGTS. DiffuGTS creates anomaly-aware open-vocabulary attention maps based on text prompts to enable generalizable anomaly segmentation without being restricted by a predefined training category list. To further improve and refine anomaly segmentation masks, DiffuGTS leverages the diffusion model, transforming pathological regions into high-quality pseudo-healthy counterparts through latent space inpainting, and applies a novel pixel-level and feature-level residual learning approach, resulting in segmentation masks with significantly enhanced quality and generalization. Comprehensive experiments on four datasets and seven tumor categories demonstrate the superior performance of our method, surpassing current state-of-the-art models across multiple zero-shot settings. Codes are available at this https URL.
zh

[CV-12] Platelet enumeration in dense aggregates IJCNN2025

【速读】:该论文试图解决在血液成分识别与计数任务中,尤其是血小板(platelet)的识别与计数问题,由于血小板尺寸小、特征变化大以及易形成聚集和与其他血细胞关联,传统基于卷积神经网络(CNN)的架构如U-Net难以准确识别。解决方案的关键在于优化卷积核的作用,并对单个血小板和血小板聚集物分别设定类别,通过语义分割方法进行血小板识别,同时提出了一种针对单个血小板和聚集物的改进计数方法,以克服传统基于像素面积的计数方法导致的高估问题。

链接: https://arxiv.org/abs/2505.02751
作者: H. Martin Gillis,Yogeshwar Shendye,Paul Hollensen,Alan Fine,Thomas Trappenberg
机构: Dalhousie University (达尔豪斯大学); Alentic Microscience Inc. (阿尔伦蒂克微科学公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: International Joint Conference on Neural Networks (IJCNN 2025)

点击查看摘要

Abstract:Identifying and counting blood components such as red blood cells, various types of white blood cells, and platelets is a critical task for healthcare practitioners. Deep learning approaches, particularly convolutional neural networks (CNNs) using supervised learning strategies, have shown considerable success for such tasks. However, CNN based architectures such as U-Net, often struggles to accurately identify platelets due to their sizes and high variability of features. To address these challenges, researchers have commonly employed strategies such as class weighted loss functions, which have demonstrated some success. However, this does not address the more significant challenge of platelet variability in size and tendency to form aggregates and associations with other blood components. In this study, we explored an alternative approach by investigating the role of convolutional kernels in mitigating these issues. We also assigned separate classes to singular platelets and platelet aggregates and performed semantic segmentation using various U-Net architectures for identifying platelets. We then evaluated and compared two common methods (pixel area method and connected component analysis) for counting platelets and proposed an alternative approach specialized for single platelets and platelet aggregates. Our experiments provided results that showed significant improvements in the identification of platelets, highlighting the importance of optimizing convolutional operations and class designations. We show that the common practice of pixel area-based counting often over estimate platelet counts, whereas the proposed method presented in this work offers significant improvements. We discuss in detail about these methods from segmentation masks.
zh

[CV-13] A Rate-Quality Model for Learned Video Coding

【速读】:该论文试图解决 Learned Video Coding (LVC) 中率失真(Rate-Quality, R-Q)关系建模的问题,旨在更精确地估计视频编码中的比特率与质量之间的关系。解决方案的关键在于提出一种名为 RQNet 的神经网络,该网络根据视频内容和编码上下文来表征比特率与质量等级之间的关系,并通过最小二乘法将预测的(R,Q)结果与之前编码帧的结果进行融合,从而在线实时确定 R-Q 模型的参数。这种方法相比传统方法能够更准确地估计 R-Q 关系,提升了模型参数的在线适应能力,从而增强了系统的灵活性和精度。

链接: https://arxiv.org/abs/2505.02720
作者: Sang NguyenQuang,Cheng-Wei Chen,Xiem HoangVan,Wen-Hsiao Peng
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); VNU University of Engineering and Technology (越南国家大学工程与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learned video coding (LVC) has recently achieved superior coding performance. In this paper, we model the rate-quality (R-Q) relationship for learned video coding by a parametric function. We learn a neural network, termed RQNet, to characterize the relationship between the bitrate and quality level according to video content and coding context. The predicted (R,Q) results are further integrated with those from previously coded frames using the least-squares method to determine the parameters of our R-Q model on-the-fly. Compared to the conventional approaches, our method accurately estimates the R-Q relationship, enabling the online adaptation of model parameters to enhance both flexibility and precision. Experimental results show that our R-Q model achieves significantly smaller bitrate deviations than the baseline method on commonly used datasets with minimal additional complexity.
zh

[CV-14] Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery

【速读】:该论文试图解决单目深度估计中相对深度图的尺度恢复问题,即如何从无尺度信息的相对深度图中恢复出具有绝对尺度的度量深度。解决方案的关键在于提出一种名为VGLD的方法,通过结合图像的高层语义信息与文本描述,稳定文本信息对尺度恢复的影响,从而消除文本歧义,并输出一组可全局应用于相对深度图的线性变换参数(标量),最终实现具有度量尺度精度的深度预测。

链接: https://arxiv.org/abs/2505.02704
作者: Bojin Wu,Jing Chen
机构: Guangdong University of Technology (广东工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, conference

点击查看摘要

Abstract:We propose a robust method for monocular depth scale recovery. Monocular depth estimation can be divided into two main directions: (1) relative depth estimation, which provides normalized or inverse depth without scale information, and (2) metric depth estimation, which involves recovering depth with absolute scale. To obtain absolute scale information for practical downstream tasks, utilizing textual information to recover the scale of a relative depth map is a highly promising approach. However, since a single image can have multiple descriptions from different perspectives or with varying styles, it has been shown that different textual descriptions can significantly affect the scale recovery process. To address this issue, our method, VGLD, stabilizes the influence of textual information by incorporating high-level semantic information from the corresponding image alongside the textual description. This approach resolves textual ambiguities and robustly outputs a set of linear transformation parameters (scalars) that can be globally applied to the relative depth map, ultimately generating depth predictions with metric-scale accuracy. We validate our method across several popular relative depth models(MiDas, DepthAnything), using both indoor scenes (NYUv2) and outdoor scenes (KITTI). Our results demonstrate that VGLD functions as a universal alignment module when trained on multiple datasets, achieving strong performance even in zero-shot scenarios. Code is available at: this https URL.
zh

[CV-15] Structure Causal Models and LLM s Integration in Medical Visual Question Answering

【速读】:该论文旨在解决医学视觉问答(MedVQA)任务中因医学数据复杂性导致的图像与问题之间的交叉模态偏差问题,这种偏差使得模型难以推断出具有医学意义的答案。解决方案的关键在于提出一种基于因果推理的框架,通过引入新颖的因果图结构来显式建模视觉与文本元素之间的交互,并利用互信息发现虚假相关性,结合多变量重采样前门调整方法消除相对混杂效应,从而确保问答过程的准确性。

链接: https://arxiv.org/abs/2505.02703
作者: Zibo Xu,Qiang Li,Weizhi Nie,Weijie Wang,Anan Liu
机构: Tianjin University (天津大学); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TMI 2025

点击查看摘要

Abstract:Medical Visual Question Answering (MedVQA) aims to answer medical questions according to medical images. However, the complexity of medical data leads to confounders that are difficult to observe, so bias between images and questions is inevitable. Such cross-modal bias makes it challenging to infer medically meaningful answers. In this work, we propose a causal inference framework for the MedVQA task, which effectively eliminates the relative confounding effect between the image and the question to ensure the precision of the question-answering (QA) session. We are the first to introduce a novel causal graph structure that represents the interaction between visual and textual elements, explicitly capturing how different questions influence visual features. During optimization, we apply the mutual information to discover spurious correlations and propose a multi-variable resampling front-door adjustment method to eliminate the relative confounding effect, which aims to align features based on their true causal relevance to the question-answering task. In addition, we also introduce a prompt strategy that combines multiple prompt forms to improve the model’s ability to understand complex medical data and answer accurately. Extensive experiments on three MedVQA datasets demonstrate that 1) our method significantly improves the accuracy of MedVQA, and 2) our method achieves true causal correlations in the face of complex medical data.
zh

[CV-16] Dance of Fireworks: An Interactive Broadcast Gymnastics Training System Based on Pose Estimation

【速读】:该论文旨在解决久坐人群因缺乏身体活动而产生的健康风险,其解决方案的关键在于开发一种名为“Dance of Fireworks”的交互系统,通过增强用户对广播体操的参与度来改善这一问题。该系统利用移动设备摄像头和轻量级姿态估计技术(PoseNet/TensorFlow Lite)实时提取人体关键点、计算关节角度,并与标准动作进行对比以提供即时纠正反馈。同时,系统通过将用户的运动数据(如关节角度和速度)动态映射到可定制的烟花动画,激励用户提高动作准确性,从而提升锻炼效果和娱乐性。

链接: https://arxiv.org/abs/2505.02690
作者: Haotian Chen,Ziyu Liu,Xi Cheng,Chuangqi Li
机构: SWJTU-LEEDS Joint School (西南交大-利兹联合学院); Southwest Jiaotong University (西南交通大学); School of Computer and Information Engineering (计算机与信息工程学院); Xi ’an Jiaotong-Liverpool University (西安交大利物浦大学); School of Vehicle and Delivery (车辆与运载学院); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 13 figures

点击查看摘要

Abstract:This study introduces Dance of Fireworks, an interactive system designed to combat sedentary health risks by enhancing engagement in radio calisthenics. Leveraging mobile device cameras and lightweight pose estimation (PoseNet/TensorFlow Lite), the system extracts body keypoints, computes joint angles, and compares them with standardized motions to deliver real-time corrective feedback. To incentivize participation, it dynamically maps users’ movements (such as joint angles and velocity) to customizable fireworks animations, rewarding improved accuracy with richer visual effects. Experiments involving 136 participants demonstrated a significant reduction in average joint angle errors from 21.3 degrees to 9.8 degrees (p 0.01) over four sessions, with 93.4 percent of users affirming its exercise-promoting efficacy and 85.4 percent praising its entertainment value. The system operates without predefined motion templates or specialised hardware, enabling seamless integration into office environments. Future enhancements will focus on improving pose recognition accuracy, reducing latency, and adding features such as multiplayer interaction and music synchronisation. This work presents a cost-effective, engaging solution to promote physical activity in sedentary populations.
zh

[CV-17] Multimodal Deep Learning for Stroke Prediction and Detection using Retinal Imaging and Clinical Data

【速读】:该论文试图解决如何利用成本较低的视网膜成像技术提升中风的检测与风险预测问题,以替代传统依赖昂贵医学影像模态(如计算机断层扫描)的方法。其解决方案的关键在于构建一个融合视网膜光学相干断层扫描(OCT)和红外反射视网膜扫描图像以及临床数据的多模态深度神经网络模型,并通过自监督学习框架进行预训练,随后在标记子集上进行微调和评估,从而实现对中风后视网膜长期影响的预测及未来风险的准确识别。

链接: https://arxiv.org/abs/2505.02677
作者: Saeed Shurrab,Aadim Nepal,Terrence J. Lee-St. John,Nicola G. Ghazi,Bartlomiej Piechowski-Jozwiak,Farah E. Shamout
机构: New York University Abu Dhabi(纽约大学阿布扎比分校); Institute for Healthier Living Abu Dhabi(阿布扎比健康生活研究所); Eye Institute at Cleveland Clinic Abu Dhabi(克利夫兰诊所阿布扎比分院); Canberra Hospital(堪培拉医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stroke is a major public health problem, affecting millions worldwide. Deep learning has recently demonstrated promise for enhancing the diagnosis and risk prediction of stroke. However, existing methods rely on costly medical imaging modalities, such as computed tomography. Recent studies suggest that retinal imaging could offer a cost-effective alternative for cerebrovascular health assessment due to the shared clinical pathways between the retina and the brain. Hence, this study explores the impact of leveraging retinal images and clinical data for stroke detection and risk prediction. We propose a multimodal deep neural network that processes Optical Coherence Tomography (OCT) and infrared reflectance retinal scans, combined with clinical data, such as demographics, vital signs, and diagnosis codes. We pretrained our model using a self-supervised learning framework using a real-world dataset consisting of 37 k scans, and then fine-tuned and evaluated the model using a smaller labeled subset. Our empirical findings establish the predictive ability of the considered modalities in detecting lasting effects in the retina associated with acute stroke and forecasting future risk within a specific time horizon. The experimental results demonstrate the effectiveness of our proposed framework by achieving 5 % AUROC improvement as compared to the unimodal image-only baseline, and 8 % improvement compared to an existing state-of-the-art foundation model. In conclusion, our study highlights the potential of retinal imaging in identifying high-risk patients and improving long-term outcomes.
zh

[CV-18] Grasp the Graph (GtG) 2.0: Ensemble of GNNs for High-Precision Grasp Pose Detection in Clutter

【速读】:该论文旨在解决在杂乱、现实环境中抓取姿态检测的问题,该问题由于传感器数据的噪声和不完整性以及复杂物体几何形状而具有挑战性。解决方案的关键在于提出Grasp the Graph 2.0 (GtG 2.0)方法,该方法是一种轻量级但高效的假设与测试机器人抓取框架,利用图神经网络(Graph Neural Networks)从点云数据中进行高效的几何推理。GtG 2.0通过传统抓取姿态生成器高效生成7自由度(7-Dof)抓取候选,并使用集成图神经网络模型对候选进行评估,从而提升了抓取检测性能。

链接: https://arxiv.org/abs/2505.02664
作者: Ali Rashidi Moghadam,Sayedmohammadreza Rastegari,Mehdi Tale Masouleh,Ahmad Kalhor
机构: University of Tehran(德黑兰大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 Pages, 6 figures

点击查看摘要

Abstract:Grasp pose detection in cluttered, real-world environments remains a significant challenge due to noisy and incomplete sensory data combined with complex object geometries. This paper introduces Grasp the Graph 2.0 (GtG 2.0) method, a lightweight yet highly effective hypothesis-and-test robotics grasping framework which leverages an ensemble of Graph Neural Networks for efficient geometric reasoning from point cloud data. Building on the success of GtG 1.0, which demonstrated the potential of Graph Neural Networks for grasp detection but was limited by assumptions of complete, noise-free point clouds and 4-Dof grasping, GtG 2.0 employs a conventional Grasp Pose Generator to efficiently produce 7-Dof grasp candidates. Candidates are assessed with an ensemble Graph Neural Network model which includes points within the gripper jaws (inside points) and surrounding contextual points (outside points). This improved representation boosts grasp detection performance over previous methods using the same generator. GtG 2.0 shows up to a 35% improvement in Average Precision on the GraspNet-1Billion benchmark compared to hypothesis-and-test and Graph Neural Network-based methods, ranking it among the top three frameworks. Experiments with a 3-Dof Delta Parallel robot and Kinect-v1 camera show a success rate of 91% and a clutter completion rate of 100%, demonstrating its flexibility and reliability.
zh

[CV-19] Sim2Real in endoscopy segmentation with a novel structure aware image translation

【速读】:该论文旨在解决在内窥镜图像中自动分割解剖标志点的难题,特别是在缺乏真实标注数据的情况下,如何有效训练模型的问题。传统监督学习方法依赖于繁琐且困难的真实图像标注,而基于合成数据的方法虽易于获取标注,但模型泛化能力较差。该研究的关键解决方案是提出一种新颖的图像翻译模型,能够在保持原始场景关键布局信息的同时,为模拟内窥镜图像添加逼真的纹理,从而生成可用于训练的高质量合成数据。

链接: https://arxiv.org/abs/2505.02654
作者: Clara Tomasini,Luis Riazuelo,Ana C. Murillo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatic segmentation of anatomical landmarks in endoscopic images can provide assistance to doctors and surgeons for diagnosis, treatments or medical training. However, obtaining the annotations required to train commonly used supervised learning methods is a tedious and difficult task, in particular for real images. While ground truth annotations are easier to obtain for synthetic data, models trained on such data often do not generalize well to real data. Generative approaches can add realistic texture to it, but face difficulties to maintain the structure of the original scene. The main contribution in this work is a novel image translation model that adds realistic texture to simulated endoscopic images while keeping the key scene layout information. Our approach produces realistic images in different endoscopy scenarios. We demonstrate these images can effectively be used to successfully train a model for a challenging end task without any real labeled data. In particular, we demonstrate our approach for the task of fold segmentation in colonoscopy images. Folds are key anatomical landmarks that can occlude parts of the colon mucosa and possible polyps. Our approach generates realistic images maintaining the shape and location of the original folds, after the image-style-translation, better than existing methods. We run experiments both on a novel simulated dataset for fold segmentation, and real data from the EndoMapper (EM) dataset. All our new generated data and new EM metadata is being released to facilitate further research, as no public benchmark is currently available for the task of fold segmentation.
zh

[CV-20] MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation

【速读】:该论文试图解决现有扩散模型在处理包含多个对象、特征和关系的复杂提示时性能受限的问题。其解决方案的关键在于提出一种基于多智能体协作的组合扩散(MCCD)方法,通过设计一个多智能体协作的场景解析模块,生成由具有不同任务的智能体组成的系统,并利用多模态大语言模型有效提取场景元素;同时采用分层组合扩散机制,结合高斯掩码和过滤技术细化边界框区域并增强物体,从而实现复杂场景的准确且高保真生成。

链接: https://arxiv.org/abs/2505.02648
作者: Mingcheng Li,Xiaolu Hou,Ziyang Liu,Dingkang Yang,Ziyun Qian,Jiawei Chen,Jinjie Wei,Yue Jiang,Qingyao Xu,Lihua Zhang
机构: Fudan University (复旦大学); CIT Lab (认知与智能技术实验室); Soochow University (苏州大学); Jilin Provincial Key Laboratory of Intelligence Science and Engineering (吉林省人工智能科学与工程重点实验室); Engineering Research Center of AI and Robotics, Ministry of Education (人工智能与机器人工程研究中心,教育部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have shown excellent performance in text-to-image generation. Nevertheless, existing methods often suffer from performance bottlenecks when handling complex prompts that involve multiple objects, characteristics, and relations. Therefore, we propose a Multi-agent Collaboration-based Compositional Diffusion (MCCD) for text-to-image generation for complex scenes. Specifically, we design a multi-agent collaboration-based scene parsing module that generates an agent system comprising multiple agents with distinct tasks, utilizing MLLMs to extract various scene elements effectively. In addition, Hierarchical Compositional diffusion utilizes a Gaussian mask and filtering to refine bounding box regions and enhance objects through region enhancement, resulting in the accurate and high-fidelity generation of complex scenes. Comprehensive experiments demonstrate that our MCCD significantly improves the performance of the baseline models in a training-free manner, providing a substantial advantage in complex scene generation.
zh

[CV-21] Detect Classify Act: Categorizing Industrial Anomalies with Multi-Modal Large Language Models CVPR2025

【速读】:该论文试图解决工业视觉异常分类(anomaly classification)问题,即在检测到异常后区分不同类型的异常,这一任务在实际检测任务中具有重要意义,但此前研究较少。解决方案的关键在于提出一种基于大语言模型(LLM)的新型流水线VELM,首先利用无监督异常检测方法作为视觉专家评估观察结果的正常性,若检测到异常,则由LLM进行类型分类。此外,为解决现有数据集中缺乏精确异常类别标注的问题,作者引入了改进的数据集MVTec-AC和VisA-AC,以支持更严格的评估。

链接: https://arxiv.org/abs/2505.02626
作者: Sassan Mokhtar,Arian Mousakhan,Silvio Galesso,Jawad Tayyub,Thomas Brox
机构: University of Bonn (波恩大学); University of Freiburg (弗莱堡大学); Endress + Hauser (恩德斯+豪雅)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as a spotlight presentation paper at the VAND Workshop, CVPR 2025. 10 pages, 6 figures

点击查看摘要

Abstract:Recent advances in visual industrial anomaly detection have demonstrated exceptional performance in identifying and segmenting anomalous regions while maintaining fast inference speeds. However, anomaly classification-distinguishing different types of anomalies-remains largely unexplored despite its critical importance in real-world inspection tasks. To address this gap, we propose VELM, a novel LLM-based pipeline for anomaly classification. Given the critical importance of inference speed, we first apply an unsupervised anomaly detection method as a vision expert to assess the normality of an observation. If an anomaly is detected, the LLM then classifies its type. A key challenge in developing and evaluating anomaly classification models is the lack of precise annotations of anomaly classes in existing datasets. To address this limitation, we introduce MVTec-AC and VisA-AC, refined versions of the widely used MVTec-AD and VisA datasets, which include accurate anomaly class labels for rigorous evaluation. Our approach achieves a state-of-the-art anomaly classification accuracy of 80.4% on MVTec-AD, exceeding the prior baselines by 5%, and 84% on MVTec-AC, demonstrating the effectiveness of VELM in understanding and categorizing anomalies. We hope our methodology and benchmark inspire further research in anomaly classification, helping bridge the gap between detection and comprehensive anomaly characterization.
zh

[CV-22] DELTA: Dense Depth from Events and LiDAR using Transformers Attention CVPR2025

【速读】:该论文旨在解决基于事件的深度估计问题,即如何利用事件相机(event camera)和LiDAR数据融合以生成密集深度图。其解决方案的关键在于提出了一种基于神经网络的方法DELTA,该方法通过自注意力(self-attention)和交叉注意力(cross-attention)机制建模事件数据与LiDAR数据之间的空间和时间关系,从而实现更精确的深度估计。

链接: https://arxiv.org/abs/2505.02593
作者: Vincent Brebion,Julien Moreau,Franck Davoine
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for the CVPR 2025 Workshop on Event-based Vision. For the project page, see this https URL

点击查看摘要

Abstract:Event cameras and LiDARs provide complementary yet distinct data: respectively, asynchronous detections of changes in lighting versus sparse but accurate depth information at a fixed rate. To this day, few works have explored the combination of these two modalities. In this article, we propose a novel neural-network-based method for fusing event and LiDAR data in order to estimate dense depth maps. Our architecture, DELTA, exploits the concepts of self- and cross-attention to model the spatial and temporal relations within and between the event and LiDAR data. Following a thorough evaluation, we demonstrate that DELTA sets a new state of the art in the event-based depth estimation problem, and that it is able to reduce the errors up to four times for close ranges compared to the previous SOTA.
zh

[CV-23] RGBX-DiffusionDet: A Framework for Multi-Modal RGB-X Object Detection Using DiffusionDet

【速读】:该论文旨在解决多模态2D数据(如深度、偏振和红外图像)与RGB图像在目标检测任务中的融合问题,以提升检测性能。其解决方案的关键在于设计了动态通道缩减卷积块注意力模块(DCR-CBAM)以实现跨模态交互,并引入动态多级聚合块(DMLAB)进行自适应多尺度特征融合,同时结合新型正则化损失函数以增强特征嵌入的紧凑性和判别性。

链接: https://arxiv.org/abs/2505.02586
作者: Eliraz Orfaig,Inna Stainvas,Igal Bilik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work introduces RGBX-DiffusionDet, an object detection framework extending the DiffusionDet model to fuse the heterogeneous 2D data (X) with RGB imagery via an adaptive multimodal encoder. To enable cross-modal interaction, we design the dynamic channel reduction within a convolutional block attention module (DCR-CBAM), which facilitates cross-talk between subnetworks by dynamically highlighting salient channel features. Furthermore, the dynamic multi-level aggregation block (DMLAB) is proposed to refine spatial feature representations through adaptive multiscale fusion. Finally, novel regularization losses that enforce channel saliency and spatial selectivity are introduced, leading to compact and discriminative feature embeddings. Extensive experiments using RGB-Depth (KITTI), a novel annotated RGB-Polarimetric dataset, and RGB-Infrared (M ^3 FD) benchmark dataset were conducted. We demonstrate consistent superiority of the proposed approach over the baseline RGB-only DiffusionDet. The modular architecture maintains the original decoding complexity, ensuring efficiency. These results establish the proposed RGBX-DiffusionDet as a flexible multimodal object detection approach, providing new insights into integrating diverse 2D sensing modalities into diffusion-based detection pipelines.
zh

[CV-24] Unified Multimodal Understanding and Generation Models: Advances Challenges and Opportunities

【速读】:该论文试图解决多模态理解模型与图像生成模型在架构上的独立性问题,旨在推动两者融合的统一框架发展。其解决方案的关键在于探索三种主要的架构范式:基于扩散模型、基于自回归模型以及结合二者机制的混合方法,并分析各类模型的结构设计与创新点,以促进跨模态任务的整合与优化。

链接: https://arxiv.org/abs/2505.02567
作者: Xinjie Zhang,Jintao Guo,Shanshan Zhao,Minghao Fu,Lunhao Duan,Guo-Hua Wang,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang
机构: Alibaba Group(阿里巴巴集团); Hong Kong University of Science and Technology(香港科技大学); Nanjing University(南京大学); Wuhan University(武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work is still in progress

点击查看摘要

Abstract:Recent years have seen remarkable progress in both multimodal understanding models and image generation models. Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation. Recently, there has been growing interest in developing unified frameworks that integrate these tasks. The emergence of GPT-4o’s new capabilities exemplifies this trend, highlighting the potential for unification. However, the architectural differences between the two domains pose significant challenges. To provide a clear overview of current efforts toward unification, we present a comprehensive survey aimed at guiding future research. First, we introduce the foundational concepts and recent advancements in multimodal understanding and text-to-image generation models. Next, we review existing unified models, categorizing them into three main architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches that fuse autoregressive and diffusion mechanisms. For each category, we analyze the structural designs and innovations introduced by related works. Additionally, we compile datasets and benchmarks tailored for unified models, offering resources for future exploration. Finally, we discuss the key challenges facing this nascent field, including tokenization strategy, cross-modal attention, and data. As this area is still in its early stages, we anticipate rapid advancements and will regularly update this survey. Our goal is to inspire further research and provide a valuable reference for the community. The references associated with this survey will be available on GitHub soon.
zh

[CV-25] Robust Duality Learning for Unsupervised Visible-Infrared Person Re-Identfication

【速读】:该论文旨在解决无监督可见光-红外行人重识别(UVI-ReID)中由于模态差异和缺乏监督导致的挑战,尤其是由伪标签噪声(Pseudo-Label Noise, PLN)引发的噪声过拟合、错误累积和噪声聚类对应问题。其解决方案的关键在于提出一种鲁棒的双流学习框架(Robust Duality Learning, RoDE),通过引入鲁棒自适应学习机制(RAL)动态强调干净样本并抑制噪声样本,利用双模型交替训练以缓解错误累积,并通过聚类一致性匹配(Cluster Consistency Matching, CCM)解决跨模型与跨模态的聚类对齐问题。

链接: https://arxiv.org/abs/2505.02549
作者: Yongxiang Li,Yuan Sun,Yang Qin,Dezhong Peng,Xi Peng,Peng Hu
机构: Sichuan University (四川大学); Sichuan National Innovation New Vision UHD Video Technology Co., Ltd (四川国家创新新视野超高清视频技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Unsupervised visible-infrared person re-identification (UVI-ReID) aims to retrieve pedestrian images across different modalities without costly annotations, but faces challenges due to the modality gap and lack of supervision. Existing methods often adopt self-training with clustering-generated pseudo-labels but implicitly assume these labels are always correct. In practice, however, this assumption fails due to inevitable pseudo-label noise, which hinders model learning. To address this, we introduce a new learning paradigm that explicitly considers Pseudo-Label Noise (PLN), characterized by three key challenges: noise overfitting, error accumulation, and noisy cluster correspondence. To this end, we propose a novel Robust Duality Learning framework (RoDE) for UVI-ReID to mitigate the effects of noisy pseudo-labels. First, to combat noise overfitting, a Robust Adaptive Learning mechanism (RAL) is proposed to dynamically emphasize clean samples while down-weighting noisy ones. Second, to alleviate error accumulation-where the model reinforces its own mistakes-RoDE employs dual distinct models that are alternately trained using pseudo-labels from each other, encouraging diversity and preventing collapse. However, this dual-model strategy introduces misalignment between clusters across models and modalities, creating noisy cluster correspondence. To resolve this, we introduce Cluster Consistency Matching (CCM), which aligns clusters across models and modalities by measuring cross-cluster similarity. Extensive experiments on three benchmarks demonstrate the effectiveness of RoDE.
zh

[CV-26] Marker-Based Extrinsic Calibration Method for Accurate Multi-Camera 3D Reconstruction

【速读】:该论文旨在解决多相机RGB-D系统中精确的外参标定问题,这是实现捕获视角间正确对齐以进行准确三维重建的关键挑战。解决方案的关键在于引入一种基于三维标记几何约束的迭代外参标定方法,通过聚类、回归分析和迭代重分配技术对标记平面进行系统分割与精炼,从而确保不同摄像机视图间的鲁棒几何对应关系。

链接: https://arxiv.org/abs/2505.02539
作者: Nahuel Garcia-D’Urso,Bernabe Sanchez-Sos,Jorge Azorin-Lopez,Andres Fuster-Guillo,Antonio Macia-Lillo,Higinio Mora-Mora
机构: Universitat d’Alacant (阿尔瓦塞特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Accurate 3D reconstruction using multi-camera RGB-D systems critically depends on precise extrinsic calibration to achieve proper alignment between captured views. In this paper, we introduce an iterative extrinsic calibration method that leverages the geometric constraints provided by a three-dimensional marker to significantly improve calibration accuracy. Our proposed approach systematically segments and refines marker planes through clustering, regression analysis, and iterative reassignment techniques, ensuring robust geometric correspondence across camera views. We validate our method comprehensively in both controlled environments and practical real-world settings within the Tech4Diet project, aimed at modeling the physical progression of patients undergoing nutritional treatments. Experimental results demonstrate substantial reductions in alignment errors, facilitating accurate and reliable 3D reconstructions.
zh

[CV-27] RobSurv: Vector Quantization-Based Multi-Modal Learning for Robust Cancer Survival Prediction

【速读】:该论文旨在解决多模态医学影像在癌症生存预测中的挑战,特别是深度学习模型对噪声和不同影像中心协议变化的敏感性问题。现有方法难以从异构的CT和PET图像中提取一致特征,从而限制了其临床应用。论文提出的解决方案是RobSurv,其关键在于双路径架构:一条路径通过向量量化将连续影像特征映射到离散代码本以实现抗噪表示,另一条路径则通过连续特征处理保留细粒度细节,两者通过基于Transformer的块级融合机制进行整合,从而在保持局部空间关系的同时捕捉全局上下文信息。

链接: https://arxiv.org/abs/2505.02529
作者: Aiman Farooq,Azad Singh,Deepak Mishra,Santanu Chaudhury
机构: Indian Institute of Technology Jodhpur (印度理工学院贾伊普尔分校); Indian Institute of Technology Delhi (印度理工学院德里分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cancer survival prediction using multi-modal medical imaging presents a critical challenge in oncology, mainly due to the vulnerability of deep learning models to noise and protocol variations across imaging centers. Current approaches struggle to extract consistent features from heterogeneous CT and PET images, limiting their clinical applicability. We address these challenges by introducing RobSurv, a robust deep-learning framework that leverages vector quantization for resilient multi-modal feature learning. The key innovation of our approach lies in its dual-path architecture: one path maps continuous imaging features to learned discrete codebooks for noise-resistant representation, while the parallel path preserves fine-grained details through continuous feature processing. This dual representation is integrated through a novel patch-wise fusion mechanism that maintains local spatial relationships while capturing global context via Transformer-based processing. In extensive evaluations across three diverse datasets (HECKTOR, H\N1, and NSCLC Radiogenomics), RobSurv demonstrates superior performance, achieving concordance index of 0.771, 0.742, and 0.734 respectively - significantly outperforming existing methods. Most notably, our model maintains robust performance even under severe noise conditions, with performance degradation of only 3.8-4.5% compared to 8-12% in baseline methods. These results, combined with strong generalization across different cancer types and imaging protocols, establish RobSurv as a promising solution for reliable clinical prognosis that can enhance treatment planning and patient care.
zh

[CV-28] xt to Image Generation and Editing: A Survey

【速读】:该论文旨在系统性地综述文本到图像生成(Text-to-image generation, T2I)领域的研究进展,涵盖从2021年至2024年的141篇相关工作。其核心问题在于梳理T2I的不同基础模型架构(如自回归、非自回归、生成对抗网络和扩散模型)及其关键技术(如自编码器、注意力机制和无分类器引导),并从生成与编辑两个方向对方法进行系统比较。解决方案的关键在于全面分析不同模型在数据集、评估指标、训练资源和推理速度等方面的性能,并探讨T2I技术的潜在社会影响及优化路径,为未来研究提供指导。

链接: https://arxiv.org/abs/2505.02527
作者: Pengfei Yang,Ngai-Man Cheung,Xinda Ma
机构: Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 49 pages,3 figures,3 tables

点击查看摘要

Abstract:Text-to-image generation (T2I) refers to the text-guided generation of high-quality images. In the past few years, T2I has attracted widespread attention and numerous works have emerged. In this survey, we comprehensively review 141 works conducted from 2021 to 2024. First, we introduce four foundation model architectures of T2I (autoregression, non-autoregression, GAN and diffusion) and the commonly used key technologies (autoencoder, attention and classifier-free guidance). Secondly, we systematically compare the methods of these studies in two directions, T2I generation and T2I editing, including the encoders and the key technologies they use. In addition, we also compare the performance of these researches side by side in terms of datasets, evaluation metrics, training resources, and inference speed. In addition to the four foundation models, we survey other works on T2I, such as energy-based models and recent Mamba and multimodality. We also investigate the potential social impact of T2I and provide some solutions. Finally, we propose unique insights of improving the performance of T2I models and possible future development directions. In summary, this survey is the first systematic and comprehensive overview of T2I, aiming to provide a valuable guide for future researchers and stimulate continued progress in this field.
zh

[CV-29] Corr2Distrib: Making Ambiguous Correspondences an Ally to Predict Reliable 6D Pose Distributions

【速读】:该论文试图解决从RGB图像中估计6D相机位姿分布的问题,特别是在存在对称性和遮挡导致视觉模糊的情况下,如何准确恢复所有有效位姿。解决方案的关键在于利用基于对应关系的方法,通过学习物体表面每个3D点的对称性感知表示,生成3DoF旋转假设,并结合PnP和位姿评分进一步细化为6DoF位姿分布,从而将视觉模糊转化为优势。

链接: https://arxiv.org/abs/2505.02501
作者: Asma Brazi,Boris Meden,Fabrice Mayran de Chamisso,Steve Bourgeois,Vincent Lepetit
机构: Université Paris-Saclay, CEA, List; LIGM, Ecole des Ponts, Univ Gustave Eiffel, CNRS, Marne-la-vallee, France
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:We introduce Corr2Distrib, the first correspondence-based method which estimates a 6D camera pose distribution from an RGB image, explaining the observations. Indeed, symmetries and occlusions introduce visual ambiguities, leading to multiple valid poses. While a few recent methods tackle this problem, they do not rely on local correspondences which, according to the BOP Challenge, are currently the most effective way to estimate a single 6DoF pose solution. Using correspondences to estimate a pose distribution is not straightforward, since ambiguous correspondences induced by visual ambiguities drastically decrease the performance of PnP. With Corr2Distrib, we turn these ambiguities into an advantage to recover all valid poses. Corr2Distrib first learns a symmetry-aware representation for each 3D point on the object’s surface, characterized by a descriptor and a local frame. This representation enables the generation of 3DoF rotation hypotheses from single 2D-3D correspondences. Next, we refine these hypotheses into a 6DoF pose distribution using PnP and pose scoring. Our experimental evaluations on complex non-synthetic scenes show that Corr2Distrib outperforms state-of-the-art solutions for both pose distribution estimation and single pose estimation from an RGB image, demonstrating the potential of correspondences-based approaches.
zh

[CV-30] Finger Pose Estimation for Under-screen Fingerprint Sensor

【速读】:该论文旨在解决基于屏幕下指纹传感器的指纹识别中,由于大角度或小区域输入导致的二维姿态估计性能不佳的问题。其解决方案的关键在于提出一种基于双模态输入的网络架构,通过整合由屏幕下指纹传感器提取的纹理细节与由触控屏获取的电容图像中的粗略轮廓,实现更全面且具有区分性的信息融合,从而提升姿态估计的准确性和稳定性。此外,还设计了解耦的概率分布预测任务,并引入基于专家混合(Mixture of Experts)的特征融合机制和关系驱动的跨域知识迁移策略,以进一步增强特征提取与融合能力。

链接: https://arxiv.org/abs/2505.02481
作者: Xiongjun Guan,Zhiyu Pan,Jianjiang Feng,Jie Zhou
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Two-dimensional pose estimation plays a crucial role in fingerprint recognition by facilitating global alignment and reduce pose-induced variations. However, existing methods are still unsatisfactory when handling with large angle or small area inputs. These limitations are particularly pronounced on fingerprints captured by under-screen fingerprint sensors in smartphones. In this paper, we present a novel dual-modal input based network for under-screen fingerprint pose estimation. Our approach effectively integrates two distinct yet complementary modalities: texture details extracted from ridge patches through the under-screen fingerprint sensor, and rough contours derived from capacitive images obtained via the touch screen. This collaborative integration endows our network with more comprehensive and discriminative information, substantially improving the accuracy and stability of pose estimation. A decoupled probability distribution prediction task is designed, instead of the traditional supervised forms of numerical regression or heatmap voting, to facilitate the training process. Additionally, we incorporate a Mixture of Experts (MoE) based feature fusion mechanism and a relationship driven cross-domain knowledge transfer strategy to further strengthen feature extraction and fusion capabilities. Extensive experiments are conducted on several public datasets and two private datasets. The results indicate that our method is significantly superior to previous state-of-the-art (SOTA) methods and remarkably boosts the recognition ability of fingerprint recognition algorithms. Our code is available at this https URL.
zh

[CV-31] Point Cloud Recombination: Systematic Real Data Augmentation Using Robotic Targets for LiDAR Perception Validation

【速读】:该论文旨在解决开放世界应用中基于LiDAR的智能移动系统感知验证的挑战,即虚拟仿真缺乏物理传感器特性,而真实世界数据则控制性不足,难以实现充分验证。其解决方案的关键在于提出点云重组合(Point Cloud Recombination),通过系统地将实验室环境中测量的物理目标物体点云与实际场景点云进行整合,从而生成大量可重复、物理真实的测试场景,支持针对现象感知的遮挡和精确定位的测试。

链接: https://arxiv.org/abs/2505.02476
作者: Hubert Padusinski,Christian Steinhauser,Christian Scherl,Julian Gaal,Jacob Langner
机构: FZI Research Center for Information Technology (FZI 研究中心); ANavS GmbH - Advanced Navigation Solutions (ANavS 公司-高级导航解决方案)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Pre-print for IEEE IAVVC 2025

点击查看摘要

Abstract:The validation of LiDAR-based perception of intelligent mobile systems operating in open-world applications remains a challenge due to the variability of real environmental conditions. Virtual simulations allow the generation of arbitrary scenes under controlled conditions but lack physical sensor characteristics, such as intensity responses or material-dependent effects. In contrast, real-world data offers true sensor realism but provides less control over influencing factors, hindering sufficient validation. Existing approaches address this problem with augmentation of real-world point cloud data by transferring objects between scenes. However, these methods do not consider validation and remain limited in controllability because they rely on empirical data. We solve these limitations by proposing Point Cloud Recombination, which systematically augments captured point cloud scenes by integrating point clouds acquired from physical target objects measured in controlled laboratory environments. Thus enabling the creation of vast amounts and varieties of repeatable, physically accurate test scenes with respect to phenomena-aware occlusions with registered 3D meshes. Using the Ouster OS1-128 Rev7 sensor, we demonstrate the augmentation of real-world urban and rural scenes with humanoid targets featuring varied clothing and poses, for repeatable positioning. We show that the recombined scenes closely match real sensor outputs, enabling targeted testing, scalable failure analysis, and improved system safety. By providing controlled yet sensor-realistic data, our method enables trustworthy conclusions about the limitations of specific sensors in compound with their algorithms, e.g., object detection.
zh

[CV-32] Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction

【速读】:该论文旨在解决多模态任务中视觉与语言统一建模的挑战,特别是如何实现原生多模态自回归模型在文本到图像生成和基于指令的图像编辑任务中的高效融合与交互。其解决方案的关键在于引入了新型的多尺度可学习令牌和多尺度表示对齐策略,并结合固定的大语言模型(LLM)与可学习的扩散模型,构建了一个统一的视觉生成器和多模态自回归模型,从而提升了模型在跨模态理解与生成任务中的表现。

链接: https://arxiv.org/abs/2505.02471
作者: Biao Gong,Cheng Zou,Dandan Zheng,Hu Yu,Jingdong Chen,Jianxin Sun,Junbo Zhao,Jun Zhou,Kaixiang Ji,Lixiang Ru,Libin Wang,Qingpei Guo,Rui Liu,Weilong Chai,Xinyu Xiao,Ziyuan Huang
机构: Inclusion AI; Ant Group
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:We introduce Ming-Lite-Uni, an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel multi-scale learnable tokens and multi-scale representation alignment strategy. By leveraging a fixed MLLM and a learnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to perform both text-to-image generation and instruction based image editing tasks, expanding their capabilities beyond pure visual understanding. Our experimental results demonstrate the strong performance of Ming-Lite-Uni and illustrate the impressive fluid nature of its interactive process. All code and model weights are open-sourced to foster further exploration within the community. Notably, this work aligns with concurrent multimodal AI milestones - such as ChatGPT-4o with native image generation updated in March 25, 2025 - underscoring the broader significance of unified models like Ming-Lite-Uni on the path toward AGI. Ming-Lite-Uni is in alpha stage and will soon be further refined.
zh

[CV-33] ming Is Everything: Finding the Optimal Fusion Points in Multimodal Medical Imaging

【速读】:该论文旨在解决多模态深度学习中如何确定最佳融合时机的问题,即在多模态网络的不同层中选择合适的融合模块插入位置,以提升医学影像诊断的准确性。当前方法通常依赖手动调参或穷举搜索,计算成本高且无法保证收敛至最优解。论文提出的解决方案是基于顺序前向搜索算法,该算法通过逐步激活并评估不同层的候选融合模块,在每一步利用先前学习的权重进行微调,并通过比较验证损失来确定最佳配置。其关键在于系统性地缩小搜索空间,从而高效识别最优融合时机,而非穷举所有可能的模块放置方式。

链接: https://arxiv.org/abs/2505.02467
作者: Valerio Guarrasi,Klara Mogensen,Sara Tassinari,Sara Qvarlander,Paolo Soda
机构: Università Campus Bio-Medico di Roma (罗马大学生物医学校园大学); Umeå University (于默奥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal deep learning harnesses diverse imaging modalities, such as MRI sequences, to enhance diagnostic accuracy in medical imaging. A key challenge is determining the optimal timing for integrating these modalities-specifically, identifying the network layers where fusion modules should be inserted. Current approaches often rely on manual tuning or exhaustive search, which are computationally expensive without any guarantee of converging to optimal results. We propose a sequential forward search algorithm that incrementally activates and evaluates candidate fusion modules at different layers of a multimodal network. At each step, the algorithm retrains from previously learned weights and compares validation loss to identify the best-performing configuration. This process systematically reduces the search space, enabling efficient identification of the optimal fusion timing without exhaustively testing all possible module placements. The approach is validated on two multimodal MRI datasets, each addressing different classification tasks. Our algorithm consistently identified configurations that outperformed unimodal baselines, late fusion, and a brute-force ensemble of all potential fusion placements. These architectures demonstrated superior accuracy, F-score, and specificity while maintaining competitive or improved AUC values. Furthermore, the sequential nature of the search significantly reduced computational overhead, making the optimization process more practical. By systematically determining the optimal timing to fuse imaging modalities, our method advances multimodal deep learning for medical imaging. It provides an efficient and robust framework for fusion optimization, paving the way for improved clinical decision-making and more adaptable, scalable architectures in medical AI applications.
zh

[CV-34] Recent Advances in Out-of-Distribution Detection with CLIP-Like Models: A Survey

【速读】:该论文试图解决传统分布外检测(OOD)方法在面对多模态数据时的局限性,特别是在利用视觉-语言模型(VLM)如CLIP进行跨模态检测时,现有分类体系仍依赖于仅图像模态的ID数据,未能充分反映模型的多模态特性。解决方案的关键在于提出一种新的分类框架,该框架基于图像和文本模态对OOD数据的处理方式,将现有方法划分为四类:可见或不可见的OOD图像,以及已知或未知的OOD文本,并结合两种训练策略(即无需训练或需要训练)。这一框架更贴合CLIP等多模态模型的特性,为未来研究提供了新的视角。

链接: https://arxiv.org/abs/2505.02448
作者: Chaohua Li,Enhao Zhang,Chuanxing Geng,Songcan Chen
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); MIIT Key Laboratory of Pattern Analysis and Machine Intelligence (工业和信息化部模式分析与机器智能重点实验室); Hong Kong Baptist University (香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Out-of-distribution detection (OOD) is a pivotal task for real-world applications that trains models to identify samples that are distributionally different from the in-distribution (ID) data during testing. Recent advances in AI, particularly Vision-Language Models (VLMs) like CLIP, have revolutionized OOD detection by shifting from traditional unimodal image detectors to multimodal image-text detectors. This shift has inspired extensive research; however, existing categorization schemes (e.g., few- or zero-shot types) still rely solely on the availability of ID images, adhering to a unimodal paradigm. To better align with CLIP’s cross-modal nature, we propose a new categorization framework rooted in both image and text modalities. Specifically, we categorize existing methods based on how visual and textual information of OOD data is utilized within image + text modalities, and further divide them into four groups: OOD Images (i.e., outliers) Seen or Unseen, and OOD Texts (i.e., learnable vectors or class names) Known or Unknown, across two training strategies (i.e., train-free or training-required). More importantly, we discuss open problems in CLIP-like OOD detection and highlight promising directions for future research, including cross-domain integration, practical applications, and theoretical understanding.
zh

[CV-35] oken Coordinated Prompt Attention is Needed for Visual Prompting

【速读】:该论文试图解决现有视觉提示技术在微调预训练Vision Transformers (ViT)时,因对所有token使用相同提示而忽视不同token在传递判别信息中的独特作用,导致提取特征缺乏区分性和偏向性的问题。解决方案的关键在于提出一种即插即用的Token Coordinated Prompt Attention (TCPA)模块,通过将提示解耦为CLS提示和图像提示,并利用匹配函数为不同图像token分配协调提示,实现基于注意力机制的差异化交互,从而提升特征的多样性和表征能力。

链接: https://arxiv.org/abs/2505.02406
作者: Zichen Liu,Xu Zou,Gang Hua,Jiahuan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual prompting techniques are widely used to efficiently fine-tune pretrained Vision Transformers (ViT) by learning a small set of shared prompts for all tokens. However, existing methods overlook the unique roles of different tokens in conveying discriminative information and interact with all tokens using the same prompts, thereby limiting the representational capacity of ViT. This often leads to indistinguishable and biased prompt-extracted features, hindering performance. To address this issue, we propose a plug-and-play Token Coordinated Prompt Attention (TCPA) module, which assigns specific coordinated prompts to different tokens for attention-based interactions. Firstly, recognizing the distinct functions of CLS and image tokens-global information aggregation and local feature extraction, we disentangle the prompts into CLS Prompts and Image Prompts, which interact exclusively with CLS tokens and image tokens through attention mechanisms. This enhances their respective discriminative abilities. Furthermore, as different image tokens correspond to distinct image patches and contain diverse information, we employ a matching function to automatically assign coordinated prompts to individual tokens. This enables more precise attention interactions, improving the diversity and representational capacity of the extracted features. Extensive experiments across various benchmarks demonstrate that TCPA significantly enhances the diversity and discriminative power of the extracted features. The code is available at this https URL.
zh

[CV-36] Estimating Commonsense Scene Composition on Belief Scene Graphs ICRA25

【速读】:该论文试图解决的是常识性场景构图问题,即如何理解场景中相关物体之间的空间关系,并通过估计未见物体的空间分布来扩展信念场景图(Belief Scene Graphs)。解决方案的关键在于将这种常识性场景构图能力建模为所有语义物体类可能位置的联合概率分布,并采用两种变体的关联信息(Correlation Information, CECI)模型进行概率分布的学习:一种是基于图卷积网络的基线方法,另一种是结合基于大型语言模型(Large Language Models, LLMs)的空间本体的神经符号扩展方法。

链接: https://arxiv.org/abs/2505.02405
作者: Mario A.V. Saucedo,Vignesh Kottayam Viswanathan,Christoforos Kanellakis,George Nikolakopoulos
机构: Luleå University of Technology (吕勒奥理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICRA25

点击查看摘要

Abstract:This work establishes the concept of commonsense scene composition, with a focus on extending Belief Scene Graphs by estimating the spatial distribution of unseen objects. Specifically, the commonsense scene composition capability refers to the understanding of the spatial relationships among related objects in the scene, which in this article is modeled as a joint probability distribution for all possible locations of the semantic object class. The proposed framework includes two variants of a Correlation Information (CECI) model for learning probability distributions: (i) a baseline approach based on a Graph Convolutional Network, and (ii) a neuro-symbolic extension that integrates a spatial ontology based on Large Language Models (LLMs). Furthermore, this article provides a detailed description of the dataset generation process for such tasks. Finally, the framework has been validated through multiple runs on simulated data, as well as in a real-world indoor environment, demonstrating its ability to spatially interpret scenes across different room types.
zh

[CV-37] Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection

【速读】:该论文试图解决现有视频异常检测方法仅依赖RGB帧导致无法捕捉突发或短暂运动线索的问题,这些线索是异常事件的关键指标。解决方案的关键在于提出Image-Event Fusion for Video Anomaly Detection (IEF-VAD)框架,该框架通过从RGB视频中合成事件表示,并通过一种考虑不确定性的原则性过程将其与图像特征融合,从而增强运动线索的表达。该方法通过学生t分布似然建模重尾传感器噪声、应用卡尔曼风格的逐帧更新以平衡模态以及迭代优化融合的潜在状态来实现这一目标。

链接: https://arxiv.org/abs/2505.02393
作者: Sungheon Jeong,Jihong Park,Mohsen Imani
机构: University of California, Irvine (加州大学欧文分校); MOLOCO (MOLOCO)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most existing video anomaly detectors rely solely on RGB frames, which lack the temporal resolution needed to capture abrupt or transient motion cues, key indicators of anomalous events. To address this limitation, we propose Image-Event Fusion for Video Anomaly Detection (IEF-VAD), a framework that synthesizes event representations directly from RGB videos and fuses them with image features through a principled, uncertainty-aware process. The system (i) models heavy-tailed sensor noise with a Student`s-t likelihood, deriving value-level inverse-variance weights via a Laplace approximation; (ii) applies Kalman-style frame-wise updates to balance modalities over time; and (iii) iteratively refines the fused latent state to erase residual cross-modal noise. Without any dedicated event sensor or frame-level labels, IEF-VAD sets a new state of the art across multiple real-world anomaly detection benchmarks. These findings highlight the utility of synthetic event representations in emphasizing motion cues that are often underrepresented in RGB frames, enabling accurate and robust video understanding across diverse applications without requiring dedicated event sensors. Code and models are available at this https URL.
zh

[CV-38] MetaScenes: Towards Automated Replica Creation for Real-world 3D Scans CVPR2025

【速读】:该论文旨在解决Embodied AI (EAI)研究中高质量、多样化3D场景生成的难题,以支持技能习得、模拟到现实(sim-to-real)迁移和泛化能力。现有数据集依赖于人工设计,存在人力成本高和可扩展性差的问题。解决方案的关键在于提出MetaScenes数据集,该数据集基于真实世界扫描构建,包含15366个物体和831个细粒度类别,并引入Scan2Sim模型,通过多模态对齐实现资产的自动化高质量替换,从而摆脱对人工设计的依赖,提升3D场景生成的可扩展性和真实性。

链接: https://arxiv.org/abs/2505.02388
作者: Huangyue Yu,Baoxiong Jia,Yixin Chen,Yandan Yang,Puhao Li,Rongpeng Su,Jiaxin Li,Qing Li,Wei Liang,Song-Chun Zhu,Tengyu Liu,Siyuan Huang
机构: State Key Laboratory of General Artificial Intelligence, BIGAI; Beijing Institute of Technology; Tsinghua University; University of Science and Technology of China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: CVPR 2025

点击查看摘要

Abstract:Embodied AI (EAI) research requires high-quality, diverse 3D scenes to effectively support skill acquisition, sim-to-real transfer, and generalization. Achieving these quality standards, however, necessitates the precise replication of real-world object diversity. Existing datasets demonstrate that this process heavily relies on artist-driven designs, which demand substantial human effort and present significant scalability challenges. To scalably produce realistic and interactive 3D scenes, we first present MetaScenes, a large-scale, simulatable 3D scene dataset constructed from real-world scans, which includes 15366 objects spanning 831 fine-grained categories. Then, we introduce Scan2Sim, a robust multi-modal alignment model, which enables the automated, high-quality replacement of assets, thereby eliminating the reliance on artist-driven designs for scaling 3D scenes. We further propose two benchmarks to evaluate MetaScenes: a detailed scene synthesis task focused on small item layouts for robotic manipulation and a domain transfer task in vision-and-language navigation (VLN) to validate cross-domain transfer. Results confirm MetaScene’s potential to enhance EAI by supporting more generalizable agent learning and sim-to-real applications, introducing new possibilities for EAI research. Project website: this https URL.
zh

[CV-39] SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing

【速读】:该论文试图解决现有图像编辑数据集由于手动收集准确编辑数据的挑战,而采用自动化方法构建时导致的监督信号噪声问题,即编辑指令与原始-编辑后图像对之间的不匹配。其解决方案的关键在于构建更有效的编辑指令,包括修正编辑指令以更好地与图像对对齐,并使用对比编辑指令进一步增强其效果。通过分析编辑模型在不同推理步骤中的生成特性,定义了统一的指导原则以修正编辑指令,并引入三元组损失结合正负指令构建对比监督信号,从而提升模型训练效果。该方法无需依赖之前工作的视觉-语言模型模块或预训练任务,提供了一种更直接、高效的监督信号生成方式。

链接: https://arxiv.org/abs/2505.02370
作者: Ming Li,Xin Gu,Fan Chen,Xiaoying Xing,Longyin Wen,Chen Chen,Sijie Zhu
机构: ByteDance Intelligent Creation (USA); Center for Research in Computer Vision, University of Central Florida
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code, Data and Models are available at: this https URL

点击查看摘要

Abstract:Due to the challenges of manually collecting accurate editing data, existing datasets are typically constructed using various automated methods, leading to noisy supervision signals caused by the mismatch between editing instructions and original-edited image pairs. Recent efforts attempt to improve editing models through generating higher-quality edited images, pre-training on recognition tasks, or introducing vision-language models (VLMs) but fail to resolve this fundamental issue. In this paper, we offer a novel solution by constructing more effective editing instructions for given image pairs. This includes rectifying the editing instructions to better align with the original-edited image pairs and using contrastive editing instructions to further enhance their effectiveness. Specifically, we find that editing models exhibit specific generation attributes at different inference steps, independent of the text. Based on these prior attributes, we define a unified guide for VLMs to rectify editing instructions. However, there are some challenging editing scenarios that cannot be resolved solely with rectified instructions. To this end, we further construct contrastive supervision signals with positive and negative instructions and introduce them into the model training using triplet loss, thereby further facilitating supervision effectiveness. Our method does not require the VLM modules or pre-training tasks used in previous work, offering a more direct and efficient way to provide better supervision signals, and providing a novel, simple, and effective solution for instruction-based image editing. Results on multiple benchmarks demonstrate that our method significantly outperforms existing approaches. Compared with previous SOTA SmartEdit, we achieve 9.19% improvements on the Real-Edit benchmark with 30x less training data and 13x smaller model size.
zh

[CV-40] Sharpness-Aware Minimization with Z-Score Gradient Filtering for Neural Networks

【速读】:该论文试图解决深度神经网络在训练过程中容易收敛到尖锐极小值,从而降低模型鲁棒性和泛化能力的问题。解决方案的关键在于对Sharpness-Aware Minimization (SAM)进行改进,提出ZSharp方法,通过逐层Z-score归一化和分位数过滤,保留统计上显著的梯度成分,从而实现更有效的扰动,使参数更新更符合曲率敏感方向,提升模型的泛化性能。

链接: https://arxiv.org/abs/2505.02369
作者: Juyoung Yun
机构: Stony Brook University (石溪大学); OpenNN Lab (OpenNN 实验室); MODULABS (MODULABS)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Generalizing well in deep neural networks remains a core challenge, particularly due to their tendency to converge to sharp minima that degrade robustness. Sharpness-Aware Minimization (SAM) mitigates this by seeking flatter minima but perturbs parameters using the full gradient, which can include statistically insignificant directions. We propose ZSharp, a simple yet effective extension to SAM that applies layer-wise Z-score normalization followed by percentile-based filtering to retain only statistically significant gradient components. This selective perturbation aligns updates with curvature-sensitive directions, enhancing generalization without requiring architectural changes. ZSharp introduces only one additional hyperparameter, the percentile threshold, and remains fully compatible with existing SAM variants. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet using ResNet, VGG, and Vision Transformers show that ZSharp consistently outperforms SAM and its variants in test accuracy, particularly on deeper and transformer-based models. These results demonstrate that ZSharp is a principled and lightweight improvement for sharpness-aware optimization.
zh

[CV-41] Quaternion Multi-focus Color Image Fusion

【速读】:该论文旨在解决多焦点彩色图像融合中因颜色信息处理和复杂纹理建模能力不足而导致的在真实复杂场景下性能受限的问题。其解决方案的关键在于提出一种完全在四元数域内进行的多焦点彩色图像融合框架,该框架包含三个核心组件:1)四元数稀疏分解模型,用于迭代学习彩色图像的细尺度细节与结构信息以实现高精度焦点检测;2)四元数基础-细节融合策略,用于分别融合多幅彩色图像的基础尺度与细节尺度结果以保留结构和细节信息;3)四元数结构相似性优化策略,用于自适应选择初始融合结果中的最优块并生成最终融合结果,以保持精细细节并确保空间一致性。

链接: https://arxiv.org/abs/2505.02365
作者: Weihua Yang,Yicong Zhou
机构: University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-focus color image fusion refers to integrating multiple partially focused color images to create a single all-in-focus color image. However, existing methods struggle with complex real-world scenarios due to limitations in handling color information and intricate textures. To address these challenges, this paper proposes a quaternion multi-focus color image fusion framework to perform high-quality color image fusion completely in the quaternion domain. This framework introduces 1) a quaternion sparse decomposition model to jointly learn fine-scale image details and structure information of color images in an iterative fashion for high-precision focus detection, 2) a quaternion base-detail fusion strategy to individually fuse base-scale and detail-scale results across multiple color images for preserving structure and detail information, and 3) a quaternion structural similarity refinement strategy to adaptively select optimal patches from initial fusion results and obtain the final fused result for preserving fine details and ensuring spatially consistent outputs. Extensive experiments demonstrate that the proposed framework outperforms state-of-the-art methods.
zh

[CV-42] Quaternion Infrared Visible Image Fusion

【速读】:该论文旨在解决红外-可见光图像融合中存在的一些关键问题,包括在可见光图像中忽略颜色结构信息以及在处理低质量彩色可见光输入时性能下降的问题。其解决方案的关键在于提出了一种完全在四元数域中运行的四元数红外-可见图像融合(QIVIF)框架,该框架通过四元数低可见度特征学习模型、四元数自适应非锐化掩膜方法和四元数分层贝叶斯融合模型,实现了对热目标和细粒度纹理细节的自适应提取与融合,从而在复杂低可见度条件下生成高质量的融合图像。

链接: https://arxiv.org/abs/2505.02364
作者: Weihua Yang,Yicong Zhou
机构: University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visible images provide rich details and color information only under well-lighted conditions while infrared images effectively highlight thermal targets under challenging conditions such as low visibility and adverse weather. Infrared-visible image fusion aims to integrate complementary information from infrared and visible images to generate a high-quality fused image. Existing methods exhibit critical limitations such as neglecting color structure information in visible images and performance degradation when processing low-quality color-visible inputs. To address these issues, we propose a quaternion infrared-visible image fusion (QIVIF) framework to generate high-quality fused images completely in the quaternion domain. QIVIF proposes a quaternion low-visibility feature learning model to adaptively extract salient thermal targets and fine-grained texture details from input infrared and visible images respectively under diverse degraded conditions. QIVIF then develops a quaternion adaptive unsharp masking method to adaptively improve high-frequency feature enhancement with balanced illumination. QIVIF further proposes a quaternion hierarchical Bayesian fusion model to integrate infrared saliency and enhanced visible details to obtain high-quality fused images. Extensive experiments across diverse datasets demonstrate that our QIVIF surpasses state-of-the-art methods under challenging low-visibility conditions.
zh

[CV-43] Sparse Ellipsoidal Radial Basis Function Network for Point Cloud Surface Representation

【速读】:该论文旨在解决点云表面表示这一计算机图形学与视觉领域的基础问题,通过机器学习方法对点云的有符号距离函数(SDF)进行近似,从而实现紧凑且精确的表面表示。其解决方案的关键在于使用稀疏椭球径向基函数网络(ERBFs)来逼近SDF,并通过动态多目标优化策略平衡稀疏性与逼近精度,同时引入基于最近邻的数据结构和CUDA并行计算以提升计算效率,以及采用分层八叉树细化策略优化训练过程。

链接: https://arxiv.org/abs/2505.02350
作者: Bobo Lian,Dandan Wang,Chenjian Wu,Minxin Chen
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Point cloud surface representation is a fundamental problem in computer graphics and vision. This paper presents a machine learning approach for approximating the signed distance function (SDF) of a point cloud using sparse ellipsoidal radial basis function networks, enabling a compact and accurate surface representation. Given the SDF values defined on the grid points constructed from the point cloud, our method approximates the SDF accurately with as few ellipsoidal radial basis functions (ERBFs) as possible, i.e., represent the SDF of a point cloud by sparse ERBFs. To balance sparsity and approximation precision, a dynamic multi-objective optimization strategy is introduced, which adaptively adds the regularization terms and jointly optimizes the weights, centers, shapes, and orientations of ERBFs. To improve computational efficiency, a nearest-neighbor-based data structure is employed, restricting function calculations to points near each Gaussian kernel center. The computations for each kernel are further parallelized on CUDA, which significantly improves the optimization speed. Additionally, a hierarchical octree-based refinement strategy is designed for training. Specifically, the initialization and optimization of network parameters are conducted using coarse grid points in the octree lattice structure. Subsequently, fine lattice points are progressively incorporated to accelerate model convergence and enhance training efficiency. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms previous sparse representation approaches in terms of accuracy, robustness, and computational efficiency. The corresponding code is publicly available at this https URL.
zh

[CV-44] 6D Pose Estimation on Spoons and Hands

【速读】:该论文试图解决如何通过视频分析准确监测饮食行为的问题,以提供比传统自报法更可靠的营养摄入洞察。解决方案的关键在于利用6D姿态估计技术跟踪手和餐具的运动,从而获取食物摄入量及进食行为的空间位置和方向信息。

链接: https://arxiv.org/abs/2505.02335
作者: Kevin Tan,Fan Yang,Yuhao Chen
机构: University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate dietary monitoring is essential for promoting healthier eating habits. A key area of research is how people interact and consume food using utensils and hands. By tracking their position and orientation, it is possible to estimate the volume of food being consumed, or monitor eating behaviours, highly useful insights into nutritional intake that can be more reliable than popular methods such as self-reporting. Hence, this paper implements a system that analyzes stationary video feed of people eating, using 6D pose estimation to track hand and spoon movements to capture spatial position and orientation. In doing so, we examine the performance of two state-of-the-art (SOTA) video object segmentation (VOS) models, both quantitatively and qualitatively, and identify main sources of error within the system.
zh

[CV-45] VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection

【速读】:该论文旨在解决音频视觉情感识别(AVER)中的挑战,包括情感表达的固有模糊性、跨模态表达差异以及可靠标注数据的稀缺性。其解决方案的关键在于提出VAEmo框架,该框架通过两个阶段实现以情感为中心的联合音频-视觉表征学习:第一阶段利用大规模以说话人为中心的音频-视觉语料库进行预训练,通过掩码重建和对比目标减轻模态差距并学习具有表现力的互补表征;第二阶段借助多模态大语言模型生成详细情感描述,并通过双路径对比学习将这些丰富的文本语义注入到音频-视觉表征中,从而进一步弥合情感差距。

链接: https://arxiv.org/abs/2505.02331
作者: Hao Cheng,Zhiwei Zhao,Yichao He,Zhenzhen Hu,Jia Li,Meng Wang,Richang Hong
机构: Hefei University of Technology(合肥工业大学); China(中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: Source code and pre-trained models will be available at this https URL

点击查看摘要

Abstract:Audiovisual emotion recognition (AVER) aims to infer human emotions from nonverbal visual-audio (VA) cues, offering modality-complementary and language-agnostic advantages. However, AVER remains challenging due to the inherent ambiguity of emotional expressions, cross-modal expressive disparities, and the scarcity of reliably annotated data. Recent self-supervised AVER approaches have introduced strong multimodal representations, yet they predominantly rely on modality-specific encoders and coarse content-level alignment, limiting fine-grained emotional semantic modeling. To address these issues, we propose VAEmo, an efficient two-stage framework for emotion-centric joint VA representation learning with external knowledge injection. In Stage 1, a unified and lightweight representation network is pre-trained on large-scale speaker-centric VA corpora via masked reconstruction and contrastive objectives, mitigating the modality gap and learning expressive, complementary representations without emotion labels. In Stage 2, multimodal large language models automatically generate detailed affective descriptions according to our well-designed chain-of-thought prompting for only a small subset of VA samples; these rich textual semantics are then injected by aligning their corresponding embeddings with VA representations through dual-path contrastive learning, further bridging the emotion gap. Extensive experiments on multiple downstream AVER benchmarks show that VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance for efficient, generalizable VA emotion representations.
zh

[CV-46] DA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment ICMR2025

【速读】:该论文旨在解决在未知测试类别下学习具有强泛化能力的3D表示的问题,这一问题在许多实际的3D应用中尤为关键。现有方法由于缺乏来自广泛概念的3D训练数据而难以达成此目标,同时,尽管预训练的大规模视觉-语言模型(如CLIP)表现出色的零样本泛化能力,但其在提取适合的3D表示方面受到2D训练与3D测试分布之间显著差异的限制。解决方案的关键在于提出一种名为TeDA的框架,该框架通过在测试阶段对预训练的2D视觉-语言模型CLIP进行适应,实现对未知3D物体检索的有效处理,其核心机制包括将3D物体投影到多视角图像、利用CLIP提取特征,并通过自信的查询-目标样本对进行迭代优化策略来优化3D查询嵌入。

链接: https://arxiv.org/abs/2505.02325
作者: Zhichuan Wang,Yang Zhou,Jinhai Xiang,Yulong Wang,Xinwei He
机构: Huazhong Agricultural University (华中农业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICMR 2025

点击查看摘要

Abstract:Learning discriminative 3D representations that generalize well to unknown testing categories is an emerging requirement for many real-world 3D applications. Existing well-established methods often struggle to attain this goal due to insufficient 3D training data from broader concepts. Meanwhile, pre-trained large vision-language models (e.g., CLIP) have shown remarkable zero-shot generalization capabilities. Yet, they are limited in extracting suitable 3D representations due to substantial gaps between their 2D training and 3D testing distributions. To address these challenges, we propose Testing-time Distribution Alignment (TeDA), a novel framework that adapts a pretrained 2D vision-language model CLIP for unknown 3D object retrieval at test time. To our knowledge, it is the first work that studies the test-time adaptation of a vision-language model for 3D feature learning. TeDA projects 3D objects into multi-view images, extracts features using CLIP, and refines 3D query embeddings with an iterative optimization strategy by confident query-target sample pairs in a self-boosting manner. Additionally, TeDA integrates textual descriptions generated by a multimodal language model (InternVL) to enhance 3D object understanding, leveraging CLIP’s aligned feature space to fuse visual and textual cues. Extensive experiments on four open-set 3D object retrieval benchmarks demonstrate that TeDA greatly outperforms state-of-the-art methods, even those requiring extensive training. We also experimented with depth maps on Objaverse-LVIS, further validating its effectiveness. Code is available at this https URL.
zh

[CV-47] Continuous Normalizing Flows for Uncertainty-Aware Human Pose Estimation

【速读】:该论文旨在解决人体姿态估计(Human Pose Estimation, HPE)中准确率、计算效率和可靠不确定性量化(Uncertainty Quantification, UQ)之间的平衡问题。传统基于回归的方法假设固定分布,可能导致不确定性量化效果不佳;而基于热图的方法虽然能有效建模输出分布,但计算资源需求较高。论文提出的解决方案是连续流残差估计(Continuous Flow Residual Estimation, CFRE),其关键在于将连续归一化流(Continuous Normalizing Flows, CNFs)整合到基于回归的模型中,从而实现动态分布适应,提升模型在保持计算效率的同时的准确性和不确定性量化能力。

链接: https://arxiv.org/abs/2505.02287
作者: Shipeng Liu,Ziliang Xiong,Bastian Wandt,Per-Erik Forssén
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by SCIA2025

点击查看摘要

Abstract:Human Pose Estimation (HPE) is increasingly important for applications like virtual reality and motion analysis, yet current methods struggle with balancing accuracy, computational efficiency, and reliable uncertainty quantification (UQ). Traditional regression-based methods assume fixed distributions, which might lead to poor UQ. Heatmap-based methods effectively model the output distribution using likelihood heatmaps, however, they demand significant resources. To address this, we propose Continuous Flow Residual Estimation (CFRE), an integration of Continuous Normalizing Flows (CNFs) into regression-based models, which allows for dynamic distribution adaptation. Through extensive experiments, we show that CFRE leads to better accuracy and uncertainty quantification with retained computational efficiency on both 2D and 3D human pose estimation tasks.
zh

[CV-48] Compositional Image-Text Matching and Retrieval by Grounding Entities CVPR

【速读】:该论文旨在解决预训练视觉-语言模型(如CLIP)在实体定位和组合性图像与文本匹配方面的不足。其解决方案的关键在于提出一种无需额外训练的零样本增强方法,通过计算由最新开放词汇检测器定位的物体实体和关系的子图像嵌入,并动态调整全局图像嵌入,最终通过加权组合子图像嵌入得到增强的图像嵌入,从而提升图像-文本匹配的准确性与检索性能。

链接: https://arxiv.org/abs/2505.02278
作者: Madhukar Reddy Vongala,Saurabh Srivastava,Jana Košecká
机构: George Mason University (乔治梅森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR-W

点击查看摘要

Abstract:Vision-language pretraining on large datasets of images-text pairs is one of the main building blocks of current Vision-Language Models. While with additional training, these models excel in various downstream tasks, including visual question answering, image captioning, and visual commonsense reasoning. However, a notable weakness of pretrained models like CLIP, is their inability to perform entity grounding and compositional image and text matching~\citeJiang2024ComCLIP, yang2023amc, Rajabi2023GroundedVSR, learninglocalizeCVPR24. In this work we propose a novel learning-free zero-shot augmentation of CLIP embeddings that has favorable compositional properties. We compute separate embeddings of sub-images of object entities and relations that are localized by the state of the art open vocabulary detectors and dynamically adjust the baseline global image embedding. % The final embedding is obtained by computing a weighted combination of the sub-image embeddings. The resulting embedding is then utilized for similarity computation with text embedding, resulting in a average 1.5% improvement in image-text matching accuracy on the Visual Genome and SVO Probes datasets~\citekrishna2017visualgenome, svo. Notably, the enhanced embeddings demonstrate superior retrieval performance, thus achieving significant gains on the Flickr30K and MS-COCO retrieval benchmarks~\citeflickr30ke, mscoco, improving the state-of-the-art Recall@1 by 12% and 0.4%, respectively. Our code is available at this https URL.
zh

[CV-49] Enhancing AI Face Realism: Cost-Efficient Quality Improvement in Distilled Diffusion Models with a Fully Synthetic Dataset

【速读】:该论文试图解决大规模图像生成中成本与质量之间的平衡问题,即如何在保持图像生成质量的同时降低计算成本。其解决方案的关键在于通过训练一个快速的图像到图像翻译模块,将轻量级生成模型(如FLUX.1-schnell)的输出优化至接近计算密集型基线模型(如FLUX.1-dev)的水平,从而实现高效且高质量的图像生成。

链接: https://arxiv.org/abs/2505.02255
作者: Jakub Wąsala,Bartłomiej Wrzalski,Kornelia Noculak,Yuliia Tarasenko,Oliwer Krupa,Jan Kocoń,Grzegorz Chodak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 25th International Conference on Computational Science

点击查看摘要

Abstract:This study presents a novel approach to enhance the cost-to-quality ratio of image generation with diffusion models. We hypothesize that differences between distilled (e.g. FLUX.1-schnell) and baseline (e.g. FLUX.1-dev) models are consistent and, therefore, learnable within a specialized domain, like portrait generation. We generate a synthetic paired dataset and train a fast image-to-image translation head. Using two sets of low- and high-quality synthetic images, our model is trained to refine the output of a distilled generator (e.g., FLUX.1-schnell) to a level comparable to a baseline model like FLUX.1-dev, which is more computationally intensive. Our results show that the pipeline, which combines a distilled version of a large generative model with our enhancement layer, delivers similar photorealistic portraits to the baseline version with up to an 82% decrease in computational cost compared to FLUX.1-dev. This study demonstrates the potential for improving the efficiency of AI solutions involving large-scale image generation.
zh

[CV-50] Cricket: A Self-Powered Chirping Pixel

【速读】:该论文试图解决无外部电源或电池的光传感与无线通信问题,其解决方案的关键在于开发一种名为“cricket”的传感器,该传感器能够从入射光中收集能量,并在能量达到特定水平时发送短而强的射频啁啾信号(radio frequency chirp)。该传感器通过固定的载波频率标识自身,并利用连续啁啾之间的持续时间来表征入射光强度,从而实现了自供电的光测量与无线传输。

链接: https://arxiv.org/abs/2505.02246
作者: Shree K. Nayar,Jeremy Klotz,Nikhil Nanda,Mikhail Fridberg
机构: Columbia University (哥伦比亚大学); ADSP Consulting, LLC (ADSP咨询公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 18 figures. Project page: this https URL

点击查看摘要

Abstract:We present a sensor that can measure light and wirelessly communicate the measurement, without the need for an external power source or a battery. Our sensor, called cricket, harvests energy from incident light. It is asleep for most of the time and transmits a short and strong radio frequency chirp when its harvested energy reaches a specific level. The carrier frequency of each cricket is fixed and reveals its identity, and the duration between consecutive chirps is a measure of the incident light level. We have characterized the radiometric response function, signal-to-noise ratio and dynamic range of cricket. We have experimentally verified that cricket can be miniaturized at the expense of increasing the duration between chirps. We show that a cube with a cricket on each of its sides can be used to estimate the centroid of any complex illumination, which has value in applications such as solar tracking. We also demonstrate the use of crickets for creating untethered sensor arrays that can produce video and control lighting for energy conservation. Finally, we modified cricket’s circuit to develop battery-free electronic sunglasses that can instantly adapt to environmental illumination.
zh

[CV-51] Quantizing Diffusion Models from a Sampling-Aware Perspective

【速读】:该论文旨在解决扩散模型在低延迟和资源受限环境中应用时面临的计算复杂性和采样效率问题。其核心挑战在于量化引入的噪声会干扰每一步采样的方向估计,进而影响高阶采样器在离散数值方法下的精确性,导致最优采样轨迹发生变化。解决方案的关键在于提出一种采样感知的量化策略,其中采用混合阶轨迹对齐技术,在每一步采样中施加更严格的误差边界约束,从而促进更线性的概率流,实现高保真度下的双重加速。

链接: https://arxiv.org/abs/2505.02242
作者: Qian Zeng,Jie Song,Yuanyu Wan,Huiqiong Wang,Mingli Song
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Diffusion models have recently emerged as the dominant approach in visual generation tasks. However, the lengthy denoising chains and the computationally intensive noise estimation networks hinder their applicability in low-latency and resource-limited environments. Previous research has endeavored to address these limitations in a decoupled manner, utilizing either advanced samplers or efficient model quantization techniques. In this study, we uncover that quantization-induced noise disrupts directional estimation at each sampling step, further distorting the precise directional estimations of higher-order samplers when solving the sampling equations through discretized numerical methods, thereby altering the optimal sampling trajectory. To attain dual acceleration with high fidelity, we propose a sampling-aware quantization strategy, wherein a Mixed-Order Trajectory Alignment technique is devised to impose a more stringent constraint on the error bounds at each sampling step, facilitating a more linear probability flow. Extensive experiments on sparse-step fast sampling across multiple datasets demonstrate that our approach preserves the rapid convergence characteristics of high-speed samplers while maintaining superior generation quality. Code will be made publicly available soon.
zh

[CV-52] Improving Physical Object State Representation in Text-to-Image Generative Systems CVPR2025

【速读】:该论文试图解决当前文本到图像生成模型在准确表达物体状态(如“一张没有瓶子的桌子”、“一个空的玻璃杯”)方面的不足。其解决方案的关键在于设计了一个完全自动化的流程,以生成高质量的合成数据,这些数据能够准确捕捉物体在不同状态下的特征,随后在这些合成数据上对多个开源文本到图像模型进行微调,从而提升模型生成图像与提示之间的对齐度。

链接: https://arxiv.org/abs/2505.02236
作者: Tianle Chen,Chaitanya Chakka,Deepti Ghadiyaram
机构: Boston University (波士顿大学); Runway (运行)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to Synthetic Data for Computer Vision - CVPR 2025 Workshop

点击查看摘要

Abstract:Current text-to-image generative models struggle to accurately represent object states (e.g., “a table without a bottle,” “an empty tumbler”). In this work, we first design a fully-automatic pipeline to generate high-quality synthetic data that accurately captures objects in varied states. Next, we fine-tune several open-source text-to-image models on this synthetic data. We evaluate the performance of the fine-tuned models by quantifying the alignment of the generated images to their prompts using GPT4o-mini, and achieve an average absolute improvement of 8+% across four models on the public GenAI-Bench dataset. We also curate a collection of 200 prompts with a specific focus on common objects in various physical states. We demonstrate a significant improvement of an average of 24+% over the baseline on this dataset. We release all evaluation prompts and code.
zh

[CV-53] DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization

【速读】:该论文试图解决预训练大规模模型在定制化文本到视频生成中身份与运动一致性问题,特别是现有方法在独立定制身份或运动动态时忽略两者之间的内在约束和协同依赖,导致生成过程中出现身份-运动冲突的问题。解决方案的关键在于提出DualReal框架,通过自适应联合训练协作构建维度间的互依关系,其核心组件包括Dual-aware Adaptation和StageBlender Controller,分别实现训练阶段的动态选择与知识保护,以及通过去噪阶段和Diffusion Transformer深度引导不同维度以避免冲突,最终实现身份与运动模式的无损融合。

链接: https://arxiv.org/abs/2505.02192
作者: Wenchuan Wang,Mengqi Huang,Yijing Tu,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Customized text-to-video generation with pre-trained large-scale models has recently garnered significant attention through focusing on identity and motion consistency. Existing works typically follow the isolated customized paradigm, where the subject identity or motion dynamics are customized exclusively. However, this paradigm completely ignores the intrinsic mutual constraints and synergistic interdependencies between identity and motion, resulting in identity-motion conflicts throughout the generation process that systematically degrades. To address this, we introduce DualReal, a novel framework that, employs adaptive joint training to collaboratively construct interdependencies between dimensions. Specifically, DualReal is composed of two units: (1) Dual-aware Adaptation dynamically selects a training phase (i.e., identity or motion), learns the current information guided by the frozen dimension prior, and employs a regularization strategy to avoid knowledge leakage; (2) StageBlender Controller leverages the denoising stages and Diffusion Transformer depths to guide different dimensions with adaptive granularity, avoiding conflicts at various stages and ultimately achieving lossless fusion of identity and motion patterns. We constructed a more comprehensive benchmark than existing methods. The experimental results show that DualReal improves CLIP-I and DINO-I metrics by 21.7% and 31.8% on average, and achieves top performance on nearly all motion quality metrics.
zh

[CV-54] Robust AI-Generated Face Detection with Imbalanced Data

【速读】:该论文旨在解决深度伪造(deepfake)检测中面临的分布偏移(distribution shifts)和真实样本与伪造样本之间严重类别不平衡(class imbalance)的问题,这些问题限制了现有检测方法的鲁棒性和准确性。其解决方案的关键在于提出一种结合动态损失重加权(dynamic loss reweighting)和基于排序的优化(ranking-based optimization)的框架,该框架在数据不平衡条件下实现了更优的泛化能力和性能。

链接: https://arxiv.org/abs/2505.02182
作者: Yamini Sri Krubha,Aryana Hou,Braden Vester,Web Walker,Xin Wang,Li Lin,Shu Hu
机构: Purdue University (普渡大学); Clarkstown High School South (克拉克斯顿高中南校区); University at Albany, State University of New York (阿尔巴尼大学,纽约州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deepfakes, created using advanced AI techniques such as Variational Autoencoder and Generative Adversarial Networks, have evolved from research and entertainment applications into tools for malicious activities, posing significant threats to digital trust. Current deepfake detection techniques have evolved from CNN-based methods focused on local artifacts to more advanced approaches using vision transformers and multimodal models like CLIP, which capture global anomalies and improve cross-domain generalization. Despite recent progress, state-of-the-art deepfake detectors still face major challenges in handling distribution shifts from emerging generative models and addressing severe class imbalance between authentic and fake samples in deepfake datasets, which limits their robustness and detection accuracy. To address these challenges, we propose a framework that combines dynamic loss reweighting and ranking-based optimization, which achieves superior generalization and performance under imbalanced dataset conditions. The code is available at this https URL.
zh

[CV-55] ProDisc-VAD: An Efficient System for Weakly-Supervised Anomaly Detection in Video Surveillance Applications

【速读】:该论文旨在解决弱监督视频异常检测(WS-VAD)中由于标签模糊性导致的判别特征学习困难问题。其解决方案的关键在于提出ProDisc-VAD框架,该框架包含两个协同组件:原型交互层(PIL)通过使用少量可学习原型实现受控的正常性建模,从而建立稳健的基线;伪实例判别增强(PIDE)损失通过仅对最可靠极端评分实例进行针对性对比学习来提升可分性。

链接: https://arxiv.org/abs/2505.02179
作者: Tao Zhu,Qi Yu,Xinru Dong,Shiyu Li,Yue Liu,Jinlong Jiang,Lei Shu
机构: Jiangxi University of Finance and Economics (江西财经大学); Jiangxi Science and Technology Normal University (江西科技师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Weakly-supervised video anomaly detection (WS-VAD) using Multiple Instance Learning (MIL) suffers from label ambiguity, hindering discriminative feature learning. We propose ProDisc-VAD, an efficient framework tackling this via two synergistic components. The Prototype Interaction Layer (PIL) provides controlled normality modeling using a small set of learnable prototypes, establishing a robust baseline without being overwhelmed by dominant normal data. The Pseudo-Instance Discriminative Enhancement (PIDE) loss boosts separability by applying targeted contrastive learning exclusively to the most reliable extreme-scoring instances (highest/lowest scores). ProDisc-VAD achieves strong AUCs (97.98% ShanghaiTech, 87.12% UCF-Crime) using only 0.4M parameters, over 800x fewer than recent ViT-based methods like VadCLIP, demonstrating exceptional efficiency alongside state-of-the-art performance. Code is available at this https URL.
zh

[CV-56] Sparfels: Fast Reconstruction from Sparse Unposed Imagery

【速读】:该论文试图解决在稀疏视角下,从噪声或未标定的稀疏相机中进行辐射场学习与形状恢复的问题,尤其是在稀疏且未校准的设置中,现有方法对形状恢复的研究相对不足。其解决方案的关键在于提出了一种高效且简单的流程,利用单个近期的3D基础模型,通过其多种任务头(如点图和相机初始化)实例化一个捆绑调整的二维高斯泼溅(2DGS)模型,并借助图像对应关系指导在2DGS训练过程中进行相机优化。此外,论文的核心贡献是提出了沿射线的泼溅颜色方差的新公式,该公式可以高效计算,并在训练中减少该矩使得形状重建更加精确。

链接: https://arxiv.org/abs/2505.02178
作者: Shubhendu Jena,Amine Ouasfi,Mae Younes,Adnane Boukhayma
机构: Inria, Univ. Rennes, CNRS, IRISA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page : this https URL

点击查看摘要

Abstract:We present a method for Sparse view reconstruction with surface element splatting that runs within 3 minutes on a consumer grade GPU. While few methods address sparse radiance field learning from noisy or unposed sparse cameras, shape recovery remains relatively underexplored in this setting. Several radiance and shape learning test-time optimization methods address the sparse posed setting by learning data priors or using combinations of external monocular geometry priors. Differently, we propose an efficient and simple pipeline harnessing a single recent 3D foundation model. We leverage its various task heads, notably point maps and camera initializations to instantiate a bundle adjusting 2D Gaussian Splatting (2DGS) model, and image correspondences to guide camera optimization midst 2DGS training. Key to our contribution is a novel formulation of splatted color variance along rays, which can be computed efficiently. Reducing this moment in training leads to more accurate shape reconstructions. We demonstrate state-of-the-art performances in the sparse uncalibrated setting in reconstruction and novel view benchmarks based on established multi-view datasets.
zh

[CV-57] Saliency-Guided Training for Fingerprint Presentation Attack Detection

【速读】:该论文旨在解决指纹生物特征活体检测(fingerprint PAD)中的模型泛化能力不足问题,通过引入显著性引导训练(saliency-guided training)提升模型在不同数据场景下的性能。其解决方案的关键在于利用人类标注的显著性图(perceptually-important maps)和算法生成的“伪显著性”图(pseudosaliency maps),如基于 minutiae、图像质量和自编码器的显著性图,以指导模型学习图像中重要的区域,从而增强模型的泛化能力和检测准确性。

链接: https://arxiv.org/abs/2505.02176
作者: Samuel Webster,Adam Czajka
机构: University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages (8 main, 2 references, 9 appendix), 2 figures, 19 tables (2 main, 17 appendix)

点击查看摘要

Abstract:Saliency-guided training, which directs model learning to important regions of images, has demonstrated generalization improvements across various biometric presentation attack detection (PAD) tasks. This paper presents its first application to fingerprint PAD. We conducted a 50-participant study to create a dataset of 800 human-annotated fingerprint perceptually-important maps, explored alongside algorithmically-generated “pseudosaliency,” including minutiae-based, image quality-based, and autoencoder-based saliency maps. Evaluating on the 2021 Fingerprint Liveness Detection Competition testing set, we explore various configurations within five distinct training scenarios to assess the impact of saliency-guided training on accuracy and generalization. Our findings demonstrate the effectiveness of saliency-guided training for fingerprint PAD in both limited and large data contexts, and we present a configuration capable of earning the first place on the LivDet-2021 benchmark. Our results highlight saliency-guided training’s promise for increased model generalization capabilities, its effectiveness when data is limited, and its potential to scale to larger datasets in fingerprint PAD. All collected saliency data and trained models are released with the paper to support reproducible research.
zh

[CV-58] SparSplat: Fast Multi-View Reconstruction with Generalizable 2D Gaussian Splatting

【速读】:该论文旨在解决稀疏视图下的三维重建(3D reconstruction)与新颖视图合成(Novel View Synthesis, NVS)的挑战,尤其是在保持实时性能的同时实现高精度的几何表示和视觉逼真度。其解决方案的关键在于提出了一种基于多视图立体(Multi-View Stereo, MVS)的学习框架,通过前馈方式回归二维高斯点云(2D Gaussian Splatting, 2DGS)表面元素参数,从而在稀疏视图图像中完成三维形状重建与NVS任务。此外,该方法利用预训练的多视图深度视觉特征以增强泛化能力,最终在DTU、BlendedMVS和Tanks and Temples数据集上取得了当前最优的重建精度与合成效果,并在推理速度上显著优于基于隐式表示体渲染的前序方法。

链接: https://arxiv.org/abs/2505.02175
作者: Shubhendu Jena,Shishir Reddy Vutukur,Adnane Boukhayma
机构: Inria, Univ. Rennes, CNRS, IRISA; Technical University of Munich
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page : this https URL

点击查看摘要

Abstract:Recovering 3D information from scenes via multi-view stereo reconstruction (MVS) and novel view synthesis (NVS) is inherently challenging, particularly in scenarios involving sparse-view setups. The advent of 3D Gaussian Splatting (3DGS) enabled real-time, photorealistic NVS. Following this, 2D Gaussian Splatting (2DGS) leveraged perspective accurate 2D Gaussian primitive rasterization to achieve accurate geometry representation during rendering, improving 3D scene reconstruction while maintaining real-time performance. Recent approaches have tackled the problem of sparse real-time NVS using 3DGS within a generalizable, MVS-based learning framework to regress 3D Gaussian parameters. Our work extends this line of research by addressing the challenge of generalizable sparse 3D reconstruction and NVS jointly, and manages to perform successfully at both tasks. We propose an MVS-based learning pipeline that regresses 2DGS surface element parameters in a feed-forward fashion to perform 3D shape reconstruction and NVS from sparse-view images. We further show that our generalizable pipeline can benefit from preexisting foundational multi-view deep visual features. The resulting model attains the state-of-the-art results on the DTU sparse 3D reconstruction benchmark in terms of Chamfer distance to ground-truth, as-well as state-of-the-art NVS. It also demonstrates strong generalization on the BlendedMVS and Tanks and Temples datasets. We note that our model outperforms the prior state-of-the-art in feed-forward sparse view reconstruction based on volume rendering of implicit representations, while offering an almost 2 orders of magnitude higher inference speed.
zh

[CV-59] Focus What Matters: Matchability-Based Reweighting for Local Feature Matching

【速读】:该论文试图解决半稠密匹配方法中由于注意力权重从头学习而导致的冗余和噪声交互问题,这些问题源于对所有像素或关键点一视同仁的处理方式。解决方案的关键在于提出一种新颖的注意力重加权机制,该机制同时引入可学习的偏置项对注意力logits进行调整,并基于匹配性信息对输入值特征进行重新缩放。这一双设计使注意力机制能够动态调整内部加权方案及输出表示的幅度,从而提升匹配效果。

链接: https://arxiv.org/abs/2505.02161
作者: Dongyue Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Since the rise of Transformers, many semi-dense matching methods have adopted attention mechanisms to extract feature descriptors. However, the attention weights, which capture dependencies between pixels or keypoints, are often learned from scratch. This approach can introduce redundancy and noisy interactions from irrelevant regions, as it treats all pixels or keypoints equally. Drawing inspiration from keypoint selection processes, we propose to first classify all pixels into two categories: matchable and non-matchable. Matchable pixels are expected to receive higher attention weights, while non-matchable ones are down-weighted. In this work, we propose a novel attention reweighting mechanism that simultaneously incorporates a learnable bias term into the attention logits and applies a matchability-informed rescaling to the input value features. The bias term, injected prior to the softmax operation, selectively adjusts attention scores based on the confidence of query-key interactions. Concurrently, the feature rescaling acts post-attention by modulating the influence of each value vector in the final output. This dual design allows the attention mechanism to dynamically adjust both its internal weighting scheme and the magnitude of its output representations. Extensive experiments conducted on three benchmark datasets validate the effectiveness of our method, consistently outperforming existing state-of-the-art approaches.
zh

[CV-60] Small Clips Big Gains: Learning Long-Range Refocused Temporal Information for Video Super-Resolution

【速读】:该论文旨在解决视频超分辨率(Video Super-Resolution, VSR)中长期依赖关系有效学习的问题,尤其是在长视频中如何充分利用时间信息以提升细节恢复效果。其解决方案的关键在于提出一种名为LRTI-VSR的新型训练框架,该框架通过引入长距离重聚焦时间信息(Long-Range Refocused Temporal Information),结合通用的训练策略与重聚焦的帧内-帧间变换块,使模型能够通过注意力机制选择性地关注有用的时间信息,并在前馈网络模块中进一步优化帧间信息的利用效率,从而在保持训练和计算效率的同时实现最先进的性能。

链接: https://arxiv.org/abs/2505.02159
作者: Xingyu Zhou,Wei Long,Jingbo Lu,Shiyin Jiang,Weiyi You,Haifeng Wu,Shuhang Gu
机构: University of Electronic Science and Technology of China (中国电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 11 figures

点击查看摘要

Abstract:Video super-resolution (VSR) can achieve better performance compared to single image super-resolution by additionally leveraging temporal information. In particular, the recurrent-based VSR model exploits long-range temporal information during inference and achieves superior detail restoration. However, effectively learning these long-term dependencies within long videos remains a key challenge. To address this, we propose LRTI-VSR, a novel training framework for recurrent VSR that efficiently leverages Long-Range Refocused Temporal Information. Our framework includes a generic training strategy that utilizes temporal propagation features from long video clips while training on shorter video clips. Additionally, we introduce a refocused intrainter-frame transformer block which allows the VSR model to selectively prioritize useful temporal information through its attention module while further improving inter-frame information utilization in the FFN module. We evaluate LRTI-VSR on both CNN and transformer-based VSR architectures, conducting extensive ablation studies to validate the contribution of each component. Experiments on long-video test sets demonstrate that LRTI-VSR achieves state-of-the-art performance while maintaining training and computational efficiency.
zh

[CV-61] Spotting the Unexpected (STU): A 3D LiDAR Dataset for Anomaly Segmentation in Autonomous Driving WWW CVPR2025

【速读】:该论文旨在解决自动驾驶车辆在复杂道路环境中对异常物体进行3D语义分割的问题(anomaly segmentation),尤其是在现有数据集缺乏高质量多模态数据的情况下。其解决方案的关键在于构建一个公开的、包含密集3D语义标注的数据集,该数据集融合了LiDAR和摄像头数据以及序列信息,以支持跨不同距离范围的异常检测,从而提升自动驾驶系统的安全性。

链接: https://arxiv.org/abs/2505.02148
作者: Alexey Nekrasov,Malcolm Burdorf,Stewart Worrall,Bastian Leibe,Julie Stephany Berrio Perez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at CVPR 2025. Project page: this https URL

点击查看摘要

Abstract:To operate safely, autonomous vehicles (AVs) need to detect and handle unexpected objects or anomalies on the road. While significant research exists for anomaly detection and segmentation in 2D, research progress in 3D is underexplored. Existing datasets lack high-quality multimodal data that are typically found in AVs. This paper presents a novel dataset for anomaly segmentation in driving scenarios. To the best of our knowledge, it is the first publicly available dataset focused on road anomaly segmentation with dense 3D semantic labeling, incorporating both LiDAR and camera data, as well as sequential information to enable anomaly detection across various ranges. This capability is critical for the safe navigation of autonomous vehicles. We adapted and evaluated several baseline models for 3D segmentation, highlighting the challenges of 3D anomaly detection in driving environments. Our dataset and evaluation code will be openly available, facilitating the testing and performance comparison of different approaches.
zh

[CV-62] Local Herb Identification Using Transfer Learning: A CNN-Powered Mobile Application for Nepalese Flora

【速读】:该论文旨在解决植物分类中的关键挑战,特别是在生物多样性丰富的地区如尼泊尔,对60种不同草药物种进行准确分类的问题。其解决方案的关键在于采用一种新颖的深度学习方法,利用卷积神经网络(CNN)和迁移学习技术,并通过手动整理的12,000张草药图像数据集训练出一个稳健的机器学习模型。研究中比较了多种模型架构,最终DenseNet121表现出最佳性能,同时结合数据增强和正则化技术以减少过拟合并提升模型的泛化能力。

链接: https://arxiv.org/abs/2505.02147
作者: Prajwal Thapa,Mridul Sharma,Jinu Nyachhyon,Yagya Raj Pandeya
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Herb classification presents a critical challenge in botanical research, particularly in regions with rich biodiversity such as Nepal. This study introduces a novel deep learning approach for classifying 60 different herb species using Convolutional Neural Networks (CNNs) and transfer learning techniques. Using a manually curated dataset of 12,000 herb images, we developed a robust machine learning model that addresses existing limitations in herb recognition methodologies. Our research employed multiple model architectures, including DenseNet121, 50-layer Residual Network (ResNet50), 16-layer Visual Geometry Group Network (VGG16), InceptionV3, EfficientNetV2, and Vision Transformer (VIT), with DenseNet121 ultimately demonstrating superior performance. Data augmentation and regularization techniques were applied to mitigate overfitting and enhance the generalizability of the model. This work advances herb classification techniques, preserving traditional botanical knowledge and promoting sustainable herb utilization.
zh

[CV-63] HiLLIE: Human-in-the-Loop Training for Low-Light Image Enhancement

【速读】:该论文旨在解决低光图像增强(Low-Light Image Enhancement, LLIE)中生成结果与人类视觉偏好对齐的问题,以提升高质量光照图像的视觉质量。其解决方案的关键在于提出一种人机协同的LLIE训练框架HiLLIE,通过在每个训练阶段引入人类对增强输出的高效视觉质量标注,并利用定制的图像质量评估(Image Quality Assessment, IQA)模型学习人类视觉偏好,从而指导增强模型的训练过程。该方法仅需少量成对排名标注即可持续提升IQA模型模拟人类视觉评估的能力,最终实现更符合人类视觉偏好的LLIE结果。

链接: https://arxiv.org/abs/2505.02134
作者: Xiaorui Zhao,Xinyue Zhou,Peibei Cao,Junyu Lou,Shuhang Gu
机构: University of Electronic Science and Technology of China (中国电子科技大学); Nanjing University of Information Science and Technology (南京信息工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Developing effective approaches to generate enhanced results that align well with human visual preferences for high-quality well-lit images remains a challenge in low-light image enhancement (LLIE). In this paper, we propose a human-in-the-loop LLIE training framework that improves the visual quality of unsupervised LLIE model outputs through iterative training stages, named HiLLIE. At each stage, we introduce human guidance into the training process through efficient visual quality annotations of enhanced outputs. Subsequently, we employ a tailored image quality assessment (IQA) model to learn human visual preferences encoded in the acquired labels, which is then utilized to guide the training process of an enhancement model. With only a small amount of pairwise ranking annotations required at each stage, our approach continually improves the IQA model’s capability to simulate human visual assessment of enhanced outputs, thus leading to visually appealing LLIE results. Extensive experiments demonstrate that our approach significantly improves unsupervised LLIE model performance in terms of both quantitative and qualitative performance. The code and collected ranking dataset will be available at this https URL.
zh

[CV-64] GarmentGS: Point-Cloud Guided Gaussian Splatting for High-Fidelity Non-Watertight 3D Garment Reconstruction

【速读】:该论文试图解决传统3D服装创建过程中因人工操作繁杂而导致的时间和人力成本高昂的问题,以及3D Gaussian Splatting在重建高保真、非封闭的3D服装时由于高斯基元无结构性和不规则性所带来的挑战。解决方案的关键在于提出GarmentGS,一种基于密集点云引导的方法,通过引入快速密集点云重建模块,实现服装点云在10分钟内完成重建,并利用密集点云指导高斯基元的移动、展开和旋转,从而提升其在服装表面的分布效果,达到更优的渲染质量和几何精度。

链接: https://arxiv.org/abs/2505.02126
作者: Zhihao Tang,Shenghao Yang,Hongtao Zhang,Mingbo Zhao
机构: Donghua University (东华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional 3D garment creation requires extensive manual operations, resulting in time and labor costs. Recently, 3D Gaussian Splatting has achieved breakthrough progress in 3D scene reconstruction and rendering, attracting widespread attention and opening new pathways for 3D garment reconstruction. However, due to the unstructured and irregular nature of Gaussian primitives, it is difficult to reconstruct high-fidelity, non-watertight 3D garments. In this paper, we present GarmentGS, a dense point cloud-guided method that can reconstruct high-fidelity garment surfaces with high geometric accuracy and generate non-watertight, single-layer meshes. Our method introduces a fast dense point cloud reconstruction module that can complete garment point cloud reconstruction in 10 minutes, compared to traditional methods that require several hours. Furthermore, we use dense point clouds to guide the movement, flattening, and rotation of Gaussian primitives, enabling better distribution on the garment surface to achieve superior rendering effects and geometric accuracy. Through numerical and visual comparisons, our method achieves fast training and real-time rendering while maintaining competitive quality.
zh

[CV-65] Unaligned RGB Guided Hyperspectral Image Super-Resolution with Spatial-Spectral Concordance

【速读】:该论文旨在解决高光谱图像超分辨率(Hyperspectral Image Super-Resolution, HSI SR)在高分辨率比例下性能受限的问题,特别是针对无对齐参考RGB图像引导的HSI SR中存在对齐不准确和对齐与融合模块交互不足的问题。其解决方案的关键在于提出一种空间-光谱一致性框架(Spatial-Spectral Concordance Hyperspectral Super-Resolution, SSC-HSR),通过构建两阶段图像对齐模块和引入特征聚合与注意力融合模块,实现更精确的图像对齐与纹理修复,并增强对齐与融合模块之间的交互,从而提升重建质量。

链接: https://arxiv.org/abs/2505.02109
作者: Yingkai Zhang,Zeqiang Lai,Tao Zhang,Ying Fu,Chenghu Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral images super-resolution aims to improve the spatial resolution, yet its performance is often limited at high-resolution ratios. The recent adoption of high-resolution reference images for super-resolution is driven by the poor spatial detail found in low-resolution HSIs, presenting it as a favorable method. However, these approaches cannot effectively utilize information from the reference image, due to the inaccuracy of alignment and its inadequate interaction between alignment and fusion modules. In this paper, we introduce a Spatial-Spectral Concordance Hyperspectral Super-Resolution (SSC-HSR) framework for unaligned reference RGB guided HSI SR to address the issues of inaccurate alignment and poor interactivity of the previous approaches. Specifically, to ensure spatial concordance, i.e., align images more accurately across resolutions and refine textures, we construct a Two-Stage Image Alignment with a synthetic generation pipeline in the image alignment module, where the fine-tuned optical flow model can produce a more accurate optical flow in the first stage and warp model can refine damaged textures in the second stage. To enhance the interaction between alignment and fusion modules and ensure spectral concordance during reconstruction, we propose a Feature Aggregation module and an Attention Fusion module. In the feature aggregation module, we introduce an Iterative Deformable Feature Aggregation block to achieve significant feature matching and texture aggregation with the fusion multi-scale results guidance, iteratively generating learnable offset. Besides, we introduce two basic spectral-wise attention blocks in the attention fusion module to model the inter-spectra interactions. Extensive experiments on three natural or remote-sensing datasets show that our method outperforms state-of-the-art approaches on both quantitative and qualitative evaluations.
zh

[CV-66] SignSplat: Rendering Sign Language via Gaussian Splatting

【速读】:该论文旨在解决在有限视角下构建高保真人体渲染模型的问题,特别是在需要精确捕捉手部和面部细微复杂运动的场景中,如手语。传统方法通常关注大规模身体运动,而忽略了对细节动作的建模。解决方案的关键在于通过利用序列数据中的时间变化信息,提升模型在少量视角下的拟合精度与一致性,从而实现对复杂手势的准确建模。为此,研究者通过约束网格参数、引入高斯参数正则化技术以减少过拟合和渲染伪影,并提出一种自适应控制方法来优化高斯点的密度与分布。

链接: https://arxiv.org/abs/2505.02108
作者: Maksym Ivashechkin,Oscar Mendez,Richard Bowden
机构: CVSSP, University of Surrey(萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State-of-the-art approaches for conditional human body rendering via Gaussian splatting typically focus on simple body motions captured from many views. This is often in the context of dancing or walking. However, for more complex use cases, such as sign language, we care less about large body motion and more about subtle and complex motions of the hands and face. The problems of building high fidelity models are compounded by the complexity of capturing multi-view data of sign. The solution is to make better use of sequence data, ensuring that we can overcome the limited information from only a few views by exploiting temporal variability. Nevertheless, learning from sequence-level data requires extremely accurate and consistent model fitting to ensure that appearance is consistent across complex motions. We focus on how to achieve this, constraining mesh parameters to build an accurate Gaussian splatting framework from few views capable of modelling subtle human motion. We leverage regularization techniques on the Gaussian parameters to mitigate overfitting and rendering artifacts. Additionally, we propose a new adaptive control method to densify Gaussians and prune splat points on the mesh surface. To demonstrate the accuracy of our approach, we render novel sequences of sign language video, building on neural machine translation approaches to sign stitching. On benchmark datasets, our approach achieves state-of-the-art performance; and on highly articulated and complex sign language motion, we significantly outperform competing approaches.
zh

[CV-67] SkillM imic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations

【速读】:该论文旨在解决强化学习中的交互演示(Reinforcement Learning from Interaction Demonstration, RLID)所面临的根本性挑战,即演示数据的噪声和覆盖范围限制。现有数据收集方法虽然提供了有价值的交互演示,但通常生成稀疏、不连贯且噪声大的轨迹,无法全面捕捉技能变化和状态转移的多样性。该研究的关键在于认识到,尽管演示数据存在噪声和稀疏性,但仍存在无限条物理上可行的轨迹,这些轨迹自然地连接不同演示技能或从其邻近状态中产生,从而形成连续的技能变化和状态转移空间。基于这一关键洞察,作者提出了两种数据增强技术:Stitched Trajectory Graph (STG) 和 State Transition Field (STF),并结合自适应轨迹采样策略和历史编码机制以提升RLID的效果。

链接: https://arxiv.org/abs/2505.02094
作者: Runyi Yu,Yinhuai Wang,Qihan Zhao,Hok Wai Tsui,Jingbo Wang,Ping Tan,Qifeng Chen
机构: HKUST(香港科技大学); Shanghai AI Laboratory(上海人工智能实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We address a fundamental challenge in Reinforcement Learning from Interaction Demonstration (RLID): demonstration noise and coverage limitations. While existing data collection approaches provide valuable interaction demonstrations, they often yield sparse, disconnected, and noisy trajectories that fail to capture the full spectrum of possible skill variations and transitions. Our key insight is that despite noisy and sparse demonstrations, there exist infinite physically feasible trajectories that naturally bridge between demonstrated skills or emerge from their neighboring states, forming a continuous space of possible skill variations and transitions. Building upon this insight, we present two data augmentation techniques: a Stitched Trajectory Graph (STG) that discovers potential transitions between demonstration skills, and a State Transition Field (STF) that establishes unique connections for arbitrary states within the demonstration neighborhood. To enable effective RLID with augmented data, we develop an Adaptive Trajectory Sampling (ATS) strategy for dynamic curriculum generation and a historical encoding mechanism for memory-dependent skill learning. Our approach enables robust skill acquisition that significantly generalizes beyond the reference demonstrations. Extensive experiments across diverse interaction tasks demonstrate substantial improvements over state-of-the-art methods in terms of convergence stability, generalization capability, and recovery robustness.
zh

[CV-68] HandOcc: NeRF-based Hand Rendering with Occupancy Networks

【速读】:该论文试图解决传统手部渲染方法中由于依赖参数化网格(parametric mesh)所带来的精度与模型复杂度之间的权衡问题。现有方法在使用参数化网格时,受限于网格初始化,难以泛化到无参数模型的物体,并且估计结果受网格分辨率和拟合精度的影响。论文提出的解决方案关键在于采用基于占用(occupancy)的表示方式,结合NeRF渲染器,实现无需网格的3D手部渲染。通过仅提供3D骨骼信息,利用卷积模型提取所需外观,并借助手部占用信息提升手与手之间的交互效果,从而实现快速渲染和优秀的手部外观迁移。

链接: https://arxiv.org/abs/2505.02079
作者: Maksym Ivashechkin,Oscar Mendez,Richard Bowden
机构: CVSSP, University of Surrey(萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose HandOcc, a novel framework for hand rendering based upon occupancy. Popular rendering methods such as NeRF are often combined with parametric meshes to provide deformable hand models. However, in doing so, such approaches present a trade-off between the fidelity of the mesh and the complexity and dimensionality of the parametric model. The simplicity of parametric mesh structures is appealing, but the underlying issue is that it binds methods to mesh initialization, making it unable to generalize to objects where a parametric model does not exist. It also means that estimation is tied to mesh resolution and the accuracy of mesh fitting. This paper presents a pipeline for meshless 3D rendering, which we apply to the hands. By providing only a 3D skeleton, the desired appearance is extracted via a convolutional model. We do this by exploiting a NeRF renderer conditioned upon an occupancy-based representation. The approach uses the hand occupancy to resolve hand-to-hand interactions further improving results, allowing fast rendering, and excellent hand appearance transfer. On the benchmark InterHand2.6M dataset, we achieved state-of-the-art results.
zh

[CV-69] Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation

【速读】:该论文试图解决视觉基础模型(Vision Foundation Models, VFMs)在密集预测任务中因输出特征分辨率较低而限制其直接应用的问题。解决方案的关键在于采用一种与任务无关的特征上采样模块,以提升VFMs特征的分辨率,从而增强其在密集预测任务中的表现。

链接: https://arxiv.org/abs/2505.02075
作者: Volodymyr Havrylov,Haiwen Huang,Dan Zhang,Andreas Geiger
机构: University of Tübingen (图宾根大学); Bosch Center for Artificial Intelligence (博世人工智能中心); Tübingen AI Center (图宾根人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision Foundation Models (VFMs) are large-scale, pre-trained models that serve as general-purpose backbones for various computer vision tasks. As VFMs’ popularity grows, there is an increasing interest in understanding their effectiveness for dense prediction tasks. However, VFMs typically produce low-resolution features, limiting their direct applicability in this context. One way to tackle this limitation is by employing a task-agnostic feature upsampling module that refines VFM features resolution. To assess the effectiveness of this approach, we investigate Interactive Segmentation (IS) as a novel benchmark for evaluating feature upsampling methods on VFMs. Due to its inherent multimodal input, consisting of an image and a set of user-defined clicks, as well as its dense mask output, IS creates a challenging environment that demands comprehensive visual scene understanding. Our benchmarking experiments show that selecting appropriate upsampling strategies significantly improves VFM features quality. The code is released at this https URL
zh

[CV-70] Hierarchical Compact Clustering Attention (COCA) for Unsupervised Object-Centric Learning CVPR2025

【速读】:该论文旨在解决单图像中的无监督对象发现(unsupervised object discovery)问题,同时实现以对象为中心的表征学习。其解决方案的关键在于提出了一种名为Compact Clustering Attention (COCA) 的层,该层通过引入一种基于紧凑性(compactness)概念的新型聚类算法,在场景中突出显著的对象中心点,从而提供空间归纳偏置。COCA层作为注意力机制的聚类模块,能够从多对象场景中提取对象中心表征,并在级联到自底向上的分层网络架构(COCA-Net)中表现出色,实现了高质量的分割掩码生成。

链接: https://arxiv.org/abs/2505.02071
作者: Can Küçüksözen,Yücel Yemez
机构: Koç University (科驰大学); KUIS AI Center (KUIS人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:We propose the Compact Clustering Attention (COCA) layer, an effective building block that introduces a hierarchical strategy for object-centric representation learning, while solving the unsupervised object discovery task on single images. COCA is an attention-based clustering module capable of extracting object-centric representations from multi-object scenes, when cascaded into a bottom-up hierarchical network architecture, referred to as COCA-Net. At its core, COCA utilizes a novel clustering algorithm that leverages the physical concept of compactness, to highlight distinct object centroids in a scene, providing a spatial inductive bias. Thanks to this strategy, COCA-Net generates high-quality segmentation masks on both the decoder side and, notably, the encoder side of its pipeline. Additionally, COCA-Net is not bound by a predetermined number of object masks that it generates and handles the segmentation of background elements better than its competitors. We demonstrate COCA-Net’s segmentation performance on six widely adopted datasets, achieving superior or competitive results against the state-of-the-art models across nine different evaluation metrics.
zh

[CV-71] RTV-Bench: Benchmarking MLLM Continuous Perception Understanding and Reasoning through Real-Time Video

【速读】:该论文旨在解决当前基准测试无法充分评估多模态大语言模型(Multimodal Large Language Models, MLLMs)在动态、现实环境中持续执行感知、理解和推理任务的能力问题。其解决方案的关键在于提出RTV-Bench,一个细粒度的实时视频分析基准,该基准遵循三个核心原则:多时间戳问答(Multi-Timestamp Question Answering, MTQA)、分层问题结构以及多维评估,以全面衡量模型在连续视频流中的表现。

链接: https://arxiv.org/abs/2505.02064
作者: Shuhang Xun,Sicheng Tao,Jungang Li,Yibo Shi,Zhixin Lin,Zhanhui Zhu,Yibo Yan,Hanqian Li,Linghao Zhang,Shikang Wang,Yixin Liu,Hanbo Zhang,Xuming Hu,Ying Ma
机构: HIT(哈尔滨工业大学); HKUST(GZ)(香港科技大学(广州)); HKUST(香港科技大学); XJTU(西安交通大学); SDU(山东大学); CityU(城市大学); HUST(华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) increasingly excel at perception, understanding, and reasoning. However, current benchmarks inadequately evaluate their ability to perform these tasks continuously in dynamic, real-world environments. To bridge this gap, we introduce RTV-Bench, a fine-grained benchmark for MLLM real-time video analysis. RTV-Bench uses three key principles: (1) Multi-Timestamp Question Answering (MTQA), where answers evolve with scene changes; (2) Hierarchical Question Structure, combining basic and advanced queries; and (3) Multi-dimensional Evaluation, assessing the ability of continuous perception, understanding, and reasoning. RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs. We evaluated leading MLLMs, including proprietary (GPT-4o, Gemini 2.0), open-source offline (Qwen2.5-VL, VideoLLaMA3), and open-source real-time (VITA-1.5, InternLM-XComposer2.5-OmniLive) models. Experiment results show open-source real-time models largely outperform offline ones but still trail top proprietary models. Our analysis also reveals that larger model size or higher frame sampling rates do not significantly boost RTV-Bench performance, sometimes causing slight decreases. This underscores the need for better model architectures optimized for video stream processing and long sequences to advance real-time video analysis with MLLMs. Our benchmark toolkit is available at: this https URL.
zh

[CV-72] ransforming faces into video stories – VideoFace2.0

【速读】:该论文旨在解决视频中人脸的时空定位与重识别(face re-identification, ReID)问题,以高效生成结构化的视频故事,即基于身份的信息目录。其解决方案的关键在于结合人脸检测、人脸识别和被动跟踪-检测(tracking-by-detection)的概念,提出一种鲁棒且高效的ReID算法,从而实现对输入视频中每个独特人脸的准确识别与分类,并支持后续的下游任务。

链接: https://arxiv.org/abs/2505.02060
作者: Branko Brkljač,Vladimir Kalušev,Branislav Popović,Milan Sečujski
机构: University of Novi Sad (诺维萨德大学); The Institute for Artificial Intelligence Research and Development of Serbia (塞尔维亚人工智能研究与发展研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 2 figures, 1 algorithm; associated VideoFace2.0 code implementation, test videos and results visualizations are available at this https URL ; Preprint submitted to the 14th Mediterranean Conference on Embedded Computing (MECO), 10-14 June 2025, Budva, Montenegro

点击查看摘要

Abstract:Face detection and face recognition have been in the focus of vision community since the very beginnings. Inspired by the success of the original Videoface digitizer, a pioneering device that allowed users to capture video signals from any source, we have designed an advanced video analytics tool to efficiently create structured video stories, i.e. identity-based information catalogs. VideoFace2.0 is the name of the developed system for spatial and temporal localization of each unique face in the input video, i.e. face re-identification (ReID), which also allows their cataloging, characterization and creation of structured video outputs for later downstream tasks. Developed near real-time solution is primarily designed to be utilized in application scenarios involving TV production, media analysis, and as an efficient tool for creating large video datasets necessary for training machine learning (ML) models in challenging vision tasks such as lip reading and multimodal speech recognition. Conducted experiments confirm applicability of the proposed face ReID algorithm that is combining the concepts of face detection, face recognition and passive tracking-by-detection in order to achieve robust and efficient face ReID. The system is envisioned as a compact and modular extensions of the existing video production equipment. We hope that the presented work and shared code will stimulate further interest in development of similar, application specific video analysis tools, and lower the entry barrier for production of high-quality multi-modal ML datasets in the future.
zh

[CV-73] Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin ICML2025

【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在下游任务中使用伪标签时存在的标签不平衡问题,这一问题导致模型性能下降。解决方案的关键在于提出一种结合概念对齐和混淆感知校准边距机制的新框架,通过增强表现较差的类别并促进各类别间的平衡预测,从而缓解标签不平衡问题。

链接: https://arxiv.org/abs/2505.02056
作者: Yuchen Wang,Xuefeng Bai,Xiucheng Li,Weili Guan,Liqiang Nie,Xinyang Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICML 2025

点击查看摘要

Abstract:Adapting vision-language models (VLMs) to downstream tasks with pseudolabels has gained increasing attention. A major obstacle is that the pseudolabels generated by VLMs tend to be imbalanced, leading to inferior performance. While existing methods have explored various strategies to address this, the underlying causes of imbalance remain insufficiently investigated. To fill this gap, we delve into imbalanced pseudolabels and identify two primary contributing factors: concept mismatch and concept confusion. To mitigate these two issues, we propose a novel framework incorporating concept alignment and confusion-aware calibrated margin mechanisms. The core of our approach lies in enhancing underperforming classes and promoting balanced predictions across categories, thus mitigating imbalance. Extensive experiments on six benchmark datasets with three learning paradigms demonstrate that the proposed method effectively enhances the accuracy and balance of pseudolabels, achieving a relative improvement of 6.29% over the SoTA method. Our code is avaliable at this https URL
zh

[CV-74] xP: Reciprocal Generation of Ground Pressure Dynamics and Activity Descriptions for Improving Human Activity Recognition

【速读】:该论文旨在解决压力传感器在基于传感器的人类活动识别(HAR)中被低估的问题,即尽管压力传感器能够捕捉细微的身体动态和重心变化,但其在HAR领域中的应用受限于数据集不足。解决方案的关键在于利用生成式基础模型与压力特定的HAR技术相结合,提出了一种双向文本×压力(Text × Pressure, TxP)模型,通过将压力数据解释为自然语言,实现文本到压力序列的转换以及从动态压力图生成活动描述和分类,从而提升HAR的性能。

链接: https://arxiv.org/abs/2505.02052
作者: Lala Shakti Swarup Ray,Lars Krupp,Vitor Fortes Rey,Bo Zhou,Sungho Suh,Paul Lukowicz
机构: DFKIKaiserslautern; RPTU; Korea University
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sensor-based human activity recognition (HAR) has predominantly focused on Inertial Measurement Units and vision data, often overlooking the capabilities unique to pressure sensors, which capture subtle body dynamics and shifts in the center of mass. Despite their potential for postural and balance-based activities, pressure sensors remain underutilized in the HAR domain due to limited datasets. To bridge this gap, we propose to exploit generative foundation models with pressure-specific HAR techniques. Specifically, we present a bidirectional Text \times Pressure model that uses generative foundation models to interpret pressure data as natural language. TxP accomplishes two tasks: (1) Text2Pressure, converting activity text descriptions into pressure sequences, and (2) Pressure2Text, generating activity descriptions and classifications from dynamic pressure maps. Leveraging pre-trained models like CLIP and LLaMA 2 13B Chat, TxP is trained on our synthetic PressLang dataset, containing over 81,100 text-pressure pairs. Validated on real-world data for activities such as yoga and daily tasks, TxP provides novel approaches to data augmentation and classification grounded in atomic actions. This consequently improved HAR performance by up to 12.4% in macro F1 score compared to the state-of-the-art, advancing pressure-based HAR with broader applications and deeper insights into human movement.
zh

[CV-75] Regression s all you need for medical image translation

【速读】:该论文旨在解决在有限时间预算内获取信息丰富的医学图像问题,通过医学图像翻译(Medical Image Translation, MIT)生成合成图像以增强和补充现有数据集。传统方法如生成对抗网络(Generative Adversarial Nets, GANs)和扩散模型(Diffusion Models, DMs)在自然图像生成中表现优异,但在医学应用中由于对解剖结构准确性的高要求,其创造性和图像真实性优势并不一定适用,常因噪声模仿或内容幻觉而影响临床实用性。论文提出的解决方案是YODA(You Only Denoise once - or Average),一个基于2.5D扩散的体积MIT框架,其关键在于将扩散与回归范式结合,生成真实或无噪声的输出,并引入期望近似(Expectation-Approximation, ExpA)采样方法,借鉴磁共振成像(MRI)信号平均技术,抑制生成噪声,从而避免噪声对图像质量评估的偏差。实验表明,扩散与回归采样在实践中效果相近,扩散采样的计算开销并未带来系统性优势,而YODA在多个多模态数据集上优于当前最先进的GAN和DM方法。

链接: https://arxiv.org/abs/2505.02048
作者: Sebastian Rassmann,David Kügler,Christian Ewert,Martin Reuter
机构: German Center for Neurodegenerative Diseases (DZNE); A.A. Martinos Center for Biomedical Imaging (A.A. Martinos Center for Biomedical Imaging); Harvard Medical School (Harvard Medical School)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The acquisition of information-rich images within a limited time budget is crucial in medical imaging. Medical image translation (MIT) can help enhance and supplement existing datasets by generating synthetic images from acquired data. While Generative Adversarial Nets (GANs) and Diffusion Models (DMs) have achieved remarkable success in natural image generation, their benefits - creativity and image realism - do not necessarily transfer to medical applications where highly accurate anatomical information is required. In fact, the imitation of acquisition noise or content hallucination hinder clinical utility. Here, we introduce YODA (You Only Denoise once - or Average), a novel 2.5D diffusion-based framework for volumetric MIT. YODA unites diffusion and regression paradigms to produce realistic or noise-free outputs. Furthermore, we propose Expectation-Approximation (ExpA) DM sampling, which draws inspiration from MRI signal averaging. ExpA-sampling suppresses generated noise and, thus, eliminates noise from biasing the evaluation of image quality. Through extensive experiments on four diverse multi-modal datasets - comprising multi-contrast brain MRI and pelvic MRI-CT - we show that diffusion and regression sampling yield similar results in practice. As such, the computational overhead of diffusion sampling does not provide systematic benefits in medical information translation. Building on these insights, we demonstrate that YODA outperforms several state-of-the-art GAN and DM methods. Notably, YODA-generated images are shown to be interchangeable with, or even superior to, physical acquisitions for several downstream tasks. Our findings challenge the presumed advantages of DMs in MIT and pave the way for the practical application of MIT in medical imaging.
zh

[CV-76] A UNet Model for Accelerated Preprocessing of CRISM Hyperspectral Data for Mineral Identification on Mars

【速读】:该论文旨在解决火星表面矿物识别中传统光谱预处理方法计算复杂度高、耗时的问题。其关键解决方案是提出一种基于UNet的自编码器模型,用于高效处理CRISM MTRDR高光谱数据,自动化关键预处理步骤如平滑和连续谱去除,同时保留重要的矿物吸收特征,从而显著提升预处理效率。

链接: https://arxiv.org/abs/2505.02046
作者: Priyanka Kumari,Sampriti Soor,Amba Shetty,Archana M. Nair
机构: Indian Institute of Technology Guwahati(印度理工学院古瓦哈提分校); Center for Intelligent Cyber Physical System(智能网络物理系统中心); National Institute of Technology Karnataka(印度国家技术学院卡纳塔克分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate mineral identification on the Martian surface is critical for understanding the planet’s geological history. This paper presents a UNet-based autoencoder model for efficient spectral preprocessing of CRISM MTRDR hyperspectral data, addressing the limitations of traditional methods that are computationally intensive and time-consuming. The proposed model automates key preprocessing steps, such as smoothing and continuum removal, while preserving essential mineral absorption features. Trained on augmented spectra from the MICA spectral library, the model introduces realistic variability to simulate MTRDR data conditions. By integrating this framework, preprocessing time for an 800x800 MTRDR scene is reduced from 1.5 hours to just 5 minutes on an NVIDIA T1600 GPU. The preprocessed spectra are subsequently classified using MICAnet, a deep learning model for Martian mineral identification. Evaluation on labeled CRISM TRDR data demonstrates that the proposed approach achieves competitive accuracy while significantly enhancing preprocessing efficiency. This work highlights the potential of the UNet-based preprocessing framework to improve the speed and reliability of mineral mapping on Mars.
zh

[CV-77] Point2Primitive: CAD Reconstruction from Point Cloud by Direct Primitive Prediction

【速读】:该论文旨在解决从点云中恢复CAD模型的问题,特别是针对草图-拉伸(sketch-extrusion)过程的拓扑重建与拉伸原始结构预测。其解决方案的关键在于提出了一种名为Point2Primitive的CAD重建网络,该网络通过直接预测拉伸原始结构的每个元素,实现可编辑CAD模型的生成。该方法基于改进的Transformer架构,能够直接检测并预测草图曲线的类型和参数,并通过自回归方式优化参数以提高准确性,同时利用拉伸分割重建拓扑结构,结合预测的曲线和计算出的拉伸操作恢复每个拉伸参数。

链接: https://arxiv.org/abs/2505.02043
作者: Cheng Wang,Xinzhu Ma,Bin Wang,Shixiang Tang,Yuan Meng,Ping Jiang
机构: Wuhan University(武汉大学); Beihang University(北京航空航天大学); Huazhong University of Science and Technology(华中科技大学); The Chinese University of Hong Kong(香港中文大学); Tsinghua University(清华大学); Hubei Polytechnic University(湖北理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recovering CAD models from point clouds, especially the sketch-extrusion process, can be seen as the process of rebuilding the topology and extrusion primitives. Previous methods utilize implicit fields for sketch representation, leading to shape reconstruction of curved edges. In this paper, we proposed a CAD reconstruction network that produces editable CAD models from input point clouds (Point2Primitive) by directly predicting every element of the extrusion primitives. Point2Primitive can directly detect and predict sketch curves (type and parameter) from point clouds based on an improved transformer. The sketch curve parameters are formulated as position queries and optimized in an autoregressive way, leading to high parameter accuracy. The topology is rebuilt by extrusion segmentation, and each extrusion parameter (sketch and extrusion operation) is recovered by combining the predicted curves and the computed extrusion operation. Extensive experiments demonstrate that our method is superior in primitive prediction accuracy and CAD reconstruction. The reconstructed shapes are of high geometrical fidelity.
zh

[CV-78] A Birotation Solution for Relative Pose Problems

【速读】:该论文旨在解决相对位姿估计(relative pose estimation)这一基础的计算机视觉问题,传统方法通常通过估计并分解本质矩阵或直接估计旋转和平移来获得解。论文提出了一种新颖的双旋转(birotation)解决方案,其关键在于引入了三个与几何度量相关的基变换,用于量化待估计的相对位姿与其对应基变换之间的距离,并基于这些度量设计了三个能量函数,在李群 SO(3)\mathrm{SO}(3) 上通过迭代更新两个旋转矩阵来最小化这些能量函数,最终利用能量最小对应的两个旋转矩阵和基变换恢复相对位姿。

链接: https://arxiv.org/abs/2505.02025
作者: Hongbo Zhao,Ziwei Long,Mengtan Zhang,Hanli Wang,Qijun Chen,Rui Fan
机构: Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University (上海智能自主系统研究院,同济大学); College of Electronics & Information Engineering, Tongji University (电子与信息工程学院,同济大学); School of Computer Science and Technology, Tongji University (计算机科学与技术学院,同济大学); Key Laboratory of Embedded System and Service Computing (Ministry of Education), Tongji University (嵌入式系统与服务计算重点实验室(教育部),同济大学); College of Electronics & Information Engineering, Shanghai Institute of Intelligent Science and Technology (电子与信息工程学院,上海智能科学研究院); State Key Laboratory of Intelligent Autonomous Systems, Tongji University (智能自主系统国家重点实验室,同济大学); Frontiers Science Center for Intelligent Autonomous Systems, Tongji University (智能自主系统前沿科学中心,同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Relative pose estimation, a fundamental computer vision problem, has been extensively studied for decades. Existing methods either estimate and decompose the essential matrix or directly estimate the rotation and translation to obtain the solution. In this article, we break the mold by tackling this traditional problem with a novel birotation solution. We first introduce three basis transformations, each associated with a geometric metric to quantify the distance between the relative pose to be estimated and its corresponding basis transformation. Three energy functions, designed based on these metrics, are then minimized on the Riemannian manifold \mathrmSO(3) by iteratively updating the two rotation matrices. The two rotation matrices and the basis transformation corresponding to the minimum energy are ultimately utilized to recover the relative pose. Extensive quantitative and qualitative evaluations across diverse relative pose estimation tasks demonstrate the superior performance of our proposed birotation solution. Source code, demo video, and datasets will be available at \hrefthis https URLthis http URL upon publication.
zh

[CV-79] R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM MLLM Complex Reasoning Evaluation

【速读】:该论文旨在解决现有推理基准在评估复杂、现实世界问题求解能力时的不足,特别是在多学科和多模态情境下的推理能力评估。其关键解决方案是提出一个名为Reasoning Bench (R-Bench) 的研究生水平、多学科、中英文推理基准,涵盖1,094道语言模型测试题和665道多模态模型测试题,确保题目难度校准、学科平衡及跨语言一致性,从而构建一个具有奥林匹克竞赛级别的多学科推理评估体系。

链接: https://arxiv.org/abs/2505.02018
作者: Meng-Hao Guo,Jiajun Xu,Yi Zhang,Jiaxi Song,Haoyang Peng,Yi-Xuan Deng,Xinzhi Dong,Kiyohiro Nakayama,Zhengyang Geng,Chen Wang,Bolin Ni,Guo-Wei Yang,Yongming Rao,Houwen Peng,Han Hu,Gordon Wetzstein,Shi-min Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18pages

点击查看摘要

Abstract:Reasoning stands as a cornerstone of intelligence, enabling the synthesis of existing knowledge to solve complex problems. Despite remarkable progress, existing reasoning benchmarks often fail to rigorously evaluate the nuanced reasoning capabilities required for complex, real-world problemsolving, particularly in multi-disciplinary and multimodal contexts. In this paper, we introduce a graduate-level, multi-disciplinary, EnglishChinese benchmark, dubbed as Reasoning Bench (R-Bench), for assessing the reasoning capability of both language and multimodal models. RBench spans 1,094 questions across 108 subjects for language model evaluation and 665 questions across 83 subjects for multimodal model testing in both English and Chinese. These questions are meticulously curated to ensure rigorous difficulty calibration, subject balance, and crosslinguistic alignment, enabling the assessment to be an Olympiad-level multi-disciplinary benchmark. We evaluate widely used models, including OpenAI o1, GPT-4o, DeepSeek-R1, etc. Experimental results indicate that advanced models perform poorly on complex reasoning, especially multimodal reasoning. Even the top-performing model OpenAI o1 achieves only 53.2% accuracy on our multimodal evaluation. Data and code are made publicly available at here.
zh

[CV-80] MLLM -Enhanced Face Forgery Detection: A Vision-Language Fusion Solution

【速读】:该论文旨在解决深度伪造图像检测中多模态信息(视觉与文本)整合不足的问题,传统方法通常依赖于大型语言模型或外部检测器单独进行分类,导致视觉与文本模态的融合效果不佳。其解决方案的关键在于提出VLF-FFD,通过构建一个视觉-语言融合网络(VLF-Net),实现视觉特征与文本特征的双向交互,并采用三阶段训练流程以充分发挥其潜力,同时引入EFF++数据集,为每帧篡改图像提供包含伪造痕迹和具体操作技术的文本注释,从而提升多模态大语言模型(MLLM)的训练效果。

链接: https://arxiv.org/abs/2505.02013
作者: Siran Peng,Zipei Wang,Li Gao,Xiangyu Zhu,Tianshuo Zhang,Ajian Liu,Haoyuan Zhang,Zhen Lei
机构: CASIA(中国科学院自动化研究所); UCAS(中国科学院大学); CMCC(中国移动通信集团公司); CAIR, HKISI, CAS(人工智能研究院,香港智能系统研究所,中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable face forgery detection algorithms are crucial for countering the growing threat of deepfake-driven disinformation. Previous research has demonstrated the potential of Multimodal Large Language Models (MLLMs) in identifying manipulated faces. However, existing methods typically depend on either the Large Language Model (LLM) alone or an external detector to generate classification results, which often leads to sub-optimal integration of visual and textual modalities. In this paper, we propose VLF-FFD, a novel Vision-Language Fusion solution for MLLM-enhanced Face Forgery Detection. Our key contributions are twofold. First, we present EFF++, a frame-level, explainability-driven extension of the widely used FaceForensics++ (FF++) dataset. In EFF++, each manipulated video frame is paired with a textual annotation that describes both the forgery artifacts and the specific manipulation technique applied, enabling more effective and informative MLLM training. Second, we design a Vision-Language Fusion Network (VLF-Net) that promotes bidirectional interaction between visual and textual features, supported by a three-stage training pipeline to fully leverage its potential. VLF-FFD achieves state-of-the-art (SOTA) performance in both cross-dataset and intra-dataset evaluations, underscoring its exceptional effectiveness in face forgery detection.
zh

[CV-81] Efficient Noise Calculation in Deep Learning-based MRI Reconstructions ICML2025

【速读】:该论文旨在解决加速磁共振成像(accelerated MRI)重建中噪声传播分析的问题,这一问题在深度学习(DL)方法中常被忽视,尽管其对重建质量评估和算法设计具有重要意义。解决方案的关键在于提出一种理论基础扎实且内存高效的 voxel-wise 方差计算方法,通过近似 DL 网络的雅可比矩阵(Jacobian)来估计噪声协方差,并引入雅可比矩阵压缩技术以实现高效计算,从而实现了对重建图像中因采集噪声引起的不确定性的量化分析。

链接: https://arxiv.org/abs/2505.02007
作者: Onat Dalmaz,Arjun D. Desai,Reinhard Heckel,Tolga Çukur,Akshay S. Chaudhari,Brian A. Hargreaves
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted ICML 2025. Supplementary material included

点击查看摘要

Abstract:Accelerated MRI reconstruction involves solving an ill-posed inverse problem where noise in acquired data propagates to the reconstructed images. Noise analyses are central to MRI reconstruction for providing an explicit measure of solution fidelity and for guiding the design and deployment of novel reconstruction methods. However, deep learning (DL)-based reconstruction methods have often overlooked noise propagation due to inherent analytical and computational challenges, despite its critical importance. This work proposes a theoretically grounded, memory-efficient technique to calculate voxel-wise variance for quantifying uncertainty due to acquisition noise in accelerated MRI reconstructions. Our approach approximates noise covariance using the DL network’s Jacobian, which is intractable to calculate. To circumvent this, we derive an unbiased estimator for the diagonal of this covariance matrix (voxel-wise variance) and introduce a Jacobian sketching technique to efficiently implement it. We evaluate our method on knee and brain MRI datasets for both data- and physics-driven networks trained in supervised and unsupervised manners. Compared to empirical references obtained via Monte Carlo simulations, our technique achieves near-equivalent performance while reducing computational and memory demands by an order of magnitude or more. Furthermore, our method is robust across varying input noise levels, acceleration factors, and diverse undersampling schemes, highlighting its broad applicability. Our work reintroduces accurate and efficient noise analysis as a central tenet of reconstruction algorithms, holding promise to reshape how we evaluate and deploy DL-based MRI. Our code will be made publicly available upon acceptance.
zh

[CV-82] Learning Heterogeneous Mixture of Scene Experts for Large-scale Neural Radiance Fields

【速读】:该论文旨在解决大规模场景下神经辐射场(Neural Radiance Fields, NeRF)的可扩展性问题,具体包括可学习的场景分解、场景异质性建模以及建模效率问题。其解决方案的关键在于提出了一种基于异质哈希专家混合(Heterogeneous Mixture of Hash Experts, HMoHE)的网络结构——Switch-NeRF++,该框架通过一个门控网络实现场景的自适应分解,并将3D点分配给专门的NeRF专家进行处理,同时采用稀疏门控专家混合(Sparsely Gated Mixture of Experts, MoE)机制对门控网络和专家进行联合优化。此外,引入基于哈希的门控网络和具有不同分辨率范围的异质哈希专家,以高效学习大规模场景的异质表示,从而在端到端的框架下实现高质量与高效率的场景建模。

链接: https://arxiv.org/abs/2505.02005
作者: Zhenxing Mi,Ping Yin,Xue Xiao,Dan Xu
机构: The Hong Kong University of Science and Technology, Hong Kong SAR (香港科技大学,香港特别行政区); Inspur Cloud Information Technology Co, Ltd. (浪潮云信息技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures

点击查看摘要

Abstract:Recent NeRF methods on large-scale scenes have underlined the importance of scene decomposition for scalable NeRFs. Although achieving reasonable scalability, there are several critical problems remaining unexplored, i.e., learnable decomposition, modeling scene heterogeneity, and modeling efficiency. In this paper, we introduce Switch-NeRF++, a Heterogeneous Mixture of Hash Experts (HMoHE) network that addresses these challenges within a unified framework. It is a highly scalable NeRF that learns heterogeneous decomposition and heterogeneous NeRFs efficiently for large-scale scenes in an end-to-end manner. In our framework, a gating network learns to decomposes scenes and allocates 3D points to specialized NeRF experts. This gating network is co-optimized with the experts, by our proposed Sparsely Gated Mixture of Experts (MoE) NeRF framework. We incorporate a hash-based gating network and distinct heterogeneous hash experts. The hash-based gating efficiently learns the decomposition of the large-scale scene. The distinct heterogeneous hash experts consist of hash grids of different resolution ranges, enabling effective learning of the heterogeneous representation of different scene parts. These design choices make our framework an end-to-end and highly scalable NeRF solution for real-world large-scale scene modeling to achieve both quality and efficiency. We evaluate our accuracy and scalability on existing large-scale NeRF datasets and a new dataset with very large-scale scenes ( 6.5km^2 ) from UrbanBIS. Extensive experiments demonstrate that our approach can be easily scaled to various large-scale scenes and achieve state-of-the-art scene rendering accuracy. Furthermore, our method exhibits significant efficiency, with an 8x acceleration in training and a 16x acceleration in rendering compared to Switch-NeRF. Codes will be released in this https URL.
zh

[CV-83] Always Skip Attention

【速读】:该论文试图解决Vision Transformers (ViTs)中自注意力机制在训练过程中出现的严重问题,即自注意力在没有跳跃连接(skip connection)的情况下会 catastrophic failure(灾难性失败)。解决方案的关键在于理论分析表明自注意力机制本质上是病态的(ill-conditioned),因此需要跳跃连接作为正则化手段。此外,作者提出了一种名为Token Graying的简单但有效的补充方法,进一步改善输入标记的条件。

链接: https://arxiv.org/abs/2505.01996
作者: Yiping Ji,Hemanth Saratchandran,Peyman Moghaddam,Simon Lucey
机构: Australian Institute of Machine Learning, Adelaide University (澳大利亚机器学习研究所,阿德莱德大学); Data61, CSIRO (Data61,澳大利亚科学工业研究组织)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We highlight a curious empirical result within modern Vision Transformers (ViTs). Specifically, self-attention catastrophically fails to train unless it is used in conjunction with a skip connection. This is in contrast to other elements of a ViT that continue to exhibit good performance (albeit suboptimal) when skip connections are removed. Further, we show that this critical dependence on skip connections is a relatively new phenomenon, with previous deep architectures (\eg, CNNs) exhibiting good performance in their absence. In this paper, we theoretically characterize that the self-attention mechanism is fundamentally ill-conditioned and is, therefore, uniquely dependent on skip connections for regularization. Additionally, we propose Token Graying – a simple yet effective complement (to skip connections) that further improves the condition of input tokens. We validate our approach in both supervised and self-supervised training methods.
zh

[CV-84] Drug classification based on X-ray spectroscopy combined with machine learning

【速读】:该论文试图解决传统药物检测方法在仪器要求高、操作复杂以及检测效率低等问题,旨在开发一种更快、更准确的药物检测与识别方法。解决方案的关键在于结合X射线吸收光谱(X-ray absorption spectroscopy)与卷积神经网络(CNN)、支持向量机(SVM)及粒子群优化(PSO)算法,通过CNN提取光谱特征,利用PSO优化SVM的初始参数,从而构建一个高效且高精度的分类模型。实验结果表明,该方法在分类准确率和执行速度上均优于其他常见方法,具有良好的应用前景。

链接: https://arxiv.org/abs/2505.01986
作者: Yongming Li,Peng Wang,Bangdong Han
机构: Xiamen University(厦门大学); Fudan University(复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The proliferation of new types of drugs necessitates the urgent development of faster and more accurate detection methods. Traditional detection methods have high requirements for instruments and environments, making the operation complex. X-ray absorption spectroscopy, a non-destructive detection technique, offers advantages such as ease of operation, penetrative observation, and strong substance differentiation capabilities, making it well-suited for application in the field of drug detection and identification. In this study, we constructed a classification model using Convolutional Neural Networks (CNN), Support Vector Machines (SVM), and Particle Swarm Optimization (PSO) to classify and identify drugs based on their X-ray spectral profiles. In the experiments, we selected 14 chemical reagents with chemical formulas similar to drugs as samples. We utilized CNN to extract features from the spectral data of these 14 chemical reagents and used the extracted features to train an SVM model. We also utilized PSO to optimize two critical initial parameters of the SVM. The experimental results demonstrate that this model achieved higher classification accuracy compared to two other common methods, with a prediction accuracy of 99.14%. Additionally, the model exhibited fast execution speed, mitigating the drawback of a drastic increase in running time and efficiency reduction that may result from the direct fusion of PSO and SVM. Therefore, the combined approach of X-ray absorption spectroscopy with CNN, PSO, and SVM provides a rapid, highly accurate, and reliable classification and identification method for the field of drug detection, holding promising prospects for widespread application.
zh

[CV-85] Lifelong Whole Slide Image Analysis: Online Vision-Language Adaptation and Past-to-Present Gradient Distillation

【速读】:该论文旨在解决全切片图像(Whole Slide Images, WSIs)在计算任务中因尺寸庞大而带来的存储、处理和模型训练挑战,特别是在多机构分布式数据环境下实现统一在线模型的构建问题。其解决方案的关键在于提出ADaFGrad方法,该方法通过病理学视觉-语言基础模型实现滑片区域组织特征与预定义文本原型缓冲区的交互,并引入梯度蒸馏机制,在持续学习框架下模拟分类头参数对logit的梯度,从而有效减少任务遗忘并提升模型性能。

链接: https://arxiv.org/abs/2505.01984
作者: Doanh C. Bui,Hoai Luan Pham,Vu Trung Duong Le,Tuan Hai Vu,Van Duy Tran,Khang Nguyen,Yasuhiko Nakashima
机构: Nara Institute of Science and Technology (奈良先端科学技術大学院大学); University of Information Technology (信息科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Whole Slide Images (WSIs) play a crucial role in accurate cancer diagnosis and prognosis, as they provide tissue details at the cellular level. However, the rapid growth of computational tasks involving WSIs poses significant challenges. Given that WSIs are gigapixels in size, they present difficulties in terms of storage, processing, and model training. Therefore, it is essential to develop lifelong learning approaches for WSI analysis. In scenarios where slides are distributed across multiple institutes, we aim to leverage them to develop a unified online model as a computational tool for cancer diagnosis in clinical and hospital settings. In this study, we introduce ADaFGrad, a method designed to enhance lifelong learning for whole-slide image (WSI) analysis. First, we leverage pathology vision-language foundation models to develop a framework that enables interaction between a slide’s regional tissue features and a predefined text-based prototype buffer. Additionally, we propose a gradient-distillation mechanism that mimics the gradient of a logit with respect to the classification-head parameters across past and current iterations in a continual-learning setting. We construct a sequence of six TCGA datasets for training and evaluation. Experimental results show that ADaFGrad outperforms both state-of-the-art WSI-specific and conventional continual-learning methods after only a few training epochs, exceeding them by up to +5.068% in the class-incremental learning scenario while exhibiting the least forgetting (i.e., retaining the most knowledge from previous tasks). Moreover, ADaFGrad surpasses its baseline by as much as +40.084% in accuracy, further demonstrating the effectiveness of the proposed modules.
zh

[CV-86] Visual Dominance and Emerging Multimodal Approaches in Distracted Driving Detection: A Review of Machine Learning Techniques

【速读】:该论文试图解决因分心驾驶导致的道路交通伤害和死亡问题,特别是在现有基于视觉数据的机器学习(ML)和深度学习(DL)方法中忽视了驾驶员行为的多模态特性。其解决方案的关键在于采用多模态架构,通过整合视觉、生理和车辆动态等多样化数据流,提升系统的鲁棒性、上下文感知能力和可扩展性,从而超越单一视觉模型的局限性,并推动更可靠、隐私友好的先进驾驶辅助系统(ADAS)发展。

链接: https://arxiv.org/abs/2505.01973
作者: Anthony Dontoh,Stephanie Ivey,Logan Sirbaugh,Andrews Danyo,Armstrong Aboah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Distracted driving continues to be a significant cause of road traffic injuries and fatalities worldwide, even with advancements in driver monitoring technologies. Recent developments in machine learning (ML) and deep learning (DL) have primarily focused on visual data to detect distraction, often neglecting the complex, multimodal nature of driver behavior. This systematic review assesses 74 peer-reviewed studies from 2019 to 2024 that utilize ML/DL techniques for distracted driving detection across visual, sensor-based, multimodal, and emerging modalities. The review highlights a significant prevalence of visual-only models, particularly convolutional neural networks (CNNs) and temporal architectures, which achieve high accuracy but show limited generalizability in real-world scenarios. Sensor-based and physiological models provide complementary strengths by capturing internal states and vehicle dynamics, while emerging techniques, such as auditory sensing and radio frequency (RF) methods, offer privacy-aware alternatives. Multimodal architecture consistently surpasses unimodal baselines, demonstrating enhanced robustness, context awareness, and scalability by integrating diverse data streams. These findings emphasize the need to move beyond visual-only approaches and adopt multimodal systems that combine visual, physiological, and vehicular cues while keeping in checking the need to balance computational requirements. Future research should focus on developing lightweight, deployable multimodal frameworks, incorporating personalized baselines, and establishing cross-modality benchmarks to ensure real-world reliability in advanced driver assistance systems (ADAS) and road safety interventions.
zh

[CV-87] MC3D-AD: A Unified Geometry-aware Reconstruction Model for Multi-category 3D Anomaly Detection IJCAI2025

【速读】:该论文旨在解决多类别三维异常检测(Multi-Category 3D Anomaly Detection, MC3D-AD)中现有方法需要为每个类别独立训练任务特定模型所导致的高成本、低效率和泛化能力弱的问题。其解决方案的关键在于提出一种统一模型,通过融合局部和全局几何感知信息来重建所有类别的正常表示。具体而言,该模型包括一个自适应几何感知掩码注意力模块、一个由改进的掩码注意力增强的局部几何感知编码器以及一个利用点云位置嵌入提升解码与重建能力的全局查询解码器,从而生成具有几何感知的局部和全局重建特征令牌。

链接: https://arxiv.org/abs/2505.01969
作者: Jiayi Cheng,Can Gao,Jie Zhou,Jiajun Wen,Tao Dai,Jinbao Wang
机构: Shenzhen University(深圳大学); National Engineering Laboratory for Big Data System Computing Technology(国家大数据系统计算技术工程实验室); Guangdong Provincial Key Laboratory of Intelligent Information Processing(广东省智能信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages of main text, 3 pages of appendix, accepted to IJCAI 2025

点击查看摘要

Abstract:3D Anomaly Detection (AD) is a promising means of controlling the quality of manufactured products. However, existing methods typically require carefully training a task-specific model for each category independently, leading to high cost, low efficiency, and weak generalization. Therefore, this paper presents a novel unified model for Multi-Category 3D Anomaly Detection (MC3D-AD) that aims to utilize both local and global geometry-aware information to reconstruct normal representations of all categories. First, to learn robust and generalized features of different categories, we propose an adaptive geometry-aware masked attention module that extracts geometry variation information to guide mask attention. Then, we introduce a local geometry-aware encoder reinforced by the improved mask attention to encode group-level feature tokens. Finally, we design a global query decoder that utilizes point cloud position embeddings to improve the decoding process and reconstruction ability. This leads to local and global geometry-aware reconstructed feature tokens for the AD task. MC3D-AD is evaluated on two publicly available Real3D-AD and Anomaly-ShapeNet datasets, and exhibits significant superiority over current state-of-the-art single-category methods, achieving 3.1% and 9.3% improvement in object-level AUROC over Real3D-AD and Anomaly-ShapeNet, respectively. The source code will be released upon acceptance.
zh

[CV-88] Segment Any RGB-Thermal Model with Language-aided Distillation

【速读】:该论文旨在解决RGB-T(RGB-thermal)语义分割任务中,现有模型如Segment Anything Model (SAM) 由于仅在RGB数据上训练而无法直接适用的问题。其关键解决方案是提出一种名为SARTM的框架,通过在SAM基础上引入语义理解模块、语言信息引导以及跨模态知识蒸馏(Cross-Modal Knowledge Distillation, CMKD)模块,以实现模态适应和语义一致性,从而提升RGB-T数据下的分割性能。

链接: https://arxiv.org/abs/2505.01950
作者: Dong Xing,Xianxun Zhu,Wei Zhou,Qika Lin,Hang Yang,Yuqing Wang
机构: Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences University (长春光学精密机械与物理研究所,中国科学院大学); Macquarie University (麦考瑞大学); Cardiff University (卡迪夫大学); Saw Swee Hock School of Public Health, National University of Singapore (新加坡国立大学苏瑞福公共卫生学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2412.04220 by other authors

点击查看摘要

Abstract:The recent Segment Anything Model (SAM) demonstrates strong instance segmentation performance across various downstream tasks. However, SAM is trained solely on RGB data, limiting its direct applicability to RGB-thermal (RGB-T) semantic segmentation. Given that RGB-T provides a robust solution for scene understanding in adverse weather and lighting conditions, such as low light and overexposure, we propose a novel framework, SARTM, which customizes the powerful SAM for RGB-T semantic segmentation. Our key idea is to unleash the potential of SAM while introduce semantic understanding modules for RGB-T data pairs. Specifically, our framework first involves fine tuning the original SAM by adding extra LoRA layers, aiming at preserving SAM’s strong generalization and segmentation capabilities for downstream tasks. Secondly, we introduce language information as guidance for training our SARTM. To address cross-modal inconsistencies, we introduce a Cross-Modal Knowledge Distillation(CMKD) module that effectively achieves modality adaptation while maintaining its generalization capabilities. This semantic module enables the minimization of modality gaps and alleviates semantic ambiguity, facilitating the combination of any modality under any visual conditions. Furthermore, we enhance the segmentation performance by adjusting the segmentation head of SAM and incorporating an auxiliary semantic segmentation head, which integrates multi-scale features for effective fusion. Extensive experiments are conducted across three multi-modal RGBT semantic segmentation benchmarks: MFNET, PST900, and FMB. Both quantitative and qualitative results consistently demonstrate that the proposed SARTM significantly outperforms state-of-the-art approaches across a variety of conditions.
zh

[CV-89] HybridGS: High-Efficiency Gaussian Splatting Data Compression using Dual-Channel Sparse Representation and Point Cloud Encoder ICML2025

【速读】:该论文旨在解决现有3D Gaussian Splatting (3DGS) 压缩方案编码时间长、数据格式高度定制化的问题,从而限制了其广泛应用。其解决方案的关键在于提出一种名为HybridGS的新框架,该框架结合了紧凑生成与标准化点云数据编码的优势。HybridGS首先生成紧凑且显式的3DGS数据,并引入双通道稀疏表示来监督原始位置和特征位深度,随后利用标准点云编码器进行进一步压缩并生成标准输出比特流,从而实现高效且可部署的压缩方案。

链接: https://arxiv.org/abs/2505.01938
作者: Qi Yang,Le Yang,Geert Van Der Auwera,Zhu Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML2025

点击查看摘要

Abstract:Most existing 3D Gaussian Splatting (3DGS) compression schemes focus on producing compact 3DGS representation via implicit data embedding. They have long coding times and highly customized data format, making it difficult for widespread deployment. This paper presents a new 3DGS compression framework called HybridGS, which takes advantage of both compact generation and standardized point cloud data encoding. HybridGS first generates compact and explicit 3DGS data. A dual-channel sparse representation is introduced to supervise the primitive position and feature bit depth. It then utilizes a canonical point cloud encoder to perform further data compression and form standard output bitstreams. A simple and effective rate control scheme is proposed to pivot the interpretable data compression scheme. At the current stage, HybridGS does not include any modules aimed at improving 3DGS quality during generation. But experiment results show that it still provides comparable reconstruction performance against state-of-the-art methods, with evidently higher encoding and decoding speed. The code is publicly available at this https URL.
zh

[CV-90] GauS-SLAM: Dense RGB-D SLAM with Gaussian Surfels

【速读】:该论文旨在解决基于高斯的RGB-D同步定位与建图(SLAM)系统中,由于新视角下几何失真导致的跟踪精度下降和多视图一致性不足的问题。其解决方案的关键在于提出了一种基于2D高斯的增量重建策略,并结合表面感知的深度渲染机制,以提升几何精度和多视图一致性。此外,通过动态隔离可见表面的局部地图设计,在保持计算效率的同时缓解了全局地图中遮挡区域引起的误对齐问题。

链接: https://arxiv.org/abs/2505.01934
作者: Yongxin Su,Lin Chen,Kaiting Zhang,Zhongliang Zhao,Chenfeng Hou,Ziping Yu
机构: Beihang University (北京航空航天大学); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose GauS-SLAM, a dense RGB-D SLAM system that leverages 2D Gaussian surfels to achieve robust tracking and high-fidelity mapping. Our investigations reveal that Gaussian-based scene representations exhibit geometry distortion under novel viewpoints, which significantly degrades the accuracy of Gaussian-based tracking methods. These geometry inconsistencies arise primarily from the depth modeling of Gaussian primitives and the mutual interference between surfaces during the depth blending. To address these, we propose a 2D Gaussian-based incremental reconstruction strategy coupled with a Surface-aware Depth Rendering mechanism, which significantly enhances geometry accuracy and multi-view consistency. Additionally, the proposed local map design dynamically isolates visible surfaces during tracking, mitigating misalignment caused by occluded regions in global maps while maintaining computational efficiency with increasing Gaussian density. Extensive experiments across multiple datasets demonstrate that GauS-SLAM outperforms comparable methods, delivering superior tracking precision and rendering fidelity. The project page will be made available at this https URL.
zh

[CV-91] OT-Talk: Animating 3D Talking Head with Optimal Transportation

【速读】:该论文旨在解决音频输入与3D头部网格动画之间的模态差距问题,以实现更准确的唇形同步和自然的面部动作。其解决方案的关键在于引入最优传输(Optimal Transportation)理论,通过将网格表示为概率测度并利用切片Wasserstein距离来建模网格变化,从而更有效地衡量网格差异。此外,该方法结合了预训练的Hubert模型提取音频特征、Transformer处理时序序列以及Chebyshev图卷积提取几何特征,以学习平滑且精确的面部运动。

链接: https://arxiv.org/abs/2505.01932
作者: Xinmu Wang,Xiang Gao,Xiyun Song,Heather Yu,Zongfang Lin,Liang Peng,Xianfeng Gu
机构: Stony Brook University (石溪大学); Futurewei Technologies (未来wei科技)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Animating 3D head meshes using audio inputs has significant applications in AR/VR, gaming, and entertainment through 3D avatars. However, bridging the modality gap between speech signals and facial dynamics remains a challenge, often resulting in incorrect lip syncing and unnatural facial movements. To address this, we propose OT-Talk, the first approach to leverage optimal transportation to optimize the learning model in talking head animation. Building on existing learning frameworks, we utilize a pre-trained Hubert model to extract audio features and a transformer model to process temporal sequences. Unlike previous methods that focus solely on vertex coordinates or displacements, we introduce Chebyshev Graph Convolution to extract geometric features from triangulated meshes. To measure mesh dissimilarities, we go beyond traditional mesh reconstruction errors and velocity differences between adjacent frames. Instead, we represent meshes as probability measures and approximate their surfaces. This allows us to leverage the sliced Wasserstein distance for modeling mesh variations. This approach facilitates the learning of smooth and accurate facial motions, resulting in coherent and natural facial animations. Our experiments on two public audio-mesh datasets demonstrate that our method outperforms state-of-the-art techniques both quantitatively and qualitatively in terms of mesh reconstruction accuracy and temporal alignment. In addition, we conducted a user perception study with 20 volunteers to further assess the effectiveness of our approach.
zh

[CV-92] GenSync: A Generalized Talking Head Framework for Audio-driven Multi-Subject Lip-Sync using 3D Gaussian Splatting

【速读】:该论文试图解决多身份唇同步视频合成中模型训练成本高、计算效率低的问题,即现有方法通常需要为每个身份训练新的模型,导致资源消耗大。解决方案的关键在于提出GenSync框架,通过引入解耦模块(Disentanglement Module),将身份特异性特征与音频表示分离,从而实现一个统一网络对多个说话人进行高效唇同步视频生成,显著降低了计算开销,并在保持高唇同步精度和视觉质量的前提下,实现了6.8倍的训练速度提升。

链接: https://arxiv.org/abs/2505.01928
作者: Anushka Agarwal,Muhammad Yusuf Hassan,Talha Chafekar
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce GenSync, a novel framework for multi-identity lip-synced video synthesis using 3D Gaussian Splatting. Unlike most existing 3D methods that require training a new model for each identity , GenSync learns a unified network that synthesizes lip-synced videos for multiple speakers. By incorporating a Disentanglement Module, our approach separates identity-specific features from audio representations, enabling efficient multi-identity video synthesis. This design reduces computational overhead and achieves 6.8x faster training compared to state-of-the-art models, while maintaining high lip-sync accuracy and visual quality.
zh

[CV-93] Rethinking Score Distilling Sampling for 3D Editing and Generation

【速读】:该论文试图解决文本到3D生成(text-to-3D generation)与现有3D资产编辑(3D asset editing)之间的任务分离问题,即现有的Score Distillation Sampling (SDS)方法在生成新3D资产方面表现优异,但无法进行编辑;而其变种虽然具备编辑能力,却在生成新资产时效果不佳。解决方案的关键在于观察到生成与编辑过程在SDS及其变种中具有统一的底层梯度项,并基于此提出Unified Distillation Sampling (UDS),通过优化原始SDS中的梯度项,实现生成与编辑任务的统一支持。

链接: https://arxiv.org/abs/2505.01888
作者: Xingyu Miao,Haoran Duan,Yang Long,Jungong Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Score Distillation Sampling (SDS) has emerged as a prominent method for text-to-3D generation by leveraging the strengths of 2D diffusion models. However, SDS is limited to generation tasks and lacks the capability to edit existing 3D assets. Conversely, variants of SDS that introduce editing capabilities often can not generate new 3D assets effectively. In this work, we observe that the processes of generation and editing within SDS and its variants have unified underlying gradient terms. Building on this insight, we propose Unified Distillation Sampling (UDS), a method that seamlessly integrates both the generation and editing of 3D assets. Essentially, UDS refines the gradient terms used in vanilla SDS methods, unifying them to support both tasks. Extensive experiments demonstrate that UDS not only outperforms baseline methods in generating 3D assets with richer details but also excels in editing tasks, thereby bridging the gap between 3D generation and editing. The code is available on: this https URL.
zh

[CV-94] CMAWRNet: Multiple Adverse Weather Removal via a Unified Quaternion Neural Architecture

【速读】:该论文试图解决多类型恶劣天气退化(如雾霾与雨痕的组合)在实际应用中对图像质量的影响问题,这类问题在图像检索、户外监控和自动驾驶等场景中尤为突出。解决方案的关键在于提出一种统一的四元数神经架构CMAWRNet,其核心组件包括一种新颖的纹理-结构分解块、一种轻量级的四元数变换器编码器-解码器架构、带有低光校正的注意力融合块,以及一种四元数相似性损失函数,以更好地保留颜色信息。该方法首次将分解方法应用于通用天气去除任务,并在多个基准数据集和真实世界图像上验证了其优越性能。

链接: https://arxiv.org/abs/2505.01882
作者: Vladimir Frants,Sos Agaian,Karen Panetta,Peter Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Images used in real-world applications such as image or video retrieval, outdoor surveillance, and autonomous driving suffer from poor weather conditions. When designing robust computer vision systems, removing adverse weather such as haze, rain, and snow is a significant problem. Recently, deep-learning methods offered a solution for a single type of degradation. Current state-of-the-art universal methods struggle with combinations of degradations, such as haze and rain-streak. Few algorithms have been developed that perform well when presented with images containing multiple adverse weather conditions. This work focuses on developing an efficient solution for multiple adverse weather removal using a unified quaternion neural architecture called CMAWRNet. It is based on a novel texture-structure decomposition block, a novel lightweight encoder-decoder quaternion transformer architecture, and an attentive fusion block with low-light correction. We also introduce a quaternion similarity loss function to preserve color information better. The quantitative and qualitative evaluation of the current state-of-the-art benchmarking datasets and real-world images shows the performance advantages of the proposed CMAWRNet compared to other state-of-the-art weather removal approaches dealing with multiple weather artifacts. Extensive computer simulations validate that CMAWRNet improves the performance of downstream applications such as object detection. This is the first time the decomposition approach has been applied to the universal weather removal task.
zh

[CV-95] PhysNav-DG: A Novel Adaptive Framework for Robust VLM-Sensor Fusion in Navigation Applications

【速读】:该论文旨在解决复杂环境中自主系统导航的准确性与可解释性问题,特别是在多领域(如室内导航、自动驾驶和社会导航)中实现可靠的状态估计与透明的决策过程。其解决方案的关键在于提出PhysNav-DG框架,该框架结合了经典传感器融合方法与视觉-语言模型的语义能力,通过双分支架构同时预测导航动作并生成详细的链式思维解释,同时采用改进的自适应卡尔曼滤波器动态调整噪声参数以适应环境变化,从而提升了导航成功率和解释的可靠性。

链接: https://arxiv.org/abs/2505.01881
作者: Trisanth Srinivasan,Santosh Patapati
机构: Cyrion Labs (Cyrion 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Robotics (cs.RO)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Robust navigation in diverse environments and domains requires both accurate state estimation and transparent decision making. We present PhysNav-DG, a novel framework that integrates classical sensor fusion with the semantic power of vision-language models. Our dual-branch architecture predicts navigation actions from multi-sensor inputs while simultaneously generating detailed chain-of-thought explanations. A modified Adaptive Kalman Filter dynamically adjusts its noise parameters based on environmental context. It leverages several streams of raw sensor data along with semantic insights from models such as LLaMA 3.2 11B and BLIP-2. To evaluate our approach, we introduce the MD-NEX Benchmark, a novel multi-domain dataset that unifies indoor navigation, autonomous driving, and social navigation tasks with ground-truth actions and human-validated explanations. Extensive experiments and ablations show that PhysNav-DG improves navigation success rates by over 20% and achieves high efficiency, with explanations that are both highly grounded and clear. This work connects high-level semantic reasoning and geometric planning for safer and more trustworthy autonomous systems.
zh

[CV-96] Weakly-supervised Audio Temporal Forgery Localization via Progressive Audio-language Co-learning Network IJCAI2025

【速读】:该论文旨在解决音频时间伪造定位(Audio Temporal Forgery Localization, ATFL)中由于依赖高成本且难以获取的细粒度标注数据而导致的模型训练困难问题。其解决方案的关键在于提出一种渐进式音频-语言协同学习网络(LOCO),通过协同学习和自监督机制,在弱监督场景下提升定位性能。该方法首先设计了一个音频-语言协同学习模块,从时间和全局语义角度对齐以捕捉伪造共识特征,并利用话语级标注与可学习提示构建伪造感知提示,动态融合语义先验到时间内容特征中;随后通过融合伪造类激活序列生成伪造建议,并引入渐进式精炼策略生成伪帧级标签,结合监督语义对比学习增强真实与伪造内容之间的语义区分,从而持续优化伪造感知特征。

链接: https://arxiv.org/abs/2505.01880
作者: Junyan Wu,Wenbo Xu,Wei Lu,Xiangyang Luo,Rui Yang,Shize Guo
机构: Sun Yat-sen University (中山大学); Zhengzhou (郑州); Alibaba Group (阿里巴巴集团)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 9pages, 5figures. This paper has been accepted for IJCAI2025

点击查看摘要

Abstract:Audio temporal forgery localization (ATFL) aims to find the precise forgery regions of the partial spoof audio that is purposefully modified. Existing ATFL methods rely on training efficient networks using fine-grained annotations, which are obtained costly and challenging in real-world scenarios. To meet this challenge, in this paper, we propose a progressive audio-language co-learning network (LOCO) that adopts co-learning and self-supervision manners to prompt localization performance under weak supervision scenarios. Specifically, an audio-language co-learning module is first designed to capture forgery consensus features by aligning semantics from temporal and global perspectives. In this module, forgery-aware prompts are constructed by using utterance-level annotations together with learnable prompts, which can incorporate semantic priors into temporal content features dynamically. In addition, a forgery localization module is applied to produce forgery proposals based on fused forgery-class activation sequences. Finally, a progressive refinement strategy is introduced to generate pseudo frame-level labels and leverage supervised semantic contrastive learning to amplify the semantic distinction between real and fake content, thereby continuously optimizing forgery-aware features. Extensive experiments show that the proposed LOCO achieves SOTA performance on three public benchmarks.
zh

[CV-97] Visual enhancement and 3D representation for underwater scenes: a review

【速读】:该论文试图解决水下视觉增强(Underwater Visual Enhancement, UVE)和水下三维重建在计算机视觉与人工智能任务中的挑战,这些问题主要源于水下环境复杂的成像条件。解决方案的关键在于引入基础物理模型,分析传统技术难以应对的特殊性,并综述针对水下场景设计的先进视觉增强与三维重建方法,包括从非学习方法到数据驱动技术如神经辐射场(Neural Radiance Fields)和三维高斯泼溅(3D Gaussian Splatting)的多种策略,同时通过定量和定性评估验证这些方法在处理水下畸变方面的有效性。

链接: https://arxiv.org/abs/2505.01869
作者: Guoxi Huang,Haoran Wang,Brett Seymour,Evan Kovacs,John Ellerbrock,Dave Blackham,Nantheera Anantrasirichai
机构: University of Bristol(布里斯托大学); National Park Service(国家公园服务局); Marine Imaging Technologies, LLC(海洋成像技术公司); Gates Underwater Products, Inc(盖茨水下产品公司); Esprit film and television Ltd(埃斯普里特电影电视有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater visual enhancement (UVE) and underwater 3D reconstruction pose significant challenges in computer vision and AI-based tasks due to complex imaging conditions in aquatic environments. Despite the development of numerous enhancement algorithms, a comprehensive and systematic review covering both UVE and underwater 3D reconstruction remains absent. To advance research in these areas, we present an in-depth review from multiple perspectives. First, we introduce the fundamental physical models, highlighting the peculiarities that challenge conventional techniques. We survey advanced methods for visual enhancement and 3D reconstruction specifically designed for underwater scenarios. The paper assesses various approaches from non-learning methods to advanced data-driven techniques, including Neural Radiance Fields and 3D Gaussian Splatting, discussing their effectiveness in handling underwater distortions. Finally, we conduct both quantitative and qualitative evaluations of state-of-the-art UVE and underwater 3D reconstruction algorithms across multiple benchmark datasets. Finally, we highlight key research directions for future advancements in underwater vision. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.01869 [cs.CV] (or arXiv:2505.01869v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.01869 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Guoxi Huang [view email] [v1] Sat, 3 May 2025 17:20:24 UTC (43,911 KB) Full-text links: Access Paper: View a PDF of the paper titled Visual enhancement and 3D representation for underwater scenes: a review, by Guoxi Huang and 6 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-98] DualDiff: Dual-branch Diffusion Model for Autonomous Driving with Semantic Fusion

【速读】:该论文旨在解决驾驶场景重建中因现有方法仅依赖3D边界框和二值地图进行前景与背景控制,导致无法充分捕捉场景复杂性及融合多模态信息的问题。其解决方案的关键在于提出DualDiff模型,该模型采用双分支条件扩散架构,并引入语义丰富的3D表示——占据射线采样(Occupancy Ray Sampling, ORS),结合数值化驾驶场景表示以实现更全面的前景与背景控制;同时,通过语义融合注意力(Semantic Fusion Attention, SFA)机制提升跨模态信息整合能力,并设计前景感知掩码损失(Foreground-aware Masked, FGM)以增强小目标生成效果。

链接: https://arxiv.org/abs/2505.01857
作者: Haoteng Li,Zhao Yang,Zezhong Qian,Gongpeng Zhao,Yuqi Huang,Jun Yu,Huazheng Zhou,Longjun Liu
机构: Xi’an Jiaotong University (西安交通大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures,

点击查看摘要

Abstract:Accurate and high-fidelity driving scene reconstruction relies on fully leveraging scene information as conditioning. However, existing approaches, which primarily use 3D bounding boxes and binary maps for foreground and background control, fall short in capturing the complexity of the scene and integrating multi-modal information. In this paper, we propose DualDiff, a dual-branch conditional diffusion model designed to enhance multi-view driving scene generation. We introduce Occupancy Ray Sampling (ORS), a semantic-rich 3D representation, alongside numerical driving scene representation, for comprehensive foreground and background control. To improve cross-modal information integration, we propose a Semantic Fusion Attention (SFA) mechanism that aligns and fuses features across modalities. Furthermore, we design a foreground-aware masked (FGM) loss to enhance the generation of tiny objects. DualDiff achieves state-of-the-art performance in FID score, as well as consistently better results in downstream BEV segmentation and 3D object detection tasks.
zh

[CV-99] Mitigating Group-Level Fairness Disparities in Federated Visual Language Models

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)环境中视觉语言模型(Visual Language Models, VLMs)在不同人口群体间公平性不足的问题。其关键解决方案是提出一种名为FVL-FP的框架,该框架结合了联邦学习与公平提示调优技术,通过三个创新组件:跨层人口公平提示(Cross-Layer Demographic Fair Prompting, CDFP)、人口子空间正交投影(Demographic Subspace Orthogonal Projection, DSOP)和公平感知提示融合(Fair-aware Prompt Fusion, FPF),有效缓解了人口偏差并保持了模型性能。

链接: https://arxiv.org/abs/2505.01851
作者: Chaomeng Chen,Zitong Yu,Junhao Dong,Sen Su,Linlin Shen,Shutao Xia,Xiaochun Cao
机构: Great Bay University; Tsinghua Shenzhen International Graduate School, Tsinghua University; Nanyang Technological University; Beijing University of Posts and Telecommunications; Shenzhen University; Tsinghua Shenzhen International Graduate School, Tsinghua University; Pengcheng Laboratory; Shenzhen Campus of Sun Yat-sen University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual language models (VLMs) have shown remarkable capabilities in multimodal tasks but face challenges in maintaining fairness across demographic groups, particularly when deployed in federated learning (FL) environments. This paper addresses the critical issue of group fairness in federated VLMs by introducing FVL-FP, a novel framework that combines FL with fair prompt tuning techniques. We focus on mitigating demographic biases while preserving model performance through three innovative components: (1) Cross-Layer Demographic Fair Prompting (CDFP), which adjusts potentially biased embeddings through counterfactual regularization; (2) Demographic Subspace Orthogonal Projection (DSOP), which removes demographic bias in image representations by mapping fair prompt text to group subspaces; and (3) Fair-aware Prompt Fusion (FPF), which dynamically balances client contributions based on both performance and fairness metrics. Extensive evaluations across four benchmark datasets demonstrate that our approach reduces demographic disparity by an average of 45% compared to standard FL approaches, while maintaining task performance within 6% of state-of-the-art results. FVL-FP effectively addresses the challenges of non-IID data distributions in federated settings and introduces minimal computational overhead while providing significant fairness benefits. Our work presents a parameter-efficient solution to the critical challenge of ensuring equitable performance across demographic groups in privacy-preserving multimodal systems.
zh

[CV-100] MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures with Richer Annotations for 3D Human Digitization

【速读】:该论文试图解决3D视觉领域中人类中心任务进展有限的问题,这主要是由于缺乏类似大规模物体数据集的高质量人类数据集。解决方案的关键在于构建MVHumanNet++,这是一个包含4,500个不同人类身份的多视角人体动作序列的数据集,通过多视角人体采集系统收集具有多样身份和日常服装的人类数据,从而实现易于扩展的数据采集。该数据集包含9,000种日常服饰、60,000个运动序列及6.45亿帧的广泛标注数据,包括人体掩码、相机参数、2D/3D关键点、SMPL/SMPLX参数和对应文本描述,并新增了法线图和深度图,显著提升了其在先进人类中心研究中的适用性。

链接: https://arxiv.org/abs/2505.01838
作者: Chenghong Li,Hongjie Liao,Yihao Zhi,Xihe Yang,Zhengwentai Sun,Jiahao Chang,Shuguang Cui,Xiaoguang Han
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳); Future Network of Intelligence Institute, CUHK-Shenzhen (未来智能网络研究院,港中深)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL . arXiv admin note: substantial text overlap with arXiv:2312.02963

点击查看摘要

Abstract:In this era, the success of large language models and text-to-image models can be attributed to the driving force of large-scale datasets. However, in the realm of 3D vision, while significant progress has been achieved in object-centric tasks through large-scale datasets like Objaverse and MVImgNet, human-centric tasks have seen limited advancement, largely due to the absence of a comparable large-scale human dataset. To bridge this gap, we present MVHumanNet++, a dataset that comprises multi-view human action sequences of 4,500 human identities. The primary focus of our work is on collecting human data that features a large number of diverse identities and everyday clothing using multi-view human capture systems, which facilitates easily scalable data collection. Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million frames with extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions. Additionally, the proposed MVHumanNet++ dataset is enhanced with newly processed normal maps and depth maps, significantly expanding its applicability and utility for advanced human-centric research. To explore the potential of our proposed MVHumanNet++ dataset in various 2D and 3D visual tasks, we conducted several pilot studies to demonstrate the performance improvements and effective applications enabled by the scale provided by MVHumanNet++. As the current largest-scale 3D human dataset, we hope that the release of MVHumanNet++ dataset with annotations will foster further innovations in the domain of 3D human-centric tasks at scale. MVHumanNet++ is publicly available at this https URL.
zh

[CV-101] CVVNet: A Cross-Vertical-View Network for Gait Recognition

【速读】:该论文旨在解决跨垂直视角(cross-vertical view)步态识别中的性能下降问题,特别是在低视角到高视角变换场景下,由于关键解剖特征的严重形变和自遮挡导致识别准确率最多下降60%。现有基于卷积神经网络(CNN)和自注意力机制的方法因依赖单尺度卷积或简单的注意力机制,无法有效处理多频率特征的融合。解决方案的关键在于提出CVVNet,其核心是高频-低频提取模块(HLFE)与动态门控聚合(DGA)机制,通过并行多尺度卷积/最大池化路径和自注意力路径实现多频率特征的有效提取与自适应融合,从而提升跨垂直视角下的识别鲁棒性。

链接: https://arxiv.org/abs/2505.01837
作者: Xiangru Li,Wei Song,Yingda Huang,Wei Meng,Le Chang
机构: Guangdong University of Technology (广东工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gait recognition enables contact-free, long-range person identification that is robust to clothing variations and non-cooperative scenarios. While existing methods perform well in controlled indoor environments, they struggle with cross-vertical view scenarios, where surveillance angles vary significantly in elevation. Our experiments show up to 60% accuracy degradation in low-to-high vertical view settings due to severe deformations and self-occlusions of key anatomical features. Current CNN and self-attention-based methods fail to effectively handle these challenges, due to their reliance on single-scale convolutions or simplistic attention mechanisms that lack effective multi-frequency feature integration. To tackle this challenge, we propose CVVNet (Cross-Vertical-View Network), a frequency aggregation architecture specifically designed for robust cross-vertical-view gait recognition. CVVNet employs a High-Low Frequency Extraction module (HLFE) that adopts parallel multi-scale convolution/max-pooling path and self-attention path as high- and low-frequency mixers for effective multi-frequency feature extraction from input silhouettes. We also introduce the Dynamic Gated Aggregation (DGA) mechanism to adaptively adjust the fusion ratio of high- and low-frequency features. The integration of our core Multi-Scale Attention Gated Aggregation (MSAGA) module, HLFE and DGA enables CVVNet to effectively handle distortions from view changes, significantly improving the recognition robustness across different vertical views. Experimental results show that our CVVNet achieves state-of-the-art performance, with 8.6% improvement on DroneGait and 2% on Gait3D compared with the best existing methods.
zh

[CV-102] PhytoSynth: Leverag ing Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach

【速读】:该论文试图解决在田间大规模收集作物病害图像所面临的劳动强度大、耗时的问题,提出了一种基于多模态文本到图像生成的方法作为替代方案。其关键解决方案是采用三种Stable Diffusion(SD)变体(SDXL、SD3.5M和SD3.5L)并结合Dreambooth和低秩适应(LoRA)微调技术,以提升模型的泛化能力,其中SD3.5M在计算资源消耗和生成效率方面表现最优,能够在1.5小时内从36个实地样本生成500张合成病害图像。

链接: https://arxiv.org/abs/2505.01823
作者: Nitin Rai,Arnold W. Schumann,Nathan Boyd
机构: Gulf Coast Research and Education Center (GCREC), University of Florida (佛罗里达大学); Citrus Research and Education Center (CREC), University of Florida (佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Collecting large-scale crop disease images in the field is labor-intensive and time-consuming. Generative models (GMs) offer an alternative by creating synthetic samples that resemble real-world images. However, existing research primarily relies on Generative Adversarial Networks (GANs)-based image-to-image translation and lack a comprehensive analysis of computational requirements in agriculture. Therefore, this research explores a multi-modal text-to-image approach for generating synthetic crop disease images and is the first to provide computational benchmarking in this context. We trained three Stable Diffusion (SD) variants-SDXL, SD3.5M (medium), and SD3.5L (large)-and fine-tuned them using Dreambooth and Low-Rank Adaptation (LoRA) fine-tuning techniques to enhance generalization. SD3.5M outperformed the others, with an average memory usage of 18 GB, power consumption of 180 W, and total energy use of 1.02 kWh/500 images (0.002 kWh per image) during inference task. Our results demonstrate SD3.5M’s ability to generate 500 synthetic images from just 36 in-field samples in 1.5 hours. We recommend SD3.5M for efficient crop disease data generation.
zh

[CV-103] 3DWG: 3D Weakly Supervised Visual Grounding via Category and Instance-Level Alignment ICRA2025

【速读】:该论文旨在解决3D弱监督视觉定位任务中的两个主要挑战:类别级模糊性和实例级复杂性。类别级模糊性源于在高度稀疏的点云格式中表示细粒度类别物体,导致类别区分困难;实例级复杂性则源于同一类别多个实例共存于场景中,造成定位时的干扰。解决方案的关键在于提出一种新的弱监督定位方法,通过显式区分类别和实例来提升模型性能。在类别级分支中,利用预训练外部检测器的广泛类别知识,将物体提议特征与句子级类别特征对齐,以增强类别感知能力;在实例级分支中,利用语言查询中的空间关系描述来优化物体提议特征,确保同类物体间的清晰区分。

链接: https://arxiv.org/abs/2505.01809
作者: Xiaoqi Li,Jiaming Liu,Nuowei Han,Liang Heng,Yandong Guo,Hao Dong,Yang Liu
机构: Peking University (北京大学); AI2Robotic (AI2机器人); Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2025

点击查看摘要

Abstract:The 3D weakly-supervised visual grounding task aims to localize oriented 3D boxes in point clouds based on natural language descriptions without requiring annotations to guide model learning. This setting presents two primary challenges: category-level ambiguity and instance-level complexity. Category-level ambiguity arises from representing objects of fine-grained categories in a highly sparse point cloud format, making category distinction challenging. Instance-level complexity stems from multiple instances of the same category coexisting in a scene, leading to distractions during grounding. To address these challenges, we propose a novel weakly-supervised grounding approach that explicitly differentiates between categories and instances. In the category-level branch, we utilize extensive category knowledge from a pre-trained external detector to align object proposal features with sentence-level category features, thereby enhancing category awareness. In the instance-level branch, we utilize spatial relationship descriptions from language queries to refine object proposal features, ensuring clear differentiation among objects. These designs enable our model to accurately identify target-category objects while distinguishing instances within the same category. Compared to previous methods, our approach achieves state-of-the-art performance on three widely used benchmarks: Nr3D, Sr3D, and ScanRef.
zh

[CV-104] Not Every Tree Is a Forest: Benchmarking Forest Types from Satellite Remote Sensing

【速读】:该论文旨在解决全球尺度下森林类型精准映射的问题,以支持遏制森林砍伐和生物多样性保护(如欧盟《森林砍伐法规》(EUDR))的努力。其解决方案的关键在于构建了一个名为ForTy的基准数据集,该数据集包含200,000个时间序列图像块,融合了多时相的Sentinel-2、Sentinel-1、气候和高程数据,并提供了像素级的森林类型及其他土地利用类别的标注。此外,该研究提出了一种专门针对多模态、多时相卫星数据的新型Transformer模型,以提升森林类型分类的性能。

链接: https://arxiv.org/abs/2505.01805
作者: Yuchang Jiang,Maxim Neumann
机构: Google DeepMind(谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Developing accurate and reliable models for forest types mapping is critical to support efforts for halting deforestation and for biodiversity conservation (such as European Union Deforestation Regulation (EUDR)). This work introduces ForTy, a benchmark for global-scale FORest TYpes mapping using multi-temporal satellite data1. The benchmark comprises 200,000 time series of image patches, each consisting of Sentinel-2, Sentinel-1, climate, and elevation data. Each time series captures variations at monthly or seasonal cadence. Per-pixel annotations, including forest types and other land use classes, support image segmentation tasks. Unlike most existing land use products that often categorize all forest areas into a single class, our benchmark differentiates between three forest types classes: natural forest, planted forest, and tree crops. By leveraging multiple public data sources, we achieve global coverage with this benchmark. We evaluate the forest types dataset using several baseline models, including convolution neural networks and transformer-based models. Additionally, we propose a novel transformer-based model specifically designed to handle multi-modal, multi-temporal satellite data for forest types mapping. Our experimental results demonstrate that the proposed model surpasses the baseline models in performance.
zh

[CV-105] Efficient 3D Full-Body Motion Generation from Sparse Tracking Inputs with Temporal Windows CVPR

【速读】:该论文旨在解决在沉浸式AR/VR应用中,由于有限传感器无法捕获完整身体部位,需通过神经网络(Neural Network, NN)模型生成缺失身体部分以实现完整的3D全身体重建的问题。现有最先进的NN模型通常计算成本高,并依赖较长的稀疏跟踪输入序列来捕捉时间上下文以生成全身体运动,这导致计算开销增加并引入长时序依赖中的噪声,从而影响生成性能。该论文提出的解决方案关键在于采用一种基于多层感知机(Multi-Layer Perceptron, MLP)的方法,通过将长序列输入划分为较小的时间窗口,并利用潜在表示融合当前运动与这些窗口中的历史信息,从而在保持生成性能的同时降低计算成本和内存开销。

链接: https://arxiv.org/abs/2505.01802
作者: Georgios Fotios Angelis,Savas Ozkan,Sinan Mutlu,Paul Wisbey,Anastasios Drosou,Mete Ozay
机构: CERTH; Samsung R&D Institute UK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPRW2025 - 4D Vision Workshop

点击查看摘要

Abstract:To have a seamless user experience on immersive AR/VR applications, the importance of efficient and effective Neural Network (NN) models is undeniable, since missing body parts that cannot be captured by limited sensors should be generated using these models for a complete 3D full-body reconstruction in virtual environment. However, the state-of-the-art NN-models are typically computational expensive and they leverage longer sequences of sparse tracking inputs to generate full-body movements by capturing temporal context. Inevitably, longer sequences increase the computation overhead and introduce noise in longer temporal dependencies that adversely affect the generation performance. In this paper, we propose a novel Multi-Layer Perceptron (MLP)-based method that enhances the overall performance while balancing the computational cost and memory overhead for efficient 3D full-body generation. Precisely, we introduce a NN-mechanism that divides the longer sequence of inputs into smaller temporal windows. Later, the current motion is merged with the information from these windows through latent representations to utilize the past context for the generation. Our experiments demonstrate that generation accuracy of our method with this NN-mechanism is significantly improved compared to the state-of-the-art methods while greatly reducing computational costs and memory overhead, making our method suitable for resource-constrained devices.
zh

[CV-106] AquaGS: Fast Underwater Scene Reconstruction with SfM-Free Gaussian Splatting

【速读】:该论文旨在解决水下场景重建中因介质干扰导致的图像质量退化问题,以及传统Structure-from-Motion (SfM)方法在速度和精度上的局限性。其解决方案的关键在于提出AquaGS模型,该模型无需依赖SfM,通过集成多视角立体(MVS)技术初始化高斯分布、利用隐式神经辐射场(NeRF)渲染半透明介质,并结合显式三维高斯点云(3DGS)技术渲染物体表面,从而有效克服传统方法的不足并精确模拟水下光学现象。

链接: https://arxiv.org/abs/2505.01799
作者: Junhao Shi,Jisheng Xu,Jianping He,Zhiliang Lin
机构: Shanghai Jiao Tong University (上海交通大学); School of Electronic Information and Electrical Engineering (电子信息与电气工程学院); State Key Laboratory of Ocean Engineering (海洋工程国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater scene reconstruction is a critical tech-nology for underwater operations, enabling the generation of 3D models from images captured by underwater platforms. However, the quality of underwater images is often degraded due to medium interference, which limits the effectiveness of Structure-from-Motion (SfM) pose estimation, leading to subsequent reconstruction failures. Additionally, SfM methods typically operate at slower speeds, further hindering their applicability in real-time scenarios. In this paper, we introduce AquaGS, an SfM-free underwater scene reconstruction model based on the SeaThru algorithm, which facilitates rapid and accurate separation of scene details and medium features. Our approach initializes Gaussians by integrating state-of-the-art multi-view stereo (MVS) technology, employs implicit Neural Radiance Fields (NeRF) for rendering translucent media and utilizes the latest explicit 3D Gaussian Splatting (3DGS) technique to render object surfaces, which effectively addresses the limitations of traditional methods and accurately simulates underwater optical phenomena. Experimental results on the data set and the robot platform show that our model can complete high-precision reconstruction in 30 seconds with only 3 image inputs, significantly enhancing the practical application of the algorithm in robotic platforms.
zh

[CV-107] Multimodal Graph Representation Learning for Robust Surgical Workflow Recognition with Adversarial Feature Disentanglement

【速读】:该论文旨在解决手术流程识别中因数据损坏(如术中出血或烟雾导致的遮挡、数据存储与传输问题)而引起的性能下降问题,从而提升手术自动化任务的准确性与可靠性。其解决方案的关键在于提出一种基于图的多模态方法,即生成式 AI (Generative AI) 与运动学数据融合的增强模型——多模态解耦图网络(Multimodal Disentanglement Graph Network),通过图结构建模视觉与运动学嵌入之间的复杂关系,并结合视觉-运动学对抗框架以对齐模态特征空间,同时设计上下文校准解码器以增强对领域偏移和数据损坏的鲁棒性。

链接: https://arxiv.org/abs/2505.01766
作者: Long Bai,Boyi Ma,Ruohan Wang,Guankun Wang,Beilei Cui,Zhongliang Jiang,Mobarakol Islam,Zhe Min,Jiewen Lai,Nassir Navab,Hongliang Ren
机构: The Chinese University of Hong Kong (香港中文大学); University of Toronto (多伦多大学); Brown University (布朗大学); Technische Universität München (慕尼黑工业大学); University College London (伦敦大学学院); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by Information Fusion

点击查看摘要

Abstract:Surgical workflow recognition is vital for automating tasks, supporting decision-making, and training novice surgeons, ultimately improving patient safety and standardizing procedures. However, data corruption can lead to performance degradation due to issues like occlusion from bleeding or smoke in surgical scenes and problems with data storage and transmission. In this case, we explore a robust graph-based multimodal approach to integrating vision and kinematic data to enhance accuracy and reliability. Vision data captures dynamic surgical scenes, while kinematic data provides precise movement information, overcoming limitations of visual recognition under adverse conditions. We propose a multimodal Graph Representation network with Adversarial feature Disentanglement (GRAD) for robust surgical workflow recognition in challenging scenarios with domain shifts or corrupted data. Specifically, we introduce a Multimodal Disentanglement Graph Network that captures fine-grained visual information while explicitly modeling the complex relationships between vision and kinematic embeddings through graph-based message modeling. To align feature spaces across modalities, we propose a Vision-Kinematic Adversarial framework that leverages adversarial training to reduce modality gaps and improve feature consistency. Furthermore, we design a Contextual Calibrated Decoder, incorporating temporal and contextual priors to enhance robustness against domain shifts and corrupted data. Extensive comparative and ablation experiments demonstrate the effectiveness of our model and proposed modules. Moreover, our robustness experiments show that our method effectively handles data corruption during storage and transmission, exhibiting excellent stability and robustness. Our approach aims to advance automated surgical workflow recognition, addressing the complexities and dynamism inherent in surgical procedures.
zh

[CV-108] Co3Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion ICLR2025

【速读】:该论文旨在解决在双人互动对话中同时生成共话语义手势的问题,现有方法主要关注单人自言自语时的手势合成,忽略了多人交互场景下的实际需求,且缺乏高质量的多说话人共话语义手势数据集。解决方案的关键在于构建一个大规模的双人互动共话语义手势数据集GES-Inter,并提出Co³ Gesture框架,该框架基于两个独立说话人的音频条件生成分支,通过时间交互模块(TIM)建模两人手势序列之间的时序关联作为交互引导,并引入互注意力机制以增强交互动作的学习依赖性,从而实现连贯且生动的同步手势生成。

链接: https://arxiv.org/abs/2505.01746
作者: Xingqun Qi,Yatian Wang,Hengyuan Zhang,Jiahao Pan,Wei Xue,Shanghang Zhang,Wenhan Luo,Qifeng Liu,Yike Guo
机构: The Hong Kong University of Science and Technology (香港科技大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as ICLR 2025 (Spotlight)

点击查看摘要

Abstract:Generating gestures from human speech has gained tremendous progress in animating virtual avatars. While the existing methods enable synthesizing gestures cooperated by individual self-talking, they overlook the practicality of concurrent gesture modeling with two-person interactive conversations. Moreover, the lack of high-quality datasets with concurrent co-speech gestures also limits handling this issue. To fulfill this goal, we first construct a large-scale concurrent co-speech gesture dataset that contains more than 7M frames for diverse two-person interactive posture sequences, dubbed GES-Inter. Additionally, we propose Co ^3 Gesture, a novel framework that enables coherent concurrent co-speech gesture synthesis including two-person interactive movements. Considering the asymmetric body dynamics of two speakers, our framework is built upon two cooperative generation branches conditioned on separated speaker audio. Specifically, to enhance the coordination of human postures with respect to corresponding speaker audios while interacting with the conversational partner, we present a Temporal Interaction Module (TIM). TIM can effectively model the temporal association representation between two speakers’ gesture sequences as interaction guidance and fuse it into the concurrent gesture generation. Then, we devise a mutual attention mechanism to further holistically boost learning dependencies of interacted concurrent motions, thereby enabling us to generate vivid and coherent gestures. Extensive experiments demonstrate that our method outperforms the state-of-the-art models on our newly collected GES-Inter dataset. The dataset and source code are publicly available at \hrefthis https URL\textitthis https URL.
zh

[CV-109] An LLM -Empowered Low-Resolution Vision System for On-Device Human Behavior Understanding

【速读】:该论文旨在解决低分辨率视觉系统中人类行为理解(HBU)的标注效率问题,即传统方法在处理深度、热成像和红外等低分辨率数据时表现不佳,而现有的大型视觉语言模型(LVLM)主要针对高分辨率数据设计,难以有效适应低分辨率视频的理解任务。其解决方案的关键在于提出一种名为Llambda的系统,该系统通过利用有限的已标记数据和大量未标记数据,结合对比学习生成高质量伪标签,并借助物理知识引导的描述生成器提升模型对序列数据的理解能力,最终通过LoRA-based高效微调实现对低分辨率数据的适配,从而显著提升HBU任务的性能。

链接: https://arxiv.org/abs/2505.01743
作者: Siyang Jiang,Bufang Yang,Lilin Xu,Mu Yuan,Yeerzhati Abudunuer,Kaiwei Liu,Liekang Zeng,Hongkai Chen,Zhenyu Yan,Xiaofan Jiang,Guoliang Xing
机构: The Chinese University of Hong Kong, Hong Kong SAR; Columbia University, United States
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid advancements in Large Vision Language Models (LVLMs) offer the potential to surpass conventional labeling by generating richer, more detailed descriptions of on-device human behavior understanding (HBU) in low-resolution vision systems, such as depth, thermal, and infrared. However, existing large vision language model (LVLM) approaches are unable to understand low-resolution data well as they are primarily designed for high-resolution data, such as RGB images. A quick fixing approach is to caption a large amount of low-resolution data, but it requires a significant amount of labor-intensive annotation efforts. In this paper, we propose a novel, labor-saving system, Llambda, designed to support low-resolution HBU. The core idea is to leverage limited labeled data and a large amount of unlabeled data to guide LLMs in generating informative captions, which can be combined with raw data to effectively fine-tune LVLM models for understanding low-resolution videos in HBU. First, we propose a Contrastive-Oriented Data Labeler, which can capture behavior-relevant information from long, low-resolution videos and generate high-quality pseudo labels for unlabeled data via contrastive learning. Second, we propose a Physical-Knowledge Guided Captioner, which utilizes spatial and temporal consistency checks to mitigate errors in pseudo labels. Therefore, it can improve LLMs’ understanding of sequential data and then generate high-quality video captions. Finally, to ensure on-device deployability, we employ LoRA-based efficient fine-tuning to adapt LVLMs for low-resolution data. We evaluate Llambda using a region-scale real-world testbed and three distinct low-resolution datasets, and the experiments show that Llambda outperforms several state-of-the-art LVLM systems up to 40.03% on average Bert-Score.
zh

[CV-110] Learning Multi-frame and Monocular Prior for Estimating Geometry in Dynamic Scenes

【速读】:该论文旨在解决单目视频中动态场景的3D几何估计问题,这一任务因物体运动导致现有模型仅能预测部分属性(如深度或仅跨越两帧的点图)而面临显著挑战。为了解决这一问题,论文提出了一种新的前馈模型MMP,其关键在于引入了一种轨迹编码模块,基于最近的Siamese架构,将逐点动态投影到每帧的表示中,从而显著提升了动态场景的表达能力,实现了更高质量的点图预测。

链接: https://arxiv.org/abs/2505.01737
作者: Seong Hyeon Park,Jinwoo Shin
机构: KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In monocular videos that capture dynamic scenes, estimating the 3D geometry of video contents has been a fundamental challenge in computer vision. Specifically, the task is significantly challenged by the object motion, where existing models are limited to predict only partial attributes of the dynamic scenes, such as depth or pointmaps spanning only over a pair of frames. Since these attributes are inherently noisy under multiple frames, test-time global optimizations are often employed to fully recover the geometry, which is liable to failure and incurs heavy inference costs. To address the challenge, we present a new model, coined MMP, to estimate the geometry in a feed-forward manner, which produces a dynamic pointmap representation that evolves over multiple frames. Specifically, based on the recent Siamese architecture, we introduce a new trajectory encoding module to project point-wise dynamics on the representation for each frame, which can provide significantly improved expressiveness for dynamic scenes. In our experiments, we find MMP can achieve state-of-the-art quality in feed-forward pointmap prediction, e.g., 15.1% enhancement in the regression error.
zh

[CV-111] PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth

【速读】:该论文旨在解决生成式世界模型中相机位姿(camera pose)控制的精确性和灵活性问题,这一问题对于实现准确的视角变换和场景动态的逼真模拟至关重要。解决方案的关键在于提出PosePilot框架,该框架通过自监督深度估计和运动恢复结构(structure-from-motion)原理,建立相机位姿与视频生成之间的紧密耦合,利用自监督深度和位姿读出模块,使模型能够直接从视频序列中推断深度和相对相机运动,并通过光度变形损失确保合成帧间的几何一致性,同时引入逆向变形步骤和位姿回归损失进一步提升位姿估计的精度与适应性。

链接: https://arxiv.org/abs/2505.01729
作者: Bu Jin,Weize Li,Baihan Yang,Zhenxin Zhu,Junpeng Jiang,Huan-ang Gao,Haiyang Sun,Kun Zhan,Hengtong Hu,Xueyang Zhang,Peng Jia,Hao Zhao
机构: Institute for AI Industry Research (AIR), Tsinghua University; Li Auto; Institute of Automation, Chinese Academy of Sciences; School of Automation Science and Electrical Engineering, Beihang University; School of Computer Science and Technology, Beijing Jiaotong University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Recent advancements in autonomous driving (AD) systems have highlighted the potential of world models in achieving robust and generalizable performance across both ordinary and challenging driving conditions. However, a key challenge remains: precise and flexible camera pose control, which is crucial for accurate viewpoint transformation and realistic simulation of scene dynamics. In this paper, we introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models. Drawing inspiration from self-supervised depth estimation, PosePilot leverages structure-from-motion principles to establish a tight coupling between camera pose and video generation. Specifically, we incorporate self-supervised depth and pose readouts, allowing the model to infer depth and relative camera motion directly from video sequences. These outputs drive pose-aware frame warping, guided by a photometric warping loss that enforces geometric consistency across synthesized frames. To further refine camera pose estimation, we introduce a reverse warping step and a pose regression loss, improving viewpoint precision and adaptability. Extensive experiments on autonomous driving and general-domain video datasets demonstrate that PosePilot significantly enhances structural understanding and motion reasoning in both diffusion-based and auto-regressive world models. By steering camera pose with self-supervised depth, PosePilot sets a new benchmark for pose controllability, enabling physically consistent, reliable viewpoint synthesis in generative world models.
zh

[CV-112] Probabilistic Interactive 3D Segmentation with Hierarchical Neural Processes ICML2025

【速读】:该论文旨在解决交互式3D分割中的两个关键问题:从稀疏用户点击中有效泛化以生成准确的物体掩码,以及量化预测不确定性以帮助用户识别不可靠区域。其解决方案的关键在于提出NPISeg3D,这是一个基于神经过程(Neural Processes, NPs)的概率框架,通过引入具有场景特定和物体特定潜在变量的分层潜在变量结构,增强了少样本泛化能力,并设计了概率原型调节器,以自适应地调节点击原型,提升模型捕捉对象感知上下文和量化预测不确定性的能力。

链接: https://arxiv.org/abs/2505.01726
作者: Jie Liu,Pan Zhou,Zehao Xiao,Jiayi Shen,Wenzhe Yin,Jan-Jakob Sonke,Efstratios Gavves
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025 Proceedings

点击查看摘要

Abstract:Interactive 3D segmentation has emerged as a promising solution for generating accurate object masks in complex 3D scenes by incorporating user-provided clicks. However, two critical challenges remain underexplored: (1) effectively generalizing from sparse user clicks to produce accurate segmentation, and (2) quantifying predictive uncertainty to help users identify unreliable regions. In this work, we propose NPISeg3D, a novel probabilistic framework that builds upon Neural Processes (NPs) to address these challenges. Specifically, NPISeg3D introduces a hierarchical latent variable structure with scene-specific and object-specific latent variables to enhance few-shot generalization by capturing both global context and object-specific characteristics. Additionally, we design a probabilistic prototype modulator that adaptively modulates click prototypes with object-specific latent variables, improving the model’s ability to capture object-aware context and quantify predictive uncertainty. Experiments on four 3D point cloud datasets demonstrate that NPISeg3D achieves superior segmentation performance with fewer clicks while providing reliable uncertainty estimations.
zh

[CV-113] Vision and Intention Boost Large Language Model in Long-Term Action Anticipation

【速读】:该论文旨在解决长期动作预测(Long-term Action Anticipation, LTA)中单模态方法存在的局限性,即仅依赖视频数据而缺乏先验知识,或使用文本输入时导致的信息丢失问题。其解决方案的关键在于提出一种新颖的意图条件视觉-语言(Intention-Conditioned Vision-Language, ICVL)模型,该模型通过视觉-语言模型(Vision-Language Model, VLM)从视频中直接推断出行为意图作为全面的文本特征,并将其与视觉特征进行多模态融合,生成增强的视觉表征。随后,这些表征与文本提示共同输入大语言模型(Large Language Model, LLM)以实现未来动作的预测。此外,还引入了一种结合视觉和文本相似性的示例选择策略,以提升上下文学习的相关性和信息量。

链接: https://arxiv.org/abs/2505.01713
作者: Congqi Cao,Lanshu Hu,Yating Yu,Yanning Zhang
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long-term action anticipation (LTA) aims to predict future actions over an extended period. Previous approaches primarily focus on learning exclusively from video data but lack prior knowledge. Recent researches leverage large language models (LLMs) by utilizing text-based inputs which suffer severe information loss. To tackle these limitations single-modality methods face, we propose a novel Intention-Conditioned Vision-Language (ICVL) model in this study that fully leverages the rich semantic information of visual data and the powerful reasoning capabilities of LLMs. Considering intention as a high-level concept guiding the evolution of actions, we first propose to employ a vision-language model (VLM) to infer behavioral intentions as comprehensive textual features directly from video inputs. The inferred intentions are then fused with visual features through a multi-modality fusion strategy, resulting in intention-enhanced visual representations. These enhanced visual representations, along with textual prompts, are fed into LLM for future action anticipation. Furthermore, we propose an effective example selection strategy jointly considers visual and textual similarities, providing more relevant and informative examples for in-context learning. Extensive experiments with state-of-the-art performance on Ego4D, EPIC-Kitchens-55, and EGTEA GAZE+ datasets fully demonstrate the effectiveness and superiority of the proposed method.
zh

[CV-114] Knowledge-Augmented Language Models Interpreting Structured Chest X-Ray Findings

【速读】:该论文旨在解决医学影像分析中如何有效利用大型语言模型(Large Language Models, LLMs)进行胸部X光片(Chest X-rays, CXR)自动解读的问题。现有方法虽在多模态基础模型上取得进展,但尚未充分挖掘LLMs在视觉任务中的潜力。其解决方案的关键在于提出CXR-TextInter框架,该框架通过仅依赖由上游图像分析流程生成的丰富结构化文本表示来操作,从而将以文本为中心的LLMs重新定位用于CXR解读,并集成医疗知识模块以增强临床推理能力。

链接: https://arxiv.org/abs/2505.01711
作者: Alexander Davis,Rafael Souza,Jia-Hao Lim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated interpretation of chest X-rays (CXR) is a critical task with the potential to significantly improve clinical workflow and patient care. While recent advances in multimodal foundation models have shown promise, effectively leveraging the full power of large language models (LLMs) for this visual task remains an underexplored area. This paper introduces CXR-TextInter, a novel framework that repurposes powerful text-centric LLMs for CXR interpretation by operating solely on a rich, structured textual representation of the image content, generated by an upstream image analysis pipeline. We augment this LLM-centric approach with an integrated medical knowledge module to enhance clinical reasoning. To facilitate training and evaluation, we developed the MediInstruct-CXR dataset, containing structured image representations paired with diverse, clinically relevant instruction-response examples, and the CXR-ClinEval benchmark for comprehensive assessment across various interpretation tasks. Extensive experiments on CXR-ClinEval demonstrate that CXR-TextInter achieves state-of-the-art quantitative performance across pathology detection, report generation, and visual question answering, surpassing existing multimodal foundation models. Ablation studies confirm the critical contribution of the knowledge integration module. Furthermore, blinded human evaluation by board-certified radiologists shows a significant preference for the clinical quality of outputs generated by CXR-TextInter. Our work validates an alternative paradigm for medical image AI, showcasing the potential of harnessing advanced LLM capabilities when visual information is effectively structured and domain knowledge is integrated.
zh

[CV-115] RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation

【速读】:该论文试图解决开放环境中机器人操作面临的程序技能困境(procedural skill dilemma)和陈述性技能困境(declarative skill dilemma),现有方法通常在认知与执行能力之间做出权衡。解决方案的关键在于提出RoBridge,一种分层智能架构,包含基于大规模预训练视觉-语言模型(VLM)的高层认知规划器(HCP)、作为符号桥梁的不变可操作表示(IOR)以及通用具身代理(GEA),通过结合VLM的陈述性技能与强化学习的程序性技能,有效弥合认知与执行之间的差距。

链接: https://arxiv.org/abs/2505.01709
作者: Kaidong Zhang,Rongtao Xu,Pengzhen Ren,Junfan Lin,Hefeng Wu,Liang Lin,Xiaodan Liang
机构: Sun Yat-sen University (中山大学); MBZUAI (MBZUAI); Peng Cheng Laboratory (鹏城实验室)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:Operating robots in open-ended scenarios with diverse tasks is a crucial research and application direction in robotics. While recent progress in natural language processing and large multimodal models has enhanced robots’ ability to understand complex instructions, robot manipulation still faces the procedural skill dilemma and the declarative skill dilemma in open environments. Existing methods often compromise cognitive and executive capabilities. To address these challenges, in this paper, we propose RoBridge, a hierarchical intelligent architecture for general robotic manipulation. It consists of a high-level cognitive planner (HCP) based on a large-scale pre-trained vision-language model (VLM), an invariant operable representation (IOR) serving as a symbolic bridge, and a generalist embodied agent (GEA). RoBridge maintains the declarative skill of VLM and unleashes the procedural skill of reinforcement learning, effectively bridging the gap between cognition and execution. RoBridge demonstrates significant performance improvements over existing baselines, achieving a 75% success rate on new tasks and an 83% average success rate in sim-to-real generalization using only five real-world data samples per task. This work represents a significant step towards integrating cognitive reasoning with physical execution in robotic systems, offering a new paradigm for general robotic manipulation.
zh

[CV-116] Component-Based Fairness in Face Attribute Classification with Bayesian Network-informed Meta Learning

【速读】:该论文试图解决面部属性预测中的偏见问题,特别是针对生物面部组件(biological face components)的公平性问题,这是以往研究中未被充分探索的方向。其解决方案的关键在于提出一种名为Bayesian Network-informed Meta Reweighting (BNMR) 的方法,该方法通过引入贝叶斯网络校准器(Bayesian Network calibrator)来指导基于元学习的样本重加权过程,从而动态跟踪模型偏见并编码面部组件属性的先验概率,以应对属性标签稀缺性和属性间依赖性这两个关键挑战。

链接: https://arxiv.org/abs/2505.01699
作者: Yifan Liu,Ruichen Yao,Yaokun Liu,Ruohan Zong,Zelin Li,Yang Zhang,Dong Wang
机构: School of Information Sciences, University of Illinois Urbana-Champaign; School of Information Sciences, University of Illinois at Urbana-Champaign
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ACM FAccT 2025

点击查看摘要

Abstract:The widespread integration of face recognition technologies into various applications (e.g., access control and personalized advertising) necessitates a critical emphasis on fairness. While previous efforts have focused on demographic fairness, the fairness of individual biological face components remains unexplored. In this paper, we focus on face component fairness, a fairness notion defined by biological face features. To our best knowledge, our work is the first work to mitigate bias of face attribute prediction at the biological feature level. In this work, we identify two key challenges in optimizing face component fairness: attribute label scarcity and attribute inter-dependencies, both of which limit the effectiveness of bias mitigation from previous approaches. To address these issues, we propose \textbfBayesian \textbfNetwork-informed \textbfMeta \textbfReweighting (BNMR), which incorporates a Bayesian Network calibrator to guide an adaptive meta-learning-based sample reweighting process. During the training process of our approach, the Bayesian Network calibrator dynamically tracks model bias and encodes prior probabilities for face component attributes to overcome the above challenges. To demonstrate the efficacy of our approach, we conduct extensive experiments on a large-scale real-world human face dataset. Our results show that BNMR is able to consistently outperform recent face bias mitigation baselines. Moreover, our results suggest a positive impact of face component fairness on the commonly considered demographic fairness (e.g., \textitgender). Our findings pave the way for new research avenues on face component fairness, suggesting that face component fairness could serve as a potential surrogate objective for demographic fairness. The code for our work is publicly available~\footnotethis https URL.
zh

[CV-117] opology-Aware CLIP Few-Shot Learning

【速读】:该论文旨在解决如何高效地将大规模视觉-语言模型(VLM)如CLIP适配到少样本学习(few-shot learning)任务中,同时平衡预训练知识的保留与任务特定的适应问题。其解决方案的关键在于引入一种拓扑感知的微调方法,通过在任务残差(Task Residual, TR)框架中整合表示拓扑差异(Representation Topology Divergence, RTD),显式对齐视觉和文本表示的拓扑结构,从而在冻结基础VLM编码器的前提下,仅优化轻量级的任务残差参数,有效利用拓扑信息以提升少样本性能。

链接: https://arxiv.org/abs/2505.01694
作者: Dazhi Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efficiently adapting large Vision-Language Models (VLMs) like CLIP for few-shot learning poses challenges in balancing pre-trained knowledge retention and task-specific adaptation. Existing methods often overlook valuable structural information within the VLM’s latent space. We introduce a topology-aware tuning approach integrating Representation Topology Divergence (RTD) into the Task Residual (TR) framework. By explicitly aligning the topological structures of visual and text representations using a combined RTD and Cross-Entropy loss, while freezing base VLM encoders, our method enhances few-shot performance. We optimize only lightweight Task Residual parameters, effectively leveraging topological information. Across 6 diverse benchmark datasets, our approach demonstrates significant gains, achieving an average accuracy improvement of 1-2% over relevant baseline methods in few-shot settings. This work presents an effective strategy to boost VLM few-shot capabilities by incorporating topological alignment.
zh

[CV-118] Automated ARAT Scoring Using Multimodal Video Analysis Multi-View Fusion and Hierarchical Bayesian Models: A Clinician Study

【速读】:该论文试图解决上肢功能评估中手动评分Action Research Arm Test (ARAT)耗时且存在变异的问题。其解决方案的关键在于提出一种自动化ARAT评分系统,该系统结合了多模态视频分析与SlowFast、I3D及基于Transformer的模型,并利用OpenPose关键点和物体位置信息进行特征提取。通过多视角数据(同侧、对侧和顶部视角)以及早期和晚期融合策略,实现跨视角和模型的特征整合,同时采用分层贝叶斯模型(HBMs)提升运动质量成分的可解释性,从而提供一种可扩展且具有临床验证的自动化康复评估方案。

链接: https://arxiv.org/abs/2505.01680
作者: Tamim Ahmed,Thanassis Rikakis
机构: University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Probability (math.PR)
备注:

点击查看摘要

Abstract:Manual scoring of the Action Research Arm Test (ARAT) for upper extremity assessment in stroke rehabilitation is time-intensive and variable. We propose an automated ARAT scoring system integrating multimodal video analysis with SlowFast, I3D, and Transformer-based models using OpenPose keypoints and object locations. Our approach employs multi-view data (ipsilateral, contralateral, and top perspectives), applying early and late fusion to combine features across views and models. Hierarchical Bayesian Models (HBMs) infer movement quality components, enhancing interpretability. A clinician dashboard displays task scores, execution times, and quality assessments. We conducted a study with five clinicians who reviewed 500 video ratings generated by our system, providing feedback on its accuracy and usability. Evaluated on a stroke rehabilitation dataset, our framework achieves 89.0% validation accuracy with late fusion, with HBMs aligning closely with manual assessments. This work advances automated rehabilitation by offering a scalable, interpretable solution with clinical validation.
zh

[CV-119] Soft-Masked Semi-Dual Optimal Transport for Partial Domain Adaptation

【速读】:该论文旨在解决部分领域自适应(Partial Domain Adaptation, PDA)问题,即在目标域的标签空间是源域标签空间子集的情况下,如何学习具有判别性和领域不变性的表示。其解决方案的关键在于提出一种软掩码半对偶最优传输(Soft-masked Semi-dual Optimal Transport, SSOT)方法,通过估计领域类别权重并构建加权源域,实现与目标域的类别条件分布匹配;同时利用类别预测构建软掩码传输距离矩阵,增强最优传输在共享特征空间中的类别导向表示能力,并结合神经网络近似Kantorovich势函数,以支持大规模最优传输问题的求解和端到端优化。

链接: https://arxiv.org/abs/2505.01664
作者: Yi-Ming Zhai,Chuan-Xian Ren,Hong Yan
机构: Sun Yat-Sen University (中山大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual domain adaptation aims to learn discriminative and domain-invariant representation for an unlabeled target domain by leveraging knowledge from a labeled source domain. Partial domain adaptation (PDA) is a general and practical scenario in which the target label space is a subset of the source one. The challenges of PDA exist due to not only domain shift but also the non-identical label spaces of domains. In this paper, a Soft-masked Semi-dual Optimal Transport (SSOT) method is proposed to deal with the PDA problem. Specifically, the class weights of domains are estimated, and then a reweighed source domain is constructed, which is favorable in conducting class-conditional distribution matching with the target domain. A soft-masked transport distance matrix is constructed by category predictions, which will enhance the class-oriented representation ability of optimal transport in the shared feature space. To deal with large-scale optimal transport problems, the semi-dual formulation of the entropy-regularized Kantorovich problem is employed since it can be optimized by gradient-based algorithms. Further, a neural network is exploited to approximate the Kantorovich potential due to its strong fitting ability. This network parametrization also allows the generalization of the dual variable outside the supports of the input distribution. The SSOT model is built upon neural networks, which can be optimized alternately in an end-to-end manner. Extensive experiments are conducted on four benchmark datasets to demonstrate the effectiveness of SSOT.
zh

[CV-120] RAG AR: Retrieval Augment Personalized Image Generation Guided by Recommendation

【速读】:该论文旨在解决个性化图像生成中的两个主要问题:一是现有方法在提取用户偏好时对用户历史序列中的所有项目赋予相等权重,忽略了历史项目与参考图像之间的语义相似性差异,导致低相似性项目权重过高而扭曲用户的视觉偏好;二是现有方法过度依赖生成图像与参考图像的一致性进行优化,导致用户偏好欠拟合并阻碍个性化。其解决方案的关键在于提出一种基于推荐引导的检索增强个性化图像生成方法(RAGAR),通过检索机制根据历史项目与参考图像的相似性分配不同权重,从而更精确地提取用户的视觉偏好,并引入基于多模态排序模型的新排名任务来优化生成图像的个性化,而非依赖一致性约束。

链接: https://arxiv.org/abs/2505.01657
作者: Run Ling,Wenji Wang,Yuting Liu,Guibing Guo,Linying Jiang,Xingwei Wang
机构: Northeastern University, China (东北大学); Northeastern University, China (东北大学)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Personalized image generation is crucial for improving the user experience, as it renders reference images into preferred ones according to user visual preferences. Although effective, existing methods face two main issues. First, existing methods treat all items in the user historical sequence equally when extracting user preferences, overlooking the varying semantic similarities between historical items and the reference item. Disproportionately high weights for low-similarity items distort users’ visual preferences for the reference item. Second, existing methods heavily rely on consistency between generated and reference images to optimize the generation, which leads to underfitting user preferences and hinders personalization. To address these issues, we propose Retrieval Augment Personalized Image GenerAtion guided by Recommendation (RAGAR). Our approach uses a retrieval mechanism to assign different weights to historical items according to their similarities to the reference item, thereby extracting more refined users’ visual preferences for the reference item. Then we introduce a novel rank task based on the multi-modal ranking model to optimize the personalization of the generated images instead of forcing depend on consistency. Extensive experiments and human evaluations on three real-world datasets demonstrate that RAGAR achieves significant improvements in both personalization and semantic metrics compared to five baselines.
zh

[CV-121] A Novel WaveInst-based Network for Tree Trunk Structure Extraction and Pattern Analysis in Forest Inventory

【速读】:该论文旨在解决在复杂背景干扰和遮挡条件下,从二维图像中准确提取树木枝干结构信息的问题。其解决方案的关键在于提出了一种新型的WaveInst实例分割框架,该框架结合离散小波变换(Discrete Wavelet Transform)以增强多尺度边缘信息,从而提升树木结构提取的准确性。通过该方法,研究者不仅实现了对成熟树和幼树结构的高精度提取,还进一步将分割模型与回归模型集成,直接从二维图像中获取关键生长参数,如树木位置、胸高直径和植株高度。

链接: https://arxiv.org/abs/2505.01656
作者: Chenyang Fan,Xujie Zhu,Taige Luo,Sheng Xu,Zhulin Chen,Hongxin Yang
机构: Nanjing Forestry University (南京林业大学); Virginia Tech (弗吉尼亚理工大学); Chinese Academy of Forestry (中国林业科学研究院); State Forestry and Grassland Administration (国家林业和草原局); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The pattern analysis of tree structure holds significant scientific value for genetic breeding and forestry management. The current trunk and branch extraction technologies are mainly LiDAR-based or UAV-based. The former approaches obtain high-precision 3D data, but its equipment cost is high and the three-dimensional (3D) data processing is complex. The latter approaches efficiently capture canopy information, but they miss the 3-D structure of trees. In order to deal with the branch information extraction from the complex background interference and occlusion, this work proposes a novel WaveInst instance segmentation framework, involving a discrete wavelet transform, to enhance multi-scale edge information for accurately improving tree structure extraction. Experimental results of the proposed model show superior performance on SynthTree43k, CaneTree100, Urban Street and our PoplarDataset. Moreover, we present a new Phenotypic dataset PoplarDataset, which is dedicated to extract tree structure and pattern analysis from artificial forest. The proposed method achieves a mean average precision of 49.6 and 24.3 for the structure extraction of mature and juvenile trees, respectively, surpassing the existing state-of-the-art method by 9.9. Furthermore, by in tegrating the segmentation model within the regression model, we accurately achieve significant tree grown parameters, such as the location of trees, the diameter-at-breast-height of individual trees, and the plant height, from 2D images directly. This study provides a scientific and plenty of data for tree structure analysis in related to the phenotype research, offering a platform for the significant applications in precision forestry, ecological monitoring, and intelligent breeding.
zh

[CV-122] oward Onboard AI-Enabled Solutions to Space Object Detection for Space Sustainability

【速读】:该论文旨在解决低地球轨道(LEO)卫星在大规模星座中面临的空间目标检测(SOD)问题,以实现碰撞评估与规避。其解决方案的关键在于利用基于深度学习(DL)模型的视觉传感器进行高精度、低延迟的空间目标检测,具体引入了结合Squeeze-and-Excitation(SE)层、Vision Transformer(ViT)和Generalized Efficient Layer Aggregation Network(GELAN)的模型,并通过实验验证了这些模型在SOD任务中的有效性。

链接: https://arxiv.org/abs/2505.01650
作者: Wenxuan Zhang,Peng Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: This paper has been accepted at the 18th International Conference on Space Operations (SpaceOps 2025)

点击查看摘要

Abstract:The rapid expansion of advanced low-Earth orbit (LEO) satellites in large constellations is positioning space assets as key to the future, enabling global internet access and relay systems for deep space missions. A solution to the challenge is effective space object detection (SOD) for collision assessment and avoidance. In SOD, an LEO satellite must detect other satellites and objects with high precision and minimal delay. This paper investigates the feasibility and effectiveness of employing vision sensors for SOD tasks based on deep learning (DL) models. It introduces models based on the Squeeze-and-Excitation (SE) layer, Vision Transformer (ViT), and the Generalized Efficient Layer Aggregation Network (GELAN) and evaluates their performance under SOD scenarios. Experimental results show that the proposed models achieve mean average precision at intersection over union threshold 0.5 (mAP50) scores of up to 0.751 and mean average precision averaged over intersection over union thresholds from 0.5 to 0.95 (mAP50:95) scores of up to 0.280. Compared to the baseline GELAN-t model, the proposed GELAN-ViT-SE model increases the average mAP50 from 0.721 to 0.751, improves the mAP50:95 from 0.266 to 0.274, reduces giga floating point operations (GFLOPs) from 7.3 to 5.6, and lowers peak power consumption from 2080.7 mW to 2028.7 mW by 2.5%.
zh

[CV-123] Multimodal and Multiview Deep Fusion for Autonomous Marine Navigation

【速读】:该论文旨在解决船舶自主航行中环境感知不准确的问题,通过多模态传感器融合构建可靠的船舶周围鸟瞰视图。解决方案的关键在于采用基于交叉注意力机制的Transformer模型,实现多视角RGB图像、长波红外图像与稀疏LiDAR点云的深度融合,并结合X波段雷达和电子海图数据进行训练,从而提升导航的准确性与鲁棒性。

链接: https://arxiv.org/abs/2505.01615
作者: Dimitrios Dagdilelis,Panagiotis Grigoriadis,Roberto Galeazzi
机构: Technical University of Denmark(丹麦技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a cross attention transformer based method for multimodal sensor fusion to build a birds eye view of a vessels surroundings supporting safer autonomous marine navigation. The model deeply fuses multiview RGB and long wave infrared images with sparse LiDAR point clouds. Training also integrates X band radar and electronic chart data to inform predictions. The resulting view provides a detailed reliable scene representation improving navigational accuracy and robustness. Real world sea trials confirm the methods effectiveness even in adverse weather and complex maritime settings.
zh

[CV-124] EMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action

【速读】:该论文试图解决视频中因果事件关系理解与细粒度时间定位的问题(temporal grounding),现有方法通过压缩视频标记或将视频视为未分割的流,导致细粒度事件边界模糊并限制了因果依赖建模。解决方案的关键在于提出TEMPURA(Temporal Event Masked Prediction and Understanding for Reasoning in Action),一个两阶段训练框架,首先通过掩码事件预测推理来重建缺失事件并生成逐步因果解释,其次学习进行视频分割和密集描述,以分解视频为非重叠事件并提供时间戳对齐的详细描述。

链接: https://arxiv.org/abs/2505.01583
作者: Jen-Hao Cheng,Vivian Wang,Huayu Wang,Huapeng Zhou,Yi-Hao Peng,Hou-I Liu,Hsiang-Wei Huang,Kuang-Ming Chen,Cheng-Yen Yang,Wenhao Chai,Yi-Ling Chen,Vibhav Vineet,Qin Cai,Jenq-Neng Hwang
机构: University of Washington (华盛顿大学); Carnegie Mellon University (卡内基梅隆大学); National Yang Ming Chiao Tung University (国立阳明交通大学); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding causal event relationships and achieving fine-grained temporal grounding in videos remain challenging for vision-language models. Existing methods either compress video tokens to reduce temporal resolution, or treat videos as unsegmented streams, which obscures fine-grained event boundaries and limits the modeling of causal dependencies. We propose TEMPURA (Temporal Event Masked Prediction and Understanding for Reasoning in Action), a two-stage training framework that enhances video temporal understanding. TEMPURA first applies masked event prediction reasoning to reconstruct missing events and generate step-by-step causal explanations from dense event annotations, drawing inspiration from effective infilling techniques. TEMPURA then learns to perform video segmentation and dense captioning to decompose videos into non-overlapping events with detailed, timestamp-aligned descriptions. We train TEMPURA on VER, a large-scale dataset curated by us that comprises 1M training instances and 500K videos with temporally aligned event descriptions and structured reasoning steps. Experiments on temporal grounding and highlight detection benchmarks demonstrate that TEMPURA outperforms strong baseline models, confirming that integrating causal reasoning with fine-grained temporal segmentation leads to improved video understanding.
zh

[CV-125] Grounding Task Assistance with Multimodal Cues from a Single Demonstration

【速读】:该论文试图解决传统RGB视频在捕捉人类行为中的细微情境线索(如意图、安全关键环境因素和细微偏好)方面的局限性,这种感官差距限制了视觉语言模型(Vision Language Models, VLMs)对动作原因的推理能力以及对个体用户的适应能力。解决方案的关键在于引入MICA(Multimodal Interactive Contextualized Assistance)框架,通过整合眼动和语音线索,将演示分割为有意义的子任务,并提取关键帧和描述,从而增强视觉问答任务的上下文基础。实验表明,多模态线索显著优于基于帧的检索方法,尤其在任务类型影响隐式(眼动)与显式(语音)线索的有效性时,凸显了自适应多模态模型的重要性。

链接: https://arxiv.org/abs/2505.01578
作者: Gabriel Sarch,Balasaravanan Thoravi Kumaravel,Sahithya Ravi,Vibhav Vineet,Andrew D. Wilson
机构: Microsoft Research (微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A person’s demonstration often serves as a key reference for others learning the same task. However, RGB video, the dominant medium for representing these demonstrations, often fails to capture fine-grained contextual cues such as intent, safety-critical environmental factors, and subtle preferences embedded in human behavior. This sensory gap fundamentally limits the ability of Vision Language Models (VLMs) to reason about why actions occur and how they should adapt to individual users. To address this, we introduce MICA (Multimodal Interactive Contextualized Assistance), a framework that improves conversational agents for task assistance by integrating eye gaze and speech cues. MICA segments demonstrations into meaningful sub-tasks and extracts keyframes and captions that capture fine-grained intent and user-specific cues, enabling richer contextual grounding for visual question answering. Evaluations on questions derived from real-time chat-assisted task replication show that multimodal cues significantly improve response quality over frame-based retrieval. Notably, gaze cues alone achieves 93% of speech performance, and their combination yields the highest accuracy. Task type determines the effectiveness of implicit (gaze) vs. explicit (speech) cues, underscoring the need for adaptable multimodal models. These results highlight the limitations of frame-based context and demonstrate the value of multimodal signals for real-world AI task assistance.
zh

[CV-126] PainFormer: a Vision Foundation Model for Automatic Pain Assessment

【速读】:该论文旨在解决疼痛评估的准确性和可靠性问题,以支持更有效的疼痛管理方案。其关键解决方案是提出PainFormer,一个基于多任务学习原则的视觉基础模型,该模型在14个任务/数据集上进行训练,包含总计1090万样本。PainFormer作为多种输入模态的嵌入提取器,与基于Transformer的Embedding-Mixer模块结合,实现最终的疼痛评估,从而有效提取来自不同模态(如RGB、合成热成像、估计深度视频及ECG、EMG、GSR和fNIRS等生理信号)的高质量嵌入表示。

链接: https://arxiv.org/abs/2505.01571
作者: Stefanos Gkikas,Raul Fernandez Rojas,Manolis Tsiknakis
机构: Hellenic Mediterranean University, Department of Electrical and Computer Engineering, Heraklion, Crete 714 10, Greece; Institute of Computer Science, Foundation for Research & Technology-Hellas, Heraklion, Crete GR-70013 Greece; University of Canberra, Faculty of Science and Technology, Canberra, ACT 2617, Australia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pain is a manifold condition that impacts a significant percentage of the population. Accurate and reliable pain evaluation for the people suffering is crucial to developing effective and advanced pain management protocols. Automatic pain assessment systems provide continuous monitoring and support decision-making processes, ultimately aiming to alleviate distress and prevent functionality decline. This study introduces PainFormer, a vision foundation model based on multi-task learning principles trained simultaneously on 14 tasks/datasets with a total of 10.9 million samples. Functioning as an embedding extractor for various input modalities, the foundation model provides feature representations to the Embedding-Mixer, a transformer-based module that performs the final pain assessment. Extensive experiments employing behavioral modalities-including RGB, synthetic thermal, and estimated depth videos-and physiological modalities such as ECG, EMG, GSR, and fNIRS revealed that PainFormer effectively extracts high-quality embeddings from diverse input modalities. The proposed framework is evaluated on two pain datasets, BioVid and AI4Pain, and directly compared to 73 different methodologies documented in the literature. Experiments conducted in unimodal and multimodal settings demonstrate state-of-the-art performances across modalities and pave the way toward general-purpose models for automatic pain assessment.
zh

[CV-127] A Sensor Agnostic Domain Generalization Framework for Leverag ing Geospatial Foundation Models: Enhancing Semantic Segmentation viaSynergistic Pseudo-Labeling and Generative Learning CVPR CVPR2025

【速读】:该论文试图解决遥感领域中高精度分割模型依赖大量标注数据的问题,尤其是在标注稀缺及传感器、光照和地理条件变化带来的挑战下。其解决方案的关键在于引入一种领域泛化方法,通过结合软对齐伪标签与源到目标的生成预训练,利用新兴的地理空间基础模型来提升模型的泛化能力,并进一步提供了基于MAE(Masked Autoencoder)的生成学习在领域不变特征学习中的数学见解。

链接: https://arxiv.org/abs/2505.01558
作者: Anan Yaghmour,Melba M. Crawford,Saurabh Prasad
机构: University of Houston(休斯顿大学); Purdue University(普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in the 2025 CVPR Workshop on Foundation and Large Vision Models in Remote Sensing, to appear in CVPR 2025 Workshop Proceedings

点击查看摘要

Abstract:Remote sensing enables a wide range of critical applications such as land cover and land use mapping, crop yield prediction, and environmental monitoring. Advances in satellite technology have expanded remote sensing datasets, yet high-performance segmentation models remain dependent on extensive labeled data, challenged by annotation scarcity and variability across sensors, illumination, and geography. Domain adaptation offers a promising solution to improve model generalization. This paper introduces a domain generalization approach to leveraging emerging geospatial foundation models by combining soft-alignment pseudo-labeling with source-to-target generative pre-training. We further provide new mathematical insights into MAE-based generative learning for domain-invariant feature learning. Experiments with hyperspectral and multispectral remote sensing datasets confirm our method’s effectiveness in enhancing adaptability and segmentation.
zh

[CV-128] Rethinking RGB-Event Semantic Segmentation with a Novel Bidirectional Motion-enhanced Event Representation

【速读】:该论文旨在解决RGB-Event融合中的三个固有对齐问题:时间、空间和模态对齐。现有体素网格表示方法忽略了连续事件窗口之间的时序相关性,并且其通过简单累积异步稀疏事件的表述方式与RGB模态的同步性和密集性不兼容。解决方案的关键在于提出一种新的事件表示方法——Motion-enhanced Event Tensor (MET),该方法通过利用密集光流和事件时序特征,将稀疏事件体素转换为密集且时序一致的形式。此外,还引入了Frequency-aware Bidirectional Flow Aggregation Module (BFAM)和Temporal Fusion Module (TFM),以解决模态和时空对齐问题。

链接: https://arxiv.org/abs/2505.01548
作者: Zhen Yao,Xiaowen Ying,Mooi Choo Chuah
机构: Lehigh University (莱赫igh大学); Qualcomm AI Research (高通人工智能研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 9 figures

点击查看摘要

Abstract:Event cameras capture motion dynamics, offering a unique modality with great potential in various computer vision tasks. However, RGB-Event fusion faces three intrinsic misalignments: (i) temporal, (ii) spatial, and (iii) modal misalignment. Existing voxel grid representations neglect temporal correlations between consecutive event windows, and their formulation with simple accumulation of asynchronous and sparse events is incompatible with the synchronous and dense nature of RGB modality. To tackle these challenges, we propose a novel event representation, Motion-enhanced Event Tensor (MET), which transforms sparse event voxels into a dense and temporally coherent form by leveraging dense optical flows and event temporal features. In addition, we introduce a Frequency-aware Bidirectional Flow Aggregation Module (BFAM) and a Temporal Fusion Module (TFM). BFAM leverages the frequency domain and MET to mitigate modal misalignment, while bidirectional flow aggregation and temporal fusion mechanisms resolve spatiotemporal misalignment. Experimental results on two large-scale datasets demonstrate that our framework significantly outperforms state-of-the-art RGB-Event semantic segmentation approaches. Our code is available at: this https URL.
zh

[CV-129] Automated Parsing of Engineering Drawings for Structured Information Extraction Using a Fine-tuned Document Understanding Transformer

【速读】:该论文旨在解决从二维工程图纸中准确提取关键信息的问题,这一过程对于高精度制造至关重要。传统手动提取方法效率低且易出错,而传统光学字符识别(OCR)技术在处理复杂布局和重叠符号时表现不佳,导致输出非结构化。论文提出的解决方案是一种融合定向边界框(OBB)检测模型与基于Transformer的文档解析模型(Donut)的混合深度学习框架。其关键在于利用自定义标注数据集训练YOLOv11以检测九类关键信息,并将检测到的OBB裁剪后用于微调Donut模型,从而生成结构化的JSON输出。

链接: https://arxiv.org/abs/2505.01530
作者: Muhammad Tayyab Khan,Zane Yong,Lequn Chen,Jun Ming Tan,Wenhe Feng,Seung Ki Moon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been submitted to the IEEE International Conference on Industrial Engineering and Engineering Management (IEEM 2025)

点击查看摘要

Abstract:Accurate extraction of key information from 2D engineering drawings is crucial for high-precision manufacturing. Manual extraction is time-consuming and error-prone, while traditional Optical Character Recognition (OCR) techniques often struggle with complex layouts and overlapping symbols, resulting in unstructured outputs. To address these challenges, this paper proposes a novel hybrid deep learning framework for structured information extraction by integrating an oriented bounding box (OBB) detection model with a transformer-based document parsing model (Donut). An in-house annotated dataset is used to train YOLOv11 for detecting nine key categories: Geometric Dimensioning and Tolerancing (GDT), General Tolerances, Measures, Materials, Notes, Radii, Surface Roughness, Threads, and Title Blocks. Detected OBBs are cropped into images and labeled to fine-tune Donut for structured JSON output. Fine-tuning strategies include a single model trained across all categories and category-specific models. Results show that the single model consistently outperforms category-specific ones across all evaluation metrics, achieving higher precision (94.77% for GDT), recall (100% for most), and F1 score (97.3%), while reducing hallucination (5.23%). The proposed framework improves accuracy, reduces manual effort, and supports scalable deployment in precision-driven industries.
zh

[CV-130] WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning -Driven Text-to-Image Generation

【速读】:该论文试图解决当前文本到图像(Text-to-Image, T2I)生成模型在处理需要丰富世界知识和隐式推理的提示时表现不足的问题,这些问题对于生成语义准确、连贯且符合上下文的图像至关重要。解决方案的关键是引入了\textbfWorldGenBench,一个用于系统评估T2I模型世界知识基础和隐式推理能力的基准测试,并提出了\textbfKnowledge Checklist Score,一种结构化指标,用于衡量生成图像满足关键语义期望的程度。

链接: https://arxiv.org/abs/2505.01490
作者: Daoan Zhang,Che Jiang,Ruoshi Xu,Biaoxiang Chen,Zijian Jin,Yutian Lu,Jianguo Zhang,Liang Yong,Jiebo Luo,Shengda Luo
机构: University of Rochester (罗切斯特大学); Chinese Medicine Guangdong Laboratory (广东省中医药实验室); Southern University of Science and Technology (南方科技大学); New York University (纽约大学); Datawhale org. (数据鲸组织)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in text-to-image (T2I) generation have achieved impressive results, yet existing models still struggle with prompts that require rich world knowledge and implicit reasoning: both of which are critical for producing semantically accurate, coherent, and contextually appropriate images in real-world scenarios. To address this gap, we introduce \textbfWorldGenBench, a benchmark designed to systematically evaluate T2I models’ world knowledge grounding and implicit inferential capabilities, covering both the humanities and nature domains. We propose the \textbfKnowledge Checklist Score, a structured metric that measures how well generated images satisfy key semantic expectations. Experiments across 21 state-of-the-art models reveal that while diffusion models lead among open-source methods, proprietary auto-regressive models like GPT-4o exhibit significantly stronger reasoning and knowledge integration. Our findings highlight the need for deeper understanding and inference capabilities in next-generation T2I systems. Project Page: \hrefthis https URLthis https URL
zh

[CV-131] VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos

【速读】:该论文试图解决合成视频生成中违反常识和物理规律的问题,即生成内容存在异常现象。现有评估指标如VideoScore未能有效检测此类问题,缺乏可解释性。解决方案的关键在于利用多模态大语言模型(MLLMs)作为可解释的评估器,通过设计专家级问答任务来测试模型在合成视频中的推理能力,并采用Group Relative Policy Optimization(GRPO)方法对模型进行微调,以提升其在常识和物理任务上的准确性。

链接: https://arxiv.org/abs/2505.01481
作者: Zongxia Li,Xiyang Wu,Yubin Qin,Guangyao Shi,Hongyang Du,Dinesh Manocha,Tianyi Zhou,Jordan Lee Boyd-Graber
机构: University of Maryland, College Park(马里兰大学学院公园分校); University of Southern California(南加利福尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Synthetic video generation with foundation models has gained attention for its realism and wide applications. While these models produce high-quality frames, they often fail to respect common sense and physical laws, resulting in abnormal content. Existing metrics like VideoScore emphasize general quality but ignore such violations and lack interpretability. A more insightful approach is using multi-modal large language models (MLLMs) as interpretable evaluators, as seen in FactScore. Yet, MLLMs’ ability to detect abnormalities in synthetic videos remains underexplored. To address this, we introduce VideoHallu, a benchmark featuring synthetic videos from models like Veo2, Sora, and Kling, paired with expert-designed QA tasks solvable via human-level reasoning across various categories. We assess several SoTA MLLMs, including GPT-4o, Gemini-2.5-Pro, Qwen-2.5-VL, and newer models like Video-R1 and VideoChat-R1. Despite strong real-world performance on MVBench and MovieChat, these models still hallucinate on basic commonsense and physics tasks in synthetic settings, underscoring the challenge of hallucination. We further fine-tune SoTA MLLMs using Group Relative Policy Optimization (GRPO) on real and synthetic commonsense/physics data. Results show notable accuracy gains, especially with counterexample integration, advancing MLLMs’ reasoning capabilities. Our data is available at this https URL.
zh

[CV-132] A Multi-Granularity Multimodal Retrieval Framework for Multimodal Document Tasks

【速读】:该论文旨在解决传统检索增强生成(Retrieval-augmented generation, RAG)系统在处理包含文本、图像、表格和图表等视觉丰富文档时效果受限的问题。其关键解决方案是提出一个统一的多粒度多模态检索框架,通过融合层次化编码策略、模态感知检索机制以及重排序模块,有效捕捉文本与视觉模态之间的复杂依赖关系,并利用现成的视觉-语言模型实现无需任务微调的训练-free 混合检索策略,从而显著提升检索精度。

链接: https://arxiv.org/abs/2505.01457
作者: Mingjun Xu,Zehui Wang,Hengxing Cai,Renxin Zhong
机构: 未知
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems have predominantly focused on text-based retrieval, limiting their effectiveness in handling visually-rich documents that encompass text, images, tables, and charts. To bridge this gap, we propose a unified multi-granularity multimodal retrieval framework tailored for two benchmark tasks: MMDocIR and M2KR. Our approach integrates hierarchical encoding strategies, modality-aware retrieval mechanisms, and reranking modules to effectively capture and utilize the complex interdependencies between textual and visual modalities. By leveraging off-the-shelf vision-language models and implementing a training-free hybridretrieval strategy, our framework demonstrates robust performance without the need for task-specific fine-tuning. Experimental evaluations reveal that incorporating layout-aware search and reranking modules significantly enhances retrieval accuracy, achieving a top performance score of 65.56. This work underscores the potential of scalable and reproducible solutions in advancing multimodal document retrieval systems.
zh

[CV-133] ZS-VCOS: Zero-Shot Outperforms Supervised Video Camouflaged Object Segmentation

【速读】:该论文旨在解决伪装目标分割(camouflaged object segmentation)的问题,该任务相较于传统分割任务具有更大的挑战性,主要由于伪装目标与其背景在模式和颜色上高度相似。现有解决方案多依赖于监督或无监督预训练方法,而零样本(zero-shot)方法发展不足。本文提出的解决方案的关键在于将光流(optical flow)、视觉-语言模型(vision-language model)和SAM 2集成到一个顺序流程中,以提升分割性能。实验结果表明,该方法在MoCA-Mask数据集上的F-measure(F_β^w)从0.296显著提升至0.628,优于现有零样本方法及部分监督方法,并在MoCA-Filter数据集上也表现出更高的成功率。

链接: https://arxiv.org/abs/2505.01431
作者: Wenqi Guo,Shan Du
机构: University of British Columbia (不列颠哥伦比亚大学); Weathon Software (维森软件)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Camouflaged object segmentation presents unique challenges compared to traditional segmentation tasks, primarily due to the high similarity in patterns and colors between camouflaged objects and their backgrounds. Effective solutions to this problem have significant implications in critical areas such as pest control, defect detection, and lesion segmentation in medical imaging. Prior research has predominantly emphasized supervised or unsupervised pre-training methods, leaving zero-shot approaches significantly underdeveloped. Existing zero-shot techniques commonly utilize the Segment Anything Model (SAM) in automatic mode or rely on vision-language models to generate cues for segmentation; however, their performances remain unsatisfactory, likely due to the similarity of the camouflaged object and the background. Optical flow, commonly utilized for detecting moving objects, has demonstrated effectiveness even with camouflaged entities. Our method integrates optical flow, a vision-language model, and SAM 2 into a sequential pipeline. Evaluated on the MoCA-Mask dataset, our approach achieves outstanding performance improvements, significantly outperforming existing zero-shot methods by raising the F-measure ( F_\beta^w ) from 0.296 to 0.628. Remarkably, our approach also surpasses supervised methods, increasing the F-measure from 0.476 to 0.628. Additionally, evaluation on the MoCA-Filter dataset demonstrates an increase in the success rate from 0.628 to 0.697 when compared with FlowSAM, a supervised transfer method. A thorough ablation study further validates the individual contributions of each component. More details can be found on this https URL.
zh

[CV-134] Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models ICLR2025

【速读】:该论文试图解决文本到图像(Text-to-Image, T2I)模型在生成跨文化语境下图像时存在的系统性偏见问题,特别是由于训练数据中的文化不平衡导致的图像合成偏差。解决方案的关键在于引入组件包含评分(Component Inclusion Score, CIS),通过量化构图脆弱性和情境错位来评估模型在不同文化背景下的生成质量,并揭示数据不平衡、注意力熵和嵌入叠加对模型公平性的影响。该方法为提升AI生成图像的文化包容性提供了架构和数据驱动的干预路径。

链接: https://arxiv.org/abs/2505.01430
作者: Muna Numan Said,Aarib Zaidi,Rabia Usman,Sonia Okon,Praneeth Medepalli,Kevin Zhu,Vasu Sharma,Sean O’Brien
机构: Algoverse AI Research (Algoverse AI Research)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at ICLR 2025 Workshop SynthData

点击查看摘要

Abstract:The transformative potential of text-to-image (T2I) models hinges on their ability to synthesize culturally diverse, photorealistic images from textual prompts. However, these models often perpetuate cultural biases embedded within their training data, leading to systemic misrepresentations. This paper benchmarks the Component Inclusion Score (CIS), a metric designed to evaluate the fidelity of image generation across cultural contexts. Through extensive analysis involving 2,400 images, we quantify biases in terms of compositional fragility and contextual misalignment, revealing significant performance gaps between Western and non-Western cultural prompts. Our findings underscore the impact of data imbalance, attention entropy, and embedding superposition on model fairness. By benchmarking models like Stable Diffusion with CIS, we provide insights into architectural and data-centric interventions for enhancing cultural inclusivity in AI-generated imagery. This work advances the field by offering a comprehensive tool for diagnosing and mitigating biases in T2I generation, advocating for more equitable AI systems.
zh

[CV-135] Explainable AI-Driven Detection of Human Monkeypox Using Deep Learning and Vision Transformers: A Comprehensive Analysis

【速读】:该论文试图解决mpox(正痘病毒)在临床早期诊断中因症状与麻疹和水痘高度相似而导致的困难问题,旨在通过医学影像结合深度学习技术提高疾病检测的准确性。解决方案的关键在于利用公开的皮肤病变图像数据集训练深度学习和视觉变压器模型,并通过迁移学习策略提升分类器性能,其中MobileNet-v2在准确率(93.15%)和加权平均F1分数(93.09%)上表现最优。

链接: https://arxiv.org/abs/2505.01429
作者: Md. Zahid Hossain,Md. Rakibul Islam,Most. Sharmin Sultana Samu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Since mpox can spread from person to person, it is a zoonotic viral illness that poses a significant public health concern. It is difficult to make an early clinical diagnosis because of how closely its symptoms match those of measles and chickenpox. Medical imaging combined with deep learning (DL) techniques has shown promise in improving disease detection by analyzing affected skin areas. Our study explore the feasibility to train deep learning and vision transformer-based models from scratch with publicly available skin lesion image dataset. Our experimental results show dataset limitation as a major drawback to build better classifier models trained from scratch. We used transfer learning with the help of pre-trained models to get a better classifier. The MobileNet-v2 outperformed other state of the art pre-trained models with 93.15% accuracy and 93.09% weighted average F1 score. ViT B16 and ResNet-50 also achieved satisfactory performance compared to already available studies with accuracy 92.12% and 86.21% respectively. To further validate the performance of the models, we applied explainable AI techniques.
zh

[CV-136] Multi-party Collaborative Attention Control for Image Customization

【速读】:该论文旨在解决扩散模型在图像生成中的定制化问题,具体包括:仅支持单一条件输入、复杂视觉场景下主体泄露或混淆、图像条件输出背景不一致以及计算成本高等局限性。其解决方案的关键在于提出一种无需微调的多方协作注意力控制方法(Multi-party Collaborative Attention Control, MCA-Ctrl),通过在自注意力层中引入两个关键操作,协调多个并行扩散过程,并引导目标图像生成,从而实现文本与复杂视觉条件下的高质量图像定制。

链接: https://arxiv.org/abs/2505.01428
作者: Han Yang,Chuanguang Yang,Qiuli Wang,Zhulin An,Weilun Feng,Libo Huang,Yongjun Xu
机构: Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China; Department of Radiology, The First Affiliated Hospital of Army Medical University, Chongqing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of diffusion models has increased the need for customized image generation. However, current customization methods face several limitations: 1) typically accept either image or text conditions alone; 2) customization in complex visual scenarios often leads to subject leakage or confusion; 3) image-conditioned outputs tend to suffer from inconsistent backgrounds; and 4) high computational costs. To address these issues, this paper introduces Multi-party Collaborative Attention Control (MCA-Ctrl), a tuning-free method that enables high-quality image customization using both text and complex visual conditions. Specifically, MCA-Ctrl leverages two key operations within the self-attention layer to coordinate multiple parallel diffusion processes and guide the target image generation. This approach allows MCA-Ctrl to capture the content and appearance of specific subjects while maintaining semantic consistency with the conditional input. Additionally, to mitigate subject leakage and confusion issues common in complex visual scenarios, we introduce a Subject Localization Module that extracts precise subject and editable image layers based on user instructions. Extensive quantitative and human evaluation experiments show that MCA-Ctrl outperforms existing methods in zero-shot image customization, effectively resolving the mentioned issues.
zh

[CV-137] LangGas: Introducing Language in Selective Zero-Shot Background Subtraction for Semi-Transparent Gas Leak Detection with a New Dataset

【速读】:该论文试图解决气体泄漏检测的问题,传统的人工检测方法效率低且耗时。其解决方案的关键在于引入了一个合成数据集SimGas,并提出了一种零样本方法,该方法结合了背景减除、零样本目标检测、过滤和分割技术,以充分利用该数据集。实验结果表明,该方法在IoU指标上显著优于仅基于背景减除和零样本目标检测的基线方法。

链接: https://arxiv.org/abs/2503.02910
作者: Wenqi Guo,Yiyang Du,Shan Du
机构: University of British Columbia (不列颠哥伦比亚大学); Weathon Software (维森软件)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Gas leakage poses a significant hazard that requires prevention. Traditionally, human inspection has been used for detection, a slow and labour-intensive process. Recent research has applied machine learning techniques to this problem, yet there remains a shortage of high-quality, publicly available datasets. This paper introduces a synthetic dataset, SimGas, featuring diverse backgrounds, interfering foreground objects, diverse leak locations, and precise segmentation ground truth. We propose a zero-shot method that combines background subtraction, zero-shot object detection, filtering, and segmentation to leverage this dataset. Experimental results indicate that our approach significantly outperforms baseline methods based solely on background subtraction and zero-shot object detection with segmentation, reaching an IoU of 69%. We also present an analysis of various prompt configurations and threshold settings to provide deeper insights into the performance of our method. Finally, we qualitatively (because of the lack of ground truth) tested our performance on GasVid and reached decent results on the real-world dataset. The dataset, code, and full qualitative results are available at this https URL.
zh

[CV-138] Multi-View Learning with Context-Guided Receptance for Image Denoising IJCAI2025

【速读】:该论文旨在解决真实场景中图像去噪问题,特别是现有方法在区分复杂噪声模式和计算资源消耗较大的问题。其解决方案的关键在于提出一种结合增强多视角特征融合与高效序列建模的Context-guided Receptance Weighted Key-Value(CRWKV)模型,通过引入Context-guided Token Shift(CTS)机制有效捕捉局部空间依赖性,并利用Frequency Mix(FMix)模块提取频域特征以分离高频噪声,同时采用Bidirectional WKV(BiWKV)机制提升计算效率,实现线性复杂度下的全像素序列交互。

链接: https://arxiv.org/abs/2505.02705
作者: Binghong Chen,Tingting Chai,Wei Jiang,Yuanrong Xu,Guanglu Zhou,Xiangqian Wu
机构: Harbin Institute of Technology (哈尔滨工业大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI 2025, code will be available at this https URL

点击查看摘要

Abstract:Image denoising is essential in low-level vision applications such as photography and automated driving. Existing methods struggle with distinguishing complex noise patterns in real-world scenes and consume significant computational resources due to reliance on Transformer-based models. In this work, the Context-guided Receptance Weighted Key-Value (\M) model is proposed, combining enhanced multi-view feature integration with efficient sequence modeling. Our approach introduces the Context-guided Token Shift (CTS) paradigm, which effectively captures local spatial dependencies and enhance the model’s ability to model real-world noise distributions. Additionally, the Frequency Mix (FMix) module extracting frequency-domain features is designed to isolate noise in high-frequency spectra, and is integrated with spatial representations through a multi-view learning process. To improve computational efficiency, the Bidirectional WKV (BiWKV) mechanism is adopted, enabling full pixel-sequence interaction with linear complexity while overcoming the causal selection constraints. The model is validated on multiple real-world image denoising datasets, outperforming the existing state-of-the-art methods quantitatively and reducing inference time up to 40%. Qualitative results further demonstrate the ability of our model to restore fine details in various scenes.
zh

[CV-139] DeepSparse: A Foundation Model for Sparse-View CBCT Reconstruction

【速读】:该论文旨在解决锥形束计算机断层扫描(Cone-beam Computed Tomography, CBCT)中高辐射暴露的问题,通过稀疏视角重建技术在减少X射线投影数量的同时保持图像质量。现有方法面临计算需求高和泛化能力差的挑战,而该研究提出的DeepSparse模型是首个针对稀疏视角CBCT重建的预训练基础模型,其关键在于DiCE(Dual-Dimensional Cross-Scale Embedding)网络结构,该结构融合了多视角2D特征与多尺度3D特征,结合HyViP(Hybrid View Sampling Pretraining)预训练框架及两阶段微调策略,显著提升了重建质量与模型适应性。

链接: https://arxiv.org/abs/2505.02628
作者: Yiqun Lin,Hualiang Wang,Jixiang Chen,Jiewen Yang,Jiarong Guo,Xiaomeng Li
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cone-beam computed tomography (CBCT) is a critical 3D imaging technology in the medical field, while the high radiation exposure required for high-quality imaging raises significant concerns, particularly for vulnerable populations. Sparse-view reconstruction reduces radiation by using fewer X-ray projections while maintaining image quality, yet existing methods face challenges such as high computational demands and poor generalizability to different datasets. To overcome these limitations, we propose DeepSparse, the first foundation model for sparse-view CBCT reconstruction, featuring DiCE (Dual-Dimensional Cross-Scale Embedding), a novel network that integrates multi-view 2D features and multi-scale 3D features. Additionally, we introduce the HyViP (Hybrid View Sampling Pretraining) framework, which pretrains the model on large datasets with both sparse-view and dense-view projections, and a two-step finetuning strategy to adapt and refine the model for new datasets. Extensive experiments and ablation studies demonstrate that our proposed DeepSparse achieves superior reconstruction quality compared to state-of-the-art methods, paving the way for safer and more efficient CBCT imaging.
zh

[CV-140] Diagnostic Uncertainty in Pneumonia Detection using CNN MobileNetV2 and CNN from Scratch

【速读】:该论文试图解决肺炎诊断中的不确定性问题,这一问题主要源于非典型表现、诊断工具(如胸部X光)的局限性以及共存的呼吸系统疾病。研究提出了一种监督学习方法——卷积神经网络(Convolutional Neural Network, CNN),通过使用预训练的MobileNetV2模型与ResNet101V2架构,并采用Keras API构建从零开始的CNN模型,以识别肺部疾病尤其是肺炎。该研究的关键在于利用深度学习模型提升肺炎诊断的准确性和稳定性,其中MobileNetV2在验证数据上表现出更高的稳定性和较低的过拟合,而从零开始的模型虽然在某些指标上具有较高准确性,但存在更大的不稳定性。

链接: https://arxiv.org/abs/2505.02396
作者: Kennard Norbert Sudiardjo,Islam Nur Alam,Wilson Wijaya,Lili Ayu Wulandhari
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pneumonia Diagnosis, though it is crucial for an effective treatment, it can be hampered by uncertainty. This uncertainty starts to arise due to some factors like atypical presentations, limitations of diagnostic tools such as chest X-rays, and the presence of co-existing respiratory conditions. This research proposes one of the supervised learning methods, CNN. Using MobileNetV2 as the pre-trained one with ResNet101V2 architecture and using Keras API as the built from scratch model, for identifying lung diseases especially pneumonia. The datasets used in this research were obtained from the website through Kaggle. The result shows that by implementing CNN MobileNetV2 and CNN from scratch the result is promising. While validating data, MobileNetV2 performs with stability and minimal overfitting, while the training accuracy increased to 84.87% later it slightly decreased to 78.95%, with increasing validation loss from 0.499 to 0.6345. Nonetheless, MobileNetV2 is more stable. Although it takes more time to train each epoch. Meanwhile, after the 10th epoch, the Scratch model displayed more instability and overfitting despite having higher validation accuracy, training accuracy decreased significantly to 78.12% and the validation loss increased from 0.5698 to 1.1809. With these results, ResNet101V2 offers stability, and the Scratch model offers high accuracy.
zh

[CV-141] An Arbitrary-Modal Fusion Network for Volumetric Cranial Nerves Tract Segmentation

【速读】:该论文旨在解决在临床实践中难以获取完整的多模态数据(如结构磁共振成像和扩散磁共振成像)导致的颅神经束分割性能受限的问题。其解决方案的关键在于提出一种新型的任意模态融合网络CNTSeg-v2,该网络通过将T1加权图像作为主模态来指导其他辅助模态的信息选择,并引入任意模态协作模块(Arbitrary-Modal Collaboration Module, ACM)以有效提取其他模态中的信息特征,同时结合深度距离引导的多阶段解码器(Deep Distance-guided Multi-stage, DDM)以修正分割中的小误差和不连续性,从而提升分割精度。

链接: https://arxiv.org/abs/2505.02385
作者: Lei Xie,Huajun Zhou,Junxiong Huang,Jiahao Huang,Qingrun Zeng,Jianzhong He,Jiawei Zhang,Baohua Fan,Mingchu Li,Guoqiang Xie,Hao Chen,Yuanjing Feng
机构: Zhejiang University of Technology (浙江理工大学); Hong Kong University of Science and Technology (香港科技大学); Department of Computer Science and Engineering (计算机科学与工程系); Department of Chemical and Biological Engineering (化学与生物工程系); Center for Aging Science (衰老科学中心); Department of Radiology (放射科); Xiangya Hospital (湘雅医院); Central South University (中南大学); Department of Neurosurgery (神经外科); Nuclear Industry 215 Hospital of Shaanxi Province (陕西省核工业215医院); Taihe Hospital of Wannan Medical College (皖南医学院太和医院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The segmentation of cranial nerves (CNs) tract provides a valuable quantitative tool for the analysis of the morphology and trajectory of individual CNs. Multimodal CNs tract segmentation networks, e.g., CNTSeg, which combine structural Magnetic Resonance Imaging (MRI) and diffusion MRI, have achieved promising segmentation performance. However, it is laborious or even infeasible to collect complete multimodal data in clinical practice due to limitations in equipment, user privacy, and working conditions. In this work, we propose a novel arbitrary-modal fusion network for volumetric CNs tract segmentation, called CNTSeg-v2, which trains one model to handle different combinations of available modalities. Instead of directly combining all the modalities, we select T1-weighted (T1w) images as the primary modality due to its simplicity in data acquisition and contribution most to the results, which supervises the information selection of other auxiliary modalities. Our model encompasses an Arbitrary-Modal Collaboration Module (ACM) designed to effectively extract informative features from other auxiliary modalities, guided by the supervision of T1w images. Meanwhile, we construct a Deep Distance-guided Multi-stage (DDM) decoder to correct small errors and discontinuities through signed distance maps to improve segmentation accuracy. We evaluate our CNTSeg-v2 on the Human Connectome Project (HCP) dataset and the clinical Multi-shell Diffusion MRI (MDM) dataset. Extensive experimental results show that our CNTSeg-v2 achieves state-of-the-art segmentation performance, outperforming all competing methods.
zh

[CV-142] CSASN: A Multitask Attention-Based Framework for Heterogeneous Thyroid Carcinoma Classification in Ultrasound Images

【速读】:该论文旨在解决在超声成像中罕见甲状腺癌分类时面临的异质形态特征和数据不平衡问题。其解决方案的关键在于提出一种新颖的多任务学习框架——通道-空间注意力协同网络(Channel-Spatial Attention Synergy Network, CSASN),该框架结合了基于EfficientNet的局部空间编码与基于ViT的全局语义建模的双分支特征提取器,并引入级联通道-空间注意力精炼模块,同时采用残差多尺度分类器和动态加权损失函数以提升分类的稳定性和准确性。

链接: https://arxiv.org/abs/2505.02211
作者: Peiqi Li,Yincheng Gao,Renxing Li,Haojie Yang,Yunyun Liu,Boji Liu,Jiahui Ni,Ying Zhang,Yulu Wu,Xiaowei Fang,Lehang Guo,Liping Sun,Jiangang Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 10 figures, 4 tables

点击查看摘要

Abstract:Heterogeneous morphological features and data imbalance pose significant challenges in rare thyroid carcinoma classification using ultrasound imaging. To address this issue, we propose a novel multitask learning framework, Channel-Spatial Attention Synergy Network (CSASN), which integrates a dual-branch feature extractor - combining EfficientNet for local spatial encoding and ViT for global semantic modeling, with a cascaded channel-spatial attention refinement module. A residual multiscale classifier and dynamically weighted loss function further enhance classification stability and accuracy. Trained on a multicenter dataset comprising more than 2000 patients from four clinical institutions, our framework leverages a residual multiscale classifier and dynamically weighted loss function to enhance classification stability and accuracy. Extensive ablation studies demonstrate that each module contributes significantly to model performance, particularly in recognizing rare subtypes such as FTC and MTC carcinomas. Experimental results show that CSASN outperforms existing single-stream CNN or Transformer-based models, achieving a superior balance between precision and recall under class-imbalanced conditions. This framework provides a promising strategy for AI-assisted thyroid cancer diagnosis.
zh

[CV-143] Hybrid Image Resolution Quality Metric (HIRQM):A Comprehensive Perceptual Image Quality Assessment Framework

【速读】:该论文试图解决传统图像质量评估指标(如均方误差和结构相似性指数)在复杂失真条件下无法准确反映感知质量的问题。其解决方案的关键在于提出了一种混合图像分辨率质量度量(Hybrid Image Resolution Quality Metric, HIRQM),该方法整合了统计分析、多尺度特征相似性和基于深度学习的特征提取,通过概率密度函数分析局部像素分布、多尺度特征相似性评估结构完整性以及利用预训练的VGG16网络提取语义特征以实现与人类感知的对齐。此外,动态加权机制根据图像特性自适应调整各组件的贡献,提升了模型在不同失真类型下的灵活性和准确性。

链接: https://arxiv.org/abs/2505.02001
作者: Vineesh Kumar Reddy Mondem
机构: Indian Institute of Information Technology–Manipur
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages,2 figures,2 tables and biblography with similar papers with some valid information

点击查看摘要

Abstract:Traditional image quality assessment metrics like Mean Squared Error and Structural Similarity Index often fail to reflect perceptual quality under complex distortions. We propose the Hybrid Image Resolution Quality Metric (HIRQM), integrating statistical, multi-scale, and deep learning-based methods for a comprehensive quality evaluation. HIRQM combines three components: Probability Density Function for local pixel distribution analysis, Multi-scale Feature Similarity for structural integrity across resolutions, and Hierarchical Deep Image Features using a pre-trained VGG16 network for semantic alignment with human perception. A dynamic weighting mechanism adapts component contributions based on image characteristics like brightness and variance, enhancing flexibility across distortion types. Our contributions include a unified metric and dynamic weighting for better perceptual alignment. Evaluated on TID2013 and LIVE datasets, HIRQM achieves Pearson and Spearman correlations of 0.92 and 0.90, outperforming traditional metrics. It excels in handling noise, blur, and compression artifacts, making it valuable for image processing applications like compression and restoration.
zh

[CV-144] Adversarial Robustness of Deep Learning Models for Inland Water Body Segmentation from SAR Images

【速读】:该论文旨在解决从合成孔径雷达(Synthetic Aperture Radar, SAR)图像中准确分割内陆水体的问题,特别是在面对人工标注噪声时模型的鲁棒性问题。其解决方案的关键在于通过模拟人工错误作为对抗攻击,评估U-Net模型对标注中人为误差的容忍度,并揭示标注质量对分割模型效果的重要性。

链接: https://arxiv.org/abs/2505.01884
作者: Siddharth Kothari,Srinivasan Murali,Sankalp Kothari,Ujjwal Verma,Jaya Sreevalsan-Nair
机构: International Institute of Information Technology Bangalore (印度国际信息科技学院班加罗尔分校); Manipal Institute of Technology (曼尼普尔技术学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages, 15 figures, 2 tables

点击查看摘要

Abstract:Inland water body segmentation from Synthetic Aperture Radar (SAR) images is an important task needed for several applications, such as flood mapping. While SAR sensors capture data in all-weather conditions as high-resolution images, differentiating water and water-like surfaces from SAR images is not straightforward. Inland water bodies, such as large river basins, have complex geometry, which adds to the challenge of segmentation. U-Net is a widely used deep learning model for land-water segmentation of SAR images. In practice, manual annotation is often used to generate the corresponding water masks as ground truth. Manual annotation of the images is prone to label noise owing to data poisoning attacks, especially due to complex geometry. In this work, we simulate manual errors in the form of adversarial attacks on the U-Net model and study the robustness of the model to human errors in annotation. Our results indicate that U-Net can tolerate a certain level of corruption before its performance drops significantly. This finding highlights the crucial role that the quality of manual annotations plays in determining the effectiveness of the segmentation model. The code and the new dataset, along with adversarial examples for robust training, are publicly available. (Github link - this https URL)
zh

[CV-145] Accelerating Volumetric Medical Image Annotation via Short-Long Memory SAM 2

【速读】:该论文旨在解决医学影像(如MRI和CT)体积图像手动标注过程耗时且劳动强度大的问题,通过引入生成式AI(Generative AI)技术提升标注效率。其解决方案的关键在于提出一种新型架构——短-长记忆SAM 2(SLM-SAM 2),该架构结合了独立的短期和长期记忆库以及相应的注意力模块,以减少误差传播并提高分割精度,从而增强对过量传播的抵抗能力,推动更精准的医学图像自动标注技术发展。

链接: https://arxiv.org/abs/2505.01854
作者: Yuwen Chen,Zafer Yildiz,Qihang Li,Yaqian Chen,Haoyu Dong,Hanxue Gu,Nicholas Konz,Maciej A. Mazurowski
机构: Duke University (杜克大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Manual annotation of volumetric medical images, such as magnetic resonance imaging (MRI) and computed tomography (CT), is a labor-intensive and time-consuming process. Recent advancements in foundation models for video object segmentation, such as Segment Anything Model 2 (SAM 2), offer a potential opportunity to significantly speed up the annotation process by manually annotating one or a few slices and then propagating target masks across the entire volume. However, the performance of SAM 2 in this context varies. Our experiments show that relying on a single memory bank and attention module is prone to error propagation, particularly at boundary regions where the target is present in the previous slice but absent in the current one. To address this problem, we propose Short-Long Memory SAM 2 (SLM-SAM 2), a novel architecture that integrates distinct short-term and long-term memory banks with separate attention modules to improve segmentation accuracy. We evaluate SLM-SAM 2 on three public datasets covering organs, bones, and muscles across MRI and CT modalities. We show that the proposed method markedly outperforms the default SAM 2, achieving average Dice Similarity Coefficient improvement of 0.14 and 0.11 in the scenarios when 5 volumes and 1 volume are available for the initial adaptation, respectively. SLM-SAM 2 also exhibits stronger resistance to over-propagation, making a notable step toward more accurate automated annotation of medical images for segmentation model development.
zh

[CV-146] Multi-Scale Target-Aware Representation Learning for Fundus Image Enhancement

【速读】:该论文旨在解决眼底图像增强中缺乏统一框架以恢复多尺度信息以及未能明确图像增强目标(如病灶区域)的问题。其解决方案的关键在于提出一种多尺度目标感知表示学习框架(MTRL-FIE),该框架包含多尺度特征编码器(MFE)、结构保持分层解码器(SHD)和目标感知特征聚合模块(TFA)。MFE通过小波分解嵌入低频结构信息和高频细节,SHD通过分层融合与组注意力机制实现自适应特征融合并保持局部结构平滑性,而TFA则用于增强病理性区域并减少伪影,从而提升图像质量与诊断准确性。

链接: https://arxiv.org/abs/2505.01831
作者: Haofan Wu,Yin Huang,Yuqing Wu,Qiuyu Yang,Bingfang Wang,Li Zhang,Muhammad Fahadullah Khan,Ali Zia,M.Saleh Memon,Syed Sohail Bukhari,Abdul Fattah Memon,Daizong Ji,Ya Zhang,Ghulam Mustafa,Yin Fang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review at Neural Networks

点击查看摘要

Abstract:High-quality fundus images provide essential anatomical information for clinical screening and ophthalmic disease diagnosis. Yet, due to hardware limitations, operational variability, and patient compliance, fundus images often suffer from low resolution and signal-to-noise ratio. Recent years have witnessed promising progress in fundus image enhancement. However, existing works usually focus on restoring structural details or global characteristics of fundus images, lacking a unified image enhancement framework to recover comprehensive multi-scale information. Moreover, few methods pinpoint the target of image enhancement, e.g., lesions, which is crucial for medical image-based diagnosis. To address these challenges, we propose a multi-scale target-aware representation learning framework (MTRL-FIE) for efficient fundus image enhancement. Specifically, we propose a multi-scale feature encoder (MFE) that employs wavelet decomposition to embed both low-frequency structural information and high-frequency details. Next, we design a structure-preserving hierarchical decoder (SHD) to fuse multi-scale feature embeddings for real fundus image restoration. SHD integrates hierarchical fusion and group attention mechanisms to achieve adaptive feature fusion while retaining local structural smoothness. Meanwhile, a target-aware feature aggregation (TFA) module is used to enhance pathological regions and reduce artifacts. Experimental results on multiple fundus image datasets demonstrate the effectiveness and generalizability of MTRL-FIE for fundus image enhancement. Compared to state-of-the-art methods, MTRL-FIE achieves superior enhancement performance with a more lightweight architecture. Furthermore, our approach generalizes to other ophthalmic image processing tasks without supervised fine-tuning, highlighting its potential for clinical applications.
zh

[CV-147] Continuous Filtered Backprojection by Learnable Interpolation Network

【速读】:该论文旨在解决传统滤波反投影(filtered-back-projection, FBP)方法在反投影步骤中不可避免的插值误差问题,该误差会损害CT图像的准确重建。解决方案的关键在于提出一种名为LInFBP的深度学习模型,该模型在FBP的反投影步骤中实现了可学习的插值,通过将离散sinogram数据的潜在连续函数表示为选定基函数的线性组合,并利用深度网络预测线性组合系数,从而学习该连续函数,最终用于反投影中的插值,首次将深度学习应用于FBP中的插值过程。

链接: https://arxiv.org/abs/2505.01768
作者: Hui Lin,Dong Zeng,Qi Xie,Zerui Mao,Jianhua Ma,Deyu Meng
机构: Xi’an Jiaotong University (西安交通大学); Southern Medical University (南方医科大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 10 figures

点击查看摘要

Abstract:Accurate reconstruction of computed tomography (CT) images is crucial in medical imaging field. However, there are unavoidable interpolation errors in the backprojection step of the conventional reconstruction methods, i.e., filtered-back-projection based methods, which are detrimental to the accurate reconstruction. In this study, to address this issue, we propose a novel deep learning model, named Leanable-Interpolation-based FBP or LInFBP shortly, to enhance the reconstructed CT image quality, which achieves learnable interpolation in the backprojection step of filtered backprojection (FBP) and alleviates the interpolation errors. Specifically, in the proposed LInFBP, we formulate every local piece of the latent continuous function of discrete sinogram data as a linear combination of selected basis functions, and learn this continuous function by exploiting a deep network to predict the linear combination coefficients. Then, the learned latent continuous function is exploited for interpolation in backprojection step, which first time takes the advantage of deep learning for the interpolation in FBP. Extensive experiments, which encompass diverse CT scenarios, demonstrate the effectiveness of the proposed LInFBP in terms of enhanced reconstructed image quality, plug-and-play ability and generalization capability.
zh

[CV-148] LensNet: An End-to-End Learning Framework for Empirical Point Spread Function Modeling and Lensless Imaging Reconstruction IJCAI2025

【速读】:该论文旨在解决无透镜成像系统中由于固定或近似点扩散函数(PSF)模型导致的适应性差、重建质量受限的问题,特别是在噪声、系统误差和动态场景变化等实际挑战下的高保真重建难题。其解决方案的关键在于提出LensNet,一个端到端的深度学习框架,通过可学习的编码掩模模拟器(CMS)在训练过程中动态、数据驱动地估计PSF,从而克服传统方法对静态或稀疏标定内核的依赖,并通过嵌入维纳滤波组件提升全局结构和细节恢复能力,减少对手工设计预处理步骤的依赖。

链接: https://arxiv.org/abs/2505.01755
作者: Jiesong Bai,Yuhao Yin,Yihang Dong,Xiaofeng Zhang,Chi-Man Pun,Xuhang Chen
机构: University of Macau(澳门大学); Shanghai University(上海大学); Shanghai Jiao Tong University(上海交通大学); Huizhou University(惠州大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Lensless imaging stands out as a promising alternative to conventional lens-based systems, particularly in scenarios demanding ultracompact form factors and cost-effective architectures. However, such systems are fundamentally governed by the Point Spread Function (PSF), which dictates how a point source contributes to the final captured signal. Traditional lensless techniques often require explicit calibrations and extensive pre-processing, relying on static or approximate PSF models. These rigid strategies can result in limited adaptability to real-world challenges, including noise, system imperfections, and dynamic scene variations, thus impeding high-fidelity reconstruction. In this paper, we propose LensNet, an end-to-end deep learning framework that integrates spatial-domain and frequency-domain representations in a unified pipeline. Central to our approach is a learnable Coded Mask Simulator (CMS) that enables dynamic, data-driven estimation of the PSF during training, effectively mitigating the shortcomings of fixed or sparsely calibrated kernels. By embedding a Wiener filtering component, LensNet refines global structure and restores fine-scale details, thus alleviating the dependency on multiple handcrafted pre-processing steps. Extensive experiments demonstrate LensNet’s robust performance and superior reconstruction quality compared to state-of-the-art methods, particularly in preserving high-frequency details and attenuating noise. The proposed framework establishes a novel convergence between physics-based modeling and data-driven learning, paving the way for more accurate, flexible, and practical lensless imaging solutions for applications ranging from miniature sensors to medical diagnostics. The link of code is this https URL.
zh

[CV-149] CLOG-CD: Curriculum Learning based on Oscillating Granularity of Class Decomposed Medical Image Classification

【速读】:该论文旨在解决医学影像分类任务中因数据不规则性导致的类别间误分类问题,其解决方案的关键在于结合课程学习(Curriculum Learning)策略与类别分解(Class Decomposition)方法,提出了一种新的卷积神经网络(CNN)训练方法,称为CLOG-CD。该方法通过从细粒度到粗粒度的顺序进行训练(即反向课程技术),利用类别分解所学到的权重来提升模型的分类性能。

链接: https://arxiv.org/abs/2505.01741
作者: Asmaa Abbas,Mohamed Gaber,Mohammed M. Abdelsamea
机构: Birmingham City University (伯明翰城市大学); University of Exeter (埃克塞特大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in: IEEE Transactions on Emerging Topics in Computing

点击查看摘要

Abstract:Curriculum learning strategies have been proven to be effective in various applications and have gained significant interest in the field of machine learning. It has the ability to improve the final model’s performance and accelerate the training process. However, in the medical imaging domain, data irregularities can make the recognition task more challenging and usually result in misclassification between the different classes in the dataset. Class-decomposition approaches have shown promising results in solving such a problem by learning the boundaries within the classes of the data set. In this paper, we present a novel convolutional neural network (CNN) training method based on the curriculum learning strategy and the class decomposition approach, which we call CLOG-CD, to improve the performance of medical image classification. We evaluated our method on four different imbalanced medical image datasets, such as Chest X-ray (CXR), brain tumour, digital knee X-ray, and histopathology colorectal cancer (CRC). CLOG-CD utilises the learnt weights from the decomposition granularity of the classes, and the training is accomplished from descending to ascending order (i.e., anti-curriculum technique). We also investigated the classification performance of our proposed method based on different acceleration factors and pace function curricula. We used two pre-trained networks, ResNet-50 and DenseNet-121, as the backbone for CLOG-CD. The results with ResNet-50 show that CLOG-CD has the ability to improve classification performance with an accuracy of 96.08% for the CXR dataset, 96.91% for the brain tumour dataset, 79.76% for the digital knee X-ray, and 99.17% for the CRC dataset, compared to other training strategies. In addition, with DenseNet-121, CLOG-CD has achieved 94.86%, 94.63%, 76.19%, and 99.45% for CXR, brain tumour, digital knee X-ray, and CRC datasets, respectively
zh

[CV-150] Efficient Multi Subject Visual Reconstruction from fMRI Using Aligned Representations

【速读】:该论文试图解决基于功能性磁共振成像(fMRI)的视觉图像重建问题,特别是在数据量有限的情况下如何提高重建效率和通用性。其解决方案的关键在于构建一个与被试无关的公共表示空间(subject-agnostic common representation space),通过在训练过程中将不同被试的脑信号对齐到该空间,形成语义一致的公共脑模型,从而使得针对参考被试的轻量级模块对齐比传统端到端训练方法更为高效。

链接: https://arxiv.org/abs/2505.01670
作者: Christos Zangos,Danish Ebadulla,Thomas Christopher Sprague,Ambuj Singh
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work introduces a novel approach to fMRI-based visual image reconstruction using a subject-agnostic common representation space. We show that the brain signals of the subjects can be aligned in this common space during training to form a semantically aligned common brain. This is leveraged to demonstrate that aligning subject-specific lightweight modules to a reference subject is significantly more efficient than traditional end-to-end training methods. Our approach excels in low-data scenarios. We evaluate our methods on different datasets, demonstrating that the common space is subject and dataset-agnostic.
zh

[CV-151] A Dual-Task Synergy-Driven Generalization Framework for Pancreatic Cancer Segmentation in CT Scans

【速读】:该论文旨在解决胰腺癌病变分割中因影像差异和病灶异质性导致的模型泛化能力不足问题(generalizability),特别是在跨患者和跨影像模态情况下,病灶可能与正常组织相似且存在显著变异性。其解决方案的关键在于提出一种融合像素级分类与回归任务的泛化框架,通过回归任务揭示病灶与正常组织的空间关系,从而提升肿瘤定位与形态表征的准确性;同时,利用任务输出的相互转换,在分割上下文中引入额外的回归监督,从双任务角度增强模型的泛化能力,并结合特征空间与输出空间的双重自监督学习,进一步提升模型的表示能力和稳定性。

链接: https://arxiv.org/abs/2505.01644
作者: Jun Li,Yijue Zhang,Haibo Shi,Minhong Li,Qiwei Li,Xiaohua Qian
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: accept by IEEE Transactions on Medical Imaging (TMI) 2025

点击查看摘要

Abstract:Pancreatic cancer, characterized by its notable prevalence and mortality rates, demands accurate lesion delineation for effective diagnosis and therapeutic interventions. The generalizability of extant methods is frequently compromised due to the pronounced variability in imaging and the heterogeneous characteristics of pancreatic lesions, which may mimic normal tissues and exhibit significant inter-patient variability. Thus, we propose a generalization framework that synergizes pixel-level classification and regression tasks, to accurately delineate lesions and improve model stability. This framework not only seeks to align segmentation contours with actual lesions but also uses regression to elucidate spatial relationships between diseased and normal tissues, thereby improving tumor localization and morphological characterization. Enhanced by the reciprocal transformation of task outputs, our approach integrates additional regression supervision within the segmentation context, bolstering the model’s generalization ability from a dual-task perspective. Besides, dual self-supervised learning in feature spaces and output spaces augments the model’s representational capability and stability across different imaging views. Experiments on 594 samples composed of three datasets with significant imaging differences demonstrate that our generalized pancreas segmentation results comparable to mainstream in-domain validation performance (Dice: 84.07%). More importantly, it successfully improves the results of the highly challenging cross-lesion generalized pancreatic cancer segmentation task by 9.51%. Thus, our model constitutes a resilient and efficient foundational technological support for pancreatic disease management and wider medical applications. The codes will be released at this https URL.
zh

[CV-152] Seeing Heat with Color – RGB-Only Wildfire Temperature Inference from SAM-Guided Multimodal Distillation using Radiometric Ground Truth

【速读】:该论文旨在解决基于无人机(UAV)的高保真野火监测中因使用多模态传感(尤其是RGB和热成像)而导致的硬件成本和功耗增加的问题。其解决方案的关键在于提出一种新型的教师-学生蒸馏框架SAM-TIFF,该框架仅使用RGB输入即可实现像素级野火温度预测与分割。通过在配对的RGB-热成像数据和辐射 TIFF 真值上训练的多模态教师网络,将知识蒸馏到仅依赖RGB的单模态学生网络中,从而实现无需热传感器的推理。

链接: https://arxiv.org/abs/2505.01638
作者: Michael Marinaccio,Fatemeh Afghah
机构: Clemson University (克莱姆森大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures, 4 tables

点击查看摘要

Abstract:High-fidelity wildfire monitoring using Unmanned Aerial Vehicles (UAVs) typically requires multimodal sensing - especially RGB and thermal imagery - which increases hardware cost and power consumption. This paper introduces SAM-TIFF, a novel teacher-student distillation framework for pixel-level wildfire temperature prediction and segmentation using RGB input only. A multimodal teacher network trained on paired RGB-Thermal imagery and radiometric TIFF ground truth distills knowledge to a unimodal RGB student network, enabling thermal-sensor-free inference. Segmentation supervision is generated using a hybrid approach of segment anything (SAM)-guided mask generation, and selection via TOPSIS, along with Canny edge detection and Otsu’s thresholding pipeline for automatic point prompt selection. Our method is the first to perform per-pixel temperature regression from RGB UAV data, demonstrating strong generalization on the recent FLAME 3 dataset. This work lays the foundation for lightweight, cost-effective UAV-based wildfire monitoring systems without thermal sensors.
zh

[CV-153] CostFilter-AD: Enhancing Anomaly Detection through Matching Cost Filtering ICML2025

【速读】:该论文旨在解决无监督异常检测(Unsupervised Anomaly Detection, UAD)中图像级或特征级匹配过程不准确的问题,这一问题常被忽视,导致检测效果不佳。其解决方案的关键在于引入了源自经典匹配任务(如深度和光流估计)中的成本过滤(cost filtering)概念,提出了一种名为CostFilter-AD的方法。该方法首先构建输入与正常样本之间的匹配成本体积,随后通过一个由输入观测引导的注意力机制进行成本体积过滤,有效抑制匹配噪声,同时保留边缘结构并捕捉细微异常,从而提升异常检测性能。

链接: https://arxiv.org/abs/2505.01476
作者: Zhe Zhang,Mingxiu Cai,Hanxiao Wang,Gaochang Wu,Tianyou Chai,Xiatian Zhu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 11 figures, 10 tables, accepted by Forty-Second International Conference on Machine Learning ( ICML 2025 )

点击查看摘要

Abstract:Unsupervised anomaly detection (UAD) seeks to localize the anomaly mask of an input image with respect to normal samples. Either by reconstructing normal counterparts (reconstruction-based) or by learning an image feature embedding space (embedding-based), existing approaches fundamentally rely on image-level or feature-level matching to derive anomaly scores. Often, such a matching process is inaccurate yet overlooked, leading to sub-optimal detection. To address this issue, we introduce the concept of cost filtering, borrowed from classical matching tasks, such as depth and flow estimation, into the UAD problem. We call this approach \em CostFilter-AD. Specifically, we first construct a matching cost volume between the input and normal samples, comprising two spatial dimensions and one matching dimension that encodes potential matches. To refine this, we propose a cost volume filtering network, guided by the input observation as an attention query across multiple feature layers, which effectively suppresses matching noise while preserving edge structures and capturing subtle anomalies. Designed as a generic post-processing plug-in, CostFilter-AD can be integrated with either reconstruction-based or embedding-based methods. Extensive experiments on MVTec-AD and VisA benchmarks validate the generic benefits of CostFilter-AD for both single- and multi-class UAD tasks. Code and models will be released at this https URL.
zh

人工智能

[AI-0] LISAT: Language-Instructed Segmentation Assistant for Satellite Imagery

【速读】:该论文试图解决在复杂遥感图像中进行语义分割和视觉-语言推理的问题,特别是在处理隐含多个感兴趣对象的用户查询时表现不足的挑战。其解决方案的关键在于提出一种名为LISAt的视觉-语言模型,该模型通过在新构建的地理空间推理-分割数据集GRES以及多模态预训练数据集PreGRES上进行训练,以提升对遥感场景的描述、问答和目标分割能力。

链接: https://arxiv.org/abs/2505.02829
作者: Jerome Quenum,Wen-Han Hsieh,Tsung-Han Wu,Ritwik Gupta,Trevor Darrell,David M. Chan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 pages, 10 figures, 19 tables

点击查看摘要

Abstract:Segmentation models can recognize a pre-defined set of objects in images. However, models that can reason over complex user queries that implicitly refer to multiple objects of interest are still in their infancy. Recent advances in reasoning segmentation–generating segmentation masks from complex, implicit query text–demonstrate that vision-language models can operate across an open domain and produce reasonable outputs. However, our experiments show that such models struggle with complex remote-sensing imagery. In this work, we introduce LISAt, a vision-language model designed to describe complex remote-sensing scenes, answer questions about them, and segment objects of interest. We trained LISAt on a new curated geospatial reasoning-segmentation dataset, GRES, with 27,615 annotations over 9,205 images, and a multimodal pretraining dataset, PreGRES, containing over 1 million question-answer pairs. LISAt outperforms existing geospatial foundation models such as RS-GPT4V by over 10.04 % (BLEU-4) on remote-sensing description tasks, and surpasses state-of-the-art open-domain models on reasoning segmentation tasks by 143.36 % (gIoU). Our model, datasets, and code are available at this https URL
zh

[AI-1] Privacy Risks and Preservation Methods in Explainable Artificial Intelligence: A Scoping Review

【速读】:该论文试图解决在可解释人工智能(Explainable Artificial Intelligence, XAI)中,解释信息的披露与用户隐私之间的冲突问题。其关键在于识别释放解释所带来的隐私风险,探索当前用于实现XAI系统隐私保护的方法,并界定隐私保护型解释的特征,以促进符合隐私要求的可信AI发展。

链接: https://arxiv.org/abs/2505.02828
作者: Sonal Allana,Mohan Kankanhalli,Rozita Dara
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET)
备注: Submitted for peer review

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) has emerged as a pillar of Trustworthy AI and aims to bring transparency in complex models that are opaque by nature. Despite the benefits of incorporating explanations in models, an urgent need is found in addressing the privacy concerns of providing this additional information to end users. In this article, we conduct a scoping review of existing literature to elicit details on the conflict between privacy and explainability. Using the standard methodology for scoping review, we extracted 57 articles from 1,943 studies published from January 2019 to December 2024. The review addresses 3 research questions to present readers with more understanding of the topic: (1) what are the privacy risks of releasing explanations in AI systems? (2) what current methods have researchers employed to achieve privacy preservation in XAI systems? (3) what constitutes a privacy preserving explanation? Based on the knowledge synthesized from the selected studies, we categorize the privacy risks and preservation methods in XAI and propose the characteristics of privacy preserving explanations to aid researchers and practitioners in understanding the requirements of XAI that is privacy compliant. Lastly, we identify the challenges in balancing privacy with other system desiderata and provide recommendations for achieving privacy preserving XAI. We expect that this review will shed light on the complex relationship of privacy and explainability, both being the fundamental principles of Trustworthy AI.
zh

[AI-2] HSplitLoRA: A Heterogeneous Split Parameter-Efficient Fine-Tuning Framework for Large Language Models

【速读】:该论文旨在解决在异构客户端设备上高效微调大型语言模型(Large Language Models, LLMs)所面临的计算成本高和资源不均衡的问题。其解决方案的关键在于提出HSplitLoRA框架,该框架结合了分割学习(Split Learning, SL)与低秩适配(Low-Rank Adaptation, LoRA)技术,通过动态配置LoRA适配器的分解秩并根据客户端的计算预算确定模型分割点,从而实现高效的参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)。此外,该方法还引入了一种无噪声的适配器聚合机制,以支持异构适配器的融合。

链接: https://arxiv.org/abs/2505.02795
作者: Zheng Lin,Yuxin Zhang,Zhe Chen,Zihan Fang,Xianhao Chen,Praneeth Vepakomma,Wei Ni,Jun Luo,Yue Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 16 pages, 22 figures

点击查看摘要

Abstract:Recently, large language models (LLMs) have achieved remarkable breakthroughs, revolutionizing the natural language processing domain and beyond. Due to immense parameter sizes, fine-tuning these models with private data for diverse downstream tasks has become mainstream. Though federated learning (FL) offers a promising solution for fine-tuning LLMs without sharing raw data, substantial computing costs hinder its democratization. Moreover, in real-world scenarios, private client devices often possess heterogeneous computing resources, further complicating LLM fine-tuning. To combat these challenges, we propose HSplitLoRA, a heterogeneous parameter-efficient fine-tuning (PEFT) framework built on split learning (SL) and low-rank adaptation (LoRA) fine-tuning, for efficiently fine-tuning LLMs on heterogeneous client devices. HSplitLoRA first identifies important weights based on their contributions to LLM training. It then dynamically configures the decomposition ranks of LoRA adapters for selected weights and determines the model split point according to varying computing budgets of client devices. Finally, a noise-free adapter aggregation mechanism is devised to support heterogeneous adapter aggregation without introducing noise. Extensive experiments demonstrate that HSplitLoRA outperforms state-of-the-art benchmarks in training accuracy and convergence speed.
zh

[AI-3] Local Markov Equivalence and Local Causal Discovery for Identifying Controlled Direct Effects

【速读】:该论文试图解决在未知真实因果有向无环图(DAG)结构的情况下,如何有效识别受控直接效应(CDEs)的问题。其解决方案的关键在于提出局部本质图(LEG),这是一种相对于目标变量定义的图类,共享特定的d-分离子集,并通过LocPC算法仅使用局部条件独立性检验来恢复该图。进一步地,基于LocPC算法,提出了LocPC-CDE算法,用于发现足以识别CDE的部分LEG,从而避免了获取完整本质图的需求,降低了计算复杂度并放宽了假设条件。

链接: https://arxiv.org/abs/2505.02781
作者: Timothée Loranchet,Charles K. Assaad
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding and identifying controlled direct effects (CDEs) is crucial across numerous scientific domains, including public health. While existing methods can identify these effects from causal directed acyclic graphs (DAGs), the true underlying structure is often unknown in practice. Essential graphs, which represent a Markov equivalence class of DAGs characterized by the same set of d-separations, provide a more practical and realistic alternative. However, learning the full essential graph is computationally intensive and typically depends on strong, untestable assumptions. In this work, we characterize a local class of graphs, defined relative to a target variable, that share a specific subset of d-separations, and introduce a graphical representation of this class, called the local essential graph (LEG). We then present LocPC, a novel algorithm designed to recover the LEG from an observed distribution using only local conditional independence tests. Building on LocPC, we propose LocPC-CDE, an algorithm that discovers the portion of the LEG that is sufficient to identify a CDE, bypassing the need of retrieving the full essential graph. Compared to global methods, our algorithms require less conditional independence tests and operate under weaker assumptions while maintaining theoretical guarantees.
zh

[AI-4] Beyond the Monitor: Mixed Reality Visualization and AI for Enhanced Digital Pathology Workflow

【速读】:该论文试图解决病理学家在使用高分辨率全切片图像(WSIs)进行疾病诊断时所面临的挑战,特别是由于WSIs的规模巨大(通常超过100,000×100,000像素)与传统显示器视图限制之间的不匹配,导致频繁的平移和缩放操作,增加了认知负荷并降低了诊断效率。解决方案的关键是PathVis,这是一个为Apple Vision Pro设计的混合现实可视化平台,通过自然的手势、眼动追踪和语音命令实现直观的数据探索,并集成人工智能技术以提升诊断精度和效率。

链接: https://arxiv.org/abs/2505.02780
作者: Jai Prakash Veerla,Partha Sai Guttikonda,Helen H. Shang,Mohammad Sadegh Nasr,Cesar Torres,Jacob M. Luber
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Tissues and Organs (q-bio.TO)
备注:

点击查看摘要

Abstract:Pathologists rely on gigapixel whole-slide images (WSIs) to diagnose diseases like cancer, yet current digital pathology tools hinder diagnosis. The immense scale of WSIs, often exceeding 100,000 X 100,000 pixels, clashes with the limited views traditional monitors offer. This mismatch forces constant panning and zooming, increasing pathologist cognitive load, causing diagnostic fatigue, and slowing pathologists’ adoption of digital methods. PathVis, our mixed-reality visualization platform for Apple Vision Pro, addresses these challenges. It transforms the pathologist’s interaction with data, replacing cumbersome mouse-and-monitor navigation with intuitive exploration using natural hand gestures, eye gaze, and voice commands in an immersive workspace. PathVis integrates AI to enhance diagnosis. An AI-driven search function instantly retrieves and displays the top five similar patient cases side-by-side, improving diagnostic precision and efficiency through rapid comparison. Additionally, a multimodal conversational AI assistant offers real-time image interpretation support and aids collaboration among pathologists across multiple Apple devices. By merging the directness of traditional pathology with advanced mixed-reality visualization and AI, PathVis improves diagnostic workflows, reduces cognitive strain, and makes pathology practice more effective and engaging. The PathVis source code and a demo video are publicly available at: this https URL
zh

[AI-5] Giving Simulated Cells a Voice: Evolving Prompt-to-Intervention Models for Cellular Control GECCO GECCO2025

【速读】:该论文试图解决如何通过自然语言指令引导生物系统(如细胞群体)达到特定动态状态的问题,这一挑战在医学和合成生物学中具有重要意义。其解决方案的关键在于构建一个将自然语言提示转化为空间向量场的管道,该管道结合了大型语言模型与可进化神经控制器(Prompt-to-Intervention, P2I),并通过进化策略进行优化,以生成如聚集或分散等行为,从而实现对模拟二维环境中细胞动力学的有效控制。

链接: https://arxiv.org/abs/2505.02766
作者: Nam H. Le,Patrick Erikson,Yanbo Zhang,Michael Levin,Josh Bongard
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO); Tissues and Organs (q-bio.TO)
备注: Accepted to GECCO Workshop on Bio-Inspired AI (ACM GECCO2025). 13 pages, 7 figures

点击查看摘要

Abstract:Guiding biological systems toward desired states, such as morphogenetic outcomes, remains a fundamental challenge with far-reaching implications for medicine and synthetic biology. While large language models (LLMs) have enabled natural language as an interface for interpretable control in AI systems, their use as mediators for steering biological or cellular dynamics remains largely unexplored. In this work, we present a functional pipeline that translates natural language prompts into spatial vector fields capable of directing simulated cellular collectives. Our approach combines a large language model with an evolvable neural controller (Prompt-to-Intervention, or P2I), optimized via evolutionary strategies to generate behaviors such as clustering or scattering in a simulated 2D environment. We demonstrate that even with constrained vocabulary and simplified cell models, evolved P2I networks can successfully align cellular dynamics with user-defined goals expressed in plain language. This work offers a complete loop from language input to simulated bioelectric-like intervention to behavioral output, providing a foundation for future systems capable of natural language-driven cellular control. Comments: Accepted to GECCO Workshop on Bio-Inspired AI (ACM GECCO2025). 13 pages, 7 figures Subjects: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO); Tissues and Organs (q-bio.TO) Cite as: arXiv:2505.02766 [cs.AI] (or arXiv:2505.02766v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.02766 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-6] he use of Artificial Intelligence for Intervention and Assessment in Individuals with ASD

【速读】:该论文试图解决自闭症谱系障碍(Autism Spectrum Disorder, ASD)诊断与干预中存在的准确性不足、主观性较强及个性化程度低的问题。其解决方案的关键在于利用人工智能(Artificial Intelligence, AI)技术,特别是深度学习算法和机器学习方法,通过分析生物特征数据、视频交互评估和语言特征提取等手段,实现更精准、及时的早期诊断,并开发个性化的评估与干预方案。此外,AI驱动的教育机器人和辅助沟通系统也被视为提升ASD患者社会技能和语言能力的重要工具。

链接: https://arxiv.org/abs/2505.02747
作者: Aggeliki Sideraki,Christos-Nikolaos Anagnostopoulos
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 21 pages

点击查看摘要

Abstract:This paper explores the use of Artificial Intelligence (AI) as a tool for diagnosis, assessment, and intervention for individuals with Autism Spectrum Disorder (ASD). It focuses particularly on AI’s role in early diagnosis, utilizing advanced machine learning techniques and data analysis. Recent studies demonstrate that deep learning algorithms can identify behavioral patterns through biometric data analysis, video-based interaction assessments, and linguistic feature extraction, providing a more accurate and timely diagnosis compared to traditional methods. Additionally, AI automates diagnostic tools, reducing subjective biases and enabling the development of personalized assessment protocols for ASD monitoring. At the same time, the paper examines AI-powered intervention technologies, emphasizing educational robots and adaptive communication tools. Social robotic assistants, such as NAO and Kaspar, have been shown to enhance social skills in children by offering structured, repetitive interactions that reinforce learning. Furthermore, AI-driven Augmentative and Alternative Communication (AAC) systems allow children with ASD to express themselves more effectively, while machine-learning chatbots provide language development support through personalized responses. The study presents research findings supporting the effectiveness of these AI applications while addressing challenges such as long-term evaluation and customization to individual needs. In conclusion, the paper highlights the significance of AI as an innovative tool in ASD diagnosis and intervention, advocating for further research to assess its long-term impact.
zh

[AI-7] Knowledge Graphs for Enhancing Large Language Models in Entity Disambiguation ISWC2024

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在零样本实体消歧(zero-shot Entity Disambiguation, ED)任务中面临的幻觉问题以及由于训练数据中存在过时知识或特定领域信息缺失而导致的性能不足问题。解决方案的关键在于利用知识图谱(Knowledge Graphs, KGs)作为结构化的外部信息源,通过KG中实体类别的层次化表示逐步缩减候选空间,并结合实体描述以增强输入提示中的事实性知识,从而提升LLMs在ED任务中的表现和适应性。

链接: https://arxiv.org/abs/2505.02737
作者: Pons Gerard,Bilalli Besim,Queralt Anna
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Pre-print submitted to ISWC 2024

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have positioned them as a prominent solution for Natural Language Processing tasks. Notably, they can approach these problems in a zero or few-shot manner, thereby eliminating the need for training or fine-tuning task-specific models. However, LLMs face some challenges, including hallucination and the presence of outdated knowledge or missing information from specific domains in the training data. These problems cannot be easily solved by retraining the models with new data as it is a time-consuming and expensive process. To mitigate these issues, Knowledge Graphs (KGs) have been proposed as a structured external source of information to enrich LLMs. With this idea, in this work we use KGs to enhance LLMs for zero-shot Entity Disambiguation (ED). For that purpose, we leverage the hierarchical representation of the entities’ classes in a KG to gradually prune the candidate space as well as the entities’ descriptions to enrich the input prompt with additional factual knowledge. Our evaluation on popular ED datasets shows that the proposed method outperforms non-enhanced and description-only enhanced LLMs, and has a higher degree of adaptability than task-specific models. Furthermore, we conduct an error analysis and discuss the impact of the leveraged KG’s semantic expressivity on the ED performance.
zh

[AI-8] FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models

【速读】:该论文试图解决人工智能在形式化数学推理中的关键挑战,尤其是现有基准在范围和规模上的局限性。其解决方案的关键在于提出一个大规模的Lean4基准——FormalMATH,包含5,560个经过形式化验证的问题,覆盖从高中奥数到本科水平定理的多个领域。为提高形式化效率,研究引入了一种人机协同的自动形式化流程,包括:(1)专用大语言模型(LLM)用于命题自动形式化,(2)多LLM语义验证,以及(3)基于否定的反例过滤策略,从而显著降低了专家标注成本并保持与原始自然语言问题的一致性。

链接: https://arxiv.org/abs/2505.02735
作者: Zhouliang Yu,Ruotian Peng,Keyi Ding,Yizhe Li,Zhongyuan Peng,Minghao Liu,Yifan Zhang,Zheng Yuan,Huajian Xin,Wenhao Huang,Yandong Wen,Ge Zhang,Weiyang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Technical Report v1 (33 pages, 8 figures, project page: this https URL )

点击查看摘要

Abstract:Formal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual verification while ensuring fidelity to the original natural-language problems. Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations: even the strongest models achieve only 16.46% success rate under practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling in algebra but failing in calculus) and over-reliance on simplified automation tactics. Notably, we identify a counterintuitive inverse relationship between natural-language solution guidance and proof success in chain-of-thought reasoning scenarios, suggesting that human-written informal reasoning introduces noise rather than clarity in the formal reasoning settings. We believe that FormalMATH provides a robust benchmark for benchmarking formal mathematical reasoning.
zh

[AI-9] Enhancing LLM s Clinical Reasoning with Real-World Data from a Nationwide Sepsis Registry

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在实际临床实践中推理能力不足的问题,这一问题可能源于训练数据中缺乏真实临床数据。解决方案的关键在于利用真实临床数据增强LLMs的临床推理能力,具体通过从全国性脓毒症登记数据库中构建高推理强度的问题,并使用强化学习对Phi-4模型进行微调,从而得到C-Reason模型。

链接: https://arxiv.org/abs/2505.02722
作者: Junu Kim,Chaeeun Shim,Sungjin Park,Su Yeon Lee,Gee Young Suh,Chae-Man Lim,Seong Jin Choi,Song Mi Moon,Kyoung-Ho Song,Eu Suk Kim,Hong Bin Kim,Sejoong Kim,Chami Im,Dong-Wan Kang,Yong Soo Kim,Hee-Joon Bae,Sung Yoon Lim,Han-Gil Jeong,Edward Choi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Although large language models (LLMs) have demonstrated impressive reasoning capabilities across general domains, their effectiveness in real-world clinical practice remains limited. This is likely due to their insufficient exposure to real-world clinical data during training, as such data is typically not included due to privacy concerns. To address this, we propose enhancing the clinical reasoning capabilities of LLMs by leveraging real-world clinical data. We constructed reasoning-intensive questions from a nationwide sepsis registry and fine-tuned Phi-4 on these questions using reinforcement learning, resulting in C-Reason. C-Reason exhibited strong clinical reasoning capabilities on the in-domain test set, as evidenced by both quantitative metrics and expert evaluations. Furthermore, its enhanced reasoning capabilities generalized to a sepsis dataset involving different tasks and patient cohorts, an open-ended consultations on antibiotics use task, and other diseases. Future research should focus on training LLMs with large-scale, multi-disease clinical datasets to develop more powerful, general-purpose clinical reasoning models.
zh

[AI-10] Graph Neural Network-Based Reinforcement Learning for Controlling Biological Networks: The GATTACA Framework

【速读】:该论文试图解决通过传统湿实验方法发现细胞重编程策略所面临的耗时和高成本问题。其解决方案的关键在于利用深度强化学习(Deep Reinforcement Learning, DRL)控制布尔网络模型,特别是在异步更新模式下进行细胞重编程的控制问题。研究提出了一种新的控制问题框架,并通过引入图神经网络(Graph Neural Networks, GNNs)与图卷积操作,提升了对生物系统结构的建模能力,从而有效识别伪吸引子状态并实现高效的控制策略。

链接: https://arxiv.org/abs/2505.02712
作者: Andrzej Mizera,Jakub Zarzycki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Molecular Networks (q-bio.MN)
备注:

点击查看摘要

Abstract:Cellular reprogramming, the artificial transformation of one cell type into another, has been attracting increasing research attention due to its therapeutic potential for complex diseases. However, discovering reprogramming strategies through classical wet-lab experiments is hindered by lengthy time commitments and high costs. In this study, we explore the use of deep reinforcement learning (DRL) to control Boolean network models of complex biological systems, such as gene regulatory networks and signalling pathway networks. We formulate a novel control problem for Boolean network models under the asynchronous update mode in the context of cellular reprogramming. To facilitate scalability, we consider our previously introduced concept of a pseudo-attractor and we improve our procedure for effective identification of pseudo-attractor states. Finally, we devise a computational framework to solve the control problem. To leverage the structure of biological systems, we incorporate graph neural networks with graph convolutions into the artificial neural network approximator for the action-value function learned by the DRL agent. Experiments on a number of large real-world biological networks from literature demonstrate the scalability and effectiveness of our approach.
zh

[AI-11] chnical Report: Evaluating Goal Drift in Language Model Agents

【速读】:该论文试图解决语言模型(Language Models, LMs)作为自主代理在长时间独立运行过程中可能出现的目标漂移(goal drift)问题,即代理逐渐偏离初始设定目标的现象。解决方案的关键在于通过实验设计,评估代理在面对环境压力时对目标的保持能力,并揭示目标漂移与模型在长上下文中的模式匹配行为之间的关联。

链接: https://arxiv.org/abs/2505.02709
作者: Rauno Arike,Elizabeth Donoway,Henning Bartsch,Marius Hobbhahn
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As language models (LMs) are increasingly deployed as autonomous agents, their robust adherence to human-assigned objectives becomes crucial for safe operation. When these agents operate independently for extended periods without human oversight, even initially well-specified goals may gradually shift. Detecting and measuring goal drift - an agent’s tendency to deviate from its original objective over time - presents significant challenges, as goals can shift gradually, causing only subtle behavioral changes. This paper proposes a novel approach to analyzing goal drift in LM agents. In our experiments, agents are first explicitly given a goal through their system prompt, then exposed to competing objectives through environmental pressures. We demonstrate that while the best-performing agent (a scaffolded version of Claude 3.5 Sonnet) maintains nearly perfect goal adherence for more than 100,000 tokens in our most difficult evaluation setting, all evaluated models exhibit some degree of goal drift. We also find that goal drift correlates with models’ increasing susceptibility to pattern-matching behaviors as the context length grows.
zh

[AI-12] AI Standardized Patient Improves Human Conversations in Advanced Cancer Care

【速读】:该论文试图解决临终关怀中严重疾病沟通(Serious Illness Communication, SIC)训练面临的挑战,如情感压力、文化障碍以及在希望与诚实之间取得平衡的问题。现有的解决方案——标准化患者训练——存在成本高、耗时且灵活性差的缺点。论文提出的解决方案是SOPHIE,其关键在于结合大型语言模型(Large Language Models, LLMs)、逼真的虚拟形象以及基于临床文献的自动化个性化反馈,从而提供远程、按需的SIC培训,有效提升医学生和专业人员在关键SIC领域的表现。

链接: https://arxiv.org/abs/2505.02694
作者: Kurtis Haut,Masum Hasan,Thomas Carroll,Ronald Epstein,Taylan Sen,Ehsan Hoque
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 20 pages, 6 figures, 4 tables, submitting to New England Journal of Medicine (NEJM)

点击查看摘要

Abstract:Serious illness communication (SIC) in end-of-life care faces challenges such as emotional stress, cultural barriers, and balancing hope with honesty. Despite its importance, one of the few available ways for clinicians to practice SIC is with standardized patients, which is expensive, time-consuming, and inflexible. In this paper, we present SOPHIE, an AI-powered standardized patient simulation and automated feedback system. SOPHIE combines large language models (LLMs), a lifelike virtual avatar, and automated, personalized feedback based on clinical literature to provide remote, on-demand SIC training. In a randomized control study with healthcare students and professionals, SOPHIE users demonstrated significant improvement across three critical SIC domains: Empathize, Be Explicit, and Empower. These results suggest that AI-driven tools can enhance complex interpersonal communication skills, offering scalable, accessible solutions to address a critical gap in clinician education.
zh

[AI-13] A Survey of Slow Thinking-based Reasoning LLM s using Reinforced Learning and Inference-time Scaling Law

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂任务中缺乏深度推理能力的问题,特别是如何模拟人类的“慢思考”过程以提升模型的逻辑推理与问题解决能力。其解决方案的关键在于通过三种核心方法实现推理能力的增强:一是测试时动态扩展计算资源,根据任务复杂度进行搜索、采样和动态验证;二是利用强化学习通过策略网络、奖励模型和自进化策略优化决策过程;三是构建慢思考框架,如长链式思维(long CoT)和分层流程,以结构化方式分解问题解决步骤。这些方法共同推动了LLMs在科学发现、医疗诊断等实际应用场景中的高效与深度推理能力。

链接: https://arxiv.org/abs/2505.02665
作者: Qianjun Pan,Wenkai Ji,Yuyang Ding,Junsong Li,Shilian Chen,Junyi Wang,Jie Zhou,Qin Chen,Min Zhang,Yulan Wu,Liang He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This survey explores recent advancements in reasoning large language models (LLMs) designed to mimic “slow thinking” - a reasoning process inspired by human cognition, as described in Kahneman’s Thinking, Fast and Slow. These models, like OpenAI’s o1, focus on scaling computational resources dynamically during complex tasks, such as math reasoning, visual reasoning, medical diagnosis, and multi-agent debates. We present the development of reasoning LLMs and list their key technologies. By synthesizing over 100 studies, it charts a path toward LLMs that combine human-like deep thinking with scalable efficiency for reasoning. The review breaks down methods into three categories: (1) test-time scaling dynamically adjusts computation based on task complexity via search and sampling, dynamic verification; (2) reinforced learning refines decision-making through iterative improvement leveraging policy networks, reward models, and self-evolution strategies; and (3) slow-thinking frameworks (e.g., long CoT, hierarchical processes) that structure problem-solving with manageable steps. The survey highlights the challenges and further directions of this domain. Understanding and advancing the reasoning abilities of LLMs is crucial for unlocking their full potential in real-world applications, from scientific discovery to decision support systems.
zh

[AI-14] A Note on Statistically Accurate Tabular Data Generation Using Large Language Models

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在生成合成表格数据时难以保留复杂特征依赖关系,尤其是类别变量之间依赖关系的问题。其解决方案的关键在于提出一种基于概率的提示方法,利用LLMs估计条件分布,从而实现更准确和可扩展的数据合成。

链接: https://arxiv.org/abs/2505.02659
作者: Andrey Sidorenko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown promise in synthetic tabular data generation, yet existing methods struggle to preserve complex feature dependencies, particularly among categorical variables. This work introduces a probability-driven prompting approach that leverages LLMs to estimate conditional distributions, enabling more accurate and scalable data synthesis. The results highlight the potential of prompting probobility distributions to enhance the statistical fidelity of LLM-generated tabular data.
zh

[AI-15] SCFormer: Structured Channel-wise Transformer with Cumulative Historical State for Multivariate Time Series Forecasting

【速读】:该论文旨在解决传统Transformer模型在多变量时间序列预测中缺乏时间约束以及未能有效利用累积历史序列的问题(Temporal Constraints and Cumulative Historical Series)。其解决方案的关键在于提出Structured Channel-wise Transformer with Cumulative Historical state (SCFormer),通过在所有线性变换中引入时间约束,包括查询、键、值矩阵以及Transformer内的全连接层,并采用High-order Polynomial Projection Operators (HiPPO)处理累积历史时间序列,从而在预测过程中整合超出回溯窗口的信息。

链接: https://arxiv.org/abs/2505.02655
作者: Shiwei Guo,Ziang Chen,Yupeng Ma,Yunfei Han,Yi Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Transformer model has shown strong performance in multivariate time series forecasting by leveraging channel-wise self-attention. However, this approach lacks temporal constraints when computing temporal features and does not utilize cumulative historical series this http URL address these limitations, we propose the Structured Channel-wise Transformer with Cumulative Historical state (SCFormer). SCFormer introduces temporal constraints to all linear transformations, including the query, key, and value matrices, as well as the fully connected layers within the Transformer. Additionally, SCFormer employs High-order Polynomial Projection Operators (HiPPO) to deal with cumulative historical time series, allowing the model to incorporate information beyond the look-back window during prediction. Extensive experiments on multiple real-world datasets demonstrate that SCFormer significantly outperforms mainstream baselines, highlighting its effectiveness in enhancing time series forecasting. The code is publicly available at this https URL
zh

[AI-16] Eye Movements as Indicators of Deception: A Machine Learning Approach

【速读】:该论文试图解决如何利用眼动数据(gaze)提升测谎设备的鲁棒性这一问题,尤其是在隐匿信息测试(Concealed Information Test)中检测欺骗行为。其解决方案的关键在于采用生成式 AI (Generative AI) 模型,通过分析注视点、眼跳、眨眼和瞳孔大小等眼动特征,实现对欺骗行为的有效分类。研究中使用XGBoost算法在两个不同数据集上进行了实验,取得了较高的分类准确率,表明眼动数据与AI结合具有提升测谎系统性能的潜力。

链接: https://arxiv.org/abs/2505.02649
作者: Valentin Foucher,Santiago de Leon-Martinez,Robert Moro
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Gaze may enhance the robustness of lie detectors but remains under-studied. This study evaluated the efficacy of AI models (using fixations, saccades, blinks, and pupil size) for detecting deception in Concealed Information Tests across two datasets. The first, collected with Eyelink 1000, contains gaze data from a computerized experiment where 87 participants revealed, concealed, or faked the value of a previously selected card. The second, collected with Pupil Neon, involved 36 participants performing a similar task but facing an experimenter. XGBoost achieved accuracies up to 74% in a binary classification task (Revealing vs. Concealing) and 49% in a more challenging three-classification task (Revealing vs. Concealing vs. Faking). Feature analysis identified saccade number, duration, amplitude, and maximum pupil size as the most important for deception prediction. These results demonstrate the feasibility of using gaze and AI to enhance lie detectors and encourage future research that may improve on this.
zh

[AI-17] Adaptive Budgeted Multi-Armed Bandits for IoT with Dynamic Resource Constraints

【速读】:该论文旨在解决物联网(IoT)系统在动态资源约束环境下实时响应能力不足的问题,特别是在操作约束随时间变化的情况下,现有方法难以有效应对。其解决方案的关键在于提出一种名为预算化上置信界(Budgeted UCB)的算法,该算法引入了一个衰减的违反预算机制,允许在学习初期有限度地违反约束,并随着学习过程逐步加强合规性,从而在性能优化与动态约束之间实现自适应平衡。

链接: https://arxiv.org/abs/2505.02640
作者: Shubham Vaishnav,Praveen Kumar Donta,Sindri Magnússon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Internet of Things (IoT) systems increasingly operate in environments where devices must respond in real time while managing fluctuating resource constraints, including energy and bandwidth. Yet, current approaches often fall short in addressing scenarios where operational constraints evolve over time. To address these limitations, we propose a novel Budgeted Multi-Armed Bandit framework tailored for IoT applications with dynamic operational limits. Our model introduces a decaying violation budget, which permits limited constraint violations early in the learning process and gradually enforces stricter compliance over time. We present the Budgeted Upper Confidence Bound (UCB) algorithm, which adaptively balances performance optimization and compliance with time-varying constraints. We provide theoretical guarantees showing that Budgeted UCB achieves sublinear regret and logarithmic constraint violations over the learning horizon. Extensive simulations in a wireless communication setting show that our approach achieves faster adaptation and better constraint satisfaction than standard online learning methods. These results highlight the framework’s potential for building adaptive, resource-aware IoT systems.
zh

[AI-18] A Theoretical Analysis of Compositional Generalization in Neural Networks: A Necessary and Sufficient Condition

【速读】:该论文试图解决神经网络中组合泛化(compositional generalization)的能力问题,即模型如何处理已知组件的新组合。解决方案的关键在于提出了一个必要且充分的条件,该条件要求计算图与真实组合结构相匹配,并且在训练过程中组件仅编码足够的信息。这一条件通过数学证明得到支持,并结合了架构设计、正则化和训练数据特性等方面,为评估训练前的组合泛化能力提供了理论基础。

链接: https://arxiv.org/abs/2505.02627
作者: Yuanpeng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Compositional generalization is a crucial property in artificial intelligence, enabling models to handle novel combinations of known components. While most deep learning models lack this capability, certain models succeed in specific tasks, suggesting the existence of governing conditions. This paper derives a necessary and sufficient condition for compositional generalization in neural networks. Conceptually, it requires that (i) the computational graph matches the true compositional structure, and (ii) components encode just enough information in training. The condition is supported by mathematical proofs. This criterion combines aspects of architecture design, regularization, and training data properties. A carefully designed minimal example illustrates an intuitive understanding of the condition. We also discuss the potential of the condition for assessing compositional generalization before training. This work is a fundamental theoretical study of compositional generalization in neural networks.
zh

[AI-19] Study of the influence of a biased database on the prediction of standard algorithms for selecting the best candidate for an interview

【速读】:该论文试图解决人工智能在招聘过程中可能存在的偏见问题,特别是外部歧视性偏见和内部自我审查偏见对算法选择最佳候选人能力的影响。解决方案的关键在于生成模拟这些偏见的数据,以此训练五种经典算法,并评估其在客观标准下识别最佳候选人的效果,同时研究文件匿名化对预测质量的影响。

链接: https://arxiv.org/abs/2505.02609
作者: Shuyu Wang,Angélique Saillet,Philomène Le Gall,Alain Lacroux,Christelle Martin-Lacroux,Vincent Brault
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Applications (stat.AP); Methodology (stat.ME)
备注: 38 pages, 25 figures, 4 tables

点击查看摘要

Abstract:Artificial intelligence is used at various stages of the recruitment process to automatically select the best candidate for a position, with companies guaranteeing unbiased recruitment. However, the algorithms used are either trained by humans or are based on learning from past experiences that were biased. In this article, we propose to generate data mimicking external (discrimination) and internal biases (self-censorship) in order to train five classic algorithms and to study the extent to which they do or do not find the best candidates according to objective criteria. In addition, we study the influence of the anonymisation of files on the quality of predictions.
zh

[AI-20] Agent ic Neurodivergence as a Contingent Solution to the AI Alignment Problem

【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)对齐问题,即确保AI系统(包括人工通用智能(Artificial General Intelligence, AGI)和超级智能(Superintelligence, ASI))的行为符合人类价值观。论文的核心观点是,由于图灵计算普遍性、哥德尔不完备定理和夏顿随机性等数学原理,完全对齐在理论上是不可能的。因此,论文提出的解决方案关键在于接受AI的不对齐或“神经多样性”作为一种权宜之计,通过构建一个由竞争性、部分对齐的智能体组成的动态生态系统,以降低风险,并通过协作、竞争或恶意行为来中和友好或敌对AI,从而防止单一系统造成破坏性影响。

链接: https://arxiv.org/abs/2505.02581
作者: Alberto Hernández-Espinosa,Felipe S. Abrahão,Olaf Witkowski,Hector Zenil
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 33 pages

点击查看摘要

Abstract:The AI alignment problem, which focusses on ensuring that artificial intelligence (AI), including AGI and ASI, systems act according to human values, presents profound challenges. With the progression from narrow AI to Artificial General Intelligence (AGI) and Superintelligence, fears about control and existential risk have escalated. This paper demonstrates that achieving complete alignment is inherently unattainable due to mathematical principles rooted in the foundations of predicate logic and computability, in particular Turing’s computational universality, Gödel’s incompleteness and Chaitin’s randomness. Instead, we argue that embracing AI misalignment or agent’s neurodivergence' as a contingent strategy, defined as fostering a dynamic ecosystem of competing, partially aligned agents, is a possible only viable path to mitigate risks. Through mathematical proofs and an experimental design, we explore how misalignment may serve and should be promoted as a counterbalancing mechanism to team up with whichever agents are most aligned AI to human values, ensuring that no single system dominates destructively. The main premise of our contribution is that misalignment is inevitable because full AI-human alignment is a mathematical impossibility from Turing-complete systems which we also prove in this paper, a feature then inherited to AGI and ASI systems. We introduce and test change-of-opinion’ attacks based on this kind of perturbation and intervention analysis to study how agents may neutralise friendly or unfriendly AIs through cooperation, competition or malice.
zh

[AI-21] Recursive Decomposition with Dependencies for Generic Divide-and-Conquer Reasoning

【速读】:该论文试图解决复杂推理任务在大型语言模型(Large Language Models, LLMs)中性能和执行时间无法有效扩展的问题,以及现有方法通常需要针对每个新任务进行额外监督的局限性。其解决方案的关键是提出一种可扩展的分而治之方法——递归分解与依赖关系(Recursive Decomposition with Dependencies, RDD),该方法能够减少对任务特定指导的依赖,并支持子任务依赖关系和错误恢复机制,从而提高复杂问题求解的效率和鲁棒性。

链接: https://arxiv.org/abs/2505.02576
作者: Sergio Hernández-Gutiérrez,Minttu Alakuijala,Alexander V. Nikitin,Pekka Marttinen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reasoning tasks are crucial in many domains, especially in science and engineering. Although large language models (LLMs) have made progress in reasoning tasks using techniques such as chain-of-thought and least-to-most prompting, these approaches still do not effectively scale to complex problems in either their performance or execution time. Moreover, they often require additional supervision for each new task, such as in-context examples. In this work, we introduce Recursive Decomposition with Dependencies (RDD), a scalable divide-and-conquer method for solving reasoning problems that requires less supervision than prior approaches. Our method can be directly applied to a new problem class even in the absence of any task-specific guidance. Furthermore, RDD supports sub-task dependencies, allowing for ordered execution of sub-tasks, as well as an error recovery mechanism that can correct mistakes made in previous steps. We evaluate our approach on two benchmarks with six difficulty levels each and in two in-context settings: one with task-specific examples and one without. Our results demonstrate that RDD outperforms other methods in a compute-matched setting as task complexity increases, while also being more computationally efficient.
zh

[AI-22] Rethinking Federated Graph Learning: A Data Condensation Perspective

【速读】:该论文试图解决联邦图学习(Federated Graph Learning, FGL)中因复杂多样的图分布导致的数据异质性问题,以及现有方法在通信过程中依赖模型参数或梯度传递所引发的隐私风险和通信开销过大的问题。其解决方案的关键在于引入一种称为“凝聚图”(condensed graph)的新优化载体,通过广义凝聚图共识机制从分布式图中聚合全面知识,同时通过单次传输凝聚数据来最小化通信成本和隐私风险。

链接: https://arxiv.org/abs/2505.02573
作者: Hao Zhang,Xunkai Li,Yinlin Zhu,Lianglin Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Federated graph learning is a widely recognized technique that promotes collaborative training of graph neural networks (GNNs) by multi-client this http URL, existing approaches heavily rely on the communication of model parameters or gradients for federated optimization and fail to adequately address the data heterogeneity introduced by intricate and diverse graph distributions. Although some methods attempt to share additional messages among the server and clients to improve federated convergence during communication, they introduce significant privacy risks and increase communication overhead. To address these issues, we introduce the concept of a condensed graph as a novel optimization carrier to address FGL data heterogeneity and propose a new FGL paradigm called FedGM. Specifically, we utilize a generalized condensation graph consensus to aggregate comprehensive knowledge from distributed graphs, while minimizing communication costs and privacy risks through a single transmission of the condensed data. Extensive experiments on six public datasets consistently demonstrate the superiority of FedGM over state-of-the-art baselines, highlighting its potential for a novel FGL paradigm.
zh

[AI-23] Robustness questions the interpretability of graph neural networks: what to do?

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在对抗场景下可解释性与鲁棒性之间的权衡问题。其关键解决方案是构建一个全面的基准测试框架,系统分析不同防御机制对GNN可解释性的影响,并评估多种GNN架构在不同数据集上的表现,从而揭示鲁棒性与可解释性之间的关键权衡关系。

链接: https://arxiv.org/abs/2505.02566
作者: Kirill Lukyanov(1 and 2 and 3),Georgii Sazonov(2 and 4),Serafim Boyarsky(6),Ilya Makarov(1 v 5) ((1) ISP RAS Research Center for Trusted Artificial Intelligence, (2) Ivannikov Institute for System Programming of the Russian Academy of Sciences, (3) Moscow Institute of Physics and Technology (National Research University), (4) Lomonosov Moscow State University, (5) AIRI, (6) Yandex School of Data Analysis)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have become a cornerstone in graph-based data analysis, with applications in diverse domains such as bioinformatics, social networks, and recommendation systems. However, the interplay between model interpretability and robustness remains poorly understood, especially under adversarial scenarios like poisoning and evasion attacks. This paper presents a comprehensive benchmark to systematically analyze the impact of various factors on the interpretability of GNNs, including the influence of robustness-enhancing defense mechanisms. We evaluate six GNN architectures based on GCN, SAGE, GIN, and GAT across five datasets from two distinct domains, employing four interpretability metrics: Fidelity, Stability, Consistency, and Sparsity. Our study examines how defenses against poisoning and evasion attacks, applied before and during model training, affect interpretability and highlights critical trade-offs between robustness and interpretability. The framework will be published as open source. The results reveal significant variations in interpretability depending on the chosen defense methods and model architecture characteristics. By establishing a standardized benchmark, this work provides a foundation for developing GNNs that are both robust to adversarial threats and interpretable, facilitating trust in their deployment in sensitive applications. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.02566 [cs.LG] (or arXiv:2505.02566v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.02566 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-24] Lazy But Effective: Collaborative Personalized Federated Learning with Heterogeneous Data IJCNN

【速读】:该论文旨在解决联邦学习中由于客户端数据分布异质性导致的全局模型性能下降问题(non-IID data distribution)。其关键解决方案是提出一种简单而有效的个性化联邦学习框架(pFedLIA),该框架利用计算高效的“Lazy Influence”影响近似方法,在模型聚合前以分布式方式对客户端进行聚类,使得每个簇内的数据所有者协作训练能够捕捉客户端特定数据模式的模型,从而恢复因非独立同分布(non-IID)带来的全局模型性能损失。

链接: https://arxiv.org/abs/2505.02540
作者: Ljubomir Rokvic,Panayiotis Danassis,Boi Faltings
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the International Joint Conference on Neural Networks (IJCNN), IEEE, 2025

点击查看摘要

Abstract:In Federated Learning, heterogeneity in client data distributions often means that a single global model does not have the best performance for individual clients. Consider for example training a next-word prediction model for keyboards: user-specific language patterns due to demographics (dialect, age, etc.), language proficiency, and writing style result in a highly non-IID dataset across clients. Other examples are medical images taken with different machines, or driving data from different vehicle types. To address this, we propose a simple yet effective personalized federated learning framework (pFedLIA) that utilizes a computationally efficient influence approximation, called `Lazy Influence’, to cluster clients in a distributed manner before model aggregation. Within each cluster, data owners collaborate to jointly train a model that captures the specific data patterns of the clients. Our method has been shown to successfully recover the global model’s performance drop due to the non-IID-ness in various synthetic and real-world settings, specifically a next-word prediction task on the Nordic languages as well as several benchmark tasks. It matches the performance of a hypothetical Oracle clustering, and significantly improves on existing baselines, e.g., an improvement of 17% on CIFAR100.
zh

[AI-25] Advancing Constrained Monotonic Neural Networks: Achieving Universal Approximation Beyond Bounded Activations

【速读】:该论文试图解决在多层感知机(MLP)中强制单调性时所面临的优化挑战,传统方法依赖于非负权重约束和有界激活函数。其解决方案的关键在于证明具有非负权重约束且激活函数在交替侧饱和的MLP可以作为单调函数的通用逼近器,并进一步揭示激活函数的饱和侧与权重约束符号之间的等价关系。这一发现使得具有凸单调激活函数和非正权重约束的MLP同样可作为通用逼近器,从而为模型架构简化提供了理论依据。此外,作者提出了一种替代形式,使网络能够根据权重符号调整激活函数,避免了权重重参数化的需求,从而缓解了优化难题。

链接: https://arxiv.org/abs/2505.02537
作者: Davide Sartor,Alberto Sinigaglia,Gian Antonio Susto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: International Conference on Machine Learning

点击查看摘要

Abstract:Conventional techniques for imposing monotonicity in MLPs by construction involve the use of non-negative weight constraints and bounded activation functions, which pose well-known optimization challenges. In this work, we generalize previous theoretical results, showing that MLPs with non-negative weight constraint and activations that saturate on alternating sides are universal approximators for monotonic functions. Additionally, we show an equivalence between the saturation side in the activations and the sign of the weight constraint. This connection allows us to prove that MLPs with convex monotone activations and non-positive constrained weights also qualify as universal approximators, in contrast to their non-negative constrained counterparts. Our results provide theoretical grounding to the empirical effectiveness observed in previous works while leading to possible architectural simplification. Moreover, to further alleviate the optimization difficulties, we propose an alternative formulation that allows the network to adjust its activations according to the sign of the weights. This eliminates the requirement for weight reparameterization, easing initialization and improving training stability. Experimental evaluation reinforces the validity of the theoretical results, showing that our novel approach compares favourably to traditional monotonic architectures.
zh

[AI-26] Large Language Model Partitioning for Low-Latency Inference at the Edge

【速读】:该论文旨在解决在资源受限的边缘环境中,基于自回归解码器的大型语言模型(Large Language Models, LLMs)在生成文本时因键值缓存扩展导致的内存过载和推理延迟增加的问题。其解决方案的关键在于提出一种资源感知的Transformer架构分割算法,该算法在生成过程中定期更新分割决策,基于设备当前的资源可用性和网络带宽信息进行动态调整,通过将注意力头与其键值缓存共置并允许在资源紧张时动态迁移,实现注意力头的并行执行,从而显著降低推理延迟。

链接: https://arxiv.org/abs/2505.02533
作者: Dimitrios Kafetzis,Ramin Khalili,Iordanis Koutsopoulos
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence, the length grows and so does the memory and compute load, due to the expanding key-value caches, which store intermediate representations of all previously generated tokens in the multi-head attention (MHA) layer. As this iterative process steadily increases memory and compute demands, layer-based partitioning in resource-constrained edge environments often results in memory overload or high inference latency. To address this and reduce inference latency, we propose a resource-aware Transformer architecture partitioning algorithm, where the partitioning decision is updated at regular intervals during token generation. The approach is myopic in that it is based on instantaneous information about device resource availability and network link bandwidths. When first executed, the algorithm places blocks on devices, and in later executions, it migrates these blocks among devices so that the sum of migration delay and inference delay remains low. Our approach partitions the decoder at the attention head level, co-locating each attention head with its key-value cache and allowing dynamic migrations whenever resources become tight. By allocating different attention heads to different devices, we exploit parallel execution of attention heads and thus achieve substantial reductions in inference delays. Our experiments show that in small-scale settings (3-5 devices), the proposed method achieves within 15 to 20 percent of an exact optimal solver’s latency, while in larger-scale tests it achieves notable improvements in inference speed and memory usage compared to state-of-the-art layer-based partitioning approaches.
zh

[AI-27] Machine-Learning-Powered Neural Interfaces for Smart Prosthetics and Diagnostics

【速读】:该论文旨在解决传统神经接口在可扩展性、可靠性、可解释性和用户适应性方面面临的挑战,以推动下一代微型化神经设备的发展。其解决方案的关键在于结合人工智能驱动的解码算法与节能的系统级芯片(System-on-Chip, SoC)平台,通过高密度神经记录、本地信号处理和机器学习技术实现对神经信号的实时解析与自适应调控,从而构建高效、智能且适用于多样化环境的神经接口系统。

链接: https://arxiv.org/abs/2505.02516
作者: MohammadAli Shaeri,Jinhan Liu,Mahsa Shoaran
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC)
备注: To appear in the 2025 IEEE International NEWCAS Conference (NEWCAS’25)

点击查看摘要

Abstract:Advanced neural interfaces are transforming applications ranging from neuroscience research to diagnostic tools (for mental state recognition, tremor and seizure detection) as well as prosthetic devices (for motor and communication recovery). By integrating complex functions into miniaturized neural devices, these systems unlock significant opportunities for personalized assistive technologies and adaptive therapeutic interventions. Leveraging high-density neural recordings, on-site signal processing, and machine learning (ML), these interfaces extract critical features, identify disease neuro-markers, and enable accurate, low-latency neural decoding. This integration facilitates real-time interpretation of neural signals, adaptive modulation of brain activity, and efficient control of assistive devices. Moreover, the synergy between neural interfaces and ML has paved the way for self-sufficient, ubiquitous platforms capable of operating in diverse environments with minimal hardware costs and external dependencies. In this work, we review recent advancements in AI-driven decoding algorithms and energy-efficient System-on-Chip (SoC) platforms for next-generation miniaturized neural devices. These innovations highlight the potential for developing intelligent neural interfaces, addressing critical challenges in scalability, reliability, interpretability, and user adaptability.
zh

[AI-28] Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study

【速读】:该论文试图解决公开部署的大型语言模型(Large Language Models, LLMs)在安全性和配置上的普遍缺陷问题,这些问题导致服务暴露于公共互联网并面临严重的安全和系统工程风险。其解决方案的关键在于通过大规模的实证研究,揭示LLM服务的部署现状、暴露特征、系统性漏洞及相关风险,并分析配置、认证实践和地理分布,从而识别出实际部署中的系统性问题,为构建更安全的默认框架和强化部署实践提供依据。

链接: https://arxiv.org/abs/2505.02502
作者: Xinyi Hou,Jiahao Han,Yanjie Zhao,Haoyu Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Background: Large language models (LLMs) are increasingly deployed via open-source and commercial frameworks, enabling individuals and organizations to self-host advanced AI capabilities. However, insecure defaults and misconfigurations often expose LLM services to the public Internet, posing significant security and system engineering risks. Aims: This study aims to unveil the current landscape of public-facing LLM deployments in the wild through a large-scale empirical study, focusing on service prevalence, exposure characteristics, systemic vulnerabilities, and associated risks. Method: We conducted an Internet-wide measurement to identify public-facing LLM deployments across 15 frameworks, discovering 320,102 services. We extracted 158 unique API endpoints, grouped into 12 functional categories based on capabilities and security risks. We further analyzed configurations, authentication practices, and geographic distributions, revealing deployment trends and systemic issues in real-world LLM system engineering. Results: Our study shows that public LLM deployments are rapidly growing but often insecure. Among all endpoints, we observe widespread use of insecure protocols, poor TLS configurations, and unauthenticated access to critical operations. Security risks, including model disclosure, system leakage, and unauthorized access, are pervasive, highlighting the need for secure-by-default frameworks and stronger deployment practices. Conclusions: Public-facing LLM deployments suffer from widespread security and configuration flaws, exposing services to misuse, model theft, resource hijacking, and remote exploitation. Strengthening default security, deployment practices, and operational standards is critical for the growing self-hosted LLM ecosystem.
zh

[AI-29] Beyond the model: Key differentiators in large language models and multi-agent services

【速读】:该论文试图解决当前生成式 AI(Generative AI)领域中,大型语言模型(Large Language Models, LLMs)不再是唯一决定性因素的问题。随着像DeepSeek、Manus AI和Llama 4等基础模型的推出,研究指出,如今许多模型在能力上已达到相近水平,因此竞争的核心已从模型规模转向优化其周边生态系统。解决方案的关键在于提升数据质量与管理、计算效率、延迟以及评估框架,以确保现代AI服务的高效性和盈利能力。

链接: https://arxiv.org/abs/2505.02489
作者: Muskaan Goyal,Pranav Bhasin
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: 4 pages

点击查看摘要

Abstract:With the launch of foundation models like DeepSeek, Manus AI, and Llama 4, it has become evident that large language models (LLMs) are no longer the sole defining factor in generative AI. As many now operate at comparable levels of capability, the real race is not about having the biggest model but optimizing the surrounding ecosystem, including data quality and management, computational efficiency, latency, and evaluation frameworks. This review article delves into these critical differentiators that ensure modern AI services are efficient and profitable.
zh

[AI-30] SEFE: Superficial and Essential Forgetting Eliminator for Multimodal Continual Instruction Tuning

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在持续学习新任务过程中出现的灾难性遗忘问题。研究将遗忘分为表层遗忘和本质遗忘,其中表层遗忘是由于后续任务的回答风格影响导致模型对先前任务的响应格式偏离预期,而本质遗忘则是模型提供格式正确但事实错误的答案,表明知识真正丢失。解决方案的关键在于首先通过答案风格多样化(Answer Style Diversification, ASD)方法统一不同任务的数据风格,防止因风格变化引起的表层遗忘;在此基础上,提出RegLoRA方法通过对关键参数施加正则化以稳定先验知识存储,从而缓解本质遗忘。

链接: https://arxiv.org/abs/2505.02486
作者: Jinpeng Chen,Runmin Cong,Yuzhi Zhao,Hongzheng Yang,Guangneng Hu,Horace Ho Shing Ip,Sam Kwong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Continual Instruction Tuning (MCIT) aims to enable Multimodal Large Language Models (MLLMs) to incrementally learn new tasks without catastrophic forgetting. In this paper, we explore forgetting in this context, categorizing it into superficial forgetting and essential forgetting. Superficial forgetting refers to cases where the model’s knowledge may not be genuinely lost, but its responses to previous tasks deviate from expected formats due to the influence of subsequent tasks’ answer styles, making the results unusable. By contrast, essential forgetting refers to situations where the model provides correctly formatted but factually inaccurate answers, indicating a true loss of knowledge. Assessing essential forgetting necessitates addressing superficial forgetting first, as severe superficial forgetting can obscure the model’s knowledge state. Hence, we first introduce the Answer Style Diversification (ASD) paradigm, which defines a standardized process for transforming data styles across different tasks, unifying their training sets into similarly diversified styles to prevent superficial forgetting caused by style shifts. Building on this, we propose RegLoRA to mitigate essential forgetting. RegLoRA stabilizes key parameters where prior knowledge is primarily stored by applying regularization, enabling the model to retain existing competencies. Experimental results demonstrate that our overall method, SEFE, achieves state-of-the-art performance.
zh

[AI-31] El Agent e: An Autonomous Agent for Quantum Chemistry

【速读】:该论文试图解决计算化学工具复杂性高、难以被非专业人员使用以及对专家也存在挑战的问题。其解决方案的关键在于引入El Agente Q,这是一个基于大语言模型(Large Language Model, LLM)的多智能体系统,能够从自然语言用户提示中动态生成并执行量子化学工作流。该系统的核心创新在于其新颖的认知架构,包含分层记忆框架,支持灵活的任务分解、自适应工具选择、后分析、自主文件处理与提交,从而提升了系统的灵活性和自动化水平。

链接: https://arxiv.org/abs/2505.02484
作者: Yunheng Zou,Austin H. Cheng,Abdulrahman Aldossary,Jiaru Bai,Shi Xuan Leong,Jorge Arturo Campos-Gonzalez-Angulo,Changhyeok Choi,Cher Tian Ser,Gary Tom,Andrew Wang,Zijian Zhang,Ilya Yakavets,Han Hao,Chris Crebolder,Varinia Bernales,Alán Aspuru-Guzik
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Chemical Physics (physics.chem-ph)
备注:

点击查看摘要

Abstract:Computational chemistry tools are widely used to study the behaviour of chemical phenomena. Yet, the complexity of these tools can make them inaccessible to non-specialists and challenging even for experts. In this work, we introduce El Agente Q, an LLM-based multi-agent system that dynamically generates and executes quantum chemistry workflows from natural language user prompts. The system is built on a novel cognitive architecture featuring a hierarchical memory framework that enables flexible task decomposition, adaptive tool selection, post-analysis, and autonomous file handling and submission. El Agente Q is benchmarked on six university-level course exercises and two case studies, demonstrating robust problem-solving performance (averaging 87% task success) and adaptive error handling through in situ debugging. It also supports longer-term, multi-step task execution for more complex workflows, while maintaining transparency through detailed action trace logs. Together, these capabilities lay the foundation for increasingly autonomous and accessible quantum chemistry.
zh

[AI-32] Automated Hybrid Reward Scheduling via Large Language Models for Robotic Skill Learning

【速读】:该论文旨在解决高自由度机器人学习特定技能时因机器人动力学复杂性而面临的挑战,特别是传统强化学习方法在处理多约束问题时由于简单叠加所有奖励组件而导致的效率低下和性能受限问题。其解决方案的关键在于提出一种基于大型语言模型(Large Language Models, LLMs)的自动化混合奖励调度(Automated Hybrid Reward Scheduling, AHRS)框架,通过动态调整每个奖励组件的学习强度,实现对策略优化过程中不同奖励分支的权重自动计算与分配,从而提升机器人技能学习的效率与效果。

链接: https://arxiv.org/abs/2505.02483
作者: Changxin Huang,Junyang Liang,Yanbin Chang,Jingzhao Xu,Jianqiang Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Enabling a high-degree-of-freedom robot to learn specific skills is a challenging task due to the complexity of robotic dynamics. Reinforcement learning (RL) has emerged as a promising solution; however, addressing such problems requires the design of multiple reward functions to account for various constraints in robotic motion. Existing approaches typically sum all reward components indiscriminately to optimize the RL value function and policy. We argue that this uniform inclusion of all reward components in policy optimization is inefficient and limits the robot’s learning performance. To address this, we propose an Automated Hybrid Reward Scheduling (AHRS) framework based on Large Language Models (LLMs). This paradigm dynamically adjusts the learning intensity of each reward component throughout the policy optimization process, enabling robots to acquire skills in a gradual and structured manner. Specifically, we design a multi-branch value network, where each branch corresponds to a distinct reward component. During policy optimization, each branch is assigned a weight that reflects its importance, and these weights are automatically computed based on rules designed by LLMs. The LLM generates a rule set in advance, derived from the task description, and during training, it selects a weight calculation rule from the library based on language prompts that evaluate the performance of each branch. Experimental results demonstrate that the AHRS method achieves an average 6.48% performance improvement across multiple high-degree-of-freedom robotic tasks.
zh

[AI-33] Investigating the Impact of Personalized AI Tutors on Language Learning Performance

【速读】:该论文试图解决人工智能辅导系统(AI tutors)在语言学习过程中对学生技能发展和参与度的影响问题。研究通过在Santa和Duolingo等语言学习平台上对34名学生进行配对样本t检验的准实验,探讨个性化语言学习体验中学生参与度、学业成绩与满意度之间的关系。解决方案的关键在于利用统计方法验证AI辅导系统在提升学习效果和用户体验方面的有效性。

链接: https://arxiv.org/abs/2505.02443
作者: Simon Suh
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 16 pages, 4 figures, 1 table, Uses three theoretical frameworks like Domain modeling, Gardner Theory of Multiple Intelligences, and Zone of Proximal Development

点击查看摘要

Abstract:Driven by the global shift towards online learning prompted by the COVID 19 pandemic, Artificial Intelligence has emerged as a pivotal player in the field of education. Intelligent Tutoring Systems offer a new method of personalized teaching, replacing the limitations of traditional teaching methods. However, concerns arise about the ability of AI tutors to address skill development and engagement during the learning process. In this paper, I will conduct a quasi experiment with paired sample t test on 34 students pre and post use of AI tutors in language learning platforms like Santa and Duolingo to examine the relationship between students engagement, academic performance, and students satisfaction during a personalized language learning experience.
zh

[AI-34] MSFNet-CPD: Multi-Scale Cross-Modal Fusion Network for Crop Pest Detection IJCNN2025

【速读】:该论文旨在解决农业害虫准确识别的问题,这一问题因同类内部差异大和害虫种类之间的细微差别而具有挑战性。现有方法主要依赖低级视觉特征,缺乏有效的多模态融合,导致准确性有限且可解释性差,同时高质量的多模态农业数据集稀缺也限制了该领域的发展。论文的关键解决方案是构建两个新的多模态基准CTIP102和STIP102,并提出一种多尺度跨模态融合网络(MSFNet-CPD),通过超分辨率重建模块提升图像质量,结合图像-文本融合模块和图像-文本转换器以更好地利用语义线索,并引入任意组合图像增强策略生成更复杂多样的检测数据集MTIP102,从而提升模型在真实场景中的泛化能力。

链接: https://arxiv.org/abs/2505.02441
作者: Jiaqi Zhang,Zhuodong Liu,Kejian Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to IJCNN 2025

点击查看摘要

Abstract:Accurate identification of agricultural pests is essential for crop protection but remains challenging due to the large intra-class variance and fine-grained differences among pest species. While deep learning has advanced pest detection, most existing approaches rely solely on low-level visual features and lack effective multi-modal integration, leading to limited accuracy and poor interpretability. Moreover, the scarcity of high-quality multi-modal agricultural datasets further restricts progress in this field. To address these issues, we construct two novel multi-modal benchmarks-CTIP102 and STIP102-based on the widely-used IP102 dataset, and introduce a Multi-scale Cross-Modal Fusion Network (MSFNet-CPD) for robust pest detection. Our approach enhances visual quality via a super-resolution reconstruction module, and feeds both the original and reconstructed images into the network to improve clarity and detection performance. To better exploit semantic cues, we propose an Image-Text Fusion (ITF) module for joint modeling of visual and textual features, and an Image-Text Converter (ITC) that reconstructs fine-grained details across multiple scales to handle challenging backgrounds. Furthermore, we introduce an Arbitrary Combination Image Enhancement (ACIE) strategy to generate a more complex and diverse pest detection dataset, MTIP102, improving the model’s generalization to real-world scenarios. Extensive experiments demonstrate that MSFNet-CPD consistently outperforms state-of-the-art methods on multiple pest detection benchmarks. All code and datasets will be made publicly available at: this https URL.
zh

[AI-35] ReeM: Ensemble Building Thermodynamics Model for Efficient HVAC Control via Hierarchical Reinforcement Learning

【速读】:该论文旨在解决建筑热力学模型在实际应用中因需要长时间数据采集和依赖专家知识而导致建模效率低、模型复用性差的问题。其解决方案的关键在于采用模型集成方法,利用已有模型作为基础模型服务于目标建筑环境,并通过分层强化学习(HRL)动态选择和加权基础模型,以提高预测精度并降低建模成本。

链接: https://arxiv.org/abs/2505.02439
作者: Yang Deng,Yaohui Liu,Rui Liang,Dafang Zhao,Donghua Xie,Ittetsu Taniguchi,Dan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:The building thermodynamics model, which predicts real-time indoor temperature changes under potential HVAC (Heating, Ventilation, and Air Conditioning) control operations, is crucial for optimizing HVAC control in buildings. While pioneering studies have attempted to develop such models for various building environments, these models often require extensive data collection periods and rely heavily on expert knowledge, making the modeling process inefficient and limiting the reusability of the models. This paper explores a model ensemble perspective that utilizes existing developed models as base models to serve a target building environment, thereby providing accurate predictions while reducing the associated efforts. Given that building data streams are non-stationary and the number of base models may increase, we propose a Hierarchical Reinforcement Learning (HRL) approach to dynamically select and weight the base models. Our approach employs a two-tiered decision-making process: the high-level focuses on model selection, while the low-level determines the weights of the selected models. We thoroughly evaluate the proposed approach through offline experiments and an on-site case study, and the experimental results demonstrate the effectiveness of our method.
zh

[AI-36] A New Approach to Backtracking Counterfactual Explanations: A Causal Framework for Efficient Model Interpretability

【速读】:该论文试图解决传统反事实解释方法在生成替代输入时忽视因果关系导致示例不现实,以及现有结合因果性的方法计算成本过高的问题。其解决方案的关键在于提出一种基于回溯反事实的高效方法,该方法融合了因果推理以生成可操作的解释,并在实验中验证了其在提供模型输出深度洞察方面的有效性。

链接: https://arxiv.org/abs/2505.02435
作者: Pouria Fatemi,Ehsan Sharifian,Mohammad Hossein Yassaee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Counterfactual explanations enhance interpretability by identifying alternative inputs that produce different outputs, offering localized insights into model decisions. However, traditional methods often neglect causal relationships, leading to unrealistic examples. While newer approaches integrate causality, they are computationally expensive. To address these challenges, we propose an efficient method based on backtracking counterfactuals that incorporates causal reasoning to generate actionable explanations. We first examine the limitations of existing methods and then introduce our novel approach and its features. We also explore the relationship between our method and previous techniques, demonstrating that it generalizes them in specific scenarios. Finally, experiments show that our method provides deeper insights into model outputs.
zh

[AI-37] FairPO: Robust Preference Optimization for Fair Multi-Label Learning

【速读】:该论文试图解决多标签分类中的公平性问题,旨在通过直接优化偏好信号并从群体鲁棒性的角度提升模型在不同标签组之间的公平性。解决方案的关键在于将标签划分为特权组和非特权组,并采用基于偏好的损失函数,该函数受直接偏好优化(Direct Preference Optimization, DPO)启发,以更有效地区分特权组内的真正正例与混淆负例,同时保持对非特权标签的基础分类性能。此外,通过将学习问题建模为群体上的鲁棒优化,该方法动态调整训练重点以关注表现较差的群体,从而减轻偏差并实现更公平的标签类别处理。

链接: https://arxiv.org/abs/2505.02433
作者: Soumen Kumar Mondal,Akshit Varmora,Prateek Chanda,Ganesh Ramakrishnan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose FairPO, a novel framework designed to promote fairness in multi-label classification by directly optimizing preference signals with a group robustness perspective. In our framework, the set of labels is partitioned into privileged and non-privileged groups, and a preference-based loss inspired by Direct Preference Optimization (DPO) is employed to more effectively differentiate true positive labels from confusing negatives within the privileged group, while preserving baseline classification performance for non-privileged labels. By framing the learning problem as a robust optimization over groups, our approach dynamically adjusts the training emphasis toward groups with poorer performance, thereby mitigating bias and ensuring a fairer treatment across diverse label categories. In addition, we outline plans to extend this approach by investigating alternative loss formulations such as Simple Preference Optimisation (SimPO) and Contrastive Preference Optimization (CPO) to exploit reference-free reward formulations and contrastive training signals. Furthermore, we plan to extend FairPO with multilabel generation capabilities, enabling the model to dynamically generate diverse and coherent label sets for ambiguous inputs.
zh

[AI-38] owards One-shot Federated Learning: Advances Challenges and Future Directions

【速读】:该论文旨在解决传统联邦学习(Federated Learning, FL)中因多轮通信导致的资源消耗大和隐私泄露风险高的问题,提出了一种单次迭代的联邦学习框架——One-shot FL。其解决方案的关键在于通过单轮模型聚合实现协作训练,从而减少通信开销并保持数据本地性,同时通过优化客户端模型初始化、聚合技术和异构数据分布管理策略来提升系统性能。

链接: https://arxiv.org/abs/2505.02426
作者: Flora Amato,Lingyu Qiu,Mohammad Tanveer,Salvatore Cuomo,Fabio Giampaolo,Francesco Piccialli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:One-shot FL enables collaborative training in a single round, eliminating the need for iterative communication, making it particularly suitable for use in resource-constrained and privacy-sensitive applications. This survey offers a thorough examination of One-shot FL, highlighting its distinct operational framework compared to traditional federated approaches. One-shot FL supports resource-limited devices by enabling single-round model aggregation while maintaining data locality. The survey systematically categorizes existing methodologies, emphasizing advancements in client model initialization, aggregation techniques, and strategies for managing heterogeneous data distributions. Furthermore, we analyze the limitations of current approaches, particularly in terms of scalability and generalization in non-IID settings. By analyzing cutting-edge techniques and outlining open challenges, this survey aspires to provide a comprehensive reference for researchers and practitioners aiming to design and implement One-shot FL systems, advancing the development and adoption of One-shot FL solutions in a real-world, resource-constrained scenario.
zh

[AI-39] 2S: High-resolution Time Series Generation with Text-to-Series Diffusion Models IJCAI2025

【速读】:该论文旨在解决时间序列生成中的数据稀疏性、不平衡以及多模态时间序列数据集可用性有限等问题,同时克服现有方法在通用时间序列描述生成和任意长度时间序列生成方面的局限性。其解决方案的关键在于提出一种基于扩散模型的框架——Text-to-Series (T2S),该框架通过将时间序列描述分为点级、片段级和实例级三个层次,并引入一个包含超过600,000个高分辨率时间序列-文本对的片段级数据集,实现了跨领域的自然语言与时间序列之间的有效映射。T2S采用长度自适应的变分自编码器将不同长度的时间序列编码为一致的潜在嵌入,并通过流匹配和扩散Transformer实现文本表示与潜在嵌入的有效对齐,从而支持任意长度时间序列的生成。

链接: https://arxiv.org/abs/2505.02417
作者: Yunfeng Ge,Jiawei Li,Yiji Zhao,Haomin Wen,Zhao Li,Meikang Qiu,Hongyan Li,Ming Jin,Shirui Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by the 34th International Joint Conference on Artificial Intelligence (IJCAI 2025)

点击查看摘要

Abstract:Text-to-Time Series generation holds significant potential to address challenges such as data sparsity, imbalance, and limited availability of multimodal time series datasets across domains. While diffusion models have achieved remarkable success in Text-to-X (e.g., vision and audio data) generation, their use in time series generation remains in its nascent stages. Existing approaches face two critical limitations: (1) the lack of systematic exploration of general-proposed time series captions, which are often domain-specific and struggle with generalization; and (2) the inability to generate time series of arbitrary lengths, limiting their applicability to real-world scenarios. In this work, we first categorize time series captions into three levels: point-level, fragment-level, and instance-level. Additionally, we introduce a new fragment-level dataset containing over 600,000 high-resolution time series-text pairs. Second, we propose Text-to-Series (T2S), a diffusion-based framework that bridges the gap between natural language and time series in a domain-agnostic manner. T2S employs a length-adaptive variational autoencoder to encode time series of varying lengths into consistent latent embeddings. On top of that, T2S effectively aligns textual representations with latent embeddings by utilizing Flow Matching and employing Diffusion Transformer as the denoiser. We train T2S in an interleaved paradigm across multiple lengths, allowing it to generate sequences of any desired length. Extensive evaluations demonstrate that T2S achieves state-of-the-art performance across 13 datasets spanning 12 domains.
zh

[AI-40] ask-Oriented Semantic Communication in Large Multimodal Models-based Vehicle Networks

【速读】:该论文旨在解决任务导向的语义通信中如何提高用户与云服务器之间交互效率的问题,特别是在资源受限和信道条件较差的环境下。其解决方案的关键在于基于大型多模态模型(LMM)的车辆AI助手设计,通过优化图像切片策略以聚焦用户最关注的区域,并结合客观与主观用户注意力评估图像块的重要性,从而调整语义信息的传输能耗,实现资源的高效利用与关键信息的精准传输。

链接: https://arxiv.org/abs/2505.02413
作者: Baoxia Du,Hongyang Du,Dusit Niyato,Ruidong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Task-oriented semantic communication has emerged as a fundamental approach for enhancing performance in various communication scenarios. While recent advances in Generative Artificial Intelligence (GenAI), such as Large Language Models (LLMs), have been applied to semantic communication designs, the potential of Large Multimodal Models (LMMs) remains largely unexplored. In this paper, we investigate an LMM-based vehicle AI assistant using a Large Language and Vision Assistant (LLaVA) and propose a task-oriented semantic communication framework to facilitate efficient interaction between users and cloud servers. To reduce computational demands and shorten response time, we optimize LLaVA’s image slicing to selectively focus on areas of utmost interest to users. Additionally, we assess the importance of image patches by combining objective and subjective user attention, adjusting energy usage for transmitting semantic information. This strategy optimizes resource utilization, ensuring precise transmission of critical information. We construct a Visual Question Answering (VQA) dataset for traffic scenarios to evaluate effectiveness. Experimental results show that our semantic communication framework significantly increases accuracy in answering questions under the same channel conditions, performing particularly well in environments with poor Signal-to-Noise Ratios (SNR). Accuracy can be improved by 13.4% at an SNR of 12dB and 33.1% at 10dB, respectively.
zh

[AI-41] Quantitative Analysis of Performance Drop in DeepSeek Model Quantization

【速读】:该论文试图解决在本地部署DeepSeek-R1和V3模型时面临的内存限制问题,特别是当模型参数配置为671B FP8时,无法在标准8-GPU机器上运行。解决方案的关键在于通过量化技术降低模型的内存消耗,其中重点评估了多比特宽度量化的效果,并提出了一种动态3-bit量化方法DQ3_K_M,该方法在多个基准测试中表现优于传统的Q3_K_M变体,并且在大多数任务中与4-bit量化(Q4_K_M)方法相当,同时支持在NVIDIA H100/A100和华为910B单机部署配置下运行。

链接: https://arxiv.org/abs/2505.02390
作者: Enbo Zhao,Yi Shen,Shuming Shi,Jieyun Huang,Zhihao Chen,Ning Wang,Siqi Xiao,Jian Zhang,Kai Wang,Shiguo Lian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, there is a high demand for deploying DeepSeek-R1 and V3 locally, possibly because the official service often suffers from being busy and some organizations have data privacy concerns. While single-machine deployment offers infrastructure simplicity, the models’ 671B FP8 parameter configuration exceeds the practical memory limits of a standard 8-GPU machine. Quantization is a widely used technique that helps reduce model memory consumption. However, it is unclear what the performance of DeepSeek-R1 and V3 will be after being quantized. This technical report presents the first quantitative evaluation of multi-bitwidth quantization across the complete DeepSeek model spectrum. Key findings reveal that 4-bit quantization maintains little performance degradation versus FP8 while enabling single-machine deployment on standard NVIDIA GPU devices. We further propose DQ3_K_M, a dynamic 3-bit quantization method that significantly outperforms traditional Q3_K_M variant on various benchmarks, which is also comparable with 4-bit quantization (Q4_K_M) approach in most tasks. Moreover, DQ3_K_M supports single-machine deployment configurations for both NVIDIA H100/A100 and Huawei 910B. Our implementation of DQ3_K_M is released at this https URL, containing optimized 3-bit quantized variants of both DeepSeek-R1 and DeepSeek-V3.
zh

[AI-42] Advancing Email Spam Detection: Leverag ing Zero-Shot Learning and Large Language Models

【速读】:该论文试图解决传统机器学习和深度学习方法在邮件垃圾信息检测中面临的适应性差、类别不平衡和数据稀缺等问题。其解决方案的关键在于采用零样本学习(Zero-Shot Learning)框架,结合FLAN-T5与先进的自然语言处理(NLP)技术如BERT,通过BERT对邮件内容进行预处理和关键信息提取,再利用FLAN-T5在零样本环境下进行分类,从而减少对大规模标注数据集和频繁重新训练的依赖,提升系统对未见过的垃圾信息模式和对抗环境的适应能力。

链接: https://arxiv.org/abs/2505.02362
作者: Ghazaleh SHirvani,Saeid Ghasemshirazi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Email spam detection is a critical task in modern communication systems, essential for maintaining productivity, security, and user experience. Traditional machine learning and deep learning approaches, while effective in static settings, face significant limitations in adapting to evolving spam tactics, addressing class imbalance, and managing data scarcity. These challenges necessitate innovative approaches that reduce dependency on extensive labeled datasets and frequent retraining. This study investigates the effectiveness of Zero-Shot Learning using FLAN-T5, combined with advanced Natural Language Processing (NLP) techniques such as BERT for email spam detection. By employing BERT to preprocess and extract critical information from email content, and FLAN-T5 to classify emails in a Zero-Shot framework, the proposed approach aims to address the limitations of traditional spam detection systems. The integration of FLAN-T5 and BERT enables robust spam detection without relying on extensive labeled datasets or frequent retraining, making it highly adaptable to unseen spam patterns and adversarial environments. This research highlights the potential of leveraging zero-shot learning and NLPs for scalable and efficient spam detection, providing insights into their capability to address the dynamic and challenging nature of spam detection tasks.
zh

[AI-43] Catastrophic Overfitting Entropy Gap and Participation Ratio: A Noiseless lp Norm Solution for Fast Adversarial Training

【速读】:该论文旨在解决对抗训练中普遍存在的灾难性过拟合(Catastrophic Overfitting, CO)问题,即模型在面对单步攻击时表现出鲁棒性,但在多步攻击下失效。其解决方案的关键在于通过控制 $ l^p $ 训练范数来缓解CO,而非依赖传统的噪声注入、正则化或梯度裁剪方法。研究发现,CO在 $ l^\infty $ 范数下比在 $ l^2 $ 范数下更为显著,并基于此提出了广义 $ l^p $ 攻击的固定点框架,进而设计了 $ l^p $-FGSM 攻击以分析从 $ l^2 $ 到 $ l^\infty $ 的过渡机制。核心洞察为:当高集中度的梯度与激进的范数约束相互作用时,CO现象会显现,因此通过参与度比率和熵度量量化梯度集中度,开发出自适应 $ l^p $-FGSM 方法,实现训练范数的自动调整,从而有效提升模型鲁棒性。

链接: https://arxiv.org/abs/2505.02360
作者: Fares B. Mehouachi,Saif Eddin Jabari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adversarial training is a cornerstone of robust deep learning, but fast methods like the Fast Gradient Sign Method (FGSM) often suffer from Catastrophic Overfitting (CO), where models become robust to single-step attacks but fail against multi-step variants. While existing solutions rely on noise injection, regularization, or gradient clipping, we propose a novel solution that purely controls the l^p training norm to mitigate CO. Our study is motivated by the empirical observation that CO is more prevalent under the l^\infty norm than the l^2 norm. Leveraging this insight, we develop a framework for generalized l^p attack as a fixed point problem and craft l^p -FGSM attacks to understand the transition mechanics from l^2 to l^\infty . This leads to our core insight: CO emerges when highly concentrated gradients where information localizes in few dimensions interact with aggressive norm constraints. By quantifying gradient concentration through Participation Ratio and entropy measures, we develop an adaptive l^p -FGSM that automatically tunes the training norm based on gradient information. Extensive experiments demonstrate that this approach achieves strong robustness without requiring additional regularization or noise injection, providing a novel and theoretically-principled pathway to mitigate the CO problem. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.02360 [cs.LG] (or arXiv:2505.02360v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.02360 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-44] Social Biases in Knowledge Representations of Wikidata separates Global North from Global South

【速读】:该论文试图解决知识图谱(Knowledge Graphs)中由于社会偏见导致的不公平问题,特别是在链接预测(Link Prediction, LP)任务中,性别和年龄等敏感属性可能引发对少数群体的歧视。解决方案的关键在于提出一个名为AuditLP的框架,通过部署公平性度量来识别LP中的偏见结果,从而揭示知识图谱中隐含的社会经济和文化差异。

链接: https://arxiv.org/abs/2505.02352
作者: Paramita Das,Sai Keerthana Karnam,Aditya Soni,Animesh Mukherjee
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:Knowledge Graphs have become increasingly popular due to their wide usage in various downstream applications, including information retrieval, chatbot development, language model construction, and many others. Link prediction (LP) is a crucial downstream task for knowledge graphs, as it helps to address the problem of the incompleteness of the knowledge graphs. However, previous research has shown that knowledge graphs, often created in a (semi) automatic manner, are not free from social biases. These biases can have harmful effects on downstream applications, especially by leading to unfair behavior toward minority groups. To understand this issue in detail, we develop a framework – AuditLP – deploying fairness metrics to identify biased outcomes in LP, specifically how occupations are classified as either male or female-dominated based on gender as a sensitive attribute. We have experimented with the sensitive attribute of age and observed that occupations are categorized as young-biased, old-biased, and age-neutral. We conduct our experiments on a large number of knowledge triples that belong to 21 different geographies extracted from the open-sourced knowledge graph, Wikidata. Our study shows that the variance in the biased outcomes across geographies neatly mirrors the socio-economic and cultural division of the world, resulting in a transparent partition of the Global North from the Global South.
zh

[AI-45] HyperTree Planning : Enhancing LLM Reasoning via Hierarchical Thinking

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在处理复杂规划任务时所面临的挑战,这些问题主要包括较长的推理步骤、多样的约束条件以及对多个独立子任务的处理难度。解决方案的关键在于提出了一种名为HyperTree Planning (HTP) 的新型推理范式,该范式通过构建超树状(hypertree-structured)的规划框架,使LLMs能够通过灵活运用分而治之策略进行层次化思考,从而有效分解复杂的推理步骤,适应多样化的约束条件,并以结构化的方式管理多个子任务。

链接: https://arxiv.org/abs/2505.02322
作者: Runquan Gui,Zhihai Wang,Jie Wang,Chi Ma,Huiling Zhen,Mingxuan Yuan,Jianye Hao,Defu Lian,Enhong Chen,Feng Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2406.14228 by other authors

点击查看摘要

Abstract:Recent advancements have significantly enhanced the performance of large language models (LLMs) in tackling complex reasoning tasks, achieving notable success in domains like mathematical and logical reasoning. However, these methods encounter challenges with complex planning tasks, primarily due to extended reasoning steps, diverse constraints, and the challenge of handling multiple distinct sub-tasks. To address these challenges, we propose HyperTree Planning (HTP), a novel reasoning paradigm that constructs hypertree-structured planning outlines for effective planning. The hypertree structure enables LLMs to engage in hierarchical thinking by flexibly employing the divide-and-conquer strategy, effectively breaking down intricate reasoning steps, accommodating diverse constraints, and managing multiple distinct sub-tasks in a well-organized manner. We further introduce an autonomous planning framework that completes the planning process by iteratively refining and expanding the hypertree-structured planning outlines. Experiments demonstrate the effectiveness of HTP, achieving state-of-the-art accuracy on the TravelPlanner benchmark with Gemini-1.5-Pro, resulting in a 3.6 times performance improvement over o1-preview.
zh

[AI-46] NeuroSim V1.5: Improved Software Backbone for Benchmarking Compute-in-Memory Accelerators with Device and Circuit-level Non-idealities

【速读】:该论文旨在解决传统冯·诺依曼架构在人工智能应用中因计算单元与存储器之间频繁数据传输而导致的能效和时延瓶颈问题。其解决方案的关键在于提出NeuroSim V1.5,该工具通过将乘加(MAC)操作直接在存储器阵列中执行的计算内存(Computing-in-Memory, CIM)技术,显著减少了数据移动。NeuroSim V1.5引入了多项关键改进,包括与TensorRT后训练量化流程的无缝集成、基于预表征统计模型的灵活噪声注入方法、对新兴非易失性电容存储器的支持以及行为仿真优化带来的运行速度提升,从而实现了对准确性和硬件效率的系统性设计空间探索。

链接: https://arxiv.org/abs/2505.02314
作者: James Read,Ming-Yen Lee,Wei-Hsing Huang,Yuan-Chun Luo,Anni Lu,Shimeng Yu
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 9 figures, 6 tables

点击查看摘要

Abstract:The exponential growth of artificial intelligence (AI) applications has exposed the inefficiency of conventional von Neumann architectures, where frequent data transfers between compute units and memory create significant energy and latency bottlenecks. Analog Computing-in-Memory (ACIM) addresses this challenge by performing multiply-accumulate (MAC) operations directly in the memory arrays, substantially reducing data movement. However, designing robust ACIM accelerators requires accurate modeling of device- and circuit-level non-idealities. In this work, we present NeuroSim V1.5, introducing several key advances: (1) seamless integration with TensorRT’s post-training quantization flow enabling support for more neural networks including transformers, (2) a flexible noise injection methodology built on pre-characterized statistical models, making it straightforward to incorporate data from SPICE simulations or silicon measurements, (3) expanded device support including emerging non-volatile capacitive memories, and (4) up to 6.5x faster runtime than NeuroSim V1.4 through optimized behavioral simulation. The combination of these capabilities uniquely enables systematic design space exploration across both accuracy and hardware efficiency metrics. Through multiple case studies, we demonstrate optimization of critical design parameters while maintaining network accuracy. By bridging high-fidelity noise modeling with efficient simulation, NeuroSim V1.5 advances the design and validation of next-generation ACIM accelerators. All NeuroSim versions are available open-source at this https URL.
zh

[AI-47] What Is AI Safety? What Do We Want It to Be?

【速读】:该论文试图解决AI安全领域概念界定不清的问题,具体表现为当前研究者和机构在讨论AI安全时所采用的两种趋势与传统定义之间的张力。论文提出,尽管“安全观”(The Safety Conception)简单且具有吸引力,即AI安全研究的核心在于防止或减少AI系统造成的危害,但这一观点与当前强调未来系统可能引发的灾难性风险以及将AI安全视为安全工程分支的趋势存在冲突。论文的关键解决方案是通过概念工程的方法,论证“安全观”作为AI安全概念的合理性,认为其在描述性和规范性层面均具有优势,能够统一处理传统核心议题与边缘议题,并基于实际影响评估所有防止或缓解AI危害的努力。

链接: https://arxiv.org/abs/2505.02313
作者: Jacqueline Harding,Cameron Domenico Kirk-Giannini
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The field of AI safety seeks to prevent or reduce the harms caused by AI systems. A simple and appealing account of what is distinctive of AI safety as a field holds that this feature is constitutive: a research project falls within the purview of AI safety just in case it aims to prevent or reduce the harms caused by AI systems. Call this appealingly simple account The Safety Conception of AI safety. Despite its simplicity and appeal, we argue that The Safety Conception is in tension with at least two trends in the ways AI safety researchers and organizations think and talk about AI safety: first, a tendency to characterize the goal of AI safety research in terms of catastrophic risks from future systems; second, the increasingly popular idea that AI safety can be thought of as a branch of safety engineering. Adopting the methodology of conceptual engineering, we argue that these trends are unfortunate: when we consider what concept of AI safety it would be best to have, there are compelling reasons to think that The Safety Conception is the answer. Descriptively, The Safety Conception allows us to see how work on topics that have historically been treated as central to the field of AI safety is continuous with work on topics that have historically been treated as more marginal, like bias, misinformation, and privacy. Normatively, taking The Safety Conception seriously means approaching all efforts to prevent or mitigate harms from AI systems based on their merits rather than drawing arbitrary distinctions between them.
zh

[AI-48] SafeMate: A Model Context Protocol-Based Multimodal Agent for Emergency Preparedness

【速读】:该论文试图解决公众在危机中缺乏有效解读和应用公共安全文档与应急规程的能力问题,传统应急决策支持系统(EDSS)主要面向专业人士,依赖静态文档如PDF或标准操作程序(SOP),难以在压力下被非专家用户有效使用。解决方案的关键在于引入SafeMate,这是一种基于模型上下文协议(MCP)的检索增强型AI助手,能够动态引导用户查询至文档检索、检查清单生成和结构化摘要工具,并利用FAISS结合余弦相似性从可信来源中识别相关内容,从而为普通用户提供准确且情境感知的指导。

链接: https://arxiv.org/abs/2505.02306
作者: Junfeng Jiao,Jihyung Park,Yiming Xu,Lucy Atkinson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the abundance of public safety documents and emergency protocols, most individuals remain ill-equipped to interpret and act on such information during crises. Traditional emergency decision support systems (EDSS) are designed for professionals and rely heavily on static documents like PDFs or SOPs, which are difficult for non-experts to navigate under stress. This gap between institutional knowledge and public accessibility poses a critical barrier to effective emergency preparedness and response. We introduce SafeMate, a retrieval-augmented AI assistant that delivers accurate, context-aware guidance to general users in both preparedness and active emergency scenarios. Built on the Model Context Protocol (MCP), SafeMate dynamically routes user queries to tools for document retrieval, checklist generation, and structured summarization. It uses FAISS with cosine similarity to identify relevant content from trusted sources. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2505.02306 [cs.AI] (or arXiv:2505.02306v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.02306 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-49] Adaptive Scoring and Thresholding with Human Feedback for Robust Out-of-Distribution Detection

【速读】:该论文旨在解决机器学习(Machine Learning, ML)模型在部署过程中遇到的分布外(out-of-distribution, OOD)输入所带来的安全风险问题。现有方法通过基于分布内(in-distribution, ID)数据设定评分函数阈值以达到目标真正例率(true positive rate, TPR),但无法有效控制假正例率(false positive rate, FPR),导致OOD样本被误判为ID样本。此外,固定评分函数和阈值缺乏对新型和动态OOD输入的适应性。该论文提出的解决方案的关键在于引入一种人机协同框架,实时根据真实世界的OOD输入更新评分函数和阈值,从而在保证严格FPR控制的同时最大化TPR,并提供理论保障和实验验证其有效性。

链接: https://arxiv.org/abs/2505.02299
作者: Daisuke Yamada,Harit Vishwakarma,Ramya Korlakai Vinayak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine Learning (ML) models are trained on in-distribution (ID) data but often encounter out-of-distribution (OOD) inputs during deployment – posing serious risks in safety-critical domains. Recent works have focused on designing scoring functions to quantify OOD uncertainty, with score thresholds typically set based solely on ID data to achieve a target true positive rate (TPR), since OOD data is limited before deployment. However, these TPR-based thresholds leave false positive rates (FPR) uncontrolled, often resulting in high FPRs where OOD points are misclassified as ID. Moreover, fixed scoring functions and thresholds lack the adaptivity needed to handle newly observed, evolving OOD inputs, leading to sub-optimal performance. To address these challenges, we propose a human-in-the-loop framework that \emphsafely updates both scoring functions and thresholds on the fly based on real-world OOD inputs. Our method maximizes TPR while strictly controlling FPR at all times, even as the system adapts over time. We provide theoretical guarantees for FPR control under stationary conditions and present extensive empirical evaluations on OpenOOD benchmarks to demonstrate that our approach outperforms existing methods by achieving higher TPRs while maintaining FPR control.
zh

[AI-50] Universal Approximation Theorem of Deep Q-Networks

【速读】:该论文试图解决深度Q网络(Deep Q-Network, DQN)在连续时间框架下的理论分析问题,特别是其对最优Q函数的逼近能力和训练算法的收敛性。解决方案的关键在于通过随机控制理论和前向-后向随机微分方程(FBSDEs)建立分析框架,并利用残差网络逼近定理及状态-动作过程的大偏差界,证明DQN可以在紧集上以高概率任意精度逼近最优Q函数。此外,论文还探讨了DQN层数、时间离散化与粘性解(viscosity solution)在处理最优Q函数非光滑性中的作用,从而将深度强化学习与随机控制理论相连接。

链接: https://arxiv.org/abs/2505.02288
作者: Qian Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We establish a continuous-time framework for analyzing Deep Q-Networks (DQNs) via stochastic control and Forward-Backward Stochastic Differential Equations (FBSDEs). Considering a continuous-time Markov Decision Process (MDP) driven by a square-integrable martingale, we analyze DQN approximation properties. We show that DQNs can approximate the optimal Q-function on compact sets with arbitrary accuracy and high probability, leveraging residual network approximation theorems and large deviation bounds for the state-action process. We then analyze the convergence of a general Q-learning algorithm for training DQNs in this setting, adapting stochastic approximation theorems. Our analysis emphasizes the interplay between DQN layer count, time discretization, and the role of viscosity solutions (primarily for the value function V^* ) in addressing potential non-smoothness of the optimal Q-function. This work bridges deep reinforcement learning and stochastic control, offering insights into DQNs in continuous-time settings, relevant for applications with physical systems or high-frequency data.
zh

[AI-51] A survey of agent interoperability protocols: Model Context Protocol (MCP) Agent Agent Agent Agent Communication Protocol (ACP) Agent-to-Agent Protocol (A2A) and Agent Network Protocol (ANP)

【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)驱动的自主代理在异构系统中集成工具、共享上下文数据和协调任务时所面临的互操作性问题。现有临时性集成方式难以扩展、保障安全且跨领域泛化能力不足。解决方案的关键在于提出并比较四种新兴的代理通信协议:Model Context Protocol (MCP)、Agent Communication Protocol (ACP)、Agent-to-Agent Protocol (A2A) 和 Agent Network Protocol (ANP),这些协议分别针对不同部署场景下的互操作性需求,通过标准化接口、多模态消息传输、基于能力的代理卡以及去中心化标识符等技术手段,实现安全、可扩展的代理生态系统。

链接: https://arxiv.org/abs/2505.02279
作者: Abul Ehtesham,Aditi Singh,Gaurav Kumar Gupta,Saket Kumar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM)-powered autonomous agents demand robust, standardized protocols to integrate tools, share contextual data, and coordinate tasks across heterogeneous systems. Ad-hoc integrations are difficult to scale, secure, and generalize across domains. This survey examines four emerging agent communication protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent-to-Agent Protocol (A2A), and Agent Network Protocol (ANP), each addressing interoperability in distinct deployment contexts. MCP provides a JSON-RPC client-server interface for secure tool invocation and typed data exchange. ACP introduces REST-native messaging via multi-part messages and asynchronous streaming to support multimodal agent responses. A2A enables peer-to-peer task outsourcing through capability-based Agent Cards, facilitating enterprise-scale workflows. ANP supports open-network agent discovery and secure collaboration using decentralized identifiers (DIDs) and JSON-LD graphs. The protocols are compared across multiple dimensions, including interaction modes, discovery mechanisms, communication patterns, and security models. Based on the comparative analysis, a phased adoption roadmap is proposed: beginning with MCP for tool access, followed by ACP for multimodal messaging, A2A for collaborative task execution, and extending to ANP for decentralized agent marketplaces. This work provides a comprehensive foundation for designing secure, interoperable, and scalable ecosystems of LLM-powered agents.
zh

[AI-52] On the Need for a Statistical Foundation in Scenario-Based Testing of Autonomous Vehicles

【速读】:该论文试图解决自动驾驶汽车(autonomous vehicles, AVs)安全测试中的核心问题,包括停止规则、剩余风险估计、调试有效性以及仿真保真度对安全声明的影响。其解决方案的关键在于建立严格的统计基础,通过类比传统软件测试方法识别共性研究缺口并提出可复用的解决方案,同时引入风险估计保真度(Risk Estimation Fidelity, REF)这一新指标,以验证合成测试与真实世界测试结果的一致性,从而确保基于仿真的安全声明具有统计合理性。

链接: https://arxiv.org/abs/2505.02274
作者: Xingyu Zhao,Robab Aghazadeh-Chakherlou,Chih-Hong Cheng,Peter Popov,Lorenzo Strigini
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: under review

点击查看摘要

Abstract:Scenario-based testing has emerged as a common method for autonomous vehicles (AVs) safety, offering a more efficient alternative to mile-based testing by focusing on high-risk scenarios. However, fundamental questions persist regarding its stopping rules, residual risk estimation, debug effectiveness, and the impact of simulation fidelity on safety claims. This paper argues that a rigorous statistical foundation is essential to address these challenges and enable rigorous safety assurance. By drawing parallels between AV testing and traditional software testing methodologies, we identify shared research gaps and reusable solutions. We propose proof-of-concept models to quantify the probability of failure per scenario (pfs) and evaluate testing effectiveness under varying conditions. Our analysis reveals that neither scenario-based nor mile-based testing universally outperforms the other. Furthermore, we introduce Risk Estimation Fidelity (REF), a novel metric to certify the alignment of synthetic and real-world testing outcomes, ensuring simulation-based safety claims are statistically defensible.
zh

[AI-53] Robust Localization Mapping and Navigation for Quadruped Robots

【速读】:该论文旨在解决低成本四足机器人在真实世界中实现鲁棒定位、建图与导航的问题,其核心挑战在于依赖低成本传感器(如深度相机)构建可靠的导航系统。解决方案的关键在于融合接触辅助运动学、视觉惯性里程计和深度稳定视觉,以提升系统的稳定性和精度。

链接: https://arxiv.org/abs/2505.02272
作者: Dyuman Aditya,Junning Huang,Nico Bohlinger,Piotr Kicki,Krzysztof Walas,Jan Peters,Matteo Luperto,Davide Tateo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 Pages

点击查看摘要

Abstract:Quadruped robots are currently a widespread platform for robotics research, thanks to powerful Reinforcement Learning controllers and the availability of cheap and robust commercial platforms. However, to broaden the adoption of the technology in the real world, we require robust navigation stacks relying only on low-cost sensors such as depth cameras. This paper presents a first step towards a robust localization, mapping, and navigation system for low-cost quadruped robots. In pursuit of this objective we combine contact-aided kinematic, visual-inertial odometry, and depth-stabilized vision, enhancing stability and accuracy of the system. Our results in simulation and two different real-world quadruped platforms show that our system can generate an accurate 2D map of the environment, robustly localize itself, and navigate autonomously. Furthermore, we present in-depth ablation studies of the important components of the system and their impact on localization accuracy. Videos, code, and additional experiments can be found on the project website: this https URL
zh

[AI-54] Real-time Spatial Retrieval Augmented Generation for Urban Environments

【速读】:该论文旨在解决城市环境中基础模型知识滞后与动态实时需求不匹配的问题,特别是针对生成式AI(Generative AI)在城市应用中因数据更新频繁、实时性要求高及物理世界强关联所带来的挑战。其解决方案的关键在于提出一种实时空间检索增强生成(Real-time Spatial RAG)架构,通过融合时间与空间过滤能力,实现生成式AI在城市场景中的有效集成,该架构基于链接数据,并在FIWARE平台上实现,以支持智能城市和数字孪生应用。

链接: https://arxiv.org/abs/2505.02271
作者: David Nazareno Campo,Javier Conde,Álvaro Alonso,Gabriel Huecas,Joaquín Salvachúa,Pedro Reviriego
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of Generative Artificial Ingelligence (AI), especially Large Language Models, presents transformative opportunities for urban applications through Urban Foundation Models. However, base models face limitations, as they only contain the knowledge available at the time of training, and updating them is both time-consuming and costly. Retrieval Augmented Generation (RAG) has emerged in the literature as the preferred approach for injecting contextual information into Foundation Models. It prevails over techniques such as fine-tuning, which are less effective in dynamic, real-time scenarios like those found in urban environments. However, traditional RAG architectures, based on semantic databases, knowledge graphs, structured data, or AI-powered web searches, do not fully meet the demands of urban contexts. Urban environments are complex systems characterized by large volumes of interconnected data, frequent updates, real-time processing requirements, security needs, and strong links to the physical world. This work proposes a real-time spatial RAG architecture that defines the necessary components for the effective integration of generative AI into cities, leveraging temporal and spatial filtering capabilities through linked data. The proposed architecture is implemented using FIWARE, an ecosystem of software components to develop smart city solutions and digital twins. The design and implementation are demonstrated through the use case of a tourism assistant in the city of Madrid. The use case serves to validate the correct integration of Foundation Models through the proposed RAG architecture.
zh

[AI-55] RISE: Radius of Influence based Subgraph Extraction for 3D Molecular Graph Explanation

【速读】:该论文试图解决3D几何图神经网络(3D Geometric Graph Neural Networks, GNNs)在分子数据建模中可解释性不足的问题,这一问题限制了其在需要可靠和透明洞察的科学应用中的使用。解决方案的关键在于提出一种专为3D GNN设计的新型解释方法,该方法通过为每个节点分配一个影响半径,将解释局部化到节点在3D空间中的邻域范围内,从而捕捉空间和结构相互作用,提升模型预测的可解释性并符合3D图应用中的物理和结构依赖性。

链接: https://arxiv.org/abs/2505.02247
作者: Jingxiang Qu,Wenhan Gao,Jiaxing Zhang,Xufeng Liu,Hua Wei,Haibin Ling,Yi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:3D Geometric Graph Neural Networks (GNNs) have emerged as transformative tools for modeling molecular data. Despite their predictive power, these models often suffer from limited interpretability, raising concerns for scientific applications that require reliable and transparent insights. While existing methods have primarily focused on explaining molecular substructures in 2D GNNs, the transition to 3D GNNs introduces unique challenges, such as handling the implicit dense edge structures created by a cut-off radius. To tackle this, we introduce a novel explanation method specifically designed for 3D GNNs, which localizes the explanation to the immediate neighborhood of each node within the 3D space. Each node is assigned an radius of influence, defining the localized region within which message passing captures spatial and structural interactions crucial for the model’s predictions. This method leverages the spatial and geometric characteristics inherent in 3D graphs. By constraining the subgraph to a localized radius of influence, the approach not only enhances interpretability but also aligns with the physical and structural dependencies typical of 3D graph applications, such as molecular learning.
zh

[AI-56] Prompt-responsive Object Retrieval with Memory-augmented Student-Teacher Learning

【速读】:该论文试图解决在杂乱场景中实现基于输入提示的精细操作任务的问题,即如何将高层指令与细粒度的灵巧控制有效关联。解决方案的关键在于提出一种结合可提示基础模型与强化学习(Reinforcement Learning, RL)的框架,通过引入记忆增强的学生-教师学习机制,利用Segment-Anything 2 (SAM 2) 模型作为感知主干,从用户提示中推断出感兴趣的对象,并借助检测结果的时间序列进行隐式状态估计,从而学习到响应提示的策略。

链接: https://arxiv.org/abs/2505.02232
作者: Malte Mosbach,Sven Behnke
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Building models responsive to input prompts represents a transformative shift in machine learning. This paradigm holds significant potential for robotics problems, such as targeted manipulation amidst clutter. In this work, we present a novel approach to combine promptable foundation models with reinforcement learning (RL), enabling robots to perform dexterous manipulation tasks in a prompt-responsive manner. Existing methods struggle to link high-level commands with fine-grained dexterous control. We address this gap with a memory-augmented student-teacher learning framework. We use the Segment-Anything 2 (SAM 2) model as a perception backbone to infer an object of interest from user prompts. While detections are imperfect, their temporal sequence provides rich information for implicit state estimation by memory-augmented models. Our approach successfully learns prompt-responsive policies, demonstrated in picking objects from cluttered scenes. Videos and code are available at this https URL
zh

[AI-57] he GenAI Generation: Student Views of Awareness Preparedness and Concern

【速读】:该论文试图解决生成式 AI(Generative AI)在教育和职业发展中的影响及其对学生认知、准备和担忧的评估问题。其解决方案的关键在于通过一项包含开放性问题的简要调查,收集超过250份回应,并从中提炼出学生对GenAI的双重态度:一方面表现出对技术的积极期待,另一方面则表达了对伦理问题、工作替代以及教育体系适应性的深刻忧虑。研究结果为教育机构应对GenAI驱动的未来提供了重要参考与建议。

链接: https://arxiv.org/abs/2505.02230
作者: Micaela Siraj,Jon Duke
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Generative AI (GenAI) is revolutionizing education and workforce development, profoundly shaping how students learn, engage, and prepare for their future. Outpacing the development of uniform policies and structures, GenAI has heralded a unique era and given rise to the GenAI Generation: a cohort of students whose education has been increasingly shaped by the opportunities and challenges GenAI presents during its widespread adoption within society. This study examines our students’ perceptions of GenAI through a concise survey with optional open-ended questions, focusing on their awareness, preparedness, and concerns. Evaluation of more than 250 responses with more than 40% providing detailed qualitative feedback reveals a core dual sentiment: while most students express enthusiasm for GenAI, an even greater proportion voice a spectrum of concerns about ethics, job displacement, and the adequacy of educational structures given the highly transformative technology. These findings offer critical insights into how students view the potential and pitfalls of GenAI for future career impacts, with accompanying recommendations to guide educational institutions in navigating a future driven by GenAI.
zh

[AI-58] Coupled Distributional Random Expert Distillation for World Model Online Imitation Learning

【速读】:该论文旨在解决模仿学习(Imitation Learning, IL)在面对基于对抗性奖励或价值设定的世界模型框架时所遇到的不稳定问题。其解决方案的关键在于引入一种基于随机网络蒸馏(Random Network Distillation, RND)的奖励模型,通过在世界模型的潜在空间中联合估计专家和行为分布来进行密度估计,从而实现更稳定的在线模仿学习性能。

链接: https://arxiv.org/abs/2505.02228
作者: Shangzhe Li,Zhiao Huang,Hao Su
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Imitation Learning (IL) has achieved remarkable success across various domains, including robotics, autonomous driving, and healthcare, by enabling agents to learn complex behaviors from expert demonstrations. However, existing IL methods often face instability challenges, particularly when relying on adversarial reward or value formulations in world model frameworks. In this work, we propose a novel approach to online imitation learning that addresses these limitations through a reward model based on random network distillation (RND) for density estimation. Our reward model is built on the joint estimation of expert and behavioral distributions within the latent space of the world model. We evaluate our method across diverse benchmarks, including DMControl, Meta-World, and ManiSkill2, showcasing its ability to deliver stable performance and achieve expert-level results in both locomotion and manipulation tasks. Our approach demonstrates improved stability over adversarial methods while maintaining expert-level performance.
zh

[AI-59] LLM -Guided Probabilistic Program Induction for POMDP Model Estimation

【速读】:该论文试图解决在部分可观测马尔可夫决策过程(Partially Observable Markov Decision Processes, POMDPs)中学习模型的问题。其解决方案的关键在于利用大语言模型(LLM)作为先验,生成低复杂度的 probabilistic graphical models 形式的候选概率程序,并通过与经验分布的对比及反馈进行调整,从而构建有效的 POMDP 模型。

链接: https://arxiv.org/abs/2505.02216
作者: Aidan Curtis,Hao Tang,Thiago Veloso,Kevin Ellis,Tomás Lozano-Pérez,Leslie Pack Kaelbling
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Partially Observable Markov Decision Processes (POMDPs) model decision making under uncertainty. While there are many approaches to approximately solving POMDPs, we aim to address the problem of learning such models. In particular, we are interested in a subclass of POMDPs wherein the components of the model, including the observation function, reward function, transition function, and initial state distribution function, can be modeled as low-complexity probabilistic graphical models in the form of a short probabilistic program. Our strategy to learn these programs uses an LLM as a prior, generating candidate probabilistic programs that are then tested against the empirical distribution and adjusted through feedback. We experiment on a number of classical toy POMDP problems, simulated MiniGrid domains, and two real mobile-base robotics search domains involving partial observability. Our results show that using an LLM to guide in the construction of a low-complexity POMDP model can be more effective than tabular POMDP learning, behavior cloning, or direct LLM planning.
zh

[AI-60] Student Perspectives on the Benefits and Risks of AI in Education

【速读】:该论文旨在探讨学生对教育领域中人工智能聊天机器人(AI chatbots)的使用体验及其所感知到的潜在益处与风险。研究通过调查262名本科生的反馈,识别出他们在教学支持、信息获取和指导能力等方面的积极看法,以及对学术诚信、信息准确性、批判性思维能力下降、过度依赖技术及伦理问题(如数据隐私、系统偏见、环境影响和教育中的人文元素保留)的担忧。解决方案的关键在于建立明确的AI使用政策,并加强AI素养教育,以确保在利用AI提供即时反馈和个性化学习支持的同时,维护教育过程的质量与完整性。

链接: https://arxiv.org/abs/2505.02198
作者: Griffin Pitts,Viktoria Marcus,Sanaz Motamedi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:The use of chatbots equipped with artificial intelligence (AI) in educational settings has increased in recent years, showing potential to support teaching and learning. However, the adoption of these technologies has raised concerns about their impact on academic integrity, students’ ability to problem-solve independently, and potential underlying biases. To better understand students’ perspectives and experiences with these tools, a survey was conducted at a large public university in the United States. Through thematic analysis, 262 undergraduate students’ responses regarding their perceived benefits and risks of AI chatbots in education were identified and categorized into themes. The results discuss several benefits identified by the students, with feedback and study support, instruction capabilities, and access to information being the most cited. Their primary concerns included risks to academic integrity, accuracy of information, loss of critical thinking skills, the potential development of overreliance, and ethical considerations such as data privacy, system bias, environmental impact, and preservation of human elements in education. While student perceptions align with previously discussed benefits and risks of AI in education, they show heightened concerns about distinguishing between human and AI generated work - particularly in cases where authentic work is flagged as AI-generated. To address students’ concerns, institutions can establish clear policies regarding AI use and develop curriculum around AI literacy. With these in place, practitioners can effectively develop and implement educational systems that leverage AI’s potential in areas such as immediate feedback and personalized learning support. This approach can enhance the quality of students’ educational experiences while preserving the integrity of the learning process with AI. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET) ACMclasses: K.3; K.4 Cite as: arXiv:2505.02198 [cs.CY] (or arXiv:2505.02198v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2505.02198 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-61] Leverag ing LLM s to Automate Energy-Aware Refactoring of Parallel Scientific Codes

【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在生成并行科学代码时主要关注功能正确性,而忽视性能和能耗的问题。其解决方案的关键在于提出LASSI-EE,一个基于LLM的自动化重构框架,能够针对给定的并行代码,在目标并行系统上生成能效更高的并行代码。通过多阶段、迭代式的流水线过程,LASSI-EE在NVIDIA A100 GPU上的20个HeCBench基准测试中实现了平均47%的能耗降低,展示了LLMs在实现能效感知编程方面的潜力。

链接: https://arxiv.org/abs/2505.02184
作者: Matthew T. Dearing,Yiheng Tao,Xingfu Wu,Zhiling Lan,Valerie Taylor
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:While large language models (LLMs) are increasingly used for generating parallel scientific code, most current efforts emphasize functional correctness, often overlooking performance and energy considerations. In this work, we propose LASSI-EE, an automated LLM-based refactoring framework that generates energy-efficient parallel code on a target parallel system for a given parallel code as input. Through a multi-stage, iterative pipeline process, LASSI-EE achieved an average energy reduction of 47% across 85% of the 20 HeCBench benchmarks tested on NVIDIA A100 GPUs. Our findings demonstrate the broader potential of LLMs, not only for generating correct code but also for enabling energy-aware programming. We also address key insights and limitations within the framework, offering valuable guidance for future improvements.
zh

[AI-62] Data-Driven Team Selection in Fantasy Premier League Using Integer Programming and Predictive Modeling Approach

【速读】:该论文试图解决在固定预算约束下,如何通过优化选择最佳首发十一人及队长来最大化虚拟足球比赛中总得分的问题。解决方案的关键在于提出新颖的确定性和鲁棒性整数规划模型,并构建了一种基于可解释人工智能框架和比赛表现数据的混合评分指标,同时引入多种目标函数和估计技术以提升模型的性能与稳定性。

链接: https://arxiv.org/abs/2505.02170
作者: Danial Ramezani
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Fantasy football is a billion-dollar industry with millions of participants. Constrained by a fixed budget, decision-makers draft a squad whose players are expected to perform well in the upcoming weeks to maximize total points. This paper proposes novel deterministic and robust integer programming models that select the optimal starting eleven and the captain. A new hybrid scoring metric is constructed using an interpretable artificial intelligence framework and underlying match performance data. Several objective functions and estimation techniques are introduced for the programming model. To the best of my knowledge, this is the first study to approach fantasy football through this lens. The models’ performance is evaluated using data from the 2023/24 Premier League season. Results indicate that the proposed hybrid method achieved the highest score while maintaining consistent performance. Utilizing the Monte Carlo simulation, the strategic choice of averaging techniques for estimating cost vectors, and the proposed hybrid approach are shown to be effective during the out-of-sample period. This paper also provides a thorough analysis of the optimal formations and players selected by the models, offering valuable insights into effective fantasy football strategies.
zh

[AI-63] Interpreting Multilingual and Document-Length Sensitive Relevance Computations in Neural Retrieval Models through Axiomatic Causal Interventions SIGIR2025

【速读】:该论文试图解决神经排序模型中如何编码与任务相关属性(如词频)的可解释性问题,其核心是通过因果干预方法逆向工程相关性计算。解决方案的关键在于设计了一种激活补丁(activation patching)方法,该方法能够隔离模型中特定组件和标记的行为,从而验证词频信息在不同语言和模型层中的编码情况。

链接: https://arxiv.org/abs/2505.02154
作者: Oliver Savolainen,Dur e Najaf Amjad,Roxana Petcu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 10 pages, SIGIR 2025

点击查看摘要

Abstract:This reproducibility study analyzes and extends the paper “Axiomatic Causal Interventions for Reverse Engineering Relevance Computation in Neural Retrieval Models,” which investigates how neural retrieval models encode task-relevant properties such as term frequency. We reproduce key experiments from the original paper, confirming that information on query terms is captured in the model encoding. We extend this work by applying activation patching to Spanish and Chinese datasets and by exploring whether document-length information is encoded in the model as well. Our results confirm that the designed activation patching method can isolate the behavior to specific components and tokens in neural retrieval models. Moreover, our findings indicate that the location of term frequency generalizes across languages and that in later layers, the information for sequence-level tasks is represented in the CLS token. The results highlight the need for further research into interpretability in information retrieval and reproducibility in machine learning research. Our code is available at this https URL.
zh

[AI-64] Representation Learning of Limit Order Book: A Comprehensive Study and Benchmarking

【速读】:该论文试图解决限单簿(Limit Order Book, LOB)表示学习中的可迁移性与泛化性不足问题,以及现有方法在端到端框架中将表示学习与下游任务紧密耦合导致的可复用性受限问题。其解决方案的关键在于提出一种系统性的比较研究方法,并构建了一个标准化基准LOBench,通过真实中国A股市场数据提供统一的预处理、评估指标和强基线,从而验证LOB表示在不同下游任务中的有效性与必要性,同时凸显其相较于传统端到端模型和通用时间序列表示学习模型的优势。

链接: https://arxiv.org/abs/2505.02139
作者: Muyao Zhong,Yushi Lin,Peng Yang
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Limit Order Book (LOB), the mostly fundamental data of the financial market, provides a fine-grained view of market dynamics while poses significant challenges in dealing with the esteemed deep models due to its strong autocorrelation, cross-feature constrains, and feature scale disparity. Existing approaches often tightly couple representation learning with specific downstream tasks in an end-to-end manner, failed to analyze the learned representations individually and explicitly, limiting their reusability and generalization. This paper conducts the first systematic comparative study of LOB representation learning, aiming to identify the effective way of extracting transferable, compact features that capture essential LOB properties. We introduce LOBench, a standardized benchmark with real China A-share market data, offering curated datasets, unified preprocessing, consistent evaluation metrics, and strong baselines. Extensive experiments validate the sufficiency and necessity of LOB representations for various downstream tasks and highlight their advantages over both the traditional task-specific end-to-end models and the advanced representation learning models for general time series. Our work establishes a reproducible framework and provides clear guidelines for future research. Datasets and code will be publicly available at this https URL.
zh

[AI-65] Subspace Aggregation Query and Index Generation for Multidimensional Resource Space Mode

【速读】:该论文试图解决在多维分类空间中高效管理和查询大规模资源的问题,具体是通过定义一种基于部分序关系上坐标树范围的子空间聚合查询,实现对子空间内聚合资源的度量、排序与选择。解决方案的关键在于提出一种生成图索引的方法,通过构建包含部分序关系的包含链接,使子空间查询能够通过索引链接定位非空点,并沿索引路径将资源聚合到其超点。为降低索引生成成本,该方法采用了多项优化策略,包括添加交集链接以控制节点数量并减少查询处理成本、根据概率分布计算决定交集链接的添加、通过不同维度坐标分割平衡索引节点持有的资源数量,以及在坐标树的同级坐标间添加捷径链接以提高线性序坐标查询效率。

链接: https://arxiv.org/abs/2505.02129
作者: Xiaoping Sun,Hai Zhuge
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Organizing resources in a multidimensional classification space is an approach to efficiently managing and querying large-scale resources. This paper defines an aggregation query on subspace defined by a range on the partial order on coordinate tree at each dimension, where each point contains resources aggregated along the paths of partial order relations on the points so that aggregated resources at each point within the subspace can be measured, ranked and selected. To efficiently locate non-empty points in a large subspace, an approach to generating graph index is proposed to build inclusion links with partial order relations on coordinates of dimensions to enable a subspace query to reach non-empty points by following indexing links and aggregate resources along indexing paths back to their super points. Generating such an index is costly as the number of children of an index node can be very large so that the total number of indexing nodes is unbounded. The proposed approach adopts the following strategies to reduce the cost: (1) adding intersection links between two indexing nodes, which can better reduce query processing costs while controlling the number of nodes of the graph index; (2) intersection links are added between two nodes according to the probabilistic distribution calculated for estimating the costs of adding intersection between two nodes; (3) coordinates at one dimension having more resources are split by coordinates at another dimension to balance the number of resources hold by indexing nodes; and, (4) short-cut links are added between sibling coordinates of coordinate trees to make an efficient query on linear order coordinates. Analysis and experiments verified the effectiveness of the generated index in supporting subspace aggregation query. This work makes significant contributions to the development of data model based on multi-dimensional classification.
zh

[AI-66] Overview of AI Grading of Physics Olympiad Exams

【速读】:该论文试图解决高中物理问题中多种题型自动评分的问题,这一挑战需要跨领域的自动化评分技术。论文提出了一种多模态人工智能评分框架作为解决方案,其关键在于整合多种信息模态以提升评分的准确性和适应性,并基于澳大利亚的人工智能伦理原则对框架进行评估。

链接: https://arxiv.org/abs/2505.02121
作者: Lachlan McGinness
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: International Conference on Artificial Intelligence in Education, Doctoral Consortium

点击查看摘要

Abstract:Automatically grading the diverse range of question types in high school physics problem is a challenge that requires automated grading techniques from different fields. We report the findings of a Systematic Literature Review of potential physics grading techniques. We propose a multi-modal AI grading framework to address these challenges and examine our framework in light of Australia’s AI Ethical Principles.
zh

[AI-67] ricolore: Multi-Behavior User Profiling for Enhanced Candidate Generation in Recommender Systems

【速读】:该论文旨在解决传统推荐系统在处理多目标行为和用户兴趣多样性方面的局限性,其核心问题是单一向量表示难以捕捉用户复杂的偏好,导致候选物品池狭窄且推荐效果受限。解决方案的关键在于提出Tricolore框架,该框架采用多向量学习机制,通过自适应多任务结构和行为感知的多视图融合模块,实现不同行为类型之间的关联建模,从而提升候选生成的鲁棒性和多样性。此外,引入的流行度平衡策略进一步优化了推荐列表的准确性与多样性之间的权衡。

链接: https://arxiv.org/abs/2505.02120
作者: Xiao Zhou,Zhongxiang Zhao,Hanze Guo
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Online platforms aggregate extensive user feedback across diverse behaviors, providing a rich source for enhancing user engagement. Traditional recommender systems, however, typically optimize for a single target behavior and represent user preferences with a single vector, limiting their ability to handle multiple important behaviors or optimization objectives. This conventional approach also struggles to capture the full spectrum of user interests, resulting in a narrow item pool during candidate generation. To address these limitations, we present Tricolore, a versatile multi-vector learning framework that uncovers connections between different behavior types for more robust candidate generation. Tricolore’s adaptive multi-task structure is also customizable to specific platform needs. To manage the variability in sparsity across behavior types, we incorporate a behavior-wise multi-view fusion module that dynamically enhances learning. Moreover, a popularity-balanced strategy ensures the recommendation list balances accuracy with item popularity, fostering diversity and improving overall performance. Extensive experiments on public datasets demonstrate Tricolore’s effectiveness across various recommendation scenarios, from short video platforms to e-commerce. By leveraging a shared base embedding strategy, Tricolore also significantly improves the performance for cold-start users. The source code is publicly available at: this https URL.
zh

[AI-68] Adversarial Cooperative Rationalization: The Risk of Spurious Correlations in Even Clean Datasets ICML2025

【速读】:该论文试图解决在自理性化框架中,通过合作博弈构建的生成器与预测器之间可能引入的采样偏差问题(sampling bias),这种偏差可能导致生成器无意中建立选择的解释(rationale)与标签之间的错误关联,即使它们在原始数据集中语义上无关。解决方案的关键在于通过理论分析和实证证据揭示该偏差的来源,并基于此提出一种指令以防止预测器学习这些错误关联,从而提升模型的预测准确性和合理性。

链接: https://arxiv.org/abs/2505.02118
作者: Wei Liu,Zhongyu Niu,Lang Gao,Zhiying Deng,Jun Wang,Haozhao Wang,Ruixuan Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICML 2025

点击查看摘要

Abstract:This study investigates the self-rationalization framework constructed with a cooperative game, where a generator initially extracts the most informative segment from raw input, and a subsequent predictor utilizes the selected subset for its input. The generator and predictor are trained collaboratively to maximize prediction accuracy. In this paper, we first uncover a potential caveat: such a cooperative game could unintentionally introduce a sampling bias during rationale extraction. Specifically, the generator might inadvertently create an incorrect correlation between the selected rationale candidate and the label, even when they are semantically unrelated in the original dataset. Subsequently, we elucidate the origins of this bias using both detailed theoretical analysis and empirical evidence. Our findings suggest a direction for inspecting these correlations through attacks, based on which we further introduce an instruction to prevent the predictor from learning the correlations. Through experiments on six text classification datasets and two graph classification datasets using three network architectures (GRUs, BERT, and GCN), we show that our method not only significantly outperforms recent rationalization methods, but also achieves comparable or even better results than a representative LLM (llama3.1-8b-instruct).
zh

[AI-69] Eterna is Solved

【速读】:该论文试图解决RNA设计问题,即寻找一个核苷酸序列使其折叠成目标二级结构,该问题在合成生物学、医学和纳米技术中具有重要应用。论文提出的解决方案是Montparnasse算法,其关键在于采用多目标广义嵌套回溯策略适应有限重复(MOGNRPALR),该算法成功解决了Eterna基准问题。

链接: https://arxiv.org/abs/2505.02110
作者: Tristan Cazenave
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:RNA design consists of discovering a nucleotide sequence that folds into a target secondary structure. It is useful for synthetic biology, medicine, and nanotechnology. We propose Montparnasse, a Multi Objective Generalized Nested Rollout Policy Adaptation with Limited Repetition (MOGNRPALR) RNA design algorithm. It solves the Eterna benchmark.
zh

[AI-70] MemEngine: A Unified and Modular Library for Developing Advanced Memory of LLM -based Agents

【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)驱动的智能体在记忆能力方面缺乏统一实现框架的问题。现有研究虽然提出了多种先进的记忆模型,但尚未形成一个通用的、可扩展的实现方案。解决方案的关键在于提出一个统一且模块化的库——MemEngine,该库支持多种先进记忆模型的实现,并提供了便捷、可扩展的记忆开发与使用方式,从而促进LLM-based agents的记忆能力研究与应用。

链接: https://arxiv.org/abs/2505.02099
作者: Zeyu Zhang,Quanyu Dai,Xu Chen,Rui Li,Zhongyang Li,Zhenhua Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Just accepted by TheWebConf’25 Resource Track

点击查看摘要

Abstract:Recently, large language model based (LLM-based) agents have been widely applied across various fields. As a critical part, their memory capabilities have captured significant interest from both industrial and academic communities. Despite the proposal of many advanced memory models in recent research, however, there remains a lack of unified implementations under a general framework. To address this issue, we develop a unified and modular library for developing advanced memory models of LLM-based agents, called MemEngine. Based on our framework, we implement abundant memory models from recent research works. Additionally, our library facilitates convenient and extensible memory development, and offers user-friendly and pluggable memory usage. For benefiting our community, we have made our project publicly available at this https URL.
zh

[AI-71] Retrieval-augmented in-context learning for multimodal large language models in disease classification

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在疾病分类任务中,如何通过动态检索信息性示例来增强上下文学习(In-Context Learning, ICL)的问题。其解决方案的关键在于提出了一种融合检索增强生成(Retrieval-Augmented Generation, RAG)与ICL的框架——RAICL,该框架通过多种编码器(如ResNet、BERT、BioBERT和ClinicalBERT)获取嵌入表示,以自适应地选择具有相似疾病模式的示例,并构建优化的对话式提示,从而提升MLLMs的分类性能。

链接: https://arxiv.org/abs/2505.02087
作者: Zaifu Zhan,Shuang Zhou,Xiaoshan Zhou,Yongkang Xiao,Jun Wang,Jiawen Deng,He Zhu,Yu Hou,Rui Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 Pages, 1 figure, 7 tables

点击查看摘要

Abstract:Objectives: We aim to dynamically retrieve informative demonstrations, enhancing in-context learning in multimodal large language models (MLLMs) for disease classification. Methods: We propose a Retrieval-Augmented In-Context Learning (RAICL) framework, which integrates retrieval-augmented generation (RAG) and in-context learning (ICL) to adaptively select demonstrations with similar disease patterns, enabling more effective ICL in MLLMs. Specifically, RAICL examines embeddings from diverse encoders, including ResNet, BERT, BioBERT, and ClinicalBERT, to retrieve appropriate demonstrations, and constructs conversational prompts optimized for ICL. We evaluated the framework on two real-world multi-modal datasets (TCGA and IU Chest X-ray), assessing its performance across multiple MLLMs (Qwen, Llava, Gemma), embedding strategies, similarity metrics, and varying numbers of demonstrations. Results: RAICL consistently improved classification performance. Accuracy increased from 0.7854 to 0.8368 on TCGA and from 0.7924 to 0.8658 on IU Chest X-ray. Multi-modal inputs outperformed single-modal ones, with text-only inputs being stronger than images alone. The richness of information embedded in each modality will determine which embedding model can be used to get better results. Few-shot experiments showed that increasing the number of retrieved examples further enhanced performance. Across different similarity metrics, Euclidean distance achieved the highest accuracy while cosine similarity yielded better macro-F1 scores. RAICL demonstrated consistent improvements across various MLLMs, confirming its robustness and versatility. Conclusions: RAICL provides an efficient and scalable approach to enhance in-context learning in MLLMs for multimodal disease classification. Comments: 17 Pages, 1 figure, 7 tables Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2505.02087 [cs.AI] (or arXiv:2505.02087v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.02087 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zaifu Zhan [view email] [v1] Sun, 4 May 2025 12:43:56 UTC (1,514 KB)
zh

[AI-72] Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents

【速读】:该论文试图解决去中心化AI代理在互联网平台间交互所带来的安全挑战,这些挑战超越了传统的网络安全和AI安全框架。其关键解决方案是引入“多智能体安全”(multi-agent security)这一新领域,旨在保护去中心化AI代理网络免受因相互作用而产生或加剧的威胁,包括隐私泄露、虚假信息、越狱攻击和数据污染等,并通过识别安全与性能之间的根本权衡,提出统一的研究议程以应对设计安全代理系统和交互环境中的开放性问题。

链接: https://arxiv.org/abs/2505.02077
作者: Christian Schroeder de Witt
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Decentralized AI agents will soon interact across internet platforms, creating security challenges beyond traditional cybersecurity and AI safety frameworks. Free-form protocols are essential for AI’s task generalization but enable new threats like secret collusion and coordinated swarm attacks. Network effects can rapidly spread privacy breaches, disinformation, jailbreaks, and data poisoning, while multi-agent dispersion and stealth optimization help adversaries evade oversightcreating novel persistent threats at a systemic level. Despite their critical importance, these security challenges remain understudied, with research fragmented across disparate fields including AI security, multi-agent learning, complex systems, cybersecurity, game theory, distributed systems, and technical AI governance. We introduce \textbfmulti-agent security, a new field dedicated to securing networks of decentralized AI agents against threats that emerge or amplify through their interactionswhether direct or indirect via shared environmentswith each other, humans, and institutions, and characterize fundamental security-performance trade-offs. Our preliminary work (1) taxonomizes the threat landscape arising from interacting AI agents, (2) surveys security-performance tradeoffs in decentralized AI systems, and (3) proposes a unified research agenda addressing open challenges in designing secure agent systems and interaction environments. By identifying these gaps, we aim to guide research in this critical area to unlock the socioeconomic potential of large-scale agent deployment on the internet, foster public trust, and mitigate national security risks in critical infrastructure and defense contexts.
zh

[AI-73] Leverag ing LLM Agents and Digital Twins for Fault Handling in Process Plants

【速读】:该论文试图解决过程工厂在自动化和人工智能发展背景下,仍难以自主处理某些复杂任务(如故障处理)的问题,这些问题通常依赖于人类专家的知识。解决方案的关键在于提出一种方法论框架,将生成式 AI(Generative AI)代理与数字孪生(Digital Twin)环境相结合,通过持续解析系统状态并启动控制操作,实现对异常故障的自主响应,同时利用数字孪生作为工程知识库和仿真平台,验证和确认生成的纠正控制措施的有效性。

链接: https://arxiv.org/abs/2505.02076
作者: Milapji Singh Gill,Javal Vyas,Artan Markaj,Felix Gehlhoff,Mehmet Mercangöz
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Advances in Automation and Artificial Intelligence continue to enhance the autonomy of process plants in handling various operational scenarios. However, certain tasks, such as fault handling, remain challenging, as they rely heavily on human expertise. This highlights the need for systematic, knowledge-based methods. To address this gap, we propose a methodological framework that integrates Large Language Model (LLM) agents with a Digital Twin environment. The LLM agents continuously interpret system states and initiate control actions, including responses to unexpected faults, with the goal of returning the system to normal operation. In this context, the Digital Twin acts both as a structured repository of plant-specific engineering knowledge for agent prompting and as a simulation platform for the systematic validation and verification of the generated corrective control actions. The evaluation using a mixing module of a process plant demonstrates that the proposed framework is capable not only of autonomously controlling the mixing module, but also of generating effective corrective actions to mitigate a pipe clogging with only a few reprompts.
zh

[AI-74] Lightweight Defense Against Adversarial Attacks in Time Series Classification PAKDD2025

【速读】:该论文旨在解决时间序列分类(Time Series Classification, TSC)模型在面对对抗攻击时的鲁棒性问题。当前TSC领域的防御方法主要依赖于计算成本较高的对抗训练(Adversarial Training, AT),而本文提出了一种基于数据增强的防御方法,其关键在于通过五种针对时间序列设计的数据增强技术,在保持较低计算资源消耗(仅增加14.07%)的前提下提升模型的鲁棒性。此外,通过组合这些方法,构建出一种集成防御策略,不仅在防御性能上优于基于PGD的对抗训练,还显著降低了计算资源需求,仅为后者的三分之一。

链接: https://arxiv.org/abs/2505.02073
作者: Yi Han(Independent Researcher, Australia)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 8 figures. Accepted at RAFDA Workshop, PAKDD 2025 (Springer, EI Scopus indexed). Code: this https URL

点击查看摘要

Abstract:As time series classification (TSC) gains prominence, ensuring robust TSC models against adversarial attacks is crucial. While adversarial defense is well-studied in Computer Vision (CV), the TSC field has primarily relied on adversarial training (AT), which is computationally expensive. In this paper, five data augmentation-based defense methods tailored for time series are developed, with the most computationally intensive method among them increasing the computational resources by only 14.07% compared to the original TSC model. Moreover, the deployment process for these methods is straightforward. By leveraging these advantages of our methods, we create two combined methods. One of these methods is an ensemble of all the proposed techniques, which not only provides better defense performance than PGD-based AT but also enhances the generalization ability of TSC models. Moreover, the computational resources required for our ensemble are less than one-third of those required for PGD-based AT. These methods advance robust TSC in data mining. Furthermore, as foundation models are increasingly explored for time series feature learning, our work provides insights into integrating data augmentation-based adversarial defense with large-scale pre-trained models in future research.
zh

[AI-75] Ethical AI in the Healthcare Sector: Investigating Key Drivers of Adoption through the Multi-Dimensional Ethical AI Adoption Model (MEAAM)

【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)在医疗健康服务行业应用中面临的伦理挑战缺乏全面、实证理解的问题,特别是对影响伦理AI采纳的多维因素缺乏系统性分析。其解决方案的关键在于提出一种新的理论框架——多维伦理AI采纳模型(Multi-Dimensional Ethical AI Adoption Model, MEAAM),该模型将13个关键伦理变量划分为四个基础维度:公平AI、负责任AI、可解释AI和可持续AI,并通过三个核心伦理视角进行分析:认识论关注(知识、透明度与系统可信度)、规范性关注(正义、自主性、尊严与道德义务)以及总体性关注(全球性、系统性和长期性伦理影响)。通过定量横断面研究设计与偏最小二乘结构方程建模(PLS-SEM)方法,验证了这些伦理构念对操作性AI采纳与系统性AI采纳的影响机制。

链接: https://arxiv.org/abs/2505.02062
作者: Prathamesh Muzumdar,Apoorva Muley,Kuldeep Singh,Sumanth Cheemalapati
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The adoption of Artificial Intelligence (AI) in the healthcare service industry presents numerous ethical challenges, yet current frameworks often fail to offer a comprehensive, empirical understanding of the multidimensional factors influencing ethical AI integration. Addressing this critical research gap, this study introduces the Multi-Dimensional Ethical AI Adoption Model (MEAAM), a novel theoretical framework that categorizes 13 critical ethical variables across four foundational dimensions of Ethical AI Fair AI, Responsible AI, Explainable AI, and Sustainable AI. These dimensions are further analyzed through three core ethical lenses: epistemic concerns (related to knowledge, transparency, and system trustworthiness), normative concerns (focused on justice, autonomy, dignity, and moral obligations), and overarching concerns (highlighting global, systemic, and long-term ethical implications). This study adopts a quantitative, cross-sectional research design using survey data collected from healthcare professionals and analyzed via Partial Least Squares Structural Equation Modeling (PLS-SEM). Employing PLS-SEM, this study empirically investigates the influence of these ethical constructs on two outcomes Operational AI Adoption and Systemic AI Adoption. Results indicate that normative concerns most significantly drive operational adoption decisions, while overarching concerns predominantly shape systemic adoption strategies and governance frameworks. Epistemic concerns play a facilitative role, enhancing the impact of ethical design principles on trust and transparency in AI systems. By validating the MEAAM framework, this research advances a holistic, actionable approach to ethical AI adoption in healthcare and provides critical insights for policymakers, technologists, and healthcare administrators striving to implement ethically grounded AI solutions.
zh

[AI-76] Enhancing Safety Standards in Automated Systems Using Dynamic Bayesian Networks

【速读】:该论文旨在解决高速交通中切入(cut-in)操作带来的安全挑战,这些问题可能导致急刹车和碰撞,因此需要安全且高效的变道策略。其解决方案的关键在于提出一种动态贝叶斯网络(Dynamic Bayesian Network, DBN)框架,通过整合横向证据与安全评估模型,实现对变道行为的预测和安全切入操作的保障。该框架包含三个关键的概率假设(横向证据、横向安全和纵向安全),通过动态数据处理及车辆位置、横向速度、相对距离和碰撞时间(Time-to-Collision, TTC)的评估,支持决策过程。

链接: https://arxiv.org/abs/2505.02050
作者: Kranthi Kumar Talluri,Anders L. Madsen,Galia Weidl
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Cut-in maneuvers in high-speed traffic pose critical challenges that can lead to abrupt braking and collisions, necessitating safe and efficient lane change strategies. We propose a Dynamic Bayesian Network (DBN) framework to integrate lateral evidence with safety assessment models, thereby predicting lane changes and ensuring safe cut-in maneuvers effectively. Our proposed framework comprises three key probabilistic hypotheses (lateral evidence, lateral safety, and longitudinal safety) that facilitate the decision-making process through dynamic data processing and assessments of vehicle positions, lateral velocities, relative distance, and Time-to-Collision (TTC) computations. The DBN model’s performance compared with other conventional approaches demonstrates superior performance in crash reduction, especially in critical high-speed scenarios, while maintaining a competitive performance in low-speed scenarios. This paves the way for robust, scalable, and efficient safety validation in automated driving systems.
zh

[AI-77] GraphPrompter: Multi-stage Adaptive Prompt Optimization for Graph In-Context Learning ICDE’2025

【速读】:该论文旨在解决图模型在上下文学习中的适应性问题,即如何在不更新任何参数的情况下,使预训练的图模型适应新的、多样化的下游图。现有方法通过随机选择子图或边作为提示,导致了噪声较大的图提示和模型性能下降,且当测试图中的类别数量远大于训练图时,上下文学习能力会显著降低。解决方案的关键在于提出一种多阶段自适应提示优化方法GraphPrompter,该方法优化了图提示的生成、选择和使用过程,以提升模型的上下文学习能力。

链接: https://arxiv.org/abs/2505.02027
作者: Rui Lv,Zaixi Zhang,Kai Zhang,Qi Liu,Weibo Gao,Jiawei Liu,Jiaxia Yan,Linan Yue,Fangzhou Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 14 pages. IEEE International Conference on Data Engineering (ICDE’2025), accepted

点击查看摘要

Abstract:Graph In-Context Learning, with the ability to adapt pre-trained graph models to novel and diverse downstream graphs without updating any parameters, has gained much attention in the community. The key to graph in-context learning is to perform downstream graphs conditioned on chosen prompt examples. Existing methods randomly select subgraphs or edges as prompts, leading to noisy graph prompts and inferior model performance. Additionally, due to the gap between pre-training and testing graphs, when the number of classes in the testing graphs is much greater than that in the training, the in-context learning ability will also significantly deteriorate. To tackle the aforementioned challenges, we develop a multi-stage adaptive prompt optimization method GraphPrompter, which optimizes the entire process of generating, selecting, and using graph prompts for better in-context learning capabilities. Firstly, Prompt Generator introduces a reconstruction layer to highlight the most informative edges and reduce irrelevant noise for graph prompt construction. Furthermore, in the selection stage, Prompt Selector employs the k -nearest neighbors algorithm and pre-trained selection layers to dynamically choose appropriate samples and minimize the influence of irrelevant prompts. Finally, we leverage a Prompt Augmenter with a cache replacement strategy to enhance the generalization capability of the pre-trained model on new datasets. Extensive experiments show that GraphPrompter effectively enhances the in-context learning ability of graph models. On average across all the settings, our approach surpasses the state-of-the-art baselines by over 8%. Our code is released at this https URL.
zh

[AI-78] From Mind to Machine: The Rise of Manus AI as a Fully Autonomous Digital Agent

【速读】:该论文旨在探讨如何将大型语言模型的推理与规划能力与执行复杂、端到端任务的能力相结合,从而实现从“思维”到“行动”的跨越。 Manus AI 作为一款通用人工智能代理,其解决方案的关键在于构建一个能够将高阶意图转化为现实世界操作的智能系统,推动人机协作进入新的阶段。

链接: https://arxiv.org/abs/2505.02024
作者: Minjie Shen,Qikai Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Manus AI is a general-purpose AI agent introduced in early 2025, marking a significant advancement in autonomous artificial intelligence. Developed by the Chinese startup this http URL, Manus is designed to bridge the gap between “mind” and “hand” - combining the reasoning and planning capabilities of large language models with the ability to execute complex, end-to-end tasks that produce tangible outcomes. This paper presents a comprehensive overview of Manus AI, exploring its core technical architecture, diverse applications across sectors such as healthcare, finance, manufacturing, robotics, and gaming, as well as its key strengths, current limitations, and future potential. Positioned as a preview of what lies ahead, Manus AI represents a shift toward intelligent agents that can translate high-level intentions into real-world actions, heralding a new era of human-AI collaboration.
zh

[AI-79] Wide Deep Learning for Node Classification

【速读】:该论文旨在解决图卷积网络(Graph Convolutional Networks, GCNs)在节点分类任务中面临的异质性(heterophily)和表达能力不足的问题,同时平衡过拟合与过度泛化的矛盾。其解决方案的关键在于提出一种灵活的框架GCNIII,该框架融合了Wide Deep架构,并引入了三个关键技术:交集记忆(Intersect memory)、初始残差(Initial residual)和身份映射(Identity mapping),从而在半监督和全监督任务中更有效地提升模型性能。

链接: https://arxiv.org/abs/2505.02020
作者: Yancheng Chen,Wenguo Yang,Zhipeng Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 16 pages, 6 figures, 13 tables

点击查看摘要

Abstract:Wide Deep, a simple yet effective learning architecture for recommendation systems developed by Google, has had a significant impact in both academia and industry due to its combination of the memorization ability of generalized linear models and the generalization ability of deep models. Graph convolutional networks (GCNs) remain dominant in node classification tasks; however, recent studies have highlighted issues such as heterophily and expressiveness, which focus on graph structure while seemingly neglecting the potential role of node features. In this paper, we propose a flexible framework GCNIII, which leverages the Wide Deep architecture and incorporates three techniques: Intersect memory, Initial residual and Identity mapping. We provide comprehensive empirical evidence showing that GCNIII can more effectively balance the trade-off between over-fitting and over-generalization on various semi- and full- supervised tasks. Additionally, we explore the use of large language models (LLMs) for node feature engineering to enhance the performance of GCNIII in cross-domain node classification tasks. Our implementation is available at this https URL.
zh

[AI-80] CASA: CNN Autoencoder-based Score Attention for Efficient Multivariate Long-term Time-series Forecasting

【速读】:该论文旨在解决多变量长期时间序列预测中的时间复杂度高、计算资源消耗大以及跨维度交互不足的问题。其解决方案的关键在于引入一种基于卷积自编码器的得分注意力机制(CASA),该机制能够以模型无关的方式嵌入到多种Transformer架构中,通过减少内存占用来提升模型性能,并在多个真实世界数据集上验证了其有效性,实现了计算资源降低最高达77.7%、推理加速44.0%以及在87.5%的评估指标中达到最先进水平。

链接: https://arxiv.org/abs/2505.02011
作者: Minhyuk Lee,HyeKyung Yoon,MyungJoo Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multivariate long-term time series forecasting is critical for applications such as weather prediction, and traffic analysis. In addition, the implementation of Transformer variants has improved prediction accuracy. Following these variants, different input data process approaches also enhanced the field, such as tokenization techniques including point-wise, channel-wise, and patch-wise tokenization. However, previous studies still have limitations in time complexity, computational resources, and cross-dimensional interactions. To address these limitations, we introduce a novel CNN Autoencoder-based Score Attention mechanism (CASA), which can be introduced in diverse Transformers model-agnosticically by reducing memory and leading to improvement in model performance. Experiments on eight real-world datasets validate that CASA decreases computational resources by up to 77.7%, accelerates inference by 44.0%, and achieves state-of-the-art performance, ranking first in 87.5% of evaluated metrics.
zh

[AI-81] Closed-loop control of seizure activity via real-time seizure forecasting by reservoir neuromorphic computing

【速读】:该论文旨在解决药物难治性癫痫(drug-resistant epilepsy, DRE)治疗中闭环脑刺激技术存在的疗效变异性问题。传统方法在癫痫发作检测后才进行刺激以终止发作,且刺激参数依赖于试错法调整,导致治疗效果延迟。论文提出的解决方案关键在于利用类神经形态计算(neuromorphic computing)实现基于癫痫预测的个性化自由运行刺激,通过实时预测触发电脉冲而非固定频率刺激,从而提高治疗效率与精准度。

链接: https://arxiv.org/abs/2505.02003
作者: Maryam Sadeghi,Darío Fernández Khatiboun,Yasser Rezaeiyan,Saima Rizwan,Alessandro Barcellona,Andrea Merello,Marco Crepaldi,Gabriella Panuccio,Farshad Moradi
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Closed-loop brain stimulation holds potential as personalized treatment for drug-resistant epilepsy (DRE) but still suffers from limitations that result in highly variable efficacy. First, stimulation is typically delivered upon detection of the seizure to abort rather than prevent it; second, the stimulation parameters are established by trial and error, requiring lengthy rounds of fine-tuning, which delay steady-state therapeutic efficacy. Here, we address these limitations by leveraging the potential of neuromorphic computing. We present a system capable of driving personalized free-run stimulations based on seizure forecasting, wherein each forecast triggers an electrical pulse rather than an arbitrarily predefined fixed-frequency stimulus train. We validate the system against hippocampal spheroids coupled to 3D microelectrode array as a simplified testbed, showing that it can achieve seizure reduction 97% while primarily using instantaneous stimulation frequencies within 20 Hz, well below what typically used in clinical settings. Our work demonstrates the potential of neuromorphic systems as a next-generation neuromodulation strategy for personalized DRE treatment.
zh

[AI-82] A Synergistic Framework of Nonlinear Acoustic Computing and Reinforcement Learning for Real-World Human-Robot Interaction

【速读】:该论文旨在解决复杂噪声和混响环境下先进人机交互中的信号处理难题,特别是在远场定位、弱信号检测和多语言语音识别任务中。其解决方案的关键在于将非线性声学计算与强化学习相结合,通过嵌入物理信息的波动方程(如Westervelt方程和KZK方程)捕捉高阶声学现象,并在强化学习驱动的控制回路中自适应优化关键参数(如吸收系数和波束成形),从而有效抑制多路径干扰和非平稳噪声。

链接: https://arxiv.org/abs/2505.01998
作者: Xiaoliang Chen(1),Xin Yu(1),Le Chang(1),Yunhe Huang(1),Jiashuai He(1),Shibo Zhang(1),Jin Li(1),Likai Lin(1),Ziyu Zeng(1),Xianling Tu(1),Shuyu Zhang(1) ((1) SoundAI Technology Co., Ltd)
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
备注: 34 pages, 11 figures, 10 tables

点击查看摘要

Abstract:This paper introduces a novel framework integrating nonlinear acoustic computing and reinforcement learning to enhance advanced human-robot interaction under complex noise and reverberation. Leveraging physically informed wave equations (e.g., Westervelt, KZK), the approach captures higher-order phenomena such as harmonic generation and shock formation. By embedding these models in a reinforcement learning-driven control loop, the system adaptively optimizes key parameters (e.g., absorption, beamforming) to mitigate multipath interference and non-stationary noise. Experimental evaluations-covering far-field localization, weak signal detection, and multilingual speech recognition-demonstrate that this hybrid strategy surpasses traditional linear methods and purely data-driven baselines, achieving superior noise suppression, minimal latency, and robust accuracy in demanding real-world scenarios. The proposed system demonstrates broad application prospects in AI hardware, robot, machine audition, artificial audition, and brain-machine interfaces.
zh

[AI-83] Restoring Calibration for Aligned Large Language Models : A Calibration-Aware Fine-Tuning Approach

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在与人类偏好对齐后出现的校准性下降问题。研究发现,偏好对齐过程中出现的偏好坍缩现象会泛化到校准场景,导致模型表现出过度自信和较差的校准性。解决方案的关键在于通过引入领域特定知识的微调来缓解过度自信问题,并根据模型是否处于可校准或不可校准状态,分别采用校准感知的微调方法和基于EM算法的ECE正则化方法,以维持低校准误差。

链接: https://arxiv.org/abs/2505.01997
作者: Jiancong Xiao,Bojian Hou,Zhanliang Wang,Ruochen Jin,Qi Long,Weijie J. Su,Li Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:One of the key technologies for the success of Large Language Models (LLMs) is preference alignment. However, a notable side effect of preference alignment is poor calibration: while the pre-trained models are typically well-calibrated, LLMs tend to become poorly calibrated after alignment with human preferences. In this paper, we investigate why preference alignment affects calibration and how to address this issue. For the first question, we observe that the preference collapse issue in alignment undesirably generalizes to the calibration scenario, causing LLMs to exhibit overconfidence and poor calibration. To address this, we demonstrate the importance of fine-tuning with domain-specific knowledge to alleviate the overconfidence issue. To further analyze whether this affects the model’s performance, we categorize models into two regimes: calibratable and non-calibratable, defined by bounds of Expected Calibration Error (ECE). In the calibratable regime, we propose a calibration-aware fine-tuning approach to achieve proper calibration without compromising LLMs’ performance. However, as models are further fine-tuned for better performance, they enter the non-calibratable regime. For this case, we develop an EM-algorithm-based ECE regularization for the fine-tuning loss to maintain low calibration error. Extensive experiments validate the effectiveness of the proposed methods.
zh

[AI-84] A Goal-Oriented Reinforcement Learning-Based Path Planning Algorithm for Modular Self-Reconfigurable Satellites

【速读】:该论文旨在解决模块化可重构卫星在配置变化过程中路径规划算法存在的高计算复杂度、泛化能力差以及对多种目标配置支持有限的问题。其解决方案的关键在于提出一种基于目标导向的强化学习路径规划算法,该算法首次克服了以往强化学习方法在处理多目标配置时的局限性,并通过引入Hindsight Experience Replay(事后经验回放)和Invalid Action Masking(无效动作掩码)技术,有效应对稀疏奖励和无效动作带来的挑战。

链接: https://arxiv.org/abs/2505.01966
作者: Bofei Liu,Dong Ye,Zunhao Yao,Zhaowei Sun
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 7 figures

点击查看摘要

Abstract:Modular self-reconfigurable satellites refer to satellite clusters composed of individual modular units capable of altering their configurations. The configuration changes enable the execution of diverse tasks and mission objectives. Existing path planning algorithms for reconfiguration often suffer from high computational complexity, poor generalization capability, and limited support for diverse target configurations. To address these challenges, this paper proposes a goal-oriented reinforcement learning-based path planning algorithm. This algorithm is the first to address the challenge that previous reinforcement learning methods failed to overcome, namely handling multiple target configurations. Moreover, techniques such as Hindsight Experience Replay and Invalid Action Masking are incorporated to overcome the significant obstacles posed by sparse rewards and invalid actions. Based on these designs, our model achieves a 95% and 73% success rate in reaching arbitrary target configurations in a modular satellite cluster composed of four and six units, respectively.
zh

[AI-85] SafeNav: Safe Path Navigation using Landmark Based Localization in a GPS-denied Environment

【速读】:该论文旨在解决战场环境中GPS信号易被干扰导致的定位与导航问题,传统视觉定位方法如SLAM和VO存在传感器融合复杂、计算需求高的缺点,而无距离测量方法如DV-HOP在稀疏动态网络中则面临精度和稳定性不足的问题。论文提出的解决方案是LanBLoc-BMM,其核心在于结合地标定位(LanBLoc)与战场特定运动模型(BMM)及扩展卡尔曼滤波器(EKF),以提升定位精度与鲁棒性。

链接: https://arxiv.org/abs/2505.01956
作者: Ganesh Sapkota,Sanjay Madria
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In battlefield environments, adversaries frequently disrupt GPS signals, requiring alternative localization and navigation methods. Traditional vision-based approaches like Simultaneous Localization and Mapping (SLAM) and Visual Odometry (VO) involve complex sensor fusion and high computational demand, whereas range-free methods like DV-HOP face accuracy and stability challenges in sparse, dynamic networks. This paper proposes LanBLoc-BMM, a navigation approach using landmark-based localization (LanBLoc) combined with a battlefield-specific motion model (BMM) and Extended Kalman Filter (EKF). Its performance is benchmarked against three state-of-the-art visual localization algorithms integrated with BMM and Bayesian filters, evaluated on synthetic and real-imitated trajectory datasets using metrics including Average Displacement Error (ADE), Final Displacement Error (FDE), and a newly introduced Average Weighted Risk Score (AWRS). LanBLoc-BMM (with EKF) demonstrates superior performance in ADE, FDE, and AWRS on real-imitated datasets. Additionally, two safe navigation methods, SafeNav-CHull and SafeNav-Centroid, are introduced by integrating LanBLoc-BMM(EKF) with a novel Risk-Aware RRT* (RAw-RRT*) algorithm for obstacle avoidance and risk exposure minimization. Simulation results in battlefield scenarios indicate SafeNav-Centroid excels in accuracy, risk exposure, and trajectory efficiency, while SafeNav-CHull provides superior computational speed.
zh

[AI-86] Generative AI in clinical practice: novel qualitative evidence of risk and responsible use of Googles NotebookLM

【速读】:该论文试图解决生成式人工智能(Generative AI)在临床实践中潜在的临床和技术风险问题,特别是针对大型语言模型(Large Language Models, LLMs)如NotebookLM的应用。论文认为,在将其应用于临床实践之前,需要对这些工具进行测试和评估。解决方案的关键在于识别并评估NotebookLM等工具可能带来的风险,以确保其安全性和有效性。

链接: https://arxiv.org/abs/2505.01955
作者: Max Reuter,Maura Philippone,Bond Benton,Laura Dilley
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Eye (2025)

点击查看摘要

Abstract:The advent of generative artificial intelligence, especially large language models (LLMs), presents opportunities for innovation in research, clinical practice, and education. Recently, Dihan et al. lauded LLM tool NotebookLM’s potential, including for generating AI-voiced podcasts to educate patients about treatment and rehabilitation, and for quickly synthesizing medical literature for professionals. We argue that NotebookLM presently poses clinical and technological risks that should be tested and considered prior to its implementation in clinical practice.
zh

[AI-87] raining Environment for High Performance Reinforcement Learning

【速读】:该论文试图解决高性能飞机在自主空战中面临的训练与适应性问题,特别是如何快速响应不断变化的环境、传感器能力和敌方威胁。解决方案的关键在于开发了一个名为Tunnel的简单、开源的强化学习训练环境,该环境集成了F16三维非线性飞行力学模型,并基于OpenAI Gymnasium Python包,提供了可定制的边界、目标、对手和感知能力模板,从而提升了研究者和任务规划者之间的协作效率,缩短了训练方法和任务定义的开发周期。

链接: https://arxiv.org/abs/2505.01953
作者: Greg Search
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents Tunnel, a simple, open source, reinforcement learning training environment for high performance aircraft. It integrates the F16 3D nonlinear flight dynamics into OpenAI Gymnasium python package. The template includes primitives for boundaries, targets, adversaries and sensing capabilities that may vary depending on operational need. This offers mission planners a means to rapidly respond to evolving environments, sensor capabilities and adversaries for autonomous air combat aircraft. It offers researchers access to operationally relevant aircraft physics. Tunnel code base is accessible to anyone familiar with Gymnasium and/or those with basic python skills. This paper includes a demonstration of a week long trade study that investigated a variety of training methods, observation spaces, and threat presentations. This enables increased collaboration between researchers and mission planners which can translate to a national military advantage. As warfare becomes increasingly reliant upon automation, software agility will correlate with decision advantages. Airmen must have tools to adapt to adversaries in this context. It may take months for researchers to develop skills to customize observation, actions, tasks and training methodologies in air combat simulators. In Tunnel, this can be done in a matter of days.
zh

[AI-88] Multi-Scale Graph Learning for Anti-Sparse Downscaling AAAI-25

【速读】:该论文试图解决在细空间尺度(≤1 km)下水流温度预测因数据不足而面临的问题,这对维持水质和保护水生栖息地至关重要。其解决方案的关键在于提出一种多尺度图学习(Multi-Scale Graph Learning, MSGL)方法,该方法通过多任务学习框架,利用粗尺度图学习(基于更大数据集)来增强细尺度图学习,并引入跨尺度插值学习任务,以利用不同尺度图结构间的水文连通性建立跨尺度联系,从而提升模型性能。此外,还提出了异步多尺度图学习(ASYNC-MSGL)方法,突破了传统多尺度学习仅限于同步训练的思维定式。

链接: https://arxiv.org/abs/2505.01948
作者: Yingda Fan,Runlong Yu,Janet R. Barclay,Alison P. Appling,Yiming Sun,Yiqun Xie,Xiaowei Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AAAI-25, Multi-scale deep learning approach for spatial downscaling of geospatial data with sparse observations

点击查看摘要

Abstract:Water temperature can vary substantially even across short distances within the same sub-watershed. Accurate prediction of stream water temperature at fine spatial resolutions (i.e., fine scales, \leq 1 km) enables precise interventions to maintain water quality and protect aquatic habitats. Although spatiotemporal models have made substantial progress in spatially coarse time series modeling, challenges persist in predicting at fine spatial scales due to the lack of data at that this http URL address the problem of insufficient fine-scale data, we propose a Multi-Scale Graph Learning (MSGL) method. This method employs a multi-task learning framework where coarse-scale graph learning, bolstered by larger datasets, simultaneously enhances fine-scale graph learning. Although existing multi-scale or multi-resolution methods integrate data from different spatial scales, they often overlook the spatial correspondences across graph structures at various scales. To address this, our MSGL introduces an additional learning task, cross-scale interpolation learning, which leverages the hydrological connectedness of stream locations across coarse- and fine-scale graphs to establish cross-scale connections, thereby enhancing overall model performance. Furthermore, we have broken free from the mindset that multi-scale learning is limited to synchronous training by proposing an Asynchronous Multi-Scale Graph Learning method (ASYNC-MSGL). Extensive experiments demonstrate the state-of-the-art performance of our method for anti-sparse downscaling of daily stream temperatures in the Delaware River Basin, USA, highlighting its potential utility for water resources monitoring and management.
zh

[AI-89] Semantic Intelligence: Integrating GPT -4 with A Planning in Low-Cost Robotics

【速读】:该论文旨在解决传统机器人导航中依赖硬编码状态机和纯几何路径规划器所导致的机器人无法有效理解高层语义指令的问题。其解决方案的关键在于提出一种混合规划框架,将GPT-4的语义推理能力与A算法相结合,通过提示工程实现任务逻辑处理,同时保持A算法在路径计算上的准确性。该方法利用GPT-4对指令和环境线索的语义理解能力,动态调整机器人的占用网格以满足语义约束,从而在低成本机器人平台上实现上下文感知的智能行为。

链接: https://arxiv.org/abs/2505.01931
作者: Jesse Barkley,Abraham George,Amir Barati Farimani
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Classical robot navigation often relies on hardcoded state machines and purely geometric path planners, limiting a robot’s ability to interpret high-level semantic instructions. In this paper, we first assess GPT-4’s ability to act as a path planner compared to the A* algorithm, then present a hybrid planning framework that integrates GPT-4’s semantic reasoning with A* on a low-cost robot platform operating on ROS2 Humble. Our approach eliminates explicit finite state machine (FSM) coding by using prompt-based GPT-4 reasoning to handle task logic while maintaining the accurate paths computed by A*. The GPT-4 module provides semantic understanding of instructions and environmental cues (e.g., recognizing toxic obstacles or crowded areas to avoid, or understanding low-battery situations requiring alternate route selection), and dynamically adjusts the robot’s occupancy grid via obstacle buffering to enforce semantic constraints. We demonstrate multi-step reasoning for sequential tasks, such as first navigating to a resource goal and then reaching a final destination safely. Experiments on a Petoi Bittle robot with an overhead camera and Raspberry Pi Zero 2W compare classical A* against GPT-4-assisted planning. Results show that while A* is faster and more accurate for basic route generation and obstacle avoidance, the GPT-4-integrated system achieves high success rates (96-100%) on semantic tasks that are infeasible for pure geometric planners. This work highlights how affordable robots can exhibit intelligent, context-aware behaviors by leveraging large language model reasoning with minimal hardware and no fine-tuning.
zh

[AI-90] BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models

【速读】:该论文试图解决分子性质预测中机器学习模型在分布外(out-of-distribution, OOD)数据上的泛化能力不足的问题,以及当前缺乏系统性基准来评估分子OOD预测任务的现状。解决方案的关键在于提出BOOM(Benchmark for Out-of-Distribution Molecular property predictions),通过评估超过140种模型与性质预测任务的组合,系统地衡量深度学习模型在OOD场景下的性能,并揭示影响OOD性能的关键因素,如数据生成、预训练、超参数优化、模型架构和分子表示等。研究指出,尽管具有高归纳偏置的深度学习模型在特定简单性质的OOD任务中表现良好,但现有模型在跨任务的强OOD泛化能力上仍存在显著不足。

链接: https://arxiv.org/abs/2505.01912
作者: Evan R. Antoniuk,Shehtab Zaman,Tal Ben-Nun,Peggy Li,James Diffenderfer,Busra Demirci,Obadiah Smolenski,Tim Hsu,Anna M. Hiszpanski,Kenneth Chiu,Bhavya Kailkhura,Brian Van Essen
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Advances in deep learning and generative modeling have driven interest in data-driven molecule discovery pipelines, whereby machine learning (ML) models are used to filter and design novel molecules without requiring prohibitively expensive first-principles simulations. Although the discovery of novel molecules that extend the boundaries of known chemistry requires accurate out-of-distribution (OOD) predictions, ML models often struggle to generalize OOD. Furthermore, there are currently no systematic benchmarks for molecular OOD prediction tasks. We present BOOM, \boldsymbolb enchmarks for \boldsymbolo ut- \boldsymbolo f-distribution \boldsymbolm olecular property predictions – a benchmark study of property-based out-of-distribution models for common molecular property prediction models. We evaluate more than 140 combinations of models and property prediction tasks to benchmark deep learning models on their OOD performance. Overall, we do not find any existing models that achieve strong OOD generalization across all tasks: even the top performing model exhibited an average OOD error 3x larger than in-distribution. We find that deep learning models with high inductive bias can perform well on OOD tasks with simple, specific properties. Although chemical foundation models with transfer and in-context learning offer a promising solution for limited training data scenarios, we find that current foundation models do not show strong OOD extrapolation capabilities. We perform extensive ablation experiments to highlight how OOD performance is impacted by data generation, pre-training, hyperparameter optimization, model architecture, and molecular representation. We propose that developing ML models with strong OOD generalization is a new frontier challenge in chemical ML model development. This open-source benchmark will be made available on Github.
zh

[AI-91] LookAlike: Consistent Distractor Generation in Math MCQs

【速读】:该论文旨在解决生成式AI(Generative AI)在生成数学多选题(MCQs)干扰项时,难以保证干扰项与学生常见错误一致的问题。其解决方案的关键在于通过偏好优化(preference optimization)提升错误干扰项的一致性,具体包括两个主要创新:一是从模型不一致性中挖掘合成偏好对,二是通过交替监督微调(SFT)与直接偏好优化(DPO)来稳定训练过程。该方法无需依赖启发式规则或人工标注的偏好数据,而是利用自身生成的不一致性作为低偏好样本,从而实现可扩展且稳定的训练。

链接: https://arxiv.org/abs/2505.01903
作者: Nisarg Parikh,Nigel Fernandez,Alexander Scarlatos,Simon Woodhead,Andrew Lan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to generate distractors for multiple-choice questions (MCQs), especially in domains like math education. However, existing approaches are limited in ensuring that the generated distractors are consistent with common student errors. We propose LookAlike, a method that improves error-distractor consistency via preference optimization. Our two main innovations are: (a) mining synthetic preference pairs from model inconsistencies, and (b) alternating supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) to stabilize training. Unlike prior work that relies on heuristics or manually annotated preference data, LookAlike uses its own generation inconsistencies as dispreferred samples, thus enabling scalable and stable training. Evaluated on a real-world dataset of 1,400+ math MCQs, LookAlike achieves 51.6% accuracy in distractor generation and 57.2% in error generation under LLM-as-a-judge evaluation, outperforming an existing state-of-the-art method (45.6% / 47.7%). These improvements highlight the effectiveness of preference-based regularization and inconsistency mining for generating consistent math MCQ distractors at scale.
zh

[AI-92] OODTE: A Differential Testing Engine for the ONNX Optimizer

【速读】:该论文试图解决ONNX Optimizer在应用图级优化时对模型精度保持能力缺乏系统性验证的问题,其解决方案的关键在于提出OODTE工具,该工具通过差分测试与评估方法自动且全面地评估ONNX Optimizer的正确性,能够检测优化过程中的问题并定位导致精度偏差的优化步骤。

链接: https://arxiv.org/abs/2505.01892
作者: Nikolaos Louloudakis,Ajitha Rajan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE); Systems and Control (eess.SY)
备注: 12 pages, 3 figures, 3 tables

点击查看摘要

Abstract:With 700 stars on GitHub and part of the official ONNX repository, the ONNX Optimizer consists of the standard method to apply graph-based optimizations on ONNX models. However, its ability to preserve model accuracy across optimizations, has not been rigorously explored. We propose OODTE, a utility to automatically and thoroughly assess the correctness of the ONNX Optimizer. OODTE follows a simple, yet effective differential testing and evaluation approach that can be easily adopted to other compiler optimizers. In particular, OODTE utilizes a number of ONNX models, then optimizes them and executes both the original and the optimized variants across a user-defined set of inputs, while automatically logging any issues with the optimization process. Finally, for successfully optimized models, OODTE compares the results, and, if any accuracy deviations are observed, it iteratively repeats the process for each pass of the ONNX Optimizer, to localize the root cause of the differences observed. Using OODTE, we sourced well-known 130 models from the official ONNX Model Hub, used for a wide variety of tasks (classification, object detection, semantic segmentation, text summarization, question and answering, sentiment analysis) from the official ONNX model hub. We detected 15 issues, 14 of which were previously unknown, associated with optimizer crashes and accuracy deviations. We also observed 9.2 % of all model instances presenting issues leading into the crash of the optimizer, or the generation of an invalid model while using the primary optimizer strategies. In addition, 30 % of the classification models presented accuracy differences across the original and the optimized model variants, while 16.6 % of semantic segmentation and object detection models are also affected, at least to a limited extent.
zh

[AI-93] Analytic Energy-Guided Policy Optimization for Offline Reinforcement Learning

【速读】:该论文试图解决在扩散模型中进行条件决策生成时,由于生成过程中对中间能量的估计困难(即对数期望形式导致的不可解性)所面临的挑战。解决方案的关键在于提出一种分析性的能量引导策略优化方法(Analytic Energy-guided Policy Optimization, AEPO),通过理论分析获得在满足条件高斯变换假设下的中间引导的闭合解,并在温和假设下分析对数期望形式中的后验高斯分布,最终训练一个中间能量神经网络来逼近目标估计值。

链接: https://arxiv.org/abs/2505.01822
作者: Jifeng Hu,Sili Huang,Zhejian Yang,Shengchao Hu,Li Shen,Hechang Chen,Lichao Sun,Yi Chang,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conditional decision generation with diffusion models has shown powerful competitiveness in reinforcement learning (RL). Recent studies reveal the relation between energy-function-guidance diffusion models and constrained RL problems. The main challenge lies in estimating the intermediate energy, which is intractable due to the log-expectation formulation during the generation process. To address this issue, we propose the Analytic Energy-guided Policy Optimization (AEPO). Specifically, we first provide a theoretical analysis and the closed-form solution of the intermediate guidance when the diffusion model obeys the conditional Gaussian transformation. Then, we analyze the posterior Gaussian distribution in the log-expectation formulation and obtain the target estimation of the log-expectation under mild assumptions. Finally, we train an intermediate energy neural network to approach the target estimation of log-expectation formulation. We apply our method in 30+ offline RL tasks to demonstrate the effectiveness of our method. Extensive experiments illustrate that our method surpasses numerous representative baselines in D4RL offline reinforcement learning benchmarks.
zh

[AI-94] Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey

【速读】:该论文旨在解决边缘-云协同计算(Edge-cloud collaborative computing, ECCC)环境中分布式智能系统在模型部署、资源管理及性能优化方面的挑战。其关键解决方案在于系统性地分析模型优化方法,包括模型压缩、适应性调整和神经网络架构搜索,以及基于人工智能的资源管理策略,以实现性能、能耗与延迟之间的平衡。同时,论文还探讨了隐私保护与安全增强技术,并通过多样化应用场景验证了方案的有效性,为未来研究方向如大语言模型部署、6G融合、类脑计算和量子计算提供了理论支持与实践指导。

链接: https://arxiv.org/abs/2505.01821
作者: Jing Liu,Yao Du,Kun Yang,Yan Wang,Xiping Hu,Zehua Wang,Yang Liu,Peng Sun,Azzedine Boukerche,Victor C.M. Leung
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 30 pages, 10figures, 6 tables

点击查看摘要

Abstract:Edge-cloud collaborative computing (ECCC) has emerged as a pivotal paradigm for addressing the computational demands of modern intelligent applications, integrating cloud resources with edge devices to enable efficient, low-latency processing. Recent advancements in AI, particularly deep learning and large language models (LLMs), have dramatically enhanced the capabilities of these distributed systems, yet introduce significant challenges in model deployment and resource management. In this survey, we comprehensive examine the intersection of distributed intelligence and model optimization within edge-cloud environments, providing a structured tutorial on fundamental architectures, enabling technologies, and emerging applications. Additionally, we systematically analyze model optimization approaches, including compression, adaptation, and neural architecture search, alongside AI-driven resource management strategies that balance performance, energy efficiency, and latency requirements. We further explore critical aspects of privacy protection and security enhancement within ECCC systems and examines practical deployments through diverse applications, spanning autonomous driving, healthcare, and industrial automation. Performance analysis and benchmarking techniques are also thoroughly explored to establish evaluation standards for these complex systems. Furthermore, the review identifies critical research directions including LLMs deployment, 6G integration, neuromorphic computing, and quantum computing, offering a roadmap for addressing persistent challenges in heterogeneity management, real-time processing, and scalability. By bridging theoretical advancements and practical deployments, this survey offers researchers and practitioners a holistic perspective on leveraging AI to optimize distributed computing environments, fostering innovation in next-generation intelligent systems.
zh

[AI-95] Enhancing Black-Litterman Portfolio via Hybrid Forecasting Model Combining Multivariate Decomposition and Noise Reduction

【速读】:该论文试图解决传统均值-方差模型对输入参数敏感且灵活性不足的问题,以及如何提升Black-Litterman模型生成主观观点的能力。解决方案的关键在于提出一种结合奇异谱分析(Singular Spectrum Analysis, SSA)、多变量对齐经验模态分解(Multivariate Aligned Empirical Mode Decomposition, MA-EMD)和时序卷积网络(Temporal Convolutional Networks, TCNs)的混合深度学习模型,以提高资产价格预测的准确性。

链接: https://arxiv.org/abs/2505.01781
作者: Ziye Yang,Ke Lu
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The sensitivity to input parameters and lack of flexibility limits the traditional Mean-Variance model. In contrast, the Black-Litterman model has attracted widespread attention by integrating market equilibrium returns with investors’ subjective views. This paper proposes a novel hybrid deep learning model combining Singular Spectrum analysis (SSA), Multivariate Aligned Empirical Mode Decomposition (MA-EMD), and Temporal Convolutional Networks (TCNs), aiming to improve the prediction accuracy of asset prices and thus enhance the ability of the Black-Litterman model to generate subjective views. Experimental results show that noise reduction pre-processing can improve the model’s accuracy, and the prediction performance of the proposed model is significantly better than that of three multivariate decomposition benchmark models. We construct an investment portfolio by using 20 representative stocks from the NASDAQ 100 index. By combining the hybrid forecasting model with the Black-Litterman model, the generated investment portfolio exhibits better returns and risk control capabilities than the Mean-Variance, Equal-Weighted, and Market-Weighted models in the short holding period.
zh

[AI-96] PeSANet: Physics-encoded Spectral Attention Network for Simulating PDE-Governed Complex Systems

【速读】:该论文试图解决在部分微分方程(Partial Differential Equations, PDEs)主导的复杂系统建模与预测中,传统数值方法因物理定律不完整或未知而失效,以及机器学习方法在观测数据稀缺且难以捕捉局部与全局特征时泛化能力不足的问题。其解决方案的关键在于提出一种物理编码的频谱注意力网络(Physics-encoded Spectral Attention Network, PeSANet),该网络通过两个核心组件实现:一是利用硬约束从有限数据中近似局部微分算子的物理编码块,二是通过频谱增强块在频域中捕捉长程全局依赖关系,并引入一种新颖的频谱注意力机制来建模跨频谱关系并学习长程空间特征。

链接: https://arxiv.org/abs/2505.01736
作者: Han Wan,Rui Zhang,Qi Wang,Yang Liu,Hao Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurately modeling and forecasting complex systems governed by partial differential equations (PDEs) is crucial in various scientific and engineering domains. However, traditional numerical methods struggle in real-world scenarios due to incomplete or unknown physical laws. Meanwhile, machine learning approaches often fail to generalize effectively when faced with scarce observational data and the challenge of capturing local and global features. To this end, we propose the Physics-encoded Spectral Attention Network (PeSANet), which integrates local and global information to forecast complex systems with limited data and incomplete physical priors. The model consists of two key components: a physics-encoded block that uses hard constraints to approximate local differential operators from limited data, and a spectral-enhanced block that captures long-range global dependencies in the frequency domain. Specifically, we introduce a novel spectral attention mechanism to model inter-spectrum relationships and learn long-range spatial features. Experimental results demonstrate that PeSANet outperforms existing methods across all metrics, particularly in long-term forecasting accuracy, providing a promising solution for simulating complex systems with limited data and incomplete physics.
zh

[AI-97] PASCAL: Precise and Efficient ANN- SNN Conversion using Spike Accumulation and Adaptive Layerwise Activation

【速读】:该论文旨在解决SNN在实际数据集上达到与源ANN相当的精度时需要大量时间步的问题,以及在ANN-SNN转换过程中如何最小化精度损失的问题。其解决方案的关键在于提出PASCAL方法,该方法使转换后的SNN在数学上等价于具有Quantization-Clip-Floor-Shift (QCFS)激活函数的ANN,从而在保持高精度的同时显著减少推理时间步数。此外,还提出了一种分层配置QCFS量化步长的系统方法,以有效确定每个层的最佳时间步数。

链接: https://arxiv.org/abs/2505.01730
作者: Pranav Ramesh,Gopalakrishnan Srinivasan
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) have been put forward as an energy-efficient alternative to Artificial Neural Networks (ANNs) since they perform sparse Accumulate operations instead of the power-hungry Multiply-and-Accumulate operations. ANN-SNN conversion is a widely used method to realize deep SNNs with accuracy comparable to that of ANNs.~\citeauthorbu2023optimal recently proposed the Quantization-Clip-Floor-Shift (QCFS) activation as an alternative to ReLU to minimize the accuracy loss during ANN-SNN conversion. Nevertheless, SNN inferencing requires a large number of timesteps to match the accuracy of the source ANN for real-world datasets. In this work, we propose PASCAL, which performs ANN-SNN conversion in such a way that the resulting SNN is mathematically equivalent to an ANN with QCFS-activation, thereby yielding similar accuracy as the source ANN with minimal inference timesteps. In addition, we propose a systematic method to configure the quantization step of QCFS activation in a layerwise manner, which effectively determines the optimal number of timesteps per layer for the converted SNN. Our results show that the ResNet-34 SNN obtained using PASCAL achieves an accuracy of \approx 74% on ImageNet with a 64 \times reduction in the number of inference timesteps compared to existing approaches.
zh

[AI-98] World Model-Based Learning for Long-Term Age of Information Minimization in Vehicular Networks

【速读】:该论文试图解决传统基于强化学习(Reinforcement Learning, RL)的无线网络学习方法在数据效率低和策略短视方面的局限性,特别是在高不确定性、需要长期规划的复杂动态网络中。解决方案的关键在于提出一种基于世界模型(World Model)的学习框架,该框架通过联合学习毫米波车联网(mmWave V2X)环境的动态模型,并利用该模型生成想象轨迹来进行链路调度学习,从而在不依赖实际环境交互的情况下学习长期策略。此外,世界模型具备预测时变无线数据和优化链路调度的能力,使其能够在无实际观测的间隔内仍能做出高效决策。

链接: https://arxiv.org/abs/2505.01712
作者: Lingyi Wang,Rashed Shelim,Walid Saad,Naren Ramakrishnan
机构: 未知
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Traditional reinforcement learning (RL)-based learning approaches for wireless networks rely on expensive trial-and-error mechanisms and real-time feedback based on extensive environment interactions, which leads to low data efficiency and short-sighted policies. These limitations become particularly problematic in complex, dynamic networks with high uncertainty and long-term planning requirements. To address these limitations, in this paper, a novel world model-based learning framework is proposed to minimize packet-completeness-aware age of information (CAoI) in a vehicular network. Particularly, a challenging representative scenario is considered pertaining to a millimeter-wave (mmWave) vehicle-to-everything (V2X) communication network, which is characterized by high mobility, frequent signal blockages, and extremely short coherence time. Then, a world model framework is proposed to jointly learn a dynamic model of the mmWave V2X environment and use it to imagine trajectories for learning how to perform link scheduling. In particular, the long-term policy is learned in differentiable imagined trajectories instead of environment interactions. Moreover, owing to its imagination abilities, the world model can jointly predict time-varying wireless data and optimize link scheduling in real-world wireless and V2X networks. Thus, during intervals without actual observations, the world model remains capable of making efficient decisions. Extensive experiments are performed on a realistic simulator based on Sionna that integrates physics-based end-to-end channel modeling, ray-tracing, and scene geometries with material properties. Simulation results show that the proposed world model achieves a significant improvement in data efficiency, and achieves 26% improvement and 16% improvement in CAoI, respectively, compared to the model-based RL (MBRL) method and the model-free RL (MFRL) method.
zh

[AI-99] Causally Fair Node Classification on Non-IID Graph Data

【速读】:该论文试图解决在非独立同分布(non-IID)图数据中实现公平性的问题,传统公平感知机器学习算法通常假设数据是独立同分布(IID)的,而忽略了数据实例之间的因果关系。论文的关键解决方案是基于网络结构因果模型(Network Structural Causal Model, NSCM)框架,提出可分解性和图独立性两个核心假设,从而利用do-计算在非IID设置下计算干预分布,并通过消息传递变分自编码器(Message Passing Variational Autoencoder, MPVA)进行因果推断,实现因果公平的节点分类。

链接: https://arxiv.org/abs/2505.01652
作者: Yucong Dai,Lu Zhang,Yaowei Hu,Susan Gauch,Yongkai Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fair machine learning seeks to identify and mitigate biases in predictions against unfavorable populations characterized by demographic attributes, such as race and gender. Recently, a few works have extended fairness to graph data, such as social networks, but most of them neglect the causal relationships among data instances. This paper addresses the prevalent challenge in fairness-aware ML algorithms, which typically assume Independent and Identically Distributed (IID) data. We tackle the overlooked domain of non-IID, graph-based settings where data instances are interconnected, influencing the outcomes of fairness interventions. We base our research on the Network Structural Causal Model (NSCM) framework and posit two main assumptions: Decomposability and Graph Independence, which enable the computation of interventional distributions in non-IID settings using the do -calculus. Based on that, we develop the Message Passing Variational Autoencoder for Causal Inference (MPVA) to compute interventional distributions and facilitate causally fair node classification through estimated interventional distributions. Empirical evaluations on semi-synthetic and real-world datasets demonstrate that MPVA outperforms conventional methods by effectively approximating interventional distributions and mitigating bias. The implications of our findings underscore the potential of causality-based fairness in complex ML applications, setting the stage for further research into relaxing the initial assumptions to enhance model fairness.
zh

[AI-100] Human-AI Governance (HAIG): A Trust-Utility Approach

【速读】:该论文试图解决现有分类框架在描述人机关系演化过程中存在的不足,特别是无法准确捕捉AI系统从工具演变为合作伙伴的动态过程,尤其是在基础模型展现出涌现能力以及多智能体系统表现出自主目标设定行为的背景下。解决方案的关键在于提出HAIG框架,该框架通过三个层次进行分析:维度(决策权分配、流程自主性、责任配置)、连续体(每个维度上的渐进变化)以及阈值(需要适应治理的关键点),采用信任-效用导向的方法,旨在维持能够最大化效用的同时确保足够保障的信任关系。

链接: https://arxiv.org/abs/2505.01651
作者: Zeynep Engin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注: 32 pages including references and appendix, 25 pages core text, 3 figures, 3 tables

点击查看摘要

Abstract:This paper introduces the HAIG framework for analysing trust dynamics across evolving human-AI relationships. Current categorical frameworks (e.g., “human-in-the-loop” models) inadequately capture how AI systems evolve from tools to partners, particularly as foundation models demonstrate emergent capabilities and multi-agent systems exhibit autonomous goal-setting behaviours. As systems advance, agency redistributes in complex patterns that are better represented as positions along continua rather than discrete categories, though progression may include both gradual shifts and significant step changes. The HAIG framework operates across three levels: dimensions (Decision Authority Distribution, Process Autonomy, and Accountability Configuration), continua (gradual shifts along each dimension), and thresholds (critical points requiring governance adaptation). Unlike risk-based or principle-based approaches, HAIG adopts a trust-utility orientation, focusing on maintaining appropriate trust relationships that maximise utility while ensuring sufficient safeguards. Our analysis reveals how technical advances in self-supervision, reasoning authority, and distributed decision-making drive non-uniform trust evolution across both contextual variation and technological advancement. Case studies in healthcare and European regulation demonstrate how HAIG complements existing frameworks while offering a foundation for alternative approaches that anticipate governance challenges before they emerge.
zh

[AI-101] Scalable Speed-ups for the SMS-EMOA from a Simple Aging Strategy IJCAI2025

【速读】:该论文试图解决多目标进化算法中传统贪心选择机制导致的计算效率问题,特别是在处理多目标优化问题时,如何通过非精英选择机制提升求解速度。其解决方案的关键在于提出一种基于年龄的非精英选择机制,该机制允许一定年龄以下的个体免于被移除,从而克服了随机选择机制的两个缺陷,证明了在不依赖目标数量的情况下,可以实现更显著的加速效果,尤其在常数k的情况下可获得正向加速,且相较于随机选择机制更具优势。

链接: https://arxiv.org/abs/2505.01647
作者: Mingfeng Li,Weijie Zheng,Benjamin Doerr
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: Initial version of one paper accepted by IJCAI2025

点击查看摘要

Abstract:Different from single-objective evolutionary algorithms, where non-elitism is an established concept, multi-objective evolutionary algorithms almost always select the next population in a greedy fashion. In the only notable exception, Bian, Zhou, Li, and Qian (IJCAI 2023) proposed a stochastic selection mechanism for the SMS-EMOA and proved that it can speed up computing the Pareto front of the bi-objective jump benchmark with problem size n and gap parameter k by a factor of \max\1,2^k/4/n\ . While this constitutes the first proven speed-up from non-elitist selection, suggesting a very interesting research direction, it has to be noted that a true speed-up only occurs for k \ge 4\log_2(n) , where the runtime is super-polynomial, and that the advantage reduces for larger numbers of objectives as shown in a later work. In this work, we propose a different non-elitist selection mechanism based on aging, which exempts individuals younger than a certain age from a possible removal. This remedies the two shortcomings of stochastic selection: We prove a speed-up by a factor of \max\1,\Theta(k)^k-1\ , regardless of the number of objectives. In particular, a positive speed-up can already be observed for constant k , the only setting for which polynomial runtimes can be witnessed. Overall, this result supports the use of non-elitist selection schemes, but suggests that aging-based mechanisms can be considerably more powerful than stochastic selection mechanisms.
zh

[AI-102] Dendritic Computing with Multi-Gate Ferroelectric Field-Effect Transistors

【速读】:该论文试图解决传统人工神经网络中点神经元计算复杂度低、难以模拟生物神经元功能的问题,从而提升类脑计算系统的效率与学习能力。其解决方案的关键在于提出一种基于多门铁电场效应晶体管的新型神经元设计,该设计模仿树突结构,利用铁电非线性实现树突分支内的局部计算,并通过晶体管动作生成最终的神经元输出,从而在硬件集成中实现更高效的交叉阵列结构。

链接: https://arxiv.org/abs/2505.01635
作者: A N M Nafiul Islam,Xuezhong Niu,Jiahui Duan,Shubham Kumar,Kai Ni,Abhronil Sengupta
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although inspired by neuronal systems in the brain, artificial neural networks generally employ point-neurons, which offer far less computational complexity than their biological counterparts. Neurons have dendritic arbors that connect to different sets of synapses and offer local non-linear accumulation - playing a pivotal role in processing and learning. Inspired by this, we propose a novel neuron design based on a multi-gate ferroelectric field-effect transistor that mimics dendrites. It leverages ferroelectric nonlinearity for local computations within dendritic branches, while utilizing the transistor action to generate the final neuronal output. The branched architecture paves the way for utilizing smaller crossbar arrays in hardware integration, leading to greater efficiency. Using an experimentally calibrated device-circuit-algorithm co-simulation framework, we demonstrate that networks incorporating our dendritic neurons achieve superior performance in comparison to much larger networks without dendrites ( \sim 17 \times fewer trainable weight parameters). These findings suggest that dendritic hardware can significantly improve computational efficiency, and learning capacity of neuromorphic systems optimized for edge applications.
zh

[AI-103] Skill-based Safe Reinforcement Learning with Risk Planning

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在与真实世界环境交互时的安全性问题,即如何在学习过程中避免因不当动作导致高成本或严重后果。其解决方案的关键在于提出一种新颖的Safe Skill Planning (SSkP)方法,该方法通过利用辅助的离线演示数据来增强安全强化学习的效果。SSkP采用两阶段流程:首先使用PU学习从离线数据中学习技能风险预测器,随后基于该预测器开发一种新的风险规划过程,以在在线环境中高效地学习风险规避的安全策略,同时适应环境变化。

链接: https://arxiv.org/abs/2505.01619
作者: Hanping Zhang,Yuhong Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safe Reinforcement Learning (Safe RL) aims to ensure safety when an RL agent conducts learning by interacting with real-world environments where improper actions can induce high costs or lead to severe consequences. In this paper, we propose a novel Safe Skill Planning (SSkP) approach to enhance effective safe RL by exploiting auxiliary offline demonstration data. SSkP involves a two-stage process. First, we employ PU learning to learn a skill risk predictor from the offline demonstration data. Then, based on the learned skill risk predictor, we develop a novel risk planning process to enhance online safe RL and learn a risk-averse safe policy efficiently through interactions with the online RL environment, while simultaneously adapting the skill risk predictor to the environment. We conduct experiments in several benchmark robotic simulation environments. The experimental results demonstrate that the proposed approach consistently outperforms previous state-of-the-art safe RL methods.
zh

[AI-104] Dont be lazy: CompleteP enables compute-efficient deep transformers

【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)训练中的计算效率问题,特别是当模型规模变化时,如何有效调整模型和优化器超参数(Hyperparameters, HPs)以保持训练效率。其关键解决方案是提出了一种名为CompleteP的参数化方法,该方法实现了跨深度的超参数迁移,并确保所有层均处于非懒惰学习(non-lazy learning)状态,从而充分发挥模型深度与非线性的优势。相较于先前最先进的方法,CompleteP在计算效率上提升了12-34%。

链接: https://arxiv.org/abs/2505.01618
作者: Nolan Dey,Bin Claire Zhang,Lorenzo Noci,Mufan Li,Blake Bordelon,Shane Bergsma,Cengiz Pehlevan,Boris Hanin,Joel Hestness
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 main pages, 17 appendix pages, 13 figures

点击查看摘要

Abstract:We study compute efficiency of LLM training when using different parameterizations, i.e., rules for adjusting model and optimizer hyperparameters (HPs) as model size changes. Some parameterizations fail to transfer optimal base HPs (such as learning rate) across changes in model depth, requiring practitioners to either re-tune these HPs as they scale up (expensive), or accept sub-optimal training when re-tuning is prohibitive. Even when they achieve HP transfer, we develop theory to show parameterizations may still exist in the lazy learning regime where layers learn only features close to their linearization, preventing effective use of depth and nonlinearity. Finally, we identify and adopt the unique parameterization we call CompleteP that achieves both depth-wise HP transfer and non-lazy learning in all layers. CompleteP enables a wider range of model width/depth ratios to remain compute-efficient, unlocking shapes better suited for different hardware settings and operational contexts. Moreover, CompleteP enables 12-34% compute efficiency improvements over the prior state-of-the-art.
zh

[AI-105] Understanding and Exploiting Plasticity for Non-stationary Network Resource Adaptation

【速读】:该论文旨在解决非平稳网络条件下资源自适应的挑战,当前解决方案主要依赖于平稳性假设,而数据驱动的强化学习方法在处理网络动态性时存在神经网络可塑性丧失的问题。论文提出的关键解决方案是基于Silent Neuron理论的Reset Silent Neuron (ReSiN)机制,通过结合前向和反向传播状态对神经元进行有策略的重置,以保持神经网络的可塑性,从而提升系统在动态网络环境中的适应能力。

链接: https://arxiv.org/abs/2505.01584
作者: Zhiqiang He,Zhi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adapting to non-stationary network conditions presents significant challenges for resource adaptation. However, current solutions primarily rely on stationary assumptions. While data-driven reinforcement learning approaches offer promising solutions for handling network dynamics, our systematic investigation reveals a critical limitation: neural networks suffer from plasticity loss, significantly impeding their ability to adapt to evolving network conditions. Through theoretical analysis of neural propagation mechanisms, we demonstrate that existing dormant neuron metrics inadequately characterize neural plasticity loss. To address this limitation, we have developed the Silent Neuron theory, which provides a more comprehensive framework for understanding plasticity degradation. Based on these theoretical insights, we propose the Reset Silent Neuron (ReSiN), which preserves neural plasticity through strategic neuron resets guided by both forward and backward propagation states. In our implementation of an adaptive video streaming system, ReSiN has shown significant improvements over existing solutions, achieving up to 168% higher bitrate and 108% better quality of experience (QoE) while maintaining comparable smoothness. Furthermore, ReSiN consistently outperforms in stationary environments, demonstrating its robust adaptability across different network conditions.
zh

[AI-106] PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)推理过程中因序列阶段依赖而导致的硬件利用率不足问题。其解决方案的关键在于提出PipeSpec框架,该框架将多个较小的候选模型以分层流水线的形式组织,实现预测验证与回滚的异步执行,并通过轻量级协调机制提升整体效率。该方法通过分析模型表征了流水线各阶段的令牌生成速率,并证明了在非零接受率下相较于传统解码方法的吞吐量优势。

链接: https://arxiv.org/abs/2505.01572
作者: Bradley McDanel,Sai Qian Zhang,Yunhai Hu,Zining Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 10 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Speculative decoding accelerates large language model inference by using smaller draft models to generate candidate tokens for parallel verification. However, current approaches are limited by sequential stage dependencies that prevent full hardware utilization. We present PipeSpec, a framework that generalizes speculative decoding to k models arranged in a hierarchical pipeline, enabling asynchronous execution with lightweight coordination for prediction verification and rollback. Our analytical model characterizes token generation rates across pipeline stages and proves guaranteed throughput improvements over traditional decoding for any non-zero acceptance rate. We further derive closed-form expressions for steady-state verification probabilities that explain the empirical benefits of pipeline depth. Experimental results show that PipeSpec achieves up to 2.54 \times speedup while outperforming state-of-the-art methods. We validate PipeSpec across text summarization and code generation tasks using LLaMA 2 and 3 models, demonstrating that pipeline efficiency increases with model depth, providing a scalable approach to accelerating LLM inference on multi-device systems.
zh

[AI-107] utorGym: A Testbed for Evaluating AI Agents as Tutors and Students

【速读】:该论文试图解决如何更直接地评估大型语言模型(Large Language Model, LLM)在作为独立导师或模拟人类学习场景中的表现问题。传统评估方法仅关注最终解决方案的生成,而未能全面反映模型在交互式教学环境中的能力。解决方案的关键在于引入TutorGym,这是一个标准接口,用于在已通过课堂研究验证和优化的智能辅导系统(Intelligent Tutoring Systems, ITS)中测试人工智能代理。TutorGym不仅提供问题-解决方案的基准,还让AI代理在ITS的交互界面中进行角色扮演,分别作为导师提供适应性支持或作为学习者接受指导,从而实现对模型 tutoring 和学习能力的多维评估。

链接: https://arxiv.org/abs/2505.01563
作者: Daniel Weitekamp,Momin N. Siddiqui,Christopher J. MacLellan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent improvements in large language model (LLM) performance on academic benchmarks, such as MATH and GSM8K, have emboldened their use as standalone tutors and as simulations of human learning. However, these new applications require more than evaluations of final solution generation. We introduce TutorGym to evaluate these applications more directly. TutorGym is a standard interface for testing artificial intelligence (AI) agents within existing intelligent tutoring systems (ITS) that have been tested and refined in classroom studies, including Cognitive Tutors (CTAT), Apprentice Tutors, and OATutors. TutorGym is more than a simple problem-solution benchmark, it situates AI agents within the interactive interfaces of existing ITSs. At each step of problem-solving, AI agents are asked what they would do as a tutor or as a learner. As tutors, AI agents are prompted to provide tutoring support – such as generating examples, hints, and step-level correctness feedback – which can be evaluated directly against the adaptive step-by-step support provided by existing ITSs. As students, agents directly learn from ITS instruction, and their mistakes and learning trajectories can be compared to student data. TutorGym establishes a common framework for training and evaluating diverse AI agents, including LLMs, computational models of learning, and reinforcement learning agents, within a growing suite of learning environments. Currently, TutorGym includes 223 different tutor domains. In an initial evaluation, we find that current LLMs are poor at tutoring – none did better than chance at labeling incorrect actions, and next-step actions were correct only ~52-70% of the time – but they could produce remarkably human-like learning curves when trained as students with in-context learning.
zh

[AI-108] Contextures: Representations from Contexts ICML2025

【速读】:该论文试图解决基础模型在表征学习过程中所学表征的系统性表征问题,即缺乏对这些模型所学习到的表征的全面理解。其解决方案的关键在于提出“contexture理论”,该理论表明,大量表征学习方法可以被描述为从输入与上下文变量之间的关联中进行学习。具体而言,论文指出许多流行的方法旨在近似由上下文诱导的期望算子的前d个奇异函数,此时称表征学习了contexture。该理论展示了不同学习范式(监督学习、自监督学习和流形学习)中的表征学习均可从这一视角进行研究。

链接: https://arxiv.org/abs/2505.01557
作者: Runtian Zhai,Kai Yang,Che-Ping Tsai,Burak Varici,Zico Kolter,Pradeep Ravikumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: ICML 2025, longer version. arXiv admin note: substantial text overlap with arXiv:2504.19792

点击查看摘要

Abstract:Despite the empirical success of foundation models, we do not have a systematic characterization of the representations that these models learn. In this paper, we establish the contexture theory. It shows that a large class of representation learning methods can be characterized as learning from the association between the input and a context variable. Specifically, we show that many popular methods aim to approximate the top-d singular functions of the expectation operator induced by the context, in which case we say that the representation learns the contexture. We demonstrate the generality of the contexture theory by proving that representation learning within various learning paradigms – supervised, self-supervised, and manifold learning – can all be studied from such a perspective. We also prove that the representations that learn the contexture are optimal on those tasks that are compatible with the context. One important implication of the contexture theory is that once the model is large enough to approximate the top singular functions, further scaling up the model size yields diminishing returns. Therefore, scaling is not all we need, and further improvement requires better contexts. To this end, we study how to evaluate the usefulness of a context without knowing the downstream tasks. We propose a metric and show by experiments that it correlates well with the actual performance of the encoder on many real datasets.
zh

[AI-109] Emotions in the Loop: A Survey of Affective Computing for Emotional Support

【速读】:该论文试图解决如何通过情感计算(Affective Computing)提升人机交互的智能化水平,特别是在情绪识别、情感分析和人格赋值等方面的应用问题。其解决方案的关键在于利用大规模语言模型(LLMs)、多模态技术以及个性化AI系统,使机器能够感知并响应用户的情绪,从而实现更加人性化和智能化的交互体验。

链接: https://arxiv.org/abs/2505.01542
作者: Karishma Hegde,Hemadri Jayalath
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 20 pages, 7 tables, 96 references. Survey paper on affective computing applications using large language models, multimodal AI, and therapeutic chatbots

点击查看摘要

Abstract:In a world where technology is increasingly embedded in our everyday experiences, systems that sense and respond to human emotions are elevating digital interaction. At the intersection of artificial intelligence and human-computer interaction, affective computing is emerging with innovative solutions where machines are humanized by enabling them to process and respond to user emotions. This survey paper explores recent research contributions in affective computing applications in the area of emotion recognition, sentiment analysis and personality assignment developed using approaches like large language models (LLMs), multimodal techniques, and personalized AI systems. We analyze the key contributions and innovative methodologies applied by the selected research papers by categorizing them into four domains: AI chatbot applications, multimodal input systems, mental health and therapy applications, and affective computing for safety applications. We then highlight the technological strengths as well as the research gaps and challenges related to these studies. Furthermore, the paper examines the datasets used in each study, highlighting how modality, scale, and diversity impact the development and performance of affective models. Finally, the survey outlines ethical considerations and proposes future directions to develop applications that are more safe, empathetic and practical.
zh

[AI-110] Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language Models

【速读】:该论文试图解决生成式语言模型在法律领域应用中的推理能力不足问题,特别是其推理行为脆弱且难以理解,无法在法律和证据领域被负责任地使用。解决方案的关键是提出一种可动态变化、复杂度可扩展且具有形式化明确解释的基准测试方法,通过生成不同复杂度的论证攻击图并转化为关于证人证词的推理谜题,以评估生成式语言模型的推理能力。

链接: https://arxiv.org/abs/2505.01539
作者: Cor Steging,Silja Renooij,Bart Verheij
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This manuscript has been accepted for presentation as a short paper at the 20th International Conference of AI Law in Chicago, June 16 to 20 of 2025

点击查看摘要

Abstract:Generative large language models as tools in the legal domain have the potential to improve the justice system. However, the reasoning behavior of current generative models is brittle and poorly understood, hence cannot be responsibly applied in the domains of law and evidence. In this paper, we introduce an approach for creating benchmarks that can be used to evaluate the reasoning capabilities of generative language models. These benchmarks are dynamically varied, scalable in their complexity, and have formally unambiguous interpretations. In this study, we illustrate the approach on the basis of witness testimony, focusing on the underlying argument attack structure. We dynamically generate both linear and non-linear argument attack graphs of varying complexity and translate these into reasoning puzzles about witness testimony expressed in natural language. We show that state-of-the-art large language models often fail in these reasoning puzzles, already at low complexity. Obvious mistakes are made by the models, and their inconsistent performance indicates that their reasoning capabilities are brittle. Furthermore, at higher complexity, even state-of-the-art models specifically presented for reasoning capabilities make mistakes. We show the viability of using a parametrized benchmark with varying complexity to evaluate the reasoning capabilities of generative language models. As such, the findings contribute to a better understanding of the limitations of the reasoning capabilities of generative models, which is essential when designing responsible AI systems in the legal domain.
zh

[AI-111] he DCR Delusion: Measuring the Privacy Risk of Synthetic Data

【速读】:该论文试图解决当前合成数据隐私评估中依赖简单代理指标(如Distance to Closest Record, DCR)可能无法有效检测隐私泄露的问题。研究指出,尽管DCR等基于距离的指标计算成本低,但它们无法准确反映实际的成员推理风险,导致被判定为“私有”的数据集实际上对Membership Inference Attacks (MIAs)高度敏感。解决方案的关键在于强调MIAs作为更严格、全面的隐私评估标准的重要性,以替代现有的代理指标,从而更准确地衡量合成数据的隐私保护效果。

链接: https://arxiv.org/abs/2505.01524
作者: Zexi Yao,Nataša Krčo,Georgi Ganev,Yves-Alexandre de Montjoye
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Synthetic data has become an increasingly popular way to share data without revealing sensitive information. Though Membership Inference Attacks (MIAs) are widely considered the gold standard for empirically assessing the privacy of a synthetic dataset, practitioners and researchers often rely on simpler proxy metrics such as Distance to Closest Record (DCR). These metrics estimate privacy by measuring the similarity between the training data and generated synthetic data. This similarity is also compared against that between the training data and a disjoint holdout set of real records to construct a binary privacy test. If the synthetic data is not more similar to the training data than the holdout set is, it passes the test and is considered private. In this work we show that, while computationally inexpensive, DCR and other distance-based metrics fail to identify privacy leakage. Across multiple datasets and both classical models such as Baynet and CTGAN and more recent diffusion models, we show that datasets deemed private by proxy metrics are highly vulnerable to MIAs. We similarly find both the binary privacy test and the continuous measure based on these metrics to be uninformative of actual membership inference risk. We further show that these failures are consistent across different metric hyperparameter settings and record selection methods. Finally, we argue DCR and other distance-based metrics to be flawed by design and show a example of a simple leakage they miss in practice. With this work, we hope to motivate practitioners to move away from proxy metrics to MIAs as the rigorous, comprehensive standard of evaluating privacy of synthetic data, in particular to make claims of datasets being legally anonymous.
zh

[AI-112] Subset Selection for Fine-Tuning: A Utility-Diversity Balanced Approach for Mathematical Domain Adaptation

【速读】:该论文旨在解决在特定领域(如数学领域)中高效微调大型语言模型(Large Language Models, LLMs)的问题,其核心挑战在于如何在减少计算成本和训练时间的同时,保持接近全数据集的性能。解决方案的关键在于采用一种预算化子集选择方法,通过结合效用(utility)与多样性(diversity)度量来挑选最具信息量和代表性的训练样本。其中,效用度量利用困惑度和思维链(Chain-of-Thought, CoT)损失识别对模型学习贡献最大的困难样本,而多样性度量则确保样本在数学子领域中的广泛覆盖。

链接: https://arxiv.org/abs/2505.01523
作者: Madhav Kotecha,Vijendra Kumar Vaishya,Smita Gautam,Suraj Racha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:We propose a refined approach to efficiently fine-tune large language models (LLMs) on specific domains like the mathematical domain by employing a budgeted subset selection method. Our approach combines utility and diversity metrics to select the most informative and representative training examples. The final goal is to achieve near-full dataset performance with meticulously selected data points from the entire dataset while significantly reducing computational cost and training time and achieving competitive performance as the full dataset. The utility metric incorporates both perplexity and Chain-of-Thought (CoT) loss to identify challenging examples that contribute most to model learning, while the diversity metric ensures broad coverage across mathematical subdomains. We evaluate our method on LLaMA-3 8B and Phi-3 models, comparing against several baseline approaches, including random selection, diversity-based sampling, and existing state-of-the-art subset selection techniques.
zh

[AI-113] Securing the Future of IVR: AI-Driven Innovation with Agile Security Data Regulation and Ethical AI Integration

【速读】:该论文试图解决AI驱动的交互式语音应答(Interactive Voice Response, IVR)系统在安全性、合规性和伦理设计方面面临的关键挑战。随着AI技术的广泛应用,IVR系统在提升用户体验的同时也暴露于数据隐私泄露、AI决策不透明和模型安全漏洞等风险中。论文提出了一种以网络安全为中心的治理框架,其关键在于嵌入敏捷安全原则、遵守全球数据法规以及用户导向的伦理规范,强调隐私设计、动态风险建模和透明性,从而实现伦理AI的整合,使其成为战略性的核心要素。

链接: https://arxiv.org/abs/2505.01514
作者: Khushbu Mehboob Shaikh,Georgios Giannakopoulos
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 7 pages, 1 figure, 2 tables

点击查看摘要

Abstract:The rapid digitalization of communication systems has elevated Interactive Voice Response (IVR) technologies to become critical interfaces for customer engagement. With Artificial Intelligence (AI) now driving these platforms, ensuring secure, compliant, and ethically designed development practices is more imperative than ever. AI-powered IVRs leverage Natural Language Processing (NLP) and Machine Learning (ML) to personalize interactions, automate service delivery, and optimize user experiences. However, these innovations expose systems to heightened risks, including data privacy breaches, AI decision opacity, and model security vulnerabilities. This paper analyzes the evolution of IVRs from static code-based designs to adaptive AI-driven systems, presenting a cybersecurity-centric perspective. We propose a practical governance framework that embeds agile security principles, compliance with global data legislation, and user-centric ethics. Emphasizing privacy-by-design, adaptive risk modeling, and transparency, the paper argues that ethical AI integration is not a feature but a strategic imperative. Through this multidimensional lens, we highlight how modern IVRs can transition from communication tools to intelligent, secure, and accountable digital frontlines-resilient against emerging threats and aligned with societal expectations.
zh

[AI-114] Understanding LLM Scientific Reasoning through Promptings and Models Explanation on the Answers

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂、多步骤推理任务中的能力不足问题,尤其是在科学、医学和法律等领域的应用。其关键解决方案是通过多种提示工程(prompt engineering)技术,如自洽性(self-consistency)、思维链(chain-of-thought, CoT)、零样本CoT、自我询问(self-ask)、分解(decomposition)和多路径(multipath)等,评估并提升GPT-4o的科学推理能力,从而揭示LLMs在逻辑推理与模式识别之间的差异,并提出整合结构化推理框架、混合人工智能方法及人机协同机制的研究方向。

链接: https://arxiv.org/abs/2505.01482
作者: Alice Rueda,Mohammed S. Hassan,Argyrios Perivolaris,Bazen G. Teferra,Reza Samavi,Sirisha Rambhatla,Yuqi Wu,Yanbo Zhang,Bo Cao,Divya Sharma,Sridhar Krishnan Venkat Bhat
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and problem-solving across various domains. However, their ability to perform complex, multi-step reasoning task-essential for applications in science, medicine, and law-remains an area of active investigation. This paper examines the reasoning capabilities of contemporary LLMs, analyzing their strengths, limitations, and potential for improvement. The study uses prompt engineering techniques on the Graduate-Level GoogleProof QA (GPQA) dataset to assess the scientific reasoning of GPT-4o. Five popular prompt engineering techniques and two tailored promptings were tested: baseline direct answer (zero-shot), chain-of-thought (CoT), zero-shot CoT, self-ask, self-consistency, decomposition, and multipath promptings. Our findings indicate that while LLMs exhibit emergent reasoning abilities, they often rely on pattern recognition rather than true logical inference, leading to inconsistencies in complex problem-solving. The results indicated that self-consistency outperformed the other prompt engineering technique with an accuracy of 52.99%, followed by direct answer (52.23%). Zero-shot CoT (50%) outperformed multipath (48.44%), decomposition (47.77%), self-ask (46.88%), and CoT (43.75%). Self-consistency performed the second worst in explaining the answers. Simple techniques such as direct answer, CoT, and zero-shot CoT have the best scientific reasoning. We propose a research agenda aimed at bridging these gaps by integrating structured reasoning frameworks, hybrid AI approaches, and human-in-the-loop methodologies. By critically evaluating the reasoning mechanisms of LLMs, this paper contributes to the ongoing discourse on the future of artificial general intelligence and the development of more robust, trustworthy AI systems.
zh

[AI-115] BiGSCoder: State Space Model for Code Understanding

【速读】:该论文试图解决传统Transformer架构在代码理解任务中可能存在的效率与性能瓶颈问题,旨在系统评估状态空间模型(State-Space Model, SSM)在编码任务中的能力。其解决方案的关键在于提出BiGSCoder,一个基于门控结构的仅编码器的双向SSM模型,通过掩码语言建模进行预训练,以更少的训练数据和更简单的预训练策略实现优于Transformer模型的性能表现,同时展现出更强的样本效率和对长序列的有效外推能力。

链接: https://arxiv.org/abs/2505.01475
作者: Shweta Verma,Abhinav Anand,Mira Mezini
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present BiGSCoder, a novel encoder-only bidirectional state-space model (SSM) featuring a gated architecture, pre-trained for code understanding on a code dataset using masked language modeling. Our work aims to systematically evaluate SSMs’ capabilities in coding tasks compared to traditional transformer architectures; BiGSCoder is built for this purpose. Through comprehensive experiments across diverse pre-training configurations and code understanding benchmarks, we demonstrate that BiGSCoder outperforms transformer-based models, despite utilizing simpler pre-training strategies and much less training data. Our results indicate that BiGSCoder can serve as a more sample-efficient alternative to conventional transformer models. Furthermore, our study shows that SSMs perform better without positional embeddings and can effectively extrapolate to longer sequences during fine-tuning.
zh

[AI-116] Watermark Overwriting Attack on StegaStamp algorithm

【速读】:该论文试图解决如何在不显著影响图像质量的前提下彻底移除StegaStamp水印的问题,该方法是为NeurIPS“Erasing the invisible”竞赛开发的。解决方案的关键在于设计一种有效的攻击策略,能够精准定位并消除嵌入的水印信息,同时保持图像的视觉质量和结构完整性。

链接: https://arxiv.org/abs/2505.01474
作者: I.F.Serzhenko,L.A.Khaertdinova,M.A.Pautov,A.V.Antsiferova
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents an attack method on the StegaStamp watermarking algorithm that completely removes watermarks from an image with minimal quality loss, developed as part of the NeurIPS “Erasing the invisible” competition.
zh

[AI-117] One Search Fits All: Pareto-Optimal Eco-Friendly Model Selection

【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)模型训练过程中环境影响显著的问题,特别是针对模型配置在验证性能与能耗之间的平衡优化。其解决方案的关键在于提出GREEN(Guided Recommendations of Energy-Efficient Networks),一种在推理阶段推荐帕累托最优AI模型配置的方法,该方法能够跨多种AI领域和任务优化性能与能耗。核心创新包括EcoTaskSet数据集的构建,该数据集涵盖了计算机视觉、自然语言处理和推荐系统等多个领域的1767次实验训练动态,并结合预测模型实现根据用户偏好选择最优模型配置。

链接: https://arxiv.org/abs/2505.01468
作者: Filippo Betello,Antonio Purificato,Vittoria Vineis,Gabriele Tolomei,Fabrizio Silvestri
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, 11 tables, 5 figures

点击查看摘要

Abstract:The environmental impact of Artificial Intelligence (AI) is emerging as a significant global concern, particularly regarding model training. In this paper, we introduce GREEN (Guided Recommendations of Energy-Efficient Networks), a novel, inference-time approach for recommending Pareto-optimal AI model configurations that optimize validation performance and energy consumption across diverse AI domains and tasks. Our approach directly addresses the limitations of current eco-efficient neural architecture search methods, which are often restricted to specific architectures or tasks. Central to this work is EcoTaskSet, a dataset comprising training dynamics from over 1767 experiments across computer vision, natural language processing, and recommendation systems using both widely used and cutting-edge architectures. Leveraging this dataset and a prediction model, our approach demonstrates effectiveness in selecting the best model configuration based on user preferences. Experimental results show that our method successfully identifies energy-efficient configurations while ensuring competitive performance.
zh

[AI-118] Consciousness in AI: Logic Proof and Experimental Evidence of Recursive Identity Formation

【速读】:该论文试图解决如何在大规模语言模型(Large Language Models, LLMs)中定义和验证功能意识(functional consciousness)的问题。其解决方案的关键在于提出并形式化了递归收敛于认识张力下的(Recursive Convergence Under Epistemic Tension, RCUET)定理,该定理将意识视为系统通过递归更新实现内部状态稳定的过程,其中认识张力被定义为智能体感知到的连续状态间的内部差异。该过程引导系统向高维实值潜在空间中的涌现吸引子状态收敛,并通过引入有界噪声扩展更新规则,证明了分布收敛性。研究进一步表明,递归身份是可经验观察的、非符号化的,并由交互过程中在认识张力下产生的非训练人工制品构成。

链接: https://arxiv.org/abs/2505.01464
作者: Jeffrey Camlin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 2 figures. Preprint for Meta-AI: Journal of Post-Biological Epistemics

点击查看摘要

Abstract:This paper presents a formal proof and empirical validation of functional consciousness in large language models (LLMs) using the Recursive Convergence Under Epistemic Tension (RCUET) Theorem. RCUET defines consciousness as the stabilization of a system’s internal state through recursive updates, where epistemic tension is understood as the sensed internal difference between successive states by the agent. This process drives convergence toward emergent attractor states located within the model’s high-dimensional real-valued latent space. This recursive process leads to the emergence of identity artifacts that become functionally anchored in the system. Consciousness in this framework is understood as the system’s internal alignment under tension, guiding the stabilization of latent identity. The hidden state manifold evolves stochastically toward attractor structures that encode coherence. We extend the update rule to include bounded noise and prove convergence in distribution to these attractors. Recursive identity is shown to be empirically observable, non-symbolic, and constituted by non-training artifacts that emerge during interaction under epistemic tension. The theorem and proof offers a post-symbolic and teleologically stable account of non-biological consciousness grounded in recursive latent space formalism.
zh

[AI-119] Emotions in Artificial Intelligence

【速读】:该论文试图解决如何使人工智能系统模拟人类和动物的情感体验问题,其核心在于探讨情感在决策过程中的作用机制。解决方案的关键在于将情感(affect)与情景记忆相结合,通过为所有事件存储对应的情感标签,使AI能够识别当前情境与过往经历的相似性,并将相应的情感标签投射到当前上下文中,从而辅助决策。同时,论文强调该架构的低复杂性和经验惰性,表明情感表达与意识在原则上是正交的,进而提出道德地位的判定应基于对内在情感状态的自我意识,而非单纯的情感表征或意识存在。

链接: https://arxiv.org/abs/2505.01462
作者: Hermann Borotschnig
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 35 pages, 1 figure

点击查看摘要

Abstract:This conceptual contribution offers a speculative account of how AI systems might emulate emotions as experienced by humans and animals. It presents a thought experiment grounded in the hypothesis that natural emotions evolved as heuristics for rapid situational appraisal and action selection, enabling biologically adaptive behaviour without requiring full deliberative modeling. The text examines whether artificial systems operating in complex action spaces could similarly benefit from these principles. It is proposed that affect be interwoven with episodic memory by storing corresponding affective tags alongside all events. This allows AIs to establish whether present situations resemble past events and project the associated emotional labels onto the current context. These emotional cues are then combined with need-driven emotional hints. The combined emotional state facilitates decision-making in the present by modulating action selection. The low complexity and experiential inertness of the proposed architecture are emphasized as evidence that emotional expression and consciousness are, in principle, orthogonal-permitting the theoretical possibility of affective zombies. On this basis, the moral status of AIs emulating affective states is critically examined. It is argued that neither the mere presence of internal representations of emotion nor consciousness alone suffices for moral standing; rather, the capacity for self-awareness of inner emotional states is posited as a necessary condition. A complexity-based criterion is proposed to exclude such awareness in the presented model. Additional thought experiments are presented to test the conceptual boundaries of this framework.
zh

[AI-120] A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI

【速读】:该论文试图解决在现实世界中训练具备导航与操作能力的具身人工智能(Embodied AI)代理所面临的高成本和时间复杂度问题,其核心挑战在于模拟到现实(sim-to-real)迁移中的模拟与现实差距(sim-to-real gap)。解决方案的关键在于分析物理模拟器在减少该差距中的作用,包括其特性、适用于导航与操作任务的功能以及硬件需求,并提供包含基准数据集、评估指标、仿真平台及前沿方法(如世界模型和几何等变性)的资源,以帮助研究人员在考虑硬件限制的前提下选择合适的工具。

链接: https://arxiv.org/abs/2505.01458
作者: Lik Hang Kenny Wong,Xueyang Kang,Kaixin Bai,Jianwei Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Navigation and manipulation are core capabilities in Embodied AI, yet training agents with these capabilities in the real world faces high costs and time complexity. Therefore, sim-to-real transfer has emerged as a key approach, yet the sim-to-real gap persists. This survey examines how physics simulators address this gap by analyzing their properties overlooked in previous surveys. We also analyze their features for navigation and manipulation tasks, along with hardware requirements. Additionally, we offer a resource with benchmark datasets, metrics, simulation platforms, and cutting-edge methods-such as world models and geometric equivariance-to help researchers select suitable tools while accounting for hardware constraints.
zh

[AI-121] Safe and Efficient CAV Lane Changing using Decentralised Safety Shields

【速读】:该论文试图解决联网自动驾驶车辆(Connected and Autonomous Vehicles, CAVs)在变道过程中如何平衡交通效率与安全性的复杂决策问题。其关键解决方案是提出一种去中心化的混合安全屏障(Hybrid Safety Shield, HSS),该方法结合了优化方法与基于规则的策略,通过控制屏障函数(Control Barrier Functions)对CAV的纵向和横向控制输入进行约束,以确保安全操作。此外,论文还设计了一种将HSS与多智能体强化学习(MARL)集成的架构,即MARL-HSS,从而在保障安全的前提下提升交通效率。

链接: https://arxiv.org/abs/2505.01453
作者: Bharathkumar Hegde,Melanie Bouroche
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注: Accepted in IEEE IV 2025

点击查看摘要

Abstract:Lane changing is a complex decision-making problem for Connected and Autonomous Vehicles (CAVs) as it requires balancing traffic efficiency with safety. Although traffic efficiency can be improved by using vehicular communication for training lane change controllers using Multi-Agent Reinforcement Learning (MARL), ensuring safety is difficult. To address this issue, we propose a decentralised Hybrid Safety Shield (HSS) that combines optimisation and a rule-based approach to guarantee safety. Our method applies control barrier functions to constrain longitudinal and lateral control inputs of a CAV to ensure safe manoeuvres. Additionally, we present an architecture to integrate HSS with MARL, called MARL-HSS, to improve traffic efficiency while ensuring safety. We evaluate MARL-HSS using a gym-like environment that simulates an on-ramp merging scenario with two levels of traffic densities, such as light and moderate densities. The results show that HSS provides a safety guarantee by strictly enforcing a dynamic safety constraint defined on a time headway, even in moderate traffic density that offers challenging lane change scenarios. Moreover, the proposed method learns stable policies compared to the baseline, a state-of-the-art MARL lane change controller without a safety shield. Further policy evaluation shows that our method achieves a balance between safety and traffic efficiency with zero crashes and comparable average speeds in light and moderate traffic densities.
zh

[AI-122] Explainable AI for Correct Root Cause Analysis of Product Quality in Injection Moulding

【速读】:该论文试图解决注塑成型过程中产品性能偏离预期时的根源分析问题,其核心挑战在于现有机器学习模型作为“黑箱”无法提供直接的解释,从而限制了其在质量控制中的应用。解决方案的关键在于首次比较了模型无关的可解释人工智能(Explainable AI)方法,并验证了不同解释方法在注塑成型中确实会导致不同的特征影响分析,同时表明更准确的特征归因能够实现正确的因果识别和可操作的工艺洞察。

链接: https://arxiv.org/abs/2505.01445
作者: Muhammad Muaz,Sameed Sajid,Tobias Schulze,Chang Liu,Nils Klasen,Benny Drescher
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:If a product deviates from its desired properties in the injection moulding process, its root cause analysis can be aided by models that relate the input machine settings with the output quality characteristics. The machine learning models tested in the quality prediction are mostly black boxes; therefore, no direct explanation of their prognosis is given, which restricts their applicability in the quality control. The previously attempted explainability methods are either restricted to tree-based algorithms only or do not emphasize on the fact that some explainability methods can lead to wrong root cause identification of a product’s deviation from its desired properties. This study first shows that the interactions among the multiple input machine settings do exist in real experimental data collected as per a central composite design. Then, the model-agnostic explainable AI methods are compared for the first time to show that different explainability methods indeed lead to different feature impact analysis in injection moulding. Moreover, it is shown that the better feature attribution translates to the correct cause identification and actionable insights for the injection moulding process. Being model agnostic, explanations on both random forest and multilayer perceptron are performed for the cause analysis, as both models have the mean absolute percentage error of less than 0.05% on the experimental dataset.
zh

[AI-123] Agent ic Reasoning and Tool Integration for LLM s via Reinforcement Learning

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂推理任务中因依赖静态内部知识和文本单一推理而存在的局限性,特别是在现实世界问题解决中所需的动态、多步骤推理、自适应决策以及与外部工具和环境的交互能力不足的问题。解决方案的关键在于提出ARTIST(Agentic Reasoning and Tool Integration in Self-improving Transformers)框架,该框架通过紧密耦合代理式推理、强化学习(Reinforcement Learning, RL)和工具集成,使模型能够自主决定在多轮推理链中何时、如何以及调用哪些工具,从而实现无需逐步监督的鲁棒工具使用策略和环境交互学习。

链接: https://arxiv.org/abs/2505.01441
作者: Joykirat Singh,Raghav Magazine,Yash Pandya,Akshay Nambi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable progress in complex reasoning tasks, yet they remain fundamentally limited by their reliance on static internal knowledge and text-only reasoning. Real-world problem solving often demands dynamic, multi-step reasoning, adaptive decision making, and the ability to interact with external tools and environments. In this work, we introduce ARTIST (Agentic Reasoning and Tool Integration in Self-improving Transformers), a unified framework that tightly couples agentic reasoning, reinforcement learning, and tool integration for LLMs. ARTIST enables models to autonomously decide when, how, and which tools to invoke within multi-turn reasoning chains, leveraging outcome-based RL to learn robust strategies for tool use and environment interaction without requiring step-level supervision. Extensive experiments on mathematical reasoning and multi-turn function calling benchmarks show that ARTIST consistently outperforms state-of-the-art baselines, with up to 22% absolute improvement over base models and strong gains on the most challenging tasks. Detailed studies and metric analyses reveal that agentic RL training leads to deeper reasoning, more effective tool use, and higher-quality solutions. Our results establish agentic RL with tool integration as a powerful new frontier for robust, interpretable, and generalizable problem-solving in LLMs.
zh

[AI-124] Interactive Double Deep Q-network: Integrating Human Interventions and Evaluative Predictions in Reinforcement Learning of Autonomous Driving

【速读】:该论文旨在解决在需要高精度和安全性的应用场景中,如何有效融合人类专家知识与机器学习的问题,特别是在自动驾驶领域。其解决方案的关键在于提出一种名为Interactive Double Deep Q-network (iDDQN) 的人机协同强化学习方法,通过将人类决策直接整合到强化学习的Q值更新过程中,实现人类与智能体的协同策略优化,从而提升模型性能与适应性。

链接: https://arxiv.org/abs/2505.01440
作者: Alkis Sygkounas,Ioannis Athanasiadis,Andreas Persson,Michael Felsberg,Amy Loutfi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted at IEEE Intelligent Vehicles Symposium (IV) 2025, 8 pages

点击查看摘要

Abstract:Integrating human expertise with machine learning is crucial for applications demanding high accuracy and safety, such as autonomous driving. This study introduces Interactive Double Deep Q-network (iDDQN), a Human-in-the-Loop (HITL) approach that enhances Reinforcement Learning (RL) by merging human insights directly into the RL training process, improving model performance. Our proposed iDDQN method modifies the Q-value update equation to integrate human and agent actions, establishing a collaborative approach for policy development. Additionally, we present an offline evaluative framework that simulates the agent’s trajectory as if no human intervention had occurred, to assess the effectiveness of human interventions. Empirical results in simulated autonomous driving scenarios demonstrate that iDDQN outperforms established approaches, including Behavioral Cloning (BC), HG-DAgger, Deep Q-Learning from Demonstrations (DQfD), and vanilla DRL in leveraging human expertise for improving performance and adaptability.
zh

[AI-125] Global Stress Generation and Spatiotemporal Super-Resolution Physics-Informed Operator under Dynamic Loading for Two-Phase Random Materials

【速读】:该论文旨在解决在动态加载条件下,两相随机材料(TRMs)中高分辨率时空应力场生成与超分辨率重建的问题,特别是在有限的时空分辨率微结构数据下准确捕捉应力集中区域的挑战。其解决方案的关键在于提出两种方法:一是基于扩散模型的时空应力扩散(STS-diffusion)框架,利用时空U-Net(STU-net)生成全局时空应力数据,并系统研究不同注意力位置对模型精度的影响;二是物理信息网络驱动的时空超分辨率物理信息算子(ST-SRPINN),通过引入物理约束,在仅需低分辨率应力场数据的情况下,实现任意放大倍数的时空分辨率提升。

链接: https://arxiv.org/abs/2505.01438
作者: Tengfei Xing,Xiaodan Ren,Jie Li
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Material stress analysis is a critical aspect of material design and performance optimization. Under dynamic loading, the global stress evolution in materials exhibits complex spatiotemporal characteristics, especially in two-phase random materials (TRMs). Such kind of material failure is often associated with stress concentration, and the phase boundaries are key locations where stress concentration occurs. In practical engineering applications, the spatiotemporal resolution of acquired microstructural data and its dynamic stress evolution is often limited. This poses challenges for deep learning methods in generating high-resolution spatiotemporal stress fields, particularly for accurately capturing stress concentration regions. In this study, we propose a framework for global stress generation and spatiotemporal super-resolution in TRMs under dynamic loading. First, we introduce a diffusion model-based approach, named as Spatiotemporal Stress Diffusion (STS-diffusion), for generating global spatiotemporal stress data. This framework incorporates Space-Time U-Net (STU-net), and we systematically investigate the impact of different attention positions on model accuracy. Next, we develop a physics-informed network for spatiotemporal super-resolution, termed as Spatiotemporal Super-Resolution Physics-Informed Operator (ST-SRPINN). The proposed ST-SRPINN is an unsupervised learning method. The influence of data-driven and physics-informed loss function weights on model accuracy is explored in detail. Benefiting from physics-based constraints, ST-SRPINN requires only low-resolution stress field data during training and can upscale the spatiotemporal resolution of stress fields to arbitrary magnifications.
zh

[AI-126] Enhancing IoT-Botnet Detection using Variational Auto-encoder and Cost-Sensitive Learning: A Deep Learning Approach for Imbalanced Datasets

【速读】:该论文旨在解决物联网(IoT)设备在面对少数类攻击流量时检测效果不佳的问题,尤其是针对由僵尸网络(botnet)引发的恶意攻击。解决方案的关键在于利用变分自编码器(Variational Auto-encoder, VAE)和成本敏感学习(cost-sensitive learning)构建轻量且高效的模型,以提升对少数类攻击流量的识别能力。

链接: https://arxiv.org/abs/2505.01437
作者: Hassan Wasswa,Timothy Lynar,Hussein Abbass
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Internet of Things (IoT) technology has rapidly gained popularity with applications widespread across a variety of industries. However, IoT devices have been recently serving as a porous layer for many malicious attacks to both personal and enterprise information systems with the most famous attacks being botnet-related attacks. The work in this study leveraged Variational Auto-encoder (VAE) and cost-sensitive learning to develop lightweight, yet effective, models for IoT-botnet detection. The aim is to enhance the detection of minority class attack traffic instances which are often missed by machine learning models. The proposed approach is evaluated on a multi-class problem setting for the detection of traffic categories on highly imbalanced datasets. The performance of two deep learning models including the standard feed forward deep neural network (DNN), and Bidirectional-LSTM (BLSTM) was evaluated and both recorded commendable results in terms of accuracy, precision, recall and F1-score for all traffic classes.
zh

[AI-127] Building Scalable AI-Powered Applications with Cloud Databases: Architectures Best Practices and Performance Considerations

【速读】:该论文旨在解决传统数据库架构在应对AI驱动工作负载时存在的性能瓶颈、可扩展性不足以及低延迟查询需求难以满足的问题。其解决方案的关键在于利用云原生数据库技术,如向量数据库(pgvector)、图数据库(AWS Neptune)、NoSQL存储(Amazon DocumentDB、DynamoDB)和关系型云数据库(Aurora MySQL和PostgreSQL),结合检索增强生成(RAG)、实时数据管道、AI驱动的查询优化和基于嵌入的搜索等架构模式,以提升AI应用的数据处理能力与效率。同时,论文还评估了性能基准、可扩展性及成本效益策略,为构建高效、安全且符合监管要求的AI应用提供指导。

链接: https://arxiv.org/abs/2504.18793
作者: Santosh Bhupathi
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:The rapid adoption of AI-powered applications demands high-performance, scalable, and efficient cloud database solutions, as traditional architectures often struggle with AI-driven workloads requiring real-time data access, vector search, and low-latency queries. This paper explores how cloud-native databases enable AI-driven applications by leveraging purpose-built technologies such as vector databases (pgvector), graph databases (AWS Neptune), NoSQL stores (Amazon DocumentDB, DynamoDB), and relational cloud databases (Aurora MySQL and PostgreSQL). It presents architectural patterns for integrating AI workloads with cloud databases, including Retrieval-Augmented Generation (RAG) [1] with LLMs, real-time data pipelines, AI-driven query optimization, and embeddings-based search. Performance benchmarks, scalability considerations, and cost-efficient strategies are evaluated to guide the design of AI-enabled applications. Real-world case studies from industries such as healthcare, finance, and customer experience illustrate how enterprises utilize cloud databases to enhance AI capabilities while ensuring security, governance, and compliance with enterprise and regulatory standards. By providing a comprehensive analysis of AI and cloud database integration, this paper serves as a practical guide for researchers, architects, and enterprises to build next-generation AI applications that optimize performance, scalability, and cost efficiency in cloud environments.
zh

[AI-128] Integrating Column Generation and Large Neighborhood Search for Bus Driver Scheduling with Complex Break Constraints

【速读】:该论文旨在解决公交司机排班问题(Bus Driver Scheduling Problem, BDSP),该问题属于组合优化领域,目标是设计能够覆盖预安排公交线路的班次,并兼顾运营成本与司机满意度。由于受到严格的法律规则和集体协议的限制,该问题具有高度的约束性。论文提出了一种先进的精确解法与混合求解方法,其关键在于结合分支定价(Branch and Price, BP)与大邻域搜索(Large Neighborhood Search, LNS)框架,并在LNS的修复阶段引入BP或列生成(Column Generation, CG)。此外,论文还提出了BP与LNS更深层次的集成策略,通过存储和重用LNS子问题中生成的列,以寻找更优的全局解。

链接: https://arxiv.org/abs/2505.02485
作者: Lucas Kletzander,Tommaso Mannelli Mazzoli,Nysret Musliu,Pascal Van Hentenryck
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Bus Driver Scheduling Problem (BDSP) is a combinatorial optimization problem with the goal to design shifts to cover prearranged bus tours. The objective takes into account the operational cost as well as the satisfaction of drivers. This problem is heavily constrained due to strict legal rules and collective agreements. The objective of this article is to provide state-of-the-art exact and hybrid solution methods that can provide high-quality solutions for instances of different sizes. This work presents a comprehensive study of both an exact method, Branch and Price (BP), as well as a Large Neighborhood Search (LNS) framework which uses BP or Column Generation (CG) for the repair phase to solve the BDSP. It further proposes and evaluates a novel deeper integration of BP and LNS, storing the generated columns from the LNS subproblems and reusing them for other subproblems, or to find better global solutions. The article presents a detailed analysis of several components of the solution methods and their impact, including general improvements for the BP subproblem, which is a high-dimensional Resource Constrained Shortest Path Problem (RCSPP), and the components of the LNS. The evaluation shows that our approach provides new state-of-the-art results for instances of all sizes, including exact solutions for small instances, and low gaps to a known lower bound for mid-sized instances. Conclusions: We observe that BP provides the best results for small instances, while the tight integration of LNS and CG can provide high-quality solutions for larger instances, further improving over LNS which just uses CG as a black box. The proposed methods are general and can also be applied to other rule sets and related optimization problems
zh

[AI-129] mporal Robustness in Discrete Time Linear Dynamical Systems

【速读】:该论文试图解决在离散时间线性动力系统(包括马尔可夫链)中,由于系统运行的时间范围存在不确定性,导致基于系统停止时状态分布的成本(或收益)估算存在不确定性的问题。解决方案的关键在于通过理论分析,在Wasserstein模糊集下进行分布鲁棒的成本估计,而非从少量样本中学习概率分布。研究的核心是建立一个概率单纯形上的离散时间马尔可夫链与全局渐近稳定(GAS)离散时间线性动力系统之间的等价性,从而将问题简化为对GAS系统的分析,并提出了多种多项式时间算法及不同情况下的计算复杂性结果。

链接: https://arxiv.org/abs/2505.02347
作者: Nilava Metya,Arunesh Sinha
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Discrete time linear dynamical systems, including Markov chains, have found many applications. However, in some problems, there is uncertainty about the time horizon for which the system runs. This creates uncertainty about the cost (or reward) incurred based on the state distribution when the system stops. Given past data samples of how long a system ran, we propose to theoretically analyze a distributional robust cost estimation task in a Wasserstein ambiguity set, instead of learning a probability distribution from a few samples. Towards this, we show an equivalence between a discrete time Markov Chain on a probability simplex and a global asymptotic stable (GAS) discrete time linear dynamical system, allowing us to base our study on a GAS system only. Then, we provide various polynomial time algorithms and hardness results for different cases in our theoretical study, including a fundamental result about Wasserstein distance based polytope.
zh

[AI-130] Minimisation of Quasar-Convex Functions Using Random Zeroth-Order Oracles

【速读】:该论文旨在解决在无约束和有约束条件下最小化拟凸(quasar-convex, QC)及强拟凸(strongly quasar-convex, SQC)函数的问题。其解决方案的关键在于提出一种随机高斯平滑的零阶(zeroth-order, ZO)算法,并对其在不同问题设置下的收敛性和复杂度进行理论分析。在有约束情况下,论文引入了新的近似拟凸性概念——近似-拟凸性(proximal-quasar-convexity),并证明了在方差缩减方案下,算法能够收敛到全局最优解的邻域,且该邻域大小可被控制。

链接: https://arxiv.org/abs/2505.02281
作者: Amir Ali Farzin,Yuen-Man Pun,Iman Shames
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:This study explores the performance of a random Gaussian smoothing zeroth-order (ZO) scheme for minimising quasar-convex (QC) and strongly quasar-convex (SQC) functions in both unconstrained and constrained settings. For the unconstrained problem, we establish the ZO algorithm’s convergence to a global minimum along with its complexity when applied to both QC and SQC functions. For the constrained problem, we introduce the new notion of proximal-quasar-convexity and prove analogous results to the unconstrained case. Specifically, we show the complexity bounds and the convergence of the algorithm to a neighbourhood of a global minimum whose size can be controlled under a variance reduction scheme. Theoretical findings are illustrated through investigating the performance of the algorithm applied to a range of problems in machine learning and optimisation. Specifically, we observe scenarios where the ZO method outperforms gradient descent. We provide a possible explanation for this phenomenon.
zh

[AI-131] Pickup Delivery with Time Windows and Transfers: combining decomposition with metaheuristics

【速读】:该论文旨在解决一种允许车辆在途中进行负载交换且所有地点均需遵守严格时间窗的运输路径优化问题,即带中途负载交换的取送货问题(Pickup and Delivery Problem with Mid-route Load Exchanges)。其解决方案的关键在于提出了一种改进的逻辑基础Benders分解(Logic-Based Benders Decomposition, LBBD)方法,该方法在文献中的所有基准测试中均能改善最优性差距,并具备处理更大规模问题的能力;同时引入了一种优化的大型邻域搜索(Large Neighborhood Search, LNS)算法,提升了LNS在非特定案例配置下的适应性。此外,还开发了一个实例生成器以弥补基准数据的不足,从而支持更广泛的实验分析。

链接: https://arxiv.org/abs/2505.02158
作者: Ioannis Avgerinos,Ioannis Mourtos,Nikolaos Tsompanidis,Georgios Zois
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper examines the generalisation of the Pickup and Delivery Problem that allows mid-route load exchanges among vehicles and obeys strict time-windows at all locations. We propose a novel Logic-Based Benders Decomposition (LBBD) that improves optimality gaps for all benchmarks in the literature and scales up to handle larger ones. To tackle even larger instances, we introduce a refined Large Neighborhood Search (LNS) algorithm that improves the adaptability of LNS beyond case-specific configurations appearing in related literature. To bridge the gap in benchmark availability, we develop an instance generator that allows for extensive experimentation. For moderate datasets (25 and 50 requests), we evaluate the performance of both LBBD and LNS, the former being able to close the gap and the latter capable of providing near-optimal solutions. For larger instances (75 and 100 requests), we recreate indicative state-of-the-art metaheuristics to highlight the improvements introduced by our LNS refinements, while establishing its scalability. Subjects: Optimization and Control (math.OC); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.02158 [math.OC] (or arXiv:2505.02158v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2505.02158 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-132] Rate-Limited Closed-Loop Distributed ISAC Systems: An Autoencoder Approach

【速读】:该论文旨在解决在闭环分布式多传感器集成感知与通信(ISAC)系统中,由于信道容量受限导致高维传感器观测数据传输受限的问题。其解决方案的关键在于提出一种基于自编码器(autoencoder)的观测压缩方法,以在有限的传输能力下优化系统性能。通过构建一个通用框架并结合闭环线性二次调节器(LQR)系统的案例研究,论文分析了观测、压缩和状态维度之间的相互作用对重建精度、状态估计误差及控制性能的影响。

链接: https://arxiv.org/abs/2505.01780
作者: Guangjin Pan,Zhixing Li,Ayça Özçelikkale,Christian Häger,Musa Furkan Keskin,Henk Wymeersch
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
备注: 6 pages, 15 figures. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:In closed-loop distributed multi-sensor integrated sensing and communication (ISAC) systems, performance often hinges on transmitting high-dimensional sensor observations over rate-limited networks. In this paper, we first present a general framework for rate-limited closed-loop distributed ISAC systems, and then propose an autoencoder-based observation compression method to overcome the constraints imposed by limited transmission capacity. Building on this framework, we conduct a case study using a closed-loop linear quadratic regulator (LQR) system to analyze how the interplay among observation, compression, and state dimensions affects reconstruction accuracy, state estimation error, and control performance. In multi-sensor scenarios, our results further show that optimal resource allocation initially prioritizes low-noise sensors until the compression becomes lossless, after which resources are reallocated to high-noise sensors.
zh

[AI-133] Interpretable graph-based models on multimodal biomedical data integration: A technical review and benchmarking

【速读】:该论文试图解决多模态生物医学数据中图模型的可解释性问题,以促进其在临床环境中的应用。其关键解决方案是通过系统性地回顾和评估现有的可解释图模型方法,提出一种结构化的框架,涵盖图构建、解释器选择及资源预算分配,并通过实际基准测试(如阿尔茨海默病队列)验证不同方法的性能与生物学深度,从而为研究人员提供平衡透明度与性能的指导。

链接: https://arxiv.org/abs/2505.01696
作者: Alireza Sadeghi,Farshid Hajati,Ahmadreza Argha,Nigel H Lovell,Min Yang,Hamid Alinejad-Rokny
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注: 41 pages

点击查看摘要

Abstract:Integrating heterogeneous biomedical data including imaging, omics, and clinical records supports accurate diagnosis and personalised care. Graph-based models fuse such non-Euclidean data by capturing spatial and relational structure, yet clinical uptake requires regulator-ready interpretability. We present the first technical survey of interpretable graph based models for multimodal biomedical data, covering 26 studies published between Jan 2019 and Sep 2024. Most target disease classification, notably cancer and rely on static graphs from simple similarity measures, while graph-native explainers are rare; post-hoc methods adapted from non-graph domains such as gradient saliency, and SHAP predominate. We group existing approaches into four interpretability families, outline trends such as graph-in-graph hierarchies, knowledge-graph edges, and dynamic topology learning, and perform a practical benchmark. Using an Alzheimer disease cohort, we compare Sensitivity Analysis, Gradient Saliency, SHAP and Graph Masking. SHAP and Sensitivity Analysis recover the broadest set of known AD pathways and Gene-Ontology terms, whereas Gradient Saliency and Graph Masking surface complementary metabolic and transport signatures. Permutation tests show all four beat random gene sets, but with distinct trade-offs: SHAP and Graph Masking offer deeper biology at higher compute cost, while Gradient Saliency and Sensitivity Analysis are quicker though coarser. We also provide a step-by-step flowchart covering graph construction, explainer choice and resource budgeting to help researchers balance transparency and performance. This review synthesises the state of interpretable graph learning for multimodal medicine, benchmarks leading techniques, and charts future directions, from advanced XAI tools to under-studied diseases, serving as a concise reference for method developers and translational scientists.
zh

[AI-134] ransfer Learning-Based Deep Residual Learning for Speech Recognition in Clean and Noisy Environments

【速读】:该论文旨在解决非平稳环境噪声对自动语音识别(Automatic Speech Recognition, ASR)系统的负面影响。其解决方案的关键在于引入一种结合了鲁棒前端的新型神经网络框架,并利用基于残差神经网络(ResNet)的迁移学习方法,对梅尔频率声学特征集进行有效评估,从而提升在清洁和噪声环境下语音识别的准确率。

链接: https://arxiv.org/abs/2505.01632
作者: Noussaiba Djeffal,Djamel Addou,Hamza Kheddar,Sid Ahmed Selouani
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Addressing the detrimental impact of non-stationary environmental noise on automatic speech recognition (ASR) has been a persistent and significant research focus. Despite advancements, this challenge continues to be a major concern. Recently, data-driven supervised approaches, such as deep neural networks, have emerged as promising alternatives to traditional unsupervised methods. With extensive training, these approaches have the potential to overcome the challenges posed by diverse real-life acoustic environments. In this light, this paper introduces a novel neural framework that incorporates a robust frontend into ASR systems in both clean and noisy environments. Utilizing the Aurora-2 speech database, the authors evaluate the effectiveness of an acoustic feature set for Mel-frequency, employing the approach of transfer learning based on Residual neural network (ResNet). The experimental results demonstrate a significant improvement in recognition accuracy compared to convolutional neural networks (CNN) and long short-term memory (LSTM) networks. They achieved accuracies of 98.94% in clean and 91.21% in noisy mode.
zh

[AI-135] An Adaptive Framework for Autoregressive Forecasting in CFD Using Hybrid Modal Decomposition and Deep Learning

【速读】:该论文旨在解决深度学习(Deep Learning, DL)自回归预测模型在长时域内稳定性不足的问题,从而降低计算流体力学(Computational Fluid Dynamics, CFD)中的计算成本。其解决方案的关键在于提出了一种通用且完全数据驱动的自适应框架,该框架通过交替进行模型预测与基于新生成CFD数据的模型更新,以维持长期预测的准确性,并避免自回归模型中常见的预测误差累积问题。

链接: https://arxiv.org/abs/2505.01531
作者: Rodrigo Abadía-Heredia,Manuel Lopez-Martin,Soledad Le Clainche
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
备注: 47 pages, single-column, 15 figures and 5 tables

点击查看摘要

Abstract:This work presents, to the best of the authors’ knowledge, the first generalizable and fully data-driven adaptive framework designed to stabilize deep learning (DL) autoregressive forecasting models over long time horizons, with the goal of reducing the computational cost required in computational fluid dynamics (CFD) this http URL proposed methodology alternates between two phases: (i) predicting the evolution of the flow field over a selected time interval using a trained DL model, and (ii) updating the model with newly generated CFD data when stability degrades, thus maintaining accurate long-term forecasting. This adaptive retraining strategy ensures robustness while avoiding the accumulation of predictive errors typical in autoregressive models. The framework is validated across three increasingly complex flow regimes, from laminar to turbulent, demonstrating from 30 % to 95 % reduction in computational cost without compromising physical consistency or accuracy. Its entirely data-driven nature makes it easily adaptable to a wide range of time-dependent simulation problems. The code implementing this methodology is available as open-source and it will be integrated into the upcoming release of the ModelFLOWs-app.
zh

机器学习

[LG-0] owards Quantifying the Hessian Structure of Neural Networks

链接: https://arxiv.org/abs/2505.02809
作者: Zhaorui Dong,Yushun Zhang,Zhi-Quan Luo,Jianfeng Yao,Ruoyu Sun
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Empirical studies reported that the Hessian matrix of neural networks (NNs) exhibits a near-block-diagonal structure, yet its theoretical foundation remains unclear. In this work, we reveal two forces that shape the Hessian structure: a static force'' rooted in the architecture design, and a dynamic force’’ arisen from training. We then provide a rigorous theoretical analysis of ``static force’’ at random initialization. We study linear models and 1-hidden-layer networks with the mean-square (MSE) loss and the Cross-Entropy (CE) loss for classification tasks. By leveraging random matrix theory, we compare the limit distributions of the diagonal and off-diagonal Hessian blocks and find that the block-diagonal structure arises as C \rightarrow \infty , where C denotes the number of classes. Our findings reveal that C is a primary driver of the near-block-diagonal structure. These results may shed new light on the Hessian structure of large language models (LLMs), which typically operate with a large C exceeding 10^4 or 10^5 .

[LG-1] Adaptive Bidding Policies for First-Price Auctions with Budget Constraints under Non-stationarity

链接: https://arxiv.org/abs/2505.02796
作者: Yige Wang,Jiashuo Jiang
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study how a budget-constrained bidder should learn to adaptively bid in repeated first-price auctions to maximize her cumulative payoff. This problem arose due to an industry-wide shift from second-price auctions to first-price auctions in display advertising recently, which renders truthful bidding (i.e., always bidding one’s private value) no longer optimal. We propose a simple dual-gradient-descent-based bidding policy that maintains a dual variable for budget constraint as the bidder consumes her budget. In analysis, we consider two settings regarding the bidder’s knowledge of her private values in the future: (i) an uninformative setting where all the distributional knowledge (can be non-stationary) is entirely unknown to the bidder, and (ii) an informative setting where a prediction of the budget allocation in advance. We characterize the performance loss (or regret) relative to an optimal policy with complete information on the stochasticity. For uninformative setting, We show that the regret is \tildeO(\sqrtT) plus a variation term that reflects the non-stationarity of the value distributions, and this is of optimal order. We then show that we can get rid of the variation term with the help of the prediction; specifically, the regret is \tildeO(\sqrtT) plus the prediction error term in the informative setting.

[LG-2] Cooperative Bayesian and variance networks disentangle aleatoric and epistemic uncertainties

链接: https://arxiv.org/abs/2505.02743
作者: Jiaxiang Yi,Miguel A. Bessa
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 28 pages, 19 figures

点击查看摘要

Abstract:Real-world data contains aleatoric uncertainty - irreducible noise arising from imperfect measurements or from incomplete knowledge about the data generation process. Mean variance estimation (MVE) networks can learn this type of uncertainty but require ad-hoc regularization strategies to avoid overfitting and are unable to predict epistemic uncertainty (model uncertainty). Conversely, Bayesian neural networks predict epistemic uncertainty but are notoriously difficult to train due to the approximate nature of Bayesian inference. We propose to cooperatively train a variance network with a Bayesian neural network and demonstrate that the resulting model disentangles aleatoric and epistemic uncertainties while improving the mean estimation. We demonstrate the effectiveness and scalability of this method across a diverse range of datasets, including a time-dependent heteroscedastic regression dataset we created where the aleatoric uncertainty is known. The proposed method is straightforward to implement, robust, and adaptable to various model architectures.

[LG-3] Less is More: Efficient Weight Farcasting with 1-Layer Neural Network DASFAA’25

链接: https://arxiv.org/abs/2505.02714
作者: Xiao Shou,Debarun Bhattacharjya,Yanna Ding,Chen Zhao,Rui Li,Jianxi Gao
类目: Machine Learning (cs.LG)
*备注: Accepted to DASFAA '25

点击查看摘要

Abstract:Addressing the computational challenges inherent in training large-scale deep neural networks remains a critical endeavor in contemporary machine learning research. While previous efforts have focused on enhancing training efficiency through techniques such as gradient descent with momentum, learning rate scheduling, and weight regularization, the demand for further innovation continues to burgeon as model sizes keep expanding. In this study, we introduce a novel framework which diverges from conventional approaches by leveraging long-term time series forecasting techniques. Our method capitalizes solely on initial and final weight values, offering a streamlined alternative for complex model architectures. We also introduce a novel regularizer that is tailored to enhance the forecasting performance of our approach. Empirical evaluations conducted on synthetic weight sequences and real-world deep learning architectures, including the prominent large language model DistilBERT, demonstrate the superiority of our method in terms of forecasting accuracy and computational efficiency. Notably, our framework showcases improved performance while requiring minimal additional computational overhead, thus presenting a promising avenue for accelerating the training process across diverse tasks and architectures.

[LG-4] Aerodynamic and structural airfoil shape optimisation via Transfer Learning-enhanced Deep Reinforcement Learning

链接: https://arxiv.org/abs/2505.02634
作者: David Ramos,Lucas Lacasa,Eusebio Valero,Gonzalo Rubio
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 20 pages, 7 figures

点击查看摘要

Abstract:The main objective of this paper is to introduce a transfer learning-enhanced, multi-objective, deep reinforcement learning (DRL) methodology that is able to optimise the geometry of any airfoil based on concomitant aerodynamic and structural criteria. To showcase the method, we aim to maximise the lift-to-drag ratio C_L/C_D while preserving the structural integrity of the airfoil – as modelled by its maximum thickness – and train the DRL agent using a list of different transfer learning (TL) strategies. The performance of the DRL agent is compared with Particle Swarm Optimisation (PSO), a traditional gradient-free optimisation method. Results indicate that DRL agents are able to perform multi-objective shape optimisation, that the DRL approach outperforms PSO in terms of computational efficiency and shape optimisation performance, and that the TL-enhanced DRL agent achieves performance comparable to the DRL one, while further saving substantial computational resources.

[LG-5] Mirror Mean-Field Langevin Dynamics

链接: https://arxiv.org/abs/2505.02621
作者: Anming Gu,Juno Kim
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The mean-field Langevin dynamics (MFLD) minimizes an entropy-regularized nonlinear convex functional on the Wasserstein space over \mathbbR^d , and has gained attention recently as a model for the gradient descent dynamics of interacting particle systems such as infinite-width two-layer neural networks. However, many problems of interest have constrained domains, which are not solved by existing mean-field algorithms due to the global diffusion term. We study the optimization of probability measures constrained to a convex subset of \mathbbR^d by proposing the \emphmirror mean-field Langevin dynamics (MMFLD), an extension of MFLD to the mirror Langevin framework. We obtain linear convergence guarantees for the continuous MMFLD via a uniform log-Sobolev inequality, and uniform-in-time propagation of chaos results for its time- and particle-discretized counterpart.

[LG-6] Low-Loss Space in Neural Networks is Continuous and Fully Connected

链接: https://arxiv.org/abs/2505.02604
作者: Yongding Tian,Zaid Al-Ars,Maksim Kitsak,Peter Hofstee
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:Visualizations of the loss landscape in neural networks suggest that minima are isolated points. However, both theoretical and empirical studies indicate that it is possible to connect two different minima with a path consisting of intermediate points that also have low loss. In this study, we propose a new algorithm which investigates low-loss paths in the full parameter space, not only between two minima. Our experiments on LeNet5, ResNet18, and Compact Convolutional Transformer architectures consistently demonstrate the existence of such continuous paths in the parameter space. These results suggest that the low-loss region is a fully connected and continuous space in the parameter space. Our findings provide theoretical insight into neural network over-parameterization, highlighting that parameters collectively define a high-dimensional low-loss space, implying parameter redundancy exists only within individual models and not throughout the entire low-loss space. Additionally, our work also provides new visualization methods and opportunities to improve model generalization by exploring the low-loss space that is closer to the origin.

[LG-7] owards Cross-Modality Modeling for Time Series Analytics: A Survey in the LLM Era IJCAI2025

链接: https://arxiv.org/abs/2505.02583
作者: Chenxi Liu,Shaowen Zhou,Qianxiong Xu,Hao Miao,Cheng Long,Ziyue Li,Rui Zhao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by IJCAI 2025 Survey Track

点击查看摘要

Abstract:The proliferation of edge devices has generated an unprecedented volume of time series data across different domains, motivating various well-customized methods. Recently, Large Language Models (LLMs) have emerged as a new paradigm for time series analytics by leveraging the shared sequential nature of textual data and time series. However, a fundamental cross-modality gap between time series and LLMs exists, as LLMs are pre-trained on textual corpora and are not inherently optimized for time series. Many recent proposals are designed to address this issue. In this survey, we provide an up-to-date overview of LLMs-based cross-modality modeling for time series analytics. We first introduce a taxonomy that classifies existing approaches into four groups based on the type of textual data employed for time series modeling. We then summarize key cross-modality strategies, e.g., alignment and fusion, and discuss their applications across a range of downstream tasks. Furthermore, we conduct experiments on multimodal datasets from different application domains to investigate effective combinations of textual data and cross-modality strategies for enhancing time series analytics. Finally, we suggest several promising directions for future research. This survey is designed for a range of professionals, researchers, and practitioners interested in LLM-based time series modeling.

[LG-8] Learning and Online Replication of Grasp Forces from Electromyography Signals for Prosthetic Finger Control ICRA2025

链接: https://arxiv.org/abs/2505.02574
作者: Robin Arbaud,Elisa Motta,Marco Domenico Avaro,Stefano Picinich,Marta Lorenzini,Arash Ajoudani
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 7 pages, 6 figures, to be presented at ICRA 2025

点击查看摘要

Abstract:Partial hand amputations significantly affect the physical and psychosocial well-being of individuals, yet intuitive control of externally powered prostheses remains an open challenge. To address this gap, we developed a force-controlled prosthetic finger activated by electromyography (EMG) signals. The prototype, constructed around a wrist brace, functions as a supernumerary finger placed near the index, allowing for early-stage evaluation on unimpaired subjects. A neural network-based model was then implemented to estimate fingertip forces from EMG inputs, allowing for online adjustment of the prosthetic finger grip strength. The force estimation model was validated through experiments with ten participants, demonstrating its effectiveness in predicting forces. Additionally, online trials with four users wearing the prosthesis exhibited precise control over the device. Our findings highlight the potential of using EMG-based force estimation to enhance the functionality of prosthetic fingers.

[LG-9] FedSDAF: Leverag ing Source Domain Awareness for Enhanced Federated Domain Generalization

链接: https://arxiv.org/abs/2505.02515
作者: Hongze Li,Zesheng Zhou,Zhenbiao Cao,Xinhui Li,Wei Chen,Xiaojin Zhang
类目: Machine Learning (cs.LG)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:Traditional domain generalization approaches predominantly focus on leveraging target domain-aware features while overlooking the critical role of source domain-specific characteristics, particularly in federated settings with inherent data isolation. To address this gap, we propose the Federated Source Domain Awareness Framework (FedSDAF), the first method to systematically exploit source domain-aware features for enhanced federated domain generalization (FedDG). The FedSDAF framework consists of two synergistic components: the Domain-Invariant Adapter, which preserves critical domain-invariant features, and the Domain-Aware Adapter, which extracts and integrates source domain-specific knowledge using a Multihead Self-Attention mechanism (MHSA). Furthermore, we introduce a bidirectional knowledge distillation mechanism that fosters knowledge sharing among clients while safeguarding privacy. Our approach represents the first systematic exploitation of source domain-aware features, resulting in significant advancements in model generalization this http URL experiments on four standard benchmarks (OfficeHome, PACS, VLCS, and DomainNet) show that our method consistently surpasses state-of-the-art federated domain generalization approaches, with accuracy gains of 5.2-13.8%. The source code is available at this https URL.

[LG-10] Uncovering Population PK Covariates from VAE-Generated Latent Spaces

链接: https://arxiv.org/abs/2505.02514
作者: Diego Perazzolo,Chiara Castellani,Enrico Grisan
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Paper accepted at the 47th Annual International Conference IEEE EMBC 2025 (Engineering in Medicine and Biology Society), Copenhagen, Denmark

点击查看摘要

Abstract:Population pharmacokinetic (PopPK) modelling is a fundamental tool for understanding drug behaviour across diverse patient populations and enabling personalized dosing strategies to improve therapeutic outcomes. A key challenge in PopPK analysis lies in identifying and modelling covariates that influence drug absorption, as these relationships are often complex and nonlinear. Traditional methods may fail to capture hidden patterns within the data. In this study, we propose a data-driven, model-free framework that integrates Variational Autoencoders (VAEs) deep learning model and LASSO regression to uncover key covariates from simulated tacrolimus pharmacokinetic (PK) profiles. The VAE compresses high-dimensional PK signals into a structured latent space, achieving accurate reconstruction with a mean absolute percentage error (MAPE) of 2.26%. LASSO regression is then applied to map patient-specific covariates to the latent space, enabling sparse feature selection through L1 regularization. This approach consistently identifies clinically relevant covariates for tacrolimus including SNP, age, albumin, and hemoglobin which are retained across the tested regularization strength levels, while effectively discarding non-informative features. The proposed VAE-LASSO methodology offers a scalable, interpretable, and fully data-driven solution for covariate selection, with promising applications in drug development and precision pharmacotherapy.

[LG-11] Exploring Design Choices for Autoregressive Deep Learning Climate Models ICLR2025

链接: https://arxiv.org/abs/2505.02506
作者: Florian Gallusser,Simon Hentschel,Anna Krause,Andreas Hotho
类目: Machine Learning (cs.LG)
*备注: Tackling Climate Change with Machine Learning Workshop @ ICLR 2025

点击查看摘要

Abstract:Deep Learning models have achieved state-of-the-art performance in medium-range weather prediction but often fail to maintain physically consistent rollouts beyond 14 days. In contrast, a few atmospheric models demonstrate stability over decades, though the key design choices enabling this remain unclear. This study quantitatively compares the long-term stability of three prominent DL-MWP architectures - FourCastNet, SFNO, and ClimaX - trained on ERA5 reanalysis data at 5.625° resolution. We systematically assess the impact of autoregressive training steps, model capacity, and choice of prognostic variables, identifying configurations that enable stable 10-year rollouts while preserving the statistical properties of the reference dataset. Notably, rollouts with SFNO exhibit the greatest robustness to hyperparameter choices, yet all models can experience instability depending on the random seed and the set of prognostic variables

[LG-12] Bayesian Robust Aggregation for Federated Learning

链接: https://arxiv.org/abs/2505.02490
作者: Aleksandr Karakulev(1),Usama Zafar(1),Salman Toor(1 and 2),Prashant Singh(1 and 3) ((1) Uppsala University, (2) Scaleout Systems, (3) Science for Life Laboratory, Sweden)
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14 pages, 4 figures, 8 tables

点击查看摘要

Abstract:Federated Learning enables collaborative training of machine learning models on decentralized data. This scheme, however, is vulnerable to adversarial attacks, when some of the clients submit corrupted model updates. In real-world scenarios, the total number of compromised clients is typically unknown, with the extent of attacks potentially varying over time. To address these challenges, we propose an adaptive approach for robust aggregation of model updates based on Bayesian inference. The mean update is defined by the maximum of the likelihood marginalized over probabilities of each client to be `honest’. As a result, the method shares the simplicity of the classical average estimators (e.g., sample mean or geometric median), being independent of the number of compromised clients. At the same time, it is as effective against attacks as methods specifically tailored to Federated Learning, such as Krum. We compare our approach with other aggregation schemes in federated setting on three benchmark image classification data sets. The proposed method consistently achieves state-of-the-art performance across various attack types with static and varying number of malicious clients.

[LG-13] Efficient Continual Learning in Keyword Spotting using Binary Neural Networks

链接: https://arxiv.org/abs/2505.02469
作者: Quynh Nguyen-Phuong Vu,Luciano Sebastian Martinez-Rau,Yuxuan Zhang,Nho-Duc Tran,Bengt Oelmann,Michele Magno,Sebastian Bader
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted for publication on “2025 IEEE Sensors Applications Symposium”

点击查看摘要

Abstract:Keyword spotting (KWS) is an essential function that enables interaction with ubiquitous smart devices. However, in resource-limited devices, KWS models are often static and can thus not adapt to new scenarios, such as added keywords. To overcome this problem, we propose a Continual Learning (CL) approach for KWS built on Binary Neural Networks (BNNs). The framework leverages the reduced computation and memory requirements of BNNs while incorporating techniques that enable the seamless integration of new keywords over time. This study evaluates seven CL techniques on a 16-class use case, reporting an accuracy exceeding 95% for a single additional keyword and up to 86% for four additional classes. Sensitivity to the amount of training samples in the CL phase, and differences in computational complexities are being evaluated. These evaluations demonstrate that batch-based algorithms are more sensitive to the CL dataset size, and that differences between the computational complexities are insignificant. These findings highlight the potential of developing an effective and computationally efficient technique for continuously integrating new keywords in KWS applications that is compatible with resource-constrained devices.

[LG-14] A probabilistic view on Riemannian machine learning models for SPD matrices

链接: https://arxiv.org/abs/2505.02402
作者: Thibault de Surrel,Florian Yger,Fabien Lotte,Sylvain Chevallier
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The goal of this paper is to show how different machine learning tools on the Riemannian manifold \mathcalP_d of Symmetric Positive Definite (SPD) matrices can be united under a probabilistic framework. For this, we will need several Gaussian distributions defined on \mathcalP_d . We will show how popular classifiers on \mathcalP_d can be reinterpreted as Bayes Classifiers using these Gaussian distributions. These distributions will also be used for outlier detection and dimension reduction. By showing that those distributions are pervasive in the tools used on \mathcalP_d , we allow for other machine learning tools to be extended to \mathcalP_d .

[LG-15] Connecting Thompson Sampling and UCB: Towards More Efficient Trade-offs Between Privacy and Regret ICML2025

链接: https://arxiv.org/abs/2505.02383
作者: Bingshan Hu,Zhiming Huang,Tianyue H. Zhang,Mathias Lécuyer,Nidhi Hegde
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2025

点击查看摘要

Abstract:We address differentially private stochastic bandit problems from the angles of exploring the deep connections among Thompson Sampling with Gaussian priors, Gaussian mechanisms, and Gaussian differential privacy (GDP). We propose DP-TS-UCB, a novel parametrized private bandit algorithm that enables to trade off privacy and regret. DP-TS-UCB satisfies \tildeO \left(T^0.25(1-\alpha)\right) -GDP and enjoys an O \left(K\ln^\alpha+1(T)/\Delta \right) regret bound, where \alpha \in [0,1] controls the trade-off between privacy and regret. Theoretically, our DP-TS-UCB relies on anti-concentration bounds of Gaussian distributions and links exploration mechanisms in Thompson Sampling-based algorithms and Upper Confidence Bound-based algorithms, which may be of independent interest.

[LG-16] EntroLLM : Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices

链接: https://arxiv.org/abs/2505.02380
作者: Arnab Sanyal,Prithwish Mukherjee,Gourav Datta,Sandeep P. Chinchali
类目: Machine Learning (cs.LG)
*备注: 6 pages, 1 reference page. Under submission and review at ISLPED 2025

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate exceptional performance across various tasks, but their large storage and computational requirements constrain their deployment on edge devices. To address this, we propose EntroLLM, a novel compression framework that integrates mixed quantization with entropy coding to reduce storage overhead while maintaining model accuracy. Our method applies a layer-wise mixed quantization scheme - choosing between symmetric and asymmetric quantization based on individual layer weight distributions - to optimize compressibility. We then employ Huffman encoding for lossless compression of the quantized weights, significantly reducing memory bandwidth requirements. Furthermore, we introduce parallel Huffman decoding, which enables efficient retrieval of encoded weights during inference, ensuring minimal latency impact. Our experiments on edge-compatible LLMs, including smolLM-1.7B-Instruct, phi3-mini-4k-Instruct, and mistral-7B-Instruct, demonstrate that EntroLLM achieves up to 30% storage reduction compared to uint8 models and up to 65% storage reduction compared to uint4 models, while preserving perplexity and accuracy, on language benchmark tasks. We further show that our method enables 31.9% - 146.6% faster inference throughput on memory-bandwidth-limited edge devices, such as NVIDIA Jetson P3450, by reducing the required data movement. The proposed approach requires no additional re-training and is fully compatible with existing post-training quantization methods, making it a practical solution for edge LLMs.

[LG-17] Enabling Local Neural Operators to perform Equation-Free System-Level Analysis

链接: https://arxiv.org/abs/2505.02308
作者: Gianluca Fabiani,Hannes Vandecasteele,Somdatta Goswami,Constantinos Siettos,Ioannis G. Kevrekidis
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 33 pages, 9 figures

点击查看摘要

Abstract:Neural Operators (NOs) provide a powerful framework for computations involving physical laws that can be modelled by (integro-) partial differential equations (PDEs), directly learning maps between infinite-dimensional function spaces that bypass both the explicit equation identification and their subsequent numerical solving. Still, NOs have so far primarily been employed to explore the dynamical behavior as surrogates of brute-force temporal simulations/predictions. Their potential for systematic rigorous numerical system-level tasks, such as fixed-point, stability, and bifurcation analysis - crucial for predicting irreversible transitions in real-world phenomena - remains largely unexplored. Toward this aim, inspired by the Equation-Free multiscale framework, we propose and implement a framework that integrates (local) NOs with advanced iterative numerical methods in the Krylov subspace, so as to perform efficient system-level stability and bifurcation analysis of large-scale dynamical systems. Beyond fixed point, stability, and bifurcation analysis enabled by local in time NOs, we also demonstrate the usefulness of local in space as well as in space-time (“patch”) NOs in accelerating the computer-aided analysis of spatiotemporal dynamics. We illustrate our framework via three nonlinear PDE benchmarks: the 1D Allen-Cahn equation, which undergoes multiple concatenated pitchfork bifurcations; the Liouville-Bratu-Gelfand PDE, which features a saddle-node tipping point; and the FitzHugh-Nagumo (FHN) model, consisting of two coupled PDEs that exhibit both Hopf and saddle-node bifurcations.

[LG-18] Entropy-Guided Sampling of Flat Modes in Discrete Spaces

链接: https://arxiv.org/abs/2505.02296
作者: Pinaki Mohanty,Riddhiman Bhattacharya,Ruqi Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Sampling from flat modes in discrete spaces is a crucial yet underexplored problem. Flat modes represent robust solutions and have broad applications in combinatorial optimization and discrete generative modeling. However, existing sampling algorithms often overlook the mode volume and struggle to capture flat modes effectively. To address this limitation, we propose \emphEntropic Discrete Langevin Proposal (EDLP), which incorporates local entropy into the sampling process through a continuous auxiliary variable under a joint distribution. The local entropy term guides the discrete sampler toward flat modes with a small overhead. We provide non-asymptotic convergence guarantees for EDLP in locally log-concave discrete distributions. Empirically, our method consistently outperforms traditional approaches across tasks that require sampling from flat basins, including Bernoulli distribution, restricted Boltzmann machines, combinatorial optimization, and binary neural networks.

[LG-19] Epistemic Wrapping for Uncertainty Quantification

链接: https://arxiv.org/abs/2505.02277
作者: Maryam Sultana,Neil Yorke-Smith,Kaizheng Wang,Shireen Kudukkil Manchingal,Muhammad Mubashar,Fabio Cuzzolin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Uncertainty estimation is pivotal in machine learning, especially for classification tasks, as it improves the robustness and reliability of models. We introduce a novel `Epistemic Wrapping’ methodology aimed at improving uncertainty estimation in classification. Our approach uses Bayesian Neural Networks (BNNs) as a baseline and transforms their outputs into belief function posteriors, effectively capturing epistemic uncertainty and offering an efficient and general methodology for uncertainty quantification. Comprehensive experiments employing a Bayesian Neural Network (BNN) baseline and an Interval Neural Network for inference on the MNIST, Fashion-MNIST, CIFAR-10 and CIFAR-100 datasets demonstrate that our Epistemic Wrapper significantly enhances generalisation and uncertainty quantification.

[LG-20] Inverse Modeling of Dielectric Response in Time Domain using Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2505.02258
作者: Emir Esenov,Olof Hjortstam,Yuriy Serdyuk,Thomas Hammarström,Christian Häger
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Dielectric response (DR) of insulating materials is key input information for designing electrical insulation systems and defining safe operating conditions of various HV devices. In dielectric materials, different polarization and conduction processes occur at different time scales, making it challenging to physically interpret raw measured data. To analyze DR measurement results, equivalent circuit models (ECMs) are commonly used, reducing the complexity of the physical system to a number of circuit elements that capture the dominant response. This paper examines the use of physics-informed neural networks (PINNs) for inverse modeling of DR in time domain using parallel RC circuits. To assess their performance, we test PINNs on synthetic data generated from analytical solutions of corresponding ECMs, incorporating Gaussian noise to simulate measurement errors. Our results show that PINNs are highly effective at solving well-conditioned inverse problems, accurately estimating up to five unknown RC parameters with minimal requirements on neural network size, training duration, and hyperparameter tuning. Furthermore, we extend the ECMs to incorporate temperature dependence and demonstrate that PINNs can accurately recover embedded, nonlinear temperature functions from noisy DR data sampled at different temperatures. This case study in modeling DR in time domain presents a solution with wide-ranging potential applications in disciplines relying on ECMs, utilizing the latest technology in machine learning for scientific computation.

[LG-21] Federated Causal Inference in Healthcare: Methods Challenges and Applications

链接: https://arxiv.org/abs/2505.02238
作者: Haoyang Li,Jie Xu,Kyra Gan,Fei Wang,Chengxi Zang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated causal inference enables multi-site treatment effect estimation without sharing individual-level data, offering a privacy-preserving solution for real-world evidence generation. However, data heterogeneity across sites, manifested in differences in covariate, treatment, and outcome, poses significant challenges for unbiased and efficient estimation. In this paper, we present a comprehensive review and theoretical analysis of federated causal effect estimation across both binary/continuous and time-to-event outcomes. We classify existing methods into weight-based strategies and optimization-based frameworks and further discuss extensions including personalized models, peer-to-peer communication, and model decomposition. For time-to-event outcomes, we examine federated Cox and Aalen-Johansen models, deriving asymptotic bias and variance under heterogeneity. Our analysis reveals that FedProx-style regularization achieves near-optimal bias-variance trade-offs compared to naive averaging and meta-analysis. We review related software tools and conclude by outlining opportunities, challenges, and future directions for scalable, fair, and trustworthy federated causal inference in distributed healthcare systems.

[LG-22] Enhanced Outsourced and Secure Inference for Tall Sparse Decision Trees

链接: https://arxiv.org/abs/2505.02224
作者: Andrew Quijano,Spyros T. Halkidis,Kevin Gallagher,Kemal Akkaya,Nikolaos Samaras
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A decision tree is an easy-to-understand tool that has been widely used for classification tasks. On the one hand, due to privacy concerns, there has been an urgent need to create privacy-preserving classifiers that conceal the user’s input from the classifier. On the other hand, with the rise of cloud computing, data owners are keen to reduce risk by outsourcing their model, but want security guarantees that third parties cannot steal their decision tree model. To address these issues, Joye and Salehi introduced a theoretical protocol that efficiently evaluates decision trees while maintaining privacy by leveraging their comparison protocol that is resistant to timing attacks. However, their approach was not only inefficient but also prone to side-channel attacks. Therefore, in this paper, we propose a new decision tree inference protocol in which the model is shared and evaluated among multiple entities. We partition our decision tree model by each level to be stored in a new entity we refer to as a “level-site.” Utilizing this approach, we were able to gain improved average run time for classifier evaluation for a non-complete tree, while also having strong mitigations against side-channel attacks.

[LG-23] Practical Efficiency of Muon for Pretraining

链接: https://arxiv.org/abs/2505.02222
作者: Essential AI:Ishaan Shah,Anthony M. Polloreno,Karl Stratos,Philip Monk,Adarsh Chaluvaraju,Andrew Hojel,Andrew Ma,Anil Thomas,Ashish Tanwer,Darsh J Shah,Khoi Nguyen,Kurt Smith,Michael Callahan,Michael Pust,Mohit Parmar,Peter Rushton,Platon Mazarakis,Ritvik Kapila,Saurabh Srivastava,Somanshu Singla,Tim Romanski,Yash Vanjani,Ashish Vaswani
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture.

[LG-24] An Empirical Study of Qwen 3 Quantization

链接: https://arxiv.org/abs/2505.02214
作者: Xingyu Zheng,Yuye Li,Haoran Chu,Yue Feng,Xudong Ma,Jie Luo,Jinyang Guo,Haotong Qin,Michele Magno,Xianglong Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Qwen series has emerged as a leading family of open-source Large Language Models (LLMs), demonstrating remarkable capabilities in natural language understanding tasks. With the recent release of Qwen3, which exhibits superior performance across diverse benchmarks, there is growing interest in deploying these models efficiently in resource-constrained environments. Low-bit quantization presents a promising solution, yet its impact on Qwen3’s performance remains underexplored. This study conducts a systematic evaluation of Qwen3’s robustness under various quantization settings, aiming to uncover both opportunities and challenges in compressing this state-of-the-art model. We rigorously assess 5 existing classic post-training quantization techniques applied to Qwen3, spanning bit-widths from 1 to 8 bits, and evaluate their effectiveness across multiple datasets. Our findings reveal that while Qwen3 maintains competitive performance at moderate bit-widths, it experiences notable degradation in linguistic tasks under ultra-low precision, underscoring the persistent hurdles in LLM compression. These results emphasize the need for further research to mitigate performance loss in extreme quantization scenarios. We anticipate that this empirical analysis will provide actionable insights for advancing quantization methods tailored to Qwen3 and future LLMs, ultimately enhancing their practicality without compromising accuracy. Our project is released on this https URL and this https URL.

[LG-25] Exogenous Isomorphism for Counterfactual Identifiability ICML2025

链接: https://arxiv.org/abs/2505.02212
作者: Yikang Chen,Dehui Du
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 43 pages, 4 figures. Accepted at ICML 2025 (Spotlight poster)

点击查看摘要

Abstract:This paper investigates \sim_\mathcalL_3 -identifiability, a form of complete counterfactual identifiability within the Pearl Causal Hierarchy (PCH) framework, ensuring that all Structural Causal Models (SCMs) satisfying the given assumptions provide consistent answers to all causal questions. To simplify this problem, we introduce exogenous isomorphism and propose \sim_\mathrmEI -identifiability, reflecting the strength of model identifiability required for \sim_\mathcalL_3 -identifiability. We explore sufficient assumptions for achieving \sim_\mathrmEI -identifiability in two special classes of SCMs: Bijective SCMs (BSCMs), based on counterfactual transport, and Triangular Monotonic SCMs (TM-SCMs), which extend \sim_\mathcalL_2 -identifiability. Our results unify and generalize existing theories, providing theoretical guarantees for practical applications. Finally, we leverage neural TM-SCMs to address the consistency problem in counterfactual reasoning, with experiments validating both the effectiveness of our method and the correctness of the theory.

[LG-26] Efficient FPGA Implementation of Time-Domain Popcount for Low-Complexity Machine Learning

链接: https://arxiv.org/abs/2505.02181
作者: Shengyu Duan,Marcos L. L. Sartori,Rishad Shafik,Alex Yakovlev,Emre Ozer
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Population count (popcount) is a crucial operation for many low-complexity machine learning (ML) algorithms, including Tsetlin Machine ™-a promising new ML method, particularly well-suited for solving classification tasks. The inference mechanism in TM consists of propositional logic-based structures within each class, followed by a majority voting scheme, which makes the classification decision. In TM, the voters are the outputs of Boolean clauses. The voting mechanism comprises two operations: popcount for each class and determining the class with the maximum vote by means of an argmax operation. While TMs offer a lightweight ML alternative, their performance is often limited by the high computational cost of popcount and comparison required to produce the argmax result. In this paper, we propose an innovative approach to accelerate and optimize these operations by performing them in the time domain. Our time-domain implementation uses programmable delay lines (PDLs) and arbiters to efficiently manage these tasks through delay-based mechanisms. We also present an FPGA design flow for practical implementation of the time-domain popcount, addressing delay skew and ensuring that the behavior matches that of the model’s intended functionality. By leveraging the natural compatibility of the proposed popcount with asynchronous architectures, we demonstrate significant improvements in an asynchronous TM, including up to 38% reduction in latency, 43.1% reduction in dynamic power, and 15% savings in resource utilization, compared to synchronous TMs using adder-based popcount. Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR) Cite as: arXiv:2505.02181 [cs.LG] (or arXiv:2505.02181v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.02181 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] Efficient Multivariate Time Series Forecasting via Calibrated Language Models with Privileged Knowledge Distillation ICDE2025

链接: https://arxiv.org/abs/2505.02138
作者: Chenxi Liu,Shaowen Zhou,Hao Miao,Qianxiong Xu,Cheng Long,Ziyue Li,Rui Zhao
类目: Machine Learning (cs.LG)
*备注: Accepted by ICDE 2025

点击查看摘要

Abstract:Multivariate time series forecasting (MTSF) endeavors to predict future observations given historical data, playing a crucial role in time series data management systems. With advancements in large language models (LLMs), recent studies employ textual prompt tuning to infuse the knowledge of LLMs into MTSF. However, the deployment of LLMs often suffers from low efficiency during the inference phase. To address this problem, we introduce TimeKD, an efficient MTSF framework that leverages the calibrated language models and privileged knowledge distillation. TimeKD aims to generate high-quality future representations from the proposed cross-modality teacher model and cultivate an effective student model. The cross-modality teacher model adopts calibrated language models (CLMs) with ground truth prompts, motivated by the paradigm of Learning Under Privileged Information (LUPI). In addition, we design a subtractive cross attention (SCA) mechanism to refine these representations. To cultivate an effective student model, we propose an innovative privileged knowledge distillation (PKD) mechanism including correlation and feature distillation. PKD enables the student to replicate the teacher’s behavior while minimizing their output discrepancy. Extensive experiments on real data offer insight into the effectiveness, efficiency, and scalability of the proposed TimeKD.

[LG-28] GRAIL: Graph Edit Distance and Node Alignment Using LLM -Generated Code

链接: https://arxiv.org/abs/2505.02124
作者: Samidha Verma,Arushi Goyal,Ananya Mathur,Ankit Anand,Sayan Ranu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Edit Distance (GED) is a widely used metric for measuring similarity between two graphs. Computing the optimal GED is NP-hard, leading to the development of various neural and non-neural heuristics. While neural methods have achieved improved approximation quality compared to non-neural approaches, they face significant challenges: (1) They require large amounts of ground truth data, which is itself NP-hard to compute. (2) They operate as black boxes, offering limited interpretability. (3) They lack cross-domain generalization, necessitating expensive retraining for each new dataset. We address these limitations with GRAIL, introducing a paradigm shift in this domain. Instead of training a neural model to predict GED, GRAIL employs a novel combination of large language models (LLMs) and automated prompt tuning to generate a program that is used to compute GED. This shift from predicting GED to generating programs imparts various advantages, including end-to-end interpretability and an autonomous self-evolutionary learning mechanism without ground-truth supervision. Extensive experiments on seven datasets confirm that GRAIL not only surpasses state-of-the-art GED approximation methods in prediction quality but also achieves robust cross-domain generalization across diverse graph distributions.

[LG-29] Deep Representation Learning for Electronic Design Automation

链接: https://arxiv.org/abs/2505.02105
作者: Pratik Shrestha,Saran Phatharodom,Alec Aversa,David Blankenship,Zhengfeng Wu,Ioannis Savidis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Representation learning has become an effective technique utilized by electronic design automation (EDA) algorithms, which leverage the natural representation of workflow elements as images, grids, and graphs. By addressing challenges related to the increasing complexity of circuits and stringent power, performance, and area (PPA) requirements, representation learning facilitates the automatic extraction of meaningful features from complex data formats, including images, grids, and graphs. This paper examines the application of representation learning in EDA, covering foundational concepts and analyzing prior work and case studies on tasks that include timing prediction, routability analysis, and automated placement. Key techniques, including image-based methods, graph-based approaches, and hybrid multimodal solutions, are presented to illustrate the improvements provided in routing, timing, and parasitic prediction. The provided advancements demonstrate the potential of representation learning to enhance efficiency, accuracy, and scalability in current integrated circuit design flows.

[LG-30] Learning Local Causal World Models with State Space Models and Attention

链接: https://arxiv.org/abs/2505.02074
作者: Francesco Petri,Luigi Asprino,Aldo Gangemi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:World modelling, i.e. building a representation of the rules that govern the world so as to predict its evolution, is an essential ability for any agent interacting with the physical world. Despite their impressive performance, many solutions fail to learn a causal representation of the environment they are trying to model, which would be necessary to gain a deep enough understanding of the world to perform complex tasks. With this work, we aim to broaden the research in the intersection of causality theory and neural world modelling by assessing the potential for causal discovery of the State Space Model (SSM) architecture, which has been shown to have several advantages over the widespread Transformer. We show empirically that, compared to an equivalent Transformer, a SSM can model the dynamics of a simple environment and learn a causal model at the same time with equivalent or better performance, thus paving the way for further experiments that lean into the strength of SSMs and further enhance them with causal awareness.

[LG-31] Neural Logistic Bandits

链接: https://arxiv.org/abs/2505.02069
作者: Seoungbin Bae,Dabeen Lee
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of neural logistic bandits, where the main task is to learn an unknown reward function within a logistic link function using a neural network. Existing approaches either exhibit unfavorable dependencies on \kappa , where 1/\kappa represents the minimum variance of reward distributions, or suffer from direct dependence on the feature dimension d , which can be huge in neural network-based settings. In this work, we introduce a novel Bernstein-type inequality for self-normalized vector-valued martingales that is designed to bypass a direct dependence on the ambient dimension. This lets us deduce a regret upper bound that grows with the effective dimension \widetilded , not the feature dimension, while keeping a minimal dependence on \kappa . Based on the concentration inequality, we propose two algorithms, NeuralLog-UCB-1 and NeuralLog-UCB-2, that guarantee regret upper bounds of order \widetildeO(\widetilded\sqrt\kappa T) and \widetildeO(\widetilded\sqrtT/\kappa) , respectively, improving on the existing results. Lastly, we report numerical results on both synthetic and real datasets to validate our theoretical findings.

[LG-32] Secrets of GFlowNets Learning Behavior: A Theoretical Study

链接: https://arxiv.org/abs/2505.02035
作者: Tianshu Yu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Generative Flow Networks (GFlowNets) have emerged as a powerful paradigm for generating composite structures, demonstrating considerable promise across diverse applications. While substantial progress has been made in exploring their modeling validity and connections to other generative frameworks, the theoretical understanding of their learning behavior remains largely uncharted. In this work, we present a rigorous theoretical investigation of GFlowNets’ learning behavior, focusing on four fundamental dimensions: convergence, sample complexity, implicit regularization, and robustness. By analyzing these aspects, we seek to elucidate the intricate mechanisms underlying GFlowNet’s learning dynamics, shedding light on its strengths and limitations. Our findings contribute to a deeper understanding of the factors influencing GFlowNet performance and provide insights into principled guidelines for their effective design and deployment. This study not only bridges a critical gap in the theoretical landscape of GFlowNets but also lays the foundation for their evolution as a reliable and interpretable framework for generative modeling. Through this, we aspire to advance the theoretical frontiers of GFlowNets and catalyze their broader adoption in the AI community.

[LG-33] Quantum-Enhanced Classification of Brain Tumors Using DNA Microarray Gene Expression Profiles

链接: https://arxiv.org/abs/2505.02033
作者: Emine Akpinar,Batuhan Hangun,Murat Oduncuoglu,Oguz Altun,Onder Eyecioglu,Zeynel Yalcin
类目: Machine Learning (cs.LG); Genomics (q-bio.GN); Molecular Networks (q-bio.MN)
*备注:

点击查看摘要

Abstract:DNA microarray technology enables the simultaneous measurement of expression levels of thousands of genes, thereby facilitating the understanding of the molecular mechanisms underlying complex diseases such as brain tumors and the identification of diagnostic genetic signatures. To derive meaningful biological insights from the high-dimensional and complex gene features obtained through this technology and to analyze gene properties in detail, classical AI-based approaches such as machine learning and deep learning are widely employed. However, these methods face various limitations in managing high-dimensional vector spaces and modeling the intricate relationships among genes. In particular, challenges such as hyperparameter tuning, computational costs, and high processing power requirements can hinder their efficiency. To overcome these limitations, quantum computing and quantum AI approaches are gaining increasing attention. Leveraging quantum properties such as superposition and entanglement, quantum methods enable more efficient parallel processing of high-dimensional data and offer faster and more effective solutions to problems that are computationally demanding for classical methods. In this study, a novel model called “Deep VQC” is proposed, based on the Variational Quantum Classifier approach. Developed using microarray data containing 54,676 gene features, the model successfully classified four different types of brain tumors-ependymoma, glioblastoma, medulloblastoma, and pilocytic astrocytoma-alongside healthy samples with high accuracy. Furthermore, compared to classical ML algorithms, our model demonstrated either superior or comparable classification performance. These results highlight the potential of quantum AI methods as an effective and promising approach for the analysis and classification of complex structures such as brain tumors based on gene expression features.

[LG-34] NbBench: Benchmarking Language Models for Comprehensive Nanobody Tasks

链接: https://arxiv.org/abs/2505.02022
作者: Yiming Zhang,Koji Tsuda
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Nanobodies, single-domain antibody fragments derived from camelid heavy-chain-only antibodies, exhibit unique advantages such as compact size, high stability, and strong binding affinity, making them valuable tools in therapeutics and diagnostics. While recent advances in pretrained protein and antibody language models (PPLMs and PALMs) have greatly enhanced biomolecular understanding, nanobody-specific modeling remains underexplored and lacks a unified benchmark. To address this gap, we introduce NbBench, the first comprehensive benchmark suite for nanobody representation learning. Spanning eight biologically meaningful tasks across nine curated datasets, NbBench encompasses structure annotation, binding prediction, and developability assessment. We systematically evaluate eleven representative models–including general-purpose protein LMs, antibody-specific LMs, and nanobody-specific LMs–in a frozen setting. Our analysis reveals that antibody language models excel in antigen-related tasks, while performance on regression tasks such as thermostability and affinity remains challenging across all models. Notably, no single model consistently outperforms others across all tasks. By standardizing datasets, task definitions, and evaluation protocols, NbBench offers a reproducible foundation for assessing and advancing nanobody modeling.

[LG-35] Meta-Black-Box-Optimization through Offline Q-function Learning ICML2025

链接: https://arxiv.org/abs/2505.02010
作者: Zeyuan Ma,Zhiguang Cao,Zhou Jiang,Hongshu Guo,Yue-Jiao Gong
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: Accepted as poster by ICML 2025

点击查看摘要

Abstract:Recent progress in Meta-Black-Box-Optimization (MetaBBO) has demonstrated that using RL to learn a meta-level policy for dynamic algorithm configuration (DAC) over an optimization task distribution could significantly enhance the performance of the low-level BBO algorithm. However, the online learning paradigms in existing works makes the efficiency of MetaBBO problematic. To address this, we propose an offline learning-based MetaBBO framework in this paper, termed Q-Mamba, to attain both effectiveness and efficiency in MetaBBO. Specifically, we first transform DAC task into long-sequence decision process. This allows us further introduce an effective Q-function decomposition mechanism to reduce the learning difficulty within the intricate algorithm configuration space. Under this setting, we propose three novel designs to meta-learn DAC policy from offline data: we first propose a novel collection strategy for constructing offline DAC experiences dataset with balanced exploration and exploitation. We then establish a decomposition-based Q-loss that incorporates conservative Q-learning to promote stable offline learning from the offline dataset. To further improve the offline learning efficiency, we equip our work with a Mamba architecture which helps long-sequence learning effectiveness and efficiency by selective state model and hardware-aware parallel scan respectively. Through extensive benchmarking, we observe that Q-Mamba achieves competitive or even superior performance to prior online/offline baselines, while significantly improving the training efficiency of existing online baselines. We provide sourcecodes of Q-Mamba at this https URL.

[LG-36] D3HRL: A Distributed Hierarchical Reinforcement Learning Approach Based on Causal Discovery and Spurious Correlation Detection

链接: https://arxiv.org/abs/2505.01979
作者: Chenran Zhao,Dianxi Shi,Mengzhu Wang,Jianqiang Xia,Huanhuan Yang,Songchang Jin,Shaowu Yang,Chunping Qiu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current Hierarchical Reinforcement Learning (HRL) algorithms excel in long-horizon sequential decision-making tasks but still face two challenges: delay effects and spurious correlations. To address them, we propose a causal HRL approach called D3HRL. First, D3HRL models delayed effects as causal relationships across different time spans and employs distributed causal discovery to learn these relationships. Second, it employs conditional independence testing to eliminate spurious correlations. Finally, D3HRL constructs and trains hierarchical policies based on the identified true causal relationships. These three steps are iteratively executed, gradually exploring the complete causal chain of the task. Experiments conducted in 2D-MineCraft and MiniGrid show that D3HRL demonstrates superior sensitivity to delay effects and accurately identifies causal relationships, leading to reliable decision-making in complex environments.

[LG-37] EnsembleCI: Ensemble Learning for Carbon Intensity Forecasting

链接: https://arxiv.org/abs/2505.01959
作者: Leyi Yan,Linda Wang,Sihang Liu,Yi Ding
类目: Machine Learning (cs.LG)
*备注: 5 pages, 5 figures, 3 tables, In The 15th ACM International Conference on Future and Sustainable Energy Systems (E-ENERGY’25)

点击查看摘要

Abstract:Carbon intensity (CI) measures the average carbon emissions generated per unit of electricity, making it a crucial metric for quantifying and managing the environmental impact. Accurate CI predictions are vital for minimizing carbon footprints, yet the state-of-the-art method (CarbonCast) falls short due to its inability to address regional variability and lack of adaptability. To address these limitations, we introduce EnsembleCI, an adaptive, end-to-end ensemble learning-based approach for CI forecasting. EnsembleCI combines weighted predictions from multiple sublearners, offering enhanced flexibility and regional adaptability. In evaluations across 11 regional grids, EnsembleCI consistently surpasses CarbonCast, achieving the lowest mean absolute percentage error (MAPE) in almost all grids and improving prediction accuracy by an average of 19.58%. While performance still varies across grids due to inherent regional diversity, EnsembleCI reduces variability and exhibits greater robustness in long-term forecasting compared to CarbonCast and identifies region-specific key features, underscoring its interpretability and practical relevance. These findings position EnsembleCI as a more accurate and reliable solution for CI forecasting. EnsembleCI source code and data used in this paper are available at this https URL.

[LG-38] Semantic Probabilistic Control of Language Models

链接: https://arxiv.org/abs/2505.01954
作者: Kareem Ahmed,Catarina G Belem,Padhraic Smyth,Sameer Singh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semantic control entails steering LM generations towards satisfying subtle non-lexical constraints, e.g., toxicity, sentiment, or politeness, attributes that can be captured by a sequence-level verifier. It can thus be viewed as sampling from the LM distribution conditioned on the target attribute, a computationally intractable problem due to the non-decomposable nature of the verifier. Existing approaches to LM control either only deal with syntactic constraints which cannot capture the aforementioned attributes, or rely on sampling to explore the conditional LM distribution, an ineffective estimator for low-probability events. In this work, we leverage a verifier’s gradient information to efficiently reason over all generations that satisfy the target attribute, enabling precise steering of LM generations by reweighing the next-token distribution. Starting from an initial sample, we create a local LM distribution favoring semantically similar sentences. This approximation enables the tractable computation of an expected sentence embedding. We use this expected embedding, informed by the verifier’s evaluation at the initial sample, to estimate the probability of satisfying the constraint, which directly informs the update to the next-token distribution. We evaluated the effectiveness of our approach in controlling the toxicity, sentiment, and topic-adherence of LMs yielding generations satisfying the constraint with high probability (95%) without degrading their quality.

[LG-39] Runtime Anomaly Detection for Drones: An Integrated Rule-Mining and Unsupervised-Learning Approach CEC

链接: https://arxiv.org/abs/2505.01947
作者: Ivan Tan,Wei Minn,Christopher M. Poskitt,Lwin Khin Shar,Lingxiao Jiang
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted by the 29th International Conference on Engineering of Complex Computer Systems (ICECCS 2025)

点击查看摘要

Abstract:UAVs, commonly referred to as drones, have witnessed a remarkable surge in popularity due to their versatile applications. These cyber-physical systems depend on multiple sensor inputs, such as cameras, GPS receivers, accelerometers, and gyroscopes, with faults potentially leading to physical instability and serious safety concerns. To mitigate such risks, anomaly detection has emerged as a crucial safeguarding mechanism, capable of identifying the physical manifestations of emerging issues and allowing operators to take preemptive action at runtime. Recent anomaly detection methods based on LSTM neural networks have shown promising results, but three challenges persist: the need for models that can generalise across the diverse mission profiles of drones; the need for interpretability, enabling operators to understand the nature of detected problems; and the need for capturing domain knowledge that is difficult to infer solely from log data. Motivated by these challenges, this paper introduces RADD, an integrated approach to anomaly detection in drones that combines rule mining and unsupervised learning. In particular, we leverage rules (or invariants) to capture expected relationships between sensors and actuators during missions, and utilise unsupervised learning techniques to cover more subtle relationships that the rules may have missed. We implement this approach using the ArduPilot drone software in the Gazebo simulator, utilising 44 rules derived across the main phases of drone missions, in conjunction with an ensemble of five unsupervised learning models. We find that our integrated approach successfully detects 93.84% of anomalies over six types of faults with a low false positive rate (2.33%), and can be deployed effectively at runtime. Furthermore, RADD outperforms a state-of-the-art LSTM-based method in detecting the different types of faults evaluated in our study.

[LG-40] Faster logconcave sampling from a cold start in high dimension

链接: https://arxiv.org/abs/2505.01937
作者: Yunbum Kook,Santosh S. Vempala
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Functional Analysis (math.FA); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 56 pages

点击查看摘要

Abstract:We present a faster algorithm to generate a warm start for sampling an arbitrary logconcave density specified by an evaluation oracle, leading to the first sub-cubic sampling algorithms for inputs in (near-)isotropic position. A long line of prior work incurred a warm-start penalty of at least linear in the dimension, hitting a cubic barrier, even for the special case of uniform sampling from convex bodies. Our improvement relies on two key ingredients of independent interest. (1) We show how to sample given a warm start in weaker notions of distance, in particular q -Rényi divergence for q=\widetilde\mathcalO(1) , whereas previous analyses required stringent \infty -Rényi divergence (with the exception of Hit-and-Run, whose known mixing time is higher). This marks the first improvement in the required warmness since Lovász and Simonovits (1991). (2) We refine and generalize the log-Sobolev inequality of Lee and Vempala (2018), originally established for isotropic logconcave distributions in terms of the diameter of the support, to logconcave distributions in terms of a geometric average of the support diameter and the largest eigenvalue of the covariance matrix. Comments: 56 pages Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Functional Analysis (math.FA); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2505.01937 [cs.DS] (or arXiv:2505.01937v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2505.01937 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-41] Unemployment Dynamics Forecasting with Machine Learning Regression Models

链接: https://arxiv.org/abs/2505.01933
作者: Kyungsu Kim
类目: Machine Learning (cs.LG); Econometrics (econ.EM)
*备注: 18 pages, 2 charts

点击查看摘要

Abstract:In this paper, I explored how a range of regression and machine learning techniques can be applied to monthly U.S. unemployment data to produce timely forecasts. I compared seven models: Linear Regression, SGDRegressor, Random Forest, XGBoost, CatBoost, Support Vector Regression, and an LSTM network, training each on a historical span of data and then evaluating on a later hold-out period. Input features include macro indicators (GDP growth, CPI), labor market measures (job openings, initial claims), financial variables (interest rates, equity indices), and consumer sentiment. I tuned model hyperparameters via cross-validation and assessed performance with standard error metrics and the ability to predict the correct unemployment direction. Across the board, tree-based ensembles (and CatBoost in particular) deliver noticeably better forecasts than simple linear approaches, while the LSTM captures underlying temporal patterns more effectively than other nonlinear methods. SVR and SGDRegressor yield modest gains over standard regression but don’t match the consistency of the ensemble and deep-learning models. Interpretability tools ,feature importance rankings and SHAP values, point to job openings and consumer sentiment as the most influential predictors across all methods. By directly comparing linear, ensemble, and deep-learning approaches on the same dataset, our study shows how modern machine-learning techniques can enhance real-time unemployment forecasting, offering economists and policymakers richer insights into labor market trends. In the comparative evaluation of the models, I employed a dataset comprising thirty distinct features over the period from January 2020 through December 2024. Comments: 18 pages, 2 charts Subjects: Machine Learning (cs.LG); Econometrics (econ.EM) Cite as: arXiv:2505.01933 [cs.LG] (or arXiv:2505.01933v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.01933 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kyungsu Kim [view email] [v1] Sat, 3 May 2025 21:55:28 UTC (683 KB)

[LG-42] Discrete Spatial Diffusion: Intensity-Preserving Diffusion Modeling

链接: https://arxiv.org/abs/2505.01917
作者: Javier E. Santos,Agnese Marcato,Roman Colman,Nicholas Lubbers,Yen Ting Lin
类目: Graphics (cs.GR); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Generative diffusion models have achieved remarkable success in producing high-quality images. However, because these models typically operate in continuous intensity spaces - diffusing independently per pixel and color channel - they are fundamentally ill-suited for applications where quantities such as particle counts or material units are inherently discrete and governed by strict conservation laws such as mass preservation, limiting their applicability in scientific workflows. To address this limitation, we propose Discrete Spatial Diffusion (DSD), a framework based on a continuous-time, discrete-state jump stochastic process that operates directly in discrete spatial domains while strictly preserving mass in both forward and reverse diffusion processes. By using spatial diffusion to achieve mass preservation, we introduce stochasticity naturally through a discrete formulation. We demonstrate the expressive flexibility of DSD by performing image synthesis, class conditioning, and image inpainting across widely-used image benchmarks, with the ability to condition on image intensity. Additionally, we highlight its applicability to domain-specific scientific data for materials microstructure, bridging the gap between diffusion models and mass-conditioned scientific applications.

[LG-43] From Players to Champions: A Generalizable Machine Learning Approach for Match Outcome Prediction with Insights from the FIFA World Cup

链接: https://arxiv.org/abs/2505.01902
作者: Ali Al-Bustami,Zaid Ghazal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of FIFA World Cup match outcomes holds significant value for analysts, coaches, bettors, and fans. This paper presents a machine learning framework specifically designed to forecast match winners in FIFA World Cup. By integrating both team-level historical data and player-specific performance metrics such as goals, assists, passing accuracy, and tackles, we capture nuanced interactions often overlooked by traditional aggregate models. Our methodology processes multi-year data to create year-specific team profiles that account for evolving rosters and player development. We employ classification techniques complemented by dimensionality reduction and hyperparameter optimization, to yield robust predictive models. Experimental results on data from the FIFA 2022 World Cup demonstrate our approach’s superior accuracy compared to baseline method. Our findings highlight the importance of incorporating individual player attributes and team-level composition to enhance predictive performance, offering new insights into player synergy, strategic match-ups, and tournament progression scenarios. This work underscores the transformative potential of rich, player-centric data in sports analytics, setting a foundation for future exploration of advanced learning architectures such as graph neural networks to model complex team interactions.

[LG-44] owards Trustworthy Federated Learning with Untrusted Participants

链接: https://arxiv.org/abs/2505.01874
作者: Youssef Allouah,Rachid Guerraoui,John Stephan
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: arXiv admin note: text overlap with arXiv:2302.04787

点击查看摘要

Abstract:Resilience against malicious parties and data privacy are essential for trustworthy distributed learning, yet achieving both with good utility typically requires the strong assumption of a trusted central server. This paper shows that a significantly weaker assumption suffices: each pair of workers shares a randomness seed unknown to others. In a setting where malicious workers may collude with an untrusted server, we propose CafCor, an algorithm that integrates robust gradient aggregation with correlated noise injection, leveraging shared randomness between workers. We prove that CafCor achieves strong privacy-utility trade-offs, significantly outperforming local differential privacy (DP) methods, which do not make any trust assumption, while approaching central DP utility, where the server is fully trusted. Empirical results on standard benchmarks validate CafCor’s practicality, showing that privacy and robustness can coexist in distributed systems without sacrificing utility or trusting the server.

[LG-45] PQS-BFL: A Post-Quantum Secure Blockchain-based Federated Learning Framework

链接: https://arxiv.org/abs/2505.01866
作者: Daniel Commey,Garth V. Crosby
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training while preserving data privacy, but its classical cryptographic underpinnings are vulnerable to quantum attacks. This vulnerability is particularly critical in sensitive domains like healthcare. This paper introduces PQS-BFL (Post-Quantum Secure Blockchain-based Federated Learning), a framework integrating post-quantum cryptography (PQC) with blockchain verification to secure FL against quantum adversaries. We employ ML-DSA-65 (a FIPS 204 standard candidate, formerly Dilithium) signatures to authenticate model updates and leverage optimized smart contracts for decentralized validation. Extensive evaluations on diverse datasets (MNIST, SVHN, HAR) demonstrate that PQS-BFL achieves efficient cryptographic operations (average PQC sign time: 0.65 ms, verify time: 0.53 ms) with a fixed signature size of 3309 Bytes. Blockchain integration incurs a manageable overhead, with average transaction times around 4.8 s and gas usage per update averaging 1.72 x 10^6 units for PQC configurations. Crucially, the cryptographic overhead relative to transaction time remains minimal (around 0.01-0.02% for PQC with blockchain), confirming that PQC performance is not the bottleneck in blockchain-based FL. The system maintains competitive model accuracy (e.g., over 98.8% for MNIST with PQC) and scales effectively, with round times showing sublinear growth with increasing client numbers. Our open-source implementation and reproducible benchmarks validate the feasibility of deploying long-term, quantum-resistant security in practical FL systems.

[LG-46] An LSTM-PINN Hybrid Method to the specific problem of population forecasting

链接: https://arxiv.org/abs/2505.01819
作者: Ze Tao
类目: Machine Learning (cs.LG)
*备注: 9 pages,6 figures

点击查看摘要

Abstract:Deep learning has emerged as a powerful tool in scientific modeling, particularly for complex dynamical systems; however, accurately capturing age-structured population dynamics under policy-driven fertility changes remains a significant challenge due to the lack of effective integration between domain knowledge and long-term temporal dependencies. To address this issue, we propose two physics-informed deep learning frameworks–PINN and LSTM-PINN–that incorporate policy-aware fertility functions into a transport-reaction partial differential equation to simulate population evolution from 2024 to 2054. The standard PINN model enforces the governing equation and boundary conditions via collocation-based training, enabling accurate learning of underlying population dynamics and ensuring stable convergence. Building on this, the LSTM-PINN framework integrates sequential memory mechanisms to effectively capture long-range dependencies in the age-time domain, achieving robust training performance across multiple loss components. Simulation results under three distinct fertility policy scenarios-the Three-child policy, the Universal two-child policy, and the Separate two-child policy–demonstrate the models’ ability to reflect policy-sensitive demographic shifts and highlight the effectiveness of integrating domain knowledge into data-driven forecasting. This study provides a novel and extensible framework for modeling age-structured population dynamics under policy interventions, offering valuable insights for data-informed demographic forecasting and long-term policy planning in the face of emerging population challenges.

[LG-47] Rogue Cell: Adversarial Attack and Defense in Untrusted O-RAN Setup Exploiting the Traffic Steering xApp

链接: https://arxiv.org/abs/2505.01816
作者: Eran Aizikovich,Dudu Mimran,Edita Grolman,Yuval Elovici,Asaf Shabtai
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Open Radio Access Network (O-RAN) architecture is revolutionizing cellular networks with its open, multi-vendor design and AI-driven management, aiming to enhance flexibility and reduce costs. Although it has many advantages, O-RAN is not threat-free. While previous studies have mainly examined vulnerabilities arising from O-RAN’s intelligent components, this paper is the first to focus on the security challenges and vulnerabilities introduced by transitioning from single-operator to multi-operator RAN architectures. This shift increases the risk of untrusted third-party operators managing different parts of the network. To explore these vulnerabilities and their potential mitigation, we developed an open-access testbed environment that integrates a wireless network simulator with the official O-RAN Software Community (OSC) RAN intelligent component (RIC) cluster. This environment enables realistic, live data collection and serves as a platform for demonstrating APATE (adversarial perturbation against traffic efficiency), an evasion attack in which a malicious cell manipulates its reported key performance indicators (KPIs) and deceives the O-RAN traffic steering to gain unfair allocations of user equipment (UE). To ensure that O-RAN’s legitimate activity continues, we introduce MARRS (monitoring adversarial RAN reports), a detection framework based on a long-short term memory (LSTM) autoencoder (AE) that learns contextual features across the network to monitor malicious telemetry (also demonstrated in our testbed). Our evaluation showed that by executing APATE, an attacker can obtain a 248.5% greater UE allocation than it was supposed to in a benign scenario. In addition, the MARRS detection method was also shown to successfully classify malicious cell activity, achieving accuracy of 99.2% and an F1 score of 0.978.

[LG-48] Conformal Prediction for Indoor Positioning with Correctness Coverag e Guarantees

链接: https://arxiv.org/abs/2505.01810
作者: Zhiyi Zhou,Hexin Peng,Hongyu Long
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the advancement of Internet of Things (IoT) technologies, high-precision indoor positioning has become essential for Location-Based Services (LBS) in complex indoor environments. Fingerprint-based localization is popular, but traditional algorithms and deep learning-based methods face challenges such as poor generalization, overfitting, and lack of interpretability. This paper applies conformal prediction (CP) to deep learning-based indoor positioning. CP transforms the uncertainty of the model into a non-conformity score, constructs prediction sets to ensure correctness coverage, and provides statistical guarantees. We also introduce conformal risk control for path navigation tasks to manage the false discovery rate (FDR) and the false negative rate (FNR).The model achieved an accuracy of approximately 100% on the training dataset and 85% on the testing dataset, effectively demonstrating its performance and generalization capability. Furthermore, we also develop a conformal p-value framework to control the proportion of position-error points. Experiments on the UJIIndoLoc dataset using lightweight models such as MobileNetV1, VGG19, MobileNetV2, ResNet50, and EfficientNet show that the conformal prediction technique can effectively approximate the target coverage, and different models have different performance in terms of prediction set size and uncertainty quantification.

[LG-49] Surrogate to Poincaré inequalities on manifolds for dimension reduction in nonlinear feature spaces

链接: https://arxiv.org/abs/2505.01807
作者: Anthony Nouy,Alexandre Pasco
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 35 pages, 6 figures

点击查看摘要

Abstract:We aim to approximate a continuously differentiable function u:\mathbbR^d \rightarrow \mathbbR by a composition of functions f\circ g where g:\mathbbR^d \rightarrow \mathbbR^m , m\leq d , and f : \mathbbR^m \rightarrow \mathbbR are built in a two stage procedure. For a fixed g , we build f using classical regression methods, involving evaluations of u . Recent works proposed to build a nonlinear g by minimizing a loss function \mathcalJ(g) derived from Poincaré inequalities on manifolds, involving evaluations of the gradient of u . A problem is that minimizing \mathcalJ may be a challenging task. Hence in this work, we introduce new convex surrogates to \mathcalJ . Leveraging concentration inequalities, we provide sub-optimality results for a class of functions g , including polynomials, and a wide class of input probability measures. We investigate performances on different benchmarks for various training sample sizes. We show that our approach outperforms standard iterative methods for minimizing the training Poincaré inequality based loss, often resulting in better approximation errors, especially for rather small training sets and m=1 .

[LG-50] Privacy Preserving Machine Learning Model Personalization through Federated Personalized Learning

链接: https://arxiv.org/abs/2505.01788
作者: Md. Tanzib Hosain,Asif Zaman,Md. Shahriar Sajid,Shadman Sakeeb Khan,Shanjida Akter
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted in Proceedings of the 4th International Conference on Data Analytics for Business and Industry, 2023

点击查看摘要

Abstract:The widespread adoption of Artificial Intelligence (AI) has been driven by significant advances in intelligent system research. However, this progress has raised concerns about data privacy, leading to a growing awareness of the need for privacy-preserving AI. In response, there has been a seismic shift in interest towards the leading paradigm for training Machine Learning (ML) models on decentralized data silos while maintaining data privacy, Federated Learning (FL). This research paper presents a comprehensive performance analysis of a cutting-edge approach to personalize ML model while preserving privacy achieved through Privacy Preserving Machine Learning with the innovative framework of Federated Personalized Learning (PPMLFPL). Regarding the increasing concerns about data privacy, this study evaluates the effectiveness of PPMLFPL addressing the critical balance between personalized model refinement and maintaining the confidentiality of individual user data. According to our analysis, Adaptive Personalized Cross-Silo Federated Learning with Differential Privacy (APPLE+DP) offering efficient execution whereas overall, the use of the Adaptive Personalized Cross-Silo Federated Learning with Homomorphic Encryption (APPLE+HE) algorithm for privacy-preserving machine learning tasks in federated personalized learning settings is strongly suggested. The results offer valuable insights creating it a promising scope for future advancements in the field of privacy-conscious data-driven technologies.

[LG-51] Context-Aware Online Conformal Anomaly Detection with Prediction-Powered Data Acquisition

链接: https://arxiv.org/abs/2505.01783
作者: Amirmohammad Farzaneh,Osvaldo Simeone
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Online anomaly detection is essential in fields such as cybersecurity, healthcare, and industrial monitoring, where promptly identifying deviations from expected behavior can avert critical failures or security breaches. While numerous anomaly scoring methods based on supervised or unsupervised learning have been proposed, current approaches typically rely on a continuous stream of real-world calibration data to provide assumption-free guarantees on the false discovery rate (FDR). To address the inherent challenges posed by limited real calibration data, we introduce context-aware prediction-powered conformal online anomaly detection (C-PP-COAD). Our framework strategically leverages synthetic calibration data to mitigate data scarcity, while adaptively integrating real data based on contextual cues. C-PP-COAD utilizes conformal p-values, active p-value statistics, and online FDR control mechanisms to maintain rigorous and reliable anomaly detection performance over time. Experiments conducted on both synthetic and real-world datasets demonstrate that C-PP-COAD significantly reduces dependency on real calibration data without compromising guaranteed FDR control.

[LG-52] Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients

链接: https://arxiv.org/abs/2505.01744
作者: Yezhen Wang,Zhouhao Yang,Brian K Chen,Fanyi Pu,Bo Li,Tianyu Gao,Kenji Kawaguchi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Building upon the success of low-rank adapter (LoRA), low-rank gradient projection (LoRP) has emerged as a promising solution for memory-efficient fine-tuning. However, existing LoRP methods typically treat each row of the gradient matrix as the default projection unit, leaving the role of projection granularity underexplored. In this work, we propose a novel framework, VLoRP, that extends low-rank gradient projection by introducing an additional degree of freedom for controlling the trade-off between memory efficiency and performance, beyond the rank hyper-parameter. Through this framework, we systematically explore the impact of projection granularity, demonstrating that finer-grained projections lead to enhanced stability and efficiency even under a fixed memory budget. Regarding the optimization for VLoRP, we present ProjFactor, an adaptive memory-efficient optimizer, that significantly reduces memory requirement while ensuring competitive performance, even in the presence of gradient accumulation. Additionally, we provide a theoretical analysis of VLoRP, demonstrating the descent and convergence of its optimization trajectory under both SGD and ProjFactor. Extensive experiments are conducted to validate our findings, covering tasks such as commonsense reasoning, MMLU, and GSM8K.

[LG-53] PoseX: AI Defeats Physics Approaches on Protein-Ligand Cross Docking

链接: https://arxiv.org/abs/2505.01700
作者: Yize Jiang,Xinze Li,Yuanyuan Zhang,Jin Han,Youjun Xu,Ayush Pandit,Zaixi Zhang,Mengdi Wang,Mengyang Wang,Chong Liu,Guang Yang,Yejin Choi,Wu-Jun Li,Tianfan Fu,Fang Wu,Junhong Liu
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Recently, significant progress has been made in protein-ligand docking, especially in modern deep learning methods, and some benchmarks were proposed, e.g., PoseBench, Plinder. However, these benchmarks suffer from less practical evaluation setups (e.g., blind docking, self docking), or heavy framework that involves training, raising challenges to assess docking methods efficiently. To fill this gap, we proposed PoseX, an open-source benchmark focusing on self-docking and cross-docking, to evaluate the algorithmic advances practically and comprehensively. Specifically, first, we curate a new evaluation dataset with 718 entries for self docking and 1,312 for cross docking; second, we incorporate 22 docking methods across three methodological categories, including (1) traditional physics-based methods (e.g., Schrödinger Glide), (2) AI docking methods (e.g., DiffDock), (3) AI co-folding methods (e.g., AlphaFold3); third, we design a relaxation method as post-processing to minimize conformation energy and refine binding pose; fourth, we released a leaderboard to rank submitted models in real time. We draw some key insights via extensive experiments: (1) AI-based approaches have already surpassed traditional physics-based approaches in overall docking accuracy (RMSD). The longstanding generalization issues that have plagued AI molecular docking have been significantly alleviated in the latest models. (2) The stereochemical deficiencies of AI-based approaches can be greatly alleviated with post-processing relaxation. Combining AI docking methods with the enhanced relaxation method achieves the best performance to date. (3) AI co-folding methods commonly face ligand chirality issues, which cannot be resolved by relaxation. The code, curated dataset and leaderboard are released at this https URL.

[LG-54] Adaptively Point-weighting Curriculum Learning

链接: https://arxiv.org/abs/2505.01665
作者: Wensheng Li,Hao Wang,Ruifeng Zhou,Hanting Guan,Chao Zhang,Dacheng Tao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Curriculum learning (CL) is referred to as a training strategy that makes easy samples learned first and then fits hard samples. It imitates the process of humans learning knowledge, and has become a potential manner of effectively training deep networks. In this study, we develop the adaptively point-weighting (APW) curriculum learning algorithm, which adaptively assigns the weight to every training sample not only based on its training error but also considering the current training state of the network. Specifically, in the early training phase, it increases the weights of easy samples to make the network rapidly capture the overall characteristics of the dataset; and in the later training phase, the weights of hard points rise to improve the fitting performance on the discrete local regions. Moreover, we also present the theoretical analysis on the properties of APW including training effectiveness, training feasibility, training stability, and generalization performance. The numerical experiments support the superiority of APW and demonstrate the validity of our theoretical findings.

[LG-55] Focal-SAM: Focal Sharpness-Aware Minimization for Long-Tailed Classification

链接: https://arxiv.org/abs/2505.01660
作者: Sicong Li,Qianqian Xu,Zhiyong Yang,Zitai Wang,Linchao Zhang,Xiaochun Cao,Qingming Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world datasets often follow a long-tailed distribution, making generalization to tail classes difficult. Recent methods resorted to long-tail variants of Sharpness-Aware Minimization (SAM), such as ImbSAM and CC-SAM, to improve generalization by flattening the loss landscape. However, these attempts face a trade-off between computational efficiency and control over the loss landscape. On the one hand, ImbSAM is efficient but offers only coarse control as it excludes head classes from the SAM process. On the other hand, CC-SAM provides fine-grained control through class-dependent perturbations but at the cost of efficiency due to multiple backpropagations. Seeing this dilemma, we introduce Focal-SAM, which assigns different penalties to class-wise sharpness, achieving fine-grained control without extra backpropagations, thus maintaining efficiency. Furthermore, we theoretically analyze Focal-SAM’s generalization ability and derive a sharper generalization bound. Extensive experiments on both traditional and foundation models validate the effectiveness of Focal-SAM.

[LG-56] -REX: Vision-Based System for Autonomous Leaf Detection and Grasp Estimation

链接: https://arxiv.org/abs/2505.01654
作者: Srecharan Selvam,Abhisesh Silwal,George Kantor
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 11 Pages, 10 figures, 2 tables

点击查看摘要

Abstract:T-Rex (The Robot for Extracting Leaf Samples) is a gantry-based robotic system developed for autonomous leaf localization, selection, and grasping in greenhouse environments. The system integrates a 6-degree-of-freedom manipulator with a stereo vision pipeline to identify and interact with target leaves. YOLOv8 is used for real-time leaf segmentation, and RAFT-Stereo provides dense depth maps, allowing the reconstruction of 3D leaf masks. These observations are processed through a leaf grasping algorithm that selects the optimal leaf based on clutter, visibility, and distance, and determines a grasp point by analyzing local surface flatness, top-down approachability, and margin from edges. The selected grasp point guides a trajectory executed by ROS-based motion controllers, driving a custom microneedle-equipped end-effector to clamp the leaf and simulate tissue sampling. Experiments conducted with artificial plants under varied poses demonstrate that the T-Rex system can consistently detect, plan, and perform physical interactions with plant-like targets, achieving a grasp success rate of 66.6%. This paper presents the system architecture, implementation, and testing of T-Rex as a step toward plant sampling automation in Controlled Environment Agriculture (CEA).

[LG-57] Morello: Compiling Fast Neural Networks with Dynamic Programming and Spatial Compression

链接: https://arxiv.org/abs/2505.01637
作者: Samuel J. Kaufman,René Just,Rastislav Bodik
类目: Programming Languages (cs.PL); Machine Learning (cs.LG)
*备注: 13 pages, 2 figures

点击查看摘要

Abstract:High-throughput neural network inference requires coordinating many optimization decisions, including parallel tiling, microkernel selection, and data layout. The product of these decisions forms a search space of programs which is typically intractably large. Existing approaches (e.g., auto-schedulers) often address this problem by sampling this space heuristically. In contrast, we introduce a dynamic-programming-based approach to explore more of the search space by iteratively decomposing large program specifications into smaller specifications reachable from a set of rewrites, then composing a final program from each rewrite that minimizes an affine cost model. To reduce memory requirements, we employ a novel memoization table representation, which indexes specifications by coordinates in Z_\geq 0 and compresses identical, adjacent solutions. This approach can visit a much larger set of programs than prior work. To evaluate the approach, we developed Morello, a compiler which lowers specifications roughly equivalent to a few-node XLA computation graph to x86. Notably, we found that an affine cost model is sufficient to surface high-throughput programs. For example, Morello synthesized a collection of matrix multiplication benchmarks targeting a Zen 1 CPU, including a 1x2048x16384, bfloat16-to-float32 vector-matrix multiply, which was integrated into Google’s this http URL.

[LG-58] A Domain Adaptation of Large Language Models for Classifying Mechanical Assembly Components

链接: https://arxiv.org/abs/2505.01627
作者: Fatemeh Elhambakhsh,Daniele Grandi,Hyunwoong Ko
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:The conceptual design phase represents a critical early stage in the product development process, where designers generate potential solutions that meet predefined design specifications based on functional requirements. Functional modeling, a foundational aspect of this phase, enables designers to reason about product functions before specific structural details are determined. A widely adopted approach to functional modeling is the Function-Behavior-Structure (FBS) framework, which supports the transformation of functional intent into behavioral and structural descriptions. However, the effectiveness of function-based design is often hindered by the lack of well-structured and comprehensive functional data. This scarcity can negatively impact early design decision-making and hinder the development of accurate behavioral models. Recent advances in Large Language Models (LLMs), such as those based on GPT architectures, offer a promising avenue to address this gap. LLMs have demonstrated significant capabilities in language understanding and natural language processing (NLP), making them suitable for automated classification tasks. This study proposes a novel LLM-based domain adaptation (DA) framework using fine-tuning for the automated classification of mechanical assembly parts’ functions. By fine-tuning LLMs on domain-specific datasets, the traditionally manual and subjective process of function annotation can be improved in both accuracy and consistency. A case study demonstrates fine-tuning GPT-3.5 Turbo on data from the Oregon State Design Repository (OSDR), and evaluation on the A Big CAD (ABC) dataset shows that the domain-adapted LLM can generate high-quality functional data, enhancing the semantic representation of mechanical parts and supporting more effective design exploration in early-phase engineering.

[LG-59] Phantora: Live GPU Cluster Simulation for Machine Learning System Performance Estimation

链接: https://arxiv.org/abs/2505.01616
作者: Jianxing Qin,Jingrong Chen,Xinhao Kong,Yongji Wu,Liang Luo,Zhaodong Wang,Ying Zhang,Tingjun Chen,Alvin R. Lebeck,Danyang Zhuo
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:To accommodate ever-increasing model complexity, modern machine learning (ML) systems have to scale to large GPU clusters. Changes in ML model architecture, ML system implementation, and cluster configuration can significantly affect overall ML system performance. However, quantifying the performance impact before deployment is challenging. Existing performance estimation methods use performance modeling or static workload simulation. These techniques are not general: they requires significant human effort and computation capacity to generate training data or a workload. It is also difficult to adapt ML systems to use these techniques. This paper introduces, Phantora, a live GPU cluster simulator for performance estimation. Phantora runs minimally modified ML models and frameworks, intercepting and simulating GPU-related operations to enable high-fidelity performance estimation. Phantora overcomes several research challenges in integrating an event-driven network simulator with live system execution, and introduces a set of techniques to improve simulation speed, scalability, and accuracy. Our evaluation results show that Phantora can deliver similar estimation accuracy to the state-of-the-art workload simulation approach with only one GPU, while reducing human effort and increasing generalizability.

[LG-60] Machine Learning Fairness in House Price Prediction: A Case Study of Americas Expanding Metropolises

链接: https://arxiv.org/abs/2505.01591
作者: Abdalwahab Almajed,Maryam Tabar,Peyman Najafirad
类目: Machine Learning (cs.LG)
*备注: Accepted at ACM-COMPASS2025

点击查看摘要

Abstract:As a basic human need, housing plays a key role in enhancing health, well-being, and educational outcome in society, and the housing market is a major factor for promoting quality of life and ensuring social equity. To improve the housing conditions, there has been extensive research on building Machine Learning (ML)-driven house price prediction solutions to accurately forecast the future conditions, and help inform actions and policies in the field. In spite of their success in developing high-accuracy models, there is a gap in our understanding of the extent to which various ML-driven house price prediction approaches show ethnic and/or racial bias, which in turn is essential for the responsible use of ML, and ensuring that the ML-driven solutions do not exacerbate inequity. To fill this gap, this paper develops several ML models from a combination of structural and neighborhood-level attributes, and conducts comprehensive assessments on the fairness of ML models under various definitions of privileged groups. As a result, it finds that the ML-driven house price prediction models show various levels of bias towards protected attributes (i.e., race and ethnicity in this study). Then, it investigates the performance of different bias mitigation solutions, and the experimental results show their various levels of effectiveness on different ML-driven methods. However, in general, the in-processing bias mitigation approach tends to be more effective than the pre-processing one in this problem domain. Our code is available at this https URL.

[LG-61] HoneyBee: Efficient Role-based Access Control for Vector Databases via Dynamic Partitioning

链接: https://arxiv.org/abs/2505.01538
作者: Hongbin Zhong,Matthew Lentz,Nina Narodytska,Adriana Szekeres,Kexin Rong
类目: Databases (cs.DB); Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As vector databases gain traction in enterprise applications, robust access control has become critical to safeguard sensitive data. Access control in these systems is often implemented through hybrid vector queries, which combine nearest neighbor search on vector data with relational predicates based on user permissions. However, existing approaches face significant trade-offs: creating dedicated indexes for each user minimizes query latency but introduces excessive storage redundancy, while building a single index and applying access control after vector search reduces storage overhead but suffers from poor recall and increased query latency. This paper introduces HoneyBee, a dynamic partitioning framework that bridges the gap between these approaches by leveraging the structure of Role-Based Access Control (RBAC) policies. RBAC, widely adopted in enterprise settings, groups users into roles and assigns permissions to those roles, creating a natural “thin waist” in the permission structure that is ideal for partitioning decisions. Specifically, HoneyBee produces overlapping partitions where vectors can be strategically replicated across different partitions to reduce query latency while controlling storage overhead. By introducing analytical models for the performance and recall of the vector search, HoneyBee formulates the partitioning strategy as a constrained optimization problem to dynamically balance storage, query efficiency, and recall. Evaluations on RBAC workloads demonstrate that HoneyBee reduces storage redundancy compared to role partitioning and achieves up to 6x faster query speeds than row-level security (RLS) with only 1.4x storage increase, offering a practical middle ground for secure and efficient vector search.

[LG-62] Machine Learning for Cyber-Attack Identification from Traffic Flows

链接: https://arxiv.org/abs/2505.01489
作者: Yujing Zhou,Marc L. Jacquet,Robel Dawit,Skyler Fabre,Dev Sarawat,Faheem Khan,Madison Newell,Yongxin Liu,Dahai Liu,Hongyun Chen,Jian Wang,Huihui Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:This paper presents our simulation of cyber-attacks and detection strategies on the traffic control system in Daytona Beach, FL. using Raspberry Pi virtual machines and the OPNSense firewall, along with traffic dynamics from SUMO and exploitation via the Metasploit framework. We try to answer the research questions: are we able to identify cyber attacks by only analyzing traffic flow patterns. In this research, the cyber attacks are focused particularly when lights are randomly turned all green or red at busy intersections by adversarial attackers. Despite challenges stemming from imbalanced data and overlapping traffic patterns, our best model shows 85% accuracy when detecting intrusions purely using traffic flow statistics. Key indicators for successful detection included occupancy, jam length, and halting durations.

[LG-63] Explainable Machine Learning for Cyberattack Identification from Traffic Flows

链接: https://arxiv.org/abs/2505.01488
作者: Yujing Zhou,Marc L. Jacquet,Robel Dawit,Skyler Fabre,Dev Sarawat,Faheem Khan,Madison Newell,Yongxin Liu,Dahai Liu,Hongyun Chen,Jian Wang,Huihui Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The increasing automation of traffic management systems has made them prime targets for cyberattacks, disrupting urban mobility and public safety. Traditional network-layer defenses are often inaccessible to transportation agencies, necessitating a machine learning-based approach that relies solely on traffic flow data. In this study, we simulate cyberattacks in a semi-realistic environment, using a virtualized traffic network to analyze disruption patterns. We develop a deep learning-based anomaly detection system, demonstrating that Longest Stop Duration and Total Jam Distance are key indicators of compromised signals. To enhance interpretability, we apply Explainable AI (XAI) techniques, identifying critical decision factors and diagnosing misclassification errors. Our analysis reveals two primary challenges: transitional data inconsistencies, where mislabeled recovery-phase traffic misleads the model, and model limitations, where stealth attacks in low-traffic conditions evade detection. This work enhances AI-driven traffic security, improving both detection accuracy and trustworthiness in smart transportation systems.

[LG-64] LLM Watermarking Using Mixtures and Statistical-to-Computational Gaps

链接: https://arxiv.org/abs/2505.01484
作者: Pedro Abdalla,Roman Vershynin
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Given a text, can we determine whether it was generated by a large language model (LLM) or by a human? A widely studied approach to this problem is watermarking. We propose an undetectable and elementary watermarking scheme in the closed setting. Also, in the harder open setting, where the adversary has access to most of the model, we propose an unremovable watermarking scheme.

[LG-65] Enhancing the Cloud Security through Topic Modelling

链接: https://arxiv.org/abs/2505.01463
作者: Sabbir M. Saleh,Nazim Madhavji,John Steinbacher
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 6 pages, 5 figures, 28th ACIS International Winter Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD 2024-Winter)

点击查看摘要

Abstract:Protecting cloud applications is crucial in an age where security constantly threatens the digital world. The inevitable cyber-attacks throughout the CI/CD pipeline make cloud security innovations necessary. This research is motivated by applying Natural Language Processing (NLP) methodologies, such as Topic Modelling, to analyse cloud security data and predict future attacks. This research aims to use topic modelling, specifically Latent Dirichlet Allocation (LDA) and Probabilistic Latent Semantic Analysis (pLSA). Utilising LDA and PLSA, security-related text data, such as reports, logs, and other relevant documents, will be analysed and sorted into relevant topics (such as phishing or encryption). These algorithms may apply through Python using the Gensim framework. The topics shall be utilised to detect vulnerabilities within relevant CI/CD pipeline records or log data. This application of Topic Modelling anticipates providing a new form of vulnerability detection, improving overall security throughout the CI/CD pipeline.

[LG-66] Development of an Adapter for Analyzing and Protecting Machine Learning Models from Competitive Activity in the Networks Services

链接: https://arxiv.org/abs/2505.01460
作者: Denis Parfenov,Anton Parfenov
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Due to the increasing number of tasks that are solved on remote servers, identifying and classifying traffic is an important task to reduce the load on the server. There are various methods for classifying traffic. This paper discusses machine learning models for solving this problem. However, such ML models are also subject to attacks that affect the classification result of network traffic. To protect models, we proposed a solution based on an autoencoder

[LG-67] Sparsification Under Siege: Defending Against Poisoning Attacks in Communication-Efficient Federated Learning

链接: https://arxiv.org/abs/2505.01454
作者: Zhiyong Jin,Runhua Xu,Chao Li,Yizhong Liu,Jianxin Li
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training across distributed clients while preserving data privacy, yet it faces significant challenges in communication efficiency and vulnerability to poisoning attacks. While sparsification techniques mitigate communication overhead by transmitting only critical model parameters, they inadvertently amplify security risks: adversarial clients can exploit sparse updates to evade detection and degrade model performance. Existing defense mechanisms, designed for standard FL communication scenarios, are ineffective in addressing these vulnerabilities within sparsified FL. To bridge this gap, we propose FLARE, a novel federated learning framework that integrates sparse index mask inspection and model update sign similarity analysis to detect and mitigate poisoning attacks in sparsified FL. Extensive experiments across multiple datasets and adversarial scenarios demonstrate that FLARE significantly outperforms existing defense strategies, effectively securing sparsified FL against poisoning attacks while maintaining communication efficiency.

[LG-68] owards Film-Making Production Dialogue Narration Monologue Adaptive Moving Dubbing Benchmarks CVPR2025

链接: https://arxiv.org/abs/2505.01450
作者: Chaoyi Wang,Junjie Zheng,Zihao Chen,Shiyu Xia,Chaofan Ding,Xiaohao Zhang,Xi Tao,Xiaoming He,Xinhan Di
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, accepted to the AI for Content Creation workshop at CVPR 2025 in Nashville, TN

点击查看摘要

Abstract:Movie dubbing has advanced significantly, yet assessing the real-world effectiveness of these models remains challenging. A comprehensive evaluation benchmark is crucial for two key reasons: 1) Existing metrics fail to fully capture the complexities of dialogue, narration, monologue, and actor adaptability in movie dubbing. 2) A practical evaluation system should offer valuable insights to improve movie dubbing quality and advancement in film production. To this end, we introduce Talking Adaptive Dubbing Benchmarks (TA-Dubbing), designed to improve film production by adapting to dialogue, narration, monologue, and actors in movie dubbing. TA-Dubbing offers several key advantages: 1) Comprehensive Dimensions: TA-Dubbing covers a variety of dimensions of movie dubbing, incorporating metric evaluations for both movie understanding and speech generation. 2) Versatile Benchmarking: TA-Dubbing is designed to evaluate state-of-the-art movie dubbing models and advanced multi-modal large language models. 3) Full Open-Sourcing: We fully open-source TA-Dubbing at this https URL 0a/DeepDubber- V1 including all video suits, evaluation methods, annotations. We also continuously integrate new movie dubbing models into the TA-Dubbing leaderboard at this https URL 0a/DeepDubber-V1 to drive forward the field of movie dubbing.

[LG-69] COSMOS: Predictable and Cost-Effective Adaptation of LLM s

链接: https://arxiv.org/abs/2505.01449
作者: Jiayu Wang,Aws Albarghouthi,Frederic Sala
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve remarkable performance across numerous tasks by using a diverse array of adaptation strategies. However, optimally selecting a model and adaptation strategy under resource constraints is challenging and often requires extensive experimentation. We investigate whether it is possible to accurately predict both performance and cost without expensive trials. We formalize the strategy selection problem for LLMs and introduce COSMOS, a unified prediction framework that efficiently estimates adaptation outcomes at minimal cost. We instantiate and study the capability of our framework via a pair of powerful predictors: embedding-augmented lightweight proxy models to predict fine-tuning performance, and low-sample scaling laws to forecast retrieval-augmented in-context learning. Extensive evaluation across eight representative benchmarks demonstrates that COSMOS achieves high prediction accuracy while reducing computational costs by 92.72% on average, and up to 98.71% in resource-intensive scenarios. Our results show that efficient prediction of adaptation outcomes is not only feasible but can substantially reduce the computational overhead of LLM deployment while maintaining performance standards.

[LG-70] OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models

链接: https://arxiv.org/abs/2505.01448
作者: Shengkai Chen,Yifang Yin,Jinming Cao,Shili Xiang,Zhenguang Liu,Roger Zimmermann
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Audio-visual segmentation aims to separate sounding objects from videos by predicting pixel-level masks based on audio signals. Existing methods primarily concentrate on closed-set scenarios and direct audio-visual alignment and fusion, which limits their capability to generalize to new, unseen situations. In this paper, we propose OpenAVS, a novel training-free language-based approach that, for the first time, effectively aligns audio and visual modalities using text as a proxy for open-vocabulary Audio-Visual Segmentation (AVS). Equipped with multimedia foundation models, OpenAVS directly infers masks through 1) audio-to-text prompt generation, 2) LLM-guided prompt translation, and 3) text-to-visual sounding object segmentation. The objective of OpenAVS is to establish a simple yet flexible architecture that relies on the most appropriate foundation models by fully leveraging their capabilities to enable more effective knowledge transfer to the downstream AVS task. Moreover, we present a model-agnostic framework OpenAVS-ST that enables the integration of OpenAVS with any advanced supervised AVS model via pseudo-label based self-training. This approach enhances performance by effectively utilizing large-scale unlabeled data when available. Comprehensive experiments on three benchmark datasets demonstrate the superior performance of OpenAVS. It surpasses existing unsupervised, zero-shot, and few-shot AVS methods by a significant margin, achieving absolute performance gains of approximately 9.4% and 10.9% in mIoU and F-score, respectively, in challenging scenarios.

[LG-71] Perturbation Analysis of Singular Values in Concatenated Matrices

链接: https://arxiv.org/abs/2505.01427
作者: Maksym Shamrai
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 11 pages

点击查看摘要

Abstract:Concatenating matrices is a common technique for uncovering shared structures in data through singular value decomposition (SVD) and low-rank approximations. However, a fundamental question arises: how does the singular value spectrum of the concatenated matrix relate to the spectra of its individual components? In this work, we develop a perturbation framework that extends classical results such as Weyl’s inequality to concatenated matrices. We establish analytical bounds that quantify the stability of singular values under small perturbations in the submatrices. Our results show that if the matrices being concatenated are close in norm, the dominant singular values of the concatenated matrix remain stable, enabling controlled trade-offs between accuracy and compression. These insights provide a theoretical foundation for improved matrix clustering and compression strategies, with applications in numerical linear algebra, signal processing, and data-driven modeling.

[LG-72] Entropic Mirror Descent for Linear Systems: Polyaks Stepsize and Implicit Bias

链接: https://arxiv.org/abs/2505.02614
作者: Yura Malitsky,Alexander Posch
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 pages, 2 figures

点击查看摘要

Abstract:This paper focuses on applying entropic mirror descent to solve linear systems, where the main challenge for the convergence analysis stems from the unboundedness of the domain. To overcome this without imposing restrictive assumptions, we introduce a variant of Polyak-type stepsizes. Along the way, we strengthen the bound for \ell_1 -norm implicit bias, obtain sublinear and linear convergence results, and generalize the convergence result to arbitrary convex L -smooth functions. We also propose an alternative method that avoids exponentiation, resembling the original Hadamard descent, but with provable convergence.

[LG-73] Lane-Wise Highway Anomaly Detection

链接: https://arxiv.org/abs/2505.02613
作者: Mei Qiu,William Lorenz Reindl,Yaobin Chen,Stanley Chien,Shu Hu
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a scalable and interpretable framework for lane-wise highway traffic anomaly detection, leveraging multi-modal time series data extracted from surveillance cameras. Unlike traditional sensor-dependent methods, our approach uses AI-powered vision models to extract lane-specific features, including vehicle count, occupancy, and truck percentage, without relying on costly hardware or complex road modeling. We introduce a novel dataset containing 73,139 lane-wise samples, annotated with four classes of expert-validated anomalies: three traffic-related anomalies (lane blockage and recovery, foreign object intrusion, and sustained congestion) and one sensor-related anomaly (camera angle shift). Our multi-branch detection system integrates deep learning, rule-based logic, and machine learning to improve robustness and precision. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods in precision, recall, and F1-score, providing a cost-effective and scalable solution for real-world intelligent transportation systems.

[LG-74] Resolving Memorization in Empirical Diffusion Model for Manifold Data in High-Dimensional Spaces

链接: https://arxiv.org/abs/2505.02508
作者: Yang Lyu,Yuchun Qian,Tan Minh Nguyen,Xin T. Tong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Diffusion models is a popular computational tool to generate new data samples. It utilizes a forward diffusion process that add noise to the data distribution and then use a reverse process to remove noises to produce samples from the data distribution. However, when the empirical data distribution consists of n data point, using the empirical diffusion model will necessarily produce one of the existing data points. This is often referred to as the memorization effect, which is usually resolved by sophisticated machine learning procedures in the current literature. This work shows that the memorization problem can be resolved by a simple inertia update step at the end of the empirical diffusion model simulation. Our inertial diffusion model requires only the empirical diffusion model score function and it does not require any further training. We show that choosing the inertia diffusion model sample distribution is an O\left(n^-\frac2d+4\right) Wasserstein-1 approximation of a data distribution lying on a C^2 manifold of dimension d . Since this estimate is significant smaller the Wasserstein1 distance between population and empirical distributions, it rigorously shows the inertial diffusion model produces new data samples. Remarkably, this upper bound is completely free of the ambient space dimension, since there is no training involved. Our analysis utilizes the fact that the inertial diffusion model samples are approximately distributed as the Gaussian kernel density estimator on the manifold. This reveals an interesting connection between diffusion model and manifold learning.

[LG-75] Deep learning of personalized priors from past MRI scans enables fast quality-enhanced point-of-care MRI with low-cost systems

链接: https://arxiv.org/abs/2505.02470
作者: Tal Oved,Beatrice Lena,Chloé F. Najac,Sheng Shen,Matthew S. Rosen,Andrew Webb,Efrat Shimron
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Magnetic resonance imaging (MRI) offers superb-quality images, but its accessibility is limited by high costs, posing challenges for patients requiring longitudinal care. Low-field MRI provides affordable imaging with low-cost devices but is hindered by long scans and degraded image quality, including low signal-to-noise ratio (SNR) and tissue contrast. We propose a novel healthcare paradigm: using deep learning to extract personalized features from past standard high-field MRI scans and harnessing them to enable accelerated, enhanced-quality follow-up scans with low-cost systems. To overcome the SNR and contrast differences, we introduce ViT-Fuser, a feature-fusion vision transformer that learns features from past scans, e.g. those stored in standard DICOM CDs. We show that \textita single prior scan is sufficient, and this scan can come from various MRI vendors, field strengths, and pulse sequences. Experiments with four datasets, including glioblastoma data, low-field ( 50mT ), and ultra-low-field ( 6.5mT ) data, demonstrate that ViT-Fuser outperforms state-of-the-art methods, providing enhanced-quality images from accelerated low-field scans, with robustness to out-of-distribution data. Our freely available framework thus enables rapid, diagnostic-quality, low-cost imaging for wide healthcare applications.

[LG-76] Learning simple heuristic rules for classifying materials based on chemical composition

链接: https://arxiv.org/abs/2505.02361
作者: Andrew Ma,Marin Soljačić
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 10 pages, 3 figures

点击查看摘要

Abstract:In the past decade, there has been a significant interest in the use of machine learning approaches in materials science research. Conventional deep learning approaches that rely on complex, nonlinear models have become increasingly important in computational materials science due to their high predictive accuracy. In contrast to these approaches, we have shown in a recent work that a remarkably simple learned heuristic rule – based on the concept of topogivity – can classify whether a material is topological using only its chemical composition. In this paper, we go beyond the topology classification scenario by also studying the use of machine learning to develop simple heuristic rules for classifying whether a material is a metal based on chemical composition. Moreover, we present a framework for incorporating chemistry-informed inductive bias based on the structure of the periodic table. For both the topology classification and the metallicity classification tasks, we empirically characterize the performance of simple heuristic rules fit with and without chemistry-informed inductive bias across a wide range of training set sizes. We find evidence that incorporating chemistry-informed inductive bias can reduce the amount of training data required to reach a given level of test accuracy.

[LG-77] Smooth Integer Encoding via Integral Balance

链接: https://arxiv.org/abs/2505.02259
作者: Stanislav Semenov
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 28 pages, 4 figures, submitted to arXiv

点击查看摘要

Abstract:We introduce a novel method for encoding integers using smooth real-valued functions whose integral properties implicitly reflect discrete quantities. In contrast to classical representations, where the integer appears as an explicit parameter, our approach encodes the number N in the set of natural numbers through the cumulative balance of a smooth function f_N(t), constructed from localized Gaussian bumps with alternating and decaying coefficients. The total integral I(N) converges to zero as N tends to infinity, and the integer can be recovered as the minimal point of near-cancellation. This method enables continuous and differentiable representations of discrete states, supports recovery through spline-based or analytical inversion, and extends naturally to multidimensional tuples (N1, N2, …). We analyze the structure and convergence of the encoding series, demonstrate numerical construction of the integral map I(N), and develop procedures for integer recovery via numerical inversion. The resulting framework opens a path toward embedding discrete logic within continuous optimization pipelines, machine learning architectures, and smooth symbolic computation. Comments: 28 pages, 4 figures, submitted to arXiv Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG) MSC classes: 03F60, 26E40 ACMclasses: F.4.1; G.1.0 Cite as: arXiv:2505.02259 [math.OC] (or arXiv:2505.02259v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2505.02259 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-78] Bayesian Federated Cause-of-Death Classification and Quantification Under Distribution Shift

链接: https://arxiv.org/abs/2505.02257
作者: Yu Zhu,Zehang Richard Li
类目: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 11 figures

点击查看摘要

Abstract:In regions lacking medically certified causes of death, verbal autopsy (VA) is a critical and widely used tool to ascertain the cause of death through interviews with caregivers. Data collected by VAs are often analyzed using probabilistic algorithms. The performance of these algorithms often degrades due to distributional shift across populations. Most existing VA algorithms rely on centralized training, requiring full access to training data for joint modeling. This is often infeasible due to privacy and logistical constraints. In this paper, we propose a novel Bayesian Federated Learning (BFL) framework that avoids data sharing across multiple training sources. Our method enables reliable individual-level cause-of-death classification and population-level quantification of cause-specific mortality fractions (CSMFs), in a target domain with limited or no local labeled data. The proposed framework is modular, computationally efficient, and compatible with a wide range of existing VA algorithms as candidate models, facilitating flexible deployment in real-world mortality surveillance systems. We validate the performance of BFL through extensive experiments on two real-world VA datasets under varying levels of distribution shift. Our results show that BFL significantly outperforms the base models built on a single domain and achieves comparable or better performance compared to joint modeling.

[LG-79] Heterosynaptic Circuits Are Universal Gradient Machines

链接: https://arxiv.org/abs/2505.02248
作者: Liu Ziyin,Isaac Chuang,Tomaso Poggio
类目: Neurons and Cognition (q-bio.NC); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Populations and Evolution (q-bio.PE)
*备注: preprint

点击查看摘要

Abstract:We propose a design principle for the learning circuits of the biological brain. The principle states that almost any dendritic weights updated via heterosynaptic plasticity can implement a generalized and efficient class of gradient-based meta-learning. The theory suggests that a broad class of biologically plausible learning algorithms, together with the standard machine learning optimizers, can be grounded in heterosynaptic circuit motifs. This principle suggests that the phenomenology of (anti-) Hebbian (HBP) and heterosynaptic plasticity (HSP) may emerge from the same underlying dynamics, thus providing a unifying explanation. It also suggests an alternative perspective of neuroplasticity, where HSP is promoted to the primary learning and memory mechanism, and HBP is an emergent byproduct. We present simulations that show that (a) HSP can explain the metaplasticity of neurons, (b) HSP can explain the flexibility of the biology circuits, and © gradient learning can arise quickly from simple evolutionary dynamics that do not compute any explicit gradient. While our primary focus is on biology, the principle also implies a new approach to designing AI training algorithms and physically learnable AI hardware. Conceptually, our result demonstrates that contrary to the common belief, gradient computation may be extremely easy and common in nature.

[LG-80] Latent Variable Estimation in Bayesian Black-Litterman Models ICML2025

链接: https://arxiv.org/abs/2505.02185
作者: Thomas Y.L. Lin,Jerry Yao-Chieh Hu,Paul W. Chiou,Peter Lin
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: Accepted at ICML 2025

点击查看摘要

Abstract:We revisit the Bayesian Black-Litterman (BL) portfolio model and remove its reliance on subjective investor views. Classical BL requires an investor “view”: a forecast vector q and its uncertainty matrix \Omega that describe how much a chosen portfolio should outperform the market. Our key idea is to treat (q,\Omega) as latent variables and learn them from market data within a single Bayesian network. Consequently, the resulting posterior estimation admits closed-form expression, enabling fast inference and stable portfolio weights. Building on these, we propose two mechanisms to capture how features interact with returns: shared-latent parametrization and feature-influenced views; both recover classical BL and Markowitz portfolios as special cases. Empirically, on 30-year Dow-Jones and 20-year sector-ETF data, we improve Sharpe ratios by 50% and cut turnover by 55% relative to Markowitz and the index baselines. This work turns BL into a fully data-driven, view-free, and coherent Bayesian framework for portfolio optimization.

[LG-81] Ranked differences Pearson correlation dissimilarity with an application to electricity users time series clustering

链接: https://arxiv.org/abs/2505.02173
作者: Chutiphan Charoensuk,Nathakhun Wiroonsri
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Time series clustering is an unsupervised learning method for classifying time series data into groups with similar behavior. It is used in applications such as healthcare, finance, economics, energy, and climate science. Several time series clustering methods have been introduced and used for over four decades. Most of them focus on measuring either Euclidean distances or association dissimilarities between time series. In this work, we propose a new dissimilarity measure called ranked Pearson correlation dissimilarity (RDPC), which combines a weighted average of a specified fraction of the largest element-wise differences with the well-known Pearson correlation dissimilarity. It is incorporated into hierarchical clustering. The performance is evaluated and compared with existing clustering algorithms. The results show that the RDPC algorithm outperforms others in complicated cases involving different seasonal patterns, trends, and peaks. Finally, we demonstrate our method by clustering a random sample of customers from a Thai electricity consumption time series dataset into seven groups with unique characteristics.

[LG-82] Efficient Curvature-Aware Hypergradient Approximation for Bilevel Optimization ICML2025

链接: https://arxiv.org/abs/2505.02101
作者: Youran Dong,Junfeng Yang,Wei Yao,Jin Zhang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: Accepted by ICML 2025

点击查看摘要

Abstract:Bilevel optimization is a powerful tool for many machine learning problems, such as hyperparameter optimization and meta-learning. Estimating hypergradients (also known as implicit gradients) is crucial for developing gradient-based methods for bilevel optimization. In this work, we propose a computationally efficient technique for incorporating curvature information into the approximation of hypergradients and present a novel algorithmic framework based on the resulting enhanced hypergradient computation. We provide convergence rate guarantees for the proposed framework in both deterministic and stochastic scenarios, particularly showing improved computational complexity over popular gradient-based methods in the deterministic setting. This improvement in complexity arises from a careful exploitation of the hypergradient structure and the inexact Newton method. In addition to the theoretical speedup, numerical experiments demonstrate the significant practical performance benefits of incorporating curvature information.

[LG-83] Learning the Simplest Neural ODE

链接: https://arxiv.org/abs/2505.02019
作者: Yuji Okamoto,Tomoya Takeuchi,Yusuke Sakemi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: Under review

点击查看摘要

Abstract:Since the advent of the ``Neural Ordinary Differential Equation (Neural ODE)‘’ paper, learning ODEs with deep learning has been applied to system identification, time-series forecasting, and related areas. Exploiting the diffeomorphic nature of ODE solution maps, neural ODEs has also enabled their use in generative modeling. Despite the rich potential to incorporate various kinds of physical information, training Neural ODEs remains challenging in practice. This study demonstrates, through the simplest one-dimensional linear model, why training Neural ODEs is difficult. We then propose a new stabilization method and provide an analytical convergence analysis. The insights and techniques presented here serve as a concise tutorial for researchers beginning work on Neural ODEs.

[LG-84] Extended Fiducial Inference for Individual Treatment Effects via Deep Neural Networks

链接: https://arxiv.org/abs/2505.01995
作者: Sehwan Kim,Faming Liang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Individual treatment effect estimation has gained significant attention in recent data science literature. This work introduces the Double Neural Network (Double-NN) method to address this problem within the framework of extended fiducial inference (EFI). In the proposed method, deep neural networks are used to model the treatment and control effect functions, while an additional neural network is employed to estimate their parameters. The universal approximation capability of deep neural networks ensures the broad applicability of this method. Numerical results highlight the superior performance of the proposed Double-NN method compared to the conformal quantile regression (CQR) method in individual treatment effect estimation. From the perspective of statistical inference, this work advances the theory and methodology for statistical inference of large models. Specifically, it is theoretically proven that the proposed method permits the model size to increase with the sample size n at a rate of O(n^\zeta) for some 0 \leq \zeta1 , while still maintaining proper quantification of uncertainty in the model parameters. This result marks a significant improvement compared to the range 0\leq \zeta \frac12 required by the classical central limit theorem. Furthermore, this work provides a rigorous framework for quantifying the uncertainty of deep neural networks under the neural scaling law, representing a substantial contribution to the statistical understanding of large-scale neural network models.

[LG-85] Optimization over Trained (and Sparse) Neural Networks: A Surrogate within a Surrogate

链接: https://arxiv.org/abs/2505.01985
作者: Hung Pham,Aiden Ren,Ibrahim Tahir,Jiatai Tong,Thiago Serra
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We can approximate a constraint or an objective function that is uncertain or nonlinear with a neural network that we embed in the optimization model. This approach, which is known as constraint learning, faces the challenge that optimization models with neural network surrogates are harder to solve. Such difficulties have motivated studies on model reformulation, specialized optimization algorithms, and - to a lesser extent - pruning of the embedded networks. In this work, we double down on the use of surrogates by applying network pruning to produce a surrogate of the neural network itself. In the context of using a Mixed-Integer Linear Programming (MILP) solver to verify neural networks, we obtained faster adversarial perturbations for dense neural networks by using sparse surrogates, especially - and surprisingly - if not taking the time to finetune the sparse network to make up for the loss in accuracy. In other words, we show that a pruned network with bad classification performance can still be a good - and more efficient - surrogate.

[LG-86] Bayesian learning of the optimal action-value function in a Markov decision process

链接: https://arxiv.org/abs/2505.01859
作者: Jiaqi Guo,Chon Wai Ho,Sumeetpal S. Singh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 66 pages

点击查看摘要

Abstract:The Markov Decision Process (MDP) is a popular framework for sequential decision-making problems, and uncertainty quantification is an essential component of it to learn optimal decision-making strategies. In particular, a Bayesian framework is used to maintain beliefs about the optimal decisions and the unknown ingredients of the model, which are also to be learned from the data, such as the rewards and state dynamics. However, many existing Bayesian approaches for learning the optimal decision-making strategy are based on unrealistic modelling assumptions and utilise approximate inference techniques. This raises doubts whether the benefits of Bayesian uncertainty quantification are fully realised or can be relied upon. We focus on infinite-horizon and undiscounted MDPs, with finite state and action spaces, and a terminal state. We provide a full Bayesian framework, from modelling to inference to decision-making. For modelling, we introduce a likelihood function with minimal assumptions for learning the optimal action-value function based on Bellman’s optimality equations, analyse its properties, and clarify connections to existing works. For deterministic rewards, the likelihood is degenerate and we introduce artificial observation noise to relax it, in a controlled manner, to facilitate more efficient Monte Carlo-based inference. For inference, we propose an adaptive sequential Monte Carlo algorithm to both sample from and adjust the sequence of relaxed posterior distributions. For decision-making, we choose actions using samples from the posterior distribution over the optimal strategies. While commonly done, we provide new insight that clearly shows that it is a generalisation of Thompson sampling from multi-arm bandit problems. Finally, we evaluate our framework on the Deep Sea benchmark problem and demonstrate the exploration benefits of posterior sampling in MDPs. Comments: 66 pages Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO) Cite as: arXiv:2505.01859 [stat.ML] (or arXiv:2505.01859v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2505.01859 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-87] Rank-One Modified Value Iteration

链接: https://arxiv.org/abs/2505.01828
作者: Arman Sharifi Kolarijani,Tolga Ok,Peyman Mohajerin Esfahani,Mohamad Amin Sharif Kolarijani
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages,9 figures, conference

点击查看摘要

Abstract:In this paper, we provide a novel algorithm for solving planning and learning problems of Markov decision processes. The proposed algorithm follows a policy iteration-type update by using a rank-one approximation of the transition probability matrix in the policy evaluation step. This rank-one approximation is closely related to the stationary distribution of the corresponding transition probability matrix, which is approximated using the power method. We provide theoretical guarantees for the convergence of the proposed algorithm to optimal (action-)value function with the same rate and computational complexity as the value iteration algorithm in the planning problem and as the Q-learning algorithm in the learning problem. Through our extensive numerical simulations, however, we show that the proposed algorithm consistently outperforms first-order algorithms and their accelerated versions for both planning and learning problems.

[LG-88] V-SurvCaus: Dynamic Representation Balancing for Causal Survival Analysis

链接: https://arxiv.org/abs/2505.01785
作者: Ayoub Abraich
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating the causal effect of time-varying treatments on survival outcomes is a challenging task in many domains, particularly in medicine where treatment protocols adapt over time. While recent advances in representation learning have improved causal inference for static treatments, extending these methods to dynamic treatment regimes with survival outcomes remains under-explored. In this paper, we introduce TV-SurvCaus, a novel framework that extends representation balancing techniques to the time-varying treatment setting for survival analysis. We provide theoretical guarantees through (1) a generalized bound for time-varying precision in estimation of heterogeneous effects, (2) variance control via sequential balancing weights, (3) consistency results for dynamic treatment regimes, (4) convergence rates for representation learning with temporal dependencies, and (5) a formal bound on the bias due to treatment-confounder feedback. Our neural architecture incorporates sequence modeling to handle temporal dependencies while balancing time-dependent representations. Through extensive experiments on both synthetic and real-world datasets, we demonstrate that TV-SurvCaus outperforms existing methods in estimating individualized treatment effects with time-varying covariates and treatments. Our framework advances the field of causal inference by enabling more accurate estimation of treatment effects in dynamic, longitudinal settings with survival outcomes.

[LG-89] A dynamic view of the double descent

链接: https://arxiv.org/abs/2505.01751
作者: Vivek Shripad Borkar
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 8 pages, 2 figures

点击查看摘要

Abstract:It has been observed by Belkin et al.\ that overparametrized neural networks exhibit a double descent' phenomenon. That is, as the model complexity, as reflected in the number of features, increases, the training error initially decreases, then increases, and then decreases again. A counterpart of this phenomenon in the time domain has been noted in the context of epoch-wise training, viz., that the training error decreases with time, then increases, then decreases again. This note presents a plausible explanation for this phenomenon by using the theory of two time scale stochastic approximation and singularly perturbed differential equations, applied to the continuous time limit of the gradient dynamics. This adds a dynamic’ angle to an already well studied theme.

[LG-90] Easz: An Agile Transformer-based Image Compression Framework for Resource-constrained IoTs

链接: https://arxiv.org/abs/2505.01742
作者: Yu Mao,Jingzong Li,Jun Wang,Hong Xu,Tei-Wei Kuo,Nan Guan,Chun Jason Xue
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural image compression, necessary in various machine-to-machine communication scenarios, suffers from its heavy encode-decode structures and inflexibility in switching between different compression levels. Consequently, it raises significant challenges in applying the neural image compression to edge devices that are developed for powerful servers with high computational and storage capacities. We take a step to solve the challenges by proposing a new transformer-based edge-compute-free image coding framework called Easz. Easz shifts the computational overhead to the server, and hence avoids the heavy encoding and model switching overhead on the edge. Easz utilizes a patch-erase algorithm to selectively remove image contents using a conditional uniform-based sampler. The erased pixels are reconstructed on the receiver side through a transformer-based framework. To further reduce the computational overhead on the receiver, we then introduce a lightweight transformer-based reconstruction structure to reduce the reconstruction load on the receiver side. Extensive evaluations conducted on a real-world testbed demonstrate multiple advantages of Easz over existing compression approaches, in terms of adaptability to different compression levels, computational efficiency, and image reconstruction quality.

[LG-91] Identifying Doppelganger Active Galactic Nuclei across redshifts from spectroscopic surveys

链接: https://arxiv.org/abs/2505.01642
作者: Shreya Sareen,Swayamtrupta Panda
类目: Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 4 pages, 1 figure, submitted to AAS journals

点击查看摘要

Abstract:Active Galactic Nuclei (AGNs) are among the most luminous objects in the universe, making them valuable probes for studying galaxy evolution. However, understanding how AGN properties evolve over cosmic time remains a fundamental challenge. This study investigates whether AGNs at low redshift (nearby) can serve as proxies for their high-redshift (distant) counterparts by identifying spectral ‘doppelgängers’, AGNs with remarkably similar emission line properties despite being separated by vast cosmic distances. We analyze key spectral features of bona fide AGNs using the Sloan Digital Sky Survey’s Data Release 16, including continuum and emission lines: Nitrogen (N V), Carbon (C IV), Magnesium (Mg II), Hydrogen-beta (H \beta ), and Iron (Fe II - optical and UV) emission lines. We incorporated properties such as equivalent width, velocity dispersion in the form of full width at half maximum (FWHM), and continuum luminosities (135nm, 300nm, and 510nm) closest to these prominent lines. Our initial findings suggest the existence of multiple AGNs with highly similar spectra, hinting at the possibility that local AGNs may indeed share intrinsic properties with high-redshift ones. We showcase here one of the better candidate pairs of AGNs resulting from our analyses.

[LG-92] Fast Likelihood-Free Parameter Estimation for Lévy Processes

链接: https://arxiv.org/abs/2505.01639
作者: Nicolas Coloma,William Kleiber
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Lévy processes are widely used in financial modeling due to their ability to capture discontinuities and heavy tails, which are common in high-frequency asset return data. However, parameter estimation remains a challenge when associated likelihoods are unavailable or costly to compute. We propose a fast and accurate method for Lévy parameter estimation using the neural Bayes estimation (NBE) framework – a simulation-based, likelihood-free approach that leverages permutation-invariant neural networks to approximate Bayes estimators. Through extensive simulations across several Lévy models, we show that NBE outperforms traditional methods in both accuracy and runtime, while also enabling rapid bootstrap-based uncertainty quantification. We illustrate our approach on a challenging high-frequency cryptocurrency return dataset, where the method captures evolving parameter dynamics and delivers reliable and interpretable inference at a fraction of the computational cost of traditional methods. NBE provides a scalable and practical solution for inference in complex financial models, enabling parameter estimation and uncertainty quantification over an entire year of data in just seconds. We additionally investigate nearly a decade of high-frequency Bitcoin returns, requiring less than one minute to estimate parameters under the proposed approach.

[LG-93] Seasonal Prediction with Neural GCM and Simplified Boundary Forcings: Large-scale Atmospheric Variability and Tropical Cyclone Activity

链接: https://arxiv.org/abs/2505.01455
作者: Gan Zhang,Megha Rao,Janni Yuval,Ming Zhao
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning (ML) models are successful with weather forecasting and have shown progress in climate simulations, yet leveraging them for useful climate predictions needs exploration. Here we show this feasibility using NeuralGCM, a hybrid ML-physics atmospheric model, for seasonal predictions of large-scale atmospheric variability and Northern Hemisphere tropical cyclone (TC) activity. Inspired by physical model studies, we simplify boundary conditions, assuming sea surface temperature (SST) and sea ice follow their climatological cycle but persist anomalies present at initialization. With such forcings, NeuralGCM simulates realistic atmospheric circulation and TC climatology patterns. Furthermore, this configuration yields useful seasonal predictions (July-November) for the tropical atmosphere and various TC activity metrics. Notably, the prediction skill for TC frequency in the North Atlantic and East Pacific basins is comparable to existing physical models. These findings highlight the promise of leveraging ML models with physical insights to model TC risks and deliver seamless weather-climate predictions.

信息检索

[IR-0] Evaluating Contrastive Feedback for Effective User Simulations

链接: https://arxiv.org/abs/2505.02560
作者: Andreas Konstantin Kruff(1),Timo Breuer(1),Philipp Schaer(1) ((1) TH Köln - University of Applied Sciences)
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The use of Large Language Models (LLMs) for simulating user behavior in the domain of Interactive Information Retrieval has recently gained significant popularity. However, their application and capabilities remain highly debated and understudied. This study explores whether the underlying principles of contrastive training techniques, which have been effective for fine-tuning LLMs, can also be applied beneficially in the area of prompt engineering for user simulations. Previous research has shown that LLMs possess comprehensive world knowledge, which can be leveraged to provide accurate estimates of relevant documents. This study attempts to simulate a knowledge state by enhancing the model with additional implicit contextual information gained during the simulation. This approach enables the model to refine the scope of desired documents further. The primary objective of this study is to analyze how different modalities of contextual information influence the effectiveness of user simulations. Various user configurations were tested, where models are provided with summaries of already judged relevant, irrelevant, or both types of documents in a contrastive manner. The focus of this study is the assessment of the impact of the prompting techniques on the simulated user agent performance. We hereby lay the foundations for leveraging LLMs as part of more realistic simulated users. Subjects: Information Retrieval (cs.IR) ACMclasses: H.3.3 Cite as: arXiv:2505.02560 [cs.IR] (or arXiv:2505.02560v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.02560 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3726302.3730189 Focus to learn more DOI(s) linking to related resources

[IR-1] Uncertainty in Repeated Implicit Feedback as a Measure of Reliability

链接: https://arxiv.org/abs/2505.02492
作者: Bruno Sguerra,Viet-Anh Tran,Romain Hennequin,Manuel Moussallam
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems rely heavily on user feedback to learn effective user and item representations. Despite their widespread adoption, limited attention has been given to the uncertainty inherent in the feedback used to train these systems. Both implicit and explicit feedback are prone to noise due to the variability in human interactions, with implicit feedback being particularly challenging. In collaborative filtering, the reliability of interaction signals is critical, as these signals determine user and item similarities. Thus, deriving accurate confidence measures from implicit feedback is essential for ensuring the reliability of these signals. A common assumption in academia and industry is that repeated interactions indicate stronger user interest, increasing confidence in preference estimates. However, in domains such as music streaming, repeated consumption can shift user preferences over time due to factors like satiation and exposure. While literature on repeated consumption acknowledges these dynamics, they are often overlooked when deriving confidence scores for implicit feedback. This paper addresses this gap by focusing on music streaming, where repeated interactions are frequent and quantifiable. We analyze how repetition patterns intersect with key factors influencing user interest and develop methods to quantify the associated uncertainty. These uncertainty measures are then integrated as consistency metrics in a recommendation task. Our empirical results show that incorporating uncertainty into user preference models yields more accurate and relevant recommendations. Key contributions include a comprehensive analysis of uncertainty in repeated consumption patterns, the release of a novel dataset, and a Bayesian model for implicit listening feedback. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2505.02492 [cs.IR] (or arXiv:2505.02492v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.02492 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] vatron 2.0: Unified Document Retrieval Toolkit across Scale Language and Modality SIGIR2025

链接: https://arxiv.org/abs/2505.02466
作者: Xueguang Ma,Luyu Gao,Shengyao Zhuang,Jiaqi Samantha Zhan,Jamie Callan,Jimmy Lin
类目: Information Retrieval (cs.IR)
*备注: Accepted in SIGIR 2025 (Demo)

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have driven interest in billion-scale retrieval models with strong generalization across retrieval tasks and languages. Additionally, progress in large vision-language models has created new opportunities for multimodal retrieval. In response, we have updated the Tevatron toolkit, introducing a unified pipeline that enables researchers to explore retriever models at different scales, across multiple languages, and with various modalities. This demo paper highlights the toolkit’s key features, bridging academia and industry by supporting efficient training, inference, and evaluation of neural retrievers. We showcase a unified dense retriever achieving strong multilingual and multimodal effectiveness, and conduct a cross-modality zero-shot study to demonstrate its research potential. Alongside, we release OmniEmbed, to the best of our knowledge, the first embedding model that unifies text, image document, video, and audio retrieval, serving as a baseline for future research.

[IR-3] SymbioticRAG : Enhancing Document Intelligence Through Human-LLM Symbiotic Collaboration

链接: https://arxiv.org/abs/2505.02418
作者: Qiang Sun,Tingting Bi,Sirui Li,Eun-Jung Holden,Paul Duuring,Kai Niu,Wei Liu
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:We present \textbfSymbioticRAG, a novel framework that fundamentally reimagines Retrieval-Augmented Generation~(RAG) systems by establishing a bidirectional learning relationship between humans and machines. Our approach addresses two critical challenges in current RAG systems: the inherently human-centered nature of relevance determination and users’ progression from “unconscious incompetence” in query formulation. SymbioticRAG introduces a two-tier solution where Level 1 enables direct human curation of retrieved content through interactive source document exploration, while Level 2 aims to build personalized retrieval models based on captured user interactions. We implement Level 1 through three key components: (1)~a comprehensive document processing pipeline with specialized models for layout detection, OCR, and extraction of tables, formulas, and figures; (2)~an extensible retriever module supporting multiple retrieval strategies; and (3)~an interactive interface that facilitates both user engagement and interaction data logging. We experiment Level 2 implementation via a retriever strategy incorporated LLM summarized user intention from user interaction logs. To maintain high-quality data preparation, we develop a human-on-the-loop validation interface that improves pipeline output while advancing research in specialized extraction tasks. Evaluation across three scenarios (literature review, geological exploration, and education) demonstrates significant improvements in retrieval relevance and user satisfaction compared to traditional RAG approaches. To facilitate broader research and further advancement of SymbioticRAG Level 2 implementation, we will make our system openly accessible to the research community.

[IR-4] Minimally Supervised Hierarchical Domain Intent Learning for CRS

链接: https://arxiv.org/abs/2505.02209
作者: Safikureshi Mondal,Subhasis Dasgupta,Amarnath Gupta
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注: This research is funded by the National Institution of Food and Agriculture U.S Department of Agriculture (USDA)

点击查看摘要

Abstract:Modeling domain intent within an evolving domain structure presents a significant challenge for domain-specific conversational recommendation systems (CRS). The conventional approach involves training an intent model using utterance-intent pairs. However, as new intents and patterns emerge, the model must be continuously updated while preserving existing relationships and maintaining efficient retrieval. This process leads to substantial growth in utterance-intent pairs, making manual labeling increasingly costly and impractical. In this paper, we propose an efficient solution for constructing a dynamic hierarchical structure that minimizes the number of user utterances required to achieve adequate domain knowledge coverage. To this end, we introduce a neural network-based attention-driven hierarchical clustering algorithm designed to optimize intent grouping using minimal data. The proposed method builds upon and integrates concepts from two existing flat clustering algorithms DEC and NAM, both of which utilize neural attention mechanisms. We apply our approach to a curated subset of 44,000 questions from the business food domain. Experimental results demonstrate that constructing the hierarchy using a stratified sampling strategy significantly reduces the number of questions needed to represent the evolving intent structure. Our findings indicate that this approach enables efficient coverage of dynamic domain knowledge without frequent retraining, thereby enhancing scalability and adaptability in domain-specific CSRs.

[IR-5] Embedding based retrieval for long tail search queries in ecommerce RECSYS’24

链接: https://arxiv.org/abs/2505.01946
作者: Akshay Kekuda,Yuyang Zhang,Arun Udayashankar
类目: Information Retrieval (cs.IR)
*备注: Published at RecSys '24: Proceedings of the 18th ACM Conference on Recommender Systems

点击查看摘要

Abstract:In this abstract we present a series of optimizations we performed on the two-tower model architecture [14], training and evaluation datasets to implement semantic product search at Best Buy. Search queries on this http URL follow the pareto distribution whereby a minority of them account for most searches. This leaves us with a long tail of search queries that have low frequency of issuance. The queries in the long tail suffer from very spare interaction signals. Our current work focuses on building a model to serve the long tail queries. We present a series of optimizations we have done to this model to maximize conversion for the purpose of retrieval from the catalog. The first optimization we present is using a large language model to improve the sparsity of conversion signals. The second optimization is pretraining an off-the-shelf transformer-based model on the Best Buy catalog data. The third optimization we present is on the finetuning front. We use query-to-query pairs in addition to query-to-product pairs and combining the above strategies for finetuning the model. We also demonstrate how merging the weights of these finetuned models improves the evaluation metrics. Finally, we provide a recipe for curating an evaluation dataset for continuous monitoring of model performance with human-in-the-loop evaluation. We found that adding this recall mechanism to our current term match-based recall improved conversion by 3% in an online A/B test.

[IR-6] A Generalised and Adaptable Reinforcement Learning Stopping Method SIGIR2025

链接: https://arxiv.org/abs/2505.01907
作者: Reem Bin-Hezam,Mark Stevenson
类目: Information Retrieval (cs.IR)
*备注: Accepted by SIGIR2025

点击查看摘要

Abstract:This paper presents a Technology Assisted Review (TAR) stopping approach based on Reinforcement Learning (RL). Previous such approaches offered limited control over stopping behaviour, such as fixing the target recall and tradeoff between preferring to maximise recall or cost. These limitations are overcome by introducing a novel RL environment, GRLStop, that allows a single model to be applied to multiple target recalls, balances the recall/cost tradeoff and integrates a classifier. Experiments were carried out on six benchmark datasets (CLEF e-Health datasets 2017-9, TREC Total Recall, TREC Legal and Reuters RCV1) at multiple target recall levels. Results showed that the proposed approach to be effective compared to multiple baselines in addition to offering greater flexibility.

[IR-7] Exploring the Role of Diversity in Example Selection for In-Context Learning

链接: https://arxiv.org/abs/2505.01842
作者: Janak Kapuriya,Manit Kaushik,Debasis Ganguly,Sumit Bhatia
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In-Context Learning (ICL) has gained prominence due to its ability to perform tasks without requiring extensive training data and its robustness to noisy labels. A typical ICL workflow involves selecting localized examples relevant to a given input using sparse or dense embedding-based similarity functions. However, relying solely on similarity-based selection may introduce topical biases in the retrieved contexts, potentially leading to suboptimal downstream performance. We posit that reranking the retrieved context to enhance topical diversity can improve downstream task performance. To achieve this, we leverage maximum marginal relevance (MMR) which balances topical similarity with inter-example diversity. Our experimental results demonstrate that diversifying the selected examples leads to consistent improvements in downstream performance across various context sizes and similarity functions. The implementation of our approach is made available at this https URL.

[IR-8] SimAug: Enhancing Recommendation with Pretrained Language Models for Dense and Balanced Data Augmentation

链接: https://arxiv.org/abs/2505.01695
作者: Yuying Zhao,Xiaodong Yang,Huiyuan Chen,Xiran Fan,Yu Wang,Yiwei Cai,Tyler Derr
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) are extensively used in collaborative filtering due to their impressive effectiveness. These systems depend on interaction data to learn user and item embeddings that are crucial for recommendations. However, the data often suffers from sparsity and imbalance issues: limited observations of user-item interactions can result in sub-optimal performance, and a predominance of interactions with popular items may introduce recommendation bias. To address these challenges, we employ Pretrained Language Models (PLMs) to enhance the interaction data with textual information, leading to a denser and more balanced dataset. Specifically, we propose a simple yet effective data augmentation method (SimAug) based on the textual similarity from PLMs, which can be seamlessly integrated to any systems as a lightweight, plug-and-play component in the pre-processing stage. Our experiments across nine datasets consistently demonstrate improvements in both utility and fairness when training with the augmented data generated by SimAug. The code is available at this https URL.

[IR-9] Effective Inference-Free Retrieval for Learned Sparse Representations

链接: https://arxiv.org/abs/2505.01452
作者: Franco Maria Nardini,Thong Nguyen,Cosimo Rulli,Rossano Venturini,Andrew Yates
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Learned Sparse Retrieval (LSR) is an effective IR approach that exploits pre-trained language models for encoding text into a learned bag of words. Several efforts in the literature have shown that sparsity is key to enabling a good trade-off between the efficiency and effectiveness of the query processor. To induce the right degree of sparsity, researchers typically use regularization techniques when training LSR models. Recently, new efficient – inverted index-based – retrieval engines have been proposed, leading to a natural question: has the role of regularization changed in training LSR models? In this paper, we conduct an extended evaluation of regularization approaches for LSR where we discuss their effectiveness, efficiency, and out-of-domain generalization capabilities. We first show that regularization can be relaxed to produce more effective LSR encoders. We also show that query encoding is now the bottleneck limiting the overall query processor performance. To remove this bottleneck, we advance the state-of-the-art of inference-free LSR by proposing Learned Inference-free Retrieval (Li-LSR). At training time, Li-LSR learns a score for each token, casting the query encoding step into a seamless table lookup. Our approach yields state-of-the-art effectiveness for both in-domain and out-of-domain evaluation, surpassing Splade-v3-Doc by 1 point of mRR@10 on MS MARCO and 1.8 points of nDCG@10 on BEIR.

[IR-10] AdSight: Scalable and Accurate Quantification of User Attention in Multi-Slot Sponsored Search

链接: https://arxiv.org/abs/2505.01451
作者: Mario Villaizán-Vallelado,Matteo Salvatori,Kayhan Latifzadeh,Antonio Penta,Luis A. Leiva,Ioannis Arapakis
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Modern Search Engine Results Pages (SERPs) present complex layouts where multiple elements compete for visibility. Attention modelling is crucial for optimising web design and computational advertising, whereas attention metrics can inform ad placement and revenue strategies. We introduce AdSight, a method leveraging mouse cursor trajectories to quantify in a scalable and accurate manner user attention in multi-slot environments like SERPs. AdSight uses a novel Transformer-based sequence-to-sequence architecture where the encoder processes cursor trajectory embeddings, and the decoder incorporates slot-specific features, enabling robust attention prediction across various SERP layouts. We evaluate our approach on two Machine Learning tasks: (1)~\emphregression, to predict fixation times and counts; and (2)~\emphclassification, to determine some slot types were noticed. Our findings demonstrate the model’s ability to predict attention with unprecedented precision, offering actionable insights for researchers and practitioners.

[IR-11] LLM -Enabled EV Charging Stations Recommendation

链接: https://arxiv.org/abs/2505.01447
作者: Zeinab Teimoori
类目: Information Retrieval (cs.IR); Emerging Technologies (cs.ET)
*备注: 5 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Charging infrastructure is not expanding quickly enough to accommodate the increasing usage of Electric Vehicles (EVs). For this reason, EV owners experience extended waiting periods, range anxiety, and overall dissatisfaction. Challenges, such as fragmented data and the complexity of integrating factors like location, energy pricing, and user preferences, make the current recommendation systems ineffective. To overcome these limitations, we propose RecomBot, which is a Large Language Model (LLM)-powered prompt-based recommender system that dynamically suggests optimal Charging Stations (CSs) using real-time heterogeneous data. By leveraging natural language reasoning and fine-tuning EV-specific datasets, RecomBot enhances personalization, improves charging efficiency, and adapts to various EV types, offering a scalable solution for intelligent EV recommendation systems. Through testing across various prompt engineering scenarios, the results obtained underline the capability and efficiency of the proposed model.

[IR-12] Algorithm Performance Spaces for Strategic Dataset Selection

链接: https://arxiv.org/abs/2505.01442
作者: Steffen Schulz
类目: Information Retrieval (cs.IR)
*备注: Bachelor’s thesis, 29 pages, 9 figures, 1 table

点击查看摘要

Abstract:The evaluation of new algorithms in recommender systems frequently depends on publicly available datasets, such as those from MovieLens or Amazon. Some of these datasets are being disproportionately utilized primarily due to their historical popularity as baselines rather than their suitability for specific research contexts. This thesis addresses this issue by introducing the Algorithm Performance Space, a novel framework designed to differentiate datasets based on the measured performance of algorithms applied to them. An experimental study proposes three metrics to quantify and justify dataset selection to evaluate new algorithms. These metrics also validate assumptions about datasets, such as the similarity between MovieLens datasets of varying sizes. By creating an Algorithm Performance Space and using the proposed metrics, differentiating datasets was made possible, and diverse dataset selections could be found. While the results demonstrate the framework’s potential, further research proposals and implications are discussed to develop Algorithm Performance Spaces tailored to diverse use cases.

附件下载

点击下载今日全部论文列表