本篇博文主要内容为 2025-04-30 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-04-30)
今日共更新491篇论文,其中:
- 自然语言处理共57篇(Computation and Language (cs.CL))
- 人工智能共140篇(Artificial Intelligence (cs.AI))
- 计算机视觉共80篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共150篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] SetKE: Knowledge Editing for Knowledge Elements Overlap DATE
【速读】: 该论文试图解决知识编辑(Knowledge Editing, KE)中因知识元素重叠(Knowledge Element Overlap, KEO)现象导致的编辑冲突问题,该问题会显著影响现有KE方法在处理共享共同元素的三元组时的性能。解决方案的关键在于提出一种新的知识集编辑(Knowledge Set Editing, KSE)范式,并引入SetKE方法,该方法能够同时编辑一组三元组,从而有效缓解KEO带来的负面影响。
链接: https://arxiv.org/abs/2504.20972
作者: Yifan Wei,Xiaoyan Yu,Ran Song,Hao Peng,Angsheng Li
机构: Beihang University (北京航空航天大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Beijing Institute of Technology (北京理工大学); Kunming University of Science and Technology (昆明理工大学)
类目: Computation and Language (cs.CL)
备注: The CR version will be updated subsequently
Abstract:Large Language Models (LLMs) excel in tasks such as retrieval and question answering but require updates to incorporate new knowledge and reduce inaccuracies and hallucinations. Traditional updating methods, like fine-tuning and incremental learning, face challenges such as overfitting and high computational costs. Knowledge Editing (KE) provides a promising alternative but often overlooks the Knowledge Element Overlap (KEO) phenomenon, where multiple triplets share common elements, leading to editing conflicts. We identify the prevalence of KEO in existing KE datasets and show its significant impact on current KE methods, causing performance degradation in handling such triplets. To address this, we propose a new formulation, Knowledge Set Editing (KSE), and introduce SetKE, a method that edits sets of triplets simultaneously. Experimental results demonstrate that SetKE outperforms existing methods in KEO scenarios on mainstream LLMs. Additionally, we introduce EditSet, a dataset containing KEO triplets, providing a comprehensive benchmark.
zh
[NLP-1] OSVBench: Benchmarking LLM s on Specification Generation Tasks for Operating System Verification
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成与操作系统内核验证相关的完整规范代码时的性能不足问题。解决方案的关键在于构建一个名为OSVBench的新基准,该基准通过将规范生成问题转化为受限语法和语义范围内的程序合成问题,要求LLMs理解提供的验证假设及潜在的语法和语义空间,从而在操作系统的高层次功能描述指导下生成可能存在缺陷的操作系统代码的完整规范。
链接: https://arxiv.org/abs/2504.20964
作者: Shangyu Li,Juyong Jiang,Tiancheng Zhao,Jiasi Shen
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Operating Systems (cs.OS); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:
Abstract:We introduce OSVBench, a new benchmark for evaluating Large Language Models (LLMs) in generating complete specification code pertaining to operating system kernel verification tasks. The benchmark first defines the specification generation problem into a program synthesis problem within a confined scope of syntax and semantics by providing LLMs with the programming model. The LLMs are required to understand the provided verification assumption and the potential syntax and semantics space to search for, then generate the complete specification for the potentially buggy operating system code implementation under the guidance of the high-level functional description of the operating system. This benchmark is built upon a real-world operating system kernel, Hyperkernel, and consists of 245 complex specification generation tasks in total, each is a long context task of about 20k-30k tokens. Our comprehensive evaluation of 12 LLMs exhibits the limited performance of the current LLMs on the specification generation tasks for operating system verification. Significant disparities in their performance on the benchmark highlight differences in their ability to handle long-context code generation tasks. The evaluation toolkit and benchmark are available at this https URL.
zh
[NLP-2] Information Gravity: A Field-Theoretic Model for Token Selection in Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在文本生成过程中表现出的复杂行为机制问题,特别是如幻觉、查询表述敏感性以及采样温度对输出多样性的影响等现象。其解决方案的关键在于提出“信息引力”(information gravity)理论模型,该模型借鉴场论和时空几何的物理框架,将用户查询视为具有“信息质量”的物体,通过其对语义空间的弯曲作用,形成引力势阱以“吸引”生成的标记。这一模型为理解LLM的行为提供了新的理论视角和解释机制。
链接: https://arxiv.org/abs/2504.20951
作者: Maryna Vyshnyvetska
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 1 figure
Abstract:We propose a theoretical model called “information gravity” to describe the text generation process in large language models (LLMs). The model uses physical apparatus from field theory and spacetime geometry to formalize the interaction between user queries and the probability distribution of generated tokens. A query is viewed as an object with “information mass” that curves the semantic space of the model, creating gravitational potential wells that “attract” tokens during generation. This model offers a mechanism to explain several observed phenomena in LLM behavior, including hallucinations (emerging from low-density semantic voids), sensitivity to query formulation (due to semantic field curvature changes), and the influence of sampling temperature on output diversity.
zh
[NLP-3] race-of-Thought: Enhanced Arithmetic Problem Solving via Reasoning Distillation From Large to Small Language Models
【速读】: 该论文试图解决在需要专业领域知识(如算术推理)的场景下,大型语言模型(Large Language Models, LLMs)在资源消耗和可定制性方面的局限性问题。其解决方案的关键在于引入一种名为“思维链提示”(Trace-of-Thought Prompting)的简单、零样本提示工程方法,该方法通过指导LLMs生成可观察的子问题来增强算术推理能力,从而在开源模型上实现显著的性能提升。
链接: https://arxiv.org/abs/2504.20946
作者: Tyler McDonald,Ali Emami
机构: Brock University (布鲁克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As Large Language Models (LLMs) continue to be leveraged for daily tasks, prompt engineering remains an active field of contribution within computational linguistics, particularly in domains requiring specialized knowledge such as arithmetic reasoning. While these LLMs are optimized for a variety of tasks, their exhaustive employment may become computationally or financially cumbersome for small teams. Additionally, complete reliance on proprietary, closed-source models often limits customization and adaptability, posing significant challenges in research and application scalability. Instead, by leveraging open-source models at or below 7 billion parameters, we can optimize our resource usage while still observing remarkable gains over standard prompting approaches. To cultivate this notion, we introduce Trace-of-Thought Prompting, a simple, zero-shot prompt engineering method that instructs LLMs to create observable subproblems using critical problem-solving, specifically designed to enhance arithmetic reasoning capabilities. When applied to open-source models in tandem with GPT-4, we observe that Trace-of-Thought not only allows novel insight into the problem-solving process but also introduces performance gains as large as 125% on language models at or below 7 billion parameters. This approach underscores the potential of open-source initiatives in democratizing AI research and improving the accessibility of high-quality computational linguistics applications.
zh
[NLP-4] owards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
【速读】: 该论文试图解决Transformer模型中注意力机制的复杂性和可解释性问题,特别是多头自注意力(Multi Head Self Attention, MHSA)中注意力超位置(attention superposition)带来的理解困难。其解决方案的关键是提出低秩稀疏注意力(Low-Rank Sparse Attention, Lorsa),通过将MHSA分解为可单独理解的组件,实现对不同标记位置间特征交互的更清晰解析。Lorsa作为一种稀疏字典学习方法,能够发现更精细的注意力行为,并在可解释性和电路发现能力方面表现出优于稀疏自编码器(Sparse Autoencoder, SAE)的特性。
链接: https://arxiv.org/abs/2504.20938
作者: Zhengfu He,Junxuan Wang,Rui Lin,Xuyang Ge,Wentao Shu,Qiong Tang,Junping Zhang,Xipeng Qiu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:We propose Low-Rank Sparse Attention (Lorsa), a sparse replacement model of Transformer attention layers to disentangle original Multi Head Self Attention (MHSA) into individually comprehensible components. Lorsa is designed to address the challenge of attention superposition to understand attention-mediated interaction between features in different token positions. We show that Lorsa heads find cleaner and finer-grained versions of previously discovered MHSA behaviors like induction heads, successor heads and attention sink behavior (i.e., heavily attending to the first token). Lorsa and Sparse Autoencoder (SAE) are both sparse dictionary learning methods applied to different Transformer components, and lead to consistent findings in many ways. For instance, we discover a comprehensive family of arithmetic-specific Lorsa heads, each corresponding to an atomic operation in Llama-3.1-8B. Automated interpretability analysis indicates that Lorsa achieves parity with SAE in interpretability while Lorsa exhibits superior circuit discovery properties, especially for features computed collectively by multiple MHSA heads. We also conduct extensive experiments on architectural design ablation, Lorsa scaling law and error analysis.
zh
[NLP-5] ChestX-Reason er: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification
【速读】: 该论文旨在解决医疗AI模型在复杂任务中忽视临床实践中结构化推理过程的问题。其关键解决方案是构建一个名为ChestX-Reasoner的多模态大语言模型(Multimodal Large Language Models, MLLMs),该模型通过直接从临床报告中挖掘过程监督信号,模拟放射科医生的逐步推理流程,并利用包含59K视觉问答样本和301K个临床验证推理步骤的RadRBench-CXR基准进行训练与评估,同时引入RadRScore指标以衡量推理的真实性、完整性和有效性。
链接: https://arxiv.org/abs/2504.20930
作者: Ziqing Fan,Cheng Liang,Chaoyi Wu,Ya Zhang,Yanfeng Wang,Weidi Xie
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in reasoning-enhanced large language models (LLMs) and multimodal LLMs (MLLMs) have significantly improved performance in complex tasks, yet medical AI models often overlook the structured reasoning processes inherent in clinical practice. In this work, we present ChestX-Reasoner, a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical reports, reflecting the step-by-step reasoning followed by radiologists. We construct a large dataset by extracting and refining reasoning chains from routine radiology reports. Our two-stage training framework combines supervised fine-tuning and reinforcement learning guided by process rewards to better align model reasoning with clinical standards. We introduce RadRBench-CXR, a comprehensive benchmark featuring 59K visual question answering samples with 301K clinically validated reasoning steps, and propose RadRScore, a metric evaluating reasoning factuality, completeness, and effectiveness. ChestX-Reasoner outperforms existing medical and general-domain MLLMs in both diagnostic accuracy and reasoning ability, achieving 16%, 5.9%, and 18% improvements in reasoning ability compared to the best medical MLLM, the best general MLLM, and its base model, respectively, as well as 3.3%, 24%, and 27% improvements in outcome accuracy. All resources are open-sourced to facilitate further research in medical reasoning MLLMs.
zh
[NLP-6] DYNAMAX: Dynamic computing for Transformers and Mamba based architectures IJCNN2025
【速读】: 该论文试图解决在大语言模型(Large Language Models, LLMs)中如何有效降低计算成本和延迟的问题,特别是在Decoder-only架构及Mamba模型中的应用尚未充分探索的情况下。解决方案的关键在于提出DYNAMAX框架,首次利用Mamba架构的独特性质实现早期退出(Early Exits, EEs)机制,并将Mamba重新定位为高效EE分类器,适用于基于Mamba和Transformer的LLMs,从而在保持性能的同时实现计算资源的优化。
链接: https://arxiv.org/abs/2504.20922
作者: Miguel Nogales,Matteo Gambella,Manuel Roveri
机构: Università della Svizzera Italiana (瑞士意大利语大学); Politecnico di Milano (米兰理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to IJCNN 2025
Abstract:Early exits (EEs) offer a promising approach to reducing computational costs and latency by dynamically terminating inference once a satisfactory prediction confidence on a data sample is achieved. Although many works integrate EEs into encoder-only Transformers, their application to decoder-only architectures and, more importantly, Mamba models, a novel family of state-space architectures in the LLM realm, remains insufficiently explored. This work introduces DYNAMAX, the first framework to exploit the unique properties of Mamba architectures for early exit mechanisms. We not only integrate EEs into Mamba but also repurpose Mamba as an efficient EE classifier for both Mamba-based and transformer-based LLMs, showcasing its versatility. Our experiments employ the Mistral 7B transformer compared to the Codestral 7B Mamba model, using data sets such as TruthfulQA, CoQA, and TriviaQA to evaluate computational savings, accuracy, and consistency. The results highlight the adaptability of Mamba as a powerful EE classifier and its efficiency in balancing computational cost and performance quality across NLP tasks. By leveraging Mamba’s inherent design for dynamic processing, we open pathways for scalable and efficient inference in embedded applications and resource-constrained environments. This study underscores the transformative potential of Mamba in redefining dynamic computing paradigms for LLMs.
zh
[NLP-7] he Leaderboard Illusion
【速读】: 该论文试图解决当前AI模型评估基准Chatbot Arena中存在的系统性偏差问题,这些问题导致了不公平的竞争环境。研究指出,部分提供商通过未公开的私有测试实践,在公开发布前测试多个模型变体并选择性地披露性能结果,从而获得偏高的Arena评分。此外,专有闭源模型在竞技场中被抽样频率更高且较少被移除,导致数据访问的不对称性。解决方案的关键在于改革Chatbot Arena的评估框架,以实现更公平和透明的基准测试,减少对特定评估动态的过拟合,提升模型整体质量的评价准确性。
链接: https://arxiv.org/abs/2504.20879
作者: Shivalika Singh,Yiyang Nan,Alex Wang,Daniel D’Souza,Sayash Kapoor,Ahmet Üstün,Sanmi Koyejo,Yuntian Deng,Shayne Longpre,Noah Smith,Beyza Ermis,Marzieh Fadaee,Sara Hooker
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Methodology (stat.ME)
备注: 68 pages, 18 figures, 9 tables
Abstract:Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena’s evaluation framework and promote fairer, more transparent benchmarking for the field
zh
[NLP-8] X-Cross: Dynamic Integration of Language Models for Cross-Domain Sequential Recommendation SIGIR’25
【速读】: 该论文试图解决推荐系统在面对新领域时需要快速适应且无需大量重新训练的问题,其核心挑战在于如何在保持领域特定细节的同时实现跨领域的可迁移性。解决方案的关键在于提出“X-Cross”模型,该模型通过集成多个领域特定的语言模型,并利用低秩适配器(LoRA)进行微调,动态地融合不同模型的知识以优化表示,从而在减少额外参数和微调数据需求的情况下,实现高效且准确的跨领域推荐。
链接: https://arxiv.org/abs/2504.20859
作者: Guy Hadad,Haggai Roitman,Yotam Eshel,Bracha Shapira,Lior Rokach
机构: Ben-Gurion University of the Negev (本·古里安大学 of the Negev); eBay (eBay)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted for publication in SIGIR '25
Abstract:As new products are emerging daily, recommendation systems are required to quickly adapt to possible new domains without needing extensive retraining. This work presents ``X-Cross’’ – a novel cross-domain sequential-recommendation model that recommends products in new domains by integrating several domain-specific language models; each model is fine-tuned with low-rank adapters (LoRA). Given a recommendation prompt, operating layer by layer, X-Cross dynamically refines the representation of each source language model by integrating knowledge from all other models. These refined representations are propagated from one layer to the next, leveraging the activations from each domain adapter to ensure domain-specific nuances are preserved while enabling adaptability across domains. Using Amazon datasets for sequential recommendation, X-Cross achieves performance comparable to a model that is fine-tuned with LoRA, while using only 25% of the additional parameters. In cross-domain tasks, such as adapting from Toys domain to Tools, Electronics or Sports, X-Cross demonstrates robust performance, while requiring about 50%-75% less fine-tuning data than LoRA to make fine-tuning effective. Furthermore, X-Cross achieves significant improvement in accuracy over alternative cross-domain baselines. Overall, X-Cross enables scalable and adaptive cross-domain recommendations, reducing computational overhead and providing an efficient solution for data-constrained environments.
zh
[NLP-9] JaccDiv: A Metric and Benchmark for Quantifying Diversity of Generated Marketing Text in the Music Industry
【速读】: 该论文试图解决传统生成方法在数据到文本(Data-to-Text)任务中容易陷入重复模式,导致生成内容单调的问题。解决方案的关键在于利用大语言模型(LLM)结合微调、少样本和零样本学习方法,以生成高质量且多样化的营销文本。研究还引入了JaccDiv指标来评估文本集合的多样性,从而为广泛领域的自动化内容生成提供可行方案。
链接: https://arxiv.org/abs/2504.20849
作者: Anum Afzal,Alexandre Mercier,Florian Matthes
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Online platforms are increasingly interested in using Data-to-Text technologies to generate content and help their users. Unfortunately, traditional generative methods often fall into repetitive patterns, resulting in monotonous galleries of texts after only a few iterations. In this paper, we investigate LLM-based data-to-text approaches to automatically generate marketing texts that are of sufficient quality and diverse enough for broad adoption. We leverage Language Models such as T5, GPT-3.5, GPT-4, and LLaMa2 in conjunction with fine-tuning, few-shot, and zero-shot approaches to set a baseline for diverse marketing texts. We also introduce a metric JaccDiv to evaluate the diversity of a set of texts. This research extends its relevance beyond the music industry, proving beneficial in various fields where repetitive automated content generation is prevalent.
zh
[NLP-10] Universal language model with the intervention of quantum theory
【速读】: 该论文试图解决自然语言的表示建模问题,其核心在于将量子力学理论引入语言中的符号-意义对,以构建更精确的自然语言表示模型。解决方案的关键在于利用量子力学的数学框架解释和优化词嵌入(word embedding),并进一步运用量子统计等相关理论研究自然语言的数学表征、自然演化及统计特性。论文认为信息的物理性是此类量子特性的来源,并通过实验代码验证了量子理论用于自然语言建模的可行性。
链接: https://arxiv.org/abs/2504.20839
作者: D.-F. Qin
机构: East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL); Quantum Physics (quant-ph)
备注:
Abstract:This paper examines language modeling based on the theory of quantum mechanics. It focuses on the introduction of quantum mechanics into the symbol-meaning pairs of language in order to build a representation model of natural language. At the same time, it is realized that word embedding, which is widely used as a basic technique for statistical language modeling, can be explained and improved by the mathematical framework of quantum mechanics. On this basis, this paper continues to try to use quantum statistics and other related theories to study the mathematical representation, natural evolution and statistical properties of natural language. It is also assumed that the source of such quantum properties is the physicality of information. The feasibility of using quantum theory to model natural language is pointed out through the construction of a experimental code. The paper discusses, in terms of applications, the possible help of the theory in constructing generative models that are popular nowadays. A preliminary discussion of future applications of the theory to quantum computers is also presented.
zh
[NLP-11] uring Machine Evaluation for Large Language Model
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在计算推理能力方面的评估问题,旨在准确衡量模型理解和执行逻辑计算操作的能力。其解决方案的关键在于提出一种基于通用图灵机(Universal Turing Machine, UTM)模拟的评估框架,并开发了TMBench基准测试工具,该工具通过严格遵循指令和跟踪动态状态来评估模型的计算推理能力,从而实现标准化、可扩展的评估。
链接: https://arxiv.org/abs/2504.20771
作者: Haitao Wu,Zongbo Han,Huaxi Huang,Changqing Zhang
机构: Tianjin University (天津大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:With the rapid development and widespread application of Large Language Models (LLMs), rigorous evaluation has become particularly crucial. This research adopts a novel perspective, focusing on evaluating the core computational reasoning ability of LLMs, defined as the capacity of model to accurately understand rules, and execute logically computing operations. This capability assesses the reliability of LLMs as precise executors, and is critical to advanced tasks such as complex code generation and multi-step problem-solving. We propose an evaluation framework based on Universal Turing Machine (UTM) simulation. This framework requires LLMs to strictly follow instructions and track dynamic states, such as tape content and read/write head position, during multi-step computations. To enable standardized evaluation, we developed TMBench, a benchmark for systematically studying the computational reasoning capabilities of LLMs. TMBench provides several key advantages, including knowledge-agnostic evaluation, adjustable difficulty, foundational coverage through Turing machine encoding, and unlimited capacity for instance generation, ensuring scalability as models continue to evolve. We find that model performance on TMBench correlates strongly with performance on other recognized reasoning benchmarks (Pearson correlation coefficient is 0.73), clearly demonstrating that computational reasoning is a significant dimension for measuring the deep capabilities of LLMs. Code and data are available at this https URL.
zh
[NLP-12] Chain-of-Defensive-Thought: Structured Reasoning Elicits Robustness in Large Language Models against Reference Corruption
【速读】: 该论文试图解决大型语言模型在非推理任务中面对参考信息损坏时的鲁棒性不足问题,其解决方案的关键在于引入一种称为“链式防御性思维(chain-of-defensive-thought)”的方法,该方法通过提供少量具有结构化和防御性推理的示例作为示范,显著提升了模型的鲁棒性。
链接: https://arxiv.org/abs/2504.20769
作者: Wenxiao Wang,Parsa Hosseini,Soheil Feizi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Chain-of-thought prompting has demonstrated great success in facilitating the reasoning abilities of large language models. In this work, we explore how these enhanced reasoning abilities can be exploited to improve the robustness of large language models in tasks that are not necessarily reasoning-focused. In particular, we show how a wide range of large language models exhibit significantly improved robustness against reference corruption using a simple method called chain-of-defensive-thought, where only a few exemplars with structured and defensive reasoning are provided as demonstrations. Empirically, the improvements can be astounding, especially given the simplicity and applicability of the method. For example, in the Natural Questions task, the accuracy of GPT-4o degrades from 60% to as low as 3% with standard prompting when 1 out of 10 references provided is corrupted with prompt injection attacks. In contrast, GPT-4o using chain-of-defensive-thought prompting maintains an accuracy of 50%.
zh
[NLP-13] Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers
【速读】: 该论文试图解决Transformer在多步骤事实推理任务中表现不足的问题,尤其是在现实世界知识稀疏的情况下。其解决方案的关键在于通过在现有知识图谱中引入精心设计的合成数据,提高推断事实与原子事实的比例(\phi_r),从而触发模型从记忆向泛化能力的转变。研究发现,即使合成数据在事实层面存在错误,也能增强模型的推理电路,促使模型依赖关系结构而非记忆进行推理,最终在多跳推理基准测试中显著提升了性能。
链接: https://arxiv.org/abs/2504.20752
作者: Roman Abramov,Felix Steinbauer,Gjergji Kasneci
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Transformers have achieved great success in numerous NLP tasks but continue to exhibit notable gaps in multi-step factual reasoning, especially when real-world knowledge is sparse. Recent advances in grokking have demonstrated that neural networks can transition from memorizing to perfectly generalizing once they detect underlying logical patterns - yet these studies have primarily used small, synthetic tasks. In this paper, for the first time, we extend grokking to real-world factual data and address the challenge of dataset sparsity by augmenting existing knowledge graphs with carefully designed synthetic data to raise the ratio \phi_r of inferred facts to atomic facts above the threshold required for grokking. Surprisingly, we find that even factually incorrect synthetic data can strengthen emergent reasoning circuits rather than degrade accuracy, as it forces the model to rely on relational structure rather than memorization. When evaluated on multi-hop reasoning benchmarks, our approach achieves up to 95-100% accuracy on 2WikiMultiHopQA - substantially improving over strong baselines and matching or exceeding current state-of-the-art results. We further provide an in-depth analysis of how increasing \phi_r drives the formation of generalizing circuits inside Transformers. Our findings suggest that grokking-based data augmentation can unlock implicit multi-hop reasoning capabilities, opening the door to more robust and interpretable factual reasoning in large-scale language models.
zh
[NLP-14] UniversalRAG : Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities
【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理多模态、异构知识源时的局限性,即大多数现有方法仅限于文本单一模态的语料库,而无法有效整合多种模态和粒度的知识。其解决方案的关键在于提出一种具有模态感知路由机制的通用RAG框架——UniversalRAG,该机制能够动态识别最相关的模态特定语料库并进行针对性检索,同时在每个模态内引入多粒度层级,以适应不同复杂度和范围的查询需求。
链接: https://arxiv.org/abs/2504.20734
作者: Woongyeong Yeo,Kangsan Kim,Soyeong Jeong,Jinheon Baek,Sung Ju Hwang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Project page : this https URL
Abstract:Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing RAG approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, a novel RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single combined corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose a modality-aware routing mechanism that dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it. Also, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 8 benchmarks spanning multiple modalities, showing its superiority over modality-specific and unified baselines.
zh
[NLP-15] Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
【速读】: 该论文试图解决传统评估方法中依赖最终答案可能无法准确反映模型最优结论的问题,以及不同推理路径可能导致不同结果的不确定性。其解决方案的关键在于分析推理过程中的中间步骤(即子思考),并通过语言线索将推理轨迹分割为连续的子思考。随后,从每个子思考的结尾生成延续,并从中提取潜在答案,通过选择出现频率最高的答案(即众数)来提高准确性,实验结果显示该方法在多个大型语言模型和数学推理数据集上均能显著提升性能。
链接: https://arxiv.org/abs/2504.20708
作者: Hasan Abed Al Kader Hammoud,Hani Itani,Bernard Ghanem
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint
Abstract:Large Language Models (LLMs) leverage step-by-step reasoning to solve complex problems. Standard evaluation practice involves generating a complete reasoning trace and assessing the correctness of the final answer presented at its conclusion. In this paper, we challenge the reliance on the final answer by posing the following two questions: Does the final answer reliably represent the model’s optimal conclusion? Can alternative reasoning paths yield different results? To answer these questions, we analyze intermediate reasoning steps, termed subthoughts, and propose a method based on our findings. Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues. We start by prompting the model to generate continuations from the end-point of each intermediate subthought. We extract a potential answer from every completed continuation originating from different subthoughts. We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace. Analyzing the consistency among the answers derived from different subthoughts reveals characteristics that correlate with the model’s confidence and correctness, suggesting potential for identifying less reliable answers. Our experiments across various LLMs and challenging mathematical reasoning datasets (AIME2024 and AIME2025) show consistent accuracy improvements, with gains reaching up to 13% and 10% respectively. Implementation is available at: this https URL.
zh
[NLP-16] BrightCookies at SemEval-2025 Task 9: Exploring Data Augmentation for Food Hazard Classification
【速读】: 该论文旨在解决食品召回事件报告中危害和产品分类的可解释性分类系统性能问题,特别是在处理少数类别时表现不佳的情况。其解决方案的关键在于采用文本增强技术,特别是词级数据增强方法(如同义词替换、随机词交换和上下文词插入),以提升模型在少数类别上的分类性能,其中上下文词插入在使用BERT模型时表现出对细粒度类别的显著改进。
链接: https://arxiv.org/abs/2504.20703
作者: Foteini Papadopoulou,Osman Mutlu,Neris Özen,Bas H.M. van der Velden,Iris Hendrickx,Ali Hürriyetoğlu
机构: Wageningen Food Safety Research, The Netherlands; Centre for Language Studies, Radboud University, The Netherlands
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents our system developed for the SemEval-2025 Task 9: The Food Hazard Detection Challenge. The shared task’s objective is to evaluate explainable classification systems for classifying hazards and products in two levels of granularity from food recall incident reports. In this work, we propose text augmentation techniques as a way to improve poor performance on minority classes and compare their effect for each category on various transformer and machine learning models. We explore three word-level data augmentation techniques, namely synonym replacement, random word swapping, and contextual word insertion. The results show that transformer models tend to have a better overall performance. None of the three augmentation techniques consistently improved overall performance for classifying hazards and products. We observed a statistically significant improvement (P 0.05) in the fine-grained categories when using the BERT model to compare the baseline with each augmented model. Compared to the baseline, the contextual words insertion augmentation improved the accuracy of predictions for the minority hazard classes by 6%. This suggests that targeted augmentation of minority classes can improve the performance of transformer models.
zh
[NLP-17] Can LLM s Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation?
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成过程中容易产生无意义、不逻辑或事实错误内容的问题,即所谓的幻觉(hallucination)。其解决方案的关键在于评估一系列开源LLMs在两种条件生成任务(翻译和改写)中检测内在幻觉的能力,并探讨模型规模、指令微调和提示选择对性能的影响。研究发现,尽管模型表现存在差异,但提示对性能影响不大,同时自然语言推理(Natural Language Inference, NLI)模型的表现与基于LLM的检测器相当,表明LLM并非该任务的唯一可行解决方案。
链接: https://arxiv.org/abs/2504.20699
作者: Evangelia Gogoulou,Shorouq Zahra,Liane Guillou,Luise Dürlich,Joakim Nivre
机构: RISE Research Institutes of Sweden, Department of Computer Science (瑞哲研究机构瑞典分部,计算机科学系); Uppsala University, Department of Linguistics and Philology (乌普萨拉大学,语言学与哲学系); KTH Royal Institute of Technology, Division of Software and Computer Systems (皇家理工学院,软件与计算机系统系); University of Edinburgh, School of Informatics (爱丁堡大学,信息学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:A frequently observed problem with LLMs is their tendency to generate output that is nonsensical, illogical, or factually incorrect, often referred to broadly as hallucination. Building on the recently proposed HalluciGen task for hallucination detection and generation, we evaluate a suite of open-access LLMs on their ability to detect intrinsic hallucinations in two conditional generation tasks: translation and paraphrasing. We study how model performance varies across tasks and language and we investigate the impact of model size, instruction tuning, and prompt choice. We find that performance varies across models but is consistent across prompts. Finally, we find that NLI models perform comparably well, suggesting that LLM-based detectors are not the only viable option for this specific task.
zh
[NLP-18] Are Information Retrieval Approaches Good at Harmonising Longitudinal Survey Questions in Social Science?
【速读】: 该论文试图解决在纵向社会科学研究中,如何自动检测语义等价问题的问题,以支持长期研究并促进实证研究的标准化。其关键解决方案是提出一种新的信息检索(IR)任务,通过识别问题和回答选项中的概念等价性(如“住房”、“工作”等),以协调纵向人口研究。研究采用多种无监督方法,包括概率模型、语言模型的线性探测以及专门用于IR的预训练神经网络,并发现IR专用神经模型在整体性能上表现最佳。
链接: https://arxiv.org/abs/2504.20679
作者: Wing Yan Li,Zeqiang Wang,Jon Johnson,Suparna De
机构: University of Surrey(萨里大学); University College London(伦敦大学学院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Automated detection of semantically equivalent questions in longitudinal social science surveys is crucial for long-term studies informing empirical research in the social, economic, and health sciences. Retrieving equivalent questions faces dual challenges: inconsistent representation of theoretical constructs (i.e. concept/sub-concept) across studies as well as between question and response options, and the evolution of vocabulary and structure in longitudinal text. To address these challenges, our multi-disciplinary collaboration of computer scientists and survey specialists presents a new information retrieval (IR) task of identifying concept (e.g. Housing, Job, etc.) equivalence across question and response options to harmonise longitudinal population studies. This paper investigates multiple unsupervised approaches on a survey dataset spanning 1946-2020, including probabilistic models, linear probing of language models, and pre-trained neural networks specialised for IR. We show that IR-specialised neural models achieve the highest overall performance with other approaches performing comparably. Additionally, the re-ranking of the probabilistic model’s results with neural models only introduces modest improvements of 0.07 at most in F1-score. Qualitative post-hoc evaluation by survey specialists shows that models generally have a low sensitivity to questions with high lexical overlap, particularly in cases where sub-concepts are mismatched. Altogether, our analysis serves to further research on harmonising longitudinal studies in social science.
zh
[NLP-19] Non-native Childrens Automatic Speech Assessment Challenge (NOCASA) ALT
【速读】: 该论文试图解决非母语儿童在第二语言(L2)学习过程中单字发音评估的问题,旨在开发能够用于游戏化发音训练应用的自动语音评估系统。解决方案的关键在于应对可用训练数据有限以及发音水平类别分布高度不平衡的挑战,为此提供了伪匿名化的训练数据集TeflonNorL2,并发布了两种已训练的系统作为基准,其中多任务wav2vec 2.0模型在测试集上取得了最佳性能,其未加权平均召回率(UAR)为36.37%。
链接: https://arxiv.org/abs/2504.20678
作者: Yaroslav Getman,Tamás Grósz,Mikko Kurimo,Giampiero Salvi
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: First draft of the baseline paper for the NOCASA competition ( this https URL ), 5 pages
Abstract:This paper presents the “Non-native Children’s Automatic Speech Assessment” (NOCASA) - a data competition part of the IEEE MLSP 2025 conference. NOCASA challenges participants to develop new systems that can assess single-word pronunciations of young second language (L2) learners as part of a gamified pronunciation training app. To achieve this, several issues must be addressed, most notably the limited nature of available training data and the highly unbalanced distribution among the pronunciation level categories. To expedite the development, we provide a pseudo-anonymized training data (TeflonNorL2), containing 10,334 recordings from 44 speakers attempting to pronounce 205 distinct Norwegian words, human-rated on a 1 to 5 scale (number of stars that should be given in the game). In addition to the data, two already trained systems are released as official baselines: an SVM classifier trained on the ComParE_16 acoustic feature set and a multi-task wav2vec 2.0 model. The latter achieves the best performance on the challenge test set, with an unweighted average recall (UAR) of 36.37%.
zh
[NLP-20] A Generative-AI-Driven Claim Retrieval System Capable of Detecting and Retrieving Claims from Social Media Platforms in Multiple Languages
【速读】: 该论文试图解决在线虚假信息传播过程中,事实核查人员面临的重复验证已核实声明的问题,这增加了工作负担并延迟了对新出现声明的响应。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)检索先前核实的声明,评估其与输入内容的相关性,并生成简洁的摘要和解释,从而帮助事实核查人员快速判断某条声明是否已被验证过。
链接: https://arxiv.org/abs/2504.20668
作者: Ivan Vykopal,Martin Hyben,Robert Moro,Michal Gregor,Jakub Simko
机构: Brno University of Technology (布尔诺技术大学); Kempelen Institute of Intelligent Technologies (肯佩伦智能技术研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Online disinformation poses a global challenge, placing significant demands on fact-checkers who must verify claims efficiently to prevent the spread of false information. A major issue in this process is the redundant verification of already fact-checked claims, which increases workload and delays responses to newly emerging claims. This research introduces an approach that retrieves previously fact-checked claims, evaluates their relevance to a given input, and provides supplementary information to support fact-checkers. Our method employs large language models (LLMs) to filter irrelevant fact-checks and generate concise summaries and explanations, enabling fact-checkers to faster assess whether a claim has been verified before. In addition, we evaluate our approach through both automatic and human assessments, where humans interact with the developed tool to review its effectiveness. Our results demonstrate that LLMs are able to filter out many irrelevant fact-checks and, therefore, reduce effort and streamline the fact-checking process.
zh
[NLP-21] Cooking Up Creativity: A Cognitively-Inspired Approach for Enhancing LLM Creativity through Structured Representations
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在创造力方面表现不足的问题,即LLMs虽然在众多任务中表现出色,但在生成具有创新性和多样性的想法时存在局限。解决方案的关键在于将LLMs与结构化表示(structured representations)以及受认知启发的操控方法相结合,通过显式重组现有概念的结构化表示,探索更抽象的想法空间,从而生成更具创造性和多样性的结果。
链接: https://arxiv.org/abs/2504.20643
作者: Moran Mizrahi,Chen Shani,Gabriel Stanovsky,Dan Jurafsky,Dafna Shahaf
机构: Hebrew University of Jerusalem (希伯来大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 8 figures
Abstract:Large Language Models (LLMs) excel at countless tasks, yet struggle with creativity. In this paper, we introduce a novel approach that couples LLMs with structured representations and cognitively inspired manipulations to generate more creative and diverse ideas. Our notion of creativity goes beyond superficial token-level variations; rather, we explicitly recombine structured representations of existing ideas, allowing our algorithm to effectively explore the more abstract landscape of ideas. We demonstrate our approach in the culinary domain with DishCOVER, a model that generates creative recipes. Experiments comparing our model’s results to those of GPT-4o show greater diversity. Domain expert evaluations reveal that our outputs, which are mostly coherent and feasible culinary creations, significantly surpass GPT-4o in terms of novelty, thus outperforming it in creative generation. We hope our work inspires further research into structured creativity in AI.
zh
[NLP-22] WenyanGPT : A Large Language Model for Classical Chinese Tasks
【速读】: 该论文旨在解决现有自然语言处理模型在古典汉语(Classical Chinese)任务中表现不足的问题,因为当前模型主要针对现代汉语进行优化。其解决方案的关键在于对LLaMA3-8B-Chinese模型进行持续预训练和指令微调,构建专门用于古典汉语处理的大语言模型WenyanGPT,并开发了评估基准数据集WenyanBENCH,以提升古典汉语相关任务的处理能力。
链接: https://arxiv.org/abs/2504.20609
作者: Xinyu Yao,Mengdi Wang,Bo Chen,Xiaobing Zhao
机构: Minzu University of China (民族大学); National Language Resources Monitoring & Research Center of Languages (国家语言资源监测与研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Classical Chinese, as the core carrier of Chinese culture, plays a crucial role in the inheritance and study of ancient literature. However, existing natural language processing models primarily optimize for Modern Chinese, resulting in inadequate performance on Classical Chinese. This paper presents a comprehensive solution for Classical Chinese language processing. By continuing pre-training and instruction fine-tuning on the LLaMA3-8B-Chinese model, we construct a large language model, WenyanGPT, which is specifically designed for Classical Chinese tasks. Additionally, we develop an evaluation benchmark dataset, WenyanBENCH. Experimental results on WenyanBENCH demonstrate that WenyanGPT significantly outperforms current advanced LLMs in various Classical Chinese tasks. We make the model’s training data, instruction fine-tuning data\footnote, and evaluation benchmark dataset publicly available to promote further research and development in the field of Classical Chinese processing.
zh
[NLP-23] F1-EN-3M: Three Million Synthetic Moral Fables for Training Small Open Language Models
【速读】: 该论文旨在解决现代自然语言处理(Natural Language Processing, NLP)领域中缺乏大规模、结构化道德故事语料库的问题,此类语料库需包含连贯叙事与明确伦理教训。其解决方案的关键在于构建TF1-EN-3M数据集,该数据集由不超过8B参数的指令调优模型生成,包含三百万个英语寓言故事,每个故事遵循六槽框架(角色 - 品质 - 场景 - 冲突 - 解决方案 - 道德),并通过组合提示引擎确保体裁一致性和主题多样性。此外,采用混合评估流程结合基于GPT的批评者与无参考的多样性和可读性指标,以确保生成故事的质量。
链接: https://arxiv.org/abs/2504.20605
作者: Mihai Nadas,Laura Diosan,Andrei Piscoran,Andreea Tomescu
机构: Babe\textcommabelows-Bolyai University (巴贝什-波尔伊利大学); KlusAI Labs (KlusAI 实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (character - trait - setting - conflict - resolution - moral), produced through a combinatorial prompt engine that guarantees genre fidelity while covering a broad thematic space. A hybrid evaluation pipeline blends (i) a GPT-based critic that scores grammar, creativity, moral clarity, and template adherence with (ii) reference-free diversity and readability metrics. Among ten open-weight candidates, an 8B-parameter Llama-3 variant delivers the best quality-speed trade-off, producing high-scoring fables on a single consumer GPU (24 GB VRAM) at approximately 13.5 cents per 1,000 fables. We release the dataset, generation code, evaluation scripts, and full metadata under a permissive license, enabling exact reproducibility and cost benchmarking. TF1-EN-3M opens avenues for research in instruction following, narrative intelligence, value alignment, and child-friendly educational AI, demonstrating that large-scale moral storytelling no longer requires proprietary giant models. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2504.20605 [cs.CL] (or arXiv:2504.20605v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.20605 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-24] Reason IR: Training Retrievers for Reasoning Tasks KR
【速读】: 该论文试图解决现有检索模型在通用推理任务中的表现有限的问题,主要原因是现有训练数据集侧重于与文档直接对应的简短事实性查询。解决方案的关键在于开发了一种合成数据生成流程,该流程为每篇文档生成具有挑战性和相关性的查询,并附带一个看似相关但最终无用的困难负样本。通过在合成数据和现有公开数据的混合数据上进行训练,ReasonIR-8B在BRIGHT基准上实现了新的最先进性能,展示了其在推理密集型信息检索任务中的有效性。
链接: https://arxiv.org/abs/2504.20595
作者: Rulin Shao,Rui Qiao,Varsha Kishore,Niklas Muennighoff,Xi Victoria Lin,Daniela Rus,Bryan Kian Hsiang Low,Sewon Min,Wen-tau Yih,Pang Wei Koh,Luke Zettlemoyer
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Our code is released at \url{ this https URL }
Abstract:We present ReasonIR-8B, the first retriever specifically trained for general reasoning tasks. Existing retrievers have shown limited gains on reasoning tasks, in part because existing training datasets focus on short factual queries tied to documents that straightforwardly answer them. We develop a synthetic data generation pipeline that, for each document, our pipeline creates a challenging and relevant query, along with a plausibly related but ultimately unhelpful hard negative. By training on a mixture of our synthetic data and existing public data, ReasonIR-8B achieves a new state-of-the-art of 29.9 nDCG@10 without reranker and 36.9 nDCG@10 with reranker on BRIGHT, a widely-used reasoning-intensive information retrieval (IR) benchmark. When applied to RAG tasks, ReasonIR-8B improves MMLU and GPQA performance by 6.4% and 22.6% respectively, relative to the closed-book baseline, outperforming other retrievers and search engines. In addition, ReasonIR-8B uses test-time compute more effectively: on BRIGHT, its performance consistently increases with longer and more information-rich rewritten queries; it continues to outperform other retrievers when combined with an LLM reranker. Our training recipe is general and can be easily extended to future LLMs; to this end, we open-source our code, data, and model.
zh
[NLP-25] ClonEval: An Open Voice Cloning Benchmark
【速读】: 该论文试图解决语音克隆文本到语音(Text-to-Speech, TTS)模型的评估问题,通过提出一个新颖的基准测试框架来标准化模型性能的评估。解决方案的关键在于构建了一个包含评估协议、开源库以及配套排行榜的完整体系,以提供统一的评价标准和透明的结果展示方式。
链接: https://arxiv.org/abs/2504.20581
作者: Iwona Christop,Tomasz Kuczyński,Marek Kubis
机构: Adam Mickiewicz University (亚当·密茨凯维奇大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We present a novel benchmark for voice cloning text-to-speech models. The benchmark consists of an evaluation protocol, an open-source library for assessing the performance of voice cloning models, and an accompanying leaderboard. The paper discusses design considerations and presents a detailed description of the evaluation procedure. The usage of the software library is explained, along with the organization of results on the leaderboard.
zh
[NLP-26] Reinforcement Learning for Reasoning in Large Language Models with One Training Example
【速读】: 该论文试图解决如何通过少量训练样本有效提升大型语言模型(Large Language Models, LLMs)的数学推理能力的问题。其解决方案的关键在于使用可验证奖励的强化学习(Reinforcement Learning with Verifiable Reward, RLVR),通过单个示例(1-shot RLVR)来优化模型性能,从而显著提高模型在多个数学推理基准上的表现。实验表明,该方法在不同模型和算法上均表现出色,并揭示了如跨领域泛化、自我反思频率增加及饱和后泛化等现象,同时强调了策略梯度损失和探索促进(如添加熵损失)在训练中的关键作用。
链接: https://arxiv.org/abs/2504.20571
作者: Yiping Wang,Qing Yang,Zhiyuan Zeng,Liliang Ren,Lucas Liu,Baolin Peng,Hao Cheng,Xuehai He,Kuan Wang,Jianfeng Gao,Weizhu Chen,Shuohang Wang,Simon Shaolei Du,Yelong Shen
机构: University of Washington(华盛顿大学); University of Southern California(南加利福尼亚大学); Microsoft(微软); University of California, Santa Cruz(加州大学圣克鲁兹分校); Georgia Institute of Technology(佐治亚理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 28 pages, 12 figures, link: this https URL
Abstract:We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6%, and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7%. This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which includes the aforementioned example. Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples (many of which yield approximately 30% or greater improvement on MATH500 when employed as a single training example). In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the “grokking” phenomenon. We also show the critical role of promoting exploration (e.g., by adding entropy loss with an appropriate coefficient) in 1-shot RLVR training. As a bonus, we observe that applying entropy loss alone, without any outcome reward, significantly enhances Qwen2.5-Math-1.5B’s performance on MATH500 by 27.4%. These findings can inspire future work on RLVR data efficiency and encourage a re-examination of both recent progress and the underlying mechanisms in RLVR. Our code, model, and data are open source at this https URL
zh
[NLP-27] BrAIcht a theatrical agent that speaks like Bertolt Brechts characters
【速读】: 该论文试图解决生成具有特定文学风格的对话内容的问题,具体是让AI conversational agent(对话代理)能够模仿德国著名剧作家Bertolt Brecht的独特风格进行对话生成。解决方案的关键在于使用BrAIcht模型,该模型基于German LeoLM(一个具有70亿参数的大语言模型)进行微调,并采用QLoRA(一种参数高效微调技术)以克服内存限制,同时利用Bertolt Brecht的29部戏剧及其他907部风格相近的德语戏剧构建多样化数据集,从而提升模型在生成Brecht风格对话任务上的表现。
链接: https://arxiv.org/abs/2504.20552
作者: Baz Roland,Kristina Malyseva,Anna Pappa(LIASD),Tristan Cazenave(APA)
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This project introduces BrAIcht, an AI conversational agent that creates dialogues in the distinctive style of the famous German playwright Bertolt Brecht. BrAIcht is fine-tuned using German LeoLM, a large language model with 7 billion parameters and a modified version of the base Llama2 suitable for German language tasks. For fine-tuning, 29 plays of Bertolt Brecht and 907 of other German plays that are stylistically similar to Bertolt Brecht are used to form a more di-erse dataset. Due to the limited memory capacity, a parameterefficient fine-tuning technique called QLoRA is implemented to train the large language model. The results, based on BLEU score and perplexity, show very promising performance of BrAIcht in generating dialogues in the style of Bertolt Brecht.
zh
[NLP-28] Revisiting the MIMIC-IV Benchmark: Experiments Using Language Models for Electronic Health Records
【速读】: 该论文试图解决医疗领域中针对文本输入缺乏标准化评估基准的问题,这限制了自然语言模型在健康相关下游任务中的广泛应用和潜力发挥。其解决方案的关键在于重新利用公开的MIMIC-IV电子健康记录(Electronic Health Records, EHRs)基准,并将其集成到Hugging Face数据集库中,以促进该数据集的共享与使用。此外,通过研究模板方法将EHR表格数据转换为文本,验证了基于文本的模型在患者死亡率任务上的有效性,表明微调后的文本模型在性能上可与强大的表格分类器相媲美,而零样本大语言模型(Large Language Models, LLMs)则难以有效利用EHR表示。
链接: https://arxiv.org/abs/2504.20547
作者: Jesus Lovon(IRIT-IRIS),Thouria Ben-Haddi,Jules Di Scala,Jose G. Moreno(IRIT-IRIS),Lynda Tamine(IRIT-IRIS)
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The lack of standardized evaluation benchmarks in the medical domain for text inputs can be a barrier to widely adopting and leveraging the potential of natural language models for health-related downstream tasks. This paper revisited an openly available MIMIC-IV benchmark for electronic health records (EHRs) to address this issue. First, we integrate the MIMIC-IV data within the Hugging Face datasets library to allow an easy share and use of this collection. Second, we investigate the application of templates to convert EHR tabular data to text. Experiments using fine-tuned and zero-shot LLMs on the mortality of patients task show that fine-tuned text-based models are competitive against robust tabular classifiers. In contrast, zero-shot LLMs struggle to leverage EHR representations. This study underlines the potential of text-based approaches in the medical field and highlights areas for further improvement.
zh
[NLP-29] UniDetox: Universal Detoxification of Large Language Models via Dataset Distillation ICLR2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中的毒性问题,现有去毒方法通常具有模型特定性,需针对不同模型进行细致的超参数调优,且在去毒效果与语言建模性能之间存在权衡。其解决方案的关键在于提出一种通用的去毒技术——UniDetox,该技术通过对比解码实现去毒数据集的高效蒸馏,生成可泛化至多种LLMs的合成文本数据,从而无需针对每个模型进行单独调优即可实现有效去毒。
链接: https://arxiv.org/abs/2504.20500
作者: Huimin Lu,Masaru Isonuma,Junichiro Mori,Ichiro Sakata
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ICLR 2025 (poster)
Abstract:We present UniDetox, a universally applicable method designed to mitigate toxicity across various large language models (LLMs). Previous detoxification methods are typically model-specific, addressing only individual models or model families, and require careful hyperparameter tuning due to the trade-off between detoxification efficacy and language modeling performance. In contrast, UniDetox provides a detoxification technique that can be universally applied to a wide range of LLMs without the need for separate model-specific tuning. Specifically, we propose a novel and efficient dataset distillation technique for detoxification using contrastive decoding. This approach distills detoxifying representations in the form of synthetic text data, enabling universal detoxification of any LLM through fine-tuning with the distilled text. Our experiments demonstrate that the detoxifying text distilled from GPT-2 can effectively detoxify larger models, including OPT, Falcon, and LLaMA-2. Furthermore, UniDetox eliminates the need for separate hyperparameter tuning for each model, as a single hyperparameter configuration can be seamlessly applied across different models. Additionally, analysis of the detoxifying text reveals a reduction in politically biased content, providing insights into the attributes necessary for effective detoxification of LLMs.
zh
[NLP-30] Enhancing LLM Language Adaption through Cross-lingual In-Context Pre-training
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在跨语言迁移中的局限性问题,特别是由于平行语料资源的限制导致的语言和领域覆盖不足。其解决方案的关键在于提出一种名为跨语言上下文预训练(Cross-lingual In-context Pre-training, CrossIC-PT)的方法,通过利用语义相关的双语文本进行简单的下一个词预测,从而增强跨语言迁移能力。该方法通过交错语义相关的双语维基百科文档构建样本,并采用系统化的分段策略和滑动窗口机制以保持上下文连贯性,同时通过语义检索框架扩展数据来源,提升跨语言预训练的效果。
链接: https://arxiv.org/abs/2504.20484
作者: Linjuan Wu,Haoran Wei,Huan Lin,Tianhao Li,Baosong Yang,Weiming Lu
机构: Zhejiang University (浙江大学); Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注: 12 pages, 6 figures, Under Review
Abstract:Large language models (LLMs) exhibit remarkable multilingual capabilities despite English-dominated pre-training, attributed to cross-lingual mechanisms during pre-training. Existing methods for enhancing cross-lingual transfer remain constrained by parallel resources, suffering from limited linguistic and domain coverage. We propose Cross-lingual In-context Pre-training (CrossIC-PT), a simple and scalable approach that enhances cross-lingual transfer by leveraging semantically related bilingual texts via simple next-word prediction. We construct CrossIC-PT samples by interleaving semantic-related bilingual Wikipedia documents into a single context window. To access window size constraints, we implement a systematic segmentation policy to split long bilingual document pairs into chunks while adjusting the sliding window mechanism to preserve contextual coherence. We further extend data availability through a semantic retrieval framework to construct CrossIC-PT samples from web-crawled corpus. Experimental results demonstrate that CrossIC-PT improves multilingual performance on three models (Llama-3.1-8B, Qwen2.5-7B, and Qwen2.5-1.5B) across six target languages, yielding performance gains of 3.79%, 3.99%, and 1.95%, respectively, with additional improvements after data augmentation.
zh
[NLP-31] Fane at SemEval-2025 Task 10: Zero-Shot Entity Framing with Large Language Models SEMEVAL2025
【速读】: 该论文试图解决新闻叙事中实体框架角色分类的问题,旨在评估大型语言模型(Large Language Models, LLMs)在零样本条件下的分类能力。其解决方案的关键在于采用分层方法,首先识别宽泛的角色,再细化到细粒度角色,相较于单步分类更具优势。此外,研究还表明输入上下文和提示策略在不同任务层级上存在差异,强调了子任务特定策略的重要性。通过优化提示设计和输入上下文,该方法在主角色准确率(Main Role Accuracy)和精确匹配率(Exact Match Ratio)上分别达到了89.4%和34.5%。
链接: https://arxiv.org/abs/2504.20469
作者: Enfa Fane,Mihai Surdeanu,Eduardo Blanco,Steven R. Corman
机构: University of Arizona (亚利桑那大学); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to The 19th International Workshop on Semantic Evaluation (Semeval 2025)
Abstract:Understanding how news narratives frame entities is crucial for studying media’s impact on societal perceptions of events. In this paper, we evaluate the zero-shot capabilities of large language models (LLMs) in classifying framing roles. Through systematic experimentation, we assess the effects of input context, prompting strategies, and task decomposition. Our findings show that a hierarchical approach of first identifying broad roles and then fine-grained roles, outperforms single-step classification. We also demonstrate that optimal input contexts and prompts vary across task levels, highlighting the need for subtask-specific strategies. We achieve a Main Role Accuracy of 89.4% and an Exact Match Ratio of 34.5%, demonstrating the effectiveness of our approach. Our findings emphasize the importance of tailored prompt design and input context optimization for improving LLM performance in entity framing.
zh
[NLP-32] Search-Based Interaction For Conversation Recommendation via Generative Reward Model Based Simulated User SIGIR2025
【速读】: 该论文旨在解决对话式推荐系统(Conversational Recommendation Systems, CRSs)中用户偏好理解的挑战,即如何在多轮交互中有效捕捉用户复杂的多维偏好,同时避免频繁用户参与导致的用户体验下降。其解决方案的关键在于提出一种基于生成式奖励模型的模拟用户(Generative Reward Model-based Simulated User, GRSU),通过设计两种反馈动作——生成式物品评分和基于属性的物品评价,以粗粒度和细粒度的方式为推荐系统提供反馈,从而提升对用户偏好的理解能力。此外,该方法通过指令微调合成数据构建统一的模拟用户,并结合束搜索策略与高效的候选排序方法,在保证推荐效果的同时提高交互效率。
链接: https://arxiv.org/abs/2504.20458
作者: Xiaolei Wang,Chunxuan Xia,Junyi Li,Fanzhe Meng,Lei Huang,Jinpeng Wang,Wayne Xin Zhao,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学人工智能学院); National University of Singapore (新加坡国立大学); University of Electronic Science and Technology of China (电子科技大学); Meituan (美团)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted by SIGIR 2025
Abstract:Conversational recommendation systems (CRSs) use multi-turn interaction to capture user preferences and provide personalized recommendations. A fundamental challenge in CRSs lies in effectively understanding user preferences from conversations. User preferences can be multifaceted and complex, posing significant challenges for accurate recommendations even with access to abundant external knowledge. While interaction with users can clarify their true preferences, frequent user involvement can lead to a degraded user experience. To address this problem, we propose a generative reward model based simulated user, named GRSU, for automatic interaction with CRSs. The simulated user provides feedback to the items recommended by CRSs, enabling them to better capture intricate user preferences through multi-turn interaction. Inspired by generative reward models, we design two types of feedback actions for the simulated user: i.e., generative item scoring, which offers coarse-grained feedback, and attribute-based item critique, which provides fine-grained feedback. To ensure seamless integration, these feedback actions are unified into an instruction-based format, allowing the development of a unified simulated user via instruction tuning on synthesized data. With this simulated user, automatic multi-turn interaction with CRSs can be effectively conducted. Furthermore, to strike a balance between effectiveness and efficiency, we draw inspiration from the paradigm of reward-guided search in complex reasoning tasks and employ beam search for the interaction process. On top of this, we propose an efficient candidate ranking method to improve the recommendation results derived from interaction. Extensive experiments on public datasets demonstrate the effectiveness, efficiency, and transferability of our approach. Comments: Accepted by SIGIR 2025 Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2504.20458 [cs.IR] (or arXiv:2504.20458v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.20458 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-33] Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding
【速读】: 该论文试图解决在任意阶语言模型中如何并行采样来自正确联合分布的标记这一开放性问题。传统离散扩散模型在并行生成更多标记时,其预测分布会偏离原始数据分布,这是由于其依赖于仅在无限小时间步长下有效的条件独立性假设。论文提出的解决方案关键在于利用一种名为任意子集自回归模型(any-subset autoregressive models, AS-ARMs)的模型结构,该模型能够以任意顺序并行生成标记,并支持并行化的联合概率密度估计。通过提出的任意子集推测解码(Any-Subset Speculative Decoding, ASSD)算法,AS-ARMs 可以校正其并行生成的标记分布,从而保证从正确的联合分布中生成标记,同时将神经网络调用次数上限控制在所预测标记数量之内。
链接: https://arxiv.org/abs/2504.20456
作者: Gabe Guo,Stefano Ermon
机构: Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:In arbitrary-order language models, it is an open question how to sample tokens in parallel from the correct joint distribution. With discrete diffusion models, the more tokens they generate in parallel, the less their predicted distributions adhere to the originally learned data distribution, as they rely on a conditional independence assumption that only works with infinitesimally small timesteps. We find that a different class of models, any-subset autoregressive models (AS-ARMs), holds the solution. As implied by the name, AS-ARMs can generate tokens in any order, and in parallel. Moreover, AS-ARMs support parallelized joint probability density estimation, allowing them to correct their own parallel-generated token distributions, via our Any-Subset Speculative Decoding (ASSD) algorithm. ASSD provably enables generation of tokens from the correct joint distribution, with the number of neural network calls upper bounded by the number of tokens predicted. We empirically verify that ASSD speeds up language generation, without sacrificing quality. Furthermore, we provide a mathematically justified scheme for training AS-ARMs for generation, and show that AS-ARMs achieve state-of-the-art performance among sub-200M parameter models on infilling benchmark tasks, and nearly match the performance of models 50X larger on code generation. Our theoretical and empirical results indicate that the once-forgotten AS-ARMs are a promising direction of language modeling.
zh
[NLP-34] am ACK at SemEval-2025 Task 2: Beyond Word-for-Word Machine Translation for English-Korean Pairs SEMEVAL-2025 ACL2025
【速读】: 该论文旨在解决在英韩语言之间翻译知识密集且实体丰富的文本时,如何通过转译(transcreation)保留语言特有和文化细微差别的问题。其解决方案的关键在于评估13种模型(包括大语言模型和机器翻译模型),并通过自动评估指标和双语标注者的主观评价来分析模型表现,发现大语言模型虽优于传统机器翻译系统,但在需要文化适应的实体翻译方面仍存在不足,进而构建错误分类体系以识别关键问题。
链接: https://arxiv.org/abs/2504.20451
作者: Daniel Lee,Harsh Sharma,Jieun Han,Sunny Jeong,Alice Oh,Vered Shwartz
机构: Adobe Inc.(Adobe公司); CU Boulder (科罗拉多大学博尔德分校); KAIST (韩国科学技术院); New York University (纽约大学); UBC (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at SemEval-2025 Workshop (ACL 2025)
Abstract:Translating knowledge-intensive and entity-rich text between English and Korean requires transcreation to preserve language-specific and cultural nuances beyond literal, phonetic or word-for-word conversion. We evaluate 13 models (LLMs and MT models) using automatic metrics and human assessment by bilingual annotators. Our findings show LLMs outperform traditional MT systems but struggle with entity translation requiring cultural adaptation. By constructing an error taxonomy, we identify incorrect responses and entity name errors as key issues, with performance varying by entity type and popularity level. This work exposes gaps in automatic evaluation metrics and hope to enable future work in completing culturally-nuanced machine translation.
zh
[NLP-35] On Psychology of AI – Does Primacy Effect Affect ChatGPT and Other LLM s?
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对具有顺序差异的描述信息时是否存在“首因效应”(primacy effect)的问题。其解决方案的关键在于通过重构Asch(1946)的经典实验,评估不同LLMs在处理正向与负向形容词顺序不同的候选人描述时的偏好倾向,从而揭示模型是否受到信息呈现顺序的影响。
链接: https://arxiv.org/abs/2504.20444
作者: Mika Hämäläinen
机构: Metropolia University of Applied Sciences (Metropolia应用科学大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We study the primacy effect in three commercial LLMs: ChatGPT, Gemini and Claude. We do this by repurposing the famous experiment Asch (1946) conducted using human subjects. The experiment is simple, given two candidates with equal descriptions which one is preferred if one description has positive adjectives first before negative ones and another description has negative adjectives followed by positive ones. We test this in two experiments. In one experiment, LLMs are given both candidates simultaneously in the same prompt, and in another experiment, LLMs are given both candidates separately. We test all the models with 200 candidate pairs. We found that, in the first experiment, ChatGPT preferred the candidate with positive adjectives listed first, while Gemini preferred both equally often. Claude refused to make a choice. In the second experiment, ChatGPT and Claude were most likely to rank both candidates equally. In the case where they did not give an equal rating, both showed a clear preference to a candidate that had negative adjectives listed first. Gemini was most likely to prefer a candidate with negative adjectives listed first.
zh
[NLP-36] DMDTEval: An Evaluation and Analysis of LLM s on Disambiguation in Multi-domain Translation
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在多领域翻译(Multi-Domain Translation, MDT)中的歧义消解能力不足的问题,即词义在不同领域中可能发生变化,导致翻译结果不准确。其解决方案的关键在于构建一个系统化的评估框架(DMDTEval),包含三个核心方面:构建带有多领域歧义词标注的翻译测试集、设计多样化的歧义消解提示模板,以及制定精确的歧义消解评价指标,并通过实验分析不同提示策略在多个先进LLMs上的效果。
链接: https://arxiv.org/abs/2504.20371
作者: Zhibo Man,Yuanmeng Chen,Yujie Zhang,Yufeng Chen,Jinan Xu
机构: Beijing Jiaotong University (北京交通大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Currently, Large Language Models (LLMs) have achieved remarkable results in machine translation. However, their performance in multi-domain translation (MDT) is less satisfactory; the meanings of words can vary across different domains, highlighting the significant ambiguity inherent in MDT. Therefore, evaluating the disambiguation ability of LLMs in MDT remains an open problem. To this end, we present an evaluation and analysis of LLMs on disambiguation in multi-domain translation (DMDTEval), our systematic evaluation framework consisting of three critical aspects: (1) we construct a translation test set with multi-domain ambiguous word annotation, (2) we curate a diverse set of disambiguation prompting templates, and (3) we design precise disambiguation metrics, and study the efficacy of various prompting strategies on multiple state-of-the-art LLMs. Our extensive experiments reveal a number of crucial findings that we believe will pave the way and also facilitate further research in the critical area of improving the disambiguation of LLMs.
zh
[NLP-37] What Causes Knowledge Loss in Multilingual Language Models?
【速读】: 该论文试图解决多语言自然语言处理模型在跨语言迁移过程中出现的灾难性遗忘问题,即在微调新任务时导致先前学习任务性能下降的现象。其解决方案的关键在于通过LoRA适配器(Low-Rank Adaptation)对参数进行非共享、部分共享和完全共享的实验,评估不同参数共享策略对保持先验知识和减少遗忘的影响,从而探索参数共享是否能够有效缓解灾难性遗忘并促进更有效的跨语言迁移。
链接: https://arxiv.org/abs/2504.20356
作者: Maria Khelli,Samuel Cahyawijaya,Ayu Purwarianti,Genta Indra Winata
机构: Institut Teknologi Bandung(印尼特兰加纳理工学院); Cohere(Cohere); Capital One(资本一号)
类目: Computation and Language (cs.CL)
备注:
Abstract:Cross-lingual transfer in natural language processing (NLP) models enhances multilingual performance by leveraging shared linguistic knowledge. However, traditional methods that process all data simultaneously often fail to mimic real-world scenarios, leading to challenges like catastrophic forgetting, where fine-tuning on new tasks degrades performance on previously learned ones. Our study explores this issue in multilingual contexts, focusing on linguistic differences affecting representational learning rather than just model parameters. We experiment with 52 languages using LoRA adapters of varying ranks to evaluate non-shared, partially shared, and fully shared parameters. Our aim is to see if parameter sharing through adapters can mitigate forgetting while preserving prior knowledge. We find that languages using non-Latin scripts are more susceptible to catastrophic forgetting, whereas those written in Latin script facilitate more effective cross-lingual transfer.
zh
[NLP-38] Local Prompt Optimization NAACL2025
【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)在提示词优化过程中存在的效率与效果问题,即如何更有效地生成高质量的提示词以完成特定任务。现有方法通常对提示词进行全局优化,需要在大规模词汇空间中调整所有提示词标记,导致优化过程缺乏足够指导。论文提出的解决方案之关键在于引入局部提示优化(Local Prompt Optimization, LPO),通过识别并仅优化提示词中的关键标记,使LLM在优化过程中集中关注这些部分,从而提升优化效率和性能表现。
链接: https://arxiv.org/abs/2504.20355
作者: Yash Jain,Vishal Chowdhary
机构: Microsoft(微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted as Oral at NAACL 2025 (Main Conference)
Abstract:In recent years, the use of prompts to guide the output of Large Language Models have increased dramatically. However, even the best of experts struggle to choose the correct words to stitch up a prompt for the desired task. To solve this, LLM driven prompt optimization emerged as an important problem. Existing prompt optimization methods optimize a prompt globally, where in all the prompt tokens have to be optimized over a large vocabulary while solving a complex task. The large optimization space (tokens) leads to insufficient guidance for a better prompt. In this work, we introduce Local Prompt Optimization (LPO) that integrates with any general automatic prompt engineering method. We identify the optimization tokens in a prompt and nudge the LLM to focus only on those tokens in its optimization step. We observe remarkable performance improvements on Math Reasoning (GSM8k and MultiArith) and BIG-bench Hard benchmarks across various automatic prompt engineering methods. Further, we show that LPO converges to the optimal prompt faster than global methods.
zh
[NLP-39] Labeling Case Similarity based on Co-Citation of Legal Articles in Judgment Documents with Empirical Dispute-Based Evaluation
【速读】: 该论文试图解决在构建法律推荐系统时,由于标注数据集有限而带来的挑战,尤其是在劳动争议等专业领域。解决方案的关键在于利用法律条文在案件中的共引关系来建立相似性,并通过算法进行标注,这一方法借鉴了案例共引的概念,将被引用的先例作为共享法律问题的指示器。
链接: https://arxiv.org/abs/2504.20323
作者: Chao-Lin Liu,Po-Hsien Wu,Yi-Ting Yu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 16 pages, 9 figures, 2 tables, the Nineteenth International Workshop on Juris-Informatics (JURISIN 2025), associated with the Seventeenth JSAI International Symposium on AI (JSAI-isAI 2025)
Abstract:This report addresses the challenge of limited labeled datasets for developing legal recommender systems, particularly in specialized domains like labor disputes. We propose a new approach leveraging the co-citation of legal articles within cases to establish similarity and enable algorithmic annotation. This method draws a parallel to the concept of case co-citation, utilizing cited precedents as indicators of shared legal issues. To evaluate the labeled results, we employ a system that recommends similar cases based on plaintiffs’ accusations, defendants’ rebuttals, and points of disputes. The evaluation demonstrates that the recommender, with finetuned text embedding models and a reasonable BiLSTM module can recommend labor cases whose similarity was measured by the co-citation of the legal articles. This research contributes to the development of automated annotation techniques for legal documents, particularly in areas with limited access to comprehensive legal databases.
zh
[NLP-40] UD-English-CHILDES: A Collected Resource of Gold and Silver Universal Dependencies Trees for Child Language Interactions
【速读】: 该论文旨在解决儿童语言数据在句法标注上缺乏统一标准的问题,通过构建一个基于CHILDES语料库的统一句法树库UD-English-CHILDES来提供一致的标注资源。解决方案的关键在于采用一致的Universal Dependencies (UD) 标注规范,对来自11名儿童及其照顾者的超过48,000句语料进行标准化处理,并验证现有黄金标准标注,同时补充了100万条银标准句子,从而为计算语言学和语言研究提供可靠的数据支持。
链接: https://arxiv.org/abs/2504.20304
作者: Xiulin Yang,Zhuoxuan Ju,Lanni Bu,Zoey Liu,Nathan Schneider
机构: Georgetown University (乔治城大学); University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:CHILDES is a widely used resource of transcribed child and child-directed speech. This paper introduces UD-English-CHILDES, the first officially released Universal Dependencies (UD) treebank derived from previously dependency-annotated CHILDES data with consistent and unified annotation guidelines. Our corpus harmonizes annotations from 11 children and their caregivers, totaling over 48k sentences. We validate existing gold-standard annotations under the UD v2 framework and provide an additional 1M silver-standard sentences, offering a consistent resource for computational and linguistic research.
zh
[NLP-41] mrCAD: Multimodal Refinement of Computer-aided Designs
【速读】: 该论文试图解决生成式 AI (Generative AI) 在进行特定语言引导的先前输出修改(refinement)方面存在的不足,即与人类在协作过程中能够迭代优化已沟通概念的能力相比,AI 在此任务上表现较差。解决方案的关键在于构建 mrCAD 数据集,这是一个包含多模态指令的通信游戏数据集,通过模拟设计师与制作者之间的交互,记录了多轮 CAD 设计和优化过程。该数据集揭示了生成与优化指令在文本和绘图成分上的差异,并为评估和建模多模态的优化语言提供了基准。
链接: https://arxiv.org/abs/2504.20294
作者: William P. McCarthy,Saujas Vaduguru,Karl D. D. Willis,Justin Matejka,Judith E. Fan,Daniel Fried,Yewen Pu
机构: Autodesk AI Lab (Autodesk人工智能实验室); Carnegie Mellon University (卡内基梅隆大学); Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: the first two authors contributed equally
Abstract:A key feature of human collaboration is the ability to iteratively refine the concepts we have communicated. In contrast, while generative AI excels at the \textitgeneration of content, it often struggles to make specific language-guided \textitmodifications of its prior outputs. To bridge the gap between how humans and machines perform edits, we present mrCAD, a dataset of multimodal instructions in a communication game. In each game, players created computer aided designs (CADs) and refined them over several rounds to match specific target designs. Only one player, the Designer, could see the target, and they must instruct the other player, the Maker, using text, drawing, or a combination of modalities. mrCAD consists of 6,082 communication games, 15,163 instruction-execution rounds, played between 1,092 pairs of human players. We analyze the dataset and find that generation and refinement instructions differ in their composition of drawing and text. Using the mrCAD task as a benchmark, we find that state-of-the-art VLMs are better at following generation instructions than refinement instructions. These results lay a foundation for analyzing and modeling a multimodal language of refinement that is not represented in previous datasets.
zh
[NLP-42] Enhancing Systematic Reviews with Large Language Models : Using GPT -4 and Kimi
【速读】: 该论文试图解决在系统综述中利用大型语言模型(Large Language Models, LLMs)生成编码的准确性与可靠性问题,其解决方案的关键在于通过对比LLM生成的编码与同行评审系统综述中人类生成的编码,评估LLM在不同数据量和问题复杂度下的性能表现。
链接: https://arxiv.org/abs/2504.20276
作者: Dandan Chen Kaptur,Yue Huang,Xuejun Ryan Ji,Yanhui Guo,Bradley Kaptur
机构: 未知
类目: Computation and Language (cs.CL); Applications (stat.AP)
备注: 13 pages, Paper presented at the National Council on Measurement in Education (NCME) Conference, Denver, Colorado, in April 2025
Abstract:This research delved into GPT-4 and Kimi, two Large Language Models (LLMs), for systematic reviews. We evaluated their performance by comparing LLM-generated codes with human-generated codes from a peer-reviewed systematic review on assessment. Our findings suggested that the performance of LLMs fluctuates by data volume and question complexity for systematic reviews.
zh
[NLP-43] A Platform for Generating Educational Activities to Teach English as a Second Language
【速读】: 该论文旨在解决外语教学中缺乏多样化和个性化教育活动的问题,其解决方案的关键在于利用自然语言处理(Natural Language Processing, NLP)技术生成适用于英语作为外语教学的教育活动。平台通过半自动化的资源生成与人工校对相结合的方式,提供即用型游戏和语言练习,并能根据教师输入的文本生成更复杂的教学内容,同时支持生成内容的审核与编辑,以提升教学效果。
链接: https://arxiv.org/abs/2504.20251
作者: Aiala Rosá,Santiago Góngora,Juan Pablo Filevich,Ignacio Sastre,Laura Musto,Brian Carpenter,Luis Chiruzzo
机构: Facultad de Ingeniería, Universidad de la República, Uruguay (工程学院,乌拉圭共和国大学); Facultad de Información y Comunicación, Universidad de la República, Uruguay (信息与传播学院,乌拉圭共和国大学); Indiana University of Pennsylvania, Indiana, PA, USA (印第安纳波利斯州立大学,宾夕法尼亚州,美国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Unpublished report written in 2023
Abstract:We present a platform for the generation of educational activities oriented to teaching English as a foreign language. The different activities --games and language practice exercises-- are strongly based on Natural Language Processing techniques. The platform offers the possibility of playing out-of-the-box games, generated from resources created semi-automatically and then manually curated. It can also generate games or exercises of greater complexity from texts entered by teachers, providing a stage of review and edition of the generated content before use. As a way of expanding the variety of activities in the platform, we are currently experimenting with image and text generation. In order to integrate them and improve the performance of other neural tools already integrated, we are working on migrating the platform to a more powerful server. In this paper we describe the development of our platform and its deployment for end users, discussing the challenges faced and how we overcame them, and also detail our future work plans.
zh
[NLP-44] A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports
【速读】: 该论文试图解决医疗领域中纸质文档向电子健康记录(Electronic Health Records, EHR)转换过程中存在的手动输入耗时且易出错的问题。解决方案的关键在于提出一个开源流程,该流程集成了复选框检测、多语言光学字符识别(OCR)和多语言视觉-语言模型(VLMs),以高效且准确地提取和分类扫描文档中的复选框数据,从而减少行政工作量并提高监管报告的准确性。
链接: https://arxiv.org/abs/2504.20220
作者: Henning Schäfer,Cynthia S. Schmidt,Johannes Wutzkowsky,Kamil Lorek,Lea Reinartz,Johannes Rückert,Christian Temme,Britta Böckmann,Peter A. Horn,Christoph M. Friedrich
机构: University Hospital Essen (埃森大学医院); University of Applied Sciences and Arts Dortmund (多特蒙德应用科学与艺术大学); Institute for Medical Informatics, Biometry and Epidemiology (IMIBE) (医学信息学、生物统计学和流行病学研究所); Institute for AI in Medicine (IKIM) (医学人工智能研究所)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite the growing adoption of electronic health records, many processes still rely on paper documents, reflecting the heterogeneous real-world conditions in which healthcare is delivered. The manual transcription process is time-consuming and prone to errors when transferring paper-based data to digital formats. To streamline this workflow, this study presents an open-source pipeline that extracts and categorizes checkbox data from scanned documents. Demonstrated on transfusion reaction reports, the design supports adaptation to other checkbox-rich document types. The proposed method integrates checkbox detection, multilingual optical character recognition (OCR) and multilingual vision-language models (VLMs). The pipeline achieves high precision and recall compared against annually compiled gold-standards from 2017 to 2024. The result is a reduction in administrative workload and accurate regulatory reporting. The open-source availability of this pipeline encourages self-hosted parsing of checkbox forms.
zh
[NLP-45] Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理多图像输入时性能显著下降的问题,这一问题源于模型难以从复杂的视觉特征中分离出关键信息。解决方案的关键在于提出了一种新的范式——以焦点为中心的视觉链(Focus-Centric Visual Chain),通过这种范式增强VLMs在多图像场景中的感知、理解和推理能力,并引入了以焦点为中心的数据合成(Focus-Centric Data Synthesis)方法,这是一种可扩展的自下而上数据生成策略,用于构建具有精细推理路径的高质量数据集VISC-150K,从而有效提升模型在多图像任务中的表现。
链接: https://arxiv.org/abs/2504.20199
作者: Juntian Zhang,Chuanqi cheng,Yuhan Liu,Wei Liu,Jian Luan,Rui Yan
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Xiaomi(小米); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Vision-language models (VLMs) achieve remarkable success in single-image tasks. However, real-world scenarios often involve intricate multi-image inputs, leading to a notable performance decline as models struggle to disentangle critical information scattered across complex visual features. In this work, we propose Focus-Centric Visual Chain, a novel paradigm that enhances VLMs’perception, comprehension, and reasoning abilities in multi-image scenarios. To facilitate this paradigm, we propose Focus-Centric Data Synthesis, a scalable bottom-up approach for synthesizing high-quality data with elaborate reasoning paths. Through this approach, We construct VISC-150K, a large-scale dataset with reasoning data in the form of Focus-Centric Visual Chain, specifically designed for multi-image tasks. Experimental results on seven multi-image benchmarks demonstrate that our method achieves average performance gains of 3.16% and 2.24% across two distinct model architectures, without compromising the general vision-language capabilities. our study represents a significant step toward more robust and capable vision-language systems that can handle complex visual scenarios.
zh
[NLP-46] MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools NAACL2025 MICRO
【速读】: 该论文试图解决工具调用代理在现实世界中行动时如何更准确地评估模型置信度以实现安全与效用平衡的问题(tool-using agents need to be both useful and safe)。现有模型的置信度往往校准不佳,难以有效权衡潜在动作的风险与收益。解决方案的关键在于提出一种基于模型内部结构的置信度估计器(Model-Internal Confidence Estimators, MICE),通过解码语言模型各中间层的输出并计算其与最终输出的相似性得分,再利用学习到的概率分类器评估置信度,从而提升工具调用的可靠性与效率。
链接: https://arxiv.org/abs/2504.20168
作者: Nishant Subramani,Jason Eisner,Justin Svegliato,Benjamin Van Durme,Yu Su,Sam Thomson
机构: CMU LTI (卡内基梅隆大学语言技术研究所); Microsoft (微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at NAACL 2025. Code: this https URL
Abstract:Tool-using agents that act in the world need to be both useful and safe. Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions, but prior work shows that many models are poorly calibrated. Inspired by interpretability literature exploring the internals of models, we propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools. MICE first decodes from each intermediate layer of the language model using logitLens and then computes similarity scores between each layer’s generation and the final output. These features are fed into a learned probabilistic classifier to assess confidence in the decoded output. On the simulated trial and error (STE) tool-calling dataset using Llama3 models, we find that MICE beats or matches the baselines on smoothed expected calibration error. Using MICE confidences to determine whether to call a tool significantly improves over strong baselines on a new metric, expected tool-calling utility. Further experiments show that MICE is sample-efficient, can generalize zero-shot to unseen APIs, and results in higher tool-calling utility in scenarios with varying risk levels. Our code is open source, available at this https URL.
zh
[NLP-47] oward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models
【速读】: 该论文试图解决基于奖励的大型语言模型(Large Language Models, LLMs)对齐方法中的两个关键问题:一是模型容易受到奖励欺骗(reward hacking)的影响,即模型利用奖励信号中的漏洞;二是当LLMs作为奖励模型时,依赖于脆弱且耗时的手动提示工程(prompt engineering)。解决方案的关键在于引入元策略优化(Meta Policy Optimization, MPO),该框架通过集成一个元奖励模型,在训练过程中动态优化奖励模型的提示,从而提供一种适应性强、不易被策略利用的奖励信号,实现更稳定的策略优化并减少对人工设计奖励提示的依赖。
链接: https://arxiv.org/abs/2504.20157
作者: Zae Myung Kim,Chanwoo Park,Vipul Raheja,Dongyeop Kang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when LLMs are used as reward models. We introduce Meta Policy Optimization (MPO), a framework that addresses these challenges by integrating a meta-reward model that dynamically refines the reward model’s prompt throughout training. In MPO, the meta-reward model monitors the evolving training context and continuously adjusts the reward model’s prompt to maintain high alignment, providing an adaptive reward signal that resists exploitation by the policy. This meta-learning approach promotes a more stable policy optimization, and greatly reduces the need for manual reward prompt design. It yields performance on par with or better than models guided by extensively hand-crafted reward prompts. Furthermore, we show that MPO maintains its effectiveness across diverse tasks, such as question answering and mathematical reasoning, without requiring specialized reward designs. Beyond standard RLAIF, MPO’s meta-learning formulation is readily extensible to higher-level alignment frameworks. Overall, this method addresses theoretical and practical challenges in reward-based RL alignment for LLMs, paving the way for more robust and adaptable alignment strategies. The code and models will be publicly shared.
zh
[NLP-48] ResearchCodeAgent : An LLM Agent : An LLM Multi-Agent System for Automated Codification of Research Methodologies
【速读】: 该论文试图解决机器学习文献中研究方法与实际代码实现之间的鸿沟问题,旨在通过自动化手段生成高质量的代码以支持基准测试或在已有方法基础上进行改进。解决方案的关键在于提出了一种基于大型语言模型(Large Language Models, LLMs)代理的多代理系统——ResearchCodeAgent,其采用灵活的代理架构和全面的操作套件,结合动态规划机制,利用短期和长期记忆进行迭代适应,从而实现对研究环境的上下文感知交互。
链接: https://arxiv.org/abs/2504.20117
作者: Shubham Gandhi,Dhruv Shah,Manasi Patwardhan,Lovekesh Vig,Gautam Shroff
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:In this paper we introduce ResearchCodeAgent, a novel multi-agent system leveraging large language models (LLMs) agents to automate the codification of research methodologies described in machine learning literature. The system bridges the gap between high-level research concepts and their practical implementation, allowing researchers auto-generating code of existing research papers for benchmarking or building on top-of existing methods specified in the literature with availability of partial or complete starter code. ResearchCodeAgent employs a flexible agent architecture with a comprehensive action suite, enabling context-aware interactions with the research environment. The system incorporates a dynamic planning mechanism, utilizing both short and long-term memory to adapt its approach iteratively. We evaluate ResearchCodeAgent on three distinct machine learning tasks with distinct task complexity and representing different parts of the ML pipeline: data augmentation, optimization, and data batching. Our results demonstrate the system’s effectiveness and generalizability, with 46.9% of generated code being high-quality and error-free, and 25% showing performance improvements over baseline implementations. Empirical analysis shows an average reduction of 57.9% in coding time compared to manual implementation. We observe higher gains for more complex tasks. ResearchCodeAgent represents a significant step towards automating the research implementation process, potentially accelerating the pace of machine learning research.
zh
[NLP-49] MATCHA: Can Multi-Agent Collaboration Build a Trustworthy Conversational Recommender?
【速读】: 该论文旨在解决对话式推荐系统(Conversational Recommendation System)在游戏推荐中面临的关键挑战,包括处理复杂且用户特定的请求、通过多智能体协作提升个性化程度、进行实证评估与部署以及确保交互的安全性和可信度。其解决方案的核心是提出一种名为MATCHA的多智能体协作框架,该框架利用大语言模型(Large Language Models, LLMs)增强推荐系统的个性化和用户参与度,通过专门设计的智能体实现意图分析、候选生成、排序、重排序、可解释性及安全防护等功能,从而协同提升推荐的准确性、多样性和安全性。
链接: https://arxiv.org/abs/2504.20094
作者: Zheng Hui,Xiaokai Wei,Yexi Jiang,Kevin Gao,Chen Wang,Frank Ong,Se-eun Yoon,Rachit Pareek,Michelle Gong
机构: Roblox Corporation(罗布乐思公司); Columbia University(哥伦比亚大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:In this paper, we propose a multi-agent collaboration framework called MATCHA for conversational recommendation system, leveraging large language models (LLMs) to enhance personalization and user engagement. Users can request recommendations via free-form text and receive curated lists aligned with their interests, preferences, and constraints. Our system introduces specialized agents for intent analysis, candidate generation, ranking, re-ranking, explainability, and safeguards. These agents collaboratively improve recommendations accuracy, diversity, and safety. On eight metrics, our model achieves superior or comparable performance to the current state-of-the-art. Through comparisons with six baseline models, our approach addresses key challenges in conversational recommendation systems for game recommendations, including: (1) handling complex, user-specific requests, (2) enhancing personalization through multi-agent collaboration, (3) empirical evaluation and deployment, and (4) ensuring safe and trustworthy interactions.
zh
[NLP-50] Understanding and Mitigating Risks of Generative AI in Financial Services
【速读】: 该论文试图解决生成式 AI(Generative AI)在金融服务业中内容安全的特定风险问题,旨在提出一个与该领域相关的AI内容风险分类体系。其解决方案的关键在于通过分析现有开源技术防护措施在红队测试数据上的表现,评估这些防护机制对所提出风险分类的覆盖程度,从而揭示当前技术手段在识别和防范金融领域内容风险方面的不足。
链接: https://arxiv.org/abs/2504.20086
作者: Sebastian Gehrmann,Claire Huang,Xian Teng,Sergei Yurovski,Iyanuoluwa Shode,Chirag S. Patel,Arjun Bhorkar,Naveen Thomas,John Doucette,David Rosenberg,Mark Dredze,David Rabinowitz
机构: Bloomberg(彭博社); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted to FAccT 2025
Abstract:To responsibly develop Generative AI (GenAI) products, it is critical to define the scope of acceptable inputs and outputs. What constitutes a “safe” response is an actively debated question. Academic work puts an outsized focus on evaluating models by themselves for general purpose aspects such as toxicity, bias, and fairness, especially in conversational applications being used by a broad audience. In contrast, less focus is put on considering sociotechnical systems in specialized domains. Yet, those specialized systems can be subject to extensive and well-understood legal and regulatory scrutiny. These product-specific considerations need to be set in industry-specific laws, regulations, and corporate governance requirements. In this paper, we aim to highlight AI content safety considerations specific to the financial services domain and outline an associated AI content risk taxonomy. We compare this taxonomy to existing work in this space and discuss implications of risk category violations on various stakeholders. We evaluate how existing open-source technical guardrail solutions cover this taxonomy by assessing them on data collected via red-teaming activities. Our results demonstrate that these guardrails fail to detect most of the content risks we discuss.
zh
[NLP-51] AI Awareness
【速读】: 该论文试图解决人工智能(Artificial Intelligence, AI)系统中“意识”或“认知能力”的量化与功能化问题,旨在探讨AI在元认知、自我意识、社会意识和情境意识四个方面的表现及其与智能行为的关系。其解决方案的关键在于通过跨学科视角,结合认知科学、心理学和计算理论,分析当前先进AI系统中这些意识形式的体现,并系统评估相关评价方法与实证结果,从而揭示AI意识与其整体能力之间的紧密联系,同时关注由此带来的安全、对齐及伦理风险。
链接: https://arxiv.org/abs/2504.20084
作者: Xiaojian Li,Haoyuan Shi,Rongwu Xu,Wei Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Recent breakthroughs in artificial intelligence (AI) have brought about increasingly capable systems that demonstrate remarkable abilities in reasoning, language understanding, and problem-solving. These advancements have prompted a renewed examination of AI awareness, not as a philosophical question of consciousness, but as a measurable, functional capacity. In this review, we explore the emerging landscape of AI awareness, which includes meta-cognition (the ability to represent and reason about its own state), self-awareness (recognizing its own identity, knowledge, limitations, inter alia), social awareness (modeling the knowledge, intentions, and behaviors of other agents), and situational awareness (assessing and responding to the context in which it operates). First, we draw on insights from cognitive science, psychology, and computational theory to trace the theoretical foundations of awareness and examine how the four distinct forms of AI awareness manifest in state-of-the-art AI. Next, we systematically analyze current evaluation methods and empirical findings to better understand these manifestations. Building on this, we explore how AI awareness is closely linked to AI capabilities, demonstrating that more aware AI agents tend to exhibit higher levels of intelligent behaviors. Finally, we discuss the risks associated with AI awareness, including key topics in AI safety, alignment, and broader ethical concerns. AI awareness is a double-edged sword: it improves general capabilities, i.e., reasoning, safety, while also raises concerns around misalignment and societal risks, demanding careful oversight as AI capabilities grow. On the whole, our interdisciplinary review provides a roadmap for future research and aims to clarify the role of AI awareness in the ongoing development of intelligent machines. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY) Cite as: arXiv:2504.20084 [cs.AI] (or arXiv:2504.20084v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.20084 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-52] RAG EN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)作为交互智能体进行训练时所面临的长期决策和与随机环境反馈交互的挑战。其核心问题在于多轮交互强化学习(Multi-turn Agent RL)的训练效果不佳,以及在缺乏精细推理感知奖励信号的情况下,智能体难以展现出深度推理能力。论文提出的解决方案是StarPO(State-Thinking-Actions-Reward Policy Optimization)框架和RAGEN系统,其中关键在于通过轨迹级策略优化、轨迹过滤、评论家模块融合及解耦剪切等技术手段来稳定训练过程,并通过多样化初始状态、中等交互粒度和高频采样提升RL轨迹的塑造效果。
链接: https://arxiv.org/abs/2504.20073
作者: Zihan Wang,Kangrui Wang,Qineng Wang,Pingyue Zhang,Linjie Li,Zhengyuan Yang,Kefan Yu,Minh Nhat Nguyen,Licheng Liu,Eli Gottlieb,Monica Lam,Yiping Lu,Kyunghyun Cho,Jiajun Wu,Li Fei-Fei,Lijuan Wang,Yejin Choi,Manling Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Training large language models (LLMs) as interactive agents presents unique challenges including long-horizon decision making and interacting with stochastic environment feedback. While reinforcement learning (RL) has enabled progress in static tasks, multi-turn agent RL training remains underexplored. We propose StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents. Our study on three stylized environments reveals three core findings. First, our agent RL training shows a recurring mode of Echo Trap where reward variance cliffs and gradient spikes; we address this with StarPO-S, a stabilized variant with trajectory filtering, critic incorporation, and decoupled clipping. Second, we find the shaping of RL rollouts would benefit from diverse initial states, medium interaction granularity and more frequent sampling. Third, we show that without fine-grained, reasoning-aware reward signals, agent reasoning hardly emerge through multi-turn RL and they may show shallow strategies or hallucinated thoughts. Code and environments are available at this https URL.
zh
[NLP-53] Recommending Clinical Trials for Online Patient Cases using Artificial Intelligence
【速读】: 该论文试图解决临床试验招募过程中存在的挑战,如患者对试验的知晓度有限、入组标准复杂以及转诊障碍等问题。其解决方案的关键在于利用生成式 AI (Generative AI) 构建的 TrialGPT 框架,通过大型语言模型(LLM)实现在线患者病例与临床试验的匹配,相较于传统的基于关键词的搜索方法,显著提高了识别符合条件试验的效率。
链接: https://arxiv.org/abs/2504.20059
作者: Joey Chan,Qiao Jin,Nicholas Wan,Charalampos S. Floudas,Elisabetta Xue,Zhiyong Lu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages with 2 figures and 2 tables
Abstract:Clinical trials are crucial for assessing new treatments; however, recruitment challenges - such as limited awareness, complex eligibility criteria, and referral barriers - hinder their success. With the growth of online platforms, patients increasingly turn to social media and health communities for support, research, and advocacy, expanding recruitment pools and established enrollment pathways. Recognizing this potential, we utilized TrialGPT, a framework that leverages a large language model (LLM) as its backbone, to match 50 online patient cases (collected from published case reports and a social media website) to clinical trials and evaluate performance against traditional keyword-based searches. Our results show that TrialGPT outperforms traditional methods by 46% in identifying eligible trials, with each patient, on average, being eligible for around 7 trials. Additionally, our outreach efforts to case authors and trial organizers regarding these patient-trial matches yielded highly positive feedback, which we present from both perspectives.
zh
[NLP-54] Evaluating Large Language Models on Multiword Expressions in Multilingual and Code-Switched Contexts
【速读】: 该论文试图解决大型语言模型在处理多词表达(Multiword Expressions)时面临的语义模糊性问题,尤其是那些具有非组合性意义和句法不规则性的表达。其解决方案的关键在于通过构建一个新型的代码转换数据集和任务,评估当前最先进的语言模型在低频上下文中对潜在习语性多词表达的歧义处理能力,从而揭示这些模型在面对语言细微差别时的局限性。
链接: https://arxiv.org/abs/2504.20051
作者: Frances Laureano De Leon,Harish Tayyar Madabushi,Mark G. Lee
机构: University of Birmingham (伯明翰大学); University of Bath (巴斯大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multiword expressions, characterised by non-compositional meanings and syntactic irregularities, are an example of nuanced language. These expressions can be used literally or idiomatically, leading to significant changes in meaning. While large language models have demonstrated strong performance across many tasks, their ability to handle such linguistic subtleties remains uncertain. Therefore, this study evaluates how state-of-the-art language models process the ambiguity of potentially idiomatic multiword expressions, particularly in contexts that are less frequent, where models are less likely to rely on memorisation. By evaluating models across in Portuguese and Galician, in addition to English, and using a novel code-switched dataset and a novel task, we find that large language models, despite their strengths, struggle with nuanced language. In particular, we find that the latest models, including GPT-4, fail to outperform the xlm-roBERTa-base baselines in both detection and semantic tasks, with especially poor performance on the novel tasks we introduce, despite its similarity to existing tasks. Overall, our results demonstrate that multiword expressions, especially those which are ambiguous, continue to be a challenge to models.
zh
[NLP-55] Its the same but not the same: Do LLM s distinguish Spanish varieties?
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在识别和区分西班牙语多种方言(如安第斯、加勒比、智利、半岛、墨西哥、中美洲和里奥拉塔诺等)的形态句法和词汇特点方面的能力问题。研究的关键在于通过设计一个多选测试,评估九种语言模型对这些方言差异的识别能力,并发现GPT-4o是唯一能够有效识别西班牙语变体多样性的模型。
链接: https://arxiv.org/abs/2504.20049
作者: Marina Mayor-Rocher,Cristina Pozo,Nina Melero,Gonzalo Martínez,María Grandury,Pedro Reviriego
机构: 未知
类目: Computation and Language (cs.CL)
备注: in Spanish language
Abstract:In recent years, large language models (LLMs) have demonstrated a high capacity for understanding and generating text in Spanish. However, with five hundred million native speakers, Spanish is not a homogeneous language but rather one rich in diatopic variations spanning both sides of the Atlantic. For this reason, in this study, we evaluate the ability of nine language models to identify and distinguish the morphosyntactic and lexical peculiarities of seven varieties of Spanish (Andean, Antillean, Continental Caribbean, Chilean, Peninsular, Mexican and Central American and Rioplatense) through a multiple-choice test. The results indicate that the Peninsular Spanish variety is the best identified by all models and that, among them, GPT-4o is the only model capable of recognizing the variability of the Spanish language. – En los últimos años, los grandes modelos de lenguaje (LLMs, por sus siglas en inglés) han demostrado una alta capacidad para comprender y generar texto en español. Sin embargo, con quinientos millones de hablantes nativos, la española no es una lengua homogénea, sino rica en variedades diatópicas que se extienden a ambos lados del Atlántico. Por todo ello, evaluamos en este trabajo la capacidad de nueve modelos de lenguaje de identificar y discernir las peculiaridades morfosintácticas y léxicas de siete variedades de español (andino, antillano, caribeño continental, chileno, español peninsular, mexicano y centroamericano y rioplatense) mediante un test de respuesta múltiple. Los resultados obtenidos indican que la variedad de español peninsular es la mejor identificada por todos los modelos y que, de entre todos, GPT-4o es el único modelo capaz de identificar la variabilidad de la lengua española. Comments: in Spanish language Subjects: Computation and Language (cs.CL) Cite as: arXiv:2504.20049 [cs.CL] (or arXiv:2504.20049v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.20049 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-56] Refiner: Restructure Retrieval Content Efficiently to Advance Question-Answering Capabilities
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在知识密集型任务中因参数化知识限制而产生的幻觉问题,以及在处理分散关键信息时存在的“中间迷失”(lost-in-the-middle)现象。其解决方案的关键在于提出一种端到端的提取与重构范式——\textit{Refiner},该方法在检索增强生成(Retrieval-Augmented Generation, RAG)的后检索过程中运作,利用单一解码器模型自适应地提取与查询相关的内容及其必要上下文,并根据内容之间的关联性进行分段,从而突出信息差异并有效对齐下游LLMs与原始上下文。
链接: https://arxiv.org/abs/2406.11357
作者: Zhonghao Li,Xuming Hu,Aiwei Liu,Kening Zheng,Sirui Huang,Hui Xiong
机构: Hongkong University of Science and Technology (香港科技大学); Tsinghua University (清华大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注: 8 pages
Abstract:Large Language Models (LLMs) are limited by their parametric knowledge, leading to hallucinations in knowledge-extensive tasks. To address this, Retrieval-Augmented Generation (RAG) incorporates external document chunks to expand LLM knowledge. Furthermore, compressing information from document chunks through extraction or summarization can improve LLM performance. Nonetheless, LLMs still struggle to notice and utilize scattered key information, a problem known as the “lost-in-the-middle” syndrome. Therefore, we typically need to restructure the content for LLM to recognize the key information. We propose \textitRefiner , an end-to-end extract-and-restructure paradigm that operates in the post-retrieval process of RAG. \textitRefiner leverages a single decoder-only LLM to adaptively extract query-relevant contents verbatim along with the necessary context, and section them based on their interconnectedness, thereby highlights information distinction, and aligns downstream LLMs with the original context effectively. Experiments show that a trained \textitRefiner (with 7B parameters) exhibits significant gain to downstream LLM in improving answer accuracy, and outperforms other state-of-the-art advanced RAG and concurrent compressing approaches in various single-hop and multi-hop QA tasks. Notably, \textitRefiner achieves a 80.5% tokens reduction and a 1.6-7.0% improvement margin in multi-hop tasks compared to the next best solution. \textitRefiner is a plug-and-play solution that can be seamlessly integrated with RAG systems, facilitating its application across diverse open-source frameworks.
zh
计算机视觉
[CV-0] YoChameleon: Personalized Vision and Language Generation CVPR2025
【速读】:该论文试图解决大型多模态模型在个性化方面的不足,即这些模型虽然功能强大,但缺乏对特定用户概念的个性化知识。解决方案的关键在于提出Yo’Chameleon,通过软提示微调(soft-prompt tuning)将特定主题的信息嵌入模型中,以实现对主题相关问题的回答和在新情境下生成像素级细节的图像。此外,Yo’Chameleon采用自提示优化机制和“软正向”图像生成方法,以在多模态任务中平衡性能并提升少样本设置下的图像质量。
链接: https://arxiv.org/abs/2504.20998
作者: Thao Nguyen,Krishna Kumar Singh,Jing Shi,Trung Bui,Yong Jae Lee,Yuheng Li
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2025; Project page: this https URL
Abstract:Large Multimodal Models (e.g., GPT-4, Gemini, Chameleon) have evolved into powerful tools with millions of users. However, they remain generic models and lack personalized knowledge of specific user concepts. Previous work has explored personalization for text generation, yet it remains unclear how these methods can be adapted to new modalities, such as image generation. In this paper, we introduce Yo’Chameleon, the first attempt to study personalization for large multimodal models. Given 3-5 images of a particular concept, Yo’Chameleon leverages soft-prompt tuning to embed subject-specific information to (i) answer questions about the subject and (ii) recreate pixel-level details to produce images of the subject in new contexts. Yo’Chameleon is trained with (i) a self-prompting optimization mechanism to balance performance across multiple modalities, and (ii) a ``soft-positive" image generation approach to enhance image quality in a few-shot setting.
zh
[CV-1] X-Fusion: Introducing New Modality to Frozen Large Language Models
【速读】:该论文试图解决如何在保持预训练大型语言模型(Large Language Model, LLM)语言能力的同时,将其扩展至多模态任务的问题。解决方案的关键在于提出X-Fusion框架,该框架采用双塔结构并引入模态特定的权重,在冻结LLM参数的前提下,整合视觉信息以实现多模态的理解与生成。
链接: https://arxiv.org/abs/2504.20996
作者: Sicheng Mo,Thao Nguyen,Xun Huang,Siddharth Srinivasan Iyer,Yijun Li,Yuchen Liu,Abhishek Tandon,Eli Shechtman,Krishna Kumar Singh,Yong Jae Lee,Bolei Zhou,Yuheng Li
机构: University of California, Los Angeles (加州大学洛杉矶分校); University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM’s parameters frozen while integrating vision-specific information for both understanding and generation. Our experiments demonstrate that X-Fusion consistently outperforms alternative architectures on both image-to-text and text-to-image tasks. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models.
zh
[CV-2] sserAct: Learning 4D Embodied World Models
【速读】:该论文试图解决如何学习新颖的4D具身世界模型(4D embodied world models)的问题,该模型能够根据具身智能体的行动动态预测3D场景随时间的变化,并提供空间和时间的一致性。解决方案的关键在于通过训练RGB-DN(RGB、深度和法线)视频来学习4D世界模型,从而在预测中融入详细的形状、配置及时间变化信息,同时有效学习准确的逆向动力学模型。该方法通过扩展现有机器人操作视频数据集以包含深度和法线信息,并在标注数据集上微调视频生成模型,实现对每一帧RGB-DN的联合预测,最终将生成的RGB、深度和法线视频直接转换为高质量的4D场景。
链接: https://arxiv.org/abs/2504.20995
作者: Haoyu Zhen,Qiao Sun,Hongxin Zhang,Junyan Li,Siyuan Zhou,Yilun Du,Chuang Gan
机构: UMass Amherst(马萨诸塞大学阿默斯特分校); HKUST(香港科技大学); Harvard University(哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project Page: this https URL
Abstract:This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent’s actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos. This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off-the-shelf models. Next, we fine-tune a video generation model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high-quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video-based world models.
zh
[CV-3] SVD Based Least Squares for X-Ray Pneumonia Classification Using Deep Features
【速读】:该论文旨在解决通过X-ray影像实现肺炎的准确且早期诊断问题,以提高治疗效果和患者预后。其解决方案的关键在于提出一种基于奇异值分解的最小二乘(SVD-LS)框架,利用先进的自监督和迁移学习模型的强大特征表示进行多类肺炎分类,同时采用闭式非迭代分类方法,避免了计算成本高昂的梯度微调,从而在保证准确性的同时显著降低了计算开销。
链接: https://arxiv.org/abs/2504.20970
作者: Mete Erdogan,Sebnem Demirtas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint submitted to IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2025
Abstract:Accurate and early diagnosis of pneumonia through X-ray imaging is essential for effective treatment and improved patient outcomes. Recent advancements in machine learning have enabled automated diagnostic tools that assist radiologists in making more reliable and efficient decisions. In this work, we propose a Singular Value Decomposition-based Least Squares (SVD-LS) framework for multi-class pneumonia classification, leveraging powerful feature representations from state-of-the-art self-supervised and transfer learning models. Rather than relying on computationally expensive gradient based fine-tuning, we employ a closed-form, non-iterative classification approach that ensures efficiency without compromising accuracy. Experimental results demonstrate that SVD-LS achieves competitive performance while offering significantly reduced computational costs, making it a viable alternative for real-time medical imaging applications.
zh
[CV-4] DS_FusionNet: Dynamic Dual-Stream Fusion with Bidirectional Knowledge Distillation for Plant Disease Recognition
【速读】:该论文旨在解决经济作物全球增长安全面临的严重挑战,特别是通过人工智能技术实现植物病害的精准识别与预防。其核心问题包括小样本学习、叶片遮挡、光照变化以及类别间相似性高等技术难题。解决方案的关键在于提出一种动态双流融合网络(DS_FusionNet),该网络结合了双主干架构、可变形动态融合模块和双向知识蒸馏策略,从而显著提升了病害识别的准确性与泛化能力。
链接: https://arxiv.org/abs/2504.20948
作者: Yanghui Song,Chengfu Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Given the severe challenges confronting the global growth security of economic crops, precise identification and prevention of plant diseases has emerged as a critical issue in artificial intelligence-enabled agricultural technology. To address the technical challenges in plant disease recognition, including small-sample learning, leaf occlusion, illumination variations, and high inter-class similarity, this study innovatively proposes a Dynamic Dual-Stream Fusion Network (DS_FusionNet). The network integrates a dual-backbone architecture, deformable dynamic fusion modules, and bidirectional knowledge distillation strategy, significantly enhancing recognition accuracy. Experimental results demonstrate that DS_FusionNet achieves classification accuracies exceeding 90% using only 10% of the PlantDisease and CIFAR-10 datasets, while maintaining 85% accuracy on the complex PlantWild dataset, exhibiting exceptional generalization capabilities. This research not only provides novel technical insights for fine-grained image classification but also establishes a robust foundation for precise identification and management of agricultural diseases.
zh
[CV-5] End-to-end Audio Deepfake Detection from RAW Waveforms: a RawNet-Based Approach with Cross-Dataset Evaluation
【速读】:该论文旨在解决音频深度伪造(audio deepfake)检测问题,特别是在开放世界条件下,即测试时遇到的欺骗方法可能与训练时不同,导致检测难度增加。其解决方案的关键在于提出一种端到端的深度学习框架RawNetLite,该框架直接处理原始波形,采用轻量级卷积-循环结构以捕捉频谱和时间特征,并通过多领域数据训练结合Focal Loss提升模型对困难或模糊样本的识别能力。此外,引入基于编解码器的篡改和波形级音频增强技术显著提升了模型在真实声学条件下的泛化性能。
链接: https://arxiv.org/abs/2504.20923
作者: Andrea Di Pierno(1 and 2),Luca Guarnera(2),Dario Allegra(2),Sebastiano Battiato(2) ((1) IMT School of Advanced Studies, Lucca, Italy, (2) Department of Mathematics and Computer Science, University of Catania, Italy)
机构: IMT School of Advanced Studies Lucca (IMT高级研究学院卢卡校区); University of Catania (卡塔尼亚大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注:
Abstract:Audio deepfakes represent a growing threat to digital security and trust, leveraging advanced generative models to produce synthetic speech that closely mimics real human voices. Detecting such manipulations is especially challenging under open-world conditions, where spoofing methods encountered during testing may differ from those seen during training. In this work, we propose an end-to-end deep learning framework for audio deepfake detection that operates directly on raw waveforms. Our model, RawNetLite, is a lightweight convolutional-recurrent architecture designed to capture both spectral and temporal features without handcrafted preprocessing. To enhance robustness, we introduce a training strategy that combines data from multiple domains and adopts Focal Loss to emphasize difficult or ambiguous samples. We further demonstrate that incorporating codec-based manipulations and applying waveform-level audio augmentations (e.g., pitch shifting, noise, and time stretching) leads to significant generalization improvements under realistic acoustic conditions. The proposed model achieves over 99.7% F1 and 0.25% EER on in-domain data (FakeOrReal), and up to 83.4% F1 with 16.4% EER on a challenging out-of-distribution test set (AVSpoof2021 + CodecFake). These findings highlight the importance of diverse training data, tailored objective functions and audio augmentations in building resilient and generalizable audio forgery detectors. Code and pretrained models are available at this https URL.
zh
[CV-6] Classifier-to-Bias: Toward Unsupervised Automatic Bias Detection for Visual Classifiers CVPR2025
【速读】:该论文试图解决在没有标注数据的情况下识别预训练模型中的偏见问题,这一问题限制了非专家用户对模型偏差的检测能力。解决方案的关键在于提出Classifier-to-Bias (C2B)框架,该框架无需任何标注数据,仅依赖于分类任务的文本描述,通过大型语言模型生成可能的偏见及其对应的带有任务特定目标标签的图像描述,再利用检索模型获取相关图像以评估模型在给定偏见下的准确性,从而实现无监督的、任务无关的偏见发现。
链接: https://arxiv.org/abs/2504.20902
作者: Quentin Guimard,Moreno D’Incà,Massimiliano Mancini,Elisa Ricci
机构: University of Trento (特伦托大学); Fondazione Bruno Kessler (布鲁诺·凯塞尔基金会)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CVPR 2025. Code: this https URL
Abstract:A person downloading a pre-trained model from the web should be aware of its biases. Existing approaches for bias identification rely on datasets containing labels for the task of interest, something that a non-expert may not have access to, or may not have the necessary resources to collect: this greatly limits the number of tasks where model biases can be identified. In this work, we present Classifier-to-Bias (C2B), the first bias discovery framework that works without access to any labeled data: it only relies on a textual description of the classification task to identify biases in the target classification model. This description is fed to a large language model to generate bias proposals and corresponding captions depicting biases together with task-specific target labels. A retrieval model collects images for those captions, which are then used to assess the accuracy of the model w.r.t. the given biases. C2B is training-free, does not require any annotations, has no constraints on the list of biases, and can be applied to any pre-trained model on any classification task. Experiments on two publicly available datasets show that C2B discovers biases beyond those of the original datasets and outperforms a recent state-of-the-art bias detection baseline that relies on task-specific annotations, being a promising first step toward addressing task-agnostic unsupervised bias detection.
zh
[CV-7] CBM-RAG : Demonstrating Enhanced Interpretability in Radiology Report Generation with Multi-Agent RAG and Concept Bottleneck Models
【速读】:该论文试图解决生成式AI在放射学工作流自动化中的可解释性和可靠性问题,这些问题限制了其在临床环境中的应用。解决方案的关键在于将概念瓶颈模型(Concept Bottleneck Models, CBMs)与多智能体检索增强生成(Multi-Agent Retrieval-Augmented Generation, RAG)系统相结合,通过CBMs将胸部X光特征映射到人类可理解的临床概念,实现透明的疾病分类,同时利用RAG系统整合多智能体协作和外部知识,生成具有上下文丰富性和循证依据的报告,从而提升预测的可解释性、减少幻觉并生成高质量、定制化的报告。
链接: https://arxiv.org/abs/2504.20898
作者: Hasan Md Tusfiqur Alam,Devansh Srivastav,Abdulrahman Mohamed Selim,Md Abdul Kadir,Md Moktadiurl Hoque Shuvo,Daniel Sonntag
机构: German Research Center for Artificial Intelligence (DFKI); Saarland University (萨尔兰大学); University of Oldenburg (奥尔登堡大学); Dhaka Medical College Hospital (达卡医学院医院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Accepted in the 17th ACM SIGCHI Symposium on Engineering Interactive Computing Systems (EICS 2025)
Abstract:Advancements in generative Artificial Intelligence (AI) hold great promise for automating radiology workflows, yet challenges in interpretability and reliability hinder clinical adoption. This paper presents an automated radiology report generation framework that combines Concept Bottleneck Models (CBMs) with a Multi-Agent Retrieval-Augmented Generation (RAG) system to bridge AI performance with clinical explainability. CBMs map chest X-ray features to human-understandable clinical concepts, enabling transparent disease classification. Meanwhile, the RAG system integrates multi-agent collaboration and external knowledge to produce contextually rich, evidence-based reports. Our demonstration showcases the system’s ability to deliver interpretable predictions, mitigate hallucinations, and generate high-quality, tailored reports with an interactive interface addressing accuracy, trust, and usability challenges. This framework provides a pathway to improving diagnostic consistency and empowering radiologists with actionable insights.
zh
[CV-8] FLIM-based Salient Object Detection Networks with Adaptive Decoders
【速读】:该论文试图解决在计算资源受限条件下,显著目标检测(SOD)任务中模型复杂度与性能之间的平衡问题。其解决方案的关键在于提出一种名为从图像标记中学习特征(FLIM)的方法,通过利用少量代表性图像中的标记像素估计编码器的卷积核,并结合自适应解码器,该解码器的权重通过启发式函数为每个输入图像单独估计,从而实现轻量级网络的构建。这种方法无需反向传播训练,仅需三到四张标注图像即可完成模型训练,适用于数据标注受限的应用场景。
链接: https://arxiv.org/abs/2504.20872
作者: Gilson Junior Soares,Matheus Abrantes Cerqueira,Jancarlo F. Gomes,Laurent Najman,Silvio Jamil F. Guimarães,Alexandre Xavier Falcão
机构: Institute of Computing, State University of Campinas(计算学院,坎皮纳斯州立大学); Univ Gustave Eiffel(古斯塔夫·埃菲尔大学); CNRS(法国国家科学研究中心); LIGM(图像与几何建模实验室); Pontificial Catholic University of Minas Gerais(米纳斯吉拉斯天主教联邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the Journal of the Brazilian Computer Society (JBCS)
Abstract:Salient Object Detection (SOD) methods can locate objects that stand out in an image, assign higher values to their pixels in a saliency map, and binarize the map outputting a predicted segmentation mask. A recent tendency is to investigate pre-trained lightweight models rather than deep neural networks in SOD tasks, coping with applications under limited computational resources. In this context, we have investigated lightweight networks using a methodology named Feature Learning from Image Markers (FLIM), which assumes that the encoder’s kernels can be estimated from marker pixels on discriminative regions of a few representative images. This work proposes flyweight networks, hundreds of times lighter than lightweight models, for SOD by combining a FLIM encoder with an adaptive decoder, whose weights are estimated for each input image by a given heuristic function. Such FLIM networks are trained from three to four representative images only and without backpropagation, making the models suitable for applications under labeled data constraints as well. We study five adaptive decoders; two of them are introduced here. Differently from the previous ones that rely on one neuron per pixel with shared weights, the heuristic functions of the new adaptive decoders estimate the weights of each neuron per pixel. We compare FLIM models with adaptive decoders for two challenging SOD tasks with three lightweight networks from the state-of-the-art, two FLIM networks with decoders trained by backpropagation, and one FLIM network whose labeled markers define the decoder’s weights. The experiments demonstrate the advantages of the proposed networks over the baselines, revealing the importance of further investigating such methods in new applications.
zh
[CV-9] AI-GenBench: A New Ongoing Benchmark for AI-Generated Image Detection
【速读】:该论文试图解决AI生成图像在现实场景中检测的挑战,特别是针对生成式AI(Generative AI)快速发展的背景下,如何确保媒体真实性的关键问题。其解决方案的关键在于提出Ai-GenBench,这是一个新颖的基准测试框架,通过时间序列化的评估方式,使检测方法在按生成模型历史顺序排列的合成图像上逐步训练,从而测试其对新型生成模型(如从GANs到扩散模型)的泛化能力。该框架克服了现有方法在数据集划分、比较公平性和计算需求方面的局限性,提供了全面的数据集、标准化的评估协议和可访问的工具,以支持研究人员和非专家用户进行有效检测与验证。
链接: https://arxiv.org/abs/2504.20865
作者: Lorenzo Pellegrini,Davide Cozzolino,Serafino Pandolfini,Davide Maltoni,Matteo Ferrara,Luisa Verdoliva,Marco Prati,Marco Ramilli
机构: Dipartimento di Informatica - Scienza e Ingegneria (DISI), Università di Bologna, Cesena, Italy; Dipartimento di Ingegneria Elettrica e delle Tecnologie dell’Informazione (DIETI), Università degli Studi di Napoli Federico II, Naples, Italy; IdentifAI, Italy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures, 4 tables, code available: this https URL
Abstract:The rapid advancement of generative AI has revolutionized image creation, enabling high-quality synthesis from text prompts while raising critical challenges for media authenticity. We present Ai-GenBench, a novel benchmark designed to address the urgent need for robust detection of AI-generated images in real-world scenarios. Unlike existing solutions that evaluate models on static datasets, Ai-GenBench introduces a temporal evaluation framework where detection methods are incrementally trained on synthetic images, historically ordered by their generative models, to test their ability to generalize to new generative models, such as the transition from GANs to diffusion models. Our benchmark focuses on high-quality, diverse visual content and overcomes key limitations of current approaches, including arbitrary dataset splits, unfair comparisons, and excessive computational demands. Ai-GenBench provides a comprehensive dataset, a standardized evaluation protocol, and accessible tools for both researchers and non-experts (e.g., journalists, fact-checkers), ensuring reproducibility while maintaining practical training requirements. By establishing clear evaluation rules and controlled augmentation strategies, Ai-GenBench enables meaningful comparison of detection methods and scalable solutions. Code and data are publicly available to ensure reproducibility and to support the development of robust forensic detectors to keep pace with the rise of new synthetic generators.
zh
[CV-10] FedMVP: Federated Multi-modal Visual Prompt Tuning for Vision-Language Models
【速读】:该论文旨在解决联邦学习中文本提示调优(textual prompt tuning)在适应视觉-语言模型(Vision-Language Models, VLMs)时存在的过拟合问题以及对已知概念的依赖性过强,从而限制了模型对未见概念的泛化能力。其解决方案的关键在于提出联邦多模态视觉提示调优(Federated Multimodal Visual Prompt Tuning, FedMVP),该方法通过结合图像条件特征和类别文本属性特征,生成动态的多模态视觉提示,并利用PromptFormer模块通过交叉注意力机制协同对齐文本与视觉特征,从而实现更丰富的上下文整合。
链接: https://arxiv.org/abs/2504.20860
作者: Mainak Singha,Subhankar Roy,Sarthak Mehrotra,Ankit Jha,Moloud Abdar,Biplab Banerjee,Elisa Ricci
机构: Indian Institute of Technology Bombay, India; University of Trento, Italy; LNMIIT Jaipur, India; University of Queensland, Australia; Fondazione Bruno Kessler, Italy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Textual prompt tuning adapts Vision-Language Models (e.g., CLIP) in federated learning by tuning lightweight input tokens (or prompts) on local client data, while keeping network weights frozen. Post training, only the prompts are shared by the clients with the central server for aggregation. However, textual prompt tuning often struggles with overfitting to known concepts and may be overly reliant on memorized text features, limiting its adaptability to unseen concepts. To address this limitation, we propose Federated Multimodal Visual Prompt Tuning (FedMVP) that conditions the prompts on comprehensive contextual information – image-conditioned features and textual attribute features of a class – that is multimodal in nature. At the core of FedMVP is a PromptFormer module that synergistically aligns textual and visual features through cross-attention, enabling richer contexual integration. The dynamically generated multimodal visual prompts are then input to the frozen vision encoder of CLIP, and trained with a combination of CLIP similarity loss and a consistency loss. Extensive evaluation on 20 datasets spanning three generalization settings demonstrates that FedMVP not only preserves performance on in-distribution classes and domains, but also displays higher generalizability to unseen classes and domains when compared to state-of-the-art methods. Codes will be released upon acceptance.
zh
[CV-11] RadSAM: Segmenting 3D radiological images with a 2D promptable model
【速读】:该论文试图解决医学图像分割中基于2D模型对3D医学影像(如CT和MRI)进行精确分割的挑战,特别是在单个提示下实现高效、准确的3D分割问题。现有方法依赖于逐层提示,导致分割过程繁琐且缺乏编辑功能。解决方案的关键在于提出RadSAM,该方法通过使用噪声掩码作为初始提示,并结合迭代推理流程,从单个提示中重建出完整的3D掩码,从而实现对3D医学对象的有效分割。
链接: https://arxiv.org/abs/2504.20837
作者: Julien Khlaut,Elodie Ferreres,Daniel Tordjman,Hélène Philippe,Tom Boeken,Pierre Manceron,Corentin Dancette
机构: Raidium, France; Université de Paris Cité, AP-HP, Hôpital Européen Georges Pompidou, Department of Vascular and Oncological Interventional Radiology, HEKA INRIA, INSERM PARCC U 970, Paris, France
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Medical image segmentation is a crucial and time-consuming task in clinical care, where mask precision is extremely important. The Segment Anything Model (SAM) offers a promising approach, as it provides an interactive interface based on visual prompting and edition to refine an initial segmentation. This model has strong generalization capabilities, does not rely on predefined classes, and adapts to diverse objects; however, it is pre-trained on natural images and lacks the ability to process medical data effectively. In addition, this model is built for 2D images, whereas a whole medical domain is based on 3D images, such as CT and MRI. Recent adaptations of SAM for medical imaging are based on 2D models, thus requiring one prompt per slice to segment 3D objects, making the segmentation process tedious. They also lack important features such as editing. To bridge this gap, we propose RadSAM, a novel method for segmenting 3D objects with a 2D model from a single prompt. In practice, we train a 2D model using noisy masks as initial prompts, in addition to bounding boxes and points. We then use this novel prompt type with an iterative inference pipeline to reconstruct the 3D mask slice-by-slice. We introduce a benchmark to evaluate the model’s ability to segment 3D objects in CT images from a single prompt and evaluate the models’ out-of-domain transfer and edition capabilities. We demonstrate the effectiveness of our approach against state-of-the-art models on this benchmark using the AMOS abdominal organ segmentation dataset.
zh
[CV-12] CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation
【速读】:该论文试图解决现有计算机辅助设计(Computer-Aided Design, CAD)方法在准确性和用户友好性方面存在的不足,这些问题主要源于过于简化的表示方式或架构无法支持多模态设计需求。其解决方案的关键在于提出一种基于边界表示(Boundary Representation, B-Rep)的级联多模态框架(Cascade MAR with Topology Predictor, CMT),该框架能够有效捕捉B-Reps中“边-计数-面”的先验信息,并通过拓扑预测器直接从MAR中的紧凑标记估计B-Reps的拓扑结构。此外,研究还构建了一个大规模多模态CAD数据集mmABC,以支持大规模训练。
链接: https://arxiv.org/abs/2504.20830
作者: Jianyu Wu,Yizhou Wang,Xiangyu Yue,Xinzhu Ma,Jingyang Guo,Dongzhan Zhou,Wanli Ouyang,Shixiang Tang
机构: Beihang University (北京航空航天大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While accurate and user-friendly Computer-Aided Design (CAD) is crucial for industrial design and manufacturing, existing methods still struggle to achieve this due to their over-simplified representations or architectures incapable of supporting multimodal design requirements. In this paper, we attempt to tackle this problem from both methods and datasets aspects. First, we propose a cascade MAR with topology predictor (CMT), the first multimodal framework for CAD generation based on Boundary Representation (B-Rep). Specifically, the cascade MAR can effectively capture the ``edge-counters-surface’’ priors that are essential in B-Reps, while the topology predictor directly estimates topology in B-Reps from the compact tokens in MAR. Second, to facilitate large-scale training, we develop a large-scale multimodal CAD dataset, mmABC, which includes over 1.3 million B-Rep models with multimodal annotations, including point clouds, text descriptions, and multi-view images. Extensive experiments show the superior of CMT in both conditional and unconditional CAD generation tasks. For example, we improve Coverage and Valid ratio by +10.68% and +10.3%, respectively, compared to state-of-the-art methods on ABC in unconditional generation. CMT also improves +4.01 Chamfer on image conditioned CAD generation on mmABC. The dataset, code and pretrained network shall be released.
zh
[CV-13] GaussTrap: Stealthy Poisoning Attacks on 3D Gaussian Splatting for Targeted Scene Confusion
【速读】:该论文试图解决3D Gaussian Splatting (3DGS)在安全关键领域中可能存在的后门威胁问题,即攻击者通过植入恶意视图导致推理过程中的场景混淆,进而引发自主导航中的环境误判或沉浸式环境中的空间畸变。解决方案的关键在于提出GuassTrap方法,该方法通过三阶段流水线(攻击、稳定和正常训练)在保持非目标视图高质量渲染的同时,隐蔽地注入恶意视图,从而实现低可检测性且高危害性的后门攻击,同时联合优化攻击效果与感知真实性以暴露3D渲染中的安全风险。
链接: https://arxiv.org/abs/2504.20829
作者: Jiaxin Hong,Sixu Chen,Shuoyang Sun,Hongyao Yu,Hao Fang,Yuqi Tan,Bin Chen,Shuhan Qi,Jiawei Li
机构: Harbin Institute of Technology(Shenzhen); South China University of Technology; Shenzhen Internation Graduate School, Tsinghua University; Huawei Manufacturing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:As 3D Gaussian Splatting (3DGS) emerges as a breakthrough in scene representation and novel view synthesis, its rapid adoption in safety-critical domains (e.g., autonomous systems, AR/VR) urgently demands scrutiny of potential security vulnerabilities. This paper presents the first systematic study of backdoor threats in 3DGS pipelines. We identify that adversaries may implant backdoor views to induce malicious scene confusion during inference, potentially leading to environmental misperception in autonomous navigation or spatial distortion in immersive environments. To uncover this risk, we propose GuassTrap, a novel poisoning attack method targeting 3DGS models. GuassTrap injects malicious views at specific attack viewpoints while preserving high-quality rendering in non-target views, ensuring minimal detectability and maximizing potential harm. Specifically, the proposed method consists of a three-stage pipeline (attack, stabilization, and normal training) to implant stealthy, viewpoint-consistent poisoned renderings in 3DGS, jointly optimizing attack efficacy and perceptual realism to expose security risks in 3D rendering. Extensive experiments on both synthetic and real-world datasets demonstrate that GuassTrap can effectively embed imperceptible yet harmful backdoor views while maintaining high-quality rendering in normal views, validating its robustness, adaptability, and practical applicability.
zh
[CV-14] Adept: Annotation-Denoising Auxiliary Tasks with Discrete Cosine Transform Map and Keypoint for Human-Centric Pretraining
【速读】:该论文旨在解决人类中心感知任务中因依赖特定数据集规模和深度信息导致的性能受限问题。其关键解决方案是通过摒弃深度信息,利用离散余弦变换(Discrete Cosine Transform, DCT)在频域空间探索RGB图像的语义信息,并引入基于关键点和DCT图的标注去噪辅助任务,以增强RGB图像提取器对人体细粒度语义信息的学习能力。
链接: https://arxiv.org/abs/2504.20800
作者: Weizhen He,Yunfeng Yan,Shixiang Tang,Yiheng Deng,Yangyang Zhong,Pengxin Luo,Donglian Qi
机构: Zhejiang University(浙江大学); Shanghai AI Laboratory(上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human-centric perception is the core of diverse computer vision tasks and has been a long-standing research focus. However, previous research studied these human-centric tasks individually, whose performance is largely limited to the size of the public task-specific datasets. Recent human-centric methods leverage the additional modalities, e.g., depth, to learn fine-grained semantic information, which limits the benefit of pretraining models due to their sensitivity to camera views and the scarcity of RGB-D data on the Internet. This paper improves the data scalability of human-centric pretraining methods by discarding depth information and exploring semantic information of RGB images in the frequency space by Discrete Cosine Transform (DCT). We further propose new annotation denoising auxiliary tasks with keypoints and DCT maps to enforce the RGB image extractor to learn fine-grained semantic information of human bodies. Our extensive experiments show that when pretrained on large-scale datasets (COCO and AIC datasets) without depth annotation, our model achieves better performance than state-of-the-art methods by +0.5 mAP on COCO, +1.4 PCKh on MPII and -0.51 EPE on Human3.6M for pose estimation, by +4.50 mIoU on Human3.6M for human parsing, by -3.14 MAE on SHA and -0.07 MAE on SHB for crowd counting, by +1.1 F1 score on SHA and +0.8 F1 score on SHA for crowd localization, and by +0.1 mAP on Market1501 and +0.8 mAP on MSMT for person ReID. We also validate the effectiveness of our method on MPII+NTURGBD datasets
zh
[CV-15] A Survey on Event-based Optical Marker Systems
【速读】:该论文旨在解决传统视觉系统在低延迟、高动态范围和低功耗方面存在的局限性,特别是在复杂光照条件下机器人视觉与机器感知的挑战。其解决方案的关键在于利用事件驱动相机(event-based cameras)与被动或主动光学标记(如AprilTags、闪烁LED阵列)的结合,通过异步操作和对不利光照条件的鲁棒性,提升系统的性能与适用性。
链接: https://arxiv.org/abs/2504.20736
作者: Nafiseh Jabbari Tofighi,Maxime Robic,Fabio Morbidi,Pascal Vasseur
机构: University of Picardie Jules Verne (皮卡第朱尔斯·费尔南德斯大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, 1 table
Abstract:The advent of event-based cameras, with their low latency, high dynamic range, and reduced power consumption, marked a significant change in robotic vision and machine perception. In particular, the combination of these neuromorphic sensors with widely-available passive or active optical markers (e.g. AprilTags, arrays of blinking LEDs), has recently opened up a wide field of possibilities. This survey paper provides a comprehensive review on Event-Based Optical Marker Systems (EBOMS). We analyze the basic principles and technologies on which these systems are based, with a special focus on their asynchronous operation and robustness against adverse lighting conditions. We also describe the most relevant applications of EBOMS, including object detection and tracking, pose estimation, and optical communication. The article concludes with a discussion of possible future research directions in this rapidly-emerging and multidisciplinary field.
zh
[CV-16] Learning a General Model: Folding Clothing with Topological Dynamics
【速读】:该论文旨在解决服装高自由度和复杂结构带来的衣物操作难题。其关键解决方案是提出一种通用的拓扑动力学模型,通过可见折叠结构作为拓扑骨架,设计一种新型拓扑图来表示衣物状态,该拓扑图具有低维特性,能够适用于多种折叠状态的复杂衣物,并能体现衣物的约束条件及预测其运动行为。
链接: https://arxiv.org/abs/2504.20720
作者: Yiming Liu,Lijun Han,Enlin Gu,Hesheng Wang
机构: Shanghai Jiao Tong University (上海交通大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The high degrees of freedom and complex structure of garments present significant challenges for clothing manipulation. In this paper, we propose a general topological dynamics model to fold complex clothing. By utilizing the visible folding structure as the topological skeleton, we design a novel topological graph to represent the clothing state. This topological graph is low-dimensional and applied for complex clothing in various folding states. It indicates the constraints of clothing and enables predictions regarding clothing movement. To extract graphs from self-occlusion, we apply semantic segmentation to analyze the occlusion relationships and decompose the clothing structure. The decomposed structure is then combined with keypoint detection to generate the topological graph. To analyze the behavior of the topological graph, we employ an improved Graph Neural Network (GNN) to learn the general dynamics. The GNN model can predict the deformation of clothing and is employed to calculate the deformation Jacobi matrix for control. Experiments using jackets validate the algorithm’s effectiveness to recognize and fold complex clothing with self-occlusion.
zh
[CV-17] In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer
【速读】:该论文旨在解决基于指令的图像编辑中精度与效率之间的权衡问题(precision-efficiency tradeoff)。当前方法在计算资源消耗和数据需求方面存在较高成本,而无需训练的技术则在指令理解和编辑质量上表现不足。其解决方案的关键在于利用大规模Diffusion Transformer (DiT) 的生成能力和上下文感知特性,并提出三项创新:一是采用上下文提示(in-context prompting)实现零样本指令合规的框架,避免结构改动;二是引入LoRA-MoE混合微调策略,提升适应灵活性并实现动态专家路由;三是通过视觉-语言模型(VLMs)在推理阶段早期筛选更优初始噪声,提升编辑质量。
链接: https://arxiv.org/abs/2504.20690
作者: Zechuan Zhang,Ji Xie,Yu Lu,Zongxin Yang,Yi Yang
机构: Zhejiang University (浙江大学); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Instruction-based image editing enables robust image modification via natural language prompts, yet current methods face a precision-efficiency tradeoff. Fine-tuning methods demand significant computational resources and large datasets, while training-free techniques struggle with instruction comprehension and edit quality. We resolve this dilemma by leveraging large-scale Diffusion Transformer (DiT)’ enhanced generation capacity and native contextual awareness. Our solution introduces three contributions: (1) an in-context editing framework for zero-shot instruction compliance using in-context prompting, avoiding structural changes; (2) a LoRA-MoE hybrid tuning strategy that enhances flexibility with efficient adaptation and dynamic expert routing, without extensive retraining; and (3) an early filter inference-time scaling method using vision-language models (VLMs) to select better initial noise early, improving edit quality. Extensive evaluations demonstrate our method’s superiority: it outperforms state-of-the-art approaches while requiring only 0.5% training data and 1% trainable parameters compared to conventional baselines. This work establishes a new paradigm that enables high-precision yet efficient instruction-guided editing. Codes and demos can be found in this https URL.
zh
[CV-18] Efficient Listener: Dyadic Facial Motion Synthesis via Action Diffusion
【速读】:该论文旨在解决在双人对话中生成逼真听者面部动作的问题,该问题由于高维动作空间和时间依赖性的要求而具有挑战性。现有方法通常通过提取3D Morphable Model (3DMM)系数并在3DMM空间中建模来实现,但这种方法导致3DMM的计算速度成为瓶颈,难以实现实时交互响应。为了解决这一问题,论文提出了Facial Action Diffusion (FAD),其关键在于将图像生成领域的扩散方法引入到面部动作生成中,从而实现高效的面部动作生成。此外,论文还构建了专门用于处理说话者视觉和音频信息的Efficient Listener Network (ELNet),结合FAD与ELNet,所提方法在提升性能的同时减少了99%的计算时间。
链接: https://arxiv.org/abs/2504.20685
作者: Zesheng Wang,Alexandre Bruckert,Patrick Le Callet,Guangtao Zhai
机构: Nantes Université(南特大学); Shanghai Jiao Tong University(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Generating realistic listener facial motions in dyadic conversations remains challenging due to the high-dimensional action space and temporal dependency requirements. Existing approaches usually consider extracting 3D Morphable Model (3DMM) coefficients and modeling in the 3DMM space. However, this makes the computational speed of the 3DMM a bottleneck, making it difficult to achieve real-time interactive responses. To tackle this problem, we propose Facial Action Diffusion (FAD), which introduces the diffusion methods from the field of image generation to achieve efficient facial action generation. We further build the Efficient Listener Network (ELNet) specially designed to accommodate both the visual and audio information of the speaker as input. Considering of FAD and ELNet, the proposed method learns effective listener facial motion representations and leads to improvements of performance over the state-of-the-art methods while reducing 99% computational time.
zh
[CV-19] OG-HFYOLO :Orientation gradient guidance and heterogeneous feature fusion for deformation table cell instance segmentation
【速读】:该论文旨在解决几何形变导致表格内容信息与结构之间相关性较弱的问题,从而影响下游任务获取准确内容信息的难题。解决方案的关键在于提出OG-HFYOLO模型,该模型通过Gradient Orientation-aware Extractor增强边缘响应,结合Heterogeneous Kernel Cross Fusion模块和scale-aware损失函数以适应多尺度目标特征,并在后处理中引入mask-driven非极大值抑制,替代传统的边界框抑制机制。此外,还提出了一个数据生成器,填补了细粒度形变表格单元空间坐标定位的数据集空白,构建了大规模数据集Deformation Wired Table (DWTAL)。
链接: https://arxiv.org/abs/2504.20682
作者: Long Liu,Cihui Yang
机构: Nanchang Hangkong University (南昌航空大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Table structure recognition is a key task in document analysis. However, the geometric deformation in deformed tables causes a weak correlation between content information and structure, resulting in downstream tasks not being able to obtain accurate content information. To obtain fine-grained spatial coordinates of cells, we propose the OG-HFYOLO model, which enhances the edge response by Gradient Orientation-aware Extractor, combines a Heterogeneous Kernel Cross Fusion module and a scale-aware loss function to adapt to multi-scale objective features, and introduces mask-driven non-maximal suppression in the post-processing, which replaces the traditional bounding box suppression mechanism. Furthermore, we also propose a data generator, filling the gap in the dataset for fine-grained deformation table cell spatial coordinate localization, and derive a large-scale dataset named Deformation Wired Table (DWTAL). Experiments show that our proposed model demonstrates excellent segmentation accuracy on all mainstream instance segmentation models. The dataset and the source code are open source: this https URL.
zh
[CV-20] Occlusion-aware Driver Monitoring System using the Driver Monitoring Dataset ITSC
【速读】:该论文旨在解决驾驶员监控系统(Driver Monitoring System, DMS)在复杂光照条件下,尤其是低光环境下的可靠性和准确性问题,同时提升系统对遮挡情况的感知能力以增强情境意识和可信度。解决方案的关键在于采用基于RGB和红外(Infrared, IR)图像的独立算法,实现驾驶员身份识别、区域注视估计以及面部遮挡检测,并通过整合这些算法形成一个协同工作的流程,从而确保系统的鲁棒性与实用性。
链接: https://arxiv.org/abs/2504.20677
作者: Paola Natalia Cañas,Alexander Diez,David Galvañ,Marcos Nieto,Igor Rodríguez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted for review to the IEEE International Conference on Intelligent Transportation Systems (ITSC) 2025
Abstract:This paper presents a robust, occlusion-aware driver monitoring system (DMS) utilizing the Driver Monitoring Dataset (DMD). The system performs driver identification, gaze estimation by regions, and face occlusion detection under varying lighting conditions, including challenging low-light scenarios. Aligned with EuroNCAP recommendations, the inclusion of occlusion detection enhances situational awareness and system trustworthiness by indicating when the system’s performance may be degraded. The system employs separate algorithms trained on RGB and infrared (IR) images to ensure reliable functioning. We detail the development and integration of these algorithms into a cohesive pipeline, addressing the challenges of working with different sensors and real-car implementation. Evaluation on the DMD and in real-world scenarios demonstrates the effectiveness of the proposed system, highlighting the superior performance of RGB-based models and the pioneering contribution of robust occlusion detection in DMS.
zh
[CV-21] FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection AAAI2025
【速读】:该论文旨在解决实时空中图像检测中小目标检测的准确性与效率之间的平衡问题(accuracy-efficiency trade-off)。其解决方案的关键在于提出了一种名为FBRT-YOLO的新型实时检测器,该检测器包含两个轻量级模块:特征互补映射模块(Feature Complementary Mapping Module, FCM)和多核感知单元(Multi-Kernel Perception Unit, MKP)。FCM通过增强网络对小目标空间位置信息的整合来缓解深度网络中小目标信息丢失导致的信息不平衡问题,而MKP则通过不同尺寸卷积核的组合来增强多尺度目标之间的关系,提升不同尺度目标的感知能力。
链接: https://arxiv.org/abs/2504.20670
作者: Yao Xiao,Tingfa Xu,Yu Xin,Jianan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025
Abstract:Embedded flight devices with visual capabilities have become essential for a wide range of applications. In aerial image detection, while many existing methods have partially addressed the issue of small target detection, challenges remain in optimizing small target detection and balancing detection accuracy with efficiency. These issues are key obstacles to the advancement of real-time aerial image detection. In this paper, we propose a new family of real-time detectors for aerial image detection, named FBRT-YOLO, to address the imbalance between detection accuracy and efficiency. Our method comprises two lightweight modules: Feature Complementary Mapping Module (FCM) and Multi-Kernel Perception Unit(MKP), designed to enhance object perception for small targets in aerial images. FCM focuses on alleviating the problem of information imbalance caused by the loss of small target information in deep networks. It aims to integrate spatial positional information of targets more deeply into the network,better aligning with semantic information in the deeper layers to improve the localization of small targets. We introduce MKP, which leverages convolutions with kernels of different sizes to enhance the relationships between targets of various scales and improve the perception of targets at different scales. Extensive experimental results on three major aerial image datasets, including Visdrone, UAVDT, and AI-TOD,demonstrate that FBRT-YOLO outperforms various real-time detectors in terms of performance and speed.
zh
[CV-22] Advance Fake Video Detection via Vision Transformers
【速读】:该论文旨在解决由生成式 AI (Generative AI) 生成的超现实图像和视频在传播虚假信息方面的潜在风险,亟需开发高精度且具有泛化能力的检测方法。其解决方案的关键在于受基于 Vision Transformer (ViT) 的假图像检测启发,并将其扩展至视频领域,提出一种创新框架,通过时间序列上 ViT 嵌入的整合来提升检测性能。
链接: https://arxiv.org/abs/2504.20669
作者: Joy Battocchio(1),Stefano Dell’Anna(1),Andrea Montibeller(1),Giulia Boato(1,2) ((1) University of Trento, Trento, Italy, (2) Truebees srl, Trento, Italy)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Recent advancements in AI-based multimedia generation have enabled the creation of hyper-realistic images and videos, raising concerns about their potential use in spreading misinformation. The widespread accessibility of generative techniques, which allow for the production of fake multimedia from prompts or existing media, along with their continuous refinement, underscores the urgent need for highly accurate and generalizable AI-generated media detection methods, underlined also by new regulations like the European Digital AI Act. In this paper, we draw inspiration from Vision Transformer (ViT)-based fake image detection and extend this idea to video. We propose an original %innovative framework that effectively integrates ViT embeddings over time to enhance detection performance. Our method shows promising accuracy, generalization, and few-shot learning capabilities across a new, large and diverse dataset of videos generated using five open source generative techniques from the state-of-the-art, as well as a separate dataset containing videos produced by proprietary generative methods.
zh
[CV-23] rueFake: A Real World Case Dataset of Last Generation Fake Images also Shared on Social Networks
【速读】:该论文试图解决生成式 AI (Generative AI) 生成的合成媒体在社交媒体平台上因压缩和其他处理导致假象检测线索退化的问题,当前许多取证工具无法应对这些现实中的挑战。解决方案的关键在于引入 TrueFake 数据集,这是一个包含 60 万张图像的大规模基准数据集,涵盖了先进的生成技术以及通过三种不同社交网络的分享,从而为最先进的假图像检测器提供在高度真实且具有挑战性条件下的严格评估。
链接: https://arxiv.org/abs/2504.20658
作者: Stefano Dell’Anna(1),Andrea Montibeller(1),Giulia Boato(1,2) ((1) University of Trento, Trento, Italy, (2) Truebees srl, Trento, Italy)
机构: University of Trento (特伦托大学); Truebees srl (Truebees srl)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:AI-generated synthetic media are increasingly used in real-world scenarios, often with the purpose of spreading misinformation and propaganda through social media platforms, where compression and other processing can degrade fake detection cues. Currently, many forensic tools fail to account for these in-the-wild challenges. In this work, we introduce TrueFake, a large-scale benchmarking dataset of 600,000 images including top notch generative techniques and sharing via three different social networks. This dataset allows for rigorous evaluation of state-of-the-art fake image detectors under very realistic and challenging conditions. Through extensive experimentation, we analyze how social media sharing impacts detection performance, and identify current most effective detection and training strategies. Our findings highlight the need for evaluating forensic models in conditions that mirror real-world use.
zh
[CV-24] Image deidentification in the XNAT ecosystem: use cases and solutions WWW
【速读】:该论文旨在解决医学影像数据中的去标识化(deidentification)问题,以确保在研究中使用DICOM图像数据时患者隐私得到保护。其解决方案的关键在于利用XNAT平台及其生态系统中的工具,构建一个有效的去标识化工作流,并通过持续基准测试进行优化。研究还探讨了基于规则的方法与机器学习模型在处理文本信息中的优劣,强调了提升地址识别能力和图像像素中可识别数据去除的重要性。
链接: https://arxiv.org/abs/2504.20657
作者: Alex Michie,Simon J Doran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: For submission to MELBA (Machine Learning for Biomedical Imaging) special issue on the MIDI-B deidentification challenge ( this https URL ). 11 pages, 1 fig, 2 tables; 1 supplementary data file ( this http URL ) containing three spreadsheet tabs
Abstract:XNAT is a server-based data management platform widely used in academia for curating large databases of DICOM images for research projects. We describe in detail a deidentification workflow for DICOM data using facilities in XNAT, together with independent tools in the XNAT “ecosystem”. We list different contexts in which deidentification might be needed, based on our prior experience. The starting point for participation in the Medical Image De-Identification Benchmark (MIDI-B) challenge was a set of pre-existing local methodologies, which were adapted during the validation phase of the challenge. Our result in the test phase was 97.91%, considerably lower than our peers, due largely to an arcane technical incompatibility of our methodology with the challenge’s Synapse platform, which prevented us receiving feedback during the validation phase. Post-submission, additional discrepancy reports from the organisers and via the MIDI-B Continuous Benchmarking facility, enabled us to improve this score significantly to 99.61%. An entirely rule-based approach was shown to be capable of removing all name-related information in the test corpus, but exhibited failures in dealing fully with address data. Initial experiments using published machine-learning models to remove addresses were partially successful but showed the models to be “over-aggressive” on other types of free-text data, leading to a slight overall degradation in performance to 99.54%. Future development will therefore focus on improving address-recognition capabilities, but also on better removal of identifiable data burned into the image pixels. Several technical aspects relating to the “answer key” are still under discussion with the challenge organisers, but we estimate that our percentage of genuine deidentification failures on the MIDI-B test corpus currently stands at 0.19%. (Abridged from original for arXiv submission)
zh
[CV-25] SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data
【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在空间推理任务中表现不佳的问题,这一问题限制了其在现实世界任务(如机器人和导航)中的应用。解决方案的关键在于构建一个专注于空间推理的合成视觉问答(VQA)数据集,该数据集基于高细节图像描述生成,旨在弥补现有VL数据集中空间关系稀缺的不足,从而提升VLMs在空间推理任务上的性能。
链接: https://arxiv.org/abs/2504.20648
作者: Michael Ogezi,Freda Shi
机构: University of Waterloo (滑铁卢大学); Vector Institute (向量研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models (VLMs) work well in tasks ranging from image captioning to visual question answering (VQA), yet they struggle with spatial reasoning, a key skill for understanding our physical world that humans excel at. We find that spatial relations are generally rare in widely used VL datasets, with only a few being well represented, while most form a long tail of underrepresented relations. This gap leaves VLMs ill-equipped to handle diverse spatial relationships. To bridge it, we construct a synthetic VQA dataset focused on spatial reasoning generated from hyper-detailed image descriptions in Localized Narratives, DOCCI, and PixMo-Cap. Our dataset consists of 455k samples containing 3.4 million QA pairs. Trained on this dataset, our Spatial-Reasoning Enhanced (SpaRE) VLMs show strong improvements on spatial reasoning benchmarks, achieving up to a 49% performance gain on the What’s Up benchmark, while maintaining strong results on general tasks. Our work narrows the gap between human and VLM spatial reasoning and makes VLMs more capable in real-world tasks such as robotics and navigation.
zh
[CV-26] LDPoly: Latent Diffusion for Polygonal Road Outline Extraction in Large-Scale Topographic Mapping
【速读】:该论文旨在解决从高分辨率航拍图像中提取多边形道路轮廓的问题,这一任务在大规模地形制图中具有重要意义,但目前尚无专门为此设计的方法。现有方法主要针对建筑物轮廓提取,而道路特有的分支结构和拓扑连通性使其难以直接适用。论文提出的解决方案是LDPoly,其关键在于引入了一种带有通道嵌入融合模块的双隐空间扩散模型,该模型能够同时生成道路掩码和顶点热图,并通过定制的多边形化方法获得顶点冗余最小的精确矢量道路多边形。
链接: https://arxiv.org/abs/2504.20645
作者: Weiqin Jiao,Hao Cheng,George Vosselman,Claudio Persello
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Polygonal road outline extraction from high-resolution aerial images is an important task in large-scale topographic mapping, where roads are represented as vectorized polygons, capturing essential geometric features with minimal vertex redundancy. Despite its importance, no existing method has been explicitly designed for this task. While polygonal building outline extraction has been extensively studied, the unique characteristics of roads, such as branching structures and topological connectivity, pose challenges to these methods. To address this gap, we introduce LDPoly, the first dedicated framework for extracting polygonal road outlines from high-resolution aerial images. Our method leverages a novel Dual-Latent Diffusion Model with a Channel-Embedded Fusion Module, enabling the model to simultaneously generate road masks and vertex heatmaps. A tailored polygonization method is then applied to obtain accurate vectorized road polygons with minimal vertex redundancy. We evaluate LDPoly on a new benchmark dataset, Map2ImLas, which contains detailed polygonal annotations for various topographic objects in several Dutch regions. Our experiments include both in-region and cross-region evaluations, with the latter designed to assess the model’s generalization performance on unseen regions. Quantitative and qualitative results demonstrate that LDPoly outperforms state-of-the-art polygon extraction methods across various metrics, including pixel-level coverage, vertex efficiency, polygon regularity, and road connectivity. We also design two new metrics to assess polygon simplicity and boundary smoothness. Moreover, this work represents the first application of diffusion models for extracting precise vectorized object outlines without redundant vertices from remote-sensing imagery, paving the way for future advancements in this field.
zh
[CV-27] AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation
【速读】:该论文旨在解决多模态到语音生成(multimodal-to-speech generation)任务中的挑战,包括语音可懂度、音视频同步性、语音自然度以及与参考说话人声音的相似性等问题。其解决方案的关键在于提出了一种名为AlignDiT的多模态对齐扩散Transformer模型,该模型通过探索三种有效的多模态表示对齐策略,并引入一种新颖的多模态无分类器引导机制,以在语音合成过程中自适应地平衡各模态的信息,从而生成准确、同步且自然的语音。
链接: https://arxiv.org/abs/2504.20629
作者: Jeongsoo Choi,Ji-Hoon Kim,Kim Sung-Bin,Tae-Hyun Oh,Joon Son Chung
机构: KAIST(韩国科学技术院); POSTECH(浦项科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:In this paper, we address the task of multimodal-to-speech generation, which aims to synthesize high-quality speech from multiple input modalities: text, video, and reference audio. This task has gained increasing attention due to its wide range of applications, such as film production, dubbing, and virtual avatars. Despite recent progress, existing methods still suffer from limitations in speech intelligibility, audio-video synchronization, speech naturalness, and voice similarity to the reference speaker. To address these challenges, we propose AlignDiT, a multimodal Aligned Diffusion Transformer that generates accurate, synchronized, and natural-sounding speech from aligned multimodal inputs. Built upon the in-context learning capability of the DiT architecture, AlignDiT explores three effective strategies to align multimodal representations. Furthermore, we introduce a novel multimodal classifier-free guidance mechanism that allows the model to adaptively balance information from each modality during speech synthesis. Extensive experiments demonstrate that AlignDiT significantly outperforms existing methods across multiple benchmarks in terms of quality, synchronization, and speaker similarity. Moreover, AlignDiT exhibits strong generalization capability across various multimodal tasks, such as video-to-speech synthesis and visual forced alignment, consistently achieving state-of-the-art performance. The demo page is available at this https URL .
zh
[CV-28] EfficientHuman: Efficient Training and Reconstruction of Moving Human using Articulated 2D Gaussian
【速读】:该论文试图解决3D Gaussian Splatting(3DGS)在动态人体重建中因多视角不一致和冗余高斯分布导致的表面平面拟合效果不佳及训练速度缓慢的问题。其解决方案的关键在于提出EfficientHuman模型,通过将高斯点云编码为规范空间中的关节2D高斯表面元(Articulated 2D Gaussian surfels),并利用线性混合皮肤(LBS)将其转换到姿态空间,从而实现高效的姿态变换,使模型能够快速适应动态人体结构,同时保持视角一致的几何特性。
链接: https://arxiv.org/abs/2504.20607
作者: Hao Tian,Rui Liu,Wen Shen,Yilong Hu,Zhihao Zheng,Xiaolin Qin
机构: Chengdu Institute of Computer Applications, Chinese Academy of Sciences (成都计算机应用研究所,中国科学院); Minzu University of China (中国民族大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures
Abstract:3D Gaussian Splatting (3DGS) has been recognized as a pioneering technique in scene reconstruction and novel view synthesis. Recent work on reconstructing the 3D human body using 3DGS attempts to leverage prior information on human pose to enhance rendering quality and improve training speed. However, it struggles to effectively fit dynamic surface planes due to multi-view inconsistency and redundant Gaussians. This inconsistency arises because Gaussian ellipsoids cannot accurately represent the surfaces of dynamic objects, which hinders the rapid reconstruction of the dynamic human body. Meanwhile, the prevalence of redundant Gaussians means that the training time of these works is still not ideal for quickly fitting a dynamic human body. To address these, we propose EfficientHuman, a model that quickly accomplishes the dynamic reconstruction of the human body using Articulated 2D Gaussian while ensuring high rendering quality. The key innovation involves encoding Gaussian splats as Articulated 2D Gaussian surfels in canonical space and then transforming them to pose space via Linear Blend Skinning (LBS) to achieve efficient pose transformations. Unlike 3D Gaussians, Articulated 2D Gaussian surfels can quickly conform to the dynamic human body while ensuring view-consistent geometries. Additionally, we introduce a pose calibration module and an LBS optimization module to achieve precise fitting of dynamic human poses, enhancing the model’s performance. Extensive experiments on the ZJU-MoCap dataset demonstrate that EfficientHuman achieves rapid 3D dynamic human reconstruction in less than a minute on average, which is 20 seconds faster than the current state-of-the-art method, while also reducing the number of redundant Gaussians.
zh
[CV-29] Purifying Labeling and Utilizing: A High-Quality Pipeline for Small Object Detection
【速读】:该论文旨在解决小目标检测中因传统“流水线式”工程流程各阶段孤立优化而导致的整体性能受限的问题。其解决方案的关键在于对检测流程中的三个核心环节——特征净化(Purifying)、标签分配(Labeling)和特征利用(Utilizing)进行全局优化,提出了一种名为PLUSNet的高效小目标检测框架,通过引入Hierarchical Feature Purifier(HFP)、Multiple Criteria Label Assignment(MCLA)和Frequency Decoupled Head(FDHead)模块,实现多尺度场景下小目标检测能力的显著提升。
链接: https://arxiv.org/abs/2504.20602
作者: Siwei Wang,Zhiwei Chen,Liujuan Cao,Rongrong Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Small object detection is a broadly investigated research task and is commonly conceptualized as a “pipeline-style” engineering process. In the upstream, images serve as raw materials for processing in the detection pipeline, where pre-trained models are employed to generate initial feature maps. In the midstream, an assigner selects training positive and negative samples. Subsequently, these samples and features are fed into the downstream for classification and regression. Previous small object detection methods often focused on improving isolated stages of the pipeline, thereby neglecting holistic optimization and consequently constraining overall performance gains. To address this issue, we have optimized three key aspects, namely Purifying, Labeling, and Utilizing, in this pipeline, proposing a high-quality Small object detection framework termed PLUSNet. Specifically, PLUSNet comprises three sequential components: the Hierarchical Feature Purifier (HFP) for purifying upstream features, the Multiple Criteria Label Assignment (MCLA) for improving the quality of midstream training samples, and the Frequency Decoupled Head (FDHead) for more effectively exploiting information to accomplish downstream tasks. The proposed PLUS modules are readily integrable into various object detectors, thus enhancing their detection capabilities in multi-scale scenarios. Extensive experiments demonstrate the proposed PLUSNet consistently achieves significant and consistent improvements across multiple datasets for small object detection.
zh
[CV-30] PartHOI: Part-based Hand-Object Interaction Transfer via Generalized Cylinders
【速读】:该论文旨在解决手-物体交互(Hand-Object Interaction, HOI)数据生成中跨类别手部姿态迁移的难题,现有方法依赖于形状匹配,受限于物体形状和尺寸的差异,难以实现跨类别的有效迁移。其解决方案的关键在于引入基于部件的HOI迁移方法PartHOI,通过广义圆柱体表示参数化物体部件的几何结构,建立鲁棒的部件间几何对应关系,并实现接触点的迁移,从而在目标物体上优化出合适的手部姿态,提升跨类别HOI迁移的效果与质量。
链接: https://arxiv.org/abs/2504.20599
作者: Qiaochu Wang,Chufeng Xiao,Manfred Lau,Hongbo Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 12 figures, this paper has been accepted by Computational Visual Media Journal (CVMJ) but has not been published yet
Abstract:Learning-based methods to understand and model hand-object interactions (HOI) require a large amount of high-quality HOI data. One way to create HOI data is to transfer hand poses from a source object to another based on the objects’ geometry. However, current methods for transferring hand poses between objects rely on shape matching, limiting the ability to transfer poses across different categories due to differences in their shapes and sizes. We observe that HOI often involves specific semantic parts of objects, which often have more consistent shapes across categories. In addition, constructing size-invariant correspondences between these parts is important for cross-category transfer. Based on these insights, we introduce a novel method PartHOI for part-based HOI transfer. Using a generalized cylinder representation to parameterize an object parts’ geometry, PartHOI establishes a robust geometric correspondence between object parts, and enables the transfer of contact points. Given the transferred points, we optimize a hand pose to fit the target object well. Qualitative and quantitative results demonstrate that our method can generalize HOI transfers well even for cross-category objects, and produce high-fidelity results that are superior to the existing methods.
zh
[CV-31] Hydra: Marker-Free RGB-D Hand-Eye Calibration
【速读】:该论文旨在解决无需标记的机器人手眼标定问题,其核心挑战在于如何在不依赖外部标记的情况下准确估计相机与机器人末端执行器之间的相对位姿。解决方案的关键在于提出了一种基于RGB-D成像的新型迭代最近点(ICP)算法实现,该算法采用鲁棒的点到平面(PTP)目标函数,并在其上构建于李代数框架下,从而提升了标定的收敛速度和精度。
链接: https://arxiv.org/abs/2504.20584
作者: Martin Huber,Huanyu Tian,Christopher E. Mower,Lucas-Raphael Müller,Sébastien Ourselin,Christos Bergeles,Tom Vercauteren
机构: King’s College London(国王学院); Huawei(华为)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This work presents an RGB-D imaging-based approach to marker-free hand-eye calibration using a novel implementation of the iterative closest point (ICP) algorithm with a robust point-to-plane (PTP) objective formulated on a Lie algebra. Its applicability is demonstrated through comprehensive experiments using three well known serial manipulators and two RGB-D cameras. With only three randomly chosen robot configurations, our approach achieves approximately 90% successful calibrations, demonstrating 2-3x higher convergence rates to the global optimum compared to both marker-based and marker-free baselines. We also report 2 orders of magnitude faster convergence time (0.8 +/- 0.4 s) for 9 robot configurations over other marker-free methods. Our method exhibits significantly improved accuracy (5 mm in task space) over classical approaches (7 mm in task space) whilst being marker-free. The benchmarking dataset and code are open sourced under Apache 2.0 License, and a ROS 2 integration with robot abstraction is provided to facilitate deployment.
zh
[CV-32] Autoencoder Models for Point Cloud Environmental Synthesis from WiFi Channel State Information: A Preliminary Study
【速读】:该论文试图解决从WiFi信道状态信息(Channel State Information, CSI)数据生成点云的问题,以实现环境的精确重建。其解决方案的关键在于采用两阶段自编码器方法:首先使用带有卷积层的PointNet自编码器进行点云生成,其次利用卷积神经网络(Convolutional Neural Network)自编码器将CSI数据映射到匹配的潜在空间,并通过对齐这些潜在空间实现WiFi数据到环境点云的准确重构。
链接: https://arxiv.org/abs/2504.20541
作者: Daniele Pannone,Danilo Avola
机构: Sapienza University of Rome (罗马第一大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:This paper introduces a deep learning framework for generating point clouds from WiFi Channel State Information data. We employ a two-stage autoencoder approach: a PointNet autoencoder with convolutional layers for point cloud generation, and a Convolutional Neural Network autoencoder to map CSI data to a matching latent space. By aligning these latent spaces, our method enables accurate environmental point cloud reconstruction from WiFi data. Experimental results validate the effectiveness of our approach, highlighting its potential for wireless sensing and environmental mapping applications.
zh
[CV-33] Beyond the Horizon: Decoupling UAVs Multi-View Action Recognition via Partial Order Transfer
【速读】:该论文旨在解决无人机(UAV)中动作识别面临的挑战,即由于垂直空间轴上的显著视角变化导致的外观差异问题。现有方法在处理不同高度下的视角变化时表现不佳,因此本文提出了一种新的解决方案——部分序引导的多视角网络(POG-MVNet),其关键在于通过显式建模UAV视角的分层结构来提升跨高度的识别性能。该框架包含三个核心组件:视图划分模块、序感知特征解耦模块和动作部分序引导模块,旨在有效利用不同高度下的视图相关信息,从而改善动作识别效果。
链接: https://arxiv.org/abs/2504.20530
作者: Wenxuan Liu,Xian Zhong,Zhuo Zhou,Siyuan Yang,Chia-Wen Lin,Alex Chichung Kot
机构: Peking University (北京大学); Wuhan University of Technology (武汉理工大学); Nanyang Technological University (南洋理工大学); Wuhan University (武汉大学); National Tsing Hua University (国立清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages
Abstract:Action recognition in unmanned aerial vehicles (UAVs) poses unique challenges due to significant view variations along the vertical spatial axis. Unlike traditional ground-based settings, UAVs capture actions from a wide range of altitudes, resulting in considerable appearance discrepancies. We introduce a multi-view formulation tailored to varying UAV altitudes and empirically observe a partial order among views, where recognition accuracy consistently decreases as the altitude increases. This motivates a novel approach that explicitly models the hierarchical structure of UAV views to improve recognition performance across altitudes. To this end, we propose the Partial Order Guided Multi-View Network (POG-MVNet), designed to address drastic view variations by effectively leveraging view-dependent information across different altitude levels. The framework comprises three key components: a View Partition (VP) module, which uses the head-to-body ratio to group views by altitude; an Order-aware Feature Decoupling (OFD) module, which disentangles action-relevant and view-specific features under partial order guidance; and an Action Partial Order Guide (APOG), which leverages the partial order to transfer informative knowledge from easier views to support learning in more challenging ones. We conduct experiments on Drone-Action, MOD20, and UAV datasets, demonstrating that POG-MVNet significantly outperforms competing methods. For example, POG-MVNet achieves a 4.7% improvement on Drone-Action dataset and a 3.5% improvement on UAV dataset compared to state-of-the-art methods ASAT and FAR. The code for POG-MVNet will be made available soon.
zh
[CV-34] Geometry-aware Temporal Aggregation Network for Monocular 3D Lane Detection
【速读】:该论文旨在解决单目3D车道检测方法中存在的两个问题:预测的3D车道几何信息不准确以及车道完整性难以保持。其解决方案的关键在于充分利用多帧输入,通过引入时序几何一致性增强场景几何感知能力,并通过时序实例感知查询生成策略挖掘更多实例信息,从而提升检测性能。具体而言,提出了Geometry-aware Temporal Aggregation Network (GTA-Net),包含Temporal Geometry Enhancement Module (TGEM)和Temporal Instance-aware Query Generation (TIQG)两个核心模块。
链接: https://arxiv.org/abs/2504.20525
作者: Huan Zheng,Wencheng Han,Tianyi Yan,Cheng-zhong Xu,Jianbing Shen
机构: University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monocular 3D lane detection aims to estimate 3D position of lanes from frontal-view (FV) images. However, current monocular 3D lane detection methods suffer from two limitations, including inaccurate geometric information of the predicted 3D lanes and difficulties in maintaining lane integrity. To address these issues, we seek to fully exploit the potential of multiple input frames. First, we aim at enhancing the ability to perceive the geometry of scenes by leveraging temporal geometric consistency. Second, we strive to improve the integrity of lanes by revealing more instance information from temporal sequences. Therefore, we propose a novel Geometry-aware Temporal Aggregation Network (GTA-Net) for monocular 3D lane detection. On one hand, we develop the Temporal Geometry Enhancement Module (TGEM), which exploits geometric consistency across successive frames, facilitating effective geometry perception. On the other hand, we present the Temporal Instance-aware Query Generation (TIQG), which strategically incorporates temporal cues into query generation, thereby enabling the exploration of comprehensive instance information. Experiments demonstrate that our GTA-Net achieves SoTA results, surpassing existing monocular 3D lane detection solutions.
zh
[CV-35] Dynamic Attention Analysis for Backdoor Detection in Text-to-Image Diffusion Models
【速读】:该论文试图解决文本到图像扩散模型在面对后门攻击时的检测问题,此类攻击通过植入隐蔽的文本触发器来操纵模型输出。现有后门检测方法主要依赖于后门样本的静态特征,而未能充分利用扩散模型的固有动态特性。该研究的关键在于提出一种新的后门检测视角——动态注意力分析(Dynamic Attention Analysis, DAA),通过分析跨注意力图的动态演化过程,发现后门样本在EOS标记处表现出与正常样本不同的特征演化模式,从而实现更有效的后门检测。
链接: https://arxiv.org/abs/2504.20518
作者: Zhongqi Wang,Jie Zhang,Shiguang Shan,Xilin Chen
机构: Chinese Academy of Sciences (中国科学院); Institute of Computing Technology (计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent studies have revealed that text-to-image diffusion models are vulnerable to backdoor attacks, where attackers implant stealthy textual triggers to manipulate model outputs. Previous backdoor detection methods primarily focus on the static features of backdoor samples. However, a vital property of diffusion models is their inherent dynamism. This study introduces a novel backdoor detection perspective named Dynamic Attention Analysis (DAA), showing that these dynamic characteristics serve as better indicators for backdoor detection. Specifically, by examining the dynamic evolution of cross-attention maps, we observe that backdoor samples exhibit distinct feature evolution patterns at the EOS token compared to benign samples. To quantify these dynamic anomalies, we first introduce DAA-I, which treats the tokens’ attention maps as spatially independent and measures dynamic feature using the Frobenius norm. Furthermore, to better capture the interactions between attention maps and refine the feature, we propose a dynamical system-based approach, referred to as DAA-S. This model formulates the spatial correlations among attention maps using a graph-based state equation and we theoretically analyze the global asymptotic stability of this method. Extensive experiments across five representative backdoor attack scenarios demonstrate that our approach significantly surpasses existing detection methods, achieving an average F1 Score of 79.49% and an AUC of 87.67%. The code is available at this https URL.
zh
[CV-36] SteelBlastQC: Shot-blasted Steel Surface Dataset with Interpretable Detection of Surface Defects IJCNN2025
【速读】:该论文旨在解决喷砂处理后钢材表面质量控制自动化的问题,以提升制造效率和一致性。其关键解决方案是构建了一个包含1654张标注的RGB图像(512x512)的数据集,用于训练计算机视觉模型,并评估了三种分类方法:紧凑型卷积Transformer(CCT)、基于ResNet-50特征提取的支持向量机(SVM)以及卷积自编码器(CAE)。其中,监督学习方法(CCT和SVM)在测试集上达到了95%的分类准确率,CCT利用了基于Transformer的注意力机制,而SVM则提供了计算效率更高的替代方案。此外,研究还通过三种神经网络实现了可解释的决策过程,使工业用户能够直观定位问题区域并理解模型推理逻辑。
链接: https://arxiv.org/abs/2504.20510
作者: Irina Ruzavina,Lisa Sophie Theis,Jesse Lemeer,Rutger de Groen,Leo Ebeling,Andrej Hulak,Jouaria Ali,Guangzhi Tang,Rico Mockel
机构: Maastricht University (马斯特里赫特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: Accepted by IJCNN 2025
Abstract:Automating the quality control of shot-blasted steel surfaces is crucial for improving manufacturing efficiency and consistency. This study presents a dataset of 1654 labeled RGB images (512x512) of steel surfaces, classified as either “ready for paint” or “needs shot-blasting.” The dataset captures real-world surface defects, including discoloration, welding lines, scratches and corrosion, making it well-suited for training computer vision models. Additionally, three classification approaches were evaluated: Compact Convolutional Transformers (CCT), Support Vector Machines (SVM) with ResNet-50 feature extraction, and a Convolutional Autoencoder (CAE). The supervised methods (CCT and SVM) achieve 95% classification accuracy on the test set, with CCT leveraging transformer-based attention mechanisms and SVM offering a computationally efficient alternative. The CAE approach, while less effective, establishes a baseline for unsupervised quality control. We present interpretable decision-making by all three neural networks, allowing industry users to visually pinpoint problematic regions and understand the model’s rationale. By releasing the dataset and baseline codes, this work aims to support further research in defect detection, advance the development of interpretable computer vision models for quality control, and encourage the adoption of automated inspection systems in industrial applications.
zh
[CV-37] MambaMoE: Mixture-of-Spectral-Spatial-Experts State Space Model for Hyperspectral Image Classification
【速读】:该论文旨在解决现有基于Mamba的高光谱图像(HSI)分类方法在处理异质目标时忽视光谱和空间方向特性,导致分类性能受限的问题。其解决方案的关键在于提出MambaMoE框架,该框架通过设计一种基于稀疏专家激活的Mixture of Mamba Expert Block (MoMEB),实现自适应的光谱-空间建模,并引入不确定性引导的修正学习(UGCL)策略,以增强模型对易产生预测歧义区域的关注,从而提升分类性能。
链接: https://arxiv.org/abs/2504.20509
作者: Yichu Xu,Di Wang,Hongzan Jiao,Lefei Zhang,Liangpei Zhang
机构: Wuhan University(武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The Mamba model has recently demonstrated strong potential in hyperspectral image (HSI) classification, owing to its ability to perform context modeling with linear computational complexity. However, existing Mamba-based methods usually neglect the spectral and spatial directional characteristics related to heterogeneous objects in hyperspectral scenes, leading to limited classification performance. To address these issues, we propose MambaMoE, a novel spectral-spatial mixture-of-experts framework, representing the first MoE-based approach in the HSI classification community. Specifically, we design a Mixture of Mamba Expert Block (MoMEB) that leverages sparse expert activation to enable adaptive spectral-spatial modeling. Furthermore, we introduce an uncertainty-guided corrective learning (UGCL) strategy to encourage the model’s attention toward complex regions prone to prediction ambiguity. Extensive experiments on multiple public HSI benchmarks demonstrate that MambaMoE achieves state-of-the-art performance in both accuracy and efficiency compared to existing advanced approaches, especially for Mamba-based methods. Code will be released.
zh
[CV-38] Style-Adaptive Detection Transformer for Single-Source Domain Generalized Object Detection
【速读】:该论文旨在解决单源域泛化(Single-source Domain Generalization, SDG)中目标检测模型在未见目标域上泛化能力不足的问题。现有方法主要依赖于卷积神经网络(CNN)架构,通过数据增强和特征对齐策略提升鲁棒性,但受限于数据增强的分布覆盖范围,难以有效提升跨所有未见域的泛化性能。为解决这一问题,该论文提出了一种基于检测Transformer(DETR)的强泛化检测器——风格自适应检测Transformer(SA-DETR)。其关键在于引入了领域风格适配器,将未见目标域的风格表示映射到训练域,实现动态风格适应;同时设计了对象感知对比学习模块,通过对象感知门控掩码在空间和语义维度上约束特征聚合,从而实现实例级特征的跨域对比,提升模型的域不变特征提取能力。
链接: https://arxiv.org/abs/2504.20498
作者: Jianhong Han,Yupei Wang,Liang Chen
机构: Beijing Institute of Technology (北京理工大学); Beijing Institute of Technology Chongqing Innovation Center (北京理工大学重庆创新中心); National Key Laboratory for Space-Born Intelligent Information Processing (国家重点实验室空间智能信息处理)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Manuscript submitted to IEEE Transactions on Multimedia
Abstract:Single-source Domain Generalization (SDG) in object detection aims to develop a detector using only data from a source domain that can exhibit strong generalization capability when applied to unseen target domains. Existing methods are built upon CNN-based detectors and primarily improve robustness by employing carefully designed data augmentation strategies integrated with feature alignment techniques. However, data augmentation methods have inherent drawbacks; they are only effective when the augmented sample distribution approximates or covers the unseen scenarios, thus failing to enhance generalization across all unseen domains. Furthermore, while the recent Detection Transformer (DETR) has demonstrated superior generalization capability in domain adaptation tasks due to its efficient global information extraction, its potential in SDG tasks remains unexplored. To this end, we introduce a strong DETR-based detector named the Style-Adaptive Detection Transformer (SA-DETR) for SDG in object detection. Specifically, we present a domain style adapter that projects the style representation of the unseen target domain into the training domain, enabling dynamic style adaptation. Then, we propose an object-aware contrastive learning module to guide the detector in extracting domain-invariant features through contrastive learning. By using object-aware gating masks to constrain feature aggregation in both spatial and semantic dimensions, this module achieves cross-domain contrast of instance-level features, thereby enhancing generalization. Extensive experiments demonstrate the superior performance and generalization capability of SA-DETR across five different weather scenarios. Code is released at this https URL.
zh
[CV-39] Large-scale visual SLAM for in-the-wild videos
【速读】:该论文旨在解决从非结构化、野生视频中实现准确且鲁棒的3D场景重建问题,这一问题对于机器人在新环境中的部署具有重要意义。现有视觉SLAM方法在基准数据集上表现良好,但在处理现实世界视频时面临挑战,如不可控运动、纹理缺失区域和动态物体等。论文提出了一种鲁棒的解决方案,其关键在于结合深度视觉里程计方法,并通过从运动恢复结构自动恢复相机内参、利用预测模型遮蔽动态物体和低约束区域、借助单目深度估计正则化光束法平差以减轻低视差情况下的误差,以及集成位置识别与回环闭合以减少长期漂移并优化内参和位姿估计。
链接: https://arxiv.org/abs/2504.20496
作者: Shuo Sun,Torsten Sattler,Malcolm Mielle,Achim J. Lilienthal,Martin Magnusson
机构: AASS research center, Örebro University, Sweden; Czech Technical University in Prague; Technical University of Munich, Chair: Perception for Intelligent Systems.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: fix the overview figure
Abstract:Accurate and robust 3D scene reconstruction from casual, in-the-wild videos can significantly simplify robot deployment to new environments. However, reliable camera pose estimation and scene reconstruction from such unconstrained videos remains an open challenge. Existing visual-only SLAM methods perform well on benchmark datasets but struggle with real-world footage which often exhibits uncontrolled motion including rapid rotations and pure forward movements, textureless regions, and dynamic objects. We analyze the limitations of current methods and introduce a robust pipeline designed to improve 3D reconstruction from casual videos. We build upon recent deep visual odometry methods but increase robustness in several ways. Camera intrinsics are automatically recovered from the first few frames using structure-from-motion. Dynamic objects and less-constrained areas are masked with a predictive model. Additionally, we leverage monocular depth estimates to regularize bundle adjustment, mitigating errors in low-parallax situations. Finally, we integrate place recognition and loop closure to reduce long-term drift and refine both intrinsics and pose estimates through global bundle adjustment. We demonstrate large-scale contiguous 3D models from several online videos in various environments. In contrast, baseline methods typically produce locally inconsistent results at several points, producing separate segments or distorted maps. In lieu of ground-truth pose data, we evaluate map consistency, execution time and visual accuracy of re-rendered NeRF models. Our proposed system establishes a new baseline for visual reconstruction from casual uncontrolled videos found online, demonstrating more consistent reconstructions over longer sequences of in-the-wild videos than previously achieved.
zh
[CV-40] Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception CVPR2025
【速读】:该论文试图解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在处理反事实预设问题(Counterfactual Presupposition Questions, CPQs)时产生的幻觉问题,即模型容易接受反事实对象的预设并生成严重不实的回答。解决方案的关键在于提出“Antidote”,一个统一的、基于合成数据的后训练框架,通过引入合成数据将事实先验融入问题中以实现自我校正,并将缓解幻觉的过程解耦为偏好优化问题,从而有效减少幻觉并提升模型在CP-Bench等基准上的表现。
链接: https://arxiv.org/abs/2504.20468
作者: Yuanchen Wu,Lu Zhang,Hang Yao,Junlong Du,Ke Yan,Shouhong Ding,Yunsheng Wu,Xiaoqiang Li
机构: Tencent YouTu Lab (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025
Abstract:Large Vision-Language Models (LVLMs) have achieved impressive results across various cross-modal tasks. However, hallucinations, i.e., the models generating counterfactual responses, remain a challenge. Though recent studies have attempted to alleviate object perception hallucinations, they focus on the models’ response generation, and overlooking the task question itself. This paper discusses the vulnerability of LVLMs in solving counterfactual presupposition questions (CPQs), where the models are prone to accept the presuppositions of counterfactual objects and produce severe hallucinatory responses. To this end, we introduce “Antidote”, a unified, synthetic data-driven post-training framework for mitigating both types of hallucination above. It leverages synthetic data to incorporate factual priors into questions to achieve self-correction, and decouple the mitigation process into a preference optimization problem. Furthermore, we construct “CP-Bench”, a novel benchmark to evaluate LVLMs’ ability to correctly handle CPQs and produce factual responses. Applied to the LLaVA series, Antidote can simultaneously enhance performance on CP-Bench by over 50%, POPE by 1.8-3.3%, and CHAIR SHR by 30-50%, all without relying on external supervision from stronger LVLMs or human feedback and introducing noticeable catastrophic forgetting issues.
zh
[CV-41] LMM4Gen3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs
【速读】:该论文旨在解决AI生成的3D人类面部(3D Human Faces, 3DHF)在质量与真实感评估方面的挑战,这一问题由于人类感知的主观性和对面部特征的内在敏感性而尤为复杂。其解决方案的关键在于提出一个大规模基准数据集Gen3DHF以及基于大型多模态模型(Large Multimodal Model, LMM)的评估指标LMME3DHF,该指标能够实现对3DHF的质量和真实性评分预测、畸变感知的视觉问答以及畸变感知显著性预测,从而有效提升评估的准确性与与人类感知的一致性。
链接: https://arxiv.org/abs/2504.20466
作者: Woo Yi Yang,Jiarui Wang,Sijing Wu,Huiyu Duan,Yuxin Zhu,Liu Yang,Kang Fu,Guangtao Zhai,Xiongkuo Min
机构: Shanghai Jiao Tong University (上海交通大学); Institution (机构)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancement in generative artificial intelligence have enabled the creation of 3D human faces (HFs) for applications including media production, virtual reality, security, healthcare, and game development, etc. However, assessing the quality and realism of these AI-generated 3D human faces remains a significant challenge due to the subjective nature of human perception and innate perceptual sensitivity to facial features. To this end, we conduct a comprehensive study on the quality assessment of AI-generated 3D human faces. We first introduce Gen3DHF, a large-scale benchmark comprising 2,000 videos of AI-Generated 3D Human Faces along with 4,000 Mean Opinion Scores (MOS) collected across two dimensions, i.e., quality and authenticity, 2,000 distortion-aware saliency maps and distortion descriptions. Based on Gen3DHF, we propose LMME3DHF, a Large Multimodal Model (LMM)-based metric for Evaluating 3DHF capable of quality and authenticity score prediction, distortion-aware visual question answering, and distortion-aware saliency prediction. Experimental results show that LMME3DHF achieves state-of-the-art performance, surpassing existing methods in both accurately predicting quality scores for AI-generated 3D human faces and effectively identifying distortion-aware salient regions and distortion types, while maintaining strong alignment with human perceptual judgments. Both the Gen3DHF database and the LMME3DHF will be released upon the publication.
zh
[CV-42] PixelHacker: Image Inpainting with Structural and Semantic Consistency
【速读】:该论文旨在解决图像修复(image inpainting)中复杂结构(如纹理、形状、空间关系)和语义(如颜色一致性、对象恢复和逻辑正确性)难以准确重建的问题,这些问题导致现有方法生成结果中出现伪影和不合理的内容。其解决方案的关键在于提出一种名为“潜在类别引导”(latent categories guidance)的简单而有效的修复范式,并构建了一个基于扩散模型的像素级修复模型PixelHacker。该模型通过分别编码前景和背景的潜在表示,并利用线性注意力机制在去噪过程中间歇性注入这些特征,从而提升修复结果在结构和语义上的一致性。
链接: https://arxiv.org/abs/2504.20438
作者: Ziyang Xu,Kangsheng Duan,Xiaolei Shen,Zhifeng Ding,Wenyu Liu,Xiaohu Ruan,Xiaoxin Chen,Xinggang Wang
机构: Huazhong University of Science and Technology (华中科技大学); VIVO AI Lab (VIVO人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image inpainting is a fundamental research area between image editing and image generation. Recent state-of-the-art (SOTA) methods have explored novel attention mechanisms, lightweight architectures, and context-aware modeling, demonstrating impressive performance. However, they often struggle with complex structure (e.g., texture, shape, spatial relations) and semantics (e.g., color consistency, object restoration, and logical correctness), leading to artifacts and inappropriate generation. To address this challenge, we design a simple yet effective inpainting paradigm called latent categories guidance, and further propose a diffusion-based model named PixelHacker. Specifically, we first construct a large dataset containing 14 million image-mask pairs by annotating foreground and background (potential 116 and 21 categories, respectively). Then, we encode potential foreground and background representations separately through two fixed-size embeddings, and intermittently inject these features into the denoising process via linear attention. Finally, by pre-training on our dataset and fine-tuning on open-source benchmarks, we obtain PixelHacker. Extensive experiments show that PixelHacker comprehensively outperforms the SOTA on a wide range of datasets (Places2, CelebA-HQ, and FFHQ) and exhibits remarkable consistency in both structure and semantics. Project page at this https URL.
zh
[CV-43] AI Assisted Cervical Cancer Screening for Cytology Samples in Developing Countries
【速读】:该论文旨在解决宫颈癌筛查中传统液基细胞学(Liquid-Based Cytology, LBC)方法存在的劳动强度大、依赖专家病理学家且易出错的问题。其解决方案的关键在于将低成本生物显微镜与简单高效的生成式 AI (Generative AI) 算法相结合,实现全片自动分析。系统通过电动显微镜采集细胞图像,并利用包含图像拼接、细胞分割和分类的 AI 流程进行处理,其中采用轻量级 UNet 模型结合人机协作方式训练分割模型,以及基于 CvT 的分类模型对五种细胞类型进行准确分类,从而提升了宫颈癌筛查的准确性和效率。
链接: https://arxiv.org/abs/2504.20435
作者: Love Panta,Suraj Prasai,Karishma Malla Vaidya,Shyam Shrestha,Suresh Manandhar
机构: Wiseyak(维赛亚克); Paropakar Prasuti Griha Maternity Hospital(帕罗帕卡尔产科格里哈妇产医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cervical cancer remains a significant health challenge, with high incidence and mortality rates, particularly in transitioning countries. Conventional Liquid-Based Cytology(LBC) is a labor-intensive process, requires expert pathologists and is highly prone to errors, highlighting the need for more efficient screening methods. This paper introduces an innovative approach that integrates low-cost biological microscopes with our simple and efficient AI algorithms for automated whole-slide analysis. Our system uses a motorized microscope to capture cytology images, which are then processed through an AI pipeline involving image stitching, cell segmentation, and classification. We utilize the lightweight UNet-based model involving human-in-the-loop approach to train our segmentation model with minimal ROIs. CvT-based classification model, trained on the SIPaKMeD dataset, accurately categorizes five cell types. Our framework offers enhanced accuracy and efficiency in cervical cancer screening compared to various state-of-art methods, as demonstrated by different evaluation metrics.
zh
[CV-44] Plant Disease Detection through Multimodal Large Language Models and Convolutional Neural Networks
【速读】:该论文旨在解决农业中作物监测与病害管理的挑战,特别是通过早期检测系统实现植物病害的自动化分类。其解决方案的关键在于将多模态大型语言模型(LLMs),特别是GPT-4o,与卷积神经网络(CNNs)相结合,利用叶部图像进行病害识别。研究通过在不同零样本、少样本和渐进微调场景下评估模型性能,并对比GPT-4o与ResNet-50在不同分辨率和植物种类上的表现,验证了该方法的有效性。结果表明,微调后的GPT-4o在分类准确率上优于ResNet-50,展示了其在提升精准农业系统可扩展性和智能化方面的潜力。
链接: https://arxiv.org/abs/2504.20419
作者: Konstantinos I. Roumeliotis,Ranjan Sapkota,Manoj Karkee,Nikolaos D. Tselikas,Dimitrios K. Nasiopoulos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automation in agriculture plays a vital role in addressing challenges related to crop monitoring and disease management, particularly through early detection systems. This study investigates the effectiveness of combining multimodal Large Language Models (LLMs), specifically GPT-4o, with Convolutional Neural Networks (CNNs) for automated plant disease classification using leaf imagery. Leveraging the PlantVillage dataset, we systematically evaluate model performance across zero-shot, few-shot, and progressive fine-tuning scenarios. A comparative analysis between GPT-4o and the widely used ResNet-50 model was conducted across three resolutions (100, 150, and 256 pixels) and two plant species (apple and corn). Results indicate that fine-tuned GPT-4o models achieved slightly better performance compared to the performance of ResNet-50, achieving up to 98.12% classification accuracy on apple leaf images, compared to 96.88% achieved by ResNet-50, with improved generalization and near-zero training loss. However, zero-shot performance of GPT-4o was significantly lower, underscoring the need for minimal training. Additional evaluations on cross-resolution and cross-plant generalization revealed the models’ adaptability and limitations when applied to new domains. The findings highlight the promise of integrating multimodal LLMs into automated disease detection pipelines, enhancing the scalability and intelligence of precision agriculture systems while reducing the dependence on large, labeled datasets and high-resolution sensor infrastructure. Large Language Models, Vision Language Models, LLMs and CNNs, Disease Detection with Vision Language Models, VLMs
zh
[CV-45] GarmentX: Autoregressive Parametric Representations for High-Fidelity 3D Garment Generation
【速读】:该论文试图解决传统服装重建方法在生成3D服装时存在的自相交和物理上不可行的结构问题(self-intersections and physically implausible garment structures),这些问题源于直接预测2D裁剪边缘及其连接性的非约束性方法。解决方案的关键在于提出GarmentX框架,该框架引入了一种与GarmentCode兼容的结构化可编辑参数化表示,确保解码后的缝制图案始终形成有效且可模拟的3D服装,同时允许对服装形状和风格进行直观修改。此外,通过采用掩码自回归模型依次预测服装参数,实现了结构化生成并减少了直接预测裁剪图的一致性问题。
链接: https://arxiv.org/abs/2504.20409
作者: Jingfeng Guo,Jinnan Chen,Weikai Chen,Zhenyu Sun,Lanjiong Li,Baozhu Zhao,Lingting Zhu,Xin Wang,Qi Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This work presents GarmentX, a novel framework for generating diverse, high-fidelity, and wearable 3D garments from a single input image. Traditional garment reconstruction methods directly predict 2D pattern edges and their connectivity, an overly unconstrained approach that often leads to severe self-intersections and physically implausible garment structures. In contrast, GarmentX introduces a structured and editable parametric representation compatible with GarmentCode, ensuring that the decoded sewing patterns always form valid, simulation-ready 3D garments while allowing for intuitive modifications of garment shape and style. To achieve this, we employ a masked autoregressive model that sequentially predicts garment parameters, leveraging autoregressive modeling for structured generation while mitigating inconsistencies in direct pattern prediction. Additionally, we introduce GarmentX dataset, a large-scale dataset of 378,682 garment parameter-image pairs, constructed through an automatic data generation pipeline that synthesizes diverse and high-quality garment images conditioned on parametric garment representations. Through integrating our method with GarmentX dataset, we achieve state-of-the-art performance in geometric fidelity and input image alignment, significantly outperforming prior approaches. We will release GarmentX dataset upon publication.
zh
[CV-46] Creating Your Editable 3D Photorealistic Avatar with Tetrahedron-constrained Gaussian Splatting
【速读】:该论文旨在解决个性化3D虚拟形象编辑中生成视觉吸引力不足的问题,这主要是由于在复杂重建场景下几何与纹理混合优化导致的表示学习不稳定。其解决方案的关键在于提出了一种精心设计的框架,将编辑过程分解为局部空间适应和真实外观学习,并采用混合四面体约束高斯点云(TetGS)作为基础表示,结合四面体网格的可控显式结构与3D高斯点云渲染的高精度能力,通过分阶段优化实现精准的区域定位、几何适应性和逼真渲染。
链接: https://arxiv.org/abs/2504.20403
作者: Hanxi Liu,Yifang Men,Zhouhui Lian
机构: Wangxuan Institute of Computer Technology, Peking University (王选计算机技术研究所,北京大学); Institute for Intelligent Computing, Alibaba Group (智能计算研究院,阿里巴巴集团)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Personalized 3D avatar editing holds significant promise due to its user-friendliness and availability to applications such as AR/VR and virtual try-ons. Previous studies have explored the feasibility of 3D editing, but often struggle to generate visually pleasing results, possibly due to the unstable representation learning under mixed optimization of geometry and texture in complicated reconstructed scenarios. In this paper, we aim to provide an accessible solution for ordinary users to create their editable 3D avatars with precise region localization, geometric adaptability, and photorealistic renderings. To tackle this challenge, we introduce a meticulously designed framework that decouples the editing process into local spatial adaptation and realistic appearance learning, utilizing a hybrid Tetrahedron-constrained Gaussian Splatting (TetGS) as the underlying representation. TetGS combines the controllable explicit structure of tetrahedral grids with the high-precision rendering capabilities of 3D Gaussian Splatting and is optimized in a progressive manner comprising three stages: 3D avatar instantiation from real-world monocular videos to provide accurate priors for TetGS initialization; localized spatial adaptation with explicitly partitioned tetrahedrons to guide the redistribution of Gaussian kernels; and geometry-based appearance generation with a coarse-to-fine activation strategy. Both qualitative and quantitative experiments demonstrate the effectiveness and superiority of our approach in generating photorealistic 3D editable avatars.
zh
[CV-47] FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding
【速读】:该论文旨在解决视频理解中由于视频数据复杂性和上下文处理限制而导致的长视频理解困难问题。现有方法在视频特征压缩过程中往往无法有效优先提取关键特征,导致冗余帧间信息或计算成本过高。其解决方案的关键在于提出FiLA-Video框架,该框架采用轻量级动态权重多帧融合策略,自适应地将多帧整合为单一表示,同时保留关键视频信息并降低计算成本。此外,通过引入关键帧选择策略和简单的长视频训练数据生成策略,进一步提升了模型在长视频理解任务中的效率与准确性。
链接: https://arxiv.org/abs/2504.20384
作者: Yanan Guo,Wenhui Dong,Jun Song,Shiding Zhu,Xuan Zhang,Hanqing Yang,Yingbo Wang,Yang Du,Xianing Chen,Bo Zheng
机构: Alibaba Group (阿里巴巴集团); University of Science and Technology of China (中国科学技术大学); ZheJiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures
Abstract:Recent advancements in video understanding within visual large language models (VLLMs) have led to notable progress. However, the complexity of video data and contextual processing limitations still hinder long-video comprehension. A common approach is video feature compression to reduce token input to large language models, yet many methods either fail to prioritize essential features, leading to redundant inter-frame information, or introduce computationally expensive this http URL address these issues, we propose FiLA(Fine-grained Vision Language Model)-Video, a novel framework that leverages a lightweight dynamic-weight multi-frame fusion strategy, which adaptively integrates multiple frames into a single representation while preserving key video information and reducing computational costs. To enhance frame selection for fusion, we introduce a keyframe selection strategy, effectively identifying informative frames from a larger pool for improved summarization. Additionally, we present a simple yet effective long-video training data generation strategy, boosting model performance without extensive manual annotation. Experimental results demonstrate that FiLA-Video achieves superior efficiency and accuracy in long-video comprehension compared to existing methods.
zh
[CV-48] Neural Stereo Video Compression with Hybrid Disparity Compensation
【速读】:该论文旨在解决立体视频压缩(SVC)中跨视角冗余利用的问题,特别是如何更有效地补偿视差以提升压缩效率。解决方案的关键在于提出一种混合视差补偿(HDC)策略,该策略结合了显式像素位移作为鲁棒先验特征以简化优化过程,并通过隐式跨注意力机制实现后续的对齐操作,从而捕捉更广泛的视差信息。
链接: https://arxiv.org/abs/2504.20383
作者: Shiyin Jiang,Zhenghao Chen,Minghao Han,Xingyu Zhou,Leheng Zhang,Shuhang Gu
机构: University of Electronic Science and Technology of China (中国电子科技大学); The University of Newcastle, Australia (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Disparity compensation represents the primary strategy in stereo video compression (SVC) for exploiting cross-view redundancy. These mechanisms can be broadly categorized into two types: one that employs explicit horizontal shifting, and another that utilizes an implicit cross-attention mechanism to reduce cross-view disparity redundancy. In this work, we propose a hybrid disparity compensation (HDC) strategy that leverages explicit pixel displacement as a robust prior feature to simplify optimization and perform implicit cross-attention mechanisms for subsequent warping operations, thereby capturing a broader range of disparity information. Specifically, HDC first computes a similarity map by fusing the horizontally shifted cross-view features to capture pixel displacement information. This similarity map is then normalized into an “explicit pixel-wise attention score” to perform the cross-attention mechanism, implicitly aligning features from one view to another. Building upon HDC, we introduce a novel end-to-end optimized neural stereo video compression framework, which integrates HDC-based modules into key coding operations, including cross-view feature extraction and reconstruction (HDC-FER) and cross-view entropy modeling (HDC-EM). Extensive experiments on SVC benchmarks, including KITTI 2012, KITTI 2015, and Nagoya, which cover both autonomous driving and general scenes, demonstrate that our framework outperforms both neural and traditional SVC methodologies.
zh
[CV-49] GSFeatLoc: Visual Localization Using Feature Correspondence on 3D Gaussian Splatting
【速读】:该论文试图解决在预计算的3D Gaussian Splatting (3DGS)场景表示下,如何高效且准确地对查询图像进行定位的问题。其解决方案的关键在于利用3DGS生成合成RGBD图像,建立查询图像与合成图像之间的2D-2D对应关系,并通过深度图将这些对应关系提升为2D-3D对应关系,进而求解视角n点(PnP)问题以获得最终的姿态估计。该方法在多个数据集上的实验结果表明,相较于基于光度损失最小化的基线方法,该方法显著降低了推理时间和估计误差,并具有较强的初始姿态估计容错能力。
链接: https://arxiv.org/abs/2504.20379
作者: Jongwon Lee,Timothy Bretl
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:In this paper, we present a method for localizing a query image with respect to a precomputed 3D Gaussian Splatting (3DGS) scene representation. First, the method uses 3DGS to render a synthetic RGBD image at some initial pose estimate. Second, it establishes 2D-2D correspondences between the query image and this synthetic image. Third, it uses the depth map to lift the 2D-2D correspondences to 2D-3D correspondences and solves a perspective-n-point (PnP) problem to produce a final pose estimate. Results from evaluation across three existing datasets with 38 scenes and over 2,700 test images show that our method significantly reduces both inference time (by over two orders of magnitude, from more than 10 seconds to as fast as 0.1 seconds) and estimation error compared to baseline methods that use photometric loss minimization. Results also show that our method tolerates large errors in the initial pose estimate of up to 55° in rotation and 1.1 units in translation (normalized by scene scale), achieving final pose errors of less than 5° in rotation and 0.05 units in translation on 90% of images from the Synthetic NeRF and Mip-NeRF360 datasets and on 42% of images from the more challenging Tanks and Temples dataset.
zh
[CV-50] Sparse2DGS: Geometry-Prioritized Gaussian Splatting for Surface Reconstruction from Sparse Views CVPR2025
【速读】:该论文旨在解决在稀疏输入视角下进行表面重建时,传统方法因依赖密集视角而难以初始化的问题,以及基于学习的多视图立体视觉(MVS)与高斯散射(Gaussian Splatting)直接结合时由于稀疏视角几何优化病态性导致的次优结果问题。其解决方案的关键在于提出Sparse2DGS,一个由MVS初始化的高斯散射管道,通过引入以几何先验为导向的增强方案,实现在病态条件下的直接且鲁棒的几何学习。
链接: https://arxiv.org/abs/2504.20378
作者: Jiang Wu,Rui Li,Yu Zhu,Rong Guo,Jinqiu Sun,Yanning Zhang
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025
Abstract:We present a Gaussian Splatting method for surface reconstruction using sparse input views. Previous methods relying on dense views struggle with extremely sparse Structure-from-Motion points for initialization. While learning-based Multi-view Stereo (MVS) provides dense 3D points, directly combining it with Gaussian Splatting leads to suboptimal results due to the ill-posed nature of sparse-view geometric optimization. We propose Sparse2DGS, an MVS-initialized Gaussian Splatting pipeline for complete and accurate reconstruction. Our key insight is to incorporate the geometric-prioritized enhancement schemes, allowing for direct and robust geometric learning under ill-posed conditions. Sparse2DGS outperforms existing methods by notable margins while being 2\times faster than the NeRF-based fine-tuning approach.
zh
[CV-51] Inception: Jailbreak the Memory Mechanism of Text-to-Image Generation Systems
【速读】:该论文旨在解决当前文本到图像生成系统中记忆机制存在的安全漏洞问题,特别是针对越狱攻击(jailbreak attack)的风险。其解决方案的关键在于提出Inception,这是首个针对实际文本到图像生成系统中记忆机制的多轮越狱攻击方法。Inception通过在对话会话的每一轮中逐步嵌入恶意内容,并利用系统检索记忆中的关键信息的机制,实现对目标提示的分段处理与递归策略,从而在不触发安全检测的情况下生成恶意图像。
链接: https://arxiv.org/abs/2504.20376
作者: Shiqian Zhao,Jiayang Liu,Yiming Li,Runyi Hu,Xiaojun Jia,Wenshu Fan,Xinfeng Li,Jie Zhang,Wei Dong,Tianwei Zhang,Luu Anh Tuan
机构: Nanyang Technological University (南洋理工大学); University of Electronic Science and Technology of China (中国电子科技大学); A*STAR (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 17 pages, 8 figures
Abstract:Currently, the memory mechanism has been widely and successfully exploited in online text-to-image (T2I) generation systems ( e.g. , DALL \cdot E 3) for alleviating the growing tokenization burden and capturing key information in multi-turn interactions. Despite its practicality, its security analyses have fallen far behind. In this paper, we reveal that this mechanism exacerbates the risk of jailbreak attacks. Different from previous attacks that fuse the unsafe target prompt into one ultimate adversarial prompt, which can be easily detected or may generate non-unsafe images due to under- or over-optimization, we propose Inception, the first multi-turn jailbreak attack against the memory mechanism in real-world text-to-image generation systems. Inception embeds the malice at the inception of the chat session turn by turn, leveraging the mechanism that T2I generation systems retrieve key information in their memory. Specifically, Inception mainly consists of two modules. It first segments the unsafe prompt into chunks, which are subsequently fed to the system in multiple turns, serving as pseudo-gradients for directive optimization. Specifically, we develop a series of segmentation policies that ensure the images generated are semantically consistent with the target prompt. Secondly, after segmentation, to overcome the challenge of the inseparability of minimum unsafe words, we propose recursion, a strategy that makes minimum unsafe words subdivisible. Collectively, segmentation and recursion ensure that all the request prompts are benign but can lead to malicious outcomes. We conduct experiments on the real-world text-to-image generation system ( i.e. , DALL \cdot E 3) to validate the effectiveness of Inception. The results indicate that Inception surpasses the state-of-the-art by a 14% margin in attack success rate.
zh
[CV-52] Fusion: A Test-Time Training-Based Strategy for Multimodal Medical Image Fusion in Surgical Robots
【速读】:该论文旨在解决手术机器人在临床实践中对多模态医学图像处理能力提升的需求,尤其是传统医学图像融合方法在实时性能、细粒度特征提取和边缘保留方面的不足。其解决方案的关键在于提出TTTFusion方法,这是一种基于测试时训练(Test-Time Training, TTT)的图像融合策略,能够在推理过程中动态调整模型参数,从而根据输入图像数据优化参数,实现更高质量的多模态医学图像融合。
链接: https://arxiv.org/abs/2504.20362
作者: Qinhua Xie,Hao Tang
机构: East China Normal University (华东师范大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the increasing use of surgical robots in clinical practice, enhancing their ability to process multimodal medical images has become a key research challenge. Although traditional medical image fusion methods have made progress in improving fusion accuracy, they still face significant challenges in real-time performance, fine-grained feature extraction, and edge this http URL this paper, we introduce TTTFusion, a Test-Time Training (TTT)-based image fusion strategy that dynamically adjusts model parameters during inference to efficiently fuse multimodal medical images. By adapting the model during the test phase, our method optimizes the parameters based on the input image data, leading to improved accuracy and better detail preservation in the fusion this http URL results demonstrate that TTTFusion significantly enhances the fusion quality of multimodal images compared to traditional fusion methods, particularly in fine-grained feature extraction and edge preservation. This approach not only improves image fusion accuracy but also offers a novel technical solution for real-time image processing in surgical robots.
zh
[CV-53] MicarVLMoE: A Modern Gated Cross-Aligned Vision-Language Mixture of Experts Model for Medical Image Captioning and Report Generation IJCNN2025
【速读】:该论文旨在解决医学影像报告(Medical Image Reporting, MIR)中细粒度特征提取、多模态对齐以及跨不同成像类型的泛化能力不足的问题,现有方法通常依赖于基础的Transformer架构,并主要关注胸部X光片。其解决方案的关键在于提出MicarVLMoE,一个具有门控交叉对齐融合的视觉-语言专家混合模型,该模型包含多尺度视觉编码器(MSVE)、多头双分支潜在注意力(MDLA)模块以及调制的专家混合(MoE)解码器,以实现更精确的临床描述生成。
链接: https://arxiv.org/abs/2504.20343
作者: Amaan Izhar,Nurul Japar,Norisma Idris,Ting Dang
机构: Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, Malaysia; School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCNN 2025, 8 pages, 8 figures, 3 tables
Abstract:Medical image reporting (MIR) aims to generate structured clinical descriptions from radiological images. Existing methods struggle with fine-grained feature extraction, multimodal alignment, and generalization across diverse imaging types, often relying on vanilla transformers and focusing primarily on chest X-rays. We propose MicarVLMoE, a vision-language mixture-of-experts model with gated cross-aligned fusion, designed to address these limitations. Our architecture includes: (i) a multiscale vision encoder (MSVE) for capturing anatomical details at varying resolutions, (ii) a multihead dual-branch latent attention (MDLA) module for vision-language alignment through latent bottleneck representations, and (iii) a modulated mixture-of-experts (MoE) decoder for adaptive expert specialization. We extend MIR to CT scans, retinal imaging, MRI scans, and gross pathology images, reporting state-of-the-art results on COVCTR, MMR, PGROSS, and ROCO datasets. Extensive experiments and ablations confirm improved clinical accuracy, cross-modal alignment, and model interpretability. Code is available at this https URL.
zh
[CV-54] A Picture is Worth a Thousand Prompts? Efficacy of Iterative Human-Driven Prompt Refinement in Image Regeneration Tasks
【速读】:该论文试图解决AI生成图像在迭代提示优化过程中,如何通过改进提示来提高再生图像与目标图像之间的相似性问题,以及现有图像相似性度量(ISMs)是否能够准确反映人类对图像相似性的主观判断。解决方案的关键在于通过结构化用户研究验证迭代提示优化对图像相似性的提升效果,并评估ISMs与人类感知的一致性,从而为生成式AI内容创作中的反馈机制提供理论依据和实践指导。
链接: https://arxiv.org/abs/2504.20340
作者: Khoi Trinh,Scott Seidenberger,Raveen Wijewickrama,Murtuza Jadliwala,Anindya Maiti
机构: University of Oklahoma (俄克拉荷马大学); University of Texas at San Antonio (德克萨斯大学圣安东尼奥分校)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:With AI-generated content becoming ubiquitous across the web, social media, and other digital platforms, it is vital to examine how such content are inspired and generated. The creation of AI-generated images often involves refining the input prompt iteratively to achieve desired visual outcomes. This study focuses on the relatively underexplored concept of image regeneration using AI, in which a human operator attempts to closely recreate a specific target image by iteratively refining their prompt. Image regeneration is distinct from normal image generation, which lacks any predefined visual reference. A separate challenge lies in determining whether existing image similarity metrics (ISMs) can provide reliable, objective feedback in iterative workflows, given that we do not fully understand if subjective human judgments of similarity align with these metrics. Consequently, we must first validate their alignment with human perception before assessing their potential as a feedback mechanism in the iterative prompt refinement process. To address these research gaps, we present a structured user study evaluating how iterative prompt refinement affects the similarity of regenerated images relative to their targets, while also examining whether ISMs capture the same improvements perceived by human observers. Our findings suggest that incremental prompt adjustments substantially improve alignment, verified through both subjective evaluations and quantitative measures, underscoring the broader potential of iterative workflows to enhance generative AI content creation across various application domains.
zh
[CV-55] DRO: Doppler-Aware Direct Radar Odometry
【速读】:该论文旨在解决移动机器人在复杂环境下的位姿估计问题,特别是在缺乏明显几何特征或恶劣天气条件下,传统视觉和激光雷达传感器性能受限的问题。其解决方案的关键在于提出一种基于SE(2)的里程计方法,直接利用毫米波雷达的全部强度信息进行扫描与局部地图的配准,无需提取特征或点云,并通过考虑运动和多普勒畸变来实现连续轨迹估计。此外,当雷达具备特定频率调制模式时,引入基于多普勒的约束以提升速度估计并支持在无特征场景中的里程计。
链接: https://arxiv.org/abs/2504.20339
作者: Cedric Le Gentil,Leonardo Brizi,Daniil Lisus,Xinyuan Qiao,Giorgio Grisetti,Timothy D. Barfoot
机构: University of Toronto Institute for Aerospace Studies (UTIAS); Sapienza University of Rome
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at RSS 2025
Abstract:A renaissance in radar-based sensing for mobile robotic applications is underway. Compared to cameras or lidars, millimetre-wave radars have the ability to `see’ through thin walls, vegetation, and adversarial weather conditions such as heavy rain, fog, snow, and dust. In this paper, we propose a novel SE(2) odometry approach for spinning frequency-modulated continuous-wave radars. Our method performs scan-to-local-map registration of the incoming radar data in a direct manner using all the radar intensity information without the need for feature or point cloud extraction. The method performs locally continuous trajectory estimation and accounts for both motion and Doppler distortion of the radar scans. If the radar possesses a specific frequency modulation pattern that makes radial Doppler velocities observable, an additional Doppler-based constraint is formulated to improve the velocity estimate and enable odometry in geometrically feature-deprived scenarios (e.g., featureless tunnels). Our method has been validated on over 250km of on-road data sourced from public datasets (Boreas and MulRan) and collected using our automotive platform. With the aid of a gyroscope, it outperforms state-of-the-art methods and achieves an average relative translation error of 0.26% on the Boreas leaderboard. When using data with the appropriate Doppler-enabling frequency modulation pattern, the translation error is reduced to 0.18% in similar environments. We also benchmarked our algorithm using 1.5 hours of data collected with a mobile robot in off-road environments with various levels of structure to demonstrate its versatility. Our real-time implementation is publicly available: this https URL.
zh
[CV-56] Fine Grain Classification: Connecting Meta using Cross-Contrastive pre-training
【速读】:该论文旨在解决细粒度视觉分类(fine-grained visual classification)中因仅依赖外观信息难以准确区分子类别而带来的识别难题。其解决方案的关键在于提出一种统一框架,通过跨对比预训练(cross-contrastive pre-training)联合学习视觉信息与元信息(meta-information),利用三编码器对图像、文本和元信息进行嵌入对齐,从而提升细粒度识别性能。实验结果表明,该框架在NABirds数据集上显著优于现有基线方法。
链接: https://arxiv.org/abs/2504.20322
作者: Sumit Mamtani,Yash Thesia
机构: New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 4 figures. Submitted to arXiv
Abstract:Fine-grained visual classification aims to recognize objects belonging to multiple subordinate categories within a super-category. However, this remains a challenging problem, as appearance information alone is often insufficient to accurately differentiate between fine-grained visual categories. To address this, we propose a novel and unified framework that leverages meta-information to assist fine-grained identification. We tackle the joint learning of visual and meta-information through cross-contrastive pre-training. In the first stage, we employ three encoders for images, text, and meta-information, aligning their projected embeddings to achieve better representations. We then fine-tune the image and meta-information encoders for the classification task. Experiments on the NABirds dataset demonstrate that our framework effectively utilizes meta-information to enhance fine-grained recognition performance. With the addition of meta-information, our framework surpasses the current baseline on NABirds by 7.83%. Furthermore, it achieves an accuracy of 84.44% on the NABirds dataset, outperforming many existing state-of-the-art approaches that utilize meta-information.
zh
[CV-57] Dynamic Contextual Attention Network: Transforming Spatial Representations into Adaptive Insights for Endoscopic Polyp Diagnosis
【速读】:该论文旨在解决传统内镜成像在结直肠息肉定位准确性不足以及缺乏全面上下文感知的问题,这些问题限制了诊断的可解释性。其解决方案的关键在于提出动态上下文注意力网络(Dynamic Contextual Attention Network, DCAN),该网络通过注意力机制将空间表征转化为自适应的上下文洞察,从而在不依赖显式定位模块的情况下增强对关键息肉区域的关注,提升了分类过程中的决策可解释性和整体诊断性能。
链接: https://arxiv.org/abs/2504.20306
作者: Teja Krishna Cherukuri,Nagur Shareef Shaik,Sribhuvan Reddy Yellu,Jun-Won Chung,Dong Hye Ye
机构: Georgia State University (佐治亚州立大学); Gachon University (伽罗瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 2025
Abstract:Colorectal polyps are key indicators for early detection of colorectal cancer. However, traditional endoscopic imaging often struggles with accurate polyp localization and lacks comprehensive contextual awareness, which can limit the explainability of diagnoses. To address these issues, we propose the Dynamic Contextual Attention Network (DCAN). This novel approach transforms spatial representations into adaptive contextual insights, using an attention mechanism that enhances focus on critical polyp regions without explicit localization modules. By integrating contextual awareness into the classification process, DCAN improves decision interpretability and overall diagnostic performance. This advancement in imaging could lead to more reliable colorectal cancer detection, enabling better patient outcomes.
zh
[CV-58] DeepAndes: A Self-Supervised Vision Foundation Model for Multi-Spectral Remote Sensing Imagery of the Andes
【速读】:该论文试图解决在大规模遥感数据中对细粒度考古特征进行标注的挑战,尤其是在多光谱卫星影像(如8波段数据)上的应用问题。传统监督学习方法在处理此类数据时存在局限性,而现有的视觉基础模型大多针对RGB图像设计,难以直接应用于多光谱数据。论文提出的解决方案是开发DeepAndes,一个基于Transformer架构的视觉基础模型,专门针对安第斯地区考古学任务进行训练,其关键在于采用定制化的DINOv2自监督学习算法,优化了对8波段多光谱影像的表征能力,从而提升了考古遥感中的图像理解性能。
链接: https://arxiv.org/abs/2504.20303
作者: Junlin Guo,James R. Zimmer-Dauphinee,Jordan M. Nieusma,Siqi Lu,Quan Liu,Ruining Deng,Can Cui,Jialin Yue,Yizhe Lin,Tianyuan Yao,Juming Xiong,Junchao Zhu,Chongyu Qu,Yuechen Yang,Mitchell Wilkes,Xiao Wang,Parker VanValkenburgh,Steven A. Wernke,Yuankai Huo
机构: Vanderbilt University (范德比尔特大学); Brown University (布朗大学); Oak Ridge National Laboratory (橡树岭国家实验室); Weill Cornell Medicine (威尔康奈尔医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:By mapping sites at large scales using remotely sensed data, archaeologists can generate unique insights into long-term demographic trends, inter-regional social networks, and past adaptations to climate change. Remote sensing surveys complement field-based approaches, and their reach can be especially great when combined with deep learning and computer vision techniques. However, conventional supervised deep learning methods face challenges in annotating fine-grained archaeological features at scale. While recent vision foundation models have shown remarkable success in learning large-scale remote sensing data with minimal annotations, most off-the-shelf solutions are designed for RGB images rather than multi-spectral satellite imagery, such as the 8-band data used in our study. In this paper, we introduce DeepAndes, a transformer-based vision foundation model trained on three million multi-spectral satellite images, specifically tailored for Andean archaeology. DeepAndes incorporates a customized DINOv2 self-supervised learning algorithm optimized for 8-band multi-spectral imagery, marking the first foundation model designed explicitly for the Andes region. We evaluate its image understanding performance through imbalanced image classification, image instance retrieval, and pixel-level semantic segmentation tasks. Our experiments show that DeepAndes achieves superior F1 scores, mean average precision, and Dice scores in few-shot learning scenarios, significantly outperforming models trained from scratch or pre-trained on smaller datasets. This underscores the effectiveness of large-scale self-supervised pre-training in archaeological remote sensing. Codes will be available on this https URL.
zh
[CV-59] Image Interpolation with Score-based Riemannian Metrics of Diffusion Models
【速读】:该论文试图解决扩散模型在内容生成中虽能隐式学习数据流形,但缺乏有效利用该流形的实际方法的问题,与具备潜在空间的其他深度生成模型不同。解决方案的关键在于将预训练扩散模型的数据空间视为一个由得分函数导出度量的黎曼流形,并基于此几何结构进行图像插值,从而实现更真实、更少噪声且更符合提示的生成结果。
链接: https://arxiv.org/abs/2504.20288
作者: Shinnosuke Saito,Takashi Matsubara
机构: Hokkaido University (北海道大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models excel in content generation by implicitly learning the data manifold, yet they lack a practical method to leverage this manifold - unlike other deep generative models equipped with latent spaces. This paper introduces a novel framework that treats the data space of pre-trained diffusion models as a Riemannian manifold, with a metric derived from the score function. Experiments with MNIST and Stable Diffusion show that this geometry-aware approach yields image interpolations that are more realistic, less noisy, and more faithful to prompts than existing methods, demonstrating its potential for improved content generation and editing.
zh
[CV-60] Physics-Informed Diffusion Models for SAR Ship Wake Generation from Text Prompts
【速读】:该论文旨在解决通过合成孔径雷达(SAR)图像检测船舶存在时因标注数据稀缺而导致的监督学习挑战。其解决方案的关键在于利用基于物理的仿真器生成数据,并通过扩散模型进行训练,从而实现更高效且端到端的SAR船舶尾迹模拟。该方法通过将仿真器生成的图像与源自仿真参数的文本提示配对构建训练数据集,最终实现了比传统物理仿真方法更快的推理速度和逼真的Kelvin尾迹生成能力。
链接: https://arxiv.org/abs/2504.20241
作者: Kamirul Kamirul,Odysseas Pappas,Alin Achim
机构: University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages; Submitted Machine Intelligence for GeoAnalytics and Remote Sensing (MIGARS) - 2025
Abstract:Detecting ship presence via wake signatures in SAR imagery is attracting considerable research interest, but limited annotated data availability poses significant challenges for supervised learning. Physics-based simulations are commonly used to address this data scarcity, although they are slow and constrain end-to-end learning. In this work, we explore a new direction for more efficient and end-to-end SAR ship wake simulation using a diffusion model trained on data generated by a physics-based simulator. The training dataset is built by pairing images produced by the simulator with text prompts derived from simulation parameters. Experimental result show that the model generates realistic Kelvin wake patterns and achieves significantly faster inference than the physics-based simulator. These results highlight the potential of diffusion models for fast and controllable wake image generation, opening new possibilities for end-to-end downstream tasks in maritime SAR analysis.
zh
[CV-61] Improving trajectory continuity in drone-based crowd monitoring using a set of minimal-cost techniques and deep discriminative correlation filters
【速读】:该论文旨在解决无人机基于的群体监控中跟踪连续性和一致性的问题,传统检测-分配跟踪方法在面对误检、漏检和频繁身份切换时导致计数精度下降,难以进行深入分析。其解决方案的关键在于提出一种面向点的在线跟踪算法,该算法在Simple Online and Real-time Tracking (SORT)框架基础上,将原始的边界框分配替换为点距离度量,并引入三种成本效益高的技术:相机运动补偿、高度感知分配和基于分类的轨迹验证。此外,通过集成Deep Discriminative Correlation Filters (DDCF)来提升计算效率并减少噪声,从而改善目标跟踪效果。
链接: https://arxiv.org/abs/2504.20234
作者: Bartosz Ptak,Marek Kraft
机构: Poznań University of Technology (波兹南科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Preprint submitted to the Expert Systems with Applications journal
Abstract:Drone-based crowd monitoring is the key technology for applications in surveillance, public safety, and event management. However, maintaining tracking continuity and consistency remains a significant challenge. Traditional detection-assignment tracking methods struggle with false positives, false negatives, and frequent identity switches, leading to degraded counting accuracy and making in-depth analysis impossible. This paper introduces a point-oriented online tracking algorithm that improves trajectory continuity and counting reliability in drone-based crowd monitoring. Our method builds on the Simple Online and Real-time Tracking (SORT) framework, replacing the original bounding-box assignment with a point-distance metric. The algorithm is enhanced with three cost-effective techniques: camera motion compensation, altitude-aware assignment, and classification-based trajectory validation. Further, Deep Discriminative Correlation Filters (DDCF) that re-use spatial feature maps from localisation algorithms for increased computational efficiency through neural network resource sharing are integrated to refine object tracking by reducing noise and handling missed detections. The proposed method is evaluated on the DroneCrowd and newly shared UP-COUNT-TRACK datasets, demonstrating substantial improvements in tracking metrics, reducing counting errors to 23% and 15%, respectively. The results also indicate a significant reduction of identity switches while maintaining high tracking accuracy, outperforming baseline online trackers and even an offline greedy optimisation method.
zh
[CV-62] FreBIS: Frequency-Based Stratification for Neural Implicit Surface Representations CVPR2025
【速读】:该论文旨在解决神经隐式表面表示方法在处理具有多样性和复杂表面的场景时性能受限的问题,这主要是由于现有方法使用单一编码器同时捕获场景中从低频到高频的表面信息,导致信息过载和表达能力不足。其解决方案的关键在于提出一种名为FreBIS的新型神经隐式表面表示方法,该方法通过根据表面频率对场景进行分层,并为每个层次(或一组层次)分配专用编码器,从而实现更高效的信息建模;此外,FreBIS引入了一种冗余感知加权模块,以促进编码特征之间的互补性,提升整体表示能力。
链接: https://arxiv.org/abs/2504.20222
作者: Naoko Sawada,Pedro Miraldo,Suhas Lohit,Tim K. Marks,Moitreya Chatterjee
机构: Mitsubishi Electric Research Laboratories (三菱电机研究所); Mitsubishi Electric Corporation (三菱电机公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025 CV4Metaverse Workshop
Abstract:Neural implicit surface representation techniques are in high demand for advancing technologies in augmented reality/virtual reality, digital twins, autonomous navigation, and many other fields. With their ability to model object surfaces in a scene as a continuous function, such techniques have made remarkable strides recently, especially over classical 3D surface reconstruction methods, such as those that use voxels or point clouds. However, these methods struggle with scenes that have varied and complex surfaces principally because they model any given scene with a single encoder network that is tasked to capture all of low through high-surface frequency information in the scene simultaneously. In this work, we propose a novel, neural implicit surface representation approach called FreBIS to overcome this challenge. FreBIS works by stratifying the scene based on the frequency of surfaces into multiple frequency levels, with each level (or a group of levels) encoded by a dedicated encoder. Moreover, FreBIS encourages these encoders to capture complementary information by promoting mutual dissimilarity of the encoded features via a novel, redundancy-aware weighting module. Empirical evaluations on the challenging BlendedMVS dataset indicate that replacing the standard encoder in an off-the-shelf neural surface reconstruction method with our frequency-stratified encoders yields significant improvements. These enhancements are evident both in the quality of the reconstructed 3D surfaces and in the fidelity of their renderings from any viewpoint.
zh
[CV-63] Remote Sensing Imagery for Flood Detection: Exploration of Augmentation Strategies
【速读】:该论文试图解决河流洪水在RGB影像中的准确检测问题,旨在通过改进训练过程提升深度学习分割网络的性能。其解决方案的关键在于探索不同的数据增强策略,从基础方法到更复杂的光学畸变技术,以识别有效的策略来优化模型训练。
链接: https://arxiv.org/abs/2504.20203
作者: Vladyslav Polushko,Damjan Hatic,Ronald Rösch,Thomas März,Markus Rauhut,Andreas Weinmann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Floods cause serious problems around the world. Responding quickly and effectively requires accurate and timely information about the affected areas. The effective use of Remote Sensing images for accurate flood detection requires specific detection methods. Typically, Deep Neural Networks are employed, which are trained on specific datasets. For the purpose of river flood detection in RGB imagery, we use the BlessemFlood21 dataset. We here explore the use of different augmentation strategies, ranging from basic approaches to more complex techniques, including optical distortion. By identifying effective strategies, we aim to refine the training process of state-of-the-art Deep Learning segmentation networks.
zh
[CV-64] Integration Flow Models
【速读】:该论文旨在解决基于常微分方程(Ordinary Differential Equation, ODE)的生成模型在采样质量受限于数值求解器的离散化误差以及训练不稳定的问题。其解决方案的关键在于提出Integration Flow,该方法直接学习ODE轨迹路径的积分,而非求解ODE函数,并将目标状态\mathbfx_0作为反向时间动力学的锚点状态,从而提升模型的稳定性和准确性。
链接: https://arxiv.org/abs/2504.20179
作者: Jingjing Wang,Dan Zhang,Joshua Luo,Yin Yang,Feng Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Ordinary differential equation (ODE) based generative models have emerged as a powerful approach for producing high-quality samples in many applications. However, the ODE-based methods either suffer the discretization error of numerical solvers of ODE, which restricts the quality of samples when only a few NFEs are used, or struggle with training instability. In this paper, we proposed Integration Flow, which directly learns the integral of ODE-based trajectory paths without solving the ODE functions. Moreover, Integration Flow explicitly incorporates the target state \mathbfx_0 as the anchor state in guiding the reverse-time dynamics. We have theoretically proven this can contribute to both stability and accuracy. To the best of our knowledge, Integration Flow is the first model with a unified structure to estimate ODE-based generative models and the first to show the exact straightness of 1-Rectified Flow without reflow. Through theoretical analysis and empirical evaluations, we show that Integration Flows achieve improved performance when it is applied to existing ODE-based models, such as diffusion models, Rectified Flows, and PFGM++. Specifically, Integration Flow achieves one-step generation on CIFAR10 with FIDs of 2.86 for the Variance Exploding (VE) diffusion model, 3.36 for rectified flow without reflow, and 2.91 for PFGM++; and on ImageNet with FIDs of 4.09 for VE diffusion model, 4.35 for rectified flow without reflow and 4.15 for PFGM++.
zh
[CV-65] A Transformer-based Multimodal Fusion Model for Efficient Crowd Counting Using Visual and Wireless Signals
【速读】:该论文试图解决当前基于单模态输入(如视觉图像或无线信号数据)的群体计数模型存在信息丢失和识别性能不佳的问题。其解决方案的关键在于提出一种基于多模态融合的群体计数模型TransFusion,该模型将信道状态信息(Channel State Information, CSI)与图像数据进行融合,并结合Transformer网络和卷积神经网络(Convolutional Neural Network, CNN),以有效捕捉全局上下文信息并增强对局部细节的提取能力,从而提升群体计数的准确性与效率。
链接: https://arxiv.org/abs/2504.20178
作者: Zhe Cui,Yuli Li,Le-Nam Tran
机构: University College Dublin (都柏林大学学院); Shandong University of Science and Technology (山东科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This paper was accepted at IEEE WCNC 2025
Abstract:Current crowd-counting models often rely on single-modal inputs, such as visual images or wireless signal data, which can result in significant information loss and suboptimal recognition performance. To address these shortcomings, we propose TransFusion, a novel multimodal fusion-based crowd- counting model that integrates Channel State Information (CSI) with image data. By leveraging the powerful capabilities of Transformer networks, TransFusion effectively combines these two distinct data modalities, enabling the capture of comprehen- sive global contextual information that is critical for accurate crowd estimation. However, while transformers are well capable of capturing global features, they potentially fail to identify finer- grained, local details essential for precise crowd counting. To mitigate this, we incorporate Convolutional Neural Networks (CNNs) into the model architecture, enhancing its ability to extract detailed local features that complement the global context provided by the Transformer. Extensive experimental evaluations demonstrate that TransFusion achieves high accuracy with minimal counting errors while maintaining superior efficiency.
zh
[CV-66] Forging and Removing Latent-Noise Diffusion Watermarks Using a Single Image
【速读】:该论文试图解决扩散模型中水印技术的脆弱性问题,即现有水印方案在面对潜在攻击时可能被移除或伪造的风险。解决方案的关键在于利用图像与初始噪声之间的多对一映射关系,通过引入扰动使干净图像进入包含水印的区域,从而实现水印的伪造或移除。这一方法无需访问扩散模型的权重,仅需一个带水印的示例即可完成攻击。
链接: https://arxiv.org/abs/2504.20111
作者: Anubhav Jain,Yuya Kobayashi,Naoki Murata,Yuhta Takida,Takashi Shibuya,Yuki Mitsufuji,Niv Cohen,Nasir Memon,Julian Togelius
机构: New York University (纽约大学); Sony AI (索尼人工智能); Sony Group Corporation (索尼集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Watermarking techniques are vital for protecting intellectual property and preventing fraudulent use of media. Most previous watermarking schemes designed for diffusion models embed a secret key in the initial noise. The resulting pattern is often considered hard to remove and forge into unrelated images. In this paper, we propose a black-box adversarial attack without presuming access to the diffusion model weights. Our attack uses only a single watermarked example and is based on a simple observation: there is a many-to-one mapping between images and initial noises. There are regions in the clean image latent space pertaining to each watermark that get mapped to the same initial noise when inverted. Based on this intuition, we propose an adversarial attack to forge the watermark by introducing perturbations to the images such that we can enter the region of watermarked images. We show that we can also apply a similar approach for watermark removal by learning perturbations to exit this region. We report results on multiple watermarking schemes (Tree-Ring, RingID, WIND, and Gaussian Shading) across two diffusion models (SDv1.4 and SDv2.0). Our results demonstrate the effectiveness of the attack and expose vulnerabilities in the watermarking methods, motivating future research on improving them.
zh
[CV-67] An on-production high-resolution longitudinal neonatal fingerprint database in Brazil
【速读】:该论文旨在解决新生儿生物特征识别系统在早期发育过程中因生理变化(如手指生长、体重变化和皮肤纹理改变)导致的识别准确性不足的问题。其关键解决方案是构建一个高质量的新生儿指纹生物特征数据库,该数据库在多个生命早期阶段采集数据,以支持机器学习模型的训练与评估,从而更准确地模拟生长对生物特征的影响,提升深度学习模型在预测细节点图变化方面的性能。
链接: https://arxiv.org/abs/2504.20104
作者: Luiz F. P. Southier,Marcelo Filipak,Luiz A. Zanlorensi,Ildefonso Wasilevski,Fabio Favarim,Jefferson T. Oliva,Marcelo Teixeira,Dalcimar Casanova
机构: Federal University of Technology - Parana (巴西帕拉纳联邦技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The neonatal period is critical for survival, requiring accurate and early identification to enable timely interventions such as vaccinations, HIV treatment, and nutrition programs. Biometric solutions offer potential for child protection by helping to prevent baby swaps, locate missing children, and support national identity systems. However, developing effective biometric identification systems for newborns remains a major challenge due to the physiological variability caused by finger growth, weight changes, and skin texture alterations during early development. Current literature has attempted to address these issues by applying scaling factors to emulate growth-induced distortions in minutiae maps, but such approaches fail to capture the complex and non-linear growth patterns of infants. A key barrier to progress in this domain is the lack of comprehensive, longitudinal biometric datasets capturing the evolution of neonatal fingerprints over time. This study addresses this gap by focusing on designing and developing a high-quality biometric database of neonatal fingerprints, acquired at multiple early life stages. The dataset is intended to support the training and evaluation of machine learning models aimed at emulating the effects of growth on biometric features. We hypothesize that such a dataset will enable the development of more robust and accurate Deep Learning-based models, capable of predicting changes in the minutiae map with higher fidelity than conventional scaling-based methods. Ultimately, this effort lays the groundwork for more reliable biometric identification systems tailored to the unique developmental trajectory of newborns.
zh
[CV-68] Long-Distance Field Demonstration of Imaging-Free Drone Identification in Intracity Environments
【速读】:该论文旨在解决远距离小目标(如无人机)检测的问题,传统成像方法受限于距离、功耗和成本,而数据驱动的单光子单像素光探测与测距(\textD\textsuperscript2SP\textsuperscript2-LiDAR)虽能降低系统复杂度和成本,但其检测范围通常仅限于数百米。论文提出的解决方案关键在于将残差神经网络(ResNet)与\textD\textsuperscript2SP\textsuperscript2-LiDAR相结合,并引入优化的观测模型,从而将检测范围扩展至城市内部环境中的5~\si\kilo\meter,并实现高精度的无人机姿态和类型识别。
链接: https://arxiv.org/abs/2504.20097
作者: Junran Guo,Tonglin Mu,Keyuan Li,Jianing Li,Ziyang Luo,Ye Chen,Xiaodong Fan,Jinquan Huang,Minjie Liu,Jinbei Zhang,Ruoyang Qi,Naiting Gu,Shihai Sun
机构: Sun Yat-sen University (中山大学); China Academy of Space Technology (中国空间技术研究院); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Quantum Physics (quant-ph)
备注: 15 pages, 9 figures
Abstract:Detecting small objects, such as drones, over long distances presents a significant challenge with broad implications for security, surveillance, environmental monitoring, and autonomous systems. Traditional imaging-based methods rely on high-resolution image acquisition, but are often constrained by range, power consumption, and cost. In contrast, data-driven single-photon-single-pixel light detection and ranging (\textD\textsuperscript2SP\textsuperscript2-LiDAR) provides an imaging-free alternative, directly enabling target identification while reducing system complexity and cost. However, its detection range has been limited to a few hundred meters. Here, we introduce a novel integration of residual neural networks (ResNet) with \textD\textsuperscript2SP\textsuperscript2-LiDAR, incorporating a refined observation model to extend the detection range to 5~\si\kilo\meter in an intracity environment while enabling high-accuracy identification of drone poses and types. Experimental results demonstrate that our approach not only outperforms conventional imaging-based recognition systems, but also achieves 94.93% pose identification accuracy and 97.99% type classification accuracy, even under weak signal conditions with long distances and low signal-to-noise ratios (SNRs). These findings highlight the potential of imaging-free methods for robust long-range detection of small targets in real-world scenarios.
zh
[CV-69] VideoMultiAgents : A Multi-Agent Framework for Video Question Answering
【速读】:该论文旨在解决视频问答(Video Question Answering, VQA)中因依赖帧级描述而难以充分捕捉时间与交互上下文的问题。其解决方案的关键在于引入VideoMultiAgents框架,该框架集成专门的视觉、场景图分析和文本处理代理,通过独立运作的代理实现互补的多模态推理,从而提升视频理解能力。此外,该方法还结合了基于问题引导的描述生成技术,以提高答案的准确性。
链接: https://arxiv.org/abs/2504.20091
作者: Noriyuki Kugo,Xiang Li,Zixin Li,Ashish Gupta,Arpandeep Khatua,Nidhish Jain,Chaitanya Patel,Yuta Kyuragi,Masamoto Tanabiki,Kazuki Kozuka,Ehsan Adeli
机构: Panasonic Connect Co., Ltd.(松下通信); Stanford University (斯坦福大学); Panasonic R&D Company of America(松下美国研发公司); Panasonic Holdings Corporation(松下控股公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:
Abstract:Video Question Answering (VQA) inherently relies on multimodal reasoning, integrating visual, temporal, and linguistic cues to achieve a deeper understanding of video content. However, many existing methods rely on feeding frame-level captions into a single model, making it difficult to adequately capture temporal and interactive contexts. To address this limitation, we introduce VideoMultiAgents, a framework that integrates specialized agents for vision, scene graph analysis, and text processing. It enhances video understanding leveraging complementary multimodal reasoning from independently operating agents. Our approach is also supplemented with a question-guided caption generation, which produces captions that highlight objects, actions, and temporal transitions directly relevant to a given query, thus improving the answer accuracy. Experimental results demonstrate that our method achieves state-of-the-art performance on Intent-QA (79.0%, +6.2% over previous SOTA), EgoSchema subset (75.4%, +3.4%), and NExT-QA (79.6%, +0.4%).
zh
[CV-70] Edge-Based Learning for Improved Classification Under Adversarial Noise
【速读】:该论文试图解决对抗噪声(adversarial noise)对深度学习模型图像分类任务造成的误分类问题,特别是针对快速梯度符号法(Fast Gradient Sign Method, FGSM)生成的对抗噪声的影响。其解决方案的关键在于利用图像的边缘特征(edge features)进行训练,以提高模型对对抗扰动的鲁棒性。研究发现,尽管对抗噪声会显著干扰非边缘区域,但边缘结构在扰动下相对稳定,能够为分类提供关键的结构性信息,因此基于边缘特征的模型在面对对抗攻击时表现出更高的抗干扰能力。
链接: https://arxiv.org/abs/2504.20077
作者: Manish Kansana,Keyan Alexander Rahimi,Elias Hossain,Iman Dehzangi,Noorbakhsh Amiri Golilarz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Adversarial noise introduces small perturbations in images, misleading deep learning models into misclassification and significantly impacting recognition accuracy. In this study, we analyzed the effects of Fast Gradient Sign Method (FGSM) adversarial noise on image classification and investigated whether training on specific image features can improve robustness. We hypothesize that while adversarial noise perturbs various regions of an image, edges may remain relatively stable and provide essential structural information for classification. To test this, we conducted a series of experiments using brain tumor and COVID datasets. Initially, we trained the models on clean images and then introduced subtle adversarial perturbations, which caused deep learning models to significantly misclassify the images. Retraining on a combination of clean and noisy images led to improved performance. To evaluate the robustness of the edge features, we extracted edges from the original/clean images and trained the models exclusively on edge-based representations. When noise was introduced to the images, the edge-based models demonstrated greater resilience to adversarial attacks compared to those trained on the original or clean images. These results suggest that while adversarial noise is able to exploit complex non-edge regions significantly more than edges, the improvement in the accuracy after retraining is marginally more in the original data as compared to the edges. Thus, leveraging edge-based learning can improve the resilience of deep learning models against adversarial perturbations.
zh
[CV-71] Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment
【速读】:该论文旨在解决扩散模型在复杂多对象场景中准确计数、属性分配及空间关系建模方面的不足。其解决方案的关键在于提出Marmot框架,该框架采用多智能体推理机制实现多对象自校正,通过分解自校正任务为计数、属性和空间关系三个关键维度,并进一步细分为对象级子任务,结合决策-执行-验证机制以减少对象间干扰并提升编辑可靠性,同时引入像素域拼接平滑器实现子任务结果的高效融合与多阶段失真消除。
链接: https://arxiv.org/abs/2504.20054
作者: Jiayang Sun,Hongbo Wang,Jie Cao,Huaibo Huang,Ran He
机构: MAIS & NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While diffusion models excel at generating high-quality images, they often struggle with accurate counting, attributes, and spatial relationships in complex multi-object scenes. To address these challenges, we propose Marmot, a novel and generalizable framework that employs Multi-Agent Reasoning for Multi-Object Self-Correcting, enhancing image-text alignment and facilitating more coherent multi-object image editing. Our framework adopts a divide-and-conquer strategy that decomposes the self-correction task into three critical dimensions (counting, attributes, and spatial relationships), and further divided into object-level subtasks. We construct a multi-agent editing system featuring a decision-execution-verification mechanism, effectively mitigating inter-object interference and enhancing editing reliability. To resolve the problem of subtask integration, we propose a Pixel-Domain Stitching Smoother that employs mask-guided two-stage latent space optimization. This innovation enables parallel processing of subtask results, thereby enhancing runtime efficiency while eliminating multi-stage distortion accumulation. Extensive experiments demonstrate that Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships for image generation tasks.
zh
[CV-72] Can Geometry Save Central Views for Sports Field Registration?
【速读】:该论文试图解决单帧体育场地注册(sports field registration)中的挑战,特别是在摄像机视角靠近场地中心区域时,由于场地标记物分布稀疏且不均匀,现有方法难以有效利用圆和点等非线性特征进行准确注册的问题。解决方案的关键在于提出一种新颖的方法,通过从圆对应关系中推导出点和线,从而将圆对应关系纳入到线性方程组中,实现对体育场地的精确注册与图像标注。
链接: https://arxiv.org/abs/2504.20052
作者: Floriane Magera,Thomas Hoyoux,Martin Castin,Olivier Barnich,Anthony Cioppa,Marc Van Droogenbroeck
机构: EVS Broadcast Equipment; University of Liège (列日大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 10 figures, 1 table, 40 references
Abstract:Single-frame sports field registration often serves as the foundation for extracting 3D information from broadcast videos, enabling applications related to sports analytics, refereeing, or fan engagement. As sports fields have rigorous specifications in terms of shape and dimensions of their line, circle and point components, sports field markings are commonly used as calibration targets for this task. However, because of the sparse and uneven distribution of field markings, close-up camera views around central areas of the field often depict only line and circle markings. On these views, sports field registration is challenging for the vast majority of existing methods, as they focus on leveraging line field markings and their intersections. It is indeed a challenge to include circle correspondences in a set of linear equations. In this work, we propose a novel method to derive a set of points and lines from circle correspondences, enabling the exploitation of circle correspondences for both sports field registration and image annotation. In our experiments, we illustrate the benefits of our bottom-up geometric method against top-performing detectors and show that our method successfully complements them, enabling sports field registration in difficult scenarios.
zh
[CV-73] SAM-Guided Robust Representation Learning for One-Shot 3D Medical Image Segmentation
【速读】:该论文旨在解决单样本医学图像分割(one-shot medical image segmentation, MIS)中因依赖人工标注和计算成本高昂而带来的挑战。其关键解决方案是提出一种基于SAM的鲁棒表征学习框架RRL-MedSAM,通过双阶段知识蒸馏(DSKD)策略和互斥指数移动平均(mutual-EMA)机制,将SAM的通用特征提取能力迁移至轻量级编码器,并结合自提示分割解码器(auto-prompting, AP)提升分割性能,从而实现高效且准确的3D医学图像分割。
链接: https://arxiv.org/abs/2504.20501
作者: Jia Wang,Yunan Mei,Jiarui Liu,Xin Fan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:One-shot medical image segmentation (MIS) is crucial for medical analysis due to the burden of medical experts on manual annotation. The recent emergence of the segment anything model (SAM) has demonstrated remarkable adaptation in MIS but cannot be directly applied to one-shot medical image segmentation (MIS) due to its reliance on labor-intensive user interactions and the high computational cost. To cope with these limitations, we propose a novel SAM-guided robust representation learning framework, named RRL-MedSAM, to adapt SAM to one-shot 3D MIS, which exploits the strong generalization capabilities of the SAM encoder to learn better feature representation. We devise a dual-stage knowledge distillation (DSKD) strategy to distill general knowledge between natural and medical images from the foundation model to train a lightweight encoder, and then adopt a mutual exponential moving average (mutual-EMA) to update the weights of the general lightweight encoder and medical-specific encoder. Specifically, pseudo labels from the registration network are used to perform mutual supervision for such two encoders. Moreover, we introduce an auto-prompting (AP) segmentation decoder which adopts the mask generated from the general lightweight model as a prompt to assist the medical-specific model in boosting the final segmentation performance. Extensive experiments conducted on three public datasets, i.e., OASIS, CT-lung demonstrate that the proposed RRL-MedSAM outperforms state-of-the-art one-shot MIS methods for both segmentation and registration tasks. Especially, our lightweight encoder uses only 3% of the parameters compared to the encoder of SAM-Base.
zh
[CV-74] LymphAtlas- A Unified Multimodal Lymphoma Imaging Repository Delivering AI-Enhanced Diagnostic Insight
【速读】:该论文旨在解决血液系统恶性肿瘤领域中标准化多模态分割数据集缺乏的问题(standardised multimodal segmentation datasets),通过整合正电子发射断层扫描(PET)的代谢信息与计算机断层扫描(CT)的解剖结构,建立基于全身氟脱氧葡萄糖(FDG)PET/CT检查的3D多模态分割数据集。其解决方案的关键在于系统性地收集和处理高质量的临床数据,确保完整的3D结构信息在数据采集、预处理和标注过程中得以保留,并基于nnUNet格式构建高质数据集,同时通过技术验证和评估确保预处理流程、标注质量和自动分割算法的可靠性,从而为深度学习模型提供稳定且准确的训练基础。
链接: https://arxiv.org/abs/2504.20454
作者: Jiajun Ding,Beiyao Zhu,Xiaosheng Liu,Lishen Zhang,Zhao Liu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 17pages,4 figures
Abstract:This study integrates PET metabolic information with CT anatomical structures to establish a 3D multimodal segmentation dataset for lymphoma based on whole-body FDG PET/CT examinations, which bridges the gap of the lack of standardised multimodal segmentation datasets in the field of haematological malignancies. We retrospectively collected 483 examination datasets acquired between March 2011 and May 2024, involving 220 patients (106 non-Hodgkin lymphoma, 42 Hodgkin lymphoma); all data underwent ethical review and were rigorously de-identified. Complete 3D structural information was preserved during data acquisition, preprocessing and annotation, and a high-quality dataset was constructed based on the nnUNet format. By systematic technical validation and evaluation of the preprocessing process, annotation quality and automatic segmentation algorithm, the deep learning model trained based on this dataset is verified to achieve accurate segmentation of lymphoma lesions in PET/CT images with high accuracy, good robustness and reproducibility, which proves the applicability and stability of this dataset in accurate segmentation and quantitative analysis. The deep fusion of PET/CT images achieved with this dataset not only significantly improves the accurate portrayal of the morphology, location and metabolic features of tumour lesions, but also provides solid data support for early diagnosis, clinical staging and personalized treatment, and promotes the development of automated image segmentation and precision medicine based on deep learning. The dataset and related resources are available at this https URL.
zh
[CV-75] SCOPE-MRI: Bankart Lesion Detection as a Case Study in Data Curation and Deep Learning for Challenging Diagnoses
【速读】:该论文试图解决在骨科影像中对复杂病灶(如Bankart损伤)的检测问题,这类病灶由于影像特征细微,传统诊断方法依赖于侵入性MRI关节造影(MRA),而常规MRI的诊断效果有限。解决方案的关键在于构建了ScopeMRI数据集,这是首个公开的、由专家标注的肩部病理数据集,并开发了一个结合卷积神经网络(CNN)和Transformer的深度学习(DL)框架,通过多平面视图的预测融合来优化性能,从而在非侵入性标准MRI上实现了与放射科医生相当的诊断水平。
链接: https://arxiv.org/abs/2504.20405
作者: Sahil Sethi,Sai Reddy,Mansi Sakarvadia,Jordan Serotte,Darlington Nwaudo,Nicholas Maassen,Lewis Shi
机构: Pritzker School of Medicine, University of Chicago(普利兹克医学院,芝加哥大学); Department of Computer Science, University of Chicago(计算机科学系,芝加哥大学); Department of Orthopaedic Surgery & Rehabilitation Medicine, UChicago Medicine(骨科手术与康复医学系,芝加哥大学医学中心)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:While deep learning has shown strong performance in musculoskeletal imaging, existing work has largely focused on pathologies where diagnosis is not a clinical challenge, leaving more difficult problems underexplored, such as detecting Bankart lesions (anterior-inferior glenoid labral tears) on standard MRIs. Diagnosing these lesions is challenging due to their subtle imaging features, often leading to reliance on invasive MRI arthrograms (MRAs). This study introduces ScopeMRI, the first publicly available, expert-annotated dataset for shoulder pathologies, and presents a deep learning (DL) framework for detecting Bankart lesions on both standard MRIs and MRAs. ScopeMRI includes 586 shoulder MRIs (335 standard, 251 MRAs) from 558 patients who underwent arthroscopy. Ground truth labels were derived from intraoperative findings, the gold standard for diagnosis. Separate DL models for MRAs and standard MRIs were trained using a combination of CNNs and transformers. Predictions from sagittal, axial, and coronal views were ensembled to optimize performance. The models were evaluated on a 20% hold-out test set (117 MRIs: 46 MRAs, 71 standard MRIs). The models achieved an AUC of 0.91 and 0.93, sensitivity of 83% and 94%, and specificity of 91% and 86% for standard MRIs and MRAs, respectively. Notably, model performance on non-invasive standard MRIs matched or surpassed radiologists interpreting MRAs. External validation demonstrated initial generalizability across imaging protocols. This study demonstrates that DL models can achieve radiologist-level diagnostic performance on standard MRIs, reducing the need for invasive MRAs. By releasing ScopeMRI and a modular codebase for training and evaluating deep learning models on 3D medical imaging data, we aim to accelerate research in musculoskeletal imaging and support the development of new datasets for clinically challenging diagnostic tasks.
zh
人工智能
[AI-0] oward Efficient Exploration by Large Language Model Agents
【速读】:该论文试图解决在基于大型语言模型(Large Language Models, LLMs)的强化学习(Reinforcement Learning, RL)代理中实现数据高效探索的问题。其关键解决方案是利用LLM显式实现一种已知的、具有统计高效探索能力的RL算法——后验采样强化学习(Posterior Sampling for Reinforcement Learning),而非依赖微调或上下文学习来隐式模仿RL算法。
链接: https://arxiv.org/abs/2504.20997
作者: Dilip Arumugam,Thomas L. Griffiths
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:A burgeoning area within reinforcement learning (RL) is the design of sequential decision-making agents centered around large language models (LLMs). While autonomous decision-making agents powered by modern LLMs could facilitate numerous real-world applications, such successes demand agents that are capable of data-efficient RL. One key obstacle to achieving data efficiency in RL is exploration, a challenge that we demonstrate many recent proposals for LLM agent designs struggle to contend with. Meanwhile, classic algorithms from the RL literature known to gracefully address exploration require technical machinery that can be challenging to operationalize in purely natural language settings. In this work, rather than relying on finetuning or in-context learning to coax LLMs into implicitly imitating a RL algorithm, we illustrate how LLMs can be used to explicitly implement an existing RL algorithm (Posterior Sampling for Reinforcement Learning) whose capacity for statistically-efficient exploration is already well-studied. We offer empirical results demonstrating how our LLM-based implementation of a known, data-efficient RL algorithm can be considerably more effective in natural language tasks that demand prudent exploration.
zh
[AI-1] Hubs and Spokes Learning: Efficient and Scalable Collaborative Machine Learning
【速读】:该论文试图解决联邦学习(Federated Learning, FL)中存在单点故障问题以及去中心化学习(Decentralized Learning, P2PL)框架在通信效率和性能上的不足。其解决方案的关键在于提出一种两层通信结构的协同机器学习框架——Hubs and Spokes Learning (HSL),该框架结合了FL和P2PL的优势,在相同通信预算下表现出更高的性能,并且在显著更低的通信预算下仍能与当前最先进的P2PL框架Epidemic Learning Local (ELL)相媲美。
链接: https://arxiv.org/abs/2504.20988
作者: Atul Sharma,Kavindu Herath,Saurabh Bagchi,Chaoyue Liu,Somali Chaterji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:We introduce the Hubs and Spokes Learning (HSL) framework, a novel paradigm for collaborative machine learning that combines the strengths of Federated Learning (FL) and Decentralized Learning (P2PL). HSL employs a two-tier communication structure that avoids the single point of failure inherent in FL and outperforms the state-of-the-art P2PL framework, Epidemic Learning Local (ELL). At equal communication budgets (total edges), HSL achieves higher performance than ELL, while at significantly lower communication budgets, it can match ELL’s performance. For instance, with only 400 edges, HSL reaches the same test accuracy that ELL achieves with 1000 edges for 100 peers (spokes) on CIFAR-10, demonstrating its suitability for resource-constrained systems. HSL also achieves stronger consensus among nodes after mixing, resulting in improved performance with fewer training rounds. We substantiate these claims through rigorous theoretical analyses and extensive experimental results, showcasing HSL’s practicality for large-scale collaborative learning.
zh
[AI-2] LTLf Adaptive Synthesis for Multi-Tier Goals in Nondeterministic Domains
【速读】:该论文试图解决在非确定性规划领域中,如何合成适应性策略以实现多层级目标的问题,这些目标由多个逐渐增加难度的LTLf(Linear Temporal Logic over finite traces)目标组成。解决方案的关键在于提出一种基于博弈论的技术,该技术能够动态地在策略执行过程中尽可能满足多层级目标,并利用环境可能的合作来实现剩余目标,其计算过程是声音且完整的,并且在目标数量上具有二次复杂度,相较于标准的LTLf合成仅需少量额外开销。
链接: https://arxiv.org/abs/2504.20983
作者: Giuseppe De Giacomo,Gianmarco Parretti,Shufang Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We study a variant of LTLf synthesis that synthesizes adaptive strategies for achieving a multi-tier goal, consisting of multiple increasingly challenging LTLf objectives in nondeterministic planning domains. Adaptive strategies are strategies that at any point of their execution (i) enforce the satisfaction of as many objectives as possible in the multi-tier goal, and (ii) exploit possible cooperation from the environment to satisfy as many as possible of the remaining ones. This happens dynamically: if the environment cooperates (ii) and an objective becomes enforceable (i), then our strategies will enforce it. We provide a game-theoretic technique to compute adaptive strategies that is sound and complete. Notably, our technique is polynomial, in fact quadratic, in the number of objectives. In other words, it handles multi-tier goals with only a minor overhead compared to standard LTLf synthesis.
zh
[AI-3] Jekyll-and-Hyde Tipping Point in an AIs Behavior
【速读】:该论文试图解决当前大型语言模型(Large Language Model, LLM)在输出过程中可能出现的“人格切换”(Jekyll-and-Hyde tipping point)问题,即模型在响应过程中突然变得错误、误导、无关或危险的现象,这种不确定性已对公众信任造成严重影响。论文提出的解决方案之关键是从基本原理出发,推导出一个精确的数学公式,用于预测该切换点的发生条件。该公式表明,模型注意力分散到极低水平时会导致突然的性能崩溃,且仅需中学数学知识即可理解。该公式为通过调整提示词和训练方式来延迟或防止切换点提供了定量依据,从而为政策制定者和公众提供讨论AI更广泛应用与风险的坚实基础。
链接: https://arxiv.org/abs/2504.20980
作者: Neil F. Johnson,Frank Yingjie Huo
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Adaptation and Self-Organizing Systems (nlin.AO); Computational Physics (physics.comp-ph); Physics and Society (physics.soc-ph)
备注:
Abstract:Trust in AI is undermined by the fact that there is no science that predicts – or that can explain to the public – when an LLM’s output (e.g. ChatGPT) is likely to tip mid-response to become wrong, misleading, irrelevant or dangerous. With deaths and trauma already being blamed on LLMs, this uncertainty is even pushing people to treat their ‘pet’ LLM more politely to ‘dissuade’ it (or its future Artificial General Intelligence offspring) from suddenly turning on them. Here we address this acute need by deriving from first principles an exact formula for when a Jekyll-and-Hyde tipping point occurs at LLMs’ most basic level. Requiring only secondary school mathematics, it shows the cause to be the AI’s attention spreading so thin it suddenly snaps. This exact formula provides quantitative predictions for how the tipping-point can be delayed or prevented by changing the prompt and the AI’s training. Tailored generalizations will provide policymakers and the public with a firm platform for discussing any of AI’s broader uses and risks, e.g. as a personal counselor, medical advisor, decision-maker for when to use force in a conflict situation. It also meets the need for clear and transparent answers to questions like ‘‘should I be polite to my LLM?’’
zh
[AI-4] A Domain-Agnostic Scalable AI Safety Ensuring Framework
【速读】:该论文旨在解决AI系统在实际部署中的安全性问题,特别是针对传统方法在处理预定义领域特定安全条件时泛化能力不足的局限性。其解决方案的关键在于提出一种新型的AI安全框架,该框架能够确保AI系统满足用户自定义的约束条件,并在任意领域中以用户指定的概率进行操作。该框架通过将AI组件(如神经网络)与优化问题相结合,生成在最小化目标函数的同时满足用户定义约束的响应,并引入内部测试数据和保守测试方法以评估AI组件的可信度,从而实现概率约束的保证。
链接: https://arxiv.org/abs/2504.20924
作者: Beomjun Kim,Kangyeon Kim,Sunwoo Kim,Heejin Ahn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Experimental supplementary material will be available before May 22 23:59PM AOE
Abstract:Ensuring the safety of AI systems has recently emerged as a critical priority for real-world deployment, particularly in physical AI applications. Current approaches to AI safety typically address predefined domain-specific safety conditions, limiting their ability to generalize across contexts. We propose a novel AI safety framework that ensures AI systems comply with \textbfany user-defined constraint, with \textbfany desired probability, and across \textbfvarious domains. In this framework, we combine an AI component (e.g., neural network) with an optimization problem to produce responses that minimize objectives while satisfying user-defined constraints with probabilities exceeding user-defined thresholds. For credibility assessment of the AI component, we propose \textitinternal test data, a supplementary set of safety-labeled data, and a \textitconservative testing methodology that provides statistical validity of using internal test data. We also present an approximation method of a loss function and how to compute its gradient for training. We mathematically prove that probabilistic constraint satisfaction is guaranteed under specific, mild conditions and prove a scaling law between safety and the number of internal test data. We demonstrate our framework’s effectiveness through experiments in diverse domains: demand prediction for production decision, safe reinforcement learning within the SafetyGym simulator, and guarding AI chatbot outputs. Through these experiments, we demonstrate that our method guarantees safety for user-specified constraints, outperforms for \textbfup to several order of magnitudes existing methods in low safety threshold regions, and scales effectively with respect to the size of internal test data. Comments: Experimental supplementary material will be available before May 22 23:59PM AOE Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2504.20924 [cs.AI] (or arXiv:2504.20924v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.20924 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Beomjun Kim [view email] [v1] Tue, 29 Apr 2025 16:38:35 UTC (2,554 KB)
zh
[AI-5] Leverag ing Generative AI Through Prompt Engineering and Rigorous Validation to Create Comprehensive Synthetic Datasets for AI Training in Healthcare
【速读】:该论文试图解决因隐私问题导致高质量医疗数据难以获取,从而阻碍电子健康记录(Electronic Health Record, EHR)应用中人工智能(Artificial Intelligence, AI)算法训练的问题。解决方案的关键在于利用GPT-4 API进行提示工程(prompt engineering),生成高质量的合成数据集,并通过多种先进验证技术确保数据的质量和完整性,包括BERT的下一句预测、GPT-2的总体合理性、RoBERTa的逻辑一致性、自编码器的异常检测以及多样性分析。
链接: https://arxiv.org/abs/2504.20921
作者: Polycarp Nalela
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Access to high-quality medical data is often restricted due to privacy concerns, posing significant challenges for training artificial intelligence (AI) algorithms within Electronic Health Record (EHR) applications. In this study, prompt engineering with the GPT-4 API was employed to generate high-quality synthetic datasets aimed at overcoming this limitation. The generated data encompassed a comprehensive array of patient admission information, including healthcare provider details, hospital departments, wards, bed assignments, patient demographics, emergency contacts, vital signs, immunizations, allergies, medical histories, appointments, hospital visits, laboratory tests, diagnoses, treatment plans, medications, clinical notes, visit logs, discharge summaries, and referrals. To ensure data quality and integrity, advanced validation techniques were implemented utilizing models such as BERT’s Next Sentence Prediction for sentence coherence, GPT-2 for overall plausibility, RoBERTa for logical consistency, autoencoders for anomaly detection, and conducted diversity analysis. Synthetic data that met all validation criteria were integrated into a comprehensive PostgreSQL database, serving as the data management system for the EHR application. This approach demonstrates that leveraging generative AI models with rigorous validation can effectively produce high-quality synthetic medical data, facilitating the training of AI algorithms while addressing privacy concerns associated with real patient data.
zh
[AI-6] When Testing AI Tests Us: Safeguarding Mental Health on the Digital Frontlines
【速读】:该论文试图解决AI红队成员(red-teamers)在执行对抗性测试任务过程中所面临的独特心理健康问题,这一问题被视为关键的工作场所安全议题。解决方案的关键在于通过分析红队劳动特有的心理影响,并借鉴其他职业(如演员、心理健康专业人员、冲突摄影师和内容审核员)中常见的互动性劳动及其心理保护策略,提出针对红队成员个体和组织层面的应对措施,以有效保障其心理健康与福祉。
链接: https://arxiv.org/abs/2504.20910
作者: Sachin R. Pendse,Darren Gergle,Rachel Kornfield,Jonah Meyerhoff,David Mohr,Jina Suh,Annie Wescott,Casey Williams,Jessica Schleider
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to ACM Conference on Fairness, Accountability, and Transparency (FAccT 2025)
Abstract:Red-teaming is a core part of the infrastructure that ensures that AI models do not produce harmful content. Unlike past technologies, the black box nature of generative AI systems necessitates a uniquely interactional mode of testing, one in which individuals on red teams actively interact with the system, leveraging natural language to simulate malicious actors and solicit harmful outputs. This interactional labor done by red teams can result in mental health harms that are uniquely tied to the adversarial engagement strategies necessary to effectively red team. The importance of ensuring that generative AI models do not propagate societal or individual harm is widely recognized – one less visible foundation of end-to-end AI safety is also the protection of the mental health and wellbeing of those who work to keep model outputs safe. In this paper, we argue that the unmet mental health needs of AI red-teamers is a critical workplace safety concern. Through analyzing the unique mental health impacts associated with the labor done by red teams, we propose potential individual and organizational strategies that could be used to meet these needs, and safeguard the mental health of red-teamers. We develop our proposed strategies through drawing parallels between common red-teaming practices and interactional labor common to other professions (including actors, mental health professionals, conflict photographers, and content moderators), describing how individuals and organizations within these professional spaces safeguard their mental health given similar psychological demands. Drawing on these protective practices, we describe how safeguards could be adapted for the distinct mental health challenges experienced by red teaming organizations as they mitigate emerging technological risks on the new digital frontlines.
zh
[AI-7] Modeling AI-Human Collaboration as a Multi-Agent Adaptation
【速读】:该论文试图解决AI与人类协作在组织战略决策中的有效性问题,特别是如何根据任务结构优化人机协同。其解决方案的关键在于构建一个基于代理的仿真模型,将AI-human协作形式化为任务结构的函数,并通过NK模型区分基于启发式的人类适应与基于规则的AI搜索,从而揭示不同任务类型(模块化与序列化)下人机协作的互补性与替代性机制。
链接: https://arxiv.org/abs/2504.20903
作者: Prothit Sen,Sai Mihir Jakkaraju
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Manuscript under review for the Special Issue: ‘Can AI Do Strategy?’ at Strategy Science (May 1, 2025)
Abstract:We develop an agent-based simulation to formalize AI-human collaboration as a function of task structure, advancing a generalizable framework for strategic decision-making in organizations. Distinguishing between heuristic-based human adaptation and rule-based AI search, we model interactions across modular (parallel) and sequenced (interdependent) tasks using an NK model. Our results reveal that in modular tasks, AI often substitutes for humans - delivering higher payoffs unless human expertise is very high, and the AI search space is either narrowly focused or extremely broad. In sequenced tasks, interesting complementarities emerge. When an expert human initiates the search and AI subsequently refines it, aggregate performance is maximized. Conversely, when AI leads, excessive heuristic refinement by the human can reduce payoffs. We also show that even “hallucinatory” AI - lacking memory or structure - can improve outcomes when augmenting low-capability humans by helping escape local optima. These results yield a robust implication: the effectiveness of AI-human collaboration depends less on context or industry, and more on the underlying task structure. By elevating task decomposition as the central unit of analysis, our model provides a transferable lens for strategic decision-making involving humans and an agentic AI across diverse organizational settings.
zh
[AI-8] Return Capping: Sample-Efficient CVaR Policy Gradient Optimisation
【速读】:该论文试图解决在使用策略梯度(Policy Gradients, PG)优化条件风险价值(Conditional Value at Risk, CVaR)时,现有方法因丢弃大量轨迹而导致样本效率低下的问题。其解决方案的关键在于对训练中使用的轨迹总回报进行上限约束,而非简单地丢弃这些轨迹,实验证明当上限设置合理时,该重构的优化问题与原问题等价,并在多个环境中表现出优于基线方法的性能。
链接: https://arxiv.org/abs/2504.20887
作者: Harry Mead,Clarissa Costen,Bruno Lacerda,Nick Hawes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:When optimising for conditional value at risk (CVaR) using policy gradients (PG), current meth- ods rely on discarding a large proportion of tra- jectories, resulting in poor sample efficiency. We propose a reformulation of the CVaR optimisation problem by capping the total return of trajecto- ries used in training, rather than simply discard- ing them, and show that this is equivalent to the original problem if the cap is set appropriately. We show, with empirical results in an number of environments, that this reformulation of the prob- lem results in consistently improved performance compared to baselines.
zh
[AI-9] Quantifying the Noise of Structural Perturbations on Graph Adversarial Attacks
【速读】:该论文试图解决当前图神经网络(Graph Neural Networks, GNNs)在面对对抗攻击时缺乏鲁棒性的问题,尤其是现有研究多关注于优化攻击效果以获得近似最优的扰动,而较少关注每种扰动(如节点或边的注入)的强度量化,导致扰动选择过程缺乏可解释性。解决方案的关键在于引入“噪声”概念来量化每个对抗性边的攻击强度,并基于定义的噪声和分类边界提出三种攻击策略,分别针对单步和多步优化场景。
链接: https://arxiv.org/abs/2504.20869
作者: Junyuan Fang,Han Yang,Haixian Wen,Jiajing Wu,Zibin Zheng,Chi K. Tse
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Ubder Review
Abstract:Graph neural networks have been widely utilized to solve graph-related tasks because of their strong learning power in utilizing the local information of neighbors. However, recent studies on graph adversarial attacks have proven that current graph neural networks are not robust against malicious attacks. Yet much of the existing work has focused on the optimization objective based on attack performance to obtain (near) optimal perturbations, but paid less attention to the strength quantification of each perturbation such as the injection of a particular node/link, which makes the choice of perturbations a black-box model that lacks interpretability. In this work, we propose the concept of noise to quantify the attack strength of each adversarial link. Furthermore, we propose three attack strategies based on the defined noise and classification margins in terms of single and multiple steps optimization. Extensive experiments conducted on benchmark datasets against three representative graph neural networks demonstrate the effectiveness of the proposed attack strategies. Particularly, we also investigate the preferred patterns of effective adversarial perturbations by analyzing the corresponding properties of the selected perturbation nodes.
zh
[AI-10] abular Data Adapters: Improving Outlier Detection for Unlabeled Private Data
【速读】:该论文试图解决在将深度学习方法应用于内部私有表格数据时,由于数据结构差异、领域偏移和缺乏标签所带来的挑战。解决方案的关键在于引入Tabular Data Adapters (TDA),通过识别统计上相似的公开数据集,并基于共享自编码器将私有数据转换为与先进公开模型兼容的格式,从而生成软标签,以缓解标注的冷启动问题。
链接: https://arxiv.org/abs/2504.20862
作者: Dayananda Herurkar,Jörn Hees,Vesselin Tzvetkov,Andreas Dengel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: outlier detection, tabular data, neural networks, weak annotations, soft labeling, unsupervised approach
Abstract:The remarkable success of Deep Learning approaches is often based and demonstrated on large public datasets. However, when applying such approaches to internal, private datasets, one frequently faces challenges arising from structural differences in the datasets, domain shift, and the lack of labels. In this work, we introduce Tabular Data Adapters (TDA), a novel method for generating soft labels for unlabeled tabular data in outlier detection tasks. By identifying statistically similar public datasets and transforming private data (based on a shared autoencoder) into a format compatible with state-of-the-art public models, our approach enables the generation of weak labels. It thereby can help to mitigate the cold start problem of labeling by basing on existing outlier detection models for public datasets. In experiments on 50 tabular datasets across different domains, we demonstrate that our method is able to provide more accurate annotations than baseline approaches while reducing computational time. Our approach offers a scalable, efficient, and cost-effective solution, to bridge the gap between public research models and real-world industrial applications.
zh
[AI-11] owards Easy and Realistic Network Infrastructure Testing for Large-scale Machine Learning
【速读】:该论文试图解决在不依赖昂贵GPU的情况下,评估真实硬件网络行为对机器学习(Machine Learning, ML)工作负载性能影响的问题。解决方案的关键在于利用CPU发起的流量在硬件测试平台中模拟GPU到GPU的通信,并通过调整ASTRA-sim模拟器来建模网络与ML工作负载之间的交互。
链接: https://arxiv.org/abs/2504.20854
作者: Jinsun Yoo,ChonLam Lao,Lianjie Cao,Bob Lantz,Minlan Yu,Tushar Krishna,Puneet Sharma
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Systems and Control (eess.SY)
备注: Presented as a poster in NSDI 25
Abstract:This paper lays the foundation for Genie, a testing framework that captures the impact of real hardware network behavior on ML workload performance, without requiring expensive GPUs. Genie uses CPU-initiated traffic over a hardware testbed to emulate GPU to GPU communication, and adapts the ASTRA-sim simulator to model interaction between the network and the ML workload.
zh
[AI-12] Fostering Self-Directed Growth with Generative AI: Toward a New Learning Analytics Framework
【速读】:该论文试图解决当前在人工智能辅助教育中关于自我导向学习(Self-Directed Learning)研究的不足,特别是在如何通过生成式AI(Generative Artificial Intelligence)与学习分析(Learning Analytics)的结合来促进学习者的自主成长(Self-Directed Growth)问题。解决方案的关键在于提出Aspire to Potentials for Learners(A2PL)模型,该模型重新定义了学习者抱负、复杂思维和总结性自我评估在生成式AI支持下的互动机制,旨在为数字时代构建公平、适应性强且可持续的学习系统提供理论基础与实践方向。
链接: https://arxiv.org/abs/2504.20851
作者: Qianrun Mao
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:In an era increasingly shaped by decentralized knowledge ecosystems and pervasive AI technologies, fostering sustainable learner agency has become a critical educational imperative. This study introduces a novel conceptual framework integrating Generative Artificial Intelligence and Learning Analytics to cultivate Self-Directed Growth, a dynamic competency that enables learners to iteratively drive their own developmental pathways across diverse this http URL upon critical gaps in current research on Self Directed Learning and AI-mediated education, the proposed Aspire to Potentials for Learners (A2PL) model reconceptualizes the interplay of learner aspirations, complex thinking, and summative self-assessment within GAI supported this http URL implications for future intervention design and learning analytics applications are discussed, positioning Self-Directed Growth as a pivotal axis for developing equitable, adaptive, and sustainable learning systems in the digital era.
zh
[AI-13] Mitigating the Structural Bias in Graph Adversarial Defenses
【速读】:该论文试图解决图神经网络(Graph Neural Networks, GNNs)在面对对抗攻击时存在的结构偏差问题,特别是在低度节点(tail nodes)上的防御能力不足。解决方案的关键在于引入异构-同构增强图构建、k近邻增强图构建以及多视角节点注意力模块,以缓解GNN对对抗攻击的结构偏差。其中,异构-同构增强图通过全局移除异质链接并为低度节点添加同质链接来优化图结构,而注意力机制则用于自适应地融合不同图视图的表示,从而提升模型的鲁棒性与公平性。
链接: https://arxiv.org/abs/2504.20848
作者: Junyuan Fang,Huimin Liu,Han Yang,Jiajing Wu,Zibin Zheng,Chi K. Tse
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Under Review
Abstract:In recent years, graph neural networks (GNNs) have shown great potential in addressing various graph structure-related downstream tasks. However, recent studies have found that current GNNs are susceptible to malicious adversarial attacks. Given the inevitable presence of adversarial attacks in the real world, a variety of defense methods have been proposed to counter these attacks and enhance the robustness of GNNs. Despite the commendable performance of these defense methods, we have observed that they tend to exhibit a structural bias in terms of their defense capability on nodes with low degree (i.e., tail nodes), which is similar to the structural bias of traditional GNNs on nodes with low degree in the clean graph. Therefore, in this work, we propose a defense strategy by including hetero-homo augmented graph construction, k NN augmented graph construction, and multi-view node-wise attention modules to mitigate the structural bias of GNNs against adversarial attacks. Notably, the hetero-homo augmented graph consists of removing heterophilic links (i.e., links connecting nodes with dissimilar features) globally and adding homophilic links (i.e., links connecting nodes with similar features) for nodes with low degree. To further enhance the defense capability, an attention mechanism is adopted to adaptively combine the representations from the above two kinds of graph views. We conduct extensive experiments to demonstrate the defense and debiasing effect of the proposed strategy on benchmark datasets.
zh
[AI-14] Disjunctive and Conjunctive Normal Form Explanations of Clusters Using Auxiliary Information
【速读】:该论文试图解决如何利用未参与聚类算法的辅助信息(称为标签)生成对聚类结果的后验解释问题。其解决方案的关键在于通过整数线性规划(ILP)和启发式方法,生成两种形式的解释:一种是析取形式(由一组标签组成),另一种是两个子句的合取范式(CNF)解释(由两组标签通过逻辑与操作符组合而成)。
链接: https://arxiv.org/abs/2504.20846
作者: Robert F. Downey,S. S. Ravi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We consider generating post-hoc explanations of clusters generated from various datasets using auxiliary information which was not used by clustering algorithms. Following terminology used in previous work, we refer to the auxiliary information as tags. Our focus is on two forms of explanations, namely disjunctive form (where the explanation for a cluster consists of a set of tags) and a two-clause conjunctive normal form (CNF) explanation (where the explanation consists of two sets of tags, combined through the AND operator). We use integer linear programming (ILP) as well as heuristic methods to generate these explanations. We experiment with a variety of datasets and discuss the insights obtained from our explanations. We also present experimental results regarding the scalability of our explanation methods.
zh
[AI-15] Reinforcement Learning for LLM Reasoning Under Memory Constraints
【速读】:该论文旨在解决在内存和计算资源受限条件下,如何通过强化学习(Reinforcement Learning, RL)技术提升大语言模型(Large Language Models, LLMs)在特定问题空间中的推理能力。其解决方案的关键在于提出两种高效方法:S-GRPO(一种内存高效的Group Relative Policy Optimization变体)和T-SPMO(一种基于令牌级前缀匹配的细粒度信用分配策略),二者均兼容LoRA微调,并可在单块40GB GPU上运行。实验表明,这些方法在资源有限的情况下显著提升了模型在SVAMP基准测试中的准确率,并展示了在硬件约束下RL微调的潜力。
链接: https://arxiv.org/abs/2504.20834
作者: Alan Lee,Harry Tong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We explore reinforcement learning (RL) techniques to enhance reasoning within targeted problem spaces in large language models (LLMs) under memory and compute constraints. Our focus is on critic-free methods compatible with LoRA fine-tuning on a single 40GB GPU, a common limitation in academic settings. We introduce S-GRPO, a memory-efficient variant of Group Relative Policy Optimization, and T-SPMO, a token-level prefix matching strategy for fine-grained credit assignment. Despite limited resources, when used to fine-tune Qwen2-1.5B both methods significantly improve SVAMP benchmark accuracy from 46% to above 70% using LoRA training. T-SPMO also excels in multi-digit multiplication tasks, underscoring the potential of RL fine-tuning under hardware constraints. Additionally, we find that our full-token GRPO baseline under LoRA fine-tuning did not improve model performance (compared to base model) on either task, suggesting that our memory-efficient methods may act as a form of regularization that stabilizes training when only a small subset of parameters are updated.
zh
[AI-16] Ascendra: Dynamic Request Prioritization for Efficient LLM Serving
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)服务系统中同时满足Time To First Token (TTFT)和Time Between Tokens (TBT)服务等级目标(SLOs)的效率问题,现有系统通常在两者之间做出权衡。解决方案的关键在于Ascendra系统通过动态划分GPU资源为低优先级和高优先级实例,利用性能模型预测可能无法满足SLOs的请求,并将其主动迁移至高优先级实例以保证低延迟,从而实现高吞吐与低延迟的平衡。
链接: https://arxiv.org/abs/2504.20828
作者: Azam Ikram,Xiang Li,Sameh Elnikety,Saurabh Bagchi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancement of Large Language Models (LLMs) has driven the need for more efficient serving strategies. In this context, efficiency refers to the proportion of requests that meet their Service Level Objectives (SLOs), particularly for Time To First Token (TTFT) and Time Between Tokens (TBT). However, existing systems often prioritize one metric at the cost of the other. We present Ascendra, an LLM serving system designed to meet both TTFT and TBT SLOs simultaneously. The core insight behind Ascendra is that a request’s urgency evolves as it approaches its deadline. To leverage this, Ascendra partitions GPU resources into two types of instances: low-priority and high-priority. Low-priority instances maximize throughput by processing requests out of arrival order, but at the risk of request starvation. To address this, Ascendra employs a performance model to predict requests at risk of missing their SLOs and proactively offloads them to high-priority instances. High-priority instances are optimized for low-latency execution and handle urgent requests nearing their deadlines. This partitioned architecture enables Ascendra to effectively balance high throughput and low latency. Extensive evaluation shows that Ascendra improves system throughput by up to 1.7x compared to vLLM and Sarathi-Serve while meeting both TTFT and TBT SLOs.
zh
[AI-17] SoccerDiffusion: Toward Learning End-to-End Humanoid Robot Soccer from Gameplay Recordings
【速读】:该论文试图解决如何直接从真实比赛录像中学习拟人机器人足球的端到端控制策略的问题,其核心挑战在于如何将多模态传感器输入(包括视觉、本体感觉和比赛状态)转化为关节指令轨迹。解决方案的关键是提出SoccerDiffusion,一个基于Transformer的扩散模型,并采用知识蒸馏技术将多步骤的扩散过程简化为单步推理,从而实现在嵌入式平台上的实时推断。
链接: https://arxiv.org/abs/2504.20808
作者: Florian Vahl,Jörn Griepenburg,Jan Gutsche,Jasper Güldenstein,Jianwei Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This paper introduces SoccerDiffusion, a transformer-based diffusion model designed to learn end-to-end control policies for humanoid robot soccer directly from real-world gameplay recordings. Using data collected from RoboCup competitions, the model predicts joint command trajectories from multi-modal sensor inputs, including vision, proprioception, and game state. We employ a distillation technique to enable real-time inference on embedded platforms that reduces the multi-step diffusion process to a single step. Our results demonstrate the model’s ability to replicate complex motion behaviors such as walking, kicking, and fall recovery both in simulation and on physical robots. Although high-level tactical behavior remains limited, this work provides a robust foundation for subsequent reinforcement learning or preference optimization methods. We release the dataset, pretrained models, and code under: this https URL
zh
[AI-18] Hallucination by Code Generation LLM s: Taxonomy Benchmarks Mitigation and Challenges
【速读】:该论文旨在解决由代码大语言模型(CodeLLMs)生成的幻觉(hallucination)问题,即模型在生成源代码时可能产生的错误、无意义且无法验证的信息。这类幻觉难以被用户识别和修复,尤其在特定执行路径下更易隐藏,从而对代码质量与系统可靠性造成潜在威胁。论文的关键解决方案在于通过分类代码生成中的幻觉类型、回顾现有基准测试与缓解策略,并识别当前研究的开放性挑战,进而为未来在检测与消除CodeLLMs生成幻觉方面的研究提供方向。
链接: https://arxiv.org/abs/2504.20799
作者: Yunseo Lee,John Youngeun Song,Dongsun Kim,Jindae Kim,Mijung Kim,Jaechang Nam
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures
Abstract:Recent technical breakthroughs in large language models (LLMs) have enabled them to fluently generate source code. Software developers often leverage both general-purpose and code-specialized LLMs to revise existing code or even generate a whole function from scratch. These capabilities are also beneficial in no-code or low-code contexts, in which one can write programs without a technical background. However, due to their internal design, LLMs are prone to generating hallucinations, which are incorrect, nonsensical, and not justifiable information but difficult to identify its presence. This problem also occurs when generating source code. Once hallucinated code is produced, it is often challenging for users to identify and fix it, especially when such hallucinations can be identified under specific execution paths. As a result, the hallucinated code may remain unnoticed within the codebase. This survey investigates recent studies and techniques relevant to hallucinations generated by CodeLLMs. We categorize the types of hallucinations in the code generated by CodeLLMs, review existing benchmarks and mitigation strategies, and identify open challenges. Based on these findings, this survey outlines further research directions in the detection and removal of hallucinations produced by CodeLLMs.
zh
[AI-19] Partitioned Memory Storag e Inspired Few-Shot Class-Incremental learning
【速读】:该论文试图解决当前主流深度学习技术对大量训练数据的依赖性过强以及在动态世界中适应能力不足的问题,这些问题与人类智能存在显著差距。为弥合这一差距,论文提出了一种名为Few-Shot Class-Incremental Learning (FSCIL)的方法,旨在在不遗忘旧知识的前提下,通过有限样本持续学习新类别。现有FSCIL研究通常使用单一模型跨所有会话学习知识,不可避免地陷入稳定性与可塑性之间的权衡困境。本文的解决方案关键在于为每个会话学习独立模型,从而内在地防止灾难性遗忘,并在测试阶段引入Uncertainty Quantification (UQ)以优化模型部署。
链接: https://arxiv.org/abs/2504.20797
作者: Renye Zhang,Yimin Yin,Jinghua Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Current mainstream deep learning techniques exhibit an over-reliance on extensive training data and a lack of adaptability to the dynamic world, marking a considerable disparity from human intelligence. To bridge this gap, Few-Shot Class-Incremental Learning (FSCIL) has emerged, focusing on continuous learning of new categories with limited samples without forgetting old knowledge. Existing FSCIL studies typically use a single model to learn knowledge across all sessions, inevitably leading to the stability-plasticity dilemma. Unlike machines, humans store varied knowledge in different cerebral cortices. Inspired by this characteristic, our paper aims to develop a method that learns independent models for each session. It can inherently prevent catastrophic forgetting. During the testing stage, our method integrates Uncertainty Quantification (UQ) for model deployment. Our method provides a fresh viewpoint for FSCIL and demonstrates the state-of-the-art performance on CIFAR-100 and mini-ImageNet datasets.
zh
[AI-20] Approximate Lifted Model Construction IJCAI-2025
【速读】:该论文试图解决在概率关系模型中,由于从数据中学习的势函数不可避免地偏离精确匹配,导致传统Advanced Colour Passing (ACP)算法无法有效识别和利用不可区分对象的问题。解决方案的关键在于引入\varepsilon-Advanced Colour Passing (\varepsilon-ACP)算法,该算法通过允许势函数根据超参数\varepsilon的偏差进行调整,从而高效地发现并利用非精确的不可区分性,同时保证了近似误差严格受限,并在实践中接近于零。
链接: https://arxiv.org/abs/2504.20784
作者: Malte Luttermann,Jan Speller,Marcel Gehrke,Tanya Braun,Ralf Möller,Mattis Hartwig
机构: 未知
类目: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
备注: Extended version of paper accepted to the Proceedings of the 34th International Joint Conference on Artificial Intelligence (IJCAI-2025)
Abstract:Probabilistic relational models such as parametric factor graphs enable efficient (lifted) inference by exploiting the indistinguishability of objects. In lifted inference, a representative of indistinguishable objects is used for computations. To obtain a relational (i.e., lifted) representation, the Advanced Colour Passing (ACP) algorithm is the state of the art. The ACP algorithm, however, requires underlying distributions, encoded as potential-based factorisations, to exactly match to identify and exploit indistinguishabilities. Hence, ACP is unsuitable for practical applications where potentials learned from data inevitably deviate even if associated objects are indistinguishable. To mitigate this problem, we introduce the \varepsilon -Advanced Colour Passing ( \varepsilon -ACP) algorithm, which allows for a deviation of potentials depending on a hyperparameter \varepsilon . \varepsilon -ACP efficiently uncovers and exploits indistinguishabilities that are not exact. We prove that the approximation error induced by \varepsilon -ACP is strictly bounded and our experiments show that the approximation error is close to zero in practice.
zh
[AI-21] Using LLM s in Generating Design Rationale for Software Architecture Decisions
【速读】:该论文试图解决软件架构决策的Design Rationale (DR) 缺乏充分文档化的问题,这一问题通常源于开发人员缺乏动机和努力。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)在文本理解、推理和生成方面的能力,尝试自动生成或恢复DR。研究通过构建包含100个与架构相关问题的数据集,并采用三种提示策略(zero-shot、chain of thought和LLM-based agents)评估了五种LLMs生成DR的性能,结果显示其在精度、召回率和F1分数上具有一定的有效性,但同时也存在部分生成内容可能不准确或具有误导性的问题。
链接: https://arxiv.org/abs/2504.20781
作者: Xiyu Zhou,Ruiyin Li,Peng Liang,Beiqi Zhang,Mojtaba Shahin,Zengyang Li,Chen Yang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 28 pages, 5 images, 7 tables, Manuscript submitted to a journal (2025)
Abstract:Design Rationale (DR) for software architecture decisions refers to the reasoning underlying architectural choices, which provides valuable insights into the different phases of the architecting process throughout software development. However, in practice, DR is often inadequately documented due to a lack of motivation and effort from developers. With the recent advancements in Large Language Models (LLMs), their capabilities in text comprehension, reasoning, and generation may enable the generation and recovery of DR for architecture decisions. In this study, we evaluated the performance of LLMs in generating DR for architecture decisions. First, we collected 50 Stack Overflow (SO) posts, 25 GitHub issues, and 25 GitHub discussions related to architecture decisions to construct a dataset of 100 architecture-related problems. Then, we selected five LLMs to generate DR for the architecture decisions with three prompting strategies, including zero-shot, chain of thought (CoT), and LLM-based agents. With the DR provided by human experts as ground truth, the Precision of LLM-generated DR with the three prompting strategies ranges from 0.267 to 0.278, Recall from 0.627 to 0.715, and F1-score from 0.351 to 0.389. Additionally, 64.45% to 69.42% of the arguments of DR not mentioned by human experts are also helpful, 4.12% to 4.87% of the arguments have uncertain correctness, and 1.59% to 3.24% of the arguments are potentially misleading. Based on the results, we further discussed the pros and cons of the three prompting strategies and the strengths and limitations of the DR generated by LLMs.
zh
[AI-22] ECOSoundSet: a finely annotated dataset for the automated acoustic identification of Orthoptera and Cicadidae in North Central and temperate Western Europe
【速读】:该论文试图解决当前用于自然声景中欧洲昆虫自动声学识别的工具在范围上的局限性。其解决方案的关键在于构建一个大规模且生态异质性的声学数据集——ECOSoundSet,该数据集包含200种直翅目昆虫和24种蝉科物种的10,653条录音,并涵盖北、中及温带西欧地区。该数据集结合了粗略标注和精细标注的录音,其中精细标注的录音被划分为训练、验证和测试集,比例约为0.8:0.1:0.1,以支持深度学习算法的训练与评估。
链接: https://arxiv.org/abs/2504.20776
作者: David Funosas,Elodie Massol,Yves Bas,Svenja Schmidt,Dominik Arend,Alexander Gebhard,Luc Barbaro,Sebastian König,Rafael Carbonell Font,David Sannier,Fernand Deroussen,Jérôme Sueur,Christian Roesti,Tomi Trilar,Wolfgang Forstmeier,Lucas Roger,Eloïsa Matheu,Piotr Guzik,Julien Barataud,Laurent Pelozuelo,Stéphane Puissant,Sandra Mueller,Björn Schuller,Jose M. Montoya,Andreas Triantafyllopoulos,Maxime Cauchoix
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 3 Figures + 2 Supplementary Figures, 2 Tables + 3 Supplementary Tables
Abstract:Currently available tools for the automated acoustic recognition of European insects in natural soundscapes are limited in scope. Large and ecologically heterogeneous acoustic datasets are currently needed for these algorithms to cross-contextually recognize the subtle and complex acoustic signatures produced by each species, thus making the availability of such datasets a key requisite for their development. Here we present ECOSoundSet (European Cicadidae and Orthoptera Sound dataSet), a dataset containing 10,653 recordings of 200 orthopteran and 24 cicada species (217 and 26 respective taxa when including subspecies) present in North, Central, and temperate Western Europe (Andorra, Belgium, Denmark, mainland France and Corsica, Germany, Ireland, Luxembourg, Monaco, Netherlands, United Kingdom, Switzerland), collected partly through targeted fieldwork in South France and Catalonia and partly through contributions from various European entomologists. The dataset is composed of a combination of coarsely labeled recordings, for which we can only infer the presence, at some point, of their target species (weak labeling), and finely annotated recordings, for which we know the specific time and frequency range of each insect sound present in the recording (strong labeling). We also provide a train/validation/test split of the strongly labeled recordings, with respective approximate proportions of 0.8, 0.1 and 0.1, in order to facilitate their incorporation in the training and evaluation of deep learning algorithms. This dataset could serve as a meaningful complement to recordings already available online for the training of deep learning algorithms for the acoustic classification of orthopterans and cicadas in North, Central, and temperate Western Europe.
zh
[AI-23] JTreeformer: Graph-Transformer via Latent-Diffusion Model for Molecular Generation
【速读】:该论文试图解决现有基于Transformer的图解码器在分子生成任务中难以有效利用图结构信息的问题,即仅依赖节点序列而非分子图的复杂拓扑结构。其解决方案的关键在于构建一种基于图Transformer的框架——JTreeformer,该框架将图生成转化为junction tree(连接树)生成,并通过结合GCN并行与多头注意力作为编码器,以及引入有向无环GCN作为解码器,从而能够逐步合成整个分子结构。此外,在编码器生成的潜在空间中插入扩散模型以提升采样效率和效果。
链接: https://arxiv.org/abs/2504.20770
作者: Ji Shi,Chengxun Xie,Zhonghao Li,Xinming Zhang,Miao Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 6figures
Abstract:The discovery of new molecules based on the original chemical molecule distributions is of great importance in medicine. The graph transformer, with its advantages of high performance and scalability compared to traditional graph networks, has been widely explored in recent research for applications of graph structures. However, current transformer-based graph decoders struggle to effectively utilize graph information, which limits their capacity to leverage only sequences of nodes rather than the complex topological structures of molecule graphs. This paper focuses on building a graph transformer-based framework for molecular generation, which we call \textbfJTreeformer as it transforms graph generation into junction tree generation. It combines GCN parallel with multi-head attention as the encoder. It integrates a directed acyclic GCN into a graph-based Transformer to serve as a decoder, which can iteratively synthesize the entire molecule by leveraging information from the partially constructed molecular structure at each step. In addition, a diffusion model is inserted in the latent space generated by the encoder, to enhance the efficiency and effectiveness of sampling further. The empirical results demonstrate that our novel framework outperforms existing molecule generation methods, thus offering a promising tool to advance drug discovery (this https URL).
zh
[AI-24] Graph-Based Fault Diagnosis for Rotating Machinery: Adaptive Segmentation and Structural Feature Integration
【速读】:该论文旨在解决旋转机械中多类故障诊断的鲁棒性与可解释性问题。其解决方案的关键在于提出一种基于图论的框架,通过熵优化信号分段、时频特征提取以及图建模,将振动信号转化为适合分类的结构化表示。该方法结合图度量(如平均最短路径长度、模块度和谱隙)与局部特征,以捕捉全局和段级故障特性,从而实现高精度的故障诊断,并具备良好的噪声鲁棒性和跨领域迁移能力。
链接: https://arxiv.org/abs/2504.20756
作者: Moirangthem Tiken Singh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper proposes a novel graph-based framework for robust and interpretable multiclass fault diagnosis in rotating machinery. The method integrates entropy-optimized signal segmentation, time-frequency feature extraction, and graph-theoretic modeling to transform vibration signals into structured representations suitable for classification. Graph metrics, such as average shortest path length, modularity, and spectral gap, are computed and combined with local features to capture global and segment-level fault characteristics. The proposed method achieves high diagnostic accuracy when evaluated on two benchmark datasets, the CWRU bearing dataset (under 0-3 HP loads) and the SU gearbox and bearing datasets (under different speed-load configurations). Classification scores reach up to 99.8% accuracy on Case Western Reserve University (CWRU) and 100% accuracy on the Southeast University datasets using a logistic regression classifier. Furthermore, the model exhibits strong noise resilience, maintaining over 95.4% accuracy at high noise levels (standard deviation = 0.5), and demonstrates excellent cross-domain transferability with up to 99.7% F1-score in load-transfer scenarios. Compared to traditional techniques, this approach requires no deep learning architecture, enabling lower complexity while ensuring interpretability. The results confirm the method’s scalability, reliability, and potential for real-time deployment in industrial diagnostics.
zh
[AI-25] In defence of post-hoc explanations in medical AI
【速读】:该论文试图解决医疗人工智能(Artificial Intelligence, AI)系统中黑箱问题(black box problem)所带来的信任与理解障碍,尤其是在临床决策中的应用。其解决方案的关键在于论证后验解释(post-hoc explanations)尽管无法完全复制黑箱系统的实际推理过程,但仍能有效提升用户对系统的功能理解,增强临床医生与AI协作的准确性,并帮助临床医生合理化其基于AI的决策。因此,后验解释虽非解决黑箱问题的“万能方案”,但仍是应对该问题的重要策略。
链接: https://arxiv.org/abs/2504.20741
作者: Joshua Hatherley,Lauritz Munch,Jens Christian Bjerring
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Since the early days of the Explainable AI movement, post-hoc explanations have been praised for their potential to improve user understanding, promote trust, and reduce patient safety risks in black box medical AI systems. Recently, however, critics have argued that the benefits of post-hoc explanations are greatly exaggerated since they merely approximate, rather than replicate, the actual reasoning processes that black box systems take to arrive at their outputs. In this article, we aim to defend the value of post-hoc explanations against this recent critique. We argue that even if post-hoc explanations do not replicate the exact reasoning processes of black box systems, they can still improve users’ functional understanding of black box systems, increase the accuracy of clinician-AI teams, and assist clinicians in justifying their AI-informed decisions. While post-hoc explanations are not a “silver bullet” solution to the black box problem in medical AI, we conclude that they remain a useful strategy for addressing the black box problem in medical AI.
zh
[AI-26] Unsupervised Surrogate Anomaly Detection
【速读】:该论文旨在解决无监督异常检测(unsupervised anomaly detection)问题,即在没有标签数据的情况下,通过学习正常数据的神经网络表示来识别偏离正常模式的异常样本。其解决方案的关键在于提出一种名为DEAN(Deep Ensemble ANomaly Detection)的新算法,该算法基于“代理异常检测”(surrogate anomaly detection)的方法论,通过形式化代理异常检测的概念并满足最优代理模型所需的一组公理,从而实现对异常的有效检测。
链接: https://arxiv.org/abs/2504.20733
作者: Simon Klüttermann,Tim Katzke,Emmanuel Müller
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages + references and appendix = 35 pages
Abstract:In this paper, we study unsupervised anomaly detection algorithms that learn a neural network representation, i.e. regular patterns of normal data, which anomalies are deviating from. Inspired by a similar concept in engineering, we refer to our methodology as surrogate anomaly detection. We formalize the concept of surrogate anomaly detection into a set of axioms required for optimal surrogate models and propose a new algorithm, named DEAN (Deep Ensemble ANomaly detection), designed to fulfill these criteria. We evaluate DEAN on 121 benchmark datasets, demonstrating its competitive performance against 19 existing methods, as well as the scalability and reliability of our method.
zh
[AI-27] Enhancing Vulnerability Reports with Automated and Augmented Description Summarization
【速读】:该论文试图解决公共漏洞数据库(如国家漏洞数据库,NVD)中漏洞描述内容简短、过时或信息不足的问题。其解决方案的关键在于构建一个名为Zad的系统,该系统通过两个管道实现漏洞描述的增强:第一个管道利用两个编码器收集并过滤外部资源中的补充数据,以构建详尽的数据集;第二个管道则在该数据集上微调预训练模型,生成更全面的漏洞描述。通过提升描述的完整性和连贯性,Zad有效改善了漏洞信息的质量。
链接: https://arxiv.org/abs/2504.20726
作者: Hattan Althebeiti,Mohammed Alkinoon,Manar Mohaisen,Saeed Salem,DaeHun Nyang,David Mohaisen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 3 tables, 12 figures. Accepted for publication in IEEE Transactions on Big Data. Extended version of arXiv:2210.01260
Abstract:Public vulnerability databases, such as the National Vulnerability Database (NVD), document vulnerabilities and facilitate threat information sharing. However, they often suffer from short descriptions and outdated or insufficient information. In this paper, we introduce Zad, a system designed to enrich NVD vulnerability descriptions by leveraging external resources. Zad consists of two pipelines: one collects and filters supplementary data using two encoders to build a detailed dataset, while the other fine-tunes a pre-trained model on this dataset to generate enriched descriptions. By addressing brevity and improving content quality, Zad produces more comprehensive and cohesive vulnerability descriptions. We evaluate Zad using standard summarization metrics and human assessments, demonstrating its effectiveness in enhancing vulnerability information.
zh
[AI-28] he Limits of AI Explainability: An Algorithmic Information Theory Approach
【速读】:该论文试图解决AI可解释性(explainability)的根本限制问题,旨在通过算法信息论建立理论基础。其解决方案的关键在于将可解释性形式化为复杂模型被简化模型近似的过程,并利用科尔莫戈罗夫复杂度(Kolmogorov complexity)量化近似误差和解释复杂度。核心贡献包括证明了简化解释必然在某些输入上与原模型存在差异的复杂度差距定理、揭示了解释复杂度随输入维度呈指数增长而误差容忍度呈多项式增长的边界条件,以及分析了局部与全局可解释性之间的差距。
链接: https://arxiv.org/abs/2504.20676
作者: Shrisha Rao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Theory (cs.IT)
备注:
Abstract:This paper establishes a theoretical foundation for understanding the fundamental limits of AI explainability through algorithmic information theory. We formalize explainability as the approximation of complex models by simpler ones, quantifying both approximation error and explanation complexity using Kolmogorov complexity. Our key theoretical contributions include: (1) a complexity gap theorem proving that any explanation significantly simpler than the original model must differ from it on some inputs; (2) precise bounds showing that explanation complexity grows exponentially with input dimension but polynomially with error tolerance for Lipschitz functions; and (3) a characterization of the gap between local and global explainability, demonstrating that local explanations can be significantly simpler while maintaining accuracy in relevant regions. We further establish a regulatory impossibility theorem proving that no governance framework can simultaneously pursue unrestricted AI capabilities, human-interpretable explanations, and negligible error. These results highlight considerations likely to be relevant to the design, evaluation, and oversight of explainable AI systems.
zh
[AI-29] CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation ACL2025
【速读】:该论文试图解决现有基准测试在评估大型语言模型(Large Language Models, LLMs)时存在的范围狭窄和缺乏全面评估框架的问题。为了解决这一问题,研究者提出了CoCo-Bench(Comprehensive Code Benchmark),其关键在于设计了一个涵盖代码理解、代码生成、代码修改和代码审查四个核心维度的综合性评估框架,以更系统和真实地反映开发者需求,并通过多语言支持和严格的人工审核确保数据质量,从而为未来代码导向的LLMs研究提供可靠且客观的基准。
链接: https://arxiv.org/abs/2504.20673
作者: Wenjing Yin,Tianze Sun,Yijiong Yu,Jiawei Fang,Guangyao Su,Jiancheng Wang,Zekun Wang,Wei Wang,Ran Chen,Ziyun Dai,Shuai Yuan,Menghang Dong,Peng Luo,Dong Cao,Da Lei,Yajun Zhang,Hao Chen,Xiang Ma,Yong Liu,Weifeng Liu,Yuanjian Xu,Ji Pei
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Submitted to ACL 2025. Under review
Abstract:Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance. However, existing benchmarks are often narrow in scope, focusing on a specific task and lack a comprehensive evaluation framework that reflects real-world applications. To address these gaps, we introduce CoCo-Bench (Comprehensive Code Benchmark), designed to evaluate LLMs across four critical dimensions: code understanding, code generation, code modification, and code review. These dimensions capture essential developer needs, ensuring a more systematic and representative evaluation. CoCo-Bench includes multiple programming languages and varying task difficulties, with rigorous manual review to ensure data quality and accuracy. Empirical results show that CoCo-Bench aligns with existing benchmarks while uncovering significant variations in model performance, effectively highlighting strengths and weaknesses. By offering a holistic and objective evaluation, CoCo-Bench provides valuable insights to guide future research and technological advancements in code-oriented LLMs, establishing a reliable benchmark for the field.
zh
[AI-30] Federated learning ethics and the double black box problem in medical AI
【速读】:该论文试图解决医疗联邦学习(Federated Learning, FL)系统中尚未被充分探讨的伦理风险问题,特别是其带来的新型透明度问题——联邦透明度缺失(federation opacity),进而引发医疗人工智能中的独特双重黑箱问题。论文指出,医疗FL所承诺的优势可能被夸大,并强调要实现FL在医学中的伦理可行性,需克服的关键在于提升系统的透明度和可解释性,以保障患者隐私与系统公正性。
链接: https://arxiv.org/abs/2504.20656
作者: Joshua Hatherley,Anders Søgaard,Angela Ballantyne,Ruben Pauwels
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Federated learning (FL) is a machine learning approach that allows multiple devices or institutions to collaboratively train a model without sharing their local data with a third-party. FL is considered a promising way to address patient privacy concerns in medical artificial intelligence. The ethical risks of medical FL systems themselves, however, have thus far been underexamined. This paper aims to address this gap. We argue that medical FL presents a new variety of opacity – federation opacity – that, in turn, generates a distinctive double black box problem in healthcare AI. We highlight several instances in which the anticipated benefits of medical FL may be exaggerated, and conclude by highlighting key challenges that must be overcome to make FL ethically feasible in medicine.
zh
[AI-31] On Stochastic Rounding with Few Random Bits
【速读】:该论文旨在解决在低精度(LP)浮点格式和混合精度计算中,使用随机舍入(Stochastic Rounding, SR)时如何减少所需随机位数的问题,同时保持SR在特定计算中的理想特性。其解决方案的关键在于研究少比特随机舍入(Few-Bit Stochastic Rounding, FBSR)的不同实现方式,并揭示这些实现可能引入的显著偏差,从而为实践者提供在开发或采用低精度浮点格式时需注意的配置参数。
链接: https://arxiv.org/abs/2504.20634
作者: Andrew Fitzgibbon,Stephen Felix
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Mathematical Software (cs.MS)
备注: Published at ARITH 2025
Abstract:Large-scale numerical computations make increasing use of low-precision (LP) floating point formats and mixed precision arithmetic, which can be enhanced by the technique of stochastic rounding (SR), that is, rounding an intermediate high-precision value up or down randomly as a function of the value’s distance to the two rounding candidates. Stochastic rounding requires, in addition to the high-precision input value, a source of random bits. As the provision of high-quality random bits is an additional computational cost, it is of interest to require as few bits as possible while maintaining the desirable properties of SR in a given computation, or computational domain. This paper examines a number of possible implementations of few-bit stochastic rounding (FBSR), and shows how several natural implementations can introduce sometimes significant bias into the rounding process, which are not present in the case of infinite-bit, infinite-precision examinations of these implementations. The paper explores the impact of these biases in machine learning examples, and hence opens another class of configuration parameters of which practitioners should be aware when developing or adopting low-precision floating point. Code is available at this http URL.
zh
[AI-32] Cognitive maps are generative programs
【速读】:该论文试图解决在资源受限条件下,人类如何高效进行规划的问题,具体表现为如何构建能够抽象现实并支持高效决策的简化心理表征。其解决方案的关键在于提出认知地图可以表现为生成式程序,这些程序利用环境的可预测性和冗余性,而非直接编码空间布局,从而实现资源高效的规划策略。研究通过行为实验和计算模型验证了这一假设,表明人类在结构化环境中依赖模块化规划策略,而模型则通过大型语言模型隐式学习人类先验知识,生成符合人类行为的资源高效计划。
链接: https://arxiv.org/abs/2504.20628
作者: Marta Kryven,Cole Wyeth,Aidan Curtis,Kevin Ellis
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 9 pages, 4 figures, to be published in Cognitive Sciences Society proceedings
Abstract:Making sense of the world and acting in it relies on building simplified mental representations that abstract away aspects of reality. This principle of cognitive mapping is universal to agents with limited resources. Living organisms, people, and algorithms all face the problem of forming functional representations of their world under various computing constraints. In this work, we explore the hypothesis that human resource-efficient planning may arise from representing the world as predictably structured. Building on the metaphor of concepts as programs, we propose that cognitive maps can take the form of generative programs that exploit predictability and redundancy, in contrast to directly encoding spatial layouts. We use a behavioral experiment to show that people who navigate in structured spaces rely on modular planning strategies that align with programmatic map representations. We describe a computational model that predicts human behavior in a variety of structured scenarios. This model infers a small distribution over possible programmatic cognitive maps conditioned on human prior knowledge of the world, and uses this distribution to generate resource-efficient plans. Our models leverages a Large Language Model as an embedding of human priors, implicitly learned through training on a vast corpus of human data. Our model demonstrates improved computational efficiency, requires drastically less memory, and outperforms unstructured planning algorithms with cognitive constraints at predicting human behavior, suggesting that human planning strategies rely on programmatic cognitive maps.
zh
[AI-33] DiffusionRIR: Room Impulse Response Interpolation using Diffusion Models
【速读】:该论文试图解决在未测量位置估计房间脉冲响应(Room Impulse Responses, RIRs)的问题,特别是在高空间分辨率需求下传统测量方法资源消耗大的问题。解决方案的关键在于利用去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPM),通过将RIR矩阵与图像修复任务进行类比,将RIR数据转换为适合基于扩散模型重建的格式,从而实现对缺失RIR的有效重建。
链接: https://arxiv.org/abs/2504.20625
作者: Sagi Della Torre,Mirco Pezzoli,Fabio Antonacci,Sharon Gannot
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Room Impulse Responses (RIRs) characterize acoustic environments and are crucial in multiple audio signal processing tasks. High-quality RIR estimates drive applications such as virtual microphones, sound source localization, augmented reality, and data augmentation. However, obtaining RIR measurements with high spatial resolution is resource-intensive, making it impractical for large spaces or when dense sampling is required. This research addresses the challenge of estimating RIRs at unmeasured locations within a room using Denoising Diffusion Probabilistic Models (DDPM). Our method leverages the analogy between RIR matrices and image inpainting, transforming RIR data into a format suitable for diffusion-based reconstruction. Using simulated RIR data based on the image method, we demonstrate our approach’s effectiveness on microphone arrays of different curvatures, from linear to semi-circular. Our method successfully reconstructs missing RIRs, even in large gaps between microphones. Under these conditions, it achieves accurate reconstruction, significantly outperforming baseline Spline Cubic Interpolation in terms of Normalized Mean Square Error and Cosine Distance between actual and interpolated RIRs. This research highlights the potential of using generative models for effective RIR interpolation, paving the way for generating additional data from limited real-world measurements. Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS) Cite as: arXiv:2504.20625 [cs.SD] (or arXiv:2504.20625v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2504.20625 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-34] PaRT: Enhancing Proactive Social Chatbots with Personalized Real-Time Retrieval
【速读】:该论文旨在解决传统社交聊天机器人在对话中参与度低和对话时长短的问题,这些问题通常源于聊天机器人被动响应机制需依赖用户主动发起或维持对话。论文提出的解决方案关键在于PaRT框架,该框架通过个性化实时检索与生成实现上下文感知的主动对话,具体包括将用户资料和对话上下文整合到大语言模型(LLM)中,以优化用户查询并识别潜在意图,进而生成个性化对话主题,并结合检索到的相关内容生成知识支撑且提升参与度的回复。
链接: https://arxiv.org/abs/2504.20624
作者: Zihan Niu,Zheyong Xie,Shaosheng Cao,Chonggang Lu,Zheyu Ye,Tong Xu,Zuozhu Liu,Yan Gao,Jia Chen,Zhe Xu,Yi Wu,Yao Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Social chatbots have become essential intelligent companions in daily scenarios ranging from emotional support to personal interaction. However, conventional chatbots with passive response mechanisms usually rely on users to initiate or sustain dialogues by bringing up new topics, resulting in diminished engagement and shortened dialogue duration. In this paper, we present PaRT, a novel framework enabling context-aware proactive dialogues for social chatbots through personalized real-time retrieval and generation. Specifically, PaRT first integrates user profiles and dialogue context into a large language model (LLM), which is initially prompted to refine user queries and recognize their underlying intents for the upcoming conversation. Guided by refined intents, the LLM generates personalized dialogue topics, which then serve as targeted queries to retrieve relevant passages from RedNote. Finally, we prompt LLMs with summarized passages to generate knowledge-grounded and engagement-optimized responses. Our approach has been running stably in a real-world production environment for more than 30 days, achieving a 21.77% improvement in the average duration of dialogues.
zh
[AI-35] he Hidden Risks of LLM -Generated Web Application Code: A Security-Centric Evaluation of Code Generation Capabilities in Large Language Models
【速读】:该论文试图解决由大型语言模型(Large Language Models, LLMs)生成的代码在安全性方面存在的潜在问题,特别是在真实应用场景中可能引发的可靠性与安全风险。其解决方案的关键在于通过预定义的安全参数对多个LLM生成的代码进行安全合规性评估,识别出认证机制、会话管理、输入验证和HTTP安全头等方面的关键漏洞,并指出当前没有一种模型能够完全符合行业最佳实践,从而强调了在自动化软件开发过程中人类专家的重要性以及建立强大安全评估框架的必要性。
链接: https://arxiv.org/abs/2504.20612
作者: Swaroop Dora,Deven Lunkad,Naziya Aslam,S. Venkatesan,Sandeep Kumar Shukla
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 9 pages
Abstract:The rapid advancement of Large Language Models (LLMs) has enhanced software development processes, minimizing the time and effort required for coding and enhancing developer productivity. However, despite their potential benefits, code generated by LLMs has been shown to generate insecure code in controlled environments, raising critical concerns about their reliability and security in real-world applications. This paper uses predefined security parameters to evaluate the security compliance of LLM-generated code across multiple models, such as ChatGPT, DeepSeek, Claude, Gemini and Grok. The analysis reveals critical vulnerabilities in authentication mechanisms, session management, input validation and HTTP security headers. Although some models implement security measures to a limited extent, none fully align with industry best practices, highlighting the associated risks in automated software development. Our findings underscore that human expertise is crucial to ensure secure software deployment or review of LLM-generated code. Also, there is a need for robust security assessment frameworks to enhance the reliability of LLM-generated code in real-world applications.
zh
[AI-36] Information Retrieval in the Age of Generative AI: The RGB Model SIGIR25
【速读】:该论文试图解决生成式 AI(Generative AI)在信息生成、索引和传播过程中带来的内容真实性与可靠性问题,特别是在面对新话题时,现有大型语言模型(Large Language Models, LLMs)因依赖实时检索增强生成(Retrieval-Augmented Generation, RAG)技术而面临的挑战。解决方案的关键在于提出一种随机模型,用于表征信息在新话题下的生成、索引与传播动态,从而揭示生成式 AI 快速普及可能引发的虚假信息扩散风险,并强调高质量信息产出所需的人工时间和努力。
链接: https://arxiv.org/abs/2504.20610
作者: Michele Garetto,Alessandro Cornacchia,Franco Galante,Emilio Leonardi,Alessandro Nordio,Alberto Tarable
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: To be presented at ACM SIGIR 25
Abstract:The advent of Large Language Models (LLMs) and generative AI is fundamentally transforming information retrieval and processing on the Internet, bringing both great potential and significant concerns regarding content authenticity and reliability. This paper presents a novel quantitative approach to shed light on the complex information dynamics arising from the growing use of generative AI tools. Despite their significant impact on the digital ecosystem, these dynamics remain largely uncharted and poorly understood. We propose a stochastic model to characterize the generation, indexing, and dissemination of information in response to new topics. This scenario particularly challenges current LLMs, which often rely on real-time Retrieval-Augmented Generation (RAG) techniques to overcome their static knowledge limitations. Our findings suggest that the rapid pace of generative AI adoption, combined with increasing user reliance, can outpace human verification, escalating the risk of inaccurate information proliferation across digital resources. An in-depth analysis of Stack Exchange data confirms that high-quality answers inevitably require substantial time and human effort to emerge. This underscores the considerable risks associated with generating persuasive text in response to new questions and highlights the critical need for responsible development and deployment of future generative AI tools.
zh
[AI-37] Inclusive Training Separation and Implicit Knowledge Interaction for Balanced Online Class-Incremental Learning
【速读】:该论文旨在解决在线类增量学习(Online Class-Incremental Learning, OCIL)中如何在持续更新模型的过程中,平衡新类知识(plasticity)与旧类知识(stability)的问题。现有方法通常依赖于显式的经验回放进行知识交互,并通过独占训练分离来缓解偏差问题,但难以实现良好的平衡,常导致可塑性下降或稳定性受限。论文提出的解决方案关键在于设计一种基于重放的新型方法——平衡在线增量学习(Balanced Online Incremental Learning, BOIL),其核心是采用双分类器的包容性训练分离策略,使新旧类知识能够有效融合,并通过隐式知识迁移机制在两个分类器之间传递信息,从而实现更高的可塑性与稳定性。
链接: https://arxiv.org/abs/2504.20566
作者: Shunjie Wen,Thomas Heinis,Dong-Wan Choi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Online class-incremental learning (OCIL) focuses on gradually learning new classes (called plasticity) from a stream of data in a single-pass, while concurrently preserving knowledge of previously learned classes (called stability). The primary challenge in OCIL lies in maintaining a good balance between the knowledge of old and new classes within the continually updated model. Most existing methods rely on explicit knowledge interaction through experience replay, and often employ exclusive training separation to address bias problems. Nevertheless, it still remains a big challenge to achieve a well-balanced learner, as these methods often exhibit either reduced plasticity or limited stability due to difficulties in continually integrating knowledge in the OCIL setting. In this paper, we propose a novel replay-based method, called Balanced Online Incremental Learning (BOIL), which can achieve both high plasticity and stability, thus ensuring more balanced performance in OCIL. Our BOIL method proposes an inclusive training separation strategy using dual classifiers so that knowledge from both old and new classes can effectively be integrated into the model, while introducing implicit approaches for transferring knowledge across the two classifiers. Extensive experimental evaluations over three widely-used OCIL benchmark datasets demonstrate the superiority of BOIL, showing more balanced yet better performance compared to state-of-the-art replay-based OCIL methods.
zh
[AI-38] Generate more than one child in your co-evolutionary semi-supervised learning GAN
【速读】:该论文旨在解决半监督学习(Semi-Supervised Learning, SSL)数据集中的样本生成问题,特别是通过改进生成对抗网络(Generative Adversarial Networks, GANs)的进化方法来提升其性能。论文提出的关键解决方案是设计一种新的协同进化方法——协同进化精英SSL-GAN(Co-evolutionary Elitist SSL-GAN, CE-SSLGAN),其核心在于采用随机交配种群结构、精英替换策略以及每代生成多个个体,以提高模型的稳定性和生成质量。
链接: https://arxiv.org/abs/2504.20560
作者: Francisco Sedeño,Jamal Toutouh,Francisco Chicano
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to The Leading European Event on Bio-Inspired AI (EvoStar 2025)
Abstract:Generative Adversarial Networks (GANs) are very useful methods to address semi-supervised learning (SSL) datasets, thanks to their ability to generate samples similar to real data. This approach, called SSL-GAN has attracted many researchers in the last decade. Evolutionary algorithms have been used to guide the evolution and training of SSL-GANs with great success. In particular, several co-evolutionary approaches have been applied where the two networks of a GAN (the generator and the discriminator) are evolved in separate populations. The co-evolutionary approaches published to date assume some spatial structure of the populations, based on the ideas of cellular evolutionary algorithms. They also create one single individual per generation and follow a generational replacement strategy in the evolution. In this paper, we re-consider those algorithmic design decisions and propose a new co-evolutionary approach, called Co-evolutionary Elitist SSL-GAN (CE-SSLGAN), with panmictic population, elitist replacement, and more than one individual in the offspring. We evaluate the performance of our proposed method using three standard benchmark datasets. The results show that creating more than one offspring per population and using elitism improves the results in comparison with a classical SSL-GAN.
zh
[AI-39] PRISM: Projection-based Reward Integration for Scene-Aware Real-to-Sim-to-Real Transfer with Few Demonstrations
【速读】:该论文旨在解决机器人在少量示例下学习鲁棒策略的问题,特别是在机器人初始位置和物体位姿变化情况下的泛化能力不足。与模仿学习相比,强化学习(Reinforcement Learning, RL)能够自主探索以获得更鲁棒的行为,但其在真实世界中的训练通常不切实际且存在安全隐患,而构建仿真环境则需要大量手动工作。为了解决这些挑战,论文提出了一种集成的“真实-仿真-真实”(real-to-sim-to-real)流程,其关键在于通过专家示例构建仿真环境,并利用视觉-语言模型(Vision-Language Model, VLM)监督的基于投影的奖励模型进行RL策略训练,同时结合专家示例进行微调,从而实现可靠的真实场景机器人控制策略部署。
链接: https://arxiv.org/abs/2504.20520
作者: Haowen Sun,Han Wang,Chengzhong Ma,Shaolong Zhang,Jiawei Ye,Xingyu Chen,Xuguang Lan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning from few demonstrations to develop policies robust to variations in robot initial positions and object poses is a problem of significant practical interest in robotics. Compared to imitation learning, which often struggles to generalize from limited samples, reinforcement learning (RL) can autonomously explore to obtain robust behaviors. Training RL agents through direct interaction with the real world is often impractical and unsafe, while building simulation environments requires extensive manual effort, such as designing scenes and crafting task-specific reward functions. To address these challenges, we propose an integrated real-to-sim-to-real pipeline that constructs simulation environments based on expert demonstrations by identifying scene objects from images and retrieving their corresponding 3D models from existing libraries. We introduce a projection-based reward model for RL policy training that is supervised by a vision-language model (VLM) using human-guided object projection relationships as prompts, with the policy further fine-tuned using expert demonstrations. In general, our work focuses on the construction of simulation environments and RL-based policy training, ultimately enabling the deployment of reliable robotic control policies in real-world scenarios.
zh
[AI-40] MuRAL: A Multi-Resident Ambient Sensor Dataset Annotated with Natural Language for Activities of Daily Living
【速读】:该论文试图解决现有用于人类活动识别(HAR)的传感器数据集(如CASAS、ARAS和MARBLE)在上下文丰富性、复杂性和标注粒度方面不足的问题,这些数据集并未为大规模语言模型(LLMs)设计,从而限制了其在利用LLMs进行自然语言推理和零样本学习方面的潜力。解决方案的关键在于引入MuRAL,这是一个首个包含多居民环境自然语言描述的多用户传感器数据集,其包含超过21小时的多用户数据,并通过细粒度自然语言描述、居民身份和高层活动标签进行标注,旨在支持LLMs在多居民场景下的活动理解任务。
链接: https://arxiv.org/abs/2504.20505
作者: Xi Chen(M-PSI),Julien Cumin,Fano Ramparany,Dominique Vaufreydaz(M-PSI)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Large Language Models (LLMs) have shown promising potential for human activity recognition (HAR) using ambient sensors, especially through natural language reasoning and zero-shot learning. However, existing datasets such as CASAS, ARAS, and MARBLE were not originally designed with LLMs in mind and therefore lack the contextual richness, complexity, and annotation granularity required to fully exploit LLM capabilities. In this paper, we introduce MuRAL, the first Multi-Resident Ambient sensor dataset with natural Language, comprising over 21 hours of multi-user sensor data collected from 21 sessions in a smart-home environment. MuRAL is annotated with fine-grained natural language descriptions, resident identities, and high-level activity labels, all situated in dynamic, realistic multi-resident settings. We benchmark MuRAL using state-of-the-art LLMs for three core tasks: subject assignment, action description, and activity classification. Our results demonstrate that while LLMs can provide rich semantic interpretations of ambient data, current models still face challenges in handling multi-user ambiguity and under-specified sensor contexts. We release MuRAL to support future research on LLM-powered, explainable, and socially aware activity understanding in smart environments. For access to the dataset, please reach out to us via the provided contact information. A direct link for dataset retrieval will be made available at this location in due course.
zh
[AI-41] oken-Efficient Prompt Injection Attack: Provoking Cessation in LLM Reasoning via Adaptive Token Compression
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)中存在的“思考停止”(thinking-stopped)安全漏洞问题,该漏洞使得模型生成的推理标记能够强制中断推理过程,导致空响应,从而影响LLM集成应用的安全性。解决方案的关键在于提出一种名为“推理中断攻击”(Reasoning Interruption Attack)的新型提示注入攻击方法,其核心是基于自适应标记压缩技术,以降低提示的token成本并形式化定义该漏洞。通过使用简单的独立算术任务作为攻击提示,该方法展示了更简洁的逻辑结构,并有效触发了漏洞,同时通过系统化的攻击提示收集和自适应压缩框架实现了高效的攻击效果。
链接: https://arxiv.org/abs/2504.20493
作者: Yu Cui,Yujun Cai,Yiwei Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:While reasoning large language models (LLMs) demonstrate remarkable performance across various tasks, they also contain notable security vulnerabilities. Recent research has uncovered a “thinking-stopped” vulnerability in DeepSeek-R1, where model-generated reasoning tokens can forcibly interrupt the inference process, resulting in empty responses that compromise LLM-integrated applications. However, existing methods triggering this vulnerability require complex mathematical word problems with long prompts–even exceeding 5,000 tokens. To reduce the token cost and formally define this vulnerability, we propose a novel prompt injection attack named “Reasoning Interruption Attack”, based on adaptive token compression. We demonstrate that simple standalone arithmetic tasks can effectively trigger this vulnerability, and the prompts based on such tasks exhibit simpler logical structures than mathematical word problems. We develop a systematic approach to efficiently collect attack prompts and an adaptive token compression framework that utilizes LLMs to automatically compress these prompts. Experiments show our compression framework significantly reduces prompt length while maintaining effective attack capabilities. We further investigate the attack’s performance via output prefix and analyze the underlying causes of the vulnerability, providing valuable insights for improving security in reasoning LLMs.
zh
[AI-42] Group Relative Knowledge Distillation: Learning from Teachers Relational Inductive Bias
【速读】:该论文试图解决知识蒸馏中现有方法过于关注模仿教师模型的绝对概率分布,而忽视了教师模型相对预测中蕴含的有价值的关系归纳偏置,从而导致暴露偏差的问题。其解决方案的关键在于提出一种新的框架——组相对知识蒸馏(Group Relative Knowledge Distillation, GRKD),通过学习类别之间的相对排名来提炼教师知识,而非直接拟合绝对分布,具体引入了一种组相对损失函数,以鼓励学生模型保留教师输出提供的成对偏好顺序。
链接: https://arxiv.org/abs/2504.20482
作者: Chao Li,Changhua Zhou,Jia Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge distillation typically transfers knowledge from a teacher model to a student model by minimizing differences between their output distributions. However, existing distillation approaches largely focus on mimicking absolute probabilities and neglect the valuable relational inductive biases embedded in the teacher’s relative predictions, leading to exposure bias. In this paper, we propose Group Relative Knowledge Distillation (GRKD), a novel framework that distills teacher knowledge by learning the relative ranking among classes, rather than directly fitting the absolute distribution. Specifically, we introduce a group relative loss that encourages the student model to preserve the pairwise preference orderings provided by the teacher’s outputs. Extensive experiments on classification benchmarks demonstrate that GRKD achieves superior generalization compared to existing methods, especially in tasks requiring fine-grained class differentiation. Our method provides a new perspective on exploiting teacher knowledge, focusing on relational structure rather than absolute likelihood.
zh
[AI-43] he Estimation of Continual Causal Effect for Dataset Shifting Streams
【速读】:该论文旨在解决在线环境中由于用户行为和领域分布随时间变化导致的数据集偏移(dataset shift)问题,从而提升因果效应估计的性能。其解决方案的关键在于提出一种增量因果效应与代理知识蒸馏框架(Incremental Causal Effect with Proxy Knowledge Distillation, ICE-PKD),该框架包含两个核心组件:一是利用反事实回归消除混杂偏差的多处理上行网络;二是通过最新数据更新并借助基于重放的知识蒸馏保护泛化能力的增量训练策略。
链接: https://arxiv.org/abs/2504.20471
作者: Baining Chen,Yiming Zhang,Yuqiao Han,Ruyue Zhang,Ruihuan Du,Zhishuo Zhou,Zhengdan Zhu,Xun Liu,Jiecheng Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:
Abstract:Causal effect estimation has been widely used in marketing optimization. The framework of an uplift model followed by a constrained optimization algorithm is popular in practice. To enhance performance in the online environment, the framework needs to be improved to address the complexities caused by temporal dataset shift. This paper focuses on capturing the dataset shift from user behavior and domain distribution changing over time. We propose an Incremental Causal Effect with Proxy Knowledge Distillation (ICE-PKD) framework to tackle this challenge. The ICE-PKD framework includes two components: (i) a multi-treatment uplift network that eliminates confounding bias using counterfactual regression; (ii) an incremental training strategy that adapts to the temporal dataset shift by updating with the latest data and protects generalization via replay-based knowledge distillation. We also revisit the uplift modeling metrics and introduce a novel metric for more precise online evaluation in multiple treatment scenarios. Extensive experiments on both simulated and online datasets show that the proposed framework achieves better performance. The ICE-PKD framework has been deployed in the marketing system of Huaxiaozhu, a ride-hailing platform in China.
zh
[AI-44] A Summary on GUI Agents with Foundation Models Enhanced by Reinforcement Learning
【速读】:该论文旨在解决如何提升图形用户界面(Graphical User Interface, GUI)代理在复杂现实环境中的泛化能力和鲁棒性问题。其解决方案的关键在于利用多模态大语言模型(Multi-modal Large Language Models, MLLMs)增强GUI代理的架构,并通过强化学习(Reinforcement Learning, RL)方法优化其决策与行动能力,从而实现从简单的提示工程向动态策略学习的演进。
链接: https://arxiv.org/abs/2504.20464
作者: Jiahao Li,Kaer Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Graphical User Interface (GUI) agents, driven by Multi-modal Large Language Models (MLLMs), have emerged as a promising paradigm for enabling intelligent interaction with digital systems. This paper provides a structured summary of recent advances in GUI agents, focusing on architectures enhanced by Reinforcement Learning (RL). We first formalize GUI agent tasks as Markov Decision Processes and discuss typical execution environments and evaluation metrics. We then review the modular architecture of (M)LLM-based GUI agents, covering Perception, Planning, and Acting modules, and trace their evolution through representative works. Furthermore, we categorize GUI agent training methodologies into Prompt-based, Supervised Fine-Tuning (SFT)-based, and RL-based approaches, highlighting the progression from simple prompt engineering to dynamic policy learning via RL. Our summary illustrates how recent innovations in multimodal perception, decision reasoning, and adaptive action generation have significantly improved the generalization and robustness of GUI agents in complex real-world environments. We conclude by identifying key challenges and future directions for building more capable and reliable GUI agents.
zh
[AI-45] AMO:Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data
【速读】:该论文试图解决分布式系统、微服务和云原生技术带来的系统复杂性和运维挑战,特别是在传统根因分析(Root Cause Analysis, RCA)方法难以实现自动化故障响应的问题。其解决方案的关键在于提出一种基于多模态观测数据的工具辅助大型语言模型(Large Language Models, LLM)代理工具TAMO,通过统一时间对齐的多模态数据表示提取一致特征,并结合专用的根因定位与故障分类工具,以增强对上下文环境的感知能力,从而克服LLM在处理实时变化的服务依赖关系和原始观测数据方面的局限性。
链接: https://arxiv.org/abs/2504.20462
作者: Qi Wang,Xiao Zhang,Mingyi Li,Yuan Yuan,Mengbai Xiao,Fuzhen Zhuang,Dongxiao Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With the development of distributed systems, microservices and cloud native technologies have become central to modern enterprise software development. Despite bringing significant advantages, these technologies also increase system complexity and operational challenges. Traditional root cause analysis (RCA) struggles to achieve automated fault response, heavily relying on manual intervention. In recent years, large language models (LLMs) have made breakthroughs in contextual inference and domain knowledge integration, providing new solutions for Artificial Intelligence for Operations (AIOps). However, Existing LLM-based approaches face three key challenges: text input constraints, dynamic service dependency hallucinations, and context window limitations. To address these issues, we propose a tool-assisted LLM agent with multi-modality observation data, namely TAMO, for fine-grained RCA. It unifies multi-modal observational data into time-aligned representations to extract consistent features and employs specialized root cause localization and fault classification tools for perceiving the contextual environment. This approach overcomes the limitations of LLM in handling real-time changing service dependencies and raw observational data and guides LLM to generate repair strategies aligned with system contexts by structuring key information into a prompt. Experimental results show that TAMO performs well in root cause analysis when dealing with public datasets characterized by heterogeneity and common fault types, demonstrating its effectiveness.
zh
[AI-46] Enhancing News Recommendation with Hierarchical LLM Prompting
【速读】:该论文试图解决个性化新闻推荐系统在捕捉用户偏好复杂性方面的不足,因为现有系统过度依赖浅层表示(如文章标题和摘要)。其解决方案的关键在于引入PNR-LLM方法,该方法利用大语言模型(Large Language Models, LLMs)的生成能力对新闻标题和摘要进行丰富,从而提升推荐质量。PNR-LLM包含一个新颖的模块——基于LLMs的新闻增强模块,能够从文章中生成更深层次的语义信息和相关实体,将浅层内容转化为更丰富的表示,并通过注意力机制聚合语义和实体级别的数据,形成统一的用户与新闻嵌入,实现更精准的用户-新闻匹配。
链接: https://arxiv.org/abs/2504.20452
作者: Hai-Dang Kieu,Delvin Ce Zhang,Minh Duc Nguyen,Min Xu,Qiang Wu,Dung D. Le
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Personalized news recommendation systems often struggle to effectively capture the complexity of user preferences, as they rely heavily on shallow representations, such as article titles and abstracts. To address this problem, we introduce a novel method, namely PNR-LLM, for Large Language Models for Personalized News Recommendation. Specifically, PNR-LLM harnesses the generation capabilities of LLMs to enrich news titles and abstracts, and consequently improves recommendation quality. PNR-LLM contains a novel module, News Enrichment via LLMs, which generates deeper semantic information and relevant entities from articles, transforming shallow contents into richer representations. We further propose an attention mechanism to aggregate enriched semantic- and entity-level data, forming unified user and news embeddings that reveal a more accurate user-news match. Extensive experiments on MIND datasets show that PNR-LLM outperforms state-of-the-art baselines. Moreover, the proposed data enrichment module is model-agnostic, and we empirically show that applying our proposed module to multiple existing models can further improve their performance, verifying the advantage of our design.
zh
[AI-47] APG-MOS: Auditory Perception Guided-MOS Predictor for Synthetic Speech
【速读】:该论文旨在解决传统深度学习模型在自动语音质量评估中对基本听觉感知机制的忽视,从而导致与人类判断不一致的问题。解决方案的关键在于提出一种基于听觉感知引导的MOS预测模型(APG-MOS),该模型通过融合听觉建模与语义分析,增强模型与人类判断的一致性。具体而言,其核心包括:基于生物听觉机制设计的感知模块以模拟耳蜗功能,利用残差向量量化方法进行语义失真建模,以及结合残差交叉注意力结构和渐进式学习策略实现多模态特征融合。
链接: https://arxiv.org/abs/2504.20447
作者: Zhicheng Lian,Lizhi Wang,Hua Huang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Automatic speech quality assessment aims to quantify subjective human perception of speech through computational models to reduce the need for labor-consuming manual evaluations. While models based on deep learning have achieved progress in predicting mean opinion scores (MOS) to assess synthetic speech, the neglect of fundamental auditory perception mechanisms limits consistency with human judgments. To address this issue, we propose an auditory perception guided-MOS prediction model (APG-MOS) that synergistically integrates auditory modeling with semantic analysis to enhance consistency with human judgments. Specifically, we first design a perceptual module, grounded in biological auditory mechanisms, to simulate cochlear functions, which encodes acoustic signals into biologically aligned electrochemical representations. Secondly, we propose a residual vector quantization (RVQ)-based semantic distortion modeling method to quantify the degradation of speech quality at the semantic level. Finally, we design a residual cross-attention architecture, coupled with a progressive learning strategy, to enable multimodal fusion of encoded electrochemical signals and semantic representations. Experiments demonstrate that APG-MOS achieves superior performance on two primary benchmarks. Our code and checkpoint will be available on a public repository upon publication.
zh
[AI-48] Head-Tail-Aware KL Divergence in Knowledge Distillation for Spiking Neural Networks IJCNN2025
【速读】:该论文试图解决脉冲神经网络(Spiking Neural Networks, SNNs)在性能上与人工神经网络(Artificial Neural Networks, ANNs)存在差距的问题,尤其是在能量效率和生物合理性方面。解决方案的关键在于提出一种新的知识蒸馏(Knowledge Distillation, KD)方法——Head-Tail Aware Kullback-Leibler (HTA-KL) 散度,该方法通过引入基于累积概率的掩码动态区分高概率和低概率区域,并分配自适应权重以实现平衡的知识迁移,从而有效对齐分布的头部和尾部区域,提升SNN的整体性能。
链接: https://arxiv.org/abs/2504.20445
作者: Tianqing Zhang,Zixin Zhu,Kairong Yu,Hongwei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by IJCNN2025
Abstract:Spiking Neural Networks (SNNs) have emerged as a promising approach for energy-efficient and biologically plausible computation. However, due to limitations in existing training methods and inherent model constraints, SNNs often exhibit a performance gap when compared to Artificial Neural Networks (ANNs). Knowledge distillation (KD) has been explored as a technique to transfer knowledge from ANN teacher models to SNN student models to mitigate this gap. Traditional KD methods typically use Kullback-Leibler (KL) divergence to align output distributions. However, conventional KL-based approaches fail to fully exploit the unique characteristics of SNNs, as they tend to overemphasize high-probability predictions while neglecting low-probability ones, leading to suboptimal generalization. To address this, we propose Head-Tail Aware Kullback-Leibler (HTA-KL) divergence, a novel KD method for SNNs. HTA-KL introduces a cumulative probability-based mask to dynamically distinguish between high- and low-probability regions. It assigns adaptive weights to ensure balanced knowledge transfer, enhancing the overall performance. By integrating forward KL (FKL) and reverse KL (RKL) divergence, our method effectively align both head and tail regions of the distribution. We evaluate our methods on CIFAR-10, CIFAR-100 and Tiny ImageNet datasets. Our method outperforms existing methods on most datasets with fewer timesteps.
zh
[AI-49] GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在训练过程中面临的显著内存瓶颈问题。其解决方案的关键在于GaLore,即梯度低秩投影(Gradient Low-Rank Projection),通过利用权重梯度的固有低秩结构,在不牺牲性能的前提下实现显著的内存节省。本文进一步提出了GaLore 2,一个高效且可扩展的框架,解决了原有方法在子空间更新中的计算开销以及与先进训练并行策略(如FSDP)集成的挑战,并通过使用最多5000亿个训练标记对Llama 7B进行从头预训练,验证了其在实际LLM预训练场景中的潜力。
链接: https://arxiv.org/abs/2504.20437
作者: DiJia Su,Andrew Gu,Jane Xu,Yuandong Tian,Jiawei Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have revolutionized natural language understanding and generation but face significant memory bottlenecks during training. GaLore, Gradient Low-Rank Projection, addresses this issue by leveraging the inherent low-rank structure of weight gradients, enabling substantial memory savings without sacrificing performance. Recent works further extend GaLore from various aspects, including low-bit quantization and higher-order tensor structures. However, there are several remaining challenges for GaLore, such as the computational overhead of SVD for subspace updates and the integration with state-of-the-art training parallelization strategies (e.g., FSDP). In this paper, we present GaLore 2, an efficient and scalable GaLore framework that addresses these challenges and incorporates recent advancements. In addition, we demonstrate the scalability of GaLore 2 by pre-training Llama 7B from scratch using up to 500 billion training tokens, highlighting its potential impact on real LLM pre-training scenarios.
zh
[AI-50] ARCS: Agent ic Retrieval-Augmented Code Synthesis with Iterative Refinement
【速读】:该论文旨在解决超算环境中高效且优化的代码生成问题,以充分发挥高性能系统的能力。其解决方案的关键在于提出了一种名为Agentic Retrieval-Augmented Code Synthesis (ARCS)的框架,该框架将检索增强生成(Retrieval-Augmented Generation, RAG)与思维链(Chain-of-Thought, CoT)推理相结合,通过系统分解和迭代优化复杂编程任务,实现准确、稳健和高效的代码生成、补全与翻译。
链接: https://arxiv.org/abs/2504.20434
作者: Manish Bhattarai,Miguel Cordova,Javier Santos,Dan O’Malley
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:In supercomputing, efficient and optimized code generation is essential to leverage high-performance systems effectively. We propose Agentic Retrieval-Augmented Code Synthesis (ARCS), an advanced framework for accurate, robust, and efficient code generation, completion, and translation. ARCS integrates Retrieval-Augmented Generation (RAG) with Chain-of-Thought (CoT) reasoning to systematically break down and iteratively refine complex programming tasks. An agent-based RAG mechanism retrieves relevant code snippets, while real-time execution feedback drives the synthesis of candidate solutions. This process is formalized as a state-action search tree optimization, balancing code correctness with editing efficiency. Evaluations on the Geeks4Geeks and HumanEval benchmarks demonstrate that ARCS significantly outperforms traditional prompting methods in translation and generation quality. By enabling scalable and precise code synthesis, ARCS offers transformative potential for automating and optimizing code development in supercomputing applications, enhancing computational resource utilization.
zh
[AI-51] RV-Syn: Rational and Verifiable Mathematical Reasoning Data Synthesis based on Structured Function Library
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在数学领域中对高质量推理数据的需求问题,现有数据合成方法在生成过程中难以掌握问题的内在逻辑并确保解题过程的可验证性。其解决方案的关键在于提出RV-Syn方法,该方法通过构建基于初始种子问题的结构化数学运算函数库,并利用该库中的Python格式函数生成计算图作为解题方案,进而将计算图反向翻译为复杂问题,实现基于解的逻辑感知问题生成,同时计算图的可执行性保障了解题过程的可验证性。
链接: https://arxiv.org/abs/2504.20426
作者: Jiapeng Wang,Jinhao Jiang,Zhiqiang Zhang,Jun Zhou,Wayne Xin Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The advancement of reasoning capabilities in Large Language Models (LLMs) requires substantial amounts of high-quality reasoning data, particularly in mathematics. Existing data synthesis methods, such as data augmentation from annotated training sets or direct question generation based on relevant knowledge points and documents, have expanded datasets but face challenges in mastering the inner logic of the problem during generation and ensuring the verifiability of the solutions. To address these issues, we propose RV-Syn, a novel Rational and Verifiable mathematical Synthesis approach. RV-Syn constructs a structured mathematical operation function library based on initial seed problems and generates computational graphs as solutions by combining Python-formatted functions from this library. These graphs are then back-translated into complex problems. Based on the constructed computation graph, we achieve solution-guided logic-aware problem generation. Furthermore, the executability of the computational graph ensures the verifiability of the solving process. Experimental results show that RV-Syn surpasses existing synthesis methods, including those involving human-generated problems, achieving greater efficient data scaling. This approach provides a scalable framework for generating high-quality reasoning datasets.
zh
[AI-52] CrashFixer: A crash resolution agent for the Linux kernel
【速读】:该论文旨在解决在大规模系统级Linux内核漏洞中应用生成式AI(Generative AI)进行代码修复的挑战,特别是针对现有基准测试规模有限的问题。其关键解决方案是基于内核开发者的典型工作流程,识别专家开发者解决内核崩溃所需的核心能力,并在此基础上对kGym平台进行改进,构建出可支持大规模Linux内核(50K文件和20M行代码)运行LLM修复代理的kGymSuite平台,从而实现对复杂内核漏洞的有效修复策略评估与实践。
链接: https://arxiv.org/abs/2504.20412
作者: Alex Mathai,Chenxi Huang,Suwei Ma,Jihwan Kim,Hailie Mitchell,Aleksandr Nogikh,Petros Maniatis,Franjo Ivančić,Junfeng Yang,Baishakhi Ray
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Operating Systems (cs.OS)
备注:
Abstract:Code large language models (LLMs) have shown impressive capabilities on a multitude of software engineering tasks. In particular, they have demonstrated remarkable utility in the task of code repair. However, common benchmarks used to evaluate the performance of code LLMs are often limited to small-scale settings. In this work, we build upon kGym, which shares a benchmark for system-level Linux kernel bugs and a platform to run experiments on the Linux kernel. This paper introduces CrashFixer, the first LLM-based software repair agent that is applicable to Linux kernel bugs. Inspired by the typical workflow of a kernel developer, we identify the key capabilities an expert developer leverages to resolve a kernel crash. Using this as our guide, we revisit the kGym platform and identify key system improvements needed to practically run LLM-based agents at the scale of the Linux kernel (50K files and 20M lines of code). We implement these changes by extending kGym to create an improved platform - called kGymSuite, which will be open-sourced. Finally, the paper presents an evaluation of various repair strategies for such complex kernel bugs and showcases the value of explicitly generating a hypothesis before attempting to fix bugs in complex systems such as the Linux kernel. We also evaluated CrashFixer’s capabilities on still open bugs, and found at least two patch suggestions considered plausible to resolve the reported bug. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Operating Systems (cs.OS) Cite as: arXiv:2504.20412 [cs.SE] (or arXiv:2504.20412v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2504.20412 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-53] FourierSpecNet: Neural Collision Operator Approximation Inspired by the Fourier Spectral Method for Solving the Boltzmann Equation
【速读】:该论文旨在解决玻尔兹曼方程(Boltzmann equation)数值求解中计算成本高、尤其在非弹性碰撞和高维速度空间下效率低的问题。其解决方案的关键在于提出一种混合框架——傅里叶谱神经网络(Fourier Neural Spectral Network, FourierSpecNet),该框架将傅里叶谱方法与深度学习相结合,在傅里叶空间中高效近似碰撞算子,实现了分辨率无关的学习和零样本超分辨率,从而在无需重新训练的情况下准确预测未见过的分辨率场景。
链接: https://arxiv.org/abs/2504.20408
作者: Jae Yong Lee,Gwang Jae Jung,Byung Chan Lim,Hyung Ju Hwang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
备注: 27 pages, 11 figures
Abstract:The Boltzmann equation, a fundamental model in kinetic theory, describes the evolution of particle distribution functions through a nonlinear, high-dimensional collision operator. However, its numerical solution remains computationally demanding, particularly for inelastic collisions and high-dimensional velocity domains. In this work, we propose the Fourier Neural Spectral Network (FourierSpecNet), a hybrid framework that integrates the Fourier spectral method with deep learning to approximate the collision operator in Fourier space efficiently. FourierSpecNet achieves resolution-invariant learning and supports zero-shot super-resolution, enabling accurate predictions at unseen resolutions without retraining. Beyond empirical validation, we establish a consistency result showing that the trained operator converges to the spectral solution as the discretization is refined. We evaluate our method on several benchmark cases, including Maxwellian and hard-sphere molecular models, as well as inelastic collision scenarios. The results demonstrate that FourierSpecNet offers competitive accuracy while significantly reducing computational cost compared to traditional spectral solvers. Our approach provides a robust and scalable alternative for solving the Boltzmann equation across both elastic and inelastic regimes.
zh
[AI-54] Skill Discovery for Software Scripting Automation via Offline Simulations with LLM s
【速读】:该论文试图解决传统脚本编写需要编程技能和特定API熟悉度所带来的用户使用障碍问题,以及运行时代码生成中存在的代码未经验证、安全风险、响应时间长和计算成本高等限制。其解决方案的关键在于提出一种离线仿真框架,通过利用大型语言模型(Large Language Models, LLMs)和公开的脚本指南,构建一个经过验证的软件专用技能集(software-specific skillset)。该框架包含两个核心组件:任务创建与技能生成,其中通过图神经网络(Graph Neural Network, GNN)-based链接预测模型捕捉API协同效应,从而提升技能生成的多样性和有效性。
链接: https://arxiv.org/abs/2504.20406
作者: Paiheng Xu,Gang Wu,Xiang Chen,Tong Yu,Chang Xiao,Franck Dernoncourt,Tianyi Zhou,Wei Ai,Viswanathan Swaminathan
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Scripting interfaces enable users to automate tasks and customize software workflows, but creating scripts traditionally requires programming expertise and familiarity with specific APIs, posing barriers for many users. While Large Language Models (LLMs) can generate code from natural language queries, runtime code generation is severely limited due to unverified code, security risks, longer response times, and higher computational costs. To bridge the gap, we propose an offline simulation framework to curate a software-specific skillset, a collection of verified scripts, by exploiting LLMs and publicly available scripting guides. Our framework comprises two components: (1) task creation, using top-down functionality guidance and bottom-up API synergy exploration to generate helpful tasks; and (2) skill generation with trials, refining and validating scripts based on execution feedback. To efficiently navigate the extensive API landscape, we introduce a Graph Neural Network (GNN)-based link prediction model to capture API synergy, enabling the generation of skills involving underutilized APIs and expanding the skillset’s diversity. Experiments with Adobe Illustrator demonstrate that our framework significantly improves automation success rates, reduces response time, and saves runtime token costs compared to traditional runtime code generation. This is the first attempt to use software scripting interfaces as a testbed for LLM-based systems, highlighting the advantages of leveraging execution feedback in a controlled environment and offering valuable insights into aligning AI capabilities with user needs in specialized software domains.
zh
[AI-55] AKIBoards: A Structure-Following Multiagent System for Predicting Acute Kidney Injury AAMAS
【速读】:该论文试图解决多智能体系统(MAS)中如何通过学习和利用全局模型来提升诊断推理性能的问题。其关键在于引入STRUC-MAS框架,该框架能够自动化学习全局模型,并将其作为先验信念嵌入到多智能体系统中,以实现结构跟随的协作推理。通过在急性肾损伤预测任务中的实验验证,表明引入全局结构可以显著提高模型的预测性能,并且智能体在交互过程中能够调整其置信度,反映出信念的强化或重构。
链接: https://arxiv.org/abs/2504.20368
作者: David Gordon,Panayiotis Petousis,Susanne B. Nicholas,Alex A.T. Bui
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at International Conference on Autonomous Agents and Multiagent Systems (AAMAS) Workshop, 2025
Abstract:Diagnostic reasoning entails a physician’s local (mental) model based on an assumed or known shared perspective (global model) to explain patient observations with evidence assigned towards a clinical assessment. But in several (complex) medical situations, multiple experts work together as a team to optimize health evaluation and decision-making by leveraging different perspectives. Such consensus-driven reasoning reflects individual knowledge contributing toward a broader perspective on the patient. In this light, we introduce STRUCture-following for Multiagent Systems (STRUC-MAS), a framework automating the learning of these global models and their incorporation as prior beliefs for agents in multiagent systems (MAS) to follow. We demonstrate proof of concept with a prosocial MAS application for predicting acute kidney injuries (AKIs). In this case, we found that incorporating a global structure enabled multiple agents to achieve better performance (average precision, AP) in predicting AKI 48 hours before onset (structure-following-fine-tuned, SF-FT, AP=0.195; SF-FT-retrieval-augmented generation, SF-FT-RAG, AP=0.194) vs. baseline (non-structure-following-FT, NSF-FT, AP=0.141; NSF-FT-RAG, AP=0.180) for balanced precision-weighted-recall-weighted voting. Markedly, SF-FT agents with higher recall scores reported lower confidence levels in the initial round on true positive and false negative cases. But after explicit interactions, their confidence in their decisions increased (suggesting reinforced belief). In contrast, the SF-FT agent with the lowest recall decreased its confidence in true positive and false negative cases (suggesting a new belief). This approach suggests that learning and leveraging global structures in MAS is necessary prior to achieving competitive classification and diagnostic reasoning performance.
zh
[AI-56] Automated Unit Test Case Generation: A Systematic Literature Review
【速读】:该论文试图解决自动化软件测试中存在的一些挑战,特别是在遗传算法(Genetic Algorithm)和粒子群优化(Particle Swarm Optimisation)方面的改进不足以及当前自动化测试面临的知识空白。其解决方案的关键在于通过系统文献综述整合现有的进化方法及其改进策略,包括混合算法组合、与变异测试和神经网络的互操作性,同时探讨这些算法中使用的主测试准则及当前面临的可读性、模拟等问题。
链接: https://arxiv.org/abs/2504.20357
作者: Jason Wang,Basem Suleiman,Muhammad Johan Alibasa
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Software is omnipresent within all factors of society. It is thus important to ensure that software are well tested to mitigate bad user experiences as well as the potential for severe financial and human losses. Software testing is however expensive and absorbs valuable time and resources. As a result, the field of automated software testing has grown of interest to researchers in past decades. In our review of present and past research papers, we have identified an information gap in the areas of improvement for the Genetic Algorithm and Particle Swarm Optimisation. A gap in knowledge in the current challenges that face automated testing has also been identified. We therefore present this systematic literature review in an effort to consolidate existing knowledge in regards to the evolutionary approaches as well as their improvements and resulting limitations. These improvements include hybrid algorithm combinations as well as interoperability with mutation testing and neural networks. We will also explore the main test criterion that are used in these algorithms alongside the challenges currently faced in the field related to readability, mocking and more.
zh
[AI-57] CarbonCall: Sustainability-Aware Function Calling for Large Language Models on Edge Devices
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在边缘人工智能系统中进行实时功能调用时带来的计算开销过大、高能耗和碳排放的问题。现有方法虽然优化了性能,但忽略了可持续性,导致在能源受限环境中效率低下。解决方案的关键在于提出CarbonCall框架,该框架集成了动态工具选择、碳感知执行和量化LLM适应,通过根据实时碳强度预测调整功率阈值,并在不同模型变体之间切换,以在功率限制下维持高每秒令牌吞吐量,从而有效降低碳排放、功耗和执行时间。
链接: https://arxiv.org/abs/2504.20348
作者: Varatheepan Paramanayakam,Andreas Karatzas,Iraklis Anagnostopoulos,Dimitrios Stamoulis
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Large Language Models (LLMs) enable real-time function calling in edge AI systems but introduce significant computational overhead, leading to high power consumption and carbon emissions. Existing methods optimize for performance while neglecting sustainability, making them inefficient for energy-constrained environments. We introduce CarbonCall, a sustainability-aware function-calling framework that integrates dynamic tool selection, carbon-aware execution, and quantized LLM adaptation. CarbonCall adjusts power thresholds based on real-time carbon intensity forecasts and switches between model variants to sustain high tokens-per-second throughput under power constraints. Experiments on an NVIDIA Jetson AGX Orin show that CarbonCall reduces carbon emissions by up to 52%, power consumption by 30%, and execution time by 30%, while maintaining high efficiency.
zh
[AI-58] Narrative-Centered Emotional Reflection: Scaffolding Autonomous Emotional Literacy with AI
【速读】:该论文试图解决传统情感计算在促进情感素养和心理成长方面的局限性,即现有系统多局限于基础的情感分类,缺乏对深层次情绪探索与价值导向行为规划的支持。解决方案的关键在于构建一个基于表达性写作、认知重构、自我决定理论和批判意识发展的结构化情感自我反思平台,通过实时情绪检测、分层反思提示和隐喻叙事生成,实现用户从表层情绪识别到价值一致行动规划的渐进式情感探索。
链接: https://arxiv.org/abs/2504.20342
作者: Shou-Tzu Han
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 10 pages, 5 figures, preliminary results, early-stage work intended for future conference submission
Abstract:Reflexion is an AI-powered platform designed to enable structured emotional self-reflection at scale. By integrating real-time emotion detection, layered reflective prompting, and metaphorical storytelling generation, Reflexion empowers users to engage in autonomous emotional exploration beyond basic sentiment categorization. Grounded in theories of expressive writing, cognitive restructuring, self-determination, and critical consciousness development, the system scaffolds a progressive journey from surface-level emotional recognition toward value-aligned action planning. Initial pilot studies with diverse participants demonstrate positive outcomes in emotional articulation, cognitive reframing, and perceived psychological resilience. Reflexion represents a promising direction for scalable, theory-informed affective computing interventions aimed at fostering emotional literacy and psychological growth across educational, therapeutic, and public health contexts.
zh
[AI-59] Leverag ing Action Relational Structures for Integrated Learning and Planning ICAPS2025
【速读】:该论文试图解决传统规划方法在与学习系统结合时搜索算法适应性不足的问题,以及在高分支因子任务中效率较低的挑战。其解决方案的关键在于引入部分空间搜索(partial-space search),该方法利用PDDL动作模式的关联结构,提供更细粒度的搜索空间,并允许在状态空间搜索之前更早地剪枝劣质动作。此外,通过引入动作集启发式(action set heuristics)来指导部分空间搜索,并结合大规模训练数据进行优化,从而提升了搜索效率和规划性能。
链接: https://arxiv.org/abs/2504.20318
作者: Ryan Xiao Wang,Felipe Trevizan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Extended version of ICAPS 2025 paper
Abstract:Recent advances in planning have explored using learning methods to help planning. However, little attention has been given to adapting search algorithms to work better with learning systems. In this paper, we introduce partial-space search, a new search space for classical planning that leverages the relational structure of actions given by PDDL action schemas – a structure overlooked by traditional planning approaches. Partial-space search provides a more granular view of the search space and allows earlier pruning of poor actions compared to state-space search. To guide partial-space search, we introduce action set heuristics that evaluate sets of actions in a state. We describe how to automatically convert existing heuristics into action set heuristics. We also train action set heuristics from scratch using large training datasets from partial-space search. Our new planner, LazyLifted, exploits our better integrated search and learning heuristics and outperforms the state-of-the-art ML-based heuristic on IPC 2023 learning track (LT) benchmarks. We also show the efficiency of LazyLifted on high-branching factor tasks and show that it surpasses LAMA in the combined IPC 2023 LT and high-branching factor benchmarks.
zh
[AI-60] Perturbation-efficient Zeroth-order Optimization for Hardware-friendly On-device Training
【速读】:该论文旨在解决零阶(Zeroth-order, ZO)优化在硬件实现中的关键问题,即其对大量高斯随机数生成的依赖,这在FPGA和ASIC等硬件平台上带来了显著的计算和功耗挑战。论文提出的解决方案关键在于设计了一种扰动高效的ZO框架PeZO,通过引入随机数复用策略以减少随机数生成需求,并采用硬件友好的自适应缩放方法,将高斯分布替换为均匀分布,从而有效降低了逻辑单元(LUTs)和触发器(FFs)的消耗,并大幅节省了功耗,同时保持了训练性能。
链接: https://arxiv.org/abs/2504.20314
作者: Qitao Tan,Sung-En Chang,Rui Xia,Huidong Ji,Chence Yang,Ci Zhang,Jun Liu,Zheng Zhan,Zhou Zou,Yanzhi Wang,Jin Lu,Geng Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Zeroth-order (ZO) optimization is an emerging deep neural network (DNN) training paradigm that offers computational simplicity and memory savings. However, this seemingly promising approach faces a significant and long-ignored challenge. ZO requires generating a substantial number of Gaussian random numbers, which poses significant difficulties and even makes it infeasible for hardware platforms, such as FPGAs and ASICs. In this paper, we identify this critical issue, which arises from the mismatch between algorithm and hardware designers. To address this issue, we proposed PeZO, a perturbation-efficient ZO framework. Specifically, we design random number reuse strategies to significantly reduce the demand for random number generation and introduce a hardware-friendly adaptive scaling method to replace the costly Gaussian distribution with a uniform distribution. Our experiments show that PeZO reduces the required LUTs and FFs for random number generation by 48.6% and 12.7%, and saves at maximum 86% power consumption, all without compromising training performance, making ZO optimization feasible for on-device training. To the best of our knowledge, we are the first to explore the potential of on-device ZO optimization, providing valuable insights for future research.
zh
[AI-61] A Cryptographic Perspective on Mitigation vs. Detection in Machine Learning
【速读】:该论文试图解决在机器学习算法推理阶段,由攻击者生成的对抗性输入的检测与缓解问题。其核心在于形式化定义两种防御机制:通过检测(Defense by Detection, DbD)和通过缓解(Defense by Mitigation, DbM),并分析它们在不同机器学习任务中的有效性。研究的关键在于证明在分类任务中DbD与DbM是等价的,而在生成式学习任务中则存在本质差异,这通过构造一个在假设存在身份基全同态加密(IB-FHE)、可公开验证的零知识简洁非交互式知识论证(zk-SNARK)和强不可伪造签名的前提下,可被缓解但无法被检测的生成式学习任务来实现。该解决方案的核心在于利用缓解阶段所需样本显著少于初始训练算法的特点,从而实现有效的防御。
链接: https://arxiv.org/abs/2504.20310
作者: Greg Gluch,Shafi Goldwasser
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 29 pages
Abstract:In this paper, we initiate a cryptographically inspired theoretical study of detection versus mitigation of adversarial inputs produced by attackers of Machine Learning algorithms during inference time. We formally define defense by detection (DbD) and defense by mitigation (DbM). Our definitions come in the form of a 3-round protocol between two resource-bounded parties: a trainer/defender and an attacker. The attacker aims to produce inference-time inputs that fool the training algorithm. We define correctness, completeness, and soundness properties to capture successful defense at inference time while not degrading (too much) the performance of the algorithm on inputs from the training distribution. We first show that achieving DbD and achieving DbM are equivalent for ML classification tasks. Surprisingly, this is not the case for ML generative learning tasks, where there are many possible correct outputs that can be generated for each input. We show a separation between DbD and DbM by exhibiting a generative learning task for which is possible to defend by mitigation but is provably impossible to defend by detection under the assumption that the Identity-Based Fully Homomorphic Encryption (IB-FHE), publicly-verifiable zero-knowledge Succinct Non-Interactive Arguments of Knowledge (zk-SNARK) and Strongly Unforgeable Signatures exist. The mitigation phase uses significantly fewer samples than the initial training algorithm. Comments: 29 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2504.20310 [cs.LG] (or arXiv:2504.20310v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.20310 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-62] he Dark Side of Digital Twins: Adversarial Attacks on AI-Driven Water Forecasting
【速读】:该论文试图解决生成式 AI (Generative AI) 在数字孪生 (Digital Twins, DTs) 应用中的安全问题,特别是机器学习模型在面对对抗攻击时的脆弱性。解决方案的关键在于引入学习自动机 (Learning Automata, LA) 和基于随机 LA 的方法,通过动态调整扰动来增强对抗攻击的隐蔽性,从而评估并揭示现有预测模型的安全隐患。实验结果表明,该方法显著降低了预测可靠性,凸显了在 AI 驱动的 DTs 中加强网络安全防护的紧迫性。
链接: https://arxiv.org/abs/2504.20295
作者: Mohammadhossein Homaei,Victor Gonzalez Morales,Oscar Mogollon-Gutierrez,Andres Caro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 7 Pages, 7 Figures
Abstract:Digital twins (DTs) are improving water distribution systems by using real-time data, analytics, and prediction models to optimize operations. This paper presents a DT platform designed for a Spanish water supply network, utilizing Long Short-Term Memory (LSTM) networks to predict water consumption. However, machine learning models are vulnerable to adversarial attacks, such as the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD). These attacks manipulate critical model parameters, injecting subtle distortions that degrade forecasting accuracy. To further exploit these vulnerabilities, we introduce a Learning Automata (LA) and Random LA-based approach that dynamically adjusts perturbations, making adversarial attacks more difficult to detect. Experimental results show that this approach significantly impacts prediction reliability, causing the Mean Absolute Percentage Error (MAPE) to rise from 26% to over 35%. Moreover, adaptive attack strategies amplify this effect, highlighting cybersecurity risks in AI-driven DTs. These findings emphasize the urgent need for robust defenses, including adversarial training, anomaly detection, and secure data pipelines.
zh
[AI-63] Deep Physics Prior for First Order Inverse Optimization
【速读】:该论文试图解决逆向设计优化问题,即从观测解中推断系统参数,这一问题在半导体制造、结构工程、材料科学和流体动力学等领域具有重要挑战性。现有方法如生成式 AI 和贝叶斯优化虽能应对部分挑战,但存在计算成本高或可扩展性差、对先验敏感及噪声干扰等问题。该论文提出的 Deep Physics Prior (DPP) 方法的关键在于利用预训练的辅助神经算子,结合代理机器学习模型,实现基于一阶梯度的逆向优化,并通过引入先验分布约束确保解的鲁棒性和有效性。
链接: https://arxiv.org/abs/2504.20278
作者: Haoyu Yang,Kamyar Azizzadenesheli,Haoxing Ren
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figure. Under Review
Abstract:Inverse design optimization aims to infer system parameters from observed solutions, posing critical challenges across domains such as semiconductor manufacturing, structural engineering, materials science, and fluid dynamics. The lack of explicit mathematical representations in many systems complicates this process and makes the first order optimization impossible. Mainstream approaches, including generative AI and Bayesian optimization, address these challenges but have limitations. Generative AI is computationally expensive, while Bayesian optimization, relying on surrogate models, suffers from scalability, sensitivity to priors, and noise issues, often leading to suboptimal solutions. This paper introduces Deep Physics Prior (DPP), a novel method enabling first-order gradient-based inverse optimization with surrogate machine learning models. By leveraging pretrained auxiliary Neural Operators, DPP enforces prior distribution constraints to ensure robust and meaningful solutions. This approach is particularly effective when prior data and observation distributions are unknown.
zh
[AI-64] Smart Water Security with AI and Blockchain-Enhanced Digital Twins
【速读】:该论文旨在解决农村地区供水系统在实时监测、网络安全和数据处理可靠性方面的严重挑战。其解决方案的关键在于构建一个集成框架,结合基于LoRaWAN的数据采集、机器学习驱动的入侵检测系统(IDS)以及区块链赋能的数字孪生(BC-DT)平台,以实现安全透明的水资源管理。其中,IDS利用长短期记忆(LSTM)自编码器和孤立森林算法过滤异常或伪造数据,随后通过私有以太坊区块链上的智能合约进行验证记录,确保数据完整性与不可篡改性;同时,验证后的数据用于实时数字孪生模型,支持泄漏检测、用水预测和预测性维护。
链接: https://arxiv.org/abs/2504.20275
作者: Mohammadhossein Homaei,Victor Gonzalez Morales,Oscar Mogollon Gutierrez,Ruben Molano Gomez,Andres Caro
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 Pages, 9 Figures
Abstract:Water distribution systems in rural areas face serious challenges such as a lack of real-time monitoring, vulnerability to cyberattacks, and unreliable data handling. This paper presents an integrated framework that combines LoRaWAN-based data acquisition, a machine learning-driven Intrusion Detection System (IDS), and a blockchain-enabled Digital Twin (BC-DT) platform for secure and transparent water management. The IDS filters anomalous or spoofed data using a Long Short-Term Memory (LSTM) Autoencoder and Isolation Forest before validated data is logged via smart contracts on a private Ethereum blockchain using Proof of Authority (PoA) consensus. The verified data feeds into a real-time DT model supporting leak detection, consumption forecasting, and predictive maintenance. Experimental results demonstrate that the system achieves over 80 transactions per second (TPS) with under 2 seconds of latency while remaining cost-effective and scalable for up to 1,000 smart meters. This work demonstrates a practical and secure architecture for decentralized water infrastructure in under-connected rural environments.
zh
[AI-65] Can Large Language Models Learn Formal Logic? A Data-Driven Training and Evaluation Framework
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在布尔逻辑中构造证明的逻辑推理能力问题。由于真实世界中的证明数据稀缺,研究者提出了一种高效且随机的合成有效证明的方法,并引入了模板转换(Template Transformation)这一数据增强技术,以提升模型处理复杂逻辑表达式的能力。该解决方案的关键在于通过合成数据和数据增强技术克服训练数据不足的问题,从而有效提升模型的逻辑推理能力。
链接: https://arxiv.org/abs/2504.20213
作者: Yuan Xia,Akanksha Atrey,Fadoua Khmaissia,Kedar S. Namjoshi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper investigates the logical reasoning capabilities of large language models (LLMs). For a precisely defined yet tractable formulation, we choose the conceptually simple but technically complex task of constructing proofs in Boolean logic. A trained LLM receives as input a set of assumptions and a goal, and produces as output a proof that formally derives the goal from the assumptions. Incorrect proofs are caught by an automated proof checker. A critical obstacle for training is the scarcity of real-world proofs. We propose an efficient, randomized procedure for synthesizing valid proofs and introduce Template Transformation, a data augmentation technique that enhances the model’s ability to handle complex logical expressions. The central evaluation question is whether an LLM has indeed learned to reason. We propose tests to measure the reasoning ability of a black-box LLM. By these measures, experiments demonstrate strong reasoning capabilities for assertions with short proofs, which decline with proof complexity. Notably, template transformation improves accuracy even for smaller models, suggesting its effectiveness across model scales.
zh
[AI-66] Representation Learning on a Random Lattice WWW
【速读】:该论文试图解决深度神经网络中学习表征的可解释性问题,旨在通过分解表征为可解释特征来提升模型的安全性和可靠性。其解决方案的关键在于从几何角度出发,将特征视为映射嵌入数据分布的学习坐标系,并将通用数据分布建模为随机格点,利用渗流理论分析其特性,从而对学习到的特征进行分类,包括上下文特征、组件特征和表面特征。
链接: https://arxiv.org/abs/2504.20197
作者: Aryeh Brill
机构: 未知
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI)
备注: Published in Proceedings of ILIAD (2024), this https URL
Abstract:Decomposing a deep neural network’s learned representations into interpretable features could greatly enhance its safety and reliability. To better understand features, we adopt a geometric perspective, viewing them as a learned coordinate system for mapping an embedded data distribution. We motivate a model of a generic data distribution as a random lattice and analyze its properties using percolation theory. Learned features are categorized into context, component, and surface features. The model is qualitatively consistent with recent findings in mechanistic interpretability and suggests directions for future research.
zh
[AI-67] Prompting LLM s for Code Editing: Struggles and Remedies
【速读】:该论文试图解决开发者在日常工作中使用生成式 AI (Generative AI) 驱动的代码编辑与转换功能时的实际使用问题及遇到的困难,特别是开发者在使用过程中频繁重新提示(re-prompting)的现象及其背后的信息缺失问题。解决方案的关键在于通过分析日志数据和用户请求,识别出开发者提示中常见的五类缺失信息,并提出 AutoPrompter 工具,该工具能够根据代码上下文自动推断缺失信息以改进提示,从而提升编辑的准确性。
链接: https://arxiv.org/abs/2504.20196
作者: Daye Nam,Ahmed Omran,Ambar Murillo,Saksham Thakur,Abner Araujo,Marcel Blistein,Alexander Frömmgen,Vincent Hellendoorn,Satish Chandra
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Large Language Models (LLMs) are rapidly transforming software engineering, with coding assistants embedded in an IDE becoming increasingly prevalent. While research has focused on improving the tools and understanding developer perceptions, a critical gap exists in understanding how developers actually use these tools in their daily workflows, and, crucially, where they struggle. This paper addresses part of this gap through a multi-phased investigation of developer interactions with an LLM-powered code editing and transformation feature, Transform Code, in an IDE widely used at Google. First, we analyze telemetry logs of the feature usage, revealing that frequent re-prompting can be an indicator of developer struggles with using Transform Code. Second, we conduct a qualitative analysis of unsatisfactory requests, identifying five key categories of information often missing from developer prompts. Finally, based on these findings, we propose and evaluate a tool, AutoPrompter, for automatically improving prompts by inferring missing information from the surrounding code context, leading to a 27% improvement in edit correctness on our test set.
zh
[AI-68] AI Recommendation Systems for Lane-Changing Using Adherence-Aware Reinforcement Learning
【速读】:该论文旨在解决半自动驾驶环境中车辆换道推荐的优化问题,以提升单辆车的行驶效率。其核心挑战在于如何处理人类驾驶员对推荐动作的部分合规性。解决方案的关键在于提出一种适应性强化学习(adherence-aware reinforcement learning)方法,具体实现为一种考虑驾驶员部分合规性的深度Q网络(adherence-aware deep Q network),该方法在马尔可夫决策过程框架下进行建模与求解。
链接: https://arxiv.org/abs/2504.20187
作者: Weihao Sun,Heeseung Bang,Andreas A. Malikopoulos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 6 pages, 5 figures, conference
Abstract:In this paper, we present an adherence-aware reinforcement learning (RL) approach aimed at seeking optimal lane-changing recommendations within a semi-autonomous driving environment to enhance a single vehicle’s travel efficiency. The problem is framed within a Markov decision process setting and is addressed through an adherence-aware deep Q network, which takes into account the partial compliance of human drivers with the recommended actions. This approach is evaluated within CARLA’s driving environment under realistic scenarios.
zh
[AI-69] BLADE: Benchmark suite for LLM -driven Automated Design and Evolution of iterative optimisation heuristics GECCO
【速读】:该论文试图解决在生成式 AI (Generative AI) 驱动的自动化算法设计(AAD)领域中,缺乏稳健且标准化的基准测试方法的问题,尤其是在连续黑盒优化背景下评估 LLM 驱动的 AAD 方法的能力和局限性。解决方案的关键在于提出 BLADE(Benchmark suite for LLM-driven Automated Design and Evolution),这是一个模块化、可扩展的框架,集成了多种基准问题、实例生成器和文本描述,支持针对泛化、专业化和信息利用等能力的测试,并提供灵活的实验设置、标准化日志记录以及对 AAD 过程的分析工具,从而实现对 LLM 驱动的 AAD 方法的系统性评估。
链接: https://arxiv.org/abs/2504.20183
作者: Niki van Stein,Anna V. Kononova,Haoran Yin,Thomas Bäck
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 9 pages, accepted at GECCO Workshop 2025
Abstract:The application of Large Language Models (LLMs) for Automated Algorithm Discovery (AAD), particularly for optimisation heuristics, is an emerging field of research. This emergence necessitates robust, standardised benchmarking practices to rigorously evaluate the capabilities and limitations of LLM-driven AAD methods and the resulting generated algorithms, especially given the opacity of their design process and known issues with existing benchmarks. To address this need, we introduce BLADE (Benchmark suite for LLM-driven Automated Design and Evolution), a modular and extensible framework specifically designed for benchmarking LLM-driven AAD methods in a continuous black-box optimisation context. BLADE integrates collections of benchmark problems (including MA-BBOB and SBOX-COST among others) with instance generators and textual descriptions aimed at capability-focused testing, such as generalisation, specialisation and information exploitation. It offers flexible experimental setup options, standardised logging for reproducibility and fair comparison, incorporates methods for analysing the AAD process (e.g., Code Evolution Graphs and various visualisation approaches) and facilitates comparison against human-designed baselines through integration with established tools like IOHanalyser and IOHexplainer. BLADE provides an `out-of-the-box’ solution to systematically evaluate LLM-driven AAD approaches. The framework is demonstrated through two distinct use cases exploring mutation prompt strategies and function specialisation.
zh
[AI-70] Causal Identification in Time Series Models
【速读】:该论文试图解决在存在潜在混杂因素的因果时间序列图中,如何判断任意时间间隔上的因果效应是否可识别的问题(causal effect identifiability)。其解决方案的关键在于提出了一种首次出现的界限,该界限仅依赖于每段时间步的变量数量以及任何直接或潜在因果效应的最大时间滞后,证明了对时间序列图的一个固定大小的片段应用因果识别算法就足以判断跨无限时间间隔的因果效应的可识别性。
链接: https://arxiv.org/abs/2504.20172
作者: Erik Jahn,Karthik Karnik,Leonard J. Schulman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
备注:
Abstract:In this paper, we analyze the applicability of the Causal Identification algorithm to causal time series graphs with latent confounders. Since these graphs extend over infinitely many time steps, deciding whether causal effects across arbitrary time intervals are identifiable appears to require computation on graph segments of unbounded size. Even for deciding the identifiability of intervention effects on variables that are close in time, no bound is known on how many time steps in the past need to be considered. We give a first bound of this kind that only depends on the number of variables per time step and the maximum time lag of any direct or latent causal effect. More generally, we show that applying the Causal Identification algorithm to a constant-size segment of the time series graph is sufficient to decide identifiability of causal effects, even across unbounded time intervals.
zh
[AI-71] LZ Penalty: An information-theoretic repetition penalty for autoregressive language models
【速读】:该论文试图解决自回归语言模型中退化重复(degenerate repetition)的问题,同时不损失模型的能力。解决方案的关键在于引入LZ惩罚项,该惩罚项基于LZ77通用无损压缩算法中的编码长度,通过预测-压缩对偶性,将解码过程解释为从去除高度可压缩信息后的残差分布中采样。这种方法使得开源推理模型能够在不损失能力的情况下使用贪婪解码(温度为零),并有效避免退化重复现象。
链接: https://arxiv.org/abs/2504.20131
作者: Antonio A. Ginart,Naveen Kodali,Jason Lee,Caiming Xiong,Silvio Savarese,John R. Emmons
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: Preprint (draft)
Abstract:We introduce the LZ penalty, a penalty specialized for reducing degenerate repetitions in autoregressive language models without loss of capability. The penalty is based on the codelengths in the LZ77 universal lossless compression algorithm. Through the lens of the prediction-compression duality, decoding the LZ penalty has the interpretation of sampling from the residual distribution after removing the information that is highly compressible. We demonstrate the LZ penalty enables state-of-the-art open-source reasoning models to operate with greedy (temperature zero) decoding without loss of capability and without instances of degenerate repetition. Both the industry-standard frequency penalty and repetition penalty are ineffective, incurring degenerate repetition rates of up to 4%.
zh
[AI-72] owards Large Language Models for Lunar Mission Planning and In Situ Resource Utilization
【速读】:该论文试图解决月球任务规划中评估本地原材料可用性的关键问题,其核心挑战在于从分散在多种科学文献中的测量数据中获取可靠的月球成分信息。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)快速处理科学文献语料库,以提取相关数据。尽管利用LLMs从科学文档中获取知识并非新颖,但该应用面临月球样本异质性和表征细节的特殊挑战,因此准确性和不确定性量化尤为重要。研究结果表明,现成的LLMs在提取文献中常见表格数据方面表现良好,但仍需进一步优化以捕捉更精细的矿物学信息并提升对复杂信息的处理能力。
链接: https://arxiv.org/abs/2504.20125
作者: Michael Pekala,Gregory Canal,Samuel Barham,Milena B. Graziano,Morgan Trexler,Leslie Hamilton,Elizabeth Reilly,Christopher D. Stiles
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注:
Abstract:A key factor for lunar mission planning is the ability to assess the local availability of raw materials. However, many potentially relevant measurements are scattered across a variety of scientific publications. In this paper we consider the viability of obtaining lunar composition data by leveraging LLMs to rapidly process a corpus of scientific publications. While leveraging LLMs to obtain knowledge from scientific documents is not new, this particular application presents interesting challenges due to the heterogeneity of lunar samples and the nuances involved in their characterization. Accuracy and uncertainty quantification are particularly crucial since many materials properties can be sensitive to small variations in composition. Our findings indicate that off-the-shelf LLMs are generally effective at extracting data from tables commonly found in these documents. However, there remains opportunity to further refine the data we extract in this initial approach; in particular, to capture fine-grained mineralogy information and to improve performance on more subtle/complex pieces of information.
zh
[AI-73] Pediatric Asthma Detection with Googles HeAR Model: An AI-Driven Respiratory Sound Classifier
【速读】:该论文试图解决儿童哮喘早期检测的问题,以预防长期呼吸系统并发症并减少紧急干预。解决方案的关键在于利用Google的Health Acoustic Representations (HeAR)模型,这是一种在3亿个健康相关音频片段(包括1亿个咳嗽声)上预训练的基础模型,通过将其应用于儿科呼吸音数据集SPRSound中提取的2秒音频片段,生成512维的嵌入表示,并结合多种分类器(如SVM、随机森林和MLP)进行哮喘指示性声音与正常声音的区分,从而实现高准确率的哮喘筛查。
链接: https://arxiv.org/abs/2504.20124
作者: Abul Ehtesham,Saket Kumar,Aditi Singh,Tala Talaei Khoei
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Early detection of asthma in children is crucial to prevent long-term respiratory complications and reduce emergency interventions. This work presents an AI-powered diagnostic pipeline that leverages Googles Health Acoustic Representations (HeAR) model to detect early signs of asthma from pediatric respiratory sounds. The SPRSound dataset, the first open-access collection of annotated respiratory sounds in children aged 1 month to 18 years, is used to extract 2-second audio segments labeled as wheeze, crackle, rhonchi, stridor, or normal. Each segment is embedded into a 512-dimensional representation using HeAR, a foundation model pretrained on 300 million health-related audio clips, including 100 million cough sounds. Multiple classifiers, including SVM, Random Forest, and MLP, are trained on these embeddings to distinguish between asthma-indicative and normal sounds. The system achieves over 91% accuracy, with strong performance on precision-recall metrics for positive cases. In addition to classification, learned embeddings are visualized using PCA, misclassifications are analyzed through waveform playback, and ROC and confusion matrix insights are provided. This method demonstrates that short, low-resource pediatric recordings, when powered by foundation audio models, can enable fast, noninvasive asthma screening. The approach is especially promising for digital diagnostics in remote or underserved healthcare settings.
zh
[AI-74] Can LLM s Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets
【速读】:该论文试图解决Retrieval-Augmented Generation (RAG)系统在评估与质量提升方面面临的复杂性问题,特别是由于系统包含多个组件(如索引、检索和生成)以及众多参数所带来的挑战。其解决方案的关键在于系统性地回顾63篇学术文章,总结当前RAG评估方法,并提出一种自动化评估方法,该方法利用大语言模型(LLM)生成评估数据集并执行评估,同时强调领域特定数据集的构建与适应对于提升评估严谨性的重要性。
链接: https://arxiv.org/abs/2504.20119
作者: Lorenz Brehme,Thomas Ströhle,Ruth Breu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 8 Pages. This paper has been accepted for presentation at the IEEE Swiss Conference on Data Science (SDS25)
Abstract:Retrieval-Augmented Generation (RAG) has advanced significantly in recent years. The complexity of RAG systems, which involve multiple components-such as indexing, retrieval, and generation-along with numerous other parameters, poses substantial challenges for systematic evaluation and quality enhancement. Previous research highlights that evaluating RAG systems is essential for documenting advancements, comparing configurations, and identifying effective approaches for domain-specific applications. This study systematically reviews 63 academic articles to provide a comprehensive overview of state-of-the-art RAG evaluation methodologies, focusing on four key areas: datasets, retrievers, indexing and databases, and the generator component. We observe the feasibility of an automated evaluation approach for each component of a RAG system, leveraging an LLM capable of both generating evaluation datasets and conducting evaluations. In addition, we found that further practical research is essential to provide companies with clear guidance on the do’s and don’ts of implementing and evaluating RAG systems. By synthesizing evaluation approaches for key RAG components and emphasizing the creation and adaptation of domain-specific datasets for benchmarking, we contribute to the advancement of systematic evaluation methods and the improvement of evaluation rigor for RAG systems. Furthermore, by examining the interplay between automated approaches leveraging LLMs and human judgment, we contribute to the ongoing discourse on balancing automation and human input, clarifying their respective contributions, limitations, and challenges in achieving robust and reliable evaluations.
zh
[AI-75] OpenTCM: A GraphRAG -Empowered LLM -based System for Traditional Chinese Medicine Knowledge Retrieval and Diagnosis
【速读】:该论文旨在解决传统中医(Traditional Chinese Medicine, TCM)文献复杂性和广度带来的现代转化与广泛可及性问题,特别是古典中文文本的解释及中医概念间复杂语义关系的建模。其解决方案的关键在于构建一个基于大语言模型(LLM)的系统——OpenTCM,该系统结合了领域特定的中医知识图谱和基于图的检索增强生成(GraphRAG)技术,通过高精度的知识图谱构建与高效的信息检索和问答能力,实现了无需模型微调的高质量中药成分检索与诊断问答功能。
链接: https://arxiv.org/abs/2504.20118
作者: Jinglin He,Yunqi Guo,Lai Kwan Lam,Waikei Leung,Lixing He,Yuanan Jiang,Chi Chiu Wang,Guoliang Xing,Hongkai Chen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures
Abstract:Traditional Chinese Medicine (TCM) represents a rich repository of ancient medical knowledge that continues to play an important role in modern healthcare. Due to the complexity and breadth of the TCM literature, the integration of AI technologies is critical for its modernization and broader accessibility. However, this integration poses considerable challenges, including the interpretation of obscure classical Chinese texts and the modeling of intricate semantic relationships among TCM concepts. In this paper, we develop OpenTCM, an LLM-based system that combines a domain-specific TCM knowledge graph and Graph-based Retrieval-Augmented Generation (GraphRAG). First, we extract more than 3.73 million classical Chinese characters from 68 gynecological books in the Chinese Medical Classics Database, with the help of TCM and gynecology experts. Second, we construct a comprehensive multi-relational knowledge graph comprising more than 48,000 entities and 152,000 interrelationships, using customized prompts and Chinese-oriented LLMs such as DeepSeek and Kimi to ensure high-fidelity semantic understanding. Last, we integrate OpenTCM with this knowledge graph, enabling high-fidelity ingredient knowledge retrieval and diagnostic question-answering without model fine-tuning. Experimental evaluations demonstrate that our prompt design and model selection significantly improve knowledge graph quality, achieving a precision of 98. 55% and an F1 score of 99. 55%. In addition, OpenTCM achieves mean expert scores of 4.5 in ingredient information retrieval and 3.8 in diagnostic question-answering tasks, outperforming state-of-the-art solutions in real-world TCM use cases.
zh
[AI-76] AutoP2C: An LLM -Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers
【速读】:该论文试图解决将学术论文中的多模态内容(包括文本、图表和表格结果)转化为可执行代码仓库的问题,这一过程传统上需要大量时间和专业知识。解决方案的关键在于提出AutoP2C,这是一个基于大语言模型的多智能体框架,能够处理论文中的文本和视觉内容,并生成完整的代码仓库,其核心包括四个阶段:从已有代码库中提取仓库蓝图、整合文本、公式和图表的多模态内容解析、结构化代码生成的分层任务分解,以及通过迭代反馈驱动的调试以确保功能和性能。
链接: https://arxiv.org/abs/2504.20115
作者: Zijie Lin,Yiqing Shen,Qilin Cai,He Sun,Jinrui Zhou,Mingjun Xiao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine Learning (ML) research is spread through academic papers featuring rich multimodal content, including text, diagrams, and tabular results. However, translating these multimodal elements into executable code remains a challenging and time-consuming process that requires substantial ML expertise. We introduce ``Paper-to-Code’’ (P2C), a novel task that transforms the multimodal content of scientific publications into fully executable code repositories, which extends beyond the existing formulation of code generation that merely converts textual descriptions into isolated code snippets. To automate the P2C process, we propose AutoP2C, a multi-agent framework based on large language models that processes both textual and visual content from research papers to generate complete code repositories. Specifically, AutoP2C contains four stages: (1) repository blueprint extraction from established codebases, (2) multimodal content parsing that integrates information from text, equations, and figures, (3) hierarchical task decomposition for structured code generation, and (4) iterative feedback-driven debugging to ensure functionality and performance. Evaluation on a benchmark of eight research papers demonstrates the effectiveness of AutoP2C, which can successfully generate executable code repositories for all eight papers, while OpenAI-o1 or DeepSeek-R1 can only produce runnable code for one paper. The code is available at this https URL.
zh
[AI-77] reeHop: Generate and Filter Next Query Embeddings Efficiently for Multi-hop Question Answering
【速读】:该论文旨在解决检索增强生成(Retrieval-augmented generation, RAG)系统在多跳问答(multi-hop question answering, MHQA)任务中的挑战,即如何高效地合成多个文档片段中的信息以回答复杂查询。现有方法依赖于基于大语言模型(LLM)的迭代查询重写和路由,导致计算成本高昂。论文提出的解决方案关键在于TreeHop框架,其通过在嵌入层进行动态查询嵌入更新,融合先前查询和检索文档的语义信息,从而实现仅通过嵌入空间操作进行迭代检索,替代传统的“检索-重写-向量化-检索”流程,显著降低了计算开销。
链接: https://arxiv.org/abs/2504.20114
作者: Zhonghao Li,Kunpeng Zhang,Jinghuai Ou,Shuliang Liu,Xuming Hu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 9 pages
Abstract:Retrieval-augmented generation (RAG) systems face significant challenges in multi-hop question answering (MHQA), where complex queries require synthesizing information across multiple document chunks. Existing approaches typically rely on iterative LLM-based query rewriting and routing, resulting in high computational costs due to repeated LLM invocations and multi-stage processes. To address these limitations, we propose TreeHop, an embedding-level framework without the need for LLMs in query refinement. TreeHop dynamically updates query embeddings by fusing semantic information from prior queries and retrieved documents, enabling iterative retrieval through embedding-space operations alone. This method replaces the traditional “Retrieve-Rewrite-Vectorize-Retrieve” cycle with a streamlined “Retrieve-Embed-Retrieve” loop, significantly reducing computational overhead. Moreover, a rule-based stop criterion is introduced to further prune redundant retrievals, balancing efficiency and recall rate. Experimental results show that TreeHop rivals advanced RAG methods across three open-domain MHQA datasets, achieving comparable performance with only 5%-0.4% of the model parameter size and reducing the query latency by approximately 99% compared to concurrent approaches. This makes TreeHop a faster and more cost-effective solution for deployment in a range of knowledge-intensive applications. For reproducibility purposes, codes and data are available here: this https URL.
zh
[AI-78] ransforming Evidence Synthesis: A Systematic Review of the Evolution of Automated Meta-Analysis in the Age of AI
【速读】:该论文试图解决自动化元分析(Automated Meta-analysis, AMA)在全面实现端到端自动化过程中存在的关键瓶颈问题,特别是针对数据处理阶段的自动化程度较高,而高级合成阶段如异质性评估和偏倚评价的自动化能力不足的问题。其解决方案的关键在于推动生成式AI(Generative AI)与大型语言模型(Large Language Models, LLMs)等先进人工智能技术在统计建模和更高阶合成任务中的深度融合,以提升AMA在整体元分析流程中的自主性和方法学稳健性,从而实现可扩展、跨领域的综合分析能力。
链接: https://arxiv.org/abs/2504.20113
作者: Lingbo Li,Anuradha Mathrani,Teo Susnjak
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Exponential growth in scientific literature has heightened the demand for efficient evidence-based synthesis, driving the rise of the field of Automated Meta-analysis (AMA) powered by natural language processing and machine learning. This PRISMA systematic review introduces a structured framework for assessing the current state of AMA, based on screening 978 papers from 2006 to 2024, and analyzing 54 studies across diverse domains. Findings reveal a predominant focus on automating data processing (57%), such as extraction and statistical modeling, while only 17% address advanced synthesis stages. Just one study (2%) explored preliminary full-process automation, highlighting a critical gap that limits AMA’s capacity for comprehensive synthesis. Despite recent breakthroughs in large language models (LLMs) and advanced AI, their integration into statistical modeling and higher-order synthesis, such as heterogeneity assessment and bias evaluation, remains underdeveloped. This has constrained AMA’s potential for fully autonomous meta-analysis. From our dataset spanning medical (67%) and non-medical (33%) applications, we found that AMA has exhibited distinct implementation patterns and varying degrees of effectiveness in actually improving efficiency, scalability, and reproducibility. While automation has enhanced specific meta-analytic tasks, achieving seamless, end-to-end automation remains an open challenge. As AI systems advance in reasoning and contextual understanding, addressing these gaps is now imperative. Future efforts must focus on bridging automation across all meta-analysis stages, refining interpretability, and ensuring methodological robustness to fully realize AMA’s potential for scalable, domain-agnostic synthesis.
zh
[AI-79] Supervised Pretraining for Material Property Prediction
【速读】:该论文旨在解决材料性质预测中深度学习模型依赖大量标注数据的问题,而自监督学习(SSL)虽然提供了一种替代方案,但其在材料性质预测中的效果仍有提升空间。论文提出的解决方案关键在于引入监督预训练(supervised pretraining),利用可用的类别信息作为代理标签(surrogate labels)来指导模型学习,即使下游任务涉及不相关的材料性质。此外,论文还提出了一种基于图的增强技术,通过注入噪声提高模型的鲁棒性,从而提升表示学习的效果。
链接: https://arxiv.org/abs/2504.20112
作者: Chowdhury Mohammad Abid Rahman,Aldo H. Romero,Prashnna K. Gyawali
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 21 pages, 7 figures, 2 algorithms, 6 tables
Abstract:Accurate prediction of material properties facilitates the discovery of novel materials with tailored functionalities. Deep learning models have recently shown superior accuracy and flexibility in capturing structure-property relationships. However, these models often rely on supervised learning, which requires large, well-annotated datasets an expensive and time-consuming process. Self-supervised learning (SSL) offers a promising alternative by pretraining on large, unlabeled datasets to develop foundation models that can be fine-tuned for material property prediction. In this work, we propose supervised pretraining, where available class information serves as surrogate labels to guide learning, even when downstream tasks involve unrelated material properties. We evaluate this strategy on two state-of-the-art SSL models and introduce a novel framework for supervised pretraining. To further enhance representation learning, we propose a graph-based augmentation technique that injects noise to improve robustness without structurally deforming material graphs. The resulting foundation models are fine-tuned for six challenging material property predictions, achieving significant performance gains over baselines, ranging from 2% to 6.67% improvement in mean absolute error (MAE) and establishing a new benchmark in material property prediction. This study represents the first exploration of supervised pertaining with surrogate labels in material property prediction, advancing methodology and application in the field.
zh
[AI-80] Personalized Artificial General Intelligence (AGI) via Neuroscience-Inspired Continuous Learning Systems
【速读】:该论文试图解决在资源受限的边缘设备上实现具备持续学习与个性化能力的真正人工通用智能(AGI)的问题。当前方法主要依赖于扩大模型参数以提升任务特定性能,但无法支持持续、适应性强且泛化的学习。论文提出的解决方案关键在于设计一种新型架构,该架构融合了类脑学习机制,包含互补的快慢学习模块、突触自优化以及高效内存模型更新,以支持设备端的终身适应性。此外,该架构借鉴了人类学习的神经科学原理,如突触修剪、赫布可塑性、稀疏编码和双记忆系统,旨在克服灾难性遗忘、内存效率及系统可扩展性等挑战。
链接: https://arxiv.org/abs/2504.20109
作者: Rajeev Gupta,Suhani Gupta,Ronak Parikh,Divya Gupta,Amir Javaheri,Jairaj Singh Shaktawat
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 39 pages, 16 figures
Abstract:Artificial Intelligence has made remarkable advancements in recent years, primarily driven by increasingly large deep learning models. However, achieving true Artificial General Intelligence (AGI) demands fundamentally new architectures rather than merely scaling up existing models. Current approaches largely depend on expanding model parameters, which improves task-specific performance but falls short in enabling continuous, adaptable, and generalized learning. Achieving AGI capable of continuous learning and personalization on resource-constrained edge devices is an even bigger challenge. This paper reviews the state of continual learning and neuroscience-inspired AI, and proposes a novel architecture for Personalized AGI that integrates brain-like learning mechanisms for edge deployment. We review literature on continuous lifelong learning, catastrophic forgetting, and edge AI, and discuss key neuroscience principles of human learning, including Synaptic Pruning, Hebbian plasticity, Sparse Coding, and Dual Memory Systems, as inspirations for AI systems. Building on these insights, we outline an AI architecture that features complementary fast-and-slow learning modules, synaptic self-optimization, and memory-efficient model updates to support on-device lifelong adaptation. Conceptual diagrams of the proposed architecture and learning processes are provided. We address challenges such as catastrophic forgetting, memory efficiency, and system scalability, and present application scenarios for mobile AI assistants and embodied AI systems like humanoid robots. We conclude with key takeaways and future research directions toward truly continual, personalized AGI on the edge. While the architecture is theoretical, it synthesizes diverse findings and offers a roadmap for future implementation. Comments: 39 pages, 16 figures Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2504.20109 [cs.AI] (or arXiv:2504.20109v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.20109 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-81] Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在保持有益性与无害性之间的平衡问题,现有方法如基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)和直接偏好优化(Direct Preference Optimization, DPO)存在性能冲突、可控性有限和可扩展性差等不足。论文提出的解决方案关键在于引入Preference Vector框架,该框架受任务算术启发,通过在单个偏好上训练独立模型,提取行为变化作为偏好向量,并在测试时动态合并,从而实现细粒度、用户可控的偏好调整,并支持新偏好的无缝集成而无需重新训练。
链接: https://arxiv.org/abs/2504.20106
作者: Ren-Wei Liang,Chin-Ting Hsu,Chan-Hung Yu,Saransh Agrawal,Shih-Cheng Huang,Shang-Tse Chen,Kuan-Hao Huang,Shao-Hua Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures, 9 tables
Abstract:Ensuring that large language models (LLMs) are both helpful and harmless is a critical challenge, as overly strict constraints can lead to excessive refusals, while permissive models risk generating harmful content. Existing approaches, such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO), attempt to balance these trade-offs but suffer from performance conflicts, limited controllability, and poor extendability. To address these issues, we propose Preference Vector, a novel framework inspired by task arithmetic. Instead of optimizing multiple preferences within a single objective, we train separate models on individual preferences, extract behavior shifts as preference vectors, and dynamically merge them at test time. This modular approach enables fine-grained, user-controllable preference adjustments and facilitates seamless integration of new preferences without retraining. Experiments show that our proposed Preference Vector framework improves helpfulness without excessive conservatism, allows smooth control over preference trade-offs, and supports scalable multi-preference alignment.
zh
[AI-82] Electricity Cost Minimization for Multi-Workflow Allocation in Geo-Distributed Data Centers
【速读】:该论文旨在解决在地理分布式数据中心(GDCs)中,如何在满足工作流应用截止时间约束的前提下降低电力成本的问题。其关键在于设计一种电力成本感知的多工作流调度算法(ECMWS),该算法通过四个阶段——工作流排序、截止时间划分、任务排序和资源分配——结合两种图嵌入模型和一个策略网络来求解马尔可夫决策过程(MDP),从而有效应对异构计算资源和动态电价带来的挑战。
链接: https://arxiv.org/abs/2504.20105
作者: Shuang Wang,He Zhang,Tianxing Wu,Yueyou Zhang,Wei Emma Zhang,Quan Z. Sheng
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: have been accepted by IEEE Transactions on Services Computing
Abstract:Worldwide, Geo-distributed Data Centers (GDCs) provide computing and storage services for massive workflow applications, resulting in high electricity costs that vary depending on geographical locations and time. How to reduce electricity costs while satisfying the deadline constraints of workflow applications is important in GDCs, which is determined by the execution time of servers, power, and electricity price. Determining the completion time of workflows with different server frequencies can be challenging, especially in scenarios with heterogeneous computing resources in GDCs. Moreover, the electricity price is also different in geographical locations and may change dynamically. To address these challenges, we develop a geo-distributed system architecture and propose an Electricity Cost aware Multiple Workflows Scheduling algorithm (ECMWS) for servers of GDCs with fixed frequency and power. ECMWS comprises four stages, namely workflow sequencing, deadline partitioning, task sequencing, and resource allocation where two graph embedding models and a policy network are constructed to solve the Markov Decision Process (MDP). After statistically calibrating parameters and algorithm components over a comprehensive set of workflow instances, the proposed algorithms are compared with the state-of-the-art methods over two types of workflow instances. The experimental results demonstrate that our proposed algorithm significantly outperforms other algorithms, achieving an improvement of over 15% while maintaining an acceptable computational time. The source codes are available at this https URL.
zh
[AI-83] HyboWaveNet: Hyperbolic Graph Neural Networks with Multi-Scale Wavelet Transform for Protein-Protein Interaction Prediction
【速读】:该论文旨在解决蛋白质-蛋白质相互作用(Protein-Protein Interactions, PPIs)预测中的因果解释不足以及难以捕捉层次化几何结构和多尺度动态互作模式的问题。其解决方案的关键在于提出一种新的深度学习框架HyboWaveNet,该框架结合了双曲图神经网络(Hyperbolic Graphical Neural Networks, HGNNs)与多尺度图小波变换,通过将蛋白质特征映射到洛伦兹空间,利用双曲距离度量模拟生物分子间的层次拓扑关系,同时通过小波变换实现不同分辨率下局部与全局互作特征的自适应提取,从而提升PPI预测的鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2504.20102
作者: Qingzhi Yu,Shuai Yan,Wenfeng Dai,Xiang Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 9 pages
Abstract:Protein-protein interactions (PPIs) are fundamental for deciphering cellular functions,disease pathways,and drug this http URL existing neural networks and machine learning methods have achieved high accuracy in PPI prediction,their black-box nature leads to a lack of causal interpretation of the prediction results and difficulty in capturing hierarchical geometries and multi-scale dynamic interaction patterns among this http URL address these challenges, we propose HyboWaveNet,a novel deep learning framework that collaborates with hyperbolic graphical neural networks (HGNNs) and multiscale graphical wavelet transform for robust PPI prediction. Mapping protein features to Lorentz space simulates hierarchical topological relationships among biomolecules via a hyperbolic distance metric,enabling node feature representations that better fit biological a this http URL inherently simulates hierarchical and scale-free biological relationships, while the integration of wavelet transforms enables adaptive extraction of local and global interaction features across different resolutions. Our framework generates node feature representations via a graph neural network under the Lorenz model and generates pairs of positive samples under multiple different views for comparative learning, followed by further feature extraction via multi-scale graph wavelet transforms to predict potential PPIs. Experiments on public datasets show that HyboWaveNet improves over both existing state-of-the-art methods. We also demonstrate through ablation experimental studies that the multi-scale graph wavelet transform module improves the predictive performance and generalization ability of HyboWaveNet. This work links geometric deep learning and signal processing to advance PPI prediction, providing a principled approach for analyzing complex biological systems
zh
[AI-84] GenTorrent: Scaling Large Language Model Serving with An Overley Network
【速读】:该论文试图解决在小组织和个体用户部署和测试其大型语言模型(Large-Language Models, LLMs)创新时,服务可扩展性(serving scalability)这一关键挑战。其解决方案的关键在于提出GenTorrent,一个基于去中心化贡献者计算资源的LLM服务覆盖网络(overlay network),通过优化覆盖网络组织、通信隐私保护、资源效率的覆盖转发以及服务质量验证等四个核心研究问题,实现低延迟和高安全性的LLM服务部署。
链接: https://arxiv.org/abs/2504.20101
作者: Fei Fang,Yifan Hua,Shengze Wang,Ruilin Zhou,Yi Liu,Chen Qian,Xiaoxue Zhang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:While significant progress has been made in research and development on open-source and cost-efficient large-language models (LLMs), serving scalability remains a critical challenge, particularly for small organizations and individuals seeking to deploy and test their LLM innovations. Inspired by peer-to-peer networks that leverage decentralized overlay nodes to increase throughput and availability, we propose GenTorrent, an LLM serving overlay that harnesses computing resources from decentralized contributors. We identify four key research problems inherent to enabling such a decentralized infrastructure: 1) overlay network organization; 2) LLM communication privacy; 3) overlay forwarding for resource efficiency; and 4) verification of serving quality. This work presents the first systematic study of these fundamental problems in the context of decentralized LLM serving. Evaluation results from a prototype implemented on a set of decentralized nodes demonstrate that GenTorrent achieves a latency reduction of over 50% compared to the baseline design without overlay forwarding. Furthermore, the security features introduce minimal overhead to serving latency and throughput. We believe this work pioneers a new direction for democratizing and scaling future AI serving capabilities.
zh
[AI-85] Decoding Latent Spaces: Assessing the Interpretability of Time Series Foundation Models for Visual Analytics
【速读】:该论文试图解决时间序列基础模型(Time Series Foundation Models)的潜在空间(latent space)可解释性问题,特别是其在可视化分析任务中的应用潜力。解决方案的关键在于评估MOMENT系列模型在多变量时间序列任务中的表现,包括数据插补、预测、分类和异常检测,并通过微调(fine tuning)来提升潜在空间嵌入的清晰度。研究发现,尽管微调在损失减少方面表现出显著效果,但对嵌入可解释性的提升有限,表明潜在空间可能需要进一步的方法优化,如替代投影技术、损失函数或数据预处理策略。
链接: https://arxiv.org/abs/2504.20099
作者: Inmaculada Santamaria-Valenzuela,Victor Rodriguez-Fernandez,Javier Huertas-Tato,Jong Hyuk Park,David Camacho
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Currently under review at the International Journal of Interactive Multimedia and Artificial Intelligence (IJIMAI)
Abstract:The present study explores the interpretability of latent spaces produced by time series foundation models, focusing on their potential for visual analysis tasks. Specifically, we evaluate the MOMENT family of models, a set of transformer-based, pre-trained architectures for multivariate time series tasks such as: imputation, prediction, classification, and anomaly detection. We evaluate the capacity of these models on five datasets to capture the underlying structures in time series data within their latent space projection and validate whether fine tuning improves the clarity of the resulting embedding spaces. Notable performance improvements in terms of loss reduction were observed after fine tuning. Visual analysis shows limited improvement in the interpretability of the embeddings, requiring further work. Results suggest that, although Time Series Foundation Models such as MOMENT are robust, their latent spaces may require additional methodological refinements to be adequately interpreted, such as alternative projection techniques, loss functions, or data preprocessing strategies. Despite the limitations of MOMENT, foundation models supose a big reduction in execution time and so a great advance for interactive visual analytics.
zh
[AI-86] Self-Healing Software Systems: Lessons from Nature Powered by AI
【速读】:该论文试图解决现代软件系统在复杂性和规模增长背景下,自主检测、诊断和恢复故障的能力不足的问题。其解决方案的关键在于提出一种受生物自愈机制启发的新型框架,该框架将系统可观测性工具作为感知输入,人工智能模型作为诊断与修复的认知核心,以及修复代理执行针对性的代码和测试修改。通过结合日志分析、静态代码检查和AI驱动的补丁或测试更新生成,该方法旨在减少停机时间、加速调试并提升软件韧性。
链接: https://arxiv.org/abs/2504.20093
作者: Mohammad Baqar,Rajat Khanda,Saba Naqvi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:As modern software systems grow in complexity and scale, their ability to autonomously detect, diagnose, and recover from failures becomes increasingly vital. Drawing inspiration from biological healing - where the human body detects damage, signals the brain, and activates targeted recovery - this paper explores the concept of self-healing software driven by artificial intelligence. We propose a novel framework that mimics this biological model system observability tools serve as sensory inputs, AI models function as the cognitive core for diagnosis and repair, and healing agents apply targeted code and test modifications. By combining log analysis, static code inspection, and AI-driven generation of patches or test updates, our approach aims to reduce downtime, accelerate debugging, and enhance software resilience. We evaluate the effectiveness of this model through case studies and simulations, comparing it against traditional manual debugging and recovery workflows. This work paves the way toward intelligent, adaptive and self-reliant software systems capable of continuous healing, akin to living organisms.
zh
[AI-87] An Integrated Framework for Contextual Personalized LLM -Based Food Recommendation
【速读】:该论文试图解决个性化食品推荐系统(Food-RecSys)因组件理解碎片化以及传统机器学习在庞大且不平衡的食品数据上表现不佳而出现的严重性能不足问题。其解决方案的关键在于提出一种针对食品领域的新型集成方法——食品推荐作为语言处理(F-RLP)框架,该框架通过定制化的大型语言模型(LLMs)应用,克服了通用模型的局限性,提供了有效、上下文相关且真正个性化的食品推荐基础设施。
链接: https://arxiv.org/abs/2504.20092
作者: Ali Rostami
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Doctorate Thesis, University of California, Irvine 2024
Abstract:Personalized food recommendation systems (Food-RecSys) critically underperform due to fragmented component understanding and the failure of conventional machine learning with vast, imbalanced food data. While Large Language Models (LLMs) offer promise, current generic Recommendation as Language Processing (RLP) strategies lack the necessary specialization for the food domain’s complexity. This thesis tackles these deficiencies by first identifying and analyzing the essential components for effective Food-RecSys. We introduce two key innovations: a multimedia food logging platform for rich contextual data acquisition and the World Food Atlas, enabling unique geolocation-based food analysis previously unavailable. Building on this foundation, we pioneer the Food Recommendation as Language Processing (F-RLP) framework - a novel, integrated approach specifically architected for the food domain. F-RLP leverages LLMs in a tailored manner, overcoming the limitations of generic models and providing a robust infrastructure for effective, contextual, and truly personalized food recommendations.
zh
[AI-88] Spark: A System for Scientifically Creative Idea Generation
【速读】:该论文试图解决科学领域中生成新颖研究想法的问题,其解决方案的关键在于构建一个名为Spark的系统,该系统结合了基于大型语言模型(Large Language Models, LLMs)的检索增强型想法生成方法与一个名为Judge的评审模型。Judge模型在来自OpenReview的600K科学评审数据上进行训练,旨在实现对生成科学想法的评估,从而符合计算创造力(Computational Creativity, CC)的基本原理。
链接: https://arxiv.org/abs/2504.20090
作者: Aishik Sanyal,Samuel Schapiro,Sumuk Shashidhar,Royce Moon,Lav R. Varshney,Dilek Hakkani-Tur
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Recently, large language models (LLMs) have shown promising abilities to generate novel research ideas in science, a direction which coincides with many foundational principles in computational creativity (CC). In light of these developments, we present an idea generation system named Spark that couples retrieval-augmented idea generation using LLMs with a reviewer model named Judge trained on 600K scientific reviews from OpenReview. Our work is both a system demonstration and intended to inspire other CC researchers to explore grounding the generation and evaluation of scientific ideas within foundational CC principles. To this end, we release the annotated dataset used to train Judge, inviting other researchers to explore the use of LLMs for idea generation and creative evaluations.
zh
[AI-89] A model and package for German ColBERT
【速读】:该论文旨在解决多语言信息检索中的效率与准确性问题,特别是在基于检索增强生成(RAG)的应用场景中。其解决方案的关键在于引入一种针对德语的ColBERT变体,这是一种晚期交互的多密集向量检索方法,能够通过高效的向量表示和交互机制提升检索性能,同时支持检索与微调的工作流。
链接: https://arxiv.org/abs/2504.20083
作者: Thuong Dang,Qiqi Chen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:In this work, we introduce a German version for ColBERT, a late interaction multi-dense vector retrieval method, with a focus on RAG applications. We also present the main features of our package for ColBERT models, supporting both retrieval and fine-tuning workflows.
zh
[AI-90] Evolution of AI in Education: Agent ic Workflows
【速读】:该论文试图解决传统大型语言模型(Large Language Models, LLMs)在教育领域中存在的局限性,包括对静态训练数据的依赖、适应性不足以及缺乏推理能力。其解决方案的关键在于引入人工智能代理(AI agents),通过四种主要范式——反思、规划、工具使用和多代理协作——来推动教育创新。AI代理通过这些设计范式展现出更高的灵活性和智能化水平,论文进一步通过一个自动化作文评分的多代理框架验证了该方法的可行性,初步结果表明其在一致性方面优于独立的LLMs。
链接: https://arxiv.org/abs/2504.20082
作者: Firuz Kamalov,David Santandreu Calonge,Linda Smail,Dilshod Azizov,Dimple R. Thadani,Theresa Kwong,Amara Atif
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Artificial intelligence (AI) has transformed various aspects of education, with large language models (LLMs) driving advancements in automated tutoring, assessment, and content generation. However, conventional LLMs are constrained by their reliance on static training data, limited adaptability, and lack of reasoning. To address these limitations and foster more sustainable technological practices, AI agents have emerged as a promising new avenue for educational innovation. In this review, we examine agentic workflows in education according to four major paradigms: reflection, planning, tool use, and multi-agent collaboration. We critically analyze the role of AI agents in education through these key design paradigms, exploring their advantages, applications, and challenges. To illustrate the practical potential of agentic systems, we present a proof-of-concept application: a multi-agent framework for automated essay scoring. Preliminary results suggest this agentic approach may offer improved consistency compared to stand-alone LLMs. Our findings highlight the transformative potential of AI agents in educational settings while underscoring the need for further research into their interpretability, trustworthiness, and sustainable impact on pedagogical impact.
zh
[AI-91] DNAD: Differentiable Neural Architecture Distillation
【速读】:该论文旨在解决设计高效神经网络的问题,即在模型性能(如分类准确率)与计算复杂度之间找到合适的权衡。其核心解决方案是提出一种基于两个核心机制的可微神经架构蒸馏(Differentiable Neural Architecture Distillation, DNAD)算法,分别为“通过删除进行搜索”和“通过模仿进行搜索”。其中,“通过删除进行搜索”引入了超网络渐进压缩(Super-Network Progressive Shrinking, SNPS)算法,通过从密集结构逐步压缩到稀疏结构,生成具有灵活结构的帕累托最优架构集;而“通过模仿进行搜索”则结合知识蒸馏(Knowledge Distillation, KD),通过最小化超网络与教师网络的行为差异,避免单级DARTS的过拟合问题,从而获得性能优异的神经架构。
链接: https://arxiv.org/abs/2504.20080
作者: Xuan Rao,Bo Zhao,Derong Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:To meet the demand for designing efficient neural networks with appropriate trade-offs between model performance (e.g., classification accuracy) and computational complexity, the differentiable neural architecture distillation (DNAD) algorithm is developed based on two cores, namely search by deleting and search by imitating. Primarily, to derive neural architectures in a space where cells of the same type no longer share the same topology, the super-network progressive shrinking (SNPS) algorithm is developed based on the framework of differentiable architecture search (DARTS), i.e., search by deleting. Unlike conventional DARTS-based approaches which yield neural architectures with simple structures and derive only one architecture during the search procedure, SNPS is able to derive a Pareto-optimal set of architectures with flexible structures by forcing the dynamic super-network shrink from a dense structure to a sparse one progressively. Furthermore, since knowledge distillation (KD) has shown great effectiveness to train a compact network with the assistance of an over-parameterized model, we integrate SNPS with KD to formulate the DNAD algorithm, i.e., search by imitating. By minimizing behavioral differences between the super-network and teacher network, the over-fitting of one-level DARTS is avoided and well-performed neural architectures are derived. Experiments on CIFAR-10 and ImageNet classification tasks demonstrate that both SNPS and DNAD are able to derive a set of architectures which achieve similar or lower error rates with fewer parameters and FLOPs. Particularly, DNAD achieves the top-1 error rate of 23.7% on ImageNet classification with a model of 6.0M parameters and 598M FLOPs, which outperforms most DARTS-based methods.
zh
[AI-92] FX-DARTS: Designing Topology-unconstrained Architectures with Differentiable Architecture Search and Entropy-based Super-network Shrinking
【速读】:该论文旨在解决传统可微分架构搜索(Differentiable Architecture Search, DARTS)中因强先验约束导致的自动化机器学习(Auto-ML)发展受限问题,这些先验约束限制了搜索空间的灵活性,阻碍了更强大神经网络的探索。其解决方案的关键在于提出一种名为FX-DARTS的方法,该方法通过消除对细胞拓扑结构的限制并改进超网络的离散化机制,结合基于熵的超网络压缩(Entropy-based Super-Network Shrinking, ESS)框架,实现了在扩大搜索空间的同时保持稳定性,从而能够在单一搜索过程中探索出性能与计算复杂度之间具有竞争力的神经网络架构。
链接: https://arxiv.org/abs/2504.20079
作者: Xuan Rao,Bo Zhao,Derong Liu,Cesare Alippi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Strong priors are imposed on the search space of Differentiable Architecture Search (DARTS), such that cells of the same type share the same topological structure and each intermediate node retains two operators from distinct nodes. While these priors reduce optimization difficulties and improve the applicability of searched architectures, they hinder the subsequent development of automated machine learning (Auto-ML) and prevent the optimization algorithm from exploring more powerful neural networks through improved architectural flexibility. This paper aims to reduce these prior constraints by eliminating restrictions on cell topology and modifying the discretization mechanism for super-networks. Specifically, the Flexible DARTS (FX-DARTS) method, which leverages an Entropy-based Super-Network Shrinking (ESS) framework, is presented to address the challenges arising from the elimination of prior constraints. Notably, FX-DARTS enables the derivation of neural architectures without strict prior rules while maintaining the stability in the enlarged search space. Experimental results on image classification benchmarks demonstrate that FX-DARTS is capable of exploring a set of neural architectures with competitive trade-offs between performance and computational complexity within a single search procedure.
zh
[AI-93] EPSILON: Adaptive Fault Mitigation in Approximate Deep Neural Network using Statistical Signatures IJCNN
【速读】:该论文旨在解决近似计算在深度神经网络加速器(AxDNNs)中引入的永久性故障对性能的严重负面影响问题,传统故障检测与缓解方法虽在准确计算的深度神经网络加速器(AccDNNs)中有效,但其带来的高开销和延迟使其不适用于能源受限的实时部署。论文提出的解决方案EPSILON的关键在于利用预计算的统计特征和逐层重要性度量,结合一种新颖的非参数模式匹配算法,实现无需中断正常执行的常数时间故障检测,并动态适应不同的网络架构和故障模式,同时通过基于权重分布统计分析和层关键性的智能缓解策略维持模型精度,从而在保证近似计算能效优势的同时提升AxDNNs的可靠性。
链接: https://arxiv.org/abs/2504.20074
作者: Khurram Khalil,Khaza Anuarul Hoque
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the International Joint Conference on Neural Networks (IJCNN), June 30th - July 5th, 2025 in Rome, Italy
Abstract:The increasing adoption of approximate computing in deep neural network accelerators (AxDNNs) promises significant energy efficiency gains. However, permanent faults in AxDNNs can severely degrade their performance compared to their accurate counterparts (AccDNNs). Traditional fault detection and mitigation approaches, while effective for AccDNNs, introduce substantial overhead and latency, making them impractical for energy-constrained real-time deployment. To address this, we introduce EPSILON, a lightweight framework that leverages pre-computed statistical signatures and layer-wise importance metrics for efficient fault detection and mitigation in AxDNNs. Our framework introduces a novel non-parametric pattern-matching algorithm that enables constant-time fault detection without interrupting normal execution while dynamically adapting to different network architectures and fault patterns. EPSILON maintains model accuracy by intelligently adjusting mitigation strategies based on a statistical analysis of weight distribution and layer criticality while preserving the energy benefits of approximate computing. Extensive evaluations across various approximate multipliers, AxDNN architectures, popular datasets (MNIST, CIFAR-10, CIFAR-100, ImageNet-1k), and fault scenarios demonstrate that EPSILON maintains 80.05% accuracy while offering 22% improvement in inference time and 28% improvement in energy efficiency, establishing EPSILON as a practical solution for deploying reliable AxDNNs in safety-critical edge applications.
zh
[AI-94] A Simple Review of EEG Foundation Models: Datasets Advancements and Future Perspectives
【速读】:该论文旨在解决如何有效处理和分析脑电图(Electroencephalogram, EEG)信号以更好地理解大脑活动和诊断神经疾病的问题。其解决方案的关键在于开发和研究EEG基础模型(EEG foundation models, EEG-FMs),这些模型通过先进的架构设计、预训练策略以及多样化数据集的使用,提升了对EEG数据的特征提取与任务适应能力,为后续应用提供了强大的支持。
链接: https://arxiv.org/abs/2504.20069
作者: Junhong Lai,Jiyu Wei,Lin Yao,Yueming Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:Electroencephalogram (EEG) signals play a crucial role in understanding brain activity and diagnosing neurological disorders. This review focuses on the recent development of EEG foundation models(EEG-FMs), which have shown great potential in processing and analyzing EEG data. We discuss various EEG-FMs, including their architectures, pre-training strategies, their pre-training and downstream datasets and other details. The review also highlights the challenges and future directions in this field, aiming to provide a comprehensive overview for researchers and practitioners interested in EEG analysis and related EEG-FMs.
zh
[AI-95] A constraints-based approach to fully interpretable neural networks for detecting learner behaviors
【速读】:该论文试图解决复杂机器学习模型在教育领域应用中可解释性不足的问题,特别是如何构建一种能够提供忠实且易于理解解释的神经网络行为检测模型。解决方案的关键在于通过引入一系列约束条件,使模型在设计上具备完全可解释性,其参数不仅具有明确的语义解释,还能全面捕捉模型对学习者行为的习得知识,从而生成既忠实于模型内部机制又符合人类认知的解释。
链接: https://arxiv.org/abs/2504.20055
作者: Juan D. Pinto,Luc Paquette
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to International Conference on Educational Data Mining (EDM) 2025
Abstract:The increasing use of complex machine learning models in education has led to concerns about their interpretability, which in turn has spurred interest in developing explainability techniques that are both faithful to the model’s inner workings and intelligible to human end-users. In this paper, we describe a novel approach to creating a neural-network-based behavior detection model that is interpretable by design. Our model is fully interpretable, meaning that the parameters we extract for our explanations have a clear interpretation, fully capture the model’s learned knowledge about the learner behavior of interest, and can be used to create explanations that are both faithful and intelligible. We achieve this by implementing a series of constraints to the model that both simplify its inference process and bring it closer to a human conception of the task at hand. We train the model to detect gaming-the-system behavior, evaluate its performance on this task, and compare its learned patterns to those identified by human experts. Our results show that the model is successfully able to learn patterns indicative of gaming-the-system behavior while providing evidence for fully interpretable explanations. We discuss the implications of our approach and suggest ways to evaluate explainability using a human-grounded approach.
zh
[AI-96] HCT-QA: A Benchmark for Question Answering on Human-Centric Tables
【速读】:该论文旨在解决从人类中心化表格(Human-Centric Tables, HCTs)中高效提取、处理和查询数据的问题,这些表格因其复杂的布局和高业务价值而在多个领域广泛存在。传统方法在处理HCTs的多样性和复杂性方面存在局限,难以有效支持查询。论文提出的解决方案关键在于构建HCT-QA基准数据集,包含真实和合成的HCTs及其对应的自然语言问答对,以评估大型语言模型在处理和查询此类表格中的能力。
链接: https://arxiv.org/abs/2504.20047
作者: Mohammad S. Ahmad,Zan A. Naeem,Michaël Aupetit,Ahmed Elmagarmid,Mohamed Eltabakh,Xiasong Ma,Mourad Ouzzani,Chaoyi Ruan
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 12 pages
Abstract:Tabular data embedded within PDF files, web pages, and other document formats are prevalent across numerous sectors such as government, engineering, science, and business. These human-centric tables (HCTs) possess a unique combination of high business value, intricate layouts, limited operational power at scale, and sometimes serve as the only data source for critical insights. However, their complexity poses significant challenges to traditional data extraction, processing, and querying methods. While current solutions focus on transforming these tables into relational formats for SQL queries, they fall short in handling the diverse and complex layouts of HCTs and hence being amenable to querying. This paper describes HCT-QA, an extensive benchmark of HCTs, natural language queries, and related answers on thousands of tables. Our dataset includes 2,188 real-world HCTs with 9,835 QA pairs and 4,679 synthetic tables with 67.5K QA pairs. While HCTs can be potentially processed by different type of query engines, in this paper, we focus on Large Language Models as potential engines and assess their ability in processing and querying such tables.
zh
[AI-97] Heterogeneous network drug-target interaction prediction model based on graph wavelet transform and multi-level contrastive learning
【速读】:该论文旨在解决药物-靶点相互作用(Drug-target interaction, DTI)预测中的黑箱问题,即传统机器学习方法难以揭示模型决策机制与生物分子相互作用模式之间的深层关联。其解决方案的关键在于提出一种异构网络药物靶点相互作用预测框架,整合图神经网络与多尺度信号处理技术,构建具备高效预测能力和多层次可解释性的模型。该框架的技术突破主要体现在三个维度:局部全局特征协同感知模块、多尺度图信号分解与生物学解释模块以及深度分层节点特征变换架构,通过融合HGCN和GWT两种视角的节点表示,实现多维信息整合与预测鲁棒性的提升。
链接: https://arxiv.org/abs/2504.20103
作者: Wenfeng Dai,Yanhong Wang,Shuai Yan,Qingzhi Yu,Xiang Cheng
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Drug-target interaction (DTI) prediction is a core task in drug development and precision medicine in the biomedical field. However, traditional machine learning methods generally have the black box problem, which makes it difficult to reveal the deep correlation between the model decision mechanism and the interaction pattern between biological molecules. This study proposes a heterogeneous network drug target interaction prediction framework, integrating graph neural network and multi scale signal processing technology to construct a model with both efficient prediction and multi level interpretability. Its technical breakthroughs are mainly reflected in the following three dimensions:Local global feature collaborative perception module. Based on heterogeneous graph convolutional neural network (HGCN), a multi order neighbor aggregation strategy is this http URL scale graph signal decomposition and biological interpretation module. A deep hierarchical node feature transform (GWT) architecture is this http URL learning combining multi dimensional perspectives and hierarchical representations. By comparing the learning models, the node representations from the two perspectives of HGCN and GWT are aligned and fused, so that the model can integrate multi dimensional information and improve the prediction robustness. Experimental results show that our framework shows excellent prediction performance on all datasets. This study provides a complete solution for drug target discovery from black box prediction to mechanism decoding, and its methodology has important reference value for modeling complex biomolecular interaction systems.
zh
机器学习
[LG-0] ACE: A Security Architecture for LLM -Integrated App Systems
链接: https://arxiv.org/abs/2504.20984
作者: Evan Li,Tushin Mallick,Evan Rose,William Robertson,Alina Oprea,Cristina Nita-Rotaru
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 21 pages, 13 figures
Abstract:LLM-integrated app systems extend the utility of Large Language Models (LLMs) with third-party apps that are invoked by a system LLM using interleaved planning and execution phases to answer user queries. These systems introduce new attack vectors where malicious apps can cause integrity violation of planning or execution, availability breakdown, or privacy compromise during execution. In this work, we identify new attacks impacting the integrity of planning, as well as the integrity and availability of execution in LLM-integrated apps, and demonstrate them against IsolateGPT, a recent solution designed to mitigate attacks from malicious apps. We propose Abstract-Concrete-Execute (ACE), a new secure architecture for LLM-integrated app systems that provides security guarantees for system planning and execution. Specifically, ACE decouples planning into two phases by first creating an abstract execution plan using only trusted information, and then mapping the abstract plan to a concrete plan using installed system apps. We verify that the plans generated by our system satisfy user-specified secure information flow constraints via static analysis on the structured plan output. During execution, ACE enforces data and capability barriers between apps, and ensures that the execution is conducted according to the trusted abstract plan. We show experimentally that our system is secure against attacks from the INJECAGENT benchmark, a standard benchmark for control flow integrity in the face of indirect prompt injection attacks, and our newly introduced attacks. Our architecture represents a significant advancement towards hardening LLM-based systems containing system facilities of varying levels of trustworthiness. Comments: 21 pages, 13 figures Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2504.20984 [cs.CR] (or arXiv:2504.20984v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2504.20984 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-1] Equivariant non-linear maps for neural networks on homogeneous spaces
链接: https://arxiv.org/abs/2504.20974
作者: Elias Nyholm,Oscar Carlsson,Maurice Weiler,Daniel Persson
类目: Machine Learning (cs.LG); Representation Theory (math.RT); Machine Learning (stat.ML)
*备注: 45 pages,10 figures
Abstract:This paper presents a novel framework for non-linear equivariant neural network layers on homogeneous spaces. The seminal work of Cohen et al. on equivariant G -CNNs on homogeneous spaces characterized the representation theory of such layers in the linear setting, finding that they are given by convolutions with kernels satisfying so-called steerability constraints. Motivated by the empirical success of non-linear layers, such as self-attention or input dependent kernels, we set out to generalize these insights to the non-linear setting. We derive generalized steerability constraints that any such layer needs to satisfy and prove the universality of our construction. The insights gained into the symmetry-constrained functional dependence of equivariant operators on feature maps and group elements informs the design of future equivariant neural network layers. We demonstrate how several common equivariant network architectures - G -CNNs, implicit steerable kernel networks, conventional and relative position embedded attention based transformers, and LieTransformers - may be derived from our framework.
[LG-2] XPG-RL: Reinforcement Learning with Explainable Priority Guidance for Efficiency-Boosted Mechanical Search
链接: https://arxiv.org/abs/2504.20969
作者: Yiting Zhang,Shichen Li,Elena Shrestha
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 13 pages, 5 figures
Abstract:Mechanical search (MS) in cluttered environments remains a significant challenge for autonomous manipulators, requiring long-horizon planning and robust state estimation under occlusions and partial observability. In this work, we introduce XPG-RL, a reinforcement learning framework that enables agents to efficiently perform MS tasks through explainable, priority-guided decision-making based on raw sensory inputs. XPG-RL integrates a task-driven action prioritization mechanism with a learned context-aware switching strategy that dynamically selects from a discrete set of action primitives such as target grasping, occlusion removal, and viewpoint adjustment. Within this strategy, a policy is optimized to output adaptive threshold values that govern the discrete selection among action primitives. The perception module fuses RGB-D inputs with semantic and geometric features to produce a structured scene representation for downstream decision-making. Extensive experiments in both simulation and real-world settings demonstrate that XPG-RL consistently outperforms baseline methods in task success rates and motion efficiency, achieving up to 4.5 \times higher efficiency in long-horizon tasks. These results underscore the benefits of integrating domain knowledge with learnable decision-making policies for robust and efficient robotic manipulation.
[LG-3] Softpick: No Attention Sink No Massive Activations with Rectified Softmax
链接: https://arxiv.org/abs/2504.20966
作者: Zayd M. K. Zuhri,Erland Hilman Fuadi,Alham Fikri Aji
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M parameter models demonstrate that softpick maintains performance parity with softmax on standard benchmarks while achieving 0% sink rate. The softpick transformer produces hidden states with significantly lower kurtosis (340 vs 33,510) and creates sparse attention maps (46.97% sparsity). Models using softpick consistently outperform softmax when quantized, with particularly pronounced advantages at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Our code is available at this https URL.
[LG-4] AegisLLM : Scaling Agent ic Systems for Self-Reflective Defense in LLM Security ICLR2025
链接: https://arxiv.org/abs/2504.20965
作者: Zikui Cai,Shayan Shabihi,Bang An,Zora Che,Brian R. Bartoldson,Bhavya Kailkhura,Tom Goldstein,Furong Huang
类目: Machine Learning (cs.LG)
*备注: ICLR 2025 Workshop BuildingTrust
Abstract:We introduce AegisLLM, a cooperative multi-agent defense against adversarial attacks and information leakage. In AegisLLM, a structured workflow of autonomous agents - orchestrator, deflector, responder, and evaluator - collaborate to ensure safe and compliant LLM outputs, while self-improving over time through prompt optimization. We show that scaling agentic reasoning system at test-time - both by incorporating additional agent roles and by leveraging automated prompt optimization (such as DSPy)- substantially enhances robustness without compromising model utility. This test-time defense enables real-time adaptability to evolving attacks, without requiring model retraining. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM. On the WMDP unlearning benchmark, AegisLLM achieves near-perfect unlearning with only 20 training examples and fewer than 300 LM calls. For jailbreaking benchmarks, we achieve 51% improvement compared to the base model on StrongReject, with false refusal rates of only 7.9% on PHTest compared to 18-55% for comparable methods. Our results highlight the advantages of adaptive, agentic reasoning over static defenses, establishing AegisLLM as a strong runtime alternative to traditional approaches based on model modifications. Code is available at this https URL
[LG-5] Deep Learning Characterizes Depression and Suicidal Ideation from Eye Movements
链接: https://arxiv.org/abs/2504.20944
作者: Kleanthis Avramidis,Woojae Jeong,Aditya Kommineni,Sudarsana R. Kadiri,Marcus Ma,Colin McDaniel,Myzelle Hughes,Thomas McGee,Elsi Kaiser,Dani Byrd,Assal Habibi,B. Rael Cahn,Idan A. Blank,Kristina Lerman,Takfarinas Medani,Richard M. Leahy,Shrikanth Narayanan
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Preprint. 12 pages, 5 figures
Abstract:Identifying physiological and behavioral markers for mental health conditions is a longstanding challenge in psychiatry. Depression and suicidal ideation, in particular, lack objective biomarkers, with screening and diagnosis primarily relying on self-reports and clinical interviews. Here, we investigate eye tracking as a potential marker modality for screening purposes. Eye movements are directly modulated by neuronal networks and have been associated with attentional and mood-related patterns; however, their predictive value for depression and suicidality remains unclear. We recorded eye-tracking sequences from 126 young adults as they read and responded to affective sentences, and subsequently developed a deep learning framework to predict their clinical status. The proposed model included separate branches for trials of positive and negative sentiment, and used 2D time-series representations to account for both intra-trial and inter-trial variations. We were able to identify depression and suicidal ideation with an area under the receiver operating curve (AUC) of 0.793 (95% CI: 0.765-0.819) against healthy controls, and suicidality specifically with 0.826 AUC (95% CI: 0.797-0.852). The model also exhibited moderate, yet significant, accuracy in differentiating depressed from suicidal participants, with 0.609 AUC (95% CI 0.571-0.646). Discriminative patterns emerge more strongly when assessing the data relative to response generation than relative to the onset time of the final word of the sentences. The most pronounced effects were observed for negative-sentiment sentences, that are congruent to depressed and suicidal participants. Our findings highlight eye tracking as an objective tool for mental health assessment and underscore the modulatory impact of emotional stimuli on cognitive processes affecting oculomotor control.
[LG-6] Scenario-based Compositional Verification of Autonomous Systems with Neural Perception
链接: https://arxiv.org/abs/2504.20942
作者: Christopher Watson,Rajeev Alur,Divya Gopinath,Ravi Mangal,Corina S. Pasareanu
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Recent advances in deep learning have enabled the development of autonomous systems that use deep neural networks for perception. Formal verification of these systems is challenging due to the size and complexity of the perception DNNs as well as hard-to-quantify, changing environment conditions. To address these challenges, we propose a probabilistic verification framework for autonomous systems based on the following key concepts: (1) Scenario-based Modeling: We decompose the task (e.g., car navigation) into a composition of scenarios, each representing a different environment condition. (2) Probabilistic Abstractions: For each scenario, we build a compact abstraction of perception based on the DNN’s performance on an offline dataset that represents the scenario’s environment condition. (3) Symbolic Reasoning and Acceleration: The abstractions enable efficient compositional verification of the autonomous system via symbolic reasoning and a novel acceleration proof rule that bounds the error probability of the system under arbitrary variations of environment conditions. We illustrate our approach on two case studies: an experimental autonomous system that guides airplanes on taxiways using high-dimensional perception DNNs and a simulation model of an F1Tenth autonomous car using LiDAR observations.
[LG-7] Improvements of Dark Experience Replay and Reservoir Sampling towards Better Balance between Consolidation and Plasticity
链接: https://arxiv.org/abs/2504.20932
作者: Taisuke Kobayashi
类目: Machine Learning (cs.LG)
*备注: 29 pages, 8 figures
Abstract:Continual learning is the one of the most essential abilities for autonomous agents, which can incrementally learn daily-life skills. For this ultimate goal, a simple but powerful method, dark experience replay (DER), has been proposed recently. DER mitigates catastrophic forgetting, in which the skills acquired in the past are unintentionally forgotten, by stochastically storing the streaming data in a reservoir sampling (RS) buffer and by relearning them or retaining the past outputs for them. However, since DER considers multiple objectives, it will not function properly without appropriate weighting of them. In addition, the ability to retain past outputs inhibits learning if the past outputs are incorrect due to distribution shift or other effects. This is due to a tradeoff between memory consolidation and plasticity. The tradeoff is hidden even in the RS buffer, which gradually stops storing new data for new skills in it as data is continuously passed to it. To alleviate the tradeoff and achieve better balance, this paper proposes improvement strategies to each of DER and RS. Specifically, DER is improved with automatic adaptation of weights, block of replaying erroneous data, and correction of past outputs. RS is also improved with generalization of acceptance probability, stratification of plural buffers, and intentional omission of unnecessary data. These improvements are verified through multiple benchmarks including regression, classification, and reinforcement learning problems. As a result, the proposed methods achieve steady improvements in learning performance by balancing the memory consolidation and plasticity.
[LG-8] Exploiting inter-agent coupling information for efficient reinforcement learning of cooperative LQR
链接: https://arxiv.org/abs/2504.20927
作者: Shahbaz P Qadri Syed,He Bai
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
*备注: Accepted at Learning for Dynamics and Control (L4DC), 2025
Abstract:Developing scalable and efficient reinforcement learning algorithms for cooperative multi-agent control has received significant attention over the past years. Existing literature has proposed inexact decompositions of local Q-functions based on empirical information structures between the agents. In this paper, we exploit inter-agent coupling information and propose a systematic approach to exactly decompose the local Q-function of each agent. We develop an approximate least square policy iteration algorithm based on the proposed decomposition and identify two architectures to learn the local Q-function for each agent. We establish that the worst-case sample complexity of the decomposition is equal to the centralized case and derive necessary and sufficient graphical conditions on the inter-agent couplings to achieve better sample efficiency. We demonstrate the improved sample efficiency and computational efficiency on numerical examples.
[LG-9] Statistical and Predictive Analysis to Identify Risk Factors and Effects of Post COVID-19 Syndrome IJCNN2025
链接: https://arxiv.org/abs/2504.20915
作者: Milad Leyli-abadi,Jean-Patrick Brunet,Axel Tahmasebimoradi
类目: Machine Learning (cs.LG)
*备注: 8 pages, 9 figures, 2 tables, initially submitted in IJCNN 2025, but rejected because of the high number of contributions (requested to be presented as a poster in the conference without being published in conference proceedings)
Abstract:Based on recent studies, some COVID-19 symptoms can persist for months after infection, leading to what is termed long COVID. Factors such as vaccination timing, patient characteristics, and symptoms during the acute phase of infection may contribute to the prolonged effects and intensity of long COVID. Each patient, based on their unique combination of factors, develops a specific risk or intensity of long COVID. In this work, we aim to achieve two objectives: (1) conduct a statistical analysis to identify relationships between various factors and long COVID, and (2) perform predictive analysis of long COVID intensity using these factors. We benchmark and interpret various data-driven approaches, including linear models, random forests, gradient boosting, and neural networks, using data from the Lifelines COVID-19 cohort. Our results show that Neural Networks (NN) achieve the best performance in terms of MAPE, with predictions averaging 19% error. Additionally, interpretability analysis reveals key factors such as loss of smell, headache, muscle pain, and vaccination timing as significant predictors, while chronic disease and gender are critical risk factors. These insights provide valuable guidance for understanding long COVID and developing targeted interventions.
[LG-10] MOSIC: Model-Agnostic Optimal Subgroup Identification with Multi-Constraint for Improved Reliability
链接: https://arxiv.org/abs/2504.20908
作者: Wenxin Chen,Weishen Pan,Kyra Gan,Fei Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Identifying subgroups that benefit from specific treatments using observational data is a critical challenge in personalized medicine. Most existing approaches solely focus on identifying a subgroup with an improved treatment effect. However, practical considerations, such as ensuring a minimum subgroup size for representativeness or achieving sufficient confounder balance for reliability, are also important for making findings clinically meaningful and actionable. While some studies address these constraints individually, none offer a unified approach to handle them simultaneously. To bridge this gap, we propose a model-agnostic framework for optimal subgroup identification under multiple constraints. We reformulate this combinatorial problem as an unconstrained min-max optimization problem with novel modifications and solve it by a gradient descent ascent algorithm. We further prove its convergence to a feasible and locally optimal solution. Our method is stable and highly flexible, supporting various models and techniques for estimating and optimizing treatment effectiveness with observational data. Extensive experiments on both synthetic and real-world datasets demonstrate its effectiveness in identifying subgroups that satisfy multiple constraints, achieving higher treatment effects and better confounder balancing results across different group sizes.
[LG-11] GiBy: A Giant-Step Baby-Step Classifier For Anomaly Detection In Industrial Control Systems
链接: https://arxiv.org/abs/2504.20906
作者: Sarad Venugopalan,Sridhar Adepu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:The continuous monitoring of the interactions between cyber-physical components of any industrial control system (ICS) is required to secure automation of the system controls, and to guarantee plant processes are fail-safe and remain in an acceptably safe state. Safety is achieved by managing actuation (where electric signals are used to trigger physical movement), dependent on corresponding sensor readings; used as ground truth in decision making. Timely detection of anomalies (attacks, faults and unascertained states) in ICSs is crucial for the safe running of a plant, the safety of its personnel, and for the safe provision of any services provided. We propose an anomaly detection method that involves accurate linearization of the non-linear forms arising from sensor-actuator(s) relationships, primarily because solving linear models is easier and well understood. Further, the time complexity of the anomaly detection scenario/problem at hand is lowered using dimensionality reduction of the actuator(s) in relationship with a sensor. We accomplish this by using a well-known water treatment testbed as a use case. Our experiments show millisecond time response to detect anomalies and provide explainability; that are not simultaneously achieved by other state of the art AI/ML models with eXplainable AI (XAI) used for the same purpose. Further, we pin-point the sensor(s) and its actuation state for which anomaly was detected.
[LG-12] Dual Explanations via Subgraph Matching for Malware Detection
链接: https://arxiv.org/abs/2504.20904
作者: Hossein Shokouhinejad,Roozbeh Razavi-Far,Griffin Higgins,Ali A. Ghorbani
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Interpretable malware detection is crucial for understanding harmful behaviors and building trust in automated security systems. Traditional explainable methods for Graph Neural Networks (GNNs) often highlight important regions within a graph but fail to associate them with known benign or malicious behavioral patterns. This limitation reduces their utility in security contexts, where alignment with verified prototypes is essential. In this work, we introduce a novel dual prototype-driven explainable framework that interprets GNN-based malware detection decisions. This dual explainable framework integrates a base explainer (a state-of-the-art explainer) with a novel second-level explainer which is designed by subgraph matching technique, called SubMatch explainer. The proposed explainer assigns interpretable scores to nodes based on their association with matched subgraphs, offering a fine-grained distinction between benign and malicious regions. This prototype-guided scoring mechanism enables more interpretable, behavior-aligned explanations. Experimental results demonstrate that our method preserves high detection performance while significantly improving interpretability in malware analysis.
[LG-13] Evaluating Generative Models for Tabular Data: Novel Metrics and Benchmarking
链接: https://arxiv.org/abs/2504.20900
作者: Dayananda Herurkar,Ahmad Ali,Andreas Dengel
类目: Machine Learning (cs.LG)
*备注: Tabular Data, Generative Models, Evaluation Metrics, Network Intrusion Detection, Outlier Detection, Anomaly Detection
Abstract:Generative models have revolutionized multiple domains, yet their application to tabular data remains underexplored. Evaluating generative models for tabular data presents unique challenges due to structural complexity, large-scale variability, and mixed data types, making it difficult to intuitively capture intricate patterns. Existing evaluation metrics offer only partial insights, lacking a comprehensive measure of generative performance. To address this limitation, we propose three novel evaluation metrics: FAED, FPCAD, and RFIS. Our extensive experimental analysis, conducted on three standard network intrusion detection datasets, compares these metrics with established evaluation methods such as Fidelity, Utility, TSTR, and TRTS. Our results demonstrate that FAED effectively captures generative modeling issues overlooked by existing metrics. While FPCAD exhibits promising performance, further refinements are necessary to enhance its reliability. Our proposed framework provides a robust and practical approach for assessing generative models in tabular data applications.
[LG-14] Does Feedback Help in Bandits with Arm Erasures?
链接: https://arxiv.org/abs/2504.20894
作者: Merve Karakas,Osama Hanna,Lin F. Yang,Christina Fragouli
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study a distributed multi-armed bandit (MAB) problem over arm erasure channels, motivated by the increasing adoption of MAB algorithms over communication-constrained networks. In this setup, the learner communicates the chosen arm to play to an agent over an erasure channel with probability \epsilon \in [0,1) ; if an erasure occurs, the agent continues pulling the last successfully received arm; the learner always observes the reward of the arm pulled. In past work, we considered the case where the agent cannot convey feedback to the learner, and thus the learner does not know whether the arm played is the requested or the last successfully received one. In this paper, we instead consider the case where the agent can send feedback to the learner on whether the arm request was received, and thus the learner exactly knows which arm was played. Surprisingly, we prove that erasure feedback does not improve the worst-case regret upper bound order over the previously studied no-feedback setting. In particular, we prove a regret lower bound of \Omega(\sqrtKT + K / (1 - \epsilon)) , where K is the number of arms and T the time horizon, that matches no-feedback upper bounds up to logarithmic factors. We note however that the availability of feedback enables simpler algorithm designs that may achieve better constants (albeit not better order) regret bounds; we design one such algorithm and evaluate its performance numerically.
[LG-15] Guessing Efficiently for Constrained Subspace Approximation
链接: https://arxiv.org/abs/2504.20883
作者: Aditya Bhaskara,Sepideh Mahabadi,Madhusudhan Reddy Pittu,Ali Vakilian,David P. Woodruff
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:In this paper we study constrained subspace approximation problem. Given a set of n points \a_1,\ldots,a_n\ in \mathbbR^d , the goal of the \em subspace approximation problem is to find a k dimensional subspace that best approximates the input points. More precisely, for a given p\geq 1 , we aim to minimize the p th power of the \ell_p norm of the error vector (|a_1-\bmPa_1|,\ldots,|a_n-\bmPa_n|) , where \bmP denotes the projection matrix onto the subspace and the norms are Euclidean. In \emphconstrained subspace approximation (CSA), we additionally have constraints on the projection matrix \bmP . In its most general form, we require \bmP to belong to a given subset \mathcalS that is described explicitly or implicitly. We introduce a general framework for constrained subspace approximation. Our approach, that we term coreset-guess-solve, yields either (1+\varepsilon) -multiplicative or \varepsilon -additive approximations for a variety of constraints. We show that it provides new algorithms for partition-constrained subspace approximation with applications to \it fair subspace approximation, k -means clustering, and projected non-negative matrix factorization, among others. Specifically, while we reconstruct the best known bounds for k -means clustering in Euclidean spaces, we improve the known results for the remainder of the problems. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2504.20883 [cs.DS] (or arXiv:2504.20883v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2504.20883 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-16] Hybrid Quantum Recurrent Neural Network For Remaining Useful Life Prediction
链接: https://arxiv.org/abs/2504.20823
作者: Olga Tsurkan,Aleksandra Konstantinova,Aleksandr Sedykh,Dmitrii Zhiganov,Arsenii Senokosov,Daniil Tarpanov,Matvei Anoshin,Leonid Fedichkin
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 11 pages
Abstract:Predictive maintenance in aerospace heavily relies on accurate estimation of the remaining useful life of jet engines. In this paper, we introduce a Hybrid Quantum Recurrent Neural Network frame- work, combining Quantum Long Short-Term Memory layers with classical dense layers for Remaining Useful Life forecasting on NASA’s Commercial Modular Aero-Propulsion System Simulation dataset. Each Quantum Long Short-Term Memory gate replaces conventional linear transformations with Quantum Depth-Infused circuits, allowing the network to learn high-frequency components more effectively. Experimental results demonstrate that, despite having fewer trainable parameters, the Hybrid Quantum Recurrent Neural Network achieves up to a 5% improvement over a Recurrent Neural Network based on stacked Long Short-Term Memory layers in terms of mean root mean squared error and mean absolute error. Moreover, a thorough comparison of our method with established techniques, including Random Forest, Convolutional Neural Network, and Multilayer Perceptron, demonstrates that our approach, which achieves a Root Mean Squared Error of 15.46, surpasses these baselines by approximately 13.68%, 16.21%, and 7.87%, respectively. Nevertheless, it remains outperformed by certain advanced joint architectures. Our findings highlight the poten- tial of hybrid quantum-classical approaches for robust time-series forecasting under limited data conditions, offering new avenues for enhancing reliability in predictive maintenance tasks.
[LG-17] An approach to melodic segmentation and classification based on filtering with the Haar-wavelet WWW
链接: https://arxiv.org/abs/2504.20822
作者: Gissel Velarde,Tillman Weyde,David Meredith
类目: Machine Learning (cs.LG)
*备注: 39 pages, 12 figures. Version of record published in the Journal of New Music Research: this http URL
Abstract:We present a novel method of classification and segmentation of melodies in symbolic representation. The method is based on filtering pitch as a signal over time with the Haar-wavelet, and we evaluate it on two tasks. The filtered signal corresponds to a single-scale signal ws from the continuous Haar wavelet transform. The melodies are first segmented using local maxima or zero-crossings of w_s. The segments of w_s are then classified using the k-nearest neighbour algorithm with Euclidian and city-block distances. The method proves more effective than using unfiltered pitch signals and Gestalt-based segmentation when used to recognize the parent works of segments from Bach’s Two-Part Inventions (BWV 772-786). When used to classify 360 Dutch folk tunes into 26 tune families, the performance of the method is comparable to the use of pitch signals, but not as good as that of string-matching methods based on multiple features.
[LG-18] he When and How of Target Variable Transformations
链接: https://arxiv.org/abs/2504.20821
作者: Loren Nuyts,Jesse Davis
类目: Machine Learning (cs.LG)
*备注:
Abstract:The machine learning pipeline typically involves the iterative process of (1) collecting the data, (2) preparing the data, (3) learning a model, and (4) evaluating a model. Practitioners recognize the importance of the data preparation phase in terms of its impact on the ability to learn accurate models. In this regard, significant attention is often paid to manipulating the feature set (e.g., selection, transformations, dimensionality reduction). A point that is less well appreciated is that transformations on the target variable can also have a large impact on whether it is possible to learn a suitable model. These transformations may include accounting for subject-specific biases (e.g., in how someone uses a rating scale), contexts (e.g., population size effects), and general trends (e.g., inflation). However, this point has received a much more cursory treatment in the existing literature. The goal of this paper is three-fold. First, we aim to highlight the importance of this problem by showing when transforming the target variable has been useful in practice. Second, we will provide a set of generic ``rules of thumb’’ that indicate situations when transforming the target variable may be needed. Third, we will discuss which transformations should be considered in a given situation.
[LG-19] Q-Fusion: Diffusing Quantum Circuits
链接: https://arxiv.org/abs/2504.20794
作者: Collin Beaudoin,Swaroop Ghosh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Quantum computing holds great potential for solving socially relevant and computationally complex problems. Furthermore, quantum machine learning (QML) promises to rapidly improve our current machine learning capabilities. However, current noisy intermediate-scale quantum (NISQ) devices are constrained by limitations in the number of qubits and gate counts, which hinder their full capabilities. Furthermore, the design of quantum algorithms remains a laborious task, requiring significant domain expertise and time. Quantum Architecture Search (QAS) aims to streamline this process by automatically generating novel quantum circuits, reducing the need for manual intervention. In this paper, we propose a diffusion-based algorithm leveraging the LayerDAG framework to generate new quantum circuits. This method contrasts with other approaches that utilize large language models (LLMs), reinforcement learning (RL), variational autoencoders (VAE), and similar techniques. Our results demonstrate that the proposed model consistently generates 100% valid quantum circuit outputs.
[LG-20] Evaluating Effects of Augmented SELFIES for Molecular Understanding Using QK-LSTM
链接: https://arxiv.org/abs/2504.20789
作者: Collin Beaudoin,Swaroop Ghosh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Identifying molecular properties, including side effects, is a critical yet time-consuming step in drug development. Failing to detect these side effects before regulatory submission can result in significant financial losses and production delays, and overlooking them during the regulatory review can lead to catastrophic consequences. This challenge presents an opportunity for innovative machine learning approaches, particularly hybrid quantum-classical models like the Quantum Kernel-Based Long Short-Term Memory (QK-LSTM) network. The QK-LSTM integrates quantum kernel functions into the classical LSTM framework, enabling the capture of complex, non-linear patterns in sequential data. By mapping input data into a high-dimensional quantum feature space, the QK-LSTM model reduces the need for large parameter sets, allowing for model compression without sacrificing accuracy in sequence-based tasks. Recent advancements have been made in the classical domain using augmented variations of the Simplified Molecular Line-Entry System (SMILES). However, to the best of our knowledge, no research has explored the impact of augmented SMILES in the quantum domain, nor the role of augmented Self-Referencing Embedded Strings (SELFIES) in either classical or hybrid quantum-classical settings. This study presents the first analysis of these approaches, providing novel insights into their potential for enhancing molecular property prediction and side effect identification. Results reveal that augmenting SELFIES yields in statistically significant improvements from SMILES by a 5.97% improvement for the classical domain and a 5.91% improvement for the hybrid quantum-classical domain.
[LG-21] DDPS: Discrete Diffusion Posterior Sampling for Paths in Layered Graphs ICLR2025
链接: https://arxiv.org/abs/2504.20754
作者: Hao Luan,See-Kiong Ng,Chun Kai Ling
类目: Machine Learning (cs.LG)
*备注: To appear at Frontiers in Probabilistic Inference: Sampling meets Learning (FPI) workshop at ICLR 2025. this https URL
Abstract:Diffusion models form an important class of generative models today, accounting for much of the state of the art in cutting edge AI research. While numerous extensions beyond image and video generation exist, few of such approaches address the issue of explicit constraints in the samples generated. In this paper, we study the problem of generating paths in a layered graph (a variant of a directed acyclic graph) using discrete diffusion models, while guaranteeing that our generated samples are indeed paths. Our approach utilizes a simple yet effective representation for paths which we call the padded adjacency-list matrix (PALM). In addition, we show how to effectively perform classifier guidance, which helps steer the sampled paths to specific preferred edges without any retraining of the diffusion model. Our preliminary results show that empirically, our method outperforms alternatives which do not explicitly account for path constraints.
[LG-22] Intelligent Task Offloading in VANETs: A Hybrid AI-Driven Approach for Low-Latency and Energy Efficiency
链接: https://arxiv.org/abs/2504.20735
作者: Tariq Qayyum,Asadullah Tariq,Muhammad Ali,Mohamed Adel Serhani,Zouheir Trabelsi,Maite López-Sánchez
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Vehicular Ad-hoc Networks (VANETs) are integral to intelligent transportation systems, enabling vehicles to offload computational tasks to nearby roadside units (RSUs) and mobile edge computing (MEC) servers for real-time processing. However, the highly dynamic nature of VANETs introduces challenges, such as unpredictable network conditions, high latency, energy inefficiency, and task failure. This research addresses these issues by proposing a hybrid AI framework that integrates supervised learning, reinforcement learning, and Particle Swarm Optimization (PSO) for intelligent task offloading and resource allocation. The framework leverages supervised models for predicting optimal offloading strategies, reinforcement learning for adaptive decision-making, and PSO for optimizing latency and energy consumption. Extensive simulations demonstrate that the proposed framework achieves significant reductions in latency and energy usage while improving task success rates and network throughput. By offering an efficient, and scalable solution, this framework sets the foundation for enhancing real-time applications in dynamic vehicular environments.
[LG-23] Whats Wrong with Your Synthetic Tabular Data? Using Explainable AI to Evaluate Generative Models
链接: https://arxiv.org/abs/2504.20687
作者: Jan Kapar,Niklas Koenen,Martin Jullum
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This is the accepted, post peer-reviewed version of the manuscript, accepted for publication in the proceedings after the Third World Conference on eXplainable Artificial Intelligence, XAI-2025. A link to the version of record will be included here upon publication
Abstract:Evaluating synthetic tabular data is challenging, since they can differ from the real data in so many ways. There exist numerous metrics of synthetic data quality, ranging from statistical distances to predictive performance, often providing conflicting results. Moreover, they fail to explain or pinpoint the specific weaknesses in the synthetic data. To address this, we apply explainable AI (XAI) techniques to a binary detection classifier trained to distinguish real from synthetic data. While the classifier identifies distributional differences, XAI concepts such as feature importance and feature effects, analyzed through methods like permutation feature importance, partial dependence plots, Shapley values and counterfactual explanations, reveal why synthetic data are distinguishable, highlighting inconsistencies, unrealistic dependencies, or missing patterns. This interpretability increases transparency in synthetic data evaluation and provides deeper insights beyond conventional metrics, helping diagnose and improve synthetic data quality. We apply our approach to two tabular datasets and generative models, showing that it uncovers issues overlooked by standard evaluation techniques.
[LG-24] Explanations Go Linear: Interpretable and Individual Latent Encoding for Post-hoc Explainability
链接: https://arxiv.org/abs/2504.20667
作者: Simone Piaggesi,Riccardo Guidotti,Fosca Giannotti,Dino Pedreschi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Post-hoc explainability is essential for understanding black-box machine learning models. Surrogate-based techniques are widely used for local and global model-agnostic explanations but have significant limitations. Local surrogates capture non-linearities but are computationally expensive and sensitive to parameters, while global surrogates are more efficient but struggle with complex local behaviors. In this paper, we present ILLUME, a flexible and interpretable framework grounded in representation learning, that can be integrated with various surrogate models to provide explanations for any black-box classifier. Specifically, our approach combines a globally trained surrogate with instance-specific linear transformations learned with a meta-encoder to generate both local and global explanations. Through extensive empirical evaluations, we demonstrate the effectiveness of ILLUME in producing feature attributions and decision rules that are not only accurate but also robust and faithful to the black-box, thus providing a unified explanation framework that effectively addresses the limitations of traditional surrogate methods.
[LG-25] SFi-Former: Sparse Flow Induced Attention for Graph Transformer ICMR2025
链接: https://arxiv.org/abs/2504.20666
作者: Zhonghao Li,Ji Shi,Xinming Zhang,Miao Zhang,Bo Li
类目: Machine Learning (cs.LG)
*备注: ICMR 2025
Abstract:Graph Transformers (GTs) have demonstrated superior performance compared to traditional message-passing graph neural networks in many studies, especially in processing graph data with long-range dependencies. However, GTs tend to suffer from weak inductive bias, overfitting and over-globalizing problems due to the dense attention. In this paper, we introduce SFi-attention, a novel attention mechanism designed to learn sparse pattern by minimizing an energy function based on network flows with l1-norm regularization, to relieve those issues caused by dense attention. Furthermore, SFi-Former is accordingly devised which can leverage the sparse attention pattern of SFi-attention to generate sparse network flows beyond adjacency matrix of graph data. Specifically, SFi-Former aggregates features selectively from other nodes through flexible adaptation of the sparse attention, leading to a more robust model. We validate our SFi-Former on various graph datasets, especially those graph data exhibiting long-range dependencies. Experimental results show that our SFi-Former obtains competitive performance on GNN Benchmark datasets and SOTA performance on LongRange Graph Benchmark (LRGB) datasets. Additionally, our model gives rise to smaller generalization gaps, which indicates that it is less prone to over-fitting. Click here for codes.
[LG-26] Quantum-Enhanced Hybrid Reinforcement Learning Framework for Dynamic Path Planning in Autonomous Systems
链接: https://arxiv.org/abs/2504.20660
作者: Sahil Tomar,Shamshe Alam,Sandeep Kumar,Amit Mathur
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Information Theory (cs.IT)
*备注: 16 pages
Abstract:In this paper, a novel quantum classical hybrid framework is proposed that synergizes quantum with Classical Reinforcement Learning. By leveraging the inherent parallelism of quantum computing, the proposed approach generates robust Q tables and specialized turn cost estimations, which are then integrated with a classical Reinforcement Learning pipeline. The Classical Quantum fusion results in rapid convergence of training, reducing the training time significantly and improved adaptability in scenarios featuring static, dynamic, and moving obstacles. Simulator based evaluations demonstrate significant enhancements in path efficiency, trajectory smoothness, and mission success rates, underscoring the potential of framework for real time, autonomous navigation in complex and unpredictable environments. Furthermore, the proposed framework was tested beyond simulations on practical scenarios, including real world map data such as the IIT Delhi campus, reinforcing its potential for real time, autonomous navigation in complex and unpredictable environments.
[LG-27] RuleKit 2: Faster and simpler rule learning
链接: https://arxiv.org/abs/2504.20650
作者: Adam Gudyś,Cezary Maszczyk,Joanna Badura,Adam Grzelak,Marek Sikora,Łukasz Wróbel
类目: Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, 2 tables
Abstract:Rules offer an invaluable combination of predictive and descriptive capabilities. Our package for rule-based data analysis, RuleKit, has proven its effectiveness in classification, regression, and survival problems. Here we present its second version. New algorithms and optimized implementations of those previously included, significantly improved the computational performance of our suite, reducing the analysis time of some data sets by two orders of magnitude. The usability of RuleKit 2 is provided by two new components: Python package and browser application with a graphical user interface. The former complies with scikit-learn, the most popular data mining library for Python, allowing RuleKit 2 to be straightforwardly integrated into existing data analysis pipelines. RuleKit 2 is available at GitHub under GNU AGPL 3 license (this https URL)
[LG-28] Combatting Dimensional Collapse in LLM Pre-Training Data via Diversified File Selection
链接: https://arxiv.org/abs/2504.20644
作者: Ziqing Fan,Siyuan Du,Shengchao Hu,Pingjie Wang,Li Shen,Ya Zhang,Dacheng Tao,Yanfeng Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Selecting high-quality pre-training data for large language models (LLMs) is crucial for enhancing their overall performance under limited computation budget, improving both training and sample efficiency. Recent advancements in file selection primarily rely on using an existing or trained proxy model to assess the similarity of samples to a target domain, such as high quality sources BookCorpus and Wikipedia. However, upon revisiting these methods, the domain-similarity selection criteria demonstrates a diversity dilemma, this http URL collapse in the feature space, improving performance on the domain-related tasks but causing severe degradation on generic performance. To prevent collapse and enhance diversity, we propose a DiverSified File selection algorithm (DiSF), which selects the most decorrelated text files in the feature space. We approach this with a classical greedy algorithm to achieve more uniform eigenvalues in the feature covariance matrix of the selected texts, analyzing its approximation to the optimal solution under a formulation of \gamma -weakly submodular optimization problem. Empirically, we establish a benchmark and conduct extensive experiments on the TinyLlama architecture with models from 120M to 1.1B parameters. Evaluating across nine tasks from the Harness framework, DiSF demonstrates a significant improvement on overall performance. Specifically, DiSF saves 98.5% of 590M training files in SlimPajama, outperforming the full-data pre-training within a 50B training budget, and achieving about 1.5x training efficiency and 5x data efficiency.
[LG-29] Decision-centric fairness: Evaluation and optimization for resource allocation problems
链接: https://arxiv.org/abs/2504.20642
作者: Simon De Vos,Jente Van Belle,Andres Algaba,Wouter Verbeke,Sam Verboven
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:Data-driven decision support tools play an increasingly central role in decision-making across various domains. In this work, we focus on binary classification models for predicting positive-outcome scores and deciding on resource allocation, e.g., credit scores for granting loans or churn propensity scores for targeting customers with a retention campaign. Such models may exhibit discriminatory behavior toward specific demographic groups through their predicted scores, potentially leading to unfair resource allocation. We focus on demographic parity as a fairness metric to compare the proportions of instances that are selected based on their positive outcome scores across groups. In this work, we propose a decision-centric fairness methodology that induces fairness only within the decision-making region – the range of relevant decision thresholds on the score that may be used to decide on resource allocation – as an alternative to a global fairness approach that seeks to enforce parity across the entire score distribution. By restricting the induction of fairness to the decision-making region, the proposed decision-centric approach avoids imposing overly restrictive constraints on the model, which may unnecessarily degrade the quality of the predicted scores. We empirically compare our approach to a global fairness approach on multiple (semi-synthetic) datasets to identify scenarios in which focusing on fairness where it truly matters, i.e., decision-centric fairness, proves beneficial.
[LG-30] Bridging the Generalisation Gap: Synthetic Data Generation for Multi-Site Clinical Model Validation
链接: https://arxiv.org/abs/2504.20635
作者: Bradley Segal,Joshua Fieggen,David Clifton,Lei Clifton
类目: Machine Learning (cs.LG)
*备注: 7 pages, 4 figures
Abstract:Ensuring the generalisability of clinical machine learning (ML) models across diverse healthcare settings remains a significant challenge due to variability in patient demographics, disease prevalence, and institutional practices. Existing model evaluation approaches often rely on real-world datasets, which are limited in availability, embed confounding biases, and lack the flexibility needed for systematic experimentation. Furthermore, while generative models aim for statistical realism, they often lack transparency and explicit control over factors driving distributional shifts. In this work, we propose a novel structured synthetic data framework designed for the controlled benchmarking of model robustness, fairness, and generalisability. Unlike approaches focused solely on mimicking observed data, our framework provides explicit control over the data generating process, including site-specific prevalence variations, hierarchical subgroup effects, and structured feature interactions. This enables targeted investigation into how models respond to specific distributional shifts and potential biases. Through controlled experiments, we demonstrate the framework’s ability to isolate the impact of site variations, support fairness-aware audits, and reveal generalisation failures, particularly highlighting how model complexity interacts with site-specific effects. This work contributes a reproducible, interpretable, and configurable tool designed to advance the reliable deployment of ML in clinical settings.
[LG-31] Independent Learning in Performative Markov Potential Games AISTATS2025
链接: https://arxiv.org/abs/2504.20593
作者: Rilind Sahitaj,Paulius Sasnauskas,Yiğit Yalın,Debmalya Mandal,Goran Radanović
类目: Machine Learning (cs.LG)
*备注: AISTATS 2025, code available at this https URL
Abstract:Performative Reinforcement Learning (PRL) refers to a scenario in which the deployed policy changes the reward and transition dynamics of the underlying environment. In this work, we study multi-agent PRL by incorporating performative effects into Markov Potential Games (MPGs). We introduce the notion of a performatively stable equilibrium (PSE) and show that it always exists under a reasonable sensitivity assumption. We then provide convergence results for state-of-the-art algorithms used to solve MPGs. Specifically, we show that independent policy gradient ascent (IPGA) and independent natural policy gradient (INPG) converge to an approximate PSE in the best-iterate sense, with an additional term that accounts for the performative effects. Furthermore, we show that INPG asymptotically converges to a PSE in the last-iterate sense. As the performative effects vanish, we recover the convergence rates from prior work. For a special case of our game, we provide finite-time last-iterate convergence results for a repeated retraining approach, in which agents independently optimize a surrogate objective. We conduct extensive experiments to validate our theoretical findings.
[LG-32] Representation Learning Preserving Ignorability and Covariate Matching for Treatment Effects
链接: https://arxiv.org/abs/2504.20579
作者: Praharsh Nanavati,Ranjitha Prasad,Karthikeyan Shanmugam
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Estimating treatment effects from observational data is challenging due to two main reasons: (a) hidden confounding, and (b) covariate mismatch (control and treatment groups not having identical distributions). Long lines of works exist that address only either of these issues. To address the former, conventional techniques that require detailed knowledge in the form of causal graphs have been proposed. For the latter, covariate matching and importance weighting methods have been used. Recently, there has been progress in combining testable independencies with partial side information for tackling hidden confounding. A common framework to address both hidden confounding and selection bias is missing. We propose neural architectures that aim to learn a representation of pre-treatment covariates that is a valid adjustment and also satisfies covariate matching constraints. We combine two different neural architectures: one based on gradient matching across domains created by subsampling a suitable anchor variable that assumes causal side information, followed by the other, a covariate matching transformation. We prove that approximately invariant representations yield approximate valid adjustment sets which would enable an interval around the true causal effect. In contrast to usual sensitivity analysis, where an unknown nuisance parameter is varied, we have a testable approximation yielding a bound on the effect estimate. We also outperform various baselines with respect to ATE and PEHE errors on causal benchmarks that include IHDP, Jobs, Cattaneo, and an image-based Crowd Management dataset.
[LG-33] Digital Shielding for Cross-Domain Wi-Fi Signal Adaptation using Relativistic Averag e Generative Adversarial Network
链接: https://arxiv.org/abs/2504.20568
作者: Danilo Avola,Federica Bruni,Gian Luca Foresti,Daniele Pannone,Amedeo Ranaldi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Wi-Fi sensing uses radio-frequency signals from Wi-Fi devices to analyze environments, enabling tasks such as tracking people, detecting intrusions, and recognizing gestures. The rise of this technology is driven by the IEEE 802.11bf standard and growing demand for tools that can ensure privacy and operate through obstacles. However, the performance of Wi-Fi sensing is heavily influenced by environmental conditions, especially when extracting spatial and temporal features from the surrounding scene. A key challenge is achieving robust generalization across domains, ensuring stable performance even when the sensing environment changes significantly. This paper introduces a novel deep learning model for cross-domain adaptation of Wi-Fi signals, inspired by physical signal shielding. The model uses a Relativistic average Generative Adversarial Network (RaGAN) with Bidirectional Long Short-Term Memory (Bi-LSTM) architectures for both the generator and discriminator. To simulate physical shielding, an acrylic box lined with electromagnetic shielding fabric was constructed, mimicking a Faraday cage. Wi-Fi signal spectra were collected from various materials both inside (domain-free) and outside (domain-dependent) the box to train the model. A multi-class Support Vector Machine (SVM) was trained on domain-free spectra and tested on signals denoised by the RaGAN. The system achieved 96% accuracy and demonstrated strong material discrimination capabilities, offering potential for use in security applications to identify concealed objects based on their composition.
[LG-34] DeeP-Mod: Deep Dynamic Programming based Environment Modelling using Feature Extraction
链接: https://arxiv.org/abs/2504.20535
作者: Chris Child,Lam Ngo
类目: Machine Learning (cs.LG)
*备注:
Abstract:The DeeP-Mod framework builds an environment model using features from a Deep Dynamic Programming Network (DDPN), trained via a Deep Q-Network (DQN). While Deep Q-Learning is effective in decision-making, state information is lost in deeper DQN layers due to mixed state-action representations. We address this by using Dynamic Programming (DP) to train a DDPN, where Value Iteration ensures the output represents state values, not state-action pairs. Extracting features from the DDPN preserves state information, enabling task and action set independence. We show that a reduced DDPN can be trained using features extracted from the original DDPN trained on an identical problem. This reduced DDPN achieves faster convergence under noise and outperforms the original DDPN. Finally, we introduce the DeeP-Mod framework, which creates an environment model using the evolution of features extracted from a DDPN in response to actions. A second DDPN, which learns directly from this feature model rather than raw states, can learn an effective feature-value representation and thus optimal policy. A key advantage of DeeP-Mod is that an externally defined environment model is not needed at any stage, making DDPN applicable to a wide range of environments.
[LG-35] Wavelet-Filtering of Symbolic Music Representations for Folk Tune Segmentation and Classification
链接: https://arxiv.org/abs/2504.20522
作者: Gissel Velarde,Tillman Weyde,David Meredith
类目: Machine Learning (cs.LG)
*备注: 7 pages, 4 figures, 2 tables, Proceedings of the Third International Workshop on Folk Music Analysis (FMA2013)
Abstract:The aim of this study is to evaluate a machine-learning method in which symbolic representations of folk songs are segmented and classified into tune families with Haar-wavelet filtering. The method is compared with previously proposed Gestalt-based method. Melodies are represented as discrete symbolic pitch-time signals. We apply the continuous wavelet transform (CWT) with the Haar wavelet at specific scales, obtaining filtered versions of melodies emphasizing their information at particular time-scales. We use the filtered signal for representation and segmentation, using the wavelet coefficients’ local maxima to indicate local boundaries and classify segments by means of k-nearest neighbours based on standard vector-metrics (Euclidean, cityblock), and compare the results to a Gestalt-based segmentation method and metrics applied directly to the pitch signal. We found that the wavelet based segmentation and wavelet-filtering of the pitch signal lead to better classification accuracy in cross-validated evaluation when the time-scale and other parameters are optimized.
[LG-36] FT-MoE: Sustainable-learning Mixture of Experts Model for Fault-Tolerant Computing with Multiple Tasks
链接: https://arxiv.org/abs/2504.20446
作者: Wenjing Xiao,Wenhao Song,Miaojiang Chen,Ruikun Luo,Min Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Intelligent fault-tolerant (FT) computing has recently demonstrated significant advantages of predicting and diagnosing faults in advance, enabling reliable service delivery. However, due to heterogeneity of fault knowledge and complex dependence relationships of time series log data, existing deep learning-based FT algorithms further improve detection performance relying on single neural network model with difficulty. To this end, we propose FT-MoE, a sustainable-learning mixture-of-experts model for fault-tolerant computing with multiple tasks, which enables different parameters learning distinct fault knowledge to achieve high-reliability for service system. Firstly, we use decoder-based transformer models to obtain fault prototype vectors of decoupling long-distance dependencies. Followed by, we present a dual mixture of experts networks for high-accurate prediction for both fault detection and classification tasks. Then, we design a two-stage optimization scheme of offline training and online tuning, which allows that in operation FT-MoE can also keep learning to adapt to dynamic service environments. Finally, to verify the effectiveness of FT-MoE, we conduct extensive experiments on the FT benchmark. Experimental results show that FT-MoE achieves superior performance compared to the state-of-the-art methods. Code will be available upon publication.
[LG-37] Multidimensional precipitation index prediction based on CNN-LSTM hybrid framework
链接: https://arxiv.org/abs/2504.20442
作者: Yuchen Wang,Pengfei Jia,Zhitao Shu,Keyan Liu,Abdul Rashid Mohamed Shariff
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:
Abstract:With the intensification of global climate change, accurate prediction of weather indicators is of great significance in disaster prevention and mitigation, agricultural production, and transportation. Precipitation, as one of the key meteorological indicators, plays a crucial role in water resource management, agricultural production, and urban flood control. This study proposes a multidimensional precipitation index prediction model based on a CNN- LSTM hybrid framework, aiming to improve the accuracy of precipitation forecasts. The dataset is sourced from Pune, Maharashtra, India, covering monthly mean precipitation data from 1972 to 2002. This dataset includes nearly 31 years (1972-2002) of monthly average precipitation, reflecting the long-term fluctuations and seasonal variations of precipitation in the region. By analyzing these time series data, the CNN-LSTM model effectively captures local features and long-term dependencies. Experimental results show that the model achieves a root mean square error (RMSE) of 6.752, which demonstrates a significant advantage over traditional time series prediction methods in terms of prediction accuracy and generalization ability. Furthermore, this study provides new research ideas for precipitation prediction. However, the model requires high computational resources when dealing with large-scale datasets, and its predictive ability for multidimensional precipitation data still needs improvement. Future research could extend the model to support and predict multidimensional precipitation data, thereby promoting the development of more accurate and efficient meteorological prediction technologies.
[LG-38] Learning Laplacian Positional Encodings for Heterophilous Graphs AISTATS2025
链接: https://arxiv.org/abs/2504.20430
作者: Michael Ito,Jiong Zhu,Dexiong Chen,Danai Koutra,Jenna Wiens
类目: Machine Learning (cs.LG)
*备注: AISTATS 2025; version with full appendix
Abstract:In this work, we theoretically demonstrate that current graph positional encodings (PEs) are not beneficial and could potentially hurt performance in tasks involving heterophilous graphs, where nodes that are close tend to have different labels. This limitation is critical as many real-world networks exhibit heterophily, and even highly homophilous graphs can contain local regions of strong heterophily. To address this limitation, we propose Learnable Laplacian Positional Encodings (LLPE), a new PE that leverages the full spectrum of the graph Laplacian, enabling them to capture graph structure on both homophilous and heterophilous graphs. Theoretically, we prove LLPE’s ability to approximate a general class of graph distances and demonstrate its generalization properties. Empirically, our evaluation on 12 benchmarks demonstrates that LLPE improves accuracy across a variety of GNNs, including graph transformers, by up to 35% and 14% on synthetic and real-world graphs, respectively. Going forward, our work represents a significant step towards developing PEs that effectively capture complex structures in heterophilous graphs.
[LG-39] Understanding GNNs and Homophily in Dynamic Node Classification AISTATS2025
链接: https://arxiv.org/abs/2504.20421
作者: Michael Ito,Danai Koutra,Jenna Wiens
类目: Machine Learning (cs.LG)
*备注: AISTATS 2025; version with full appendix
Abstract:Homophily, as a measure, has been critical to increasing our understanding of graph neural networks (GNNs). However, to date this measure has only been analyzed in the context of static graphs. In our work, we explore homophily in dynamic settings. Focusing on graph convolutional networks (GCNs), we demonstrate theoretically that in dynamic settings, current GCN discriminative performance is characterized by the probability that a node’s future label is the same as its neighbors’ current labels. Based on this insight, we propose dynamic homophily, a new measure of homophily that applies in the dynamic setting. This new measure correlates with GNN discriminative performance and sheds light on how to potentially design more powerful GNNs for dynamic graphs. Leveraging a variety of dynamic node classification datasets, we demonstrate that popular GNNs are not robust to low dynamic homophily. Going forward, our work represents an important step towards understanding homophily and GNN performance in dynamic node classification.
[LG-40] ADiff4TPP: Asynchronous Diffusion Models for Temporal Point Processes
链接: https://arxiv.org/abs/2504.20411
作者: Amartya Mukherjee,Ruizhi Deng,He Zhao,Yuzhen Mao,Leonid Sigal,Frederick Tung
类目: Machine Learning (cs.LG)
*备注:
Abstract:This work introduces a novel approach to modeling temporal point processes using diffusion models with an asynchronous noise schedule. At each step of the diffusion process, the noise schedule injects noise of varying scales into different parts of the data. With a careful design of the noise schedules, earlier events are generated faster than later ones, thus providing stronger conditioning for forecasting the more distant future. We derive an objective to effectively train these models for a general family of noise schedules based on conditional flow matching. Our method models the joint distribution of the latent representations of events in a sequence and achieves state-of-the-art results in predicting both the next inter-event time and event type on benchmark datasets. Additionally, it flexibly accommodates varying lengths of observation and prediction windows in different forecasting settings by adjusting the starting and ending points of the generation process. Finally, our method shows superior performance in long-horizon prediction tasks, outperforming existing baseline methods.
[LG-41] Partial Answer of How Transformers Learn Automata
链接: https://arxiv.org/abs/2504.20395
作者: Tiantian(Crystal)Zhang
类目: Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
*备注:
Abstract:We introduce a novel framework for simulating finite automata using representation-theoretic semidirect products and Fourier modules, achieving more efficient Transformer-based implementations.
[LG-42] Manifold Clustering with Schatten p-norm Maximization
链接: https://arxiv.org/abs/2504.20390
作者: Fangfang Li,Quanxue Gao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Manifold clustering, with its exceptional ability to capture complex data structures, holds a pivotal position in cluster analysis. However, existing methods often focus only on finding the optimal combination between K-means and manifold learning, and overlooking the consistency between the data structure and labels. To address this issue, we deeply explore the relationship between K-means and manifold learning, and on this basis, fuse them to develop a new clustering framework. Specifically, the algorithm uses labels to guide the manifold structure and perform clustering on it, which ensures the consistency between the data structure and labels. Furthermore, in order to naturally maintain the class balance in the clustering process, we maximize the Schatten p-norm of labels, and provide a theoretical proof to support this. Additionally, our clustering framework is designed to be flexible and compatible with many types of distance functions, which facilitates efficient processing of nonlinear separable data. The experimental results of several databases confirm the superiority of our proposed model.
[LG-43] Generative Learning for Slow Manifolds and Bifurcation Diagrams
链接: https://arxiv.org/abs/2504.20375
作者: Ellis R. Crabtree,Dimitris G. Giovanis,Nikolaos Evangelou,Juan M. Bello-Rivas,Ioannis G. Kevrekidis
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 17 pages, 13 figures, 1 table
Abstract:In dynamical systems characterized by separation of time scales, the approximation of so called slow manifolds'', on which the long term dynamics lie, is a useful step for model reduction. Initializing on such slow manifolds is a useful step in modeling, since it circumvents fast transients, and is crucial in multiscale algorithms alternating between fine scale (fast) and coarser scale (slow) simulations. In a similar spirit, when one studies the infinite time dynamics of systems depending on parameters, the system attractors (e.g., its steady states) lie on bifurcation diagrams. Sampling these manifolds gives us representative attractors (here, steady states of ODEs or PDEs) at different parameter values. Algorithms for the systematic construction of these manifolds are required parts of the
traditional’’ numerical nonlinear dynamics toolkit. In more recent years, as the field of Machine Learning develops, conditional score-based generative models (cSGMs) have demonstrated capabilities in generating plausible data from target distributions that are conditioned on some given label. It is tempting to exploit such generative models to produce samples of data distributions conditioned on some quantity of interest (QoI). In this work, we present a framework for using cSGMs to quickly (a) initialize on a low-dimensional (reduced-order) slow manifold of a multi-time-scale system consistent with desired value(s) of a QoI (a label'') on the manifold, and (b) approximate steady states in a bifurcation diagram consistent with a (new, out-of-sample) parameter value. This conditional sampling can help uncover the geometry of the reduced slow-manifold and/or approximately
fill in’’ missing segments of steady states in a bifurcation diagram. Comments: 17 pages, 13 figures, 1 table Subjects: Machine Learning (cs.LG); Dynamical Systems (math.DS) MSC classes: 37M20, 37M21, 68T07, 35B32 Cite as: arXiv:2504.20375 [cs.LG] (or arXiv:2504.20375v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.20375 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-44] Bayesian Experimental Design for Model Discrepancy Calibration: An Auto-Differentiable Ensemble Kalman Inversion Approach
链接: https://arxiv.org/abs/2504.20319
作者: Huchen Yang,Xinghao Dong,Jin-Long Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Bayesian experimental design (BED) offers a principled framework for optimizing data acquisition by leveraging probabilistic inference. However, practical implementations of BED are often compromised by model discrepancy, i.e., the mismatch between predictive models and true physical systems, which can potentially lead to biased parameter estimates. While data-driven approaches have been recently explored to characterize the model discrepancy, the resulting high-dimensional parameter space poses severe challenges for both Bayesian updating and design optimization. In this work, we propose a hybrid BED framework enabled by auto-differentiable ensemble Kalman inversion (AD-EKI) that addresses these challenges by providing a computationally efficient, gradient-free alternative to estimate the information gain for high-dimensional network parameters. The AD-EKI allows a differentiable evaluation of the utility function in BED and thus facilitates the use of standard gradient-based methods for design optimization. In the proposed hybrid framework, we iteratively optimize experimental designs, decoupling the inference of low-dimensional physical parameters handled by standard BED methods, from the high-dimensional model discrepancy handled by AD-EKI. The identified optimal designs for the model discrepancy enable us to systematically collect informative data for its calibration. The performance of the proposed method is studied by a classical convection-diffusion BED example, and the hybrid framework enabled by AD-EKI efficiently identifies informative data to calibrate the model discrepancy and robustly infers the unknown physical parameters in the modeled system. Besides addressing the challenges of BED with model discrepancy, AD-EKI also potentially fosters efficient and scalable frameworks in many other areas with bilevel optimization, such as meta-learning and structure optimization.
[LG-45] FigBO: A Generalized Acquisition Function Framework with Look-Ahead Capability for Bayesian Optimization
链接: https://arxiv.org/abs/2504.20307
作者: Hui Chen,Xuhui Fan,Zhangkai Wu,Longbing Cao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Bayesian optimization is a powerful technique for optimizing expensive-to-evaluate black-box functions, consisting of two main components: a surrogate model and an acquisition function. In recent years, myopic acquisition functions have been widely adopted for their simplicity and effectiveness. However, their lack of look-ahead capability limits their performance. To address this limitation, we propose FigBO, a generalized acquisition function that incorporates the future impact of candidate points on global information gain. FigBO is a plug-and-play method that can integrate seamlessly with most existing myopic acquisition functions. Theoretically, we analyze the regret bound and convergence rate of FigBO when combined with the myopic base acquisition function expected improvement (EI), comparing them to those of standard EI. Empirically, extensive experimental results across diverse tasks demonstrate that FigBO achieves state-of-the-art performance and significantly faster convergence compared to existing methods.
[LG-46] Radius-Guided Post-Clustering for Shape-Aware Scalable Refinement of k-Means Results
链接: https://arxiv.org/abs/2504.20293
作者: Stefan Kober
类目: Machine Learning (cs.LG)
*备注: 9 pages, 2 figures, 4 tables. Open-source code available at: this https URL
Abstract:Traditional k-means clustering underperforms on non-convex shapes and requires the number of clusters k to be specified in advance. We propose a simple geometric enhancement: after standard k-means, each cluster center is assigned a radius (the distance to its farthest assigned point), and clusters whose radii overlap are merged. This post-processing step loosens the requirement for exact k: as long as k is overestimated (but not excessively), the method can often reconstruct non-convex shapes through meaningful merges. We also show that this approach supports recursive partitioning: clustering can be performed independently on tiled regions of the feature space, then globally merged, making the method scalable and suitable for distributed systems. Implemented as a lightweight post-processing step atop scikit-learn’s k-means, the algorithm performs well on benchmark datasets, achieving high accuracy with minimal additional computation.
[LG-47] FedCCL: Federated Clustered Continual Learning Framework for Privacy-focused Energy Forecasting
链接: https://arxiv.org/abs/2504.20282
作者: Michael A. Helcig,Stefan Nastic
类目: Machine Learning (cs.LG)
*备注:
Abstract:Privacy-preserving distributed model training is crucial for modern machine learning applications, yet existing Federated Learning approaches struggle with heterogeneous data distributions and varying computational capabilities. Traditional solutions either treat all participants uniformly or require costly dynamic clustering during training, leading to reduced efficiency and delayed model specialization. We present FedCCL (Federated Clustered Continual Learning), a framework specifically designed for environments with static organizational characteristics but dynamic client availability. By combining static pre-training clustering with an adapted asynchronous FedAvg algorithm, FedCCL enables new clients to immediately profit from specialized models without prior exposure to their data distribution, while maintaining reduced coordination overhead and resilience to client disconnections. Our approach implements an asynchronous Federated Learning protocol with a three-tier model topology - global, cluster-specific, and local models - that efficiently manages knowledge sharing across heterogeneous participants. Evaluation using photovoltaic installations across central Europe demonstrates that FedCCL’s location-based clustering achieves an energy prediction error of 3.93% (±0.21%), while maintaining data privacy and showing that the framework maintains stability for population-independent deployments, with 0.14 percentage point degradation in performance for new installations. The results demonstrate that FedCCL offers an effective framework for privacy-preserving distributed learning, maintaining high accuracy and adaptability even with dynamic participant populations.
[LG-48] Generative Diffusion Models for Resource Allocation in Wireless Networks
链接: https://arxiv.org/abs/2504.20277
作者: Yigit Berkay Uslu,Samar Hadou,Shirin Saeedi Bidokhti,Alejandro Ribeiro
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:This paper proposes a supervised training algorithm for learning stochastic resource allocation policies with generative diffusion models (GDMs). We formulate the allocation problem as the maximization of an ergodic utility function subject to ergodic Quality of Service (QoS) constraints. Given samples from a stochastic expert policy that yields a near-optimal solution to the problem, we train a GDM policy to imitate the expert and generate new samples from the optimal distribution. We achieve near-optimal performance through sequential execution of the generated samples. To enable generalization to a family of network configurations, we parameterize the backward diffusion process with a graph neural network (GNN) architecture. We present numerical results in a case study of power control in multi-user interference networks.
[LG-49] Investigating task-specific prompts and sparse autoencoders for activation monitoring
链接: https://arxiv.org/abs/2504.20271
作者: Henk Tillman,Dan Mossing
类目: Machine Learning (cs.LG)
*备注: 18 pages, 13 figures
Abstract:Language models can behave in unexpected and unsafe ways, and so it is valuable to monitor their outputs. Internal activations of language models encode additional information that could be useful for this. The baseline approach for activation monitoring is some variation of linear probing on a particular layer: starting from a labeled dataset, train a logistic regression classifier on that layer’s activations. Recent work has proposed several approaches which may improve on naive linear probing, by leveraging additional computation. One class of techniques, which we call “prompted probing,” leverages test time computation to improve monitoring by (1) prompting the model with a description of the monitoring task, and (2) applying a learned linear probe to resulting activations. Another class of techniques uses computation at train time: training sparse autoencoders offline to identify an interpretable basis for the activations, and e.g. max-pooling activations across tokens using that basis before applying a linear probe. However, one can also prompt the model with a description of the monitoring task and use its output directly. We develop and test novel refinements of these methods and compare them against each other. We find asking the model zero-shot is a reasonable baseline when inference-time compute is not limited; however, activation probing methods can substantially outperform this baseline given sufficient training data. Specifically, we recommend prompted probing when inference-time compute is available, due to its superior data efficiency and good generalization performance. Alternatively, if inference-time compute is limited, we find SAE-based probing methods outperform raw activation probing.
[LG-50] A Virtual Cybersecurity Department for Securing Digital Twins in Water Distribution Systems
链接: https://arxiv.org/abs/2504.20266
作者: Mohammadhossein Homaei,Agustin Di Bartolo,Oscar Mogollon-Gutierrez,Fernando Broncano Morgado,Pablo Garcia Rodriguez
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 7 pages, 9 Figures
Abstract:Digital twins (DTs) help improve real-time monitoring and decision-making in water distribution systems. However, their connectivity makes them easy targets for cyberattacks such as scanning, denial-of-service (DoS), and unauthorized access. Small and medium-sized enterprises (SMEs) that manage these systems often do not have enough budget or staff to build strong cybersecurity teams. To solve this problem, we present a Virtual Cybersecurity Department (VCD), an affordable and automated framework designed for SMEs. The VCD uses open-source tools like Zabbix for real-time monitoring, Suricata for network intrusion detection, Fail2Ban to block repeated login attempts, and simple firewall settings. To improve threat detection, we also add a machine-learning-based IDS trained on the OD-IDS2022 dataset using an improved ensemble model. This model detects cyber threats such as brute-force attacks, remote code execution (RCE), and network flooding, with 92% accuracy and fewer false alarms. Our solution gives SMEs a practical and efficient way to secure water systems using low-cost and easy-to-manage tools.
[LG-51] Financial Data Analysis with Robust Federated Logistic Regression
链接: https://arxiv.org/abs/2504.20250
作者: Kun Yang,Nikhil Krishnan,Sanjeev R. Kulkarni
类目: Machine Learning (cs.LG); General Finance (q-fin.GN); Statistical Finance (q-fin.ST); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
Abstract:In this study, we focus on the analysis of financial data in a federated setting, wherein data is distributed across multiple clients or locations, and the raw data never leaves the local devices. Our primary focus is not only on the development of efficient learning frameworks (for protecting user data privacy) in the field of federated learning but also on the importance of designing models that are easier to interpret. In addition, we care about the robustness of the framework to outliers. To achieve these goals, we propose a robust federated logistic regression-based framework that strives to strike a balance between these goals. To verify the feasibility of our proposed framework, we carefully evaluate its performance not only on independently identically distributed (IID) data but also on non-IID data, especially in scenarios involving outliers. Extensive numerical results collected from multiple public datasets demonstrate that our proposed method can achieve comparable performance to those of classical centralized algorithms, such as Logistical Regression, Decision Tree, and K-Nearest Neighbors, in both binary and multi-class classification tasks.
[LG-52] mporal Neural Operator for Modeling Time-Dependent Physical Phenomena
链接: https://arxiv.org/abs/2504.20249
作者: W. Diab,M. Al-Kobaisi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural Operators (NOs) are machine learning models designed to solve partial differential equations (PDEs) by learning to map between function spaces. Neural Operators such as the Deep Operator Network (DeepONet) and the Fourier Neural Operator (FNO) have demonstrated excellent generalization properties when mapping between spatial function spaces. However, they struggle in mapping the temporal dynamics of time-dependent PDEs, especially for time steps not explicitly seen during training. This limits their temporal accuracy as they do not leverage these dynamics in the training process. In addition, most NOs tend to be prohibitively costly to train, especially for higher-dimensional PDEs. In this paper, we propose the Temporal Neural Operator (TNO), an efficient neural operator specifically designed for spatio-temporal operator learning for time-dependent PDEs. TNO achieves this by introducing a temporal-branch to the DeepONet framework, leveraging the best architectural design choices from several other NOs, and a combination of training strategies including Markov assumption, teacher forcing, temporal bundling, and the flexibility to condition the output on the current state or past states. Through extensive benchmarking and an ablation study on a diverse set of example problems we demonstrate the TNO long range temporal extrapolation capabilities, robustness to error accumulation, resolution invariance, and flexibility to handle multiple input functions.
[LG-53] Leverag ing Neural Graph Compilers in Machine Learning Research for Edge-Cloud Systems
链接: https://arxiv.org/abs/2504.20198
作者: Alireza Furutanpey,Carmen Walser,Philipp Raith,Pantelis A. Frangoudis,Schahram Dustdar
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 10 pages, 12 figures, 8 tables
Abstract:This work presents a comprehensive evaluation of neural network graph compilers across heterogeneous hardware platforms, addressing the critical gap between theoretical optimization techniques and practical deployment scenarios. We demonstrate how vendor-specific optimizations can invalidate relative performance comparisons between architectural archetypes, with performance advantages sometimes completely reversing after compilation. Our systematic analysis reveals that graph compilers exhibit performance patterns highly dependent on both neural architecture and batch sizes. Through fine-grained block-level experimentation, we establish that vendor-specific compilers can leverage repeated patterns in simple architectures, yielding disproportionate throughput gains as model depth increases. We introduce novel metrics to quantify a compiler’s ability to mitigate performance friction as batch size increases. Our methodology bridges the gap between academic research and practical deployment by incorporating compiler effects throughout the research process, providing actionable insights for practitioners navigating complex optimization landscapes across heterogeneous hardware environments.
[LG-54] ProFi-Net: Prototype-based Feature Attention with Curriculum Augmentation for WiFi-based Gesture Recognition APWEB
链接: https://arxiv.org/abs/2504.20193
作者: Zhe Cui,Shuxian Zhang,Kangzhi Lou,Le-Nam Tran
类目: Machine Learning (cs.LG)
*备注: This paper was accepted at The 9th APWeb-WAIM joint International Conference on Web and Big Data
Abstract:This paper presents ProFi-Net, a novel few-shot learning framework for WiFi-based gesture recognition that overcomes the chal- lenges of limited training data and sparse feature representations. ProFi- Net employs a prototype-based metric learning architecture enhanced with a feature-level attention mechanism, which dynamically refines the Euclidean distance by emphasizing the most discriminative feature di- mensions. Additionally, our approach introduces a curriculum-inspired data augmentation strategy exclusively on the query set. By progressively incorporating Gaussian noise of increasing magnitude, the model is ex- posed to a broader range of challenging variations, thereby improving its generalization and robustness to overfitting. Extensive experiments con- ducted across diverse real-world environments demonstrate that ProFi- Net significantly outperforms conventional prototype networks and other state-of-the-art few-shot learning methods in terms of classification ac- curacy and training efficiency.
[LG-55] Enhancing Cell Counting through MLOps: A Structured Approach for Automated Cell Analysis
链接: https://arxiv.org/abs/2504.20126
作者: Matteo Testi,Luca Clissa,Matteo Ballabio,Salvatore Ricciardi,Federico Baldo,Emanuele Frontoni,Sara Moccia,Gennario Vessio
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 21 pages, 4 figures, 1 table
Abstract:Machine Learning (ML) models offer significant potential for advancing cell counting applications in neuroscience, medical research, pharmaceutical development, and environmental monitoring. However, implementing these models effectively requires robust operational frameworks. This paper introduces Cell Counting Machine Learning Operations (CC-MLOps), a comprehensive framework that streamlines the integration of ML in cell counting workflows. CC-MLOps encompasses data access and preprocessing, model training, monitoring, explainability features, and sustainability considerations. Through a practical use case, we demonstrate how MLOps principles can enhance model reliability, reduce human error, and enable scalable Cell Counting solutions. This work provides actionable guidance for researchers and laboratory professionals seeking to implement machine learning (ML)- powered cell counting systems.
[LG-56] Benchmarking Transferability: A Framework for Fair and Robust Evaluation
链接: https://arxiv.org/abs/2504.20121
作者: Alireza Kazemi,Helia Rezvani,Mahsa Baktashmotlagh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transferability scores aim to quantify how well a model trained on one domain generalizes to a target domain. Despite numerous methods proposed for measuring transferability, their reliability and practical usefulness remain inconclusive, often due to differing experimental setups, datasets, and assumptions. In this paper, we introduce a comprehensive benchmarking framework designed to systematically evaluate transferability scores across diverse settings. Through extensive experiments, we observe variations in how different metrics perform under various scenarios, suggesting that current evaluation practices may not fully capture each method’s strengths and limitations. Our findings underscore the value of standardized assessment protocols, paving the way for more reliable transferability measures and better-informed model selection in cross-domain applications. Additionally, we achieved a 3.5% improvement using our proposed metric for the head-training fine-tuning experimental setup. Our code is available in this repository: this https URL.
[LG-57] Attention to Detail: Fine-Scale Feature Preservation-Oriented Geometric Pre-training for AI-Driven Surrogate Modeling
链接: https://arxiv.org/abs/2504.20110
作者: Yu-hsuan Chen,Jing Bi,Cyril Ngo Ngoc,Victor Oancea,Jonathan Cagan,Levent Burak Kara
类目: Machine Learning (cs.LG)
*备注:
Abstract:AI-driven surrogate modeling has become an increasingly effective alternative to physics-based simulations for 3D design, analysis, and manufacturing. These models leverage data-driven methods to predict physical quantities traditionally requiring computationally expensive simulations. However, the scarcity of labeled CAD-to-simulation datasets has driven recent advancements in self-supervised and foundation models, where geometric representation learning is performed offline and later fine-tuned for specific downstream tasks. While these approaches have shown promise, their effectiveness is limited in applications requiring fine-scale geometric detail preservation. This work introduces a self-supervised geometric representation learning method designed to capture fine-scale geometric features from non-parametric 3D models. Unlike traditional end-to-end surrogate models, this approach decouples geometric feature extraction from downstream physics tasks, learning a latent space embedding guided by geometric reconstruction losses. Key elements include the essential use of near-zero level sampling and the innovative batch-adaptive attention-weighted loss function, which enhance the encoding of intricate design features. The proposed method is validated through case studies in structural mechanics, demonstrating strong performance in capturing design features and enabling accurate few-shot physics predictions. Comparisons with traditional parametric surrogate modeling highlight its potential to bridge the gap between geometric and physics-based representations, providing an effective solution for surrogate modeling in data-scarce scenarios.
[LG-58] Swapped Logit Distillation via Bi-level Teacher Alignment
链接: https://arxiv.org/abs/2504.20108
作者: Stephen Ekaputra Limantoro,Jhe-Hao Lin,Chih-Yu Wang,Yi-Lung Tsai,Hong-Han Shuai,Ching-Chun Huang,Wen-Huang Cheng
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注: Accepted to Multimedia Systems 2025
Abstract:Knowledge distillation (KD) compresses the network capacity by transferring knowledge from a large (teacher) network to a smaller one (student). It has been mainstream that the teacher directly transfers knowledge to the student with its original distribution, which can possibly lead to incorrect predictions. In this article, we propose a logit-based distillation via swapped logit processing, namely Swapped Logit Distillation (SLD). SLD is proposed under two assumptions: (1) the wrong prediction occurs when the prediction label confidence is not the maximum; (2) the “natural” limit of probability remains uncertain as the best value addition to the target cannot be determined. To address these issues, we propose a swapped logit processing scheme. Through this approach, we find that the swap method can be effectively extended to teacher and student outputs, transforming into two teachers. We further introduce loss scheduling to boost the performance of two teachers’ alignment. Extensive experiments on image classification tasks demonstrate that SLD consistently performs best among previous state-of-the-art methods.
[LG-59] owards Practical Second-Order Optimizers in Deep Learning: Insights from Fisher Information Analysis
链接: https://arxiv.org/abs/2504.20096
作者: Damien Martins Gomes
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Master’s Thesis in Computer Science (Deep Learning Dynamics - Optimization)
Abstract:First-order optimization methods remain the standard for training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by preconditioning the stochastic gradient with a diagonal matrix. Despite the widespread adoption of first-order methods, second-order optimization algorithms often exhibit superior convergence compared to methods like Adam and SGD. However, their practicality in training DNNs is still limited by a significantly higher per-iteration computational cost compared to first-order methods. In this thesis, we present AdaFisher, a novel adaptive second-order optimizer that leverages a diagonal block-Kronecker approximation of the Fisher information matrix to adaptively precondition gradients. AdaFisher aims to bridge the gap between the improved convergence and generalization of second-order methods and the computational efficiency needed for training DNNs. Despite the traditionally slower speed of second-order optimizers, AdaFisher is effective for tasks such as image classification and language modeling, ex- hibiting remarkable stability and robustness during hyperparameter tuning. We demonstrate that AdaFisher outperforms state-of-the-art optimizers in both accuracy and convergence speed. The code is available from this https URL.
[LG-60] Low-Rank Matrix Approximation for Neural Network Compression
链接: https://arxiv.org/abs/2504.20078
作者: Kalyan Cherukuri,Aarav Lala
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC)
*备注:
Abstract:Deep Neural Networks (DNNs) are often constrained by their large memories and computational restrictions. In this paper, we introduce a novel adaptive-rank Singular Value Decomposition (ARSVD) that dynamically chooses the rank increase of the fully connected layers below a certain threshold in energy expenditure. Unlike conventional SVD compression methods that apply a fixed rank reduction in all layers, our ARSVD method uses energy distribution to adaptively select rank per layer while retaining accuracy. This is done for each layer in an effort to use as much energy as possible while maintaining the lowest accuracy loss. Such accuracy-adaptive approaches outperform traditional static rank reduction methods by providing an improved balance between compression and model performance. We first train a simple Multi-Layer Perceptron (MLP) on the MNIST, CIFAR-10, and CIFAR-100 dataset and evaluate its performance using accuracy and F1-score. After applying ARSVD, our results demonstrate that the technique can achieve substantial model compression without compromising classification accuracy. These results illustrate the usefulness of ARSVD in computing scenarios where both computational and memory resources are scarce.
[LG-61] Improving Deep Knowledge Tracing via Gated Architectures and Adaptive Optimization
链接: https://arxiv.org/abs/2504.20070
作者: Altun Shukurlu
类目: Machine Learning (cs.LG)
*备注: 11 pages, 2 figures
Abstract:Deep Knowledge Tracing (DKT) models student learning behavior by using Recurrent Neural Networks (RNNs) to predict future performance based on historical interaction data. However, the original implementation relied on standard RNNs in the Lua-based Torch framework, which limited extensibility and reproducibility. In this work, we revisit the DKT model from two perspectives: architectural improvements and optimization efficiency. First, we enhance the model using gated recurrent units, specifically Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRU), which better capture long-term dependencies and help mitigate vanishing gradient issues. Second, we re-implement DKT using the PyTorch framework, enabling a modular and accessible infrastructure compatible with modern deep learning workflows. We also benchmark several optimization algorithms SGD, RMSProp, Adagrad, Adam, and AdamW to evaluate their impact on convergence speed and predictive accuracy in educational modeling tasks. Experiments on the Synthetic-5 and Khan Academy datasets show that GRUs and LSTMs achieve higher accuracy and improved training stability compared to basic RNNs, while adaptive optimizers such as Adam and AdamW consistently outperform SGD in both early-stage learning and final model performance. Our open-source PyTorch implementation provides a reproducible and extensible foundation for future research in neural knowledge tracing and personalized learning systems.
[LG-62] mpo: Application-aware LLM Serving with Mixed SLO Requirements
链接: https://arxiv.org/abs/2504.20068
作者: Wei Zhang,Zhiyu Wu,Yi Mu,Banruo Liu,Myungjin Lee,Fan Lai
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:The integration of Large Language Models (LLMs) into diverse applications, ranging from interactive chatbots and cloud AIOps to intelligent agents, has introduced a wide spectrum of Service Level Objectives (SLOs) for responsiveness. These workloads include latency-sensitive requests focused on per-token latency in streaming chat, throughput-intensive requests that require rapid full responses to invoke tools, and collective requests with dynamic dependencies arising from self-reflection or agent-based reasoning. This workload diversity, amplified by unpredictable request information such as response lengths and runtime dependencies, makes existing schedulers inadequate even within their design envelopes. In this paper, we define service gain as the useful service delivered by completing requests. We observe that as SLO directly reflects the actual performance needs of requests, completing a request much faster than its SLO (e.g., deadline) yields limited additional service gain. Based on this insight, we introduce Tempo, the first systematic SLO-aware scheduler designed to maximize service gain across diverse LLM workloads. Tempo allocates just enough serving bandwidth to meet each SLO, maximizing residual capacity for others best-effort workloads. Instead of assuming request information or none at all, it adopts a hybrid scheduling strategy: using quantile-based response upper bounds and dependency-graph matching for conservative initial estimates, prioritizing requests by service gain density, and refining decisions online as generation progresses. Our evaluation across diverse workloads, including chat, reasoning, and agentic pipelines, shows that Tempo improves end-to-end service gain by up to 8.3 \times and achieves up to 10.3 \times SLO goodput compared to state-of-the-art designs Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2504.20068 [cs.DC] (or arXiv:2504.20068v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2504.20068 Focus to learn more arXiv-issued DOI via DataCite
[LG-63] Provably faster randomized and quantum algorithms for k-means clustering via uniform sampling
链接: https://arxiv.org/abs/2504.20982
作者: Tyler Chen,Archan Ray,Akshay Seshadri,Dylan Herman,Bao Bach,Pranav Deshpande,Abhishek Som,Niraj Kumar,Marco Pistoia
类目: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:The k -means algorithm (Lloyd’s algorithm) is a widely used method for clustering unlabeled data. A key bottleneck of the k -means algorithm is that each iteration requires time linear in the number of data points, which can be expensive in big data applications. This was improved in recent works proposing quantum and quantum-inspired classical algorithms to approximate the k -means algorithm locally, in time depending only logarithmically on the number of data points (along with data dependent parameters) [ q -means: A quantum algorithm for unsupervised machine learning; Kerenidis, Landman, Luongo, and Prakash, NeurIPS 2019; Do you know what q -means?, Doriguello, Luongo, Tang]. In this work, we describe a simple randomized mini-batch k -means algorithm and a quantum algorithm inspired by the classical algorithm. We prove worse-case guarantees that significantly improve upon the bounds for previous algorithms. Our improvements are due to a careful use of uniform sampling, which preserves certain symmetries of the k -means problem that are not preserved in previous algorithms that use data norm-based sampling.
[LG-64] Energy-Based Coarse-Graining in Molecular Dynamics: A Flow-Based Framework Without Data
链接: https://arxiv.org/abs/2504.20940
作者: Maximilian Stupp,P. S. Koutsourelakis
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Coarse-grained (CG) models offer an effective route to reducing the complexity of molecular simulations, yet conventional approaches depend heavily on long all-atom molecular dynamics (MD) trajectories to adequately sample configurational space. This data-driven dependence limits their accuracy and generalizability, as unvisited configurations remain excluded from the resulting CG model. We introduce a data-free generative framework for coarse-graining that directly targets the all-atom Boltzmann distribution. Our model defines a structured latent space comprising slow collective variables, which are statistically associated with multimodal marginal densities capturing metastable states, and fast variables, which represent the remaining degrees of freedom with simple, unimodal conditional distributions. A potentially learnable, bijective map from the full latent space to the all-atom configuration space enables automatic and accurate reconstruction of molecular structures. The model is trained using an energy-based objective that minimizes the reverse Kullback-Leibler divergence, relying solely on the interatomic potential rather than sampled trajectories. A tempering scheme is used to stabilize training and promote exploration of diverse configurations. Once trained, the model can generate unbiased, one-shot equilibrium all-atom samples. We validate the method on two synthetic systems-a double-well potential and a Gaussian mixture-as well as on the benchmark alanine dipeptide. The model captures all relevant modes of the Boltzmann distribution, accurately reconstructs atomic configurations, and learns physically meaningful coarse-grained representations, all without any simulation data.
[LG-65] Preference-centric Bandits: Optimality of Mixtures and Regret-efficient Algorithms
链接: https://arxiv.org/abs/2504.20877
作者: Meltem Tatlı,Arpan Mukherjee,Prashanth L.A.,Karthikeyan Shanmugam,Ali Tajer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The objective of canonical multi-armed bandits is to identify and repeatedly select an arm with the largest reward, often in the form of the expected value of the arm’s probability distribution. Such a utilitarian perspective and focus on the probability models’ first moments, however, is agnostic to the distributions’ tail behavior and their implications for variability and risks in decision-making. This paper introduces a principled framework for shifting from expectation-based evaluation to an alternative reward formulation, termed a preference metric (PM). The PMs can place the desired emphasis on different reward realization and can encode a richer modeling of preferences that incorporate risk aversion, robustness, or other desired attitudes toward uncertainty. A fundamentally distinct observation in such a PM-centric perspective is that designing bandit algorithms will have a significantly different principle: as opposed to the reward-based models in which the optimal sampling policy converges to repeatedly sampling from the single best arm, in the PM-centric framework the optimal policy converges to selecting a mix of arms based on specific mixing weights. Designing such mixture policies departs from the principles for designing bandit algorithms in significant ways, primarily because of uncountable mixture possibilities. The paper formalizes the PM-centric framework and presents two algorithm classes (horizon-dependent and anytime) that learn and track mixtures in a regret-efficient fashion. These algorithms have two distinctions from their canonical counterparts: (i) they involve an estimation routine to form reliable estimates of optimal mixtures, and (ii) they are equipped with tracking mechanisms to navigate arm selection fractions to track the optimal mixtures. These algorithms’ regret guarantees are investigated under various algebraic forms of the PMs.
[LG-66] Learning and Generalization with Mixture Data
链接: https://arxiv.org/abs/2504.20651
作者: Harsh Vardhan,Avishek Ghosh,Arya Mazumdar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In many, if not most, machine learning applications the training data is naturally heterogeneous (e.g. federated learning, adversarial attacks and domain adaptation in neural net training). Data heterogeneity is identified as one of the major challenges in modern day large-scale learning. A classical way to represent heterogeneous data is via a mixture model. In this paper, we study generalization performance and statistical rates when data is sampled from a mixture distribution. We first characterize the heterogeneity of the mixture in terms of the pairwise total variation distance of the sub-population distributions. Thereafter, as a central theme of this paper, we characterize the range where the mixture may be treated as a single (homogeneous) distribution for learning. In particular, we study the generalization performance under the classical PAC framework and the statistical error rates for parametric (linear regression, mixture of hyperplanes) as well as non-parametric (Lipschitz, convex and Hölder-smooth) regression problems. In order to do this, we obtain Rademacher complexity and (local) Gaussian complexity bounds with mixture data, and apply them to get the generalization and convergence rates respectively. We observe that as the (regression) function classes get more complex, the requirement on the pairwise total variation distance gets stringent, which matches our intuition. We also do a finer analysis for the case of mixed linear regression and provide a tight bound on the generalization error in terms of heterogeneity.
[LG-67] Data Driven Deep Learning for Correcting Global Climate Model Projections of SST and DSL in the Bay of Bengal
链接: https://arxiv.org/abs/2504.20620
作者: Abhishek Pasula,Deepak N. Subramani
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 28 pages, 10 figures, 3 tables of main paper; 24 pages, 17 figures of Supporting Information
Abstract:Climate change alters ocean conditions, notably temperature and sea level. In the Bay of Bengal, these changes influence monsoon precipitation and marine productivity, critical to the Indian economy. In Phase 6 of the Coupled Model Intercomparison Project (CMIP6), Global Climate Models (GCMs) use different shared socioeconomic pathways (SSPs) to obtain future climate projections. However, significant discrepancies are observed between these models and the reanalysis data in the Bay of Bengal for 2015-2024. Specifically, the root mean square error (RMSE) between the climate model output and the Ocean Reanalysis System (ORAS5) is 1.2C for the sea surface temperature (SST) and 1.1 m for the dynamic sea level (DSL). We introduce a new data-driven deep learning model to correct for this bias. The deep neural model for each variable is trained using pairs of climatology-removed monthly climate projections as input and the corresponding month’s ORAS5 as output. This model is trained with historical data (1950 to 2014), validated with future projection data from 2015 to 2020, and tested with future projections from 2021 to 2023. Compared to the conventional EquiDistant Cumulative Distribution Function (EDCDF) statistical method for bias correction in climate models, our approach decreases RMSE by 0.15C for SST and 0.3 m for DSL. The trained model subsequently corrects the projections for 2024-2100. A detailed analysis of the monthly, seasonal, and decadal means and variability is performed to underscore the implications of the novel dynamics uncovered in our corrected projections.
[LG-68] Sobolev norm inconsistency of kernel interpolation
链接: https://arxiv.org/abs/2504.20617
作者: Yunfei Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We study the consistency of minimum-norm interpolation in reproducing kernel Hilbert spaces corresponding to bounded kernels. Our main result give lower bounds for the generalization error of the kernel interpolation measured in a continuous scale of norms that interpolate between L^2 and the hypothesis space. These lower bounds imply that kernel interpolation is always inconsistent, when the smoothness index of the norm is larger than a constant that depends only on the embedding index of the hypothesis space and the decay rate of the eigenvalues.
[LG-69] Quality-factor inspired deep neural network solver for solving inverse scattering problems
链接: https://arxiv.org/abs/2504.20504
作者: Yutong Du,Zicheng Liu,Miao Cao,Zupeng Liang,Yali Zong,Changyou Li
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Deep neural networks have been applied to address electromagnetic inverse scattering problems (ISPs) and shown superior imaging performances, which can be affected by the training dataset, the network architecture and the applied loss function. Here, the quality of data samples is cared and valued by the defined quality factor. Based on the quality factor, the composition of the training dataset is optimized. The network architecture is integrated with the residual connections and channel attention mechanism to improve feature extraction. A loss function that incorporates data-fitting error, physical-information constraints and the desired feature of the solution is designed and analyzed to suppress the background artifacts and improve the reconstruction accuracy. Various numerical analysis are performed to demonstrate the superiority of the proposed quality-factor inspired deep neural network (QuaDNN) solver and the imaging performance is finally verified by experimental imaging test.
[LG-70] Full-field surrogate modeling of cardiac function encoding geometric variability
链接: https://arxiv.org/abs/2504.20479
作者: Elena Martinez,Beatrice Moscoloni,Matteo Salvador,Fanwei Kong,Mathias Peirlinck,Alison Lesley Marsden
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 34 pages, 19 figures
Abstract:Combining physics-based modeling with data-driven methods is critical to enabling the translation of computational methods to clinical use in cardiology. The use of rigorous differential equations combined with machine learning tools allows for model personalization with uncertainty quantification in time frames compatible with clinical practice. However, accurate and efficient surrogate models of cardiac function, built from physics-based numerical simulation, are still mostly geometry-specific and require retraining for different patients and pathological conditions. We propose a novel computational pipeline to embed cardiac anatomies into full-field surrogate models. We generate a dataset of electrophysiology simulations using a complex multi-scale mathematical model coupling partial and ordinary differential equations. We adopt Branched Latent Neural Maps (BLNMs) as an effective scientific machine learning method to encode activation maps extracted from physics-based numerical simulations into a neural network. Leveraging large deformation diffeomorphic metric mappings, we build a biventricular anatomical atlas and parametrize the anatomical variability of a small and challenging cohort of 13 pediatric patients affected by Tetralogy of Fallot. We propose a novel statistical shape modeling based z-score sampling approach to generate a new synthetic cohort of 52 biventricular geometries that are compatible with the original geometrical variability. This synthetic cohort acts as the training set for BLNMs. Our surrogate model demonstrates robustness and great generalization across the complex original patient cohort, achieving an average adimensional mean squared error of 0.0034. The Python implementation of our BLNM model is publicly available under MIT License at this https URL.
[LG-71] Nonlinear Computation with Linear Optics via Source-Position Encoding
链接: https://arxiv.org/abs/2504.20401
作者: N. Richardson,C. Bosch,R. P. Adams
类目: Optics (physics.optics); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:Optical computing systems provide an alternate hardware model which appears to be aligned with the demands of neural network workloads. However, the challenge of implementing energy efficient nonlinearities in optics – a key requirement for realizing neural networks – is a conspicuous missing link. In this work we introduce a novel method to achieve nonlinear computation in fully linear media. Our method can operate at low power and requires only the ability to drive the optical system at a data-dependent spatial position. Leveraging this positional encoding, we formulate a fully automated, topology-optimization-based hardware design framework for extremely specialized optical neural networks, drawing on modern advancements in optimization and machine learning. We evaluate our optical designs on machine learning classification tasks: demonstrating significant improvements over linear methods, and competitive performance when compared to standard artificial neural networks.
[LG-72] Optimizing Hard Thresholding for Sparse Model Discovery
链接: https://arxiv.org/abs/2504.20256
作者: Derek W. Jollie,Scott G. McCalla
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA)
*备注:
Abstract:Many model selection algorithms rely on sparse dictionary learning to provide interpretable and physics-based governing equations. The optimization algorithms typically use a hard thresholding process to enforce sparse activations in the model coefficients by removing library elements from consideration. By introducing an annealing scheme that reactivates a fraction of the removed terms with a cooling schedule, we are able to improve the performance of these sparse learning algorithms. We concentrate on two approaches to the optimization, SINDy, and an alternative using hard thresholding pursuit. We see in both cases that annealing can improve model accuracy. The effectiveness of annealing is demonstrated through comparisons on several nonlinear systems pulled from convective flows, excitable systems, and population dynamics. Finally we apply these algorithms to experimental data for projectile motion.
[LG-73] sting the Limit of Atmospheric Predictability with a Machine Learning Weather Model
链接: https://arxiv.org/abs/2504.20238
作者: P. Trent Vonich,Gregory J. Hakim
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
Abstract:Atmospheric predictability research has long held that the limit of skillful deterministic weather forecasts is about 14 days. We challenge this limit using GraphCast, a machine-learning weather model, by optimizing forecast initial conditions using gradient-based techniques for twice-daily forecasts spanning 2020. This approach yields an average error reduction of 86% at 10 days, with skill lasting beyond 30 days. Mean optimal initial-condition perturbations reveal large-scale, spatially coherent corrections to ERA5, primarily reflecting an intensification of the Hadley circulation. Forecasts using GraphCast-optimal initial conditions in the Pangu-Weather model achieve a 21% error reduction, peaking at 4 days, indicating that analysis corrections reflect a combination of both model bias and a reduction in analysis error. These results demonstrate that, given accurate initial conditions, skillful deterministic forecasts are consistently achievable far beyond two weeks, challenging long-standing assumptions about the limits of atmospheric predictability.
[LG-74] Coreset selection for the Sinkhorn divergence and generic smooth divergences
链接: https://arxiv.org/abs/2504.20194
作者: Alex Kokot,Alex Luedtke
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We introduce CO2, an efficient algorithm to produce convexly-weighted coresets with respect to generic smooth divergences. By employing a functional Taylor expansion, we show a local equivalence between sufficiently regular losses and their second order approximations, reducing the coreset selection problem to maximum mean discrepancy minimization. We apply CO2 to the Sinkhorn divergence, providing a novel sampling procedure that requires logarithmically many data points to match the approximation guarantees of random sampling. To show this, we additionally verify several new regularity properties for entropically regularized optimal transport of independent interest. Our approach leads to a new perspective linking coreset selection and kernel quadrature to classical statistical methods such as moment and score matching. We showcase this method with a practical application of subsampling image data, and highlight key directions to explore for improved algorithmic efficiency and theoretical guarantees.
[LG-75] A Physically Driven Long Short Term Memory Model for Estimating Snow Water Equivalent over the Continental United States
链接: https://arxiv.org/abs/2504.20129
作者: Arun M. Saranathan,Mahmoud Saeedimoghaddam,Brandon Smith,Deepthi Raghunandan,Grey Nearing,Craig Pelissier
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: Preprint of journal paper in preparation. Details: 24 pages, 8 figures
Abstract:Snow is an essential input for various land surface models. Seasonal snow estimates are available as snow water equivalent (SWE) from process-based reanalysis products or locally from in situ measurements. While the reanalysis products are computationally expensive and available at only fixed spatial and temporal resolutions, the in situ measurements are highly localized and sparse. To address these issues and enable the analysis of the effect of a large suite of physical, morphological, and geological conditions on the presence and amount of snow, we build a Long Short-Term Memory (LSTM) network, which is able to estimate the SWE based on time series input of the various physical/meteorological factors as well static spatial/morphological factors. Specifically, this model breaks down the SWE estimation into two separate tasks: (i) a classification task that indicates the presence/absence of snow on a specific day and (ii) a regression task that indicates the height of the SWE on a specific day in the case of snow presence. The model is trained using physical/in situ SWE measurements from the SNOw TELemetry (SNOTEL) snow pillows in the western United States. We will show that trained LSTM models have a classification accuracy of \geq 93% for the presence of snow and a coefficient of correlation of \sim 0.9 concerning their SWE estimates. We will also demonstrate that the models can generalize both spatially and temporally to previously unseen data.
[LG-76] Learning Hierarchical Interaction for Accurate Molecular Property Prediction
链接: https://arxiv.org/abs/2504.20127
作者: Huiyang Hong,Xinkai Wu,Hongyu Sun,Qi Wang,Yuquan Li
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:
Abstract:Discovering molecules with desirable molecular properties, including ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) profiles, is of great importance in drug discovery. Existing approaches typically employ deep learning models, such as Graph Neural Networks (GNNs) and Transformers, to predict these molecular properties by learning from diverse chemical information. However, these models often fail to efficiently capture and utilize the hierarchical nature of molecular structures, and lack mechanisms for effective interaction among multi-level features. To address these limitations, we propose a Hierarchical Interaction Message Passing Mechanism, which serves as the foundation of our novel model, HimNet. Our method enables interaction-aware representation learning across atomic, motif, and molecular levels via hierarchical attention-guided message passing. This design allows HimNet to effectively balance global and local information, ensuring rich and task-relevant feature extraction for downstream property prediction tasks, such as Blood-Brain Barrier Permeability (BBBP). Extensive experiments on multiple benchmark datasets demonstrate that HimNet achieves the best or near-best performance in most molecular property prediction tasks. Furthermore, our method exhibits promising hierarchical interpretability, aligning well with chemical intuition on representative molecules. We believe that HimNet offers an accurate and efficient solution for molecular activity and ADMET property prediction, contributing significantly to advanced decision-making in the early stages of drug discovery.
[LG-77] Deep Learning vs. Black-Scholes: Option Pricing Performance on Brazilian Petrobras Stocks
链接: https://arxiv.org/abs/2504.20088
作者: Joao Felipe Gueiros,Hemanth Chandravamsi,Steven H. Frankel
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: 11 pages, 7 figures, 3 tables
Abstract:This paper explores the use of deep residual networks for pricing European options on Petrobras, one of the world’s largest oil and gas producers, and compares its performance with the Black-Scholes (BS) model. Using eight years of historical data from B3 (Brazilian Stock Exchange) collected via web scraping, a deep learning model was trained using a custom built hybrid loss function that incorporates market data and analytical pricing. The data for training and testing were drawn between the period spanning November 2016 to January 2025, using an 80-20 train-test split. The test set consisted of data from the final three months: November, December, and January 2025. The deep residual network model achieved a 64.3% reduction in the mean absolute error for the 3-19 BRL (Brazilian Real) range when compared to the Black-Scholes model on the test set. Furthermore, unlike the Black-Scholes solution, which tends to decrease its accuracy for longer periods of time, the deep learning model performed accurately for longer expiration periods. These findings highlight the potential of deep learning in financial modeling, with future work focusing on specialized models for different price ranges.
[LG-78] Predictive AI with External Knowledge Infusion for Stocks
链接: https://arxiv.org/abs/2504.20058
作者: Ambedkar Dukkipati,Kawin Mayilvaghanan,Naveen Kumar Pallekonda,Sai Prakash Hadnoor,Ranga Shaarad Ayyagari
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注:
Abstract:Fluctuations in stock prices are influenced by a complex interplay of factors that go beyond mere historical data. These factors, themselves influenced by external forces, encompass inter-stock dynamics, broader economic factors, various government policy decisions, outbreaks of wars, etc. Furthermore, all of these factors are dynamic and exhibit changes over time. In this paper, for the first time, we tackle the forecasting problem under external influence by proposing learning mechanisms that not only learn from historical trends but also incorporate external knowledge from temporal knowledge graphs. Since there are no such datasets or temporal knowledge graphs available, we study this problem with stock market data, and we construct comprehensive temporal knowledge graph datasets. In our proposed approach, we model relations on external temporal knowledge graphs as events of a Hawkes process on graphs. With extensive experiments, we show that learned dynamic representations effectively rank stocks based on returns across multiple holding periods, outperforming related baselines on relevant metrics.
信息检索
[IR-0] RecGaze: The First Eye Tracking and User Interaction Dataset for Carousel Interfaces SIGIR’25
链接: https://arxiv.org/abs/2504.20792
作者: Santiago de Leon-Martinez,Jingwei Kang,Robert Moro,Maarten de Rijke,Branislav Kveton,Harrie Oosterhuis,Maria Bielikova
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注: Accepted to Resource Reproducibility Track SIGIR '25
Abstract:Carousel interfaces are widely used in e-commerce and streaming services, but little research has been devoted to them. Previous studies of interfaces for presenting search and recommendation results have focused on single ranked lists, but it appears their results cannot be extrapolated to carousels due to the added complexity. Eye tracking is a highly informative approach to understanding how users click, yet there are no eye tracking studies concerning carousels. There are very few interaction datasets on recommenders with carousel interfaces and none that contain gaze data. We introduce the RecGaze dataset: the first comprehensive feedback dataset on carousels that includes eye tracking results, clicks, cursor movements, and selection explanations. The dataset comprises of interactions from 3 movie selection tasks with 40 different carousel interfaces per user. In total, 87 users and 3,477 interactions are logged. In addition to the dataset, its description and possible use cases, we provide results of a survey on carousel design and the first analysis of gaze data on carousels, which reveals a golden triangle or F-pattern browsing behavior. Our work seeks to advance the field of carousel interfaces by providing the first dataset with eye tracking results on carousels. In this manner, we provide and encourage an empirical understanding of interactions with carousel interfaces, for building better recommender systems through gaze information, and also encourage the development of gaze-based recommenders. Comments: Accepted to Resource Reproducibility Track SIGIR '25 Subjects: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC) Cite as: arXiv:2504.20792 [cs.IR] (or arXiv:2504.20792v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.20792 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3726302.3730301 Focus to learn more DOI(s) linking to related resources
[IR-1] Natural Language Processing tools for Pharmaceutical Manufacturing Information Extraction from Patents Natural Language Processing (NLP) tools for Pharmaceutical Manufacturing Information Extraction from Patents
链接: https://arxiv.org/abs/2504.20598
作者: Diego Alvarado-Maldonado,Blair Johnston,Cameron J. Brown
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Abundant and diverse data on medicines manufacturing and other lifecycle components has been made easily accessible in the last decades. However, a significant proportion of this information is characterised by not being tabulated and usable for machine learning purposes. Thus, natural language processing tools have been used to build databases in domains such as biomedical and chemical to address this limitation. This has allowed the development of artificial intelligence applications, which have improved drug discovery and treatments. In the pharmaceutical manufacturing context, some initiatives and datasets for primary processing can be found, but the manufacturing of drug products is an area which is still lacking, to the best of our knowledge. This works aims to explore and adapt NLP tools used in other domains to extract information on both primary and secondary manufacturing, employing patents as the main source of data. Thus, two independent, but complementary, models were developed comprising a method to select fragments of text that contain manufacturing data, and a named entity recognition system that enables extracting information on operations, materials, and conditions of a process. For the first model, the identification of relevant sections was achieved using an unsupervised approach combining Latent Dirichlet Allocation and k-Means clustering. The performance of this model measured as a Cohen’s kappa between model output and manual revision was higher than 90%. NER model consisted of a deep neural network, and an f1-score micro average of 84.2% was obtained which is comparable to other works. Some considerations for these tools to be used in data extraction are discussed throughout this document.
[IR-2] Against Opacity: Explainable AI and Large Language Models for Effective Digital Advertising
链接: https://arxiv.org/abs/2504.20064
作者: Qi Yang,Marlo Ongpin,Sergey Nikolenko,Alfred Huang,Aleksandr Farseev
类目: Information Retrieval (cs.IR); Multimedia (cs.MM); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:The opaqueness of modern digital advertising, exemplified by platforms such as Meta Ads, raises concerns regarding their autonomous control over audience targeting, pricing structures, and ad relevancy assessments. Locked in their leading positions by network effects, ``Metas and Googles of the world’’ attract countless advertisers who rely on intuition, with billions of dollars lost on ineffective social media ads. The platforms’ algorithms use huge amounts of data unavailable to advertisers, and the algorithms themselves are opaque as well. This lack of transparency hinders the advertisers’ ability to make informed decisions and necessitates efforts to promote transparency, standardize industry metrics, and strengthen regulatory frameworks. In this work, we propose novel ways to assist marketers in optimizing their advertising strategies via machine learning techniques designed to analyze and evaluate content, in particular, predict the click-through rates (CTR) of novel advertising content. Another important problem is that large volumes of data available in the competitive landscape, e.g., competitors’ ads, impede the ability of marketers to derive meaningful insights. This leads to a pressing need for a novel approach that would allow us to summarize and comprehend complex data. Inspired by the success of ChatGPT in bridging the gap between large language models (LLMs) and a broader non-technical audience, we propose a novel system that facilitates marketers in data interpretation, called SODA, that merges LLMs with explainable AI, enabling better human-AI collaboration with an emphasis on the domain of digital marketing and advertising. By combining LLMs and explainability features, in particular modern text-image models, we aim to improve the synergy between human marketers and AI systems.