本篇博文主要内容为 2025-04-14 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-04-14)

今日共更新387篇论文,其中:

  • 自然语言处理54篇(Computation and Language (cs.CL))
  • 人工智能98篇(Artificial Intelligence (cs.AI))
  • 计算机视觉97篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习106篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] owards an Understanding of Context Utilization in Code Intelligence

【速读】: 该论文旨在解决代码智能领域中缺乏系统性分析上下文(context)的问题。随着研究兴趣的增长,尽管已有许多相关工作,但尚未有系统性的综述来全面理解上下文在代码智能中的作用及其应用方式。为此,作者通过回顾2007年9月至2024年8月间发表的146篇相关文献,提出了四个主要贡献:(1) 对研究领域的定量分析,涵盖出版趋势、会议期刊分布及探索主题;(2) 提出了一种新的代码智能上下文类型分类法;(3) 针对不同任务,探讨了上下文整合策略;(4) 批判性评估了上下文感知方法的评价方法论。基于这些发现,文章识别了当前代码智能系统中上下文利用的根本挑战,并提出了一条研究路线图,指出了未来研究的关键机遇。解决方案的关键在于建立一个全面而系统的框架,不仅梳理了现有工作的全景,还创新性地提出了分类体系与整合策略,从而推动了代码智能领域对于如何有效利用上下文的理解。

链接: https://arxiv.org/abs/2504.08734
作者: Yanlin Wang,Kefeng Duan,Dewu Zheng,Ensheng Shi,Fengji Zhang,Yanli Wang,Jiachi Chen,Xilin Liu,Yuchi Ma,Hongyu Zhang,Qianxiang Wang,Zibin Zheng
机构: Sun Yat-sen University (中山大学)(Zhuhai, China); Huawei Cloud Computing Technologies Co., Ltd. (华为云技术有限公司)(Beijing, China); City University of Hong Kong (香港城市大学)(Hong Kong, China); Chongqing University (重庆大学)(Chongqing, China)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Code intelligence is an emerging domain in software engineering, aiming to improve the effectiveness and efficiency of various code-related tasks. Recent research suggests that incorporating contextual information beyond the basic original task inputs (i.e., source code) can substantially enhance model performance. Such contextual signals may be obtained directly or indirectly from sources such as API documentation or intermediate representations like abstract syntax trees can significantly improve the effectiveness of code intelligence. Despite growing academic interest, there is a lack of systematic analysis of context in code intelligence. To address this gap, we conduct an extensive literature review of 146 relevant studies published between September 2007 and August 2024. Our investigation yields four main contributions. (1) A quantitative analysis of the research landscape, including publication trends, venues, and the explored domains; (2) A novel taxonomy of context types used in code intelligence; (3) A task-oriented analysis investigating context integration strategies across diverse code intelligence tasks; (4) A critical evaluation of evaluation methodologies for context-aware methods. Based on these findings, we identify fundamental challenges in context utilization in current code intelligence systems and propose a research roadmap that outlines key opportunities for future research.
zh

[NLP-1] DocAgent : A Multi-Agent System for Automated Code Documentation Generation

【速读】: 该论文旨在解决利用大型语言模型(Large Language Models, LLMs)自动生成高质量代码文档的挑战,现有方法常导致输出不完整、无帮助或事实错误。论文提出的关键解决方案是DocAgent,一个采用拓扑代码处理的新型多智能体协作系统,通过逐步构建上下文来生成文档。其核心在于由Reader、Searcher、Writer、Verifier和Orchestrator等专业化智能体协同工作,并结合多维度评估框架评估文档的完整性、实用性和准确性。实验结果表明,DocAgent在复杂且专有代码仓库中显著优于基线方法,消融研究进一步验证了拓扑处理顺序的重要性。

链接: https://arxiv.org/abs/2504.08725
作者: Dayu Yang,Antoine Simoulin,Xin Qian,Xiaoyi Liu,Yuwei Cao,Zhaopu Teng,Grey Yang
机构: Meta (Meta)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:High-quality code documentation is crucial for software development especially in the era of AI. However, generating it automatically using Large Language Models (LLMs) remains challenging, as existing approaches often produce incomplete, unhelpful, or factually incorrect outputs. We introduce DocAgent, a novel multi-agent collaborative system using topological code processing for incremental context building. Specialized agents (Reader, Searcher, Writer, Verifier, Orchestrator) then collaboratively generate documentation. We also propose a multi-faceted evaluation framework assessing Completeness, Helpfulness, and Truthfulness. Comprehensive experiments show DocAgent significantly outperforms baselines consistently. Our ablation study confirms the vital role of the topological processing order. DocAgent offers a robust approach for reliable code documentation generation in complex and proprietary repositories.
zh

[NLP-2] SWAN-GPT : An Efficient and Scalable Approach for Long-Context Language Modeling

【速读】: 该论文试图解决长上下文(long context)场景下语言模型性能下降的问题,即如何让模型在训练时所见序列长度显著短于推理时的序列长度的情况下,仍能保持稳健的泛化能力。解决方案的关键在于提出了一种名为SWAN-GPT的新架构,该架构结合了无位置编码(NoPE)层与配备旋转位置编码(RoPE)的滑动窗口注意力(SWA)层,并通过在推理阶段对注意力分数进行简单的动态缩放(dynamic scaling),实现了对超长序列的有效处理。此外,该架构还展示了如何以较低的成本将现有解码器-only模型高效转换为SWAN架构,从而支持更长的上下文长度。这种设计不仅提升了模型的鲁棒性,还提高了计算效率。

链接: https://arxiv.org/abs/2504.08719
作者: Krishna C. Puvvada,Faisal Ladhak,Santiago Akle Serrano,Cheng-Ping Hsieh,Shantanu Acharya,Somshubra Majumdar,Fei Jia,Samuel Kriman,Simeng Sun,Dima Rekesh,Boris Ginsburg
机构: NVIDIA (英伟达)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a decoder-only Transformer architecture that robustly generalizes to sequence lengths substantially longer than those seen during training. Our model, SWAN-GPT, interleaves layers without positional encodings (NoPE) and sliding-window attention layers equipped with rotary positional encodings (SWA-RoPE). Experiments demonstrate strong performance on sequence lengths significantly longer than the training length without the need for additional long-context training. This robust length extrapolation is achieved through our novel architecture, enhanced by a straightforward dynamic scaling of attention scores during inference. In addition, SWAN-GPT is more computationally efficient than standard GPT architectures, resulting in cheaper training and higher throughput. Further, we demonstrate that existing pre-trained decoder-only models can be efficiently converted to the SWAN architecture with minimal continued training, enabling longer contexts. Overall, our work presents an effective approach for scaling language models to longer contexts in a robust and efficient manner.
zh

[NLP-3] ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance

【速读】: 该论文旨在解决预训练Transformer编码器模型(如DeBERTaV3和ModernBERT)在性能提升方面,因训练数据差异而非模型架构改进导致的混淆问题。论文的关键在于通过控制变量法,在相同的预训练数据集上对比ModernBERT与CamemBERTaV2(一个基于DeBERTaV3的法语文本模型),从而分离模型设计对性能的影响。研究发现,尽管ModernBERT在样本效率和最终基准性能上略逊于早期模型(如DeBERTaV3),但它在训练和推理速度上有显著优势,并且相较于更早的模型(如BERT和RoBERTa)提供了有意义的架构改进。此外,高质量的预训练数据能够加速收敛但未显著提高最终性能,表明可能已接近基准饱和点。因此,论文强调在评估Transformer模型时需明确区分预训练数据与架构创新的作用。

链接: https://arxiv.org/abs/2504.08716
作者: Wissam Antoun,Benoît Sagot,Djamé Seddah
机构: Inria (法国国家信息与自动化研究所)
类目: Computation and Language (cs.CL)
备注: Preprint. Under review

点击查看摘要

Abstract:Pretrained transformer-encoder models like DeBERTaV3 and ModernBERT introduce architectural advancements aimed at improving efficiency and performance. Although the authors of ModernBERT report improved performance over DeBERTaV3 on several benchmarks, the lack of disclosed training data and the absence of comparisons using a shared dataset make it difficult to determine whether these gains are due to architectural improvements or differences in training data. In this work, we conduct a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, isolating the effect of model design. Our results show that the previous model generation remains superior in sample efficiency and overall benchmark performance, with ModernBERT’s primary advantage being faster training and inference speed. However, the new proposed model still provides meaningful architectural improvements compared to earlier models such as BERT and RoBERTa. Additionally, we observe that high-quality pre-training data accelerates convergence but does not significantly improve final performance, suggesting potential benchmark saturation. These findings show the importance of disentangling pretraining data from architectural innovations when evaluating transformer models.
zh

[NLP-4] Generating Fine Details of Entity Interactions

【速读】: 该论文试图解决多实体相互作用的高保真图像生成这一长期挑战,现有预训练文本到图像模型在处理罕见对象交互时因训练数据不足而难以准确生成交互。为解决此问题,论文提出了一种名为DetailScribe的解决方案,其关键在于一种分解增强的 refinement 过程:利用大型语言模型(LLMs)将交互分解为更细粒度的概念,采用视觉语言模型(VLM)对生成图像进行评估,并在扩散过程中应用针对性干预以优化结果。

链接: https://arxiv.org/abs/2504.08714
作者: Xinyi Gu,Jiayuan Mao
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:Images not only depict objects but also encapsulate rich interactions between them. However, generating faithful and high-fidelity images involving multiple entities interacting with each other, is a long-standing challenge. While pre-trained text-to-image models are trained on large-scale datasets to follow diverse text instructions, they struggle to generate accurate interactions, likely due to the scarcity of training data for uncommon object interactions. This paper introduces InterActing, an interaction-focused dataset with 1000 fine-grained prompts covering three key scenarios: (1) functional and action-based interactions, (2) compositional spatial relationships, and (3) multi-subject interactions. To address interaction generation challenges, we propose a decomposition-augmented refinement procedure. Our approach, DetailScribe, built on Stable Diffusion 3.5, leverages LLMs to decompose interactions into finer-grained concepts, uses a VLM to critique generated images, and applies targeted interventions within the diffusion process in refinement. Automatic and human evaluations show significantly improved image quality, demonstrating the potential of enhanced inference strategies. Our dataset and code are available at this https URL to facilitate future exploration of interaction-rich image generation.
zh

[NLP-5] Large Language Models as Span Annotators

【速读】: 该论文试图解决高质量文本评估中单分数指标难以提供可操作反馈的问题,提出通过跨度标注(span annotation)来指导改进并提供见解。传统跨度标注依赖人工标注者或微调后的编码模型,成本高昂且效率有限。论文的关键解决方案是利用大规模语言模型(Large Language Models, LLMs)实现自动化跨度标注。实验表明,LLMs在数据到文本生成评估、机器翻译评估以及手写文本宣传检测三项任务上的表现与熟练的人类标注者具有中等程度的一致性,在某些场景下甚至接近标注者之间的平均一致性。此外,推理模型的表现优于指令微调模型,并提供了更合理的解释。这种方案不仅实施简便,而且显著降低了成本。论文还公开了一个包含超过4万条模型和人工标注的数据集以供进一步研究。

链接: https://arxiv.org/abs/2504.08697
作者: Zdeněk Kasner,Vilém Zouhar,Patrícia Schmidtová,Ivan Kartáč,Kristýna Onderková,Ondřej Plátek,Dimitra Gkatzia,Saad Mahamood,Ondřej Dušek,Simone Balloccu
机构: Charles University (查理大学), Prague, Czechia; ETH Zurich (瑞士联邦理工学院), Switzerland; Edinburgh Napier University (爱丁堡 Napier 大学), Scotland, United Kingdom; trivago N.V. (特里瓦戈公司), Düsseldorf, Germany; TU Darmstadt (达姆施塔特工业大学), Germany
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:For high-quality texts, single-score metrics seldom provide actionable feedback. In contrast, span annotation - pointing out issues in the text by annotating their spans - can guide improvements and provide insights. Until recently, span annotation was limited to human annotators or fine-tuned encoder models. In this study, we automate span annotation with large language models (LLMs). We compare expert or skilled crowdworker annotators with open and proprietary LLMs on three tasks: data-to-text generation evaluation, machine translation evaluation, and propaganda detection in human-written texts. In our experiments, we show that LLMs as span annotators are straightforward to implement and notably more cost-efficient than human annotators. The LLMs achieve moderate agreement with skilled human annotators, in some scenarios comparable to the average agreement among the annotators themselves. Qualitative analysis shows that reasoning models outperform their instruction-tuned counterparts and provide more valid explanations for annotations. We release the dataset of more than 40k model and human annotations for further research.
zh

[NLP-6] P-RAG : Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning

【速读】: 该论文旨在解决现有大型语言模型(Large Language Models, LLMs)在自动化旅行规划中的局限性,特别是无法有效处理复杂的时空合理性(spatiotemporal rationality)的问题。尽管已有基准关注基本的计划有效性,但忽略了路线效率、兴趣点(Point of Interest, POI)吸引力以及实时适应性等关键方面。为此,论文引入了TP-RAG,首个专为检索增强型(retrieval-augmented)且具备时空感知能力的旅行规划设计的数据集与基准。其数据集包含2,348个真实世界旅行查询、85,575个细粒度标注的兴趣点及18,784条高质量旅行轨迹参考。通过实验表明,整合参考轨迹显著提升了旅行计划的空间效率和POI合理性,但也面临通用性和鲁棒性挑战,如参考冲突和噪声数据的影响。为应对这些问题,论文提出EvoRAG,一种进化框架,高效融合多样化的检索轨迹与LLMs的内在推理能力。EvoRAG实现了最先进的性能,在时空合规性和常识违背减少方面优于自下而上和检索增强基线方法。关键在于通过结合Web知识与LLM驱动优化实现混合增强,从而推动更可靠和适应性的旅行规划代理的发展。

链接: https://arxiv.org/abs/2504.08694
作者: Hang Ni,Fan Liu,Xinyu Ma,Lixin Su,Shuaiqiang Wang,Dawei Yin,Hui Xiong,Hao Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown promise in automating travel planning, yet they often fall short in addressing nuanced spatiotemporal rationality. While existing benchmarks focus on basic plan validity, they neglect critical aspects such as route efficiency, POI appeal, and real-time adaptability. This paper introduces TP-RAG, the first benchmark tailored for retrieval-augmented, spatiotemporal-aware travel planning. Our dataset includes 2,348 real-world travel queries, 85,575 fine-grain annotated POIs, and 18,784 high-quality travel trajectory references sourced from online tourist documents, enabling dynamic and context-aware planning. Through extensive experiments, we reveal that integrating reference trajectories significantly improves spatial efficiency and POI rationality of the travel plan, while challenges persist in universality and robustness due to conflicting references and noisy data. To address these issues, we propose EvoRAG, an evolutionary framework that potently synergizes diverse retrieved trajectories with LLMs’ intrinsic reasoning. EvoRAG achieves state-of-the-art performance, improving spatiotemporal compliance and reducing commonsense violation compared to ground-up and retrieval-augmented baselines. Our work underscores the potential of hybridizing Web knowledge with LLM-driven optimization, paving the way for more reliable and adaptive travel planning agents.
zh

[NLP-7] Fast-Slow-Thinking: Complex Task Solving with Large Language Models

【速读】: 该论文旨在解决现有任务分解方法在处理逻辑复杂且包含严格约束的任务时性能可能不理想的问题。具体而言,当任务过于复杂时,现有的方法可能导致大型语言模型(Large Language Models, LLMs)生成的解偏离原任务目标,或包含冗余甚至错误的内容。为应对这一挑战,论文提出了一种名为“快-慢思考”(Fast-Slow-Thinking, FST)的新任务分解方法。该方法的关键在于通过快速思考(Fast Thinking, FT)和慢速思考(Slow Thinking, ST)两个步骤的协作,模拟人类的两种思维方式。其中,FT专注于简化任务以关注其一般性和简洁性,而ST则通过重新引入FT中移除的约束条件,优化FT阶段生成的答案,使其满足原任务的要求。这种方法使LLMs能够从粗到细地逐步考虑复杂问题,从而更接近人类的认知过程。实验结果验证了FST方法在三种不同类型任务中的有效性。

链接: https://arxiv.org/abs/2504.08690
作者: Yiliu Sun,Yanfang Zhang,Zicheng Zhao,Sheng Wan,Dacheng Tao,Chen Gong
机构: Nanjing University of Science and Technology (南京理工大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 37 pages, 7 figures

点击查看摘要

Abstract:Nowadays, Large Language Models (LLMs) have been gradually employed to solve complex tasks. To face the challenge, task decomposition has become an effective way, which proposes to divide a complex task into multiple simpler subtasks and then solve them separately so that the difficulty of the original task can be reduced. However, the performance of existing task decomposition methods can be suboptimal when the task contains overly complex logic and constraints. In this situation, the solution generated by LLMs may deviate from the original purpose of the task, or contain redundant or even erroneous content. Therefore, inspired by the fact that humans possess two thinking systems including fast thinking and slow thinking, this paper introduces a new task decomposition method termed ``Fast-Slow-Thinking’’ (FST), which stimulates LLMs to solve tasks through the cooperation of Fast Thinking (FT) and Slow Thinking (ST) steps. Here FT focuses more on the general and concise aspect of the task, and ST focuses more on the details of the task. In FT, LLMs are prompted to remove the constraints of the original task, therefore simplifying it to a general and concise one. In ST, we recall the constraints removed in FT, so that LLMs can improve the answer generated in FT to meet the requirements of the original task. Therefore, our FST method enables LLMs to consider a complex problem via a human-like cognition process from coarse to fine, the effectiveness of which has been well demonstrated by the experiments on three types of tasks.
zh

[NLP-8] Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning

【速读】: 该论文旨在解决现有大型语言模型(LLM)后训练技术过度依赖外部监督信号(如结果监督或辅助奖励模型)所面临的可扩展性差及标注成本高的问题。为实现无需外部监督的LLM推理能力提升,论文提出了一种通用且完全无监督的自训练框架——Genius。其关键是通过逐步前瞻重采样策略模拟未来结果来探索并利用最优步骤,同时引入优势校准优化(ACO)损失函数以缓解无监督设置下固有的噪声与不确定性导致的估计不一致性。这些方法共同构成了一个先进的初始方案,使LLM能够在一般查询下实现自我推理能力的提升,从而革新推理扩展定律。

链接: https://arxiv.org/abs/2504.08672
作者: Fangzhi Xu,Hang Yan,Chang Ma,Haiteng Zhao,Qiushi Sun,Kanzhi Cheng,Junxian He,Jun Liu,Zhiyong Wu
机构: Shanghai AI Lab (上海人工智能实验室); Xi’an Jiaotong University (西安交通大学); The University of Hong Kong (香港大学); Peking University (北京大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. We introduce a generalizable and purely unsupervised self-training framework, named Genius. Without external auxiliary, Genius requires to seek the optimal response sequence in a stepwise manner and optimize the LLM. To explore the potential steps and exploit the optimal ones, Genius introduces a stepwise foresight re-sampling strategy to sample and estimate the step value by simulating future outcomes. Further, we recognize that the unsupervised setting inevitably induces the intrinsic noise and uncertainty. To provide a robust optimization, we propose an advantage-calibrated optimization (ACO) loss function to mitigate estimation inconsistencies. Combining these techniques together, Genius provides an advanced initial step towards self-improve LLM reasoning with general queries and without supervision, revolutionizing reasoning scaling laws given the vast availability of general queries. The code will be released at this https URL.
zh

[NLP-9] raining-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

【速读】: 该论文旨在解决现有文本到视频(Text-to-Video, T2V)扩散模型在遵循精确的文字描述时面临的挑战,特别是在需要准确控制空间布局或物体轨迹的情况下。为了解决这一问题,论文提出了一种名为Video-MSG的训练-free引导方法,其关键在于通过多模态规划和结构化噪声初始化实现高效的引导过程。与需在推理阶段微调或迭代操作注意力图的传统方法不同,Video-MSG通过创建视频草图(Video Sketch),即最终视频的细粒度时空计划,包含背景、前景及物体轨迹,并以草稿视频帧的形式表达,从而无需额外增加内存负担即可指导下游T2V扩散模型完成噪声反转与去噪过程。这种方法显著降低了对大模型内存的需求,提高了模型的实用性和效率。

链接: https://arxiv.org/abs/2504.08641
作者: Jialu Li,Shoubin Yu,Han Lin,Jaemin Cho,Jaehong Yoon,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Website: this https URL ; The first three authors contributed equally

点击查看摘要

Abstract:Recent advancements in text-to-video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine-tuning or iterative manipulation of the attention map during inference time. This significantly increases the memory requirement, making it difficult to adopt a large T2V model as a backbone. To address this, we introduce Video-MSG, a training-free Guidance method for T2V generation based on Multimodal planning and Structured noise initialization. Video-MSG consists of three steps, where in the first two steps, Video-MSG creates Video Sketch, a fine-grained spatio-temporal plan for the final video, specifying background, foreground, and object trajectories, in the form of draft video frames. In the last step, Video-MSG guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Notably, Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models. Video-MSG demonstrates its effectiveness in enhancing text alignment with multiple T2V backbones (VideoCrafter2 and CogVideoX-5B) on popular T2V generation benchmarks (T2VCompBench and VBench). We provide comprehensive ablation studies about noise inversion ratio, different background generators, background object detection, and foreground object segmentation.
zh

[NLP-10] Analyzing 16193 LLM Papers for Fun and Profits

【速读】: 本文旨在探讨大型语言模型(Large Language Models, LLMs)在过去六年间(2019-2024)在计算机科学领域的研究趋势及其对学术生态的影响。论文从四个独特视角出发:(1) 分析LLM研究如何推动主要会议中的主题转变;(2) 利用主题建模方法识别LLM相关领域增长的不同方面,并揭示各会议关注的主题;(3) 探索学术与工业机构的贡献模式差异;(4) 研究国家起源对LLM发展路径的影响。通过综合这些分析角度的结果,提炼出十个关键见解,以阐明LLM研究生态系统的发展动态与演化规律。关键在于采用多维度分析框架,结合定量与定性方法,系统性地揭示LLM研究的趋势、分布及影响因素。

链接: https://arxiv.org/abs/2504.08619
作者: Zhiqiu Xia,Lang Zhu,Bingzhe Li,Feng Chen,Qiannan Li,Hang Liu
机构: Rutgers, The State University of New Jersey (罗格斯大学), New Brunswick, NJ, USA; The University of Texas at Dallas (达拉斯大学), Richardson, Texas, USA; University of California, Davis (加州大学戴维斯分校), Davis, CA, USA
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are reshaping the landscape of computer science research, driving significant shifts in research priorities across diverse conferences and fields. This study provides a comprehensive analysis of the publication trend of LLM-related papers in 77 top-tier computer science conferences over the past six years (2019-2024). We approach this analysis from four distinct perspectives: (1) We investigate how LLM research is driving topic shifts within major conferences. (2) We adopt a topic modeling approach to identify various areas of LLM-related topic growth and reveal the topics of concern at different conferences. (3) We explore distinct contribution patterns of academic and industrial institutions. (4) We study the influence of national origins on LLM development trajectories. Synthesizing the findings from these diverse analytical angles, we derive ten key insights that illuminate the dynamics and evolution of the LLM research ecosystem.
zh

[NLP-11] A Survey of Machine Learning Models and Datasets for the Multi-label Classification of Textual Hate Speech in English

【速读】: 该论文旨在系统性地调查和分析多标签分类方法在在线仇恨言论文本数据中的研究现状,并试图解决现有研究中存在的关键问题。论文关注如何通过机器学习模型实现更细粒度的仇恨言论分类,以满足实践需求,如区分仇恨言论的目标、严重程度或合法性等子类型,这些子类型可能彼此重叠。论文的关键在于通过全面梳理46篇相关文献,揭示了用于训练多标签分类模型的28个数据集在标签集、规模、元概念、标注过程及注释者一致性等方面的显著异质性。此外,对24篇提出适用分类模型的研究进行分析后发现,现有研究在评估标准上存在不一致现象,并倾向于采用基于双向Transformer编码器表示(BERT)和循环神经网络(RNN)的架构。论文识别出不平衡的训练数据、对众包平台的依赖、小而稀疏的数据集以及缺乏方法学一致性等问题,并提出了十条研究建议作为解决方案的关键方向。

链接: https://arxiv.org/abs/2504.08609
作者: Julian Bäumler,Louis Blöcher,Lars-Joel Frey,Xian Chen,Markus Bayer,Christian Reuter
机构: Technical University of Darmstadt (达姆施塔特工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 35 pages, 4 figures, 4 tables

点击查看摘要

Abstract:The dissemination of online hate speech can have serious negative consequences for individuals, online communities, and entire societies. This and the large volume of hateful online content prompted both practitioners’, i.e., in content moderation or law enforcement, and researchers’ interest in machine learning models to automatically classify instances of hate speech. Whereas most scientific works address hate speech classification as a binary task, practice often requires a differentiation into sub-types, e.g., according to target, severity, or legality, which may overlap for individual content. Hence, researchers created datasets and machine learning models that approach hate speech classification in textual data as a multi-label problem. This work presents the first systematic and comprehensive survey of scientific literature on this emerging research landscape in English (N=46). We contribute with a concise overview of 28 datasets suited for training multi-label classification models that reveals significant heterogeneity regarding label-set, size, meta-concept, annotation process, and inter-annotator agreement. Our analysis of 24 publications proposing suitable classification models further establishes inconsistency in evaluation and a preference for architectures based on Bidirectional Encoder Representation from Transformers (BERT) and Recurrent Neural Networks (RNNs). We identify imbalanced training data, reliance on crowdsourcing platforms, small and sparse datasets, and missing methodological alignment as critical open issues and formulate ten recommendations for research.
zh

[NLP-12] MedHal: An Evaluation Dataset for Medical Hallucination Detection

【速读】: 该论文试图解决医学文本生成中幻觉检测方法在专业领域(如医学)应用时面临的显著局限性问题,这些局限可能导致严重后果。论文的关键解决方案在于提出MedHal,一个大规模数据集,通过整合多样化的医学文本来源与任务、提供大量标注样本以支持模型训练,并包含事实不一致的解释来指导模型学习,从而有效提升医学幻觉检测能力,同时减少对昂贵专家评审的依赖。

链接: https://arxiv.org/abs/2504.08596
作者: Gaya Mehenni,Amal Zouaq
机构: LAMA-WeST (拉马韦斯通实验室); Polytechnique Montreal (蒙特利尔理工学院); Mila (米拉研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present MedHal, a novel large-scale dataset specifically designed to evaluate if models can detect hallucinations in medical texts. Current hallucination detection methods face significant limitations when applied to specialized domains like medicine, where they can have disastrous consequences. Existing medical datasets are either too small, containing only a few hundred samples, or focus on a single task like Question Answering or Natural Language Inference. MedHal addresses these gaps by: (1) incorporating diverse medical text sources and tasks; (2) providing a substantial volume of annotated samples suitable for training medical hallucination detection models; and (3) including explanations for factual inconsistencies to guide model learning. We demonstrate MedHal’s utility by training and evaluating a baseline medical hallucination detection model, showing improvements over general-purpose hallucination detection approaches. This resource enables more efficient evaluation of medical text generation systems while reducing reliance on costly expert review, potentially accelerating the development of medical AI research.
zh

[NLP-13] Playpen: An Environment for Exploring Learning Through Conversational Interaction

【速读】: 该论文试图解决在大规模语言模型中因预测下一个单词这一学习信号逐渐枯竭而带来的挑战,探索是否可以通过合成对话交互(Dialogue Games)提供新的学习信号,并研究如何有效利用这种信号。论文的关键在于引入一种环境,利用大语言模型作为对话对手生成目标导向且规则驱动的对话交互数据,同时评估监督微调(Supervised Fine-Tuning)与强化学习方法(如DPO、GRPO)的效果。结果显示,虽然多种方法在特定领域内游戏任务中均有所提升,但仅GRPO方法不仅能在跨领域游戏中表现良好,还能在基于参考的任务中保持竞争力。因此,解决方案的关键在于设计有效的合成对话交互环境以及选择合适的强化学习算法以充分利用新学习信号。

链接: https://arxiv.org/abs/2504.08590
作者: Nicola Horst,Davide Mazzaccara,Antonia Schmidt,Michael Sullivan,Filippo Momentè,Luca Franceschetti,Philipp Sadler,Sherzod Hakimov,Alberto Testoni,Raffaella Bernardi,Raquel Fernández,Alexander Koller,Oliver Lemon,David Schlangen,Mario Giulianelli,Alessandro Suglia
机构: 未知
类目: Computation and Language (cs.CL)
备注: Source code: this https URL Please send correspodence to: lm-playschool@googlegroups.com

点击查看摘要

Abstract:Are we running out of learning signal? Predicting the next word in an existing text has turned out to be a powerful signal, at least at scale. But there are signs that we are running out of this resource. In recent months, interaction between learner and feedback-giver has come into focus, both for “alignment” (with a reward model judging the quality of instruction following attempts) and for improving “reasoning” (process- and outcome-based verifiers judging reasoning steps). In this paper, we explore to what extent synthetic interaction in what we call Dialogue Games – goal-directed and rule-governed activities driven predominantly by verbal actions – can provide a learning signal, and how this signal can be used. We introduce an environment for producing such interaction data (with the help of a Large Language Model as counterpart to the learner model), both offline and online. We investigate the effects of supervised fine-tuning on this data, as well as reinforcement learning setups such as DPO, and GRPO; showing that all of these approaches achieve some improvements in in-domain games, but only GRPO demonstrates the ability to generalise to out-of-domain games as well as retain competitive performance in reference-based tasks. We release the framework and the baseline training setups in the hope that this can foster research in this promising new direction.
zh

[NLP-14] UoB-NLP at SemEval-2025 Task 11: Leverag ing Adapters for Multilingual and Cross-Lingual Emotion Detection SEMEVAL-2025

【速读】: 该论文旨在解决多语言和跨语言情感检测中的挑战,特别是在低资源语言(low-resource languages)领域,现有方法尚待深入研究。为应对这一问题,论文提出利用基于适配器(adapter-based)的微调策略,并结合多语言预训练语言模型(multilingual pre-trained language models)。适配器的关键在于仅引入少量可训练参数的同时保持预训练模型权重不变,从而以更高效的参数方式实现模型适应性。论文通过实验验证了不同适配器调优策略的效果,发现目标语言就绪的任务适配器(target-language-ready task adapters)在性能上表现最优,尤其在低资源非洲语言(如Tigrinya和Kinyarwanda)的情感检测任务中取得了显著成果。此外,与大规模语言模型相比,该方法在11种语言中表现更优,在另外4种语言中达到同等性能,同时所需计算资源显著减少。因此,该方案的核心在于通过适配器微调实现高效且有效的多语言和跨语言情感检测能力。

链接: https://arxiv.org/abs/2504.08543
作者: Frances Laureano De Leon,Yixiao Wang,Yue Feng,Mark G. Lee
机构: University of Birmingham (伯明翰大学)
类目: Computation and Language (cs.CL)
备注: Accepted to appear in Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

点击查看摘要

Abstract:Emotion detection in natural language processing is a challenging task due to the complexity of human emotions and linguistic diversity. While significant progress has been made in high-resource languages, emotion detection in low-resource languages remains underexplored. In this work, we address multilingual and cross-lingual emotion detection by leveraging adapter-based fine-tuning with multilingual pre-trained language models. Adapters introduce a small number of trainable parameters while keeping the pre-trained model weights fixed, offering a parameter-efficient approach to adaptation. We experiment with different adapter tuning strategies, including task-only adapters, target-language-ready task adapters, and language-family-based adapters. Our results show that target-language-ready task adapters achieve the best overall performance, particularly for low-resource African languages with our team ranking 7th for Tigrinya, and 8th for Kinyarwanda in Track A. In Track C, our system ranked 3rd for Amharic, and 4th for Oromo, Tigrinya, Kinyarwanda, Hausa, and Igbo. Our approach outperforms large language models in 11 languages and matches their performance in four others, despite our models having significantly fewer parameters. Furthermore, we find that adapter-based models retain cross-linguistic transfer capabilities while requiring fewer computational resources compared to full fine-tuning for each language.
zh

[NLP-15] Lexical Bundle Frequency as a Construct-Relevant Candidate Feature in Automated Scoring of L2 Academic Writing

【速读】: 本文旨在解决自动化评分(Automated Scoring, AS)系统在评估二语写作(L2 writing)时的效度(construct validity)问题,特别是在现有评分模型中如何更有效地利用语言学特征来提升评分准确性。论文的关键在于探索将词汇束(Lexical Bundles, LBs)的频率特征整合到评分模型中的可行性。具体而言,研究通过分析TOEFL独立写作任务数据集中的子语料库(包含1,225篇作文,涵盖9种母语背景),提取了3至9词的词汇束,并区分了与题目相关的词汇束(prompt-specific)和非题目相关的词汇束(non-prompt)。研究对比了一个基于传统语言学特征(如语法、连贯性和复杂性等)的基准支持向量机(Support Vector Machine, SVM)评分模型与扩展模型(加入三种聚合词汇束频率特征:总题目相关、总非题目相关以及总体词汇束频率)。结果显示,尽管词汇束频率与评分之间的关系总体较小,但扩展模型显著提高了与人工评分者的一致性(Quadratic Cohen’s Kappa提升2.05%,总体Cohen’s Kappa提升5.63%),尤其在低(exact agreement提升10.1%)和中等(Cohen’s Kappa提升14.3%)熟练度作文中表现突出。因此,关键解决方案在于引入词汇束频率特征,以增强评分系统的语言学信息含量和评分准确性,特别是针对发展中的二语学习者写作能力的区分能力。

链接: https://arxiv.org/abs/2504.08537
作者: Burak Senel
机构: Iowa State University (爱荷华州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated scoring (AS) systems are increasingly used for evaluating L2 writing, but require ongoing refinement for construct validity. While prior work suggested lexical bundles (LBs) - recurrent multi-word sequences satisfying certain frequency criteria - could inform assessment, their empirical integration into AS models needs further investigation. This study tested the impact of incorporating LB frequency features into an AS model for TOEFL independent writing tasks. Analyzing a sampled subcorpus (N=1,225 essays, 9 L1s) from the TOEFL11 corpus, scored by ETS-trained raters (Low, Medium, High), 3- to 9-word LBs were extracted, distinguishing prompt-specific from non-prompt types. A baseline Support Vector Machine (SVM) scoring model using established linguistic features (e.g., mechanics, cohesion, sophistication) was compared against an extended model including three aggregate LB frequency features (total prompt, total non-prompt, overall total). Results revealed significant, though generally small-effect, relationships between LB frequency (especially non-prompt bundles) and proficiency (p .05). Mean frequencies suggested lower proficiency essays used more LBs overall. Critically, the LB-enhanced model improved agreement with human raters (Quadratic Cohen’s Kappa +2.05%, overall Cohen’s Kappa +5.63%), with notable gains for low (+10.1% exact agreement) and medium (+14.3% Cohen’s Kappa) proficiency essays. These findings demonstrate that integrating aggregate LB frequency offers potential for developing more linguistically informed and accurate AS systems, particularly for differentiating developing L2 writers.
zh

[NLP-16] On The Landscape of Spoken Language Models: A Comprehensive Survey

【速读】: 该论文旨在通过综述近年来相关工作的文献,增进对语音语言模型(Spoken Language Models, SLMs)的理解。论文试图解决的问题是:随着语音处理领域从训练定制化的任务特定模型向使用和优化通用语音处理系统(即SLMs)的趋势转变,现有研究在模型架构、训练方法及评估方式上存在多样性与不统一性,导致缺乏全面理解这一领域的共识。论文的关键解决方案在于通过统一的文献综述,将该领域的研究按照模型架构、训练策略以及评估方法进行分类,并阐述其面临的若干关键挑战与未来发展方向。

链接: https://arxiv.org/abs/2504.08528
作者: Siddhant Arora,Kai-Wei Chang,Chung-Ming Chien,Yifan Peng,Haibin Wu,Yossi Adi,Emmanuel Dupoux,Hung-Yi Lee,Karen Livescu,Shinji Watanabe
机构: Carnegie Mellon University, USA (卡内基梅隆大学,美国); National Taiwan University, Taiwan (台湾大学,台湾); Toyota Technological Institute at Chicago, USA (芝加哥丰田技术研究所,美国); Hebrew University of Jerusalem, Israel (耶路撒冷希伯来大学,以色列); ENS - PSL, EHESS, CNRS, France (法国巴黎高等师范学院, 高级社会科学研究学院, 法国国家科学研究中心,法国)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The field of spoken language processing is undergoing a shift from training custom-built, task-specific models toward using and optimizing spoken language models (SLMs) which act as universal speech processing systems. This trend is similar to the progression toward universal language models that has taken place in the field of (text) natural language processing. SLMs include both “pure” language models of speech – models of the distribution of tokenized speech sequences – and models that combine speech encoders with text language models, often including both spoken and written input or output. Work in this area is very diverse, with a range of terminology and evaluation settings. This paper aims to contribute an improved understanding of SLMs via a unifying literature survey of recent work in the context of the evolution of the field. Our survey categorizes the work in this area by model architecture, training, and evaluation choices, and describes some key challenges and directions for future work.
zh

[NLP-17] Integrated ensemble of BERT- and features-based models for authorship attribution in Japanese literary works

【速读】: 该论文旨在解决小样本作者身份认定(Authorship Attribution, AA)任务中传统基于特征的方法与现代预训练语言模型(Pre-trained Language Models, PLMs)结合使用时性能不足的问题。论文的关键在于提出了一种集成的传统特征方法与BERT等现代PLM方法的组合框架,通过构建基于BERT的分类器集成以及传统特征与BERT模型的联合集成,显著提升了小样本AA任务的性能。实验结果显示,在未包含在预训练数据中的语料库上,这种集成方法将F1分数提高了约14个百分点,超越了单一最佳模型的表现。这种方法为未来高效利用不断扩展的数据处理工具提供了可行的解决方案。

链接: https://arxiv.org/abs/2504.08527
作者: Taisei Kanda,Mingzhe Jin,Wataru Zaitsu
机构: Graduate School of Culture and Information Science, Doshisha University (同志社大学文化情報学研究科); Research Center for Linguistic Ecology, Doshisha University, Kyoto, Japan (同志社大学语言生态研究中心); Institute of Interdisciplinary Research, Kyoto University of Advanced Science, Kyoto, Japan (京都先端科学大学交叉学科研究所); Faculty of Psychology, Mejiro University, Tokyo, Japan (目白大学心理学部)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditionally, authorship attribution (AA) tasks relied on statistical data analysis and classification based on stylistic features extracted from texts. In recent years, pre-trained language models (PLMs) have attracted significant attention in text classification tasks. However, although they demonstrate excellent performance on large-scale short-text datasets, their effectiveness remains under-explored for small samples, particularly in AA tasks. Additionally, a key challenge is how to effectively leverage PLMs in conjunction with traditional feature-based methods to advance AA research. In this study, we aimed to significantly improve performance using an integrated integrative ensemble of traditional feature-based and modern PLM-based methods on an AA task in a small sample. For the experiment, we used two corpora of literary works to classify 10 authors each. The results indicate that BERT is effective, even for small-sample AA tasks. Both BERT-based and classifier ensembles outperformed their respective stand-alone models, and the integrated ensemble approach further improved the scores significantly. For the corpus that was not included in the pre-training data, the integrated ensemble improved the F1 score by approximately 14 points, compared to the best-performing single model. Our methodology provides a viable solution for the efficient use of the ever-expanding array of data processing tools in the foreseeable future.
zh

[NLP-18] ask Memory Engine (TME): Enhancing State Awareness for Multi-Step LLM Agent Tasks

【速读】: 该论文旨在解决现有大型语言模型(Large Language Models, LLMs)在执行多步任务时缺乏结构化任务状态理解的问题。大多数现有框架依赖线性提示拼接或浅层内存缓冲区,导致性能脆弱、频繁出现幻觉以及长程连贯性差。论文的关键解决方案是提出任务记忆引擎(Task Memory Engine, TME),这是一个轻量且结构化的内存模块,通过层次化的任务记忆树(Task Memory Tree, TMT)跟踪任务执行过程。每个树节点对应一个任务步骤,存储相关输入、输出、状态及子任务关系,并引入一种动态生成LLM提示的方法,显著提高了执行一致性与上下文关联性。实验表明,TME能够提升任务完成准确性并增强行为可解释性,同时具有极小的实现开销。

链接: https://arxiv.org/abs/2504.08525
作者: Ye Ye
机构: New York University (纽约大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 5 figures. Preprint prepared for future submission. Includes implementation and token-efficiency analysis. Code at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used as autonomous agents for multi-step tasks. However, most existing frameworks fail to maintain a structured understanding of the task state, often relying on linear prompt concatenation or shallow memory buffers. This leads to brittle performance, frequent hallucinations, and poor long-range coherence. In this work, we propose the Task Memory Engine (TME), a lightweight and structured memory module that tracks task execution using a hierarchical Task Memory Tree (TMT). Each node in the tree corresponds to a task step, storing relevant input, output, status, and sub-task relationships. We introduce a prompt synthesis method that dynamically generates LLM prompts based on the active node path, significantly improving execution consistency and contextual grounding. Through case studies and comparative experiments on multi-step agent tasks, we demonstrate that TME leads to better task completion accuracy and more interpretable behavior with minimal implementation overhead. The full implementation of TME is available at this https URL.
zh

[NLP-19] BOISHOMMO: Holistic Approach for Bangla Hate Speech

【速读】: 该论文旨在解决低资源语言(如孟加拉语)中仇恨言论(Hate Speech, HS)检测数据集匮乏的问题,并填补现有数据集中未能充分反映仇恨成分多维度与多重滥用属性的空白。论文的关键解决方案在于构建了一个名为BOISHOMMO的多标签孟加拉语仇恨言论数据集,该数据集涵盖了种族、性别、宗教、政治等多个类别,并包含超过两千个标注样本。此外,BOISHOMMO通过评估多种算法方法,不仅展示了处理非拉丁脚本文本的复杂性,还评估了模型性能,从而为低资源语言的仇恨言论检测与分析研究提供了更细致、更多元化的数据基础。

链接: https://arxiv.org/abs/2504.08408
作者: Md Abdullah Al Kafi,Sumit Kumar Banshal,Md Sadman Shakib,Showrov Azam,Tamanna Alam Tabashom
机构: Cambridge University Press (剑桥大学出版社)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:One of the most alarming issues in digital society is hate speech (HS) on social media. The severity is so high that researchers across the globe are captivated by this domain. A notable amount of work has been conducted to address the identification and alarm system. However, a noticeable gap exists, especially for low-resource languages. Comprehensive datasets are the main problem among the constrained resource languages, such as Bangla. Interestingly, hate speech or any particular speech has no single dimensionality. Similarly, the hate component can simultaneously have multiple abusive attributes, which seems to be missed in the existing datasets. Thus, a multi-label Bangla hate speech dataset named BOISHOMMO has been compiled and evaluated in this work. That includes categories of HS across race, gender, religion, politics, and more. With over two thousand annotated examples, BOISHOMMO provides a nuanced understanding of hate speech in Bangla and highlights the complexities of processing non-Latin scripts. Apart from evaluating with multiple algorithmic approaches, it also highlights the complexities of processing Bangla text and assesses model performance. This unique multi-label approach enriches future hate speech detection and analysis studies for low-resource languages by providing a more nuanced, diverse dataset.
zh

[NLP-20] Beyond Self-Reports: Multi-Observer Agents for Personality Assessment in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)人格特质评估中存在的系统性偏差问题,传统基于自我报告问卷的评估方法可能因内在偏见和元知识污染而未能准确捕捉其真实行为特征。论文提出的关键解决方案是一种受心理学中信息提供者报告方法启发的多观察者框架。该框架通过配置具有特定关系背景(如家庭、朋友或工作场所)的多个观察者代理,模拟与目标LLM的交互场景,并进行对话后对大五人格维度进行评分。实验结果表明,这种方法能够有效减少非系统性偏差,且当观察者数量为5-7个时可实现最佳可靠性。研究强调了关系背景对人格感知的重要影响,并证明多观察者范式可以更稳健、更情境敏感地评估LLM的人格特质。

链接: https://arxiv.org/abs/2504.08399
作者: Yin Jou Huang,Rafik Hadfi
机构: Graduate School of Informatics, Kyoto University (京都大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures, 2 tables

点击查看摘要

Abstract:There is a growing interest in assessing the personality traits of Large language models (LLMs). However, traditional personality assessments based on self-report questionnaires may fail to capture their true behavioral nuances due to inherent biases and meta-knowledge contamination. This paper introduces a novel multi-observer framework for LLM personality assessment that draws inspiration from informant-report methods in psychology. Instead of relying solely on self-assessments, our approach employs multiple observer agents configured with a specific relationship context (e.g., family, friend, or workplace) to simulate interactive scenarios with a subject LLM. These observers engage in dialogues and subsequently provide ratings across the Big Five personality dimensions. Our experiments reveal that LLMs possess systematic biases in self-report personality ratings. Moreover, aggregating observer ratings effectively reduces non-systematic biases and achieves optimal reliability with 5-7 observers. The findings highlight the significant impact of relationship context on personality perception and demonstrate that a multi-observer paradigm yields a more robust and context-sensitive evaluation of LLM personality traits.
zh

[NLP-21] Scholar Inbox: Personalized Paper Recommendations for Scientists WWW

【速读】: 该论文试图解决研究人员在应对快速增长的科学文献量时面临的挑战,特别是如何高效获取与个人研究兴趣高度相关的最新文献。论文的关键解决方案在于开发了一个名为Scholar Inbox的新开放平台,其核心在于个性化推荐系统(Personalized Recommendation System),该系统通过训练用户评分数据为每位研究人员提供定制化建议。此外,平台结合科学地图(Map of Science)和主动学习策略(Active Learning Strategy),不仅解决了推荐系统常见的冷启动问题,还通过迭代提示用户对选定论文进行评分,快速学习用户偏好,从而显著提升推荐质量。论文通过包含80万用户评分的数据集及广泛的用户研究对推荐系统的性能进行了评估。

链接: https://arxiv.org/abs/2504.08385
作者: Markus Flicke,Glenn Angrabeit,Madhav Iyengar,Vitalii Protsenko,Illia Shakun,Jovan Cicvaric,Bora Kargi,Haoyu He,Lukas Schuler,Lewin Scholz,Kavyanjali Agnihotri,Yong Cao,Andreas Geiger
机构: University of Tübingen, Tübingen AI Center (图宾根大学,图宾根人工智能中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: this https URL

点击查看摘要

Abstract:Scholar Inbox is a new open-access platform designed to address the challenges researchers face in staying current with the rapidly expanding volume of scientific literature. We provide personalized recommendations, continuous updates from open-access archives (arXiv, bioRxiv, etc.), visual paper summaries, semantic search, and a range of tools to streamline research workflows and promote open research access. The platform’s personalized recommendation system is trained on user ratings, ensuring that recommendations are tailored to individual researchers’ interests. To further enhance the user experience, Scholar Inbox also offers a map of science that provides an overview of research across domains, enabling users to easily explore specific topics. We use this map to address the cold start problem common in recommender systems, as well as an active learning strategy that iteratively prompts users to rate a selection of papers, allowing the system to learn user preferences quickly. We evaluate the quality of our recommendation system on a novel dataset of 800k user ratings, which we make publicly available, as well as via an extensive user study. this https URL
zh

[NLP-22] FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations

【速读】: 该论文试图解决视觉理解中因任务需求不同而导致的固定特征表示无法灵活适应多样化上下文的问题。现有图像编码范式通常将图像表示为通用的固定特征向量,忽视了针对不同下游任务优先关注不同视觉信息的需求。为了解决这一问题,论文提出FocalLens,这是一种基于条件的视觉编码方法,通过自然语言灵活表达的上下文信息,为同一图像生成不同的表征。其关键在于利用视觉指令微调数据,并对比微调预训练视觉编码器,使其以自然语言指令作为额外输入,从而生成条件化图像表示,显著提升了目标视觉特征的区分度,并在多种下游任务中实现了性能提升。

链接: https://arxiv.org/abs/2504.08368
作者: Cheng-Yu Hsieh,Pavan Kumar Anasosalu Vasu,Fartash Faghri,Raviteja Vemulapalli,Chun-Liang Li,Ranjay Krishna,Oncel Tuzel,Hadi Pouransari
机构: University of Washington (华盛顿大学); Apple (苹果公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Visual understanding is inherently contextual – what we focus on in an image depends on the task at hand. For instance, given an image of a person holding a bouquet of flowers, we may focus on either the person such as their clothing, or the type of flowers, depending on the context of interest. Yet, most existing image encoding paradigms represent an image as a fixed, generic feature vector, overlooking the potential needs of prioritizing varying visual information for different downstream use cases. In this work, we introduce FocalLens, a conditional visual encoding method that produces different representations for the same image based on the context of interest, expressed flexibly through natural language. We leverage vision instruction tuning data and contrastively finetune a pretrained vision encoder to take natural language instructions as additional inputs for producing conditional image representations. Extensive experiments validate that conditional image representation from FocalLens better pronounce the visual features of interest compared to generic features produced by standard vision encoders like CLIP. In addition, we show FocalLens further leads to performance improvements on a range of downstream tasks including image-image retrieval, image classification, and image-text retrieval, with an average gain of 5 and 10 points on the challenging SugarCrepe and MMVP-VLM benchmarks, respectively.
zh

[NLP-23] MedRep: Medical Concept Representation for General Electronic Health Record Foundation Models

【速读】: 该论文旨在解决电子健康记录(EHR)基础模型在处理词汇表外(out-of-vocabulary, OOV)医疗代码时的根本限制,这一问题阻碍了模型的通用性以及基于不同词汇训练模型的整合。论文的关键解决方案是提出MedRep,它基于OMOP通用数据模型(CDM),通过提供集成的医学概念表示和患者轨迹的基本数据增强策略来应对这一挑战。具体而言,MedRep利用大型语言模型(LLM)提示丰富每个概念的最小定义以学习概念表示,并通过OMOP词汇的图谱本体增强基于文本的表示;同时,轨迹增强通过随机替换选定的概念为其他具有紧密相关表示的相似概念,使模型能够练习处理OOV概念。最终结果表明,使用MedRep训练的EHR基础模型在外部落地数据集上的预测性能得到了更好的保持。

链接: https://arxiv.org/abs/2504.08329
作者: Junmo Kim,Namkyeong Lee,Jiwon Kim,Kwangsoo Kim
机构: Seoul National University (首尔国立大学); KAIST (韩国科学技术院); ICMIT (国际医学信息学中心); Seoul National University Hospital; College of Medicine, Seoul National University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

Abstract:Electronic health record (EHR) foundation models have been an area ripe for exploration with their improved performance in various medical tasks. Despite the rapid advances, there exists a fundamental limitation: Processing unseen medical codes out of the vocabulary. This problem limits the generality of EHR foundation models and the integration of models trained with different vocabularies. To deal with this problem, we propose MedRep for EHR foundation models based on the observational medical outcome partnership (OMOP) common data model (CDM), providing the integrated medical concept representations and the basic data augmentation strategy for patient trajectories. For concept representation learning, we enrich the information of each concept with a minimal definition through large language model (LLM) prompts and enhance the text-based representations through graph ontology of OMOP vocabulary. Trajectory augmentation randomly replaces selected concepts with other similar concepts that have closely related representations to let the model practice with the concepts out-of-vocabulary. Finally, we demonstrate that EHR foundation models trained with MedRep better maintain the prediction performance in external datasets. Our code implementation is publicly available at this https URL.
zh

[NLP-24] Large language models could be rote learners

【速读】: 该论文试图解决Large Language Models (LLMs) 在多选题(MCQs)基准测试中因数据集污染(benchmark contamination)导致评估可靠性下降的问题。论文将污染重新定义为学习的固有属性,并提出区分LLMs的机械记忆与真正的能力获取。解决方案的关键在于提出TrinEval这一新型评估框架,通过将MCQs重构为三元组形式(trinity format),在减少机械记忆的同时保留知识评估的能力,从而有效分离机械记忆与真实能力学习的现象。实验验证了TrinEval在改写任务中的有效性,并揭示了常见LLMs平均在MMLU数据集上存在20.5%的知识点依赖机械记忆的情况。

链接: https://arxiv.org/abs/2504.08300
作者: Yuyang Xu,Renjun Hu,Haochao Ying,Jian Wu,Xing Shi,Wei Lin
机构: College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院); School of Data Science and Engineering, East China Normal University (华东师范大学数据科学与工程学院); State Key Laboratory of Transvascular Implantation Devices, The Second Affiliated Hospital Zhejiang University School of Medicine (浙江大学医学院附属第二医院血管植入装置国家重点实验室); School of Public Health, Zhejiang University (浙江大学公共卫生学院); Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence (浙江省医学影像人工智能重点实验室); Alibaba Cloud Computing (阿里云)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in Progress

点击查看摘要

Abstract:Multiple-choice question (MCQ) benchmarks are widely used for evaluating Large Language Models (LLMs), yet their reliability is undermined by benchmark contamination. In this study, we reframe contamination as an inherent aspect of learning and seek to disentangle genuine capability acquisition from superficial memorization in LLM evaluation. First, by analyzing model performance under different memorization conditions, we uncover a counterintuitive trend: LLMs perform worse on memorized MCQs than on non-memorized ones, indicating the coexistence of two distinct learning phenomena, i.e., rote memorization and genuine capability learning. To disentangle them, we propose TrinEval, a novel evaluation framework that reformulates MCQs into an alternative trinity format, reducing memorization while preserving knowledge assessment. Experiments validate TrinEval’s effectiveness in reformulation, and its evaluation reveals that common LLMs may memorize by rote 20.5% of knowledge points (in MMLU on average).
zh

[NLP-25] ELSA: A Style Aligned Dataset for Emotionally Intelligent Language Generation

【速读】: 该论文旨在解决现有情感数据集在情感细粒度(emotional granularity)和风格多样性(stylistic diversity)之间存在的关键权衡问题,这限制了有效的情感条件文本生成系统的发展。为弥合这一差距,论文提出了一种名为ELSA(Emotion and Language Style Alignment Dataset)的新颖数据集。该数据集通过采用来自现有数据集(如dair ai emotion dataset和GoEmotions taxonomy)的细粒度情感分类法,结合先进的大语言模型(Large Language Models, LLMs),生成了多个情感细腻且风格多样的句子变体,包括会话式、正式、诗歌式和叙事式等不同语境风格。解决方案的关键在于通过系统化的方法构建一个兼具情感真实性、语言流畅性和文本多样性的高质量数据集,并通过困惑度(perplexity)、嵌入变化(embedding variance)、可读性(readability)、词汇多样性(lexical diversity)和语义连贯性(semantic coherence)等指标进行严格评估,验证其有效性。最终,ELSA数据集为探索细粒度情感控制、提示驱动解释、可解释性以及基于LLMs的风格适应性表达语言生成提供了坚实的基础。

链接: https://arxiv.org/abs/2504.08281
作者: Vishal Gandhi,Sagar Gandhi
机构: Joyspace AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages

点击查看摘要

Abstract:Advancements in emotion aware language processing increasingly shape vital NLP applications ranging from conversational AI and affective computing to computational psychology and creative content generation. Existing emotion datasets either lack emotional granularity or fail to capture necessary stylistic diversity, limiting the advancement of effective emotion conditioned text generation systems. Seeking to bridge this crucial gap between granularity and style diversity, this paper introduces a novel systematically constructed dataset named ELSA Emotion and Language Style Alignment Dataset leveraging fine grained emotion taxonomies adapted from existing sources such as dair ai emotion dataset and GoEmotions taxonomy. This dataset comprises multiple emotionally nuanced variations of original sentences regenerated across distinct contextual styles such as conversational, formal, poetic, and narrative, using advanced Large Language Models LLMs. Rigorous computational evaluation using metrics such as perplexity, embedding variance, readability, lexical diversity, and semantic coherence measures validates the datasets emotional authenticity, linguistic fluency, and textual diversity. Comprehensive metric analyses affirm its potential to support deeper explorations into emotion conditioned style adaptive text generation. By enabling precision tuned emotionally nuanced language modeling, our dataset creates fertile ground for research on fine grained emotional control, prompt driven explanation, interpretability, and style adaptive expressive language generation with LLMs.
zh

[NLP-26] Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation

【速读】: 该论文致力于解决多语言文本到语音(Text-to-Speech, TTS)模型在处理跨语言语音生成时面临的挑战,包括不同语言之间音素词汇表的差异、韵律(prosody)以及说话风格的变化。现有方法要么为每种语言训练独立模型以实现高性能但需较高计算资源,要么采用统一模型难以捕捉精细的语言特定风格变化。论文提出的关键解决方案是LanStyleTTS,这是一种非自回归、语言感知的风格自适应TTS框架,通过标准化音素表示并在语言间实现细粒度的音素级风格控制,从而支持单一统一的多语言TTS模型,无需为每种语言单独训练模型即可生成准确且高质量的语音。实验表明,该方法不仅提升了多种架构下的性能,还验证了使用潜在编码(latent encoding)可显著减小模型规模和计算成本,同时保持高保真语音生成能力。

链接: https://arxiv.org/abs/2504.08274
作者: Haowei Lou,Hye-young Paik,Sheng Li,Wen Hu,Lina Yao
机构: School of Computer Science and Engineering, UNSW Sydney (新南威尔士大学计算机科学与工程学院), Australia; School of Engineering, Institute of Science Tokyo (东京科学大学工程学院), Japan; CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织的数据61部门), Australia
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Text-to-Speech (TTS) models can generate natural, human-like speech across multiple languages by transforming phonemes into waveforms. However, multilingual TTS remains challenging due to discrepancies in phoneme vocabularies and variations in prosody and speaking style across languages. Existing approaches either train separate models for each language, which achieve high performance at the cost of increased computational resources, or use a unified model for multiple languages that struggles to capture fine-grained, language-specific style variations. In this work, we propose LanStyleTTS, a non-autoregressive, language-aware style adaptive TTS framework that standardizes phoneme representations and enables fine-grained, phoneme-level style control across languages. This design supports a unified multilingual TTS model capable of producing accurate and high-quality speech without the need to train language-specific models. We evaluate LanStyleTTS by integrating it with several state-of-the-art non-autoregressive TTS architectures. Results show consistent performance improvements across different model backbones. Furthermore, we investigate a range of acoustic feature representations, including mel-spectrograms and autoencoder-derived latent features. Our experiments demonstrate that latent encodings can significantly reduce model size and computational cost while preserving high-quality speech generation.
zh

[NLP-27] VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering

【速读】: 该论文旨在解决多模态多跳问答(MMQA)任务中现有方法存在的局限性,包括有限的跨模态推理能力、对模态转换的依赖以及视觉与文本表示之间对齐不足的问题。为了解决这些问题,论文提出了Vision-Language Multimodal Transformer (VLMT),其关键在于通过基于Transformer的视觉编码器与序列到序列语言模型的统一架构,利用直接的token级注入机制在共享嵌入空间中融合视觉和文本输入,无需中间投影层。此外,VLMT采用三阶段预训练策略来逐步对齐视觉-语言表示,并提升模型的多模态理解能力。这些创新点共同构成了VLMT的核心解决方案,显著提升了多模态推理和问答性能。

链接: https://arxiv.org/abs/2504.08269
作者: Qi Zhi Lim,Chin Poo Lee,Kian Ming Lim,Kalaiarasi Sonai Muthu Anbananthen
机构: Faculty of Information Science and Technology, Multimedia University (多媒体大学); School of Computer Science, University of Nottingham Ningbo China (宁波诺丁汉大学); Faculty of Information Science and Technology, Multimedia University (多媒体大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The increasing availability of multimodal data across text, tables, and images presents new challenges for developing models capable of complex cross-modal reasoning. Existing methods for Multimodal Multi-hop Question Answering (MMQA) often suffer from limited reasoning capabilities, reliance on modality conversion, and inadequate alignment between visual and textual representations. To address these limitations, this paper introduces Vision-Language Multimodal Transformer (VLMT), a unified architecture that integrates a transformer-based vision encoder with a sequence-to-sequence language model. VLMT employs a direct token-level injection mechanism to fuse visual and textual inputs within a shared embedding space, eliminating the need for intermediate projection layers. To enhance cross-modal alignment and reasoning, a three-stage pretraining strategy is proposed to progressively align vision-language representations and improve the model’s capacity for multimodal understanding. Based on the pretrained backbone, two task-specific modules are instantiated to form a two-stage MMQA framework: a multimodal reranker that predicts document relevance scores and utilizes a relative threshold with top-k strategy for context retrieval, and a multimodal question answering model that generates contextually grounded answers based on the retrieved evidence. Comprehensive experiments on two benchmark datasets demonstrate the effectiveness of the proposed approach. On MultimodalQA validation set, VLMT-Large achieves 76.5% Exact Match and 80.1% F1, outperforming the previous state-of-the-art by +9.1% in Exact Match and +8.8% in F1. On WebQA, it attains a QA score of 47.6, surpassing prior models such as PERQA by +3.2. These results highlight VLMT’s strong capabilities in multimodal reasoning and its potential to advance real-world information retrieval and question answering systems.
zh

[NLP-28] Evaluating the Bias in LLM s for Surveying Opinion and Decision Making in Healthcare

【速读】: 该论文试图解决生成式人工智能(Generative AI)在模拟人类行为时是否能够真实反映个体差异的问题。研究的关键在于通过基于人口统计学的提示工程(demographic-based prompt engineering)创建数字孪生体(digital twins),并评估不同大型语言模型(LLMs)在复现现实世界行为方面的表现。研究表明,某些LLMs未能准确预测真实的决策行为,而Llama 3虽然在种族和收入层面的表现更为精确,但也引入了原始数据中不存在的偏差。这表明生成式代理在行为研究中具有潜力,但同时也揭示了来自LLMs及其提示策略的偏见风险。

链接: https://arxiv.org/abs/2504.08260
作者: Yonchanok Khaokaew,Flora D. Salim,Andreas Züfle,Hao Xue,Taylor Anderson,Matthew Scotch,David J Heslop
机构: University of New South Wales (新南威尔士大学), Australia; Emory University (埃默里大学), USA; George Mason University (乔治梅森大学), USA; Arizona State University (亚利桑那州立大学), USA
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative agents have been increasingly used to simulate human behaviour in silico, driven by large language models (LLMs). These simulacra serve as sandboxes for studying human behaviour without compromising privacy or safety. However, it remains unclear whether such agents can truly represent real individuals. This work compares survey data from the Understanding America Study (UAS) on healthcare decision-making with simulated responses from generative agents. Using demographic-based prompt engineering, we create digital twins of survey respondents and analyse how well different LLMs reproduce real-world behaviours. Our findings show that some LLMs fail to reflect realistic decision-making, such as predicting universal vaccine acceptance. However, Llama 3 captures variations across race and Income more accurately but also introduces biases not present in the UAS data. This study highlights the potential of generative agents for behavioural research while underscoring the risks of bias from both LLMs and prompting strategies.
zh

[NLP-29] Millions of States: Designing a Scalable MoE Architecture with RWKV-7 Meta-learner

【速读】: 本文旨在解决RWKV-7模型在处理token-parameter交互以及扩展性方面的局限性。RWKV-7虽然具备线性复杂度且在短上下文场景中具有更强的表达能力,但缺乏有效的token-parameter交互机制和原生可扩展性,限制了其适应性和增长潜力,通常需要重新训练以适应变化。为了解决这些问题,论文提出了Meta-State这一创新扩展方案,其关键在于用完全基于状态驱动的方法取代传统的注意力机制,并通过引入Self-State Encoder (SSE)机制实现token-parameter交互。SSE巧妙地利用RWKV-7中的Weighted Key-Value (WKV)状态部分作为变换权重,以线性且基于状态的方式编码token-parameter交互,同时避免引入新的可训练矩阵或softmax操作,从而保持了token处理的自回归特性。此外,Meta-State支持通过扩展WKV状态和参数tokens进行渐进式模型扩展,重用现有参数而无需重新训练。这种方法弥合了基于状态建模、token-parameter交互与可扩展架构之间的差距,提供了一个高效且灵活的序列建模框架,具备线性复杂度和恒定内存使用的特点。

链接: https://arxiv.org/abs/2504.08247
作者: Liu Xiao,Li Zhiyuan,Lin Yueyu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:State-based sequence models like RWKV-7 offer a compelling alternative to Transformer architectures, achieving linear complexity while demonstrating greater expressive power in short-context scenarios and enabling state tracking beyond the (\textTC^0) complexity class. However, RWKV-7 lacks mechanisms for token-parameter interactions and native scalability, limiting its adaptability and growth without retraining. In this paper, we propose \textbfMeta-State, a novel extension to RWKV-7 that replaces attention mechanisms with a fully state-driven approach, integrating token-parameter interactions through a \textbfSelf-State Encoder (SSE) mechanism. The SSE repurposes a portion of the RWKV-7 Weighted Key-Value (WKV) state as transformation weights to encode token-parameter interactions in a linear, state-driven manner without introducing new trainable matrices or softmax operations, while preserving the autoregressive property of token processing. Meta-State supports progressive model scaling by expanding the WKV state and parameter tokens, reusing existing parameters without retraining. Our approach bridges the gap between state-based modeling, token-parameter interactions, and scalable architectures, offering a flexible framework for efficient and adaptable sequence modeling with linear complexity and constant memory usage.
zh

[NLP-30] Out of Style: RAG s Frag ility to Linguistic Variation

【速读】: 该论文旨在解决 Retrieval-augmented Generation (RAG) 系统在处理实际用户与大型语言模型 (Large Language Model, LLM) 交互查询时鲁棒性不足的问题。尽管 RAG 系统在多种自然语言处理 (NLP) 基准测试中表现出色,但其在面对真实世界中用户查询的更大语言学变异性以及可能引发的级联错误方面尚未得到充分研究。这一局限性阻碍了其在实际部署中的广泛应用。

论文的关键在于系统性地分析四种语言维度(正式度、可读性、礼貌程度和语法正确性)的变化如何影响 RAG 系统的性能,并评估两种检索模型与九种不同规模(3 至 720 亿参数)的 LLM 在四个信息检索型问答 (QA) 数据集上的表现。研究发现,语言重构显著影响检索和生成阶段,导致非正式查询的 Recall@5 分数相对下降高达 40.41%,而包含语法错误查询的答案匹配分数下降达 38.86%。此外,论文揭示 RAG 系统比仅使用 LLM 的方法对语言变化更为敏感,突显了其因语言转换引发错误传播的脆弱性。基于这些发现,论文强调了开发增强鲁棒性的技术以提升 RAG 系统在多样化用户交互场景下可靠性的重要性。

链接: https://arxiv.org/abs/2504.08231
作者: Tianyu Cao,Neel Bhandari,Akhila Yerukola,Akari Asai,Maarten Sap
机构: Language Technologies Institute, Carnegie Mellon University (卡内基梅隆大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the impressive performance of Retrieval-augmented Generation (RAG) systems across various NLP benchmarks, their robustness in handling real-world user-LLM interaction queries remains largely underexplored. This presents a critical gap for practical deployment, where user queries exhibit greater linguistic variations and can trigger cascading errors across interdependent RAG components. In this work, we systematically analyze how varying four linguistic dimensions (formality, readability, politeness, and grammatical correctness) impact RAG performance. We evaluate two retrieval models and nine LLMs, ranging from 3 to 72 billion parameters, across four information-seeking Question Answering (QA) datasets. Our results reveal that linguistic reformulations significantly impact both retrieval and generation stages, leading to a relative performance drop of up to 40.41% in Recall@5 scores for less formal queries and 38.86% in answer match scores for queries containing grammatical errors. Notably, RAG systems exhibit greater sensitivity to such variations compared to LLM-only generations, highlighting their vulnerability to error propagation due to linguistic shifts. These findings highlight the need for improved robustness techniques to enhance reliability in diverse user interactions.
zh

[NLP-31] Big Meaning: Qualitative Analysis on Large Bodies of Data Using AI

【速读】: 该论文试图解决的主题分析中如何高效识别富含独特人类生成代码(fecundity)的文本的问题。解决方案的关键在于提出了一种利用生成式 AI (Generative AI) 创建描述性代码的框架,这些代码能够指示文本的富饶度,从而引导选择可能带来更丰富定性洞见的文档。通过将AI选出的文档与随机选择的文档进行对比实验,证明了AI选出的文本富饶度约为随机选择文本的两倍,验证了AI生成代码作为主题分析中识别高意义潜在文档的有效代理手段。

链接: https://arxiv.org/abs/2504.08213
作者: Samuel Flanders,Melati Nungsari,Mark Cheong Wing Loong
机构: 未知
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2504.07408

点击查看摘要

Abstract:This study introduces a framework that leverages AI-generated descriptive codes to indicate a text’s fecundity–the density of unique human-generated codes–in thematic analysis. Rather than replacing human interpretation, AI-generated codes guide the selection of texts likely to yield richer qualitative insights. Using a dataset of 2,530 Malaysian news articles on refugee attitudes, we compare AI-selected documents to randomly chosen ones by having three human coders independently derive codes. The results demonstrate that AI-selected texts exhibit approximately twice the fecundity. Our findings support the use of AI-generated codes as an effective proxy for identifying documents with a high potential for meaning-making in thematic analysis.
zh

[NLP-32] LLM for Comparative Narrative Analysis

【速读】: 该论文旨在解决跨大型语言模型(Large Language Models, LLMs)性能比较的公平性和准确性问题。研究通过多视角比较叙述分析(Multi-Perspective Comparative Narrative Analysis, CNA)方法,对三个主流LLMs(GPT-3.5、PaLM2和Llama2)进行评估。关键在于采用相同的提示(prompts),并在特定任务下比较它们的输出结果,同时以人工评价作为基准标准,从四个维度分析LLMs在任务理解、分析能力及响应差异方面的表现,从而揭示不同模型之间的显著差异。

链接: https://arxiv.org/abs/2504.08211
作者: Leo Kampen,Carlos Rabat Villarreal,Louis Yu,Santu Karmaker,Dongji Feng
机构: Gustavus Adolphus College (古斯塔维乌斯·阿道夫斯学院); Auburn University (奥本大学); University of Central Florida (中佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures, Appendix included

点击查看摘要

Abstract:In this paper, we conducted a Multi-Perspective Comparative Narrative Analysis (CNA) on three prominent LLMs: GPT-3.5, PaLM2, and Llama2. We applied identical prompts and evaluated their outputs on specific tasks, ensuring an equitable and unbiased comparison between various LLMs. Our study revealed that the three LLMs generated divergent responses to the same prompt, indicating notable discrepancies in their ability to comprehend and analyze the given task. Human evaluation was used as the gold standard, evaluating four perspectives to analyze differences in LLM performance.
zh

[NLP-33] Harnessing the Unseen: The Hidden Influence of Intrinsic Knowledge in Long-Context Language Models

【速读】: 该论文试图解决长上下文模型(Long-Context Models, LCMs)在处理极长输入上下文时过度依赖外部上下文信息,而对其内在知识影响内容生成的能力研究不足的问题。论文的关键在于提出了一种结合内外部检索能力的Hybrid Needle-in-a-Haystack测试方法,通过评估模型在利用内在知识(Intrinsic Knowledge)和外部知识(Extrinsic Knowledge)上的综合表现,揭示了内在检索能力(Intrinsic Retrieval Ability)与外在检索能力(Extrinsic Retrieval Ability)之间的差异及其相互干扰现象。实验结果表明,Qwen-2.5模型在内在检索能力方面显著优于Llama-3.1模型,并指出仅关注外在检索能力可能限制模型潜力,强调了从双重视角评估模型的重要性。

链接: https://arxiv.org/abs/2504.08202
作者: Yu Fu,Haz Sameen Shahgir,Hui Liu,Xianfeng Tang,Qi He,Yue Dong
机构: University of California Riverside(加州大学河滨分校)
类目: Computation and Language (cs.CL)
备注: 21 pages,11figures

点击查看摘要

Abstract:Recent advances in long-context models (LCMs), designed to handle extremely long input contexts, primarily focus on utilizing external contextual information, often leaving the influence of large language models’ intrinsic knowledge underexplored. In this work, we investigate how this intrinsic knowledge affects content generation and demonstrate that its impact becomes increasingly pronounced as context length extends. Furthermore, we show that the model’s ability to utilize intrinsic knowledge, which we call intrinsic retrieval ability, does not improve simultaneously with its ability to leverage contextual knowledge through extrinsic retrieval ability. Moreover, better extrinsic retrieval can interfere with the model’s ability to use its own knowledge effectively, limiting its full potential. To bridge this gap, we design a simple yet effective Hybrid Needle-in-a-Haystack test that evaluates models based on their capabilities across both retrieval abilities, rather than solely emphasizing extrinsic retrieval ability. Our experimental results reveal that Qwen-2.5 models significantly outperform Llama-3.1 models, demonstrating superior intrinsic retrieval ability. Moreover, even the more powerful Llama-3.1-70B-Instruct model fails to exhibit better performance under LCM conditions, highlighting the importance of evaluating models from a dual-retrieval perspective.
zh

[NLP-34] SAEs textitCan Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLM s

【速读】: 该论文致力于解决现有梯度基(gradient-based)机器遗忘(machine unlearning)方法存在的高计算成本、超参数不稳定、序列遗忘能力差、易受重学习攻击、数据效率低以及缺乏可解释性等问题。论文的关键在于提出了一种名为\textbf{动态去噪自编码器护栏}(Dynamic Denoising Autoencoder Guardrails, DSG)的新方法,通过引入基于特征选择的精确遗忘机制和动态分类器,显著提升了遗忘过程的精度与效率。实验结果表明,DSG在多个方面优于现有领先方法,包括更优的遗忘-效用权衡、增强的计算效率与稳定性、更强的序列遗忘能力和抗重学习攻击能力,以及更高的数据效率和更好的可解释性。

链接: https://arxiv.org/abs/2504.08192
作者: Aashiq Muhamed,Jacopo Bonato,Mona Diab,Virginia Smith
机构: Carnegie Mellon University (卡内基梅隆大学); Leonardo Labs (列奥纳多实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Machine unlearning is a promising approach to improve LLM safety by removing unwanted knowledge from the model. However, prevailing gradient-based unlearning methods suffer from issues such as high computational costs, hyperparameter instability, poor sequential unlearning capability, vulnerability to relearning attacks, low data efficiency, and lack of interpretability. While Sparse Autoencoders are well-suited to improve these aspects by enabling targeted activation-based unlearning, prior approaches underperform gradient-based methods. This work demonstrates that, contrary to these earlier findings, SAEs can significantly improve unlearning when employed dynamically. We introduce \textbfDynamic DAE Guardrails (DSG), a novel method for precision unlearning that leverages principled feature selection and a dynamic classifier. Our experiments show DSG substantially outperforms leading unlearning methods, achieving superior forget-utility trade-offs. DSG addresses key drawbacks of gradient-based approaches for unlearning – offering enhanced computational efficiency and stability, robust performance in sequential unlearning, stronger resistance to relearning attacks, better data efficiency including zero-shot settings, and more interpretable unlearning.
zh

[NLP-35] Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora ACL CONLL

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在数据效率方面远低于人类儿童的问题,即如何在有限的数据预算下训练出更高效的语言模型。传统方法需要数万亿词级别的数据,而儿童仅需数十亿词即可掌握语言能力,这种差距限制了研究人员利用现有资源开发更具发展可比性的认知模型。论文通过发起BabyLM挑战赛,鼓励参与者优化语言模型训练策略,并在多个评价任务上比较其性能,包括语法能力、下游任务表现及泛化能力。

解决方案的关键在于探索如何在固定数据预算下提升语言模型的数据效率。论文总结了来自30多份提交的成果,发现LTG-BERT架构(Samuel等,2023)的获胜模型尽管使用了较短的输入序列,却超越了基于万亿级别数据训练的传统模型;其他成功案例则通过学生-教师训练方式或针对较短输入序列进行优化实现了显著效果。然而,大多数课程学习(Curriculum Learning)尝试未能取得突破性进展,仅部分展示出有限改进。因此,论文强调了数据效率优化的重要性以及未来研究应关注的方向。

链接: https://arxiv.org/abs/2504.08165
作者: Alex Warstadt,Aaron Mueller,Leshem Choshen,Ethan Wilcox,Chengxu Zhuang,Juan Ciro,Rafael Mosquera,Bhargavi Paranjape,Adina Williams,Tal Linzen,Ryan Cotterell
机构: ETH Zürich (苏黎世联邦理工学院); Northeastern University (东北大学); Technion (以色列工科大学); MIT (麻省理工学院); IBM Research (IBM研究院); MLCommons; Meta AI (FAIR) (Meta人工智能实验室); University of Washington (华盛顿大学); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注: Published in Proceedings of BabyLM. Please cite the published version on ACL anthology: this http URL

点击查看摘要

Abstract:Children can acquire language from less than 100 million words of input. Large language models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data and still do not perform as well as humans on many evaluations. These intensive resource demands limit the ability of researchers to train new models and use existing models as developmentally plausible cognitive models. The BabyLM Challenge is a communal effort in which participants compete to optimize language model training on a fixed data budget. Submissions are compared on various evaluation tasks targeting grammatical ability, downstream task performance, and generalization. Participants can submit to up to three tracks with progressively looser data restrictions. From over 30 submissions, we extract concrete recommendations on how best to train data-efficient language models, and on where future efforts should (and perhaps should not) focus. The winning submissions using the LTG-BERT architecture (Samuel et al., 2023) outperformed models trained on trillions of words. Other submissions achieved strong results through training on shorter input sequences or training a student model on a pretrained teacher. Curriculum learning attempts, which accounted for a large number of submissions, were largely unsuccessful, though some showed modest improvements.
zh

[NLP-36] DeepSeek vs. o3-mini: How Well can Reasoning LLM s Evaluate MT and Summarization?

【速读】: 该论文试图解决“推理能力是否能够有效提升大型语言模型(LLMs)在自然语言生成(NLG)评估任务中的性能”这一问题。解决方案的关键在于系统性地对比具有推理能力的大型语言模型(如DeepSeek-R1和OpenAI o3)与非推理模型在机器翻译(MT)和文本摘要(TS)评估任务中的表现,并通过实验揭示推理能力的收益依赖于具体模型架构和任务类型。研究发现,推理能力在某些情况下(如OpenAI o3-mini模型)可以显著提高评估质量,而在其他情况下(如DeepSeek-R1)则表现不佳。此外,实验还表明,推理能力的蒸馏在中等规模模型(32B参数)中仍能保持合理性能,但在较小规模模型(8B参数)中则明显下降。这些结果为推理型LLMs在NLG评估中的应用提供了全面的评估和实用见解。

链接: https://arxiv.org/abs/2504.08120
作者: Daniil Larionov,Sotaro Takeshita,Ran Zhang,Yanran Chen,Christoph Leiter,Zhipin Wang,Christian Greisinger,Steffen Eger
机构: University of Mannheim (曼海姆大学); University of Technology Nuremberg (纽伦堡工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning-enabled large language models (LLMs) have recently demonstrated impressive performance in complex logical and mathematical tasks, yet their effectiveness in evaluating natural language generation remains unexplored. This study systematically compares reasoning-based LLMs (DeepSeek-R1 and OpenAI o3) with their non-reasoning counterparts across machine translation (MT) and text summarization (TS) evaluation tasks. We evaluate eight models across three architectural categories, including state-of-the-art reasoning models, their distilled variants (ranging from 8B to 70B parameters), and equivalent conventional, non-reasoning LLMs. Our experiments on WMT23 and SummEval benchmarks reveal that the benefits of reasoning capabilities are highly model and task-dependent: while OpenAI o3-mini models show consistent performance improvements with increased reasoning intensity, DeepSeek-R1 underperforms compared to its non-reasoning variant, with exception to certain aspects of TS evaluation. Correlation analysis demonstrates that increased reasoning token usage positively correlates with evaluation quality in o3-mini models. Furthermore, our results show that distillation of reasoning capabilities maintains reasonable performance in medium-sized models (32B) but degrades substantially in smaller variants (8B). This work provides the first comprehensive assessment of reasoning LLMs for NLG evaluation and offers insights into their practical use.
zh

[NLP-37] Geneshift: Impact of different scenario shift on Jailbreaking LLM

【速读】: 该论文致力于解决生成式语言模型(LLMs)在应对越狱攻击(jailbreak attacks)时的表现不足问题。现有基于词典的方法虽然在基于字典的评估中取得了较高的攻击成功率,但在满足有害请求时难以生成详细且具体的内容,在基于 GPT 的评估中表现欠佳。论文的关键解决方案是提出了一种名为 GeneShift 的黑盒越狱攻击方法,通过遗传算法优化场景转换(scenario shifts)。研究发现,恶意查询在不同的场景转换下表现出最佳效果,GeneShift 方法通过遗传算法演化和选择场景转换的混合体,引导模型输出详细且可操作的有害响应,同时保持看似无害的表象以增强隐蔽性。实验结果验证了 GeneShift 方法的有效性,显著提升了越狱成功率为 60%,而直接提示方法在此情况下无法实现越狱。

链接: https://arxiv.org/abs/2504.08104
作者: Tianyi Wu,Zhiwei Xue,Yue Liu,Jiaheng Zhang,Bryan Hooi,See-Kiong Ng
机构: Integrative Sciences and Engineering Programme, NUS Graduate School, National University of Singapore (新加坡国立大学); Institute of Data Science (IDS), National University of Singapore (新加坡国立大学); Department of Computer Science, School of Computing, National University of Singapore (新加坡国立大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Jailbreak attacks, which aim to cause LLMs to perform unrestricted behaviors, have become a critical and challenging direction in AI safety. Despite achieving the promising attack success rate using dictionary-based evaluation, existing jailbreak attack methods fail to output detailed contents to satisfy the harmful request, leading to poor performance on GPT-based evaluation. To this end, we propose a black-box jailbreak attack termed GeneShift, by using a genetic algorithm to optimize the scenario shifts. Firstly, we observe that the malicious queries perform optimally under different scenario shifts. Based on it, we develop a genetic algorithm to evolve and select the hybrid of scenario shifts. It guides our method to elicit detailed and actionable harmful responses while keeping the seemingly benign facade, improving stealthiness. Extensive experiments demonstrate the superiority of GeneShift. Notably, GeneShift increases the jailbreak success rate from 0% to 60% when direct prompting alone would fail.
zh

[NLP-38] Multi-view autoencoders for Fake News Detection

【速读】: 该论文旨在解决自动假新闻检测任务中提取文本特征的挑战,特别是如何有效整合多种特征提取技术以生成更全面的假新闻表示。论文的关键在于提出利用多视图自编码器(multi-view autoencoders)将几种常用的特征提取方法集成起来,生成联合特征表示(joint feature representation),从而提升假新闻分类性能。实验结果表明,与单一视图相比,该方法显著提高了分类准确性,并且通过选择部分视图而非构建完整的潜在空间,可以在保持较高精度的同时减少计算开销。

链接: https://arxiv.org/abs/2504.08102
作者: Ingryd V. S. T. Pereira,George D. C. Cavalcanti,Rafael M. O. Cruz
机构: Centro de Informática (中心), Universidade Federal de Pernambuco (联邦大学); École de technologie supérieure (高等技术学院), Université du Québec (魁北克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by IEEE Symposium Series on Computational Intelligence - IEEE SSCI 2025

点击查看摘要

Abstract:Given the volume and speed at which fake news spreads across social media, automatic fake news detection has become a highly important task. However, this task presents several challenges, including extracting textual features that contain relevant information about fake news. Research about fake news detection shows that no single feature extraction technique consistently outperforms the others across all scenarios. Nevertheless, different feature extraction techniques can provide complementary information about the textual data and enable a more comprehensive representation of the content. This paper proposes using multi-view autoencoders to generate a joint feature representation for fake news detection by integrating several feature extraction techniques commonly used in the literature. Experiments on fake news datasets show a significant improvement in classification performance compared to individual views (feature representations). We also observed that selecting a subset of the views instead of composing a latent space with all the views can be advantageous in terms of accuracy and computational effort. For further details, including source codes, figures, and datasets, please refer to the project’s repository: this https URL.
zh

[NLP-39] he AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agent ic Tree Search

【速读】: 该论文试图解决科学发现过程中高度依赖人工参与的问题,目标是实现从科学假设提出、实验设计与执行、数据分析到论文撰写的全链条自主化。解决方案的关键在于引入The AI Scientist-v2系统,这是一个端到端的主动式(agentic)系统,其核心创新包括:1) 消除了对人类编写的代码模板的依赖;2) 通过新型的渐进式主动树搜索方法论(progressive agentic tree-search methodology),在多个机器学习领域表现出良好的泛化能力;3) 增强了AI审稿人组件,通过视觉-语言模型(Vision-Language Model, VLM)反馈循环优化图表的内容与美学呈现。这些改进使得该系统能够提交完全由AI生成的论文至同行评审,并成功有一篇达到人类平均水平以上的接受分数,标志着AI在科学研究全流程中的能力提升。

链接: https://arxiv.org/abs/2504.08066
作者: Yutaro Yamada,Robert Tjarko Lange,Cong Lu,Shengran Hu,Chris Lu,Jakob Foerster,Jeff Clune,David Ha
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI is increasingly playing a pivotal role in transforming how scientific discoveries are made. We introduce The AI Scientist-v2, an end-to-end agentic system capable of producing the first entirely AI generated peer-review-accepted workshop paper. This system iteratively formulates scientific hypotheses, designs and executes experiments, analyzes and visualizes data, and autonomously authors scientific manuscripts. Compared to its predecessor (v1, Lu et al., 2024 arXiv:2408.06292), The AI Scientist-v2 eliminates the reliance on human-authored code templates, generalizes effectively across diverse machine learning domains, and leverages a novel progressive agentic tree-search methodology managed by a dedicated experiment manager agent. Additionally, we enhance the AI reviewer component by integrating a Vision-Language Model (VLM) feedback loop for iterative refinement of content and aesthetics of the figures. We evaluated The AI Scientist-v2 by submitting three fully autonomous manuscripts to a peer-reviewed ICLR workshop. Notably, one manuscript achieved high enough scores to exceed the average human acceptance threshold, marking the first instance of a fully AI-generated paper successfully navigating a peer review. This accomplishment highlights the growing capability of AI in conducting all aspects of scientific research. We anticipate that further advancements in autonomous scientific discovery technologies will profoundly impact human knowledge generation, enabling unprecedented scalability in research productivity and significantly accelerating scientific breakthroughs, greatly benefiting society at large. We have open-sourced the code at this https URL to foster the future development of this transformative technology. We also discuss the role of AI in science, including AI safety.
zh

[NLP-40] Large-Scale Analysis of Online Questions Related to Opioid Use Disorder on Reddit

【速读】: 本文旨在研究Reddit平台上与阿片类药物使用障碍(OUD)相关自然语言问题的信息寻求行为,以识别和分类用户提出的问题类型。论文的关键解决方案在于结合基于Transformer的提问检测技术与分层聚类方法,通过对2018年至2021年间19个子版块的数据分析,将OUD相关问题归纳为六个粗粒度类别和69个细粒度类别,揭示了用户在药物销售、特定药物问题、治疗方案、药物用途、副作用、戒断、生活方式、药物检测、疼痛管理以及其他方面的信息需求。这种方法为深入理解Reddit用户在OUD相关背景下的隐性提问行为提供了重要支持,并为进一步的技术干预和公共卫生减害策略奠定了基础。

链接: https://arxiv.org/abs/2504.08044
作者: Tanmay Laud,Akadia Kacha-Ochana,Steven A. Sumner,Vikram Krishnasamy,Royal Law,Lyna Schieber,Munmun De Choudhury,Mai ElSherief
机构: UC San Diego (加州大学圣地亚哥分校); unknown
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注: Accepted to ICWSM 2025

点击查看摘要

Abstract:Opioid use disorder (OUD) is a leading health problem that affects individual well-being as well as general public health. Due to a variety of reasons, including the stigma faced by people using opioids, online communities for recovery and support were formed on different social media platforms. In these communities, people share their experiences and solicit information by asking questions to learn about opioid use and recovery. However, these communities do not always contain clinically verified information. In this paper, we study natural language questions asked in the context of OUD-related discourse on Reddit. We adopt transformer-based question detection along with hierarchical clustering across 19 subreddits to identify six coarse-grained categories and 69 fine-grained categories of OUD-related questions. Our analysis uncovers ten areas of information seeking from Reddit users in the context of OUD: drug sales, specific drug-related questions, OUD treatment, drug uses, side effects, withdrawal, lifestyle, drug testing, pain management and others, during the study period of 2018-2021. Our work provides a major step in improving the understanding of OUD-related questions people ask unobtrusively on Reddit. We finally discuss technological interventions and public health harm reduction techniques based on the topics of these questions.
zh

[NLP-41] Can Reasoning LLM s Enhance Clinical Document Classification?

【速读】: 该论文试图解决临床文档分类任务中的挑战,包括复杂医学语言、隐私限制以及标注数据不足等问题。解决方案的关键在于评估不同大语言模型(Large Language Models, LLMs)在处理临床离院摘要分类任务中的性能与一致性。研究对比了八种LLMs(四类推理型模型与四类非推理型模型),通过使用cTAKES结构化临床叙述,并结合多数投票机制确定最终预测结果。结果显示推理型模型在准确率(71% vs 68%)和F1分数(67% vs 60%)方面优于非推理型模型,其中Gemini 2.0 Flash Thinking表现最佳;而非推理型模型则展现出更高的稳定性(一致性为91% vs 84%)。此外,不同ICD-10编码类别间性能存在差异,这表明在实际应用中采用混合方法可能优化临床编码效果。未来工作应关注多标签分类、领域特定微调及集成技术以提升模型可靠性。

链接: https://arxiv.org/abs/2504.08040
作者: Akram Mustafa,Usman Naseem,Mostafa Rahimi Azghadi
机构: James Cook University (詹姆斯·库克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 28 pages, 13 tables, 12 figures

点击查看摘要

Abstract:Clinical document classification is essential for converting unstructured medical texts into standardised ICD-10 diagnoses, yet it faces challenges due to complex medical language, privacy constraints, and limited annotated datasets. Large Language Models (LLMs) offer promising improvements in accuracy and efficiency for this task. This study evaluates the performance and consistency of eight LLMs; four reasoning (Qwen QWQ, Deepseek Reasoner, GPT o3 Mini, Gemini 2.0 Flash Thinking) and four non-reasoning (Llama 3.3, GPT 4o Mini, Gemini 2.0 Flash, Deepseek Chat); in classifying clinical discharge summaries using the MIMIC-IV dataset. Using cTAKES to structure clinical narratives, models were assessed across three experimental runs, with majority voting determining final predictions. Results showed that reasoning models outperformed non-reasoning models in accuracy (71% vs 68%) and F1 score (67% vs 60%), with Gemini 2.0 Flash Thinking achieving the highest accuracy (75%) and F1 score (76%). However, non-reasoning models demonstrated greater stability (91% vs 84% consistency). Performance varied across ICD-10 codes, with reasoning models excelling in complex cases but struggling with abstract categories. Findings indicate a trade-off between accuracy and consistency, suggesting that a hybrid approach could optimise clinical coding. Future research should explore multi-label classification, domain-specific fine-tuning, and ensemble methods to enhance model reliability in real-world applications.
zh

[NLP-42] From Speech to Summary: A Comprehensive Survey of Speech Summarization

【速读】: 该论文旨在明确语音摘要(Speech Summarization)领域的定义边界,梳理其与相关研究领域(如语音识别、文本摘要及特定应用场景如会议摘要)的交集,并系统性地评估现有数据集与评估方法的有效性。同时,论文总结了该领域近期的技术进展,重点探讨从传统系统向精细化级联架构(fine-tuned cascaded architectures)和端到端解决方案(end-to-end solutions)的演进趋势。论文的关键在于通过全面分析数据集、评估方法以及技术发展,揭示语音摘要任务的核心挑战及其未来发展方向。

链接: https://arxiv.org/abs/2504.08024
作者: Fabian Retkowski,Maike Züfle,Andreas Sudmann,Dinah Pfau,Jan Niehues,Alexander Waibel
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Speech summarization has become an essential tool for efficiently managing and accessing the growing volume of spoken and audiovisual content. However, despite its increasing importance, speech summarization is still not clearly defined and intersects with several research areas, including speech recognition, text summarization, and specific applications like meeting summarization. This survey not only examines existing datasets and evaluation methodologies, which are crucial for assessing the effectiveness of summarization approaches but also synthesizes recent developments in the field, highlighting the shift from traditional systems to advanced models like fine-tuned cascaded architectures and end-to-end solutions.
zh

[NLP-43] More diverse more adaptive: Comprehensive Multi-task Learning for Improved LLM Domain Adaptation in E-commerce KDD

【速读】: 该论文试图解决电子商务领域中大规模语言模型(Large Language Models, LLMs)在域适应性能验证不足的问题。论文指出,尽管先前研究表明多模态数据能够提升LLMs的域适应能力,但在电子商务领域的这一假设尚未得到充分验证。为此,研究提出了一种全面的电子商务多任务框架,并通过实证实验从“能力综合性”和“任务综合性”两个角度检验多样化数据与任务的影响。解决方案的关键在于通过引入与新能力领域相关的任务以及在不同主要能力领域内不断增加子任务,显著提升了LLM的性能;同时发现增加模型容量可以放大数据多样性带来的益处,揭示了模型容量与数据多样性之间的协同关系。最终,在KDD Cup 2024中验证了最佳模型的表现,取得了Task 1的第5名成绩,验证了研究在推动电子商务领域LLMs发展中的重要性。

链接: https://arxiv.org/abs/2504.08002
作者: Tong Piao,Pei Tang,Zhipeng Zhang,Jiaqi Li,Qiao Liu,Zufeng Wu
机构: University of Electronic Science and Technology of China(电子科技大学)(成都, 中国); Xiaohongshu Inc.(小红书)(上海, 中国); ByteDance Inc.(字节跳动)(上海, 中国); Sichuan University(四川大学)(成都, 中国)
类目: Computation and Language (cs.CL)
备注: Accepted by KDD workshop 2024

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have been widely applied across various domains due to their powerful domain adaptation capabilities. Previous studies have suggested that diverse, multi-modal data can enhance LLMs’ domain adaptation performance. However, this hypothesis remains insufficiently validated in the e-commerce sector. To address this gap, we propose a comprehensive e-commerce multi-task framework and design empirical experiments to examine the impact of diverse data and tasks on LLMs from two perspectives: “capability comprehensiveness” and “task comprehensiveness.” Specifically, we observe significant improvements in LLM performance by progressively introducing tasks related to new major capability areas and by continuously adding subtasks within different major capability domains. Furthermore, we observe that increasing model capacity amplifies the benefits of diversity, suggesting a synergistic relationship between model capacity and data diversity. Finally, we validate the best-performing model from our empirical experiments in the KDD Cup 2024, achieving a rank 5 in Task 1. This outcome demonstrates the significance of our research for advancing LLMs in the e-commerce domain.
zh

[NLP-44] Linguistic Interpretability of Transformer-based Language Models: a systematic review

【速读】: 该论文试图解决的问题是如何揭示基于Transformer架构的语言模型内部计算如何编码和表达语言学知识,以克服这些模型作为“黑箱”系统的局限性。论文的关键在于通过“可解释性”研究,特别是“语言学可解释性”领域的工作,探索这些模型是否具备类似于人类的语言学知识,并从句法(Syntax)、形态学(Morphology)、词汇语义学(Lexico-Semantics)以及语篇(Discourse)等传统语言学学科的角度,分析160项涵盖多种语言及多语言预训练语言模型的研究工作。该研究填补了现有可解释性文献的空白,尤其是那些仅聚焦于英语模型或未关注模型内部表征的局限性。

链接: https://arxiv.org/abs/2504.08001
作者: Miguel López-Otal,Jorge Gracia,Jordi Bernad,Carlos Bobed,Lucía Pitarch-Ballesteros,Emma Anglés-Herrero
机构: Aragon Institute of Engineering Research, University of Zaragoza (阿贡工程研究所, 萨拉戈萨大学)
类目: Computation and Language (cs.CL)
备注: Supplementary material: this https URL

点击查看摘要

Abstract:Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or sentiment analysis. However, despite the architecture of these models being well-defined, little is known about how their internal computations help them achieve their results. This renders these models, as of today, a type of ‘black box’ systems. There is, however, a line of research – ‘interpretability’ – aiming to learn how information is encoded inside these models. More specifically, there is work dedicated to studying whether Transformer-based models possess knowledge of linguistic phenomena similar to human speakers – an area we call ‘linguistic interpretability’ of these models. In this survey we present a comprehensive analysis of 160 research works, spread across multiple languages and models – including multilingual ones – that attempt to discover linguistic information from the perspective of several traditional Linguistics disciplines: Syntax, Morphology, Lexico-Semantics and Discourse. Our survey fills a gap in the existing interpretability literature, which either not focus on linguistic knowledge in these models or present some limitations – e.g. only studying English-based models. Our survey also focuses on Pre-trained Language Models not further specialized for a downstream task, with an emphasis on works that use interpretability techniques that explore models’ internal representations.
zh

[NLP-45] BiasCause: Evaluate Socially Biased Causal Reasoning of Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成内容时存在的社会偏见问题,特别是探究其因果推理过程中的潜在偏见来源。现有基准虽已有效识别出LLMs的社会偏见,但对其导致这些偏见的深层次推理机制理解仍存在空白。为填补这一差距,论文提出了一种新的概念框架,用于分类LLMs产生的因果推理,并通过合成涵盖8个敏感属性的1788个问题来评估其因果推理能力,同时使用因果图让LLMs揭示其推理过程。研究测试了4种最先进的LLMs,发现所有模型在多数问题上表现出有偏的因果推理,共产生4135个有偏的因果图。此外,通过分析无偏案例,论文发现了三种避免有偏因果推理的策略,并进一步揭示了LLMs容易出现的“误判偏见”因果推理现象,即先混淆相关性与因果性以推断特定敏感群体名称,再引入有偏的因果推理。解决方案的关键在于提出新的概念框架、设计验证问题以及系统性分析不同类型的因果推理偏差,从而深入理解并揭示LLMs偏见推理的本质及其规避方法。

链接: https://arxiv.org/abs/2504.07997
作者: Tian Xie,Tongxin Yin,Vaishakh Keshava,Xueru Zhang,Siddhartha Reddy Jonnalagadda
机构: Google; The Ohio State University (俄亥俄州立大学)
类目: Computation and Language (cs.CL)
备注: This work has been done when the first author is at Google. The first author is a student at the Ohio State University

点击查看摘要

Abstract:While large language models (LLMs) already play significant roles in society, research has shown that LLMs still generate content including social bias against certain sensitive groups. While existing benchmarks have effectively identified social biases in LLMs, a critical gap remains in our understanding of the underlying reasoning that leads to these biased outputs. This paper goes one step further to evaluate the causal reasoning process of LLMs when they answer questions eliciting social biases. We first propose a novel conceptual framework to classify the causal reasoning produced by LLMs. Next, we use LLMs to synthesize 1788 questions covering 8 sensitive attributes and manually validate them. The questions can test different kinds of causal reasoning by letting LLMs disclose their reasoning process with causal graphs. We then test 4 state-of-the-art LLMs. All models answer the majority of questions with biased causal reasoning, resulting in a total of 4135 biased causal graphs. Meanwhile, we discover 3 strategies for LLMs to avoid biased causal reasoning by analyzing the “bias-free” cases. Finally, we reveal that LLMs are also prone to “mistaken-biased” causal reasoning, where they first confuse correlation with causality to infer specific sensitive group names and then incorporate biased causal reasoning.
zh

[NLP-46] SafeChat: A Framework for Building Trustworthy Collaborative Assistants and a Case Study of its Usefulness

【速读】: 该论文试图解决大型语言模型(LLM)驱动的聊天机器人在可靠性与可信度方面的不足,具体包括无法解释响应生成过程、存在生成有害内容的风险、缺乏标准化的可靠性测试以及需要深厚的AI专业知识和较长开发周期等问题,这些问题限制了其在对信任敏感的应用场景(如选举或医疗)中的适用性。论文提出的关键解决方案是SafeChat,这是一种通用的安全且可信聊天机器人的架构,特别关注信息检索应用场景。SafeChat的核心在于通过领域无关的设计确保响应可追溯至权威来源(出处证明)、采用“拒绝响应”策略避免有害答案,同时提供自动摘要生成、溯源能力及信任评估机制以提升可用性,并通过CSV驱动的工作流、自动化测试和多设备集成实现快速且可扩展的开发流程。这些特性共同构成了SafeChat的关键创新点,使其能够应对现有聊天机器人面临的挑战。

链接: https://arxiv.org/abs/2504.07995
作者: Biplav Srivastava,Kausik Lakkaraju,Nitin Gupta,Vansh Nagpal,Bharath C. Muppasani,Sara E. Jones
机构: University of South Carolina(南卡罗来纳大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Collaborative assistants, or chatbots, are data-driven decision support systems that enable natural interaction for task completion. While they can meet critical needs in modern society, concerns about their reliability and trustworthiness persist. In particular, Large Language Model (LLM)-based chatbots like ChatGPT, Gemini, and DeepSeek are becoming more accessible. However, such chatbots have limitations, including their inability to explain response generation, the risk of generating problematic content, the lack of standardized testing for reliability, and the need for deep AI expertise and extended development times. These issues make chatbots unsuitable for trust-sensitive applications like elections or healthcare. To address these concerns, we introduce SafeChat, a general architecture for building safe and trustworthy chatbots, with a focus on information retrieval use cases. Key features of SafeChat include: (a) safety, with a domain-agnostic design where responses are grounded and traceable to approved sources (provenance), and ‘do-not-respond’ strategies to prevent harmful answers; (b) usability, with automatic extractive summarization of long responses, traceable to their sources, and automated trust assessments to communicate expected chatbot behavior, such as sentiment; and © fast, scalable development, including a CSV-driven workflow, automated testing, and integration with various devices. We implemented SafeChat in an executable framework using the open-source chatbot platform Rasa. A case study demonstrates its application in building ElectionBot-SC, a chatbot designed to safely disseminate official election information. SafeChat is being used in many domains, validating its potential, and is available at: this https URL.
zh

[NLP-47] Evaluating the Fitness of Ontologies for the Task of Question Generation

【速读】: 该论文旨在解决基于本体(Ontology)的问题生成任务中,缺乏系统性评估方法以衡量本体对于生成高质量且认知难度适中的问题的适用性这一问题。论文的关键在于提出了一组针对教学场景下自动问题生成(Automatic Question Generation, AQG)任务的本体需求和特定任务评估指标,并采用基于ROMEO方法论的结构化框架,结合专家评估的方法,系统性地分析不同本体在AQG任务中的表现差异。研究结果表明,本体的特性对问题生成的有效性有显著影响,强调了针对特定任务评估本体质量的重要性。

链接: https://arxiv.org/abs/2504.07994
作者: Samah Alkhuzaey,Floriana Grasso,Terry R. Payne,Valentina Tamma
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ontology-based question generation is an important application of semantic-aware systems that enables the creation of large question banks for diverse learning environments. The effectiveness of these systems, both in terms of the calibre and cognitive difficulty of the resulting questions, depends heavily on the quality and modelling approach of the underlying ontologies, making it crucial to assess their fitness for this task. To date, there has been no comprehensive investigation into the specific ontology aspects or characteristics that affect the question generation process. Therefore, this paper proposes a set of requirements and task-specific metrics for evaluating the fitness of ontologies for question generation tasks in pedagogical settings. Using the ROMEO methodology, a structured framework for deriving task-specific metrics, an expert-based approach is employed to assess the performance of various ontologies in Automatic Question Generation (AQG) tasks, which is then evaluated over a set of ontologies. Our results demonstrate that ontology characteristics significantly impact the effectiveness of question generation, with different ontologies exhibiting varying performance levels. This highlights the importance of assessing ontology quality with respect to AQG tasks.
zh

[NLP-48] Neural howlround in large language models : a self-reinforcing bias phenomenon and a dynamic attenuation solution

【速读】: 该论文旨在解决大型语言模型(LLM)驱动的 AI 系统中存在的“神经啸叫”(neural howlround)现象,这是一种由某些高度加权输入主导导致的自我强化认知循环,使得系统形成顽固的响应模式,难以通过常规方式纠正。与模型崩塌(model collapse)和偏置显著权重(biased salience weighting)不同,这种现象需要特定机制进行干预。论文的关键解决方案是提出一种基于衰减(attenuation-based)的校正机制,该机制能够动态引入平衡调整,从而恢复 AI 系统的自适应推理能力,即使在“锁定状态”的系统中也能发挥作用。此外,论文还讨论了其他因不当管理强化学习而产生的相关效应,并概述了该缓解策略在提升实际决策任务中 AI 鲁棒性的潜在应用。

链接: https://arxiv.org/abs/2504.07992
作者: Seth Drake
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 27 pages, 3 figures, 2 tables,

点击查看摘要

Abstract:Large language model (LLM)-driven AI systems may exhibit an inference failure mode we term neural howlround,' a self-reinforcing cognitive loop where certain highly weighted inputs become dominant, leading to entrenched response patterns resistant to correction. This paper explores the mechanisms underlying this phenomenon, which is distinct from model collapse and biased salience weighting. We propose an attenuation-based correction mechanism that dynamically introduces counterbalancing adjustments and can restore adaptive reasoning, even in locked-in’ AI systems. Additionally, we discuss some other related effects arising from improperly managed reinforcement. Finally, we outline potential applications of this mitigation strategy for improving AI robustness in real-world decision-making tasks.
zh

[NLP-49] Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance

【速读】: 该论文旨在解决小语言模型(SLMs)在处理区域语言(如印地语、马拉地语和孟加拉语)时的效率与性能问题,并探索其在语言特定上下文中的适用性。论文的关键解决方案在于开发了一个包含翻译数据和合成数据的新框架,用于评估SLMs在区域语言处理中的表现,并通过语言特定的分词器优化其性能。研究发现,相比通用分词器,语言特定分词器在印度语言处理中表现出更优的效果。此外,通过信息论和形态学分析揭示了印地语模型优于马拉地语和孟加拉语模型的根本原因。同时,合成数据在训练SLMs方面显著优于翻译内容。这些方法不仅推动了SLMs在欠服务语言中的实际应用,还增进了对神经语言发展的理论理解。

链接: https://arxiv.org/abs/2504.07989
作者: Nirvan Patil,Malhar Abhay Inamdar,Agnivo Gosai,Guruprasad Pathak,Anish Joshi,Aryan Sagavekar,Anish Joshirao,Raj Dandekar,Rajat Dandekar,Sreedath Panat
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 34 pages, 24 figures, 16 tables

点击查看摘要

Abstract:Small Language Models (SLMs) offer efficient alternatives to LLMs for specific domains. The 2023 TinyStories study developed an English dataset that allows SLMs with 1 to 10 million parameters to produce coherent outputs. Our research expands this framework by translating the original dataset into Indian languages and creating synthetic data using LLMs. We focus on Hindi, Marathi, and Bengali, evaluating SLMs for regional language processing and understanding linguistic complexity. We show that SLMs efficiently process regional languages with significantly fewer parameters than LLMs, providing a complementary framework for ``inference based evaluation" of tokenization strategies and linguistic complexity. Our analysis shows that language-specific tokenizers outperform general-purpose ones for Indian languages. Empirical validations, supported by information-theoretic and morphological analyses, provides fundamental understanding behind the better performance of Hindi models over Marathi and Bengali. Additionally, we show that synthetic datasets outperform translated content for training SLMs. Correlation analyses reveal cross-linguistic patterns and language-specific relationships between creativity, grammatical precision, and narrative completeness. These findings advance both the practical application of SLMs to underserved languages and our theoretical understanding of neural language development.
zh

[NLP-50] SEAL: Steerable Reasoning Calibration of Large Language Models for Free

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在利用扩展链式思维(Chain-of-Thought, CoT)机制进行复杂推理任务时存在的冗余推理路径问题。这种冗余不仅导致推理延迟增加,还通过分散注意力到不必要的推理路径上影响了模型性能。为了解决这一问题,论文的关键在于提出了一种名为SEAL(可调节推理校准)的训练-free方法。SEAL通过分析LLMs内部推理结构,将思维类型划分为执行、反思和过渡三种主要类型,并发现过度的反思和过渡思维与失败案例高度相关。基于此,SEAL引入了一个在潜在空间中提取推理调节向量的离线阶段,以及通过表示干预实时校准推理轨迹的在线阶段,从而在提升准确性的同时显著提高了效率。值得注意的是,该调节向量具有跨任务的良好迁移能力。实验结果表明,SEAL在多个模型和基准数据集上有效提升了模型精度,最高可达11%,同时减少了11.8%至50.4%的推理标记数量。

链接: https://arxiv.org/abs/2504.07986
作者: Runjin Chen,Zhenyu Zhang,Junyuan Hong,Souvik Kundu,Zhangyang Wang
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Intel (英特尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), such as OpenAI’s o1-series have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism. However, recent studies reveal substantial redundancy in the CoT reasoning traces, which not only increases inference latency but also negatively impacts model performance by diverting attention to unnecessary reasoning paths. To address this issue, we investigate the internal reasoning structures of LLMs and categorize them into three primary thought types: execution, reflection, and transition thoughts. Moreover, our analysis reveals that excessive reflection and transition thoughts are strongly correlated with failure cases and these thought categories exhibit clear separation in the latent space. Based on these, we introduce SEAL (Steerable reasoning calibration), a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains. SEAL consists of an offline stage for extracting the reasoning steering vector in the latent space, followed by an on-the-fly calibration of the reasoning trace through representation intervention using the steering vector. Notably, the steering vector exhibits strong transferability across various tasks. Extensive experiments across multiple models (DeepSeek-R1-Distill and QwQ-32B-Preview) and benchmarks (Math500, GSM8K, LiveCodeBench) validate the effectiveness of SEAL, up to a 11% improvement in accuracy while reducing reasoning tokens by 11.8% to 50.4%. Our code is publicly available at this https URL.
zh

[NLP-51] opic mining based on fine-tuning Sentence-BERT and LDA

【速读】: 该论文旨在解决消费者在购物时关注产品细粒度属性关键信息的问题。研究通过微调Sentence-BERT词嵌入模型和LDA模型,挖掘商品在线评论中的主题特征,并向消费者展示商品各方面的详细信息。解决方案的关键在于首先对Sentence-BERT模型进行微调以生成富含语义信息的词向量集,然后将该词向量集输入LDA模型进行主题特征提取,最后通过主题下的关键词分析聚焦于产品的关键功能。实验结果显示,该模型的主题一致性比其他模型高出0.5,显著提升了主题提取的准确性。

链接: https://arxiv.org/abs/2504.07984
作者: Jianheng Li,Lirong Chen
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 11 pages, 7 Postscript figures

点击查看摘要

Abstract:Research background: With the continuous development of society, consumers pay more attention to the key information of product fine-grained attributes when shopping. Research purposes: This study will fine tune the Sentence-BERT word embedding model and LDA model, mine the subject characteristics in online reviews of goods, and show consumers the details of various aspects of goods. Research methods: First, the Sentence-BERT model was fine tuned in the field of e-commerce online reviews, and the online review text was converted into a word vector set with richer semantic information; Secondly, the vectorized word set is input into the LDA model for topic feature extraction; Finally, focus on the key functions of the product through keyword analysis under the theme. Results: This study compared this model with other word embedding models and LDA models, and compared it with common topic extraction methods. The theme consistency of this model is 0.5 higher than that of other models, which improves the accuracy of theme extraction
zh

[NLP-52] Psychological Health Knowledge-Enhanced LLM -based Social Network Crisis Intervention Text Transfer Recognition Method

【速读】: 该论文旨在解决社交网络平台上心理健康危机识别与干预的挑战,特别是如何更有效地检测潜在的心理危机并预防其危害。论文的关键解决方案在于提出了一种基于大型语言模型(Large Language Model, LLM)的文本迁移识别方法,并结合领域特定的心理健康知识。该方案通过多层级框架实现,利用BERT进行迁移学习,同时整合了心理健康知识、情感分析以及行为预测技术。此外,通过训练在真实事件社交媒体数据集上的危机标注工具,模型能够捕捉细微的情感线索并精准识别心理危机,从而显著提升了危机检测的准确性及对情感和语境微妙变化的敏感度。

链接: https://arxiv.org/abs/2504.07983
作者: Shurui Wu,Xinyi Huang,Dingxin Lu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As the prevalence of mental health crises increases on social media platforms, identifying and preventing potential harm has become an urgent challenge. This study introduces a large language model (LLM)-based text transfer recognition method for social network crisis intervention, enhanced with domain-specific mental health knowledge. We propose a multi-level framework that incorporates transfer learning using BERT, and integrates mental health knowledge, sentiment analysis, and behavior prediction techniques. The framework includes a crisis annotation tool trained on social media datasets from real-world events, enabling the model to detect nuanced emotional cues and identify psychological crises. Experimental results show that the proposed method outperforms traditional models in crisis detection accuracy and exhibits greater sensitivity to subtle emotional and contextual variations.
zh

[NLP-53] Metamorphic Testing for Fairness Evaluation in Large Language Models : Identifying Intersectional Bias in LLaMA and GPT

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在公平性(fairness)方面存在的问题,特别是其训练数据中固有的偏差可能导致的公平性漏洞。这些偏差在LLMs应用于敏感领域(如医疗、金融和法律)时可能带来风险。论文的关键解决方案是引入一种基于元变测试(metamorphic testing)的方法,通过定义并应用一组面向公平性的元变关系(fairness-oriented metamorphic relations, MRs),系统地识别LLMs中的公平性缺陷。这种方法的核心在于生成源测试案例和对应的后续测试案例,并分析模型对这些测试案例的响应以检测公平性违规现象。研究结果表明,该方法能够有效揭示与语气和情感相关的偏差模式,并突出显示敏感属性交叉点中频繁出现的公平性故障,从而为提升LLMs在公平性敏感应用中的鲁棒性提供结构化手段。

链接: https://arxiv.org/abs/2504.07982
作者: Harishwar Reddy,Madhusudan Srinivasan,Upulee Kanewala
机构: East Carolina University (东卡罗来纳大学); University of North Florida (北佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant strides in Natural Language Processing but remain vulnerable to fairness-related issues, often reflecting biases inherent in their training data. These biases pose risks, particularly when LLMs are deployed in sensitive areas such as healthcare, finance, and law. This paper introduces a metamorphic testing approach to systematically identify fairness bugs in LLMs. We define and apply a set of fairness-oriented metamorphic relations (MRs) to assess the LLaMA and GPT model, a state-of-the-art LLM, across diverse demographic inputs. Our methodology includes generating source and follow-up test cases for each MR and analyzing model responses for fairness violations. The results demonstrate the effectiveness of MT in exposing bias patterns, especially in relation to tone and sentiment, and highlight specific intersections of sensitive attributes that frequently reveal fairness faults. This research improves fairness testing in LLMs, providing a structured approach to detect and mitigate biases and improve model robustness in fairness-sensitive applications.
zh

计算机视觉

[CV-0] GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

【速读】:该论文旨在解决在扩展视觉tokenizer时图像重建质量与下游自回归生成质量之间存在的权衡问题,这是现有文献中尚未充分解决的挑战。论文的关键解决方案是引入语义正则化(semantic regularization),通过将tokenizer特征与预训练视觉编码器的语义一致特征对齐,限制潜在空间在扩展过程中的复杂性增长。这种方法不仅提升了图像重建质量,还显著改善了下游自回归生成和表示学习的效果。此外,论文进一步探索了三种扩展tokenizer的最佳实践,包括使用一维tokenizer提高可扩展性、优先扩展解码器以优化整体性能,以及采用熵损失稳定大规模tokenizer的训练过程。通过这些方法,GigaTok在扩展至30亿参数时实现了重建、下游自回归生成及表示质量的最新性能。

链接: https://arxiv.org/abs/2504.08736
作者: Tianwei Xiong,Jun Hao Liew,Zilong Huang,Jiashi Feng,Xihui Liu
机构: The University of Hong Kong (香港大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality – a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma. To mitigate this, we propose semantic regularization, which aligns tokenizer features with semantically consistent features from a pre-trained visual encoder. This constraint prevents excessive latent space complexity during scaling, yielding consistent improvements in both reconstruction and downstream autoregressive generation. Building on semantic regularization, we explore three key practices for scaling tokenizers:(1) using 1D tokenizers for better scalability, (2) prioritizing decoder scaling when expanding both encoder and decoder, and (3) employing entropy loss to stabilize training for billion-scale tokenizers. By scaling to \bf3 \space billion parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.
zh

[CV-1] Steering CLIPs vision transformer with sparse autoencoders CVPR2025

【速读】:该论文旨在解决视觉模型内部机制理解不足的问题,这一挑战在语言处理领域通过稀疏自编码器(Sparse Autoencoders, SAEs)得到了部分缓解,但在视觉领域仍缺乏深入探索。为填补这一空白,论文提出通过在CLIP的视觉Transformer上训练SAEs,揭示了视觉与语言处理之间的关键差异,包括跨层及不同标记类型下SAEs的稀疏模式差异。解决方案的关键在于引入量化指标分析CLIP视觉Transformer的可操控性(Steerability),发现约10-15%的神经元和特征具有可操控性,且SAEs提供的可操控特征数量远超基础模型。基于此,通过针对性抑制SAE特征,在三个视觉解缠任务(CelebA、Waterbirds和印刷体攻击)中实现了性能提升,并在印刷体攻击防御中达到了最先进的性能水平。

链接: https://arxiv.org/abs/2504.08729
作者: Sonia Joseph,Praneet Suresh,Ethan Goldfarb,Lorenz Hufe,Yossi Gandelsman,Robert Graham,Danilo Bzdok,Wojciech Samek,Blake Aaron Richards
机构: Mila (米拉); McGill University (麦吉尔大学); Independent Researcher; Fraunhofer HHI (弗劳恩霍夫赫兹研究所); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 7 figures. Accepted to the CVPR 2025 Workshop on Mechanistic Interpretability for Vision (MIV)

点击查看摘要

Abstract:While vision models are highly capable, their internal mechanisms remain poorly understood – a challenge which sparse autoencoders (SAEs) have helped address in language, but which remains underexplored in vision. We address this gap by training SAEs on CLIP’s vision transformer and uncover key differences between vision and language processing, including distinct sparsity patterns for SAEs trained across layers and token types. We then provide the first systematic analysis on the steerability of CLIP’s vision transformer by introducing metrics to quantify how precisely SAE features can be steered to affect the model’s output. We find that 10-15% of neurons and features are steerable, with SAEs providing thousands more steerable features than the base model. Through targeted suppression of SAE features, we then demonstrate improved performance on three vision disentanglement tasks (CelebA, Waterbirds, and typographic attacks), finding optimal disentanglement in middle model layers, and achieving state-of-the-art performance on defense against typographic attacks.
zh

[CV-2] Visual Chronicles: Using Multimodal LLM s to Analyze Massive Collections of Images

【速读】:该论文旨在解决利用多模态大型语言模型(Multimodal Large Language Models, MLLMs)分析包含数千万张在不同时间点拍摄图像的大规模数据库的问题,以发现城市在特定时间段内的时序变化模式。不同于以往基于目标检测或预设标签的视觉分析方法,该研究能够回答开放性问题(如“城市中频繁发生的变化类型有哪些?”),无需任何预先设定的目标对象或训练标签。然而,由于数据集规模巨大(比现有MLLM上下文容量大四个数量级),直接使用MLLM进行分析不可行。因此,论文提出了一种自下而上的分解策略,将大规模视觉分析问题拆解为更易处理的小问题,并针对每个子问题设计基于MLLM的解决方案。实验和消融研究表明,该系统显著优于基线方法,并成功发现了多个有趣的趋势,如“户外餐饮的增加”和“立交桥被涂成蓝色”等。关键在于通过分解策略克服了数据规模限制,同时充分利用了MLLMs的开放式语义理解能力。

链接: https://arxiv.org/abs/2504.08727
作者: Boyang Deng,Songyou Peng,Kyle Genova,Gordon Wetzstein,Noah Snavely,Leonidas Guibas,Thomas Funkhouser
机构: Stanford University (斯坦福大学); Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Project page: this https URL second and third listed authors have equal contributions

点击查看摘要

Abstract:We present a system using Multimodal LLMs (MLLMs) to analyze a large database with tens of millions of images captured at different times, with the aim of discovering patterns in temporal changes. Specifically, we aim to capture frequent co-occurring changes (“trends”) across a city over a certain period. Unlike previous visual analyses, our analysis answers open-ended queries (e.g., “what are the frequent types of changes in the city?”) without any predetermined target subjects or training labels. These properties cast prior learning-based or unsupervised visual analysis tools unsuitable. We identify MLLMs as a novel tool for their open-ended semantic understanding capabilities. Yet, our datasets are four orders of magnitude too large for an MLLM to ingest as context. So we introduce a bottom-up procedure that decomposes the massive visual analysis problem into more tractable sub-problems. We carefully design MLLM-based solutions to each sub-problem. During experiments and ablation studies with our system, we find it significantly outperforms baselines and is able to discover interesting trends from images captured in large cities (e.g., “addition of outdoor dining,”, “overpass was painted blue,” etc.). See more results and interactive demos at this https URL.
zh

[CV-3] EMO-X: Efficient Multi-Person Pose and Shape Estimation in One-Stage

【速读】:本文旨在解决多人群体中表达性人体姿态与形状估计(EHPS)的问题,现有基于Transformer的方法因自注意力机制的二次复杂度导致显著的计算开销,尤其在多人场景下表现不佳。尽管Mamba因其高效的全局建模能力成为一种有前景的替代方案,但它在捕捉细粒度局部依赖关系方面仍显不足,而这对于精确的EHPS至关重要。为了解决这些问题,论文提出了EMO-X,这是一种针对多人群体EHPS的高效单阶段模型。关键在于引入了一种基于扫描的全局-局部解码器(SGLD),它将全局上下文与骨骼感知的局部特征相结合,以迭代增强人体标记。此外,EMO-X利用了Mamba出色的全局建模能力,并设计了骨骼感知的双向扫描机制以实现局部细化。实验结果表明,EMO-X在效率和准确性之间取得了优异的平衡,相比最先进的方法,其推理时间减少了69.8%,同时在大多数情况下表现出更高的精度。

链接: https://arxiv.org/abs/2504.08718
作者: Haohang Jian,Jinlu Zhang,Junyi Wu,Zhigang Tu
机构: Wuhan University (武汉大学); Center on Frontiers of Computing Studies, School of Computer Science, Peking University (北京大学计算机科学前沿研究中心); Fuzhou University (福州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Expressive Human Pose and Shape Estimation (EHPS) aims to jointly estimate human pose, hand gesture, and facial expression from monocular images. Existing methods predominantly rely on Transformer-based architectures, which suffer from quadratic complexity in self-attention, leading to substantial computational overhead, especially in multi-person scenarios. Recently, Mamba has emerged as a promising alternative to Transformers due to its efficient global modeling capability. However, it remains limited in capturing fine-grained local dependencies, which are essential for precise EHPS. To address these issues, we propose EMO-X, the Efficient Multi-person One-stage model for multi-person EHPS. Specifically, we explore a Scan-based Global-Local Decoder (SGLD) that integrates global context with skeleton-aware local features to iteratively enhance human tokens. Our EMO-X leverages the superior global modeling capability of Mamba and designs a local bidirectional scan mechanism for skeleton-aware local refinement. Comprehensive experiments demonstrate that EMO-X strikes an excellent balance between efficiency and accuracy. Notably, it achieves a significant reduction in computational complexity, requiring 69.8% less inference time compared to state-of-the-art (SOTA) methods, while outperforming most of them in accuracy.
zh

[CV-4] Hypergraph Vision Transformers: Images are More than Nodes More than Edges CVPR2025

【速读】:该论文旨在解决视觉Transformer (Vision Transformers, ViTs) 在适应性、计算效率及建模高阶关系方面的平衡难题,同时克服视觉图神经网络 (Vision Graph Neural Networks, ViGs) 中因聚类算法导致的计算瓶颈。论文的关键解决方案是提出超图视觉Transformer (Hypergraph Vision Transformer, HgVT),通过引入分层二部超图结构到视觉Transformer框架中,以捕获更高阶语义关系并保持计算效率。HgVT 利用人群与多样性正则化实现动态超图构建而无需聚类,并采用专家边池化技术增强语义提取和促进基于图的图像检索。实验结果表明,HgVT 在图像分类和检索任务中表现出色,成为基于语义的视觉任务的一种高效框架。

链接: https://arxiv.org/abs/2504.08710
作者: Joshua Fixelle
机构: University of Virginia (弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Recent advancements in computer vision have highlighted the scalability of Vision Transformers (ViTs) across various tasks, yet challenges remain in balancing adaptability, computational efficiency, and the ability to model higher-order relationships. Vision Graph Neural Networks (ViGs) offer an alternative by leveraging graph-based methodologies but are hindered by the computational bottlenecks of clustering algorithms used for edge generation. To address these issues, we propose the Hypergraph Vision Transformer (HgVT), which incorporates a hierarchical bipartite hypergraph structure into the vision transformer framework to capture higher-order semantic relationships while maintaining computational efficiency. HgVT leverages population and diversity regularization for dynamic hypergraph construction without clustering, and expert edge pooling to enhance semantic extraction and facilitate graph-based image retrieval. Empirical results demonstrate that HgVT achieves strong performance on image classification and retrieval, positioning it as an efficient framework for semantic-based vision tasks.
zh

[CV-5] Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

【速读】:该论文旨在解决在资源受限条件下训练高效视频生成基础模型的问题。论文提出的关键解决方案是设计一个约70亿参数(7B)的中等规模研究模型Seaweed-7B,并通过从头开始使用665,000个H100 GPU小时进行训练。尽管计算资源有限,Seaweed-7B在性能上与更大规模且需要更多GPU资源训练的当代视频生成模型相当甚至更优。论文强调了几个关键设计决策,这些决策显著提升了中等规模扩散模型的性能。实证结果显示:(1) Seaweed-7B在有限资源下表现出与更大模型相媲美的性能;(2) 该模型具有较强的泛化能力,可通过轻量级微调或继续训练有效适应多种下游应用场景。

链接: https://arxiv.org/abs/2504.08685
作者: Team Seawead,Ceyuan Yang,Zhijie Lin,Yang Zhao,Shanchuan Lin,Zhibei Ma,Haoyuan Guo,Hao Chen,Lu Qi,Sen Wang,Feng Cheng,Feilong Zuo Xuejiao Zeng,Ziyan Yang,Fangyuan Kong,Zhiwu Qing,Fei Xiao,Meng Wei,Tuyen Hoang,Siyu Zhang,Peihao Zhu,Qi Zhao,Jiangqiao Yan,Liangke Gui,Sheng Bi,Jiashi Li,Yuxi Ren,Rui Wang,Huixia Li,Xuefeng Xiao,Shu Liu,Feng Ling,Heng Zhang,Houmin Wei,Huafeng Kuang,Jerry Duncan,Junda Zhang,Junru Zheng,Li Sun,Manlin Zhang,Renfei Sun,Xiaobin Zhuang,Xiaojie Li,Xin Xia,Xuyan Chi,Yanghua Peng,Yuping Wang,Yuxuan Wang,Zhongkai Zhao,Zhuo Chen,Zuquan Song,Zhenheng Yang,Jiashi Feng,Jianchao Yang,Lu Jiang
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical report

点击查看摘要

Abstract:This technical report presents a cost-efficient strategy for training a video generation foundation model. We present a mid-sized research model with approximately 7 billion parameters (7B) called Seaweed-7B trained from scratch using 665,000 H100 GPU hours. Despite being trained with moderate computational resources, Seaweed-7B demonstrates highly competitive performance compared to contemporary video generation models of much larger size. Design choices are especially crucial in a resource-constrained setting. This technical report highlights the key design decisions that enhance the performance of the medium-sized diffusion model. Empirically, we make two observations: (1) Seaweed-7B achieves performance comparable to, or even surpasses, larger models trained on substantially greater GPU resources, and (2) our model, which exhibits strong generalization ability, can be effectively adapted across a wide range of downstream applications either by lightweight fine-tuning or continue training. See the project page at this https URL
zh

[CV-6] X2BR: High-Fidelity 3D Bone Reconstruction from a Planar X-Ray Image with Hybrid Neural Implicit Methods

【速读】:该论文致力于解决从单张平面X射线图像精确重建三维骨骼结构的问题,这一挑战源于解剖学的复杂性以及输入数据的局限性。论文提出了一种名为X2BR的混合神经隐式框架,其核心在于结合连续体积重建与基于模板引导的非刚性配准。关键创新点包括:(1) 核心网络X2B采用ConvNeXt为基础的编码器提取X射线的空间特征,并预测高保真的三维骨骼占据场,无需依赖统计形状模型;(2) 集成患者特定的模板网格,通过YOLOv9检测与SKEL生物力学骨骼模型构建,并利用基于测地线的连贯点漂移将粗略重建结果与模板对齐,从而实现解剖一致的三维骨骼体积。实验结果显示,X2B在数值精度上表现最佳(IoU=0.952,Chamfer-L1距离=0.005),而X2BR则在引入基于YOLOv9的骨检测和生物力学模板对齐后,在视觉一致性方面更具优势,特别是在肋骨曲率和椎体对齐方面的解剖真实性。

链接: https://arxiv.org/abs/2504.08675
作者: Gokce Guven,H. Fatih Ugurdag,Hasan F. Ates
机构: Ozyegin University (欧伊因大学), Istanbul, Turkey (土耳其)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate 3D bone reconstruction from a single planar X-ray remains a challenge due to anatomical complexity and limited input data. We propose X2BR, a hybrid neural implicit framework that combines continuous volumetric reconstruction with template-guided non-rigid registration. The core network, X2B, employs a ConvNeXt-based encoder to extract spatial features from X-rays and predict high-fidelity 3D bone occupancy fields without relying on statistical shape models. To further refine anatomical accuracy, X2BR integrates a patient-specific template mesh, constructed using YOLOv9-based detection and the SKEL biomechanical skeleton model. The coarse reconstruction is aligned to the template using geodesic-based coherent point drift, enabling anatomically consistent 3D bone volumes. Experimental results on a clinical dataset show that X2B achieves the highest numerical accuracy, with an IoU of 0.952 and Chamfer-L1 distance of 0.005, outperforming recent baselines including X2V and D2IM-Net. Building on this, X2BR incorporates anatomical priors via YOLOv9-based bone detection and biomechanical template alignment, leading to reconstructions that, while slightly lower in IoU (0.875), offer superior anatomical realism, especially in rib curvature and vertebral alignment. This numerical accuracy vs. visual consistency trade-off between X2B and X2BR highlights the value of hybrid frameworks for clinically relevant 3D reconstructions.
zh

[CV-7] he Invisible EgoHand: 3D Hand Forecasting through EgoBody Pose Estimation

【速读】:该论文旨在解决从第一人称视角预测手部运动和姿态的问题,现有方法仅关注可见视野内手部位置的预测且忽略关节活动性,未能充分利用全身关节信息。论文提出的关键解决方案是设计了一种基于扩散机制的Transformer架构(EgoH4),它以观察序列和相机姿态作为输入,预测佩戴者双手机械装置的未来三维运动与姿态。EgoH4利用全身姿态信息约束手部运动,并通过去噪处理结合可见性预测器及3D到2D重投影损失函数,在手部处于视野内外时均能提高预测精度。实验结果显示,相较于基线模型,EgoH4在手部轨迹预测的平均位移误差(ADE)和手部姿态预测的平均关节位置误差(MPJPE)上分别提升了3.4厘米和5.1厘米。

链接: https://arxiv.org/abs/2504.08654
作者: Masashi Hatano,Zhifan Zhu,Hideo Saito,Dima Damen
机构: Keio University (庆应义塾大学); University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Forecasting hand motion and pose from an egocentric perspective is essential for understanding human intention. However, existing methods focus solely on predicting positions without considering articulation, and only when the hands are visible in the field of view. This limitation overlooks the fact that approximate hand positions can still be inferred even when they are outside the camera’s view. In this paper, we propose a method to forecast the 3D trajectories and poses of both hands from an egocentric video, both in and out of the field of view. We propose a diffusion-based transformer architecture for Egocentric Hand Forecasting, EgoH4, which takes as input the observation sequence and camera poses, then predicts future 3D motion and poses for both hands of the camera wearer. We leverage full-body pose information, allowing other joints to provide constraints on hand motion. We denoise the hand and body joints along with a visibility predictor for hand joints and a 3D-to-2D reprojection loss that minimizes the error when hands are in-view. We evaluate EgoH4 on the Ego-Exo4D dataset, combining subsets with body and hand annotations. We train on 156K sequences and evaluate on 34K sequences, respectively. EgoH4 improves the performance by 3.4cm and 5.1cm over the baseline in terms of ADE for hand trajectory forecasting and MPJPE for hand pose forecasting. Project page: this https URL
zh

[CV-8] MBE-ARI: A Multimodal Dataset Mapping Bi-directional Engagement in Animal-Robot Interaction ICRA2025

【速读】:该论文旨在解决动物-机器人交互(Animal-Robot Interaction, ARI)领域缺乏有效双向沟通资源的核心问题。当前,机器人难以解析动物复杂的多模态交流线索(如肢体语言、运动和声音),而与人类-机器人交互相比,ARI 缺乏支持性数据集和框架。为填补这一空白,论文提出 MBE-ARI(Multimodal Bidirectional Engagement in Animal-Robot Interaction),这是一个包含详细交互信息的新多模态数据集,记录了腿足机器人与牛之间的互动,包括来自多个视角的同步 RGB-D 数据流,并标注了交互阶段的肢体姿态和活动标签。此外,论文开发了一种专用于四足动物的全身姿态估计模型,能够以 92.7% 的平均精度(mAP)跟踪 39 个关键点,显著超越现有动物姿态估计基准。解决方案的关键在于通过构建高质量数据集和先进的姿态估计模型,为 ARI 领域提供必要的感知、推理和交互工具,从而推动机器人与动物有效协作的研究进展。

链接: https://arxiv.org/abs/2504.08646
作者: Ian Noronha,Advait Prasad Jawaji,Juan Camilo Soto,Jiajun An,Yan Gu,Upinder Kaur
机构: Department of Agricultural and Biological Engineering, Purdue University (普渡大学); School of Mechanical Engineering, Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to ICRA 2025

点击查看摘要

Abstract:Animal-robot interaction (ARI) remains an unexplored challenge in robotics, as robots struggle to interpret the complex, multimodal communication cues of animals, such as body language, movement, and vocalizations. Unlike human-robot interaction, which benefits from established datasets and frameworks, animal-robot interaction lacks the foundational resources needed to facilitate meaningful bidirectional communication. To bridge this gap, we present the MBE-ARI (Multimodal Bidirectional Engagement in Animal-Robot Interaction), a novel multimodal dataset that captures detailed interactions between a legged robot and cows. The dataset includes synchronized RGB-D streams from multiple viewpoints, annotated with body pose and activity labels across interaction phases, offering an unprecedented level of detail for ARI research. Additionally, we introduce a full-body pose estimation model tailored for quadruped animals, capable of tracking 39 keypoints with a mean average precision (mAP) of 92.7%, outperforming existing benchmarks in animal pose estimation. The MBE-ARI dataset and our pose estimation framework lay a robust foundation for advancing research in animal-robot interaction, providing essential tools for developing perception, reasoning, and interaction frameworks needed for effective collaboration between robots and animals. The dataset and resources are publicly available at this https URL, inviting further exploration and development in this critical area.
zh

[CV-9] block detection and information extraction for enhanced building drawings search

【速读】:该论文旨在解决从建筑图纸中提取信息(Information Extraction, IE)的问题,特别是在处理复杂、噪声较多的历史建筑图纸时效率低下的挑战。论文关注的重点是如何简化图纸搜索过程,特别是通过利用图纸标题栏(Title Block)中的元数据来实现这一目标。然而,历史图纸由于不遵循现有统一标准,导致标题栏信息提取变得复杂且困难。

解决方案的关键在于提出了一种新颖的标题栏检测与信息提取管道。该管道结合了一个轻量级卷积神经网络(Convolutional Neural Network, CNN)和GPT-4o,能够以高精度检测工程建筑图纸的标题栏,并从中提取结构化的图纸元数据。这种方法尤其适用于处理复杂的、噪声较大的历史图纸,并显著提高了向量格式(如CAD图纸)和手绘图纸的信息提取效率。此外,论文还开发了一个基于提取元数据的用户界面(User Interface, UI),并在实际项目中部署,展示了显著的时间节省效果。同时,为了支持未来的研究,构建了一个可扩展的领域专家标注的数据集,采用了一种高效的建筑行业友好型标注工作流。

链接: https://arxiv.org/abs/2504.08645
作者: Alessio Lombardi(1),Li Duan(2),Ahmed Elnagar(1),Ahmed Zaalouk(2),Khalid Ismail(2),Edlira Vakaj(2) ((1) Buro Happold, London (UK), (2) Birmingham City University (UK))
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures, 1 table. Accepted for publication in the 2025 European Conference on Computing in Construction (EC3, this https URL )

点击查看摘要

Abstract:The architecture, engineering, and construction (AEC) industry still heavily relies on information stored in drawings for building construction, maintenance, compliance and error checks. However, information extraction (IE) from building drawings is often time-consuming and costly, especially when dealing with historical buildings. Drawing search can be simplified by leveraging the information stored in the title block portion of the drawing, which can be seen as drawing metadata. However, title block IE can be complex especially when dealing with historical drawings which do not follow existing standards for uniformity. This work performs a comparison of existing methods for this kind of IE task, and then proposes a novel title block detection and IE pipeline which outperforms existing methods, in particular when dealing with complex, noisy historical drawings. The pipeline is obtained by combining a lightweight Convolutional Neural Network and GPT-4o, the proposed inference pipeline detects building engineering title blocks with high accuracy, and then extract structured drawing metadata from the title blocks, which can be used for drawing search, filtering and grouping. The work demonstrates high accuracy and efficiency in IE for both vector (CAD) and hand-drawn (historical) drawings. A user interface (UI) that leverages the extracted metadata for drawing search is established and deployed on real projects, which demonstrates significant time savings. Additionally, an extensible domain-expert-annotated dataset for title block detection is developed, via an efficient AEC-friendly annotation workflow that lays the foundation for future work.
zh

[CV-10] Latent Diffusion Autoencoders: Toward Efficient and Meaningful Unsupervised Representation Learning in Medical Imaging

【速读】:该论文旨在解决在医学影像领域中高效且有意义的无监督学习问题,特别是在阿尔茨海默病(Alzheimer’s Disease, AD)的脑部磁共振成像(MRI)分析中的挑战。传统扩散自编码器通常在图像空间中运行,计算开销较大且难以处理三维医学影像的表征学习。为了解决这些问题,论文提出了一种名为Latent Diffusion Autoencoder (LDAE) 的新型编码器-解码器扩散框架,其关键创新在于将扩散过程应用于压缩后的潜在表示空间而非原始图像空间。这种方法不仅提升了计算效率,还使得三维医学影像的表征学习变得可行。实验验证表明,LDAE能够有效捕获与AD及衰老相关的有意义语义表示,并实现了高质量的图像生成与重建,同时保持了显著的计算优势。这些特性使LDAE成为可扩展医学影像应用的一个有前景的基础模型。

链接: https://arxiv.org/abs/2504.08635
作者: Gabriele Lozupone,Alessandro Bria,Francesco Fontanella,Frederick J.A. Meijer,Claudio De Stefano,Henkjan Huisman
机构: University of the Sacred Heart (Pontifical Catholic University of the Sacred Heart); University of Rome Tor Vergata; Maastricht University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures, 7 tables

点击查看摘要

Abstract:This study presents Latent Diffusion Autoencoder (LDAE), a novel encoder-decoder diffusion-based framework for efficient and meaningful unsupervised learning in medical imaging, focusing on Alzheimer disease (AD) using brain MR from the ADNI database as a case study. Unlike conventional diffusion autoencoders operating in image space, LDAE applies the diffusion process in a compressed latent representation, improving computational efficiency and making 3D medical imaging representation learning tractable. To validate the proposed approach, we explore two key hypotheses: (i) LDAE effectively captures meaningful semantic representations on 3D brain MR associated with AD and ageing, and (ii) LDAE achieves high-quality image generation and reconstruction while being computationally efficient. Experimental results support both hypotheses: (i) linear-probe evaluations demonstrate promising diagnostic performance for AD (ROC-AUC: 90%, ACC: 84%) and age prediction (MAE: 4.1 years, RMSE: 5.2 years); (ii) the learned semantic representations enable attribute manipulation, yielding anatomically plausible modifications; (iii) semantic interpolation experiments show strong reconstruction of missing scans, with SSIM of 0.969 (MSE: 0.0019) for a 6-month gap. Even for longer gaps (24 months), the model maintains robust performance (SSIM 0.93, MSE 0.004), indicating an ability to capture temporal progression trends; (iv) compared to conventional diffusion autoencoders, LDAE significantly increases inference throughput (20x faster) while also enhancing reconstruction quality. These findings position LDAE as a promising framework for scalable medical imaging applications, with the potential to serve as a foundation model for medical image analysis. Code available at this https URL
zh

[CV-11] ask-conditioned Ensemble of Expert Models for Continuous Learning

【速读】:该论文旨在解决机器学习领域中部署模型在非平稳环境下的准确性保持问题。具体而言,在分布偏移导致性能退化的情况下,如何通过新数据更新模型,使其既能保留对旧数据的准确性,又能适应新数据。论文的关键解决方案是提出了一种基于任务条件的模型集成方法,通过任务成员信息构建专家模型集合,并利用基于局部离群概念的领域内模型动态提供任务成员信息,从而实现模型在不同分布偏移场景下的性能维持。实验结果验证了该方法的有效性。

链接: https://arxiv.org/abs/2504.08626
作者: Renu Sharma,Debasmita Pal,Arun Ross
机构: Michigan State University (密歇根州立大学), USA
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:One of the major challenges in machine learning is maintaining the accuracy of the deployed model (e.g., a classifier) in a non-stationary environment. The non-stationary environment results in distribution shifts and, consequently, a degradation in accuracy. Continuous learning of the deployed model with new data could be one remedy. However, the question arises as to how we should update the model with new training data so that it retains its accuracy on the old data while adapting to the new data. In this work, we propose a task-conditioned ensemble of models to maintain the performance of the existing model. The method involves an ensemble of expert models based on task membership information. The in-domain models-based on the local outlier concept (different from the expert models) provide task membership information dynamically at run-time to each probe sample. To evaluate the proposed method, we experiment with three setups: the first represents distribution shift between tasks (LivDet-Iris-2017), the second represents distribution shift both between and within tasks (LivDet-Iris-2020), and the third represents disjoint distribution between tasks (Split MNIST). The experiments highlight the benefits of the proposed method. The source code is available at this https URL.
zh

[CV-12] Efficient Mixture of Geographical Species for On Device Wildlife Monitoring

【速读】:该论文旨在解决在边缘设备上实现高效物种检测模型的问题,特别关注地理感知条件计算的应用。由于视觉变换器在边缘设备上的应用尚处于初期阶段,现有方法尚未充分探索基于输入数据的子网络条件执行。论文的关键解决方案是提出一种方法,通过地理感知的方式偏置结构化子网络,并针对不同地理位置修剪专家模型。实验验证了该方法在两个地理分布数据集(iNaturalist 和 iWildCam)上的条件计算性能。

链接: https://arxiv.org/abs/2504.08620
作者: Emmanuel Azuh Mensah,Joban Mand,Yueheng Ou,Min Jang,Kurtis Heimerl
机构: University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Efficient on-device models have become attractive for near-sensor insight generation, of particular interest to the ecological conservation community. For this reason, deep learning researchers are proposing more approaches to develop lower compute models. However, since vision transformers are very new to the edge use case, there are still unexplored approaches, most notably conditional execution of subnetworks based on input data. In this work, we explore the training of a single species detector which uses conditional computation to bias structured sub networks in a geographically-aware manner. We propose a method for pruning the expert model per location and demonstrate conditional computation performance on two geographically distributed datasets: iNaturalist and iWildcam.
zh

[CV-13] Preserving Privacy Without Compromising Accuracy: Machine Unlearning for Handwritten Text Recognition

【速读】:该论文旨在解决手写文本识别(HTR)模型中隐私保护与模型性能之间的权衡问题。具体而言,手写数据可能包含可识别用户的敏感信息(如独特的书写风格和个人词汇选择),这可能导致隐私泄露并削弱对AI服务的信任。为应对这一挑战,论文引入了一种针对多头Transformer架构HTR模型的新颖两阶段机器遗忘(Machine Unlearning)策略,该策略结合了剪枝(pruning)和随机标记(random labeling)。关键在于利用书写者分类头(writer classification head)作为遗忘操作的指示器和触发器,同时保持识别头(recognition head)的有效性。通过这种方式,论文实现了在保护隐私的同时维持模型准确性,这是首次全面探索HTR任务中的机器遗忘方法,并通过成员推理攻击(Membership Inference Attack, MIA)验证了去除用户可识别信息的效果。

链接: https://arxiv.org/abs/2504.08616
作者: Lei Kang,Xuanshuo Fu,Lluis Gomez,Alicia Fornés,Ernest Valveny,Dimosthenis Karatzas
机构: Computer Vision Center (计算机视觉中心), Universitat Autònoma de Barcelona (巴塞罗那自治大学), Barcelona, Spain
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Handwritten Text Recognition (HTR) is essential for document analysis and digitization. However, handwritten data often contains user-identifiable information, such as unique handwriting styles and personal lexicon choices, which can compromise privacy and erode trust in AI services. Legislation like the ``right to be forgotten’’ underscores the necessity for methods that can expunge sensitive information from trained models. Machine unlearning addresses this by selectively removing specific data from models without necessitating complete retraining. Yet, it frequently encounters a privacy-accuracy tradeoff, where safeguarding privacy leads to diminished model performance. In this paper, we introduce a novel two-stage unlearning strategy for a multi-head transformer-based HTR model, integrating pruning and random labeling. Our proposed method utilizes a writer classification head both as an indicator and a trigger for unlearning, while maintaining the efficacy of the recognition head. To our knowledge, this represents the first comprehensive exploration of machine unlearning within HTR tasks. We further employ Membership Inference Attacks (MIA) to evaluate the effectiveness of unlearning user-identifiable information. Extensive experiments demonstrate that our approach effectively preserves privacy while maintaining model accuracy, paving the way for new research directions in the document analysis community. Our code will be publicly available upon acceptance.
zh

[CV-14] Enhancing knowledge retention for continual learning with domain-specific adapters and features gating

【速读】:本文旨在解决连续学习(Continual Learning)中的灾难性遗忘(Catastrophic Forgetting)问题,即在模型从连续数据流中学习新任务的同时,如何有效保留之前学到的知识。论文提出了一种新颖的方法,通过将适配器(adapters)集成到视觉Transformer(Vision Transformers)的自注意力机制中,以增强跨不同领域依次添加数据集时的知识保持能力。该方法的关键在于引入了领域特定的输出头(domain-specific output heads)和特征门控(feature gating),使得模型能够在保持对先前任务高精度表现的同时,仅吸收多领域数据中的必要信息。这一解决方案相较于当前最先进的参数高效微调方法展示了其有效性,并通过实验验证了任务顺序对模型性能的重要影响。

链接: https://arxiv.org/abs/2504.08613
作者: Mohamed Abbas Hedjazi,Oussama Hadjerci,Adel Hafiane
机构: dasia.ai(达西亚人工智能); INSA Centre Val de Loire (中央大西洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Submitted to Applied Intelligence (Springer), under review since November 26, 2024

点击查看摘要

Abstract:Continual learning empowers models to learn from a continuous stream of data while preserving previously acquired knowledge, effectively addressing the challenge of catastrophic forgetting. In this study, we propose a new approach that integrates adapters within the self-attention mechanisms of Vision Transformers to enhance knowledge retention when sequentially adding datasets from different domains. Unlike previous methods that continue learning with only one dataset, our approach introduces domain-specific output heads and feature gating, allowing the model to maintain high accuracy on previously learned tasks while incorporating only the essential information from multiple domains. The proposed method is compared to prominent parameter-efficient fine-tuning methods in the current state of the art. The results provide evidence that our method effectively alleviates the limitations of previous works. Furthermore, we conduct a comparative analysis using three datasets, CIFAR-100, Flowers102, and DTD, each representing a distinct domain, to investigate the impact of task order on model performance. Our findings underscore the critical role of dataset sequencing in shaping learning outcomes, demonstrating that strategic ordering can significantly improve the model’s ability to adapt to evolving data distributions over time while preserving the integrity of previously learned knowledge.
zh

[CV-15] FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment

【速读】:该论文试图解决在大规模未知环境中实现实时、开放词汇语义理解的问题。传统方法在纯几何信息与开放词汇语义信息之间存在理解鸿沟,且通常依赖外部真值姿态信息。论文提出的解决方案关键在于引入FindAnything框架,通过将视觉-语言特征融入密集体素子图(submaps),实现了从纯几何到开放词汇语义理解的更高层次融合,同时无需依赖任何外部真值姿态信息即可探索任意环境。其关键创新点包括:使用SLAM校正漂移后可变形的体素占用子图表示环境,利用高效SAM生成的像素级视觉-语言特征构建以物体为中心的体素子图,并实现开放词汇查询到3D几何的映射,同时优化了内存使用效率。这种开放词汇地图表示在封闭集评估中达到了最先进的语义准确性,使机器人能够基于自然语言查询的对象或感兴趣区域进行探索。

链接: https://arxiv.org/abs/2504.08603
作者: Sebastián Barbas Laina,Simon Boche,Sotiris Papatheodorou,Simon Schaefer,Jaehyung Jung,Stefan Leutenegger
机构: Technical University of Munich (慕尼黑工业大学); Smart Robotics Lab, School of Computation, Information and Technology, Technical University of Munich; Smart Robotics Lab, Department of Computing, Imperial College London (帝国理工学院); Munich Institute of Robotics and Machine Intelligence (MIRMI); Munich Center for Machine Learning (MCML)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Geometrically accurate and semantically expressive map representations have proven invaluable to facilitate robust and safe mobile robot navigation and task planning. Nevertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments is still an open problem. In this paper we present FindAnything, an open-world mapping and exploration framework that incorporates vision-language information into dense volumetric submaps. Thanks to the use of vision-language features, FindAnything bridges the gap between pure geometric and open-vocabulary semantic information for a higher level of understanding while allowing to explore any environment without the help of any external source of ground-truth pose information. We represent the environment as a series of volumetric occupancy submaps, resulting in a robust and accurate map representation that deforms upon pose updates when the underlying SLAM system corrects its drift, allowing for a locally consistent representation between submaps. Pixel-wise vision-language features are aggregated from efficient SAM (eSAM)-generated segments, which are in turn integrated into object-centric volumetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. The open-vocabulary map representation of FindAnything achieves state-of-the-art semantic accuracy in closed-set evaluations on the Replica dataset. This level of scene understanding allows a robot to explore environments based on objects or areas of interest selected via natural language queries. Our system is the first of its kind to be deployed on resource-constrained devices, such as MAVs, leveraging vision-language information for real-world robotic tasks.
zh

[CV-16] On Background Bias of Post-Hoc Concept Embeddings in Computer Vision DNNs

【速读】:该论文试图解决的问题是:现有基于数据驱动的后验概念解释性人工智能(C-XAI)方法是否以及在多大程度上受到背景偏差的影响,尤其是在背景在训练过程中未被严格控制的情况下。论文特别关注这些方法是否可能利用背景与目标对象之间的统计关联(如“野生动物通常出现在植被背景下”)来获得性能提升,从而导致对特定概念(如“路上的动物”)的极端情况表现下降而未被发现。

解决方案的关键在于验证和分析当前主流的概念分割技术(如基于Net2Vec的方法)是否经常捕获到背景偏差,并通过对比三种不同的背景随机化技术在两个数据集和七种不同深度神经网络架构上的表现,揭示这些方法在处理背景相关任务(如道路场景)时的潜在不足。研究结果表明,即使简单的低成本设置也能提供有价值的洞见,并增强模型对背景变化的鲁棒性。

链接: https://arxiv.org/abs/2504.08602
作者: Gesina Schwalbe,Georgii Mikriukov,Edgar Heinert,Stavros Gerolymatos,Mert Keser,Alois Knoll,Matthias Rottmann,Annika Mütze
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: camera-ready version for 3rd World Conference on eXplainable Artificial Intelligence; 5 figures, 6 tables; code available at: this https URL

点击查看摘要

Abstract:The thriving research field of concept-based explainable artificial intelligence (C-XAI) investigates how human-interpretable semantic concepts embed in the latent spaces of deep neural networks (DNNs). Post-hoc approaches therein use a set of examples to specify a concept, and determine its embeddings in DNN latent space using data driven techniques. This proved useful to uncover biases between different target (foreground or concept) classes. However, given that the background is mostly uncontrolled during training, an important question has been left unattended so far: Are/to what extent are state-of-the-art, data-driven post-hoc C-XAI approaches themselves prone to biases with respect to their backgrounds? E.g., wild animals mostly occur against vegetation backgrounds, and they seldom appear on roads. Even simple and robust C-XAI methods might abuse this shortcut for enhanced performance. A dangerous performance degradation of the concept-corner cases of animals on the road could thus remain undiscovered. This work validates and thoroughly confirms that established Net2Vec-based concept segmentation techniques frequently capture background biases, including alarming ones, such as underperformance on road scenes. For the analysis, we compare 3 established techniques from the domain of background randomization on 50 concepts from 2 datasets, and 7 diverse DNN architectures. Our results indicate that even low-cost setups can provide both valuable insight and improved background robustness.
zh

[CV-17] Hands-On: Segmenting Individual Signs from Continuous Sequences

【速读】:该论文致力于解决连续手语分割(Continuous Sign Language Segmentation)这一关键任务,该任务在手语翻译和数据标注领域具有重要影响。论文提出了一种基于Transformer的架构,将手语的时序动态建模,并将帧级分割视为序列标注问题,采用Begin-In-Out (BIO) 标注方案。解决方案的关键在于利用HaMeR手部特征,并结合3D角度信息,从而实现卓越的性能。实验结果表明,该模型在DGS语料库上达到了当前最优水平,同时提出的特征在BSLCorpus上超越了先前的基准。

链接: https://arxiv.org/abs/2504.08593
作者: Low Jian He,Harry Walsh,Ozge Mercanoglu Sincan,Richard Bowden
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in the 19th IEEE International Conference on Automatic Face and Gesture Recognition

点击查看摘要

Abstract:This work tackles the challenge of continuous sign language segmentation, a key task with huge implications for sign language translation and data annotation. We propose a transformer-based architecture that models the temporal dynamics of signing and frames segmentation as a sequence labeling problem using the Begin-In-Out (BIO) tagging scheme. Our method leverages the HaMeR hand features, and is complemented with 3D Angles. Extensive experiments show that our model achieves state-of-the-art results on the DGS Corpus, while our features surpass prior benchmarks on BSLCorpus.
zh

[CV-18] ZipIR: Latent Pyramid Diffusion Transformer for High-Resolution Image Restoration

【速读】:该论文旨在解决在超高清分辨率下部署生成式图像恢复模型时面临的质量与效率之间的权衡问题,主要源于长距离注意力机制带来的巨大计算开销。为了解决这一挑战,论文提出了一种名为ZipIR的新框架,其关键是通过引入高度压缩的潜在表示(将图像压缩32倍)显著减少空间标记的数量,从而支持使用高容量模型如Diffusion Transformer (DiT)。此外,ZipIR设计了一种潜在金字塔变分自编码器(Latent Pyramid VAE, LP-VAE),以分层结构化潜在空间,简化扩散训练过程。通过在高达2K分辨率的完整图像上进行训练,ZipIR在从严重退化的输入中恢复高分辨率图像时实现了无与伦比的速度和质量。

链接: https://arxiv.org/abs/2504.08591
作者: Yongsheng Yu,Haitian Zheng,Zhifei Zhang,Jianming Zhang,Yuqian Zhou,Connelly Barnes,Yuchen Liu,Wei Xiong,Zhe Lin,Jiebo Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in generative models has significantly improved image restoration capabilities, particularly through powerful diffusion models that offer remarkable recovery of semantic details and local fidelity. However, deploying these models at ultra-high resolutions faces a critical trade-off between quality and efficiency due to the computational demands of long-range attention mechanisms. To address this, we introduce ZipIR, a novel framework that enhances efficiency, scalability, and long-range modeling for high-res image restoration. ZipIR employs a highly compressed latent representation that compresses image 32x, effectively reducing the number of spatial tokens, and enabling the use of high-capacity models like the Diffusion Transformer (DiT). Toward this goal, we propose a Latent Pyramid VAE (LP-VAE) design that structures the latent space into sub-bands to ease diffusion training. Trained on full images up to 2K resolution, ZipIR surpasses existing diffusion-based methods, offering unmatched speed and quality in restoring high-resolution images from severely degraded inputs.
zh

[CV-19] Hardware Algorithms and Applications of the Neuromorphic Vision Sensor: a Review

【速读】:该论文旨在解决传统视觉传感技术在处理实时动态场景时面临的效率与适应性瓶颈问题。通过引入神经形态(事件)相机这一新型视觉传感器,其基于像素级光照变化生成异步事件流的数据格式,为视觉应用提供了潜在性能提升。然而,这种优势需要重新设计或改造现有算法以有效处理新数据格式,这是解决方案的关键所在。论文从技术演进、算法开发及实际应用三个维度系统探讨了神经形态视觉的特性及其挑战,并提出了未来研究方向,强调了跨领域适配与创新算法开发的重要性。

链接: https://arxiv.org/abs/2504.08588
作者: Claudio Cimarelli,Jose Andres Millan-Romera,Holger Voos,Jose Luis Sanchez-Lopez
机构: Automation and Robotics Research Group (自动化与机器人研究组), Interdisciplinary Centre for Security, Reliability, and Trust (SnT) (跨学科安全、可靠性和信任中心), University of Luxembourg (卢森堡大学), Luxembourg (卢森堡); Faculty of Science, Technology, and Medicine, University of Luxembourg (卢森堡大学科学、技术和医学学院), Luxembourg (卢森堡); European Defence Agency (EDA) (欧洲防务局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages total, 26 without references, two images and five tables. Submitted to IEEE Sensors

点击查看摘要

Abstract:Neuromorphic, or event, cameras represent a transformation in the classical approach to visual sensing encodes detected instantaneous per-pixel illumination changes into an asynchronous stream of event packets. Their novelty compared to standard cameras lies in the transition from capturing full picture frames at fixed time intervals to a sparse data format which, with its distinctive qualities, offers potential improvements in various applications. However, these advantages come at the cost of reinventing algorithmic procedures or adapting them to effectively process the new data format. In this survey, we systematically examine neuromorphic vision along three main dimensions. First, we highlight the technological evolution and distinctive hardware features of neuromorphic cameras from their inception to recent models. Second, we review image processing algorithms developed explicitly for event-based data, covering key works on feature detection, tracking, and optical flow -which form the basis for analyzing image elements and transformations -as well as depth and pose estimation or object recognition, which interpret more complex scene structures and components. These techniques, drawn from classical computer vision and modern data-driven approaches, are examined to illustrate the breadth of applications for event-based cameras. Third, we present practical application case studies demonstrating how event cameras have been successfully used across various industries and scenarios. Finally, we analyze the challenges limiting widespread adoption, identify significant research gaps compared to standard imaging techniques, and outline promising future directions and opportunities that neuromorphic vision offers. Comments: 26 pages total, 26 without references, two images and five tables. Submitted to IEEE Sensors Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.08588 [cs.CV] (or arXiv:2504.08588v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.08588 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-20] Boosting multi-demographic federated learning for chest x-ray analysis using general-purpose self-supervised representations

【速读】:该论文试图解决在联邦学习(Federated Learning, FL)框架下,医学影像分析的人工智能模型因数据分布非独立同分布(non-IID)而导致性能下降的问题,特别是在儿科数据特有的非IID变异性以及大型成人数据集上的表现退化。此外,现有研究多集中于成人数据集,忽视了儿科数据带来的额外挑战。

解决方案的关键在于引入通用自监督图像表示(general-purpose self-supervised image representations),通过迁移学习提升模型性能。具体而言,论文发现直接应用FL在较小的成人数据集上可改善性能(P<0.001),但在较大的成人数据集和儿科数据集上则导致性能下降(P=0.064和P=0.242)。然而,若为FL赋予自监督权重,则显著提升了儿科病例(P=0.031)及大多数成人数据集(P<0.008)的表现,除了最大规模的数据集外(P=0.052)。这些结果表明,通用的自监督图像表示能够有效应对临床FL应用中的non-IID挑战,并为缓解数据稀缺性和变异性提供了解决方案,尤其在儿科医疗领域具有重要潜力。

链接: https://arxiv.org/abs/2504.08584
作者: Mahshad Lotfinia,Arash Tayebiarasteh,Samaneh Samiei,Mehdi Joodaki,Soroosh Tayebi Arasteh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reliable artificial intelligence (AI) models for medical image analysis often depend on large and diverse labeled datasets. Federated learning (FL) offers a decentralized and privacy-preserving approach to training but struggles in highly non-independent and identically distributed (non-IID) settings, where institutions with more representative data may experience degraded performance. Moreover, existing large-scale FL studies have been limited to adult datasets, neglecting the unique challenges posed by pediatric data, which introduces additional non-IID variability. To address these limitations, we analyzed n=398,523 adult chest radiographs from diverse institutions across multiple countries and n=9,125 pediatric images, leveraging transfer learning from general-purpose self-supervised image representations to classify pneumonia and cases with no abnormality. Using state-of-the-art vision transformers, we found that FL improved performance only for smaller adult datasets (P0.001) but degraded performance for larger datasets (P0.064) and pediatric cases (P=0.242). However, equipping FL with self-supervised weights significantly enhanced outcomes across pediatric cases (P=0.031) and most adult datasets (P0.008), except the largest dataset (P=0.052). These findings underscore the potential of easily deployable general-purpose self-supervised image representations to address non-IID challenges in clinical FL applications and highlight their promise for enhancing patient outcomes and advancing pediatric healthcare, where data scarcity and variability remain persistent obstacles.
zh

[CV-21] FMLGS: Fast Multilevel Language Embedded Gaussians for Part-level Interactive Agents

【速读】:该论文旨在解决多粒度语义交互中的挑战,特别是由于语言歧义性和对象组件查询时质量下降导致的问题。论文的关键解决方案是提出FMLGS方法,该方法支持基于3D高斯点 splatting (3DGS) 的部件级开放词汇查询。关键创新包括基于Segment Anything Model 2 (SAM2) 构建一致的对象级和部件级语义的高效管道,以及设计语义偏差策略以解决对象部件间的语言歧义问题,通过细粒度目标的语义特征插值增强信息丰富性。这一方案不仅显著提升了定位指定部件级目标的精度,还在速度和准确性上达到领先性能。

链接: https://arxiv.org/abs/2504.08581
作者: Xin Tan,Yuzhou Ji,He Zhu,Yuan Xie
机构: School of Computer Science and Technology, East China Normal University, Shanghai 200062, China (华东师范大学计算机科学与技术学院); Shanghai Innovation Institute, Shanghai 200062, China (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The semantically interactive radiance field has long been a promising backbone for 3D real-world applications, such as embodied AI to achieve scene understanding and manipulation. However, multi-granularity interaction remains a challenging task due to the ambiguity of language and degraded quality when it comes to queries upon object components. In this work, we present FMLGS, an approach that supports part-level open-vocabulary query within 3D Gaussian Splatting (3DGS). We propose an efficient pipeline for building and querying consistent object- and part-level semantics based on Segment Anything Model 2 (SAM2). We designed a semantic deviation strategy to solve the problem of language ambiguity among object parts, which interpolates the semantic features of fine-grained targets for enriched information. Once trained, we can query both objects and their describable parts using natural language. Comparisons with other state-of-the-art methods prove that our method can not only better locate specified part-level targets, but also achieve first-place performance concerning both speed and accuracy, where FMLGS is 98 x faster than LERF, 4 x faster than LangSplat and 2.5 x faster than LEGaussians. Meanwhile, we further integrate FMLGS as a virtual agent that can interactively navigate through 3D scenes, locate targets, and respond to user demands through a chat interface, which demonstrates the potential of our work to be further expanded and applied in the future.
zh

[CV-22] Knowledge Distillation for Multimodal Egocentric Action Recognition Robust to Missing Modalities

【速读】:该论文旨在解决单模态方法在第一人称视角动作识别任务中因常见问题(如模糊和遮挡)导致的鲁棒性不足,以及现有多模态方法在缺失任意模态时性能下降的问题。论文的关键在于提出了一种名为KARMMA的知识蒸馏方法,通过利用预训练模型作为教师模型中的单模态特征提取器,将知识高效地蒸馏到一个更小、更快的学生模型中,使其在面对缺失模态时仍保持鲁棒性,同时在多模态可用时也能有效提升性能。

链接: https://arxiv.org/abs/2504.08578
作者: Maria Santos-Villafranca,Dustin Carrión-Ojeda,Alejandro Perez-Yus,Jesus Bermudez-Cameo,Jose J. Guerrero,Simone Schaub-Meyer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Action recognition is an essential task in egocentric vision due to its wide range of applications across many fields. While deep learning methods have been proposed to address this task, most rely on a single modality, typically video. However, including additional modalities may improve the robustness of the approaches to common issues in egocentric videos, such as blurriness and occlusions. Recent efforts in multimodal egocentric action recognition often assume the availability of all modalities, leading to failures or performance drops when any modality is missing. To address this, we introduce an efficient multimodal knowledge distillation approach for egocentric action recognition that is robust to missing modalities (KARMMA) while still benefiting when multiple modalities are available. Our method focuses on resource-efficient development by leveraging pre-trained models as unimodal feature extractors in our teacher model, which distills knowledge into a much smaller and faster student model. Experiments on the Epic-Kitchens and Something-Something datasets demonstrate that our student model effectively handles missing modalities while reducing its accuracy drop in this scenario.
zh

[CV-23] Banana Ripeness Level Classification using a Simple CNN Model Trained with Real and Synthetic Datasets

【速读】:该论文试图解决工业级香蕉成熟度评估依赖手动方法的问题,并提出了一种结合真实与合成数据的鲁棒数据集以及一种简单的卷积神经网络(CNN)架构来提高分类准确性。解决方案的关键在于利用合成数据训练初始模型,并通过迁移学习技术将模型性能提升以有效分类真实数据,最终实现高精度(0.917)和快速执行时间。

链接: https://arxiv.org/abs/2504.08568
作者: Luis Chuquimarca,Boris Vintimilla,Sergio Velastin
机构: ESPOL Polytechnic University, ESPOL, CIDIS, Guayaquil, Ecuador (ESPOL科技大学,ESPOL,CIDIS,瓜亚基尔,厄瓜多尔); UPSE Santa Elena Peninsula State University, UPSE, FACSISTEL, La Libertad, Ecuador (UPSE圣埃莱娜半岛州立大学,UPSE,FACSISTEL,自由,厄瓜多尔); Queen Mary University of London, London, UK (伦敦玛丽女王大学,伦敦,英国); University Carlos III, Madrid, Spain (卡洛斯三世大学,马德里,西班牙)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures, conference

点击查看摘要

Abstract:The level of ripeness is essential in determining the quality of bananas. To correctly estimate banana maturity, the metrics of international marketing standards need to be considered. However, the process of assessing the maturity of bananas at an industrial level is still carried out using manual methods. The use of CNN models is an attractive tool to solve the problem, but there is a limitation regarding the availability of sufficient data to train these models reliably. On the other hand, in the state-of-the-art, existing CNN models and the available data have reported that the accuracy results are acceptable in identifying banana maturity. For this reason, this work presents the generation of a robust dataset that combines real and synthetic data for different levels of banana ripeness. In addition, it proposes a simple CNN architecture, which is trained with synthetic data and using the transfer learning technique, the model is improved to classify real data, managing to determine the level of maturity of the banana. The proposed CNN model is evaluated with several architectures, then hyper-parameter configurations are varied, and optimizers are used. The results show that the proposed CNN model reaches a high accuracy of 0.917 and a fast execution time.
zh

[CV-24] Shadow Erosion and Nighttime Adaptability for Camera-Based Automated Driving Applications

【速读】:该论文旨在解决自动驾驶场景下图像增强的问题,特别是在挑战性光照条件(如阴影和夜间环境)下的图像质量提升。论文的关键在于提出了一种名为“Shadow Erosion and Nighttime Adaptability”的图像处理管道,该方法在保留颜色和纹理细节的同时,通过针对性地缓解阴影影响和优化夜间适应能力,显著改善图像的视觉感知质量和照明均匀性。与广泛使用的CLAHE技术相比,该方案不仅提升了图像质量,还增强了基于YOLO的可驾驶区域分割算法的性能。

链接: https://arxiv.org/abs/2504.08551
作者: Mohamed Sabry,Gregory Schroeder,Joshua Varughese,Cristina Olaverri-Monreal
机构: Johannes Kepler University Linz (约翰内斯·开普勒林茨大学), Austria; Department Intelligent Transport Systems (智能交通系统系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages

点击查看摘要

Abstract:Enhancement of images from RGB cameras is of particular interest due to its wide range of ever-increasing applications such as medical imaging, satellite imaging, automated driving, etc. In autonomous driving, various techniques are used to enhance image quality under challenging lighting conditions. These include artificial augmentation to improve visibility in poor nighttime conditions, illumination-invariant imaging to reduce the impact of lighting variations, and shadow mitigation to ensure consistent image clarity in bright daylight. This paper proposes a pipeline for Shadow Erosion and Nighttime Adaptability in images for automated driving applications while preserving color and texture details. The Shadow Erosion and Nighttime Adaptability pipeline is compared to the widely used CLAHE technique and evaluated based on illumination uniformity and visual perception quality metrics. The results also demonstrate a significant improvement over CLAHE, enhancing a YOLO-based drivable area segmentation algorithm.
zh

[CV-25] Proxy-Anchor and EVT-Driven Continual Learning Method for Generalized Category Discovery

【速读】:本文旨在解决连续广义类别发现(Continual Generalized Category Discovery)中的两个关键挑战:在处理新数据批次时持续发现和学习新类别,同时避免灾难性遗忘(Catastrophic Forgetting)的问题。解决方案的关键在于将极值理论(Extreme Value Theory, EVT)与代理锚点(Proxy Anchors)相结合。具体而言,通过引入概率包含函数(Probability of Inclusion Function)来定义代理周围的边界,从而实现未知样本的拒绝;同时提出了一种新的基于EVT的损失函数以增强学习表示能力。此外,在连续学习阶段引入经验回放(Experience Replay)和知识蒸馏(Knowledge Distillation)机制进一步防止灾难性遗忘,并通过减少模型大小和丢弃冗余代理来缓解新类别发现过程中可能产生的类别数量高估问题。实验结果表明,所提方法在相关场景中优于现有最先进方法。

链接: https://arxiv.org/abs/2504.08550
作者: Alireza Fathalizadeh,Roozbeh Razavi-Far
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual generalized category discovery has been introduced and studied in the literature as a method that aims to continuously discover and learn novel categories in incoming data batches while avoiding catastrophic forgetting of previously learned categories. A key component in addressing this challenge is the model’s ability to separate novel samples, where Extreme Value Theory (EVT) has been effectively employed. In this work, we propose a novel method that integrates EVT with proxy anchors to define boundaries around proxies using a probability of inclusion function, enabling the rejection of unknown samples. Additionally, we introduce a novel EVT-based loss function to enhance the learned representation, achieving superior performance compared to other deep-metric learning methods in similar settings. Using the derived probability functions, novel samples are effectively separated from previously known categories. However, category discovery within these novel samples can sometimes overestimate the number of new categories. To mitigate this issue, we propose a novel EVT-based approach to reduce the model size and discard redundant proxies. We also incorporate experience replay and knowledge distillation mechanisms during the continual learning stage to prevent catastrophic forgetting. Experimental results demonstrate that our proposed approach outperforms state-of-the-art methods in continual generalized category discovery scenarios.
zh

[CV-26] COP-GEN-Beta: Unified Generative Modelling of COPernicus Imagery Thumbnails CVPR2025

【速读】:该论文旨在解决跨模态学习在遥感领域中的挑战,即如何从多种传感器获取的多模态数据中学习统一的表征。传统方法通常局限于单一或双模态处理,而该研究提出了一种名为COP-GEN-Beta的生成扩散模型,通过在光学、雷达和高程数据上的训练,实现了任意模态子集之间的映射,从而支持零样本模态翻译。其关键在于采用基于序列的扩散变换器(diffusion transformer),其中每种模态由独立的时间步嵌入(timestep embedding)控制,这种设计使得模型能够灵活地进行模态间的转换。实验结果表明,该模型在生成高质量样本方面表现出色,并具备作为未来遥感任务强大预训练模型的潜力。

链接: https://arxiv.org/abs/2504.08548
作者: Miguel Espinosa,Valerio Marsocci,Yuru Jia,Elliot J. Crowley,Mikolaj Czerkawski
机构: University of Edinburgh (爱丁堡大学); European Space Agency (ESA); KU Leuven; Asterisk Labs
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025 Workshop MORSE

点击查看摘要

Abstract:In remote sensing, multi-modal data from various sensors capturing the same scene offers rich opportunities, but learning a unified representation across these modalities remains a significant challenge. Traditional methods have often been limited to single or dual-modality approaches. In this paper, we introduce COP-GEN-Beta, a generative diffusion model trained on optical, radar, and elevation data from the Major TOM dataset. What sets COP-GEN-Beta apart is its ability to map any subset of modalities to any other, enabling zero-shot modality translation after training. This is achieved through a sequence-based diffusion transformer, where each modality is controlled by its own timestep embedding. We extensively evaluate COP-GEN-Beta on thumbnail images from the Major TOM dataset, demonstrating its effectiveness in generating high-quality samples. Qualitative and quantitative evaluations validate the model’s performance, highlighting its potential as a powerful pre-trained model for future remote sensing tasks.
zh

[CV-27] Discriminator-Free Direct Preference Optimization for Video Diffusion

【速读】:该论文旨在解决将直接偏好优化(DPO)应用于视频扩散模型时面临的两个关键问题:(1) 数据效率低下,每次DPO迭代需要生成数千个视频,导致高昂成本;(2) 评估不确定性,人类标注存在主观偏差,而自动判别器难以检测细微的时间相关伪影(如闪烁或运动不连贯)。为了解决这些问题,论文提出了一种无判别器的视频DPO框架,其关键是利用原始真实视频作为胜例(win cases),并通过简单编辑操作(如反转、片段打乱或添加噪声)生成对应的败例(lose cases),同时训练模型识别并避免由编辑引入的伪影。这种方法无需昂贵的合成视频对比,提供了明确的质量信号,并通过简单的编辑操作实现了训练数据的无限扩展。此外,论文从理论上证明了该框架在真实视频与模型生成视频遵循不同分布的情况下依然有效。实验结果验证了所提方法的高效性。

链接: https://arxiv.org/abs/2504.08542
作者: Haoran Cheng,Qide Dong,Liang Peng,Zhizhou Sha,Weiguo Feng,Jinghui Xie,Zhao Song,Shilei Wen,Xiaofei He,Boxi Wu
机构: Zhejiang University; Bytedance; Tsinghua University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2412.14167 by other authors

点击查看摘要

Abstract:Direct Preference Optimization (DPO), which aligns models with human preferences through win/lose data pairs, has achieved remarkable success in language and image generation. However, applying DPO to video diffusion models faces critical challenges: (1) Data inefficiency. Generating thousands of videos per DPO iteration incurs prohibitive costs; (2) Evaluation uncertainty. Human annotations suffer from subjective bias, and automated discriminators fail to detect subtle temporal artifacts like flickering or motion incoherence. To address these, we propose a discriminator-free video DPO framework that: (1) Uses original real videos as win cases and their edited versions (e.g., reversed, shuffled, or noise-corrupted clips) as lose cases; (2) Trains video diffusion models to distinguish and avoid artifacts introduced by editing. This approach eliminates the need for costly synthetic video comparisons, provides unambiguous quality signals, and enables unlimited training data expansion through simple editing operations. We theoretically prove the framework’s effectiveness even when real videos and model-generated videos follow different distributions. Experiments on CogVideoX demonstrate the efficiency of the proposed method.
zh

[CV-28] Digital Twin Catalog: A Large-Scale Photorealistic 3D Object Digital Twin Dataset CVPR2025

【速读】:该论文试图解决的问题是缺乏一个大规模、高质量的真实世界数据集及基准,用于定量评估和比较不同三维物体重建方法的性能,并通过训练或微调提升重建质量。此外,为了普及三维数字孪生体的创建技术,需要将其与下一代以用户为中心的计算平台(如增强现实眼镜)相结合,但目前尚无可用的数据集来评估使用以用户为中心捕获图像进行三维物体重建的效果。论文的关键解决方案在于引入Digital Twin Catalog (DTC) 数据集,该数据集包含2000个扫描得到的高质量三维物体及其在不同光照条件下使用数码单反相机和增强现实眼镜拍摄的图像序列,从而建立了一个全面的现实世界评估基准,为现有重建方法的对比和改进提供了坚实的基础。

链接: https://arxiv.org/abs/2504.08541
作者: Zhao Dong,Ka Chen,Zhaoyang Lv,Hong-Xing Yu,Yunzhi Zhang,Cheng Zhang,Yufeng Zhu,Stephen Tian,Zhengqin Li,Geordie Moffatt,Sean Christofferson,James Fort,Xiaqing Pan,Mingfei Yan,Jiajun Wu,Carl Yuheng Ren,Richard Newcombe
机构: Meta Reality Labs Research (Meta 实景实验室研究); Stanford University (斯坦福大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: accepted to CVPR 2025 highlights

点击查看摘要

Abstract:We introduce Digital Twin Catalog (DTC), a new large-scale photorealistic 3D object digital twin dataset. A digital twin of a 3D object is a highly detailed, virtually indistinguishable representation of a physical object, accurately capturing its shape, appearance, physical properties, and other attributes. Recent advances in neural-based 3D reconstruction and inverse rendering have significantly improved the quality of 3D object reconstruction. Despite these advancements, there remains a lack of a large-scale, digital twin quality real-world dataset and benchmark that can quantitatively assess and compare the performance of different reconstruction methods, as well as improve reconstruction quality through training or fine-tuning. Moreover, to democratize 3D digital twin creation, it is essential to integrate creation techniques with next-generation egocentric computing platforms, such as AR glasses. Currently, there is no dataset available to evaluate 3D object reconstruction using egocentric captured images. To address these gaps, the DTC dataset features 2,000 scanned digital twin-quality 3D objects, along with image sequences captured under different lighting conditions using DSLR cameras and egocentric AR glasses. This dataset establishes the first comprehensive real-world evaluation benchmark for 3D digital twin creation tasks, offering a robust foundation for comparing and improving existing reconstruction methods. The DTC dataset is already released at this https URL and we will also make the baseline evaluations open-source.
zh

[CV-29] Datasets for Lane Detection in Autonomous Driving: A Comprehensive Review

【速读】:该论文旨在系统性地梳理和分析现有的公开车道检测数据集,以解决车道检测算法开发与评估中数据集选择的挑战与不足。论文的关键在于通过全面分类(依据传感器分辨率、标注类型以及道路和天气条件的多样性等关键因素),揭示现有数据集的优势、局限性及研究空白,从而为研究人员提供明确的指导,帮助其选择合适的训练与测试数据集。同时,论文强调了未来数据集改进的方向,以推动鲁棒车道检测技术的进一步发展,助力自动驾驶技术的整体进步。

链接: https://arxiv.org/abs/2504.08540
作者: Jörg Gamerdinger,Sven Teufel,Oliver Bringmann
机构: University of Tübingen (蒂宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate lane detection is essential for automated driving, enabling safe and reliable vehicle navigation in a variety of road scenarios. Numerous datasets have been introduced to support the development and evaluation of lane detection algorithms, each differing in terms of the amount of data, sensor types, annotation granularity, environmental conditions, and scenario diversity. This paper provides a comprehensive review of over 30 publicly available lane detection datasets, systematically analysing their characteristics, advantages and limitations. We classify these datasets based on key factors such as sensor resolution, annotation types and diversity of road and weather conditions. By identifying existing challenges and research gaps, we highlight opportunities for future dataset improvements that can further drive innovation in robust lane detection. This survey serves as a resource for researchers seeking appropriate datasets for lane detection, and contributes to the broader goal of advancing autonomous driving.
zh

[CV-30] Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

【速读】:该论文致力于解决在通用环境中,生成式 AI (Generative AI) 模型描述任意物体时因不同相机视角和杂乱场景导致的图像描述不一致和不准确的问题。论文的关键解决方案在于提出了一种三阶段框架:首先通过智能体主动探索环境收集带有噪声的图像-标题对;其次利用大语言模型通过一致性机制提炼每个物体实例的一致性伪标题;最后结合对比学习使用这些伪标题微调现有的标题生成模型。这一方法通过引入共识机制显著提升了跨视角描述的准确性和一致性。

链接: https://arxiv.org/abs/2504.08531
作者: Tommaso Galliena,Tommaso Apicella,Stefano Rosa,Pietro Morerio,Alessio Del Bue,Lorenzo Natale
机构: Istituto Italiano di Tecnologia (意大利技术研究院), University of Genoa (热那亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 11 pages, 8 figures, 5 tables, code and test set annotations available at this https URL

点击查看摘要

Abstract:We present a self-supervised method to improve an agent’s abilities in describing arbitrary objects while actively exploring a generic environment. This is a challenging problem, as current models struggle to obtain coherent image captions due to different camera viewpoints and clutter. We propose a three-phase framework to fine-tune existing captioning models that enhances caption accuracy and consistency across views via a consensus mechanism. First, an agent explores the environment, collecting noisy image-caption pairs. Then, a consistent pseudo-caption for each object instance is distilled via consensus using a large language model. Finally, these pseudo-captions are used to fine-tune an off-the-shelf captioning model, with the addition of contrastive learning. We analyse the performance of the combination of captioning models, exploration policies, pseudo-labeling methods, and fine-tuning strategies, on our manually labeled test set. Results show that a policy can be trained to mine samples with higher disagreement compared to classical baselines. Our pseudo-captioning method, in combination with all policies, has a higher semantic similarity compared to other existing methods, and fine-tuning improves caption accuracy and consistency by a significant margin. Code and test set annotations available at this https URL
zh

[CV-31] A Hybrid Fully Convolutional CNN-Transformer Model for Inherently Interpretable Medical Image Classification

【速读】:该论文旨在解决医学影像分类任务中混合卷积神经网络(CNN)与视觉变换器(Vision Transformer, ViT)模型难以解释的问题,这限制了其在医疗领域的应用。论文的关键解决方案在于提出了一种从设计上具备可解释性的全卷积CNN-Transformer混合架构,通过生成忠实且局部化的证据图(evidence maps),直接反映模型的决策过程,从而实现对模型预测结果的透明解释。这种单次前向传播即可提供类别特定稀疏证据图的方法,不仅达到了当前最先进的预测性能,还有效克服了传统黑盒模型或现有可解释模型在解释性上的不足。

链接: https://arxiv.org/abs/2504.08481
作者: Kerol Djoumessi,Samuel Ofosu Mensah,Philipp Berens
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In many medical imaging tasks, convolutional neural networks (CNNs) efficiently extract local features hierarchically. More recently, vision transformers (ViTs) have gained popularity, using self-attention mechanisms to capture global dependencies, but lacking the inherent spatial localization of convolutions. Therefore, hybrid models combining CNNs and ViTs have been developed to combine the strengths of both architectures. However, such hybrid CNN-ViT models are difficult to interpret, which hinders their application in medical imaging. In this work, we introduce an interpretable-by-design hybrid fully convolutional CNN-Transformer architecture for medical image classification. Unlike widely used post-hoc saliency methods for ViTs, our approach generates faithful and localized evidence maps that directly reflect the model’s decision process. We evaluated our method on two medical image classification tasks using color fundus images. Our model not only achieves state-of-the-art predictive performance compared to both black-box and interpretable models but also provides class-specific sparse evidence maps in a single forward pass. The code is available at: this https URL.
zh

[CV-32] Cut-and-Splat: Leverag ing Gaussian Splatting for Synthetic Data Generation

【速读】:该论文旨在解决通过合成图像生成高质量上下文感知实例分割训练数据的问题,主要挑战在于获取精确的三维模型以及模拟真实感场景时存在的光照效果和相机伪影等难题。论文的关键解决方案是引入了一种名为高斯点 splatting(Gaussian Splatting)的新视角合成方法。这种方法能够自动从目标物体的视频中提取对象,并利用高斯点 splatting 渲染到随机背景图像上,同时结合单目深度估计技术确保物体以可信的姿态呈现。通过这种方式,论文提出的数据生成流程实现了完全自动化,仅需输入目标物体的视频即可完成。此外,论文还介绍了一个新的数据集来验证所提方法的有效性,并证明其在性能上优于其他如剪切粘贴(Cut-and-Paste)和基于扩散模型的方法。

链接: https://arxiv.org/abs/2504.08473
作者: Bram Vanherle,Brent Zoomers,Jeroen Put,Frank Van Reeth,Nick Michiels
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the International Conference on Robotics, Computer Vision and Intelligent Systems 2025 (ROBOVIS)

点击查看摘要

Abstract:Generating synthetic images is a useful method for cheaply obtaining labeled data for training computer vision models. However, obtaining accurate 3D models of relevant objects is necessary, and the resulting images often have a gap in realism due to challenges in simulating lighting effects and camera artifacts. We propose using the novel view synthesis method called Gaussian Splatting to address these challenges. We have developed a synthetic data pipeline for generating high-quality context-aware instance segmentation training data for specific objects. This process is fully automated, requiring only a video of the target object. We train a Gaussian Splatting model of the target object and automatically extract the object from the video. Leveraging Gaussian Splatting, we then render the object on a random background image, and monocular depth estimation is employed to place the object in a believable pose. We introduce a novel dataset to validate our approach and show superior performance over other data generation approaches, such as Cut-and-Paste and Diffusion model-based generation.
zh

[CV-33] Road Grip Uncertainty Estimation Through Surface State Segmentation

【速读】:该论文旨在解决在湿滑路面条件下,可靠估计路面附着不确定性以确保自动驾驶车辆安全控制的问题。论文的关键在于提出了一种基于路面状态分割的新方法,通过推断路面条件来估计逐像素的附着概率分布,从而提升附着不确定性预测的鲁棒性。

链接: https://arxiv.org/abs/2504.08452
作者: Jyri Maanpää,Julius Pesonen,Iaroslav Melekhov,Heikki Hyyti,Juha Hyyppä
机构: Amazon(亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 5 figures (supplementary material 2 pages, 1 figure). Anonymized version submitted to Scandinavian Conference on Image Analysis (SCIA) 2025

点击查看摘要

Abstract:Slippery road conditions pose significant challenges for autonomous driving. Beyond predicting road grip, it is crucial to estimate its uncertainty reliably to ensure safe vehicle control. In this work, we benchmark several uncertainty prediction methods to assess their effectiveness for grip uncertainty estimation. Additionally, we propose a novel approach that leverages road surface state segmentation to predict grip uncertainty. Our method estimates a pixel-wise grip probability distribution based on inferred road surface conditions. Experimental results indicate that the proposed approach enhances the robustness of grip uncertainty prediction.
zh

[CV-34] Muon-Accelerated Attention Distillation for Real-Time Edge Synthesis via Optimized Latent Diffusion

【速读】:该论文旨在解决在边缘设备上实时部署高保真视觉合成(如艺术风格迁移和文本到图像生成)的挑战,主要由于计算和内存资源受限。论文的关键创新在于提出了一种名为Muon-AD的协同设计框架,该框架结合了Muon优化器与注意力蒸馏技术。通过正交参数更新消除梯度冲突以及动态剪枝,Muon-AD实现了比Stable Diffusion-TensorRT快3.2倍的收敛速度,同时保持了合成质量(FID降低15%,SSIM提高4%)。此外,通过混合精度量化和课程学习,该框架将Jetson Orin上的峰值内存减少至7GB,并支持24FPS的实时生成。这些方法显著降低了分布式训练中的通信开销,并在边缘GPU上实现了每秒生成10张图像的实时性能,从而在资源受限环境中实现了高质量视觉合成的普及化。

链接: https://arxiv.org/abs/2504.08451
作者: Weiye Chen,Qingen Zhu,Qian Long
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in visual synthesis have leveraged diffusion models and attention mechanisms to achieve high-fidelity artistic style transfer and photorealistic text-to-image generation. However, real-time deployment on edge devices remains challenging due to computational and memory constraints. We propose Muon-AD, a co-designed framework that integrates the Muon optimizer with attention distillation for real-time edge synthesis. By eliminating gradient conflicts through orthogonal parameter updates and dynamic pruning, Muon-AD achieves 3.2 times faster convergence compared to Stable Diffusion-TensorRT, while maintaining synthesis quality (15% lower FID, 4% higher SSIM). Our framework reduces peak memory to 7GB on Jetson Orin and enables 24FPS real-time generation through mixed-precision quantization and curriculum learning. Extensive experiments on COCO-Stuff and ImageNet-Texture demonstrate Muon-AD’s Pareto-optimal efficiency-quality trade-offs. Here, we show a 65% reduction in communication overhead during distributed training and real-time 10s/image generation on edge GPUs. These advancements pave the way for democratizing high-quality visual synthesis in resource-constrained environments.
zh

[CV-35] Ego4o: Egocentric Human Motion Capture and Understanding from Multi-Modal Input

【速读】:本文聚焦于利用消费级可穿戴设备(如VR/AR头显、智能眼镜、手机和智能手表)跟踪与理解人体运动。这些设备提供的多模态传感器输入包括第一人称图像以及1-3个稀疏惯性测量单元(IMU)传感器,且信号可能间歇性可用,这给一致性的运动捕捉与理解带来了挑战。为解决这些问题,论文提出了一种名为Ego4o(“o”代表全向)的新框架,用于从多模态的第一人称输入中同时进行人体运动捕捉与理解。该方法在部分输入条件下仍能保持性能,并在多种模态结合时取得更优结果。关键在于首先将IMU传感器输入、可选的第一人称图像及人体运动文本描述编码至运动VQ-VAE的潜在空间,然后通过VQ-VAE解码器优化以追踪人体运动;当缺少运动描述时,可将潜在向量输入到多模态大型语言模型(LLM)生成运动描述,从而进一步提升运动捕捉精度。定量与定性评估证明了该方法在预测精确人体运动和高质量运动描述方面的有效性。

链接: https://arxiv.org/abs/2504.08449
作者: Jian Wang,Rishabh Dabral,Diogo Luvizon,Zhe Cao,Lingjie Liu,Thabo Beeler,Christian Theobalt
机构: MPI Informatics (马克斯·普朗克计算机科学研究所) & Saarland Informatics Campus (萨尔州大学计算机科学校区); Google(谷歌); University of Pennsylvania (宾夕法尼亚大学); Saarbrücken Research Center for Visual Computing, Interaction and Artificial Intelligence (萨尔布吕肯视觉计算、交互与人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work focuses on tracking and understanding human motion using consumer wearable devices, such as VR/AR headsets, smart glasses, cellphones, and smartwatches. These devices provide diverse, multi-modal sensor inputs, including egocentric images, and 1-3 sparse IMU sensors in varied combinations. Motion descriptions can also accompany these signals. The diverse input modalities and their intermittent availability pose challenges for consistent motion capture and understanding. In this work, we present Ego4o (o for omni), a new framework for simultaneous human motion capture and understanding from multi-modal egocentric inputs. This method maintains performance with partial inputs while achieving better results when multiple modalities are combined. First, the IMU sensor inputs, the optional egocentric image, and text description of human motion are encoded into the latent space of a motion VQ-VAE. Next, the latent vectors are sent to the VQ-VAE decoder and optimized to track human motion. When motion descriptions are unavailable, the latent vectors can be input into a multi-modal LLM to generate human motion descriptions, which can further enhance motion capture accuracy. Quantitative and qualitative evaluations demonstrate the effectiveness of our method in predicting accurate human motion and high-quality motion descriptions.
zh

[CV-36] SARFormer – An Acquisition Parameter Aware Vision Transformer for Synthetic Aperture Radar Data

【速读】:本文旨在解决合成孔径雷达(SAR)图像处理中的复杂几何特性带来的挑战,特别是在多幅SAR图像场景下,提高下游任务如高度重建与分割的性能。论文的关键创新在于提出了一种获取参数编码模块(acquisition parameter encoding module),该模块显著引导了学习过程,尤其是在多图像情况下。此外,通过自监督预训练及在少量标注数据下的实验,并结合消融实验全面评估模型在基线上的贡献与适应性,进一步强化了解决方案的有效性。最终,所提方法在基准模型基础上实现了高达17%的均方根误差(RMSE)提升。

链接: https://arxiv.org/abs/2504.08441
作者: Jonathan Prexl,Michael Recla,Michael Schmitt
机构: University of the Bundeswehr Munich (德国联邦国防军大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This manuscript introduces SARFormer, a modified Vision Transformer (ViT) architecture designed for processing one or multiple synthetic aperture radar (SAR) images. Given the complex image geometry of SAR data, we propose an acquisition parameter encoding module that significantly guides the learning process, especially in the case of multiple images, leading to improved performance on downstream tasks. We further explore self-supervised pre-training, conduct experiments with limited labeled data, and benchmark our contribution and adaptations thoroughly in ablation experiments against a baseline, where the model is tested on tasks such as height reconstruction and segmentation. Our approach achieves up to 17% improvement in terms of RMSE over baseline models
zh

[CV-37] he Composite Visual-Laser Navigation Method Applied in Indoor Poultry Farming Environments

【速读】:该论文旨在解决室内家禽养殖场中因复杂环境(如强光区域和积水)导致的传统单一传感器导航方法性能不佳的问题,这些问题包括激光漂移和视觉导航线提取的不准确性。论文的关键解决方案在于提出了一种新型的复合导航方法,通过实时评估激光和视觉两种传感器模态的可靠性,动态计算融合的偏航角,从而无需依赖物理导航线即可实现精准导航。实验验证表明,该方法不仅克服了单一传感器系统的固有缺陷,还显著提升了导航精度和运行效率。

链接: https://arxiv.org/abs/2504.08431
作者: Jiafan Lu,Dongcheng Hu,Yitian Ye,Anqi Liu,Zixian Zhang,Xin Peng
机构: East China University of Science and Technology (华东理工大学); Key Laboratory of Smart Manufacturing in Energy Chemical Process Ministry of Education (能源化工过程智能化教育部重点实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Indoor poultry farms require inspection robots to maintain precise environmental control, which is crucial for preventing the rapid spread of disease and large-scale bird mortality. However, the complex conditions within these facilities, characterized by areas of intense illumination and water accumulation, pose significant challenges. Traditional navigation methods that rely on a single sensor often perform poorly in such environments, resulting in issues like laser drift and inaccuracies in visual navigation line extraction. To overcome these limitations, we propose a novel composite navigation method that integrates both laser and vision technologies. This approach dynamically computes a fused yaw angle based on the real-time reliability of each sensor modality, thereby eliminating the need for physical navigation lines. Experimental validation in actual poultry house environments demonstrates that our method not only resolves the inherent drawbacks of single-sensor systems, but also significantly enhances navigation precision and operational efficiency. As such, it presents a promising solution for improving the performance of inspection robots in complex indoor poultry farming settings.
zh

[CV-38] CMIP-CIL: A Cross-Modal Benchmark for Image-Point Class Incremental Learning

【速读】:该论文致力于解决跨模态图像-点云增量学习中的跨模态灾难性遗忘问题。传统方法在单模态场景下有效但无法应对跨模态差异,而某些方法虽处理了训练/测试数据内的模态差异,却假设模态间不存在间隙。为此,论文提出了一个跨模态增量学习基准CMIP-CIL,并通过引入对比学习框架下的掩码点云与多视角渲染图像进行预训练,使视觉模型具备图像-点云对应关系的泛化能力。关键在于增量阶段通过冻结主干网络并促使对象表示向各自原型靠近,从而在保留已学类别知识的同时有效扩展到新类别,实现知识的持续积累与泛化。实验结果表明,该方法在基准数据集上的表现达到当前最优水平。

链接: https://arxiv.org/abs/2504.08422
作者: Chao Qi,Jianqin Yin,Ren Zhang
机构: School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications (北京邮电大学智能工程与自动化学院), China.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image-point class incremental learning helps the 3D-points-vision robots continually learn category knowledge from 2D images, improving their perceptual capability in dynamic environments. However, some incremental learning methods address unimodal forgetting but fail in cross-modal cases, while others handle modal differences within training/testing datasets but assume no modal gaps between them. We first explore this cross-modal task, proposing a benchmark CMIP-CIL and relieving the cross-modal catastrophic forgetting problem. It employs masked point clouds and rendered multi-view images within a contrastive learning framework in pre-training, empowering the vision model with the generalizations of image-point correspondence. In the incremental stage, by freezing the backbone and promoting object representations close to their respective prototypes, the model effectively retains and generalizes knowledge across previously seen categories while continuing to learn new ones. We conduct comprehensive experiments on the benchmark datasets. Experiments prove that our method achieves state-of-the-art results, outperforming the baseline methods by a large margin.
zh

[CV-39] GeoTexBuild: 3D Building Model Generation from Map Footprints

【速读】:该论文旨在解决现有基于单一面貌照片生成三维建筑模型时存在的结构变化问题。为实现这一目标,论文提出GeoTexBuild框架,其关键在于采用三阶段生成过程(高度图生成、几何重建与外观风格化)以及集成自定义的ControlNet和Text2Mesh模型,从而有效控制生成过程中几何与视觉属性,确保生成模型的细节与准确性。这一方案显著降低了建筑物建模的手动工作量,并为设计师提供了灵感。

链接: https://arxiv.org/abs/2504.08419
作者: Ruizhe Wang,Junyan Yang,Qiao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages(excluding references), 10 figures

点击查看摘要

Abstract:We introduce GeoTexBuild, a modular generative framework for creating 3D building models from map footprints. The proposed framework employs a three-stage process comprising height map generation, geometry reconstruction, and appearance stylization, culminating in building models with intricate geometry and appearance attributes. By integrating customized ControlNet and Text2Mesh models, we explore effective methods for controlling both geometric and visual attributes during the generation process. By this, we eliminate the problem of structural variations behind a single facade photo of the existing 3D generation techniques. Experimental results at each stage validate the capability of GeoTexBuild to generate detailed and accurate building models from footprints derived from site planning or map designs. Our framework significantly reduces manual labor in modeling buildings and can offer inspiration for designers.
zh

[CV-40] Adversarial Examples in Environment Perception for Automated Driving (Review)

【速读】:该论文旨在解决深度学习在自动驾驶领域应用中,神经网络对对抗样本(adversarial examples)的脆弱性问题。对抗样本的扰动对人类不可感知,却可能导致神经网络产生错误预测,从而对人工智能驱动的自动驾驶应用构成重大风险。论文的关键在于系统回顾过去十年对抗鲁棒性研究的发展,包括攻击方法与防御策略及其在自动驾驶中的应用,以推动可信赖人工智能(trustworthy AI)在自动驾驶领域的实现。

链接: https://arxiv.org/abs/2504.08414
作者: Jun Yan,Huilin Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: One chapter of upcoming Springer book: Recent Advances in Autonomous Vehicle Technology, 2025

点击查看摘要

Abstract:The renaissance of deep learning has led to the massive development of automated driving. However, deep neural networks are vulnerable to adversarial examples. The perturbations of adversarial examples are imperceptible to human eyes but can lead to the false predictions of neural networks. It poses a huge risk to artificial intelligence (AI) applications for automated driving. This survey systematically reviews the development of adversarial robustness research over the past decade, including the attack and defense methods and their applications in automated driving. The growth of automated driving pushes forward the realization of trustworthy AI applications. This review lists significant references in the research history of adversarial examples.
zh

[CV-41] Boosting the Class-Incremental Learning in 3D Point Clouds via Zero-Collection-Cost Basic Shape Pre-Training

【速读】:该论文试图解决在3D点云领域中无示例(exemplar-free)增量学习面临的性能下降问题,尤其是在缺乏预训练数据集且对几何细节关注不足的情况下。现有方法在2D域中通过预训练模型取得了最先进的结果,但无法直接迁移到3D域。为突破这些限制,论文提出了一种基于零采集成本的基本形状数据集用于模型预训练,以帮助模型获取丰富的3D几何知识。解决方案的关键在于设计了一个嵌入3D几何知识的增量学习框架,适用于无示例(-based)设置。在增量阶段,通过正则化同类数据表示来计算类别原型,并在学习过程中不断调整,从而帮助模型记住不同类别的形状特征。实验表明,该方法在多种基准数据集上显著优于其他基线方法。

链接: https://arxiv.org/abs/2504.08412
作者: Chao Qi,Jianqin Yin,Meng Chen,Yingchun Niu,Yuan Sun
机构: School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications (北京邮电大学智能工程与自动化学院); School of Electronic Engineering, Beijing University of Posts and Telecommunications (北京邮电大学电子工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing class-incremental learning methods in 3D point clouds rely on exemplars (samples of former classes) to resist the catastrophic forgetting of models, and exemplar-free settings will greatly degrade the performance. For exemplar-free incremental learning, the pre-trained model methods have achieved state-of-the-art results in 2D domains. However, these methods cannot be migrated to the 3D domains due to the limited pre-training datasets and insufficient focus on fine-grained geometric details. This paper breaks through these limitations, proposing a basic shape dataset with zero collection cost for model pre-training. It helps a model obtain extensive knowledge of 3D geometries. Based on this, we propose a framework embedded with 3D geometry knowledge for incremental learning in point clouds, compatible with exemplar-free (-based) settings. In the incremental stage, the geometry knowledge is extended to represent objects in point clouds. The class prototype is calculated by regularizing the data representation with the same category and is kept adjusting in the learning process. It helps the model remember the shape features of different categories. Experiments show that our method outperforms other baseline methods by a large margin on various benchmark datasets, considering both exemplar-free (-based) settings.
zh

[CV-42] A Knowledge-guided Adversarial Defense for Resisting Malicious Visual Manipulation

【速读】:本文旨在解决恶意视觉操作对多领域用户安全和声誉构成的严重威胁。现有基于对抗噪声的防御方法主要通过“仅数据”方式在低级特征空间干扰伪造样本,而未能有效应对高级语义空间中的恶意操纵,存在局限性。为此,论文提出了一种知识引导型对抗防御(Knowledge-Guided Adversarial Defense, KGAD),其关键是将深度学习中的知识融入防御机制,主动迫使恶意操纵模型输出语义混淆的样本。具体而言,在生成对抗噪声的过程中,KGAD专注于构建特定领域的显著语义混淆,并采用与视觉感知密切相关的度量替代通用像素级度量。所生成的对抗噪声通过触发与知识和感知相关的干扰,有效干扰恶意操纵模型。实验验证表明,该方法在人类感知和视觉质量评估任务中优于现有最先进方法,并展现出良好的泛化能力。

链接: https://arxiv.org/abs/2504.08411
作者: Dawei Zhou,Suzhi Gang,Decheng Liu,Tongliang Liu,Nannan Wang,Xinbo Gao
机构: Xidian University (西安电子科技大学); The University of Sydney (悉尼大学); Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Malicious applications of visual manipulation have raised serious threats to the security and reputation of users in many fields. To alleviate these issues, adversarial noise-based defenses have been enthusiastically studied in recent years. However, ``data-only" methods tend to distort fake samples in the low-level feature space rather than the high-level semantic space, leading to limitations in resisting malicious manipulation. Frontier research has shown that integrating knowledge in deep learning can produce reliable and generalizable solutions. Inspired by these, we propose a knowledge-guided adversarial defense (KGAD) to actively force malicious manipulation models to output semantically confusing samples. Specifically, in the process of generating adversarial noise, we focus on constructing significant semantic confusions at the domain-specific knowledge level, and exploit a metric closely related to visual perception to replace the general pixel-wise metrics. The generated adversarial noise can actively interfere with the malicious manipulation model by triggering knowledge-guided and perception-related disruptions in the fake samples. To validate the effectiveness of the proposed method, we conduct qualitative and quantitative experiments on human perception and visual quality assessment. The results on two different tasks both show that our defense provides better protection compared to state-of-the-art methods and achieves great generalizability.
zh

[CV-43] PMNI: Pose-free Multi-view Normal Integration for Reflective and Textureless Surface Reconstruction

【速读】:该论文旨在解决多视角 3D 重建中无纹理和反光表面难以精确重建的问题,这些问题通常导致相机位姿标定和形状重建因缺乏可靠的跨视角视觉特征而失败。论文的关键解决方案是提出了一种名为 PMNI(Pose-free Multi-view Normal Integration)的方法,它通过利用法向量图而非 RGB 图像来整合丰富的几何信息,并在神经符号距离函数(Signed Distance Function, SDF)优化框架内施加法向量的几何约束与多视角形状一致性,从而同时恢复准确的相机位姿和高保真的表面几何结构。

链接: https://arxiv.org/abs/2504.08410
作者: Mingzhi Pei,Xu Cao,Xiangyi Wang,Heng Guo,Zhanyu Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reflective and textureless surfaces remain a challenge in multi-view 3D this http URL camera pose calibration and shape reconstruction often fail due to insufficient or unreliable cross-view visual features. To address these issues, we present PMNI (Pose-free Multi-view Normal Integration), a neural surface reconstruction method that incorporates rich geometric information by leveraging surface normal maps instead of RGB images. By enforcing geometric constraints from surface normals and multi-view shape consistency within a neural signed distance function (SDF) optimization framework, PMNI simultaneously recovers accurate camera poses and high-fidelity surface geometry. Experimental results on synthetic and real-world datasets show that our method achieves state-of-the-art performance in the reconstruction of reflective surfaces, even without reliable initial camera poses.
zh

[CV-44] Light-YOLOv8-Flame: A Lightweight High-Performance Flame Detection Algorithm

【速读】:该论文旨在解决传统火焰检测算法(尤其是基于计算机视觉的方法)在实时系统中面临的高计算成本和响应延迟问题。为应对这些局限性,论文提出了一种名为Light-YOLOv8-Flame的轻量级火焰检测算法,专为快速高效的实时部署而设计。解决方案的关键在于通过将YOLOv8架构中原有的C2f模块替换为FasterNet Block模块,该模块结合了部分卷积(Partial Convolution, PConv)和卷积(Convolution, Conv)层,从而显著降低了计算复杂度和模型大小,同时提升了检测性能和速度。

链接: https://arxiv.org/abs/2504.08389
作者: Jiawei Lan,Zhibiao Wang,Haoyang Yu,Ye Tao,Wenhua Cui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 19 figures, 6 tables. Submitted to Engineering Letters

点击查看摘要

Abstract:Fire detection algorithms, particularly those based on computer vision, encounter significant challenges such as high computational costs and delayed response times, which hinder their application in real-time systems. To address these limitations, this paper introduces Light-YOLOv8-Flame, a lightweight flame detection algorithm specifically designed for fast and efficient real-time deployment. The proposed model enhances the YOLOv8 architecture through the substitution of the original C2f module with the FasterNet Block module. This new block combines Partial Convolution (PConv) and Convolution (Conv) layers, reducing both computational complexity and model size. A dataset comprising 7,431 images, representing both flame and non-flame scenarios, was collected and augmented for training purposes. Experimental findings indicate that the modified YOLOv8 model achieves a 0.78% gain in mean average precision (mAP) and a 2.05% boost in recall, while reducing the parameter count by 25.34%, with only a marginal decrease in precision by 0.82%. These findings highlight that Light-YOLOv8-Flame offers enhanced detection performance and speed, making it well-suited for real-time fire detection on resource-constrained devices.
zh

[CV-45] MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft

【速读】:该论文旨在解决智能代理在开放环境中进行世界建模(world modeling)以有效与人类交互并操作于动态环境中的核心挑战。解决方案的关键在于提出了一种名为MineWorld的实时交互式世界模型,其基于视觉-动作自回归Transformer构建。通过将游戏场景与相应动作转化为离散的token id,并以交错方式组合两种id作为模型输入,同时利用下一步token预测任务来学习游戏状态的丰富表示以及状态与动作之间的条件关系。此外,在推理阶段,论文开发了一种新颖的并行解码算法,允许不同规模的模型每秒生成4到7帧画面,从而实现与玩家的实时交互。评价方面,论文提出了新的评估指标,不仅关注生成画面的视觉质量,还特别强调生成画面对动作的遵循能力,这是世界模型的重要特性。综合评估表明,MineWorld在性能上显著优于现有的开源扩散模型。

链接: https://arxiv.org/abs/2504.08388
作者: Junliang Guo,Yang Ye,Tianyu He,Haoyu Wu,Yushu Jiang,Tim Pearce,Jiang Bian
机构: Microsoft Research (微软研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical report. Project page this https URL

点击查看摘要

Abstract:World modeling is a crucial task for enabling intelligent agents to effectively interact with humans and operate in dynamic environments. In this work, we propose MineWorld, a real-time interactive world model on Minecraft, an open-ended sandbox game which has been utilized as a common testbed for world modeling. MineWorld is driven by a visual-action autoregressive Transformer, which takes paired game scenes and corresponding actions as input, and generates consequent new scenes following the actions. Specifically, by transforming visual game scenes and actions into discrete token ids with an image tokenizer and an action tokenizer correspondingly, we consist the model input with the concatenation of the two kinds of ids interleaved. The model is then trained with next token prediction to learn rich representations of game states as well as the conditions between states and actions simultaneously. In inference, we develop a novel parallel decoding algorithm that predicts the spatial redundant tokens in each frame at the same time, letting models in different scales generate 4 to 7 frames per second and enabling real-time interactions with game players. In evaluation, we propose new metrics to assess not only visual quality but also the action following capacity when generating new scenes, which is crucial for a world model. Our comprehensive evaluation shows the efficacy of MineWorld, outperforming SoTA open-sourced diffusion based world models significantly. The code and model have been released.
zh

[CV-46] owards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking

【速读】:该论文旨在解决长视频理解在交互式检索系统中的挑战,现有方法因无法高效处理冗长视频内容而存在局限性。论文提出的关键解决方案包括:(1) 集成粗粒度(CLIP)和细粒度(BEIT3)模型的集合搜索策略以提升检索准确性;(2) 基于TransNetV2选择代表性关键帧并进行去重的存储优化技术以减少冗余;(3) 利用起始点双重查询定位视频片段的时间搜索机制;以及(4) 借助邻近帧上下文进行时间重排序以稳定排名。这些创新显著提升了检索精度、效率及用户可解释性,在已知项检索和问答任务中表现出色。

链接: https://arxiv.org/abs/2504.08384
作者: Huu-Loc Tran,Tinh-Anh Nguyen-Nhu,Huu-Phong Phan-Nguyen,Tien-Huy Nguyen,Nhat-Minh Nguyen-Dich,Anh Dao,Huy-Duc Do,Quan Nguyen,Hoang M. Le,Quang-Vinh Dinh
机构: University of Information Technology, VNU-HCM, Vietnam (越南胡志明市国家大学信息技术大学); Ho Chi Minh University of Technology, VNU-HCM, Vietnam (越南胡志明市国家大学技术大学); Hanoi University of Science and Technology, Hanoi, Vietnam (河内科技大学); Michigan State University, USA (美国密歇根州立大学); National Economics University, Hanoi, Vietnam (越南国家经济大学); Posts and Telecommunications Institute of Technology, Hanoi, Vietnam (越南邮电技术学院); York University, Canada (加拿大约克大学); AI VIETNAM Lab (AI VIETNAM实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long-form video understanding presents significant challenges for interactive retrieval systems, as conventional methods struggle to process extensive video content efficiently. Existing approaches often rely on single models, inefficient storage, unstable temporal search, and context-agnostic reranking, limiting their effectiveness. This paper presents a novel framework to enhance interactive video retrieval through four key innovations: (1) an ensemble search strategy that integrates coarse-grained (CLIP) and fine-grained (BEIT3) models to improve retrieval accuracy, (2) a storage optimization technique that reduces redundancy by selecting representative keyframes via TransNetV2 and deduplication, (3) a temporal search mechanism that localizes video segments using dual queries for start and end points, and (4) a temporal reranking approach that leverages neighboring frame context to stabilize rankings. Evaluated on known-item search and question-answering tasks, our framework demonstrates substantial improvements in retrieval precision, efficiency, and user interpretability, offering a robust solution for real-world interactive video retrieval applications.
zh

[CV-47] In-2-4D: Inbetweening from Two Single-View Images to 4D Generation

【速读】:本文提出了一种新的问题定义,称为In-2-4D,旨在从极简输入设置(即捕捉物体在两种不同运动状态下的两幅单视角图像)出发,实现生成式4D(三维空间加运动)插值。论文的目标是基于起始和结束状态的两幅图像,生成并重构物体的完整运动过程。由于视频插值模型可能因大范围帧间运动而导致歧义,解决方案的关键在于采用分层方法识别视觉上接近输入状态且具有显著运动的关键帧,并在这些关键帧之间生成平滑的运动片段。对于每个片段,通过高斯点阵化构建关键帧的三维表示,并利用变形场将时间帧引导的运动转化为动态高斯分布。此外,通过扩展多视角扩散模型的跨时间步自注意力机制并引入刚体变换正则化,进一步提升时间一致性并优化三维运动。最终,通过插值边界变形场并优化以与引导视频对齐,将独立生成的三维运动片段合并,确保平滑无闪烁的过渡效果。实验结果验证了所提方法及其各组成部分的有效性。

链接: https://arxiv.org/abs/2504.08366
作者: Sauradip Nag,Daniel Cohen-Or,Hao Zhang,Ali Mahdavi-Amiri
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:We propose a new problem, In-2-4D, for generative 4D (i.e., 3D + motion) inbetweening from a minimalistic input setting: two single-view images capturing an object in two distinct motion states. Given two images representing the start and end states of an object in motion, our goal is to generate and reconstruct the motion in 4D. We utilize a video interpolation model to predict the motion, but large frame-to-frame motions can lead to ambiguous interpretations. To overcome this, we employ a hierarchical approach to identify keyframes that are visually close to the input states and show significant motion, then generate smooth fragments between them. For each fragment, we construct the 3D representation of the keyframe using Gaussian Splatting. The temporal frames within the fragment guide the motion, enabling their transformation into dynamic Gaussians through a deformation field. To improve temporal consistency and refine 3D motion, we expand the self-attention of multi-view diffusion across timesteps and apply rigid transformation regularization. Finally, we merge the independently generated 3D motion segments by interpolating boundary deformation fields and optimizing them to align with the guiding video, ensuring smooth and flicker-free transitions. Through extensive qualitative and quantitiave experiments as well as a user study, we show the effectiveness of our method and its components. The project page is available at this https URL
zh

[CV-48] SN-LiDAR: Semantic Neural Fields for Novel Space-time View LiDAR Synthesis

【速读】:该论文致力于解决LiDAR点云的新视角合成(Novel View Synthesis, NVS)中语义标签缺失的问题,这是自动驾驶和机器人感知等下游应用的关键需求。现有方法大多无法重建语义标签,而LiDAR点云缺乏如图像般强大的预训练分割模型,导致语义标注耗时且劳动密集。为应对这一挑战,论文提出了一种名为SN-LiDAR的方法,其关键是通过粗到细的平面网格特征表示提取多帧点云的全局特征,并利用基于CNN的编码器从当前帧点云中提取局部语义特征,从而实现语义分割、几何重建和LiDAR合成的同时优化。实验结果验证了SN-LiDAR在语义和几何重建上的优越性。

链接: https://arxiv.org/abs/2504.08361
作者: Yi Chen,Tianchen Deng,Wentao Zhao,Xiaoning Wang,Wenqian Xi,Weidong Chen,Jingchuan Wang
机构: Shanghai Jiao Tong University (上海交通大学); Ruijin Hospital, Shanghai Jiao Tong University School of Medicine (上海交通大学医学院附属瑞金医院); Renji Hospital, Shanghai Jiao Tong University School of Medicine (上海交通大学医学院附属仁济医院); Ministry of Education, China (中国教育部)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Recent research has begun exploring novel view synthesis (NVS) for LiDAR point clouds, aiming to generate realistic LiDAR scans from unseen viewpoints. However, most existing approaches do not reconstruct semantic labels, which are crucial for many downstream applications such as autonomous driving and robotic perception. Unlike images, which benefit from powerful segmentation models, LiDAR point clouds lack such large-scale pre-trained models, making semantic annotation time-consuming and labor-intensive. To address this challenge, we propose SN-LiDAR, a method that jointly performs accurate semantic segmentation, high-quality geometric reconstruction, and realistic LiDAR synthesis. Specifically, we employ a coarse-to-fine planar-grid feature representation to extract global features from multi-frame point clouds and leverage a CNN-based encoder to extract local semantic features from the current frame point cloud. Extensive experiments on SemanticKITTI and KITTI-360 demonstrate the superiority of SN-LiDAR in both semantic and geometric reconstruction, effectively handling dynamic objects and large-scale scenes. Codes will be available on this https URL.
zh

[CV-49] LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs

【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在文本到图像(Text-to-Image, T2I)生成任务中生成图像的感知质量与图文对齐问题,以及现有手动评估方法成本高且效率低的局限性。论文的关键解决方案是提出了EvalMi-50K数据集和基准,它包含全面的任务设计(覆盖20个细粒度任务维度下的2,100个扩展提示)和大规模的人类偏好标注(包括100K的平均意见分数MOS和50K的问答QA对)。基于此数据集,论文进一步提出了一种名为LMM4LMM的LMM驱动的评估指标,从感知质量、图文对应性和任务特定准确性等多个维度综合评价T2I生成效果。实验结果表明,LMM4LMM在EvalMi-50K及其它生成式AI图像评估基准数据集上均表现出最先进的性能和强大的泛化能力。

链接: https://arxiv.org/abs/2504.08358
作者: Jiarui Wang,Huiyu Duan,Yu Zhao,Juntong Wang,Guangtao Zhai,Xiongkuo Min
机构: Institute of Image Communication and Network Engineering (图像通信与网络工程研究所); MoE Key Lab of Artificial Intelligence (教育部人工智能重点实验室), AI Institute (人工智能研究院), Shanghai Jiao Tong University (上海交通大学), Shanghai, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent breakthroughs in large multimodal models (LMMs) have significantly advanced both text-to-image (T2I) generation and image-to-text (I2T) interpretation. However, many generated images still suffer from issues related to perceptual quality and text-image alignment. Given the high cost and inefficiency of manual evaluation, an automatic metric that aligns with human preferences is desirable. To this end, we present EvalMi-50K, a comprehensive dataset and benchmark for evaluating large-multimodal image generation, which features (i) comprehensive tasks, encompassing 2,100 extensive prompts across 20 fine-grained task dimensions, and (ii) large-scale human-preference annotations, including 100K mean-opinion scores (MOSs) and 50K question-answering (QA) pairs annotated on 50,400 images generated from 24 T2I models. Based on EvalMi-50K, we propose LMM4LMM, an LMM-based metric for evaluating large multimodal T2I generation from multiple dimensions including perception, text-image correspondence, and task-specific accuracy. Extensive experimental results show that LMM4LMM achieves state-of-the-art performance on EvalMi-50K, and exhibits strong generalization ability on other AI-generated image evaluation benchmark datasets, manifesting the generality of both the EvalMi-50K dataset and LMM4LMM metric. Both EvalMi-50K and LMM4LMM will be released at this https URL.
zh

[CV-50] Single View Garment Reconstruction Using Diffusion Mapping Via Pattern Coordinates

【速读】:本文旨在解决从单张图像中高保真重建三维着装人体中衣物几何结构(尤其是宽松服装)精确重建这一开放性挑战。论文提出了一种结合隐式缝合图(Implicit Sewing Patterns, ISP)与生成扩散模型的方法,在二维UV空间中学习丰富的衣物形状先验。方案的关键创新在于其映射模型,该模型在二维图像像素、UV图案坐标与三维几何之间建立对应关系,通过将学习到的先验与图像观测对齐,实现三维衣物网格与对应的二维图案的同时优化。尽管仅使用合成模拟布料数据进行训练,该方法仍能有效泛化至真实世界图像,并在紧身及宽松服装的重建任务中超越现有方法,重建出的衣物既具有物理合理性又能捕捉精细的几何细节,支持后续应用如服装重定向与纹理操作。

链接: https://arxiv.org/abs/2504.08353
作者: Ren Li,Cong Cao,Corentin Dumery,Yingxuan You,Hao Li,Pascal Fua
机构: École Polytechnique Fédérale de Lausanne(Swiss Federal Institute of Technology Lausanne) ; Mohamed bin Zayed University of Artificial Intelligence(阿联酋穆罕默德·本·扎耶德人工智能大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reconstructing 3D clothed humans from images is fundamental to applications like virtual try-on, avatar creation, and mixed reality. While recent advances have enhanced human body recovery, accurate reconstruction of garment geometry – especially for loose-fitting clothing – remains an open challenge. We present a novel method for high-fidelity 3D garment reconstruction from single images that bridges 2D and 3D representations. Our approach combines Implicit Sewing Patterns (ISP) with a generative diffusion model to learn rich garment shape priors in a 2D UV space. A key innovation is our mapping model that establishes correspondences between 2D image pixels, UV pattern coordinates, and 3D geometry, enabling joint optimization of both 3D garment meshes and the corresponding 2D patterns by aligning learned priors with image observations. Despite training exclusively on synthetically simulated cloth data, our method generalizes effectively to real-world images, outperforming existing approaches on both tight- and loose-fitting garments. The reconstructed garments maintain physical plausibility while capturing fine geometric details, enabling downstream applications including garment retargeting and texture manipulation.
zh

[CV-51] Geometric Consistency Refinement for Single Image Novel View Synthesis via Test-Time Adaptation of Diffusion Models CVPR2025

【速读】:该论文旨在解决扩散模型在单图像新型视图合成(Single Image Novel View Synthesis, NVS)任务中的几何一致性问题,即生成的图像往往无法满足由目标视点给出的对极约束(epipolar constraints),导致显著的几何错误。为了解决这一问题,论文提出了一种方法以提升扩散模型生成图像的几何正确性。解决方案的关键在于设计了一个基于图像匹配和对极约束的损失函数,并优化扩散采样过程的起始噪声,使得生成的图像不仅保持高度真实感,还能满足来自目标视点的几何约束。此方法无需额外的训练数据或扩散模型的微调,且可适用于多种最先进的单图像NVS模型。实验结果显示,该方法在MegaScenes数据集上的几何一致性优于基线模型,同时保留了生成图像的质量。

链接: https://arxiv.org/abs/2504.08348
作者: Josef Bengtson,David Nilsson,Fredrik Kahl
机构: Computer Vision Group, Chalmers University of Technology (查尔姆斯理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025 EDGE Workshop. Project page: this https URL

点击查看摘要

Abstract:Diffusion models for single image novel view synthesis (NVS) can generate highly realistic and plausible images, but they are limited in the geometric consistency to the given relative poses. The generated images often show significant errors with respect to the epipolar constraints that should be fulfilled, as given by the target pose. In this paper we address this issue by proposing a methodology to improve the geometric correctness of images generated by a diffusion model for single image NVS. We formulate a loss function based on image matching and epipolar constraints, and optimize the starting noise in a diffusion sampling process such that the generated image should both be a realistic image and fulfill geometric constraints derived from the given target pose. Our method does not require training data or fine-tuning of the diffusion models, and we show that we can apply it to multiple state-of-the-art models for single image NVS. The method is evaluated on the MegaScenes dataset and we show that geometric consistency is improved compared to the baseline models while retaining the quality of the generated images.
zh

[CV-52] EasyGenNet: An Efficient Framework for Audio-Driven Gesture Video Generation Based on Diffusion Model

【速读】:该论文旨在解决音频驱动的共发言视频生成中,从手势到视频阶段合成自然表情和手势的挑战。现有方法通常依赖复杂的输入策略和大规模预训练数据集,增加了实际应用的难度。论文的关键解决方案在于提出了一种简单的一阶段训练方法,并基于扩散模型设计了一种时序推理方法,能够在无需额外时序训练的情况下生成逼真且连续的手势视频。该方法仅利用现有的预训练权重,并通过少量(几千帧)特定角色的数据即可完成微调。此外,论文构建了一个新的音频到视频pipeline,采用2D人体骨架作为中间运动表示。这一创新显著简化了模型训练过程,同时提升了生成效果。

链接: https://arxiv.org/abs/2504.08344
作者: Renda Li,Xiaohua Qi,Qiang Ling,Jun Yu,Ziyi Chen,Peng Chang,Mei HanJing Xiao
机构: University of Science and Technology of China (中国科学技术大学); PAII Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio-driven cospeech video generation typically involves two stages: speech-to-gesture and gesture-to-video. While significant advances have been made in speech-to-gesture generation, synthesizing natural expressions and gestures remains challenging in gesture-to-video systems. In order to improve the generation effect, previous works adopted complex input and training strategies and required a large amount of data sets for pre-training, which brought inconvenience to practical applications. We propose a simple one-stage training method and a temporal inference method based on a diffusion model to synthesize realistic and continuous gesture videos without the need for additional training of temporal this http URL entire model makes use of existing pre-trained weights, and only a few thousand frames of data are needed for each character at a time to complete fine-tuning. Built upon the video generator, we introduce a new audio-to-video pipeline to synthesize co-speech videos, using 2D human skeleton as the intermediate motion representation. Our experiments show that our method outperforms existing GAN-based and diffusion-based methods.
zh

[CV-53] DSM: Building A Diverse Semantic Map for 3D Visual Grounding IROS

【速读】:该论文致力于解决现有基于视觉-语言大模型(VLMs)的三维视觉定位(3D Visual Grounding)方法忽视从场景中提取多样化语义信息以及理解丰富隐含语义属性(如外观、物理特性和功能)的问题。为应对这些挑战,论文提出了一种针对机器人执行三维视觉定位任务的多样化语义图构建方法。该方法的关键在于利用VLMs捕捉场景内物体的潜在语义属性与关系,并通过几何滑动窗口映射构造策略创建多样化的语义图(Diverse Semantic Map, DSM),进而基于DSM增强定位信息的理解能力,并提出新的方法DSM-Grounding。实验结果表明,该方法在语义分割和三维视觉定位等任务中优于现有方法,特别是在整体性能上超越了最先进的技术。此外,该方法已在机器人上部署,验证了其在导航和抓取任务中的有效性。

链接: https://arxiv.org/abs/2504.08307
作者: Qinghongbing Xie,Zijian Liang,Long Zeng
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院, 清华大学); School of Mechanical and Automotive Engineering, South China University of Technology (华南理工大学机械与汽车工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 6 figures, submitted to IROS, Project Page: this https URL

点击查看摘要

Abstract:In recent years, with the growing research and application of multimodal large language models (VLMs) in robotics, there has been an increasing trend of utilizing VLMs for robotic scene understanding tasks. Existing approaches that use VLMs for 3D Visual Grounding tasks often focus on obtaining scene information through geometric and visual information, overlooking the extraction of diverse semantic information from the scene and the understanding of rich implicit semantic attributes, such as appearance, physics, and affordance. The 3D scene graph, which combines geometry and language, is an ideal representation method for environmental perception and is an effective carrier for language models in 3D Visual Grounding tasks. To address these issues, we propose a diverse semantic map construction method specifically designed for robotic agents performing 3D Visual Grounding tasks. This method leverages VLMs to capture the latent semantic attributes and relations of objects within the scene and creates a Diverse Semantic Map (DSM) through a geometry sliding-window map construction strategy. We enhance the understanding of grounding information based on DSM and introduce a novel approach named DSM-Grounding. Experimental results show that our method outperforms current approaches in tasks like semantic segmentation and 3D Visual Grounding, particularly excelling in overall metrics compared to the state-of-the-art. In addition, we have deployed this method on robots to validate its effectiveness in navigation and grasping tasks.
zh

[CV-54] STSeg-Complex Video Object Segmentation: The 1st Solution for 4th PVUW MOSE Challenge

【速读】:该论文致力于解决视频对象在复杂场景中的分割难题,这一问题是视频理解与计算机视觉领域的核心挑战之一。论文的关键在于提出了一种名为STSeg的解决方案,其核心创新点包括基于MOSE数据集对Segment Anything Model 2 (SAM2) 和无监督模型TMO进行微调,并引入自适应伪标签引导的模型精化管道(Adaptive Pseudo-labels Guided Model Refinement Pipeline)。通过该方法,STSeg能够在处理复杂运动物体和长视频序列时表现出显著优势,并通过智能化选择适配的模型处理每段视频,在2025年第4届PVUW Challenge MOSE赛道测试集中达到了87.26%的JF得分,取得第一名的成绩,从而推动了复杂场景下视频对象分割技术的发展。

链接: https://arxiv.org/abs/2504.08306
作者: Kehuan Song,Xinglin Xie,Kexin Zhang,Licheng Jiao,Lingling Li,Shuyuan Yang
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segmentation of video objects in complex scenarios is highly challenging, and the MOSE dataset has significantly contributed to the development of this field. This technical report details the STSeg solution proposed by the “imaplus” this http URL finetuning SAM2 and the unsupervised model TMO on the MOSE dataset, the STSeg solution demonstrates remarkable advantages in handling complex object motions and long-video sequences. In the inference phase, an Adaptive Pseudo-labels Guided Model Refinement Pipeline is adopted to intelligently select appropriate models for processing each video. Through finetuning the models and employing the Adaptive Pseudo-labels Guided Model Refinement Pipeline in the inference phase, the STSeg solution achieved a JF score of 87.26% on the test set of the 2025 4th PVUW Challenge MOSE Track, securing the 1st place and advancing the technology for video object segmentation in complex scenarios.
zh

[CV-55] Generative AI for Film Creation: A Survey of Recent Advances CVPR2025

【速读】:该论文旨在研究生成式 AI (Generative AI, GenAI) 在电影制作中的应用及其对影视创作的影响。论文通过分析近期由 AI 驱动的电影工作流程,探讨 GenAI 在角色创造、美学风格塑造以及叙事表达方面的贡献,并重点解决如何保持角色一致性、实现风格连贯性以及确保动作连续性等关键问题。此外,论文还关注新兴趋势,如 3D 场景生成技术的普及以及实拍与 AI 生成内容的融合。在技术层面之外,论文还收集了艺术家对当前工具挑战及改进建议的反馈,包括一致性、可控性、细节编辑能力以及运动优化等方面的需求。论文的关键在于揭示 AI 与电影制作交叉领域的演变规律,并为研究人员和创作者提供一个清晰的发展路线图。

链接: https://arxiv.org/abs/2504.08296
作者: Ruihan Zhang,Borou Yu,Jiajian Min,Yetong Xin,Zheng Wei,Juncheng Nemo Shi,Mingzhen Huang,Xianghao Kong,Nix Liu Xin,Shanshan Jiang,Praagya Bahuguna,Mark Chan,Khushi Hora,Lijian Yang,Yongqi Liang,Runhe Bian,Yunlei Liu,Isabela Campillo Valencia,Patricia Morales Tredinick,Ilia Kozlov,Sijia Jiang,Peiwen Huang,Na Chen,Xuanxuan Liu,Anyi Rao
机构: Google(谷歌); University of California, Santa Barbara, Media Arts and Technology (加州大学圣塔芭芭拉分校,媒体艺术与技术); MYStudio; Harvard University (哈佛大学); Reality Hack; SUNY Buffalo (纽约州立大学水牛城分校); Onceness; University of Southampton (南安普顿大学); New York University (纽约大学); Communication University of China (中国传媒大学); University of Southern California (南加州大学); Dodge College of Film and Media Arts (电影与媒体艺术学院); Pratt Institute (普拉特学院); Rubyspot; MIT (麻省理工学院); Netflix; Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025 CVEU workshop: AI for Creative Visual Content Generation Editing and Understanding

点击查看摘要

Abstract:Generative AI (GenAI) is transforming filmmaking, equipping artists with tools like text-to-image and image-to-video diffusion, neural radiance fields, avatar generation, and 3D synthesis. This paper examines the adoption of these technologies in filmmaking, analyzing workflows from recent AI-driven films to understand how GenAI contributes to character creation, aesthetic styling, and narration. We explore key strategies for maintaining character consistency, achieving stylistic coherence, and ensuring motion continuity. Additionally, we highlight emerging trends such as the growing use of 3D generation and the integration of real footage with AI-generated elements. Beyond technical advancements, we examine how GenAI is enabling new artistic expressions, from generating hard-to-shoot footage to dreamlike diffusion-based morphing effects, abstract visuals, and unworldly objects. We also gather artists’ feedback on challenges and desired improvements, including consistency, controllability, fine-grained editing, and motion refinement. Our study provides insights into the evolving intersection of AI and filmmaking, offering a roadmap for researchers and artists navigating this rapidly expanding field. Comments: Accepted at CVPR 2025 CVEU workshop: AI for Creative Visual Content Generation Editing and Understanding Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.08296 [cs.CV] (or arXiv:2504.08296v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.08296 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-56] DreamFuse: Adaptive Image Fusion with Diffusion Transformer

【速读】:该论文旨在解决图像融合中适应性与交互式融合的挑战,即如何使前景对象不仅简单插入背景,而是能够动态调整并与背景语境自然互动,以实现更连贯的融合效果。解决方案的关键在于提出了一种迭代式人机协作数据生成管道,通过有限的初始数据和多样化的文本提示生成多种场景下的融合数据集,并基于此引入了DreamFuse方法。DreamFuse利用Diffusion Transformer (DiT) 模型结合Positional Affine机制将前景的尺寸与位置信息注入背景,通过共享注意力实现有效的前景-背景交互;同时采用由人类反馈引导的Localized Direct Preference Optimization优化策略,进一步提升背景一致性与前景和谐度。这些创新点共同确保了生成图像的协调性和真实感,同时支持文本驱动的属性编辑。

链接: https://arxiv.org/abs/2504.08291
作者: Junjia Huang,Pengxiang Yan,Jiyang Liu,Jie Wu,Zhao Wang,Yitong Wang,Liang Lin,Guanbin Li
机构: Sun Yat-sen University (中山大学); ByteDance Intelligent Creation (字节跳动智能创作); Peng Cheng Laboratory (鹏城实验室); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review

点击查看摘要

Abstract:Image fusion seeks to seamlessly integrate foreground objects with background scenes, producing realistic and harmonious fused images. Unlike existing methods that directly insert objects into the background, adaptive and interactive fusion remains a challenging yet appealing task. It requires the foreground to adjust or interact with the background context, enabling more coherent integration. To address this, we propose an iterative human-in-the-loop data generation pipeline, which leverages limited initial data with diverse textual prompts to generate fusion datasets across various scenarios and interactions, including placement, holding, wearing, and style transfer. Building on this, we introduce DreamFuse, a novel approach based on the Diffusion Transformer (DiT) model, to generate consistent and harmonious fused images with both foreground and background information. DreamFuse employs a Positional Affine mechanism to inject the size and position of the foreground into the background, enabling effective foreground-background interaction through shared attention. Furthermore, we apply Localized Direct Preference Optimization guided by human feedback to refine DreamFuse, enhancing background consistency and foreground harmony. DreamFuse achieves harmonious fusion while generalizing to text-driven attribute editing of the fused results. Experimental results demonstrate that our method outperforms state-of-the-art approaches across multiple metrics.
zh

[CV-57] PNE-SGAN: Probabilistic NDT-Enhanced Semantic Graph Attention Network for LiDAR Loop Closure Detection

【速读】:本文旨在解决基于激光雷达(LiDAR)回环检测(LCD)在一致性同时定位与建图(SLAM)中的鲁棒性和准确性不足的问题。现有方法,如语义图方法,通常面临几何表示粗糙以及对噪声、动态变化和视角变化缺乏时间鲁棒性的挑战。为克服这些局限性,论文提出了一种概率NDT增强语义图注意网络(PNE-SGAN)。其关键是通过使用正态分布变换(NDT)协方差矩阵作为丰富的判别几何节点特征,并结合图注意网络(GAT)增强语义图;同时,将图相似度分数集成到概率时间滤波框架(建模为隐马尔可夫模型/贝叶斯滤波器)中,结合不确定里程计进行运动建模,并利用前后平滑处理歧义。这种结合NDT详细几何信息与原理性概率时间推理的方法显著提升了LiDAR LCD的精度和鲁棒性,在复杂大规模环境中增强了SLAM的可靠性。

链接: https://arxiv.org/abs/2504.08280
作者: Xiong Li,Shulei Liu,Xingning Chen,Yisong Wu,Dong Zhu
机构: Hunan Technical College of Railway High-speed, Telecommunication College (湖南高速铁路职业技术学院, 电信学院); Shanghai Dianji University, School of Electronic and Information (上海电机学院, 电子与信息学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:LiDAR loop closure detection (LCD) is crucial for consistent Simultaneous Localization and Mapping (SLAM) but faces challenges in robustness and accuracy. Existing methods, including semantic graph approaches, often suffer from coarse geometric representations and lack temporal robustness against noise, dynamics, and viewpoint changes. We introduce PNE-SGAN, a Probabilistic NDT-Enhanced Semantic Graph Attention Network, to overcome these limitations. PNE-SGAN enhances semantic graphs by using Normal Distributions Transform (NDT) covariance matrices as rich, discriminative geometric node features, processed via a Graph Attention Network (GAT). Crucially, it integrates graph similarity scores into a probabilistic temporal filtering framework (modeled as an HMM/Bayes filter), incorporating uncertain odometry for motion modeling and utilizing forward-backward smoothing to effectively handle ambiguities. Evaluations on challenging KITTI sequences (00 and 08) demonstrate state-of-the-art performance, achieving Average Precision of 96.2% and 95.1%, respectively. PNE-SGAN significantly outperforms existing methods, particularly in difficult bidirectional loop scenarios where others falter. By synergizing detailed NDT geometry with principled probabilistic temporal reasoning, PNE-SGAN offers a highly accurate and robust solution for LiDAR LCD, enhancing SLAM reliability in complex, large-scale environments.
zh

[CV-58] Palmprint De-Identification Using Diffusion Model for High-Quality and Diverse Synthesis

【速读】:该论文旨在解决公开可用的掌纹图像可能被恶意滥用的问题,提出了一种无需训练的框架,利用预训练的扩散模型生成多样化且高质量的掌纹图像,以实现身份特征的去标识化,同时保留图像的实用性和非敏感信息。解决方案的关键在于引入语义引导的嵌入融合与先验插值机制,以增强合成过程的稳定性和可控性,并进一步提出去标识化比率这一新颖的评估指标,用于直观衡量去标识化效果。实验结果表明,该方法在多个掌纹数据集和识别方法上的表现优异,有效隐藏了身份相关特征,同时保持了高视觉保真度和良好的实用性。

链接: https://arxiv.org/abs/2504.08272
作者: Licheng Yan,Bob Zhang,Andrew Beng Jin Teoh,Lu Leng,Shuyi Li,Yuqi Wang,Ziyuan Yang
机构: PAMI Research Group, Department of Computer and Information Science, University of Macau (澳门大学), Macau, China; School of Electrical and Electronic Engineering, College of Engineering, Yonsei University (延世大学), Seoul, 120749, Republic of Korea; School of Software, Nanchang Hangkong University (南昌航空大学), Nanchang, 330063, Jiangxi, P. R. China; College of Computer Science, Sichuan University (四川大学), Chengdu 610045, China
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Palmprint recognition techniques have advanced significantly in recent years, enabling reliable recognition even when palmprints are captured in uncontrolled or challenging environments. However, this strength also introduces new risks, as publicly available palmprint images can be misused by adversaries for malicious activities. Despite this growing concern, research on methods to obscure or anonymize palmprints remains largely unexplored. Thus, it is essential to develop a palmprint de-identification technique capable of removing identity-revealing features while retaining the image’s utility and preserving non-sensitive information. In this paper, we propose a training-free framework that utilizes pre-trained diffusion models to generate diverse, high-quality palmprint images that conceal identity features for de-identification purposes. To ensure greater stability and controllability in the synthesis process, we incorporate a semantic-guided embedding fusion alongside a prior interpolation mechanism. We further propose the de-identification ratio, a novel metric for intuitive de-identification assessment. Extensive experiments across multiple palmprint datasets and recognition methods demonstrate that our method effectively conceals identity-related traits with significant diversity across de-identified samples. The de-identified samples preserve high visual fidelity and maintain excellent usability, achieving a balance between de-identification and retaining non-identity information.
zh

[CV-59] CoProSketch: Controllable and Progressive Sketch Generation with Diffusion Model

【速读】:该论文旨在解决**生成式草图(Sketch Generation)**这一未被充分探索的问题,尽管生成模型在其他领域已取得显著进展。当前基于扩散模型(Diffusion Models)的图像生成方法难以直接生成清晰的二值化草图,导致生成结果混乱。为此,论文提出的关键解决方案是采用无符号距离场(Unsigned Distance Field, UDF)来表征草图。UDF是一种连续函数,能够通过轻量级网络轻松解码为草图,从而克服传统二值化图像的局限性。此外,论文设计了一个名为CoProSketch的新框架,允许用户从一个边界框和文本提示生成粗略草图,并通过迭代编辑与反馈逐步优化为细节丰富的最终草图。同时,作者构建了首个大规模文本-草图配对数据集以支持训练。实验表明,该方法在语义一致性和可控性方面优于现有基线,为将用户反馈整合到生成工作流中提供了实用方案。

链接: https://arxiv.org/abs/2504.08259
作者: Ruohao Zhan,Yijin Li,Yisheng He,Shuo Chen,Yichen Shen,Xinyu Chen,Zilong Dong,Zhaoyang Huang,Guofeng Zhang
机构: State Key Laboratory of CAD & CG, Zhejiang University (浙江大学国家重点实验室); Avolution AI (Avolution AI); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:Sketches serve as fundamental blueprints in artistic creation because sketch editing is easier and more intuitive than pixel-level RGB image editing for painting artists, yet sketch generation remains unexplored despite advancements in generative models. We propose a novel framework CoProSketch, providing prominent controllability and details for sketch generation with diffusion models. A straightforward method is fine-tuning a pretrained image generation diffusion model with binarized sketch images. However, we find that the diffusion models fail to generate clear binary images, which makes the produced sketches chaotic. We thus propose to represent the sketches by unsigned distance field (UDF), which is continuous and can be easily decoded to sketches through a lightweight network. With CoProSketch, users generate a rough sketch from a bounding box and a text prompt. The rough sketch can be manually edited and fed back into the model for iterative refinement and will be decoded to a detailed sketch as the final result. Additionally, we curate the first large-scale text-sketch paired dataset as the training data. Experiments demonstrate superior semantic consistency and controllability over baselines, offering a practical solution for integrating user feedback into generative workflows.
zh

[CV-60] Knowledge Distillation for Underwater Feature Extraction and Matching via GAN-synthesized Images

【速读】:本文旨在解决水下环境中由于图像模糊和噪声(由衰减、散射以及海洋雪干扰引起)导致的特征提取与匹配困难的问题。为提升特征提取与匹配在浑浊水下环境中的鲁棒性,论文提出了一种基于跨模态知识蒸馏的方法,通过合成水下图像作为媒介,将空中特征提取模型迁移到水下场景。解决方案的关键在于首先提出一种新颖的自适应GAN-合成方法,用于估计水体参数和水下噪声分布,从而生成特定环境的合成水下图像;其次引入一种通用的知识蒸馏框架,兼容不同的教师模型。此外,基于GAN的合成评估突显了所提模型中新组件(如GAN合成的噪声和前向散射)的重要性,而实际水下序列上的下游特征提取与匹配(视觉SLAM, VSLAM)应用验证了迁移模型的有效性。

链接: https://arxiv.org/abs/2504.08253
作者: Jinghe Yang,Mingming Gong,Ye Pu
机构: Department of Electrical and Electronic Engineering, The University of Melbourne (墨尔本大学); School of Mathematics and Statistics, The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous Underwater Vehicles (AUVs) play a crucial role in underwater exploration. Vision-based methods offer cost-effective solutions for localization and mapping in the absence of conventional sensors like GPS and LIDAR. However, underwater environments present significant challenges for feature extraction and matching due to image blurring and noise caused by attenuation, scattering, and the interference of \textitmarine snow. In this paper, we aim to improve the robustness of the feature extraction and matching in the turbid underwater environment using the cross-modal knowledge distillation method that transfers the in-air feature extraction models to underwater settings using synthetic underwater images as the medium. We first propose a novel adaptive GAN-synthesis method to estimate water parameters and underwater noise distribution, to generate environment-specific synthetic underwater images. We then introduce a general knowledge distillation framework compatible with different teacher models. The evaluation of GAN-based synthesis highlights the significance of the new components, i.e. GAN-synthesized noise and forward scattering, in the proposed model. Additionally, the downstream application of feature extraction and matching (VSLAM) on real underwater sequences validates the effectiveness of the transferred model.
zh

[CV-61] Stereophotoclinometry Revisited

【速读】:该论文旨在解决基于影像的小型天体表面重构与特性表征问题,当前主流方法如立体光束测量法(Stereophotoclinometry, SPC)依赖人工干预和高保真先验信息,限制了其自主性和适用性。论文提出了一种名为Photoclinometry-from-Motion (PhoMo) 的新框架,通过将光束测量技术融入基于关键点的运动结构恢复系统,利用深度学习驱动的自主关键点检测与匹配方法,从就地成像数据中估算地标位置的表面法向量和反照率,从而实现小型天体表面和形状特性的自主表征。关键创新在于摒弃了SPC中昂贵的地图块估计步骤,采用密集关键点测量与对应关系,并通过基于因子图的方法同时优化航天器姿态、地标位置、太阳方向以及表面法向量和反照率,实现多源观测信息的融合优化。验证结果表明,PhoMo在小行星灶神星和矮行星谷神星的实测影像中表现出优于SPC的渲染性能,并且无需依赖任何先验相机姿态或地形信息及人工干预即可与立体摄影测量结果精确对齐。

链接: https://arxiv.org/abs/2504.08252
作者: Travis Driver,Andrew Vaughan,Yang Cheng,Adnan Ansar,John Christian,Panagiotis Tsiotras
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2312.06865

点击查看摘要

Abstract:Image-based surface reconstruction and characterization is crucial for missions to small celestial bodies, as it informs mission planning, navigation, and scientific analysis. However, current state-of-the-practice methods, such as stereophotoclinometry (SPC), rely heavily on human-in-the-loop verification and high-fidelity a priori information. This paper proposes Photoclinometry-from-Motion (PhoMo), a novel framework that incorporates photoclinometry techniques into a keypoint-based structure-from-motion (SfM) system to estimate the surface normal and albedo at detected landmarks to improve autonomous surface and shape characterization of small celestial bodies from in-situ imagery. In contrast to SPC, we forego the expensive maplet estimation step and instead use dense keypoint measurements and correspondences from an autonomous keypoint detection and matching method based on deep learning. Moreover, we develop a factor graph-based approach allowing for simultaneous optimization of the spacecraft’s pose, landmark positions, Sun-relative direction, and surface normals and albedos via fusion of Sun vector measurements and image keypoint measurements. The proposed framework is validated on real imagery taken by the Dawn mission to the asteroid 4 Vesta and the minor planet 1 Ceres and compared against an SPC reconstruction, where we demonstrate superior rendering performance compared to an SPC solution and precise alignment to a stereophotogrammetry (SPG) solution without relying on any a priori camera pose and topography information or humans-in-the-loop.
zh

[CV-62] F3Set: Towards Analyzing Fast Frequent and Fine-grained Events from Videos ICLR2025

【速读】:该论文旨在解决在视频分析和多模态大型语言模型(Multi-modal LLMs)中精确识别满足快速(Fast)、频繁(Frequent)和细粒度(Fine-grained)(即 F³)标准事件的重大挑战。现有方法因运动模糊和细微视觉差异等问题难以实现高精度检测。为推动视频理解领域的研究,论文引入了 F³ Set 数据集基准,包含多个具有精确时间戳和多层级粒度支持的体育视频数据集,其规模庞大且细节全面。目前,该数据集主要涵盖超过 1,000 种事件类型,未来还可扩展至其他应用场景。论文通过评估现有时间动作理解方法在 F³ Set 上的表现,揭示了现有技术存在的显著不足,并提出了一种新的方法 F³ ED,实现了更优性能。关键在于构建了大规模且细节丰富的 F³ Set 数据集以及设计针对 F³ 事件检测优化的新算法。

链接: https://arxiv.org/abs/2504.08222
作者: Zhaoyu Liu,Kan Jiang,Murong Ma,Zhe Hou,Yun Lin,Jin Song Dong
机构: Ningbo University (宁波大学); National University of Singapore (新加坡国立大学); Griffith University (格里菲斯大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The Thirteenth International Conference on Learning Representations (ICLR 2025)

点击查看摘要

Abstract:Analyzing Fast, Frequent, and Fine-grained (F ^3 ) events presents a significant challenge in video analytics and multi-modal LLMs. Current methods struggle to identify events that satisfy all the F ^3 criteria with high accuracy due to challenges such as motion blur and subtle visual discrepancies. To advance research in video understanding, we introduce F ^3 Set, a benchmark that consists of video datasets for precise F ^3 event detection. Datasets in F ^3 Set are characterized by their extensive scale and comprehensive detail, usually encompassing over 1,000 event types with precise timestamps and supporting multi-level granularity. Currently, F ^3 Set contains several sports datasets, and this framework may be extended to other applications as well. We evaluated popular temporal action understanding methods on F ^3 Set, revealing substantial challenges for existing techniques. Additionally, we propose a new method, F ^3 ED, for F ^3 event detections, achieving superior performance. The dataset, model, and benchmark code are available at this https URL.
zh

[CV-63] VL-UR: Vision-Language-guided Universal Restoration of Images Degraded by Adverse Weather Conditions

【速读】:该论文旨在解决现有图像恢复方法适应性差、难以应对真实世界中多样且复杂的非均匀退化问题。论文的关键创新在于提出了一种名为视觉-语言引导的通用图像恢复(Vision-Language-Guided Universal Restoration, VL-UR)框架。该框架通过引入零样本对比视觉-语言预训练(Contrastive Language-Image Pre-training, CLIP)模型,并结合场景分类器生成与退化图像对齐的语言嵌入,同时预测复杂场景中的退化类型,从而有效整合视觉与语义信息,实现自适应和智能化的图像恢复。实验结果表明,VL-UR在多种退化场景下展现出最先进的性能、鲁棒性和适应性。

链接: https://arxiv.org/abs/2504.08219
作者: Ziyan Liu,Yuxu Lu,Huashan Yu,Dong yang
机构: School of Computer Science, Peking University (北京大学计算机学院); Department of Logistics and Maritime Studies, Hong Kong Polytechnic University (香港理工大学物流与航运学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image restoration is critical for improving the quality of degraded images, which is vital for applications like autonomous driving, security surveillance, and digital content enhancement. However, existing methods are often tailored to specific degradation scenarios, limiting their adaptability to the diverse and complex challenges in real-world environments. Moreover, real-world degradations are typically non-uniform, highlighting the need for adaptive and intelligent solutions. To address these issues, we propose a novel vision-language-guided universal restoration (VL-UR) framework. VL-UR leverages a zero-shot contrastive language-image pre-training (CLIP) model to enhance image restoration by integrating visual and semantic information. A scene classifier is introduced to adapt CLIP, generating high-quality language embeddings aligned with degraded images while predicting degraded types for complex scenarios. Extensive experiments across eleven diverse degradation settings demonstrate VL-UR’s state-of-the-art performance, robustness, and adaptability. This positions VL-UR as a transformative solution for modern image restoration challenges in dynamic, real-world environments.
zh

[CV-64] RealCam-Vid: High-resolution Video Dataset with Dynamic Scenes and Metric-scale Camera Movements

【速读】:该论文旨在解决现有相机可控视频生成技术受限于静态场景数据集的问题,这些数据集仅提供相对尺度的相机标注(如RealEstate10K),无法有效捕捉动态场景交互,且缺乏度量尺度的几何一致性,这在复杂环境中对于合成逼真的物体运动和精确的相机轨迹至关重要。论文的关键解决方案是引入首个完全开源的高分辨率动态场景数据集,并包含度量尺度的相机标注。

链接: https://arxiv.org/abs/2504.08212
作者: Guangcong Zheng,Teng Li,Xianpan Zhou,Xi Li
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in camera-controllable video generation have been constrained by the reliance on static-scene datasets with relative-scale camera annotations, such as RealEstate10K. While these datasets enable basic viewpoint control, they fail to capture dynamic scene interactions and lack metric-scale geometric consistency-critical for synthesizing realistic object motions and precise camera trajectories in complex environments. To bridge this gap, we introduce the first fully open-source, high-resolution dynamic-scene dataset with metric-scale camera annotations in this https URL.
zh

[CV-65] EO-VLM: VLM-Guided Energy Overload Attacks on Vision Models ACSA

【速读】:该论文旨在解决视觉模型在关键应用(如自动驾驶和闭路电视监控)中易受资源消耗型攻击的问题。论文提出了一种新颖的能量过载攻击方法(Energy Overload via Vision Language Model, EO-VLM),通过利用视觉语言模型(Vision Language Model, VLM)的提示生成针对视觉模型的对抗样本图像。这些对抗样本对人眼不可察觉,却能在多种视觉模型上显著增加GPU的能耗,从而威胁系统的可用性。该方案的关键在于其模型无关性(model-agnostic),即无需依赖目标视觉模型的具体架构或类型,并且通过利用像DALL-E 3等VLM缺乏安全过滤器的特点,生成对抗噪声图像而无需任何目标模型的先验知识或内部结构信息。实验结果显示,该攻击可使能耗增加高达50%,揭示了当前视觉模型中的一个关键漏洞。

链接: https://arxiv.org/abs/2504.08205
作者: Minjae Seo,Myoungsung You,Junhee Lee,Jaehan Kim,Hwanjo Heo,Jintae Oh,Jinwoo Kim
机构: ETRI(韩国电子通信研究院); KAIST(韩国科学技术院); Kwangwoon University(光云大学); ETRI(韩国电子通信研究院); KAIST(韩国科学技术院); ETRI(韩国电子通信研究院); ETRI(韩国电子通信研究院); Kwangwoon University(光云大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Presented as a poster at ACSAC 2024

点击查看摘要

Abstract:Vision models are increasingly deployed in critical applications such as autonomous driving and CCTV monitoring, yet they remain susceptible to resource-consuming attacks. In this paper, we introduce a novel energy-overloading attack that leverages vision language model (VLM) prompts to generate adversarial images targeting vision models. These images, though imperceptible to the human eye, significantly increase GPU energy consumption across various vision models, threatening the availability of these systems. Our framework, EO-VLM (Energy Overload via VLM), is model-agnostic, meaning it is not limited by the architecture or type of the target vision model. By exploiting the lack of safety filters in VLMs like DALL-E 3, we create adversarial noise images without requiring prior knowledge or internal structure of the target vision models. Our experiments demonstrate up to a 50% increase in energy consumption, revealing a critical vulnerability in current vision models.
zh

[CV-66] Comparative Analysis of Different Methods for Classifying Polychromatic Sketches

【速读】:该论文旨在解决图像分类在计算机视觉领域中的挑战,特别是针对人类不熟悉的领域。论文的核心目标是通过构建具备超越或至少媲美人类视觉能力的机器学习算法来实现这一目标。为达成此目标,研究团队收集、清洗并解析了一个包含大量手绘涂鸦的大规模数据集,并将其分类为170个不同的类别。论文的关键在于评估多种机器学习方法的有效性,其中表现最优的模型实现了47.5%的Top-1准确率,显著超过了该数据集上人类41%的分类准确率。

链接: https://arxiv.org/abs/2504.08186
作者: Fahd Baba,Devon Mack
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image classification is a significant challenge in computer vision, particularly in domains humans are not accustomed to. As machine learning and artificial intelligence become more prominent, it is crucial these algorithms develop a sense of sight that is on par with or exceeds human ability. For this reason, we have collected, cleaned, and parsed a large dataset of hand-drawn doodles and compared multiple machine learning solutions to classify these images into 170 distinct categories. The best model we found achieved a Top-1 accuracy of 47.5%, significantly surpassing human performance on the dataset, which stands at 41%.
zh

[CV-67] okenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation

【速读】:该论文致力于解决视频生成中以人为中心的动作控制这一关键挑战,特别是在同时控制摄像机运动与人体姿态的场景下,如Grammy Glambot的经典瞬间。现有方法在有限的动作表示能力和摄像机与人体动作控制的整合方面存在不足。为了解决这些问题,论文提出TokenMotion,这是首个基于DiT的视频扩散框架,能够实现对摄像机运动、人体运动及其联合交互的精细控制。其关键在于将摄像机轨迹和人体姿势表示为空间-时间令牌以支持局部控制粒度,并通过一种由人类感知动态掩码连接的解耦-融合策略进行统一建模,有效处理结合运动信号的空间和时间变化特性。通过广泛的实验,证明了TokenMotion在文本到视频和图像到视频范式中的有效性,显著优于当前最先进的方法。这项工作代表了可控视频生成领域的重要进展,尤其适用于创意生产应用。

链接: https://arxiv.org/abs/2504.08181
作者: Ruineng Li,Daitao Xing,Huiming Sun,Yuanzhou Ha,Jinglin Shen,Chiuman Ho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human-centric motion control in video generation remains a critical challenge, particularly when jointly controlling camera movements and human poses in scenarios like the iconic Grammy Glambot moment. While recent video diffusion models have made significant progress, existing approaches struggle with limited motion representations and inadequate integration of camera and human motion controls. In this work, we present TokenMotion, the first DiT-based video diffusion framework that enables fine-grained control over camera motion, human motion, and their joint interaction. We represent camera trajectories and human poses as spatio-temporal tokens to enable local control granularity. Our approach introduces a unified modeling framework utilizing a decouple-and-fuse strategy, bridged by a human-aware dynamic mask that effectively handles the spatially-and-temporally varying nature of combined motion signals. Through extensive experiments, we demonstrate TokenMotion’s effectiveness across both text-to-video and image-to-video paradigms, consistently outperforming current state-of-the-art methods in human-centric motion control tasks. Our work represents a significant advancement in controllable video generation, with particular relevance for creative production applications.
zh

[CV-68] Multi-person Physics-based Pose Estimation for Combat Sports

【速读】:本文旨在解决在对抗性运动(如拳击)中使用稀疏多摄像机设置进行精确三维人体姿态估计的问题。关键解决方案包括:首先,通过基于变换器的自顶向下方法实现鲁棒的多视角二维姿态跟踪,并利用对极几何约束与长期视频对象分割技术确保跨视角的身份一致性跟踪;其次,初始三维姿态由加权三角化及样条平滑获得,并进一步通过运动学优化提升精度;最后,引入基于物理模型的多人轨迹优化步骤以增强姿态的真实性和鲁棒性,有效应对快速运动、遮挡以及近距离交互等挑战。实验结果表明,该方法在包括新发布的精英拳击数据集在内的多样化数据集上达到了当前最佳性能。

链接: https://arxiv.org/abs/2504.08175
作者: Hossein Feiz,David Labbé,Thomas Romeas,Jocelyn Faubert,Sheldon Andrews
机构: École de technologie supérieure (魁北克高等技术学院), Montreal, Canada; Université de Montréal (蒙特利尔大学), Montreal, Canada
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a novel framework for accurate 3D human pose estimation in combat sports using sparse multi-camera setups. Our method integrates robust multi-view 2D pose tracking via a transformer-based top-down approach, employing epipolar geometry constraints and long-term video object segmentation for consistent identity tracking across views. Initial 3D poses are obtained through weighted triangulation and spline smoothing, followed by kinematic optimization to refine pose accuracy. We further enhance pose realism and robustness by introducing a multi-person physics-based trajectory optimization step, effectively addressing challenges such as rapid motions, occlusions, and close interactions. Experimental results on diverse datasets, including a new benchmark of elite boxing footage, demonstrate state-of-the-art performance. Additionally, we release comprehensive annotated video datasets to advance future research in multi-person pose estimation for combat sports.
zh

[CV-69] Learning Object Focused Attention

【速读】:该论文旨在解决视觉Transformer(Vision Transformers, ViTs)在处理图像任务时难以显式建模对象结构的问题。传统ViTs在注意力计算时通常无法有效区分图像中不同对象的局部区域与背景区域,导致模型倾向于依赖纹理等表面相关特征而非对象形状进行分类或泛化。论文的关键解决方案是引入了一种新的归纳偏置(inductive bias),通过在选定的注意力层中添加一个额外分支来计算辅助损失——称为对象聚焦注意力(Object-Focused Attention, OFA)损失。这一方法通过限制注意力机制仅关注属于同一对象类别的图像块,使ViTs能够更好地理解对象的整体形状配置,而不是被背景或其他无关区域干扰。由于该方案仅在部分注意力层增加了一个辅助损失项,因此易于融入现有Transformer框架且不会带来推理阶段的额外开销。此外,论文还探索了多尺度掩码技术以进一步提升OFA模型性能,并为自监督学习提供了新路径。实验结果表明,采用OFA的ViTs不仅在分类任务上表现更优,还能更有效地应对分布外(Out-of-Distribution, OOD)数据和对抗性扰动,同时学习到基于对象形状而非虚假纹理的相关表征。

链接: https://arxiv.org/abs/2504.08166
作者: Vivek Trivedy,Amani Almalki,Longin Jan Latecki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose an adaptation to the training of Vision Transformers (ViTs) that allows for an explicit modeling of objects during the attention computation. This is achieved by adding a new branch to selected attention layers that computes an auxiliary loss which we call the object-focused attention (OFA) loss. We restrict the attention to image patches that belong to the same object class, which allows ViTs to gain a better understanding of configural (or holistic) object shapes by focusing on intra-object patches instead of other patches such as those in the background. Our proposed inductive bias fits easily into the attention framework of transformers since it only adds an auxiliary loss over selected attention layers. Furthermore, our approach has no additional overhead during inference. We also experiment with multiscale masking to further improve the performance of our OFA model and give a path forward for self-supervised learning with our method. Our experimental results demonstrate that ViTs with OFA achieve better classification results than their base models, exhibit a stronger generalization ability to out-of-distribution (OOD) and adversarially corrupted images, and learn representations based on object shapes rather than spurious correlations via general textures. For our OOD setting, we generate a novel dataset using the COCO dataset and Stable Diffusion inpainting which we plan to share with the community.
zh

[CV-70] Investigating Vision-Language Model for Point Cloud-based Vehicle Classification CVPR

【速读】:该论文旨在解决重型卡车在自动驾驶协同系统中因尺寸大、操控性差带来的安全挑战,通过引入一种新的框架提升基于合作自动驾驶的安全视角。传统基于激光雷达(LiDAR)的卡车分类方法依赖大量人工标注,导致其劳动密集且成本高昂。此外,尽管大型语言模型(LLMs)的快速发展提供了利用少量样本学习能力的机会,但现有的视觉-语言模型(VLMs)主要针对图像数据集训练,难以直接处理点云数据。为此,论文提出了三个关键创新:(1) 使用真实世界激光雷达数据集进行模型开发;(2) 设计预处理流程将点云数据适配为VLM输入,包括点云配准以实现密集三维渲染以及数学形态学技术以增强特征表示;(3) 利用上下文学习与少量提示样本实现最小标注数据下的车辆分类。实验结果表明,该方法具有良好的性能,并有望减少标注工作量同时提高分类准确性。

链接: https://arxiv.org/abs/2504.08154
作者: Yiqiao Li,Jie Wei,Camille Kamga
机构: City College of New York (纽约城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages,3 figures, 1 table, CVPR DriveX workshop

点击查看摘要

Abstract:Heavy-duty trucks pose significant safety challenges due to their large size and limited maneuverability compared to passenger vehicles. A deeper understanding of truck characteristics is essential for enhancing the safety perspective of cooperative autonomous driving. Traditional LiDAR-based truck classification methods rely on extensive manual annotations, which makes them labor-intensive and costly. The rapid advancement of large language models (LLMs) trained on massive datasets presents an opportunity to leverage their few-shot learning capabilities for truck classification. However, existing vision-language models (VLMs) are primarily trained on image datasets, which makes it challenging to directly process point cloud data. This study introduces a novel framework that integrates roadside LiDAR point cloud data with VLMs to facilitate efficient and accurate truck classification, which supports cooperative and safe driving environments. This study introduces three key innovations: (1) leveraging real-world LiDAR datasets for model development, (2) designing a preprocessing pipeline to adapt point cloud data for VLM input, including point cloud registration for dense 3D rendering and mathematical morphological techniques to enhance feature representation, and (3) utilizing in-context learning with few-shot prompting to enable vehicle classification with minimally labeled training data. Experimental results demonstrate encouraging performance of this method and present its potential to reduce annotation efforts while improving classification accuracy.
zh

[CV-71] LoRAX: LoRA eXpandable Networks for Continual Synthetic Image Attribution

【速读】:该论文试图解决生成式 AI 图像技术普及背景下,现有归因模型难以泛化至未见过的生成模型以及传统微调方法在实际应用中不切实际的问题。论文的关键解决方案是提出了一种名为 LoRA eXpandable Networks (LoRAX) 的参数高效类增量算法,该算法通过低秩适应(Low Rank Adaptation)实现对新型生成图像模型的适应,而无需完全重新训练。LoRAX 在每个连续学习任务中仅使用极少量的参数训练特定的任务特征提取器,从而显著减少参数需求,同时保持性能竞争力。实验表明,LoRAX 在连续深度伪造检测基准上优于或与最先进的类增量学习算法相当,且每个特征提取器的可训练参数仅为全秩实现的不到 3%。

链接: https://arxiv.org/abs/2504.08149
作者: Danielle Sullivan-Pao,Nicole Tian,Pooya Khorrami
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As generative AI image technologies become more widespread and advanced, there is a growing need for strong attribution models. These models are crucial for verifying the authenticity of images and identifying the architecture of their originating generative models-key to maintaining media integrity. However, attribution models struggle to generalize to unseen models, and traditional fine-tuning methods for updating these models have shown to be impractical in real-world settings. To address these challenges, we propose LoRA eXpandable Networks (LoRAX), a parameter-efficient class incremental algorithm that adapts to novel generative image models without the need for full retraining. Our approach trains an extremely parameter-efficient feature extractor per continual learning task via Low Rank Adaptation. Each task-specific feature extractor learns distinct features while only requiring a small fraction of the parameters present in the underlying feature extractor’s backbone model. Our extensive experimentation shows LoRAX outperforms or remains competitive with state-of-the-art class incremental learning algorithms on the Continual Deepfake Detection benchmark across all training scenarios and memory settings, while requiring less than 3% of the number of trainable parameters per feature extractor compared to the full-rank implementation. LoRAX code is available at: this https URL.
zh

[CV-72] Impact of Language Guidance: A Reproducibility Study

【速读】:该论文试图解决自监督学习中语义表征质量的问题。传统方法如 Banani 等人(2023)提出的基于语言引导的视图对采样方法虽声称可提升概念相似性,但其使用的数据集 RedCaps 存在低质量标注的问题。为此,论文的关键解决方案是采用现成的图像描述模型 BLIP-2 替换原始标注以生成高质量图像描述,并设计了一种新的基于可解释性的度量方法来评估自监督模型的语义能力,从而改进现有方法的性能与可靠性。

链接: https://arxiv.org/abs/2504.08140
作者: Cherish Puniani,Advika Sinha,Shree Singhi,Aayan Yadav
机构: Indian Institute of Technology, Roorkee (印度理工学院鲁尔基); Indian Institute of Technology, Roorkee (印度理工学院鲁尔基); Indian Institute of Technology, Roorkee (印度理工学院鲁尔基); Indian Institute of Technology, Roorkee (印度理工学院鲁尔基)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern deep-learning architectures need large amounts of data to produce state-of-the-art results. Annotating such huge datasets is time-consuming, expensive, and prone to human error. Recent advances in self-supervised learning allow us to train huge models without explicit annotation. Contrastive learning is a popular paradigm in self-supervised learning. Recent works like SimCLR and CLIP rely on image augmentations or directly minimizing cross-modal loss between image and text. Banani et al. (2023) propose to use language guidance to sample view pairs. They claim that language enables better conceptual similarity, eliminating the effects of visual variability. We reproduce their experiments to verify their claims and find that their dataset, RedCaps, contains low-quality captions. We use an off-the-shelf image captioning model, BLIP-2, to replace the captions and improve performance, and we also devise a new metric to evaluate the semantic capabilities of self-supervised models based on interpretability methods.
zh

[CV-73] Gen3DEval: Using vLLM s for Automatic Evaluation of Generated 3D Objects CVPR2025

【速读】:该论文试图解决现有文本到3D生成领域中评价指标无法紧密贴合人类判断的问题,当前常用的指标如PSNR和CLIP要么依赖于参考数据(ground-truth),要么仅关注提示词保真度(prompt fidelity)。为了解决这一问题,论文提出了一种名为Gen3DEval的新颖评估框架。其关键是利用专门针对3D物体质量评估微调的视觉大语言模型(vision large language models, vLLMs),通过分析3D表面法线来评估文本保真度、外观及表面质量,无需依赖参考数据,从而弥合自动化指标与用户偏好之间的差距。

链接: https://arxiv.org/abs/2504.08125
作者: Shalini Maiti,Lourdes Agapito,Filippos Kokkinos
机构: Meta AI (Meta); University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Rapid advancements in text-to-3D generation require robust and scalable evaluation metrics that align closely with human judgment, a need unmet by current metrics such as PSNR and CLIP, which require ground-truth data or focus only on prompt fidelity. To address this, we introduce Gen3DEval, a novel evaluation framework that leverages vision large language models (vLLMs) specifically fine-tuned for 3D object quality assessment. Gen3DEval evaluates text fidelity, appearance, and surface quality by analyzing 3D surface normals, without requiring ground-truth comparisons, bridging the gap between automated metrics and user preferences. Compared to state-of-the-art task-agnostic models, Gen3DEval demonstrates superior performance in user-aligned evaluations, placing it as a comprehensive and accessible benchmark for future research on text-to-3D generation. The project page can be found here: \hrefthis https URLthis https URL.
zh

[CV-74] Benchmarking Suite for Synthetic Aperture Radar Imagery Anomaly Detection (SARIAD) Algorithms

【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)图像领域中缺乏用于开发和基准测试异常检测方法的标准方法的问题。解决方案的关键在于引入了SAR图像异常检测(SAR Imagery Anomaly Detection, SARIAD),这是一个结合了深度学习库Anomalib的综合工具包,提供了针对SAR图像的多种算法和数据集,以评估和开发异常检测方法。SARIAD通过整合多个SAR数据集以及相应的工具,使得不同异常检测算法能够有效应用于SAR图像,并提供了多种异常检测指标和可视化功能,从而实现可重复研究的基准测试框架。

链接: https://arxiv.org/abs/2504.08115
作者: Lucian Chauvina,Somil Guptac,Angelina Ibarrac,Joshua Peeples
机构: Texas A&M University (德克萨斯农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to SPIE at: this https URL

点击查看摘要

Abstract:Anomaly detection is a key research challenge in computer vision and machine learning with applications in many fields from quality control to radar imaging. In radar imaging, specifically synthetic aperture radar (SAR), anomaly detection can be used for the classification, detection, and segmentation of objects of interest. However, there is no method for developing and benchmarking these methods on SAR imagery. To address this issue, we introduce SAR imagery anomaly detection (SARIAD). In conjunction with Anomalib, a deep-learning library for anomaly detection, SARIAD provides a comprehensive suite of algorithms and datasets for assessing and developing anomaly detection approaches on SAR imagery. SARIAD specifically integrates multiple SAR datasets along with tools to effectively apply various anomaly detection algorithms to SAR imagery. Several anomaly detection metrics and visualizations are available. Overall, SARIAD acts as a central package for benchmarking SAR models and datasets to allow for reproducible research in the field of anomaly detection in SAR imagery. This package is publicly available: this https URL.
zh

[CV-75] POEM: Precise Object-level Editing via MLLM control

【速读】:该论文致力于解决对象级图像编辑中的精确性与自动化难题。现有基于文本指令的方法难以实现局部形状和布局的精准变换,且容易引入非预期的全局变化;而基于图像交互的方法虽能提供更高精度,但需要大量人工干预。为减少人工努力同时保持高精度编辑效果,论文提出POEM框架,其关键是利用多模态大型语言模型(Multimodal Large Language Models, MLLMs)分析编辑指令,自动生成变换前后的精确对象掩码,从而在无需过多用户输入的情况下实现细粒度控制。这一结构化推理阶段引导基于扩散模型的编辑过程,确保对象定位与变换的准确性。

链接: https://arxiv.org/abs/2504.08111
作者: Marco Schouten,Mehmet Onurcan Kaya,Serge Belongie,Dim P. Papadopoulos
机构: Technical University of Denmark(DTU); University of Copenhagen; Pioneer Centre for AI(先驱人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to SCIA 2025

点击查看摘要

Abstract:Diffusion models have significantly improved text-to-image generation, producing high-quality, realistic images from textual descriptions. Beyond generation, object-level image editing remains a challenging problem, requiring precise modifications while preserving visual coherence. Existing text-based instructional editing methods struggle with localized shape and layout transformations, often introducing unintended global changes. Image interaction-based approaches offer better accuracy but require manual human effort to provide precise guidance. To reduce this manual effort while maintaining a high image editing accuracy, in this paper, we propose POEM, a framework for Precise Object-level Editing using Multimodal Large Language Models (MLLMs). POEM leverages MLLMs to analyze instructional prompts and generate precise object masks before and after transformation, enabling fine-grained control without extensive user input. This structured reasoning stage guides the diffusion-based editing process, ensuring accurate object localization and transformation. To evaluate our approach, we introduce VOCEdits, a benchmark dataset based on PASCAL VOC 2012, augmented with instructional edit prompts, ground-truth transformations, and precise object masks. Experimental results show that POEM outperforms existing text-based image editing approaches in precision and reliability while reducing manual effort compared to interaction-based methods.
zh

[CV-76] owards Unconstrained 2D Pose Estimation of the Human Spine CVPR

【速读】:本文针对无约束环境下2D脊柱姿态估计数据集缺乏的问题提出了解决方案。现有姿态数据集通常将脊柱简化为单一刚性段,忽略了准确运动分析所需的细微关节活动。为解决这一难题,论文构建了SpineTrack数据集,包含两个互补子集:一个由Unreal Engine生成、OpenSim实现生物力学对齐的合成数据集,包含25k标注;另一个通过主动学习管道从超过33k真实场景图像中精心标注的人类反馈优化自动化标注得到的现实世界数据集。关键在于这种集成方法确保了大规模解剖学一致性的标签,即使在具有挑战性的野外图像中也是如此。此外,论文还提出了SpinePose方法,通过知识蒸馏和解剖正则化策略扩展现有的人体姿态估计算法,同时预测身体和脊柱关键点,从而有效提升脊柱姿态估计的精度。

链接: https://arxiv.org/abs/2504.08110
作者: Muhammad Saif Ullah Khan,Stephan Krauß,Didier Stricker
机构: German Research Center for Artificial Intelligence (DFKI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in CVPRW 2025

点击查看摘要

Abstract:We present SpineTrack, the first comprehensive dataset for 2D spine pose estimation in unconstrained settings, addressing a crucial need in sports analytics, healthcare, and realistic animation. Existing pose datasets often simplify the spine to a single rigid segment, overlooking the nuanced articulation required for accurate motion analysis. In contrast, SpineTrack annotates nine detailed spinal keypoints across two complementary subsets: a synthetic set comprising 25k annotations created using Unreal Engine with biomechanical alignment through OpenSim, and a real-world set comprising over 33k annotations curated via an active learning pipeline that iteratively refines automated annotations with human feedback. This integrated approach ensures anatomically consistent labels at scale, even for challenging, in-the-wild images. We further introduce SpinePose, extending state-of-the-art body pose estimators using knowledge distillation and an anatomical regularization strategy to jointly predict body and spine keypoints. Our experiments in both general and sports-specific contexts validate the effectiveness of SpineTrack for precise spine pose estimation, establishing a robust foundation for future research in advanced biomechanical analysis and 3D spine reconstruction in the wild.
zh

[CV-77] ContrastiveGaussian: High-Fidelity 3D Generation with Contrastive Learning and Gaussian Splatting

【速读】:该论文旨在解决从单视角图像创建高质量三维内容的问题,当前方法通常利用预训练的二维扩散模型通过分数蒸馏采样(Score Distillation Sampling, SDS)生成多视角三维表示。然而,这些方法的性能常受限于扩散模型输出的视觉不一致性。为了解决这一问题,论文提出了一种名为ContrastiveGaussian的方法,其关键是将对比学习融入生成过程中,并通过引入感知损失(perceptual loss)有效区分正负样本,利用视觉不一致性提升三维生成质量。此外,为了进一步增强样本区分能力并改进对比学习效果,该方法结合超分辨率模型,并引入量纲感知三元组损失(Quantity-Aware Triplet Loss),以应对训练过程中的样本分布变化。实验结果表明,此方法在纹理保真度和几何一致性方面表现出色。

链接: https://arxiv.org/abs/2504.08100
作者: Junbang Liu,Enpei Huang,Dongxing Mao,Hui Zhang,Xinyuan Song,Yongxin Ni
机构: Beijing Normal-Hong Kong Baptist University (北京师范大学-香港浸会大学联合国际学院), Zhuhai, China; National University of Singapore (新加坡国立大学), Singapore, Singapore; Emory University (埃默里大学), Atlanta, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code will be available at this https URL

点击查看摘要

Abstract:Creating 3D content from single-view images is a challenging problem that has attracted considerable attention in recent years. Current approaches typically utilize score distillation sampling (SDS) from pre-trained 2D diffusion models to generate multi-view 3D representations. Although some methods have made notable progress by balancing generation speed and model quality, their performance is often limited by the visual inconsistencies of the diffusion model outputs. In this work, we propose ContrastiveGaussian, which integrates contrastive learning into the generative process. By using a perceptual loss, we effectively differentiate between positive and negative samples, leveraging the visual inconsistencies to improve 3D generation quality. To further enhance sample differentiation and improve contrastive learning, we incorporate a super-resolution model and introduce another Quantity-Aware Triplet Loss to address varying sample distributions during training. Our experiments demonstrate that our approach achieves superior texture fidelity and improved geometric consistency.
zh

[CV-78] X-DECODE: EXtreme Deblurring with Curriculum Optimization and Domain Equalization

【速读】:本文旨在解决严重模糊图像恢复这一计算机视觉领域的重大挑战,该问题广泛影响自动驾驶、医学影像及摄影等应用。论文提出了一种基于课程学习(Curriculum Learning)的新颖训练策略,以提升深度学习模型在极端图像去模糊任务中的鲁棒性。不同于传统方法仅在低至中等模糊程度下进行训练,本文的方法通过逐步引入更高模糊程度的图像来增加训练难度,使模型能够渐进式适应。关键解决方案在于:一方面采用线性课程进度(Linear Curriculum Progression),优于阶梯式(Step-wise)、Sigmoid 和指数式(Exponential)进展;另一方面,在训练过程中结合感知损失(Perceptual Loss)和铰链损失(Hinge Loss),以增强细节恢复能力和提高训练稳定性。实验结果表明,所提方法在Extreme-GoPro数据集上的结构相似性指数(SSIM)比次优方法高出14%,在Extreme-KITTI数据集上高出18%。此外,超参数设置如训练模糊百分比和损失函数形式对处理极端模糊伪影至关重要。

链接: https://arxiv.org/abs/2504.08072
作者: Sushant Gautam,Jingdao Chen
机构: Mississippi State University (密西西比州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Restoring severely blurred images remains a significant challenge in computer vision, impacting applications in autonomous driving, medical imaging, and photography. This paper introduces a novel training strategy based on curriculum learning to improve the robustness of deep learning models for extreme image deblurring. Unlike conventional approaches that train on only low to moderate blur levels, our method progressively increases the difficulty by introducing images with higher blur severity over time, allowing the model to adapt incrementally. Additionally, we integrate perceptual and hinge loss during training to enhance fine detail restoration and improve training stability. We experimented with various curriculum learning strategies and explored the impact of the train-test domain gap on the deblurring performance. Experimental results on the Extreme-GoPro dataset showed that our method outperforms the next best method by 14% in SSIM, whereas experiments on the Extreme-KITTI dataset showed that our method outperforms the next best by 18% in SSIM. Ablation studies showed that a linear curriculum progression outperforms step-wise, sigmoid, and exponential progressions, while hyperparameter settings such as the training blur percentage and loss function formulation all play important roles in addressing extreme blur artifacts. Datasets and code are available at this https URL
zh

[CV-79] STEI-PCN: an efficient pure convolutional network for traffic prediction via spatial-temporal encoding and inferring

【速读】:该论文试图解决交通数据中复杂的时间、空间以及时空相关性建模的问题。现有模型大多通过独立模块分别提取时间或空间相关性,或者通过联合模块同步提取两者,但通常忽略了时空相关性的显式建模。此外,考虑联合时空相关性的模型在精度和计算效率方面面临显著挑战,限制了其性能优势的发挥。为了解决这些问题,论文提出了一种基于时空编码与推断(STEI-PCN)的高效纯卷积网络。其关键在于引入了一个基于绝对时空坐标和相对时空距离编码的动态邻接矩阵推断模块,并结合图卷积网络与门控机制捕获局部同步的联合时空相关性;同时利用三层时域膨胀因果卷积网络捕捉长程时间相关性。最终,通过多视角协同预测模块整合原始特征、局部同步联合时空特征和长程时间特征,实现全面预测。

链接: https://arxiv.org/abs/2504.08061
作者: Kai Hu,Zhidan Zhao,Zhifeng Hao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic data exhibits complex temporal, spatial, and spatial-temporal correlations. Most of models use either independent modules to separately extract temporal and spatial correlations or joint modules to synchronously extract them, without considering the spatial-temporal correlations. Moreover, models that consider joint spatial-temporal correlations (temporal, spatial, and spatial-temporal correlations) often encounter significant challenges in accuracy and computational efficiency which prevent such models from demonstrating the expected advantages of a joint spatial-temporal correlations architecture. To address these issues, this paper proposes an efficient pure convolutional network for traffic prediction via spatial-temporal encoding and inferring (STEI-PCN). The model introduces and designs a dynamic adjacency matrix inferring module based on absolute spatial and temporal coordinates, as well as relative spatial and temporal distance encoding, using a graph convolutional network combined with gating mechanism to capture local synchronous joint spatial-temporal correlations. Additionally, three layers of temporal dilated causal convolutional network are used to capture long-range temporal correlations. Finally, through multi-view collaborative prediction module, the model integrates the gated-activated original, local synchronous joint spatial-temporal, and long-range temporal features to achieve comprehensive prediction. This study conducts extensive experiments on flow datasets (PeMS03/04/07/08) and speed dataset (PeMS-Bay), covering multiple prediction horizons. The results show that STEI-PCN demonstrates competitive computational efficiency in both training and inference speeds, and achieves superior or slightly inferior to state-of-the-art (SOTA) models on most evaluation metrics.
zh

[CV-80] Multi-Task Learning with Multi-Annotation Triplet Loss for Improved Object Detection

【速读】:该论文试图解决传统三元组损失(Triplet Loss)仅依赖类别标签而未能充分利用多任务场景下可用的所有标注信息的问题。解决方案的关键在于引入了一种多标注三元组损失(Multi-Annotation Triplet Loss, MATL)框架,通过将类别标签与额外的注释信息(如边界框信息)结合到损失函数中,从而扩展了传统的三元组损失。这种方法利用互补的标注信息,提升了同时需要分类和定位任务的多任务学习性能。实验结果表明,MATL在分类和定位任务上均优于常规三元组损失。

链接: https://arxiv.org/abs/2504.08054
作者: Meilun Zhou,Aditya Dutt,Alina Zare
机构: University of Florida (佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for Oral Presentation at the 45th IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2025, Brisbane, Australia. 4 pages and 4 figures

点击查看摘要

Abstract:Triplet loss traditionally relies only on class labels and does not use all available information in multi-task scenarios where multiple types of annotations are available. This paper introduces a Multi-Annotation Triplet Loss (MATL) framework that extends triplet loss by incorporating additional annotations, such as bounding box information, alongside class labels in the loss formulation. By using these complementary annotations, MATL improves multi-task learning for tasks requiring both classification and localization. Experiments on an aerial wildlife imagery dataset demonstrate that MATL outperforms conventional triplet loss in both classification and localization. These findings highlight the benefit of using all available annotations for triplet loss in multi-task learning frameworks.
zh

[CV-81] Patch distribution modeling framework adaptive cosine estimator (PaDiM-ACE) for anomaly detection and localization in synthetic aperture radar imagery

【速读】:该论文旨在解决合成孔径雷达(SAR)图像中的异常检测与定位问题。现有方法基于 Patch Distribution Modeling (PaDiM) 框架,使用马氏距离(Mahalanobis Distance)进行推理,但其作为一种无界度量可能导致性能局限。论文的关键创新在于引入自适应余弦估计器(Adaptive Cosine Estimator, ACE)检测统计量,采用有界且更稳健的余弦相似性(cosine similarity)度量替代传统的马氏距离,从而实现更高效的异常检测评分。通过在多个 SAR 数据集上的评估,该方法在图像级和像素级的接收者操作特性曲线下的面积(AUROC)等性能指标上展现了提升的异常检测与定位能力。代码已公开发布。

链接: https://arxiv.org/abs/2504.08049
作者: Angelina Ibarra,Joshua Peeples
机构: Department of Electrical and Computer Engineering, Texas A&M University (德克萨斯农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to SPIE, Defense and Commercial Sensing, Algorithms for Synthetic Aperture Radar Imagery XXXII (April 2025)

点击查看摘要

Abstract:This work presents a new approach to anomaly detection and localization in synthetic aperture radar imagery (SAR), expanding upon the existing patch distribution modeling framework (PaDiM). We introduce the adaptive cosine estimator (ACE) detection statistic. PaDiM uses the Mahalanobis distance at inference, an unbounded metric. ACE instead uses the cosine similarity metric, providing bounded anomaly detection scores. The proposed method is evaluated across multiple SAR datasets, with performance metrics including the area under the receiver operating curve (AUROC) at the image and pixel level, aiming for increased performance in anomaly detection and localization of SAR imagery. The code is publicly available: this https URL.
zh

[CV-82] aching Humans Subtle Differences with DIFFusion

【速读】:该论文试图解决如何有效教授初学者在专业化领域中区分细微类别的问题。解决方案的关键在于提出了一种名为DIFFusion的新方法,通过操控扩散模型(diffusion model)的条件空间,将类别结构与实例身份解耦,从而实现即使在数据稀疏、样本无配对且类别边界难以用文本描述的情况下,也能生成高保真的特征变化可视化结果(即反事实,counterfactuals),以辅助教学感知性专业知识。实验和用户研究验证了该方法的有效性。

链接: https://arxiv.org/abs/2504.08046
作者: Mia Chiquier,Orr Avrech,Yossi Gandelsman,Berthy Feng,Katherine Bouman,Carl Vondrick
机构: Columbia University; UC Berkeley; California Institute of Technology; Columbia University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human expertise depends on the ability to recognize subtle visual differences, such as distinguishing diseases, species, or celestial phenomena. We propose a new method to teach novices how to differentiate between nuanced categories in specialized domains. Our method uses generative models to visualize the minimal change in features to transition between classes, i.e., counterfactuals, and performs well even in domains where data is sparse, examples are unpaired, and category boundaries are not easily explained by text. By manipulating the conditioning space of diffusion models, our proposed method DIFFusion disentangles category structure from instance identity, enabling high-fidelity synthesis even in challenging domains. Experiments across six domains show accurate transitions even with limited and unpaired examples across categories. User studies confirm that our generated counterfactuals outperform unpaired examples in teaching perceptual expertise, showing the potential of generative models for specialized visual learning.
zh

[CV-83] Learning Fine-grained Domain Generalization via Hyperbolic State Space Hallucination AAAI2025

【速读】:本文旨在解决细粒度领域泛化(Fine-grained Domain Generalization, FGDG)问题,即如何仅基于源域数据学习一种能够良好泛化到未见目标域的细粒度表示。与通用领域泛化相比,细粒度领域泛化更具挑战性,因为细粒度类别的区分依赖于一些微妙且细微的模式,而这些模式在光照、色彩等跨域风格变化下尤为脆弱。为应对这一挑战,本文提出了一种新颖的双曲状态空间幻觉(Hyperbolic State Space Hallucination, HSSH)方法,其关键在于状态空间幻觉(State Space Hallucination, SSH)和双曲流形一致性(Hyperbolic Manifold Consistency, HMC)两个组件:SSH通过先外推再幻觉源图像来增强状态嵌入的风格多样性;随后将预幻觉和后幻觉的状态嵌入投影到双曲流形中,利用双曲状态空间建模高阶统计特性以更好地辨别细粒度模式,并最终最小化双曲距离以消除风格变化对细粒度模式的影响。实验结果表明,该方法在三个FGDG基准数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2504.08020
作者: Qi Bi,Jingjun Yi,Haolan Zhan,Wei Ji,Gui-Song Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted by AAAI2025

点击查看摘要

Abstract:Fine-grained domain generalization (FGDG) aims to learn a fine-grained representation that can be well generalized to unseen target domains when only trained on the source domain data. Compared with generic domain generalization, FGDG is particularly challenging in that the fine-grained category can be only discerned by some subtle and tiny patterns. Such patterns are particularly fragile under the cross-domain style shifts caused by illumination, color and etc. To push this frontier, this paper presents a novel Hyperbolic State Space Hallucination (HSSH) method. It consists of two key components, namely, state space hallucination (SSH) and hyperbolic manifold consistency (HMC). SSH enriches the style diversity for the state embeddings by firstly extrapolating and then hallucinating the source images. Then, the pre- and post- style hallucinate state embeddings are projected into the hyperbolic manifold. The hyperbolic state space models the high-order statistics, and allows a better discernment of the fine-grained patterns. Finally, the hyperbolic distance is minimized, so that the impact of style variation on fine-grained patterns can be eliminated. Experiments on three FGDG benchmarks demonstrate its state-of-the-art performance.
zh

[CV-84] DGFamba: Learning Flow Factorized State Space for Visual Domain Generalization AAAI2025

【速读】:该论文旨在解决视觉领域泛化(Visual Domain Generalization)中的关键挑战,即如何从源域学习到一种表示,使其能够推广到任意未见过的目标域,尤其是在面对显著的风格变化而图像内容保持稳定的情况下。论文指出,现有方法如VMamba虽在表示内容方面展示了全局感受野,但对选择性状态空间(Selective State Space)中领域不变性(domain-invariant)属性的利用尚不充分。为此,论文提出了一种新颖的流因子化状态空间模型(DG-Famba)。其关键创新在于通过流因子化(flow factorization)将增强风格后的状态嵌入与原始状态嵌入进行映射,以保持领域一致性。在隐空间中,来自特定风格的每个状态嵌入由一个潜在概率路径指定。通过对这些概率路径的对齐,状态嵌入能够无论风格差异如何,都表示相同的内容分布。这一方法在多种视觉领域泛化设置下的广泛实验验证中表现出最先进的性能。

链接: https://arxiv.org/abs/2504.08019
作者: Qi Bi,Jingjun Yi,Hao Zheng,Haolan Zhan,Wei Ji,Yawen Huang,Yuexiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted by AAAI2025

点击查看摘要

Abstract:Domain generalization aims to learn a representation from the source domain, which can be generalized to arbitrary unseen target domains. A fundamental challenge for visual domain generalization is the domain gap caused by the dramatic style variation whereas the image content is stable. The realm of selective state space, exemplified by VMamba, demonstrates its global receptive field in representing the content. However, the way exploiting the domain-invariant property for selective state space is rarely explored. In this paper, we propose a novel Flow Factorized State Space model, dubbed as DG-Famba, for visual domain generalization. To maintain domain consistency, we innovatively map the style-augmented and the original state embeddings by flow factorization. In this latent flow space, each state embedding from a certain style is specified by a latent probability path. By aligning these probability paths in the latent space, the state embeddings are able to represent the same content distribution regardless of the style differences. Extensive experiments conducted on various visual domain generalization settings show its state-of-the-art performance.
zh

[CV-85] SRVP: Strong Recollection Video Prediction Model Using Attention-Based Spatiotemporal Correlation Fusion CVPR2025

【速读】:该论文旨在解决传统基于循环神经网络(Recurrent Neural Network, RNN)的视频预测(Video Prediction, VP)模型在长时间序列建模中逐渐丢失物体外观细节的问题。为了解决这一挑战,论文提出强记忆视频预测(Strong Recollection Video Prediction, SRVP)模型,其关键在于结合标准注意力(Standard Attention, SA)模块和强化特征注意力(Reinforced Feature Attention, RFA)模块。这两个模块通过缩放点积注意力(scaled dot-product attention)提取时间上下文和空间相关性,并将这些信息融合以增强时空表示能力。实验结果表明,SRVP不仅缓解了RNN基模型中的图像质量退化问题,同时在预测性能上达到了与无RNN架构相当的水平。

链接: https://arxiv.org/abs/2504.08012
作者: Yuseon Kim,Kyongseok Park
机构: Korea Institute of Science and Technology Information (KISTI); Department of Applied AI, University of Science and Technology (UST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted to CVPR 2025 Precognition Workshop

点击查看摘要

Abstract:Video prediction (VP) generates future frames by leveraging spatial representations and temporal context from past frames. Traditional recurrent neural network (RNN)-based models enhance memory cell structures to capture spatiotemporal states over extended durations but suffer from gradual loss of object appearance details. To address this issue, we propose the strong recollection VP (SRVP) model, which integrates standard attention (SA) and reinforced feature attention (RFA) modules. Both modules employ scaled dot-product attention to extract temporal context and spatial correlations, which are then fused to enhance spatiotemporal representations. Experiments on three benchmark datasets demonstrate that SRVP mitigates image quality degradation in RNN-based models while achieving predictive performance comparable to RNN-free architectures.
zh

[CV-86] Self-Bootstrapping for Versatile Test-Time Adaptation

【速读】:该论文旨在开发一种适用于多种任务(包括图像级、目标级和像素级的分类与回归任务)的通用测试时适应(Test-Time Adaptation, TTA)目标。解决方案的关键在于通过自引导(self-bootstrapping)方案优化测试图像与其退化视图之间的预测一致性。为了实现这一目标,论文分析了常见分布偏移如何影响图像在傅里叶域中的空间频率信息,并发现低频成分携带高信息量,屏蔽这些成分能够提供更多的学习信号,而高频成分则不然。基于此,论文提出了一种在傅里叶域中随机屏蔽图像低频幅值的增强方法,并结合噪声注入以补偿高频部分的缺失学习信号。实验结果表明,无论是独立使用还是作为即插即用模块,该方法在分类、分割及3D单目检测任务中均表现出色,适用于Transformer和CNN模型。

链接: https://arxiv.org/abs/2504.08010
作者: Shuaicheng Niu,Guohao Chen,Peilin Zhao,Tianyi Wang,Pengcheng Wu,Zhiqi Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 10 tables, 4 figures

点击查看摘要

Abstract:In this paper, we seek to develop a versatile test-time adaptation (TTA) objective for a variety of tasks - classification and regression across image-, object-, and pixel-level predictions. We achieve this through a self-bootstrapping scheme that optimizes prediction consistency between the test image (as target) and its deteriorated view. The key challenge lies in devising effective augmentations/deteriorations that: i) preserve the image’s geometric information, e.g., object sizes and locations, which is crucial for TTA on object/pixel-level tasks, and ii) provide sufficient learning signals for TTA. To this end, we analyze how common distribution shifts affect the image’s information power across spatial frequencies in the Fourier domain, and reveal that low-frequency components carry high power and masking these components supplies more learning signals, while masking high-frequency components can not. In light of this, we randomly mask the low-frequency amplitude of an image in its Fourier domain for augmentation. Meanwhile, we also augment the image with noise injection to compensate for missing learning signals at high frequencies, by enhancing the information power there. Experiments show that, either independently or as a plug-and-play module, our method achieves superior results across classification, segmentation, and 3D monocular detection tasks with both transformer and CNN models.
zh

[CV-87] Have we unified image generation and understanding yet? An empirical study of GPT -4os image generation ability

【速读】:该论文试图解决的问题是评估 OpenAI 的多模态模型 GPT-4o 在实现基于世界知识的语义合成(即无缝整合领域知识、上下文推理和指令遵循能力)方面的有效性。尽管现有基准表明 GPT-4o 在图像生成和编辑方面表现强劲,但论文通过系统性评估发现其在全局指令遵循、细粒度编辑精度以及后生成推理方面的局限性,包括对指令的字面理解、知识约束应用的一致性不足以及条件推理任务中的困难。论文的关键在于揭示这些局限性,并呼吁开发更强大的基准测试和训练策略,以超越表面级别的对齐,强调上下文感知和推理驱动的多模态生成方法。

链接: https://arxiv.org/abs/2504.08003
作者: Ning Li,Jingran Zhang,Justin Cui
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Early work, technical report

点击查看摘要

Abstract:OpenAI’s multimodal GPT-4o has demonstrated remarkable capabilities in image generation and editing, yet its ability to achieve world knowledge-informed semantic synthesis–seamlessly integrating domain knowledge, contextual reasoning, and instruction adherence–remains unproven. In this study, we systematically evaluate these capabilities across three critical dimensions: (1) Global Instruction Adherence, (2) Fine-Grained Editing Precision, and (3) Post-Generation Reasoning. While existing benchmarks highlight GPT-4o’s strong capabilities in image generation and editing, our evaluation reveals GPT-4o’s persistent limitations: the model frequently defaults to literal interpretations of instructions, inconsistently applies knowledge constraints, and struggles with conditional reasoning tasks. These findings challenge prevailing assumptions about GPT-4o’s unified understanding and generation capabilities, exposing significant gaps in its dynamic knowledge integration. Our study calls for the development of more robust benchmarks and training strategies that go beyond surface-level alignment, emphasizing context-aware and reasoning-grounded multimodal generation.
zh

[CV-88] CDM-QTA: Quantized Training Acceleration for Efficient LoRA Fine-Tuning of Diffusion Model ISCAS2025

【速读】:该论文旨在解决在移动设备上高效微调大型扩散模型(diffusion models)所面临的高能耗和长时间计算的挑战。论文的关键解决方案是开发了一种专门用于扩散模型低秩适应(Low-Rank Adaptation, LoRA)的新型训练加速器。通过采用全量化训练方案来优化LoRA微调过程,实现了显著的内存使用和功耗降低,同时保持了模型的高保真度。该加速器的核心创新在于其灵活的数据流设计,能够有效处理LoRA过程中不规则和可变张量形状的问题。实验结果表明,与基线相比,该方法实现了高达1.81倍的训练速度提升和5.50倍的能量效率改进,且对图像生成质量的影响极小。

链接: https://arxiv.org/abs/2504.07998
作者: Jinming Lu,Minghao She,Wendong Mao,Zhongfeng Wang
机构: Nanjing University (南京大学); Sun Yat-Sen University (中山大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
备注: ISCAS 2025

点击查看摘要

Abstract:Fine-tuning large diffusion models for custom applications demands substantial power and time, which poses significant challenges for efficient implementation on mobile devices. In this paper, we develop a novel training accelerator specifically for Low-Rank Adaptation (LoRA) of diffusion models, aiming to streamline the process and reduce computational complexity. By leveraging a fully quantized training scheme for LoRA fine-tuning, we achieve substantial reductions in memory usage and power consumption while maintaining high model fidelity. The proposed accelerator features flexible dataflow, enabling high utilization for irregular and variable tensor shapes during the LoRA process. Experimental results show up to 1.81x training speedup and 5.50x energy efficiency improvements compared to the baseline, with minimal impact on image generation quality.
zh

[CV-89] ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

【速读】:该论文旨在解决多模态大型语言模型(Multi-modal Large Language Models, MLLMs)在高分辨率专业场景下图形用户界面(GUI)感知能力不足的问题。现有模型在处理专业领域任务时表现不佳,尤其是在高分辨率显示、小目标检测以及复杂环境下的性能有限。论文的关键在于引入了一个新的基准测试集ScreenSpot-Pro,包含来自五个行业、三个操作系统的真实高分辨率图像及专家标注数据,覆盖23种应用场景。此外,通过分析发现缩小搜索范围可显著提升精度,基于此提出了ScreenSeeker,一种利用强规划器的GUI知识引导级联搜索的视觉搜索方法,实现了无需额外训练的最佳性能,准确率达到48.1%。这一解决方案的核心在于结合领域特定挑战与创新性的搜索策略来优化MLLMs的接地能力。

链接: https://arxiv.org/abs/2504.07981
作者: Kaixin Li,Ziyang Meng,Hongzhan Lin,Ziyang Luo,Yuchen Tian,Jing Ma,Zhiyong Huang,Tat-Seng Chua
机构: National University of Singapore (新加坡国立大学); East China Normal University (华东师范大学); Hong Kong Baptist University (香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: 13pages

点击查看摘要

Abstract:Recent advancements in Multi-modal Large Language Models (MLLMs) have led to significant progress in developing GUI agents for general tasks such as web browsing and mobile phone use. However, their application in professional domains remains under-explored. These specialized workflows introduce unique challenges for GUI perception models, including high-resolution displays, smaller target sizes, and complex environments. In this paper, we introduce ScreenSpot-Pro, a new benchmark designed to rigorously evaluate the grounding capabilities of MLLMs in high-resolution professional settings. The benchmark comprises authentic high-resolution images from a variety of professional domains with expert annotations. It spans 23 applications across five industries and three operating systems. Existing GUI grounding models perform poorly on this dataset, with the best model achieving only 18.9%. Our experiments reveal that strategically reducing the search area enhances accuracy. Based on this insight, we propose ScreenSeekeR, a visual search method that utilizes the GUI knowledge of a strong planner to guide a cascaded search, achieving state-of-the-art performance with 48.1% without any additional training. We hope that our benchmark and findings will advance the development of GUI agents for professional applications. Code, data and leaderboard can be found at this https URL.
zh

[CV-90] Poisson multi-Bernoulli mixture filter for trajectory measurements

【速读】:该论文旨在解决基于传感器轨迹测量(set of trajectory measurements)的多目标跟踪问题,提出了一种基于泊松多伯努利混合密度(Poisson multi-Bernoulli mixture, PMBM)的滤波方法。论文的关键创新在于引入了一种轨迹测量PMBM(Trajectory Measurement PMBM, TM-PMBM)滤波器,该滤波器在目标状态集合上维持一个PMBM密度分布,并通过预测和更新步骤递归估计多目标状态。具体而言,滤波器首先在前两个时间步的轨迹集合上获得PMBM密度,然后利用当前时间步的轨迹测量值进行更新,最后将两步轨迹上的后验PMBM密度边缘化以得到目标状态集合上的PMBM密度。此外,论文还通过最小化Kullback-Leibler散度推导出一种计算更高效的泊松多伯努利(PMB)密度作为替代方案。这些方法为基于轨迹测量的多目标跟踪提供了封闭形式的解法,并通过仿真研究验证了其性能。

链接: https://arxiv.org/abs/2504.08421
作者: Marco Fontana,Ángel F. García-Fernández,Simon Maskell
机构: Department of Electrical Engineering and Electronics, University of Liverpool (利物浦大学); ETSI de Telecomunicación, Universidad Politécnica de Madrid (马德里理工大学)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注: 16 pages, 7 figures, journal paper

点击查看摘要

Abstract:This paper presents a Poisson multi-Bernoulli mixture (PMBM) filter for multi-target filtering based on sensor measurements that are sets of trajectories in the last two-time step window. The proposed filter, the trajectory measurement PMBM (TM-PMBM) filter, propagates a PMBM density on the set of target states. In prediction, the filter obtains the PMBM density on the set of trajectories over the last two time steps. This density is then updated with the set of trajectory measurements. After the update step, the PMBM posterior on the set of two-step trajectories is marginalised to obtain a PMBM density on the set of target states. The filter provides a closed-form solution for multi-target filtering based on sets of trajectory measurements, estimating the set of target states at the end of each time window. Additionally, the paper proposes computationally lighter alternatives to the TM-PMBM filter by deriving a Poisson multi-Bernoulli (PMB) density through Kullback-Leibler divergence minimisation in an augmented space with auxiliary variables. The performance of the proposed filters are evaluated in a simulation study.
zh

[CV-91] SynthFM: Training Modality-agnostic Foundation Models for Medical Image Segmentation without Real Medical Data

【速读】:该论文试图解决医学图像分割中因自然图像与医学图像在纹理、对比度和噪声方面的差异导致基础模型(如Segment Anything Model, SAM)在零样本分割任务上的性能下降问题,同时克服标注医学图像成本高且需要领域专业知识的限制。为应对这些挑战,论文提出了一种名为SynthFM的合成数据生成框架,其关键在于通过模拟医学图像的复杂性,使基础模型能够在不使用真实医学图像的情况下进行适应和优化。具体而言,研究者利用SAM的预训练编码器,并在SynthFM生成的数据集上从头训练解码器,最终在9个数据集(涵盖CT、MRI和超声图像)的11个解剖结构上验证了方法的有效性,结果表明SynthFM在不同提示设置下及跨分布数据集上均优于零样本基线方法(如SAM和MedSAM)。

链接: https://arxiv.org/abs/2504.08177
作者: Sourya Sengupta,Satrajit Chakrabarty,Keerthi Sravan Ravi,Gopal Avinash,Ravi Soni
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models like the Segment Anything Model (SAM) excel in zero-shot segmentation for natural images but struggle with medical image segmentation due to differences in texture, contrast, and noise. Annotating medical images is costly and requires domain expertise, limiting large-scale annotated data availability. To address this, we propose SynthFM, a synthetic data generation framework that mimics the complexities of medical images, enabling foundation models to adapt without real medical data. Using SAM’s pretrained encoder and training the decoder from scratch on SynthFM’s dataset, we evaluated our method on 11 anatomical structures across 9 datasets (CT, MRI, and Ultrasound). SynthFM outperformed zero-shot baselines like SAM and MedSAM, achieving superior results under different prompt settings and on out-of-distribution datasets.
zh

[CV-92] Interpretable Automatic Rosacea Detection with Whitened Cosine Similarity

【速读】:该论文旨在解决皮肤玫瑰痤疮(rosacea)的自动检测与诊断问题,以提高公众对该疾病的认识,并帮助医生更准确地进行诊断。论文提出了一种基于白化余弦相似度的可解释性自动玫瑰痤疮检测方法。解决方案的关键在于通过测量测试样本与两类均值(玫瑰痤疮类与正常类)之间的相似性来实现高精度的分类,并同时解决模型的可解释性问题,使医学专业人士和患者能够理解并信任检测结果。这种方法不仅提高了检测准确性,还强调了早期治疗的重要性,因为玫瑰痤疮在早期阶段更容易治疗。

链接: https://arxiv.org/abs/2504.08073
作者: Chengyu Yang,Chengjun Liu
机构: Department of Computer Science, New Jersey Institute of Technology (新泽西理工学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:According to the National Rosacea Society, approximately sixteen million Americans suffer from rosacea, a common skin condition that causes flushing or long-term redness on a person’s face. To increase rosacea awareness and to better assist physicians to make diagnosis on this disease, we propose an interpretable automatic rosacea detection method based on whitened cosine similarity in this paper. The contributions of the proposed methods are three-fold. First, the proposed method can automatically distinguish patients suffering from rosacea from people who are clean of this disease with a significantly higher accuracy than other methods in unseen test data, including both classical deep learning and statistical methods. Second, the proposed method addresses the interpretability issue by measuring the similarity between the test sample and the means of two classes, namely the rosacea class versus the normal class, which allows both medical professionals and patients to understand and trust the results. And finally, the proposed methods will not only help increase awareness of rosacea in the general population, but will also help remind patients who suffer from this disease of possible early treatment, as rosacea is more treatable in its early stages. The code and data are available at this https URL. The code and data are available at this https URL.
zh

人工智能

[AI-0] ProtoECGNet: Case-Based Interpretable Deep Learning for Multi-Label ECG Classification with Contrastive Learning

【速读】:该论文旨在解决深度学习驱动的心电图(ECG)分类在临床应用中因缺乏透明且忠实的解释而受到阻碍的问题。现有的后验方法(如显著性图)可能无法准确反映模型的真实决策过程。论文的关键解决方案是提出了一种基于原型推理的深度学习模型——ProtoECGNet,它通过将决策基于对真实ECG片段学习表示的相似性,提供更透明的替代方案,并实现忠实的案例导向解释。ProtoECGNet采用了一种结构化的多分支架构,结合了不同类型的卷积神经网络(CNN)与全局及局部原型,以支持心律、形态学推理以及弥漫性异常的多标签分类。其关键创新在于设计了一种针对多标签学习的原型损失函数,该函数综合了聚类、分离、多样性以及一种新颖的对比损失,以确保无关类别之间的适当分离并允许相关诊断的聚类。这一方法不仅实现了与最先进的黑盒模型相当的性能,还提供了结构化、案例导向的解释能力。

链接: https://arxiv.org/abs/2504.08713
作者: Sahil Sethi,David Chen,Thomas Statchen,Michael C. Burkhart,Nipun Bhandari,Bashar Ramadan,Brett Beaulieu-Jones
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning-based electrocardiogram (ECG) classification has shown impressive performance but clinical adoption has been slowed by the lack of transparent and faithful explanations. Post hoc methods such as saliency maps may fail to reflect a model’s true decision process. Prototype-based reasoning offers a more transparent alternative by grounding decisions in similarity to learned representations of real ECG segments, enabling faithful, case-based explanations. We introduce ProtoECGNet, a prototype-based deep learning model for interpretable, multi-label ECG classification. ProtoECGNet employs a structured, multi-branch architecture that reflects clinical interpretation workflows: it integrates a 1D CNN with global prototypes for rhythm classification, a 2D CNN with time-localized prototypes for morphology-based reasoning, and a 2D CNN with global prototypes for diffuse abnormalities. Each branch is trained with a prototype loss designed for multi-label learning, combining clustering, separation, diversity, and a novel contrastive loss that encourages appropriate separation between prototypes of unrelated classes while allowing clustering for frequently co-occurring diagnoses. We evaluate ProtoECGNet on all 71 diagnostic labels from the PTB-XL dataset, demonstrating competitive performance relative to state-of-the-art black-box models while providing structured, case-based explanations. To assess prototype quality, we conduct a structured clinician review of the final model’s projected prototypes, finding that they are rated as representative and clear. ProtoECGNet shows that prototype learning can be effectively scaled to complex, multi-label time-series classification, offering a practical path toward transparent and trustworthy deep learning models for clinical decision support.
zh

[AI-1] Voice Interaction With Conversational AI Could Facilitate Thoughtful Reflection and Substantive Revision in Writing NAACL2025

【速读】:本文旨在解决如何通过技术手段提升写作过程中反思与修订的质量。传统静态反馈在促进作者反思方面存在局限性,而对话式反馈(如写作中心辅导中的互动)则显示出更优的效果。论文提出利用多模态大型语言模型(LLMs)生成的静态反馈作为对话起点,结合语音交互的方式,使作者能够主动寻求澄清、请求示例及提出后续问题,从而深化对自身作品的反思。关键在于采用语音交互形式,这不仅能自然地促进这种对话式交流,还能鼓励作者关注更高层次的问题,迭代优化其反思过程,并减轻认知负担,相比文本交互更具优势。为验证这一方案的有效性,研究计划开展一项形成性研究,探索文本输入与语音输入对作者反思及后续修订的影响,以期为智能交互式写作工具的设计提供指导。

链接: https://arxiv.org/abs/2504.08687
作者: Jiho Kim,Philippe Laban,Xiang ‘Anthony’ Chen,Kenneth C. Arnold
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 5 pages; Accepted to Fourth Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2025) at NAACL 2025

点击查看摘要

Abstract:Writing well requires not only expressing ideas but also refining them through revision, a process facilitated by reflection. Prior research suggests that feedback delivered through dialogues, such as those in writing center tutoring sessions, can help writers reflect more thoughtfully on their work compared to static feedback. Recent advancements in multi-modal large language models (LLMs) now offer new possibilities for supporting interactive and expressive voice-based reflection in writing. In particular, we propose that LLM-generated static feedback can be repurposed as conversation starters, allowing writers to seek clarification, request examples, and ask follow-up questions, thereby fostering deeper reflection on their writing. We argue that voice-based interaction can naturally facilitate this conversational exchange, encouraging writers’ engagement with higher-order concerns, facilitating iterative refinement of their reflections, and reduce cognitive load compared to text-based interactions. To investigate these effects, we propose a formative study exploring how text vs. voice input influence writers’ reflection and subsequent revisions. Findings from this study will inform the design of intelligent and interactive writing tools, offering insights into how voice-based interactions with LLM-powered conversational agents can support reflection and revision.
zh

[AI-2] Pobogot – An Open-Hardware Open-Source Low Cost Robot for Swarm Robotics

【速读】:该论文旨在开发一种低成本且功能强大的开源平台——Pogobot,以支持群机器人(Swarm Robotics)领域的研究。论文的核心问题是如何设计一个既经济实惠又高度模块化、可扩展的硬件与软件系统,从而促进分布式人工智能算法(如群体智能算法和在线强化学习算法)的实现。解决方案的关键在于Pogobot采用了振动驱动(vibration-based locomotion)、红外通信(infrared communication)以及丰富的传感器阵列,并通过其模块化设计、全面的应用程序接口(API)和可扩展架构,实现了方向性通信(directional communication)等高级功能,同时保持了约250欧元/单位的成本优势。这一创新为研究自组织系统、可编程活性物质、离散反应-扩散-对流系统以及社会学习与演化模型提供了高效的实验工具。

链接: https://arxiv.org/abs/2504.08686
作者: Alessia Loi,Loona Macabre,Jérémy Fersula,Keivan Amini,Leo Cazenille,Fabien Caura,Alexandre Guerre,Stéphane Gourichon,Olivier Dauchot,Nicolas Bredeche
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This paper describes the Pogobot, an open-source and open-hardware platform specifically designed for research involving swarm robotics. Pogobot features vibration-based locomotion, infrared communication, and an array of sensors in a cost-effective package (approx. 250~euros/unit). The platform’s modular design, comprehensive API, and extensible architecture facilitate the implementation of swarm intelligence algorithms and distributed online reinforcement learning algorithms. Pogobots offer an accessible alternative to existing platforms while providing advanced capabilities including directional communication between units. More than 200 Pogobots are already being used on a daily basis at Sorbonne Université and PSL to study self-organizing systems, programmable active matter, discrete reaction-diffusion-advection systems as well as models of social learning and evolution.
zh

[AI-3] Designing Child-Friendly AI Interfaces: Six Developmentally-Appropriate Design Insights from Analysing Disney Animation

【速读】:该论文旨在解决如何构建儿童能够直观理解并使用的 AI 界面(AI Interfaces)的问题。为实现这一目标,论文的关键在于将皮亚杰(Piagetian)发展理论与从 52 部迪士尼动画作品中提取的设计模式相结合,提出六项可转移至以儿童为中心的 AI 界面设计的策略:(1) 情感表现力与视觉清晰度,(2) 音乐与听觉支持,(3) 视听同步以提供情感舒适感,(4) 类助手型人格设定,(5) 支持象征性游戏与想象力探索,以及 (6) 可预测且有层次的交互结构。这些策略通过多模态支架促进注意力、理解和情感共鸣,从而形成一种儿童熟悉且适用于 AI 界面设计的结构化设计语法。论文的核心解决方案是将电影叙事逻辑重构为 AI 设计方法论,为创造符合儿童认知阶段与情感需求的直观 AI 界面提供启发。

链接: https://arxiv.org/abs/2504.08670
作者: Nomisha Kurian
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 30 pages

点击查看摘要

Abstract:To build AI interfaces that children can intuitively understand and use, designers need a design grammar that truly serves children’s developmental needs. This paper bridges Artificial Intelligence design for children – an emerging field still defining its best practices – and children’s animation, a well-established field with decades of experience in engaging young viewers through emotionally resonant, cognitively accessible storytelling. Pairing Piagetian developmental theory with design pattern extraction from 52 works of Disney animation, the paper presents six design insights transferable to child-centred AI interface design: (1) emotional expressiveness and visual clarity, (2) musical and auditory scaffolding, (3) audiovisual synchrony for emotional comfort, (4) sidekick-style personas, (5) support for symbolic play and imaginative exploration, and (6) predictable and scaffolded interaction structures. These strategies – long refined in Disney animation – function as multimodal scaffolds for attention, understanding, and emotional attunement, thereby forming a structured design grammar familiar to children and transferable to AI interface design. By reframing cinematic storytelling as design logic for AI, the paper offers heuristics for crafting intuitive AI interfaces that align with children’s cognitive stages and emotional needs. The work contributes to design theory by showing how sensory, affective and narrative techniques can inform developmentally attuned AI design for children. Future directions include empirical testing, cultural adaptation, and participatory co-design.
zh

[AI-4] Variability-Driven User-Story Generation using LLM and Triadic Concept Analysis

【速读】:该论文旨在解决如何基于现有软件产品家族的变异性逻辑,为开发新系统建议所需的用户故事集合。论文的关键在于结合三元概念分析(Triadic Concept Analysis, TCA)与大型语言模型(Large Language Model, LLM)提示技术,通过以下步骤实现目标:首先计算表达为TCA蕴含关系的三维变异性;其次向设计者提供可理解的设计选项;然后捕捉设计者的选项选择;接着提出初步的用户故事集合,并依据第一步识别的蕴含关系验证其有效性,必要时进行补充;最后利用LLM生成更全面的网站用户故事集合。该方法通过包含67个类似用途网站的用户故事集合数据集进行了评估。

链接: https://arxiv.org/abs/2504.08666
作者: Alexandre Bazin,Alain Gutierrez,Marianne Huchard,Pierre Martin,Yulin(Huaxi)Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 20th International Conference on Evaluation of Novel Approaches to Software Engineering April 4-6, 2025, in Porto, Portugal

点击查看摘要

Abstract:A widely used Agile practice for requirements is to produce a set of user stories (also called ``agile product backlog’'), which roughly includes a list of pairs (role, feature), where the role handles the feature for a certain purpose. In the context of Software Product Lines, the requirements for a family of similar systems is thus a family of user-story sets, one per system, leading to a 3-dimensional dataset composed of sets of triples (system, role, feature). In this paper, we combine Triadic Concept Analysis (TCA) and Large Language Model (LLM) prompting to suggest the user-story set required to develop a new system relying on the variability logic of an existing system family. This process consists in 1) computing 3-dimensional variability expressed as a set of TCA implications, 2) providing the designer with intelligible design options, 3) capturing the designer’s selection of options, 4) proposing a first user-story set corresponding to this selection, 5) consolidating its validity according to the implications identified in step 1, while completing it if necessary, and 6) leveraging LLM to have a more comprehensive website. This process is evaluated with a dataset comprising the user-story sets of 67 similar-purpose websites.
zh

[AI-5] Do LLM s trust AI regulation? Emerging behaviour of game-theoretic LLM agents

【速读】:本文旨在解决如何通过促进人工智能(AI)开发生态系统内信任与合作,推动可信 AI 系统的广泛采用这一问题。为实现此目标,论文将大型语言模型(Large Language Model, LLM)代理嵌入到演化博弈论(Evolutionary Game Theory, EGT)框架中,研究 AI 开发者、监管机构和用户之间的复杂互动,并在不同监管场景下建模各方的战略选择。解决方案的关键在于建立用户信任与监管机构声誉之间的良性反馈循环,这有助于引导开发者创建安全的 AI 系统。此外,研究发现,信任的程度可能取决于用于测试的具体 LLM 模型,因此需要进一步探索以确保通用性。

链接: https://arxiv.org/abs/2504.08640
作者: Alessio Buscemi,Daniele Proverbio,Paolo Bova,Nataliya Balabanova,Adeela Bashir,Theodor Cimpeanu,Henrique Correia da Fonseca,Manh Hong Duong,Elias Fernandez Domingos,Antonio M. Fernandes,Marcus Krellner,Ndidi Bianca Ogbo,Simon T. Powers,Fernando P. Santos,Zia Ush Shamszaman,Zhao Song,Alessandro Di Stefano, TheAnh Han
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Chaotic Dynamics (nlin.CD)
备注:

点击查看摘要

Abstract:There is general agreement that fostering trust and cooperation within the AI development ecosystem is essential to promote the adoption of trustworthy AI systems. By embedding Large Language Model (LLM) agents within an evolutionary game-theoretic framework, this paper investigates the complex interplay between AI developers, regulators and users, modelling their strategic choices under different regulatory scenarios. Evolutionary game theory (EGT) is used to quantitatively model the dilemmas faced by each actor, and LLMs provide additional degrees of complexity and nuances and enable repeated games and incorporation of personality traits. Our research identifies emerging behaviours of strategic AI agents, which tend to adopt more “pessimistic” (not trusting and defective) stances than pure game-theoretic agents. We observe that, in case of full trust by users, incentives are effective to promote effective regulation; however, conditional trust may deteriorate the “social pact”. Establishing a virtuous feedback between users’ trust and regulators’ reputation thus appears to be key to nudge developers towards creating safe AI. However, the level at which this trust emerges may depend on the specific LLM used for testing. Our results thus provide guidance for AI regulation systems, and help predict the outcome of strategic LLM agents, should they be used to aid regulation itself.
zh

[AI-6] Deep Learning Methods for Detecting Thermal Runaway Events in Battery Production Lines

【速读】:该论文旨在解决电池制造过程中热失控(Thermal Runaway)的检测问题,这是一种可能导致火灾、爆炸及有毒气体排放的关键安全风险。论文的解决方案关键在于利用深度学习技术开发自动化检测系统,通过收集电池生产线上的光学图像与热成像数据,并结合外部加热和烟雾源模拟热失控条件,实现对正常状态与热失控状态的数据表征。研究评估了三种广泛应用于计算机视觉领域的深度学习模型(浅层卷积神经网络、残差神经网络和视觉Transformer),并采用可解释性方法分析模型捕捉输入特征信息的能力。实验结果表明,深度学习方法在电池生产线上具有良好的热失控检测潜力。

链接: https://arxiv.org/abs/2504.08632
作者: Athanasios Athanasopoulos,Matúš Mihalák,Marcin Pietrasik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:One of the key safety considerations of battery manufacturing is thermal runaway, the uncontrolled increase in temperature which can lead to fires, explosions, and emissions of toxic gasses. As such, development of automated systems capable of detecting such events is of considerable importance in both academic and industrial contexts. In this work, we investigate the use of deep learning for detecting thermal runaway in the battery production line of VDL Nedcar, a Dutch automobile manufacturer. Specifically, we collect data from the production line to represent both baseline (non thermal runaway) and thermal runaway conditions. Thermal runaway was simulated through the use of external heat and smoke sources. The data consisted of both optical and thermal images which were then preprocessed and fused before serving as input to our models. In this regard, we evaluated three deep-learning models widely used in computer vision including shallow convolutional neural networks, residual neural networks, and vision transformers on two performance metrics. Furthermore, we evaluated these models using explainability methods to gain insight into their ability to capture the relevant feature information from their inputs. The obtained results indicate that the use of deep learning is a viable approach to thermal runaway detection in battery production lines.
zh

[AI-7] Enterprise-Grade Security for the Model Context Protocol (MCP): Frameworks and Mitigation Strategies

【速读】:该论文致力于解决由Anthropic提出的模型上下文协议(Model Context Protocol, MCP)在实际应用中引入的安全挑战。这些挑战包括工具投毒等复杂威胁,论文通过系统化的威胁建模与分析,提出针对MCP实施者和采用者的可操作性安全模式。解决方案的关键在于将理论上的安全顾虑转化为实用且可执行的框架,并提供具体控制措施,从而为集成式人工智能系统的安全企业级采用和治理提供必要的指导。

链接: https://arxiv.org/abs/2504.08623
作者: Vineeth Sai Narajala,Idan Habler
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures, 1 table

点击查看摘要

Abstract:The Model Context Protocol (MCP), introduced by Anthropic, provides a standardized framework for artificial intelligence (AI) systems to interact with external data sources and tools in real-time. While MCP offers significant advantages for AI integration and capability extension, it introduces novel security challenges that demand rigorous analysis and mitigation. This paper builds upon foundational research into MCP architecture and preliminary security assessments to deliver enterprise-grade mitigation frameworks and detailed technical implementation strategies. Through systematic threat modeling and analysis of MCP implementations and analysis of potential attack vectors, including sophisticated threats like tool poisoning, we present actionable security patterns tailored for MCP implementers and adopters. The primary contribution of this research lies in translating theoretical security concerns into a practical, implementable framework with actionable controls, thereby providing essential guidance for the secure enterprise adoption and governance of integrated AI systems.
zh

[AI-8] Neural Fidelity Calibration for Informative Sim-to-Real Adaptation

【速读】:该论文旨在解决模拟器到真实世界迁移中的物理模型不精确性和感知不确定性问题。传统方法通过领域随机化或对抗方法弥合这一差距通常需要专家物理知识以确保策略鲁棒性,但即使最先进的模拟器也可能无法完全捕捉现实世界的细节,导致环境重建引入误差。为应对这些挑战,论文提出了一种名为神经保真度校准(Neural Fidelity Calibration, NFC)的新框架,其关键在于利用条件分数驱动扩散模型在线校准模拟器的物理系数和残余保真度域。具体而言,残余保真度反映了仿真模型相对于真实动力学的变化,并捕获感知环境的不确定性,从而能够在推断出的分布下采样真实的环境进行策略微调。该框架具有信息性和适应性的三个关键特性:(a) 仅在异常场景下微调预训练策略;(b) 在线构建序列NFC,利用预训练NFC的建议先验减少扩散模型的训练负担;© 当NFC不确定性较高且可能降低策略改进时,采用乐观探索来优化幻觉策略。实验表明,该框架相比现有技术在多种高维参数空间的机器人中实现了更精确的模拟器校准精度,并验证了残余保真度对策略改进的重要贡献,特别是在具有挑战性的实际条件下,如雪地上的断轴情况,展示了机器人导航的鲁棒性。

链接: https://arxiv.org/abs/2504.08604
作者: Youwei Yu,Lantao Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Deep reinforcement learning can seamlessly transfer agile locomotion and navigation skills from the simulator to real world. However, bridging the sim-to-real gap with domain randomization or adversarial methods often demands expert physics knowledge to ensure policy robustness. Even so, cutting-edge simulators may fall short of capturing every real-world detail, and the reconstructed environment may introduce errors due to various perception uncertainties. To address these challenges, we propose Neural Fidelity Calibration (NFC), a novel framework that employs conditional score-based diffusion models to calibrate simulator physical coefficients and residual fidelity domains online during robot execution. Specifically, the residual fidelity reflects the simulation model shift relative to the real-world dynamics and captures the uncertainty of the perceived environment, enabling us to sample realistic environments under the inferred distribution for policy fine-tuning. Our framework is informative and adaptive in three key ways: (a) we fine-tune the pretrained policy only under anomalous scenarios, (b) we build sequential NFC online with the pretrained NFC’s proposal prior, reducing the diffusion model’s training burden, and © when NFC uncertainty is high and may degrade policy improvement, we leverage optimistic exploration to enable hallucinated policy optimization. Our framework achieves superior simulator calibration precision compared to state-of-the-art methods across diverse robots with high-dimensional parametric spaces. We study the critical contribution of residual fidelity to policy improvement in simulation and real-world experiments. Notably, our approach demonstrates robust robot navigation under challenging real-world conditions, such as a broken wheel axle on snowy surfaces.
zh

[AI-9] Ready Bid Go! On-Demand Delivery Using Fleets of Drones with Unknown Heterogeneous Energy Storag e Constraints AAMAS2025

【速读】:本文旨在解决物流领域中基于需求的无人机(UAV)即时配送问题,在该场景下无人机编队需处理随机到达的订单。不同于以往研究,本文考虑了具有异构且未知储能容量的无人机,并假设不了解能耗模型。为应对这些挑战,作者提出了一种去中心化的部署策略,结合基于拍卖的任务分配与在线学习方法。每架无人机根据自身储能电量、包裹重量及配送距离独立决定是否竞标订单,并通过在线学习逐步优化其策略以仅针对能力范围内的订单进行竞标。关键在于通过让最不自信的竞标者接收订单这一反直觉的方法,显著减少了交付时间并提高了订单完成率,优于需要无人机在部署时达到特定电量阈值的传统阈值方法。此外,还设计了一种利用预测性学习策略的变体,使电量不足的无人机能够承诺在未来特定时间完成订单,从而优先处理早期订单。该研究为无人机集群的长期部署提供了新见解,强调了在真实动态环境中结合去中心化能源感知决策与在线学习的优势。

链接: https://arxiv.org/abs/2504.08585
作者: Mohamed S. Talamali,Genki Miyauchi,Thomas Watteyne,Micael S. Couceiro,Roderich Gross
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: The 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025)

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) are expected to transform logistics, reducing delivery time, costs, and emissions. This study addresses an on-demand delivery , in which fleets of UAVs are deployed to fulfil orders that arrive stochastically. Unlike previous work, it considers UAVs with heterogeneous, unknown energy storage capacities and assumes no knowledge of the energy consumption models. We propose a decentralised deployment strategy that combines auction-based task allocation with online learning. Each UAV independently decides whether to bid for orders based on its energy storage charge level, the parcel mass, and delivery distance. Over time, it refines its policy to bid only for orders within its capability. Simulations using realistic UAV energy models reveal that, counter-intuitively, assigning orders to the least confident bidders reduces delivery times and increases the number of successfully fulfilled orders. This strategy is shown to outperform threshold-based methods which require UAVs to exceed specific charge levels at deployment. We propose a variant of the strategy which uses learned policies for forecasting. This enables UAVs with insufficient charge levels to commit to fulfilling orders at specific future times, helping to prioritise early orders. Our work provides new insights into long-term deployment of UAV swarms, highlighting the advantages of decentralised energy-aware decision-making coupled with online learning in real-world dynamic environments.
zh

[AI-10] Uncovering the Structure of Explanation Quality with Spectral Analysis

【速读】:该论文旨在解决机器学习模型在高风险领域应用时,其预测策略对用户的透明性问题,具体通过提出一种新的框架来系统性地捕捉不同解释技术的多维度特性。论文的关键解决方案在于基于解释结果的光谱分析,提出了两个可直接通过谱分解观察到的解释质量因素:稳定性(stability)和目标敏感性(target sensitivity)。通过在MNIST和ImageNet上的实验验证,发现现有的评估技术(如像素翻转、熵等)部分捕获了这两个因素之间的权衡关系。这一框架为理解解释质量提供了理论基础,并指导了更可靠的解释评估技术的发展。

链接: https://arxiv.org/abs/2504.08553
作者: Johannes Maeß,Grégoire Montavon,Shinichi Nakajima,Klaus-Robert Müller,Thomas Schnake
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures, Accepted at XAI World Conference 2025

点击查看摘要

Abstract:As machine learning models are increasingly considered for high-stakes domains, effective explanation methods are crucial to ensure that their prediction strategies are transparent to the user. Over the years, numerous metrics have been proposed to assess quality of explanations. However, their practical applicability remains unclear, in particular due to a limited understanding of which specific aspects each metric rewards. In this paper we propose a new framework based on spectral analysis of explanation outcomes to systematically capture the multifaceted properties of different explanation techniques. Our analysis uncovers two distinct factors of explanation quality-stability and target sensitivity-that can be directly observed through spectral decomposition. Experiments on both MNIST and ImageNet show that popular evaluation techniques (e.g., pixel-flipping, entropy) partially capture the trade-offs between these factors. Overall, our framework provides a foundational basis for understanding explanation quality, guiding the development of more reliable techniques for evaluating explanations.
zh

[AI-11] owards an Evaluation Framework for Explainable Artificial Intelligence Systems for Health and Well-being

【速读】:该论文试图解决人工智能在计算机系统开发中的集成所带来的挑战,即如何使智能系统对人类具有可解释性,尤其是在健康与福祉领域,确保决策支持系统的透明性以增强专业人士对自动化决策的信任。论文的关键解决方案在于提出了一种评估框架,旨在支持面向健康与福祉的可解释人工智能(Explainable AI, XAI)系统的开发,并通过案例研究展示了该框架的实际应用。这一框架不仅有助于医疗领域的XAI系统开发,还可应用于任何对个体有重大影响的人工智能系统。

链接: https://arxiv.org/abs/2504.08552
作者: Esperança Amengual-Alcover,Antoni Jaume-i-Capó,Miquel Miró-Nicolau,Gabriel Moyà-Alcover,Antonia Paniza-Fullana
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of Artificial Intelligence in the development of computer systems presents a new challenge: make intelligent systems explainable to humans. This is especially vital in the field of health and well-being, where transparency in decision support systems enables healthcare professionals to understand and trust automated decisions and predictions. To address this need, tools are required to guide the development of explainable AI systems. In this paper, we introduce an evaluation framework designed to support the development of explainable AI systems for health and well-being. Additionally, we present a case study that illustrates the application of the framework in practice. We believe that our framework can serve as a valuable tool not only for developing explainable AI systems in healthcare but also for any AI system that has a significant impact on individuals.
zh

[AI-12] Explainability and Continual Learning meet Federated Learning at the Network Edge

【速读】:该论文旨在解决分布式学习在网络边缘环境中优化的关键挑战,特别是在无线互联边缘设备场景下。具体而言,论文关注以下三个核心问题:(1) 如何在复杂预测模型中平衡预测准确性与可解释性;(2) 如何将固有的可解释树基模型集成到分布式学习框架中;(3) 在资源受限环境下,如何通过终身学习 (Continual Learning, CL) 实现模型的持续适应。论文的关键在于提出基于多目标优化 (Multi-objective Optimization, MOO) 的方法来解决预测准确性和可解释性之间的权衡问题,并探讨如何结合联邦学习 (Federated Learning, FL) 和终身学习策略以支持自适应、终生学习的能力,同时利用有限大小的数据缓冲区存储历史数据进行重新训练。这些方法为设计面向边缘计算和智能服务需求的隐私保护、自适应且可信的机器学习解决方案提供了理论基础和技术工具。

链接: https://arxiv.org/abs/2504.08536
作者: Thomas Tsouparopoulos,Iordanis Koutsopoulos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:As edge devices become more capable and pervasive in wireless networks, there is growing interest in leveraging their collective compute power for distributed learning. However, optimizing learning at the network edge entails unique challenges, particularly when moving beyond conventional settings and objectives. While Federated Learning (FL) has emerged as a key paradigm for distributed model training, critical challenges persist. First, existing approaches often overlook the trade-off between predictive accuracy and interpretability. Second, they struggle to integrate inherently explainable models such as decision trees because their non-differentiable structure makes them not amenable to backpropagation-based training algorithms. Lastly, they lack meaningful mechanisms for continual Machine Learning (ML) model adaptation through Continual Learning (CL) in resource-limited environments. In this paper, we pave the way for a set of novel optimization problems that emerge in distributed learning at the network edge with wirelessly interconnected edge devices, and we identify key challenges and future directions. Specifically, we discuss how Multi-objective optimization (MOO) can be used to address the trade-off between predictive accuracy and explainability when using complex predictive models. Next, we discuss the implications of integrating inherently explainable tree-based models into distributed learning settings. Finally, we investigate how CL strategies can be effectively combined with FL to support adaptive, lifelong learning when limited-size buffers are used to store past data for retraining. Our approach offers a cohesive set of tools for designing privacy-preserving, adaptive, and trustworthy ML solutions tailored to the demands of edge computing and intelligent services.
zh

[AI-13] LGRPool: Hierarchical Graph Pooling Via Local-Global Regularisation

【速读】:该论文试图解决传统图神经网络(Graph Neural Networks, GNN)在处理图数据时存在的两个主要问题:一是缺乏考虑图的整体拓扑结构,二是未能有效结合局部特征与全局特征。为了解决这些问题,论文提出了一种名为LGRPool的分层图池化方法(Hierarchical Graph Pooling, HGP),其关键在于通过引入期望最大化框架中的正则化项,将消息传递过程中的局部和全局特征对齐,并确保不同尺度下的全局拓扑信息与局部特征表示一致。这种设计使得LGRPool能够在分层图池化过程中更好地捕捉多尺度特性,从而提升图分类任务的表现。

链接: https://arxiv.org/abs/2504.08530
作者: Farshad Noravesh,Reza Haffari,Layki Soon,Arghya Pal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: f tables, 2 figures

点击查看摘要

Abstract:Hierarchical graph pooling(HGP) are designed to consider the fact that conventional graph neural networks(GNN) are inherently flat and are also not multiscale. However, most HGP methods suffer not only from lack of considering global topology of the graph and focusing on the feature learning aspect, but also they do not align local and global features since graphs should inherently be analyzed in a multiscale way. LGRPool is proposed in the present paper as a HGP in the framework of expectation maximization in machine learning that aligns local and global aspects of message passing with each other using a regularizer to force the global topological information to be inline with the local message passing at different scales through the representations at different layers of HGP. Experimental results on some graph classification benchmarks show that it slightly outperforms some baselines.
zh

[AI-14] Hallucination reliability and the role of generative AI in science

【速读】:该论文试图解决生成式 AI (Generative AI) 在科学领域应用中因“幻觉”(hallucinations)导致的可靠性问题。具体而言,论文关注由模型内部机制引发的腐蚀性幻觉(corrosive hallucinations),即那些在实质上具有误导性且难以系统性预测的错误输出,并探讨其对科学推理的威胁。论文指出,尽管腐蚀性幻觉可能危及科学可靠性,但它们并非不可避免。

解决方案的关键在于通过科学工作流来缓解这些问题。例如,以 AlphaFold 和 GenCast 作为案例研究,论文建议在模型训练期间施加理论约束,并在推理阶段采取策略性错误筛查,从而有效中和腐蚀性幻觉的影响,使生成式 AI 能够可靠地为科学研究贡献力量。

链接: https://arxiv.org/abs/2504.08526
作者: Charles Rathkopf
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 31 pages, 1 figure

点击查看摘要

Abstract:Generative AI is increasingly used in scientific domains, from protein folding to climate modeling. But these models produce distinctive errors known as hallucinations - outputs that are incorrect yet superficially plausible. Worse, some arguments suggest that hallucinations are an inevitable consequence of the mechanisms underlying generative inference. Fortunately, such arguments rely on a conception of hallucination defined solely with respect to internal properties of the model, rather than in reference to the empirical target system. This conception fails to distinguish epistemically benign errors from those that threaten scientific inference. I introduce the concept of corrosive hallucination to capture the epistemically troubling subclass: misrepresentations that are substantively misleading and resistant to systematic anticipation. I argue that although corrosive hallucinations do pose a threat to scientific reliability, they are not inevitable. Scientific workflows such as those surrounding AlphaFold and GenCast, both of which serve as case studies, can neutralize their effects by imposing theoretical constraints during training, and by strategically screening for errors at inference time. When embedded in such workflows, generative AI can reliably contribute to scientific knowledge.
zh

[AI-15] Adopting Large Language Models to Automated System Integration

【速读】:该论文致力于解决现代企业计算系统中服务集成复杂性增加的问题,尤其是在基于Web技术(如REST或OpenAPI)的服务组合场景下。尽管单个服务的维护成本降低,但整体集成的复杂度显著上升,传统自动化服务组合方法因依赖复杂的正式建模而未能在实践中广泛采用。论文的关键在于探索利用大型语言模型(Large Language Models, LLMs)通过自然语言输入实现服务自动集成的可能性。其解决方案的核心是提出一种基于LLMs的软件架构(Compositio Prompto),并通过检索增强生成(Retrieval Augmented Generation, RAG)优化服务发现过程,并设计了一种新的基于自然语言查询的服务发现基准测试以及扩展其到完整的服务组合场景。虽然生成的结果并非始终完全正确,但它能够为集成工程师提供接近理想解的近似方案,从而大幅减少使其投入运行所需的工作量。

链接: https://arxiv.org/abs/2504.08490
作者: Robin D. Pesl
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern enterprise computing systems integrate numerous subsystems to resolve a common task by yielding emergent behavior. A widespread approach is using services implemented with Web technologies like REST or OpenAPI, which offer an interaction mechanism and service documentation standard, respectively. Each service represents a specific business functionality, allowing encapsulation and easier maintenance. Despite the reduced maintenance costs on an individual service level, increased integration complexity arises. Consequently, automated service composition approaches have arisen to mitigate this issue. Nevertheless, these approaches have not achieved high acceptance in practice due to their reliance on complex formal modeling. Within this Ph.D. thesis, we analyze the application of Large Language Models (LLMs) to automatically integrate the services based on a natural language input. The result is a reusable service composition, e.g., as program code. While not always generating entirely correct results, the result can still be helpful by providing integration engineers with a close approximation of a suitable solution, which requires little effort to become operational. Our research involves (i) introducing a software architecture for automated service composition using LLMs, (ii) analyzing Retrieval Augmented Generation (RAG) for service discovery, (iii) proposing a novel natural language query-based benchmark for service discovery, and (iv) extending the benchmark to complete service composition scenarios. We have presented our software architecture as Compositio Prompto, the analysis of RAG for service discovery, and submitted a proposal for the service discovery benchmark. Open topics are primarily the extension of the service discovery benchmark to service composition scenarios and the improvements of the service composition generation, e.g., using fine-tuning or LLM agents.
zh

[AI-16] On the Design of Diffusion-based Neural Speech Codecs

【速读】:该论文旨在系统性地探索基于扩散模型(Diffusion Models, DMs)的神经语音编解码器(Neural Speech Codecs, NSCs)的设计空间,以填补现有研究中的空白。论文的关键在于提出了一种新的分类框架,依据扩散模型的条件域与输出域对NSCs进行归类,并通过这一框架设计和评估新型扩散模型,同时将其与现有的生成对抗网络(GANs)及扩散模型基线进行客观指标和主观听觉测试的对比分析。这种系统性的方法不仅定义了扩散模型在NSCs设计中的潜在空间,还为未来的研究提供了明确的方向。

链接: https://arxiv.org/abs/2504.08470
作者: Pietro Foti,Andreas Brendel
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recently, neural speech codecs (NSCs) trained as generative models have shown superior performance compared to conventional codecs at low bitrates. Although most state-of-the-art NSCs are trained as Generative Adversarial Networks (GANs), Diffusion Models (DMs), a recent class of generative models, represent a promising alternative due to their superior performance in image generation relative to GANs. Consequently, DMs have been successfully applied for audio and speech coding among various other audio generation applications. However, the design of diffusion-based NSCs has not yet been explored in a systematic way. We address this by providing a comprehensive analysis of diffusion-based NSCs divided into three contributions. First, we propose a categorization based on the conditioning and output domains of the DM. This simple conceptual framework allows us to define a design space for diffusion-based NSCs and to assign a category to existing approaches in the literature. Second, we systematically investigate unexplored designs by creating and evaluating new diffusion-based NSCs within the conceptual framework. Finally, we compare the proposed models to existing GAN and DM baselines through objective metrics and subjective listening tests.
zh

[AI-17] seeBias: A Comprehensive Tool for Assessing and Visualizing AI Fairness

【速读】:该论文旨在解决当前公平性工具包在评估人工智能预测模型时仅孤立关注分类性能差异,而忽视校准(calibration)等其他关键方面的问题。论文的关键解决方案是提出seeBias,一个用于全面评估模型公平性和预测性能的R语言软件包。seeBias通过整合分类、校准及其他性能领域的评估,提供了一个更完整的模型行为视图,并包含可定制的可视化功能以支持透明报告与负责任的人工智能实施。研究使用刑事司法和医疗保健领域的公开数据集,展示了seeBias如何辅助公平性评估并揭示传统公平性指标可能忽略的差异。

链接: https://arxiv.org/abs/2504.08418
作者: Yilin Ning,Yian Ma,Mingxuan Liu,Xin Li,Nan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fairness in artificial intelligence (AI) prediction models is increasingly emphasized to support responsible adoption in high-stakes domains such as health care and criminal justice. Guidelines and implementation frameworks highlight the importance of both predictive accuracy and equitable outcomes. However, current fairness toolkits often evaluate classification performance disparities in isolation, with limited attention to other critical aspects such as calibration. To address these gaps, we present seeBias, an R package for comprehensive evaluation of model fairness and predictive performance. seeBias offers an integrated evaluation across classification, calibration, and other performance domains, providing a more complete view of model behavior. It includes customizable visualizations to support transparent reporting and responsible AI implementation. Using public datasets from criminal justice and healthcare, we demonstrate how seeBias supports fairness evaluations, and uncovers disparities that conventional fairness metrics may overlook. The R package is available on GitHub, and a Python version is under development.
zh

[AI-18] Belief States for Cooperative Multi-Agent Reinforcement Learning under Partial Observability

【速读】:该论文致力于解决部分可观测环境中多智能体强化学习的挑战,特别是当智能体需要同时学习并相互影响系统状态及其观测时所面临的困难。为克服这些挑战,论文提出利用对系统底层状态的学习信念(learned beliefs)来实现完全去中心化训练与执行的强化学习方法。方案的关键在于通过自监督方式预训练一个概率性信念模型(probabilistic belief model),该模型能够捕获推断出的状态信息及其不确定性。随后,将这些信念状态集成到基于状态的强化学习算法中,形成在部分可观测条件下合作多智能体强化学习的端到端模型。通过分离信念估计与强化学习任务,显著简化了策略和价值函数的学习过程,从而提升了收敛速度和最终性能。

链接: https://arxiv.org/abs/2504.08417
作者: Paul J. Pritz,Kin K. Leung
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning in partially observable environments is typically challenging, as it requires agents to learn an estimate of the underlying system state. These challenges are exacerbated in multi-agent settings, where agents learn simultaneously and influence the underlying state as well as each others’ observations. We propose the use of learned beliefs on the underlying state of the system to overcome these challenges and enable reinforcement learning with fully decentralized training and execution. Our approach leverages state information to pre-train a probabilistic belief model in a self-supervised fashion. The resulting belief states, which capture both inferred state information as well as uncertainty over this information, are then used in a state-based reinforcement learning algorithm to create an end-to-end model for cooperative multi-agent reinforcement learning under partial observability. By separating the belief and reinforcement learning tasks, we are able to significantly simplify the policy and value function learning tasks and improve both the convergence speed and the final performance. We evaluate our proposed method on diverse partially observable multi-agent tasks designed to exhibit different variants of partial observability.
zh

[AI-19] Constrained Machine Learning Through Hyperspherical Representation

【速读】:该论文致力于解决机器学习模型输出满足约束条件的问题,特别是在安全关键领域的应用中,这一问题尤为关键。传统的解决方案包括基于惩罚的方法(无法保证完全避免约束违反)、特定约束的模型架构或推理阶段的输出投影(可能带来较高的计算开销)。论文提出了一种名为“超球约束表示”(Hypersherical Constrained Representation) 的新方法,其关键在于通过将欧几里得坐标转换为相对于约束区域的超球坐标,从而在新的表示系统中仅表示可行解,确保100%的约束满足性。实验结果表明,该方法在保持预测性能的同时,具有极低的推理计算成本。

链接: https://arxiv.org/abs/2504.08415
作者: Gaetano Signorelli,Michele Lombardi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The problem of ensuring constraints satisfaction on the output of machine learning models is critical for many applications, especially in safety-critical domains. Modern approaches rely on penalty-based methods at training time, which do not guarantee to avoid constraints violations; or constraint-specific model architectures (e.g., for monotonocity); or on output projection, which requires to solve an optimization problem that might be computationally demanding. We present the Hypersherical Constrained Representation, a novel method to enforce constraints in the output space for convex and bounded feasibility regions (generalizable to star domains). Our method operates on a different representation system, where Euclidean coordinates are converted into hyperspherical coordinates relative to the constrained region, which can only inherently represent feasible points. Experiments on a synthetic and a real-world dataset show that our method has predictive performance comparable to the other approaches, can guarantee 100% constraint satisfaction, and has a minimal computational cost at inference time.
zh

[AI-20] Human strategies for correcting `human-robot errors during a laundry sorting task

【速读】:该论文试图解决如何通过改善家用机器人(domestic robot)的语言和行为,以提升人机交互(human-robot interaction, HRI)的效果,从而更有效地协助人类完成协作性家务任务。论文关注的重点是在机器人出现故障时,人们如何进行自然沟通,并识别相应的沟通模式。研究的关键在于通过分析参与者在与具有不同错误模式的Laundrobot交互时所使用的语言、手势及行为反应,揭示常见的应对策略,如纠正与教导、承担责任以及表达挫败感等。此外,研究还探讨了参与者对机器人表现与预期之间差距的反应强度是否会因经验积累而减弱,以及这些反应是否受到机器人外观、形态、语音、能力及恢复策略等因素的影响。关键解决方案在于设计实验场景,让参与者在不知晓Laundrobot由真人操控且未被告知错误模式的情况下,自然地使用语言和手势指导其完成任务,以此捕捉真实的人机互动中的沟通模式与应对机制。

链接: https://arxiv.org/abs/2504.08395
作者: Pepita Barnard,Maria J Galvez Trigo,Dominic Price,Sue Cobb,Gisela Reyes-Cruz,Gustavo Berumen,David Branson III,Mojtaba A. Khanesar,Mercedes Torres Torres,Michel Valstar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Mental models and expectations underlying human-human interaction (HHI) inform human-robot interaction (HRI) with domestic robots. To ease collaborative home tasks by improving domestic robot speech and behaviours for human-robot communication, we designed a study to understand how people communicated when failure occurs. To identify patterns of natural communication, particularly in response to robotic failures, participants instructed Laundrobot to move laundry into baskets using natural language and gestures. Laundrobot either worked error-free, or in one of two error modes. Participants were not advised Laundrobot would be a human actor, nor given information about error modes. Video analysis from 42 participants found speech patterns, included laughter, verbal expressions, and filler words, such as oh'' and ok’', also, sequences of body movements, including touching one’s own face, increased pointing with a static finger, and expressions of surprise. Common strategies deployed when errors occurred, included correcting and teaching, taking responsibility, and displays of frustration. The strength of reaction to errors diminished with exposure, possibly indicating acceptance or resignation. Some used strategies similar to those used to communicate with other technologies, such as smart assistants. An anthropomorphic robot may not be ideally suited to this kind of task. Laundrobot’s appearance, morphology, voice, capabilities, and recovery strategies may have impacted how it was perceived. Some participants indicated Laundrobot’s actual skills were not aligned with expectations; this made it difficult to know what to expect and how much Laundrobot understood. Expertise, personality, and cultural differences may affect responses, however these were not assessed.
zh

[AI-21] PCA-RAG : Principal Component Analysis for Efficient Retrieval-Augmented Generation

【速读】:该论文旨在解决 Retrieval-Augmented Generation (RAG) 模型在处理高维语言模型嵌入时面临的存储和延迟可扩展性挑战,特别是在大规模金融文本语料库中的应用。为应对这一问题,论文提出利用主成分分析(Principal Component Analysis, PCA)降低嵌入维度,从而缓解计算瓶颈,同时尽量减少精度损失。关键在于通过 PCA 压缩实现检索保真度与资源效率之间的平衡,验证了经典降维技术在优化速度、内存效率和准确性方面的可行性,这对于实时系统如 Zanista AI 的 Newswitch 平台尤为重要。

链接: https://arxiv.org/abs/2504.08386
作者: Arman Khaledian,Amirreza Ghadiridehkordi,Nariman Khaledian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (stat.ML)
备注: 19 pages

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for grounding large language models in external knowledge sources, improving the precision of agents responses. However, high-dimensional language model embeddings, often in the range of hundreds to thousands of dimensions, can present scalability challenges in terms of storage and latency, especially when processing massive financial text corpora. This paper investigates the use of Principal Component Analysis (PCA) to reduce embedding dimensionality, thereby mitigating computational bottlenecks without incurring large accuracy losses. We experiment with a real-world dataset and compare different similarity and distance metrics under both full-dimensional and PCA-compressed embeddings. Our results show that reducing vectors from 3,072 to 110 dimensions provides a sizeable (up to 60\times ) speedup in retrieval operations and a \sim 28.6\times reduction in index size, with only moderate declines in correlation metrics relative to human-annotated similarity scores. These findings demonstrate that PCA-based compression offers a viable balance between retrieval fidelity and resource efficiency, essential for real-time systems such as Zanista AI’s \textitNewswitch platform. Ultimately, our study underscores the practicality of leveraging classical dimensionality reduction techniques to scale RAG architectures for knowledge-intensive applications in finance and trading, where speed, memory efficiency, and accuracy must jointly be optimized.
zh

[AI-22] Passive Underwater Acoustic Signal Separation based on Feature Decoupling Dual-path Network

【速读】:该论文致力于解决被动声呐领域中舰船辐射噪声分离的问题,指出当前广泛使用的分离网络多源于语音分离应用,未能充分考虑水下声学的独特特性(如传播介质、信号频率及调制特性等影响)。为应对这一挑战,论文提出了一种新颖的时间域网络,其关键在于采用双路径模型与特征解耦方法,将混合信号的特征映射到一个更具独立性的空间,并解耦各维度的重要性。此外,通过在分离层融合局部与全局注意力机制进一步优化性能。实验结果表明,该方法在ShipsEar和DeepShip数据集上的表现优于其他主流网络模型。

链接: https://arxiv.org/abs/2504.08371
作者: Yucheng Liu,Longyu Jiang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 10pages,4 figures

点击查看摘要

Abstract:Signal separation in the passive underwater acoustic domain has heavily relied on deep learning techniques to isolate ship radiated noise. However, the separation networks commonly used in this domain stem from speech separation applications and may not fully consider the unique aspects of underwater acoustics beforehand, such as the influence of different propagation media, signal frequencies and modulation characteristics. This oversight highlights the need for tailored approaches that account for the specific characteristics of underwater sound propagation. This study introduces a novel temporal network designed to separate ship radiated noise by employing a dual-path model and a feature decoupling approach. The mixed signals’ features are transformed into a space where they exhibit greater independence, with each dimension’s significance decoupled. Subsequently, a fusion of local and global attention mechanisms is employed in the separation layer. Extensive comparisons showcase the effectiveness of this method when compared to other prevalent network models, as evidenced by its performance in the ShipsEar and DeepShip datasets.
zh

[AI-23] Kernel-Level Energy-Efficient Neural Architecture Search for Tabular Dataset

【速读】:该论文试图解决通过神经架构搜索(NAS)方法直接优化神经网络能耗的问题,而非依赖如内存使用、浮点运算次数(FLOPs)或推理延迟等代理指标来间接估计能耗。传统方法通常假设减少这些代理指标会同时降低实际能耗,但该研究提出了一个全新的能量高效NAS方法,其关键是专注于直接最小化能耗,同时保持可接受的模型精度。特别地,该解决方案针对表格数据集(tabular datasets)进行了优化设计,与以往主要面向视觉和语言任务的方法不同。实验结果显示,该方法推荐的最佳架构相较于传统NAS推荐的架构,最多可将能耗降低92%。

链接: https://arxiv.org/abs/2504.08359
作者: Hoang-Loc La,Phuong Hoai Ha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ACIIDS 2025 Conference

点击查看摘要

Abstract:Many studies estimate energy consumption using proxy metrics like memory usage, FLOPs, and inference latency, with the assumption that reducing these metrics will also lower energy consumption in neural networks. This paper, however, takes a different approach by introducing an energy-efficient Neural Architecture Search (NAS) method that directly focuses on identifying architectures that minimize energy consumption while maintaining acceptable accuracy. Unlike previous methods that primarily target vision and language tasks, the approach proposed here specifically addresses tabular datasets. Remarkably, the optimal architecture suggested by this method can reduce energy consumption by up to 92% compared to architectures recommended by conventional NAS.
zh

[AI-24] SortBench: Benchmarking LLM s based on their ability to sort lists

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在排序任务(Sorting Task)上的表现不佳问题。排序任务对人类而言简单且直观,但对LLMs却极具挑战性,因为其涉及LLMs已知的一些弱点,如对输入数据的忠实性、值之间的逻辑比较以及严格区分语法(用于排序)与语义(通常由嵌入向量学习)的能力。为解决这一问题,论文设计了一个名为SortBench的新基准测试集,该测试集具有不同的难度级别且易于按需扩展。通过将SortBench应用于七种最先进的LLMs,包括当前的测试时推理模型,研究发现即使在一般排序任务中表现出色的模型,在面对需要同时处理语法和语义的复杂字符串排序(如数字以文字形式呈现)时也会被误导。此外,所有模型在处理长列表输入时均存在忠实性不足的问题,表现为遗漏项目或添加无关项目。研究还表明,测试时推理虽然有助于某些任务,但也可能因过度思考而导致性能下降。最后,无测试时推理能力的模型(如GPT-4o)的表现并不明显逊色于具备推理能力的模型。因此,论文的关键在于提出SortBench基准测试集,并通过系统性评估揭示LLMs在排序任务中的共性挑战及测试时推理的潜在局限性。

链接: https://arxiv.org/abs/2504.08312
作者: Steffen Herbold
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sorting is a tedious but simple task for human intelligence and can be solved fairly easily algorithmically. However, for Large Language Models (LLMs) this task is surprisingly hard, as some properties of sorting are among known weaknesses of LLMs: being faithful to the input data, logical comparisons between values, and strictly differentiating between syntax (used for sorting) and semantics (typically learned by embeddings). Within this paper, we describe the new SortBench benchmark for LLMs that comes with different difficulties and that can be easily scaled in terms of difficulty. We apply this benchmark to seven state-of-the-art LLMs, including current test-time reasoning models. Our results show that while the o3-mini model is very capable at sorting in general, even this can be fooled if strings are defined to mix syntactical and semantical aspects, e.g., by asking to sort numbers written-out as word. Furthermore, all models have problems with the faithfulness to the input of long lists, i.e., they drop items and add new ones. Our results also show that test-time reasoning has a tendency to overthink problems which leads to performance degradation. Finally, models without test-time reasoning like GPT-4o are not much worse than reasoning models.
zh

[AI-25] RAG -VR: Leverag ing Retrieval-Augmented Generation for 3D Question Answering in VR Environments

【速读】:该论文旨在解决虚拟现实(Virtual Reality, VR)环境中上下文高度本地化和个人化导致通用大型语言模型(Large Language Models, LLMs)效果受限的问题。为应对这一挑战,论文提出了一种名为RAG-VR的3D问答系统,其关键创新在于结合检索增强生成(Retrieval-Augmented Generation, RAG),通过从本地化知识库中检索外部知识来增强LLM,从而提升答案质量。此外,RAG-VR设计了一个管道以提取关于虚拟环境和用户条件的全面知识,并将检索过程卸载到附近的边缘服务器以提高效率,同时训练检索器有效区分与问题相关的、无关的以及难以区分的信息。这些方法使得RAG-VR在答案准确性上提升了17.9%-41.8%,并减少了34.5%-47.3%的端到端延迟。

链接: https://arxiv.org/abs/2504.08256
作者: Shiyi Ding,Ying Chen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Proceedings of the 2025 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), March 2025

点击查看摘要

Abstract:Recent advances in large language models (LLMs) provide new opportunities for context understanding in virtual reality (VR). However, VR contexts are often highly localized and personalized, limiting the effectiveness of general-purpose LLMs. To address this challenge, we present RAG-VR, the first 3D question-answering system for VR that incorporates retrieval-augmented generation (RAG), which augments an LLM with external knowledge retrieved from a localized knowledge database to improve the answer quality. RAG-VR includes a pipeline for extracting comprehensive knowledge about virtual environments and user conditions for accurate answer generation. To ensure efficient retrieval, RAG-VR offloads the retrieval process to a nearby edge server and uses only essential information during retrieval. Moreover, we train the retriever to effectively distinguish among relevant, irrelevant, and hard-to-differentiate information in relation to questions. RAG-VR improves answer accuracy by 17.9%-41.8% and reduces end-to-end latency by 34.5%-47.3% compared with two baseline systems.
zh

[AI-26] Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLM s on Edge Devices

【速读】:该论文致力于解决生成式大语言模型(Generative Large Language Models, LLMs)在边缘设备上推理时面临的资源受限问题,特别是计算资源有限导致的推理延迟过长和内存占用过高的挑战。传统方法虽已探索协作边缘计算以突破单设备资源限制,但存在通信开销巨大及边缘资源利用率低的问题,且主要集中于优化预填充(prefill)阶段,忽视了生成式 LLM 关键的自回归解码(autoregressive decoding)阶段。论文的关键创新在于提出了一种名为 Jupiter 的协作边缘 AI 系统,通过引入灵活的流水线架构,并针对预填充和解码阶段的不同特性进行系统设计。对于预填充阶段,Jupiter 提出了一种新的序列内流水线并行机制并制定了详细的并行规划策略;而对于解码阶段,则设计了一种基于轮廓的流水线并行解码机制结合推测解码技术,进一步提升推理加速效果。这些方法显著提升了系统的资源效率和整体性能,在多种边缘环境配置下实现了高达 26.1 倍的端到端延迟减少,同时保持了与现有方法相当的生成质量。

链接: https://arxiv.org/abs/2504.08242
作者: Shengyuan Ye,Bei Ouyang,Liekang Zeng,Tianyi Qian,Xiaowen Chu,Jian Tang,Xu Chen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: Accepted by IEEE International Conference on Computer Communications 2025

点击查看摘要

Abstract:Generative large language models (LLMs) have garnered significant attention due to their exceptional capabilities in various AI tasks. Traditionally deployed in cloud datacenters, LLMs are now increasingly moving towards more accessible edge platforms to protect sensitive user data and ensure privacy preservation. The limited computational resources of individual edge devices, however, can result in excessively prolonged inference latency and overwhelmed memory usage. While existing research has explored collaborative edge computing to break the resource wall of individual devices, these solutions yet suffer from massive communication overhead and under-utilization of edge resources. Furthermore, they focus exclusively on optimizing the prefill phase, neglecting the crucial autoregressive decoding phase for generative LLMs. To address that, we propose Jupiter, a fast, scalable, and resource-efficient collaborative edge AI system for generative LLM inference. Jupiter introduces a flexible pipelined architecture as a principle and differentiates its system design according to the differentiated characteristics of the prefill and decoding phases. For prefill phase, Jupiter submits a novel intra-sequence pipeline parallelism and develops a meticulous parallelism planning strategy to maximize resource efficiency; For decoding, Jupiter devises an effective outline-based pipeline parallel decoding mechanism combined with speculative decoding, which further magnifies inference acceleration. Extensive evaluation based on realistic implementation demonstrates that Jupiter remarkably outperforms state-of-the-art approaches under various edge environment setups, achieving up to 26.1x end-to-end latency reduction while rendering on-par generation quality.
zh

[AI-27] Optimizing Power Grid Topologies with Reinforcement Learning: A Survey of Methods and Challenges

【速读】:该论文旨在解决电力网络拓扑优化中的强化学习(Reinforcement Learning, RL)应用问题,通过系统性回顾现有技术、分类关键设计选择以及识别研究空白,为基于强化学习的电网优化方法提供全面的视角。论文的关键在于通过标准化基准和问题设定加速相关研究进展,并通过对比数值研究评估常用RL方法的实际有效性,从而为未来基于强化学习的电力系统优化奠定基础。

链接: https://arxiv.org/abs/2504.08210
作者: Erica van der Sar,Alessandro Zocca,Sandjai Bhulai
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 60 pages, 26 figures, preprint

点击查看摘要

Abstract:Power grid operation is becoming increasingly complex due to the rising integration of renewable energy sources and the need for more adaptive control strategies. Reinforcement Learning (RL) has emerged as a promising approach to power network control (PNC), offering the potential to enhance decision-making in dynamic and uncertain environments. The Learning To Run a Power Network (L2RPN) competitions have played a key role in accelerating research by providing standardized benchmarks and problem formulations, leading to rapid advancements in RL-based methods. This survey provides a comprehensive and structured overview of RL applications for power grid topology optimization, categorizing existing techniques, highlighting key design choices, and identifying gaps in current research. Additionally, we present a comparative numerical study evaluating the impact of commonly applied RL-based methods, offering insights into their practical effectiveness. By consolidating existing research and outlining open challenges, this survey aims to provide a foundation for future advancements in RL-driven power grid optimization.
zh

[AI-28] How Good Are Large Language Models for Course Recommendation in MOOCs?

【速读】:该论文试图解决大型语言模型(LLMs)在教育推荐系统中的应用潜力尚未被充分探索的问题。解决方案的关键在于利用LLMs从大规模语料库中提取的广泛知识,通过多种方法(包括基于提示的方法到更先进的微调技术)构建通用的课程推荐模型,并将其性能与传统推荐模型进行对比评估。研究在真实世界的慕课数据集上开展了广泛的实验,从准确性、多样性和新颖性等维度验证了LLMs在课程推荐任务中的表现,结果表明LLMs能够达到与传统模型相当的性能,展示了其提升教育推荐系统的潜力。

链接: https://arxiv.org/abs/2504.08208
作者: Boxuan Ma,Md Akib Zabed Khan,Tianyuan Yang,Agoritsa Polyzou,Shin’ichi Konomi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant strides in natural language processing and are increasingly being integrated into recommendation systems. However, their potential in educational recommendation systems has yet to be fully explored. This paper investigates the use of LLMs as a general-purpose recommendation model, leveraging their vast knowledge derived from large-scale corpora for course recommendation tasks. We explore a variety of approaches, ranging from prompt-based methods to more advanced fine-tuning techniques, and compare their performance against traditional recommendation models. Extensive experiments were conducted on a real-world MOOC dataset, evaluating using LLMs as course recommendation systems across key dimensions such as accuracy, diversity, and novelty. Our results demonstrate that LLMs can achieve good performance comparable to traditional models, highlighting their potential to enhance educational recommendation systems. These findings pave the way for further exploration and development of LLM-based approaches in the context of educational recommendations.
zh

[AI-29] DRAFT-ing Architectural Design Decisions using LLM s

【速读】:该论文旨在解决软件架构知识管理(Architectural Knowledge Management, AKM)中存在的标准化不足和高人工投入问题,特别是通过架构决策记录(Architecture Decision Records, ADRs)捕捉架构设计决策(Architecture Design Decisions, ADDs)时面临的挑战。尽管生成式大语言模型(Large Language Models, LLMs)已被证明能够辅助生成ADDs,但仅凭简单提示无法生成高质量的ADDs,同时使用第三方LLMs存在隐私风险,而自托管则面临资源限制。

为了解决上述问题,论文提出了DRAFT(Domain Specific Retrieval-Augmented Few-Shot Fine-Tuning),这是一种结合少样本学习(few-shot)、检索增强生成(retrieval-augmented generation, RAG)和微调(fine-tuning)三种方法优势的新框架。DRAFT的关键在于其两阶段工作流程:离线阶段利用检索增强的示例对LLM进行微调,以生成更高质量的ADDs;在线阶段则通过结合检索到的ADR和微调后的模型来生成最终的ADDs。实验结果显示,DRAFT在有效性和效率上均优于现有方法,同时解决了隐私和资源约束的问题。

链接: https://arxiv.org/abs/2504.08207
作者: Rudra Dhar,Adyansh Kakran,Amey Karan,Karthik Vaidhyanathan,Vasudeva Varma
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Architectural Knowledge Management (AKM) is crucial for software development but remains challenging due to the lack of standardization and high manual effort. Architecture Decision Records (ADRs) provide a structured approach to capture Architecture Design Decisions (ADDs), but their adoption is limited due to the manual effort involved and insufficient tool support. Our previous work has shown that Large Language Models (LLMs) can assist in generating ADDs. However, simply prompting the LLM does not produce quality ADDs. Moreover, using third-party LLMs raises privacy concerns, while self-hosting them poses resource challenges. To this end, we experimented with different approaches like few-shot, retrieval-augmented generation (RAG) and fine-tuning to enhance LLM’s ability to generate ADDs. Our results show that both techniques improve effectiveness. Building on this, we propose Domain Specific Retreival Augumented Few Shot Fine Tuninng, DRAFT, which combines the strengths of all these three approaches for more effective ADD generation. DRAFT operates in two phases: an offline phase that fine-tunes an LLM on generating ADDs augmented with retrieved examples and an online phase that generates ADDs by leveraging retrieved ADRs and the fine-tuned model. We evaluated DRAFT against existing approaches on a dataset of 4,911 ADRs and various LLMs and analyzed them using automated metrics and human evaluations. Results show DRAFT outperforms all other approaches in effectiveness while maintaining efficiency. Our findings indicate that DRAFT can aid architects in drafting ADDs while addressing privacy and resource constraints. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2504.08207 [cs.SE] (or arXiv:2504.08207v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2504.08207 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-30] Influential Bandits: Pulling an Arm May Change the Environment

【速读】:该论文致力于解决多臂老虎机(multi-armed bandit)问题中现有模型未能充分捕捉的两个关键挑战:非平稳环境中的臂间相互依赖性以及选择某一臂对其他臂未来奖励的影响。传统模型如rotting bandits或restless bandits无法有效处理这些场景,而本文通过引入“影响老虎机”(influential bandit)问题来建模这种臂间交互,利用未知的对称半正定交互矩阵描述臂损失的动力学特性。论文的关键解决方案在于提出了一种基于下置信界(Lower Confidence Bound, LCB)估计的新算法,该算法针对损失动态结构进行了优化设计。在温和假设下,此算法实现了接近最优的时间复杂度依赖性 O(KTlogT)O(KT\log T) 的遗憾界,并且在计算上高效易实现。实证评估表明,该方法在合成数据和真实数据上的表现显著优于传统老虎机算法。

链接: https://arxiv.org/abs/2504.08200
作者: Ryoma Sato,Shinji Ito
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While classical formulations of multi-armed bandit problems assume that each arm’s reward is independent and stationary, real-world applications often involve non-stationary environments and interdependencies between arms. In particular, selecting one arm may influence the future rewards of other arms, a scenario not adequately captured by existing models such as rotting bandits or restless bandits. To address this limitation, we propose the influential bandit problem, which models inter-arm interactions through an unknown, symmetric, positive semi-definite interaction matrix that governs the dynamics of arm losses. We formally define this problem and establish two regret lower bounds, including a superlinear \Omega(T^2 / \log^2 T) bound for the standard UCB algorithm and an algorithm-independent \Omega(T) bound, which highlight the inherent difficulty of the setting. We then introduce a new algorithm based on a lower confidence bound (LCB) estimator tailored to the structure of the loss dynamics. Under mild assumptions, our algorithm achieves a regret of O(KT \log T) , which is nearly optimal in terms of its dependence on the time horizon. The algorithm is simple to implement and computationally efficient. Empirical evaluations on both synthetic and real-world datasets demonstrate the presence of inter-arm influence and confirm the superior performance of our method compared to conventional bandit algorithms.
zh

[AI-31] Graph Based Deep Reinforcement Learning Aided by Transformers for Multi-Agent Cooperation

【速读】:本文针对分布式目标点服务场景下(如灾害响应、环境监测和监控)合作自主无人机编队的任务规划挑战展开研究,特别是在部分可观测性、有限通信范围和不确定环境条件下,传统路径规划算法表现欠佳,尤其是缺乏先验信息时。为解决这些问题,论文提出了一种融合图神经网络(Graph Neural Networks, GNNs)、深度强化学习(Deep Reinforcement Learning, DRL)以及基于Transformer的消息传递机制的新框架,以增强多智能体协调能力和集体任务执行效率。关键在于利用GNNs通过自适应图构建来建模智能体间及智能体与目标间的交互,实现受限通信条件下的高效信息聚合与决策;同时结合边缘特征增强注意力的Transformer消息传递机制捕捉复杂交互模式,并采用带优先经验回放的双深度Q网络(Double Deep Q-Network, Double DQN)优化智能体策略在部分可观测环境中的行为。这种设计特别关注多智能体导航的可扩展性、适应性和任务执行效率。实验结果表明,该方法在服务提供率和服务覆盖率方面优于粒子群优化(PSO)、贪心算法和标准DQN等基准方法。

链接: https://arxiv.org/abs/2504.08195
作者: Michael Elrod,Niloufar Mehrabi,Rahul Amin,Manveen Kaur,Long Cheng,Jim Martin,Abolfazl Razi
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 6 pages, 7 figures, Accepted to the 2025 IEEE International Conference on Communications Workshops (ICC Workshops)

点击查看摘要

Abstract:Mission planning for a fleet of cooperative autonomous drones in applications that involve serving distributed target points, such as disaster response, environmental monitoring, and surveillance, is challenging, especially under partial observability, limited communication range, and uncertain environments. Traditional path-planning algorithms struggle in these scenarios, particularly when prior information is not available. To address these challenges, we propose a novel framework that integrates Graph Neural Networks (GNNs), Deep Reinforcement Learning (DRL), and transformer-based mechanisms for enhanced multi-agent coordination and collective task execution. Our approach leverages GNNs to model agent-agent and agent-goal interactions through adaptive graph construction, enabling efficient information aggregation and decision-making under constrained communication. A transformer-based message-passing mechanism, augmented with edge-feature-enhanced attention, captures complex interaction patterns, while a Double Deep Q-Network (Double DQN) with prioritized experience replay optimizes agent policies in partially observable environments. This integration is carefully designed to address specific requirements of multi-agent navigation, such as scalability, adaptability, and efficient task execution. Experimental results demonstrate superior performance, with 90% service provisioning and 100% grid coverage (node discovery), while reducing the average steps per episode to 200, compared to 600 for benchmark methods such as particle swarm optimization (PSO), greedy algorithms and DQN.
zh

[AI-32] On the Practice of Deep Hierarchical Ensemble Network for Ad Conversion Rate Prediction WWW2025

【速读】:该论文旨在解决深度分层集成网络(Deep Hierarchical Ensemble Network, DHEN)在转化率预测(CVR)任务中的性能不确定性问题,特别是在广告竞标场景下预测用户离站行为(如购买、加入购物车、注册等)的应用中。论文的关键在于提出了一种以DHEN为核心架构的多任务学习框架,并结合以下创新方法提升其性能:
1)研究如何有效集成多种特征交叉模块(如MLP、DCN、Transformer等),并优化模型的深度与宽度;
2)构建基于用户行为序列的现场实时特征及离站转化事件序列,并评估其重要性;
3)引入自监督辅助损失函数,通过预测输入序列中的未来动作缓解标签稀疏问题。这些方案共同提升了DHEN在转化率预测任务上的表现,尤其在利用预训练用户个性化特征时达到当前最佳水平。

链接: https://arxiv.org/abs/2504.08169
作者: Jinfeng Zhuang,Yinrui Li,Runze Su,Ke Xu,Zhixuan Shao,Kungang Li,Ling Leng,Han Sun,Meng Qi,Yixiong Meng,Yang Tang,Zhifang Liu,Qifei Shen,Aayush Mudgal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
备注: Accepted by WWW 2025

点击查看摘要

Abstract:The predictions of click through rate (CTR) and conversion rate (CVR) play a crucial role in the success of ad-recommendation systems. A Deep Hierarchical Ensemble Network (DHEN) has been proposed to integrate multiple feature crossing modules and has achieved great success in CTR prediction. However, its performance for CVR prediction is unclear in the conversion ads setting, where an ad bids for the probability of a user’s off-site actions on a third party website or app, including purchase, add to cart, sign up, etc. A few challenges in DHEN: 1) What feature-crossing modules (MLP, DCN, Transformer, to name a few) should be included in DHEN? 2) How deep and wide should DHEN be to achieve the best trade-off between efficiency and efficacy? 3) What hyper-parameters to choose in each feature-crossing module? Orthogonal to the model architecture, the input personalization features also significantly impact model performance with a high degree of freedom. In this paper, we attack this problem and present our contributions biased to the applied data science side, including: First, we propose a multitask learning framework with DHEN as the single backbone model architecture to predict all CVR tasks, with a detailed study on how to make DHEN work effectively in practice; Second, we build both on-site real-time user behavior sequences and off-site conversion event sequences for CVR prediction purposes, and conduct ablation study on its importance; Last but not least, we propose a self-supervised auxiliary loss to predict future actions in the input sequence, to help resolve the label sparseness issue in CVR prediction. Our method achieves state-of-the-art performance compared to previous single feature crossing modules with pre-trained user personalization features. Comments: Accepted by WWW 2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML) Cite as: arXiv:2504.08169 [cs.LG] (or arXiv:2504.08169v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.08169 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-33] Rethinking the Foundations for Continual Reinforcement Learning

【速读】:该论文试图解决传统强化学习(Reinforcement Learning, RL)的核心基础是否适用于连续学习(Continual Reinforcement Learning)的问题。作者指出,传统RL的四个核心基础——马尔可夫决策过程(Markov Decision Process, MDP)形式化、对最优策略的关注、以预期奖励总和为主要评估指标以及采用接受上述三者的分段基准环境(episodic benchmark environments),实际上与连续学习的目标相冲突。论文的关键在于提出了一组替代性的基础,这些新基础更符合连续学习的需求,旨在促使研究者重新思考传统基础、提出并批评替代方案,并开发基于这些更合适基础的新算法和方法。

链接: https://arxiv.org/abs/2504.08161
作者: Michael Bowling,Esraa Elelimy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Algorithms and approaches for continual reinforcement learning have gained increasing attention. Much of this early progress rests on the foundations and standard practices of traditional reinforcement learning, without questioning if they are well-suited to the challenges of continual learning agents. We suggest that many core foundations of traditional RL are, in fact, antithetical to the goals of continual reinforcement learning. We enumerate four such foundations: the Markov decision process formalism, a focus on optimal policies, the expected sum of rewards as the primary evaluation metric, and episodic benchmark environments that embrace the other three foundations. Shedding such sacredly held and taught concepts is not easy. They are self-reinforcing in that each foundation depends upon and holds up the others, making it hard to rethink each in isolation. We propose an alternative set of all four foundations that are better suited to the continual learning setting. We hope to spur on others in rethinking the traditional foundations, proposing and critiquing alternatives, and developing new algorithms and approaches enabled by better-suited foundations.
zh

[AI-34] Orchestrating Agents and Data for Enterprise: A Blueprint Architecture for Compound AI

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在企业应用中广泛采用所面临的挑战,包括与现有系统集成、利用专有数据和模型、满足成本、质量、响应速度等多方面需求。论文指出,从单一模型向复合人工智能(Compound AI)系统的转变是解决问题的关键方向,但目前进展零散,缺乏整体架构设计。
解决方案的关键在于提出了一种“蓝图架构”(blueprint architecture),用于协调代理(agents)和数据以支持企业应用。该架构的核心概念是“流”(streams),用于协调代理之间的数据和指令流动。企业现有的专有模型和API被映射为注册表中的代理,并通过代理注册表提供元数据和学习表示用于搜索和规划;数据则通过数据注册表进行管理。任务和查询通过数据与任务规划器分解、映射和优化,以满足特定的服务质量(Quality of Service, QoS)要求。论文以人力资源(HR)领域的应用场景为例展示了该架构的实现,并探讨了“代理型AI”(agentic AI)在企业中的机遇与挑战。

链接: https://arxiv.org/abs/2504.08148
作者: Eser Kandogan,Nikita Bhutani,Dan Zhang,Rafael Li Chen,Sairam Gurajada,Estevam Hruschka
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have gained significant interest in industry due to their impressive capabilities across a wide range of tasks. However, the widespread adoption of LLMs presents several challenges, such as integration into existing applications and infrastructure, utilization of company proprietary data, models, and APIs, and meeting cost, quality, responsiveness, and other requirements. To address these challenges, there is a notable shift from monolithic models to compound AI systems, with the premise of more powerful, versatile, and reliable applications. However, progress thus far has been piecemeal, with proposals for agentic workflows, programming models, and extended LLM capabilities, without a clear vision of an overall architecture. In this paper, we propose a ‘blueprint architecture’ for compound AI systems for orchestrating agents and data for enterprise applications. In our proposed architecture the key orchestration concept is ‘streams’ to coordinate the flow of data and instructions among agents. Existing proprietary models and APIs in the enterprise are mapped to ‘agents’, defined in an ‘agent registry’ that serves agent metadata and learned representations for search and planning. Agents can utilize proprietary data through a ‘data registry’ that similarly registers enterprise data of various modalities. Tying it all together, data and task ‘planners’ break down, map, and optimize tasks and queries for given quality of service (QoS) requirements such as cost, accuracy, and latency. We illustrate an implementation of the architecture for a use-case in the HR domain and discuss opportunities and challenges for ‘agentic AI’ in the enterprise.
zh

[AI-35] Vector Quantized-Elites: Unsupervised and Problem-Agnostic Quality-Diversity Optimization

【速读】:该论文旨在解决传统Quality-Diversity算法(如MAP-Elites)依赖预定义行为描述符和任务特定先验知识的问题,这限制了其灵活性与适用性。为了解决这一问题,论文提出了一种名为Vector Quantized-Elites (VQ-Elites) 的新型Quality-Diversity算法,其核心创新在于通过无监督学习自主构建结构化的行为空间网格,无需任务特定的先验知识。关键解决方案包括将向量量化变分自编码器(Vector Quantized Variational Autoencoders)集成到算法中,实现行为描述符的动态学习以及结构化行为空间网格的生成。此外,论文还引入了行为空间边界约束和协作机制以进一步提升性能。这些设计使VQ-Elites成为一种灵活、鲁棒且任务无关的优化框架,并显著提升了无监督Quality-Diversity算法的收敛性和表现。

链接: https://arxiv.org/abs/2504.08057
作者: Constantinos Tsakonas,Konstantinos Chatzilygeroudis
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 12 pages, 10 figures, 2 algorithms, 1 table

点击查看摘要

Abstract:Quality-Diversity algorithms have transformed optimization by prioritizing the discovery of diverse, high-performing solutions over a single optimal result. However, traditional Quality-Diversity methods, such as MAP-Elites, rely heavily on predefined behavioral descriptors and complete prior knowledge of the task to define the behavioral space grid, limiting their flexibility and applicability. In this work, we introduce Vector Quantized-Elites (VQ-Elites), a novel Quality-Diversity algorithm that autonomously constructs a structured behavioral space grid using unsupervised learning, eliminating the need for prior task-specific knowledge. At the core of VQ-Elites is the integration of Vector Quantized Variational Autoencoders, which enables the dynamic learning of behavioral descriptors and the generation of a structured, rather than unstructured, behavioral space grid - a significant advancement over existing unsupervised Quality-Diversity approaches. This design establishes VQ-Elites as a flexible, robust, and task-agnostic optimization framework. To further enhance the performance of unsupervised Quality-Diversity algorithms, we introduce two key components: behavioral space bounding and cooperation mechanisms, which significantly improve convergence and performance. We validate VQ-Elites on robotic arm pose-reaching and mobile robot space-covering tasks. The results demonstrate its ability to efficiently generate diverse, high-quality solutions, emphasizing its adaptability, scalability, robustness to hyperparameters, and potential to extend Quality-Diversity optimization to complex, previously inaccessible domains.
zh

[AI-36] Compositional Flows for 3D Molecule and Synthesis Pathway Co-design ICLR2025

【速读】:该论文旨在解决合成性分子设计等基于组合结构的生成任务中,如何有效建模连续特征状态并优化采样效率的问题。论文的关键创新在于提出了Compositional Generative Flows (CGFlow),这是一种将流匹配(Flow Matching)扩展到组合步骤中以生成具有连续特征对象的新框架。其核心思想是将组合状态转换建模视为流匹配插值过程的直接扩展,并结合生成流网络(GFlowNet)的理论基础实现基于奖励的组合结构采样。通过联合设计分子的合成路径及其3D结合姿势,CGFlow在LIT-PCBA基准的全部15个目标上实现了最先进的结合亲和力性能,并在CrossDocked基准上达到了Vina Dock和AiZynth成功率的最新技术水平,同时相比基于2D合成的方法提升了5.8倍的采样效率。

链接: https://arxiv.org/abs/2504.08051
作者: Tony Shen,Seonghwan Seo,Ross Irwin,Kieran Didi,Simon Olsson,Woo Youn Kim,Martin Ester
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Spotlighted at ICLR 2025 GEM and AI4Mat workshops, 29 pages, 7 figures

点击查看摘要

Abstract:Many generative applications, such as synthesis-based 3D molecular design, involve constructing compositional objects with continuous features. Here, we introduce Compositional Generative Flows (CGFlow), a novel framework that extends flow matching to generate objects in compositional steps while modeling continuous states. Our key insight is that modeling compositional state transitions can be formulated as a straightforward extension of the flow matching interpolation process. We further build upon the theoretical foundations of generative flow networks (GFlowNets), enabling reward-guided sampling of compositional structures. We apply CGFlow to synthesizable drug design by jointly designing the molecule’s synthetic pathway with its 3D binding pose. Our approach achieves state-of-the-art binding affinity on all 15 targets from the LIT-PCBA benchmark, and 5.8 \times improvement in sampling efficiency compared to 2D synthesis-based baseline. To our best knowledge, our method is also the first to achieve state of-art-performance in both Vina Dock (-9.38) and AiZynth success rate (62.2%) on the CrossDocked benchmark.
zh

[AI-37] Utility Inspired Generalizations of TOPSIS

【速读】:该论文试图解决的问题是如何在标准TOPSIS方法中引入对权重缩放均值(WM)和权重缩放标准差(WSD)的可控影响,以实现对最终排名的更清晰解释性控制。标准TOPSIS方法中,这些组件对排名的影响无法被调节。为了解决这一局限,论文提出了一种对标准TOPSIS的改进方案,通过使TOPSIS聚合过程响应于WM和WSD,从而允许决策者根据偏好调整对这两种成分的权衡。关键在于提出的修改使得广义TOPSIS能够自然地还原为原始TOPSIS,或者在决策者的指导下将WM与WSD进行灵活权衡,甚至渐进转化为常规的“效用基础方法”。这种改进提供了一个实用工具,用于通过决策者偏好的受控应用来影响排名结果。

链接: https://arxiv.org/abs/2504.08014
作者: Robert Susmaga,Izabela Szczech
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:TOPSIS, a popular method for ranking alternatives is based on aggregated distances to ideal and anti-ideal points. As such, it was considered to be essentially different from widely popular and acknowledged utility-based methods', which build rankings from weight-averaged utility values. Nonetheless, TOPSIS has recently been shown to be a natural generalization of these utility-based methods’ on the grounds that the distances it uses can be decomposed into so called weight-scaled means (WM) and weight-scaled standard deviations (WSD) of utilities. However, the influence that these two components exert on the final ranking cannot be in any way influenced in the standard TOPSIS. This is why, building on our previous results, in this paper we put forward modifications that make TOPSIS aggregations responsive to WM and WSD, achieving some amount of well interpretable control over how the rankings are influenced by WM and WSD. The modifications constitute a natural generalization of the standard TOPSIS method because, thanks to them, the generalized TOPSIS may turn into the original TOPSIS or, otherwise, following the decision maker’s preferences, may trade off WM for WSD or WSD for WM. In the latter case, TOPSIS gradually reduces to a regular `utility-based method’. All in all, we believe that the proposed generalizations constitute an interesting practical tool for influencing the ranking by controlled application of a new form of decision maker’s preferences.
zh

[AI-38] A Python toolkit for dealing with Petri nets over ontological graphs

【速读】:本文献试图解决如何利用本体图上的Petri网模型来增强基于语义关系的知识推理与控制过程。关键在于通过OWL 2本体语言构建本体图,并设计了一个Python工具包,使用户能够定义此类Petri网的结构与动态特性,从而将领域知识以本体形式嵌入模型中,实现更丰富的语义信息处理。

链接: https://arxiv.org/abs/2504.08006
作者: Krzysztof Pancerz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: PP-RAI 2024

点击查看摘要

Abstract:We present theoretical rudiments of Petri nets over ontological graphs as well as the designed and implemented Python toolkit for dealing with such nets. In Petri nets over ontological graphs, the domain knowledge is enclosed in a form of ontologies. In this way, some valuable knowledge (especially in terms of semantic relations) can be added to model reasoning and control processes by means of Petri nets. In the implemented approach, ontological graphs are obtained from ontologies built in accordance with the OWL 2 Web Ontology Language. The implemented tool enables the users to define the structure and dynamics of Petri nets over ontological graphs.
zh

[AI-39] Neuron-level Balance between Stability and Plasticity in Deep Reinforcement Learning

【速读】:该论文试图解决深度强化学习(Deep Reinforcement Learning, DRL)中智能体面临的稳定性-可塑性困境(stability-plasticity dilemma),即在保持已有技能(稳定性)与学习新知识(可塑性)之间难以平衡的问题。当前方法主要从网络层面进行优化,缺乏针对单个神经元的精细调控能力。为克服这一局限,论文提出了一种基于神经元层面的稳定性-可塑性平衡(Neuron-level Balance between Stability and Plasticity, NBSP)方法。其关键是通过目标导向的方法定义并识别出与任务相关的关键技能神经元(RL skill neurons),然后利用梯度掩蔽(gradient masking)和经验回放(experience replay)技术针对这些神经元设计框架,从而在保留已有技能的同时实现对新任务的适应能力。实验结果表明,NBSP在Meta-World和Atari基准测试中显著优于现有方法。

链接: https://arxiv.org/abs/2504.08000
作者: Jiahua Lan,Sen Zhang,Haixia Pan,Ruijun Liu,Li Shen,Dacheng Tao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Reinforcement learning, RL skill neuron, stability and plasticity

点击查看摘要

Abstract:In contrast to the human ability to continuously acquire knowledge, agents struggle with the stability-plasticity dilemma in deep reinforcement learning (DRL), which refers to the trade-off between retaining existing skills (stability) and learning new knowledge (plasticity). Current methods focus on balancing these two aspects at the network level, lacking sufficient differentiation and fine-grained control of individual neurons. To overcome this limitation, we propose Neuron-level Balance between Stability and Plasticity (NBSP) method, by taking inspiration from the observation that specific neurons are strongly relevant to task-relevant skills. Specifically, NBSP first (1) defines and identifies RL skill neurons that are crucial for knowledge retention through a goal-oriented method, and then (2) introduces a framework by employing gradient masking and experience replay techniques targeting these neurons to preserve the encoded existing skills while enabling adaptation to new tasks. Numerous experimental results on the Meta-World and Atari benchmarks demonstrate that NBSP significantly outperforms existing approaches in balancing stability and plasticity.
zh

[AI-40] SPHERE: An Evaluation Card for Human-AI Systems

【速读】:该论文试图解决在大语言模型(Large Language Models, LLMs)时代下,建立有效评估多样化人机交互系统方法和标准面临的挑战。为鼓励更透明的文档记录并促进关于人机系统评估设计选项的讨论,论文提出了一种名为SPHERE的评估卡片,涵盖五个关键维度:1)评估对象是什么;2)评估如何进行;3)哪些参与者参与评估;4)评估何时进行;5)如何验证评估结果。通过使用SPHERE回顾39个人机交互系统,论文概述了当前的评估实践及其改进领域,并提出了三条提升评估有效性与严谨性的建议。解决方案的关键在于开发并应用这一包含多维考量的评估框架——SPHERE,以系统化地分析和优化人机交互系统的评估过程。

链接: https://arxiv.org/abs/2504.07971
作者: Qianou Ma,Dora Zhao,Xinran Zhao,Chenglei Si,Chenyang Yang,Ryan Louie,Ehud Reiter,Diyi Yang,Tongshuang Wu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the era of Large Language Models (LLMs), establishing effective evaluation methods and standards for diverse human-AI interaction systems is increasingly challenging. To encourage more transparent documentation and facilitate discussion on human-AI system evaluation design options, we present an evaluation card SPHERE, which encompasses five key dimensions: 1) What is being evaluated?; 2) How is the evaluation conducted?; 3) Who is participating in the evaluation?; 4) When is evaluation conducted?; 5) How is evaluation validated? We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement. We provide three recommendations for improving the validity and rigor of evaluation practices.
zh

[AI-41] Mitigating Timbre Leakage with Universal Semantic Mapping Residual Block for Voice Conversion

【速读】:该论文旨在解决语音转换(Voice Conversion, VC)中内容表示中固有嵌入源说话人音色信息导致的音色泄漏(timbre leakage)问题,从而提升目标说话人相似度。论文的关键解决方案在于引入了一个残差块到内容提取器中,该残差块包含两个加权分支:1)基于通用语义词典的内容特征重表达(Content Feature Re-expression, CFR)模块,提供无音色的内容表示;2)跳过连接到原始内容层,补充细粒度信息。其中,CFR模块通过将每个内容帧表示为词典条目的加权线性组合来实现无音色的内容表示,词典条目代表音素类别,并通过多个说话人的语音统计计算得到,形成稳定的、与说话人无关的语义集。这一方法有效缓解了音色泄漏问题,显著提高了目标说话人的相似度。

链接: https://arxiv.org/abs/2504.08524
作者: Na Li,Chuke Wang,Yu Gu,Zhifeng Li
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Voice conversion (VC) transforms source speech into a target voice by preserving the content. However, timbre information from the source speaker is inherently embedded in the content representations, causing significant timbre leakage and reducing similarity to the target speaker. To address this, we introduce a residual block to a content extractor. The residual block consists of two weighted branches: 1) universal semantic dictionary based Content Feature Re-expression (CFR) module, supplying timbre-free content representation. 2) skip connection to the original content layer, providing complementary fine-grained information. In the CFR module, each dictionary entry in the universal semantic dictionary represents a phoneme class, computed statistically using speech from multiple speakers, creating a stable, speaker-independent semantic set. We introduce a CFR method to obtain timbre-free content representations by expressing each content frame as a weighted linear combination of dictionary entries using corresponding phoneme posteriors as weights. Extensive experiments across various VC frameworks demonstrate that our approach effectively mitigates timbre leakage and significantly improves similarity to the target speaker.
zh

[AI-42] Generalization Bounds in Hybrid Quantum-Classical Machine Learning Models

【速读】:该论文试图解决的问题是如何分析杂交经典-量子模型(Hybrid Classical-Quantum Models)在训练数据上的泛化能力,并理解这些系统如何从数据中学习。论文的关键解决方案在于提出了一种统一的数学框架,用于分析杂交模型的泛化性能,并基于此推导出一个新的泛化界(generalization bound),形式为 ( O\big( \sqrt{\frac{T \log T}{N}} + \frac{\alpha}{\sqrt{N}} \big) ),其中 ( N ) 是训练样本数量,( T ) 是可训练量子门的数量,而 ( ||F|| \leq \alpha ) 表示有界的全连接层。这一泛化界清晰地分解为量子部分和经典部分,不仅扩展了对两部分已有工作的研究,还阐明了它们之间的相互作用。此外,论文将该结果应用于量子-经典卷积神经网络(Quantum-Classical Convolutional Neural Network, QCCNN),并通过分析揭示了在杂交设置下应用经典统计学习理论的局限性,并提出了未来理论研究的潜在方向。

链接: https://arxiv.org/abs/2504.08456
作者: Tongyan Wu,Amine Bentellis,Alona Sakhnenko,Jeanette Miriam Lorenz
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 6 + 5 pages

点击查看摘要

Abstract:Hybrid classical-quantum models aim to harness the strengths of both quantum computing and classical machine learning, but their practical potential remains poorly understood. In this work, we develop a unified mathematical framework for analyzing generalization in hybrid models, offering insight into how these systems learn from data. We establish a novel generalization bound of the form O\big( \sqrt\fracT\logTN + \frac\alpha\sqrtN\big) for N training data points, T trainable quantum gates, and bounded fully-connected layers ||F|| \leq \alpha . This bound decomposes cleanly into quantum and classical contributions, extending prior work on both components and clarifying their interaction. We apply our results to the quantum-classical convolutional neural network (QCCNN), an architecture that integrates quantum convolutional layers with classical processing. Alongside the bound, we highlight conceptual limitations of applying classical statistical learning theory in the hybrid setting and suggest promising directions for future theoretical work.
zh

[AI-43] Entropic bounds for conditionally Gaussian vectors and applications to neural networks

【速读】:本文旨在研究条件高斯分布与具有可逆协方差矩阵的高斯分布在总变差距离和2-Wasserstein距离上的新界值,并通过信息论中的熵不等式实现。关键解决方案在于利用量化累积量估计(quantitative cumulant estimates)的方法,这是由Hanin (2024) 提出的。基于此,论文量化了随机初始化的全连接神经网络及其在有限输入下导数收敛到高斯分布的速度。所提出的方法对激活函数仅需较弱的假设,能够以多种距离度量恢复最优收敛速率,改进并扩展了Basteri和Trevisan (2023)、Favaro等(2023)、Trevisan (2024)以及Apollonio等(2024)的研究成果。此外,文中还通过示例展示了如何将结果应用于限制贝叶斯后验分布与对应的高斯极限后验分布之间的总变差距离,从而提供了Hron等(2022)提出的后验中心极限定理的定量版本,并将Trevisan (2024)的若干估计扩展至总变差度量。

链接: https://arxiv.org/abs/2504.08335
作者: Lucia Celli,Giovanni Peccati
机构: 未知
类目: Probability (math.PR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Using entropic inequalities from information theory, we provide new bounds on the total variation and 2-Wasserstein distances between a conditionally Gaussian law and a Gaussian law with invertible covariance matrix. We apply our results to quantify the speed of convergence to Gaussian of a randomly initialized fully connected neural network and its derivatives - evaluated in a finite number of inputs - when the initialization is Gaussian and the sizes of the inner layers diverge to infinity. Our results require mild assumptions on the activation function, and allow one to recover optimal rates of convergence in a variety of distances, thus improving and extending the findings of Basteri and Trevisan (2023), Favaro et al. (2023), Trevisan (2024) and Apollonio et al. (2024). One of our main tools are the quantitative cumulant estimates established in Hanin (2024). As an illustration, we apply our results to bound the total variation distance between the Bayesian posterior law of the neural network and its derivatives, and the posterior law of the corresponding Gaussian limit: this yields quantitative versions of a posterior CLT by Hron et al. (2022), and extends several estimates by Trevisan (2024) to the total variation metric.
zh

[AI-44] Accelerating Multi-Objective Collaborative Optimization of Doped Thermoelectric Materials via Artificial Intelligence

【速读】:该论文旨在解决通过传统试错方法效率低下且耗时长的热电材料发现难题。论文的关键解决方案是提出了一种深度学习模型,能够直接基于化学式准确预测掺杂材料的热电性能,并达到当前最先进的预测精度。此外,通过引入敏感性分析技术增强模型的可解释性,揭示物理描述符对热电品质因子 (zT) 的影响。同时,构建了一个集成代理模型与多目标遗传算法的耦合框架,以高效探索广阔的成分空间,寻找高性能候选材料。实验验证进一步证实了该方法在中温范围内发现具有优异 zT 值的新型热电材料的能力。

链接: https://arxiv.org/abs/2504.08258
作者: Yuxuan Zeng,Wenhao Xie,Wei Cao,Tan Peng,Yue Hou,Ziyu Wang,Jing Shi
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:The thermoelectric performance of materials exhibits complex nonlinear dependencies on both elemental types and their proportions, rendering traditional trial-and-error approaches inefficient and time-consuming for material discovery. In this work, we present a deep learning model capable of accurately predicting thermoelectric properties of doped materials directly from their chemical formulas, achieving state-of-the-art performance. To enhance interpretability, we further incorporate sensitivity analysis techniques to elucidate how physical descriptors affect the thermoelectric figure of merit (zT). Moreover, we establish a coupled framework that integrates a surrogate model with a multi-objective genetic algorithm to efficiently explore the vast compositional space for high-performance candidates. Experimental validation confirms the discovery of a novel thermoelectric material with superior zT values in the medium-temperature regime.
zh

[AI-45] Bayesian Reasoning Enabled by Spin-Orbit Torque Magnetic Tunnel Junctions

【速读】:该论文旨在解决传统贝叶斯网络推理在存储效率和数据预处理复杂性方面的局限性。论文提出了一种基于自旋轨道转矩磁性隧道结(SOT-MTJ)的新型解决方案,通过将其作为随机数生成器和采样器,实现了概率前向传播神经网络的参数化表示,并利用简单的逐点训练算法优化网络参数。关键在于利用SOT-MTJ的特性,无需存储所有历史数据或显式统计条件概率,显著提升了存储效率并简化了数据预处理过程,同时展示了其在医学诊断等复杂推理任务中的应用潜力。这一方法为人工概率神经网络领域提供了高效的低存储解决方案,拓展了自旋电子器件的应用范围。

链接: https://arxiv.org/abs/2504.08257
作者: Yingqian Xu,Xiaohan Li,Caihua Wan,Ran Zhang,Bin He,Shiqiang Liu,Jihao Xia,Dehao Kong,Shilong Xiong,Guoqiang Yu,Xiufeng Han
机构: 未知
类目: Applied Physics (physics.app-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bayesian networks play an increasingly important role in data mining, inference, and reasoning with the rapid development of artificial intelligence. In this paper, we present proof-of-concept experiments demonstrating the use of spin-orbit torque magnetic tunnel junctions (SOT-MTJs) in Bayesian network reasoning. Not only can the target probability distribution function (PDF) of a Bayesian network be precisely formulated by a conditional probability table as usual but also quantitatively parameterized by a probabilistic forward propagating neuron network. Moreover, the parameters of the network can also approach the optimum through a simple point-by point training algorithm, by leveraging which we do not need to memorize all historical data nor statistically summarize conditional probabilities behind them, significantly improving storage efficiency and economizing data pretreatment. Furthermore, we developed a simple medical diagnostic system using the SOT-MTJ as a random number generator and sampler, showcasing the application of SOT-MTJ-based Bayesian reasoning. This SOT-MTJ-based Bayesian reasoning shows great promise in the field of artificial probabilistic neural network, broadening the scope of spintronic device applications and providing an efficient and low-storage solution for complex reasoning tasks.
zh

[AI-46] Neural Encoding and Decoding at Scale

【速读】:该论文旨在解决现有大规模神经科学研究方法仅专注于单向建模(即从行为预测神经活动或反之)的问题,无法充分捕捉神经活动与行为之间的双向关系。为弥合这一差距,论文提出了一种名为Neural Encoding and Decoding at Scale (NEDS) 的多模态、多任务模型。其关键创新在于引入了一种新颖的多任务掩码策略,通过在神经数据、行为数据、同模态以及跨模态之间动态切换掩码,实现了同时进行神经编码与解码的能力。通过在国际脑实验室(IBL)重复位点数据集上的预训练及后续微调,NEDS 在多动物数据上表现出针对编码与解码任务的最先进性能,并意外发现其学到的嵌入表示具有表征脑区预测的涌现特性。这一工作朝着构建能够无缝翻译神经活动与行为的大脑基础模型迈出了重要一步。

链接: https://arxiv.org/abs/2504.08201
作者: Yizi Zhang,Yanchen Wang,Mehdi Azabou,Alexandre Andre,Zixuan Wang,Hanrui Lyu, TheInternational Brain Laboratory,Eva Dyer,Liam Paninski,Cole Hurwitz
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent work has demonstrated that large-scale, multi-animal models are powerful tools for characterizing the relationship between neural activity and behavior. Current large-scale approaches, however, focus exclusively on either predicting neural activity from behavior (encoding) or predicting behavior from neural activity (decoding), limiting their ability to capture the bidirectional relationship between neural activity and behavior. To bridge this gap, we introduce a multimodal, multi-task model that enables simultaneous Neural Encoding and Decoding at Scale (NEDS). Central to our approach is a novel multi-task-masking strategy, which alternates between neural, behavioral, within-modality, and cross-modality masking. We pretrain our method on the International Brain Laboratory (IBL) repeated site dataset, which includes recordings from 83 animals performing the same visual decision-making task. In comparison to other large-scale models, we demonstrate that NEDS achieves state-of-the-art performance for both encoding and decoding when pretrained on multi-animal data and then fine-tuned on new animals. Surprisingly, NEDS’s learned embeddings exhibit emergent properties: even without explicit training, they are highly predictive of the brain regions in each recording. Altogether, our approach is a step towards a foundation model of the brain that enables seamless translation between neural activity and behavior.
zh

[AI-47] Cellular Development Follows the Path of Minimum Action

【速读】:该论文试图解决细胞发育过程中随机性与规则性共存但其底层原理尚不明确的问题。论文提出细胞发育遵循最小作用量路径,并通过结合最小作用量原理与最大熵原理,构建了一个基于Transformer架构的计算框架来模拟发育过程。解决方案的关键在于利用这一框架精确量化单细胞RNA序列数据中发育不对称性的熵产生、信息流曲率及局部不可逆性,并在统一框架下提供可解释的度量指标,包括熵以捕捉探索-利用权衡、曲率以评估可塑性-弹性动态以及熵产生以表征去分化和跨分化。论文通过单细胞及胚胎发育数据集验证了方法的有效性,展示了其揭示塑造细胞命运决策的热力学与信息学约束的能力。

链接: https://arxiv.org/abs/2504.08096
作者: Rohola Zandie,Farhan Khodaee,Yufan Xia,Elazer R. Edelman
机构: 未知
类目: Biological Physics (physics.bio-ph); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Cellular development follows a stochastic yet rule-governed trajectory, though the underlying principles remain elusive. Here, we propose that cellular development follows paths of least action, aligning with foundational physical laws that govern dynamic systems across nature. We introduce a computational framework that takes advantage of the deep connection between the principle of least action and maximum entropy to model developmental processes using Transformers architecture. This approach enables precise quantification of entropy production, information flow curvature, and local irreversibility for developmental asymmetry in single-cell RNA sequence data. Within this unified framework, we provide interpretable metrics: entropy to capture exploration-exploitation trade-offs, curvature to assess plasticity-elasticity dynamics, and entropy production to characterize dedifferentiation and transdifferentiation. We validate our method across both single-cell and embryonic development datasets, demonstrating its ability to reveal hidden thermodynamic and informational constraints shaping cellular fate decisions.
zh

[AI-48] Comparative analysis of Realistic EMF Exposure Estimation from Low Density Sensor Network by Finite Infinite Neural Networks

【速读】:该论文旨在解决环境中射频电磁场(RF-EMF)暴露的空间和时间模式评估问题,以支持RF-EMF暴露与人类健康、野生动物及植物影响之间潜在关联的风险评估。为实现这一目标,论文提出了一种基于有限宽和无限宽卷积网络的方法,利用法国里尔市70个真实传感器的数据来估计和评估EMF暴露水平,并通过比较不同方法的执行时间和估计准确性,分析其性能差异。为提高高分辨率网格下估计的准确性,论文引入了预条件梯度下降法进行核函数估计,同时采用均方根误差(RMSE)作为模型性能比较的标准。关键在于结合深度学习技术设计高效且精确的估计方法,并通过优化策略提升高分辨率场景下的性能表现。

链接: https://arxiv.org/abs/2504.07990
作者: Mohammed Mallik,Laurent Clavier,Davy P. Gaillot
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding the spatial and temporal patterns of environmental exposure to radio-frequency electromagnetic fields (RF-EMF) is essential for conducting risk assessments. These assessments aim to explore potential connections between RF-EMF exposure and its effects on human health, as well as on wildlife and plant life. Existing research has used different machine learning tools for EMF exposure estimation; however, a comparative analysis of these techniques is required to better understand their performance for real-world datasets. In this work, we present both finite and infinite-width convolutional network-based methods to estimate and assess EMF exposure levels from 70 real-world sensors in Lille, France. A comparative analysis has been conducted to analyze the performance of the methods’ execution time and estimation accuracy. To improve estimation accuracy for higher-resolution grids, we utilized a preconditioned gradient descent method for kernel estimation. Root Mean Square Error (RMSE) is used as the evaluation criterion for comparing the performance of these deep learning models.
zh

机器学习

[LG-0] Dimension reduction for derivative-informed operator learning: An analysis of approximation errors

链接: https://arxiv.org/abs/2504.08730
作者: Dingcheng Luo,Thomas O’Leary-Roseberry,Peng Chen,Omar Ghattas
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the derivative-informed learning of nonlinear operators between infinite-dimensional separable Hilbert spaces by neural networks. Such operators can arise from the solution of partial differential equations (PDEs), and are used in many simulation-based outer-loop tasks in science and engineering, such as PDE-constrained optimization, Bayesian inverse problems, and optimal experimental design. In these settings, the neural network approximations can be used as surrogate models to accelerate the solution of the outer-loop tasks. However, since outer-loop tasks in infinite dimensions often require knowledge of the underlying geometry, the approximation accuracy of the operator’s derivatives can also significantly impact the performance of the surrogate model. Motivated by this, we analyze the approximation errors of neural operators in Sobolev norms over infinite-dimensional Gaussian input measures. We focus on the reduced basis neural operator (RBNO), which uses linear encoders and decoders defined on dominant input/output subspaces spanned by reduced sets of orthonormal bases. To this end, we study two methods for generating the bases; principal component analysis (PCA) and derivative-informed subspaces (DIS), which use the dominant eigenvectors of the covariance of the data or the derivatives as the reduced bases, respectively. We then derive bounds for errors arising from both the dimension reduction and the latent neural network approximation, including the sampling errors associated with the empirical estimation of the PCA/DIS. Our analysis is validated on numerical experiments with elliptic PDEs, where our results show that bases informed by the map (i.e., DIS or output PCA) yield accurate reconstructions and generalization errors for both the operator and its derivatives, while input PCA may underperform unless ranks and training sample sizes are sufficiently large.

[LG-1] Surrogate-based optimization of system architectures subject to hidden constraints

链接: https://arxiv.org/abs/2504.08721
作者: Jasper Bussemaker,Paul Saves,Nathalie Bartoli,Thierry Lefebvre,Björn Nagel
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The exploration of novel architectures requires physics-based simulation due to a lack of prior experience to start from, which introduces two specific challenges for optimization algorithms: evaluations become more expensive (in time) and evaluations might fail. The former challenge is addressed by Surrogate-Based Optimization (SBO) algorithms, in particular Bayesian Optimization (BO) using Gaussian Process (GP) models. An overview is provided of how BO can deal with challenges specific to architecture optimization, such as design variable hierarchy and multiple objectives: specific measures include ensemble infills and a hierarchical sampling algorithm. Evaluations might fail due to non-convergence of underlying solvers or infeasible geometry in certain areas of the design space. Such failed evaluations, also known as hidden constraints, pose a particular challenge to SBO/BO, as the surrogate model cannot be trained on empty results. This work investigates various strategies for satisfying hidden constraints in BO algorithms. Three high-level strategies are identified: rejection of failed points from the training set, replacing failed points based on viable (non-failed) points, and predicting the failure region. Through investigations on a set of test problems including a jet engine architecture optimization problem, it is shown that best performance is achieved with a mixed-discrete GP to predict the Probability of Viability (PoV), and by ensuring selected infill points satisfy some minimum PoV threshold. This strategy is demonstrated by solving a jet engine architecture problem that features at 50% failure rate and could not previously be solved by a BO algorithm. The developed BO algorithm and used test problems are available in the open-source Python library SBArchOpt.

[LG-2] Beyond Black-Box Predictions: Identifying Marginal Feature Effects in Tabular Transformer Networks

链接: https://arxiv.org/abs/2504.08712
作者: Anton Thielmann,Arik Reuter,Benjamin Saefken
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In recent years, deep neural networks have showcased their predictive power across a variety of tasks. Beyond natural language processing, the transformer architecture has proven efficient in addressing tabular data problems and challenges the previously dominant gradient-based decision trees in these areas. However, this predictive power comes at the cost of intelligibility: Marginal feature effects are almost completely lost in the black-box nature of deep tabular transformer networks. Alternative architectures that use the additivity constraints of classical statistical regression models can maintain intelligible marginal feature effects, but often fall short in predictive power compared to their more complex counterparts. To bridge the gap between intelligibility and performance, we propose an adaptation of tabular transformer networks designed to identify marginal feature effects. We provide theoretical justifications that marginal feature effects can be accurately identified, and our ablation study demonstrates that the proposed model efficiently detects these effects, even amidst complex feature interactions. To demonstrate the model’s predictive capabilities, we compare it to several interpretable as well as black-box models and find that it can match black-box performances while maintaining intelligibility. The source code is available at this https URL.

[LG-3] Offline Reinforcement Learning using Human-Aligned Reward Labeling for Autonomous Emergency Braking in Occluded Pedestrian Crossing

链接: https://arxiv.org/abs/2504.08704
作者: Vinal Asodia,Zhenhua Feng,Saber Fallah
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 13 pages, 9 figures, 1 table

点击查看摘要

Abstract:Effective leveraging of real-world driving datasets is crucial for enhancing the training of autonomous driving systems. While Offline Reinforcement Learning enables the training of autonomous vehicles using such data, most available datasets lack meaningful reward labels. Reward labeling is essential as it provides feedback for the learning algorithm to distinguish between desirable and undesirable behaviors, thereby improving policy performance. This paper presents a novel pipeline for generating human-aligned reward labels. The proposed approach addresses the challenge of absent reward signals in real-world datasets by generating labels that reflect human judgment and safety considerations. The pipeline incorporates an adaptive safety component, activated by analyzing semantic segmentation maps, allowing the autonomous vehicle to prioritize safety over efficiency in potential collision scenarios. The proposed pipeline is applied to an occluded pedestrian crossing scenario with varying levels of pedestrian traffic, using synthetic and simulation data. The results indicate that the generated reward labels closely match the simulation reward labels. When used to train the driving policy using Behavior Proximal Policy Optimisation, the results are competitive with other baselines. This demonstrates the effectiveness of our method in producing reliable and human-aligned reward signals, facilitating the training of autonomous driving systems through Reinforcement Learning outside of simulation environments and in alignment with human values.

[LG-4] SeaView: Software Engineering Agent Visual Interface for Enhanced Workflow

链接: https://arxiv.org/abs/2504.08696
作者: Timothy Bula,Saurabh Pujar,Luca Buratti,Mihaela Bornea,Avirup Sil
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Auto-regressive LLM-based software engineering (SWE) agents, henceforth SWE agents, have made tremendous progress (60% on SWE-Bench Verified) on real-world coding challenges including GitHub issue resolution. SWE agents use a combination of reasoning, environment interaction and self-reflection to resolve issues thereby generating “trajectories”. Analysis of SWE agent trajectories is difficult, not only as they exceed LLM sequence length (sometimes, greater than 128k) but also because it involves a relatively prolonged interaction between an LLM and the environment managed by the agent. In case of an agent error, it can be hard to decipher, locate and understand its scope. Similarly, it can be hard to track improvements or regression over multiple runs or experiments. While a lot of research has gone into making these SWE agents reach state-of-the-art, much less focus has been put into creating tools to help analyze and visualize agent output. We propose a novel tool called SeaView: Software Engineering Agent Visual Interface for Enhanced Workflow, with a vision to assist SWE-agent researchers to visualize and inspect their experiments. SeaView’s novel mechanisms help compare experimental runs with varying hyper-parameters or LLMs, and quickly get an understanding of LLM or environment related problems. Based on our user study, experienced researchers spend between 10 and 30 minutes to gather the information provided by SeaView, while researchers with little experience can spend between 30 minutes to 1 hour to diagnose their experiment.

[LG-5] Regularized infill criteria for multi-objective Bayesian optimization with application to aircraft design

链接: https://arxiv.org/abs/2504.08671
作者: Robin Grapin,Youssef Diouane,Joseph Morlier,Nathalie Bartoli,Thierry Lefebvre,Paul Saves,Jasper Bussemaker
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP)
*备注: AIAA AVIATION 2022 Forum

点击查看摘要

Abstract:Bayesian optimization is an advanced tool to perform ecient global optimization It consists on enriching iteratively surrogate Kriging models of the objective and the constraints both supposed to be computationally expensive of the targeted optimization problem Nowadays efficient extensions of Bayesian optimization to solve expensive multiobjective problems are of high interest The proposed method in this paper extends the super efficient global optimization with mixture of experts SEGOMOE to solve constrained multiobjective problems To cope with the illposedness of the multiobjective inll criteria different enrichment procedures using regularization techniques are proposed The merit of the proposed approaches are shown on known multiobjective benchmark problems with and without constraints The proposed methods are then used to solve a biobjective application related to conceptual aircraft design with ve unknown design variables and three nonlinear inequality constraints The preliminary results show a reduction of the total cost in terms of function evaluations by a factor of 20 compared to the evolutionary algorithm NSGA-II.

[LG-6] Channel Estimation by Infinite Width Convolutional Networks

链接: https://arxiv.org/abs/2504.08660
作者: Mohammed Mallik,Guillaume Villemaud
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In wireless communications, estimation of channels in OFDM systems spans frequency and time, which relies on sparse collections of pilot data, posing an ill-posed inverse problem. Moreover, deep learning estimators require large amounts of training data, computational resources, and true channels to produce accurate channel estimates, which are not realistic. To address this, a convolutional neural tangent kernel (CNTK) is derived from an infinitely wide convolutional network whose training dynamics can be expressed by a closed-form equation. This CNTK is used to impute the target matrix and estimate the missing channel response using only the known values available at pilot locations. This is a promising solution for channel estimation that does not require a large training set. Numerical results on realistic channel datasets demonstrate that our strategy accurately estimates the channels without a large dataset and significantly outperforms deep learning methods in terms of speed, accuracy, and computational resources.

[LG-7] Application of machine learning models to predict the relationship between air pollution ecosystem degradation and health disparities and lung cancer in Vietnam

链接: https://arxiv.org/abs/2504.08651
作者: Ngoc Hong Tran,Lan Kim Vien,Ngoc-Thao Thi Le
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注: Accepted and Published in the Proceeding of the 2nd International Conference on “Green Solutions and Emerging Technologies for Sustainability” (GSETS 2025) ISBN: 978-604-76-3087-5

点击查看摘要

Abstract:Lung cancer is one of the major causes of death worldwide, and Vietnam is not an exception. This disease is the second most common type of cancer globally and the second most common cause of death in Vietnam, just after liver cancer, with 23,797 fatal cases and 26,262 new cases, or 14.4% of the disease in 2020. Recently, with rising disease rates in Vietnam causing a huge public health burden, lung cancer continues to hold the top position in attention and care. Especially together with climate change, under a variety of types of pollution, deforestation, and modern lifestyles, lung cancer risks are on red alert, particularly in Vietnam. To understand more about the severe disease sources in Vietnam from a diversity of key factors, including environmental features and the current health state, with a particular emphasis on Vietnam’s distinct socioeconomic and ecological context, we utilize large datasets such as patient health records and environmental indicators containing necessary information, such as deforestation rate, green cover rate, air pollution, and lung cancer risks, that is collected from well-known governmental sharing websites. Then, we process and connect them and apply analytical methods (heatmap, information gain, p-value, spearman correlation) to determine causal correlations influencing lung cancer risks. Moreover, we deploy machine learning (ML) models (Decision Tree, Random Forest, Support Vector Machine, K-mean clustering) to discover cancer risk patterns. Our experimental results, leveraged by the aforementioned ML models to identify the disease patterns, are promising, particularly, the models as Random Forest, SVM, and PCA are working well on the datasets and give high accuracy (99%), however, the K means clustering has very low accuracy (10%) and does not fit the datasets.

[LG-8] MooseAgent : A LLM Based Multi-agent Framework for Automating Moose Simulation

链接: https://arxiv.org/abs/2504.08621
作者: Tao Zhang,Zhenhai Liu,Yong Xin,Yongjun Jiao
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 7 pages, 2 Figs

点击查看摘要

Abstract:The Finite Element Method (FEM) is widely used in engineering and scientific computing, but its pre-processing, solver configuration, and post-processing stages are often time-consuming and require specialized knowledge. This paper proposes an automated solution framework, MooseAgent, for the multi-physics simulation framework MOOSE, which combines large-scale pre-trained language models (LLMs) with a multi-agent system. The framework uses LLMs to understand user-described simulation requirements in natural language and employs task decomposition and multi-round iterative verification strategies to automatically generate MOOSE input files. To improve accuracy and reduce model hallucinations, the system builds and utilizes a vector database containing annotated MOOSE input cards and function documentation. We conducted experimental evaluations on several typical cases, including heat transfer, mechanics, phase field, and multi-physics coupling. The results show that MooseAgent can automate the MOOSE simulation process to a certain extent, especially demonstrating a high success rate when dealing with relatively simple single-physics problems. The main contribution of this research is the proposal of a multi-agent automated framework for MOOSE, which validates its potential in simplifying finite element simulation processes and lowering the user barrier, providing new ideas for the development of intelligent finite element simulation software. The code for the MooseAgent framework proposed in this paper has been open-sourced and is available at this https URL

[LG-9] Boosting-inspired online learning with transfer for railway maintenance

链接: https://arxiv.org/abs/2504.08554
作者: Diogo Risca,Afonso Lourenço,Goreti Marreiros
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of advanced sensor technologies with deep learning algorithms has revolutionized fault diagnosis in railway systems, particularly at the wheel-track interface. Although numerous models have been proposed to detect irregularities such as wheel out-of-roundness, they often fall short in real-world applications due to the dynamic and nonstationary nature of railway operations. This paper introduces BOLT-RM (Boosting-inspired Online Learning with Transfer for Railway Maintenance), a model designed to address these challenges using continual learning for predictive maintenance. By allowing the model to continuously learn and adapt as new data become available, BOLT-RM overcomes the issue of catastrophic forgetting that often plagues traditional models. It retains past knowledge while improving predictive accuracy with each new learning episode, using a boosting-like knowledge sharing mechanism to adapt to evolving operational conditions such as changes in speed, load, and track irregularities. The methodology is validated through comprehensive multi-domain simulations of train-track dynamic interactions, which capture realistic railway operating conditions. The proposed BOLT-RM model demonstrates significant improvements in identifying wheel anomalies, establishing a reliable sequence for maintenance interventions.

[LG-10] Slicing the Gaussian Mixture Wasserstein Distance

链接: https://arxiv.org/abs/2504.08544
作者: Moritz Piening,Robert Beinert
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Gaussian mixture models (GMMs) are widely used in machine learning for tasks such as clustering, classification, image reconstruction, and generative modeling. A key challenge in working with GMMs is defining a computationally efficient and geometrically meaningful metric. The mixture Wasserstein (MW) distance adapts the Wasserstein metric to GMMs and has been applied in various domains, including domain adaptation, dataset comparison, and reinforcement learning. However, its high computational cost – arising from repeated Wasserstein distance computations involving matrix square root estimations and an expensive linear program – limits its scalability to high-dimensional and large-scale problems. To address this, we propose multiple novel slicing-based approximations to the MW distance that significantly reduce computational complexity while preserving key optimal transport properties. From a theoretical viewpoint, we establish several weak and strong equivalences between the introduced metrics, and show the relations to the original MW distance and the well-established sliced Wasserstein distance. Furthermore, we validate the effectiveness of our approach through numerical experiments, demonstrating computational efficiency and applications in clustering, perceptual image comparison, and GMM minimization

[LG-11] Physics-informed data-driven control without persistence of excitation

链接: https://arxiv.org/abs/2504.08484
作者: Martina Vanelli,Julien M. Hendrickx
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:We show that data that is not sufficiently informative to allow for system re-identification can still provide meaningful information when combined with external or physical knowledge of the system, such as bounded system matrix norms. We then illustrate how this information can be leveraged for safety and energy minimization problems and to enhance predictions in unmodelled dynamics. This preliminary work outlines key ideas toward using limited data for effective control by integrating physical knowledge of the system and exploiting interpolation conditions.

[LG-12] A Systematic Evaluation of Knowledge Graph Embeddings for Gene-Disease Association Prediction

链接: https://arxiv.org/abs/2504.08445
作者: Catarina Canastra,Cátia Pesquita
类目: Machine Learning (cs.LG)
*备注: 48 pages, 2 figures, 18 tables

点击查看摘要

Abstract:Discovery gene-disease links is important in biology and medicine areas, enabling disease identification and drug repurposing. Machine learning approaches accelerate this process by leveraging biological knowledge represented in ontologies and the structure of knowledge graphs. Still, many existing works overlook ontologies explicitly representing diseases, missing causal and semantic relationships between them. The gene-disease association problem naturally frames itself as a link prediction task, where embedding algorithms directly predict associations by exploring the structure and properties of the knowledge graph. Some works frame it as a node-pair classification task, combining embedding algorithms with traditional machine learning algorithms. This strategy aligns with the logic of a machine learning pipeline. However, the use of negative examples and the lack of validated gene-disease associations to train embedding models may constrain its effectiveness. This work introduces a novel framework for comparing the performance of link prediction versus node-pair classification tasks, analyses the performance of state of the art gene-disease association approaches, and compares the different order-based formalizations of gene-disease association prediction. It also evaluates the impact of the semantic richness through a disease-specific ontology and additional links between ontologies. The framework involves five steps: data splitting, knowledge graph integration, embedding, modeling and prediction, and method evaluation. Results show that enriching the semantic representation of diseases slightly improves performance, while additional links generate a greater impact. Link prediction methods better explore the semantic richness encoded in knowledge graphs. Although node-pair classification methods identify all true positives, link prediction methods outperform overall.

[LG-13] Customizing Spider Silk: Generative Models with Mechanical Property Conditioning for Protein Engineering

链接: https://arxiv.org/abs/2504.08437
作者: Neeru Dubey,Elin Karlsson,Miguel Angel Redondo,Johan Reimegård,Anna Rising,Hedvig Kjellström
类目: Machine Learning (cs.LG)
*备注: 23 pages, 11 figures

点击查看摘要

Abstract:The remarkable mechanical properties of spider silk, including its tensile strength and extensibility, are primarily governed by the repetitive regions of the proteins that constitute the fiber, the major ampullate spidroins (MaSps). However, establishing correlations between mechanical characteristics and repeat sequences is challenging due to the intricate sequence-structure-function relationships of MaSps and the limited availability of annotated datasets. In this study, we present a novel computational framework for designing MaSp repeat sequences with customizable mechanical properties. To achieve this, we developed a lightweight GPT-based generative model by distilling the pre-trained ProtGPT2 protein language model. The distilled model was subjected to multilevel fine-tuning using curated subsets of the Spider Silkome dataset. Specifically, we adapt the model for MaSp repeat generation using 6,000 MaSp repeat sequences and further refine it with 572 repeats associated with experimentally determined fiber-level mechanical properties. Our model generates biologically plausible MaSp repeat regions tailored to specific mechanical properties while also predicting those properties for given sequences. Validation includes sequence-level analysis, assessing physicochemical attributes and expected distribution of key motifs as well as secondary structure compositions. A correlation study using BLAST on the Spider Silkome dataset and a test set of MaSp repeats with known mechanical properties further confirmed the predictive accuracy of the model. This framework advances the rational design of spider silk-inspired biomaterials, offering a versatile tool for engineering protein sequences with tailored mechanical attributes.

[LG-14] Graph Reduction with Unsupervised Learning in Column Generation: A Routing Application

链接: https://arxiv.org/abs/2504.08401
作者: Abdo Abouelrous,Laurens Bliea,Adriana F. Gabor,Yaoxin Wu,Yingqian Zhang
类目: Machine Learning (cs.LG)
*备注: 22 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Column Generation (CG) is a popular method dedicated to enhancing computational efficiency in large scale Combinatorial Optimization (CO) problems. It reduces the number of decision variables in a problem by solving a pricing problem. For many CO problems, the pricing problem is an Elementary Shortest Path Problem with Resource Constraints (ESPPRC). Large ESPPRC instances are difficult to solve to near-optimality. Consequently, we use a Graph neural Network (GNN) to reduces the size of the ESPPRC such that it becomes computationally tractable with standard solving techniques. Our GNN is trained by Unsupervised Learning and outputs a distribution for the arcs to be retained in the reduced PP. The reduced PP is solved by a local search that finds columns with large reduced costs and speeds up convergence. We apply our method on a set of Capacitated Vehicle Routing Problems with Time Windows and show significant improvements in convergence compared to simple reduction techniques from the literature. For a fixed computational budget, we improve the objective values by over 9% for larger instances. We also analyze the performance of our CG algorithm and test the generalization of our method to different classes of instances than the training data.

[LG-15] MixDiT: Accelerating Image Diffusion Transformer Inference with Mixed-Precision MX Quantization

链接: https://arxiv.org/abs/2504.08398
作者: Daeun Kim,Jinwoo Hwang,Changhun Oh,Jongse Park
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion Transformer (DiT) has driven significant progress in image generation tasks. However, DiT inferencing is notoriously compute-intensive and incurs long latency even on datacenter-scale GPUs, primarily due to its iterative nature and heavy reliance on GEMM operations inherent to its encoder-based structure. To address the challenge, prior work has explored quantization, but achieving low-precision quantization for DiT inferencing with both high accuracy and substantial speedup remains an open problem. To this end, this paper proposes MixDiT, an algorithm-hardware co-designed acceleration solution that exploits mixed Microscaling (MX) formats to quantize DiT activation values. MixDiT quantizes the DiT activation tensors by selectively applying higher precision to magnitude-based outliers, which produce mixed-precision GEMM operations. To achieve tangible speedup from the mixed-precision arithmetic, we design a MixDiT accelerator that enables precision-flexible multiplications and efficient MX precision conversions. Our experimental results show that MixDiT delivers a speedup of 2.10-5.32 times over RTX 3090, with no loss in FID.

[LG-16] Scaling Up On-Device LLM s via Active-Weight Swapping Between DRAM and Flash

链接: https://arxiv.org/abs/2504.08378
作者: Fucheng Jia,Zewen Wu,Shiqi Jiang,Huiqiang Jiang,Qianxi Zhang,Yuqing Yang,Yunxin Liu,Ju Ren,Deyu Zhang,Ting Cao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being deployed on mobile devices, but the limited DRAM capacity constrains the deployable model size. This paper introduces ActiveFlow, the first LLM inference framework that can achieve adaptive DRAM usage for modern LLMs (not ReLU-based), enabling the scaling up of deployable model sizes. The framework is based on the novel concept of active weight DRAM-flash swapping and incorporates three novel techniques: (1) Cross-layer active weights preloading. It uses the activations from the current layer to predict the active weights of several subsequent layers, enabling computation and data loading to overlap, as well as facilitating large I/O transfers. (2) Sparsity-aware self-distillation. It adjusts the active weights to align with the dense-model output distribution, compensating for approximations introduced by contextual sparsity. (3) Active weight DRAM-flash swapping pipeline. It orchestrates the DRAM space allocation among the hot weight cache, preloaded active weights, and computation-involved weights based on available memory. Results show ActiveFlow achieves the performance-cost Pareto frontier compared to existing efficiency optimization methods.

[LG-17] Proofs as Explanations: Short Certificates for Reliable Predictions

链接: https://arxiv.org/abs/2504.08377
作者: Avrim Blum,Steve Hanneke,Chirag Pabbaraju,Donya Saless
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider a model for explainable AI in which an explanation for a prediction h(x)=y consists of a subset S’ of the training data (if it exists) such that all classifiers h’ \in H that make at most b mistakes on S’ predict h’(x)=y . Such a set S’ serves as a proof that x indeed has label y under the assumption that (1) the target function h^\star belongs to H , and (2) the set S contains at most b corrupted points. For example, if b=0 and H is the family of linear classifiers in \mathbbR^d , and if x lies inside the convex hull of the positive data points in S (and hence every consistent linear classifier labels x as positive), then Carathéodory’s theorem states that x lies inside the convex hull of d+1 of those points. So, a set S’ of size d+1 could be released as an explanation for a positive prediction, and would serve as a short proof of correctness of the prediction under the assumption of realizability. In this work, we consider this problem more generally, for general hypothesis classes H and general values b\geq 0 . We define the notion of the robust hollow star number of H (which generalizes the standard hollow star number), and show that it precisely characterizes the worst-case size of the smallest certificate achievable, and analyze its size for natural classes. We also consider worst-case distributional bounds on certificate size, as well as distribution-dependent bounds that we show tightly control the sample size needed to get a certificate for any given test example. In particular, we define a notion of the certificate coefficient \varepsilon_x of an example x with respect to a data distribution D and target function h^\star , and prove matching upper and lower bounds on sample size as a function of \varepsilon_x , b , and the VC dimension d of H . Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2504.08377 [cs.LG] (or arXiv:2504.08377v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.08377 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-18] DRIP: DRop unImportant data Points – Enhancing Machine Learning Efficiency with Grad-CAM-Based Real-Time Data Prioritization for On-Device Training

链接: https://arxiv.org/abs/2504.08364
作者: Marcus Rüb,Daniel Konegen,Axel Sikora,Daniel Mueller-Gritschneder
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Selecting data points for model training is critical in machine learning. Effective selection methods can reduce the labeling effort, optimize on-device training for embedded systems with limited data storage, and enhance the model performance. This paper introduces a novel algorithm that uses Grad-CAM to make online decisions about retaining or discarding data points. Optimized for embedded devices, the algorithm computes a unique DRIP Score to quantify the importance of each data point. This enables dynamic decision-making on whether a data point should be stored for potential retraining or discarded without compromising model performance. Experimental evaluations on four benchmark datasets demonstrate that our approach can match or even surpass the accuracy of models trained on the entire dataset, all while achieving storage savings of up to 39%. To our knowledge, this is the first algorithm that makes online decisions about data point retention without requiring access to the entire dataset.

[LG-19] An Adaptive Clustering Scheme for Client Selections in Communication-Efficient Federated Learning

链接: https://arxiv.org/abs/2504.08356
作者: Yan-Ann Chen,Guan-Lin Chen
类目: Machine Learning (cs.LG)
*备注: Published in the Proceedings of IEEE VTS Asia Pacific Wireless Communications Symposium (APWCS), 2023

点击查看摘要

Abstract:Federated learning is a novel decentralized learning architecture. During the training process, the client and server must continuously upload and receive model parameters, which consumes a lot of network transmission resources. Some methods use clustering to find more representative customers, select only a part of them for training, and at the same time ensure the accuracy of training. However, in federated learning, it is not trivial to know what the number of clusters can bring the best training result. Therefore, we propose to dynamically adjust the number of clusters to find the most ideal grouping results. It may reduce the number of users participating in the training to achieve the effect of reducing communication costs without affecting the model performance. We verify its experimental results on the non-IID handwritten digit recognition dataset and reduce the cost of communication and transmission by almost 50% compared with traditional federated learning without affecting the accuracy of the model.

[LG-20] owards generalizable single-cell perturbation modeling via the Conditional Monge Gap

链接: https://arxiv.org/abs/2504.08328
作者: Alice Driessen,Benedek Harsanyi,Marianna Rapsomaniki,Jannis Born
类目: Machine Learning (cs.LG); Cell Behavior (q-bio.CB)
*备注: Main text, 15 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Learning the response of single-cells to various treatments offers great potential to enable targeted therapies. In this context, neural optimal transport (OT) has emerged as a principled methodological framework because it inherently accommodates the challenges of unpaired data induced by cell destruction during data acquisition. However, most existing OT approaches are incapable of conditioning on different treatment contexts (e.g., time, drug treatment, drug dosage, or cell type) and we still lack methods that unanimously show promising generalization performance to unseen treatments. Here, we propose the Conditional Monge Gap which learns OT maps conditionally on arbitrary covariates. We demonstrate its value in predicting single-cell perturbation responses conditional to one or multiple drugs, a drug dosage, or combinations thereof. We find that our conditional models achieve results comparable and sometimes even superior to the condition-specific state-of-the-art on scRNA-seq as well as multiplexed protein imaging data. Notably, by aggregating data across conditions we perform cross-task learning which unlocks remarkable generalization abilities to unseen drugs or drug dosages, widely outperforming other conditional models in capturing heterogeneity (i.e., higher moments) in the perturbed population. Finally, by scaling to hundreds of conditions and testing on unseen drugs, we narrow the gap between structure-based and effect-based drug representations, suggesting a promising path to the successful prediction of perturbation effects for unseen treatments.

[LG-21] Academic Network Representation via Prediction-Sampling Incorporated Tensor Factorization

链接: https://arxiv.org/abs/2504.08323
作者: Chunyang Zhang,Xin Liao,Hao Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate representation to an academic network is of great significance to academic relationship mining like predicting scientific impact. A Latent Factorization of Tensors (LFT) model is one of the most effective models for learning the representation of a target network. However, an academic network is often High-Dimensional and Incomplete (HDI) because the relationships among numerous network entities are impossible to be fully explored, making it difficult for an LFT model to learn accurate representation of the academic network. To address this issue, this paper proposes a Prediction-sampling-based Latent Factorization of Tensors (PLFT) model with two ideas: 1) constructing a cascade LFT architecture to enhance model representation learning ability via learning academic network hierarchical features, and 2) introducing a nonlinear activation-incorporated predicting-sampling strategy to more accurately learn the network representation via generating new academic network data layer by layer. Experimental results from the three real-world academic network datasets show that the PLFT model outperforms existing models when predicting the unexplored relationships among network entities.

[LG-22] Enabling Automatic Differentiation with Mollified Graph Neural Operators

链接: https://arxiv.org/abs/2504.08277
作者: Ryan Y. Lin,Julius Berner,Valentin Duruisseaux,David Pitt,Daniel Leibovici,Jean Kossaifi,Kamyar Azizzadenesheli,Anima Anandkumar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-informed neural operators offer a powerful framework for learning solution operators of partial differential equations (PDEs) by combining data and physics losses. However, these physics losses rely on derivatives. Computing these derivatives remains challenging, with spectral and finite difference methods introducing approximation errors due to finite resolution. Here, we propose the mollified graph neural operator (mGNO), the first method to leverage automatic differentiation and compute \emphexact gradients on arbitrary geometries. This enhancement enables efficient training on irregular grids and varying geometries while allowing seamless evaluation of physics losses at randomly sampled points for improved generalization. For a PDE example on regular grids, mGNO paired with autograd reduced the L2 relative data error by 20x compared to finite differences, although training was slower. It can also solve PDEs on unstructured point clouds seamlessly, using physics losses only, at resolutions vastly lower than those needed for finite differences to be accurate enough. On these unstructured point clouds, mGNO leads to errors that are consistently 2 orders of magnitude lower than machine learning baselines (Meta-PDE) for comparable runtimes, and also delivers speedups from 1 to 3 orders of magnitude compared to the numerical solver for similar accuracy. mGNOs can also be used to solve inverse design and shape optimization problems on complex geometries.

[LG-23] Understanding the Impact of Data Domain Extraction on Synthetic Data Privacy ICLR2025

链接: https://arxiv.org/abs/2504.08254
作者: Georgi Ganev,Meenatchi Sundaram Muthu Selva Annamalai,Sofiane Mahiou,Emiliano De Cristofaro
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted to the Synthetic Data x Data Access Problem workshop (SynthData), part of ICLR 2025

点击查看摘要

Abstract:Privacy attacks, particularly membership inference attacks (MIAs), are widely used to assess the privacy of generative models for tabular synthetic data, including those with Differential Privacy (DP) guarantees. These attacks often exploit outliers, which are especially vulnerable due to their position at the boundaries of the data domain (e.g., at the minimum and maximum values). However, the role of data domain extraction in generative models and its impact on privacy attacks have been overlooked. In this paper, we examine three strategies for defining the data domain: assuming it is externally provided (ideally from public data), extracting it directly from the input data, and extracting it with DP mechanisms. While common in popular implementations and libraries, we show that the second approach breaks end-to-end DP guarantees and leaves models vulnerable. While using a provided domain (if representative) is preferable, extracting it with DP can also defend against popular MIAs, even at high privacy budgets.

[LG-24] Neural Network-assisted Interval Reachability for Systems with Control Barrier Function-Based Safe Controllers

链接: https://arxiv.org/abs/2504.08249
作者: Damola Ajeyemi,Saber Jafarpour,Emiliano Dall’Anese
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Control Barrier Functions (CBFs) have been widely utilized in the design of optimization-based controllers and filters for dynamical systems to ensure forward invariance of a given set of safe states. While CBF-based controllers offer safety guarantees, they can compromise the performance of the system, leading to undesirable behaviors such as unbounded trajectories and emergence of locally stable spurious equilibria. Computing reachable sets for systems with CBF-based controllers is an effective approach for runtime performance and stability verification, and can potentially serve as a tool for trajectory re-planning. In this paper, we propose a computationally efficient interval reachability method for performance verification of systems with optimization-based controllers by: (i) approximating the optimization-based controller by a pre-trained neural network to avoid solving optimization problems repeatedly, and (ii) using mixed monotone theory to construct an embedding system that leverages state-of-the-art neural network verification algorithms for bounding the output of the neural network. Results in terms of closeness of solutions of trajectories of the system with the optimization-based controller and the neural network are derived. Using a single trajectory of the embedding system along with our closeness of solutions result, we obtain an over-approximation of the reachable set of the system with optimization-based controllers. Numerical results are presented to corroborate the technical findings.

[LG-25] Spectral Normalization for Lipschitz-Constrained Policies on Learning Humanoid Locomotion

链接: https://arxiv.org/abs/2504.08246
作者: Jaeyong Shin,Woohyun Cha,Donghyeon Kim,Junhyeok Cha,Jaeheung Park
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Reinforcement learning (RL) has shown great potential in training agile and adaptable controllers for legged robots, enabling them to learn complex locomotion behaviors directly from experience. However, policies trained in simulation often fail to transfer to real-world robots due to unrealistic assumptions such as infinite actuator bandwidth and the absence of torque limits. These conditions allow policies to rely on abrupt, high-frequency torque changes, which are infeasible for real actuators with finite bandwidth. Traditional methods address this issue by penalizing aggressive motions through regularization rewards, such as joint velocities, accelerations, and energy consumption, but they require extensive hyperparameter tuning. Alternatively, Lipschitz-Constrained Policies (LCP) enforce finite bandwidth action control by penalizing policy gradients, but their reliance on gradient calculations introduces significant GPU memory overhead. To overcome this limitation, this work proposes Spectral Normalization (SN) as an efficient replacement for enforcing Lipschitz continuity. By constraining the spectral norm of network weights, SN effectively limits high-frequency policy fluctuations while significantly reducing GPU memory usage. Experimental evaluations in both simulation and real-world humanoid robot show that SN achieves performance comparable to gradient penalty methods while enabling more efficient parallel training. Comments: This work has been submitted to the IEEE for possible publication Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2504.08246 [cs.RO] (or arXiv:2504.08246v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2504.08246 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-26] Bringing Structure to Naturalness: On the Naturalness of ASTs

链接: https://arxiv.org/abs/2504.08234
作者: Profir-Petru Pârţachi,Mahito Sugiyama
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Source code comes in different shapes and forms. Previous research has already shown code to be more predictable than natural language as well as highlighted its statistical predictability at the token level: source code can be natural. More recently, the structure of code – control flow, syntax graphs, abstract syntax trees etc. – has been successfully used to improve the state-of-the-art on numerous tasks: code suggestion, code summarisation, method naming etc. This body of work implicitly assumes that structured representations of code are similarly statistically predictable, i.e. that a structured view of code is also natural. We consider that this view should be made explicit and propose directly studying the Structured Naturalness Hypothesis. Beyond just naming existing research that assumes this hypothesis and formulating it, we also provide evidence in the case of trees: TreeLSTM models over ASTs for some languages, such as Ruby, are competitive with n -gram models while handling the syntax token issue highlighted by previous research ‘for free’. For other languages, such as Java or Python, we find tree models to perform worse, suggesting that downstream task improvement is uncorrelated to the language modelling task. Further, we show how such naturalness signals can be employed for near state-of-the-art results on just-in-time defect prediction while forgoing manual feature engineering work.

[LG-27] DrivAer Transformer: A high-precision and fast prediction method for vehicle aerodynamic drag coefficient based on the DrivAerNet dataset

链接: https://arxiv.org/abs/2504.08217
作者: Jiaqi He,Xiangwen Luo,Yiping Wang
类目: Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:At the current stage, deep learning-based methods have demonstrated excellent capabilities in evaluating aerodynamic performance, significantly reducing the time and cost required for traditional computational fluid dynamics (CFD) simulations. However, when faced with the task of processing extremely complex three-dimensional (3D) vehicle models, the lack of large-scale datasets and training resources, coupled with the inherent diversity and complexity of the geometry of different vehicle models, means that the prediction accuracy and versatility of these networks are still not up to the level required for current production. In view of the remarkable success of Transformer models in the field of natural language processing and their strong potential in the field of image processing, this study innovatively proposes a point cloud learning framework called DrivAer Transformer (DAT). The DAT structure uses the DrivAerNet++ dataset, which contains high-fidelity CFD data of industrial-standard 3D vehicle shapes. enabling accurate estimation of air drag directly from 3D meshes, thus avoiding the limitations of traditional methods such as 2D image rendering or signed distance fields (SDF). DAT enables fast and accurate drag prediction, driving the evolution of the aerodynamic evaluation process and laying the critical foundation for introducing a data-driven approach to automotive design. The framework is expected to accelerate the vehicle design process and improve development efficiency.

[LG-28] he More is not the Merrier: Investigating the Effect of Client Size on Federated Learning

链接: https://arxiv.org/abs/2504.08198
作者: Eleanor Wallach,Sage Siler,Jing Deng
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 6 pages, 7 figures

点击查看摘要

Abstract:Federated Learning (FL) has been introduced as a way to keep data local to clients while training a shared machine learning model, as clients train on their local data and send trained models to a central aggregator. It is expected that FL will have a huge implication on Mobile Edge Computing, the Internet of Things, and Cross-Silo FL. In this paper, we focus on the widely used FedAvg algorithm to explore the effect of the number of clients in FL. We find a significant deterioration of learning accuracy for FedAvg as the number of clients increases. To address this issue for a general application, we propose a method called Knowledgeable Client Insertion (KCI) that introduces a very small number of knowledgeable clients to the MEC setting. These knowledgeable clients are expected to have accumulated a large set of data samples to help with training. With the help of KCI, the learning accuracy of FL increases much faster even with a normal FedAvg aggregation technique. We expect this approach to be able to provide great privacy protection for clients against security attacks such as model inversion attacks. Our code is available at this https URL.

[LG-29] Detecting Credit Card Fraud via Heterogeneous Graph Neural Networks with Graph Attention

链接: https://arxiv.org/abs/2504.08183
作者: Qiuwu Sha,Tengda Tang,Xinyu Du,Jie Liu,Yixian Wang,Yuan Sheng
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:This study proposes a credit card fraud detection method based on Heterogeneous Graph Neural Network (HGNN) to address fraud in complex transaction networks. Unlike traditional machine learning methods that rely solely on numerical features of transaction records, this approach constructs heterogeneous transaction graphs. These graphs incorporate multiple node types, including users, merchants, and transactions. By leveraging graph neural networks, the model captures higher-order transaction relationships. A Graph Attention Mechanism is employed to dynamically assign weights to different transaction relationships. Additionally, a Temporal Decay Mechanism is integrated to enhance the model’s sensitivity to time-related fraud patterns. To address the scarcity of fraudulent transaction samples, this study applies SMOTE oversampling and Cost-sensitive Learning. These techniques strengthen the model’s ability to identify fraudulent transactions. Experimental results demonstrate that the proposed method outperforms existing GNN models, including GCN, GAT, and GraphSAGE, on the IEEE-CIS Fraud Detection dataset. The model achieves notable improvements in both accuracy and OC-ROC. Future research may explore the integration of dynamic graph neural networks and reinforcement learning. Such advancements could enhance the real-time adaptability of fraud detection systems and provide more intelligent solutions for financial risk control.

[LG-30] External-Wrench Estimation for Aerial Robots Exploiting a Learned Model

链接: https://arxiv.org/abs/2504.08156
作者: Ayham Alharbat,Gabriele Ruscelli,Roberto Diversi,Abeje Mersha
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted at ICUAS 2025

点击查看摘要

Abstract:This paper presents an external wrench estimator that uses a hybrid dynamics model consisting of a first-principles model and a neural network. This framework addresses one of the limitations of the state-of-the-art model-based wrench observers: the wrench estimation of these observers comprises the external wrench (e.g. collision, physical interaction, wind); in addition to residual wrench (e.g. model parameters uncertainty or unmodeled dynamics). This is a problem if these wrench estimations are to be used as wrench feedback to a force controller, for example. In the proposed framework, a neural network is combined with a first-principles model to estimate the residual dynamics arising from unmodeled dynamics and parameters uncertainties, then, the hybrid trained model is used to estimate the external wrench, leading to a wrench estimation that has smaller contributions from the residual dynamics, and affected more by the external wrench. This method is validated with numerical simulations of an aerial robot in different flying scenarios and different types of residual dynamics, and the statistical analysis of the results shows that the wrench estimation error has improved significantly compared to a model-based wrench observer using only a first-principles model.

[LG-31] Adaptive Bounded Exploration and Intermediate Actions for Data Debiasing

链接: https://arxiv.org/abs/2504.08151
作者: Yifan Yang,Yang Liu,Parinaz Naghizadeh
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2110.13054

点击查看摘要

Abstract:The performance of algorithmic decision rules is largely dependent on the quality of training datasets available to them. Biases in these datasets can raise economic and ethical concerns due to the resulting algorithms’ disparate treatment of different groups. In this paper, we propose algorithms for sequentially debiasing the training dataset through adaptive and bounded exploration in a classification problem with costly and censored feedback. Our proposed algorithms balance between the ultimate goal of mitigating the impacts of data biases – which will in turn lead to more accurate and fairer decisions, and the exploration risks incurred to achieve this goal. Specifically, we propose adaptive bounds to limit the region of exploration, and leverage intermediate actions which provide noisy label information at a lower cost. We analytically show that such exploration can help debias data in certain distributions, investigate how algorithmic fairness interventions can work in conjunction with our proposed algorithms, and validate the performance of these algorithms through numerical experiments on synthetic and real-world data.

[LG-32] Beyond Feature Importance: Feature Interactions in Predicting Post-Stroke Rigidity with Graph Explainable AI

链接: https://arxiv.org/abs/2504.08150
作者: Jiawei Xu,Yonggeon Lee,Anthony Elkommos Youssef,Eunjin Yun,Tinglin Huang,Tianjian Guo,Hamidreza Saber,Rex Ying,Ying Ding
类目: Machine Learning (cs.LG)
*备注: Jiawei Xu and Yonggeon Lee contributed equally to this work

点击查看摘要

Abstract:This study addresses the challenge of predicting post-stroke rigidity by emphasizing feature interactions through graph-based explainable AI. Post-stroke rigidity, characterized by increased muscle tone and stiffness, significantly affects survivors’ mobility and quality of life. Despite its prevalence, early prediction remains limited, delaying intervention. We analyze 519K stroke hospitalization records from the Healthcare Cost and Utilization Project dataset, where 43% of patients exhibited rigidity. We compare traditional approaches such as Logistic Regression, XGBoost, and Transformer with graph-based models like Graphormer and Graph Attention Network. These graph models inherently capture feature interactions and incorporate intrinsic or post-hoc explainability. Our results show that graph-based methods outperform others (AUROC 0.75), identifying key predictors such as NIH Stroke Scale and APR-DRG mortality risk scores. They also uncover interactions missed by conventional models. This research provides a novel application of graph-based XAI in stroke prognosis, with potential to guide early identification and personalized rehabilitation strategies.

[LG-33] Variational quantum and neural quantum states algorithms for the linear complementarity problem

链接: https://arxiv.org/abs/2504.08141
作者: Saibal De,Oliver Knitter,Rohan Kodati,Paramsothy Jayakumar,James Stokes,Shravan Veerapaneni
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 13 pages, 5 figures, to appear in Philosophical Transactions of the Royal Society A

点击查看摘要

Abstract:Variational quantum algorithms (VQAs) are promising hybrid quantum-classical methods designed to leverage the computational advantages of quantum computing while mitigating the limitations of current noisy intermediate-scale quantum (NISQ) hardware. Although VQAs have been demonstrated as proofs of concept, their practical utility in solving real-world problems – and whether quantum-inspired classical algorithms can match their performance – remains an open question. We present a novel application of the variational quantum linear solver (VQLS) and its classical neural quantum states-based counterpart, the variational neural linear solver (VNLS), as key components within a minimum map Newton solver for a complementarity-based rigid body contact model. We demonstrate using the VNLS that our solver accurately simulates the dynamics of rigid spherical bodies during collision events. These results suggest that quantum and quantum-inspired linear algebra algorithms can serve as viable alternatives to standard linear algebra solvers for modeling certain physical systems.

[LG-34] A physics informed neural network approach to simulating ice dynamics governed by the shallow ice approximation

链接: https://arxiv.org/abs/2504.08136
作者: Kapil Chawla,William Holmes
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:In this article we develop a Physics Informed Neural Network (PINN) approach to simulate ice sheet dynamics governed by the Shallow Ice Approximation. This problem takes the form of a time-dependent parabolic obstacle problem. Prior work has used this approach to address the stationary obstacle problem and here we extend it to the time dependent problem. Through comprehensive 1D and 2D simulations, we validate the model’s effectiveness in capturing complex free-boundary conditions. By merging traditional mathematical modeling with cutting-edge deep learning methods, this approach provides a scalable and robust solution for predicting temporal variations in ice thickness. To illustrate this approach in a real world setting, we simulate the dynamics of the Devon Ice Cap, incorporating aerogeophysical data from 2000 and 2018.

[LG-35] Between Linear and Sinusoidal: Rethinking the Time Encoder in Dynamic Graph Learning

链接: https://arxiv.org/abs/2504.08129
作者: Hsing-Huan Chung,Shravan Chaudhari,Xing Han,Yoav Wald,Suchi Saria,Joydeep Ghosh
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Dynamic graph learning is essential for applications involving temporal networks and requires effective modeling of temporal relationships. Seminal attention-based models like TGAT and DyGFormer rely on sinusoidal time encoders to capture temporal relationships between edge events. In this paper, we study a simpler alternative: the linear time encoder, which avoids temporal information loss caused by sinusoidal functions and reduces the need for high dimensional time encoders. We show that the self-attention mechanism can effectively learn to compute time spans from linear time encodings and extract relevant temporal patterns. Through extensive experiments on six dynamic graph datasets, we demonstrate that the linear time encoder improves the performance of TGAT and DyGFormer in most cases. Moreover, the linear time encoder can lead to significant savings in model parameters with minimal performance loss. For example, compared to a 100-dimensional sinusoidal time encoder, TGAT with a 2-dimensional linear time encoder saves 43% of parameters and achieves higher average precision on five datasets. These results can be readily used to positively impact the design choices of a wide variety of dynamic graph learning architectures. The experimental code is available at: this https URL.

[LG-36] RL-based Control of UAS Subject to Significant Disturbance

链接: https://arxiv.org/abs/2504.08114
作者: Kousheek Chakraborty,Thijs Hof,Ayham Alharbat,Abeje Mersha
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted at ICUAS 2025

点击查看摘要

Abstract:This paper proposes a Reinforcement Learning (RL)-based control framework for position and attitude control of an Unmanned Aerial System (UAS) subjected to significant disturbance that can be associated with an uncertain trigger signal. The proposed method learns the relationship between the trigger signal and disturbance force, enabling the system to anticipate and counteract the impending disturbances before they occur. We train and evaluate three policies: a baseline policy trained without exposure to the disturbance, a reactive policy trained with the disturbance but without the trigger signal, and a predictive policy that incorporates the trigger signal as an observation and is exposed to the disturbance during training. Our simulation results show that the predictive policy outperforms the other policies by minimizing position deviations through a proactive correction maneuver. This work highlights the potential of integrating predictive cues into RL frameworks to improve UAS performance.

[LG-37] Scaling Laws of Graph Neural Networks for Atomistic Materials Modeling

链接: https://arxiv.org/abs/2504.08112
作者: Chaojian Li,Zhifan Ye,Massimiliano Lupo Pasini,Jong Youl Choi,Cheng Wan,Yingyan Celine Lin,Prasanna Balaprakash
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: Accepted by DAC’25

点击查看摘要

Abstract:Atomistic materials modeling is a critical task with wide-ranging applications, from drug discovery to materials science, where accurate predictions of the target material property can lead to significant advancements in scientific discovery. Graph Neural Networks (GNNs) represent the state-of-the-art approach for modeling atomistic material data thanks to their capacity to capture complex relational structures. While machine learning performance has historically improved with larger models and datasets, GNNs for atomistic materials modeling remain relatively small compared to large language models (LLMs), which leverage billions of parameters and terabyte-scale datasets to achieve remarkable performance in their respective domains. To address this gap, we explore the scaling limits of GNNs for atomistic materials modeling by developing a foundational model with billions of parameters, trained on extensive datasets in terabyte-scale. Our approach incorporates techniques from LLM libraries to efficiently manage large-scale data and models, enabling both effective training and deployment of these large-scale GNN models. This work addresses three fundamental questions in scaling GNNs: the potential for scaling GNN model architectures, the effect of dataset size on model accuracy, and the applicability of LLM-inspired techniques to GNN architectures. Specifically, the outcomes of this study include (1) insights into the scaling laws for GNNs, highlighting the relationship between model size, dataset volume, and accuracy, (2) a foundational GNN model optimized for atomistic materials modeling, and (3) a GNN codebase enhanced with advanced LLM-based training techniques. Our findings lay the groundwork for large-scale GNNs with billions of parameters and terabyte-scale datasets, establishing a scalable pathway for future advancements in atomistic materials modeling.

[LG-38] Differentially Private Selection using Smooth Sensitivity

链接: https://arxiv.org/abs/2504.08086
作者: Iago Chaves,Victor Farias,Amanda Perez,Diego Parente,Javam Machado
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Databases (cs.DB)
*备注: This is the full version of our paper “Differentially Private Selection using Smooth Sensitivity”, which will appear in IEEE Security Privacy 2025 as a regular research paper

点击查看摘要

Abstract:Differentially private selection mechanisms offer strong privacy guarantees for queries aiming to identify the top-scoring element r from a finite set R, based on a dataset-dependent utility function. While selection queries are fundamental in data science, few mechanisms effectively ensure their privacy. Furthermore, most approaches rely on global sensitivity to achieve differential privacy (DP), which can introduce excessive noise and impair downstream inferences. To address this limitation, we propose the Smooth Noisy Max (SNM) mechanism, which leverages smooth sensitivity to yield provably tighter (upper bounds on) expected errors compared to global sensitivity-based methods. Empirical results demonstrate that SNM is more accurate than state-of-the-art differentially private selection methods in three applications: percentile selection, greedy decision trees, and random forests.

[LG-39] Programs as Singularities

链接: https://arxiv.org/abs/2504.08075
作者: Daniel Murfet,Will Troiani
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG); Logic (math.LO)
*备注:

点击查看摘要

Abstract:We develop a correspondence between the structure of Turing machines and the structure of singularities of real analytic functions, based on connecting the Ehrhard-Regnier derivative from linear logic with the role of geometry in Watanabe’s singular learning theory. The correspondence works by embedding ordinary (discrete) Turing machine codes into a family of noisy codes which form a smooth parameter space. On this parameter space we consider a potential function which has Turing machines as critical points. By relating the Taylor series expansion of this potential at such a critical point to combinatorics of error syndromes, we relate the local geometry to internal structure of the Turing machine. The potential in question is the negative log-likelihood for a statistical model, so that the structure of the Turing machine and its associated singularity is further related to Bayesian inference. Two algorithms that produce the same predictive function can nonetheless correspond to singularities with different geometries, which implies that the Bayesian posterior can discriminate between distinct algorithmic implementations, contrary to a purely functional view of inference. In the context of singular learning theory our results point to a more nuanced understanding of Occam’s razor and the meaning of simplicity in inductive inference. Subjects: Logic in Computer Science (cs.LO); Machine Learning (cs.LG); Logic (math.LO) Cite as: arXiv:2504.08075 [cs.LO] (or arXiv:2504.08075v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2504.08075 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-40] Deep Reinforcement Learning for Day-to-day Dynamic Tolling in Tradable Credit Schemes

链接: https://arxiv.org/abs/2504.08074
作者: Xiaoyi Wu,Ravi Seshadri,Filipe Rodrigues,Carlos Lima Azevedo
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Tradable credit schemes (TCS) are an increasingly studied alternative to congestion pricing, given their revenue neutrality and ability to address issues of equity through the initial credit allocation. Modeling TCS to aid future design and implementation is associated with challenges involving user and market behaviors, demand-supply dynamics, and control mechanisms. In this paper, we focus on the latter and address the day-to-day dynamic tolling problem under TCS, which is formulated as a discrete-time Markov Decision Process and solved using reinforcement learning (RL) algorithms. Our results indicate that RL algorithms achieve travel times and social welfare comparable to the Bayesian optimization benchmark, with generalization across varying capacities and demand levels. We further assess the robustness of RL under different hyperparameters and apply regularization techniques to mitigate action oscillation, which generates practical tolling strategies that are transferable under day-to-day demand and supply variability. Finally, we discuss potential challenges such as scaling to large networks, and show how transfer learning can be leveraged to improve computational efficiency and facilitate the practical deployment of RL-based TCS solutions.

[LG-41] Multi-user Wireless Image Semantic Transmission over MIMO Multiple Access Channels

链接: https://arxiv.org/abs/2504.07969
作者: Bingyan Xie,Yongpeng Wu,Feng Shu,Jiangzhou Wang,Wenjun Zhang
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: This paper has been accepted by IEEE Wireless Communications Letters

点击查看摘要

Abstract:This paper focuses on a typical uplink transmission scenario over multiple-input multiple-output multiple access channel (MIMO-MAC) and thus propose a multi-user learnable CSI fusion semantic communication (MU-LCFSC) framework. It incorporates CSI as the side information into both the semantic encoders and decoders to generate a proper feature mask map in order to produce a more robust attention weight distribution. Especially for the decoding end, a cooperative successive interference cancellation procedure is conducted along with a cooperative mask ratio generator, which flexibly controls the mask elements of feature mask maps. Numerical results verify the superiority of proposed MU-LCFSC compared to DeepJSCC-NOMA over 3 dB in terms of PSNR.

[LG-42] Bayesian optimization for mixed variables using an adaptive dimension reduction process: applications to aircraft design

链接: https://arxiv.org/abs/2504.08682
作者: Paul Saves,Nathalie Bartoli,Youssef Diouane,Thierry Lefebvre,Joseph Morlier,Christophe David,Eric Nguyen Van,Sébastien Defoort
类目: Methodology (stat.ME); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: AIAA SciTech 2022 Forum. arXiv admin note: substantial text overlap with arXiv:2402.04711

点击查看摘要

Abstract:Multidisciplinary design optimization methods aim at adapting numerical optimization techniques to the design of engineering systems involving multiple disciplines. In this context, a large number of mixed continuous, integer and categorical variables might arise during the optimization process and practical applications involve a large number of design variables. Recently, there has been a growing interest in mixed variables constrained Bayesian optimization but most existing approaches severely increase the number of the hyperparameters related to the surrogate model. In this paper, we address this issue by constructing surrogate models using less hyperparameters. The reduction process is based on the partial least squares method. An adaptive procedure for choosing the number of hyperparameters is proposed. The performance of the proposed approach is confirmed on analytical tests as well as two real applications related to aircraft design. A significant improvement is obtained compared to genetic algorithms.

[LG-43] ransformer Learns Optimal Variable Selection in Group-Sparse Classification

链接: https://arxiv.org/abs/2504.08638
作者: Chenyang Zhang,Xuran Meng,Yuan Cao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 63 pages, 6 figures

点击查看摘要

Abstract:Transformers have demonstrated remarkable success across various applications. However, the success of transformers have not been understood in theory. In this work, we give a case study of how transformers can be trained to learn a classic statistical model with “group sparsity”, where the input variables form multiple groups, and the label only depends on the variables from one of the groups. We theoretically demonstrate that, a one-layer transformer trained by gradient descent can correctly leverage the attention mechanism to select variables, disregarding irrelevant ones and focusing on those beneficial for classification. We also demonstrate that a well-pretrained one-layer transformer can be adapted to new downstream tasks to achieve good prediction accuracy with a limited number of samples. Our study sheds light on how transformers effectively learn structured data.

[LG-44] Gradient Descent Robustly Learns the Intrinsic Dimension of Data in Training Convolutional Neural Networks

链接: https://arxiv.org/abs/2504.08628
作者: Chenyang Zhang,Peifeng Gao,Difan Zou,Yuan Cao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 43 pages, 4 figures

点击查看摘要

Abstract:Modern neural networks are usually highly over-parameterized. Behind the wide usage of over-parameterized networks is the belief that, if the data are simple, then the trained network will be automatically equivalent to a simple predictor. Following this intuition, many existing works have studied different notions of “ranks” of neural networks and their relation to the rank of data. In this work, we study the rank of convolutional neural networks (CNNs) trained by gradient descent, with a specific focus on the robustness of the rank to image background noises. Specifically, we point out that, when adding background noises to images, the rank of the CNN trained with gradient descent is affected far less compared with the rank of the data. We support our claim with a theoretical case study, where we consider a particular data model to characterize low-rank clean images with added background noises. We prove that CNNs trained by gradient descent can learn the intrinsic dimension of clean images, despite the presence of relatively large background noises. We also conduct experiments on synthetic and real datasets to further validate our claim.

[LG-45] AstroLLaVA: towards the unification of astronomical data and natural language ICLR2025

链接: https://arxiv.org/abs/2504.08583
作者: Sharaf Zaman,Michael J. Smith,Pranav Khetarpal,Rishabh Chakrabarty,Michele Ginolfi,Marc Huertas-Company,Maja Jabłońska,Sandor Kruk,Matthieu Le Lain,Sergio José Rodríguez Méndez,Dimitrios Tanoglidis
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, accepted to SCI-FM@ICLR 2025. Code at this https URL

点击查看摘要

Abstract:We present AstroLLaVA, a vision language model for astronomy that enables interaction with astronomical imagery through natural dialogue. By fine-tuning the LLaVA model on a diverse dataset of \sim 30k images with captions and question-answer pairs sourced from NASA’s `Astronomy Picture of the Day’, the European Southern Observatory, and the NASA/ESA Hubble Space Telescope, we create a model capable of answering open-ended questions about astronomical concepts depicted visually. Our two-stage fine-tuning process adapts the model to both image captioning and visual question answering in the astronomy domain. We demonstrate AstroLLaVA’s performance on an astronomical visual question answering benchmark and release the model weights, code, and training set to encourage further open source work in this space. Finally, we suggest a roadmap towards general astronomical data alignment with pre-trained language models, and provide an open space for collaboration towards this end for interested researchers.

[LG-46] Statistically guided deep learning

链接: https://arxiv.org/abs/2504.08489
作者: Michael Kohler,Adam Krzyzak
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: arXiv admin note: text overlap with arXiv:2504.03405

点击查看摘要

Abstract:We present a theoretically well-founded deep learning algorithm for nonparametric regression. It uses over-parametrized deep neural networks with logistic activation function, which are fitted to the given data via gradient descent. We propose a special topology of these networks, a special random initialization of the weights, and a data-dependent choice of the learning rate and the number of gradient descent steps. We prove a theoretical bound on the expected L_2 error of this estimate, and illustrate its finite sample size performance by applying it to simulated data. Our results show that a theoretical analysis of deep learning which takes into account simultaneously optimization, generalization and approximation can result in a new deep learning estimate which has an improved finite sample performance.

[LG-47] Artifact detection and localization in single-channel mobile EEG for sleep research using deep learning and attention mechanisms

链接: https://arxiv.org/abs/2504.08469
作者: Khrystyna Semkiv,Jia Zhang,Maria Laura Ferster,Walter Karlen
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Artifacts in the electroencephalogram (EEG) degrade signal quality and impact the analysis of brain activity. Current methods for detecting artifacts in sleep EEG rely on simple threshold-based algorithms that require manual intervention, which is time-consuming and impractical due to the vast volume of data that novel mobile recording systems generate. We propose a convolutional neural network (CNN) model incorporating a convolutional block attention module (CNN-CBAM) to detect and identify the location of artifacts in the sleep EEG with attention maps. We benchmarked this model against six other machine learning and signal processing approaches. We trained/tuned all models on 72 manually annotated EEG recordings obtained during home-based monitoring from 18 healthy participants with a mean (SD) age of 68.05 y ( \pm 5.02). We tested them on 26 separate recordings from 6 healthy participants with a mean (SD) age of 68.33 y ( \pm 4.08), with contained artifacts in 4% of epochs. CNN-CBAM achieved the highest area under the receiver operating characteristic curve (0.88), sensitivity (0.81), and specificity (0.86) when compared to the other approaches. The attention maps from CNN-CBAM localized artifacts within the epoch with a sensitivity of 0.71 and specificity of 0.67. This work demonstrates the feasibility of automating the detection and localization of artifacts in wearable sleep EEG.

[LG-48] Standardization of Weighted Ranking Correlation Coefficients

链接: https://arxiv.org/abs/2504.08428
作者: Pierangelo Lombardo
类目: Methodology (stat.ME); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:A relevant problem in statistics is defining the correlation of two rankings of a list of items. Kendall’s tau and Spearman’s rho are two well established correlation coefficients, characterized by a symmetric form that ensures zero expected value between two pairs of rankings randomly chosen with uniform probability. However, in recent years, several weighted versions of the original Spearman and Kendall coefficients have emerged that take into account the greater importance of top ranks compared to low ranks, which is common in many contexts. The weighting schemes break the symmetry, causing a non-zero expected value between two random rankings. This issue is very relevant, as it undermines the concept of uncorrelation between rankings. In this paper, we address this problem by proposing a standardization function g(x) that maps a correlation ranking coefficient \Gamma in a standard form g(\Gamma) that has zero expected value, while maintaining the relevant statistical properties of \Gamma .

[LG-49] An Empirical Investigation of Reconstruction-Based Models for Seizure Prediction from ECG Signals

链接: https://arxiv.org/abs/2504.08381
作者: Mohammad Reza Chopannavaz,Foad Ghaderi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Epileptic seizures are sudden neurological disorders characterized by abnormal, excessive neuronal activity in the brain, which is often associated with changes in cardiovascular activity. These disruptions can pose significant physical and psychological challenges for patients. Therefore, accurate seizure prediction can help mitigate these risks by enabling timely interventions, ultimately improving patients’ quality of life. Traditionally, EEG signals have been the primary standard for seizure prediction due to their precision in capturing brain activity. However, their high cost, susceptibility to noise, and logistical constraints limit their practicality, restricting their use to clinical settings. In order to overcome these limitations, this study focuses on leveraging ECG signals as an alternative for seizure prediction. In this paper, we present a novel method for predicting seizures based on detecting anomalies in ECG signals during their reconstruction. By extracting time-frequency features and leveraging various advanced deep learning architectures, the proposed method identifies deviations in heart rate dynamics associated with seizure onset. The proposed approach was evaluated using the Siena database and could achieve specificity of 99.16%, accuracy of 76.05%, and false positive rate (FPR) of 0.01/h, with an average prediction time of 45 minutes before seizure onset. These results highlight the potential of ECG-based seizure prediction as a patient-friendly alternative to traditional EEG-based methods.

[LG-50] All Optical Echo State Network Reservoir Computing

链接: https://arxiv.org/abs/2504.08224
作者: Ishwar S Kaushik,Peter J Ehlers,Daniel Soh
类目: Optics (physics.optics); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15 pages, 10 figures

点击查看摘要

Abstract:We propose an innovative design for an all-optical Echo State Network (ESN), an advanced type of reservoir computer known for its universal computational capabilities. Our design enables fully optical implementation of arbitrary ESNs, featuring complete flexibility in optical matrix multiplication and nonlinear activation. Leveraging the nonlinear characteristics of stimulated Brillouin scattering (SBS), the architecture efficiently realizes measurement-free operations crucial for reservoir computing. The approach significantly reduces computational overhead and energy consumption compared to traditional software-based methods. Comprehensive simulations validate the system’s memory capacity, nonlinear processing strength, and polynomial algebra capabilities, showcasing performance comparable to software ESNs across key benchmark tasks. Our design establishes a feasible, scalable, and universally applicable framework for optical reservoir computing, suitable for diverse machine learning applications.

[LG-51] Local Distance-Preserving Node Embeddings and Their Performance on Random Graphs

链接: https://arxiv.org/abs/2504.08216
作者: My Le,Luana Ruiz,Souvik Dhara
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning node representations is a fundamental problem in graph machine learning. While existing embedding methods effectively preserve local similarity measures, they often fail to capture global functions like graph distances. Inspired by Bourgain’s seminal work on Hilbert space embeddings of metric spaces (1985), we study the performance of local distance-preserving node embeddings. Known as landmark-based algorithms, these embeddings approximate pairwise distances by computing shortest paths from a small subset of reference nodes (i.e., landmarks). Our main theoretical contribution shows that random graphs, such as Erdős-Rényi random graphs, require lower dimensions in landmark-based embeddings compared to worst-case graphs. Empirically, we demonstrate that the GNN-based approximations for the distances to landmarks generalize well to larger networks, offering a scalable alternative for graph representation learning.

[LG-52] Deep Distributional Learning with Non-crossing Quantile Network

链接: https://arxiv.org/abs/2504.08215
作者: Guohao Shen,Runpeng Dai,Guojun Wu,Shikai Luo,Chengchun Shi,Hongtu Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a non-crossing quantile (NQ) network for conditional distribution learning. By leveraging non-negative activation functions, the NQ network ensures that the learned distributions remain monotonic, effectively addressing the issue of quantile crossing. Furthermore, the NQ network-based deep distributional learning framework is highly adaptable, applicable to a wide range of applications, from classical non-parametric quantile regression to more advanced tasks such as causal effect estimation and distributional reinforcement learning (RL). We also develop a comprehensive theoretical foundation for the deep NQ estimator and its application to distributional RL, providing an in-depth analysis that demonstrates its effectiveness across these domains. Our experimental results further highlight the robustness and versatility of the NQ network.

[LG-53] Particle Hit Clustering and Identification Using Point Set Transformers in Liquid Argon Time Projection Chambers

链接: https://arxiv.org/abs/2504.08182
作者: Edgar E. Robles,Alejando Yankelevich,Wenjie Wu,Jianming Bian,Pierre Baldi
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, 2 tables, for submission at JINST

点击查看摘要

Abstract:Liquid argon time projection chambers are often used in neutrino physics and dark-matter searches because of their high spatial resolution. The images generated by these detectors are extremely sparse, as the energy values detected by most of the detector are equal to 0, meaning that despite their high resolution, most of the detector is unused in a particular interaction. Instead of representing all of the empty detections, the interaction is usually stored as a sparse matrix, a list of detection locations paired with their energy values. Traditional machine learning methods that have been applied to particle reconstruction such as convolutional neural networks (CNNs), however, cannot operate over data stored in this way and therefore must have the matrix fully instantiated as a dense matrix. Operating on dense matrices requires a lot of memory and computation time, in contrast to directly operating on the sparse matrix. We propose a machine learning model using a point set neural network that operates over a sparse matrix, greatly improving both processing speed and accuracy over methods that instantiate the dense matrix, as well as over other methods that operate over sparse matrices. Compared to competing state-of-the-art methods, our method improves classification performance by 14%, segmentation performance by more than 22%, while taking 80% less time and using 66% less memory. Compared to state-of-the-art CNN methods, our method improves classification performance by more than 86%, segmentation performance by more than 71%, while reducing runtime by 91% and reducing memory usage by 61%.

[LG-54] A Piecewise Lyapunov Analysis of sub–quadratic SGD: Applications to Robust and Quantile Regression

链接: https://arxiv.org/abs/2504.08178
作者: Yixuan Zhang,Dongyan(Lucy)Huo,Yudong Chen,Qiaomin Xie
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST)
*备注: ACM SIGMETRICS 2025. 40 pages, 12 figures

点击查看摘要

Abstract:Motivated by robust and quantile regression problems, we investigate the stochastic gradient descent (SGD) algorithm for minimizing an objective function f that is locally strongly convex with a sub–quadratic tail. This setting covers many widely used online statistical methods. We introduce a novel piecewise Lyapunov function that enables us to handle functions f with only first-order differentiability, which includes a wide range of popular loss functions such as Huber loss. Leveraging our proposed Lyapunov function, we derive finite-time moment bounds under general diminishing stepsizes, as well as constant stepsizes. We further establish the weak convergence, central limit theorem and bias characterization under constant stepsize, providing the first geometrical convergence result for sub–quadratic SGD. Our results have wide applications, especially in online statistical methods. In particular, we discuss two applications of our results. 1) Online robust regression: We consider a corrupted linear model with sub–exponential covariates and heavy–tailed noise. Our analysis provides convergence rates comparable to those for corrupted models with Gaussian covariates and noise. 2) Online quantile regression: Importantly, our results relax the common assumption in prior work that the conditional density is continuous and provide a more fine-grained analysis for the moment bounds.

[LG-55] Efficient measurement of neutral-atom qubits with matched filters

链接: https://arxiv.org/abs/2504.08170
作者: Robert M. Kent,Linipun Phuttitarn,Chaithanya Naik Mude,Swamit Tannu,Mark Saffman,Gregory Lafyatis,Daniel J. Gauthier
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Quantum computers require high-fidelity measurement of many qubits to achieve a quantum advantage. Traditional approaches suffer from readout crosstalk for a neutral-atom quantum processor with a tightly spaced array. Although classical machine learning algorithms based on convolutional neural networks can improve fidelity, they are computationally expensive, making it difficult to scale them to large qubit counts. We present two simpler and scalable machine learning algorithms that realize matched filters for the readout problem. One is a local model that focuses on a single qubit, and the other uses information from neighboring qubits in the array to prevent crosstalk among the qubits. We demonstrate error reductions of up to 32% and 43% for the site and array models, respectively, compared to a conventional Gaussian threshold approach. Additionally, our array model uses two orders of magnitude fewer trainable parameters and four orders of magnitude fewer multiplications and nonlinear function evaluations than a recent convolutional neural network approach, with only a minor (3.5%) increase in error across different readout times. Another strength of our approach is its physical interpretability: the learned filter can be visualized to provide insights into experimental imperfections. We also show that a convolutional neural network model for improved can be pruned to have 70x and 4000x fewer parameters, respectively, while maintaining similar errors. Our work shows that simple machine learning approaches can achieve high-fidelity qubit measurements while remaining scalable to systems with larger qubit counts.

[LG-56] Fusing Global and Local: Transformer-CNN Synergy for Next-Gen Current Estimation

链接: https://arxiv.org/abs/2504.07996
作者: Junlang Huang,Hao Chen,Li Luo,Yong Cai,Lexin Zhang,Tianhao Ma,Yitian Zhang,Zhong Guan
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a hybrid model combining Transformer and CNN for predicting the current waveform in signal lines. Unlike traditional approaches such as current source models, driver linear representations, waveform functional fitting, or equivalent load capacitance methods, our model does not rely on fixed simplified models of standard-cell drivers or RC loads. Instead, it replaces the complex Newton iteration process used in traditional SPICE simulations, leveraging the powerful sequence modeling capabilities of the Transformer framework to directly predict current responses without iterative solving steps. The hybrid architecture effectively integrates the global feature-capturing ability of Transformers with the local feature extraction advantages of CNNs, significantly improving the accuracy of current waveform predictions. Experimental results demonstrate that, compared to traditional SPICE simulations, the proposed algorithm achieves an error of only 0.0098. These results highlight the algorithm’s superior capabilities in predicting signal line current waveforms, timing analysis, and power evaluation, making it suitable for a wide range of technology nodes, from 40nm to 3nm. Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG) Cite as: arXiv:2504.07996 [eess.SP] (or arXiv:2504.07996v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2504.07996 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-57] owards Simple Machine Learning Baselines for GNSS RFI Detection

链接: https://arxiv.org/abs/2504.07993
作者: Viktor Ivanov,Richard C. Wilson,Maurizio Scaramuzza
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2405.02678 by other authors

点击查看摘要

Abstract:Machine learning research in GNSS radio frequency interference (RFI) detection often lacks a proper justification for the decisions made in deep learning-based model architectures. Our paper challenges the status quo in machine learning approaches for GNSS RFI detection, revealing the potentially misleading track of current research and highlighting alternative directions. Our position advocates for a shift in focus from solely pursuing novel model designs to critically evaluating the utility of complex black box deep learning methods against simpler and more interpretable machine learning baselines. Our findings demonstrate the need for the creation of simple baselines and suggest the need for more exploration and development of simple and interpretable machine learning methods for the detection of GNSS RFIs. The increment of model complexity in the state-of-the-art deep learning-based models often provides very little improvement. Thanks to a unique dataset from Swiss Air Force and Swiss Air-Rescue (Rega), preprocessed by Swiss Air Navigation Services Ltd. (Skyguide), we demonstrate the effectiveness of a simple machine learning baseline for GNSS RFI detection on real-world large-scale aircraft data containing flight recordings impacted by real jamming. The experimental results indicate that our solution successfully detects potential GNSS RFI with 91% accuracy outperforming state-of-the-art deep learning architectures. We believe that our work offers insights and suggestions for the field to move forward.

[LG-58] mixEEG: Enhancing EEG Federated Learning for Cross-subject EEG Classification with Tailored mixup

链接: https://arxiv.org/abs/2504.07987
作者: Xuan-Hao Liu,Bao-Liang Lu,Wei-Long Zheng
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: CogSci 2025 Oral

点击查看摘要

Abstract:The cross-subject electroencephalography (EEG) classification exhibits great challenges due to the diversity of cognitive processes and physiological structures between different subjects. Modern EEG models are based on neural networks, demanding a large amount of data to achieve high performance and generalizability. However, privacy concerns associated with EEG pose significant limitations to data sharing between different hospitals and institutions, resulting in the lack of large dataset for most EEG tasks. Federated learning (FL) enables multiple decentralized clients to collaboratively train a global model without direct communication of raw data, thus preserving privacy. For the first time, we investigate the cross-subject EEG classification in the FL setting. In this paper, we propose a simple yet effective framework termed mixEEG. Specifically, we tailor the vanilla mixup considering the unique properties of the EEG modality. mixEEG shares the unlabeled averaged data of the unseen subject rather than simply sharing raw data under the domain adaptation setting, thus better preserving privacy and offering an averaged label as pseudo-label. Extensive experiments are conducted on an epilepsy detection and an emotion recognition dataset. The experimental result demonstrates that our mixEEG enhances the transferability of global model for cross-subject EEG classification consistently across different datasets and model architectures. Code is published at: this https URL.

[LG-59] EquiNO: A Physics-Informed Neural Operator for Multiscale Simulations

链接: https://arxiv.org/abs/2504.07976
作者: Hamidreza Eivazi,Jendrik-Alexander Tröger,Stefan Wittek,Stefan Hartmann,Andreas Rausch
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注: 36 pages. Code available at: this https URL

点击查看摘要

Abstract:Multiscale problems are ubiquitous in physics. Numerical simulations of such problems by solving partial differential equations (PDEs) at high resolution are computationally too expensive for many-query scenarios, e.g., uncertainty quantification, remeshing applications, topology optimization, and so forth. This limitation has motivated the application of data-driven surrogate models, where the microscale computations are \textitsubstituted with a surrogate, usually acting as a black-box mapping between macroscale quantities. These models offer significant speedups but struggle with incorporating microscale physical constraints, such as the balance of linear momentum and constitutive models. In this contribution, we propose Equilibrium Neural Operator (EquiNO) as a \textitcomplementary physics-informed PDE surrogate for predicting microscale physics and compare it with variational physics-informed neural and operator networks. Our framework, applicable to the so-called multiscale FE ^,2, computations, introduces the FE-OL approach by integrating the finite element (FE) method with operator learning (OL). We apply the proposed FE-OL approach to quasi-static problems of solid mechanics. The results demonstrate that FE-OL can yield accurate solutions even when confronted with a restricted dataset during model development. Our results show that EquiNO achieves speedup factors exceeding 8000-fold compared to traditional methods and offers an optimal balance between data-driven and physics-based strategies.

[LG-60] Minimax-optimal and Locally-adaptive Online Nonparametric Regression

链接: https://arxiv.org/abs/2410.03363
作者: Paul Liautaud(LPSM (UMR_8001), Thoth),Pierre Gaillard(Thoth),Olivier Wintenberger(LPSM (UMR_8001))
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study adversarial online nonparametric regression with general convex losses and propose a parameter-free learning algorithm that achieves minimax optimal rates. Our approach leverages chaining trees to compete against Hölder functions and establishes optimal regret bounds. While competing with nonparametric function classes can be challenging, they often exhibit local patterns - such as local Hölder continuity - that online algorithms can exploit. Without prior knowledge, our method dynamically tracks and adapts to different Hölder profiles by pruning a core chaining tree structure, aligning itself with local smoothness variations. This leads to the first computationally efficient algorithm with locally adaptive optimal rates for online regression in an adversarial setting. Finally, we discuss how these notions could be extended to a boosting framework, offering promising directions for future research.

信息检索

[IR-0] A Comparative Study of Recommender Systems under Big Data Constraints

链接: https://arxiv.org/abs/2504.08457
作者: Arimondo Scrivano
类目: Information Retrieval (cs.IR)
*备注: 12 pages,2 figures

点击查看摘要

Abstract:Recommender Systems (RS) have become essential tools in a wide range of digital services, from e-commerce and streaming platforms to news and social media. As the volume of user-item interactions grows exponentially, especially in Big Data environments, selecting the most appropriate RS model becomes a critical task. This paper presents a comparative study of several state-of-the-art recommender algorithms, including EASE-R, SLIM, SLIM with ElasticNet regularization, Matrix Factorization (FunkSVD and ALS), P3Alpha, and RP3Beta. We evaluate these models according to key criteria such as scalability, computational complexity, predictive accuracy, and interpretability. The analysis considers both their theoretical underpinnings and practical applicability in large-scale scenarios. Our results highlight that while models like SLIM and SLIM-ElasticNet offer high accuracy and interpretability, they suffer from high computational costs, making them less suitable for real-time applications. In contrast, algorithms such as EASE-R and RP3Beta achieve a favorable balance between performance and scalability, proving more effective in large-scale environments. This study aims to provide guidelines for selecting the most appropriate recommender approach based on specific Big Data constraints and system requirements.

[IR-1] A Reproducibility Study of Graph-Based Legal Case Retrieval SIGIR2025

链接: https://arxiv.org/abs/2504.08400
作者: Gregor Donabauer,Udo Kruschwitz
类目: Information Retrieval (cs.IR)
*备注: Preprint accepted at SIGIR 2025

点击查看摘要

Abstract:Legal retrieval is a widely studied area in Information Retrieval (IR) and a key task in this domain is retrieving relevant cases based on a given query case, often done by applying language models as encoders to model case similarity. Recently, Tang et al. proposed CaseLink, a novel graph-based method for legal case retrieval, which models both cases and legal charges as nodes in a network, with edges representing relationships such as references and shared semantics. This approach offers a new perspective on the task by capturing higher-order relationships of cases going beyond the stand-alone level of documents. However, while this shift in approaching legal case retrieval is a promising direction in an understudied area of graph-based legal IR, challenges in reproducing novel results have recently been highlighted, with multiple studies reporting difficulties in reproducing previous findings. Thus, in this work we reproduce CaseLink, a graph-based legal case retrieval method, to support future research in this area of IR. In particular, we aim to assess its reliability and generalizability by (i) first reproducing the original study setup and (ii) applying the approach to an additional dataset. We then build upon the original implementations by (iii) evaluating the approach’s performance when using a more sophisticated graph data representation and (iv) using an open large language model (LLM) in the pipeline to address limitations that are known to result from using closed models accessed via an API. Our findings aim to improve the understanding of graph-based approaches in legal IR and contribute to improving reproducibility in the field. To achieve this, we share all our implementations and experimental artifacts with the community.

[IR-2] OnSET: Ontology and Semantic Exploration Toolkit SIGIR

链接: https://arxiv.org/abs/2504.08373
作者: Benedikt Kantz,Kevin Innerebner,Peter Waldert,Stefan Lengauer,Elisabeth Lex,Tobias Schreck
类目: Information Retrieval (cs.IR)
*备注: 5 pages, 4 figures, accepted to SIGIR Demo Paper Track 2025

点击查看摘要

Abstract:Retrieval over knowledge graphs is usually performed using dedicated, complex query languages like SPARQL. We propose a novel system, Ontology and Semantic Exploration Toolkit (OnSET) that allows non-expert users to easily build queries with visual user guidance provided by topic modelling and semantic search throughout the application. OnSET allows users without any prior information about the ontology or networked knowledge to start exploring topics of interest over knowledge graphs, including the retrieval and detailed exploration of prototypical sub-graphs and their instances. Existing systems either focus on direct graph explorations or do not foster further exploration of the result set. We, however, provide a node-based editor that can extend on these missing properties of existing systems to support the search over big ontologies with sub-graph instances. Furthermore, OnSET combines efficient and open platforms to deploy the system on commodity hardware.

[IR-3] ST2 Miner – Process Discovery Based on Firing Partial Orders

链接: https://arxiv.org/abs/2504.08372
作者: Sabine Folz-Weinstein,Christian Rennert,Lisa Luise Mannel,Robin Bergenthum,Wil van der Aalst
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注: 16 pages, 5 figures, to be published in 37th International Conference on Advanced Information Systems Engineering

点击查看摘要

Abstract:Process discovery generates process models from event logs. Traditionally, an event log is defined as a multiset of traces, where each trace is a sequence of events. The total order of the events in a sequential trace is typically based on their temporal occurrence. However, real-life processes are partially ordered by nature. Different activities can occur in different parts of the process and, thus, independently of each other. Therefore, the temporal total order of events does not necessarily reflect their causal order, as also causally unrelated events may be ordered in time. Only partial orders allow to express concurrency, duration, overlap, and uncertainty of events. Consequently, there is a growing need for process mining algorithms that can directly handle partially ordered input. In this paper, we combine two well-established and efficient algorithms, the eST Miner from the process mining community and the Firing LPO algorithm from the Petri net community, to introduce the eST ^2 Miner. The eST ^2 Miner is a process discovery algorithm that can directly handle partially ordered input, gives strong formal guarantees, offers good runtime and excellent space complexity, and can, thus, be used in real-life applications.

附件下载

点击下载今日全部论文列表