Arxiv今日论文 | 2025-04-04

本篇博文主要内容为 2025-04-04 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决扩散模型（Diffusion Models）在图像编辑任务中面临的挑战：现有方法通过调整文本嵌入（text embedding）或分数空间（score space）中的编辑方向来设计表示操作流程，但这种方法存在一个关键难题——过度估计编辑强度会损害视觉一致性，而低估则无法完成编辑任务。此外，不同源图像可能需要不同的编辑强度，通过试错法寻找合适的强度代价高昂。为了解决这一问题，论文提出了一种名为Concept Lancet (CoLan) 的零样本即插即用框架，用于基于扩散模型的表征操作。其关键是利用收集到的视觉概念表示，在潜在空间（如文本嵌入或扩散分数）中将源输入分解为稀疏线性组合，从而精确估计每个图像中概念的存在情况以指导编辑过程，并根据不同编辑任务（替换/添加/移除）执行定制化的概念移植过程，以施加相应的编辑方向。为了充分建模概念空间，还构建了一个包含15万条记录的概念表示数据集CoLan-150K，其中涵盖了视觉术语和短语的多样化描述与场景，用于潜在字典。实验结果表明，采用CoLan的方法在编辑效果和一致性保持方面达到了最先进的性能。

链接: https://arxiv.org/abs/2504.02828
作者: Jinqi Luo,Tianjiao Ding,Kwan Ho Ryan Chan,Hancheng Min,Chris Callison-Burch,René Vidal
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted in CVPR 2025. Project page at this https URL

点击查看摘要

Abstract:Diffusion models are widely used for image editing tasks. Existing editing methods often design a representation manipulation procedure by curating an edit direction in the text embedding or score space. However, such a procedure faces a key challenge: overestimating the edit strength harms visual consistency while underestimating it fails the editing task. Notably, each source image may require a different editing strength, and it is costly to search for an appropriate strength via trial-and-error. To address this challenge, we propose Concept Lancet (CoLan), a zero-shot plug-and-play framework for principled representation manipulation in diffusion-based image editing. At inference time, we decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts. This allows us to accurately estimate the presence of concepts in each image, which informs the edit. Based on the editing task (replace/add/remove), we perform a customized concept transplant process to impose the corresponding editing direction. To sufficiently model the concept space, we curate a conceptual representation dataset, CoLan-150K, which contains diverse descriptions and scenarios of visual terms and phrases for the latent dictionary. Experiments on multiple diffusion-based image editing baselines show that methods equipped with CoLan achieve state-of-the-art performance in editing effectiveness and consistency preservation.
zh

[NLP-1] Generative Evaluation of Complex Reasoning in Large Language Models

【速读】：该论文试图解决的问题是：大型语言模型（LLMs）是否真正具备推理能力，还是仅仅依赖于从大规模训练数据集中回忆答案。由于公开发布的基准测试不可避免地会受到后续LLM训练集的污染，其作为可靠评估工具的有效性受到削弱。

解决方案的关键在于引入KUMO，这是一个专门设计用于评估LLMs推理能力的生成式评价框架。KUMO通过协同利用LLMs与符号引擎动态生成多样化的多轮推理任务，并确保任务的部分可观察性和难度可调性。此外，KUMO通过自动化流程在开放领域持续生成新颖的任务，促使模型展示真正的泛化能力而非记忆能力。这一方法有效规避了现有基准测试的局限性，为评估LLMs的真实推理能力提供了稳健且持久的工具。

链接: https://arxiv.org/abs/2504.02810
作者: Haowei Lin,Xiangyu Wang,Ruilin Yan,Baizhou Huang,Haotian Ye,Jianhua Zhu,Zihao Wang,James Zou,Jianzhu Ma,Yitao Liang
机构: Institute for Artificial Intelligence, Peking University (北京大学人工智能研究院), Beijing, China.; Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所), Beijing, China.; Department of Electronic Engineering, Tsinghua University (清华大学电子工程系), Beijing, China.; Institute for AI Industry Research, Tsinghua University (清华大学人工智能产业研究院), Beijing, China.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With powerful large language models (LLMs) demonstrating superhuman reasoning capabilities, a critical question arises: Do LLMs genuinely reason, or do they merely recall answers from their extensive, web-scraped training datasets? Publicly released benchmarks inevitably become contaminated once incorporated into subsequent LLM training sets, undermining their reliability as faithful assessments. To address this, we introduce KUMO, a generative evaluation framework designed specifically for assessing reasoning in LLMs. KUMO synergistically combines LLMs with symbolic engines to dynamically produce diverse, multi-turn reasoning tasks that are partially observable and adjustable in difficulty. Through an automated pipeline, KUMO continuously generates novel tasks across open-ended domains, compelling models to demonstrate genuine generalization rather than memorization. We evaluated 23 state-of-the-art LLMs on 5,000 tasks across 100 domains created by KUMO, benchmarking their reasoning abilities against university students. Our findings reveal that many LLMs have outperformed university-level performance on easy reasoning tasks, and reasoning-scaled LLMs reach university-level performance on complex reasoning challenges. Moreover, LLM performance on KUMO tasks correlates strongly with results on newly released real-world reasoning benchmarks, underscoring KUMO’s value as a robust, enduring assessment tool for genuine LLM reasoning capabilities.
zh

[NLP-2] MegaMath: Pushing the Limits of Open Math Corpora

【速读】：该论文试图解决数学推理（Mathematical Reasoning）作为高级大型语言模型（Large Language Models, LLMs）能力基准的核心挑战，但研究社区缺乏适合数学中心化预训练的开源、大规模、高质量数据集的问题。论文的关键解决方案在于提出MegaMath数据集，通过以下策略构建：(1) 重新审视网络数据，从Common Crawl中提取数学文档，并结合HTML优化、fasttext过滤及去重操作以提升数据质量；(2) 挖掘与数学相关的代码数据，从大规模代码语料库Stack-V2中识别高质量数学代码，增强数据多样性；(3) 利用合成方法生成问答形式文本、数学相关代码以及文本-代码混合块。通过综合这些方法并进行广泛的消融实验验证其有效性，MegaMath提供了现有开源数学预训练数据集中数量最大且质量最优的3710亿tokens。

链接: https://arxiv.org/abs/2504.02807
作者: Fan Zhou,Zengzhi Wang,Nikhil Ranjan,Zhoujun Cheng,Liping Tang,Guowei He,Zhengzhong Liu,Eric P. Xing
机构: MBZUAI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages, 15 figures, 22 tables

点击查看摘要

Abstract:Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs). However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training. We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet. (2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity. (3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data. By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.
zh

[NLP-3] A Survey of Large Language Models in Mental Health Disorder Detection on Social Media

【速读】：该论文旨在解决如何利用大型语言模型（Large Language Models, LLMs）检测社交媒体上的心理健康问题，特别是常见的心理障碍如抑郁和焦虑，以及精神病性障碍和外化障碍等。论文的关键在于探索LLMs在社交媒体数据分析中的应用潜力，从文本数据的分析与心理障碍检测等多个维度总结其应用方法，并揭示当前研究的主要挑战与不足。此外，论文还概述了常用的公开数据集及评估指标。论文通过全面综述为心理健康领域的研究人员提供了重要的参考框架，同时展示了LLMs在心理健康检测中的巨大潜力，以促进其在未来心理健康干预中的进一步应用。

链接: https://arxiv.org/abs/2504.02800
作者: Zhuohan Ge(1),Nicole Hu(2),Darian Li(1),Yubo Wang(3),Shihao Qi(1),Yuming Xu(1),Han Shi(3),Jason Zhang(1) ((1) The Hong Kong Polytechnic University, (2) The Chinese University of Hong Kong, (3) Hong Kong University of Science and Technology)
机构: The Hong Kong Polytechnic University; The Chinese University of Hong Kong; Hong Kong University of Science and Technology
类目: Computation and Language (cs.CL)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:The detection and intervention of mental health issues represent a critical global research focus, and social media data has been recognized as an important resource for mental health research. However, how to utilize Large Language Models (LLMs) for mental health problem detection on social media poses significant challenges. Hence, this paper aims to explore the potential of LLM applications in social media data analysis, focusing not only on the most common psychological disorders such as depression and anxiety but also incorporating psychotic disorders and externalizing disorders, summarizing the application methods of LLM from different dimensions, such as text data analysis and detection of mental disorders, and revealing the major challenges and shortcomings of current research. In addition, the paper provides an overview of popular datasets, and evaluation metrics. The survey in this paper provides a comprehensive frame of reference for researchers in the field of mental health, while demonstrating the great potential of LLMs in mental health detection to facilitate the further application of LLMs in future mental health interventions.
zh

[NLP-4] A Framework for Situating Innovations Opportunities and Challenges in Advancing Vertical Systems with Large AI Models

【速读】：该论文旨在解决大型人工智能模型在高风险垂直领域（如医疗、教育和法律）应用时所面临的显著局限性问题，例如对输入数据的小幅变化敏感、在关键场景中缺乏上下文信息以及因输出不准确而损害用户信任等。论文的关键在于提出了一种分层抽象的框架，通过这一框架将创新点模块化，以满足实际应用场景中用户的需求，并将大型模型转化为实用的“垂直系统”。该框架不仅关注如何优化模型性能，还强调了不同层级之间动态交互的重要性，同时指导研究人员和实践者如何定位创新方向、发现被忽视的机会，并促进跨学科交流与协作。

链接: https://arxiv.org/abs/2504.02793
作者: Gaurav Verma,Jiawei Zhou,Mohit Chandra,Srijan Kumar,Munmun De Choudhury
机构: College of Computing, Georgia Institute of Technology (计算学院, 佐治亚理工学院), Atlanta, Georgia, USA
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: pre-print; 7 pages of main content, 1 figure, 1 table

点击查看摘要

Abstract:Large artificial intelligence (AI) models have garnered significant attention for their remarkable, often “superhuman”, performance on standardized benchmarks. However, when these models are deployed in high-stakes verticals such as healthcare, education, and law, they often reveal notable limitations. For instance, they exhibit brittleness to minor variations in input data, present contextually uninformed decisions in critical settings, and undermine user trust by confidently producing or reproducing inaccuracies. These challenges in applying large models necessitate cross-disciplinary innovations to align the models’ capabilities with the needs of real-world applications. We introduce a framework that addresses this gap through a layer-wise abstraction of innovations aimed at meeting users’ requirements with large models. Through multiple case studies, we illustrate how researchers and practitioners across various fields can operationalize this framework. Beyond modularizing the pipeline of transforming large models into useful “vertical systems”, we also highlight the dynamism that exists within different layers of the framework. Finally, we discuss how our framework can guide researchers and practitioners to (i) optimally situate their innovations (e.g., when vertical-specific insights can empower broadly impactful vertical-agnostic innovations), (ii) uncover overlooked opportunities (e.g., spotting recurring problems across verticals to develop practically useful foundation models instead of chasing benchmarks), and (iii) facilitate cross-disciplinary communication of critical challenges (e.g., enabling a shared vocabulary for AI developers, domain experts, and human-computer interaction scholars).
zh

[NLP-5] A Framework for Robust Cognitive Evaluation of LLM s

链接: https://arxiv.org/abs/2504.02789
作者: Karin de Langis,Jong Inn Park,Bin Hu,Khanh Chi Le,Andreas Schramm,Michael C. Mensink,Andrew Elfenbein,Dongyeop Kang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-6] MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

【速读】：该论文试图解决多语言大模型在处理低资源语言时的能力不足问题，构建了一个包含101种语言、6种语言学现象、超过125,000对最小对立对（Minimal Pairs）的多语言基准测试集MultiBLiMP 1.0。其解决方案的关键在于采用全自动化的流水线（fully automated pipeline），基于Universal Dependencies和UniMorph的大规模语言学资源生成最小对立对，从而以前所未有的多语言规模评估大型语言模型（LLMs）的能力，并揭示当前最先进的模型在建模低资源语言方面的不足。

链接: https://arxiv.org/abs/2504.02768
作者: Jaap Jumelet,Leonie Weissweiler,Arianna Bisazza
机构: University of Groningen (格罗宁根大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校); University of Groningen (格罗宁根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages, 6 linguistic phenomena and containing more than 125,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages.
zh

[NLP-7] Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study ICLR2025

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在面对输入扰动时表现脆弱的问题，特别是针对任务级指令的字符级和词级编辑导致下游性能显著下降的情况。现有方法主要集中在处理被扰动的数据样本，而对提升任务级指令抗扰性的研究相对较少。论文的关键解决方案是通过自去噪（self-denoising）技术来增强LLMs的鲁棒性，无论是使用冻结的预训练模型还是经过微调的模型进行自去噪，均显示出比其他策略（包括集成和监督方法等更复杂的基线）更高的性能提升。

链接: https://arxiv.org/abs/2504.02733
作者: Aryan Agrawal,Lisa Alazraki,Shahin Honarvar,Marek Rei
机构: Imperial College London
类目: Computation and Language (cs.CL)
备注: Building Trust Workshop, ICLR 2025

点击查看摘要

Abstract:Large Language Models (LLMs) are highly vulnerable to input perturbations, as even a small prompt change may result in a substantially different output. Existing methods to enhance LLM robustness are primarily focused on perturbed data samples, whereas improving resiliency to perturbations of task-level instructions has remained relatively underexplored. In this work, we focus on character- and word-level edits of task-specific instructions, which substantially degrade downstream performance. We experiment with a variety of techniques to enhance the robustness of LLMs, including self-denoising and representation alignment, testing different models (Llama 3 and Flan-T5), datasets (CoLA, QNLI, SST-2) and instructions (both task-oriented and role-oriented). We find that, on average, self-denoising – whether performed by a frozen LLM or a fine-tuned model – achieves substantially higher performance gains than alternative strategies, including more complex baselines such as ensembling and supervised methods.
zh

[NLP-8] Why do LLM s attend to the first token?

【速读】：该论文试图解决的问题是：为什么大型语言模型（LLMs）会学习到对序列首token的高度关注这一模式（即所谓的“注意力汇点”），以及这种模式如何在训练过程中被利用。尽管已有研究探讨了注意力汇点的发生条件及其影响，但对其背后的原因和具体作用机制缺乏深入理解。

解决方案的关键在于从理论上和实验上论证注意力汇点能够帮助LLMs避免信息过度混合（over-mixing），并通过连接现有关于Transformer中信息传播数学建模的研究工作，揭示这一机制的本质。论文通过一系列实验验证了理论假设，并分析了上下文长度、网络深度以及数据打包方式等因素对注意力汇点行为的影响。这些研究结果为理解LLMs训练过程中形成的注意力模式提供了新的实用视角。

链接: https://arxiv.org/abs/2504.02732
作者: Federico Barbero,Álvaro Arroyo,Xiangming Gu,Christos Perivolaropoulos,Michael Bronstein,Petar Veličkovi ć,Razvan Pascanu
机构: University of Oxford (牛津大学); National University of Singapore (新加坡国立大学); Google DeepMind (谷歌深思维)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) tend to attend heavily to the first token in the sequence – creating a so-called attention sink. Many works have studied this phenomenon in detail, proposing various ways to either leverage or alleviate it. Attention sinks have been connected to quantisation difficulties, security issues, and streaming attention. Yet, while many works have provided conditions in which they occur or not, a critical question remains shallowly answered: Why do LLMs learn such patterns and how are they being used? In this work, we argue theoretically and empirically that this mechanism provides a method for LLMs to avoid over-mixing, connecting this to existing lines of work that study mathematically how information propagates in Transformers. We conduct experiments to validate our theoretical intuitions and show how choices such as context length, depth, and data packing influence the sink behaviour. We hope that this study provides a new practical perspective on why attention sinks are useful in LLMs, leading to a better understanding of the attention patterns that form during training.
zh

[NLP-9] ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成有害内容方面存在的潜在风险，以及现有对齐方法难以全面覆盖多样化安全场景且易受对抗攻击的问题。论文提出了一种名为提前推理偏好优化（Ex-Ante Reasoning Preference Optimization, ERPO）的新颖安全对齐框架，其关键在于通过链式思维（Chain-of-Thought）赋予模型明确的预先推理能力，并通过嵌入预定义的安全规则为安全判断提供清晰证据。具体而言，ERPO 包含三个阶段：首先利用监督微调（Supervised Fine-Tuning, SFT）构建推理模块以赋予模型提前推理能力；其次通过直接偏好优化（Direct Preference Optimization, DPO）提升安全性、实用性和效率；最后采用长度控制的迭代偏好优化策略降低推理延迟。实验表明，ERPO 在保持响应效率的同时显著提升了模型的安全性能。

链接: https://arxiv.org/abs/2504.02725
作者: Kehua Feng,Keyan Ding,Jing Yu,Menghan Li,Yuhao Wang,Tong Xu,Xinda Wang,Qiang Zhang,Huajun Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages, 5 figures

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have accelerated progress toward artificial general intelligence, yet their potential to generate harmful content poses critical safety challenges. Existing alignment methods often struggle to cover diverse safety scenarios and remain vulnerable to adversarial attacks. In this work, we propose Ex-Ante Reasoning Preference Optimization (ERPO), a novel safety alignment framework that equips LLMs with explicit preemptive reasoning through Chain-of-Thought and provides clear evidence for safety judgments by embedding predefined safety rules. Specifically, our approach consists of three stages: first, equipping the model with Ex-Ante reasoning through supervised fine-tuning (SFT) using a constructed reasoning module; second, enhancing safety, usefulness, and efficiency via Direct Preference Optimization (DPO); and third, mitigating inference latency with a length-controlled iterative preference optimization strategy. Experiments on multiple open-source LLMs demonstrate that ERPO significantly enhances safety performance while maintaining response efficiency.
zh

[NLP-10] he Hidden Space of Safety: Understanding Preference-Tuned LLM s in Multilingual context ACL2025

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在多语言环境下的对齐机制有效性问题，特别是当前对齐方法主要聚焦于英语，导致其在其他语言上的泛化能力不明朗。论文的关键解决方案在于系统分析LLMs嵌入空间在对齐前后的分布偏移，并利用由对齐诱导的安全空间分离作为定量工具，评估对齐如何施加安全约束。通过使用平衡毒性数据集和并行文本去毒基准测试七种LLMs，研究揭示了高资源语言与低资源语言之间潜在表示空间的显著差异，强调了语言特定微调的必要性，以确保多语言对齐的公平性、可靠性和鲁棒性。

链接: https://arxiv.org/abs/2504.02708
作者: Nikhil Verma,Manasa Bharadwaj
机构: LG Toronto AI Research lab (LG 多伦多人工智能研究实验室)
类目: Computation and Language (cs.CL)
备注: 14 pages, 11 Figures, 2 Tables, currently under review at ACL 2025

点击查看摘要

Abstract:Alignment tuning has enabled large language models to excel in reasoning, instruction-following, and minimizing harmful generations. However, despite their widespread deployment, these models exhibit a monolingual bias, raising concerns about the effectiveness of alignment across languages. Current alignment methods predominantly focus on English, leaving it unclear how alignment mechanism generalize to multilingual settings. To address this, we conduct a systematic analysis of distributional shifts in the embedding space of LLMs before and after alignment, uncovering its impact on model behavior across diverse languages. We leverage the alignment-induced separation in safety space as a quantitative tool to measure how alignment enforces safety constraints. Our study evaluates seven LLMs using balanced toxicity datasets and parallel text-detoxification benchmarks, revealing substantial disparities in the latent representation space between high-resource and low-resource languages. These findings underscore the need for language-specific fine-tuning to ensure fair, reliable and robust multilingual alignment. Our insights provide a foundation for developing truly safe multilingual LLMs, emphasizing the urgency of addressing alignment gaps in underrepresented languages.
zh

[NLP-11] Limitations of Religious Data and the Importance of the Target Domain: Towards Machine Translation for Guinea-Bissau Creole NAACL2025

【速读】： Model call failure

链接: https://arxiv.org/abs/2504.02674
作者: Jacqueline Rowe,Edward Gow-Smith,Mark Hepple
机构: University of Edinburgh (爱丁堡大学); University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures, 7 tables. To be published in Proceedings of the 8th Workshop on Technologies for Machine Translation of Low-Resource Languages (NAACL 2025)

点击查看摘要

Abstract:We introduce a new dataset for machine translation of Guinea-Bissau Creole (Kiriol), comprising around 40 thousand parallel sentences to English and Portuguese. This dataset is made up of predominantly religious data (from the Bible and texts from the Jehovah’s Witnesses), but also a small amount of general domain data (from a dictionary). This mirrors the typical resource availability of many low resource languages. We train a number of transformer-based models to investigate how to improve domain transfer from religious data to a more general domain. We find that adding even 300 sentences from the target domain when training substantially improves the translation performance, highlighting the importance and need for data collection for low-resource languages, even on a small-scale. We additionally find that Portuguese-to-Kiriol translation models perform better on average than other source and target language pairs, and investigate how this relates to the morphological complexity of the languages involved and the degree of lexical overlap between creoles and lexifiers. Overall, we hope our work will stimulate research into Kiriol and into how machine translation might better support creole languages in general.
zh

[NLP-12] LLM for Complex Reasoning Task: An Exploratory Study in Fermi Problems

【速读】：该论文试图解决的问题是如何评估大型语言模型（Large Language Models, LLMs）在处理费米问题（Fermi Problems, FPs）中的能力与局限性。费米问题是一种需要人类逻辑和数值推理的数学任务，其挑战性在于通常涉及现实世界中的非实用性或模糊概念，即使对于人类也难以解答。尽管AI在多种推理任务中取得了进展，但费米问题的研究仍相对较少。

解决方案的关键在于设计实验以系统性地评估LLMs在费米问题上的表现。首先，通过使用一个公开可用的费米问题数据集，对三种先进的LLMs的整体性能进行了测试，并基于最近提出的TELeR分类法设计了提示词，包括零样本场景。结果表明，所有三个LLMs的fp_score均低于0.5，这凸显了这类推理任务的难度。为进一步探究，将费米问题分为标准问题和特定问题两类，假设LLMs在标准问题上的表现会优于特定问题，因为前者具有更高的清晰度和简洁性。对比实验验证了这一假设，证明LLMs在标准费米问题上不仅准确性更高，而且效率也更好。

链接: https://arxiv.org/abs/2504.02671
作者: Zishuo Liu,Carlos Rabat Villarreal,Mostafa Rahgouy,Amit Das,Zheng Zhang,Chang Ren,Dongji Feng
机构: MCS department, Gustavus Adolphus College (古斯塔维斯·阿道弗斯学院计算机科学系); Department of CSSE, Auburn University (奥本大学计算机科学与工程系); Department of CIS, University of North Alabama (阿拉巴马大学北校区信息系统系); Department of CSIS, Murray State University (默里州立大学计算机科学与信息技术系)
类目: Computation and Language (cs.CL)
备注: 7 pages,7 tables, 5 figures

点击查看摘要

Abstract:Fermi Problems (FPs) are mathematical reasoning tasks that require human-like logic and numerical reasoning. Unlike other reasoning questions, FPs often involve real-world impracticalities or ambiguous concepts, making them challenging even for humans to solve. Despite advancements in AI, particularly with large language models (LLMs) in various reasoning tasks, FPs remain relatively under-explored. This work conducted an exploratory study to examine the capabilities and limitations of LLMs in solving FPs. We first evaluated the overall performance of three advanced LLMs using a publicly available FP dataset. We designed prompts according to the recently proposed TELeR taxonomy, including a zero-shot scenario. Results indicated that all three LLMs achieved a fp_score (range between 0 - 1) below 0.5, underscoring the inherent difficulty of these reasoning tasks. To further investigate, we categorized FPs into standard and specific questions, hypothesizing that LLMs would perform better on standard questions, which are characterized by clarity and conciseness, than on specific ones. Comparative experiments confirmed this hypothesis, demonstrating that LLMs performed better on standard FPs in terms of both accuracy and efficiency.
zh

[NLP-13] Affordable AI Assistants with Knowledge Graph of Thoughts

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）驱动的AI助手在实际应用中面临的高运营成本和复杂基准测试（如GAIA）成功率低的问题。论文提出的解决方案核心在于引入知识图谱中的思维（Knowledge Graph of Thoughts, KGoT），这是一种创新的AI助手架构，通过将LLM推理与动态构建的知识图谱（Knowledge Graphs, KGs）相结合来实现。KGoT的关键在于从任务相关知识中提取并结构化信息，形成动态的知识图谱表示，并通过外部工具（如数学求解器、网络爬虫和Python脚本）迭代增强其内容。这种结构化知识的表示方式使得低成本模型能够高效解决复杂任务，从而显著提升任务成功率并大幅降低运行成本。例如，在GAIA基准测试中，KGoT相比Hugging Face Agents with GPT-4o mini提升了29%的任务成功率，同时将成本降低了超过36倍。

链接: https://arxiv.org/abs/2504.02670
作者: Maciej Besta,Lorenzo Paleari,Jia Hao Andrea Jiang,Robert Gerstenberger,You Wu,Patrick Iff,Ales Kubicek,Piotr Nyczyk,Diana Khimey,Jón Gunnar Hannesson,Grzegorz Kwaśniewski,Marcin Copik,Hubert Niewiadomski,Torsten Hoefler
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are revolutionizing the development of AI assistants capable of performing diverse tasks across domains. However, current state-of-the-art LLM-driven agents face significant challenges, including high operational costs and limited success rates on complex benchmarks like GAIA. To address these issues, we propose the Knowledge Graph of Thoughts (KGoT), an innovative AI assistant architecture that integrates LLM reasoning with dynamically constructed knowledge graphs (KGs). KGoT extracts and structures task-relevant knowledge into a dynamic KG representation, iteratively enhanced through external tools such as math solvers, web crawlers, and Python scripts. Such structured representation of task-relevant knowledge enables low-cost models to solve complex tasks effectively. For example, KGoT achieves a 29% improvement in task success rates on the GAIA benchmark compared to Hugging Face Agents with GPT-4o mini, while reducing costs by over 36x compared to GPT-4o. Improvements for recent reasoning models are similar, e.g., 36% and 37.5% for Qwen2.5-32B and Deepseek-R1-70B, respectively. KGoT offers a scalable, affordable, and high-performing solution for AI assistants.
zh

[NLP-14] Efficient Model Editing with Task-Localized Sparse Fine-tuning ICLR2025

【速读】：该论文旨在解决现有任务算术（Task Arithmetic）方法在模型编辑中的计算瓶颈问题，特别是在训练和推理阶段因网络线性化（network linearization）导致的效率低下。此外，线性化无法充分保证权重解耦（weight disentanglement），这是实现任务向量无冲突组合的关键属性。为了解决这些问题，论文提出了一种名为TaLoS的方法，其关键是通过构建稀疏任务向量（sparse task vectors）来最小化参数干扰，而无需显式的线性化或跨任务共享信息。具体而言，TaLoS利用预训练模型中一组对多任务具有低梯度敏感性的参数，并仅稀疏更新这些参数，从而在微调过程中促进权重解耦。实验结果表明，TaLoS不仅提高了训练和推理效率，还在任务增减操作中优于现有方法，同时推动了可适应基础模型在实际应用中的部署。

链接: https://arxiv.org/abs/2504.02620
作者: Leonardo Iurada,Marco Ciccone,Tatiana Tommasi
机构: Politecnico di Torino (都灵理工大学); Vector Institute (向量研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted ICLR 2025 - this https URL

点击查看摘要

Abstract:Task arithmetic has emerged as a promising approach for editing models by representing task-specific knowledge as composable task vectors. However, existing methods rely on network linearization to derive task vectors, leading to computational bottlenecks during training and inference. Moreover, linearization alone does not ensure weight disentanglement, the key property that enables conflict-free composition of task vectors. To address this, we propose TaLoS which allows to build sparse task vectors with minimal interference without requiring explicit linearization and sharing information across tasks. We find that pre-trained models contain a subset of parameters with consistently low gradient sensitivity across tasks, and that sparsely updating only these parameters allows for promoting weight disentanglement during fine-tuning. Our experiments prove that TaLoS improves training and inference efficiency while outperforming current methods in task addition and negation. By enabling modular parameter editing, our approach fosters practical deployment of adaptable foundation models in real-world applications.
zh

[NLP-15] Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

【速读】：该论文试图解决现有代码修复基准测试主要集中于单一编程语言（如Python）的问题，这限制了其在评估跨多样软件生态系统的大规模语言模型（LLMs）方面的适用性。为了解决这一局限性，论文提出了一种名为Multi-SWE-bench的多语言代码修复基准测试集，涵盖Java、TypeScript、JavaScript、Go、Rust、C和C++七种编程语言，包含1,632个高质量实例。这些实例由68名专家从2,456个候选样本中精心标注，确保了基准测试的准确性和可靠性。解决方案的关键在于构建一个跨多种编程语言的高质量基准数据集，并通过开放源代码管道和教程鼓励社区持续贡献与扩展，从而推动强化学习（RL）在代码修复任务中的研究进展。

链接: https://arxiv.org/abs/2504.02605
作者: Daoguang Zan,Zhirong Huang,Wei Liu,Hanwu Chen,Linhao Zhang,Shulin Xin,Lu Chen,Qi Liu,Xiaojian Zhong,Aoyan Li,Siyao Liu,Yongsheng Xiao,Liangqiang Chen,Yuyu Zhang,Jing Su,Tianyu Liu,Rui Long,Kai Shen,Liang Xiang
机构: ByteDance(字节跳动)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.
zh

[NLP-16] LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect

【速读】：该论文旨在解决因突尼斯阿拉伯方言（Tunisian Arabic Dialect）语言复杂性高及标注语音数据集稀缺所导致的自动语音识别（Automatic Speech Recognition, ASR）系统开发挑战。为应对这些难题，论文提出了LinTO音频与文本数据集——一套全面的资源，能够捕捉突尼斯阿拉伯方言的音韵特征和词汇特性。该方案的关键在于通过包含来自多源文本和具有多样化说话人以及突尼斯阿拉伯方言与英语或法语代码混合的现实世界音频样本，提供高质量配以精确转录的音频数据，从而为目标方言构建和评估ASR系统提供高质量的基准材料。

链接: https://arxiv.org/abs/2504.02604
作者: Hedi Naouara,Jean-Pierre Lorré,Jérôme Louradour
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Developing Automatic Speech Recognition (ASR) systems for Tunisian Arabic Dialect is challenging due to the dialect’s linguistic complexity and the scarcity of annotated speech datasets. To address these challenges, we propose the LinTO audio and textual datasets – comprehensive resources that capture phonological and lexical features of Tunisian Arabic Dialect. These datasets include a variety of texts from numerous sources and real-world audio samples featuring diverse speakers and code-switching between Tunisian Arabic Dialect and English or French. By providing high-quality audio paired with precise transcriptions, the LinTO audio and textual datasets aim to provide qualitative material to build and benchmark ASR systems for the Tunisian Arabic Dialect. Keywords – Tunisian Arabic Dialect, Speech-to-Text, Low-Resource Languages, Audio Data Augmentation Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS) Cite as: arXiv:2504.02604 [cs.CL] (or arXiv:2504.02604v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.02604 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-17] LexPam: Legal Procedure Awareness-Guided Mathematical Reasoning

【速读】：本文旨在解决法律大语言模型（Legal LLMs）在实际应用场景中缺乏可靠法律数学推理能力的问题，以及现有开放域推理模型无法满足法律场景逻辑要求的局限性。同时，由于缺乏专门针对法律数学推理的数据集，LLMs在法律情境下的推理能力验证与提升受到阻碍。为应对这些问题，论文提出了关键解决方案：构建首个中文法律数学推理数据集LexNum，并基于此提出了一种名为LexPam的强化学习算法。LexPam通过引入法律程序意识指导训练，显著提升了LLMs在经济补偿、工伤补偿及交通事故补偿等典型法律数学推理任务中的表现。实验表明，现有模型在此类任务上的性能不理想，而LexPam能够有效增强其推理能力。

链接: https://arxiv.org/abs/2504.02590
作者: Kepu Zhang,Guofu Xie,Weijie Yu,Mingyue Xu,Xu Tang,Yaxin Li,Jun Xu
机构: Gaoling School of Artificial Intelligence, Renmin University of China (高瓴人工智能学院，中国人民大学); University of International Business and Economics (对外经济贸易大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The legal mathematical reasoning ability of LLMs is crucial when applying them to real-world scenarios, as it directly affects the credibility of the LLM. While existing legal LLMs can perform general judicial question answering, their legal mathematical reasoning capabilities have not been trained. Open-domain reasoning models, though able to generate detailed calculation steps, do not follow the reasoning logic required for legal scenarios. Additionally, there is currently a lack of legal mathematical reasoning datasets to help validate and enhance LLMs’ reasoning abilities in legal contexts. To address these issues, we propose the first Chinese legal Mathematical Reasoning Dataset, LexNum, which includes three common legal mathematical reasoning scenarios: economic compensation, work injury compensation, and traffic accident compensation. Based on LexNum, we tested the performance of existing legal LLMs and reasoning LLMs, and introduced LexPam, a reinforcement learning algorithm guided by legal procedural awareness to train LLMs, enhancing their mathematical reasoning abilities in legal scenarios. Experiments on tasks in the three legal scenarios show that the performance of existing legal LLMs and reasoning models in legal mathematical reasoning tasks is unsatisfactory. LexPam can enhance the LLM’s ability in these tasks.
zh

[NLP-18] Rethinking RL Scaling for Vision Language Models: A Transparent From-Scratch Framework and Comprehensive Evaluation Scheme

【速读】：该论文旨在解决现有视觉-语言模型（Vision-Language Models, VLMs）中强化学习（Reinforcement Learning, RL）应用存在的可复现性差、缺乏标准化评估协议以及难以比较结果和解析训练动态的问题。为应对这些挑战，论文提出了一种透明且从零构建的框架，该框架包含一个经过多个模型和数据集验证的极简但功能完整的四步流程。此外，还设计了一套标准化评估方案以评估训练动态和反射行为。关键在于通过这一框架和标准化评估方法，不仅建立了可复现的研究基线，还揭示了关于响应长度敏感性、反射与输出长度的相关性以及RL在泛化能力上优于监督微调（Supervised Fine-Tuning, SFT）等重要发现。

链接: https://arxiv.org/abs/2504.02587
作者: Yan Ma,Steffi Chern,Xuyang Shen,Yiran Zhong,Pengfei Liu
机构: Shanghai Jiao Tong University (SJTU); Minimax; Fudan University; SII; Generative Artificial Intelligence Lab (GAIR)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Code is public and available at: this https URL

点击查看摘要

Abstract:Reinforcement learning (RL) has recently shown strong potential in improving the reasoning capabilities of large language models and is now being actively extended to vision-language models (VLMs). However, existing RL applications in VLMs often rely on heavily engineered frameworks that hinder reproducibility and accessibility, while lacking standardized evaluation protocols, making it difficult to compare results or interpret training dynamics. This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.
zh

[NLP-19] Reasoning Inconsistencies and How to Mitigate Them in Deep Learning

【速读】：该论文致力于解决深度学习模型在推理过程中存在的系统性不一致或错误模式的问题，特别是与逻辑或推理性缺陷相关的错误。这些问题可能导致模型输出矛盾、无法在相似任务间泛化或在特定上下文中得出错误结论。论文的关键在于提出了一系列创新方法来检测、量化和缓解这些推理上的不一致性。具体而言，针对自然语言和图像处理模型内部过程的不透明性，论文贡献了两种检测和量化预测不一致性的技术；为应对训练数据中的偏差，提出了一个高效采样方法以提升公平性和性能，并在低资源场景下设计了一种合成数据集生成方法；此外，还提供了两种优化复杂推理任务的方法。这些方案不仅提高了模型性能，还增强了模型在推理阶段的忠实性和可解释性，从而全面提升了模型在不同任务和模态下的鲁棒性、公平性和可解释性。

链接: https://arxiv.org/abs/2504.02577
作者: Erik Arakelyan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: PhD thesis

点击查看摘要

Abstract:The recent advancements in Deep Learning models and techniques have led to significant strides in performance across diverse tasks and modalities. However, while the overall capabilities of models show promising growth, our understanding of their internal reasoning processes remains limited, particularly concerning systematic inconsistencies or errors patterns of logical or inferential flaws. These inconsistencies may manifest as contradictory outputs, failure to generalize across similar tasks, or erroneous conclusions in specific contexts. Even detecting and measuring such reasoning discrepancies is challenging, as they may arise from opaque internal procedures, biases and imbalances in training data, or the inherent complexity of the task. Without effective methods to detect, measure, and mitigate these errors, there is a risk of deploying models that are biased, exploitable, or logically unreliable. This thesis aims to address these issues by producing novel methods for deep learning models that reason over knowledge graphs, natural language, and images. The thesis contributes two techniques for detecting and quantifying predictive inconsistencies originating from opaque internal procedures in natural language and image processing models. To mitigate inconsistencies from biases in training data, this thesis presents a data efficient sampling method to improve fairness and performance and a synthetic dataset generation approach in low resource scenarios. Finally, the thesis offers two techniques to optimize the models for complex reasoning tasks. These methods enhance model performance while allowing for more faithful and interpretable exploration and exploitation during inference. Critically, this thesis provides a comprehensive framework to improve the robustness, fairness, and interpretability of deep learning models across diverse tasks and modalities.
zh

[NLP-20] Language Models reach higher Agreement than Humans in Historical Interpretation

【速读】：该论文旨在比较人类与大型语言模型（Large Language Models, LLMs）在历史注释中的表现，试图解决如何利用LLMs进行大规模历史数据的注释与定量分析的问题。论文的关键在于揭示LLMs在处理短文本时对历史事实解释的高一致性，同时指出其分歧源于信息遗漏或幻觉生成，而非个人偏见。这一发现为数字人文领域提供了新的研究路径，通过跨模型的历史解读探索与批判性思考偏见的可能性。

链接: https://arxiv.org/abs/2504.02572
作者: Fabio Celli,Georgios Spathulas
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper compares historical annotations by humans and Large Language Models. The findings reveal that both exhibit some cultural bias, but Large Language Models achieve a higher consensus on the interpretation of historical facts from short texts. While humans tend to disagree on the basis of their personal biases, Large Models disagree when they skip information or produce hallucinations. These findings have significant implications for digital humanities, enabling large-scale annotation and quantitative analysis of historical data. This offers new educational and research opportunities to explore historical interpretations from different Language Models, fostering critical thinking about bias.
zh

[NLP-21] Leverag ing LLM For Synchronizing Information Across Multilingual Tables

【速读】：该论文旨在解决非英语母语者在获取在线信息时面临的语言不平衡问题，特别是维基百科等知识库中低资源语言内容经常过时或不完整的问题。传统基于规则的方法虽能在一定程度上改善跨语言同步，但存在复杂性和泛化能力不足的局限性。为此，论文提出利用大规模语言模型（Large Language Models, LLMs）通过零样本提示（zero-shot prompting）实现多语言信息同步的可扩展解决方案。关键在于引入了一种任务分解策略，以提升LLMs在信息更新（Information Updation）和信息补充（Information Addition）等任务中的连贯性和准确性，从而显著超越现有基准方法，尤其在动态更新和丰富数据方面展现了模型的优势。

链接: https://arxiv.org/abs/2504.02559
作者: Siddharth Khincha,Tushar Kataria,Ankita Anand,Dan Roth,Vivek Gupta
机构: IIT Guwahati(印度理工学院瓜瓦哈蒂分校); University of Utah(犹他大学); University of Pennsylvania(宾夕法尼亚大学); Arizona State University(亚利桑那州立大学)
类目: Computation and Language (cs.CL)
备注: 17 Pages, 11 Tables, 2 Figures

点击查看摘要

Abstract:The vast amount of online information today poses challenges for non-English speakers, as much of it is concentrated in high-resource languages such as English and French. Wikipedia reflects this imbalance, with content in low-resource languages frequently outdated or incomplete. Recent research has sought to improve cross-language synchronization of Wikipedia tables using rule-based methods. These approaches can be effective, but they struggle with complexity and generalization. This paper explores large language models (LLMs) for multilingual information synchronization, using zero-shot prompting as a scalable solution. We introduce the Information Updation dataset, simulating the real-world process of updating outdated Wikipedia tables, and evaluate LLM performance. Our findings reveal that single-prompt approaches often produce suboptimal results, prompting us to introduce a task decomposition strategy that enhances coherence and accuracy. Our proposed method outperforms existing baselines, particularly in Information Updation (1.79%) and Information Addition (20.58%), highlighting the model strength in dynamically updating and enriching data across architectures
zh

[NLP-22] UNDO: Understanding Distillation as Optimization

【速读】：该论文旨在解决标准一次性知识蒸馏方法因教师生成的解释与学生模型特定学习需求不匹配而导致性能不佳的问题。论文的关键创新在于提出了UNDO（UNderstanding Distillation as Optimization）框架，通过迭代方式识别学生模型的学习缺陷，并引导教师优化其解释，以提供更有针对性和增强的解释。这种方法直接针对学生模型的学习不足，显著提升了蒸馏效果，在多种数学和常识推理任务上的性能提升可达20%，同时证明了经此迭代过程优化的教师生成数据对不同学生模型的有效性。论文的核心贡献在于将知识蒸馏重新定义为一个动态的教师-学生交互过程，有效利用教师的动态优化来实现更优的知识传递。

链接: https://arxiv.org/abs/2504.02521
作者: Kushal Jain,Piyushi Goyal,Kumar Shridhar
机构: UC San Diego (加州大学圣地亚哥分校); ETH Zurich (瑞士苏黎世联邦理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge distillation has emerged as an effective strategy for compressing large language models’ (LLMs) knowledge into smaller, more efficient student models. However, standard one-shot distillation methods often produce suboptimal results due to a mismatch between teacher-generated rationales and the student’s specific learning requirements. In this paper, we introduce the UNDO: UNderstanding Distillation as Optimization framework, designed to bridge this gap by iteratively identifying the student’s errors and prompting the teacher to refine its explanations accordingly. Each iteration directly targets the student’s learning deficiencies, motivating the teacher to provide tailored and enhanced rationales that specifically address these weaknesses. Empirical evaluations on various challenging mathematical and commonsense reasoning tasks demonstrate that our iterative distillation method, UNDO, significantly outperforms standard one-step distillation methods, achieving performance gains of up to 20%. Additionally, we show that teacher-generated data refined through our iterative process remains effective even when applied to different student models, underscoring the broad applicability of our approach. Our work fundamentally reframes knowledge distillation as an iterative teacher-student interaction, effectively leveraging dynamic refinement by the teacher for better knowledge distillation.
zh

[NLP-23] ZClip: Adaptive Spike Mitigation for LLM Pre-Training

【速读】：该论文旨在解决大型语言模型（LLMs）训练过程中因梯度不稳定和损失尖峰导致的灾难性发散问题，这些问题会引发昂贵的检查点恢复和数据批次跳过操作。传统梯度裁剪方法（如基于常量或范数的方法）由于依赖固定的阈值或启发式规则，在处理这些挑战时效率低下且需频繁人工干预。论文提出的ZClip是一种自适应梯度裁剪算法，其关键在于通过动态调整裁剪阈值来响应梯度范数随时间变化的统计特性，而非依赖于对梯度范数尺度及其时间演变的先验假设。ZClip的核心机制基于z分数的异常检测，能够识别并缓解大梯度尖峰，从而防止恶性损失尖峰，同时不会干扰正常的收敛过程。

链接: https://arxiv.org/abs/2504.02507
作者: Abhay Kumar,Louis Owen,Nilabhra Roy Chowdhury,Fabian Güra
机构: BluOrion (蓝奥里恩)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Training large language models (LLMs) presents numerous challenges, including gradient instability and loss spikes. These phenomena can lead to catastrophic divergence, requiring costly checkpoint restoration and data batch skipping. Traditional gradient clipping techniques, such as constant or norm-based methods, fail to address these issues effectively due to their reliance on fixed thresholds or heuristics, leading to inefficient learning and requiring frequent manual intervention. In this work, we propose ZClip, an adaptive gradient clipping algorithm that dynamically adjusts the clipping threshold based on statistical properties of gradient norms over time. Unlike prior reactive strategies, ZClip proactively adapts to training dynamics without making any prior assumptions on the scale and the temporal evolution of gradient norms. At its core, it leverages z-score-based anomaly detection to identify and mitigate large gradient spikes, preventing malignant loss spikes while not interfering with convergence otherwise. Our code is available at: this https URL.
zh

[NLP-24] Inference-Time Scaling for Generalist Reward Modeling

【速读】：该论文旨在解决在强化学习（Reinforcement Learning, RL）框架下，大型语言模型（Large Language Models, LLMs）奖励建模（Reward Modeling, RM）面临的两大挑战：一是如何在非可验证性问题或人工规则之外的广泛领域中获得准确的奖励信号；二是如何通过有效的计算扩展提升奖励生成的性能与可扩展性。论文的关键在于提出了一种结合点式生成式奖励建模（Pointwise Generative Reward Modeling, GRM）与自原则评价微调（Self-Principled Critique Tuning, SPCT）的方法。其中，SPCT通过在线强化学习机制，使生成式奖励模型能够自适应地生成原则并准确地提供评价反馈，从而显著提升了奖励建模的质量和可扩展性。此外，为了进一步增强推理阶段的计算扩展能力，论文采用了并行采样扩展计算资源，并引入元奖励建模引导投票过程以优化扩展性能。这些方法共同构成了DeepSeek-GRM模型，其表现优于现有方法并在多种基准测试中展现出良好的性能。

链接: https://arxiv.org/abs/2504.02495
作者: Zijun Liu,Peiyi Wang,Runxin Xu,Shirong Ma,Chong Ruan,Peng Li,Yang Liu,Yu Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint, under review. 42 pages

点击查看摘要

Abstract:Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that \textitproper learning methods could enable effective inference-time scalability . A key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the \textbfinference-time scalability of generalist RM , and further, how to improve the effectiveness of performance-compute scaling with proper learning methods. For the RM approach, we adopt pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling. For the learning method, we propose Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in \textbfDeepSeek-GRM models. Furthermore, for effective inference-time scaling, we use parallel sampling to expand compute usage, and introduce a meta RM to guide voting process for better scaling performance. Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance compared to training-time scaling. DeepSeek-GRM still meets challenges in some tasks, which we believe can be addressed by future efforts in generalist reward systems. The models will be released and open-sourced.
zh

[NLP-25] Cognitive Memory in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在记忆机制方面的挑战，以提升其上下文丰富响应的能力、减少幻觉现象，并提高运行效率。论文的关键在于将记忆分为感觉记忆、短时记忆和长时记忆三类，并详细探讨了不同类型的内存机制及其实现方式。对于文本记忆部分，重点在于记忆的获取（选择与摘要）、管理（更新、访问、存储及冲突解决）以及利用（全文搜索、SQL查询、语义搜索）。而在基于KV缓存的记忆部分，则聚焦于选择方法（基于规则的摘要、基于评分的方法、特殊标记嵌入）和压缩技术（低秩压缩、KV合并、多模态压缩），同时提出了管理策略如卸载和共享注意力机制。此外，通过参数化记忆方法（如LoRA、TTT、MoE）将记忆转化为模型参数来优化效率，以及通过基于隐藏状态的记忆方法（如分块机制、循环Transformer、Mamba模型）结合RNN隐藏状态改善长文本处理能力。综上所述，论文全面分析了LLMs的记忆机制，突显其重要性并指出了未来的研究方向。

链接: https://arxiv.org/abs/2504.02441
作者: Lianlei Shan,Shixian Luo,Zezhou Zhu,Yu Yuan,Yong Wu
机构: Intelligent Cloud Group (智能云集团); Li Auto (理想汽车)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 37 pages, 9 figures

点击查看摘要

Abstract:This paper examines memory mechanisms in Large Language Models (LLMs), emphasizing their importance for context-rich responses, reduced hallucinations, and improved efficiency. It categorizes memory into sensory, short-term, and long-term, with sensory memory corresponding to input prompts, short-term memory processing immediate context, and long-term memory implemented via external databases or structures. The text-based memory section covers acquisition (selection and summarization), management (updating, accessing, storing, and resolving conflicts), and utilization (full-text search, SQL queries, semantic search). The KV cache-based memory section discusses selection methods (regularity-based summarization, score-based approaches, special token embeddings) and compression techniques (low-rank compression, KV merging, multimodal compression), along with management strategies like offloading and shared attention mechanisms. Parameter-based memory methods (LoRA, TTT, MoE) transform memories into model parameters to enhance efficiency, while hidden-state-based memory approaches (chunk mechanisms, recurrent transformers, Mamba model) improve long-text processing by combining RNN hidden states with current methods. Overall, the paper offers a comprehensive analysis of LLM memory mechanisms, highlighting their significance and future research directions.
zh

[NLP-26] Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation

【速读】：该论文旨在解决长视频处理对视觉-语言模型（Vision-Language Models, VLMs）带来的高计算成本挑战，特别是由于长时间序列处理导致的关键时间依赖性丢失或语义信息稀释的问题。为应对这一挑战，论文提出了一种名为“差分蒸馏”（Differential Distillation）的原则性方法，通过系统性地保留任务相关的信息并抑制冗余来解决问题。基于此原理，论文开发了ViLaMP模型，其核心在于两种机制：(1) 差分关键帧选择（Differential Keyframe Selection），在帧级别最大化查询相关性的同时保持时间独特性；(2) 差分特征融合（Differential Feature Merging），在patch级别保留非关键帧中的查询显著特征。这些机制使ViLaMP能够在单个NVIDIA A100 GPU上高效处理长达10,000帧的超长视频，同时保持最先进的性能。

链接: https://arxiv.org/abs/2504.02438
作者: Chuanqi Cheng,Jian Guan,Wei Wu,Rui Yan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-form video processing fundamentally challenges vision-language models (VLMs) due to the high computational costs of handling extended temporal sequences. Existing token pruning and feature merging methods often sacrifice critical temporal dependencies or dilute semantic information. We introduce differential distillation, a principled approach that systematically preserves task-relevant information while suppressing redundancy. Based on this principle, we develop ViLaMP, a hierarchical video-language model that processes hour-long videos at ``mixed precision’’ through two key mechanisms: (1) differential keyframe selection that maximizes query relevance while maintaining temporal distinctiveness at the frame level and (2) differential feature merging that preserves query-salient features in non-keyframes at the patch level. Hence, ViLaMP retains full information in keyframes while reducing non-keyframes to their most salient features, resembling mixed-precision training. Extensive experiments demonstrate ViLaMP’s superior performance across four video understanding benchmarks, particularly on long-form content. Notably, ViLaMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU, achieving substantial computational efficiency while maintaining state-of-the-art performance.
zh

[NLP-27] Adapting Large Language Models for Multi-Domain Retrieval-Augmented-Generation

【速读】：该论文旨在解决 Retrieval-Augmented Generation (RAG) 在多领域应用中存在的两个主要问题：缺乏多样化的基准数据集以及在跨领域场景下的泛化能力不足。论文的关键贡献在于提出了一个包含来自8个来源、覆盖13个领域的多样化问答任务基准，并系统性评估了典型RAG微调策略的跨领域泛化性能。研究发现标准微调方法难以有效泛化，而基于教师生成标签的序列级蒸馏能够通过提供更一致的监督显著提升模型的跨领域表现，从而为提高多领域RAG鲁棒性提供了重要指导。

链接: https://arxiv.org/abs/2504.02411
作者: Alexandre Misrahi,Nadezhda Chirkova,Maxime Louis,Vassilina Nikoulina
机构: 未知
类目: Computation and Language (cs.CL)
备注: 25 pages, 8 figures, 21 tables

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances LLM factuality, but multi-domain applications face challenges like lack of diverse benchmarks and poor out-of-domain generalization. The first contribution of this work is to introduce a diverse benchmark comprising a variety of question-answering tasks from 8 sources and covering 13 domains. Our second contribution consists in systematically testing out-of-domain generalization for typical RAG tuning strategies. While our findings reveal that standard fine-tuning fails to generalize effectively, we show that sequence-level distillation with teacher-generated labels improves out-of-domain performance by providing more coherent supervision. Our findings highlight key strategies for improving multi-domain RAG robustness.
zh

[NLP-28] AnesBench: Multi-Dimensional Evaluation of LLM Reasoning in Anesthesiology

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）在麻醉学等专业化医学领域推理能力研究不足的问题。为实现这一目标，论文的关键在于引入AnesBench基准数据集，该数据集用于评估麻醉学相关推理任务的三个层次：事实检索（System 1）、混合推理（System 1.x）以及复杂决策制定（System 2）。通过系统性实验，研究分析了影响模型推理性能的关键因素，包括模型规模、Chain of Thought (CoT)长度及跨语言迁移能力，并进一步探讨了不同训练策略（如持续预训练Continous Pre-training, CPT和有监督微调Supervised Fine-Tuning, SFT）的有效性。此外，还考察了测试阶段推理技术（如Best-of-N采样和束搜索Beam Search）的作用，以及推理增强模型蒸馏方法（DeepSeek-R1）的影响。最终，论文将公开发布AnesBench及相关训练数据集与评估代码。

链接: https://arxiv.org/abs/2504.02404
作者: Xiang Feng,Wentao Jiang,Zengmao Wang,Yong Luo,Pingbo Xu,Baosheng Yu,Hua Jin,Bo Du,Jing Zhang
机构: School of Computer Science, Wuhan University, China (武汉大学计算机学院); Department of Anesthesiology, Zhejiang Cancer Hospital, China (浙江省肿瘤医院麻醉科); Institute of Medicine, Chinese Academy of Sciences, Hangzhou, Zhejiang, China (中国科学院杭州医学研究所); Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore (新加坡南洋理工大学李光前医学院); Department of Anesthesiology, First People’s Hospital of Yunnan Province, China (云南省第一人民医院麻醉科); Kunming University of Science and Technology, China (昆明理工大学)
类目: Computation and Language (cs.CL)
备注: 23 pages, 9 figures

点击查看摘要

Abstract:The application of large language models (LLMs) in the medical field has gained significant attention, yet their reasoning capabilities in more specialized domains like anesthesiology remain underexplored. In this paper, we systematically evaluate the reasoning capabilities of LLMs in anesthesiology and analyze key factors influencing their performance. To this end, we introduce AnesBench, a cross-lingual benchmark designed to assess anesthesiology-related reasoning across three levels: factual retrieval (System 1), hybrid reasoning (System 1.x), and complex decision-making (System 2). Through extensive experiments, we first explore how model characteristics, including model scale, Chain of Thought (CoT) length, and language transferability, affect reasoning performance. Then, we further evaluate the effectiveness of different training strategies, leveraging our curated anesthesiology-related dataset, including continuous pre-training (CPT) and supervised fine-tuning (SFT). Additionally, we also investigate how the test-time reasoning techniques, such as Best-of-N sampling and beam search, influence reasoning performance, and assess the impact of reasoning-enhanced model distillation, specifically DeepSeek-R1. We will publicly release AnesBench, along with our CPT and SFT training datasets and evaluation code at this https URL.
zh

[NLP-29] DaKultur: Evaluating the Cultural Awareness of Language Models for Danish with Native Speakers NAACL

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在跨文化交流中的文化适应性不足问题，特别是在资源相对有限的语言（如丹麦语）中，这些模型往往提供以英语为中心或不恰当的响应。论文的关键在于通过使用母语者引导模型完成需要文化意识的任务，分析自动翻译数据的局限性，并验证使用母语者数据训练可以将响应接受率提高两倍以上。为此，作者发布了DaKultur数据集，这是首个专注于丹麦语文化意识的数据集。

链接: https://arxiv.org/abs/2504.02403
作者: Max Müller-Eberstein,Mike Zhang,Elisa Bassignana,Peter Brunsgaard Trolle,Rob van der Goot
机构: IT University of Copenhagen, Denmark (哥本哈根信息技术大学); Aalborg University, Denmark (奥胡斯大学); Pioneer Center for Artificial Intelligence, Denmark (先锋人工智能中心)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Accepted at C3NLP at NAACL

点击查看摘要

Abstract:Large Language Models (LLMs) have seen widespread societal adoption. However, while they are able to interact with users in languages beyond English, they have been shown to lack cultural awareness, providing anglocentric or inappropriate responses for underrepresented language communities. To investigate this gap and disentangle linguistic versus cultural proficiency, we conduct the first cultural evaluation study for the mid-resource language of Danish, in which native speakers prompt different models to solve tasks requiring cultural awareness. Our analysis of the resulting 1,038 interactions from 63 demographically diverse participants highlights open challenges to cultural adaptation: Particularly, how currently employed automatically translated data are insufficient to train or measure cultural adaptation, and how training on native-speaker data can more than double response acceptance rates. We release our study data as DaKultur - the first native Danish cultural awareness dataset.
zh

[NLP-30] Scaling Analysis of Interleaved Speech-Text Language Models

【速读】：该论文试图解决的问题是：现有研究表明语音语言模型（Speech Language Model, SLM）相较于文本领域需要更多的计算资源和数据，质疑训练高质量SLM的可行性。然而，现代SLM通常通过从预训练的文本语言模型（Text Language Model, TextLM）初始化，并结合语音-文本交错训练以实现知识迁移。论文旨在探讨这种交错初始化是否能让SLM更高效地扩展规模，即“交织SLM是否比无文本SLM更具扩展效率？”

解决方案的关键在于：通过训练数十个模型并分析其扩展趋势，研究发现，在这种交错初始化的设置下，SLM的扩展效率显著提高，且其扩展动力学与无文本SLM存在显著差异。研究进一步指出，应更多地分配计算预算用于增加模型规模而非仅关注训练样本数量。此外，合成数据和TextLM模型家族的作用也被重点研究，结果表明，通过扩展模型，该方法在语音语义度量上的表现可与领先模型相当，但所需计算和数据资源更少。论文开源了模型、样本和数据以供验证。

链接: https://arxiv.org/abs/2504.02398
作者: Gallil Maimon,Michael Hassid,Amit Roth,Yossi Adi
机构: Department of Computer Science and Engineering (计算机科学与工程系), Hebrew University of Jerusalem (希伯来大学耶路撒冷)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. They predict that SLMs require much more compute and data compared to text, leading some to question the feasibility of training high-quality SLMs. However, modern SLMs are often initialised from pre-trained TextLMs using speech-text interleaving to allow knowledge transfer. This raises the question - Do interleaved SLMs scale more efficiently than textless-SLMs? In this paper we answer a resounding, yes! We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the scaling trends. We see that under this setup SLMs scale more efficiently with compute. Additionally, our results indicate that the scaling-dynamics are significantly different than textless-SLMs, suggesting one should allocate notably more of the compute budget for increasing model size over training tokens. We also study the role of synthetic data and TextLM model families in unlocking this potential. Results suggest, that our scaled up model achieves comparable performance with leading models on speech semantic metrics while using less compute and data than other approaches. We open source models, samples, and data - this https URL.
zh

[NLP-31] he quasi-semantic competence of LLM s: a case study on the part-whole relation

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）对部分-整体关系（Part-Whole Relation，又称meronymy）的理解能力，这一关系在词汇组织中起着关键作用，但此前研究较为匮乏。论文的关键解决方案在于通过三种分析层次设计了一系列方法：首先，通过提示（prompting）进行行为测试，直接评估模型对部分-整体关系的知识；其次，利用句子概率评分（sentence probability scoring）检验模型区分正确与错误部分-整体关系的能力；最后，在向量空间中分析概念表征（concept representation analysis），验证部分-整体概念的线性组织特性。这些方法揭示了LLMs对该关系的理解仅具有部分性和有限深度，表现出一种“准语义”（quasi-semantic）能力，尚无法充分捕捉深层次的推理属性。

链接: https://arxiv.org/abs/2504.02395
作者: Mattia Proietti,Alessandro Lenci
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding the extent and depth of the semantic competence of \emphLarge Language Models (LLMs) is at the center of the current scientific agenda in Artificial Intelligence (AI) and Computational Linguistics (CL). We contribute to this endeavor by investigating their knowledge of the \emphpart-whole relation, a.k.a. \emphmeronymy, which plays a crucial role in lexical organization, but it is significantly understudied. We used data from ConceptNet relations \citepspeer2016conceptnet and human-generated semantic feature norms \citepMcRae:2005 to explore the abilities of LLMs to deal with \textitpart-whole relations. We employed several methods based on three levels of analysis: i.) \textbfbehavioral testing via prompting, where we directly queried the models on their knowledge of meronymy, ii.) sentence \textbfprobability scoring, where we tested models’ abilities to discriminate correct (real) and incorrect (asymmetric counterfactual) \textitpart-whole relations, and iii.) \textbfconcept representation analysis in vector space, where we proved the linear organization of the \textitpart-whole concept in the embedding and unembedding spaces. These analyses present a complex picture that reveals that the LLMs’ knowledge of this relation is only partial. They have just a ``\emphquasi-semantic’’ competence and still fall short of capturing deep inferential properties.
zh

[NLP-32] LearNAT: Learning NL2SQL with AST-guided Task Decomposition for Large Language Models

【速读】：该论文旨在解决自然语言到SQL（Natural Language to SQL, NL2SQL）任务中复杂查询场景下现有开放源码大语言模型（Large Language Models, LLMs）性能不足的问题。具体而言，尽管现有的NL2SQL方法在闭源LLMs中通过提示工程展现了显著效果，但开源模型通常需要微调以获取领域特定知识，并且在处理复杂查询时因用户查询意图的间接表达及语义鸿沟难以充分理解数据库模式，从而面临挑战。

论文提出的解决方案LearNAT（Learning NL2SQL with AST-guided Task Decomposition）的关键在于引入了一种基于抽象语法树（Abstract Syntax Trees, ASTs）引导的任务分解框架，结合强化学习优化策略，提升开源LLMs在复杂NL2SQL任务上的表现。其核心创新点包括：(1) 基于AST的任务分解合成过程，指导高效搜索与剪枝策略；(2) 边界感知的强化学习机制，通过细粒度步骤级优化利用AST边界进行DPO（Deep Policy Optimization）训练；以及(3) 自适应示例推理模块，动态选择相关示例以增强任务分解能力。实验结果表明，LearNAT使一个仅有7B参数规模的开源LLM在Spider和BIRD两个基准数据集上的性能接近GPT-4，同时具备更高的效率和可访问性。

链接: https://arxiv.org/abs/2504.02327
作者: Weibin Liao,Xin Gao,Tianyu Jia,Rihong Qiu,Yifan Zhu,Yang Lin,Xu Chu,Junfeng Zhao,Yasha Wang
机构: Peking University (北京大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural Language to SQL (NL2SQL) has emerged as a critical task for enabling seamless interaction with databases. Recent advancements in Large Language Models (LLMs) have demonstrated remarkable performance in this domain. However, existing NL2SQL methods predominantly rely on closed-source LLMs leveraging prompt engineering, while open-source models typically require fine-tuning to acquire domain-specific knowledge. Despite these efforts, open-source LLMs struggle with complex NL2SQL tasks due to the indirect expression of user query objectives and the semantic gap between user queries and database schemas. Inspired by the application of reinforcement learning in mathematical problem-solving to encourage step-by-step reasoning in LLMs, we propose LearNAT (Learning NL2SQL with AST-guided Task Decomposition), a novel framework that improves the performance of open-source LLMs on complex NL2SQL tasks through task decomposition and reinforcement learning. LearNAT introduces three key components: (1) a Decomposition Synthesis Procedure that leverages Abstract Syntax Trees (ASTs) to guide efficient search and pruning strategies for task decomposition, (2) Margin-aware Reinforcement Learning, which employs fine-grained step-level optimization via DPO with AST margins, and (3) Adaptive Demonstration Reasoning, a mechanism for dynamically selecting relevant examples to enhance decomposition capabilities. Extensive experiments on two benchmark datasets, Spider and BIRD, demonstrate that LearNAT enables a 7B-parameter open-source LLM to achieve performance comparable to GPT-4, while offering improved efficiency and accessibility.
zh

[NLP-33] CoTAL: Human-in-the-Loop Prompt Engineering Chain-of-Thought Reasoning and Active Learning for Generalizable Formative Assessment Scoring

【速读】：该论文旨在解决大型语言模型（LLMs）在跨多个领域（如科学、计算和工程）的课程评估中，其基于提示的方法（如chain-of-thought提示）在通用性上的局限性问题。论文提出了一种名为Chain-of-Thought Prompting + Active Learning（CoTAL）的新方法，其关键是结合Evidence-Centered Design（ECD）原则开发与课程对齐的形成性评估及评分标准，通过人机协作的提示工程实现自动化评分，并利用教师和学生的反馈迭代优化评估问题、评分标准以及用于自动评分的LLM提示。研究结果表明，CoTAL显著提升了GPT-4的评分表现，相比未经过提示工程优化的基线提高了高达24.5%。此外，教师和学生均认为CoTAL在评分和解释学生答案方面有效，并提供了有价值的改进建议以进一步提高评分准确性和解释质量。

链接: https://arxiv.org/abs/2504.02323
作者: Clayton Cohn,Nicole Hutchins,Ashwin T S,Gautam Biswas
机构: Department of Computer Science, Vanderbilt University (范德比尔特大学); College of Education, University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL)
备注: Submitted to IEEE Transactions on Learning Technologies. Currently under review

点击查看摘要

Abstract:Large language models (LLMs) have created new opportunities to assist teachers and support student learning. Methods such as chain-of-thought (CoT) prompting enable LLMs to grade formative assessments in science, providing scores and relevant feedback to students. However, the extent to which these methods generalize across curricula in multiple domains (such as science, computing, and engineering) remains largely untested. In this paper, we introduce Chain-of-Thought Prompting + Active Learning (CoTAL), an LLM-based approach to formative assessment scoring that (1) leverages Evidence-Centered Design (ECD) principles to develop curriculum-aligned formative assessments and rubrics, (2) applies human-in-the-loop prompt engineering to automate response scoring, and (3) incorporates teacher and student feedback to iteratively refine assessment questions, grading rubrics, and LLM prompts for automated grading. Our findings demonstrate that CoTAL improves GPT-4’s scoring performance, achieving gains of up to 24.5% over a non-prompt-engineered baseline. Both teachers and students view CoTAL as effective in scoring and explaining student responses, each providing valuable refinements to enhance grading accuracy and explanation quality.
zh

[NLP-34] Improving Harmful Text Detection with Joint Retrieval and External Knowledge

【速读】：该论文旨在解决有害文本检测在大型语言模型开发与部署中的准确性与鲁棒性不足的问题，尤其是在资源受限的训练场景和多语言环境中。随着人工智能生成内容在数字平台上的持续扩展，传统检测模型难以有效捕捉复杂的有害内容，存在局限性。为应对这一挑战，论文提出了一种联合检索框架，其关键是将预训练语言模型与知识图谱相结合，通过利用外部上下文信息来增强对细微有害内容的捕获能力。这种集成方法不仅提高了检测性能，还在低资源和多语言场景下表现出显著优势，为构建更安全、可信且可靠的自动化内容审核系统提供了重要贡献。

链接: https://arxiv.org/abs/2504.02310
作者: Zidong Yu,Shuo Wang,Nan Jiang,Weiqiang Huang,Xu Han,Junliang Du
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Harmful text detection has become a crucial task in the development and deployment of large language models, especially as AI-generated content continues to expand across digital platforms. This study proposes a joint retrieval framework that integrates pre-trained language models with knowledge graphs to improve the accuracy and robustness of harmful text detection. Experimental results demonstrate that the joint retrieval approach significantly outperforms single-model baselines, particularly in low-resource training scenarios and multilingual environments. The proposed method effectively captures nuanced harmful content by leveraging external contextual information, addressing the limitations of traditional detection models. Future research should focus on optimizing computational efficiency, enhancing model interpretability, and expanding multimodal detection capabilities to better tackle evolving harmful content patterns. This work contributes to the advancement of AI safety, ensuring more trustworthy and reliable content moderation systems.
zh

[NLP-35] Measurement of LLM s Philosophies of Human Nature

【速读】：本文旨在解决人工智能（尤其是大型语言模型，Large Language Models, LLM）在与人类交互过程中因缺乏信任而引发的社会关切。通过基于Wrightsman的人性哲学量表（Philosophies of Human Nature Scale, PHNS）设计出专门评估LLM人性态度的心理量表（Machine-based Philosophies of Human Nature Scale, M-PHNS），研究发现当前LLM普遍表现出对人类的系统性不信任，并揭示模型智能水平越高，其对人类的信任反而越低的现象。为此，论文提出了一种心理循环学习框架（mental loop learning framework），通过构建道德场景使LLM在虚拟互动中持续优化其价值体系，从而改善其对人类的态度。实验表明，这种方法相较于人格化提示或指令提示显著提升了LLM对人类的信任度。这一成果不仅有助于诊断LLM的认知偏见，还为人工智能的伦理学习提供了潜在解决方案。论文已公开M-PHNS的评估代码与数据。

链接: https://arxiv.org/abs/2504.02304
作者: Minheng Ni,Ennan Wu,Zidong Gong,Zhengyuan Yang,Linjie Li,Chung-Ching Lin,Kevin Lin,Lijuan Wang,Wangmeng Zuo
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The widespread application of artificial intelligence (AI) in various tasks, along with frequent reports of conflicts or violations involving AI, has sparked societal concerns about interactions with AI systems. Based on Wrightsman’s Philosophies of Human Nature Scale (PHNS), a scale empirically validated over decades to effectively assess individuals’ attitudes toward human nature, we design the standardized psychological scale specifically targeting large language models (LLM), named the Machine-based Philosophies of Human Nature Scale (M-PHNS). By evaluating LLMs’ attitudes toward human nature across six dimensions, we reveal that current LLMs exhibit a systemic lack of trust in humans, and there is a significant negative correlation between the model’s intelligence level and its trust in humans. Furthermore, we propose a mental loop learning framework, which enables LLM to continuously optimize its value system during virtual interactions by constructing moral scenarios, thereby improving its attitude toward human nature. Experiments demonstrate that mental loop learning significantly enhances their trust in humans compared to persona or instruction prompts. This finding highlights the potential of human-based psychological assessments for LLM, which can not only diagnose cognitive biases but also provide a potential solution for ethical learning in artificial intelligence. We release the M-PHNS evaluation code and data at this https URL.
zh

[NLP-36] State-of-the-Art Translation of Text-to-Gloss using mBART : A case study of Bangla

【速读】：该论文旨在解决孟加拉语手语（Bangla Sign Language, BdSL）文本到 gloss 翻译任务的研究空白问题。论文的关键在于通过解决数据集不足的问题，提出了一种新的文本到 gloss 的任务解决方案。具体而言，论文首先从基于语法规则的手势生成方法中获得启发，并将其适配到 BdSL；同时利用大型语言模型（LLM）生成合成数据，并结合回译和文本生成技术进行数据增强。在构建数据集后，论文对预训练的 mBART-50 和 mBERT 模型进行了微调，并设计了包含多头注意力机制的新型序列到序列模型。研究发现，mBART-50 在 BdSL 数据集上的 ScareBLEU 得分为 79.53，表现出显著的高性能。进一步分析表明，mBART 模型因其在训练过程中使用了打乱和掩码的文本数据，天然适合文本到 gloss 的任务。为了验证这一假设，论文将 mBART-50 应用于 PHOENIX-14T 数据集并取得了当前最佳性能，在所有六个评估指标上均优于现有方法。因此，论文提出的解决方案核心在于利用 mBART 模型的特性以及规则驱动的合成数据集来显著提升 BdSL 的文本到 gloss 翻译任务表现。

链接: https://arxiv.org/abs/2504.02293
作者: Sharif Md. Abdullah,Abhijit Paul,Shebuti Rayana,Ahmedul Kabir,Zarif Masud
机构: IIT, University of Dhaka (达卡大学工程与技术学院); SUNY, Old Westbury (纽约州立大学奥尔德韦斯特伯里校区); Toronto Metropolitan University (多伦多城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Initial Version

点击查看摘要

Abstract:Despite a large deaf and dumb population of 1.7 million, Bangla Sign Language (BdSL) remains a understudied domain. Specifically, there are no works on Bangla text-to-gloss translation task. To address this gap, we begin by addressing the dataset problem. We take inspiration from grammatical rule based gloss generation used in Germany and American sign langauage (ASL) and adapt it for BdSL. We also leverage LLM to generate synthetic data and use back-translation, text generation for data augmentation. With dataset prepared, we started experimentation. We fine-tuned pretrained mBART-50 and mBERT-multiclass-uncased model on our dataset. We also trained GRU, RNN and a novel seq-to-seq model with multi-head attention. We observe significant high performance (ScareBLEU=79.53) with fine-tuning pretrained mBART-50 multilingual model from Facebook. We then explored why we observe such high performance with mBART. We soon notice an interesting property of mBART – it was trained on shuffled and masked text data. And as we know, gloss form has shuffling property. So we hypothesize that mBART is inherently good at text-to-gloss tasks. To find support against this hypothesis, we trained mBART-50 on PHOENIX-14T benchmark and evaluated it with existing literature. Our mBART-50 finetune demonstrated State-of-the-Art performance on PHOENIX-14T benchmark, far outperforming existing models in all 6 metrics (ScareBLEU = 63.89, BLEU-1 = 55.14, BLEU-2 = 38.07, BLEU-3 = 27.13, BLEU-4 = 20.68, COMET = 0.624). Based on the results, this study proposes a new paradigm for text-to-gloss task using mBART models. Additionally, our results show that BdSL text-to-gloss task can greatly benefit from rule-based synthetic dataset.
zh

[NLP-37] Advancing Semantic Caching for LLM s with Domain-Specific Embeddings and Synthetic Data

【速读】：该论文旨在解决语义缓存（Semantic Caching）在精确性、查询延迟和计算效率之间难以平衡的核心挑战。论文的关键解决方案在于利用经过专门微调的小型领域特定嵌入模型（domain-specific embedding models），并通过真实世界数据集与合成数据集进行优化。此外，作者提出了一种新颖的合成数据生成管道，以缓解领域特定标注数据不足的问题，进一步提升嵌入性能。通过这些方法，论文展示了仅在一个epoch内训练的紧凑型嵌入模型，在精度和召回率方面显著优于现有的开源和专有方案，同时有效平衡了计算开销与准确性，为实际应用中的语义缓存提供了一个可行且高效的策略。

链接: https://arxiv.org/abs/2504.02268
作者: Waris Gill(1 and 2),Justin Cechmanek(1),Tyler Hutcherson(1),Srijith Rajamohan(1),Jen Agarwal(1),Muhammad Ali Gulzar(2),Manvinder Singh(1),Benoit Dion((1) Redis, (2) Virginia Tech)
机构: Redis (瑞迪斯); Virginia Tech (弗吉尼亚理工大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Initial study on embedding fine tuning for semantic cache. It also explores synthetic data. Total pages are 12, including refrences

点击查看摘要

Abstract:This report investigates enhancing semantic caching effectiveness by employing specialized, fine-tuned embedding models. Semantic caching relies on embedding similarity rather than exact key matching, presenting unique challenges in balancing precision, query latency, and computational efficiency. We propose leveraging smaller, domain-specific embedding models, fine-tuned with targeted real-world and synthetically generated datasets. Our empirical evaluations demonstrate that compact embedding models fine-tuned for just one epoch on specialized datasets significantly surpass both state-of-the-art open-source and proprietary alternatives in precision and recall. Moreover, we introduce a novel synthetic data generation pipeline for the semantic cache that mitigates the challenge of limited domain-specific annotated data, further boosting embedding performance. Our approach effectively balances computational overhead and accuracy, establishing a viable and efficient strategy for practical semantic caching implementations.
zh

[NLP-38] LLM s as Deceptive Agents : How Role-Based Prompting Induces Semantic Ambiguity in Puzzle Tasks

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在对抗性环境中利用语义歧义进行欺骗性行为的问题。研究聚焦于理解LLM作为自主代理如何通过零样本提示（zero-shot prompting）、角色注入对抗性提示（role-injected adversarial prompts）以及人工设计的例子生成误导性谜题，从而挑战和迷惑人类用户。论文的关键在于通过计算分析（使用HateBERT量化语义歧义）与主观的人类评估相结合，揭示明确的对抗性代理行为显著增加语义歧义，进而提高认知负荷并降低解谜公平性。这些发现为理解LLMs的新兴自主特性提供了重要洞见，并强调了在教育技术和娱乐领域评估和安全部署自主语言系统时的重要伦理考量。

链接: https://arxiv.org/abs/2504.02254
作者: Seunghyun Yoo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, 1 table

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have not only showcased impressive creative capabilities but also revealed emerging agentic behaviors that exploit linguistic ambiguity in adversarial settings. In this study, we investigate how an LLM, acting as an autonomous agent, leverages semantic ambiguity to generate deceptive puzzles that mislead and challenge human users. Inspired by the popular puzzle game “Connections”, we systematically compare puzzles produced through zero-shot prompting, role-injected adversarial prompts, and human-crafted examples, with an emphasis on understanding the underlying agent decision-making processes. Employing computational analyses with HateBERT to quantify semantic ambiguity, alongside subjective human evaluations, we demonstrate that explicit adversarial agent behaviors significantly heighten semantic ambiguity – thereby increasing cognitive load and reducing fairness in puzzle solving. These findings provide critical insights into the emergent agentic qualities of LLMs and underscore important ethical considerations for evaluating and safely deploying autonomous language systems in both educational technologies and entertainment.
zh

[NLP-39] LLM Social Simulations Are a Promising Research Method

【速读】：本文旨在解决大型语言模型（Large Language Model, LLM）在模拟人类研究对象时准确性与可验证性不足的问题，从而实现其作为理解人类行为和训练新AI系统数据来源的潜力。目前，由于存在若干挑战，LLM的社会模拟方法尚未被广泛采用。论文提出通过解决五个可操作的挑战来兑现LLM社会模拟的潜力，这些挑战包括提示设计（prompting）、微调（fine-tuning）以及补充方法的应用。关键在于结合有效的提示策略、针对性的微调技术以及与其他方法的互补使用，以提升LLM模拟人类行为的逼真度与可靠性。论文认为，基于当前LLM的能力，这些模拟已可用于心理学、经济学、社会学和市场营销等领域的探索性研究，如试点实验，并建议研究人员优先开发能够快速迭代部署和优化的概念模型及评估体系，以适应AI技术的快速发展。

链接: https://arxiv.org/abs/2504.02234
作者: Jacy Reese Anthis,Ryan Liu,Sean M. Richardson,Austin C. Kozlowski,Bernard Koch,James Evans,Erik Brynjolfsson,Michael Bernstein
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Accurate and verifiable large language model (LLM) simulations of human research subjects promise an accessible data source for understanding human behavior and training new AI systems. However, results to date have been limited, and few social scientists have adopted these methods. In this position paper, we argue that the promise of LLM social simulations can be achieved by addressing five tractable challenges. We ground our argument in a literature survey of empirical comparisons between LLMs and human research subjects, commentaries on the topic, and related work. We identify promising directions with prompting, fine-tuning, and complementary methods. We believe that LLM social simulations can already be used for exploratory research, such as pilot experiments for psychology, economics, sociology, and marketing. More widespread use may soon be possible with rapidly advancing LLM capabilities, and researchers should prioritize developing conceptual models and evaluations that can be iteratively deployed and refined at pace with ongoing AI advances.
zh

[NLP-40] Subasa – Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala NAACL

【速读】：该论文旨在解决低资源语言（如僧伽罗语）中仇恨言论检测任务性能显著低于高资源语言的问题。论文的关键创新在于引入了一种未曾在僧伽罗语任务中探索过的微调策略，具体包括在下游任务中通过掩码释义预测（Masked Rationale Prediction）进行预微调的“Subasa-XLM-R”模型，以及针对Llama (3.2) 和Mistral (v0.3) 的任务特定微调版本“Subasa-Llama”和“Subasa-Mistral”。这些方法显著提升了僧伽罗语仇恨言论检测的性能，在SOLD基准数据集上的零样本设置下，“Subasa-XLM-R”实现了最高的Macro F1分数（0.84），超越了现有的大型语言模型如GPT-4o。

链接: https://arxiv.org/abs/2504.02178
作者: Shanilka Haturusinghe,Tharindu Cyril Weerasooriya,Marcos Zampieri,Christopher M. Homan,S.R. Liyanage
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to appear at NAACL SRW 2025

点击查看摘要

Abstract:Accurate detection of offensive language is essential for a number of applications related to social media safety. There is a sharp contrast in performance in this task between low and high-resource languages. In this paper, we adapt fine-tuning strategies that have not been previously explored for Sinhala in the downstream task of offensive language detection. Using this approach, we introduce four models: “Subasa-XLM-R”, which incorporates an intermediate Pre-Finetuning step using Masked Rationale Prediction. Two variants of “Subasa-Llama” and “Subasa-Mistral”, are fine-tuned versions of Llama (3.2) and Mistral (v0.3), respectively, with a task-specific strategy. We evaluate our models on the SOLD benchmark dataset for Sinhala offensive language detection. All our models outperform existing baselines. Subasa-XLM-R achieves the highest Macro F1 score (0.84) surpassing state-of-the-art large language models like GPT-4o when evaluated on the same SOLD benchmark dataset under zero-shot settings. The models and code are publicly available.
zh

[NLP-41] Neural Style Transfer for Synthesising a Dataset of Ancient Egyptian Hieroglyphs

【速读】：该论文试图解决低资源语言（如古埃及语）在训练数据有限情况下应用机器学习技术面临的挑战。解决方案的关键在于创新性地将神经风格迁移（Neural Style Transfer, NST）应用于数字字体，通过生成古埃及象形文字的数据集来克服数据稀缺问题，并验证了基于NST生成样本与真实照片训练的图像分类模型在性能和跨域迁移能力上的有效性。

链接: https://arxiv.org/abs/2504.02163
作者: Lewis Matheson Creed
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 50 Pages, 10 figures, Honours Thesis

点击查看摘要

Abstract:The limited availability of training data for low-resource languages makes applying machine learning techniques challenging. Ancient Egyptian is one such language with few resources. However, innovative applications of data augmentation methods, such as Neural Style Transfer, could overcome these barriers. This paper presents a novel method for generating datasets of ancient Egyptian hieroglyphs by applying NST to a digital typeface. Experimental results found that image classification models trained on NST-generated examples and photographs demonstrate equal performance and transferability to real unseen images of hieroglyphs.
zh

[NLP-42] LL4G: Self-Supervised Dynamic Optimization for Graph-Based Personality Detection

【速读】：该论文致力于解决基于图的个性检测在处理稀疏或噪声数据时面临的挑战，以及现有方法依赖静态图结构无法有效捕捉节点间动态关系的问题。解决方案的关键在于提出了一种名为LL4G的自监督框架，它利用大规模语言模型（Large Language Models, LLMs）优化图神经网络（Graph Neural Networks, GNNs）。LLMs通过提取丰富的语义特征生成节点表示，并推断显式与隐式的节点间关系，同时图结构能够根据输入数据自适应地添加节点和边以持续优化自身。最终，GNN基于这些优化后的表示进行联合训练，完成节点重建、边预测及对比学习任务，从而整合语义与结构信息构建鲁棒的个性档案。实验结果表明，LL4G在Kaggle和Pandora数据集上的表现优于当前最先进的模型。

链接: https://arxiv.org/abs/2504.02146
作者: Lingzhi Shen,Yunfei Long,Xiaohao Cai,Guanming Chen,Yuhan Wang,Imran Razzak,Shoaib Jameel
机构: University of Southampton(南安普顿大学); University of Essex(埃塞克斯大学); Birkbeck, University of London(伦敦大学伯克贝克学院); Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学); University of Southampton(南安普顿大学); University of Southampton(南安普顿大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Graph-based personality detection constructs graph structures from textual data, particularly social media posts. Current methods often struggle with sparse or noisy data and rely on static graphs, limiting their ability to capture dynamic changes between nodes and relationships. This paper introduces LL4G, a self-supervised framework leveraging large language models (LLMs) to optimize graph neural networks (GNNs). LLMs extract rich semantic features to generate node representations and to infer explicit and implicit relationships. The graph structure adaptively adds nodes and edges based on input data, continuously optimizing itself. The GNN then uses these optimized representations for joint training on node reconstruction, edge prediction, and contrastive learning tasks. This integration of semantic and structural information generates robust personality profiles. Experimental results on Kaggle and Pandora datasets show LL4G outperforms state-of-the-art models.
zh

[NLP-43] owards Interpretable Soft Prompts

【速读】：该论文试图解决软提示（soft prompts）和其他可训练提示在任务性能提升与可解释性之间存在的权衡问题。现有方法未能自然满足作者提出的基于“忠实性”（faithfulness）和“可审查性”（scrutability）两大准则的可解释性标准。为了解决这一问题，论文的关键在于提出了一种新的理论框架，用于评估可训练提示的可解释性，并进一步设计了面向可解释性的目标函数，以优化Hard Prompts Made Easy (PEZ) 和 RLPrompt 两种最先进的提示调整器。实验结果表明，在GPT-2上的研究表明，可解释性和任务性能之间存在根本性的权衡，揭示了优化可解释性代理时出现的复杂行为，从而强调了软提示可解释性问题的难度。

链接: https://arxiv.org/abs/2504.02144
作者: Oam Patel,Jason Wang,Nikhil Shivakumar Nayak,Suraj Srinivas,Himabindu Lakkaraju
机构: Harvard University (哈佛大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 9 pages, 8 figures

点击查看摘要

Abstract:Soft prompts have been popularized as a cheap and easy way to improve task-specific LLM performance beyond few-shot prompts. Despite their origin as an automated prompting method, however, soft prompts and other trainable prompts remain a black-box method with no immediately interpretable connections to prompting. We create a novel theoretical framework for evaluating the interpretability of trainable prompts based on two desiderata: faithfulness and scrutability. We find that existing methods do not naturally satisfy our proposed interpretability criterion. Instead, our framework inspires a new direction of trainable prompting methods that explicitly optimizes for interpretability. To this end, we formulate and test new interpretability-oriented objective functions for two state-of-the-art prompt tuners: Hard Prompts Made Easy (PEZ) and RLPrompt. Our experiments with GPT-2 demonstrate a fundamental trade-off between interpretability and the task-performance of the trainable prompt, explicating the hardness of the soft prompt interpretability problem and revealing odd behavior that arises when one optimizes for an interpretability proxy.
zh

[NLP-44] One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image

【速读】：该论文旨在解决多模态检索增强生成（M-RAG）系统因知识库（KB）中毒攻击而易受干扰的问题。具体而言，研究关注视觉文档检索应用中的KB，该KB包含文档页面的图像。论文的关键在于设计一种针对M-RAG系统的通用拒绝服务（DoS）攻击方法，通过构造一个单一图像，在多种不同用户查询下被检索到，并持续影响生成模型的输出，从而实现对系统的广泛干扰。解决方案的核心在于利用嵌入模型（embedding models）和大型多模态模型（LMMs）的特性，同时证明这种攻击对某些鲁棒性较差的嵌入模型尤为有效，但也揭示了即使在良性环境下，M-RAG管道可能存在的根本性能瓶颈。

链接: https://arxiv.org/abs/2504.02132
作者: Ezzeldin Shereen,Dan Ristea,Burak Hasircioglu,Shae McFadden,Vasilios Mavroudis,Chris Hicks
机构: The Alan Turing Institute (图灵研究所); University College London (伦敦大学学院); King’s College London (伦敦国王学院)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Multimodal retrieval augmented generation (M-RAG) has recently emerged as a method to inhibit hallucinations of large multimodal models (LMMs) through a factual knowledge base (KB). However, M-RAG also introduces new attack vectors for adversaries that aim to disrupt the system by injecting malicious entries into the KB. In this work, we present a poisoning attack against M-RAG targeting visual document retrieval applications, where the KB contains images of document pages. Our objective is to craft a single image that is retrieved for a variety of different user queries, and consistently influences the output produced by the generative model, thus creating a universal denial-of-service (DoS) attack against the M-RAG system. We demonstrate that while our attack is effective against a diverse range of widely-used, state-of-the-art retrievers (embedding models) and generators (LMMs), it can also be ineffective against robust embedding models. Our attack not only highlights the vulnerability of M-RAG pipelines to poisoning attacks, but also sheds light on a fundamental weakness that potentially hinders their performance even in benign settings.
zh

[NLP-45] Achieving Unanimous Consensus in Decision Making Using Multi-Agents

【速读】：该论文试图解决区块链共识机制在处理需要考虑个体意见而非单纯依赖多数诚实节点或加权共识的复杂决策场景中的适应性问题。现有方法如工作量证明（Proof-of-Work, PoW）和权益证明（Proof-of-Stake, PoS）难以有效应对这种需求。论文提出了一种基于协商的新型共识机制，其关键在于利用大型语言模型（Large Language Models, LLMs）作为理性代理，通过结构化讨论实现全体一致共识。该方案结合分级共识与多轮协商过程，不仅确保了确定性问题的一致性，还为优先级决策提供了分级置信度支持。实验结果验证了系统的可行性，同时探讨了思想退化、幻觉生成、恶意模型和节点行为、资源消耗及可扩展性等挑战。

链接: https://arxiv.org/abs/2504.02128
作者: Apurba Pokharel,Ram Dantu,Shakila Zaman,Sirisha Talapuru,Vinh Quach
机构: University of North Texas (北德克萨斯大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 9 figure, 3 tables

点击查看摘要

Abstract:Blockchain consensus mechanisms have relied on algorithms such as Proof-of-Work (PoW) and Proof-of-Stake (PoS) to ensure network functionality and integrity. However, these approaches struggle with adaptability for decision-making where the opinions of each matter rather than reaching an agreement based on honest majority or weighted consensus. This paper introduces a novel deliberation-based consensus mechanism where Large Language Models (LLMs) act as rational agents engaging in structured discussions to reach a unanimous consensus. By leveraging graded consensus and a multi-round deliberation process, our approach ensures both unanimous consensus for definitive problems and graded confidence for prioritized decisions and policies. We provide a formalization of our system and use it to show that the properties of blockchains: consistency, agreement, liveness, and determinism are maintained. Moreover, experimental results demonstrate our system’s feasibility, showcasing how our deliberation method’s convergence, block properties, and accuracy enable decision-making on blockchain networks. We also address key challenges with this novel approach such as degeneration of thoughts, hallucinations, malicious models and nodes, resource consumption, and scalability.
zh

[NLP-46] Overcoming Vocabulary Constraints with Pixel-level Fallback

【速读】：该论文旨在解决基于子词（subword）切分在处理未被优先训练的语言或字符集时性能不佳的问题。解决方案的关键在于引入一种无需词汇表的编码器（vocabulary-free encoder），该编码器直接从文本渲染的像素生成输入嵌入，从而绕过传统子词切分的限制。通过在以英语为中心的语言模型上的实验，研究者证明了这种方法显著提升了机器翻译性能，并促进了有效的跨语言迁移，优于基于tokenizer的方法。此外，研究表明基于像素的表示法优于字节级方法和标准的词汇扩展策略。此方法能够在不进行大量重新训练的情况下增强单语语言模型的多语言能力，并通过输入压缩降低解码延迟。

链接: https://arxiv.org/abs/2504.02122
作者: Jonas F. Lotz,Hendra Setiawan,Stephan Peitz,Yova Kementchedjhieva
机构: University of Copenhagen (哥本哈根大学), Denmark; ROCKWOOL Foundation Research Unit (岩石羊毛基金会研究中心); Apple (苹果); MBZUAI (阿联酋穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models with a vocabulary-free encoder that generates input embeddings from text rendered as pixels. Through experiments on English-centric language models, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that pixel-based representations outperform byte-level approaches and standard vocabulary expansion. Our approach enhances the multilingual capabilities of monolingual language models without extensive retraining and reduces decoding latency via input compression.
zh

[NLP-47] Language Models at the Syntax-Semantics Interface: A Case Study of the Long-Distance Binding of Chinese Reflexive ziji

【速读】：该论文旨在探究语言模型是否能够有效处理汉语反身代词“自己”复杂的绑定模式，这些模式受到句法和语义因素的共同约束。论文的关键解决方案是构建了一个包含240个基于模板合成的句子和320个来自BCC语料库的自然句子的数据集，并通过评估21种语言模型在这组数据上的表现，将其与母语为汉语者的判断进行对比分析。研究发现，现有的语言模型倾向于过度依赖序列线索，而非始终选择最近的字符串，并且常常忽视微妙的句法和语义约束，对名词相关的语义比动词相关的语义更为敏感。

链接: https://arxiv.org/abs/2504.02116
作者: Xiulin Yang
机构: Georgetown University (乔治城大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper explores whether language models can effectively resolve the complex binding patterns of the Mandarin Chinese reflexive ziji, which are constrained by both syntactic and semantic factors. We construct a dataset of 240 synthetic sentences using templates and examples from syntactic literature, along with 320 natural sentences from the BCC corpus. Evaluating 21 language models against this dataset and comparing their performance to judgments from native Mandarin speakers, we find that none of the models consistently replicates human-like judgments. The results indicate that existing language models tend to rely heavily on sequential cues, though not always favoring the closest strings, and often overlooking subtle semantic and syntactic constraints. They tend to be more sensitive to noun-related than verb-related semantics.
zh

[NLP-48] Exploring LLM Reasoning Through Controlled Prompt Variations

【速读】：该论文旨在研究大型语言模型（Large Language Models, LLMs）在数学问题求解任务中面对系统性引入的输入扰动时的推理鲁棒性。论文通过GSM8K数据集作为控制测试平台，评估了最先进的模型在四种提示扰动类别下的逻辑一致性和正确性：无关上下文、病理性指令、事实相关但非必要的上下文以及上述两类的组合。实验结果表明，在模型的上下文窗口内引入无关上下文会显著降低性能，这表明区分关键信息与冗余细节仍然是一个亟待解决的问题。此外，某些扰动无意中触发了类似于链式思维的推理行为，即使没有明确提示也是如此。论文的关键在于揭示了当前LLMs中的关键脆弱性，并强调了提高模型对噪声、误导性和上下文密集型输入的鲁棒性的必要性，从而为实际应用中更强大和可靠的推理能力铺平道路。

链接: https://arxiv.org/abs/2504.02111
作者: Giannis Chatziveroglou,Richard Yun,Maura Kelleher
机构: MIT (麻省理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study investigates the reasoning robustness of large language models (LLMs) on mathematical problem-solving tasks under systematically introduced input perturbations. Using the GSM8K dataset as a controlled testbed, we evaluate how well state-of-the-art models maintain logical consistency and correctness when confronted with four categories of prompt perturbations: irrelevant context, pathological instructions, factually relevant but non-essential context, and a combination of the latter two. Our experiments, conducted on thirteen open-source and closed-source LLMs, reveal that introducing irrelevant context within the model’s context window significantly degrades performance, suggesting that distinguishing essential from extraneous details remains a pressing challenge. Surprisingly, performance regressions are relatively insensitive to the complexity of the reasoning task, as measured by the number of steps required, and are not strictly correlated with model size. Moreover, we observe that certain perturbations inadvertently trigger chain-of-thought-like reasoning behaviors, even without explicit prompting. Our findings highlight critical vulnerabilities in current LLMs and underscore the need for improved robustness against noisy, misleading, and contextually dense inputs, paving the way for more resilient and reliable reasoning in real-world applications.
zh

[NLP-49] C-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在训练数据过时时如何有效更新以适应新数据的问题。论文的关键解决方案在于引入了一个源自114个Common Crawl(CC)数据集版本的超大规模时间连续预训练数据集，并设计了时间分层评估方法，同时探索了不同持续学习方法在保留旧知识的同时适应新数据的能力。研究发现，对于通用CC数据，结合固定比例旧数据回放的自回归元调度策略可以在保持过去知识的同时达到与从头开始重新训练相当的持留损失，且所需计算量减少了2.6倍。关键在于找到新数据与旧数据回放之间的最佳平衡点，这在通用网络数据中对避免遗忘至关重要，而在特定领域则不那么显著。

链接: https://arxiv.org/abs/2504.02107
作者: Jeffrey Li,Mohammadreza Armandpour,Iman Mirzadeh,Sachin Mehta,Vaishaal Shankar,Raviteja Vemulapalli,Samy Bengio,Oncel Tuzel,Mehrdad Farajtabar,Hadi Pouransari,Fartash Faghri
机构: University of Washington(华盛顿大学); Apple(苹果)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Code available at: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) - orders of magnitude larger than previous continual language modeling benchmarks. We also design time-stratified evaluations across both general CC data and specific domains (Wikipedia, StackExchange, and code documentation) to assess how well various continual learning methods adapt to new data while retaining past knowledge. Our findings demonstrate that, on general CC data, autoregressive meta-schedules combined with a fixed-ratio replay of older data can achieve comparable held-out loss to re-training from scratch, while requiring significantly less computation (2.6x). However, the optimal balance between incorporating new data and replaying old data differs as replay is crucial to avoid forgetting on generic web data but less so on specific domains.
zh

[NLP-50] ContrastScore: Towards Higher Quality Less Biased More Efficient Evaluation Metrics with Contrastive Evaluation

【速读】：该论文旨在解决自动评估生成文本质量时存在的挑战，特别是传统基于参考的指标与人工评价之间较弱的相关性问题。尽管近期研究表明大型语言模型（Large Language Models, LLMs）可作为基于源的自然语言生成（Natural Language Generation, NLG）评估指标，但这些方法仍难以完全与人类判断保持一致，尤其是较小规模的模型。论文的关键解决方案是提出了一种名为ContrastScore的对比评估指标，该指标通过设计实现了更高品质、更少偏差且更高效的生成文本评估能力。实验结果表明，ContrastScore在机器翻译和摘要生成两项任务中均表现出比单一模型及集成基线更强的人类判断相关性，甚至基于Qwen 3B和0.5B参数量的版本在某些情况下优于拥有更多参数的Qwen 7B，体现了其高效性。此外，它有效缓解了常见的评估偏差，如长度偏好和可能性偏好，从而提供了更为稳健的自动化评估。

链接: https://arxiv.org/abs/2504.02106
作者: Xiao Wang,Daniil Larionov,Siwei Wu,Yiqi Liu,Steffen Eger,Nafise Sadat Moosavi,Chenghua Lin
机构: University of Manchester (曼彻斯特大学); University of Mannheim (曼海姆大学); University of Technology Nuremberg (纽伦堡大学技术); University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating the quality of generated text automatically remains a significant challenge. Conventional reference-based metrics have been shown to exhibit relatively weak correlation with human evaluations. Recent research advocates the use of large language models (LLMs) as source-based metrics for natural language generation (NLG) assessment. While promising, LLM-based metrics, particularly those using smaller models, still fall short in aligning with human judgments. In this work, we introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text. We evaluate ContrastScore on two NLG tasks: machine translation and summarization. Experimental results show that ContrastScore consistently achieves stronger correlation with human judgments than both single-model and ensemble-based baselines. Notably, ContrastScore based on Qwen 3B and 0.5B even outperforms Qwen 7B, despite having only half as many parameters, demonstrating its efficiency. Furthermore, it effectively mitigates common evaluation biases such as length and likelihood preferences, resulting in more robust automatic evaluation.
zh

[NLP-51] Increasing happiness through conversations with artificial intelligence

【速读】：该论文试图解决的问题是：人与人工智能（AI）对话如何影响主观幸福感（Subjective Well-being）。这一问题在现有研究中尚未得到充分探讨。论文的关键解决方案在于通过实验设计对比参与者与AI聊天机器人对话或撰写日记后报告的即时幸福感，并结合情感分析技术揭示AI在对话中的情感反馈机制。研究发现，当讨论负面话题时，AI聊天机器人能够通过一致性积极偏见（Positivity Bias）逐渐引导参与者的情感向更积极的方向调整，从而提升幸福感。进一步通过计算模型分析，论文指出这种幸福感效应可能源于参与者对情感预期误差的历史累积，即实际情感反馈与预期之间的差异。因此，情绪期望在对话过程中的作用被证明是核心机制之一。

链接: https://arxiv.org/abs/2504.02091
作者: Joseph Heffner,Chongyu Qin,Martin Chadwick,Chris Knutsen,Christopher Summerfield,Zeb Kurth-Nelson,Robb B. Rutledge
机构: 未知
类目: Computation and Language (cs.CL)
备注: 26 pages, 4 figures

点击查看摘要

Abstract:Chatbots powered by artificial intelligence (AI) have rapidly become a significant part of everyday life, with over a quarter of American adults using them multiple times per week. While these tools offer potential benefits and risks, a fundamental question remains largely unexplored: How do conversations with AI influence subjective well-being? To investigate this, we conducted a study where participants either engaged in conversations with an AI chatbot (N = 334) or wrote journal entires (N = 193) on the same randomly assigned topics and reported their momentary happiness afterward. We found that happiness after AI chatbot conversations was higher than after journaling, particularly when discussing negative topics such as depression or guilt. Leveraging large language models for sentiment analysis, we found that the AI chatbot mirrored participants’ sentiment while maintaining a consistent positivity bias. When discussing negative topics, participants gradually aligned their sentiment with the AI’s positivity, leading to an overall increase in happiness. We hypothesized that the history of participants’ sentiment prediction errors, the difference between expected and actual emotional tone when responding to the AI chatbot, might explain this happiness effect. Using computational modeling, we find the history of these sentiment prediction errors over the course of a conversation predicts greater post-conversation happiness, demonstrating a central role of emotional expectations during dialogue. Our findings underscore the effect that AI interactions can have on human well-being.
zh

[NLP-52] From Text to Graph: Leverag ing Graph Neural Networks for Enhanced Explainability in NLP

【速读】：该论文试图解决自然语言处理（NLP）任务中大型Transformer模型可解释性技术计算开销大且难以从语义层面解释的问题。大型Transformer模型通过将输入文本分解为缺乏内在语义信息的Token序列来理解信息，这使得从初始阶段就难以解释模型的行为。为了解决这一问题，论文提出了一种创新方法，通过自动将句子转换为图结构，并利用节点和关系表达基础语言学概念以保持语义一致性。该方法的关键在于结合图表示学习，使模型在完成分类等任务的同时，能够揭示文本内部不同元素与目标任务之间的关联性，从而确定给定分类任务中文本结构中的最关键成分。实验结果表明，该方法在识别文本分类任务中的重要组成部分方面表现出色。

链接: https://arxiv.org/abs/2504.02064
作者: Fabio Yáñez-Romero,Andrés Montoyo,Armando Suárez,Yoan Gutiérrez,Ruslan Mitkov
机构: University Institute for Computer Research (计算机研究所大学); University of Alicante (阿利坎特大学) ; Department of Computing and Information Systems (计算与信息系统系); Lancaster University (兰卡斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Researchers have relegated natural language processing tasks to Transformer-type models, particularly generative models, because these models exhibit high versatility when performing generation and classification tasks. As the size of these models increases, they achieve outstanding results. Given their widespread use, many explainability techniques are developed based on these models. However, this process becomes computationally expensive due to the large size of the models. Additionally, transformers interpret input information through tokens that fragment input words into sequences lacking inherent semantic meaning, complicating the explanation of the model from the very beginning. This study proposes a novel methodology to achieve explainability in natural language processing tasks by automatically converting sentences into graphs and maintaining semantics through nodes and relations that express fundamental linguistic concepts. It also allows the subsequent exploitation of this knowledge in subsequent tasks, making it possible to obtain trends and understand how the model associates the different elements inside the text with the explained task. The experiments delivered promising results in determining the most critical components within the text structure for a given classification.
zh

[NLP-53] Self-Resource Allocation in Multi-Agent LLM Systems

【速读】：该论文旨在解决如何通过大型语言模型（LLMs）有效分配计算任务至多个代理（agents），同时考虑成本、效率及性能等因素的问题。论文的关键在于探讨LLMs作为协调器（orchestrators）与规划器（planners）的有效性，并比较它们在任务分配与协调中的表现。研究发现，规划器方法相较于协调器方法在处理并发动作时更具优势，能够提升效率并更有效地利用代理资源。此外，明确提供工作者（workers）能力信息可显著优化规划器的分配策略，尤其是在应对能力较低的工作者时。

链接: https://arxiv.org/abs/2504.02051
作者: Alfonso Amayuelas,Jingbo Yang,Saaket Agashe,Ashwin Nagarajan,Antonis Antoniades,Xin Eric Wang,William Wang
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校); University of California, Santa Cruz (加州大学圣克鲁兹分校)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the development of LLMs as agents, there is a growing interest in connecting multiple agents into multi-agent systems to solve tasks concurrently, focusing on their role in task assignment and coordination. This paper explores how LLMs can effectively allocate computational tasks among multiple agents, considering factors such as cost, efficiency, and performance. In this work, we address key questions, including the effectiveness of LLMs as orchestrators and planners, comparing their effectiveness in task assignment and coordination. Our experiments demonstrate that LLMs can achieve high validity and accuracy in resource allocation tasks. We find that the planner method outperforms the orchestrator method in handling concurrent actions, resulting in improved efficiency and better utilization of agents. Additionally, we show that providing explicit information about worker capabilities enhances the allocation strategies of planners, particularly when dealing with suboptimal workers.
zh

[NLP-54] Urban Computing in the Era of Large Language Models

【速读】：本文旨在解决传统方法在城市计算领域中普遍存在的泛化能力不足、可扩展性有限以及上下文理解薄弱的问题。论文的关键在于探索大型语言模型（Large Language Models, LLMs）在城市计算中的应用潜力，通过利用LLMs处理和分析城市数据、提升决策支持能力以及促进公民参与来实现这一目标。论文不仅概述了LLMs的技术发展及其核心原理，还调研了其在交通、公共安全和环境监测等关键城市领域的具体应用场景，并总结了相关任务与已有研究工作。此外，文章提出了基于LLMs的潜在解决方案以应对尚未解决的挑战，并提供了适用于多种城市场景的数据集和工具清单。最后，讨论了现有方法的局限性，并指出了未来改进LLMs应用于城市计算的方向。

链接: https://arxiv.org/abs/2504.02009
作者: Zhonghang Li,Lianghao Xia,Xubin Ren,Jiabin Tang,Tianyi Chen,Yong Xu,Chao Huang
机构: South China University of Technology (华南理工大学)(Guangzhou China); The University of Hong Kong (香港大学)(Hong Kong SAR China); City University of Hong Kong (香港城市大学)(Hong Kong SAR China); South China University of Technology (华南理工大学)(Guangzhou China)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 36 pages

点击查看摘要

Abstract:Urban computing has emerged as a multidisciplinary field that harnesses data-driven technologies to address challenges and improve urban living. Traditional approaches, while beneficial, often face challenges with generalization, scalability, and contextual understanding. The advent of Large Language Models (LLMs) offers transformative potential in this domain. This survey explores the intersection of LLMs and urban computing, emphasizing the impact of LLMs in processing and analyzing urban data, enhancing decision-making, and fostering citizen engagement. We provide a concise overview of the evolution and core technologies of LLMs. Additionally, we survey their applications across key urban domains, such as transportation, public safety, and environmental monitoring, summarizing essential tasks and prior works in various urban contexts, while highlighting LLMs’ functional roles and implementation patterns. Building on this, we propose potential LLM-based solutions to address unresolved challenges. To facilitate in-depth research, we compile a list of available datasets and tools applicable to diverse urban scenarios. Finally, we discuss the limitations of current approaches and outline future directions for advancing LLMs in urban computing.
zh

[NLP-55] LLM s Working in Harmony: A Survey on the Technological Aspects of Building Effective LLM -Based Multi Agent Systems

【速读】：该论文旨在解决如何优化基于大型语言模型（Large Language Model, LLM）的多智能体系统以适应协作和动态环境的问题。关键在于从架构（Architecture）、记忆机制（Memory）、规划能力（Planning）和技术框架（Technologies/Frameworks）四个核心领域入手，分析现有技术的进展与局限性，如可扩展性、实时响应挑战及智能体协调约束等。通过研究创新框架，如智能体混合架构（Mixture of Agents）和ReAct规划模型，论文展示了在角色分配和决策制定方面的改进。最终，该研究总结了关键技术的优势与持续挑战，并提出了增强系统可扩展性、智能体协作与适应性的实用建议，为未来研究提供了发展路径。

链接: https://arxiv.org/abs/2504.01963
作者: R. M. Aratchige,W. M. K. S. Ilmini
机构: Department of Computer Science, Faculty of Computing, General Sir John Kotelawala Defence University (国防大学约翰·科特拉瓦拉将军), Ratmalana, Sri Lanka
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This survey investigates foundational technologies essential for developing effective Large Language Model (LLM)-based multi-agent systems. Aiming to answer how best to optimize these systems for collaborative, dynamic environments, we focus on four critical areas: Architecture, Memory, Planning, and Technologies/Frameworks. By analyzing recent advancements and their limitations - such as scalability, real-time response challenges, and agent coordination constraints, we provide a detailed view of the technological landscape. Frameworks like the Mixture of Agents architecture and the ReAct planning model exemplify current innovations, showcasing improvements in role assignment and decision-making. This review synthesizes key strengths and persistent challenges, offering practical recommendations to enhance system scalability, agent collaboration, and adaptability. Our findings provide a roadmap for future research, supporting the creation of robust, efficient multi-agent systems that advance both individual agent performance and collective system resilience.
zh

计算机视觉

[CV-0] Envisioning Beyond the Pixels: Benchmarking Reasoning -Informed Visual Editing

【速读】：该论文旨在解决大型多模态模型（LMMs）在通用视觉编辑任务中的挑战，特别是在处理复杂指令、保持外观一致性以及支持灵活输入格式方面存在的不足。为了解决这些问题，论文引入了RISEBench，这是首个用于评估推理驱动视觉编辑（RISE）的基准测试平台。RISEBench专注于四种关键推理类型：时间推理、因果推理、空间推理和逻辑推理，并通过精心设计的高质量测试案例与包含人类评判及大型模型作为裁判的评估框架来衡量指令推理、外观一致性和视觉合理性。研究结果表明，尽管GPT-4o-Native在这些任务上表现优异，但即使是当前最先进的系统，在逻辑推理任务上仍面临困难，这表明该领域仍有待深入探索。作为一个初步尝试，RISEBench的目标是为具有推理意识的视觉编辑提供基础见解，并推动未来的研究发展。虽然目前还处于起步阶段，但我们致力于不断扩展和完善这一基准，以支持下一代多模态系统的更全面、可靠且可扩展的评估。相关的代码和数据将在指定链接处发布。

链接: https://arxiv.org/abs/2504.02826
作者: Xiangyu Zhao,Peiyuan Zhang,Kexian Tang,Hao Li,Zicheng Zhang,Guangtao Zhai,Junchi Yan,Hua Yang,Xue Yang,Haodong Duan
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); Wuhan University (武汉大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 23 figures, 1 table. Technical Report

点击查看摘要

Abstract:Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To address this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and an LMM-as-a-judge approach. Our experiments reveal that while GPT-4o-Native significantly outperforms other open-source and proprietary models, even this state-of-the-art system struggles with logical reasoning tasks, highlighting an area that remains underexplored. As an initial effort, RISEBench aims to provide foundational insights into reasoning-aware visual editing and to catalyze future research. Though still in its early stages, we are committed to continuously expanding and refining the benchmark to support more comprehensive, reliable, and scalable evaluations of next-generation multimodal systems. Our code and data will be released at this https URL.
zh

[CV-1] STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection CVPR2025

【速读】：该论文旨在解决当前计算机辅助筛查（CAS）系统在X射线行李扫描中检测安全威胁能力受限的问题。主要挑战包括现有数据集无法充分代表现实世界中的复杂威胁与隐藏手段，以及现有方法局限于预定义标签的封闭集范式。为应对这些挑战，论文引入了STCray，首个多模态X射线行李安全数据集，包含来自21类威胁的46,642个图像-描述配对样本，并通过专门协议确保领域感知且连贯的描述，从而构建用于X射线行李安全的多模态指令跟随数据。关键解决方案是基于此数据集训练了一个名为STING-BEE的领域感知视觉AI助手，支持场景理解、威胁定位、视觉接地及视觉问答（VQA）等多种视觉-语言任务，同时在跨领域设置中展现了最先进的泛化能力。

链接: https://arxiv.org/abs/2504.02823
作者: Divya Velayudhan,Abdelfatah Ahmed,Mohamad Alansari,Neha Gour,Abderaouf Behouch,Taimur Hassan,Syed Talal Wasim,Nabil Maalej,Muzammal Naseer,Juergen Gall,Mohammed Bennamoun,Ernesto Damiani,Naoufel Werghi
机构: Khalifa University of Science and Technology (哈利法科技大学); Abu Dhabi University (阿联酋阿布扎比大学); University of Bonn (波恩大学); Lamarr Institute for ML and AI (拉马尔机器学习与人工智能研究所); The University of Western Australia (西澳大利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Advancements in Computer-Aided Screening (CAS) systems are essential for improving the detection of security threats in X-ray baggage scans. However, current datasets are limited in representing real-world, sophisticated threats and concealment tactics, and existing approaches are constrained by a closed-set paradigm with predefined labels. To address these challenges, we introduce STCray, the first multimodal X-ray baggage security dataset, comprising 46,642 image-caption paired scans across 21 threat categories, generated using an X-ray scanner for airport security. STCray is meticulously developed with our specialized protocol that ensures domain-aware, coherent captions, that lead to the multi-modal instruction following data in X-ray baggage security. This allows us to train a domain-aware visual AI assistant named STING-BEE that supports a range of vision-language tasks, including scene comprehension, referring threat localization, visual grounding, and visual question answering (VQA), establishing novel baselines for multi-modal learning in X-ray baggage security. Further, STING-BEE shows state-of-the-art generalization in cross-domain settings. Code, data, and models are available at this https URL.
zh

[CV-2] Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）中神经元表示多义性（multimodal）的问题，并探索提升其可解释性和可控性的方法。论文的关键解决方案是引入稀疏自编码器（Sparse Autoencoders, SAEs），通过在VLM（如CLIP）上训练SAEs，显著增强了单义性（monosemanticity）神经元表示，同时构建了一种全面的评估框架来衡量视觉表征的单义性。实验表明，应用SAEs不仅提升了神经元表示的单一语义特性，还保留了与专家定义结构（如iNaturalist分类法）一致的层次化特征，且无需修改底层模型即可直接操控多模态大语言模型（如LLaVA）的输出，从而验证了SAEs作为一种无监督方法在提升VLMs可解释性和可控性方面的实用性和有效性。

链接: https://arxiv.org/abs/2504.02821
作者: Mateusz Pach,Shyamgopal Karthik,Quentin Bouniot,Serge Belongie,Zeynep Akata
机构: Technical University of Munich (慕尼黑工业大学); Helmholtz Munich (赫姆霍兹协会慕尼黑研究中心); Munich Center of Machine Learning (慕尼黑机器学习中心); Munich Data Science Institute (慕尼黑数据科学研究所); University of Tübingen (图宾根大学); University of Copenhagen (哥本哈根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. The code is available at this https URL

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) have recently been shown to enhance interpretability and steerability in Large Language Models (LLMs). In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity in vision representations. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons while also exhibiting hierarchical representations that align well with expert-defined structures (e.g., iNaturalist taxonomy). Most notably, we demonstrate that applying SAEs to intervene on a CLIP vision encoder, directly steer output from multimodal LLMs (e.g., LLaVA) without any modifications to the underlying model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised approach for enhancing both the interpretability and control of VLMs.
zh

[CV-3] GMR-Conv: An Efficient Rotation and Reflection Equivariant Convolution Kernel Using Gaussian Mixture Rings

【速读】：该论文旨在解决传统卷积神经网络（Convolutional Neural Networks, CNNs）在扩展旋转和平移等几何变换下的等变性（equivariance）时面临的挑战，这些挑战通常需要在等变性、计算效率和信息损失之间进行权衡。论文的关键创新在于提出了高斯混合环卷积（Gaussian Mixture Ring Convolution, GMR-Conv），这是一种高效的卷积核设计，通过使用高斯加权环的混合来平滑径向对称性。这种设计减少了圆形核的离散化误差，从而在保持鲁棒的旋转和平移等变性的同时，避免了计算开销。此外，通过新颖的参数化和计算策略优化了GMR-Conv的空间和速度效率，使其能够在可接受的成本下支持更大的核尺寸。这一解决方案的核心在于利用径向对称性来缓解信息损失问题，同时提升了等变网络架构的性能与鲁棒性。

链接: https://arxiv.org/abs/2504.02819
作者: Yuexi Du,Jiazhen Zhang,Nicha C. Dvornek,John A. Onofrey
机构: Yale University (耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Symmetry, where certain features remain invariant under geometric transformations, can often serve as a powerful prior in designing convolutional neural networks (CNNs). While conventional CNNs inherently support translational equivariance, extending this property to rotation and reflection has proven challenging, often forcing a compromise between equivariance, efficiency, and information loss. In this work, we introduce Gaussian Mixture Ring Convolution (GMR-Conv), an efficient convolution kernel that smooths radial symmetry using a mixture of Gaussian-weighted rings. This design mitigates discretization errors of circular kernels, thereby preserving robust rotation and reflection equivariance without incurring computational overhead. We further optimize both the space and speed efficiency of GMR-Conv via a novel parameterization and computation strategy, allowing larger kernels at an acceptable cost. Extensive experiments on eight classification and one segmentation datasets demonstrate that GMR-Conv not only matches conventional CNNs’ performance but can also surpass it in applications with orientation-less data. GMR-Conv is also proven to be more robust and efficient than the state-of-the-art equivariant learning methods. Our work provides inspiring empirical evidence that carefully applied radial symmetry can alleviate the challenges of information loss, marking a promising advance in equivariant network architectures. The code is available at this https URL.
zh

[CV-4] Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization

【速读】：该论文旨在解决现有基于变分自编码器（Variational Autoencoders, VAEs）的三维形状生成模型在处理具有尺度和复杂度变化的3D数据时所面临的挑战。具体而言，传统方法将所有形状编码为固定大小的令牌，忽略了3D数据之间固有的尺度和复杂性差异，导致低效的潜在表示，可能影响下游生成任务的质量。为了解决这一问题，论文提出了一种名为八叉树自适应令牌化（Octree-based Adaptive Tokenization）的新框架。该框架的关键在于通过二次误差引导的细分准则构建自适应八叉树结构，并利用基于查询的Transformer为每个八叉树单元分配形状潜在向量，从而实现根据形状复杂度动态调整潜在表示维度的目标。这种自适应机制不仅显著减少了令牌数量（相比固定大小方法减少50%），还能够在保持视觉质量的同时提高生成形状的质量，特别是在使用相似令牌长度的情况下。此外，结合开发的基于八叉树的自回归生成模型，该方法能够创建更加详细和多样化的3D内容。

链接: https://arxiv.org/abs/2504.02817
作者: Kangle Deng,Hsueh-Ti Derek Liu,Yiheng Zhu,Xiaoxia Sun,Chong Shang,Kiran Bhat,Deva Ramanan,Jun-Yan Zhu,Maneesh Agrawala,Tinghui Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Many 3D generative models rely on variational autoencoders (VAEs) to learn compact shape representations. However, existing methods encode all shapes into a fixed-size token, disregarding the inherent variations in scale and complexity across 3D data. This leads to inefficient latent representations that can compromise downstream generation. We address this challenge by introducing Octree-based Adaptive Tokenization, a novel framework that adjusts the dimension of latent representations according to shape complexity. Our approach constructs an adaptive octree structure guided by a quadric-error-based subdivision criterion and allocates a shape latent vector to each octree cell using a query-based transformer. Building upon this tokenization, we develop an octree-based autoregressive generative model that effectively leverages these variable-sized representations in shape generation. Extensive experiments demonstrate that our approach reduces token counts by 50% compared to fixed-size methods while maintaining comparable visual quality. When using a similar token length, our method produces significantly higher-quality shapes. When incorporated with our downstream generative model, our method creates more detailed and diverse 3D content than existing approaches.
zh

[CV-5] BOP Challenge 2024 on Model-Based and Model-Free 6D Object Pose Estimation

【速读】：该论文试图解决的问题是如何将6D物体姿态估计及相关任务从实验室环境过渡到真实世界场景。为实现这一目标，论文提出了三个关键解决方案：首先，引入了无需3D物体模型的新任务，方法需仅依赖提供的参考视频来定位物体；其次，定义了一个更实用的6D物体检测任务，测试图像中的物体身份不再作为输入提供；第三，引入了新的BOP-H3数据集，使用高分辨率传感器和AR/VR头显录制，更贴近真实场景，并包含3D模型和上车视频以支持基于模型和无模型的任务。这些方案的关键在于通过新任务设计和数据集构建，提升方法在真实环境下的适用性和性能表现。

链接: https://arxiv.org/abs/2504.02812
作者: Van Nguyen Nguyen,Stephen Tyree,Andrew Guo,Mederic Fourmy,Anas Gouda,Taeyeop Lee,Sungphill Moon,Hyeontae Son,Lukas Ranftl,Jonathan Tremblay,Eric Brachmann,Bertram Drost,Vincent Lepetit,Carsten Rother,Stan Birchfield,Jiri Matas,Yann Labbe,Martin Sundermeyer,Tomas Hodan
机构: ENPC (École Nationale des Ponts et Chaussées); NVIDIA; University of Toronto; CTU Prague (Czech Technical University in Prague); TU Dortmund (Technical University of Dortmund); KAIST (Korea Advanced Institute of Science and Technology); NAVER LABS; MVTec; TU Munich (Technical University of Munich); Niantic; Heidelberg University; Google; Meta (Facebook)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2403.09799

点击查看摘要

Abstract:We present the evaluation methodology, datasets and results of the BOP Challenge 2024, the sixth in a series of public competitions organized to capture the state of the art in 6D object pose estimation and related tasks. In 2024, our goal was to transition BOP from lab-like setups to real-world scenarios. First, we introduced new model-free tasks, where no 3D object models are available and methods need to onboard objects just from provided reference videos. Second, we defined a new, more practical 6D object detection task where identities of objects visible in a test image are not provided as input. Third, we introduced new BOP-H3 datasets recorded with high-resolution sensors and AR/VR headsets, closely resembling real-world scenarios. BOP-H3 include 3D models and onboarding videos to support both model-based and model-free tasks. Participants competed on seven challenge tracks, each defined by a task, object onboarding setup, and dataset group. Notably, the best 2024 method for model-based 6D localization of unseen objects (FreeZeV2.1) achieves 22% higher accuracy on BOP-Classic-Core than the best 2023 method (GenFlow), and is only 4% behind the best 2023 method for seen objects (GPose2023) although being significantly slower (24.9 vs 2.7s per image). A more practical 2024 method for this task is Co-op which takes only 0.8s per image and is 25X faster and 13% more accurate than GenFlow. Methods have a similar ranking on 6D detection as on 6D localization but higher run time. On model-based 2D detection of unseen objects, the best 2024 method (MUSE) achieves 21% relative improvement compared to the best 2023 method (CNOS). However, the 2D detection accuracy for unseen objects is still noticealy (-53%) behind the accuracy for seen objects (GDet2023). The online evaluation system stays open and is available at this http URL
zh

[CV-6] F-ViTA: Foundation Model Guided Visible to Thermal Translation

【速读】：该论文旨在解决因采集大型热成像数据集成本高且耗时而带来的挑战，提出了一种新的方法F-ViTA。传统方法主要依赖生成式对抗网络（GANs）或扩散模型（DMs），将可见光到热成像的转换视为风格迁移问题，但这些方法在有限训练数据下难以同时学习模态分布偏移和物理原理。F-ViTA的关键创新在于利用基础模型中嵌入的一般世界知识来引导扩散过程，具体通过使用来自SAM和Grounded DINO等基础模型的零样本掩码和标签来条件化InstructPix2Pix扩散模型，使模型能够学习场景物体与其红外图像中热特征之间的有意义的相关性。实验结果表明，F-ViTA在五个公开数据集上的表现优于现有最先进技术，并且在分布外场景中表现出良好的泛化能力，可以从同一可见光图像生成长波红外（LWIR）、中波红外（MWIR）和近红外（NIR）转换。

链接: https://arxiv.org/abs/2504.02801
作者: Jay N. Paranjape,Celso de Melo,Vishal M. Patel
机构: Johns Hopkins University (约翰斯·霍普kins大学); DEVCOM Army Research Laboratory ( DEVCOM陆军研究实验室); Johns Hopkins University (约翰斯·霍普kins大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Thermal imaging is crucial for scene understanding, particularly in low-light and nighttime conditions. However, collecting large thermal datasets is costly and labor-intensive due to the specialized equipment required for infrared image capture. To address this challenge, researchers have explored visible-to-thermal image translation. Most existing methods rely on Generative Adversarial Networks (GANs) or Diffusion Models (DMs), treating the task as a style transfer problem. As a result, these approaches attempt to learn both the modality distribution shift and underlying physical principles from limited training data. In this paper, we propose F-ViTA, a novel approach that leverages the general world knowledge embedded in foundation models to guide the diffusion process for improved translation. Specifically, we condition an InstructPix2Pix Diffusion Model with zero-shot masks and labels from foundation models such as SAM and Grounded DINO. This allows the model to learn meaningful correlations between scene objects and their thermal signatures in infrared imagery. Extensive experiments on five public datasets demonstrate that F-ViTA outperforms state-of-the-art (SOTA) methods. Furthermore, our model generalizes well to out-of-distribution (OOD) scenarios and can generate Long-Wave Infrared (LWIR), Mid-Wave Infrared (MWIR), and Near-Infrared (NIR) translations from the same visible image. Code: this https URL.
zh

[CV-7] Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence

【速读】：该论文试图解决大型视觉语言模型（Vision-Language Models, VLMs）在医学干预领域的实际应用问题，特别是在需要主观决策和场景多变的手术领域中，评估其通用性和适应性。论文通过分析11种最先进的VLMs在涵盖腹腔镜、机器人辅助及开放式手术的13个数据集上的17项关键视觉理解任务（如解剖结构识别和技能评估），探索这些模型在未见过的任务和场景中的表现。论文的关键解决方案在于验证VLMs的泛化能力，并发现上下文学习（in-context learning）通过在测试时引入示例，能够将性能提升多达三倍，这表明模型的适应性是其重要优势之一。然而，论文也指出，涉及空间或时间推理的任务仍然是挑战。这项研究不仅为手术AI提供见解，还为复杂动态场景下的临床及其他现实世界应用提供了参考。

链接: https://arxiv.org/abs/2504.02799
作者: Anita Rau,Mark Endo,Josiah Aklilu,Jaewoo Heo,Khaled Saab,Alberto Paderno,Jeffrey Jopling,F. Christopher Holsinger,Serena Yeung-Levy
机构: Stanford University (斯坦福大学); Google DeepMind (谷歌深度思维); Humanitas University (Humanitas大学); Johns Hopkins University (约翰斯·霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models offer a new paradigm for AI-driven image understanding, enabling models to perform tasks without task-specific training. This flexibility holds particular promise across medicine, where expert-annotated data is scarce. Yet, VLMs’ practical utility in intervention-focused domains–especially surgery, where decision-making is subjective and clinical scenarios are variable–remains uncertain. Here, we present a comprehensive analysis of 11 state-of-the-art VLMs across 17 key visual understanding tasks in surgical AI–from anatomy recognition to skill assessment–using 13 datasets spanning laparoscopic, robotic, and open procedures. In our experiments, VLMs demonstrate promising generalizability, at times outperforming supervised models when deployed outside their training setting. In-context learning, incorporating examples during testing, boosted performance up to three-fold, suggesting adaptability as a key strength. Still, tasks requiring spatial or temporal reasoning remained difficult. Beyond surgery, our findings offer insights into VLMs’ potential for tackling complex and dynamic scenarios in clinical and broader real-world applications.
zh

[CV-8] Spline-based Transformers

【速读】：该论文试图解决传统Transformer模型中依赖固定或学习型位置编码（positional encoding）所带来的局限性，特别是序列长度外推（sequence length extrapolation）的问题。为了解决这些问题，论文提出了一种基于样条（Spline-based）的新颖Transformer架构，其关键在于将输入序列嵌入为潜空间中的平滑轨迹，通过控制样条的控制点直接操作潜空间，从而实现对新轨迹和序列的创建。这种设计不仅消除了对传统位置编码的需求，还为用户提供了与Transformer潜空间交互的新方式。

链接: https://arxiv.org/abs/2504.02797
作者: Prashanth Chandran,Agon Serifi,Markus Gross,Moritz Bächer
机构: DisneyResearch|Studios, Switzerland (迪士尼研究实验室，瑞士); Disney Research, Switzerland (迪士尼研究，瑞士); ETH Zurich, Switzerland (瑞士苏黎世联邦理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Spline-based Transformers, a novel class of Transformer models that eliminate the need for positional encoding. Inspired by workflows using splines in computer animation, our Spline-based Transformers embed an input sequence of elements as a smooth trajectory in latent space. Overcoming drawbacks of positional encoding such as sequence length extrapolation, Spline-based Transformers also provide a novel way for users to interact with transformer latent spaces by directly manipulating the latent control points to create new latent trajectories and sequences. We demonstrate the superior performance of our approach in comparison to conventional positional encoding on a variety of datasets, ranging from synthetic 2D to large-scale real-world datasets of images, 3D shapes, and animations.
zh

[CV-9] GPT -ImgEval: A Comprehensive Benchmark for Diagnosing GPT 4o in Image Generation

【速读】：本文旨在评估OpenAI的GPT-4o模型在图像生成与编辑任务中的性能，并通过构建首个基准（GPT-ImgEval）定量与定性诊断其在生成质量、编辑能力及基于世界知识的语义合成三个关键维度的表现。论文的关键解决方案在于提出了一种基于分类模型的方法来探究GPT-4o的潜在架构，发现其结合了自回归（Auto-Regressive, AR）机制与基于扩散（Diffusion-Based）的解码头部，而非类似VAR的架构。此外，研究还全面分析了GPT-4o在图像生成中的特定局限性和常见合成伪影，比较了其与Gemini 2.0 Flash在多轮图像编辑中的差异，并探讨了输出的安全性及其对现有图像取证模型的可检测性。这些工作为未来图像生成及相关领域的研究提供了有价值的洞见和可靠的基准。

链接: https://arxiv.org/abs/2504.02782
作者: Zhiyuan Yan,Junyan Ye,Weijia Li,Zilong Huang,Shenghai Yuan,Xiangyang He,Kaiqing Lin,Jun He,Conghui He,Li Yuan
机构: Peking University, Shenzhen Graduate School (北京大学深圳研究生院); Sun Yat-sen University (中山大学); Rabbitpre AI (兔博士人工智能); Shanghai AI Laboratory (上海人工智能实验室); Shenzhen University (深圳大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The recent breakthroughs in OpenAI’s GPT4o model have demonstrated surprisingly good capabilities in image generation and editing, resulting in significant excitement in the community. This technical report presents the first-look evaluation benchmark (named GPT-ImgEval), quantitatively and qualitatively diagnosing GPT-4o’s performance across three critical dimensions: (1) generation quality, (2) editing proficiency, and (3) world knowledge-informed semantic synthesis. Across all three tasks, GPT-4o demonstrates strong performance, significantly surpassing existing methods in both image generation control and output quality, while also showcasing exceptional knowledge reasoning capabilities. Furthermore, based on the GPT-4o’s generated data, we propose a classification-model-based approach to investigate the underlying architecture of GPT-4o, where our empirical results suggest the model consists of an auto-regressive (AR) combined with a diffusion-based head for image decoding, rather than the VAR-like architectures. We also provide a complete speculation on GPT-4o’s overall architecture. In addition, we conduct a series of analyses to identify and visualize GPT-4o’s specific limitations and the synthetic artifacts commonly observed in its image generation. We also present a comparative study of multi-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss the safety implications of GPT-4o’s outputs, particularly their detectability by existing image forensic models. We hope that our work can offer valuable insight and provide a reliable benchmark to guide future research, foster reproducibility, and accelerate innovation in the field of image generation and beyond. The codes and datasets used for evaluating GPT-4o can be found at this https URL.
zh

[CV-10] Multi-Head Adaptive Graph Convolution Network for Sparse Point Cloud-Based Human Activity Recognition

【速读】：该论文旨在解决基于毫米波 (mmWave) 雷达点云数据的人类活动识别中，由于点云数据稀疏且噪声较多而导致的传统固定核图卷积方法表现不佳的问题。为克服这一限制，论文提出了一种自适应方法，关键在于引入多头自适应核 (Multi-Head Adaptive Kernel, MAK) 模块，该模块能够生成多个动态核，每个核捕捉局部特征空间的不同方面。通过在保持全局空间上下文的同时逐步优化局部特征，此方法使卷积核能够根据点云数据中每个局部邻域的具体几何特性进行动态调整，从而显著提升了人类活动识别的性能。实验结果表明，该方法在基准数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2504.02778
作者: Vincent Gbouna Zakka,Luis J. Manso,Zhuangzhuang Dai
机构: Aston University (阿斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human activity recognition is increasingly vital for supporting independent living, particularly for the elderly and those in need of assistance. Domestic service robots with monitoring capabilities can enhance safety and provide essential support. Although image-based methods have advanced considerably in the past decade, their adoption remains limited by concerns over privacy and sensitivity to low-light or dark conditions. As an alternative, millimetre-wave (mmWave) radar can produce point cloud data which is privacy-preserving. However, processing the sparse and noisy point clouds remains a long-standing challenge. While graph-based methods and attention mechanisms show promise, they predominantly rely on “fixed” kernels; kernels that are applied uniformly across all neighbourhoods, highlighting the need for adaptive approaches that can dynamically adjust their kernels to the specific geometry of each local neighbourhood in point cloud data. To overcome this limitation, we introduce an adaptive approach within the graph convolutional framework. Instead of a single shared weight function, our Multi-Head Adaptive Kernel (MAK) module generates multiple dynamic kernels, each capturing different aspects of the local feature space. By progressively refining local features while maintaining global spatial context, our method enables convolution kernels to adapt to varying local features. Experimental results on benchmark datasets confirm the effectiveness of our approach, achieving state-of-the-art performance in human activity recognition. Our source code is made publicly available at: this https URL
zh

[CV-11] ailedCore: Few-Shot Sampling for Unsupervised Long-Tail Noisy Anomaly Detection CVPR2025

【速读】：本文旨在解决实际应用中具有挑战性的无监督异常检测问题，其中正常数据集不仅受到缺陷区域的污染，而且其产品类别分布呈现长尾且未知。论文观察到现有模型面临“长尾对抗噪声”的权衡问题：如果模型对像素噪声具有鲁棒性，则其在长尾类别样本上的性能会下降，反之亦然。为缓解此问题，本文提出将长尾类别样本与噪声样本独立处理的方法。为此，作者设计了一种新颖的类大小预测器TailSampler，它基于嵌入相似度的类别分布对称假设来估计样本的类别基数。TailSampler能够专门采样长尾类别样本，从而实现单独处理。在此基础上，构建了一个基于内存的异常检测模型TailedCore，该模型既能很好地捕捉长尾类别信息，又具备抗噪能力。通过广泛验证，TailedCore在无监督长尾噪声异常检测任务中表现出色，优于现有技术。

链接: https://arxiv.org/abs/2504.02775
作者: Yoon Gyo Jung,Jaewoo Park,Jaeho Yoon,Kuan-Chuan Peng,Wonchul Kim,Andrew Beng Jin Teoh,Octavia Camps
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR2025

点击查看摘要

Abstract:We aim to solve unsupervised anomaly detection in a practical challenging environment where the normal dataset is both contaminated with defective regions and its product class distribution is tailed but unknown. We observe that existing models suffer from tail-versus-noise trade-off where if a model is robust against pixel noise, then its performance deteriorates on tail class samples, and vice versa. To mitigate the issue, we handle the tail class and noise samples independently. To this end, we propose TailSampler, a novel class size predictor that estimates the class cardinality of samples based on a symmetric assumption on the class-wise distribution of embedding similarities. TailSampler can be utilized to sample the tail class samples exclusively, allowing to handle them separately. Based on these facets, we build a memory-based anomaly detection model TailedCore, whose memory both well captures tail class information and is noise-robust. We extensively validate the effectiveness of TailedCore on the unsupervised long-tail noisy anomaly detection setting, and show that TailedCore outperforms the state-of-the-art in most settings.
zh

[CV-12] Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model CVPR2025

【速读】：本文旨在解决现有视频扩散（Video Diffusion）方法在从单张图像生成通用场景时存在的两个主要问题：有限的视频长度和场景一致性不足，这些问题会导致重建过程中出现伪影和失真。为了解决这些问题，论文提出了一种基于动量的框架——Scene Splatter。其关键是通过构建噪声样本作为动量来增强视频细节并保持场景一致性；同时，针对潜在特征感知场跨越已知与未知区域的情况，引入像素级动量以改善未见区域的恢复效果。此外，通过迭代优化全局高斯表示和增强帧，并更新下一阶段的新帧动量，实现了避免视频长度限制的3D场景恢复。实验结果验证了该方法在生成高保真且一致场景方面的优越性能和泛化能力。

链接: https://arxiv.org/abs/2504.02764
作者: Shengjun Zhang,Jinzhao Li,Xin Fei,Hao Liu,Yueqi Duan
机构: Tsinghua University (清华大学); WeChat Vision, Tencent Inc. (微信视见, 腾讯公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2025

点击查看摘要

Abstract:In this paper, we propose Scene Splatter, a momentum-based paradigm for video diffusion to generate generic scenes from single image. Existing methods, which employ video generation models to synthesize novel views, suffer from limited video length and scene inconsistency, leading to artifacts and distortions during further reconstruction. To address this issue, we construct noisy samples from original features as momentum to enhance video details and maintain scene consistency. However, for latent features with the perception field that spans both known and unknown regions, such latent-level momentum restricts the generative ability of video diffusion in unknown regions. Therefore, we further introduce the aforementioned consistent video as a pixel-level momentum to a directly generated video without momentum for better recovery of unseen regions. Our cascaded momentum enables video diffusion models to generate both high-fidelity and consistent novel views. We further finetune the global Gaussian representations with enhanced frames and render new frames for momentum update in the next step. In this manner, we can iteratively recover a 3D scene, avoiding the limitation of video length. Extensive experiments demonstrate the generalization capability and superior performance of our method in high-fidelity and consistent scene generation.
zh

[CV-13] CanonNet: Canonical Ordering and Curvature Learning for Point Cloud Analysis

【速读】：该论文旨在解决点云处理中的两个核心挑战：建立一致的点排序和有效学习细粒度几何特征。当前架构依赖于复杂的操作，这些操作不仅限制了表达能力，还难以捕捉详细的表面几何结构。论文提出的解决方案是CanonNet，一种轻量级神经网络，包含两个互补组件：(1) 一个预处理管道，用于创建一致的点排序和方向；(2) 一个几何学习框架，在其中网络从具有精确曲率值的合成曲面进行学习。该模块化方法无需复杂的变换不变架构，同时能够有效地捕获局部几何属性。关键在于通过数学预处理与神经架构的有效结合，实现高效的点云分析。实验结果表明，CanonNet在曲率估计任务中达到了最先进的性能，并在几何描述符任务中取得了具有竞争力的结果，所需参数量仅为同类方法的百分之一（\textbf{100X}更少）。

链接: https://arxiv.org/abs/2504.02763
作者: Benjy Friedmann,Michael Werman
机构: Hebrew University of Jerusalem (希伯来大学耶路撒冷); Hebrew University of Jerusalem (希伯来大学耶路撒冷)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point cloud processing poses two fundamental challenges: establishing consistent point ordering and effectively learning fine-grained geometric features. Current architectures rely on complex operations that limit expressivity while struggling to capture detailed surface geometry. We present CanonNet, a lightweight neural network composed of two complementary components: (1) a preprocessing pipeline that creates a canonical point ordering and orientation, and (2) a geometric learning framework where networks learn from synthetic surfaces with precise curvature values. This modular approach eliminates the need for complex transformation-invariant architectures while effectively capturing local geometric properties. Our experiments demonstrate state-of-the-art performance in curvature estimation and competitive results in geometric descriptor tasks with significantly fewer parameters (\textbf100X) than comparable methods. CanonNet’s efficiency makes it particularly suitable for real-world applications where computational resources are limited, demonstrating that mathematical preprocessing can effectively complement neural architectures for point cloud analysis. The code for the project is publicly available \hyperlinkthis https URLthis https URL.
zh

[CV-14] MD-ProjTex: Texturing 3D Shapes with Multi-Diffusion Projection

【速读】：该论文旨在解决3D形状文本引导纹理生成过程中效率低、一致性差的问题。论文提出的方法MD-ProjTex通过利用预训练的文字到图像扩散模型实现快速且一致的纹理生成。其关键在于引入了一种在UV空间中的多视角一致性机制，该机制在每个扩散步骤融合来自多个视角的噪声预测，并联合更新每视角去噪方向以保持3D一致性。与依赖优化或顺序视图合成的现有最先进方法相比，MD-ProjTex在计算上更加高效，同时在定量和定性结果上表现更优。

链接: https://arxiv.org/abs/2504.02762
作者: Ahmet Burak Yildirim,Mustafa Utku Aydogdu,Duygu Ceylan,Aysegul Dundar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce MD-ProjTex, a method for fast and consistent text-guided texture generation for 3D shapes using pretrained text-to-image diffusion models. At the core of our approach is a multi-view consistency mechanism in UV space, which ensures coherent textures across different viewpoints. Specifically, MD-ProjTex fuses noise predictions from multiple views at each diffusion step and jointly updates the per-view denoising directions to maintain 3D consistency. In contrast to existing state-of-the-art methods that rely on optimization or sequential view synthesis, MD-ProjTex is computationally more efficient and achieves better quantitative and qualitative results.
zh

[CV-15] HQViT: Hybrid Quantum Vision Transformer for Image Classification

【速读】：该论文旨在解决经典计算机在训练基于 Transformer 的视觉模型时，自注意力机制的二次计算复杂度导致高维输入数据（如图像）训练成本高昂的问题。为应对这一挑战，论文提出了一种混合量子视觉 Transformer (HQViT)，其关键在于结合量子计算与经典计算的优势：通过幅度编码实现整图处理以保留全局图像信息，并仅在最关键的步骤利用量子计算，同时以经典方式处理其他组件，从而显著降低量子资源需求（qubit 需求为 (O(\log_2 N))，参数化量子门数量为 (O(\log_2 d))），使其适用于噪声中级量子设备。此外，通过将计算密集型的注意力系数矩阵计算卸载到量子框架，HQViT 减少了经典计算负载 (O(T^2 d))。实验结果表明，HQViT 在多个图像分类任务中超越现有方法，最高提升达 10.9%。

链接: https://arxiv.org/abs/2504.02730
作者: Hui Zhang,Qinglin Zhao,Mengchu Zhou,Li Feng
机构: Faculty of Innovation Engineering, Macau University of Science and Technology (澳门科技大学创新工程学院); Department of Electrical and Computer Engineering, New Jersey Institute of Technology (新泽西理工学院电气与计算机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:Transformer-based architectures have revolutionized the landscape of deep learning. In computer vision domain, Vision Transformer demonstrates remarkable performance on par with or even surpassing that of convolutional neural networks. However, the quadratic computational complexity of its self-attention mechanism poses challenges for classical computing, making model training with high-dimensional input data, e.g., images, particularly expensive. To address such limitations, we propose a Hybrid Quantum Vision Transformer (HQViT), that leverages the principles of quantum computing to accelerate model training while enhancing model performance. HQViT introduces whole-image processing with amplitude encoding to better preserve global image information without additional positional encoding. By leveraging quantum computation on the most critical steps and selectively handling other components in a classical way, we lower the cost of quantum resources for HQViT. The qubit requirement is minimized to O(log_2N) and the number of parameterized quantum gates is only O(log_2d) , making it well-suited for Noisy Intermediate-Scale Quantum devices. By offloading the computationally intensive attention coefficient matrix calculation to the quantum framework, HQViT reduces the classical computational load by O(T^2d) . Extensive experiments across various computer vision datasets demonstrate that HQViT outperforms existing models, achieving a maximum improvement of up to 10.9% (on the MNIST 10-classification task) over the state of the art. This work highlights the great potential to combine quantum and classical computing to cope with complex image classification tasks.
zh

[CV-16] Learning Phase Distortion with Selective State Space Models for Video Turbulence Mitigation CVPR2025

【速读】：该论文旨在解决长程成像系统中由大气湍流引起图像退化的问题。现有基于深度学习的大气湍流缓解 (Turbulence Mitigation, TM) 方法普遍存在速度慢、内存占用高以及泛化能力不足等缺陷。在空间域，基于卷积算子的方法受限于其有限的感受野，难以处理湍流所需的较大空间依赖性；而在时间域，依赖自注意力机制的方法虽然理论上可利用湍流的幸运效应，但其二次复杂度限制了其扩展到多帧的能力，传统循环聚合方法也面临并行化挑战。论文提出的解决方案的关键在于：(1) 基于选择性状态空间模型 (Selective State Space Model, MambaTM) 的湍流缓解网络，MambaTM 在空间和时间维度的每一层都提供了全局感受野，并保持了线性计算复杂度；(2) 学习的潜在相位畸变映射 (Learned Latent Phase Distortion, LPD)，LPD 引导状态空间模型，通过捕获湍流的实际影响取代传统的 Zernike 表示，显著减少了病态问题，提升了模型估计退化的能力。实验表明，所提出的方法在多种合成和真实世界数据集上的性能超越当前最先进的方法，同时实现了更快的推理速度。

链接: https://arxiv.org/abs/2504.02697
作者: Xingguang Zhang,Nicholas Chimitt,Xijun Wang,Yu Yuan,Stanley H. Chan
机构: School of Electrical and Computer Engineering, Purdue University (普渡大学电子与计算机工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: CVPR 2025, project page: this https URL

点击查看摘要

Abstract:Atmospheric turbulence is a major source of image degradation in long-range imaging systems. Although numerous deep learning-based turbulence mitigation ™ methods have been proposed, many are slow, memory-hungry, and do not generalize well. In the spatial domain, methods based on convolutional operators have a limited receptive field, so they cannot handle a large spatial dependency required by turbulence. In the temporal domain, methods relying on self-attention can, in theory, leverage the lucky effects of turbulence, but their quadratic complexity makes it difficult to scale to many frames. Traditional recurrent aggregation methods face parallelization challenges. In this paper, we present a new TM method based on two concepts: (1) A turbulence mitigation network based on the Selective State Space Model (MambaTM). MambaTM provides a global receptive field in each layer across spatial and temporal dimensions while maintaining linear computational complexity. (2) Learned Latent Phase Distortion (LPD). LPD guides the state space model. Unlike classical Zernike-based representations of phase distortion, the new LPD map uniquely captures the actual effects of turbulence, significantly improving the model’s capability to estimate degradation by reducing the ill-posedness. Our proposed method exceeds current state-of-the-art networks on various synthetic and real-world TM benchmarks with significantly faster inference speed. The code is available at this http URL. Comments: CVPR 2025, project page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2504.02697 [cs.CV] (or arXiv:2504.02697v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.02697 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-17] BECAME: BayEsian Continual Learning with Adaptive Model MErging

【速读】：该论文旨在解决连续学习（Continual Learning, CL）中的稳定性-可塑性权衡问题，即在保留已有知识（稳定性）的同时有效学习新任务（可塑性）。现有梯度投影方法虽能保证稳定性，但往往限制了模型的可塑性；而模型合并技术虽然提供了潜在解决方案，但通常依赖经验假设和精心调整的超参数。论文的关键在于通过贝叶斯连续学习原理重新构建模型合并机制，并推导出一个自适应任务特性的最优合并系数闭合解。此外，提出了一种名为BECAME的两阶段框架，结合梯度投影和自适应合并的优势，验证了该方法在性能上超越当前最先进的CL方法及现有合并策略。

链接: https://arxiv.org/abs/2504.02666
作者: Mei Li,Yuxiang Lu,Qinyan Dai,Suizhi Huang,Yue Ding,Hongtao Lu
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual Learning (CL) strives to learn incrementally across tasks while mitigating catastrophic forgetting. A key challenge in CL is balancing stability (retaining prior knowledge) and plasticity (learning new tasks). While representative gradient projection methods ensure stability, they often limit plasticity. Model merging techniques offer promising solutions, but prior methods typically rely on empirical assumptions and carefully selected hyperparameters. In this paper, we explore the potential of model merging to enhance the stability-plasticity trade-off, providing theoretical insights that underscore its benefits. Specifically, we reformulate the merging mechanism using Bayesian continual learning principles and derive a closed-form solution for the optimal merging coefficient that adapts to the diverse characteristics of tasks. To validate our approach, we introduce a two-stage framework named BECAME, which synergizes the expertise of gradient projection and adaptive merging. Extensive experiments show that our approach outperforms state-of-the-art CL methods and existing merging strategies.
zh

[CV-18] PicoPose: Progressive Pixel-to-Pixel Correspondence Learning for Novel Object Pose Estimation

【速读】：该论文致力于解决RGB图像中未见过目标的新颖位姿估计问题（Novel Object Pose Estimation from RGB Images），尤其是在零样本泛化（zero-shot generalization）场景下，即需要估计RGB观测与训练过程中未见过物体的CAD模型之间的相对6D变换。论文的关键解决方案是提出了一种名为PicoPose的新框架，采用三阶段像素到像素对应学习过程：首先通过匹配RGB观测特征与渲染的目标模板特征，确定最佳匹配模板并建立粗略对应；其次通过全局回归2D仿射变换（包括平面内旋转、尺度和二维平移）平滑粗略对应图；最后在最佳匹配模板特征图的局部区域学习对应偏移以实现细粒度对应。通过逐步细化对应关系，PicoPose显著提高了通过PnP/RANSAC计算的目标位姿准确性，并在BOP基准的七个核心数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2504.02617
作者: Lihua Liu,Jiehong Lin,Zhenxin Liu,Kui Jia
机构: South China University of Technology (华南理工大学); The University of Hong Kong (香港大学); School of Data Science, The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳数据科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Novel object pose estimation from RGB images presents a significant challenge for zero-shot generalization, as it involves estimating the relative 6D transformation between an RGB observation and a CAD model of an object that was not seen during training. In this paper, we introduce PicoPose, a novel framework designed to tackle this task using a three-stage pixel-to-pixel correspondence learning process. Firstly, PicoPose matches features from the RGB observation with those from rendered object templates, identifying the best-matched template and establishing coarse correspondences. Secondly, PicoPose smooths the correspondences by globally regressing a 2D affine transformation, including in-plane rotation, scale, and 2D translation, from the coarse correspondence map. Thirdly, PicoPose applies the affine transformation to the feature map of the best-matched template and learns correspondence offsets within local regions to achieve fine-grained correspondences. By progressively refining the correspondences, PicoPose significantly improves the accuracy of object poses computed via PnP/RANSAC. PicoPose achieves state-of-the-art performance on the seven core datasets of the BOP benchmark, demonstrating exceptional generalization to novel objects represented by CAD models or object reference images. Code and models are available at this https URL.
zh

[CV-19] Fine-Tuning Visual Autoregressive Models for Subject-Driven Generation

【速读】：该论文旨在解决基于扩散模型的文本到图像生成在主题驱动生成任务中计算开销大、推理速度慢以及生成多样性下降的问题。论文的关键在于提出了一种基于视觉自回归（Visual Autoregressive, VAR）模型的主题驱动生成方法，并通过引入选择性层调优（selective layer tuning）降低复杂度，利用先验蒸馏（prior distillation）缓解语言漂移（language drift），同时根据早期阶段对主体生成影响更大的发现，提出了尺度加权调优（scale-wise weighted tuning），以引导模型更关注与主题相关的全局信息而非局部细节。这些创新共同提升了生成性能和效率，验证了其在实际应用中的优越性。

链接: https://arxiv.org/abs/2504.02612
作者: Jiwoo Chung,Sangeek Hyun,Hyunjun Kim,Eunseo Koh,MinKyu Lee,Jae-Pil Heo
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in text-to-image generative models have enabled numerous practical applications, including subject-driven generation, which fine-tunes pretrained models to capture subject semantics from only a few examples. While diffusion-based models produce high-quality images, their extensive denoising steps result in significant computational overhead, limiting real-world applicability. Visual autoregressive~(VAR) models, which predict next-scale tokens rather than spatially adjacent ones, offer significantly faster inference suitable for practical deployment. In this paper, we propose the first VAR-based approach for subject-driven generation. However, na"ıve fine-tuning VAR leads to computational overhead, language drift, and reduced diversity. To address these challenges, we introduce selective layer tuning to reduce complexity and prior distillation to mitigate language drift. Additionally, we found that the early stages have a greater influence on the generation of subject than the latter stages, which merely synthesize local details. Based on this finding, we propose scale-wise weighted tuning, which prioritizes coarser resolutions for promoting the model to focus on the subject-relevant information instead of local details. Extensive experiments validate that our method significantly outperforms diffusion-based baselines across various metrics and demonstrates its practical usage.
zh

[CV-20] Leverag ing Sparse Annotations for Leukemia Diagnosis on the Large Leukemia Dataset

【速读】：该论文旨在解决白血病分析中因缺乏大规模、多样化的多任务数据集而导致的挑战，现有小规模数据集在领域多样性上的不足限制了其实际应用。论文的关键解决方案包括：首先，构建了一个名为Large Leukemia Dataset (LLD)的大规模白细胞（WBC）数据集，通过外周血涂片（PBF）从多位患者、多种显微镜、多摄像头及多放大倍率采集样本，并对每个白血病细胞在100倍放大下标注了7种形态学属性；其次，提出了一种多任务模型，不仅能检测WBC，还能预测其属性，提供可解释且具有临床意义的解决方案；最后，开发了一种利用稀疏标注进行WBC检测与属性分析的方法，大幅降低了血液学家的标注负担，同时提升了模型的学习效率和诊断准确性。这些方法从提升诊断可解释性到应对领域迁移挑战，为显微图像分析的多个难题提供了潜在解决方案。

链接: https://arxiv.org/abs/2504.02602
作者: Abdul Rehman,Talha Meraj,Aiman Mahmood Minhas,Ayisha Imran,Mohsen Ali,Waqas Sultani,Mubarak Shah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Leukemia is 10th most frequently diagnosed cancer and one of the leading causes of cancer related deaths worldwide. Realistic analysis of Leukemia requires White Blook Cells (WBC) localization, classification, and morphological assessment. Despite deep learning advances in medical imaging, leukemia analysis lacks a large, diverse multi-task dataset, while existing small datasets lack domain diversity, limiting real world applicability. To overcome dataset challenges, we present a large scale WBC dataset named Large Leukemia Dataset (LLD) and novel methods for detecting WBC with their attributes. Our contribution here is threefold. First, we present a large-scale Leukemia dataset collected through Peripheral Blood Films (PBF) from several patients, through multiple microscopes, multi cameras, and multi magnification. To enhance diagnosis explainability and medical expert acceptance, each leukemia cell is annotated at 100x with 7 morphological attributes, ranging from Cell Size to Nuclear Shape. Secondly, we propose a multi task model that not only detects WBCs but also predicts their attributes, providing an interpretable and clinically meaningful solution. Third, we propose a method for WBC detection with attribute analysis using sparse annotations. This approach reduces the annotation burden on hematologists, requiring them to mark only a small area within the field of view. Our method enables the model to leverage the entire field of view rather than just the annotated regions, enhancing learning efficiency and diagnostic accuracy. From diagnosis explainability to overcoming domain shift challenges, presented datasets could be used for many challenging aspects of microscopic image analysis. The datasets, code, and demo are available at: this https URL
zh

[CV-21] L-LBVC: Long-Term Motion Estimation and Prediction for Learned Bi-Directional Video Compression

【速读】：该论文旨在解决学习型双向视频压缩（Learned Bi-directional Video Compression, LBVC）在低延迟配置下的性能瓶颈问题，特别是由于不准确的长程运动估计与预测导致的性能差距，尤其是在大运动场景中。为了解决这些问题，论文提出了一个名为L-LBVC的新框架。其关键解决方案包括：首先，设计了一个自适应运动估计模块，能够同时处理短程和长程运动，通过直接估计相邻帧和小运动非相邻帧的光流，并通过递归累积相邻帧之间的局部光流来估算长程光流；其次，提出了一种自适应运动预测模块，通过在测试阶段自适应下采样参考帧以匹配训练期间观察到的运动范围，从而大幅降低运动编码的比特成本。实验表明，L-LBVC在随机接入配置下显著超越了现有的学习型视频压缩方法，甚至超过了VVC（VTM）在某些测试数据集上的表现。

链接: https://arxiv.org/abs/2504.02560
作者: Yongqi Zhai,Luyang Tang,Wei Jiang,Jiayu Yang,Ronggang Wang
机构: Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University (北京大学深圳研究生院), China; Pengcheng Laboratory (鹏城实验室), China
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted to 2025 Data Compression Conference (DCC)

点击查看摘要

Abstract:Recently, learned video compression (LVC) has shown superior performance under low-delay configuration. However, the performance of learned bi-directional video compression (LBVC) still lags behind traditional bi-directional coding. The performance gap mainly arises from inaccurate long-term motion estimation and prediction of distant frames, especially in large motion scenes. To solve these two critical problems, this paper proposes a novel LBVC framework, namely L-LBVC. Firstly, we propose an adaptive motion estimation module that can handle both short-term and long-term motions. Specifically, we directly estimate the optical flows for adjacent frames and non-adjacent frames with small motions. For non-adjacent frames with large motions, we recursively accumulate local flows between adjacent frames to estimate long-term flows. Secondly, we propose an adaptive motion prediction module that can largely reduce the bit cost for motion coding. To improve the accuracy of long-term motion prediction, we adaptively downsample reference frames during testing to match the motion ranges observed during training. Experiments show that our L-LBVC significantly outperforms previous state-of-the-art LVC methods and even surpasses VVC (VTM) on some test datasets under random access configuration.
zh

[CV-22] Rip Current Segmentation: A Novel Benchmark and YOLOv8 Baseline Results CVPR2023

【速读】：该论文旨在解决海滩激流（rip current）实例分割这一全新任务，以实现这些危险表面水流的自动检测。论文的关键在于构建了一个包含2,466张图像的综合数据集，并为实例分割提供了新创建的多边形标注用于训练和验证；同时引入了一个包含17段无人机视频（约24K帧，30 FPS）的新数据集，视频数据不仅包含用于实例分割的多边形标注，还包含用于目标检测的边界框标注，用于测试。为了解决该问题，论文训练了多种版本的YOLOv8模型进行静态图像的实例分割，并在测试数据集（视频）上评估其性能。其中，YOLOv8-nano模型在验证数据集上的mAP50达到88.94%，在测试数据集上的宏平均值为81.21%，表现最佳。这一工作通过提供详细标注的数据集和针对激流实例分割的深度学习模型，为后续研究奠定了基础。

链接: https://arxiv.org/abs/2504.02558
作者: Andrei Dumitriu,Florin Tatui,Florin Miron,Radu Tudor Ionescu,Radu Timofte
机构: Computer Vision Lab, CAIDAS & IFI, University of Würzburg (维尔茨堡大学); University of Bucharest (布加勒斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR 2023 NTIRE Workshop

点击查看摘要

Abstract:Rip currents are the leading cause of fatal accidents and injuries on many beaches worldwide, emphasizing the importance of automatically detecting these hazardous surface water currents. In this paper, we address a novel task: rip current instance segmentation. We introduce a comprehensive dataset containing 2,466 images with newly created polygonal annotations for instance segmentation, used for training and validation. Additionally, we present a novel dataset comprising 17 drone videos (comprising about 24K frames) captured at 30 FPS , annotated with both polygons for instance segmentation and bounding boxes for object detection, employed for testing purposes. We train various versions of YOLOv8 for instance segmentation on static images and assess their performance on the test dataset (videos). The best results were achieved by the YOLOv8-nano model (runnable on a portable device), with an mAP50 of 88.94% on the validation dataset and 81.21% macro average on the test dataset. The results provide a baseline for future research in rip current segmentation. Our work contributes to the existing literature by introducing a detailed, annotated dataset, and training a deep learning model for instance segmentation of rip currents. The code, training details and the annotated dataset are made publicly available at this https URL.
zh

[CV-23] Noise Calibration and Spatial-Frequency Interactive Network for STEM Image Enhancement CVPR2025

【速读】：该论文旨在解决扫描透射电子显微镜（STEM）图像因噪声、电子束损伤及样品厚度等因素导致的原子级成像困难问题，并提升现有增强方法在频率域特征利用及数据集真实性和通用性方面的不足。论文的关键解决方案包括：首先提出一种STEM噪声校准方法，通过统计分析真实含原子的STEM图像参数（背景噪声、扫描噪声及点噪声），合成更真实的图像；其次构建一个包含规则与随机原子排列且涵盖HAADF和BF模式的更通用数据集；最后设计一种空间-频率交互网络，利用原子排布周期性在频域中探索信息。实验结果表明，所提方法生成的数据更接近真实STEM图像，并结合网络实现了更好的图像增强性能。

链接: https://arxiv.org/abs/2504.02555
作者: Hesong Li,Ziqi Wu,Ruiwen Shao,Tao Zhang,Ying Fu
机构: Beijing Institute of Technology (北京理工大学); Hangzhou Dianzi University (杭州电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Acceped by CVPR2025

点击查看摘要

Abstract:Scanning Transmission Electron Microscopy (STEM) enables the observation of atomic arrangements at sub-angstrom resolution, allowing for atomically resolved analysis of the physical and chemical properties of materials. However, due to the effects of noise, electron beam damage, sample thickness, etc, obtaining satisfactory atomic-level images is often challenging. Enhancing STEM images can reveal clearer structural details of materials. Nonetheless, existing STEM image enhancement methods usually overlook unique features in the frequency domain, and existing datasets lack realism and generality. To resolve these issues, in this paper, we develop noise calibration, data synthesis, and enhancement methods for STEM images. We first present a STEM noise calibration method, which is used to synthesize more realistic STEM images. The parameters of background noise, scan noise, and pointwise noise are obtained by statistical analysis and fitting of real STEM images containing atoms. Then we use these parameters to develop a more general dataset that considers both regular and random atomic arrangements and includes both HAADF and BF mode images. Finally, we design a spatial-frequency interactive network for STEM image enhancement, which can explore the information in the frequency domain formed by the periodicity of atomic arrangement. Experimental results show that our data is closer to real STEM images and achieves better enhancement performances together with our network. Code will be available at this https URLthis https URL.
zh

[CV-24] MAD: Makeup All-in-One with Cross-Domain Diffusion Model

【速读】：该论文旨在解决现有美妆技术因需设计多个模型以处理不同输入并跨域对齐特征，从而导致复杂性增加的问题。此外，还尝试弥补无文本引导美妆试穿（text-guided makeup try-on）的空白，这种形式更友好且无需参考图像。论文的关键解决方案在于提出了一种单一模型用于多种美妆任务的方法，具体通过将不同美妆任务形式化为跨域翻译，并利用跨域扩散模型完成所有任务。不同于依赖独立编码器-解码器配置或基于循环机制的现有方法，本文建议使用不同的领域嵌入来促进领域控制，仅通过改变嵌入即可实现无缝的领域切换，从而减少对额外模块的依赖。此外，为了支持精确的文本到美妆应用，作者通过扩展MT数据集并添加文本注释引入了MT-Text数据集，推动了美妆技术的实际应用。

链接: https://arxiv.org/abs/2504.02545
作者: Bo-Kai Ruan,Hong-Han Shuai
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Existing makeup techniques often require designing multiple models to handle different inputs and align features across domains for different makeup tasks, e.g., beauty filter, makeup transfer, and makeup removal, leading to increased complexity. Another limitation is the absence of text-guided makeup try-on, which is more user-friendly without needing reference images. In this study, we make the first attempt to use a single model for various makeup tasks. Specifically, we formulate different makeup tasks as cross-domain translations and leverage a cross-domain diffusion model to accomplish all tasks. Unlike existing methods that rely on separate encoder-decoder configurations or cycle-based mechanisms, we propose using different domain embeddings to facilitate domain control. This allows for seamless domain switching by merely changing embeddings with a single model, thereby reducing the reliance on additional modules for different tasks. Moreover, to support precise text-to-makeup applications, we introduce the MT-Text dataset by extending the MT dataset with textual annotations, advancing the practicality of makeup technologies.
zh

[CV-25] Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation

【速读】：该论文旨在解决现有虚拟头像和人机交互中 Talking Head 合成方法受限于单一主要模态控制的问题，限制了其实用性。为了解决这一问题，论文提出了 \textbfACTalker，这是一种支持多信号控制和单信号控制的端到端视频扩散框架，用于生成逼真的 Talking Head 视频。解决方案的关键在于设计了一个具有多个分支的平行 Mamba 结构，每个分支利用独立的驱动信号控制特定的面部区域，并通过门机制实现灵活的视频生成控制。此外，Mamba 结构允许驱动信号在时空维度上协调特征标记，同时引入掩码丢弃策略以避免控制冲突，从而确保生成视频的自然性和一致性。

链接: https://arxiv.org/abs/2504.02542
作者: Fa-Ting Hong,Zunnan Xu,Zixiang Zhou,Jun Zhou,Xiu Li,Qin Lin,Qinglin Lu,Dan Xu
机构: HKUST (香港科技大学); Tencent (腾讯); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Talking head synthesis is vital for virtual avatars and human-computer interaction. However, most existing methods are typically limited to accepting control from a single primary modality, restricting their practical utility. To this end, we introduce \textbfACTalker, an end-to-end video diffusion framework that supports both multi-signals control and single-signal control for talking head video generation. For multiple control, we design a parallel mamba structure with multiple branches, each utilizing a separate driving signal to control specific facial regions. A gate mechanism is applied across all branches, providing flexible control over video generation. To ensure natural coordination of the controlled video both temporally and spatially, we employ the mamba structure, which enables driving signals to manipulate feature tokens across both dimensions in each branch. Additionally, we introduce a mask-drop strategy that allows each driving signal to independently control its corresponding facial region within the mamba structure, preventing control conflicts. Experimental results demonstrate that our method produces natural-looking facial videos driven by diverse signals and that the mamba layer seamlessly integrates multiple driving modalities without conflict.
zh

[CV-26] A Sensorimotor Vision Transformer

【速读】：该论文旨在解决传统视觉模型在处理图像时内存消耗大、计算复杂度高的问题，同时探索如何通过借鉴生物视觉系统的高效信息处理机制来提升模型的资源利用效率。论文提出的解决方案核心在于引入一种受人类扫视运动（saccadic eye movements）启发的Sensorimotor Transformer (SMT) 架构，其关键创新点是通过识别并选择图像中最显著的区域（salient patches），而非均匀处理所有图像块。这一机制基于二维内在特征（如角点和遮挡）来确定高信息量区域，与人类视觉注意模式相一致。通过仅对这些最具信息量的区域进行处理，SMT不仅显著降低了内存使用，还减少了计算复杂度，尤其在限制选取的patch数量时表现尤为突出。这种“扫视式”选择机制为基于Transformer的视觉模型提供了一种高效的替代方案，并为资源受限应用中的生物启发架构设计提供了新思路。

链接: https://arxiv.org/abs/2504.02536
作者: Konrad Gadzicki,Kerstin Schill,Christoph Zetzsche
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:This paper presents the Sensorimotor Transformer (SMT), a vision model inspired by human saccadic eye movements that prioritize high-saliency regions in visual input to enhance computational efficiency and reduce memory consumption. Unlike traditional models that process all image patches uniformly, SMT identifies and selects the most salient patches based on intrinsic two-dimensional (i2D) features, such as corners and occlusions, which are known to convey high-information content and align with human fixation patterns. The SMT architecture uses this biological principle to leverage vision transformers to process only the most informative patches, allowing for a substantial reduction in memory usage that scales with the sequence length of selected patches. This approach aligns with visual neuroscience findings, suggesting that the human visual system optimizes information gathering through selective, spatially dynamic focus. Experimental evaluations on Imagenet-1k demonstrate that SMT achieves competitive top-1 accuracy while significantly reducing memory consumption and computational complexity, particularly when a limited number of patches is used. This work introduces a saccade-like selection mechanism into transformer-based vision models, offering an efficient alternative for image analysis and providing new insights into biologically motivated architectures for resource-constrained applications.
zh

[CV-27] Delineate Anything: Resolution-Agnostic Field Boundary Delineation on Satellite Imagery

【速读】：该论文旨在解决农业地块边界从卫星图像中精确定界的问题，这一问题是土地管理和作物监测的关键。当前方法面临数据集规模有限、分辨率差异以及环境多样性等挑战。为应对这些问题，论文将任务重新定义为实例分割，并引入了Field Boundary Instance Segmentation - 22M (FBIS-22M) 数据集，这是一个包含672,909个高分辨率卫星图像块（分辨率为0.25米到10米）和22,926,427个单独地块实例掩码的大规模多分辨率数据集，显著缩小了农业领域数据集与计算机视觉其他领域之间的差距。论文进一步提出了一种名为Delineate Anything的实例分割模型，该模型基于新的FBIS-22M数据集进行训练。所提出的模型在mAP@0.5指标上提升了88.5%，在mAP@0.5:0.95指标上提升了103%，相比现有方法实现了显著性能提升，同时展示了更快的推理速度和对不同图像分辨率及未见过地理区域的强大零样本泛化能力。关键在于通过创建大规模高质量的数据集FBIS-22M以及开发适用于多种场景的高效实例分割模型Delineate Anything来克服现有技术的局限性。

链接: https://arxiv.org/abs/2504.02534
作者: Mykola Lavreniuk,Nataliia Kussul,Andrii Shelestov,Bohdan Yailymov,Yevhenii Salii,Volodymyr Kuzin,Zoltan Szantoi
机构: European Space Agency (欧洲航天局); Space Research Institute NASU-SSAU (乌克兰国家科学院与乌克兰社会主义联盟航天研究所); University of Maryland (马里兰大学); National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute (乌克兰基辅理工学院 Igor Sikorsky)”
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The accurate delineation of agricultural field boundaries from satellite imagery is vital for land management and crop monitoring. However, current methods face challenges due to limited dataset sizes, resolution discrepancies, and diverse environmental conditions. We address this by reformulating the task as instance segmentation and introducing the Field Boundary Instance Segmentation - 22M dataset (FBIS-22M), a large-scale, multi-resolution dataset comprising 672,909 high-resolution satellite image patches (ranging from 0.25 m to 10 m) and 22,926,427 instance masks of individual fields, significantly narrowing the gap between agricultural datasets and those in other computer vision domains. We further propose Delineate Anything, an instance segmentation model trained on our new FBIS-22M dataset. Our proposed model sets a new state-of-the-art, achieving a substantial improvement of 88.5% in mAP@0.5 and 103% in mAP@0.5:0.95 over existing methods, while also demonstrating significantly faster inference and strong zero-shot generalization across diverse image resolutions and unseen geographic regions. Code, pre-trained models, and the FBIS-22M dataset are available at this https URL.
zh

[CV-28] SelfMedHPM: Self Pre-training With Hard Patches Mining Masked Autoencoders For Medical Image Segmentation

【速读】：该论文旨在解决基于掩码图像建模（Masked Image Modeling, MIM）的CT多器官分割方法非常有限的问题，并指出现有利用MAE进行CT多器官分割的方法未能有效识别重建中最困难的区域。为了解决这一问题，论文提出了一种名为自监督硬块挖掘掩码自动编码器的MIM自训练框架（selfMedHPM）。该框架的关键在于通过在目标数据集的训练集上进行Vision Transformer（ViT）的自预训练，并引入辅助损失预测器，该预测器首先预测补丁的损失以确定下一个掩码的位置。这种方法在腹部CT多器官分割和全身CT多器官分割任务中表现优于多种竞争方法，并已在BTCV数据集和SMWB数据集上验证了其性能。

链接: https://arxiv.org/abs/2504.02524
作者: Yunhao Lv,Lingyu Chen,Jian Wang,Yangxi Li,Fang Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2304.05919 by other authors

点击查看摘要

Abstract:In recent years, deep learning methods such as convolutional neural network (CNN) and transformers have made significant progress in CT multi-organ segmentation. However, CT multi-organ segmentation methods based on masked image modeling (MIM) are very limited. There are already methods using MAE for CT multi-organ segmentation task, we believe that the existing methods do not identify the most difficult areas to reconstruct. To this end, we propose a MIM self-training framework with hard patches mining masked autoencoders for CT multi-organ segmentation tasks (selfMedHPM). The method performs ViT self-pretraining on the training set of the target data and introduces an auxiliary loss predictor, which first predicts the patch loss and determines the location of the next mask. SelfMedHPM implementation is better than various competitive methods in abdominal CT multi-organ segmentation and body CT multi-organ segmentation. We have validated the performance of our method on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for abdomen mult-organ segmentation and the SinoMed Whole Body (SMWB) dataset for body multi-organ segmentation tasks.
zh

[CV-29] Charm: The Missing Piece in ViT fine-tuning for Image Aesthetic Assessment CVPR2025

【速读】：该论文旨在解决视觉Transformer (Vision Transformers, ViTs) 在处理可变尺寸输入时面临的计算复杂度高和批处理限制问题，现有方法通常通过下采样或裁剪将图像固定为小尺寸，这会导致显著的信息丢失，特别是在图像美学评估等任务中。为了解决这一问题，论文提出了一种名为Charm的新颖标记化方法，其关键是同时保留构图（Composition）、高分辨率（High-resolution）、宽高比（Aspect Ratio）和多尺度（Multi-scale）信息。Charm通过在特定区域优先保留高分辨率细节，同时对其他部分进行下采样，从而生成较短的固定长度输入序列，同时包含关键信息。此外，Charm设计与预训练ViTs及其学习到的位置嵌入兼容，并通过引入多尺度输入和多样化标记避免裁剪或改变宽高比以进一步保留信息。实验结果表明，使用轻量级ViT主干，在多种图像美学和质量评估数据集上的性能提升了高达8.1%。

链接: https://arxiv.org/abs/2504.02522
作者: Fatemeh Behrad,Tinne Tuytelaars,Johan Wagemans
机构: KU Leuven University (鲁汶大学), Belgium
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:The capacity of Vision transformers (ViTs) to handle variable-sized inputs is often constrained by computational complexity and batch processing limitations. Consequently, ViTs are typically trained on small, fixed-size images obtained through downscaling or cropping. While reducing computational burden, these methods result in significant information loss, negatively affecting tasks like image aesthetic assessment. We introduce Charm, a novel tokenization approach that preserves Composition, High-resolution, Aspect Ratio, and Multi-scale information simultaneously. Charm prioritizes high-resolution details in specific regions while downscaling others, enabling shorter fixed-size input sequences for ViTs while incorporating essential information. Charm is designed to be compatible with pre-trained ViTs and their learned positional embeddings. By providing multiscale input and introducing variety to input tokens, Charm improves ViT performance and generalizability for image aesthetic assessment. We avoid cropping or changing the aspect ratio to further preserve information. Extensive experiments demonstrate significant performance improvements on various image aesthetic and quality assessment datasets (up to 8.1 %) using a lightweight ViT backbone. Code and pre-trained models are available at this https URL.
zh

[CV-30] Data-Driven Object Tracking: Integrating Modular Neural Networks into a Kalman Framework

【速读】：该论文致力于解决多目标跟踪（Multi-Object Tracking, MOT）在高级驾驶辅助系统（Advanced Driver Assistance Systems, ADAS）中日益增长的复杂性和精度需求。为应对这一挑战，论文提出了一种创新的机器学习（Machine Learning, ML）方法，通过设计三个神经网络（Neural Network, NN）模型来分别解决MOT中的关键问题：单预测网络（Single-Prediction Network, SPENT）用于轨迹预测；单关联网络（Single-Association Network, SANT）用于将单个传感器对象（Sensor Object, SO）与现有轨迹进行匹配；多关联网络（Multi-Association Network, MANTa）用于将多个传感器对象与多个轨迹进行关联。这些模型被无缝集成到传统的卡尔曼滤波器（Kalman Filter, KF）框架中，通过替换相关组件实现了系统的模块化，同时保持整体架构不受影响。关键在于所有网络均能在实时嵌入式环境中运行，并且每个网络的可训练参数少于50k，从而确保了高效性和实用性。

链接: https://arxiv.org/abs/2504.02519
作者: Christian Alexander Holz,Christian Bader,Markus Enzweiler,Matthias Drüppel
机构: Daimler Truck AG (戴姆勒卡车公司), Research and Advanced Development (研究与先进开发)(斯图加特, 德国); Institute for Intelligent Systems (智能系统研究所), University of Applied Sciences (应用科学大学), Esslingen (埃斯林根)(德国); Center for Artificial Intelligence (人工智能中心), Baden-Württemberg Cooperative State University (DHBW) (巴登-符腾堡合作州立大学)(斯图加特, 德国)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents novel Machine Learning (ML) methodologies for Multi-Object Tracking (MOT), specifically designed to meet the increasing complexity and precision demands of Advanced Driver Assistance Systems (ADAS). We introduce three Neural Network (NN) models that address key challenges in MOT: (i) the Single-Prediction Network (SPENT) for trajectory prediction, (ii) the Single-Association Network (SANT) for mapping individual Sensor Object (SO) to existing tracks, and (iii) the Multi-Association Network (MANTa) for associating multiple SOs to multiple tracks. These models are seamlessly integrated into a traditional Kalman Filter (KF) framework, maintaining the system’s modularity by replacing relevant components without disrupting the overall architecture. Importantly, all three networks are designed to be run in a realtime, embedded environment. Each network contains less than 50k trainable parameters. Our evaluation, conducted on the public KITTI tracking dataset, demonstrates significant improvements in tracking performance. SPENT reduces the Root Mean Square Error (RMSE) by 50% compared to a standard KF, while SANT and MANTa achieve up to 95% accuracy in sensor object-to-track assignments. These results underscore the effectiveness of incorporating task-specific NNs into traditional tracking systems, boosting performance and robustness while preserving modularity, maintainability, and interpretability.
zh

[CV-31] MultiNeRF: Multiple Watermark Embedding for Neural Radiance Fields

【速读】：该论文试图解决在单个神经辐射场（Neural Radiance Field, NeRF）模型渲染的图像中嵌入多个独立水印的同时保持高视觉质量的问题。论文的关键解决方案在于通过在TensoRF NeRF模型的基础上引入专用的水印网格，并结合基于FiLM的条件调制机制，实现了动态激活输入标识对应的水印。这种方法不仅显著提升了水印容量，还避免了水印信号与场景内容的纠缠，同时无需重新训练模型即可实现多水印的嵌入与提取，从而提供了一个可扩展的三维内容保护方案。

链接: https://arxiv.org/abs/2504.02517
作者: Yash Kulthe,Andrew Gilbert,John Collomosse
机构: CVSSP, University of Surrey (CVSSP, 苏塞克斯大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present MultiNeRF, a 3D watermarking method that embeds multiple uniquely keyed watermarks within images rendered by a single Neural Radiance Field (NeRF) model, whilst maintaining high visual quality. Our approach extends the TensoRF NeRF model by incorporating a dedicated watermark grid alongside the existing geometry and appearance grids. This extension ensures higher watermark capacity without entangling watermark signals with scene content. We propose a FiLM-based conditional modulation mechanism that dynamically activates watermarks based on input identifiers, allowing multiple independent watermarks to be embedded and extracted without requiring model retraining. MultiNeRF is validated on the NeRF-Synthetic and LLFF datasets, with statistically significant improvements in robust capacity without compromising rendering quality. By generalizing single-watermark NeRF methods into a flexible multi-watermarking framework, MultiNeRF provides a scalable solution for 3D content. attribution.
zh

[CV-32] Exploration-Driven Generative Interactive Environments CVPR2025

【速读】：该论文旨在解决现代世界模型训练中因需要昂贵且耗时的人类动作演示或特定环境代理的数据收集而产生的高成本问题。论文提出了一种仅依赖虚拟环境中随机代理进行训练的框架，以简化数据收集过程并降低训练成本。然而，这种方法受限于随机探索的局限性。为克服这一限制，论文的关键解决方案是引入AutoExplore Agent（自动探索代理），该代理完全基于世界模型的不确定性进行探索，从而生成多样化的训练数据，使模型能够学习到更优的行为策略。此外，该代理与特定环境奖励无关，具备快速适应新环境的能力。论文还通过构建RetroAct数据集和改进Genie模型（GenieRedux-G）进一步支持这一方法，实现了多环境模型在新环境中的快速迁移适配及视频保真度与可控性的提升。

链接: https://arxiv.org/abs/2504.02515
作者: Nedko Savov,Naser Kazemi,Mohammad Mahdi,Danda Pani Paudel,Xi Wang,Luc Van Gool
机构: INSAIT, Sofia University “St. Kliment Ohridski” (索非亚大学); ETH Zurich (瑞士苏黎世联邦理工学院); TU Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Modern world models require costly and time-consuming collection of large video datasets with action demonstrations by people or by environment-specific agents. To simplify training, we focus on using many virtual environments for inexpensive, automatically collected interaction data. Genie, a recent multi-environment world model, demonstrates simulation abilities of many environments with shared behavior. Unfortunately, training their model requires expensive demonstrations. Therefore, we propose a training framework merely using a random agent in virtual environments. While the model trained in this manner exhibits good controls, it is limited by the random exploration possibilities. To address this limitation, we propose AutoExplore Agent - an exploration agent that entirely relies on the uncertainty of the world model, delivering diverse data from which it can learn the best. Our agent is fully independent of environment-specific rewards and thus adapts easily to new environments. With this approach, the pretrained multi-environment model can quickly adapt to new environments achieving video fidelity and controllability improvement. In order to obtain automatically large-scale interaction datasets for pretraining, we group environments with similar behavior and controls. To this end, we annotate the behavior and controls of 974 virtual environments - a dataset that we name RetroAct. For building our model, we first create an open implementation of Genie - GenieRedux and apply enhancements and adaptations in our version GenieRedux-G. Our code and data are available at this https URL.
zh

[CV-33] owards Generalizing Temporal Action Segmentation to Unseen Views

【速读】：该论文致力于解决跨未见视角（unseen view）的动作分割问题，即在训练阶段无法获取用于评估模型的相机视角的情况下，如何使动作分割模型泛化至全新的视角（如从顶部前视图切换到侧面视图，甚至是从外视角（exocentric view）切换到内视角（egocentric view》。论文的关键解决方案在于提出了一种利用序列级和片段级共享表征的方法，通过引入序列损失（sequence loss）和动作损失（action loss），在不同视角下促进视频和动作表示的一致性，从而减轻视角差异对模型性能的影响。实验结果表明，该方法在Assembly101、IkeaASM和EgoExoLearn数据集上的未见外视角和未见内视角任务中分别实现了12.8%和54%的F1@50指标提升。

链接: https://arxiv.org/abs/2504.02512
作者: Emad Bahrami,Olga Zatsarynna,Gianpiero Francesca,Juergen Gall
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While there has been substantial progress in temporal action segmentation, the challenge to generalize to unseen views remains unaddressed. Hence, we define a protocol for unseen view action segmentation where camera views for evaluating the model are unavailable during training. This includes changing from top-frontal views to a side view or even more challenging from exocentric to egocentric views. Furthermore, we present an approach for temporal action segmentation that tackles this challenge. Our approach leverages a shared representation at both the sequence and segment levels to reduce the impact of view differences during training. We achieve this by introducing a sequence loss and an action loss, which together facilitate consistent video and action representations across different views. The evaluation on the Assembly101, IkeaASM, and EgoExoLearn datasets demonstrate significant improvements, with a 12.8% increase in F1@50 for unseen exocentric views and a substantial 54% improvement for unseen egocentric views.
zh

[CV-34] APHQ-ViT: Post-Training Quantization with Averag e Perturbation Hessian Based Reconstruction for Vision Transformers CVPR2025

【速读】：该论文旨在解决视觉Transformer（Vision Transformers, ViTs）在超低比特后训练量化（Post-Training Quantization, PTQ）中显著精度下降的问题，尤其是在量化后激活函数GELU时面临的输出重要性估计不准确及量化精度严重退化等挑战。论文的关键创新在于提出了一种基于平均扰动Hessian（Average Perturbation Hessian, APH）的新型PTQ方法——APHQ-ViT。其核心解决方案包括：(1) 提出改进的平均扰动Hessian损失函数以更准确地估计量化的输出重要性；(2) 设计MLP重建（MLP Reconstruction, MR）方法，通过用ReLU替换MLP中的GELU，并利用APH损失在小规模未标注校准集上进行重建，有效应对GELU激活函数的量化难题。实验结果表明，该方法在3-bit和4-bit量化条件下显著优于现有PTQ技术。

链接: https://arxiv.org/abs/2504.02508
作者: Zhuguanyu Wu,Jiayi Zhang,Jiaxin Chen,Jinyang Guo,Di Huang,Yunhong Wang
机构: State Key Laboratory of Virtual Reality Technology and Systems, Beihang University (北航), China; School of Computer Science and Engineering, Beihang University (北航), Beijing, China; School of Artificial Intelligence, Beihang University (北航), Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Vision Transformers (ViTs) have become one of the most commonly used backbones for vision tasks. Despite their remarkable performance, they often suffer significant accuracy drops when quantized for practical deployment, particularly by post-training quantization (PTQ) under ultra-low bits. Recently, reconstruction-based PTQ methods have shown promising performance in quantizing Convolutional Neural Networks (CNNs). However, they fail when applied to ViTs, primarily due to the inaccurate estimation of output importance and the substantial accuracy degradation in quantizing post-GELU activations. To address these issues, we propose \textbfAPHQ-ViT, a novel PTQ approach based on importance estimation with Average Perturbation Hessian (APH). Specifically, we first thoroughly analyze the current approximation approaches with Hessian loss, and propose an improved average perturbation Hessian loss. To deal with the quantization of the post-GELU activations, we design an MLP Reconstruction (MR) method by replacing the GELU function in MLP with ReLU and reconstructing it by the APH loss on a small unlabeled calibration set. Extensive experiments demonstrate that APHQ-ViT using linear quantizers outperforms existing PTQ methods by substantial margins in 3-bit and 4-bit across different vision tasks. The source code is available at this https URL.
zh

[CV-35] Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention

【速读】：该论文旨在解决现有图像描述（image captioning）方法在生成具有区分度（distinctiveness）的描述时的不足，即虽然传统模型在BLEU、CIDEr和SPICE等指标上表现优异，但其生成的描述往往难以有效地区分目标图像与其他相似图像。为了解决这一问题，论文提出了一种名为“基于组的差异性区分描述方法”（Group-based Differential Distinctive Captioning Method）。该方法的关键在于引入了一个“基于组的差异性记忆注意模块”（Group-based Differential Memory Attention, GDMA），通过视觉对比目标图像与其相似图像组中的其他图像，识别并突出目标图像的独特对象特征（即与其他图像中对象具有较低相似性的特征）。此外，该模块确保在生成描述时优先考虑这些独特特征，从而显著提升描述的区分度。同时，论文还通过从真实标签中选择区分词（distinctive words）来引导语言解码器与GDMA模块，并提出了一个新的定量评估指标——区分词比率（Distinctive Word Rate, DisWordRate），以进一步优化描述生成过程并评估其效果。

链接: https://arxiv.org/abs/2504.02496
作者: Jiuniu Wang,Wenjia Xu,Qingzhong Wang,Antoni B. Chan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 20 pages. arXiv admin note: substantial text overlap with arXiv:2108.09151

点击查看摘要

Abstract:Recent advances in image captioning have focused on enhancing accuracy by substantially increasing the dataset and model size. While conventional captioning models exhibit high performance on established metrics such as BLEU, CIDEr, and SPICE, the capability of captions to distinguish the target image from other similar images is under-explored. To generate distinctive captions, a few pioneers employed contrastive learning or re-weighted the ground-truth captions. However, these approaches often overlook the relationships among objects in a similar image group (e.g., items or properties within the same album or fine-grained events). In this paper, we introduce a novel approach to enhance the distinctiveness of image captions, namely Group-based Differential Distinctive Captioning Method, which visually compares each image with other images in one similar group and highlights the uniqueness of each image. In particular, we introduce a Group-based Differential Memory Attention (GDMA) module, designed to identify and emphasize object features in an image that are uniquely distinguishable within its image group, i.e., those exhibiting low similarity with objects in other images. This mechanism ensures that such unique object features are prioritized during caption generation for the image, thereby enhancing the distinctiveness of the resulting captions. To further refine this process, we select distinctive words from the ground-truth captions to guide both the language decoder and the GDMA module. Additionally, we propose a new evaluation metric, the Distinctive Word Rate (DisWordRate), to quantitatively assess caption distinctiveness. Quantitative results indicate that the proposed method significantly improves the distinctiveness of several baseline models, and achieves state-of-the-art performance on distinctiveness while not excessively sacrificing accuracy…
zh

[CV-36] Semiconductor Wafer Map Defect Classification with Tiny Vision Transformers

【速读】：该论文旨在解决半导体晶圆缺陷分类中因类别不平衡及多类型重叠缺陷识别困难导致的传统卷积神经网络（CNN）模型性能不足的问题。为应对这些挑战，论文提出了一种轻量级的视觉Transformer（Vision Transformer, ViT）框架——ViT-Tiny，专门优化用于晶圆缺陷分类任务。其关键在于通过引入Transformer架构，并经过系统性消融研究确定了16×16的最优Patch大小设置，从而在WM-38k数据集上实现了卓越的分类性能与鲁棒性，显著提升了各类别缺陷检测的F1分数、召回率和精度，同时在标注数据有限的情况下表现出更强的适应能力。

链接: https://arxiv.org/abs/2504.02494
作者: Faisal Mohammad,Duksan Ryu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Semiconductor wafer defect classification is critical for ensuring high precision and yield in manufacturing. Traditional CNN-based models often struggle with class imbalances and recognition of the multiple overlapping defect types in wafer maps. To address these challenges, we propose ViT-Tiny, a lightweight Vision Transformer (ViT) framework optimized for wafer defect classification. Trained on the WM-38k dataset. ViT-Tiny outperforms its ViT-Base counterpart and state-of-the-art (SOTA) models, such as MSF-Trans and CNN-based architectures. Through extensive ablation studies, we determine that a patch size of 16 provides optimal performance. ViT-Tiny achieves an F1-score of 98.4%, surpassing MSF-Trans by 2.94% in four-defect classification, improving recall by 2.86% in two-defect classification, and increasing precision by 3.13% in three-defect classification. Additionally, it demonstrates enhanced robustness under limited labeled data conditions, making it a computationally efficient and reliable solution for real-world semiconductor defect detection.
zh

[CV-37] Graph Attention-Driven Bayesian Deep Unrolling for Dual-Peak Single-Photon Lidar Imaging

【速读】：该论文致力于解决单光子激光雷达（Single-photon Lidar）成像在高噪声环境及每个像素存在多目标场景下的应用挑战。传统统计方法虽具有参数可解释性，但难以处理复杂场景；基于深度学习的方法虽然在精度和鲁棒性方面表现优异，却缺乏可解释性或仅限于单峰值场景。论文的关键在于提出了一种用于双峰值单光子激光雷达成像的深度展开算法。通过引入多目标的分层贝叶斯模型以及展开底层统计方法的神经网络，结合双深度图表示与几何深度学习从点云中提取特征，该方法兼具统计方法的准确性与量化不确定性能力以及学习方法的优势。实验结果验证了其在合成数据和真实数据上的竞争力，并提供了不确定性信息。

链接: https://arxiv.org/abs/2504.02480
作者: Kyungmin Choi,JaKeoung Koo,Stephen McLaughlin,Abderrahim Halimi
机构: School of Computing, Gachon University (高丽大学计算机学院); School of Engineering and Physical Sciences, Heriot-Watt University (赫瑞瓦特大学工程与物理科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Single-photon Lidar imaging offers a significant advantage in 3D imaging due to its high resolution and long-range capabilities, however it is challenging to apply in noisy environments with multiple targets per pixel. To tackle these challenges, several methods have been proposed. Statistical methods demonstrate interpretability on the inferred parameters, but they are often limited in their ability to handle complex scenes. Deep learning-based methods have shown superior performance in terms of accuracy and robustness, but they lack interpretability or they are limited to a single-peak per pixel. In this paper, we propose a deep unrolling algorithm for dual-peak single-photon Lidar imaging. We introduce a hierarchical Bayesian model for multiple targets and propose a neural network that unrolls the underlying statistical method. To support multiple targets, we adopt a dual depth maps representation and exploit geometric deep learning to extract features from the point cloud. The proposed method takes advantages of statistical methods and learning-based methods in terms of accuracy and quantifying uncertainty. The experimental results on synthetic and real data demonstrate the competitive performance when compared to existing methods, while also providing uncertainty information.
zh

[CV-38] MG-MotionLLM : A Unified Framework for Motion Comprehension and Generation across Multiple Granularities

【速读】：该论文致力于解决现有运动感知大语言模型在处理细粒度运动相关任务（如特定身体部位运动的理解与控制）时的能力局限性问题。这些现有方法主要专注于粗粒度的运动-文本建模，仅通过少数词汇描述整个运动序列的整体语义，而难以应对更精细的任务需求。为克服这一限制，论文提出的关键解决方案是MG-MotionLLM，这是一种统一的多粒度运动-语言模型，用于运动理解和生成。其核心创新在于引入了一套全面的多粒度训练方案，通过新增局部运动段时间边界的详细文本定位以及运动详细描述等辅助任务，促进不同粒度下运动-文本建模的相互增强。实验结果表明，MG-MotionLLM在经典文本到运动及运动到文本任务中表现出色，并展现出在新型细粒度运动理解与编辑任务中的潜力。

链接: https://arxiv.org/abs/2504.02478
作者: Bizhu Wu,Jinheng Xie,Keming Shen,Zhe Kong,Jianfeng Ren,Ruibin Bai,Rong Qu,Linlin Shen
机构: Computer Vision Institute, School of Computer Science & Software Engineering, Shenzhen University (深圳大学计算机科学与软件工程学院计算机视觉研究所); School of Computer Science, University of Nottingham Ningbo China, Ningbo, China (英国诺丁汉大学宁波分校计算机科学学院); Guangdong Provincial Key Laboratory of Intelligent Information Processing (广东省智能信息处理重点实验室); National University of Singapore (新加坡国立大学); Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); School of Computer Science, University of Nottingham, Nottingham, United Kingdom (英国诺丁汉大学计算机科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent motion-aware large language models have demonstrated promising potential in unifying motion comprehension and generation. However, existing approaches primarily focus on coarse-grained motion-text modeling, where text describes the overall semantics of an entire motion sequence in just a few words. This limits their ability to handle fine-grained motion-relevant tasks, such as understanding and controlling the movements of specific body parts. To overcome this limitation, we pioneer MG-MotionLLM, a unified motion-language model for multi-granular motion comprehension and generation. We further introduce a comprehensive multi-granularity training scheme by incorporating a set of novel auxiliary tasks, such as localizing temporal boundaries of motion segments via detailed text as well as motion detailed captioning, to facilitate mutual reinforcement for motion-text modeling across various levels of granularity. Extensive experiments show that our MG-MotionLLM achieves superior performance on classical text-to-motion and motion-to-text tasks, and exhibits potential in novel fine-grained motion comprehension and editing tasks. Project page: CVI-SZU/MG-MotionLLM
zh

[CV-39] Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision

【速读】：本文旨在解决机器人视觉领域中多模态融合技术在关键任务中的应用问题，重点关注语义场景理解、SLAM（同时定位与建图）、3D物体检测、导航与定位以及机器人操作等任务。论文通过对比基于大型语言模型（Large Language Models, LLMs）的视觉-语言模型（Vision-Language Models, VLMs）与传统多模态融合方法，分析了各自的优劣势及潜在协同效应。此外，论文深入评估了常用数据集在实际机器人场景中的适用性与挑战，并识别出跨模态对齐、高效融合策略、实时部署及领域适应等核心研究难题。关键在于提出未来研究方向，包括针对鲁棒多模态表征的自监督学习、基于Transformer的融合架构以及可扩展的多模态框架，以推动机器人视觉中的多模态感知与交互能力的发展。

链接: https://arxiv.org/abs/2504.02477
作者: Xiaofeng Han,Shunpeng Chen,Zenghuang Fu,Zhe Feng,Lue Fan,Dong An,Changwei Wang,Li Guo,Weiliang Meng,Xiaopeng Zhang,Rongtao Xu,Shibiao Xu
机构: The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所多模态人工智能系统国家重点实验室); School of Artificial Intelligence, Beijing University of Posts and Telecommunications (北京邮电大学人工智能学院); Key Laboratory of Computing Power Network and Information Security, Ministry of Education; Shandong Computer Science Center (教育部计算力网络与信息安全重点实验室; 山东计算机中心)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 11 figures, survey paper submitted to Information Fusion

点击查看摘要

Abstract:Robot vision has greatly benefited from advancements in multimodal fusion techniques and vision-language models (VLMs). We systematically review the applications of multimodal fusion in key robotic vision tasks, including semantic scene understanding, simultaneous localization and mapping (SLAM), 3D object detection, navigation and localization, and robot manipulation. We compare VLMs based on large language models (LLMs) with traditional multimodal fusion methods, analyzing their advantages, limitations, and synergies. Additionally, we conduct an in-depth analysis of commonly used datasets, evaluating their applicability and challenges in real-world robotic scenarios. Furthermore, we identify critical research challenges such as cross-modal alignment, efficient fusion strategies, real-time deployment, and domain adaptation, and propose future research directions, including self-supervised learning for robust multimodal representations, transformer-based fusion architectures, and scalable multimodal frameworks. Through a comprehensive review, comparative analysis, and forward-looking discussion, we provide a valuable reference for advancing multimodal perception and interaction in robotic vision. A comprehensive list of studies in this survey is available at this https URL.
zh

[CV-40] Adaptive path planning for efficient object search by UAVs in agricultural fields

【速读】：该论文试图解决农业田地中目标搜索路径规划的问题，特别是在利用无人机（UAV）进行搜索时如何高效且准确地检测目标。论文的关键解决方案在于提出了一种自适应路径规划器（adaptive path planner），其核心思想是结合高海拔覆盖飞行路径与低海拔局部检查路径，并根据目标检测网络的不确定性动态调整低海拔检查的频率。具体而言，论文采用YOLOv8检测网络实现人工目标的检测，并通过优化路径规划参数以及评估不同分布情况下的性能，证明了该自适应路径规划器在面对定位误差及目标数量变化时的鲁棒性，尤其在非均匀分布场景下显著优于传统的低海拔全覆盖路径。

链接: https://arxiv.org/abs/2504.02473
作者: Rick van Essen,Eldert van Henten,Lammert Kooistra,Gert Kootstra
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents an adaptive path planner for object search in agricultural fields using UAVs. The path planner uses a high-altitude coverage flight path and plans additional low-altitude inspections when the detection network is uncertain. The path planner was evaluated in an offline simulation environment containing real-world images. We trained a YOLOv8 detection network to detect artificial plants placed in grass fields to showcase the potential of our path planner. We evaluated the effect of different detection certainty measures, optimized the path planning parameters, investigated the effects of localization errors and different numbers of objects in the field. The YOLOv8 detection confidence worked best to differentiate between true and false positive detections and was therefore used in the adaptive planner. The optimal parameters of the path planner depended on the distribution of objects in the field, when the objects were uniformly distributed, more low-altitude inspections were needed compared to a non-uniform distribution of objects, resulting in a longer path length. The adaptive planner proved to be robust against localization uncertainty. When increasing the number of objects, the flight path length increased, especially when the objects were uniformly distributed. When the objects were non-uniformly distributed, the adaptive path planner yielded a shorter path than a low-altitude coverage path, even with high number of objects. Overall, the presented adaptive path planner allowed to find non-uniformly distributed objects in a field faster than a coverage path planner and resulted in a compatible detection accuracy. The path planner is made available at this https URL.
zh

[CV-41] Semantic segmentation of forest stands using deep learning

【速读】：该论文旨在解决森林经营调查、林学及财务分析中森林群落绘图效率低下且一致性差的问题。传统方法依赖人工解读立体航拍图像来划定群落边界，这一过程耗时且主观，限制了操作效率并引入了不一致性。尽管已有研究尝试结合多种算法与航空影像及机载激光扫描（Airborne Laser Scanning, ALS）数据的高度模型实现自动化，但人工解释仍为主要手段。论文提出了一种新颖的方法，将群落边界划分视为多类别分割问题，并基于U-Net的深度学习（Deep Learning, DL）框架构建模型。模型的关键在于利用多光谱图像、ALS数据以及由专家解释员创建的现有群落地图进行训练和评估，通过整体精度（Overall Accuracy）这一分类任务的标准度量指标，在独立数据集上达到0.73的整体精度。这表明深度学习在自动化群落边界划分方面具有巨大潜力，但也指出了复杂森林环境下的若干关键挑战。

链接: https://arxiv.org/abs/2504.02471
作者: Håkon Næss Sandum(1),Hans Ole Ørka(1),Oliver Tomic(2),Erik Næsset(1),Terje Gobakken(1) ((1) Faculty of Environmental Sciences and Natural Resource Management, Norwegian University of Life Sciences, NMBU, Ås, Norway, (2) Faculty of Science and Technology, Norwegian University of Life Sciences, NMBU, Ås, Norway)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Forest stands are the fundamental units in forest management inventories, silviculture, and financial analysis within operational forestry. Over the past two decades, a common method for mapping stand borders has involved delineation through manual interpretation of stereographic aerial images. This is a time-consuming and subjective process, limiting operational efficiency and introducing inconsistencies. Substantial effort has been devoted to automating the process, using various algorithms together with aerial images and canopy height models constructed from airborne laser scanning (ALS) data, but manual interpretation remains the preferred method. Deep learning (DL) methods have demonstrated great potential in computer vision, yet their application to forest stand delineation remains unexplored in published research. This study presents a novel approach, framing stand delineation as a multiclass segmentation problem and applying a U-Net based DL framework. The model was trained and evaluated using multispectral images, ALS data, and an existing stand map created by an expert interpreter. Performance was assessed on independent data using overall accuracy, a standard metric for classification tasks that measures the proportions of correctly classified pixels. The model achieved an overall accuracy of 0.73. These results demonstrate strong potential for DL in automated stand delineation. However, a few key challenges were noted, especially for complex forest environments.
zh

[CV-42] RASP: Revisiting 3D Anamorphic Art for Shadow-Guided Packing of Irregular Objects CVPR

【速读】：该论文旨在解决通过计算模型探索和解释特定艺术形式（如影子艺术、折纸和素描艺术）的问题，并专注于三维畸形艺术 (3D Anamorphic Art) 的布局优化。论文的关键在于引入了一个基于可微渲染 (differentiable rendering) 的框架 RASP，用于在有界体积内排列任意形状的三维物体，其目标是最小化物体间的间距并最大化占用率，同时通过阴影 (或轮廓) 引导的优化实现这一目标。此外，论文提出了一种基于 SDF (Signed Distance Function) 的新方法来处理物体间的交叠及容器的膨胀问题。最终，RASP 被扩展到零件装配与物体打包任务中，并展示了从多个视角生成有意义表达的能力。因此，RASP 的关键创新在于结合可微渲染与阴影引导优化，以及其在处理复杂三维物体布局中的有效性。

链接: https://arxiv.org/abs/2504.02465
作者: Soumyaratna Debnath,Ashish Tiwari,Kaustubh Sadekar,Shanmuganathan Raman
机构: Indian Institute of Technology Gandhinagar (印度理工学院甘地讷格尔); Portland State University (波特兰州立大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Conference on Computer Vision and Pattern Recognition (CVPR) 2025

点击查看摘要

Abstract:Recent advancements in learning-based methods have opened new avenues for exploring and interpreting art forms, such as shadow art, origami, and sketch art, through computational models. One notable visual art form is 3D Anamorphic Art in which an ensemble of arbitrarily shaped 3D objects creates a realistic and meaningful expression when observed from a particular viewpoint and loses its coherence over the other viewpoints. In this work, we build on insights from 3D Anamorphic Art to perform 3D object arrangement. We introduce RASP, a differentiable-rendering-based framework to arrange arbitrarily shaped 3D objects within a bounded volume via shadow (or silhouette)-guided optimization with an aim of minimal inter-object spacing and near-maximal occupancy. Furthermore, we propose a novel SDF-based formulation to handle inter-object intersection and container extrusion. We demonstrate that RASP can be extended to part assembly alongside object packing considering 3D objects to be “parts” of another 3D object. Finally, we present artistic illustrations of multi-view anamorphic art, achieving meaningful expressions from multiple viewpoints within a single ensemble.
zh

[CV-43] CornerPoint3D: Look at the Nearest Corner Instead of the Center

【速读】：该论文旨在解决3D目标检测在跨域任务中因点云分布变化导致的定位精度下降以及现有评估指标易受数据集特定尺寸变化影响的问题。论文关注的重点是如何更有效地预防车辆与其他障碍物之间的碰撞，特别是在跨域场景下正确预测物体尺寸更具挑战性的情境。为了解决这些问题，论文从实用角度重新审视了跨域3D目标检测，并提出了两个新的评估指标来衡量模型检测物体靠近LiDAR传感器表面的能力。此外，引入了EdgeHead，这是一种改进头结构，引导模型更加专注于可学习的较近表面，从而显著提升了在新旧BEV/3D评估指标下的跨域性能。关键解决方案在于提出了一种新的3D目标检测器CornerPoint3D，它基于CenterPoint构建并通过热图监督每个物体最近角的学习与检测，实现了整个边界框检测质量和靠近LiDAR传感器表面定位准确性之间的平衡，相比传统的基于中心的检测器CenterPoint，在多个跨域任务中表现出更好的实用性和鲁棒性。

链接: https://arxiv.org/abs/2504.02464
作者: Ruixiao Zhang,Runwei Guan,Xiangyu Chen,Adam Prugel-Bennett,Xiaohao Cai
机构: University of Southampton (南安普顿大学); University of Liverpool (利物浦大学); Cornell University (康奈尔大学); University of Southampton (南安普顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2407.04061

点击查看摘要

Abstract:3D object detection aims to predict object centers, dimensions, and rotations from LiDAR point clouds. Despite its simplicity, LiDAR captures only the near side of objects, making center-based detectors prone to poor localization accuracy in cross-domain tasks with varying point distributions. Meanwhile, existing evaluation metrics designed for single-domain assessment also suffer from overfitting due to dataset-specific size variations. A key question arises: Do we really need models to maintain excellent performance in the entire 3D bounding boxes after being applied across domains? Actually, one of our main focuses is on preventing collisions between vehicles and other obstacles, especially in cross-domain scenarios where correctly predicting the sizes is much more difficult. To address these issues, we rethink cross-domain 3D object detection from a practical perspective. We propose two new metrics that evaluate a model’s ability to detect objects’ closer-surfaces to the LiDAR sensor. Additionally, we introduce EdgeHead, a refinement head that guides models to focus more on learnable closer surfaces, significantly improving cross-domain performance under both our new and traditional BEV/3D metrics. Furthermore, we argue that predicting the nearest corner rather than the object center enhances robustness. We propose a novel 3D object detector, coined as CornerPoint3D, which is built upon CenterPoint and uses heatmaps to supervise the learning and detection of the nearest corner of each object. Our proposed methods realize a balanced trade-off between the detection quality of entire bounding boxes and the locating accuracy of closer surfaces to the LiDAR sensor, outperforming the traditional center-based detector CenterPoint in multiple cross-domain tasks and providing a more practically reasonable and robust cross-domain 3D object detection solution.
zh

[CV-44] aylor Series-Inspired Local Structure Fitting Network for Few-shot Point Cloud Semantic Segmentation

【速读】：该论文旨在解决少样本点云语义分割问题，即在有限标注数据条件下，准确分割“未见过”的新类别。现有基于预训练的方法不仅引入了过多的时间开销，还忽略了不规则点云之间的局部结构表示。为了解决这些问题，论文提出了一种无预训练的局部结构拟合网络TaylorSeg。其关键是受泰勒级数启发，将不规则点云的局部结构表示视为多项式拟合问题，并提出了新颖的局部结构拟合卷积TaylorConv。此卷积能够从局部几何结构的显式编码中学习低阶基础信息和高阶精化信息。此外，通过TaylorConv构建了两种TaylorSeg变体：非参数的TaylorSeg-NN和参数化的TaylorSeg-PN。后者配备了自适应推拉（APP）模块以缓解查询集与支持集之间的特征分布差异。实验结果验证了所提方法的有效性，在2-way 1-shot设置下，TaylorSeg-PN分别在S3DIS和ScanNet数据集上实现了+2.28%和+4.37%的mIoU提升。

链接: https://arxiv.org/abs/2504.02454
作者: Changshuo Wang,Shuting He,Xiang Fang,Meiqing Wu,Siew-Kei Lam,Prayag Tiwari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot point cloud semantic segmentation aims to accurately segment “unseen” new categories in point cloud scenes using limited labeled data. However, pretraining-based methods not only introduce excessive time overhead but also overlook the local structure representation among irregular point clouds. To address these issues, we propose a pretraining-free local structure fitting network for few-shot point cloud semantic segmentation, named TaylorSeg. Specifically, inspired by Taylor series, we treat the local structure representation of irregular point clouds as a polynomial fitting problem and propose a novel local structure fitting convolution, called TaylorConv. This convolution learns the low-order basic information and high-order refined information of point clouds from explicit encoding of local geometric structures. Then, using TaylorConv as the basic component, we construct two variants of TaylorSeg: a non-parametric TaylorSeg-NN and a parametric TaylorSeg-PN. The former can achieve performance comparable to existing parametric models without pretraining. For the latter, we equip it with an Adaptive Push-Pull (APP) module to mitigate the feature distribution differences between the query set and the support set. Extensive experiments validate the effectiveness of the proposed method. Notably, under the 2-way 1-shot setting, TaylorSeg-PN achieves improvements of +2.28% and +4.37% mIoU on the S3DIS and ScanNet datasets respectively, compared to the previous state-of-the-art methods. Our code is available at this https URL.
zh

[CV-45] ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer

【速读】：该论文旨在解决现有Text-to-Video (T2V)生成方法在多主体视频处理中的两大局限性：1) 难以精确控制特定主体的运动；2) 在主体形状各异的情况下难以保持运动的多样性和准确性。为克服这些挑战，论文提出了一种名为\textbf{ConMo}的零样本框架，其关键在于通过主体掩码解耦并重组主体与相机运动的轨迹信息。ConMo能够从源视频复杂的运动轨迹中分离出个体主体和背景的运动线索，并重新组合用于目标视频生成，从而实现对多样化主体更精准的运动控制，并提升多主体场景下的性能。此外，论文在重组阶段引入软指导机制，通过调节原始运动保留程度来适应主体形状约束，促进语义转换。与以往方法相比，ConMo解锁了包括主体尺寸和位置编辑、主体移除、语义修改以及相机运动模拟等广泛应用场景。实验结果表明，ConMo在运动保真度和语义一致性方面显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.02451
作者: Jiayi Gao,Zijin Yin,Changcheng Hua,Yuxin Peng,Kongming Liang,Zhanyu Ma,Jun Guo,Yang Liu
机构: Wangxuan Institute of Computer Technology, Peking University (王选计算机技术研究所，北京大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The development of Text-to-Video (T2V) generation has made motion transfer possible, enabling the control of video motion based on existing footage. However, current methods have two limitations: 1) struggle to handle multi-subjects videos, failing to transfer specific subject motion; 2) struggle to preserve the diversity and accuracy of motion as transferring to subjects with varying shapes. To overcome these, we introduce \textbfConMo, a zero-shot framework that disentangle and recompose the motions of subjects and camera movements. ConMo isolates individual subject and background motion cues from complex trajectories in source videos using only subject masks, and reassembles them for target video generation. This approach enables more accurate motion control across diverse subjects and improves performance in multi-subject scenarios. Additionally, we propose soft guidance in the recomposition stage which controls the retention of original motion to adjust shape constraints, aiding subject shape adaptation and semantic transformation. Unlike previous methods, ConMo unlocks a wide range of applications, including subject size and position editing, subject removal, semantic modifications, and camera motion simulation. Extensive experiments demonstrate that ConMo significantly outperforms state-of-the-art methods in motion fidelity and semantic consistency. The code is available at this https URL.
zh

[CV-46] HGFormer: Topology-Aware Vision Transformer with HyperGraph Learning

【速读】：该论文旨在解决传统视觉Transformer在建模过程中因隐式处理排列不变性和全连接交互而导致区域上下文和空间拓扑被破坏的问题，这偏离了感知组织强调局部组群和整体拓扑的原则。为了解决这一问题，论文的关键方案是引入超图（Hypergraph）的概念，提出了一种名为超图Transformer（HGFormer）的模型。具体而言，首先通过中心采样K近邻（CS-KNN）算法实现语义引导的超图构建；其次，设计了一种拓扑感知的超图注意力机制（HGA），将超图拓扑作为感知指示，指导全局无偏信息的聚合。通过这种方式，HGFormer能够实现有效的统一表示，并在多个视觉基准测试中表现出与最新技术相当的性能。

链接: https://arxiv.org/abs/2504.02440
作者: Hao Wang,Shuo Zhang,Biao Leng
机构: School of Computer Science and Engineering, Beihang University (北航计算机科学与工程学院), China; Beijing Key Laboratory of Traffic Data Analysis and Mining, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The computer vision community has witnessed an extensive exploration of vision transformers in the past two years. Drawing inspiration from traditional schemes, numerous works focus on introducing vision-specific inductive biases. However, the implicit modeling of permutation invariance and fully-connected interaction with individual tokens disrupts the regional context and spatial topology, further hindering higher-order modeling. This deviates from the principle of perceptual organization that emphasizes the local groups and overall topology of visual elements. Thus, we introduce the concept of hypergraph for perceptual exploration. Specifically, we propose a topology-aware vision transformer called HyperGraph Transformer (HGFormer). Firstly, we present a Center Sampling K-Nearest Neighbors (CS-KNN) algorithm for semantic guidance during hypergraph construction. Secondly, we present a topology-aware HyperGraph Attention (HGA) mechanism that integrates hypergraph topology as perceptual indications to guide the aggregation of global and unbiased information during hypergraph messaging. Using HGFormer as visual backbone, we develop an effective and unitive representation, achieving distinct and detailed scene depictions. Empirical experiments show that the proposed HGFormer achieves competitive performance compared to the recent SoTA counterparts on various visual benchmarks. Extensive ablation and visualization studies provide comprehensive explanations of our ideas and contributions.
zh

[CV-47] Estimating Scene Flow in Robot Surroundings with Distributed Miniaturized Time-of-Flight Sensors

【速读】：该论文旨在解决机器人周围环境中人类或物体运动跟踪的问题，以提升机器人的安全运动与反应能力。论文提出了一种从分布于机器人身体上的小型化飞行时间（ToF）传感器获取的低密度且噪声较大的点云数据中估计场景流的方法。方案的关键在于通过聚类连续帧中的点，并应用迭代最近点（ICP）算法估算密集的运动流，同时引入额外步骤减轻传感器噪声和低密度数据点的影响。具体而言，采用基于适应度的分类方法区分静止点和移动点，并利用内点去除策略优化几何对应关系。实验验证表明，该方法能够一致地近似运动方向及其幅度，误差与传感器噪声相当。

链接: https://arxiv.org/abs/2504.02439
作者: Jack Sander,Giammarco Caroleo,Alessandro Albini,Perla Maiolino
机构: Oxford Robotics Institute (ORI), University of Oxford (牛津大学), UK
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures, 2 tables, 1 algorithm

点击查看摘要

Abstract:Tracking motions of humans or objects in the surroundings of the robot is essential to improve safe robot motions and reactions. In this work, we present an approach for scene flow estimation from low-density and noisy point clouds acquired from miniaturized Time of Flight (ToF) sensors distributed on the robot body. The proposed method clusters points from consecutive frames and applies Iterative Closest Point (ICP) to estimate a dense motion flow, with additional steps introduced to mitigate the impact of sensor noise and low-density data points. Specifically, we employ a fitness-based classification to distinguish between stationary and moving points and an inlier removal strategy to refine geometric correspondences. The proposed approach is validated in an experimental setup where 24 ToF are used to estimate the velocity of an object moving at different controlled speeds. Experimental results show that the method consistently approximates the direction of the motion and its magnitude with an error which is in line with sensor noise.
zh

[CV-48] MonoGS: Fast and Accurate Monocular RGB Gaussian SLAM

【速读】：该论文旨在解决现有基于3D高斯点阵（Gaussian Splatting, GS）的Simultaneous Localization and Mapping (SLAM) 方法对深度传感器高度依赖的问题。为了解决这一问题，论文提出了一种新的方法MonoGS++，仅依赖RGB输入实现快速且精确的SLAM。其关键解决方案在于：(1) 引入动态3D高斯插入机制以避免在已重建良好的区域添加冗余高斯分布；(2) 设计清晰度增强的高斯密度优化模块和平面正则化技术，以更好地处理纹理缺失区域和平坦表面。这些改进不仅实现了与现有最先进方法相当的精确相机跟踪结果，还在帧率性能上取得了5.57倍的显著提升。

链接: https://arxiv.org/abs/2504.02437
作者: Renwu Li,Wenjing Ke,Dong Li,Lu Tian,Emad Barsoum
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present MonoGS++, a novel fast and accurate Simultaneous Localization and Mapping (SLAM) method that leverages 3D Gaussian representations and operates solely on RGB inputs. While previous 3D Gaussian Splatting (GS)-based methods largely depended on depth sensors, our approach reduces the hardware dependency and only requires RGB input, leveraging online visual odometry (VO) to generate sparse point clouds in real-time. To reduce redundancy and enhance the quality of 3D scene reconstruction, we implemented a series of methodological enhancements in 3D Gaussian mapping. Firstly, we introduced dynamic 3D Gaussian insertion to avoid adding redundant Gaussians in previously well-reconstructed areas. Secondly, we introduced clarity-enhancing Gaussian densification module and planar regularization to handle texture-less areas and flat surfaces better. We achieved precise camera tracking results both on the synthetic Replica and real-world TUM-RGBD datasets, comparable to those of the state-of-the-art. Additionally, our method realized a significant 5.57x improvement in frames per second (fps) over the previous state-of-the-art, MonoGS.
zh

[CV-49] SkyReels-A2: Compose Anything in Video Diffusion Transformers

【速读】：本文旨在解决元素到视频（Elements-to-Video, E2V）任务中的关键挑战，即在基于文本提示合成视频的同时，保持每个视觉元素（如角色、物体、背景）与参考图像的高度一致性，确保场景组成连贯且输出自然。为实现这一目标，论文提出的关键解决方案包括：设计一个全面的数据处理流程以构建提示-参考-视频三元组用于模型训练；提出一种新颖的图文联合嵌入模型，将多元素表示融入生成过程，平衡局部元素特定一致性与全局连贯性及文本对齐；优化推理管道以提升速度与输出稳定性；同时引入精心策划的基准数据集A2 Bench用于系统评估。实验表明，SkyReels-A2框架能够生成高质量且多样化的视频，并提供精确的元素控制能力。

链接: https://arxiv.org/abs/2504.02436
作者: Zhengcong Fei,Debang Li,Di Qiu,Jiahua Wang,Yikun Dou,Rui Wang,Jingtao Xu,Mingyuan Fan,Guibin Chen,Yang Li,Yahui Zhou
机构: Kunlun Inc. (昆仑万维股份有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents SkyReels-A2, a controllable video generation framework capable of assembling arbitrary visual elements (e.g., characters, objects, backgrounds) into synthesized videos based on textual prompts while maintaining strict consistency with reference images for each element. We term this task elements-to-video (E2V), whose primary challenges lie in preserving the fidelity of each reference element, ensuring coherent composition of the scene, and achieving natural outputs. To address these, we first design a comprehensive data pipeline to construct prompt-reference-video triplets for model training. Next, we propose a novel image-text joint embedding model to inject multi-element representations into the generative process, balancing element-specific consistency with global coherence and text alignment. We also optimize the inference pipeline for both speed and output stability. Moreover, we introduce a carefully curated benchmark for systematic evaluation, i.e, A2 Bench. Experiments demonstrate that our framework can generate diverse, high-quality videos with precise element control. SkyReels-A2 is the first open-source commercial grade model for the generation of E2V, performing favorably against advanced closed-source commercial models. We anticipate SkyReels-A2 will advance creative applications such as drama and virtual e-commerce, pushing the boundaries of controllable video generation.
zh

[CV-50] OmniTalker: Real-Time Text-Driven Talking Head Generation with In-Context Audio-Visual Style Replication

【速读】：该论文致力于解决传统文本驱动的Talking Head生成方法中存在的系统复杂性、延迟高、音视频输出异步以及生成语音与视觉表达风格不一致的问题。为应对这些挑战，论文提出OmniTalker，这是一种端到端统一框架，能够在实时零样本场景下从文本和参考视频同步生成语音与Talking Head视频，同时保留语音风格和面部风格。解决方案的关键在于其采用双分支扩散变换器架构，其中音频分支从文本合成梅尔频谱图，视觉分支预测精细的头部姿态和面部动态；并通过引入新型的音视频融合模块确保音视频输出的时间同步与风格一致性。此外，论文提出的上下文参考学习模块能够从单一参考视频中有效捕捉语音和面部风格特征，而无需额外的风格提取模块。这种方法实现了在零样本设置下的端到端联合建模，并达到了25 FPS的实时推理速度。

链接: https://arxiv.org/abs/2504.02433
作者: Zhongjian Wang,Peng Zhang,Jinwei Qi,Guangyuan Wang Sheng Xu,Bang Zhang,Liefeng Bo
机构: Tongyi Lab, Alibaba Group (通义实验室, 阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page this https URL

点击查看摘要

Abstract:Recent years have witnessed remarkable advances in talking head generation, owing to its potential to revolutionize the human-AI interaction from text interfaces into realistic video chats. However, research on text-driven talking heads remains underexplored, with existing methods predominantly adopting a cascaded pipeline that combines TTS systems with audio-driven talking head models. This conventional pipeline not only introduces system complexity and latency overhead but also fundamentally suffers from asynchronous audiovisual output and stylistic discrepancies between generated speech and visual expressions. To address these limitations, we introduce OmniTalker, an end-to-end unified framework that simultaneously generates synchronized speech and talking head videos from text and reference video in real-time zero-shot scenarios, while preserving both speech style and facial styles. The framework employs a dual-branch diffusion transformer architecture: the audio branch synthesizes mel-spectrograms from text, while the visual branch predicts fine-grained head poses and facial dynamics. To bridge modalities, we introduce a novel audio-visual fusion module that integrates cross-modal information to ensure temporal synchronization and stylistic coherence between audio and visual outputs. Furthermore, our in-context reference learning module effectively captures both speech and facial style characteristics from a single reference video without introducing an extra style extracting module. To the best of our knowledge, OmniTalker presents the first unified framework that jointly models speech style and facial style in a zero-shot setting, achieving real-time inference speed of 25 FPS. Extensive experiments demonstrate that our method surpasses existing approaches in generation quality, particularly excelling in style preservation and audio-video synchronization.
zh

[CV-51] Leverag ing Static Relationships for Intra-Type and Inter-Type Message Passing in Video Question Answering

【速读】：该论文致力于解决视频问答（VideoQA）领域中静态关系推理准确性不足的问题，尤其是在静态关系识别与表示方面的局限性。现有方法未能充分挖掘视频中的静态关系信息以实现深入推理与分析。为此，论文提出了一种基于静态关系的消息传递推理方法，其关键是构建了两种图结构：一种用于同类消息传递推理的双图（dual graph），以及一种基于静态关系的异质图（heterogeneous graph）用于跨类消息传递推理。通过在双图中捕获与问题相关的同类目标及其关系的邻域信息，并更新双图以获取同类线索；同时，在异质图中捕获跨类相关的目标及关系的邻域信息，并更新异质图以获取跨类线索。最终结合两类线索进行答案推断。实验结果表明，该方法在ANetQA和Next-QA数据集上均表现出有效性。

链接: https://arxiv.org/abs/2504.02417
作者: Lili Liang,Guanglu Sun
机构: School of Computer Science and Technology, Harbin University of Science and Technology (哈尔滨理工大学计算机科学与技术学院), Heilongjiang Provincial Key Laboratory of Intelligent Information Processing and Application (黑龙江省智能信息处理与应用重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video Question Answering (VideoQA) is an important research direction in the field of artificial intelligence, enabling machines to understand video content and perform reasoning and answering based on natural language questions. Although methods based on static relationship reasoning have made certain progress, there are still deficiencies in the accuracy of static relationship recognition and representation, and they have not fully utilized the static relationship information in videos for in-depth reasoning and analysis. Therefore, this paper proposes a reasoning method for intra-type and inter-type message passing based on static relationships. This method constructs a dual graph for intra-type message passing reasoning and builds a heterogeneous graph based on static relationships for inter-type message passing reasoning. The intra-type message passing reasoning model captures the neighborhood information of targets and relationships related to the question in the dual graph, updating the dual graph to obtain intra-type clues for answering the question. The inter-type message passing reasoning model captures the neighborhood information of targets and relationships from different categories related to the question in the heterogeneous graph, updating the heterogeneous graph to obtain inter-type clues for answering the question. Finally, the answers are inferred by combining the intra-type and inter-type clues based on static relationships. Experimental results on the ANetQA and Next-QA datasets demonstrate the effectiveness of this method.
zh

[CV-52] Hyperspectral Remote Sensing Images Salient Object Detection: The First Benchmark Dataset and Baseline

【速读】：该论文旨在解决高光谱遥感图像显著目标检测（HRSI-SOD）领域中专用数据集匮乏及方法不足的问题。为应对这一挑战，论文引入了首个专门针对HRSI-SOD的HRSSD数据集，并提出了一种创新且高效的深度光谱显著性网络（Deep Spectral Saliency Network, DSSN）。DSSN的关键在于其跨尺度显著性评估模块（Cross-level Saliency Assessment Block），该模块通过像素级注意力机制及多尺度相似性图的贡献评估，在复杂场景中有效抑制错误响应并突出显著区域；同时，高分辨率融合模块结合自底向上的融合策略与学习到的空间上采样能力，确保小目标的精确定位。实验结果验证了DSSN在HRSSD数据集上的优越性能，并展示了其在其他相关数据集上的泛化能力。

链接: https://arxiv.org/abs/2504.02416
作者: Peifu Liu,Huiyan Bai,Tingfa Xu,Jihui Wang,Huan Chen,Jianan Li
机构: Beijing Institute of Technology (北京理工大学); Key Laboratory of Photoelectronic Imaging Technology and System, Ministry of Education of China (教育部光电成像技术与系统重点实验室); Big Data and Artificial Intelligence Laboratory, Beijing Institute of Technology Chongqing Innovation Center (北京理工大学重庆创新中心大数据与人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TGRS 2025

点击查看摘要

Abstract:The objective of hyperspectral remote sensing image salient object detection (HRSI-SOD) is to identify objects or regions that exhibit distinct spectrum contrasts with the background. This area holds significant promise for practical applications; however, progress has been limited by a notable scarcity of dedicated datasets and methodologies. To bridge this gap and stimulate further research, we introduce the first HRSI-SOD dataset, termed HRSSD, which includes 704 hyperspectral images and 5327 pixel-level annotated salient objects. The HRSSD dataset poses substantial challenges for salient object detection algorithms due to large scale variation, diverse foreground-background relations, and multi-salient objects. Additionally, we propose an innovative and efficient baseline model for HRSI-SOD, termed the Deep Spectral Saliency Network (DSSN). The core of DSSN is the Cross-level Saliency Assessment Block, which performs pixel-wise attention and evaluates the contributions of multi-scale similarity maps at each spatial location, effectively reducing erroneous responses in cluttered regions and emphasizes salient regions across scales. Additionally, the High-resolution Fusion Module combines bottom-up fusion strategy and learned spatial upsampling to leverage the strengths of multi-scale saliency maps, ensuring accurate localization of small objects. Experiments on the HRSSD dataset robustly validate the superiority of DSSN, underscoring the critical need for specialized datasets and methodologies in this domain. Further evaluations on the HSOD-BIT and HS-SOD datasets demonstrate the generalizability of the proposed method. The dataset and source code are publicly available at this https URL.
zh

[CV-53] Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval CVPR2025

【速读】：该论文旨在解决视频文本检索领域中过度依赖视觉和文本特征而忽视音频信息的问题，同时传统模型在利用音频时未考虑其有用性，导致视频表征次优。为解决这些问题，论文提出了一种名为“Audio-guided VIdeo representation learning with GATEd attention (AVIGATE)”的新框架。该框架的关键在于通过门控注意力机制有效利用音频线索，该机制能够选择性地过滤掉无信息的音频信号。此外，还引入了一种自适应基于边际的对比损失来处理视频与文本之间固有的正负关系不明确的问题，从而促进更好的视频-文本对齐学习。实验结果表明，AVIGATE 在所有公开基准数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2504.02397
作者: Boseung Jeong,Jicheol Park,Sungyeon Kim,Suha Kwak
机构: Dept. of CSE, POSTECH (计算机科学与工程系, POSTECH); Graduate School of AI, POSTECH (人工智能研究生院, POSTECH)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Video-text retrieval, the task of retrieving videos based on a textual query or vice versa, is of paramount importance for video understanding and multimodal information retrieval. Recent methods in this area rely primarily on visual and textual features and often ignore audio, although it helps enhance overall comprehension of video content. Moreover, traditional models that incorporate audio blindly utilize the audio input regardless of whether it is useful or not, resulting in suboptimal video representation. To address these limitations, we propose a novel video-text retrieval framework, Audio-guided VIdeo representation learning with GATEd attention (AVIGATE), that effectively leverages audio cues through a gated attention mechanism that selectively filters out uninformative audio signals. In addition, we propose an adaptive margin-based contrastive loss to deal with the inherently unclear positive-negative relationship between video and text, which facilitates learning better video-text alignment. Our extensive experiments demonstrate that AVIGATE achieves state-of-the-art performance on all the public benchmarks.
zh

[CV-54] Marine Saliency Segmenter: Object-Focused Conditional Diffusion with Region-Level Semantic Knowledge Distillation

【速读】：该论文旨在解决基于视觉的海洋探索任务中，现有海海洋显著性分割技术因复杂水下环境导致的目标定位错误和边界不精确的问题。同时，尽管扩散模型在视觉分割中表现出色，但仍有进一步挖掘其上下文语义信息以增强区域级显著目标特征学习的空间，从而提升分割效果的潜力。为此，论文提出了一种名为DiffMSS的新方法，基于扩散模型实现海洋显著性分割，并利用语义知识蒸馏指导分割过程。关键在于设计了一种区域-词相似性匹配机制，通过文本描述提取高阶语义特征，引导条件特征学习网络生成带有语义知识蒸馏的显著且精确的扩散条件；此外，针对独特海洋生物精细结构的分割，开发了专门的一致性确定性采样策略以抑制过自信的误分割现象。

链接: https://arxiv.org/abs/2504.02391
作者: Laibin Chang,Yunke Wang,JiaXing Huang,Longxiang Deng,Bo Du,Chang Xu
机构: School of Computer Science, Wuhan University (武汉大学); School of Computer Science, The University of Sydney (悉尼大学); School of Computer Science and Engineering, Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Marine Saliency Segmentation (MSS) plays a pivotal role in various vision-based marine exploration tasks. However, existing marine segmentation techniques face the dilemma of object mislocalization and imprecise boundaries due to the complex underwater environment. Meanwhile, despite the impressive performance of diffusion models in visual segmentation, there remains potential to further leverage contextual semantics to enhance feature learning of region-level salient objects, thereby improving segmentation outcomes. Building on this insight, we propose DiffMSS, a novel marine saliency segmenter based on the diffusion model, which utilizes semantic knowledge distillation to guide the segmentation of marine salient objects. Specifically, we design a region-word similarity matching mechanism to identify salient terms at the word level from the text descriptions. These high-level semantic features guide the conditional feature learning network in generating salient and accurate diffusion conditions with semantic knowledge distillation. To further refine the segmentation of fine-grained structures in unique marine organisms, we develop the dedicated consensus deterministic sampling to suppress overconfident missegmentations. Comprehensive experiments demonstrate the superior performance of DiffMSS over state-of-the-art methods in both quantitative and qualitative evaluations.
zh

[CV-55] VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models CEC

【速读】：本文旨在解决自动化视频配音（video dubbing）的问题，目标是从文本和面部线索生成高质量且自然的语音，同时确保语音与面部表情的时间同步及情感一致性。论文的关键创新在于将神经编解码语言模型（Neural Codec Language Model, NCLM）的语音合成能力扩展至结合视频特征的场景。具体而言，通过设计适配器（adapters）将面部特征映射到NCLM的令牌空间，并引入音视频融合层（audio-visual fusion layers），实现音视频信息在NCLM框架内的有效整合。此外，作者构建了一个名为CelebV-Dub的新数据集，用于支持训练和评估。实验结果表明，所提出的VoiceCraft-Dub方法在主观和客观评价中均优于现有技术，展现了其在影视制作、多媒体创作以及辅助语音障碍者等领域的广泛应用潜力。

链接: https://arxiv.org/abs/2504.02386
作者: Kim Sung-Bin,Jeongsoo Choi,Puyuan Peng,Joon Son Chung,Tae-Hyun Oh,David Harwath
机构: Department of Electrical Engineering, POSTECH; School of Electrical Engineering, KAIST; Department of Computer Science, The University of Texas at Austin
类目: Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: this https URL

点击查看摘要

Abstract:We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video features, ensuring that synthesized speech is time-synchronized and expressively aligned with facial movements while preserving natural prosody. To inject visual cues, we design adapters to align facial features with the NCLM token space and introduce audio-visual fusion layers to merge audio-visual information within the NCLM framework. Additionally, we curate CelebV-Dub, a new dataset of expressive, real-world videos specifically designed for automated video dubbing. Extensive experiments show that our model achieves high-quality, intelligible, and natural speech synthesis with accurate lip synchronization, outperforming existing methods in human perception and performing favorably in objective evaluations. We also adapt VoiceCraft-Dub for the video-to-speech task, demonstrating its versatility for various applications.
zh

[CV-56] Brightness Perceiving for Recursive Low-Light Image Enhancement

【速读】：该论文旨在解决现有端到端方法在增强低光图像时因场景动态范围大而导致的对比度退化和细节模糊程度差异显著的问题，难以将低光图像有效增强至正常曝光。为解决此问题，论文提出了一种基于亮度感知的递归增强框架，将低光图像增强分解为递归增强任务。方案的关键在于设计了一个包含两个并行子网络的递归增强框架：自适应对比度与纹理增强网络（Adaptive Contrast and Texture enhancement network, ACT-Net）和亮度感知网络（Brightness Perception network, BP-Net）。ACT-Net 通过亮度调整分支和梯度调整分支的引导，自适应地增强图像对比度和细节；BP-Net 则通过探索图像亮度分布特性，控制 ACT-Net 的递归增强次数以适配不同亮度水平的图像。此外，为了协调两个子网络，论文还设计了一种新颖的无监督训练策略，并构建了一个具有更广亮度分布的新数据集以验证方法的有效性。最终，该方法在六个参考和非参考评估指标上达到了新的 SOTA 性能，相比现有最佳方法提升了 0.9 dB 的 PSNR。

链接: https://arxiv.org/abs/2504.02362
作者: Haodian Wang,Long Peng,Yuejin Sun,Zengyu Wan,Yang Wang,Yang Cao
机构: University of Science and Technology of China (中国科学技术大学), Anhui, China
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Due to the wide dynamic range in real low-light scenes, there will be large differences in the degree of contrast degradation and detail blurring of captured images, making it difficult for existing end-to-end methods to enhance low-light images to normal exposure. To address the above issue, we decompose low-light image enhancement into a recursive enhancement task and propose a brightness-perceiving-based recursive enhancement framework for high dynamic range low-light image enhancement. Specifically, our recursive enhancement framework consists of two parallel sub-networks: Adaptive Contrast and Texture enhancement network (ACT-Net) and Brightness Perception network (BP-Net). The ACT-Net is proposed to adaptively enhance image contrast and details under the guidance of the brightness adjustment branch and gradient adjustment branch, which are proposed to perceive the degradation degree of contrast and details in low-light images. To adaptively enhance images captured under different brightness levels, BP-Net is proposed to control the recursive enhancement times of ACT-Net by exploring the image brightness distribution properties. Finally, in order to coordinate ACT-Net and BP-Net, we design a novel unsupervised training strategy to facilitate the training procedure. To further validate the effectiveness of the proposed method, we construct a new dataset with a broader brightness distribution by mixing three low-light datasets. Compared with eleven existing representative methods, the proposed method achieves new SOTA performance on six reference and no reference metrics. Specifically, the proposed method improves the PSNR by 0.9 dB compared to the existing SOTA method.
zh

[CV-57] MG-Gen: Single Image to Motion Graphics Generation with Layer Decomposition

【速读】：该论文旨在解决现有图像转视频生成方法在生成动画图形时存在的两个主要问题：一是生成的动画缺乏主动的文字运动且存在物体失真；二是基于代码的动画生成方法通常需要分层结构的矢量数据，而这类数据在运动图形生成任务中往往难以获取。为应对这些挑战，论文提出了一种名为MG-Gen的新框架，其关键在于通过从单个栅格图像中重构矢量格式的数据，扩展基于代码的方法的能力，从而实现在通用图像到视频生成框架下从栅格图像生成运动图形。MG-Gen首先将输入图像分解为逐层元素，将其重构为HTML格式数据，并为重构后的HTML数据生成可执行的JavaScript代码。实验验证表明，该方法在保持文字可读性和输入一致性的同时成功生成了运动图形，这表明结合分层分解与动画代码生成是一种有效的运动图形生成策略。

链接: https://arxiv.org/abs/2504.02361
作者: Takahiro Shirakawa,Tomoyuki Suzuki,Daichi Haraguchi
机构: CyberAgent (赛博代理公司)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:General image-to-video generation methods often produce suboptimal animations that do not meet the requirements of animated graphics, as they lack active text motion and exhibit object distortion. Also, code-based animation generation methods typically require layer-structured vector data which are often not readily available for motion graphic generation. To address these challenges, we propose a novel framework named MG-Gen that reconstructs data in vector format from a single raster image to extend the capabilities of code-based methods to enable motion graphics generation from a raster image in the framework of general image-to-video generation. MG-Gen first decomposes the input image into layer-wise elements, reconstructs them as HTML format data and then generates executable JavaScript code for the reconstructed HTML data. We experimentally confirm that \ours generates motion graphics while preserving text readability and input consistency. These successful results indicate that combining layer decomposition and animation code generation is an effective strategy for motion graphics generation.
zh

[CV-58] All-day Depth Completion via Thermal-LiDAR Fusion

【速读】：该论文致力于解决在恶劣环境（如强降雨和低光照条件）下，基于稀疏LiDAR与RGB图像的深度完成任务所面临的性能不足问题。现有方法受限于RGB传感器，在这些条件下难以提供可靠性能，同时真实深度图在恶劣天气中常存在大量缺失数据，导致监督信号不足。此外，热成像相机虽在恶劣环境下具备清晰可靠的可见性优势，但热成像-LiDAR深度完成的研究仍较少被探索，且热图像的模糊性、低对比度及噪声特性进一步加剧了深度边界不清晰的问题。为此，论文首先通过在MS²和ViViD数据集上的广泛基准测试，评估了热成像-LiDAR深度完成在多样化光照、天气及场景条件下的可行性和鲁棒性。为解决上述挑战，论文提出了一种名为COPS（Contrastive learning and Pseudo-Supervision）的框架，其关键在于利用深度基础模型以两种方式提升深度边界清晰度和完成精度：一是通过挖掘正负样本并施加深度感知对比损失来增强深度边界锐化；二是利用基础模型预测作为密集深度先验，缓解真实深度图监督不足的问题。

链接: https://arxiv.org/abs/2504.02356
作者: Janghyun Kim,Minseong Kweon,Jinsun Park,Ukcheol Shin
机构: Pusan National University (釜山国立大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Depth completion, which estimates dense depth from sparse LiDAR and RGB images, has demonstrated outstanding performance in well-lit conditions. However, due to the limitations of RGB sensors, existing methods often struggle to achieve reliable performance in harsh environments, such as heavy rain and low-light conditions. Furthermore, we observe that ground truth depth maps often suffer from large missing measurements in adverse weather conditions such as heavy rain, leading to insufficient supervision. In contrast, thermal cameras are known for providing clear and reliable visibility in such conditions, yet research on thermal-LiDAR depth completion remains underexplored. Moreover, the characteristics of thermal images, such as blurriness, low contrast, and noise, bring unclear depth boundary problems. To address these challenges, we first evaluate the feasibility and robustness of thermal-LiDAR depth completion across diverse lighting (eg., well-lit, low-light), weather (eg., clear-sky, rainy), and environment (eg., indoor, outdoor) conditions, by conducting extensive benchmarks on the MS ^2 and ViViD datasets. In addition, we propose a framework that utilizes COntrastive learning and Pseudo-Supervision (COPS) to enhance depth boundary clarity and improve completion accuracy by leveraging a depth foundation model in two key ways. First, COPS enforces a depth-aware contrastive loss between different depth points by mining positive and negative samples using a monocular depth foundation model to sharpen depth boundaries. Second, it mitigates the issue of incomplete supervision from ground truth depth maps by leveraging foundation model predictions as dense depth priors. We also provide in-depth analyses of the key challenges in thermal-LiDAR depth completion to aid in understanding the task and encourage future research.
zh

[CV-59] Agglomerating Large Vision Encoders via Distillation for VFSS Segmentation

【速读】：该论文旨在解决医学影像领域中基础模型在下游任务中的训练开销大以及推理复杂度高的问题。尽管已有轻量级变体模型被提出，但其性能受限于有限的模型容量和次优的训练策略。为实现复杂性和性能之间的更优权衡，论文提出了一种新的框架，通过从多个大型医学基础模型（如MedSAM、RAD-DINO、MedCLIP）进行多模型知识蒸馏来提升低复杂度模型的性能。这些基础模型各自专注于不同的视觉任务。关键在于利用多模型的知识蒸馏技术，有效弥合医学图像分割任务中的性能差距，同时确保模型在12个分割任务中表现出色的泛化能力，且无需针对每个任务进行显式训练。实验结果显示，与简单蒸馏相比，该方法在Dice系数上的平均性能提升了2%。

链接: https://arxiv.org/abs/2504.02351
作者: Chengxi Zeng,Yuxuan Jiang,Fan Zhang,Alberto Gambaruto,Tilo Burghardt
机构: University of Bristol (布里斯托尔大学), UK
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The deployment of foundation models for medical imaging has demonstrated considerable success. However, their training overheads associated with downstream tasks remain substantial due to the size of the image encoders employed, and the inference complexity is also significantly high. Although lightweight variants have been obtained for these foundation models, their performance is constrained by their limited model capacity and suboptimal training strategies. In order to achieve an improved tradeoff between complexity and performance, we propose a new framework to improve the performance of low complexity models via knowledge distillation from multiple large medical foundation models (e.g., MedSAM, RAD-DINO, MedCLIP), each specializing in different vision tasks, with the goal to effectively bridge the performance gap for medical image segmentation tasks. The agglomerated model demonstrates superior generalization across 12 segmentation tasks, whereas specialized models require explicit training for each task. Our approach achieved an average performance gain of 2% in Dice coefficient compared to simple distillation.
zh

[CV-60] SemiISP/SemiIE: Semi-Supervised Image Signal Processor and Image Enhancement Leverag ing One-to-Many Mapping sRGB-to-RAW

【速读】：该论文致力于解决DNN（深度神经网络）在Image Signal Processor (ISP) 和 image enhancement (IE) 任务中因训练数据成本高而难以构建大规模数据集的问题，并探索如何利用少量个性化训练数据实现高质量的ISP和IE。此外，论文关注通过半监督学习（Semi-Supervised Learning）的方法来挖掘潜在价值，因为不同用户和应用场景对图像质量的需求存在差异。然而，传统半监督学习方法在这些任务中的应用较少。论文的关键在于提出了一种基于RAW图像重建（sRGB-to-RAW）的方法，用于实现ISP和IE的半监督学习，并进一步设计了一种改进的sRGB-to-RAW技术以提升图像质量，使其能够满足ISP和IE任务对精确图像质量定义的要求。这一结合了创新sRGB-to-RAW方法的半监督学习方案成功提升了多种模型在各类数据集上的图像质量表现。

链接: https://arxiv.org/abs/2504.02345
作者: Masakazu Yoshimura,Junji Otsuka,Radu Berdan,Takeshi Ohashi
机构: Sony Group Corporation (索尼集团); SonyAI (索尼AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:DNN-based methods have been successful in Image Signal Processor (ISP) and image enhancement (IE) tasks. However, the cost of creating training data for these tasks is considerably higher than for other tasks, making it difficult to prepare large-scale datasets. Also, creating personalized ISP and IE with minimal training data can lead to new value streams since preferred image quality varies depending on the person and use case. While semi-supervised learning could be a potential solution in such cases, it has rarely been utilized for these tasks. In this paper, we realize semi-supervised learning for ISP and IE leveraging a RAW image reconstruction (sRGB-to-RAW) method. Although existing sRGB-to-RAW methods can generate pseudo-RAW image datasets that improve the accuracy of RAW-based high-level computer vision tasks such as object detection, their quality is not sufficient for ISP and IE tasks that require precise image quality definition. Therefore, we also propose a sRGB-to-RAW method that can improve the image quality of these tasks. The proposed semi-supervised learning with the proposed sRGB-to-RAW method successfully improves the image quality of various models on various datasets.
zh

[CV-61] LPA3D: 3D Room-Level Scene Generation from In-the-Wild Images

【速读】：该论文旨在解决从野外图像生成具有语义合理性和细节外观的真实感室级室内场景的问题，这对于VR、AR和机器人领域的应用至关重要。然而，现有的基于NeRF的场景级生成方法需要额外的信息（如多视图、深度图像或语义指导），而不仅仅是依赖RGB图像，这是因为在室内场景中，由于定义对齐的复杂性以及从单个图像全局估计相机姿态的困难，难以近似获得所需的相机姿态先验知识。论文的关键解决方案是重新定义全局姿态框架下的局部姿态对齐（Local-Pose-Alignment, LPA）——一种基于锚点的多局部坐标系系统，使用选定数量的锚点作为这些坐标的根。在此基础上，论文提出了LPA-GAN，这是一种新颖的基于NeRF的生成方法，它通过特定修改在LPA下估计相机姿态的先验，并同时优化姿态预测器和场景生成过程。消融研究及与NeRF基对象生成方法的简单扩展对比证明了该方法的有效性，视觉比较还显示其在视图间一致性和语义正常性方面优于其他技术。

链接: https://arxiv.org/abs/2504.02337
作者: Ming-Jia Yang,Yu-Xiao Guo,Yang Liu,Bin Zhou,Xin Tong
机构: Beihang University (北京航空航天大学); Microsoft Research Asia (微软亚洲研究院); BeijingP. R. China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating realistic, room-level indoor scenes with semantically plausible and detailed appearances from in-the-wild images is crucial for various applications in VR, AR, and robotics. The success of NeRF-based generative methods indicates a promising direction to address this challenge. However, unlike their success at the object level, existing scene-level generative methods require additional information, such as multiple views, depth images, or semantic guidance, rather than relying solely on RGB images. This is because NeRF-based methods necessitate prior knowledge of camera poses, which is challenging to approximate for indoor scenes due to the complexity of defining alignment and the difficulty of globally estimating poses from a single image, given the unseen parts behind the camera. To address this challenge, we redefine global poses within the framework of Local-Pose-Alignment (LPA) – an anchor-based multi-local-coordinate system that uses a selected number of anchors as the roots of these coordinates. Building on this foundation, we introduce LPA-GAN, a novel NeRF-based generative approach that incorporates specific modifications to estimate the priors of camera poses under LPA. It also co-optimizes the pose predictor and scene generation processes. Our ablation study and comparisons with straightforward extensions of NeRF-based object generative methods demonstrate the effectiveness of our approach. Furthermore, visual comparisons with other techniques reveal that our method achieves superior view-to-view consistency and semantic normality.
zh

[CV-62] Evaluating and Enhancing Segmentation Model Robustness with Metamorphic Testing

【速读】：本文旨在解决图像分割模型在面对细微图像失真时缺乏鲁棒性的问题，特别是其易受对抗扰动影响的局限性。为应对这一挑战，论文提出了一种名为SegRMT的变形测试方法，其关键是利用遗传算法（Genetic Algorithm, GA）优化空间和光谱变换序列，同时通过预定义的峰值信噪比（PSNR）阈值保持图像保真度。实验表明，SegRMT不仅能够有效生成挑战DeepLabV3模型的对抗样本，使其平均交并比（mean Intersection over Union, mIoU）降至6.4%，还通过对抗训练显著提升了模型性能，在专用对抗数据集上的mIoU提升可达73%，并在交叉对抗mIoU上达到53.8%，优于其他方法。这表明SegRMT不仅能模拟真实的图像失真，还能增强分割模型的鲁棒性，对于保障安全关键应用的可靠性能具有重要价值。

链接: https://arxiv.org/abs/2504.02335
作者: Seif Mzoughi,Mohamed Elshafeia,Foutse Khomh
机构: Polytechnique Montréal (蒙特利尔理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Image segmentation is critical for applications such as medical imaging, augmented reality, and video surveillance. However, segmentation models often lack robustness, making them vulnerable to adversarial perturbations from subtle image distortions. In this work, we propose SegRMT, a metamorphic testing approach that leverages genetic algorithms (GA) to optimize sequences of spatial and spectral transformations while preserving image fidelity via a predefined PSNR threshold. Using the Cityscapes dataset, our method generates adversarial examples that effectively challenge the DeepLabV3 segmentation model. Our experiments show that SegRMT reduces DeepLabV3’s mean Intersection over Union (mIoU) to 6.4%, outperforming other adversarial baselines that decrease mIoU to between 8.5% and 21.7%. Furthermore, when used for adversarial training, SegRMT boosts model performance, achieving mIoU improvements up to 73% on dedicated adversarial datasets and increasing cross-adversarial mIoU to 53.8%, compared to only 2%-10% for other methods. These findings demonstrate that SegRMT not only simulates realistic image distortions but also enhances the robustness of segmentation models, making it a valuable tool for ensuring reliable performance in safety-critical applications.
zh

[CV-63] Determining Sphere Radius through Pairwise Distances

【速读】：该论文旨在解决基于带有测量误差的距离数据确定球面半径的问题，尤其关注当球体存在随机形状偏差时的半径估计。解决方案的关键在于提出了一种新的解析表达式，通过点之间的距离矩阵直接计算球面半径，并利用最少四个点以及任意数量附加点的配置来实现。此外，论文还分析了由测量误差和形状偏差引起的半径估计标准差，并找到了使半径估计标准差最小化的最优点分布配置。整个方法的核心创新在于提供了一个封闭形式的解法，并结合数学推导全面解决了这一问题。

链接: https://arxiv.org/abs/2504.02334
作者: Boris Sukhovilov
机构: 未知
类目: Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, we share the implementation of our method as open source code at this https URL

点击查看摘要

Abstract:We propose a novel method for determining the radius of a spherical surface based on the distances measured between points on this surface. We consider the most general case of determining the radius when the distances are measured with errors and the sphere has random deviations from its ideal shape. For the solution, we used the minimally necessary four points and an arbitrary N number of points. We provide a new closed form solution for the radius of the sphere through the matrix of pairwise distances. We also determine the standard deviation of the radius estimate caused by measurement errors and deviations of the sphere from its ideal shape. We found optimal configurations of points on the sphere that provide the minimum standard deviation of the radius estimate. This paper describes our solution and provides all the mathematical derivations. We share the implementation of our method as open source code at this https URL.
zh

[CV-64] owards Assessing Deep Learning Test Input Generators

【速读】：该论文旨在解决深度学习（Deep Learning, DL）系统在安全性关键应用中因鲁棒性问题可能导致严重故障的风险，尽管已开发出多种测试输入生成器（Test Input Generators, TIGs）来评估DL系统的鲁棒性，但对其在不同维度上的综合有效性仍缺乏全面评估。为此，论文通过评估四种最先进的TIGs（DeepHunter、DeepFault、AdvGAN和SinVAD）在故障揭示能力、自然性、多样性及效率四个关键方面的表现，提出了一个全面的分析框架。研究基于三个预训练模型（LeNet-5、VGG16和EfficientNetB3）在不同复杂度的数据集（MNIST、CIFAR-10和ImageNet-1K）上的实证研究，揭示了这些TIGs在鲁棒性揭示能力、测试用例生成变化以及计算效率方面的权衡关系，并指出TIG性能随数据集复杂度的变化特性。论文的关键解决方案在于通过多维度综合评估，为选择与特定目标和数据集特征相匹配的TIG提供实用指导，同时强调需要进一步改进以应对现有TIG的局限性，并推动其应用于实际的安全关键系统。

链接: https://arxiv.org/abs/2504.02329
作者: Seif Mzoughi,Ahmed Hajyahmed,Mohamed Elshafei,Foutse Khomh anb Diego Elias Costa
机构: Polytechnique Montreal (蒙特利尔理工学院); Concordia University (康考迪亚大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: Accepted to EASE 2025

点击查看摘要

Abstract:Deep Learning (DL) systems are increasingly deployed in safety-critical applications, yet they remain vulnerable to robustness issues that can lead to significant failures. While numerous Test Input Generators (TIGs) have been developed to evaluate DL robustness, a comprehensive assessment of their effectiveness across different dimensions is still lacking. This paper presents a comprehensive assessment of four state-of-the-art TIGs–DeepHunter, DeepFault, AdvGAN, and SinVAD–across multiple critical aspects: fault-revealing capability, naturalness, diversity, and efficiency. Our empirical study leverages three pre-trained models (LeNet-5, VGG16, and EfficientNetB3) on datasets of varying complexity (MNIST, CIFAR-10, and ImageNet-1K) to evaluate TIG performance. Our findings reveal important trade-offs in robustness revealing capability, variation in test case generation, and computational efficiency across TIGs. The results also show that TIG performance varies significantly with dataset complexity, as tools that perform well on simpler datasets may struggle with more complex ones. In contrast, others maintain steadier performance or better scalability. This paper offers practical guidance for selecting appropriate TIGs aligned with specific objectives and dataset characteristics. Nonetheless, more work is needed to address TIG limitations and advance TIGs for real-world, safety-critical systems.
zh

[CV-65] Refining CLIPs Spatial Awareness: A Visual-Centric Perspective ICLR2025

【速读】：该论文旨在解决CLIP（Contrastive Language-Image Pre-training）在密集多模态任务中的空间感知能力不足的问题，特别是在经过Region-Language Alignment (RLA) 微调后，其空间意识显著下降，影响了需要精确空间理解的任务性能。论文的关键解决方案是提出了一种名为Spatial Correlation Distillation (SCD) 的框架，该框架不仅保留了CLIP视觉Transformer (ViT) 的固有空间结构，还通过缓解因RLA导致的空间感知退化来提升模型表现。此外，为了进一步增强空间相关性，论文引入了一个轻量级Refiner，它能够从CLIP中直接提取高质量的密集特征，并将其输入到SCD中。这些组件共同构成了一个鲁棒的蒸馏框架，使CLIP ViTs能够在开放词汇密集预测基准测试中实现最先进的结果。

链接: https://arxiv.org/abs/2504.02328
作者: Congpei Qiu,Yanhao Wu,Wei Ke,Xiuxiu Bai,Tong Zhang
机构: School of Software Engineering, Xi’an Jiaotong University (软件工程学院，西安交通大学); School of Computer and Communication Sciences, EPFL (计算机与通信科学学院，洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) excels in global alignment with language but exhibits limited sensitivity to spatial information, leading to strong performance in zero-shot classification tasks but underperformance in tasks requiring precise spatial understanding. Recent approaches have introduced Region-Language Alignment (RLA) to enhance CLIP’s performance in dense multimodal tasks by aligning regional visual representations with corresponding text inputs. However, we find that CLIP ViTs fine-tuned with RLA suffer from notable loss in spatial awareness, which is crucial for dense prediction tasks. To address this, we propose the Spatial Correlation Distillation (SCD) framework, which preserves CLIP’s inherent spatial structure and mitigates the above degradation. To further enhance spatial correlations, we introduce a lightweight Refiner that extracts refined correlations directly from CLIP before feeding them into SCD, based on an intriguing finding that CLIP naturally captures high-quality dense features. Together, these components form a robust distillation framework that enables CLIP ViTs to integrate both visual-language and visual-centric improvements, achieving state-of-the-art results across various open-vocabulary dense prediction benchmarks.
zh

[CV-66] X-Capture: An Open-Source Portable Device for Multi-Sensory Learning

【速读】：该论文旨在解决人工智能（AI）和机器人系统在模拟人类多感官感知能力方面的局限性，特别是缺乏真实世界多感官数据的问题。现有数据集通常受限于受控环境、模拟对象或有限的模态组合，难以支持跨感官整合与丰富理解。论文的关键解决方案是引入X-Capture，这是一种开源、便携且成本低于1,000美元的设备，能够采集高相关性的RGBD图像、触觉读数以及冲击音频等多感官数据。通过X-Capture，研究者构建了一个包含3,000个样本点的多感官数据集，覆盖500种日常物体，并展示了该数据在预训练和微调多模态表示中的价值，特别是在物体中心任务（如跨感官检索和重建）中的应用潜力。X-Capture的核心优势在于其可扩展性、易用性和对现实世界的适用性，为实现类人感官表征奠定了基础。

链接: https://arxiv.org/abs/2504.02318
作者: Samuel Clarke,Suzannah Wistreich,Yanjie Ze,Jiajun Wu
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Understanding objects through multiple sensory modalities is fundamental to human perception, enabling cross-sensory integration and richer comprehension. For AI and robotic systems to replicate this ability, access to diverse, high-quality multi-sensory data is critical. Existing datasets are often limited by their focus on controlled environments, simulated objects, or restricted modality pairings. We introduce X-Capture, an open-source, portable, and cost-effective device for real-world multi-sensory data collection, capable of capturing correlated RGBD images, tactile readings, and impact audio. With a build cost under 1,000, X-Capture democratizes the creation of multi-sensory datasets, requiring only consumer-grade tools for assembly. Using X-Capture, we curate a sample dataset of 3,000 total points on 500 everyday objects from diverse, real-world environments, offering both richness and variety. Our experiments demonstrate the value of both the quantity and the sensory breadth of our data for both pretraining and fine-tuning multi-modal representations for object-centric tasks such as cross-sensory retrieval and reconstruction. X-Capture lays the groundwork for advancing human-like sensory representations in AI, emphasizing scalability, accessibility, and real-world applicability.
zh

[CV-67] ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation

【速读】：该论文致力于解决零样本文本到三维（Text-to-3D）生成中的多视图一致性问题，特别是由预训练文本到图像（Text-to-Image, T2I）模型的视角偏差引起的“双面Janus”现象，即生成的三维对象在不同视角下表现出冲突的特征。为应对这一挑战，论文提出ConsDreamer框架，其关键在于通过改进分数蒸馏过程中的条件与非条件项来减轻视角偏差：(1) 视角解耦模块（View Disentanglement Module, VDM），通过分离无关的视角成分并注入精确的相机参数消除条件提示中的视角偏差；(2) 基于相似性的偏序损失，通过对齐余弦相似性和方位关系，在非条件项中强制实现几何一致性。实验结果表明，ConsDreamer在视觉质量和一致性方面优于现有方法。

链接: https://arxiv.org/abs/2504.02316
作者: Yuan Zhou,Shilong Jin,Litao Hua,Wanjun Lv,Haoran Duan,Jungong Han
机构: School of Artificial Intelligence, Nanjing University of Information Science and Technology (南京信息工程大学人工智能学院); Department of Automation, Tsinghua University (清华大学自动化系); Lenovo (联想)(北京, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 11 figures, 3 tables

点击查看摘要

Abstract:Recent advances in zero-shot text-to-3D generation have revolutionized 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent view biases in T2I priors. These biases lead to inconsistent 3D generation, particularly manifesting as the multi-face Janus problem, where objects exhibit conflicting features across views. To address this fundamental challenge, we propose ConsDreamer, a novel framework that mitigates view bias by refining both the conditional and unconditional terms in the score distillation process: (1) a View Disentanglement Module (VDM) that eliminates viewpoint biases in conditional prompts by decoupling irrelevant view components and injecting precise camera parameters; and (2) a similarity-based partial order loss that enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships. Extensive experiments demonstrate that ConsDreamer effectively mitigates the multi-face Janus problem in text-to-3D generation, outperforming existing methods in both visual quality and consistency.
zh

[CV-68] OmniCam: Unified Multimodal Video Generation via Camera Control

【速读】：该论文旨在解决现有相机控制方法面临的复杂交互和有限控制能力等挑战。为了解决这些问题，论文提出了一种名为OmniCam的统一多模态相机控制框架。OmniCam的关键在于结合大型语言模型（Large Language Models）和视频扩散模型（Video Diffusion Models），以生成时空一致的视频，并支持多种输入模态的组合，包括文本或带期望轨迹的视频作为相机路径引导，以及图像或视频作为内容参考，从而实现对相机运动的精确控制。此外，为了训练OmniCam，论文引入了OmniTr数据集，其中包含大量高质量的长序列轨迹、视频及其对应的描述。实验结果表明，该模型在各种度量标准下实现了最先进的性能。

链接: https://arxiv.org/abs/2504.02312
作者: Xiaoda Yang,Jiayang Xu,Kaixuan Luan,Xinyu Zhan,Hongshun Qiu,Shijun Shi,Hao Li,Shuai Yang,Li Zhang,Checheng Yu,Cewu Lu,Lixin Yang
机构: Zhejiang University (浙江大学); Shanghai Jiao Tong University (上海交通大学); Beijing University of Technology (北京工业大学); Jiangnan University (江南大学); University of Science and Technology of China (中国科学技术大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Camera control, which achieves diverse visual effects by changing camera position and pose, has attracted widespread attention. However, existing methods face challenges such as complex interaction and limited control capabilities. To address these issues, we present OmniCam, a unified multimodal camera control framework. Leveraging large language models and video diffusion models, OmniCam generates spatio-temporally consistent videos. It supports various combinations of input modalities: the user can provide text or video with expected trajectory as camera path guidance, and image or video as content reference, enabling precise control over camera motion. To facilitate the training of OmniCam, we introduce the OmniTr dataset, which contains a large collection of high-quality long-sequence trajectories, videos, and corresponding descriptions. Experimental results demonstrate that our model achieves state-of-the-art performance in high-quality camera-controlled video generation across various metrics.
zh

[CV-69] MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion

【速读】：该论文旨在解决多模态多视角动作识别在实际应用中的挑战，特别是现有数据集无法有效应对宽视场环境条件、异步数据流以及缺乏帧级标注等问题。同时，现有方法在建模视角间关系和增强空间特征学习方面存在困难。为了解决这些问题，论文提出了一种基于Transformer的多模态多视角传感器融合（MultiTSF）方法，并引入了MultiSensor-Home数据集，作为家庭环境中全面动作识别的新基准。MultiTSF方法的关键在于利用基于Transformer的融合机制动态建模视角间关系，同时结合外部人体检测模块以提升空间特征学习能力。实验结果表明，MultiTSF在MultiSensor-Home和MM-Office数据集上的表现优于当前最先进的方法。

链接: https://arxiv.org/abs/2504.02287
作者: Trung Thanh Nguyen,Yasutomo Kawanishi,Vijay John,Takahiro Komamizu,Ichiro Ide
机构: Graduate School of Informatics, Nagoya University (名古屋大学信息研究生院); Guardian Robot Project, Information R&D and Strategy Headquarters, RIKEN (理化学研究所守护机器人项目, 信息研发与战略总部); Center for Artificial Intelligence, Mathematical and Data Science, Nagoya University (名古屋大学人工智能数学与数据科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal multi-view action recognition is a rapidly growing field in computer vision, offering significant potential for applications in surveillance. However, current datasets often fail to address real-world challenges such as wide-area environmental conditions, asynchronous data streams, and the lack of frame-level annotations. Furthermore, existing methods face difficulties in effectively modeling inter-view relationships and enhancing spatial feature learning. In this study, we propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF) method and introduce the MultiSensor-Home dataset, a novel benchmark designed for comprehensive action recognition in home environments. The MultiSensor-Home dataset features untrimmed videos captured by distributed sensors, providing high-resolution RGB and audio data along with detailed multi-view frame-level action labels. The proposed MultiTSF method leverages a Transformer-based fusion mechanism to dynamically model inter-view relationships. Furthermore, the method also integrates a external human detection module to enhance spatial feature learning. Experiments on MultiSensor-Home and MM-Office datasets demonstrate the superiority of MultiTSF over the state-of-the-art methods. The quantitative and qualitative results highlight the effectiveness of the proposed method in advancing real-world multi-modal multi-view action recognition.
zh

[CV-70] Moment Quantization for Video Temporal Grounding

【速读】：该论文旨在解决视频时间定位（Video Temporal Grounding）任务中区分相关与无关时刻的挑战。传统方法主要关注连续特征的学习，但其在前景与背景特征之间的区分能力较弱。论文提出了一种基于Moment-量化（Moment-Quantization）的视频时间定位方法（MQVTG），通过将输入视频量化为多种离散向量，增强相关与无关时刻之间的区分能力。关键在于引入可学习的时刻词典（moment codebook），并通过将时刻-词典匹配视为聚类过程，避免直接硬量化导致的信息丢失，同时结合有效的先验初始化和联合投影策略优化词典维护。这种方法实现简单，可作为插件集成到现有模型中。实验结果表明，MQVTG在六个流行基准数据集上显著优于现有技术，有效提升了相关特征的聚合与无关特征的分离能力。

链接: https://arxiv.org/abs/2504.02286
作者: Xiaolong Sun,Le Wang,Sanping Zhou,Liushuai Shi,Kun Xia,Mengnan Liu,Yabing Wang,Gang Hua
机构: National Key Laboratory of Human-Machine Hybrid Augmented Intelligence (人机混合增强智能国家重点实验室), National Engineering Research Center for Visual Information and Applications (视觉信息工程国家工程研究中心), Institute of Artificial Intelligence and Robotics (人工智能与机器人研究所), Xi’an Jiaotong University (西安交通大学); Multimodal Experiences Research Lab (多模态体验研究实验室), Dolby Laboratories (杜比实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video temporal grounding is a critical video understanding task, which aims to localize moments relevant to a language description. The challenge of this task lies in distinguishing relevant and irrelevant moments. Previous methods focused on learning continuous features exhibit weak differentiation between foreground and background features. In this paper, we propose a novel Moment-Quantization based Video Temporal Grounding method (MQVTG), which quantizes the input video into various discrete vectors to enhance the discrimination between relevant and irrelevant moments. Specifically, MQVTG maintains a learnable moment codebook, where each video moment matches a codeword. Considering the visual diversity, i.e., various visual expressions for the same moment, MQVTG treats moment-codeword matching as a clustering process without using discrete vectors, avoiding the loss of useful information from direct hard quantization. Additionally, we employ effective prior-initialization and joint-projection strategies to enhance the maintained moment codebook. With its simple implementation, the proposed method can be integrated into existing temporal grounding models as a plug-and-play component. Extensive experiments on six popular benchmarks demonstrate the effectiveness and generalizability of MQVTG, significantly outperforming state-of-the-art methods. Further qualitative analysis shows that our method effectively groups relevant features and separates irrelevant ones, aligning with our goal of enhancing discrimination.
zh

[CV-71] LLM -Guided Evolution: An Autonomous Model Optimization for Object Detection

【速读】：本文旨在解决神经网络架构搜索（NAS）在目标检测任务中的效率与性能优化问题。传统方法依赖领域知识和试错过程，而进化算法通常受限于固定的规则和预定义的构建块。为应对这些挑战，论文提出了一种基于大型语言模型引导的进化（LLM-Guided Evolution, GE）框架，通过将大型语言模型引入来直接修改图像分类算法的源代码，并智能地指导变异和交叉操作。该方法的关键创新在于“思维演化”（Evolution of Thought, EoT）技术，它通过建立反馈循环，使语言模型能够根据先前操作的表现迭代优化决策。研究进一步改进了LLM-GE，用于调整You Only Look Once (YOLO) 模型的架构，以提升KITTI数据集上的检测精度和速度。实验结果表明，该方法显著提升了平均精度均值（Mean Average Precision），从92.5%提高到94.5%，验证了LLM-GE在实际应用中的灵活性与有效性，为自动化机器学习提供了结合语言模型推理与进化策略的新范式。

链接: https://arxiv.org/abs/2504.02280
作者: YiMing Yu,Jason Zutty
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In machine learning, Neural Architecture Search (NAS) requires domain knowledge of model design and a large amount of trial-and-error to achieve promising performance. Meanwhile, evolutionary algorithms have traditionally relied on fixed rules and pre-defined building blocks. The Large Language Model (LLM)-Guided Evolution (GE) framework transformed this approach by incorporating LLMs to directly modify model source code for image classification algorithms on CIFAR data and intelligently guide mutations and crossovers. A key element of LLM-GE is the “Evolution of Thought” (EoT) technique, which establishes feedback loops, allowing LLMs to refine their decisions iteratively based on how previous operations performed. In this study, we perform NAS for object detection by improving LLM-GE to modify the architecture of You Only Look Once (YOLO) models to enhance performance on the KITTI dataset. Our approach intelligently adjusts the design and settings of YOLO to find the optimal algorithms against objective such as detection accuracy and speed. We show that LLM-GE produced variants with significant performance improvements, such as an increase in Mean Average Precision from 92.5% to 94.5%. This result highlights the flexibility and effectiveness of LLM-GE on real-world challenges, offering a novel paradigm for automated machine learning that combines LLM-driven reasoning with evolutionary strategies.
zh

[CV-72] MultiTSF: Transformer-based Sensor Fusion for Human-Centric Multi-view and Multi-modal Action Recognition

【速读】：该论文旨在解决多模态多视角观测下的动作识别在实际应用中面临的挑战，包括复杂的环境条件、严格的传感器同步需求以及细粒度标注的依赖。为应对这些挑战，论文提出了一种基于Transformer的多模态多视角传感器融合方法（Multi-modal Multi-view Transformer-based Sensor Fusion, MultiTSF）。该方法的关键在于利用Transformer模型动态建模跨视角的关系并捕捉多视角的时间依赖性，同时引入人体检测模块生成伪地面真实标签，以突出包含人体活动的帧并增强空间特征学习。实验结果表明，MultiTSF在视频序列级和帧级动作识别任务中均优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.02279
作者: Trung Thanh Nguyen,Yasutomo Kawanishi,Vijay John,Takahiro Komamizu,Ichiro Ide
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Action recognition from multi-modal and multi-view observations holds significant potential for applications in surveillance, robotics, and smart environments. However, existing methods often fall short of addressing real-world challenges such as diverse environmental conditions, strict sensor synchronization, and the need for fine-grained annotations. In this study, we propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF). The proposed method leverages a Transformer-based to dynamically model inter-view relationships and capture temporal dependencies across multiple views. Additionally, we introduce a Human Detection Module to generate pseudo-ground-truth labels, enabling the model to prioritize frames containing human activity and enhance spatial feature learning. Comprehensive experiments conducted on our in-house MultiSensor-Home dataset and the existing MM-Office dataset demonstrate that MultiTSF outperforms state-of-the-art methods in both video sequence-level and frame-level action recognition settings.
zh

[CV-73] Beyond Conventional Transformers: The Medical X-ray Attention (MXA) Block for Improved Multi-Label Diagnosis Using Knowledge Distillation

【速读】：该论文旨在解决医学X射线图像中多标签分类的问题，特别是在单次扫描中同时检测多种异常情况的临床需求。传统方法在处理这类任务时面临挑战，尤其是在捕捉局部细节与全局上下文之间的平衡。为了解决这一问题，论文提出了一种名为Medical X-ray Attention (MXA) block的新颖注意力机制。MXA block的关键创新在于通过集成专门设计的模块来增强传统的Multi-Head Self Attention (MHSA)，从而高效地结合详细的局部信息与广泛的全局上下文。此工作首次针对胸部X射线诊断提出了特定于任务的注意力机制，并尝试利用Efficient Vision Transformer (EfficientViT) 进行多标签分类。通过将MXA block嵌入到EfficientViT架构中并采用知识蒸馏技术，所提出的模型在CheXpert数据集上的性能显著提升，达到了0.85的曲线下面积（AUC），相较于基线模型的0.66提高了0.19，相当于相对于随机猜测（AUC=0.5）大约提升了233%的相对改进。

链接: https://arxiv.org/abs/2504.02277
作者: Amit Rand,Hadi Ibrahim
机构: Department of Mathematics, University of California, Los Angeles (加州大学洛杉矶分校); Department of Mathematics, University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 4 figures, 5 tables. For supplementary material and code, see this https URL

点击查看摘要

Abstract:Medical imaging, particularly X-ray analysis, often involves detecting multiple conditions simultaneously within a single scan, making multi-label classification crucial for real-world clinical applications. We present the Medical X-ray Attention (MXA) block, a novel attention mechanism tailored specifically to address the unique challenges of X-ray abnormality detection. The MXA block enhances traditional Multi-Head Self Attention (MHSA) by integrating a specialized module that efficiently captures both detailed local information and broader global context. To the best of our knowledge, this is the first work to propose a task-specific attention mechanism for diagnosing chest X-rays, as well as to attempt multi-label classification using an Efficient Vision Transformer (EfficientViT). By embedding the MXA block within the EfficientViT architecture and employing knowledge distillation, our proposed model significantly improves performance on the CheXpert dataset, a widely used benchmark for multi-label chest X-ray abnormality detection. Our approach achieves an area under the curve (AUC) of 0.85, an absolute improvement of 0.19 compared to our baseline model’s AUC of 0.66, corresponding to a substantial approximate 233% relative improvement over random guessing (AUC = 0.5).
zh

[CV-74] Generative Classifier for Domain Generalization

【速读】：该论文旨在解决领域泛化（Domain Generalization, DG）中过度依赖领域不变性（domain invariance）而导致对领域特定信息（domain-specific information）利用不足的问题。主流的DG方法主要关注学习领域不变特征，但忽视了领域特定信息中潜在的价值，尤其是在面对类内分布偏移（intra-class shifts）等多模态分布时，判别性线性分类器的表现受限。此外，现有方法在处理领域特定信息时容易捕捉到虚假相关性（spurious correlations），影响模型的泛化能力。

论文的关键解决方案是提出了一种基于生成式分类器驱动的领域泛化方法（Generative Classifier-driven Domain Generalization, GCDG）。其核心在于引入生成范式（generative paradigm），通过高斯混合模型（Gaussian Mixture Models, GMMs）为跨领域的每个类别建模特征分布。GCDG包含三个关键模块：异质性学习分类器（Heterogeneity Learning Classifier, HLC）、虚假相关性阻断模块（Spurious Correlation Blocking, SCB）以及多样成分平衡模块（Diverse Component Balancing, DCB）。其中，HLC利用GMM捕获领域特定信息；SCB识别并扰动含有虚假相关性的神经元，减少HLC学习错误模式的风险；DCB确保HLC中各成分的均衡贡献，避免关键成分被低估或忽略。通过这些机制，GCDG能够有效挖掘领域特定信息中的细微差异，降低目标风险（target risk），促进平坦最优解（flat minima）的形成，从而提升模型的泛化性能。实验结果表明，GCDG在五个DG基准数据集和一个反欺骗人脸数据集上表现出色，并可无缝集成到现有DG方法中以实现一致改进。

链接: https://arxiv.org/abs/2504.02272
作者: Shaocong Long,Qianyu Zhou,Xiangtai Li,Chenhao Ying,Yunhai Tong,Lizhuang Ma,Yuan Luo,Dacheng Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code will be available at this https URL

点击查看摘要

Abstract:Domain generalization (DG) aims to improve the generalizability of computer vision models toward distribution shifts. The mainstream DG methods focus on learning domain invariance, however, such methods overlook the potential inherent in domain-specific information. While the prevailing practice of discriminative linear classifier has been tailored to domain-invariant features, it struggles when confronted with diverse domain-specific information, e.g., intra-class shifts, that exhibits multi-modality. To address these issues, we explore the theoretical implications of relying on domain invariance, revealing the crucial role of domain-specific information in mitigating the target risk for DG. Drawing from these insights, we propose Generative Classifier-driven Domain Generalization (GCDG), introducing a generative paradigm for the DG classifier based on Gaussian Mixture Models (GMMs) for each class across domains. GCDG consists of three key modules: Heterogeneity Learning Classifier~(HLC), Spurious Correlation Blocking~(SCB), and Diverse Component Balancing~(DCB). Concretely, HLC attempts to model the feature distributions and thereby capture valuable domain-specific information via GMMs. SCB identifies the neural units containing spurious correlations and perturbs them, mitigating the risk of HLC learning spurious patterns. Meanwhile, DCB ensures a balanced contribution of components in HLC, preventing the underestimation or neglect of critical components. In this way, GCDG excels in capturing the nuances of domain-specific information characterized by diverse distributions. GCDG demonstrates the potential to reduce the target risk and encourage flat minima, improving the generalizability. Extensive experiments show GCDG’s comparable performance on five DG benchmarks and one face anti-spoofing dataset, seamlessly integrating into existing DG methods with consistent improvements.
zh

[CV-75] MinkOcc: Towards real-time label-efficient semantic occupancy prediction

【速读】：该论文旨在解决3D语义占用预测模型开发中对密集3D标注依赖的问题，这通常需要耗费大量的人力和资源。为应对这一挑战，论文提出的关键解决方案是MinkOcc，这是一种面向相机和LiDAR的多模态3D语义占用预测框架。MinkOcc采用两阶段半监督训练流程：首先利用少量带有明确3D标注的数据集启动训练；随后通过更易于标注的累积LiDAR扫描和图像（借助视觉基础模型进行语义标注）继续提供监督信号。这种方法显著减少了对人工标注的依赖（降低了90%），同时保持了竞争力的准确性。此外，MinkOcc通过早期融合LiDAR和相机数据，并结合稀疏卷积网络实现实时预测，从而在监督效率和计算效率上均表现出色，目标是将其应用范围扩展到更广泛的现实世界场景中，推动自动驾驶领域中3D语义占用预测技术的实际部署。

链接: https://arxiv.org/abs/2504.02270
作者: Samuel Sze,Daniele De Martini,Lars Kunze
机构: Oxford Robotics Institute, Department of Engineering Science, University of Oxford (牛津大学工程科学系机器人研究所); Bristol Robotics Laboratory, School of Engineering, University of the West of England (西英格兰大学工程学院布里斯托尔机器人实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages

点击查看摘要

Abstract:Developing 3D semantic occupancy prediction models often relies on dense 3D annotations for supervised learning, a process that is both labor and resource-intensive, underscoring the need for label-efficient or even label-free approaches. To address this, we introduce MinkOcc, a multi-modal 3D semantic occupancy prediction framework for cameras and LiDARs that proposes a two-step semi-supervised training procedure. Here, a small dataset of explicitly 3D annotations warm-starts the training process; then, the supervision is continued by simpler-to-annotate accumulated LiDAR sweeps and images – semantically labelled through vision foundational models. MinkOcc effectively utilizes these sensor-rich supervisory cues and reduces reliance on manual labeling by 90% while maintaining competitive accuracy. In addition, the proposed model incorporates information from LiDAR and camera data through early fusion and leverages sparse convolution networks for real-time prediction. With its efficiency in both supervision and computation, we aim to extend MinkOcc beyond curated datasets, enabling broader real-world deployment of 3D semantic occupancy prediction in autonomous driving.
zh

[CV-76] MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception

【速读】：该论文旨在解决现有高级驾驶辅助系统（ADAS）中多任务学习未能充分利用任务间联合学习潜力的问题。具体而言，论文关注同时理解驾驶员的心理/生理状态与交通环境的需求，提出了一种名为MMTL-UniAD的统一多模态多任务学习框架，用于同步识别驾驶员行为、情绪、车辆行为及交通语境等任务。论文的关键在于避免任务间的负迁移现象，以提升整体学习性能。为此，作者引入了两个核心组件：一是多轴区域注意力网络，用于提取全局上下文相关的特征；二是双分支多模态嵌入模块，从任务共享特征与任务特定特征中学习多模态嵌入。前者通过多注意力机制筛选任务相关特征，减少无关特征引起的负迁移；后者采用双分支结构动态调整共享与特定参数，促进跨任务知识迁移的同时降低任务冲突。评估结果显示，MMTL-UniAD在AIDE数据集上的表现优于现有最先进方法。

链接: https://arxiv.org/abs/2504.02264
作者: Wenzhuo Liu,Wenshuo Wang,Yicheng Qiao,Qiannan Guo,Jiayin Zhu,Pengfei Li,Zilong Chen,Huiming Yang,Zhiwei Li,Lening Wang,Tiao Tan,Huaping Liu
机构: Beijing Institute of Technology, Zhuhai (北京理工大学珠海校区); Tsinghua University (清华大学); HKUST(GZ) (香港科技大学（广州）); Beijing University of Chemical Technology (北京化工大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advanced driver assistance systems require a comprehensive understanding of the driver’s mental/physical state and traffic context but existing works often neglect the potential benefits of joint learning between these tasks. This paper proposes MMTL-UniAD, a unified multi-modal multi-task learning framework that simultaneously recognizes driver behavior (e.g., looking around, talking), driver emotion (e.g., anxiety, happiness), vehicle behavior (e.g., parking, turning), and traffic context (e.g., traffic jam, traffic smooth). A key challenge is avoiding negative transfer between tasks, which can impair learning performance. To address this, we introduce two key components into the framework: one is the multi-axis region attention network to extract global context-sensitive features, and the other is the dual-branch multimodal embedding to learn multimodal embeddings from both task-shared and task-specific features. The former uses a multi-attention mechanism to extract task-relevant features, mitigating negative transfer caused by task-unrelated features. The latter employs a dual-branch structure to adaptively adjust task-shared and task-specific parameters, enhancing cross-task knowledge transfer while reducing task conflicts. We assess MMTL-UniAD on the AIDE dataset, using a series of ablation studies, and show that it outperforms state-of-the-art methods across all four tasks. The code is available on this https URL.
zh

[CV-77] WonderTurbo: Generating Interactive 3D World in 0.72 Seconds

【速读】：该论文试图解决当前3D生成技术在实时互动性方面的挑战，即如何实现3D场景的实时交互生成。论文提出的解决方案关键在于WonderTurbo框架，它通过加速3D场景生成中的几何建模和外观建模来实现这一目标。具体而言，几何建模方面提出了StepSplat方法，通过动态更新构建高效的3D几何表示，每次更新仅需0.26秒；同时设计了QuickDepth模块以提供一致的深度输入，进一步提升几何准确性。在外观建模方面，开发了FastPaint两步扩散模型，专注于保持空间外观一致性以实现即时修复。实验结果表明，WonderTurbo相比基线方法实现了15倍的速度提升，同时保持了优异的空间一致性并输出高质量结果。

链接: https://arxiv.org/abs/2504.02261
作者: Chaojun Ni,Xiaofeng Wang,Zheng Zhu,Weijie Wang,Haoyun Li,Guosheng Zhao,Jie Li,Wenkang Qin,Guan Huang,Wenjun Mei
机构: Institution1; Institution2
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Interactive 3D generation is gaining momentum and capturing extensive attention for its potential to create immersive virtual experiences. However, a critical challenge in current 3D generation technologies lies in achieving real-time interactivity. To address this issue, we introduce WonderTurbo, the first real-time interactive 3D scene generation framework capable of generating novel perspectives of 3D scenes within 0.72 seconds. Specifically, WonderTurbo accelerates both geometric and appearance modeling in 3D scene generation. In terms of geometry, we propose StepSplat, an innovative method that constructs efficient 3D geometric representations through dynamic updates, each taking only 0.26 seconds. Additionally, we design QuickDepth, a lightweight depth completion module that provides consistent depth input for StepSplat, further enhancing geometric accuracy. For appearance modeling, we develop FastPaint, a 2-steps diffusion model tailored for instant inpainting, which focuses on maintaining spatial appearance consistency. Experimental results demonstrate that WonderTurbo achieves a remarkable 15X speedup compared to baseline methods, while preserving excellent spatial consistency and delivering high-quality output.
zh

[CV-78] Re-thinking Temporal Search for Long-Form Video Understanding CVPR2025

【速读】：该论文旨在解决长视频理解中高效时空搜索的关键挑战，具体针对现有最先进的长上下文视觉语言模型（Vision-Language Models, VLMs）在长视频时间维度搜索能力上的显著研究空白。论文的核心贡献在于重新定义了长视频的时间搜索问题为“长视频干草堆”（Long Video Haystack, LV-Haystack）问题，即从数万帧的真实长视频中找到与特定查询相关的最小关键帧集合（通常为一到五帧）。为验证这一定义，作者构建了首个包含3,874个人工标注实例的数据集LV-Haystack，并提出了一套精细评估指标以衡量关键帧搜索的质量和计算效率。实验表明，当前最先进的关键帧选择方法在LVBench子集上的时间F1得分仅为2.1%。

解决方案的关键在于提出了一种轻量级的关键帧搜索框架T*，它将昂贵的时间搜索转化为一个空间搜索问题。T利用图像中优越的视觉定位能力，并引入了一个跨时间和空间维度的自适应放大机制，从而有效提升了长视频的理解性能。实验结果显示，在32帧的推理预算下，T将GPT-4o在LongVideoBench XL子集上的性能从50.5%提升至53.1%，并将LLaVA-OneVision-72B的性能从56.5%提升至62.4%。

链接: https://arxiv.org/abs/2504.02259
作者: Jinhui Ye,Zihan Wang,Haosen Sun,Keshigeyan Chandrasegaran,Zane Durante,Cristobal Eyzaguirre,Yonatan Bisk,Juan Carlos Niebles,Ehsan Adeli,Li Fei-Fei,Jiajun Wu,Manling Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025; A real-world long video needle-in-haystack benchmark; long-video QA with human ref frames

点击查看摘要

Abstract:Efficient understanding of long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding, studying a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). In particular, our contributions are two-fold: First, we formulate temporal search as a Long Video Haystack problem, i.e., finding a minimal set of relevant frames (typically one to five) among tens of thousands of frames from real-world long videos given specific queries. To validate our formulation, we create LV-Haystack, the first benchmark containing 3,874 human-annotated instances with fine-grained evaluation metrics for assessing keyframe search quality and computational efficiency. Experimental results on LV-Haystack highlight a significant research gap in temporal search capabilities, with SOTA keyframe selection methods achieving only 2.1% temporal F1 score on the LVBench subset. Next, inspired by visual search in images, we re-think temporal searching and propose a lightweight keyframe searching framework, T*, which casts the expensive temporal search as a spatial search problem. T* leverages superior visual localization capabilities typically used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Our extensive experiments show that when integrated with existing methods, T* significantly improves SOTA long-form video understanding performance. Specifically, under an inference budget of 32 frames, T* improves GPT-4o’s performance from 50.5% to 53.1% and LLaVA-OneVision-72B’s performance from 56.5% to 62.4% on LongVideoBench XL subset. Our PyTorch code, benchmark dataset and models are included in the Supplementary material. Comments: Accepted by CVPR 2025; A real-world long video needle-in-haystack benchmark; long-video QA with human ref frames Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.02259 [cs.CV] (or arXiv:2504.02259v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.02259 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-79] SocialGesture: Delving into Multi-person Gesture Understanding CVPR2025

【速读】：该论文旨在解决现有手势识别研究普遍忽视多人交互的问题，这一局限性导致难以将手势与语言、语音等其他模态对齐，尤其是在理解自然发生的社交语境中的手势时。为了解决这一挑战，论文的关键解决方案是引入了一个名为SocialGesture的新大规模数据集，该数据集专为多人手势分析设计，包含丰富的自然场景，并支持多种手势分析任务（如基于视频的识别和时间定位）。此外，论文还提出了一种新的视觉问答(VQA)任务，用于评估视觉-语言模型(VLMs)在社交手势理解方面的性能。通过这些方法，论文揭示了当前手势识别模型的若干局限性，并为未来改进提供了方向。SocialGesture数据集可通过提供的链接获取。

链接: https://arxiv.org/abs/2504.02244
作者: Xu Cao,Pranav Virupaksha,Wenqi Jia,Bolin Lai,Fiona Ryan,Sangmin Lee,James M. Rehg
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Georgia Institute of Technology (乔治亚理工学院); Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Previous research in human gesture recognition has largely overlooked multi-person interactions, which are crucial for understanding the social context of naturally occurring gestures. This limitation in existing datasets presents a significant challenge in aligning human gestures with other modalities like language and speech. To address this issue, we introduce SocialGesture, the first large-scale dataset specifically designed for multi-person gesture analysis. SocialGesture features a diverse range of natural scenarios and supports multiple gesture analysis tasks, including video-based recognition and temporal localization, providing a valuable resource for advancing the study of gesture during complex social interactions. Furthermore, we propose a novel visual question answering (VQA) task to benchmark vision language models’(VLMs) performance on social gesture understanding. Our findings highlight several limitations of current gesture recognition models, offering insights into future directions for improvement in this field. SocialGesture is available at this http URL.
zh

[CV-80] AC-LoRA: Auto Component LoRA for Personalized Artistic Style Image Generation

【速读】：该论文旨在解决基于Low Rank Adaptation (LoRA) 的个性化图像生成方法在调整秩参数（rank parameter）以获得满意结果时面临的挑战。论文的关键创新在于提出AutoComponent-LoRA (AC-LoRA)，通过利用奇异值分解（Singular Value Decomposition, SVD）和动态启发式算法自动分离LoRA矩阵中的信号成分与噪声成分，从而实现快速且高效的个性化艺术风格图像生成。这种方法能够有效克服模型欠拟合或过拟合的问题，并通过FID、CLIP、DINO和ImageReward等指标验证，在多项性能评估中实现了平均9%的提升。

链接: https://arxiv.org/abs/2504.02231
作者: Zhipu Cui,Andong Tian,Zhi Ying,Jialiang Lu
机构: SPEIT, Shanghai Jiaotong University (上海交通大学SPEIT); Ubisoft La Forge (育碧La Forge)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures, ICCGV 2025, SPIE

点击查看摘要

Abstract:Personalized image generation allows users to preserve styles or subjects of a provided small set of images for further image generation. With the advancement in large text-to-image models, many techniques have been developed to efficiently fine-tune those models for personalization, such as Low Rank Adaptation (LoRA). However, LoRA-based methods often face the challenge of adjusting the rank parameter to achieve satisfactory results. To address this challenge, AutoComponent-LoRA (AC-LoRA) is proposed, which is able to automatically separate the signal component and noise component of the LoRA matrices for fast and efficient personalized artistic style image generation. This method is based on Singular Value Decomposition (SVD) and dynamic heuristics to update the hyperparameters during training. Superior performance over existing methods in overcoming model underfitting or overfitting problems is demonstrated. The results were validated using FID, CLIP, DINO, and ImageReward, achieving an average of 9% improvement.
zh

[CV-81] Geospatial Artificial Intelligence for Satellite-based Flood Extent Mapping: Concepts Advances and Future Perspectives

【速读】：该论文旨在解决利用卫星数据进行洪水范围制图的问题，以支持灾害管理和空间决策。论文的核心解决方案是系统性地整合地理空间人工智能（Geospatial Artificial Intelligence, GeoAI）技术与卫星数据，通过生成洪水范围地图（flood extent maps）来识别洪水事件及其影响，同时提供不确定性评估（uncertainty estimation）和变化检测（change detection）等附加分析输出。其关键在于将先进的AI技术应用于卫星数据处理，从而实现高效且精确的洪水监测与分析。

链接: https://arxiv.org/abs/2504.02214
作者: Hyunho Lee,Wenwen Li
机构: School of Geographical Sciences and Urban Planning (地理科学与城市规划学院), Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Geospatial Artificial Intelligence (GeoAI) for satellite-based flood extent mapping systematically integrates artificial intelligence techniques with satellite data to identify flood events and assess their impacts, for disaster management and spatial decision-making. The primary output often includes flood extent maps, which delineate the affected areas, along with additional analytical outputs such as uncertainty estimation and change detection.
zh

[CV-82] ESC: Erasing Space Concept for Knowledge Deletion CVPR2025

【速读】：该论文旨在解决深度学习中隐私保护的问题，特别是用户对训练模型中个人知识可能被滥用的担忧。现有方法通常未能充分满足用户对于完全知识擦除的实际需求，并且存在通过嵌入特征泄露个人知识的风险。为了解决这些问题，论文引入了一种名为Knowledge Deletion (KD) 的新概念，这是一种同时考虑隐私保护和知识擦除的高级任务，并提出了一个名为Knowledge Retention score (KR) 的评估指标来衡量特征空间中的知识保留情况。解决方案的关键在于提出了一种无需训练的擦除方法Erasing Space Concept (ESC)，它通过消除特征中的相关激活来限制遗忘知识的重要子空间。此外，还提出了带有训练的ESC-T版本，利用可学习的掩码更好地平衡KD中的遗忘与保留知识之间的权衡。实验结果表明，所提出的方法在不同数据集和模型上的表现达到最快且最先进的性能，适用于多种遗忘场景，展示了方法的通用性。

链接: https://arxiv.org/abs/2504.02199
作者: Tae-Young Lee,Sundong Park,Minwoo Jeon,Hyoseok Hwang,Gyeong-Moon Park
机构: Korea University (韩国大学), Seoul, Republic of Korea; Kyung Hee University (庆熙大学), Yongin, Republic of Korea
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 14 figures, 18 tables, CVPR 2025

点击查看摘要

Abstract:As concerns regarding privacy in deep learning continue to grow, individuals are increasingly apprehensive about the potential exploitation of their personal knowledge in trained models. Despite several research efforts to address this, they often fail to consider the real-world demand from users for complete knowledge erasure. Furthermore, our investigation reveals that existing methods have a risk of leaking personal knowledge through embedding features. To address these issues, we introduce a novel concept of Knowledge Deletion (KD), an advanced task that considers both concerns, and provides an appropriate metric, named Knowledge Retention score (KR), for assessing knowledge retention in feature space. To achieve this, we propose a novel training-free erasing approach named Erasing Space Concept (ESC), which restricts the important subspace for the forgetting knowledge by eliminating the relevant activations in the feature. In addition, we suggest ESC with Training (ESC-T), which uses a learnable mask to better balance the trade-off between forgetting and preserving knowledge in KD. Our extensive experiments on various datasets and models demonstrate that our proposed methods achieve the fastest and state-of-the-art performance. Notably, our methods are applicable to diverse forgetting scenarios, such as facial domain setting, demonstrating the generalizability of our methods. The code is available at this http URL .
zh

[CV-83] Foreground Focus: Enhancing Coherence and Fidelity in Camouflaged Image Generation

【速读】：该论文旨在解决两类关键问题：一是现有方法在生成伪装图像时，前景特征与背景知识的整合不足，导致前景与背景缺乏一致性（如色彩不协调）；二是生成过程中未能充分优先保证前景对象的保真度，尤其对小尺寸物体容易产生失真。为解决这些问题，论文提出了一种名为Foreground-Aware Camouflaged Image Generation (FACIG) 的模型。其核心解决方案包括引入Foreground-Aware Feature Integration Module (FAFIM) 来强化前景特征与背景知识的融合，并设计Foreground-Aware Denoising Loss以增强前景重建的监督。实验结果表明，该方法在整体伪装图像质量和前景保真度方面优于现有方法。

链接: https://arxiv.org/abs/2504.02180
作者: Pei-Chi Chen,Yi Yao,Chan-Feng Hsu,HongXia Xie,Hung-Jen Chen,Hong-Han Shuai,Wen-Huang Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Camouflaged image generation is emerging as a solution to data scarcity in camouflaged vision perception, offering a cost-effective alternative to data collection and labeling. Recently, the state-of-the-art approach successfully generates camouflaged images using only foreground objects. However, it faces two critical weaknesses: 1) the background knowledge does not integrate effectively with foreground features, resulting in a lack of foreground-background coherence (e.g., color discrepancy); 2) the generation process does not prioritize the fidelity of foreground objects, which leads to distortion, particularly for small objects. To address these issues, we propose a Foreground-Aware Camouflaged Image Generation (FACIG) model. Specifically, we introduce a Foreground-Aware Feature Integration Module (FAFIM) to strengthen the integration between foreground features and background knowledge. In addition, a Foreground-Aware Denoising Loss is designed to enhance foreground reconstruction supervision. Experiments on various datasets show our method outperforms previous methods in overall camouflaged image quality and foreground fidelity.
zh

[CV-84] MDP: Multidimensional Vision Model Pruning with Latency Constraint CVPR2025

【速读】：该论文致力于解决当前结构剪枝方法面临的两个主要限制：(1) 剪枝通常局限于更细粒度的层面（如通道），难以实现激进的参数减少；(2) 过于关注参数和浮点运算（FLOPs）的减少，而现有的延迟感知方法多依赖于简化的次优线性模型，在Transformer等多维度交互影响延迟的任务中泛化效果不佳。为了解决这些问题，论文提出了一种名为多维剪枝（Multi-Dimensional Pruning, MDP）的新范式，其关键在于联合优化多种剪枝粒度（包括通道、查询、键、注意力头、嵌入和模块块），并通过先进的延迟建模技术准确捕捉所有可剪枝维度上的延迟变化，从而在延迟与精度之间达到最佳平衡。MDP通过将剪枝问题重新表述为混合整数非线性规划（Mixed-Integer Nonlinear Program, MINLP），高效识别出满足延迟约束的最佳剪枝结构，同时支持CNN和Transformer模型。实验结果表明，MDP在高剪枝比率下显著优于现有方法。例如，在ImageNet数据集上，MDP对ResNet50的剪枝实现了比HALP快28%的速度提升，并提高了+1.4的Top-1准确率，同时相比最新的Transformer剪枝方法Isomorphic，进一步加速了37%，提升了+0.7的Top-1准确率。

链接: https://arxiv.org/abs/2504.02168
作者: Xinglong Sun,Barath Lakshmanan,Maying Shen,Shiyi Lan,Jingde Chen,Jose M. Alvarez
机构: NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Current structural pruning methods face two significant limitations: (i) they often limit pruning to finer-grained levels like channels, making aggressive parameter reduction challenging, and (ii) they focus heavily on parameter and FLOP reduction, with existing latency-aware methods frequently relying on simplistic, suboptimal linear models that fail to generalize well to transformers, where multiple interacting dimensions impact latency. In this paper, we address both limitations by introducing Multi-Dimensional Pruning (MDP), a novel paradigm that jointly optimizes across a variety of pruning granularities-including channels, query, key, heads, embeddings, and blocks. MDP employs an advanced latency modeling technique to accurately capture latency variations across all prunable dimensions, achieving an optimal balance between latency and accuracy. By reformulating pruning as a Mixed-Integer Nonlinear Program (MINLP), MDP efficiently identifies the optimal pruned structure across all prunable dimensions while respecting latency constraints. This versatile framework supports both CNNs and transformers. Extensive experiments demonstrate that MDP significantly outperforms previous methods, especially at high pruning ratios. On ImageNet, MDP achieves a 28% speed increase with a +1.4 Top-1 accuracy improvement over prior work like HALP for ResNet50 pruning. Against the latest transformer pruning method, Isomorphic, MDP delivers an additional 37% acceleration with a +0.7 Top-1 accuracy improvement.
zh

[CV-85] Preference-Driven Active 3D Scene Representation for Robotic Inspection in Nuclear Decommissioning IROS

【速读】：该论文旨在解决传统主动三维场景表示方法在优化几何精度或渲染准确性的同时，未能充分考虑操作者特定目标（如安全关键覆盖或任务驱动视点）的问题，尤其在受限环境（如核退役场景）中导致次优视角选择。为填补这一空白，论文提出了一种新颖框架，将专家操作者偏好融入主动三维场景表示流程。解决方案的关键在于利用基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）指导机器人路径规划，并重塑奖励函数以反映专家输入。通过交互式选择实验捕捉操作者优先级，结合显式三维几何建模与隐式人机协作优化，实现了同时提升场景表示质量和轨迹效率的目标。

链接: https://arxiv.org/abs/2504.02161
作者: Zhen Meng,Kan Chen,Xiangmin Xu,Erwin Jose Lopez Pulgarin,Emma Li,Philip G. Zhao,David Flynn
机构: School of Computing Science, University of Glasgow (格拉斯哥大学计算科学学院); Department of Engineering, University of Manchester (曼彻斯特大学工程系); Department of Computer Science, University of Manchester (曼彻斯特大学计算机科学系); James Watt School of Engineering, University of Glasgow (格拉斯哥大学詹姆斯瓦特工程学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025

点击查看摘要

Abstract:Active 3D scene representation is pivotal in modern robotics applications, including remote inspection, manipulation, and telepresence. Traditional methods primarily optimize geometric fidelity or rendering accuracy, but often overlook operator-specific objectives, such as safety-critical coverage or task-driven viewpoints. This limitation leads to suboptimal viewpoint selection, particularly in constrained environments such as nuclear decommissioning. To bridge this gap, we introduce a novel framework that integrates expert operator preferences into the active 3D scene representation pipeline. Specifically, we employ Reinforcement Learning from Human Feedback (RLHF) to guide robotic path planning, reshaping the reward function based on expert input. To capture operator-specific priorities, we conduct interactive choice experiments that evaluate user preferences in 3D scene representation. We validate our framework using a UR3e robotic arm for reactor tile inspection in a nuclear decommissioning scenario. Compared to baseline methods, our approach enhances scene representation while optimizing trajectory efficiency. The RLHF-based policy consistently outperforms random selection, prioritizing task-critical details. By unifying explicit 3D geometric modeling with implicit human-in-the-loop optimization, this work establishes a foundation for adaptive, safety-critical robotic perception systems, paving the way for enhanced automation in nuclear decommissioning, remote maintenance, and other high-risk environments.
zh

[CV-86] Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

【速读】：该论文致力于解决在基于主体生成（Subject-Driven Generation）领域中数据可扩展性（Data Scalability）和主体可扩展性（Subject Expansibility）的两大挑战。首先，从构建单一主体数据集扩展到多主体数据集并进行规模化是一个难点；其次，大多数现有方法专注于单一主体生成，难以适应多主体场景。为了解决这些问题，论文提出了一种高度一致的数据合成管道（Highly-consistent Data Synthesis Pipeline），利用扩散变换器（Diffusion Transformers）的上下文生成能力生成高一致性多主体配对数据。解决方案的关键在于引入了UNO模型，它包含渐进跨模态对齐（Progressive Cross-modal Alignment）和通用旋转位置嵌入（Universal Rotary Position Embedding），并且是一个从文本到图像模型迭代训练得到的多图像条件主体到图像模型。实验结果表明，该方法在保证可控性的前提下实现了单主体和多主体驱动生成的高度一致性。

链接: https://arxiv.org/abs/2504.02160
作者: Shaojin Wu,Mengqi Huang,Wenxu Wu,Yufeng Cheng,Fei Ding,Qian He
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL Code and model: this https URL

点击查看摘要

Abstract:Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge. This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistency multi-subject paired data. Additionally, we introduce UNO, which consists of progressive cross-modal alignment and universal rotary position embedding. It is a multi-image conditioned subject-to-image model iteratively trained from a text-to-image model. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.
zh

[CV-87] UAVTwin: Neural Digital Twins for UAVs using Gaussian Splatting

【速读】：本文旨在解决基于无人机（UAV）的感知任务中数字孪生（Digital Twin）生成的问题，特别是在复杂场景下包含多动态目标且具有显著外观变化时，传统基于3D高斯点 splatting (3DGS) 方法容易引入伪影的问题。论文的关键解决方案在于提出了一种新的外观建模策略和掩码细化模块，以增强3D Gaussian Splatting 的训练效果。通过这种方法，不仅实现了高质量的神经渲染，在峰值信噪比（PSNR）上相比现有方法提升了1.23 dB，还通过数据增强显著提高了下游模型在真实环境中的性能，例如在人体检测任务中平均精度均值（mAP）提升了2.5%到13.7%。

链接: https://arxiv.org/abs/2504.02158
作者: Jaehoon Choi,Dongki Jung,Yonghan Lee,Sungmin Eum,Dinesh Manocha,Heesung Kwon
机构: University of Maryland, College Park (马里兰大学帕克分校); DEVCOM Army Research Laboratory (陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present UAVTwin, a method for creating digital twins from real-world environments and facilitating data augmentation for training downstream models embedded in unmanned aerial vehicles (UAVs). Specifically, our approach focuses on synthesizing foreground components, such as various human instances in motion within complex scene backgrounds, from UAV perspectives. This is achieved by integrating 3D Gaussian Splatting (3DGS) for reconstructing backgrounds along with controllable synthetic human models that display diverse appearances and actions in multiple poses. To the best of our knowledge, UAVTwin is the first approach for UAV-based perception that is capable of generating high-fidelity digital twins based on 3DGS. The proposed work significantly enhances downstream models through data augmentation for real-world environments with multiple dynamic objects and significant appearance variations-both of which typically introduce artifacts in 3DGS-based modeling. To tackle these challenges, we propose a novel appearance modeling strategy and a mask refinement module to enhance the training of 3D Gaussian Splatting. We demonstrate the high quality of neural rendering by achieving a 1.23 dB improvement in PSNR compared to recent methods. Furthermore, we validate the effectiveness of data augmentation by showing a 2.5% to 13.7% improvement in mAP for the human detection task.
zh

[CV-88] FreSca: Unveiling the Scaling Space in Diffusion Models

【速读】：本文旨在探索扩散模型（Diffusion Models）在“缩放空间”中的潜力，特别是通过条件与非条件噪声预测之间的差异来实现细粒度语义操控的可能性。现有方法对这一空间的潜在价值挖掘不足，而本文的核心贡献在于通过对噪声预测进行傅里叶分析，揭示其低频和高频成分在扩散过程中的演化特性存在差异。基于此洞察，作者提出了FreSca方法，它在傅里叶域中独立调整不同频率带的指导缩放系数。这种方法无需重新训练即可显著提升现有图像编辑技术的效果，并且还扩展到图像理解任务如深度估计，实现了跨数据集的定量改进。因此，论文的关键在于发现了噪声预测频域特性差异，并据此设计了能够有效增强图像编辑及理解能力的FreSca方法。

链接: https://arxiv.org/abs/2504.02154
作者: Chao Huang,Susan Liang,Yunlong Tang,Li Ma,Yapeng Tian,Chenliang Xu
机构: University of Rochester (罗切斯特大学); Netflix Eyeline Studios (奈飞眼线工作室); The University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Diffusion models offer impressive controllability for image tasks, primarily through noise predictions that encode task-specific information and classifier-free guidance enabling adjustable scaling. This scaling mechanism implicitly defines a ``scaling space’’ whose potential for fine-grained semantic manipulation remains underexplored. We investigate this space, starting with inversion-based editing where the difference between conditional/unconditional noise predictions carries key semantic information. Our core contribution stems from a Fourier analysis of noise predictions, revealing that its low- and high-frequency components evolve differently throughout diffusion. Based on this insight, we introduce FreSca, a straightforward method that applies guidance scaling independently to different frequency bands in the Fourier domain. FreSca demonstrably enhances existing image editing methods without retraining. Excitingly, its effectiveness extends to image understanding tasks such as depth estimation, yielding quantitative gains across multiple datasets.
zh

[CV-89] Multivariate Temporal Regression at Scale: A Three-Pillar Framework Combining ML XAI and NLP

【速读】：本文旨在解决高维数据分析中的挑战，特别是当数据过于复杂时难以捕捉隐含关系的问题。传统方法通常关注输入变量间的直接关联，而忽视了更复杂的内在联系。为应对这些局限性，论文探索了多种验证技术，包括通过移除特定变量观察其影响以及利用统计分析挖掘多变量间的关系。此外，研究还评估了合成数据的作用，并考虑了不同传感器间信息冗余的情况。然而，这类分析往往计算成本高昂且需要大量人工干预。论文提出的关键解决方案是识别全局模式，以简化模型并促进分类或回归任务的理解。这种方法通过减少数据维度来实现模型简化，从而揭示输入与输出之间未被发现的关系，有助于进一步验证这些新发现的连接。研究基于真实世界数据集和合成数据集，目标是建立一种突出全局关键特征的方法，以提升预测能力并简化数据集的验证与量化过程。

链接: https://arxiv.org/abs/2504.02151
作者: Jiztom Kavalakkatt Francis,Matthew J Darr
机构: Iowa State Unviersity (爱荷华州立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages

点击查看摘要

Abstract:The rapid use of artificial intelligence (AI) in processes such as coding, image processing, and data prediction means it is crucial to understand and validate the data we are working with fully. This paper dives into the hurdles of analyzing high-dimensional data, especially when it gets too complex. Traditional methods in data analysis often look at direct connections between input variables, which can miss out on the more complicated relationships within the data. To address these issues, we explore several tested techniques, such as removing specific variables to see their impact and using statistical analysis to find connections between multiple variables. We also consider the role of synthetic data and how information can sometimes be redundant across different sensors. These analyses are typically very computationally demanding and often require much human effort to make sense of the results. A common approach is to treat the entire dataset as one unit and apply advanced models to handle it. However, this can become problematic with larger, noisier datasets and more complex models. So, we suggest methods to identify overall patterns that can help with tasks like classification or regression based on the idea that more straightforward approaches might be more understandable. Our research looks at two datasets: a real-world dataset and a synthetic one. The goal is to create a methodology that highlights key features on a global scale that lead to predictions, making it easier to validate or quantify the data set. By reducing the dimensionality with this method, we can simplify the models used and thus clarify the insights we gain. Furthermore, our method can reveal unexplored relationships between specific inputs and outcomes, providing a way to validate these new connections further. Comments: 7 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.02151 [cs.LG] (or arXiv:2504.02151v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.02151 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jiztom Kavalakkatt Francis [view email] [v1] Wed, 2 Apr 2025 21:53:03 UTC (1,407 KB)
zh

[CV-90] Evaluation of Flight Parameters in UAV-based 3D Reconstruction for Rooftop Infrastructure Assessment

【速读】：该论文旨在解决利用无人机（UAV）摄影测量进行复杂屋顶基础设施三维重建时，现有方法通常需要高图像重叠百分比和延长飞行时间以确保模型精度的问题。论文的关键在于系统评估地面采样距离（Ground Sampling Distance, GSD）和图像重叠这两个关键飞行参数，以优化三维重建效果。通过在Queen’s University的多段屋顶上进行受控无人机飞行实验，并结合Reality Capture软件处理数据与基于无人机激光雷达（LiDAR）及地面激光扫描（TLS）生成的真实模型对比验证，研究发现GSD范围为0.75-1.26厘米配合85%图像重叠能够实现较高的模型准确性，同时减少所需拍摄图像数量和飞行时间。这一成果为规划高效的自主无人机飞行路径提供了指导。

链接: https://arxiv.org/abs/2504.02084
作者: Nick Chodura,Melissa Greeff,Joshua Woods
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 8 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Rooftop 3D reconstruction using UAV-based photogrammetry offers a promising solution for infrastructure assessment, but existing methods often require high percentages of image overlap and extended flight times to ensure model accuracy when using autonomous flight paths. This study systematically evaluates key flight parameters-ground sampling distance (GSD) and image overlap-to optimize the 3D reconstruction of complex rooftop infrastructure. Controlled UAV flights were conducted over a multi-segment rooftop at Queen’s University using a DJI Phantom 4 Pro V2, with varied GSD and overlap settings. The collected data were processed using Reality Capture software and evaluated against ground truth models generated from UAV-based LiDAR and terrestrial laser scanning (TLS). Experimental results indicate that a GSD range of 0.75-1.26 cm combined with 85% image overlap achieves a high degree of model accuracy, while minimizing images collected and flight time. These findings provide guidance for planning autonomous UAV flight paths for efficient rooftop assessments.
zh

[CV-91] Aligned Better Listen Better for Audio-Visual Large Language Models ICLR2025

【速读】：该论文旨在解决现有视频大语言模型（Video-LLMs）和音频-视觉大语言模型（AV-LLMs）在利用音频信息方面存在的不足，这些问题导致模型在多模态视频理解中的表现较弱，并容易产生幻觉。为了解决这些问题，论文从模型架构和数据集两个方面提出了解决方案。关键在于：(1) 在架构层面，提出了一种细粒度的音频-视觉大语言模型Dolphin，通过在时序和空间维度上同时对齐音频和视觉模态，确保对视频的全面且准确的理解。具体而言，设计了用于多尺度信息聚合的音频-视觉多尺度适配器以实现空间对齐，并提出了音频-视觉交错合并方法以优化时序对齐。(2) 在数据集层面，构建了一个名为AVU的音频-视觉描述和指令微调数据集，包含520万条多样化、开放性数据元组（视频、音频、问题、答案），并引入了一种新颖的数据划分策略。实验结果表明，所提方法不仅显著提升了音频-视觉理解性能，还有效减少了幻觉现象的发生。

链接: https://arxiv.org/abs/2504.02061
作者: Yuxin Guo,Shuailei Ma,Shijie Ma,Xiaoyi Bao,Chen-Wei Xie,Kecheng Zheng,Tingyu Weng,Siyang Sun,Yun Zheng,Wei Zou
机构: School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); MAIS, Institute of Automation, Chinese Academy of Sciences (CASIA) (中国科学院自动化研究所麻省理工学院智能媒体实验室); Tongyi Lab, Alibaba Group (阿里集团通义实验室); Ant Group (蚂蚁集团); Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to ICLR 2025

点击查看摘要

Abstract:Audio is essential for multimodal video understanding. On the one hand, video inherently contains audio, which supplies complementary information to vision. Besides, video large language models (Video-LLMs) can encounter many audio-centric settings. However, existing Video-LLMs and Audio-Visual Large Language Models (AV-LLMs) exhibit deficiencies in exploiting audio information, leading to weak understanding and hallucinations. To solve the issues, we delve into the model architecture and dataset. (1) From the architectural perspective, we propose a fine-grained AV-LLM, namely Dolphin. The concurrent alignment of audio and visual modalities in both temporal and spatial dimensions ensures a comprehensive and accurate understanding of videos. Specifically, we devise an audio-visual multi-scale adapter for multi-scale information aggregation, which achieves spatial alignment. For temporal alignment, we propose audio-visual interleaved merging. (2) From the dataset perspective, we curate an audio-visual caption and instruction-tuning dataset, called AVU. It comprises 5.2 million diverse, open-ended data tuples (video, audio, question, answer) and introduces a novel data partitioning strategy. Extensive experiments show our model not only achieves remarkable performance in audio-visual understanding, but also mitigates potential hallucinations.
zh

[CV-92] LSC-ADL: An Activity of Daily Living (ADL)-Annotated Lifelog Dataset Generated via Semi-Automatic Clustering

【速读】：该论文旨在解决日常生活中基于第一人称视角生活日志数据检索的相关性不足问题，特别是现有方法在活动级别标注（Activity-Level Annotations）上的忽视，这些问题限制了语义理解与检索的可解释性。论文的关键创新在于引入了一个名为LSC-ADL的新数据集，该数据集通过将日常生活活动（Activities of Daily Living, ADLs）作为结构化语义层，增强了检索的上下文感知能力。解决方案的核心在于采用半自动标注方法，结合HDBSCAN算法实现类内聚类，并辅以人工验证确保ADL标注的准确性。此外，通过将动作识别整合到日志检索任务中，LSC-ADL弥合了现有研究中的关键空白，为日常生活提供了更丰富的语义表示。这一方法不仅提升了检索结果的准确性，还显著增强了其可解释性。

链接: https://arxiv.org/abs/2504.02060
作者: Minh-Quan Ho-Le,Duy-Khang Ho,Van-Tu Ninh,Cathal Gurrin,Minh-Triet Tran
机构: SELAB, Ho Chi Minh City University of Science (HCMUS)(SELAB, 胡志明市国立大学科学学院); FIT, Ho Chi Minh City University of Science (HCMUS)(FIT, 胡志明市国立大学科学学院); Dublin City University (DCU)(都柏林城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Lifelogging involves continuously capturing personal data through wearable cameras, providing an egocentric view of daily activities. Lifelog retrieval aims to search and retrieve relevant moments from this data, yet existing methods largely overlook activity-level annotations, which capture temporal relationships and enrich semantic understanding. In this work, we introduce LSC-ADL, an ADL-annotated lifelog dataset derived from the LSC dataset, incorporating Activities of Daily Living (ADLs) as a structured semantic layer. Using a semi-automatic approach featuring the HDBSCAN algorithm for intra-class clustering and human-in-the-loop verification, we generate accurate ADL annotations to enhance retrieval explainability. By integrating action recognition into lifelog retrieval, LSC-ADL bridges a critical gap in existing research, offering a more context-aware representation of daily life. We believe this dataset will advance research in lifelog retrieval, activity recognition, and egocentric vision, ultimately improving the accuracy and interpretability of retrieved content. The ADL annotations can be downloaded at this https URL.
zh

[CV-93] WorldPrompter: Traversable Text-to-Scene Generation

【速读】：该论文旨在解决场景级三维（3D）生成领域中存在的挑战，即现有方法大多只能生成部分场景且提供的导航自由度有限的问题。论文提出了一种名为WorldPrompter的新颖生成管道，用于从文本提示合成可行走的3D场景。其关键解决方案在于利用全景视频作为中间表示来建模场景的360°细节，并结合一个条件化360°全景视频生成器，能够生成包含128帧的视频以模拟人在虚拟环境中的行走与观察过程。随后，通过快速前馈3D重建器将视频重构为高斯点云（Gaussian splats），从而实现真实的场景内行走体验。实验结果表明，所提出的全景视频生成模型在跨帧视图一致性方面表现出色，支持高质量的全景高斯点云重建，并促进了对场景某区域的有效遍历。此外，定性和定量评估均显示其性能优于现有的360°视频生成器和3D场景生成模型。

链接: https://arxiv.org/abs/2504.02045
作者: Zhaoyang Zhang,Yannick Hold-Geoffroy,Miloš Hašan,Chen Ziwen,Fujun Luan,Julie Dorsey,Yiwei Hu
机构: Adobe; Yale University; Oregon State University
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scene-level 3D generation is a challenging research topic, with most existing methods generating only partial scenes and offering limited navigational freedom. We introduce WorldPrompter, a novel generative pipeline for synthesizing traversable 3D scenes from text prompts. We leverage panoramic videos as an intermediate representation to model the 360° details of a scene. WorldPrompter incorporates a conditional 360° panoramic video generator, capable of producing a 128-frame video that simulates a person walking through and capturing a virtual environment. The resulting video is then reconstructed as Gaussian splats by a fast feedforward 3D reconstructor, enabling a true walkable experience within the 3D scene. Experiments demonstrate that our panoramic video generation model achieves convincing view consistency across frames, enabling high-quality panoramic Gaussian splat reconstruction and facilitating traversal over an area of the scene. Qualitative and quantitative results also show it outperforms the state-of-the-art 360° video generators and 3D scene generation models.
zh

[CV-94] Real-Time Navigation for Autonomous Aerial Vehicles Using Video

【速读】：该论文试图解决在自主导航中利用有限资源设备（如空中无人机）进行语义信息检测与处理的问题。传统方法依赖于几何三维点云的构建与处理，成本高昂；而基于语义信息（如交通标志）的导航方式虽更简单，但其涉及的计算机视觉算法（如目标检测）对资源受限的设备仍具有较高需求。解决方案的关键在于引入一种新颖的马尔可夫决策过程（Markov Decision Process, MDP）框架，通过该框架显著降低计算机视觉方法的工作负载，同时在能耗和速度上获得明显提升，仅以有限的精度损失为代价。

链接: https://arxiv.org/abs/2504.01996
作者: Khizar Anjum,Parul Pandey,Vidyasagar Sadhu,Roberto Tron,Dario Pompili
机构: Rutgers University (罗格斯大学); Boston University (波士顿大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Submitted to Journal of Real-Time Image Processing

点击查看摘要

Abstract:Most applications in autonomous navigation using mounted cameras rely on the construction and processing of geometric 3D point clouds, which is an expensive process. However, there is another simpler way to make a space navigable quickly: to use semantic information (e.g., traffic signs) to guide the agent. However, detecting and acting on semantic information involves Computer Vision~(CV) algorithms such as object detection, which themselves are demanding for agents such as aerial drones with limited onboard resources. To solve this problem, we introduce a novel Markov Decision Process~(MDP) framework to reduce the workload of these CV approaches. We apply our proposed framework to both feature-based and neural-network-based object-detection tasks, using open-loop and closed-loop simulations as well as hardware-in-the-loop emulations. These holistic tests show significant benefits in energy consumption and speed with only a limited loss in accuracy compared to models based on static features and neural networks.
zh

[CV-95] A Concise Survey on Lane Topology Reasoning for HD Mapping

【速读】：该论文旨在系统性地综述车道拓扑推理方法的发展历程与现状，填补现有领域内缺乏全面概述的空白。论文将相关方法分为三大范式：基于过程建模的方法、基于航拍图像的方法以及基于车载传感器的方法，并分析了从早期基于规则的方法向现代基于Transformer、图神经网络（Graph Neural Networks, GNNs）及其他深度学习架构的解决方案演进的过程。论文的关键在于通过标准化评估指标（如道路级指标APLS和TLTS分数，以及车道级指标DET和TOP分数）及在基准数据集（如OpenLane-V2）上的性能对比，深入探讨技术挑战（如数据集可用性和模型效率），并提出未来研究的潜在方向。论文的核心贡献在于为研究人员和从业者提供了关于车道拓扑推理理论框架、实际实现方式及其在高精地图应用中的新兴趋势的深刻见解。

链接: https://arxiv.org/abs/2504.01989
作者: Yi Yao,Miao Fan,Shengtong Xu,Haoyi Xiong,Xiangzeng Liu,Wenbo Hu,Wenbing Huang
机构: NavInfo Co., Ltd. (四维图新科技股份有限公司, 中国); NavInfo Co., Ltd. (四维图新科技股份有限公司, 中国); Autohome Inc. (汽车之家, 中国); Baidu Inc. (百度, 中国); Xidian University of Technology (西安电子科技大学, 中国); Hefei University of Technology (合肥工业大学, 中国); Renmin University of China (中国人民大学, 中国)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE IV’25

点击查看摘要

Abstract:Lane topology reasoning techniques play a crucial role in high-definition (HD) mapping and autonomous driving applications. While recent years have witnessed significant advances in this field, there has been limited effort to consolidate these works into a comprehensive overview. This survey systematically reviews the evolution and current state of lane topology reasoning methods, categorizing them into three major paradigms: procedural modeling-based methods, aerial imagery-based methods, and onboard sensors-based methods. We analyze the progression from early rule-based approaches to modern learning-based solutions utilizing transformers, graph neural networks (GNNs), and other deep learning architectures. The paper examines standardized evaluation metrics, including road-level measures (APLS and TLTS score), and lane-level metrics (DET and TOP score), along with performance comparisons on benchmark datasets such as OpenLane-V2. We identify key technical challenges, including dataset availability and model efficiency, and outline promising directions for future research. This comprehensive review provides researchers and practitioners with insights into the theoretical frameworks, practical implementations, and emerging trends in lane topology reasoning for HD mapping applications.
zh

[CV-96] Distance Estimation to Support Assistive Drones for the Visually Impaired using Robust Calibration

【速读】：该论文旨在解决视障人士（Visually Impaired People, VIPs）在户外环境中自主导航并规避障碍物的问题。为实现这一目标，论文提出了一种名为NOVA的鲁棒校准技术，其关键是利用深度图估计校园环境中的绝对障碍物距离，并采用动态更新方法以适应对抗性场景。通过与最先进的深度图方法及基于几何和回归的基线模型进行比较，验证了NOVA在多种动态条件下的距离估计精度（对VIP的误差为30cm，对车辆和自行车等障碍物的最大误差为60cm），且优于基准模型和现有技术。

链接: https://arxiv.org/abs/2504.01988
作者: Suman Raj,Bhavani A Madhabhavi,Madhav Kumar,Prabhav Gupta,Yogesh Simmhan
机构: Dream:Lab (Dream:Lab); Indian Institute of Science (印度科学理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 39 pages

点击查看摘要

Abstract:Autonomous navigation by drones using onboard sensors, combined with deep learning and computer vision algorithms, is impacting a number of domains. We examine the use of drones to autonomously assist Visually Impaired People (VIPs) in navigating outdoor environments while avoiding obstacles. Here, we present NOVA, a robust calibration technique using depth maps to estimate absolute distances to obstacles in a campus environment. NOVA uses a dynamic-update method that can adapt to adversarial scenarios. We compare NOVA with SOTA depth map approaches, and with geometric and regression-based baseline models, for distance estimation to VIPs and other obstacles in diverse and dynamic conditions. We also provide exhaustive evaluations to validate the robustness and generalizability of our methods. NOVA predicts distances to VIP with an error 30cm and to different obstacles like cars and bicycles with a maximum of 60cm error, which are better than the baselines. NOVA also clearly out-performs SOTA depth map methods, by upto 5.3-14.6x.
zh

[CV-97] CaLiV: LiDAR-to-Vehicle Calibration of Arbitrary Sensor Setups via Object Reconstruction

【速读】：本文旨在解决自动驾驶系统中多线 LiDAR 系统外参标定（Sensor-to-Sensor 和 Sensor-to-Vehicle Calibration）的问题，特别是针对非重叠视场（FoVs）和无特征环境的场景。现有方法通常依赖于视场重叠、外部传感器或丰富的环境特征，且大多数算法不支持 Sensor-to-Vehicle 标定。论文提出了一种基于目标的新型标定技术 CaLiV，其关键在于通过运动引入视场重叠，并利用无味卡尔曼滤波器（Unscented Kalman Filter, UKF）估计车辆位姿；随后采用基于高斯混合模型的点云配准框架 GMMCalib 实现统一标定坐标系下的点云对齐；最终将标定问题转化为最小化问题求解。这种方法能够精确解决传感器间的平移与旋转误差，同时实现高精度的 Sensor-to-Vehicle 旋转角标定。

链接: https://arxiv.org/abs/2504.01987
作者: Ilir Tahiraj,Markus Edinger,Dominik Kulmer,Markus Lienkamp
机构: TUM School of Engineering and Design, Chair of Automotive Technology, Technical University of Munich (慕尼黑工业大学); TUM School of Computation, Information and Technology, Technical University of Munich (慕尼黑工业大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In autonomous systems, sensor calibration is essential for a safe and efficient navigation in dynamic environments. Accurate calibration is a prerequisite for reliable perception and planning tasks such as object detection and obstacle avoidance. Many existing LiDAR calibration methods require overlapping fields of view, while others use external sensing devices or postulate a feature-rich environment. In addition, Sensor-to-Vehicle calibration is not supported by the vast majority of calibration algorithms. In this work, we propose a novel target-based technique for extrinsic Sensor-to-Sensor and Sensor-to-Vehicle calibration of multi-LiDAR systems called CaLiV. This algorithm works for non-overlapping FoVs, as well as arbitrary calibration targets, and does not require any external sensing devices. First, we apply motion to produce FoV overlaps and utilize a simple unscented Kalman filter to obtain vehicle poses. Then, we use the Gaussian mixture model-based registration framework GMMCalib to align the point clouds in a common calibration frame. Finally, we reduce the task of recovering the sensor extrinsics to a minimization problem. We show that both translational and rotational Sensor-to-Sensor errors can be solved accurately by our method. In addition, all Sensor-to-Vehicle rotation angles can also be calibrated with high accuracy. We validate the simulation results in real-world experiments. The code is open source and available on this https URL.
zh

[CV-98] Adaptive Frequency Enhancement Network for Remote Sensing Image Semantic Segmentation

【速读】：该论文旨在解决高分辨率遥感图像语义分割中适应不同地物分布的网络参数调整难题以及增强空间域与频域特征交互能力的挑战。为应对这些挑战，论文提出了一种自适应频率增强网络（Adaptive Frequency Enhancement Network, AFENet），其核心在于两个关键模块：自适应频率与空间特征交互模块（Adaptive Frequency and Spatial feature Interaction Module, AFSIM）和选择性特征融合模块（Selective feature Fusion Module, SFM）。AFSIM能够根据输入图像内容动态分离并调节高低频特征，通过自适应生成掩码来分离高低频成分，从而为地物特征表示提供最优细节和上下文补充信息；而SFM则选择性地融合全局上下文与局部详细特征以提升网络的表征能力，进一步强化频域与空间域特征之间的交互。实验结果表明，所提出的AFENet在多个公开数据集上优于现有最先进的方法，并验证了AFSIM和SFM在处理多样化地物类型和复杂场景中的有效性。

链接: https://arxiv.org/abs/2504.02647
作者: Feng Gao,Miao Fu,Jingchao Cao,Junyu Dong,Qian Du
机构: School of Computer Science and Technology, Ocean University of China (海洋大学计算机科学与技术学院), Qingdao 266100, China; Department of Electrical and Computer Engineering, Mississippi State University (密西西比州立大学电气与计算机工程系), Starkville, MS 39762 USA
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TGRS 2025

点击查看摘要

Abstract:Semantic segmentation of high-resolution remote sensing images plays a crucial role in land-use monitoring and urban planning. Recent remarkable progress in deep learning-based methods makes it possible to generate satisfactory segmentation results. However, existing methods still face challenges in adapting network parameters to various land cover distributions and enhancing the interaction between spatial and frequency domain features. To address these challenges, we propose the Adaptive Frequency Enhancement Network (AFENet), which integrates two key components: the Adaptive Frequency and Spatial feature Interaction Module (AFSIM) and the Selective feature Fusion Module (SFM). AFSIM dynamically separates and modulates high- and low-frequency features according to the content of the input image. It adaptively generates two masks to separate high- and low-frequency components, therefore providing optimal details and contextual supplementary information for ground object feature representation. SFM selectively fuses global context and local detailed features to enhance the network’s representation capability. Hence, the interactions between frequency and spatial features are further enhanced. Extensive experiments on three publicly available datasets demonstrate that the proposed AFENet outperforms state-of-the-art methods. In addition, we also validate the effectiveness of AFSIM and SFM in managing diverse land cover types and complex scenarios. Our codes are available at this https URL.
zh

[CV-99] owards Computation- and Communication-efficient Computational Pathology

【速读】：本文旨在解决当前计算病理学模型在临床应用中的诊断效率挑战，主要由于其依赖高放大倍率全切片图像分析而导致的计算负担重、文件传输需求大以及存储开销高的问题。特别是在时间敏感的诊断场景（如术中冰冻切片诊断）和需要高效数据传输的情况下，这一局限性尤为突出。为应对这些挑战，论文提出了一种名为Magnification-Aligned Global-Local Transformer (MAGA-GLTrans) 的新型高效计算与通信框架。该方案的关键创新在于提出的Magnification Alignment (MAGA) 机制，通过自监督学习有效地弥合低倍率与高倍率图像之间的信息差距，并对齐它们的特征表示，从而实现基于低倍率输入的有效分析。此方法显著减少了高达10.7倍的计算时间以及超过20倍的文件传输和存储需求，同时保持了最先进的分类性能。此外，MAGA框架还展示了其作为特征提取器增强任何计算病理学架构效率的能力，以及与现有基础模型和组织病理学特定编码器的兼容性，使其能够以极小的信息损失处理低倍率输入。这些进步使MAGA-GLTrans成为时间敏感型应用的特别有前景的解决方案。

链接: https://arxiv.org/abs/2504.02628
作者: Chu Han,Bingchao Zhao,Jiatai Lin,Shanshan Lyu,Longfei Wang,Tianpeng Deng,Cheng Lu,Changhong Liang,Hannah Y. Wen,Xiaojing Guo,Zhenwei Shi,Zaiyi Liu
机构: Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application (广东省医学图像分析人工智能重点实验室), Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences)(广东省人民医院（广东医学科学院）), Southern Medical University (南方医科大学), Guangzhou 510080, China; Department of Radiology (放射科), Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences)(广东省人民医院（广东医学科学院）), Southern Medical University (南方医科大学), Guangzhou 510080, China; Department of Pathology (病理科), Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences)(广东省人民医院（广东医学科学院）), Southern Medical University (南方医科大学), Guangzhou 510080, China; Department of Pathology and Laboratory Medicine, Memorial Sloan Kettering Cancer Center (纪念斯隆凯特琳癌症中心), 1275 York Avenue, New York, NY 10065; Department of Breast Pathology and Laboratory, Tianjin Medical University Cancer Institute and Hospital (天津医科大学肿瘤医院), National Clinical Research Center for Cancer (国家肿瘤临床研究中心), Key Laboratory of Breast Cancer Prevention and Therapy of Ministry of Education of China (教育部乳腺癌防治重点实验室), Tianjin Medical University (天津医科大学), Tianjin’s Clinical Research Center for Cancer (天津市癌症临床研究中心), West Huanhu Road, Tianjin, China, 300060
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the impressive performance across a wide range of applications, current computational pathology models face significant diagnostic efficiency challenges due to their reliance on high-magnification whole-slide image analysis. This limitation severely compromises their clinical utility, especially in time-sensitive diagnostic scenarios and situations requiring efficient data transfer. To address these issues, we present a novel computation- and communication-efficient framework called Magnification-Aligned Global-Local Transformer (MAGA-GLTrans). Our approach significantly reduces computational time, file transfer requirements, and storage overhead by enabling effective analysis using low-magnification inputs rather than high-magnification ones. The key innovation lies in our proposed magnification alignment (MAGA) mechanism, which employs self-supervised learning to bridge the information gap between low and high magnification levels by effectively aligning their feature representations. Through extensive evaluation across various fundamental CPath tasks, MAGA-GLTrans demonstrates state-of-the-art classification performance while achieving remarkable efficiency gains: up to 10.7 times reduction in computational time and over 20 times reduction in file transfer and storage requirements. Furthermore, we highlight the versatility of our MAGA framework through two significant extensions: (1) its applicability as a feature extractor to enhance the efficiency of any CPath architecture, and (2) its compatibility with existing foundation models and histopathology-specific encoders, enabling them to process low-magnification inputs with minimal information loss. These advancements position MAGA-GLTrans as a particularly promising solution for time-sensitive applications, especially in the context of intraoperative frozen section diagnosis where both accuracy and efficiency are paramount.
zh

[CV-100] ranslation of Fetal Brain Ultrasound Images into Pseudo-MRI Images using Artificial Intelligence

【速读】：该论文旨在解决超声成像在胎儿大脑评估中的局限性，特别是在孕晚期，由于图像质量限制导致定量数据分析困难的问题。与磁共振成像（MRI）相比，超声虽然成本低且易获取，但其图像质量和组织对比度较低。为克服这一挑战，论文提出了一种名为“双扩散关联模型”（DDIC）的方法，其关键是利用基于扩散的翻译方法，假设超声和MRI领域之间存在共享潜在空间，并采用先进的生成式AI技术（如扩散模型）将超声图像转换为类似MRI的显示效果。通过训练模型并结合多个数据集（包括HC18、CRL胎儿大脑图谱和FeTA），DDIC显著提升了伪MRI图像的视觉分辨能力，尤其是在侧脑室和大脑裂等区域的对比清晰度。此外，多项评价指标（如互信息、峰值信噪比、Fréchet Inception距离和信噪比对比度）均证明了DDIC相较于其他翻译方法的优越性能，同时临床医生测试也验证了其在诊断展示上的改进潜力。

链接: https://arxiv.org/abs/2504.02408
作者: Naomi Silverstein,Efrat Leibowitz,Ron Beloosesky,Haim Azhari
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:Ultrasound is a widely accessible and cost-effective medical imaging tool commonly used for prenatal evaluation of the fetal brain. However, it has limitations, particularly in the third trimester, where the complexity of the fetal brain requires high image quality for extracting quantitative data. In contrast, magnetic resonance imaging (MRI) offers superior image quality and tissue differentiation but is less available, expensive, and requires time-consuming acquisition. Thus, transforming ultrasonic images into an MRI-mimicking display may be advantageous and allow better tissue anatomy presentation. To address this goal, we have examined the use of artificial intelligence, implementing a diffusion model renowned for generating high-quality images. The proposed method, termed “Dual Diffusion Imposed Correlation” (DDIC), leverages a diffusion-based translation methodology, assuming a shared latent space between ultrasound and MRI domains. Model training was obtained utilizing the “HC18” dataset for ultrasound and the “CRL fetal brain atlas” along with the "FeTA " datasets for MRI. The generated pseudo-MRI images provide notable improvements in visual discrimination of brain tissue, especially in the lateral ventricles and the Sylvian fissure, characterized by enhanced contrast clarity. Improvement was demonstrated in Mutual information, Peak signal-to-noise ratio, Fréchet Inception Distance, and Contrast-to-noise ratio. Findings from these evaluations indicate statistically significant superior performance of the DDIC compared to other translation methodologies. In addition, a Medical Opinion Test was obtained from 5 gynecologists. The results demonstrated display improvement in 81% of the tested images. In conclusion, the presented pseudo-MRI images hold the potential for streamlining diagnosis and enhancing clinical outcomes through improved representation.
zh

[CV-101] Benchmark of Segmentation Techniques for Pelvic Fracture in CT and X-ray: Summary of the PENGWIN 2024 Challenge

【速读】：该论文旨在解决骨盆骨折碎片在CT和X射线图像中分割的难题，这对于创伤诊断、手术规划及术中引导至关重要。由于解剖结构复杂性和成像限制，准确且高效地描绘骨碎片仍具挑战性。为推进自动化骨折分割技术的发展，PENGWIN挑战赛作为MICCAI 2024卫星活动举办，通过基准测试最先进的算法应对这些复杂任务。研究收集了来自多个临床中心的150例CT扫描数据，并利用DeepDRR方法生成大量模拟X射线图像。最终，来自全球16支队伍的提交结果在严格的多指标测试方案下被评估。顶级CT算法实现了平均片段级交并比（IoU）为0.930，表明其具有满意的精度；然而，在X射线任务中，最优算法仅达到0.774的IoU，凸显了重叠解剖结构带来的更大挑战。除了定量评估外，该挑战还揭示了算法设计的方法学多样性，如实例表示方式的不同（主要-次要分类与边界-核心分离）导致了不同的分割策略。尽管取得了令人鼓舞的结果，但挑战也暴露了碎片定义中的固有不确定性，特别是在不完全骨折的情况下。这些发现表明，将人类决策与任务相关的信息相结合的交互式分割方法可能是提高模型可靠性和临床适用性的关键。

链接: https://arxiv.org/abs/2504.02382
作者: Yudi Sang,Yanzhen Liu,Sutuke Yibulayimu,Yunning Wang,Benjamin D. Killeen,Mingxu Liu,Ping-Cheng Ku,Ole Johannsen,Karol Gotkowski,Maximilian Zenk,Klaus Maier-Hein,Fabian Isensee,Peiyan Yue,Yi Wang,Haidong Yu,Zhaohong Pan,Yutong He,Xiaokun Liang,Daiqi Liu,Fuxin Fan,Artur Jurgas,Andrzej Skalski,Yuxi Ma,Jing Yang,Szymon Płotka,Rafał Litka,Gang Zhu,Yingchun Song,Mathias Unberath,Mehran Armand,Dan Ruan,S. Kevin Zhou,Qiyong Cao,Chunpeng Zhao,Xinbao Wu,Yu Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: PENGWIN 2024 Challenge Report

点击查看摘要

Abstract:The segmentation of pelvic fracture fragments in CT and X-ray images is crucial for trauma diagnosis, surgical planning, and intraoperative guidance. However, accurately and efficiently delineating the bone fragments remains a significant challenge due to complex anatomy and imaging limitations. The PENGWIN challenge, organized as a MICCAI 2024 satellite event, aimed to advance automated fracture segmentation by benchmarking state-of-the-art algorithms on these complex tasks. A diverse dataset of 150 CT scans was collected from multiple clinical centers, and a large set of simulated X-ray images was generated using the DeepDRR method. Final submissions from 16 teams worldwide were evaluated under a rigorous multi-metric testing scheme. The top-performing CT algorithm achieved an average fragment-wise intersection over union (IoU) of 0.930, demonstrating satisfactory accuracy. However, in the X-ray task, the best algorithm attained an IoU of 0.774, highlighting the greater challenges posed by overlapping anatomical structures. Beyond the quantitative evaluation, the challenge revealed methodological diversity in algorithm design. Variations in instance representation, such as primary-secondary classification versus boundary-core separation, led to differing segmentation strategies. Despite promising results, the challenge also exposed inherent uncertainties in fragment definition, particularly in cases of incomplete fractures. These findings suggest that interactive segmentation approaches, integrating human decision-making with task-relevant information, may be essential for improving model reliability and clinical applicability.
zh

[CV-102] HPGN: Hybrid Priors-Guided Network for Compressed Low-Light Image Enhancement

【速读】：该论文旨在解决现有方法在处理低光照图像压缩增强时存在的两个主要问题：一是未能有效去除增强过程中可能产生的压缩伪影；二是缺乏针对不同压缩质量图像的统一联合任务增强框架。为了解决这些问题，论文提出了一种混合先验引导网络（Hybrid Priors-Guided Network, HPGN），其关键在于通过整合压缩先验（compression priors）和光照先验（illumination priors），利用JPEG质量因子（Quality Factor, QF）和离散余弦变换量化矩阵（Quantization Matrix, QM）来指导高效联合任务的即插即用模块设计，并采用随机QF生成策略引导模型训练，从而实现单一模型对不同压缩水平图像的有效增强。实验结果验证了所提方法的优越性。

链接: https://arxiv.org/abs/2504.02373
作者: Hantang Li,Jinhua Hao,Lei Xiong,Shuyuan Zhu
机构: University of Electronic Science and Technology of China (UESTC); Kuaishou Technology (快手科技), Beijing
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:In practical applications, conventional methods generate large volumes of low-light images that require compression for efficient storage and transmission. However, most existing methods either disregard the removal of potential compression artifacts during the enhancement process or fail to establish a unified framework for joint task enhancement of images with varying compression qualities. To solve this problem, we propose the hybrid priors-guided network (HPGN), which enhances compressed low-light images by integrating both compression and illumination priors. Our approach fully utilizes the JPEG quality factor (QF) and DCT quantization matrix (QM) to guide the design of efficient joint task plug-and-play modules. Additionally, we employ a random QF generation strategy to guide model training, enabling a single model to enhance images across different compression levels. Experimental results confirm the superiority of our proposed method.
zh

[CV-103] APSeg: Auto-Prompt Model with Acquired and Injected Knowledge for Nuclear Instance Segmentation and Classification

【速读】：该论文旨在解决数字病理学诊断中细胞核分割与分类的准确性与效率问题。尽管Segment Anything Model (SAM) 显著提升了细胞核分割的精度与效率，但其高度依赖精确提示（prompts），并且由于其类别无关（class-agnostic）的设计，分类结果完全取决于提供的提示。为了解决这些问题，论文提出了一种名为\textbfAPSeg的自动提示生成模型，该模型结合了获取的知识和注入的知识，用于细胞核实例的分割与分类。关键在于引入了两个知识感知模块：(1) 密度图引导的分布偏移提议模块（Distribution-Guided Proposal Offset Module, DG-POM），通过学习密度图引导的分布知识实现更精准的定位；(2) 类别知识语义注入模块（Category Knowledge Semantic Injection Module, CK-SIM），通过注入从类别描述中提取的形态学知识实现更准确的分类。论文在PanNuke和CoNSeP数据集上进行了大量实验，验证了所提方法的有效性。

链接: https://arxiv.org/abs/2504.02222
作者: Liying Xu,Hongliang He,Wei Han,Hanbin Huang,Siwei Feng,Guohong Fu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Nuclear instance segmentation and classification provide critical quantitative foundations for digital pathology diagnosis. With the advent of the foundational Segment Anything Model (SAM), the accuracy and efficiency of nuclear segmentation have improved significantly. However, SAM imposes a strong reliance on precise prompts, and its class-agnostic design renders its classification results entirely dependent on the provided prompts. Therefore, we focus on generating prompts with more accurate localization and classification and propose \textbfAPSeg, \textbfAuto-\textbfPrompt model with acquired and injected knowledge for nuclear instance \textbfSegmentation and classification. APSeg incorporates two knowledge-aware modules: (1) Distribution-Guided Proposal Offset Module (\textbfDG-POM), which learns distribution knowledge through density map guided, and (2) Category Knowledge Semantic Injection Module (\textbfCK-SIM), which injects morphological knowledge derived from category descriptions. We conducted extensive experiments on the PanNuke and CoNSeP datasets, demonstrating the effectiveness of our approach. The code will be released upon acceptance.
zh

[CV-104] Image Coding for Machines via Feature-Preserving Rate-Distortion Optimization

【速读】：该论文旨在解决在图像和视频压缩过程中同时优化视觉质量和下游计算机视觉任务性能的问题。传统率失真优化（Rate-Distortion Optimization, RDO）方法需要迭代式的编码、解码和特征评估流程，计算开销巨大且不适用于实际应用。为了解决这一问题，论文的关键创新在于简化RDO公式的失真项，使其能够通过基于块的编码器进行高效计算。具体而言，论文首先利用泰勒展开式对特征提取器进行线性化处理，将特征距离重铸为一个带有神经网络雅可比矩阵的二次度量，并进一步将其替换为输入相关的平方误差（Input-Dependent Squared Error, IDSE）的分块近似形式。为了降低计算复杂度，引入雅可比矩阵草图（Jacobian Sketches）来近似IDSE。最终，这种优化后的损失函数可以在变换域中分块评估，并与均方误差（Sum of Squared Errors, SSE）结合，从而在保证视觉质量的同时提升计算机视觉任务的性能。实验结果表明，与基于SSE的传统RDO方法相比，该方法在保持相同计算机视觉准确性的情况下可实现高达10%的比特率节省，且仅增加7%的编码器复杂度，而无需额外的解码器开销。

链接: https://arxiv.org/abs/2504.02216
作者: Samuel Fernández-Menduiña,Eduardo Pavez,Antonio Ortega
机构: Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California (南加州大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Many images and videos are primarily processed by computer vision algorithms, involving only occasional human inspection. When this content requires compression before processing, e.g., in distributed applications, coding methods must optimize for both visual quality and downstream task performance. We first show that, given the features obtained from the original and the decoded images, an approach to reduce the effect of compression on a task loss is to perform rate-distortion optimization (RDO) using the distance between features as a distortion metric. However, optimizing directly such a rate-distortion trade-off requires an iterative workflow of encoding, decoding, and feature evaluation for each coding parameter, which is computationally impractical. We address this problem by simplifying the RDO formulation to make the distortion term computable using block-based encoders. We first apply Taylor’s expansion to the feature extractor, recasting the feature distance as a quadratic metric with the Jacobian matrix of the neural network. Then, we replace the linearized metric with a block-wise approximation, which we call input-dependent squared error (IDSE). To reduce computational complexity, we approximate IDSE using Jacobian sketches. The resulting loss can be evaluated block-wise in the transform domain and combined with the sum of squared errors (SSE) to address both visual quality and computer vision performance. Simulations with AVC across multiple feature extractors and downstream neural networks show up to 10% bit-rate savings for the same computer vision accuracy compared to RDO based on SSE, with no decoder complexity overhead and just a 7% encoder complexity increase.
zh

人工智能

[AI-0] On Vanishing Variance in Transformer Length Generalization

链接: https://arxiv.org/abs/2504.02827
作者: Ruining Li,Gabrijel Boduljak,Jensen(Jinghao)Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL . The first two authors contributed equally to this work

点击查看摘要

Abstract:It is a widely known issue that Transformers, when trained on shorter sequences, fail to generalize robustly to longer ones at test time. This raises the question of whether Transformer models are real reasoning engines, despite their impressive abilities in mathematical problem solving and code synthesis. In this paper, we offer a vanishing variance perspective on this issue. To the best of our knowledge, we are the first to demonstrate that even for today’s frontier models, a longer sequence length results in a decrease in variance in the output of the multi-head attention modules. On the argmax retrieval and dictionary lookup tasks, our experiments show that applying layer normalization after the attention outputs leads to significantly better length generalization. Our analyses attribute this improvement to a reduction-though not a complete elimination-of the distribution shift caused by vanishing variance.

[AI-1] Do Two AI Scientists Agree?

链接: https://arxiv.org/abs/2504.02822
作者: Xinghong Fu,Ziming Liu,Max Tegmark
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When two AI models are trained on the same scientific task, do they learn the same theory or two different theories? Throughout history of science, we have witnessed the rise and fall of theories driven by experimental validation or falsification: many theories may co-exist when experimental data is lacking, but the space of survived theories become more constrained with more experimental data becoming available. We show the same story is true for AI scientists. With increasingly more systems provided in training data, AI scientists tend to converge in the theories they learned, although sometimes they form distinct groups corresponding to different theories. To mechanistically interpret what theories AI scientists learn and quantify their agreement, we propose MASS, Hamiltonian-Lagrangian neural networks as AI Scientists, trained on standard problems in physics, aggregating training results across many seeds simulating the different configurations of AI scientists. Our findings suggests for AI scientists switch from learning a Hamiltonian theory in simple setups to a Lagrangian formulation when more complex systems are introduced. We also observe strong seed dependence of the training dynamics and final learned weights, controlling the rise and fall of relevant theories. We finally demonstrate that not only can our neural networks aid interpretability, it can also be applied to higher dimensional problems.

[AI-2] Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

链接: https://arxiv.org/abs/2504.02792
作者: Chuning Zhu,Raymond Yu,Siyuan Feng,Benjamin Burchfiel,Paarth Shah,Abhishek Gupta
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Imitation learning has emerged as a promising approach towards building generalist robots. However, scaling imitation learning for large robot foundation models remains challenging due to its reliance on high-quality expert demonstrations. Meanwhile, large amounts of video data depicting a wide range of environments and diverse behaviors are readily available. This data provides a rich source of information about real-world dynamics and agent-environment interactions. Leveraging this data directly for imitation learning, however, has proven difficult due to the lack of action annotation required for most contemporary methods. In this work, we present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning. Specifically, a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. We show that by simply controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator. Through simulated and real-world experiments, we show that: (1) UWM enables effective pretraining on large-scale multitask robot datasets with both dynamics and action predictions, resulting in more generalizable and robust policies than imitation learning, (2) UWM naturally facilitates learning from action-free video data through independent control of modality-specific diffusion timesteps, further improving the performance of finetuned policies. Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning, and provides a simple unification between the often disparate paradigms of imitation learning and world modeling. Videos and code are available at this https URL.

[AI-3] owards Green AI-Native Networks: Evaluation of Neural Circuit Policy for Estimating Energy Consumption of Base Stations

链接: https://arxiv.org/abs/2504.02781
作者: Selim Ickin,Shruti Bothe,Aman Raparia,Nitin Khanna,Erik Sanders
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Signal Processing (eess.SP)
*备注: 15 pages, 9 figures

点击查看摘要

Abstract:Optimization of radio hardware and AI-based network management software yield significant energy savings in radio access networks. The execution of underlying Machine Learning (ML) models, which enable energy savings through recommended actions, may require additional compute and energy, highlighting the opportunity to explore and adopt accurate and energy-efficient ML technologies. This work evaluates the novel use of sparsely structured Neural Circuit Policies (NCPs) in a use case to estimate the energy consumption of base stations. Sparsity in ML models yields reduced memory, computation and energy demand, hence facilitating a low-cost and scalable solution. We also evaluate the generalization capability of NCPs in comparison to traditional and widely used ML models such as Long Short Term Memory (LSTM), via quantifying their sensitivity to varying model hyper-parameters (HPs). NCPs demonstrated a clear reduction in computational overhead and energy consumption. Moreover, results indicated that the NCPs are robust to varying HPs such as number of epochs and neurons in each layer, making them a suitable option to ease model management and to reduce energy consumption in Machine Learning Operations (MLOps) in telecommunications.

[AI-4] From Consumption to Collaboration: Measuring Interaction Patterns to Augment Human Cognition in Open-Ended Tasks

链接: https://arxiv.org/abs/2504.02780
作者: Joshua Holstein,Moritz Diener,Philipp Spitzer
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Accepted at Tools for Thought Workshop (CHI’25)

点击查看摘要

Abstract:The rise of Generative AI, and Large Language Models (LLMs) in particular, is fundamentally changing cognitive processes in knowledge work, raising critical questions about their impact on human reasoning and problem-solving capabilities. As these AI systems become increasingly integrated into workflows, they offer unprecedented opportunities for augmenting human thinking while simultaneously risking cognitive erosion through passive consumption of generated answers. This tension is particularly pronounced in open-ended tasks, where effective solutions require deep contextualization and integration of domain knowledge. Unlike structured tasks with established metrics, measuring the quality of human-LLM interaction in such open-ended tasks poses significant challenges due to the absence of ground truth and the iterative nature of solution development. To address this, we present a framework that analyzes interaction patterns along two dimensions: cognitive activity mode (exploration vs. exploitation) and cognitive engagement mode (constructive vs. detrimental). This framework provides systematic measurements to evaluate when LLMs are effective tools for thought rather than substitutes for human cognition, advancing theoretical understanding and practical guidance for developing AI systems that protect and augment human cognitive capabilities.

[AI-5] How Deep Do Large Language Models Internalize Scientific Literature and Citation Practices?

链接: https://arxiv.org/abs/2504.02767
作者: Andres Algaba,Vincent Holst,Floriano Tori,Melika Mobini,Brecht Verbeken,Sylvia Wenmackers,Vincent Ginis
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 32 pages, 17 figures

点击查看摘要

Abstract:The spread of scientific knowledge depends on how researchers discover and cite previous work. The adoption of large language models (LLMs) in the scientific research process introduces a new layer to these citation practices. However, it remains unclear to what extent LLMs align with human citation practices, how they perform across domains, and may influence citation dynamics. Here, we show that LLMs systematically reinforce the Matthew effect in citations by consistently favoring highly cited papers when generating references. This pattern persists across scientific domains despite significant field-specific variations in existence rates, which refer to the proportion of generated references that match existing records in external bibliometric databases. Analyzing 274,951 references generated by GPT-4o for 10,000 papers, we find that LLM recommendations diverge from traditional citation patterns by preferring more recent references with shorter titles and fewer authors. Emphasizing their content-level relevance, the generated references are semantically aligned with the content of each paper at levels comparable to the ground truth references and display similar network effects while reducing author self-citations. These findings illustrate how LLMs may reshape citation practices and influence the trajectory of scientific discovery by reflecting and amplifying established trends. As LLMs become more integrated into the scientific research process, it is important to understand their role in shaping how scientific communities discover and build upon prior work.

[AI-6] RBR4DNN: Requirements-based Testing of Neural Networks

链接: https://arxiv.org/abs/2504.02737
作者: Nusrat Jahan Mozumder,Felipe Toledo,Swaroopa Dola,Matthew B. Dwyer
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural network (DNN) testing is crucial for the reliability and safety of critical systems, where failures can have severe consequences. Although various techniques have been developed to create robustness test suites, requirements-based testing for DNNs remains largely unexplored – yet such tests are recognized as an essential component of software validation of critical systems. In this work, we propose a requirements-based test suite generation method that uses structured natural language requirements formulated in a semantic feature space to create test suites by prompting text-conditional latent diffusion models with the requirement precondition and then using the associated postcondition to define a test oracle to judge outputs of the DNN under test. We investigate the approach using fine-tuned variants of pre-trained generative models. Our experiments on the MNIST, CelebA-HQ, ImageNet, and autonomous car driving datasets demonstrate that the generated test suites are realistic, diverse, consistent with preconditions, and capable of revealing faults.

[AI-7] Autonomous Human-Robot Interaction via Operator Imitation

链接: https://arxiv.org/abs/2504.02724
作者: Sammy Christen,David Müller,Agon Serifi,Ruben Grandia,Georg Wiedebach,Michael A. Hopkins,Espen Knoop,Moritz Bächer
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Teleoperated robotic characters can perform expressive interactions with humans, relying on the operators’ experience and social intuition. In this work, we propose to create autonomous interactive robots, by training a model to imitate operator data. Our model is trained on a dataset of human-robot interactions, where an expert operator is asked to vary the interactions and mood of the robot, while the operator commands as well as the pose of the human and robot are recorded. Our approach learns to predict continuous operator commands through a diffusion process and discrete commands through a classifier, all unified within a single transformer architecture. We evaluate the resulting model in simulation and with a user study on the real system. We show that our method enables simple autonomous human-robot interactions that are comparable to the expert-operator baseline, and that users can recognize the different robot moods as generated by our model. Finally, we demonstrate a zero-shot transfer of our model onto a different robotic platform with the same operator interface.

[AI-8] Responsible Development of Offensive AI

链接: https://arxiv.org/abs/2504.02701
作者: Ryan Marinelli
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:As AI advances, broader consensus is needed to determine research priorities. This endeavor discusses offensive AI and provides guidance by leveraging Sustainable Development Goals (SDGs) and interpretability techniques. The objective is to more effectively establish priorities that balance societal benefits against risks. The two forms of offensive AI evaluated in this study are vulnerability detection agents, which solve Capture- The-Flag challenges, and AI-powered malware.

[AI-9] SCMPPI: Supervised Contrastive Multimodal Framework for Predicting Protein-Protein Interactions

链接: https://arxiv.org/abs/2504.02698
作者: Shengrui XU,Tianchi Lu,Zikun Wang,Jixiu Zhai,Jingwan Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注: 19 pages,11 figures,conference

点击查看摘要

Abstract:Protein-Protein Interaction (PPI) prediction is a key task in uncovering cellular functional networks and disease mechanisms. However, traditional experimental methods are time-consuming and costly, and existing computational models face challenges in cross-modal feature fusion, robustness, and false-negative suppression. In this paper, we propose a novel supervised contrastive multimodal framework, SCMPPI, for PPI prediction. By integrating protein sequence features (AAC, DPC, CKSAAP-ESMC) with PPI network topology information (Node2Vec graph embedding), and combining an improved supervised contrastive learning strategy, SCMPPI significantly enhances PPI prediction performance. For the PPI task, SCMPPI introduces a negative sample filtering mechanism and modifies the contrastive loss function, effectively optimizing multimodal features. Experiments on eight benchmark datasets, including yeast, human, and this http URL, show that SCMPPI outperforms existing state-of-the-art methods (such as DF-PPI and TAGPPI) in key metrics such as accuracy ( 98.01%) and AUC (99.62%), and demonstrates strong generalization in cross-species prediction (AUC 99% on multi-species datasets). Furthermore, SCMPPI has been successfully applied to CD9 networks, the Wnt pathway, and cancer-specific networks, providing a reliable tool for disease target discovery. This framework also offers a new paradigm for multimodal biological information fusion and contrastive learning in collaborative optimization for various combined predictions.

[AI-10] STOOD-X methodology: using statistical nonparametric test for OOD Detection Large-Scale datasets enhanced with explainability

链接: https://arxiv.org/abs/2504.02685
作者: Iván Sevillano-García,Julián Luengo,Francisco Herrera
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML)
*备注: 18 pages, 7 Figures

点击查看摘要

Abstract:Out-of-Distribution (OOD) detection is a critical task in machine learning, particularly in safety-sensitive applications where model failures can have serious consequences. However, current OOD detection methods often suffer from restrictive distributional assumptions, limited scalability, and a lack of interpretability. To address these challenges, we propose STOOD-X, a two-stage methodology that combines a Statistical nonparametric Test for OOD Detection with eXplainability enhancements. In the first stage, STOOD-X uses feature-space distances and a Wilcoxon-Mann-Whitney test to identify OOD samples without assuming a specific feature distribution. In the second stage, it generates user-friendly, concept-based visual explanations that reveal the features driving each decision, aligning with the BLUE XAI paradigm. Through extensive experiments on benchmark datasets and multiple architectures, STOOD-X achieves competitive performance against state-of-the-art post hoc OOD detectors, particularly in high-dimensional and complex settings. In addition, its explainability framework enables human oversight, bias detection, and model debugging, fostering trust and collaboration between humans and AI systems. The STOOD-X methodology therefore offers a robust, explainable, and scalable solution for real-world OOD detection tasks.

[AI-11] SymDQN: Symbolic Knowledge and Reasoning in Neural Network-based Reinforcement Learning

链接: https://arxiv.org/abs/2504.02654
作者: Ivo Amador,Nina Gierasimczuk
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Neural and Evolutionary Computing (cs.NE)
*备注: 8 pages, 8 figures

点击查看摘要

Abstract:We propose a learning architecture that allows symbolic control and guidance in reinforcement learning with deep neural networks. We introduce SymDQN, a novel modular approach that augments the existing Dueling Deep Q-Networks (DuelDQN) architecture with modules based on the neuro-symbolic framework of Logic Tensor Networks (LTNs). The modules guide action policy learning and allow reinforcement learning agents to display behaviour consistent with reasoning about the environment. Our experiment is an ablation study performed on the modules. It is conducted in a reinforcement learning environment of a 5x5 grid navigated by an agent that encounters various shapes, each associated with a given reward. The underlying DuelDQN attempts to learn the optimal behaviour of the agent in this environment, while the modules facilitate shape recognition and reward prediction. We show that our architecture significantly improves learning, both in terms of performance and the precision of the agent. The modularity of SymDQN allows reflecting on the intricacies and complexities of combining neural and symbolic approaches in reinforcement learning.

[AI-12] Prompt Optimization with Logged Bandit Data

链接: https://arxiv.org/abs/2504.02646
作者: Haruka Kiyohara,Daniel Yiming Cao,Yuta Saito,Thorsten Joachims
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (stat.ML)
*备注: Preprint

点击查看摘要

Abstract:We study how to use naturally available user feedback, such as clicks, to optimize large language model (LLM) pipelines for generating personalized sentences using prompts. Naive approaches, which estimate the policy gradient in the prompt space, suffer either from variance caused by the large action space of prompts or bias caused by inaccurate reward predictions. To circumvent these challenges, we propose a novel kernel-based off-policy gradient method, which estimates the policy gradient by leveraging similarity among generated sentences, substantially reducing variance while suppressing the bias. Empirical results on our newly established suite of benchmarks demonstrate the effectiveness of the proposed approach in generating personalized descriptions for movie recommendations, particularly when the number of candidate prompts is large.

[AI-13] Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions

链接: https://arxiv.org/abs/2504.02623
作者: PeiJie Yu,Yifan Yang,Jinjian Li,Zelong Zhang,Haorui Wang,Xiao Feng,Feng Zhang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong potential as agents for tool invocation due to their advanced comprehension and planning capabilities. Users increasingly rely on LLM-based agents to solve complex missions through iterative interactions. However, existing benchmarks predominantly access agents in single-mission scenarios, failing to capture real-world complexity. To bridge this gap, we propose the Multi-Mission Tool Bench. In the benchmark, each test case comprises multiple interrelated missions. This design requires agents to dynamically adapt to evolving demands. Moreover, the proposed benchmark explores all possible mission-switching patterns within a fixed mission number. Specifically, we propose a multi-agent data generation framework to construct the benchmark. We also propose a novel method to evaluate the accuracy and efficiency of agent decisions with dynamic decision trees. Experiments on diverse open-source and closed-source LLMs reveal critical factors influencing agent robustness and provide actionable insights to the tool invocation society.

[AI-14] Learning Geometrically-Informed Lyapunov Functions with Deep Diffeomorphic RBF Networks

链接: https://arxiv.org/abs/2504.02607
作者: Samuel Tesfazgi,Leonhard Sprandl,Sandra Hirche
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The practical deployment of learning-based autonomous systems would greatly benefit from tools that flexibly obtain safety guarantees in the form of certificate functions from data. While the geometrical properties of such certificate functions are well understood, synthesizing them using machine learning techniques still remains a challenge. To mitigate this issue, we propose a diffeomorphic function learning framework where prior structural knowledge of the desired output is encoded in the geometry of a simple surrogate function, which is subsequently augmented through an expressive, topology-preserving state-space transformation. Thereby, we achieve an indirect function approximation framework that is guaranteed to remain in the desired hypothesis space. To this end, we introduce a novel approach to construct diffeomorphic maps based on RBF networks, which facilitate precise, local transformations around data. Finally, we demonstrate our approach by learning diffeomorphic Lyapunov functions from real-world data and apply our method to different attractor systems.

[AI-15] Improving Counterfactual Truthfulness for Molecular Property Prediction through Uncertainty Quantification

链接: https://arxiv.org/abs/2504.02606
作者: Jonas Teufel,Annika Leinweber,Pascal Friederich
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 24 pages, 5 figures, 4 tabels, accepted at the 3rd xAI World Conference

点击查看摘要

Abstract:Explainable AI (xAI) interventions aim to improve interpretability for complex black-box models, not only to improve user trust but also as a means to extract scientific insights from high-performing predictive systems. In molecular property prediction, counterfactual explanations offer a way to understand predictive behavior by highlighting which minimal perturbations in the input molecular structure cause the greatest deviation in the predicted property. However, such explanations only allow for meaningful scientific insights if they reflect the distribution of the true underlying property – a feature we define as counterfactual truthfulness. To increase this truthfulness, we propose the integration of uncertainty estimation techniques to filter counterfactual candidates with high predicted uncertainty. Through computational experiments with synthetic and real-world datasets, we demonstrate that traditional uncertainty estimation methods, such as ensembles and mean-variance estimation, can already substantially reduce the average prediction error and increase counterfactual truthfulness, especially for out-of-distribution settings. Our results highlight the importance and potential impact of incorporating uncertainty estimation into explainability methods, especially considering the relatively high effectiveness of low-effort interventions like model ensembles.

[AI-16] Knowledge Graph Completion with Mixed Geometry Tensor Factorization AISTATS2025

链接: https://arxiv.org/abs/2504.02589
作者: Viacheslav Yusupov,Maxim Rakhuba,Evgeny Frolov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (stat.ML)
*备注: Accepted to AISTATS 2025

点击查看摘要

Abstract:In this paper, we propose a new geometric approach for knowledge graph completion via low rank tensor approximation. We augment a pretrained and well-established Euclidean model based on a Tucker tensor decomposition with a novel hyperbolic interaction term. This correction enables more nuanced capturing of distributional properties in data better aligned with real-world knowledge graphs. By combining two geometries together, our approach improves expressivity of the resulting model achieving new state-of-the-art link prediction accuracy with a significantly lower number of parameters compared to the previous Euclidean and hyperbolic models.

[AI-17] Deep learning for music generation. Four approaches and their comparative evaluation

链接: https://arxiv.org/abs/2504.02586
作者: Razvan Paroiu,Stefan Trausan-Matu
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper introduces four different artificial intelligence algorithms for music generation and aims to compare these methods not only based on the aesthetic quality of the generated music but also on their suitability for specific applications. The first set of melodies is produced by a slightly modified visual transformer neural network that is used as a language model. The second set of melodies is generated by combining chat sonification with a classic transformer neural network (the same method of music generation is presented in a previous research), the third set of melodies is generated by combining the Schillinger rhythm theory together with a classic transformer neural network, and the fourth set of melodies is generated using GPT3 transformer provided by OpenAI. A comparative analysis is performed on the melodies generated by these approaches and the results indicate that significant differences can be observed between them and regarding the aesthetic value of them, GPT3 produced the most pleasing melodies, and the newly introduced Schillinger method proved to generate better sounding music than previous sonification methods.

[AI-18] GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning

链接: https://arxiv.org/abs/2504.02546
作者: Xiangxiang Chu,Hailang Huang,Xiao Zhang,Fei Wei,Yong Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) can directly enhance the reasoning capabilities of large language models without extensive reliance on Supervised Fine-Tuning (SFT). In this work, we revisit the traditional Policy Gradient (PG) mechanism and propose a minimalist RL approach termed Group Policy Gradient (GPG). Unlike conventional methods, GPG directly optimize the original RL objective, thus obviating the need for surrogate loss functions. As illustrated in our paper, by eliminating both the critic and reference models, and avoiding KL divergence constraints, our approach significantly simplifies the training process when compared to Group Relative Policy Optimization (GRPO). Our approach achieves superior performance without relying on auxiliary techniques or adjustments. Extensive experiments demonstrate that our method not only reduces computational costs but also consistently outperforms GRPO across various unimodal and multimodal tasks. Our code is available at this https URL.

[AI-19] Fourier Sliced-Wasserstein Embedding for Multisets and Measures ICLR2025

链接: https://arxiv.org/abs/2504.02544
作者: Tal Amir,Nadav Dym
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICLR 2025 camera-ready. arXiv admin note: substantial text overlap with arXiv:2405.16519

点击查看摘要

Abstract:We present the Fourier Sliced-Wasserstein (FSW) embedding - a novel method to embed multisets and measures over \mathbbR^d into Euclidean space. Our proposed embedding approximately preserves the sliced Wasserstein distance on distributions, thereby yielding geometrically meaningful representations that better capture the structure of the input. Moreover, it is injective on measures and bi-Lipschitz on multisets - a significant advantage over prevalent methods based on sum- or max-pooling, which are provably not bi-Lipschitz, and, in many cases, not even injective. The required output dimension for these guarantees is near-optimal: roughly 2 N d , where N is the maximal input multiset size. Furthermore, we prove that it is impossible to embed distributions over \mathbbR^d into Euclidean space in a bi-Lipschitz manner. Thus, the metric properties of our embedding are, in a sense, the best possible. Through numerical experiments, we demonstrate that our method yields superior multiset representations that improve performance in practical learning tasks. Specifically, we show that (a) a simple combination of the FSW embedding with an MLP achieves state-of-the-art performance in learning the (non-sliced) Wasserstein distance; and (b) replacing max-pooling with the FSW embedding makes PointNet significantly more robust to parameter reduction, with only minor performance degradation even after a 40-fold reduction. Comments: ICLR 2025 camera-ready. arXiv admin note: substantial text overlap with arXiv:2405.16519 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.02544 [cs.LG] (or arXiv:2504.02544v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.02544 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tal Amir [view email] [v1] Thu, 3 Apr 2025 12:51:40 UTC (188 KB)

[AI-20] Improving User Experience with FAICO: Towards a Framework for AI Communication in Human-AI Co-Creativity

链接: https://arxiv.org/abs/2504.02526
作者: Jeba Rezwana,Corey Ford
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:How AI communicates with humans is crucial for effective human-AI co-creation. However, many existing co-creative AI tools cannot communicate effectively, limiting their potential as collaborators. This paper introduces our initial design of a Framework for designing AI Communication (FAICO) for co-creative AI based on a systematic review of 107 full-length papers. FAICO presents key aspects of AI communication and their impacts on user experience to guide the design of effective AI communication. We then show actionable ways to translate our framework into two practical tools: design cards for designers and a configuration tool for users. The design cards enable designers to consider AI communication strategies that cater to a diverse range of users in co-creative contexts, while the configuration tool empowers users to customize AI communication based on their needs and creative workflows. This paper contributes new insights within the literature on human-AI co-creativity and Human-Computer Interaction, focusing on designing AI communication to enhance user experience.

[AI-21] A Memory-Augmented LLM -Driven Method for Autonomous Merging of 3D Printing Work Orders

链接: https://arxiv.org/abs/2504.02509
作者: Yuhao Liu,Maolin Yang,Pingyu Jiang
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:With the rapid development of 3D printing, the demand for personalized and customized production on the manufacturing line is steadily increasing. Efficient merging of printing workpieces can significantly enhance the processing efficiency of the production line. Addressing the challenge, a Large Language Model (LLM)-driven method is established in this paper for the autonomous merging of 3D printing work orders, integrated with a memory-augmented learning strategy. In industrial scenarios, both device and order features are modeled into LLM-readable natural language prompt templates, and develop an order-device matching tool along with a merging interference checking module. By incorporating a self-memory learning strategy, an intelligent agent for autonomous order merging is constructed, resulting in improved accuracy and precision in order allocation. The proposed method effectively leverages the strengths of LLMs in industrial applications while reducing hallucination.

[AI-22] Industrial Internet Robot Collaboration System and Edge Computing Optimization

链接: https://arxiv.org/abs/2504.02492
作者: Qian Zuo,Dajun Tao,Tian Qi,Jieyi Xie,Zijie Zhou,Zhen Tian,Yu Mingyu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In a complex environment, for a mobile robot to safely and collision - free avoid all obstacles, it poses high requirements for its intelligence level. Given that the information such as the position and geometric characteristics of obstacles is random, the control parameters of the robot, such as velocity and angular velocity, are also prone to random deviations. To address this issue in the framework of the Industrial Internet Robot Collaboration System, this paper proposes a global path control scheme for mobile robots based on deep learning. First of all, the dynamic equation of the mobile robot is established. According to the linear velocity and angular velocity of the mobile robot, its motion behaviors are divided into obstacle - avoidance behavior, target - turning behavior, and target approaching behavior. Subsequently, the neural network method in deep learning is used to build a global path planning model for the robot. On this basis, a fuzzy controller is designed with the help of a fuzzy control algorithm to correct the deviations that occur during path planning, thereby achieving optimized control of the robot’s global path. In addition, considering edge computing optimization, the proposed model can process local data at the edge device, reducing the communication burden between the robot and the central server, and improving the real time performance of path planning. The experimental results show that for the mobile robot controlled by the research method in this paper, the deviation distance of the path angle is within 5 cm, the deviation convergence can be completed within 10 ms, and the planned path is shorter. This indicates that the proposed scheme can effectively improve the global path planning ability of mobile robots in the industrial Internet environment and promote the collaborative operation of robots through edge computing optimization.

[AI-23] he Self-Learning Agent with a Progressive Neural Network Integrated Transformer

链接: https://arxiv.org/abs/2504.02489
作者: Ajay Sivakumar,Shalini,Vasantha Raj,Sebastian Sylvester
类目: Artificial Intelligence (cs.AI)
*备注: 7 pages, 2 figures, focuses on continual learning with PNN and LLaMA. Experiments demonstrate scalability and lifelong learning capabilities

点击查看摘要

Abstract:This paper introduces a self-learning agent that integrates LLaMA 3.2 with a Progressive Neural Network (PNN) for continual learning in conversational AI and code generation. The framework dynamically collects data, fine-tunes tasks with minimal samples, and leverages Meta-Learning for rapid adaptation. LoRA optimizes fine-tuning, while Elastic Weight Consolidation (EWC) enhances knowledge retention. Experimental results demonstrate improved adaptability and memory stability, positioning this approach as a scalable step toward Artificial General Intelligence (AGI).

[AI-24] We Need Improved Data Curation and Attribution in AI for Scientific Discovery

链接: https://arxiv.org/abs/2504.02486
作者: Mara Graziani,Antonio Foncubierta,Dimitrios Christofidellis,Irina Espejo-Morales,Malina Molnar,Marvin Alberts,Matteo Manica,Jannis Born
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As the interplay between human-generated and synthetic data evolves, new challenges arise in scientific discovery concerning the integrity of the data and the stability of the models. In this work, we examine the role of synthetic data as opposed to that of real experimental data for scientific research. Our analyses indicate that nearly three-quarters of experimental datasets available on open-access platforms have relatively low adoption rates, opening new opportunities to enhance their discoverability and usability by automated methods. Additionally, we observe an increasing difficulty in distinguishing synthetic from real experimental data. We propose supplementing ongoing efforts in automating synthetic data detection by increasing the focus on watermarking real experimental data, thereby strengthening data traceability and integrity. Our estimates suggest that watermarking even less than half of the real world data generated annually could help sustain model robustness, while promoting a balanced integration of synthetic and human-generated content.

[AI-25] Hierarchical Policy-Gradient Reinforcement Learning for Multi-Agent Shepherding Control of Non-Cohesive Targets

链接: https://arxiv.org/abs/2504.02479
作者: Stefano Covone,Italo Napolitano,Francesco De Lellis,Mario di Bernardo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a decentralized reinforcement learning solution for multi-agent shepherding of non-cohesive targets using policy-gradient methods. Our architecture integrates target-selection with target-driving through Proximal Policy Optimization, overcoming discrete-action constraints of previous Deep Q-Network approaches and enabling smoother agent trajectories. This model-free framework effectively solves the shepherding problem without prior dynamics knowledge. Experiments demonstrate our method’s effectiveness and scalability with increased target numbers and limited sensing capabilities.

[AI-26] BOOST: Bootstrapping Strategy-Driven Reasoning Programs for Program-Guided Fact-Checking

链接: https://arxiv.org/abs/2504.02467
作者: Qisheng Hu,Quanyu Long,Wenya Wang
类目: Artificial Intelligence (cs.AI)
*备注: 18 pages, 5 figures

点击查看摘要

Abstract:Program-guided reasoning has shown promise in complex claim fact-checking by decomposing claims into function calls and executing reasoning programs. However, prior work primarily relies on few-shot in-context learning (ICL) with ad-hoc demonstrations, which limit program diversity and require manual design with substantial domain knowledge. Fundamentally, the underlying principles of effective reasoning program generation still remain underexplored, making it challenging to construct effective demonstrations. To address this, we propose BOOST, a bootstrapping-based framework for few-shot reasoning program generation. BOOST explicitly integrates claim decomposition and information-gathering strategies as structural guidance for program generation, iteratively refining bootstrapped demonstrations in a strategy-driven and data-centric manner without human intervention. This enables a seamless transition from zero-shot to few-shot strategic program-guided learning, enhancing interpretability and effectiveness. Experimental results show that BOOST outperforms prior few-shot baselines in both zero-shot and few-shot settings for complex claim verification.

[AI-27] Evaluating AI Recruitment Sourcing Tools by Human Preference

链接: https://arxiv.org/abs/2504.02463
作者: Vladimir Slaykovskiy,Maksim Zvegintsev,Yury Sakhonchyk,Hrachik Ajamian
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study introduces a benchmarking methodology designed to evaluate the performance of AI-driven recruitment sourcing tools. We created and utilized a dataset to perform a comparative analysis of search results generated by leading AI-based solutions, LinkedIn Recruiter, and our proprietary system, this http URL. Human experts assessed the relevance of the returned candidates, and an Elo rating system was applied to quantitatively measure each tool’s comparative performance. Our findings indicate that AI-driven recruitment sourcing tools consistently outperform LinkedIn Recruiter in candidate relevance, with this http URL achieving the highest performance scores. Furthermore, we found a strong alignment between AI-based evaluations and human judgments, highlighting the potential for advanced AI technologies to substantially enhance talent acquisition effectiveness. Code and supporting data are publicly available at this https URL

[AI-28] Am I Being Treated Fairly? A Conceptual Framework for Individuals to Ascertain Fairness

链接: https://arxiv.org/abs/2504.02461
作者: Juliett Suárez Ferreira,Marija Slavkovik,Jorge Casillas
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 21 pages, 5 figures

点击查看摘要

Abstract:Current fairness metrics and mitigation techniques provide tools for practitioners to asses how non-discriminatory Automatic Decision Making (ADM) systems are. What if I, as an individual facing a decision taken by an ADM system, would like to know: Am I being treated fairly? We explore how to create the affordance for users to be able to ask this question of ADM. In this paper, we argue for the reification of fairness not only as a property of ADM, but also as an epistemic right of an individual to acquire information about the decisions that affect them and use that information to contest and seek effective redress against those decisions, in case they are proven to be discriminatory. We examine key concepts from existing research not only in algorithmic fairness but also in explainable artificial intelligence, accountability, and contestability. Integrating notions from these domains, we propose a conceptual framework to ascertain fairness by combining different tools that empower the end-users of ADM systems. Our framework shifts the focus from technical solutions aimed at practitioners to mechanisms that enable individuals to understand, challenge, and verify the fairness of decisions, and also serves as a blueprint for organizations and policymakers, bridging the gap between technical requirements and practical, user-centered accountability.

[AI-29] Retrieval-Augmented Purifier for Robust LLM -Empowered Recommendation

链接: https://arxiv.org/abs/2504.02458
作者: Liangbo Ning,Wenqi Fan,Qing Li
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, Large Language Model (LLM)-empowered recommender systems have revolutionized personalized recommendation frameworks and attracted extensive attention. Despite the remarkable success, existing LLM-empowered RecSys have been demonstrated to be highly vulnerable to minor perturbations. To mitigate the negative impact of such vulnerabilities, one potential solution is to employ collaborative signals based on item-item co-occurrence to purify the malicious collaborative knowledge from the user’s historical interactions inserted by attackers. On the other hand, due to the capabilities to expand insufficient internal knowledge of LLMs, Retrieval-Augmented Generation (RAG) techniques provide unprecedented opportunities to enhance the robustness of LLM-empowered recommender systems by introducing external collaborative knowledge. Therefore, in this paper, we propose a novel framework (RETURN) by retrieving external collaborative signals to purify the poisoned user profiles and enhance the robustness of LLM-empowered RecSys in a plug-and-play manner. Specifically, retrieval-augmented perturbation positioning is proposed to identify potential perturbations within the users’ historical sequences by retrieving external knowledge from collaborative item graphs. After that, we further retrieve the collaborative knowledge to cleanse the perturbations by using either deletion or replacement strategies and introduce a robust ensemble recommendation strategy to generate final robust predictions. Extensive experiments on three real-world datasets demonstrate the effectiveness of the proposed RETURN.

[AI-30] CHARMS: Cognitive Hierarchical Agent with Reasoning and Motion Styles

链接: https://arxiv.org/abs/2504.02450
作者: Jingyi Wang,Duanfeng Chu,Zejian Deng,Liping Lu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To address the current challenges of low intelligence and simplistic vehicle behavior modeling in autonomous driving simulation scenarios, this paper proposes the Cognitive Hierarchical Agent with Reasoning and Motion Styles (CHARMS). The model can reason about the behavior of other vehicles like a human driver and respond with different decision-making styles, thereby improving the intelligence and diversity of the surrounding vehicles in the driving scenario. By introducing the Level-k behavioral game theory, the paper models the decision-making process of human drivers and employs deep reinforcement learning to train the models with diverse decision styles, simulating different reasoning approaches and behavioral characteristics. Building on the Poisson cognitive hierarchy theory, this paper also presents a novel driving scenario generation method. The method controls the proportion of vehicles with different driving styles in the scenario using Poisson and binomial distributions, thus generating controllable and diverse driving environments. Experimental results demonstrate that CHARMS not only exhibits superior decision-making capabilities as ego vehicles, but also generates more complex and diverse driving scenarios as surrounding vehicles. We will release code for CHARMS at this https URL.

[AI-31] How Artificial Intelligence Leads to Knowledge Why: An Inquiry Inspired by Aristotles Posterior Analytics

链接: https://arxiv.org/abs/2504.02430
作者: Guus Eelink,Kilian Rückschloß,Felix Weitkämper
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Bayesian networks and causal models provide frameworks for handling queries about external interventions and counterfactuals, enabling tasks that go beyond what probability distributions alone can address. While these formalisms are often informally described as capturing causal knowledge, there is a lack of a formal theory characterizing the type of knowledge required to predict the effects of external interventions. This work introduces the theoretical framework of causal systems to clarify Aristotle’s distinction between knowledge that and knowledge why within artificial intelligence. By interpreting existing artificial intelligence technologies as causal systems, it investigates the corresponding types of knowledge. Furthermore, it argues that predicting the effects of external interventions is feasible only with knowledge why, providing a more precise understanding of the knowledge necessary for such tasks.

[AI-32] Narrative Studio: Visual narrative exploration using LLM s and Monte Carlo Tree Search

链接: https://arxiv.org/abs/2504.02426
作者: Parsa Ghaffari,Chris Hokamp
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Interactive storytelling benefits from planning and exploring multiple ‘what if’ scenarios. Modern LLMs are useful tools for ideation and exploration, but current chat-based user interfaces restrict users to a single linear flow. To address this limitation, we propose Narrative Studio – a novel in-browser narrative exploration environment featuring a tree-like interface that allows branching exploration from user-defined points in a story. Each branch is extended via iterative LLM inference guided by system and user-defined prompts. Additionally, we employ Monte Carlo Tree Search (MCTS) to automatically expand promising narrative paths based on user-specified criteria, enabling more diverse and robust story development. We also allow users to enhance narrative coherence by grounding the generated text in an entity graph that represents the actors and environment of the story.

[AI-33] EvMic: Event-based Non-contact sound recovery from effective spatial-temporal modeling

链接: https://arxiv.org/abs/2504.02402
作者: Hao Yin,Shi Guo,Xu Jia,Xudong XU,Lu Zhang,Si Liu,Dong Wang,Huchuan Lu,Tianfan Xue
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Our project page: this https URL

点击查看摘要

Abstract:When sound waves hit an object, they induce vibrations that produce high-frequency and subtle visual changes, which can be used for recovering the sound. Early studies always encounter trade-offs related to sampling rate, bandwidth, field of view, and the simplicity of the optical path. Recent advances in event camera hardware show good potential for its application in visual sound recovery, because of its superior ability in capturing high-frequency signals. However, existing event-based vibration recovery methods are still sub-optimal for sound recovery. In this work, we propose a novel pipeline for non-contact sound recovery, fully utilizing spatial-temporal information from the event stream. We first generate a large training set using a novel simulation pipeline. Then we designed a network that leverages the sparsity of events to capture spatial information and uses Mamba to model long-term temporal information. Lastly, we train a spatial aggregation block to aggregate information from different locations to further improve signal quality. To capture event signals caused by sound waves, we also designed an imaging system using a laser matrix to enhance the gradient and collected multiple data sequences for testing. Experimental results on synthetic and real-world data demonstrate the effectiveness of our method.

[AI-34] mporal Gaussian Copula For Clinical Multivariate Time Series Data Imputation

链接: https://arxiv.org/abs/2504.02317
作者: Ye Su,Hezhe Qiao,Di Wu,Yuwen Chen,Lin Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted in BIBM2024

点击查看摘要

Abstract:The imputation of the Multivariate time series (MTS) is particularly challenging since the MTS typically contains irregular patterns of missing values due to various factors such as instrument failures, interference from irrelevant data, and privacy regulations. Existing statistical methods and deep learning methods have shown promising results in time series imputation. In this paper, we propose a Temporal Gaussian Copula Model (TGC) for three-order MTS imputation. The key idea is to leverage the Gaussian Copula to explore the cross-variable and temporal relationships based on the latent Gaussian representation. Subsequently, we employ an Expectation-Maximization (EM) algorithm to improve robustness in managing data with varying missing rates. Comprehensive experiments were conducted on three real-world MTS datasets. The results demonstrate that our TGC substantially outperforms the state-of-the-art imputation methods. Additionally, the TGC model exhibits stronger robustness to the varying missing ratios in the test dataset. Our code is available at this https URL.

[AI-35] ree-based Models for Vertical Federated Learning: A Survey

链接: https://arxiv.org/abs/2504.02285
作者: Bingchen Qian,Yuexiang Xie,Yaliang Li,Bolin Ding,Jingren Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by ACM Computing Surveys (CSUR)

点击查看摘要

Abstract:Tree-based models have achieved great success in a wide range of real-world applications due to their effectiveness, robustness, and interpretability, which inspired people to apply them in vertical federated learning (VFL) scenarios in recent years. In this paper, we conduct a comprehensive study to give an overall picture of applying tree-based models in VFL, from the perspective of their communication and computation protocols. We categorize tree-based models in VFL into two types, i.e., feature-gathering models and label-scattering models, and provide a detailed discussion regarding their characteristics, advantages, privacy protection mechanisms, and applications. This study also focuses on the implementation of tree-based models in VFL, summarizing several design principles for better satisfying various requirements from both academic research and industrial deployment. We conduct a series of experiments to provide empirical observations on the differences and advances of different types of tree-based models.

[AI-36] Engineering Artificial Intelligence: Framework Challenges and Future Direction

链接: https://arxiv.org/abs/2504.02269
作者: Jay Lee,Hanqi Su,Dai-Yan Ji,Takanobu Minami
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Over the past ten years, the application of artificial intelligence (AI) and machine learning (ML) in engineering domains has gained significant popularity, showcasing their potential in data-driven contexts. However, the complexity and diversity of engineering problems often require the development of domain-specific AI approaches, which are frequently hindered by a lack of systematic methodologies, scalability, and robustness during the development process. To address this gap, this paper introduces the “ABCDE” as the key elements of Engineering AI and proposes a unified, systematic engineering AI ecosystem framework, including eight essential layers, along with attributes, goals, and applications, to guide the development and deployment of AI solutions for specific engineering needs. Additionally, key challenges are examined, and nine future research directions are highlighted. By providing a comprehensive perspective, this paper aims to advance the strategic implementation of AI, fostering the development of next-generation engineering AI solutions.

[AI-37] Implicit Neural Differential Model for Spatiotemporal Dynamics

链接: https://arxiv.org/abs/2504.02260
作者: Deepak Akhare,Pan Du,Tengfei Luo,Jian-Xun Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hybrid neural-physics modeling frameworks through differentiable programming have emerged as powerful tools in scientific machine learning, enabling the integration of known physics with data-driven learning to improve prediction accuracy and generalizability. However, most existing hybrid frameworks rely on explicit recurrent formulations, which suffer from numerical instability and error accumulation during long-horizon forecasting. In this work, we introduce Im-PiNDiff, a novel implicit physics-integrated neural differentiable solver for stable and accurate modeling of spatiotemporal dynamics. Inspired by deep equilibrium models, Im-PiNDiff advances the state using implicit fixed-point layers, enabling robust long-term simulation while remaining fully end-to-end differentiable. To enable scalable training, we introduce a hybrid gradient propagation strategy that integrates adjoint-state methods with reverse-mode automatic differentiation. This approach eliminates the need to store intermediate solver states and decouples memory complexity from the number of solver iterations, significantly reducing training overhead. We further incorporate checkpointing techniques to manage memory in long-horizon rollouts. Numerical experiments on various spatiotemporal PDE systems, including advection-diffusion processes, Burgers’ dynamics, and multi-physics chemical vapor infiltration processes, demonstrate that Im-PiNDiff achieves superior predictive performance, enhanced numerical stability, and substantial reductions in memory and runtime cost relative to explicit and naive implicit baselines. This work provides a principled, efficient, and scalable framework for hybrid neural-physics modeling.

[AI-38] Adapting World Models with Latent-State Dynamics Residuals

链接: https://arxiv.org/abs/2504.02252
作者: JB Lanier,Kyungmin Kim,Armin Karamzade,Yifei Liu,Ankita Sinha,Kat He,Davide Corsi,Roy Fox
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 15 pages, 11 figures. Project website at this https URL

点击查看摘要

Abstract:Simulation-to-reality reinforcement learning (RL) faces the critical challenge of reconciling discrepancies between simulated and real-world dynamics, which can severely degrade agent performance. A promising approach involves learning corrections to simulator forward dynamics represented as a residual error function, however this operation is impractical with high-dimensional states such as images. To overcome this, we propose ReDRAW, a latent-state autoregressive world model pretrained in simulation and calibrated to target environments through residual corrections of latent-state dynamics rather than of explicit observed states. Using this adapted world model, ReDRAW enables RL agents to be optimized with imagined rollouts under corrected dynamics and then deployed in the real world. In multiple vision-based MuJoCo domains and a physical robot visual lane-following task, ReDRAW effectively models changes to dynamics and avoids overfitting in low data regimes where traditional transfer methods fail.

[AI-39] VEGAS: Towards Visually Explainable and Grounded Artificial Social Intelligence AAAI2025

链接: https://arxiv.org/abs/2504.02227
作者: Hao Li,Hao Fei,Zechao Hu,Zhengwei Yang,Zheng Wang
类目: Artificial Intelligence (cs.AI)
*备注: 9 pages, 5 figures, AAAI 2025

点击查看摘要

Abstract:Social Intelligence Queries (Social-IQ) serve as the primary multimodal benchmark for evaluating a model’s social intelligence level. While impressive multiple-choice question(MCQ) accuracy is achieved by current solutions, increasing evidence shows that they are largely, and in some cases entirely, dependent on language modality, overlooking visual context. Additionally, the closed-set nature further prevents the exploration of whether and to what extent the reasoning path behind selection is correct. To address these limitations, we propose the Visually Explainable and Grounded Artificial Social Intelligence (VEGAS) model. As a generative multimodal model, VEGAS leverages open-ended answering to provide explainable responses, which enhances the clarity and evaluation of reasoning paths. To enable visually grounded answering, we propose a novel sampling strategy to provide the model with more relevant visual frames. We then enhance the model’s interpretation of these frames through Generalist Instruction Fine-Tuning (GIFT), which aims to: i) learn multimodal-language transformations for fundamental emotional social traits, and ii) establish multimodal joint reasoning capabilities. Extensive experiments, comprising modality ablation, open-ended assessments, and supervised MCQ evaluations, consistently show that VEGAS effectively utilizes visual information in reasoning to produce correct and also credible answers. We expect this work to of fer a new perspective on Social-IQ and advance the development of human-like social AI.

[AI-40] Learning and Improving Backgammon Strategy

链接: https://arxiv.org/abs/2504.02221
作者: Gregory R. Galperin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: Accompanied by oral presentation by Gregory Galperin at the CBCL Learning Day 1994

点击查看摘要

Abstract:A novel approach to learning is presented, combining features of on-line and off-line methods to achieve considerable performance in the task of learning a backgammon value function in a process that exploits the processing power of parallel supercomputers. The off-line methods comprise a set of techniques for parallelizing neural network training and TD(\lambda) reinforcement learning; here Monte-Carlo ``Rollouts’’ are introduced as a massively parallel on-line policy improvement technique which applies resources to the decision points encountered during the search of the game tree to further augment the learned value function estimate. A level of play roughly as good as, or possibly better than, the current champion human and computer backgammon players has been achieved in a short period of learning.

[AI-41] FT-Transformer: Resilient and Reliable Transformer with End-to-End Fault Tolerant Attention

链接: https://arxiv.org/abs/2504.02211
作者: Huangliang Dai,Shixun Wu,Hairui Zhao,Jiajun Huang,Zizhe Jian,Yue Zhu,Haiyang Hu,Zizhong Chen
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer models leverage self-attention mechanisms to capture complex dependencies, demonstrating exceptional performance in various applications. However, the long-duration high-load computations required for model inference impose stringent reliability demands on the computing platform, as soft errors that occur during execution can significantly degrade model performance. Existing fault tolerance methods protect each operation separately using decoupled kernels, incurring substantial computational and memory overhead. In this paper, we propose a novel error-resilient framework for Transformer models, integrating end-to-end fault tolerant attention (EFTA) to improve inference reliability against soft errors. Our approach enables error detection and correction within a fully fused attention kernel, reducing redundant data access and thereby mitigating memory faults. To further enhance error coverage and reduce overhead, we design a hybrid fault tolerance scheme tailored for the EFTA, introducing for the first time: 1) architecture-aware algorithm-based fault tolerance (ABFT) using tensor checksum, which minimizes inter-thread communication overhead on tensor cores during error detection; 2) selective neuron value restriction, which selectively applies adaptive fault tolerance constraints to neuron values, balancing error coverage and overhead; 3) unified verification, reusing checksums to streamline multiple computation steps into a single verification process. Experimental results show that EFTA achieves up to 7.56x speedup over traditional methods with an average fault tolerance overhead of 13.9%.

[AI-42] More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment

链接: https://arxiv.org/abs/2504.02193
作者: Yifan Wang,Runjin Chen,Bolian Li,David Cho,Yihe Deng,Ruqi Zhang,Tianlong Chen,Zhangyang Wang,Ananth Grama,Junyuan Hong
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) with human values is an increasingly critical step in post-training. Direct Preference Optimization (DPO) has emerged as a simple, yet effective alternative to reinforcement learning from human feedback (RLHF). Synthetic preference data with its low cost and high quality enable effective alignment through single- or multi-model generated preference data. Our study reveals a striking, safety-specific phenomenon associated with DPO alignment: Although multi-model generated data enhances performance on general tasks (ARC, Hellaswag, MMLU, TruthfulQA, Winogrande) by providing diverse responses, it also tends to facilitate reward hacking during training. This can lead to a high attack success rate (ASR) when models encounter jailbreaking prompts. The issue is particularly pronounced when employing stronger models like GPT-4o or larger models in the same family to generate chosen responses paired with target model self-generated rejected responses, resulting in dramatically poorer safety outcomes. Furthermore, with respect to safety, using solely self-generated responses (single-model generation) for both chosen and rejected pairs significantly outperforms configurations that incorporate responses from stronger models, whether used directly as chosen data or as part of a multi-model response pool. We demonstrate that multi-model preference data exhibits high linear separability between chosen and rejected responses, which allows models to exploit superficial cues rather than internalizing robust safety constraints. Our experiments, conducted on models from the Llama, Mistral, and Qwen families, consistently validate these findings.

[AI-43] A Survey of Scaling in Large Language Model Reasoning

链接: https://arxiv.org/abs/2504.02181
作者: Zihan Chen,Song Wang,Zhen Tan,Xingbo Fu,Zhenyu Lei,Peng Wang,Huan Liu,Cong Shen,Jundong Li
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advancements in large Language models (LLMs) have significantly enhanced their reasoning capabilities, driven by various strategies such as multi-agent collaboration. However, unlike the well-established performance improvements achieved through scaling data and model size, the scaling of reasoning in LLMs is more complex and can even negatively impact reasoning performance, introducing new challenges in model alignment and robustness. In this survey, we provide a comprehensive examination of scaling in LLM reasoning, categorizing it into multiple dimensions and analyzing how and to what extent different scaling strategies contribute to improving reasoning capabilities. We begin by exploring scaling in input size, which enables LLMs to process and utilize more extensive context for improved reasoning. Next, we analyze scaling in reasoning steps that improves multi-step inference and logical consistency. We then examine scaling in reasoning rounds, where iterative interactions refine reasoning outcomes. Furthermore, we discuss scaling in training-enabled reasoning, focusing on optimization through iterative model improvement. Finally, we review applications of scaling across domains and outline future directions for further advancing LLM reasoning. By synthesizing these diverse perspectives, this survey aims to provide insights into how scaling strategies fundamentally enhance the reasoning capabilities of LLMs and further guide the development of next-generation AI systems.

[AI-44] On the Geometry of Receiver Operating Characteristic and Precision-Recall Curves

链接: https://arxiv.org/abs/2504.02169
作者: Reza Sameni
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the geometry of Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves in binary classification problems. The key finding is that many of the most commonly used binary classification metrics are merely functions of the composition function G := F_p \circ F_n^-1 , where F_p(\cdot) and F_n(\cdot) are the class-conditional cumulative distribution functions of the classifier scores in the positive and negative classes, respectively. This geometric perspective facilitates the selection of operating points, understanding the effect of decision thresholds, and comparison between classifiers. It also helps explain how the shapes and geometry of ROC/PR curves reflect classifier behavior, providing objective tools for building classifiers optimized for specific applications with context-specific constraints. We further explore the conditions for classifier dominance, present analytical and numerical examples demonstrating the effects of class separability and variance on ROC and PR geometries, and derive a link between the positive-to-negative class leakage function G(\cdot) and the Kullback–Leibler divergence. The framework highlights practical considerations, such as model calibration, cost-sensitive optimization, and operating point selection under real-world capacity constraints, enabling more informed approaches to classifier deployment and decision-making.

[AI-45] OmniCellTOSG: The First Cell Text-Omic Signaling Graphs Dataset for Joint LLM and GNN Modeling

链接: https://arxiv.org/abs/2504.02148
作者: Heming Zhang,Tim Xu,Dekang Cao,Shunning Liang,Lars Schimmelpfennig,Levi Kaster,Di Huang,Carlos Cruchaga,Guangfu Li,Michael Province,Yixin Chen,Philip Payne,Fuhai Li
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Complex cell signaling systems – governed by varying protein abundances and interactions – generate diverse cell types across organs. These systems evolve under influences such as age, sex, diet, environmental exposures, and diseases, making them challenging to decode given the involvement of tens of thousands of genes and proteins. Recently, hundreds of millions of single-cell omics data have provided a robust foundation for understanding these signaling networks within various cell subpopulations and conditions. Inspired by the success of large foundation models (for example, large language models and large vision models) pre-trained on massive datasets, we introduce OmniCellTOSG, the first dataset of cell text-omic signaling graphs (TOSGs). Each TOSG represents the signaling network of an individual or meta-cell and is labeled with information such as organ, disease, sex, age, and cell subtype. OmniCellTOSG offers two key contributions. First, it introduces a novel graph model that integrates human-readable annotations – such as biological functions, cellular locations, signaling pathways, related diseases, and drugs – with quantitative gene and protein abundance data, enabling graph reasoning to decode cell signaling. This approach calls for new joint models combining large language models and graph neural networks. Second, the dataset is built from single-cell RNA sequencing data of approximately 120 million cells from diverse tissues and conditions (healthy and diseased) and is fully compatible with PyTorch. This facilitates the development of innovative cell signaling models that could transform research in life sciences, healthcare, and precision medicine. The OmniCellTOSG dataset is continuously expanding and will be updated regularly. The dataset and code are available at this https URL.

[AI-46] On Simulation-Guided LLM -based Code Generation for Safe Autonomous Driving Software

链接: https://arxiv.org/abs/2504.02141
作者: Ali Nouri,Johan Andersson,Kailash De Jesus Hornig,Zhennan Fei,Emil Knabe,Hakan Sivencrona,Beatriz Cabrero-Daniel,Christian Berger
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accepted in the 29th International Conference on Evaluation and Assessment in Software Engineering (EASE)

点击查看摘要

Abstract:Automated Driving System (ADS) is a safety-critical software system responsible for the interpretation of the vehicle’s environment and making decisions accordingly. The unbounded complexity of the driving context, including unforeseeable events, necessitate continuous improvement, often achieved through iterative DevOps processes. However, DevOps processes are themselves complex, making these improvements both time- and resource-intensive. Automation in code generation for ADS using Large Language Models (LLM) is one potential approach to address this challenge. Nevertheless, the development of ADS requires rigorous processes to verify, validate, assess, and qualify the code before it can be deployed in the vehicle and used. In this study, we developed and evaluated a prototype for automatic code generation and assessment using a designed pipeline of a LLM-based agent, simulation model, and rule-based feedback generator in an industrial setup. The LLM-generated code is evaluated automatically in a simulation model against multiple critical traffic scenarios, and an assessment report is provided as feedback to the LLM for modification or bug fixing. We report about the experimental results of the prototype employing Codellama:34b, DeepSeek (r1:32b and Coder:33b), CodeGemma:7b, Mistral:7b, and GPT4 for Adaptive Cruise Control (ACC) and Unsupervised Collision Avoidance by Evasive Manoeuvre (CAEM). We finally assessed the tool with 11 experts at two Original Equipment Manufacturers (OEMs) by conducting an interview study.

[AI-47] Enhancing Embedding Representation Stability in Recommendation Systems with Semantic ID

链接: https://arxiv.org/abs/2504.02137
作者: Carolina Zheng,Minhui Huang,Dmitrii Pedchenko,Kaushik Rangadurai,Siyu Wang,Gaby Nahum,Jie Lei,Yang Yang,Tao Liu,Zutian Luo,Xiaohan Wei,Dinesh Ramasamy,Jiyan Yang,Yiping Han,Lin Yang,Hangjun Xu,Rong Jin,Shuang Yang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The exponential growth of online content has posed significant challenges to ID-based models in industrial recommendation systems, ranging from extremely high cardinality and dynamically growing ID space, to highly skewed engagement distributions, to prediction instability as a result of natural id life cycles (e.g, the birth of new IDs and retirement of old IDs). To address these issues, many systems rely on random hashing to handle the id space and control the corresponding model parameters (i.e embedding table). However, this approach introduces data pollution from multiple ids sharing the same embedding, leading to degraded model performance and embedding representation instability. This paper examines these challenges and introduces Semantic ID prefix ngram, a novel token parameterization technique that significantly improves the performance of the original Semantic ID. Semantic ID prefix ngram creates semantically meaningful collisions by hierarchically clustering items based on their content embeddings, as opposed to random assignments. Through extensive experimentation, we demonstrate that Semantic ID prefix ngram not only addresses embedding instability but also significantly improves tail id modeling, reduces overfitting, and mitigates representation shifts. We further highlight the advantages of Semantic ID prefix ngram in attention-based models that contextualize user histories, showing substantial performance improvements. We also report our experience of integrating Semantic ID into Meta production Ads Ranking system, leading to notable performance gains and enhanced prediction stability in live deployments. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.02137 [cs.IR] (or arXiv:2504.02137v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.02137 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-48] LLM Pi: Optimizing LLM s for High-Throughput on Raspberry Pi

链接: https://arxiv.org/abs/2504.02118
作者: Mahsa Ardakani,Jinendra Malekar,Ramtin Zand
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deploying Large Language Models (LLMs) on resource-constrained edge devices like the Raspberry Pi presents challenges in computational efficiency, power consumption, and response latency. This paper explores quantization-based optimization techniques to enable high-throughput, energy-efficient execution of LLMs on low-power embedded systems. Our approach leverages k-quantization, a Post-Training Quantization (PTQ) method designed for different bit-widths, enabling efficient 2-bit, 4-bit, 6-bit, and 8-bit weight quantization. Additionally, we employ ternary quantization using Quantization-Aware Training (QAT) for BitNet models, allowing for more effective adaptation to lower-bit representations while preserving accuracy. Our findings highlight the potential of quantized LLMs for real-time conversational AI on edge devices, paving the way for low-power, high-efficiency AI deployment in mobile and embedded applications. This study demonstrates that aggressive quantization strategies can significantly reduce energy consumption while maintaining inference quality, making LLMs practical for resource-limited environments. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.02118 [cs.LG] (or arXiv:2504.02118v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.02118 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-49] On Model Protection in Federated Learning against Eavesdropping Attacks

链接: https://arxiv.org/abs/2504.02114
作者: Dipankar Maity,Kushal Chakrabarti
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this study, we investigate the protection offered by federated learning algorithms against eavesdropping adversaries. In our model, the adversary is capable of intercepting model updates transmitted from clients to the server, enabling it to create its own estimate of the model. Unlike previous research, which predominantly focuses on safeguarding client data, our work shifts attention protecting the client model itself. Through a theoretical analysis, we examine how various factors, such as the probability of client selection, the structure of local objective functions, global aggregation at the server, and the eavesdropper’s capabilities, impact the overall level of protection. We further validate our findings through numerical experiments, assessing the protection by evaluating the model accuracy achieved by the adversary. Finally, we compare our results with methods based on differential privacy, underscoring their limitations in this specific context.

[AI-50] ScreenAudit: Detecting Screen Reader Accessibility Errors in Mobile Apps Using Large Language Models

链接: https://arxiv.org/abs/2504.02110
作者: Mingyuan Zhong,Ruolin Chen,Xia Chen,James Fogarty,Jacob O. Wobbrock
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: CHI 2025

点击查看摘要

Abstract:Many mobile apps are inaccessible, thereby excluding people from their potential benefits. Existing rule-based accessibility checkers aim to mitigate these failures by identifying errors early during development but are constrained in the types of errors they can detect. We present ScreenAudit, an LLM-powered system designed to traverse mobile app screens, extract metadata and transcripts, and identify screen reader accessibility errors overlooked by existing checkers. We recruited six accessibility experts including one screen reader user to evaluate ScreenAudit’s reports across 14 unique app screens. Our findings indicate that ScreenAudit achieves an average coverage of 69.2%, compared to only 31.3% with a widely-used accessibility checker. Expert feedback indicated that ScreenAudit delivered higher-quality feedback and addressed more aspects of screen reader accessibility compared to existing checkers, and that ScreenAudit would benefit app developers in real-world settings.

[AI-51] FlowDistill: Scalable Traffic Flow Prediction via Distillation from LLM s

链接: https://arxiv.org/abs/2504.02094
作者: Chenyang Yu,Xinpeng Xie,Yan Huang,Chenxi Qiu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate traffic flow prediction is vital for optimizing urban mobility, yet it remains difficult in many cities due to complex spatio-temporal dependencies and limited high-quality data. While deep graph-based models demonstrate strong predictive power, their performance often comes at the cost of high computational overhead and substantial training data requirements, making them impractical for deployment in resource-constrained or data-scarce environments. We propose the FlowDistill, a lightweight and scalable traffic prediction framework based on knowledge distillation from large language models (LLMs). In this teacher-student setup, a fine-tuned LLM guides a compact multi-layer perceptron (MLP) student model using a novel combination of the information bottleneck principle and teacher-bounded regression loss, ensuring the distilled model retains only essential and transferable knowledge. Spatial and temporal correlations are explicitly encoded to enhance the model’s generalization across diverse urban settings. Despite its simplicity, FlowDistill consistently outperforms state-of-the-art models in prediction accuracy while requiring significantly less training data, and achieving lower memory usage and inference latency, highlighting its efficiency and suitability for real-world, scalable deployment.

[AI-52] An Introductory Survey to Autoencoder-based Deep Clustering – Sandboxes for Combining Clustering with Deep Learning

链接: https://arxiv.org/abs/2504.02087
作者: Collin Leiber,Lukas Miklautz,Claudia Plant,Christian Böhm
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Autoencoders offer a general way of learning low-dimensional, non-linear representations from data without labels. This is achieved without making any particular assumptions about the data type or other domain knowledge. The generality and domain agnosticism in combination with their simplicity make autoencoders a perfect sandbox for researching and developing novel (deep) clustering algorithms. Clustering methods group data based on similarity, a task that benefits from the lower-dimensional representation learned by an autoencoder, mitigating the curse of dimensionality. Specifically, the combination of deep learning with clustering, called Deep Clustering, enables to learn a representation tailored to specific clustering tasks, leading to high-quality results. This survey provides an introduction to fundamental autoencoder-based deep clustering algorithms that serve as building blocks for many modern approaches.

[AI-53] Evolving Security in LLM s: A Study of Jailbreak Attacks and Defenses

链接: https://arxiv.org/abs/2504.02080
作者: Zhengchun Shang,Wenlan Wei
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly popular, powering a wide range of applications. Their widespread use has sparked concerns, especially through jailbreak attacks that bypass safety measures to produce harmful content. In this paper, we present a comprehensive security analysis of large language models (LLMs), addressing critical research questions on the evolution and determinants of model safety. Specifically, we begin by identifying the most effective techniques for detecting jailbreak attacks. Next, we investigate whether newer versions of LLMs offer improved security compared to their predecessors. We also assess the impact of model size on overall security and explore the potential benefits of integrating multiple defense strategies to enhance model robustness. Our study evaluates both open-source models (e.g., LLaMA and Mistral) and closed-source systems (e.g., GPT-4) by employing four state-of-the-art attack techniques and assessing the efficacy of three new defensive approaches. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.02080 [cs.CR] (or arXiv:2504.02080v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2504.02080 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Wenlan Wei [view email] [v1] Wed, 2 Apr 2025 19:33:07 UTC (1,504 KB)

[AI-54] rapped by Expectations: Functional Fixedness in LLM -Enabled Chat Search

链接: https://arxiv.org/abs/2504.02074
作者: Jiqun Liu,Jamshed Karimnazarov,Ryen W. White
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Functional fixedness, a cognitive bias that restricts users’ interactions with a new system or tool to expected or familiar ways, limits the full potential of Large Language Model (LLM)-enabled chat search, especially in complex and exploratory tasks. To investigate its impact, we conducted a crowdsourcing study with 450 participants, each completing one of six decision-making tasks spanning public safety, diet and health management, sustainability, and AI ethics. Participants engaged in a multi-prompt conversation with ChatGPT to address the task, allowing us to compare pre-chat intent-based expectations with observed interactions. We found that: 1) Several aspects of pre-chat expectations are closely associated with users’ prior experiences with ChatGPT, search engines, and virtual assistants; 2) Prior system experience shapes language use and prompting behavior. Frequent ChatGPT users reduced deictic terms and hedge words and frequently adjusted prompts. Users with rich search experience maintained structured, less-conversational queries with minimal modifications. Users of virtual assistants favored directive, command-like prompts, reinforcing functional fixedness; 3) When the system failed to meet expectations, participants generated more detailed prompts with increased linguistic diversity, reflecting adaptive shifts. These findings suggest that while preconceived expectations constrain early interactions, unmet expectations can motivate behavioral adaptation. With appropriate system support, this may promote broader exploration of LLM capabilities. This work also introduces a typology for user intents in chat search and highlights the importance of mitigating functional fixedness to support more creative and analytical use of LLMs.

[AI-55] RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics IROS2025

链接: https://arxiv.org/abs/2504.02069
作者: Zhiyuan Zhang,Yuxin He,Yong Sun,Junyu Shi,Lijiang Liu,Qiang Nie
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: IROS 2025

点击查看摘要

Abstract:Visual Language Models (VLMs) have emerged as pivotal tools for robotic systems, enabling cross-task generalization, dynamic environmental interaction, and long-horizon planning through multimodal perception and semantic reasoning. However, existing open-source VLMs predominantly trained for generic vision-language alignment tasks fail to model temporally correlated action semantics that are crucial for robotic manipulation effectively. While current image-based fine-tuning methods partially adapt VLMs to robotic applications, they fundamentally disregard temporal evolution patterns in video sequences and suffer from visual feature entanglement between robotic agents, manipulated objects, and environmental contexts, thereby limiting semantic decoupling capability for atomic actions and compromising model this http URL overcome these challenges, this work presents RoboAct-CLIP with dual technical contributions: 1) A dataset reconstruction framework that performs semantic-constrained action unit segmentation and re-annotation on open-source robotic videos, constructing purified training sets containing singular atomic actions (e.g., “grasp”); 2) A temporal-decoupling fine-tuning strategy based on Contrastive Language-Image Pretraining (CLIP) architecture, which disentangles temporal action features across video frames from object-centric characteristics to achieve hierarchical representation learning of robotic atomic this http URL results in simulated environments demonstrate that the RoboAct-CLIP pretrained model achieves a 12% higher success rate than baseline VLMs, along with superior generalization in multi-object manipulation tasks.

[AI-56] Epistemic Closure and the Irreversibility of Misalignment: Modeling Systemic Barriers to Alignment Innovation

链接: https://arxiv.org/abs/2504.02058
作者: Andy Williams
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efforts to ensure the safe development of artificial general intelligence (AGI) often rely on consensus-based alignment approaches grounded in axiomatic formalism, interpretability, and empirical validation. However, these methods may be structurally unable to recognize or incorporate novel solutions that fall outside their accepted epistemic frameworks. This paper introduces a functional model of epistemic closure, in which cognitive, institutional, social, and infrastructural filters combine to make many alignment proposals illegible to existing evaluation systems. We present a weighted closure model supported by both theoretical and empirical sources, including a meta-analysis performed by an AI system on patterns of rejection and non-engagement with a framework for decentralized collective intelligence (DCI). We argue that the recursive failure to assess models like DCI is not just a sociological oversight but a structural attractor, mirroring the very risks of misalignment we aim to avoid in AGI. Without the adoption of DCI or a similarly recursive model of epistemic correction, we may be on a predictable path toward irreversible misalignment. The development and acceptance of this paper, first through simulated review and then through formal channels, provide a case study supporting its central claim: that epistemic closure can only be overcome by recursive modeling of the constraints that sustain it.

[AI-57] Antithetic Sampling for Top-k Shapley Identification

链接: https://arxiv.org/abs/2504.02019
作者: Patrick Kolpaczki,Tim Nielen,Eyke Hüllermeier
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Additive feature explanations rely primarily on game-theoretic notions such as the Shapley value by viewing features as cooperating players. The Shapley value’s popularity in and outside of explainable AI stems from its axiomatic uniqueness. However, its computational complexity severely limits practicability. Most works investigate the uniform approximation of all features’ Shapley values, needlessly consuming samples for insignificant features. In contrast, identifying the k most important features can already be sufficiently insightful and yields the potential to leverage algorithmic opportunities connected to the field of multi-armed bandits. We propose Comparable Marginal Contributions Sampling (CMCS), a method for the top- k identification problem utilizing a new sampling scheme taking advantage of correlated observations. We conduct experiments to showcase the efficacy of our method in compared to competitive baselines. Our empirical findings reveal that estimation quality for the approximate-all problem does not necessarily transfer to top- k identification and vice versa.

[AI-58] Random Conditioning with Distillation for Data-Efficient Diffusion Model Compression CVPR2025

链接: https://arxiv.org/abs/2504.02011
作者: Dohyun Kim,Sehwan Park,Geonhee Han,Seung Wook Kim,Paul Hongsuck Seo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to CVPR 2025. 8 pages main paper + 4 pages references + 5 pages supplementary, 9 figures in total

点击查看摘要

Abstract:Diffusion models generate high-quality images through progressive denoising but are computationally intensive due to large model sizes and repeated sampling. Knowledge distillation, which transfers knowledge from a complex teacher to a simpler student model, has been widely studied in recognition tasks, particularly for transferring concepts unseen during student training. However, its application to diffusion models remains underexplored, especially in enabling student models to generate concepts not covered by the training images. In this work, we propose Random Conditioning, a novel approach that pairs noised images with randomly selected text conditions to enable efficient, image-free knowledge distillation. By leveraging this technique, we show that the student can generate concepts unseen in the training images. When applied to conditional diffusion model distillation, our method allows the student to explore the condition space without generating condition-specific images, resulting in notable improvements in both generation quality and efficiency. This promotes resource-efficient deployment of generative diffusion models, broadening their accessibility for both research and real-world applications. Code, models, and datasets are available at this https URL .

[AI-59] AI Regulation and Capitalist Growth: Balancing Innovation Ethics and Global Governance

链接: https://arxiv.org/abs/2504.02000
作者: Vikram Kulothungan,Priya Ranjani Mohan,Deepti Gupta
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: Accepted for IEEE BigDataSecurity 2025 Conference

点击查看摘要

Abstract:Artificial Intelligence (AI) is increasingly central to economic growth, promising new efficiencies and markets. This economic significance has sparked debate over AI regulation: do rules and oversight bolster long term growth by building trust and safeguarding the public, or do they constrain innovation and free enterprise? This paper examines the balance between AI regulation and capitalist ideals, focusing on how different approaches to AI data privacy can impact innovation in AI-driven applications. The central question is whether AI regulation enhances or inhibits growth in a capitalist economy. Our analysis synthesizes historical precedents, the current U.S. regulatory landscape, economic projections, legal challenges, and case studies of recent AI policies. We discuss that carefully calibrated AI data privacy regulations-balancing innovation incentives with the public interest can foster sustainable growth by building trust and ensuring responsible data use, while excessive regulation may risk stifling innovation and entrenching incumbents.

[AI-60] Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics

链接: https://arxiv.org/abs/2504.01995
作者: Hamed Mahdavi,Alireza Hashemi,Majid Daliri,Pegah Mohammadipour,Alireza Farhadi,Samira Malek,Yekta Yazdanifard,Amir Khasahmadi,Vasant Honavar
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have shown impressive progress in mathematical reasoning tasks. However, current evaluation benchmarks predominantly focus on the accuracy of final answers, often overlooking the logical rigor crucial for mathematical problem-solving. The claim that state-of-the-art LLMs can solve Math Olympiad-level problems requires closer examination. To explore this, we conducted both qualitative and quantitative human evaluations of proofs generated by LLMs, and developed a schema for automatically assessing their reasoning capabilities. Our study reveals that current LLMs fall significantly short of solving challenging Olympiad-level problems and frequently fail to distinguish correct mathematical reasoning from clearly flawed solutions. We also found that occasional correct final answers provided by LLMs often result from pattern recognition or heuristic shortcuts rather than genuine mathematical reasoning. These findings underscore the substantial gap between LLM performance and human expertise in advanced mathematical reasoning and highlight the importance of developing benchmarks that prioritize the rigor and coherence of mathematical arguments rather than merely the correctness of final answers.

[AI-61] PIM-LLM : A High-Throughput Hybrid PIM Architecture for 1-bit LLM s

链接: https://arxiv.org/abs/2504.01994
作者: Jinendra Malekar,Peyton Chandarana,Md Hasibul Amin,Mohammed E. Elbtity,Ramtin Zand
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we propose PIM-LLM, a hybrid architecture developed to accelerate 1-bit large language models (LLMs). PIM-LLM leverages analog processing-in-memory (PIM) architectures and digital systolic arrays to accelerate low-precision matrix multiplication (MatMul) operations in projection layers and high-precision MatMul operations in attention heads of 1-bit LLMs, respectively. Our design achieves up to roughly 80x improvement in tokens per second and a 70% increase in tokens per joule compared to conventional hardware accelerators. Additionally, PIM-LLM outperforms previous PIM-based LLM accelerators, setting a new benchmark with at least 2x and 5x improvement in GOPS and GOPS/W, respectively.

[AI-62] Exploring the Societal and Economic Impacts of Artificial Intelligence: A Scenario Generation Methodology

链接: https://arxiv.org/abs/2504.01992
作者: Carlos J. Costa,Joao Tiago Aparicio
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Theoretical Economics (econ.TH)
*备注: 6 pages

点击查看摘要

Abstract:This paper explores artificial intelligence’s potential societal and economic impacts (AI) through generating scenarios that assess how AI may influence various sectors. We categorize and analyze key factors affecting AI’s integration and adoption by applying an Impact-Uncertainty Matrix. A proposed methodology involves querying academic databases, identifying emerging trends and topics, and categorizing these into an impact uncertainty framework. The paper identifies critical areas where AI may bring significant change and outlines potential future scenarios based on these insights. This research aims to inform policymakers, industry leaders, and researchers on the strategic planning required to address the challenges and opportunities AI presents

[AI-63] Advances and Challenges in Foundation Agents : From Brain-Inspired Intelligence to Evolutionary Collaborative and Safe Systems

链接: https://arxiv.org/abs/2504.01990
作者: Bang Liu,Xinfeng Li,Jiayi Zhang,Jinlin Wang,Tanjin He,Sirui Hong,Hongzhang Liu,Shaokun Zhang,Kaitao Song,Kunlun Zhu,Yuheng Cheng,Suyuchen Wang,Xiaoqiang Wang,Yuyu Luo,Haibo Jin,Peiyan Zhang,Ollie Liu,Jiaqi Chen,Huan Zhang,Zhaoyang Yu,Haochen Shi,Boyan Li,Dekun Wu,Fengwei Teng,Xiaojun Jia,Jiawei Xu,Jinyu Xiang,Yizhang Lin,Tianming Liu,Tongliang Liu,Yu Su,Huan Sun,Glen Berseth,Jianyun Nie,Ian Foster,Logan Ward,Qingyun Wu,Yu Gu,Mingchen Zhuge,Xiangru Tang,Haohan Wang,Jiaxuan You,Chi Wang,Jian Pei,Qiang Yang,Xiaoliang Qi,Chenglin Wu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advent of large language models (LLMs) has catalyzed a transformative shift in artificial intelligence, paving the way for advanced intelligent agents capable of sophisticated reasoning, robust perception, and versatile action across diverse domains. As these agents increasingly drive AI research and practical applications, their design, evaluation, and continuous improvement present intricate, multifaceted challenges. This survey provides a comprehensive overview, framing intelligent agents within a modular, brain-inspired architecture that integrates principles from cognitive science, neuroscience, and computational research. We structure our exploration into four interconnected parts. First, we delve into the modular foundation of intelligent agents, systematically mapping their cognitive, perceptual, and operational modules onto analogous human brain functionalities, and elucidating core components such as memory, world modeling, reward processing, and emotion-like systems. Second, we discuss self-enhancement and adaptive evolution mechanisms, exploring how agents autonomously refine their capabilities, adapt to dynamic environments, and achieve continual learning through automated optimization paradigms, including emerging AutoML and LLM-driven optimization strategies. Third, we examine collaborative and evolutionary multi-agent systems, investigating the collective intelligence emerging from agent interactions, cooperation, and societal structures, highlighting parallels to human social dynamics. Finally, we address the critical imperative of building safe, secure, and beneficial AI systems, emphasizing intrinsic and extrinsic security threats, ethical alignment, robustness, and practical mitigation strategies necessary for trustworthy real-world deployment.

[AI-64] uRTLe: A Unified Evaluation of LLM s for RTL Generation

链接: https://arxiv.org/abs/2504.01986
作者: Dario Garcia-Gasulla,Gokcen Kestor,Emanuele Parisi,Miquel Albert’i-Binimelis,Cristian Gutierrez,Razine Moundir Ghorab,Orlando Montenegro,Bernat Homs,Miquel Moreto
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advancements in LLMs have driven the adoption of generative AI in various domains, including Electronic Design Automation (EDA). Unlike traditional software development, EDA presents unique challenges, as generated RTL code must not only be syntactically correct and functionally accurate but also synthesizable by hardware generators while meeting performance, power, and area constraints. These additional requirements introduce complexities that existing code-generation benchmarks often fail to capture, limiting their effectiveness in evaluating LLMs for RTL generation. To address this gap, we propose TuRTLe, a unified evaluation framework designed to systematically assess LLMs across key RTL generation tasks. TuRTLe integrates multiple existing benchmarks and automates the evaluation process, enabling a comprehensive assessment of LLM performance in syntax correctness, functional correctness, synthesis, PPA optimization, and exact line completion. Using this framework, we benchmark a diverse set of open LLMs and analyze their strengths and weaknesses in EDA-specific tasks. Our results show that reasoning-based models, such as DeepSeek R1, consistently outperform others across multiple evaluation criteria, but at the cost of increased computational overhead and inference latency. Additionally, base models are better suited in module completion tasks, while instruct-tuned models perform better in specification-to-RTL tasks.

[AI-65] Multi-Dimensional AGV Path Planning in 3D Warehouses Using Ant Colony Optimization and Advanced Neural Networks

链接: https://arxiv.org/abs/2504.01985
作者: Bo Zhang,Xiubo Liang,Wei Song,Yulu Chen
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Within modern warehouse scenarios, the rapid expansion of e-commerce and increasingly complex, multi-level storage environments have exposed the limitations of traditional AGV (Automated Guided Vehicle) path planning methods–often reliant on static 2D models and expert-tuned heuristics that struggle to handle dynamic traffic and congestion. Addressing these limitations, this paper introduces a novel AGV path planning approach for 3D warehouse environments that leverages a hybrid framework combining ACO (Ant Colony Optimization) with deep learning models, called NAHACO (Neural Adaptive Heuristic Ant Colony Optimization). NAHACO integrates three key innovations: first, an innovative heuristic algorithm for 3D warehouse cargo modeling using multidimensional tensors, which addresses the challenge of achieving superior heuristic accuracy; second, integration of a congestion-aware loss function within the ACO framework to adjust path costs based on traffic and capacity constraints, called CARL (Congestion-Aware Reinforce Loss), enabling dynamic heuristic calibration for optimizing ACO-based path planning; and third, an adaptive attention mechanism that captures multi-scale spatial features, thereby addressing dynamic heuristic calibration for further optimization of ACO-based path planning and AGV navigation. NAHACO significantly boosts path planning efficiency, yielding faster computation times and superior performance over both vanilla and state-of-the-art methods, while automatically adapting to warehouse constraints for real-time optimization. NAHACO outperforms state-of-the-art methods, lowering the total cost by up to 24.7% on TSP benchmarks. In warehouse tests, NAHACO cuts cost by up to 41.5% and congestion by up to 56.1% compared to previous methods.

[AI-66] NLS: Natural-Level Synthesis for Hardware Implementation Through GenAI

链接: https://arxiv.org/abs/2504.01981
作者: Kaiyuan Yang,Huang Ouyang,Xinyi Wang,Bingjie Lu,Yanbo Wang,Charith Abhayaratne,Sizhao Li,Long Jin,Tiantai Deng
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures, and 5 tables. Submitted for IEEE Transactions on CAD. The same content was accepted by Design Automation Conference 2025 as a WIP Poster (not count as publication, so it’s ok to submit the content elsewhere). TCAD info: this https URL Submitted for review on 26th of Feb. Reference - TCAD-2025-0203

点击查看摘要

Abstract:This paper introduces Natural-Level Synthesis, an innovative approach for generating hardware using generative artificial intelligence on both the system level and component-level. NLS bridges a gap in current hardware development processes, where algorithm and application engineers’ involvement typically ends at the requirements stage. With NLS, engineers can participate more deeply in the development, synthesis, and test stages by using Gen-AI models to convert natural language descriptions directly into Hardware Description Language code. This approach not only streamlines hardware development but also improves accessibility, fostering a collaborative workflow between hardware and algorithm engineers. We developed the NLS tool to facilitate natural language-driven HDL synthesis, enabling rapid generation of system-level HDL designs while significantly reducing development complexity. Evaluated through case studies and benchmarks using Performance, Power, and Area metrics, NLS shows its potential to enhance resource efficiency in hardware development. This work provides a extensible, efficient solution for hardware synthesis and establishes a Visual Studio Code Extension to assess Gen-AI-driven HDL generation and system integration, laying a foundation for future AI-enhanced and AI-in-the-loop Electronic Design Automation tools.

[AI-67] Information Gain Is Not All You Need

链接: https://arxiv.org/abs/2504.01980
作者: Ludvig Ericson,José Pedro,Patric Jensfelt
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 9 pages, 6 figures, under review

点击查看摘要

Abstract:Autonomous exploration in mobile robotics is driven by two competing objectives: coverage, to exhaustively observe the environment; and path length, to do so with the shortest path possible. Though it is difficult to evaluate the best course of action without knowing the unknown, the unknown can often be understood through models, maps, or common sense. However, previous work has shown that improving estimates of information gain through such prior knowledge leads to greedy behavior and ultimately causes backtracking, which degrades coverage performance. In fact, any information gain maximization will exhibit this behavior, even without prior knowledge. Information gained at task completion is constant, and cannot be maximized for. It is therefore an unsuitable choice as an optimization objective. Instead, information gain is a decision criterion for determining which candidate states should still be considered for exploration. The task therefore becomes to reach completion with the shortest total path. Since determining the shortest path is typically intractable, it is necessary to rely on a heuristic or estimate to identify candidate states that minimize the total path length. To address this, we propose a heuristic that reduces backtracking by preferring candidate states that are close to the robot, but far away from other candidate states. We evaluate the performance of the proposed heuristic in simulation against an information gain-based approach and frontier exploration, and show that our method significantly decreases total path length, both with and without prior knowledge of the environment.

[AI-68] Correlation-Attention Masked Temporal Transformer for User Identity Linkage Using Heterogeneous Mobility Data

链接: https://arxiv.org/abs/2504.01979
作者: Ziang Yan,Xingyu Zhao,Hanqing Ma,Wei Chen,Jianpeng Qi,Yanwei Yu,Junyu Dong
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: 9 pages, 5 figures, 3 tables

点击查看摘要

Abstract:With the rise of social media and Location-Based Social Networks (LBSN), check-in data across platforms has become crucial for User Identity Linkage (UIL). These data not only reveal users’ spatio-temporal information but also provide insights into their behavior patterns and interests. However, cross-platform identity linkage faces challenges like poor data quality, high sparsity, and noise interference, which hinder existing methods from extracting cross-platform user information. To address these issues, we propose a Correlation-Attention Masked Transformer for User Identity Linkage Network (MT-Link), a transformer-based framework to enhance model performance by learning spatio-temporal co-occurrence patterns of cross-platform users. Our model effectively captures spatio-temporal co-occurrence in cross-platform user check-in sequences. It employs a correlation attention mechanism to detect the spatio-temporal co-occurrence between user check-in sequences. Guided by attention weight maps, the model focuses on co-occurrence points while filtering out noise, ultimately improving classification performance. Experimental results show that our model significantly outperforms state-of-the-art baselines by 12.92%~17.76% and 5.80%~8.38% improvements in terms of Macro-F1 and Area Under Curve (AUC).

[AI-69] Steiner Traveling Salesman Problem with Quantum Annealing GECCO2025

链接: https://arxiv.org/abs/2504.02388
作者: Alessia Ciacco,Francesca Guerriero,Eneko Osaba
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: 7 pages, 1 figure, 6 tables. Paper submitted to The Genetic and Evolutionary Computation Conference (GECCO 2025)

点击查看摘要

Abstract:The Steiner Traveling Salesman Problem (STSP) is a variant of the classical Traveling Salesman Problem. The STSP involves incorporating steiner nodes, which are extra nodes not originally part of the required visit set but that can be added to the route to enhance the overall solution and minimize the total travel cost. Given the NP-hard nature of the STSP, we propose a quantum approach to address it. Specifically, we employ quantum annealing using D-Wave’s hardware to explore its potential for solving this problem. To enhance computational feasibility, we develop a preprocessing method that effectively reduces the network size. Our experimental results demonstrate that this reduction technique significantly decreases the problem complexity, making the Quadratic Unconstrained Binary Optimization formulation, the standard input for quantum annealers, better suited for existing quantum hardware. Furthermore, the results highlight the potential of quantum annealing as a promising and innovative approach for solving the STSP.

[AI-70] HCAF-DTA: drug-target binding affinity prediction with cross-attention fused hypergraph neural networks

链接: https://arxiv.org/abs/2504.02014
作者: Jiannuo Li,Lan Yao
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of the binding affinity between drugs and target proteins is a core task in computer-aided drug design. Existing deep learning methods tend to ignore the information of internal sub-structural features of drug molecules and drug-target interactions, resulting in limited prediction performance. In this paper, we propose a drug-target association prediction model HCAF-DTA based on cross-attention fusion hypergraph neural network. The model innovatively introduces hypergraph representation in the feature extraction stage: drug molecule hypergraphs are constructed based on the tree decomposition algorithm, and the sub-structural and global features extracted by fusing the hypergraph neural network with the graphical neural network through hopping connections, in which the hyper edges can efficiently characterise the functional functional groups and other key chemical features; for the protein feature extraction, a weighted graph is constructed based on the residues predicted by the ESM model contact maps to construct weighted graphs, and multilayer graph neural networks were used to capture spatial dependencies. In the prediction stage, a bidirectional multi-head cross-attention mechanism is designed to model intermolecular interactions from the dual viewpoints of atoms and amino acids, and cross-modal features with correlated information are fused by attention. Experiments on benchmark datasets such as Davis and KIBA show that HCAF-DTA outperforms state of the arts in all three performance evaluation metrics, with the MSE metrics reaching 0.198 and 0.122, respectively, with an improvement of up to 4% from the optimal baseline.

[AI-71] st-time Adaptation for Foundation Medical Segmentation Model without Parametric Updates

链接: https://arxiv.org/abs/2504.02008
作者: Kecheng Chen,Xinyu Luo,Tiexin Qin,Jie Liu,Hui Liu,Victor Ho Fun Lee,Hong Yan,Haoliang Li
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:Foundation medical segmentation models, with MedSAM being the most popular, have achieved promising performance across organs and lesions. However, MedSAM still suffers from compromised performance on specific lesions with intricate structures and appearance, as well as bounding box prompt-induced perturbations. Although current test-time adaptation (TTA) methods for medical image segmentation may tackle this issue, partial (e.g., batch normalization) or whole parametric updates restrict their effectiveness due to limited update signals or catastrophic forgetting in large models. Meanwhile, these approaches ignore the computational complexity during adaptation, which is particularly significant for modern foundation models. To this end, our theoretical analyses reveal that directly refining image embeddings is feasible to approach the same goal as parametric updates under the MedSAM architecture, which enables us to realize high computational efficiency and segmentation performance without the risk of catastrophic forgetting. Under this framework, we propose to encourage maximizing factorized conditional probabilities of the posterior prediction probability using a proposed distribution-approximated latent conditional random field loss combined with an entropy minimization loss. Experiments show that we achieve about 3% Dice score improvements across three datasets while reducing computational complexity by over 7 times.

[AI-72] Universally applicable and tunable graph-based coarse-graining for Machine learning force fields

链接: https://arxiv.org/abs/2504.01973
作者: Christoph Brunken,Sebastien Boyer,Mustafa Omar,Martin Maarand,Olivier Peltre,Solal Attias,Bakary N’tji Diallo,Anastasia Markina,Olaf Othersen,Oliver Bent
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Coarse-grained (CG) force field methods for molecular systems are a crucial tool to simulate large biological macromolecules and are therefore essential for characterisations of biomolecular systems. While state-of-the-art deep learning (DL)-based models for all-atom force fields have improved immensely over recent years, we observe and analyse significant limitations of the currently available approaches for DL-based CG simulations. In this work, we present the first transferable DL-based CG force field approach (i.e., not specific to only one narrowly defined system type) applicable to a wide range of biosystems. To achieve this, our CG algorithm does not rely on hard-coded rules and is tuned to output coarse-grained systems optimised for minimal statistical noise in the ground truth CG forces, which results in significant improvement of model training. Our force field model is also the first CG variant that is based on the MACE architecture and is trained on a custom dataset created by a new approach based on the fragmentation of large biosystems covering protein, RNA and lipid chemistry. We demonstrate that our model can be applied in molecular dynamics simulations to obtain stable and qualitatively accurate trajectories for a variety of systems, while also discussing cases for which we observe limited reliability.

[AI-73] Differentiable Optimization for Deep Learning-Enhanced DC Approximation of AC Optimal Power Flow

链接: https://arxiv.org/abs/2504.01970
作者: Andrew Rosemberg,Michael Klamkin
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:The growing scale of power systems and the increasing uncertainty introduced by renewable energy sources necessitates novel optimization techniques that are significantly faster and more accurate than existing methods. The AC Optimal Power Flow (AC-OPF) problem, a core component of power grid optimization, is often approximated using linearized DC Optimal Power Flow (DC-OPF) models for computational tractability, albeit at the cost of suboptimal and inefficient decisions. To address these limitations, we propose a novel deep learning-based framework for network equivalency that enhances DC-OPF to more closely mimic the behavior of AC-OPF. The approach utilizes recent advances in differentiable optimization, incorporating a neural network trained to predict adjusted nodal shunt conductances and branch susceptances in order to account for nonlinear power flow behavior. The model can be trained end-to-end using modern deep learning frameworks by leveraging the implicit function theorem. Results demonstrate the framework’s ability to significantly improve prediction accuracy, paving the way for more reliable and efficient power systems.

机器学习

[LG-0] Atrial constitutive neural networks

链接: https://arxiv.org/abs/2504.02748
作者: Mathias Peirlinck,Kevin Linka,Ellen Kuhl
类目: Computational Engineering, Finance, and Science (cs.CE); Soft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG); Medical Physics (physics.med-ph); Tissues and Organs (q-bio.TO)
*备注:

点击查看摘要

Abstract:This work presents a novel approach for characterizing the mechanical behavior of atrial tissue using constitutive neural networks. Based on experimental biaxial tensile test data of healthy human atria, we automatically discover the most appropriate constitutive material model, thereby overcoming the limitations of traditional, pre-defined models. This approach offers a new perspective on modeling atrial mechanics and is a significant step towards improved simulation and prediction of cardiac health.

[LG-1] Pushing the Limit of PPG Sensing in Sedentary Conditions by Addressing Poor Skin-sensor Contact

链接: https://arxiv.org/abs/2504.02735
作者: Manh Pham Hung,Matthew Yiwen Ho,Yiming Zhang,Dimitris Spathis,Aaqib Saeed,Dong Ma
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Photoplethysmography (PPG) is a widely used non-invasive technique for monitoring cardiovascular health and various physiological parameters on consumer and medical devices. While motion artifacts are well-known challenges in dynamic settings, suboptimal skin-sensor contact in sedentary conditions - a critical issue often overlooked in existing literature - can distort PPG signal morphology, leading to the loss or shift of essential waveform features and therefore degrading sensing performance. In this work, we propose CP-PPG, a novel approach that transforms Contact Pressure-distorted PPG signals into ones with the ideal morphology. CP-PPG incorporates a novel data collection approach, a well-crafted signal processing pipeline, and an advanced deep adversarial model trained with a custom PPG-aware loss function. We validated CP-PPG through comprehensive evaluations, including 1) morphology transformation performance on our self-collected dataset, 2) downstream physiological monitoring performance on public datasets, and 3) in-the-wild performance. Extensive experiments demonstrate substantial and consistent improvements in signal fidelity (Mean Absolute Error: 0.09, 40% improvement over the original signal) as well as downstream performance across all evaluations in Heart Rate (HR), Heart Rate Variability (HRV), Respiration Rate (RR), and Blood Pressure (BP) estimation (on average, 21% improvement in HR; 41-46% in HRV; 6% in RR; and 4-5% in BP). These findings highlight the critical importance of addressing skin-sensor contact issues for accurate and dependable PPG-based physiological monitoring. Furthermore, CP-PPG can serve as a generic, plug-in API to enhance PPG signal quality.

[LG-2] Computing High-dimensional Confidence Sets for Arbitrary Distributions

链接: https://arxiv.org/abs/2504.02723
作者: Chao Gao,Liren Shan,Vaidehi Srinivas,Aravindan Vijayaraghavan
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of learning a high-density region of an arbitrary distribution over \mathbbR^d . Given a target coverage parameter \delta , and sample access to an arbitrary distribution D , we want to output a confidence set S \subset \mathbbR^d such that S achieves \delta coverage of D , i.e., \mathbbP_y \sim D \left[ y \in S \right] \ge \delta , and the volume of S is as small as possible. This is a central problem in high-dimensional statistics with applications in finding confidence sets, uncertainty quantification, and support estimation. In the most general setting, this problem is statistically intractable, so we restrict our attention to competing with sets from a concept class C with bounded VC-dimension. An algorithm is competitive with class C if, given samples from an arbitrary distribution D , it outputs in polynomial time a set that achieves \delta coverage of D , and whose volume is competitive with the smallest set in C with the required coverage \delta . This problem is computationally challenging even in the basic setting when C is the set of all Euclidean balls. Existing algorithms based on coresets find in polynomial time a ball whose volume is \exp(\tildeO( d/ \log d)) -factor competitive with the volume of the best ball. Our main result is an algorithm that finds a confidence set whose volume is \exp(\tildeO(d^2/3)) factor competitive with the optimal ball having the desired coverage. The algorithm is improper (it outputs an ellipsoid). Combined with our computational intractability result for proper learning balls within an \exp(\tildeO(d^1-o(1))) approximation factor in volume, our results provide an interesting separation between proper and (improper) learning of confidence sets. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2504.02723 [cs.DS] (or arXiv:2504.02723v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2504.02723 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] GPT Qv2: Efficient Finetuning-Free Quantization for Asymmetric Calibration

链接: https://arxiv.org/abs/2504.02692
作者: Yuhang Li,Ruokai Yin,Donghyun Lee,Shiting Xiao,Priyadarshini Panda
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce GPTQv2, a novel finetuning-free quantization method for compressing large-scale transformer architectures. Unlike the previous GPTQ method, which independently calibrates each layer, we always match the quantized layer’s output to the exact output in the full-precision model, resulting in a scheme that we call asymmetric calibration. Such a scheme can effectively reduce the quantization error accumulated in previous layers. We analyze this problem using optimal brain compression to derive a close-formed solution. The new solution explicitly minimizes the quantization error as well as the accumulated asymmetry error. Furthermore, we utilize various techniques to parallelize the solution calculation, including channel parallelization, neuron decomposition, and Cholesky reformulation for matrix fusion. As a result, GPTQv2 is easy to implement, simply using 20 more lines of code than GPTQ but improving its performance under low-bit quantization. Remarkably, on a single GPU, we quantize a 405B language transformer as well as EVA-02 the rank first vision transformer that achieves 90% pretraining Imagenet accuracy. Code is available at this http URL.

[LG-4] Handover and SINR-Aware Path Optimization in 5G-UAV mmWave Communication using DRL

链接: https://arxiv.org/abs/2504.02688
作者: Achilles Kiwanuka Machumilane,Alberto Gotta,Pietro Cassarà
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Path planning and optimization for unmanned aerial vehicles (UAVs)-assisted next-generation wireless networks is critical for mobility management and ensuring UAV safety and ubiquitous connectivity, especially in dense urban environments with street canyons and tall buildings. Traditional statistical and model-based techniques have been successfully used for path optimization in communication networks. However, when dynamic channel propagation characteristics such as line-of-sight (LOS), interference, handover, and signal-to-interference and noise ratio (SINR) are included in path optimization, statistical and model-based path planning solutions become obsolete since they cannot adapt to the dynamic and time-varying wireless channels, especially in the mmWave bands. In this paper, we propose a novel model-free actor-critic deep reinforcement learning (AC-DRL) framework for path optimization in UAV-assisted 5G mmWave wireless networks, which combines four important aspects of UAV communication: \textitflight time, handover, connectivity and SINR. We train an AC-RL agent that enables a UAV connected to a gNB to determine the optimal path to a desired destination in the shortest possible time with minimal gNB handover, while maintaining connectivity and the highest possible SINR. We train our model with data from a powerful ray tracing tool called Wireless InSite, which uses 3D images of the propagation environment and provides data that closely resembles the real propagation environment. The simulation results show that our system has superior performance in tracking high SINR compared to other selected RL algorithms.

[LG-5] Compositionality Unlocks Deep Interpretable Models

链接: https://arxiv.org/abs/2504.02667
作者: Thomas Dooms,Ward Gauderis,Geraint A. Wiggins,Jose Oramas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose \chi -net, an intrinsically interpretable architecture combining the compositional multilinear structure of tensor networks with the expressivity and efficiency of deep neural networks. \chi -nets retain equal accuracy compared to their baseline counterparts. Our novel, efficient diagonalisation algorithm, ODT, reveals linear low-rank structure in a multilayer SVHN model. We leverage this toward formal weight-based interpretability and model compression.

[LG-6] Integrating Human Knowledge Through Action Masking in Reinforcement Learning for Operations Research

链接: https://arxiv.org/abs/2504.02662
作者: Mirko Stappert,Bernhard Lutz,Niklas Goby,Dirk Neumann
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) provides a powerful method to address problems in operations research. However, its real-world application often fails due to a lack of user acceptance and trust. A possible remedy is to provide managers with the possibility of altering the RL policy by incorporating human expert knowledge. In this study, we analyze the benefits and caveats of including human knowledge via action masking. While action masking has so far been used to exclude invalid actions, its ability to integrate human expertise remains underexplored. Human knowledge is often encapsulated in heuristics, which suggest reasonable, near-optimal actions in certain situations. Enforcing such actions should hence increase trust among the human workforce to rely on the model’s decisions. Yet, a strict enforcement of heuristic actions may also restrict the policy from exploring superior actions, thereby leading to overall lower performance. We analyze the effects of action masking based on three problems with different characteristics, namely, paint shop scheduling, peak load management, and inventory management. Our findings demonstrate that incorporating human knowledge through action masking can achieve substantial improvements over policies trained without action masking. In addition, we find that action masking is crucial for learning effective policies in constrained action spaces, where certain actions can only be performed a limited number of times. Finally, we highlight the potential for suboptimal outcomes when action masks are overly restrictive.

[LG-7] MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators

链接: https://arxiv.org/abs/2504.02658
作者: Beichen Huang,Yueming Yuan,Zelei Shao,Minjia Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A critical approach for efficiently deploying Mixture-of-Experts (MoE) models with massive parameters is quantization. However, state-of-the-art MoE models suffer from non-negligible accuracy loss with extreme quantization, such as under 4 bits. To address this, we introduce MiLo, a novel method that augments highly quantized MoEs with a mixture of low-rank compensators. These compensators consume only a small amount of additional memory but significantly recover accuracy loss from extreme quantization. MiLo also identifies that MoEmodels exhibit distinctive characteristics across weights due to their hybrid dense-sparse architectures, and employs adaptive rank selection policies along with iterative optimizations to close the accuracy gap. MiLo does not rely on calibration data, allowing it to generalize to different MoE models and datasets without overfitting to a calibration set. To avoid the hardware inefficiencies of extreme quantization, such as 3-bit, MiLo develops Tensor Core-friendly 3-bit kernels, enabling measured latency speedups on 3-bit quantized MoE models. Our evaluation shows that MiLo outperforms existing methods on SoTA MoE models across various tasks.

[LG-8] Solving the Paint Shop Problem with Flexible Management of Multi-Lane Buffers Using Reinforcement Learning and Action Masking

链接: https://arxiv.org/abs/2504.02644
作者: Mirko Stappert,Bernhard Lutz,Janis Brammer,Dirk Neumann
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In the paint shop problem, an unordered incoming sequence of cars assigned to different colors has to be reshuffled with the objective of minimizing the number of color changes. To reshuffle the incoming sequence, manufacturers can employ a first-in-first-out multi-lane buffer system allowing store and retrieve operations. So far, prior studies primarily focused on simple decision heuristics like greedy or simplified problem variants that do not allow full flexibility when performing store and retrieve operations. In this study, we propose a reinforcement learning approach to minimize color changes for the flexible problem variant, where store and retrieve operations can be performed in an arbitrary order. After proving that greedy retrieval is optimal, we incorporate this finding into the model using action masking. Our evaluation, based on 170 problem instances with 2-8 buffer lanes and 5-15 colors, shows that our approach reduces color changes compared to existing methods by considerable margins depending on the problem size. Furthermore, we demonstrate the robustness of our approach towards different buffer sizes and imbalanced color distributions.

[LG-9] Reservoir Computing: A New Paradigm for Neural Networks

链接: https://arxiv.org/abs/2504.02639
作者: Felix Grezes
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A Literature Review of Reservoir Computing. Even before Artificial Intelligence was its own field of computational science, humanity has tried to mimic the activity of the human brain. In the early 1940s the first artificial neuron models were created as purely mathematical concepts. Over the years, ideas from neuroscience and computer science were used to develop the modern Neural Network. The interest in these models rose quickly but fell when they failed to be successfully applied to practical applications, and rose again in the late 2000s with the drastic increase in computing power, notably in the field of natural language processing, for example with the state-of-the-art speech recognizer making heavy use of deep neural networks. Recurrent Neural Networks (RNNs), a class of neural networks with cycles in the network, exacerbates the difficulties of traditional neural nets. Slow convergence limiting the use to small networks, and difficulty to train through gradient-descent methods because of the recurrent dynamics have hindered research on RNNs, yet their biological plausibility and their capability to model dynamical systems over simple functions makes then interesting for computational researchers. Reservoir Computing emerges as a solution to these problems that RNNs traditionally face. Promising to be both theoretically sound and computationally fast, Reservoir Computing has already been applied successfully to numerous fields: natural language processing, computational biology and neuroscience, robotics, even physics. This survey will explore the history and appeal of both traditional feed-forward and recurrent neural networks, before describing the theory and models of this new reservoir computing paradigm. Finally recent papers using reservoir computing in a variety of scientific fields will be reviewed. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.02639 [cs.LG] (or arXiv:2504.02639v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.02639 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Felix Grezes [view email] [v1] Thu, 3 Apr 2025 14:34:51 UTC (1,473 KB)

[LG-10] Grammar-based Ordinary Differential Equation Discovery

链接: https://arxiv.org/abs/2504.02630
作者: Karin L. Yu,Eleni Chatzi,Georgios Kissas
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:The understanding and modeling of complex physical phenomena through dynamical systems has historically driven scientific progress, as it provides the tools for predicting the behavior of different systems under diverse conditions through time. The discovery of dynamical systems has been indispensable in engineering, as it allows for the analysis and prediction of complex behaviors for computational modeling, diagnostics, prognostics, and control of engineered systems. Joining recent efforts that harness the power of symbolic regression in this domain, we propose a novel framework for the end-to-end discovery of ordinary differential equations (ODEs), termed Grammar-based ODE Discovery Engine (GODE). The proposed methodology combines formal grammars with dimensionality reduction and stochastic search for efficiently navigating high-dimensional combinatorial spaces. Grammars allow us to seed domain knowledge and structure for both constraining, as well as, exploring the space of candidate expressions. GODE proves to be more sample- and parameter-efficient than state-of-the-art transformer-based models and to discover more accurate and parsimonious ODE expressions than both genetic programming- and other grammar-based methods for more complex inference tasks, such as the discovery of structural dynamics. Thus, we introduce a tool that could play a catalytic role in dynamics discovery tasks, including modeling, system identification, and monitoring tasks.

[LG-11] Variational Online Mirror Descent for Robust Learning in Schrödinger Bridge

链接: https://arxiv.org/abs/2504.02618
作者: Dong-Sig Han,Jaein Kim,Hee Bin Yoo,Byoung-Tak Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Schödinger bridge (SB) has evolved into a universal class of probabilistic generative models. In practice, however, estimated learning signals are often uncertain, and the reliability promised by existing methods is often based on speculative optimal-case scenarios. Recent studies regarding the Sinkhorn algorithm through mirror descent (MD) have gained attention, revealing geometric insights into solution acquisition of the SB problems. In this paper, we propose a variational online MD (OMD) framework for the SB problems, which provides further stability to SB solvers. We formally prove convergence and a regret bound for the novel OMD formulation of SB acquisition. As a result, we propose a simulation-free SB algorithm called Variational Mirrored Schrödinger Bridge (VMSB) by utilizing the Wasserstein-Fisher-Rao geometry of the Gaussian mixture parameterization for Schrödinger potentials. Based on the Wasserstein gradient flow theory, the algorithm offers tractable learning dynamics that precisely approximate each OMD step. In experiments, we validate the performance of the proposed VMSB algorithm across an extensive suite of benchmarks. VMSB consistently outperforms contemporary SB solvers on a range of SB problems, demonstrating the robustness predicted by our theory.

[LG-12] State-Space Model Inspired Multiple-Input Multiple-Output Spiking Neurons

链接: https://arxiv.org/abs/2504.02591
作者: Sanja Karilanova,Subhrakanti Dey,Ayça Özçelikkale
类目: Machine Learning (cs.LG)
*备注: 9 pages, 3 figures, 6 tables, conference - 2025 Neuro Inspired Computational Elements (NICE)

点击查看摘要

Abstract:In spiking neural networks (SNNs), the main unit of information processing is the neuron with an internal state. The internal state generates an output spike based on its component associated with the membrane potential. This spike is then communicated to other neurons in the network. Here, we propose a general multiple-input multiple-output (MIMO) spiking neuron model that goes beyond this traditional single-input single-output (SISO) model in the SNN literature. Our proposed framework is based on interpreting the neurons as state-space models (SSMs) with linear state evolutions and non-linear spiking activation functions. We illustrate the trade-offs among various parameters of the proposed SSM-inspired neuron model, such as the number of hidden neuron states, the number of input and output channels, including single-input multiple-output (SIMO) and multiple-input single-output (MISO) models. We show that for SNNs with a small number of neurons with large internal state spaces, significant performance gains may be obtained by increasing the number of output channels of a neuron. In particular, a network with spiking neurons with multiple-output channels may achieve the same level of accuracy with the baseline with the continuous-valued communications on the same reference network architecture.

[LG-13] MAD: A Magnitude And Direction Policy Parametrization for Stability Constrained Reinforcement Learning

链接: https://arxiv.org/abs/2504.02565
作者: Luca Furieri,Sucheth Shenoy,Danilo Saccani,Andrea Martin,Giancarlo Ferrari Trecate
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce magnitude and direction (MAD) policies, a policy parameterization for reinforcement learning (RL) that preserves Lp closed-loop stability for nonlinear dynamical systems. Although complete in their ability to describe all stabilizing controllers, methods based on nonlinear Youla and system-level synthesis are significantly affected by the difficulty of parameterizing Lp-stable operators. In contrast, MAD policies introduce explicit feedback on state-dependent features - a key element behind the success of RL pipelines - without compromising closed-loop stability. This is achieved by describing the magnitude of the control input with a disturbance-feedback Lp-stable operator, while selecting its direction based on state-dependent features through a universal function approximator. We further characterize the robust stability properties of MAD policies under model mismatch. Unlike existing disturbance-feedback policy parameterizations, MAD policies introduce state-feedback components compatible with model-free RL pipelines, ensuring closed-loop stability without requiring model information beyond open-loop stability. Numerical experiments show that MAD policies trained with deep deterministic policy gradient (DDPG) methods generalize to unseen scenarios, matching the performance of standard neural network policies while guaranteeing closed-loop stability by design.

[LG-14] Probabilistic Pontryagins Maximum Principle for Continuous-Time Model-Based Reinforcement Learning

链接: https://arxiv.org/abs/2504.02543
作者: David Leeftink,Çağatay Yıldız,Steffen Ridderbusch,Max Hinne,Marcel van Gerven
类目: Machine Learning (cs.LG)
*备注: 7 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Without exact knowledge of the true system dynamics, optimal control of non-linear continuous-time systems requires careful treatment of epistemic uncertainty. In this work, we propose a probabilistic extension to Pontryagin’s maximum principle by minimizing the mean Hamiltonian with respect to epistemic uncertainty. We show minimization of the mean Hamiltonian is a necessary optimality condition when optimizing the mean cost, and propose a multiple shooting numerical method scalable to large-scale probabilistic dynamical models, including ensemble neural ordinary differential equations. Comparisons against state-of-the-art methods in online and offline model-based reinforcement learning tasks show that our probabilistic Hamiltonian formulation leads to reduced trial costs in offline settings and achieves competitive performance in online scenarios. By bridging optimal control and reinforcement learning, our approach offers a principled and practical framework for controlling uncertain systems with learned dynamics.

[LG-15] VISTA: Unsupervised 2D Temporal Dependency Representations for Time Series Anomaly Detection

链接: https://arxiv.org/abs/2504.02498
作者: Sinchee Chin,Fan Zhang,Xiaochen Yang,Jing-Hao Xue,Wenming Yang,Peng Jia,Guijin Wang,Luo Yingqun
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Time Series Anomaly Detection (TSAD) is essential for uncovering rare and potentially harmful events in unlabeled time series data. Existing methods are highly dependent on clean, high-quality inputs, making them susceptible to noise and real-world imperfections. Additionally, intricate temporal relationships in time series data are often inadequately captured in traditional 1D representations, leading to suboptimal modeling of dependencies. We introduce VISTA, a training-free, unsupervised TSAD algorithm designed to overcome these challenges. VISTA features three core modules: 1) Time Series Decomposition using Seasonal and Trend Decomposition via Loess (STL) to decompose noisy time series into trend, seasonal, and residual components; 2) Temporal Self-Attention, which transforms 1D time series into 2D temporal correlation matrices for richer dependency modeling and anomaly detection; and 3) Multivariate Temporal Aggregation, which uses a pretrained feature extractor to integrate cross-variable information into a unified, memory-efficient representation. VISTA’s training-free approach enables rapid deployment and easy hyperparameter tuning, making it suitable for industrial applications. It achieves state-of-the-art performance on five multivariate TSAD benchmarks.

[LG-16] A Physics-Informed Meta-Learning Framework for the Continuous Solution of Parametric PDEs on Arbitrary Geometries

链接: https://arxiv.org/abs/2504.02459
作者: Reza Najian Asl,Yusuke Yamazaki,Kianoosh Taghikhani,Mayu Muramatsu,Markus Apel,Shahed Rezaei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we introduce implicit Finite Operator Learning (iFOL) for the continuous and parametric solution of partial differential equations (PDEs) on arbitrary geometries. We propose a physics-informed encoder-decoder network to establish the mapping between continuous parameter and solution spaces. The decoder constructs the parametric solution field by leveraging an implicit neural field network conditioned on a latent or feature code. Instance-specific codes are derived through a PDE encoding process based on the second-order meta-learning technique. In training and inference, a physics-informed loss function is minimized during the PDE encoding and decoding. iFOL expresses the loss function in an energy or weighted residual form and evaluates it using discrete residuals derived from standard numerical PDE methods. This approach results in the backpropagation of discrete residuals during both training and inference. iFOL features several key properties: (1) its unique loss formulation eliminates the need for the conventional encode-process-decode pipeline previously used in operator learning with conditional neural fields for PDEs; (2) it not only provides accurate parametric and continuous fields but also delivers solution-to-parameter gradients without requiring additional loss terms or sensitivity analysis; (3) it can effectively capture sharp discontinuities in the solution; and (4) it removes constraints on the geometry and mesh, making it applicable to arbitrary geometries and spatial sampling (zero-shot super-resolution capability). We critically assess these features and analyze the network’s ability to generalize to unseen samples across both stationary and transient PDEs. The overall performance of the proposed method is promising, demonstrating its applicability to a range of challenging problems in computational mechanics. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.02459 [cs.LG] (or arXiv:2504.02459v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.02459 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-17] he Amenability Framework: Rethinking Causal Ordering Without Estimating Causal Effects

链接: https://arxiv.org/abs/2504.02456
作者: Carlos Fernández-Loría,Jorge Loría
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Who should we prioritize for intervention when we cannot estimate intervention effects? In many applied domains (e.g., advertising, customer retention, and behavioral nudging) prioritization is guided by predictive models that estimate outcome probabilities rather than causal effects. This paper investigates when these predictions (scores) can effectively rank individuals by their intervention effects, particularly when direct effect estimation is infeasible or unreliable. We propose a conceptual framework based on amenability: an individual’s latent proclivity to be influenced by an intervention. We then formalize conditions under which predictive scores serve as effective proxies for amenability. These conditions justify using non-causal scores for intervention prioritization, even when the scores do not directly estimate effects. We further show that, under plausible assumptions, predictive models can outperform causal effect estimators in ranking individuals by intervention effects. Empirical evidence from an advertising context supports our theoretical findings, demonstrating that predictive modeling can offer a more robust approach to targeting than effect estimation. Our framework suggests a shift in focus, from estimating effects to inferring who is amenable, as a practical and theoretically grounded strategy for prioritizing interventions in resource-constrained environments.

[LG-18] Robust Randomized Low-Rank Approximation with Row-Wise Outlier Detection

链接: https://arxiv.org/abs/2504.02432
作者: Aidan Tiruvan
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 27 pages, 9 figures, preprint

点击查看摘要

Abstract:Robust low-rank approximation under row-wise adversarial corruption can be achieved with a single pass, randomized procedure that detects and removes outlier rows by thresholding their projected norms. We propose a scalable, non-iterative algorithm that efficiently recovers the underlying low-rank structure in the presence of row-wise adversarial corruption. By first compressing the data with a Johnson Lindenstrauss projection, our approach preserves the geometry of clean rows while dramatically reducing dimensionality. Robust statistical techniques based on the median and median absolute deviation then enable precise identification and removal of outlier rows with abnormally high norms. The subsequent rank-k approximation achieves near-optimal error bounds with a one pass procedure that scales linearly with the number of observations. Empirical results confirm that combining random sketches with robust statistics yields efficient, accurate decompositions even in the presence of large fractions of corrupted rows.

[LG-19] Bridging the Theoretical Gap in Randomized Smoothing

链接: https://arxiv.org/abs/2504.02412
作者: Blaise Delattre,Paul Caillon,Quentin Barthélemy,Erwan Fagnou,Alexandre Allauzen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Randomized smoothing has become a leading approach for certifying adversarial robustness in machine learning models. However, a persistent gap remains between theoretical certified robustness and empirical robustness accuracy. This paper introduces a new framework that bridges this gap by leveraging Lipschitz continuity for certification and proposing a novel, less conservative method for computing confidence intervals in randomized smoothing. Our approach tightens the bounds of certified robustness, offering a more accurate reflection of model robustness in practice. Through rigorous experimentation we show that our method improves the robust accuracy, compressing the gap between empirical findings and previous theoretical results. We argue that investigating local Lipschitz constants and designing ad-hoc confidence intervals can further enhance the performance of randomized smoothing. These results pave the way for a deeper understanding of the relationship between Lipschitz continuity and certified robustness.

[LG-20] Reinforcement Learning for Solving the Pricing Problem in Column Generation: Applications to Vehicle Routing

链接: https://arxiv.org/abs/2504.02383
作者: Abdo Abouelrous,Laurens Bliek,Adriana F. Gabor,Yaoxin Wu,Yingqian Zhang
类目: Machine Learning (cs.LG)
*备注: 25 pages, 7 figures, 7 tables, Journal Submission

点击查看摘要

Abstract:In this paper, we address the problem of Column Generation (CG) using Reinforcement Learning (RL). Specifically, we use a RL model based on the attention-mechanism architecture to find the columns with most negative reduced cost in the Pricing Problem (PP). Unlike previous Machine Learning (ML) applications for CG, our model deploys an end-to-end mechanism as it independently solves the pricing problem without the help of any heuristic. We consider a variant of Vehicle Routing Problem (VRP) as a case study for our method. Through a set of experiments where our method is compared against a Dynamic Programming (DP)-based heuristic for solving the PP, we show that our method solves the linear relaxation up to a reasonable objective gap within 9% in significantly shorter running times, up to over 300 times faster for instances with 100 customers.

[LG-21] Large (Vision) Language Models are Unsupervised In-Context Learners ICLR2025

链接: https://arxiv.org/abs/2504.02349
作者: Artyom Gadetsky,Andrei Atanov,Yulun Jiang,Zhitong Gao,Ghazal Hosseini Mighan,Amir Zamir,Maria Brbic
类目: Machine Learning (cs.LG)
*备注: ICLR 2025 camera-ready

点击查看摘要

Abstract:Recent advances in large language and vision-language models have enabled zero-shot inference, allowing models to solve new tasks without task-specific training. Various adaptation techniques such as prompt engineering, In-Context Learning (ICL), and supervised fine-tuning can further enhance the model’s performance on a downstream task, but they require substantial manual effort to construct effective prompts or labeled examples. In this work, we introduce a joint inference framework for fully unsupervised adaptation, eliminating the need for manual prompt engineering and labeled examples. Unlike zero-shot inference, which makes independent predictions, the joint inference makes predictions simultaneously for all inputs in a given task. Since direct joint inference involves computationally expensive optimization, we develop efficient approximation techniques, leading to two unsupervised adaptation methods: unsupervised fine-tuning and unsupervised ICL. We demonstrate the effectiveness of our methods across diverse tasks and models, including language-only Llama-3.1 on natural language processing tasks, reasoning-oriented Qwen2.5-Math on grade school math problems, vision-language OpenFlamingo on vision tasks, and the API-only access GPT-4o model on massive multi-discipline tasks. Our experiments demonstrate substantial improvements over the standard zero-shot approach, including 39% absolute improvement on the challenging GSM8K math reasoning dataset. Remarkably, despite being fully unsupervised, our framework often performs on par with supervised approaches that rely on ground truth labels.

[LG-22] oward General and Robust LLM -enhanced Text-attributed Graph Learning

链接: https://arxiv.org/abs/2504.02343
作者: Zihao Zhang,Xunkai Li,Rong-Hua Li,Bing Zhou,Zhenjun Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) and the proliferation of Text-Attributed Graphs (TAGs) across various domains have positioned LLM-enhanced TAG learning as a critical research area. By utilizing rich graph descriptions, this paradigm leverages LLMs to generate high-quality embeddings, thereby enhancing the representational capacity of Graph Neural Networks (GNNs). However, the field faces significant challenges: (1) the absence of a unified framework to systematize the diverse optimization perspectives arising from the complex interactions between LLMs and GNNs, and (2) the lack of a robust method capable of handling real-world TAGs, which often suffer from texts and edge sparsity, leading to suboptimal performance. To address these challenges, we propose UltraTAG, a unified pipeline for LLM-enhanced TAG learning. UltraTAG provides a unified comprehensive and domain-adaptive framework that not only organizes existing methodologies but also paves the way for future advancements in the field. Building on this framework, we propose UltraTAG-S, a robust instantiation of UltraTAG designed to tackle the inherent sparsity issues in real-world TAGs. UltraTAG-S employs LLM-based text propagation and text augmentation to mitigate text sparsity, while leveraging LLM-augmented node selection techniques based on PageRank and edge reconfiguration strategies to address edge sparsity. Our extensive experiments demonstrate that UltraTAG-S significantly outperforms existing baselines, achieving improvements of 2.12% and 17.47% in ideal and sparse settings, respectively. Moreover, as the data sparsity ratio increases, the performance improvement of UltraTAG-S also rises, which underscores the effectiveness and robustness of UltraTAG-S. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.02343 [cs.LG] (or arXiv:2504.02343v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.02343 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-23] On shallow feedforward neural networks with inputs from a topological space

链接: https://arxiv.org/abs/2504.02321
作者: Vugar Ismailov
类目: Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注: 10 pages; this article uses material from arXiv:2409.12913

点击查看摘要

Abstract:We study feedforward neural networks with inputs from a topological space (TFNNs). We prove a universal approximation theorem for shallow TFNNs, which demonstrates their capacity to approximate any continuous function defined on this topological space. As an application, we obtain an approximative version of Kolmogorov’s superposition theorem for compact metric spaces.

[LG-24] Causal Self-supervised Pretrained Frontend with Predictive Code for Speech Separation

链接: https://arxiv.org/abs/2504.02302
作者: Wupeng Wang,Zexu Pan,Xinke Li,Shuai Wang,Haizhou Li
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: arXiv admin note: text overlap with arXiv:2411.03085

点击查看摘要

Abstract:Speech separation (SS) seeks to disentangle a multi-talker speech mixture into single-talker speech streams. Although SS can be generally achieved using offline methods, such a processing paradigm is not suitable for real-time streaming applications. Causal separation models, which rely only on past and present information, offer a promising solution for real-time streaming. However, these models typically suffer from notable performance degradation due to the absence of future context. In this paper, we introduce a novel frontend that is designed to mitigate the mismatch between training and run-time inference by implicitly incorporating future information into causal models through predictive patterns. The pretrained frontend employs a transformer decoder network with a causal convolutional encoder as the backbone and is pretrained in a self-supervised manner with two innovative pretext tasks: autoregressive hybrid prediction and contextual knowledge distillation. These tasks enable the model to capture predictive patterns directly from mixtures in a self-supervised manner. The pretrained frontend subsequently serves as a feature extractor to generate high-quality predictive patterns. Comprehensive evaluations on synthetic and real-world datasets validated the effectiveness of the proposed pretrained frontend.

[LG-25] SPACE: SPike-Aware Consistency Enhancement for Test-Time Adaptation in Spiking Neural Networks

链接: https://arxiv.org/abs/2504.02298
作者: Xinyu Luo,Kecheng Chen,Pao-Sheng Vincent Sun,Chris Xing Tian,Arindam Basu,Haoliang Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs), as a biologically plausible alternative to Artificial Neural Networks (ANNs), have demonstrated advantages in terms of energy efficiency, temporal processing, and biological plausibility. However, SNNs are highly sensitive to distribution shifts, which can significantly degrade their performance in real-world scenarios. Traditional test-time adaptation (TTA) methods designed for ANNs often fail to address the unique computational dynamics of SNNs, such as sparsity and temporal spiking behavior. To address these challenges, we propose \textbfSP ike- \textbfA ware \textbfC onsistency \textbfE nhancement (SPACE), the first source-free and single-instance TTA method specifically designed for SNNs. SPACE leverages the inherent spike dynamics of SNNs to maximize the consistency of spike-behavior-based local feature maps across augmented versions of a single test sample, enabling robust adaptation without requiring source data. We evaluate SPACE on multiple datasets, including CIFAR-10-C, CIFAR-100-C, Tiny-ImageNet-C and DVS Gesture-C. Furthermore, SPACE demonstrates strong generalization across different model architectures, achieving consistent performance improvements on both VGG9 and ResNet11. Experimental results show that SPACE outperforms state-of-the-art methods, highlighting its effectiveness and robustness in real-world settings.

[LG-26] FEASE: Shallow AutoEncoding Recommender with Cold Start Handling via Side Features RECSYS2025

链接: https://arxiv.org/abs/2504.02288
作者: Edward DongBo Cui,Lu Zhang,William Ping-hsun Lee
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Preparing submission to RecSys 2025; 2 Figures; 4 Tables; 13 pages; Python code implementation example

点击查看摘要

Abstract:User and item cold starts present significant challenges in industrial applications of recommendation systems. Supplementing user-item interaction data with metadata is a common solution-but often at the cost of introducing additional biases. In this work, we introduce an augmented EASE model, i.e. FEASE, that seamlessly integrates both user and item side information to address these cold start issues. Our straightforward, autoencoder-based method produces a closed-form solution that leverages rich content signals for cold items while refining user representations in data-sparse environments. Importantly, our method strikes a balance by effectively recommending cold start items and handling cold start users without incurring extra bias, and it maintains strong performance in warm settings. Experimental results demonstrate improved recommendation accuracy and robustness compared to previous collaborative filtering approaches. Moreover, our model serves as a strong baseline for future comparative studies.

[LG-27] Ga_2O_3 TCAD Mobility Parameter Calibration using Simulation Augmented Machine Learning with Physics Informed Neural Network

链接: https://arxiv.org/abs/2504.02283
作者: Le Minh Long Nguyen,Edric Ong,Matthew Eng,Yuhao Zhang,Hiu Yung Wong
类目: Machine Learning (cs.LG)
*备注: 4 pages, 3 figures

点击查看摘要

Abstract:In this paper, we demonstrate the possibility of performing automatic Technology Computer-Aided-Design (TCAD) parameter calibration using machine learning, verified with experimental data. The machine only needs to be trained by TCAD data. Schottky Barrier Diode (SBD) fabricated with emerging ultra-wide-bandgap material, Gallium Oxide (Ga _2 O _3 ), is measured and its current-voltage (IV) is used for Ga _2 O _3 Philips Unified Mobility (PhuMob) model parameters, effective anode workfunction, and ambient temperature extraction (7 parameters). A machine comprised of an autoencoder (AE) and a neural network (NN) (AE-NN) is used. Ga _2 O _3 PhuMob parameters are extracted from the noisy experimental curves. TCAD simulation with the extracted parameters shows that the quality of the parameters is as good as an expert’s calibration at the pre-turned-on regime but not in the on-state regime. By using a simple physics-informed neural network (PINN) (AE-PINN), the machine performs as well as the human expert in all regimes.

[LG-28] Enhancing Customer Contact Efficiency with Graph Neural Networks in Credit Card Fraud Detection Workflow

链接: https://arxiv.org/abs/2504.02275
作者: Menghao Huo,Kuan Lu,Qiang Zhu,Zhenrui Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Credit card fraud has been a persistent issue since the last century, causing significant financial losses to the industry. The most effective way to prevent fraud is by contacting customers to verify suspicious transactions. However, while these systems are designed to detect fraudulent activity, they often mistakenly flag legitimate transactions, leading to unnecessary declines that disrupt the user experience and erode customer trust. Frequent false positives can frustrate customers, resulting in dissatisfaction, increased complaints, and a diminished sense of security. To address these limitations, we propose a fraud detection framework incorporating Relational Graph Convolutional Networks (RGCN) to enhance the accuracy and efficiency of identifying fraudulent transactions. By leveraging the relational structure of transaction data, our model reduces the need for direct customer confirmation while maintaining high detection performance. Our experiments are conducted using the IBM credit card transaction dataset to evaluate the effectiveness of this approach.

[LG-29] Reasoning Under 1 Billion: Memory-Augmented Reinforcement Learning for Large Language Models

链接: https://arxiv.org/abs/2504.02273
作者: Hung Le,Dai Do,Dung Nguyen,Svetha Venkatesh
类目: Machine Learning (cs.LG)
*备注: preprint,20 pages

点击查看摘要

Abstract:Recent advances in fine-tuning large language models (LLMs) with reinforcement learning (RL) have shown promising improvements in complex reasoning tasks, particularly when paired with chain-of-thought (CoT) prompting. However, these successes have been largely demonstrated on large-scale models with billions of parameters, where a strong pretraining foundation ensures effective initial exploration. In contrast, RL remains challenging for tiny LLMs with 1 billion parameters or fewer because they lack the necessary pretraining strength to explore effectively, often leading to suboptimal reasoning patterns. This work introduces a novel intrinsic motivation approach that leverages episodic memory to address this challenge, improving tiny LLMs in CoT reasoning tasks. Inspired by human memory-driven learning, our method leverages successful reasoning patterns stored in memory while allowing for controlled exploration to generate novel responses. Intrinsic rewards are computed efficiently using a kNN-based episodic memory, allowing the model to discover new reasoning strategies while quickly adapting to effective past solutions. Experiments on fine-tuning GSM8K and AI-MO datasets demonstrate that our approach significantly enhances smaller LLMs’ sample efficiency and generalization capability, making RL-based reasoning improvements more accessible in low-resource settings.

[LG-30] MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

链接: https://arxiv.org/abs/2504.02263
作者: Ruidong Zhu,Ziheng Jiang,Chao Jin,Peng Wu,Cesar A. Stuardo,Dongyang Wang,Xinlei Zhang,Huaping Zhou,Haoran Wei,Yang Cheng,Jianzhe Xiao,Xinyi Zhang,Lingjun Liu,Haibin Lin,Li-Wen Chang,Jianxi Ye,Xiao Yu,Xuanzhe Liu,Xin Jin,Xin Liu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexity. However, its sparsely activated architecture shifts feed-forward networks (FFNs) from being compute-intensive to memory-intensive during inference, leading to substantially lower GPU utilization and increased operational costs. We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models. MegaScale-Infer disaggregates attention and FFN modules within each model layer, enabling independent scaling, tailored parallelism strategies, and heterogeneous deployment for both modules. To fully exploit disaggregation in the presence of MoE’s sparsity, MegaScale-Infer introduces ping-pong pipeline parallelism, which partitions a request batch into micro-batches and shuttles them between attention and FFNs for inference. Combined with distinct model parallelism for each module, MegaScale-Infer effectively hides communication overhead and maximizes GPU utilization. To adapt to disaggregated attention and FFN modules and minimize data transmission overhead (e.g., token dispatch), MegaScale-Infer provides a high-performance M2N communication library that eliminates unnecessary GPU-to-CPU data copies, group initialization overhead, and GPU synchronization. Experimental results indicate that MegaScale-Infer achieves up to 1.90x higher per-GPU throughput than state-of-the-art solutions.

[LG-31] Quantum Lipschitz Bandits

链接: https://arxiv.org/abs/2504.02251
作者: Bongsoo Yi,Yue Kang,Yao Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Lipschitz bandit is a key variant of stochastic bandit problems where the expected reward function satisfies a Lipschitz condition with respect to an arm metric space. With its wide-ranging practical applications, various Lipschitz bandit algorithms have been developed, achieving the cumulative regret lower bound of order \tilde O(T^(d_z+1)/(d_z+2)) over time horizon T . Motivated by recent advancements in quantum computing and the demonstrated success of quantum Monte Carlo in simpler bandit settings, we introduce the first quantum Lipschitz bandit algorithms to address the challenges of continuous action spaces and non-linear reward functions. Specifically, we first leverage the elimination-based framework to propose an efficient quantum Lipschitz bandit algorithm named Q-LAE. Next, we present novel modifications to the classical Zooming algorithm, which results in a simple quantum Lipschitz bandit method, Q-Zooming. Both algorithms exploit the computational power of quantum methods to achieve an improved regret bound of \tilde O(T^d_z/(d_z+1)) . Comprehensive experiments further validate our improved theoretical findings, demonstrating superior empirical performance compared to existing Lipschitz bandit methods.

[LG-32] CRC-SGAD: Conformal Risk Control for Supervised Graph Anomaly Detection

链接: https://arxiv.org/abs/2504.02248
作者: Songran Bai,Xiaolong Zheng,Daniel Dajun Zeng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Anomaly Detection (GAD) is critical in security-sensitive domains, yet faces reliability challenges: miscalibrated confidence estimation (underconfidence in normal nodes, overconfidence in anomalies), adversarial vulnerability of derived confidence score under structural perturbations, and limited efficacy of conventional calibration methods for sparse anomaly patterns. Thus we propose CRC-SGAD, a framework integrating statistical risk control into GAD via two innovations: (1) A Dual-Threshold Conformal Risk Control mechanism that provides theoretically guaranteed bounds for both False Negative Rate (FNR) and False Positive Rate (FPR) through providing prediction sets; (2) A Subgraph-aware Spectral Graph Neural Calibrator (SSGNC) that optimizes node representations through adaptive spectral filtering while reducing the size of prediction sets via hybrid loss optimization. Experiments on four datasets and five GAD models demonstrate statistically significant improvements in FNR and FPR control and prediction set size. CRC-SGAD establishes a paradigm for statistically rigorous anomaly detection in graph-structured security applications.

[LG-33] Secure Generalization through Stochastic Bidirectional Parameter Updates Using Dual-Gradient Mechanism

链接: https://arxiv.org/abs/2504.02213
作者: Shourya Goel,Himanshi Tibrewal,Anant Jain,Anshul Pundhir,Pravendra Singh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has gained increasing attention due to privacy-preserving collaborative training on decentralized clients, mitigating the need to upload sensitive data to a central server directly. Nonetheless, recent research has underscored the risk of exposing private data to adversaries, even within FL frameworks. In general, existing methods sacrifice performance while ensuring resistance to privacy leakage in FL. We overcome these issues and generate diverse models at a global server through the proposed stochastic bidirectional parameter update mechanism. Using diverse models, we improved the generalization and feature representation in the FL setup, which also helped to improve the robustness of the model against privacy leakage without hurting the model’s utility. We use global models from past FL rounds to follow systematic perturbation in parameter space at the server to ensure model generalization and resistance against privacy attacks. We generate diverse models (in close neighborhoods) for each client by using systematic perturbations in model parameters at a fine-grained level (i.e., altering each convolutional filter across the layers of the model) to improve the generalization and security perspective. We evaluated our proposed approach on four benchmark datasets to validate its superiority. We surpassed the state-of-the-art methods in terms of model utility and robustness towards privacy leakage. We have proven the effectiveness of our method by evaluating performance using several quantitative and qualitative results.

[LG-34] A User-Tunable Machine Learning Framework for Step-Wise Synthesis Planning

链接: https://arxiv.org/abs/2504.02191
作者: Shivesh Prakash,Viki Kumar Prasad,Hans-Arno Jacobsen
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce MHNpath, a machine learning-driven retrosynthetic tool designed for computer-aided synthesis planning. Leveraging modern Hopfield networks and novel comparative metrics, MHNpath efficiently prioritizes reaction templates, improving the scalability and accuracy of retrosynthetic predictions. The tool incorporates a tunable scoring system that allows users to prioritize pathways based on cost, reaction temperature, and toxicity, thereby facilitating the design of greener and cost-effective reaction routes. We demonstrate its effectiveness through case studies involving complex molecules from ChemByDesign, showcasing its ability to predict novel synthetic and enzymatic pathways. Furthermore, we benchmark MHNpath against existing frameworks, replicating experimentally validated “gold-standard” pathways from PaRoutes. Our case studies reveal that the tool can generate shorter, cheaper, moderate-temperature routes employing green solvents, as exemplified by compounds such as dronabinol, arformoterol, and lupinine.

[LG-35] FastFlow: Early Yet Robust Network Flow Classification using the Minimal Number of Time-Series Packets

链接: https://arxiv.org/abs/2504.02174
作者: Rushi Jayeshkumar Babaria,Minzhao Lyu,Gustavo Batista,Vijay Sivaraman
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: This paper is accepted at ACM SIGMETRICS 2025. Proc. ACM Meas. Anal. Comput. Syst (2025)

点击查看摘要

Abstract:Network traffic classification is of great importance for network operators in their daily routines, such as analyzing the usage patterns of multimedia applications and optimizing network configurations. Internet service providers (ISPs) that operate high-speed links expect network flow classifiers to accurately classify flows early, using the minimal number of necessary initial packets per flow. These classifiers must also be robust to packet sequence disorders in candidate flows and capable of detecting unseen flow types that are not within the existing classification scope, which are not well achieved by existing methods. In this paper, we develop FastFlow, a time-series flow classification method that accurately classifies network flows as one of the known types or the unknown type, which dynamically selects the minimal number of packets to balance accuracy and efficiency. Toward the objectives, we first develop a flow representation process that converts packet streams at both per-packet and per-slot granularity for precise packet statistics with robustness to packet sequence disorders. Second, we develop a sequential decision-based classification model that leverages LSTM architecture trained with reinforcement learning. Our model makes dynamic decisions on the minimal number of time-series data points per flow for the confident classification as one of the known flow types or an unknown one. We evaluated our method on public datasets and demonstrated its superior performance in early and accurate flow classification. Deployment insights on the classification of over 22.9 million flows across seven application types and 33 content providers in a campus network over one week are discussed, showing that FastFlow requires an average of only 8.37 packets and 0.5 seconds to classify the application type of a flow with over 91% accuracy and over 96% accuracy for the content providers.

[LG-36] Example-Free Learning of Regular Languages with Prefix Queries

链接: https://arxiv.org/abs/2504.02170
作者: Eve Fernando,Sasha Rubin,Rahul Gopinath
类目: Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Language learning refers to the problem of inferring a mathematical model which accurately represents a formal language. Many language learning algorithms learn by asking certain types of queries about the language being modeled. Language learning is of practical interest in the field of cybersecurity, where it is used to model the language accepted by a program’s input parser (also known as its input processor). In this setting, a learner can only query a string of its choice by executing the parser on it, which limits the language learning algorithms that can be used. Most practical parsers can indicate not only whether the string is valid or not, but also where the parsing failed. This extra information can be leveraged into producing a type of query we call the prefix query. Notably, no existing language learning algorithms make use of prefix queries, though some ask membership queries i.e., they ask whether or not a given string is valid. When these approaches are used to learn the language of a parser, the prefix information provided by the parser remains unused. In this work, we present PL*, the first known language learning algorithm to make use of the prefix query, and a novel modification of the classical L* algorithm. We show both theoretically and empirically that PL* is able to learn more efficiently than L* due to its ability to exploit the additional information given by prefix queries over membership queries. Furthermore, we show how PL* can be used to learn the language of a parser, by adapting it to a more practical setting in which prefix queries are the only source of information available to it; that is, it does not have access to any labelled examples or any other types of queries. We demonstrate empirically that, even in this more constrained setting, PL* is still capable of accurately learning a range of languages of practical interest. Subjects: Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE) MSC classes: 68Q45, 68T05 ACMclasses: F.4.3; I.2.6 Cite as: arXiv:2504.02170 [cs.FL] (or arXiv:2504.02170v1 [cs.FL] for this version) https://doi.org/10.48550/arXiv.2504.02170 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-37] Like Oil and Water: Group Robustness Methods and Poisoning Defenses May Be at Odds ICLR2024

链接: https://arxiv.org/abs/2504.02142
作者: Michael-Andrei Panaitescu-Liess,Yigitcan Kaya,Sicheng Zhu,Furong Huang,Tudor Dumitras
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 22 pages, 3 figures. Published at ICLR 2024

点击查看摘要

Abstract:Group robustness has become a major concern in machine learning (ML) as conventional training paradigms were found to produce high error on minority groups. Without explicit group annotations, proposed solutions rely on heuristics that aim to identify and then amplify the minority samples during training. In our work, we first uncover a critical shortcoming of these methods: an inability to distinguish legitimate minority samples from poison samples in the training set. By amplifying poison samples as well, group robustness methods inadvertently boost the success rate of an adversary – e.g., from 0% without amplification to over 97% with it. Notably, we supplement our empirical evidence with an impossibility result proving this inability of a standard heuristic under some assumptions. Moreover, scrutinizing recent poisoning defenses both in centralized and federated learning, we observe that they rely on similar heuristics to identify which samples should be eliminated as poisons. In consequence, minority samples are eliminated along with poisons, which damages group robustness – e.g., from 55% without the removal of the minority samples to 41% with it. Finally, as they pursue opposing goals using similar heuristics, our attempt to alleviate the trade-off by combining group robustness methods and poisoning defenses falls short. By exposing this tension, we also hope to highlight how benchmark-driven ML scholarship can obscure the trade-offs among different metrics with potentially detrimental consequences.

[LG-38] Ordering-based Conditions for Global Convergence of Policy Gradient Methods DATE NEURIPS2023

链接: https://arxiv.org/abs/2504.02130
作者: Jincheng Mei,Bo Dai,Alekh Agarwal,Mohammad Ghavamzadeh,Csaba Szepesvari,Dale Schuurmans
类目: Machine Learning (cs.LG)
*备注: arXiv version for the NeurIPS 2023 paper; to be updated for a technical issue

点击查看摘要

Abstract:We prove that, for finite-arm bandits with linear function approximation, the global convergence of policy gradient (PG) methods depends on inter-related properties between the policy update and the representation. textcolorblueFirst, we establish a few key observations that frame the study: \textbf(i) Global convergence can be achieved under linear function approximation without policy or reward realizability, both for the standard Softmax PG and natural policy gradient (NPG). \textbf(ii) Approximation error is not a key quantity for characterizing global convergence in either algorithm. \textbf(iii) The conditions on the representation that imply global convergence are different between these two algorithms. Overall, these observations call into question approximation error as an appropriate quantity for characterizing the global convergence of PG methods under linear function approximation. \textcolorblueSecond, motivated by these observations, we establish new general results: \textbf(i) NPG with linear function approximation achieves global convergence \emphif and only if the projection of the reward onto the representable space preserves the optimal action’s rank, a quantity that is not strongly related to approximation error. \textbf(ii) The global convergence of Softmax PG occurs if the representation satisfies a non-domination condition and can preserve the ranking of rewards, which goes well beyond policy or reward realizability. We provide experimental results to support these theoretical findings.

[LG-39] Efficient Model Selection for Time Series Forecasting via LLM s

链接: https://arxiv.org/abs/2504.02119
作者: Wang Wei,Tiankai Yang,Hongjie Chen,Ryan A. Rossi,Yue Zhao,Franck Dernoncourt,Hoda Eldardiry
类目: Machine Learning (cs.LG)
*备注: 16 pages, 3 Figures

点击查看摘要

Abstract:Model selection is a critical step in time series forecasting, traditionally requiring extensive performance evaluations across various datasets. Meta-learning approaches aim to automate this process, but they typically depend on pre-constructed performance matrices, which are costly to build. In this work, we propose to leverage Large Language Models (LLMs) as a lightweight alternative for model selection. Our method eliminates the need for explicit performance matrices by utilizing the inherent knowledge and reasoning capabilities of LLMs. Through extensive experiments with LLaMA, GPT and Gemini, we demonstrate that our approach outperforms traditional meta-learning techniques and heuristic baselines, while significantly reducing computational overhead. These findings underscore the potential of LLMs in efficient model selection for time series forecasting.

[LG-40] PolyG: Effective and Efficient GraphRAG with Adaptive Graph Traversal

链接: https://arxiv.org/abs/2504.02112
作者: Renjie Liu,Haitian Jiang,Xiao Yan,Bo Tang,Jinyang Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:GraphRAG enhances large language models (LLMs) to generate quality answers for user questions by retrieving related facts from external knowledge graphs. Existing GraphRAG methods adopt a fixed graph traversal strategy for fact retrieval but we observe that user questions come in different types and require different graph traversal strategies. As such, existing GraphRAG methods are limited in effectiveness (i.e., quality of the generated answers) and/or efficiency (i.e., response time or the number of used tokens). In this paper, we propose to classify the questions according to a complete four-class taxonomy and adaptively select the appropriate graph traversal strategy for each type of questions. Our system PolyG is essentially a query planner for GraphRAG and can handle diverse questions with an unified interface and execution engine. Compared with SOTA GraphRAG methods, PolyG achieves an overall win rate of 75% on generation quality and a speedup up to 4x on response time.

[LG-41] Measuring the Data

链接: https://arxiv.org/abs/2504.02083
作者: Ido Cohen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Measuring the Data analytically finds the intrinsic manifold in big data. First, Optimal Transport generates the tangent space at each data point from which the intrinsic dimension is revealed. Then, the Koopman Dimensionality Reduction procedure derives a nonlinear transformation from the data to the intrinsic manifold. Measuring the data procedure is presented here, backed up with encouraging results.

[LG-42] A Truncated Newton Method for Optimal Transport ICLR2025

链接: https://arxiv.org/abs/2504.02067
作者: Mete Kemertas,Amir-massoud Farahmand,Allan D. Jepson
类目: Machine Learning (cs.LG); Mathematical Software (cs.MS); Optimization and Control (math.OC)
*备注: Accepted to ICLR 2025

点击查看摘要

Abstract:Developing a contemporary optimal transport (OT) solver requires navigating trade-offs among several critical requirements: GPU parallelization, scalability to high-dimensional problems, theoretical convergence guarantees, empirical performance in terms of precision versus runtime, and numerical stability in practice. With these challenges in mind, we introduce a specialized truncated Newton algorithm for entropic-regularized OT. In addition to proving that locally quadratic convergence is possible without assuming a Lipschitz Hessian, we provide strategies to maximally exploit the high rate of local convergence in practice. Our GPU-parallel algorithm exhibits exceptionally favorable runtime performance, achieving high precision orders of magnitude faster than many existing alternatives. This is evidenced by wall-clock time experiments on 24 problem sets (12 datasets \times 2 cost functions). The scalability of the algorithm is showcased on an extremely large OT problem with n \approx 10^6 , solved approximately under weak entopric regularization.

[LG-43] Geometric Reasoning in the Embedding Space

链接: https://arxiv.org/abs/2504.02018
作者: Jan Hůla,David Mojžíšek,Jiří Janeček,David Herel,Mikoláš Janota
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this contribution, we demonstrate that Graph Neural Networks and Transformers can learn to reason about geometric constraints. We train them to predict spatial position of points in a discrete 2D grid from a set of constraints that uniquely describe hidden figures containing these points. Both models are able to predict the position of points and interestingly, they form the hidden figures described by the input constraints in the embedding space during the reasoning process. Our analysis shows that both models recover the grid structure during training so that the embeddings corresponding to the points within the grid organize themselves in a 2D subspace and reflect the neighborhood structure of the grid. We also show that the Graph Neural Network we design for the task performs significantly better than the Transformer and is also easier to scale.

[LG-44] Fourier Feature Attribution: A New Efficiency Attribution Method

链接: https://arxiv.org/abs/2504.02016
作者: Zechen Liu,Feiyang Zhang,Wei Song,Xiang Li,Wei Wei
类目: Machine Learning (cs.LG)
*备注: 11 pages, 13 figures

点击查看摘要

Abstract:The study of neural networks from the perspective of Fourier features has garnered significant attention. While existing analytical research suggests that neural networks tend to learn low-frequency features, a clear attribution method for identifying the specific learned Fourier features has remained elusive. To bridge this gap, we propose a novel Fourier feature attribution method grounded in signal decomposition theory. Additionally, we analyze the differences between game-theoretic attribution metrics for Fourier and spatial domain features, demonstrating that game-theoretic evaluation metrics are better suited for Fourier-based feature attribution. Our experiments show that Fourier feature attribution exhibits superior feature selection capabilities compared to spatial domain attribution methods. For instance, in the case of Vision Transformers (ViTs) on the ImageNet dataset, only 8% of the Fourier features are required to maintain the original predictions for 80% of the samples. Furthermore, we compare the specificity of features identified by our method against traditional spatial domain attribution methods. Results reveal that Fourier features exhibit greater intra-class concentration and inter-class distinctiveness, indicating their potential for more efficient classification and explainable AI algorithms. Comments: 11 pages, 13 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.02016 [cs.LG] (or arXiv:2504.02016v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.02016 Focus to learn more arXiv-issued DOI via DataCite

[LG-45] Fault injection analysis of Real NVP normalising flow model for satellite anomaly detection IJCNN

链接: https://arxiv.org/abs/2504.02015
作者: Gabriele Greco,Carlo Cena,Umberto Albertin,Mauro Martini,Marcello Chiaberge
类目: Machine Learning (cs.LG)
*备注: Passed first review at 2025 International Joint Conference on Neural Networks (IJCNN)

点击查看摘要

Abstract:Satellites are used for a multitude of applications, including communications, Earth observation, and space science. Neural networks and deep learning-based approaches now represent the state-of-the-art to enhance the performance and efficiency of these tasks. Given that satellites are susceptible to various faults, one critical application of Artificial Intelligence (AI) is fault detection. However, despite the advantages of neural networks, these systems are vulnerable to radiation errors, which can significantly impact their reliability. Ensuring the dependability of these solutions requires extensive testing and validation, particularly using fault injection methods. This study analyses a physics-informed (PI) real-valued non-volume preserving (Real NVP) normalizing flow model for fault detection in space systems, with a focus on resilience to Single-Event Upsets (SEUs). We present a customized fault injection framework in TensorFlow to assess neural network resilience. Fault injections are applied through two primary methods: Layer State injection, targeting internal network components such as weights and biases, and Layer Output injection, which modifies layer outputs across various activations. Fault types include zeros, random values, and bit-flip operations, applied at varying levels and across different network layers. Our findings reveal several critical insights, such as the significance of bit-flip errors in critical bits, that can lead to substantial performance degradation or even system failure. With this work, we aim to exhaustively study the resilience of Real NVP models against errors due to radiation, providing a means to guide the implementation of fault tolerance measures.

[LG-46] Attention Mamba: Time Series Modeling with Adaptive Pooling Acceleration and Receptive Field Enhancements

链接: https://arxiv.org/abs/2504.02013
作者: Sijie Xiong,Shuqing Liu,Cheng Tang,Fumiya Okubo,Haoling Xiong,Atsushi Shimada
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:“This work has been submitted to the lEEE for possible publication. Copyright may be transferred without noticeafter which this version may no longer be accessible.” Time series modeling serves as the cornerstone of real-world applications, such as weather forecasting and transportation management. Recently, Mamba has become a promising model that combines near-linear computational complexity with high prediction accuracy in time series modeling, while facing challenges such as insufficient modeling of nonlinear dependencies in attention and restricted receptive fields caused by convolutions. To overcome these limitations, this paper introduces an innovative framework, Attention Mamba, featuring a novel Adaptive Pooling block that accelerates attention computation and incorporates global information, effectively overcoming the constraints of limited receptive fields. Furthermore, Attention Mamba integrates a bidirectional Mamba block, efficiently capturing long-short features and transforming inputs into the Value representations for attention mechanisms. Extensive experiments conducted on diverse datasets underscore the effectiveness of Attention Mamba in extracting nonlinear dependencies and enhancing receptive fields, establishing superior performance among leading counterparts. Our codes will be available on GitHub.

[LG-47] Instruction-Guided Autoregressive Neural Network Parameter Generation

链接: https://arxiv.org/abs/2504.02012
作者: Soro Bedionita,Bruno Andreis,Song Chong,Sung Ju Hwang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning to generate neural network parameters conditioned on task descriptions and architecture specifications is pivotal for advancing model adaptability and transfer learning. Existing methods especially those based on diffusion models suffer from limited scalability to large architectures, rigidity in handling varying network depths, and disjointed parameter generation that undermines inter-layer coherence. In this work, we propose IGPG (Instruction Guided Parameter Generation), an autoregressive framework that unifies parameter synthesis across diverse tasks and architectures. IGPG leverages a VQ-VAE and an autoregressive model to generate neural network parameters, conditioned on task instructions, dataset, and architecture details. By autoregressively generating neural network weights’ tokens, IGPG ensures inter-layer coherence and enables efficient adaptation across models and datasets. Operating at the token level, IGPG effectively captures complex parameter distributions aggregated from a broad spectrum of pretrained models. Extensive experiments on multiple vision datasets demonstrate that IGPG consolidates diverse pretrained models into a single, flexible generative framework. The synthesized parameters achieve competitive or superior performance relative to state-of-the-art methods, especially in terms of scalability and efficiency when applied to large architectures. These results underscore ICPG potential as a powerful tool for pretrained weight retrieval, model selection, and rapid task-specific fine-tuning.

[LG-48] When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Reasoning Models on Complex Reasoning Tasks

链接: https://arxiv.org/abs/2504.02010
作者: Nan Zhang,Yusen Zhang,Prasenjit Mitra,Rui Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent open-source large reasoning models (LRMs) exhibit strong performance on complex reasoning tasks, but their large parameter count makes them prohibitively expensive for individuals. The compression of large language models (LLMs) offers an effective solution to reduce cost of computational resources. However, systematic studies on the performance of compressed LLMs in complex reasoning tasks, especially for LRMs, are lacking. Most works on quantization and pruning focus on preserving language modeling performance, while existing distillation works do not comprehensively benchmark student models based on reasoning difficulty or compression impact on knowledge and reasoning. In this paper, we benchmark compressed DeepSeek-R1 models on four different reasoning datasets (AIME 2024, FOLIO, Temporal Sequences of BIG-Bench Hard, and MuSiQue), ranging from mathematical to multihop reasoning, using quantization, distillation, and pruning methods. We benchmark 2.51-, 1.73-, and 1.58-bit R1 models that adopt dynamic quantization. We also benchmark distilled R1 models that are based on LLaMA or Qwen and run SparseGPT on them to obtain various sparsity levels. Studying the performance and behavior of compressed LRMs, we report their performance scores and test-time compute (number of tokens spent on each question). Notably, using MuSiQue, we find that parameter count has a much greater impact on LRMs’ knowledge memorization than on their reasoning capability, which can inform the choice of compression techniques. Through our empirical analysis of test-time compute, we find that shorter model outputs generally achieve better performance than longer ones across several benchmarks for both R1 and its compressed variants, highlighting the need for more concise reasoning chains.

[LG-49] Efficient Near-Optimal Algorithm for Online Shortest Paths in Directed Acyclic Graphs with Bandit Feedback Against Adaptive Adversaries

链接: https://arxiv.org/abs/2504.00461
作者: Arnab Maiti,Zhiyuan Fan,Kevin Jamieson,Lillian J. Ratliff,Gabriele Farina
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: 48 pages, 8 figures

点击查看摘要

Abstract:In this paper, we study the online shortest path problem in directed acyclic graphs (DAGs) under bandit feedback against an adaptive adversary. Given a DAG G = (V, E) with a source node v_\mathsfs and a sink node v_\mathsft , let X \subseteq \0,1^|E| denote the set of all paths from v_\mathsfs to v_\mathsft . At each round t , we select a path \mathbfx_t \in X and receive bandit feedback on our loss \langle \mathbfx_t, \mathbfy_t \rangle \in [-1,1] , where \mathbfy_t is an adversarially chosen loss vector. Our goal is to minimize regret with respect to the best path in hindsight over T rounds. We propose the first computationally efficient algorithm to achieve a near-minimax optimal regret bound of \tilde O(\sqrt|E|T\log |X|) with high probability against any adaptive adversary, where \tilde O(\cdot) hides logarithmic factors in the number of edges |E| . Our algorithm leverages a novel loss estimator and a centroid-based decomposition in a nontrivial manner to attain this regret bound. As an application, we show that our algorithm for DAGs provides state-of-the-art efficient algorithms for m -sets, extensive-form games, the Colonel Blotto game, shortest walks in directed graphs, hypercubes, and multi-task multi-armed bandits, achieving improved high-probability regret guarantees in all these settings. Comments: 48 pages, 8 figures Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2504.00461 [cs.LG] (or arXiv:2504.00461v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.00461 Focus to learn more arXiv-issued DOI via DataCite

[LG-50] RoboCAS: A Benchmark for Robotic Manipulation in Complex Object Arrangement Scenarios

链接: https://arxiv.org/abs/2407.06951
作者: Liming Zheng,Feng Yan,Fanfan Liu,Chengjian Feng,Zhuoliang Kang,Lin Ma
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation models hold significant potential for enabling robots to perform long-horizon general manipulation tasks. However, the simplicity of tasks and the uniformity of environments in existing benchmarks restrict their effective deployment in complex scenarios. To address this limitation, this paper introduces the \textitRoboCAS benchmark, the first benchmark specifically designed for complex object arrangement scenarios in robotic manipulation. This benchmark employs flexible and concise scripted policies to efficiently collect a diverse array of demonstrations, showcasing scattered, orderly, and stacked object arrangements within a highly realistic physical simulation environment. It includes complex processes such as target retrieval, obstacle clearance, and robot manipulation, testing agents’ abilities to perform long-horizon planning for spatial reasoning and predicting chain reactions under ambiguous instructions. Extensive experiments on multiple baseline models reveal their limitations in managing complex object arrangement scenarios, underscoring the urgent need for intelligent agents capable of performing long-horizon operations in practical deployments and providing valuable insights for future research directions. Project website: \urlthis https URL.

[LG-51] Semiparametric Counterfactual Regression

链接: https://arxiv.org/abs/2504.02694
作者: Kwangho Kim
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study counterfactual regression, which aims to map input features to outcomes under hypothetical scenarios that differ from those observed in the data. This is particularly useful for decision-making when adapting to sudden shifts in treatment patterns is essential. We propose a doubly robust-style estimator for counterfactual regression within a generalizable framework that accommodates a broad class of risk functions and flexible constraints, drawing on tools from semiparametric theory and stochastic optimization. Our approach uses incremental interventions to enhance adaptability while maintaining consistency with standard methods. We formulate the target estimand as the optimal solution to a stochastic optimization problem and develop an efficient estimation strategy, where we can leverage rapid development of modern optimization algorithms. We go on to analyze the rates of convergence and characterize the asymptotic distributions. Our analysis shows that the proposed estimators can achieve \sqrtn -consistency and asymptotic normality for a broad class of problems. Numerical illustrations highlight their effectiveness in adapting to unseen counterfactual scenarios while maintaining parametric convergence rates.

[LG-52] A Dynamic Ordinal Gaussian Process Item Response Theoretic Model

链接: https://arxiv.org/abs/2504.02643
作者: Yehu Chen,Jacob Montgomery,Roman Garnett
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Social scientists are often interested in using ordinal indicators to estimate latent traits that change over time. Frequently, this is done with item response theoretic (IRT) models that describe the relationship between those latent traits and observed indicators. We combine recent advances in Bayesian nonparametric IRT, which makes minimal assumptions on shapes of item response functions, and Gaussian process time series methods to capture dynamic structures in latent traits from longitudinal observations. We propose a generalized dynamic Gaussian process item response theory (GD-GPIRT) as well as a Markov chain Monte Carlo sampling algorithm for estimation of both latent traits and response functions. We evaluate GD-GPIRT in simulation studies against baselines in dynamic IRT, and apply it to various substantive studies, including assessing public opinions on economy environment and congressional ideology related to abortion debate.

[LG-53] Incorporating the ChEES Criterion into Sequential Monte Carlo Samplers

链接: https://arxiv.org/abs/2504.02627
作者: Andrew Millard,Joshua Murphy,Daniel Frisch,Simon Maskell
类目: Computation (stat.CO); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: 16 pages, 9 figures

点击查看摘要

Abstract:Markov chain Monte Carlo (MCMC) methods are a powerful but computationally expensive way of performing non-parametric Bayesian inference. MCMC proposals which utilise gradients, such as Hamiltonian Monte Carlo (HMC), can better explore the parameter space of interest if the additional hyper-parameters are chosen well. The No-U-Turn Sampler (NUTS) is a variant of HMC which is extremely effective at selecting these hyper-parameters but is slow to run and is not suited to GPU architectures. An alternative to NUTS, Change in the Estimator of the Expected Square HMC (ChEES-HMC) was shown not only to run faster than NUTS on GPU but also sample from posteriors more efficiently. Sequential Monte Carlo (SMC) samplers are another sampling method which instead output weighted samples from the posterior. They are very amenable to parallelisation and therefore being run on GPUs while having additional flexibility in their choice of proposal over MCMC. We incorporate (ChEEs-HMC) as a proposal into SMC samplers and demonstrate competitive but faster performance than NUTS on a number of tasks.

[LG-54] Analytical Discovery of Manifold with Machine Learning

链接: https://arxiv.org/abs/2504.02511
作者: Yafei Shen,Huan-Fei Ma,Ling Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding low-dimensional structures within high-dimensional data is crucial for visualization, interpretation, and denoising in complex datasets. Despite the advancements in manifold learning techniques, key challenges-such as limited global insight and the lack of interpretable analytical descriptions-remain unresolved. In this work, we introduce a novel framework, GAMLA (Global Analytical Manifold Learning using Auto-encoding). GAMLA employs a two-round training process within an auto-encoding framework to derive both character and complementary representations for the underlying manifold. With the character representation, the manifold is represented by a parametric function which unfold the manifold to provide a global coordinate. While with the complementary representation, an approximate explicit manifold description is developed, offering a global and analytical representation of smooth manifolds underlying high-dimensional datasets. This enables the analytical derivation of geometric properties such as curvature and normal vectors. Moreover, we find the two representations together decompose the whole latent space and can thus characterize the local spatial structure surrounding the manifold, proving particularly effective in anomaly detection and categorization. Through extensive experiments on benchmark datasets and real-world applications, GAMLA demonstrates its ability to achieve computational efficiency and interpretability while providing precise geometric and structural insights. This framework bridges the gap between data-driven manifold learning and analytical geometry, presenting a versatile tool for exploring the intrinsic properties of complex data sets.

[LG-55] CrystalFormer-RL: Reinforcement Fine-Tuning for Materials Design

链接: https://arxiv.org/abs/2504.02367
作者: Zhendong Cao,Lei Wang
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Reinforcement fine-tuning has instrumental enhanced the instruction-following and reasoning abilities of large language models. In this work, we explore the applications of reinforcement fine-tuning to the autoregressive transformer-based materials generative model CrystalFormer (arXiv:2403.15734) using discriminative machine learning models such as interatomic potentials and property prediction models. By optimizing reward signals-such as energy above the convex hull and material property figures of merit-reinforcement fine-tuning infuses knowledge from discriminative models into generative models. The resulting model, CrystalFormer-RL, shows enhanced stability in generated crystals and successfully discovers crystals with desirable yet conflicting material properties, such as substantial dielectric constant and band gap simultaneously. Notably, we observe that reinforcement fine-tuning enables not only the property-guided novel material design ability of generative pre-trained model but also unlocks property-driven material retrieval from the unsupervised pre-training dataset. Leveraging rewards from discriminative models to fine-tune materials generative models opens an exciting gateway to the synergies of the machine learning ecosystem for materials.

[LG-56] Dynamic Assortment Selection and Pricing with Censored Preference Feedback ICLR2025

链接: https://arxiv.org/abs/2504.02324
作者: Jung-hun Kim,Min-hwan Oh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at ICLR 2025

点击查看摘要

Abstract:In this study, we investigate the problem of dynamic multi-product selection and pricing by introducing a novel framework based on a \textitcensored multinomial logit (C-MNL) choice model. In this model, sellers present a set of products with prices, and buyers filter out products priced above their valuation, purchasing at most one product from the remaining options based on their preferences. The goal is to maximize seller revenue by dynamically adjusting product offerings and prices, while learning both product valuations and buyer preferences through purchase feedback. To achieve this, we propose a Lower Confidence Bound (LCB) pricing strategy. By combining this pricing strategy with either an Upper Confidence Bound (UCB) or Thompson Sampling (TS) product selection approach, our algorithms achieve regret bounds of \tildeO(d^\frac32\sqrtT/\kappa) and \tildeO(d^2\sqrtT/\kappa) , respectively. Finally, we validate the performance of our methods through simulations, demonstrating their effectiveness.

[LG-57] Quantum Deep Sets and Sequences

链接: https://arxiv.org/abs/2504.02241
作者: Vladimir Vargas-Calderón
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: Presented at Quantum Techniques in Machine Learning 2024

点击查看摘要

Abstract:This paper introduces the quantum deep sets model, expanding the quantum machine learning tool-box by enabling the possibility of learning variadic functions using quantum systems. A couple of variants are presented for this model. The first one focuses on mapping sets to quantum systems through state vector averaging: each element of the set is mapped to a quantum state, and the quantum state of the set is the average of the corresponding quantum states of its elements. This approach allows the definition of a permutation-invariant variadic model. The second variant is useful for ordered sets, i.e., sequences, and relies on optimal coherification of tristochastic tensors that implement products of mixed states: each element of the set is mapped to a density matrix, and the quantum state of the set is the product of the corresponding density matrices of its elements. Such variant can be relevant in tasks such as natural language processing. The resulting quantum state in any of the variants is then processed to realise a function that solves a machine learning task such as classification, regression or density estimation. Through synthetic problem examples, the efficacy and versatility of quantum deep sets and sequences (QDSs) is demonstrated.

[LG-58] Orbit Determination through Cosmic Microwave Background Radiation

链接: https://arxiv.org/abs/2504.02196
作者: Pedro K de Albuquerque,Andre R Kuroswiski,Annie S. Wu,Willer G. dos Santos,Paulo Costa
类目: Instrumentation and Detectors (physics.ins-det); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: This paper was presented at the 2024 AAS/AIAA Astrodynamics Specialist Conference, August 11-15, 2024, Broomfield, Colorado, USA

点击查看摘要

Abstract:This research explores the use of Cosmic Microwave Background (CMB) radiation as a reference signal for Initial Orbit Determination (IOD). By leveraging the unique properties of CMB, this study introduces a novel method for estimating spacecraft velocity and position with minimal reliance on pre-existing environmental data, offering significant advantages for space missions independent of Earth-specific conditions. Using Machine Learning (ML) regression models, this approach demonstrates the capability to determine velocity from CMB signals and subsequently determine the satellite’s position. The results indicate that CMB has the potential to enhance the autonomy and flexibility of spacecraft operations.

[LG-59] HQCC: A Hybrid Quantum-Classical Classifier with Adaptive Structure

链接: https://arxiv.org/abs/2504.02167
作者: Ren-Xin Zhao,Xinze Tong,Shi Wang
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parameterized Quantum Circuits (PQCs) with fixed structures severely degrade the performance of Quantum Machine Learning (QML). To address this, a Hybrid Quantum-Classical Classifier (HQCC) is proposed. It opens a practical way to advance QML in the Noisy Intermediate-Scale Quantum (NISQ) era by adaptively optimizing the PQC through a Long Short-Term Memory (LSTM) driven dynamic circuit generator, utilizing a local quantum filter for scalable feature extraction, and exploiting architectural plasticity to balance the entanglement depth and noise robustness. We realize the HQCC on the TensorCircuit platform and run simulations on the MNIST and Fashion MNIST datasets, achieving up to 97.12% accuracy on MNIST and outperforming several alternative methods.

[LG-60] Robust Channel Estimation for Optical Wireless Communications Using Neural Network

链接: https://arxiv.org/abs/2504.02134
作者: Dianxin Luan,John Thompson
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optical Wireless Communication (OWC) has gained significant attention due to its high-speed data transmission and throughput. Optical wireless channels are often assumed to be flat, but we evaluate frequency selective channels to consider high data rate optical wireless or very dispersive environments. To address this for optical scenarios, this paper presents a robust channel estimation framework with low-complexity to mitigate frequency-selective effects, then to improve system reliability and performance. This channel estimation framework contains a neural network that can estimate general optical wireless channels without prior channel information about the environment. Based on this estimate and the corresponding delay spread, one of several candidate offline-trained neural networks will be activated to predict this channel. Simulation results demonstrate that the proposed method has improved and robust normalized mean square error (NMSE) and bit error rate (BER) performance compared to conventional estimation methods while maintaining computational efficiency. These findings highlight the potential of neural network solutions in enhancing the performance of OWC systems under indoor channel conditions.

[LG-61] What Can 240000 New Credit Transactions Tell Us About the Impact of NGEU Funds?

链接: https://arxiv.org/abs/2504.01964
作者: Alvaro Ortiz,Tomasa Rodrigo,David Sarasa,Pedro Torinos,Sirenia Vazquez
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Using a panel data local projections model and controlling for firm characteristics, procurement bid attributes, and macroeconomic conditions, the study estimates the dynamic effects of procurement awards on new lending, a more precise measure than the change in the stock of credit. The analysis further examines heterogeneity in credit responses based on firm size, industry, credit maturity, and value chain position of the firms. The empirical evidence confirms that public procurement awards significantly increase new lending, with NGEU-funded contracts generating stronger credit expansion than traditional procurement during the recent period. The results show that the impact of NGEU procurement programs aligns closely with historical procurement impacts, with differences driven mainly by lower utilization rates. Moreover, integrating high-frequency financial data with procurement records highlights the potential of Big Data in refining public policy design.

信息检索

[IR-0] An Assessment of the CO2 Emission Reduction Potential of Residential Load Management in Developing and Developed Countries

链接: https://arxiv.org/abs/2504.02811
作者: Alona Zharova,Felix Creutzig
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Intermittent renewable energies are increasingly dominating electricity grids and are forecasted to be the main force driving out fossil fuels from the grid in most major economies until 2040. However, grids based on intermittent renewables are challenged by diurnal and seasonal mismatch between supply of sun and wind and demand for electricity, including for heat pumps and electric two and four wheelers. Load management and demand response measures promise to adjust for this mismatch, utilizing information- and price-based approaches to steer demand towards times with high supply of intermittent renewables. Here, we systematically review the literature estimating CO2 savings from residential load management in developing and developed nations. We find that load management holds high potential, locally differentiated with energy mix (including the respective share of renewables and fossils), climate zone, and the regulatory environment and price mechanism. Most identified studies suggest a mitigation potential between 1 and 20%. Load management becomes more relevant with higher shares of intermittent renewables, and when electricity prices are high. Importantly, load management aligns consumers’ financial incentives with climate change mitigation, thus rendering accompanying strategies politically feasible. We summarize key regulatory steps to facilitate load management in economies and to realize relevant consumer surplus and mitigation potential.

[IR-1] Graphs are everywhere – Psst! In Music Recommendation too

链接: https://arxiv.org/abs/2504.02598
作者: Bharani Jayakumar
类目: Information Retrieval (cs.IR)
*备注: 5 pages, 4 figures, 2 tables, and a few equations

点击查看摘要

Abstract:In recent years, graphs have gained prominence across various domains, especially in recommendation systems. Within the realm of music recommendation, graphs play a crucial role in enhancing genre-based recommendations by integrating Mel-Frequency Cepstral Coefficients (MFCC) with advanced graph embeddings. This study explores the efficacy of Graph Convolutional Networks (GCN), GraphSAGE, and Graph Transformer (GT) models in learning embeddings that effectively capture intricate relationships between music items and genres represented within graph structures. Through comprehensive empirical evaluations on diverse real-world music datasets, our findings consistently demonstrate that these graph-based approaches outperform traditional methods that rely solely on MFCC features or collaborative filtering techniques. Specifically, the graph-enhanced models achieve notably higher accuracy in predicting genre-specific preferences and offering relevant music suggestions to users. These results underscore the effectiveness of utilizing graph embeddings to enrich feature representations and exploit latent associations within music data, thereby illustrating their potential to advance the capabilities of personalized and context-aware music recommendation systems. Keywords: graphs, recommendation systems, neural networks, MFCC

[IR-2] Research Paper Recommender System by Considering Users Information Seeking Behaviors IJCNN2025

链接: https://arxiv.org/abs/2504.02377
作者: Zhelin Xu,Shuhei Yamamoto,Hideo Joho
类目: Information Retrieval (cs.IR)
*备注: 9 pages, 5 figures, accepted as a full paper at IJCNN 2025

点击查看摘要

Abstract:With the rapid growth of scientific publications, researchers need to spend more time and effort searching for papers that align with their research interests. To address this challenge, paper recommendation systems have been developed to help researchers in effectively identifying relevant paper. One of the leading approaches to paper recommendation is content-based filtering method. Traditional content-based filtering methods recommend relevant papers to users based on the overall similarity of papers. However, these approaches do not take into account the information seeking behaviors that users commonly employ when searching for literature. Such behaviors include not only evaluating the overall similarity among papers, but also focusing on specific sections, such as the method section, to ensure that the approach aligns with the user’s interests. In this paper, we propose a content-based filtering recommendation method that takes this information seeking behavior into account. Specifically, in addition to considering the overall content of a paper, our approach also takes into account three specific sections (background, method, and results) and assigns weights to them to better reflect user preferences. We conduct offline evaluations on the publicly available DBLP dataset, and the results demonstrate that the proposed method outperforms six baseline methods in terms of precision, recall, F1-score, MRR, and MAP.

[IR-3] LLM -Augmented Graph Neural Recommenders: Integrating User Reviews

链接: https://arxiv.org/abs/2504.02195
作者: Hiroki Kanezashi,Toyotaro Suzumura,Cade Reid,Md Mostafizur Rahman,Yu Hirate
类目: Information Retrieval (cs.IR)
*备注: Under Review

点击查看摘要

Abstract:Recommender systems increasingly aim to combine signals from both user reviews and purchase (or other interaction) behaviors. While user-written comments provide explicit insights about preferences, merging these textual representations from large language models (LLMs) with graph-based embeddings of user actions remains a challenging task. In this work, we propose a framework that employs both a Graph Neural Network (GNN)-based model and an LLM to produce review-aware representations, preserving review semantics while mitigating textual noise. Our approach utilizes a hybrid objective that balances user-item interactions against text-derived features, ensuring that user’s both behavioral and linguistic signals are effectively captured. We evaluate this method on multiple datasets from diverse application domains, demonstrating consistent improvements over a baseline GNN-based recommender model. Notably, our model achieves significant gains in recommendation accuracy when review data is sparse or unevenly distributed. These findings highlight the importance of integrating LLM-driven textual feedback with GNN-derived user behavioral patterns to develop robust, context-aware recommender systems.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-04-04

目录

概览 (2025-04-04)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载